HPC Weekly Sync Meeting

If any of the terms mentioned on this page are unfamiliar, please use the glossary of terms to familiarize yourself with the terminology.

The primer can be used for your internal departmental discussions and general familiarity with the use of the term "HPC".

Current Pomona HPC environment is based on a traditional bare metal provisioning model where physical nodes are two-socket servers with the largest amount of RAM possible and specialized hardware like GPUs. The specifications are largely based on the interviews of the faculty and students we conducted at the end of 2017. If the current specifications are insufficient for your workload we are open to a dialogue on how we can address your needs with additional hardware.

In addition to compute nodes as the physical nodes we also support virtual machines, VMWare for the workloads that are better suited for Enterprise network, OpenStack for workloads in the learning environment or development environment that are not meant to be permanent, Docker and Singularity for portability, versioning, and reproducibility of workloads that are meant to move between your Desktop, HPC, or collaborators on other platforms including cloud.

The hardware was chosen to represent the variety of options available in today's HPC environments and address the various software requirements like high core count vs higher GHz per core and diversity of platforms like Intel, AMD, and NVIDIA. In the current configuration there are 5 physical servers with the following hardware specifications:

  1. DGX workstation
  2. SuperMicro Xeon with 2 NVLINK Volta cards (expandable to a total of 4)
  3. SuperMicro Xeon 8 core
  4. SuperMicro Xeon 22 core
  5. AMD Epyc 32 core

Current HPC environment supports up to 20 nodes, so the remaining 16 nodes are Virtual Machines provisioned using the Enterprise VMWare environment and can be customized to have 1-48 cores and 2 GB – 1.5 TB of RAM.

The local disk space on the physical servers can be expanded but is ultimately limited by the expansion capabilities of that particular server. The local disk space on the virtual machines is unlimited but we do have quotas in place and may introduce a type of chargeback or showback at some point.

To request access to the HPC environment please submit a ticket in Footprints, contact help@pomona.edu, or email us at its-hpc@pomona.edu. We ask that you detail as much as possible how you intend to use the environment. We are happy to attend your department meeting or schedule one on one meetings as necessary to dive deeper into how we can assist.

The various technologies used in the HPC environment will be discussed in regular Blog posts. We are also exploring alternative methods of sharing the knowledge like video and podcast format. In addition to the above we host monthly Research Computing Office Hours where we explore a chosen topic, sometimes with hands-on exercises and we are available on Slack (please email its-hpc@pomona.edu to be added). We welcome feedback on how you prefer to access information that helps you navigate this relatively complicated terrain.

Once you are added to the HPC Users group, you should get a welcome email with the first steps, as well as pointers to additional resources. Your login will be the same as your regular username for any other ITS-supported computer on campus

You will typically log in to the HPC environment through a Login node, which serves as the gateway to the environment. Once you are logged in, you will be able to submit jobs using [Slurm] scheduler. The jobs will run on compute nodes. You will be able to [ssh] into individual compute nodes where your job is running, but only while the job is running. We can assist with the profile configuration, adding modules, data transfers, and specialized software as needed.

Once you submit a job you can opt in to get an email when the job starts running and when the job finishes.

The input and output data for your job will typically reside either on the local disk of the compute node where your job is running, a dedicated iSCSI mount from the central All-Flash storage system (Pure Storage), or a shared parallel filesystem in the future (BeeGFS). None of these storage solutions are meant for storing data indefinitely. They are called "scratch space," and we encourage you to transfer the data somewhere it can be stored permanently, such as your home directory or another server. The scratch file systems will be purged in accordance with appropriate policies.

While you can log in to every available compute node to explore visualizations, view options for running your job, or interactively test a concept, the length of the session will be limited.

The operating system of choice in the HPC environment is [CentOS 7]. We do not typically support any other operating systems with some exceptions when warranted.

An NVIDIA DGX workstation is one of those exceptions as we strive to provide a state-of-the-art platform for AI (Artificial Intelligence) research and this particular hardware and software configuration is built and supported by NVIDIA using Ubuntu. The DGX workstation is accessible via a scheduler, the same way as any other regular compute node but its software provisioning and user access will be handled differently than the rest of the environment.

We use Bright Computing as the cluster provisioning and management tool. As a part of the Bright Computing platform we provide access to a web-based User Portal where you can see more specifics on the hardware, queues, users, and current utilization. The User Portal view can be modified to show more customized information.

Another tool we use in the HPC environment is XDMOD. This is an accounting tool that processes the accounting data from the scheduler to visualize cluster utilization and assist with Grant-based accounting.

We strive to provide the most comfortable experience with the Pomona HPC infrastructure. While CLI (Command Line) and ssh-based access are a standard way to operate in HPC infrastructures everywhere, we also support X-Forwarding, No Machine, and other remote visualization tools, and eventually a web-based access to a Scientific Gateway with pre-configured workflows.

We use Active Directory for authentication.

Home Directories are shared across the nodes and mounted at login time using NFS. They are not the best place to store data that will be used for jobs, as the IO (Input/Output) may be inadequate compared to fast local disk or iSCSI All-Flash storage. We propose that you consult with us to choose the right solution for each use case.

Most of the time, when we mention cores, we mean Physical core vs. Virtual core. Virtualization and Hyper-Threading are typically disabled on the physical nodes.

We support Docker and Singularity containers and are planning to establish a local repository for Pomona-based containers with the most common software in the near future.

We use git and GitHub for Version Control. If you need assistance with creating a repository or provisioning software from an existing repository, please let us know. This is generally the most expedient way to provision custom software in the HPC environment. Most of the software comes pre-installed as modules in an NFS-mounted directory on each node. We use Ansible for post-provisioning software installations.

We recommend the use of Anaconda/Conda or Virtual Environments when using Python, to isolate from the system's Python and between versions.