OVERVIEW
Infer came online in January of 2021 and provides 18 nodes, each with an Nvidia T4 GPU. The cluster’s name "Infer" alludes to the AI/ML inference capabilities of the T4 GPUs derived from the "tensor cores" on these devices. We think they will also be a great all-purpose resource for researchers who are making their first forays into GPU-enabled computations of any type. Technical details are below:
Vendor | HPE |
---|---|
Chip | Intel Xeon Gold 6130 |
Nodes | 18 |
Cores/Node | 32 |
GPU Model | Nvidia Tesla T4 |
GPU/Node | 1 |
Memory (GB)/Node | 192 |
Total Cores | 576 |
Total Memory (GB) | 3,456 |
Local Disk | 480GB SSD |
Interconnect | EDR-100 IB |
LOGIN
ARC users can log into Infer at:
infer1.arc.vt.edu
POLICIES
Limits are set on the scale and quantity of jobs at the user and allocation (Slurm account) levels to help ensure availability of resources to a broad set of researchers and applications:
t4_normal_q | t4_dev_q | |
---|---|---|
Node Type | Base Compute | Base Compute |
Billing Weight | 0 (no billing) | 0 (no billing) |
Number of Nodes | 16 | 2 |
MaxRunningJobs (User) | 10 | 2 |
MaxSubmitJobs (User) | 100 | 3 |
MaxRunningJobs (Allocation) | 20 | 3 |
MaxSubmitJobs (Allocation) | 200 | 6 |
MaxNodes (User) | 8 | 2 |
MaxNodes (Allocation) | 12 | 2 |
MaxCPUs (User) | 256 | 64 |
MaxCPUs (Allocation) | 384 | 64 |
MaxGPUs (User) | 8 | 2 |
MaxGPUs (Allocation) | 12 | 2 |
Max Job Duration (hours) | 72 | 4 |
MODULES
Infer's module structure is similar to that of TinkerCliffs, but different from previous ARC clusters in that it uses a new application stack/module system based on EasyBuild. A video tutorial of module usage under this paradigm is provided here; a longer class on EasyBuild, including how you can use it to build your own modules is here.
Key differences between EasyBuild and our legacy paradigm from a user perspective include:
-
Hierarchies are replaced by toolchains. Right now, there are four:
- foss ("Free Open Source Software"): gcc compilers, OpenBLAS for linear algebra, OpenMPI for MPI, etc
- fosscuda:
foss
with CUDA support - intel: Intel compilers, Intel MKL for linear algebra, Intel MPI
- intelcuda:
intel
with CUDA support
-
Instead of loading modules individually (e.g., module load intel mkl impi), a user can just load the toolchain (e.g.,
module load fosscuda/2020b
). -
Modules load their dependencies, e.g.,
$ module reset; module load GROMACS/2020.4-fosscuda-2020b; module list Currently Loaded Modules: 1) shared 8) GCCcore/10.2.0 15) numactl/2.0.13-GCCcore-10.2.0 22) GDRCopy/2.1-GCCcore-10.2.0-CUDA-11.1.1 29) FFTW/3.3.8-gompic-2020b 2) gcc/9.2.0 9) zlib/1.2.11-GCCcore-10.2.0 16) XZ/5.2.5-GCCcore-10.2.0 23) UCX/1.9.0-GCCcore-10.2.0-CUDA-11.1.1 30) ScaLAPACK/2.1.0-gompic-2020b 3) slurm/slurm/19.05.5 10) binutils/2.35-GCCcore-10.2.0 17) libxml2/2.9.10-GCCcore-10.2.0 24) libfabric/1.11.0-GCCcore-10.2.0 31) fosscuda/2020b 4) apps 11) GCC/10.2.0 18) libpciaccess/0.16-GCCcore-10.2.0 25) PMIx/3.1.5-GCCcore-10.2.0 32) GROMACS/2020.4-fosscuda-2020b 5) site/infer/easybuild/setup 12) CUDAcore/11.1.1 19) hwloc/2.2.0-GCCcore-10.2.0 26) OpenMPI/4.0.5-gcccuda-2020b 6) useful_scripts 13) CUDA/11.1.1-GCC-10.2.0 20) libevent/2.1.12-GCCcore-10.2.0 27) OpenBLAS/0.3.12-GCC-10.2.0 7) DefaultModules 14) gcccuda/2020b 21) Check/0.15.2-GCCcore-10.2.0 28) gompic/2020b
-
All modules are visible with
module avail
. So in many cases it is probably better to search withmodule spider
rather than printing the whole list. -
Some key system software, like the Slurm scheduler, are included in default modules. This means that
module purge
can break important functionality. Usemodule reset
instead. -
Lower-level software is included in the module structure (see, e.g.,
binutils
in the GROMACS example above), which should mean less risk of conflicts in adding new versions later. -
Environment variables (e.g.,
$SOFTWARE_LIB
) available in our previous module system may not be provided. Instead, EasyBuild typically provides$EBROOTSOFTWARE
to point to the software installation location. So for example, to link to NetCDF libraries, one might use-L$EBROOTCUDA/lib64
instead of the previous-L$CUDA_LIB
.
EXAMPLES
For examples of Infer usage, see the ARC Examples GitHub Repository. For, e.g., a simple CUDA submission script, see the CUDA Bandwidth example.