Overview
DragonsTooth is a 48-node system designed to support general batch HPC jobs. The table below lists the technical details of each DragonsTooth node. Nodes are connected to each other and to storage via 10 gigabit ethernet (10GbE), a communication channel with high bandwidth but higher latency than InfiniBand (IB). As a result, DragonsTooth is better suited to jobs that require less internode communication and/or less I/O intearction with non-local storage than NewRiver, which has similar nodes but a low-latency IB interconnect. To allow I/O-intensive jobs, DragonsTooth nodes are each outfitted with nearly 2 TB of solid state local disk. DragonsTooth was released to the Virginia Tech research community in August 2016.
In November of 2018, DragonsTooth was reprovisioned with Slurm as its scheduler as a replacement for Moab/Torque.
Technical Specifications
CPU | 2 x Intel Xeon E5-2680v3 (Haswell) 2.5 GHz 12-core |
---|---|
Memory | 256 GB 2133 MHz DDR4 |
Local Storage | 4 x 480 GB SSD Drives |
Theoretical Peak (DP) | 806 GFlops/s |
Policies
Note: DragonsTooth is governed by an allocation manager, meaning that in order to run most jobs on it, you must be an authorized user of an allocation that has been submitted and approved. For more on allocations, click here.
As described above, communications between nodes and between a node and storage will have higher latency on DragonsTooth than on other ARC clusters. For this reason the queue structure is designed to allow more jobs and longer-running jobs than on other ARC clusters.
DragonsTooth has three partitions (queues) :
- normal_q for production (research) runs.
- dev_q for short testing, debugging, and interactive sessions. dev_q provides slightly elevated job priority to facilitate code development and job testing prior to production runs.
The settings for the partitions are:
QUEUE | NORMAL_Q | DEV_Q | |
---|---|---|---|
Access to | dt003-dt048 | dt003-dt048 | |
Max Jobs | 288 per user 432 per allocation |
1 per user | |
Max Nodes | 12 per user 18 per allocation |
12 per user | |
Max Core-Hours* | 34,560 per user 51,840 per allocation |
96 per user | |
Max Walltime | 30 days | 2 hr |
Other notes:
- Shared node access: more than one job can run on a node.
* A user cannot, at any one time, have more than this many core-hours allocated across all of their running jobs. So you can run long jobs or large/many jobs, but not both. For illustration, the following table describes how many nodes a user can allocate for a given amount of time:
Walltime | Max Nodes (per user) | Max Nodes (per allocation) |
72 hr (3 days) | 12 | 18 |
144 hr (6 days) | 10 | 15 |
360 hr (15 days) | 4 | 6 |
720 hr (30 days) | 2 | 3 |
Software
For list of software available on DragonsTooth, as well as a comparison of software available on all ARC systems, click here.
Note that a user will have to load the appropriate module(s) in order to use a given software package on the cluster. The module avail and module spider commands can also be used to find software packages available on a given system.
Usage
The cluster is accessed via ssh to one of the login nodes below. Log in using your username (usually Virginia Tech PID) and password. You will need an SSH Client to log in; see here for information on how to obtain and use an SSH Client.
- dragonstooth1.arc.vt.edu
- dragonstooth2.arc.vt.edu
Job Submission
Access to all compute engines is controlled via the Slurm job scheduler. See the Slurm Job Submission page here. The basic flags are:
#SBATCH -p normal_q (or other partition, see Policies) #SBATCH -A <yourAllocation> (see Policies) #SBATCH -t dd-hh:mm:ss
The DragonsTooth cluster formerly ran using a different scheduler which would take #PBS style directives. Configurations were implemented during the transition to Slurm so that most of these directives and commands will continue to work without any modifications. In particular, the following PBS environment variables are populated with values as needed to allow jobs which depend on them to work:
PBS_O_WORKDIR=<job submission directory> PBS_JOBID=<job number> PBS_NP=<#cpu-cores allocated to job> PBS_NODEFILE=<file containing list of the job's nodes>
Shared Node
DragonsTooth compute nodes can be shared by multiple jobs. Resources can be requested by specifying the number of nodes, processes per node (ppn), cores, memory, etc. See example resource requests below:
#Request exclusive access to all resources on 2 nodes #SBATCH --nodes=2 #SBATCH --exclusive #Request 4 cores (on any number of nodes) #SBATCH --ntasks=4 #Request 2 nodes with 12 tasks running on each #SBATCH --nodes=2 #SBATCH --ntasks-per-node=12 #Request 12 tasks with 20GB memory per core #SBATCH --ntasks=12 #SBATCH --mem-per-cpu=20G #Request 5 nodes and spread 50 tasks evenly across them #SBATCH --nodes=5 #SBATCH --ntasks=50 #SBATCH --spread-job
Finding Information
Check status of a job after submission:
squeue
Get detailed information about a running job
scontrol show job <job-number>
Check status of the cluster's nodes and partitions:
sinfo
Examples
This shell script provides a template for submission of jobs on DragonsTooth. The comments in the script include notes about how to request resources, load modules, submit MPI jobs, etc.
To utilize this script template, create your own copy and edit as described here.