HPC Frequently Asked Questions
Question: Why can't I log in?
Answer: Log in problems can occur for a number of reasons. If you cannot log into one of ARC's systems, please check the following:
- Is your PID password expired? Try logging into my.vt.edu. If you cannot log in there, then your PID password has likely expired and needs to be changed. (Contact 4Help for help with this issue.)
- Are you on-campus? If you are not on-campus, you will need to connect to the Virginia Tech VPN in order to access ARC's systems.
- Is the hostname correct? Please check the name of the login node(s) for the system you are trying to access. For example, for login to Cascades, the hostname is not cascades.arc.vt.edu but rather cascades1.arc.vt.edu or cascades2.arc.vt.edu.
- Do you have an account? You must request an account on a system before you can log in.
- Is there a maintenance outage? ARC systems are occassionally taken offline for maintenance purposes. Users are typically notified via email well ahead of maintenance outages.
If you have checked all of the above and are still not sure why you cannot log in, please submit a help ticket.
Question: How much does it cost to use ARC's systems?
Answer: ARC's systems are free, though privileged access can be purchased through the Investment Program. For most systems, this means that Virginia Tech researchers can simply request an account to get access. Use of the clusters (submitting and running jobs) does require an approved allocation, which in turn requires some basic information to be provided, but getting an allocation does not require monetary payment of any kind. More information on how to get started with ARC is here. More information on the Investment Program is here.
Question: Why is my job not starting?
Slurm (non-NewRiver) Clusters:
Answer: Typically the squeue command will provide the reason a job isn't starting. This shows information about all pending or queued jobs, so it may be helpful to query for only your own jobs squeue -u <your pid>or only for a particular job squeue -u <jobid>. For example:
[brownm12@calogin2 ~]$ squeue -u brownm12 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 310926 normal_q bash brownm12 PD 0:00 64 (PartitionNodeLimit)
This job has been submitted with a request for 64 nodes which exceeds the per-job limit on the
Other common reasons:
|Priority/Resources||these two are the most common reasons given for a job being pending (PD). They simply mean that the job is waiting in the queue for resources to become available.|
|QOSMaxJobsPerUserLimit||QOS applied to the partition restricts users to a maximum number of concurrent running jobs. As your jobs complete, queued jobs will be allowed to start.|
|QOSMaxCpuMinutesPerJobLimit||QOS applied to the partition restricts jobs to a maximum number of CPU-minutes. To run, the job must request either fewer CPUs or less time.|
|PartitionTimeLimit||requested timelimit exceeds the maximum for the partition|
Newriver only (Torque/Moab scheduler)
Answer: Typically the command checkjob -v <job id> will provide the reason. There are many different reasons that a submitted job will not start, though they often fall into one of the following categories:
- The job is missing header information. Job submissions must include a set of flags for the scheduler; if your job does not include one of these flags (or includes incorrect information), it may wind up stuck or rejected outright. Sample job submission scripts are included in the Examples section on each system page; compare your submitted job with the #PBS lines in that script to ensure that you have all of the required information. Examples for some missing flags:
- If the #PBS -q <queue name> flag is missing, it will produce the error message "qsub: Unknown queue MSG=cannot locate queue".
- If the #PBS -W group_list=<group name> flag is missing, it will produce the error message "qsub: Unauthorized Request MSG=group ACL is not satisfied".
- If the #PBS -A flag is incorrect, it will produce the error message "Defer:InvalidAccount" or "Insufficient funds: There are no valid allocations against which to make the lien."
- The job violates system policies. Each system has a set of policies that govern how users may use its resources (e.g., how long jobs may run or how many cores a given user many consume at one time). These policies are described in the Policies section on each system page. If your job violates one of these policies, it may wind up stuck and never run (look for "job violates constraints" or "job violates...limit" in the checkjob output) or it may be rejected with an immediate error message. Please ensure that your job is within the policies for the given system and queue that you are trying to use.
- The required resources are not available (yet). When you submit a job, you request a set of resources. This typically includes the number of nodes and number of cores that you require, but it may also include information about the type of those resources. (BlueRidge and Ithaca, for example, each have some "highmem" nodes with twice as much memory as normal nodes.) If those resources are not available at the times that you submit your job - that is, they are being used by another user - then your job will remain in queue until the resources that you have requested are available for the amount of time that you require.
- The system will soon have an outage. ARC periodically takes systems offline to perform maintenance. When this occurs, all system resources will be reserved starting on the date and time that the maintenance will begin. So if your job is scheduled to run for 100 hours (4 days, 4 hours) and is submitted four days before the start of a maintenance outage, your job will remain in queue until after the maintenance is complete. (Note that ARC will occasionally place non-maintenance reservations on a subset of a system's resources, such as for training classes.) The command showres will, on most systems, show information about the size and date/time of any reservations scheduled on a system. When a maintenance outage on a given system is scheduled, ARC sends an email notification to all users of that system to notify then of when the system will be unavailable.
Question: When will my job start?
Answer: The command showstart <job id> will provide the system's best guess as to when the job will start. If showstart returns something like "Estimated Rsv based start in INFINITY", then either the system is about to undergo maintenance or something is wrong with the job. See "Why is my job not starting?" for more information.
Question: How do I submit an interactive job?
Answer: A user can request an interactive session on a compute node (e.g., for debugging purposes), using
interact, a wrapper on
srun. By default, this script will request one core (with one GPU on Infer) for one hour on a default partition (often
dev_q, depending on the cluster). An allocation should be provided:
interact -A yourallocationThe request can be customized with standard job submission flags used by
sbatch. Examples include:
- Request two hours:
interact -A yourallocation -t 2:00:00
- Request two hours on the
interact -A yourallocation -t 2:00:00 -p normal_q
- Request two hours on one core and one GPU on Infer's
interact -A yourallocation -t 2:00:00 -p t4_dev_q -n 1 --gres=gpu:1
- Get additional details on what
interact -A yourallocation --verbose
(The flags for requesting resources may vary from system to system; please see the documentation for the system that you want to use.)
Once the job has been submitted, the system may print out some information about the defaults that
interact has chosen. Once the resources requested are available, you will then get a prompt on a compute node. You can issue commands on the compute node as you would on the login node or any other system. To exit the interactive session, simply type
Note: As with any other job, if all resources on the requested queue are being used by running jobs at the time an interactive job is submitted, it may take some time for the interactive job to start.
Question: How do I change a job's stack size limit?
Answer: If your MPI code needs higher stack sizes then you may specify the stack size in the command that you specify to MPI. For example:
mpirun -bind-to-core -np $SLURM_NTASKS /bin/bash -c "ulimit -s unlimited; ./your_program"
Question: How do I check my job's resource usage?
Answer: The jobload command will report core and memory usage for each node of a given job. Example output is:
[jkrometi@tinkercliffs2 04/06 09:21:13 ~]$ jobload 129722 Basic job information: JOBID PARTITION NAME ACCOUNT USER STATE TIME TIME_LIMIT NODES NODELIST(REASON) 129722 normal_q tinkercliffs someaccount someuser RUNNING 43:43 8:00:00 2 tc[082-083] Job is running on nodes: tc082 tc083 Node utilization is: node cores load pct mem used pct tc082 128 128.0 100.0 251.7GB 182.1GB 72.3 tc083 128 47.9 37.4 251.7GB 187.2GB 74.3
This TinkerCliffs job is using all 128 cores on one node but only 48 cores on the second node. In this case, we know that the job has requested two full nodes, so some optimization may be in order to ensure that they're using all of the requested resources. The job is, however, using 70-75% memory on both nodes, so the resource request may be intentional. If more information is required about a given node, the scontrol show node tc083 can provide it.
Question: I need a software package for my research. Can you install it for me?
Answer: At any given time, ARC staff is trying to balance many high-priority tasks to improve, refine, or augment our systems. Unfortunately, this means that we typically cannot install all or even most of the software that our users require to do their research. As a result, the set of applications on each system does not typically change unless a new software package is requested by a large number of users. However, users are welcome to install software that they require for their research in their Home directory. This generally involves copying the source code into one of your personal or group storage locations and then following the directions provided with the software to build that source code into an executable. If the vendor does not provide source code and just provides an executable (which is true of some commercial software packages), then you need to select the right executable for the system hardware and copy that into your home directory. ARC provides a script called
setup_app that helps automate setup of directories and creation of personal modules.
Answer: The key is to create a modulefile for the software and make sure that it is in a location that can be found by
MODULEPATH. Starting on TinkerCliffs and later systems, ARC provides a script called
setup_app that automates much of this process. See also this video tutorial. Start by providing a software package and version, e.g.,
[jkrometi@tinkercliffs2 ~]$ setup_app julia 1.6.1-foss-2020b Create directories /home/jkrometi/apps/tinkercliffs-rome/julia/1.6.1-foss-2020b and /home/jkrometi/easybuild/modules/tinkercliffs-rome/all/julia?
y to let it proceed. The script will then set up the directory and the modulefile. It finishes by printing some information about what you need to do to finish off the install:
Done. To finish your build: 1. Install your app in /home/jkrometi/apps/tinkercliffs-rome/julia/1.6.1-foss-2020b/ 2. Edit the modulefile in /home/jkrometi/easybuild/modules/tinkercliffs-rome/all/julia/1.6.1-foss-2020b.lua - Set or remove modules to load in the load() line - Edit description and URL - Check the variable names - Edit paths (some packages don't have, e.g., an include/) Note: You may need to refresh the cache, e.g., module --ignore_cache spider julia to find the module the first time.Note that
setup_appalso provides a
--baseflag that will allow installation somewhere other than the default location, e.g.,
setup_app --base=/projects/myproject julia 1.6.1-foss-2020b
Question: What does a "Disk quota exceeded" error mean?
Answer: This typically means that one of your storage locations has exceeded the maximum allowable size. You will need to reduce the space consumed in order to run jobs successfully again.
Question: What does a "Detected 1 oom-kill event(s)" error mean?
Answer: If your job fails with an error like
slurmstepd: error: Detected 1 oom-kill event(s)then your job triggered Linux's Out of Memory Killer process. This means that it tried to use more memory than allocated to the job. The
seffcommand (Slurm job efficiency) also provides some information on resource utilization:
[user@infer1 ~]$ seff 1447 Job ID: 1447 Cluster: infer User/Group: someuser/someuser State: OUT_OF_MEMORY (exit code 0) Nodes: 2 Cores per node: 32 CPU Utilized: 02:43:59 CPU Efficiency: 1.56% of 7-07:21:36 core-walltime Job Wall-clock time: 02:44:24 Memory Utilized: 174.83 GB Memory Efficiency: 49.11% of 356.00 GB
If your job is requesting a subset of a node, you will need to request more cores (which will give you more memory). If you are already requesting a full node, you will need to either edit your code or problem to use less memory or submit to different hardware that has more memory (e.g., the high memory nodes on TinkerCliffs) -- check the details for each cluster to find an option that might work for you.
Question: Why are basic commands like
sbatch not recognized?
Answer: Starting with Tinkercliffs and Infer, ARC provides a default set of modules that are automatically loaded when you log in. If basic commands like
sbatch are not recognized, it is often because these default modules have been removed (e.g., via
module purge). Please run
module reset and see if that solves your problem.
Question: How do I add a user to an allocation?
Answer: To add a user to an existing allocation, follow these steps:
- Go to ColdFront. (You may be prompted for a password.)
- You will see a list of your Projects. Click on the one you want to modify.
- Scroll down to "Users" and select "Add Users".
- Under "Search String" enter the user's PID (or a list of PIDs) and click Search.
- Scroll down, select the user whom you want to add, and click "Add Selected Users to Project".
- The page will refresh and the user's PID should be included in the Users table. They are now added to the project and its associated allocations.
Question: How do I attach to my process for debugging?
Short Answer: Attaching to a process for debugging no longer requires any special steps on ARC resources.
Longer Answer: Debuggers like gdb make software development much more efficient. Attaching to a process for debugging requires that the targeted process and the user's current process be in the same group. When ARC used Moab and Torque for scheduling and resource management, processes launched by the scheduler were started under a group other than the user's group. Special steps were then required to switch groups before trying to attach with gdb. However, the Slurm scheduler now used by ARC launches processes under the user's group, so these steps are no longer required. You may simply
ssh to the compute node where the process is running, look up the process ID (e.g., with
ps), and then attach to it.
Question: How can I submit a job that depends on the completion of another job?
Answer: Sometimes it may be useful to split one large computation into multiple jobs (e.g. due to queue limits), but submit those jobs all at once. Jobs can be made dependent on each other using the
--dependency=after:job_id flag to
sbatch. Additional dependency options can be found in the [documentation for sbatch](https://slurm.schedmd.com/sbatch.html "documentation for sbatch"). For example, here we submit three jobs, each of which depends on the preceding one:
[johndoe@tinkercliffs2 ~]$ sbatch test.sh Submitted batch job 126448 [johndoe@tinkercliffs2 ~]$ sbatch --dependency=after:126448 test.sh Submitted batch job 126449 [johndoe@tinkercliffs2 ~]$ sbatch --dependency=after:126449 test.sh Submitted batch job 126450
The first job starts right away, but the second doesn't start until the first one finishes and the third job doesn't start until the second one finishes. This allows the user to split their job up into multiple pieces, submit them all right away, and then just monitor them as they run one after the other to completion.
Question: How can I run multiple serial tasks inside one job?
Answer: Users with serial (sequential) programs may want to "package" multiple serial tasks into a single job submitted to the scheduler. This can be done with third-party tools (gnu parallel is a good one) or using a loop within the job submission script. (A similar structure can be used to run multiple short, parallel tasks inside a job.) The basic structure is to loop through the number of tasks using while or for, start the task in the background using the & operator, and then use the wait command to wait for the tasks to finish:
# Define variables numtasks=16 np=1 # Loop through numtasks tasks while [ $np -le $numtasks ] do # Run the task in the background with input and output depending on the variable np ./a.out $np > $np.out & # Increment task counter np=$((np+1)) done # Wait for all of the tasks to finish wait
Please note that the above structure will only work within a single node. To ensure that the same program (with the same inputs) isn't being run multiple times, users should make sure that the loop variable (np, above) is used to specify input files or parameters.
Question: How can I run multiple short, parallel tasks inside one job?
Answer: Sometimes users have a parallel application that runs quickly, but that they need to run many times. In this case, it may be useful to package multiple parallel runs into a single job. This can be done using a loop within the job submission script. An example structure:
# Specify the list of tasks tasklist="task1 task2 task3" # Loop through the tasks for tsk in $tasklist; do # run the task $tsk mpirun -np $SLURM_NTASKS ./a.out $tsk done
To ensure that the same program (with the same inputs) isn't being run multiple times, users should make sure that the loop variable (tsk, above) is used to specify input files or parameters. Note that, unlike when running multiple serial tasks at once, in this case each task will not start until the previous one has finished.