SLURM
Contents
About SLURM
SLURM is the scheduler used by the Frontenac cluster. Like Sun Grid Engine (the scheduler used for the M9000 and SW clusters), SLURM is used for submitting, monitoring, and controlling jobs on a cluster. Any jobs or computations done on the Frontenac cluster must be started via SLURM. Reading this tutorial will supply all the information necessary to run jobs on Frontenac.
Although existing users are likely very familiar with Sun Grid Engine (SGE), switching to SLURM offers a number of advantages over the old system. The biggest advantage is that the scheduling algorithm is significantly better than that offered by SGE, allowing more jobs to be run on the same amount of hardware. SLURM also supports new types of jobs- users will now be able to schedule interactive sessions or run individual commands via the scheduler. In terms of administration and accounting, SLURM is also considerably more flexible. Although easier cluster administration does not directly impact users in the short term, CAC will be able to more easily reconfigure our systems over time to meet the changing needs of users and perform critical system maintenance. All in all, we believe switching to SLURM will offer our users an all-around better experience when using our systems.
How SLURM works
SLURM is the piece of software that allows many users to share a compute cluster. A cluster is a set of networked computers- each computer represents one "node" of the cluster. When a user submits a job, SLURM will schedule this job on a node (or nodes) that meets the resource requirements indicated by the user. If no resources are currently available, the users job will wait in a queue until the resources they have requested become available for use.
Nodes in SLURM are divided into distinct "partitions" (similar to queues in SGE) and a node may be part of multiple partitions. Different partitions may have different uses, such as directing users' jobs to nodes with a particular piece of software installed (some software licenses only allow us to install software on a given number of nodes). Generally, the default partition (named "default") will suffice for most uses and encompasses the largest amount of hardware.
All users will have one or more SLURM usage accounts. Accounts are used to record accounting information and may be used control access to certain partitions (such as those for RAC allocations). For everyday, default use, most users will not need to bother with accounts or accounting details (just be aware that they exist). For a detailed overview of SLURM accounts and accounting, please see our guide to SLURM accounting .
Basic SLURM commands
These are the basic commands used to do most basic operations with SLURM.
sinfo - Check the status of the cluster/partitions
sinfo sinfo -lNe # same as above, but shows per-node status
Example output of sinfo on a small demonstration cluster. Nodes cac002-cac006 are part of the "standard" partition (jobs are submitted to this partition by default, indicated by the '*' character), and nodes cac007-cac009 are part of the "large" partition. One node in the "large" partition is currently allocated and being used (cac007).
$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST standard* up 2-00:00:00 5 idle cac[002-006] large up 14-00:00:0 1 alloc cac007 large up 14-00:00:0 2 idle cac[008-009]
squeue - Show status of jobs
squeue # your jobs squeue -u <username> # show jobs for user <username> squeue --start # show expected start times of jobs in queue
Example output of squeue on a demonstration cluster. User jeffs has 5 jobs running on nodes ac002-ac006 (in partition "standard"), and 4 jobs in queue. Job 1164 is has not started because no resources are available for that job, and jobs 1165-1167 have not started because job 1164 has priority.
$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 1166 standard long-job jeffs PD 0:00 1 (Priority) 1167 standard long-job jeffs PD 0:00 1 (Priority) 1165 standard long-job jeffs PD 0:00 1 (Priority) 1164 standard long-job jeffs PD 0:00 1 (Resources) 1161 standard long-job jeffs R 0:08 1 cac004 1162 standard long-job jeffs R 0:08 1 cac005 1163 standard long-job jeffs R 0:08 1 cac006 1160 standard long-job jeffs R 0:12 1 cac003 1159 standard long-job jeffs R 0:16 1 cac002
scancel - Kill a job
You can get job IDs with squeue Note that you can only kill your own jobs.
scancel <jobID> # kill job <jobID>. (you can get the job IDs with "squeue") scancel -u <username> # kill all jobs for user <username>. scancel -t <state> # kill all jobs in state <state>. <state> can be one of: PENDING, RUNNING, SUSPENDED
Running jobs
There are actually 3 methods of submitting jobs under SLURM: sbatch, srun, and an srun interactive job. Although this may initially seem unnecessarily complicated, these commands have the same options, and allows users to submit new types of jobs.
sbatch - Submit a job script to be run
sbatch will submit a job script to be run by the cluster. Job scripts under SLURM are simply just shell scripts (*.sh) with a set of resource requests at the top of the script. Users of Sun Grid Engine should note that SLURM's sbatch is functionally identical to SGE's qsub.
To submit a job script to SLURM:
sbatch nameOfScript.sh
Example output:
$ sbatch long-job.sh Submitted batch job 1169
Job scripts specify the resources requested and other special considerations with special "#SBATCH" comments at the top of a job script. Although many of these options are optional, directives dealing with resource requests (CPUs, memory, and walltime) are mandatory. All directives should be added to your scripts in the following manner:
#SBATCH <directive>
To specify a job name, for instance, you would add the following to your script:
#SBATCH -J myJobName
For users looking to get started with SLURM as fast as possible, a minimalist template job script is shown below:
#!/bin/bash #SBATCH -c # Number of CPUS requested. If omitted, the default is 1 CPU. #SBATCH --mem=megabytes # Memory requested in megabytes. If omitted, the default is 1024 MB. #SBATCH -t days-hours:minutes:seconds # How long will your job run for? If omitted, the default is 3 hours. # commands for your job go here
Mandatory directives
Directives in this section are mandatory, and are by SLURM to determine where and when your jobs will run. If you do not assign a value for these, the scheduler will assign your jobs the default value. If you do not specifically request resources for a job, it will be assigned a set of default resources. Unlike with Sun Grid Engine, jobs that exceed their resource requests will be automatically killed by SLURM. Though this seems harsh, it means that users exceeding the resources that the scheduler has given them will not degrade the experiences of other users on the system. Jobs requesting more resources may be harder to schedule (because they have to wait for a larger slot).
-c <cpus> -- This is the number of CPUs your job needs. Note that SLURM is relatively generous with CPUs, and the value specified here is the minimum number of CPUs that your job will be assigned. If additional CPUs are available on a node beyond what was requested, your job will be given those CPUs until they are needed by other jobs. Default value is 1 CPU. Attempting to use more CPUs than you have been allocated will result in your extra processes taking turns on the same CPU (slowing your job down).
--mem=<megabytes> -- This is the amount of memory your job needs to run. Chances are, you may not know how much memory your job will use. If this is the case, a good rule of thumb is 2048 megabytes (2 gigabytes) per processor that your job uses. Note that jobs will be killed if they exceed their memory allocations, so it's best to err on the safe side and request extra memory if you are unsure of things (there is no penalty for requesting too much memory). Default value is 1024 MB.
-t <days-hours:minutes:seconds> -- Walltime for your job. The walltime is the length of time you expect your job to run. Again, your job will be killed if it runs for longer than the requested walltime. If you do not know how long your job will run for, err on the side of requesting too much walltime, rather than to little. A typical rule of thumb is asking for twice or three times the amount of time you think you will need. May also follow the format "hours:minutes:seconds". Default value is 3 hours, and the maximum walltime is 2 weeks (please contact us if you need to run longer jobs, this is quite easy to accommodate).
Optional directives
For a list of all directives available, see the SLURM documentation at http://slurm.schedmd.com/sbatch.html. The directives in this article were covered because they were the most relevant for typical use cases.
--mail-type=BEGIN,END,FAIL,ALL and --mail-user=<emailAddress> -- Be emailed when your job starts/finishes/fails. You can specify multiple values for this (separated by commas) if need be.
-p <partition> -- Submit a job to a specific partition. Your submission may be rejected if you do not have permission to run in the requested partition.
-A <account> -- Associate a job with a particular SLURM usage account. Unnecessary unless you wish to submit jobs to a partition that require the use of a particular account.
-D <directory> -- The working directory you want your job script to execute in. By default, job working directory is the location where sbatch <script> was run.
-J <name> -- Specify a name for your job.
-o <STDOUT_log> -- Redirect output to a the logfiles you specify. By default, both STDOUT and STDERR are sent to this file. You can specify %j as part of the log filename to indicate job ID (as an example, "#SBATCH -o ouptut_%j.o" would redirect output to "output_123456.o").
-e <STDERR_log> -- Redirect STDERR to a separate file. Works exactly the same as "-o".
Array jobs
When running hundreds or thousands of jobs, it may be advantages to run these jobs as an "array job". Array jobs allow you submit thousands of such jobs (called "job steps") with a single job script. Each job will be assigned a unique value for the environment variable SLURM_ARRAY_TASK_ID. You can use this variable to read parameters for individual steps from a given line of a file, for instance.
A sample array job that creates 6 job steps with SLURM_ARRAY_TASK_ID incremented by 3. STDOUT and STDERR output streams have been redirected to the same file: arrayJob_%A_%a.out (%A is the job number of the array job itself, %a is the job step).
#!/bin/bash #SBATCH --array=0-20:3 #SBATCH -o arrayJob_%A_%a.out echo 'This is job step '${SLURM_ARRAY_TASK_ID}
srun - Run a single command on the cluster
Sometimes it may be advantageous to run a single command on the cluster as a test or to quickly perform an operation with additional resources. srun enables users to do this, and shares all of the same directives as sbatch. STDOUT and STDERR for an srun job will be redirected to the user's screen. Ctrl-C will cancel an srun job.
Basic usage:
srun <someCommand>
Example output (running the command "hostname" to return which computer you are running on):
$ srun hostname cac003
Submit a command with additional directives (in this case run the program "test" with 12 cpus/20 gigabytes of memory in partition "bigjob"):
srun -c 12 --mem=20000 -p bigjob test
Schedule an interactive job
SLURM has the unique capability of being able to schedule interactive sessions for a user. An "interactive session" is identical to having normal, command-line usage of one of a cluster node with the resources requested. Need to run a program that requires using a GUI or test out a program? No problem, this just requires a slight modification to srun's syntax.
To start an interactive shell, use srun in the following manner. The "--pty" option enables an interactive shell, and the "--x11" option enables X11 graphics forwarding. Note that use of X11 forwarding requires that you have connected to the cluster using an SSH client that supports X-forwarding (done using "ssh -X" on logon, requires XQuartz on OSX or MobaXTerm on Windows).
srun [other srun options] --pty --x11 bash
Example usage (use 4 processors and 6 gigabytes of RAM interactively):
[jeffs@cac009 ~]$ srun -c 4 --mem=6000 --pty --x11 bash # start an interactive session with x forwarding for graphics [jeffs@cac002 ~]$ xeyes # test graphics forwarding [jeffs@cac002 ~]$ exit # quit interactive session exit [jeffs@cac009 ~]$ # we are now back on the node where we started
Parallel Jobs
Many of the jobs running on a production cluster are going to involve more than one processor (CPU, core). Such parallel jobs need to request the number of required resources through additional options. The most common ones are:
-N Number of cluster nodes requested -n Total number of tasks (processes) -c Number of cores per task
For different types of parallel jobs different options will be specified. The most common parallel jobs are MPI (distributed memory) jobs, multi-threaded (shared memory) jobs, and so-called hybrids that are a combination of the two. Let's discuss them separately with a n example for each.
MPI Jobs
MPI (Message Passing Interface) is the standard communication API for parallel distributed-memory job capable of being deployed on a cluster. To schedule such a job, it is necessary to specify the number of cluster nodes that will be used, and the number of processes (tasks) that are going to run on each node.
Currently, each MPI job on our cluster is restricted to run on a single node, i.e. all processes are scheduled on different CPUs (cores) and use a so-called shared-memory layer to communicate with each other. The upside of this type of scheduling is that communication is fast and efficient compared with inter-node communication. The downside is that the total number of tasks (processes) used by a program is limited by the size of the node on which it runs. A typical MPI script for such a program looks like this:
#!/bin/bash #SBATCH --job-name=MPItest #SBATCH --mail-type=ALL #SBATCH --mail-user=my.email@whatever.com #SBATCH -o STD.out #SBATCH -e STD.err #SBATCH -N 1 #SBATCH -n 8 #SBATCH -c 1 #SBATCH -t 30:00 #SBATCH --mem=1000 mpirun ./mpi_program
The key option here is "-n 8" which requests enough cores for 8 MPI tasks.
The "-N" and "-c" options need to be kept at 1 to indicated that all processes are to be run on a single node, and that each process is single-threaded (i.e. we are not doing any multi-threading on the MPI processes).
A specification of the number of processes in the mpirun line may be omitted as mpirun interfaces with SLURM and selects the proper number automatically from the "-n" option.
Multi-threaded Jobs
Parallel jobs designed to run on a multi-core (shared-memory) system are usually "multi-threaded". Scheduling such a job requires to specify the number of cores being used to accommodate the threads.
OpenMP is the commonly set of compiler directives to facilitate the development of multi-threaded programs. A typical SLURM script for such a program looks like this:
#!/bin/bash #SBATCH --job-name=OMPtest #SBATCH --mail-type=ALL #SBATCH --mail-user=my.email@whatever.ca #SBATCH -o STD.out #SBATCH -e STD.err #SBATCH -N 1 #SBATCH -n 1 #SBATCH -c 4 #SBATCH -t 30:00 #SBATCH --mem=1000 OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK time ./omp-program
When using an OpenMP program, the number of threads (and therefore the required number of cores) is specified via the environment variable OMP_NUM_THREADS which therefore appears in the script in front of the call to the program.We are setting it to the internal variable SLURM_CORES_PER_TASK which is set through the "-c" option (to 4 in our case).
The "-N" and "-n" options are kept at 1 to indicate a single main program running on a single node.
Multi-threaded programs that use different multi-threaded techniques (for instance, the Posix thread libraries) use a slightly different approach, but the principle is the same:
Specify the number of required cores through the "-c" option and pass that number to the program through the variable SLURM_CORES_PER_TASK.
Hybrid Jobs
MPI distributed-memory and OpenMP shared-memory parallelism may be combined to obtain a "hybrid" program. This has to be done with great care to avoid race-conditions on the process-to-process communication. However, such programs are particularly useful when it is important to exploit the multi-core nature of the nodes in a cluster.
The following script works for simple run of a hybrid program on a single node, assuming each MPI process uses the same number of sub-threads:
#!/bin/bash #SBATCH --job-name=OMPtest #SBATCH --mail-type=ALL #SBATCH --mail-user=my.email@whatever.ca #SBATCH -o STD.out #SBATCH -e STD.err #SBATCH -N 1 #SBATCH -n 8 #SBATCH -c 4 #SBATCH -t 30:00 #SBATCH --mem=1000 OMP_NUM_THREADS=$SLURM_CORES_PER_TASK mpirun ./hybrid-program
This example would run the program "hybrid-program" with 8 MPI processes, each utilizing 4 threads for a total of 32. Note that the number of nodes (i.e. the -N option) is set to one to indicate that all cores need to be allocated on a single node. This setting should not be changed in the current cluster configuration.
Migrating from other Schedulers
Sun Grid Engine
Most SGE commands (qsub, qstat, etc.) will work on SLURM, although you will need to rewrite your scripts to use #SBATCH directives instead of the #$ directives used by SGE. The command sge2slurm
will convert a SGE job script to a SLURM job script.
PBS/TORQUE
SLURM can actually run PBS job scripts in many cases. Most PBS commands (qstat, qsub, etc.) will work on SLURM. The "pbs2slurm" script can be used to convert a PBS script to a SLURM one.