SLURM

From CAC Wiki
Revision as of 17:22, 7 July 2016 by Jstaff (Talk | contribs) (create page)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

About SLURM

SLURM is the scheduler used by the CLUSTERNAME cluster. Like Sun Grid Engine (the scheduler used for the M9000 and SW clusters), SLURM is used for submitting, monitoring, and controlling jobs on a cluster. Any jobs or computations done on the CLUSTERNAME cluster must be started via SLURM. Reading this tutorial will supply all the information necessary to run jobs on CLUSTERNAME.

Although existing users are likely very familiar with Sun Grid Engine (SGE), switching to SLURM offers a number of advantages over the old system. The biggest advantage is that the scheduling algorithm is significantly better than that offered by SGE, allowing more jobs to be run on the same amount of hardware. SLURM also supports new types of jobs- users will now be able to schedule interactive sessions or run individual commands via the scheduler. In terms of administration and accounting, SLURM is also considerably more flexible. Although easier cluster administration does not directly impact users in the short term, CAC will be able to more easily reconfigure our systems over time to meet the changing needs of users and perform critical system maintenance. All in all, we believe switching to SLURM will offer our users an all-around better experience when using our systems.

How SLURM works

SLURM is the piece of software that allows many users to share a compute cluster. A cluster is a set of networked computers- each computer represents one "node" of the cluster. When a user submits a job, SLURM will schedule this job on a node (or nodes) that meets the resource requirements indicated by the user. If no resources are currently available, the users job will wait in a queue until the resources they have requested become available for use.

Nodes in SLURM are divided into distinct "partitions" (similar to queues in SGE). Different partitions may have different uses, such as directing users' jobs to nodes with a particular piece of software installed (some software licenses only allow us to install software on a given number of nodes). Generally, the default partition (named "default") will suffice for most uses and encompasses the largest amount of hardware.

All users will have one or more SLURM usage accounts. Accounts are used to record accounting information and may be used control access to certain partitions (such as those for RAC allocations). For everyday, default use, most users will not need to bother with accounts or accounting details (just be aware that they exist).

Basic SLURM commands

These are the basic commands used to do most basic operations with SLURM.


sinfo - Check the status of the cluster/partitions

sinfo 
sinfo -lNe  # same as above, but shows per-node status


squeue - Show status of jobs

squeue                  # your jobs
squeue -u <username>    # show jobs for user <username>
squeue --start          # show expected start times of jobs in queue


scancel - Kill a job

You can get job IDs with squeue Note that you can only kill your own jobs.

scancel <jobID>         # kill job <jobID>. (you can get the job IDs with "squeue")
scancel -u <username>   # kill all jobs for user <username>. 
scancel -t <state>      # kill all jobs in state <state>. <state> can be one of: PENDING, RUNNING, SUSPENDED


Running jobs

There are actually 3 methods of submitting jobs under SLURM: sbatch, srun, and salloc. Although this may initially seem unnecessarily complicated, each of these commands has the same options, and allows users to submit new types of jobs.

sbatch - Submit a job script to be run

sbatch will submit a job script to be run by the cluster. Job scripts under SLURM are simply just shell scripts (*.sh) with a set of resource requests at the top of the script. Users of Sun Grid Engine should note that SLURM's sbatch is functionally identical to SGE's qsub.

To submit a job script to SLURM:

sbatch nameOfScript.sh

Job scripts specify the resources requested and other special considerations with special "#SBATCH" comments at the top of a job script. Although many of these options are optional, directives dealing with resource requests (CPUs, memory, and walltime) are mandatory. All directives should be added to your scripts in the following manner:

#SBATCH <directive>

To specify a job name, for instance, you would add the following to your script:

#SBATCH -J myJobName


For users looking to get started with SLURM as fast as possible, a minimalist template job script is shown below:

#!/bin/bash
#SBATCH -c                                 # Number of CPUS requested. If omitted, the default is 1 CPU.
#SBATCH --mem=megabytes                    # Memory requested in megabytes. If omitted, the default is 1024 MB.
#SBATCH -t days-hours:minutes:seconds      # How long will your job run for? If omitted, the default is 3 hours.

# commands for your job go here

Mandatory directives

Directives in this section are mandatory, and are by SLURM to determine where and when your jobs will run. If you do not assign a value for these, the scheduler will assign your jobs the default value. If you do not specifically request resources for a job, it will be assigned a set of default resources. Unlike with Sun Grid Engine, jobs that exceed their resource requests will be automatically killed by SLURM. Though this seems harsh, it means that users exceeding the resources that the scheduler has given them will not degrade the experiences of other users on the system. Jobs requesting more resources may be harder to schedule (because they have to wait for a larger slot).

-c <cpus> -- This is the number of CPUs your job needs. Note that SLURM is relatively generous with CPUs, and the value specified here is the minimum number of CPUs that your job will be assigned. If additional CPUs are available on a node beyond what was requested, your job will be given those CPUs until they are needed by other jobs. Default value is 1 CPU.

--mem=<megabytes> -- This is the amount of memory your job needs to run. Chances are, you may not know how much memory your job will use. If this is the case, a good rule of thumb is 2048 megabytes (2 gigabytes) per processor that your job uses. Note that jobs will be killed if they exceed their memory allocations, so it's best to err on the safe side and request extra memory if you are unsure of things (there is no penalty for requesting too much memory). Default value is 1024 MB.

-t <days-hours:minutes:seconds> -- Walltime for your job. The walltime is the length of time you expect your job to run. Again, your job will be killed if it runs for longer than the requested walltime. If you do not know how long your job will run for, err on the side of requesting too much walltime, rather than to little. May also follow the format "hours:minutes:seconds". Default value is 3 hours, and the maximum walltime is 2 weeks (please contact us if you need to run longer jobs, this is quite easy to accommodate).

Optional directives

--mail-type=BEGIN,END,FAIL,ALL and --mail-user=<emailAddress> -- Be emailed when your job starts/finishes/fails. You can specify multiple values for this (separated by commas) if need be.

-p <partition> -- Submit a job to a specific partition. Your submission may be rejected if you do not have permission to run in the requested partition.

-A <account> -- Associate a job with a particular SLURM usage account. Unnecessary unless you wish to submit jobs to a partition that require the use of a particular account.

-D <directory> -- The working directory you want your job script to execute in. By default, job working directory is the location where sbatch <script> was run.

-J <name> -- Specify a name for your job.

-o <STDOUT_log> -- Redirect output to a the logfiles you specify. By default, both STDOUT and STDERR are sent to this file. You can specify %j as part of the log filename to indicate job ID (as an example, "#SBATCH -o ouptut_%j.o" would redirect output to "output_123456.o").

-e <STDERR_log> -- Redirect STDERR to a separate file. Works exactly the same as "-o".

Array jobs

When running hundreds or thousands of jobs, it may be advantages to run these jobs as an "array job". Array jobs allow you submit thousands of such jobs (called "job steps") with a single job script. Each job will be assigned a unique value for the environment variable SLURM_ARRAY_TASK_ID. You can use this variable to read parameters for individual steps from a given line of a file, for instance.

A sample array job that creates 6 job steps with SLURM_ARRAY_TASK_ID incremented by 3. STDOUT and STDERR output streams have been redirected to the same file: arrayJob_%A_%a.out (%A is the job number of the array job itself, %a is the job step).

#!/bin/bash
#SBATCH -J array
#SBATCH --array=0-20:3
#SBATCH -o arrayJob_%A_%a.out
#SBATCH -e arrayJob_%A_%a.out

echo 'This is job step '${SLURM_ARRAY_TASK_ID}

srun - Run a single command on the cluster

Sometimes it may be advantageous to run a single command on the cluster as a test or to quickly perform an operation with additional resources. srun enables users to do this, and shares all of the same directives as sbatch'. STDOUT and STDERR for an srun job will be redirected to the users screen. Ctrl-C will cancel an srun job.

Basic usage:

srun <someCommand>     # try it out with "srun hostname"

Submit a command with additional directives (in this case run the program "test" with 12 cpus/20 gigabytes of memory in partition "bigjob"):

srun -c 12 --mem=20000 -p bigjob test

salloc - Schedule an interactive job

Migrating from other Schedulers

Sun Grid Engine

PBS/TORQUE

SLURM can actually run PBS job scripts in many cases. The "pbs2slurm" script will also convert a PBS script to a SLURM one.