Difference between revisions of "SLURM Accounting"

From CAC Wiki
Jump to: navigation, search
(Submitting jobs with a usage account)
Line 66: Line 66:
 
<actual job commands would go here>
 
<actual job commands would go here>
 
</pre>
 
</pre>
 
  
 
== Account usage and resource limits ==
 
== Account usage and resource limits ==

Revision as of 15:39, 2 August 2016

There a number of partitions and settings on the cluster reserved for special purposes like RAC resource allocations or controlling usage of specific software packages. To use these, you will need to understand the basics of how SLURM performs access control and accounting.

For users, a SLURM account is simply an association between your user name and a particular usage account. These usage accounts may grant access to special partitions or otherwise give a user's jobs a higher priority. A user can be a member of multiple accounts, and can choose the account for a job at submission time. To view the accounts available to you, use the following command:

sacctmgr show associations where user=yourUserName

Example usage and output:

[jeffs@cac009 ~]$ sacctmgr show associations where user=jeffs
   Cluster    Account       User  Partition     Share GrpJobs       GrpTRES GrpSubmit     GrpWall   GrpTRESMins MaxJobs       MaxTRES MaxTRESPerNode MaxSubmit     MaxWall   MaxTRESMins                  QOS   Def QOS GrpTRESRunMin
---------- ---------- ---------- ---------- --------- ------- ------------- --------- ----------- ------------- ------- ------------- -------------- --------- ----------- ------------- -------------------- --------- -------------
cac_workup       rac1      jeffs                    1                                                                                                                                              privileged
cac_workup        rac      jeffs                    1                                                                                                                                              privileged
cac_workup  snowflake      jeffs                    1                                                                                                                                                  normal
cac_workup    default      jeffs                    1                                                                                                                                                  normal

Here we can see that user "jeffs" has 4 usage accounts: rac1, rac, snowflake, and default. Accounts rac1 and rac have "privileged" queueing priority.

Hidden partitions

Upon first logging in to the cluster, you are given the ability to run jobs under a default account. This default account has no special privileges or restrictions, and is able to submit to all partitions where access has not been restricted to a specific groups. These partitions are visible with sinfo.

[jeffs@cac009 ~]$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
standard*    up 2-00:00:00      5   idle cac[002-006]
large        up 14-00:00:0      3   idle cac[007-009]

However, the cluster also has a number of "hidden" partitions as well. These partitions have been hidden to avoid cluttering the output of commands like sinfo (most users will not want to view them). You can view these hidden partitions with sinfo -a.

[jeffs@cac009 ~]$ sinfo -a
PARTITION         AVAIL  TIMELIMIT  NODES  STATE NODELIST
standard*            up 2-00:00:00      5   idle cac[002-006]
large                up 14-00:00:0      3   idle cac[007-009]
rac-jobs             up   infinite      8   idle cac[002-009]
special-snowflake    up   infinite      8   idle cac[002-009]
debug                up   infinite      8   idle cac[002-009]

It looks like there are 3 hidden partitions on the test cluster: rac-jobs, special-snowflake, and debug. In this particular case, these are simply test partitions for the finalized cluster, but you can see that the 3 hidden partitions have a larger number of nodes available and no job length limit. Note that jobs submitted these partitions will queue with jobs submitted by users under the default resource allocations. No special queue priority is given from submitting to one of these partitions- queue priority is instead controlled by a user's account. If you are given access to one of these special "hidden partitions", we will inform you of which partitions you are able to submit to.

Submitting jobs with a usage account

All jobs are submitted to SLURM under a particular usage account. If an account is not specified, a user's default account will be used instead. To submit a job using a particular account, simply add the "-A <accountName>" to job scripts. To indicate a particular partition to be used, submit a job with "-p partitionName". A job will be unable to be scheduled if the account it is submitted under does not have permission to run in a partition. Note that jobs in hidden partitions will not show up in squeue's output unless the "-a" option is used.

An example job to be submitted under the rac-jobs partition:

#!/bin/bash

#SBATCH -A rac
#SBATCH -p rac-jobs

#SBATCH -c 1
#SBATCH --mem=4000
#SBATCH -t 6:0:0

<actual job commands would go here>

Account usage and resource limits

Some usage accounts may have set resource limits. Once these limits are exhausted, the account will be inactivated and a user will need to submit jobs under another account instead (such as the "default" account). To view an account's characteristics and resource limits, the command "sacctmgr show associations where account=AccountName" can be used. Here is an example account:

   Cluster    Account       User  Partition     Share GrpJobs       GrpTRES GrpSubmit     GrpWall   GrpTRESMins MaxJobs       MaxTRES MaxTRESPerNode MaxSubmit     MaxWall   MaxTRESMins                 QOS   Def QOS GrpTRESRunMin
---------- ---------- ---------- ---------- --------- ------- ------------- --------- ----------- ------------- ------- ------------- -------------- --------- ----------- ------------- -------------------- --------- -------------
cac_workup       rac1                               1                                                   cpu=100                                                                                    privileged
cac_workup       rac1    hpc3293                    1                                                                                                                                              privileged
cac_workup       rac1      jeffs                    1                                                                                                                                              privileged

In this particular case, the "rac1" account has a limit of 100 CPU minutes of usage ("TRES" stands for "Trackable RESource"). Once the limit is reached, a user may no longer schedule jobs under this account (the "default" account is always available). If a job would exceed an account's maximum utilization, it will not be scheduled. The scheduler will indicate jobs that cannot be scheduled through "squeue -a": jobs will show that they cannot be scheduled due to "(AssocGrpCPUMinutesLimit)".

Utilization is tracked in terms of CPU minutes. Essentially using 1 CPU for 1 minute equals a use of 1 CPU minute. Using this logic, either 32 CPUs for 1 minute or 1 CPU for 32 minutes both result in a usage of 32 CPU minutes. Account utilization is viewed with the following command: sreport cluster AccountUtilizationByUser user=UserName start=YYYY-MM-DD

Example output (show jeffs's usage since June 1, 2016):

[jeffs@cac009 ~]$ sreport cluster AccountUtilizationByUser start=2016-06-01 user=jeffs
--------------------------------------------------------------------------------
Cluster/Account/User Utilization 2016-06-01T00:00:00 - 2016-08-01T23:59:59 (5356800 secs)
Use reported in TRES Minutes
--------------------------------------------------------------------------------
  Cluster         Account     Login     Proper Name     Used   Energy
--------- --------------- --------- --------------- -------- --------
cac_work+            rac1     jeffs                       16        0
cac_work+             rac     jeffs                       40        0
cac_work+       snowflake     jeffs                     3944        0
cac_work+         default     jeffs                      523        0