Allocation

From CAC Wiki
Jump to: navigation, search

Resource Allocations on the Frontenac Cluster

This Wiki entry is meant to explain how resources are shared on the CAC Frontenac cluster. This includes default allocations in terms of Compute time as well as extended resources that were allocated by Compute Canada or that come from contributed systems. We also point out differences between the current Frontenac allocation scheme and the older scheme that was used on the now decommissioned SW/CAC clusters.

Partitions

The Frontenac cluster is partitioned to enable the efficient and fair allocation of jobs. There are three main partitions. These are:

  • standard partition : This partition comprises about 20% of the AVX2 based cluster nodes (cores). The purpose of this partition is to serve users who don't have a specific allocation of resources. If no such allocation exists and no partition is specified, the system will default to this partition.
  • reserved partition : 80% of the cluster are reserved for users who have an explicit allocation. This allocation may either be awarded through a "Resource Allocation Competition", or from a contribution.
  • sse3 partition(decommissioned) : this partition comprises nodes that are based on the SSE3 instruction set and are kept separately from the AVX2 based nodes to avoid failing runs. These node may be accessed by any user from any account, but require a specific setup. They are not be suitable for software that is optimized for running on AVX2 based hardware, but work well with many commercial software packages, and software that was compiled generically. These nodes have been decommissioned in 2021

AVX2 vs SSE3

The default hardware setting on the cluster is based on the AVX2 chipset. Most of the software running on our cluster employs the Compute Canada CVMFS software stack, and the default compilation of that stack is optimized (whenever possible) for AVX2.

However, many software packages do not require AVX2 but can be compiled for the older SSE3 set. Such software can be run on our "sse3" partition. To enable scheduling on that partition, users need to execute the script

load-sse3

before submitting their job. The advantage of using the sse3 partition is that successful scheduling becomes much more likely because of the limited use of this partition. The downside is that jobs that require the newer AVX2 will fail. This is of course the reason why we have separated the SSE3 nodes.

Default accounts

Our job scheduler on Frontenac is SLURM. All resource allocations and limitations are applied through this scheduler. For a basic intro on how to use it, please see our scheduler help file.

Every user on our systems has at least one SLURM account, the default account. Users with access to extended resources have additional accounts corresponding to these allocations. These SLURM accounts have intrinsic restrictions and allow scheduling of jobs up to these limits.

The limitations of a default account are :

  • Jobs are scheduled to the standard partition
  • This partition comprises about 20-25% of our standard nodes/cores
  • This partition excludes most high-memory and large-core-count nodes
  • The default priority is low
  • No core limits
  • Continued usage lowers the relative priority with respect to other jobs
  • Maximum time limit: 14 days (2 weeks)
  • Default time limit : 3 hrs (all accounts)
  • Default memory limit : 1 GB (all accounts)
  • Default number of cores : 1 (all accounts)

Contributed accounts

Users who have applied for and received a RAC allocation from Compute Canada, are accessing this allocation through a special RAC account.

The limitations of a RAC account are :

  • contributed allocation limit : 1 core year per core contributed
  • contributed memory limit : equal to memory contributed
  • Jobs are scheduled to the reserved partition
  • No scheduling on the standard partition.
  • This partition includes all high-memory and large-core-count nodes
  • The contributed priority is high
  • Core limit : number of cores contributed
  • Continued usage lowers the relative priority with respect to other contributed jobs
  • Maximum time limit : 28 days (4 weeks)

Time and memory limits

The SLURM scheduler uses time limits to predict when a given resource will become available (at the latest). This allows it to fill small fragmented resources with small short-running jobs without forcing larger allocations to wait unduly long. Such waiting periods would result in very inefficient scheduling, wasting valuable resources. In order for time limits to have a beneficial effect, they have to be enforced stringently. The scheduler will not function properly without time limits.

Time limits a "hard" limits. Jobs that exceed their time limit are terminated.

In order to avoid having a job terminated, you must specify a time limit in excess of your maximum expected run time. We strongly recommend to checkpoint your jobs. The system cannot do this automatically, so it must be done through your application. This is the user's responsibility. It also safeguards against losing all your work when running into time-limit terminations.

Please specify a reasonable time limit and checkpoint your jobs. The default time limit is short (3 hrs).

If you don't specify a time limit, a short default will be assigned. Time limits may not be changed once a job is running. The maximum time limit for a standard job is 14 days, for an allocated (RAC or contributed) job it is 28 days.

For similar reasons, the scheduler uses memory limits. A specific hardware node has a physical amount of memory. If this memory limit is exceeded by running jobs, the node will "swap" i.e. start writing memory content to disk and read it back in when needed. This has a dramatically negative effect on node performance, affecting all jobs that run on it. To avoid this situation, memory needs to be pre-allocated. This can only be done if memory limits are specified.

Memory limits a as "hard" as time limits. Jobs that use more memory that exceed their memory limit are terminated.


In order to avoid having a job terminated, you must request memory in excess of your maximum expected memory usage. Checkpoint your jobs to avoid losing all your work when running into memory-limit terminations. This is specifically important if your job has unpredictable and dynamic memory allocation and usage.

Please specify a reasonable memory limit based on test runs. The default memory limit is small (1 GB).

If you don't specify a memory limit, a small default will be assigned. Memory limits may not be changed once a job is running. We do not impose maximum memory limits but keep in mind that the largest of our nodes has 2 TB of memory. The maximum memory of nodes accessible from a default account is 256 GB. Jobs that request too much memory cannot be scheduled. SLURM will warn you when you make such a request.

Core and job limits

The main purpose of the scheduler is to allocate cores on the nodes of the cluster. If you are running a parallel job, you will have specify how many nodes, processes, and cores you need to actually run the job in parallel. This is done through 3 SLURM parameters:

  • -N specifies the number of nodes for your job. Currently this should be kept at 1 (single-node scheduling).
  • -n specifies the number of processes. For serial and shared-memory jobs, this is 1. MPI and similar jobs, it is greater than 1.
  • -c specifies the number of cores per process. For serial and MPI jobs, this is 1. For shared-memory and hybrid jobs it is greater than one.

The system will not let you use more cores than you have requested. It will rather "contain" your jobs within the core limit. If you are trying to use more you will experience a dramatic effect on your performance, and not in a good way.

Core limits must be specified for parallel jobs. Jobs that exceed their core limits are "contained" and slow down dramatically.

The default core limit is 1.

On the decommissioned SW/CAC cluster we imposed "job limits" to keep the number of jobs a user can get scheduled within bounds to ensure a fairer usage of resources.

The present Frontenac cluster does not have any jobs limits.

Note that this means we do not artificially limit the number of jobs that will be scheduled.

However, the scheduler employs a fair-share scheme that has the effect that the more resources you use, the lower your relative priority with respect to other user's jobs will be. This acts both on the level of individual users and on the level of their user groups, and will limit the number of jobs you can actually run simultaneously. Out of the same reason, we do not impose a maximum core number you can request from a default or a RAC account.

Rule of thumb: The more resources you use, the lower will be your priority with respect to your peers.

Summary: Allocation on Frontenac (SLURM)

Allocation Feature Frontenac (SLURM)
Default allocation (compute)
  • no core or job limit
  • standard partition
  • low priority
  • no scheduling to large/high-memory nodes
RAC allocation (compute)
  • core-years allocation over one year
  • allocation from Compute Canada
  • scheduled to "reserved" partition at enhanced priority
  • no dedicated resources, reserved partition
  • compete with other users of same priority
Contributed allocation (compute)
  • fixed core-years allocation over one year
  • memory allocation based on contributed memory
  • allocation depends on contributed-system size
  • scheduled to "reserved" partition at high priority
  • no dedicated resources, reserved partition
  • compete with other users of same priority
Limits
  • usage lowers priority (fair share)
  • RAC/contributed allocation increase priority
  • no global job limits