Difference between revisions of "HowTo:Scheduler"

From CAC Wiki
Jump to: navigation, search
(Created page with "==How to run Jobs: The Grid Engine== This is an introduction to the '''Sun Grid Engine''' (SGE) scheduling software that is used to submit batch jobs to our production cluste...")
 
(Redirected page to SLURM)
 
(61 intermediate revisions by 3 users not shown)
Line 1: Line 1:
==How to run Jobs: The Grid Engine==
+
#REDIRECT [[SLURM]]
 +
 
 +
= How to run Jobs: The Grid Engine =
  
 
This is an introduction to the '''Sun Grid Engine''' (SGE) scheduling software that is used to submit batch jobs to our production clusters. Note that the use of this software is '''mandatory'''. Please familiarize yourself with Grid Engine by reading this file, and refer to the documentation listed in it for details.
 
This is an introduction to the '''Sun Grid Engine''' (SGE) scheduling software that is used to submit batch jobs to our production clusters. Note that the use of this software is '''mandatory'''. Please familiarize yourself with Grid Engine by reading this file, and refer to the documentation listed in it for details.
Line 5: Line 7:
 
Note that the usage of SGE on the production systems of the Centre for Advanced Computing will be phased out in the course of 2016. We will replace this scheduler with a newer one, in all likelihood "SLURM".
 
Note that the usage of SGE on the production systems of the Centre for Advanced Computing will be phased out in the course of 2016. We will replace this scheduler with a newer one, in all likelihood "SLURM".
  
===What is Grid Engine?===
+
==What is Grid Engine?==
  
Sun Grid Engine (SGE) is a Load Management System that allocates resources such as processors (CPU's), memory, disk-space, and computing time. Grid Engine like other schedulers enables transparent load sharing, controls the sharing of resources, and also implements utilization and site policies.
+
Sun Grid Engine (SGE) is a Load Management System that allocates resources such as processors (CPU's), memory, disk-space, and computing time. Grid Engine like other schedulers enables transparent load sharing, controls the sharing of resources, and also implements utilization and site policies. It has many characteristics including batch queuing and load balancing, as well as giving the users the ability to suspend/resume jobs and check the status of their jobs.
  
It has many characteristics including batch queuing and load balancing, as well as giving the users the ability to suspend/resume jobs and check the status of their jobs.
+
Grid Engine can be used through the command line or through a Graphical User Interface (GUI) called "qmon", both with the same set of commands.
 
+
Grid Engine can be used through the command line or through a Graphical User Interface (GUI) called "qmon", which both have the same set of commands.
+
 
+
Additional information about Grid Engine features will follow in the next sections and the documents referenced in this FAQ.
+
 
+
==Which version of Grid Engine is currently in use in HPCVL machines?==
+
  
 
The version of Grid Engine on our systems is '''Sun Grid Engine 6.1 (update 4)'''.
 
The version of Grid Engine on our systems is '''Sun Grid Engine 6.1 (update 4)'''.
  
==How do I Setup my Environment to use Grid Engine?==
+
== Using Grid Engine ==
  
When you first log in you will already have the proper setup for using Gridengine. This is because Gridengine is included in the default settings for ''usepackage''. If for some reason Gridengine is not part of your environment setup, you can add it by issuing the
+
Jobs are submitted to Grid Engine through the '''qsub''' command.
  
<pre>
+
If the job is simple and consists on only few commands then the submission can be done via the command line. If the job requires the set-up of many options and requests, the job is written in the form of a script.
use sge6
+
</pre>
+
  
command. Part of the setup that is done automatically by ''usepackage'' is to source a setup-script that is located in the directory
+
Here is a '''sample script''' that must be modified to fit your use case:
  
 
<pre>
 
<pre>
/opt/n1ge6/default/common/
+
#!/bin/bash
 +
#$ -S /bin/bash
 +
#$ -V
 +
#$ -cwd
 +
#$ -M my.email@some.address.com
 +
#$ -m be
 +
#$ -o STD.out
 +
#$ -e STD.err
 +
./program < input
 
</pre>
 
</pre>
  
You can also "source" those scripts manually:
+
Such a script is then submitted through the '''qsub''' command:
  
<pre>
+
<pre>qsub test.sh </pre>
:source /opt/n1ge6/default/common/settings.sh
+
</pre>
+
 
+
The setup script modifies your search PATH and sets other environment variables that are required to get Grid Engine running. One of those variables is SGE_ROOT which contains the directory in which the Grid Engine-related programs are located.
+
 
+
==How do I start using Grid Engine?==
+
 
+
Grid Engine provides two ways to run your jobs, the first is directly from the command line using the '''"qstat" command''' or through the '''"qmon" GUI''', and it's up to the user to choose what is convenient for her.
+
 
+
However, if the job is simple and consists on only few commands then the submission is more easily done done via the command line. If the job requires the setup of many options and special requests, the use of the GUI is helpful (at least first time when you are writing your script), and facilitates the navigation through the available options.
+
 
+
==What are the most commonly used Grid Engine Commands?==
+
 
+
Sun Grid Engine has a large set of programs that let the user submit/delete jobs, check job status, and have information about available queues and environments. For the normal user the knowledge of the following basic commands should be sufficient to get started with Grid Engine and have full control of his jobs:
+
 
+
* '''qconf''': Shows (-s) the user the configurations and access permissions only.
+
* '''qdel''': Gives the user the ability to delete his own jobs only.
+
* '''qhost: Displays status information about Sun Grid Engine execution hosts.
+
* '''qmod''': Modify the status of your jobs (like suspend/resume).
+
* '''qmon''': Provides the X-windows GUI command interface.
+
* '''qstat''': Provides a status listing of all jobs and queues associated with the cluster.
+
* '''qsub''': Is the user interface for submitting a job to Grid Engine.
+
 
+
All these commands come with many options and switches and are also available with the GUI QMON. They all have detailed man pages (e.g. ">man qsub"), and are documented in the Sun Grid Engine 6 User's Guide.(about 2.2 MB)
+
 
+
==What are the different kinds of jobs that I can run with Grid Engine?==
+
 
+
Grid Engine uses the notion of a queue to distinguish between the different different types of jobs and the different components of the HPCVL clusters. Grid Engine queues can allow execution of many jobs concurrently, and Grid Engine tries to start new jobs in the queue that is most suitable and least loaded.
+
 
+
Note, that a job is always associated with its queue, and depends on the status of this queue, but, users do not need to submit jobs directly to a queue. You only need to specify the requirement profile of the job, which includes memory, available software and type of job (parallel or not, MPI,...).
+
 
+
Although you don't submit jobs directly to a queue you still need to know which queue is handling your job and what are the characteristics of this queue. On the HPCVL system, we have presently three different queues that are used for different purposes. If you type
+
 
+
: qconf -sql
+
 
+
you will see a list of all available queues. In particular, you'll find the following:
+
 
+
* '''m9k.q''' This is the default queue. All jobs other than simple short test jobs are sent to this queue automatically. It is associated with the [[Hardware:M9000|M9000 Cluster]] of Fujitsu Sparc64-VII based Sun Enterprise M9000 servers. It is used to schedule serial and parallel jobs to these high-memory dually-threaded nodes '''m9k00[1-8]'''.
+
* '''abaqus.q''' This queue is associated with the Intel Xeon - based [[Hardware:SW|SW (Linux) Cluster]] which is used to schedule software that does not run on the Solaris/Sparc platform and requires Linux x86 based servers instead. Such software includes commercial packages such as Fluent and Abaqus (hence the name), as well as opensource packages developed on Linux. The queue schedules serial and parallel jobs to nodes '''sw00[11-51]'''.
+
* '''vf.q'''  This queue is associated with the [[Hardware:VictoriaFalls|Victoria Falls Cluster]] of Niagara-2 based Sun T5140 servers. It is used to schedule serial and parallel jobs to these highly multi-threaded nodes '''v[01-73]'''.
+
 
+
===How do I write and submit batch jobs?===
+
 
+
To run a job with grid engine you have to submit it from the command line or the GUI. But first, you have to write a batch script file that contains all the commands and environment requests that you want for this job. If, for example, test.sh is the name of the script file, then use the command qsub to submit the job:
+
 
+
: qsub test.sh
+
  
 
And, if the submission of the job is successful, you will see this message:
 
And, if the submission of the job is successful, you will see this message:
  
: your job 12345 (``test.sh'') has been submitted.
+
<pre>your job 12345 (``test.sh'') has been submitted.</pre>
  
After that, you can monitor the status of your job with the command qstat or the GUI qmon. When the job is finished you may have two output files called "test.sh.o1" and "test.sh.e1".
+
After that, you can monitor the status of your job with the command '''qstat''' or the GUI '''qmon'''.  
  
Now, let's take a look at the structure of a Grid Engine batch job script.
+
Now, let's take a look at the structure of the Grid Engine batch job script.
  
 
We first recall that a batch job is a UNIX shell script consisting of a sequence of UNIX command-line instructions (or interpreted scripts like perl,...) assembled in a file. In Grid Engine, it is a batch script that contains additionally to normal UNIX command special comments lines defined by the leading prefix "#$".
 
We first recall that a batch job is a UNIX shell script consisting of a sequence of UNIX command-line instructions (or interpreted scripts like perl,...) assembled in a file. In Grid Engine, it is a batch script that contains additionally to normal UNIX command special comments lines defined by the leading prefix "#$".
Line 95: Line 51:
 
The first two lines usually specify the shell
 
The first two lines usually specify the shell
  
: #! /bin/bash
+
<pre>
: #$ -S /bin/bash
+
#! /bin/bash
 +
#$ -S /bin/bash
 +
</pre>
 +
 
 +
We force Grid Engine to use a ''bash'' shell interpreter (''csh'' is the default).
  
Here, we force Grid Engine to use a ''bash'' shell interpreter (''csh'' is the default).
 
 
To tell SGE to run the job from the current working directory add this script line
 
To tell SGE to run the job from the current working directory add this script line
  
: #$ -cwd  
+
<pre>#$ -cwd </pre>
  
 
if you want to pass some environment variable VAR (or a list of variables separated by commas) use the "-v"  option like this:
 
if you want to pass some environment variable VAR (or a list of variables separated by commas) use the "-v"  option like this:
  
: #$ -v VAR  
+
<pre>#$ -v VAR </pre>
  
 
The "-V" option passes all variables listed in env:
 
The "-V" option passes all variables listed in env:
  
: #$ -V  
+
<pre>#$ -V </pre>
  
Insert the full path name of the files to which you want to redirect the standard output/error respectively (the full pathname is actually not necessary if the #$ -cwd option was used).
+
Insert the name of the files to which you want to redirect the standard output/error, respectively:
 
+
: #$ -o file for standard output
+
: #$ -e file for standard error
+
 
+
Here is a '''serial sample script''' that has to be modified to fit your case. You will of course have to change the appropriate settings to make this usable for your case:
+
  
 
<pre>
 
<pre>
#!/bin/bash
+
#$ -o STD.out  
#$ -S /bin/bash
+
#$ -V
+
#$ -cwd
+
#$ -M hpcXXXX@localhost
+
#$ -m be
+
#$ -o STD.out
+
 
#$ -e STD.err
 
#$ -e STD.err
./program < input
 
 
</pre>
 
</pre>
  
The "-M" option is for email notification. It is best to use hpcXXXX@localhost with hpcXXXX replaced by your actual user name and place file named ".forward" that contains your real email address into your home directory. This way, your email address remains private and invisible to other users. With the "-m" option you let the system know when you want to be notified (in this case at the '''b'''eginning and at the '''e'''nd of the run.
+
The '''-M''' option is for email notification. It is best to use ''hpcXXXX@localhost'' {''hpcXXXX'' stands for your actual user name) and place file named ".forward" that contains your real email address into your home directory. This way, your email address remains private and invisible to other users. With the '''-m''' option you let the system know when you want to be notified ('''b'''eginning and '''e'''nd).
Note that that qsub usually expects shell scripts, not executable files. To submit the job (here, we call it "script.sh") you simply type
+
 
+
: qsub script.sh
+
 
+
Note that from the command line you can issue options and type, for instance
+
 
+
: qsub -cwd -v VAR=value -o /home/tmp -e /home/tmp serial.sh
+
 
+
===How do I Submit Jobs to other than the default queues?===
+
 
+
Our main production environment consists of 8 [[Hardware:M9000|Sun Enterprise M9000 servers]]. When you submit jobs, by default this is the set of machines on which your job will run. The associated queue is '''"m9k.q"'''.
+
 
+
However, we have two other clusters, namely the [[Hardware:VictoriaFalls|Victoria Falls cluster] and the [[Hardware:SW|Software (SW,Linux) cluster]]. Both of these have their own queues, '''"vf.q"''' and '''"abaqus.q"''', respectively.
+
  
The Victoria Falls cluster consists of highly multi-threaded nodes with 16 cores and up to 128 hardware-supported threads. It runs on the Solaris/Sparc platform. The software cluster consists of x86 machines running Linux, and requires re-compilation of user software, or a specific version of pre-compiled applications.
+
Note that qsub usually expects shell scripts, not executable files.  
  
It is possible that code compiled on the login node (and therefore optimized for the US IV+ chip) will not run efficiently on the Niagara 2 chips of the VF cluster, or on the Sparc64-VII chips of the M9000 cluster. See our [[FAQ:Parallel|Parallel Programming FAQ]] for suggestions on how to optimize code for architectures other than US IV+.
+
Note that you can also add options from the command line, for instance
  
Let's say you want to include the SW (Linux) cluster nodes in the list of possible machines to run your job. Here is what you can do:
+
<pre>$ qsub -cwd -v VAR=value -o /home/tmp -e /home/tmp test.sh </pre>
*(simplest) the job can run on either machine:
+
: #$ ... other directives ...
+
: #$ -q abaqus.q
+
* ensure the job must run on the newly added cluster (in this case, the Sunfire cluster):
+
: #$ ... other directives ...
+
: #$ -q abaqus.q
+
: #$ -l qname=abaqus.q
+
The -l option selects the VF queue as the only acceptable choice, and ensures that the job is scheduled there.
+
  
Note that if you are submitting jobs to the SW (Linux) cluster, it is best to do so from the swlogin1 Linux login node, not from sflogin0 (Solaris). This is because scripts often assume that settings are inherited ("#$ -V" line) and so the settings have to be appropriate for Linux in the first place.
+
The default Linux (SW) cluster queue is called '''abaqus.q'''. Note that jobs to the SW (Linux) cluster are best submitted from the swlogin1 Linux login node, not from sflogin0 (Solaris). This is because scripts often assume that settings are inherited ("#$ -V" line) and so the settings have to be appropriate for Linux.
  
===How do I submit an Array of Jobs?===
+
== Array Jobs ==
  
 
An array of jobs is a job consisting of a range of independent near-identical tasks. Rather than making a separate submission script for each of these tasks, it is preferable to make only one script with all the information that is identical among the tasks, and then use a "counter" to vary the parts that differ.
 
An array of jobs is a job consisting of a range of independent near-identical tasks. Rather than making a separate submission script for each of these tasks, it is preferable to make only one script with all the information that is identical among the tasks, and then use a "counter" to vary the parts that differ.
Line 167: Line 93:
 
In an array job, there is usually a line like this:
 
In an array job, there is usually a line like this:
  
: #$ -t 2-1000:2
+
<pre>#$ -t 2-1000:2</pre>
  
 
which instructs Grid Engine to dynamically (and internally) create copies of the current job script that differ from each other in a counter variable '''SGE_TASK_ID''' which gets counted up from 2 to 1000 in steps of 2. This variable can be used to distinguish between the tasks. For instance, if we want to run the same program "runme.exe" with various different input and output files, we may have a line
 
which instructs Grid Engine to dynamically (and internally) create copies of the current job script that differ from each other in a counter variable '''SGE_TASK_ID''' which gets counted up from 2 to 1000 in steps of 2. This variable can be used to distinguish between the tasks. For instance, if we want to run the same program "runme.exe" with various different input and output files, we may have a line
  
: runme.exe < input$SGE_TASK_ID > output$SGE_TASK_ID
+
<pre>runme.exe < input$SGE_TASK_ID > output$SGE_TASK_ID</pre>
  
in our script. Note that it is also possible to use SGE_TASK_ID in a script that does not explicitly contain the "#$ -t" line. You can then just submit your job (let's call it "array.sh") with the corresponding option '''-t''', like this
+
in our script. Note that it is also possible to use SGE_TASK_ID in a script that does not explicitly contain the '''#$ -t''' line. You can then just submit your job (let's call it "array.sh") with the corresponding option '''-t''', like this
  
: qsub -t 2-1000:2 array.sh
+
<pre>qsub -t 2-1000:2 array.sh</pre>
  
 
Check '''page 71 [http://www.hpcvl.org/sites/default/files/hpvcl_sge_manual.pdf of the manual]''' for more details.
 
Check '''page 71 [http://www.hpcvl.org/sites/default/files/hpvcl_sge_manual.pdf of the manual]''' for more details.
  
==How do I monitor my jobs?==
+
== Monitoring Jobs ==
 
After submitting your job to Grid Engine you may track its status by using either the '''qstat''' command, the GUI interface '''qmon''', or by '''email'''.
 
After submitting your job to Grid Engine you may track its status by using either the '''qstat''' command, the GUI interface '''qmon''', or by '''email'''.
  
===How do I monitor with qstat?===
+
=== With qstat ===
  
 
The qstat command provides the status of all jobs and queues in the cluster. The most useful options are:
 
The qstat command provides the status of all jobs and queues in the cluster. The most useful options are:
 
* '''qstat''': Displays list of all jobs of the current user with no queue status information.
 
* '''qstat''': Displays list of all jobs of the current user with no queue status information.
 
* '''qstat -u hpc1234''': Displays list of all jobs belonging to user hpc1234
 
* '''qstat -u hpc1234''': Displays list of all jobs belonging to user hpc1234
* '''qstat -u "*"''': Displays list of all jobs belonging to all users.
+
* '''qstat -u "*"''': Displays list of all jobs belonging to all users (note the double-quotes around the asterisk).
 
* '''qstat -f''': gives full information about jobs and queues.
 
* '''qstat -f''': gives full information about jobs and queues.
* '''qstat -j [job_id]''': Gives the reason why the pending job (if any) is not being scheduled.
+
* '''qstat -j 1234567''': Gives details about pending or running job job 1234567.
 
You can refer to the man pages for a complete description of all the options of the qstat command.
 
You can refer to the man pages for a complete description of all the options of the qstat command.
  
===How do monitor jobs by electronic mail?===
+
=== By electronic mail ===
  
 
Another way to monitor your jobs is to make Grid Engine notify you by email on status of the job.
 
Another way to monitor your jobs is to make Grid Engine notify you by email on status of the job.
Line 198: Line 124:
 
In your batch script or from the command line use the -m option to request that an email should be send and the -M option to specify the email address where this should be sent. This will look like:
 
In your batch script or from the command line use the -m option to request that an email should be send and the -M option to specify the email address where this should be sent. This will look like:
  
: #$ -M hpcXXXX@localhost
+
<pre>
: #$ -m be
+
#$ -M email@address.com
 +
#$ -m be
 +
</pre>
  
Where the (-m) option can select after which events you want to receive your email. In particular you can select to be notified at the '''b'''eginning/'''e'''nd of the job (see the sample script lines above), or when the job is '''a'''borted/'''s'''uspended. In the example we specify hpcXXXX@localhost where hpcXXXX stands for your username. This is a way to make the system send email internally. If you place a file ".forward" that contains your email address into your home directory, then the email will be sent there without another user being able to see your address.
+
Where the (-m) option can select after which events you want to receive your email. In particular you can select to be notified at the '''b'''eginning/'''e'''nd of the job (see the sample script lines above), or when the job is '''a'''borted/'''s'''uspended. The -M option is used to specify the email address at which you want to be notified.
  
 
From the command line you can use the options (for example):
 
From the command line you can use the options (for example):
  
: qsub -M hpcXXXX@localhost job.sh
+
<pre>qsub -M email@address.com job.sh</pre>
  
===How do I monitor my jobs with qmon?===
+
=== With qmon ===
  
 
You can also use the GUI '''qmon''', which gives a convenient window dialog specifically designed for monitoring and controlling jobs, and the buttons are self explanatory.
 
You can also use the GUI '''qmon''', which gives a convenient window dialog specifically designed for monitoring and controlling jobs, and the buttons are self explanatory.
  
==How do I control my jobs?==
+
== Deleting Jobs ==
Based on the status of the job displayed, you can control the job by the following actions:
+
  
'''Modify a job''': As a user, you have certain rights that apply exclusively to your jobs. The Grid Engine command line used is qmod. Check the man pages for the options that you are allowed to use.
+
You can delete a job that is running or spooled in the queue by using the qdel command like this
  
Suspend (or Resume) a job: This uses the UNIX kill command, and applies only to running jobs, in practice you type
+
<pre>qdel 1234567</pre>
  
: qmod -s (or -r) job_id
+
which removes job number 1234567. Note that if your job is not on the waiting queue, but is already executing, you might need to issue the
 +
'''-f''' (force) option with the qdel job_id command to terminate the job.
  
where job_id is given by qstat or qsub. Note that this works reliably only with serial jobs and should not be used with multi-threaded or even MPI jobs.
+
== Requesting Memory ==
 
+
'''Delete a job''': You can delete a job that is running or spooled in the queue by using the qdel command like this
+
 
+
: qdel job_id
+
 
+
where job_id is given by qstat or qsub. Note that if your job is not on the waiting queue, but is already executing, you might need to issue the
+
-f (force) option with the qdel job_id command to terminate the job.
+
 
+
==How do I request memory ?==
+
  
 
Sometimes your job requires additional resources to run, for instance you may need a minimum amount of memory. This is particularly relevant when you are running jobs on the SW (Linux) cluster, using '''abaqus.q'''. Since this cluster consists of nodes with different available physical memory (see this table for a list), it is important to be aware of whether the node you are running your job on has enough memory to execute properly. To this end, Grid Engine provides a simple resource specification of "free memory":
 
Sometimes your job requires additional resources to run, for instance you may need a minimum amount of memory. This is particularly relevant when you are running jobs on the SW (Linux) cluster, using '''abaqus.q'''. Since this cluster consists of nodes with different available physical memory (see this table for a list), it is important to be aware of whether the node you are running your job on has enough memory to execute properly. To this end, Grid Engine provides a simple resource specification of "free memory":
  
: #$ -l mf=35G
+
<pre>#$ -l mf=35G</pre>
  
This line would be included in your submission script if you are running a program that requires up to 35 GB of physical memory. SGE will check before scheduling whether this amount is available on a node, and not send the job to nodes with less remaining memory. Note that this is checked only at the time of scheduling. It cannot provide a safeguard against "running out" later. However, specifying memory this way makes it much less likely that the job ends up "swapping", i.e. using disk to store data. Swapping usually slows down execution by a huge factor, often leading to unacceptable execution times.  
+
In this example a program that requires up to 35 GB of physical memory. SGE spot-checks before scheduling whether this amount is available on a node, and avoid nodes with less. Note that this is checked only at the time of scheduling, it won't provide a safeguard against "running out" later. However, it makes it less likely that the job ends up "swapping", i.e. using disk to store data. Swapping usually slows down execution by a huge factor, often leading to unacceptable execution times.
  
==What are the Parallel Environments available under HPCVL Grid Engine?==
+
== Parallel Jobs ==
  
A Parallel Environment is a programming environment designed for parallel computing in a network of computers, which allows execution of shared memory and distributed memory parallel applications. The most commonly used parallel environments are Message Passing Interface (MPI) for distributed-memory machines, and OpenMP for shared-memory achines.
+
A Parallel Environment is a programming environment designed for parallel computing in a network of computers, which allows execution of shared memory and distributed memory parallel applications. The most commonly used parallel environments are Message Passing Interface (MPI) for distributed-memory machines, and OpenMP for shared-memory machines.
  
* For MPI there is an implementation called HPC ClusterTools. It's located in the /opt/SUNWhpc directory, (check the HPCVL Parallel Programming FAQ for more details)
+
Grid Engine provides an interface to handle parallel jobs running on the top of these parallel environments. For the users convenience we have predefined parallel environment interfaces for them. These are:
* For OpenMP, no separate runtime environment is required. Details about shared-memory programming and multi-threading with OpenMP may be found in the HPCVL Parallel Programming FAQ.
+
  
Grid Engine provides an interface to handle parallel jobs running on the top of these parallel environments. For the users convenience HPCVL has predefined parallel environment interfaces for them. These are:
+
* '''shm.pe''': This environment is intended for shared-memory applications. Grid Engine will assign the processors in a single node to take advantage of the fastest connection available between the slots. It is permissible to use '''shm.pe''' for distributed-memory (e.g. MPI) jobs, if the intention is to keep them within a single node. Note that this might speed up communication, but can also lead to longer waiting periods.
 +
* '''dist.pe''': This environment is intended for distributed memory applications using MPI. Grid Engine will assign the '''dist.pe''' jobs to the production queue and try to use fastest connection available between the slots and nodes. Although the system will try to allocate processes on as few nodes as possible, it will be allowed to spread them out over the cluster, since this parallel environment is meant to handle distributed-memory jobs. Note that currently, '''dist.pe''' is functionally equivalent to '''shm.pe''', i.e. no inter-node scheduling takes place.
  
* '''dist.pe''': This environment is intended for distributed memory applications using the Sun HPC ClusterTools libraries, in particular MPI. Grid Engine will assign the strong>dist.pe jobs to the production.q queue and try to use fastest connection available between the slots and nodes. Although the system will try to allocate processes on as few nodes as possible, it will be allowed to spread them out over the cluster, since this parallel environment is meant to handle distributed-memory jobs.
+
=== Multi-threaded Jobs ===
* '''shm.pe''': This environment is intended for shared-memory applications. Grid Engine will assign the processors in a single node to take advantage of the fastest connection available between the slots. shm.pe jobs are submitted to the production.q queue, i.e. to nodes m9k000[1-8]. It is permissible to use shm.pe for distributed-memory (e.g. MPI) jobs, if the intention is to keep them within a single node. Note that this might speed up communication, but also lead to longer waiting periods.
+
* '''vfdist.pe''': This environment serves the same purpose as dist.pe, but is designed for the Victoria Falls cluster, and restricts the scheduling of processes to a 40-node sub-cluster that is internally connected through 10 Gig Ethernet.
+
* '''abaqus.pe''' and '''fluent.pe''': These are specialized environments that are used for parallel runs for the application software packages Abaqus, Fluent, and Matlab, respectively. These applications need their own parallel environments to keep track of available licenses, and to run auxillary commands.
+
  
===How do I submit a multi-threaded job?===
 
 
You need to specify the parallel environment to use, which is shm.pe in our case, and how many processors are going to be used. This is done via the script line:
 
You need to specify the parallel environment to use, which is shm.pe in our case, and how many processors are going to be used. This is done via the script line:
  
: #$ -pe shm.pe 16
+
<pre>#$ -pe shm.pe 16</pre>
  
if you want to use 16 processors. This sets and environment variable NSLOTS and requests the corresponding number of processes.
+
if you want to use 16 processors. This sets and environment variable '''NSLOTS''' and requests the corresponding number of processes.
  
There is no request for parallel queues or special complexes, but like in an interactive run of multi-threaded program you need to set the variables PARALLEL and also OMP_NUM_THREADS (in case of OpenMP applications) to the number of processors to be used. Add the following lines to your mt_job.sh script file (bash syntax):
+
In the case of an OpenMP based multi-threaded programs you need to set the variable '''OMP_NUM_THREADS''' to the number of processors to be used. Add the following lines to your script file:
  
: export PARALLEL=$NSLOTS
+
<pre>export OMP_NUM_THREADS=$NSLOTS</pre>
: export OMP_NUM_THREADS=$NSLOTS
+
  
Here is a '''multi-threading sample script''' that has to be modified to fit your case. You will of course have to change the appropriate settings to make this usable for your case:
+
Here is a '''multi-threading sample script''' that has to be modified to fit your case:
  
 
<pre>
 
<pre>
Line 270: Line 184:
 
#$ -V
 
#$ -V
 
#$ -cwd
 
#$ -cwd
#$ -M hpcXXXX@localhost
+
#$ -M email@address.com
 
#$ -m be
 
#$ -m be
 
#$ -o STD.out
 
#$ -o STD.out
 
#$ -e STD.err
 
#$ -e STD.err
#$ -pe shm.pe Nthreads
+
#$ -pe shm.pe 16
export PARALLEL=$NSLOTS
+
 
export OMP_NUM_THREADS=$NSLOTS
 
export OMP_NUM_THREADS=$NSLOTS
./program < input
+
./omp_program < input
 
</pre>
 
</pre>
"Nthreads" should be replaced by the actual number of threads you want to use. Note that this number should not exceed the number of cores available on the node you are running this, because otherwise the job won't get scheduled.
 
  
===How do I submit a parallel MPI job?===
+
We're assuming 16 threads in this example, you'll have to change that if you're using a different number. Note that this number should not exceed the number of cores available on the nodes you are planning to run this, because otherwise the job won't get scheduled.
 +
 
 +
=== Distributed (MPI) Jobs ===
 +
 
 
A specific parallel environment needs to be specified, to let the system know which environment and how many processors are going to be used. This is done via the script line:
 
A specific parallel environment needs to be specified, to let the system know which environment and how many processors are going to be used. This is done via the script line:
  
: #$ -pe dist.pe 16
+
<pre>#$ -pe dist.pe 16</pre>
  
 
where the number of processors is 16 in this case.
 
where the number of processors is 16 in this case.
  
In the standard mpirun command, you do '''not''' have to specify the number of processes through the -np option, because the Cluster Tools runtime system knows that resource allocation will be done by Grid Engine and determines the number of processes from the -pe directive.
+
In the standard '''mpirun''' command, you specify the number of processes through the '''-np''' option, because the Cluster Tools runtime system knows that resource allocation will be done by Grid Engine and determines the number of processes from the '''-pe''' directive.
  
Here is an '''MPI sample script''' that has to be modified to fit your case. You will of course have to change the appropriate settings to make this usable for your case:
+
Here is an '''MPI sample script''' that has to be modified to fit your case:
  
 
<pre>
 
<pre>
Line 297: Line 212:
 
#$ -V
 
#$ -V
 
#$ -cwd
 
#$ -cwd
#$ -M hpcXXXX@localhost
+
#$ -M email@address.com
 
#$ -m be
 
#$ -m be
 
#$ -o STD.out
 
#$ -o STD.out
 
#$ -e STD.err
 
#$ -e STD.err
#$ -pe dist.pe Nprocs
+
#$ -pe dist.pe 16
mpirun ./program < input
+
mpirun -np $NSLOTS  ./mpi_program < input
 
</pre>
 
</pre>
  
Nprocs should be replaced by the number of MPI processes you want to use. Since by default we are operating our cluster nodes "in box", i.e. no internode communication is used, the number of processes shouldl not exceed the number of cores for the node the job is running on. Otherwise the job will not be scheduled.
+
16 should be replaced by the number of MPI processes you actually want to use. Since by default we are operating our cluster nodes "in box", i.e. no internode communication is used, the number of processes should not exceed the number of cores for the node the job is running on. Otherwise the job will not be scheduled.
  
 
To run this job you simply type
 
To run this job you simply type
  
: qsub mpi_job.sh
+
<pre>qsub mpi_job.sh</pre>
 +
 
 +
'''Note:''' Presently the dist.pe and shm.pe parallel environments are configured the same way. This means that an MPI job will only be scheduled on a single node. This is done for reasons of efficiency.
 +
 
 +
== Other Commands ==
 +
 
 +
Sun Grid Engine allows the user to submit/delete jobs, check job status, and have information about available queues and environments. For most users the knowledge of the following basic commands should be sufficient:
 +
 
 +
* '''qconf''': Shows (-s) the user the configurations and access permissions only.
 +
* '''qdel''': Gives the user the ability to delete his own jobs only.
 +
* '''qhost''': Displays status information about Sun Grid Engine execution hosts.
 +
* '''qmod''': Modify the status of your jobs (like suspend/resume).
 +
* '''qmon''': Provides the X-windows GUI command interface.
 +
* '''qstat''': Provides a status listing of all jobs and queues associated with the cluster.
 +
* '''qsub''': Is the user interface for submitting a job to Grid Engine.
 +
 
 +
All these commands come with many options and switches and are also available with the GUI '''qmon'''. They all have detailed man pages (e.g. "man qsub"), and are documented in the [http://www.hpcvl.org/sites/default/files/hpvcl_sge_manual.pdf Sun Grid Engine 6 User's Guide].
 +
 
 +
== Environment Setup ==
 +
 
 +
When you first log in you will already have the proper setup for using Gridengine. This is because Gridengine is included in the default settings for ''usepackage''. If for some reason Gridengine is not part of your environment setup, you can add it by issuing the
 +
 
 +
<pre>
 +
use sge6
 +
</pre>
 +
 
 +
command. Part of the setup that is done automatically by ''usepackage'' is to source a setup-script that is located in the directory
 +
 
 +
<pre>
 +
/opt/n1ge6/default/common/
 +
</pre>
 +
 
 +
You can also "source" those scripts manually:
 +
 
 +
<pre>
 +
source /opt/n1ge6/default/common/settings.sh
 +
</pre>
 +
 
 +
The setup script modifies your search PATH and sets other environment variables that are required to get Grid Engine running. One of those variables is SGE_ROOT which contains the directory in which the Grid Engine-related programs are located.
 +
 
 +
== Help and documentation ==
  
==Where can I get more help and documentation?==
 
 
Grid Engine has a lot more options and possibilities for every kind of jobs. Here, we gave the user only the basic steps to get started using GE. Detailed documentation is available. First, there is [http://www.hpcvl.org/sites/default/files/hpvcl_sge_manual.pdf the User's Guide] which should answer almost all of your questions.
 
Grid Engine has a lot more options and possibilities for every kind of jobs. Here, we gave the user only the basic steps to get started using GE. Detailed documentation is available. First, there is [http://www.hpcvl.org/sites/default/files/hpvcl_sge_manual.pdf the User's Guide] which should answer almost all of your questions.
  
 
For specific commands, the '''man pages''' are very comprehensive and should be consulted. For instance "man qstat" explains the meaning of the qstat command options.
 
For specific commands, the '''man pages''' are very comprehensive and should be consulted. For instance "man qstat" explains the meaning of the qstat command options.
  
HPCVL also offers user support; for questions about this FAQ and the usage of Grid Engine in HPCVL machines [[Contacts:UserSupport|contact us directly]] or [mailto:help@hpcvl.org send email to help@hpcvl.org].
+
The Centre for Advanced Computing offers user support; for questions about this help file and the usage of Grid Engine on our machines [[contacts:UserSupport|contact us]].

Latest revision as of 14:45, 30 June 2020

Redirect to:

How to run Jobs: The Grid Engine

This is an introduction to the Sun Grid Engine (SGE) scheduling software that is used to submit batch jobs to our production clusters. Note that the use of this software is mandatory. Please familiarize yourself with Grid Engine by reading this file, and refer to the documentation listed in it for details.

Note that the usage of SGE on the production systems of the Centre for Advanced Computing will be phased out in the course of 2016. We will replace this scheduler with a newer one, in all likelihood "SLURM".

What is Grid Engine?

Sun Grid Engine (SGE) is a Load Management System that allocates resources such as processors (CPU's), memory, disk-space, and computing time. Grid Engine like other schedulers enables transparent load sharing, controls the sharing of resources, and also implements utilization and site policies. It has many characteristics including batch queuing and load balancing, as well as giving the users the ability to suspend/resume jobs and check the status of their jobs.

Grid Engine can be used through the command line or through a Graphical User Interface (GUI) called "qmon", both with the same set of commands.

The version of Grid Engine on our systems is Sun Grid Engine 6.1 (update 4).

Using Grid Engine

Jobs are submitted to Grid Engine through the qsub command.

If the job is simple and consists on only few commands then the submission can be done via the command line. If the job requires the set-up of many options and requests, the job is written in the form of a script.

Here is a sample script that must be modified to fit your use case:

#!/bin/bash
#$ -S /bin/bash
#$ -V
#$ -cwd
#$ -M my.email@some.address.com
#$ -m be
#$ -o STD.out
#$ -e STD.err
./program < input

Such a script is then submitted through the qsub command:

qsub test.sh 

And, if the submission of the job is successful, you will see this message:

your job 12345 (``test.sh'') has been submitted.

After that, you can monitor the status of your job with the command qstat or the GUI qmon.

Now, let's take a look at the structure of the Grid Engine batch job script.

We first recall that a batch job is a UNIX shell script consisting of a sequence of UNIX command-line instructions (or interpreted scripts like perl,...) assembled in a file. In Grid Engine, it is a batch script that contains additionally to normal UNIX command special comments lines defined by the leading prefix "#$".

The first two lines usually specify the shell

#! /bin/bash
#$ -S /bin/bash

We force Grid Engine to use a bash shell interpreter (csh is the default).

To tell SGE to run the job from the current working directory add this script line

#$ -cwd 

if you want to pass some environment variable VAR (or a list of variables separated by commas) use the "-v" option like this:

#$ -v VAR 

The "-V" option passes all variables listed in env:

#$ -V 

Insert the name of the files to which you want to redirect the standard output/error, respectively:

#$ -o STD.out 
#$ -e STD.err

The -M option is for email notification. It is best to use hpcXXXX@localhost {hpcXXXX stands for your actual user name) and place file named ".forward" that contains your real email address into your home directory. This way, your email address remains private and invisible to other users. With the -m option you let the system know when you want to be notified (beginning and end).

Note that qsub usually expects shell scripts, not executable files.

Note that you can also add options from the command line, for instance

$ qsub -cwd -v VAR=value -o /home/tmp -e /home/tmp test.sh 

The default Linux (SW) cluster queue is called abaqus.q. Note that jobs to the SW (Linux) cluster are best submitted from the swlogin1 Linux login node, not from sflogin0 (Solaris). This is because scripts often assume that settings are inherited ("#$ -V" line) and so the settings have to be appropriate for Linux.

Array Jobs

An array of jobs is a job consisting of a range of independent near-identical tasks. Rather than making a separate submission script for each of these tasks, it is preferable to make only one script with all the information that is identical among the tasks, and then use a "counter" to vary the parts that differ.

In an array job, there is usually a line like this:

#$ -t 2-1000:2

which instructs Grid Engine to dynamically (and internally) create copies of the current job script that differ from each other in a counter variable SGE_TASK_ID which gets counted up from 2 to 1000 in steps of 2. This variable can be used to distinguish between the tasks. For instance, if we want to run the same program "runme.exe" with various different input and output files, we may have a line

runme.exe < input$SGE_TASK_ID > output$SGE_TASK_ID

in our script. Note that it is also possible to use SGE_TASK_ID in a script that does not explicitly contain the #$ -t line. You can then just submit your job (let's call it "array.sh") with the corresponding option -t, like this

qsub -t 2-1000:2 array.sh

Check page 71 of the manual for more details.

Monitoring Jobs

After submitting your job to Grid Engine you may track its status by using either the qstat command, the GUI interface qmon, or by email.

With qstat

The qstat command provides the status of all jobs and queues in the cluster. The most useful options are:

  • qstat: Displays list of all jobs of the current user with no queue status information.
  • qstat -u hpc1234: Displays list of all jobs belonging to user hpc1234
  • qstat -u "*": Displays list of all jobs belonging to all users (note the double-quotes around the asterisk).
  • qstat -f: gives full information about jobs and queues.
  • qstat -j 1234567: Gives details about pending or running job job 1234567.

You can refer to the man pages for a complete description of all the options of the qstat command.

By electronic mail

Another way to monitor your jobs is to make Grid Engine notify you by email on status of the job.

In your batch script or from the command line use the -m option to request that an email should be send and the -M option to specify the email address where this should be sent. This will look like:

#$ -M email@address.com
#$ -m be

Where the (-m) option can select after which events you want to receive your email. In particular you can select to be notified at the beginning/end of the job (see the sample script lines above), or when the job is aborted/suspended. The -M option is used to specify the email address at which you want to be notified.

From the command line you can use the options (for example):

qsub -M email@address.com job.sh

With qmon

You can also use the GUI qmon, which gives a convenient window dialog specifically designed for monitoring and controlling jobs, and the buttons are self explanatory.

Deleting Jobs

You can delete a job that is running or spooled in the queue by using the qdel command like this

qdel 1234567

which removes job number 1234567. Note that if your job is not on the waiting queue, but is already executing, you might need to issue the -f (force) option with the qdel job_id command to terminate the job.

Requesting Memory

Sometimes your job requires additional resources to run, for instance you may need a minimum amount of memory. This is particularly relevant when you are running jobs on the SW (Linux) cluster, using abaqus.q. Since this cluster consists of nodes with different available physical memory (see this table for a list), it is important to be aware of whether the node you are running your job on has enough memory to execute properly. To this end, Grid Engine provides a simple resource specification of "free memory":

#$ -l mf=35G

In this example a program that requires up to 35 GB of physical memory. SGE spot-checks before scheduling whether this amount is available on a node, and avoid nodes with less. Note that this is checked only at the time of scheduling, it won't provide a safeguard against "running out" later. However, it makes it less likely that the job ends up "swapping", i.e. using disk to store data. Swapping usually slows down execution by a huge factor, often leading to unacceptable execution times.

Parallel Jobs

A Parallel Environment is a programming environment designed for parallel computing in a network of computers, which allows execution of shared memory and distributed memory parallel applications. The most commonly used parallel environments are Message Passing Interface (MPI) for distributed-memory machines, and OpenMP for shared-memory machines.

Grid Engine provides an interface to handle parallel jobs running on the top of these parallel environments. For the users convenience we have predefined parallel environment interfaces for them. These are:

  • shm.pe: This environment is intended for shared-memory applications. Grid Engine will assign the processors in a single node to take advantage of the fastest connection available between the slots. It is permissible to use shm.pe for distributed-memory (e.g. MPI) jobs, if the intention is to keep them within a single node. Note that this might speed up communication, but can also lead to longer waiting periods.
  • dist.pe: This environment is intended for distributed memory applications using MPI. Grid Engine will assign the dist.pe jobs to the production queue and try to use fastest connection available between the slots and nodes. Although the system will try to allocate processes on as few nodes as possible, it will be allowed to spread them out over the cluster, since this parallel environment is meant to handle distributed-memory jobs. Note that currently, dist.pe is functionally equivalent to shm.pe, i.e. no inter-node scheduling takes place.

Multi-threaded Jobs

You need to specify the parallel environment to use, which is shm.pe in our case, and how many processors are going to be used. This is done via the script line:

#$ -pe shm.pe 16

if you want to use 16 processors. This sets and environment variable NSLOTS and requests the corresponding number of processes.

In the case of an OpenMP based multi-threaded programs you need to set the variable OMP_NUM_THREADS to the number of processors to be used. Add the following lines to your script file:

export OMP_NUM_THREADS=$NSLOTS

Here is a multi-threading sample script that has to be modified to fit your case:

#!/bin/bash
#$ -S /bin/bash
#$ -V
#$ -cwd
#$ -M email@address.com
#$ -m be
#$ -o STD.out
#$ -e STD.err
#$ -pe shm.pe 16
export OMP_NUM_THREADS=$NSLOTS
./omp_program < input

We're assuming 16 threads in this example, you'll have to change that if you're using a different number. Note that this number should not exceed the number of cores available on the nodes you are planning to run this, because otherwise the job won't get scheduled.

Distributed (MPI) Jobs

A specific parallel environment needs to be specified, to let the system know which environment and how many processors are going to be used. This is done via the script line:

#$ -pe dist.pe 16

where the number of processors is 16 in this case.

In the standard mpirun command, you specify the number of processes through the -np option, because the Cluster Tools runtime system knows that resource allocation will be done by Grid Engine and determines the number of processes from the -pe directive.

Here is an MPI sample script that has to be modified to fit your case:

#!/bin/bash
#$ -S /bin/bash
#$ -V
#$ -cwd
#$ -M email@address.com
#$ -m be
#$ -o STD.out
#$ -e STD.err
#$ -pe dist.pe 16
mpirun -np $NSLOTS  ./mpi_program < input

16 should be replaced by the number of MPI processes you actually want to use. Since by default we are operating our cluster nodes "in box", i.e. no internode communication is used, the number of processes should not exceed the number of cores for the node the job is running on. Otherwise the job will not be scheduled.

To run this job you simply type

qsub mpi_job.sh

Note: Presently the dist.pe and shm.pe parallel environments are configured the same way. This means that an MPI job will only be scheduled on a single node. This is done for reasons of efficiency.

Other Commands

Sun Grid Engine allows the user to submit/delete jobs, check job status, and have information about available queues and environments. For most users the knowledge of the following basic commands should be sufficient:

  • qconf: Shows (-s) the user the configurations and access permissions only.
  • qdel: Gives the user the ability to delete his own jobs only.
  • qhost: Displays status information about Sun Grid Engine execution hosts.
  • qmod: Modify the status of your jobs (like suspend/resume).
  • qmon: Provides the X-windows GUI command interface.
  • qstat: Provides a status listing of all jobs and queues associated with the cluster.
  • qsub: Is the user interface for submitting a job to Grid Engine.

All these commands come with many options and switches and are also available with the GUI qmon. They all have detailed man pages (e.g. "man qsub"), and are documented in the Sun Grid Engine 6 User's Guide.

Environment Setup

When you first log in you will already have the proper setup for using Gridengine. This is because Gridengine is included in the default settings for usepackage. If for some reason Gridengine is not part of your environment setup, you can add it by issuing the

use sge6

command. Part of the setup that is done automatically by usepackage is to source a setup-script that is located in the directory

/opt/n1ge6/default/common/

You can also "source" those scripts manually:

source /opt/n1ge6/default/common/settings.sh

The setup script modifies your search PATH and sets other environment variables that are required to get Grid Engine running. One of those variables is SGE_ROOT which contains the directory in which the Grid Engine-related programs are located.

Help and documentation

Grid Engine has a lot more options and possibilities for every kind of jobs. Here, we gave the user only the basic steps to get started using GE. Detailed documentation is available. First, there is the User's Guide which should answer almost all of your questions.

For specific commands, the man pages are very comprehensive and should be consulted. For instance "man qstat" explains the meaning of the qstat command options.

The Centre for Advanced Computing offers user support; for questions about this help file and the usage of Grid Engine on our machines contact us.