Difference between revisions of "Local Scratch"

From CAC Wiki
Jump to: navigation, search
(Created page with " == Using Node Local Storage == When Slurm starts a job, it creates a temporary directory on each node assigned to the job. It then sets the full path name of that directory...")
 
(Using Node Local Storage)
Line 1: Line 1:
 +
 +
__TOC__
  
 
== Using Node Local Storage ==
 
== Using Node Local Storage ==
  
When Slurm starts a job, it creates a temporary directory on each node assigned to the job. It then sets the full path name of that directory in an environment variable called TMPDIR.
+
When Slurm starts a job, it creates a temporary directory on each node assigned to the job. It then sets the full path name of that directory in an environment variable called TMPDIR, e.g. TMPDIR=/lscratch/slurm-job-6366591
  
 
Because this directory resides on local disk, input and output (I/O) to it is almost always faster than I/O to a network storage (/global/project, /global/scratch, or /global/home). Specifically, local disk is better for frequent small I/O transactions than network storage. Any job doing a lot of input and output (which is most jobs!) may expect to run more quickly if it uses $TMPDIR instead of network storage.
 
Because this directory resides on local disk, input and output (I/O) to it is almost always faster than I/O to a network storage (/global/project, /global/scratch, or /global/home). Specifically, local disk is better for frequent small I/O transactions than network storage. Any job doing a lot of input and output (which is most jobs!) may expect to run more quickly if it uses $TMPDIR instead of network storage.
  
 
The temporary character of $TMPDIR makes it more trouble to use than network storage. Input must be copied from network storage to $TMPDIR before it can be read, and output must be copied from $TMPDIR back to network storage before the job ends to preserve it for later
 
The temporary character of $TMPDIR makes it more trouble to use than network storage. Input must be copied from network storage to $TMPDIR before it can be read, and output must be copied from $TMPDIR back to network storage before the job ends to preserve it for later
 +
 +
 +
==Transferring Data==
 +
In order to read data from $TMPDIR, you must first copy the data there. In the simplest case, you can do this with cp or rsync:
 +
 +
cp /global/project/<your groupname>/<username>/input.files.* $TMPDIR/
 +
 +
This may not work if the input is too large, or if it must be read by processes on different nodes.
 +
 +
==Output File==
 +
Once the job has compuleted, output data must be copied from $TMPDIR back to some permanent storage before the job ends. If a job times out, then the last few lines of the job script might not be executed. This can be addressed three ways:
 +
 +
* request enough runtime to let the application finish, although we understand that this isn't always possible;
 +
* write checkpoints to network storage, not to $TMPDIR;
 +
* write a signal trapping function.

Revision as of 19:36, 2 October 2023

Using Node Local Storage

When Slurm starts a job, it creates a temporary directory on each node assigned to the job. It then sets the full path name of that directory in an environment variable called TMPDIR, e.g. TMPDIR=/lscratch/slurm-job-6366591

Because this directory resides on local disk, input and output (I/O) to it is almost always faster than I/O to a network storage (/global/project, /global/scratch, or /global/home). Specifically, local disk is better for frequent small I/O transactions than network storage. Any job doing a lot of input and output (which is most jobs!) may expect to run more quickly if it uses $TMPDIR instead of network storage.

The temporary character of $TMPDIR makes it more trouble to use than network storage. Input must be copied from network storage to $TMPDIR before it can be read, and output must be copied from $TMPDIR back to network storage before the job ends to preserve it for later


Transferring Data

In order to read data from $TMPDIR, you must first copy the data there. In the simplest case, you can do this with cp or rsync:

cp /global/project/<your groupname>/<username>/input.files.* $TMPDIR/

This may not work if the input is too large, or if it must be read by processes on different nodes.

Output File

Once the job has compuleted, output data must be copied from $TMPDIR back to some permanent storage before the job ends. If a job times out, then the last few lines of the job script might not be executed. This can be addressed three ways:

  • request enough runtime to let the application finish, although we understand that this isn't always possible;
  • write checkpoints to network storage, not to $TMPDIR;
  • write a signal trapping function.