Difference between revisions of "Local Scratch"

From CAC Wiki
Jump to: navigation, search
(Using Node Local Storage)
 
Line 8: Line 8:
 
Because this directory resides on local disk, input and output (I/O) to it is almost always faster than I/O to a network storage (/global/project, /global/scratch, or /global/home). Specifically, local disk is better for frequent small I/O transactions than network storage. Any job doing a lot of input and output (which is most jobs!) may expect to run more quickly if it uses $TMPDIR instead of network storage.
 
Because this directory resides on local disk, input and output (I/O) to it is almost always faster than I/O to a network storage (/global/project, /global/scratch, or /global/home). Specifically, local disk is better for frequent small I/O transactions than network storage. Any job doing a lot of input and output (which is most jobs!) may expect to run more quickly if it uses $TMPDIR instead of network storage.
  
The temporary character of $TMPDIR makes it more trouble to use than network storage. Input must be copied from network storage to $TMPDIR before it can be read, and output must be copied from $TMPDIR back to network storage before the job ends to preserve it for later
+
The temporary character of $TMPDIR makes it more trouble to use than network storage. Input must be copied from network storage to $TMPDIR before it can be read, and output must be copied from $TMPDIR back to network storage before the job ends to preserve it for later. Note: slurm prolog will clean up the $TMPDIR once the job has been completed.
 
+
  
 
==Transferring Data==
 
==Transferring Data==

Latest revision as of 19:53, 2 October 2023

Using Node Local Storage

When Slurm starts a job, it creates a temporary directory on each node assigned to the job. It then sets the full path name of that directory in an environment variable called TMPDIR, e.g. TMPDIR=/lscratch/slurm-job-6366591

Because this directory resides on local disk, input and output (I/O) to it is almost always faster than I/O to a network storage (/global/project, /global/scratch, or /global/home). Specifically, local disk is better for frequent small I/O transactions than network storage. Any job doing a lot of input and output (which is most jobs!) may expect to run more quickly if it uses $TMPDIR instead of network storage.

The temporary character of $TMPDIR makes it more trouble to use than network storage. Input must be copied from network storage to $TMPDIR before it can be read, and output must be copied from $TMPDIR back to network storage before the job ends to preserve it for later. Note: slurm prolog will clean up the $TMPDIR once the job has been completed.

Transferring Data

In order to read data from $TMPDIR, you must first copy the data there. In the simplest case, you can do this with cp or rsync:

cp /global/project/<your groupname>/<username>/input.files.* $TMPDIR/

This may not work if the input is too large, or if it must be read by processes on different nodes.

Output File

Once the job has compuleted, output data must be copied from $TMPDIR back to some permanent storage before the job ends. If a job times out, then the last few lines of the job script might not be executed. This can be addressed three ways:

  • request enough runtime to let the application finish, although we understand that this isn't always possible;
  • write checkpoints to network storage, not to $TMPDIR;
  • write a signal trapping function.

Sample Job

A minimal Slurm job script looks like this:

#!/bin/bash
#SBATCH --time=00:15:00
#SBATCH --account=def-hpcggroup
cp $HOME/myInputFiles.txt $TMPDIR/
./myapplication $TMPDIR/myInputFile.txt > $TMPDIR/myOutputFile.txt
cp $TMPDIR/myOutputFile.txt $HOME/