Difference between revisions of "About:FileSystems"

From CAC Wiki
Jump to: navigation, search
(Are there disk quotas?)
 
(29 intermediate revisions by 2 users not shown)
Line 1: Line 1:
=About Our File Systems=
+
=Disk Space and Quotas=
  
This document is intended as a quick reference on basic questions about the file systems used on HPCVL clusters. It includes information on home directories, work and scratch space, disk quota, and our tape library backup system.
+
This document is intended as a quick reference on basic questions about our file systems. It includes information on home directories, work and scratch space, disk quota, and our tape library backup system.
  
 +
{|  style="border-spacing: 8px;"
 +
| valign="top" width="50%" style="padding:1em; border:1px solid #aaaaaa; background-color:#e1eaf1; border-radius:7px" |
 
==The /home file system==
 
==The /home file system==
  
The '''/home''' file system is the main area where our users keep their data. Each user's home directory resides there. It is called /home/hpcXXXX where hpcXXXX denotes the user name. This system is '''shared''', i.e. it is visible from all our servers and the workup/login nodes.
+
The '''/home''' file system is the main area where our users keep their data. Each user's home directory resides there. It is called /global/home/hpcXXXX where hpcXXXX denotes the user name. This system is '''shared''', i.e. it is visible from all our servers and the workup/login nodes.
  
 
Physically, this file system resides in 4 racks containing approximately 1 PB of raw disk space. The configuration is designed to tolerate the failure of multiple disks without the loss of data or disruption in service. This is achieved with the implementation of a spare pool of disks. If one (or several) member drives of an array fails, the global spare drive joins the logical drive and automatically starts to rebuild. Our disk arrays are both read and write cached through flash memory to increase speed. Access speed is homogeneous throughout the file system.
 
Physically, this file system resides in 4 racks containing approximately 1 PB of raw disk space. The configuration is designed to tolerate the failure of multiple disks without the loss of data or disruption in service. This is achieved with the implementation of a spare pool of disks. If one (or several) member drives of an array fails, the global spare drive joins the logical drive and automatically starts to rebuild. Our disk arrays are both read and write cached through flash memory to increase speed. Access speed is homogeneous throughout the file system.
Line 11: Line 13:
 
The file system is '''NFS''' at the front end and based on '''ZFS''' at the back. A high degree of redundancy is built into the system, both on the level of the head nodes (which are dual active/active), and on the level of connectivity (10 Gig Ethernet). Part of the older SAM-QFS based storage systems serve as a backup management system that connects our disk arrays to our tape library. This allows the '''regular backup of files'''.
 
The file system is '''NFS''' at the front end and based on '''ZFS''' at the back. A high degree of redundancy is built into the system, both on the level of the head nodes (which are dual active/active), and on the level of connectivity (10 Gig Ethernet). Part of the older SAM-QFS based storage systems serve as a backup management system that connects our disk arrays to our tape library. This allows the '''regular backup of files'''.
  
==Are there disk quotas?==
+
== Temporary Scratch Space: /scratch ==
  
Disk quotas are active on the /home, /u1, and /scratch sub-sytems. They are enforced automatically. Once a user exceeds a quota, no further data can be written to the file system by that user, making it impossible to log in in some cases. If this happens, you need to contact us and arrange for freeing up disk space.
+
'''Note! Scratch space is not backed up. Any files that are kept in the /global/scratch/hpcXXXX area cannot be retrieved if they are deleted.'''
 +
 
 +
'''Note: Scratch is regularly purged. Any untouched data older than 60 days will be deleted.'''
 +
 
 +
Scratch space is supplied in the '''/global/scratch''' area of the file system. This space is intended for transitory data that are generated during a calculation and are usually deleted shortly after the calculation has finished. However, it is worthwhile to consider keeping other intermediate results that are only needed for a short time on scratch space if there is a danger of exceeding disk quota in /global/home/hpcXXXX or /global/project/hpcgXXXX where hpcgXXXX denotes your CAC groupname.
 +
 
 +
/scratch is subject to a '''quota of 5 TB''' per user. If you require more, please contact us.
 +
 
 +
Note that currently our scratch space is '''global''', i.e. accessible from all nodes. While this implies slower access than local scratch, it allows data to be used from different nodes within a program run, and it simplifies maintenance.
 +
 
 +
A directory '''/global/scratch/hpcXXXX''' is automatically created when you receive an account. By default, it is only accessible by the owner.
 +
 
 +
To use the scratch, you will often have to set an application specific environment variable, for instance for the chemistry code "Gaussian":
 +
 
 +
<pre>$ export GAUSS_SCRDIR=/global/scratch/hpcXXXX</pre>
 +
|}
 +
{|  style="border-spacing: 8px;"
 +
| valign="top" width="50%" style="padding:1em; border:1px solid #aaaaaa; background-color:#f7f7f7; border-radius:7px" |
 +
 
 +
==Disk quotas==
 +
 
 +
Disk quotas are active on the /home, and /scratch filesets. They are enforced automatically. Once a user exceeds a quota, no further data can be written to the file system by that user, making it impossible to log in in some cases. If this happens, you need to contact us and arrange for freeing up disk space.
 +
 
 +
You can check your disk usage by using the ''''myquota'''' command.
  
 
{| class="wikitable" style="float:right; margin-left: 10px;"
 
{| class="wikitable" style="float:right; margin-left: 10px;"
Line 22: Line 47:
 
|'''Backed Up ?'''  
 
|'''Backed Up ?'''  
 
|-
 
|-
| /home  
+
| /global/home/hpcXXXX
| 500 GB
+
| 3 TB
 
| yes  
 
| yes  
 
|-
 
|-
| /u1/work
+
| /global/scratch/hpcXXXX
| 2 TB
+
| yes
+
|-
+
| /scratch  
+
 
| 5 TB  
 
| 5 TB  
 
| no
 
| no
 
|}
 
|}
  
'''Note''': Some of our users currently exceed our disk quotas on one or several of these areas, due to previous negotiated arrangements. For groups that require such larger quotas we can temporarily raise them to allow continuing work. We will contact these users and arrange for moving data to bring them back within the standard quota. '''We do not provide long-term data storage as a default'''.
+
'''Note''': Some of our users currently exceed our disk quotas on one or several of these areas. For groups that require such larger quotas we can '''temporarily''' raise them to allow continuing work. We contact these users regularly to check if such extensions are still required. '''We do not provide long-term data storage as a default'''.
  
If you have special needs concerning disk usage that exceeds the above quotas, you can contact us and make a temporary arrangement for more. However, this arrangement will be periodically reviewed and has to be time-limited.
+
If you have special needs concerning disk usage that exceeds the above quotas, please contact us.
  
Files in /home and /u1 are automatically backed up. Users do not have to do anything for these activities to occur.
+
'''Files in /home are automatically backed up.''' Users do not have to do anything for these activities to occur.
 +
|}
  
==What to do if I need additional disk space?==
+
{|  style="border-spacing: 8px;"
 +
| valign="top" width="50%" style="padding:1em; border:1px solid #aaaaaa; background-color:#e1eaf1; border-radius:7px" |
 +
==Large Datasets==
 +
If you plan to use large datasets (>1 TB) that are relatively common and could be useful to other researchers, CAC can create a shared data repository in /global/project/public.datasets, this will help minimize disk usage.
 +
Note: If the dataset requires licensing or a signed User Agreement, the dataset will be LDAP protected
 +
|}
  
Experience shows that some users need disk space exceeding the 500 GB disk quota for their home directory, sometimes over an extended time period. Examples would be the trajectory files of a molecular dynamics run, or the results of large fluid dynamics simulations. Files that contain information which is occasionally accessed may be moved away from the /home file system into an '''alternative area denoted /u1'''. This file system is subject to considerably increased disk quota ('''2 TB per user''').
+
{|  style="border-spacing: 8px;"
 +
| valign="top" width="50%" style="padding:1em; border:1px solid #aaaaaa; background-color:#e1eaf1; border-radius:7px" |
 +
==Backups==
  
Data residing in /u1 are '''backed up''' by default. When you receive a user account, a directory '''/u1/work/hpcXXXX''' is automatically created, and access is restricted to to the owner. The structure of the files and directories below this is left to the user.
+
By default, we maintain backups for the purpose of securing user data (disaster recovery) only, not for permanent storage or external use. This means that it is the responsibility of the individual user to remove data that are to be kept permanently from the cluster and store them on external media, such as disks, tapes, or DVD's.
  
==What do I do if I need a lot of disk space?==
+
User data that reside in the '''/home file system are backed up locally and off-site'''. This happens automatically.
  
If you need more disk space than the disk quota on /home and /u1 allows, you should consider the following options, preferably in that order:
+
The backup cycle is the same as for /home. '''No user data residing outside these directories are backed up. This holds specifically for /tmp and /scratch'''.
  
* The easiest way to keep data safe and within easy access is to keep them in /home. Our disk quota are comparably high, and the presence of more than 500 GB of data should lead a user to consider "cleaning up" the /home area. We strongly encourage all our users to '''download their permanent data and back them up on external media''' such as external drives and arrays, Tapes, or DVD. This is the safest way to make sure your data cannot be lost. In many cases the only reason why disk space usage approaches the limit is that unprocessed data are kept around unnecessarily. We do not supply individual backups or archiving facilities. This is the responsibility of the user.
+
==How to get lost data back==
* If the data are only '''temporary''' (eg, they are results that serve as input for other computations, but can then be discarded), they might be written out on '''scratch space'''. See this question below for further details about scratch space. Data on scratch space '''will not be archived or backed up''', and and should be removed as soon as they are not needed anymore. This is only a solution if the data will be removed within a reasonably short time period.
+
* If the data are needed for a longer period of time, but can eventually be moved off the system or deleted, they can be brought to the afore-mentioned '''/u1 file system'''. Data of this kind will be backed up for the (rather unlikely) event of a multiple disk failure. We recommend to consider this solution for data that need to be available for several weeks or months, but cannot be kept in the /home area to avoid exceeding the quota.
+
* Finally, if you need to keep large amounts of data that are accessed rather frequently, exceed the disk quota, and need to be backed up, you need to contact us and '''make special arrangements'''.
+
  
==How is scratch space handled on the cluster?==
+
[[Contacts:UserSupport|'''contact us''']]. The system administrators will retrieve data from the regular backup. Keep in mind that changes that you made to the data before the loss occurred might be lost since the copy of your file may be outdated. Likewise, if you made accidental changes to your files, you might be able to revert to an earlier version by retrieval from a backup copy. However, if the changes were already committed the earlier file could be lost. To avoid such problems, consider a version control system.
  
Scratch space is supplied in the '''/scratch''' area of the file system. This space is intended for transitory data that are generated during a calculation and are usually deleted shortly after the calculation has finished. However, it is worthwhile to consider keeping other intermediate results that are only needed for a short time on scratch space if there is a danger of exceeding disk quota in /home or /u1.
+
If the loss is the consequence of a general disk failure, the part of the file system that was affected will be restored from safety backups, and it is not necessary (nor useful) to contact the administrator for the retrieval of individual files. In that case, you will have to wait until the file system is restored to normal. This may take several days in the case of a severe failure.
 +
|}
 +
{|  style="border-spacing: 8px;"
 +
| valign="top" width="50%" style="padding:1em; border:1px solid #aaaaaa; background-color:#f7f7f7; border-radius:7px" |
 +
== What to do if a lot of disk space is needed ==
  
/scratch is subject to a '''quota of 5 TB''' per user to avoid sudden overflow on a disk array. If you require more, please [[Contacts:UserSupport|contact us]].
+
If you need more disk space than the disk quota on /home and /u1 allows, you should consider the following options, preferably in that order:
 
+
Note that our scratch space is '''global''', i.e. accessible from all nodes. While this implies somewhat slower access than local scratch, it allows data to be used from different nodes within a program run (e.g. of an MPI program), and it simplifies maintenance.
+
  
Scratch space is accessed via the /scratch directory. A directory '''/scratch/hpcXXXX''' is automatically created when you receive and HPCVL account. By default, it is only accessible by the owner.
+
* '''Clean up''' the /home. Delete data that are no longer needed.
 
+
* '''Download permanent data''' and back them up on external media, such as external drives and arrays, Tapes, or DVD. This is the safest way to make sure your data cannot be lost. We do not supply individual backups or archiving facilities by default. This is the responsibility of the user.
To use the scratch, you will often have to set an application specific environment variable, which can then be given the name /scratch/hpcXXXX to work on all nodes, eg. for the quantum-chemistry code "Gaussian", one would set (bash):
+
* If the data are '''temporary''' (eg, they can then be discarded eventually), they might be written out on '''/scratch'''. Data on scratch space '''are not backed up''', and should be removed as soon as they are not needed any more.  
: export GAUSS_SCRDIR=/scratch/hpcXXXX
+
* Finally, if you need to keep large amounts of data that are accessed rather frequently, exceed the disk quota, and need to be backed up, you need to contact us and '''make special arrangements'''.
Note that the above setting is automatically applied when you issue a
+
|}
: use g03
+
command.
+
 
+
==Which files are backed up, which aren't?==
+
 
+
By default, HPCVL maintains backups for the purpose of securing user data (disaster recovery) only, not for permanent storage or external use. This means that it is the responsibility of the individual user to remove data that are to be kept permanently from the cluster and store them on external media, such as disks, tapes, or DVD's.
+
 
+
User data that reside in the '''/home file system are backed up''' on a short cycle (in the order of days) locally to our '''L1400 tape library''', and '''off-site'''. This happens automatically as soon as a file appears on the files system. Whenever a file changes, the change will be committed to the backup as well.
+
 
+
Data that reside in '''/u1 are also backed up'''. The backup cycle is the same as for /home. '''No user data residing outside these directories are backed up. This holds specifically for /tmp and /scratch'''.
+
 
+
==I lost data, how can I get them back?==
+
 
+
The general answer is [[Contacts:UserSupport|'''contact us''']]. The system administrators may be able to retrieve the lost data from the regular backup on the L1400 tape library. Keep in mind that changes that you made to the data before the loss occurred might be lost since the copy of your file may be outdated. Likewise, if you made accidental changes to your files, you might be able to revert to an earlier version by retrieval from a backup copy. However, if the changes were already committed the earlier file could be lost. To avoid such problems, consider a version control system.
+
 
+
If the loss is the consequence of a general disk failure, the part of the file system that was affected will be restored from safety backups, and it is not necessary (nor useful) to contact the administrator for the retrieval of individual files. In that case, you will have to wait until the file system is restored to normal. This may take several days in the case of a severe failure.
+

Latest revision as of 14:50, 4 March 2024

Disk Space and Quotas

This document is intended as a quick reference on basic questions about our file systems. It includes information on home directories, work and scratch space, disk quota, and our tape library backup system.

The /home file system

The /home file system is the main area where our users keep their data. Each user's home directory resides there. It is called /global/home/hpcXXXX where hpcXXXX denotes the user name. This system is shared, i.e. it is visible from all our servers and the workup/login nodes.

Physically, this file system resides in 4 racks containing approximately 1 PB of raw disk space. The configuration is designed to tolerate the failure of multiple disks without the loss of data or disruption in service. This is achieved with the implementation of a spare pool of disks. If one (or several) member drives of an array fails, the global spare drive joins the logical drive and automatically starts to rebuild. Our disk arrays are both read and write cached through flash memory to increase speed. Access speed is homogeneous throughout the file system.

The file system is NFS at the front end and based on ZFS at the back. A high degree of redundancy is built into the system, both on the level of the head nodes (which are dual active/active), and on the level of connectivity (10 Gig Ethernet). Part of the older SAM-QFS based storage systems serve as a backup management system that connects our disk arrays to our tape library. This allows the regular backup of files.

Temporary Scratch Space: /scratch

Note! Scratch space is not backed up. Any files that are kept in the /global/scratch/hpcXXXX area cannot be retrieved if they are deleted.

Note: Scratch is regularly purged. Any untouched data older than 60 days will be deleted.

Scratch space is supplied in the /global/scratch area of the file system. This space is intended for transitory data that are generated during a calculation and are usually deleted shortly after the calculation has finished. However, it is worthwhile to consider keeping other intermediate results that are only needed for a short time on scratch space if there is a danger of exceeding disk quota in /global/home/hpcXXXX or /global/project/hpcgXXXX where hpcgXXXX denotes your CAC groupname.

/scratch is subject to a quota of 5 TB per user. If you require more, please contact us.

Note that currently our scratch space is global, i.e. accessible from all nodes. While this implies slower access than local scratch, it allows data to be used from different nodes within a program run, and it simplifies maintenance.

A directory /global/scratch/hpcXXXX is automatically created when you receive an account. By default, it is only accessible by the owner.

To use the scratch, you will often have to set an application specific environment variable, for instance for the chemistry code "Gaussian":

$ export GAUSS_SCRDIR=/global/scratch/hpcXXXX

Disk quotas

Disk quotas are active on the /home, and /scratch filesets. They are enforced automatically. Once a user exceeds a quota, no further data can be written to the file system by that user, making it impossible to log in in some cases. If this happens, you need to contact us and arrange for freeing up disk space.

You can check your disk usage by using the 'myquota' command.

Disk Quota on User File Systems
File System Default Quota Backed Up ?
/global/home/hpcXXXX 3 TB yes
/global/scratch/hpcXXXX 5 TB no

Note: Some of our users currently exceed our disk quotas on one or several of these areas. For groups that require such larger quotas we can temporarily raise them to allow continuing work. We contact these users regularly to check if such extensions are still required. We do not provide long-term data storage as a default.

If you have special needs concerning disk usage that exceeds the above quotas, please contact us.

Files in /home are automatically backed up. Users do not have to do anything for these activities to occur.

Large Datasets

If you plan to use large datasets (>1 TB) that are relatively common and could be useful to other researchers, CAC can create a shared data repository in /global/project/public.datasets, this will help minimize disk usage. Note: If the dataset requires licensing or a signed User Agreement, the dataset will be LDAP protected

Backups

By default, we maintain backups for the purpose of securing user data (disaster recovery) only, not for permanent storage or external use. This means that it is the responsibility of the individual user to remove data that are to be kept permanently from the cluster and store them on external media, such as disks, tapes, or DVD's.

User data that reside in the /home file system are backed up locally and off-site. This happens automatically.

The backup cycle is the same as for /home. No user data residing outside these directories are backed up. This holds specifically for /tmp and /scratch.

How to get lost data back

contact us. The system administrators will retrieve data from the regular backup. Keep in mind that changes that you made to the data before the loss occurred might be lost since the copy of your file may be outdated. Likewise, if you made accidental changes to your files, you might be able to revert to an earlier version by retrieval from a backup copy. However, if the changes were already committed the earlier file could be lost. To avoid such problems, consider a version control system.

If the loss is the consequence of a general disk failure, the part of the file system that was affected will be restored from safety backups, and it is not necessary (nor useful) to contact the administrator for the retrieval of individual files. In that case, you will have to wait until the file system is restored to normal. This may take several days in the case of a severe failure.

What to do if a lot of disk space is needed

If you need more disk space than the disk quota on /home and /u1 allows, you should consider the following options, preferably in that order:

  • Clean up the /home. Delete data that are no longer needed.
  • Download permanent data and back them up on external media, such as external drives and arrays, Tapes, or DVD. This is the safest way to make sure your data cannot be lost. We do not supply individual backups or archiving facilities by default. This is the responsibility of the user.
  • If the data are temporary (eg, they can then be discarded eventually), they might be written out on /scratch. Data on scratch space are not backed up, and should be removed as soon as they are not needed any more.
  • Finally, if you need to keep large amounts of data that are accessed rather frequently, exceed the disk quota, and need to be backed up, you need to contact us and make special arrangements.