CIBRClusters/Storage

Data Storage

Each cluster has one or more shared RAID6 arrays as well as scratch storage on each node. User's home folders are on the RAID6 array, and these folders are automounted on any node in the cluster on-demand.

This RAID6 storage is NOT BACKED UP IN ANY WAY. While RAID6 storage is inherently more reliable than a normal hard drive, it is far from failure proof, and there is always the risk of hardware failure and TOTAL DATA LOSS. This storage is for use during active compute jobs. It is not intended for long term storage of your results. You should always copy your completed runs back to a local machine in your lab and practice safe backup strategies there. While we do all we can to prevent catastrophic data loss on the clusters, if it happens, complaining that "but I forgot to copy my data" will not get your data back again.

Reliable (backed up RAID 6) storage is also available for storage of data still needed for processing, but not immediately needed for active jobs. As of late 2017, this storage is available from all cluster nodes, but performance will not match that of the inbuilt RAID arrays. The normal shared home folders should be used for active jobs run on the cluster. However, when your job completes you may move the results to more reliable storage until you need it again. Typically users are permitted only 1-2 TB on the active scratch storage. Reliable storage may currently extend to 4-5 TB. You must request an allocation on this storage to have access.

Each node also has ~1 TB of available local scratch space (space specific to each node). This local storage is mounted as /scratch on each node. Data in these folders is available only from each specific node. To use these folders, simply create a subfolder /scratch/username, and use the space as you like during the job. When the job is complete, it is polite to clean up any data you have created there. The queuing system will not clean up for you. While data in these folders will not be immediately erased after your job completes, there is no guarantee that it will persist for any specific amount of time after your job. It may be freely deleted upon need after your job is done. If you will only access a data file once during a run, there is little point in using the scratch disk. However, if you will be reading the same file over and over again, or are producing scratch files which exist only during the run, the /scratch folder may dramatically improve the performance of your job, as well as other people's jobs.

On both Torus and Prism, the RAID6 is shared via a QDR infiniband connection, capable of a total bandwidth over all nodes of 800 MB/sec. That is, if only one node is requesting data, it can get the full 800 MB/sec, but if 44 nodes are all reading at once, you will only get ~18 MB/sec on each node. If all 704 cores on Prism are trying to read at once, each will get a paltry 1 MB/sec. The scratch disks are typical standard desktop hard drives, and can read ~150 MB/sec, but this bandwidth is dedicated to each node.

One last subtlety which can impact disk performance is caching. Any unused memory on each compute node is used automatically for file caching on a rotating basis. Say a node has 20 GB of free RAM, and you read in your 2 GB data file. This data will automatically be cached in this 20 GB space, and if you read the same file again, it will read almost instantaneously. Now, there are all sorts of limitations to this, particularly when reading very large files, or when there is little free RAM on the node, but in many situations it will automatically provide much better performance than you would expect given the specifications above.

Caching also has an impact on benchmarking. If you are trying to test how long a job will take to run, by running a small test-job over and over, if that job is reading the same file over and over again, you likely aren't going to get accurate numbers.