Return to the CIBR Cluster Home Page

Moving Data to/from the Clusters

Before you can run jobs on the cluster, you will need to first get your data to the cluster. The primary storage on each cluster is designed for temporary use while running jobs, not permanent storage of your data/results, and this cluster storage is not backed up in any way, so there is a measurable risk of total data loss (over a period of years). So, in addition to getting your data TO the cluster, it is also important to copy your results back FROM the cluster to your local computer (which hopefully you also follow proper backup procedures on).

In addition to the temporary RAID storage on each node, there is also a data archive shared among the clusters. This storage is only available from the head-node, so it cannot be used to run jobs, but unlike the primary storage, this storage is backed up on a daily basis to a second identical storage unit, so the risk of total data loss is much smaller. You can use this to back up results of your computations, or to make space in your account on the primary cluster storage. You will find this storage mounted on the head-node of the clusters as /store1_a-j. To use this storage, you must request an allocation from the system administrator.

When moving data to/from the cluster, there are 3 tools which can be used: sftp, scp and rsync. Of these, rsync is by far the better choice in virtually all situations, because it allows you to synchronize directories, and only copies files that have changed. That is, if you copy a project folder to the cluster, then run a job producing a few new files in the folder, when you rsync it back to your local machine, it will only copy the new files, not the originals.

Quick Summary

rsync -avr mylocalfolder user@prism:data
--- run job ---
rsync -avr user@prism:data/mylocalfolder .

The first line will copy the entire contents of mylocalfolder to prism as data/mylocalfolder. The second command will copy data/mylocalfolder on prism back to your local machine as mylocalfolder. It will ONLY copy files which have changed. This means if your folder has a 100GB data file in it, the first command will copy it to the cluster, but the second command will only copy the new files created during your job back to the local machine!

More detailed example

In this example id is the name of the local computer. The id> prompt indicates commands run locally.

id> pwd
/home/stevel/test

id> ls -l
total 28
drwxr-xr-x   2 stevel stevel  4096 Apr 20 13:41 images

id> ls -l images
total 55340
-rw-r--r-- 1 stevel stevel 28331520 Apr 20 13:41 r_04_03.hdf
-rw-r--r-- 1 stevel stevel 28324008 Apr 20 13:41 r_09_04.hdf

id> rsync -avr images stevel@prism:data
sending incremental file list
images/
images/r_04_03.hdf
images/r_09_04.hdf

sent 56662627 bytes  received 54 bytes  37775120.67 bytes/sec
total size is 56655528  speedup is 1.00

id> ssh prism
Last login: Sun Apr 20 13:03:37 2014 from id.grid.bcm.edu
[stevel@prism ~]$ cd data
[stevel@prism data]$ ls -l
total 4259920
drwxr-xr-x.  2 stevel stevel         42 Apr 20 13:41 images

[stevel@prism data]$ ls -l images
total 55336
-rw-r--r--.  1 stevel stevel 28331520 Apr 20 13:41 r_04_03.hdf
-rw-r--r--.  1 stevel stevel 28324008 Apr 20 13:41 r_09_04.hdf

--- RUN JOB ---

[stevel@prism data]$ pwd
/home/stevel/data

[stevel@prism data]$ ls -l images
total 110672
-rw-r--r--.  1 stevel stevel 28331520 Apr 20 13:41 r_04_03.hdf
-rw-r--r--.  1 stevel stevel 28324008 Apr 20 13:41 r_09_04.hdf
-rw-rw-r--.  1 stevel stevel 28331800 Apr 20 13:45 result1.hdf
-rw-rw-r--.  1 stevel stevel 28324056 Apr 20 13:45 result2.hdf

[stevel@prism data]$ logout
Connection to prism closed.

id> pwd
/home/stevel/test

id> rsync -avr stevel@prism:data/images .
receiving incremental file list
images/
images/result1.hdf
images/result2.hdf
sent 72 bytes  received 56663326 bytes  37775598.67 bytes/sec
total size is 113311638  speedup is 2.00

CIBRClusters/MovingData (last edited 2015-08-21 13:54:08 by SteveLudtke)