CIBRClusters/MovingData

Moving Data to/from the Clusters

Before you can run jobs on the cluster, you will need to first get your data to the cluster. The primary storage on each cluster is designed for temporary use while running jobs, not permanent storage of your data/results, and this cluster storage is not backed up in any way, so there is a measurable risk of total data loss (over a period of years). So, in addition to getting your data TO the cluster, it is also important to copy your results back FROM the cluster to your local computer (which hopefully you also follow proper backup procedures on).

In addition to the temporary RAID storage on each node, there is also a data archive shared among the clusters. This storage is only available from the head-node, so it cannot be used to run jobs, but unlike the primary storage, this storage is backed up on a daily basis to a second identical storage unit, so the risk of total data loss is much smaller. You can use this to back up results of your computations, or to make space in your account on the primary cluster storage. You will find this storage mounted on the head-node of the clusters as /store1_a-j. To use this storage, you must request an allocation from the system administrator.

When moving data to/from the cluster, there are 3 tools which can be used: sftp, scp and rsync. Of these, rsync is by far the better choice in virtually all situations, because it allows you to synchronize directories, and only copies files that have changed. That is, if you copy a project folder to the cluster, then run a job producing a few new files in the folder, when you rsync it back to your local machine, it will only copy the new files, not the originals.

Quick Summary

rsync -avr mylocalfolder user@prism:data
--- run job ---
rsync -avr user@prism:data/mylocalfolder .

The first line will copy the entire contents of mylocalfolder to prism as data/mylocalfolder. The second command will copy data/mylocalfolder on prism back to your local machine as mylocalfolder. It will ONLY copy files which have changed. This means if your folder has a 100GB data file in it, the first command will copy it to the cluster, but the second command will only copy the new files created during your job back to the local machine!

More detailed example

In this example id is the name of the local computer. The id> prompt indicates commands run locally.

id> pwd
/home/stevel/test

id> ls -l
total 28
drwxr-xr-x   2 stevel stevel  4096 Apr 20 13:41 images

id> ls -l images
total 55340
-rw-r--r-- 1 stevel stevel 28331520 Apr 20 13:41 r_04_03.hdf
-rw-r--r-- 1 stevel stevel 28324008 Apr 20 13:41 r_09_04.hdf

id> rsync -avr images stevel@prism:data
sending incremental file list
images/
images/r_04_03.hdf
images/r_09_04.hdf

sent 56662627 bytes  received 54 bytes  37775120.67 bytes/sec
total size is 56655528  speedup is 1.00

id> ssh prism
Last login: Sun Apr 20 13:03:37 2014 from id.grid.bcm.edu
[stevel@prism ~]$ cd data
[stevel@prism data]$ ls -l
total 4259920
drwxr-xr-x.  2 stevel stevel         42 Apr 20 13:41 images

[stevel@prism data]$ ls -l images
total 55336
-rw-r--r--.  1 stevel stevel 28331520 Apr 20 13:41 r_04_03.hdf
-rw-r--r--.  1 stevel stevel 28324008 Apr 20 13:41 r_09_04.hdf

--- RUN JOB ---

[stevel@prism data]$ pwd
/home/stevel/data

[stevel@prism data]$ ls -l images
total 110672
-rw-r--r--.  1 stevel stevel 28331520 Apr 20 13:41 r_04_03.hdf
-rw-r--r--.  1 stevel stevel 28324008 Apr 20 13:41 r_09_04.hdf
-rw-rw-r--.  1 stevel stevel 28331800 Apr 20 13:45 result1.hdf
-rw-rw-r--.  1 stevel stevel 28324056 Apr 20 13:45 result2.hdf

[stevel@prism data]$ logout
Connection to prism closed.

id> pwd
/home/stevel/test

id> rsync -avr stevel@prism:data/images .
receiving incremental file list
images/
images/result1.hdf
images/result2.hdf
sent 72 bytes  received 56663326 bytes  37775598.67 bytes/sec
total size is 113311638  speedup is 2.00

The first rsync above copies the images folder from /home/stevel/test on the local machine to /home/stevel/data on the remote machine.
The second rsync copies the same folder back to its original location. Note that the two input files are not copied back. Only the two new output files are copied.
It is very important that you are in the correct directory when issuing each of the above commands. Pay careful attention to the pwd output and the way the rsync command is structured. Otherwise you may end up unintentionally copying the entire folder to a location you didn't intend.
It is also possible to do the rsync from prism back to your local computer, assuming you have an SSH daemon running on your local machine.
All of these copy operations tunnel the data transfer via an SSH connection. This means the data is encrypted both ways. While this may be a good thing in some situations, it can also slow the transfer down somewhat. If you have very large data files to transfer (100GB+), it is possible to do direct transfers without encryption. Doing this requires special configuration on the server and running an rsync daemon. Please contact the sysop if you think you need to do this.
For more details and options on rsync simply type man rsync on the cluster.
Note that if you operate this way, it will provide you with a cloned copy of your entire data set on both the cluster and the local machine! This has the advantage of having a live backup of your data and results at all times. It is still expected when you are done processing that you will remove the folder from the cluster, but while you are processing this gives you valuable redundancy.