Moving Data to/from the Clusters

Before you can run jobs on the cluster, you will need to first get your data from your local computer to storage on the cluster. If you read the other documentation, you will recall that the storage on the cluster is designed for temporary use while running jobs, not permanent storage of your data/results, and that the cluster storage is not backed up in any way, so there is a real risk of total data loss (over a period of years). So, in addition to getting your data TO the cluster, it is also important to copy your results back FROM the cluster to your local computer (which hopefully you also follow proper backup procedures on).

There are 3 tools which can be used to get data to and from the cluster: sftp, scp and rsync. Of these, rsync is by far the better choice in virtually all situations, because it allows you to synchronize directories, and only copies files that have changed!

Quick Summary

rsync -avr mylocalfolder user@prism:data
--- run job ---
rsync -avr user@prism:data/mylocalfolder .

The first line will copy the entire contents of mylocalfolder to prism as data/mylocalfolder. The second command will copy data/mylocalfolder on prism back to your local machine as mylocalfolder. It will ONLY copy files which have changed. This means if your folder has a 100GB data file in it, the first command will copy it to the cluster, but the second command will only copy the new files created during your job back to the local machine!

More detailed example

In this example id is the name of the local computer. The id> prompt indicates commands run locally.

id> pwd
/home/stevel/test

id> ls -l
total 28
drwxr-xr-x   2 stevel stevel  4096 Apr 20 13:41 images

id> ls -l images
total 55340
-rw-r--r-- 1 stevel stevel 28331520 Apr 20 13:41 r_04_03.hdf
-rw-r--r-- 1 stevel stevel 28324008 Apr 20 13:41 r_09_04.hdf

id> rsync -avr images stevel@prism:data
sending incremental file list
images/
images/r_04_03.hdf
images/r_09_04.hdf

sent 56662627 bytes  received 54 bytes  37775120.67 bytes/sec
total size is 56655528  speedup is 1.00

id> ssh prism
Last login: Sun Apr 20 13:03:37 2014 from id.grid.bcm.edu
[stevel@prism ~]$ cd data
[stevel@prism data]$ ls -l
total 4259920
drwxr-xr-x.  2 stevel stevel         42 Apr 20 13:41 images

[stevel@prism data]$ ls -l images
total 55336
-rw-r--r--.  1 stevel stevel 28331520 Apr 20 13:41 r_04_03.hdf
-rw-r--r--.  1 stevel stevel 28324008 Apr 20 13:41 r_09_04.hdf

--- RUN JOB ---

[stevel@prism data]$ pwd
/home/stevel/data

[stevel@prism data]$ ls -l images
total 110672
-rw-r--r--.  1 stevel stevel 28331520 Apr 20 13:41 r_04_03.hdf
-rw-r--r--.  1 stevel stevel 28324008 Apr 20 13:41 r_09_04.hdf
-rw-rw-r--.  1 stevel stevel 28331800 Apr 20 13:45 result1.hdf
-rw-rw-r--.  1 stevel stevel 28324056 Apr 20 13:45 result2.hdf

[stevel@prism data]$ logout
Connection to prism closed.

id> pwd
/home/stevel/test

id> rsync -avr stevel@prism:data/images .
receiving incremental file list
images/
images/result1.hdf
images/result2.hdf
sent 72 bytes  received 56663326 bytes  37775598.67 bytes/sec
total size is 113311638  speedup is 2.00