Differences between revisions 2 and 3
Revision 2 as of 2014-04-20 18:58:53
Size: 4932
Editor: SteveLudtke
Comment:
Revision 3 as of 2014-04-20 18:59:37
Size: 5020
Editor: SteveLudtke
Comment:
Deletions are marked like this. Additions are marked like this.
Line 86: Line 86:
 * For more details and options on ''rsync'' simply type ''man rsync'' on the cluster.

Moving Data to/from the Clusters

Before you can run jobs on the cluster, you will need to first get your data from your local computer to storage on the cluster. If you read the other documentation, you will recall that the storage on the cluster is designed for temporary use while running jobs, not permanent storage of your data/results, and that the cluster storage is not backed up in any way, so there is a real risk of total data loss (over a period of years). So, in addition to getting your data TO the cluster, it is also important to copy your results back FROM the cluster to your local computer (which hopefully you also follow proper backup procedures on).

There are 3 tools which can be used to get data to and from the cluster: sftp, scp and rsync. Of these, rsync is by far the better choice in virtually all situations, because it allows you to synchronize directories, and only copies files that have changed!

Quick Summary

rsync -avr mylocalfolder user@prism:data
--- run job ---
rsync -avr user@prism:data/mylocalfolder .

The first line will copy the entire contents of mylocalfolder to prism as data/mylocalfolder. The second command will copy data/mylocalfolder on prism back to your local machine as mylocalfolder. It will ONLY copy files which have changed. This means if your folder has a 100GB data file in it, the first command will copy it to the cluster, but the second command will only copy the new files created during your job back to the local machine!

More detailed example

In this example id is the name of the local computer. The id> prompt indicates commands run locally.

id> pwd
/home/stevel/test

id> ls -l
total 28
drwxr-xr-x   2 stevel stevel  4096 Apr 20 13:41 images

id> ls -l images
total 55340
-rw-r--r-- 1 stevel stevel 28331520 Apr 20 13:41 r_04_03.hdf
-rw-r--r-- 1 stevel stevel 28324008 Apr 20 13:41 r_09_04.hdf

id> rsync -avr images stevel@prism:data
sending incremental file list
images/
images/r_04_03.hdf
images/r_09_04.hdf

sent 56662627 bytes  received 54 bytes  37775120.67 bytes/sec
total size is 56655528  speedup is 1.00

id> ssh prism
Last login: Sun Apr 20 13:03:37 2014 from id.grid.bcm.edu
[stevel@prism ~]$ cd data
[stevel@prism data]$ ls -l
total 4259920
drwxr-xr-x.  2 stevel stevel         42 Apr 20 13:41 images

[stevel@prism data]$ ls -l images
total 55336
-rw-r--r--.  1 stevel stevel 28331520 Apr 20 13:41 r_04_03.hdf
-rw-r--r--.  1 stevel stevel 28324008 Apr 20 13:41 r_09_04.hdf

--- RUN JOB ---

[stevel@prism data]$ pwd
/home/stevel/data

[stevel@prism data]$ ls -l images
total 110672
-rw-r--r--.  1 stevel stevel 28331520 Apr 20 13:41 r_04_03.hdf
-rw-r--r--.  1 stevel stevel 28324008 Apr 20 13:41 r_09_04.hdf
-rw-rw-r--.  1 stevel stevel 28331800 Apr 20 13:45 result1.hdf
-rw-rw-r--.  1 stevel stevel 28324056 Apr 20 13:45 result2.hdf

[stevel@prism data]$ logout
Connection to prism closed.

id> pwd
/home/stevel/test

id> rsync -avr stevel@prism:data/images .
receiving incremental file list
images/
images/result1.hdf
images/result2.hdf
sent 72 bytes  received 56663326 bytes  37775598.67 bytes/sec
total size is 113311638  speedup is 2.00
  • The first rsync above copies the images folder from /home/stevel/test on the local machine to /home/stevel/data on the remote machine.

  • The second rsync copies the same folder back to its original location. Note that the two input files are not copied back. Only the two new output files are copied.

  • It is very important that you are in the correct directory when issuing each of the above commands. Pay careful attention to the pwd output and the way the rsync command is structured. Otherwise you may end up unintentionally copying the entire folder to a location you didn't intend.

  • It is also possible to do the rsync from prism back to your local computer, assuming you have an SSH daemon running on your local machine.
  • All of these copy operations tunnel the data transfer via an SSH connection. This means the data is encrypted both ways. While this may be a good thing in some situations, it can also slow the transfer down somewhat. If you have very large data files to transfer (100GB+), it is possible to do direct transfers without encryption. Doing this requires special configuration on the server and running an rsync daemon. Please contact the sysop if you think you need to do this.

  • For more details and options on rsync simply type man rsync on the cluster.

  • Note that if you operate this way, it will provide you with a cloned copy of your entire data set on both the cluster and the local machine! This has the advantage of having a live backup of your data and results at all times. It is still expected when you are done processing that you will remove the folder from the cluster, but while you are processing this gives you valuable redundancy.

CIBRClusters/MovingData (last edited 2015-08-21 13:54:08 by SteveLudtke)