Wiki Page Content

This page is obsolete

Using EMAN2 with shared filesystems (like NFS)

In many large groups, computers will be configured so users can login to multiple machines, and their home-directory will be cross-mounted from wherever it lives. This is also the case on linux clusters, where the user's home directory is inevitably cross-mounted on all of the compute nodes. This fact causes some complications for any image processing where more than one computer may be in use.

The problem

NFS (and other shared filesystems) do not generally guarantee that 2 different computers will see identical versions of the same file at all times. That is, if you write to a file called x.txt on workstationA, then read x.txt on workstationB, it may not see the entire x.txt file until some undefined amount of time after you finish writing. Usually this delay is small, but it does exist. This problem can be observed on any image processing system, not just EMAN1/2. Say you are appending images to the end of a stack on 3 different computers at the same time. MOST of the images will get appended correctly, but there is a small, but finite, risk of getting image data out of phase, and causing funny shifts and jumbling of the images. For systems like embedded databases (the BDB system used in EMAN2) there is ZERO tolerance for writing information to a file and not seeing exactly what you wrote (unreliable storage).

The solution

EMAN1's solution

In EMAN1, the user was expected to be careful and insure that they not write to the same file at the same time from multiple nodes. For situations like parallel computing on clusters, this was handled through use of a fileserver in the 'runpar' command. That is, when runpar was used to spawn jobs, it also took care of all reading and writing of files. Whenever a node needed image data, instead of reading it using NFS, it would ask runpar for the data from the 1st node in the job. If this mechanism was intentionally bypassed in EMAN1 you would often see corrupted images as a result.

EMAN2's solution

In EMAN2, in addition to the image data itself, we also have to consider the database cache, which is used to give BDB access much better performance and also enhance reliability. Here are the rules, also somewhat described in EMAN2/DatabaseWarning :

  • You may put the EMAN2 binaries anywhere you like, though there are a number of advantages in having each user have their own copy in their home directory. This installation can safely be on NFS mounted filesystems.
  • The EMAN2 BDB cache is normally automatically put into /tmp, which is almost always a local filesystem. It ABSOLUTELY MUST NOT be on an NFS mounted filesystem. It must be on a local hard drive on the machine you are presently using. Diskless workstations (which are very uncommon) may be a problem.
  • EMAN2 project directories MAY be on NFS mounted filesystems, but this will require caution on the user's part. If this is done, the following additional precautions must be taken:
    • Say you have 2 workstations, A and B. If you are using A, then finish what you're doing, and wish to use B, you should exit all EMAN2 jobs on A, and run ' -c' before going to work on B.
    • If you fail to do this, then the cache on machine A may get out of sync with the cache on machine B. Bottom line is, only 1 machine should be accessing the project at any given point in time.
    • One caveat - If you are working in project directory X on machine A, it should be safe to work on project directory Y simultaneously on machine B. As long as the 2 machines never try to access the same databases at the same time.
  • Clusters
    • When running a job on a cluster, you should use either MPI or Distributed Computing. Both of these mechanisms will handle the BDB issues for you while running a job.

    • HOWEVER, while a job is running on the nodes of a large cluster, the first cluster node is granted BDB access while the job runs. You must not modify any project data from the head-node while the job is still running.
    • It is relatively safe to use the file browser to look at files and monitor the progress of a job while the job is running (read-only access), however, you may run into cases where the head-node will not see the most recent version of some specific file, due to the cache.
      • If you run into problems on the head-node while the job is still running. Just try ' -c', and it will likely clear up the issue.
      • If you accidentally write data to one of the databases in an actively running project from the head node, there is a possibility of irretrievably corrupting one or more of the databases. Be VERY careful.

Eman2NFS (last edited 2013-08-11 18:08:27 by SteveLudtke)