MPI Parallelism

YES, THIS IS A LONG PAGE, BUT IT IS CRITICAL THAT YOU READ THE ENTIRE THING. NO SHORTCUTS FOR MPI.

MPI stands for 'Message Passing Interface', and over the last decade it has become the de-facto standard for running large scale computations on Linux clusters around the world. In most supercomputing centers this will be the ONLY option you have for running in parallel, and administrators may be actively hostile to trying to make use of any non-MPI software on their clusters.

PLEASE NOTE: Using MPI on any cluster is not a task for linux/unix novices. You must have a fair bit of education to understand what's involved in using MPI with any program (not just EMAN). You should be comfortable with running MPI jobs before attempting this with EMAN2. If necessary you may need to consult a cluster administrator for assistance. There is enough variation between different specific linux clusters that we cannot provide specific advice for every situation. We have tried to provide as much generic advice as possible, but this is often not going to be a cookie-cutter operation.

ALSO NOTE: This parallelism system has a bit of overhead associated with it. It is efficient for large jobs, but if you can run a refinement cycle on your desktop in 5 minutes, you won't gain much by using 128 MPI cores. Data transfer will eat up all of your potential gains. However if you have a big job with a large box size and a lot of particles (something that would take, say, 12 hours on your desktop), you can get extremely efficient speedups. ie - if you run a small test job on MPI don't be disappointed when it doesn't give you the speedup you were hoping for. Try a bigger test.

For users with MPI already configured

If you already have MPI configured and tested with EMAN2 on your system, you can skip to the Using MPI section below. The basic option you will use in your scripts is:

--parallel=mpi:<n>:/path/to/scratch[:nocache]

The 'nocache' option is new as of EMAN2.04, and allows you to rely on your local filesysem sharing rather than caching data on each node. If you are using a shared Lustre or similar filesystem to store your data, or if your cluster doesn't have any significant local scratch space on the compute nodes, this may be beneficial.

MPI setup

Linux EMAN2 MPI library installation

On linux clusters you will need to compile one small module directly on the cluster in question. In most cases this will be straightforward, and the setup will be largely automatic. However, in some situations it may require you to do some research about your cluster and/or consult your cluster documentation or administrator.

The EMAN2 binary and source distributions both include a subdirectory called mpi_eman. Go to this directory. Inside you will find a 0README text file you may consult for details, but in many cases simply running the Makefile.linux2 script at the command line as follows will do everything that is necessary:

make -f Makefile.linux2 install

If it does not, the first thing to do is to edit Makefile.linux2 and see if there are any obvious changes that need to be made. If you can't figure out what to do, first try consulting a local expert for the specific cluster you're using. If this approach doesn't work, feel free to email sludtke@bcm.edu.

When running the Make File, errors might arise from having installed the wrong version of EMAN2. PLEASE verify that your version of EMAN2 is correct for your system (for example, 32bit vs 64bit).

Specific MPI systems

Specific Batch Queuing systems

Note: If you don't understand the difference between a batch queuing system (this section) and an MPI system (previous section), you may wish to consider invoking your local cluster guru. People with the necessary linux/cluster expertise should not be confused by this distinction.

How to create a script to run jobs on the cluster

You need to create your own script (which is basically a TEXT FILE) to run specific jobs on the cluster (you could call it, for example myscript.txt). The format for the instructions inside this script will depend on the "batch queuing system" of the cluster you're using. ASK if you don't know which one your cluster uses.

You do not need to write a script from scratch; rather, you can edit the examples provided in the /EMAN2/mpi_eman/ directory. Read the details on the next two sections "OpenPBS/Torque" and "SGE".

OpenPBS/Torque

If your cluster uses openPBS/Torque, there is an example batch file (that is, a "sample script") called pbs.example inside the /EMAN2/mpi_eman/ directory. You can edit this file to create your own script and use it for testing. There are also a couple of simple python test scripts which could be executed with mpirun. You will need to learn and understand how you are expected to launch MPI jobs on your specific cluster before trying any of these things! If you just naively run some of these scripts you could do things which in some installations will make the system administrator very angry, so please, learn what you're supposed to do and how before proceeding past this point. If you do not know what you're doing, showing the pbs.example script to a knowledgeable user should tell them what they need to know before offering you advice on what to do.

SGE (Sun Grid Engine)

This is another popular queuing system, which uses 'qsub' and 'qstat' commands much like OpenPBS/Torque does. Configuration, however, is completely different.

Here is an example of an SGE script to run a refinement with e2refine.py using mpich:

#$ -S /bin/bash
#$ -V
#$ -N refine4
#$ -cwd
#$ -j y
#$ -pe mpich 40

e2refine.py --input=bdb:sets#set2-allgood_phase_flipped-hp --mass=1200.0 --apix=2.9 --automask3d=0.7,24,9,9,24 --iter=1 --sym=c1 --model=bdb:refine_02#threed_filt_05 --path=refine_sge --orientgen=eman:delta=3:inc_mirror=0 --projector=standard --simcmp=frc:snrweight=1:zeromask=1 --simalign=rotate_translate_flip --simaligncmp=ccc --simralign=refine --simraligncmp=frc:snrweight=1 --twostage=2 --classcmp=frc:snrweight=1:zeromask=1 --classalign=rotate_translate_flip --classaligncmp=ccc --classralign=refine --classraligncmp=frc:snrweight=1 --classiter=1 --classkeep=1.5 --classnormproc=normalize.edgemean --classaverager=ctf.auto --sep=5 --m3diter=2 --m3dkeep=0.9 --recon=fourier --m3dpreprocess=normalize.edgemean --m3dpostprocess=filter.lowpass.gauss:cutoff_freq=.1 --pad=256 --lowmem --classkeepsig --classrefsf --m3dsetsf -v 2 --parallel=mpi:40:/scratch/username

e2bdb.py -cF

To create your own script, you can just copy and paste the text above into a text file and introduce the modifications you need. For example, whatever you type after "$ -N" will be the "name" of your job.

You can replace e2refine.py and all the options that follow (denoted by the double dashes --) with a call to any other program in eman2 that is subject to parallelism, for example, e2classaverage3d.py

NOTE in particular the #$ -pe mpich 40 statement which specifies the number of MPI cpus, and the --parallel=mpi:40:/scratch/username option which should match in the number of CPUs (you may actually need to specify one less here if you run into problems), and also must point to a valid scratch directory present as a local (non shared) drive on each compute node.

This LOCAL STORAGE drive for each node may literally be called "scratch". Ask your cluster administrator about it if you cannot find it.

Testing Your MPI Setup

If you're ready to run the test scripts (and even if you're not but want to try it anyway), you can do so through the pbs.example script provided in the ~/EMAN2/mpi_eman/ directory. You can change the number of nodes and processors per node you want to run the test job by modifying the line that says #PBS -l nodes=30:ppn=8 (just change the numbers after the equal sign following the words 'nodes' and 'ppn', which stands for 'processors per node').

If you are using a binary EMAN2 distribution (if you did not "compile from source"), you will need to change the first line of the test scripts to correspond to the EMAN2 provided python interpreter (look at the first line of any of the 'e2*' programs, all of which are in the /EMAN2/bin/ directory). If the mpi_test.py and mpi_test_basic.py scripts do not work properly when run as indicated above, then you either have not installed the library properly, or you need to learn more about how to run MPI jobs on your cluster. If these scripts run to completion and the output looks sensible, you are ready to proceed with an actual EMAN2 job.

Using MPI

Once you have verified that your MPI support is installed and working, making actual use of MPI to run your jobs is quite straightforward, with a couple of caveats.

  1. Make sure you read this warning

  2. Prepare the batch file appropriate for your cluster. Do not try to use 'mpirun' or 'mpiexec' on any EMAN programs. Instead, add the '--parallel=mpi:<n>:/path/to/scratch[:nocache]' option to an EMAN2 command like e2refine.py. Some commands do not support the --parallel option, and trying to run them using mpirun will not accomplish anything useful.

    • replace <n> with the total number of processors you have requested (these number must match exactly)

    • replace /path/to/scratch, with the path to a scratch storage directory available on each node of the cluster. Note that this directory must be local storage on each node, not a directory shared between nodes. If you use the path to a shared directory, like $HOME/scratch, you will have very very serious problems. You must use a filesystem local to the specific node. If you don't have this information, check your cluster documentation and/or consult with your system administrator.
    • Make sure that after the last e2* command in your batch script you put an 'e2bdb.py -cF' command to make sure all of the output image files have been flushed to disk.
  3. Immediately before submitting your job, run 'e2bdb.py -c'. This will require you to exit all running EMAN2 jobs (if any) before proceeding. Do this.
  4. Submit your job.
  5. IMPORTANT : While the job is running, you have effectively ceded control of that specific project to the cluster nodes using MPI. You MUST NOT modify any of the files in that project in any way while the job is running, or you will risk a variety of bad things. While the bad things will not always happen, there is a large risk, and the bad things are VERY bad, including corruption of your entire project. Wait until the job is complete before you do anything that could possibly change files in that directory.

  6. When you run into problems (note I say when, not if), and you have exhausted any local MPI experts, please feel free to email me (sludtke@bcm.edu). Once you have things properly configured, you should be able to use MPI routinely, but getting there may be a painful process on some clusters. Don't get too discouraged.

  7. The 'nocache' option is new as of EMAN2.04, and allows you to rely on your local filesysem sharing rather than caching data on each node. If you are using a shared Lustre or similar filesystem

to store your data, or if your cluster doesn't have any significant local scratch space on the compute nodes, this may be beneficial.

Note about use of shared clusters

EMAN2 can make use of MPI very efficiently, however, as this type of image processing is VERY data intensive, in some situations, your jobs may be limited by data transfer between nodes rather than by the computational capacity of the cluster. The inherent scalability of your job will depend quite strongly on the parameters of your reconstruction. In general larger projects will scale better than smaller projects, but projects can be 'large' in several different ways (eg- large box size, large number of particles, low symmetry,...). If your cluster administrator complains that your jobs aren't using the CPUs that you have allocated for your jobs sufficiently, you can try A) running on fewer processors, which will increase the efficiency (but also increase run-times), or you can refer them to me, and I will explain the issues involved. We are also developing tools to help better measure how efficiently your jobs are taking advantage of the CPUs you are allocating, but this will be an ongoing process.