Differences between revisions 24 and 51 (spanning 27 versions)
Revision 24 as of 2012-12-18 19:24:04
Size: 17781
Editor: jgalaz
Comment:
Revision 51 as of 2022-09-08 21:34:00
Size: 15604
Editor: SteveLudtke
Comment:
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
== MPI Parallelism ==

'''YES, THIS IS A LONG PAGE, BUT IT IS CRITICAL THAT YOU READ THE ENTIRE THING. NO SHORTCUTS FOR MPI.'''
= MPI Parallelism =
Line 7: Line 4:
'''PLEASE NOTE:''' __Using MPI on any cluster is not a task for linux/unix novices__. You must have a fair bit of education to understand what's involved in using MPI with any program (not just EMAN). You should be comfortable with running MPI jobs before attempting this with EMAN2. If necessary you may need to consult a cluster administrator for assistance. There is enough variation between different specific linux clusters that we cannot provide specific advice for every situation. We have tried to provide as much generic advice as possible, but this is often not going to be a cookie-cutter operation. '''PLEASE NOTE:''' Unfortunately, MPI was designed by computer scientists for computer scientists, and there can be a steep learning curve. That is, __using MPI on any cluster is not a task for linux/unix novices__. You must have a fair bit of education to understand what's involved in using MPI with any program (not just EMAN). You should be comfortable with running MPI jobs before attempting this with EMAN2. If necessary you may need to consult a cluster administrator for assistance. There is enough variation between different specific linux clusters that we cannot provide specific advice for every situation. We have tried to provide as much generic advice as possible, but this is often not going to be a cookie-cutter operation.
Line 9: Line 6:
'''ALSO NOTE:''' This parallelism system has a bit of overhead associated with it. It is efficient for large jobs, but if you can run a refinement cycle on your desktop in 5 minutes, you won't gain much by using 128 MPI cores. Data transfer will eat up all of your potential gains. However if you have a big job with a large box size and a lot of particles (something that would take, say, 12 hours on your desktop), you can get extremely efficient speedups. ie - if you run a small test job on MPI don't be disappointed when it doesn't give you the speedup you were hoping for. Try a bigger test. == Installing MPI Support in EMAN2/SPARX ==
SPARX and EMAN2 have merged their MPI support efforts, and as of 4/19/2013, the legacy EMAN2 MPI system has been retired. Pydusa used to require separate installation. It is now installed as part of the Anaconda based installation system, so no further action should be required.
Line 11: Line 9:
=== For users with MPI already configured ===
If you already have MPI configured and tested with EMAN2 on your system, you can skip to the '''Using MPI''' section below. The basic option you will use in your scripts is:
{{{
--parallel=mpi:<n>:/path/to/scratch[:nocache]
}}}

The 'nocache' option is new as of EMAN2.04, and allows you to rely on your local filesysem sharing rather than caching data on each node. If you are using a shared Lustre or similar filesystem
to store your data, or if your cluster doesn't have any significant local scratch space on the compute nodes, this may be beneficial.

=== MPI setup ===

 * Mac - MPI is provided as part of the operating system, so we provide a fully functional binary. No extra installation should be required.
 * Windows - we do not presently offer MPI support. try [[EMAN2/Parallel|one of the other parallelism methods]]
 * Linux - Unfortunately there are many variants of MPI and there are many variants of linux. Due to these issues, there is one specific file which we cannot distribute as part of the EMAN2 binary release for linux. The following will explain how to go about setting this up:

==== Linux EMAN2 MPI library installation ====
On linux clusters you will need to compile one small module directly on the cluster in question. In most cases this will be straightforward, and the setup will be largely automatic. However, in some situations it may require you to do some research about your cluster and/or consult your cluster documentation or administrator.

The EMAN2 binary and source distributions both include a __subdirectory called ''mpi_eman''__. Go to this directory. Inside you will find a __''0README'' text file__ you may consult for details, but in many cases simply running the Makefile.linux2 script at the command line as follows will do everything that is necessary:
== Testing EMAN2 MPI installation ==
There is a special script in ''EMAN2/examples'' called ''mpi_test.py''. Once you have EMAN2 and Pydusa installed, you can use this script to test the basic MPI setup. Make a batch queuing script (described below), and in the script, put:
Line 32: Line 13:
make -f Makefile.linux2 install mpirun <path to python> $EMAN2DIR/examples/mpi_test.py
Line 34: Line 15:
<path to python> can be found on the first line of any of the EMAN2 programs, eg - "head -1 $EMAN2DIR/bin/e2proc2d.py"
Line 35: Line 17:
This program will produce console output describing the success or failure of the MPI job. If this script succeeds, then it is very likely that an EMAN2 program using the --parallel option will also succeed.
Line 36: Line 19:
---- /!\ '''Edit conflict - other version:''' ----
If the Makefile does not run successfully, the first thing to do is to edit it, e.g, ''Makefile.linux2'' and see if there are any obvious changes that need to be made.
For example, __one reason for the Makefile to spit out errors is calling the '''''wrong python'''''__. The first line of the Makefile determines which python it will use.
To find out what python to use, type "python" at the command line.
That will tell you exactly what version of python your system is __'''''actually'''''__ using (of the many that might be installed). For example
If you encounter errors, please see the Debugging section below
Line 42: Line 21:
[jgalaz@asterix mpi_eman]$ python
Python 2.6.5 (r265:79063, Mar 25 2010, 15:12:17)
== Using MPI in EMAN2 ==
Once you have EMAN2 and pydusa installed, usage should be straightforward. EMAN2 has a modular parallelism system, supporting several types of parallel computing, not just MPI. All of these parallelism systems use a common syntax. For EMAN2 commands which can run in parallel, to use MPI parallelism, the basic syntax is:
Line 45: Line 24:
In the Makefile, you can comment out several attempts to use this python version or the other, by adding a "#" at the beginning of a line, until you've tried one that will work:
#INCLUDE = /usr/local/bin/python
#INCLUDE = /usr/local/lib/python2.6.4
INCLUDE = /usr/local/include/python2.6.5
{{{
--parallel=mpi:<nproc>:/path/to/scratch
}}}
for example:
Line 50: Line 29:
If after this things still fail and you can't figure out what to do, try consulting a local expert for the specific cluster you're using.
If even this approach doesn't work, feel free to email sludtke@bcm.edu.
{{{
e2refine.py ... --parallel=mpi:64:/scratch/stevel
}}}
 * the number of processors MUST match your request to the batch system (see below)
 * ideally, the compute nodes should have a local scratch directory (on a non-shared hard drive). However, since EMAN2.1, this is no longer a strict requirement.
 * Do NOT use 'mpirun' directly. This will be called for you by the program you give the --parallel option to.
Line 53: Line 36:
---- /!\ '''Edit conflict - your version:''' ----
If the Makefile does not run successfully, the first thing to do is to edit it, e.g, ''Makefile.linux2'' and see if there are any obvious changes that need to be made.
For example, __one reason for the Makefile to spit out errors is calling the '''''wrong python'''''__. The first line of the Makefile determines which python it will use.
To find out what python to use, type "python" at the command line.
That will tell you exactly what version of python your system is __'''''actually'''''__ using (of the many that might be installed). For example
=== Special exception for e2refine_easy and other select programs ===
When using MPI parallelism, certain tasks become I/O limited, and are simply not efficient to run in parallel on multiple nodes. To handle this situation, we have added a --threads option to e2refine_easy and other select programs. If you specify --threads, then these i/o-limited tasks will all be run on the first assigned node using threads. For example, you might say:
Line 59: Line 39:
[jgalaz@asterix mpi_eman]$ python
Python 2.6.5 (r265:79063, Mar 25 2010, 15:12:17)
--parallel=mpi:128:/scratch --threads=12
Line 62: Line 41:
In the Makefile, you can comment out several attempts to use this python version or the other, by adding a "#" at the beginning of a line, until you've tried one that will work:
#INCLUDE = /usr/local/bin/python
#INCLUDE = /usr/local/lib/python2.6.4
INCLUDE = /usr/local/include/python2.6.5
Which would run most of the refinement using the full 128 cores and MPI. In cases where this was inefficient, 12 cores on the first assigned node would be used instead.
Line 67: Line 43:
If after this things still fail and you can't figure out what to do, try consulting a local expert for the specific cluster you're using.
If even this approach doesn't work, feel free to email sludtke@bcm.edu.
One word of warning, however. On some clusters, with very tight management and job control, having a single process make use of all 12 cores could result in jobs being terminated. If you run into this problem, please let me know so we can try to come up with a solution.
Line 70: Line 45:
---- /!\ '''End of edit conflict''' ---- === Special Options to mpirun ===
By default, EMAN2 programs *internally* run mpirun with the --n <ncpus> option, and gets the list of available nodes from the batch queuing system (e.g., PBS or SGE). If you need to specify a different set of options (for example, if you aren't using PBS or SGE, and you want to specify a nodefile), you can set the environment variable "EMANMPIOPTS"(this variable will replace -n <ncpus> on the command line) by adding the following line in your invisible ''.bashrc'' file:
Line 72: Line 48:
When running the Make File, ''errors might arise from having installed the wrong version of EMAN2. __PLEASE verify that your version of EMAN2 is correct for your system__'' (for example, 32bit vs 64bit). {{{
export EMANMPIOPTS="-hostfile myhosts.txt"
}}}
''myhosts.txt'' should be a text file in your home directory listing the names of the nodes you want to use on the cluster.
Line 74: Line 53:
==== Specific MPI systems ====
 * '''OpenMPI''' - This is the most widely used distribution at present. If your cluster uses version 1.2 or earlier of OpenMPI, it will likely work without difficulty. However, if you are using 1.3 or newer, you will need to make sure OpenMPI is compiled with the ''--disable-dlopen'' option or you will probably get fatal errors when you try to run the test scripts. You may need to talk to your system administrator if this happens. ''--disable-dlopen'' is required for Python compatibility, and is not an EMAN2 specific requirement. __''Sometimes you will need to compile a different version of MPI than the one installed on your cluster''__
  * We have a page with some instructions for [[EMAN2/Parallel/BuildMpi|building MPI yourself without administrative privliges]] if you don't have a usable copy on your cluster.
 * '''MPICH2/MVAPICH2''' - Another very standard MPI library. Worked fine for us in initial testing, but we have not done extensive burn-in testing on it.
 * '''LAM''' - An older library. We haven't tested it.
 * '''Other''' proprietary MPI distributions - Many high-end clusters will have a commercial MPI installation to make optimal use of specific hardware. While EMAN2 should work fine with these systems, it is difficult to predict what problems you may encounter. Please contact us if you have any problems.
For example:
Line 81: Line 55:
==== Specific Batch Queuing systems ====
'''Note:''' If you don't understand the difference between a batch queuing system (this section) and an MPI system (previous section), you may wish to consider invoking your local cluster guru. People with the necessary linux/cluster expertise should not be confused by this distinction.
{{{
 n1
 n2
 n3
 n4
}}}
You can then supply the --parallel command as usual, but instead of writing 'n' for the ''number of cpus'' to use, you need to write the ''number of nodes'' (--parallel=mpi:#nodes:/scratch/username ) listed in ''myhosts.txt''. In the example above, there are 4 nodes listed; therefore, the --parallel parameter would become, in an EMAN2 command ran at the command line:
Line 84: Line 63:
==== How to create a script to run jobs on the cluster ==== {{{
eman2whateverprogram.py --input=<input> --output=<output> --parallel=mpi:4:<scratch_directory>
}}}
'''IMPORTANT''': Note that you usually cannot run a command from the ''head node''; rather, you should log into any one node, for example
Line 86: Line 68:
You need to create your own script (which is basically a TEXT FILE) to run specific jobs on the cluster (you could call it, for example ''myscript.txt'').
The format for the instructions inside this script will depend on the "batch queuing system" of the cluster you're using.
'''ASK if you don't know which one your cluster uses.'''
{{{
ssh n1
}}}
and then run your eman2 command from there. This is usually the way to do it since, typically, the head not will NOT have a "scratch" directory.
Line 90: Line 73:
You do not need to write a script from scratch; rather, you can edit the examples provided in the ''/EMAN2/mpi_eman/'' directory. Read the details on the next two sections "OpenPBS/Torque" and "SGE". == Batch Queuing systems ==
=== How to create a script to run jobs on the cluster ===
A cluster is a shared computing environment. Limited resources must be allocated for many different users and jobs all at the same time. To run a job on most clusters, you must formulate a request in the form of a text file containing the details about your job. These details include, the number of processors you want, how long the job is expected to run, perhaps the amount of RAM you will need, etc. This text file is called a 'batch script' and is submitted to the 'Batch Queuing System' (BQS) on your cluster. The BQS then decides when and where your job will be run, and communicates information about which specific processors to use to your job when launched.
Line 92: Line 77:
===== OpenPBS/Torque =====
If your cluster uses openPBS/Torque, '''there is an example batch file (that is, a "sample script") called ''pbs.example'' inside the ''/EMAN2/mpi_eman/'' directory'''. You can edit this file to create your own script and use it for testing.
There are also a couple of simple python test scripts which could be executed with ''mpirun''. You will need to learn and understand how you are expected to launch MPI jobs on your specific cluster before trying any of these things! If you just naively run some of these scripts you could do things which in some installations will make the system administrator very angry, so please, learn what you're supposed to do and how before proceeding past this point. If you do not know what you're doing, showing the ''pbs.example'' script to a knowledgeable user should tell them what they need to know before offering you advice on what to do.
The BQS (allocates resources and launches jobs) is independent of MPI (runs jobs on allocated resources). There are several common BQS systems you may encounter. We cannot cover every possibility here, so you need to consult with your local cluster policy information for details on how to submit jobs using your BQS. We will provide examples for OpenPBS and SGE, which are two of the more common BQS systems out there. Even then, the details may vary a little from cluster to cluster. These examples just give you someplace to start.
Line 96: Line 79:
===== SGE (Sun Grid Engine) ===== ==== OpenPBS/Torque ====
Here is an example of a batch script for PBS-based systems:

{{{#!/bin/bash
# All lines starting with "#PBS" are PBS commands
#
# The following line asks for 10 nodes, each of which has 12 processors, for a total of 120 CPUs.
# The walltime is the maximum length of time your job will be permitted to run. If it is too small, your
# job will be killed before it's done. If it's too long, however, your job may have to wait a looong
# time before the cluster starts running it (depends on local policy).
#PBS -l nodes=10:ppn=12
#PBS -l walltime=120:00:00
#PBS -q dque

# This prints the list of nodes your job is running on to the output file
cat $PBS_NODEFILE

# cd to your project directory
cd /home/stevel/data/myproject

# Now the actual EMAN2 command(s). Note the --parallel option at the end. The number of CPUs must match the number specified above
e2refine_easy.py --input=sets/all__ctf_flip_hp.lst --model=initial_models/model_00_02.hdf --targetres=6.0 --sym=d7 --iter=3 --mass=800.0 --apix=2.1 --classkeep=0.9 --m3dkeep=0.8 --parallel=mpi:120:/scratch/stevel --threads 12 --speed 5
}}}
If this file were called, for example, test.pbs, you would then submit the job to the cluster by saying

{{{
e2bdb.py -c
qsub test.pbs
}}}
There are additional options you can use with the qsub command as well. See your local cluster documentation for details on what is required/allowed. The e2bdb.py -c command is a good idea to make sure that the compute nodes will see any recent changes you've made to images in the project.

==== SGE (Sun Grid Engine) ====
Line 101: Line 115:
{{{
#!/bin/bash
{{{#!/bin/bash
Line 110: Line 123:
e2refine.py --input=bdb:sets#set2-allgood_phase_flipped-hp --mass=1200.0 --apix=2.9 --automask3d=0.7,24,9,9,24 --iter=1 --sym=c1 --model=bdb:refine_02#threed_filt_05 --path=refine_sge --orientgen=eman:delta=3:inc_mirror=0 --projector=standard --simcmp=frc:snrweight=1:zeromask=1 --simalign=rotate_translate_flip --simaligncmp=ccc --simralign=refine --simraligncmp=frc:snrweight=1 --twostage=2 --classcmp=frc:snrweight=1:zeromask=1 --classalign=rotate_translate_flip --classaligncmp=ccc --classralign=refine --classraligncmp=frc:snrweight=1 --classiter=1 --classkeep=1.5 --classnormproc=normalize.edgemean --classaverager=ctf.auto --sep=5 --m3diter=2 --m3dkeep=0.9 --recon=fourier --m3dpreprocess=normalize.edgemean --m3dpostprocess=filter.lowpass.gauss:cutoff_freq=.1 --pad=256 --lowmem --classkeepsig --classrefsf --m3dsetsf -v 2 --parallel=mpi:40:/scratch/username cd /home/stevel/data/myproject
Line 112: Line 125:
e2bdb.py -cF e2refine_easy.py --input=sets/all__ctf_flip_hp.lst --model=initial_models/model_00_02.hdf --targetres=6.0 --sym=d7 --iter=3 --mass=800.0 --apix=2.1 --classkeep=0.9 --m3dkeep=0.8 --parallel=mpi:40:/scratch/stevel --threads 4 --speed 5
Line 114: Line 127:

To create your own script, you can just copy and paste the text above into a text file and introduce the modifications you need.
For example, whatever you type after "$ -N" will be the "name" of your job.

You can replace ''e2refine.py'' and all the options that follow (denoted by the double dashes --) with a call to any other program in eman2 that is subject to parallelism, for example, ''e2classaverage3d.py''

'''NOTE''' in particular the ''#$ -pe mpich 40'' statement which specifies the number of MPI cpus, and the ''--parallel=mpi:40:/scratch/username'' option which should match in the number of CPUs (you may actually need to specify one less here if you run into problems), and also must point to a valid scratch directory present as a local (non shared) drive on each compute node.

This LOCAL STORAGE drive for each node may literally be called "scratch". Ask your cluster administrator about it if you cannot find it.

=== Testing Your MPI Setup ===
If you're ready to run the test scripts (and even if you're not but want to try it anyway), you can do so through the __''pbs.example''__ script provided in the ''~/EMAN2/mpi_eman/'' directory. You can change the number of nodes and processors per node you want to run the test job by modifying the line that says ''#PBS -l nodes=30:ppn=8'' (just change the numbers after the equal sign following the words 'nodes' and 'ppn', which stands for 'processors per node').
 
 *__Note that the pound sign (''#'') inside a PBS script comments out a line from being executed__.

 * There are TWO mpi test scripts inside the /EMAN2/mpi_eman/ directory:
  * mpi_test_basic.py
  * mpi_test.py
 To run one or the other, comment out EITHER ONE of the lines (NOT BOTH) inside the ''pbs.example'' file that say ''"mpirun python mpi_test_basic.py"'' or ''"mpirun python mpi_test.py"'' by adding a pound sign (''#'') at the beginning of one of the lines.

   * '''NOTE''' that when you run an actual EMAN2 job by calling one of the ''e2whatever.py'' scripts, you DO NOT need to call ''mpirun'' within the PBS file to execute it (look at the line in the ''pbs.example'' file that starts with ''#e2refine.py ...''; there's no call to ''mpirun'' there because EMAN2's parallelization does it for you internally when you specify the ''--parallel=mpi'' option).
  * To ACTUALLY submit a job to the cluster use the __''qsub'' command__ at the command line as follows (for example):
  {{{
  qsub pbs.example
  }}}

If you are using a binary EMAN2 distribution (if you did not "compile from source"), __you will need to change the first line of the test scripts__ to correspond to the EMAN2 provided python interpreter (look at the first line of any of the 'e2*' programs, all of which are in the ''/EMAN2/bin/'' directory).
If the ''mpi_test.py'' and ''mpi_test_basic.py'' scripts do not work properly when run as indicated above, then you either have not installed the library properly, or you need to learn more about how to run MPI jobs on your cluster.
If these scripts run to completion and the output looks sensible, you are ready to proceed with an actual EMAN2 job.

=== Using MPI ===
Once you have verified that your MPI support is installed and working, making actual use of MPI to run your jobs is quite straightforward, with a couple of caveats.

 1. Make sure you read [[EMAN2/DatabaseWarning|this warning]]
 1. Prepare the batch file appropriate for your cluster. Do not try to use 'mpirun' or 'mpiexec' on any EMAN programs. Instead, add the '--parallel=mpi:<n>:/path/to/scratch[:nocache]' option to an EMAN2 command like e2refine.py. Some commands do not support the --parallel option, and trying to run them using mpirun will not accomplish anything useful.
  * replace <n> with the total number of processors you have requested (these number must match exactly)
=== Summary ===
 1. Prepare the batch file appropriate for your cluster. Do not try to use 'mpirun' or 'mpiexec' directly on any EMAN programs. Instead, add the '--parallel=mpi:<n>:/path/to/scratch[:nocache]' option to an EMAN2 command like e2refine.py. Some commands do not support the --parallel option, and trying to run them using mpirun will not accomplish anything useful.
  * replace <n> with the total number of processors you have requested (these number must match exactly)
Line 152: Line 132:
 1. Immediately before submitting your job, run 'e2bdb.py -c'. This will require you to exit all running EMAN2 jobs (if any) before proceeding. Do this.  1. If you need to pass special options to mpirun (like a hostfile), you can use the '''EMANMPIOPTS''' shell variable, but most users should not need this. A typical usage would be '''export EMANMPIOPTS="-hostfile myhosts.txt"'''. You should only do this if necessary, though __(note that then when supplying the parameter ''--parallel=mpi:n:scratch_directory'', 'n' is no longer the number of cpus to use, but rather the number of nodes listed in ''myhosts.txt'')__
Line 154: Line 134:
 1. '''IMPORTANT :''' While the job is running, you have effectively ceded control of that specific project to the cluster nodes using MPI. You MUST NOT modify any of the files in that project in any way while the job is running, or you will risk a variety of ''bad things''. While the ''bad things'' will not always happen, there is a large risk, and the ''bad things'' are VERY bad, including corruption of your entire project. Wait until the job is complete before you do anything that could possibly change files in that directory.
 1. When you run into problems (note I say when, not if), and you have exhausted any local MPI experts, please feel free to email me (sludtke@bcm.edu). Once you have things properly configured, you should be able to use MPI routinely, but getting there may be a painful process on some clusters. Don't get too discouraged.
 1. The 'nocache' option is new as of EMAN2.04, and allows you to rely on your local filesysem sharing rather than caching data on each node. If you are using a shared Lustre or similar filesystem
to store your data, or if your cluster doesn't have any significant local scratch space on the compute nodes, this may be beneficial.
Line 159: Line 135:
 1. When you run into problems (note I say when, not if), and you have exhausted any local MPI experts, please feel free to email me ( sludtke@bcm.edu ). Once you have things properly configured, you should be able to use MPI routinely, but getting there may be a painful process on some clusters. Don't get too discouraged.
 1. The 'nocache' option is new as of EMAN2.04, and allows you to rely on your local filesysem sharing rather than caching data on each node. If you are using a shared Lustre or similar filesystem to store your data, or if your cluster doesn't have any significant local scratch space on the compute nodes, this may be beneficial, but the option is still experimental.
 1. EMAN2 can make use of MPI very efficiently, however, as this type of image processing is VERY data intensive, in some situations, your jobs may be limited by data transfer between nodes rather than by the computational capacity of the cluster. The inherent scalability of your job will depend quite strongly on the parameters of your reconstruction. In general larger projects will scale better than smaller projects, but projects can be 'large' in several different ways (eg- large box size, large number of particles, low symmetry,...). If your cluster administrator complains that your jobs aren't using the CPUs that you have allocated for your jobs sufficiently, you can try A) running on fewer processors, which will increase the efficiency (but also increase run-times), or you can refer them to me, and I will explain the issues involved. We are also developing tools to help better measure how efficiently your jobs are taking advantage of the CPUs you are allocating, but this will be an ongoing process.
Line 160: Line 139:
==== Note about use of shared clusters ====
EMAN2 can make use of MPI very efficiently, however, as this type of image processing is VERY data intensive, in some situations, your jobs may be limited by data transfer between nodes rather than by the computational capacity of the cluster. The inherent scalability of your job will depend quite strongly on the parameters of your reconstruction. In general larger projects will scale better than smaller projects, but projects can be 'large' in several different ways (eg- large box size, large number of particles, low symmetry,...). If your cluster administrator complains that your jobs aren't using the CPUs that you have allocated for your jobs sufficiently, you can try A) running on fewer processors, which will increase the efficiency (but also increase run-times), or you can refer them to me, and I will explain the issues involved. We are also developing tools to help better measure how efficiently your jobs are taking advantage of the CPUs you are allocating, but this will be an ongoing process.
== Debugging Problems ==
'''''Every Linux cluster is different, and normally the best source of help will be your cluster support staff. If you show them this web page, they should be able to get the information they need to help you debug the problem. If not, you are always free to email for help on the EMAN2 list or directly to sludtke@bcm.edu .'''''

 * If you get an error about being unable to find libmpi.so.1, try this (may not work exactly like this on every cluster):

{{{
$ locate libmpi.so.1
/usr/lib64/openmpi/lib/libmpi.so.1
/usr/lib64/openmpi/lib/libmpi.so.1.0.2
/usr/local/lib/libmpi.so.1
/usr/local/lib/libmpi.so.1.0.8
/usr/local/openmpi-1.6.5/lib/libmpi.so.1
/usr/local/openmpi-1.6.5/lib/libmpi.so.1.0.8
/usr/local/openmpi-1.6.5/ompi/.libs/libmpi.so.1
/usr/local/openmpi-1.6.5/ompi/.libs/libmpi.so.1.0.8

$ which mpirun
/usr/local/bin/mpirun

$ echo $LD_LIBRARY_PATH
/home/stevel/lib:/home/stevel/EMAN2/lib

This would prompt me to add this to my .bashrc:
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib
}}}
 * If you get errors like: "Undefined symbol", there are two major causes:
  * If you are using OpenMPI. make sure it is compiled with the '--disable-dlopen' option. If you are using a system-provided OpenMPI, and they cannot do this for you, you may have to install your own copy of OpenMPI (in your home directory).
  * Some systems have more than one MPI installed. You may have your $PATH pointing to one MPI and your LD_PRELOAD or LD_LIBRARY_PATH pointing at a different MPI. See the section about on libmpi.so.1 to debug this.

To set the LD_PRELOAD variable, add a similar line to the one below to your ''.bashrc'' file (which is an invisible file in your home directory) by following these steps:

1) Open your invisible ''.bashrc file'' by typing at the command line:

{{{
vi .bashrc
}}}
(''vi'' is an obnoxious text editor, but you can look up how to use it on Google).

2) Add the following line anywhere in the ''.bashrc'' file: ''export LD_PRELOAD=<mpi_directory>/lib/libmpi.so'' making sure to replace <mpi_directory> with your mpi directory.

In my case, the directory of my mpi installation is ''/raid/home/jgalaz/openmpi-1.4.3''

Thus, the line I added to my .bashrc fine was:

''export LD_PRELOAD=/raid/home/jgalaz/openmpi-1.4.3/lib/libmpi.so''

To find what mpi you're using (what the installation directory is for your local version of mpi), type at the command line:

{{{
which mpirun
}}}
I get:

''/raid/home/jgalaz/openmpi-1.4.3/bin/mpirun''

Therefore the path before '/bin' corresponds to <mpi_directory> (''/raid/home/jgalaz/openmpi=1.4.3'')

 * Problems with the scratch directory specified in ''--parallel=mpi:N:/path/to/scratch''. Previous versions of EMAN2 required local scratch space on each node of the cluster, and this space could not be shared. This requirement has been relaxed in EMAN2.1, and it should now be possible to run even on clusters that have no local scratch space, by specifying a shared scratch space. In any case, whatever directory you specify here must be writable by you, and should have a reasonable amount (perhaps a few GB) of free space.

MPI Parallelism

MPI stands for 'Message Passing Interface', and over the last decade it has become the de-facto standard for running large scale computations on Linux clusters around the world. In most supercomputing centers this will be the ONLY option you have for running in parallel, and administrators may be actively hostile to trying to make use of any non-MPI software on their clusters.

PLEASE NOTE: Unfortunately, MPI was designed by computer scientists for computer scientists, and there can be a steep learning curve. That is, using MPI on any cluster is not a task for linux/unix novices. You must have a fair bit of education to understand what's involved in using MPI with any program (not just EMAN). You should be comfortable with running MPI jobs before attempting this with EMAN2. If necessary you may need to consult a cluster administrator for assistance. There is enough variation between different specific linux clusters that we cannot provide specific advice for every situation. We have tried to provide as much generic advice as possible, but this is often not going to be a cookie-cutter operation.

Installing MPI Support in EMAN2/SPARX

SPARX and EMAN2 have merged their MPI support efforts, and as of 4/19/2013, the legacy EMAN2 MPI system has been retired. Pydusa used to require separate installation. It is now installed as part of the Anaconda based installation system, so no further action should be required.

Testing EMAN2 MPI installation

There is a special script in EMAN2/examples called mpi_test.py. Once you have EMAN2 and Pydusa installed, you can use this script to test the basic MPI setup. Make a batch queuing script (described below), and in the script, put:

mpirun <path to python> $EMAN2DIR/examples/mpi_test.py

<path to python> can be found on the first line of any of the EMAN2 programs, eg - "head -1 $EMAN2DIR/bin/e2proc2d.py"

This program will produce console output describing the success or failure of the MPI job. If this script succeeds, then it is very likely that an EMAN2 program using the --parallel option will also succeed.

If you encounter errors, please see the Debugging section below

Using MPI in EMAN2

Once you have EMAN2 and pydusa installed, usage should be straightforward. EMAN2 has a modular parallelism system, supporting several types of parallel computing, not just MPI. All of these parallelism systems use a common syntax. For EMAN2 commands which can run in parallel, to use MPI parallelism, the basic syntax is:

--parallel=mpi:<nproc>:/path/to/scratch

for example:

e2refine.py ... --parallel=mpi:64:/scratch/stevel
  • the number of processors MUST match your request to the batch system (see below)
  • ideally, the compute nodes should have a local scratch directory (on a non-shared hard drive). However, since EMAN2.1, this is no longer a strict requirement.
  • Do NOT use 'mpirun' directly. This will be called for you by the program you give the --parallel option to.

Special exception for e2refine_easy and other select programs

When using MPI parallelism, certain tasks become I/O limited, and are simply not efficient to run in parallel on multiple nodes. To handle this situation, we have added a --threads option to e2refine_easy and other select programs. If you specify --threads, then these i/o-limited tasks will all be run on the first assigned node using threads. For example, you might say:

--parallel=mpi:128:/scratch --threads=12

Which would run most of the refinement using the full 128 cores and MPI. In cases where this was inefficient, 12 cores on the first assigned node would be used instead.

One word of warning, however. On some clusters, with very tight management and job control, having a single process make use of all 12 cores could result in jobs being terminated. If you run into this problem, please let me know so we can try to come up with a solution.

Special Options to mpirun

By default, EMAN2 programs *internally* run mpirun with the --n <ncpus> option, and gets the list of available nodes from the batch queuing system (e.g., PBS or SGE). If you need to specify a different set of options (for example, if you aren't using PBS or SGE, and you want to specify a nodefile), you can set the environment variable "EMANMPIOPTS"(this variable will replace -n <ncpus> on the command line) by adding the following line in your invisible .bashrc file:

export EMANMPIOPTS="-hostfile myhosts.txt"

myhosts.txt should be a text file in your home directory listing the names of the nodes you want to use on the cluster.

For example:

 n1
 n2
 n3
 n4

You can then supply the --parallel command as usual, but instead of writing 'n' for the number of cpus to use, you need to write the number of nodes (--parallel=mpi:#nodes:/scratch/username ) listed in myhosts.txt. In the example above, there are 4 nodes listed; therefore, the --parallel parameter would become, in an EMAN2 command ran at the command line:

eman2whateverprogram.py --input=<input> --output=<output> --parallel=mpi:4:<scratch_directory>

IMPORTANT: Note that you usually cannot run a command from the head node; rather, you should log into any one node, for example

ssh n1

and then run your eman2 command from there. This is usually the way to do it since, typically, the head not will NOT have a "scratch" directory.

Batch Queuing systems

How to create a script to run jobs on the cluster

A cluster is a shared computing environment. Limited resources must be allocated for many different users and jobs all at the same time. To run a job on most clusters, you must formulate a request in the form of a text file containing the details about your job. These details include, the number of processors you want, how long the job is expected to run, perhaps the amount of RAM you will need, etc. This text file is called a 'batch script' and is submitted to the 'Batch Queuing System' (BQS) on your cluster. The BQS then decides when and where your job will be run, and communicates information about which specific processors to use to your job when launched.

The BQS (allocates resources and launches jobs) is independent of MPI (runs jobs on allocated resources). There are several common BQS systems you may encounter. We cannot cover every possibility here, so you need to consult with your local cluster policy information for details on how to submit jobs using your BQS. We will provide examples for OpenPBS and SGE, which are two of the more common BQS systems out there. Even then, the details may vary a little from cluster to cluster. These examples just give you someplace to start.

OpenPBS/Torque

Here is an example of a batch script for PBS-based systems:

{{{#!/bin/bash # All lines starting with "#PBS" are PBS commands # # The following line asks for 10 nodes, each of which has 12 processors, for a total of 120 CPUs. # The walltime is the maximum length of time your job will be permitted to run. If it is too small, your # job will be killed before it's done. If it's too long, however, your job may have to wait a looong # time before the cluster starts running it (depends on local policy). #PBS -l nodes=10:ppn=12 #PBS -l walltime=120:00:00 #PBS -q dque

# This prints the list of nodes your job is running on to the output file cat $PBS_NODEFILE

# cd to your project directory cd /home/stevel/data/myproject

# Now the actual EMAN2 command(s). Note the --parallel option at the end. The number of CPUs must match the number specified above e2refine_easy.py --input=sets/allctf_flip_hp.lst --model=initial_models/model_00_02.hdf --targetres=6.0 --sym=d7 --iter=3 --mass=800.0 --apix=2.1 --classkeep=0.9 --m3dkeep=0.8 --parallel=mpi:120:/scratch/stevel --threads 12 --speed 5 }}} If this file were called, for example, test.pbs, you would then submit the job to the cluster by saying

e2bdb.py -c
qsub test.pbs

There are additional options you can use with the qsub command as well. See your local cluster documentation for details on what is required/allowed. The e2bdb.py -c command is a good idea to make sure that the compute nodes will see any recent changes you've made to images in the project.

SGE (Sun Grid Engine)

This is another popular queuing system, which uses 'qsub' and 'qstat' commands much like OpenPBS/Torque does. Configuration, however, is completely different.

Here is an example of an SGE script to run a refinement with e2refine.py using mpich:

{{{#!/bin/bash #$ -S /bin/bash #$ -V #$ -N refine4 #$ -cwd #$ -j y #$ -pe mpich 40

cd /home/stevel/data/myproject

e2refine_easy.py --input=sets/allctf_flip_hp.lst --model=initial_models/model_00_02.hdf --targetres=6.0 --sym=d7 --iter=3 --mass=800.0 --apix=2.1 --classkeep=0.9 --m3dkeep=0.8 --parallel=mpi:40:/scratch/stevel --threads 4 --speed 5 }}}

Summary

  1. Prepare the batch file appropriate for your cluster. Do not try to use 'mpirun' or 'mpiexec' directly on any EMAN programs. Instead, add the '--parallel=mpi:<n>:/path/to/scratch[:nocache]' option to an EMAN2 command like e2refine.py. Some commands do not support the --parallel option, and trying to run them using mpirun will not accomplish anything useful.

    • replace <n> with the total number of processors you have requested (these number must match exactly)

    • replace /path/to/scratch, with the path to a scratch storage directory available on each node of the cluster. Note that this directory must be local storage on each node, not a directory shared between nodes. If you use the path to a shared directory, like $HOME/scratch, you will have very very serious problems. You must use a filesystem local to the specific node. If you don't have this information, check your cluster documentation and/or consult with your system administrator.
    • Make sure that after the last e2* command in your batch script you put an 'e2bdb.py -cF' command to make sure all of the output image files have been flushed to disk.
  2. If you need to pass special options to mpirun (like a hostfile), you can use the EMANMPIOPTS shell variable, but most users should not need this. A typical usage would be export EMANMPIOPTS="-hostfile myhosts.txt". You should only do this if necessary, though (note that then when supplying the parameter --parallel=mpi:n:scratch_directory, 'n' is no longer the number of cpus to use, but rather the number of nodes listed in myhosts.txt)

  3. Submit your job.
  4. When you run into problems (note I say when, not if), and you have exhausted any local MPI experts, please feel free to email me ( sludtke@bcm.edu ). Once you have things properly configured, you should be able to use MPI routinely, but getting there may be a painful process on some clusters. Don't get too discouraged.

  5. The 'nocache' option is new as of EMAN2.04, and allows you to rely on your local filesysem sharing rather than caching data on each node. If you are using a shared Lustre or similar filesystem to store your data, or if your cluster doesn't have any significant local scratch space on the compute nodes, this may be beneficial, but the option is still experimental.
  6. EMAN2 can make use of MPI very efficiently, however, as this type of image processing is VERY data intensive, in some situations, your jobs may be limited by data transfer between nodes rather than by the computational capacity of the cluster. The inherent scalability of your job will depend quite strongly on the parameters of your reconstruction. In general larger projects will scale better than smaller projects, but projects can be 'large' in several different ways (eg- large box size, large number of particles, low symmetry,...). If your cluster administrator complains that your jobs aren't using the CPUs that you have allocated for your jobs sufficiently, you can try A) running on fewer processors, which will increase the efficiency (but also increase run-times), or you can refer them to me, and I will explain the issues involved. We are also developing tools to help better measure how efficiently your jobs are taking advantage of the CPUs you are allocating, but this will be an ongoing process.

Debugging Problems

Every Linux cluster is different, and normally the best source of help will be your cluster support staff. If you show them this web page, they should be able to get the information they need to help you debug the problem. If not, you are always free to email for help on the EMAN2 list or directly to sludtke@bcm.edu .

  • If you get an error about being unable to find libmpi.so.1, try this (may not work exactly like this on every cluster):

$ locate libmpi.so.1
/usr/lib64/openmpi/lib/libmpi.so.1
/usr/lib64/openmpi/lib/libmpi.so.1.0.2
/usr/local/lib/libmpi.so.1
/usr/local/lib/libmpi.so.1.0.8
/usr/local/openmpi-1.6.5/lib/libmpi.so.1
/usr/local/openmpi-1.6.5/lib/libmpi.so.1.0.8
/usr/local/openmpi-1.6.5/ompi/.libs/libmpi.so.1
/usr/local/openmpi-1.6.5/ompi/.libs/libmpi.so.1.0.8

$ which mpirun
/usr/local/bin/mpirun

$ echo $LD_LIBRARY_PATH
/home/stevel/lib:/home/stevel/EMAN2/lib

This would prompt me to add this to my .bashrc:
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib
  • If you get errors like: "Undefined symbol", there are two major causes:
    • If you are using OpenMPI. make sure it is compiled with the '--disable-dlopen' option. If you are using a system-provided OpenMPI, and they cannot do this for you, you may have to install your own copy of OpenMPI (in your home directory).
    • Some systems have more than one MPI installed. You may have your $PATH pointing to one MPI and your LD_PRELOAD or LD_LIBRARY_PATH pointing at a different MPI. See the section about on libmpi.so.1 to debug this.

To set the LD_PRELOAD variable, add a similar line to the one below to your .bashrc file (which is an invisible file in your home directory) by following these steps:

1) Open your invisible .bashrc file by typing at the command line:

vi .bashrc

(vi is an obnoxious text editor, but you can look up how to use it on Google).

2) Add the following line anywhere in the .bashrc file: export LD_PRELOAD=<mpi_directory>/lib/libmpi.so making sure to replace <mpi_directory> with your mpi directory.

In my case, the directory of my mpi installation is /raid/home/jgalaz/openmpi-1.4.3

Thus, the line I added to my .bashrc fine was:

export LD_PRELOAD=/raid/home/jgalaz/openmpi-1.4.3/lib/libmpi.so

To find what mpi you're using (what the installation directory is for your local version of mpi), type at the command line:

which mpirun

I get:

/raid/home/jgalaz/openmpi-1.4.3/bin/mpirun

Therefore the path before '/bin' corresponds to <mpi_directory> (/raid/home/jgalaz/openmpi=1.4.3)

  • Problems with the scratch directory specified in --parallel=mpi:N:/path/to/scratch. Previous versions of EMAN2 required local scratch space on each node of the cluster, and this space could not be shared. This requirement has been relaxed in EMAN2.1, and it should now be possible to run even on clusters that have no local scratch space, by specifying a shared scratch space. In any case, whatever directory you specify here must be writable by you, and should have a reasonable amount (perhaps a few GB) of free space.

EMAN2/Parallel/Mpi (last edited 2022-09-08 21:34:00 by SteveLudtke)