Diff for "EMAN2/Parallel"

Differences between revisions 15 and 16

Parallel Processing in EMAN2

EMAN2 uses a modular strategy for running commands in parallel. That is, you can choose different ways to run EMAN2 programs in parallel, depending on your environment. Unfortunately, as of April, 2010, there is still only one available parallelism strategy. This should be gradually fleshed out over 2010. Also unfortunately, it isn't trivial to use for simple multithreaded execution (but it does work). We hope to rectify this soon.

Programs with parallelism support will take the --parallel command line option as follows:

--parallel=<type>:<option>=<value>:<option>=<value>:...

for example, for the distributed parallelism model: --parallel=dc:localhost:9990

for the local multicore threaded model: --parallel=thread:4 (where 4 is the number of cores to use)

Note that not all programs will run in parallel. If a program does not accept the --parallel option, then it is not parallelized.

Local Machine (multiple cores)

As of 7/15/2010 If you only want to use multiple cores on your local machine (and not simultaneously use processors from other computers), just put 'thread:<ncpu>' in the 'Parallel' box in e2workflow, or specify the '--parallel=thread:<ncpu> option on the command line. <ncpu> should, of course, be replaced with the number of cores you wish to use.

MPI

Sorry, we haven't had a chance to finish this yet. For the moment you will have to use the Distributed Computing mode on clusters, which may or may not be possible depending on your cluster's network configuration. Direct MPI support is planned by fall 2010.

Distributed Computing

Quickstart

For those not wanting to read or understand the parallelism method, here are the basic required steps:

on the machine with the data, make a scratch directory on a local hard drive, cd to it, and run e2parallel.py dcserver --port=9990 --verbose=2
make another scratch directory on a local hard drive, cd to it, and run e2parallel.py dcclient --host=<server hostname>
repeat #2 for each core or machine you want to run tasks on
run your parallel job, like 'e2refine.py' with the --parallel=dc:localhost:9990

Notes

If you need to restart the server for some reason, that's fine. As long as it is restarted within about 5 minutes, it should be harmless to stop it with ^c and restart it
Make sure the same version of EMAN2 on all machines, if multiple machines are being used as clients
If you need to stop the 'e2refine' program, you can run 'e2parallel.py killall' to cancel any pending jobs on the server after stopping e2refine.
You can add or remove clients at any time during a run
When you are done running jobs, exit the server (^c), then run 'e2parallel.py dckillclients' from the server directory, and let it run for a minute or two. This will tell the clients to shut down. If you plan to do another run relatively soon, you can just leave the server and clients running.

You should really consider reading the detailed instructions below :^)

Introduction

This is the sort of parallelism made famous by projects like SETI-at-home and Folding-at-Home. The general idea is that you have a list of small jobs to do, and a bunch of computers with spare cycles willing to help out with the computation. The number of computers willing to do computations may vary with time, and possibly may agree to do a computation, but then fail to complete it. This is a very flexible parallelism model, which can be adapted to both individual computers with multiple cores as well as linux clusters or sets of workstations laying around the lab.

There are 3 components to this system:

User Application (customer) <==> Server <==> Compute Nodes (client)

The user application (e2refine.py for example) builds a list of computational tasks that it needs to have completed, then sends the list to the server. Compute nodes with nothing to do then contact the server and request tasks to compute. The server sends the tasks out to the clients. When the client finishes the requested computation, results are sent back to the server. The user application then requests the results from the server and completes processing. As long as the number of tasks to complete is larger than the number of clients servicing requests, this is an extremely efficient infrastructure.

Internally things are somewhat more complicated and tackle issues such as data caching on the clients, how to handle clients that die in the middle of processing, etc., but the basic concept is quite straightforward.

With any of the e2parallel.py commands below, you may consider adding the --verbose=1 (or 2) option to see more of what it's doing.

How to use Distributed Computing in EMAN2

To use distributed computing, there are three basic steps:

Run a server on a machine that the clients can communicate with
Run some number of clients pointing at the server
run an EMAN2 program with the --parallel=dc:host:port option

What follows are specific instructions for doing this under 2 different scenarios.

Using DC on a linux cluster

This can be a bit tricky, as there are several possible configurations, depending on the configuration of your cluster:

If the individual compute nodes can communicate directly (through the head node) to your workstation, you may consider running the server and the e2refine.py command directly on your workstation, and launch only clients on the cluster. The clients will communicate data among themselves using the high-performance internal network on the cluster, so this approach doesn't require much more network bandwidth than copying the data to the cluster, and copying the results back when you're done, and has the convenience that all data and results remain on your computer where you can monitor them.
If the individual compute nodes cannot communicate outside the cluster, then you will need to use e2scp.py to copy your project data to the disk on the cluster. If you are permitted to run small single-CPU commands directly on the storage/head node (attached to the physical storage), then running the server and e2refine command on that node is the best option.
If that isn't allowed on your cluster either, then things become a bit more difficult. You will need to launch the server, e2refine and the clients all from the queuing system script. Given the diversity of different cluster configurations, it is difficult to give specific details on this process, but the general comments below should give you something to start with.

General method of using DC computing:

The server is run with the e2parallel.py dcserver --port=9990 command.
The clients are run with the e2parallel.py dcclient --port=9990 --server=<server hostname> command.
The actual refinement is run with the 'e2refine.py --parallel=dc:<server hostname>:9990' command.

Notes:

The server MUST be run from a directory on a hard drive physically attached to the computer (not a network mounted drive). This directory should not require large amounts of disk space. This need not be the same drive that stores the data.
The clients MUST similarly be run from a directory on a physically attached drive. If you are running multiple clients on a single cluster node with multiple cores, all of the clients should be run from the SAME directory so they can share a data cache. This directory may get quite large, as it will be used to cache data during processing to reduce network load.
If you need to stop the server, do so nicely with '^c' or 'kill <pid>'. Do NOT 'kill -9 <pid>'. You may stop and restart the server without disturbing the running refinement job, so long as it isn't down for more than 5-10 minutes.
Clients should also be killed 'nicely'. Clients may be started or stopped at any time without disturbing the refinement run.
If you decide to kill the refinement in the middle, you may also wish to run the 'e2parallel.py killall' command from the server directory to remove any incomplete tasks from the server.
If you are forced to run the server on a compute-node with the data stored on a network mounted drive, then additional precautions MUST be taken:
- When you finish the job, nicely kill the server, then immediately run 'e2bdb.py -c' on the same node. After this, it will be safe to access the files from the head-node again.
- While the job is running, you must not access any of the project files, or database corruption may result. On a shared filesystem, only one node may have read/write access to databases at one time. This means if you need to check the progress of the running job, you must be very careful not to do anything that causes data to be written to the project. A safer alternative which may be possible on your cluster is to log in to the node running the server, and check the files from there. see the warning about the database for more info on this topic.

Using DC on a set of workstations

The server should run on a computer with a direct physical connection to the storage
All of the clients must be able to make a network connection to the server machine
Run a server on the desired machine e2parallel.py dcserver in an empty directory on the local hard drive
The server will print a message saying what port it's running on. This will usually be 9990. If it is something else, make a note of it.
Run one client for each core you want to use for processing on each computer : e2parallel.py dcclient --server=<server> --port=9990 (replace the server hostname and port with the correct values)
Run your EMAN2 programs with the option --parallel=dc:<server>:9990 (again, use the right port number and server hostname)

For all of the above, once you have finished running your jobs, kill the server, then run 'e2parallel.py dckillclients' from the same directory. When it stops spewing out 'client killed' messages, you can kill this server.

IF THIS IS NOT WORKING FOR YOU, PLEASE FOLLOW THESE DEBUGGING INSTRUCTIONS

-  ⇤ ← Revision 15 as of 2010-07-15 19:18:18 → 
  Size: 8639
  Editor: SteveLudtke
  Comment:
+   ← Revision 16 as of 2010-07-19 14:06:02 → ⇥
  Size: 10261
  Editor: SteveLudtke
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 15:
-=== GPGPU Computing ===
While not precisely a parallelism methodology, this technique makes use of the GPU (graphics processing unit) common in most modern PC's, to dramatically
accelerate many image processing algorithms. At present (summer 2009) we are at the initial stages of implementing GPGPU support using Nvidia's CUDA 
infrastructure. We will likely move to OpenCL in future as it becomes a stable platform. We have only implemented a few algorithms using this methodology
to date, and we will need to implement and optimize virtually all of them before this becomes a viable platform for day-to-day use. However, we have demonstrated
speedups of as much as 100x in select algorithms, meaning a desktop PC with a GPU could easily become the equivalent of a small Linux cluster. While all of
the GPGPU code is available in the nightly source snapshots, you are encouraged to contact sludtke@bcm.edu if you are interested in experimenting with this
technology.
+Note that not all programs will run in parallel. If a program does not accept the --parallel option, then it is not parallelized.
-Line 26:
+Line 18:
-As of 7/15/2010 this is now supported !  If you ''only'' want to use multiple cores on your local machine, just
put 'thread:<ncpu>' in the 'Parallel' box, or specify the '--parallel=thread:<ncpu> option on the command line. <ncpu> should, of course, be replaced
with the number of cores you wish to use.
+''As of 7/15/2010''
If you ''only'' want to use multiple cores on your local machine (and not simultaneously use processors from other computers), just
put 'thread:<ncpu>' in the 'Parallel' box in e2workflow, or specify the '--parallel=thread:<ncpu> option on the command line. <ncpu> should, of course, be replaced
with the number of cores you wish to use.
-Line 30:
+Line 23:
+=== MPI ===
Sorry, we haven't had a chance to finish this yet. For the moment you will have to use the Distributed Computing mode on clusters, which may or may
not be possible depending on your cluster's network configuration. Direct MPI support is planned by fall 2010.
-Line 42:
+Line 38:
- * If you need to kill the server and restart it for some reason, that's fine. As long as it is restarted within about 5 minutes, it should be harmless
+ * If you need to restart the server for some reason, that's fine. As long as it is restarted within about 5 minutes, it should be harmless to stop it with ^c and restart it
-Line 46:
+Line 42:
- * When you are done running jobs, kill the server, then run 'e2parallel.py dckillclients' from the server directory, and let it run for a minute or two. This will tell the clients to shut down. If you plan to do another run relatively soon, you can just leave the server and clients running.
+ * When you are done running jobs, exit the server (^c), then run 'e2parallel.py dckillclients' from the server directory, and let it run for a minute or two. This will tell the clients to shut down. If you plan to do another run relatively soon, you can just leave the server and clients running.
-Line 48:
+Line 44:
-You should really consider reading the detailed instructions below, though :^)
+You should really consider reading the detailed instructions below :^)
-Line 68:
+Line 64:
-With any of the e2parallel.py commands below, you may consider adding the --verbose=1 option to see more of what it's doing.
+With any of the e2parallel.py commands below, you may consider adding the --verbose=1 (or 2) option to see more of what it's doing.
-Line 74:
+Line 70:
- * run an EMAN2 program with the --parallel option
+ * run an EMAN2 program with the --parallel=dc:host:port option
-Line 76:
+Line 72:
-What follows are specific instructions for doing this under 3 different scenarios.

===== Using DC on a single multi-core workstation =====
 * Ideally your data will be stored on a hard drive physically connected to the workstation (not on a shared network drive)
 * make an empty directory on a local hard drive
 * Run a server on the workstation ''e2parallel.py dcserver'' from the empty directory you just created
 * The server will print a message saying what port it's running on. This will usually be 9990. If it is something else, make a note of it.
 * Run one client for each core you want to use for processing : ''e2parallel.py dcclient --server=localhost --port=9990'' (replace the port with the correct number if necessary)
 * Run your EMAN2 programs with the option ''--parallel=dc:localhost:9990'' (again, use the right port number)
+What follows are specific instructions for doing this under 2 different scenarios.
-Line 87:
+Line 75:
- * The server should run on the node (often the head node or a specialized 'storage node') with a direct physical connection to the storage
 * If you want to use clients from multiple clusters, then remember all of the clients must be able to make a network connection to the server machine
 * Run a server on the head-node ''e2parallel.py dcserver'' in an empty directory on the local hard drive
 * The server will print a message saying what port it's running on. This will usually be 9990. If it is something else, make a note of it.
 * Run one client for each core you want to use for processing on each node : ''e2parallel.py dcclient --server=<server> --port=9990'' (replace the server hostname and port with the correct values)
 * Run your EMAN2 programs with the option ''--parallel=dc:<server>:9990'' (again, use the right port number and server hostname)
+This can be a bit tricky, as there are several possible configurations, depending on the configuration of your cluster:
 * If the individual compute nodes can communicate directly (through the head node) to your workstation, you may consider running the server and the e2refine.py command directly on your workstation, and launch only clients on the cluster. The clients will communicate data among themselves using the high-performance internal network on the cluster, so this approach doesn't require much more network bandwidth than copying the data to the cluster, and copying the results back when you're done, and has the convenience that all data and results remain on your computer where you can monitor them.
 * If the individual compute nodes cannot communicate outside the cluster, then you will need to use e2scp.py to copy your project data to the disk on the cluster. If you are permitted to run small single-CPU commands directly on the storage/head node (attached to the physical storage), then running the server and e2refine command on that node is the best option.
 * If that isn't allowed on your cluster either, then things become a bit more difficult. You will need to launch the server, e2refine and the clients all from the queuing system script. Given the diversity of different cluster configurations, it is difficult to give specific details on this process, but the general comments below should give you something to start with.

General method of using DC computing:

 * The server is run with the ''e2parallel.py dcserver --port=9990'' command. 
 * The clients are run with the ''e2parallel.py dcclient --port=9990 --server=<server hostname>'' command. 
 * The actual refinement is run with the 'e2refine.py --parallel=dc:<server hostname>:9990' command. 

Notes:
 * The server MUST be run from a directory on a hard drive physically attached to the computer (not a network mounted drive). This directory should not require large amounts of disk space. This need not be the same drive that stores the data.
 * The clients MUST similarly be run from a directory on a physically attached drive. If you are running multiple clients on a single cluster node with multiple cores, all of the clients should be run from the SAME directory so they can share a data cache. This directory may get quite large, as it will be used to cache data during processing to reduce network load. 
 * If you need to stop the server, do so nicely with '^c' or 'kill <pid>'. Do NOT 'kill -9 <pid>'. You may stop and restart the server without disturbing the running refinement job, so long as it isn't down for more than 5-10 minutes.
 * Clients should also be killed 'nicely'. Clients may be started or stopped at any time without disturbing the refinement run.
 * If you decide to kill the refinement in the middle, you may also wish to run the 'e2parallel.py killall' command from the server directory to remove any incomplete tasks from the server.
 * If you are forced to run the server on a compute-node with the data stored on a network mounted drive, then additional precautions MUST be taken:
  * When you finish the job, nicely kill the server, then immediately run 'e2bdb.py -c' on the same node. After this, it will be safe to access the files from the head-node again.
  * While the job is running, you must not access any of the project files, or database corruption may result. On a shared filesystem, only one node may have read/write access to databases at one time. This means if you need to check the progress of the running job, you must be very careful not to do anything that causes data to be written to the project. A safer alternative which may be possible on your cluster is to log in to the node running the server, and check the files from there. see the [[EMAN2/DatabaseWarning|warning about the database]] for more info on this topic.
-Line 106:
+Line 108:
-=== MPI ===
Sorry, we haven't had a chance to finish this yet. For the moment you will have to use the Distributed Computing mode on clusters.