THIS METHOD IS NO LONGER WORKING IN EMAN2.1 - we are considering retiring it. It was quite powerful and useful many years ago, but rarely makes sense in the current linux cluster ecosystem. If you are actively using it, we'd very much like to discuss the issue, and understand what motivates this in your setup.

Distributed Computing

This is by far the most flexible parallelism mechanism in EMAN2. It can permit you to:

However, it is also the most difficult method to use, requires a bit of effort to set up, and may not work in all environments. If you are running on a cluster at a shared supercomputing center, you will probably want to use MPI instead. If you are trying to use multiple workstations in your lab as a sort of ad-hoc cluster, this method works well (it can be used on clusters too, if you have issues with MPI).

Quickstart

For those not wanting to read or understand the parallelism method, here are the basic required steps:

  1. on the machine with the data, make a scratch directory on a local hard drive, cd to it, and run e2parallel.py dcserver --port=9990 --verbose=2
  2. make another scratch directory on a local hard drive, cd to it, and run e2parallel.py dcclient --host=<server hostname>

  3. repeat #2 for each core or machine you want to run tasks on
  4. run your parallel job, like 'e2refine.py' with the --parallel=dc:localhost:9990

Notes

You should really consider reading the detailed instructions below :^)

Introduction

This is the sort of parallelism made famous by projects like SETI-at-home and Folding-at-Home. The general idea is that you have a list of small jobs to do, and a bunch of computers with spare cycles willing to help out with the computation. The number of computers willing to do computations may vary with time, and possibly may agree to do a computation, but then fail to complete it. This is a very flexible parallelism model, which can be adapted to both individual computers with multiple cores as well as linux clusters or sets of workstations laying around the lab.

There are 3 components to this system:

User Application (customer) <==> Server <==> Compute Nodes (client)

The user application (e2refine.py for example) builds a list of computational tasks that it needs to have completed, then sends the list to the server. Compute nodes with nothing to do then contact the server and request tasks to compute. The server sends the tasks out to the clients. When the client finishes the requested computation, results are sent back to the server. The user application then requests the results from the server and completes processing. As long as the number of tasks to complete is larger than the number of clients servicing requests, this is an extremely efficient infrastructure.

Internally things are somewhat more complicated and tackle issues such as data caching on the clients, how to handle clients that die in the middle of processing, etc., but the basic concept is quite straightforward.

With any of the e2parallel.py commands below, you may consider adding the --verbose=1 (or 2) option to see more of what it's doing.

How to use Distributed Computing in EMAN2

To use distributed computing, there are three basic steps:

What follows are specific instructions for doing this under 2 different scenarios.

Using DC on a linux cluster

This can be a bit tricky, as there are several possible configurations, depending on the configuration of your cluster:

General method of using DC computing:

Notes:

Using DC on a set of workstations

For all of the above, once you have finished running your jobs, kill the server, then run 'e2parallel.py dckillclients' from the same directory. When it stops spewing out 'client killed' messages, you can kill this server.

IF THIS IS NOT WORKING FOR YOU, PLEASE FOLLOW THESE DEBUGGING INSTRUCTIONS

EMAN2/Parallel/Distributed (last edited 2014-06-10 14:58:16 by SteveLudtke)