Differences between revisions 1 and 2
Revision 1 as of 2009-10-08 02:22:13
Size: 2501
Editor: SteveLudtke
Comment:
Revision 2 as of 2009-12-11 02:39:35
Size: 2480
Editor: SteveLudtke
Comment:
Deletions are marked like this. Additions are marked like this.
Line 2: Line 2:
example, called calisto.bumc.bu.edu. example, called frodo.bu.edu.
Line 5: Line 5:
 1. the first thing to do is to make sure that the workstation can talk to itself. While the server is running, try running ''e2parallel.py dcclient --server=calisto.bumc.bu.edu --port=9990 --verbose=1'' in another window on the workstation itself.  1. the first thing to do is to make sure that the workstation can talk to itself. While the server is running, try running ''e2parallel.py dcclient --server=frodo.bu.edu --port=9990 --verbose=1'' in another window on the workstation itself.
Line 10: Line 10:
 1. Ok, we've established that your machine can accept connections on 9990 from itself. The next thing to try is to log into the head-node on the cluster and try ''e2parallel.py dcclient --server=calisto.bumc.bu.edu --port=9990 --verbose=1''  1. Ok, we've established that your machine can accept connections on 9990 from itself. The next thing to try is to log into the head-node on the cluster and try ''e2parallel.py dcclient --server=frodo.bu.edu --port=9990 --verbose=1''

If you are having difficulties with parallelism where a linux cluster's nodes are running clients and the server is running on a separate workstation, in this example, called frodo.bu.edu.

  1. When you run the server you should see a message like "server running on port 9990". If clients connect (nodes on the cluster) you will see a spinning '-/|\' sequence appear, with a count of the number of nodes that it's seen so far. If you don't see 'server running on port 9990', you'll have to contact me.
  2. the first thing to do is to make sure that the workstation can talk to itself. While the server is running, try running e2parallel.py dcclient --server=frodo.bu.edu --port=9990 --verbose=1 in another window on the workstation itself.

    1. If this produces a slowly spinning response and a (1) on the server, then that's working, at least, and you can kill the client and proceed to 3.
    2. If THAT doesn't work, you can try e2parallel.py dcclient --server=localhost --port=9990 --verbose=1

      1. If THAT works, then you likely have a firewall configuration problem on your machine preventing outside connections to port 9990, which needs to be resolved.
      2. If even that doesn't work, you'll have to contact me again, because something mysterious is going on
  3. Ok, we've established that your machine can accept connections on 9990 from itself. The next thing to try is to log into the head-node on the cluster and try e2parallel.py dcclient --server=frodo.bu.edu --port=9990 --verbose=1

    1. If that works, proceed to 4
    2. If that doesn't work, then either your workstation firewall is configured to block 9990 from external sources, or something in the network between the machines is blocking the connection.
  4. Now, manually log into one of the cluster nodes and run e2parallel.py dcclient --server=calisto.bumc.bu.edu --port=9990 --verbose=1

    1. If that works, then you should be all set, and you can run clients on all of your cluster nodes.
    2. If that fails, most likely your cluster head-node isn't configured to forward network connections from the nodes to the outside world. The head node probably needs something like:  iptables -t nat -A POSTROUTING -o eth1 -j MASQUERADE  to be added to the /etc/rc.local file, and the compute nodes need something like route add default gw 192.168.0.132 in their rc.local files. Note that 'eth1' and 192.168.0.132 will need to be adapted to your specific configuration.

EMAN2/Parallel/Debug (last edited 2012-03-20 16:45:00 by JohnFlanagan)