Diff for "CIBRClusters/Queue"

Differences between revisions 4 and 5

Batch Queuing System

A Batch Queuing System (BQS) manages all of the shared compute resources on a cluster. It allocates processors (and sometimes memory and other resources) to specific jobs in a specific order based on Queuing policies. When you have a job to run on the cluster, you prepare a short script file, describing the resources you need, and the commands to run, then submit it to the BQS. The BQS will then launch the job as soon as the necessary resources are available, and collect the standard output and error output into files in the folder where the job was launched from.

The system we use (Maui/Torque) is a descendent of a very traditional BQS known as PBS (and later OpenPBS). Some variant of this system is used on a majority of the Linux clusters around the world.

IMPORTANT: The BQS must be used for all jobs run on the cluster! If you are found running jobs directly on nodes without using PBS, your account may be temporarily or permanently disabled. In some limited cases (see below) you can run small jobs on the head node.

Programs for Job Management

The main programs used to access the queuing systems are:

Program	Function
qsub <your job file>	Submit a new job
qdel <job id>	Kill or remove a queued job
qstat	Info on running jobs
qstat -a	Info on ALL jobs in the queue
qstat -an	All jobs with node allocation information
qstat -q	List all available queues
pbsnodes -a	List status of all system nodes
checkjob <job id>	Gives useful details about a job
showq	Another way of looking at the queues

Running jobs

This is the sequence of events that takes place when you run a job on the cluster:

Run showq or qstat -a to see the current cluster load level and determine how many cores you will request for your job.
Create a ".pbs" file containing your job (see below)
qsub file.pbs -q queuename
wait for your job to run
showq or qstat -a can be used to monitor.
- "Q" means the job is waiting to run
- "R" means the job is running
- "C" or "A" mean the job has failed for some reason
Look at the file.pbs.eXXXXX and file.pbs.oXXXXX to see the output generated by your program

NOTE: The .eXXXXX and .oXXXXX collect all of the information printed to the terminal if the job were run interactively. Some programs generate massive amounts of output like this, and in some cases this can cause your jobs to crash, or the output files not to get copied back to the head-node. If your program generates massive console output which you need to keep, use '>logfile.txt' to redirect the output directly to a file instead of using the .oXXXXX mechanism. If you do not need to keep the output, you may just '>/dev/null' which will simply discard it.

Making a .pbs File

The job file submitted with qsub looks like this:

#!/bin/sh

#PBS -N NAME_OF_THE_JOB
#PBS -l nodes=10:ppn=4      
#PBS -l walltime=24:00:00
#PBS -q longjob

# This job's temporary working directory. You may also work in any of your
# home directories
echo Working directory is $PBS_O_WORKDIR

echo Running on host `hostname`
echo Time is `date`
echo Directory is `pwd`
echo This job runs on the following processors:
echo "PBS_NODEFILE=" $PBS_NODEFILE

cd <some other folder>
my_first_program
my_second_program
mpirun my_mpi_program

This example script asks for 10 nodes with 4 processors on each, for a total of 40 processors. Both CIBR clusters are equipped with 4 GB of RAM for each processor
my_first_program and my_second_program are the command-lines you would normally type if running a program interactively. These programs will be executed sequentially on the first processor you are assigned.
If your program supports MPI, then you can mpirun my_mpi_program, and the program will make use of all of the assigned processors. Otherwise, my_first_program would only run on a single processor (unless it has some internal mechanism for using more processors, as, for example, EMAN2 does).
You need to take memory requirements into account. If your program will require more than ~3.5 GB of RAM per core, you will need to allocate additional cores (which go unused) so your job is provided the requisite amount of RAM. If you do this, you will be "charged" for the additional CPU time which you are not using. For example, if you have a single threaded job which needs 11 GB of RAM, you need to allocate 3 cores for your job, even though you are using only 1 of them. If the job runs for 10 hours, your account will be "charged" for 30 CPU-hours of time.

Interactive Jobs (eg - matlab)

Sometimes a user will want to run an interactive session on one of the cluster nodes, taking advantage of the large memory or large number of cores available on these machines. Matlab is one software package people commonly use this way. To do this, you still must use the batch queuing system, but you make a special request for an interactive session. Requesting such an interactive batch job is quite straightforward. For example to allocate all 16 cores on one node interactively:

qsub -I -l nodes=1:ppn=16 -q dque

After typing this command, the job will be queued. If a node is available immediately, you will get a terminal prompt directly on the node you've been allocated. If the cluster is busy, the command will sit and wait until a node is free, then return with a prompt when available.

Be warned, however, that with the example above, for instance, you will be charged for 16 CPU-hr for every hour you leave this shell running. If you accidentally forget to log out of this shell and go home for the weekend, you will come back monday having used ~1000 CPU-hr of your allocation. For this reason, you may wish to consider adding a :walltime=4:00:00 (for example) to automatically close the process (killing anything you might be doing) after 4 hours.

You can, of course, also just request a single processor with :ppn=1. If you do this, then you are responsible for insuring that your job uses only a single core. Some software may detect that there are 16 cores available on the machine and attempt to automatically use more. If you request all (16 on Prism, 12 on Torus) of the cores on a single node, then you are free to use the full RAM, and all of the cores, including hyperthreading on the node.

Hyperthreading

For jobs that can run in parallel using threads, you can gain an additional performance boost by taking advantage of hyperthreading. If the node has 16 physical compute cores, you can actually pull out 10-20% more performance from each one if you run, for example, 32 threads rather than 16 threads on that node. However, if you want to do this there are a few things you need to keep in mind:

You MUST allocate ALL of the cores on a node to do this. If you ask for 4 cores, then use 8, you will be competing with another user who was allocated the other 12 cores on that node, unfairly slowing his job down. So, if you want to hyperthread, you must allocate all 16 (or 12) cores.
If your software doesn't make use of shared memory, keep in mind that doing this will decrease the amount of memory available per thread.

Running Jobs Directly on the Head Node

In some limited cases it is acceptable (and even encouraged) to run small jobs directly on the head-node, or in the case of Torus, on the 'store' node. The node with direct connection to the RAID array can access the data much more rapidly than the compute nodes, and without saturating the network and interfering with other jobs on the cluster. Here are examples of cases where it is permissible to use the head/store nodes:

Compiling/configuring software
Small pre-processing runs. If you need to, for example, reformat the data in a large file, such that very little CPU is used, but large files must be processed, you may do this on the head-node, provided the job takes <15 minutes. You are encouraged to do such pre-processing on your local workstation instead, before copying the data to the cluster in the first place, though.
SHORT test runs of software to estimate runtime (<15 minutes on no more than 4 cores). It is usually a better idea to do this using qsub -I as described above.
Data visualization. The cluster is not really configured for graphical applications, and it is usually a good idea to rsync your results back to your local machine for visualization. In some cases, however, it is much more convenient to just take a quick look from the head-node. This is permitted, but do not leave any graphical applications running on the head-node for extended periods of time.

-  ⇤ ← Revision 4 as of 2014-04-20 04:01:02 → 
  Size: 7087
  Editor: SteveLudtke
  Comment:
+   ← Revision 5 as of 2014-04-20 18:23:36 → ⇥
  Size: 8828
  Editor: SteveLudtke
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 6:
+'''IMPORTANT:''' The BQS must be used for all jobs run on the cluster!  If you are found running jobs directly on nodes without using PBS, your account may be temporarily or permanently disabled. In some limited cases (see below) you can run small jobs on the head node.
-Line 87:
+Line 89:
+=== Running Jobs Directly on the Head Node ===
In some limited cases it is acceptable (and even encouraged) to run small jobs directly on the head-node, or in the case of Torus, on the 'store' node. The node with direct connection to the RAID array can access the data much more rapidly than the compute nodes, and without saturating the network and interfering with other jobs on the cluster. Here are examples of cases where it is permissible to use the head/store nodes:

 * Compiling/configuring software
 * Small pre-processing runs. If you need to, for example, reformat the data in a large file, such that very little CPU is used, but large files must be processed, you may do this on the head-node, provided the job takes <15 minutes. You are encouraged to do such pre-processing on your local workstation instead, before copying the data to the cluster in the first place, though.
 * SHORT test runs of software to estimate runtime (<15 minutes on no more than 4 cores). It is usually a better idea to do this using qsub -I as described above.
 * Data visualization. The cluster is not really configured for graphical applications, and it is usually a good idea to ''rsync'' your results back to your local machine for visualization. In some cases, however, it is much more convenient to just take a quick look from the head-node. This is permitted, but do not leave any graphical applications running on the head-node for extended periods of time.