CIBRClusters/Queue

Batch Queuing System

A Batch Queuing System (BQS) manages all of the shared compute resources on a cluster. It allocates processors (and sometimes memory and other resources) to specific jobs in a specific order based on Queuing policies. When you have a job to run on the cluster, you prepare a short script file, describing the resources you need, and the commands to run, then submit it to the BQS. The BQS will then launch the job as soon as the necessary resources are available, and collect the standard output and error output into files in the folder where the job was launched from.

The system we use (Maui/Torque) is a descendent of a very traditional BQS known as PBS (and later OpenPBS). Some variant of this system is used on a majority of the Linux clusters around the world.

Programs for Job Management

The main programs used to access the queuing systems are:

qsub <your job file>	Submit a new job
qdel <job id>	Kill or remove a queued job
qstat	Info on running jobs
qstat -a	Info on ALL jobs in the queue
qstat -an	All jobs with node allocation information
qstat -q	List all available queues
pbsnodes -a	List status of all system nodes
checkjob <job id>	Gives useful details about a job
showq	Another way of looking at the queues

Running jobs

This is the sequence of events that takes place when you run a job on the cluster:

Run showq or qstat -a to see the current cluster load level and determine how many cores you will request for your job.
Create a ".pbs" file containing your job (see below)
qsub file.pbs -q queuename
wait for your job to run
showq or qstat -a can be used to monitor.
- "Q" means the job is waiting to run
- "R" means the job is running
- "C" or "A" mean the job has failed for some reason
Look at the file.pbs.eXXXXX and file.pbs.oXXXXX to see the output generated by your program

Making a .pbs File

The job file submitted with qsub looks like this:

#!/bin/sh

#PBS -N NAME_OF_THE_JOB
#PBS -l nodes=10:ppn=4      
#PBS -l walltime=24:00:00
#PBS -q longjob

# This job's temporary working directory. You may also work in any of your
# home directories
echo Working directory is $PBS_O_WORKDIR

echo Running on host `hostname`
echo Time is `date`
echo Directory is `pwd`
echo This job runs on the following processors:
echo "PBS_NODEFILE=" $PBS_NODEFILE

cd <some other folder>
my_first_program
my_second_program
mpirun my_mpi_program

This example script asks for 10 nodes with 4 processors on each, for a total of 40 processors. Both CIBR clusters are equipped with 4 GB of RAM for each processor
my_first_program and my_second_program are the command-lines you would normally type if running a program interactively. These programs will be executed sequentially on the first processor you are assigned.
If your program supports MPI, then you can mpirun my_mpi_program, and the program will make use of all of the assigned processors. Otherwise, my_first_program would only run on a single processor (unless it has some internal mechanism for using more processors, as, for example, EMAN2 does).
You need to take memory requirements into account. If your program will require more than ~3.5 GB of RAM per core, you will need to allocate additional cores (which go unused) so your job is provided the requisite amount of RAM. If you do this, you will be "charged" for the additional CPU time which you are not using. For example, if you have a single threaded job which needs 11 GB of RAM, you need to allocate 3 cores for your job, even though you are using only 1 of them. If the job runs for 10 hours, your account will be "charged" for 30 CPU-hours of time.

Interactive Jobs (eg - matlab)

Sometimes a user will want to run an interactive session on one of the cluster nodes, taking advantage of the large memory or large number of cores available on these machines. Matlab is one software package people commonly use this way. To do this, you still must use the batch queuing system, but you make a special request for an interactive session. Requesting such an interactive batch job is quite straightforward. For example to allocate all 16 cores on one node interactively:

qsub -I -l nodes=1:ppn=16 -q dque

After typing this command, the job will be queued. If a node is available immediately, you will get a terminal prompt directly on the node you've been allocated. If the cluster is busy, the command will sit and wait until a node is free, then return with a prompt when available.

Be warned, however, that with the example above, for instance, you will be charged for 16 CPU-hr for every hour you leave this shell running. If you accidentally forget to log out of this shell and go home for the weekend, you will come back monday having used ~1000 CPU-hr of your allocation. For this reason, you may wish to consider adding a :walltime=4:00:00 (for example) to automatically close the process (killing anything you might be doing) after 4 hours.

You can, of course, also just request a single processor with :ppn=1. If you do this, then you are responsible for insuring that your job uses only a single core. Some software may detect that there are 16 cores available on the machine and attempt to automatically use more. If you request all (16 on Prism, 12 on Torus) of the cores on a single node, then you are free to use the full RAM, and all of the cores, including hyperthreading on the node.

Hyperthreading

For jobs that can run in parallel using threads, you can gain an additional performance boost by taking advantage of hyperthreading. If the node has 16 physical compute cores, you can actually pull out 10-20% more performance from each one if you run, for example, 32 threads rather than 16 threads on that node. However, if you want to do this there are a few things you need to keep in mind:

You MUST allocate ALL of the cores on a node to do this. If you ask for 4 cores, then use 8, you will be competing with another user who was allocated the other 12 cores on that node, unfairly slowing his job down. So, if you want to hyperthread, you must allocate all 16 (or 12) cores.
If your software doesn't make use of shared memory, keep in mind that doing this will decrease the amount of memory available per thread.