CIBRClusters/Queue

Batch Queuing System

A Batch Queuing System (BQS) manages all of the shared compute resources on a cluster. It allocates processors (and sometimes memory and other resources) to specific jobs in a specific order based on Queuing policies. When you have a job to run on the cluster, you prepare a short script file, describing the resources you need, and the commands to run, then submit it to the BQS. The BQS will then launch the job as soon as the necessary resources are available, and collect the standard output and error output into files in the folder where the job was launched from.

The system we use (Maui/Torque) is a descendent of a very traditional BQS known as PBS (and later OpenPBS). Some variant of this system is used on a majority of the Linux clusters around the world.

Running Jobs Directly on the Head Node

In some limited cases it is acceptable (and even encouraged) to run small tasks directly on the head-node, or in the case of Torus, on the 'store' node. The node with direct connection to the RAID array can access the data much more rapidly than the compute nodes, and without saturating the network and interfering with other jobs on the cluster. Here are examples of cases where it is permissible to use the head/store nodes:

Compiling/configuring software
Small pre-processing runs. If you need to, for example, reformat the data in a large file, such that very little CPU is used, but large files must be processed, you may do this on the head-node, provided the job takes <15 minutes. You are encouraged to do such pre-processing on your local workstation instead, before copying the data to the cluster in the first place, though.
SHORT test runs of software to estimate runtime (<15 minutes on no more than 4 cores). It is usually a better idea to do this using qsub -I as described below.
Data visualization. The cluster is not really configured for graphical applications, and it is usually a good idea to rsync your results back to your local machine for visualization. In some cases, however, it is much more convenient to just take a quick look from the head-node. This is permitted, but do not leave any graphical applications running on the head-node for extended periods of time.

On Torus (and Gauss) it is important to note that the head node (the one you log into when ssh'ing to the cluster) does NOT have improved access to the disk. On these machines there are separate storage nodes. On Torus, this node is called 'store'. To use it, ssh to torus, and from there, 'ssh store', and run your small disk-intensive tasks there. On Gauss, there are 2: "store1" and "store2".

IMPORTANT: The BQS must be used for all jobs run on the cluster! If you are found running compute jobs directly on nodes without using PBS, your account may be temporarily or permanently disabled.

Programs for Job Management

The main programs used to access the queuing systems are:

Program	Function
qsub <your job file> -q <queuename>	Submit a new job
qdel <job id>	Kill or remove a queued job
qstat	Info on running jobs
qstat -a	Info on ALL jobs in the queue
qstat -an	All jobs with node allocation information
qstat -q	List all available queues
pbsnodes -a	List status of all system nodes
checkjob <job id>	Gives useful details about a job
showq	Another way of looking at the queues

Running jobs

This is the sequence of events that takes place when you run a job on the cluster:

Run showq or qstat -a to see the current cluster load level and determine how many cores you will request for your job.
Create a ".pbs" file containing your job (see below)
qsub file.pbs -q queuename
wait for your job to run
showq or qstat -a can be used to monitor.
- "Q" means the job is waiting to run
- "R" means the job is running
- "C" or "A" mean the job has failed for some reason
Look at the file.pbs.eXXXXX and file.pbs.oXXXXX to see the output generated by your program

NOTE: The .eXXXXX and .oXXXXX collect all of the information printed to the terminal if the job were run interactively. Some programs generate massive amounts of output like this, and in some cases this can cause your jobs to crash, or the output files not to get copied back to the head-node. If your program generates massive console output which you need to keep, use '>logfile.txt' to redirect the output directly to a file instead of using the .oXXXXX mechanism. If you do not need to keep the output, you may just '>/dev/null' which will simply discard it.

How Many Nodes to Request

At large supercomputing centers, if you have 1000 jobs to run, you just dump them all into the queue, and they will run when they can (assuming you have time left in your allocation). Usually when you submit jobs in these environments you will have to wait hours or days (depending on the request) before your job(s) actually start running. The goal at such centers is to keep the computer as occupied as possible to keep their efficiency ratings up and justify purchasing more computers. However, this can be very frustrating if you are running jobs that, for example, run for an hour, then require some human interaction, then run for another hour, etc. The job start delays make it hard to get anything accomplished.

This is the major role of the small local clusters like ours. In these clusters, while of course we would like them to be heavily used, we also try to arrange things so most of the time there are at least a few (~10%) nodes free, so when someone starts a new job (at least a small one) they rarely have to wait at all. For this reason, even if you use the low priority queue, you are NOT allowed to swamp the cluster with 500 job requests:

If the cluster is mostly idle
- it is acceptable for individual users/groups to occupy up to ~2/3 of the cluster (for jobs lasting <1 day)
- perhaps ~1/2 of the cluster for longer jobs.
If (as is usually the case) the cluster is heavily used
- you should not use more than ~1/2 of the UNUSED nodes for small job requests
- for larger/longer requests, which will not be able to start running immediately anyway, limit yourself to ~1/3 of the total nodes on the cluster
If you need to run many small jobs, it is absolutely not acceptable to simply put them all in the queue, even the low priority queue. Instead, you should group a few commands together into each PBS file sequentially, such that the number of nodes your jobs will use does not exceed the above guidelines.

Making a .pbs File

The job file submitted with qsub looks like this:

#!/bin/sh

#PBS -N NAME_OF_THE_JOB
#PBS -l nodes=10:ppn=4      
#PBS -l walltime=24:00:00
#PBS -q longjob

# This job's temporary working directory. You may also work in any of your
# home directories
echo Working directory is $PBS_O_WORKDIR

echo Running on host `hostname`
echo Time is `date`
echo Directory is `pwd`
echo This job runs on the following processors:
echo "PBS_NODEFILE=" $PBS_NODEFILE

cd <some other folder>
my_first_program
my_second_program
mpirun my_mpi_program

This example script asks for 10 nodes with 4 processors on each, for a total of 40 processors. Both CIBR clusters are equipped with 4 GB of RAM for each processor
my_first_program and my_second_program are the command-lines you would normally type if running a program interactively. These programs will be executed sequentially on the first processor you are assigned.
It is very important to realize that you CAN put more than one program in a PBS file. The second command will run after the first finishes, and so on. In that way you can turn 100 little PBS files into a handful of large PBS files, which will take longer to run, but will not use an unfair fraction of the available resources.
If your program supports MPI, then you can mpirun my_mpi_program, and the program will make use of all of the assigned processors. Otherwise, my_first_program would only run on a single processor (unless it has some internal mechanism for using more processors, as, for example, EMAN2 does).
You need to take memory requirements into account. If your program will require more than ~3.5 GB of RAM per core, you will need to allocate additional cores (which go unused) so your job is provided the requisite amount of RAM. If you do this, you will be "charged" for the additional CPU time which you are not using. For example, if you have a single threaded job which needs 11 GB of RAM, you need to allocate 3 cores for your job, even though you are using only 1 of them. If the job runs for 10 hours, your account will be "charged" for 30 CPU-hours of time.

Which Queue to Use ?

You can find a list of the available queues and limits for each with:

qstat -q

Generally the 'dque' (default queue) will be the correct choice for most jobs.
If your job needs to run for longer than 48 hours, you will need to use the 'longjob' queue, which restricts the number of nodes you can use, but allows much longer runtimes.
If you have an urgent deadline, you can use the high priority queue, which will leapfrog ahead of any jobs waiting in the normal priority queues, but will "charge" you for 1.5x as many CPU-hours as you actually use.
If you have some jobs you want to run, but you don't care how soon they get run, you can use the low priority queue, which charges you only 75% of the hours you use.
If your lab has exhausted your quarterly hour allocation, at present we don't "cut you off" immediately. However, we do ask that you use only a limited number of CPUs and that you only run in the low priority queue when you are "over allocation".

Interactive Jobs (eg - matlab)

Sometimes a user will want to run an interactive session on one of the cluster nodes, taking advantage of the large memory or large number of cores available on these machines. Matlab is one software package people commonly use this way. To do this, you still must use the batch queuing system, but you make a special request for an interactive session. Requesting such an interactive batch job is quite straightforward. For example to allocate all 16 cores on one node interactively:

qsub -I -l nodes=1:ppn=16 -q dque

After typing this command, the job will be queued. If a node is available immediately, you will get a terminal prompt directly on the node you've been allocated. If the cluster is busy, the command will sit and wait until a node is free, then return with a prompt when available.

Be warned, however, that with the example above, for instance, you will be charged for 16 CPU-hr for every hour you leave this shell running. If you accidentally forget to log out of this shell and go home for the weekend, you will come back monday having used ~1000 CPU-hr of your allocation. For this reason, you may wish to consider adding a :walltime=4:00:00 (for example) to automatically close the process (killing anything you might be doing) after 4 hours.

You can, of course, also just request a single processor with :ppn=1. If you do this, then you are responsible for insuring that your job uses only a single core. Some software may detect that there are 16 cores available on the machine and attempt to automatically use more. If you request all (16 on Prism, 12 on Torus) of the cores on a single node, then you are free to use the full RAM, and all of the cores, including hyperthreading on the node.

Hyperthreading

For jobs that can run in parallel using threads, you can gain an additional performance boost by taking advantage of hyperthreading. If the node has 16 physical compute cores, you can actually pull out 10-20% more performance from each one if you run, for example, 32 threads rather than 16 threads on that node. However, if you want to do this there are a few things you need to keep in mind:

You MUST allocate ALL of the cores on a node to do this. If you ask for 4 cores, then use 8, you will be competing with another user who was allocated the other 12 cores on that node, unfairly slowing his job down. So, if you want to hyperthread, you must allocate all 16 (or 12) cores.
If your software doesn't make use of shared memory, keep in mind that doing this will decrease the amount of memory available per thread.