Batch Queuing System

A Batch Queuing System (BQS) manages all of the shared compute resources on a cluster. It allocates processors (and sometimes memory and other resources) to specific jobs in a specific order based on Queuing policies. When you have a job to run on the cluster, you prepare a short script file, describing the resources you need, and the commands to run, then submit it to the BQS. The BQS will then launch the job as soon as the necessary resources are available, and collect the standard output and error output into files in the folder where the job was launched from.

The system we use (Maui/Torque) is a descendent of a very traditional BQS known as PBS (and later OpenPBS). Some variant of this system is used on a majority of the Linux clusters around the world.

Running Jobs Directly on the Head Node

In some limited cases it is acceptable (and even encouraged) to run small tasks directly on the head-node, or in the case of Torus, on the 'store' node. The node with direct connection to the RAID array can access the data much more rapidly than the compute nodes, and without saturating the network and interfering with other jobs on the cluster. Here are examples of cases where it is permissible to use the head/store nodes:

On Torus (and Gauss) it is important to note that the head node (the one you log into when ssh'ing to the cluster) does NOT have improved access to the disk. On these machines there are separate storage nodes. On Torus, this node is called 'store'. To use it, ssh to torus, and from there, 'ssh store', and run your small disk-intensive tasks there. On Gauss, there are 2: "store1" and "store2".

IMPORTANT: The BQS must be used for all jobs run on the cluster! If you are found running compute jobs directly on nodes without using PBS, your account may be temporarily or permanently disabled.

Programs for Job Management

The main programs used to access the queuing systems are:

Program

Function

qsub <your job file> -q <queuename>

Submit a new job

qdel <job id>

Kill or remove a queued job

qstat

Info on running jobs

qstat -a

Info on ALL jobs in the queue

qstat -an

All jobs with node allocation information

qstat -q

List all available queues

pbsnodes -a

List status of all system nodes

checkjob <job id>

Gives useful details about a job

showq

Another way of looking at the queues

Running jobs

This is the sequence of events that takes place when you run a job on the cluster:

  1. Run showq or qstat -a to see the current cluster load level and determine how many cores you will request for your job.

  2. Create a ".pbs" file containing your job (see below)
  3. qsub file.pbs -q queuename

  4. wait for your job to run
  5. showq or qstat -a can be used to monitor.

    • "Q" means the job is waiting to run
    • "R" means the job is running
    • "C" or "A" mean the job has failed for some reason
  6. Look at the file.pbs.eXXXXX and file.pbs.oXXXXX to see the output generated by your program

NOTE: The .eXXXXX and .oXXXXX collect all of the information printed to the terminal if the job were run interactively. Some programs generate massive amounts of output like this, and in some cases this can cause your jobs to crash, or the output files not to get copied back to the head-node. If your program generates massive console output which you need to keep, use '>logfile.txt' to redirect the output directly to a file instead of using the .oXXXXX mechanism. If you do not need to keep the output, you may just '>/dev/null' which will simply discard it.

How Many Nodes to Request

At large supercomputing centers, if you have 1000 jobs to run, you just dump them all into the queue, and they will run when they can (assuming you have time left in your allocation). Usually when you submit jobs in these environments you will have to wait hours or days (depending on the request) before your job(s) actually start running. The goal at such centers is to keep the computer as occupied as possible to keep their efficiency ratings up and justify purchasing more computers. However, this can be very frustrating if you are running jobs that, for example, run for an hour, then require some human interaction, then run for another hour, etc. The job start delays make it hard to get anything accomplished.

This is the major role of the small local clusters like ours. In these clusters, while of course we would like them to be heavily used, we also try to arrange things so most of the time there are at least a few (~10%) nodes free, so when someone starts a new job (at least a small one) they rarely have to wait at all. For this reason, even if you use the low priority queue, you are NOT allowed to swamp the cluster with 500 job requests:

Making a .pbs File

The job file submitted with qsub looks like this:

#!/bin/sh

#PBS -N NAME_OF_THE_JOB
#PBS -l nodes=10:ppn=4      
#PBS -l walltime=24:00:00
#PBS -q longjob

# This job's temporary working directory. You may also work in any of your
# home directories
echo Working directory is $PBS_O_WORKDIR

echo Running on host `hostname`
echo Time is `date`
echo Directory is `pwd`
echo This job runs on the following processors:
echo "PBS_NODEFILE=" $PBS_NODEFILE

cd <some other folder>
my_first_program
my_second_program
mpirun my_mpi_program

Which Queue to Use ?

You can find a list of the available queues and limits for each with:

qstat -q

Interactive Jobs (eg - matlab)

Sometimes a user will want to run an interactive session on one of the cluster nodes, taking advantage of the large memory or large number of cores available on these machines. Matlab is one software package people commonly use this way. To do this, you still must use the batch queuing system, but you make a special request for an interactive session. Requesting such an interactive batch job is quite straightforward. For example to allocate all 16 cores on one node interactively:

qsub -I -l nodes=1:ppn=16 -q dque

After typing this command, the job will be queued. If a node is available immediately, you will get a terminal prompt directly on the node you've been allocated. If the cluster is busy, the command will sit and wait until a node is free, then return with a prompt when available.

Be warned, however, that with the example above, for instance, you will be charged for 16 CPU-hr for every hour you leave this shell running. If you accidentally forget to log out of this shell and go home for the weekend, you will come back monday having used ~1000 CPU-hr of your allocation. For this reason, you may wish to consider adding a :walltime=4:00:00 (for example) to automatically close the process (killing anything you might be doing) after 4 hours.

You can, of course, also just request a single processor with :ppn=1. If you do this, then you are responsible for insuring that your job uses only a single core. Some software may detect that there are 16 cores available on the machine and attempt to automatically use more. If you request all (16 on Prism, 12 on Torus) of the cores on a single node, then you are free to use the full RAM, and all of the cores, including hyperthreading on the node.

Hyperthreading

For jobs that can run in parallel using threads, you can gain an additional performance boost by taking advantage of hyperthreading. If the node has 16 physical compute cores, you can actually pull out 10-20% more performance from each one if you run, for example, 32 threads rather than 16 threads on that node. However, if you want to do this there are a few things you need to keep in mind: