=== General Policy === * Please try to be considerate to other cluster users. * If the cluster is heavily occupied, we request that no single user have more than 240 cores (10 nodes) in the queue (running or waiting). The debug queue is an exception. * If the cluster has been running below capacity for 8 hours or more, you may submit additional jobs, up to an additional 240 cores, but only to the 'shortjob' queue (max runtime of 8 hours). Note that this queue allocates only whole nodes like the bynode queue. * Please also try to insure that your jobs get distributed among as few nodes as possible. ie- if running many small jobs, try to group jobs with similar runtimes together. * You are not allowed to write batch scripts which conditionally submit additional jobs to the cluster. Small bugs in your scripts can easily lead to runaway behavior, and may result in your account being banned from the cluster. * Similarly, use of cron/at to submit jobs at a later time is also not permitted. * If you need to run a sequence of commands, with the second command running only after the first completes, simply put all of the commands into a single batch script and make sure the total runtime is sufficient. * Similarly, if you have, say, 480 jobs to run, simply put 2 commands in sequence in each of 240 batch scripts. This way all of the jobs will run and you won't over-utilize the cluster. * When using the bynode, longjob or shortjob queues, you may use the full RAM available on each node. * If running other jobs, you are responsible for insuring that each 'task' uses no more than 5 GB/core and runs only a single thread of execution. If you have a single-threaded process which requires (for example) 20 GB of RAM, you must request 4 CPUs for each process. === Launching Jobs on the Clusters using SLURM === SLURM is a much more flexible queuing system than the previous Torque/Maui system (used on the other CIBR clusters). Some general tips to get you started: * Partition - this was called a Queue under the old system * Note that unlike the old system, where it was difficult to monitor jobs, STDOUT is written to slurm-jobid.txt and updated in real-time with SLURM, so you can see exactly what is going on. * sinfo -al - will give a detailed list of the available partitions * squeue -l - will give a detailed list of running and queued jobs * sbatch