CIBRClusters/Sphere

Sphere

Overview

Sphere is a medium-scale high performance linux cluster purchased by a collaborative arrangement between CIBR, and the Ludtke, Barth, Waterland and Guan labs. Each group that contributed financially to the purchase of the cluster is entitled to a proportional amount of the overall compute capacity of the cluster. In theory, the cluster can provide up to 6M CPU-hr of computation annually, however, it is impossible to keep such a cluster fully loaded all of the time, and typical clusters run at ~60-80% of capacity, and allocations reflect this.

CIBR faculty can acquire time allocations on this cluster simply by requesting them via email to sludtke@bcm.edu . Initial allocations of 25,000 CPU-hr/quarter do not require any formal application, simply a request. The request must come from the PI, not students/postdocs, as allocations are made on a per-faculty basis, and the professor should discuss relative allocations with people in his group. Faculty must be members of CIBR to receive free time on the cluster (membership is free). If the initial allocation is exhausted, larger allocations may be possible, depending on usage levels, and the number of requests in that quarter.

There is a CIBR Cluster Google Group/Mailing List (https://groups.google.com/forum/#!forum/cibrcluster) which is used for announcements of outages, problems, policy changes, etc. We strongly recommend all CIBR Cluster users join.

Hardware & Software

1 Head node for job management
1 Storage node for home directories and mid-term data storage with a 40 TB Raid array
40 compute nodes, each with:
- 24 cores (dual 12-core CPUs @ 2.5 Ghz, E5-2680v3)
- 128 GB of RAM (5 GB/core)
- 2 TB hard drive (1 TB local scratch)
10 Gb interconnect between nodes for high-performance MPI and storage

Please note: The primary RAID array for active storage of data is not backed up in any way. RAID6 provides some redundancy and protection against routine drive failures, but if multiple drives fail at once, or some other hardware problem occurs, it is possible to lose everything stored on the primary RAID. For this reason we also offer more reliable backup storage for data not actively being processed, but will still be needed on the cluster. Any user can request an allocation on this reliable backup space.

Software Configuration

The cluster runs CentOS 7.1, a variant of linux, which is equivalent to RHEL7.1.

We do not provide any commercial software, but most standard open-source tools are installed, and users are welcome to install commercial or free software within their own accounts. BCM's site license for Matlab is usable on the cluster, but must be installed/licensed in the user's account. Contact Larry Mayran about this for details. We may also be able to install other software system-wide, if you have such a need, please ask.

A list of some of the open-source software we have made available for users is maintained on the Software page.

Detailed Information on Using the Cluster

Important Note: Unlike the other CIBR clusters, which use the older Torque/Maui batch system based on OpenPBS, Sphere uses the newer SLURM system, which provides a number of advantages, but has significantly different usage than the older system. See below for details. We will gradually be transitioning to this system on all CIBR resources.

The cluster is administered in a fairly laissez-faire fashion. Generally speaking, all users, paid and free alike share the same queuing system. While there are specific queues for high and low priority as well as long-running jobs, there is no mechanism to preempt a running job. That is, once a job is started, it will run (occupying assigned resources) until complete. The priority system only impacts the order in which jobs are started. Please see the documentation below for more details. Under no circumstances should compute jobs be executed on the head-node, or directly on any compute nodes without going through the queuing system. It is permissible to run short I/O intensive jobs on the head node (pre-processing data and such) since the head-node has more efficient storage access.

It is the user's responsibility to follow all cluster policies. While we try to be understanding of mistakes, we would much prefer to answer a question rather than spend two days fixing an accidental problem. Users who intentionally abuse policy may have their accounts temporarily or permanently suspended. In such situations, the user's PI will be consulted.

For Assistance

In general, if you don’t understand something about how to use the cluster effectively, or have any questions/issues, don’t hesitate to email Steve Ludtke ( sludtke@bcm.edu ).

Last modified on July 20, 2015