Differences between revisions 3 and 4
Revision 3 as of 2014-02-12 17:57:12
Size: 3997
Editor: 10
Comment:
Revision 4 as of 2014-04-19 03:46:49
Size: 4763
Editor: SteveLudtke
Comment: Updated
Deletions are marked like this. Additions are marked like this.
Line 7: Line 7:
CIBR faculty can acquire time allocations on this cluster simply by requesting them via email to sludtke@bcm.edu . Initial allocations of 25,000 CPU-hr do not require any formal application, simply a request. The request ''must'' come from the PI, not students/postdocs, as allocations are made on a per-faculty basis, and the professor must decide how to allocate the time among people in his/her lab. Faculty must be members of CIBR to receive free time on the cluster (membership is free). If the initial allocation is exhausted, larger allocations may be possible, depending on usage levels, and the number of requests in that quarter. CIBR faculty can acquire time allocations on this cluster simply by requesting them via email to sludtke@bcm.edu . Initial allocations of 25,000 CPU-hr do not require any formal application, simply a request. The request ''must'' come from the PI, not students/postdocs, as allocations are made on a per-faculty basis, and the professor should discuss relative allocations with people in his group. Faculty must be members of CIBR to receive free time on the cluster (membership is free). If the initial allocation is exhausted, larger allocations may be possible, depending on usage levels, and the number of requests in that quarter.
Line 12: Line 12:
 * 48 compute nodes, each with:
  * 12 cores (dual 6-core CPUs)
  * 48 GB of RAM (4 GB/core)
  * 2 TB hard drive (1 TB local scratch, 1 TB Lustre)
 * 4 additional "bioinformatics" nodes used exclusively for "bioinfo" queue, each with:
  * 16 cores (dual 8-core CPUs)
  * 256 GB of RAM (16 GB/core)
  * 1 80GB SSD hard drive for system use. 7 1TB 2.5" hard drives raid5, mounted as /data1 (high speed local storage).
 * QDR Infiniband (40 Gb/sec) interconnect between nodes for high-performance MPI & Lustre filesystem
 * --(48 TB Lustre scratch filesystem accessible from all nodes, at speeds comparable to a local disk)-- (offline until further notice)

Prism is equipped with a QDR Infiniband interconnect, capable of very high bandwidth transfers between nodes.
 * 44 compute nodes, each with:
  * 16 cores (dual 8-core CPUs @ 2.6 Ghz, E5-2670)
  * 64 GB of RAM (4 GB/core)
  * 2 TB hard drive (1 TB local scratch)
 * QDR Infiniband (40 Gb/sec) interconnect between nodes for high-performance MPI and storage
Line 26: Line 19:
 * The cluster runs ''CentOS 6'', a variant of linux, which is equivalent to ''RHEL6''.  * The cluster runs ''CentOS 6.5'', a variant of linux, which is equivalent to ''RHEL6.5''.
Line 28: Line 21:
 * ''OpenMPI'' is available, and users are free to compile and use other MPI distributions which can take advantage of the available Infiniband interconnect.  * ''OpenMPI 1.6.5'' is available (in /usr/local), and users are free to compile and use other MPI distributions from their own accounts.
Line 30: Line 23:
 * There is, at present, no commercial software available on Torus. If you require a specific commercial software package, you may install your
 * BCM's Matlab license is not for cluster use, though it is possible to install a license in your own account for use on one node at a time. To do this you need to use PBS in interactive mode. You MUST still use PBS, not directly log in to a node and run matlab.
 * There is, at present, no commercial software available on Torus. If you require a specific commercial software package, you may install them yourself in your own account. If they require system privileges we suggest discussing it with us prior to purchase.
 * BCM's Matlab license does '''not''' cover cluster use. It is possible to install a license in your own account for use on one node at a time, which at least gives access to 16 cores. To do this you need to use PBS in interactive mode. You MUST still use PBS, not directly log in to a node and run matlab.
Line 34: Line 27:
 * [[CIBRClusters/Prism/Queue|Running Jobs on the Cluster]]
 * [[CIBRClusters/Prism/Storage|Disk storage policies]]
The cluster is administered in a fairly laissez-faire fashion. Generally speaking, all users, paid and free alike share the same queuing system. While there are specific queues for high and low priority as well as long-running jobs, there is no mechanism to preempt a running job. That is, once a job is started, it will run (occupying assigned resources) until complete. The priority system only impacts the order in which jobs are started. Please see the documentation below for more details. Under no circumstances should compute jobs be executed on the head-node, or directly on any compute nodes without going through the queuing system. It is permissible to run short I/O intensive jobs on the head node (pre-processing data and such) since the head-node has more efficient storage access.

It is the user's responsibility to follow all cluster policies. While we try to be understanding of mistakes, we would much prefer to answer a question rather than spend two days fixing an accidental problem. Users who intentionally abuse policy may have their accounts temporarily or permanently suspended. In such situations, the user's PI will be consulted.

 * [[CIBRClusters/Queue|Running Jobs on the Cluster (using the queuing system)]]
 * [[CIBRClusters/Storage|Disk storage policies]]
 * [[CIBRClusters/EffectiveUse|How to use, not abuse, the cluster]]
Line 39: Line 37:
 . The cluster is currently maintained by Dwight Noel, who works for the NCMI, and maintains the clusters only part-time. In general, he will be happy to help you and answer your questions, but keep in mind that his effort is being provided as a courtesy to non-NCMI personnel. In general, if you don’t understand something about how to use the cluster effectively, or have any questions/issues, don’t hesitate to email ( dwight.noel@bcm.edu ) or Steve Ludtke ( sludtke@bcm.edu ).  . The cluster is maintained by Dwight Noel, who works for the NCMI, and maintains the clusters only part-time. In general, he will be happy to help you and answer your questions. In general, if you don’t understand something about how to use the cluster effectively, or have any questions/issues, don’t hesitate to email ( dwight.noel@bcm.edu ) or Steve Ludtke ( sludtke@bcm.edu ).
Line 45: Line 43:
Last modified on Feb 22, 2013 Last modified on April 18, 2014

Prism Policy

updated 12/32/2013

Overview

Prism is a medium-scale high performance linux cluster purchased by a collaborative arrangement between CIBR, and the Chiu, Barth, Wensel, Ludtke and Guan labs. Each group that contributed financially to the purchase of the cluster is entitled to a proportional amount of the overall compute capacity of the cluster. In theory, the cluster can provide up to 6,170,000 CPU-hr of computation annually, however, it is impossible to keep such a cluster fully loaded all of the time, and typical clusters run at ~60-80% of capacity, and allocations reflect this.

CIBR faculty can acquire time allocations on this cluster simply by requesting them via email to sludtke@bcm.edu . Initial allocations of 25,000 CPU-hr do not require any formal application, simply a request. The request must come from the PI, not students/postdocs, as allocations are made on a per-faculty basis, and the professor should discuss relative allocations with people in his group. Faculty must be members of CIBR to receive free time on the cluster (membership is free). If the initial allocation is exhausted, larger allocations may be possible, depending on usage levels, and the number of requests in that quarter.

Hardware & Software

  • 1 Head node for job management
  • 1 Storage node for home directories and mid-term data storage with a 40 TB Raid array
  • 44 compute nodes, each with:
    • 16 cores (dual 8-core CPUs @ 2.6 Ghz, E5-2670)
    • 64 GB of RAM (4 GB/core)
    • 2 TB hard drive (1 TB local scratch)
  • QDR Infiniband (40 Gb/sec) interconnect between nodes for high-performance MPI and storage

Software Configuration

  • The cluster runs CentOS 6.5, a variant of linux, which is equivalent to RHEL6.5.

  • The Torque queuing system with Maui scheduler, basically equivalent to PBS. This is how users submit jobs for execution on the cluster nodes.

  • OpenMPI 1.6.5 is available (in /usr/local), and users are free to compile and use other MPI distributions from their own accounts.

  • The cluster has a wide range of open-source programs and libraries installed on it as part of the CentOS distribution. Within limits, new packages requested by users can be added.
  • There is, at present, no commercial software available on Torus. If you require a specific commercial software package, you may install them yourself in your own account. If they require system privileges we suggest discussing it with us prior to purchase.
  • BCM's Matlab license does not cover cluster use. It is possible to install a license in your own account for use on one node at a time, which at least gives access to 16 cores. To do this you need to use PBS in interactive mode. You MUST still use PBS, not directly log in to a node and run matlab.

Detailed Information on Using the Cluster

The cluster is administered in a fairly laissez-faire fashion. Generally speaking, all users, paid and free alike share the same queuing system. While there are specific queues for high and low priority as well as long-running jobs, there is no mechanism to preempt a running job. That is, once a job is started, it will run (occupying assigned resources) until complete. The priority system only impacts the order in which jobs are started. Please see the documentation below for more details. Under no circumstances should compute jobs be executed on the head-node, or directly on any compute nodes without going through the queuing system. It is permissible to run short I/O intensive jobs on the head node (pre-processing data and such) since the head-node has more efficient storage access.

It is the user's responsibility to follow all cluster policies. While we try to be understanding of mistakes, we would much prefer to answer a question rather than spend two days fixing an accidental problem. Users who intentionally abuse policy may have their accounts temporarily or permanently suspended. In such situations, the user's PI will be consulted.

For Assistance

  • The cluster is maintained by Dwight Noel, who works for the NCMI, and maintains the clusters only part-time. In general, he will be happy to help you and answer your questions. In general, if you don’t understand something about how to use the cluster effectively, or have any questions/issues, don’t hesitate to email ( dwight.noel@bcm.edu ) or Steve Ludtke ( sludtke@bcm.edu ).

Contact

SysOp: Dwight Noel dwight.noel@bcm.edu

Last modified on April 18, 2014

CIBRClusters/Prism (last edited 2018-09-19 04:40:20 by SteveLudtke)