Differences

This shows you the differences between two versions of the page.

--- cibrclusters [2025/07/05 21:10] – steveludtke
+++ cibrclusters [2025/07/05 21:32] (current) – steveludtke
@@ Line 1: / Line 1: @@
-[[CIBRClusters|Return to the CIBR Cluster Home Page]]
+====== CIBR Cluster Co-op ======
-===== Introduction to Clusters and CIBR Philosophy =====
+**WARNING: ** The existing CIBR clusters are now >10 years old, and we no longer know how long the room housing them will remain operational due to BCM cuts. If you have material on the CIBR Clusters we strongly suggest moving it to local storage at your earliest convenience. Note also that [[https://www.bcm.edu/academic-centers/dan-l-duncan-comprehensive-cancer-center/research/cancer-center-shared-resources/biostatistics-and-informatics/biomedical-informatics-group/high-performance-computer-cluster|BISR]] operates a fee-for-service cluster which is more actively maintained and developed.
-==== Glossary & Basic Concepts ====
+----
-First, a little terminology. Here is the hierarchy of computing resources on a modern cluster (you may find some disagreement about some of these specific definitions):
-  * //Server// - a single physical box mounted in a rack in the cluster
-  * //Node// - One //server// often contains multiple independent computers, or //nodes//. A 'node' is an independently operating computer.
-  * //Processor// or //CPU// - A physical microprocessor chip inserted into the motherboard on the computer. Most linux cluster //nodes// have 2 //processors//, though there are exceptions
-  * //Core// - A semi-independent computing unit inside a //processor//. The cores within a single physical package do share some resources (like cache), but can largely be treated as independent. Modern //processors// typically have 2-12 //cores//.
-  * //Thread// - A fairly recent concept. Each //core// can do several non-overlapping things at the same time. Sometimes some of these resources are idle due to conflicts between sequential operations. //threads// allow you to recover some of this wasted capacity by running 2 jobs on each //core//. On current generation processors, using this //thread// system can recoup as much as 10-20% of otherwise wasted capacity. However, using it is tricky. You will also often see vendors (like Amazon) advertising //threads// as if they were //cores//. This is patently false.
-Non-CPU terms:
+**NOTE:** The CIBR cluster Co-op strives to maximize CPU/cost, and BCM no longer provides any staff support for cluster operations. Dr. Ludtke's group is handling issues as a community service. Will still provide cluster access to users meeting the requirements (faculty CIBR membership), but software configuration within your account and running jobs is your responsibility. If you need more inclusive service (costs $) please see [[https://www.bcm.edu/academic-centers/dan-l-duncan-comprehensive-cancer-center/research/cancer-center-shared-resources/biostatistics-and-informatics/biomedical-informatics-group/high-performance-computer-cluster|BISR]]
-  * //RAM// - The memory on a single //node//. This memory is shared among all of the //cores// on the node, and often not explicitly allocated to specific tasks.
-  * //Swap// - A concept used to avoid memory exhaustion on computers. When the //RAM// becomes full, the computer may temporarily store some information on the hard drive in specially allocated //swap space//. Note, however, that a typical hard drive can read/write information at ~150 MB/sec. //RAM// can read/write information at ~75,000 MB/sec. Forcing a modern cluster to //swap// is a very bad thing, and the CIBR clusters have little or no swap space available.
-  * //Infiniband// - A high speed network used to help make many separate computers seem more like a single computer. Allows //nodes// to exchange information at up to 800MB/sec on our clusters(QDR single channel).
-  * //Gigabit// - Normal network used between nodes (and in the computers in your lab), can transfer ~100 MB/sec. The connection into the cluster from the outside world is a Gigabit connection.
-  * //RAID// - A type of disk storage combining multiple hard drives into a single larger 'virtual' drive which is both faster, and provides some safety against drive failure.
-    * //RAID5// - With N hard drives, provides read performance improvement of up to N times, and provides N-1 drives worth of storage space. If any single hard drive fails at a time, no data is lost.
-    * //RAID6// - Same as RAID5, but can tolerate 2 drive failing at once, and provides N-2 drives worth of space.
-  * //NFS// - Networked FileSystem. Makes a hard drive///RAID// on one computer available on other computers. Network communications may be a bottleneck, and the total bandwidth is shared among all computers.
-=== CPU resources ===
+==== Overview ====
-If we consider the Prism cluster, for example, it contains (as of the time of writing) 11 servers, each of which has 4 nodes, each of which has 2 processors, each of which has 8 cores. 11*4*2*8 = 704 cores. Technically this could be interpreted as 1408 //threads//, but whereas 2 cores provide 2x performance, 2 threads on a single core typically provides ~1.1-1.2x performance, and can only be used in very specific situations.
+[[https://www.bcm.edu/research/centers/computational-integrative-biomedical-research|CIBR]] and a number of specific faculty have sponsored the purchase of several clusters managed for shared use by faculty. CIBR faculty may request moderate allocations for free, as a benefit of CIBR membership (which is free). Faculty who contributed funds directly to the purchase have larger guaranteed allocations.
-=== Network and Storage ===
+The clusters currently in operation are:
-Our clusters are typically configured with QDR infiniband interconnects, meaning nodes can exchange data among each other at a rate of ~800 MB/sec. Note that the RAID6 storage has a single QDR connection, so the **total** file i/o bandwidth available to all of the nodes is ~800 MB/sec. While this number may seem higher than the ~150 MB/sec you can get from the scratch hard drive inside each node, it is shared among 44 nodes. If all nodes try to do file I/O at once, each will only have ~18 MB/sec of bandwidth available (10x slower than the scratch space).
+||**Name**||**Cores**||**Storage**||**Purchased**||**Free CIBR Access**||**Paid Access**||**Notes**||
+||[[CIBRClusters:Sphere|sphere.grid.bcm.edu]]||960||60 TB + 80 TB||2015||25,000 CPU-hr/qtr||Ludtke, Guan, Waterland || ||
+||[[CIBRClusters:Prism|prism.grid.bcm.edu]]||704||63+88 TB||2013||25,000 CPU-hr/qtr|| Wensel, Ludtke, Guan|| ||
+||[[CIBRClusters:Store|store]]|| - || 350 TB Raid6x2 || 2013 || upon request || - || For inactive data which will still be needed on the clusters. This storage array is available upon request to users of any of the clusters, but is directly accessible only from the clusters to discourage people for using it for general lab backup purposes. It is a RAID6 volume, which gives it good reliability, but every 4-5 years we experience some sort of failure. At one time this storage was backed up in addition to being a RAID6 array, but that is no longer true. If the RAID suffers a critical failure, all data could be lost. ||
-=== Memory (RAM) ===
+==== Requesting Access ====
-We currently configure our clusters with 4GB of memory per core. This means a node on Prism has 64 GB of total RAM, shared among 16 cores.
+Free cluster accounts must be requested directly by a PI (tenure track faculty) who must be an established member of CIBR, and such accounts will charge against that PI's allocation. Contact lmayran@bcm.edu (Larry Mayran) about CIBR membership. We may ask to discuss your intended use for the cluster before granting accounts, simply to insure that it will not interfere with paid cluster users or violate policy. Please note that cluster security does not meet HIPAA requirements, so no identified patient data may be stored or processed on the clusters.
-==== Configuring SSH so you can log in to the nodes ====
+To request an account, email sludtke@bcm.edu with:
-This can be critical to making jobs run correctly. You should be able to ssh into any of the nodes on the cluster without being asked for a password. If you cannot do this, these instructions should help:
+  * Full name
+  * Requested username
+  * Position (student, postdoc, etc)
+  * Cell Phone Number (for emergencies and initial password)
+  * An email address which must be checked at least daily.
-<code>
+==== Cluster Usage ====
-cd
+Each cluster has independent user accounts, as different faculty contributed different amounts to each. Please click on one of the links above for detailed policies and procedures for each cluster
-mkdir .ssh   # if it doesn't already exist
-cd .ssh
-ssh-keygen
-<use default filename>
-<use blank password, just hit enter>
-cat id_rsa.pub >>authorized_keys
-</code>
-You will need a valid known_hosts file. On each of the clusters there is a file called /etc/known_hosts.default which should work, just"
+Individual users are expected to be familiar with the use of Linux clusters in general, and if not, to obtain assistance from their acquaintances, rather than learning by trial and error. Occasionally CIBR offers a workshop to help familiarize people with cluster computing, which will be advertised via normal CIBR mechanisms.
-<code>
-cp /etc/known_hosts.default $HOME/.ssh/known_hosts
-</code>
-After this, make sure the permissions on the .ssh directory and files are set correctly:
+All of the clusters run some version of CentOS Linux. **We do not provide any commercial software**, but most standard open-source tools are installed, and users are welcome to install commercial or free software within their own accounts. BCM's site license for Matlab is usable on the cluster, but must be installed/licensed in the user's account, and there are some setup issues. Contact Larry Mayran about this for details.  We may also be able to install other open-source software system-wide, if you have such a need, please ask.
-<code>
-chmod og-w $HOME
-chmod 700 $HOME/.ssh
-chmod 600 $HOME/.ssh
-</code>
-==== Using Resources on a Cluster ====
+Details on some of the open-source software we have made available for users is on the [[CIBRClusters:cibrclusters_Software|Software page]].
-Running jobs on a cluster is a little bit art and a little bit science. For any project you are considering running on a cluster, you should do the following pre-analysis:
+There is a CIBR Cluster Google Group/Mailing List (https://groups.google.com/forum/#!forum/cibrcluster) which is used for announcements of outages, problems, policy changes, etc. We strongly recommend all CIBR Cluster users join.
-  * Data size - reading and writing data takes a (comparatively) long time. Consider your project and the software you're running and estimate the total size of the data you will process and the number of times each data element will need to be read from disk.
+==== In manuscripts, grant proposals, presentations, and other publications ====
-  * CPU - How much computation will your task require, assuming reading the data from disk is instantaneous? That is, for one job working with the amount of data you determined above, how long would just the computation take on a single core?
+Please acknowledge the assistance from the CIBR Center by including a statement similar to the following:
-  * Memory requirements - How much data needs to be in the computer's memory at a time, per process?
+"We gratefully acknowledge the assistance and computing resources provided by the CIBR Center for Computational and Integrative Biomedical Research of Baylor College of Medicine in the completion of this work."
-  * Parallelism - can your task take advantage of the shared memory on a single node? If your job supports threaded parallelism, then you could use all 64 GB of RAM (for example) and all 16 cores at the same time. Otherwise, you would be limited to 4 GB/core, or would have to leave some cores idle.
-Take the data size and divide by 100 MB/sec. How does this amount of time compare to the CPU time estimate you made. If the CPU time estimate is an order of magnitude or more larger than the number you got from the data, then your job may be well suited for a cluster. If your memory requirements are larger than 4GB/core, and your job does not support threaded parallelism, then you may need to consider a non-CIBR cluster where they have focused more $$$ on large memory configurations. If the project is small, we do have 4 specialized "bioinformatics nodes" on the Torus cluster which may suit your needs.
+----
+CIBR hosted a Mini-Workshop on cluster computing in 2018:
-Again, compute clusters excel at running very CPU-intensive tasks with low file I/O requirements. Tasks such as molecular dynamics or other types of simulation are good examples of this. A task at the other extreme, such as searching a genome for possible primer sequences, is probably not something that should even be attempted on a cluster. Most tasks are somewhere between these extremes. For example, CryoEM single particle reconstruction does work with many GB of data, but the amount of processing required per byte of data is high enough that clusters can be efficiently used.
+The slides can be downloaded here: [[http://blake.bcm.edu/dl/EMAN2/workshop2018.pdf]]
+Archived video of the presentation is here: [[http://blake.grid.bcm.edu/dl/CIBR/cluster_computing.mp4|Cluster Computing]]
-If your project is very data intensive, it may be worth considering an in-lab workstation configured with a high performance RAID array instead. Such a machine can be purchased for well under $10k, and (in 2014) can provide as much as ~1.5 GB/second read/write speeds. For rapidly processing large amounts of sequence data, machines like this can be much more time and cost-efficient than any sort of cluster.
-If you have any questions about this for your specific project, please just email sludtke@bcm.edu and we can chat about it.