e2gmm - A semi-friendly GUI for running GMM dynamics in EMAN2

This tutorial discusses the new (2022) GUI tool for making use of the Gaussian Mixture Model based variability tools in EMAN2. These tools are still under development, but are now in a usable form. With the GUI these tools are more approachable for typical CryoEM/ET investigators. We recommend this as a good starting point for understanding the method even if you plan to use command line tools manually in the end.

First, a quick overview of the programs:

This tutorial covers e2gmm.py, the GUI interface, which currently makes use of e2gmm_refine_point.py.

Quick Theory Overview

e2gmm is one of several emerging tools in the CryoEM community which make use of a mathematical concept known as manifold embedding to characterize the compositional and conformational variability of a macromolecular system. So, what does that mean, exactly? The concept is not as complicated or intimidating as it may sound. If you think about a large biomolecule in solution, it should be obvious that the picture of a single absolutely static high resolution structure simply does not reflect reality. At the very least, the structure is being continuously impacted by solvent molecules causing motion at least on the level of individual atoms or side-chains. However, for the vast majority of biomolecules it goes far beyond this, with large domain scale motions and assembly/disassembly processes going on continuously as part of the macromolecular function.

Why then are most macromolecules represented in the PDB as "the high resolution structure of X"? This really came from X-ray crystallography where, to solve the structure, the molecules have to be identically configured, and packed into a crystal lattice. However, this concept has now been extended to CryoEM, where practitioners routinely discard 90% of their raw particle data to achieve "the high resolution structure of X". In CryoEM, the next step beyond this, towards reality, are the traditional classification approaches, where a large heterogeneous data set is classified into N more homogeneous subsets, which are then processed (often again discarding large portions of the subset) to produce "N high resolution structures of X".