THIS PAGE IS UNDER CONSTRUCTION

TODO - indicates incomplete tutorial sections
THE e2gmm program is under active development. STRONGLY encourage using an EMAN2 snapshot or source version dated 12/2022 or later. Even with a current version you may notice some small differences in the display, etc.
This makes extensive use of tensorflow, and the training even on a GPU can, in some cases, take several hours. This tutorial is intended to be run on a Linux box with an NVidia GPU.

e2gmm - A semi-friendly GUI for running GMM dynamics in EMAN2

This tutorial discusses the new (2022) GUI tool for making use of the Gaussian Mixture Model based variability tools in EMAN2. These tools are still under development, but are now in a usable form. With the GUI these tools are more approachable for typical CryoEM/ET investigators. We recommend this as a good starting point for understanding the method even if you plan to use command line tools manually in the end.

First, a quick overview of the programs:

e2gmm.py - A graphical interface for GMM analysis of single particle or subtomogram averaging data sets. Makes use of e2gmm_refine_point.py
e2gmm_refine.py - The original GMM program as described in the first GMM paper (PMC8363932), largely superseded now
e2gmm_refine_point.py - Dr. Ludtke's new variant, used by the GUI. Significant mathematical changes from the original, but requires substantially less RAM, and in many cases produces better particle classification
e2gmm_refine_new.py - Dr. Chen's new variant, where he is exploring new refinement methods. He is developing a separate tutorial for this new tool.
e2gmm_analysis.py - Ancillary program used to analyze the output of GMM runs. Related functionality to some of the GUI tools.

This tutorial covers e2gmm.py, the GUI interface, which currently makes use of e2gmm_refine_point.py.

Quick Theory Overview

e2gmm is one of several emerging tools in the CryoEM community which make use of a mathematical concept known as manifold embedding to characterize the compositional and conformational variability of a macromolecular system. So, what does that mean, exactly? The concept is not as complicated or intimidating as it may sound. If you think about a large biomolecule in solution, it should be obvious that the picture of a single absolutely static high resolution structure simply does not reflect reality. At the very least, the structure is being continuously impacted by solvent molecules causing motion at least on the level of individual atoms or side-chains. However, for the vast majority of biomolecules it goes far beyond this, with large domain scale motions and assembly/disassembly processes going on continuously as part of the macromolecular function.

Why then are most macromolecules represented in the PDB as "the high resolution structure of X"? This really came from X-ray crystallography where, to solve the structure, the molecules have to be identically configured, and packed into a crystal lattice. However, this concept has now been extended to CryoEM, where practitioners routinely discard 90% of their raw particle data to achieve "the high resolution structure of X". In CryoEM, the next step beyond this, towards reality, are the traditional classification approaches, where a large heterogeneous data set is classified into N more homogeneous subsets, which are then processed (often again discarding large portions of the subset) to produce "N high resolution structures of X". Clearly this is an improvement, and is a reasonable way to represent discrete events such as association/dissociation/ligand binding, but still won't adequately capture continuous changes from state A to state B.

When we do normal single particle analysis, each particle already has (at least) 5 values associated with it: the x-y shift needed to center the particle and the 3 Euler angles defining it's 3-D orientation. The goal of manifold methods is to associate several additional numbers with each particle, each associated with some particular, possibly independent, motion of the system. If (TODO)

Getting Started

The GMM requires a set of single particle or subtomogram averaging data to use as input. It also requires particle orientations, generally determined with respect to a single "neutral" structure. In step 1, we create a gmm_XX folder, which is attached to a specific existing refinement run (refine_XX, r3d_XX or spt_XX). At the moment if you want to work with a Relion refinement, you will need to import the STAR file into an EMAN2 project first, then open the resulting folder (TODO).

For this tutorial we suggest using the results of the Subtomogram Averaging Tutorial as the starting point for this tutorial. The single particle tutorial uses beta-galactosidase, which has minimal flexibility and high symmetry, and isn't really suitable for the GMM. The subtomogram averaging tutorial, however, uses ribosomes, which exhibits a clear ratcheting motion and generally has a subset of "bad" particles.

Step 1

Press the "New GMM" button, select one of the supported folder types, and press OK. There will be a moderate delay after pressing OK while the GMM folder is constructed and populated.

Step 2

The dynamics/variability in a complex macromolecule becomes more and more complicated, with more and more degrees of freedom as you increase the level of detail. At the extreme, you could consider the anisotropic motions of the individual atoms as reflected in crystallography by anisotropic B-factors. But even taking a step back from that, most sidechains in solution are almost certainly not rigidly oriented. With this in mind, when looking at dynamics, it is important to consider the lengthscale you wish to consider. Normally, it would make sense to start with the coarsest lengthscales and look for large scale motions like the racheting mechanism in the ribosome or the association/dissociation of components of the macromolecular system. If those are characterized and understood (or do not exist in a significant way), then moving down to examine domain motions, then motions of individual secondary structures, etc. makes sense.

So, while the GMM tool is automated, the user still has to make decisions about the level of detail to explore, how many degrees of freedom to permit, whether to focus the analysis using a mask, etc. That is, you will almost certainly want to do more than one run of the GMM system with different parameters as part of your explorations.

In step 2, we create (named) GMM runs. Each run will have its own set of parameters, and will exist in the context of the selected GMM_XX folder from Step 1.

Press Create Run
Enter "15_25_100"
Press Enter or click OK

You may use whatever name you like for these runs, BUT the name will be used as part of the filename of several files, so please do not include spaces or special characters other than "_","-" or "+" in the name. You may also consider keeping the name relatively short.

Depending on the version you are running, you may see various warning messages appear in the console after creating your new run. These can be safely ignored if present. You should see an entry for your newly created run appear in the box shown under Step 2. Make sure this new entry is selected.

Step 3

In Step 3 we create the neutral Gaussian representation for our neutral map. The "neutral map" represents the solved structure from the folder selected in step 1. It's the map produced when you include all or most of the particles in the data set, and should be a sort of average structure for the system. When we look for dynamics/variability, it will be with respect to this neutral map.

To proceed with the GMM, we need to have a Gaussian model for this neutral map. This can be generated automatically, but the user needs to provide two key parameters: 1) the level of detail at which to make the representation and 2) how strong densities need to be to be included in the representation. Item 1) is defined as a resolution (in Å), but do not confuse this with the resolution you might achieve in final classified maps. This number defines the resolution level at which variability will be studied. If you specify 30 Å here, it will be looking for changes/motions in features with a 30 Å lengthscale, but successfully isolating a more self-consistent set of particles could still readily achieve near atomic resolution maps. When starting with a new project we suggest beginning with 20-30 Å here, then if no useful variability is detected, or when you are ready to look for finer motions, then reduce the number incrementally by ~2x. You will want to make a new run (step 2) each time you try new parameters.

Item 2) specifies a relative amplitude threshold for Gaussian modeling. This is specified as a resolution-dependent threshold. Larger values -> fewer Gaussians. If there are highly dynamic regions of the structure, specifying too high a threshold may fail to produce any Gaussians in these weaker regions of the map, which will make them less likely to be characterized when training the GMM. Still, there is a balance, as a very large number of Gaussians may be difficult to train, will make the GMM run for a long time, and make the results more difficult to interpret. We suggest selecting a threshold value to produce the fewest Gaussians which seem to include at least some representation of all regions of the map.

Resolution -> 15
(box below Resolution) -> 0.5,0.7
press Resolution

You should see your neutral map shown in the panel on the right, with some number of spheres visible inside it. The number appearing below the Resolution button is the number of Gaussians in the current model. You can drag the Sphere Size slider to adjust the visual size of the spheres. This has no impact on any computations, it is purely for visualization.

There is one additional complexity. In addition to producing Gaussians representing the strongest densities in the map, a smaller number of Gaussians representing the strongest negative features in the map (relative to the solvent) will also be produced. These are generally placed in residual CTF artifacts or otherwise just outside the structure, and are useful in accurately characterizing individual particles. If (box below Resolution) is specified as a single value, a reasonable automatic value is selected for the negative threshold. Optionally, (box below resolution) can be specified as a comma-separated pair, with the first number as the threshold for positive Gaussians and the second number for negative Gaussians. If the second value is larger, fewer negative Gaussians will be created. If you look at the console after pressing Resolution, you should see an output like "pos: 244", "neg: 38", indicating that the model has 192 positive Gaussians and 46 negative Gaussians.

You can play with the parameters above and press Resolution again as many times as you like to achieve a representation you are happy with, but note that large numbers of Gaussians (>1000) may significantly increase the amount of time required to train the GMM and will increase the GPU memory requirements for the network, and in most cases won't actually accomplish anything useful, unless you are using a large number of latent dimensions. Indeed, useful results can sometimes be obtained with as few as 20 - 30 Gaussians. A few hundred is a reasonable target for typical projects.

Once the neutral Gaussian model has been defined, the neural network needs to be initialized. It's critical that the network initialization use parameters compatible with the final GMM training, so we fill these in now:

Input Res -> 25,100
Mask -> nothing, clear the box
Latent Dim -> 4
Train Iter -> 20
Model Reg -> 0
Model Perturb -> 0.05
Convolutional -> not selected
Position -> selected
Amplitude -> selected
Press Train Neutral

A number of these parameters can actually be altered for the actual dynamics run, the ones which absolutely cannot be changed without rerunning Train Neutral are: Latent Dim, Mask and the selection state of Position and Amplitude. If you change any of these, you will need to repeat all of Step 3 before Train GMM.