= Particle orientation refinement using GMM representation =

 * Most programs are available in EMAN2 builds after 2023-03, but some are still under continuous development. Newer versions are typically better.
 * The tutorial is only tested on Linux with Nvidia GPU and CUDA. 
 * For reference of the method, see the [[https://www.nature.com/articles/s41592-023-02082-9 | paper]].

Here we use particles of SARS-COV-2 from [[https://www.ebi.ac.uk/empiar/EMPIAR-10492/ | EMPIAR-10492]] as an example. Starting from particles with assigned orientation, i.e. the '''Polished''' folder (13.5GB) from EMPIAR, as well as '''job096_run_data.star'''.

== Import existing refinement ==

Here we will need a .lst file with the location of all particles and their initial orientation assignment. Since here we start from a Relion star file, run 
{{{
e2convertrelion.py job096_run_data.star --voltage 300 --cs 2.7 --apix 1.098 --amp 10 --skipheader 26 --onestack particles/particles_all.hdf --make3d --sym c3
}}}

Note that we need to phase flip the particles before the refinement, so this may take a while. Also make sure to provide the correct CTF related information to the program, including voltage, cs, amp, apix, since the program does not read those from the star file automatically. Check `--help` for more details. After importing the particles, with the `--make3d` option, the program will create a '''r3d_00''' folder and reconstruct the 3D maps. You should see the structure of Covid spike with FSC at ~3.9Å at this point. Note the resolution number here is different from the one reported, because the pixel size used for processing is 1.098, which is then calibrated to 1.061. We still use the pixel size of 1.098 here, since otherwise the CTF information from the star file would be incorrect. Just keep in mind the actual resolution number is slightly better than the reported value. 

To start from other formats:
 * From classical EMAN2 refinement (e2refine_easy), run {{{e2evalrefine.py refine_XX --extractorientptcl particles.lst}}}
 * From the new EMAN2 refinement (e2spa_refine), simply use the {{{ptcls_XX.lst}}} file from the last iteration. 
 * From CryoSPARC or others, convert it to a relion star file using [[https://github.com/asarnow/pyem | pyem]], then follow the relion conversion. 

To convert an EMAN2 lst back to the star file, use the `--lst2star` option. Note the two files need to be in the same order, so the even/odd lst files need to be merged and sorted first. For example, run
{{{
e2proclst.py gmm_00/ptcls_04_even.lst gmm_00/ptcls_04_odd.lst --create gmm_00/ptcls_04_all.lst --mergeref sets/all_relion.lst
}}}
to merge the particles while make sure the order is the same as the initial input, then run the same conversion command with the `--lst2star` option to generate a new star file.
{{{
e2convertrelion.py job096_run_data.star --voltage 300 --cs 2.7 --apix 1.098 --amp 10 --skipheader 26 --onestack particles/particles_all.hdf --lst2star sets/all_relion.lst
}}}
The new star file will be saved to `job096_run_data_from_eman.star`.

== Global orientation refinement ==

To start the GMM-based refinement, we first need to determine the number of Gaussian to represent the volume. Often it is convenient to just use the number of non-H atoms in the molecule. Alternatively, we can guess the number given an existing map, isosurface threshold, and target resolution. Ideally, use a map with the correct masking, sharpening, and filtration, and pick an isosurface threshold where all features of interest are visible and most of the noisy dust is hidden. `--maxres` should be the target resolution, which can be slightly higher (0.4Å here) than the initially determined resolution. Having a slightly larger/smaller number normally does not hurt.
{{{
e2gmm_guess_n.py r3d_00/threed_00.hdf --thr 4 --maxres 3.5 --startn 10000
}}}
Here the number we get is 18000, and the program should also generate a file called '''r3d_00/threed_00_seg.pdb''' which can be used to visualize the coordinates of the Gaussian in the density map, and also used to initialize the GMM for refinement. The actual file name could be different depending on the `e2gmm_guess_n` command, and will be print out at the end of that command. Now we can run the GMM based global refinement.
{{{
e2gmm_refine_iter.py r3d_00/threed_00.hdf --startres 3.9 --initpts r3d_00/threed_00_seg.pdb --sym c3
}}}

The program will start from the initial orientation assignment and run five iterations of refinement using GMMs as references. It should create a folder called '''gmm_00''', and you can find all files related to the refinement inside. '''threed_xx.hdf''' are the reconstructed density maps, '''fsc_masked_xx.txt''' are the FSC curves, and '''model_xx_even/odd.txt''' are the GMM parameters after each iteration. After the refinement, the resolution should reach ~3.4Å.

{{attachment:global_refinement.png | structure after global refinement |width=600}}

Note when displaying the volume through the '''Show3D''' button in '''e2display''' browser, the isosurface is automatically colored by the local resolution. Here we can see the tip of the spike is not as well-resolved as the central parts due to structual flexibility. We will address this in the following sections. 

== Focused refinement ==

Here we target the Receptor binding domain (RBD) using focused refinement. First, we need to make a mask for the target region using '''Filtertool'''. In the e2display browser, select '''gmm_00/threed_05.hdf''' and click '''Filtertool''' to start the program. Hold Shift while clicking the button will enter a "safe mode" of Filtertool, which might be useful if the program crashes often. To craft a mask for the RBD, we use three processors:
{{{
mask.soft:dx=15.0:dy=15.0:dz=80.0:outer_radius=20.0:width=20.0
filter.lowpass.gauss:cutoff_abs=0.2
mask.auto3d.thresh:nshells=4:nshellsgauss=6:return_mask=True:threshold1=6:threshold2=3.0
}}}

`mask.soft` locates the rough location of one of the RBD, and `filter.lowpass.gauss` lowpass filters the density map. `mask.auto3d.thresh` creates the final mask based on the filtered density. Basically, it starts from a high threshold indicated by `threshold1`, at which only the density of the target domain is visible, and gets down to a lower `threshold2`, where the density in the target domain is connected, but the density outside is not. The processor will then include all densities at `threshold2` that are connected to the visible voxels at `threshold1`, pad a few layers and add a soft Gaussian falloff as indicated by `nshells` and `nshellsgauss`, then return the mask. 

{{attachment:craft_mask.png | craft mask using filtertool |width=500}}

Clicking '''File -> Save''' will save the results to '''processed_map.hdf''', and we rename it to '''mask_rbd.hdf''' for better bookkeeping. 

Then run `e2gmm_refine_iter.py` again using the previous refinement and the new mask. 

{{{
e2gmm_refine_iter.py gmm_00/threed_05.hdf --startres 3.5 --initpts gmm_00/model_04.txt --mask mask_rbd.hdf --masksigma --expandsym c3
}}}

Note here `gmm_00/model_04.txt` does not actually exist, but the program will look for `gmm_00/model_04_even.txt` and the `odd` version automatically, and keep the "gold-standard" split from the previous refinement. Since the global refinement is performed with c3 symmetry imposed, here we need to expand it to c1 to focus on one of the RBDs, i.e., making 3 copies of each particle at the symmetrical orientations. 

The refinement will write related files in another `gmm_xx` folder. The final output will have a lower overall FSC curve, but better-resolved features at the target domain. 

{{attachment:focus_refine.png | Focused refinement of RBD |width=500}}


== Refinement from a DNN heterogeneity analysis ==

For this dataset, we use the N terminal domain (NTD) as an example to show refinement starting from a heterogeneity analysis. The movement of NTD is actually relatively subtle, and there might not be a significant difference between DNN-based heterogeneity analysis and the focused alignment. Ideally, this function is more useful when the target domain involves large-scale (with a range of 10Å or more) continuous movement. Nonetheless, we use this as a demonstration since it is easier to show all functionalities in one small dataset. 

Again, we start by making the mask for the NTD using the same method, and name the output '''mask_ntd.hdf'''

{{{
mask.soft:dx=30.0:dy=70.0:dz=70.0:outer_radius=20.0:width=30.0
filter.lowpass.gauss:cutoff_abs=0.1
mask.auto3d.thresh:nshells=4:nshellsgauss=4:return_mask=True:threshold1=3:threshold2=1
}}}

Then run the heterogeneous refinement starting from the previous global refinement given the mask for NTD. 
{{{
e2gmm_heter_refine.py gmm_00/threed_05.hdf --mask mask_ntd.hdf --expandsym c3 --maxres 3.5 --minres 80
}}}

The program will first perform the DNN heterogeneity analysis on that domain, and the trajectories can be viewed in `gmm_xx/ptcls_even_cls_00.hdf`. Since we keep the "gold-standard" data separation through the pipeline, there will be one volume stack for each subset. The exact trajectory can be different but the overall movement pattern from the subsets should be similar. 

{{attachment:dnn_heter.gif | Hetergeneous analysis of NTD |width=500}}

Then the program will convert conformation to particle orientation and perform a few rounds of focused alignment starting from the new orientation. This will produce a final map that is better resolved at one of the NTDs, and smeared out everywhere else. 

{{attachment:focus_refine2.png | Focused refinement of NTD |width=500}}

== Morphing molecular models ==

 * (Updated 2024-09) Now it is recommended to get the model series through the [[https://blake.bcm.edu/emanwiki/EMAN2/e2gmm_model | gmm_model protocol]], which should generate models that match the heterogeneity analysis result and also have good geometry. 

We can also visualize the same movement via molecular models. Here we use the existing PDB file (PDB:6zwv) and morph it using the trained decoder. 

{{{
e2gmm_eval.py --pts gmm_04/midall_00_even.txt --pcaout gmm_04/mid_pca.txt --ptclsin gmm_04/ptcls_00_even.lst --ptclsout gmm_04/ptcls_even_cls_10.lst --mode regress --ncls 4 --nptcl 5000 --axis 0 --decoder gmm_04/dec_even.h5 --pdb 6zwv.pdb --apix 1.061 --selgauss mask_ntd.hdf --skip3d --model00 gmm_04/model_00_even.txt
}}}

Here '''gmm_04''' was produced in the previous step of heterogeneity analysis with '''e2gmm_heter_refine'''. '''gmm_04/midall_00_even.txt''' is the latent space particle distribution of all particles from the even subset, focusing on the NTD, and '''gmm_04/dec_even.h5''' is the corresponding trained decoder from the analysis. Due to the apix issue mentioned in the beginning, here we need to overwrite it using the `--apix` options so the model matches the density. The program should produce multiple pdb files corresponding to the individual frames of the 3D movie, which can be visualized in UCSF Chimera.

{{attachment:morph_model.png | morphing models |width=500}}


== Merge multiple refinements ==

We also provide a simple way to merge the results from multiple refinement runs. For example, run

{{{
e2gmm_merge_patch.py gmm_01 gmm_02 --base gmm_00 --sym c3
}}}

if you have the RBD and NTD refinement results in the `gmm_01` and `gmm_02` folders. It will take the results from `gmm_00` as the base and replace the regions of focus using results from `gmm_01` and `gmm_02` and their corresponding focusing mask files. The program will use unmasked maps for the merge and recompute the FSC to filter the results accordingly. The result will be written as iteration 99 in the base folder, `gmm_00`.

== Patch-by-patch refinement ==

Finally, we show the patch-by-patch refinement. This process is slow but automatic. In fact, since the scale of movement is relatively small in this particular dataset, the patch-by-patch refinement can render the steps after the global refinement quite useless. I.e., you can simply run this after the global orientation refinement, and get about as good or better results without the manual mask crafting and heterogeneity analysis. However, it is still worth trying the DNN-based analysis if the protein is undergoing large-scale continuous movement, or some domains are not as well resolved after the patch-by-patch refinement. 

Starting from the global refinement, run
{{{
e2gmm_refine_patch.py gmm_00/threed_05.hdf --npatch 9 --startres 3.5 --masktight --expandsym c3
}}}

The program will divide the GMM into 9 patches and focus refine them individually. This will take 9 times longer than the focused refinement of one domain. The results of the patches will be merged together at the end and the final results will be shown as iteration 99. 

{{attachment:patch_refine.png | Patch-by-patch refinement |width=600}}

== Building motion trajectories from refinements ==

(After 2023-06-15) From the result of patch-by-patch refinement, or in fact any two or more refinement runs from the tutorial, we can build eigen-motion trajectories based on the difference of alignment of the same particles. For example, to assemble all results from the even half set of the patch-by-patch refinement, run
{{{
e2spa_trajfromrefine.py gmm_05/ptcls_*3_even.lst --baseali gmm_05/ptcls_00_even.lst
}}}

{{attachment:patch_traj.gif | Trajectory from patch refinement |width=500}}

The movement is a bit subtle since we take information from all patches. You can also view trajectories from a selected local region by providing only the ptcls_xx.lst file focused on that region.