Differences between revisions 4 and 6 (spanning 2 versions)

Particle orientation refinement using GMM representation

Most programs are available in EMAN2 builds after 2023-03, but some are still under continuous development. Newer versions are typically better.
It is recommended to add the "examples/" folder in EMAN2 binary to $PATH, as some new programs have not been moved to "bin/" yet.
The tutorial is only tested on Linux with Nvidia GPU and CUDA.
For reference of the method, see the Arxiv paper.

Here we use particles of SARS-COV-2 from EMPIAR-10492 as an example. Starting from particles with assigned orientation, i.e. the Polished folder (13.5GB) from EMPIAR, as well as job096_run_data.star.

Import existing refinement

Here we will need a .lst file with the location of all particles and their initial orientation assignment. Since here we start from a Relion star file, run

e2convertrelion.py job096_run_data.star --voltage 300 --cs 2.7 --apix 1.098 --amp 10 --skipheader 26 --onestack particles/particles_all.hdf --make3d --sym c3

Note that we need to phase flip the particles before the refinement, so this may take a while. Also make sure to provide the correct CTF related information to the program, including voltage, cs, amp, apix, since the program does not read those from the star file automatically. Check --help for more details. After importing the particles, with the --make3d option, the program will create a r3d_00 folder and reconstruct the 3D maps. You should see the structure of Covid spike with FSC at ~3.9Å at this point. Note the resolution number here is different from the one reported, because the pixel size used for processing is 1.098, which is then calibrated to 1.061. We still use the pixel size of 1.098 here, since otherwise the CTF information from the star file would be incorrect.

To start from other formats:

From classical EMAN2 refinement (e2refine_easy), run e2evalrefine.py refine_XX --extractorientptcl particles.lst
From the new EMAN2 refinement (e2spa_refine), simply use the ptcls_XX.lst file from the last iteration.
From CryoSPARC or others, convert it to a relion star file using pyem, then follow the relion conversion.

Global orientation refinement

We first need to determine the number of Gaussian to represent the volume. Often it is convenient to just use the number of non-H atoms in the molecule. Alternatively, we can guess the number given an existing map, isosurface threshold, and target resolution.

e2gmm_guess_n.py r3d_00/threed_00.hdf --thr 4 --maxres 3.5 --startn 10000

Here the number we get is 18000, and the program should also generate a file called threed_seg.pdb which can be used to visualize the coordinates of the Gaussian in the density map, and also used to initialize the GMM for refinement. Now we can run the GMM based global refinement.

e2gmm_refine_iter.py r3d_00/threed_00.hdf --startres 3.9 --initpts threed_seg.pdb --sym c3

The program will start from the initial orientation assignment and run five iterations of refinement using GMMs as references. It should create a folder called gmm_00, and you can find all files related to the refinement inside. threed_xx.hdf are the reconstructed density maps, fsc_masked_xx.txt are the FSC curves, and model_xx_even/odd.txt are the GMM parameters after each iteration. After the refinement, the resolution should reach ~3.4Å.

structure after global refinement

Focused refinement

Here we target the Receptor binding domain (RBD) using focused refinement. First, we need to make a mask for the target region using Filtertool. In the e2display browser, select gmm_00/threed_05.hdf and click Filtertool to start the program. Hold Shift while clicking the button will enter a "safe mode" of Filtertool, which might be useful if the program crashes often. To craft a mask for the RBD, we use three processors:

mask.soft:dx=10.0:dy=15.0:dz=70.0:outer_radius=20.0:width=30.0
filter.lowpass.gauss:cutoff_abs=0.1
mask.auto3d.thresh:nshells=4:nshellsgauss=4:return_mask=True:threshold1=5.5:threshold2=3.0

mask.soft locates the rough location of one of the RBD, and filter.lowpass.gauss lowpass filters the density map. mask.auto3d.thresh creates the final mask based on the filtered density. Basically, it starts from a high threshold indicated by threshold1, at which only the density of the target domain is visible, and get down to a lower threshold2, where the density in the target domain is connected, but the density outside is not. The processor will then include all densities at threshold2 that are connected to the visible voxels at threshold1, pad a few layer and add a soft Gaussian falloff as indicated by nshells and nshellsgauss, then return the mask.

craft mask using filtertool

Clicking File -> Save will save the results to processed_map.hdf, and we rename it to mask_rbd.hdf for better bookkeeping.

Then run e2gmm_refine_iter.py again using the previous refinement and the new mask.

e2gmm_refine_iter.py gmm_00/threed_05.hdf --startres 3.5 --initpts gmm_00/model_04.txt --mask mask_rbd.hdf --masksigma --expandsym c3

Note here gmm_00/model_04.txt does not actually exist, but the program will look for gmm_00/model_04_even.txt and the odd version automatically, and keep the "gold-standard" split from the previous refinement. Since the global refinement is performed with c3 symmetry imposed, here we need to expand it to c1 to focus on one of the RBDs, i.e., making 3 copies of each particle at the symmetrical orientations.

The refinement will write related files in another gmm_xx folder. The final output will have a lower overall FSC curve, but better resolved features at the target domain.

Focused refinement of RBD

Refinement from a DNN heterogeneity analysis

For this dataset, we use the N terminal domain (NTD) as an example to show refinement starting from a heterogeneity analysis. The movement of NTD is actually relatively subtle, and there might not be a significant difference between DNN-based heterogeneity analysis and the focused alignment. Still, we use this as a demonstration since it is easier to show all functionalities in one small dataset.

Again, we start by making the mask for the NTD using the same method, and name the output mask_ntd.hdf

mask.soft:dx=30.0:dy=70.0:dz=70.0:outer_radius=20.0:width=30.0
filter.lowpass.gauss:cutoff_abs=0.1
mask.auto3d.thresh:nshells=4:nshellsgauss=4:return_mask=True:threshold1=3:threshold2=1

Then run the heterogeneous refinement starting from the previous global refinement given the mask for NTD.

e2gmm_heter_refine.py gmm_00/threed_05.hdf --mask mask_ntd.hdf --expandsym c3 --maxres 3.5 --minres 80

The program will first perform the DNN heterogeneity analysis on that domain, and the trajectories can be viewed in gmm_xx/ptcls_even_cls_00.hdf. Since we keep the "gold-standard" data separation through the pipeline, there will be one volume stack for each subset. The exact trajectory can be different but the overall movement pattern from the subsets should be similar.

Hetergeneous analysis of NTD

Then the program will convert conformation to particle orientation and perform a few rounds of focused alignment starting from the new orientation. This will produce a final map that is better resolved at one of the NTDs, and smeared out everywhere else.

Focused refinement of NTD

Merge multiple refinements

We also provide a simple way to merge the results from multiple refinement runs. For example, run

e2gmm_merge_patch.py gmm_01 gmm_02 --base gmm_00 --sym c3

if you have the RBD and NTD refinement results in the gmm_01 and gmm_02 folders. It will take the results from gmm_00 as the base and replace the regions of focus using results from gmm_01 and gmm_02 and their corresponding focusing mask files. The program will use unmasked maps for the merge and recompute the FSC to filter the results accordingly. The result will be written as iteration 99 in the base folder, gmm_00.

Patch-by-patch refinement

Finally, we show the patch-by-patch refinement. This process is slow but quite automatic. In fact, since the scale of movement is relatively small in this particular dataset, the patch-by-patch refinement can render the steps after the global refinement quite useless. I.e., you can simply run this after the global orientation refinement, and get about as good or better results without the manual mask crafting and heterogeneity analysis. However, it is still worth trying the DNN-based analysis if the protein is undergoing large-scale continuous movement, or some domains are not as well resolved after the patch-by-patch refinement.

Starting from the global refinement, run

e2gmm_refine_patch.py gmm_00/threed_05.hdf --npatch 9 --startres 3.5 --masktight --expandsym c3

The program will divide the GMM into 9 patches and focus refine them individually. This will take 9 times longer than the focused refinement of one domain. The results of the patches will be merged together at the end and the final results will be shown as iteration 99.

Patch-by-patch refinement

-  ⇤ ← Revision 4 as of 2023-04-03 19:50:24 → 
  Size: 5762
  Editor: MuyuanChen
  Comment:
+   ← Revision 6 as of 2023-04-04 00:38:59 → ⇥
  Size: 9792
  Editor: MuyuanChen
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 6:
+ * For reference of the method, see the [[https://arxiv.org/abs/2303.18241 | Arxiv paper]].
-Line 41:
+Line 42:
-Here we target the RBD using focused refinement. First, we need to make a mask for the target region using '''Filtertool'''. In the e2display browser, select '''gmm_00/threed_05.hdf''' and click '''Filtertool''' to start the program. Hold Shift while clicking the button will enter a "safe mode" of Filtertool, which might be useful if the program crashes often. To craft a mask for the RBD, we use three processors:
+Here we target the Receptor binding domain (RBD) using focused refinement. First, we need to make a mask for the target region using '''Filtertool'''. In the e2display browser, select '''gmm_00/threed_05.hdf''' and click '''Filtertool''' to start the program. Hold Shift while clicking the button will enter a "safe mode" of Filtertool, which might be useful if the program crashes often. To craft a mask for the RBD, we use three processors:
-Line 54:
+Line 55:
-== Refine from a GMM heterogeneity analysis ==
+Then run `e2gmm_refine_iter.py` again using the previous refinement and the new mask.
-Line 57:
+Line 58:
-e2gmm_heter_refine.py gmm_XX/threed_XX.hdf --maxres X --mask mask.hdf
+e2gmm_refine_iter.py gmm_00/threed_05.hdf --startres 3.5 --initpts gmm_00/model_04.txt --mask mask_rbd.hdf --masksigma --expandsym c3
-Line 59:
+Line 60:
-Here we also start from the global refinement. `--maxres` defines the resolution for the heterogeneity analysis, and it is typically safer to use a lower resolution (7Å by default), since the flexible parts are often not well resolved in the first place. The target region is specified with `mask.hdf`.
 Line 61:
+Note here `gmm_00/model_04.txt` does not actually exist, but the program will look for `gmm_00/model_04_even.txt` and the `odd` version automatically, and keep the "gold-standard" split from the previous refinement. Since the global refinement is performed with c3 symmetry imposed, here we need to expand it to c1 to focus on one of the RBDs, i.e., making 3 copies of each particle at the symmetrical orientations. 

The refinement will write related files in another `gmm_xx` folder. The final output will have a lower overall FSC curve, but better resolved features at the target domain. 

{{attachment:focus_refine.png | Focused refinement of RBD |width=500}}


== Refinement from a DNN heterogeneity analysis ==

For this dataset, we use the N terminal domain (NTD) as an example to show refinement starting from a heterogeneity analysis. The movement of NTD is actually relatively subtle, and there might not be a significant difference between DNN-based heterogeneity analysis and the focused alignment. Still, we use this as a demonstration since it is easier to show all functionalities in one small dataset. 

Again, we start by making the mask for the NTD using the same method, and name the output '''mask_ntd.hdf'''

{{{
mask.soft:dx=30.0:dy=70.0:dz=70.0:outer_radius=20.0:width=30.0
filter.lowpass.gauss:cutoff_abs=0.1
mask.auto3d.thresh:nshells=4:nshellsgauss=4:return_mask=True:threshold1=3:threshold2=1
}}}

Then run the heterogeneous refinement starting from the previous global refinement given the mask for NTD. 
{{{
e2gmm_heter_refine.py gmm_00/threed_05.hdf --mask mask_ntd.hdf --expandsym c3 --maxres 3.5 --minres 80
}}}

The program will first perform the DNN heterogeneity analysis on that domain, and the trajectories can be viewed in `gmm_xx/ptcls_even_cls_00.hdf`. Since we keep the "gold-standard" data separation through the pipeline, there will be one volume stack for each subset. The exact trajectory can be different but the overall movement pattern from the subsets should be similar. 

{{attachment:dnn_heter.gif | Hetergeneous analysis of NTD |width=500}}

Then the program will convert conformation to particle orientation and perform a few rounds of focused alignment starting from the new orientation. This will produce a final map that is better resolved at one of the NTDs, and smeared out everywhere else. 

{{attachment:focus_refine2.png | Focused refinement of NTD |width=500}}

== Merge multiple refinements ==

We also provide a simple way to merge the results from multiple refinement runs. For example, run

{{{
e2gmm_merge_patch.py gmm_01 gmm_02 --base gmm_00 --sym c3
}}}

if you have the RBD and NTD refinement results in the `gmm_01` and `gmm_02` folders. It will take the results from `gmm_00` as the base and replace the regions of focus using results from `gmm_01` and `gmm_02` and their corresponding focusing mask files. The program will use unmasked maps for the merge and recompute the FSC to filter the results accordingly. The result will be written as iteration 99 in the base folder, `gmm_00`.
-Line 64:
+Line 105:
-Starting from a finished global refinement, run
+Finally, we show the patch-by-patch refinement. This process is slow but quite automatic. In fact, since the scale of movement is relatively small in this particular dataset, the patch-by-patch refinement can render the steps after the global refinement quite useless. I.e., you can simply run this after the global orientation refinement, and get about as good or better results without the manual mask crafting and heterogeneity analysis. However, it is still worth trying the DNN-based analysis if the protein is undergoing large-scale continuous movement, or some domains are not as well resolved after the patch-by-patch refinement. 

Starting from the global refinement, run
-Line 66:
+Line 109:
-e2gmm_refine_patch.py gmm_XX/threed_XX.hdf --startres X --npatch N
+e2gmm_refine_patch.py gmm_00/threed_05.hdf --npatch 9 --startres 3.5 --masktight --expandsym c3
-Line 68:
+Line 111:
+The program will divide the GMM into 9 patches and focus refine them individually. This will take 9 times longer than the focused refinement of one domain. The results of the patches will be merged together at the end and the final results will be shown as iteration 99. 

{{attachment:patch_refine.png | Patch-by-patch refinement |width=600}}