Differences between revisions 3 and 4
Revision 3 as of 2019-03-16 23:47:25
Size: 4255
Editor: SteveLudtke
Comment:
Revision 4 as of 2021-09-06 23:23:31
Size: 4381
Editor: SteveLudtke
Comment:
Deletions are marked like this. Additions are marked like this.
Line 6: Line 6:
If you convert images from some other file format into HDF5 in EMAN2, it will retain most of the header information from the other format following [[Eman2Metadata|these specifications]]. If you convert images from some other file format into HDF5 in EMAN2, it will retain most of the header information from the other format following [[EMAN2/Eman2Metadata|these specifications]].
Line 50: Line 50:
That is, there is a top level GROUP called "MDF" defining the standard we're following. Inside that is a GROUP called "Images" which contains a single integer attribute "imageid_max" representing the largest integer image identifier in the file. 0 is always the lowest numbered one. There is no guarantee that all of 0-n will be present, just that nothing >n will be. Following this are the actual images. Each image is a GROUP with an integer name. That group contains a list of named attributes followed by a DATASET. The image attributes defined by EMAN are prefixed with "EMAN.". All of the attributes we currently define are listed (or supposed to be listed) here: That is:
 *
there is a top level GROUP called "MDF" defining the standard we're following
  * i
nside that is a GROUP called "Images"
   *
which contains a single integer attribute "imageid_max" representing the largest integer image identifier in the file.
    *
0 is always the lowest numbered image
    *
There is no guarantee that all of 0-n will be present at all times, but no image >n will be present.
   * f
ollowing this are the actual images.
   * e
ach image is a GROUP with an integer name.
    * that group contains a list of named attributes
     * all of the attributes used in EM
AN2 should be listed here: [[http://eman2.org/Eman2Metadata]]
     * t
he image attributes defined by EMAN are prefixed with "EMAN.".
     * we request that others making use of this specification update this page if they add their own metadata items (ask for edit permission or email updates)
     * at least apix_x, apix_y, and apix_z are recommended
     * "special tag"s are not required in the file, as they can be computed from the data, but EM
AN will generally store these values for convenience
    * followed by a DATASET, containing the actual image data
  
Line 52: Line 67:
http://blake.bcm.edu/emanwiki/Eman2Metadata If something I said above is ambiguous, you can take an EMAN2 HDF5 file and run h5dump on it, and it will give you human-readable output.
Line 54: Line 69:
These attributes are available from within EMAN2 after reading the image, but without the "EMAN." prefix. Not all of these attributes need to be present for the image file to be valid, but at least apix_x, apix_y, and apix_z are recommended. nx,ny,nz are defined implicitly by the size of the DATASET, and the other "special tag"s are not required in the file, as they can be computed from the data (though they are often present so they can be read from the header). Also, please note that there was an earlier HDF convention EMAN1 used for a while in the early 2000s, which didn't follow this standard. EMAN2 will still read the old format, but always writes the new format. Chimera is capable of reading this format, but also supports another simpler HDF structure, which it will write by default.
Line 56: Line 71:
If something I said above is ambiguous, you can take an EMAN2 derived HDF5 file and run h5dump on it, and it will give you human-readable output.

Also, please note that there was an earlier HDF convention EMAN1 used for a while some years ago, which didn't follow this standard. EMAN2 will still read this format, but both EMAN1 and 2 write this new format now.

Please let me know if you need any more information...
Please let me know if you need any more information (sludtke@bcm.edu)...

HDF5

HDF5 is a general purpose scientific data storage file format adopted by numerous scientific disciplines. In the CryoEM community, it is used by EMAN2/SPARX/SPHIRE, Chimera (2 different variants), and may(?) have some support in recent IMOD versions.

The advantage of HDF5 over other formats used in CryoEM is that it can store arbitrary metadata (header information) with every image, and can store stacks of images of any dimensionality using any number format (8 - 32 bit int, floating point, etc.)

If you convert images from some other file format into HDF5 in EMAN2, it will retain most of the header information from the other format following these specifications.

EMAN2 HDF5 Specifications

Note, this is still a very rough draft provided so other implementers have at least something to go on...

HDF5 is an interdisciplinary file format standard used by a wide range of scientific communities to represent N-dimensional data efficiently and accurately. EMAN2 uses this format by default for all data storage, however, the format is extremely flexible, so it is necessary to define the conventions used within the file.

We followed a draft standard for interdisciplinary image storage in HDF when we implemented it, and my hope was that it would be some sort of official or semi-official standard by now, but I don't think that's happened yet. However, the specifications are pretty simple. HDF files are structured much like a filesystem, with a GROUP representing a folder, an ATTRIBUTE representing a single piece of metadata, and a DATASET containing actual data:

HDF5 "refine_05/classes_01_even.hdf" {
GROUP "/" {
   GROUP "MDF" {
      GROUP "images" {
         ATTRIBUTE "imageid_max" {
            DATATYPE  H5T_STD_I32LE
            DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
            DATA {
            (0): 70
            }
         }
         GROUP "0" {
            ATTRIBUTE "EMAN.apix_x" {
               DATATYPE  H5T_IEEE_F64LE
               DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
               DATA {
               (0): 2.1
               }
            }
            ...  other attributes

            DATASET "image" {
               DATATYPE  H5T_IEEE_F32LE
               DATASPACE  SIMPLE { ( 168, 168 ) / ( 168, 168 ) }
               DATA {
               (0,0): -0.0727458, -0.0415057, -0.040645, -0.0567185,
               (0,4): -0.0407567, -0.0624736, -0.0837779, -0.048533,
               ... rest of image

            }
         GROUP "1" {
         ...

That is:

  • there is a top level GROUP called "MDF" defining the standard we're following
    • inside that is a GROUP called "Images"
      • which contains a single integer attribute "imageid_max" representing the largest integer image identifier in the file.
        • 0 is always the lowest numbered image
        • There is no guarantee that all of 0-n will be present at all times, but no image >n will be present.

      • following this are the actual images.
      • each image is a GROUP with an integer name.
        • that group contains a list of named attributes
          • all of the attributes used in EMAN2 should be listed here: http://eman2.org/Eman2Metadata

          • the image attributes defined by EMAN are prefixed with "EMAN.".
          • we request that others making use of this specification update this page if they add their own metadata items (ask for edit permission or email updates)
          • at least apix_x, apix_y, and apix_z are recommended
          • "special tag"s are not required in the file, as they can be computed from the data, but EMAN will generally store these values for convenience
        • followed by a DATASET, containing the actual image data

If something I said above is ambiguous, you can take an EMAN2 HDF5 file and run h5dump on it, and it will give you human-readable output.

Also, please note that there was an earlier HDF convention EMAN1 used for a while in the early 2000s, which didn't follow this standard. EMAN2 will still read the old format, but always writes the new format. Chimera is capable of reading this format, but also supports another simpler HDF structure, which it will write by default.

Please let me know if you need any more information (sludtke@bcm.edu)...

Eman2HDF (last edited 2021-09-07 01:43:57 by SteveLudtke)