Differences between revisions 13 and 14
Revision 13 as of 2011-02-22 16:46:31
Size: 5566
Editor: SteveLudtke
Comment:
Revision 14 as of 2011-02-24 20:14:20
Size: 6403
Editor: SteveLudtke
Comment:
Deletions are marked like this. Additions are marked like this.
Line 22: Line 22:
EMAN2 uses an embedded database to store information about a project, as well as much of the actual image data. This choice was made
for a number of reasons including performance, flexibility, and dealing with projects with thousands of micrographs and hundreds of
thousands of particles. However, it comes with a few limitations. Like most databases, it uses a memory & disk cache to give faster
Q: Why in the blankety-blank-blank did you decide to switch to this ? It's a pain to deal with !

A: This system is an embedded database and stores information about a project, as well as much of the actual image data. Originally,
we planned to use just HDF, a convenient cross-platform interdisciplinary format. However, upon testing, we found that A) its performance
for large image sets was awful and B) the sorts of issues that often happen with shared filesystems multiple processes accessing the
file, etc, caused corruption in the HDF files somewhat frequently, and they could not be recovered. If you do the sorts of thing that
cause BDB to have problems with normal flat files, like MRC, IMAGIC, SPIDER, etc., you will get file corruption in the form of images
being mispositioned, out of order, and other strange phenomena. The BDB system permits you to safely use as many threads/processes
as you like within a single computer, and has specific procedures to follow when you need to go between multiple machines. It also
gives a dramatic performance boost to many many tasks. So, that's why we use it...

However, it comes with a few limitations. Like most databases, it uses a memory & disk cache to give faster

WARNING !

EMAN2 stores much of the data and other information during processing in an embedded database system based on BerkeleyDB. These databases exist in directories called EMAN2DB. You may be tempted to rename, delete or otherwise manipulate the files in these directories. If you feel thus tempted, you need to be aware of a number of limitations and restrictions associated with the flexibility and convenience of such database systems. Failure to heed these warnings could potentially result in data loss and a variety of apparently bizzare things happening. Regular flat files (HDF5, MRC, SPIDER, IMAGIC, etc) saved by EMAN2 are completely safe, of course, and you can do what you want with them. This warning applies strictly to the EMAN2DB directory and its contents :

  • The e2bdb.py program can help to manipulate databases in certain ways. 'e2bdb.py -c' is an important command to be aware of (see below).
  • Do NOT move files within an EMAN2DB directory around. These are not normal image files that you can access or transport between machines. They are the internal files generated by an embedded database system. Don't mess with them !

  • exception to the above statement: If you need to remove files from an EMAN2DB directory (taking up too much space and aren't needed, etc.), you can do so, but ONLY after running 'e2bdb.py -c' on the machine first. e2bdb.py --delete is a simpler method (though it empties the files rather than completely deleting them).

  • Do NOT delete directories containing EMAN2DB directories without running e2bdb.py -c first If you do, exiting all EMAN2 programs and running e2bdb.py -c after the inevitable problem occurs will generally fix the issue.

  • If you use a shared filesystem, and wish to run EMAN2 jobs on one machine after running jobs on another machine, you must run 'e2bdb.py -c' on the first machine, and insure that EMAN2 programs are closed before opening programs on the other machine. If you simultaneously access these files from 2 machines at once, you may corrupt databases, or see inconsistent results.

  • If you DO get a message saying there is a database error and corruption may have resulted: first try running 'e2bdb.py -c'. 90% of the time that will fix the problem. If that doesn't work, then you may have to resort to removing the cache directory in /tmp. This may be a risky operation which could result in data loss, and is only a last resort.

  • To use EMAN2 images with other programs Most files are stored in the internal database by default. If you need to use EMAN2 images with another program, you can simply export them into any of the standard cryoEM formats. You can get files out of the database using the 'e2display.py' GUI or the browser in 'e2workflow.py' by right clicking on the file and selecting 'save as', or using 'e2proc2d.py' or 'e2proc3d.py' from the command-line.

  • Beware of network mounted filesystems. ie - if your home directory is on a network volume, rather than the local machine, you need to be very very cautious. This CAN be done safely, but only with care. The EMAN2 database is safe for running multiple programs on a single machine. It is NOT safe for simultaneous access by multiple machines. ie - if you run an EMAN2 program accessing a particular database on one machine, and simultaneously access the database on another machine via NFS, you may get very unpredictable results, and if you write to the database from both machines, you could cause corruption. Note that this problem is not unique to EMAN2 database files. If you write to a regular file (SPIDER, IMAGIC, etc.) from 2 different machines at once, you will also often cause corruption.

Brief technical explanation

Details on the database are discussed in Eman2DataStorage

Q: Why in the blankety-blank-blank did you decide to switch to this ? It's a pain to deal with !

A: This system is an embedded database and stores information about a project, as well as much of the actual image data. Originally, we planned to use just HDF, a convenient cross-platform interdisciplinary format. However, upon testing, we found that A) its performance for large image sets was awful and B) the sorts of issues that often happen with shared filesystems multiple processes accessing the file, etc, caused corruption in the HDF files somewhat frequently, and they could not be recovered. If you do the sorts of thing that cause BDB to have problems with normal flat files, like MRC, IMAGIC, SPIDER, etc., you will get file corruption in the form of images being mispositioned, out of order, and other strange phenomena. The BDB system permits you to safely use as many threads/processes as you like within a single computer, and has specific procedures to follow when you need to go between multiple machines. It also gives a dramatic performance boost to many many tasks. So, that's why we use it...

However, it comes with a few limitations. Like most databases, it uses a memory & disk cache to give faster access to information and coordinate access to the data from multiple programs (on the same machine). This cache consists of a set of files stored in /tmp (which must be physically attached to the local machine). If you try to access the same database from two different machines at the same time via a shared network filesystem, each machine establishes an independent cache in /tmp, and both think they have exclusive access to the files. This produces a situation where the machines can easily disagree about the contents of a file, and can cause database corruption. The 'e2bdb.py -c' program will safely close the cache on one machine, so it can be reliably accessed from another machine. It is also possible in some cases to open the databases read-only from multiple machines at once, with no cache, however this is a special case used in some situations on clusters, and not a general rule.

The files in the EMAN2DB directories are not normal flat image files, but are actually proprietary database files. Moving them around or otherwise messing with them will confuse the database. Just like you wouldn't create a MySQL database and go moving around its database files wily-nily, you shouldn't mess with files in the EMAN2DB directory, particularly if there is an active cache.

EMAN2/DatabaseWarning (last edited 2014-04-22 14:49:36 by SteveLudtke)