Differences between revisions 26 and 27
Revision 26 as of 2011-09-30 13:31:53
Size: 7601
Editor: SteveLudtke
Comment:
Revision 27 as of 2011-12-05 17:00:35
Size: 7810
Editor: SteveLudtke
Comment:
Deletions are marked like this. Additions are marked like this.
Line 9: Line 9:
 * Do NOT do a 'hard kill' on any EMAN2 program. Closing windows normally in the GUI is fine, and on Unix, 'kill' is fine, but 'kill -9' is not. Doing a 'force close' of a GUI window is potentially very bad.

WARNING !

If you look around the subdirectories in an EMAN2 project, you will find directories (folders) named EMAN2DB. EMAN2 stores much of the data and other information during processing in an embedded database system based on BerkeleyDB. These databases exist in directories called EMAN2DB. While the files may look like ordinary image files, the situation is much more complicated than this, and failing to understand the contents of this warning page may lead to data loss, image corruption and other bizzare things. While the database is now very stable (unlike 2010 and earlier), you must still follow the rules.

It IS safe to do things like 'rsync -avr' an entire project to your cluster so you can run a big EMAN2 job, but only if you are careful. Specifically, you MUST run e2bdb.py -c on both computers before the copy, which also requires that you not have any EMAN2 jobs running on either machine at that time. Similarly, you should run this command before trying to delete or rename any files inside an EMAN2DB directory.

Here are some specific rules to follow:

  • When upgrading: BEFORE you install the new version, run 'e2bdb.py -c' using the old version. Most of the time this isn't necessary, but if the BDB in the EMAN2 download has been upgraded in the new download, you will have problems if you don't.

  • Do NOT do a 'hard kill' on any EMAN2 program. Closing windows normally in the GUI is fine, and on Unix, 'kill' is fine, but 'kill -9' is not. Doing a 'force close' of a GUI window is potentially very bad.
  • Use e2bdb.py -c whenever you plan to do something involving an EMAN2DB directory, even if it seems harmless. For example, say you have a project directory called /home/me/myproject, and this directory contains refine01/EMAN2DB. If you do something like mv /home/me/myproject /home/me/myproject2, you will cause all sorts of strange problems, UNLESS you run e2bdb.py -c first !

  • It is safe to run e2bdb.py -c multiple times. It should never cause any damage. If you are running an EMAN2 program at the same time, it will warn you on most platforms. In some cases (windows ?) it may cause the EMAN2 program that is running to crash, but will not corrupt the database.

  • If you forget to do this, and experience an error, or apparently vanished files, etc., about 90% of the time running e2bdb.py -c after the error will correct the problem.

  • If that doesn't work, please see the debugging FAQ, which has a step-by-step guide to maximize your chances of recovery.

  • To use EMAN2 images with other programs Most files are stored in the internal database by default. If you need to use EMAN2 images with another program, you can simply export them into any of the standard cryoEM formats. You can get files out of the database using the 'e2display.py' GUI or the browser in 'e2workflow.py' by right clicking on the file and selecting 'save as', or using 'e2proc2d.py' or 'e2proc3d.py' from the command-line. Note that .hdf is the only file format which can store all of the image metadata generated in EMAN2 other than the bdb.

  • Beware of network mounted filesystems. ie - if your home directory is on a network volume, rather than the local machine, you need to be very very cautious. This CAN be done safely, but only with care. The EMAN2 database is safe for running multiple programs on a single machine. It is NOT safe for simultaneous access by multiple machines. ie - if you run an EMAN2 program accessing a particular database on one machine, and simultaneously access the database on another machine via NFS, you may get very unpredictable results, and if you write to the database from both machines, you could cause corruption. Note that this problem is not unique to EMAN2 database files. If you write to a regular file (SPIDER, IMAGIC, etc.) from 2 different machines at once, you will also often cause corruption.

  • Do not put spaces in any filenames or components of your path - This is good advice for computers in general. While most modern operating systems support spaces in filenames, you will find a wide range of software (scientific software in particular) which doesn't like it. eg - no paths like '/home/user/EMAN Project/testdata', use an '_' instead.

  • External Hard Drives - We strongly recommend NOT using external hard drives for processing, as convenient as it may seem to be. There are many reasons for this. If you do it anyway, make very very VERY sure that you run 'e2bdb.py -c' on the computer BEFORE you properly unmount/eject the disk and unplug it from the computer ! One mistake one time and you may have serious recovery problems.

    • If you screw up, and eject the hard drive before running 'e2bdb.py -c', if you immediately plug it back in, and run e2bdb.py -c after it is mounted, you may salvage the situation.

Brief technical explanation

Details on the database are discussed in Eman2DataStorage

There are also some FAQ questions dealing with problem solving

Q: Why in the blankety-blank-blank did you decide to switch to this ? It's a pain to deal with !

A: Desktop computers with up to 12 cores are now becoming very common, meaning to get full performance you need to run things in parallel. Did you know that the normal flat files you deal with (Spider, MRC, IMAGIC, etc.) don't work very well in parallel environments ? If you have multiple cores trying to write to a file at the same time, you can get images corrupted in many different ways. This is particularly true if you use a shared filesystem (like NFS or windows/mac file sharing).

Traditional databases run a server process which requires maintenance, and has to be running all the time to use any programs. BDB, however, is an embedded database, which doesn't have a server, and permits multiple jobs within a single computer all to safely read and write to the database at the same time. It stores information about a project, as well as much of the actual image data. It also gives a dramatic performance boost to many tasks and permits arbitrary information to be stored with each image. So, that's why we use it...

However, it comes with a few limitations. Like most databases, it uses a memory & disk cache to give faster access to information and coordinate access to the data from multiple programs (on the same machine). This cache consists of a set of files stored in /tmp (which must be physically attached to the local machine). If you try to access the same database from two different computers at the same time via a shared network filesystem, each machine establishes an independent cache in /tmp, and both think they have exclusive access to the files. This produces a situation where the machines can easily disagree about the contents of a file, and can cause database corruption. The 'e2bdb.py -c' program will safely close the cache on one machine, so it can be reliably accessed from another machine. It is also possible in some cases to open the databases read-only from multiple machines at once, with no cache, however this is a special case used in some situations on clusters, and not a general rule.

The files in the EMAN2DB directories are not normal flat image files, but are actually proprietary database files. Moving them around or otherwise messing with them will confuse the database. If you run 'e2bdb.py -c', which closes and removes the cache, then and only then is it safe to do things like copy databases between machines, or remove or rename EMAN2DB contents. Note that when a cache is inactive, you must also NOT rename directories containing EMAN2DB directories.

EMAN2/DatabaseWarning (last edited 2014-04-22 14:49:36 by SteveLudtke)