WARNING !

If you look around the subdirectories in an EMAN2 project, you will find directories (folders) named EMAN2DB. EMAN2 stores much of the data and other information during processing in an embedded database system based on BerkeleyDB. These databases exist in directories called EMAN2DB. While the files may look like ordinary image files, the situation is much more complicated than this, and failing to understand the contents of this warning page may lead to data loss, image corruption and other bizzare things. While the database is now very stable (unlike 2010 and earlier), you must still follow the rules.

It IS safe to do things like 'rsync -avr' an entire project to your cluster so you can run a big EMAN2 job, but only if you are careful. Specifically, you MUST run e2bdb.py -c on both computers before the copy, which also requires that you not have any EMAN2 jobs running on either machine at that time. Similarly, you should run this command before trying to delete or rename any files inside an EMAN2DB directory.

Here are some specific rules to follow:

Brief technical explanation

Details on the database are discussed in Eman2DataStorage

There are also some FAQ questions dealing with problem solving

Q: Why in the blankety-blank-blank did you decide to switch to this ? It's a pain to deal with !

A: Desktop computers with up to 12 cores are now becoming very common, meaning to get full performance you need to run things in parallel. Did you know that the normal flat files you deal with (Spider, MRC, IMAGIC, etc.) don't work very well in parallel environments ? If you have multiple cores trying to write to a file at the same time, you can get images corrupted in many different ways. This is particularly true if you use a shared filesystem (like NFS or windows/mac file sharing).

Traditional databases run a server process which requires maintenance, and has to be running all the time to use any programs. BDB, however, is an embedded database, which doesn't have a server, and permits multiple jobs within a single computer all to safely read and write to the database at the same time. It stores information about a project, as well as much of the actual image data. It also gives a dramatic performance boost to many tasks and permits arbitrary information to be stored with each image. So, that's why we use it...

However, it comes with a few limitations. Like most databases, it uses a memory & disk cache to give faster access to information and coordinate access to the data from multiple programs (on the same machine). This cache consists of a set of files stored in /tmp (which must be physically attached to the local machine). If you try to access the same database from two different computers at the same time via a shared network filesystem, each machine establishes an independent cache in /tmp, and both think they have exclusive access to the files. This produces a situation where the machines can easily disagree about the contents of a file, and can cause database corruption. The 'e2bdb.py -c' program will safely close the cache on one machine, so it can be reliably accessed from another machine. It is also possible in some cases to open the databases read-only from multiple machines at once, with no cache, however this is a special case used in some situations on clusters, and not a general rule.

The files in the EMAN2DB directories are not normal flat image files, but are actually proprietary database files. Moving them around or otherwise messing with them will confuse the database. If you run 'e2bdb.py -c', which closes and removes the cache, then and only then is it safe to do things like copy databases between machines, or remove or rename EMAN2DB contents. Note that when a cache is inactive, you must also NOT rename directories containing EMAN2DB directories.