Using the database from Python (for programmers or advanced users)

The normal method for accessing image data on disk is using the read_image, read_images and write_image methods, for example:

   1 # e2.py
   2 img=EMData()
   3 img.read_image("test.hdf",5)  # reads the 6th image from test.hdf (first image is 0)
   4 img.write_image("test2.hdf",-1)   # appends (-1) the image to the end of test2.hdf
   5 img_list=EMData.read_images("test.hdf",range(50))   # reads the first 50 images from test.hdf into a list of EMData objects
   6 n=EMUtil.get_image_count("test.hdf")   # counts the number of images in test.hdf

When writing to a (typically) 8 bit file format, like JPEG, PNG, PGM, the floating point values in the image need to be converted to an 8 bit scale. By default this is done with an algorithm that exludes outliers (ie - it doesn't span the full range of the image). To override this behavior, set the dictionary elements "render_min" and "render_max" on the image to be saved, and the specified range will be used instead. Here is a simple example:

a=test_image()
a["render_min"]=a["minimum"]
a["render_max"]=a["maximum"]
a.write_image("a.png")

File i/o can also be performed with databases, such as :

   1 img.read_image("bdb:test",5)
   2 img.write_image("bdb:test2",-1)

However, this is not the preferred mechanism for using the database interface, since there are many more powerful operations which can be performed. Such as:

   1 e2.py    # This implicitly performs a 'from EMAN2db import *', which opens the local environment: DB=EMAN2DB.open_db()
   2 testdb = db_open_dict("bdb:test")       # this opens a specific database in the local directory called "test"
   3 testdb[0]=test_image()    # stores an EMData object in the 'test' database
   4 img=testdb[0]             # This reads the EMData object back from the database
   5 testdb.set_attr(0,"mykey",5.5)   # This sets an attribute "mykey" on EMData keyed 0 in database 'test'
   6                                   # This operation is MUCH faster than doing the same thing with any
   7                                   # flat file
   8 testdb.get_attr(0,"mykey")       # This retrieves an attribute of image 0 from database test without
   9                                   # loading the image data
  10 testdb["testimg"]=test_image()   # Keys in the database need not be integers, though the
  11                                   # read_image, etc. methods can only access integer keys
  12 testdb["alist"]=[1,2,3,4,5]      # You can also use the 'test' database to store arbitrary other
  13                                   # metadata, not just images. This assigns a list to key 'alist'
  14 db_close_dict("test")             # While database will be cleanly closed automatically, except for
  15                                   # cases where python is forcibly terminated (^c is ok), it isn't
  16                                   # a bad idea to close them if you know you won't use them again

Basically, each database object can be treated as a python dictionary. Any Python object that can be pickled (almost any python object) can be stored as a value in these dictionaries. It is even possible to mix images of different sizes within a single object.

The attribute mechanism (set_attr, get_attr) is tied into the EMData object attribute dictionary. That is, the following operations are functionally equivalent, but the second version is MUCH faster.

   1 img=testdb[0]
   2 img.set_attr("mykey",5.5)
   3 testdb[3]=img
   4 # OR
   5 DB.test.set_attr(0,"mykey",5.5)

Unlike python dictionaries, if a value in the database is an object, changing the object does not result in writing the change back to the database, unless you explicitly write it again. For example:

   1 # With a dictionary
   2 test={1:["a","b","c"],2:3}
   3 test[1][1]="c"
   4 print test[1]
   5 ["a","c","c"]
   6 # With a database
   7 testdb = db_open_dict("bdb:test")
   8 testdb[1]=["a","b","c"]
   9 testdb[2]=3
  10 testdb[1][1]="c"    # This effectively does nothing
  11 print testdb[1]
  12 ["a","b","c"]
  13 # To make the above actually work
  14 d=testdb[1]
  15 d[1]="c"
  16 testdb[1]=d

You can write/read the full header for an EMData object inexpensively with:

   1 testdb[2]=test_image()
   2 hdr=testdb.get_header(2)   # returns the equivalent of get_attr_dict on an EMData object
   3 #If DB is associated with the disk database, get header requires an argument (image number).
   4 hdr["apix_x"]=2.0
   5 testdb.set_header(2, hdr)    # hdr can be either a dictionary or and EMData object

There is a small cost associated with opening each database, so it is generally a good idea for performance purposes to open the database and only close it if you aren't expecting to use it again for some time.

Clusters/MPI

Multiple processes/threads on a single machine can safely have the same database open at the same time (reading and writing). The databases (based on BerkeleyDB) support record-level locking. If one process is writing to a record and another process simultaneously tries to read the record, the read operation will block until the write completes. On a single machine the databases coordinate with each other using the database cache in /tmp which MUST be on a locally mounted filesystem (not NFS).

Multiple processes accessing (reading and writing) to a single file from multiple machines on a network-mounted filesystem IS NOT SAFE, and may result in unpredictable errors.

If you need to directly read a database on a node other than the node which is coordinating all write operations, you must open the database with caching disabled, or you may observe strange inconsistencies in the data where the local cache disagrees with the current database contents. To open the database with caching disabled:

   1 # YOU MUST DO THIS BEFORE OPENING ANY DATABASES ON NON-WRITE NODES
   2 import EMAN2db
   3 EMAN2db.BDB_CACHE_DISABLE=True

In addition, any database which is currently open on the writing node, and has recently been written to may not be properly readable on the other nodes. All write operations must be completed and the databases closed before opening them on other nodes.

The 'standard' parallelism mechanism in EMAN2 will be an encapsulation and distribution approach where reads/writes are synchronized through a single 'master' node. Finer grained MPI processing will also be supported, but less generally. SPARX is using a different approach.

Eman2BdbStorage (last edited 2011-07-05 04:04:02 by SteveLudtke)