EMEN2 Index Accelerator

EMEN2 includes a C module -- bulk.c -- that increases index read speed by using some features of the Berkeley DB interface that are not exposed by the bsddb3 module. This module is not required, and can be enabled/disabled at any time.

What it does

Berkeley DB provides a "bulk" interface for loading large amounts of data quickly. It is especially useful for loading indexes. However this feature is not available in the bsddb3 Python module. I have not confirmed this, but I believe the reason is because you must create a buffer, and any item loaded will need to be able to fit in the buffer. This is not an issue with EMEN2 index reads, because the values are record IDs, which are small.

The module simply exposes this interface, in a limited fashion, to Python. If the module is not present, EMEN2 falls back to using normal cursor operations to read the indexes.

Building Module

It is simple to build the module. The C source code is emen2/indexwrapper/bulk.c. It requires header files for Python, Berkeley DB, and the bsddb3 module. On my Mac OS X development machine, the following commands are sufficient (from the emen2 source directory; you will of course have to modify all the paths):

export BDBVERSION=4.8 BDBMODULEPATH=$HOME/emen2/src/bsddb3-4.8.2/Modules/

gcc-4.2 -fno-strict-aliasing -fno-common -dynamic -g -fwrapv -Os -Wall -Wstrict-prototypes  -pipe -I/usr/local/BerkeleyDB.$BDBVERSION/include -I/System/Library/Frameworks/Python.framework/Versions/2.6/include/python2.6 -I$BDBMODULEPATH -c indexwrapper/bulk.c -o indexwrapper/bulk.o

gcc-4.2 -Wl,-F. -bundle -undefined dynamic_lookup -L/usr/local/BerkeleyDB.$BDBVERSION/lib -L/usr/local/BerkeleyDB.$BDBVERSION/lib -ldb-$BDBVERSION indexwrapper/bulk.o -o indexwrapper/bulk.so

On my production servers that run Linux:

gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -DPYBSDDB_STANDALONE=1 -I/usr/local/BerkeleyDB.4.8/include -I/usr/local/include/python2.6 -I/home/emen2/src/bsddb3-4.8.2/Modules  -c indexwrapper/bulk.c -o indexwrapper/bulk.o

gcc -pthread -shared indexwrapper/bulk.o -L/usr/local/BerkeleyDB.4.8/lib -Wl,-R/usr/local/BerkeleyDB.4.8/lib -ldb-4.8 -o indexwrapper/bulk.so

The bulk.so file will be detected automatically and loaded. You can check to make sure it has been compiled and can be imported correctly:

python -c "import emen2.indexwrapper.bulk"

EMEN2/bulk.so (last edited 2010-04-21 05:56:49 by root)