JSON Files
(BDB Replacement)
JSON files replace the much despised BDB database mechanism for storing metadata in EMAN2. These files offer a number of advantages over BDB, but there are also a few tradeoffs.
Advantages:
- Human-readable, and human-editable.
- Can be renamed, deleted, copied, etc, just like any other file.
Standard file format, a subset of JavaScript, so interfaces easily with the web.
Persistance & threadsafety. Through use of file-locking it should be safe to read a single JSON file from multiple threads/processes.
Tradeoffs:
- Speed vs. Immediacy. Making any change to a JSON file requires re-writing the entire file. If writing is deferred (which it can be), then other processes won't see the changes until the write actually happens.
- Speed. Not really designed for very large 'databases' of information. While you _could_ have a JSON file with 1,000,000 items in it, deferred writing would be critical to maintaining any sort of performance.
- Images are not stored directly in the JSON file, though it appears this way to a user. If you put an image in a JSON file, a related HDF file is created and the JSON file stores a reference to the HDF file. This is transparent to the user/programmer.
- While technically, multiple writers should be safe, if multiple processes are all writing to .json files at high speed, it is conceivable that there could be some corruption. We have not yet ever observed this happening, but it isn't completely impossible. Also, if conflicting writes DO happen, the last writer wins.
Command Line Program
e2procjson.py can be used to perform a range of manipulations on JSON files. Use --help for a list of options.
Basic Python Usage
The main object is the JSDict class, which provides a dictionary-like access to the .json file. Each instance of this class represents a single file on disk with a '.json' extension.
js_open_dict(filename)
Opens a JSON file as a dict-like database object (JSDict). Writes to JDB dictionaries are inefficient unless deferred writing is used. Default behavior is to write the entire dictionary to disk when any element is changed. File locking is attempted to avoid conflicts, but may not work in all situations. This mechansim should be While it is possible to store images in JSON files it is not recommended due to inefficiency, and making files which are difficult to read.
js_close_dict(filename)
- This does not need to be called explicitly, but will free some resources associated with the database. Not associated with closing a file pointer.
js_remove_dict(filename)
- closes and deletes a database using the same specification as db_open_dict. Unlike BDB functions this will actually remove the associated file on disk.
js_check_dict(filename, readonly=True)
- Checks for the existence of the named JSON file and insures that it can be opened for reading [and writing]. It does not check the contents of the file, just for its exsistence and permissions.
js_list_dicts(path)
- Gives a list of readable json files at a given path.
The JSDict class acts much like a standard python dictionary, once opened. Default behavior is to sync with the file on disk (only if necessary) on each read or write, giving it a high level of persistence and making it feasible to use in multi-process and shared-filesystem environments. However, this scheme can be extremely inefficient, so mechanisms exist for deferred writing of changes and reading without checking the file for changes (though this second task is fairly inexpensive anyway. In addition to all of the standard dictionary methods:
get(self,key,noupdate=False)
- This will retrieve a value from the dictionary, exactly like dict[key], but permits skipping the check to see if the file has changed on disk since the last access.
setval(self,key,val,deferupdate=False)
This will set a value in the dictionary. This is identical to dict[key]=value, unless deferupdate is set, in which case the change is made in memory, but not immediately committed back to the JSON file on disk. To commit changes made with deferupdate set, either call sync(self) or make another change without deferupdate set. All changes are committed at the next sync(self).
update(self,otherdict)
Just like the normal dictionary update method, but will only do one sync(self) at the end of the update.
delete(self,key,deferupdate=False)
As with setval you can defer the actual key deletion in the file on disk.
- Note: Keys in JSON files are strings, and ONLY strings. Integer/float/set keys are not supported. This also applies at embedded contents, such as a dictionary within an element. See below for an example.
- Note: remember that while any pickleable object can be stored in a JSDict, storing a stack of 10,000 images probably isn't a very good idea. Use HDF files for anything other than incidental image storage.
Examples
Write some metadata to a JSON file
js = js_open_dict("info/mytest.json") js["key1"] = 123.5 js["key2"] = "alphabet" js["key3"] = test_image() print js["key1"] display(js["key3"])
But see below for possible efficiency issues.
Convert a BDB to a JSON file
Pretty trivial:
a = db_open_dict("bdb:refine_01#register") b = js_open_dict("refine01/register.json) b.update(a)
Now try looking at register.json in a text editor. You'll see that it is nicely formatted text, and can be edited by hand. Formatting is not required for a valid file, and if you make changes by hand that break the formatting, the next time a program changes something it will get automatically reformatted for you.
Gotacha - All keys must be strings !
e2.py js = js_open_dict("test.json") js[5] = "testing" # this will work, but 5 will be converted to a string print js[5] # this will also work at the top-level js["5"] = "new test" # this will replace the original ! print js[5] # now "new test" js["test"]={1:2,2:3,3:4} # This is where the real danger lies ! print js["test"] # exactly what we put in, BUT js[5] = 1 # this change triggers a re-sync with the JSON file print js["test"] # note all of the keys have been converted from integers to strings !
Bottom line, if you are storing dictionaries in JSON files, make sure you have converted all keys to strings yourself, or you will get very odd behavior. This is only for keys on dictionaries. Dictionary values, list items, and any other Python class stored in a JSON file will be preserved without change !
JSON write performance (large files)
Consider this:
from EMAN2 import * a=range(1000) d=js_open_dict("tst.json") for i in range(500): d[i]=a print i
now consider this:
from EMAN2 import * a=range(1000) d=js_open_dict("tst.json") for i in range(500): d.setval(i,a,deferupdate=True) print i
Both produce exactly the same tst.json file in the end, but the first version takes almost 2 minutes to run as compared to 0.5 seconds for the second version. Of course, if another program were to try and access the file during that 0.5 seconds, it wouldn't see any of the changes...