FileDict – a Persistent Dictionary in Python

Python’s dictionary is possibly the most useful construct in the language.  And I argue that for some purposes, mapping it to a file (in real-time) can be even more useful.

Why?

The dictionary resides in memory, and so has three main “faults”:

  1. It only lasts as long as your program does.
  2. It occupies memory that might be useful for other, more commonly accessed, data.
  3. It is limited to how much memory your machine has.

The first can be solved by pickling and unpickling the dictionary, but will not survive an unexpected shutdown (even putting the pickling in a try-finally block won’t protect it against all errors).

FileDict

FileDict is a dictionary interface I wrote, that saves and loads its data from a file using keys. Current version uses Sqlite3 to provide consistency, and as a by-product, acidity.

The result is a dictionary which at all-times exists as a file, has virtually no size limit, and can be accessed by several processes concurrently.

It is meant as a quick-and-simple general-purpose solution. It is rarely the best solution, but it is usually good enough.

Performance obviously cannot compare to the builtin dictionary, but it is reasonable and of low complexity (refer to sqlite for more details on that).

Uses

FileDict can be used for many purposes, including:

  • Saving important data in a convinient manner
  • Managing large amounts of data in dictionary form, without the mess of implementing paging or other complex solutions
  • Communication between processes (sqlite supports multiple connections and implements ACID)

Examples

$ python
>>> import filedict
>>> d=filedict.FileDict(filename="example.dict")
>>> d['bla'] = 10
>>> d[(2,1)] = ['hello', (1,2) ]
-- exit --
$ python
>>> import filedict
>>> d=filedict.FileDict(filename="example.dict")
>>> print d['bla']
10
>>> print d.items()
[['bla', 10], [(2, 1), ['hello', (1, 2)]]]
>>> print dict(d)
{'bla': 10, (2, 1): ['hello', (1, 2)]}
>>> d=filedict.FileDict(filename="try.dict")
>>> with d.batch:  # using .batch suspend commits, making a batch of changes quicker
>>>    for i in range(100000):
>>>            d[i] = i**2
(takes about 8 seconds on my comp)
>>> print len(d)
100000
>>> del d[103]
>>> print len(d)
99999

Limitations

  • All data (keys and values) must be pickle-able
  • Keys must be hashable (perhaps this should be removed by hashing the pickled key)

Source Code

Is availible in here

Future

Additions in the future may include:

  • An LRU-cache for fetching entries
  • A storage strategy different than Sqlite

Other suggestions?

13 Responses to “FileDict – a Persistent Dictionary in Python”

  1. Erez's Programming Stuff: Filedict Says:

    [...] my blog post for [...]

  2. Calvin Spealman Says:

    Why not use shelve, from the stdlib?

  3. erezsh Says:

    I’ll be honest – I forgot about shelve, and did not take it into account.

    However, I can still try and answer your question:
    1. Shelve does not implement ACID
    2. Sqlite seems to use more compact storage
    3. There are good tools to examine Sqlite DBs
    4. My implementation doesn’t limit the type of your keys (tho this can be mostly fixed with a wrapper)
    5. My implementation allows using the file for inter-process communication

    Thanks for pointing it out, there are certainly situations where shelve would be a better choice.

  4. R Says:

    Hashing the pickled key won’t work because pickle is not consistent. See http://bugs.python.org/issue5518

  5. erezsh Says:

    Very interesting! I may have to adjust my code to avoid possible bugs. Thanks R.

  6. lorg Says:

    Re: “Python’s dictionary is possibly the most useful construct in the language.”

    Well, it has been said that just as lisp is based on lists, Python is based on dicts :)

  7. Jeremy Grosser Says:

    I played around with writing a few UserDicts with similar intent a while ago and ended up building them into a module.

    http://github.com/synack/ncore/blob/dd41752b16b544aa240e8d4f2409d80d674ae643/ncore/data.py

  8. Mike Says:

    This is very interesting, definitely going to be considering this for a few projects. :)

  9. ErikW Says:

    For these kind of dictionary-emulating classes, I often use the UserDict.DictMixin class. I find that you can get full dictionary emulation with little effort. An alternative implementation using DictMixin can be found here http://python.pastebin.com/f2d208b4

    I’ve shamelessly stolen the key concept of hash(key)/pickle(key) and indexing on the hash. However I have not optimized the pickling (using binary pickle).

  10. FileDict – bug-fixes and updates « Stories For Sad Robots Says:

    [...] – bug-fixes and updates In my previous post I introduced FileDict. I did my best to get it right the first time, but as we all know, this is [...]

  11. Felipe Cruz Says:

    Check out copycat.

    Transparent, pure OO persistence tool.

    I’ts a very different concept but has a lot more options than just work with dicts.

  12. code43 Says:

    Erez, for another project with similar objectives — with compression — plus support for regular expression on the key, see

    http://yserial.sourceforge.net

    a timestamp upon insert is also useful. It seems to me that your code currently will deadlock if used in concurrent situations.

    The general idea is definitely worth pursuing: persistance of schema-less data. Thanks very much.

  13. gls Says:

    Hey, this was just the “ready to go” implementation of sqlite i was searching! I needed a very high level abstraction layer to SQL instruction (I do not understand SQL thing, and I dont want to). Thanks, I am very enthusiast!

Leave a Reply