Python’s dictionary is possibly the most useful construct in the language. And I argue that for some purposes, mapping it to a file (in real-time) can be even more useful.
Why?
The dictionary resides in memory, and so has three main “faults”:
- It only lasts as long as your program does.
- It occupies memory that might be useful for other, more commonly accessed, data.
- It is limited to how much memory your machine has.
The first can be solved by pickling and unpickling the dictionary, but will not survive an unexpected shutdown (even putting the pickling in a try-finally block won’t protect it against all errors).
FileDict
FileDict is a dictionary interface I wrote, that saves and loads its data from a file using keys. Current version uses Sqlite3 to provide consistency, and as a by-product, acidity.
The result is a dictionary which at all-times exists as a file, has virtually no size limit, and can be accessed by several processes concurrently.
It is meant as a quick-and-simple general-purpose solution. It is rarely the best solution, but it is usually good enough.
Performance obviously cannot compare to the builtin dictionary, but it is reasonable and of low complexity (refer to sqlite for more details on that).
Uses
FileDict can be used for many purposes, including:
- Saving important data in a convinient manner
- Managing large amounts of data in dictionary form, without the mess of implementing paging or other complex solutions
- Communication between processes (sqlite supports multiple connections and implements ACID)
Examples
$ python
>>> import filedict
>>> d=filedict.FileDict(filename="example.dict")
>>> d['bla'] = 10
>>> d[(2,1)] = ['hello', (1,2) ]
-- exit --
$ python
>>> import filedict
>>> d=filedict.FileDict(filename="example.dict")
>>> print d['bla']
10
>>> print d.items()
[['bla', 10], [(2, 1), ['hello', (1, 2)]]]
>>> print dict(d)
{'bla': 10, (2, 1): ['hello', (1, 2)]}
>>> d=filedict.FileDict(filename="try.dict") >>> with d.batch: # using .batch suspend commits, making a batch of changes quicker >>> for i in range(100000): >>> d[i] = i**2 (takes about 8 seconds on my comp) >>> print len(d) 100000 >>> del d[103] >>> print len(d) 99999
Limitations
- All data (keys and values) must be pickle-able
- Keys must be hashable (perhaps this should be removed by hashing the pickled key)
Source Code
Is availible in here
Future
Additions in the future may include:
- An LRU-cache for fetching entries
- A storage strategy different than Sqlite
Other suggestions?
May 24, 2009 at 18:15 |
[...] my blog post for [...]
May 24, 2009 at 19:40 |
Why not use shelve, from the stdlib?
May 24, 2009 at 20:31 |
I’ll be honest – I forgot about shelve, and did not take it into account.
However, I can still try and answer your question:
1. Shelve does not implement ACID
2. Sqlite seems to use more compact storage
3. There are good tools to examine Sqlite DBs
4. My implementation doesn’t limit the type of your keys (tho this can be mostly fixed with a wrapper)
5. My implementation allows using the file for inter-process communication
Thanks for pointing it out, there are certainly situations where shelve would be a better choice.
May 24, 2009 at 21:05 |
Hashing the pickled key won’t work because pickle is not consistent. See http://bugs.python.org/issue5518
May 24, 2009 at 21:46 |
Very interesting! I may have to adjust my code to avoid possible bugs. Thanks R.
May 24, 2009 at 21:55 |
Re: “Python’s dictionary is possibly the most useful construct in the language.”
Well, it has been said that just as lisp is based on lists, Python is based on dicts
May 24, 2009 at 22:26 |
I played around with writing a few UserDicts with similar intent a while ago and ended up building them into a module.
http://github.com/synack/ncore/blob/dd41752b16b544aa240e8d4f2409d80d674ae643/ncore/data.py
May 25, 2009 at 4:56 |
This is very interesting, definitely going to be considering this for a few projects.
May 25, 2009 at 11:46 |
For these kind of dictionary-emulating classes, I often use the UserDict.DictMixin class. I find that you can get full dictionary emulation with little effort. An alternative implementation using DictMixin can be found here http://python.pastebin.com/f2d208b4
I’ve shamelessly stolen the key concept of hash(key)/pickle(key) and indexing on the hash. However I have not optimized the pickling (using binary pickle).
May 31, 2009 at 19:15 |
[...] – bug-fixes and updates In my previous post I introduced FileDict. I did my best to get it right the first time, but as we all know, this is [...]
October 16, 2009 at 23:47 |
Check out copycat.
Transparent, pure OO persistence tool.
I’ts a very different concept but has a lot more options than just work with dicts.