Lazier Copy-On-Write

June 21, 2009

Copy-on-write (COW) is a popular mechanism of lazy evaluation, that helps improve running speed and reduce memory requirements by transparently delaying the copying of data. Essentially, this is how it works: When you try to copy an object, you are instead given a fake object. Attempts to read that new object will instead read from the original object. Writing to the object will create a new copy (as originally requested) and refer to it in the future for reads and writes.

It allows to write simpler and safer programs. For instance, a programmer can pass “copies” of his data with minimal performance impact, and not have to worry about others changing his original data.

It’s great, but can it be more?

I present here two proposals for extending this idea for even greater optimization. They are far from my areas of expertise, so I hope they still make sense.

1. Copy-On-Write + Fragmentation

Fragmentation of data is a mechanism that allows different parts (blocks) of data to reside in different locations in memory while appearing intact (that is, sequential).  This mechanism is often accused of  slowing the computer down. However, its a critical feature of virtual memory, which is a basis to all modern operating systems.

Introducing fragmentation into your data structures has many benefits, but let’s discuss the benefits regarding COW, which may be obvious by now: On write, you don’t have to copy all of the data, just the blocks which are being changed. This can be a big difference, if you’re changing a few bytes in a 50mb string.

You still have to copy the meta-data (such as, where are the blocks, what is the next block, etc.), but that’s a small price to pay, and a reasonable design requirement.

Now instead of copying the entire data, you copy only a fragment of it. How big is that fragment? Perhaps a fixed size, such as 64k. But assuming you have no real restriction on the size of these data blocks, the next logical step, in my eyes, is to ask: Why not make it as small as possible? That is, why not intentionally fragment the block into three smaller blocks: Before the area that is to be written, the area that is to be written, and after the area to be written. At this point we continue as we originally planned: We copy only the block which is to be written, which is, of course, exactly as small as it can be.

Eventually, we have a model in which writing n bytes into a COWed data of m bytes takes O(n) time and memory, instead of the original O(m+n) time and O(m) memory. I argue that in the common case, n is significantly smaller than m, and so the win is big.

Of course, fragmentation has a habit of slowing down reading times. When fragmentation is “too high”, it is possible to defragment the memory (an occasional O(m) process). The optimal balance of fragmentation depends heavily on the frequency of reads vs of writes, but I argue that even a sub-optimal, common-case balance, will produce an improvement in performance.

Edit: I’ve been unclear about how it affects look-ups. Fragmentation to blocks of fixed size remains O(1) for getting and setting items. However, for variable-size blocks it’s not so simple. A search tree can achieve a look-up of  O(logn) where n is number of fragments, which is a lot slower than the original array peformance. It is probably only a good idea if you have access to the operating system’s memory management, or if the use of look-ups is rare (and then an occasional defragmentation would still be necessary). Still, fixed-size fragments are good enough, and they can be dynamically resized with little cost, as long as the resize is uniform.

2. Copy-On-Write-And-ReaD

Or in short, COWARD, is a mechanism to even further delay copying, to only after the written data is also read. That is, when the programmer requests to write data, this mechanism will instead journal the data, producing sort of a “diff”. Only when the programmer attempts to read the result, the original data is copied and the diff is applied. A diff structure is provided by any implementation of lazy evaluation, by definition, but perhaps there are other more suitable diff structures for this purpose.

This starts to make more sense with fragmentation: Then the diff can be applied only to the block that is read. And so, a block will be copied only if it is both written and read. In some cases, there may be very little intersection between the two (and so, very little copying).

So basically, COWARD is just a (non-)fancy name for an array of promises (not to be confused with a field of dreams). The (possible) novelty is in the way this array is created and used: transparently, and relatively efficiently. Note that, like the previous proposal, it provides little value in situations where the all of the data is altered or read. However, I argue it will significantly improve performance in cases where only part of the data is read and written.

It can, for instance, be useful in cases where an algorithm works on COWed data (which happens quite often) and provides more processing than the user requires. Using this method, only blocks that the user requests are copied — and if the calculations themselves are lazy — processed. And all of it transparent to both the user and the implementer of the algorithm .

Here’s to lazier COWs!


PySnippets – improving code reuse

June 2, 2009

For a long time now, I’ve been hindered by the issue of utilities, or snippets. These are convenience functions and classes that are too small or too incomplete to justify a library, yet are useful enough to be used.
I’ve posted a few on my blog: Namespaces, X and now FileDict. Others I didn’t post, and include a priority queue, an A* implementation, a lazy-list, an LRU memoizer, etc. You probably have a few of those. I know because I see them on snippet sites.

However, I rarely actually use these in my code. I really want to. But once your code spans more than one file, you usually need to make a proper installation, or at least trouble your “users” a bit more. Usually saving a few lines just isn’t worth the trouble. Copy-pasting the snippet into the file is sometimes the solution, but it really pains me that I’ll have to re-paste it every time I improve my snippet.

I’m sure some of you looked at my snippets, or other people’s, thought “cool”, but never used them, simply because it was too much trouble.

Paradoxically, this is especially true when writing snippets. They are just one small file, and using another snippet would probably make them too hard to distribute. This is a magic-circle, for-ever limiting our snippets to a low level of sophistication, and discouraging re-use.

I want to break that circle. I want to create an economy of snippets, increasingly building on each other, eventually creating a “standard library” of their own. But how would one do that? I have one suggestion, along with a proof-of-concept, which I will present here.

PySnippets

PySnippets is my attempt of making snippets usable. It’s comprised of two solutions – a server and a client.

  1. Server - A website for uploading snippets. Simple enough. You can rate them, tag them, discuss them, offer some documentation and of-course post newer versions.
  2. Client - A python module that automagically imports snippets from the web. Essentially, it downloads the snippets you request to a cache, imports them if they’re already there, and periodically searches for updates.

The server is structured in a predictable way, so that the client knows how to fetch a snippet just by its name.

The Client

Here’s a usage example with my current client implementation, I creatively call “snippets”:

import snippets
antigravity = snippets.get('antigravity')  # "snippet-import"
antigravity.start(mode='xkcd')

Easy as that!

The snippets.get function looks for the module in the local snippets-cache. If it’s there, get just imports it and returns the module. If it’s not, it queries the server for a snippet called “antigravity” (names are unique), stores it in the cache, and the imports it. What the user notices is a 2-second pause the first time he ever imports that snippet, and nothing else from then on.

You can specify to download a specific version, like this:

filedict = snippets.get('filedict', version=0.1)

Auto-Updating Snippets

The current implementation also includes an “auto-update” feature: Periodically, before importing a module, the client surveys the server for a newer version of it. If a newer version exists, it downloads it to the cache and continues with the import.

Auto-updates can be disabled in a parameter to get.

The Server

The server is yet another service to upload snippets, however it has a slightly unusual design (which no other snippet site I know of has):

  • A URL to a snippet is easy to deduce given its name.
  • There is a conscious (though simple) support for versions.
  • To increase reliability and trust (more on that later), uploaded snippets cannot be altered (but a new version can be issued)

Since I know very little about administration and server-maintenance, I chose wikidot.com to host my POC web-site. They have an elaborate support for permissions and most of the features I need, such as the ability to rate, tag and discuss snippets.

Trust

Perhaps the biggest issue with such a system is trust. Since you’re running code which resides online, you have to trust me not to maliciously alter the snippets, and also you have to trust the author of the snippet not to do so.

As a partial solution, uploaded files cannot be altered: Not edited, nor renamed, nor deleted, etc. So if specify a particular snippet version, it is guaranteed that it will never change (I may commit changes by request, but I will audit them myself).
If you decide to use the latest version of a snippet (that is, not specify a version), please make sure you trust its author.

Perhaps higher-ups in the Python community would like to take some sponsorship of the project, removing the remaining trust-issues with the administrator (that’s me).

Implications

  • To distribute your snippets, all you need is for the reciever to have an internet connection, and the snippets client.
  • If you’re sending someone code, you can attach the client (it’s rather small, too), and just import away. The reciever will benefit from improvements and bugfixes to your snippets.
  • You can use other people’s snippets just as easily, as long as you trust them.
  • Snippets can now build on each other without worrying too much.

What if my user is offline?

Then probably PySnippets isn’t for him.

However, I do have some ideas, and might implement them if there is sufficient demand.

Afterword

PySnippets is my humble attempt at solving the utility/snippet reuse problems. I hope you like it and find it useful.

Please try it!


FileDict – bug-fixes and updates

May 31, 2009

In my previous post I introduced FileDict. I did my best to get it right the first time, but as we all know, this is impossible for any non-trivial piece of code.
I want to thank everyone for their comments and remarks. It’s been very helpful.

The Unreliable Pickle

A special thanks goes to the mysterious commenter “R”, for pointing out that pickling identical objects may produce different strings (!), which are therefor inadequate to be used as keys. And my FileDict indeed suffered from this bug, as this example shows:

>>> key = (1, u'foo')
>>> d[(1, u'foo')] = 4
>>> d[(1, u'foo')]
4
>>> d[key]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "filedict.py", line 64, in __getitem__
    raise KeyError(key)
KeyError: (1, u'foo')

And if that’s not bad enough:

>>> d[key] = 5
>>> list(d.items())
[['a', 3], [(1, 2), 3], [(1, u'foo'), 4], [(1, u'foo'), 5]]

Ouch.
I’ve rewritten the entire storing mechanism to poll only on hash and compare keys after unpickling. This may be a bit slower, but I don’t (and shouldn’t) expect many colliding hashes anyway.
Bug is fixed.

DictMixin

Under popular demand, I’m now inheriting from DictMixin. It’s made my code a bit shorter, and was not at all painful.

Copy and Close

I no longer close the database on __del__, and instead I rely on the garbage collector. It seems to close the database on time, and it allows to one copy the dictionary (which, of course, will all be always have the same keys, but doesn’t have to have the same behavior or attributes).

New Source Code

Is available here


FileDict – a Persistent Dictionary in Python

May 24, 2009

Python’s dictionary is possibly the most useful construct in the language.  And I argue that for some purposes, mapping it to a file (in real-time) can be even more useful.

Why?

The dictionary resides in memory, and so has three main “faults”:

  1. It only lasts as long as your program does.
  2. It occupies memory that might be useful for other, more commonly accessed, data.
  3. It is limited to how much memory your machine has.

The first can be solved by pickling and unpickling the dictionary, but will not survive an unexpected shutdown (even putting the pickling in a try-finally block won’t protect it against all errors).

FileDict

FileDict is a dictionary interface I wrote, that saves and loads its data from a file using keys. Current version uses Sqlite3 to provide consistency, and as a by-product, acidity.

The result is a dictionary which at all-times exists as a file, has virtually no size limit, and can be accessed by several processes concurrently.

It is meant as a quick-and-simple general-purpose solution. It is rarely the best solution, but it is usually good enough.

Performance obviously cannot compare to the builtin dictionary, but it is reasonable and of low complexity (refer to sqlite for more details on that).

Uses

FileDict can be used for many purposes, including:

  • Saving important data in a convinient manner
  • Managing large amounts of data in dictionary form, without the mess of implementing paging or other complex solutions
  • Communication between processes (sqlite supports multiple connections and implements ACID)

Examples

$ python
>>> import filedict
>>> d=filedict.FileDict(filename="example.dict")
>>> d['bla'] = 10
>>> d[(2,1)] = ['hello', (1,2) ]
-- exit --
$ python
>>> import filedict
>>> d=filedict.FileDict(filename="example.dict")
>>> print d['bla']
10
>>> print d.items()
[['bla', 10], [(2, 1), ['hello', (1, 2)]]]
>>> print dict(d)
{'bla': 10, (2, 1): ['hello', (1, 2)]}
>>> d=filedict.FileDict(filename="try.dict")
>>> with d.batch:  # using .batch suspend commits, making a batch of changes quicker
>>>    for i in range(100000):
>>>            d[i] = i**2
(takes about 8 seconds on my comp)
>>> print len(d)
100000
>>> del d[103]
>>> print len(d)
99999

Limitations

  • All data (keys and values) must be pickle-able
  • Keys must be hashable (perhaps this should be removed by hashing the pickled key)

Source Code

Is availible in here

Future

Additions in the future may include:

  • An LRU-cache for fetching entries
  • A storage strategy different than Sqlite

Other suggestions?


Raphael and a Prophecy

March 7, 2009

More than once I have wasted my time in attempt to introduce graphics into the webpage, some of these attempts are recorded in this very blog: First were my lines in CSS, followed by rotating rectangles. I say “wasted” because I have now found Raphaël, and I’m in love.

Raphaël

What is it? Their site is too modest, claiming “Raphaël is a small JavaScript library that should simplify your work with vector graphics on the web”.
While true, it is also a most ambitious project to change the way the web looks, whether they mean to or not.
All (popular) browsers support vector graphics: IE supports VML, and all other browsers support SVG. By implementing both languages and switching between them, you can create complex vector graphics that work in almost every browser.
This idea was new to me, at least, and I think it’s great. Now Raphaël comes along and does it for you. Nice.

My Prophecy
Vector graphics (VG) will become more popular. As VG libraries mature (and Raphaël is already in good shape), VG on the web would become so easy that anyone could use them, and many would. Soon enough you will have beautiful interactive GUIs. At fiirst snippets and demos, then complete libraries for everyone to use. What more can it do? Maybe interactive animations, maybe just add some spice to websites.
It probably won’t bury Flash, but I expect it would be a competition.