Lazier Copy-On-Write

June 21, 2009

Copy-on-write (COW) is a popular mechanism of lazy evaluation, that helps improve running speed and reduce memory requirements by transparently delaying the copying of data. Essentially, this is how it works: When you try to copy an object, you are instead given a fake object. Attempts to read that new object will instead read from the original object. Writing to the object will create a new copy (as originally requested) and refer to it in the future for reads and writes.

It allows to write simpler and safer programs. For instance, a programmer can pass “copies” of his data with minimal performance impact, and not have to worry about others changing his original data.

It’s great, but can it be more?

I present here two proposals for extending this idea for even greater optimization. They are far from my areas of expertise, so I hope they still make sense.

1. Copy-On-Write + Fragmentation

Fragmentation of data is a mechanism that allows different parts (blocks) of data to reside in different locations in memory while appearing intact (that is, sequential).  This mechanism is often accused of  slowing the computer down. However, its a critical feature of virtual memory, which is a basis to all modern operating systems.

Introducing fragmentation into your data structures has many benefits, but let’s discuss the benefits regarding COW, which may be obvious by now: On write, you don’t have to copy all of the data, just the blocks which are being changed. This can be a big difference, if you’re changing a few bytes in a 50mb string.

You still have to copy the meta-data (such as, where are the blocks, what is the next block, etc.), but that’s a small price to pay, and a reasonable design requirement.

Now instead of copying the entire data, you copy only a fragment of it. How big is that fragment? Perhaps a fixed size, such as 64k. But assuming you have no real restriction on the size of these data blocks, the next logical step, in my eyes, is to ask: Why not make it as small as possible? That is, why not intentionally fragment the block into three smaller blocks: Before the area that is to be written, the area that is to be written, and after the area to be written. At this point we continue as we originally planned: We copy only the block which is to be written, which is, of course, exactly as small as it can be.

Eventually, we have a model in which writing n bytes into a COWed data of m bytes takes O(n) time and memory, instead of the original O(m+n) time and O(m) memory. I argue that in the common case, n is significantly smaller than m, and so the win is big.

Of course, fragmentation has a habit of slowing down reading times. When fragmentation is “too high”, it is possible to defragment the memory (an occasional O(m) process). The optimal balance of fragmentation depends heavily on the frequency of reads vs of writes, but I argue that even a sub-optimal, common-case balance, will produce an improvement in performance.

Edit: I’ve been unclear about how it affects look-ups. Fragmentation to blocks of fixed size remains O(1) for getting and setting items. However, for variable-size blocks it’s not so simple. A search tree can achieve a look-up of  O(logn) where n is number of fragments, which is a lot slower than the original array peformance. It is probably only a good idea if you have access to the operating system’s memory management, or if the use of look-ups is rare (and then an occasional defragmentation would still be necessary). Still, fixed-size fragments are good enough, and they can be dynamically resized with little cost, as long as the resize is uniform.

2. Copy-On-Write-And-ReaD

Or in short, COWARD, is a mechanism to even further delay copying, to only after the written data is also read. That is, when the programmer requests to write data, this mechanism will instead journal the data, producing sort of a “diff”. Only when the programmer attempts to read the result, the original data is copied and the diff is applied. A diff structure is provided by any implementation of lazy evaluation, by definition, but perhaps there are other more suitable diff structures for this purpose.

This starts to make more sense with fragmentation: Then the diff can be applied only to the block that is read. And so, a block will be copied only if it is both written and read. In some cases, there may be very little intersection between the two (and so, very little copying).

So basically, COWARD is just a (non-)fancy name for an array of promises (not to be confused with a field of dreams). The (possible) novelty is in the way this array is created and used: transparently, and relatively efficiently. Note that, like the previous proposal, it provides little value in situations where the all of the data is altered or read. However, I argue it will significantly improve performance in cases where only part of the data is read and written.

It can, for instance, be useful in cases where an algorithm works on COWed data (which happens quite often) and provides more processing than the user requires. Using this method, only blocks that the user requests are copied — and if the calculations themselves are lazy — processed. And all of it transparent to both the user and the implementer of the algorithm .

Here’s to lazier COWs!


PySnippets – improving code reuse

June 2, 2009

For a long time now, I’ve been hindered by the issue of utilities, or snippets. These are convenience functions and classes that are too small or too incomplete to justify a library, yet are useful enough to be used.
I’ve posted a few on my blog: Namespaces, X and now FileDict. Others I didn’t post, and include a priority queue, an A* implementation, a lazy-list, an LRU memoizer, etc. You probably have a few of those. I know because I see them on snippet sites.

However, I rarely actually use these in my code. I really want to. But once your code spans more than one file, you usually need to make a proper installation, or at least trouble your “users” a bit more. Usually saving a few lines just isn’t worth the trouble. Copy-pasting the snippet into the file is sometimes the solution, but it really pains me that I’ll have to re-paste it every time I improve my snippet.

I’m sure some of you looked at my snippets, or other people’s, thought “cool”, but never used them, simply because it was too much trouble.

Paradoxically, this is especially true when writing snippets. They are just one small file, and using another snippet would probably make them too hard to distribute. This is a magic-circle, for-ever limiting our snippets to a low level of sophistication, and discouraging re-use.

I want to break that circle. I want to create an economy of snippets, increasingly building on each other, eventually creating a “standard library” of their own. But how would one do that? I have one suggestion, along with a proof-of-concept, which I will present here.

PySnippets

PySnippets is my attempt of making snippets usable. It’s comprised of two solutions – a server and a client.

  1. Server – A website for uploading snippets. Simple enough. You can rate them, tag them, discuss them, offer some documentation and of-course post newer versions.
  2. Client – A python module that automagically imports snippets from the web. Essentially, it downloads the snippets you request to a cache, imports them if they’re already there, and periodically searches for updates.

The server is structured in a predictable way, so that the client knows how to fetch a snippet just by its name.

The Client

Here’s a usage example with my current client implementation, I creatively call “snippets”:

import snippets
antigravity = snippets.get('antigravity')  # "snippet-import"
antigravity.start(mode='xkcd')

Easy as that!

The snippets.get function looks for the module in the local snippets-cache. If it’s there, get just imports it and returns the module. If it’s not, it queries the server for a snippet called “antigravity” (names are unique), stores it in the cache, and the imports it. What the user notices is a 2-second pause the first time he ever imports that snippet, and nothing else from then on.

You can specify to download a specific version, like this:

filedict = snippets.get('filedict', version=0.1)

Auto-Updating Snippets

The current implementation also includes an “auto-update” feature: Periodically, before importing a module, the client surveys the server for a newer version of it. If a newer version exists, it downloads it to the cache and continues with the import.

Auto-updates can be disabled in a parameter to get.

The Server

The server is yet another service to upload snippets, however it has a slightly unusual design (which no other snippet site I know of has):

  • A URL to a snippet is easy to deduce given its name.
  • There is a conscious (though simple) support for versions.
  • To increase reliability and trust (more on that later), uploaded snippets cannot be altered (but a new version can be issued)

Since I know very little about administration and server-maintenance, I chose wikidot.com to host my POC web-site. They have an elaborate support for permissions and most of the features I need, such as the ability to rate, tag and discuss snippets.

Trust

Perhaps the biggest issue with such a system is trust. Since you’re running code which resides online, you have to trust me not to maliciously alter the snippets, and also you have to trust the author of the snippet not to do so.

As a partial solution, uploaded files cannot be altered: Not edited, nor renamed, nor deleted, etc. So if specify a particular snippet version, it is guaranteed that it will never change (I may commit changes by request, but I will audit them myself).
If you decide to use the latest version of a snippet (that is, not specify a version), please make sure you trust its author.

Perhaps higher-ups in the Python community would like to take some sponsorship of the project, removing the remaining trust-issues with the administrator (that’s me).

Implications

  • To distribute your snippets, all you need is for the reciever to have an internet connection, and the snippets client.
  • If you’re sending someone code, you can attach the client (it’s rather small, too), and just import away. The reciever will benefit from improvements and bugfixes to your snippets.
  • You can use other people’s snippets just as easily, as long as you trust them.
  • Snippets can now build on each other without worrying too much.

What if my user is offline?

Then probably PySnippets isn’t for him.

However, I do have some ideas, and might implement them if there is sufficient demand.

Afterword

PySnippets is my humble attempt at solving the utility/snippet reuse problems. I hope you like it and find it useful.

Please try it!