Contracts and protocols as a substitute to types and interfaces

December 8, 2011

I am a big fan of assertions. Whenever I reach a point in my code where I say “that pointer can’t possibly be null”, I immediately write – assert( p != NULL ); - and whenever I say “this list can’t possibly be longer than 256″ I write assert len(l) <= 256. If you wonder why I keep doing this, it’s because very often I’m wrong. It’s not that I’m a particularly bad programmer, but sometimes I make mistakes, and even when I don’t, sometimes I get very unexpected input, and even when I don’t, sometimes other pieces of code conspire against me. Assertions save me from mythical bug hunts on a regular basis.

So, it’s not a big surprise that I’m a big fan of contracts too. If you don’t know what contracts are, they’re essentially asserts that run at the beginning and end of each function, and check that the parameters and the return values meet certain expectations. In a way, function type declarations, as can be found in C or Java, are a special case of contracts. (Would you like to know more?)

Why not just use duck-typing?

Duck typing is great, but in my experience it becomes a burden as the system grows in size and complexity. Sometimes objects aren’t fully used right away; they are stored as an instance variable, pickled for later use, or sent to another process or another computer. When you finally get the AttributeError, it’s in another execution stack, or in another thread, or in another computer, and debugging it becomes very unpleasant! And what happens when you get the correct object, but it’s in the wrong state? You won’t even get an exception until something somewhere gets corrupted.

In my experience, using an assertion system is the best way to find the subtle bugs and incongruities of big and complex systems.

Why do we need something new?

Types are very confining, even in “typeless” dynamic languages. Take Python: If your API has to verify that it’s getting a file object, the only way is to call isinstance(x, file). That forces the caller to inherit from file, even if he’s writing a mock object (say, as an RPC proxy) that makes no disk access. In any static-type language the I know, it’s impossible to say that you accept either int or float, and you’re forced to either write the same function twice, or use a template and just define it twice.

Today’s interfaces are ridiculous. In C#, an interface with a method that returns a IList<int> will be very upset if you try to implement it as returning List<int>! And don’t even try to return a List<int> when you’re expected to return List. Note that C# will gladly cast between these types in the code, but when dealing with interfaces and function signatures it just goes nuts. It gets very annoying when you’re implementing an ITree inteface and can’t use your own class as nodes‘ type because the signatures collide, and instead you have to explicitly cast from ITree at every method. But I digress.

Even if today’s implementations were better, types are just not enough. They tell you very little about the input or the output. You want to be able to test its values, lengths, states, and maybe to even interact with it to some degree. What we have just doesn’t cut it.

What should we do instead?

Contracts are already pretty good: they have a lot of flexibility and power, they’re self-documenting, and they can be reasoned upon by the compiler/interpreter (“Oh it only accepts a list[int<256]? Time to use my optimized string functions instead!”). But they only serve as a band-aid to existing type systems. They don’t give you the wholesome experience of abstract classes and methods. But, they can.

To me, contracts are much bigger than just assertions. I see them as stepping-stones to a completely new paradigm, that will replace our current system of interfaces, abstract methods, and needless inheritance, with “Contract Protocols”.

How? These are the steps that we need to take to get there:
  1.  Be able to state your assertions about a function, in a declarative manner. Treat these assertions as an entity called a “contract”.  We’re in the middle of this step, and some contract implementations (such as the wonderful PyContracts for python) have already taken the declarative entity route, which is essential for the next step.
  2. Be able to compare contracts. Basically, I want to be able to tell if a contract is contained within another contract, so if C1⊂C2 and x∊C1 then x∊C2. I suspect it’s easier said then done, but I believe that the following (much easier) steps render it as worth doing.
  3. Be able to bundle contracts in a “contract protocol”, and use it to test a class. A protocol is basically just a mapping of {method-name: contract}, and applying it to a class tests that each method exists in the class, and that its contract is a subset of the protocol’s corresponding contract. If these terms are met, it can be said that the class implements the protocol. A class can implement several protocols, obviously.
  4. Be able to compare protocols. Similarly to contracts, we want to check if a protocol is a subset of another protocol. Arguably, it’s the same as step 3.
  5. Contracts can also check if an instance implements a protocol. Making a full circle, we can now use protocols to check for protocols and so on, allowing layers of complexity. We can now write infinitely detailed demands about what a variable should be, but very concisely.

When we finish point 5, we have a complete and very powerful system in our hands. We don’t need to ever discuss types, except for the most basic ones. Inheritance is now only needed to gain functionality, not identity. We can use it for debug-only purposes, but also for run-time decisions in production (For example, in a Strategy pattern).

Example

As a last attempt to get my point across, here is vaguely how I imagine the file protocol to look in pseudo-code.

It doesn’t do the idea any justice, but hopefully it’s enough to get you started.

protocol Closeable:
<pre>    close()

protocol _File < Closeable:
    [str] name
    [int,>0] tell()
    seek( [int,in (0,1,2)] )

protocol ReadOnlyFile < _File:
    [str,!=''] read( [int,>0 or None]? )
    [Iterable[str]] readlines( )
    [Iterable[str]] __iter__( )

protocol WriteOnlyfile < _File:
    [int,>0] write( [str,!=''] )
    writelines( [Iterable[str]] )
    flush()

protocol RWFile < ReadOnlyFile | WriteOnlyFile:
    pass

>>> print ReadOnlyFile < RWFile
True
>>> print implements( open('bla'), ReadOnlyFile )
True
>>> print implements( open('bla'), Iterable )  # has __iter__ function,
True
>>> print implements( open('bla'), Iterable[int] )
False
>>> print implements( open('bla'), WriteOnlyFile )  # default is 'r'
False
>>> print implements( open('bla'), RWFile )
False
>>> print implements( open('bla', 'w+'), RWFile )
True


Baker – Expose Python to the Shell

February 17, 2010

It’s been a long time since my last post, and it would be appropriate that I post about whatever it is that I’ve been working on. But I won’t. I’m writing this post only to tell you about an interesting new python library I stumbled upon.

Baker, in their own words, “lets you easily add a command line interface”.

In other words, it lets you expose your python utility functions to your favorite shell.

The only requirements are that:

  • Your function must accept string arguments (an exception: it accepts ints/floats if you provide a default argument)
  • Your function must print its output to stdout

Okay, so these are a little limiting. But the interesting part about Baker is not its implementation (which is still a bit clunky and basic), but rather its concept. Here’s a small piece of code I wrote:

import baker

@baker.command
def substr(text, start, end, step=1):
    print text[int(start):int(end):step]

if __name__ == '__main__':
    baker.run()

And here’s how I use it from the command line:

> baketest.py substr "Hello World!" 1 10
ello Worl

> baketest.py substr "Hello World!" 1 10 --step 2
el ol

The simplicity and intuitiveness of this interface really appealed to me. Hopefully this will catch on, and we’ll see more python scripts providing command-line interface, just because it’s very easy.


Lazier Copy-On-Write

June 21, 2009

Copy-on-write (COW) is a popular mechanism of lazy evaluation, that helps improve running speed and reduce memory requirements by transparently delaying the copying of data. Essentially, this is how it works: When you try to copy an object, you are instead given a fake object. Attempts to read that new object will instead read from the original object. Writing to the object will create a new copy (as originally requested) and refer to it in the future for reads and writes.

It allows to write simpler and safer programs. For instance, a programmer can pass “copies” of his data with minimal performance impact, and not have to worry about others changing his original data.

It’s great, but can it be more?

I present here two proposals for extending this idea for even greater optimization. They are far from my areas of expertise, so I hope they still make sense.

1. Copy-On-Write + Fragmentation

Fragmentation of data is a mechanism that allows different parts (blocks) of data to reside in different locations in memory while appearing intact (that is, sequential).  This mechanism is often accused of  slowing the computer down. However, its a critical feature of virtual memory, which is a basis to all modern operating systems.

Introducing fragmentation into your data structures has many benefits, but let’s discuss the benefits regarding COW, which may be obvious by now: On write, you don’t have to copy all of the data, just the blocks which are being changed. This can be a big difference, if you’re changing a few bytes in a 50mb string.

You still have to copy the meta-data (such as, where are the blocks, what is the next block, etc.), but that’s a small price to pay, and a reasonable design requirement.

Now instead of copying the entire data, you copy only a fragment of it. How big is that fragment? Perhaps a fixed size, such as 64k. But assuming you have no real restriction on the size of these data blocks, the next logical step, in my eyes, is to ask: Why not make it as small as possible? That is, why not intentionally fragment the block into three smaller blocks: Before the area that is to be written, the area that is to be written, and after the area to be written. At this point we continue as we originally planned: We copy only the block which is to be written, which is, of course, exactly as small as it can be.

Eventually, we have a model in which writing n bytes into a COWed data of m bytes takes O(n) time and memory, instead of the original O(m+n) time and O(m) memory. I argue that in the common case, n is significantly smaller than m, and so the win is big.

Of course, fragmentation has a habit of slowing down reading times. When fragmentation is “too high”, it is possible to defragment the memory (an occasional O(m) process). The optimal balance of fragmentation depends heavily on the frequency of reads vs of writes, but I argue that even a sub-optimal, common-case balance, will produce an improvement in performance.

Edit: I’ve been unclear about how it affects look-ups. Fragmentation to blocks of fixed size remains O(1) for getting and setting items. However, for variable-size blocks it’s not so simple. A search tree can achieve a look-up of  O(logn) where n is number of fragments, which is a lot slower than the original array peformance. It is probably only a good idea if you have access to the operating system’s memory management, or if the use of look-ups is rare (and then an occasional defragmentation would still be necessary). Still, fixed-size fragments are good enough, and they can be dynamically resized with little cost, as long as the resize is uniform.

2. Copy-On-Write-And-ReaD

Or in short, COWARD, is a mechanism to even further delay copying, to only after the written data is also read. That is, when the programmer requests to write data, this mechanism will instead journal the data, producing sort of a “diff”. Only when the programmer attempts to read the result, the original data is copied and the diff is applied. A diff structure is provided by any implementation of lazy evaluation, by definition, but perhaps there are other more suitable diff structures for this purpose.

This starts to make more sense with fragmentation: Then the diff can be applied only to the block that is read. And so, a block will be copied only if it is both written and read. In some cases, there may be very little intersection between the two (and so, very little copying).

So basically, COWARD is just a (non-)fancy name for an array of promises (not to be confused with a field of dreams). The (possible) novelty is in the way this array is created and used: transparently, and relatively efficiently. Note that, like the previous proposal, it provides little value in situations where the all of the data is altered or read. However, I argue it will significantly improve performance in cases where only part of the data is read and written.

It can, for instance, be useful in cases where an algorithm works on COWed data (which happens quite often) and provides more processing than the user requires. Using this method, only blocks that the user requests are copied — and if the calculations themselves are lazy — processed. And all of it transparent to both the user and the implementer of the algorithm .

Here’s to lazier COWs!


PySnippets – improving code reuse

June 2, 2009

For a long time now, I’ve been hindered by the issue of utilities, or snippets. These are convenience functions and classes that are too small or too incomplete to justify a library, yet are useful enough to be used.
I’ve posted a few on my blog: Namespaces, X and now FileDict. Others I didn’t post, and include a priority queue, an A* implementation, a lazy-list, an LRU memoizer, etc. You probably have a few of those. I know because I see them on snippet sites.

However, I rarely actually use these in my code. I really want to. But once your code spans more than one file, you usually need to make a proper installation, or at least trouble your “users” a bit more. Usually saving a few lines just isn’t worth the trouble. Copy-pasting the snippet into the file is sometimes the solution, but it really pains me that I’ll have to re-paste it every time I improve my snippet.

I’m sure some of you looked at my snippets, or other people’s, thought “cool”, but never used them, simply because it was too much trouble.

Paradoxically, this is especially true when writing snippets. They are just one small file, and using another snippet would probably make them too hard to distribute. This is a magic-circle, for-ever limiting our snippets to a low level of sophistication, and discouraging re-use.

I want to break that circle. I want to create an economy of snippets, increasingly building on each other, eventually creating a “standard library” of their own. But how would one do that? I have one suggestion, along with a proof-of-concept, which I will present here.

PySnippets

PySnippets is my attempt of making snippets usable. It’s comprised of two solutions – a server and a client.

  1. Server - A website for uploading snippets. Simple enough. You can rate them, tag them, discuss them, offer some documentation and of-course post newer versions.
  2. Client - A python module that automagically imports snippets from the web. Essentially, it downloads the snippets you request to a cache, imports them if they’re already there, and periodically searches for updates.

The server is structured in a predictable way, so that the client knows how to fetch a snippet just by its name.

The Client

Here’s a usage example with my current client implementation, I creatively call “snippets”:

import snippets
antigravity = snippets.get('antigravity')  # "snippet-import"
antigravity.start(mode='xkcd')

Easy as that!

The snippets.get function looks for the module in the local snippets-cache. If it’s there, get just imports it and returns the module. If it’s not, it queries the server for a snippet called “antigravity” (names are unique), stores it in the cache, and the imports it. What the user notices is a 2-second pause the first time he ever imports that snippet, and nothing else from then on.

You can specify to download a specific version, like this:

filedict = snippets.get('filedict', version=0.1)

Auto-Updating Snippets

The current implementation also includes an “auto-update” feature: Periodically, before importing a module, the client surveys the server for a newer version of it. If a newer version exists, it downloads it to the cache and continues with the import.

Auto-updates can be disabled in a parameter to get.

The Server

The server is yet another service to upload snippets, however it has a slightly unusual design (which no other snippet site I know of has):

  • A URL to a snippet is easy to deduce given its name.
  • There is a conscious (though simple) support for versions.
  • To increase reliability and trust (more on that later), uploaded snippets cannot be altered (but a new version can be issued)

Since I know very little about administration and server-maintenance, I chose wikidot.com to host my POC web-site. They have an elaborate support for permissions and most of the features I need, such as the ability to rate, tag and discuss snippets.

Trust

Perhaps the biggest issue with such a system is trust. Since you’re running code which resides online, you have to trust me not to maliciously alter the snippets, and also you have to trust the author of the snippet not to do so.

As a partial solution, uploaded files cannot be altered: Not edited, nor renamed, nor deleted, etc. So if specify a particular snippet version, it is guaranteed that it will never change (I may commit changes by request, but I will audit them myself).
If you decide to use the latest version of a snippet (that is, not specify a version), please make sure you trust its author.

Perhaps higher-ups in the Python community would like to take some sponsorship of the project, removing the remaining trust-issues with the administrator (that’s me).

Implications

  • To distribute your snippets, all you need is for the reciever to have an internet connection, and the snippets client.
  • If you’re sending someone code, you can attach the client (it’s rather small, too), and just import away. The reciever will benefit from improvements and bugfixes to your snippets.
  • You can use other people’s snippets just as easily, as long as you trust them.
  • Snippets can now build on each other without worrying too much.

What if my user is offline?

Then probably PySnippets isn’t for him.

However, I do have some ideas, and might implement them if there is sufficient demand.

Afterword

PySnippets is my humble attempt at solving the utility/snippet reuse problems. I hope you like it and find it useful.

Please try it!


FileDict – bug-fixes and updates

May 31, 2009

In my previous post I introduced FileDict. I did my best to get it right the first time, but as we all know, this is impossible for any non-trivial piece of code.
I want to thank everyone for their comments and remarks. It’s been very helpful.

The Unreliable Pickle

A special thanks goes to the mysterious commenter “R”, for pointing out that pickling identical objects may produce different strings (!), which are therefor inadequate to be used as keys. And my FileDict indeed suffered from this bug, as this example shows:

>>> key = (1, u'foo')
>>> d[(1, u'foo')] = 4
>>> d[(1, u'foo')]
4
>>> d[key]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "filedict.py", line 64, in __getitem__
    raise KeyError(key)
KeyError: (1, u'foo')

And if that’s not bad enough:

>>> d[key] = 5
>>> list(d.items())
[['a', 3], [(1, 2), 3], [(1, u'foo'), 4], [(1, u'foo'), 5]]

Ouch.
I’ve rewritten the entire storing mechanism to poll only on hash and compare keys after unpickling. This may be a bit slower, but I don’t (and shouldn’t) expect many colliding hashes anyway.
Bug is fixed.

DictMixin

Under popular demand, I’m now inheriting from DictMixin. It’s made my code a bit shorter, and was not at all painful.

Copy and Close

I no longer close the database on __del__, and instead I rely on the garbage collector. It seems to close the database on time, and it allows to one copy the dictionary (which, of course, will all be always have the same keys, but doesn’t have to have the same behavior or attributes).

New Source Code

Is available here


Follow

Get every new post delivered to your Inbox.