How To Write A Calculator in 70 Python Lines, By Writing a Recursive-Descent Parser

February 24, 2013

Three months ago, I wrote a post detailing the process of writing a calculator using a parsing library. The popular response, however, was that readers are far more curious about seeing a calculator written from scratch, with the batteries included but nothing else. I figured, why not?

Writing a calculator is simple, if you use hacks specific to arithmetic expressions, but the effect of hacks is nearly always the same: the solution isn’t elegant, it’s not extendable, and it’s hard to understand intuitively. In my appreciation of a good challenge, and my aim at a beneficial post, I decided to write it using a mostly generic recursive-descent parser. In the same spirit as last time, I wanted to do it in as few lines as I reasonably can, so it’s filled with hacks and tricks, but they’re superficial and not specific to the task at hand.

This post is a detailed, step-by-step explanation of my implementation. If you want to jump straight to the code and figure it out by yourself, just scroll to the end of this post. Hopefully when you’re done you’ll have better understanding of how parsing works internally, and you’ll be inspired to use a proper parsing library to avoid this entire bloody mess.

To understand this post, you should have a strong understanding of Python, and it’s recommended to have some understanding of what parsing is and what it’s for. If you’re not sure, I recommend that you read my previous post, in which I thoroughly explain the grammar that I will be using in this post.

Step 1: Tokenize

The first step of processing the expression is to turn it into a list of individual symbols. This is the easiest part, and not the point of this exercise, so I allowed myself to cheat here quite a lot.

First, I defined the tokens (Numbers are notably absent; they’re the default) and a Token type:

token_map = {'+':'ADD', '-':'ADD', 
             '*':'MUL', '/':'MUL', 
             '(':'LPAR', ')':'RPAR'}

Token = namedtuple('Token', ['name', 'value'])

And here’s the code I used to tokenize an expression `expr`:

split_expr = re.findall('[\d.]+|[%s]' % ''.join(token_map), expr)
tokens = [Token(token_map.get(x, 'NUM'), x) for x in split_expr]

The first line is a trick that splits the expression into the basic tokens, so

'1.2 / ( 11+3)' --> ['1.2', '/', '(', '11', '+', '3', ')']

The next line names the tokens, so that the parser can recognize them by category:

['1.2', '/', '(', '11', '+', '3', ')']
->
[Token(name='NUM', value='1.2'), Token(name='MUL', value='/'), Token(name='LPAR', value='('), Token(name='NUM', value='11'), Token(name='ADD', value='+'), Token(name='NUM', value='3'), Token(name='RPAR', value=')')]

Any token that is not in the token_map is assumed to be a number. Our tokenizer lacks a property called validation which would prevent non-numbers from being accepted, but luckily the evaluator will handle this task later on.

That’s it. Now that we have a list of tokens, our next step is to parse it into an AST.

Step 2: Define the grammar

The parser I chose to implement is a naive recursive descent parser, which is a simpler version of LL parsing. It’s the simplest parser to implement, and in fact mine takes only 14 lines. It’s a kind of top-down parser, which means that it starts by matching the highest rule (like: expression), and recursively tries to match its sub-rules until it matches the lowest rules (like: number). To put it another way, while a bottom-up (LR) parser will gradually fold tokens and rules into other rules, until there’s only one rule left, a top-down (LL) parser like ours will gradually expand the rules into less abstract rules, until they completely match the input-tokens.

Before we get to the actual parser, let’s talk about the grammar. In my previous post, I used an LR parser, and I defined the calculator grammar like this (caps are tokens):

add: add ADD mul | mul;
mul: mul MUL atom | atom;
atom: NUM | '(' add ')' | neg;
neg: '-' atom;

(If you don’t understand this grammar, you should read my previous post)

This time I’m using an LL parser, instead of LR, and here’s how I defined the grammar:

rule_map = {
    'add' : ['mul ADD add', 'mul'],
    'mul' : ['atom MUL mul', 'atom'],
    'atom': ['NUM', 'LPAR add RPAR', 'neg'],
    'neg' : ['ADD atom'],
}

There is a subtle change here. The recursive definitions of add and mul are reversed. This is a very important detail, and I need to explain it.

The LR version of this grammar uses something called left-recursion. When LL parsers see recursion, they just dive in there in an attempt to match the rule. So when faced with left-recursion, they enter infinite recursion. Even smart LL-parsers such as ANTLR suffer from this issue, though it probably writes a friendly error instead of looping infinitely like our toy parser would.

Left-recursion is easily solved by changing it to right-recursion, and that is what I did. But because nothing is easy with parsers, it created another problem: While left-recursion parses 3-2-1 correctly as (3-2)-1, right-recursion parses it
incorrectly as 3-(2-1). I don’t know of an easy solution to this problem, so to keep things short and simple for you and me both, I decided to keep the incorrect form and deal with it in post-processing (see step 4).

Step 3: Parse into an AST

The algorithm is simple. We’re going to define a recursive function that receives two parameters: The first is the name of the rule that we’re trying to match, and the second is the list of tokens we have left. We’ll start with add (which is the highest rule) and with the entire list of tokens, and have the recursive calls become increasingly more specific. The function returns a tuple: The current match, and a list of the tokens that are left to match. For the purpose of short code, we’ll make it capable of also matching tokens (they’re both strings; one is UPPER-CASE and the other lower-case).

Here’s is the code for the parser:

RuleMatch = namedtuple('RuleMatch', ['name', 'matched'])

def match(rule_name, tokens):
    if tokens and rule_name == tokens[0].name:      # Match a token?
        return RuleMatch(tokens[0], tokens[1:])
    for expansion in rule_map.get(rule_name, ()):   # Match a rule?
        remaining_tokens = tokens
        matched_subrules = []
        for subrule in expansion.split():
            matched, remaining_tokens = match(subrule, remaining_tokens)
            if not matched:
                break   # no such luck. next expansion!
            matched_subrules.append(matched)
        else:
            return RuleMatch(rule_name, matched_subrules), remaining_tokens
    return None, None   # match not found

Lines 4-5 check if rule_name is actually a token, and if it matches the current token. If it does, it will return the match, and which tokens are still left to consume.

Line 6 iterates over the sub-rules of rule_name, so each can be matched recursively. If rule_name is a token, the get() call will return an empty tuple and the flow will fall to the empty return (line 16).

Lines 9-15 iterate over every element of the current sub-rule, and try to match them in sequentially. Each iteration tries to consume as many matching tokens as possible. If one element did not match, we discard the entire sub-rule. However, if all elements matched, we reach the else clause and return our match for rule_name, with the remaining tokens to match.

Let’s run it and see what we get for 1.2 / ( 11+3).

>>> tokens = [Token(name='NUM', value='1.2'), Token(name='MUL', value='/'), Token(name='LPAR', value='('), Token (name='NUM', value='11'), Token(name='ADD', value='+'), Token(name='NUM', value='3'), Token(name='RPAR', value=')')]

>>> match('add', tokens)

(RuleMatch(name='add', matched=[RuleMatch(name='mul', matched=[RuleMatch(name='atom', matched=[Token(name='NUM', value='1.2')]), Token(name='MUL', value='/'), RuleMatch(name='mul', matched=[RuleMatch(name='atom', matched=[Token(name='LPAR', value='('), RuleMatch(name='add', matched=[RuleMatch(name='mul', matched=[RuleMatch(name='atom', matched=[Token(name='NUM', value='11')])]), Token(name='ADD', value='+'), RuleMatch(name='add', matched=[RuleMatch(name='mul', matched=[RuleMatch(name='atom', matched=[Token(name='NUM', value='3')])])])]), Token(name='RPAR', value=')')])])])]), [])

The result is a tuple, of course, and we can see there are no remaining tokens. The actual match is not easy to read, so let me draw it for you

    add
        mul
            atom
                NUM '1.2'
            MUL '/'
            mul
                atom
                    LPAR    '('
                    add
                        mul
                            atom
                                NUM '11'
                        ADD '+'
                        add
                            mul
                                atom
                                    NUM '3'
                    RPAR    ')'

This is what the AST looks like, in concept. It’s a good practice to imagine the parser run in your mind, or on a piece of paper. I dare say it’s necessary to do so if you want to grok it. You can use this AST as a reference to make sure you got it right.

So far we’ve written a parser capable of correctly parsing binary operations, unary operations, brackets and precedence.

There’s only one thing it does incorrectly, and we’re going to fix it in the next step.

Step 4: Post Processing

My parser is not perfect in many ways. The important one is that it cannot handle left-recursion, which forced me to write the grammar as right-recursive. As a result, parsing 8/4/2 results in the folowing AST:

    add
        mul
            atom
                NUM 8
            MUL '/'
            mul
                atom
                    NUM 4
                MUL '/'
                mul
                    atom
                        NUM 2

If we try to solve the expression using this AST, we’ll have to calculate 4/2 first, which is wrong. Some LL-parsers choose to fix the associativity in the tree. That takes too many lines ;). Instead, we’re going to flatten it. The algorithm is simple: For each rule in the AST that 1) needs fixing, and 2) is a binary operation (has three sub-rules), and 3) its right-hand operand is the same rule: flatten the latter into the former. By “flatten”, I mean replace a node with its children, in the context of its parent. Since our traversal is DFS post-order, meaning it starts from the edge of the tree and works its way to the root, the effect accumulates. Here’s the code:

    fix_assoc_rules = 'add', 'mul'

    def _recurse_tree(tree, func):
        return map(func, tree.matched) if tree.name in rule_map else tree[1]

    def flatten_right_associativity(tree):
        new = _recurse_tree(tree, flatten_right_associativity)
        if tree.name in fix_assoc_rules and len(new)==3 and new[2].name==tree.name:
            new[-1:] = new[-1].matched
        return RuleMatch(tree.name, new)

This code will turn any structural sequence of additions or multiplications into a flat list (without mixing each other). Parenthesis break the sequence, of course, so they won’t be affected.

From this point I could re-build the structure as left-associative, using code such as

    def build_left_associativity(tree):
        new_nodes = _recurse_tree(tree, build_left_associativity)
        if tree.name in fix_assoc_rules:
            while len(new_nodes)>3:
                new_nodes[:3] = [RuleMatch(tree.name, new_nodes[:3])]
        return RuleMatch(tree.name, new_nodes)

But I won’t. I’m pressed for lines of code, and changing the evaluation code to handle lists takes a lot less lines than rebuilding the tree.

Step 5: Evaluate

Evaluating the tree is very simple. All that’s required is to traverse the tree in a similar fashion to the post-processing code (namely DFS post-order), and to evaluate each rule in it. At the point of evaluation, because we recurse first, each rule should be made of nothing more than numbers and operations. Here’s the code:

    bin_calc_map = {'*':mul, '/':div, '+':add, '-':sub}
    def calc_binary(x):
        while len(x) > 1:
            x[:3] = [ bin_calc_map[x[1]](x[0], x[2]) ]
        return x[0]

    calc_map = {
        'NUM' : float,
        'atom': lambda x: x[len(x)!=1],
        'neg' : lambda (op,num): (num,-num)[op=='-'],
        'mul' : calc_binary,
        'add' : calc_binary,
    }

    def evaluate(tree):
        solutions = _recurse_tree(tree, evaluate)
        return calc_map.get(tree.name, lambda x:x)(solutions)

I wrote calc_binary to evaluate both addition and multiplication (and their counterparts). It evaluates lists of either, in a left-associative fashion, thus bringing our little LL-grammar annoyance to conclusion.

Step 6: The REPL

The plainest REPL possible:

    if __name__ == '__main__':
        while True:
            print( calc(raw_input('> ')) )

Please don’t make me explain it :)

Appendix: Tying it all together: A calculator in 70 lines

    '''A Calculator Implemented With A Top-Down, Recursive-Descent Parser'''
    # Author: Erez Shinan, Dec 2012

    import re, collections
    from operator import add,sub,mul,div

    Token = collections.namedtuple('Token', ['name', 'value'])
    RuleMatch = collections.namedtuple('RuleMatch', ['name', 'matched'])

    token_map = {'+':'ADD', '-':'ADD', '*':'MUL', '/':'MUL', '(':'LPAR', ')':'RPAR'}
    rule_map = {
        'add' : ['mul ADD add', 'mul'],
        'mul' : ['atom MUL mul', 'atom'],
        'atom': ['NUM', 'LPAR add RPAR', 'neg'],
        'neg' : ['ADD atom'],
    }
    fix_assoc_rules = 'add', 'mul'

    bin_calc_map = {'*':mul, '/':div, '+':add, '-':sub}
    def calc_binary(x):
        while len(x) > 1:
            x[:3] = [ bin_calc_map[x[1]](x[0], x[2]) ]
        return x[0]

    calc_map = {
        'NUM' : float,
        'atom': lambda x: x[len(x)!=1],
        'neg' : lambda (op,num): (num,-num)[op=='-'],
        'mul' : calc_binary,
        'add' : calc_binary,
    }

    def match(rule_name, tokens):
        if tokens and rule_name == tokens[0].name:      # Match a token?
            return tokens[0], tokens[1:]
        for expansion in rule_map.get(rule_name, ()):   # Match a rule?
            remaining_tokens = tokens
            matched_subrules = []
            for subrule in expansion.split():
                matched, remaining_tokens = match(subrule, remaining_tokens)
                if not matched:
                    break   # no such luck. next expansion!
                matched_subrules.append(matched)
            else:
                return RuleMatch(rule_name, matched_subrules), remaining_tokens
        return None, None   # match not found

    def _recurse_tree(tree, func):
        return map(func, tree.matched) if tree.name in rule_map else tree[1]

    def flatten_right_associativity(tree):
        new = _recurse_tree(tree, flatten_right_associativity)
        if tree.name in fix_assoc_rules and len(new)==3 and new[2].name==tree.name:
            new[-1:] = new[-1].matched
        return RuleMatch(tree.name, new)

    def evaluate(tree):
        solutions = _recurse_tree(tree, evaluate)
        return calc_map.get(tree.name, lambda x:x)(solutions)

    def calc(expr):
        split_expr = re.findall('[\d.]+|[%s]' % ''.join(token_map), expr)
        tokens = [Token(token_map.get(x, 'NUM'), x) for x in split_expr]
        tree = match('add', tokens)[0]
        tree = flatten_right_associativity( tree )
        return evaluate(tree)

    if __name__ == '__main__':
        while True:
            print( calc(raw_input('> ')) )

How To Write A Calculator in 50 Python Lines (Without Eval)

November 18, 2012

Introduction

In this post I will demonstrate how to parse and calculate an arithmetic expression a general-purpose parser. When we’re done, we’ll be able to handle expressions such as 1 + 2 * -(-3+2)/5.6 + 3, and hopefully you’ll have gained the tools to handle much more.

My motivation is to provide a simple and fun lesson in parsing and formal grammars, as well as to show-case PlyPlus, a parser interface I’ve been working on-and-off on for the past few years. As a bonus, the result is a safe arithmetic alternative to eval().

If you want to follow the examples on your computer, you can install PlyPlus with pip install plyplus.

Working knowledge of Python is required for the implementation section.

Grammars

For those of you who don’t know how parsing and formal grammars work, here is a quick overview: Formal grammars are a hierarchy of rules for parsing text. Each rule matches some portion of the input text, by describing the rules that it’s made of.

Here an example of a pseudo-grammar for parsing 1 + 2 + 3 + 4:

Rule #1 - add  IS MADE OF  add + number 
                       OR  number + number


Or in EBNF:

add: add '+' number
   | number '+' number
   ;

Each pass of the parser will look for either add+number or number+number, and if it finds one, convert it to add. Basically, every parser aims to climb the hierarchy as much as possible.

Here are the steps the parser will take:

  1. number + number + number + number
    first pass turns all numbers into a ‘number’ rule
  2. [number + number] + number + number
    the parser found its first pattern!
  3. [add + number] + number
    after converting the pattern, it finds the next one
  4. [add + number]
  5. add

The sequence of symbols has turned into a hierarchy of two simple rules: number+number and add+number, and if we tell the computer how to solve each of them, it can solve the entire expression for us. In fact, it can solve any sequence of additions, no matter how long! That is the strength of formal grammars.

Operator Precedence

Arithmetic expressions are not just a linear progression of symbols. Their operators create an implicit hierarchy, which makes them the perfect target for formal grammars:

1 + 2 * 3 / 4 - 5 + 6

Is equivalent to:

1 + (2 * 3 / 4) - 5 + 6

We can express this structure in the grammar by nesting the rules:

add: add + mul
   | mul '+' mul
   ;
mul: mul '*; number
   | number '*' number
   ;

By telling add that it operates on mul, and not on numbers, we are giving multiplications the precedence.
Let’s pretend-run this grammar on 1 + 2 * 3 * 4 with our magical parser that is in my head:

  1. number + number * number * number
  2. number + [number * number] * number
    the parser doesn’t know what a number+number is, so this is his next pick
  3. number + [mul * number]
  4. number + mul
  5. ???

Now we are in a bit of a pickle! The parser doesn’t know what to do with number+mul. We can tell it, but if we keep looking we’ll find that there are many possibilities we didn’t cover, such as mul+number, add+number, add+add, etc.

So what do we do?

Luckily, we have a trick up our sleeve: We can say that a number by itself is a multiplication, and a multiplication by itself is an addition!

This method might seem strange at first, but it makes total sense:

add: add '+' mul
   | mul '+' mul
   | mul
   ;
mul: mul '*' number
   | number '*' number
   | number
   ;

But if mul can become add, and number can become mul, we have extra lines that do nothing. Removing them, we get:

add: add '+' mul
   | mul
   ;
mul: mul '*' number
   | number
   ;

Let’s pretend-run on 1 + 2 * 3 * 4 again with this new grammar:

  1. number + number * number * number
    There’s no rule for number*number now, but the parser can “get creative”

  2. number + [number] * number * number
  3. number + [mul * number] * number
  4. number + [mul * number]
  5. [number] + mul
  6. [mul] + mul
  7. [add + mul]
  8. add

Success!!!

If this looks like magic to you, try pretend-running on different arithmetic expressions, and see how the expression resolves itself in the correct order every time. Or wait for the next section and let the computer run it for you!

Running the parser

By now we have a fairly good idea of how we want our grammar to work. Let’s apply it and write an actual grammar:

start: add;             // This is the top of the hierarchy
add: add add_symbol mul | mul;
mul: mul mul_symbol number | number;
number: '[\d.]+';       // Regular expression for a decimal number
mul_symbol: '\*' | '/'; // Match * or /
add_symbol: '\+' | '-'; // Match + or -

You might want to brush up on regular expressions a bit, but otherwise this grammar is pretty straight-forward. Let’s run it on an expression!

>>> from plyplus import Grammar
>>> g = Grammar("""...""")
>>> print g.parse('1+2*3-5').pretty()
start
  add
    add
      add
        mul
          number
            1
      add_symbol
        +
      mul
        mul
          number
            2
        mul_symbol
          *
        number
          3
    add_symbol
      -
    mul
      number
        5

So pretty!

Study the tree and see what hierarchy the parser chose.

If you want to play with the parser, and feed it expressions by yourself, you can! All you need is Python. Run pip install plyplus and paste the above commands inside python (make sure to put the actual grammar instead of ‘…’ ;-) ).

Shaping the tree

Plyplus automagically creates a tree, but it’s not very optimal. While putting number inside mul and mul inside add was useful for creating a hierarchy, now that we already have a hierarchy they are just a burden. We can tell Plyplus to “expand” (i.e. remove) rules by prefixing them. A @ will always expand a rule, a # will flatten it, and a ? will expand it if and only if it has one child. In this case, ? is what we want.

start: add;
?add: add add_symbol mul | mul;       // Expand add if it's just a mul
?mul: mul mul_symbol number | number; // Expand mul if it's just a number
number: '[\d.]+';
mul_symbol: '\*' | '/';
add_symbol: '\+' | '-';

Here’s how the tree looks with the new grammar:

>>> g = Grammar("""...""")
>>> print g.parse('1+2*3-5').pretty()
start
  add
    add
      number
        1
      add_symbol
        +
      mul
        number
          2
        mul_symbol
          *
        number
          3
    add_symbol
      -
    number
      5

Ooh, that is so much cleaner, and I dare say, quite optimal!

Parenthesis and Other Features

We are missing some obvious features: Parenthesis, unary operators (-(1+2)), and the ability to put spaces inside the expression. These are all so easy to add at this point that it would be a shame not to.

The important concept is to add a new rule, we’ll call atom. Everything inside the atom (namely parenthesis and unary operations) happens before any additions or multiplications (aka binary operations). Since the atom is only a hierarchical construct with no semantic significance, we’ll make sure it’s always expanded, by adding @ to it.

The obvious way to allow spaces is with something like add: add SPACE add_symbol SPACE mul | mul;, but that’s tedious and unreadable. Instead, we will tell Plyplus to always ignore whitespace.

Here’s the final grammar, with all of these features:

start: add;
?add: (add add_symbol)? mul;
?mul: (mul mul_symbol)? atom;
@atom: neg | number | '\(' add '\)';
neg: '-' atom;
number: '[\d.]+';
mul_symbol: '\*' | '/';
add_symbol: '\+' | '-';
WHITESPACE: '[ \t]+' (%ignore);

Make sure you understand it, so we can proceed to the next step: Calculating!

Calculating!

We can already turn an expression into a hierarchical tree. To calculate it, all we need is to collapse the tree into a number. The way to do it is branch by branch.

This is the part we start writing code, so I need to explain two things about the tree.

  1. Each branch is an instance with two attributes: head, which is the name of the rule (say, add or number), and tail, which is the list of sub-rules that it matched.
  2. By default, Plyplus removes unnecessary tokens. In our example, the ‘(‘ and ‘)’ will already be removed from the tree, as well as neg’s ‘-’. Those of add and mul won’t be removed, because they have their own rule, so Plyplus knows they’re important. This feature can be turned off to keep all tokens in the tree, but in my experience it’s always more elegant to leave it on and change the grammar accordingly.

Okay, we are ready to write some code! We will collapse the tree using a transformer, which is very simple. It traverses the tree, starting with the outermost branches, until it reaches the root. It’s your job to tell it how to collapse each branch. If you do it right, you will always run on an outermost branch, riding the wave of its collapse. Dramatic! Let’s see how it’s done.

>>> import operator as op
>>> from plyplus import STransformer

class Calc(STransformer):

    def _bin_operator(self, exp):
        arg1, operator_symbol, arg2 = exp.tail

        operator_func = { '+': op.add, 
                          '-': op.sub, 
                          '*': op.mul, 
                          '/': op.div }[operator_symbol]

        return operator_func(arg1, arg2)

    number      = lambda self, exp: float(exp.tail[0])
    neg         = lambda self, exp: -exp.tail[0]
    __default__ = lambda self, exp: exp.tail[0]

    add = _bin_operator
    mul = _bin_operator

Each method corresponds to a rule name. If a method doesn’t exist, __default__ is called. In our implementation, we left out start, add_symbol, and mul_symbol, all of which should do nothing but return their only sub-branch.

I use float() to parse numbers, because I’m lazy, but I can implement it using the parser as well.

I use the operator module for syntactic beauty. operator.add is basically ‘lambda x,y: x+y’, etc.

Alright, let’s run the transformer and see how it turns out.

>>> Calc().transform( g.parse('1 + 2 * -(-3+2) / 5.6 + 30'))
31.357142857142858

What does eval() think?

>>> eval('1 + 2 * -(-3+2) / 5.6 + 30')
31.357142857142858

Success!

Final Step: The REPL

For aesthetic reasons, let’s wrap it up in a nice calculator REPL:

def main():
    calc = Calc()
    while True:
        try:
            s = raw_input('> ')
        except EOFError:
            break
        if s == '':
            break
        tree = calc_grammar.parse(s)
        print calc.transform(tree)

You can see the full source code here: https://github.com/erezsh/plyplus/blob/master/examples/calc.py


Contracts and protocols as a substitute to types and interfaces

December 8, 2011

I am a big fan of assertions. Whenever I reach a point in my code where I say “that pointer can’t possibly be null”, I immediately write – assert( p != NULL ); - and whenever I say “this list can’t possibly be longer than 256″ I write assert len(l) <= 256. If you wonder why I keep doing this, it’s because very often I’m wrong. It’s not that I’m a particularly bad programmer, but sometimes I make mistakes, and even when I don’t, sometimes I get very unexpected input, and even when I don’t, sometimes other pieces of code conspire against me. Assertions save me from mythical bug hunts on a regular basis.

So, it’s not a big surprise that I’m a big fan of contracts too. If you don’t know what contracts are, they’re essentially asserts that run at the beginning and end of each function, and check that the parameters and the return values meet certain expectations. In a way, function type declarations, as can be found in C or Java, are a special case of contracts. (Would you like to know more?)

Why not just use duck-typing?

Duck typing is great, but in my experience it becomes a burden as the system grows in size and complexity. Sometimes objects aren’t fully used right away; they are stored as an instance variable, pickled for later use, or sent to another process or another computer. When you finally get the AttributeError, it’s in another execution stack, or in another thread, or in another computer, and debugging it becomes very unpleasant! And what happens when you get the correct object, but it’s in the wrong state? You won’t even get an exception until something somewhere gets corrupted.

In my experience, using an assertion system is the best way to find the subtle bugs and incongruities of big and complex systems.

Why do we need something new?

Types are very confining, even in “typeless” dynamic languages. Take Python: If your API has to verify that it’s getting a file object, the only way is to call isinstance(x, file). That forces the caller to inherit from file, even if he’s writing a mock object (say, as an RPC proxy) that makes no disk access. In any static-type language the I know, it’s impossible to say that you accept either int or float, and you’re forced to either write the same function twice, or use a template and just define it twice.

Today’s interfaces are ridiculous. In C#, an interface with a method that returns a IList<int> will be very upset if you try to implement it as returning List<int>! And don’t even try to return a List<int> when you’re expected to return List. Note that C# will gladly cast between these types in the code, but when dealing with interfaces and function signatures it just goes nuts. It gets very annoying when you’re implementing an ITree inteface and can’t use your own class as nodes‘ type because the signatures collide, and instead you have to explicitly cast from ITree at every method. But I digress.

Even if today’s implementations were better, types are just not enough. They tell you very little about the input or the output. You want to be able to test its values, lengths, states, and maybe to even interact with it to some degree. What we have just doesn’t cut it.

What should we do instead?

Contracts are already pretty good: they have a lot of flexibility and power, they’re self-documenting, and they can be reasoned upon by the compiler/interpreter (“Oh it only accepts a list[int<256]? Time to use my optimized string functions instead!”). But they only serve as a band-aid to existing type systems. They don’t give you the wholesome experience of abstract classes and methods. But, they can.

To me, contracts are much bigger than just assertions. I see them as stepping-stones to a completely new paradigm, that will replace our current system of interfaces, abstract methods, and needless inheritance, with “Contract Protocols”.

How? These are the steps that we need to take to get there:
  1.  Be able to state your assertions about a function, in a declarative manner. Treat these assertions as an entity called a “contract”.  We’re in the middle of this step, and some contract implementations (such as the wonderful PyContracts for python) have already taken the declarative entity route, which is essential for the next step.
  2. Be able to compare contracts. Basically, I want to be able to tell if a contract is contained within another contract, so if C1⊂C2 and x∊C1 then x∊C2. I suspect it’s easier said then done, but I believe that the following (much easier) steps render it as worth doing.
  3. Be able to bundle contracts in a “contract protocol”, and use it to test a class. A protocol is basically just a mapping of {method-name: contract}, and applying it to a class tests that each method exists in the class, and that its contract is a subset of the protocol’s corresponding contract. If these terms are met, it can be said that the class implements the protocol. A class can implement several protocols, obviously.
  4. Be able to compare protocols. Similarly to contracts, we want to check if a protocol is a subset of another protocol. Arguably, it’s the same as step 3.
  5. Contracts can also check if an instance implements a protocol. Making a full circle, we can now use protocols to check for protocols and so on, allowing layers of complexity. We can now write infinitely detailed demands about what a variable should be, but very concisely.

When we finish point 5, we have a complete and very powerful system in our hands. We don’t need to ever discuss types, except for the most basic ones. Inheritance is now only needed to gain functionality, not identity. We can use it for debug-only purposes, but also for run-time decisions in production (For example, in a Strategy pattern).

Example

As a last attempt to get my point across, here is vaguely how I imagine the file protocol to look in pseudo-code.

It doesn’t do the idea any justice, but hopefully it’s enough to get you started.

protocol Closeable:
<pre>    close()

protocol _File < Closeable:
    [str] name
    [int,>0] tell()
    seek( [int,in (0,1,2)] )

protocol ReadOnlyFile < _File:
    [str,!=''] read( [int,>0 or None]? )
    [Iterable[str]] readlines( )
    [Iterable[str]] __iter__( )

protocol WriteOnlyfile < _File:
    [int,>0] write( [str,!=''] )
    writelines( [Iterable[str]] )
    flush()

protocol RWFile < ReadOnlyFile | WriteOnlyFile:
    pass

>>> print ReadOnlyFile < RWFile
True
>>> print implements( open('bla'), ReadOnlyFile )
True
>>> print implements( open('bla'), Iterable )  # has __iter__ function,
True
>>> print implements( open('bla'), Iterable[int] )
False
>>> print implements( open('bla'), WriteOnlyFile )  # default is 'r'
False
>>> print implements( open('bla'), RWFile )
False
>>> print implements( open('bla', 'w+'), RWFile )
True

Baker – Expose Python to the Shell

February 17, 2010

It’s been a long time since my last post, and it would be appropriate that I post about whatever it is that I’ve been working on. But I won’t. I’m writing this post only to tell you about an interesting new python library I stumbled upon.

Baker, in their own words, “lets you easily add a command line interface”.

In other words, it lets you expose your python utility functions to your favorite shell.

The only requirements are that:

  • Your function must accept string arguments (an exception: it accepts ints/floats if you provide a default argument)
  • Your function must print its output to stdout

Okay, so these are a little limiting. But the interesting part about Baker is not its implementation (which is still a bit clunky and basic), but rather its concept. Here’s a small piece of code I wrote:

import baker

@baker.command
def substr(text, start, end, step=1):
    print text[int(start):int(end):step]

if __name__ == '__main__':
    baker.run()

And here’s how I use it from the command line:

> baketest.py substr "Hello World!" 1 10
ello Worl

> baketest.py substr "Hello World!" 1 10 --step 2
el ol

The simplicity and intuitiveness of this interface really appealed to me. Hopefully this will catch on, and we’ll see more python scripts providing command-line interface, just because it’s very easy.


Lazier Copy-On-Write

June 21, 2009

Copy-on-write (COW) is a popular mechanism of lazy evaluation, that helps improve running speed and reduce memory requirements by transparently delaying the copying of data. Essentially, this is how it works: When you try to copy an object, you are instead given a fake object. Attempts to read that new object will instead read from the original object. Writing to the object will create a new copy (as originally requested) and refer to it in the future for reads and writes.

It allows to write simpler and safer programs. For instance, a programmer can pass “copies” of his data with minimal performance impact, and not have to worry about others changing his original data.

It’s great, but can it be more?

I present here two proposals for extending this idea for even greater optimization. They are far from my areas of expertise, so I hope they still make sense.

1. Copy-On-Write + Fragmentation

Fragmentation of data is a mechanism that allows different parts (blocks) of data to reside in different locations in memory while appearing intact (that is, sequential).  This mechanism is often accused of  slowing the computer down. However, its a critical feature of virtual memory, which is a basis to all modern operating systems.

Introducing fragmentation into your data structures has many benefits, but let’s discuss the benefits regarding COW, which may be obvious by now: On write, you don’t have to copy all of the data, just the blocks which are being changed. This can be a big difference, if you’re changing a few bytes in a 50mb string.

You still have to copy the meta-data (such as, where are the blocks, what is the next block, etc.), but that’s a small price to pay, and a reasonable design requirement.

Now instead of copying the entire data, you copy only a fragment of it. How big is that fragment? Perhaps a fixed size, such as 64k. But assuming you have no real restriction on the size of these data blocks, the next logical step, in my eyes, is to ask: Why not make it as small as possible? That is, why not intentionally fragment the block into three smaller blocks: Before the area that is to be written, the area that is to be written, and after the area to be written. At this point we continue as we originally planned: We copy only the block which is to be written, which is, of course, exactly as small as it can be.

Eventually, we have a model in which writing n bytes into a COWed data of m bytes takes O(n) time and memory, instead of the original O(m+n) time and O(m) memory. I argue that in the common case, n is significantly smaller than m, and so the win is big.

Of course, fragmentation has a habit of slowing down reading times. When fragmentation is “too high”, it is possible to defragment the memory (an occasional O(m) process). The optimal balance of fragmentation depends heavily on the frequency of reads vs of writes, but I argue that even a sub-optimal, common-case balance, will produce an improvement in performance.

Edit: I’ve been unclear about how it affects look-ups. Fragmentation to blocks of fixed size remains O(1) for getting and setting items. However, for variable-size blocks it’s not so simple. A search tree can achieve a look-up of  O(logn) where n is number of fragments, which is a lot slower than the original array peformance. It is probably only a good idea if you have access to the operating system’s memory management, or if the use of look-ups is rare (and then an occasional defragmentation would still be necessary). Still, fixed-size fragments are good enough, and they can be dynamically resized with little cost, as long as the resize is uniform.

2. Copy-On-Write-And-ReaD

Or in short, COWARD, is a mechanism to even further delay copying, to only after the written data is also read. That is, when the programmer requests to write data, this mechanism will instead journal the data, producing sort of a “diff”. Only when the programmer attempts to read the result, the original data is copied and the diff is applied. A diff structure is provided by any implementation of lazy evaluation, by definition, but perhaps there are other more suitable diff structures for this purpose.

This starts to make more sense with fragmentation: Then the diff can be applied only to the block that is read. And so, a block will be copied only if it is both written and read. In some cases, there may be very little intersection between the two (and so, very little copying).

So basically, COWARD is just a (non-)fancy name for an array of promises (not to be confused with a field of dreams). The (possible) novelty is in the way this array is created and used: transparently, and relatively efficiently. Note that, like the previous proposal, it provides little value in situations where the all of the data is altered or read. However, I argue it will significantly improve performance in cases where only part of the data is read and written.

It can, for instance, be useful in cases where an algorithm works on COWed data (which happens quite often) and provides more processing than the user requires. Using this method, only blocks that the user requests are copied — and if the calculations themselves are lazy — processed. And all of it transparent to both the user and the implementer of the algorithm .

Here’s to lazier COWs!


PySnippets – improving code reuse

June 2, 2009

For a long time now, I’ve been hindered by the issue of utilities, or snippets. These are convenience functions and classes that are too small or too incomplete to justify a library, yet are useful enough to be used.
I’ve posted a few on my blog: Namespaces, X and now FileDict. Others I didn’t post, and include a priority queue, an A* implementation, a lazy-list, an LRU memoizer, etc. You probably have a few of those. I know because I see them on snippet sites.

However, I rarely actually use these in my code. I really want to. But once your code spans more than one file, you usually need to make a proper installation, or at least trouble your “users” a bit more. Usually saving a few lines just isn’t worth the trouble. Copy-pasting the snippet into the file is sometimes the solution, but it really pains me that I’ll have to re-paste it every time I improve my snippet.

I’m sure some of you looked at my snippets, or other people’s, thought “cool”, but never used them, simply because it was too much trouble.

Paradoxically, this is especially true when writing snippets. They are just one small file, and using another snippet would probably make them too hard to distribute. This is a magic-circle, for-ever limiting our snippets to a low level of sophistication, and discouraging re-use.

I want to break that circle. I want to create an economy of snippets, increasingly building on each other, eventually creating a “standard library” of their own. But how would one do that? I have one suggestion, along with a proof-of-concept, which I will present here.

PySnippets

PySnippets is my attempt of making snippets usable. It’s comprised of two solutions – a server and a client.

  1. Server - A website for uploading snippets. Simple enough. You can rate them, tag them, discuss them, offer some documentation and of-course post newer versions.
  2. Client - A python module that automagically imports snippets from the web. Essentially, it downloads the snippets you request to a cache, imports them if they’re already there, and periodically searches for updates.

The server is structured in a predictable way, so that the client knows how to fetch a snippet just by its name.

The Client

Here’s a usage example with my current client implementation, I creatively call “snippets”:

import snippets
antigravity = snippets.get('antigravity')  # "snippet-import"
antigravity.start(mode='xkcd')

Easy as that!

The snippets.get function looks for the module in the local snippets-cache. If it’s there, get just imports it and returns the module. If it’s not, it queries the server for a snippet called “antigravity” (names are unique), stores it in the cache, and the imports it. What the user notices is a 2-second pause the first time he ever imports that snippet, and nothing else from then on.

You can specify to download a specific version, like this:

filedict = snippets.get('filedict', version=0.1)

Auto-Updating Snippets

The current implementation also includes an “auto-update” feature: Periodically, before importing a module, the client surveys the server for a newer version of it. If a newer version exists, it downloads it to the cache and continues with the import.

Auto-updates can be disabled in a parameter to get.

The Server

The server is yet another service to upload snippets, however it has a slightly unusual design (which no other snippet site I know of has):

  • A URL to a snippet is easy to deduce given its name.
  • There is a conscious (though simple) support for versions.
  • To increase reliability and trust (more on that later), uploaded snippets cannot be altered (but a new version can be issued)

Since I know very little about administration and server-maintenance, I chose wikidot.com to host my POC web-site. They have an elaborate support for permissions and most of the features I need, such as the ability to rate, tag and discuss snippets.

Trust

Perhaps the biggest issue with such a system is trust. Since you’re running code which resides online, you have to trust me not to maliciously alter the snippets, and also you have to trust the author of the snippet not to do so.

As a partial solution, uploaded files cannot be altered: Not edited, nor renamed, nor deleted, etc. So if specify a particular snippet version, it is guaranteed that it will never change (I may commit changes by request, but I will audit them myself).
If you decide to use the latest version of a snippet (that is, not specify a version), please make sure you trust its author.

Perhaps higher-ups in the Python community would like to take some sponsorship of the project, removing the remaining trust-issues with the administrator (that’s me).

Implications

  • To distribute your snippets, all you need is for the reciever to have an internet connection, and the snippets client.
  • If you’re sending someone code, you can attach the client (it’s rather small, too), and just import away. The reciever will benefit from improvements and bugfixes to your snippets.
  • You can use other people’s snippets just as easily, as long as you trust them.
  • Snippets can now build on each other without worrying too much.

What if my user is offline?

Then probably PySnippets isn’t for him.

However, I do have some ideas, and might implement them if there is sufficient demand.

Afterword

PySnippets is my humble attempt at solving the utility/snippet reuse problems. I hope you like it and find it useful.

Please try it!


FileDict – bug-fixes and updates

May 31, 2009

In my previous post I introduced FileDict. I did my best to get it right the first time, but as we all know, this is impossible for any non-trivial piece of code.
I want to thank everyone for their comments and remarks. It’s been very helpful.

The Unreliable Pickle

A special thanks goes to the mysterious commenter “R”, for pointing out that pickling identical objects may produce different strings (!), which are therefor inadequate to be used as keys. And my FileDict indeed suffered from this bug, as this example shows:

>>> key = (1, u'foo')
>>> d[(1, u'foo')] = 4
>>> d[(1, u'foo')]
4
>>> d[key]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "filedict.py", line 64, in __getitem__
    raise KeyError(key)
KeyError: (1, u'foo')

And if that’s not bad enough:

>>> d[key] = 5
>>> list(d.items())
[['a', 3], [(1, 2), 3], [(1, u'foo'), 4], [(1, u'foo'), 5]]

Ouch.
I’ve rewritten the entire storing mechanism to poll only on hash and compare keys after unpickling. This may be a bit slower, but I don’t (and shouldn’t) expect many colliding hashes anyway.
Bug is fixed.

DictMixin

Under popular demand, I’m now inheriting from DictMixin. It’s made my code a bit shorter, and was not at all painful.

Copy and Close

I no longer close the database on __del__, and instead I rely on the garbage collector. It seems to close the database on time, and it allows to one copy the dictionary (which, of course, will all be always have the same keys, but doesn’t have to have the same behavior or attributes).

New Source Code

Is available here


Follow

Get every new post delivered to your Inbox.