How To Write A Calculator in 70 Python Lines, By Writing a Recursive-Descent Parser

February 24, 2013

Three months ago, I wrote a post detailing the process of writing a calculator using a parsing library. The popular response, however, was that readers are far more curious about seeing a calculator written from scratch, with the batteries included but nothing else. I figured, why not?

Writing a calculator is simple, if you use hacks specific to arithmetic expressions, but the effect of hacks is nearly always the same: the solution isn’t elegant, it’s not extendable, and it’s hard to understand intuitively. In my appreciation of a good challenge, and my aim at a beneficial post, I decided to write it using a mostly generic recursive-descent parser. In the same spirit as last time, I wanted to do it in as few lines as I reasonably can, so it’s filled with hacks and tricks, but they’re superficial and not specific to the task at hand.

This post is a detailed, step-by-step explanation of my implementation. If you want to jump straight to the code and figure it out by yourself, just scroll to the end of this post. Hopefully when you’re done you’ll have better understanding of how parsing works internally, and you’ll be inspired to use a proper parsing library to avoid this entire bloody mess.

To understand this post, you should have a strong understanding of Python, and it’s recommended to have some understanding of what parsing is and what it’s for. If you’re not sure, I recommend that you read my previous post, in which I thoroughly explain the grammar that I will be using in this post.

Step 1: Tokenize

The first step of processing the expression is to turn it into a list of individual symbols. This is the easiest part, and not the point of this exercise, so I allowed myself to cheat here quite a lot.

First, I defined the tokens (Numbers are notably absent; they’re the default) and a Token type:

token_map = {'+':'ADD', '-':'ADD', 
             '*':'MUL', '/':'MUL', 
             '(':'LPAR', ')':'RPAR'}

Token = namedtuple('Token', ['name', 'value'])

And here’s the code I used to tokenize an expression `expr`:

split_expr = re.findall('[\d.]+|[%s]' % ''.join(token_map), expr)
tokens = [Token(token_map.get(x, 'NUM'), x) for x in split_expr]

The first line is a trick that splits the expression into the basic tokens, so

'1.2 / ( 11+3)' --> ['1.2', '/', '(', '11', '+', '3', ')']

The next line names the tokens, so that the parser can recognize them by category:

['1.2', '/', '(', '11', '+', '3', ')']
->
[Token(name='NUM', value='1.2'), Token(name='MUL', value='/'), Token(name='LPAR', value='('), Token(name='NUM', value='11'), Token(name='ADD', value='+'), Token(name='NUM', value='3'), Token(name='RPAR', value=')')]

Any token that is not in the token_map is assumed to be a number. Our tokenizer lacks a property called validation which would prevent non-numbers from being accepted, but luckily the evaluator will handle this task later on.

That’s it. Now that we have a list of tokens, our next step is to parse it into an AST.

Step 2: Define the grammar

The parser I chose to implement is a naive recursive descent parser, which is a simpler version of LL parsing. It’s the simplest parser to implement, and in fact mine takes only 14 lines. It’s a kind of top-down parser, which means that it starts by matching the highest rule (like: expression), and recursively tries to match its sub-rules until it matches the lowest rules (like: number). To put it another way, while a bottom-up (LR) parser will gradually fold tokens and rules into other rules, until there’s only one rule left, a top-down (LL) parser like ours will gradually expand the rules into less abstract rules, until they completely match the input-tokens.

Before we get to the actual parser, let’s talk about the grammar. In my previous post, I used an LR parser, and I defined the calculator grammar like this (caps are tokens):

add: add ADD mul | mul;
mul: mul MUL atom | atom;
atom: NUM | '(' add ')' | neg;
neg: '-' atom;

(If you don’t understand this grammar, you should read my previous post)

This time I’m using an LL parser, instead of LR, and here’s how I defined the grammar:

rule_map = {
    'add' : ['mul ADD add', 'mul'],
    'mul' : ['atom MUL mul', 'atom'],
    'atom': ['NUM', 'LPAR add RPAR', 'neg'],
    'neg' : ['ADD atom'],
}

There is a subtle change here. The recursive definitions of add and mul are reversed. This is a very important detail, and I need to explain it.

The LR version of this grammar uses something called left-recursion. When LL parsers see recursion, they just dive in there in an attempt to match the rule. So when faced with left-recursion, they enter infinite recursion. Even smart LL-parsers such as ANTLR suffer from this issue, though it probably writes a friendly error instead of looping infinitely like our toy parser would.

Left-recursion is easily solved by changing it to right-recursion, and that is what I did. But because nothing is easy with parsers, it created another problem: While left-recursion parses 3-2-1 correctly as (3-2)-1, right-recursion parses it
incorrectly as 3-(2-1). I don’t know of an easy solution to this problem, so to keep things short and simple for you and me both, I decided to keep the incorrect form and deal with it in post-processing (see step 4).

Step 3: Parse into an AST

The algorithm is simple. We’re going to define a recursive function that receives two parameters: The first is the name of the rule that we’re trying to match, and the second is the list of tokens we have left. We’ll start with add (which is the highest rule) and with the entire list of tokens, and have the recursive calls become increasingly more specific. The function returns a tuple: The current match, and a list of the tokens that are left to match. For the purpose of short code, we’ll make it capable of also matching tokens (they’re both strings; one is UPPER-CASE and the other lower-case).

Here’s is the code for the parser:

RuleMatch = namedtuple('RuleMatch', ['name', 'matched'])

def match(rule_name, tokens):
    if tokens and rule_name == tokens[0].name:      # Match a token?
        return RuleMatch(tokens[0], tokens[1:])
    for expansion in rule_map.get(rule_name, ()):   # Match a rule?
        remaining_tokens = tokens
        matched_subrules = []
        for subrule in expansion.split():
            matched, remaining_tokens = match(subrule, remaining_tokens)
            if not matched:
                break   # no such luck. next expansion!
            matched_subrules.append(matched)
        else:
            return RuleMatch(rule_name, matched_subrules), remaining_tokens
    return None, None   # match not found

Lines 4-5 check if rule_name is actually a token, and if it matches the current token. If it does, it will return the match, and which tokens are still left to consume.

Line 6 iterates over the sub-rules of rule_name, so each can be matched recursively. If rule_name is a token, the get() call will return an empty tuple and the flow will fall to the empty return (line 16).

Lines 9-15 iterate over every element of the current sub-rule, and try to match them in sequentially. Each iteration tries to consume as many matching tokens as possible. If one element did not match, we discard the entire sub-rule. However, if all elements matched, we reach the else clause and return our match for rule_name, with the remaining tokens to match.

Let’s run it and see what we get for 1.2 / ( 11+3).

>>> tokens = [Token(name='NUM', value='1.2'), Token(name='MUL', value='/'), Token(name='LPAR', value='('), Token (name='NUM', value='11'), Token(name='ADD', value='+'), Token(name='NUM', value='3'), Token(name='RPAR', value=')')]

>>> match('add', tokens)

(RuleMatch(name='add', matched=[RuleMatch(name='mul', matched=[RuleMatch(name='atom', matched=[Token(name='NUM', value='1.2')]), Token(name='MUL', value='/'), RuleMatch(name='mul', matched=[RuleMatch(name='atom', matched=[Token(name='LPAR', value='('), RuleMatch(name='add', matched=[RuleMatch(name='mul', matched=[RuleMatch(name='atom', matched=[Token(name='NUM', value='11')])]), Token(name='ADD', value='+'), RuleMatch(name='add', matched=[RuleMatch(name='mul', matched=[RuleMatch(name='atom', matched=[Token(name='NUM', value='3')])])])]), Token(name='RPAR', value=')')])])])]), [])

The result is a tuple, of course, and we can see there are no remaining tokens. The actual match is not easy to read, so let me draw it for you

    add
        mul
            atom
                NUM '1.2'
            MUL '/'
            mul
                atom
                    LPAR    '('
                    add
                        mul
                            atom
                                NUM '11'
                        ADD '+'
                        add
                            mul
                                atom
                                    NUM '3'
                    RPAR    ')'

This is what the AST looks like, in concept. It’s a good practice to imagine the parser run in your mind, or on a piece of paper. I dare say it’s necessary to do so if you want to grok it. You can use this AST as a reference to make sure you got it right.

So far we’ve written a parser capable of correctly parsing binary operations, unary operations, brackets and precedence.

There’s only one thing it does incorrectly, and we’re going to fix it in the next step.

Step 4: Post Processing

My parser is not perfect in many ways. The important one is that it cannot handle left-recursion, which forced me to write the grammar as right-recursive. As a result, parsing 8/4/2 results in the folowing AST:

    add
        mul
            atom
                NUM 8
            MUL '/'
            mul
                atom
                    NUM 4
                MUL '/'
                mul
                    atom
                        NUM 2

If we try to solve the expression using this AST, we’ll have to calculate 4/2 first, which is wrong. Some LL-parsers choose to fix the associativity in the tree. That takes too many lines ;). Instead, we’re going to flatten it. The algorithm is simple: For each rule in the AST that 1) needs fixing, and 2) is a binary operation (has three sub-rules), and 3) its right-hand operand is the same rule: flatten the latter into the former. By “flatten”, I mean replace a node with its children, in the context of its parent. Since our traversal is DFS post-order, meaning it starts from the edge of the tree and works its way to the root, the effect accumulates. Here’s the code:

    fix_assoc_rules = 'add', 'mul'

    def _recurse_tree(tree, func):
        return map(func, tree.matched) if tree.name in rule_map else tree[1]

    def flatten_right_associativity(tree):
        new = _recurse_tree(tree, flatten_right_associativity)
        if tree.name in fix_assoc_rules and len(new)==3 and new[2].name==tree.name:
            new[-1:] = new[-1].matched
        return RuleMatch(tree.name, new)

This code will turn any structural sequence of additions or multiplications into a flat list (without mixing each other). Parenthesis break the sequence, of course, so they won’t be affected.

From this point I could re-build the structure as left-associative, using code such as

    def build_left_associativity(tree):
        new_nodes = _recurse_tree(tree, build_left_associativity)
        if tree.name in fix_assoc_rules:
            while len(new_nodes)>3:
                new_nodes[:3] = [RuleMatch(tree.name, new_nodes[:3])]
        return RuleMatch(tree.name, new_nodes)

But I won’t. I’m pressed for lines of code, and changing the evaluation code to handle lists takes a lot less lines than rebuilding the tree.

Step 5: Evaluate

Evaluating the tree is very simple. All that’s required is to traverse the tree in a similar fashion to the post-processing code (namely DFS post-order), and to evaluate each rule in it. At the point of evaluation, because we recurse first, each rule should be made of nothing more than numbers and operations. Here’s the code:

    bin_calc_map = {'*':mul, '/':div, '+':add, '-':sub}
    def calc_binary(x):
        while len(x) > 1:
            x[:3] = [ bin_calc_map[x[1]](x[0], x[2]) ]
        return x[0]

    calc_map = {
        'NUM' : float,
        'atom': lambda x: x[len(x)!=1],
        'neg' : lambda (op,num): (num,-num)[op=='-'],
        'mul' : calc_binary,
        'add' : calc_binary,
    }

    def evaluate(tree):
        solutions = _recurse_tree(tree, evaluate)
        return calc_map.get(tree.name, lambda x:x)(solutions)

I wrote calc_binary to evaluate both addition and multiplication (and their counterparts). It evaluates lists of either, in a left-associative fashion, thus bringing our little LL-grammar annoyance to conclusion.

Step 6: The REPL

The plainest REPL possible:

    if __name__ == '__main__':
        while True:
            print( calc(raw_input('> ')) )

Please don’t make me explain it 🙂

Appendix: Tying it all together: A calculator in 70 lines

    '''A Calculator Implemented With A Top-Down, Recursive-Descent Parser'''
    # Author: Erez Shinan, Dec 2012

    import re, collections
    from operator import add,sub,mul,div

    Token = collections.namedtuple('Token', ['name', 'value'])
    RuleMatch = collections.namedtuple('RuleMatch', ['name', 'matched'])

    token_map = {'+':'ADD', '-':'ADD', '*':'MUL', '/':'MUL', '(':'LPAR', ')':'RPAR'}
    rule_map = {
        'add' : ['mul ADD add', 'mul'],
        'mul' : ['atom MUL mul', 'atom'],
        'atom': ['NUM', 'LPAR add RPAR', 'neg'],
        'neg' : ['ADD atom'],
    }
    fix_assoc_rules = 'add', 'mul'

    bin_calc_map = {'*':mul, '/':div, '+':add, '-':sub}
    def calc_binary(x):
        while len(x) > 1:
            x[:3] = [ bin_calc_map[x[1]](x[0], x[2]) ]
        return x[0]

    calc_map = {
        'NUM' : float,
        'atom': lambda x: x[len(x)!=1],
        'neg' : lambda (op,num): (num,-num)[op=='-'],
        'mul' : calc_binary,
        'add' : calc_binary,
    }

    def match(rule_name, tokens):
        if tokens and rule_name == tokens[0].name:      # Match a token?
            return tokens[0], tokens[1:]
        for expansion in rule_map.get(rule_name, ()):   # Match a rule?
            remaining_tokens = tokens
            matched_subrules = []
            for subrule in expansion.split():
                matched, remaining_tokens = match(subrule, remaining_tokens)
                if not matched:
                    break   # no such luck. next expansion!
                matched_subrules.append(matched)
            else:
                return RuleMatch(rule_name, matched_subrules), remaining_tokens
        return None, None   # match not found

    def _recurse_tree(tree, func):
        return map(func, tree.matched) if tree.name in rule_map else tree[1]

    def flatten_right_associativity(tree):
        new = _recurse_tree(tree, flatten_right_associativity)
        if tree.name in fix_assoc_rules and len(new)==3 and new[2].name==tree.name:
            new[-1:] = new[-1].matched
        return RuleMatch(tree.name, new)

    def evaluate(tree):
        solutions = _recurse_tree(tree, evaluate)
        return calc_map.get(tree.name, lambda x:x)(solutions)

    def calc(expr):
        split_expr = re.findall('[\d.]+|[%s]' % ''.join(token_map), expr)
        tokens = [Token(token_map.get(x, 'NUM'), x) for x in split_expr]
        tree = match('add', tokens)[0]
        tree = flatten_right_associativity( tree )
        return evaluate(tree)

    if __name__ == '__main__':
        while True:
            print( calc(raw_input('> ')) )

How To Write A Calculator in 50 Python Lines (Without Eval)

November 18, 2012

Introduction

In this post I will demonstrate how to parse and calculate an arithmetic expression a general-purpose parser. When we’re done, we’ll be able to handle expressions such as 1 + 2 * -(-3+2)/5.6 + 3, and hopefully you’ll have gained the tools to handle much more.

My motivation is to provide a simple and fun lesson in parsing and formal grammars, as well as to show-case PlyPlus, a parser interface I’ve been working on-and-off on for the past few years. As a bonus, the result is a safe arithmetic alternative to eval().

If you want to follow the examples on your computer, you can install PlyPlus with pip install plyplus.

Working knowledge of Python is required for the implementation section.

Grammars

For those of you who don’t know how parsing and formal grammars work, here is a quick overview: Formal grammars are a hierarchy of rules for parsing text. Each rule matches some portion of the input text, by describing the rules that it’s made of.

Here an example of a pseudo-grammar for parsing 1 + 2 + 3 + 4:

Rule #1 - add  IS MADE OF  add + number 
                       OR  number + number


Or in EBNF:

add: add '+' number
   | number '+' number
   ;

Each pass of the parser will look for either add+number or number+number, and if it finds one, convert it to add. Basically, every parser aims to climb the hierarchy as much as possible.

Here are the steps the parser will take:

  1. number + number + number + number
    first pass turns all numbers into a ‘number’ rule
  2. [number + number] + number + number
    the parser found its first pattern!
  3. [add + number] + number
    after converting the pattern, it finds the next one
  4. [add + number]
  5. add

The sequence of symbols has turned into a hierarchy of two simple rules: number+number and add+number, and if we tell the computer how to solve each of them, it can solve the entire expression for us. In fact, it can solve any sequence of additions, no matter how long! That is the strength of formal grammars.

Operator Precedence

Arithmetic expressions are not just a linear progression of symbols. Their operators create an implicit hierarchy, which makes them the perfect target for formal grammars:

1 + 2 * 3 / 4 - 5 + 6

Is equivalent to:

1 + (2 * 3 / 4) - 5 + 6

We can express this structure in the grammar by nesting the rules:

add: add + mul
   | mul '+' mul
   ;
mul: mul '*; number
   | number '*' number
   ;

By telling add that it operates on mul, and not on numbers, we are giving multiplications the precedence.
Let’s pretend-run this grammar on 1 + 2 * 3 * 4 with our magical parser that is in my head:

  1. number + number * number * number
  2. number + [number * number] * number
    the parser doesn’t know what a number+number is, so this is his next pick
  3. number + [mul * number]
  4. number + mul
  5. ???

Now we are in a bit of a pickle! The parser doesn’t know what to do with number+mul. We can tell it, but if we keep looking we’ll find that there are many possibilities we didn’t cover, such as mul+number, add+number, add+add, etc.

So what do we do?

Luckily, we have a trick up our sleeve: We can say that a number by itself is a multiplication, and a multiplication by itself is an addition!

This method might seem strange at first, but it makes total sense:

add: add '+' mul
   | mul '+' mul
   | mul
   ;
mul: mul '*' number
   | number '*' number
   | number
   ;

But if mul can become add, and number can become mul, we have extra lines that do nothing. Removing them, we get:

add: add '+' mul
   | mul
   ;
mul: mul '*' number
   | number
   ;

Let’s pretend-run on 1 + 2 * 3 * 4 again with this new grammar:

  1. number + number * number * number
    There’s no rule for number*number now, but the parser can “get creative”

  2. number + [number] * number * number
  3. number + [mul * number] * number
  4. number + [mul * number]
  5. [number] + mul
  6. [mul] + mul
  7. [add + mul]
  8. add

Success!!!

If this looks like magic to you, try pretend-running on different arithmetic expressions, and see how the expression resolves itself in the correct order every time. Or wait for the next section and let the computer run it for you!

Running the parser

By now we have a fairly good idea of how we want our grammar to work. Let’s apply it and write an actual grammar:

start: add;             // This is the top of the hierarchy
add: add add_symbol mul | mul;
mul: mul mul_symbol number | number;
number: '[\d.]+';       // Regular expression for a decimal number
mul_symbol: '\*' | '/'; // Match * or /
add_symbol: '\+' | '-'; // Match + or -

You might want to brush up on regular expressions a bit, but otherwise this grammar is pretty straight-forward. Let’s run it on an expression!

>>> from plyplus import Grammar
>>> g = Grammar("""...""")
>>> print g.parse('1+2*3-5').pretty()
start
  add
    add
      add
        mul
          number
            1
      add_symbol
        +
      mul
        mul
          number
            2
        mul_symbol
          *
        number
          3
    add_symbol
      -
    mul
      number
        5

So pretty!

Study the tree and see what hierarchy the parser chose.

If you want to play with the parser, and feed it expressions by yourself, you can! All you need is Python. Run pip install plyplus and paste the above commands inside python (make sure to put the actual grammar instead of ‘…’ 😉 ).

Shaping the tree

Plyplus automagically creates a tree, but it’s not very optimal. While putting number inside mul and mul inside add was useful for creating a hierarchy, now that we already have a hierarchy they are just a burden. We can tell Plyplus to “expand” (i.e. remove) rules by prefixing them. A @ will always expand a rule, a # will flatten it, and a ? will expand it if and only if it has one child. In this case, ? is what we want.

start: add;
?add: add add_symbol mul | mul;       // Expand add if it's just a mul
?mul: mul mul_symbol number | number; // Expand mul if it's just a number
number: '[\d.]+';
mul_symbol: '\*' | '/';
add_symbol: '\+' | '-';

Here’s how the tree looks with the new grammar:

>>> g = Grammar("""...""")
>>> print g.parse('1+2*3-5').pretty()
start
  add
    add
      number
        1
      add_symbol
        +
      mul
        number
          2
        mul_symbol
          *
        number
          3
    add_symbol
      -
    number
      5

Ooh, that is so much cleaner, and I dare say, quite optimal!

Parenthesis and Other Features

We are missing some obvious features: Parenthesis, unary operators (-(1+2)), and the ability to put spaces inside the expression. These are all so easy to add at this point that it would be a shame not to.

The important concept is to add a new rule, we’ll call atom. Everything inside the atom (namely parenthesis and unary operations) happens before any additions or multiplications (aka binary operations). Since the atom is only a hierarchical construct with no semantic significance, we’ll make sure it’s always expanded, by adding @ to it.

The obvious way to allow spaces is with something like add: add SPACE add_symbol SPACE mul | mul;, but that’s tedious and unreadable. Instead, we will tell Plyplus to always ignore whitespace.

Here’s the final grammar, with all of these features:

start: add;
?add: (add add_symbol)? mul;
?mul: (mul mul_symbol)? atom;
@atom: neg | number | '\(' add '\)';
neg: '-' atom;
number: '[\d.]+';
mul_symbol: '\*' | '/';
add_symbol: '\+' | '-';
WHITESPACE: '[ \t]+' (%ignore);

Make sure you understand it, so we can proceed to the next step: Calculating!

Calculating!

We can already turn an expression into a hierarchical tree. To calculate it, all we need is to collapse the tree into a number. The way to do it is branch by branch.

This is the part we start writing code, so I need to explain two things about the tree.

  1. Each branch is an instance with two attributes: head, which is the name of the rule (say, add or number), and tail, which is the list of sub-rules that it matched.
  2. By default, Plyplus removes unnecessary tokens. In our example, the ‘(‘ and ‘)’ will already be removed from the tree, as well as neg’s ‘-‘. Those of add and mul won’t be removed, because they have their own rule, so Plyplus knows they’re important. This feature can be turned off to keep all tokens in the tree, but in my experience it’s always more elegant to leave it on and change the grammar accordingly.

Okay, we are ready to write some code! We will collapse the tree using a transformer, which is very simple. It traverses the tree, starting with the outermost branches, until it reaches the root. It’s your job to tell it how to collapse each branch. If you do it right, you will always run on an outermost branch, riding the wave of its collapse. Dramatic! Let’s see how it’s done.

>>> import operator as op
>>> from plyplus import STransformer

class Calc(STransformer):

    def _bin_operator(self, exp):
        arg1, operator_symbol, arg2 = exp.tail

        operator_func = { '+': op.add, 
                          '-': op.sub, 
                          '*': op.mul, 
                          '/': op.div }[operator_symbol]

        return operator_func(arg1, arg2)

    number      = lambda self, exp: float(exp.tail[0])
    neg         = lambda self, exp: -exp.tail[0]
    __default__ = lambda self, exp: exp.tail[0]

    add = _bin_operator
    mul = _bin_operator

Each method corresponds to a rule name. If a method doesn’t exist, __default__ is called. In our implementation, we left out start, add_symbol, and mul_symbol, all of which should do nothing but return their only sub-branch.

I use float() to parse numbers, because I’m lazy, but I can implement it using the parser as well.

I use the operator module for syntactic beauty. operator.add is basically ‘lambda x,y: x+y’, etc.

Alright, let’s run the transformer and see how it turns out.

>>> Calc().transform( g.parse('1 + 2 * -(-3+2) / 5.6 + 30'))
31.357142857142858

What does eval() think?

>>> eval('1 + 2 * -(-3+2) / 5.6 + 30')
31.357142857142858

Success!

Final Step: The REPL

For aesthetic reasons, let’s wrap it up in a nice calculator REPL:

def main():
    calc = Calc()
    while True:
        try:
            s = raw_input('> ')
        except EOFError:
            break
        if s == '':
            break
        tree = calc_grammar.parse(s)
        print calc.transform(tree)

You can see the full source code here: https://github.com/erezsh/plyplus/blob/master/examples/calc.py


Python Parsing Ep. II: Antlr Issues

July 17, 2008

Continuing that post.

Before I go on, I wish to make 2 corrections to the code provided in the last post:

  1. I think that the grammar for COMMENT is incorrect (in the original too). It requires comments to always end with a new-line. I changed that requirement to be optional, and now it seems more agreeable with end-of-files. For your consideration.
  2. The code for INDENT/DEDENT cannot work because ANTLR’s Emit cannot be called twice from the same symbol: Only the last call registers. That is fixable by overriding Emit and NextToken (I found this code somewhere on the internet, cannot remember where). And so, here is my code for the lexer:
    @lexer::members {
    int implicitLineJoiningLevel = 0;
    int startPos=0;
    System.Collections.Generic.List<int> IndentStack = new System.Collections.Generic.List<int>() { 0 };
    
    System.Collections.Generic.List<IToken> tokens = new System.Collections.Generic.List<IToken>();
    public override void Emit(IToken token) {
            this.token = token;
    		token.Line = this.tokenStartLine;
    		token.CharPositionInLine = this.tokenStartCharPositionInLine;
        	tokens.Add(token);
    }
    public override IToken NextToken() {
        	base.NextToken();
            if ( tokens.Count==0 ) {
                return Token.EOF_TOKEN;
            }
    		Token t = (Token)tokens[0];
    		tokens.RemoveAt(0);
            return t;
    }
    
    }
    
    

Now that we’re done with the lexing…

We have a lexer working, which inputs python code and outputs tokens. So now all we have to do is feed the formal grammar into ANTLR and our work is done, right?

Well, ANTLR’s tree API is very unfriendly; I spent hours figuring it out and I still don’t really get it. And since the parser just throws a sequence of tokens at you, I had no obvious way of constructing some hierarchy out of the code.
But even if I could somehow untangle the mysteries of ANTLR’s trees, I had a bigger problem which is only relevant if you want to display the code (which I did): The output from the parser contains no regard for whitespace or comments that were originally there. I had to find the symbol’s start and end in the original text, and copy the text from there. But symbols didn’t end where they “ought to”. Since comments were removed before reaching the parser, symbols didn’t include them even in their position marker, so:

class A:
    hello = "world"
    # just some class

… would leave the comment out of the scope of the class, which is bad. It would be a lot easier to just write an interpreter…

But I conjured a solution to both of my problems at once: I will insert markers — fake symbols — at the start and end of all definitions. I took the chance to insert other markers which would ease my life. So my class definition looked something like that:

classdef:
	{ParserEmit($classdef.start, $classdef.stop, ParserToken.TokenType.ClassBegin, "");}
	CLASS classname
	{ParserEmit($classname.start, $classname.stop, ParserToken.TokenType.Name, $classname.text);}
	(LPAREN testlist? RPAREN
		{ParserEmit($testlist.start, $testlist.stop, ParserToken.TokenType.Args, $testlist.text);}
	)? colon
	(suite {ParserEmit($suite.start, $suite.stop, ParserToken.TokenType.ClassEnd, "");}
	| NEWLINE
		{ParserEmit($colon.start, $colon.stop, ParserToken.TokenType.Error, "Empty Class Def");
		 ParserEmit($colon.start, $colon.stop, ParserToken.TokenType.ClassEnd, "");
		}
	)
	;

(If you are familiar with ANTLR you’ll notice that I’m using the ‘Emit’ function “incorrectly”. I will explain soon enough)

Observant readers might notice that I only delayed the problem – the resulting stream of fake symbols are themselves a production of a formal grammar, and need to be parsed in order to construct hierarchy. True, but the result is a lot simpler than Python code; I am now dealing with symbols (=classes) and not with text (=strings).
However, this is not my only problem. As I said earlier, ‘Emit’ is only designed to be called once for each symbol. I didn’t like Emit very much anyway, and so I created my own emit function (I called it ‘ParserEmit’). It simply saves everything it receives in a list, which I can later iterate over in my code. Just for the record, here’s how it looks:

public System.Collections.Generic.List<ParserToken> parser_toks =
	new System.Collections.Generic.List<ParserToken>();

void ParserEmit(IToken start, IToken stop, ParserToken.TokenType t, string text)
{
	ParserToken tok = new ParserToken() { type=t, start=start, stop=stop, text=text };
	parser_toks.Add(tok);
}

(ParserToken is a class I made up. You can figure out what it looks like)

Now I had a list of symbols telling me exactly where each class/function starts/ends.
Next post will be Ep. 3 and last, where I will describe how I turned those symbols into a tree, rant about text-obsession in parsers, and provide my complete ANTLR grammar for Python (which just has too many changes/additions to describe each individually).


Python Parsing Ep. I: Lexing Python (with ANTLR)

July 12, 2008

For a while now, I’ve been writing a code editor for Python. I know there are quite enough of these, but I wanted something a bit different. I decided to write it in C#, and I’m glad I did  –  C#’s GUI is really easy and fun to write, probably the best GUI-writing experience I’ve had since VB.

So, at some point of the project I did the horrible mistake of attempting to parse Python. Now, had I been trying to parse C, or maybe Perl, or actually almost any other language (except for C++, which is a parser’s worst nightmare) it would’ve been a piece of cake. But no, I had to try python. And so, it began.

The Parser
Unlike for some other languages, deciding on a parser generator for C# was easy: There’s ANTLR, and there’s the rest. No competition. (Later I would find out about Grammatica, which is pretty good. More on that some other time).

ANTLR’s website provides a variety of grammars for the lazy programmer, and among them a grammar for Python. Thinking that I evaded real work, I tried to compile it:

F:\antlr_play> Antlr3.Tool.exe Python.g
ANTLR Parser Generator  Version 3.0 (May 17, 2007)  1989-2007
warning(105): Python.g:289:17: no lexer rule corresponding to
                               token: INDENT
warning(105): Python.g:289:32: no lexer rule corresponding to
                               token: DEDENT
... (some more warnings about how the grammar is not
     deteministic) ...

Ouch. Looks like someone did not implement INDENT and DEDENT in the lexer.
Turns out that this grammar was written for Java, which has some kind of a special pre-processor (or post?) for Python (called PythonTokenStream.java) to produce the right indentation tokens.

C# does not have one.

Googling produced no results that could help me, and so I set out to write the lexer grammar myself.

Lexing Python
Now, a few things about lexing python.

  1. Python’s indentation cannot be solved with a DFA. (I’m still perplexed at whether it can even be solved with a context-free grammar).
  2. PyPy produced an interesting post about lexing Python (they intend to solve it using post-processing the lexer output)
  3. CPython’s tokenizer is written in C. It’s ad-hoc, hand-written, and complex. It is the only official implementation of Python lexing that I know of.

But the best part is that I wasn’t aware of any of these points when I started writing my lexer grammar.

What I did find was the state-machine algorithm for the indentation, found in Python’s official documentation:

Before the first line of the file is read, a single zero is pushed on the stack; this will never be popped off again. The numbers pushed on the stack will always be strictly increasing from bottom to top. At the beginning of each logical line, the line’s indentation level is compared to the top of the stack. If it is equal, nothing happens. If it is larger, it is pushed on the stack, and one INDENT token is generated. If it is smaller, it must be one of the numbers occurring on the stack; all numbers on the stack that are larger are popped off, and for each number popped off a DEDENT token is generated. At the end of the file, a DEDENT token is generated for each number remaining on the stack that is larger than zero.

Or in python pseudo code:

stack = [0]
for line in code_lines:
    indent_level = get_indent_level( line )
    if stack[-1] < indent_level:
        stack.append( indent_level )
        emit_token( INDENT )
    else:
        while stack&#91;-1&#93; > indent_level:
            stack.pop()
            if stack[-1] < indent_level
                raise IndentationError()
            emit_token ( DEDENT )

    process_line ( line )

for dummy in stack&#91;1:&#93;:
    emit_token( DEDENT)
&#91;/sourcecode&#93;

Now, this is clean and pretty. That is because it's a post-processor, and is not part of the lexer. But when you deal with indentation while tokenizing, you have to take into account python's quirks: line joinings, both explicit (backslash) and implicit (parenthesis and friends). Also, indents and dedents are not counted for empty lines, so you have to take whitespace into account only if it is followed by text (using lookahead). And of course the whitespace must be thrown away after processing it (or your parser grammar will be awful).

So what person in his right mind wouldn't do it as post-processing? Well, me, and anyone else who doesn't know better. But, despite the complications, I managed to get the grammar to work, and I would like to share it with you. Only four disclaimers:
<ol>
	<li>It's written for C#. But with very little tweaking it should work for Java just the same (not like you need it).</li>
	<li>It's kinda ugly. Admittedly, it's not my best piece of code. But ugly code for an ugly task. Anyway, at least it's working. Oh, see 3.</li>
	<li>I've only tested it on few python files, and it's not impossible that it would fail on some eccentric piece of code.</li>
	<li>I had to paste it because wordpress doesn't seem to allow uploading non-media files, and it doesn't fit the width of the page. Just copy-paste it if you want to read it properly.</li>
</ol>
With that in mind, here's the relevant part changed in the lexer grammar (again, on top of <a href="http://www.antlr.org/grammar/1200715779785/Python.g" target="_blank">this</a> grammar):



/** Treat a sequence of blank lines as a single blank line.  If
 *  nested within a (..), {..}, or [..], then ignore newlines.
 *  If the first newline starts in column one, they are to be ignored.
 *
 *  Frank Wierzbicki added: Also ignore FORMFEEDS (\u000C).
 */
NEWLINE
@init {
    int spaces = 0;
}
    :   ((('\u000C')?('\r')? '\n' ) | '\t' | ' ' )* (('\u000C')?('\r')? '\n' )
		leading_space = (' '  { spaces++; } | '\t' { spaces += 8; spaces -= (spaces \% 8); } )*
        {
			if ( tokenStartCharPositionInLine==0 || implicitLineJoiningLevel>0 )
				//$channel=HIDDEN;
				Emit(new ClassicToken(NEWLINE,this.Text, HIDDEN));
			else {
				Emit(new ClassicToken(NEWLINE,this.Text));
				//newLine = true;
			}

			if (implicitLineJoiningLevel==0)
			{
				if ( spaces > IndentStack[IndentStack.Count - 1] ) {
					IndentStack.Add(spaces);
					Emit(new ClassicToken(INDENT,">"));

				}
				else
				{
					while (spaces < IndentStack&#91;IndentStack.Count - 1&#93;) {
						IndentStack.RemoveAt(IndentStack.Count - 1);
						Emit(new ClassicToken(DEDENT,"<"));
					}
				}

			}

        }
    ;

WS  :    {tokenStartCharPositionInLine>0}?=> (' '|'\t'|'\u000C')+ {$channel=HIDDEN;}
    ;

/** Grab everything before a real symbol.  Then if newline, kill it
 *  as this is a blank line.  If whitespace followed by comment, kill it
 *  as it's a comment on a line by itself.
 *
 *  Ignore leading whitespace when nested in [..], (..), {..}.
 */
LEADING_WS
@init {
    int spaces = 0;
}
    :   {tokenStartCharPositionInLine==0}?=>
        (   {implicitLineJoiningLevel>0}? ( ' ' | '\t' )+ {$channel=HIDDEN;}
           |    (     ' '  { spaces++; }
                |     '\t' { spaces += 8; spaces -= (spaces \% 8); }
                )+
            {
            // make a string of n spaces where n is column number - 1
            char[] indentation = new char[spaces];
            for (int i=0; i<spaces; i++) {
                indentation&#91;i&#93; = ' ';
            }
            String s = new String(indentation);
            Emit(new ClassicToken(LEADING_WS,new String(indentation)));

            }
            // kill trailing newline if present and then ignore
            ( ('\r')? '\n' {if (token!=null) token.Channel = HIDDEN; else $channel=HIDDEN;})*
           // {token.Channel = 99; }
        )
    ;

/** Comments not on line by themselves are turned into newlines.

    b = a # end of line comment

    or

    a = &#91;1, # weird
         2&#93;

    This rule is invoked directly by nextToken when the comment is in
    first column or when comment is on end of nonwhitespace line.

    Only match \n here if we didn't start on left edge; let NEWLINE return that.
    Kill if newlines if we live on a line by ourselves

    Consume any leading whitespace if it starts on left edge.
 */
COMMENT
@init {
    $channel=HIDDEN;
}
    :    {tokenStartCharPositionInLine==0}?=> (' '|'\t')* '#' (~'\n')* '\n'+
    |    {tokenStartCharPositionInLine>0}?=> '#' (~'\n')* // let NEWLINE handle \n unless char pos==0 for '#'
    ;

So, this is it so far. I hope it was useful.
The next part of the saga will discuss ANTLR specifically, explain why using it to parse python can also be complicated sometimes, and of course provide a solution.