## Introduction

In this post I will demonstrate how to parse and calculate an arithmetic expression a general-purpose parser. When we’re done, we’ll be able to handle expressions such as `1 + 2 * -(-3+2)/5.6 + 3`

, and hopefully you’ll have gained the tools to handle much more.

My motivation is to provide a simple and fun lesson in parsing and formal grammars, as well as to show-case PlyPlus, a parser interface I’ve been working on-and-off on for the past few years. As a bonus, the result is a safe arithmetic alternative to eval().

If you want to follow the examples on your computer, you can install PlyPlus with `pip install plyplus`

.

Working knowledge of Python is required for the implementation section.

## Grammars

For those of you who don’t know how parsing and formal grammars work, here is a quick overview: Formal grammars are a hierarchy of rules for parsing text. Each rule matches some portion of the input text, by describing the rules that it’s made of.

Here an example of a pseudo-grammar for parsing `1 + 2 + 3 + 4`

:

```
```

```
Rule #1 - add IS MADE OF add + number
OR number + number
```

```
```

Or in EBNF:

add: add '+' number | number '+' number ;

Each pass of the parser will look for either *add+number* or *number+number*, and if it finds one, convert it to add. Basically, every parser aims to climb the hierarchy as much as possible.

Here are the steps the parser will take:

- number + number + number + number

*first pass turns all numbers into a ‘number’ rule* - [number + number] + number + number

*the parser found its first pattern!* - [
**add**+ number] + number

*after converting the pattern, it finds the next one* - [
**add +**number] **add**

The sequence of symbols has turned into a hierarchy of two simple rules: `number+number`

and `add+number`

, and if we tell the computer how to solve each of them, it can solve the entire expression for us. In fact, it can solve any sequence of additions, no matter how long! That is the strength of formal grammars.

## Operator Precedence

Arithmetic expressions are not just a linear progression of symbols. Their operators create an implicit hierarchy, which makes them the perfect target for formal grammars:

`1 + 2 * 3 / 4 - 5 + 6`

Is equivalent to:

`1 + (2 * 3 / 4) - 5 + 6`

We can express this structure in the grammar by nesting the rules:

add: add + mul | mul '+' mul ; mul: mul '*; number | number '*' number ;

By telling *add* that it operates on *mul*, and not on numbers, we are giving multiplications the precedence.

Let’s pretend-run this grammar on `1 + 2 * 3 * 4`

with our magical parser that is in my head:

- number + number * number * number
- number + [number * number] * number

*the parser doesn’t know what a*`number+number`

is, so this is his next pick - number + [
**mul*** number] - number +
**mul** - ???

Now we are in a bit of a pickle! The parser doesn’t know what to do with *number+mul*. We can tell it, but if we keep looking we’ll find that there are many possibilities we didn’t cover, such as *mul+number*, *add+number*, *add+add*, etc.

So what do we do?

Luckily, we have a trick up our sleeve: We can say that a number by itself is a multiplication, and a multiplication by itself is an addition!

This method might seem strange at first, but it makes total sense:

add: add '+' mul | mul '+' mul | mul ; mul: mul '*' number | number '*' number | number ;

But if *mul* can become *add*, and *number* can become *mul*, we have extra lines that do nothing. Removing them, we get:

add: add '+' mul | mul ; mul: mul '*' number | number ;

Let’s pretend-run on `1 + 2 * 3 * 4`

again with this new grammar:

- number + number * number * number

*There’s no rule for*`number*number`

now, but the parser can “get creative” - number + [number] * number * number
- number + [
**mul*** number] * number - number + [
**mul*** number] - [number] +
**mul** - [
**mul**] + mul - [
**add**+ mul] **add**

Success!!!

If this looks like magic to you, try pretend-running on different arithmetic expressions, and see how the expression resolves itself in the correct order every time. Or wait for the next section and let the computer run it for you!

## Running the parser

By now we have a fairly good idea of how we want our grammar to work. Let’s apply it and write an actual grammar:

start: add; // This is the top of the hierarchy add: add add_symbol mul | mul; mul: mul mul_symbol number | number; number: '[\d.]+'; // Regular expression for a decimal number mul_symbol: '\*' | '/'; // Match * or / add_symbol: '\+' | '-'; // Match + or -

You might want to brush up on regular expressions a bit, but otherwise this grammar is pretty straight-forward. Let’s run it on an expression!

>>> from plyplus import Grammar >>> g = Grammar("""...""") >>> print g.parse('1+2*3-5').pretty() start add add add mul number 1 add_symbol + mul mul number 2 mul_symbol * number 3 add_symbol - mul number 5

So pretty!

Study the tree and see what hierarchy the parser chose.

If you want to play with the parser, and feed it expressions by yourself, you can! All you need is Python. Run `pip install plyplus`

and paste the above commands inside python (make sure to put the actual grammar instead of ‘…’ ;-) ).

## Shaping the tree

Plyplus automagically creates a tree, but it’s not very optimal. While putting *number* inside *mul* and *mul* inside *add* was useful for creating a hierarchy, now that we already have a hierarchy they are just a burden. We can tell Plyplus to “expand” (i.e. remove) rules by prefixing them. A `@`

will always expand a rule, a `#`

will flatten it, and a `?`

will expand it if and only if it has one child. In this case, `?`

is what we want.

start: add; ?add: add add_symbol mul | mul; // Expand add if it's just a mul ?mul: mul mul_symbol number | number; // Expand mul if it's just a number number: '[\d.]+'; mul_symbol: '\*' | '/'; add_symbol: '\+' | '-';

Here’s how the tree looks with the new grammar:

>>> g = Grammar("""...""") >>> print g.parse('1+2*3-5').pretty() start add add number 1 add_symbol + mul number 2 mul_symbol * number 3 add_symbol - number 5

Ooh, that is so much cleaner, and I dare say, quite optimal!

## Parenthesis and Other Features

We are missing some obvious features: Parenthesis, unary operators `(-(1+2))`

, and the ability to put spaces inside the expression. These are all so easy to add at this point that it would be a shame not to.

The important concept is to add a new rule, we’ll call *atom*. Everything inside the atom (namely parenthesis and unary operations) happens before any additions or multiplications (aka binary operations). Since the atom is only a hierarchical construct with no semantic significance, we’ll make sure it’s always expanded, by adding `@`

to it.

The obvious way to allow spaces is with something like `add: add SPACE add_symbol SPACE mul | mul;`

, but that’s tedious and unreadable. Instead, we will tell Plyplus to always ignore whitespace.

Here’s the final grammar, with all of these features:

start: add; ?add: (add add_symbol)? mul; ?mul: (mul mul_symbol)? atom; @atom: neg | number | '\(' add '\)'; neg: '-' atom; number: '[\d.]+'; mul_symbol: '\*' | '/'; add_symbol: '\+' | '-'; WHITESPACE: '[ \t]+' (%ignore);

Make sure you understand it, so we can proceed to the next step: Calculating!

## Calculating!

We can already turn an expression into a hierarchical tree. To calculate it, all we need is to collapse the tree into a number. The way to do it is branch by branch.

This is the part we start writing code, so I need to explain two things about the tree.

- Each branch is an instance with two attributes:
*head*, which is the name of the rule (say,*add*or*number*), and*tail*, which is the list of sub-rules that it matched. - By default, Plyplus removes unnecessary tokens. In our example, the ‘(‘ and ‘)’ will already be removed from the tree, as well as neg’s ‘-‘. Those of
*add*and*mul*won’t be removed, because they have their own rule, so Plyplus knows they’re important. This feature can be turned off to keep all tokens in the tree, but in my experience it’s always more elegant to leave it on and change the grammar accordingly.

Okay, we are ready to write some code! We will collapse the tree using a transformer, which is very simple. It traverses the tree, starting with the outermost branches, until it reaches the root. It’s your job to tell it how to collapse each branch. If you do it right, you will always run on an outermost branch, riding the wave of its collapse. Dramatic! Let’s see how it’s done.

>>> import operator as op >>> from plyplus import STransformer class Calc(STransformer): def _bin_operator(self, exp): arg1, operator_symbol, arg2 = exp.tail operator_func = { '+': op.add, '-': op.sub, '*': op.mul, '/': op.div }[operator_symbol] return operator_func(arg1, arg2) number = lambda self, exp: float(exp.tail[0]) neg = lambda self, exp: -exp.tail[0] __default__ = lambda self, exp: exp.tail[0] add = _bin_operator mul = _bin_operator

Each method corresponds to a rule name. If a method doesn’t exist, __default__ is called. In our implementation, we left out start, add_symbol, and mul_symbol, all of which should do nothing but return their only sub-branch.

I use float() to parse numbers, because I’m lazy, but I can implement it using the parser as well.

I use the operator module for syntactic beauty. operator.add is basically ‘lambda x,y: x+y’, etc.

Alright, let’s run the transformer and see how it turns out.

>>> Calc().transform( g.parse('1 + 2 * -(-3+2) / 5.6 + 30')) 31.357142857142858

What does eval() think?

>>> eval('1 + 2 * -(-3+2) / 5.6 + 30') 31.357142857142858

Success!

## Final Step: The REPL

For aesthetic reasons, let’s wrap it up in a nice calculator REPL:

def main(): calc = Calc() while True: try: s = raw_input('> ') except EOFError: break if s == '': break tree = calc_grammar.parse(s) print calc.transform(tree)

You can see the full source code here: https://github.com/erezsh/plyplus/blob/master/examples/calc.py

Fascinating. How is this different from the Lex-Yacc combo, if you have used that? For reference, PLY :http://www.dabeaz.com/ply/ply.html

Thank you. Actually, PlyPlus uses the excellent PLY as its underlying engine.

The difference is in the features PlyPlus offers, such as an advanced grammar, and automatic tree construction.

For comparison, see PLY’s calc implementation: http://www.dabeaz.com/ply/example.html

So, it’s complaining about the unused WHITESPACE definition in the grammar. How do we put that to use?

WHITESPACE really isn’t used, but it’s not a helpful warning. I consider it a minor bug, which I’ll fix if people show interest.

Anyway, it only prints it the first time you compile the grammar.

The title is kinda misleading. Its not really 50 lines if you have to import a non standard library.

I have a question about parsing

number + number * number * number

with the new grammer. As far as I can see, parsing is done from left to right. So, what makes our grammer to skip first “number” while searching for a pattern? Also, after creating first “mul”, what makes grammer to skip transforming “mul -> add” and check for “mul + number”?

What am I missing?

That what makes a parser different than a simple pattern-matcher.

Parsers will only match a rule if it leads to a full solution of the text. An error occurs if there are no solutions, or if there is more than one solution (usually). How do they do it? That’s complex parser theory, but in simple terms, an LR parser keeps a state, and it will only match a rule if it maintains a ‘good’ state. A pattern that leads it into a ‘bad’ state is ignored.

So I guess that is where PLY comes in. Thanks :)

actually the grammar is case-sensitive, right? e.g. ‘mulSymbol’, or ‘whitespace’, will cause an error. BTW the error information is really weird;(

anyway, it’s much more simpler and convenient than the ply so far. thanks for your hard work.

Yes, it’s case sensitive. I’m glad to hear it’s convenient for you. I want PlyPlus to be very easy, so if you run into anything confusing or difficult, I would like to hear about it!

Hi again, I have been used both ply and plyplus for the same task. I would like to share some ideas from my side:

1. lack documents about the details. I must read the source code and samples over and over during my work. most stuff are straightforward but not for some tricky part.

2. the auto generated AST is very useful to test if the grammar works or not. but to me I don’t like the head & tail style in the generated node, instead I prefer using the subscript, which is more simpler. OK, this is not a big deal in fact.

3. use plyplus to create a prototype for AST generator is so much simpler! I also tried the PyParsing but honestly I think separated grammar is more clear, simple and powerful when facing the complex problem. for simple problem I prefer hand made parser. once the grammar works, the rest become simple.

5. both ply and plyplus are weak in error prompting. but plyplus is more easy to test the grammar, which will help a little.

6. plyplus is more natural for non-yacc users! big win!

IMO, plyplus is far from production quality compared with ply. but it’s a good start and more over it has its own highlights, so I think you should continue the work.

Thanks.

Hi Shawn,

If there are details that are missing from the documentation, especially if they are tricky, please let me know what they are so I can add them.

Anyway, I’m glad you find plyplus useful.