Nick Mudge Ignition Software Consulting & Development

I recently wrote a parser in C that parses arthmetic expressions. It's a stepping stone towards creating the parser for the compiler I'm working on. It takes as input a string that consists of an equation. It reads the string once while creating a graph data structure. The data structure can then be traversed with the "parse" function to make all the calculations in the correct order and arrive at the answer.

After it is compiled, it is ran like this: ./parser "var = 77 + 3 / 2 + 90 * 100;"

I made a little web interface for it that just takes an expression, passes the expression to the parser and then outputs the result from the parser to the webpage:

Here's the C code:

Next Steps

  • Right now the parser reads its input starting from the right, moving to the left. I want to rewrite some code to make it read from left to right. Providing syntax error information will be easier to do if it reads from left to right.
  • Add parentheses to the language.
  • Start adding more programming language constructs.

I think the compiler is going to implement ANSI C. This is part of the operating system I said I was going to write.


Derek Elkins
30 June 2008

Before I looked at any of these files I thought to myself, "This is going to be godawful." Then I looked at them and they were godawful. I'll ignore the poor code related to general issues (e.g. the 'token_tracker' whose sole purpose in life seems to be to add an array overflow bug to your code) and focus on the aspects just related to parsing.

As your code is written it will not scale. Simply adding support for parentheses will already require you to go, at least conceptually, go beyond the structure of your current code. (You can actually hack them in without trying too hard.) The problem is that your code is currently written as a state machine. State machines can only parse regular languages (i.e. what basic regexes can parse). Even the language of arithmetic expressions is context free (the next step up from regular.) C is actually context-sensitive (the -next- step up), but usually the context sensitivities are handled in an ad-hoc manner and the language is primarily parsed with a CFG. There is no way you could continuously hack your state machine to handle all of C, you'd eventually fail; every change you try to add breaking other things and not quite working right itself.

The only way of solving that would be to find a "general hack." That "hack" in this situation is to add a stack producing what is called a Pushdown Automata which is capable of parsing CFGs. With great discipline you could write a C parser using this approach. However what this stack is holding is states to return to, that is where to go, and this should immediately raise alarm bells in your head. Doing this is manually simulating the C control stack. If we write the code to use the actual control stack, we end up with the recommended way of writing parsers by hand: the recursive descent parser.

This is what code to parse arithmetic expressions should look like (well closer to it) [using a slightly C-ish C++, definitely not what I would write using C++]:

enum TokenType {

struct Token {
TokenType type;
int value;

// assume some lexing functions
// returns NULL for no lex
extern Token *peek();
extern void next();

struct ASTNode {
Token token;
int numChildren;
ASTNode **children;

ASTNode *makeASTNode(Token *token, int numChildren, ASTNode **children) {
// I'm ignoring error handling throughout
ASTNode *result = (ASTNode *)malloc(sizeof(ASTNode));
result->token.type = token->type;
result->token.value = token->value;
result->numChildren = numChildren;
result->children = children;
return result;

void freeASTNode(ASTNode *node) {
for(int i = 0; i < node->numChildren; --i)

/* The motivating BNF
term ::= factor termTail
termTail ::= add factor termTail | epsilon
factor ::= atom factorTail
factorTail ::= mul atom factorTail | epsilon
atom ::= NUM | ( term )
add ::= + | -
mul ::= * | /

ASTNode *parseTerm(){
return parseTermTail(parseFactor());

ASTNode *parseTermTail(ASTNode *left) {
Token *current = peek();
if(current != NULL && (current->type == PLUS || current->type == MINUS)) {
ASTNode **children = (ASTNode **)malloc(2*sizeof(ASTNode *));
children[0] = left;
children[1] = parseFactor();
return parseTermTail(makeASTNode(current, 2, children));
} else if(current == NULL || current->type == CLOSE_PAREN)
return left;
error("no parse");

ASTNode *parseFactor(){
return parseFactorTail(parseAtom());

ASTNode *parseTermTail(ASTNode *left) {
Token *current = peek();
if(current != NULL && (current->type == MULTIPLY || current->type == DIVIDE)) {
ASTNode **children = (ASTNode **)malloc(2*sizeof(ASTNode *));
children[0] = left;
children[1] = parseAtom();
return parseFactorTail(makeASTNode(current, 2, children));
} else if(current == NULL || current->type == CLOSE_PAREN)
return left;
error("no parse");

ASTNode *parseAtom() {
Token *current = peek();
if(current != NULL && current->type == NUM)
return makeASTNode(current, 0, NULL);
else if(current != NULL && current->type == OPEN_PAREN) {
ASTNode *result = parseTerm();
current = peek();
if(current != NULL && current->type == CLOSE_PAREN) {
return result;
} else
error("no parse");
} else
error("no parse");

I don't know how this will come out formatting wise. If it is a mess, just copy and paste it into an editor and run indent (or an equivalent) over it.

This was all assuming that you are interested in writing a parser "by hand". Otherwise I'd recommend using a parser generator or an existing C parser (of which many exist.)
5 July 2008

Derek, good comments, cut the poor guy some slack though, hehe, he seems like he's new to all of this (=

Nick, almost every programmer has considered writing their own compiler (and most of them do, in college), and their own OS (most of them don't, they just implement parts of the system in college). Its a fun and very educating experience. Try to read (a lot) before you start with projects of this caliber. Learn from others, and try to focus on some interesting new feature you'd like to add to your language or OS instead of simply re-implementing whats already out there (unless its purely for educational reasons).

Most of all, enjoy! (=
18 December 2008

The way ahead is the use of a stack in two contexts. One is a stack holding operands loaded and combined during execution: imagine that machine code is being produced for a stack machine. Thus 1 + 2*3 + 4 should compile to something like
Load 1
Load 2
Load 3
Load 4
And if you want the expression evaluated, you have an interpreter that performs the operations in the above sequence.
To generate the operations in the above sequence by parsing the given text string left to right in which they do not appear in that order, you need a parser that uses a stack to temporarily hold operators (their text symbols will do) in accordance with precedence rules. Each operator has an associated precedence, carefully chosen. With the just scanned operator symbol in hand, pop from the stack all waiting operators whose precedence is greater than or equal to the incoming operator, then, place that operator on the operator stack and scan on. (The parse alternates between expecting an operand and an operator)
The precedence table I use is as follows:
TYPE SYMB !To recognise symbols and carry associated information.
CHARACTER*2 IS !Its text.
INTEGER*1 PRECEDENCE !Controls the order of evaluation.
INTEGER*1 POPCOUNT !Stack activity: a+b means + requires two in.
CHARACTER*48 USAGE !Description.
END TYPE SYMB !The cross-linkage of precedences is tricky.
TYPE(SYMB) SYMBOL(OPSYMBOLS) !Righto, I'll have one.
PARAMETER (SYMBOL =(/ !Note that "*" is not to be seen as a match to "**".
1 SYMB(" ", 1,0,"separates symbols and aids legibility."),
2 SYMB(") ", 2,0,"opened with ( to bracket a sub-expression."),
3 SYMB("] ", 2,0,"opened with [ to bracket a sub-expression."),
4 SYMB("} ", 2,0,"opened with { to bracket a sub-expression."),
5 SYMB(", ", 3,0,"continues a list of parameters to a function."),
C SYMB(":=", 4,0,"marks an on-the-fly assignment of a result"), Identified differently...
6 SYMB("| ", 5,2,"logical OR, similar to addition."),
7 SYMB("& ", 6,2,"logical AND, similar to multiplication."),
8 SYMB("= ", 7,2,"tests for equality (beware decimal fractions)"),
9 SYMB("< ", 7,2,"tests strictly less than."),
A SYMB("> ", 7,2,"tests strictly greater than."),
B SYMB("<>", 7,2,"tests not equal (there is no 'not' sign!)"),
C SYMB("<=", 7,2,"tests less than or equal."),
D SYMB(">=", 7,2,"tests greater than or equal."),
E SYMB("+ ", 8,2,"addition, and unary + to no effect."),
F SYMB("- ", 8,2,"subtraction, and unary - for neg. numbers."),
G SYMB("* ", 9,2,"multiplication."),
H SYMB("/ ", 9,2,"division."),
I SYMB("\ ", 9,2,"remainder a = a - truncate(a/b)*b; 11 = 2"),
J SYMB("^ ",10,2,"raise to power: also recognised is **."),
K SYMB("**",10,2,"raise to power: also recognised is ^."), !Uses the next precedence level also!
C 11 is used so that incoming ** will have lower priority than stacked **, thus delivering right-to-left evaluation.
L SYMB("! ",12,1,"factorial, sortof, just for fun.")/))

The algol-style of assignment within an arithmetic expression might be a surprise (and is specially recognised), and I allow three types of brackets as per mathematics. There are tricks in the parser so that exponentiation is treated right-to-left (thus 6**6**6 is 6**(6**6)) and a further trick so that factorial works as expected.
But otherwise, it is all straightforward, once the operator stack is understood.
Name: (required)
Email: (required)
What has four legs, rhymes with bat and says, "Meow?" (One word answer.)
Spam Filter: