Pages

Thursday, January 3, 2013

ANTLR, second chance and rivals

This post continues the story from the introduction and first impression.

It has been long month and it took me quite a while to gather the material for this post. Long story short ANTLR is no go for C#, it's rival in alpha stage is lacking and the third option finally proved worthy. Now the long story:

Last time I tried ANTLR with separate files for lexer, parser and tree processor grammars which produced "mouth feeding" process in the user code using those grammars. This time i tried ANTLR with combined grammar file, basically all three grammars in one file.

grammar MyLogoCombined;

options {
language=CSharp3;
output=AST;
}

@header {
using System;
}

fragment SIGN
: '+' | '-';
fragment SPACE
: ' ' | '\t';

NUMBER : '0' | SIGN? '1'..'9' '0'..'9'*;
FORWARD : 'FD';
ROTATE : 'RT';

NEWLINE : ('\r'? '\n')+ { Skip(); };
WHITESPACE
: SPACE+ { Skip(); };

public script : statement* EOF!;

statement
: FORWARD v=NUMBER {
double length = toDouble($v);
angle = Math.PI * this.angle / 180;
x += Math.Cos(angle) * length;
y += Math.Sin(angle) * length;
Console.WriteLine("Moved to {0}, {1}", x.ToString("0.#"), y.ToString("0.#"));
}
| ROTATE v=NUMBER {
angle += toDouble($v);
Console.WriteLine("Facing {0}°", angle.ToString("0.#"));
};

It doesn't look bad or crammed and generated code is better too. Generated classes are lexer and merged parser and tree processor. As an effect a user code required to utilize those classes got considerably simpler.

static void CombinedLogo()
{
using (var input = new StreamReader("input.txt")) {
MyLogoCombinedLexer lexer = new MyLogoCombinedLexer(new ANTLRReaderStream(input));
MyLogoCombinedParser parser = new MyLogoCombinedParser(new CommonTokenStream(lexer));
parser.script();
}
}

Now we have something usable. But I wasn't satisfied, generated code requires 100 kB run-time library (may look weird compared to executable weighting 30 kB) and is riddled with comments mentioning full path to original grammar file. And there is no way to generate code without those comments. Well, not exactly critical issues, more like unnecessary vice. There is another issue I haven't mentioned earlier, documentation is not quite there. It exists, that's good, there are a lot of examples, that's good too but you won't get your question answered on the single page. You'll have to look in the official docs, unofficial examples and still do a few experiments to get definite answers.

Irony.Net

On the Stack Overflow webpage that led me to ANTLR next recommended thing was Irony.Net. It's the project hosted on CodePlex with professional looking home page. I've downloaded source code and started looking around. The nicest thing about Irony.Net is the way of defining a grammar, there is no code generations from specially formatted text file, it's defined as ordinary C# class. I usually post code as plain text but for the sake of presentation you'll get an image this time.


That's as awesome as ANTLR's state diagram for grammar rules. Operator overloading is usually syntax evil but in some cases it's OK thing to do and Irony.Net has found one. As you can see plus and bitwise OR operators are overloaded to operate on non-terminals so the grammar rules can resemble BNF notation as closely as possible. At the same time the image presents three issues with the Irony.Net: piece of code commented out, first parameter of MakeStarRule method and lack of rule attributes (tree processor actions).

Commented out code is left on purpose to demonstrate the existence of various flags and properties. While trying to make it work I read somewhere that I should explicitly specify creation of abstract syntax tree which is usually normal process in parsers so I added that line. It turned out that Irony.Net creates syntax tree by default and that flag actually confuses it. Alright, it confused me too because it didn't work with educational material, it required extra information that I couldn't figure out within the time I had at the moment. I figured it is supposed to indicate that custom tree node objects are being used and that creation of those objects is mysterious art.

That leads us to the lack of rule attributes problem. In order to process the tree you have to either traverse the tree on your own or use the mysterious art. I really tried to figure out the mysterious art but the lack of education materials and limited time bested me (a year later after writing this post I did learn how to use visitor pattern but still, there better ways). The biggest problem with Irony.Net is the lack of documentation. It has so many features and only way to figure them out is to dig in their source code.

The weirdness of MakeStarRule method may not be obvious but why does it require destination non-terminal and why is it class member? In my opinion it should be static and have only one mandatory parameter, the repeatable BNF term. The reason why it requires destination non-terminal is because it secretly modifies the rule instead of just building the BNF term. The reason it is non-static is because it uses certain flag from the grammar object. That is actually OK because the grammar object is used to build the "language" object, not for direct parsing.

Now that I mentioned it, using the grammar is quite simple, build parser using the grammar, feed the input and get the tree:


Parser parser = new Parser(new IronyLogo());
ParseTree tree;
using (var input = new StreamReader("input.txt"))
tree = parser.Parse(input.ReadToEnd());

For processing the tree I opted for no mysterious art approach, doing it in the user code and it's not a big hassle:

double x = 0, y = 0, angle = 0;

foreach (ParseTreeNode statement in tree.Root.ChildNodes) {
ParseTreeNode node = statement.ChildNodes[0];
int value = (int)node.ChildNodes[1].Token.Value;

switch (node.Term.Name) {
case "forward":
x += value * Math.Cos(angle);
y += value * Math.Sin(angle);
break;

case "rotate":
angle += Math.PI * value / 180.0;
break;
}

Console.WriteLine("Turtle at {0}, {1}", x.ToString("0.#"), y.ToString("0.#"));
}

Neat thing is that NumberLiteral type of literals handles string to number conversion on it's own. In fact Irony.Net has a lot of features for building programming languages, custom literals for number and identifiers, methods for defining operator precedence to name a few.

So why is Irony.Net on go for C#? If you are building a compiler or interpreter, it's good but if you are build parser for domain specific language it's simply not the right tool. You can chop wood with hammer but using an axe would be more practical. First of all it requires 160 kB of run-time library, that's more than ANTLR, doesn't have elegant grammar attributes and is undocumented. Oh my God, it is so undocumented that I have to involve the God to the issue. Author himself claims that source code is enough and is unwilling to write decent documentation (well, the project is still in alpha phase so I can forgive him) and sometimes doesn't give an answer beyond "look in the code of a certain example shipped with Irony". There is a moderate number of examples included with Irony.Net source code but they are either grammars without attributes or proofs of concept (such as fully featured interpreter with a lot of layers between grammar and attributes) that are hard to follow while learning stuff.

Coco/R

Goggling for third solution led me back to Stack Overflow but to a different page this time. Answers on that page presented various new solutions and I've checked the first one (actually the second one because the first is for F#), Coco/R. The link led me to a clean black on white page where I've quickly found my way to the tutorial that hooked me up. At first I was skeptical, zip file with 5 years old Powerpoint presentation but after skipping language processing theory, presentation arrived at very simple example describing whole process, from writing a grammar to running the parser. Few slides further presented grammar attributes and gimmicks such as how to ignore new line characters, make grammar case insensitive and so on. The post is already long, but heck, I'll publish my first class Logo grammar anyway:

COMPILER Logo
double turtleX = 0, turtleY = 0, angle = 0;

IGNORECASE

CHARACTERS
digit = "0123456789".

TOKENS
number = digit {digit}.

IGNORE '\t' + '\r' + '\n'

PRODUCTIONS
Logo = {Statement}.
Statement = Forward | Rotate.

Forward (. double length; .)
= "fd" Parameter<out length>
(. turtleX += Math.Cos(angle) * length;
turtleY += Math.Sin(angle) * length;
Console.WriteLine("Turtle at {0}, {1}", 
turtleX.ToString("0.#"), 
turtleY.ToString("0.#"));
.)
.

Rotate (. double ang; .)
= "rt" Parameter<out ang>
(. angle += Math.PI * ang / 180; .)
.

Parameter<out double n>
= number (. n = Convert.ToDouble(t.val); .)


.
END Logo.

You've already seen something similar with ANTLR grammar, lexical tokens, extenden BNF production, c# header and attributed code. This one has a period character after almost everything. One thing I should point out is that parts of definition have to strictly follow an order of appearance. For instance IGNORE part can't appear before TOKENS or CHARACTERS blocks. Correct order of appearance can be found in the documentation that exists and is informative. Before I start praising it's documentation, let's conclude the usage of Coco/R. This is what user code looks like:

static void Main(string[] args)
{
Scanner scanner = new Scanner("input.txt");
Parser parser = new Parser(scanner);
parser.Parse();
}

As simple as that. Scanner can also accept Stream object instead of file name, for those insisting on abstraction (which is good requirement).

Conclusion


Unlike ANTLR or Irony.Net, Coco/R didn't catch me with nasty surprises, it's simple, it works, requires no run-time files and it is well documented. Though it doesn't have fancy grammar editor or grammar visualisation it's a tool that I would use in my projects.