Pages

Monday, December 3, 2012

ANTLR, first impression

This post continues the story from the previous one.

So, following Stack Overflow link I landed at ANTLR homepage. Psychological scam alarms went off, bunch of text riddled with hyperlinks and varying typefaces, prominent button accompanied with "Oh come on, download me now", testimonials column and a guy making silly face. It took me a few seconds to actually start comprehending the content. Image below silly guy encouraged me to stay, it looked like FSA visualization I once saw during the college and FSA are crucial part of formal language processors. I started to search for something that identifies what the site is about and I read the first few lines:
What is ANTLR?
ANTLR, ANother Tool for Language Recognition,
... lot of technical terms that ordinary eye skips ...
 target language.
"OK, that looks like what I'm looking for", said to myself. I'll stop with through storytelling now and the silly guy is Terence Parr, the guy behind the ANTLR project. Anyway, next thing was to have a taste how the tool works. I was disappointed at first with a fact that it is a command line tool but that wouldn't stop me from learning it. Soon I've found out that it has additional tool with GUI, ANTLRWorks, so I downloaded it and started looking for some kind of tutorial. Thing is with this kind of software that you have to learn a metalanguage for defining grammar rules and examples are best way to start. I've found a good one on a different web site. It's an example of not so basic calculator/interpreter for Java. I copy-pasted all three grammars ANTLRWorks, pressed "generate code" button and it worked. I had a clue how it works but wasn't really ready to work on my own so I've tried different approach, make a parser for a very simple language.


If you remember old days in LOGO, you'll smile. My first informatics class in elementary school (5th grade, I remember it as if it was yesterday :) ) was about LOGO and moving the turtle with FD and RT commands. FD 100 moved the turtle forward 100 pixels and as it moved it left a straight line. RT 90 rotated turtle 90° clockwise. Repeat that four times and you'd draw a rectangle. That's the simple a language I've decided to parse. Also, I've decided to split the grammar to three parts because last example did so and I believed it's a proper way to do and this was my grammar file for lexer:

lexer grammar MyLogoLexer;

options {
language=CSharp3;
}

fragment SIGN
: '+' | '-';
fragment SPACE
: ' ' | '\t';

NUMBER : '0' | SIGN? '1'..'9' '0'..'9'*;
FORWARD : 'FD';
ROTATE : 'RT';

NEWLINE : ('\r'? '\n')+ { Skip(); };
WHITESPACE
: SPACE+ { Skip(); };

Pretty simple vocabulary, white spaces are skipped, digits are treated as numbers and keywords are FD and RT. Second grammar describes what is formally called the parser, a syntax analyzer and a part that build a syntax tree. For the language I was building, the parser was even simpler than lexer:

parser grammar MyLogoParser;

options {
language=CSharp3;
output = AST;
tokenVocab = MyLogoLexer;
}

script : statement* EOF!;

statement
: FORWARD NUMBER
| ROTATE NUMBER;

Basicaly the root of the tree is a script. Script contains zero or more statements and statements are either "forward" or "rotate" commands. ANTLRWorks is full of little features that help writing and validating grammar and graphical representation of a rule is one of them. Look at how nice it is, feast your eyes on it's awesomeness:


And now the last grammar, so called tree grammar, not because it builds the tree but because it traverses the tree and performs node actions:

tree grammar MyLogoTree;

options {
language=CSharp3;
ASTLabelType = CommonTree;
tokenVocab = MyLogoParser;
}

@header {
using System;
}

@members {
double x = 0, y = 0, angle = 0;

private double toDouble(CommonTree node) {
double value = 0.0;
   String text = node.Text;
   value = Double.Parse(text);
   return value;
  }
}
script : statement*;

statement 
: FORWARD v=NUMBER {
double length = toDouble($v);
angle = Math.PI * this.angle / 180;
x += Math.Cos(angle) * length;
y += Math.Sin(angle) * length;
Console.WriteLine("Moved to {0}, {1}", x.ToString("0.#"), y.ToString("0.#"));
}
| ROTATE v=NUMBER {
angle += toDouble($v);
Console.WriteLine("Facing {0}°", angle.ToString("0.#"));
};

When I wrote grammars, I've hit "generate code" button, got errors, corrected them, retried and after a few tries it succeeded. And than I got my self in from of the weird wall, how to run the generated code? It was midnight already and I was trying to break the habit staying up late. I should have been trivial matter but it turned out that it took two hours of my life. I expected that generated source code looked like this:


From the perspective of the user code, a single class that does the job. But instead it turned out that user's code has to mouth feed EVERY phase on it's own. See:



Notice the absence of gray arrows. Thing are even worse, there are additional objects the user code has to take care of besides generated classes. Mimicking the the Java example ("not so basic calculator" mentioned at the beginning), I figured the code should look like this:

static void Main(string[] args)
{
using (var input = new StreamReader("input.txt")) {
MyLogoLexer lexer = new MyLogoLexer(new ANTLRReaderStream(input));

MyLogoParser tokenParser = new MyLogoParser(new CommonTokenStream(lexer));
tokenParser.TreeAdaptor = new CommonTreeAdaptor();
var parserResult = tokenParser.script();

CommonTree tree = (CommonTree)parserResult.Tree;
MyLogoTree treeProcessor = new MyLogoTree(new CommonTreeNodeStream(tree));

treeProcessor.script();
}
}

But there were two big problems: parser's script method was private and tree processor's script method didn't exist at all. For method visibility issue the Google led me back to the Stack Overflow. Each parser and tree grammar generates a method and there is an undocumented feature that defines visibility of such methods. When generating Java code, the feature is ignored and all methods are public but when generating C# code methods are private unless the undocumented feature is used. Since the example that I used as learning material was meant to generate Java code, it didn't mention this issue. The lack of script method in tree parser was my fault entirely, I didn't noticed that the grammar rules has to be repeated in the tree processor even if they have no action. I mean, not all rules have to be repeated, only those that should generate a method.

To summarize, half of the workflow is good, enjoyable and looks mature but the other half is drastically unpolished. That contrast is the problem too, at first you'd live in high hopes that you are working in mature IDE such as Eclipse and Visual Studio and than BAM, this doesn't work, that doesn't work the way you thought and so on.

As I woke up next morning I looked at other options, there aren't looking better than ANTLR. There is one with nice premise but it's in alpha phase. So I decided to try that one and to give ANTLR another chance.

No comments:

Post a Comment