Nexus 64213: December 2012

This post continues the story from the previous one.

So, following Stack Overflow link I landed at ANTLR homepage. Psychological scam alarms went off, bunch of text riddled with hyperlinks and varying typefaces, prominent button accompanied with "Oh come on, download me now", testimonials column and a guy making silly face. It took me a few seconds to actually start comprehending the content. Image below silly guy encouraged me to stay, it looked like FSA visualization I once saw during the college and FSA are crucial part of formal language processors. I started to search for something that identifies what the site is about and I read the first few lines:

What is ANTLR?
ANTLR, ANother Tool for Language Recognition,
... lot of technical terms that ordinary eye skips ...
target language.

"OK, that looks like what I'm looking for", said to myself. I'll stop with through storytelling now and the silly guy is Terence Parr, the guy behind the ANTLR project. Anyway, next thing was to have a taste how the tool works. I was disappointed at first with a fact that it is a command line tool but that wouldn't stop me from learning it. Soon I've found out that it has additional tool with GUI, ANTLRWorks, so I downloaded it and started looking for some kind of tutorial. Thing is with this kind of software that you have to learn a metalanguage for defining grammar rules and examples are best way to start. I've found a good one on a different web site. It's an example of not so basic calculator/interpreter for Java. I copy-pasted all three grammars ANTLRWorks, pressed "generate code" button and it worked. I had a clue how it works but wasn't really ready to work on my own so I've tried different approach, make a parser for a very simple language.

If you remember old days in LOGO, you'll smile. My first informatics class in elementary school (5th grade, I remember it as if it was yesterday :) ) was about LOGO and moving the turtle with FD and RT commands. FD 100 moved the turtle forward 100 pixels and as it moved it left a straight line. RT 90 rotated turtle 90° clockwise. Repeat that four times and you'd draw a rectangle. That's the simple a language I've decided to parse. Also, I've decided to split the grammar to three parts because last example did so and I believed it's a proper way to do and this was my grammar file for lexer:

lexer grammar MyLogoLexer;

options {
language=CSharp3;
}

fragment SIGN
: '+' | '-';
fragment SPACE
: ' ' | '\t';

NUMBER : '0' | SIGN? '1'..'9' '0'..'9'*;
FORWARD : 'FD';
ROTATE : 'RT';

NEWLINE : ('\r'? '\n')+ { Skip(); };
WHITESPACE
: SPACE+ { Skip(); };

Pretty simple vocabulary, white spaces are skipped, digits are treated as numbers and keywords are FD and RT. Second grammar describes what is formally called the parser, a syntax analyzer and a part that build a syntax tree. For the language I was building, the parser was even simpler than lexer:

parser grammar MyLogoParser;

options {
language=CSharp3;
output = AST;
tokenVocab = MyLogoLexer;
}

script : statement* EOF!;

statement
: FORWARD NUMBER
| ROTATE NUMBER;

Basicaly the root of the tree is a script. Script contains zero or more statements and statements are either "forward" or "rotate" commands. ANTLRWorks is full of little features that help writing and validating grammar and graphical representation of a rule is one of them. Look at how nice it is, feast your eyes on it's awesomeness:

And now the last grammar, so called tree grammar, not because it builds the tree but because it traverses the tree and performs node actions:

tree grammar MyLogoTree;

options {
language=CSharp3;
ASTLabelType = CommonTree;
tokenVocab = MyLogoParser;
}

@header {
using System;
}

@members {
double x = 0, y = 0, angle = 0;

private double toDouble(CommonTree node) {
double value = 0.0;
String text = node.Text;
value = Double.Parse(text);
return value;
}
}
script : statement*;

statement
: FORWARD v=NUMBER {
double length = toDouble($v);
angle = Math.PI * this.angle / 180;
x += Math.Cos(angle) * length;
y += Math.Sin(angle) * length;
Console.WriteLine("Moved to {0}, {1}", x.ToString("0.#"), y.ToString("0.#"));
}
| ROTATE v=NUMBER {
angle += toDouble($v);
Console.WriteLine("Facing {0}°", angle.ToString("0.#"));
};

When I wrote grammars, I've hit "generate code" button, got errors, corrected them, retried and after a few tries it succeeded. And than I got my self in from of the weird wall, how to run the generated code? It was midnight already and I was trying to break the habit staying up late. I should have been trivial matter but it turned out that it took two hours of my life. I expected that generated source code looked like this:

From the perspective of the user code, a single class that does the job. But instead it turned out that user's code has to mouth feed EVERY phase on it's own. See:

Notice the absence of gray arrows. Thing are even worse, there are additional objects the user code has to take care of besides generated classes. Mimicking the the Java example ("not so basic calculator" mentioned at the beginning), I figured the code should look like this:

static void Main(string[] args)
{
using (var input = new StreamReader("input.txt")) {
MyLogoLexer lexer = new MyLogoLexer(new ANTLRReaderStream(input));

MyLogoParser tokenParser = new MyLogoParser(new CommonTokenStream(lexer));
tokenParser.TreeAdaptor = new CommonTreeAdaptor();
var parserResult = tokenParser.script();

CommonTree tree = (CommonTree)parserResult.Tree;
MyLogoTree treeProcessor = new MyLogoTree(new CommonTreeNodeStream(tree));

treeProcessor.script();
}
}

But there were two big problems: parser's script method was private and tree processor's script method didn't exist at all. For method visibility issue the Google led me back to the Stack Overflow. Each parser and tree grammar generates a method and there is an undocumented feature that defines visibility of such methods. When generating Java code, the feature is ignored and all methods are public but when generating C# code methods are private unless the undocumented feature is used. Since the example that I used as learning material was meant to generate Java code, it didn't mention this issue. The lack of script method in tree parser was my fault entirely, I didn't noticed that the grammar rules has to be repeated in the tree processor even if they have no action. I mean, not all rules have to be repeated, only those that should generate a method.

To summarize, half of the workflow is good, enjoyable and looks mature but the other half is drastically unpolished. That contrast is the problem too, at first you'd live in high hopes that you are working in mature IDE such as Eclipse and Visual Studio and than BAM, this doesn't work, that doesn't work the way you thought and so on.

As I woke up next morning I looked at other options, there aren't looking better than ANTLR. There is one with nice premise but it's in alpha phase. So I decided to try that one and to give ANTLR another chance.

I did a little research on libraries for processing formal languages, both for my self and for the project I'm working on, the Stareater. Previous versions of Stareater had a feature that allowed having whole expressions instead of plain constant numbers in data files. For instance, for farmer efficiency you could write something that evaluates to 3+0.125*level of farming technology in a single attribute instead of having multiple attributes, one for constant part, one for technology identifier and one for technology level influence factor. And on top of that, you can change operators in an expression any way you want. But there is a catch, you have to write it either normal or reverse Polish notation (postfix and prefix notations respectively) so the last example would look like this (highlighted part):

It's bearable when expressions are simple but take a look at the image below. That is an expression for the maximum number of starports on a planet. Floor is a function for rounding down, "^" is the power operator, PLANET_VELICINA is the size of a planet and SVEMIRSKI_PROGRAM_LVL is a level of relevant technology.

It translates to Floor( (1 + 0.1 * (tech_level - 1)) * (planet_size ^ 1.5) / 100 ). So, why Polish notation instead of "plain English", the infix notation? I wrote about it before but this is a summary: it's easy to parse. Parser implementation is straightforward there is no operator precedence issue. For instance, evaluator of postfix notation expression could be explained in a few sentences (knowledge of basic data structures assumed): read input from left to right, add operands to the top of the stack, apply operators to the operands on the top of the stack, remove used operands and push the result to the top of the stack. If the expression was valid, in the end stack would contain only one element and that would be the result of the expression. Parser is very similar to described evaluator but instead of calculation, the so called syntax tree is built. What is syntax tree you might wonder. Well now we are stepping in the area of computer science and the rest of the post will assume that you have some degree of familiarity with the field of formal language processing.

This, the picture above, is the syntax tree for 3+0.125*level of farming expression. Arrow at the top points to the root of the tree (knowledge of the trees from graph theory assumed). To get the value of the expression, user code (the one using the tree) has to simply ask the root to evaluate itself. Rest is the magic of composite design pattern, root would see that it needs two or more operands and sum them and it will ask it's children to evaluate them self. Since the first child is a constant, it will simply return 3. Second child on the other hand will ask it's children to evaluate themselves and multiply their values. As you can see a nice utilization of recursion and dynamic polymorphism.

Building syntax tree for infix expression complicated thing, at first there are binary operators (assumed reader knows what that mean) which have one set of not-so-easy-to-implement rules, then add to the mix unary operators and you'd get stuff complicated by factor 1.5 and things can be complicated even further with brackets and functions. On top of that you have to decide whether a / b / c should be interpreted as (a / b) / c, a / (b / c) or rejected as invalid. It would be hell to code and test all of it from scratch. Fortunately our fathers and grandfathers solved that for us when they invented compilers. Programmers use compilers for transforming a source code to an executable files but compilers are more general then that, they translate one language to another. In case of C++, source code written by the rules of C++ is translated to a machine code of the target hardware. There is fair amount of computer science theory involved here and it is named the formal language.

Main phases of compilation process

Every programming language is formal language and as such they have vocabulary (set of lexemes), grammar rules (syntax) and semantic rules. Vocabulary usually contains list of keywords (if, for and default for example) and regular expressions for variable names, constant numeric and textual values and other simple stuff though regular expressions are formal languages of their own. Grammar rules define more complicated stuff such as assignment has form of variable followed by '=' character followed by expression and ended with ';'. Semantic rules cover stuff which are hard to cover by the grammar such as checking if the type of a value can be stored in a variable. There are tools for generating source code for compiler's analysis phases, an inception one might say.

Problem of parsing text containing infix expression is basically that, lexical, syntax and semantic analysis. So I goggled a little bit for C# solutions and found that most queries lead to ANTLR, Java tools that can generate C# source code from grammar rules. More about my impressions with the tool in the following post(s).

Nexus 64213

Pages

Monday, December 3, 2012

ANTLR, first impression

Saturday, December 1, 2012

ANTLR, introduction