Thursday, January 3, 2013

ANTLR, second chance and rivals

This post continues the story from the introduction and first impression.

It has been long month and it took me quite a while to gather the material for this post. Long story short ANTLR is no go for C#, it's rival in alpha stage is lacking and the third option finally proved worthy. Now the long story:

Last time I tried ANTLR with separate files for lexer, parser and tree processor grammars which produced "mouth feeding" process in the user code using those grammars. This time i tried ANTLR with combined grammar file, basically all three grammars in one file.

grammar MyLogoCombined;

options {
language=CSharp3;
output=AST;
}

@header {
using System;
}

fragment SIGN
: '+' | '-';
fragment SPACE
: ' ' | '\t';

NUMBER : '0' | SIGN? '1'..'9' '0'..'9'*;
FORWARD : 'FD';
ROTATE : 'RT';

NEWLINE : ('\r'? '\n')+ { Skip(); };
WHITESPACE
: SPACE+ { Skip(); };

public script : statement* EOF!;

statement
: FORWARD v=NUMBER {
double length = toDouble($v);
angle = Math.PI * this.angle / 180;
x += Math.Cos(angle) * length;
y += Math.Sin(angle) * length;
Console.WriteLine("Moved to {0}, {1}", x.ToString("0.#"), y.ToString("0.#"));
}
| ROTATE v=NUMBER {
angle += toDouble($v);
Console.WriteLine("Facing {0}°", angle.ToString("0.#"));
};

It doesn't look bad or crammed and generated code is better too. Generated classes are lexer and merged parser and tree processor. As an effect a user code required to utilize those classes got considerably simpler.

static void CombinedLogo()
{
using (var input = new StreamReader("input.txt")) {
MyLogoCombinedLexer lexer = new MyLogoCombinedLexer(new ANTLRReaderStream(input));
MyLogoCombinedParser parser = new MyLogoCombinedParser(new CommonTokenStream(lexer));
parser.script();
}
}

Now we have something usable. But I wasn't satisfied, generated code requires 100 kB run-time library (may look weird compared to executable weighting 30 kB) and is riddled with comments mentioning full path to original grammar file. And there is no way to generate code without those comments. Well, not exactly critical issues, more like unnecessary vice. There is another issue I haven't mentioned earlier, documentation is not quite there. It exists, that's good, there are a lot of examples, that's good too but you won't get your question answered on the single page. You'll have to look in the official docs, unofficial examples and still do a few experiments to get definite answers.

Irony.Net

On the Stack Overflow webpage that led me to ANTLR next recommended thing was Irony.Net. It's the project hosted on CodePlex with professional looking home page. I've downloaded source code and started looking around. The nicest thing about Irony.Net is the way of defining a grammar, there is no code generations from specially formatted text file, it's defined as ordinary C# class. I usually post code as plain text but for the sake of presentation you'll get an image this time.

That's as awesome as ANTLR's state diagram for grammar rules. Operator overloading is usually syntax evil but in some cases it's OK thing to do and Irony.Net has found one. As you can see plus and bitwise OR operators are overloaded to operate on non-terminals so the grammar rules can resemble BNF notation as closely as possible. At the same time the image presents three issues with the Irony.Net: piece of code commented out, first parameter of MakeStarRule method and lack of rule attributes (tree processor actions).

Commented out code is left on purpose to demonstrate the existence of various flags and properties. While trying to make it work I read somewhere that I should explicitly specify creation of abstract syntax tree which is usually normal process in parsers so I added that line. It turned out that Irony.Net creates syntax tree by default and that flag actually confuses it. Alright, it confused me too because it didn't work with educational material, it required extra information that I couldn't figure out within the time I had at the moment. I figured it is supposed to indicate that custom tree node objects are being used and that creation of those objects is mysterious art.

That leads us to the lack of rule attributes problem. In order to process the tree you have to either traverse the tree on your own or use the mysterious art. I really tried to figure out the mysterious art but the lack of education materials and limited time bested me (a year later after writing this post I did learn how to use visitor pattern but still, there better ways). The biggest problem with Irony.Net is the lack of documentation. It has so many features and only way to figure them out is to dig in their source code.

The weirdness of MakeStarRule method may not be obvious but why does it require destination non-terminal and why is it class member? In my opinion it should be static and have only one mandatory parameter, the repeatable BNF term. The reason why it requires destination non-terminal is because it secretly modifies the rule instead of just building the BNF term. The reason it is non-static is because it uses certain flag from the grammar object. That is actually OK because the grammar object is used to build the "language" object, not for direct parsing.

Now that I mentioned it, using the grammar is quite simple, build parser using the grammar, feed the input and get the tree:

Parser parser = new Parser(new IronyLogo());
ParseTree tree;
using (var input = new StreamReader("input.txt"))
tree = parser.Parse(input.ReadToEnd());

For processing the tree I opted for no mysterious art approach, doing it in the user code and it's not a big hassle:

double x = 0, y = 0, angle = 0;

foreach (ParseTreeNode statement in tree.Root.ChildNodes) {
ParseTreeNode node = statement.ChildNodes[0];
int value = (int)node.ChildNodes[1].Token.Value;

switch (node.Term.Name) {
case "forward":
x += value * Math.Cos(angle);
y += value * Math.Sin(angle);
break;

case "rotate":
angle += Math.PI * value / 180.0;
break;
}

Console.WriteLine("Turtle at {0}, {1}", x.ToString("0.#"), y.ToString("0.#"));
}

Neat thing is that NumberLiteral type of literals handles string to number conversion on it's own. In fact Irony.Net has a lot of features for building programming languages, custom literals for number and identifiers, methods for defining operator precedence to name a few.

So why is Irony.Net on go for C#? If you are building a compiler or interpreter, it's good but if you are build parser for domain specific language it's simply not the right tool. You can chop wood with hammer but using an axe would be more practical. First of all it requires 160 kB of run-time library, that's more than ANTLR, doesn't have elegant grammar attributes and is undocumented. Oh my God, it is so undocumented that I have to involve the God to the issue. Author himself claims that source code is enough and is unwilling to write decent documentation (well, the project is still in alpha phase so I can forgive him) and sometimes doesn't give an answer beyond "look in the code of a certain example shipped with Irony". There is a moderate number of examples included with Irony.Net source code but they are either grammars without attributes or proofs of concept (such as fully featured interpreter with a lot of layers between grammar and attributes) that are hard to follow while learning stuff.

Coco/R

Goggling for third solution led me back to Stack Overflow but to a different page this time. Answers on that page presented various new solutions and I've checked the first one (actually the second one because the first is for F#), Coco/R. The link led me to a clean black on white page where I've quickly found my way to the tutorial that hooked me up. At first I was skeptical, zip file with 5 years old Powerpoint presentation but after skipping language processing theory, presentation arrived at very simple example describing whole process, from writing a grammar to running the parser. Few slides further presented grammar attributes and gimmicks such as how to ignore new line characters, make grammar case insensitive and so on. The post is already long, but heck, I'll publish my first class Logo grammar anyway:

COMPILER Logo
double turtleX = 0, turtleY = 0, angle = 0;

IGNORECASE

CHARACTERS
digit = "0123456789".

TOKENS
number = digit {digit}.

IGNORE '\t' + '\r' + '\n'

PRODUCTIONS
Logo = {Statement}.
Statement = Forward | Rotate.

Forward (. double length; .)
= "fd" Parameter<out length>
(. turtleX += Math.Cos(angle) * length;
turtleY += Math.Sin(angle) * length;
Console.WriteLine("Turtle at {0}, {1}",
turtleX.ToString("0.#"),
turtleY.ToString("0.#"));
.)
.

Rotate (. double ang; .)
= "rt" Parameter<out ang>
(. angle += Math.PI * ang / 180; .)
.

Parameter<out double n>
= number (. n = Convert.ToDouble(t.val); .)

.
END Logo.

You've already seen something similar with ANTLR grammar, lexical tokens, extenden BNF production, c# header and attributed code. This one has a period character after almost everything. One thing I should point out is that parts of definition have to strictly follow an order of appearance. For instance IGNORE part can't appear before TOKENS or CHARACTERS blocks. Correct order of appearance can be found in the documentation that exists and is informative. Before I start praising it's documentation, let's conclude the usage of Coco/R. This is what user code looks like:

static void Main(string[] args)
{
Scanner scanner = new Scanner("input.txt");
Parser parser = new Parser(scanner);
parser.Parse();
}

As simple as that. Scanner can also accept Stream object instead of file name, for those insisting on abstraction (which is good requirement).

Conclusion

Unlike ANTLR or Irony.Net, Coco/R didn't catch me with nasty surprises, it's simple, it works, requires no run-time files and it is well documented. Though it doesn't have fancy grammar editor or grammar visualisation it's a tool that I would use in my projects.

Monday, December 3, 2012

ANTLR, first impression

This post continues the story from the previous one.

So, following Stack Overflow link I landed at ANTLR homepage. Psychological scam alarms went off, bunch of text riddled with hyperlinks and varying typefaces, prominent button accompanied with "Oh come on, download me now", testimonials column and a guy making silly face. It took me a few seconds to actually start comprehending the content. Image below silly guy encouraged me to stay, it looked like FSA visualization I once saw during the college and FSA are crucial part of formal language processors. I started to search for something that identifies what the site is about and I read the first few lines:

What is ANTLR?
ANTLR, ANother Tool for Language Recognition,
... lot of technical terms that ordinary eye skips ...
target language.

"OK, that looks like what I'm looking for", said to myself. I'll stop with through storytelling now and the silly guy is Terence Parr, the guy behind the ANTLR project. Anyway, next thing was to have a taste how the tool works. I was disappointed at first with a fact that it is a command line tool but that wouldn't stop me from learning it. Soon I've found out that it has additional tool with GUI, ANTLRWorks, so I downloaded it and started looking for some kind of tutorial. Thing is with this kind of software that you have to learn a metalanguage for defining grammar rules and examples are best way to start. I've found a good one on a different web site. It's an example of not so basic calculator/interpreter for Java. I copy-pasted all three grammars ANTLRWorks, pressed "generate code" button and it worked. I had a clue how it works but wasn't really ready to work on my own so I've tried different approach, make a parser for a very simple language.

If you remember old days in LOGO, you'll smile. My first informatics class in elementary school (5th grade, I remember it as if it was yesterday :) ) was about LOGO and moving the turtle with FD and RT commands. FD 100 moved the turtle forward 100 pixels and as it moved it left a straight line. RT 90 rotated turtle 90° clockwise. Repeat that four times and you'd draw a rectangle. That's the simple a language I've decided to parse. Also, I've decided to split the grammar to three parts because last example did so and I believed it's a proper way to do and this was my grammar file for lexer:

lexer grammar MyLogoLexer;

options {
language=CSharp3;
}

fragment SIGN
: '+' | '-';
fragment SPACE
: ' ' | '\t';

NUMBER : '0' | SIGN? '1'..'9' '0'..'9'*;
FORWARD : 'FD';
ROTATE : 'RT';

NEWLINE : ('\r'? '\n')+ { Skip(); };
WHITESPACE
: SPACE+ { Skip(); };

Pretty simple vocabulary, white spaces are skipped, digits are treated as numbers and keywords are FD and RT. Second grammar describes what is formally called the parser, a syntax analyzer and a part that build a syntax tree. For the language I was building, the parser was even simpler than lexer:

parser grammar MyLogoParser;

options {
language=CSharp3;
output = AST;
tokenVocab = MyLogoLexer;
}

script : statement* EOF!;

statement
: FORWARD NUMBER
| ROTATE NUMBER;

Basicaly the root of the tree is a script. Script contains zero or more statements and statements are either "forward" or "rotate" commands. ANTLRWorks is full of little features that help writing and validating grammar and graphical representation of a rule is one of them. Look at how nice it is, feast your eyes on it's awesomeness:

And now the last grammar, so called tree grammar, not because it builds the tree but because it traverses the tree and performs node actions:

tree grammar MyLogoTree;

options {
language=CSharp3;
ASTLabelType = CommonTree;
tokenVocab = MyLogoParser;
}

@header {
using System;
}

@members {
double x = 0, y = 0, angle = 0;

private double toDouble(CommonTree node) {
double value = 0.0;
String text = node.Text;
value = Double.Parse(text);
return value;
}
}
script : statement*;

statement
: FORWARD v=NUMBER {
double length = toDouble($v);
angle = Math.PI * this.angle / 180;
x += Math.Cos(angle) * length;
y += Math.Sin(angle) * length;
Console.WriteLine("Moved to {0}, {1}", x.ToString("0.#"), y.ToString("0.#"));
}
| ROTATE v=NUMBER {
angle += toDouble($v);
Console.WriteLine("Facing {0}°", angle.ToString("0.#"));
};

When I wrote grammars, I've hit "generate code" button, got errors, corrected them, retried and after a few tries it succeeded. And than I got my self in from of the weird wall, how to run the generated code? It was midnight already and I was trying to break the habit staying up late. I should have been trivial matter but it turned out that it took two hours of my life. I expected that generated source code looked like this:

From the perspective of the user code, a single class that does the job. But instead it turned out that user's code has to mouth feed EVERY phase on it's own. See:

Notice the absence of gray arrows. Thing are even worse, there are additional objects the user code has to take care of besides generated classes. Mimicking the the Java example ("not so basic calculator" mentioned at the beginning), I figured the code should look like this:

static void Main(string[] args)
{
using (var input = new StreamReader("input.txt")) {
MyLogoLexer lexer = new MyLogoLexer(new ANTLRReaderStream(input));

MyLogoParser tokenParser = new MyLogoParser(new CommonTokenStream(lexer));
tokenParser.TreeAdaptor = new CommonTreeAdaptor();
var parserResult = tokenParser.script();

CommonTree tree = (CommonTree)parserResult.Tree;
MyLogoTree treeProcessor = new MyLogoTree(new CommonTreeNodeStream(tree));

treeProcessor.script();
}
}

But there were two big problems: parser's script method was private and tree processor's script method didn't exist at all. For method visibility issue the Google led me back to the Stack Overflow. Each parser and tree grammar generates a method and there is an undocumented feature that defines visibility of such methods. When generating Java code, the feature is ignored and all methods are public but when generating C# code methods are private unless the undocumented feature is used. Since the example that I used as learning material was meant to generate Java code, it didn't mention this issue. The lack of script method in tree parser was my fault entirely, I didn't noticed that the grammar rules has to be repeated in the tree processor even if they have no action. I mean, not all rules have to be repeated, only those that should generate a method.

To summarize, half of the workflow is good, enjoyable and looks mature but the other half is drastically unpolished. That contrast is the problem too, at first you'd live in high hopes that you are working in mature IDE such as Eclipse and Visual Studio and than BAM, this doesn't work, that doesn't work the way you thought and so on.

As I woke up next morning I looked at other options, there aren't looking better than ANTLR. There is one with nice premise but it's in alpha phase. So I decided to try that one and to give ANTLR another chance.

Saturday, December 1, 2012

ANTLR, introduction

I did a little research on libraries for processing formal languages, both for my self and for the project I'm working on, the Stareater. Previous versions of Stareater had a feature that allowed having whole expressions instead of plain constant numbers in data files. For instance, for farmer efficiency you could write something that evaluates to 3+0.125*level of farming technology in a single attribute instead of having multiple attributes, one for constant part, one for technology identifier and one for technology level influence factor. And on top of that, you can change operators in an expression any way you want. But there is a catch, you have to write it either normal or reverse Polish notation (postfix and prefix notations respectively) so the last example would look like this (highlighted part):

It's bearable when expressions are simple but take a look at the image below. That is an expression for the maximum number of starports on a planet. Floor is a function for rounding down, "^" is the power operator, PLANET_VELICINA is the size of a planet and SVEMIRSKI_PROGRAM_LVL is a level of relevant technology.

It translates to Floor( (1 + 0.1 * (tech_level - 1)) * (planet_size ^ 1.5) / 100 ). So, why Polish notation instead of "plain English", the infix notation? I wrote about it before but this is a summary: it's easy to parse. Parser implementation is straightforward there is no operator precedence issue. For instance, evaluator of postfix notation expression could be explained in a few sentences (knowledge of basic data structures assumed): read input from left to right, add operands to the top of the stack, apply operators to the operands on the top of the stack, remove used operands and push the result to the top of the stack. If the expression was valid, in the end stack would contain only one element and that would be the result of the expression. Parser is very similar to described evaluator but instead of calculation, the so called syntax tree is built. What is syntax tree you might wonder. Well now we are stepping in the area of computer science and the rest of the post will assume that you have some degree of familiarity with the field of formal language processing.

This, the picture above, is the syntax tree for 3+0.125*level of farming expression. Arrow at the top points to the root of the tree (knowledge of the trees from graph theory assumed). To get the value of the expression, user code (the one using the tree) has to simply ask the root to evaluate itself. Rest is the magic of composite design pattern, root would see that it needs two or more operands and sum them and it will ask it's children to evaluate them self. Since the first child is a constant, it will simply return 3. Second child on the other hand will ask it's children to evaluate themselves and multiply their values. As you can see a nice utilization of recursion and dynamic polymorphism.

Building syntax tree for infix expression complicated thing, at first there are binary operators (assumed reader knows what that mean) which have one set of not-so-easy-to-implement rules, then add to the mix unary operators and you'd get stuff complicated by factor 1.5 and things can be complicated even further with brackets and functions. On top of that you have to decide whether a / b / c should be interpreted as (a / b) / c, a / (b / c) or rejected as invalid. It would be hell to code and test all of it from scratch. Fortunately our fathers and grandfathers solved that for us when they invented compilers. Programmers use compilers for transforming a source code to an executable files but compilers are more general then that, they translate one language to another. In case of C++, source code written by the rules of C++ is translated to a machine code of the target hardware. There is fair amount of computer science theory involved here and it is named the formal language.

Main phases of compilation process

Every programming language is formal language and as such they have vocabulary (set of lexemes), grammar rules (syntax) and semantic rules. Vocabulary usually contains list of keywords (if, for and default for example) and regular expressions for variable names, constant numeric and textual values and other simple stuff though regular expressions are formal languages of their own. Grammar rules define more complicated stuff such as assignment has form of variable followed by '=' character followed by expression and ended with ';'. Semantic rules cover stuff which are hard to cover by the grammar such as checking if the type of a value can be stored in a variable. There are tools for generating source code for compiler's analysis phases, an inception one might say.

Problem of parsing text containing infix expression is basically that, lexical, syntax and semantic analysis. So I goggled a little bit for C# solutions and found that most queries lead to ANTLR, Java tools that can generate C# source code from grammar rules. More about my impressions with the tool in the following post(s).

Sunday, August 12, 2012

Blog za Zvjezdojedaca

Odlučih odvojiti "filozofske" postove od progress reportova Zvjezdojeca. Tako da će ubuduće postovi vezani za taj projekt ići na poseban blog:

http://stareater4x.blogspot.com/

Blog će biti pisan na engleskom jeziku no početnih par postova će biti na hrvatskom jer je riječ o postovima s ovog bloga. I da, verzija 0.4 je gotova.

Monday, June 4, 2012

Zvjezdojedac 0.3.3

http://code.google.com/p/zvjezdojedac/downloads/list

Prošli tjedan stavih novu verziju Zvjezdojeca na download. Kao što sam pisao u prošlim postovima, u novoj verziji je implementiran space combat. Uz sam space combat dodah još malo contenta, warp pogon i čestični štit. Doduše te komponente su dodane čisto zbog testa space combata.

Thursday, May 24, 2012

Space combat commitan

Whoohoo! Nakon pet mjeseci source code je konačno stabilan. Doduše ima jedno mjesto gdje se krši, u starom sučelju za dizajniranje brodova. Sredim to, istestiram još malo bitke, malo zglancam sučelje za bitke i ide novi build u download sekciju.

Monday, May 21, 2012

Space comat radi

Space combat u Zvjezdojecu je implementiran. Još ga nisam do kraja testirao ali chance to hit, damage reduction, i uništavanje brodova funkcionira. Čim stavim da se bitke mogu razriješiti, uploadam u repozitorij. Trenutno se combat turnovi odbrojavaju ali se ništa ne dogodi kad se prođe "zadnji" i potpuno uništenje protivničke strane isto trenutno ne dovodi do razrješenja bitke.

Našao sam zašto se događao bug kod učitavanja igre, zbog kojeg si igrači odjednom dobe par tisuća brodova. Stari problem s decimalnom točkom. Količina ostatka gradnje je bila pohranjena s decimalnim zarezom (regionalne postavke na Windowsima) a kod učitavanja se tražila točka (eksplicitno specificiran format). MICROSOFTE, ZAŠTO MJEŠAŠ STROJNI I KULTURNO OVISAN PRIKAZ BROJEVA? Naime, C#-u double tip podatka služi za pohranu realnih brojeva (uz kolku tolku preciznost), na svakoj varijabli i konstanti tipa double se može pozvat metoda za pretvorbu u tekst (tako reći). Moj problem nastaje u tome što se broj pretvara u tekst poštujući regionalne postavke operativnog sustava. Osobno smatram da se to ne bi trebalo tako raditi je vrijednost tipa double sama po sebi ne pripada ni jednoj državi ili kulturi. Kada se double vrijednost pita za tekstualni zapis, trebao bi se koristiti neki univerzalni format. Univerzalan u smislu da ne ovisi o ničem osim o samoj brojčanoj vrijednosti. Turčin, Arap, Kanađanin i Francuz bi za istu double vrijednost trebali dobiti isti tekstualni zapis. Za prikaz broja u kulturno ovisnom zapisu bi se trebala koristiti metoda kojoj se eksplicitno definira način pretvorbe broja u tekst.

Nazad na temu. Zadnjih par dana sam se uhvatio u koštac s balansiranjem igre. Htio sam se koncentrirati na testiranje i dovršavanje space combata ali brojke koje sam još davno postavio za snagu napada i obrane su takve da bitka traje u nedogled. Kako sam bio prije testirao tehnologije i mehanizme automatske nadogradnje tako je ispalo da igrači odmah u startu imaju do kraja nadograđen tier 3 oklop i da obični kolonizator ima 12k hit pointa dok je napadač s kojim sam testirao imao 8 strojnica koje su na taj oklop radile oko 9 damagea. S takvim brojkama bitka je trajala 170 krugova a ciljam na to da je ograničenje broja krugova za bitku negdje oko 20. Problem kod kolonizatora je što moraju s jedne strane biti ogromni a s druge ranjivi. Već prije sam razmišljao o tom da dodam specijalnu opremu za proširenje teretnog prostora i mislim da ću to napraviti u ovoj verziji. Nešto u stilu battle podsa iz Master of Oriona 2 ali da se uz povećanje nosivost se smanje hit pointi. To ja lagan dio posla, ono što je crna rupa koja upija sate mog rada je izbor brojki za damage oružja. Recimo za sam što na empirijski što na egzaktan način došao do načela po kojem određujem brojke za nosivost i izdržljivost vrsta trupova i sve komponente koje utjeću na vjerojatnost pogotka. Ali brojke za damage mi nikako ne idu. Načela kojih bih se držati su da brodovi iste veličine i tehnologije se međusobno unište unutar 5 krugova i da manje više svaki vojni brod može u dogledno vrijeme uništiti kolonizatora.

No već ću nešto smislit, dovršiti testiranje i uploadati u repozitorij. Nakon toga malo dotjerati GUI, spakirati u release i staviti na download.

Nexus 64213

Pages