Pages

Saturday, December 1, 2012

ANTLR, introduction

I did a little research on libraries for processing formal languages, both for my self and for the project I'm working on, the Stareater. Previous versions of Stareater had a feature that allowed having whole expressions instead of plain constant numbers in data files. For instance, for farmer efficiency you could write something that evaluates to 3+0.125*level of farming technology in a single attribute instead of having multiple attributes, one for constant part, one for technology identifier and one for technology level influence factor. And on top of that, you can change operators in an expression any way you want. But there is a catch, you have to write it either normal or reverse Polish notation (postfix and prefix notations respectively) so the last example would look like this (highlighted part):


It's bearable when expressions are simple but take a look at the image below. That is an expression for the maximum number of  starports on a planet. Floor is a function for rounding down, "^" is the power operator, PLANET_VELICINA is the size of a planet and SVEMIRSKI_PROGRAM_LVL is a level of relevant technology.


It translates to Floor( (1 + 0.1 * (tech_level - 1)) * (planet_size ^ 1.5) / 100 ). So, why Polish notation instead of "plain English", the infix notation? I wrote about it before but this is a summary: it's easy to parse. Parser implementation is straightforward there is no operator precedence issue. For instance, evaluator of postfix notation expression could be explained in a few sentences (knowledge of basic data structures assumed): read input from left to right, add operands to the top of the stack, apply operators to the operands on the top of the stack, remove used operands and push the result to the top of the stack. If the expression was valid, in the end stack would contain only one element and that would be the result of the expression. Parser is very similar to described evaluator but instead of calculation, the so called syntax tree is built. What is syntax tree you might wonder. Well now we are stepping in the area of computer science and the rest of the post will assume that you have some degree of familiarity with the field of formal language processing.


This, the picture above, is the syntax tree for 3+0.125*level of farming expression. Arrow at the top points to the root of the tree (knowledge of the trees from graph theory assumed). To get the value of the expression, user code (the one using the tree) has to simply ask the root to evaluate itself. Rest is the magic of composite design pattern, root would see that it needs two or more operands and sum them and it will ask it's children to evaluate them self. Since the first child is a constant, it will simply return 3. Second child on the other hand will ask it's children to evaluate themselves and multiply their values. As you can see a nice utilization of recursion and dynamic polymorphism.

Building syntax tree for infix expression complicated thing, at first there are binary operators (assumed reader knows what that mean) which have one set of not-so-easy-to-implement rules, then add to the mix unary operators and you'd get stuff complicated by factor 1.5 and things can be complicated even further with brackets and functions. On top of that you have to decide whether a / b / c should be interpreted as (a / b) / ca / (b / c) or rejected as invalid. It would be hell to code and test all of it from scratch. Fortunately our fathers and grandfathers solved that for us when they invented compilers. Programmers use compilers for transforming a source code to an executable files but compilers are more general then that, they translate one language to another. In case of C++, source code written by the rules of C++ is translated to a machine code of the target hardware. There is fair amount of computer science theory involved here and it is named the formal language.

Main phases of compilation process

Every programming language is formal language and as such they have vocabulary (set of lexemes), grammar rules (syntax) and semantic rules. Vocabulary usually contains list of keywords (if, for and default for example) and regular expressions for variable names, constant numeric and textual values and other simple stuff though regular expressions are formal languages of their own. Grammar rules define more complicated stuff such as assignment has form of variable followed by '=' character followed by expression and ended with ';'. Semantic rules cover stuff which are hard to cover by the grammar such as checking if the type of a value can be stored in a variable. There are tools for generating source code for compiler's analysis phases, an inception one might say.

Problem of parsing text containing infix expression is basically that, lexical, syntax and semantic analysis. So I goggled a little bit for C# solutions and found that most queries lead to ANTLR, Java tools that can generate C# source code from grammar rules. More about my impressions with the tool in the following post(s).

No comments:

Post a Comment