Pages

Wednesday, March 27, 2013

Coco/R and parsing strings

Writing a parser for Stareater with Coco/R as parser generator was easy. Despite requiring slightly different way of thinking (all terminals "return void" so output is done through method parameters with keyword out) overhead imposed by Coco/R was minimal. And then I wanted to test the parser. I figured unit tests would be appropriate and while writing first unit test I got stuck at scanner constructor. The scanner is the part that converts input to tokens (lexeme) and default Coco/R scanner can be constructed either by providing a file name (or relative path) string or the Stream object. The problem was my parser was intended for in-memory strings.

Quick googling showed that string can written to memory stream with few lines[1]. If you want to be quick and don't mind bloat you can stop reading, here your solution:

var stream = new MemoryStream(Encoding.UTF8.GetBytes(text));

Parser parser = new Parser(new Scanner(stream);

Why do I call this bloated? Because:

Under the hood default scanner implementation wraps stream with buffer. The buffer does what the name says, buffering, plus conversion of bytes to characters. I'm not going to discuss their design decisions like why didn't they use BufferedStream. When implementing IKON reader for .Net I opted for TextReader as a base class for input because conversion form input to characters is guarantied by interface and there StreamReader and StringReader classes that implement TextReader and can read from any stream or from in-memory string. Maybe they had some reason to insist on the Stream but that is not the point of this post. What is the point is that I was talking about default Coco/R scanner.

Both Coco/R parser and scanner are based upon a frame file. Those files are sort of blueprints, a C# code interleaved with placeholders for generated code. When generating parser and scanner, command line tool must be supplied with those files along with grammar specification file though they don't have to be explicitly named if they are in the same folder as grammar file. By customizing scanner's frame file parsing bloat can be reduced. Below are steps for making a frame for a scanner that only accepts string as input.



Buffers can be ditched altogether since in-memory string is already buffered and converted to collection of characters. Feel free to completle delete Buffer and UTF8Buffer classes from scanner's frame but keep in mind that generated scanner's code depends on Buffer.EOF constant. I prefer to hide it as static private nested class inside Scanner class even but it's valid to just leave original Buffer as is.

static class Buffer
{
  public const int EOF = char.MaxValue + 1;
}



Next, make an reference to input string and initialize it in constructor.


public string input; // scanner input
public Scanner (string input) {
    this.input = input;
    Init();
}



Than clean up Init method. Aside form initialization, that method detect whether the stream is encoded in ASCII or UTF-8. Since new scanner works directly with characters, encoding detection can be omitted.


void Init() {
    pos = -1; line = 1; col = 0; charPos = -1;
    oldEols = 0;
    NextCh();
    pt = tokens = new Token();  // first token is a dummy
}



And finally modify NextCh method to raise "end of file" when end of string is reached.


    void NextCh() {
        if (oldEols > 0) { ch = EOL; oldEols--; }
        else {
            pos = charPos;
            charPos++;

            if (charPos >= input.Length)
                ch = Buffer.EOF;
            else
            {
                ch = input[charPos]; col++;
                // replace isolated '\r' by '\n' in order to make
                // eol handling uniform across Windows, Unix and Mac
                if (ch == '\r' && input.Length > charPos + 1 && input[charPos + 1] != '\n') ch = EOL;
                if (ch == EOL) { line++; col = 0; }
            }
        }
-->casing1
    }



That's it! In case you want to support both strings and streams you could do similar modifications with TextReader instead of string. I haven't tried that yet since the idea occurred to me while writing this post.

No comments:

Post a Comment