Miles to go ...

Goparsec: Parsing by composing

By far the easiest way to write parser is by matching input text as terminal nodes and then composing a tree of terminal and non-terminal nodes. Terminal nodes, the leaf nodes, actually match the input text, while Non-Terminal nodes are composed using Terminal nodes.

If values can be composed together to create higher ordered values, as in Terminal nodes and Non-Terminal nodes, then, the same can be modeled with functions. That is, we define simple functions and compose them together to create a higher order function. The simple functions, called terminal parsers, match input text.

I guess that is enough of theory, let us look at an example:

64.242.88.10 [07/Mar/2004:16:05:49] GET,/index.html,401,1234

Above line is pulled out of a log file. There is a format to this line, with following components: (ip, timestamp, comma-separated-values)

parsing ip

Format of ip is: ".""."".", which can be parsed as:

ydelimit = parsec.Atom(".")
yip := parsec.And(
    nil,  // Nodify
    parsec.Int(),
    ydelimit, parsec.Int(),
    ydelimit, parsec.Int(),
    ydelimit, parsec.Int(),
)

Nodify

What happens after all the parsers supplied to the And combinator succeed in matching the input text ?

The first argument, called Nodify callback, if not nil, will be dispatched. The signature of Nodify is:

type Nodify func(nodes []ParsecNode) node ParsecNode

The array nodes carry a node item for each matching parser supplied to the And combinator. In this case, len(nodes) will be 7. Nodify callback is user supplied function and it is up to the user to decide what to do with nodes, and finally a node item returned. This node item, typed as ParsecNode, can be anything; it is just an alias for interface{}.

If our application is going to consume the ip as string, then we can use the Nodify callback to do some interesting stuff:

tostring := func(nodes []ParsecNode) node ParsecNode {
    s := ""
    for _, node := range nodes {
        s += node.(*Terminal).Value
    }
    return s
}
yip := parsec.And(
    tostring,
    parsec.Int(),
    ydelimit, parsec.Int(),
    ydelimit, parsec.Int(),
    ydelimit, parsec.Int(),
)

Here we are extracting the value of each terminal and concatenating it to create the full ip address and return the string as ParsecNode.

timestamp

For time-stamp, we can’t match the string as it is because other log lines, in the same file, will have different time-stamp values. But the time-stamp has a pattern: \[[0-9]{2}/[a-zA-Z]{3}/[0-9]{4}:[0-9]{2}:[0-9]{2}:[0-9]{2}\]

ytm := parsec.Token(
    `\[[0-9]{2}/[a-zA-Z]{3}/[0-9]{4}:[0-9]{2}:[0-9]{2}:[0-9]{2}\]`,
    "TIMESTAMP")

ytm can parse any time-stamp in the log file as long as it follows the same pattern.

comma-separated-values

Parsing comma separated values can be more complex, but that give us the opportunity to learn more about parsec.

vector2scalar := func(nodes []parsec.ParsecNode) parsec.ParsecNode {
    return notes[0]
}
concat := func(nodes []parsec.ParsecNode) parsec.ParsecNode {
    s := nodes[0].(string)
    s = s[1 : len(s)-1]
    return s
}
ystr := parsec.And(concat, parsec.String())
yatom := parsec.Token("[a-zA-Z][a-zA-Z0-9_\.-]+", "ATOM")
yterm := parsec.OrdChoice(vector2scalar, ystr, yatom)
ycomma := parsec.Token(`,`, "FIELDSEP")
ycsv := parsec.Kleene(nil, parsec.Maybe(maybenode, yterm), ycomma)

Values can be one of the following:

In the above parsing logic, values are parsed as yterm. Since a term can be one of the value described above, we use OrdChoice.

Note that there is an ambiguity here, between second type of value, an Integer and third type of value, an Atom. In such cases, we should compose the parser in such a way that, more specific parsers are tried before trying more generic parsers. Although OrdChoice parses one of the value as ParsecNode, it is returned as an array of []ParsecNode with arity one. This is to keep it consistent with rest of the combinators like And, Kleene, Many. To unwrap the single item in the array we are once again using a nodify callback.

And finally we tie them up using the Kleene combinator. Kleene is used because we expect ZERO or more terms in the input. Combinators like Kleene and Many can take two parser, the first one parses the input for the actual value and the second one parses the separator token. In our case the separator token in a COMMA ,. Kleene will repeatedly apply the two parsers until they fail matching the input, at which point all o/p from the first parser will be collected in an array, of []ParsecNode, and dispatched to the nodify callback and/or returned back.

A note on Maybe: If input contains a string like GET,index.html,,, where the term can be empty, we still want the first parser to succeed so the Kleene can continue matching the remaining string. If indeed the input contains an empty term then Maybe combinator that wraps the term will return ParsecNode as MissingNone. And the final []ParsecNode returned by Kleene will contain MissingNone for all missing terms.

text = `64.242.88.10 [07/Mar/2004:16:05:49] GET,"/index.html",401,1234`
y := parsec.And(nil, yip, ytm, ycsv)
scanner := NewScanner([]byte(text))
node, scanner := y(scanner)

So far we have only constructed parser functions, by composing simple parsers to create complex parser, using one of the Combinators like And, OrdChoice, Kleene, Many, Maybe. To actually parse the text we will have to use a scanner, that implement parsec.Scanner{} interface. Goparser provides a simple scanner constructed by calling NewScanner.