Parsing Techniques in Limpid

Parsing Character Input

Unless otherwise set, a character (byte) source is treated as UTF-8. This means that single-byte 7-bit ASCII, as well as a character encodings of up to 6 bytes are processed automatically.

The bytes are read from an input stream and processed by a CharacterConverter inside a Reader.

Forming Characters into String Tokens

A Reader is used by a TokenStream to assemble tokens. This is automatically managed on the basis of defined whitespace and defined single-character delimiters, selected by setWhitespace() and setDelimiters().

Environment Whitespace Chars Delimiter Chars
XML Parsing " \n\t " "=/\"\'!?"
DTD Parsing " \n\t " "<>()=,|*+?!#\"\'%&;"
XPath Parsing " \n\t " "|/^[]()=!@:,\"'$+-*<>"

Using the Tokens: XML

Nodes are constructed by XMLParser. This is quite straight-forward, involving a loop consisting of:

Using the Tokens: DTD

DTD has a deceptive syntax, superficially like XML, but it is less regular and requires much more decision making during the parse. In Limpid, the information content of the DTD is extracted and used to construct a simple, but W3C-compliant, Schema document.

Using the Tokens: XPath

XPath has a complicated and flexible syntax. To be W3C-compliant, it is necessary to implement a full LALR (look-ahead left-to-right) parser. Unlike XML and DTD, the grammatical productions for XPath expressions have the following complications:

The steps involved in XPath processing are:

This sequence has the same overall outcome as the stack-based approach in classical texts for LALR parsing, and shows very nicely just how useful the DOM can be for organizing complex data.