Introduction to Streams and Traversers

Introduction to NodeSource and Streams

If I were to nominate a single unifying class inside the Limpid API, apart from the DOM classes themselves, it would probably be NodeSource. This is an abstract class with a simple interface:

Node& nextNode();
void setRoot(Node& root);
NodeStream& copy();

This interface is implemented by several concrete classes, the most useful of which are:

Managing a Stream

A recurrent idiom throughout the Limpid source is a primitive while loop to process a sequence of nodes from a stream:

NodeStream stream(root, true, true);

  while (true) {
    Node& node = stream.nextNode();
    if (!node.isValid()) break;
  
    XMLWriter().process(node);
 }

Respecting a NodeSource

This is well illustrated by a piece of bad code:

// assume we have constructed a document 
 // and that the document element has several children
 
 // construct a list stream based on the children of the document element
 ListStream stream(document.getDocumentElement());
 
 Document newDocument;
 Element rootElement("RootElement");
 newDocument.appendChild(rootElement);
 
 while (true) {
   Node& node = stream.nextNode();
   if (!node.isValid()) break;
   
   newElement.appendChild(node);
 }

The first point to make is that the system would not permit this sort of travesty: the addition of nodes from one document to another document. It would lead to a Wrong Document error.

There is another serious flaw in this code: when we append the nodes to the new document, or anywhere else for that matter, the nodes are first detached from the first document before doing the append. On this basis, we would expect that all the child nodes would be removed from the document element. That is partly true, but the full truth is rather messier: we are removing nodes from a live list (the NodeList owned by the document element and which underlies the ListStream). And the outcome is that only some of the child nodes are transferred.

Now this is a general caveat when working with lists and streams: if you want to do something that, directly or indirectly, alters the data underlying a list or stream:

The following code will give the expected result (to repeat, it works because removalList is , and this is because it was constructed with a default constructor, without a Node parameter):

// assume we have constructed a document
 // and that the document element has several children
 ListStream stream(document.getDocumentElement());
 
 Document newDocument;
 Element newElement("root-element");
 newDocument.appendChild(newElement);

  NodeList removalList;

  // collect the nodes
  while (true) {
   Node& node = stream.nextNode();
   if (!node.isValid()) break;
   
   removalList.addNode(node);
  }
  
  ListStream removalStream(removalList);
  
  while (true) {
   Node& node = removalStream.nextNode();
   if (!node.isValid()) break;
   
   //make sure we can append the node
   newElement.importNode(node);
   newElement.appendChild(node);
 }

But perhaps we really did not intend to remove the nodes from the first document? In that event, the correct approach is to append a deep clone of the node:

// assume we have constructed a document
 // and that the document element has several children
ListStream stream(document.getDocumentElement);
 
Document newDocument;
Element newElement("root-element");
newDocument.appendChild(newElement);
 
while (true) {
  Node& node = stream.nextNode();
  if (!node.isValid()) break;
  
  newElement.appendChild(node.cloneNode(true));
 }

Multiple Processors

To illustrate how many of the Limpid classes interlock, the following code creates a Reader, with default UTF-8 encoding, incorporates it into a TokenStream and uses the TokenStream to construct an XMLParser. This parser is then used to construct a ProcessBox.

Processors are added to the ProcessBox; one Processor is itself an ProcessBox. The different Processors perform their different functions when process(Node&  node) is invoked on them:

void doMain(const char* fname) {
  XMLParser parser(fname);
  ProcessBox mainBox(parser);
  
  Builder builder1;
  mainBox.addProcessor(builder1);
  
  ProcessBox childBox;
  Cloner cloner;	// a Processor
  childBox.addProcessor(cloner);
  Builder builder2;
  childBox.addProcessor(builder2);
  
  mainBox.addProcessor(childBox);
  mainBox.process();	// no parameter
  
  Document document1 = builder1.getDocument();
  Document document2 = builder2.getDocument();
  
  DOMWriter().write(document1);
  DOMWriter().write(document2);
}

This example shows the parallel construction of two documents from the same node stream (XMLParser). This is needlessly complex (actually, it is silly!), and shown here only for illustration.

There is an effective split in the stream at the point where the childBox is added to mainBox. Inside the childBox, a cloner is used to make a parallel version of each node before a second Builder incorporates it into a second document. Finally, and this is the trickiest bit, a Cutter (loaded with a specific test) tests the EndNode of each Element. If there is a match, the node is removed from the document. As a result, the only SPEECH elements retained by cutDocument are those with LORD POLONIUS as speaker. (Note that the Cutter comes after the Builder, as it removes selected SPEECH nodes all their child nodes.

Other options for streams include selective output of specified nodes on-the-fly. It is even possible to output a document without building it:

void doMain(const char* fname) {
  ProcessBox mainBox(fname);
  
  XMLWriter xmlWriter;
  mainBox.addProcessor(xmlWriter);
  
  mainBox.process();
}

This simple arrangement is successful because the ProcessBox incorporates a default XMLParser and the XMLWriter is smart enough to interpret the nodes in the stream and convert them into correctly formatted XML text. The ProcessBox and NodeStream concept is proving very flexible. Particularly promising is the ability to perform either stream editing, on-the-fly output of specified content or constuction of a smaller document from specific nodes in a really huge source document, with obvious practical advantages.

Processing a Document or Subtree

NodeStream is a particularly powerful stream, as it simplifies a depth-first traversal of a tree. It is defined by the root node and two optional parameters:

NodeStream::NodeStream(Node& root,
    bool includeEnds = false, includeRoot = true);

A simple example of its use is to output a document using an XMLWriter (which writes nodes to a writer and manages indentation):

NodeStream nodeStream(document, true, true);
  XMLWriter xmlWriter(destPath);
	
  while (true) {
    Node& node = nodeStream.nextNode();
    if (!node.isValid()) break;

    xmlWriter.process(node);
  }

If we need a full clone of a document or subtree, we code similarly:

// do a deep clone of the document:
  Document cloneDocument(document) {
  // we want a stream to return root and endnodes:
  NodeStream nodeStream(document, true, true);
  Builder builder;
	
  while (true) {
    Node& node = nodeStream.nextNode();
    if (!node.isValid()) break;

    builder.process(node.cloneNode(false)); // shallow clone
  }
  
  return builder.getDocument();
 }

Yes, I know, we could simply have done a deep clone of the document, but I want to demonstrate a code pattern.

Using a Traverser

There is another way of achieving what we have been doing with a NodeStream by using a Traverser:

XMLWriter xmlWriter(destPath);
 // construct a Traverser with xmlWriter as processor
 // we want to include end nodes, with no finish() processing
 Traverser traverser(xmlWriter, true, false);
 traverser.traverse();

This can greatly shorten code; its disadvantage is that it is necessary to have a canned processor to handle all the required processing.