Converting Text to XML Format

Introduction

XML text is a wonderful tool, but it is monstrous to write -- and this is why there are specialist XML editors. During the development of Limpid and Lexxia, I found myself writing lots of web pages and other XML documents using, of course, a text editor. And I concluded that there had to be a BETTER WAY.

This BETTER WAY is the theme of this document, in which I describe just how easy it is to add markup to a text file. But, before I start, I should point out that the XML documents that interested me were primarily XHTML documents -- or, at least XML documents with XHTML tags. This brings into focus a central division in XML documents:

My emphasis here is on creating display-oriented XML, but the syntax that I will describe could, with less elegance, be used for any form of XML. So this is my central aim: to simplify, and rationalise, the creation of structured text information. And the key word is structured. Structure is important because it provides the basis for analysing, organising, presentation and styling of text. And this means the storage, extraction and manipulation of elements of the text and how it is transmitted and displayed. Structure is a critically important part of all but the most trivial text.

Text with Structure

The preceding section, in isolation, has some structure: three paragraphs and an unordered list. If it were not for XML, how might we have maintained this structure? Or have incorporated it into the higher structure of the whole document? XML is really the only credible option for structured (and other) text. Any other option is a journey into the past!

Building an XHTML Document

The remainder of this document outlines a simple and powerful approach to converting plain text to a rich XHTML format. This has obvious interest if you want an XHTML document but it can be adapted more generally to generate any other XML doctype. And, as I will demonstrate, the XHTML doctype (perhaps with some enhancements) provides a very smooth path to LaTeX files and subsequently DVI, postscript and PDF files.

Text Markup and Style

A styled text document, as viewed in hard copy, or viewed in a browser appears to have structure if it is punctuated with headings. The headings lead the eye and provide the impression of structure. But the underlying document is not necessarily structured, but rather a simple stream of content with differing styles along the way.

This simplicity limits the informational content of the document -- and consequently its potential value. Consider, for example, scanning a series of documents for information on "credit crisis" in a paragraph after a section of text that is displayed in large italic font (meaning a section heading), which includes "worsening". Clumsy and limiting. Far better to have a block of content that has specified heading and associated text.

Getting Started

Let us decide that a paragraph is a string of text without any hard line breaks (from the Enter key). And let us decide to ignore blank lines, so that we have a simple visual check on where the paragraphs are.

Consider the following sequence of text. It has five lines (determined by page breaks with the <ENTER> key) and one blank line:

Lexxia uses a dedicated processor (TexConverter) that has good default...
TexConverter also has various options for processing different...
It can be further configured on the fly, using a simple stylesheet with CSS...

It uses part of the limpid XSLT node identification infrastructure so...
Processing of complex structures, such as tables, is handled smoothly...

Text does not come much plainer than this, but we still save it to a file (''try1.txt'') and process it with Lexxia::

lexxia try1.txt -O*.xml

The resulting XML file is:

<root>
 <p>Lexxia uses a dedicated processor (TexConverter) that has good default...</p>
 <p>TexConverter also has various options for processing different...</p>
 <p>It can be further configured on the fly, using a simple stylesheet with CSS..,</p>
 <p>It uses part of the limpid XSLT node identification infrastructure so...</p>
 <p>Processing of complex structures, such as tables, is handled smoothly...</p>
</root>

Here, we have used the most primitive processing conventions;

To obey the rules of the XML DOM, all the <p> elements are loaded into a single <root> element. Neat, perhaps, but no structure. This is the next step.

Adding Structure

We now edit our source text to add headings:

.Introduction

..Heading 1
...Subheading 1
Lexxia uses a dedicated processor (TexConverter) that has good default...
TexConverter also has various options for processing different...

...Comment
It can be further configured on the fly, using a simple stylesheet with CSS...

..Heading 2
It uses part of the limpid XSLT node identification infrastructure so...
Processing of complex structures, such as tables, is handled smoothly...

We process as before, to create the following XML document:

<root>
 <div class="L1">
  <h1>Introduction</h1>
  <div class="L2">
   <h2>Heading 1</h2>
   <div class="L3">
    <h3>Subheading 1</h3>
    <p>Lexxia uses a dedicated processor (TexConverter) that has good default...</p>
    <p>TexConverter also has various options for processing different...</p>
   </div>
   <div class="L3">
    <h3>Comment</h3>
    <p>It can be further configured on the fly, using a simple stylesheet with CSS...</p>
   </div>
  </div>
  <div class="L2">
   <h2>Heading 2</h2>
   <p>It uses part of the limpid XSLT node identification infrastructure so...</p>
   <p>Processing of complex structures, such as tables, is handled smoothly...</p>
  </div>
 </div>
</root>

So, by designating certain lines as headings (by adding one or more dots at the start of lines, we have specified:

The Dot Rules

Now we are getting real structure into the document, but we are still generating very little XHTML markup. And we need specialist structures, such as lists and tables.

The Underscore Syntax

This is a device to incorporate an specifically named element. This may be a leaf element (contains text) or a block element (contains other elements):

_span=this is span text and a leaf

This is loaded as:

<span>this is span text and a leaf</span>
_block
_span=this is span text and a leaf

__ block This text after the double underscore is ignored: a good place for comments.

This is loaded as:

<block>
  <span>this is span text and a leaf</span>
</block>

Unordered and Ordered Lists

Recall that XHTML has ordered lists (<ol>), unordered lists (<ul>) and definition lists (&dl'dl>). To incorporate these, and any other named element, we also use the underscore syntax:

_ul
Point 1 ...
Point 2...
Point 3 ...
__ ul

This loads as:

<ul>
  <li>Point 1 ...</li>
  <li>Point 2 ...</li>
  <li>Point 3 ...</li>
</ul>

Here Lexxia has used some of the smarts in the dottedfile format. And it greatly simplifies the writing of the document.

More Complicated XHTML Structures

Definition (Property) Lists

A simple example is

_dl
ONE|We parse, load and correct the XML document as before;
TWO|We then process the XML document with a TexConverter.
THREE|The XML document is <em>not changed</em> by this processing.
__ dl

Lexxia loads this as

<dl>
 <dt>ONE</dt>
 <dd>We parse, load and correct the XML document as before;</dd>
 <dt>TWO</dt>
 <dd>We then process the XML document with a TexConverter.</dd>
 <dt>THREE</dt>
 <dd>The XML document is <em>not changed</em> by this processing.</dd>
</dl>

Tables

Consider the following terse specification for a table:

_table
_caption=MY TABLE CAPTION
_thead=Head1|Head2|Head3|Head4
This is first|This is second and <em>probably</em> the longest <b>field</b>|Third|Fourth
And is first|And is second|Third|Fourth and final field
_tfoot=This the foot|foot 2|foot 3|foot4
__ table

Lexxia loads this as

<table>
     <caption>MY TABLE CAPTION</caption>
     <thead>
      <tr>
       <td>Head1</td>
       <td>Head2</td>
       <td>Head3</td>
       <td>Head4</td>
      </tr>
     </thead>
     <tbody>
      <tr>
       <td>This is first</td>
       <td>This is second and <em>probably</em> the longest <b>field</b></td>
       <td>Third</td>
       <td>Fourth</td>
      </tr>
      <tr>
       <td>And is first</td>
       <td>And is second</td>
       <td>Third</td>
       <td>Fourth and final field</td>
      </tr>
     </tbody>
     <tfoot>
      <tr>
       <td>This the foot</td>
       <td>foot 2</td>
       <td>foot 3</td>
       <td>foot4</td>
      </tr>
     </tfoot>
    </table>

Dusty Corners

Newline characters in Text

Recall that tag content is normally delimited by the next newline character. In the event that this is not convenient and that you really may need to include newlines within the text, you use the "||" string as a substitute. When the content is parsed, all "||" strings are replaced with newlines:

_p=tag content before a newline||;and after the newline

This is parsed to:

<p>tag content before a newline
and after the newline</p>.

(Even though the parsing works correctly, inserting newlines into XHTML text is pretty pointless, because browsers disregard newline and tab characters. But the option is there if you really feel that you need it.)

Preformatted & Script Text

Including XML source as preformatted text (<pre> and <script> tags) in (X)HTML can be very messy, and requires that all '<' characters be escaped as the XML &lt; entity. The dottedfile syntax avoids this difficulty by making the replacement automatic. Typically, preformatted text is lengthy and includes many newlines. This means that the parsing of a preformatted tag cannot rely on content delimitation by newlines. Instead, it is delimited by the "##" string. as in:

_pre=
<table>
<caption>This is a Table Caption</caption>
<thead>
 <tr>
  <td>Column 1</td>
  <td>Column 2</td>
  <td>Column 3</td>
  <td>Column 4</td>
 </tr>
</thead>
<body>
 <tr>
  <td>R1C1</td>
  <td>R1C2 2</td>
  <td>R1C3</td>
  <td>R1C4</td>
 </tr>
 <tr>
  <td>R2C1</td>
  <td>R2C2 2</td>
  <td>R2C3</td>
  <td>R2C4</td>
 </tr>
</body>
</table>
##

For the display by a parser to be successful, all the '<' characters are escaped (again, this is automatic).

Comments

We should be able to include explanatory comments in our code. This is very easy, and uses a hash command, a familiar syntax:

#This is a comment##

which gives, of course:

<!--This is a comment-->

Note that, as with a <pre> tag, the content of a comment is limited by the "##" string. This is to allow multiline comments.

Processing Instructions and XML Declaration

In a dottedfile, these are signalled by query ('?') character at the start of a line. A processing instruction name follows directly after the '?' character, A space follows, and then a string corresponding to the actual instruction:

?mytarget my instructions for this target

This parses to:

<?mytarget my instructions for this target?>

An XML declaration is apparently of the same form. It is distinguished by having the "xml" string instead of the target name and requires, at a minimum, a version string, together with optional encoding and standalone attributes. In the dottedfile format, these are provided as a bar-separated string:

?xml 1.0|UTF-8|true

This is parsed and reformed to:

<?xml version="1.0" encoding="UTF-8" standalone="true" ?>

CDATASections

It is just possible that you might need to use one of these, such as to hide a script from a browser. In such an event, it is likely that you may want to hide several lines. To make this easy, signal a CDATASection with a "[" at the start of a line and delimit the content with "##", as with comments and <pre> tags:

[This is the content of a processing instruction.

more stuff.
##

This parses to:

<![CDATA[This is the content of a processing instruction.

more stuff.
]]>

Forms

Here is the dottedfile format at its worst. We just have to use XML mixed-content markup:

_form
@action=www.mysite.com/scripts/mycgi.cgi
@method=post

_p=New Customer: <input type="radio" name="new-customer" value="C" />
_p=Name: <input type="text" name="customer-name" value="N"/>
__

This parses into:

<form action="www.mysite.com/scripts/mycgi.cgi" method="post">
 <p>New Customer: <input type="radio" name="new-customer" value="C" /></p>
 <p>Name: <input type="text" name="customer-name" value="N"/></p>
</form>

Scripting

Here, the dottedfile format fares very nicely. An extremely simple example is:

_head
_meta=http-equiv=content-type|text/html, charset=UTF-8
_title=Associating scripts with a button
__

_button=What time is it?
@type=button
@name=time
@onclick=alert('Today is ' + Date())
@style=font: 1.5em Arial, sans-serif; background: yellow; color: red; padding: 0.3em; 

This builds into the following document:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD_XHTML 1.0 Strict//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd" >
<html>
 <head>
  <meta http-equiv="content-type" content="text/html, charset=UTF-8"/>
  <title>Associating scripts with a button</title>
 </head>
 <body>
  <button type="button" name="time" onclick="alert('Today is ' + Date())"
    style="font: 1.5em Arial, sans-serif; background: yellow;
    color: red; padding: 0.3em;">What time is it?</button>
 </body>
</html>

Embedded Scripts

Another very simple example follows. (Remember that the content for a <script> tag, like that for a <pre> tag, is delimited by the "##" string to allow for extended content.)

_head
_meta=http-equiv=content-type|text/html, charset=utf-8
_title=Using a Menu and Script
__

#This is a very simple dynamic inclusion##
_script=document.write("It is now: " + Date());##
@type=text/javascript
@language=JavaScript
.A FIRST HEADING

This builds to the following document:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD_XHTML 1.0 Strict//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd" >
<html>
 <head>
  <meta http-equiv="content-type" content="text/html, charset=UTF-8"/>
  <title>Using a Menu and Script</title>
  <link rel="stylesheet" href="css/simple1.css" type="text/css"/>
 </head>
 <body>
  <!--This is a very simple dynamic inclusion-->
  <script type="text/javascript"
    language="JavaScript">document.write("It is now: " + Date());</script>
  <div class="L1">
   <h1>A FIRST HEADING</h1>
  </div>
 </body>
</html>