XML text is a wonderful tool, but it is monstrous to write -- and this is why there are specialist XML editors. During the development of Limpid and Lexxia, I found myself writing lots of web pages and other XML documents using, of course, a text editor. And I concluded that there had to be a BETTER WAY.
This BETTER WAY is the theme of this document, in which I describe just how easy it is to add markup to a text file. But, before I start, I should point out that the XML documents that interested me were primarily XHTML documents -- or, at least XML documents with XHTML tags. This brings into focus a central division in XML documents:
My emphasis here is on creating display-oriented XML, but the syntax that I will describe could, with less elegance, be used for any form of XML. So this is my central aim: to simplify, and rationalise, the creation of structured text information. And the key word is structured. Structure is important because it provides the basis for analysing, organising, presentation and styling of text. And this means the storage, extraction and manipulation of elements of the text and how it is transmitted and displayed. Structure is a critically important part of all but the most trivial text.
The preceding section, in isolation, has some structure: three paragraphs and an unordered list. If it were not for XML, how might we have maintained this structure? Or have incorporated it into the higher structure of the whole document? XML is really the only credible option for structured (and other) text. Any other option is a journey into the past!
The remainder of this document outlines a simple and powerful approach to converting plain text to a rich XHTML format. This has obvious interest if you want an XHTML document but it can be adapted more generally to generate any other XML doctype. And, as I will demonstrate, the XHTML doctype (perhaps with some enhancements) provides a very smooth path to LaTeX files and subsequently DVI, postscript and PDF files.
A styled text document, as viewed in hard copy, or viewed in a browser appears to have structure if it is punctuated with headings. The headings lead the eye and provide the impression of structure. But the underlying document is not necessarily structured, but rather a simple stream of content with differing styles along the way.
This simplicity limits the informational content of the document -- and consequently its potential value. Consider, for example, scanning a series of documents for information on "credit crisis" in a paragraph after a section of text that is displayed in large italic font (meaning a section heading), which includes "worsening". Clumsy and limiting. Far better to have a block of content that has specified heading and associated text.
Let us decide that a paragraph is a string of text without any hard line breaks (from the Enter key). And let us decide to ignore blank lines, so that we have a simple visual check on where the paragraphs are.
Consider the following sequence of text. It has five lines (determined by page breaks with the <ENTER> key) and one blank line:
Lexxia uses a dedicated processor (TexConverter) that has good default... TexConverter also has various options for processing different... It can be further configured on the fly, using a simple stylesheet with CSS... It uses part of the limpid XSLT node identification infrastructure so... Processing of complex structures, such as tables, is handled smoothly...
Text does not come much plainer than this, but we still save it to a file (''try1.txt'') and process it with Lexxia::
lexxia try1.txt -O*.xml
The resulting XML file is:
<root> <p>Lexxia uses a dedicated processor (TexConverter) that has good default...</p> <p>TexConverter also has various options for processing different...</p> <p>It can be further configured on the fly, using a simple stylesheet with CSS..,</p> <p>It uses part of the limpid XSLT node identification infrastructure so...</p> <p>Processing of complex structures, such as tables, is handled smoothly...</p> </root>
Here, we have used the most primitive processing conventions;
To obey the rules of the XML DOM, all the <p> elements are loaded into a single <root> element. Neat, perhaps, but no structure. This is the next step.
We now edit our source text to add headings:
.Introduction ..Heading 1 ...Subheading 1 Lexxia uses a dedicated processor (TexConverter) that has good default... TexConverter also has various options for processing different... ...Comment It can be further configured on the fly, using a simple stylesheet with CSS... ..Heading 2 It uses part of the limpid XSLT node identification infrastructure so... Processing of complex structures, such as tables, is handled smoothly...
We process as before, to create the following XML document:
<root>
<div class="L1">
<h1>Introduction</h1>
<div class="L2">
<h2>Heading 1</h2>
<div class="L3">
<h3>Subheading 1</h3>
<p>Lexxia uses a dedicated processor (TexConverter) that has good default...</p>
<p>TexConverter also has various options for processing different...</p>
</div>
<div class="L3">
<h3>Comment</h3>
<p>It can be further configured on the fly, using a simple stylesheet with CSS...</p>
</div>
</div>
<div class="L2">
<h2>Heading 2</h2>
<p>It uses part of the limpid XSLT node identification infrastructure so...</p>
<p>Processing of complex structures, such as tables, is handled smoothly...</p>
</div>
</div>
</root>
So, by designating certain lines as headings (by adding one or more dots at the start of lines, we have specified:
Now we are getting real structure into the document, but we are still generating very little XHTML markup. And we need specialist structures, such as lists and tables.
This is a device to incorporate an specifically named element. This may be a leaf element (contains text) or a block element (contains other elements):
_span=this is span text and a leaf
This is loaded as:
<span>this is span text and a leaf</span>
_block _span=this is span text and a leaf __ block This text after the double underscore is ignored: a good place for comments.
This is loaded as:
<block> <span>this is span text and a leaf</span> </block>
Recall that XHTML has ordered lists (<ol>), unordered lists (<ul>) and definition lists (&dl'dl>). To incorporate these, and any other named element, we also use the underscore syntax:
_ul Point 1 ... Point 2... Point 3 ... __ ul
This loads as:
<ul> <li>Point 1 ...</li> <li>Point 2 ...</li> <li>Point 3 ...</li> </ul>
Here Lexxia has used some of the smarts in the dottedfile format. And it greatly simplifies the writing of the document.
A simple example is
_dl ONE|We parse, load and correct the XML document as before; TWO|We then process the XML document with a TexConverter. THREE|The XML document is <em>not changed</em> by this processing. __ dl
Lexxia loads this as
<dl> <dt>ONE</dt> <dd>We parse, load and correct the XML document as before;</dd> <dt>TWO</dt> <dd>We then process the XML document with a TexConverter.</dd> <dt>THREE</dt> <dd>The XML document is <em>not changed</em> by this processing.</dd> </dl>
Consider the following terse specification for a table:
_table _caption=MY TABLE CAPTION _thead=Head1|Head2|Head3|Head4 This is first|This is second and <em>probably</em> the longest <b>field</b>|Third|Fourth And is first|And is second|Third|Fourth and final field _tfoot=This the foot|foot 2|foot 3|foot4 __ table
Lexxia loads this as
<table>
<caption>MY TABLE CAPTION</caption>
<thead>
<tr>
<td>Head1</td>
<td>Head2</td>
<td>Head3</td>
<td>Head4</td>
</tr>
</thead>
<tbody>
<tr>
<td>This is first</td>
<td>This is second and <em>probably</em> the longest <b>field</b></td>
<td>Third</td>
<td>Fourth</td>
</tr>
<tr>
<td>And is first</td>
<td>And is second</td>
<td>Third</td>
<td>Fourth and final field</td>
</tr>
</tbody>
<tfoot>
<tr>
<td>This the foot</td>
<td>foot 2</td>
<td>foot 3</td>
<td>foot4</td>
</tr>
</tfoot>
</table>
Recall that tag content is normally delimited by the next newline character. In the event that this is not convenient and that you really may need to include newlines within the text, you use the "||" string as a substitute. When the content is parsed, all "||" strings are replaced with newlines:
_p=tag content before a newline||;and after the newline
This is parsed to:
<p>tag content before a newline and after the newline</p>.
(Even though the parsing works correctly, inserting newlines into XHTML text is pretty pointless, because browsers disregard newline and tab characters. But the option is there if you really feel that you need it.)
Including XML source as preformatted text (<pre> and <script> tags) in (X)HTML can be very messy, and requires that all '<' characters be escaped as the XML < entity. The dottedfile syntax avoids this difficulty by making the replacement automatic. Typically, preformatted text is lengthy and includes many newlines. This means that the parsing of a preformatted tag cannot rely on content delimitation by newlines. Instead, it is delimited by the "##" string. as in:
_pre= <table> <caption>This is a Table Caption</caption> <thead> <tr> <td>Column 1</td> <td>Column 2</td> <td>Column 3</td> <td>Column 4</td> </tr> </thead> <body> <tr> <td>R1C1</td> <td>R1C2 2</td> <td>R1C3</td> <td>R1C4</td> </tr> <tr> <td>R2C1</td> <td>R2C2 2</td> <td>R2C3</td> <td>R2C4</td> </tr> </body> </table> ##
For the display by a parser to be successful, all the '<' characters are escaped (again, this is automatic).
We should be able to include explanatory comments in our code. This is very easy, and uses a hash command, a familiar syntax:
#This is a comment##
which gives, of course:
<!--This is a comment-->
Note that, as with a <pre> tag, the content of a comment is limited by the "##" string. This is to allow multiline comments.
In a dottedfile, these are signalled by query ('?') character at the start of a line. A processing instruction name follows directly after the '?' character, A space follows, and then a string corresponding to the actual instruction:
?mytarget my instructions for this target
This parses to:
<?mytarget my instructions for this target?>
An XML declaration is apparently of the same form. It is distinguished by having the "xml" string instead of the target name and requires, at a minimum, a version string, together with optional encoding and standalone attributes. In the dottedfile format, these are provided as a bar-separated string:
?xml 1.0|UTF-8|true
This is parsed and reformed to:
<?xml version="1.0" encoding="UTF-8" standalone="true" ?>
It is just possible that you might need to use one of these, such as to hide a script from a browser. In such an event, it is likely that you may want to hide several lines. To make this easy, signal a CDATASection with a "[" at the start of a line and delimit the content with "##", as with comments and <pre> tags:
[This is the content of a processing instruction. more stuff. ##
This parses to:
<![CDATA[This is the content of a processing instruction. more stuff. ]]>
Here is the dottedfile format at its worst. We just have to use XML mixed-content markup:
_form @action=www.mysite.com/scripts/mycgi.cgi @method=post _p=New Customer: <input type="radio" name="new-customer" value="C" /> _p=Name: <input type="text" name="customer-name" value="N"/> __
This parses into:
<form action="www.mysite.com/scripts/mycgi.cgi" method="post"> <p>New Customer: <input type="radio" name="new-customer" value="C" /></p> <p>Name: <input type="text" name="customer-name" value="N"/></p> </form>
Here, the dottedfile format fares very nicely. An extremely simple example is:
_head
_meta=http-equiv=content-type|text/html, charset=UTF-8
_title=Associating scripts with a button
__
_button=What time is it?
@type=button
@name=time
@onclick=alert('Today is ' + Date())
@style=font: 1.5em Arial, sans-serif; background: yellow; color: red; padding: 0.3em;
This builds into the following document:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD_XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd" >
<html>
<head>
<meta http-equiv="content-type" content="text/html, charset=UTF-8"/>
<title>Associating scripts with a button</title>
</head>
<body>
<button type="button" name="time" onclick="alert('Today is ' + Date())"
style="font: 1.5em Arial, sans-serif; background: yellow;
color: red; padding: 0.3em;">What time is it?</button>
</body>
</html>
Another very simple example follows. (Remember that the content for a <script> tag, like that for a <pre> tag, is delimited by the "##" string to allow for extended content.)
_head
_meta=http-equiv=content-type|text/html, charset=utf-8
_title=Using a Menu and Script
__
#This is a very simple dynamic inclusion##
_script=document.write("It is now: " + Date());##
@type=text/javascript
@language=JavaScript
.A FIRST HEADING
This builds to the following document:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD_XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd" >
<html>
<head>
<meta http-equiv="content-type" content="text/html, charset=UTF-8"/>
<title>Using a Menu and Script</title>
<link rel="stylesheet" href="css/simple1.css" type="text/css"/>
</head>
<body>
<!--This is a very simple dynamic inclusion-->
<script type="text/javascript"
language="JavaScript">document.write("It is now: " + Date());</script>
<div class="L1">
<h1>A FIRST HEADING</h1>
</div>
</body>
</html>