It is no secret that OpenOffice.org files are wrappers for XML documents, and that they are, in fact, zip files. This provides a real opportunity for interfacing the files with other utilities, such as Lexxia. This document should be read in conjunction with the Lexxia editing commands
We are interested in the content file, content.xml, inside the wrapper document. The easy way to do this is to make a copy of the document, say source.odt, and rename it to source.zip. Now your browser will recognise the file and allow you to open it and make a copy of content.xml. But we can do this, perhaps more elegantly, with a very small script, which I have named (unimaginatively) xtract:
# xtract: Extracts a single content.xml file from .odt and related file. # It sets up a temporary directory, makes a copy of the source file in that directory, # uses unzip to open the file, copies it back to the main directory # and deletes the temporary directory. if [ $# != 3 ]; then echo "xtract use: ./xtract source.odt result.xml tempdirname" exit 1 fi rm -f -r $3 mkdir $3 cp $1 $3/mycopy.zip cd $3 unzip -q mycopy.zip cd .. mv $3/content.xml $2 rm -r $3 exit 0
A first glimpse of this file is a little intimidating, and it is surprisingly difficult to make out the actual text content. This reinforces my view that XML is a wonderful tool, but that humans should avoid mixing with it. But it really allows the Lexxia editing tools to come to the fore. The main problems are:
We can convince ourselves of this with Lexxia. Assume that we have a content file called mycontent.xml. First we load and output the file with Lexxia:
lexxia mycontent.xml -O|less
We observe that there are many elements with a ''text:style-name'' attribute, so we rename these with the ''-A'' command:
lexxia mycontent.xml -Atext:style-name -O|less
We see that we have renamed all of these elements with a name corresponding to the value of these attributes. Things are just a little clearer. But there are still lots of attributes, so we remove all of them with the ''-a'' command:
lexxia mycontent.xml -Atext:style-name -a -O|less
Now the file is dramatically simpler, but there are now some empty elements (these had previously carried attributes), so we decide to remove them with the ''-z'' command:
lexxia mycontent.xml -Atext:style-name -a -z -O|less
We now recognise some of the element names" ''Heading_20_1'', for example, is pretty obviously a level-1 heading -- and there are level-2, level-3, etc headings. We want to convert these to standard XHTML h1, h2, h3, etc. This is a little long-winded, but we are getting there:
lexxia mycontent.xml -Atext:style-name -a -z -nHeading_20_1+h1 -nHeading_20_2+h2 -nHeading_20_3+h3 -O|less
Similarly, we rename a ''Standard'' element to ''p'' because it seems to apply to normal text:
lexxia mycontent.xml -Atext:style-name -a -z -nHeading_20_1+h1 -nHeading_20_2+h2 -nHeading_20_3+h3 -nStandard+p -O|less
Depending on your file, and the extent of styling, you may have a few other elements to be treated in this way. One that seems fairly common is the ''Emphasis'' element. Just use the ''-n'' command to change it to the usual ''em'' element -- or any other XHTML element that appeals to you.
Again depending on your file, you may find that you have a single root element with a single element child, which in turn has a single element child, etc. This is fixed with the ''-r'' (root) command.
lexxia mycontent.xml -Atext:style-name -a -z -nHeading_20_1+h1 -nHeading_20_2+h2 -nHeading_20_3+h3 -nStandard+p -r -O|less
Finally, you may want to convert the result into an XHTML file with a stylesheet To get the file ''mycontent.html'', we add:
lexxia mycontent.xml -Atext:style-name -a -z -nHeading_20_1+h1 -nHeading_20_2+h2 -nHeading_20_3+h3 -nStandard+p -r -wpage.css -O*.html
Or you can build a LaTeX file, thence a PDF file:
lexxia mycontent.xml -Atext:style-name -a -z -nHeading_20_1+h1 -nHeading_20_2+h2 -nHeading_20_3+h3 -nStandard+p -r -Elatex-html.css+*.tex pdflatex mycontent.tex
I have made a feature here of how the Lexxia commands are concatenated to give an overall result, but it is not necessary to work this way. Of course, such formidable commands are best wrapped in a script. Just what commands you need to incorporate into the script will depend on how you have styled the OpenOffice.org document, but these examples should show the way. Obviously, a consistent set of styles will allow you to standardise your script.
It depends on what you need, but it is certainly no bad thing to be able to move content between formats. And there may be some subtle extras, such as superior justification and automatic hyphenation, page headings, a table of contents and page numbering.
It is very instructive to compare printouts from different environments. For example, the printout from a PDF file generated from the LaTeX file we have just built is markedly superior to printouts from the XHTML file we have just built (regardless of the browser) or the .odt file in OpenOffice.org. It is all about the quality of the rendering engines. If you want to do serious work, probably with very large files, LaTeX is definitely the stand-out.