Strings and Characters in Limpid

Preamble on Strings

String handling is of central importance in XML processing. From the start, the String class was part of the Java language, and this, presumably, was an important factor in the general preference for Java in XML. On the other hand, the Java String class really shows its age, and this might stem from the fact that Java (aka Oak) was not originally conceived as a general-purpose language. This shows in its interface. As examples, you cannot add Strings, but you must append() them; a String is not mutable and must be assigned to itself, eg:

String s = new String("first");
 s = s.append("second");

Equivalent code in Limpid is:

DOMString s = "first";
s += "second";

While it is fair to criticise the shortcomings of the Java String class, we cannot do so for C++, because there is none! And this is the real point: with Java, you can work almost interchangeably with Strings and hard-coded string literals and not have to worry about what you are doing. This is because String is part of Java, not just a library add-on. With C++, the only strings available (unless you design your own) are the strings in the Standard Template Library (STL). These strings are part of the C++ language.

So, in writing the Limpid system, I have had to consider the user at every point where a string of some sort is involved. This has led me to define interfaces for very many of the classes, for example, an Element constructor:

Element(const DOMString& elementName);

rather than:

Element(DOMString& elementName);

This is to exploit a very clever, but non-obvious, feature of C++: automatic parameter conversion. In this case, it permits use of an (ASCII) string literal as parameter:

Element element("firstElement");

Now this is possible because there is a constructor for DOMString:

DOMString(const char*);

Characters

Before compilation of the application or library, and the compilation is for Linux, the developer has the choice of using 7-bit ASCII characters or wide characters. In a header file, definitions.h, there is a choice of narrow (ASCII) or wide characters in the definition of Char (note case). If Char is narrow, Char equates to char; if wide, it equates to wchar_t. The default is wide.

In the Windows environment, UTF-16 characters are used, regardless of the choice of narrow or wide.

Now, in the interests of efficiency and robustness of code, I have decided to store all characters internally as either ASCII characters or as UTF-32 characters. All input and output uses a Reader or Writer class and conversion is automatic. Readers and Writers use UTF-8 as default, but the UTF-16 variants are selectable. UTF-32 I/O is not implemented, as this is unlikely to be an attractive option.

The decision to use UTF-32 internally is not as extravagant as it might seem in terms of memory requirements. Checks have shown that, in the worst case, the memory footprints are increased by 50 - 60% over using ASCII characters (not the increase of 300% that might have been expected).

All character output to the console is UTF-8. Console output uses a static instance of Writer, named Console. This permits display of characters that are ignored by the default C++ cout.

DOMString, aka String

All string handling in Limpid uses, directly or indirectly, the DOMString class, which is a subclass of std::basic_string<Char>. DOMString consolidates a range of utility functions for manipulating and searching strings. The localisation of all string functions within DOMString simplifies maintenance.

DOMString provides conversion constructors for null-terminated arrays of const  char and Char (note case). In particular, it is used to translate to and from arrays of Char in instances of (eg) TextContent, ElementContent and AttributeContent. It also provides functions for searching and replacement and for generation of arrays of const char* that are required to interface with the C++ standard library. The header file for DOMString is:

class DOMString : public std::basic_string<Char> {
    enum {ConversionError = -1};

    DOMString();
    DOMString(const Char *str) { if (str) append(str); }
    DOMString(const char *str) { if (str) add(str); }
    DOMString(const char c) { push_back(static_cast<Char>(c)); };
    DOMString(size_t len, Char c);
    DOMString(const CharString& s);
    DOMString(const DOMString& refString);
    DOMString& operator=(const DOMString& refString) {
      if (this != &refString) assign(refString); return *this; }
    
    DOMString substring(size_t start, size_t len) const;
    DOMString& operator+=(char c) {
      push_back(static_cast<Char>(c)); return *this; }
    DOMString& operator+=(const DOMString& str) {
      append(str); return *this; }
    DOMString& operator+=(const char* str) {
      return add(str); }
    DOMString operator+(const DOMString& str) {
      return DOMString(*this) += str; }

    bool operator==(const DOMString& rhs) const {
      return !compare(rhs); }
    bool operator!=(const DOMString& rhs) const {
      return compare(rhs); }
    bool operator<(const DOMString& rhs) const {
      return compare(rhs) < 0; }
    bool operator==(const char* s) const;
    bool operator!=(const char* s) const;

    static DOMString failString;

  //array functions:
    Char *copyChars() const;
    char* getCString() const;

  //query functions:
    size_t getLength() { return size(); }
    size_t indexOf(const DOMString& substring, size_t startPos = 0) const;
    bool contains(const DOMString& substring) const;
    bool startsWith(const DOMString& substring) const;
    bool endsWith(const DOMString& substring) const;

  //manipulation functions:
    DOMString& trim();
    DOMString& normalizeWhitespace();
    DOMString& change(const DOMString& oldString,
      const DOMString& newString);
    DOMString& change(size_t offset,
      size_t count, const DOMString& newString);
    DOMString& insertString(size_t pos, const DOMString& newString);
    DOMString& remove(size_t pos, size_t count);
    DOMString& toLowerCase();
    DOMString& toUpperCase();
    DOMString& capitalize();
    DOMString& reverse();
    DOMString& resolveEntities();
    DOMString getDelimitedString(size_t startPos);
    int toInt(int base = 10) const;
    DOMString& escapeAll();
    DOMString& escapeBraKet();
    DOMString& escapeDelimiters();
    size_t getSize() const {
      return sizeof(*this) + size() * sizeof(Char); }
    bool isValid() { return this != &failString; }

  //utility static functions:
    static size_t strLength(const Char* ptr) {
      size_t size = 0; while (*ptr++) ++size; return size; }
    static bool isWhitespace(int c) {
      return (c == ' ' 
 c == '\t' 
 c == '\n' 
 c == '\r'); }
    static Char *copyChars(Char* source);
    static Char *copyChars(const Char* source, size_t length);
    static Char *copyChars(const char* source);
    
  protected:
    DOMString &add(const char* str);
    DOMString& normalizeGaps();

  private:
    bool doResolve(DOMString& string);
    bool doExtendedChar(DOMString& string);
    int getExtendedChar(DOMString& entityName);
};

Implementation Issues

Two points pertaining to inheritance from std::basic_string should be noted: