[Expat-discuss] Handling pernicious mixed content
Jeremy H. Griffith
jeremy at omsys.com
Sun Sep 20 01:54:21 CEST 2009
As required by the XML spec, expat reports all whitespace
within elements. In most cases, it's easy to determine if
the whitespace should be retained or not. If the element
can contain text (#PCDATA), keep it; if the element cannot
contain text, discard it.
But there are two elements in DITA (and one in DocBook)
that create ambiguity: the table cell wrapper, <entry>
in DITA, and the lits element wrapper, <li> in DITA.
DocBook calls it "pernicious mixed content" in Entry
(in Norm Walsh's O'Reilly DocBook book, on page 211).
For each, the content can include text but also block
elements. So if you have:
<ul>
<li>
<p>Something.</p?
</li>
<li> or something else.</li?
</ul>
and normalize whitespace, you'd expect:
*Something,
* or something else.
not:
* Something,
* or something else.
or:
*Something,
*or something else.
This is acknowledged as an intractable problem in DocBook,
for the SGML parser, but clearly those using the parsers
must solve it in some way. One answer is, if you see a block
element within the wrapper, discard the preceding whitespace.
Is there a better one?
Thanks!
-- Jeremy H. Griffith, at Omni Systems Inc.
<jeremy at omsys.com> http://www.omsys.com/
More information about the Expat-discuss
mailing list