[Expat-discuss] not well-formed (invalid token) error
lee at novomail.net
Thu Apr 9 00:00:36 CEST 2009
Krishna Kondaka wrote:
> I am trying to parse a very simple HTML file but I am getting 'not
> well-formed (invalid token) error'. Is there any thing I can do to
> make this work without getting errors?
Yes, you can convert your HTML file to XHTML.
Expat is an XML parser. HTML is /not/ XML; rather HTML is based on SGML
syntax. SGML allows for certain tags to be implicitly closed. For
example, in HTML this is allowed:
<p>This is a paragraph
<p>This is another paragraph
In HTML, when a <p> is encountered, it /implicitly/ closes any open <p> tag.
XHTML is an implementation of HTML with the additional restriction that
not only must the markup be valid HTML it must also be well-formed XML
as well. So the foregoing HTML snippet must be encoded in XHTML as:
<p>This is a paragraph</p>
<p>This is another paragraph</p>
Note the addition of the closing tags.
Also, HTML has a few tags that are self-contained ("empty"), and contain
no closing tags. Among these tags are <img>, <br> and <hr>. In XML,
however, closing tags must be explicit so the syntax for these empty
tags uses a slash prior to the final angle bracket.
You example is valid HTML, but it is not valid XHTML, because the <hr>
tag is not closed; use <hr/> instead.
If you do not control your HTML, you can use a tool like HTMLTidy
(http://tidy.sourceforge.net/) to convert valid (and sometimes invalid)
HTML to XHTML.
More information about the Expat-discuss