[Expat-discuss] & symbol workaround

Régis St-Gelais (Laubrass) regis.st-gelais at laubrass.com
Wed Feb 4 21:57:51 CET 2009


Maybe you could preprocess the xml file as you read it before passing the buffer to the XML parser

Regis St-Gelais
www.laubrass.com

----- Original Message ----- 
  From: Brad Causey 
  To: expat-discuss 
  Sent: Wednesday, February 04, 2009 3:40 PM
  Subject: Re: [Expat-discuss] & symbol workaround


  Nick,

  I completely agree. Unfortunately, I don't have control over the code that
  generates these XML files.
  If there isn't a better alternative, I'll have to create a duplicate of
  EVERY file and parse each one at a text level to replace non-standard
  characters with a escaped version. (doing this for < is nearly impossible)
  This is something I am trying to avoid for obvious reasons. I don't like
  non-standard XML any more than the next guy. (I've been through 3 different
  python XML parsers trying to resolve this) But I'm running out of options.
  Any ideas?



  -Brad


  On Wed, Feb 4, 2009 at 2:30 PM, Nick <nickmacd at xxx.com> wrote:

  > amp is NOT valid as a standalone character in XML and needs to be
  > escaped as &amp; otherwise you are not parsing standard (and thus
  > valid) XML files, but in fact parsing some other hybrid thing.
  >
  > Referring to the XML standard ( http://www.w3.org/TR/REC-xml/ ):
  >
  > The ampersand character (&) and the left angle bracket (<) MUST NOT
  > appear in their literal form, except when used as markup delimiters,
  > or within a comment, a processing instruction, or a CDATA section. If
  > they are needed elsewhere, they MUST be escaped using either numeric
  > character references or the strings " &amp;  " and " &lt;  "
  > respectively. The right angle bracket (>) may be represented using the
  > string " &gt;  ", and MUST, for compatibility, be escaped using either
  > " &gt;  " or a character reference when it appears in the string " ]]>
  >  " in content, when that string is not marking the end of a CDATA
  > section.
  >
  > So I would argue that you NEED to change the source files, in order to
  > bring them into line with the standard.
  >
  > Nick
  >
  >
  > On Wed, Feb 4, 2009 at 2:56 PM, Brad Causey <bradcausey at xxx.com<bradcausey at gmail.com>>
  > wrote:
  > > I am working on a Python script that parses around 6800 small xml files.
  > > My code isn't pretty, as I am just testing a PoC at this point, but I
  > have
  > > run into a problem. When the script hits the Ampersand symbol, it quits
  > with
  > > "xml.parsers.expat.ExpatError: not well-formed (invalid token): line 28,
  > > column 41"
  > >
  > > I am trying to figure out a way to work around this without modifying the
  > > XML files themselves as these need to be preserved in the original
  > format.
  > <NickMacD at gmail.com>
  _______________________________________________
  Expat-discuss mailing list
  Expat-discuss at libexpat.org
  http://mail.libexpat.org/mailman/listinfo/expat-discuss


More information about the Expat-discuss mailing list