[Expat-discuss] empty tags

Fred L. Drake, Jr. fdrake@acm.org
Mon, 13 Aug 2001 10:51:58 -0400 (EDT)

=?iso-8859-1?q?Sylvain=20PRAT?= writes:
 > yes, but we should be aware of the charset...

  Yes, pretty flaky stuff.  Here's another approach: for every start
tag, get the current source index, then for end tags, you know
it was an empty tag if the position didn't change.  That can be
optimized a little bit by maintaining a flag:

static int  maybe_empty_element_tag;
static long byte_index;

start(void *data, const char *el, const char **attr)
    maybe_empty_element_tag = 1;
    byte_index = XML_GetCurrentByteIndex(parser);


static void
end(void *data, const char *el)
    if (maybe_empty_element_tag) {
        maybe_empty_element_tag = 0;
        if (byte_index == XML_GetCurrentByteIndex(parser)) {
           /* empty-element tag */

  All other handlers could optionally clear maybe_empty_element_tag to
avoid the call back into expat for elements like <e>characters
only</e>.  Whether that would be a win depends how isolated you want
this aspect of the processing, or if the performance improvement (very
small) is worth the maintenance cost.

 > I think the parser is aware of the empty tag, so why
 > couldn't this be a feature (as reading the input
 > context is one too), especially because start tags are
 > possibly erased before reading the end tags...  

  The parser has to be aware of it, but I'm not sure what would be the
best way to expose it.  Perhaps something similar to the
XML_GetSpecifiedAttributeCount() function, which is only valid during
the start-element callback?  That's certainly possible, but would
incur higher overhead that the approach outlined here (because it
would always result in a function call).  Providing that would not
invalidate this approach, so they could coexist.


Fred L. Drake, Jr.  <fdrake at acm.org>
PythonLabs at Zope Corporation