[Expat-discuss] XML_CharacterDataHandler: can it receive text cut half inside a multibyte character sequence?

Boris Dušek boris.dusek at gmail.com
Sat Mar 14 19:03:06 CET 2009


Hello,

I am using expat with libiconv to convert data to wchar_t, but this is
a valid question for any other non-UTF8 target encoding, not just
wchar_t (i.e. for ISO-8859-2):

When expat calls the function set by XML_SetCharacterDataHandler, can
the function receive a block of text (with parameters const XML_Char
*s, int len) such that it ends in the middle of a multibyte character?
(i.e. there is a unicode character encoded as a sequence of 2-4 bytes,
and the block's last character, s[len-1], is a character of a
multibyte sequence that is not a last character of such multibyte
sequence). I can still think of a solution (i.e. copy the last
incomplete multibyte sequence as indicated by iconv to beginning of a
buffer and on next call to the data handler, copy the rest of data to
the buffer and call iconv again - but this involves copying and
possibly dynamically allocating the buffer and I want to avoid that),
but it would be great if expat did not end in the middle of a
multibyte sequence.

Thanks for any answer,
Boris Dušek


More information about the Expat-discuss mailing list