From: ludo@gnu.org (Ludovic Courtès)
To: Mark H Weaver <mhw@netris.org>
Cc: Andy Wingo <wingo@pobox.com>, guile-devel@gnu.org
Subject: Re: [PATCHES] Discard BOMs at stream start for UTF-{8, 16, 32} encodings
Date: Thu, 31 Jan 2013 22:42:39 +0100 [thread overview]
Message-ID: <87wqutjagg.fsf@gnu.org> (raw)
In-Reply-To: <87d2wmhsn4.fsf_-_@tines.lan> (Mark H. Weaver's message of "Wed, 30 Jan 2013 23:40:31 -0500")
Hi!
Mark H Weaver <mhw@netris.org> skribis:
> I researched this some more, and discovered that removal of byte-order
> marks (BOMs) is the responsibility of iconv, which discards BOMs from
> the beginning of streams when using the UTF-16 or UTF-32 encodings, but
> *not* for UTF-16LE, UTF-16GE, UTF-32LE, UTF-32GE or any other encoding.
> It uses the BOM to determine the endianness of the stream, but other
> than that does *not* use it to guess the encoding, so there's no
> guesswork involved. (Side note: iconv also inserts a BOM automatically
> when writing a stream using UTF-16 or UTF-32).
Are you talking about GNU iconv or iconv as specified by POSIX?
I can’t see any occurrence of “BOM” at
<http://pubs.opengroup.org/onlinepubs/9699919799/functions/iconv.html>.
> So thanks to iconv, we get UTF-{16,32} BOM removal for free.
> Unfortunately we have a nasty bug in 'get_iconv_codepoint' that leads to
> a buffer overrun and assertion failure when 'iconv' discards a BOM.
Good catch!
> The first patch below fixes this problem. I ended up almost completely
> rewriting that function, partly because it was largely structured around
> a mistaken assumption that iconv will never consume input without
> producing output, and partly because it was quite inefficient (several
> unnecessary conditional branches in the loop) and IMO was rather
> difficult to read.
Great. (I think ‘iconv’ semantics lead to tricky code, no matter what.)
> get_iconv_codepoint (SCM port, scm_t_wchar *codepoint,
> char buf[SCM_MBCHAR_BUF_SIZE], size_t *len)
> {
[...]
> - for (output_size = 0, output = (char *) utf8_buf,
> - bytes_consumed = 0, err = 0;
> - err == 0 && output_size == 0
> - && (bytes_consumed == 0 || byte_read != EOF);
> - bytes_consumed++)
> + for (;;)
Clarity is in the eye of the beholder, but to me this is a step backwards.
[...]
> + /* NOTE: The following test assumes that the only special values
> + (other than SCM_ICONV_UNINITIALIZED) are for UTF-8. */
> + if (SCM_ICONV_SPECIAL_P (pt->input_cd))
Probably an indication that a more descriptive name is needed, as Andy
noted.
> @@ -2247,16 +2279,15 @@ scm_i_set_port_encoding_x (SCM port, const char *encoding)
> new_output_cd = iconv_open (encoding, "UTF-8");
> if (new_output_cd == (iconv_t) -1)
Should be SCM_ICONV_UNINITIALIZED?
Thanks again for the research and fixes!
Ludo’.
next prev parent reply other threads:[~2013-01-31 21:42 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-01-28 21:42 byte-order marks Andy Wingo
2013-01-28 22:20 ` Mike Gran
2013-01-29 9:03 ` Andy Wingo
2013-01-29 8:22 ` Mark H Weaver
2013-01-29 9:03 ` Andy Wingo
2013-01-29 13:27 ` Ludovic Courtès
2013-01-29 14:04 ` Andy Wingo
2013-01-29 17:09 ` Mark H Weaver
2013-01-29 19:09 ` Mark H Weaver
2013-01-29 20:52 ` Ludovic Courtès
2013-01-29 20:53 ` Ludovic Courtès
2013-01-30 9:20 ` Andy Wingo
2013-01-30 21:18 ` Ludovic Courtès
2013-01-31 8:52 ` Andy Wingo
2013-01-31 4:40 ` [PATCHES] Discard BOMs at stream start for UTF-{8,16,32} encodings Mark H Weaver
2013-01-31 9:39 ` Andy Wingo
2013-01-31 10:33 ` Andy Wingo
2013-01-31 18:01 ` [PATCHES] Discard BOMs at stream start for UTF-{8, 16, 32} encodings Mark H Weaver
2013-01-31 21:42 ` Ludovic Courtès [this message]
2013-01-29 19:22 ` byte-order marks Neil Jerram
2013-01-29 21:09 ` Andy Wingo
2013-01-29 21:12 ` Neil Jerram
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://www.gnu.org/software/guile/
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87wqutjagg.fsf@gnu.org \
--to=ludo@gnu.org \
--cc=guile-devel@gnu.org \
--cc=mhw@netris.org \
--cc=wingo@pobox.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).