unofficial mirror of guile-devel@gnu.org 
 help / color / mirror / Atom feed
From: ludo@gnu.org (Ludovic Courtès)
To: Mark H Weaver <mhw@netris.org>
Cc: Andy Wingo <wingo@pobox.com>, guile-devel@gnu.org
Subject: Re: [PATCHES] Discard BOMs at stream start for UTF-{8, 16, 32} encodings
Date: Thu, 31 Jan 2013 22:42:39 +0100	[thread overview]
Message-ID: <87wqutjagg.fsf@gnu.org> (raw)
In-Reply-To: <87d2wmhsn4.fsf_-_@tines.lan> (Mark H. Weaver's message of "Wed,  30 Jan 2013 23:40:31 -0500")

Hi!

Mark H Weaver <mhw@netris.org> skribis:

> I researched this some more, and discovered that removal of byte-order
> marks (BOMs) is the responsibility of iconv, which discards BOMs from
> the beginning of streams when using the UTF-16 or UTF-32 encodings, but
> *not* for UTF-16LE, UTF-16GE, UTF-32LE, UTF-32GE or any other encoding.
> It uses the BOM to determine the endianness of the stream, but other
> than that does *not* use it to guess the encoding, so there's no
> guesswork involved.  (Side note: iconv also inserts a BOM automatically
> when writing a stream using UTF-16 or UTF-32).

Are you talking about GNU iconv or iconv as specified by POSIX?

I can’t see any occurrence of “BOM” at
<http://pubs.opengroup.org/onlinepubs/9699919799/functions/iconv.html>.

> So thanks to iconv, we get UTF-{16,32} BOM removal for free.
> Unfortunately we have a nasty bug in 'get_iconv_codepoint' that leads to
> a buffer overrun and assertion failure when 'iconv' discards a BOM.

Good catch!

> The first patch below fixes this problem.  I ended up almost completely
> rewriting that function, partly because it was largely structured around
> a mistaken assumption that iconv will never consume input without
> producing output, and partly because it was quite inefficient (several
> unnecessary conditional branches in the loop) and IMO was rather
> difficult to read.

Great.  (I think ‘iconv’ semantics lead to tricky code, no matter what.)

>  get_iconv_codepoint (SCM port, scm_t_wchar *codepoint,
>  		     char buf[SCM_MBCHAR_BUF_SIZE], size_t *len)
>  {

[...]

> -  for (output_size = 0, output = (char *) utf8_buf,
> -	 bytes_consumed = 0, err = 0;
> -       err == 0 && output_size == 0
> -	 && (bytes_consumed == 0 || byte_read != EOF);
> -       bytes_consumed++)
> +  for (;;)

Clarity is in the eye of the beholder, but to me this is a step backwards.

[...]

> +  /* NOTE: The following test assumes that the only special values
> +     (other than SCM_ICONV_UNINITIALIZED) are for UTF-8. */
> +  if (SCM_ICONV_SPECIAL_P (pt->input_cd))

Probably an indication that a more descriptive name is needed, as Andy
noted.

> @@ -2247,16 +2279,15 @@ scm_i_set_port_encoding_x (SCM port, const char *encoding)
>  	  new_output_cd = iconv_open (encoding, "UTF-8");
>  	  if (new_output_cd == (iconv_t) -1)

Should be SCM_ICONV_UNINITIALIZED?

Thanks again for the research and fixes!

Ludo’.



  parent reply	other threads:[~2013-01-31 21:42 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-01-28 21:42 byte-order marks Andy Wingo
2013-01-28 22:20 ` Mike Gran
2013-01-29  9:03   ` Andy Wingo
2013-01-29  8:22 ` Mark H Weaver
2013-01-29  9:03   ` Andy Wingo
2013-01-29 13:27     ` Ludovic Courtès
2013-01-29 14:04       ` Andy Wingo
2013-01-29 17:09         ` Mark H Weaver
2013-01-29 19:09           ` Mark H Weaver
2013-01-29 20:52             ` Ludovic Courtès
2013-01-29 20:53           ` Ludovic Courtès
2013-01-30  9:20           ` Andy Wingo
2013-01-30 21:18             ` Ludovic Courtès
2013-01-31  8:52               ` Andy Wingo
2013-01-31  4:40             ` [PATCHES] Discard BOMs at stream start for UTF-{8,16,32} encodings Mark H Weaver
2013-01-31  9:39               ` Andy Wingo
2013-01-31 10:33                 ` Andy Wingo
2013-01-31 18:01                   ` [PATCHES] Discard BOMs at stream start for UTF-{8, 16, 32} encodings Mark H Weaver
2013-01-31 21:42               ` Ludovic Courtès [this message]
2013-01-29 19:22 ` byte-order marks Neil Jerram
2013-01-29 21:09   ` Andy Wingo
2013-01-29 21:12     ` Neil Jerram

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/guile/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87wqutjagg.fsf@gnu.org \
    --to=ludo@gnu.org \
    --cc=guile-devel@gnu.org \
    --cc=mhw@netris.org \
    --cc=wingo@pobox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).