From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: Andy Wingo <wingo@pobox.com>
Newsgroups: gmane.lisp.guile.devel
Subject: Re: [PATCH] Improve handling of Unicode byte-order marks (BOMs)
Date: Thu, 04 Apr 2013 22:50:35 +0200
Message-ID: <8738v69g10.fsf@pobox.com>
References: <87ip43zyf0.fsf@tines.lan> <87r4irq0zp.fsf@gnu.org>
	<874nfnza5g.fsf@tines.lan> <87a9pfml1n.fsf@gnu.org>
	<87zjxfxsl5.fsf@tines.lan>
NNTP-Posting-Host: plane.gmane.org
Mime-Version: 1.0
Content-Type: text/plain
X-Trace: ger.gmane.org 1365108751 8546 80.91.229.3 (4 Apr 2013 20:52:31 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Thu, 4 Apr 2013 20:52:31 +0000 (UTC)
Cc: Ludovic =?utf-8?Q?Court=C3=A8s?= <ludo@gnu.org>, guile-devel@gnu.org
To: Mark H Weaver <mhw@netris.org>
Original-X-From: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Thu Apr 04 22:52:58 2013
Return-path: <guile-devel-bounces+guile-devel=m.gmane.org@gnu.org>
Envelope-to: guile-devel@m.gmane.org
Original-Received: from lists.gnu.org ([208.118.235.17])
	by plane.gmane.org with esmtp (Exim 4.69)
	(envelope-from <guile-devel-bounces+guile-devel=m.gmane.org@gnu.org>)
	id 1UNr9W-0006iY-L6
	for guile-devel@m.gmane.org; Thu, 04 Apr 2013 22:52:58 +0200
Original-Received: from localhost ([::1]:38055 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <guile-devel-bounces+guile-devel=m.gmane.org@gnu.org>)
	id 1UNr97-0001Tv-Sb
	for guile-devel@m.gmane.org; Thu, 04 Apr 2013 16:52:33 -0400
Original-Received: from eggs.gnu.org ([208.118.235.92]:42417)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <wingo@pobox.com>) id 1UNr8z-0001Jg-9R
	for guile-devel@gnu.org; Thu, 04 Apr 2013 16:52:31 -0400
Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <wingo@pobox.com>) id 1UNr8t-0000fJ-VZ
	for guile-devel@gnu.org; Thu, 04 Apr 2013 16:52:25 -0400
Original-Received: from a-pb-sasl-quonix.pobox.com ([208.72.237.25]:55533
	helo=sasl.smtp.pobox.com) by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <wingo@pobox.com>)
	id 1UNr8t-0000Da-R6; Thu, 04 Apr 2013 16:52:19 -0400
Original-Received: from sasl.smtp.pobox.com (unknown [127.0.0.1])
	by a-pb-sasl-quonix.pobox.com (Postfix) with ESMTP id 66F04BC00;
	Thu,  4 Apr 2013 16:50:42 -0400 (EDT)
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=pobox.com; h=from:to:cc
	:subject:references:date:in-reply-to:message-id:mime-version
	:content-type; s=sasl; bh=tZffcPlqQPPC35ecy8u24uH5fc4=; b=Df+GQQ
	eB8zbAWvlq5FxrVpxTbvhc3A7cZUQ7zLS3N6JujxoExdNoq2AEFXPBfiijwWF8FL
	1vCbJJ27kWm8KmNSj9vQ4X6jXQ9nNu/SD9O58aXzWYAH64Eu0nufCPk1neRJagYF
	RLp8mKMjeMeWU/DxxpYWa8Nh/LyKAPfGXrXiw=
DomainKey-Signature: a=rsa-sha1; c=nofws; d=pobox.com; h=from:to:cc
	:subject:references:date:in-reply-to:message-id:mime-version
	:content-type; q=dns; s=sasl; b=a18xZ3n8bNwExf1N8gsfe+DyHxT+pvX1
	RWH6xnXlG1e16m+Dys5colYHLHLCE0yfJr704+qGQGsRnJ6JF/+BW5b8OH/NULxh
	u1H5tXBxnEUzay4+2+Nc37bqJ3coYT28gRSmiQ4T1kstOI/9oyVxHSOJ/LEgbQ2n
	jhQYxTO9Pq0=
Original-Received: from a-pb-sasl-quonix.pobox.com (unknown [127.0.0.1])
	by a-pb-sasl-quonix.pobox.com (Postfix) with ESMTP id 5C242BBFF;
	Thu,  4 Apr 2013 16:50:42 -0400 (EDT)
Original-Received: from badger (unknown [88.160.190.192]) (using TLSv1 with cipher
	DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by
	a-pb-sasl-quonix.pobox.com (Postfix) with ESMTPSA id 95365BBFE;
	Thu,  4 Apr 2013 16:50:41 -0400 (EDT)
In-Reply-To: <87zjxfxsl5.fsf@tines.lan> (Mark H. Weaver's message of "Wed, 03
	Apr 2013 16:33:10 -0400")
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.2 (gnu/linux)
X-Pobox-Relay-ID: 503653EA-9D69-11E2-8857-D36F0E5B5709-02397024!a-pb-sasl-quonix.pobox.com
X-detected-operating-system: by eggs.gnu.org: Solaris 10
X-Received-From: 208.72.237.25
X-BeenThere: guile-devel@gnu.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: "Developers list for Guile,
	the GNU extensibility library" <guile-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/guile-devel>,
	<mailto:guile-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/guile-devel>
List-Post: <mailto:guile-devel@gnu.org>
List-Help: <mailto:guile-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/guile-devel>,
	<mailto:guile-devel-request@gnu.org?subject=subscribe>
Errors-To: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org
Original-Sender: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.lisp.guile.devel:16141
Archived-At: <http://permalink.gmane.org/gmane.lisp.guile.devel/16141>

Hi.  The following review applies to the wrong version of this patch.
I'll go ahead and post it anyway.

On Wed 03 Apr 2013 22:33, Mark H Weaver <mhw@netris.org> writes:

> +          /* If we just read a BOM in an encoding that recognizes them,
> +             then silently consume it and read another code point. */
> +          if (SCM_UNLIKELY (*codepoint == SCM_UNICODE_BOM
> +                            && (strcasecmp (pt->encoding, "UTF-8") == 0
> +                                || strcasecmp (pt->encoding, "UTF-16") == 0
> +                                || strcasecmp (pt->encoding, "UTF-32") == 0)))
> +            return get_codepoint (port, codepoint, buf, len);

Don't we have an enumerated value for UTF-8?  Also I thought the
documentation noted that we don't consume BOM for UTF-8.

> +static int
> +looking_at_bytes (SCM port, const unsigned char *bytes, int len)
> +{
> +  scm_t_port *pt = SCM_PTAB_ENTRY (port);
> +  int result;
> +  int i = 0;
> +
> +  while (i < len && scm_peek_byte_or_eof (port) == bytes[i])
> +    {
> +      pt->read_pos++;
> +      i++;
> +    }
> +
> +  result = (i == len);
> +
> +  while (i > 0)
> +    scm_unget_byte (bytes[--i], port);
> +
> +  return result;
> +}

Very subtle ;)  But looks good.

> +static const unsigned char scm_utf8_bom[3]    = {0xEF, 0xBB, 0xBF};
> +static const unsigned char scm_utf16be_bom[2] = {0xFE, 0xFF};
> +static const unsigned char scm_utf16le_bom[2] = {0xFF, 0xFE};
> +static const unsigned char scm_utf32be_bom[4] = {0x00, 0x00, 0xFE, 0xFF};
> +static const unsigned char scm_utf32le_bom[4] = {0xFF, 0xFE, 0x00, 0x00};

Does it not work to leave out the number?  i.e. foo[] instead of
foo[3].  Would be nicer if that works otherwise it's fine.

> +/* Decide what endianness to use for a UTF-16 port.  Return "UTF-16BE"
> +   or "UTF-16LE".  MODE must be either SCM_PORT_READ or SCM_PORT_WRITE,
> +   and specifies which operation is about to be done.  The MODE
> +   determines how we will decide the endianness.  We deliberately avoid
> +   reading from the port unless the user is about to do so.  If the user
> +   is about to read, then we look for a BOM, and if present, we use it
> +   to determine the endianness.  Otherwise we choose big-endian, as
> +   recommended by the Unicode Consortium.  */

I am surprised this does not default to native endianness!  But OK :)

> +static const char *
> +decide_utf16_encoding (SCM port, scm_t_port_rw_active mode)
> +{
> +  if (mode == SCM_PORT_READ
> +      && SCM_PORT_GET_INTERNAL (port)->at_stream_start_for_bom_read
> +      && looking_at_bytes (port, scm_utf16le_bom, sizeof scm_utf16le_bom))
> +    return "UTF-16LE";
> +  else
> +    return "UTF-16BE";
> +}
> +
> +/* Decide what endianness to use for a UTF-32 port.  Return "UTF-32BE"
> +   or "UTF-32LE".  See the comment above 'decide_utf16_encoding' for
> +   details.  */
> +static const char *
> +decide_utf32_encoding (SCM port, scm_t_port_rw_active mode)
> +{
> +  if (mode == SCM_PORT_READ
> +      && SCM_PORT_GET_INTERNAL (port)->at_stream_start_for_bom_read
> +      && looking_at_bytes (port, scm_utf32le_bom, sizeof scm_utf32le_bom))
> +    return "UTF-32LE";
> +  else
> +    return "UTF-32BE";
> +}
> +

Why don't these consume the BOM?  They just use the BOM to detect the
endianness but leave the at_stream_start_for_bom_read flag to 1 I guess?
Probably deserves a comment.

> +      /* If the specified encoding is UTF-16 or UTF-32, then make
> +         that more precise by deciding what endianness to use.  */
> +      if (strcasecmp (pt->encoding, "UTF-16") == 0)
> +        precise_encoding = decide_utf16_encoding (port, mode);
> +      else if (strcasecmp (pt->encoding, "UTF-32") == 0)
> +        precise_encoding = decide_utf32_encoding (port, mode);

Ideally these comparisons would not be locale-dependent.  Dunno.

> @@ -2377,28 +2480,27 @@ scm_i_set_port_encoding_x (SCM port, const char *encoding)
>    pti = SCM_PORT_GET_INTERNAL (port);
>    prev = pti->iconv_descriptors;
>  
> +  /* In order to handle cases where the encoding changes mid-stream
> +     (e.g. within an HTTP stream, or within a file that is composed of
> +     segments with different encodings), we consider this to be "stream
> +     start" for purposes of BOM handling, regardless of our actual file
> +     position. */
> +  pti->at_stream_start_for_bom_read  = 1;
> +  pti->at_stream_start_for_bom_write = 1;
> +

I dunno.  If I receive an HTTP response as a bytevector, parse it to a
Unicode (UTF-{8,16,32}) string successfully, and then convert it back to
bytes, I feel like I should get the same sequence of bytes back.  Is
that unreasonable?

> +  if (SCM_UNLIKELY (pti->at_stream_start_for_bom_write && len > 0))
> +    {
> +      scm_t_port *pt = SCM_PTAB_ENTRY (port);
> +
> +      /* Record that we're no longer at stream start.  */
> +      pti->at_stream_start_for_bom_write = 0;
> +      if (pt->rw_random)
> +        pti->at_stream_start_for_bom_read = 0;
> +
> +      /* Write a BOM if appropriate.  */
> +      if (SCM_UNLIKELY (strcasecmp(pt->encoding, "UTF-16") == 0
> +                        || strcasecmp(pt->encoding, "UTF-32") == 0))
> +        display_character (SCM_UNICODE_BOM, port, iconveh_error);
> +    }

Perhaps only set these flags if we are in UTF-16 or UTF-32 to begin
with instead of pushing this strcasecmp out to print.c.

> +  (pass-if-equal "BOM discarded from start of UTF-8 stream"
> +      "a"
> +    (bv-read-test "Utf-8" #vu8(#xEF #xBB #xBF #x61)))

This test contradicts the docs regarding BOM consumption for UTF-8
streams, no?

Quitting because I realized that you have two new patches (!).

Andy
-- 
http://wingolog.org/