From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: Mark H Weaver <mhw@netris.org>
Newsgroups: gmane.lisp.guile.devel
Subject: Re: [PATCH] Improve handling of Unicode byte-order marks (BOMs)
Date: Fri, 05 Apr 2013 03:30:53 -0400
Message-ID: <87wqshv3gy.fsf@tines.lan>
References: <87ip43zyf0.fsf@tines.lan> <87r4irq0zp.fsf@gnu.org>
	<874nfnza5g.fsf@tines.lan> <87a9pfml1n.fsf@gnu.org>
	<87zjxfxsl5.fsf@tines.lan> <8738v69g10.fsf@pobox.com>
NNTP-Posting-Host: plane.gmane.org
Mime-Version: 1.0
Content-Type: text/plain
X-Trace: ger.gmane.org 1365147091 20050 80.91.229.3 (5 Apr 2013 07:31:31 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Fri, 5 Apr 2013 07:31:31 +0000 (UTC)
Cc: Ludovic =?utf-8?Q?Court=C3=A8s?= <ludo@gnu.org>, guile-devel@gnu.org
To: Andy Wingo <wingo@pobox.com>
Original-X-From: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Fri Apr 05 09:31:58 2013
Return-path: <guile-devel-bounces+guile-devel=m.gmane.org@gnu.org>
Envelope-to: guile-devel@m.gmane.org
Original-Received: from lists.gnu.org ([208.118.235.17])
	by plane.gmane.org with esmtp (Exim 4.69)
	(envelope-from <guile-devel-bounces+guile-devel=m.gmane.org@gnu.org>)
	id 1UO17t-0005Tr-Qx
	for guile-devel@m.gmane.org; Fri, 05 Apr 2013 09:31:58 +0200
Original-Received: from localhost ([::1]:54842 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <guile-devel-bounces+guile-devel=m.gmane.org@gnu.org>)
	id 1UO17U-0002Om-UN
	for guile-devel@m.gmane.org; Fri, 05 Apr 2013 03:31:32 -0400
Original-Received: from eggs.gnu.org ([208.118.235.92]:46849)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <mhw@netris.org>) id 1UO17K-0002Gm-46
	for guile-devel@gnu.org; Fri, 05 Apr 2013 03:31:28 -0400
Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <mhw@netris.org>) id 1UO17B-0001Qs-7l
	for guile-devel@gnu.org; Fri, 05 Apr 2013 03:31:22 -0400
Original-Received: from world.peace.net ([96.39.62.75]:33371)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <mhw@netris.org>)
	id 1UO17B-0001Qo-3J; Fri, 05 Apr 2013 03:31:13 -0400
Original-Received: from 209-6-91-212.c3-0.smr-ubr1.sbo-smr.ma.cable.rcn.com
	([209.6.91.212] helo=tines.lan)
	by world.peace.net with esmtpsa (TLS1.0:DHE_RSA_AES_128_CBC_SHA1:16)
	(Exim 4.72) (envelope-from <mhw@netris.org>)
	id 1UO170-0000v2-Aj; Fri, 05 Apr 2013 03:31:02 -0400
In-Reply-To: <8738v69g10.fsf@pobox.com> (Andy Wingo's message of "Thu, 04 Apr
	2013 22:50:35 +0200")
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.3 (gnu/linux)
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x
X-Received-From: 96.39.62.75
X-BeenThere: guile-devel@gnu.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: "Developers list for Guile,
	the GNU extensibility library" <guile-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/guile-devel>,
	<mailto:guile-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/guile-devel>
List-Post: <mailto:guile-devel@gnu.org>
List-Help: <mailto:guile-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/guile-devel>,
	<mailto:guile-devel-request@gnu.org?subject=subscribe>
Errors-To: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org
Original-Sender: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.lisp.guile.devel:16149
Archived-At: <http://permalink.gmane.org/gmane.lisp.guile.devel/16149>

Hi Andy,

Andy Wingo <wingo@pobox.com> writes:

> On Wed 03 Apr 2013 22:33, Mark H Weaver <mhw@netris.org> writes:
>
>> +          /* If we just read a BOM in an encoding that recognizes them,
>> +             then silently consume it and read another code point. */
>> +          if (SCM_UNLIKELY (*codepoint == SCM_UNICODE_BOM
>> +                            && (strcasecmp (pt->encoding, "UTF-8") == 0
>> +                                || strcasecmp (pt->encoding, "UTF-16") == 0
>> +                                || strcasecmp (pt->encoding, "UTF-32") == 0)))
>> +            return get_codepoint (port, codepoint, buf, len);
>
> Don't we have an enumerated value for UTF-8?

Indeed, good point.  I changed this as you suggest, but fwiw the
efficiency impact is negligible, because that 'strcasecmp' is only
called if a BOM is found at the start of a stream.

> Also I thought the
> documentation noted that we don't consume BOM for UTF-8.

No.  For UTF-8, we consume a BOM (if present) but do not produce one.

>> +static const unsigned char scm_utf8_bom[3]    = {0xEF, 0xBB, 0xBF};
>> +static const unsigned char scm_utf16be_bom[2] = {0xFE, 0xFF};
>> +static const unsigned char scm_utf16le_bom[2] = {0xFF, 0xFE};
>> +static const unsigned char scm_utf32be_bom[4] = {0x00, 0x00, 0xFE, 0xFF};
>> +static const unsigned char scm_utf32le_bom[4] = {0xFF, 0xFE, 0x00, 0x00};
>
> Does it not work to leave out the number?  i.e. foo[] instead of
> foo[3].  Would be nicer if that works otherwise it's fine.

It would almost certainly work, but I'm paranoid that some weird
compiler might make the array longer than needed, which would be bad.

>> +/* Decide what endianness to use for a UTF-16 port.  Return "UTF-16BE"
>> +   or "UTF-16LE".  MODE must be either SCM_PORT_READ or SCM_PORT_WRITE,
>> +   and specifies which operation is about to be done.  The MODE
>> +   determines how we will decide the endianness.  We deliberately avoid
>> +   reading from the port unless the user is about to do so.  If the user
>> +   is about to read, then we look for a BOM, and if present, we use it
>> +   to determine the endianness.  Otherwise we choose big-endian, as
>> +   recommended by the Unicode Consortium.  */
>
> I am surprised this does not default to native endianness!  But OK :)

Remember that "network byte order" is big endian.  Uniformity has its
benefits.  Anyway, in the docs I explicitly reserved the right to change
this later.

>> +static const char *
>> +decide_utf16_encoding (SCM port, scm_t_port_rw_active mode)
>> +{
>> +  if (mode == SCM_PORT_READ
>> +      && SCM_PORT_GET_INTERNAL (port)->at_stream_start_for_bom_read
>> +      && looking_at_bytes (port, scm_utf16le_bom, sizeof scm_utf16le_bom))
>> +    return "UTF-16LE";
>> +  else
>> +    return "UTF-16BE";
>> +}
>> +
>> +/* Decide what endianness to use for a UTF-32 port.  Return "UTF-32BE"
>> +   or "UTF-32LE".  See the comment above 'decide_utf16_encoding' for
>> +   details.  */
>> +static const char *
>> +decide_utf32_encoding (SCM port, scm_t_port_rw_active mode)
>> +{
>> +  if (mode == SCM_PORT_READ
>> +      && SCM_PORT_GET_INTERNAL (port)->at_stream_start_for_bom_read
>> +      && looking_at_bytes (port, scm_utf32le_bom, sizeof scm_utf32le_bom))
>> +    return "UTF-32LE";
>> +  else
>> +    return "UTF-32BE";
>> +}
>> +
>
> Why don't these consume the BOM?  They just use the BOM to detect the
> endianness but leave the at_stream_start_for_bom_read flag to 1 I guess?

Right.  I did it this way for simplicity and consistency.  When you seek
to the beginning of the file, the BOM will be consumed again even though
this code will not be run.  Therefore, we need logic to consume the BOM
in 'get_codepoint'.  It seems cleaner to do that job in just one place
rather than in two places.

> Probably deserves a comment.

Okay, I added one.

>> +      /* If the specified encoding is UTF-16 or UTF-32, then make
>> +         that more precise by deciding what endianness to use.  */
>> +      if (strcasecmp (pt->encoding, "UTF-16") == 0)
>> +        precise_encoding = decide_utf16_encoding (port, mode);
>> +      else if (strcasecmp (pt->encoding, "UTF-32") == 0)
>> +        precise_encoding = decide_utf32_encoding (port, mode);
>
> Ideally these comparisons would not be locale-dependent.  Dunno.

Yes, that would be preferable.  We talked about adding an
'ascii_strcasecmp' function.  What file do you think it should be
defined in?

>> @@ -2377,28 +2480,27 @@ scm_i_set_port_encoding_x (SCM port, const char *encoding)
>>    pti = SCM_PORT_GET_INTERNAL (port);
>>    prev = pti->iconv_descriptors;
>>  
>> +  /* In order to handle cases where the encoding changes mid-stream
>> +     (e.g. within an HTTP stream, or within a file that is composed of
>> +     segments with different encodings), we consider this to be "stream
>> +     start" for purposes of BOM handling, regardless of our actual file
>> +     position. */
>> +  pti->at_stream_start_for_bom_read  = 1;
>> +  pti->at_stream_start_for_bom_write = 1;
>> +
>
> I dunno.  If I receive an HTTP response as a bytevector, parse it to a
> Unicode (UTF-{8,16,32}) string successfully, and then convert it back to
> bytes, I feel like I should get the same sequence of bytes back.  Is
> that unreasonable?

It's not an unreasonable wish, but sadly it's impossible in the presence
of automatic BOM removal.  The reason is that information is lost when
you remove the BOM.  This affects the UTF-8, UTF-16, and UTF-32
encodings.

Also, I'm not sure how your comment relates to the quoted code above it.

>> +  if (SCM_UNLIKELY (pti->at_stream_start_for_bom_write && len > 0))
>> +    {
>> +      scm_t_port *pt = SCM_PTAB_ENTRY (port);
>> +
>> +      /* Record that we're no longer at stream start.  */
>> +      pti->at_stream_start_for_bom_write = 0;
>> +      if (pt->rw_random)
>> +        pti->at_stream_start_for_bom_read = 0;
>> +
>> +      /* Write a BOM if appropriate.  */
>> +      if (SCM_UNLIKELY (strcasecmp(pt->encoding, "UTF-16") == 0
>> +                        || strcasecmp(pt->encoding, "UTF-32") == 0))
>> +        display_character (SCM_UNICODE_BOM, port, iconveh_error);
>> +    }
>
> Perhaps only set these flags if we are in UTF-16 or UTF-32 to begin
> with instead of pushing this strcasecmp out to print.c.

I thought of that, but then we'd have to add these strcasecmp's
everywhere that the flags are set, which is currently in three places:
(1) port creation, (2) set-port-encoding!, and (3) seeks.  As it is now,
we only have to do these comparisons in two places (textual read and
write), and often they can be avoided (e.g. on read the comparisons are
only done when a BOM is found at the start of a stream).  So I think
this way is not only cleaner, but also probably a bit more efficient.

I went ahead and pushed it to stable-2.0, with the changes mentioned
above.  Of course we can continue to tweak it.

    Thanks!
      Mark