From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: ludo@gnu.org (Ludovic =?utf-8?Q?Court=C3=A8s?=)
Newsgroups: gmane.lisp.guile.devel
Subject: Re: [PATCH] Improve handling of Unicode byte-order marks (BOMs)
Date: Wed, 03 Apr 2013 13:58:50 +0200
Message-ID: <87r4irq0zp.fsf@gnu.org>
References: <87ip43zyf0.fsf@tines.lan>
NNTP-Posting-Host: plane.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
X-Trace: ger.gmane.org 1364990350 8949 80.91.229.3 (3 Apr 2013 11:59:10 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Wed, 3 Apr 2013 11:59:10 +0000 (UTC)
To: guile-devel@gnu.org
Original-X-From: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Wed Apr 03 13:59:38 2013
Return-path: <guile-devel-bounces+guile-devel=m.gmane.org@gnu.org>
Envelope-to: guile-devel@m.gmane.org
Original-Received: from lists.gnu.org ([208.118.235.17])
	by plane.gmane.org with esmtp (Exim 4.69)
	(envelope-from <guile-devel-bounces+guile-devel=m.gmane.org@gnu.org>)
	id 1UNMLo-0001dB-7I
	for guile-devel@m.gmane.org; Wed, 03 Apr 2013 13:59:36 +0200
Original-Received: from localhost ([::1]:45486 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <guile-devel-bounces+guile-devel=m.gmane.org@gnu.org>)
	id 1UNMLP-0004OV-Js
	for guile-devel@m.gmane.org; Wed, 03 Apr 2013 07:59:11 -0400
Original-Received: from eggs.gnu.org ([208.118.235.92]:33463)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <guile-devel@m.gmane.org>) id 1UNMLK-0004O4-I1
	for guile-devel@gnu.org; Wed, 03 Apr 2013 07:59:09 -0400
Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <guile-devel@m.gmane.org>) id 1UNMLI-0000XN-Vb
	for guile-devel@gnu.org; Wed, 03 Apr 2013 07:59:06 -0400
Original-Received: from plane.gmane.org ([80.91.229.3]:59251)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <guile-devel@m.gmane.org>) id 1UNMLI-0000Wu-LA
	for guile-devel@gnu.org; Wed, 03 Apr 2013 07:59:04 -0400
Original-Received: from list by plane.gmane.org with local (Exim 4.69)
	(envelope-from <guile-devel@m.gmane.org>) id 1UNMLd-0001Vu-SM
	for guile-devel@gnu.org; Wed, 03 Apr 2013 13:59:25 +0200
Original-Received: from reverse-83.fdn.fr ([80.67.176.83])
	by main.gmane.org with esmtp (Gmexim 0.1 (Debian))
	id 1AlnuQ-0007hv-00
	for <guile-devel@gnu.org>; Wed, 03 Apr 2013 13:59:25 +0200
Original-Received: from ludo by reverse-83.fdn.fr with local (Gmexim 0.1 (Debian))
	id 1AlnuQ-0007hv-00
	for <guile-devel@gnu.org>; Wed, 03 Apr 2013 13:59:25 +0200
X-Injected-Via-Gmane: http://gmane.org/
Original-Lines: 141
Original-X-Complaints-To: usenet@ger.gmane.org
X-Gmane-NNTP-Posting-Host: reverse-83.fdn.fr
X-URL: http://www.fdn.fr/~lcourtes/
X-Revolutionary-Date: 14 Germinal an 221 de la =?utf-8?Q?R=C3=A9volution?=
X-PGP-Key-ID: 0xEA52ECF4
X-PGP-Key: http://www.fdn.fr/~lcourtes/ludovic.asc
X-PGP-Fingerprint: 83C4 F8E5 10A3 3B4C 5BEA  D15D 77DD 95E2 EA52 ECF4
X-OS: x86_64-unknown-linux-gnu
User-Agent: Gnus/5.130005 (Ma Gnus v0.5) Emacs/24.3 (gnu/linux)
Cancel-Lock: sha1:9QtGdh1HTrSwtoe5yZPhxtge4AA=
X-detected-operating-system: by eggs.gnu.org: Genre and OS details not
	recognized.
X-Received-From: 80.91.229.3
X-BeenThere: guile-devel@gnu.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: "Developers list for Guile,
	the GNU extensibility library" <guile-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/guile-devel>,
	<mailto:guile-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/guile-devel>
List-Post: <mailto:guile-devel@gnu.org>
List-Help: <mailto:guile-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/guile-devel>,
	<mailto:guile-devel-request@gnu.org?subject=subscribe>
Errors-To: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org
Original-Sender: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.lisp.guile.devel:16116
Archived-At: <http://permalink.gmane.org/gmane.lisp.guile.devel/16116>

Hello, Mark!

Mark H Weaver <mhw@netris.org> skribis:

> * All kinds of streams are supported in a uniform way: files, pipes,
>   sockets, terminals, etc.
>
> * As specified in Unicode 6.2, BOMs are only handled specially at the
>   start of a stream, and only if the encoding is set to "UTF-16" or
>   "UTF-32".  BOMs are *not* handled specially if the encoding is set to
>   "UTF-16LE", etc.

OK.

> * This code never tries to read a BOM until the user has asked to read.
>   If the user writes before reading, it chooses big-endian and writes a
>   BOM if appropriate (if the encoding is set to "UTF-16" or "UTF-32").
>
> * The encodings "UTF-16" and "UTF-32" are *never* passed to iconv,
>   because BOM handling varies between iconv implementations.  Creation
>   of the iconv descriptors is always postponed until the first read or
>   write, at which point a decision is made about the endianness, and
>   then "UTF-16BE", "UTF-16LE", "UTF-32BE", or "UTF-32LE" is passed to
>   iconv.
>
> * If 'rw_random' is zero, then the input and output streams are
>   considered independent: the first read will consume a BOM if
>   appropriate, *and* the first write will produce a BOM if appropriate.
>
> * If 'rw_random' is non-zero, then the input and output streams are
>   considered linked: if the user reads first, then a BOM will be
>   consumed if appropriate, but later writes will *not* produce a BOM.
>   Similarly, if the user writes first, then later reads will *not*
>   consume a BOM.
>
> * If 'set-port-encoding!' is called in the middle of a stream, it treats
>   it as a new logical "start of stream", i.e. if the encoding is set to
>   "UTF-16" or "UTF-32" then a BOM will be consumed the next time you
>   read and/or produced the next time you write.
>
> * Seeks to the beginning of the file set the "start of stream" flags.
>   Seeks anywhere else clear the "start of stream" flags.

Woow, well thought out.  The semantics seem good.  (It’s interesting to
see how BOMs complicate things, but that’s life, I guess.)

The patch looks good to me.  The test suite is nice.  It doesn’t seem to
cover all the corner cases listed above, but that can be added later on
perhaps?

Perhaps the text above could be added to the manual, in a
@ununnumberedsec or something?

Remarks:

> diff --git a/libguile/ports-internal.h b/libguile/ports-internal.h
> index 73a788f..cd1746b 100644
> --- a/libguile/ports-internal.h
> +++ b/libguile/ports-internal.h
> @@ -48,14 +48,19 @@ struct scm_port_internal
>  {
>    scm_t_port_encoding_mode encoding_mode;
>    scm_t_iconv_descriptors *iconv_descriptors;
> +  int at_stream_start_for_bom_read;
> +  int at_stream_start_for_bom_write;

Add “:1”?

> +#define SCM_UNICODE_BOM  0xFEFF  /* Unicode byte-order mark */

0xfeffUL to be on the safe side.

> +/* If the next LEN bytes from port are equal to those in BYTES, then

s/port/PORT/

> +   return 1, else return 0.  Leave the port position unchanged.  */
> +static int
> +looking_at_bytes (SCM port, unsigned char *bytes, int len)

const unsigned char *bytes

> +{
> +  scm_t_port *pt = SCM_PTAB_ENTRY (port);
> +  int result;
> +  int i = 0;
> +
> +  while (i < len && scm_peek_byte_or_eof (port) == bytes[i])
> +    {
> +      pt->read_pos++;
> +      i++;
> +    }
> +
> +  result = (i == len);
> +
> +  while (i > 0)
> +    scm_unget_byte (bytes[--i], port);
> +
> +  return result;
> +}

Should it be scm_get_byte_or_eof given that scm_unget_byte is used later?

What if pt->read_buf_size == 1?  What if there’s data in saved_read_buf?

> +/* Decide what endianness to use for a UTF-16 port.  Return "UTF-16BE"
> +   or "UTF-16LE".  MODE must be either SCM_PORT_READ or SCM_PORT_WRITE,
> +   and specifies which operation is about to be done.  The MODE
> +   determines how we will decide the endianness.  We deliberately avoid
> +   reading from the port unless the user is about to do so.  If the user
> +   is about to read, then we look for a BOM, and if present, we use it
> +   to determine the endianness.  Otherwise we choose big-endian, as
> +   recommended by the Unicode Consortium.  */
> +static char *
> +decide_utf16_encoding (SCM port, scm_t_port_rw_active mode)

static const char *

> +static char *
> +decide_utf32_encoding (SCM port, scm_t_port_rw_active mode)

Likewise.

> +      /* If the specified encoding is UTF-16 or UTF-32, then make
> +         that more precise by deciding what endianness to use.  */
> +      if (strcmp (pt->encoding, "UTF-16") == 0)
> +        precise_encoding = decide_utf16_encoding (port, mode);
> +      else if (strcmp (pt->encoding, "UTF-32") == 0)
> +        precise_encoding = decide_utf32_encoding (port, mode);

Shouldn’t it be strcasecmp?  (Actually there are other uses of strcmp
already, but I think it’s a mistake.)

> +      if (SCM_UNLIKELY (strcmp(pt->encoding, "UTF-16") == 0
> +                        || strcmp(pt->encoding, "UTF-32") == 0))

Likewise, + space before paren.

Thanks!

Ludo’.