Re: [PATCH] Improve handling of Unicode byte-order marks (BOMs)

unofficial mirror of guile-devel@gnu.org 
 help / color / mirror / Atom feed

From: ludo@gnu.org (Ludovic Courtès)
To: guile-devel@gnu.org
Subject: Re: [PATCH] Improve handling of Unicode byte-order marks (BOMs)
Date: Wed, 03 Apr 2013 13:58:50 +0200	[thread overview]
Message-ID: <87r4irq0zp.fsf@gnu.org> (raw)
In-Reply-To: 87ip43zyf0.fsf@tines.lan

Hello, Mark!

Mark H Weaver <mhw@netris.org> skribis:

> * All kinds of streams are supported in a uniform way: files, pipes,
>   sockets, terminals, etc.
>
> * As specified in Unicode 6.2, BOMs are only handled specially at the
>   start of a stream, and only if the encoding is set to "UTF-16" or
>   "UTF-32".  BOMs are *not* handled specially if the encoding is set to
>   "UTF-16LE", etc.

OK.

> * This code never tries to read a BOM until the user has asked to read.
>   If the user writes before reading, it chooses big-endian and writes a
>   BOM if appropriate (if the encoding is set to "UTF-16" or "UTF-32").
>
> * The encodings "UTF-16" and "UTF-32" are *never* passed to iconv,
>   because BOM handling varies between iconv implementations.  Creation
>   of the iconv descriptors is always postponed until the first read or
>   write, at which point a decision is made about the endianness, and
>   then "UTF-16BE", "UTF-16LE", "UTF-32BE", or "UTF-32LE" is passed to
>   iconv.
>
> * If 'rw_random' is zero, then the input and output streams are
>   considered independent: the first read will consume a BOM if
>   appropriate, *and* the first write will produce a BOM if appropriate.
>
> * If 'rw_random' is non-zero, then the input and output streams are
>   considered linked: if the user reads first, then a BOM will be
>   consumed if appropriate, but later writes will *not* produce a BOM.
>   Similarly, if the user writes first, then later reads will *not*
>   consume a BOM.
>
> * If 'set-port-encoding!' is called in the middle of a stream, it treats
>   it as a new logical "start of stream", i.e. if the encoding is set to
>   "UTF-16" or "UTF-32" then a BOM will be consumed the next time you
>   read and/or produced the next time you write.
>
> * Seeks to the beginning of the file set the "start of stream" flags.
>   Seeks anywhere else clear the "start of stream" flags.

Woow, well thought out.  The semantics seem good.  (It’s interesting to
see how BOMs complicate things, but that’s life, I guess.)

The patch looks good to me.  The test suite is nice.  It doesn’t seem to
cover all the corner cases listed above, but that can be added later on
perhaps?

Perhaps the text above could be added to the manual, in a
@ununnumberedsec or something?

Remarks:

> diff --git a/libguile/ports-internal.h b/libguile/ports-internal.h
> index 73a788f..cd1746b 100644
> --- a/libguile/ports-internal.h
> +++ b/libguile/ports-internal.h
> @@ -48,14 +48,19 @@ struct scm_port_internal
>  {
>    scm_t_port_encoding_mode encoding_mode;
>    scm_t_iconv_descriptors *iconv_descriptors;
> +  int at_stream_start_for_bom_read;
> +  int at_stream_start_for_bom_write;

Add “:1”?

> +#define SCM_UNICODE_BOM  0xFEFF  /* Unicode byte-order mark */

0xfeffUL to be on the safe side.

> +/* If the next LEN bytes from port are equal to those in BYTES, then

s/port/PORT/

> +   return 1, else return 0.  Leave the port position unchanged.  */
> +static int
> +looking_at_bytes (SCM port, unsigned char *bytes, int len)

const unsigned char *bytes

> +{
> +  scm_t_port *pt = SCM_PTAB_ENTRY (port);
> +  int result;
> +  int i = 0;
> +
> +  while (i < len && scm_peek_byte_or_eof (port) == bytes[i])
> +    {
> +      pt->read_pos++;
> +      i++;
> +    }
> +
> +  result = (i == len);
> +
> +  while (i > 0)
> +    scm_unget_byte (bytes[--i], port);
> +
> +  return result;
> +}

Should it be scm_get_byte_or_eof given that scm_unget_byte is used later?

What if pt->read_buf_size == 1?  What if there’s data in saved_read_buf?

> +/* Decide what endianness to use for a UTF-16 port.  Return "UTF-16BE"
> +   or "UTF-16LE".  MODE must be either SCM_PORT_READ or SCM_PORT_WRITE,
> +   and specifies which operation is about to be done.  The MODE
> +   determines how we will decide the endianness.  We deliberately avoid
> +   reading from the port unless the user is about to do so.  If the user
> +   is about to read, then we look for a BOM, and if present, we use it
> +   to determine the endianness.  Otherwise we choose big-endian, as
> +   recommended by the Unicode Consortium.  */
> +static char *
> +decide_utf16_encoding (SCM port, scm_t_port_rw_active mode)

static const char *

> +static char *
> +decide_utf32_encoding (SCM port, scm_t_port_rw_active mode)

Likewise.

> +      /* If the specified encoding is UTF-16 or UTF-32, then make
> +         that more precise by deciding what endianness to use.  */
> +      if (strcmp (pt->encoding, "UTF-16") == 0)
> +        precise_encoding = decide_utf16_encoding (port, mode);
> +      else if (strcmp (pt->encoding, "UTF-32") == 0)
> +        precise_encoding = decide_utf32_encoding (port, mode);

Shouldn’t it be strcasecmp?  (Actually there are other uses of strcmp
already, but I think it’s a mistake.)

> +      if (SCM_UNLIKELY (strcmp(pt->encoding, "UTF-16") == 0
> +                        || strcmp(pt->encoding, "UTF-32") == 0))

Likewise, + space before paren.

Thanks!

Ludo’.

next prev parent reply	other threads:[~2013-04-03 11:58 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-04-03 10:44 [PATCH] Improve handling of Unicode byte-order marks (BOMs) Mark H Weaver
2013-04-03 11:47 ` Mark H Weaver
2013-04-03 11:58 ` Ludovic Courtès [this message]
2013-04-03 19:28   ` Mark H Weaver
2013-04-03 20:11     ` Ludovic Courtès
2013-04-03 20:33       ` Mark H Weaver
2013-04-03 20:48         ` Mike Gran
2013-04-03 22:24           ` Mark H Weaver
2013-04-04  5:59             ` Mark H Weaver
2013-04-04 20:50         ` Andy Wingo
2013-04-05  7:30           ` Mark H Weaver
2013-04-05  7:42             ` Mike Gran
2013-04-05 10:04               ` Ludovic Courtès
2013-04-05 18:15                 ` Mark H Weaver

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/guile/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87r4irq0zp.fsf@gnu.org \
    --to=ludo@gnu.org \
    --cc=guile-devel@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).