[PATCH] Improve handling of Unicode byte-order marks (BOMs)

unofficial mirror of guile-devel@gnu.org 
 help / color / mirror / Atom feed

* [PATCH] Improve handling of Unicode byte-order marks (BOMs)
@ 2013-04-03 10:44 Mark H Weaver
  2013-04-03 11:47 ` Mark H Weaver
  2013-04-03 11:58 ` Ludovic Courtès
  0 siblings, 2 replies; 14+ messages in thread
From: Mark H Weaver @ 2013-04-03 10:44 UTC (permalink / raw
  To: guile-devel

[-- Attachment #1: Type: text/plain, Size: 1963 bytes --]

Hello all,

I've attached a proposed patch to improve our handling of BOMs.
Here are a few notable aspects:

* All kinds of streams are supported in a uniform way: files, pipes,
  sockets, terminals, etc.

* As specified in Unicode 6.2, BOMs are only handled specially at the
  start of a stream, and only if the encoding is set to "UTF-16" or
  "UTF-32".  BOMs are *not* handled specially if the encoding is set to
  "UTF-16LE", etc.

* This code never tries to read a BOM until the user has asked to read.
  If the user writes before reading, it chooses big-endian and writes a
  BOM if appropriate (if the encoding is set to "UTF-16" or "UTF-32").

* The encodings "UTF-16" and "UTF-32" are *never* passed to iconv,
  because BOM handling varies between iconv implementations.  Creation
  of the iconv descriptors is always postponed until the first read or
  write, at which point a decision is made about the endianness, and
  then "UTF-16BE", "UTF-16LE", "UTF-32BE", or "UTF-32LE" is passed to
  iconv.

* If 'rw_random' is zero, then the input and output streams are
  considered independent: the first read will consume a BOM if
  appropriate, *and* the first write will produce a BOM if appropriate.

* If 'rw_random' is non-zero, then the input and output streams are
  considered linked: if the user reads first, then a BOM will be
  consumed if appropriate, but later writes will *not* produce a BOM.
  Similarly, if the user writes first, then later reads will *not*
  consume a BOM.

* If 'set-port-encoding!' is called in the middle of a stream, it treats
  it as a new logical "start of stream", i.e. if the encoding is set to
  "UTF-16" or "UTF-32" then a BOM will be consumed the next time you
  read and/or produced the next time you write.

* Seeks to the beginning of the file set the "start of stream" flags.
  Seeks anywhere else clear the "start of stream" flags.

Okay, here's the patch.  Comments and suggestions solicited.

     Mark



[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: [PATCH] Improve handling of Unicode byte-order marks (BOMs) --]
[-- Type: text/x-diff, Size: 23050 bytes --]

From 008b89c7ba4637e2d6323f02b6b8b6284a533857 Mon Sep 17 00:00:00 2001
From: Mark H Weaver <mhw@netris.org>
Date: Wed, 3 Apr 2013 04:22:04 -0400
Subject: [PATCH] Improve handling of Unicode byte-order marks (BOMs).

* libguile/ports-internal.h (struct scm_port_internal): Add new members
  'at_stream_start_for_bom_read' and 'at_stream_start_for_bom_write'.
  (SCM_UNICODE_BOM): New macro.
  (scm_i_port_iconv_descriptors): Add 'mode' parameter to prototype.

* libguile/ports.c (scm_new_port_table_entry): Initialize
  'at_stream_start_for_bom_read' and 'at_stream_start_for_bom_write'.
  (get_iconv_codepoint): Pass new 'mode' parameter to
  'scm_i_port_iconv_descriptors'.
  (get_codepoint): After reading a codepoint at stream start, record
  that we're no longer at stream start, and consume a BOM where
  appropriate.
  (scm_seek): Set the stream start flags according to the new position.
  (looking_at_bytes): New static function.
  (scm_utf8_bom, scm_utf16be_bom, scm_utf16le_bom, scm_utf32be_bom,
  scm_utf32le_bom): New static const arrays.
  (decide_utf16_encoding, decide_utf32_encoding): New static functions.
  (scm_i_port_iconv_descriptors): Add new 'mode' parameter.  If the
  specified encoding is UTF-16 or UTF-32, make that precise by deciding
  what endianness to use, and construct iconv descriptors based on the
  precise encoding.
  (scm_i_set_port_encoding_x): Record that we are now at stream start.
  Do not open the new iconv descriptors immediately; let them be
  initialized lazily.

* libguile/print.c (display_string_using_iconv): Record that we're no
  longer at stream start.  Write a BOM if appropriate.

* test-suite/tests/ports.test ("set-port-encoding!, wrong encoding"):
  Adapt test to cope with the fact that 'set-port-encoding!' does not
  immediately open the iconv descriptors.
  (bv-read-test): New procedure.
  ("unicode byte-order marks (BOMs)"): New test prefix.
---
 libguile/ports-internal.h   |    7 +-
 libguile/ports.c            |  134 +++++++++++++++++++---
 libguile/print.c            |   18 ++-
 test-suite/tests/ports.test |  259 ++++++++++++++++++++++++++++++++++++++++++-
 4 files changed, 399 insertions(+), 19 deletions(-)

diff --git a/libguile/ports-internal.h b/libguile/ports-internal.h
index 73a788f..cd1746b 100644
--- a/libguile/ports-internal.h
+++ b/libguile/ports-internal.h
@@ -48,14 +48,19 @@ struct scm_port_internal
 {
   scm_t_port_encoding_mode encoding_mode;
   scm_t_iconv_descriptors *iconv_descriptors;
+  int at_stream_start_for_bom_read;
+  int at_stream_start_for_bom_write;
   SCM alist;
 };
 
 typedef struct scm_port_internal scm_t_port_internal;
 
+#define SCM_UNICODE_BOM  0xFEFF  /* Unicode byte-order mark */
+
 #define SCM_PORT_GET_INTERNAL(x)                                \
   ((scm_t_port_internal *) (SCM_PTAB_ENTRY(x)->input_cd))
 
-SCM_INTERNAL scm_t_iconv_descriptors *scm_i_port_iconv_descriptors (SCM port);
+SCM_INTERNAL scm_t_iconv_descriptors *
+scm_i_port_iconv_descriptors (SCM port, scm_t_port_rw_active mode);
 
 #endif
diff --git a/libguile/ports.c b/libguile/ports.c
index 51145e6..382867a 100644
--- a/libguile/ports.c
+++ b/libguile/ports.c
@@ -639,6 +639,9 @@ scm_new_port_table_entry (scm_t_bits tag)
     pti->encoding_mode = SCM_PORT_ENCODING_MODE_ICONV;
   pti->iconv_descriptors = NULL;
 
+  pti->at_stream_start_for_bom_read  = 1;
+  pti->at_stream_start_for_bom_write = 1;
+
   /* XXX These fields are not what they seem.  They have been
      repurposed, but cannot safely be renamed in 2.0 without breaking
      ABI compatibility.  This will be cleaned up in 2.2.  */
@@ -1306,10 +1309,12 @@ static int
 get_iconv_codepoint (SCM port, scm_t_wchar *codepoint,
 		     char buf[SCM_MBCHAR_BUF_SIZE], size_t *len)
 {
-  scm_t_iconv_descriptors *id = scm_i_port_iconv_descriptors (port);
+  scm_t_iconv_descriptors *id;
   scm_t_uint8 utf8_buf[SCM_MBCHAR_BUF_SIZE];
   size_t input_size = 0;
 
+  id = scm_i_port_iconv_descriptors (port, SCM_PORT_READ);
+
   for (;;)
     {
       int byte_read;
@@ -1393,7 +1398,24 @@ get_codepoint (SCM port, scm_t_wchar *codepoint,
     err = get_iconv_codepoint (port, codepoint, buf, len);
 
   if (SCM_LIKELY (err == 0))
-    update_port_lf (*codepoint, port);
+    {
+      if (SCM_UNLIKELY (pti->at_stream_start_for_bom_read))
+        {
+          /* Record that we're no longer at stream start. */
+          pti->at_stream_start_for_bom_read = 0;
+          if (pt->rw_random)
+            pti->at_stream_start_for_bom_write = 0;
+
+          /* If we just read a BOM in an encoding that recognizes them,
+             then silently consume it and read another code point. */
+          if (SCM_UNLIKELY (*codepoint == SCM_UNICODE_BOM
+                            && (strcmp(pt->encoding, "UTF-8") == 0
+                                || strcmp(pt->encoding, "UTF-16") == 0
+                                || strcmp(pt->encoding, "UTF-32") == 0)))
+            return get_codepoint (port, codepoint, buf, len);
+        }
+      update_port_lf (*codepoint, port);
+    }
   else if (pt->ilseq_handler == SCM_ICONVEH_QUESTION_MARK)
     {
       *codepoint = '?';
@@ -2006,6 +2028,7 @@ SCM_DEFINE (scm_seek, "seek", 3, 0, 0,
 
   if (SCM_OPPORTP (fd_port))
     {
+      scm_t_port_internal *pti = SCM_PORT_GET_INTERNAL (fd_port);
       scm_t_ptob_descriptor *ptob = scm_ptobs + SCM_PTOBNUM (fd_port);
       off_t_or_off64_t off = scm_to_off_t_or_off64_t (offset);
       off_t_or_off64_t rv;
@@ -2015,6 +2038,11 @@ SCM_DEFINE (scm_seek, "seek", 3, 0, 0,
                         scm_cons (fd_port, SCM_EOL));
       else
 	rv = ptob->seek (fd_port, off, how);
+
+      /* Set stream-start flags according to new position. */
+      pti->at_stream_start_for_bom_read  = (rv == 0);
+      pti->at_stream_start_for_bom_write = (rv == 0);
+
       return scm_from_off_t_or_off64_t (rv);
     }
   else /* file descriptor?.  */
@@ -2265,6 +2293,66 @@ scm_i_default_port_encoding (void)
     }
 }
 
+/* If the next LEN bytes from port are equal to those in BYTES, then
+   return 1, else return 0.  Leave the port position unchanged.  */
+static int
+looking_at_bytes (SCM port, unsigned char *bytes, int len)
+{
+  scm_t_port *pt = SCM_PTAB_ENTRY (port);
+  int result;
+  int i = 0;
+
+  while (i < len && scm_peek_byte_or_eof (port) == bytes[i])
+    {
+      pt->read_pos++;
+      i++;
+    }
+
+  result = (i == len);
+
+  while (i > 0)
+    scm_unget_byte (bytes[--i], port);
+
+  return result;
+}
+
+static unsigned char scm_utf8_bom[3]    = {0xEF, 0xBB, 0xBF};
+static unsigned char scm_utf16be_bom[2] = {0xFE, 0xFF};
+static unsigned char scm_utf16le_bom[2] = {0xFF, 0xFE};
+static unsigned char scm_utf32be_bom[4] = {0x00, 0x00, 0xFE, 0xFF};
+static unsigned char scm_utf32le_bom[4] = {0xFF, 0xFE, 0x00, 0x00};
+
+/* Decide what endianness to use for a UTF-16 port.  Return "UTF-16BE"
+   or "UTF-16LE".  MODE must be either SCM_PORT_READ or SCM_PORT_WRITE,
+   and specifies which operation is about to be done.  The MODE
+   determines how we will decide the endianness.  We deliberately avoid
+   reading from the port unless the user is about to do so.  If the user
+   is about to read, then we look for a BOM, and if present, we use it
+   to determine the endianness.  Otherwise we choose big-endian, as
+   recommended by the Unicode Consortium.  */
+static char *
+decide_utf16_encoding (SCM port, scm_t_port_rw_active mode)
+{
+  if (mode == SCM_PORT_READ
+      && looking_at_bytes (port, scm_utf16le_bom, sizeof scm_utf16le_bom))
+    return "UTF-16LE";
+  else
+    return "UTF-16BE";
+}
+
+/* Decide what endianness to use for a UTF-32 port.  Return "UTF-16BE"
+   or "UTF-16LE".  See the comment above 'decide_utf16_encoding' for
+   details.  */
+static char *
+decide_utf32_encoding (SCM port, scm_t_port_rw_active mode)
+{
+  if (mode == SCM_PORT_READ
+      && looking_at_bytes (port, scm_utf32le_bom, sizeof scm_utf32le_bom))
+    return "UTF-32LE";
+  else
+    return "UTF-32BE";
+}
+
 static void
 finalize_iconv_descriptors (void *ptr, void *data)
 {
@@ -2341,23 +2429,36 @@ close_iconv_descriptors (scm_t_iconv_descriptors *id)
   id->output_cd = (void *) -1;
 }
 
+/* Return the iconv_descriptors, initializing them if necessary.  MODE
+   must be either SCM_PORT_READ or SCM_PORT_WRITE, and specifies which
+   operation is about to be done.  We deliberately avoid reading from
+   the port unless the user was about to do so.  */
 scm_t_iconv_descriptors *
-scm_i_port_iconv_descriptors (SCM port)
+scm_i_port_iconv_descriptors (SCM port, scm_t_port_rw_active mode)
 {
-  scm_t_port *pt;
-  scm_t_port_internal *pti;
-
-  pt = SCM_PTAB_ENTRY (port);
-  pti = SCM_PORT_GET_INTERNAL (port);
+  scm_t_port_internal *pti = SCM_PORT_GET_INTERNAL (port);
 
   assert (pti->encoding_mode == SCM_PORT_ENCODING_MODE_ICONV);
 
   if (!pti->iconv_descriptors)
     {
+      scm_t_port *pt = SCM_PTAB_ENTRY (port);
+      char *precise_encoding;
+
       if (!pt->encoding)
         pt->encoding = "ISO-8859-1";
+
+      /* If the specified encoding is UTF-16 or UTF-32, then make
+         that more precise by deciding what endianness to use.  */
+      if (strcmp (pt->encoding, "UTF-16") == 0)
+        precise_encoding = decide_utf16_encoding (port, mode);
+      else if (strcmp (pt->encoding, "UTF-32") == 0)
+        precise_encoding = decide_utf32_encoding (port, mode);
+      else
+        precise_encoding = pt->encoding;
+
       pti->iconv_descriptors =
-        open_iconv_descriptors (pt->encoding,
+        open_iconv_descriptors (precise_encoding,
                                 SCM_INPUT_PORT_P (port),
                                 SCM_OUTPUT_PORT_P (port));
     }
@@ -2377,6 +2478,14 @@ scm_i_set_port_encoding_x (SCM port, const char *encoding)
   pti = SCM_PORT_GET_INTERNAL (port);
   prev = pti->iconv_descriptors;
 
+  /* In order to handle cases where the encoding changes mid-stream
+     (e.g. within an HTTP stream, or within a file that is composed of
+     segments with different encodings), we consider this to be "stream
+     start" for purposes of BOM handling, regardless of our actual file
+     position. */
+  pti->at_stream_start_for_bom_read  = 1;
+  pti->at_stream_start_for_bom_write = 1;
+
   if (encoding == NULL)
     encoding = "ISO-8859-1";
 
@@ -2387,19 +2496,14 @@ scm_i_set_port_encoding_x (SCM port, const char *encoding)
     {
       pt->encoding = "UTF-8";
       pti->encoding_mode = SCM_PORT_ENCODING_MODE_UTF8;
-      pti->iconv_descriptors = NULL;
     }
   else
     {
-      /* Open descriptors before mutating the port. */
-      pti->iconv_descriptors =
-        open_iconv_descriptors (encoding,
-                                SCM_INPUT_PORT_P (port),
-                                SCM_OUTPUT_PORT_P (port));
       pt->encoding = scm_gc_strdup (encoding, "port");
       pti->encoding_mode = SCM_PORT_ENCODING_MODE_ICONV;
     }
 
+  pti->iconv_descriptors = NULL;
   if (prev)
     close_iconv_descriptors (prev);
 }
diff --git a/libguile/print.c b/libguile/print.c
index 1572690..5795c8e 100644
--- a/libguile/print.c
+++ b/libguile/print.c
@@ -881,8 +881,24 @@ display_string_using_iconv (const void *str, int narrow_p, size_t len,
 {
   size_t printed;
   scm_t_iconv_descriptors *id;
+  scm_t_port_internal *pti = SCM_PORT_GET_INTERNAL (port);
 
-  id = scm_i_port_iconv_descriptors (port);
+  id = scm_i_port_iconv_descriptors (port, SCM_PORT_WRITE);
+
+  if (SCM_UNLIKELY (pti->at_stream_start_for_bom_write && len > 0))
+    {
+      scm_t_port *pt = SCM_PTAB_ENTRY (port);
+
+      /* Record that we're no longer at stream start.  */
+      pti->at_stream_start_for_bom_write = 0;
+      if (pt->rw_random)
+        pti->at_stream_start_for_bom_read = 0;
+
+      /* Write a BOM if appropriate.  */
+      if (SCM_UNLIKELY (strcmp(pt->encoding, "UTF-16") == 0
+                        || strcmp(pt->encoding, "UTF-32") == 0))
+        display_character (SCM_UNICODE_BOM, port, iconveh_question_mark);
+    }
 
   printed = 0;
 
diff --git a/test-suite/tests/ports.test b/test-suite/tests/ports.test
index 886ab24..69a4ea7 100644
--- a/test-suite/tests/ports.test
+++ b/test-suite/tests/ports.test
@@ -24,7 +24,8 @@
   #:use-module (ice-9 popen)
   #:use-module (ice-9 rdelim)
   #:use-module (rnrs bytevectors)
-  #:use-module ((rnrs io ports) #:select (open-bytevector-input-port)))
+  #:use-module ((rnrs io ports) #:select (open-bytevector-input-port
+                                          open-bytevector-output-port)))
 
 (define (display-line . args)
   (for-each display args)
@@ -918,7 +919,9 @@
 
   (pass-if-exception "set-port-encoding!, wrong encoding"
     exception:miscellaneous-error
-    (set-port-encoding! (open-input-string "") "does-not-exist"))
+    (let ((p (open-input-string "")))
+      (set-port-encoding! p "does-not-exist")
+      (read p)))
 
   (pass-if-exception "%default-port-encoding, wrong encoding"
     exception:miscellaneous-error
@@ -1149,6 +1152,258 @@
 
 \f
 
+(with-test-prefix "unicode byte-order marks (BOMs)"
+
+  (define (bv-read-test* encoding bv proc)
+    (let ((port (open-bytevector-input-port bv)))
+      (set-port-encoding! port encoding)
+      (proc port)))
+
+  (define (bv-read-test encoding bv)
+    (bv-read-test* encoding bv read-string))
+
+  (define (bv-write-test* encoding proc)
+    (call-with-values
+        (lambda () (open-bytevector-output-port))
+      (lambda (port get-bytevector)
+        (set-port-encoding! port encoding)
+        (proc port)
+        (get-bytevector))))
+
+  (define (bv-write-test encoding str)
+    (bv-write-test* encoding
+                    (lambda (p)
+                      (display str p))))
+
+  (pass-if-equal "BOM not discarded from Latin-1 stream"
+      "\xEF\xBB\xBF\x61"
+    (bv-read-test "ISO-8859-1" #vu8(#xEF #xBB #xBF #x61)))
+
+  (pass-if-equal "BOM not discarded from Latin-2 stream"
+      "\u010F\u0165\u017C\x61"
+    (bv-read-test "ISO-8859-2" #vu8(#xEF #xBB #xBF #x61)))
+
+  (pass-if-equal "BOM not discarded from UTF-16BE stream"
+      "\uFEFF\x61"
+    (bv-read-test "UTF-16BE" #vu8(#xFE #xFF #x00 #x61)))
+
+  (pass-if-equal "BOM not discarded from UTF-16LE stream"
+      "\uFEFF\x61"
+    (bv-read-test "UTF-16LE" #vu8(#xFF #xFE #x61 #x00)))
+
+  (pass-if-equal "BOM not discarded from UTF-32BE stream"
+      "\uFEFF\x61"
+    (bv-read-test "UTF-32BE" #vu8(#x00 #x00 #xFE #xFF
+                                       #x00 #x00 #x00 #x61)))
+
+  (pass-if-equal "BOM not discarded from UTF-32LE stream"
+      "\uFEFF\x61"
+    (bv-read-test "UTF-32LE" #vu8(#xFF #xFE #x00 #x00
+                                       #x61 #x00 #x00 #x00)))
+
+  (pass-if-equal "BOM not written to UTF-8 stream"
+      #vu8(#x61)
+    (bv-write-test "UTF-8" "a"))
+
+  (pass-if-equal "BOM discarded from start of UTF-8 stream"
+      "a"
+    (bv-read-test "UTF-8" #vu8(#xEF #xBB #xBF #x61)))
+
+  (pass-if-equal "BOM discarded from start of UTF-8 stream after seek to 0"
+      '(#\a "a")
+    (bv-read-test* "UTF-8" #vu8(#xEF #xBB #xBF #x61)
+                   (lambda (p)
+                     (let ((c (read-char p)))
+                       (seek p 0 SEEK_SET)
+                       (let ((s (read-string p)))
+                         (list c s))))))
+
+  (pass-if-equal "Only one BOM discarded from start of UTF-8 stream"
+      "\uFEFFa"
+    (bv-read-test "UTF-8" #vu8(#xEF #xBB #xBF #xEF #xBB #xBF #x61)))
+
+  (pass-if-equal "BOM not discarded from UTF-8 stream after seek to > 0"
+      "\uFEFFb"
+    (bv-read-test* "UTF-8" #vu8(#x61 #xEF #xBB #xBF #x62)
+                   (lambda (p)
+                     (seek p 1 SEEK_SET)
+                     (read-string p))))
+
+  (pass-if-equal "BOM not discarded unless at start of UTF-8 stream"
+      "a\uFEFFb"
+    (bv-read-test "UTF-8" #vu8(#x61 #xEF #xBB #xBF #x62)))
+
+  (pass-if-equal "BOM (BE) written to start of UTF-16 stream"
+      #vu8(#xFE #xFF #x00 #x61 #x00 #x62)
+    (bv-write-test "UTF-16" "ab"))
+
+  (pass-if-equal "BOM (BE) written to UTF-16 stream after set-port-encoding!"
+      #vu8(#xFE #xFF #x00 #x61 #x00 #x62 #xFE #xFF #x00 #x63 #x00 #x64)
+    (bv-write-test* "UTF-16"
+                    (lambda (p)
+                      (display "ab" p)
+                      (set-port-encoding! p "UTF-16")
+                      (display "cd" p))))
+
+  (pass-if-equal "BOM discarded from start of UTF-16 stream (BE)"
+      "a"
+    (bv-read-test "UTF-16" #vu8(#xFE #xFF #x00 #x61)))
+
+  (pass-if-equal "BOM discarded from start of UTF-16 stream (BE) after seek to 0"
+      '(#\a "a")
+    (bv-read-test* "UTF-16" #vu8(#xFE #xFF #x00 #x61)
+                   (lambda (p)
+                     (let ((c (read-char p)))
+                       (seek p 0 SEEK_SET)
+                       (let ((s (read-string p)))
+                         (list c s))))))
+
+  (pass-if-equal "Only one BOM discarded from start of UTF-16 stream (BE)"
+      "\uFEFFa"
+    (bv-read-test "UTF-16" #vu8(#xFE #xFF #xFE #xFF #x00 #x61)))
+
+  (pass-if-equal "BOM not discarded from UTF-16 stream (BE) after seek to > 0"
+      "\uFEFFa"
+    (bv-read-test* "UTF-16" #vu8(#xFE #xFF #xFE #xFF #x00 #x61)
+                   (lambda (p)
+                     (seek p 2 SEEK_SET)
+                     (read-string p))))
+
+  (pass-if-equal "BOM not discarded unless at start of UTF-16 stream"
+      "a\uFEFFb"
+    (let ((be (bv-read-test "UTF-16" #vu8(#x00 #x61 #xFE #xFF #x00 #x62)))
+          (le (bv-read-test "UTF-16" #vu8(#x61 #x00 #xFF #xFE #x62 #x00))))
+      (if (char=? #\a (string-ref be 0))
+          be
+          le)))
+
+  (pass-if-equal "BOM discarded from start of UTF-16 stream (LE)"
+      "a"
+    (bv-read-test "UTF-16" #vu8(#xFF #xFE #x61 #x00)))
+
+  (pass-if-equal "BOM discarded from start of UTF-16 stream (LE) after seek to 0"
+      '(#\a "a")
+    (bv-read-test* "UTF-16" #vu8(#xFF #xFE #x61 #x00)
+                   (lambda (p)
+                     (let ((c (read-char p)))
+                       (seek p 0 SEEK_SET)
+                       (let ((s (read-string p)))
+                         (list c s))))))
+
+  (pass-if-equal "Only one BOM discarded from start of UTF-16 stream (LE)"
+      "\uFEFFa"
+    (bv-read-test "UTF-16" #vu8(#xFF #xFE #xFF #xFE #x61 #x00)))
+
+  (pass-if-equal "BOM not discarded from UTF-16 stream (LE) after seek to > 0"
+      "\uFEFFa"
+    (bv-read-test* "UTF-16" #vu8(#xFF #xFE #xFF #xFE #x61 #x00)
+                   (lambda (p)
+                     (seek p 2 SEEK_SET)
+                     (read-string p))))
+
+  (pass-if-equal "BOM discarded from start of UTF-32 stream (BE)"
+      "a"
+    (bv-read-test "UTF-32" #vu8(#x00 #x00 #xFE #xFF
+                                     #x00 #x00 #x00 #x61)))
+
+  (pass-if-equal "BOM discarded from start of UTF-32 stream (BE) after seek to 0"
+      '(#\a "a")
+    (bv-read-test* "UTF-32" #vu8(#x00 #x00 #xFE #xFF
+                                      #x00 #x00 #x00 #x61)
+                   (lambda (p)
+                     (let ((c (read-char p)))
+                       (seek p 0 SEEK_SET)
+                       (let ((s (read-string p)))
+                         (list c s))))))
+
+  (pass-if-equal "Only one BOM discarded from start of UTF-32 stream (BE)"
+      "\uFEFFa"
+    (bv-read-test "UTF-32" #vu8(#x00 #x00 #xFE #xFF
+                                     #x00 #x00 #xFE #xFF
+                                     #x00 #x00 #x00 #x61)))
+
+  (pass-if-equal "BOM not discarded from UTF-32 stream (BE) after seek to > 0"
+      "\uFEFFa"
+    (bv-read-test* "UTF-32" #vu8(#x00 #x00 #xFE #xFF
+                                      #x00 #x00 #xFE #xFF
+                                      #x00 #x00 #x00 #x61)
+                   (lambda (p)
+                     (seek p 4 SEEK_SET)
+                     (read-string p))))
+
+  (pass-if-equal "BOM discarded within UTF-16 stream (BE) after set-port-encoding!"
+      "ab"
+    (bv-read-test* "UTF-16" #vu8(#x00 #x61 #xFE #xFF #x00 #x62)
+                   (lambda (p)
+                     (let ((a (read-char p)))
+                       (set-port-encoding! p "UTF-16")
+                       (string a (read-char p))))))
+
+  (pass-if-equal "BOM discarded within UTF-16 stream (LE,BE) after set-port-encoding!"
+      "ab"
+    (bv-read-test* "UTF-16" #vu8(#x00 #x61 #xFF #xFE #x62 #x00)
+                   (lambda (p)
+                     (let ((a (read-char p)))
+                       (set-port-encoding! p "UTF-16")
+                       (string a (read-char p))))))
+
+  (pass-if-equal "BOM discarded within UTF-32 stream (BE) after set-port-encoding!"
+      "ab"
+    (bv-read-test* "UTF-32" #vu8(#x00 #x00 #x00 #x61
+                                      #x00 #x00 #xFE #xFF
+                                      #x00 #x00 #x00 #x62)
+                   (lambda (p)
+                     (let ((a (read-char p)))
+                       (set-port-encoding! p "UTF-32")
+                       (string a (read-char p))))))
+
+  (pass-if-equal "BOM discarded within UTF-32 stream (LE,BE) after set-port-encoding!"
+      "ab"
+    (bv-read-test* "UTF-32" #vu8(#x00 #x00 #x00 #x61
+                                      #xFF #xFE #x00 #x00
+                                      #x62 #x00 #x00 #x00)
+                   (lambda (p)
+                     (let ((a (read-char p)))
+                       (set-port-encoding! p "UTF-32")
+                       (string a (read-char p))))))
+
+  (pass-if-equal "BOM not discarded unless at start of UTF-32 stream"
+      "a\uFEFFb"
+    (let ((be (bv-read-test "UTF-32" #vu8(#x00 #x00 #x00 #x61
+                                               #x00 #x00 #xFE #xFF
+                                               #x00 #x00 #x00 #x62)))
+          (le (bv-read-test "UTF-32" #vu8(#x61 #x00 #x00 #x00
+                                               #xFF #xFE #x00 #x00
+                                               #x62 #x00 #x00 #x00))))
+      (if (char=? #\a (string-ref be 0))
+          be
+          le)))
+
+  (pass-if-equal "BOM discarded from start of UTF-32 stream (LE)"
+      "a"
+    (bv-read-test "UTF-32" #vu8(#xFF #xFE #x00 #x00
+                                     #x61 #x00 #x00 #x00)))
+
+  (pass-if-equal "BOM discarded from start of UTF-32 stream (LE) after seek to 0"
+      '(#\a "a")
+    (bv-read-test* "UTF-32" #vu8(#xFF #xFE #x00 #x00
+                                      #x61 #x00 #x00 #x00)
+                   (lambda (p)
+                     (let ((c (read-char p)))
+                       (seek p 0 SEEK_SET)
+                       (let ((s (read-string p)))
+                         (list c s))))))
+
+  (pass-if-equal "Only one BOM discarded from start of UTF-32 stream (LE)"
+      "\uFEFFa"
+    (bv-read-test "UTF-32" #vu8(#xFF #xFE #x00 #x00
+                                     #xFF #xFE #x00 #x00
+                                     #x61 #x00 #x00 #x00)))
+
+  )
+
+\f
+
 (define-syntax-rule (with-load-path path body ...)
   (let ((new path)
         (old %load-path))
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] Improve handling of Unicode byte-order marks (BOMs)
  2013-04-03 10:44 [PATCH] Improve handling of Unicode byte-order marks (BOMs) Mark H Weaver
@ 2013-04-03 11:47 ` Mark H Weaver
  2013-04-03 11:58 ` Ludovic Courtès
  1 sibling, 0 replies; 14+ messages in thread
From: Mark H Weaver @ 2013-04-03 11:47 UTC (permalink / raw
  To: guile-devel

[-- Attachment #1: Type: text/plain, Size: 262 bytes --]

Here's an improved version of the patch.  Mainly it adds more tests.
Also, I forgot to mention that binary I/O does not affect the "start of
stream" flags at all.  This is mainly for efficiency reasons, but even
so, I don't feel too badly about it.

     Mark



[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: [PATCH] Improve handling of Unicode byte-order marks (BOMs) --]
[-- Type: text/x-diff, Size: 24215 bytes --]

From d8d37d5519ca61961b70cb3051ccca2be7d4affa Mon Sep 17 00:00:00 2001
From: Mark H Weaver <mhw@netris.org>
Date: Wed, 3 Apr 2013 04:22:04 -0400
Subject: [PATCH] Improve handling of Unicode byte-order marks (BOMs).

* libguile/ports-internal.h (struct scm_port_internal): Add new members
  'at_stream_start_for_bom_read' and 'at_stream_start_for_bom_write'.
  (SCM_UNICODE_BOM): New macro.
  (scm_i_port_iconv_descriptors): Add 'mode' parameter to prototype.

* libguile/ports.c (scm_new_port_table_entry): Initialize
  'at_stream_start_for_bom_read' and 'at_stream_start_for_bom_write'.
  (get_iconv_codepoint): Pass new 'mode' parameter to
  'scm_i_port_iconv_descriptors'.
  (get_codepoint): After reading a codepoint at stream start, record
  that we're no longer at stream start, and consume a BOM where
  appropriate.
  (scm_seek): Set the stream start flags according to the new position.
  (looking_at_bytes): New static function.
  (scm_utf8_bom, scm_utf16be_bom, scm_utf16le_bom, scm_utf32be_bom,
  scm_utf32le_bom): New static const arrays.
  (decide_utf16_encoding, decide_utf32_encoding): New static functions.
  (scm_i_port_iconv_descriptors): Add new 'mode' parameter.  If the
  specified encoding is UTF-16 or UTF-32, make that precise by deciding
  what endianness to use, and construct iconv descriptors based on the
  precise encoding.
  (scm_i_set_port_encoding_x): Record that we are now at stream start.
  Do not open the new iconv descriptors immediately; let them be
  initialized lazily.

* libguile/print.c (display_string_using_iconv): Record that we're no
  longer at stream start.  Write a BOM if appropriate.

* test-suite/tests/ports.test ("set-port-encoding!, wrong encoding"):
  Adapt test to cope with the fact that 'set-port-encoding!' does not
  immediately open the iconv descriptors.
  (bv-read-test): New procedure.
  ("unicode byte-order marks (BOMs)"): New test prefix.
---
 libguile/ports-internal.h   |    7 +-
 libguile/ports.c            |  134 +++++++++++++++++---
 libguile/print.c            |   18 ++-
 test-suite/tests/ports.test |  293 ++++++++++++++++++++++++++++++++++++++++++-
 4 files changed, 433 insertions(+), 19 deletions(-)

diff --git a/libguile/ports-internal.h b/libguile/ports-internal.h
index 73a788f..cd1746b 100644
--- a/libguile/ports-internal.h
+++ b/libguile/ports-internal.h
@@ -48,14 +48,19 @@ struct scm_port_internal
 {
   scm_t_port_encoding_mode encoding_mode;
   scm_t_iconv_descriptors *iconv_descriptors;
+  int at_stream_start_for_bom_read;
+  int at_stream_start_for_bom_write;
   SCM alist;
 };
 
 typedef struct scm_port_internal scm_t_port_internal;
 
+#define SCM_UNICODE_BOM  0xFEFF  /* Unicode byte-order mark */
+
 #define SCM_PORT_GET_INTERNAL(x)                                \
   ((scm_t_port_internal *) (SCM_PTAB_ENTRY(x)->input_cd))
 
-SCM_INTERNAL scm_t_iconv_descriptors *scm_i_port_iconv_descriptors (SCM port);
+SCM_INTERNAL scm_t_iconv_descriptors *
+scm_i_port_iconv_descriptors (SCM port, scm_t_port_rw_active mode);
 
 #endif
diff --git a/libguile/ports.c b/libguile/ports.c
index 51145e6..99261da 100644
--- a/libguile/ports.c
+++ b/libguile/ports.c
@@ -639,6 +639,9 @@ scm_new_port_table_entry (scm_t_bits tag)
     pti->encoding_mode = SCM_PORT_ENCODING_MODE_ICONV;
   pti->iconv_descriptors = NULL;
 
+  pti->at_stream_start_for_bom_read  = 1;
+  pti->at_stream_start_for_bom_write = 1;
+
   /* XXX These fields are not what they seem.  They have been
      repurposed, but cannot safely be renamed in 2.0 without breaking
      ABI compatibility.  This will be cleaned up in 2.2.  */
@@ -1306,10 +1309,12 @@ static int
 get_iconv_codepoint (SCM port, scm_t_wchar *codepoint,
 		     char buf[SCM_MBCHAR_BUF_SIZE], size_t *len)
 {
-  scm_t_iconv_descriptors *id = scm_i_port_iconv_descriptors (port);
+  scm_t_iconv_descriptors *id;
   scm_t_uint8 utf8_buf[SCM_MBCHAR_BUF_SIZE];
   size_t input_size = 0;
 
+  id = scm_i_port_iconv_descriptors (port, SCM_PORT_READ);
+
   for (;;)
     {
       int byte_read;
@@ -1393,7 +1398,24 @@ get_codepoint (SCM port, scm_t_wchar *codepoint,
     err = get_iconv_codepoint (port, codepoint, buf, len);
 
   if (SCM_LIKELY (err == 0))
-    update_port_lf (*codepoint, port);
+    {
+      if (SCM_UNLIKELY (pti->at_stream_start_for_bom_read))
+        {
+          /* Record that we're no longer at stream start. */
+          pti->at_stream_start_for_bom_read = 0;
+          if (pt->rw_random)
+            pti->at_stream_start_for_bom_write = 0;
+
+          /* If we just read a BOM in an encoding that recognizes them,
+             then silently consume it and read another code point. */
+          if (SCM_UNLIKELY (*codepoint == SCM_UNICODE_BOM
+                            && (strcmp(pt->encoding, "UTF-8") == 0
+                                || strcmp(pt->encoding, "UTF-16") == 0
+                                || strcmp(pt->encoding, "UTF-32") == 0)))
+            return get_codepoint (port, codepoint, buf, len);
+        }
+      update_port_lf (*codepoint, port);
+    }
   else if (pt->ilseq_handler == SCM_ICONVEH_QUESTION_MARK)
     {
       *codepoint = '?';
@@ -2006,6 +2028,7 @@ SCM_DEFINE (scm_seek, "seek", 3, 0, 0,
 
   if (SCM_OPPORTP (fd_port))
     {
+      scm_t_port_internal *pti = SCM_PORT_GET_INTERNAL (fd_port);
       scm_t_ptob_descriptor *ptob = scm_ptobs + SCM_PTOBNUM (fd_port);
       off_t_or_off64_t off = scm_to_off_t_or_off64_t (offset);
       off_t_or_off64_t rv;
@@ -2015,6 +2038,11 @@ SCM_DEFINE (scm_seek, "seek", 3, 0, 0,
                         scm_cons (fd_port, SCM_EOL));
       else
 	rv = ptob->seek (fd_port, off, how);
+
+      /* Set stream-start flags according to new position. */
+      pti->at_stream_start_for_bom_read  = (rv == 0);
+      pti->at_stream_start_for_bom_write = (rv == 0);
+
       return scm_from_off_t_or_off64_t (rv);
     }
   else /* file descriptor?.  */
@@ -2265,6 +2293,66 @@ scm_i_default_port_encoding (void)
     }
 }
 
+/* If the next LEN bytes from port are equal to those in BYTES, then
+   return 1, else return 0.  Leave the port position unchanged.  */
+static int
+looking_at_bytes (SCM port, const unsigned char *bytes, int len)
+{
+  scm_t_port *pt = SCM_PTAB_ENTRY (port);
+  int result;
+  int i = 0;
+
+  while (i < len && scm_peek_byte_or_eof (port) == bytes[i])
+    {
+      pt->read_pos++;
+      i++;
+    }
+
+  result = (i == len);
+
+  while (i > 0)
+    scm_unget_byte (bytes[--i], port);
+
+  return result;
+}
+
+static const unsigned char scm_utf8_bom[3]    = {0xEF, 0xBB, 0xBF};
+static const unsigned char scm_utf16be_bom[2] = {0xFE, 0xFF};
+static const unsigned char scm_utf16le_bom[2] = {0xFF, 0xFE};
+static const unsigned char scm_utf32be_bom[4] = {0x00, 0x00, 0xFE, 0xFF};
+static const unsigned char scm_utf32le_bom[4] = {0xFF, 0xFE, 0x00, 0x00};
+
+/* Decide what endianness to use for a UTF-16 port.  Return "UTF-16BE"
+   or "UTF-16LE".  MODE must be either SCM_PORT_READ or SCM_PORT_WRITE,
+   and specifies which operation is about to be done.  The MODE
+   determines how we will decide the endianness.  We deliberately avoid
+   reading from the port unless the user is about to do so.  If the user
+   is about to read, then we look for a BOM, and if present, we use it
+   to determine the endianness.  Otherwise we choose big-endian, as
+   recommended by the Unicode Consortium.  */
+static char *
+decide_utf16_encoding (SCM port, scm_t_port_rw_active mode)
+{
+  if (mode == SCM_PORT_READ
+      && looking_at_bytes (port, scm_utf16le_bom, sizeof scm_utf16le_bom))
+    return "UTF-16LE";
+  else
+    return "UTF-16BE";
+}
+
+/* Decide what endianness to use for a UTF-32 port.  Return "UTF-32BE"
+   or "UTF-32LE".  See the comment above 'decide_utf16_encoding' for
+   details.  */
+static char *
+decide_utf32_encoding (SCM port, scm_t_port_rw_active mode)
+{
+  if (mode == SCM_PORT_READ
+      && looking_at_bytes (port, scm_utf32le_bom, sizeof scm_utf32le_bom))
+    return "UTF-32LE";
+  else
+    return "UTF-32BE";
+}
+
 static void
 finalize_iconv_descriptors (void *ptr, void *data)
 {
@@ -2341,23 +2429,36 @@ close_iconv_descriptors (scm_t_iconv_descriptors *id)
   id->output_cd = (void *) -1;
 }
 
+/* Return the iconv_descriptors, initializing them if necessary.  MODE
+   must be either SCM_PORT_READ or SCM_PORT_WRITE, and specifies which
+   operation is about to be done.  We deliberately avoid reading from
+   the port unless the user was about to do so.  */
 scm_t_iconv_descriptors *
-scm_i_port_iconv_descriptors (SCM port)
+scm_i_port_iconv_descriptors (SCM port, scm_t_port_rw_active mode)
 {
-  scm_t_port *pt;
-  scm_t_port_internal *pti;
-
-  pt = SCM_PTAB_ENTRY (port);
-  pti = SCM_PORT_GET_INTERNAL (port);
+  scm_t_port_internal *pti = SCM_PORT_GET_INTERNAL (port);
 
   assert (pti->encoding_mode == SCM_PORT_ENCODING_MODE_ICONV);
 
   if (!pti->iconv_descriptors)
     {
+      scm_t_port *pt = SCM_PTAB_ENTRY (port);
+      char *precise_encoding;
+
       if (!pt->encoding)
         pt->encoding = "ISO-8859-1";
+
+      /* If the specified encoding is UTF-16 or UTF-32, then make
+         that more precise by deciding what endianness to use.  */
+      if (strcmp (pt->encoding, "UTF-16") == 0)
+        precise_encoding = decide_utf16_encoding (port, mode);
+      else if (strcmp (pt->encoding, "UTF-32") == 0)
+        precise_encoding = decide_utf32_encoding (port, mode);
+      else
+        precise_encoding = pt->encoding;
+
       pti->iconv_descriptors =
-        open_iconv_descriptors (pt->encoding,
+        open_iconv_descriptors (precise_encoding,
                                 SCM_INPUT_PORT_P (port),
                                 SCM_OUTPUT_PORT_P (port));
     }
@@ -2377,6 +2478,14 @@ scm_i_set_port_encoding_x (SCM port, const char *encoding)
   pti = SCM_PORT_GET_INTERNAL (port);
   prev = pti->iconv_descriptors;
 
+  /* In order to handle cases where the encoding changes mid-stream
+     (e.g. within an HTTP stream, or within a file that is composed of
+     segments with different encodings), we consider this to be "stream
+     start" for purposes of BOM handling, regardless of our actual file
+     position. */
+  pti->at_stream_start_for_bom_read  = 1;
+  pti->at_stream_start_for_bom_write = 1;
+
   if (encoding == NULL)
     encoding = "ISO-8859-1";
 
@@ -2387,19 +2496,14 @@ scm_i_set_port_encoding_x (SCM port, const char *encoding)
     {
       pt->encoding = "UTF-8";
       pti->encoding_mode = SCM_PORT_ENCODING_MODE_UTF8;
-      pti->iconv_descriptors = NULL;
     }
   else
     {
-      /* Open descriptors before mutating the port. */
-      pti->iconv_descriptors =
-        open_iconv_descriptors (encoding,
-                                SCM_INPUT_PORT_P (port),
-                                SCM_OUTPUT_PORT_P (port));
       pt->encoding = scm_gc_strdup (encoding, "port");
       pti->encoding_mode = SCM_PORT_ENCODING_MODE_ICONV;
     }
 
+  pti->iconv_descriptors = NULL;
   if (prev)
     close_iconv_descriptors (prev);
 }
diff --git a/libguile/print.c b/libguile/print.c
index 1572690..b8b13d4 100644
--- a/libguile/print.c
+++ b/libguile/print.c
@@ -881,8 +881,24 @@ display_string_using_iconv (const void *str, int narrow_p, size_t len,
 {
   size_t printed;
   scm_t_iconv_descriptors *id;
+  scm_t_port_internal *pti = SCM_PORT_GET_INTERNAL (port);
 
-  id = scm_i_port_iconv_descriptors (port);
+  id = scm_i_port_iconv_descriptors (port, SCM_PORT_WRITE);
+
+  if (SCM_UNLIKELY (pti->at_stream_start_for_bom_write && len > 0))
+    {
+      scm_t_port *pt = SCM_PTAB_ENTRY (port);
+
+      /* Record that we're no longer at stream start.  */
+      pti->at_stream_start_for_bom_write = 0;
+      if (pt->rw_random)
+        pti->at_stream_start_for_bom_read = 0;
+
+      /* Write a BOM if appropriate.  */
+      if (SCM_UNLIKELY (strcmp(pt->encoding, "UTF-16") == 0
+                        || strcmp(pt->encoding, "UTF-32") == 0))
+        display_character (SCM_UNICODE_BOM, port, iconveh_error);
+    }
 
   printed = 0;
 
diff --git a/test-suite/tests/ports.test b/test-suite/tests/ports.test
index 886ab24..f966fc3 100644
--- a/test-suite/tests/ports.test
+++ b/test-suite/tests/ports.test
@@ -24,7 +24,8 @@
   #:use-module (ice-9 popen)
   #:use-module (ice-9 rdelim)
   #:use-module (rnrs bytevectors)
-  #:use-module ((rnrs io ports) #:select (open-bytevector-input-port)))
+  #:use-module ((rnrs io ports) #:select (open-bytevector-input-port
+                                          open-bytevector-output-port)))
 
 (define (display-line . args)
   (for-each display args)
@@ -918,7 +919,9 @@
 
   (pass-if-exception "set-port-encoding!, wrong encoding"
     exception:miscellaneous-error
-    (set-port-encoding! (open-input-string "") "does-not-exist"))
+    (let ((p (open-input-string "")))
+      (set-port-encoding! p "does-not-exist")
+      (read p)))
 
   (pass-if-exception "%default-port-encoding, wrong encoding"
     exception:miscellaneous-error
@@ -1149,6 +1152,292 @@
 
 \f
 
+(with-test-prefix "unicode byte-order marks (BOMs)"
+
+  (define (bv-read-test* encoding bv proc)
+    (let ((port (open-bytevector-input-port bv)))
+      (set-port-encoding! port encoding)
+      (proc port)))
+
+  (define (bv-read-test encoding bv)
+    (bv-read-test* encoding bv read-string))
+
+  (define (bv-write-test* encoding proc)
+    (call-with-values
+        (lambda () (open-bytevector-output-port))
+      (lambda (port get-bytevector)
+        (set-port-encoding! port encoding)
+        (proc port)
+        (get-bytevector))))
+
+  (define (bv-write-test encoding str)
+    (bv-write-test* encoding
+                    (lambda (p)
+                      (display str p))))
+
+  (pass-if-equal "BOM not discarded from Latin-1 stream"
+      "\xEF\xBB\xBF\x61"
+    (bv-read-test "ISO-8859-1" #vu8(#xEF #xBB #xBF #x61)))
+
+  (pass-if-equal "BOM not discarded from Latin-2 stream"
+      "\u010F\u0165\u017C\x61"
+    (bv-read-test "ISO-8859-2" #vu8(#xEF #xBB #xBF #x61)))
+
+  (pass-if-equal "BOM not discarded from UTF-16BE stream"
+      "\uFEFF\x61"
+    (bv-read-test "UTF-16BE" #vu8(#xFE #xFF #x00 #x61)))
+
+  (pass-if-equal "BOM not discarded from UTF-16LE stream"
+      "\uFEFF\x61"
+    (bv-read-test "UTF-16LE" #vu8(#xFF #xFE #x61 #x00)))
+
+  (pass-if-equal "BOM not discarded from UTF-32BE stream"
+      "\uFEFF\x61"
+    (bv-read-test "UTF-32BE" #vu8(#x00 #x00 #xFE #xFF
+                                       #x00 #x00 #x00 #x61)))
+
+  (pass-if-equal "BOM not discarded from UTF-32LE stream"
+      "\uFEFF\x61"
+    (bv-read-test "UTF-32LE" #vu8(#xFF #xFE #x00 #x00
+                                       #x61 #x00 #x00 #x00)))
+
+  (pass-if-equal "BOM not written to UTF-8 stream"
+      #vu8(#x61)
+    (bv-write-test "UTF-8" "a"))
+
+  (pass-if-equal "BOM not written to UTF-16BE stream"
+      #vu8(#x00 #x61)
+    (bv-write-test "UTF-16BE" "a"))
+
+  (pass-if-equal "BOM not written to UTF-16LE stream"
+      #vu8(#x61 #x00)
+    (bv-write-test "UTF-16LE" "a"))
+
+  (pass-if-equal "BOM not written to UTF-32BE stream"
+      #vu8(#x00 #x00 #x00 #x61)
+    (bv-write-test "UTF-32BE" "a"))
+
+  (pass-if-equal "BOM not written to UTF-32LE stream"
+      #vu8(#x61 #x00 #x00 #x00)
+    (bv-write-test "UTF-32LE" "a"))
+
+  (pass-if "Don't read from the port unless user asks to"
+    (let* ((p (make-soft-port
+               (vector
+                (lambda (c) #f)           ; write char
+                (lambda (s) #f)           ; write string
+                (lambda () #f)            ; flush
+                (lambda () (throw 'fail)) ; read char
+                (lambda () #f))
+               "rw")))
+      (set-port-encoding! p "UTF-16")
+      (display "abc" p)
+      (set-port-encoding! p "UTF-32")
+      (display "def" p)
+      #t))
+
+  ;; TODO: test that input and output streams are independent when
+  ;; appropriate, and linked when appropriate.
+
+  (pass-if-equal "BOM discarded from start of UTF-8 stream"
+      "a"
+    (bv-read-test "UTF-8" #vu8(#xEF #xBB #xBF #x61)))
+
+  (pass-if-equal "BOM discarded from start of UTF-8 stream after seek to 0"
+      '(#\a "a")
+    (bv-read-test* "UTF-8" #vu8(#xEF #xBB #xBF #x61)
+                   (lambda (p)
+                     (let ((c (read-char p)))
+                       (seek p 0 SEEK_SET)
+                       (let ((s (read-string p)))
+                         (list c s))))))
+
+  (pass-if-equal "Only one BOM discarded from start of UTF-8 stream"
+      "\uFEFFa"
+    (bv-read-test "UTF-8" #vu8(#xEF #xBB #xBF #xEF #xBB #xBF #x61)))
+
+  (pass-if-equal "BOM not discarded from UTF-8 stream after seek to > 0"
+      "\uFEFFb"
+    (bv-read-test* "UTF-8" #vu8(#x61 #xEF #xBB #xBF #x62)
+                   (lambda (p)
+                     (seek p 1 SEEK_SET)
+                     (read-string p))))
+
+  (pass-if-equal "BOM not discarded unless at start of UTF-8 stream"
+      "a\uFEFFb"
+    (bv-read-test "UTF-8" #vu8(#x61 #xEF #xBB #xBF #x62)))
+
+  (pass-if-equal "BOM (BE) written to start of UTF-16 stream"
+      #vu8(#xFE #xFF #x00 #x61 #x00 #x62)
+    (bv-write-test "UTF-16" "ab"))
+
+  (pass-if-equal "BOM (BE) written to UTF-16 stream after set-port-encoding!"
+      #vu8(#xFE #xFF #x00 #x61 #x00 #x62 #xFE #xFF #x00 #x63 #x00 #x64)
+    (bv-write-test* "UTF-16"
+                    (lambda (p)
+                      (display "ab" p)
+                      (set-port-encoding! p "UTF-16")
+                      (display "cd" p))))
+
+  (pass-if-equal "BOM discarded from start of UTF-16 stream (BE)"
+      "a"
+    (bv-read-test "UTF-16" #vu8(#xFE #xFF #x00 #x61)))
+
+  (pass-if-equal "BOM discarded from start of UTF-16 stream (BE) after seek to 0"
+      '(#\a "a")
+    (bv-read-test* "UTF-16" #vu8(#xFE #xFF #x00 #x61)
+                   (lambda (p)
+                     (let ((c (read-char p)))
+                       (seek p 0 SEEK_SET)
+                       (let ((s (read-string p)))
+                         (list c s))))))
+
+  (pass-if-equal "Only one BOM discarded from start of UTF-16 stream (BE)"
+      "\uFEFFa"
+    (bv-read-test "UTF-16" #vu8(#xFE #xFF #xFE #xFF #x00 #x61)))
+
+  (pass-if-equal "BOM not discarded from UTF-16 stream (BE) after seek to > 0"
+      "\uFEFFa"
+    (bv-read-test* "UTF-16" #vu8(#xFE #xFF #xFE #xFF #x00 #x61)
+                   (lambda (p)
+                     (seek p 2 SEEK_SET)
+                     (read-string p))))
+
+  (pass-if-equal "BOM not discarded unless at start of UTF-16 stream"
+      "a\uFEFFb"
+    (let ((be (bv-read-test "UTF-16" #vu8(#x00 #x61 #xFE #xFF #x00 #x62)))
+          (le (bv-read-test "UTF-16" #vu8(#x61 #x00 #xFF #xFE #x62 #x00))))
+      (if (char=? #\a (string-ref be 0))
+          be
+          le)))
+
+  (pass-if-equal "BOM discarded from start of UTF-16 stream (LE)"
+      "a"
+    (bv-read-test "UTF-16" #vu8(#xFF #xFE #x61 #x00)))
+
+  (pass-if-equal "BOM discarded from start of UTF-16 stream (LE) after seek to 0"
+      '(#\a "a")
+    (bv-read-test* "UTF-16" #vu8(#xFF #xFE #x61 #x00)
+                   (lambda (p)
+                     (let ((c (read-char p)))
+                       (seek p 0 SEEK_SET)
+                       (let ((s (read-string p)))
+                         (list c s))))))
+
+  (pass-if-equal "Only one BOM discarded from start of UTF-16 stream (LE)"
+      "\uFEFFa"
+    (bv-read-test "UTF-16" #vu8(#xFF #xFE #xFF #xFE #x61 #x00)))
+
+  (pass-if-equal "BOM not discarded from UTF-16 stream (LE) after seek to > 0"
+      "\uFEFFa"
+    (bv-read-test* "UTF-16" #vu8(#xFF #xFE #xFF #xFE #x61 #x00)
+                   (lambda (p)
+                     (seek p 2 SEEK_SET)
+                     (read-string p))))
+
+  (pass-if-equal "BOM discarded from start of UTF-32 stream (BE)"
+      "a"
+    (bv-read-test "UTF-32" #vu8(#x00 #x00 #xFE #xFF
+                                     #x00 #x00 #x00 #x61)))
+
+  (pass-if-equal "BOM discarded from start of UTF-32 stream (BE) after seek to 0"
+      '(#\a "a")
+    (bv-read-test* "UTF-32" #vu8(#x00 #x00 #xFE #xFF
+                                      #x00 #x00 #x00 #x61)
+                   (lambda (p)
+                     (let ((c (read-char p)))
+                       (seek p 0 SEEK_SET)
+                       (let ((s (read-string p)))
+                         (list c s))))))
+
+  (pass-if-equal "Only one BOM discarded from start of UTF-32 stream (BE)"
+      "\uFEFFa"
+    (bv-read-test "UTF-32" #vu8(#x00 #x00 #xFE #xFF
+                                     #x00 #x00 #xFE #xFF
+                                     #x00 #x00 #x00 #x61)))
+
+  (pass-if-equal "BOM not discarded from UTF-32 stream (BE) after seek to > 0"
+      "\uFEFFa"
+    (bv-read-test* "UTF-32" #vu8(#x00 #x00 #xFE #xFF
+                                      #x00 #x00 #xFE #xFF
+                                      #x00 #x00 #x00 #x61)
+                   (lambda (p)
+                     (seek p 4 SEEK_SET)
+                     (read-string p))))
+
+  (pass-if-equal "BOM discarded within UTF-16 stream (BE) after set-port-encoding!"
+      "ab"
+    (bv-read-test* "UTF-16" #vu8(#x00 #x61 #xFE #xFF #x00 #x62)
+                   (lambda (p)
+                     (let ((a (read-char p)))
+                       (set-port-encoding! p "UTF-16")
+                       (string a (read-char p))))))
+
+  (pass-if-equal "BOM discarded within UTF-16 stream (LE,BE) after set-port-encoding!"
+      "ab"
+    (bv-read-test* "UTF-16" #vu8(#x00 #x61 #xFF #xFE #x62 #x00)
+                   (lambda (p)
+                     (let ((a (read-char p)))
+                       (set-port-encoding! p "UTF-16")
+                       (string a (read-char p))))))
+
+  (pass-if-equal "BOM discarded within UTF-32 stream (BE) after set-port-encoding!"
+      "ab"
+    (bv-read-test* "UTF-32" #vu8(#x00 #x00 #x00 #x61
+                                      #x00 #x00 #xFE #xFF
+                                      #x00 #x00 #x00 #x62)
+                   (lambda (p)
+                     (let ((a (read-char p)))
+                       (set-port-encoding! p "UTF-32")
+                       (string a (read-char p))))))
+
+  (pass-if-equal "BOM discarded within UTF-32 stream (LE,BE) after set-port-encoding!"
+      "ab"
+    (bv-read-test* "UTF-32" #vu8(#x00 #x00 #x00 #x61
+                                      #xFF #xFE #x00 #x00
+                                      #x62 #x00 #x00 #x00)
+                   (lambda (p)
+                     (let ((a (read-char p)))
+                       (set-port-encoding! p "UTF-32")
+                       (string a (read-char p))))))
+
+  (pass-if-equal "BOM not discarded unless at start of UTF-32 stream"
+      "a\uFEFFb"
+    (let ((be (bv-read-test "UTF-32" #vu8(#x00 #x00 #x00 #x61
+                                               #x00 #x00 #xFE #xFF
+                                               #x00 #x00 #x00 #x62)))
+          (le (bv-read-test "UTF-32" #vu8(#x61 #x00 #x00 #x00
+                                               #xFF #xFE #x00 #x00
+                                               #x62 #x00 #x00 #x00))))
+      (if (char=? #\a (string-ref be 0))
+          be
+          le)))
+
+  (pass-if-equal "BOM discarded from start of UTF-32 stream (LE)"
+      "a"
+    (bv-read-test "UTF-32" #vu8(#xFF #xFE #x00 #x00
+                                     #x61 #x00 #x00 #x00)))
+
+  (pass-if-equal "BOM discarded from start of UTF-32 stream (LE) after seek to 0"
+      '(#\a "a")
+    (bv-read-test* "UTF-32" #vu8(#xFF #xFE #x00 #x00
+                                      #x61 #x00 #x00 #x00)
+                   (lambda (p)
+                     (let ((c (read-char p)))
+                       (seek p 0 SEEK_SET)
+                       (let ((s (read-string p)))
+                         (list c s))))))
+
+  (pass-if-equal "Only one BOM discarded from start of UTF-32 stream (LE)"
+      "\uFEFFa"
+    (bv-read-test "UTF-32" #vu8(#xFF #xFE #x00 #x00
+                                     #xFF #xFE #x00 #x00
+                                     #x61 #x00 #x00 #x00)))
+
+  )
+
+\f
+
 (define-syntax-rule (with-load-path path body ...)
   (let ((new path)
         (old %load-path))
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] Improve handling of Unicode byte-order marks (BOMs)
  2013-04-03 10:44 [PATCH] Improve handling of Unicode byte-order marks (BOMs) Mark H Weaver
  2013-04-03 11:47 ` Mark H Weaver
@ 2013-04-03 11:58 ` Ludovic Courtès
  2013-04-03 19:28   ` Mark H Weaver
  1 sibling, 1 reply; 14+ messages in thread
From: Ludovic Courtès @ 2013-04-03 11:58 UTC (permalink / raw
  To: guile-devel

Hello, Mark!

Mark H Weaver <mhw@netris.org> skribis:

> * All kinds of streams are supported in a uniform way: files, pipes,
>   sockets, terminals, etc.
>
> * As specified in Unicode 6.2, BOMs are only handled specially at the
>   start of a stream, and only if the encoding is set to "UTF-16" or
>   "UTF-32".  BOMs are *not* handled specially if the encoding is set to
>   "UTF-16LE", etc.

OK.

> * This code never tries to read a BOM until the user has asked to read.
>   If the user writes before reading, it chooses big-endian and writes a
>   BOM if appropriate (if the encoding is set to "UTF-16" or "UTF-32").
>
> * The encodings "UTF-16" and "UTF-32" are *never* passed to iconv,
>   because BOM handling varies between iconv implementations.  Creation
>   of the iconv descriptors is always postponed until the first read or
>   write, at which point a decision is made about the endianness, and
>   then "UTF-16BE", "UTF-16LE", "UTF-32BE", or "UTF-32LE" is passed to
>   iconv.
>
> * If 'rw_random' is zero, then the input and output streams are
>   considered independent: the first read will consume a BOM if
>   appropriate, *and* the first write will produce a BOM if appropriate.
>
> * If 'rw_random' is non-zero, then the input and output streams are
>   considered linked: if the user reads first, then a BOM will be
>   consumed if appropriate, but later writes will *not* produce a BOM.
>   Similarly, if the user writes first, then later reads will *not*
>   consume a BOM.
>
> * If 'set-port-encoding!' is called in the middle of a stream, it treats
>   it as a new logical "start of stream", i.e. if the encoding is set to
>   "UTF-16" or "UTF-32" then a BOM will be consumed the next time you
>   read and/or produced the next time you write.
>
> * Seeks to the beginning of the file set the "start of stream" flags.
>   Seeks anywhere else clear the "start of stream" flags.

Woow, well thought out.  The semantics seem good.  (It’s interesting to
see how BOMs complicate things, but that’s life, I guess.)

The patch looks good to me.  The test suite is nice.  It doesn’t seem to
cover all the corner cases listed above, but that can be added later on
perhaps?

Perhaps the text above could be added to the manual, in a
@ununnumberedsec or something?

Remarks:

> diff --git a/libguile/ports-internal.h b/libguile/ports-internal.h
> index 73a788f..cd1746b 100644
> --- a/libguile/ports-internal.h
> +++ b/libguile/ports-internal.h
> @@ -48,14 +48,19 @@ struct scm_port_internal
>  {
>    scm_t_port_encoding_mode encoding_mode;
>    scm_t_iconv_descriptors *iconv_descriptors;
> +  int at_stream_start_for_bom_read;
> +  int at_stream_start_for_bom_write;

Add “:1”?

> +#define SCM_UNICODE_BOM  0xFEFF  /* Unicode byte-order mark */

0xfeffUL to be on the safe side.

> +/* If the next LEN bytes from port are equal to those in BYTES, then

s/port/PORT/

> +   return 1, else return 0.  Leave the port position unchanged.  */
> +static int
> +looking_at_bytes (SCM port, unsigned char *bytes, int len)

const unsigned char *bytes

> +{
> +  scm_t_port *pt = SCM_PTAB_ENTRY (port);
> +  int result;
> +  int i = 0;
> +
> +  while (i < len && scm_peek_byte_or_eof (port) == bytes[i])
> +    {
> +      pt->read_pos++;
> +      i++;
> +    }
> +
> +  result = (i == len);
> +
> +  while (i > 0)
> +    scm_unget_byte (bytes[--i], port);
> +
> +  return result;
> +}

Should it be scm_get_byte_or_eof given that scm_unget_byte is used later?

What if pt->read_buf_size == 1?  What if there’s data in saved_read_buf?

> +/* Decide what endianness to use for a UTF-16 port.  Return "UTF-16BE"
> +   or "UTF-16LE".  MODE must be either SCM_PORT_READ or SCM_PORT_WRITE,
> +   and specifies which operation is about to be done.  The MODE
> +   determines how we will decide the endianness.  We deliberately avoid
> +   reading from the port unless the user is about to do so.  If the user
> +   is about to read, then we look for a BOM, and if present, we use it
> +   to determine the endianness.  Otherwise we choose big-endian, as
> +   recommended by the Unicode Consortium.  */
> +static char *
> +decide_utf16_encoding (SCM port, scm_t_port_rw_active mode)

static const char *

> +static char *
> +decide_utf32_encoding (SCM port, scm_t_port_rw_active mode)

Likewise.

> +      /* If the specified encoding is UTF-16 or UTF-32, then make
> +         that more precise by deciding what endianness to use.  */
> +      if (strcmp (pt->encoding, "UTF-16") == 0)
> +        precise_encoding = decide_utf16_encoding (port, mode);
> +      else if (strcmp (pt->encoding, "UTF-32") == 0)
> +        precise_encoding = decide_utf32_encoding (port, mode);

Shouldn’t it be strcasecmp?  (Actually there are other uses of strcmp
already, but I think it’s a mistake.)

> +      if (SCM_UNLIKELY (strcmp(pt->encoding, "UTF-16") == 0
> +                        || strcmp(pt->encoding, "UTF-32") == 0))

Likewise, + space before paren.

Thanks!

Ludo’.




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] Improve handling of Unicode byte-order marks (BOMs)
  2013-04-03 11:58 ` Ludovic Courtès
@ 2013-04-03 19:28   ` Mark H Weaver
  2013-04-03 20:11     ` Ludovic Courtès
  0 siblings, 1 reply; 14+ messages in thread
From: Mark H Weaver @ 2013-04-03 19:28 UTC (permalink / raw
  To: Ludovic Courtès; +Cc: guile-devel

[-- Attachment #1: Type: text/plain, Size: 3145 bytes --]

Hi Ludovic,

Thanks for the quick review!  An improved patch is attached below.

ludo@gnu.org (Ludovic Courtès) writes:
> Woow, well thought out.  The semantics seem good.  (It’s interesting to
> see how BOMs complicate things, but that’s life, I guess.)
>
> The patch looks good to me.  The test suite is nice.  It doesn’t seem to
> cover all the corner cases listed above, but that can be added later on
> perhaps?

Yes, the tests are still a work-in-progess, but I've added quite a few
more since you last looked.

> Perhaps the text above could be added to the manual,

In the attached patch, I've added a new node to the "Input and Output"
section.

> Mark H Weaver <mhw@netris.org> skribis:
>> diff --git a/libguile/ports-internal.h b/libguile/ports-internal.h
>> index 73a788f..cd1746b 100644
>> --- a/libguile/ports-internal.h
>> +++ b/libguile/ports-internal.h
>> @@ -48,14 +48,19 @@ struct scm_port_internal
>>  {
>>    scm_t_port_encoding_mode encoding_mode;
>>    scm_t_iconv_descriptors *iconv_descriptors;
>> +  int at_stream_start_for_bom_read;
>> +  int at_stream_start_for_bom_write;
>
> Add “:1”?

Good idea.

[...more good suggestions that I've incorporated in the new patch...]

>> +{
>> +  scm_t_port *pt = SCM_PTAB_ENTRY (port);
>> +  int result;
>> +  int i = 0;
>> +
>> +  while (i < len && scm_peek_byte_or_eof (port) == bytes[i])
>> +    {
>> +      pt->read_pos++;
>> +      i++;
>> +    }
>> +
>> +  result = (i == len);
>> +
>> +  while (i > 0)
>> +    scm_unget_byte (bytes[--i], port);
>> +
>> +  return result;
>> +}
>
> Should it be scm_get_byte_or_eof given that scm_unget_byte is used later?

Yes.  Bytes are only consumed if are equal to bytes[i], so an EOF will
never be consumed or passed to scm_unget_byte.

> What if pt->read_buf_size == 1?  What if there’s data in saved_read_buf?

All of those details are handled by 'scm_peek_byte_or_eof', which is
guaranteed to leave 'pt->read_pos' pointing at the byte that's returned
(if not EOF).  Therefore, it's always safe to increment that pointer by
one (but no more than one) after calling 'scm_peek_byte_or_eof' if it
returned non-EOF.

Look at the code for 'scm_peek_byte_or_eof' and this will be clear.
Also note that you did the same thing in 'scm_utf8_codepoint' :)

[...more good suggestions, incorporated...]

>> +      /* If the specified encoding is UTF-16 or UTF-32, then make
>> +         that more precise by deciding what endianness to use.  */
>> +      if (strcmp (pt->encoding, "UTF-16") == 0)
>> +        precise_encoding = decide_utf16_encoding (port, mode);
>> +      else if (strcmp (pt->encoding, "UTF-32") == 0)
>> +        precise_encoding = decide_utf32_encoding (port, mode);
>
> Shouldn’t it be strcasecmp?  (Actually there are other uses of strcmp
> already, but I think it’s a mistake.)

Ouch, good catch!  Indeed, we already had some bugs because of this.  I
pushed a fix for the existing bugs to stable-2.0, and updated this patch
accordingly.

Here's the new patch.  Any more suggestions?

    Thanks!
      Mark



[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: [PATCH] Improve handling of Unicode byte-order marks (BOMs) --]
[-- Type: text/x-diff, Size: 28093 bytes --]

From c0d7228824dcaf7edcbc2de2cdef5c091ef2fc2f Mon Sep 17 00:00:00 2001
From: Mark H Weaver <mhw@netris.org>
Date: Wed, 3 Apr 2013 04:22:04 -0400
Subject: [PATCH] Improve handling of Unicode byte-order marks (BOMs).

* libguile/ports-internal.h (struct scm_port_internal): Add new members
  'at_stream_start_for_bom_read' and 'at_stream_start_for_bom_write'.
  (SCM_UNICODE_BOM): New macro.
  (scm_i_port_iconv_descriptors): Add 'mode' parameter to prototype.

* libguile/ports.c (scm_new_port_table_entry): Initialize
  'at_stream_start_for_bom_read' and 'at_stream_start_for_bom_write'.
  (get_iconv_codepoint): Pass new 'mode' parameter to
  'scm_i_port_iconv_descriptors'.
  (get_codepoint): After reading a codepoint at stream start, record
  that we're no longer at stream start, and consume a BOM where
  appropriate.
  (scm_seek): Set the stream start flags according to the new position.
  (looking_at_bytes): New static function.
  (scm_utf8_bom, scm_utf16be_bom, scm_utf16le_bom, scm_utf32be_bom,
  scm_utf32le_bom): New static const arrays.
  (decide_utf16_encoding, decide_utf32_encoding): New static functions.
  (scm_i_port_iconv_descriptors): Add new 'mode' parameter.  If the
  specified encoding is UTF-16 or UTF-32, make that precise by deciding
  what endianness to use, and construct iconv descriptors based on the
  precise encoding.
  (scm_i_set_port_encoding_x): Record that we are now at stream start.
  Do not open the new iconv descriptors immediately; let them be
  initialized lazily.

* libguile/print.c (display_string_using_iconv): Record that we're no
  longer at stream start.  Write a BOM if appropriate.

* doc/ref/api-io.texi (BOM Handling): New node.

* test-suite/tests/ports.test ("set-port-encoding!, wrong encoding"):
  Adapt test to cope with the fact that 'set-port-encoding!' does not
  immediately open the iconv descriptors.
  (bv-read-test): New procedure.
  ("unicode byte-order marks (BOMs)"): New test prefix.
---
 doc/ref/api-io.texi         |   67 ++++++++++
 libguile/ports-internal.h   |    7 +-
 libguile/ports.c            |  136 +++++++++++++++++---
 libguile/print.c            |   18 ++-
 test-suite/tests/ports.test |  291 ++++++++++++++++++++++++++++++++++++++++++-
 5 files changed, 500 insertions(+), 19 deletions(-)

diff --git a/doc/ref/api-io.texi b/doc/ref/api-io.texi
index 8c974be..b5c78d0 100644
--- a/doc/ref/api-io.texi
+++ b/doc/ref/api-io.texi
@@ -19,6 +19,7 @@
 * Port Types::                  Types of port and how to make them.
 * R6RS I/O Ports::              The R6RS port API.
 * I/O Extensions::              Using and extending ports in C.
+* BOM Handling::                Handling of Unicode byte order marks.
 @end menu
 
 
@@ -2373,6 +2374,72 @@ Set using
 
 @end table
 
+@node BOM Handling
+@subsection Handling of Unicode byte order marks.
+@cindex BOM
+@cindex byte order mark
+
+This section documents the finer points of Guile's handling of Unicode
+byte order marks (BOMs).  A byte order mark (U+FEFF) is typically found
+at the start of a UTF-16 or UTF-32 stream, so that the reader can
+reliably determine the byte order.  Occasionally, a BOM is found at the
+start of a UTF-8 stream, but this is much less common and not generally
+recommended.
+
+Guile attempts to handle BOMs automatically, and in accordance with the
+recommendations of the Unicode Standard, when the port encoding is set
+to @code{UTF-8}, @code{UTF-16}, or @code{UTF-32}.  In brief, Guile
+automatically writes a BOM at the start of a UTF-16 and UTF-32 stream,
+and automatically consumes one from the start of a UTF-8, UTF-16, or
+UTF-32 stream.
+
+As specified in the Unicode Standard, BOMs are only handled specially at
+the start of a stream, and only if the port encoding is set to
+@code{UTF-16} or @code{UTF-32}.  If the port encoding is set to
+@code{UTF-16BE}, @code{UTF-16LE}, @code{UTF-16BE}, or @code{UTF-16LE},
+then BOMs are @emph{not} handled specially, and none of the special
+handling described in this section applies.
+
+@itemize @bullet
+@item
+Guile looks for a BOM only if the user performs a textual read before
+writing or seeking.  If the user writes to a port before reading,
+big-endian is assumed.
+
+@item
+If @code{set-port-encoding!} is called in the middle of a stream, Guile
+treats this as a new logical ``start of stream'' for purposes of BOM
+handling.  This is intended to multiple logical text streams embedded
+within a larger binary stream.
+
+@item
+Binary I/O operations are not guaranteed to update Guile's notion of
+whether the port is at the ``start of the stream'', nor are they
+guaranteed to produce or consume BOMs.  More generally, the handling of
+BOMs is unspecified if binary I/O is performed before textual I/O.
+
+@item
+For ports that support seeking (e.g. normal files), the input and output
+streams are considered linked: if the user reads first, then a BOM will
+be consumed (if appropriate), but later writes will @emph{not} produce a
+BOM.  Similarly, if the user writes first, then later reads will
+@emph{not} consume a BOM.
+
+@item
+For ports that do not support seeking (e.g. pipes, sockets, and
+terminals), the input and output streams are considered
+@emph{independent} for purposes of BOM handling: the first read will
+consume a BOM (if appropriate), @emph{and} the first write will produce
+a BOM (if appropriate).
+
+@item
+Seeks to the beginning of the file set the ``start of stream'' flags.
+Seeks anywhere else clear the ``start of stream'' flags.  Note that
+seeking before reading or writing, when the encoding is set to
+@code{UTF-16} or @code{UTF-32}, will generally cause big-endian to be
+used, unless your first read is at a BOM.
+@end itemize
+
 @c Local Variables:
 @c TeX-master: "guile.texi"
 @c End:
diff --git a/libguile/ports-internal.h b/libguile/ports-internal.h
index 73a788f..70e8c45 100644
--- a/libguile/ports-internal.h
+++ b/libguile/ports-internal.h
@@ -46,6 +46,8 @@ typedef struct scm_iconv_descriptors scm_t_iconv_descriptors;
 
 struct scm_port_internal
 {
+  unsigned at_stream_start_for_bom_read  : 1;
+  unsigned at_stream_start_for_bom_write : 1;
   scm_t_port_encoding_mode encoding_mode;
   scm_t_iconv_descriptors *iconv_descriptors;
   SCM alist;
@@ -53,9 +55,12 @@ struct scm_port_internal
 
 typedef struct scm_port_internal scm_t_port_internal;
 
+#define SCM_UNICODE_BOM  0xFEFFUL  /* Unicode byte-order mark */
+
 #define SCM_PORT_GET_INTERNAL(x)                                \
   ((scm_t_port_internal *) (SCM_PTAB_ENTRY(x)->input_cd))
 
-SCM_INTERNAL scm_t_iconv_descriptors *scm_i_port_iconv_descriptors (SCM port);
+SCM_INTERNAL scm_t_iconv_descriptors *
+scm_i_port_iconv_descriptors (SCM port, scm_t_port_rw_active mode);
 
 #endif
diff --git a/libguile/ports.c b/libguile/ports.c
index 61f8006..a19e869 100644
--- a/libguile/ports.c
+++ b/libguile/ports.c
@@ -639,6 +639,9 @@ scm_new_port_table_entry (scm_t_bits tag)
     pti->encoding_mode = SCM_PORT_ENCODING_MODE_ICONV;
   pti->iconv_descriptors = NULL;
 
+  pti->at_stream_start_for_bom_read  = 1;
+  pti->at_stream_start_for_bom_write = 1;
+
   /* XXX These fields are not what they seem.  They have been
      repurposed, but cannot safely be renamed in 2.0 without breaking
      ABI compatibility.  This will be cleaned up in 2.2.  */
@@ -1306,10 +1309,12 @@ static int
 get_iconv_codepoint (SCM port, scm_t_wchar *codepoint,
 		     char buf[SCM_MBCHAR_BUF_SIZE], size_t *len)
 {
-  scm_t_iconv_descriptors *id = scm_i_port_iconv_descriptors (port);
+  scm_t_iconv_descriptors *id;
   scm_t_uint8 utf8_buf[SCM_MBCHAR_BUF_SIZE];
   size_t input_size = 0;
 
+  id = scm_i_port_iconv_descriptors (port, SCM_PORT_READ);
+
   for (;;)
     {
       int byte_read;
@@ -1393,7 +1398,24 @@ get_codepoint (SCM port, scm_t_wchar *codepoint,
     err = get_iconv_codepoint (port, codepoint, buf, len);
 
   if (SCM_LIKELY (err == 0))
-    update_port_lf (*codepoint, port);
+    {
+      if (SCM_UNLIKELY (pti->at_stream_start_for_bom_read))
+        {
+          /* Record that we're no longer at stream start. */
+          pti->at_stream_start_for_bom_read = 0;
+          if (pt->rw_random)
+            pti->at_stream_start_for_bom_write = 0;
+
+          /* If we just read a BOM in an encoding that recognizes them,
+             then silently consume it and read another code point. */
+          if (SCM_UNLIKELY (*codepoint == SCM_UNICODE_BOM
+                            && (strcasecmp (pt->encoding, "UTF-8") == 0
+                                || strcasecmp (pt->encoding, "UTF-16") == 0
+                                || strcasecmp (pt->encoding, "UTF-32") == 0)))
+            return get_codepoint (port, codepoint, buf, len);
+        }
+      update_port_lf (*codepoint, port);
+    }
   else if (pt->ilseq_handler == SCM_ICONVEH_QUESTION_MARK)
     {
       *codepoint = '?';
@@ -2006,6 +2028,7 @@ SCM_DEFINE (scm_seek, "seek", 3, 0, 0,
 
   if (SCM_OPPORTP (fd_port))
     {
+      scm_t_port_internal *pti = SCM_PORT_GET_INTERNAL (fd_port);
       scm_t_ptob_descriptor *ptob = scm_ptobs + SCM_PTOBNUM (fd_port);
       off_t_or_off64_t off = scm_to_off_t_or_off64_t (offset);
       off_t_or_off64_t rv;
@@ -2015,6 +2038,11 @@ SCM_DEFINE (scm_seek, "seek", 3, 0, 0,
                         scm_cons (fd_port, SCM_EOL));
       else
 	rv = ptob->seek (fd_port, off, how);
+
+      /* Set stream-start flags according to new position. */
+      pti->at_stream_start_for_bom_read  = (rv == 0);
+      pti->at_stream_start_for_bom_write = (rv == 0);
+
       return scm_from_off_t_or_off64_t (rv);
     }
   else /* file descriptor?.  */
@@ -2265,6 +2293,68 @@ scm_i_default_port_encoding (void)
     }
 }
 
+/* If the next LEN bytes from PORT are equal to those in BYTES, then
+   return 1, else return 0.  Leave the port position unchanged.  */
+static int
+looking_at_bytes (SCM port, const unsigned char *bytes, int len)
+{
+  scm_t_port *pt = SCM_PTAB_ENTRY (port);
+  int result;
+  int i = 0;
+
+  while (i < len && scm_peek_byte_or_eof (port) == bytes[i])
+    {
+      pt->read_pos++;
+      i++;
+    }
+
+  result = (i == len);
+
+  while (i > 0)
+    scm_unget_byte (bytes[--i], port);
+
+  return result;
+}
+
+static const unsigned char scm_utf8_bom[3]    = {0xEF, 0xBB, 0xBF};
+static const unsigned char scm_utf16be_bom[2] = {0xFE, 0xFF};
+static const unsigned char scm_utf16le_bom[2] = {0xFF, 0xFE};
+static const unsigned char scm_utf32be_bom[4] = {0x00, 0x00, 0xFE, 0xFF};
+static const unsigned char scm_utf32le_bom[4] = {0xFF, 0xFE, 0x00, 0x00};
+
+/* Decide what endianness to use for a UTF-16 port.  Return "UTF-16BE"
+   or "UTF-16LE".  MODE must be either SCM_PORT_READ or SCM_PORT_WRITE,
+   and specifies which operation is about to be done.  The MODE
+   determines how we will decide the endianness.  We deliberately avoid
+   reading from the port unless the user is about to do so.  If the user
+   is about to read, then we look for a BOM, and if present, we use it
+   to determine the endianness.  Otherwise we choose big-endian, as
+   recommended by the Unicode Consortium.  */
+static const char *
+decide_utf16_encoding (SCM port, scm_t_port_rw_active mode)
+{
+  if (mode == SCM_PORT_READ
+      && SCM_PORT_GET_INTERNAL (port)->at_stream_start_for_bom_read
+      && looking_at_bytes (port, scm_utf16le_bom, sizeof scm_utf16le_bom))
+    return "UTF-16LE";
+  else
+    return "UTF-16BE";
+}
+
+/* Decide what endianness to use for a UTF-32 port.  Return "UTF-32BE"
+   or "UTF-32LE".  See the comment above 'decide_utf16_encoding' for
+   details.  */
+static const char *
+decide_utf32_encoding (SCM port, scm_t_port_rw_active mode)
+{
+  if (mode == SCM_PORT_READ
+      && SCM_PORT_GET_INTERNAL (port)->at_stream_start_for_bom_read
+      && looking_at_bytes (port, scm_utf32le_bom, sizeof scm_utf32le_bom))
+    return "UTF-32LE";
+  else
+    return "UTF-32BE";
+}
+
 static void
 finalize_iconv_descriptors (void *ptr, void *data)
 {
@@ -2341,23 +2431,36 @@ close_iconv_descriptors (scm_t_iconv_descriptors *id)
   id->output_cd = (void *) -1;
 }
 
+/* Return the iconv_descriptors, initializing them if necessary.  MODE
+   must be either SCM_PORT_READ or SCM_PORT_WRITE, and specifies which
+   operation is about to be done.  We deliberately avoid reading from
+   the port unless the user was about to do so.  */
 scm_t_iconv_descriptors *
-scm_i_port_iconv_descriptors (SCM port)
+scm_i_port_iconv_descriptors (SCM port, scm_t_port_rw_active mode)
 {
-  scm_t_port *pt;
-  scm_t_port_internal *pti;
-
-  pt = SCM_PTAB_ENTRY (port);
-  pti = SCM_PORT_GET_INTERNAL (port);
+  scm_t_port_internal *pti = SCM_PORT_GET_INTERNAL (port);
 
   assert (pti->encoding_mode == SCM_PORT_ENCODING_MODE_ICONV);
 
   if (!pti->iconv_descriptors)
     {
+      scm_t_port *pt = SCM_PTAB_ENTRY (port);
+      const char *precise_encoding;
+
       if (!pt->encoding)
         pt->encoding = "ISO-8859-1";
+
+      /* If the specified encoding is UTF-16 or UTF-32, then make
+         that more precise by deciding what endianness to use.  */
+      if (strcasecmp (pt->encoding, "UTF-16") == 0)
+        precise_encoding = decide_utf16_encoding (port, mode);
+      else if (strcasecmp (pt->encoding, "UTF-32") == 0)
+        precise_encoding = decide_utf32_encoding (port, mode);
+      else
+        precise_encoding = pt->encoding;
+
       pti->iconv_descriptors =
-        open_iconv_descriptors (pt->encoding,
+        open_iconv_descriptors (precise_encoding,
                                 SCM_INPUT_PORT_P (port),
                                 SCM_OUTPUT_PORT_P (port));
     }
@@ -2377,6 +2480,14 @@ scm_i_set_port_encoding_x (SCM port, const char *encoding)
   pti = SCM_PORT_GET_INTERNAL (port);
   prev = pti->iconv_descriptors;
 
+  /* In order to handle cases where the encoding changes mid-stream
+     (e.g. within an HTTP stream, or within a file that is composed of
+     segments with different encodings), we consider this to be "stream
+     start" for purposes of BOM handling, regardless of our actual file
+     position. */
+  pti->at_stream_start_for_bom_read  = 1;
+  pti->at_stream_start_for_bom_write = 1;
+
   if (encoding == NULL)
     encoding = "ISO-8859-1";
 
@@ -2387,19 +2498,14 @@ scm_i_set_port_encoding_x (SCM port, const char *encoding)
     {
       pt->encoding = "UTF-8";
       pti->encoding_mode = SCM_PORT_ENCODING_MODE_UTF8;
-      pti->iconv_descriptors = NULL;
     }
   else
     {
-      /* Open descriptors before mutating the port. */
-      pti->iconv_descriptors =
-        open_iconv_descriptors (encoding,
-                                SCM_INPUT_PORT_P (port),
-                                SCM_OUTPUT_PORT_P (port));
       pt->encoding = scm_gc_strdup (encoding, "port");
       pti->encoding_mode = SCM_PORT_ENCODING_MODE_ICONV;
     }
 
+  pti->iconv_descriptors = NULL;
   if (prev)
     close_iconv_descriptors (prev);
 }
diff --git a/libguile/print.c b/libguile/print.c
index 1572690..3f72810 100644
--- a/libguile/print.c
+++ b/libguile/print.c
@@ -881,8 +881,24 @@ display_string_using_iconv (const void *str, int narrow_p, size_t len,
 {
   size_t printed;
   scm_t_iconv_descriptors *id;
+  scm_t_port_internal *pti = SCM_PORT_GET_INTERNAL (port);
 
-  id = scm_i_port_iconv_descriptors (port);
+  id = scm_i_port_iconv_descriptors (port, SCM_PORT_WRITE);
+
+  if (SCM_UNLIKELY (pti->at_stream_start_for_bom_write && len > 0))
+    {
+      scm_t_port *pt = SCM_PTAB_ENTRY (port);
+
+      /* Record that we're no longer at stream start.  */
+      pti->at_stream_start_for_bom_write = 0;
+      if (pt->rw_random)
+        pti->at_stream_start_for_bom_read = 0;
+
+      /* Write a BOM if appropriate.  */
+      if (SCM_UNLIKELY (strcasecmp(pt->encoding, "UTF-16") == 0
+                        || strcasecmp(pt->encoding, "UTF-32") == 0))
+        display_character (SCM_UNICODE_BOM, port, iconveh_error);
+    }
 
   printed = 0;
 
diff --git a/test-suite/tests/ports.test b/test-suite/tests/ports.test
index 886ab24..d81e864 100644
--- a/test-suite/tests/ports.test
+++ b/test-suite/tests/ports.test
@@ -24,7 +24,8 @@
   #:use-module (ice-9 popen)
   #:use-module (ice-9 rdelim)
   #:use-module (rnrs bytevectors)
-  #:use-module ((rnrs io ports) #:select (open-bytevector-input-port)))
+  #:use-module ((rnrs io ports) #:select (open-bytevector-input-port
+                                          open-bytevector-output-port)))
 
 (define (display-line . args)
   (for-each display args)
@@ -918,7 +919,9 @@
 
   (pass-if-exception "set-port-encoding!, wrong encoding"
     exception:miscellaneous-error
-    (set-port-encoding! (open-input-string "") "does-not-exist"))
+    (let ((p (open-input-string "")))
+      (set-port-encoding! p "does-not-exist")
+      (read p)))
 
   (pass-if-exception "%default-port-encoding, wrong encoding"
     exception:miscellaneous-error
@@ -1149,6 +1152,290 @@
 
 \f
 
+(with-test-prefix "unicode byte-order marks (BOMs)"
+
+  (define (bv-read-test* encoding bv proc)
+    (let ((port (open-bytevector-input-port bv)))
+      (set-port-encoding! port encoding)
+      (proc port)))
+
+  (define (bv-read-test encoding bv)
+    (bv-read-test* encoding bv read-string))
+
+  (define (bv-write-test* encoding proc)
+    (call-with-values
+        (lambda () (open-bytevector-output-port))
+      (lambda (port get-bytevector)
+        (set-port-encoding! port encoding)
+        (proc port)
+        (get-bytevector))))
+
+  (define (bv-write-test encoding str)
+    (bv-write-test* encoding
+                    (lambda (p)
+                      (display str p))))
+
+  (pass-if-equal "BOM not discarded from Latin-1 stream"
+      "\xEF\xBB\xBF\x61"
+    (bv-read-test "ISO-8859-1" #vu8(#xEF #xBB #xBF #x61)))
+
+  (pass-if-equal "BOM not discarded from Latin-2 stream"
+      "\u010F\u0165\u017C\x61"
+    (bv-read-test "ISO-8859-2" #vu8(#xEF #xBB #xBF #x61)))
+
+  (pass-if-equal "BOM not discarded from UTF-16BE stream"
+      "\uFEFF\x61"
+    (bv-read-test "UTF-16BE" #vu8(#xFE #xFF #x00 #x61)))
+
+  (pass-if-equal "BOM not discarded from UTF-16LE stream"
+      "\uFEFF\x61"
+    (bv-read-test "UTF-16LE" #vu8(#xFF #xFE #x61 #x00)))
+
+  (pass-if-equal "BOM not discarded from UTF-32BE stream"
+      "\uFEFF\x61"
+    (bv-read-test "UTF-32BE" #vu8(#x00 #x00 #xFE #xFF
+                                       #x00 #x00 #x00 #x61)))
+
+  (pass-if-equal "BOM not discarded from UTF-32LE stream"
+      "\uFEFF\x61"
+    (bv-read-test "UTF-32LE" #vu8(#xFF #xFE #x00 #x00
+                                       #x61 #x00 #x00 #x00)))
+
+  (pass-if-equal "BOM not written to UTF-8 stream"
+      #vu8(#x61)
+    (bv-write-test "UTF-8" "a"))
+
+  (pass-if-equal "BOM not written to UTF-16BE stream"
+      #vu8(#x00 #x61)
+    (bv-write-test "UTF-16BE" "a"))
+
+  (pass-if-equal "BOM not written to UTF-16LE stream"
+      #vu8(#x61 #x00)
+    (bv-write-test "UTF-16LE" "a"))
+
+  (pass-if-equal "BOM not written to UTF-32BE stream"
+      #vu8(#x00 #x00 #x00 #x61)
+    (bv-write-test "UTF-32BE" "a"))
+
+  (pass-if-equal "BOM not written to UTF-32LE stream"
+      #vu8(#x61 #x00 #x00 #x00)
+    (bv-write-test "UTF-32LE" "a"))
+
+  (pass-if "Don't read from the port unless user asks to"
+    (let* ((p (make-soft-port
+               (vector
+                (lambda (c) #f)           ; write char
+                (lambda (s) #f)           ; write string
+                (lambda () #f)            ; flush
+                (lambda () (throw 'fail)) ; read char
+                (lambda () #f))
+               "rw")))
+      (set-port-encoding! p "UTF-16")
+      (display "abc" p)
+      (set-port-encoding! p "UTF-32")
+      (display "def" p)
+      #t))
+
+  ;; TODO: test that input and output streams are independent when
+  ;; appropriate, and linked when appropriate.
+
+  (pass-if-equal "BOM discarded from start of UTF-8 stream"
+      "a"
+    (bv-read-test "Utf-8" #vu8(#xEF #xBB #xBF #x61)))
+
+  (pass-if-equal "BOM discarded from start of UTF-8 stream after seek to 0"
+      '(#\a "a")
+    (bv-read-test* "uTf-8" #vu8(#xEF #xBB #xBF #x61)
+                   (lambda (p)
+                     (let ((c (read-char p)))
+                       (seek p 0 SEEK_SET)
+                       (let ((s (read-string p)))
+                         (list c s))))))
+
+  (pass-if-equal "Only one BOM discarded from start of UTF-8 stream"
+      "\uFEFFa"
+    (bv-read-test "UTF-8" #vu8(#xEF #xBB #xBF #xEF #xBB #xBF #x61)))
+
+  (pass-if-equal "BOM not discarded from UTF-8 stream after seek to > 0"
+      "\uFEFFb"
+    (bv-read-test* "UTF-8" #vu8(#x61 #xEF #xBB #xBF #x62)
+                   (lambda (p)
+                     (seek p 1 SEEK_SET)
+                     (read-string p))))
+
+  (pass-if-equal "BOM not discarded unless at start of UTF-8 stream"
+      "a\uFEFFb"
+    (bv-read-test "UTF-8" #vu8(#x61 #xEF #xBB #xBF #x62)))
+
+  (pass-if-equal "BOM (BE) written to start of UTF-16 stream"
+      #vu8(#xFE #xFF #x00 #x61 #x00 #x62)
+    (bv-write-test "UTF-16" "ab"))
+
+  (pass-if-equal "BOM (BE) written to UTF-16 stream after set-port-encoding!"
+      #vu8(#xFE #xFF #x00 #x61 #x00 #x62 #xFE #xFF #x00 #x63 #x00 #x64)
+    (bv-write-test* "UTF-16"
+                    (lambda (p)
+                      (display "ab" p)
+                      (set-port-encoding! p "UTF-16")
+                      (display "cd" p))))
+
+  (pass-if-equal "BOM discarded from start of UTF-16 stream (BE)"
+      "a"
+    (bv-read-test "UTF-16" #vu8(#xFE #xFF #x00 #x61)))
+
+  (pass-if-equal "BOM discarded from start of UTF-16 stream (BE) after seek to 0"
+      '(#\a "a")
+    (bv-read-test* "utf-16" #vu8(#xFE #xFF #x00 #x61)
+                   (lambda (p)
+                     (let ((c (read-char p)))
+                       (seek p 0 SEEK_SET)
+                       (let ((s (read-string p)))
+                         (list c s))))))
+
+  (pass-if-equal "Only one BOM discarded from start of UTF-16 stream (BE)"
+      "\uFEFFa"
+    (bv-read-test "Utf-16" #vu8(#xFE #xFF #xFE #xFF #x00 #x61)))
+
+  (pass-if-equal "BOM not discarded from UTF-16 stream (BE) after seek to > 0"
+      "\uFEFFa"
+    (bv-read-test* "uTf-16" #vu8(#xFE #xFF #xFE #xFF #x00 #x61)
+                   (lambda (p)
+                     (seek p 2 SEEK_SET)
+                     (read-string p))))
+
+  (pass-if-equal "BOM not discarded unless at start of UTF-16 stream"
+      "a\uFEFFb"
+    (let ((be (bv-read-test "utf-16" #vu8(#x00 #x61 #xFE #xFF #x00 #x62)))
+          (le (bv-read-test "utf-16" #vu8(#x61 #x00 #xFF #xFE #x62 #x00))))
+      (if (char=? #\a (string-ref be 0))
+          be
+          le)))
+
+  (pass-if-equal "BOM discarded from start of UTF-16 stream (LE)"
+      "a"
+    (bv-read-test "UTF-16" #vu8(#xFF #xFE #x61 #x00)))
+
+  (pass-if-equal "BOM discarded from start of UTF-16 stream (LE) after seek to 0"
+      '(#\a "a")
+    (bv-read-test* "Utf-16" #vu8(#xFF #xFE #x61 #x00)
+                   (lambda (p)
+                     (let ((c (read-char p)))
+                       (seek p 0 SEEK_SET)
+                       (let ((s (read-string p)))
+                         (list c s))))))
+
+  (pass-if-equal "Only one BOM discarded from start of UTF-16 stream (LE)"
+      "\uFEFFa"
+    (bv-read-test "UTf-16" #vu8(#xFF #xFE #xFF #xFE #x61 #x00)))
+
+  (pass-if-equal "BOM not discarded from UTF-16 stream (LE) after seek to > 0"
+      "\uFEFFa"
+    (bv-read-test* "utF-16" #vu8(#xFF #xFE #xFF #xFE #x61 #x00)
+                   (lambda (p)
+                     (seek p 2 SEEK_SET)
+                     (read-string p))))
+
+  (pass-if-equal "BOM discarded from start of UTF-32 stream (BE)"
+      "a"
+    (bv-read-test "UTF-32" #vu8(#x00 #x00 #xFE #xFF
+                                     #x00 #x00 #x00 #x61)))
+
+  (pass-if-equal "BOM discarded from start of UTF-32 stream (BE) after seek to 0"
+      '(#\a "a")
+    (bv-read-test* "utF-32" #vu8(#x00 #x00 #xFE #xFF
+                                      #x00 #x00 #x00 #x61)
+                   (lambda (p)
+                     (let ((c (read-char p)))
+                       (seek p 0 SEEK_SET)
+                       (let ((s (read-string p)))
+                         (list c s))))))
+
+  (pass-if-equal "Only one BOM discarded from start of UTF-32 stream (BE)"
+      "\uFEFFa"
+    (bv-read-test "UTF-32" #vu8(#x00 #x00 #xFE #xFF
+                                     #x00 #x00 #xFE #xFF
+                                     #x00 #x00 #x00 #x61)))
+
+  (pass-if-equal "BOM not discarded from UTF-32 stream (BE) after seek to > 0"
+      "\uFEFFa"
+    (bv-read-test* "UtF-32" #vu8(#x00 #x00 #xFE #xFF
+                                      #x00 #x00 #xFE #xFF
+                                      #x00 #x00 #x00 #x61)
+                   (lambda (p)
+                     (seek p 4 SEEK_SET)
+                     (read-string p))))
+
+  (pass-if-equal "BOM discarded within UTF-16 stream (BE) after set-port-encoding!"
+      "ab"
+    (bv-read-test* "UTF-16" #vu8(#x00 #x61 #xFE #xFF #x00 #x62)
+                   (lambda (p)
+                     (let ((a (read-char p)))
+                       (set-port-encoding! p "UTF-16")
+                       (string a (read-char p))))))
+
+  (pass-if-equal "BOM discarded within UTF-16 stream (LE,BE) after set-port-encoding!"
+      "ab"
+    (bv-read-test* "utf-16" #vu8(#x00 #x61 #xFF #xFE #x62 #x00)
+                   (lambda (p)
+                     (let ((a (read-char p)))
+                       (set-port-encoding! p "UTF-16")
+                       (string a (read-char p))))))
+
+  (pass-if-equal "BOM discarded within UTF-32 stream (BE) after set-port-encoding!"
+      "ab"
+    (bv-read-test* "UTF-32" #vu8(#x00 #x00 #x00 #x61
+                                      #x00 #x00 #xFE #xFF
+                                      #x00 #x00 #x00 #x62)
+                   (lambda (p)
+                     (let ((a (read-char p)))
+                       (set-port-encoding! p "UTF-32")
+                       (string a (read-char p))))))
+
+  (pass-if-equal "BOM discarded within UTF-32 stream (LE,BE) after set-port-encoding!"
+      "ab"
+    (bv-read-test* "UTF-32" #vu8(#x00 #x00 #x00 #x61
+                                      #xFF #xFE #x00 #x00
+                                      #x62 #x00 #x00 #x00)
+                   (lambda (p)
+                     (let ((a (read-char p)))
+                       (set-port-encoding! p "UTF-32")
+                       (string a (read-char p))))))
+
+  (pass-if-equal "BOM not discarded unless at start of UTF-32 stream"
+      "a\uFEFFb"
+    (let ((be (bv-read-test "UTF-32" #vu8(#x00 #x00 #x00 #x61
+                                               #x00 #x00 #xFE #xFF
+                                               #x00 #x00 #x00 #x62)))
+          (le (bv-read-test "UTF-32" #vu8(#x61 #x00 #x00 #x00
+                                               #xFF #xFE #x00 #x00
+                                               #x62 #x00 #x00 #x00))))
+      (if (char=? #\a (string-ref be 0))
+          be
+          le)))
+
+  (pass-if-equal "BOM discarded from start of UTF-32 stream (LE)"
+      "a"
+    (bv-read-test "UTF-32" #vu8(#xFF #xFE #x00 #x00
+                                     #x61 #x00 #x00 #x00)))
+
+  (pass-if-equal "BOM discarded from start of UTF-32 stream (LE) after seek to 0"
+      '(#\a "a")
+    (bv-read-test* "UTf-32" #vu8(#xFF #xFE #x00 #x00
+                                      #x61 #x00 #x00 #x00)
+                   (lambda (p)
+                     (let ((c (read-char p)))
+                       (seek p 0 SEEK_SET)
+                       (let ((s (read-string p)))
+                         (list c s))))))
+
+  (pass-if-equal "Only one BOM discarded from start of UTF-32 stream (LE)"
+      "\uFEFFa"
+    (bv-read-test "UTF-32" #vu8(#xFF #xFE #x00 #x00
+                                     #xFF #xFE #x00 #x00
+                                     #x61 #x00 #x00 #x00))))
+
+\f
+
 (define-syntax-rule (with-load-path path body ...)
   (let ((new path)
         (old %load-path))
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] Improve handling of Unicode byte-order marks (BOMs)
  2013-04-03 19:28   ` Mark H Weaver
@ 2013-04-03 20:11     ` Ludovic Courtès
  2013-04-03 20:33       ` Mark H Weaver
  0 siblings, 1 reply; 14+ messages in thread
From: Ludovic Courtès @ 2013-04-03 20:11 UTC (permalink / raw
  To: Mark H Weaver; +Cc: guile-devel

Mark H Weaver <mhw@netris.org> skribis:

> ludo@gnu.org (Ludovic Courtès) writes:
>> Woow, well thought out.  The semantics seem good.  (It’s interesting to
>> see how BOMs complicate things, but that’s life, I guess.)
>>
>> The patch looks good to me.  The test suite is nice.  It doesn’t seem to
>> cover all the corner cases listed above, but that can be added later on
>> perhaps?
>
> Yes, the tests are still a work-in-progess, but I've added quite a few
> more since you last looked.

Nice.

>> Perhaps the text above could be added to the manual,
>
> In the attached patch, I've added a new node to the "Input and Output"
> section.

Perfect.

>>> +{
>>> +  scm_t_port *pt = SCM_PTAB_ENTRY (port);
>>> +  int result;
>>> +  int i = 0;
>>> +
>>> +  while (i < len && scm_peek_byte_or_eof (port) == bytes[i])
>>> +    {
>>> +      pt->read_pos++;
>>> +      i++;
>>> +    }
>>> +
>>> +  result = (i == len);
>>> +
>>> +  while (i > 0)
>>> +    scm_unget_byte (bytes[--i], port);
>>> +
>>> +  return result;
>>> +}
>>
>> Should it be scm_get_byte_or_eof given that scm_unget_byte is used later?
>
> Yes.  Bytes are only consumed if are equal to bytes[i], so an EOF will
> never be consumed or passed to scm_unget_byte.
>
>> What if pt->read_buf_size == 1?  What if there’s data in saved_read_buf?
>
> All of those details are handled by 'scm_peek_byte_or_eof', which is
> guaranteed to leave 'pt->read_pos' pointing at the byte that's returned
> (if not EOF).  Therefore, it's always safe to increment that pointer by
> one (but no more than one) after calling 'scm_peek_byte_or_eof' if it
> returned non-EOF.
>
> Look at the code for 'scm_peek_byte_or_eof' and this will be clear.
> Also note that you did the same thing in 'scm_utf8_codepoint' :)

Ah yes, indeed.

[...]

> Here's the new patch.  Any more suggestions?

Not from me!  OK to commit as far as I’m concerned.

Thank you!

Ludo’.



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] Improve handling of Unicode byte-order marks (BOMs)
  2013-04-03 20:11     ` Ludovic Courtès
@ 2013-04-03 20:33       ` Mark H Weaver
  2013-04-03 20:48         ` Mike Gran
  2013-04-04 20:50         ` Andy Wingo
  0 siblings, 2 replies; 14+ messages in thread
From: Mark H Weaver @ 2013-04-03 20:33 UTC (permalink / raw
  To: Ludovic Courtès; +Cc: Andy Wingo, guile-devel

[-- Attachment #1: Type: text/plain, Size: 333 bytes --]

ludo@gnu.org (Ludovic Courtès) writes:

> Mark H Weaver <mhw@netris.org> skribis:
>
>> Here's the new patch.  Any more suggestions?
>
> Not from me!  OK to commit as far as I’m concerned.

Great!  I'd still like to hear what Andy thinks.
I've attached a new version with some more tweaks.

    Thanks,
      Mark



[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: [PATCH] Improve handling of Unicode byte-order marks (BOMs) --]
[-- Type: text/x-diff, Size: 27990 bytes --]

From f849f9a3f6babd87088d39369442a7f429762cec Mon Sep 17 00:00:00 2001
From: Mark H Weaver <mhw@netris.org>
Date: Wed, 3 Apr 2013 04:22:04 -0400
Subject: [PATCH] Improve handling of Unicode byte-order marks (BOMs).

* libguile/ports-internal.h (struct scm_port_internal): Add new members
  'at_stream_start_for_bom_read' and 'at_stream_start_for_bom_write'.
  (SCM_UNICODE_BOM): New macro.
  (scm_i_port_iconv_descriptors): Add 'mode' parameter to prototype.

* libguile/ports.c (scm_new_port_table_entry): Initialize
  'at_stream_start_for_bom_read' and 'at_stream_start_for_bom_write'.
  (get_iconv_codepoint): Pass new 'mode' parameter to
  'scm_i_port_iconv_descriptors'.
  (get_codepoint): After reading a codepoint at stream start, record
  that we're no longer at stream start, and consume a BOM where
  appropriate.
  (scm_seek): Set the stream start flags according to the new position.
  (looking_at_bytes): New static function.
  (scm_utf8_bom, scm_utf16be_bom, scm_utf16le_bom, scm_utf32be_bom,
  scm_utf32le_bom): New static const arrays.
  (decide_utf16_encoding, decide_utf32_encoding): New static functions.
  (scm_i_port_iconv_descriptors): Add new 'mode' parameter.  If the
  specified encoding is UTF-16 or UTF-32, make that precise by deciding
  what endianness to use, and construct iconv descriptors based on the
  precise encoding.
  (scm_i_set_port_encoding_x): Record that we are now at stream start.
  Do not open the new iconv descriptors immediately; let them be
  initialized lazily.

* libguile/print.c (display_string_using_iconv): Record that we're no
  longer at stream start.  Write a BOM if appropriate.

* doc/ref/api-io.texi (BOM Handling): New node.

* test-suite/tests/ports.test ("set-port-encoding!, wrong encoding"):
  Adapt test to cope with the fact that 'set-port-encoding!' does not
  immediately open the iconv descriptors.
  (bv-read-test): New procedure.
  ("unicode byte-order marks (BOMs)"): New test prefix.
---
 doc/ref/api-io.texi         |   65 ++++++++++
 libguile/ports-internal.h   |    7 +-
 libguile/ports.c            |  146 ++++++++++++++++++----
 libguile/print.c            |   18 ++-
 test-suite/tests/ports.test |  284 ++++++++++++++++++++++++++++++++++++++++++-
 5 files changed, 494 insertions(+), 26 deletions(-)

diff --git a/doc/ref/api-io.texi b/doc/ref/api-io.texi
index 8c974be..57afa37 100644
--- a/doc/ref/api-io.texi
+++ b/doc/ref/api-io.texi
@@ -19,6 +19,7 @@
 * Port Types::                  Types of port and how to make them.
 * R6RS I/O Ports::              The R6RS port API.
 * I/O Extensions::              Using and extending ports in C.
+* BOM Handling::                Handling of Unicode byte order marks.
 @end menu
 
 
@@ -2373,6 +2374,70 @@ Set using
 
 @end table
 
+@node BOM Handling
+@subsection Handling of Unicode byte order marks.
+@cindex BOM
+@cindex byte order mark
+
+This section documents the finer points of Guile's handling of Unicode
+byte order marks (BOMs).  A byte order mark (U+FEFF) is typically found
+at the start of a UTF-16 or UTF-32 stream, so that the reader can
+reliably determine the byte order.  Occasionally, a BOM is found at the
+start of a UTF-8 stream, but this is much less common and not generally
+recommended.
+
+Guile attempts to handle BOMs automatically, and in accordance with the
+recommendations of the Unicode Standard, when the port encoding is set
+to @code{UTF-8}, @code{UTF-16}, or @code{UTF-32}.  In brief, Guile
+automatically writes a BOM at the start of a UTF-16 and UTF-32 stream,
+and automatically consumes one from the start of a UTF-8, UTF-16, or
+UTF-32 stream.
+
+As specified in the Unicode Standard, a BOM is only handled specially at
+the start of a stream, and only if the port encoding is set to
+@code{UTF-16} or @code{UTF-32}.  If the port encoding is set to
+@code{UTF-16BE}, @code{UTF-16LE}, @code{UTF-16BE}, or @code{UTF-16LE},
+then BOMs are @emph{not} handled specially, and none of the special
+handling described in this section applies.
+
+@itemize @bullet
+@item
+To ensure that Guile will properly detect the byte order of a
+@code{UTF-16} or @code{UTF-32} stream, you must perform a textual read
+before writing, seeking, or binary I/O.  Guile will not attempt to read
+a BOM until a read is explicitly requested at the start of the stream.
+
+@item
+If @code{set-port-encoding!} is called in the middle of a stream, Guile
+treats this as a new logical ``start of stream'' for purposes of BOM
+handling.  This is intended to multiple logical text streams embedded
+within a larger binary stream.
+
+@item
+Binary I/O operations are not guaranteed to update Guile's notion of
+whether the port is at the ``start of the stream'', nor are they
+guaranteed to produce or consume BOMs.  More generally, the handling of
+BOMs is unspecified if binary I/O is performed before textual I/O.
+
+@item
+For ports that support seeking (e.g. normal files), the input and output
+streams are considered linked: if the user reads first, then a BOM will
+be consumed (if appropriate), but later writes will @emph{not} produce a
+BOM.  Similarly, if the user writes first, then later reads will
+@emph{not} consume a BOM.
+
+@item
+For ports that do not support seeking (e.g. pipes, sockets, and
+terminals), the input and output streams are considered
+@emph{independent} for purposes of BOM handling: the first read will
+consume a BOM (if appropriate), @emph{and} the first write will produce
+a BOM (if appropriate).
+
+@item
+Seeks to the beginning of the file set the ``start of stream'' flags.
+Seeks anywhere else clear the ``start of stream'' flags.
+@end itemize
+
 @c Local Variables:
 @c TeX-master: "guile.texi"
 @c End:
diff --git a/libguile/ports-internal.h b/libguile/ports-internal.h
index 73a788f..70e8c45 100644
--- a/libguile/ports-internal.h
+++ b/libguile/ports-internal.h
@@ -46,6 +46,8 @@ typedef struct scm_iconv_descriptors scm_t_iconv_descriptors;
 
 struct scm_port_internal
 {
+  unsigned at_stream_start_for_bom_read  : 1;
+  unsigned at_stream_start_for_bom_write : 1;
   scm_t_port_encoding_mode encoding_mode;
   scm_t_iconv_descriptors *iconv_descriptors;
   SCM alist;
@@ -53,9 +55,12 @@ struct scm_port_internal
 
 typedef struct scm_port_internal scm_t_port_internal;
 
+#define SCM_UNICODE_BOM  0xFEFFUL  /* Unicode byte-order mark */
+
 #define SCM_PORT_GET_INTERNAL(x)                                \
   ((scm_t_port_internal *) (SCM_PTAB_ENTRY(x)->input_cd))
 
-SCM_INTERNAL scm_t_iconv_descriptors *scm_i_port_iconv_descriptors (SCM port);
+SCM_INTERNAL scm_t_iconv_descriptors *
+scm_i_port_iconv_descriptors (SCM port, scm_t_port_rw_active mode);
 
 #endif
diff --git a/libguile/ports.c b/libguile/ports.c
index eaa2047..4f042cc 100644
--- a/libguile/ports.c
+++ b/libguile/ports.c
@@ -639,6 +639,9 @@ scm_new_port_table_entry (scm_t_bits tag)
     pti->encoding_mode = SCM_PORT_ENCODING_MODE_ICONV;
   pti->iconv_descriptors = NULL;
 
+  pti->at_stream_start_for_bom_read  = 1;
+  pti->at_stream_start_for_bom_write = 1;
+
   /* XXX These fields are not what they seem.  They have been
      repurposed, but cannot safely be renamed in 2.0 without breaking
      ABI compatibility.  This will be cleaned up in 2.2.  */
@@ -1306,10 +1309,12 @@ static int
 get_iconv_codepoint (SCM port, scm_t_wchar *codepoint,
 		     char buf[SCM_MBCHAR_BUF_SIZE], size_t *len)
 {
-  scm_t_iconv_descriptors *id = scm_i_port_iconv_descriptors (port);
+  scm_t_iconv_descriptors *id;
   scm_t_uint8 utf8_buf[SCM_MBCHAR_BUF_SIZE];
   size_t input_size = 0;
 
+  id = scm_i_port_iconv_descriptors (port, SCM_PORT_READ);
+
   for (;;)
     {
       int byte_read;
@@ -1393,7 +1398,24 @@ get_codepoint (SCM port, scm_t_wchar *codepoint,
     err = get_iconv_codepoint (port, codepoint, buf, len);
 
   if (SCM_LIKELY (err == 0))
-    update_port_lf (*codepoint, port);
+    {
+      if (SCM_UNLIKELY (pti->at_stream_start_for_bom_read))
+        {
+          /* Record that we're no longer at stream start. */
+          pti->at_stream_start_for_bom_read = 0;
+          if (pt->rw_random)
+            pti->at_stream_start_for_bom_write = 0;
+
+          /* If we just read a BOM in an encoding that recognizes them,
+             then silently consume it and read another code point. */
+          if (SCM_UNLIKELY (*codepoint == SCM_UNICODE_BOM
+                            && (strcasecmp (pt->encoding, "UTF-8") == 0
+                                || strcasecmp (pt->encoding, "UTF-16") == 0
+                                || strcasecmp (pt->encoding, "UTF-32") == 0)))
+            return get_codepoint (port, codepoint, buf, len);
+        }
+      update_port_lf (*codepoint, port);
+    }
   else if (pt->ilseq_handler == SCM_ICONVEH_QUESTION_MARK)
     {
       *codepoint = '?';
@@ -2006,6 +2028,7 @@ SCM_DEFINE (scm_seek, "seek", 3, 0, 0,
 
   if (SCM_OPPORTP (fd_port))
     {
+      scm_t_port_internal *pti = SCM_PORT_GET_INTERNAL (fd_port);
       scm_t_ptob_descriptor *ptob = scm_ptobs + SCM_PTOBNUM (fd_port);
       off_t_or_off64_t off = scm_to_off_t_or_off64_t (offset);
       off_t_or_off64_t rv;
@@ -2015,6 +2038,11 @@ SCM_DEFINE (scm_seek, "seek", 3, 0, 0,
                         scm_cons (fd_port, SCM_EOL));
       else
 	rv = ptob->seek (fd_port, off, how);
+
+      /* Set stream-start flags according to new position. */
+      pti->at_stream_start_for_bom_read  = (rv == 0);
+      pti->at_stream_start_for_bom_write = (rv == 0);
+
       return scm_from_off_t_or_off64_t (rv);
     }
   else /* file descriptor?.  */
@@ -2265,6 +2293,68 @@ scm_i_default_port_encoding (void)
     }
 }
 
+/* If the next LEN bytes from PORT are equal to those in BYTES, then
+   return 1, else return 0.  Leave the port position unchanged.  */
+static int
+looking_at_bytes (SCM port, const unsigned char *bytes, int len)
+{
+  scm_t_port *pt = SCM_PTAB_ENTRY (port);
+  int result;
+  int i = 0;
+
+  while (i < len && scm_peek_byte_or_eof (port) == bytes[i])
+    {
+      pt->read_pos++;
+      i++;
+    }
+
+  result = (i == len);
+
+  while (i > 0)
+    scm_unget_byte (bytes[--i], port);
+
+  return result;
+}
+
+static const unsigned char scm_utf8_bom[3]    = {0xEF, 0xBB, 0xBF};
+static const unsigned char scm_utf16be_bom[2] = {0xFE, 0xFF};
+static const unsigned char scm_utf16le_bom[2] = {0xFF, 0xFE};
+static const unsigned char scm_utf32be_bom[4] = {0x00, 0x00, 0xFE, 0xFF};
+static const unsigned char scm_utf32le_bom[4] = {0xFF, 0xFE, 0x00, 0x00};
+
+/* Decide what endianness to use for a UTF-16 port.  Return "UTF-16BE"
+   or "UTF-16LE".  MODE must be either SCM_PORT_READ or SCM_PORT_WRITE,
+   and specifies which operation is about to be done.  The MODE
+   determines how we will decide the endianness.  We deliberately avoid
+   reading from the port unless the user is about to do so.  If the user
+   is about to read, then we look for a BOM, and if present, we use it
+   to determine the endianness.  Otherwise we choose big-endian, as
+   recommended by the Unicode Consortium.  */
+static const char *
+decide_utf16_encoding (SCM port, scm_t_port_rw_active mode)
+{
+  if (mode == SCM_PORT_READ
+      && SCM_PORT_GET_INTERNAL (port)->at_stream_start_for_bom_read
+      && looking_at_bytes (port, scm_utf16le_bom, sizeof scm_utf16le_bom))
+    return "UTF-16LE";
+  else
+    return "UTF-16BE";
+}
+
+/* Decide what endianness to use for a UTF-32 port.  Return "UTF-32BE"
+   or "UTF-32LE".  See the comment above 'decide_utf16_encoding' for
+   details.  */
+static const char *
+decide_utf32_encoding (SCM port, scm_t_port_rw_active mode)
+{
+  if (mode == SCM_PORT_READ
+      && SCM_PORT_GET_INTERNAL (port)->at_stream_start_for_bom_read
+      && looking_at_bytes (port, scm_utf32le_bom, sizeof scm_utf32le_bom))
+    return "UTF-32LE";
+  else
+    return "UTF-32BE";
+}
+
 static void
 finalize_iconv_descriptors (void *ptr, void *data)
 {
@@ -2341,23 +2431,36 @@ close_iconv_descriptors (scm_t_iconv_descriptors *id)
   id->output_cd = (void *) -1;
 }
 
+/* Return the iconv_descriptors, initializing them if necessary.  MODE
+   must be either SCM_PORT_READ or SCM_PORT_WRITE, and specifies which
+   operation is about to be done.  We deliberately avoid reading from
+   the port unless the user was about to do so.  */
 scm_t_iconv_descriptors *
-scm_i_port_iconv_descriptors (SCM port)
+scm_i_port_iconv_descriptors (SCM port, scm_t_port_rw_active mode)
 {
-  scm_t_port *pt;
-  scm_t_port_internal *pti;
-
-  pt = SCM_PTAB_ENTRY (port);
-  pti = SCM_PORT_GET_INTERNAL (port);
+  scm_t_port_internal *pti = SCM_PORT_GET_INTERNAL (port);
 
   assert (pti->encoding_mode == SCM_PORT_ENCODING_MODE_ICONV);
 
   if (!pti->iconv_descriptors)
     {
+      scm_t_port *pt = SCM_PTAB_ENTRY (port);
+      const char *precise_encoding;
+
       if (!pt->encoding)
         pt->encoding = "ISO-8859-1";
+
+      /* If the specified encoding is UTF-16 or UTF-32, then make
+         that more precise by deciding what endianness to use.  */
+      if (strcasecmp (pt->encoding, "UTF-16") == 0)
+        precise_encoding = decide_utf16_encoding (port, mode);
+      else if (strcasecmp (pt->encoding, "UTF-32") == 0)
+        precise_encoding = decide_utf32_encoding (port, mode);
+      else
+        precise_encoding = pt->encoding;
+
       pti->iconv_descriptors =
-        open_iconv_descriptors (pt->encoding,
+        open_iconv_descriptors (precise_encoding,
                                 SCM_INPUT_PORT_P (port),
                                 SCM_OUTPUT_PORT_P (port));
     }
@@ -2377,28 +2480,27 @@ scm_i_set_port_encoding_x (SCM port, const char *encoding)
   pti = SCM_PORT_GET_INTERNAL (port);
   prev = pti->iconv_descriptors;
 
+  /* In order to handle cases where the encoding changes mid-stream
+     (e.g. within an HTTP stream, or within a file that is composed of
+     segments with different encodings), we consider this to be "stream
+     start" for purposes of BOM handling, regardless of our actual file
+     position. */
+  pti->at_stream_start_for_bom_read  = 1;
+  pti->at_stream_start_for_bom_write = 1;
+
   if (encoding == NULL)
     encoding = "ISO-8859-1";
 
   /* If ENCODING is UTF-8, then no conversion descriptor is opened
      because we do I/O ourselves.  This saves 100+ KiB for each
      descriptor.  */
+  pt->encoding = scm_gc_strdup (encoding, "port");
   if (strcasecmp (encoding, "UTF-8") == 0)
-    {
-      pti->encoding_mode = SCM_PORT_ENCODING_MODE_UTF8;
-      pti->iconv_descriptors = NULL;
-    }
+    pti->encoding_mode = SCM_PORT_ENCODING_MODE_UTF8;
   else
-    {
-      /* Open descriptors before mutating the port. */
-      pti->iconv_descriptors =
-        open_iconv_descriptors (encoding,
-                                SCM_INPUT_PORT_P (port),
-                                SCM_OUTPUT_PORT_P (port));
-      pti->encoding_mode = SCM_PORT_ENCODING_MODE_ICONV;
-    }
-  pt->encoding = scm_gc_strdup (encoding, "port");
+    pti->encoding_mode = SCM_PORT_ENCODING_MODE_ICONV;
 
+  pti->iconv_descriptors = NULL;
   if (prev)
     close_iconv_descriptors (prev);
 }
diff --git a/libguile/print.c b/libguile/print.c
index 1572690..3f72810 100644
--- a/libguile/print.c
+++ b/libguile/print.c
@@ -881,8 +881,24 @@ display_string_using_iconv (const void *str, int narrow_p, size_t len,
 {
   size_t printed;
   scm_t_iconv_descriptors *id;
+  scm_t_port_internal *pti = SCM_PORT_GET_INTERNAL (port);
 
-  id = scm_i_port_iconv_descriptors (port);
+  id = scm_i_port_iconv_descriptors (port, SCM_PORT_WRITE);
+
+  if (SCM_UNLIKELY (pti->at_stream_start_for_bom_write && len > 0))
+    {
+      scm_t_port *pt = SCM_PTAB_ENTRY (port);
+
+      /* Record that we're no longer at stream start.  */
+      pti->at_stream_start_for_bom_write = 0;
+      if (pt->rw_random)
+        pti->at_stream_start_for_bom_read = 0;
+
+      /* Write a BOM if appropriate.  */
+      if (SCM_UNLIKELY (strcasecmp(pt->encoding, "UTF-16") == 0
+                        || strcasecmp(pt->encoding, "UTF-32") == 0))
+        display_character (SCM_UNICODE_BOM, port, iconveh_error);
+    }
 
   printed = 0;
 
diff --git a/test-suite/tests/ports.test b/test-suite/tests/ports.test
index 886ab24..cb2b698 100644
--- a/test-suite/tests/ports.test
+++ b/test-suite/tests/ports.test
@@ -24,7 +24,8 @@
   #:use-module (ice-9 popen)
   #:use-module (ice-9 rdelim)
   #:use-module (rnrs bytevectors)
-  #:use-module ((rnrs io ports) #:select (open-bytevector-input-port)))
+  #:use-module ((rnrs io ports) #:select (open-bytevector-input-port
+                                          open-bytevector-output-port)))
 
 (define (display-line . args)
   (for-each display args)
@@ -918,7 +919,9 @@
 
   (pass-if-exception "set-port-encoding!, wrong encoding"
     exception:miscellaneous-error
-    (set-port-encoding! (open-input-string "") "does-not-exist"))
+    (let ((p (open-input-string "")))
+      (set-port-encoding! p "does-not-exist")
+      (read p)))
 
   (pass-if-exception "%default-port-encoding, wrong encoding"
     exception:miscellaneous-error
@@ -1149,6 +1152,283 @@
 
 \f
 
+(with-test-prefix "unicode byte-order marks (BOMs)"
+
+  (define (bv-read-test* encoding bv proc)
+    (let ((port (open-bytevector-input-port bv)))
+      (set-port-encoding! port encoding)
+      (proc port)))
+
+  (define (bv-read-test encoding bv)
+    (bv-read-test* encoding bv read-string))
+
+  (define (bv-write-test* encoding proc)
+    (call-with-values
+        (lambda () (open-bytevector-output-port))
+      (lambda (port get-bytevector)
+        (set-port-encoding! port encoding)
+        (proc port)
+        (get-bytevector))))
+
+  (define (bv-write-test encoding str)
+    (bv-write-test* encoding
+                    (lambda (p)
+                      (display str p))))
+
+  (pass-if-equal "BOM not discarded from Latin-1 stream"
+      "\xEF\xBB\xBF\x61"
+    (bv-read-test "ISO-8859-1" #vu8(#xEF #xBB #xBF #x61)))
+
+  (pass-if-equal "BOM not discarded from Latin-2 stream"
+      "\u010F\u0165\u017C\x61"
+    (bv-read-test "ISO-8859-2" #vu8(#xEF #xBB #xBF #x61)))
+
+  (pass-if-equal "BOM not discarded from UTF-16BE stream"
+      "\uFEFF\x61"
+    (bv-read-test "UTF-16BE" #vu8(#xFE #xFF #x00 #x61)))
+
+  (pass-if-equal "BOM not discarded from UTF-16LE stream"
+      "\uFEFF\x61"
+    (bv-read-test "UTF-16LE" #vu8(#xFF #xFE #x61 #x00)))
+
+  (pass-if-equal "BOM not discarded from UTF-32BE stream"
+      "\uFEFF\x61"
+    (bv-read-test "UTF-32BE" #vu8(#x00 #x00 #xFE #xFF
+                                       #x00 #x00 #x00 #x61)))
+
+  (pass-if-equal "BOM not discarded from UTF-32LE stream"
+      "\uFEFF\x61"
+    (bv-read-test "UTF-32LE" #vu8(#xFF #xFE #x00 #x00
+                                       #x61 #x00 #x00 #x00)))
+
+  (pass-if-equal "BOM not written to UTF-8 stream"
+      #vu8(#x61)
+    (bv-write-test "UTF-8" "a"))
+
+  (pass-if-equal "BOM not written to UTF-16BE stream"
+      #vu8(#x00 #x61)
+    (bv-write-test "UTF-16BE" "a"))
+
+  (pass-if-equal "BOM not written to UTF-16LE stream"
+      #vu8(#x61 #x00)
+    (bv-write-test "UTF-16LE" "a"))
+
+  (pass-if-equal "BOM not written to UTF-32BE stream"
+      #vu8(#x00 #x00 #x00 #x61)
+    (bv-write-test "UTF-32BE" "a"))
+
+  (pass-if-equal "BOM not written to UTF-32LE stream"
+      #vu8(#x61 #x00 #x00 #x00)
+    (bv-write-test "UTF-32LE" "a"))
+
+  (pass-if "Don't read from the port unless user asks to"
+    (let* ((p (make-soft-port
+               (vector
+                (lambda (c) #f)           ; write char
+                (lambda (s) #f)           ; write string
+                (lambda () #f)            ; flush
+                (lambda () (throw 'fail)) ; read char
+                (lambda () #f))
+               "rw")))
+      (set-port-encoding! p "UTF-16")
+      (display "abc" p)
+      (set-port-encoding! p "UTF-32")
+      (display "def" p)
+      #t))
+
+  ;; TODO: test that input and output streams are independent when
+  ;; appropriate, and linked when appropriate.
+
+  (pass-if-equal "BOM discarded from start of UTF-8 stream"
+      "a"
+    (bv-read-test "Utf-8" #vu8(#xEF #xBB #xBF #x61)))
+
+  (pass-if-equal "BOM discarded from start of UTF-8 stream after seek to 0"
+      '(#\a "a")
+    (bv-read-test* "uTf-8" #vu8(#xEF #xBB #xBF #x61)
+                   (lambda (p)
+                     (let ((c (read-char p)))
+                       (seek p 0 SEEK_SET)
+                       (let ((s (read-string p)))
+                         (list c s))))))
+
+  (pass-if-equal "Only one BOM discarded from start of UTF-8 stream"
+      "\uFEFFa"
+    (bv-read-test "UTF-8" #vu8(#xEF #xBB #xBF #xEF #xBB #xBF #x61)))
+
+  (pass-if-equal "BOM not discarded from UTF-8 stream after seek to > 0"
+      "\uFEFFb"
+    (bv-read-test* "UTF-8" #vu8(#x61 #xEF #xBB #xBF #x62)
+                   (lambda (p)
+                     (seek p 1 SEEK_SET)
+                     (read-string p))))
+
+  (pass-if-equal "BOM not discarded unless at start of UTF-8 stream"
+      "a\uFEFFb"
+    (bv-read-test "UTF-8" #vu8(#x61 #xEF #xBB #xBF #x62)))
+
+  (pass-if-equal "BOM (BE) written to start of UTF-16 stream"
+      #vu8(#xFE #xFF #x00 #x61 #x00 #x62)
+    (bv-write-test "UTF-16" "ab"))
+
+  (pass-if-equal "BOM (BE) written to UTF-16 stream after set-port-encoding!"
+      #vu8(#xFE #xFF #x00 #x61 #x00 #x62 #xFE #xFF #x00 #x63 #x00 #x64)
+    (bv-write-test* "UTF-16"
+                    (lambda (p)
+                      (display "ab" p)
+                      (set-port-encoding! p "UTF-16")
+                      (display "cd" p))))
+
+  (pass-if-equal "BOM discarded from start of UTF-16 stream (BE)"
+      "a"
+    (bv-read-test "UTF-16" #vu8(#xFE #xFF #x00 #x61)))
+
+  (pass-if-equal "BOM discarded from start of UTF-16 stream (BE) after seek to 0"
+      '(#\a "a")
+    (bv-read-test* "utf-16" #vu8(#xFE #xFF #x00 #x61)
+                   (lambda (p)
+                     (let ((c (read-char p)))
+                       (seek p 0 SEEK_SET)
+                       (let ((s (read-string p)))
+                         (list c s))))))
+
+  (pass-if-equal "Only one BOM discarded from start of UTF-16 stream (BE)"
+      "\uFEFFa"
+    (bv-read-test "Utf-16" #vu8(#xFE #xFF #xFE #xFF #x00 #x61)))
+
+  (pass-if-equal "BOM not discarded from UTF-16 stream (BE) after seek to > 0"
+      "\uFEFFa"
+    (bv-read-test* "uTf-16" #vu8(#xFE #xFF #xFE #xFF #x00 #x61)
+                   (lambda (p)
+                     (seek p 2 SEEK_SET)
+                     (read-string p))))
+
+  (pass-if-equal "BOM not discarded unless at start of UTF-16 stream"
+      "a\uFEFFb"
+    (let ((be (bv-read-test "utf-16" #vu8(#x00 #x61 #xFE #xFF #x00 #x62)))
+          (le (bv-read-test "utf-16" #vu8(#x61 #x00 #xFF #xFE #x62 #x00))))
+      (if (char=? #\a (string-ref be 0))
+          be
+          le)))
+
+  (pass-if-equal "BOM discarded from start of UTF-16 stream (LE)"
+      "a"
+    (bv-read-test "UTF-16" #vu8(#xFF #xFE #x61 #x00)))
+
+  (pass-if-equal "BOM discarded from start of UTF-16 stream (LE) after seek to 0"
+      '(#\a "a")
+    (bv-read-test* "Utf-16" #vu8(#xFF #xFE #x61 #x00)
+                   (lambda (p)
+                     (let ((c (read-char p)))
+                       (seek p 0 SEEK_SET)
+                       (let ((s (read-string p)))
+                         (list c s))))))
+
+  (pass-if-equal "Only one BOM discarded from start of UTF-16 stream (LE)"
+      "\uFEFFa"
+    (bv-read-test "UTf-16" #vu8(#xFF #xFE #xFF #xFE #x61 #x00)))
+
+  (pass-if-equal "BOM discarded from start of UTF-32 stream (BE)"
+      "a"
+    (bv-read-test "UTF-32" #vu8(#x00 #x00 #xFE #xFF
+                                     #x00 #x00 #x00 #x61)))
+
+  (pass-if-equal "BOM discarded from start of UTF-32 stream (BE) after seek to 0"
+      '(#\a "a")
+    (bv-read-test* "utF-32" #vu8(#x00 #x00 #xFE #xFF
+                                      #x00 #x00 #x00 #x61)
+                   (lambda (p)
+                     (let ((c (read-char p)))
+                       (seek p 0 SEEK_SET)
+                       (let ((s (read-string p)))
+                         (list c s))))))
+
+  (pass-if-equal "Only one BOM discarded from start of UTF-32 stream (BE)"
+      "\uFEFFa"
+    (bv-read-test "UTF-32" #vu8(#x00 #x00 #xFE #xFF
+                                     #x00 #x00 #xFE #xFF
+                                     #x00 #x00 #x00 #x61)))
+
+  (pass-if-equal "BOM not discarded from UTF-32 stream (BE) after seek to > 0"
+      "\uFEFFa"
+    (bv-read-test* "UtF-32" #vu8(#x00 #x00 #xFE #xFF
+                                      #x00 #x00 #xFE #xFF
+                                      #x00 #x00 #x00 #x61)
+                   (lambda (p)
+                     (seek p 4 SEEK_SET)
+                     (read-string p))))
+
+  (pass-if-equal "BOM discarded within UTF-16 stream (BE) after set-port-encoding!"
+      "ab"
+    (bv-read-test* "UTF-16" #vu8(#x00 #x61 #xFE #xFF #x00 #x62)
+                   (lambda (p)
+                     (let ((a (read-char p)))
+                       (set-port-encoding! p "UTF-16")
+                       (string a (read-char p))))))
+
+  (pass-if-equal "BOM discarded within UTF-16 stream (LE,BE) after set-port-encoding!"
+      "ab"
+    (bv-read-test* "utf-16" #vu8(#x00 #x61 #xFF #xFE #x62 #x00)
+                   (lambda (p)
+                     (let ((a (read-char p)))
+                       (set-port-encoding! p "UTF-16")
+                       (string a (read-char p))))))
+
+  (pass-if-equal "BOM discarded within UTF-32 stream (BE) after set-port-encoding!"
+      "ab"
+    (bv-read-test* "UTF-32" #vu8(#x00 #x00 #x00 #x61
+                                      #x00 #x00 #xFE #xFF
+                                      #x00 #x00 #x00 #x62)
+                   (lambda (p)
+                     (let ((a (read-char p)))
+                       (set-port-encoding! p "UTF-32")
+                       (string a (read-char p))))))
+
+  (pass-if-equal "BOM discarded within UTF-32 stream (LE,BE) after set-port-encoding!"
+      "ab"
+    (bv-read-test* "UTF-32" #vu8(#x00 #x00 #x00 #x61
+                                      #xFF #xFE #x00 #x00
+                                      #x62 #x00 #x00 #x00)
+                   (lambda (p)
+                     (let ((a (read-char p)))
+                       (set-port-encoding! p "UTF-32")
+                       (string a (read-char p))))))
+
+  (pass-if-equal "BOM not discarded unless at start of UTF-32 stream"
+      "a\uFEFFb"
+    (let ((be (bv-read-test "UTF-32" #vu8(#x00 #x00 #x00 #x61
+                                               #x00 #x00 #xFE #xFF
+                                               #x00 #x00 #x00 #x62)))
+          (le (bv-read-test "UTF-32" #vu8(#x61 #x00 #x00 #x00
+                                               #xFF #xFE #x00 #x00
+                                               #x62 #x00 #x00 #x00))))
+      (if (char=? #\a (string-ref be 0))
+          be
+          le)))
+
+  (pass-if-equal "BOM discarded from start of UTF-32 stream (LE)"
+      "a"
+    (bv-read-test "UTF-32" #vu8(#xFF #xFE #x00 #x00
+                                     #x61 #x00 #x00 #x00)))
+
+  (pass-if-equal "BOM discarded from start of UTF-32 stream (LE) after seek to 0"
+      '(#\a "a")
+    (bv-read-test* "UTf-32" #vu8(#xFF #xFE #x00 #x00
+                                      #x61 #x00 #x00 #x00)
+                   (lambda (p)
+                     (let ((c (read-char p)))
+                       (seek p 0 SEEK_SET)
+                       (let ((s (read-string p)))
+                         (list c s))))))
+
+  (pass-if-equal "Only one BOM discarded from start of UTF-32 stream (LE)"
+      "\uFEFFa"
+    (bv-read-test "UTF-32" #vu8(#xFF #xFE #x00 #x00
+                                     #xFF #xFE #x00 #x00
+                                     #x61 #x00 #x00 #x00))))
+
+\f
+
 (define-syntax-rule (with-load-path path body ...)
   (let ((new path)
         (old %load-path))
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] Improve handling of Unicode byte-order marks (BOMs)
  2013-04-03 20:33       ` Mark H Weaver
@ 2013-04-03 20:48         ` Mike Gran
  2013-04-03 22:24           ` Mark H Weaver
  2013-04-04 20:50         ` Andy Wingo
  1 sibling, 1 reply; 14+ messages in thread
From: Mike Gran @ 2013-04-03 20:48 UTC (permalink / raw
  To: Mark H Weaver, Ludovic Courtès; +Cc: Andy Wingo, guile-devel@gnu.org

Hi Mark



>>>  Here's the new patch.  Any more suggestions?

There are a couple of lines in your doc patch that aren't quite right.

"@code{UTF-16BE}, @code{UTF-16LE}, @code{UTF-16BE}, or @code{UTF-16LE}"

I assume that two of these should be UTF-32.

Also

"This is intended to multiple logical text streams embedded
within a larger binary stream.""

Should probably be be

"This is intended to support multiple ..."

-Mike




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] Improve handling of Unicode byte-order marks (BOMs)
  2013-04-03 20:48         ` Mike Gran
@ 2013-04-03 22:24           ` Mark H Weaver
  2013-04-04  5:59             ` Mark H Weaver
  0 siblings, 1 reply; 14+ messages in thread
From: Mark H Weaver @ 2013-04-03 22:24 UTC (permalink / raw
  To: Mike Gran; +Cc: Andy Wingo, Ludovic Courtès, guile-devel@gnu.org

[-- Attachment #1: Type: text/plain, Size: 115 bytes --]

Thanks for the review, Mike.  I've attached a new patch with those
problems (and a few others) fixed.

     Mark



[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: [PATCH] Improve handling of Unicode byte-order marks (BOMs) --]
[-- Type: text/x-diff, Size: 27911 bytes --]

From a373927201028915f7b8cd5a1c72c5819cb4797c Mon Sep 17 00:00:00 2001
From: Mark H Weaver <mhw@netris.org>
Date: Wed, 3 Apr 2013 04:22:04 -0400
Subject: [PATCH] Improve handling of Unicode byte-order marks (BOMs).

* libguile/ports-internal.h (struct scm_port_internal): Add new members
  'at_stream_start_for_bom_read' and 'at_stream_start_for_bom_write'.
  (SCM_UNICODE_BOM): New macro.
  (scm_i_port_iconv_descriptors): Add 'mode' parameter to prototype.

* libguile/ports.c (scm_new_port_table_entry): Initialize
  'at_stream_start_for_bom_read' and 'at_stream_start_for_bom_write'.
  (get_iconv_codepoint): Pass new 'mode' parameter to
  'scm_i_port_iconv_descriptors'.
  (get_codepoint): After reading a codepoint at stream start, record
  that we're no longer at stream start, and consume a BOM where
  appropriate.
  (scm_seek): Set the stream start flags according to the new position.
  (looking_at_bytes): New static function.
  (scm_utf8_bom, scm_utf16be_bom, scm_utf16le_bom, scm_utf32be_bom,
  scm_utf32le_bom): New static const arrays.
  (decide_utf16_encoding, decide_utf32_encoding): New static functions.
  (scm_i_port_iconv_descriptors): Add new 'mode' parameter.  If the
  specified encoding is UTF-16 or UTF-32, make that precise by deciding
  what endianness to use, and construct iconv descriptors based on the
  precise encoding.
  (scm_i_set_port_encoding_x): Record that we are now at stream start.
  Do not open the new iconv descriptors immediately; let them be
  initialized lazily.

* libguile/print.c (display_string_using_iconv): Record that we're no
  longer at stream start.  Write a BOM if appropriate.

* doc/ref/api-io.texi (BOM Handling): New node.

* test-suite/tests/ports.test ("set-port-encoding!, wrong encoding"):
  Adapt test to cope with the fact that 'set-port-encoding!' does not
  immediately open the iconv descriptors.
  (bv-read-test): New procedure.
  ("unicode byte-order marks (BOMs)"): New test prefix.
---
 doc/ref/api-io.texi         |   64 ++++++++++
 libguile/ports-internal.h   |    7 +-
 libguile/ports.c            |  146 ++++++++++++++++++----
 libguile/print.c            |   18 ++-
 test-suite/tests/ports.test |  284 ++++++++++++++++++++++++++++++++++++++++++-
 5 files changed, 493 insertions(+), 26 deletions(-)

diff --git a/doc/ref/api-io.texi b/doc/ref/api-io.texi
index 8c974be..abf9cbd 100644
--- a/doc/ref/api-io.texi
+++ b/doc/ref/api-io.texi
@@ -19,6 +19,7 @@
 * Port Types::                  Types of port and how to make them.
 * R6RS I/O Ports::              The R6RS port API.
 * I/O Extensions::              Using and extending ports in C.
+* BOM Handling::                Handling of Unicode byte order marks.
 @end menu
 
 
@@ -2373,6 +2374,69 @@ Set using
 
 @end table
 
+@node BOM Handling
+@subsection Handling of Unicode byte order marks.
+@cindex BOM
+@cindex byte order mark
+
+This section documents the finer points of Guile's handling of Unicode
+byte order marks (BOMs).  A byte order mark (U+FEFF) is typically found
+at the start of a UTF-16 or UTF-32 stream, so that the reader can
+reliably determine the byte order.  Occasionally, a BOM is found at the
+start of a UTF-8 stream, but this is much less common and not generally
+recommended.
+
+Guile attempts to handle BOMs automatically, and in accordance with the
+recommendations of the Unicode Standard, when the port encoding is set
+to @code{UTF-8}, @code{UTF-16}, or @code{UTF-32}.  In brief, Guile
+automatically writes a BOM at the start of a UTF-16 and UTF-32 stream,
+and automatically consumes one from the start of a UTF-8, UTF-16, or
+UTF-32 stream.
+
+As specified in the Unicode Standard, a BOM is only handled specially at
+the start of a stream, and only if the port encoding is set to
+@code{UTF-8}, @code{UTF-16} or @code{UTF-32}.  If the port encoding is
+set to @code{UTF-16BE}, @code{UTF-16LE}, @code{UTF-32BE}, or
+@code{UTF-32LE}, then BOMs are @emph{not} handled specially, and none of
+the special handling described in this section applies.
+
+@itemize @bullet
+@item
+To ensure that Guile will properly detect the byte order of a
+@code{UTF-16} or @code{UTF-32} stream, you must perform a textual read
+before writing, seeking, or binary I/O.  Guile will not attempt to read
+a BOM until a read is explicitly requested at the start of the stream.
+
+@item
+If @code{set-port-encoding!} is called in the middle of a stream, Guile
+treats this as a new logical ``start of stream'' for purposes of BOM
+handling.  This is intended to support multiple logical text streams
+embedded within a larger binary stream.
+
+@item
+Binary I/O operations are not guaranteed to update Guile's notion of
+whether the port is at the ``start of the stream'', nor are they
+guaranteed to produce or consume BOMs.
+
+@item
+For ports that support seeking (e.g. normal files), the input and output
+streams are considered linked: if the user reads first, then a BOM will
+be consumed (if appropriate), but later writes will @emph{not} produce a
+BOM.  Similarly, if the user writes first, then later reads will
+@emph{not} consume a BOM.
+
+@item
+For ports that do not support seeking (e.g. pipes, sockets, and
+terminals), the input and output streams are considered
+@emph{independent} for purposes of BOM handling: the first read will
+consume a BOM (if appropriate), @emph{and} the first write will produce
+a BOM (if appropriate).
+
+@item
+Seeks to the beginning of the file set the ``start of stream'' flags.
+Seeks anywhere else clear the ``start of stream'' flags.
+@end itemize
+
 @c Local Variables:
 @c TeX-master: "guile.texi"
 @c End:
diff --git a/libguile/ports-internal.h b/libguile/ports-internal.h
index 73a788f..70e8c45 100644
--- a/libguile/ports-internal.h
+++ b/libguile/ports-internal.h
@@ -46,6 +46,8 @@ typedef struct scm_iconv_descriptors scm_t_iconv_descriptors;
 
 struct scm_port_internal
 {
+  unsigned at_stream_start_for_bom_read  : 1;
+  unsigned at_stream_start_for_bom_write : 1;
   scm_t_port_encoding_mode encoding_mode;
   scm_t_iconv_descriptors *iconv_descriptors;
   SCM alist;
@@ -53,9 +55,12 @@ struct scm_port_internal
 
 typedef struct scm_port_internal scm_t_port_internal;
 
+#define SCM_UNICODE_BOM  0xFEFFUL  /* Unicode byte-order mark */
+
 #define SCM_PORT_GET_INTERNAL(x)                                \
   ((scm_t_port_internal *) (SCM_PTAB_ENTRY(x)->input_cd))
 
-SCM_INTERNAL scm_t_iconv_descriptors *scm_i_port_iconv_descriptors (SCM port);
+SCM_INTERNAL scm_t_iconv_descriptors *
+scm_i_port_iconv_descriptors (SCM port, scm_t_port_rw_active mode);
 
 #endif
diff --git a/libguile/ports.c b/libguile/ports.c
index eaa2047..4f042cc 100644
--- a/libguile/ports.c
+++ b/libguile/ports.c
@@ -639,6 +639,9 @@ scm_new_port_table_entry (scm_t_bits tag)
     pti->encoding_mode = SCM_PORT_ENCODING_MODE_ICONV;
   pti->iconv_descriptors = NULL;
 
+  pti->at_stream_start_for_bom_read  = 1;
+  pti->at_stream_start_for_bom_write = 1;
+
   /* XXX These fields are not what they seem.  They have been
      repurposed, but cannot safely be renamed in 2.0 without breaking
      ABI compatibility.  This will be cleaned up in 2.2.  */
@@ -1306,10 +1309,12 @@ static int
 get_iconv_codepoint (SCM port, scm_t_wchar *codepoint,
 		     char buf[SCM_MBCHAR_BUF_SIZE], size_t *len)
 {
-  scm_t_iconv_descriptors *id = scm_i_port_iconv_descriptors (port);
+  scm_t_iconv_descriptors *id;
   scm_t_uint8 utf8_buf[SCM_MBCHAR_BUF_SIZE];
   size_t input_size = 0;
 
+  id = scm_i_port_iconv_descriptors (port, SCM_PORT_READ);
+
   for (;;)
     {
       int byte_read;
@@ -1393,7 +1398,24 @@ get_codepoint (SCM port, scm_t_wchar *codepoint,
     err = get_iconv_codepoint (port, codepoint, buf, len);
 
   if (SCM_LIKELY (err == 0))
-    update_port_lf (*codepoint, port);
+    {
+      if (SCM_UNLIKELY (pti->at_stream_start_for_bom_read))
+        {
+          /* Record that we're no longer at stream start. */
+          pti->at_stream_start_for_bom_read = 0;
+          if (pt->rw_random)
+            pti->at_stream_start_for_bom_write = 0;
+
+          /* If we just read a BOM in an encoding that recognizes them,
+             then silently consume it and read another code point. */
+          if (SCM_UNLIKELY (*codepoint == SCM_UNICODE_BOM
+                            && (strcasecmp (pt->encoding, "UTF-8") == 0
+                                || strcasecmp (pt->encoding, "UTF-16") == 0
+                                || strcasecmp (pt->encoding, "UTF-32") == 0)))
+            return get_codepoint (port, codepoint, buf, len);
+        }
+      update_port_lf (*codepoint, port);
+    }
   else if (pt->ilseq_handler == SCM_ICONVEH_QUESTION_MARK)
     {
       *codepoint = '?';
@@ -2006,6 +2028,7 @@ SCM_DEFINE (scm_seek, "seek", 3, 0, 0,
 
   if (SCM_OPPORTP (fd_port))
     {
+      scm_t_port_internal *pti = SCM_PORT_GET_INTERNAL (fd_port);
       scm_t_ptob_descriptor *ptob = scm_ptobs + SCM_PTOBNUM (fd_port);
       off_t_or_off64_t off = scm_to_off_t_or_off64_t (offset);
       off_t_or_off64_t rv;
@@ -2015,6 +2038,11 @@ SCM_DEFINE (scm_seek, "seek", 3, 0, 0,
                         scm_cons (fd_port, SCM_EOL));
       else
 	rv = ptob->seek (fd_port, off, how);
+
+      /* Set stream-start flags according to new position. */
+      pti->at_stream_start_for_bom_read  = (rv == 0);
+      pti->at_stream_start_for_bom_write = (rv == 0);
+
       return scm_from_off_t_or_off64_t (rv);
     }
   else /* file descriptor?.  */
@@ -2265,6 +2293,68 @@ scm_i_default_port_encoding (void)
     }
 }
 
+/* If the next LEN bytes from PORT are equal to those in BYTES, then
+   return 1, else return 0.  Leave the port position unchanged.  */
+static int
+looking_at_bytes (SCM port, const unsigned char *bytes, int len)
+{
+  scm_t_port *pt = SCM_PTAB_ENTRY (port);
+  int result;
+  int i = 0;
+
+  while (i < len && scm_peek_byte_or_eof (port) == bytes[i])
+    {
+      pt->read_pos++;
+      i++;
+    }
+
+  result = (i == len);
+
+  while (i > 0)
+    scm_unget_byte (bytes[--i], port);
+
+  return result;
+}
+
+static const unsigned char scm_utf8_bom[3]    = {0xEF, 0xBB, 0xBF};
+static const unsigned char scm_utf16be_bom[2] = {0xFE, 0xFF};
+static const unsigned char scm_utf16le_bom[2] = {0xFF, 0xFE};
+static const unsigned char scm_utf32be_bom[4] = {0x00, 0x00, 0xFE, 0xFF};
+static const unsigned char scm_utf32le_bom[4] = {0xFF, 0xFE, 0x00, 0x00};
+
+/* Decide what endianness to use for a UTF-16 port.  Return "UTF-16BE"
+   or "UTF-16LE".  MODE must be either SCM_PORT_READ or SCM_PORT_WRITE,
+   and specifies which operation is about to be done.  The MODE
+   determines how we will decide the endianness.  We deliberately avoid
+   reading from the port unless the user is about to do so.  If the user
+   is about to read, then we look for a BOM, and if present, we use it
+   to determine the endianness.  Otherwise we choose big-endian, as
+   recommended by the Unicode Consortium.  */
+static const char *
+decide_utf16_encoding (SCM port, scm_t_port_rw_active mode)
+{
+  if (mode == SCM_PORT_READ
+      && SCM_PORT_GET_INTERNAL (port)->at_stream_start_for_bom_read
+      && looking_at_bytes (port, scm_utf16le_bom, sizeof scm_utf16le_bom))
+    return "UTF-16LE";
+  else
+    return "UTF-16BE";
+}
+
+/* Decide what endianness to use for a UTF-32 port.  Return "UTF-32BE"
+   or "UTF-32LE".  See the comment above 'decide_utf16_encoding' for
+   details.  */
+static const char *
+decide_utf32_encoding (SCM port, scm_t_port_rw_active mode)
+{
+  if (mode == SCM_PORT_READ
+      && SCM_PORT_GET_INTERNAL (port)->at_stream_start_for_bom_read
+      && looking_at_bytes (port, scm_utf32le_bom, sizeof scm_utf32le_bom))
+    return "UTF-32LE";
+  else
+    return "UTF-32BE";
+}
+
 static void
 finalize_iconv_descriptors (void *ptr, void *data)
 {
@@ -2341,23 +2431,36 @@ close_iconv_descriptors (scm_t_iconv_descriptors *id)
   id->output_cd = (void *) -1;
 }
 
+/* Return the iconv_descriptors, initializing them if necessary.  MODE
+   must be either SCM_PORT_READ or SCM_PORT_WRITE, and specifies which
+   operation is about to be done.  We deliberately avoid reading from
+   the port unless the user was about to do so.  */
 scm_t_iconv_descriptors *
-scm_i_port_iconv_descriptors (SCM port)
+scm_i_port_iconv_descriptors (SCM port, scm_t_port_rw_active mode)
 {
-  scm_t_port *pt;
-  scm_t_port_internal *pti;
-
-  pt = SCM_PTAB_ENTRY (port);
-  pti = SCM_PORT_GET_INTERNAL (port);
+  scm_t_port_internal *pti = SCM_PORT_GET_INTERNAL (port);
 
   assert (pti->encoding_mode == SCM_PORT_ENCODING_MODE_ICONV);
 
   if (!pti->iconv_descriptors)
     {
+      scm_t_port *pt = SCM_PTAB_ENTRY (port);
+      const char *precise_encoding;
+
       if (!pt->encoding)
         pt->encoding = "ISO-8859-1";
+
+      /* If the specified encoding is UTF-16 or UTF-32, then make
+         that more precise by deciding what endianness to use.  */
+      if (strcasecmp (pt->encoding, "UTF-16") == 0)
+        precise_encoding = decide_utf16_encoding (port, mode);
+      else if (strcasecmp (pt->encoding, "UTF-32") == 0)
+        precise_encoding = decide_utf32_encoding (port, mode);
+      else
+        precise_encoding = pt->encoding;
+
       pti->iconv_descriptors =
-        open_iconv_descriptors (pt->encoding,
+        open_iconv_descriptors (precise_encoding,
                                 SCM_INPUT_PORT_P (port),
                                 SCM_OUTPUT_PORT_P (port));
     }
@@ -2377,28 +2480,27 @@ scm_i_set_port_encoding_x (SCM port, const char *encoding)
   pti = SCM_PORT_GET_INTERNAL (port);
   prev = pti->iconv_descriptors;
 
+  /* In order to handle cases where the encoding changes mid-stream
+     (e.g. within an HTTP stream, or within a file that is composed of
+     segments with different encodings), we consider this to be "stream
+     start" for purposes of BOM handling, regardless of our actual file
+     position. */
+  pti->at_stream_start_for_bom_read  = 1;
+  pti->at_stream_start_for_bom_write = 1;
+
   if (encoding == NULL)
     encoding = "ISO-8859-1";
 
   /* If ENCODING is UTF-8, then no conversion descriptor is opened
      because we do I/O ourselves.  This saves 100+ KiB for each
      descriptor.  */
+  pt->encoding = scm_gc_strdup (encoding, "port");
   if (strcasecmp (encoding, "UTF-8") == 0)
-    {
-      pti->encoding_mode = SCM_PORT_ENCODING_MODE_UTF8;
-      pti->iconv_descriptors = NULL;
-    }
+    pti->encoding_mode = SCM_PORT_ENCODING_MODE_UTF8;
   else
-    {
-      /* Open descriptors before mutating the port. */
-      pti->iconv_descriptors =
-        open_iconv_descriptors (encoding,
-                                SCM_INPUT_PORT_P (port),
-                                SCM_OUTPUT_PORT_P (port));
-      pti->encoding_mode = SCM_PORT_ENCODING_MODE_ICONV;
-    }
-  pt->encoding = scm_gc_strdup (encoding, "port");
+    pti->encoding_mode = SCM_PORT_ENCODING_MODE_ICONV;
 
+  pti->iconv_descriptors = NULL;
   if (prev)
     close_iconv_descriptors (prev);
 }
diff --git a/libguile/print.c b/libguile/print.c
index 1572690..3f72810 100644
--- a/libguile/print.c
+++ b/libguile/print.c
@@ -881,8 +881,24 @@ display_string_using_iconv (const void *str, int narrow_p, size_t len,
 {
   size_t printed;
   scm_t_iconv_descriptors *id;
+  scm_t_port_internal *pti = SCM_PORT_GET_INTERNAL (port);
 
-  id = scm_i_port_iconv_descriptors (port);
+  id = scm_i_port_iconv_descriptors (port, SCM_PORT_WRITE);
+
+  if (SCM_UNLIKELY (pti->at_stream_start_for_bom_write && len > 0))
+    {
+      scm_t_port *pt = SCM_PTAB_ENTRY (port);
+
+      /* Record that we're no longer at stream start.  */
+      pti->at_stream_start_for_bom_write = 0;
+      if (pt->rw_random)
+        pti->at_stream_start_for_bom_read = 0;
+
+      /* Write a BOM if appropriate.  */
+      if (SCM_UNLIKELY (strcasecmp(pt->encoding, "UTF-16") == 0
+                        || strcasecmp(pt->encoding, "UTF-32") == 0))
+        display_character (SCM_UNICODE_BOM, port, iconveh_error);
+    }
 
   printed = 0;
 
diff --git a/test-suite/tests/ports.test b/test-suite/tests/ports.test
index 886ab24..cb2b698 100644
--- a/test-suite/tests/ports.test
+++ b/test-suite/tests/ports.test
@@ -24,7 +24,8 @@
   #:use-module (ice-9 popen)
   #:use-module (ice-9 rdelim)
   #:use-module (rnrs bytevectors)
-  #:use-module ((rnrs io ports) #:select (open-bytevector-input-port)))
+  #:use-module ((rnrs io ports) #:select (open-bytevector-input-port
+                                          open-bytevector-output-port)))
 
 (define (display-line . args)
   (for-each display args)
@@ -918,7 +919,9 @@
 
   (pass-if-exception "set-port-encoding!, wrong encoding"
     exception:miscellaneous-error
-    (set-port-encoding! (open-input-string "") "does-not-exist"))
+    (let ((p (open-input-string "")))
+      (set-port-encoding! p "does-not-exist")
+      (read p)))
 
   (pass-if-exception "%default-port-encoding, wrong encoding"
     exception:miscellaneous-error
@@ -1149,6 +1152,283 @@
 
 \f
 
+(with-test-prefix "unicode byte-order marks (BOMs)"
+
+  (define (bv-read-test* encoding bv proc)
+    (let ((port (open-bytevector-input-port bv)))
+      (set-port-encoding! port encoding)
+      (proc port)))
+
+  (define (bv-read-test encoding bv)
+    (bv-read-test* encoding bv read-string))
+
+  (define (bv-write-test* encoding proc)
+    (call-with-values
+        (lambda () (open-bytevector-output-port))
+      (lambda (port get-bytevector)
+        (set-port-encoding! port encoding)
+        (proc port)
+        (get-bytevector))))
+
+  (define (bv-write-test encoding str)
+    (bv-write-test* encoding
+                    (lambda (p)
+                      (display str p))))
+
+  (pass-if-equal "BOM not discarded from Latin-1 stream"
+      "\xEF\xBB\xBF\x61"
+    (bv-read-test "ISO-8859-1" #vu8(#xEF #xBB #xBF #x61)))
+
+  (pass-if-equal "BOM not discarded from Latin-2 stream"
+      "\u010F\u0165\u017C\x61"
+    (bv-read-test "ISO-8859-2" #vu8(#xEF #xBB #xBF #x61)))
+
+  (pass-if-equal "BOM not discarded from UTF-16BE stream"
+      "\uFEFF\x61"
+    (bv-read-test "UTF-16BE" #vu8(#xFE #xFF #x00 #x61)))
+
+  (pass-if-equal "BOM not discarded from UTF-16LE stream"
+      "\uFEFF\x61"
+    (bv-read-test "UTF-16LE" #vu8(#xFF #xFE #x61 #x00)))
+
+  (pass-if-equal "BOM not discarded from UTF-32BE stream"
+      "\uFEFF\x61"
+    (bv-read-test "UTF-32BE" #vu8(#x00 #x00 #xFE #xFF
+                                       #x00 #x00 #x00 #x61)))
+
+  (pass-if-equal "BOM not discarded from UTF-32LE stream"
+      "\uFEFF\x61"
+    (bv-read-test "UTF-32LE" #vu8(#xFF #xFE #x00 #x00
+                                       #x61 #x00 #x00 #x00)))
+
+  (pass-if-equal "BOM not written to UTF-8 stream"
+      #vu8(#x61)
+    (bv-write-test "UTF-8" "a"))
+
+  (pass-if-equal "BOM not written to UTF-16BE stream"
+      #vu8(#x00 #x61)
+    (bv-write-test "UTF-16BE" "a"))
+
+  (pass-if-equal "BOM not written to UTF-16LE stream"
+      #vu8(#x61 #x00)
+    (bv-write-test "UTF-16LE" "a"))
+
+  (pass-if-equal "BOM not written to UTF-32BE stream"
+      #vu8(#x00 #x00 #x00 #x61)
+    (bv-write-test "UTF-32BE" "a"))
+
+  (pass-if-equal "BOM not written to UTF-32LE stream"
+      #vu8(#x61 #x00 #x00 #x00)
+    (bv-write-test "UTF-32LE" "a"))
+
+  (pass-if "Don't read from the port unless user asks to"
+    (let* ((p (make-soft-port
+               (vector
+                (lambda (c) #f)           ; write char
+                (lambda (s) #f)           ; write string
+                (lambda () #f)            ; flush
+                (lambda () (throw 'fail)) ; read char
+                (lambda () #f))
+               "rw")))
+      (set-port-encoding! p "UTF-16")
+      (display "abc" p)
+      (set-port-encoding! p "UTF-32")
+      (display "def" p)
+      #t))
+
+  ;; TODO: test that input and output streams are independent when
+  ;; appropriate, and linked when appropriate.
+
+  (pass-if-equal "BOM discarded from start of UTF-8 stream"
+      "a"
+    (bv-read-test "Utf-8" #vu8(#xEF #xBB #xBF #x61)))
+
+  (pass-if-equal "BOM discarded from start of UTF-8 stream after seek to 0"
+      '(#\a "a")
+    (bv-read-test* "uTf-8" #vu8(#xEF #xBB #xBF #x61)
+                   (lambda (p)
+                     (let ((c (read-char p)))
+                       (seek p 0 SEEK_SET)
+                       (let ((s (read-string p)))
+                         (list c s))))))
+
+  (pass-if-equal "Only one BOM discarded from start of UTF-8 stream"
+      "\uFEFFa"
+    (bv-read-test "UTF-8" #vu8(#xEF #xBB #xBF #xEF #xBB #xBF #x61)))
+
+  (pass-if-equal "BOM not discarded from UTF-8 stream after seek to > 0"
+      "\uFEFFb"
+    (bv-read-test* "UTF-8" #vu8(#x61 #xEF #xBB #xBF #x62)
+                   (lambda (p)
+                     (seek p 1 SEEK_SET)
+                     (read-string p))))
+
+  (pass-if-equal "BOM not discarded unless at start of UTF-8 stream"
+      "a\uFEFFb"
+    (bv-read-test "UTF-8" #vu8(#x61 #xEF #xBB #xBF #x62)))
+
+  (pass-if-equal "BOM (BE) written to start of UTF-16 stream"
+      #vu8(#xFE #xFF #x00 #x61 #x00 #x62)
+    (bv-write-test "UTF-16" "ab"))
+
+  (pass-if-equal "BOM (BE) written to UTF-16 stream after set-port-encoding!"
+      #vu8(#xFE #xFF #x00 #x61 #x00 #x62 #xFE #xFF #x00 #x63 #x00 #x64)
+    (bv-write-test* "UTF-16"
+                    (lambda (p)
+                      (display "ab" p)
+                      (set-port-encoding! p "UTF-16")
+                      (display "cd" p))))
+
+  (pass-if-equal "BOM discarded from start of UTF-16 stream (BE)"
+      "a"
+    (bv-read-test "UTF-16" #vu8(#xFE #xFF #x00 #x61)))
+
+  (pass-if-equal "BOM discarded from start of UTF-16 stream (BE) after seek to 0"
+      '(#\a "a")
+    (bv-read-test* "utf-16" #vu8(#xFE #xFF #x00 #x61)
+                   (lambda (p)
+                     (let ((c (read-char p)))
+                       (seek p 0 SEEK_SET)
+                       (let ((s (read-string p)))
+                         (list c s))))))
+
+  (pass-if-equal "Only one BOM discarded from start of UTF-16 stream (BE)"
+      "\uFEFFa"
+    (bv-read-test "Utf-16" #vu8(#xFE #xFF #xFE #xFF #x00 #x61)))
+
+  (pass-if-equal "BOM not discarded from UTF-16 stream (BE) after seek to > 0"
+      "\uFEFFa"
+    (bv-read-test* "uTf-16" #vu8(#xFE #xFF #xFE #xFF #x00 #x61)
+                   (lambda (p)
+                     (seek p 2 SEEK_SET)
+                     (read-string p))))
+
+  (pass-if-equal "BOM not discarded unless at start of UTF-16 stream"
+      "a\uFEFFb"
+    (let ((be (bv-read-test "utf-16" #vu8(#x00 #x61 #xFE #xFF #x00 #x62)))
+          (le (bv-read-test "utf-16" #vu8(#x61 #x00 #xFF #xFE #x62 #x00))))
+      (if (char=? #\a (string-ref be 0))
+          be
+          le)))
+
+  (pass-if-equal "BOM discarded from start of UTF-16 stream (LE)"
+      "a"
+    (bv-read-test "UTF-16" #vu8(#xFF #xFE #x61 #x00)))
+
+  (pass-if-equal "BOM discarded from start of UTF-16 stream (LE) after seek to 0"
+      '(#\a "a")
+    (bv-read-test* "Utf-16" #vu8(#xFF #xFE #x61 #x00)
+                   (lambda (p)
+                     (let ((c (read-char p)))
+                       (seek p 0 SEEK_SET)
+                       (let ((s (read-string p)))
+                         (list c s))))))
+
+  (pass-if-equal "Only one BOM discarded from start of UTF-16 stream (LE)"
+      "\uFEFFa"
+    (bv-read-test "UTf-16" #vu8(#xFF #xFE #xFF #xFE #x61 #x00)))
+
+  (pass-if-equal "BOM discarded from start of UTF-32 stream (BE)"
+      "a"
+    (bv-read-test "UTF-32" #vu8(#x00 #x00 #xFE #xFF
+                                     #x00 #x00 #x00 #x61)))
+
+  (pass-if-equal "BOM discarded from start of UTF-32 stream (BE) after seek to 0"
+      '(#\a "a")
+    (bv-read-test* "utF-32" #vu8(#x00 #x00 #xFE #xFF
+                                      #x00 #x00 #x00 #x61)
+                   (lambda (p)
+                     (let ((c (read-char p)))
+                       (seek p 0 SEEK_SET)
+                       (let ((s (read-string p)))
+                         (list c s))))))
+
+  (pass-if-equal "Only one BOM discarded from start of UTF-32 stream (BE)"
+      "\uFEFFa"
+    (bv-read-test "UTF-32" #vu8(#x00 #x00 #xFE #xFF
+                                     #x00 #x00 #xFE #xFF
+                                     #x00 #x00 #x00 #x61)))
+
+  (pass-if-equal "BOM not discarded from UTF-32 stream (BE) after seek to > 0"
+      "\uFEFFa"
+    (bv-read-test* "UtF-32" #vu8(#x00 #x00 #xFE #xFF
+                                      #x00 #x00 #xFE #xFF
+                                      #x00 #x00 #x00 #x61)
+                   (lambda (p)
+                     (seek p 4 SEEK_SET)
+                     (read-string p))))
+
+  (pass-if-equal "BOM discarded within UTF-16 stream (BE) after set-port-encoding!"
+      "ab"
+    (bv-read-test* "UTF-16" #vu8(#x00 #x61 #xFE #xFF #x00 #x62)
+                   (lambda (p)
+                     (let ((a (read-char p)))
+                       (set-port-encoding! p "UTF-16")
+                       (string a (read-char p))))))
+
+  (pass-if-equal "BOM discarded within UTF-16 stream (LE,BE) after set-port-encoding!"
+      "ab"
+    (bv-read-test* "utf-16" #vu8(#x00 #x61 #xFF #xFE #x62 #x00)
+                   (lambda (p)
+                     (let ((a (read-char p)))
+                       (set-port-encoding! p "UTF-16")
+                       (string a (read-char p))))))
+
+  (pass-if-equal "BOM discarded within UTF-32 stream (BE) after set-port-encoding!"
+      "ab"
+    (bv-read-test* "UTF-32" #vu8(#x00 #x00 #x00 #x61
+                                      #x00 #x00 #xFE #xFF
+                                      #x00 #x00 #x00 #x62)
+                   (lambda (p)
+                     (let ((a (read-char p)))
+                       (set-port-encoding! p "UTF-32")
+                       (string a (read-char p))))))
+
+  (pass-if-equal "BOM discarded within UTF-32 stream (LE,BE) after set-port-encoding!"
+      "ab"
+    (bv-read-test* "UTF-32" #vu8(#x00 #x00 #x00 #x61
+                                      #xFF #xFE #x00 #x00
+                                      #x62 #x00 #x00 #x00)
+                   (lambda (p)
+                     (let ((a (read-char p)))
+                       (set-port-encoding! p "UTF-32")
+                       (string a (read-char p))))))
+
+  (pass-if-equal "BOM not discarded unless at start of UTF-32 stream"
+      "a\uFEFFb"
+    (let ((be (bv-read-test "UTF-32" #vu8(#x00 #x00 #x00 #x61
+                                               #x00 #x00 #xFE #xFF
+                                               #x00 #x00 #x00 #x62)))
+          (le (bv-read-test "UTF-32" #vu8(#x61 #x00 #x00 #x00
+                                               #xFF #xFE #x00 #x00
+                                               #x62 #x00 #x00 #x00))))
+      (if (char=? #\a (string-ref be 0))
+          be
+          le)))
+
+  (pass-if-equal "BOM discarded from start of UTF-32 stream (LE)"
+      "a"
+    (bv-read-test "UTF-32" #vu8(#xFF #xFE #x00 #x00
+                                     #x61 #x00 #x00 #x00)))
+
+  (pass-if-equal "BOM discarded from start of UTF-32 stream (LE) after seek to 0"
+      '(#\a "a")
+    (bv-read-test* "UTf-32" #vu8(#xFF #xFE #x00 #x00
+                                      #x61 #x00 #x00 #x00)
+                   (lambda (p)
+                     (let ((c (read-char p)))
+                       (seek p 0 SEEK_SET)
+                       (let ((s (read-string p)))
+                         (list c s))))))
+
+  (pass-if-equal "Only one BOM discarded from start of UTF-32 stream (LE)"
+      "\uFEFFa"
+    (bv-read-test "UTF-32" #vu8(#xFF #xFE #x00 #x00
+                                     #xFF #xFE #x00 #x00
+                                     #x61 #x00 #x00 #x00))))
+
+\f
+
 (define-syntax-rule (with-load-path path body ...)
   (let ((new path)
         (old %load-path))
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] Improve handling of Unicode byte-order marks (BOMs)
  2013-04-03 22:24           ` Mark H Weaver
@ 2013-04-04  5:59             ` Mark H Weaver
  0 siblings, 0 replies; 14+ messages in thread
From: Mark H Weaver @ 2013-04-04  5:59 UTC (permalink / raw
  To: Andy Wingo; +Cc: guile-devel

[-- Attachment #1: Type: text/plain, Size: 109 bytes --]

Here's the latest revision of the patch.  The only thing that has
changed is the documentation.

     Mark



[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: [PATCH] Improve handling of Unicode byte-order marks (BOMs) --]
[-- Type: text/x-diff, Size: 29210 bytes --]

From a3f2c379f11782f0440d9beb2b40601146ee14ea Mon Sep 17 00:00:00 2001
From: Mark H Weaver <mhw@netris.org>
Date: Wed, 3 Apr 2013 04:22:04 -0400
Subject: [PATCH] Improve handling of Unicode byte-order marks (BOMs).

* libguile/ports-internal.h (struct scm_port_internal): Add new members
  'at_stream_start_for_bom_read' and 'at_stream_start_for_bom_write'.
  (SCM_UNICODE_BOM): New macro.
  (scm_i_port_iconv_descriptors): Add 'mode' parameter to prototype.

* libguile/ports.c (scm_new_port_table_entry): Initialize
  'at_stream_start_for_bom_read' and 'at_stream_start_for_bom_write'.
  (get_iconv_codepoint): Pass new 'mode' parameter to
  'scm_i_port_iconv_descriptors'.
  (get_codepoint): After reading a codepoint at stream start, record
  that we're no longer at stream start, and consume a BOM where
  appropriate.
  (scm_seek): Set the stream start flags according to the new position.
  (looking_at_bytes): New static function.
  (scm_utf8_bom, scm_utf16be_bom, scm_utf16le_bom, scm_utf32be_bom,
  scm_utf32le_bom): New static const arrays.
  (decide_utf16_encoding, decide_utf32_encoding): New static functions.
  (scm_i_port_iconv_descriptors): Add new 'mode' parameter.  If the
  specified encoding is UTF-16 or UTF-32, make that precise by deciding
  what endianness to use, and construct iconv descriptors based on the
  precise encoding.
  (scm_i_set_port_encoding_x): Record that we are now at stream start.
  Do not open the new iconv descriptors immediately; let them be
  initialized lazily.

* libguile/print.c (display_string_using_iconv): Record that we're no
  longer at stream start.  Write a BOM if appropriate.

* doc/ref/api-io.texi (BOM Handling): New node.

* test-suite/tests/ports.test ("set-port-encoding!, wrong encoding"):
  Adapt test to cope with the fact that 'set-port-encoding!' does not
  immediately open the iconv descriptors.
  (bv-read-test): New procedure.
  ("unicode byte-order marks (BOMs)"): New test prefix.
---
 doc/ref/api-io.texi         |   81 +++++++++++-
 libguile/ports-internal.h   |    7 +-
 libguile/ports.c            |  146 ++++++++++++++++++----
 libguile/print.c            |   18 ++-
 test-suite/tests/ports.test |  284 ++++++++++++++++++++++++++++++++++++++++++-
 5 files changed, 509 insertions(+), 27 deletions(-)

diff --git a/doc/ref/api-io.texi b/doc/ref/api-io.texi
index 8c974be..3f75a63 100644
--- a/doc/ref/api-io.texi
+++ b/doc/ref/api-io.texi
@@ -1,7 +1,7 @@
 @c -*-texinfo-*-
 @c This is part of the GNU Guile Reference Manual.
 @c Copyright (C)  1996, 1997, 2000, 2001, 2002, 2003, 2004, 2007, 2009,
-@c   2010, 2011  Free Software Foundation, Inc.
+@c   2010, 2011, 2013  Free Software Foundation, Inc.
 @c See the file guile.texi for copying conditions.
 
 @node Input and Output
@@ -19,6 +19,7 @@
 * Port Types::                  Types of port and how to make them.
 * R6RS I/O Ports::              The R6RS port API.
 * I/O Extensions::              Using and extending ports in C.
+* BOM Handling::                Handling of Unicode byte order marks.
 @end menu
 
 
@@ -2373,6 +2374,84 @@ Set using
 
 @end table
 
+@node BOM Handling
+@subsection Handling of Unicode byte order marks.
+@cindex BOM
+@cindex byte order mark
+
+This section documents the finer points of Guile's handling of Unicode
+byte order marks (BOMs).  A byte order mark (U+FEFF) is typically found
+at the start of a UTF-16 or UTF-32 stream, to allow readers to reliably
+determine the byte order.  Occasionally, a BOM is found at the start of
+a UTF-8 stream, but this is much less common and not generally
+recommended.
+
+Guile attempts to handle BOMs automatically, and in accordance with the
+recommendations of the Unicode Standard, when the port encoding is set
+to @code{UTF-8}, @code{UTF-16}, or @code{UTF-32}.  In brief, Guile
+automatically writes a BOM at the start of a UTF-16 or UTF-32 stream,
+and automatically consumes one from the start of a UTF-8, UTF-16, or
+UTF-32 stream.
+
+As specified in the Unicode Standard, a BOM is only handled specially at
+the start of a stream, and only if the port encoding is set to
+@code{UTF-8}, @code{UTF-16} or @code{UTF-32}.  If the port encoding is
+set to @code{UTF-16BE}, @code{UTF-16LE}, @code{UTF-32BE}, or
+@code{UTF-32LE}, then BOMs are @emph{not} handled specially, and none of
+the special handling described in this section applies.
+
+@itemize @bullet
+@item
+To ensure that Guile will properly detect the byte order of a UTF-16 or
+UTF-32 stream, you must perform a textual read before any writes, seeks,
+or binary I/O.  Guile will not attempt to read a BOM unless a read is
+explicitly requested at the start of the stream.
+
+@item
+If a textual write is performed before the first read, then an arbitrary
+byte order will be chosen.  Currently, big endian is the default on all
+platforms, but that may change in the future.  If you wish to explicitly
+control the byte order of an output stream, set the port encoding to
+@code{UTF-16BE}, @code{UTF-16LE}, @code{UTF-32BE}, or @code{UTF-32LE},
+and explicitly write a BOM (@code{#\xFEFF}) if desired.
+
+@item
+If @code{set-port-encoding!} is called in the middle of a stream, Guile
+treats this as a new logical ``start of stream'' for purposes of BOM
+handling, and will forget about any BOMs that had previously been seen.
+Therefore, it may choose a different byte order than had been used
+previously.  This is intended to support multiple logical text streams
+embedded within a larger binary stream.
+
+@item
+Binary I/O operations are not guaranteed to update Guile's notion of
+whether the port is at the ``start of the stream'', nor are they
+guaranteed to produce or consume BOMs.
+
+@item
+For ports that support seeking (e.g. normal files), the input and output
+streams are considered linked: if the user reads first, then a BOM will
+be consumed (if appropriate), but later writes will @emph{not} produce a
+BOM.  Similarly, if the user writes first, then later reads will
+@emph{not} consume a BOM.
+
+@item
+For ports that do not support seeking (e.g. pipes, sockets, and
+terminals), the input and output streams are considered
+@emph{independent} for purposes of BOM handling: the first read will
+consume a BOM (if appropriate), and the first write will @emph{also}
+produce a BOM (if appropriate).  However, the input and output streams
+will always use the same byte order at any given time.
+
+@item
+Seeks to the beginning of a file will set the ``start of stream'' flags.
+Therefore, a subsequent textual read or write will consume or produce a
+BOM.  However, unlike @code{set-port-encoding!}, if a byte order had
+already been chosen for the port, it will remain in effect after a seek,
+and cannot be changed by the presence of a BOM.  Seeks anywhere other
+than the beginning of a file clear the ``start of stream'' flags.
+@end itemize
+
 @c Local Variables:
 @c TeX-master: "guile.texi"
 @c End:
diff --git a/libguile/ports-internal.h b/libguile/ports-internal.h
index 73a788f..70e8c45 100644
--- a/libguile/ports-internal.h
+++ b/libguile/ports-internal.h
@@ -46,6 +46,8 @@ typedef struct scm_iconv_descriptors scm_t_iconv_descriptors;
 
 struct scm_port_internal
 {
+  unsigned at_stream_start_for_bom_read  : 1;
+  unsigned at_stream_start_for_bom_write : 1;
   scm_t_port_encoding_mode encoding_mode;
   scm_t_iconv_descriptors *iconv_descriptors;
   SCM alist;
@@ -53,9 +55,12 @@ struct scm_port_internal
 
 typedef struct scm_port_internal scm_t_port_internal;
 
+#define SCM_UNICODE_BOM  0xFEFFUL  /* Unicode byte-order mark */
+
 #define SCM_PORT_GET_INTERNAL(x)                                \
   ((scm_t_port_internal *) (SCM_PTAB_ENTRY(x)->input_cd))
 
-SCM_INTERNAL scm_t_iconv_descriptors *scm_i_port_iconv_descriptors (SCM port);
+SCM_INTERNAL scm_t_iconv_descriptors *
+scm_i_port_iconv_descriptors (SCM port, scm_t_port_rw_active mode);
 
 #endif
diff --git a/libguile/ports.c b/libguile/ports.c
index eaa2047..4f042cc 100644
--- a/libguile/ports.c
+++ b/libguile/ports.c
@@ -639,6 +639,9 @@ scm_new_port_table_entry (scm_t_bits tag)
     pti->encoding_mode = SCM_PORT_ENCODING_MODE_ICONV;
   pti->iconv_descriptors = NULL;
 
+  pti->at_stream_start_for_bom_read  = 1;
+  pti->at_stream_start_for_bom_write = 1;
+
   /* XXX These fields are not what they seem.  They have been
      repurposed, but cannot safely be renamed in 2.0 without breaking
      ABI compatibility.  This will be cleaned up in 2.2.  */
@@ -1306,10 +1309,12 @@ static int
 get_iconv_codepoint (SCM port, scm_t_wchar *codepoint,
 		     char buf[SCM_MBCHAR_BUF_SIZE], size_t *len)
 {
-  scm_t_iconv_descriptors *id = scm_i_port_iconv_descriptors (port);
+  scm_t_iconv_descriptors *id;
   scm_t_uint8 utf8_buf[SCM_MBCHAR_BUF_SIZE];
   size_t input_size = 0;
 
+  id = scm_i_port_iconv_descriptors (port, SCM_PORT_READ);
+
   for (;;)
     {
       int byte_read;
@@ -1393,7 +1398,24 @@ get_codepoint (SCM port, scm_t_wchar *codepoint,
     err = get_iconv_codepoint (port, codepoint, buf, len);
 
   if (SCM_LIKELY (err == 0))
-    update_port_lf (*codepoint, port);
+    {
+      if (SCM_UNLIKELY (pti->at_stream_start_for_bom_read))
+        {
+          /* Record that we're no longer at stream start. */
+          pti->at_stream_start_for_bom_read = 0;
+          if (pt->rw_random)
+            pti->at_stream_start_for_bom_write = 0;
+
+          /* If we just read a BOM in an encoding that recognizes them,
+             then silently consume it and read another code point. */
+          if (SCM_UNLIKELY (*codepoint == SCM_UNICODE_BOM
+                            && (strcasecmp (pt->encoding, "UTF-8") == 0
+                                || strcasecmp (pt->encoding, "UTF-16") == 0
+                                || strcasecmp (pt->encoding, "UTF-32") == 0)))
+            return get_codepoint (port, codepoint, buf, len);
+        }
+      update_port_lf (*codepoint, port);
+    }
   else if (pt->ilseq_handler == SCM_ICONVEH_QUESTION_MARK)
     {
       *codepoint = '?';
@@ -2006,6 +2028,7 @@ SCM_DEFINE (scm_seek, "seek", 3, 0, 0,
 
   if (SCM_OPPORTP (fd_port))
     {
+      scm_t_port_internal *pti = SCM_PORT_GET_INTERNAL (fd_port);
       scm_t_ptob_descriptor *ptob = scm_ptobs + SCM_PTOBNUM (fd_port);
       off_t_or_off64_t off = scm_to_off_t_or_off64_t (offset);
       off_t_or_off64_t rv;
@@ -2015,6 +2038,11 @@ SCM_DEFINE (scm_seek, "seek", 3, 0, 0,
                         scm_cons (fd_port, SCM_EOL));
       else
 	rv = ptob->seek (fd_port, off, how);
+
+      /* Set stream-start flags according to new position. */
+      pti->at_stream_start_for_bom_read  = (rv == 0);
+      pti->at_stream_start_for_bom_write = (rv == 0);
+
       return scm_from_off_t_or_off64_t (rv);
     }
   else /* file descriptor?.  */
@@ -2265,6 +2293,68 @@ scm_i_default_port_encoding (void)
     }
 }
 
+/* If the next LEN bytes from PORT are equal to those in BYTES, then
+   return 1, else return 0.  Leave the port position unchanged.  */
+static int
+looking_at_bytes (SCM port, const unsigned char *bytes, int len)
+{
+  scm_t_port *pt = SCM_PTAB_ENTRY (port);
+  int result;
+  int i = 0;
+
+  while (i < len && scm_peek_byte_or_eof (port) == bytes[i])
+    {
+      pt->read_pos++;
+      i++;
+    }
+
+  result = (i == len);
+
+  while (i > 0)
+    scm_unget_byte (bytes[--i], port);
+
+  return result;
+}
+
+static const unsigned char scm_utf8_bom[3]    = {0xEF, 0xBB, 0xBF};
+static const unsigned char scm_utf16be_bom[2] = {0xFE, 0xFF};
+static const unsigned char scm_utf16le_bom[2] = {0xFF, 0xFE};
+static const unsigned char scm_utf32be_bom[4] = {0x00, 0x00, 0xFE, 0xFF};
+static const unsigned char scm_utf32le_bom[4] = {0xFF, 0xFE, 0x00, 0x00};
+
+/* Decide what endianness to use for a UTF-16 port.  Return "UTF-16BE"
+   or "UTF-16LE".  MODE must be either SCM_PORT_READ or SCM_PORT_WRITE,
+   and specifies which operation is about to be done.  The MODE
+   determines how we will decide the endianness.  We deliberately avoid
+   reading from the port unless the user is about to do so.  If the user
+   is about to read, then we look for a BOM, and if present, we use it
+   to determine the endianness.  Otherwise we choose big-endian, as
+   recommended by the Unicode Consortium.  */
+static const char *
+decide_utf16_encoding (SCM port, scm_t_port_rw_active mode)
+{
+  if (mode == SCM_PORT_READ
+      && SCM_PORT_GET_INTERNAL (port)->at_stream_start_for_bom_read
+      && looking_at_bytes (port, scm_utf16le_bom, sizeof scm_utf16le_bom))
+    return "UTF-16LE";
+  else
+    return "UTF-16BE";
+}
+
+/* Decide what endianness to use for a UTF-32 port.  Return "UTF-32BE"
+   or "UTF-32LE".  See the comment above 'decide_utf16_encoding' for
+   details.  */
+static const char *
+decide_utf32_encoding (SCM port, scm_t_port_rw_active mode)
+{
+  if (mode == SCM_PORT_READ
+      && SCM_PORT_GET_INTERNAL (port)->at_stream_start_for_bom_read
+      && looking_at_bytes (port, scm_utf32le_bom, sizeof scm_utf32le_bom))
+    return "UTF-32LE";
+  else
+    return "UTF-32BE";
+}
+
 static void
 finalize_iconv_descriptors (void *ptr, void *data)
 {
@@ -2341,23 +2431,36 @@ close_iconv_descriptors (scm_t_iconv_descriptors *id)
   id->output_cd = (void *) -1;
 }
 
+/* Return the iconv_descriptors, initializing them if necessary.  MODE
+   must be either SCM_PORT_READ or SCM_PORT_WRITE, and specifies which
+   operation is about to be done.  We deliberately avoid reading from
+   the port unless the user was about to do so.  */
 scm_t_iconv_descriptors *
-scm_i_port_iconv_descriptors (SCM port)
+scm_i_port_iconv_descriptors (SCM port, scm_t_port_rw_active mode)
 {
-  scm_t_port *pt;
-  scm_t_port_internal *pti;
-
-  pt = SCM_PTAB_ENTRY (port);
-  pti = SCM_PORT_GET_INTERNAL (port);
+  scm_t_port_internal *pti = SCM_PORT_GET_INTERNAL (port);
 
   assert (pti->encoding_mode == SCM_PORT_ENCODING_MODE_ICONV);
 
   if (!pti->iconv_descriptors)
     {
+      scm_t_port *pt = SCM_PTAB_ENTRY (port);
+      const char *precise_encoding;
+
       if (!pt->encoding)
         pt->encoding = "ISO-8859-1";
+
+      /* If the specified encoding is UTF-16 or UTF-32, then make
+         that more precise by deciding what endianness to use.  */
+      if (strcasecmp (pt->encoding, "UTF-16") == 0)
+        precise_encoding = decide_utf16_encoding (port, mode);
+      else if (strcasecmp (pt->encoding, "UTF-32") == 0)
+        precise_encoding = decide_utf32_encoding (port, mode);
+      else
+        precise_encoding = pt->encoding;
+
       pti->iconv_descriptors =
-        open_iconv_descriptors (pt->encoding,
+        open_iconv_descriptors (precise_encoding,
                                 SCM_INPUT_PORT_P (port),
                                 SCM_OUTPUT_PORT_P (port));
     }
@@ -2377,28 +2480,27 @@ scm_i_set_port_encoding_x (SCM port, const char *encoding)
   pti = SCM_PORT_GET_INTERNAL (port);
   prev = pti->iconv_descriptors;
 
+  /* In order to handle cases where the encoding changes mid-stream
+     (e.g. within an HTTP stream, or within a file that is composed of
+     segments with different encodings), we consider this to be "stream
+     start" for purposes of BOM handling, regardless of our actual file
+     position. */
+  pti->at_stream_start_for_bom_read  = 1;
+  pti->at_stream_start_for_bom_write = 1;
+
   if (encoding == NULL)
     encoding = "ISO-8859-1";
 
   /* If ENCODING is UTF-8, then no conversion descriptor is opened
      because we do I/O ourselves.  This saves 100+ KiB for each
      descriptor.  */
+  pt->encoding = scm_gc_strdup (encoding, "port");
   if (strcasecmp (encoding, "UTF-8") == 0)
-    {
-      pti->encoding_mode = SCM_PORT_ENCODING_MODE_UTF8;
-      pti->iconv_descriptors = NULL;
-    }
+    pti->encoding_mode = SCM_PORT_ENCODING_MODE_UTF8;
   else
-    {
-      /* Open descriptors before mutating the port. */
-      pti->iconv_descriptors =
-        open_iconv_descriptors (encoding,
-                                SCM_INPUT_PORT_P (port),
-                                SCM_OUTPUT_PORT_P (port));
-      pti->encoding_mode = SCM_PORT_ENCODING_MODE_ICONV;
-    }
-  pt->encoding = scm_gc_strdup (encoding, "port");
+    pti->encoding_mode = SCM_PORT_ENCODING_MODE_ICONV;
 
+  pti->iconv_descriptors = NULL;
   if (prev)
     close_iconv_descriptors (prev);
 }
diff --git a/libguile/print.c b/libguile/print.c
index 1572690..3f72810 100644
--- a/libguile/print.c
+++ b/libguile/print.c
@@ -881,8 +881,24 @@ display_string_using_iconv (const void *str, int narrow_p, size_t len,
 {
   size_t printed;
   scm_t_iconv_descriptors *id;
+  scm_t_port_internal *pti = SCM_PORT_GET_INTERNAL (port);
 
-  id = scm_i_port_iconv_descriptors (port);
+  id = scm_i_port_iconv_descriptors (port, SCM_PORT_WRITE);
+
+  if (SCM_UNLIKELY (pti->at_stream_start_for_bom_write && len > 0))
+    {
+      scm_t_port *pt = SCM_PTAB_ENTRY (port);
+
+      /* Record that we're no longer at stream start.  */
+      pti->at_stream_start_for_bom_write = 0;
+      if (pt->rw_random)
+        pti->at_stream_start_for_bom_read = 0;
+
+      /* Write a BOM if appropriate.  */
+      if (SCM_UNLIKELY (strcasecmp(pt->encoding, "UTF-16") == 0
+                        || strcasecmp(pt->encoding, "UTF-32") == 0))
+        display_character (SCM_UNICODE_BOM, port, iconveh_error);
+    }
 
   printed = 0;
 
diff --git a/test-suite/tests/ports.test b/test-suite/tests/ports.test
index 886ab24..cb2b698 100644
--- a/test-suite/tests/ports.test
+++ b/test-suite/tests/ports.test
@@ -24,7 +24,8 @@
   #:use-module (ice-9 popen)
   #:use-module (ice-9 rdelim)
   #:use-module (rnrs bytevectors)
-  #:use-module ((rnrs io ports) #:select (open-bytevector-input-port)))
+  #:use-module ((rnrs io ports) #:select (open-bytevector-input-port
+                                          open-bytevector-output-port)))
 
 (define (display-line . args)
   (for-each display args)
@@ -918,7 +919,9 @@
 
   (pass-if-exception "set-port-encoding!, wrong encoding"
     exception:miscellaneous-error
-    (set-port-encoding! (open-input-string "") "does-not-exist"))
+    (let ((p (open-input-string "")))
+      (set-port-encoding! p "does-not-exist")
+      (read p)))
 
   (pass-if-exception "%default-port-encoding, wrong encoding"
     exception:miscellaneous-error
@@ -1149,6 +1152,283 @@
 
 \f
 
+(with-test-prefix "unicode byte-order marks (BOMs)"
+
+  (define (bv-read-test* encoding bv proc)
+    (let ((port (open-bytevector-input-port bv)))
+      (set-port-encoding! port encoding)
+      (proc port)))
+
+  (define (bv-read-test encoding bv)
+    (bv-read-test* encoding bv read-string))
+
+  (define (bv-write-test* encoding proc)
+    (call-with-values
+        (lambda () (open-bytevector-output-port))
+      (lambda (port get-bytevector)
+        (set-port-encoding! port encoding)
+        (proc port)
+        (get-bytevector))))
+
+  (define (bv-write-test encoding str)
+    (bv-write-test* encoding
+                    (lambda (p)
+                      (display str p))))
+
+  (pass-if-equal "BOM not discarded from Latin-1 stream"
+      "\xEF\xBB\xBF\x61"
+    (bv-read-test "ISO-8859-1" #vu8(#xEF #xBB #xBF #x61)))
+
+  (pass-if-equal "BOM not discarded from Latin-2 stream"
+      "\u010F\u0165\u017C\x61"
+    (bv-read-test "ISO-8859-2" #vu8(#xEF #xBB #xBF #x61)))
+
+  (pass-if-equal "BOM not discarded from UTF-16BE stream"
+      "\uFEFF\x61"
+    (bv-read-test "UTF-16BE" #vu8(#xFE #xFF #x00 #x61)))
+
+  (pass-if-equal "BOM not discarded from UTF-16LE stream"
+      "\uFEFF\x61"
+    (bv-read-test "UTF-16LE" #vu8(#xFF #xFE #x61 #x00)))
+
+  (pass-if-equal "BOM not discarded from UTF-32BE stream"
+      "\uFEFF\x61"
+    (bv-read-test "UTF-32BE" #vu8(#x00 #x00 #xFE #xFF
+                                       #x00 #x00 #x00 #x61)))
+
+  (pass-if-equal "BOM not discarded from UTF-32LE stream"
+      "\uFEFF\x61"
+    (bv-read-test "UTF-32LE" #vu8(#xFF #xFE #x00 #x00
+                                       #x61 #x00 #x00 #x00)))
+
+  (pass-if-equal "BOM not written to UTF-8 stream"
+      #vu8(#x61)
+    (bv-write-test "UTF-8" "a"))
+
+  (pass-if-equal "BOM not written to UTF-16BE stream"
+      #vu8(#x00 #x61)
+    (bv-write-test "UTF-16BE" "a"))
+
+  (pass-if-equal "BOM not written to UTF-16LE stream"
+      #vu8(#x61 #x00)
+    (bv-write-test "UTF-16LE" "a"))
+
+  (pass-if-equal "BOM not written to UTF-32BE stream"
+      #vu8(#x00 #x00 #x00 #x61)
+    (bv-write-test "UTF-32BE" "a"))
+
+  (pass-if-equal "BOM not written to UTF-32LE stream"
+      #vu8(#x61 #x00 #x00 #x00)
+    (bv-write-test "UTF-32LE" "a"))
+
+  (pass-if "Don't read from the port unless user asks to"
+    (let* ((p (make-soft-port
+               (vector
+                (lambda (c) #f)           ; write char
+                (lambda (s) #f)           ; write string
+                (lambda () #f)            ; flush
+                (lambda () (throw 'fail)) ; read char
+                (lambda () #f))
+               "rw")))
+      (set-port-encoding! p "UTF-16")
+      (display "abc" p)
+      (set-port-encoding! p "UTF-32")
+      (display "def" p)
+      #t))
+
+  ;; TODO: test that input and output streams are independent when
+  ;; appropriate, and linked when appropriate.
+
+  (pass-if-equal "BOM discarded from start of UTF-8 stream"
+      "a"
+    (bv-read-test "Utf-8" #vu8(#xEF #xBB #xBF #x61)))
+
+  (pass-if-equal "BOM discarded from start of UTF-8 stream after seek to 0"
+      '(#\a "a")
+    (bv-read-test* "uTf-8" #vu8(#xEF #xBB #xBF #x61)
+                   (lambda (p)
+                     (let ((c (read-char p)))
+                       (seek p 0 SEEK_SET)
+                       (let ((s (read-string p)))
+                         (list c s))))))
+
+  (pass-if-equal "Only one BOM discarded from start of UTF-8 stream"
+      "\uFEFFa"
+    (bv-read-test "UTF-8" #vu8(#xEF #xBB #xBF #xEF #xBB #xBF #x61)))
+
+  (pass-if-equal "BOM not discarded from UTF-8 stream after seek to > 0"
+      "\uFEFFb"
+    (bv-read-test* "UTF-8" #vu8(#x61 #xEF #xBB #xBF #x62)
+                   (lambda (p)
+                     (seek p 1 SEEK_SET)
+                     (read-string p))))
+
+  (pass-if-equal "BOM not discarded unless at start of UTF-8 stream"
+      "a\uFEFFb"
+    (bv-read-test "UTF-8" #vu8(#x61 #xEF #xBB #xBF #x62)))
+
+  (pass-if-equal "BOM (BE) written to start of UTF-16 stream"
+      #vu8(#xFE #xFF #x00 #x61 #x00 #x62)
+    (bv-write-test "UTF-16" "ab"))
+
+  (pass-if-equal "BOM (BE) written to UTF-16 stream after set-port-encoding!"
+      #vu8(#xFE #xFF #x00 #x61 #x00 #x62 #xFE #xFF #x00 #x63 #x00 #x64)
+    (bv-write-test* "UTF-16"
+                    (lambda (p)
+                      (display "ab" p)
+                      (set-port-encoding! p "UTF-16")
+                      (display "cd" p))))
+
+  (pass-if-equal "BOM discarded from start of UTF-16 stream (BE)"
+      "a"
+    (bv-read-test "UTF-16" #vu8(#xFE #xFF #x00 #x61)))
+
+  (pass-if-equal "BOM discarded from start of UTF-16 stream (BE) after seek to 0"
+      '(#\a "a")
+    (bv-read-test* "utf-16" #vu8(#xFE #xFF #x00 #x61)
+                   (lambda (p)
+                     (let ((c (read-char p)))
+                       (seek p 0 SEEK_SET)
+                       (let ((s (read-string p)))
+                         (list c s))))))
+
+  (pass-if-equal "Only one BOM discarded from start of UTF-16 stream (BE)"
+      "\uFEFFa"
+    (bv-read-test "Utf-16" #vu8(#xFE #xFF #xFE #xFF #x00 #x61)))
+
+  (pass-if-equal "BOM not discarded from UTF-16 stream (BE) after seek to > 0"
+      "\uFEFFa"
+    (bv-read-test* "uTf-16" #vu8(#xFE #xFF #xFE #xFF #x00 #x61)
+                   (lambda (p)
+                     (seek p 2 SEEK_SET)
+                     (read-string p))))
+
+  (pass-if-equal "BOM not discarded unless at start of UTF-16 stream"
+      "a\uFEFFb"
+    (let ((be (bv-read-test "utf-16" #vu8(#x00 #x61 #xFE #xFF #x00 #x62)))
+          (le (bv-read-test "utf-16" #vu8(#x61 #x00 #xFF #xFE #x62 #x00))))
+      (if (char=? #\a (string-ref be 0))
+          be
+          le)))
+
+  (pass-if-equal "BOM discarded from start of UTF-16 stream (LE)"
+      "a"
+    (bv-read-test "UTF-16" #vu8(#xFF #xFE #x61 #x00)))
+
+  (pass-if-equal "BOM discarded from start of UTF-16 stream (LE) after seek to 0"
+      '(#\a "a")
+    (bv-read-test* "Utf-16" #vu8(#xFF #xFE #x61 #x00)
+                   (lambda (p)
+                     (let ((c (read-char p)))
+                       (seek p 0 SEEK_SET)
+                       (let ((s (read-string p)))
+                         (list c s))))))
+
+  (pass-if-equal "Only one BOM discarded from start of UTF-16 stream (LE)"
+      "\uFEFFa"
+    (bv-read-test "UTf-16" #vu8(#xFF #xFE #xFF #xFE #x61 #x00)))
+
+  (pass-if-equal "BOM discarded from start of UTF-32 stream (BE)"
+      "a"
+    (bv-read-test "UTF-32" #vu8(#x00 #x00 #xFE #xFF
+                                     #x00 #x00 #x00 #x61)))
+
+  (pass-if-equal "BOM discarded from start of UTF-32 stream (BE) after seek to 0"
+      '(#\a "a")
+    (bv-read-test* "utF-32" #vu8(#x00 #x00 #xFE #xFF
+                                      #x00 #x00 #x00 #x61)
+                   (lambda (p)
+                     (let ((c (read-char p)))
+                       (seek p 0 SEEK_SET)
+                       (let ((s (read-string p)))
+                         (list c s))))))
+
+  (pass-if-equal "Only one BOM discarded from start of UTF-32 stream (BE)"
+      "\uFEFFa"
+    (bv-read-test "UTF-32" #vu8(#x00 #x00 #xFE #xFF
+                                     #x00 #x00 #xFE #xFF
+                                     #x00 #x00 #x00 #x61)))
+
+  (pass-if-equal "BOM not discarded from UTF-32 stream (BE) after seek to > 0"
+      "\uFEFFa"
+    (bv-read-test* "UtF-32" #vu8(#x00 #x00 #xFE #xFF
+                                      #x00 #x00 #xFE #xFF
+                                      #x00 #x00 #x00 #x61)
+                   (lambda (p)
+                     (seek p 4 SEEK_SET)
+                     (read-string p))))
+
+  (pass-if-equal "BOM discarded within UTF-16 stream (BE) after set-port-encoding!"
+      "ab"
+    (bv-read-test* "UTF-16" #vu8(#x00 #x61 #xFE #xFF #x00 #x62)
+                   (lambda (p)
+                     (let ((a (read-char p)))
+                       (set-port-encoding! p "UTF-16")
+                       (string a (read-char p))))))
+
+  (pass-if-equal "BOM discarded within UTF-16 stream (LE,BE) after set-port-encoding!"
+      "ab"
+    (bv-read-test* "utf-16" #vu8(#x00 #x61 #xFF #xFE #x62 #x00)
+                   (lambda (p)
+                     (let ((a (read-char p)))
+                       (set-port-encoding! p "UTF-16")
+                       (string a (read-char p))))))
+
+  (pass-if-equal "BOM discarded within UTF-32 stream (BE) after set-port-encoding!"
+      "ab"
+    (bv-read-test* "UTF-32" #vu8(#x00 #x00 #x00 #x61
+                                      #x00 #x00 #xFE #xFF
+                                      #x00 #x00 #x00 #x62)
+                   (lambda (p)
+                     (let ((a (read-char p)))
+                       (set-port-encoding! p "UTF-32")
+                       (string a (read-char p))))))
+
+  (pass-if-equal "BOM discarded within UTF-32 stream (LE,BE) after set-port-encoding!"
+      "ab"
+    (bv-read-test* "UTF-32" #vu8(#x00 #x00 #x00 #x61
+                                      #xFF #xFE #x00 #x00
+                                      #x62 #x00 #x00 #x00)
+                   (lambda (p)
+                     (let ((a (read-char p)))
+                       (set-port-encoding! p "UTF-32")
+                       (string a (read-char p))))))
+
+  (pass-if-equal "BOM not discarded unless at start of UTF-32 stream"
+      "a\uFEFFb"
+    (let ((be (bv-read-test "UTF-32" #vu8(#x00 #x00 #x00 #x61
+                                               #x00 #x00 #xFE #xFF
+                                               #x00 #x00 #x00 #x62)))
+          (le (bv-read-test "UTF-32" #vu8(#x61 #x00 #x00 #x00
+                                               #xFF #xFE #x00 #x00
+                                               #x62 #x00 #x00 #x00))))
+      (if (char=? #\a (string-ref be 0))
+          be
+          le)))
+
+  (pass-if-equal "BOM discarded from start of UTF-32 stream (LE)"
+      "a"
+    (bv-read-test "UTF-32" #vu8(#xFF #xFE #x00 #x00
+                                     #x61 #x00 #x00 #x00)))
+
+  (pass-if-equal "BOM discarded from start of UTF-32 stream (LE) after seek to 0"
+      '(#\a "a")
+    (bv-read-test* "UTf-32" #vu8(#xFF #xFE #x00 #x00
+                                      #x61 #x00 #x00 #x00)
+                   (lambda (p)
+                     (let ((c (read-char p)))
+                       (seek p 0 SEEK_SET)
+                       (let ((s (read-string p)))
+                         (list c s))))))
+
+  (pass-if-equal "Only one BOM discarded from start of UTF-32 stream (LE)"
+      "\uFEFFa"
+    (bv-read-test "UTF-32" #vu8(#xFF #xFE #x00 #x00
+                                     #xFF #xFE #x00 #x00
+                                     #x61 #x00 #x00 #x00))))
+
+\f
+
 (define-syntax-rule (with-load-path path body ...)
   (let ((new path)
         (old %load-path))
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] Improve handling of Unicode byte-order marks (BOMs)
  2013-04-03 20:33       ` Mark H Weaver
  2013-04-03 20:48         ` Mike Gran
@ 2013-04-04 20:50         ` Andy Wingo
  2013-04-05  7:30           ` Mark H Weaver
  1 sibling, 1 reply; 14+ messages in thread
From: Andy Wingo @ 2013-04-04 20:50 UTC (permalink / raw
  To: Mark H Weaver; +Cc: Ludovic Courtès, guile-devel

Hi.  The following review applies to the wrong version of this patch.
I'll go ahead and post it anyway.

On Wed 03 Apr 2013 22:33, Mark H Weaver <mhw@netris.org> writes:

> +          /* If we just read a BOM in an encoding that recognizes them,
> +             then silently consume it and read another code point. */
> +          if (SCM_UNLIKELY (*codepoint == SCM_UNICODE_BOM
> +                            && (strcasecmp (pt->encoding, "UTF-8") == 0
> +                                || strcasecmp (pt->encoding, "UTF-16") == 0
> +                                || strcasecmp (pt->encoding, "UTF-32") == 0)))
> +            return get_codepoint (port, codepoint, buf, len);

Don't we have an enumerated value for UTF-8?  Also I thought the
documentation noted that we don't consume BOM for UTF-8.

> +static int
> +looking_at_bytes (SCM port, const unsigned char *bytes, int len)
> +{
> +  scm_t_port *pt = SCM_PTAB_ENTRY (port);
> +  int result;
> +  int i = 0;
> +
> +  while (i < len && scm_peek_byte_or_eof (port) == bytes[i])
> +    {
> +      pt->read_pos++;
> +      i++;
> +    }
> +
> +  result = (i == len);
> +
> +  while (i > 0)
> +    scm_unget_byte (bytes[--i], port);
> +
> +  return result;
> +}

Very subtle ;)  But looks good.

> +static const unsigned char scm_utf8_bom[3]    = {0xEF, 0xBB, 0xBF};
> +static const unsigned char scm_utf16be_bom[2] = {0xFE, 0xFF};
> +static const unsigned char scm_utf16le_bom[2] = {0xFF, 0xFE};
> +static const unsigned char scm_utf32be_bom[4] = {0x00, 0x00, 0xFE, 0xFF};
> +static const unsigned char scm_utf32le_bom[4] = {0xFF, 0xFE, 0x00, 0x00};

Does it not work to leave out the number?  i.e. foo[] instead of
foo[3].  Would be nicer if that works otherwise it's fine.

> +/* Decide what endianness to use for a UTF-16 port.  Return "UTF-16BE"
> +   or "UTF-16LE".  MODE must be either SCM_PORT_READ or SCM_PORT_WRITE,
> +   and specifies which operation is about to be done.  The MODE
> +   determines how we will decide the endianness.  We deliberately avoid
> +   reading from the port unless the user is about to do so.  If the user
> +   is about to read, then we look for a BOM, and if present, we use it
> +   to determine the endianness.  Otherwise we choose big-endian, as
> +   recommended by the Unicode Consortium.  */

I am surprised this does not default to native endianness!  But OK :)

> +static const char *
> +decide_utf16_encoding (SCM port, scm_t_port_rw_active mode)
> +{
> +  if (mode == SCM_PORT_READ
> +      && SCM_PORT_GET_INTERNAL (port)->at_stream_start_for_bom_read
> +      && looking_at_bytes (port, scm_utf16le_bom, sizeof scm_utf16le_bom))
> +    return "UTF-16LE";
> +  else
> +    return "UTF-16BE";
> +}
> +
> +/* Decide what endianness to use for a UTF-32 port.  Return "UTF-32BE"
> +   or "UTF-32LE".  See the comment above 'decide_utf16_encoding' for
> +   details.  */
> +static const char *
> +decide_utf32_encoding (SCM port, scm_t_port_rw_active mode)
> +{
> +  if (mode == SCM_PORT_READ
> +      && SCM_PORT_GET_INTERNAL (port)->at_stream_start_for_bom_read
> +      && looking_at_bytes (port, scm_utf32le_bom, sizeof scm_utf32le_bom))
> +    return "UTF-32LE";
> +  else
> +    return "UTF-32BE";
> +}
> +

Why don't these consume the BOM?  They just use the BOM to detect the
endianness but leave the at_stream_start_for_bom_read flag to 1 I guess?
Probably deserves a comment.

> +      /* If the specified encoding is UTF-16 or UTF-32, then make
> +         that more precise by deciding what endianness to use.  */
> +      if (strcasecmp (pt->encoding, "UTF-16") == 0)
> +        precise_encoding = decide_utf16_encoding (port, mode);
> +      else if (strcasecmp (pt->encoding, "UTF-32") == 0)
> +        precise_encoding = decide_utf32_encoding (port, mode);

Ideally these comparisons would not be locale-dependent.  Dunno.

> @@ -2377,28 +2480,27 @@ scm_i_set_port_encoding_x (SCM port, const char *encoding)
>    pti = SCM_PORT_GET_INTERNAL (port);
>    prev = pti->iconv_descriptors;
>  
> +  /* In order to handle cases where the encoding changes mid-stream
> +     (e.g. within an HTTP stream, or within a file that is composed of
> +     segments with different encodings), we consider this to be "stream
> +     start" for purposes of BOM handling, regardless of our actual file
> +     position. */
> +  pti->at_stream_start_for_bom_read  = 1;
> +  pti->at_stream_start_for_bom_write = 1;
> +

I dunno.  If I receive an HTTP response as a bytevector, parse it to a
Unicode (UTF-{8,16,32}) string successfully, and then convert it back to
bytes, I feel like I should get the same sequence of bytes back.  Is
that unreasonable?

> +  if (SCM_UNLIKELY (pti->at_stream_start_for_bom_write && len > 0))
> +    {
> +      scm_t_port *pt = SCM_PTAB_ENTRY (port);
> +
> +      /* Record that we're no longer at stream start.  */
> +      pti->at_stream_start_for_bom_write = 0;
> +      if (pt->rw_random)
> +        pti->at_stream_start_for_bom_read = 0;
> +
> +      /* Write a BOM if appropriate.  */
> +      if (SCM_UNLIKELY (strcasecmp(pt->encoding, "UTF-16") == 0
> +                        || strcasecmp(pt->encoding, "UTF-32") == 0))
> +        display_character (SCM_UNICODE_BOM, port, iconveh_error);
> +    }

Perhaps only set these flags if we are in UTF-16 or UTF-32 to begin
with instead of pushing this strcasecmp out to print.c.

> +  (pass-if-equal "BOM discarded from start of UTF-8 stream"
> +      "a"
> +    (bv-read-test "Utf-8" #vu8(#xEF #xBB #xBF #x61)))

This test contradicts the docs regarding BOM consumption for UTF-8
streams, no?

Quitting because I realized that you have two new patches (!).

Andy
-- 
http://wingolog.org/



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] Improve handling of Unicode byte-order marks (BOMs)
  2013-04-04 20:50         ` Andy Wingo
@ 2013-04-05  7:30           ` Mark H Weaver
  2013-04-05  7:42             ` Mike Gran
  0 siblings, 1 reply; 14+ messages in thread
From: Mark H Weaver @ 2013-04-05  7:30 UTC (permalink / raw
  To: Andy Wingo; +Cc: Ludovic Courtès, guile-devel

Hi Andy,

Andy Wingo <wingo@pobox.com> writes:

> On Wed 03 Apr 2013 22:33, Mark H Weaver <mhw@netris.org> writes:
>
>> +          /* If we just read a BOM in an encoding that recognizes them,
>> +             then silently consume it and read another code point. */
>> +          if (SCM_UNLIKELY (*codepoint == SCM_UNICODE_BOM
>> +                            && (strcasecmp (pt->encoding, "UTF-8") == 0
>> +                                || strcasecmp (pt->encoding, "UTF-16") == 0
>> +                                || strcasecmp (pt->encoding, "UTF-32") == 0)))
>> +            return get_codepoint (port, codepoint, buf, len);
>
> Don't we have an enumerated value for UTF-8?

Indeed, good point.  I changed this as you suggest, but fwiw the
efficiency impact is negligible, because that 'strcasecmp' is only
called if a BOM is found at the start of a stream.

> Also I thought the
> documentation noted that we don't consume BOM for UTF-8.

No.  For UTF-8, we consume a BOM (if present) but do not produce one.

>> +static const unsigned char scm_utf8_bom[3]    = {0xEF, 0xBB, 0xBF};
>> +static const unsigned char scm_utf16be_bom[2] = {0xFE, 0xFF};
>> +static const unsigned char scm_utf16le_bom[2] = {0xFF, 0xFE};
>> +static const unsigned char scm_utf32be_bom[4] = {0x00, 0x00, 0xFE, 0xFF};
>> +static const unsigned char scm_utf32le_bom[4] = {0xFF, 0xFE, 0x00, 0x00};
>
> Does it not work to leave out the number?  i.e. foo[] instead of
> foo[3].  Would be nicer if that works otherwise it's fine.

It would almost certainly work, but I'm paranoid that some weird
compiler might make the array longer than needed, which would be bad.

>> +/* Decide what endianness to use for a UTF-16 port.  Return "UTF-16BE"
>> +   or "UTF-16LE".  MODE must be either SCM_PORT_READ or SCM_PORT_WRITE,
>> +   and specifies which operation is about to be done.  The MODE
>> +   determines how we will decide the endianness.  We deliberately avoid
>> +   reading from the port unless the user is about to do so.  If the user
>> +   is about to read, then we look for a BOM, and if present, we use it
>> +   to determine the endianness.  Otherwise we choose big-endian, as
>> +   recommended by the Unicode Consortium.  */
>
> I am surprised this does not default to native endianness!  But OK :)

Remember that "network byte order" is big endian.  Uniformity has its
benefits.  Anyway, in the docs I explicitly reserved the right to change
this later.

>> +static const char *
>> +decide_utf16_encoding (SCM port, scm_t_port_rw_active mode)
>> +{
>> +  if (mode == SCM_PORT_READ
>> +      && SCM_PORT_GET_INTERNAL (port)->at_stream_start_for_bom_read
>> +      && looking_at_bytes (port, scm_utf16le_bom, sizeof scm_utf16le_bom))
>> +    return "UTF-16LE";
>> +  else
>> +    return "UTF-16BE";
>> +}
>> +
>> +/* Decide what endianness to use for a UTF-32 port.  Return "UTF-32BE"
>> +   or "UTF-32LE".  See the comment above 'decide_utf16_encoding' for
>> +   details.  */
>> +static const char *
>> +decide_utf32_encoding (SCM port, scm_t_port_rw_active mode)
>> +{
>> +  if (mode == SCM_PORT_READ
>> +      && SCM_PORT_GET_INTERNAL (port)->at_stream_start_for_bom_read
>> +      && looking_at_bytes (port, scm_utf32le_bom, sizeof scm_utf32le_bom))
>> +    return "UTF-32LE";
>> +  else
>> +    return "UTF-32BE";
>> +}
>> +
>
> Why don't these consume the BOM?  They just use the BOM to detect the
> endianness but leave the at_stream_start_for_bom_read flag to 1 I guess?

Right.  I did it this way for simplicity and consistency.  When you seek
to the beginning of the file, the BOM will be consumed again even though
this code will not be run.  Therefore, we need logic to consume the BOM
in 'get_codepoint'.  It seems cleaner to do that job in just one place
rather than in two places.

> Probably deserves a comment.

Okay, I added one.

>> +      /* If the specified encoding is UTF-16 or UTF-32, then make
>> +         that more precise by deciding what endianness to use.  */
>> +      if (strcasecmp (pt->encoding, "UTF-16") == 0)
>> +        precise_encoding = decide_utf16_encoding (port, mode);
>> +      else if (strcasecmp (pt->encoding, "UTF-32") == 0)
>> +        precise_encoding = decide_utf32_encoding (port, mode);
>
> Ideally these comparisons would not be locale-dependent.  Dunno.

Yes, that would be preferable.  We talked about adding an
'ascii_strcasecmp' function.  What file do you think it should be
defined in?

>> @@ -2377,28 +2480,27 @@ scm_i_set_port_encoding_x (SCM port, const char *encoding)
>>    pti = SCM_PORT_GET_INTERNAL (port);
>>    prev = pti->iconv_descriptors;
>>  
>> +  /* In order to handle cases where the encoding changes mid-stream
>> +     (e.g. within an HTTP stream, or within a file that is composed of
>> +     segments with different encodings), we consider this to be "stream
>> +     start" for purposes of BOM handling, regardless of our actual file
>> +     position. */
>> +  pti->at_stream_start_for_bom_read  = 1;
>> +  pti->at_stream_start_for_bom_write = 1;
>> +
>
> I dunno.  If I receive an HTTP response as a bytevector, parse it to a
> Unicode (UTF-{8,16,32}) string successfully, and then convert it back to
> bytes, I feel like I should get the same sequence of bytes back.  Is
> that unreasonable?

It's not an unreasonable wish, but sadly it's impossible in the presence
of automatic BOM removal.  The reason is that information is lost when
you remove the BOM.  This affects the UTF-8, UTF-16, and UTF-32
encodings.

Also, I'm not sure how your comment relates to the quoted code above it.

>> +  if (SCM_UNLIKELY (pti->at_stream_start_for_bom_write && len > 0))
>> +    {
>> +      scm_t_port *pt = SCM_PTAB_ENTRY (port);
>> +
>> +      /* Record that we're no longer at stream start.  */
>> +      pti->at_stream_start_for_bom_write = 0;
>> +      if (pt->rw_random)
>> +        pti->at_stream_start_for_bom_read = 0;
>> +
>> +      /* Write a BOM if appropriate.  */
>> +      if (SCM_UNLIKELY (strcasecmp(pt->encoding, "UTF-16") == 0
>> +                        || strcasecmp(pt->encoding, "UTF-32") == 0))
>> +        display_character (SCM_UNICODE_BOM, port, iconveh_error);
>> +    }
>
> Perhaps only set these flags if we are in UTF-16 or UTF-32 to begin
> with instead of pushing this strcasecmp out to print.c.

I thought of that, but then we'd have to add these strcasecmp's
everywhere that the flags are set, which is currently in three places:
(1) port creation, (2) set-port-encoding!, and (3) seeks.  As it is now,
we only have to do these comparisons in two places (textual read and
write), and often they can be avoided (e.g. on read the comparisons are
only done when a BOM is found at the start of a stream).  So I think
this way is not only cleaner, but also probably a bit more efficient.

I went ahead and pushed it to stable-2.0, with the changes mentioned
above.  Of course we can continue to tweak it.

    Thanks!
      Mark



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] Improve handling of Unicode byte-order marks (BOMs)
  2013-04-05  7:30           ` Mark H Weaver
@ 2013-04-05  7:42             ` Mike Gran
  2013-04-05 10:04               ` Ludovic Courtès
  0 siblings, 1 reply; 14+ messages in thread
From: Mike Gran @ 2013-04-05  7:42 UTC (permalink / raw
  To: Mark H Weaver, Andy Wingo; +Cc: Ludovic Courtès, guile-devel@gnu.org

>>>  +      /* If the specified encoding is UTF-16 or UTF-32, then make
>>>  +         that more precise by deciding what endianness to use.  */
>>>  +      if (strcasecmp (pt->encoding, "UTF-16") == 0)
>>>  +        precise_encoding = decide_utf16_encoding (port, mode);
>>>  +      else if (strcasecmp (pt->encoding, "UTF-32") == 0)
>>>  +        precise_encoding = decide_utf32_encoding (port, mode);
>> 
>>  Ideally these comparisons would not be locale-dependent.  Dunno.
> 
> Yes, that would be preferable.  We talked about adding an
> 'ascii_strcasecmp' function.  What file do you think it should be
> defined in?

It would be a trivial function to write, of course, but there is a
c-strcasecmp func in gnulib.



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] Improve handling of Unicode byte-order marks (BOMs)
  2013-04-05  7:42             ` Mike Gran
@ 2013-04-05 10:04               ` Ludovic Courtès
  2013-04-05 18:15                 ` Mark H Weaver
  0 siblings, 1 reply; 14+ messages in thread
From: Ludovic Courtès @ 2013-04-05 10:04 UTC (permalink / raw
  To: Mike Gran; +Cc: Andy Wingo, Mark H Weaver, guile-devel@gnu.org

Mike Gran <spk121@yahoo.com> skribis:

>>>>  +      /* If the specified encoding is UTF-16 or UTF-32, then make
>>>>  +         that more precise by deciding what endianness to use.  */
>>>>  +      if (strcasecmp (pt->encoding, "UTF-16") == 0)
>>>>  +        precise_encoding = decide_utf16_encoding (port, mode);
>>>>  +      else if (strcasecmp (pt->encoding, "UTF-32") == 0)
>>>>  +        precise_encoding = decide_utf32_encoding (port, mode);
>>> 
>>>  Ideally these comparisons would not be locale-dependent.  Dunno.
>> 
>> Yes, that would be preferable.  We talked about adding an
>> 'ascii_strcasecmp' function.  What file do you think it should be
>> defined in?
>
> It would be a trivial function to write, of course, but there is a
> c-strcasecmp func in gnulib.

Yes, better use that one.

(Just add ‘c-strcase’ in m4/gnulib-cache.m4, run ‘gnulib-tool --update’
with Gnulib v0.0-7865-ga828bb2, and git add the new files.)

Ludo’.



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] Improve handling of Unicode byte-order marks (BOMs)
  2013-04-05 10:04               ` Ludovic Courtès
@ 2013-04-05 18:15                 ` Mark H Weaver
  0 siblings, 0 replies; 14+ messages in thread
From: Mark H Weaver @ 2013-04-05 18:15 UTC (permalink / raw
  To: Ludovic Courtès; +Cc: Andy Wingo, guile-devel@gnu.org

ludo@gnu.org (Ludovic Courtès) writes:

> Mike Gran <spk121@yahoo.com> skribis:
>
>> It would be a trivial function to write, of course, but there is a
>> c-strcasecmp func in gnulib.
>
> Yes, better use that one.
>
> (Just add ‘c-strcase’ in m4/gnulib-cache.m4, run ‘gnulib-tool --update’
> with Gnulib v0.0-7865-ga828bb2, and git add the new files.)

Done and pushed.  Thanks!

      Mark



^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2013-04-05 18:15 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-04-03 10:44 [PATCH] Improve handling of Unicode byte-order marks (BOMs) Mark H Weaver
2013-04-03 11:47 ` Mark H Weaver
2013-04-03 11:58 ` Ludovic Courtès
2013-04-03 19:28   ` Mark H Weaver
2013-04-03 20:11     ` Ludovic Courtès
2013-04-03 20:33       ` Mark H Weaver
2013-04-03 20:48         ` Mike Gran
2013-04-03 22:24           ` Mark H Weaver
2013-04-04  5:59             ` Mark H Weaver
2013-04-04 20:50         ` Andy Wingo
2013-04-05  7:30           ` Mark H Weaver
2013-04-05  7:42             ` Mike Gran
2013-04-05 10:04               ` Ludovic Courtès
2013-04-05 18:15                 ` Mark H Weaver

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).