From: Mark H Weaver <mhw@netris.org>
To: guile-devel@gnu.org
Subject: Re: [PATCH] Improve handling of Unicode byte-order marks (BOMs)
Date: Wed, 03 Apr 2013 07:47:41 -0400 [thread overview]
Message-ID: <87bo9vzvhe.fsf@tines.lan> (raw)
In-Reply-To: <87ip43zyf0.fsf@tines.lan> (Mark H. Weaver's message of "Wed, 03 Apr 2013 06:44:19 -0400")
[-- Attachment #1: Type: text/plain, Size: 262 bytes --]
Here's an improved version of the patch. Mainly it adds more tests.
Also, I forgot to mention that binary I/O does not affect the "start of
stream" flags at all. This is mainly for efficiency reasons, but even
so, I don't feel too badly about it.
Mark
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: [PATCH] Improve handling of Unicode byte-order marks (BOMs) --]
[-- Type: text/x-diff, Size: 24215 bytes --]
From d8d37d5519ca61961b70cb3051ccca2be7d4affa Mon Sep 17 00:00:00 2001
From: Mark H Weaver <mhw@netris.org>
Date: Wed, 3 Apr 2013 04:22:04 -0400
Subject: [PATCH] Improve handling of Unicode byte-order marks (BOMs).
* libguile/ports-internal.h (struct scm_port_internal): Add new members
'at_stream_start_for_bom_read' and 'at_stream_start_for_bom_write'.
(SCM_UNICODE_BOM): New macro.
(scm_i_port_iconv_descriptors): Add 'mode' parameter to prototype.
* libguile/ports.c (scm_new_port_table_entry): Initialize
'at_stream_start_for_bom_read' and 'at_stream_start_for_bom_write'.
(get_iconv_codepoint): Pass new 'mode' parameter to
'scm_i_port_iconv_descriptors'.
(get_codepoint): After reading a codepoint at stream start, record
that we're no longer at stream start, and consume a BOM where
appropriate.
(scm_seek): Set the stream start flags according to the new position.
(looking_at_bytes): New static function.
(scm_utf8_bom, scm_utf16be_bom, scm_utf16le_bom, scm_utf32be_bom,
scm_utf32le_bom): New static const arrays.
(decide_utf16_encoding, decide_utf32_encoding): New static functions.
(scm_i_port_iconv_descriptors): Add new 'mode' parameter. If the
specified encoding is UTF-16 or UTF-32, make that precise by deciding
what endianness to use, and construct iconv descriptors based on the
precise encoding.
(scm_i_set_port_encoding_x): Record that we are now at stream start.
Do not open the new iconv descriptors immediately; let them be
initialized lazily.
* libguile/print.c (display_string_using_iconv): Record that we're no
longer at stream start. Write a BOM if appropriate.
* test-suite/tests/ports.test ("set-port-encoding!, wrong encoding"):
Adapt test to cope with the fact that 'set-port-encoding!' does not
immediately open the iconv descriptors.
(bv-read-test): New procedure.
("unicode byte-order marks (BOMs)"): New test prefix.
---
libguile/ports-internal.h | 7 +-
libguile/ports.c | 134 +++++++++++++++++---
libguile/print.c | 18 ++-
test-suite/tests/ports.test | 293 ++++++++++++++++++++++++++++++++++++++++++-
4 files changed, 433 insertions(+), 19 deletions(-)
diff --git a/libguile/ports-internal.h b/libguile/ports-internal.h
index 73a788f..cd1746b 100644
--- a/libguile/ports-internal.h
+++ b/libguile/ports-internal.h
@@ -48,14 +48,19 @@ struct scm_port_internal
{
scm_t_port_encoding_mode encoding_mode;
scm_t_iconv_descriptors *iconv_descriptors;
+ int at_stream_start_for_bom_read;
+ int at_stream_start_for_bom_write;
SCM alist;
};
typedef struct scm_port_internal scm_t_port_internal;
+#define SCM_UNICODE_BOM 0xFEFF /* Unicode byte-order mark */
+
#define SCM_PORT_GET_INTERNAL(x) \
((scm_t_port_internal *) (SCM_PTAB_ENTRY(x)->input_cd))
-SCM_INTERNAL scm_t_iconv_descriptors *scm_i_port_iconv_descriptors (SCM port);
+SCM_INTERNAL scm_t_iconv_descriptors *
+scm_i_port_iconv_descriptors (SCM port, scm_t_port_rw_active mode);
#endif
diff --git a/libguile/ports.c b/libguile/ports.c
index 51145e6..99261da 100644
--- a/libguile/ports.c
+++ b/libguile/ports.c
@@ -639,6 +639,9 @@ scm_new_port_table_entry (scm_t_bits tag)
pti->encoding_mode = SCM_PORT_ENCODING_MODE_ICONV;
pti->iconv_descriptors = NULL;
+ pti->at_stream_start_for_bom_read = 1;
+ pti->at_stream_start_for_bom_write = 1;
+
/* XXX These fields are not what they seem. They have been
repurposed, but cannot safely be renamed in 2.0 without breaking
ABI compatibility. This will be cleaned up in 2.2. */
@@ -1306,10 +1309,12 @@ static int
get_iconv_codepoint (SCM port, scm_t_wchar *codepoint,
char buf[SCM_MBCHAR_BUF_SIZE], size_t *len)
{
- scm_t_iconv_descriptors *id = scm_i_port_iconv_descriptors (port);
+ scm_t_iconv_descriptors *id;
scm_t_uint8 utf8_buf[SCM_MBCHAR_BUF_SIZE];
size_t input_size = 0;
+ id = scm_i_port_iconv_descriptors (port, SCM_PORT_READ);
+
for (;;)
{
int byte_read;
@@ -1393,7 +1398,24 @@ get_codepoint (SCM port, scm_t_wchar *codepoint,
err = get_iconv_codepoint (port, codepoint, buf, len);
if (SCM_LIKELY (err == 0))
- update_port_lf (*codepoint, port);
+ {
+ if (SCM_UNLIKELY (pti->at_stream_start_for_bom_read))
+ {
+ /* Record that we're no longer at stream start. */
+ pti->at_stream_start_for_bom_read = 0;
+ if (pt->rw_random)
+ pti->at_stream_start_for_bom_write = 0;
+
+ /* If we just read a BOM in an encoding that recognizes them,
+ then silently consume it and read another code point. */
+ if (SCM_UNLIKELY (*codepoint == SCM_UNICODE_BOM
+ && (strcmp(pt->encoding, "UTF-8") == 0
+ || strcmp(pt->encoding, "UTF-16") == 0
+ || strcmp(pt->encoding, "UTF-32") == 0)))
+ return get_codepoint (port, codepoint, buf, len);
+ }
+ update_port_lf (*codepoint, port);
+ }
else if (pt->ilseq_handler == SCM_ICONVEH_QUESTION_MARK)
{
*codepoint = '?';
@@ -2006,6 +2028,7 @@ SCM_DEFINE (scm_seek, "seek", 3, 0, 0,
if (SCM_OPPORTP (fd_port))
{
+ scm_t_port_internal *pti = SCM_PORT_GET_INTERNAL (fd_port);
scm_t_ptob_descriptor *ptob = scm_ptobs + SCM_PTOBNUM (fd_port);
off_t_or_off64_t off = scm_to_off_t_or_off64_t (offset);
off_t_or_off64_t rv;
@@ -2015,6 +2038,11 @@ SCM_DEFINE (scm_seek, "seek", 3, 0, 0,
scm_cons (fd_port, SCM_EOL));
else
rv = ptob->seek (fd_port, off, how);
+
+ /* Set stream-start flags according to new position. */
+ pti->at_stream_start_for_bom_read = (rv == 0);
+ pti->at_stream_start_for_bom_write = (rv == 0);
+
return scm_from_off_t_or_off64_t (rv);
}
else /* file descriptor?. */
@@ -2265,6 +2293,66 @@ scm_i_default_port_encoding (void)
}
}
+/* If the next LEN bytes from port are equal to those in BYTES, then
+ return 1, else return 0. Leave the port position unchanged. */
+static int
+looking_at_bytes (SCM port, const unsigned char *bytes, int len)
+{
+ scm_t_port *pt = SCM_PTAB_ENTRY (port);
+ int result;
+ int i = 0;
+
+ while (i < len && scm_peek_byte_or_eof (port) == bytes[i])
+ {
+ pt->read_pos++;
+ i++;
+ }
+
+ result = (i == len);
+
+ while (i > 0)
+ scm_unget_byte (bytes[--i], port);
+
+ return result;
+}
+
+static const unsigned char scm_utf8_bom[3] = {0xEF, 0xBB, 0xBF};
+static const unsigned char scm_utf16be_bom[2] = {0xFE, 0xFF};
+static const unsigned char scm_utf16le_bom[2] = {0xFF, 0xFE};
+static const unsigned char scm_utf32be_bom[4] = {0x00, 0x00, 0xFE, 0xFF};
+static const unsigned char scm_utf32le_bom[4] = {0xFF, 0xFE, 0x00, 0x00};
+
+/* Decide what endianness to use for a UTF-16 port. Return "UTF-16BE"
+ or "UTF-16LE". MODE must be either SCM_PORT_READ or SCM_PORT_WRITE,
+ and specifies which operation is about to be done. The MODE
+ determines how we will decide the endianness. We deliberately avoid
+ reading from the port unless the user is about to do so. If the user
+ is about to read, then we look for a BOM, and if present, we use it
+ to determine the endianness. Otherwise we choose big-endian, as
+ recommended by the Unicode Consortium. */
+static char *
+decide_utf16_encoding (SCM port, scm_t_port_rw_active mode)
+{
+ if (mode == SCM_PORT_READ
+ && looking_at_bytes (port, scm_utf16le_bom, sizeof scm_utf16le_bom))
+ return "UTF-16LE";
+ else
+ return "UTF-16BE";
+}
+
+/* Decide what endianness to use for a UTF-32 port. Return "UTF-32BE"
+ or "UTF-32LE". See the comment above 'decide_utf16_encoding' for
+ details. */
+static char *
+decide_utf32_encoding (SCM port, scm_t_port_rw_active mode)
+{
+ if (mode == SCM_PORT_READ
+ && looking_at_bytes (port, scm_utf32le_bom, sizeof scm_utf32le_bom))
+ return "UTF-32LE";
+ else
+ return "UTF-32BE";
+}
+
static void
finalize_iconv_descriptors (void *ptr, void *data)
{
@@ -2341,23 +2429,36 @@ close_iconv_descriptors (scm_t_iconv_descriptors *id)
id->output_cd = (void *) -1;
}
+/* Return the iconv_descriptors, initializing them if necessary. MODE
+ must be either SCM_PORT_READ or SCM_PORT_WRITE, and specifies which
+ operation is about to be done. We deliberately avoid reading from
+ the port unless the user was about to do so. */
scm_t_iconv_descriptors *
-scm_i_port_iconv_descriptors (SCM port)
+scm_i_port_iconv_descriptors (SCM port, scm_t_port_rw_active mode)
{
- scm_t_port *pt;
- scm_t_port_internal *pti;
-
- pt = SCM_PTAB_ENTRY (port);
- pti = SCM_PORT_GET_INTERNAL (port);
+ scm_t_port_internal *pti = SCM_PORT_GET_INTERNAL (port);
assert (pti->encoding_mode == SCM_PORT_ENCODING_MODE_ICONV);
if (!pti->iconv_descriptors)
{
+ scm_t_port *pt = SCM_PTAB_ENTRY (port);
+ char *precise_encoding;
+
if (!pt->encoding)
pt->encoding = "ISO-8859-1";
+
+ /* If the specified encoding is UTF-16 or UTF-32, then make
+ that more precise by deciding what endianness to use. */
+ if (strcmp (pt->encoding, "UTF-16") == 0)
+ precise_encoding = decide_utf16_encoding (port, mode);
+ else if (strcmp (pt->encoding, "UTF-32") == 0)
+ precise_encoding = decide_utf32_encoding (port, mode);
+ else
+ precise_encoding = pt->encoding;
+
pti->iconv_descriptors =
- open_iconv_descriptors (pt->encoding,
+ open_iconv_descriptors (precise_encoding,
SCM_INPUT_PORT_P (port),
SCM_OUTPUT_PORT_P (port));
}
@@ -2377,6 +2478,14 @@ scm_i_set_port_encoding_x (SCM port, const char *encoding)
pti = SCM_PORT_GET_INTERNAL (port);
prev = pti->iconv_descriptors;
+ /* In order to handle cases where the encoding changes mid-stream
+ (e.g. within an HTTP stream, or within a file that is composed of
+ segments with different encodings), we consider this to be "stream
+ start" for purposes of BOM handling, regardless of our actual file
+ position. */
+ pti->at_stream_start_for_bom_read = 1;
+ pti->at_stream_start_for_bom_write = 1;
+
if (encoding == NULL)
encoding = "ISO-8859-1";
@@ -2387,19 +2496,14 @@ scm_i_set_port_encoding_x (SCM port, const char *encoding)
{
pt->encoding = "UTF-8";
pti->encoding_mode = SCM_PORT_ENCODING_MODE_UTF8;
- pti->iconv_descriptors = NULL;
}
else
{
- /* Open descriptors before mutating the port. */
- pti->iconv_descriptors =
- open_iconv_descriptors (encoding,
- SCM_INPUT_PORT_P (port),
- SCM_OUTPUT_PORT_P (port));
pt->encoding = scm_gc_strdup (encoding, "port");
pti->encoding_mode = SCM_PORT_ENCODING_MODE_ICONV;
}
+ pti->iconv_descriptors = NULL;
if (prev)
close_iconv_descriptors (prev);
}
diff --git a/libguile/print.c b/libguile/print.c
index 1572690..b8b13d4 100644
--- a/libguile/print.c
+++ b/libguile/print.c
@@ -881,8 +881,24 @@ display_string_using_iconv (const void *str, int narrow_p, size_t len,
{
size_t printed;
scm_t_iconv_descriptors *id;
+ scm_t_port_internal *pti = SCM_PORT_GET_INTERNAL (port);
- id = scm_i_port_iconv_descriptors (port);
+ id = scm_i_port_iconv_descriptors (port, SCM_PORT_WRITE);
+
+ if (SCM_UNLIKELY (pti->at_stream_start_for_bom_write && len > 0))
+ {
+ scm_t_port *pt = SCM_PTAB_ENTRY (port);
+
+ /* Record that we're no longer at stream start. */
+ pti->at_stream_start_for_bom_write = 0;
+ if (pt->rw_random)
+ pti->at_stream_start_for_bom_read = 0;
+
+ /* Write a BOM if appropriate. */
+ if (SCM_UNLIKELY (strcmp(pt->encoding, "UTF-16") == 0
+ || strcmp(pt->encoding, "UTF-32") == 0))
+ display_character (SCM_UNICODE_BOM, port, iconveh_error);
+ }
printed = 0;
diff --git a/test-suite/tests/ports.test b/test-suite/tests/ports.test
index 886ab24..f966fc3 100644
--- a/test-suite/tests/ports.test
+++ b/test-suite/tests/ports.test
@@ -24,7 +24,8 @@
#:use-module (ice-9 popen)
#:use-module (ice-9 rdelim)
#:use-module (rnrs bytevectors)
- #:use-module ((rnrs io ports) #:select (open-bytevector-input-port)))
+ #:use-module ((rnrs io ports) #:select (open-bytevector-input-port
+ open-bytevector-output-port)))
(define (display-line . args)
(for-each display args)
@@ -918,7 +919,9 @@
(pass-if-exception "set-port-encoding!, wrong encoding"
exception:miscellaneous-error
- (set-port-encoding! (open-input-string "") "does-not-exist"))
+ (let ((p (open-input-string "")))
+ (set-port-encoding! p "does-not-exist")
+ (read p)))
(pass-if-exception "%default-port-encoding, wrong encoding"
exception:miscellaneous-error
@@ -1149,6 +1152,292 @@
\f
+(with-test-prefix "unicode byte-order marks (BOMs)"
+
+ (define (bv-read-test* encoding bv proc)
+ (let ((port (open-bytevector-input-port bv)))
+ (set-port-encoding! port encoding)
+ (proc port)))
+
+ (define (bv-read-test encoding bv)
+ (bv-read-test* encoding bv read-string))
+
+ (define (bv-write-test* encoding proc)
+ (call-with-values
+ (lambda () (open-bytevector-output-port))
+ (lambda (port get-bytevector)
+ (set-port-encoding! port encoding)
+ (proc port)
+ (get-bytevector))))
+
+ (define (bv-write-test encoding str)
+ (bv-write-test* encoding
+ (lambda (p)
+ (display str p))))
+
+ (pass-if-equal "BOM not discarded from Latin-1 stream"
+ "\xEF\xBB\xBF\x61"
+ (bv-read-test "ISO-8859-1" #vu8(#xEF #xBB #xBF #x61)))
+
+ (pass-if-equal "BOM not discarded from Latin-2 stream"
+ "\u010F\u0165\u017C\x61"
+ (bv-read-test "ISO-8859-2" #vu8(#xEF #xBB #xBF #x61)))
+
+ (pass-if-equal "BOM not discarded from UTF-16BE stream"
+ "\uFEFF\x61"
+ (bv-read-test "UTF-16BE" #vu8(#xFE #xFF #x00 #x61)))
+
+ (pass-if-equal "BOM not discarded from UTF-16LE stream"
+ "\uFEFF\x61"
+ (bv-read-test "UTF-16LE" #vu8(#xFF #xFE #x61 #x00)))
+
+ (pass-if-equal "BOM not discarded from UTF-32BE stream"
+ "\uFEFF\x61"
+ (bv-read-test "UTF-32BE" #vu8(#x00 #x00 #xFE #xFF
+ #x00 #x00 #x00 #x61)))
+
+ (pass-if-equal "BOM not discarded from UTF-32LE stream"
+ "\uFEFF\x61"
+ (bv-read-test "UTF-32LE" #vu8(#xFF #xFE #x00 #x00
+ #x61 #x00 #x00 #x00)))
+
+ (pass-if-equal "BOM not written to UTF-8 stream"
+ #vu8(#x61)
+ (bv-write-test "UTF-8" "a"))
+
+ (pass-if-equal "BOM not written to UTF-16BE stream"
+ #vu8(#x00 #x61)
+ (bv-write-test "UTF-16BE" "a"))
+
+ (pass-if-equal "BOM not written to UTF-16LE stream"
+ #vu8(#x61 #x00)
+ (bv-write-test "UTF-16LE" "a"))
+
+ (pass-if-equal "BOM not written to UTF-32BE stream"
+ #vu8(#x00 #x00 #x00 #x61)
+ (bv-write-test "UTF-32BE" "a"))
+
+ (pass-if-equal "BOM not written to UTF-32LE stream"
+ #vu8(#x61 #x00 #x00 #x00)
+ (bv-write-test "UTF-32LE" "a"))
+
+ (pass-if "Don't read from the port unless user asks to"
+ (let* ((p (make-soft-port
+ (vector
+ (lambda (c) #f) ; write char
+ (lambda (s) #f) ; write string
+ (lambda () #f) ; flush
+ (lambda () (throw 'fail)) ; read char
+ (lambda () #f))
+ "rw")))
+ (set-port-encoding! p "UTF-16")
+ (display "abc" p)
+ (set-port-encoding! p "UTF-32")
+ (display "def" p)
+ #t))
+
+ ;; TODO: test that input and output streams are independent when
+ ;; appropriate, and linked when appropriate.
+
+ (pass-if-equal "BOM discarded from start of UTF-8 stream"
+ "a"
+ (bv-read-test "UTF-8" #vu8(#xEF #xBB #xBF #x61)))
+
+ (pass-if-equal "BOM discarded from start of UTF-8 stream after seek to 0"
+ '(#\a "a")
+ (bv-read-test* "UTF-8" #vu8(#xEF #xBB #xBF #x61)
+ (lambda (p)
+ (let ((c (read-char p)))
+ (seek p 0 SEEK_SET)
+ (let ((s (read-string p)))
+ (list c s))))))
+
+ (pass-if-equal "Only one BOM discarded from start of UTF-8 stream"
+ "\uFEFFa"
+ (bv-read-test "UTF-8" #vu8(#xEF #xBB #xBF #xEF #xBB #xBF #x61)))
+
+ (pass-if-equal "BOM not discarded from UTF-8 stream after seek to > 0"
+ "\uFEFFb"
+ (bv-read-test* "UTF-8" #vu8(#x61 #xEF #xBB #xBF #x62)
+ (lambda (p)
+ (seek p 1 SEEK_SET)
+ (read-string p))))
+
+ (pass-if-equal "BOM not discarded unless at start of UTF-8 stream"
+ "a\uFEFFb"
+ (bv-read-test "UTF-8" #vu8(#x61 #xEF #xBB #xBF #x62)))
+
+ (pass-if-equal "BOM (BE) written to start of UTF-16 stream"
+ #vu8(#xFE #xFF #x00 #x61 #x00 #x62)
+ (bv-write-test "UTF-16" "ab"))
+
+ (pass-if-equal "BOM (BE) written to UTF-16 stream after set-port-encoding!"
+ #vu8(#xFE #xFF #x00 #x61 #x00 #x62 #xFE #xFF #x00 #x63 #x00 #x64)
+ (bv-write-test* "UTF-16"
+ (lambda (p)
+ (display "ab" p)
+ (set-port-encoding! p "UTF-16")
+ (display "cd" p))))
+
+ (pass-if-equal "BOM discarded from start of UTF-16 stream (BE)"
+ "a"
+ (bv-read-test "UTF-16" #vu8(#xFE #xFF #x00 #x61)))
+
+ (pass-if-equal "BOM discarded from start of UTF-16 stream (BE) after seek to 0"
+ '(#\a "a")
+ (bv-read-test* "UTF-16" #vu8(#xFE #xFF #x00 #x61)
+ (lambda (p)
+ (let ((c (read-char p)))
+ (seek p 0 SEEK_SET)
+ (let ((s (read-string p)))
+ (list c s))))))
+
+ (pass-if-equal "Only one BOM discarded from start of UTF-16 stream (BE)"
+ "\uFEFFa"
+ (bv-read-test "UTF-16" #vu8(#xFE #xFF #xFE #xFF #x00 #x61)))
+
+ (pass-if-equal "BOM not discarded from UTF-16 stream (BE) after seek to > 0"
+ "\uFEFFa"
+ (bv-read-test* "UTF-16" #vu8(#xFE #xFF #xFE #xFF #x00 #x61)
+ (lambda (p)
+ (seek p 2 SEEK_SET)
+ (read-string p))))
+
+ (pass-if-equal "BOM not discarded unless at start of UTF-16 stream"
+ "a\uFEFFb"
+ (let ((be (bv-read-test "UTF-16" #vu8(#x00 #x61 #xFE #xFF #x00 #x62)))
+ (le (bv-read-test "UTF-16" #vu8(#x61 #x00 #xFF #xFE #x62 #x00))))
+ (if (char=? #\a (string-ref be 0))
+ be
+ le)))
+
+ (pass-if-equal "BOM discarded from start of UTF-16 stream (LE)"
+ "a"
+ (bv-read-test "UTF-16" #vu8(#xFF #xFE #x61 #x00)))
+
+ (pass-if-equal "BOM discarded from start of UTF-16 stream (LE) after seek to 0"
+ '(#\a "a")
+ (bv-read-test* "UTF-16" #vu8(#xFF #xFE #x61 #x00)
+ (lambda (p)
+ (let ((c (read-char p)))
+ (seek p 0 SEEK_SET)
+ (let ((s (read-string p)))
+ (list c s))))))
+
+ (pass-if-equal "Only one BOM discarded from start of UTF-16 stream (LE)"
+ "\uFEFFa"
+ (bv-read-test "UTF-16" #vu8(#xFF #xFE #xFF #xFE #x61 #x00)))
+
+ (pass-if-equal "BOM not discarded from UTF-16 stream (LE) after seek to > 0"
+ "\uFEFFa"
+ (bv-read-test* "UTF-16" #vu8(#xFF #xFE #xFF #xFE #x61 #x00)
+ (lambda (p)
+ (seek p 2 SEEK_SET)
+ (read-string p))))
+
+ (pass-if-equal "BOM discarded from start of UTF-32 stream (BE)"
+ "a"
+ (bv-read-test "UTF-32" #vu8(#x00 #x00 #xFE #xFF
+ #x00 #x00 #x00 #x61)))
+
+ (pass-if-equal "BOM discarded from start of UTF-32 stream (BE) after seek to 0"
+ '(#\a "a")
+ (bv-read-test* "UTF-32" #vu8(#x00 #x00 #xFE #xFF
+ #x00 #x00 #x00 #x61)
+ (lambda (p)
+ (let ((c (read-char p)))
+ (seek p 0 SEEK_SET)
+ (let ((s (read-string p)))
+ (list c s))))))
+
+ (pass-if-equal "Only one BOM discarded from start of UTF-32 stream (BE)"
+ "\uFEFFa"
+ (bv-read-test "UTF-32" #vu8(#x00 #x00 #xFE #xFF
+ #x00 #x00 #xFE #xFF
+ #x00 #x00 #x00 #x61)))
+
+ (pass-if-equal "BOM not discarded from UTF-32 stream (BE) after seek to > 0"
+ "\uFEFFa"
+ (bv-read-test* "UTF-32" #vu8(#x00 #x00 #xFE #xFF
+ #x00 #x00 #xFE #xFF
+ #x00 #x00 #x00 #x61)
+ (lambda (p)
+ (seek p 4 SEEK_SET)
+ (read-string p))))
+
+ (pass-if-equal "BOM discarded within UTF-16 stream (BE) after set-port-encoding!"
+ "ab"
+ (bv-read-test* "UTF-16" #vu8(#x00 #x61 #xFE #xFF #x00 #x62)
+ (lambda (p)
+ (let ((a (read-char p)))
+ (set-port-encoding! p "UTF-16")
+ (string a (read-char p))))))
+
+ (pass-if-equal "BOM discarded within UTF-16 stream (LE,BE) after set-port-encoding!"
+ "ab"
+ (bv-read-test* "UTF-16" #vu8(#x00 #x61 #xFF #xFE #x62 #x00)
+ (lambda (p)
+ (let ((a (read-char p)))
+ (set-port-encoding! p "UTF-16")
+ (string a (read-char p))))))
+
+ (pass-if-equal "BOM discarded within UTF-32 stream (BE) after set-port-encoding!"
+ "ab"
+ (bv-read-test* "UTF-32" #vu8(#x00 #x00 #x00 #x61
+ #x00 #x00 #xFE #xFF
+ #x00 #x00 #x00 #x62)
+ (lambda (p)
+ (let ((a (read-char p)))
+ (set-port-encoding! p "UTF-32")
+ (string a (read-char p))))))
+
+ (pass-if-equal "BOM discarded within UTF-32 stream (LE,BE) after set-port-encoding!"
+ "ab"
+ (bv-read-test* "UTF-32" #vu8(#x00 #x00 #x00 #x61
+ #xFF #xFE #x00 #x00
+ #x62 #x00 #x00 #x00)
+ (lambda (p)
+ (let ((a (read-char p)))
+ (set-port-encoding! p "UTF-32")
+ (string a (read-char p))))))
+
+ (pass-if-equal "BOM not discarded unless at start of UTF-32 stream"
+ "a\uFEFFb"
+ (let ((be (bv-read-test "UTF-32" #vu8(#x00 #x00 #x00 #x61
+ #x00 #x00 #xFE #xFF
+ #x00 #x00 #x00 #x62)))
+ (le (bv-read-test "UTF-32" #vu8(#x61 #x00 #x00 #x00
+ #xFF #xFE #x00 #x00
+ #x62 #x00 #x00 #x00))))
+ (if (char=? #\a (string-ref be 0))
+ be
+ le)))
+
+ (pass-if-equal "BOM discarded from start of UTF-32 stream (LE)"
+ "a"
+ (bv-read-test "UTF-32" #vu8(#xFF #xFE #x00 #x00
+ #x61 #x00 #x00 #x00)))
+
+ (pass-if-equal "BOM discarded from start of UTF-32 stream (LE) after seek to 0"
+ '(#\a "a")
+ (bv-read-test* "UTF-32" #vu8(#xFF #xFE #x00 #x00
+ #x61 #x00 #x00 #x00)
+ (lambda (p)
+ (let ((c (read-char p)))
+ (seek p 0 SEEK_SET)
+ (let ((s (read-string p)))
+ (list c s))))))
+
+ (pass-if-equal "Only one BOM discarded from start of UTF-32 stream (LE)"
+ "\uFEFFa"
+ (bv-read-test "UTF-32" #vu8(#xFF #xFE #x00 #x00
+ #xFF #xFE #x00 #x00
+ #x61 #x00 #x00 #x00)))
+
+ )
+
+\f
+
(define-syntax-rule (with-load-path path body ...)
(let ((new path)
(old %load-path))
--
1.7.10.4
next prev parent reply other threads:[~2013-04-03 11:47 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-04-03 10:44 [PATCH] Improve handling of Unicode byte-order marks (BOMs) Mark H Weaver
2013-04-03 11:47 ` Mark H Weaver [this message]
2013-04-03 11:58 ` Ludovic Courtès
2013-04-03 19:28 ` Mark H Weaver
2013-04-03 20:11 ` Ludovic Courtès
2013-04-03 20:33 ` Mark H Weaver
2013-04-03 20:48 ` Mike Gran
2013-04-03 22:24 ` Mark H Weaver
2013-04-04 5:59 ` Mark H Weaver
2013-04-04 20:50 ` Andy Wingo
2013-04-05 7:30 ` Mark H Weaver
2013-04-05 7:42 ` Mike Gran
2013-04-05 10:04 ` Ludovic Courtès
2013-04-05 18:15 ` Mark H Weaver
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://www.gnu.org/software/guile/
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87bo9vzvhe.fsf@tines.lan \
--to=mhw@netris.org \
--cc=guile-devel@gnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).