From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Mark H Weaver Newsgroups: gmane.lisp.guile.devel Subject: Re: [PATCH] Improve handling of Unicode byte-order marks (BOMs) Date: Wed, 03 Apr 2013 07:47:41 -0400 Message-ID: <87bo9vzvhe.fsf@tines.lan> References: <87ip43zyf0.fsf@tines.lan> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" X-Trace: ger.gmane.org 1364989701 32069 80.91.229.3 (3 Apr 2013 11:48:21 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Wed, 3 Apr 2013 11:48:21 +0000 (UTC) To: guile-devel@gnu.org Original-X-From: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Wed Apr 03 13:48:49 2013 Return-path: Envelope-to: guile-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1UNMBL-0002aM-RA for guile-devel@m.gmane.org; Wed, 03 Apr 2013 13:48:48 +0200 Original-Received: from localhost ([::1]:34327 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UNMAx-00069X-5g for guile-devel@m.gmane.org; Wed, 03 Apr 2013 07:48:23 -0400 Original-Received: from eggs.gnu.org ([208.118.235.92]:59100) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UNMAn-00067k-Sd for guile-devel@gnu.org; Wed, 03 Apr 2013 07:48:20 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1UNMAh-0005Ao-5e for guile-devel@gnu.org; Wed, 03 Apr 2013 07:48:13 -0400 Original-Received: from world.peace.net ([96.39.62.75]:59285) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UNMAg-0005AV-UH for guile-devel@gnu.org; Wed, 03 Apr 2013 07:48:07 -0400 Original-Received: from 209-6-91-212.c3-0.smr-ubr1.sbo-smr.ma.cable.rcn.com ([209.6.91.212] helo=tines.lan) by world.peace.net with esmtpsa (TLS1.0:DHE_RSA_AES_128_CBC_SHA1:16) (Exim 4.72) (envelope-from ) id 1UNMAP-0002L0-QI; Wed, 03 Apr 2013 07:47:50 -0400 In-Reply-To: <87ip43zyf0.fsf@tines.lan> (Mark H. Weaver's message of "Wed, 03 Apr 2013 06:44:19 -0400") User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.3 (gnu/linux) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-Received-From: 96.39.62.75 X-BeenThere: guile-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Developers list for Guile, the GNU extensibility library" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Original-Sender: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.lisp.guile.devel:16115 Archived-At: --=-=-= Content-Type: text/plain Here's an improved version of the patch. Mainly it adds more tests. Also, I forgot to mention that binary I/O does not affect the "start of stream" flags at all. This is mainly for efficiency reasons, but even so, I don't feel too badly about it. Mark --=-=-= Content-Type: text/x-diff Content-Disposition: inline; filename=0001-Improve-handling-of-Unicode-byte-order-marks-BOMs.patch Content-Description: [PATCH] Improve handling of Unicode byte-order marks (BOMs) >From d8d37d5519ca61961b70cb3051ccca2be7d4affa Mon Sep 17 00:00:00 2001 From: Mark H Weaver Date: Wed, 3 Apr 2013 04:22:04 -0400 Subject: [PATCH] Improve handling of Unicode byte-order marks (BOMs). * libguile/ports-internal.h (struct scm_port_internal): Add new members 'at_stream_start_for_bom_read' and 'at_stream_start_for_bom_write'. (SCM_UNICODE_BOM): New macro. (scm_i_port_iconv_descriptors): Add 'mode' parameter to prototype. * libguile/ports.c (scm_new_port_table_entry): Initialize 'at_stream_start_for_bom_read' and 'at_stream_start_for_bom_write'. (get_iconv_codepoint): Pass new 'mode' parameter to 'scm_i_port_iconv_descriptors'. (get_codepoint): After reading a codepoint at stream start, record that we're no longer at stream start, and consume a BOM where appropriate. (scm_seek): Set the stream start flags according to the new position. (looking_at_bytes): New static function. (scm_utf8_bom, scm_utf16be_bom, scm_utf16le_bom, scm_utf32be_bom, scm_utf32le_bom): New static const arrays. (decide_utf16_encoding, decide_utf32_encoding): New static functions. (scm_i_port_iconv_descriptors): Add new 'mode' parameter. If the specified encoding is UTF-16 or UTF-32, make that precise by deciding what endianness to use, and construct iconv descriptors based on the precise encoding. (scm_i_set_port_encoding_x): Record that we are now at stream start. Do not open the new iconv descriptors immediately; let them be initialized lazily. * libguile/print.c (display_string_using_iconv): Record that we're no longer at stream start. Write a BOM if appropriate. * test-suite/tests/ports.test ("set-port-encoding!, wrong encoding"): Adapt test to cope with the fact that 'set-port-encoding!' does not immediately open the iconv descriptors. (bv-read-test): New procedure. ("unicode byte-order marks (BOMs)"): New test prefix. --- libguile/ports-internal.h | 7 +- libguile/ports.c | 134 +++++++++++++++++--- libguile/print.c | 18 ++- test-suite/tests/ports.test | 293 ++++++++++++++++++++++++++++++++++++++++++- 4 files changed, 433 insertions(+), 19 deletions(-) diff --git a/libguile/ports-internal.h b/libguile/ports-internal.h index 73a788f..cd1746b 100644 --- a/libguile/ports-internal.h +++ b/libguile/ports-internal.h @@ -48,14 +48,19 @@ struct scm_port_internal { scm_t_port_encoding_mode encoding_mode; scm_t_iconv_descriptors *iconv_descriptors; + int at_stream_start_for_bom_read; + int at_stream_start_for_bom_write; SCM alist; }; typedef struct scm_port_internal scm_t_port_internal; +#define SCM_UNICODE_BOM 0xFEFF /* Unicode byte-order mark */ + #define SCM_PORT_GET_INTERNAL(x) \ ((scm_t_port_internal *) (SCM_PTAB_ENTRY(x)->input_cd)) -SCM_INTERNAL scm_t_iconv_descriptors *scm_i_port_iconv_descriptors (SCM port); +SCM_INTERNAL scm_t_iconv_descriptors * +scm_i_port_iconv_descriptors (SCM port, scm_t_port_rw_active mode); #endif diff --git a/libguile/ports.c b/libguile/ports.c index 51145e6..99261da 100644 --- a/libguile/ports.c +++ b/libguile/ports.c @@ -639,6 +639,9 @@ scm_new_port_table_entry (scm_t_bits tag) pti->encoding_mode = SCM_PORT_ENCODING_MODE_ICONV; pti->iconv_descriptors = NULL; + pti->at_stream_start_for_bom_read = 1; + pti->at_stream_start_for_bom_write = 1; + /* XXX These fields are not what they seem. They have been repurposed, but cannot safely be renamed in 2.0 without breaking ABI compatibility. This will be cleaned up in 2.2. */ @@ -1306,10 +1309,12 @@ static int get_iconv_codepoint (SCM port, scm_t_wchar *codepoint, char buf[SCM_MBCHAR_BUF_SIZE], size_t *len) { - scm_t_iconv_descriptors *id = scm_i_port_iconv_descriptors (port); + scm_t_iconv_descriptors *id; scm_t_uint8 utf8_buf[SCM_MBCHAR_BUF_SIZE]; size_t input_size = 0; + id = scm_i_port_iconv_descriptors (port, SCM_PORT_READ); + for (;;) { int byte_read; @@ -1393,7 +1398,24 @@ get_codepoint (SCM port, scm_t_wchar *codepoint, err = get_iconv_codepoint (port, codepoint, buf, len); if (SCM_LIKELY (err == 0)) - update_port_lf (*codepoint, port); + { + if (SCM_UNLIKELY (pti->at_stream_start_for_bom_read)) + { + /* Record that we're no longer at stream start. */ + pti->at_stream_start_for_bom_read = 0; + if (pt->rw_random) + pti->at_stream_start_for_bom_write = 0; + + /* If we just read a BOM in an encoding that recognizes them, + then silently consume it and read another code point. */ + if (SCM_UNLIKELY (*codepoint == SCM_UNICODE_BOM + && (strcmp(pt->encoding, "UTF-8") == 0 + || strcmp(pt->encoding, "UTF-16") == 0 + || strcmp(pt->encoding, "UTF-32") == 0))) + return get_codepoint (port, codepoint, buf, len); + } + update_port_lf (*codepoint, port); + } else if (pt->ilseq_handler == SCM_ICONVEH_QUESTION_MARK) { *codepoint = '?'; @@ -2006,6 +2028,7 @@ SCM_DEFINE (scm_seek, "seek", 3, 0, 0, if (SCM_OPPORTP (fd_port)) { + scm_t_port_internal *pti = SCM_PORT_GET_INTERNAL (fd_port); scm_t_ptob_descriptor *ptob = scm_ptobs + SCM_PTOBNUM (fd_port); off_t_or_off64_t off = scm_to_off_t_or_off64_t (offset); off_t_or_off64_t rv; @@ -2015,6 +2038,11 @@ SCM_DEFINE (scm_seek, "seek", 3, 0, 0, scm_cons (fd_port, SCM_EOL)); else rv = ptob->seek (fd_port, off, how); + + /* Set stream-start flags according to new position. */ + pti->at_stream_start_for_bom_read = (rv == 0); + pti->at_stream_start_for_bom_write = (rv == 0); + return scm_from_off_t_or_off64_t (rv); } else /* file descriptor?. */ @@ -2265,6 +2293,66 @@ scm_i_default_port_encoding (void) } } +/* If the next LEN bytes from port are equal to those in BYTES, then + return 1, else return 0. Leave the port position unchanged. */ +static int +looking_at_bytes (SCM port, const unsigned char *bytes, int len) +{ + scm_t_port *pt = SCM_PTAB_ENTRY (port); + int result; + int i = 0; + + while (i < len && scm_peek_byte_or_eof (port) == bytes[i]) + { + pt->read_pos++; + i++; + } + + result = (i == len); + + while (i > 0) + scm_unget_byte (bytes[--i], port); + + return result; +} + +static const unsigned char scm_utf8_bom[3] = {0xEF, 0xBB, 0xBF}; +static const unsigned char scm_utf16be_bom[2] = {0xFE, 0xFF}; +static const unsigned char scm_utf16le_bom[2] = {0xFF, 0xFE}; +static const unsigned char scm_utf32be_bom[4] = {0x00, 0x00, 0xFE, 0xFF}; +static const unsigned char scm_utf32le_bom[4] = {0xFF, 0xFE, 0x00, 0x00}; + +/* Decide what endianness to use for a UTF-16 port. Return "UTF-16BE" + or "UTF-16LE". MODE must be either SCM_PORT_READ or SCM_PORT_WRITE, + and specifies which operation is about to be done. The MODE + determines how we will decide the endianness. We deliberately avoid + reading from the port unless the user is about to do so. If the user + is about to read, then we look for a BOM, and if present, we use it + to determine the endianness. Otherwise we choose big-endian, as + recommended by the Unicode Consortium. */ +static char * +decide_utf16_encoding (SCM port, scm_t_port_rw_active mode) +{ + if (mode == SCM_PORT_READ + && looking_at_bytes (port, scm_utf16le_bom, sizeof scm_utf16le_bom)) + return "UTF-16LE"; + else + return "UTF-16BE"; +} + +/* Decide what endianness to use for a UTF-32 port. Return "UTF-32BE" + or "UTF-32LE". See the comment above 'decide_utf16_encoding' for + details. */ +static char * +decide_utf32_encoding (SCM port, scm_t_port_rw_active mode) +{ + if (mode == SCM_PORT_READ + && looking_at_bytes (port, scm_utf32le_bom, sizeof scm_utf32le_bom)) + return "UTF-32LE"; + else + return "UTF-32BE"; +} + static void finalize_iconv_descriptors (void *ptr, void *data) { @@ -2341,23 +2429,36 @@ close_iconv_descriptors (scm_t_iconv_descriptors *id) id->output_cd = (void *) -1; } +/* Return the iconv_descriptors, initializing them if necessary. MODE + must be either SCM_PORT_READ or SCM_PORT_WRITE, and specifies which + operation is about to be done. We deliberately avoid reading from + the port unless the user was about to do so. */ scm_t_iconv_descriptors * -scm_i_port_iconv_descriptors (SCM port) +scm_i_port_iconv_descriptors (SCM port, scm_t_port_rw_active mode) { - scm_t_port *pt; - scm_t_port_internal *pti; - - pt = SCM_PTAB_ENTRY (port); - pti = SCM_PORT_GET_INTERNAL (port); + scm_t_port_internal *pti = SCM_PORT_GET_INTERNAL (port); assert (pti->encoding_mode == SCM_PORT_ENCODING_MODE_ICONV); if (!pti->iconv_descriptors) { + scm_t_port *pt = SCM_PTAB_ENTRY (port); + char *precise_encoding; + if (!pt->encoding) pt->encoding = "ISO-8859-1"; + + /* If the specified encoding is UTF-16 or UTF-32, then make + that more precise by deciding what endianness to use. */ + if (strcmp (pt->encoding, "UTF-16") == 0) + precise_encoding = decide_utf16_encoding (port, mode); + else if (strcmp (pt->encoding, "UTF-32") == 0) + precise_encoding = decide_utf32_encoding (port, mode); + else + precise_encoding = pt->encoding; + pti->iconv_descriptors = - open_iconv_descriptors (pt->encoding, + open_iconv_descriptors (precise_encoding, SCM_INPUT_PORT_P (port), SCM_OUTPUT_PORT_P (port)); } @@ -2377,6 +2478,14 @@ scm_i_set_port_encoding_x (SCM port, const char *encoding) pti = SCM_PORT_GET_INTERNAL (port); prev = pti->iconv_descriptors; + /* In order to handle cases where the encoding changes mid-stream + (e.g. within an HTTP stream, or within a file that is composed of + segments with different encodings), we consider this to be "stream + start" for purposes of BOM handling, regardless of our actual file + position. */ + pti->at_stream_start_for_bom_read = 1; + pti->at_stream_start_for_bom_write = 1; + if (encoding == NULL) encoding = "ISO-8859-1"; @@ -2387,19 +2496,14 @@ scm_i_set_port_encoding_x (SCM port, const char *encoding) { pt->encoding = "UTF-8"; pti->encoding_mode = SCM_PORT_ENCODING_MODE_UTF8; - pti->iconv_descriptors = NULL; } else { - /* Open descriptors before mutating the port. */ - pti->iconv_descriptors = - open_iconv_descriptors (encoding, - SCM_INPUT_PORT_P (port), - SCM_OUTPUT_PORT_P (port)); pt->encoding = scm_gc_strdup (encoding, "port"); pti->encoding_mode = SCM_PORT_ENCODING_MODE_ICONV; } + pti->iconv_descriptors = NULL; if (prev) close_iconv_descriptors (prev); } diff --git a/libguile/print.c b/libguile/print.c index 1572690..b8b13d4 100644 --- a/libguile/print.c +++ b/libguile/print.c @@ -881,8 +881,24 @@ display_string_using_iconv (const void *str, int narrow_p, size_t len, { size_t printed; scm_t_iconv_descriptors *id; + scm_t_port_internal *pti = SCM_PORT_GET_INTERNAL (port); - id = scm_i_port_iconv_descriptors (port); + id = scm_i_port_iconv_descriptors (port, SCM_PORT_WRITE); + + if (SCM_UNLIKELY (pti->at_stream_start_for_bom_write && len > 0)) + { + scm_t_port *pt = SCM_PTAB_ENTRY (port); + + /* Record that we're no longer at stream start. */ + pti->at_stream_start_for_bom_write = 0; + if (pt->rw_random) + pti->at_stream_start_for_bom_read = 0; + + /* Write a BOM if appropriate. */ + if (SCM_UNLIKELY (strcmp(pt->encoding, "UTF-16") == 0 + || strcmp(pt->encoding, "UTF-32") == 0)) + display_character (SCM_UNICODE_BOM, port, iconveh_error); + } printed = 0; diff --git a/test-suite/tests/ports.test b/test-suite/tests/ports.test index 886ab24..f966fc3 100644 --- a/test-suite/tests/ports.test +++ b/test-suite/tests/ports.test @@ -24,7 +24,8 @@ #:use-module (ice-9 popen) #:use-module (ice-9 rdelim) #:use-module (rnrs bytevectors) - #:use-module ((rnrs io ports) #:select (open-bytevector-input-port))) + #:use-module ((rnrs io ports) #:select (open-bytevector-input-port + open-bytevector-output-port))) (define (display-line . args) (for-each display args) @@ -918,7 +919,9 @@ (pass-if-exception "set-port-encoding!, wrong encoding" exception:miscellaneous-error - (set-port-encoding! (open-input-string "") "does-not-exist")) + (let ((p (open-input-string ""))) + (set-port-encoding! p "does-not-exist") + (read p))) (pass-if-exception "%default-port-encoding, wrong encoding" exception:miscellaneous-error @@ -1149,6 +1152,292 @@ +(with-test-prefix "unicode byte-order marks (BOMs)" + + (define (bv-read-test* encoding bv proc) + (let ((port (open-bytevector-input-port bv))) + (set-port-encoding! port encoding) + (proc port))) + + (define (bv-read-test encoding bv) + (bv-read-test* encoding bv read-string)) + + (define (bv-write-test* encoding proc) + (call-with-values + (lambda () (open-bytevector-output-port)) + (lambda (port get-bytevector) + (set-port-encoding! port encoding) + (proc port) + (get-bytevector)))) + + (define (bv-write-test encoding str) + (bv-write-test* encoding + (lambda (p) + (display str p)))) + + (pass-if-equal "BOM not discarded from Latin-1 stream" + "\xEF\xBB\xBF\x61" + (bv-read-test "ISO-8859-1" #vu8(#xEF #xBB #xBF #x61))) + + (pass-if-equal "BOM not discarded from Latin-2 stream" + "\u010F\u0165\u017C\x61" + (bv-read-test "ISO-8859-2" #vu8(#xEF #xBB #xBF #x61))) + + (pass-if-equal "BOM not discarded from UTF-16BE stream" + "\uFEFF\x61" + (bv-read-test "UTF-16BE" #vu8(#xFE #xFF #x00 #x61))) + + (pass-if-equal "BOM not discarded from UTF-16LE stream" + "\uFEFF\x61" + (bv-read-test "UTF-16LE" #vu8(#xFF #xFE #x61 #x00))) + + (pass-if-equal "BOM not discarded from UTF-32BE stream" + "\uFEFF\x61" + (bv-read-test "UTF-32BE" #vu8(#x00 #x00 #xFE #xFF + #x00 #x00 #x00 #x61))) + + (pass-if-equal "BOM not discarded from UTF-32LE stream" + "\uFEFF\x61" + (bv-read-test "UTF-32LE" #vu8(#xFF #xFE #x00 #x00 + #x61 #x00 #x00 #x00))) + + (pass-if-equal "BOM not written to UTF-8 stream" + #vu8(#x61) + (bv-write-test "UTF-8" "a")) + + (pass-if-equal "BOM not written to UTF-16BE stream" + #vu8(#x00 #x61) + (bv-write-test "UTF-16BE" "a")) + + (pass-if-equal "BOM not written to UTF-16LE stream" + #vu8(#x61 #x00) + (bv-write-test "UTF-16LE" "a")) + + (pass-if-equal "BOM not written to UTF-32BE stream" + #vu8(#x00 #x00 #x00 #x61) + (bv-write-test "UTF-32BE" "a")) + + (pass-if-equal "BOM not written to UTF-32LE stream" + #vu8(#x61 #x00 #x00 #x00) + (bv-write-test "UTF-32LE" "a")) + + (pass-if "Don't read from the port unless user asks to" + (let* ((p (make-soft-port + (vector + (lambda (c) #f) ; write char + (lambda (s) #f) ; write string + (lambda () #f) ; flush + (lambda () (throw 'fail)) ; read char + (lambda () #f)) + "rw"))) + (set-port-encoding! p "UTF-16") + (display "abc" p) + (set-port-encoding! p "UTF-32") + (display "def" p) + #t)) + + ;; TODO: test that input and output streams are independent when + ;; appropriate, and linked when appropriate. + + (pass-if-equal "BOM discarded from start of UTF-8 stream" + "a" + (bv-read-test "UTF-8" #vu8(#xEF #xBB #xBF #x61))) + + (pass-if-equal "BOM discarded from start of UTF-8 stream after seek to 0" + '(#\a "a") + (bv-read-test* "UTF-8" #vu8(#xEF #xBB #xBF #x61) + (lambda (p) + (let ((c (read-char p))) + (seek p 0 SEEK_SET) + (let ((s (read-string p))) + (list c s)))))) + + (pass-if-equal "Only one BOM discarded from start of UTF-8 stream" + "\uFEFFa" + (bv-read-test "UTF-8" #vu8(#xEF #xBB #xBF #xEF #xBB #xBF #x61))) + + (pass-if-equal "BOM not discarded from UTF-8 stream after seek to > 0" + "\uFEFFb" + (bv-read-test* "UTF-8" #vu8(#x61 #xEF #xBB #xBF #x62) + (lambda (p) + (seek p 1 SEEK_SET) + (read-string p)))) + + (pass-if-equal "BOM not discarded unless at start of UTF-8 stream" + "a\uFEFFb" + (bv-read-test "UTF-8" #vu8(#x61 #xEF #xBB #xBF #x62))) + + (pass-if-equal "BOM (BE) written to start of UTF-16 stream" + #vu8(#xFE #xFF #x00 #x61 #x00 #x62) + (bv-write-test "UTF-16" "ab")) + + (pass-if-equal "BOM (BE) written to UTF-16 stream after set-port-encoding!" + #vu8(#xFE #xFF #x00 #x61 #x00 #x62 #xFE #xFF #x00 #x63 #x00 #x64) + (bv-write-test* "UTF-16" + (lambda (p) + (display "ab" p) + (set-port-encoding! p "UTF-16") + (display "cd" p)))) + + (pass-if-equal "BOM discarded from start of UTF-16 stream (BE)" + "a" + (bv-read-test "UTF-16" #vu8(#xFE #xFF #x00 #x61))) + + (pass-if-equal "BOM discarded from start of UTF-16 stream (BE) after seek to 0" + '(#\a "a") + (bv-read-test* "UTF-16" #vu8(#xFE #xFF #x00 #x61) + (lambda (p) + (let ((c (read-char p))) + (seek p 0 SEEK_SET) + (let ((s (read-string p))) + (list c s)))))) + + (pass-if-equal "Only one BOM discarded from start of UTF-16 stream (BE)" + "\uFEFFa" + (bv-read-test "UTF-16" #vu8(#xFE #xFF #xFE #xFF #x00 #x61))) + + (pass-if-equal "BOM not discarded from UTF-16 stream (BE) after seek to > 0" + "\uFEFFa" + (bv-read-test* "UTF-16" #vu8(#xFE #xFF #xFE #xFF #x00 #x61) + (lambda (p) + (seek p 2 SEEK_SET) + (read-string p)))) + + (pass-if-equal "BOM not discarded unless at start of UTF-16 stream" + "a\uFEFFb" + (let ((be (bv-read-test "UTF-16" #vu8(#x00 #x61 #xFE #xFF #x00 #x62))) + (le (bv-read-test "UTF-16" #vu8(#x61 #x00 #xFF #xFE #x62 #x00)))) + (if (char=? #\a (string-ref be 0)) + be + le))) + + (pass-if-equal "BOM discarded from start of UTF-16 stream (LE)" + "a" + (bv-read-test "UTF-16" #vu8(#xFF #xFE #x61 #x00))) + + (pass-if-equal "BOM discarded from start of UTF-16 stream (LE) after seek to 0" + '(#\a "a") + (bv-read-test* "UTF-16" #vu8(#xFF #xFE #x61 #x00) + (lambda (p) + (let ((c (read-char p))) + (seek p 0 SEEK_SET) + (let ((s (read-string p))) + (list c s)))))) + + (pass-if-equal "Only one BOM discarded from start of UTF-16 stream (LE)" + "\uFEFFa" + (bv-read-test "UTF-16" #vu8(#xFF #xFE #xFF #xFE #x61 #x00))) + + (pass-if-equal "BOM not discarded from UTF-16 stream (LE) after seek to > 0" + "\uFEFFa" + (bv-read-test* "UTF-16" #vu8(#xFF #xFE #xFF #xFE #x61 #x00) + (lambda (p) + (seek p 2 SEEK_SET) + (read-string p)))) + + (pass-if-equal "BOM discarded from start of UTF-32 stream (BE)" + "a" + (bv-read-test "UTF-32" #vu8(#x00 #x00 #xFE #xFF + #x00 #x00 #x00 #x61))) + + (pass-if-equal "BOM discarded from start of UTF-32 stream (BE) after seek to 0" + '(#\a "a") + (bv-read-test* "UTF-32" #vu8(#x00 #x00 #xFE #xFF + #x00 #x00 #x00 #x61) + (lambda (p) + (let ((c (read-char p))) + (seek p 0 SEEK_SET) + (let ((s (read-string p))) + (list c s)))))) + + (pass-if-equal "Only one BOM discarded from start of UTF-32 stream (BE)" + "\uFEFFa" + (bv-read-test "UTF-32" #vu8(#x00 #x00 #xFE #xFF + #x00 #x00 #xFE #xFF + #x00 #x00 #x00 #x61))) + + (pass-if-equal "BOM not discarded from UTF-32 stream (BE) after seek to > 0" + "\uFEFFa" + (bv-read-test* "UTF-32" #vu8(#x00 #x00 #xFE #xFF + #x00 #x00 #xFE #xFF + #x00 #x00 #x00 #x61) + (lambda (p) + (seek p 4 SEEK_SET) + (read-string p)))) + + (pass-if-equal "BOM discarded within UTF-16 stream (BE) after set-port-encoding!" + "ab" + (bv-read-test* "UTF-16" #vu8(#x00 #x61 #xFE #xFF #x00 #x62) + (lambda (p) + (let ((a (read-char p))) + (set-port-encoding! p "UTF-16") + (string a (read-char p)))))) + + (pass-if-equal "BOM discarded within UTF-16 stream (LE,BE) after set-port-encoding!" + "ab" + (bv-read-test* "UTF-16" #vu8(#x00 #x61 #xFF #xFE #x62 #x00) + (lambda (p) + (let ((a (read-char p))) + (set-port-encoding! p "UTF-16") + (string a (read-char p)))))) + + (pass-if-equal "BOM discarded within UTF-32 stream (BE) after set-port-encoding!" + "ab" + (bv-read-test* "UTF-32" #vu8(#x00 #x00 #x00 #x61 + #x00 #x00 #xFE #xFF + #x00 #x00 #x00 #x62) + (lambda (p) + (let ((a (read-char p))) + (set-port-encoding! p "UTF-32") + (string a (read-char p)))))) + + (pass-if-equal "BOM discarded within UTF-32 stream (LE,BE) after set-port-encoding!" + "ab" + (bv-read-test* "UTF-32" #vu8(#x00 #x00 #x00 #x61 + #xFF #xFE #x00 #x00 + #x62 #x00 #x00 #x00) + (lambda (p) + (let ((a (read-char p))) + (set-port-encoding! p "UTF-32") + (string a (read-char p)))))) + + (pass-if-equal "BOM not discarded unless at start of UTF-32 stream" + "a\uFEFFb" + (let ((be (bv-read-test "UTF-32" #vu8(#x00 #x00 #x00 #x61 + #x00 #x00 #xFE #xFF + #x00 #x00 #x00 #x62))) + (le (bv-read-test "UTF-32" #vu8(#x61 #x00 #x00 #x00 + #xFF #xFE #x00 #x00 + #x62 #x00 #x00 #x00)))) + (if (char=? #\a (string-ref be 0)) + be + le))) + + (pass-if-equal "BOM discarded from start of UTF-32 stream (LE)" + "a" + (bv-read-test "UTF-32" #vu8(#xFF #xFE #x00 #x00 + #x61 #x00 #x00 #x00))) + + (pass-if-equal "BOM discarded from start of UTF-32 stream (LE) after seek to 0" + '(#\a "a") + (bv-read-test* "UTF-32" #vu8(#xFF #xFE #x00 #x00 + #x61 #x00 #x00 #x00) + (lambda (p) + (let ((c (read-char p))) + (seek p 0 SEEK_SET) + (let ((s (read-string p))) + (list c s)))))) + + (pass-if-equal "Only one BOM discarded from start of UTF-32 stream (LE)" + "\uFEFFa" + (bv-read-test "UTF-32" #vu8(#xFF #xFE #x00 #x00 + #xFF #xFE #x00 #x00 + #x61 #x00 #x00 #x00))) + + ) + + + (define-syntax-rule (with-load-path path body ...) (let ((new path) (old %load-path)) -- 1.7.10.4 --=-=-=--