From: Andy Wingo <wingo@pobox.com>
To: Mark H Weaver <mhw@netris.org>
Cc: "Ludovic Courtès" <ludo@gnu.org>, guile-devel@gnu.org
Subject: Re: [PATCH] Improve handling of Unicode byte-order marks (BOMs)
Date: Thu, 04 Apr 2013 22:50:35 +0200 [thread overview]
Message-ID: <8738v69g10.fsf@pobox.com> (raw)
In-Reply-To: <87zjxfxsl5.fsf@tines.lan> (Mark H. Weaver's message of "Wed, 03 Apr 2013 16:33:10 -0400")
Hi. The following review applies to the wrong version of this patch.
I'll go ahead and post it anyway.
On Wed 03 Apr 2013 22:33, Mark H Weaver <mhw@netris.org> writes:
> + /* If we just read a BOM in an encoding that recognizes them,
> + then silently consume it and read another code point. */
> + if (SCM_UNLIKELY (*codepoint == SCM_UNICODE_BOM
> + && (strcasecmp (pt->encoding, "UTF-8") == 0
> + || strcasecmp (pt->encoding, "UTF-16") == 0
> + || strcasecmp (pt->encoding, "UTF-32") == 0)))
> + return get_codepoint (port, codepoint, buf, len);
Don't we have an enumerated value for UTF-8? Also I thought the
documentation noted that we don't consume BOM for UTF-8.
> +static int
> +looking_at_bytes (SCM port, const unsigned char *bytes, int len)
> +{
> + scm_t_port *pt = SCM_PTAB_ENTRY (port);
> + int result;
> + int i = 0;
> +
> + while (i < len && scm_peek_byte_or_eof (port) == bytes[i])
> + {
> + pt->read_pos++;
> + i++;
> + }
> +
> + result = (i == len);
> +
> + while (i > 0)
> + scm_unget_byte (bytes[--i], port);
> +
> + return result;
> +}
Very subtle ;) But looks good.
> +static const unsigned char scm_utf8_bom[3] = {0xEF, 0xBB, 0xBF};
> +static const unsigned char scm_utf16be_bom[2] = {0xFE, 0xFF};
> +static const unsigned char scm_utf16le_bom[2] = {0xFF, 0xFE};
> +static const unsigned char scm_utf32be_bom[4] = {0x00, 0x00, 0xFE, 0xFF};
> +static const unsigned char scm_utf32le_bom[4] = {0xFF, 0xFE, 0x00, 0x00};
Does it not work to leave out the number? i.e. foo[] instead of
foo[3]. Would be nicer if that works otherwise it's fine.
> +/* Decide what endianness to use for a UTF-16 port. Return "UTF-16BE"
> + or "UTF-16LE". MODE must be either SCM_PORT_READ or SCM_PORT_WRITE,
> + and specifies which operation is about to be done. The MODE
> + determines how we will decide the endianness. We deliberately avoid
> + reading from the port unless the user is about to do so. If the user
> + is about to read, then we look for a BOM, and if present, we use it
> + to determine the endianness. Otherwise we choose big-endian, as
> + recommended by the Unicode Consortium. */
I am surprised this does not default to native endianness! But OK :)
> +static const char *
> +decide_utf16_encoding (SCM port, scm_t_port_rw_active mode)
> +{
> + if (mode == SCM_PORT_READ
> + && SCM_PORT_GET_INTERNAL (port)->at_stream_start_for_bom_read
> + && looking_at_bytes (port, scm_utf16le_bom, sizeof scm_utf16le_bom))
> + return "UTF-16LE";
> + else
> + return "UTF-16BE";
> +}
> +
> +/* Decide what endianness to use for a UTF-32 port. Return "UTF-32BE"
> + or "UTF-32LE". See the comment above 'decide_utf16_encoding' for
> + details. */
> +static const char *
> +decide_utf32_encoding (SCM port, scm_t_port_rw_active mode)
> +{
> + if (mode == SCM_PORT_READ
> + && SCM_PORT_GET_INTERNAL (port)->at_stream_start_for_bom_read
> + && looking_at_bytes (port, scm_utf32le_bom, sizeof scm_utf32le_bom))
> + return "UTF-32LE";
> + else
> + return "UTF-32BE";
> +}
> +
Why don't these consume the BOM? They just use the BOM to detect the
endianness but leave the at_stream_start_for_bom_read flag to 1 I guess?
Probably deserves a comment.
> + /* If the specified encoding is UTF-16 or UTF-32, then make
> + that more precise by deciding what endianness to use. */
> + if (strcasecmp (pt->encoding, "UTF-16") == 0)
> + precise_encoding = decide_utf16_encoding (port, mode);
> + else if (strcasecmp (pt->encoding, "UTF-32") == 0)
> + precise_encoding = decide_utf32_encoding (port, mode);
Ideally these comparisons would not be locale-dependent. Dunno.
> @@ -2377,28 +2480,27 @@ scm_i_set_port_encoding_x (SCM port, const char *encoding)
> pti = SCM_PORT_GET_INTERNAL (port);
> prev = pti->iconv_descriptors;
>
> + /* In order to handle cases where the encoding changes mid-stream
> + (e.g. within an HTTP stream, or within a file that is composed of
> + segments with different encodings), we consider this to be "stream
> + start" for purposes of BOM handling, regardless of our actual file
> + position. */
> + pti->at_stream_start_for_bom_read = 1;
> + pti->at_stream_start_for_bom_write = 1;
> +
I dunno. If I receive an HTTP response as a bytevector, parse it to a
Unicode (UTF-{8,16,32}) string successfully, and then convert it back to
bytes, I feel like I should get the same sequence of bytes back. Is
that unreasonable?
> + if (SCM_UNLIKELY (pti->at_stream_start_for_bom_write && len > 0))
> + {
> + scm_t_port *pt = SCM_PTAB_ENTRY (port);
> +
> + /* Record that we're no longer at stream start. */
> + pti->at_stream_start_for_bom_write = 0;
> + if (pt->rw_random)
> + pti->at_stream_start_for_bom_read = 0;
> +
> + /* Write a BOM if appropriate. */
> + if (SCM_UNLIKELY (strcasecmp(pt->encoding, "UTF-16") == 0
> + || strcasecmp(pt->encoding, "UTF-32") == 0))
> + display_character (SCM_UNICODE_BOM, port, iconveh_error);
> + }
Perhaps only set these flags if we are in UTF-16 or UTF-32 to begin
with instead of pushing this strcasecmp out to print.c.
> + (pass-if-equal "BOM discarded from start of UTF-8 stream"
> + "a"
> + (bv-read-test "Utf-8" #vu8(#xEF #xBB #xBF #x61)))
This test contradicts the docs regarding BOM consumption for UTF-8
streams, no?
Quitting because I realized that you have two new patches (!).
Andy
--
http://wingolog.org/
next prev parent reply other threads:[~2013-04-04 20:50 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-04-03 10:44 [PATCH] Improve handling of Unicode byte-order marks (BOMs) Mark H Weaver
2013-04-03 11:47 ` Mark H Weaver
2013-04-03 11:58 ` Ludovic Courtès
2013-04-03 19:28 ` Mark H Weaver
2013-04-03 20:11 ` Ludovic Courtès
2013-04-03 20:33 ` Mark H Weaver
2013-04-03 20:48 ` Mike Gran
2013-04-03 22:24 ` Mark H Weaver
2013-04-04 5:59 ` Mark H Weaver
2013-04-04 20:50 ` Andy Wingo [this message]
2013-04-05 7:30 ` Mark H Weaver
2013-04-05 7:42 ` Mike Gran
2013-04-05 10:04 ` Ludovic Courtès
2013-04-05 18:15 ` Mark H Weaver
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://www.gnu.org/software/guile/
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=8738v69g10.fsf@pobox.com \
--to=wingo@pobox.com \
--cc=guile-devel@gnu.org \
--cc=ludo@gnu.org \
--cc=mhw@netris.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).