From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Andy Wingo Newsgroups: gmane.lisp.guile.devel Subject: Re: [PATCH] Improve handling of Unicode byte-order marks (BOMs) Date: Thu, 04 Apr 2013 22:50:35 +0200 Message-ID: <8738v69g10.fsf@pobox.com> References: <87ip43zyf0.fsf@tines.lan> <87r4irq0zp.fsf@gnu.org> <874nfnza5g.fsf@tines.lan> <87a9pfml1n.fsf@gnu.org> <87zjxfxsl5.fsf@tines.lan> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain X-Trace: ger.gmane.org 1365108751 8546 80.91.229.3 (4 Apr 2013 20:52:31 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Thu, 4 Apr 2013 20:52:31 +0000 (UTC) Cc: Ludovic =?utf-8?Q?Court=C3=A8s?= , guile-devel@gnu.org To: Mark H Weaver Original-X-From: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Thu Apr 04 22:52:58 2013 Return-path: Envelope-to: guile-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1UNr9W-0006iY-L6 for guile-devel@m.gmane.org; Thu, 04 Apr 2013 22:52:58 +0200 Original-Received: from localhost ([::1]:38055 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UNr97-0001Tv-Sb for guile-devel@m.gmane.org; Thu, 04 Apr 2013 16:52:33 -0400 Original-Received: from eggs.gnu.org ([208.118.235.92]:42417) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UNr8z-0001Jg-9R for guile-devel@gnu.org; Thu, 04 Apr 2013 16:52:31 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1UNr8t-0000fJ-VZ for guile-devel@gnu.org; Thu, 04 Apr 2013 16:52:25 -0400 Original-Received: from a-pb-sasl-quonix.pobox.com ([208.72.237.25]:55533 helo=sasl.smtp.pobox.com) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UNr8t-0000Da-R6; Thu, 04 Apr 2013 16:52:19 -0400 Original-Received: from sasl.smtp.pobox.com (unknown [127.0.0.1]) by a-pb-sasl-quonix.pobox.com (Postfix) with ESMTP id 66F04BC00; Thu, 4 Apr 2013 16:50:42 -0400 (EDT) DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=pobox.com; h=from:to:cc :subject:references:date:in-reply-to:message-id:mime-version :content-type; s=sasl; bh=tZffcPlqQPPC35ecy8u24uH5fc4=; b=Df+GQQ eB8zbAWvlq5FxrVpxTbvhc3A7cZUQ7zLS3N6JujxoExdNoq2AEFXPBfiijwWF8FL 1vCbJJ27kWm8KmNSj9vQ4X6jXQ9nNu/SD9O58aXzWYAH64Eu0nufCPk1neRJagYF RLp8mKMjeMeWU/DxxpYWa8Nh/LyKAPfGXrXiw= DomainKey-Signature: a=rsa-sha1; c=nofws; d=pobox.com; h=from:to:cc :subject:references:date:in-reply-to:message-id:mime-version :content-type; q=dns; s=sasl; b=a18xZ3n8bNwExf1N8gsfe+DyHxT+pvX1 RWH6xnXlG1e16m+Dys5colYHLHLCE0yfJr704+qGQGsRnJ6JF/+BW5b8OH/NULxh u1H5tXBxnEUzay4+2+Nc37bqJ3coYT28gRSmiQ4T1kstOI/9oyVxHSOJ/LEgbQ2n jhQYxTO9Pq0= Original-Received: from a-pb-sasl-quonix.pobox.com (unknown [127.0.0.1]) by a-pb-sasl-quonix.pobox.com (Postfix) with ESMTP id 5C242BBFF; Thu, 4 Apr 2013 16:50:42 -0400 (EDT) Original-Received: from badger (unknown [88.160.190.192]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by a-pb-sasl-quonix.pobox.com (Postfix) with ESMTPSA id 95365BBFE; Thu, 4 Apr 2013 16:50:41 -0400 (EDT) In-Reply-To: <87zjxfxsl5.fsf@tines.lan> (Mark H. Weaver's message of "Wed, 03 Apr 2013 16:33:10 -0400") User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.2 (gnu/linux) X-Pobox-Relay-ID: 503653EA-9D69-11E2-8857-D36F0E5B5709-02397024!a-pb-sasl-quonix.pobox.com X-detected-operating-system: by eggs.gnu.org: Solaris 10 X-Received-From: 208.72.237.25 X-BeenThere: guile-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Developers list for Guile, the GNU extensibility library" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Original-Sender: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.lisp.guile.devel:16141 Archived-At: Hi. The following review applies to the wrong version of this patch. I'll go ahead and post it anyway. On Wed 03 Apr 2013 22:33, Mark H Weaver writes: > + /* If we just read a BOM in an encoding that recognizes them, > + then silently consume it and read another code point. */ > + if (SCM_UNLIKELY (*codepoint == SCM_UNICODE_BOM > + && (strcasecmp (pt->encoding, "UTF-8") == 0 > + || strcasecmp (pt->encoding, "UTF-16") == 0 > + || strcasecmp (pt->encoding, "UTF-32") == 0))) > + return get_codepoint (port, codepoint, buf, len); Don't we have an enumerated value for UTF-8? Also I thought the documentation noted that we don't consume BOM for UTF-8. > +static int > +looking_at_bytes (SCM port, const unsigned char *bytes, int len) > +{ > + scm_t_port *pt = SCM_PTAB_ENTRY (port); > + int result; > + int i = 0; > + > + while (i < len && scm_peek_byte_or_eof (port) == bytes[i]) > + { > + pt->read_pos++; > + i++; > + } > + > + result = (i == len); > + > + while (i > 0) > + scm_unget_byte (bytes[--i], port); > + > + return result; > +} Very subtle ;) But looks good. > +static const unsigned char scm_utf8_bom[3] = {0xEF, 0xBB, 0xBF}; > +static const unsigned char scm_utf16be_bom[2] = {0xFE, 0xFF}; > +static const unsigned char scm_utf16le_bom[2] = {0xFF, 0xFE}; > +static const unsigned char scm_utf32be_bom[4] = {0x00, 0x00, 0xFE, 0xFF}; > +static const unsigned char scm_utf32le_bom[4] = {0xFF, 0xFE, 0x00, 0x00}; Does it not work to leave out the number? i.e. foo[] instead of foo[3]. Would be nicer if that works otherwise it's fine. > +/* Decide what endianness to use for a UTF-16 port. Return "UTF-16BE" > + or "UTF-16LE". MODE must be either SCM_PORT_READ or SCM_PORT_WRITE, > + and specifies which operation is about to be done. The MODE > + determines how we will decide the endianness. We deliberately avoid > + reading from the port unless the user is about to do so. If the user > + is about to read, then we look for a BOM, and if present, we use it > + to determine the endianness. Otherwise we choose big-endian, as > + recommended by the Unicode Consortium. */ I am surprised this does not default to native endianness! But OK :) > +static const char * > +decide_utf16_encoding (SCM port, scm_t_port_rw_active mode) > +{ > + if (mode == SCM_PORT_READ > + && SCM_PORT_GET_INTERNAL (port)->at_stream_start_for_bom_read > + && looking_at_bytes (port, scm_utf16le_bom, sizeof scm_utf16le_bom)) > + return "UTF-16LE"; > + else > + return "UTF-16BE"; > +} > + > +/* Decide what endianness to use for a UTF-32 port. Return "UTF-32BE" > + or "UTF-32LE". See the comment above 'decide_utf16_encoding' for > + details. */ > +static const char * > +decide_utf32_encoding (SCM port, scm_t_port_rw_active mode) > +{ > + if (mode == SCM_PORT_READ > + && SCM_PORT_GET_INTERNAL (port)->at_stream_start_for_bom_read > + && looking_at_bytes (port, scm_utf32le_bom, sizeof scm_utf32le_bom)) > + return "UTF-32LE"; > + else > + return "UTF-32BE"; > +} > + Why don't these consume the BOM? They just use the BOM to detect the endianness but leave the at_stream_start_for_bom_read flag to 1 I guess? Probably deserves a comment. > + /* If the specified encoding is UTF-16 or UTF-32, then make > + that more precise by deciding what endianness to use. */ > + if (strcasecmp (pt->encoding, "UTF-16") == 0) > + precise_encoding = decide_utf16_encoding (port, mode); > + else if (strcasecmp (pt->encoding, "UTF-32") == 0) > + precise_encoding = decide_utf32_encoding (port, mode); Ideally these comparisons would not be locale-dependent. Dunno. > @@ -2377,28 +2480,27 @@ scm_i_set_port_encoding_x (SCM port, const char *encoding) > pti = SCM_PORT_GET_INTERNAL (port); > prev = pti->iconv_descriptors; > > + /* In order to handle cases where the encoding changes mid-stream > + (e.g. within an HTTP stream, or within a file that is composed of > + segments with different encodings), we consider this to be "stream > + start" for purposes of BOM handling, regardless of our actual file > + position. */ > + pti->at_stream_start_for_bom_read = 1; > + pti->at_stream_start_for_bom_write = 1; > + I dunno. If I receive an HTTP response as a bytevector, parse it to a Unicode (UTF-{8,16,32}) string successfully, and then convert it back to bytes, I feel like I should get the same sequence of bytes back. Is that unreasonable? > + if (SCM_UNLIKELY (pti->at_stream_start_for_bom_write && len > 0)) > + { > + scm_t_port *pt = SCM_PTAB_ENTRY (port); > + > + /* Record that we're no longer at stream start. */ > + pti->at_stream_start_for_bom_write = 0; > + if (pt->rw_random) > + pti->at_stream_start_for_bom_read = 0; > + > + /* Write a BOM if appropriate. */ > + if (SCM_UNLIKELY (strcasecmp(pt->encoding, "UTF-16") == 0 > + || strcasecmp(pt->encoding, "UTF-32") == 0)) > + display_character (SCM_UNICODE_BOM, port, iconveh_error); > + } Perhaps only set these flags if we are in UTF-16 or UTF-32 to begin with instead of pushing this strcasecmp out to print.c. > + (pass-if-equal "BOM discarded from start of UTF-8 stream" > + "a" > + (bv-read-test "Utf-8" #vu8(#xEF #xBB #xBF #x61))) This test contradicts the docs regarding BOM consumption for UTF-8 streams, no? Quitting because I realized that you have two new patches (!). Andy -- http://wingolog.org/