From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Mark H Weaver Newsgroups: gmane.lisp.guile.devel Subject: Re: [PATCH] Improve handling of Unicode byte-order marks (BOMs) Date: Fri, 05 Apr 2013 03:30:53 -0400 Message-ID: <87wqshv3gy.fsf@tines.lan> References: <87ip43zyf0.fsf@tines.lan> <87r4irq0zp.fsf@gnu.org> <874nfnza5g.fsf@tines.lan> <87a9pfml1n.fsf@gnu.org> <87zjxfxsl5.fsf@tines.lan> <8738v69g10.fsf@pobox.com> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain X-Trace: ger.gmane.org 1365147091 20050 80.91.229.3 (5 Apr 2013 07:31:31 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Fri, 5 Apr 2013 07:31:31 +0000 (UTC) Cc: Ludovic =?utf-8?Q?Court=C3=A8s?= , guile-devel@gnu.org To: Andy Wingo Original-X-From: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Fri Apr 05 09:31:58 2013 Return-path: Envelope-to: guile-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1UO17t-0005Tr-Qx for guile-devel@m.gmane.org; Fri, 05 Apr 2013 09:31:58 +0200 Original-Received: from localhost ([::1]:54842 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UO17U-0002Om-UN for guile-devel@m.gmane.org; Fri, 05 Apr 2013 03:31:32 -0400 Original-Received: from eggs.gnu.org ([208.118.235.92]:46849) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UO17K-0002Gm-46 for guile-devel@gnu.org; Fri, 05 Apr 2013 03:31:28 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1UO17B-0001Qs-7l for guile-devel@gnu.org; Fri, 05 Apr 2013 03:31:22 -0400 Original-Received: from world.peace.net ([96.39.62.75]:33371) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UO17B-0001Qo-3J; Fri, 05 Apr 2013 03:31:13 -0400 Original-Received: from 209-6-91-212.c3-0.smr-ubr1.sbo-smr.ma.cable.rcn.com ([209.6.91.212] helo=tines.lan) by world.peace.net with esmtpsa (TLS1.0:DHE_RSA_AES_128_CBC_SHA1:16) (Exim 4.72) (envelope-from ) id 1UO170-0000v2-Aj; Fri, 05 Apr 2013 03:31:02 -0400 In-Reply-To: <8738v69g10.fsf@pobox.com> (Andy Wingo's message of "Thu, 04 Apr 2013 22:50:35 +0200") User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.3 (gnu/linux) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-Received-From: 96.39.62.75 X-BeenThere: guile-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Developers list for Guile, the GNU extensibility library" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Original-Sender: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.lisp.guile.devel:16149 Archived-At: Hi Andy, Andy Wingo writes: > On Wed 03 Apr 2013 22:33, Mark H Weaver writes: > >> + /* If we just read a BOM in an encoding that recognizes them, >> + then silently consume it and read another code point. */ >> + if (SCM_UNLIKELY (*codepoint == SCM_UNICODE_BOM >> + && (strcasecmp (pt->encoding, "UTF-8") == 0 >> + || strcasecmp (pt->encoding, "UTF-16") == 0 >> + || strcasecmp (pt->encoding, "UTF-32") == 0))) >> + return get_codepoint (port, codepoint, buf, len); > > Don't we have an enumerated value for UTF-8? Indeed, good point. I changed this as you suggest, but fwiw the efficiency impact is negligible, because that 'strcasecmp' is only called if a BOM is found at the start of a stream. > Also I thought the > documentation noted that we don't consume BOM for UTF-8. No. For UTF-8, we consume a BOM (if present) but do not produce one. >> +static const unsigned char scm_utf8_bom[3] = {0xEF, 0xBB, 0xBF}; >> +static const unsigned char scm_utf16be_bom[2] = {0xFE, 0xFF}; >> +static const unsigned char scm_utf16le_bom[2] = {0xFF, 0xFE}; >> +static const unsigned char scm_utf32be_bom[4] = {0x00, 0x00, 0xFE, 0xFF}; >> +static const unsigned char scm_utf32le_bom[4] = {0xFF, 0xFE, 0x00, 0x00}; > > Does it not work to leave out the number? i.e. foo[] instead of > foo[3]. Would be nicer if that works otherwise it's fine. It would almost certainly work, but I'm paranoid that some weird compiler might make the array longer than needed, which would be bad. >> +/* Decide what endianness to use for a UTF-16 port. Return "UTF-16BE" >> + or "UTF-16LE". MODE must be either SCM_PORT_READ or SCM_PORT_WRITE, >> + and specifies which operation is about to be done. The MODE >> + determines how we will decide the endianness. We deliberately avoid >> + reading from the port unless the user is about to do so. If the user >> + is about to read, then we look for a BOM, and if present, we use it >> + to determine the endianness. Otherwise we choose big-endian, as >> + recommended by the Unicode Consortium. */ > > I am surprised this does not default to native endianness! But OK :) Remember that "network byte order" is big endian. Uniformity has its benefits. Anyway, in the docs I explicitly reserved the right to change this later. >> +static const char * >> +decide_utf16_encoding (SCM port, scm_t_port_rw_active mode) >> +{ >> + if (mode == SCM_PORT_READ >> + && SCM_PORT_GET_INTERNAL (port)->at_stream_start_for_bom_read >> + && looking_at_bytes (port, scm_utf16le_bom, sizeof scm_utf16le_bom)) >> + return "UTF-16LE"; >> + else >> + return "UTF-16BE"; >> +} >> + >> +/* Decide what endianness to use for a UTF-32 port. Return "UTF-32BE" >> + or "UTF-32LE". See the comment above 'decide_utf16_encoding' for >> + details. */ >> +static const char * >> +decide_utf32_encoding (SCM port, scm_t_port_rw_active mode) >> +{ >> + if (mode == SCM_PORT_READ >> + && SCM_PORT_GET_INTERNAL (port)->at_stream_start_for_bom_read >> + && looking_at_bytes (port, scm_utf32le_bom, sizeof scm_utf32le_bom)) >> + return "UTF-32LE"; >> + else >> + return "UTF-32BE"; >> +} >> + > > Why don't these consume the BOM? They just use the BOM to detect the > endianness but leave the at_stream_start_for_bom_read flag to 1 I guess? Right. I did it this way for simplicity and consistency. When you seek to the beginning of the file, the BOM will be consumed again even though this code will not be run. Therefore, we need logic to consume the BOM in 'get_codepoint'. It seems cleaner to do that job in just one place rather than in two places. > Probably deserves a comment. Okay, I added one. >> + /* If the specified encoding is UTF-16 or UTF-32, then make >> + that more precise by deciding what endianness to use. */ >> + if (strcasecmp (pt->encoding, "UTF-16") == 0) >> + precise_encoding = decide_utf16_encoding (port, mode); >> + else if (strcasecmp (pt->encoding, "UTF-32") == 0) >> + precise_encoding = decide_utf32_encoding (port, mode); > > Ideally these comparisons would not be locale-dependent. Dunno. Yes, that would be preferable. We talked about adding an 'ascii_strcasecmp' function. What file do you think it should be defined in? >> @@ -2377,28 +2480,27 @@ scm_i_set_port_encoding_x (SCM port, const char *encoding) >> pti = SCM_PORT_GET_INTERNAL (port); >> prev = pti->iconv_descriptors; >> >> + /* In order to handle cases where the encoding changes mid-stream >> + (e.g. within an HTTP stream, or within a file that is composed of >> + segments with different encodings), we consider this to be "stream >> + start" for purposes of BOM handling, regardless of our actual file >> + position. */ >> + pti->at_stream_start_for_bom_read = 1; >> + pti->at_stream_start_for_bom_write = 1; >> + > > I dunno. If I receive an HTTP response as a bytevector, parse it to a > Unicode (UTF-{8,16,32}) string successfully, and then convert it back to > bytes, I feel like I should get the same sequence of bytes back. Is > that unreasonable? It's not an unreasonable wish, but sadly it's impossible in the presence of automatic BOM removal. The reason is that information is lost when you remove the BOM. This affects the UTF-8, UTF-16, and UTF-32 encodings. Also, I'm not sure how your comment relates to the quoted code above it. >> + if (SCM_UNLIKELY (pti->at_stream_start_for_bom_write && len > 0)) >> + { >> + scm_t_port *pt = SCM_PTAB_ENTRY (port); >> + >> + /* Record that we're no longer at stream start. */ >> + pti->at_stream_start_for_bom_write = 0; >> + if (pt->rw_random) >> + pti->at_stream_start_for_bom_read = 0; >> + >> + /* Write a BOM if appropriate. */ >> + if (SCM_UNLIKELY (strcasecmp(pt->encoding, "UTF-16") == 0 >> + || strcasecmp(pt->encoding, "UTF-32") == 0)) >> + display_character (SCM_UNICODE_BOM, port, iconveh_error); >> + } > > Perhaps only set these flags if we are in UTF-16 or UTF-32 to begin > with instead of pushing this strcasecmp out to print.c. I thought of that, but then we'd have to add these strcasecmp's everywhere that the flags are set, which is currently in three places: (1) port creation, (2) set-port-encoding!, and (3) seeks. As it is now, we only have to do these comparisons in two places (textual read and write), and often they can be avoided (e.g. on read the comparisons are only done when a BOM is found at the start of a stream). So I think this way is not only cleaner, but also probably a bit more efficient. I went ahead and pushed it to stable-2.0, with the changes mentioned above. Of course we can continue to tweak it. Thanks! Mark