From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Neil Jerram Newsgroups: gmane.lisp.guile.devel Subject: Re: UTF-16 and (ice-9 rdelim) Date: Mon, 18 Jan 2010 20:13:31 +0000 Message-ID: <87k4vff9xw.fsf@ossau.uklinux.net> References: <871vho9wk2.fsf@ossau.uklinux.net> <442751.82517.qm@web37908.mail.mud.yahoo.com> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: ger.gmane.org 1263845713 21178 80.91.229.12 (18 Jan 2010 20:15:13 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Mon, 18 Jan 2010 20:15:13 +0000 (UTC) Cc: Guile Development To: Mike Gran Original-X-From: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Mon Jan 18 21:15:06 2010 Return-path: Envelope-to: guile-devel@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.50) id 1NWy09-0001EB-43 for guile-devel@m.gmane.org; Mon, 18 Jan 2010 21:15:05 +0100 Original-Received: from localhost ([127.0.0.1]:43527 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1NWy09-00083c-RZ for guile-devel@m.gmane.org; Mon, 18 Jan 2010 15:15:05 -0500 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1NWxyz-0007rL-Uj for guile-devel@gnu.org; Mon, 18 Jan 2010 15:13:54 -0500 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1NWxyv-0007pC-81 for guile-devel@gnu.org; Mon, 18 Jan 2010 15:13:53 -0500 Original-Received: from [199.232.76.173] (port=35922 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1NWxyu-0007p7-UT for guile-devel@gnu.org; Mon, 18 Jan 2010 15:13:49 -0500 Original-Received: from mail3.uklinux.net ([80.84.72.33]:52464) by monty-python.gnu.org with esmtp (Exim 4.60) (envelope-from ) id 1NWxyu-00079r-IR for guile-devel@gnu.org; Mon, 18 Jan 2010 15:13:48 -0500 Original-Received: from arudy (host86-183-76-203.range86-183.btcentralplus.com [86.183.76.203]) by mail3.uklinux.net (Postfix) with ESMTP id 5C86F1F6B90; Mon, 18 Jan 2010 20:13:47 +0000 (GMT) Original-Received: from arudy (arudy [127.0.0.1]) by arudy (Postfix) with ESMTP id E850938024; Mon, 18 Jan 2010 20:13:31 +0000 (GMT) In-Reply-To: <442751.82517.qm@web37908.mail.mud.yahoo.com> (Mike Gran's message of "Sun, 17 Jan 2010 16:11:44 -0800 (PST)") User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.1 (gnu/linux) X-detected-operating-system: by monty-python.gnu.org: GNU/Linux 2.4-2.6 X-BeenThere: guile-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Developers list for Guile, the GNU extensibility library" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Errors-To: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.lisp.guile.devel:9897 Archived-At: Hi Mike, Many thanks for your quick response. I'll hopefully work on these fixes shortly. A few comments... Mike Gran writes: > This should work. BOMs are either 2, 3, or 4 bytes for UTF-16, UTF-8, > and UTF-32 respectively. And if the port encoding is expected to be > set correctly in the first place, a BOM should always be the first > code point returned by read-char. Thanks. For the moment, I am assuming that the encoding will have previously been declared correctly, by `set-port-encoding' or by a `coding:' comment. > If you already have to go to the trouble of converting to u32, it might > be simplest to reimplement the non-Latin-1 case in Scheme, > since read-char and unread-char should work even for UTF-16. > That might do bad things to speed, though. I'll have a look; it's nice to prototype that way, at least. > There are a couple of issues here. If you want a port to automatically > identify a Unicode encoding by checking its first four bytes for a BOM, > then you would need some sort of association table. It wouldn't be that > hard to do. I'm not thinking of that yet. (For the future, clearly it must be possible, as Emacs is doing it all the time.) > But, if you just want to get rid of a BOM, you can cut it down to > a rule. If the first code point that a port reads is U+FEFF and if the > encoding has the string "utf" in it, ignore it. If the first code point > is U+FFFE and the encoding has "utf" in it, flag an error. Agreed. Out of interest, does that mean that iconv will auto-detect the endianness if the encoding does not explicitly say "le" or "be"? Regards, Neil