From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: Neil Jerram <neil@ossau.uklinux.net>
Newsgroups: gmane.lisp.guile.devel
Subject: Re: UTF-16 and (ice-9 rdelim)
Date: Mon, 18 Jan 2010 20:13:31 +0000
Message-ID: <87k4vff9xw.fsf@ossau.uklinux.net>
References: <871vho9wk2.fsf@ossau.uklinux.net>
	<442751.82517.qm@web37908.mail.mud.yahoo.com>
NNTP-Posting-Host: lo.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Trace: ger.gmane.org 1263845713 21178 80.91.229.12 (18 Jan 2010 20:15:13 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Mon, 18 Jan 2010 20:15:13 +0000 (UTC)
Cc: Guile Development <guile-devel@gnu.org>
To: Mike Gran <spk121@yahoo.com>
Original-X-From: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Mon Jan 18 21:15:06 2010
Return-path: <guile-devel-bounces+guile-devel=m.gmane.org@gnu.org>
Envelope-to: guile-devel@m.gmane.org
Original-Received: from lists.gnu.org ([199.232.76.165])
	by lo.gmane.org with esmtp (Exim 4.50)
	id 1NWy09-0001EB-43
	for guile-devel@m.gmane.org; Mon, 18 Jan 2010 21:15:05 +0100
Original-Received: from localhost ([127.0.0.1]:43527 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43)
	id 1NWy09-00083c-RZ
	for guile-devel@m.gmane.org; Mon, 18 Jan 2010 15:15:05 -0500
Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43)
	id 1NWxyz-0007rL-Uj
	for guile-devel@gnu.org; Mon, 18 Jan 2010 15:13:54 -0500
Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43)
	id 1NWxyv-0007pC-81
	for guile-devel@gnu.org; Mon, 18 Jan 2010 15:13:53 -0500
Original-Received: from [199.232.76.173] (port=35922 helo=monty-python.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1NWxyu-0007p7-UT
	for guile-devel@gnu.org; Mon, 18 Jan 2010 15:13:49 -0500
Original-Received: from mail3.uklinux.net ([80.84.72.33]:52464)
	by monty-python.gnu.org with esmtp (Exim 4.60)
	(envelope-from <neil@ossau.uklinux.net>) id 1NWxyu-00079r-IR
	for guile-devel@gnu.org; Mon, 18 Jan 2010 15:13:48 -0500
Original-Received: from arudy (host86-183-76-203.range86-183.btcentralplus.com
	[86.183.76.203]) by mail3.uklinux.net (Postfix) with ESMTP
	id 5C86F1F6B90; Mon, 18 Jan 2010 20:13:47 +0000 (GMT)
Original-Received: from arudy (arudy [127.0.0.1])
	by arudy (Postfix) with ESMTP id E850938024;
	Mon, 18 Jan 2010 20:13:31 +0000 (GMT)
In-Reply-To: <442751.82517.qm@web37908.mail.mud.yahoo.com> (Mike Gran's
	message of "Sun, 17 Jan 2010 16:11:44 -0800 (PST)")
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.1 (gnu/linux)
X-detected-operating-system: by monty-python.gnu.org: GNU/Linux 2.4-2.6
X-BeenThere: guile-devel@gnu.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: "Developers list for Guile,
	the GNU extensibility library" <guile-devel.gnu.org>
List-Unsubscribe: <http://lists.gnu.org/mailman/listinfo/guile-devel>,
	<mailto:guile-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/guile-devel>
List-Post: <mailto:guile-devel@gnu.org>
List-Help: <mailto:guile-devel-request@gnu.org?subject=help>
List-Subscribe: <http://lists.gnu.org/mailman/listinfo/guile-devel>,
	<mailto:guile-devel-request@gnu.org?subject=subscribe>
Original-Sender: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org
Errors-To: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.lisp.guile.devel:9897
Archived-At: <http://permalink.gmane.org/gmane.lisp.guile.devel/9897>

Hi Mike,

Many thanks for your quick response.  I'll hopefully work on these fixes
shortly.

A few comments...

Mike Gran <spk121@yahoo.com> writes:

> This should work.  BOMs are either 2, 3, or 4 bytes for UTF-16, UTF-8,
> and UTF-32 respectively.  And if the port encoding is expected to be
> set correctly in the first place, a BOM should always be the first
> code point returned by read-char.

Thanks.  For the moment, I am assuming that the encoding will have
previously been declared correctly, by `set-port-encoding' or by a
`coding:' comment.

> If you already have to go to the trouble of converting to u32, it might
> be simplest to reimplement the non-Latin-1 case in Scheme,
> since read-char and unread-char should work even for UTF-16.
> That might do bad things to speed, though.

I'll have a look; it's nice to prototype that way, at least.

> There are a couple of issues here.  If you want a port to automatically
> identify a Unicode encoding by checking its first four bytes for a BOM, 
> then you would need some sort of association table.  It wouldn't be that
> hard to do.

I'm not thinking of that yet.  (For the future, clearly it must be
possible, as Emacs is doing it all the time.)

> But, if you just want to get rid of a BOM, you can cut it down to 
> a rule.  If the first code point that a port reads is U+FEFF and if the
> encoding has the string "utf" in it, ignore it.  If the first code point
> is U+FFFE and the encoding has "utf" in it, flag an error.

Agreed.

Out of interest, does that mean that iconv will auto-detect the
endianness if the encoding does not explicitly say "le" or "be"?

Regards,
        Neil