From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: Neil Jerram <neil@ossau.uklinux.net>
Newsgroups: gmane.lisp.guile.devel
Subject: UTF-16 and (ice-9 rdelim)
Date: Sun, 17 Jan 2010 22:49:17 +0000
Message-ID: <871vho9wk2.fsf@ossau.uklinux.net>
NNTP-Posting-Host: lo.gmane.org
Mime-Version: 1.0
Content-Type: multipart/mixed; boundary="=-=-="
X-Trace: ger.gmane.org 1263768591 19126 80.91.229.12 (17 Jan 2010 22:49:51 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Sun, 17 Jan 2010 22:49:51 +0000 (UTC)
To: Guile Development <guile-devel@gnu.org>
Original-X-From: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Sun Jan 17 23:49:43 2010
Return-path: <guile-devel-bounces+guile-devel=m.gmane.org@gnu.org>
Envelope-to: guile-devel@m.gmane.org
Original-Received: from lists.gnu.org ([199.232.76.165])
	by lo.gmane.org with esmtp (Exim 4.50)
	id 1NWdwE-0000MP-Tb
	for guile-devel@m.gmane.org; Sun, 17 Jan 2010 23:49:43 +0100
Original-Received: from localhost ([127.0.0.1]:54678 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43)
	id 1NWdwF-0006Pc-Nu
	for guile-devel@m.gmane.org; Sun, 17 Jan 2010 17:49:43 -0500
Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43)
	id 1NWdwA-0006Oi-Db
	for guile-devel@gnu.org; Sun, 17 Jan 2010 17:49:38 -0500
Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43)
	id 1NWdw6-0006OL-PC
	for guile-devel@gnu.org; Sun, 17 Jan 2010 17:49:38 -0500
Original-Received: from [199.232.76.173] (port=45016 helo=monty-python.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1NWdw6-0006OI-Mm
	for guile-devel@gnu.org; Sun, 17 Jan 2010 17:49:34 -0500
Original-Received: from mail3.uklinux.net ([80.84.72.33]:47116)
	by monty-python.gnu.org with esmtp (Exim 4.60)
	(envelope-from <neil@ossau.uklinux.net>) id 1NWdw6-0002be-7l
	for guile-devel@gnu.org; Sun, 17 Jan 2010 17:49:34 -0500
Original-Received: from arudy (host86-183-207-218.range86-183.btcentralplus.com
	[86.183.207.218])
	by mail3.uklinux.net (Postfix) with ESMTP id 64BDA1F6721
	for <guile-devel@gnu.org>; Sun, 17 Jan 2010 22:49:33 +0000 (GMT)
Original-Received: from arudy (arudy [127.0.0.1])
	by arudy (Postfix) with ESMTP id D9DFD3801F
	for <guile-devel@gnu.org>; Sun, 17 Jan 2010 22:49:17 +0000 (GMT)
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.1 (gnu/linux)
X-detected-operating-system: by monty-python.gnu.org: GNU/Linux 2.4-2.6
X-BeenThere: guile-devel@gnu.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: "Developers list for Guile,
	the GNU extensibility library" <guile-devel.gnu.org>
List-Unsubscribe: <http://lists.gnu.org/mailman/listinfo/guile-devel>,
	<mailto:guile-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/guile-devel>
List-Post: <mailto:guile-devel@gnu.org>
List-Help: <mailto:guile-devel-request@gnu.org?subject=help>
List-Subscribe: <http://lists.gnu.org/mailman/listinfo/guile-devel>,
	<mailto:guile-devel-request@gnu.org?subject=subscribe>
Original-Sender: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org
Errors-To: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.lisp.guile.devel:9890
Archived-At: <http://permalink.gmane.org/gmane.lisp.guile.devel/9890>

--=-=-=

I have a program that processes a UTF-16 input file, using
`with-input-from-file', `set-port-encoding' and `read-line' in a pattern
like this:

(use-modules (ice-9 rdelim))

(with-input-from-file "rdelim-utf16.txt"
  (lambda ()
    (set-port-encoding! (current-input-port) "UTF16LE")

    (let ((first-line (read-line))
          (second-line (read-line)))

      ...)

    ))

A sample UTF-16 input file is attached.

This hits a couple of problems.

1. It seems that most (all?) UTF-16 files begin with a byte order marker
(BOM), \ufeff, which readers are conventionally supposed to discard -
but Guile doesn't.  So first-line becomes "\ufeffhello"

2. The internals of (read-line) just search for a '\n' char to determine
the end of the first line, which means they're assuming that

- '\n' never occurs as part of some other multibyte sequence

- when '\n' occurs as part of the newline sequence, it occupies a single
  byte.

This causes the second line to be read wrong, because newline in UTF-16
is actually 2 bytes - \n \0 - and the first (read-line) leaves the \0
byte unconsumed.

I think the fixes for these are roughly as follows.

For 1:

- Add a flag to the representation of a file port to say whether we're
  still at the start of the file.  This flag starts off true, and
  becomes false once we've read enough bytes to get past a possible BOM.

- Define a static map from encodings to possible BOMs.

- When reading bytes, and the flag is true, and the port has an
  encoding, and that encoding has a possible BOM, check for and consume
  the BOM.

Or is it too magic for the port to do this automatically?
Alternatively, we could provide something like
`read-line-discarding-bom', and it would be up to the application to
know when to use this instead of `read-line'.

For 2:

- In scm_do_read_line(), keep the current (fast) code for the case where
  the port has no encoding.

- When the port has an encoding, use a modified implementation that
  copies raw bytes into an intermediate buffer, calls
  u32_conv_from_encoding to convert those to u32*, and uses u32_strchr
  to look for a newline.

Does that sound about right?  Are there any possible optimizations?

For the static map, is there a canonical set of possible encoding
strings, or a way to get a single canonical ID for all the strings that
are allowed to mean the same encoding?  For UTF-16, for example, it
seems to me that many of the following encoding strings will work

utf-16
utf-16-le
utf16le
utf16-le
utf-16le
utf16
+ the same with different case

and we don't want a map entry for each one.

I suppose one pseudo-canonical method would be to upcase and remove all
punctuation.  Then we're only left with "UTF16" and "UTF16LE", which
makes sense.

Regards,
        Neil


--=-=-=
Content-Disposition: attachment; filename=rdelim-utf16.txt

hello
hello again

--=-=-=--