From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Neil Jerram Newsgroups: gmane.lisp.guile.devel Subject: UTF-16 and (ice-9 rdelim) Date: Sun, 17 Jan 2010 22:49:17 +0000 Message-ID: <871vho9wk2.fsf@ossau.uklinux.net> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" X-Trace: ger.gmane.org 1263768591 19126 80.91.229.12 (17 Jan 2010 22:49:51 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Sun, 17 Jan 2010 22:49:51 +0000 (UTC) To: Guile Development Original-X-From: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Sun Jan 17 23:49:43 2010 Return-path: Envelope-to: guile-devel@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.50) id 1NWdwE-0000MP-Tb for guile-devel@m.gmane.org; Sun, 17 Jan 2010 23:49:43 +0100 Original-Received: from localhost ([127.0.0.1]:54678 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1NWdwF-0006Pc-Nu for guile-devel@m.gmane.org; Sun, 17 Jan 2010 17:49:43 -0500 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1NWdwA-0006Oi-Db for guile-devel@gnu.org; Sun, 17 Jan 2010 17:49:38 -0500 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1NWdw6-0006OL-PC for guile-devel@gnu.org; Sun, 17 Jan 2010 17:49:38 -0500 Original-Received: from [199.232.76.173] (port=45016 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1NWdw6-0006OI-Mm for guile-devel@gnu.org; Sun, 17 Jan 2010 17:49:34 -0500 Original-Received: from mail3.uklinux.net ([80.84.72.33]:47116) by monty-python.gnu.org with esmtp (Exim 4.60) (envelope-from ) id 1NWdw6-0002be-7l for guile-devel@gnu.org; Sun, 17 Jan 2010 17:49:34 -0500 Original-Received: from arudy (host86-183-207-218.range86-183.btcentralplus.com [86.183.207.218]) by mail3.uklinux.net (Postfix) with ESMTP id 64BDA1F6721 for ; Sun, 17 Jan 2010 22:49:33 +0000 (GMT) Original-Received: from arudy (arudy [127.0.0.1]) by arudy (Postfix) with ESMTP id D9DFD3801F for ; Sun, 17 Jan 2010 22:49:17 +0000 (GMT) User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.1 (gnu/linux) X-detected-operating-system: by monty-python.gnu.org: GNU/Linux 2.4-2.6 X-BeenThere: guile-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Developers list for Guile, the GNU extensibility library" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Errors-To: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.lisp.guile.devel:9890 Archived-At: --=-=-= I have a program that processes a UTF-16 input file, using `with-input-from-file', `set-port-encoding' and `read-line' in a pattern like this: (use-modules (ice-9 rdelim)) (with-input-from-file "rdelim-utf16.txt" (lambda () (set-port-encoding! (current-input-port) "UTF16LE") (let ((first-line (read-line)) (second-line (read-line))) ...) )) A sample UTF-16 input file is attached. This hits a couple of problems. 1. It seems that most (all?) UTF-16 files begin with a byte order marker (BOM), \ufeff, which readers are conventionally supposed to discard - but Guile doesn't. So first-line becomes "\ufeffhello" 2. The internals of (read-line) just search for a '\n' char to determine the end of the first line, which means they're assuming that - '\n' never occurs as part of some other multibyte sequence - when '\n' occurs as part of the newline sequence, it occupies a single byte. This causes the second line to be read wrong, because newline in UTF-16 is actually 2 bytes - \n \0 - and the first (read-line) leaves the \0 byte unconsumed. I think the fixes for these are roughly as follows. For 1: - Add a flag to the representation of a file port to say whether we're still at the start of the file. This flag starts off true, and becomes false once we've read enough bytes to get past a possible BOM. - Define a static map from encodings to possible BOMs. - When reading bytes, and the flag is true, and the port has an encoding, and that encoding has a possible BOM, check for and consume the BOM. Or is it too magic for the port to do this automatically? Alternatively, we could provide something like `read-line-discarding-bom', and it would be up to the application to know when to use this instead of `read-line'. For 2: - In scm_do_read_line(), keep the current (fast) code for the case where the port has no encoding. - When the port has an encoding, use a modified implementation that copies raw bytes into an intermediate buffer, calls u32_conv_from_encoding to convert those to u32*, and uses u32_strchr to look for a newline. Does that sound about right? Are there any possible optimizations? For the static map, is there a canonical set of possible encoding strings, or a way to get a single canonical ID for all the strings that are allowed to mean the same encoding? For UTF-16, for example, it seems to me that many of the following encoding strings will work utf-16 utf-16-le utf16le utf16-le utf-16le utf16 + the same with different case and we don't want a map entry for each one. I suppose one pseudo-canonical method would be to upcase and remove all punctuation. Then we're only left with "UTF16" and "UTF16LE", which makes sense. Regards, Neil --=-=-= Content-Disposition: attachment; filename=rdelim-utf16.txt hello hello again --=-=-=--