From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Mike Gran Newsgroups: gmane.lisp.guile.devel Subject: Re: UTF-16 and (ice-9 rdelim) Date: Sun, 17 Jan 2010 16:11:44 -0800 (PST) Message-ID: <442751.82517.qm@web37908.mail.mud.yahoo.com> References: <871vho9wk2.fsf@ossau.uklinux.net> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: ger.gmane.org 1263773615 32607 80.91.229.12 (18 Jan 2010 00:13:35 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Mon, 18 Jan 2010 00:13:35 +0000 (UTC) To: Neil Jerram , Guile Development Original-X-From: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Mon Jan 18 01:13:29 2010 Return-path: Envelope-to: guile-devel@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.50) id 1NWfFI-0001q5-LF for guile-devel@m.gmane.org; Mon, 18 Jan 2010 01:13:28 +0100 Original-Received: from localhost ([127.0.0.1]:56444 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1NWfFJ-00013j-7s for guile-devel@m.gmane.org; Sun, 17 Jan 2010 19:13:29 -0500 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1NWfDi-0000DY-O0 for guile-devel@gnu.org; Sun, 17 Jan 2010 19:11:50 -0500 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1NWfDd-00007X-UU for guile-devel@gnu.org; Sun, 17 Jan 2010 19:11:50 -0500 Original-Received: from [199.232.76.173] (port=49489 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1NWfDd-00007B-MC for guile-devel@gnu.org; Sun, 17 Jan 2010 19:11:45 -0500 Original-Received: from web37908.mail.mud.yahoo.com ([209.191.91.170]:46359) by monty-python.gnu.org with smtp (Exim 4.60) (envelope-from ) id 1NWfDd-0006Ft-76 for guile-devel@gnu.org; Sun, 17 Jan 2010 19:11:45 -0500 Original-Received: (qmail 82627 invoked by uid 60001); 18 Jan 2010 00:11:44 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s1024; t=1263773504; bh=hpKO6ixWDy9apt6AMEsHR+PHg1uw81RyzRcKy5wHP2c=; h=Message-ID:X-YMail-OSG:Received:X-Mailer:References:Date:From:Subject:To:In-Reply-To:MIME-Version:Content-Type; b=yGEKtgIm0V24xY1aMCh8tPZIal5aFu0Vntc0iyOn+NTu2xJL9hZwlJiV2De4AnUDGwNpsgm4Y0nxODFC5GeFRaQBTkE6MqMCI6cc3kuD3PAGS/Vh6qQuMyMZ3D2hqS1kzqiISgaYBI1/6/r1+wwAao/vKWMOmJojJAWQGK0DW/E= DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=Message-ID:X-YMail-OSG:Received:X-Mailer:References:Date:From:Subject:To:In-Reply-To:MIME-Version:Content-Type; b=DlQfW+8IIquTk9t/qEXEwUAr/4DbplJ9/Iu4N9coVjDqUinw5Hst2Twbr6LQ6Myy17r7a1ENgQHH2RjvcwUlc4rqQZKfm6wMjTNp4OIw/SnRoLL3Cl4aWafKC7NLUuC+TypCLY1DRT5Y+kqGPqKuX75fTMqpKFp9LHm2Us0Lm2Y=; X-YMail-OSG: o5zSPkAVM1lr0Qm0eGYkU9LqQU898dGXhzPEdCSM5OvKngtuen5XkVoPt_yaJW0YoLb.aG3YkkIz2IPZx8l_xlESEovnoEm9VGqs6DtZzq_5VXr3LuEUMrqAk0OrmtfQhecpV8kgRUbIA1QXt10ozln1cnVLFw2ID6bXe3MxdWZ5o_z.574892dXveM66b8cfmYRP6txBWBabUi7stmr7vHCFjgrtAvrdgbpW_Ph4tGX0Q8Lzuq2m9PsdEVlkJY59TtXvqkk40SGxT7WeWIfO7_Aa9HlHOBng34vqy.nsq91qj.fD8pWfuZ6Ang_YnzLHah5vF0tJ7AYK1jcuJDADi14NJ25GWAvcr7qaQZTRTA- Original-Received: from [71.130.217.199] by web37908.mail.mud.yahoo.com via HTTP; Sun, 17 Jan 2010 16:11:44 PST X-Mailer: YahooMailRC/272.7 YahooMailWebService/0.8.100.260964 In-Reply-To: <871vho9wk2.fsf@ossau.uklinux.net> X-detected-operating-system: by monty-python.gnu.org: FreeBSD 6.x (1) X-BeenThere: guile-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Developers list for Guile, the GNU extensibility library" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Errors-To: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.lisp.guile.devel:9891 Archived-At: > From: Neil Jerram > > 1. It seems that most (all?) UTF-16 files begin with a byte order marker > (BOM), \ufeff, which readers are conventionally supposed to discard - > but Guile doesn't. So first-line becomes "\ufeffhello" > 2. The internals of (read-line) just search for a '\n' char to determine > the end of the first line, which means they're assuming that > > - '\n' never occurs as part of some other multibyte sequence > > - when '\n' occurs as part of the newline sequence, it occupies a single > byte. > > This causes the second line to be read wrong, because newline in UTF-16 > is actually 2 bytes - \n \0 - and the first (read-line) leaves the \0 > byte unconsumed. > > I think the fixes for these are roughly as follows. > > For 1: > > - Add a flag to the representation of a file port to say whether we're > still at the start of the file. This flag starts off true, and > becomes false once we've read enough bytes to get past a possible BOM. > > - Define a static map from encodings to possible BOMs. > > - When reading bytes, and the flag is true, and the port has an > encoding, and that encoding has a possible BOM, check for and consume > the BOM. This should work. BOMs are either 2, 3, or 4 bytes for UTF-16, UTF-8, and UTF-32 respectively. And if the port encoding is expected to be set correctly in the first place, a BOM should always be the first code point returned by read-char. > For 2: > > - In scm_do_read_line(), keep the current (fast) code for the case where > the port has no encoding. > > - When the port has an encoding, use a modified implementation that > copies raw bytes into an intermediate buffer, calls > u32_conv_from_encoding to convert those to u32*, and uses u32_strchr > to look for a newline. > > Does that sound about right? Are there any possible optimizations? If you already have to go to the trouble of converting to u32, it might be simplest to reimplement the non-Latin-1 case in Scheme, since read-char and unread-char should work even for UTF-16. That might do bad things to speed, though. > > For the static map, is there a canonical set of possible encoding > strings, or a way to get a single canonical ID for all the strings that > are allowed to mean the same encoding? For UTF-16, for example, it > seems to me that many of the following encoding strings will work > > utf-16 > utf-16-le > utf16le > utf16-le > utf-16le > utf16 > + the same with different case > > and we don't want a map entry for each one. > > I suppose one pseudo-canonical method would be to upcase and remove all > punctuation. Then we're only left with "UTF16" and "UTF16LE", which > makes sense. There are a couple of issues here. If you want a port to automatically identify a Unicode encoding by checking its first four bytes for a BOM, then you would need some sort of association table. It wouldn't be that hard to do. But, if you just want to get rid of a BOM, you can cut it down to a rule. If the first code point that a port reads is U+FEFF and if the encoding has the string "utf" in it, ignore it. If the first code point is U+FFFE and the encoding has "utf" in it, flag an error. > > Regards, > Neil -Mike