From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Neil Jerram Newsgroups: gmane.lisp.guile.devel Subject: Re: make check fails if no en_US.iso88591 locale Date: Wed, 09 Sep 2009 22:53:44 +0100 Message-ID: <873a6v7pjr.fsf@arudy.ossau.uklinux.net> References: <87pra1djys.fsf@arudy.ossau.uklinux.net> <322965.9784.qm@web37906.mail.mud.yahoo.com> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable X-Trace: ger.gmane.org 1252533270 24731 80.91.229.12 (9 Sep 2009 21:54:30 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Wed, 9 Sep 2009 21:54:30 +0000 (UTC) Cc: Guile Development To: Mike Gran Original-X-From: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Wed Sep 09 23:54:23 2009 Return-path: Envelope-to: guile-devel@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.50) id 1MlV7P-00056v-0W for guile-devel@m.gmane.org; Wed, 09 Sep 2009 23:54:23 +0200 Original-Received: from localhost ([127.0.0.1]:33407 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1MlV7O-0006tP-ID for guile-devel@m.gmane.org; Wed, 09 Sep 2009 17:54:22 -0400 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1MlV78-0006ln-AU for guile-devel@gnu.org; Wed, 09 Sep 2009 17:54:06 -0400 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1MlV73-0006gv-D6 for guile-devel@gnu.org; Wed, 09 Sep 2009 17:54:05 -0400 Original-Received: from [199.232.76.173] (port=46565 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1MlV72-0006gY-Vh for guile-devel@gnu.org; Wed, 09 Sep 2009 17:54:01 -0400 Original-Received: from mail3.uklinux.net ([80.84.72.33]:49190) by monty-python.gnu.org with esmtp (Exim 4.60) (envelope-from ) id 1MlV72-0007ZW-Cq for guile-devel@gnu.org; Wed, 09 Sep 2009 17:54:00 -0400 Original-Received: from arudy (host86-152-99-133.range86-152.btcentralplus.com [86.152.99.133]) by mail3.uklinux.net (Postfix) with ESMTP id 9C61A1F6795; Wed, 9 Sep 2009 22:53:59 +0100 (BST) Original-Received: from arudy.ossau.uklinux.net (arudy [127.0.0.1]) by arudy (Postfix) with ESMTP id 7F51338021; Wed, 9 Sep 2009 22:53:44 +0100 (BST) In-Reply-To: <322965.9784.qm@web37906.mail.mud.yahoo.com> (Mike Gran's message of "Tue\, 8 Sep 2009 18\:28\:52 -0700 \(PDT\)") User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.2 (gnu/linux) X-detected-operating-system: by monty-python.gnu.org: GNU/Linux 2.4-2.6 X-BeenThere: guile-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Developers list for Guile, the GNU extensibility library" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Errors-To: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.lisp.guile.devel:9296 Archived-At: Mike Gran writes: > My bad.=A0 Actually, I should have enclosed the 'with-locale' in the > context of a 'pass-if', which would have caught the exception. Yes, but at the cost of not running the tests... >> I can allow make check to complete by changing that line to >>=20 >> =A0 (false-if-exception (with-locale "en_US.iso88591" >>=20 >> but I doubt that's the best fix.=A0 Is the "en_US.iso88591" locale >> actually important for the enclosed tests? > > It is important.=A0 This is one of the problems with the whole Unicode > effort.=A0=A0There is no Unicode-capable regex library.=A0 The regexp.test > tries matching all bytes from 0 to 255, and it uses scm_to_locale_string > to prep the string for dispatch to the libc regex calls and > scm_from_locale_string to send them back.=A0=20 > > If the current locale is C or ASCII, bytes above 127 will cause errors. > If the current locale is UTF-8, bytes above 127 will be converted into > multibyte sequences that won't be matched by the regular expression > being tested.=A0 To pass the test in regexp.test, we need to use the=20 > encoding that matches all of the codepoints 0 to 255 to single byte > characters, which is ISO-8859-1. > > So until a better=A0regex comes along, wrapping regex in an > 8-bit-clean-friendly locale like Latin-1 is necessary to avoid encoding > errors when encoding arbitrary 8-bit data like the test does. > > The reason why this problem is cropping up now and didn't occur before > is because the old scm_to_locale_string was just a stub that passed > 8-bit data through unmodified. Thanks for explaining; I think I understand now. So then Ludovic's suggestion of with-latin1-locale should work, shouldn't it? > This regex library actually can be used with arbitrary Unicode data > but it takes extra care.=A0 UTF-8 can be used as the locale, and, then > regular expression must be written keeping in mind that each non-ASCII > character is really a multibyte string. Can you give an example of what that ("keeping in mind...") means? Is it being careful with repetition counts (as in "[a-z]{3}"), for example? Thanks, Neil