From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Andy Wingo Newsgroups: gmane.lisp.guile.bugs Subject: [bug #31650] ice-9 regexp doesn't work with multibyte chars Date: Sun, 14 Nov 2010 10:59:14 +0000 Message-ID: <20101114-105913.sv20118.39975@savannah.gnu.org> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain;charset=UTF-8 X-Trace: dough.gmane.org 1289732377 18718 80.91.229.12 (14 Nov 2010 10:59:37 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Sun, 14 Nov 2010 10:59:37 +0000 (UTC) To: Andy Wingo , bug-guile@gnu.org Original-X-From: bug-guile-bounces+guile-bugs=m.gmane.org@gnu.org Sun Nov 14 11:59:33 2010 Return-path: Envelope-to: guile-bugs@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1PHaJ1-0005Au-Dc for guile-bugs@m.gmane.org; Sun, 14 Nov 2010 11:59:31 +0100 Original-Received: from localhost ([127.0.0.1]:40475 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1PHaJ0-0005pI-HG for guile-bugs@m.gmane.org; Sun, 14 Nov 2010 05:59:30 -0500 Original-Received: from [140.186.70.92] (port=51777 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1PHaIr-0005nD-SM for bug-guile@gnu.org; Sun, 14 Nov 2010 05:59:23 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1PHaIk-0004st-UC for bug-guile@gnu.org; Sun, 14 Nov 2010 05:59:21 -0500 Original-Received: from colonialone.fsf.org ([140.186.70.51]:37626 helo=internal.in.savannah.gnu.org) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1PHaIk-0004sn-Ra for bug-guile@gnu.org; Sun, 14 Nov 2010 05:59:14 -0500 Original-Received: from [10.1.0.103] (helo=frontend.in.savannah.gnu.org) by internal.in.savannah.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1PHaIk-0007Xo-8o; Sun, 14 Nov 2010 10:59:14 +0000 Original-Received: from www-data by frontend.in.savannah.gnu.org with local (Exim 4.69) (envelope-from ) id 1PHaIk-0007k1-7d; Sun, 14 Nov 2010 10:59:14 +0000 X-Savane-Server: savannah.gnu.org:443 [10.1.0.103] X-Savane-Project: guile X-Savane-Tracker: bugs X-Savane-Item-ID: 31650 User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-us) AppleWebKit/531.2+ (KHTML, like Gecko) Safari/531.2+ Epiphany/2.30.2 X-Apparently-From: 81.39.161.251 (Savane authenticated user wingo) Original-References: In-Reply-To: X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 2) X-BeenThere: bug-guile@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Bug reports for GUILE, GNU's Ubiquitous Extension Language" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: bug-guile-bounces+guile-bugs=m.gmane.org@gnu.org Errors-To: bug-guile-bounces+guile-bugs=m.gmane.org@gnu.org Xref: news.gmane.org gmane.lisp.guile.bugs:4850 Archived-At: URL: Summary: ice-9 regexp doesn't work with multibyte chars Project: Guile Submitted by: wingo Submitted on: Sun 14 Nov 2010 10:59:13 AM GMT Category: None Severity: 3 - Normal Item Group: None Status: None Privacy: Public Assigned to: None Open/Closed: Open Discussion Lock: Any _______________________________________________________ Details: Steps to reproduce: > (setlocale LC_ALL) > (match:substring (string-match ".*" "calçot") 0) Expected results: "calçot" Actual results: ERROR: In procedure substring: ERROR: Value out of range 0 to 6: 7 I think what is happening is that as per http://www.gnu.org/software/libc/manual/html_node/Regexp-Subexpressions.html#Regexp-Subexpressions, the regmatch_t structure gives us *byte offsets* at which the string matched, but Guile's match structures need *char offsets*. So we need to be able to do a reverse mapping between byte index of a string as encoded in the current locale to the character index. _______________________________________________________ Reply to this item at: _______________________________________________ Message sent via/by Savannah http://savannah.gnu.org/