From mboxrd@z Thu Jan 1 00:00:00 1970 Path: main.gmane.org!not-for-mail From: Kenichi Handa Newsgroups: gmane.emacs.devel Subject: regex and case-fold-search problem Date: Fri, 23 Aug 2002 15:25:42 +0900 (JST) Sender: emacs-devel-admin@gnu.org Message-ID: <200208230625.PAA23426@etlken.m17n.org> NNTP-Posting-Host: localhost.gmane.org Mime-Version: 1.0 (generated by SEMI 1.14.3 - "Ushinoya") Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Trace: main.gmane.org 1030083978 18091 127.0.0.1 (23 Aug 2002 06:26:18 GMT) X-Complaints-To: usenet@main.gmane.org NNTP-Posting-Date: Fri, 23 Aug 2002 06:26:18 +0000 (UTC) Return-path: Original-Received: from quimby.gnus.org ([80.91.224.244]) by main.gmane.org with esmtp (Exim 3.35 #1 (Debian)) id 17i7th-0004hg-00 for ; Fri, 23 Aug 2002 08:26:17 +0200 Original-Received: from monty-python.gnu.org ([199.232.76.173]) by quimby.gnus.org with esmtp (Exim 3.12 #1 (Debian)) id 17i8Mv-0001vP-00 for ; Fri, 23 Aug 2002 08:56:30 +0200 Original-Received: from localhost ([127.0.0.1] helo=monty-python.gnu.org) by monty-python.gnu.org with esmtp (Exim 4.10) id 17i7ut-00068Q-00; Fri, 23 Aug 2002 02:27:32 -0400 Original-Received: from list by monty-python.gnu.org with tmda-scanned (Exim 4.10) id 17i7tE-00060F-00 for emacs-devel@gnu.org; Fri, 23 Aug 2002 02:25:48 -0400 Original-Received: from mail by monty-python.gnu.org with spam-scanned (Exim 4.10) id 17i7tB-000603-00 for emacs-devel@gnu.org; Fri, 23 Aug 2002 02:25:47 -0400 Original-Received: from tsukuba.m17n.org ([192.47.44.130]) by monty-python.gnu.org with esmtp (Exim 4.10) id 17i7tA-0005zy-00 for emacs-devel@gnu.org; Fri, 23 Aug 2002 02:25:44 -0400 Original-Received: from fs.m17n.org (fs.m17n.org [192.47.44.2]) by tsukuba.m17n.org (8.11.6/3.7W-20010518204228) with ESMTP id g7N6Pgl06058 for ; Fri, 23 Aug 2002 15:25:42 +0900 (JST) (envelope-from handa@m17n.org) Original-Received: from etlken.m17n.org (etlken.m17n.org [192.47.44.125]) by fs.m17n.org (8.11.3/3.7W-20010823150639) with ESMTP id g7N6Pg900266 for ; Fri, 23 Aug 2002 15:25:42 +0900 (JST) Original-Received: (from handa@localhost) by etlken.m17n.org (8.8.8+Sun/3.7W-2001040620) id PAA23426; Fri, 23 Aug 2002 15:25:42 +0900 (JST) Original-To: emacs-devel@gnu.org User-Agent: SEMI/1.14.3 (Ushinoya) FLIM/1.14.2 (Yagi-Nishiguchi) APEL/10.2 Emacs/21.1.30 (sparc-sun-solaris2.6) MULE/5.0 (SAKAKI) Errors-To: emacs-devel-admin@gnu.org X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.0.11 Precedence: bulk List-Help: List-Post: List-Subscribe: , List-Id: Emacs development discussions. List-Unsubscribe: , List-Archive: Xref: main.gmane.org gmane.emacs.devel:6781 X-Report-Spam: http://spam.gmane.org/gmane.emacs.devel:6781 While working on emacs-unicode, I noticed a very difficult problem which also exists in the current emacs. (let ((case-fold-search nil)) (string-match "[=DE-=DF]" "=DE")) =3D> 0 (let ((case-fold-search nil)) (string-match "[=DE=DF]" "=DE")) =3D> 0 (let ((case-fold-search t)) (string-match "[=DE-=DF]" "=DE")) =3D> nil !!! (let ((case-fold-search t)) (string-match "[=DE=DF]" "=DE")) =3D> 0 When you see the output of M-x list-charset-chars RET latin-iso8859-1 RET, you'll soon find what's going on. The relevan character codes are as follows: =DE (#x8DE) =DF (#x8DF) (downcase ?=DE) =3D=3D ?=FE (#x8FE) (downcase ?=DF) =3D=3D ?=DF (#x8DF) This problem is not specific to non-ASCII chars, it's just rarer to face such a sitution in ASCII chars. (let ((case-fold-search nil)) (string-match "[A-_]" "A")) =3D> 0 (let ((case-fold-search t)) (string-match "[A-_]" "A")) =3D> nil (let ((case-fold-search t)) (string-match "[A_]" "A")) =3D> 0 In my opinion, specifying ranges by chars are nonsense because there should be no semantics in the order of characters codes. But, anyway, we have to decide what to do. (1) Regard the above case as a bug, and fix it completely. As we don't support a range striding over different charsets by the current Emacs, I think the fix is difficult but not that much. But, in emacs-unicode, we can't have such a restriction, and thus the fix is very difficult. (2) Regard the above case as an (unpleasant) feature, and document it. (3) Signal an error for such a regex (and of course document it). --- Ken'ichi HANDA handa@etl.go.jp