From mboxrd@z Thu Jan 1 00:00:00 1970 Path: main.gmane.org!not-for-mail From: Richard Stallman Newsgroups: gmane.emacs.devel Subject: Re: regex and case-fold-search problem Date: Mon, 26 Aug 2002 15:51:41 -0600 (MDT) Sender: emacs-devel-admin@gnu.org Message-ID: <200208262151.g7QLpfA12782@wijiji.santafe.edu> References: <200208230625.PAA23426@etlken.m17n.org> Reply-To: rms@gnu.org NNTP-Posting-Host: localhost.gmane.org X-Trace: main.gmane.org 1030399338 16331 127.0.0.1 (26 Aug 2002 22:02:18 GMT) X-Complaints-To: usenet@main.gmane.org NNTP-Posting-Date: Mon, 26 Aug 2002 22:02:18 +0000 (UTC) Cc: emacs-devel@gnu.org Return-path: Original-Received: from quimby.gnus.org ([80.91.224.244]) by main.gmane.org with esmtp (Exim 3.35 #1 (Debian)) id 17jRw5-0004Eu-00 for ; Tue, 27 Aug 2002 00:02:13 +0200 Original-Received: from monty-python.gnu.org ([199.232.76.173]) by quimby.gnus.org with esmtp (Exim 3.12 #1 (Debian)) id 17jSR6-0007es-00 for ; Tue, 27 Aug 2002 00:34:16 +0200 Original-Received: from localhost ([127.0.0.1] helo=monty-python.gnu.org) by monty-python.gnu.org with esmtp (Exim 4.10) id 17jRxM-0000Y7-00; Mon, 26 Aug 2002 18:03:32 -0400 Original-Received: from list by monty-python.gnu.org with tmda-scanned (Exim 4.10) id 17jRm7-0006Vo-00 for emacs-devel@gnu.org; Mon, 26 Aug 2002 17:51:55 -0400 Original-Received: from mail by monty-python.gnu.org with spam-scanned (Exim 4.10) id 17jRlx-0006Nh-00 for emacs-devel@gnu.org; Mon, 26 Aug 2002 17:51:50 -0400 Original-Received: from pele.santafe.edu ([192.12.12.119]) by monty-python.gnu.org with esmtp (Exim 4.10) id 17jRlw-0006M6-00; Mon, 26 Aug 2002 17:51:44 -0400 Original-Received: from wijiji.santafe.edu (wijiji [192.12.12.5]) by pele.santafe.edu (8.11.6+Sun/8.11.6) with ESMTP id g7QLq8505075; Mon, 26 Aug 2002 15:52:08 -0600 (MDT) Original-Received: (from rms@localhost) by wijiji.santafe.edu (8.11.6+Sun/8.9.3) id g7QLpfA12782; Mon, 26 Aug 2002 15:51:41 -0600 (MDT) X-Authentication-Warning: wijiji.santafe.edu: rms set sender to rms@wijiji using -f Original-To: handa@etl.go.jp In-Reply-To: <200208230625.PAA23426@etlken.m17n.org> (message from Kenichi Handa on Fri, 23 Aug 2002 15:25:42 +0900 (JST)) Errors-To: emacs-devel-admin@gnu.org X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.0.11 Precedence: bulk List-Help: List-Post: List-Subscribe: , List-Id: Emacs development discussions. List-Unsubscribe: , List-Archive: Xref: main.gmane.org gmane.emacs.devel:6940 X-Report-Spam: http://spam.gmane.org/gmane.emacs.devel:6940 In my opinion, specifying ranges by chars are nonsense because there should be no semantics in the order of characters codes. The fact is, people know the character codes and take advantage of their knowledge. I don't think this is unreasonable. But that question is academic, since the feature is used and we need to make it work. Does that happen because under case-fold-search non-nil the characters on the range specification are downcased? It looks that way. Maybe we can simply use the smallest contiguous > range of chars that includes all the chars we should match, That isn't right. The range should be equal to the disjunction of all characters in it; A-_ should be equivalent to []A.....Z[\^_]. With case folding, that should match A-Z, a-z, and [\]^_. In other words, The correct behavior is that all character codes that are equivalent (when you ignore case) to any character in the originally specified range should match. Given the whole case table, you can compute this by looping over the original (non-case-folded) range and finding, for each character, all the characters that are equivalent to it. Then those could be assembled into the smallest possible number of ranges. A faster way, in the usual cases, would be to look for the case where several consecutive characters that have just one case-sibling each, and the siblings are consecutive too. Each subrange of this kind can be turned into two subranges, the original and the case-converted. Also identify subranges of characters that have no case-siblings; each subrange of this kind just remains as it is. Finally, any unusual characters that are encountered can be replaced with a list of all the case-siblings. This too requires use of the whole case table.