From mboxrd@z Thu Jan 1 00:00:00 1970 Path: main.gmane.org!not-for-mail From: Richard Stallman Newsgroups: gmane.emacs.devel Subject: Re: regex and case-fold-search problem Date: Fri, 30 Aug 2002 15:19:14 -0400 Sender: emacs-devel-admin@gnu.org Message-ID: References: <200208230625.PAA23426@etlken.m17n.org> <200208262151.g7QLpfA12782@wijiji.santafe.edu> <200208290853.RAA03185@etlken.m17n.org> Reply-To: rms@gnu.org NNTP-Posting-Host: localhost.gmane.org X-Trace: main.gmane.org 1030736128 25587 127.0.0.1 (30 Aug 2002 19:35:28 GMT) X-Complaints-To: usenet@main.gmane.org NNTP-Posting-Date: Fri, 30 Aug 2002 19:35:28 +0000 (UTC) Cc: emacs-devel@gnu.org Return-path: Original-Received: from quimby.gnus.org ([80.91.224.244]) by main.gmane.org with esmtp (Exim 3.35 #1 (Debian)) id 17krYF-0006ea-00 for ; Fri, 30 Aug 2002 21:35:27 +0200 Original-Received: from monty-python.gnu.org ([199.232.76.173]) by quimby.gnus.org with esmtp (Exim 3.12 #1 (Debian)) id 17ks59-0005Xk-00 for ; Fri, 30 Aug 2002 22:09:27 +0200 Original-Received: from localhost ([127.0.0.1] helo=monty-python.gnu.org) by monty-python.gnu.org with esmtp (Exim 4.10) id 17krZf-00068v-00; Fri, 30 Aug 2002 15:36:55 -0400 Original-Received: from list by monty-python.gnu.org with tmda-scanned (Exim 4.10) id 17krIb-00048S-00 for emacs-devel@gnu.org; Fri, 30 Aug 2002 15:19:17 -0400 Original-Received: from mail by monty-python.gnu.org with spam-scanned (Exim 4.10) id 17krIY-00048G-00 for emacs-devel@gnu.org; Fri, 30 Aug 2002 15:19:16 -0400 Original-Received: from fencepost.gnu.org ([199.232.76.164]) by monty-python.gnu.org with esmtp (Exim 4.10) id 17krIY-000486-00 for emacs-devel@gnu.org; Fri, 30 Aug 2002 15:19:14 -0400 Original-Received: from rms by fencepost.gnu.org with local (Exim 4.10) id 17krIY-0004ow-00; Fri, 30 Aug 2002 15:19:14 -0400 Original-To: handa@etl.go.jp In-Reply-To: <200208290853.RAA03185@etlken.m17n.org> (message from Kenichi Handa on Thu, 29 Aug 2002 17:53:53 +0900 (JST)) Errors-To: emacs-devel-admin@gnu.org X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.0.11 Precedence: bulk List-Help: List-Post: List-Subscribe: , List-Id: Emacs development discussions. List-Unsubscribe: , List-Archive: Xref: main.gmane.org gmane.emacs.devel:7181 X-Report-Spam: http://spam.gmane.org/gmane.emacs.devel:7181 So, I agree with Stephen that his method is good enough. It is wrong even for ASCII--we definitely must do something better, at least for ASCII. The only question is, how much more than ASCII? I think we all know that is the right behaviour, and at least for ASCII, the latest code works as that. Perhpas, we should make Emacs work correctly also for Latin-1 chars, because in emacs-unicode also, they have the same code order. What about for Latin-2 characters? Will those regexp ranges change their meaning in emacs-unicode? If so, perhaps we only need to make an effort to support ranges really right for codes 0-256. > A faster way, in the usual cases, would be to look for the case where > several consecutive characters that have just one case-sibling each, > and the siblings are consecutive too. Each subrange of this kind can > be turned into two subranges, the original and the case-converted. > Also identify subranges of characters that have no case-siblings; each > subrange of this kind just remains as it is. Finally, any unusual > characters that are encountered can be replaced with a list of all the > case-siblings. > This too requires use of the whole case table. Implemnting that for any range of characters consumes our man-power and makes the running code slower. It is not a very hard program to write, I think. I'd guess around 30 lines. However, you're right about the slowness for large ranges. If we only do this for codes 0-256 (or, currently, for ASCII and Latin-1), then it won't be too slow. Consider the situation that one writes this regexp "[\000-\xffff]" to search only Unicode BMP chars in emacs-unicode. Do you think that is a reasonable kind of range that we should try to support? If so, there goes my idea that we only need to support ranges in 0-256 very well. On the other hand, if we handle \000-\xffff by doing case conversion carefully only for ASCII and Latin-1, and treat the rest of the range in a less smart way, we would get the same results in this case. Is that a good solution?