From mboxrd@z Thu Jan 1 00:00:00 1970 Path: main.gmane.org!not-for-mail From: Kenichi Handa Newsgroups: gmane.emacs.devel Subject: Re: regex and case-fold-search problem Date: Thu, 29 Aug 2002 17:53:53 +0900 (JST) Sender: emacs-devel-admin@gnu.org Message-ID: <200208290853.RAA03185@etlken.m17n.org> References: <200208230625.PAA23426@etlken.m17n.org> <200208262151.g7QLpfA12782@wijiji.santafe.edu> NNTP-Posting-Host: localhost.gmane.org Mime-Version: 1.0 (generated by SEMI 1.14.3 - "Ushinoya") Content-Type: text/plain; charset=US-ASCII X-Trace: main.gmane.org 1030611236 25954 127.0.0.1 (29 Aug 2002 08:53:56 GMT) X-Complaints-To: usenet@main.gmane.org NNTP-Posting-Date: Thu, 29 Aug 2002 08:53:56 +0000 (UTC) Cc: emacs-devel@gnu.org Return-path: Original-Received: from quimby.gnus.org ([80.91.224.244]) by main.gmane.org with esmtp (Exim 3.35 #1 (Debian)) id 17kL3n-0006kH-00 for ; Thu, 29 Aug 2002 10:53:51 +0200 Original-Received: from monty-python.gnu.org ([199.232.76.173]) by quimby.gnus.org with esmtp (Exim 3.12 #1 (Debian)) id 17kLZz-0006r9-00 for ; Thu, 29 Aug 2002 11:27:08 +0200 Original-Received: from localhost ([127.0.0.1] helo=monty-python.gnu.org) by monty-python.gnu.org with esmtp (Exim 4.10) id 17kL59-0003Uq-00; Thu, 29 Aug 2002 04:55:15 -0400 Original-Received: from list by monty-python.gnu.org with tmda-scanned (Exim 4.10) id 17kL3v-0003Ho-00 for emacs-devel@gnu.org; Thu, 29 Aug 2002 04:53:59 -0400 Original-Received: from mail by monty-python.gnu.org with spam-scanned (Exim 4.10) id 17kL3s-0003Hb-00 for emacs-devel@gnu.org; Thu, 29 Aug 2002 04:53:59 -0400 Original-Received: from tsukuba.m17n.org ([192.47.44.130]) by monty-python.gnu.org with esmtp (Exim 4.10) id 17kL3s-0003HM-00; Thu, 29 Aug 2002 04:53:56 -0400 Original-Received: from fs.m17n.org (fs.m17n.org [192.47.44.2]) by tsukuba.m17n.org (8.11.6/3.7W-20010518204228) with ESMTP id g7T8rrl01371; Thu, 29 Aug 2002 17:53:53 +0900 (JST) (envelope-from handa@m17n.org) Original-Received: from etlken.m17n.org (etlken.m17n.org [192.47.44.125]) by fs.m17n.org (8.11.3/3.7W-20010823150639) with ESMTP id g7T8rr917903; Thu, 29 Aug 2002 17:53:53 +0900 (JST) Original-Received: (from handa@localhost) by etlken.m17n.org (8.8.8+Sun/3.7W-2001040620) id RAA03185; Thu, 29 Aug 2002 17:53:53 +0900 (JST) Original-To: rms@gnu.org In-Reply-To: <200208262151.g7QLpfA12782@wijiji.santafe.edu> (message from Richard Stallman on Mon, 26 Aug 2002 15:51:41 -0600 (MDT)) User-Agent: SEMI/1.14.3 (Ushinoya) FLIM/1.14.2 (Yagi-Nishiguchi) APEL/10.2 Emacs/21.1.30 (sparc-sun-solaris2.6) MULE/5.0 (SAKAKI) Errors-To: emacs-devel-admin@gnu.org X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.0.11 Precedence: bulk List-Help: List-Post: List-Subscribe: , List-Id: Emacs development discussions. List-Unsubscribe: , List-Archive: Xref: main.gmane.org gmane.emacs.devel:7099 X-Report-Spam: http://spam.gmane.org/gmane.emacs.devel:7099 In article <200208262151.g7QLpfA12782@wijiji.santafe.edu>, Richard Stallman writes: > The fact is, people know the character codes and take advantage of > their knowledge. I don't think this is unreasonable. But that > question is academic, since the feature is used and we need to make it > work. People know the character codes that are based on their familiar charset. So, they can take advantage only when Emacs internally uses the character representation in which character code order is the same as that familiar charset. For instance, those who are familiar with iso-8859-2 charset can take advantage of their knowledge in Emacs 21. But, if they write such a regular expression, they'll find it matches different characters in emacs-unicode. > Maybe we can simply use the smallest contiguous >> range of chars that includes all the chars we should match, > That isn't right. The range should be equal to the disjunction of all > characters in it; A-_ should be equivalent to []A.....Z[\^_]. With > case folding, that should match A-Z, a-z, and [\]^_. In other words, > The correct behavior is that all character codes that are equivalent > (when you ignore case) to any character in the originally specified > range should match. I think we all know that is the right behaviour, and at least for ASCII, the latest code works as that. Perhpas, we should make Emacs work correctly also for Latin-1 chars, because in emacs-unicode also, they have the same code order. But... > Given the whole case table, you can compute this by looping over the > original (non-case-folded) range and finding, for each character, all > the characters that are equivalent to it. Then those could be > assembled into the smallest possible number of ranges. > A faster way, in the usual cases, would be to look for the case where > several consecutive characters that have just one case-sibling each, > and the siblings are consecutive too. Each subrange of this kind can > be turned into two subranges, the original and the case-converted. > Also identify subranges of characters that have no case-siblings; each > subrange of this kind just remains as it is. Finally, any unusual > characters that are encountered can be replaced with a list of all the > case-siblings. > This too requires use of the whole case table. Implemnting that for any range of characters consumes our man-power and makes the running code slower. Consider the situation that one writes this regexp "[\000-\xffff]" to search only Unicode BMP chars in emacs-unicode. I suspect that, if we implent the above method, compiling this regexp when case-fold-search is non-nil takes longer time than people usually expect. So, I agree with Stephen that his method is good enough. --- Ken'ichi HANDA handa@etl.go.jp