From mboxrd@z Thu Jan  1 00:00:00 1970
Path: main.gmane.org!not-for-mail
From: Kenichi Handa <handa@etl.go.jp>
Newsgroups: gmane.emacs.devel
Subject: Re: regex and case-fold-search problem
Date: Thu, 29 Aug 2002 17:53:53 +0900 (JST)
Sender: emacs-devel-admin@gnu.org
Message-ID: <200208290853.RAA03185@etlken.m17n.org>
References: <200208230625.PAA23426@etlken.m17n.org> <200208262151.g7QLpfA12782@wijiji.santafe.edu>
NNTP-Posting-Host: localhost.gmane.org
Mime-Version: 1.0 (generated by SEMI 1.14.3 - "Ushinoya")
Content-Type: text/plain; charset=US-ASCII
X-Trace: main.gmane.org 1030611236 25954 127.0.0.1 (29 Aug 2002 08:53:56 GMT)
X-Complaints-To: usenet@main.gmane.org
NNTP-Posting-Date: Thu, 29 Aug 2002 08:53:56 +0000 (UTC)
Cc: emacs-devel@gnu.org
Return-path: <emacs-devel-admin@gnu.org>
Original-Received: from quimby.gnus.org ([80.91.224.244])
	by main.gmane.org with esmtp (Exim 3.35 #1 (Debian))
	id 17kL3n-0006kH-00
	for <emacs-devel@main.gmane.org>; Thu, 29 Aug 2002 10:53:51 +0200
Original-Received: from monty-python.gnu.org ([199.232.76.173])
	by quimby.gnus.org with esmtp (Exim 3.12 #1 (Debian))
	id 17kLZz-0006r9-00
	for <emacs-devel@quimby.gnus.org>; Thu, 29 Aug 2002 11:27:08 +0200
Original-Received: from localhost ([127.0.0.1] helo=monty-python.gnu.org)
	by monty-python.gnu.org with esmtp (Exim 4.10)
	id 17kL59-0003Uq-00; Thu, 29 Aug 2002 04:55:15 -0400
Original-Received: from list by monty-python.gnu.org with tmda-scanned (Exim 4.10)
	id 17kL3v-0003Ho-00
	for emacs-devel@gnu.org; Thu, 29 Aug 2002 04:53:59 -0400
Original-Received: from mail by monty-python.gnu.org with spam-scanned (Exim 4.10)
	id 17kL3s-0003Hb-00
	for emacs-devel@gnu.org; Thu, 29 Aug 2002 04:53:59 -0400
Original-Received: from tsukuba.m17n.org ([192.47.44.130])
	by monty-python.gnu.org with esmtp (Exim 4.10)
	id 17kL3s-0003HM-00; Thu, 29 Aug 2002 04:53:56 -0400
Original-Received: from fs.m17n.org (fs.m17n.org [192.47.44.2])
	by tsukuba.m17n.org (8.11.6/3.7W-20010518204228) with ESMTP id g7T8rrl01371;
	Thu, 29 Aug 2002 17:53:53 +0900 (JST)
	(envelope-from handa@m17n.org)
Original-Received: from etlken.m17n.org (etlken.m17n.org [192.47.44.125])
	by fs.m17n.org (8.11.3/3.7W-20010823150639) with ESMTP id g7T8rr917903;
	Thu, 29 Aug 2002 17:53:53 +0900 (JST)
Original-Received: (from handa@localhost)
	by etlken.m17n.org (8.8.8+Sun/3.7W-2001040620) id RAA03185;
	Thu, 29 Aug 2002 17:53:53 +0900 (JST)
Original-To: rms@gnu.org
In-Reply-To: <200208262151.g7QLpfA12782@wijiji.santafe.edu> (message from
	Richard Stallman on Mon, 26 Aug 2002 15:51:41 -0600 (MDT))
User-Agent: SEMI/1.14.3 (Ushinoya) FLIM/1.14.2 (Yagi-Nishiguchi) APEL/10.2 Emacs/21.1.30 (sparc-sun-solaris2.6) MULE/5.0 (SAKAKI)
Errors-To: emacs-devel-admin@gnu.org
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.0.11
Precedence: bulk
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Post: <mailto:emacs-devel@gnu.org>
List-Subscribe: <http://mail.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
List-Id: Emacs development discussions. <emacs-devel.gnu.org>
List-Unsubscribe: <http://mail.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://mail.gnu.org/pipermail/emacs-devel/>
Xref: main.gmane.org gmane.emacs.devel:7099
X-Report-Spam: http://spam.gmane.org/gmane.emacs.devel:7099

In article <200208262151.g7QLpfA12782@wijiji.santafe.edu>, Richard Stallman <rms@gnu.org> writes:
> The fact is, people know the character codes and take advantage of
> their knowledge.  I don't think this is unreasonable.  But that
> question is academic, since the feature is used and we need to make it
> work.

People know the character codes that are based on their
familiar charset.  So, they can take advantage only when
Emacs internally uses the character representation in which
character code order is the same as that familiar charset.
For instance, those who are familiar with iso-8859-2 charset
can take advantage of their knowledge in Emacs 21.  But, if
they write such a regular expression, they'll find it
matches different characters in emacs-unicode.

>       Maybe we can simply use the smallest contiguous
>>  range of chars that includes all the chars we should match,

> That isn't right.  The range should be equal to the disjunction of all
> characters in it; A-_ should be equivalent to []A.....Z[\^_].  With
> case folding, that should match A-Z, a-z, and [\]^_.  In other words,
> The correct behavior is that all character codes that are equivalent
> (when you ignore case) to any character in the originally specified
> range should match.

I think we all know that is the right behaviour, and at
least for ASCII, the latest code works as that.  Perhpas, we
should make Emacs work correctly also for Latin-1 chars,
because in emacs-unicode also, they have the same code
order.

But...

> Given the whole case table, you can compute this by looping over the
> original (non-case-folded) range and finding, for each character, all
> the characters that are equivalent to it.  Then those could be
> assembled into the smallest possible number of ranges.

> A faster way, in the usual cases, would be to look for the case where
> several consecutive characters that have just one case-sibling each,
> and the siblings are consecutive too.  Each subrange of this kind can
> be turned into two subranges, the original and the case-converted.
> Also identify subranges of characters that have no case-siblings; each
> subrange of this kind just remains as it is.  Finally, any unusual
> characters that are encountered can be replaced with a list of all the
> case-siblings.

> This too requires use of the whole case table.

Implemnting that for any range of characters consumes our
man-power and makes the running code slower.

Consider the situation that one writes this regexp
	"[\000-\xffff]"
to search only Unicode BMP chars in emacs-unicode.  I
suspect that, if we implent the above method, compiling this
regexp when case-fold-search is non-nil takes longer time
than people usually expect.

So, I agree with Stephen that his method is good enough.

---
Ken'ichi HANDA
handa@etl.go.jp