From mboxrd@z Thu Jan  1 00:00:00 1970
Path: main.gmane.org!not-for-mail
From: Richard Stallman <rms@gnu.org>
Newsgroups: gmane.emacs.devel
Subject: Re: regex and case-fold-search problem
Date: Fri, 30 Aug 2002 15:19:14 -0400
Sender: emacs-devel-admin@gnu.org
Message-ID: <E17krIY-0004ow-00@fencepost.gnu.org>
References: <200208230625.PAA23426@etlken.m17n.org> <200208262151.g7QLpfA12782@wijiji.santafe.edu> <200208290853.RAA03185@etlken.m17n.org>
Reply-To: rms@gnu.org
NNTP-Posting-Host: localhost.gmane.org
X-Trace: main.gmane.org 1030736128 25587 127.0.0.1 (30 Aug 2002 19:35:28 GMT)
X-Complaints-To: usenet@main.gmane.org
NNTP-Posting-Date: Fri, 30 Aug 2002 19:35:28 +0000 (UTC)
Cc: emacs-devel@gnu.org
Return-path: <emacs-devel-admin@gnu.org>
Original-Received: from quimby.gnus.org ([80.91.224.244])
	by main.gmane.org with esmtp (Exim 3.35 #1 (Debian))
	id 17krYF-0006ea-00
	for <emacs-devel@main.gmane.org>; Fri, 30 Aug 2002 21:35:27 +0200
Original-Received: from monty-python.gnu.org ([199.232.76.173])
	by quimby.gnus.org with esmtp (Exim 3.12 #1 (Debian))
	id 17ks59-0005Xk-00
	for <emacs-devel@quimby.gnus.org>; Fri, 30 Aug 2002 22:09:27 +0200
Original-Received: from localhost ([127.0.0.1] helo=monty-python.gnu.org)
	by monty-python.gnu.org with esmtp (Exim 4.10)
	id 17krZf-00068v-00; Fri, 30 Aug 2002 15:36:55 -0400
Original-Received: from list by monty-python.gnu.org with tmda-scanned (Exim 4.10)
	id 17krIb-00048S-00
	for emacs-devel@gnu.org; Fri, 30 Aug 2002 15:19:17 -0400
Original-Received: from mail by monty-python.gnu.org with spam-scanned (Exim 4.10)
	id 17krIY-00048G-00
	for emacs-devel@gnu.org; Fri, 30 Aug 2002 15:19:16 -0400
Original-Received: from fencepost.gnu.org ([199.232.76.164])
	by monty-python.gnu.org with esmtp (Exim 4.10)
	id 17krIY-000486-00
	for emacs-devel@gnu.org; Fri, 30 Aug 2002 15:19:14 -0400
Original-Received: from rms by fencepost.gnu.org with local (Exim 4.10)
	id 17krIY-0004ow-00; Fri, 30 Aug 2002 15:19:14 -0400
Original-To: handa@etl.go.jp
In-Reply-To: <200208290853.RAA03185@etlken.m17n.org> (message from Kenichi
	Handa on Thu, 29 Aug 2002 17:53:53 +0900 (JST))
Errors-To: emacs-devel-admin@gnu.org
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.0.11
Precedence: bulk
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Post: <mailto:emacs-devel@gnu.org>
List-Subscribe: <http://mail.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
List-Id: Emacs development discussions. <emacs-devel.gnu.org>
List-Unsubscribe: <http://mail.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://mail.gnu.org/pipermail/emacs-devel/>
Xref: main.gmane.org gmane.emacs.devel:7181
X-Report-Spam: http://spam.gmane.org/gmane.emacs.devel:7181

    So, I agree with Stephen that his method is good enough.

It is wrong even for ASCII--we definitely must do something better, at
least for ASCII.  The only question is, how much more than ASCII?

    I think we all know that is the right behaviour, and at
    least for ASCII, the latest code works as that.  Perhpas, we
    should make Emacs work correctly also for Latin-1 chars,
    because in emacs-unicode also, they have the same code
    order.

What about for Latin-2 characters?  Will those regexp ranges
change their meaning in emacs-unicode?

If so, perhaps we only need to make an effort to support ranges really
right for codes 0-256.

    > A faster way, in the usual cases, would be to look for the case where
    > several consecutive characters that have just one case-sibling each,
    > and the siblings are consecutive too.  Each subrange of this kind can
    > be turned into two subranges, the original and the case-converted.
    > Also identify subranges of characters that have no case-siblings; each
    > subrange of this kind just remains as it is.  Finally, any unusual
    > characters that are encountered can be replaced with a list of all the
    > case-siblings.

    > This too requires use of the whole case table.

    Implemnting that for any range of characters consumes our
    man-power and makes the running code slower.

It is not a very hard program to write, I think.  I'd guess around 30
lines.  However, you're right about the slowness for large ranges.  If
we only do this for codes 0-256 (or, currently, for ASCII and
Latin-1), then it won't be too slow.

    Consider the situation that one writes this regexp
	    "[\000-\xffff]"
    to search only Unicode BMP chars in emacs-unicode.

Do you think that is a reasonable kind of range that we
should try to support?  If so, there goes my idea that
we only need to support ranges in 0-256 very well.

On the other hand, if we handle \000-\xffff by doing case conversion
carefully only for ASCII and Latin-1, and treat the rest of the range
in a less smart way, we would get the same results in this case.
Is that a good solution?