From mboxrd@z Thu Jan  1 00:00:00 1970
Path: main.gmane.org!not-for-mail
From: "Stefan Monnier" <monnier+gnu/emacs@rum.cs.yale.edu>
Newsgroups: gmane.emacs.devel
Subject: Re: regex and case-fold-search problem
Date: Fri, 23 Aug 2002 13:36:41 -0400
Sender: emacs-devel-admin@gnu.org
Message-ID: <200208231736.g7NHafW02174@rum.cs.yale.edu>
References: <200208230625.PAA23426@etlken.m17n.org>
NNTP-Posting-Host: localhost.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: quoted-printable
X-Trace: main.gmane.org 1030124280 28941 127.0.0.1 (23 Aug 2002 17:38:00 GMT)
X-Complaints-To: usenet@main.gmane.org
NNTP-Posting-Date: Fri, 23 Aug 2002 17:38:00 +0000 (UTC)
Cc: emacs-devel@gnu.org
Return-path: <emacs-devel-admin@gnu.org>
Original-Received: from quimby.gnus.org ([80.91.224.244])
	by main.gmane.org with esmtp (Exim 3.35 #1 (Debian))
	id 17iINj-0007Wg-00
	for <emacs-devel@main.gmane.org>; Fri, 23 Aug 2002 19:37:59 +0200
Original-Received: from monty-python.gnu.org ([199.232.76.173])
	by quimby.gnus.org with esmtp (Exim 3.12 #1 (Debian))
	id 17iIrC-0000do-00
	for <emacs-devel@quimby.gnus.org>; Fri, 23 Aug 2002 20:08:26 +0200
Original-Received: from localhost ([127.0.0.1] helo=monty-python.gnu.org)
	by monty-python.gnu.org with esmtp (Exim 4.10)
	id 17iIOw-00006I-00; Fri, 23 Aug 2002 13:39:14 -0400
Original-Received: from list by monty-python.gnu.org with tmda-scanned (Exim 4.10)
	id 17iIMZ-0008P1-00
	for emacs-devel@gnu.org; Fri, 23 Aug 2002 13:36:47 -0400
Original-Received: from mail by monty-python.gnu.org with spam-scanned (Exim 4.10)
	id 17iIMX-0008ON-00
	for emacs-devel@gnu.org; Fri, 23 Aug 2002 13:36:47 -0400
Original-Received: from rum.cs.yale.edu ([128.36.229.169])
	by monty-python.gnu.org with esmtp (Exim 4.10)
	id 17iIMX-0008OI-00
	for emacs-devel@gnu.org; Fri, 23 Aug 2002 13:36:45 -0400
Original-Received: (from monnier@localhost)
	by rum.cs.yale.edu (8.11.6/8.11.6) id g7NHafW02174;
	Fri, 23 Aug 2002 13:36:41 -0400
X-Mailer: exmh version 2.4 06/23/2000 with nmh-1.0.4
Original-To: Kenichi Handa <handa@etl.go.jp>
X-MIME-Autoconverted: from 8bit to quoted-printable by rum.cs.yale.edu id g7NHafW02174
Errors-To: emacs-devel-admin@gnu.org
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.0.11
Precedence: bulk
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Post: <mailto:emacs-devel@gnu.org>
List-Subscribe: <http://mail.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
List-Id: Emacs development discussions. <emacs-devel.gnu.org>
List-Unsubscribe: <http://mail.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://mail.gnu.org/pipermail/emacs-devel/>
Xref: main.gmane.org gmane.emacs.devel:6807
X-Report-Spam: http://spam.gmane.org/gmane.emacs.devel:6807

> While working on emacs-unicode, I noticed a very difficult
> problem which also exists in the current emacs.
>=20
> (let ((case-fold-search nil))
>   (string-match "[=DE-=DF]" "=DE")) =3D> 0
> (let ((case-fold-search nil))
>   (string-match "[=DE=DF]" "=DE")) =3D> 0
>=20
> (let ((case-fold-search t))
>   (string-match "[=DE-=DF]" "=DE")) =3D> nil !!!
> (let ((case-fold-search t))
>   (string-match "[=DE=DF]" "=DE")) =3D> 0
>=20
> When you see the output of M-x list-charset-chars RET
> latin-iso8859-1 RET,  you'll soon find what's going on.
>=20
> The relevan character codes are as follows:
> 	=DE (#x8DE)
> 	=DF (#x8DF)
> 	(downcase ?=DE) =3D=3D ?=FE (#x8FE)
> 	(downcase ?=DF) =3D=3D ?=DF (#x8DF)
>=20
> This problem is not specific to non-ASCII chars, it's just
> rarer to face such a sitution in ASCII chars.
>=20
> (let ((case-fold-search nil))
>   (string-match "[A-_]" "A")) =3D> 0
> (let ((case-fold-search t))
>   (string-match "[A-_]" "A")) =3D> nil
> (let ((case-fold-search t))
>   (string-match "[A_]" "A")) =3D> 0
>=20
> In my opinion, specifying ranges by chars are nonsense
> because there should be no semantics in the order of
> characters codes.

Indeed.  POSIX basically says the behavior is unclear (it's locale-depend=
ent).

But I think that if it works with (case-fold-search nil) it should
also work with (case-fold-search t).  The current behavior is really
counter-intuitive.

> But, anyway, we have to decide what to do.
>=20
> (1) Regard the above case as a bug, and fix it completely.
>     As we don't support a range striding over different
>     charsets by the current Emacs, I think the fix is
>     difficult but not that much.  But, in emacs-unicode, we
>     can't have such a restriction, and thus the fix is very
>     difficult.

For ASCII it's pretty easy to fix.  But for other charsets, it's
indeed more tricky.  Maybe we can simply use the smallest contiguous
range of chars that includes all the chars we should match,
so the behavior is indeed "implementation-defined" (in the sense
that it's not necessarily obvious to the user what happens) but
it's at least less confusing (in the sense that (case-fold-search t)
matches at least as much as (case-fold-search nil)).

> (2) Regard the above case as an (unpleasant) feature, and
>     document it.

I think we should document the fact that char-ranges shouldn't
be relied upon too much, especially outside of ASCII.  That's
true no matter how we deal with the problem.

> (3) Signal an error for such a regex (and of course document it).

That might be an option as well.


	Stefan