unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed
From: "Stefan Monnier" <monnier+gnu/emacs@rum.cs.yale.edu>
Cc: emacs-devel@gnu.org
Subject: Re: regex and case-fold-search problem
Date: Fri, 23 Aug 2002 13:36:41 -0400	[thread overview]
Message-ID: <200208231736.g7NHafW02174@rum.cs.yale.edu> (raw)
In-Reply-To: 200208230625.PAA23426@etlken.m17n.org

> While working on emacs-unicode, I noticed a very difficult
> problem which also exists in the current emacs.
> 
> (let ((case-fold-search nil))
>   (string-match "[Þ-ß]" "Þ")) => 0
> (let ((case-fold-search nil))
>   (string-match "[Þß]" "Þ")) => 0
> 
> (let ((case-fold-search t))
>   (string-match "[Þ-ß]" "Þ")) => nil !!!
> (let ((case-fold-search t))
>   (string-match "[Þß]" "Þ")) => 0
> 
> When you see the output of M-x list-charset-chars RET
> latin-iso8859-1 RET,  you'll soon find what's going on.
> 
> The relevan character codes are as follows:
> 	Þ (#x8DE)
> 	ß (#x8DF)
> 	(downcase ?Þ) == ?þ (#x8FE)
> 	(downcase ?ß) == ?ß (#x8DF)
> 
> This problem is not specific to non-ASCII chars, it's just
> rarer to face such a sitution in ASCII chars.
> 
> (let ((case-fold-search nil))
>   (string-match "[A-_]" "A")) => 0
> (let ((case-fold-search t))
>   (string-match "[A-_]" "A")) => nil
> (let ((case-fold-search t))
>   (string-match "[A_]" "A")) => 0
> 
> In my opinion, specifying ranges by chars are nonsense
> because there should be no semantics in the order of
> characters codes.

Indeed.  POSIX basically says the behavior is unclear (it's locale-dependent).

But I think that if it works with (case-fold-search nil) it should
also work with (case-fold-search t).  The current behavior is really
counter-intuitive.

> But, anyway, we have to decide what to do.
> 
> (1) Regard the above case as a bug, and fix it completely.
>     As we don't support a range striding over different
>     charsets by the current Emacs, I think the fix is
>     difficult but not that much.  But, in emacs-unicode, we
>     can't have such a restriction, and thus the fix is very
>     difficult.

For ASCII it's pretty easy to fix.  But for other charsets, it's
indeed more tricky.  Maybe we can simply use the smallest contiguous
range of chars that includes all the chars we should match,
so the behavior is indeed "implementation-defined" (in the sense
that it's not necessarily obvious to the user what happens) but
it's at least less confusing (in the sense that (case-fold-search t)
matches at least as much as (case-fold-search nil)).

> (2) Regard the above case as an (unpleasant) feature, and
>     document it.

I think we should document the fact that char-ranges shouldn't
be relied upon too much, especially outside of ASCII.  That's
true no matter how we deal with the problem.

> (3) Signal an error for such a regex (and of course document it).

That might be an option as well.


	Stefan

  parent reply	other threads:[~2002-08-23 17:36 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2002-08-23  6:25 regex and case-fold-search problem Kenichi Handa
2002-08-23 15:56 ` Eli Zaretskii
2002-08-24  0:51   ` Kenichi Handa
2002-08-24  1:03     ` Miles Bader
2002-08-24  9:42       ` Eli Zaretskii
2002-08-24 16:16       ` Andreas Schwab
2002-08-26  1:54         ` Miles Bader
2002-08-26 16:11           ` Stefan Monnier
2002-08-26 21:51         ` Richard Stallman
2002-08-24  9:39     ` Eli Zaretskii
2002-08-26  1:29       ` Kenichi Handa
2002-08-26  2:31         ` Miles Bader
2002-08-25 22:21     ` Kim F. Storm
2002-08-23 17:36 ` Stefan Monnier [this message]
2002-08-23 21:52   ` Stefan Monnier
2002-08-24  1:16   ` Kenichi Handa
2002-08-25 18:52     ` Stefan Monnier
2002-08-26  1:56       ` Kenichi Handa
2002-08-24 10:40   ` Kai Großjohann
2002-08-26 21:51 ` Richard Stallman
2002-08-29  8:53   ` Kenichi Handa
2002-08-29 12:33     ` Kim F. Storm
2002-08-29 13:38       ` Kenichi Handa
2002-08-29 15:00         ` Kim F. Storm
2002-08-29 16:00         ` Stefan Monnier
2002-08-30  1:11           ` Kenichi Handa
2002-08-30 19:19             ` Richard Stallman
2002-08-30 19:19     ` Richard Stallman
2002-08-30 20:08       ` Stefan Monnier
2002-09-01 13:15         ` Richard Stallman
2002-09-01 16:26           ` Stefan Monnier
2002-09-02 14:54             ` Richard Stallman
2002-09-02 16:58               ` Stefan Monnier
2002-09-04 14:13                 ` Richard Stallman
2002-09-04 16:04                   ` Stefan Monnier
2002-09-05 18:02                     ` Richard Stallman
2002-09-06  1:00                       ` re-search-forward seems to be broken Miles Bader
2002-09-06 20:03                         ` Richard Stallman
2002-08-31  6:14       ` regex and case-fold-search problem Eli Zaretskii
2002-09-01 13:14         ` Richard Stallman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=200208231736.g7NHafW02174@rum.cs.yale.edu \
    --to=monnier+gnu/emacs@rum.cs.yale.edu \
    --cc=emacs-devel@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).