regex and case-fold-search problem

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

* regex and case-fold-search problem
@ 2002-08-23  6:25 Kenichi Handa
  2002-08-23 15:56 ` Eli Zaretskii
                   ` (2 more replies)
  0 siblings, 3 replies; 40+ messages in thread
From: Kenichi Handa @ 2002-08-23  6:25 UTC (permalink / raw)


While working on emacs-unicode, I noticed a very difficult
problem which also exists in the current emacs.

(let ((case-fold-search nil))
  (string-match "[Þ-ß]" "Þ")) => 0
(let ((case-fold-search nil))
  (string-match "[Þß]" "Þ")) => 0

(let ((case-fold-search t))
  (string-match "[Þ-ß]" "Þ")) => nil !!!
(let ((case-fold-search t))
  (string-match "[Þß]" "Þ")) => 0

When you see the output of M-x list-charset-chars RET
latin-iso8859-1 RET,  you'll soon find what's going on.

The relevan character codes are as follows:
	Þ (#x8DE)
	ß (#x8DF)
	(downcase ?Þ) == ?þ (#x8FE)
	(downcase ?ß) == ?ß (#x8DF)

This problem is not specific to non-ASCII chars, it's just
rarer to face such a sitution in ASCII chars.

(let ((case-fold-search nil))
  (string-match "[A-_]" "A")) => 0
(let ((case-fold-search t))
  (string-match "[A-_]" "A")) => nil
(let ((case-fold-search t))
  (string-match "[A_]" "A")) => 0

In my opinion, specifying ranges by chars are nonsense
because there should be no semantics in the order of
characters codes.  But, anyway, we have to decide what to
do.

(1) Regard the above case as a bug, and fix it completely.
    As we don't support a range striding over different
    charsets by the current Emacs, I think the fix is
    difficult but not that much.  But, in emacs-unicode, we
    can't have such a restriction, and thus the fix is very
    difficult.

(2) Regard the above case as an (unpleasant) feature, and
    document it.

(3) Signal an error for such a regex (and of course document
    it).

---
Ken'ichi HANDA
handa@etl.go.jp

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: regex and case-fold-search problem
  2002-08-23  6:25 regex and case-fold-search problem Kenichi Handa
@ 2002-08-23 15:56 ` Eli Zaretskii
  2002-08-24  0:51   ` Kenichi Handa
  2002-08-23 17:36 ` Stefan Monnier
  2002-08-26 21:51 ` Richard Stallman
  2 siblings, 1 reply; 40+ messages in thread
From: Eli Zaretskii @ 2002-08-23 15:56 UTC (permalink / raw)
  Cc: emacs-devel

> From: Kenichi Handa <handa@etl.go.jp>
> Date: Fri, 23 Aug 2002 15:25:42 +0900 (JST)
> 
> This problem is not specific to non-ASCII chars, it's just
> rarer to face such a sitution in ASCII chars.
> 
> (let ((case-fold-search nil))
>   (string-match "[A-_]" "A")) => 0
> (let ((case-fold-search t))
>   (string-match "[A-_]" "A")) => nil
> (let ((case-fold-search t))
>   (string-match "[A_]" "A")) => 0

Does that happen because under case-fold-search non-nil the
characters on the range specification are downcased?

> In my opinion, specifying ranges by chars are nonsense
> because there should be no semantics in the order of
> characters codes.

Sorry, I don't understand: how would one specify a range _except_
with two characters and a dash between them?  What am I missing?

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: regex and case-fold-search problem
  2002-08-23  6:25 regex and case-fold-search problem Kenichi Handa
  2002-08-23 15:56 ` Eli Zaretskii
@ 2002-08-23 17:36 ` Stefan Monnier
  2002-08-23 21:52   ` Stefan Monnier
                     ` (2 more replies)
  2002-08-26 21:51 ` Richard Stallman
  2 siblings, 3 replies; 40+ messages in thread
From: Stefan Monnier @ 2002-08-23 17:36 UTC (permalink / raw)
  Cc: emacs-devel

> While working on emacs-unicode, I noticed a very difficult
> problem which also exists in the current emacs.
> 
> (let ((case-fold-search nil))
>   (string-match "[Þ-ß]" "Þ")) => 0
> (let ((case-fold-search nil))
>   (string-match "[Þß]" "Þ")) => 0
> 
> (let ((case-fold-search t))
>   (string-match "[Þ-ß]" "Þ")) => nil !!!
> (let ((case-fold-search t))
>   (string-match "[Þß]" "Þ")) => 0
> 
> When you see the output of M-x list-charset-chars RET
> latin-iso8859-1 RET,  you'll soon find what's going on.
> 
> The relevan character codes are as follows:
> 	Þ (#x8DE)
> 	ß (#x8DF)
> 	(downcase ?Þ) == ?þ (#x8FE)
> 	(downcase ?ß) == ?ß (#x8DF)
> 
> This problem is not specific to non-ASCII chars, it's just
> rarer to face such a sitution in ASCII chars.
> 
> (let ((case-fold-search nil))
>   (string-match "[A-_]" "A")) => 0
> (let ((case-fold-search t))
>   (string-match "[A-_]" "A")) => nil
> (let ((case-fold-search t))
>   (string-match "[A_]" "A")) => 0
> 
> In my opinion, specifying ranges by chars are nonsense
> because there should be no semantics in the order of
> characters codes.

Indeed.  POSIX basically says the behavior is unclear (it's locale-dependent).

But I think that if it works with (case-fold-search nil) it should
also work with (case-fold-search t).  The current behavior is really
counter-intuitive.

> But, anyway, we have to decide what to do.
> 
> (1) Regard the above case as a bug, and fix it completely.
>     As we don't support a range striding over different
>     charsets by the current Emacs, I think the fix is
>     difficult but not that much.  But, in emacs-unicode, we
>     can't have such a restriction, and thus the fix is very
>     difficult.

For ASCII it's pretty easy to fix.  But for other charsets, it's
indeed more tricky.  Maybe we can simply use the smallest contiguous
range of chars that includes all the chars we should match,
so the behavior is indeed "implementation-defined" (in the sense
that it's not necessarily obvious to the user what happens) but
it's at least less confusing (in the sense that (case-fold-search t)
matches at least as much as (case-fold-search nil)).

> (2) Regard the above case as an (unpleasant) feature, and
>     document it.

I think we should document the fact that char-ranges shouldn't
be relied upon too much, especially outside of ASCII.  That's
true no matter how we deal with the problem.

> (3) Signal an error for such a regex (and of course document it).

That might be an option as well.


	Stefan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: regex and case-fold-search problem
  2002-08-23 17:36 ` Stefan Monnier
@ 2002-08-23 21:52   ` Stefan Monnier
  2002-08-24  1:16   ` Kenichi Handa
  2002-08-24 10:40   ` Kai Großjohann
  2 siblings, 0 replies; 40+ messages in thread
From: Stefan Monnier @ 2002-08-23 21:52 UTC (permalink / raw)
  Cc: emacs-devel

"Stefan Monnier" <monnier+gnu/emacs@rum.cs.yale.edu> wrote:
> For ASCII it's pretty easy to fix.  But for other charsets, it's
> indeed more tricky.  Maybe we can simply use the smallest contiguous
> range of chars that includes all the chars we should match,
> so the behavior is indeed "implementation-defined" (in the sense
> that it's not necessarily obvious to the user what happens) but
> it's at least less confusing (in the sense that (case-fold-search t)
> matches at least as much as (case-fold-search nil)).

How about the patch below ?


	Stefan


Index: regex.c
===================================================================
RCS file: /cvsroot/emacs/emacs/src/regex.c,v
retrieving revision 1.176
diff -u -u -b -r1.176 regex.c
--- regex.c	25 Mar 2002 00:45:48 -0000	1.176
+++ regex.c	23 Aug 2002 21:49:10 -0000
@@ -1914,12 +1914,13 @@
 #define BIT_UPPER	0x10
 #define BIT_MULTIBYTE	0x20
 
-/* Set a range (RANGE_START, RANGE_END) to WORK_AREA.  */
-#define SET_RANGE_TABLE_WORK_AREA(work_area, range_start, range_end)	\
+/* Set a range START..END to WORK_AREA.
+   The range is passed through TRANSLATE, so START and END
+   should be untranslated.  */
+#define SET_RANGE_TABLE_WORK_AREA(work_area, start, end)	\
   do {									\
     EXTEND_RANGE_TABLE_WORK_AREA ((work_area), 2);			\
-    (work_area).table[(work_area).used++] = (range_start);		\
-    (work_area).table[(work_area).used++] = (range_end);		\
+    set_image_of_range (&work_area, start, end, translate);	\
   } while (0)
 
 /* Free allocated memory for WORK_AREA.	 */
@@ -2077,6 +2078,31 @@
 }
 #endif
 
+
+
+/* We need to find the image of the range start..end when passed through
+   TRANSLATE.  This is not necessarily TRANSLATE(start)..TRANSLATE(end)
+   and is not even necessarily contiguous.
+   We approximate it with the smallest contiguous range that contains
+   all the chars we need.  */
+static void
+set_image_of_range (work_area, start, end, translate)
+     RE_TRANSLATE_TYPE translate;
+     struct range_table_work_area *work_area;
+     re_wchar_t start, end;
+{
+  re_wchar_t cmin = TRANSLATE (start), cmax = TRANSLATE (end);
+  if (RE_TRANSLATE_P (translate))
+    for (; start <= end; start++)
+      {
+	re_wchar_t c = TRANSLATE (start);
+	cmin = MIN (cmin, c);
+	cmax = MAX (cmax, c);
+      }
+  work_area->table[work_area->used++] = (cmin);
+  work_area->table[work_area->used++] = (cmax);
+}
+
 /* Explicit quit checking is only used on NTemacs.  */
 #if defined WINDOWSNT && defined emacs && defined QUIT
 extern int immediate_quit;
@@ -2525,14 +2551,18 @@
 
 		if (p == pend) FREE_STACK_RETURN (REG_EBRACK);
 
-		PATFETCH (c);
+		/* Don't translate yet.  The range TRANSLATE(X..Y) cannot
+		   always be determined from TRANSLATE(X) and TRANSLATE(Y)
+		   So the translation is done later in a loop.  Example:
+		   (let ((case-fold-search t)) (string-match "[A-_]" "A"))  */
+		PATFETCH_RAW (c);
 
 		/* \ might escape characters inside [...] and [^...].  */
 		if ((syntax & RE_BACKSLASH_ESCAPE_IN_LISTS) && c == '\\')
 		  {
 		    if (p == pend) FREE_STACK_RETURN (REG_EESCAPE);
 
-		    PATFETCH (c);
+		    PATFETCH_RAW (c);
 		    escaped_char = true;
 		  }
 		else
@@ -2636,10 +2668,10 @@
 		  {
 
 		    /* Discard the `-'. */
-		    PATFETCH (c1);
+		    PATFETCH_RAW (c1);
 
 		    /* Fetch the character which ends the range. */
-		    PATFETCH (c1);
+		    PATFETCH_RAW (c1);
 
 		    if (SINGLE_BYTE_CHAR_P (c))
 		      {
@@ -2653,7 +2685,7 @@
 			       starting at the smallest character in
 			       the charset of C1 and ending at C1.  */
 			    int charset = CHAR_CHARSET (c1);
-			    int c2 = MAKE_CHAR (charset, 0, 0);
+			    re_wchar_t c2 = MAKE_CHAR (charset, 0, 0);
 			    
 			    SET_RANGE_TABLE_WORK_AREA (range_table_work,
 						       c2, c1);
@@ -2672,7 +2704,7 @@
 		  /* ... into bitmap.  */
 		  {
 		    re_wchar_t this_char;
-		    int range_start = c, range_end = c1;
+		    re_wchar_t range_start = c, range_end = c1;
 
 		    /* If the start is after the end, the range is empty.  */
 		    if (range_start > range_end)

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: regex and case-fold-search problem
  2002-08-23 15:56 ` Eli Zaretskii
@ 2002-08-24  0:51   ` Kenichi Handa
  2002-08-24  1:03     ` Miles Bader
                       ` (2 more replies)
  0 siblings, 3 replies; 40+ messages in thread
From: Kenichi Handa @ 2002-08-24  0:51 UTC (permalink / raw)
  Cc: emacs-devel

In article <9003-Fri23Aug2002185625+0300-eliz@is.elta.co.il>, "Eli Zaretskii" <eliz@is.elta.co.il> writes:
>>  (let ((case-fold-search nil))
>>    (string-match "[A-_]" "A")) => 0
>>  (let ((case-fold-search t))
>>    (string-match "[A-_]" "A")) => nil
>>  (let ((case-fold-search t))
>>    (string-match "[A_]" "A")) => 0

> Does that happen because under case-fold-search non-nil the
> characters on the range specification are downcased?

Yes.

>>  In my opinion, specifying ranges by chars are nonsense
>>  because there should be no semantics in the order of
>>  characters codes.

> Sorry, I don't understand: how would one specify a range _except_
> with two characters and a dash between them?  What am I missing?

I mean that the concept of character range itself is not
good.  A character code is just an identifier of a
character.  We usually don't think about "a range of
identifiers" (e.g. "symbols in the range between t and nil"
is nonsense).

---
Ken'ichi HANDA
handa@etl.go.jp

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: regex and case-fold-search problem
  2002-08-24  0:51   ` Kenichi Handa
@ 2002-08-24  1:03     ` Miles Bader
  2002-08-24  9:42       ` Eli Zaretskii
  2002-08-24 16:16       ` Andreas Schwab
  2002-08-24  9:39     ` Eli Zaretskii
  2002-08-25 22:21     ` Kim F. Storm
  2 siblings, 2 replies; 40+ messages in thread
From: Miles Bader @ 2002-08-24  1:03 UTC (permalink / raw)
  Cc: eliz, emacs-devel

On Sat, Aug 24, 2002 at 09:51:46AM +0900, Kenichi Handa wrote:
> I mean that the concept of character range itself is not good.  A character
> code is just an identifier of a character.  We usually don't think about "a
> range of identifiers" (e.g. "symbols in the range between t and nil" is
> nonsense).

Yeah, but character ranges make perfect sense in many local contexts.
E.g., [0-9], or [<0>-<9>] where <0> and <9> are `wide' digits from some
character set.

I think that in cases where the notion of a character range _does_ make
sense, that either both ends will be downcase-able, or that both will not, so
that perhaps the problem won't actually show up in practice if we just say
`only use character ranges when they make sense!'

-Miles
-- 
P.S.  All information contained in the above letter is false,
      for reasons of military security.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: regex and case-fold-search problem
  2002-08-23 17:36 ` Stefan Monnier
  2002-08-23 21:52   ` Stefan Monnier
@ 2002-08-24  1:16   ` Kenichi Handa
  2002-08-25 18:52     ` Stefan Monnier
  2002-08-24 10:40   ` Kai Großjohann
  2 siblings, 1 reply; 40+ messages in thread
From: Kenichi Handa @ 2002-08-24  1:16 UTC (permalink / raw)
  Cc: emacs-devel

In article <200208231736.g7NHafW02174@rum.cs.yale.edu>, "Stefan Monnier" <monnier+gnu/emacs@rum.cs.yale.edu> writes:
> But I think that if it works with (case-fold-search nil) it should
> also work with (case-fold-search t).  The current behavior is really
> counter-intuitive.

I agree.

>>  But, anyway, we have to decide what to do.
>>  
>>  (1) Regard the above case as a bug, and fix it completely.
>>      As we don't support a range striding over different
>>      charsets by the current Emacs, I think the fix is
>>      difficult but not that much.  But, in emacs-unicode, we
>>      can't have such a restriction, and thus the fix is very
>>      difficult.

> For ASCII it's pretty easy to fix.  But for other charsets, it's
> indeed more tricky.  Maybe we can simply use the smallest contiguous
> range of chars that includes all the chars we should match,
> so the behavior is indeed "implementation-defined" (in the sense
> that it's not necessarily obvious to the user what happens) but
> it's at least less confusing (in the sense that (case-fold-search t)
> matches at least as much as (case-fold-search nil)).

Ideally, the range "[A-_]" must be converted to "[a-z[-_]".
But, it seems that your idea is to convert "[A-_]" to
"[_-z]", correct?  I agree that it results in less
counter-intuitive behaviour.

> How about the patch below ?
[...]
?? It seems that the patch handles only non-ASCII chars.

---
Ken'ichi HANDA
handa@etl.go.jp

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: regex and case-fold-search problem
  2002-08-24  0:51   ` Kenichi Handa
  2002-08-24  1:03     ` Miles Bader
@ 2002-08-24  9:39     ` Eli Zaretskii
  2002-08-26  1:29       ` Kenichi Handa
  2002-08-25 22:21     ` Kim F. Storm
  2 siblings, 1 reply; 40+ messages in thread
From: Eli Zaretskii @ 2002-08-24  9:39 UTC (permalink / raw)
  Cc: emacs-devel

> Date: Sat, 24 Aug 2002 09:51:46 +0900 (JST)
> From: Kenichi Handa <handa@etl.go.jp>
> 
> > Does that happen because under case-fold-search non-nil the
> > characters on the range specification are downcased?
> 
> Yes.

Then perhaps, instead of downcasing the range, we should do the
comparison in a case-insensitive manner?  Or is that impossible with
the current regex code?

> I mean that the concept of character range itself is not
> good.

As Miles wrote, it does make a perfect sense in a context of a
specific language.  For example, if the characters that designate the
range are all Cyrillic characters, the range is sensible.

It would IMHO be a pity to lose the ability to specify ranges in such
cases.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: regex and case-fold-search problem
  2002-08-24  1:03     ` Miles Bader
@ 2002-08-24  9:42       ` Eli Zaretskii
  2002-08-24 16:16       ` Andreas Schwab
  1 sibling, 0 replies; 40+ messages in thread
From: Eli Zaretskii @ 2002-08-24  9:42 UTC (permalink / raw)
  Cc: emacs-devel

> Date: Fri, 23 Aug 2002 21:03:07 -0400
> From: Miles Bader <miles@gnu.org>
> 
> I think that in cases where the notion of a character range _does_ make
> sense, that either both ends will be downcase-able, or that both will not

Yes, but the problem is that downcasing both ends of a range might
change the range in some (admittedly a bit rare) cases.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: regex and case-fold-search problem
  2002-08-23 17:36 ` Stefan Monnier
  2002-08-23 21:52   ` Stefan Monnier
  2002-08-24  1:16   ` Kenichi Handa
@ 2002-08-24 10:40   ` Kai Großjohann
  2 siblings, 0 replies; 40+ messages in thread
From: Kai Großjohann @ 2002-08-24 10:40 UTC (permalink / raw)
  Cc: Kenichi Handa, emacs-devel

"Stefan Monnier" <monnier+gnu/emacs@rum.cs.yale.edu> writes:

> For ASCII it's pretty easy to fix.  But for other charsets, it's
> indeed more tricky.  Maybe we can simply use the smallest contiguous
> range of chars that includes all the chars we should match,
> so the behavior is indeed "implementation-defined" (in the sense
> that it's not necessarily obvious to the user what happens) but
> it's at least less confusing (in the sense that (case-fold-search t)
> matches at least as much as (case-fold-search nil)).

My first intuition would be to take all the characters in the range
[A-_] (preserving case), then to "double" each character that has an
uppercase and a lowercase variant.

So we are talking about the characters "ABCDEFGHIJKLMNOPQRSTXYZ[\]^_"
for the given range, and now we make a case-insensitive variant of
this list of characters.

Does this make sense?  Is it feasible to implement?

kai
-- 
A large number of young women don't trust men with beards.  (BFBS Radio)

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: regex and case-fold-search problem
  2002-08-24  1:03     ` Miles Bader
  2002-08-24  9:42       ` Eli Zaretskii
@ 2002-08-24 16:16       ` Andreas Schwab
  2002-08-26  1:54         ` Miles Bader
  2002-08-26 21:51         ` Richard Stallman
  1 sibling, 2 replies; 40+ messages in thread
From: Andreas Schwab @ 2002-08-24 16:16 UTC (permalink / raw)
  Cc: Kenichi Handa, eliz, emacs-devel

Miles Bader <miles@gnu.org> writes:

|> On Sat, Aug 24, 2002 at 09:51:46AM +0900, Kenichi Handa wrote:
|> > I mean that the concept of character range itself is not good.  A character
|> > code is just an identifier of a character.  We usually don't think about "a
|> > range of identifiers" (e.g. "symbols in the range between t and nil" is
|> > nonsense).
|> 
|> Yeah, but character ranges make perfect sense in many local contexts.
|> E.g., [0-9], or [<0>-<9>] where <0> and <9> are `wide' digits from some
|> character set.

What does [A-Z] mean in EBCDIC?  [0-9] is a special case, because ISO C
requires that 0,1,2,3,4,5,6,7,8,9 are consecutive in the execution
character set.  But in many locales the collating sequence <A> - <Z>
contains more that just the upper case letters from the English alphabet.

Andreas.

-- 
Andreas Schwab, SuSE Labs, schwab@suse.de
SuSE Linux AG, Deutschherrnstr. 15-19, D-90429 Nürnberg
Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: regex and case-fold-search problem
  2002-08-24  1:16   ` Kenichi Handa
@ 2002-08-25 18:52     ` Stefan Monnier
  2002-08-26  1:56       ` Kenichi Handa
  0 siblings, 1 reply; 40+ messages in thread
From: Stefan Monnier @ 2002-08-25 18:52 UTC (permalink / raw)
  Cc: monnier+gnu/emacs, emacs-devel

> In article <200208231736.g7NHafW02174@rum.cs.yale.edu>, "Stefan Monnier" <monnier+gnu/emacs@rum.cs.yale.edu> writes:
> > But I think that if it works with (case-fold-search nil) it should
> > also work with (case-fold-search t).  The current behavior is really
> > counter-intuitive.
> 
> I agree.
> 
> >>  But, anyway, we have to decide what to do.
> >>  
> >>  (1) Regard the above case as a bug, and fix it completely.
> >>      As we don't support a range striding over different
> >>      charsets by the current Emacs, I think the fix is
> >>      difficult but not that much.  But, in emacs-unicode, we
> >>      can't have such a restriction, and thus the fix is very
> >>      difficult.
> 
> > For ASCII it's pretty easy to fix.  But for other charsets, it's
> > indeed more tricky.  Maybe we can simply use the smallest contiguous
> > range of chars that includes all the chars we should match,
> > so the behavior is indeed "implementation-defined" (in the sense
> > that it's not necessarily obvious to the user what happens) but
> > it's at least less confusing (in the sense that (case-fold-search t)
> > matches at least as much as (case-fold-search nil)).
> 
> Ideally, the range "[A-_]" must be converted to "[a-z[-_]".

Indeed and the (new) current code does just that for ASCII.

> But, it seems that your idea is to convert "[A-_]" to
> "[_-z]", correct?  I agree that it results in less
> counter-intuitive behaviour.

Not quite: [_-z] would not include [ \ ] and ^.
So instead it's [[-z] which includes all of [a-z[-_]
as well as ` (in this particular case).

> > How about the patch below ?
> [...]
> ?? It seems that the patch handles only non-ASCII chars.

Well, that's because the code for ASCII was already there (just
didn't work right because we did PATFETCH instead of PATFETCH_RAW).


	Stefan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: regex and case-fold-search problem
  2002-08-24  0:51   ` Kenichi Handa
  2002-08-24  1:03     ` Miles Bader
  2002-08-24  9:39     ` Eli Zaretskii
@ 2002-08-25 22:21     ` Kim F. Storm
  2 siblings, 0 replies; 40+ messages in thread
From: Kim F. Storm @ 2002-08-25 22:21 UTC (permalink / raw)
  Cc: eliz, emacs-devel

Kenichi Handa <handa@etl.go.jp> writes:

> In article <9003-Fri23Aug2002185625+0300-eliz@is.elta.co.il>, "Eli Zaretskii" <eliz@is.elta.co.il> writes:
> >>  (let ((case-fold-search nil))
> >>    (string-match "[A-_]" "A")) => 0
> >>  (let ((case-fold-search t))
> >>    (string-match "[A-_]" "A")) => nil
> >>  (let ((case-fold-search t))
> >>    (string-match "[A_]" "A")) => 0
> 
> > Does that happen because under case-fold-search non-nil the
> > characters on the range specification are downcased?
> 
> Yes.
> 
> >>  In my opinion, specifying ranges by chars are nonsense
> >>  because there should be no semantics in the order of
> >>  characters codes.
> 
> > Sorry, I don't understand: how would one specify a range _except_
> > with two characters and a dash between them?  What am I missing?
> 
> I mean that the concept of character range itself is not
> good.  A character code is just an identifier of a
> character.  We usually don't think about "a range of
> identifiers" (e.g. "symbols in the range between t and nil"
> is nonsense).

Which is why [[:alpha:]] [[:digit:]] etc were invented for regex's.
They are supposed to "look at the locale"...

-- 
Kim F. Storm <storm@cua.dk> http://www.cua.dk

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: regex and case-fold-search problem
  2002-08-24  9:39     ` Eli Zaretskii
@ 2002-08-26  1:29       ` Kenichi Handa
  2002-08-26  2:31         ` Miles Bader
  0 siblings, 1 reply; 40+ messages in thread
From: Kenichi Handa @ 2002-08-26  1:29 UTC (permalink / raw)
  Cc: emacs-devel

In article <9743-Sat24Aug2002123958+0300-eliz@is.elta.co.il>, "Eli Zaretskii" <eliz@is.elta.co.il> writes:
>>  > Does that happen because under case-fold-search non-nil the
>>  > characters on the range specification are downcased?
>>  
>>  Yes.

> Then perhaps, instead of downcasing the range, we should do the
> comparison in a case-insensitive manner?  Or is that impossible with
> the current regex code?

Of course, it's not impossible.   It's just not easy.

>>  I mean that the concept of character range itself is not
>>  good.

> As Miles wrote, it does make a perfect sense in a context of a
> specific language.  For example, if the characters that designate the
> range are all Cyrillic characters, the range is sensible.

It makes sense only when we assume some character set (or
locale).  For instance, in Emacs 21, Cyrillic characters has
the same code order as that of iso-8859-5.  But, in
emacs-unicode, we use Unicode.  So, a Cyrillic char range
that works well in Emacs 21 won't work in emacs-unicode.

> It would IMHO be a pity to lose the ability to specify ranges in such
> cases.

I don't suggest to remove that ability.  I'm just wondering
if it is worth spending our time (and perhaps users time) to
make Emacs behave completely correctly to handle a char
range especially in the case that case-fold-search is t.

I think something like Stefan's compromise method (quoted
below) is good enough.

> For ASCII it's pretty easy to fix.  But for other charsets, it's
> indeed more tricky.  Maybe we can simply use the smallest contiguous
> range of chars that includes all the chars we should match,
> so the behavior is indeed "implementation-defined" (in the sense
> that it's not necessarily obvious to the user what happens) but
> it's at least less confusing (in the sense that (case-fold-search t)
> matches at least as much as (case-fold-search nil)).

---
Ken'ichi HANDA
handa@etl.go.jp

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: regex and case-fold-search problem
  2002-08-24 16:16       ` Andreas Schwab
@ 2002-08-26  1:54         ` Miles Bader
  2002-08-26 16:11           ` Stefan Monnier
  2002-08-26 21:51         ` Richard Stallman
  1 sibling, 1 reply; 40+ messages in thread
From: Miles Bader @ 2002-08-26  1:54 UTC (permalink / raw)
  Cc: Kenichi Handa, eliz, emacs-devel

Andreas Schwab <schwab@suse.de> writes:
> |> Yeah, but character ranges make perfect sense in many local contexts.
> |> E.g., [0-9], or [<0>-<9>] where <0> and <9> are `wide' digits from some
> |> character set.
> 
> What does [A-Z] mean in EBCDIC?  [0-9] is a special case, because ISO C
> requires that 0,1,2,3,4,5,6,7,8,9 are consecutive in the execution
> character set.  But in many locales the collating sequence <A> - <Z>
> contains more that just the upper case letters from the English alphabet.

The question is not `does [A-Z] make sense?', but rather: `_if_ [A-Z]
makes sense, does [a-z] make sense too?'

That is, we aren't the ones writing [A-Z], it's lisp authors or users
entering regexps or something.  If they want to enter a less-than-useful
character range, that's their prerogative; however, emacs should avoid
making what they enter _less_ meaningful because of the case-fold-search
setting.

My point was that perhaps in practice, the ranges that would get screwed
up by case-fold-search are even less sensible that normal, meaning it's
likely most people wouldn't (or shouldn't) use them, and we really don't
need to worry about the issue.  [ASCII is probably a special case, since
it's so well known that people actually do tend to specify wierd ranges]

[but it looks like maybe it will get fixed properly anyway...]

-miles
-- 
`Life is a boundless sea of bitterness'

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: regex and case-fold-search problem
  2002-08-25 18:52     ` Stefan Monnier
@ 2002-08-26  1:56       ` Kenichi Handa
  0 siblings, 0 replies; 40+ messages in thread
From: Kenichi Handa @ 2002-08-26  1:56 UTC (permalink / raw)
  Cc: emacs-devel

In article <200208251852.g7PIqf121329@rum.cs.yale.edu>, "Stefan Monnier" <monnier+gnu/emacs@rum.cs.yale.edu> writes:
>>  But, it seems that your idea is to convert "[A-_]" to
>>  "[_-z]", correct?  I agree that it results in less
>>  counter-intuitive behaviour.

> Not quite: [_-z] would not include [ \ ] and ^.
> So instead it's [[-z] which includes all of [a-z[-_]
> as well as ` (in this particular case).

Ah!  Right.

>>  > How about the patch below ?
>>  [...]
>>  ?? It seems that the patch handles only non-ASCII chars.

> Well, that's because the code for ASCII was already there (just
> didn't work right because we did PATFETCH instead of PATFETCH_RAW).

I see.  I confirmed that with the latest code.  Thank you!

---
Ken'ichi HANDA
handa@etl.go.jp

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: regex and case-fold-search problem
  2002-08-26  1:29       ` Kenichi Handa
@ 2002-08-26  2:31         ` Miles Bader
  0 siblings, 0 replies; 40+ messages in thread
From: Miles Bader @ 2002-08-26  2:31 UTC (permalink / raw)
  Cc: eliz, emacs-devel

Kenichi Handa <handa@etl.go.jp> writes:
> > As Miles wrote, it does make a perfect sense in a context of a
> > specific language.  For example, if the characters that designate the
> > range are all Cyrillic characters, the range is sensible.
> 
> It makes sense only when we assume some character set (or locale).
> For instance, in Emacs 21, Cyrillic characters has the same code order
> as that of iso-8859-5.  But, in emacs-unicode, we use Unicode.  So, a
> Cyrillic char range that works well in Emacs 21 won't work in
> emacs-unicode.

I don't think it really matters.

As I said in a previous message, the question is not `does [A-Z] make
sense?', but rather: `_if_ [A-Z] makes sense, does [a-z] make sense
too?'

If someone writes [<cyrillic-char>-<chinese_char>] then they they get
what they deserve; it's not emacs' fault.

-Miles
-- 
.Numeric stability is probably not all that important when you're guessing.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: regex and case-fold-search problem
  2002-08-26  1:54         ` Miles Bader
@ 2002-08-26 16:11           ` Stefan Monnier
  0 siblings, 0 replies; 40+ messages in thread
From: Stefan Monnier @ 2002-08-26 16:11 UTC (permalink / raw)
  Cc: Andreas Schwab, Kenichi Handa, eliz, emacs-devel

> Andreas Schwab <schwab@suse.de> writes:
> > |> Yeah, but character ranges make perfect sense in many local contexts.
> > |> E.g., [0-9], or [<0>-<9>] where <0> and <9> are `wide' digits from some
> > |> character set.
> > 
> > What does [A-Z] mean in EBCDIC?  [0-9] is a special case, because ISO C
> > requires that 0,1,2,3,4,5,6,7,8,9 are consecutive in the execution
> > character set.  But in many locales the collating sequence <A> - <Z>
> > contains more that just the upper case letters from the English alphabet.
> 
> The question is not `does [A-Z] make sense?', but rather: `_if_ [A-Z]
> makes sense, does [a-z] make sense too?'
> 
> That is, we aren't the ones writing [A-Z], it's lisp authors or users
> entering regexps or something.  If they want to enter a less-than-useful
> character range, that's their prerogative; however, emacs should avoid
> making what they enter _less_ meaningful because of the case-fold-search
> setting.
> 
> My point was that perhaps in practice, the ranges that would get screwed
> up by case-fold-search are even less sensible that normal, meaning it's
> likely most people wouldn't (or shouldn't) use them, and we really don't
> need to worry about the issue.  [ASCII is probably a special case, since
> it's so well known that people actually do tend to specify wierd ranges]
> 
> [but it looks like maybe it will get fixed properly anyway...]

I agree that we shouldn't spend too much time on it.
The patch I installed does the following:
- Fix a few problems such as ``if the case-table mapped ?* to ?o then
  "\\(fo\\)*" used to only match "foo"''.  Luckily such case-tables
  are not very common, so nobody noticed the problem.
- case-fold-search now works correctly for ranges in ASCII
- case-fold-search still doesn't work correctly for ranges in non-ASCII
  but it matches at least as much as when case-fold-search is nil: i.e.
  the range might include some chars which the user didn't expect, but it
  at least include the chars which the user expected.  The previous behavior
  was that the range could include some unexpected chars as well and could
  also not include some expected chars.  The current code matches at least
  as many strings as the previous one.

I think that's good enough for now,


	Stefan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: regex and case-fold-search problem
  2002-08-23  6:25 regex and case-fold-search problem Kenichi Handa
  2002-08-23 15:56 ` Eli Zaretskii
  2002-08-23 17:36 ` Stefan Monnier
@ 2002-08-26 21:51 ` Richard Stallman
  2002-08-29  8:53   ` Kenichi Handa
  2 siblings, 1 reply; 40+ messages in thread
From: Richard Stallman @ 2002-08-26 21:51 UTC (permalink / raw)
  Cc: emacs-devel

    In my opinion, specifying ranges by chars are nonsense
    because there should be no semantics in the order of
    characters codes.

The fact is, people know the character codes and take advantage of
their knowledge.  I don't think this is unreasonable.  But that
question is academic, since the feature is used and we need to make it
work.

    Does that happen because under case-fold-search non-nil the
    characters on the range specification are downcased?

It looks that way.

      Maybe we can simply use the smallest contiguous
    > range of chars that includes all the chars we should match,

That isn't right.  The range should be equal to the disjunction of all
characters in it; A-_ should be equivalent to []A.....Z[\^_].  With
case folding, that should match A-Z, a-z, and [\]^_.  In other words,
The correct behavior is that all character codes that are equivalent
(when you ignore case) to any character in the originally specified
range should match.

Given the whole case table, you can compute this by looping over the
original (non-case-folded) range and finding, for each character, all
the characters that are equivalent to it.  Then those could be
assembled into the smallest possible number of ranges.

A faster way, in the usual cases, would be to look for the case where
several consecutive characters that have just one case-sibling each,
and the siblings are consecutive too.  Each subrange of this kind can
be turned into two subranges, the original and the case-converted.
Also identify subranges of characters that have no case-siblings; each
subrange of this kind just remains as it is.  Finally, any unusual
characters that are encountered can be replaced with a list of all the
case-siblings.

This too requires use of the whole case table.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: regex and case-fold-search problem
  2002-08-24 16:16       ` Andreas Schwab
  2002-08-26  1:54         ` Miles Bader
@ 2002-08-26 21:51         ` Richard Stallman
  1 sibling, 0 replies; 40+ messages in thread
From: Richard Stallman @ 2002-08-26 21:51 UTC (permalink / raw)
  Cc: miles, handa, eliz, emacs-devel

    What does [A-Z] mean in EBCDIC?

Fortunately, we don't need to worry about the question.
Emacs always operates on ASCII or extensions of ASCII.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: regex and case-fold-search problem
  2002-08-26 21:51 ` Richard Stallman
@ 2002-08-29  8:53   ` Kenichi Handa
  2002-08-29 12:33     ` Kim F. Storm
  2002-08-30 19:19     ` Richard Stallman
  0 siblings, 2 replies; 40+ messages in thread
From: Kenichi Handa @ 2002-08-29  8:53 UTC (permalink / raw)
  Cc: emacs-devel

In article <200208262151.g7QLpfA12782@wijiji.santafe.edu>, Richard Stallman <rms@gnu.org> writes:
> The fact is, people know the character codes and take advantage of
> their knowledge.  I don't think this is unreasonable.  But that
> question is academic, since the feature is used and we need to make it
> work.

People know the character codes that are based on their
familiar charset.  So, they can take advantage only when
Emacs internally uses the character representation in which
character code order is the same as that familiar charset.
For instance, those who are familiar with iso-8859-2 charset
can take advantage of their knowledge in Emacs 21.  But, if
they write such a regular expression, they'll find it
matches different characters in emacs-unicode.

>       Maybe we can simply use the smallest contiguous
>>  range of chars that includes all the chars we should match,

> That isn't right.  The range should be equal to the disjunction of all
> characters in it; A-_ should be equivalent to []A.....Z[\^_].  With
> case folding, that should match A-Z, a-z, and [\]^_.  In other words,
> The correct behavior is that all character codes that are equivalent
> (when you ignore case) to any character in the originally specified
> range should match.

I think we all know that is the right behaviour, and at
least for ASCII, the latest code works as that.  Perhpas, we
should make Emacs work correctly also for Latin-1 chars,
because in emacs-unicode also, they have the same code
order.

But...

> Given the whole case table, you can compute this by looping over the
> original (non-case-folded) range and finding, for each character, all
> the characters that are equivalent to it.  Then those could be
> assembled into the smallest possible number of ranges.

> A faster way, in the usual cases, would be to look for the case where
> several consecutive characters that have just one case-sibling each,
> and the siblings are consecutive too.  Each subrange of this kind can
> be turned into two subranges, the original and the case-converted.
> Also identify subranges of characters that have no case-siblings; each
> subrange of this kind just remains as it is.  Finally, any unusual
> characters that are encountered can be replaced with a list of all the
> case-siblings.

> This too requires use of the whole case table.

Implemnting that for any range of characters consumes our
man-power and makes the running code slower.

Consider the situation that one writes this regexp
	"[\000-\xffff]"
to search only Unicode BMP chars in emacs-unicode.  I
suspect that, if we implent the above method, compiling this
regexp when case-fold-search is non-nil takes longer time
than people usually expect.

So, I agree with Stephen that his method is good enough.

---
Ken'ichi HANDA
handa@etl.go.jp

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: regex and case-fold-search problem
  2002-08-29  8:53   ` Kenichi Handa
@ 2002-08-29 12:33     ` Kim F. Storm
  2002-08-29 13:38       ` Kenichi Handa
  2002-08-30 19:19     ` Richard Stallman
  1 sibling, 1 reply; 40+ messages in thread
From: Kim F. Storm @ 2002-08-29 12:33 UTC (permalink / raw)
  Cc: rms, emacs-devel

> 
> Consider the situation that one writes this regexp
> 	"[\000-\xffff]"
> to search only Unicode BMP chars in emacs-unicode.  I
> suspect that, if we implent the above method, compiling this
> regexp when case-fold-search is non-nil takes longer time
> than people usually expect.
> 
> So, I agree with Stephen that his method is good enough.

IMO, it is wrong to handle case-fold-search for regexp ranges by
trying to modify the interpretation of the regex range.

Instead, the regex matcher should try to upcase and lowercase each
character in the string and see if either of these caracters are
within the given range.

-- 
Kim F. Storm <storm@cua.dk> http://www.cua.dk

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: regex and case-fold-search problem
  2002-08-29 12:33     ` Kim F. Storm
@ 2002-08-29 13:38       ` Kenichi Handa
  2002-08-29 15:00         ` Kim F. Storm
  2002-08-29 16:00         ` Stefan Monnier
  0 siblings, 2 replies; 40+ messages in thread
From: Kenichi Handa @ 2002-08-29 13:38 UTC (permalink / raw)
  Cc: rms, emacs-devel

In article <5x8z2pj13t.fsf@kfs2.cua.dk>, storm@cua.dk (Kim F. Storm) writes:
> IMO, it is wrong to handle case-fold-search for regexp ranges by
> trying to modify the interpretation of the regex range.

> Instead, the regex matcher should try to upcase and lowercase each
> character in the string and see if either of these caracters are
> within the given range.

I also reached to that idea.  It makes regexp compiling
simpler and faster but makes regexp matching a little bit
slower.  I don't know if that slowerness is tolerable or
not, but it's worth trying.

---
Ken'ichi HANDA
handa@etl.go.jp

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: regex and case-fold-search problem
  2002-08-29 13:38       ` Kenichi Handa
@ 2002-08-29 15:00         ` Kim F. Storm
  2002-08-29 16:00         ` Stefan Monnier
  1 sibling, 0 replies; 40+ messages in thread
From: Kim F. Storm @ 2002-08-29 15:00 UTC (permalink / raw)
  Cc: storm, rms, emacs-devel

Kenichi Handa <handa@etl.go.jp> writes:

> In article <5x8z2pj13t.fsf@kfs2.cua.dk>, storm@cua.dk (Kim F. Storm) writes:
> > IMO, it is wrong to handle case-fold-search for regexp ranges by
> > trying to modify the interpretation of the regex range.
> 
> > Instead, the regex matcher should try to upcase and lowercase each
> > character in the string and see if either of these caracters are
> > within the given range.
> 
> I also reached to that idea.  It makes regexp compiling
> simpler and faster but makes regexp matching a little bit
> slower.  I don't know if that slowerness is tolerable or
> not, but it's worth trying.

Maybe it can be semi-optimized for a char C as follows:

 MATCH = (C in range) ||
          (UC = uppercase(C)) != C ? (UC in range) : (lowercase(C) in range))
 
-- 
Kim F. Storm <storm@cua.dk> http://www.cua.dk

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: regex and case-fold-search problem
  2002-08-29 13:38       ` Kenichi Handa
  2002-08-29 15:00         ` Kim F. Storm
@ 2002-08-29 16:00         ` Stefan Monnier
  2002-08-30  1:11           ` Kenichi Handa
  1 sibling, 1 reply; 40+ messages in thread
From: Stefan Monnier @ 2002-08-29 16:00 UTC (permalink / raw)
  Cc: storm, rms, emacs-devel

> In article <5x8z2pj13t.fsf@kfs2.cua.dk>, storm@cua.dk (Kim F. Storm) writes:
> > IMO, it is wrong to handle case-fold-search for regexp ranges by
> > trying to modify the interpretation of the regex range.
> 
> > Instead, the regex matcher should try to upcase and lowercase each
> > character in the string and see if either of these caracters are
> > within the given range.
> 
> I also reached to that idea.  It makes regexp compiling
> simpler and faster but makes regexp matching a little bit
> slower.  I don't know if that slowerness is tolerable or
> not, but it's worth trying.

Two things:

- Neither `upper(lower(x)) = x' nor `lower(upper(x)) = x' are guaranteed.
- The regexp matcher right now only has access to one of the two tables
  (I believe it's the `lower' but I'm not even sure) and so two chars
  are deemed to match if translate(a) = translate(b).

The first might be a non-issue, I don't know.
The second is more serious because that means that if we want to use
`upper' we'll need to somehow pass that table as well, which requires
changing the interface to the reg-matching functions.


	Stefan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: regex and case-fold-search problem
  2002-08-29 16:00         ` Stefan Monnier
@ 2002-08-30  1:11           ` Kenichi Handa
  2002-08-30 19:19             ` Richard Stallman
  0 siblings, 1 reply; 40+ messages in thread
From: Kenichi Handa @ 2002-08-30  1:11 UTC (permalink / raw)
  Cc: storm, rms, emacs-devel

In article <200208291600.g7TG0NZ11087@rum.cs.yale.edu>, "Stefan Monnier" <monnier+gnu/emacs@rum.cs.yale.edu> writes:
> Two things:

> - Neither `upper(lower(x)) = x' nor `lower(upper(x)) = x' are guaranteed.
> - The regexp matcher right now only has access to one of the two tables
>   (I believe it's the `lower' but I'm not even sure) and so two chars
>   are deemed to match if translate(a) = translate(b).

> The first might be a non-issue, I don't know.

There's an EQUIVALENCES table.  It seems that the documentation
of set-case-table says that:
  X and Y match in case-fold-search if:
      equiv(X) == Y
   or equiv(equiv(X)) == Y
   or equiv(equiv(equiv(X))) == Y
   or ...
Correct?

> The second is more serious because that means that if we want to use
> `upper' we'll need to somehow pass that table as well, which requires
> changing the interface to the reg-matching functions.

TRANSLATE table is passed as the member `tranlate' of
re_pattern_buffer.  Instead of setting it to lowercase
table, we can set it to the case-table itself that has
upcase, canon, and equiv tables in the extra slots.

Or, if we can use EQUIVALENCES table as above, what we need
is only that table.

---
Ken'ichi HANDA
handa@etl.go.jp

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: regex and case-fold-search problem
  2002-08-29  8:53   ` Kenichi Handa
  2002-08-29 12:33     ` Kim F. Storm
@ 2002-08-30 19:19     ` Richard Stallman
  2002-08-30 20:08       ` Stefan Monnier
  2002-08-31  6:14       ` regex and case-fold-search problem Eli Zaretskii
  1 sibling, 2 replies; 40+ messages in thread
From: Richard Stallman @ 2002-08-30 19:19 UTC (permalink / raw)
  Cc: emacs-devel

    So, I agree with Stephen that his method is good enough.

It is wrong even for ASCII--we definitely must do something better, at
least for ASCII.  The only question is, how much more than ASCII?

    I think we all know that is the right behaviour, and at
    least for ASCII, the latest code works as that.  Perhpas, we
    should make Emacs work correctly also for Latin-1 chars,
    because in emacs-unicode also, they have the same code
    order.

What about for Latin-2 characters?  Will those regexp ranges
change their meaning in emacs-unicode?

If so, perhaps we only need to make an effort to support ranges really
right for codes 0-256.

    > A faster way, in the usual cases, would be to look for the case where
    > several consecutive characters that have just one case-sibling each,
    > and the siblings are consecutive too.  Each subrange of this kind can
    > be turned into two subranges, the original and the case-converted.
    > Also identify subranges of characters that have no case-siblings; each
    > subrange of this kind just remains as it is.  Finally, any unusual
    > characters that are encountered can be replaced with a list of all the
    > case-siblings.

    > This too requires use of the whole case table.

    Implemnting that for any range of characters consumes our
    man-power and makes the running code slower.

It is not a very hard program to write, I think.  I'd guess around 30
lines.  However, you're right about the slowness for large ranges.  If
we only do this for codes 0-256 (or, currently, for ASCII and
Latin-1), then it won't be too slow.

    Consider the situation that one writes this regexp
	    "[\000-\xffff]"
    to search only Unicode BMP chars in emacs-unicode.

Do you think that is a reasonable kind of range that we
should try to support?  If so, there goes my idea that
we only need to support ranges in 0-256 very well.

On the other hand, if we handle \000-\xffff by doing case conversion
carefully only for ASCII and Latin-1, and treat the rest of the range
in a less smart way, we would get the same results in this case.
Is that a good solution?

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: regex and case-fold-search problem
  2002-08-30  1:11           ` Kenichi Handa
@ 2002-08-30 19:19             ` Richard Stallman
  0 siblings, 0 replies; 40+ messages in thread
From: Richard Stallman @ 2002-08-30 19:19 UTC (permalink / raw)
  Cc: monnier+gnu/emacs, storm, emacs-devel

This slot in the case table may be useful:

    CANONICALIZE maps each character to a canonical equivalent;
     any two characters that are related by case-conversion have the same
     canonical equivalent character;

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: regex and case-fold-search problem
  2002-08-30 19:19     ` Richard Stallman
@ 2002-08-30 20:08       ` Stefan Monnier
  2002-09-01 13:15         ` Richard Stallman
  2002-08-31  6:14       ` regex and case-fold-search problem Eli Zaretskii
  1 sibling, 1 reply; 40+ messages in thread
From: Stefan Monnier @ 2002-08-30 20:08 UTC (permalink / raw)
  Cc: handa, emacs-devel

>     So, I agree with Stephen that his method is good enough.
> 
> It is wrong even for ASCII

Do you have any evidence to support that claim ?


	Stefan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: regex and case-fold-search problem
  2002-08-30 19:19     ` Richard Stallman
  2002-08-30 20:08       ` Stefan Monnier
@ 2002-08-31  6:14       ` Eli Zaretskii
  2002-09-01 13:14         ` Richard Stallman
  1 sibling, 1 reply; 40+ messages in thread
From: Eli Zaretskii @ 2002-08-31  6:14 UTC (permalink / raw)
  Cc: emacs-devel

> From: Richard Stallman <rms@gnu.org>
> Date: Fri, 30 Aug 2002 15:19:14 -0400
> 
>     I think we all know that is the right behaviour, and at
>     least for ASCII, the latest code works as that.  Perhpas, we
>     should make Emacs work correctly also for Latin-1 chars,
>     because in emacs-unicode also, they have the same code
>     order.
> 
> What about for Latin-2 characters?  Will those regexp ranges
> change their meaning in emacs-unicode?

Yes.  Latin-2 characters have different order in Unicode than in
8859-2.  Those characters which are common to Latin-2 and Latin-1 are
in the same order, but those which aren't have different places.  The
same goes for all the other Latin-N characters where N != 1.

We could have some code to map a range specified by a Lisp program
into a range of internal character codepoints (in Unicode Emacs, the
latter would be Unicode codepoints).  We could make this code depend
on some user variable that states the external ordering meant by the
application.  For example, Cyrillic users could tell Emacs that [A-Z]
was intended to work as in KOI8-R or as in 8859-5.

Would something like that work?

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: regex and case-fold-search problem
  2002-08-31  6:14       ` regex and case-fold-search problem Eli Zaretskii
@ 2002-09-01 13:14         ` Richard Stallman
  0 siblings, 0 replies; 40+ messages in thread
From: Richard Stallman @ 2002-09-01 13:14 UTC (permalink / raw)
  Cc: emacs-devel

    > What about for Latin-2 characters?  Will those regexp ranges
    > change their meaning in emacs-unicode?

    Yes.  Latin-2 characters have different order in Unicode than in
    8859-2.  Those characters which are common to Latin-2 and Latin-1 are
    in the same order, but those which aren't have different places.  The
    same goes for all the other Latin-N characters where N != 1.

This suggests that perhaps there is no need to be careful about
case-folding of ranges outside of ASCII and Latin-1.

    We could have some code to map a range specified by a Lisp program
    into a range of internal character codepoints (in Unicode Emacs, the
    latter would be Unicode codepoints).  We could make this code depend
    on some user variable that states the external ordering meant by the
    application.  For example, Cyrillic users could tell Emacs that [A-Z]
    was intended to work as in KOI8-R or as in 8859-5.

This is a coherent idea, but since it is a substantial amount of work,
the question is whether it is better to do this or do nothing about
those cases.  I wonder how many programs use ranges of Latin-2 or
KOI8-R and depend on case-folding to work precisely.  Probably few or
none.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: regex and case-fold-search problem
  2002-08-30 20:08       ` Stefan Monnier
@ 2002-09-01 13:15         ` Richard Stallman
  2002-09-01 16:26           ` Stefan Monnier
  0 siblings, 1 reply; 40+ messages in thread
From: Richard Stallman @ 2002-09-01 13:15 UTC (permalink / raw)
  Cc: handa, emacs-devel

    > It is wrong even for ASCII

    Do you have any evidence to support that claim ?

You yourself said it would match characters which
were not case-equivalent to something in the originally specified range.
That means it is wrong.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: regex and case-fold-search problem
  2002-09-01 13:15         ` Richard Stallman
@ 2002-09-01 16:26           ` Stefan Monnier
  2002-09-02 14:54             ` Richard Stallman
  0 siblings, 1 reply; 40+ messages in thread
From: Stefan Monnier @ 2002-09-01 16:26 UTC (permalink / raw)
  Cc: monnier+gnu/emacs, handa, emacs-devel

>     > It is wrong even for ASCII
>     Do you have any evidence to support that claim ?
> You yourself said it would match characters which
> were not case-equivalent to something in the originally specified range.

When was it ?  I'd guess that was before I installed my patch.

> That means it is wrong.

Thank you for your trust ;-)


	Stefan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: regex and case-fold-search problem
  2002-09-01 16:26           ` Stefan Monnier
@ 2002-09-02 14:54             ` Richard Stallman
  2002-09-02 16:58               ` Stefan Monnier
  0 siblings, 1 reply; 40+ messages in thread
From: Richard Stallman @ 2002-09-02 14:54 UTC (permalink / raw)
  Cc: monnier+gnu/emacs, handa, emacs-devel

    >     > It is wrong even for ASCII
    >     Do you have any evidence to support that claim ?
    > You yourself said it would match characters which
    > were not case-equivalent to something in the originally specified range.

    When was it ?  I'd guess that was before I installed my patch.

src/ChangeLog does not list any recent changes in regex.c.
Did you install a change and fail to put it in ChangeLog?

Anyway, the change you sent seemed to have the problem of including
excess characters in ASCII ranges.  The change can't stay if it has
that problem.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: regex and case-fold-search problem
  2002-09-02 14:54             ` Richard Stallman
@ 2002-09-02 16:58               ` Stefan Monnier
  2002-09-04 14:13                 ` Richard Stallman
  0 siblings, 1 reply; 40+ messages in thread
From: Stefan Monnier @ 2002-09-02 16:58 UTC (permalink / raw)
  Cc: monnier+gnu/emacs, handa, emacs-devel

>     >     > It is wrong even for ASCII
>     >     Do you have any evidence to support that claim ?
>     > You yourself said it would match characters which
>     > were not case-equivalent to something in the originally specified range.
> 
>     When was it ?  I'd guess that was before I installed my patch.
> 
> src/ChangeLog does not list any recent changes in regex.c.
> Did you install a change and fail to put it in ChangeLog?

2002-08-23  Stefan Monnier  <monnier@cs.yale.edu>

	* regex.c (PATFETCH): Remove the translating fetch.
	(PATFETCH_RAW): Rename to PATFETCH.
	(set_image_of_range): New fun.
	(SET_RANGE_TABLE_WORK_AREA): Use it.
	(regex_compile): Don't translate the pattern chars so eagerly.
	Only do it when inserting an `exactn' bytecode or when handling
	a char-range.
	(mutually_exclusive_p): Avoid empty statement.

> Anyway, the change you sent seemed to have the problem of including
> excess characters in ASCII ranges.

No, only in non-ASCII chars.  The excess is introduced in set_image_of_range
which is only used for non-ASCII chars.


	Stefan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: regex and case-fold-search problem
  2002-09-02 16:58               ` Stefan Monnier
@ 2002-09-04 14:13                 ` Richard Stallman
  2002-09-04 16:04                   ` Stefan Monnier
  0 siblings, 1 reply; 40+ messages in thread
From: Richard Stallman @ 2002-09-04 14:13 UTC (permalink / raw)
  Cc: monnier+gnu/emacs, handa, emacs-devel

    No, only in non-ASCII chars.  The excess is introduced in set_image_of_range
    which is only used for non-ASCII chars.

Does that include Latin-1?
The results of our conversation suggest that we need to fix this
at least for Latin-1.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: regex and case-fold-search problem
  2002-09-04 14:13                 ` Richard Stallman
@ 2002-09-04 16:04                   ` Stefan Monnier
  2002-09-05 18:02                     ` Richard Stallman
  0 siblings, 1 reply; 40+ messages in thread
From: Stefan Monnier @ 2002-09-04 16:04 UTC (permalink / raw)
  Cc: monnier+gnu/emacs, handa, emacs-devel

>     No, only in non-ASCII chars.  The excess is introduced in set_image_of_range
>     which is only used for non-ASCII chars.
> Does that include Latin-1?

No.

> The results of our conversation suggest that we need to fix this
> at least for Latin-1.

I don't feel an urgent need, so you'll be more quickly served if you ask
someone else to do it.  He'll need to improve set_image_of_range.


	Stefan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: regex and case-fold-search problem
  2002-09-04 16:04                   ` Stefan Monnier
@ 2002-09-05 18:02                     ` Richard Stallman
  2002-09-06  1:00                       ` re-search-forward seems to be broken Miles Bader
  0 siblings, 1 reply; 40+ messages in thread
From: Richard Stallman @ 2002-09-05 18:02 UTC (permalink / raw)
  Cc: monnier+gnu/emacs, handa, emacs-devel

    I don't feel an urgent need, so you'll be more quickly served if you ask
    someone else to do it.  He'll need to improve set_image_of_range.

I did it.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* re-search-forward seems to be broken
  2002-09-05 18:02                     ` Richard Stallman
@ 2002-09-06  1:00                       ` Miles Bader
  2002-09-06 20:03                         ` Richard Stallman
  0 siblings, 1 reply; 40+ messages in thread
From: Miles Bader @ 2002-09-06  1:00 UTC (permalink / raw)
  Cc: monnier+gnu/emacs, handa, emacs-devel

When I do:

  (re-search-forward "[«»{}()]" nil t)

I get:

  Lisp error: (wrong-type-argument arrayp nil)

I presume this is from the `set_image_of_range' changes.

-Miles
-- 
P.S.  All information contained in the above letter is false,
      for reasons of military security.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: re-search-forward seems to be broken
  2002-09-06  1:00                       ` re-search-forward seems to be broken Miles Bader
@ 2002-09-06 20:03                         ` Richard Stallman
  0 siblings, 0 replies; 40+ messages in thread
From: Richard Stallman @ 2002-09-06 20:03 UTC (permalink / raw)
  Cc: monnier+gnu/emacs, handa, emacs-devel

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 133 bytes --]

      (re-search-forward "[«»{}()]" nil t)

    I get:

      Lisp error: (wrong-type-argument arrayp nil)

I fixed this.  Thanks.

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2002-09-06 20:03 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-08-23  6:25 regex and case-fold-search problem Kenichi Handa
2002-08-23 15:56 ` Eli Zaretskii
2002-08-24  0:51   ` Kenichi Handa
2002-08-24  1:03     ` Miles Bader
2002-08-24  9:42       ` Eli Zaretskii
2002-08-24 16:16       ` Andreas Schwab
2002-08-26  1:54         ` Miles Bader
2002-08-26 16:11           ` Stefan Monnier
2002-08-26 21:51         ` Richard Stallman
2002-08-24  9:39     ` Eli Zaretskii
2002-08-26  1:29       ` Kenichi Handa
2002-08-26  2:31         ` Miles Bader
2002-08-25 22:21     ` Kim F. Storm
2002-08-23 17:36 ` Stefan Monnier
2002-08-23 21:52   ` Stefan Monnier
2002-08-24  1:16   ` Kenichi Handa
2002-08-25 18:52     ` Stefan Monnier
2002-08-26  1:56       ` Kenichi Handa
2002-08-24 10:40   ` Kai Großjohann
2002-08-26 21:51 ` Richard Stallman
2002-08-29  8:53   ` Kenichi Handa
2002-08-29 12:33     ` Kim F. Storm
2002-08-29 13:38       ` Kenichi Handa
2002-08-29 15:00         ` Kim F. Storm
2002-08-29 16:00         ` Stefan Monnier
2002-08-30  1:11           ` Kenichi Handa
2002-08-30 19:19             ` Richard Stallman
2002-08-30 19:19     ` Richard Stallman
2002-08-30 20:08       ` Stefan Monnier
2002-09-01 13:15         ` Richard Stallman
2002-09-01 16:26           ` Stefan Monnier
2002-09-02 14:54             ` Richard Stallman
2002-09-02 16:58               ` Stefan Monnier
2002-09-04 14:13                 ` Richard Stallman
2002-09-04 16:04                   ` Stefan Monnier
2002-09-05 18:02                     ` Richard Stallman
2002-09-06  1:00                       ` re-search-forward seems to be broken Miles Bader
2002-09-06 20:03                         ` Richard Stallman
2002-08-31  6:14       ` regex and case-fold-search problem Eli Zaretskii
2002-09-01 13:14         ` Richard Stallman

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).