unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
* bug#18150: 24.3.92; Uppercase umlauts and case-fold-search t
@ 2014-07-30 15:11 Michael Heerdegen
  2016-02-16 14:53 ` Marcin Borkowski
  0 siblings, 1 reply; 7+ messages in thread
From: Michael Heerdegen @ 2014-07-30 15:11 UTC (permalink / raw)
  To: 18150


Hello,


sorry if this is just a unibyte/multibyte thing I don't understand, but
it makes no sense to me:

  (let ((str "École")
        (case-fold-search t))
    (when (string-match "[[:upper:]]" str)
      (match-string 0 str)))

==> "c"

However,

  (let ((str "École")
        (case-fold-search nil))
    (when (string-match "[[:upper:]]" str)
      (match-string 0 str)))

==> "É"

I would expect "É" in both examples.


Thanks,

Michael.




In GNU Emacs 24.3.92.1 (x86_64-unknown-linux-gnu, GTK+ Version 3.12.2)
 of 2014-07-17 on drachen
Windowing system distributor `The X.Org Foundation', version 11.0.11600000
System Description:	Debian GNU/Linux testing (jessie)

Important settings:
  value of $LC_ALL: de_DE.utf8
  value of $LC_COLLATE: C
  value of $LC_TIME: C
  value of $LANG: de_DE.utf8
  locale-coding-system: utf-8-unix






^ permalink raw reply	[flat|nested] 7+ messages in thread

* bug#18150: 24.3.92; Uppercase umlauts and case-fold-search t
  2014-07-30 15:11 bug#18150: 24.3.92; Uppercase umlauts and case-fold-search t Michael Heerdegen
@ 2016-02-16 14:53 ` Marcin Borkowski
  2016-02-16 18:09   ` Eli Zaretskii
  0 siblings, 1 reply; 7+ messages in thread
From: Marcin Borkowski @ 2016-02-16 14:53 UTC (permalink / raw)
  To: Michael Heerdegen; +Cc: 18150

Confirmed on emacs -Q (GNU Emacs 25.1.50.2, commit 4ccd268).

Best,
mb



On 2014-07-30, at 18:11, Michael Heerdegen <michael_heerdegen@web.de> wrote:

> Hello,
>
>
> sorry if this is just a unibyte/multibyte thing I don't understand, but
> it makes no sense to me:
>
>   (let ((str "École")
>         (case-fold-search t))
>     (when (string-match "[[:upper:]]" str)
>       (match-string 0 str)))
>
> ==> "c"
>
> However,
>
>   (let ((str "École")
>         (case-fold-search nil))
>     (when (string-match "[[:upper:]]" str)
>       (match-string 0 str)))
>
> ==> "É"
>
> I would expect "É" in both examples.
>
>
> Thanks,
>
> Michael.
>
>
>
>
> In GNU Emacs 24.3.92.1 (x86_64-unknown-linux-gnu, GTK+ Version 3.12.2)
>  of 2014-07-17 on drachen
> Windowing system distributor `The X.Org Foundation', version 11.0.11600000
> System Description:	Debian GNU/Linux testing (jessie)
>
> Important settings:
>   value of $LC_ALL: de_DE.utf8
>   value of $LC_COLLATE: C
>   value of $LC_TIME: C
>   value of $LANG: de_DE.utf8
>   locale-coding-system: utf-8-unix





^ permalink raw reply	[flat|nested] 7+ messages in thread

* bug#18150: 24.3.92; Uppercase umlauts and case-fold-search t
  2016-02-16 14:53 ` Marcin Borkowski
@ 2016-02-16 18:09   ` Eli Zaretskii
  2016-02-16 18:38     ` Michael Heerdegen
  0 siblings, 1 reply; 7+ messages in thread
From: Eli Zaretskii @ 2016-02-16 18:09 UTC (permalink / raw)
  To: Marcin Borkowski; +Cc: michael_heerdegen, 18150

> From: Marcin Borkowski <mbork@mbork.pl>
> Date: Tue, 16 Feb 2016 15:53:41 +0100
> Cc: 18150@debbugs.gnu.org
> 
> Confirmed on emacs -Q (GNU Emacs 25.1.50.2, commit 4ccd268).
> 
> Best,
> mb
> 
> On 2014-07-30, at 18:11, Michael Heerdegen <michael_heerdegen@web.de> wrote:
> 
> > Hello,
> >
> >
> > sorry if this is just a unibyte/multibyte thing I don't understand, but
> > it makes no sense to me:
> >
> >   (let ((str "École")
> >         (case-fold-search t))
> >     (when (string-match "[[:upper:]]" str)
> >       (match-string 0 str)))
> >
> > ==> "c"
> >
> > However,
> >
> >   (let ((str "École")
> >         (case-fold-search nil))
> >     (when (string-match "[[:upper:]]" str)
> >       (match-string 0 str)))
> >
> > ==> "É"
> >
> > I would expect "É" in both examples.

What do we expect the result to be in the variant below?

   (let ((str "ecole")
         (case-fold-search t))
     (when (string-match "[[:upper:]]" str)
       (match-string 0 str)))





^ permalink raw reply	[flat|nested] 7+ messages in thread

* bug#18150: 24.3.92; Uppercase umlauts and case-fold-search t
  2016-02-16 18:09   ` Eli Zaretskii
@ 2016-02-16 18:38     ` Michael Heerdegen
  2016-02-16 18:57       ` Eli Zaretskii
  0 siblings, 1 reply; 7+ messages in thread
From: Michael Heerdegen @ 2016-02-16 18:38 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 18150, Marcin Borkowski

Eli Zaretskii <eliz@gnu.org> writes:

> What do we expect the result to be in the variant below?
>
>    (let ((str "ecole")
>          (case-fold-search t))
>      (when (string-match "[[:upper:]]" str)
>        (match-string 0 str)))

According to the docstring of `case-fold-search', I would expect "e"
(which the expression returns here).

Before having thought about it, 70% of me expected `nil'.


Michael.





^ permalink raw reply	[flat|nested] 7+ messages in thread

* bug#18150: 24.3.92; Uppercase umlauts and case-fold-search t
  2016-02-16 18:38     ` Michael Heerdegen
@ 2016-02-16 18:57       ` Eli Zaretskii
  2016-02-20 11:06         ` Eli Zaretskii
  0 siblings, 1 reply; 7+ messages in thread
From: Eli Zaretskii @ 2016-02-16 18:57 UTC (permalink / raw)
  To: Michael Heerdegen; +Cc: 18150, mbork

> From: Michael Heerdegen <michael_heerdegen@web.de>
> Cc: Marcin Borkowski <mbork@mbork.pl>,  18150@debbugs.gnu.org
> Date: Tue, 16 Feb 2016 19:38:21 +0100
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> 
> > What do we expect the result to be in the variant below?
> >
> >    (let ((str "ecole")
> >          (case-fold-search t))
> >      (when (string-match "[[:upper:]]" str)
> >        (match-string 0 str)))
> 
> According to the docstring of `case-fold-search', I would expect "e"
> (which the expression returns here).
> 
> Before having thought about it, 70% of me expected `nil'.

That's exactly the point.

If, when case-fold-search is non-nil, we want both [:upper:] and
[:lower:] to match any letter that has a case variant, then the patch
below seems to do the job.  Does anyone see a problem with it?

The gotcha here is that regex.c doesn't know what TRANSLATE does, and
no one promises that TRANSLATE downcases characters.  It could fold
them, for example, or, more generally, transform them in any way the
caller wants.  The patch below is TRT when TRANSLATE downcases; when
it does something else, the question is: do we want to test the match
only on the result of TRANSLATE (which is what the original code
does), or do we want something else?

For the unibyte case, re_compile_pattern sets up a bitmap for
characters _after_ TRANSLATE, so things work as expected.  We cannot
do that for multibyte characters -- there are too many of them -- so
this problem arises.  AFAICS, it existed since Emacs 20.

diff --git a/src/regex.c b/src/regex.c
index dd3f2b3..27dce8b 100644
--- a/src/regex.c
+++ b/src/regex.c
@@ -5444,7 +5444,7 @@ re_match_2_internal (struct re_pattern_buffer *bufp, const_re_char *string1,
 	case charset:
 	case charset_not:
 	  {
-	    register unsigned int c;
+	    register unsigned int c, corig;
 	    boolean not = (re_opcode_t) *(p - 1) == charset_not;
 	    int len;
 
@@ -5473,7 +5473,7 @@ re_match_2_internal (struct re_pattern_buffer *bufp, const_re_char *string1,
 	      }
 
 	    PREFETCH ();
-	    c = RE_STRING_CHAR_AND_LENGTH (d, len, target_multibyte);
+	    corig = c = RE_STRING_CHAR_AND_LENGTH (d, len, target_multibyte);
 	    if (target_multibyte)
 	      {
 		int c1;
@@ -5517,11 +5517,13 @@ re_match_2_internal (struct re_pattern_buffer *bufp, const_re_char *string1,
 	      {
 		int class_bits = CHARSET_RANGE_TABLE_BITS (&p[-1]);
 
-		if (  (class_bits & BIT_LOWER && ISLOWER (c))
+		if (  (class_bits & BIT_LOWER
+		       && (ISLOWER (c) || (corig != c && ISUPPER(c))))
 		    | (class_bits & BIT_MULTIBYTE)
 		    | (class_bits & BIT_PUNCT && ISPUNCT (c))
 		    | (class_bits & BIT_SPACE && ISSPACE (c))
-		    | (class_bits & BIT_UPPER && ISUPPER (c))
+		    | (class_bits & BIT_UPPER
+		       && (ISUPPER (c) || (corig != c && ISLOWER (c))))
 		    | (class_bits & BIT_WORD  && ISWORD  (c))
 		    | (class_bits & BIT_ALPHA && ISALPHA (c))
 		    | (class_bits & BIT_ALNUM && ISALNUM (c))





^ permalink raw reply related	[flat|nested] 7+ messages in thread

* bug#18150: 24.3.92; Uppercase umlauts and case-fold-search t
  2016-02-16 18:57       ` Eli Zaretskii
@ 2016-02-20 11:06         ` Eli Zaretskii
  2016-02-20 12:09           ` Michael Heerdegen
  0 siblings, 1 reply; 7+ messages in thread
From: Eli Zaretskii @ 2016-02-20 11:06 UTC (permalink / raw)
  To: michael_heerdegen; +Cc: 18150-done, mbork

> Date: Tue, 16 Feb 2016 20:57:41 +0200
> From: Eli Zaretskii <eliz@gnu.org>
> Cc: 18150@debbugs.gnu.org, mbork@mbork.pl
> 
> If, when case-fold-search is non-nil, we want both [:upper:] and
> [:lower:] to match any letter that has a case variant, then the patch
> below seems to do the job.  Does anyone see a problem with it?

No further comment, so I pushed a slightly safer change to emacs-25
branch, and I'm marking this bug done.

Thanks.





^ permalink raw reply	[flat|nested] 7+ messages in thread

* bug#18150: 24.3.92; Uppercase umlauts and case-fold-search t
  2016-02-20 11:06         ` Eli Zaretskii
@ 2016-02-20 12:09           ` Michael Heerdegen
  0 siblings, 0 replies; 7+ messages in thread
From: Michael Heerdegen @ 2016-02-20 12:09 UTC (permalink / raw)
  To: 18150

Eli Zaretskii <eliz@gnu.org> writes:

> No further comment, so I pushed a slightly safer change to emacs-25
> branch, and I'm marking this bug done.

Thanks, Eli.  I'm too ignorant to estimate you C-level patch, but things
behave as I expect now.


Regards,

Michael.





^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2016-02-20 12:09 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-07-30 15:11 bug#18150: 24.3.92; Uppercase umlauts and case-fold-search t Michael Heerdegen
2016-02-16 14:53 ` Marcin Borkowski
2016-02-16 18:09   ` Eli Zaretskii
2016-02-16 18:38     ` Michael Heerdegen
2016-02-16 18:57       ` Eli Zaretskii
2016-02-20 11:06         ` Eli Zaretskii
2016-02-20 12:09           ` Michael Heerdegen

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).