* bug#18150: 24.3.92; Uppercase umlauts and case-fold-search t @ 2014-07-30 15:11 Michael Heerdegen 2016-02-16 14:53 ` Marcin Borkowski 0 siblings, 1 reply; 7+ messages in thread From: Michael Heerdegen @ 2014-07-30 15:11 UTC (permalink / raw) To: 18150 Hello, sorry if this is just a unibyte/multibyte thing I don't understand, but it makes no sense to me: (let ((str "École") (case-fold-search t)) (when (string-match "[[:upper:]]" str) (match-string 0 str))) ==> "c" However, (let ((str "École") (case-fold-search nil)) (when (string-match "[[:upper:]]" str) (match-string 0 str))) ==> "É" I would expect "É" in both examples. Thanks, Michael. In GNU Emacs 24.3.92.1 (x86_64-unknown-linux-gnu, GTK+ Version 3.12.2) of 2014-07-17 on drachen Windowing system distributor `The X.Org Foundation', version 11.0.11600000 System Description: Debian GNU/Linux testing (jessie) Important settings: value of $LC_ALL: de_DE.utf8 value of $LC_COLLATE: C value of $LC_TIME: C value of $LANG: de_DE.utf8 locale-coding-system: utf-8-unix ^ permalink raw reply [flat|nested] 7+ messages in thread
* bug#18150: 24.3.92; Uppercase umlauts and case-fold-search t 2014-07-30 15:11 bug#18150: 24.3.92; Uppercase umlauts and case-fold-search t Michael Heerdegen @ 2016-02-16 14:53 ` Marcin Borkowski 2016-02-16 18:09 ` Eli Zaretskii 0 siblings, 1 reply; 7+ messages in thread From: Marcin Borkowski @ 2016-02-16 14:53 UTC (permalink / raw) To: Michael Heerdegen; +Cc: 18150 Confirmed on emacs -Q (GNU Emacs 25.1.50.2, commit 4ccd268). Best, mb On 2014-07-30, at 18:11, Michael Heerdegen <michael_heerdegen@web.de> wrote: > Hello, > > > sorry if this is just a unibyte/multibyte thing I don't understand, but > it makes no sense to me: > > (let ((str "École") > (case-fold-search t)) > (when (string-match "[[:upper:]]" str) > (match-string 0 str))) > > ==> "c" > > However, > > (let ((str "École") > (case-fold-search nil)) > (when (string-match "[[:upper:]]" str) > (match-string 0 str))) > > ==> "É" > > I would expect "É" in both examples. > > > Thanks, > > Michael. > > > > > In GNU Emacs 24.3.92.1 (x86_64-unknown-linux-gnu, GTK+ Version 3.12.2) > of 2014-07-17 on drachen > Windowing system distributor `The X.Org Foundation', version 11.0.11600000 > System Description: Debian GNU/Linux testing (jessie) > > Important settings: > value of $LC_ALL: de_DE.utf8 > value of $LC_COLLATE: C > value of $LC_TIME: C > value of $LANG: de_DE.utf8 > locale-coding-system: utf-8-unix ^ permalink raw reply [flat|nested] 7+ messages in thread
* bug#18150: 24.3.92; Uppercase umlauts and case-fold-search t 2016-02-16 14:53 ` Marcin Borkowski @ 2016-02-16 18:09 ` Eli Zaretskii 2016-02-16 18:38 ` Michael Heerdegen 0 siblings, 1 reply; 7+ messages in thread From: Eli Zaretskii @ 2016-02-16 18:09 UTC (permalink / raw) To: Marcin Borkowski; +Cc: michael_heerdegen, 18150 > From: Marcin Borkowski <mbork@mbork.pl> > Date: Tue, 16 Feb 2016 15:53:41 +0100 > Cc: 18150@debbugs.gnu.org > > Confirmed on emacs -Q (GNU Emacs 25.1.50.2, commit 4ccd268). > > Best, > mb > > On 2014-07-30, at 18:11, Michael Heerdegen <michael_heerdegen@web.de> wrote: > > > Hello, > > > > > > sorry if this is just a unibyte/multibyte thing I don't understand, but > > it makes no sense to me: > > > > (let ((str "École") > > (case-fold-search t)) > > (when (string-match "[[:upper:]]" str) > > (match-string 0 str))) > > > > ==> "c" > > > > However, > > > > (let ((str "École") > > (case-fold-search nil)) > > (when (string-match "[[:upper:]]" str) > > (match-string 0 str))) > > > > ==> "É" > > > > I would expect "É" in both examples. What do we expect the result to be in the variant below? (let ((str "ecole") (case-fold-search t)) (when (string-match "[[:upper:]]" str) (match-string 0 str))) ^ permalink raw reply [flat|nested] 7+ messages in thread
* bug#18150: 24.3.92; Uppercase umlauts and case-fold-search t 2016-02-16 18:09 ` Eli Zaretskii @ 2016-02-16 18:38 ` Michael Heerdegen 2016-02-16 18:57 ` Eli Zaretskii 0 siblings, 1 reply; 7+ messages in thread From: Michael Heerdegen @ 2016-02-16 18:38 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 18150, Marcin Borkowski Eli Zaretskii <eliz@gnu.org> writes: > What do we expect the result to be in the variant below? > > (let ((str "ecole") > (case-fold-search t)) > (when (string-match "[[:upper:]]" str) > (match-string 0 str))) According to the docstring of `case-fold-search', I would expect "e" (which the expression returns here). Before having thought about it, 70% of me expected `nil'. Michael. ^ permalink raw reply [flat|nested] 7+ messages in thread
* bug#18150: 24.3.92; Uppercase umlauts and case-fold-search t 2016-02-16 18:38 ` Michael Heerdegen @ 2016-02-16 18:57 ` Eli Zaretskii 2016-02-20 11:06 ` Eli Zaretskii 0 siblings, 1 reply; 7+ messages in thread From: Eli Zaretskii @ 2016-02-16 18:57 UTC (permalink / raw) To: Michael Heerdegen; +Cc: 18150, mbork > From: Michael Heerdegen <michael_heerdegen@web.de> > Cc: Marcin Borkowski <mbork@mbork.pl>, 18150@debbugs.gnu.org > Date: Tue, 16 Feb 2016 19:38:21 +0100 > > Eli Zaretskii <eliz@gnu.org> writes: > > > What do we expect the result to be in the variant below? > > > > (let ((str "ecole") > > (case-fold-search t)) > > (when (string-match "[[:upper:]]" str) > > (match-string 0 str))) > > According to the docstring of `case-fold-search', I would expect "e" > (which the expression returns here). > > Before having thought about it, 70% of me expected `nil'. That's exactly the point. If, when case-fold-search is non-nil, we want both [:upper:] and [:lower:] to match any letter that has a case variant, then the patch below seems to do the job. Does anyone see a problem with it? The gotcha here is that regex.c doesn't know what TRANSLATE does, and no one promises that TRANSLATE downcases characters. It could fold them, for example, or, more generally, transform them in any way the caller wants. The patch below is TRT when TRANSLATE downcases; when it does something else, the question is: do we want to test the match only on the result of TRANSLATE (which is what the original code does), or do we want something else? For the unibyte case, re_compile_pattern sets up a bitmap for characters _after_ TRANSLATE, so things work as expected. We cannot do that for multibyte characters -- there are too many of them -- so this problem arises. AFAICS, it existed since Emacs 20. diff --git a/src/regex.c b/src/regex.c index dd3f2b3..27dce8b 100644 --- a/src/regex.c +++ b/src/regex.c @@ -5444,7 +5444,7 @@ re_match_2_internal (struct re_pattern_buffer *bufp, const_re_char *string1, case charset: case charset_not: { - register unsigned int c; + register unsigned int c, corig; boolean not = (re_opcode_t) *(p - 1) == charset_not; int len; @@ -5473,7 +5473,7 @@ re_match_2_internal (struct re_pattern_buffer *bufp, const_re_char *string1, } PREFETCH (); - c = RE_STRING_CHAR_AND_LENGTH (d, len, target_multibyte); + corig = c = RE_STRING_CHAR_AND_LENGTH (d, len, target_multibyte); if (target_multibyte) { int c1; @@ -5517,11 +5517,13 @@ re_match_2_internal (struct re_pattern_buffer *bufp, const_re_char *string1, { int class_bits = CHARSET_RANGE_TABLE_BITS (&p[-1]); - if ( (class_bits & BIT_LOWER && ISLOWER (c)) + if ( (class_bits & BIT_LOWER + && (ISLOWER (c) || (corig != c && ISUPPER(c)))) | (class_bits & BIT_MULTIBYTE) | (class_bits & BIT_PUNCT && ISPUNCT (c)) | (class_bits & BIT_SPACE && ISSPACE (c)) - | (class_bits & BIT_UPPER && ISUPPER (c)) + | (class_bits & BIT_UPPER + && (ISUPPER (c) || (corig != c && ISLOWER (c)))) | (class_bits & BIT_WORD && ISWORD (c)) | (class_bits & BIT_ALPHA && ISALPHA (c)) | (class_bits & BIT_ALNUM && ISALNUM (c)) ^ permalink raw reply related [flat|nested] 7+ messages in thread
* bug#18150: 24.3.92; Uppercase umlauts and case-fold-search t 2016-02-16 18:57 ` Eli Zaretskii @ 2016-02-20 11:06 ` Eli Zaretskii 2016-02-20 12:09 ` Michael Heerdegen 0 siblings, 1 reply; 7+ messages in thread From: Eli Zaretskii @ 2016-02-20 11:06 UTC (permalink / raw) To: michael_heerdegen; +Cc: 18150-done, mbork > Date: Tue, 16 Feb 2016 20:57:41 +0200 > From: Eli Zaretskii <eliz@gnu.org> > Cc: 18150@debbugs.gnu.org, mbork@mbork.pl > > If, when case-fold-search is non-nil, we want both [:upper:] and > [:lower:] to match any letter that has a case variant, then the patch > below seems to do the job. Does anyone see a problem with it? No further comment, so I pushed a slightly safer change to emacs-25 branch, and I'm marking this bug done. Thanks. ^ permalink raw reply [flat|nested] 7+ messages in thread
* bug#18150: 24.3.92; Uppercase umlauts and case-fold-search t 2016-02-20 11:06 ` Eli Zaretskii @ 2016-02-20 12:09 ` Michael Heerdegen 0 siblings, 0 replies; 7+ messages in thread From: Michael Heerdegen @ 2016-02-20 12:09 UTC (permalink / raw) To: 18150 Eli Zaretskii <eliz@gnu.org> writes: > No further comment, so I pushed a slightly safer change to emacs-25 > branch, and I'm marking this bug done. Thanks, Eli. I'm too ignorant to estimate you C-level patch, but things behave as I expect now. Regards, Michael. ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2016-02-20 12:09 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-07-30 15:11 bug#18150: 24.3.92; Uppercase umlauts and case-fold-search t Michael Heerdegen 2016-02-16 14:53 ` Marcin Borkowski 2016-02-16 18:09 ` Eli Zaretskii 2016-02-16 18:38 ` Michael Heerdegen 2016-02-16 18:57 ` Eli Zaretskii 2016-02-20 11:06 ` Eli Zaretskii 2016-02-20 12:09 ` Michael Heerdegen
Code repositories for project(s) associated with this external index https://git.savannah.gnu.org/cgit/emacs.git https://git.savannah.gnu.org/cgit/emacs/org-mode.git This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.