From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.bugs Subject: bug#18150: 24.3.92; Uppercase umlauts and case-fold-search t Date: Tue, 16 Feb 2016 20:57:41 +0200 Message-ID: <838u2kwkii.fsf@gnu.org> References: <87oaw6lw56.fsf@web.de> <877fi41zbe.fsf@amu.edu.pl> <83a8n0wmrl.fsf@gnu.org> <87si0s8pr6.fsf@web.de> Reply-To: Eli Zaretskii NNTP-Posting-Host: plane.gmane.org X-Trace: ger.gmane.org 1455649105 21446 80.91.229.3 (16 Feb 2016 18:58:25 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Tue, 16 Feb 2016 18:58:25 +0000 (UTC) Cc: 18150@debbugs.gnu.org, mbork@mbork.pl To: Michael Heerdegen Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Tue Feb 16 19:58:13 2016 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1aVkpA-0006Uy-0h for geb-bug-gnu-emacs@m.gmane.org; Tue, 16 Feb 2016 19:58:12 +0100 Original-Received: from localhost ([::1]:49728 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aVkp9-00005I-DD for geb-bug-gnu-emacs@m.gmane.org; Tue, 16 Feb 2016 13:58:11 -0500 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:56889) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aVkp3-0008TJ-9l for bug-gnu-emacs@gnu.org; Tue, 16 Feb 2016 13:58:06 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1aVkp0-0007wX-2o for bug-gnu-emacs@gnu.org; Tue, 16 Feb 2016 13:58:05 -0500 Original-Received: from debbugs.gnu.org ([208.118.235.43]:39701) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aVkoz-0007wT-VK for bug-gnu-emacs@gnu.org; Tue, 16 Feb 2016 13:58:02 -0500 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84) (envelope-from ) id 1aVkoz-0000Zk-Pn for bug-gnu-emacs@gnu.org; Tue, 16 Feb 2016 13:58:01 -0500 X-Loop: help-debbugs@gnu.org Resent-From: Eli Zaretskii Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Tue, 16 Feb 2016 18:58:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 18150 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: Original-Received: via spool by 18150-submit@debbugs.gnu.org id=B18150.14556490622188 (code B ref 18150); Tue, 16 Feb 2016 18:58:01 +0000 Original-Received: (at 18150) by debbugs.gnu.org; 16 Feb 2016 18:57:42 +0000 Original-Received: from localhost ([127.0.0.1]:38375 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aVkof-0000ZD-Rx for submit@debbugs.gnu.org; Tue, 16 Feb 2016 13:57:42 -0500 Original-Received: from eggs.gnu.org ([208.118.235.92]:37579) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aVkoe-0000Z1-F3 for 18150@debbugs.gnu.org; Tue, 16 Feb 2016 13:57:40 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1aVkoW-0007t8-4w for 18150@debbugs.gnu.org; Tue, 16 Feb 2016 13:57:35 -0500 Original-Received: from fencepost.gnu.org ([2001:4830:134:3::e]:60698) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aVkoW-0007t4-2A; Tue, 16 Feb 2016 13:57:32 -0500 Original-Received: from 84.94.185.246.cable.012.net.il ([84.94.185.246]:3338 helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_128_CBC_SHA1:128) (Exim 4.82) (envelope-from ) id 1aVkoV-0004BL-7E; Tue, 16 Feb 2016 13:57:31 -0500 In-reply-to: <87si0s8pr6.fsf@web.de> (message from Michael Heerdegen on Tue, 16 Feb 2016 19:38:21 +0100) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 208.118.235.43 X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Original-Sender: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.bugs:113148 Archived-At: > From: Michael Heerdegen > Cc: Marcin Borkowski , 18150@debbugs.gnu.org > Date: Tue, 16 Feb 2016 19:38:21 +0100 > > Eli Zaretskii writes: > > > What do we expect the result to be in the variant below? > > > > (let ((str "ecole") > > (case-fold-search t)) > > (when (string-match "[[:upper:]]" str) > > (match-string 0 str))) > > According to the docstring of `case-fold-search', I would expect "e" > (which the expression returns here). > > Before having thought about it, 70% of me expected `nil'. That's exactly the point. If, when case-fold-search is non-nil, we want both [:upper:] and [:lower:] to match any letter that has a case variant, then the patch below seems to do the job. Does anyone see a problem with it? The gotcha here is that regex.c doesn't know what TRANSLATE does, and no one promises that TRANSLATE downcases characters. It could fold them, for example, or, more generally, transform them in any way the caller wants. The patch below is TRT when TRANSLATE downcases; when it does something else, the question is: do we want to test the match only on the result of TRANSLATE (which is what the original code does), or do we want something else? For the unibyte case, re_compile_pattern sets up a bitmap for characters _after_ TRANSLATE, so things work as expected. We cannot do that for multibyte characters -- there are too many of them -- so this problem arises. AFAICS, it existed since Emacs 20. diff --git a/src/regex.c b/src/regex.c index dd3f2b3..27dce8b 100644 --- a/src/regex.c +++ b/src/regex.c @@ -5444,7 +5444,7 @@ re_match_2_internal (struct re_pattern_buffer *bufp, const_re_char *string1, case charset: case charset_not: { - register unsigned int c; + register unsigned int c, corig; boolean not = (re_opcode_t) *(p - 1) == charset_not; int len; @@ -5473,7 +5473,7 @@ re_match_2_internal (struct re_pattern_buffer *bufp, const_re_char *string1, } PREFETCH (); - c = RE_STRING_CHAR_AND_LENGTH (d, len, target_multibyte); + corig = c = RE_STRING_CHAR_AND_LENGTH (d, len, target_multibyte); if (target_multibyte) { int c1; @@ -5517,11 +5517,13 @@ re_match_2_internal (struct re_pattern_buffer *bufp, const_re_char *string1, { int class_bits = CHARSET_RANGE_TABLE_BITS (&p[-1]); - if ( (class_bits & BIT_LOWER && ISLOWER (c)) + if ( (class_bits & BIT_LOWER + && (ISLOWER (c) || (corig != c && ISUPPER(c)))) | (class_bits & BIT_MULTIBYTE) | (class_bits & BIT_PUNCT && ISPUNCT (c)) | (class_bits & BIT_SPACE && ISSPACE (c)) - | (class_bits & BIT_UPPER && ISUPPER (c)) + | (class_bits & BIT_UPPER + && (ISUPPER (c) || (corig != c && ISLOWER (c)))) | (class_bits & BIT_WORD && ISWORD (c)) | (class_bits & BIT_ALPHA && ISALPHA (c)) | (class_bits & BIT_ALNUM && ISALNUM (c))