bug#11309: 24.1.50; Case problems with [:upper:] and Cyrillic, Greek

unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed

* bug#11309: 24.1.50; Case problems with [:upper:] and Cyrillic, Greek
@ 2012-04-22 10:11 Aidan Kehoe
  2020-12-07 17:24 ` Lars Ingebrigtsen
  2020-12-07 22:14 ` Mattias Engdegård
  0 siblings, 2 replies; 19+ messages in thread
From: Aidan Kehoe @ 2012-04-22 10:11 UTC (permalink / raw)
  To: 11309


This bug report will be sent to the Bug-GNU-Emacs mailing list
and the GNU bug tracker at debbugs.gnu.org.  Please check that
the From: line contains a valid email address.  After a delay of up
to one day, you should receive an acknowledgement at that address.

Please write in English if possible, as the Emacs maintainers
usually do not have translators for other languages.

Please describe exactly what actions triggered the bug, and
the precise symptoms of the bug.  If you can, give a recipe
starting from `emacs -Q':

The Lisp manual says this when describing character classes:

  `[:lower:]'
       This matches any lower-case letter, as determined by the current
       case table (*note Case Tables::).  If `case-fold-search' is
       non-`nil', this also matches any upper-case letter.

And:

  `[:upper:]'
       This matches any upper-case letter, as determined by the current
       case table (*note Case Tables::).  If `case-fold-search' is
       non-`nil', this also matches any lower-case letter.
  
OK, so let's test this:

(let ((case-fold-search t))
  (string-match "[[:upper:]]" "a\u0686"))
=> 0 ;; As documented

(upcase "\u0430") ;; CYRILLIC SMALL LETTER A
=> "А" ;; "\u0410", so it's in the case table

(let ((case-fold-search t))
  (string-match "[[:upper:]]" "\u0430\u0686"))
=> nil ;; Ah, this is unexpected.

(let ((case-fold-search t))
  (string-match "[[:lower:]]" "\u0410\u0686"))
=> 0 ;; But this works as documented. 

(upcase "\u03b2") ;; GREEK SMALL LETTER BETA
=> "Β" ;; "\u0392", it's in the case table

(let ((case-fold-search t))
  (string-match "[[:upper:]]" "\u03b2\u5357"))
=> nil ;; Oops

(let ((case-fold-search t))
  (string-match "[[:lower:]]" "\u0392\u5357"))
=> 0 ;; But this works, again. 

If Emacs crashed, and you have the Emacs process in the gdb debugger,
please include the output from the following gdb commands:
    `bt full' and `xbacktrace'.
For information about debugging Emacs, please read the file
/Sources/emacs/nextstep/Emacs.app/Contents/Resources/etc/DEBUG.


In GNU Emacs 24.1.50.1 (i386-apple-darwin10.8.0, NS apple-appkit-1038.36)
 of 2012-04-22 on bonbon
Windowing system distributor `Apple', version 10.3.1038
Configured using:
 `configure '--with-ns''

Important settings:
  value of $LC_ALL: nil
  value of $LC_COLLATE: nil
  value of $LC_CTYPE: nil
  value of $LC_MESSAGES: nil
  value of $LC_MONETARY: nil
  value of $LC_NUMERIC: nil
  value of $LC_TIME: nil
  value of $LANG: de_DE.UTF-8
  value of $XMODIFIERS: nil
  locale-coding-system: utf-8-unix
  default enable-multibyte-characters: t

Major mode: Info

Minor modes in effect:
  tooltip-mode: t
  mouse-wheel-mode: t
  tool-bar-mode: t
  menu-bar-mode: t
  file-name-shadow-mode: t
  global-font-lock-mode: t
  font-lock-mode: t
  blink-cursor-mode: t
  auto-composition-mode: t
  auto-encryption-mode: t
  auto-compression-mode: t
  line-number-mode: t
  transient-mark-mode: t

Recent input:
C-b C-b C-b C-b C-b C-b C-b C-f SPC \ x 7 f C-e C-j 
C-p C-f C-f C-f C-x = C-a ( SPC C-f C-x = C-a C-f s 
t <backspace> <backspace> m u l t <backspace> i b y 
t e - s t r i n g - p C-a C-f C-f C-f C-f t C-e ) C-j 
C-p C-p C-p C-n C-f C-f C-f C-f C-f C-f C-f C-f C-f 
C-f C-f C-f C-f C-f C-f C-f C-f C-f C-f C-f C-f C-f 
C-f C-f C-f C-b C-b C-b C-f C-x = C-x 1 C-f C-f C-f 
C-b C-k <escape> b <left> C-k C-p C-p C-p C-p C-p C-p 
C-p C-p C-p C-p C-p C-p C-p C-p C-p C-p C-p C-p C-p 
C-p C-e C-b C-b C-b C-y C-k ) C-j C-p C-p C-e C-b C-b 
C-b C-b C-d C-e C-j C-p C-p C-e C-b C-b C-b C-t C-e 
C-j C-p C-p C-e C-x C-b C-x o C-n C-n C-n RET C-x 1 
C-x b <return> C-x b * s c <tab> <return> C-n C-p C-n 
C-n e n a b l e - m u l t i b y t e - c h a r a c t 
e r s C-j C-x b <return> C-p C-n RET C-v l C-a C-n 
C-n C-n C-e C-x 2 C-x o C-x b * s c <backspace> <backspace> 
<backspace> C-g C-x C-b C-x o C-n C-n C-n C-n RET C-p 
C-p C-p C-x o C-p C-p C-a C-n C-SPC C-n C-n C-n C-n 
<escape> w <escape> x r e p o r t - e m a c s - b u 
g s <tab> C-g <escape> x r e p o r t - e m a c s - 
b u g <return>

Recent messages:
insert-file-contents-literally: Opening input file: no such file or directory, /Sources/emacs/nextstep/Emacs.app/Contents/Resources/etc/DOC-24.1.50.1
Mark set
Char: ä (228, #o344, #xe4, file ...) point=499 of 612 (81%) column=1 [2 times]
Char: DEL (127, #o177, #x7f) point=466 of 623 (75%) column=3
Char: ä (228, #o344, #xe4, file ...) point=466 of 625 (74%) column=3
Char: DEL (127, #o177, #x7f) point=486 of 647 (75%) column=23
Mark set
Quit
byte-code: Beginning of buffer [2 times]
Mark set
Quit

Load-path shadows:
None found.

Features:
(shadow sort gnus-util mail-extr emacsbug message format-spec rfc822 mml
mml-sec mm-decode mm-bodies mm-encode mail-parse rfc2231 mailabbrev
gmm-utils mailheader sendmail rfc2047 rfc2045 ietf-drums mm-util
mail-prsvr mail-utils find-func vc-git cc-mode cc-fonts cc-guess
cc-menus cc-cmds cc-styles cc-align cc-engine cc-vars cc-defs mule-util
multi-isearch info help-mode easymenu view help-fns byte-opt warnings cl
compile comint ansi-color ring bytecomp byte-compile cconv macroexp
vc-hg time-date tooltip ediff-hook vc-hooks lisp-float-type mwheel
ns-win tool-bar dnd fontset image regexp-opt fringe lisp-mode register
page menu-bar rfn-eshadow timer select scroll-bar mouse jit-lock
font-lock syntax facemenu font-core frame cham georgian utf-8-lang
misc-lang vietnamese tibetan thai tai-viet lao korean japanese hebrew
greek romanian slovak czech european ethiopic indian cyrillic chinese
case-table epa-hook jka-cmpr-hook help simple abbrev minibuffer loaddefs
button faces cus-face files text-properties overlay sha1 md5 base64
format env code-pages mule custom widget hashtable-print-readable
backquote make-network-process dbusbind ns multi-tty emacs)

-- 
‘Iodine deficiency was endemic in parts of the UK until, through what has been
described as “an unplanned and accidental public health triumph”, iodine was
added to cattle feed to improve milk production in the 1930s.’
(EN Pearce, Lancet, June 2011)





^ permalink raw reply	[flat|nested] 19+ messages in thread

* bug#11309: 24.1.50; Case problems with [:upper:] and Cyrillic, Greek
  2012-04-22 10:11 bug#11309: 24.1.50; Case problems with [:upper:] and Cyrillic, Greek Aidan Kehoe
@ 2020-12-07 17:24 ` Lars Ingebrigtsen
  2020-12-07 22:14 ` Mattias Engdegård
  1 sibling, 0 replies; 19+ messages in thread
From: Lars Ingebrigtsen @ 2020-12-07 17:24 UTC (permalink / raw)
  To: Aidan Kehoe; +Cc: 11309

Aidan Kehoe <kehoea@parhasard.net> writes:

> (let ((case-fold-search t))
>   (string-match "[[:upper:]]" "a\u0686"))
> => 0 ;; As documented
>
> (upcase "\u0430") ;; CYRILLIC SMALL LETTER A
> => "А" ;; "\u0410", so it's in the case table
>
> (let ((case-fold-search t))
>   (string-match "[[:upper:]]" "\u0430\u0686"))
> => nil ;; Ah, this is unexpected.

I tried this in Emacs 28, and I can confirm that this behaviour is still
present.

> (let ((case-fold-search t))
>   (string-match "[[:lower:]]" "\u0410\u0686"))
> => 0 ;; But this works as documented. 
>
> (upcase "\u03b2") ;; GREEK SMALL LETTER BETA
> => "Β" ;; "\u0392", it's in the case table
>
> (let ((case-fold-search t))
>   (string-match "[[:upper:]]" "\u03b2\u5357"))
> => nil ;; Oops
>
> (let ((case-fold-search t))
>   (string-match "[[:lower:]]" "\u0392\u5357"))
> => 0 ;; But this works, again. 

And this, too.

Anybody have any insight here?

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no





^ permalink raw reply	[flat|nested] 19+ messages in thread

* bug#11309: 24.1.50; Case problems with [:upper:] and Cyrillic,  Greek
  2012-04-22 10:11 bug#11309: 24.1.50; Case problems with [:upper:] and Cyrillic, Greek Aidan Kehoe
  2020-12-07 17:24 ` Lars Ingebrigtsen
@ 2020-12-07 22:14 ` Mattias Engdegård
  2020-12-08 14:48   ` Mattias Engdegård
  1 sibling, 1 reply; 19+ messages in thread
From: Mattias Engdegård @ 2020-12-07 22:14 UTC (permalink / raw)
  To: Lars Ingebrigtsen, Aidan Kehoe; +Cc: 11309

Not surprising in the least given the broken logic:

	  ((class_bits & BIT_UPPER) &&
	   (ISUPPER (c) || (corig != c &&
			    c == downcase (corig) && ISLOWER (c)))) ||
	  ((class_bits & BIT_LOWER) &&
	   (ISLOWER (c) || (corig != c &&
			    c == upcase (corig) && ISUPPER(c)))) ||

where corig is the character being matched and c is corig after canonicalising, which appears to mean downcasing in practice.
This means that the second case (BIT_LOWER means [:lower:]) works more or less as intended (by accident) but the [:upper:] case is less lucky and doesn't, as observed.

ASCII characters aren't affected by this bug since they are handled by a separate bitmap.

This has probably never worked properly.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* bug#11309: 24.1.50; Case problems with [:upper:] and Cyrillic,  Greek
  2020-12-07 22:14 ` Mattias Engdegård
@ 2020-12-08 14:48   ` Mattias Engdegård
  2020-12-08 16:02     ` Eli Zaretskii
                       ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Mattias Engdegård @ 2020-12-08 14:48 UTC (permalink / raw)
  To: Lars Ingebrigtsen, Aidan Kehoe; +Cc: 11309

[-- Attachment #1: Type: text/plain, Size: 418 bytes --]

tags 11309 patch
stop

The attached patch should fix the bug for all characters except ß which still is not matched by [:lower:] nor by [:upper:] no matter the value of case-fold-search.

The remaining problem seems to be that the upcase table maps ß to itself, which is wrong -- as long as we don't upcase ß to U+1E9E, it should not have an upcase table entry at all. I'll see what can be done about that.


[-- Attachment #2: 0001-Fix-upper-and-lower-for-Unicode-characters-bug-11309.patch --]
[-- Type: application/octet-stream, Size: 6465 bytes --]

From aead9bce8351477ee29d03d419a8c896a22aec4c Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Mattias=20Engdeg=C3=A5rd?= <mattiase@acm.org>
Date: Tue, 8 Dec 2020 12:47:58 +0100
Subject: [PATCH] Fix [:upper:] and [:lower:] for Unicode characters
 (bug#11309)

* src/regex-emacs.c (execute_charset): Add canon_table argument to
allow expression of a correct predicate for [:upper:] and [:lower:].
(mutually_exclusive_p, re_match_2_internal): Pass extra argument.
* test/src/regex-emacs-tests.el (regexp-case-fold, regexp-eszett):
New tests.  Parts of regexp-eszett still fail and are commented out.
---
 src/regex-emacs.c             | 17 ++++++-----
 test/src/regex-emacs-tests.el | 57 +++++++++++++++++++++++++++++++++++
 2 files changed, 66 insertions(+), 8 deletions(-)

diff --git a/src/regex-emacs.c b/src/regex-emacs.c
index 971a5f6374..6b5dded8e5 100644
--- a/src/regex-emacs.c
+++ b/src/regex-emacs.c
@@ -3575,9 +3575,11 @@ skip_noops (re_char *p, re_char *pend)
    opcode.  When the function finishes, *PP will be advanced past that opcode.
    C is character to test (possibly after translations) and CORIG is original
    character (i.e. without any translations).  UNIBYTE denotes whether c is
-   unibyte or multibyte character. */
+   unibyte or multibyte character.
+   CANON_TABLE is the canonicalisation table for case folding or Qnil.  */
 static bool
-execute_charset (re_char **pp, int c, int corig, bool unibyte)
+execute_charset (re_char **pp, int c, int corig, bool unibyte,
+                 Lisp_Object canon_table)
 {
   eassume (0 <= c && 0 <= corig);
   re_char *p = *pp, *rtp = NULL;
@@ -3617,11 +3619,9 @@ execute_charset (re_char **pp, int c, int corig, bool unibyte)
           (class_bits & BIT_BLANK && ISBLANK (c)) ||
 	  (class_bits & BIT_WORD  && ISWORD  (c)) ||
 	  ((class_bits & BIT_UPPER) &&
-	   (ISUPPER (c) || (corig != c &&
-			    c == downcase (corig) && ISLOWER (c)))) ||
+	   (ISUPPER (corig) || (canon_table != Qnil && ISLOWER (corig)))) ||
 	  ((class_bits & BIT_LOWER) &&
-	   (ISLOWER (c) || (corig != c &&
-			    c == upcase (corig) && ISUPPER(c)))) ||
+	   (ISLOWER (corig) || (canon_table != Qnil && ISUPPER (corig)))) ||
 	  (class_bits & BIT_PUNCT && ISPUNCT (c)) ||
 	  (class_bits & BIT_GRAPH && ISGRAPH (c)) ||
 	  (class_bits & BIT_PRINT && ISPRINT (c)))
@@ -3696,7 +3696,8 @@ mutually_exclusive_p (struct re_pattern_buffer *bufp, re_char *p1,
 	else if ((re_opcode_t) *p1 == charset
 		 || (re_opcode_t) *p1 == charset_not)
 	  {
-	    if (!execute_charset (&p1, c, c, !multibyte || ASCII_CHAR_P (c)))
+	    if (!execute_charset (&p1, c, c, !multibyte || ASCII_CHAR_P (c),
+                                  Qnil))
 	      {
 		DEBUG_PRINT ("	 No match => fast loop.\n");
 		return true;
@@ -4367,7 +4368,7 @@ re_match_2_internal (struct re_pattern_buffer *bufp,
 	      }
 
 	    p -= 1;
-	    if (!execute_charset (&p, c, corig, unibyte_char))
+	    if (!execute_charset (&p, c, corig, unibyte_char, translate))
 	      goto fail;
 
 	    d += len;
diff --git a/test/src/regex-emacs-tests.el b/test/src/regex-emacs-tests.el
index f9372e37b1..576630aa5a 100644
--- a/test/src/regex-emacs-tests.el
+++ b/test/src/regex-emacs-tests.el
@@ -803,4 +803,61 @@ regexp-multibyte-unibyte
   (should-not (string-match "å" "\xe5"))
   (should-not (string-match "[å]" "\xe5")))
 
+(ert-deftest regexp-case-fold ()
+  "Test case-sensitive and case-insensitive matching."
+  (let ((case-fold-search nil))
+    (should (equal (string-match "aB" "ABaB") 2))
+    (should (equal (string-match "åÄ" "ÅäåäÅÄåÄ") 6))
+    (should (equal (string-match "λΛ" "lΛλλΛ") 3))
+    (should (equal (string-match "шШ" "zШшшШ") 3))
+    (should (equal (string-match "[[:alpha:]]+" ".3aBåÄßλΛшШ中﷽") 2))
+    (should (equal (match-end 0) 12))
+    (should (equal (string-match "[[:alnum:]]+" ".3aBåÄßλΛшШ中﷽") 1))
+    (should (equal (match-end 0) 12))
+    (should (equal (string-match "[[:upper:]]+" ".3aåλшBÄΛШ中﷽") 6))
+    (should (equal (match-end 0) 10))
+    (should (equal (string-match "[[:lower:]]+" ".3BÄΛШaåλш中﷽") 6))
+    (should (equal (match-end 0) 10)))
+  (let ((case-fold-search t))
+    (should (equal (string-match "aB" "ABaB") 0))
+    (should (equal (string-match "åÄ" "ÅäåäÅÄåÄ") 0))
+    (should (equal (string-match "λΛ" "lΛλλΛ") 1))
+    (should (equal (string-match "шШ" "zШшшШ") 1))
+    (should (equal (string-match "[[:alpha:]]+" ".3aBåÄßλΛшШ中﷽") 2))
+    (should (equal (match-end 0) 12))
+    (should (equal (string-match "[[:alnum:]]+" ".3aBåÄßλΛшШ中﷽") 1))
+    (should (equal (match-end 0) 12))
+    (should (equal (string-match "[[:upper:]]+" ".3aåλшBÄΛШ中﷽") 2))
+    (should (equal (match-end 0) 10))
+    (should (equal (string-match "[[:lower:]]+" ".3BÄΛШaåλш中﷽") 2))
+    (should (equal (match-end 0) 10))))
+
+(ert-deftest regexp-eszett ()
+  "Test matching of ß and ẞ."
+  ;; ß is a lower-case letter (Ll); ẞ is an upper-case letter (Lu).
+  (let ((case-fold-search nil))
+    (should (equal (string-match "ß" "ß") 0))
+    (should (equal (string-match "ß" "ẞ") nil))
+    (should (equal (string-match "ẞ" "ß") nil))
+    (should (equal (string-match "ẞ" "ẞ") 0))
+    (should (equal (string-match "[[:alpha:]]" "ß") 0))
+    ;; bug#11309
+    ;;(should (equal (string-match "[[:lower:]]" "ß") 0))
+    ;;(should (equal (string-match "[[:upper:]]" "ß") nil))
+    (should (equal (string-match "[[:alpha:]]" "ẞ") 0))
+    (should (equal (string-match "[[:lower:]]" "ẞ") nil))
+    (should (equal (string-match "[[:upper:]]" "ẞ") 0)))
+  (let ((case-fold-search t))
+    (should (equal (string-match "ß" "ß") 0))
+    (should (equal (string-match "ß" "ẞ") 0))
+    (should (equal (string-match "ẞ" "ß") 0))
+    (should (equal (string-match "ẞ" "ẞ") 0))
+    (should (equal (string-match "[[:alpha:]]" "ß") 0))
+    ;; bug#11309
+    ;;(should (equal (string-match "[[:lower:]]" "ß") 0))
+    ;;(should (equal (string-match "[[:upper:]]" "ß") 0))
+    (should (equal (string-match "[[:alpha:]]" "ẞ") 0))
+    (should (equal (string-match "[[:lower:]]" "ẞ") 0))
+    (should (equal (string-match "[[:upper:]]" "ẞ") 0))))
+
 ;;; regex-emacs-tests.el ends here
-- 
2.21.1 (Apple Git-122.3)


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* bug#11309: 24.1.50; Case problems with [:upper:] and Cyrillic,  Greek
  2020-12-08 14:48   ` Mattias Engdegård
@ 2020-12-08 16:02     ` Eli Zaretskii
  2020-12-08 16:57       ` Mattias Engdegård
  2020-12-08 16:10     ` Andreas Schwab
  2020-12-08 17:01     ` Basil L. Contovounesios
  2 siblings, 1 reply; 19+ messages in thread
From: Eli Zaretskii @ 2020-12-08 16:02 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: kehoea, larsi, 11309

> From: Mattias Engdegård <mattiase@acm.org>
> Date: Tue, 8 Dec 2020 15:48:42 +0100
> Cc: 11309@debbugs.gnu.org
> 
> The remaining problem seems to be that the upcase table maps ß to itself, which is wrong -- as long as we don't upcase ß to U+1E9E, it should not have an upcase table entry at all. I'll see what can be done about that.

Why is this a problem?  AFAIR characters that don't have an upper-case
form map to themselves when downcased.  E.g.

  (upcase ?1) => ?1

Why should ß violate this convention?

> * src/regex-emacs.c (execute_charset): Add canon_table argument to
> allow expression of a correct predicate for [:upper:] and [:lower:].
> (mutually_exclusive_p, re_match_2_internal): Pass extra argument.
> * test/src/regex-emacs-tests.el (regexp-case-fold, regexp-eszett):
> New tests.  Parts of regexp-eszett still fail and are commented out.

Thanks, LGTM.





^ permalink raw reply	[flat|nested] 19+ messages in thread

* bug#11309: 24.1.50; Case problems with [:upper:] and Cyrillic,  Greek
  2020-12-08 16:02     ` Eli Zaretskii
@ 2020-12-08 16:57       ` Mattias Engdegård
  2020-12-08 17:05         ` Eli Zaretskii
  0 siblings, 1 reply; 19+ messages in thread
From: Mattias Engdegård @ 2020-12-08 16:57 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: kehoea, larsi, 11309

8 dec. 2020 kl. 17.02 skrev Eli Zaretskii <eliz@gnu.org>:

> AFAIR characters that don't have an upper-case
> form map to themselves when downcased.  E.g.
> 
>  (upcase ?1) => ?1

This is not about the Lisp (upcase x) function but the C upcase(x) function, which uses the upcase table directly.
They affect the uppercasep and lowercasep functions which are used in the regexp engine. Thus we get uppercasep(ß)=lowercasep(ß)=false which is wrong.

The logic of 'lowercasep' may need to be changed because its use of upcase and downcase which return their argument if the respective table has no entry for it. Let's see what can be done.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* bug#11309: 24.1.50; Case problems with [:upper:] and Cyrillic,  Greek
  2020-12-08 16:57       ` Mattias Engdegård
@ 2020-12-08 17:05         ` Eli Zaretskii
  2020-12-09 14:37           ` Mattias Engdegård
  0 siblings, 1 reply; 19+ messages in thread
From: Eli Zaretskii @ 2020-12-08 17:05 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: kehoea, larsi, 11309

> From: Mattias Engdegård <mattiase@acm.org>
> Date: Tue, 8 Dec 2020 17:57:32 +0100
> Cc: larsi@gnus.org, kehoea@parhasard.net, 11309@debbugs.gnu.org
> 
> This is not about the Lisp (upcase x) function but the C upcase(x) function, which uses the upcase table directly.
> They affect the uppercasep and lowercasep functions which are used in the regexp engine. Thus we get uppercasep(ß)=lowercasep(ß)=false which is wrong.

Why is it wrong, and what practical problems does this cause?

> The logic of 'lowercasep' may need to be changed because its use of upcase and downcase which return their argument if the respective table has no entry for it. Let's see what can be done.

I don't want us to change the logic of such basic functions for the
benefit of a single obscure character.  Let's first see what problems
with this character we have in practice, and then discuss what is the
best way of solving those problems.

TIA





^ permalink raw reply	[flat|nested] 19+ messages in thread

* bug#11309: 24.1.50; Case problems with [:upper:] and Cyrillic,  Greek
  2020-12-08 17:05         ` Eli Zaretskii
@ 2020-12-09 14:37           ` Mattias Engdegård
  2020-12-09 15:46             ` Eli Zaretskii
  2020-12-10  9:36             ` Mattias Engdegård
  0 siblings, 2 replies; 19+ messages in thread
From: Mattias Engdegård @ 2020-12-09 14:37 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Aidan Kehoe, Lars Ingebrigtsen, 11309-done

Eli, thanks for looking at the patch, now pushed to master (with Basil's suggested tweak).

> Why is it wrong, and what practical problems does this cause?

ß is a lower case letter so lowercasep(ß)=false is wrong. As a consequence, matching ß with [:lower:] and [:upper:] don't work correctly: ß should be matched by [:lower:] when case-fold-search is nil, and by both [:lower:] and [:upper:] when case-fold-search is non-nil.

The problem stems from the fact that uppercasep and lowercasep don't use the Unicode case information directly (which perhaps they should) but derive the case indirectly from the upcase and downcase tables, and there is no way to state that a char is lower case but cannot be upcased or downcased. (Below I'm going to use the notation T[C] for the table T indexed by character C.)

Currently, characters missing from or self-mapping in the upcase and downcase tables are considered to be caseless. For instance, upcase[*]=downcase[*]=* and upcase[中]=downcase[中]=nil. However, we also have upcase[ß]=downcase[ß]=ß, causing the incorrect lowercasep result.

The solution that I ended up applying was the simplest possible: set upcase[ß]=ẞ (U+7838). The special-uppercase properties ensure that (upcase "ß") => "SS", and now all tests pass.

(An acceptable alternative would have been to set upcase[ß]=nil and adapt lowercasep accordingly. I tried that and it works flawlessly, but involves slightly more changes.)

And that concludes the resolution of this bug.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* bug#11309: 24.1.50; Case problems with [:upper:] and Cyrillic,  Greek
  2020-12-09 14:37           ` Mattias Engdegård
@ 2020-12-09 15:46             ` Eli Zaretskii
  2020-12-10  9:36             ` Mattias Engdegård
  1 sibling, 0 replies; 19+ messages in thread
From: Eli Zaretskii @ 2020-12-09 15:46 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: kehoea, larsi, 11309-done

> From: Mattias Engdegård <mattiase@acm.org>
> Date: Wed, 9 Dec 2020 15:37:19 +0100
> Cc: Lars Ingebrigtsen <larsi@gnus.org>, Aidan Kehoe <kehoea@parhasard.net>,
>         11309-done@debbugs.gnu.org
> 
> ß is a lower case letter so lowercasep(ß)=false is wrong. As a consequence, matching ß with [:lower:] and [:upper:] don't work correctly: ß should be matched by [:lower:] when case-fold-search is nil, and by both [:lower:] and [:upper:] when case-fold-search is non-nil.
> 
> The problem stems from the fact that uppercasep and lowercasep don't use the Unicode case information directly (which perhaps they should) but derive the case indirectly from the upcase and downcase tables, and there is no way to state that a char is lower case but cannot be upcased or downcased. (Below I'm going to use the notation T[C] for the table T indexed by character C.)
> 
> Currently, characters missing from or self-mapping in the upcase and downcase tables are considered to be caseless. For instance, upcase[*]=downcase[*]=* and upcase[中]=downcase[中]=nil. However, we also have upcase[ß]=downcase[ß]=ß, causing the incorrect lowercasep result.
> 
> The solution that I ended up applying was the simplest possible: set upcase[ß]=ẞ (U+7838). The special-uppercase properties ensure that (upcase "ß") => "SS", and now all tests pass.
> 
> (An acceptable alternative would have been to set upcase[ß]=nil and adapt lowercasep accordingly. I tried that and it works flawlessly, but involves slightly more changes.)
> 
> And that concludes the resolution of this bug.

Thanks.





^ permalink raw reply	[flat|nested] 19+ messages in thread

* bug#11309: 24.1.50; Case problems with [:upper:] and Cyrillic,  Greek
  2020-12-09 14:37           ` Mattias Engdegård
  2020-12-09 15:46             ` Eli Zaretskii
@ 2020-12-10  9:36             ` Mattias Engdegård
  2020-12-10 14:17               ` Eli Zaretskii
  1 sibling, 1 reply; 19+ messages in thread
From: Mattias Engdegård @ 2020-12-10  9:36 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Aidan Kehoe, Lars Ingebrigtsen, 11309

As it turns out I had completely forgotten about Fupcase with a character argument -- (upcase ?ß) previously returned ?ß but ?ẞ after the change -- which was caught by casefiddle-tests. Now, what to do about it?

One solution would be the previous plan B: set upcase[ß]=nil, modify the uppercasep logic, and we will have (upcase ?ß) => ?ß again. However, I would argue that the current state is actually preferable:

Upcasing ß to ß never really makes sense. Words containing ß are written with SS in upper case: groß -> GROSS - which is one reason why the character-to-character use of Fupcase normally cannot be used for text containing the letter. The capital ß, ?ẞ, is still not widely employed but one of its purposes is when it is important to preserve the exact spelling of proper names when written in all caps: Gauß -> GAUẞ, not GAUSS. (I wouldn't be surprised if this will eventually become the general convention for all text, but we are getting ahead of society here.)

For these reasons, I'm adapting casefiddle-tests and calling it a feature.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* bug#11309: 24.1.50; Case problems with [:upper:] and Cyrillic,  Greek
  2020-12-10  9:36             ` Mattias Engdegård
@ 2020-12-10 14:17               ` Eli Zaretskii
  2020-12-10 15:48                 ` Mattias Engdegård
  0 siblings, 1 reply; 19+ messages in thread
From: Eli Zaretskii @ 2020-12-10 14:17 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: kehoea, larsi, 11309

> From: Mattias Engdegård <mattiase@acm.org>
> Date: Thu, 10 Dec 2020 10:36:12 +0100
> Cc: Lars Ingebrigtsen <larsi@gnus.org>, Aidan Kehoe <kehoea@parhasard.net>,
>         11309@debbugs.gnu.org
> 
> Upcasing ß to ß never really makes sense. Words containing ß are written with SS in upper case: groß -> GROSS - which is one reason why the character-to-character use of Fupcase normally cannot be used for text containing the letter. The capital ß, ?ẞ, is still not widely employed but one of its purposes is when it is important to preserve the exact spelling of proper names when written in all caps: Gauß -> GAUẞ, not GAUSS. (I wouldn't be surprised if this will eventually become the general convention for all text, but we are getting ahead of society here.)

Wouldn't it be confusing that upcase treats ?ß and "ß" differently?





^ permalink raw reply	[flat|nested] 19+ messages in thread

* bug#11309: 24.1.50; Case problems with [:upper:] and Cyrillic,  Greek
  2020-12-10 14:17               ` Eli Zaretskii
@ 2020-12-10 15:48                 ` Mattias Engdegård
  2020-12-10 15:53                   ` Lars Ingebrigtsen
  0 siblings, 1 reply; 19+ messages in thread
From: Mattias Engdegård @ 2020-12-10 15:48 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: kehoea, Lars Ingebrigtsen, 11309

10 dec. 2020 kl. 15.17 skrev Eli Zaretskii <eliz@gnu.org>:

> Wouldn't it be confusing that upcase treats ?ß and "ß" differently?

Well it already did so before (returning ?ß and "SS", respectively) and it's not as if we have much of a choice since
(1) upcase is documented to return a value of the same type as its argument, and
(2) "SS" is definitely the right return value for "ß".






^ permalink raw reply	[flat|nested] 19+ messages in thread

* bug#11309: 24.1.50; Case problems with [:upper:] and Cyrillic, Greek
  2020-12-10 15:48                 ` Mattias Engdegård
@ 2020-12-10 15:53                   ` Lars Ingebrigtsen
  2020-12-11  9:18                     ` Mattias Engdegård
  0 siblings, 1 reply; 19+ messages in thread
From: Lars Ingebrigtsen @ 2020-12-10 15:53 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: kehoea, 11309

Mattias Engdegård <mattiase@acm.org> writes:

> Well it already did so before (returning ?ß and "SS", respectively)
> and it's not as if we have much of a choice since
> (1) upcase is documented to return a value of the same type as its argument, and
> (2) "SS" is definitely the right return value for "ß".

I can only vaguely read German, but doesn't that depend one the locale?
That is, whether an upcase of ß should be SS or ẞ depends on...  what
time and place we're at?

So returning either, or both (as after your patch), sounds fine to me --
it's an improvement on what Emacs did before.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no





^ permalink raw reply	[flat|nested] 19+ messages in thread

* bug#11309: 24.1.50; Case problems with [:upper:] and Cyrillic, Greek
  2020-12-10 15:53                   ` Lars Ingebrigtsen
@ 2020-12-11  9:18                     ` Mattias Engdegård
  2020-12-11 15:26                       ` Lars Ingebrigtsen
  0 siblings, 1 reply; 19+ messages in thread
From: Mattias Engdegård @ 2020-12-11  9:18 UTC (permalink / raw)
  To: Lars Ingebrigtsen; +Cc: kehoea, 11309

10 dec. 2020 kl. 16.53 skrev Lars Ingebrigtsen <larsi@gnus.org>:

> I can only vaguely read German, but doesn't that depend one the locale?
> That is, whether an upcase of ß should be SS or ẞ depends on...  what
> time and place we're at?

I suppose, but upcasing to ẞ is not standard practice (at least not yet) in any German-speaking country. The Swiss prefer not using ß at all and write ss instead, but that doesn't affect the case-conversion rules.






^ permalink raw reply	[flat|nested] 19+ messages in thread

* bug#11309: 24.1.50; Case problems with [:upper:] and Cyrillic, Greek
  2020-12-11  9:18                     ` Mattias Engdegård
@ 2020-12-11 15:26                       ` Lars Ingebrigtsen
  0 siblings, 0 replies; 19+ messages in thread
From: Lars Ingebrigtsen @ 2020-12-11 15:26 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: kehoea, 11309

Mattias Engdegård <mattiase@acm.org> writes:

> 10 dec. 2020 kl. 16.53 skrev Lars Ingebrigtsen <larsi@gnus.org>:
>
>> I can only vaguely read German, but doesn't that depend one the locale?
>> That is, whether an upcase of ß should be SS or ẞ depends on...  what
>> time and place we're at?
>
> I suppose, but upcasing to ẞ is not standard practice (at least not
> yet) in any German-speaking country. The Swiss prefer not using ß at
> all and write ss instead, but that doesn't affect the case-conversion
> rules.

I thought I vaguely remembered somebody somewhere making ẞ a standard
upcase, but it seems I remembered wrong.  They only say that it's "also
possible":

"According to the council’s 2017 spelling manual: When writing the
uppercase [of ß], write SS. It’s also possible to use the uppercase
ẞ. Example: Straße — STRASSE — STRAẞE"

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no





^ permalink raw reply	[flat|nested] 19+ messages in thread

* bug#11309: 24.1.50; Case problems with [:upper:] and Cyrillic, Greek
  2020-12-08 14:48   ` Mattias Engdegård
  2020-12-08 16:02     ` Eli Zaretskii
@ 2020-12-08 16:10     ` Andreas Schwab
  2020-12-08 16:19       ` Mattias Engdegård
  2020-12-08 17:01     ` Basil L. Contovounesios
  2 siblings, 1 reply; 19+ messages in thread
From: Andreas Schwab @ 2020-12-08 16:10 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: Aidan Kehoe, Lars Ingebrigtsen, 11309

On Dez 08 2020, Mattias Engdegård wrote:

> diff --git a/src/regex-emacs.c b/src/regex-emacs.c
> index 971a5f6374..6b5dded8e5 100644
> --- a/src/regex-emacs.c
> +++ b/src/regex-emacs.c
> @@ -3575,9 +3575,11 @@ skip_noops (re_char *p, re_char *pend)
>     opcode.  When the function finishes, *PP will be advanced past that opcode.
>     C is character to test (possibly after translations) and CORIG is original
>     character (i.e. without any translations).  UNIBYTE denotes whether c is
> -   unibyte or multibyte character. */
> +   unibyte or multibyte character.
> +   CANON_TABLE is the canonicalisation table for case folding or Qnil.  */

The function uses that only as a boolean, so why not pass it as that?

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."





^ permalink raw reply	[flat|nested] 19+ messages in thread

* bug#11309: 24.1.50; Case problems with [:upper:] and Cyrillic, Greek
  2020-12-08 16:10     ` Andreas Schwab
@ 2020-12-08 16:19       ` Mattias Engdegård
  0 siblings, 0 replies; 19+ messages in thread
From: Mattias Engdegård @ 2020-12-08 16:19 UTC (permalink / raw)
  To: Andreas Schwab; +Cc: Aidan Kehoe, Lars Ingebrigtsen, 11309

8 dec. 2020 kl. 17.10 skrev Andreas Schwab <schwab@linux-m68k.org>:

> The function uses that only as a boolean, so why not pass it as that?

Thanks for reading the patch! It's a micro-optimisation: passing it as a boolean would entail an unconditional comparison against Qnil, but it is only used for [:lower:] and [:upper:] which are used in a small fraction of character alternatives. Maybe there is a cleaner way to do this without making the code slower.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* bug#11309: 24.1.50; Case problems with [:upper:] and Cyrillic, Greek
  2020-12-08 14:48   ` Mattias Engdegård
  2020-12-08 16:02     ` Eli Zaretskii
  2020-12-08 16:10     ` Andreas Schwab
@ 2020-12-08 17:01     ` Basil L. Contovounesios
  2020-12-08 17:04       ` Mattias Engdegård
  2 siblings, 1 reply; 19+ messages in thread
From: Basil L. Contovounesios @ 2020-12-08 17:01 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: Aidan Kehoe, Lars Ingebrigtsen, 11309

Mattias Engdegård <mattiase@acm.org> writes:

> @@ -3617,11 +3619,9 @@ execute_charset (re_char **pp, int c, int corig, bool unibyte)
>            (class_bits & BIT_BLANK && ISBLANK (c)) ||
>  	  (class_bits & BIT_WORD  && ISWORD  (c)) ||
>  	  ((class_bits & BIT_UPPER) &&
> -	   (ISUPPER (c) || (corig != c &&
> -			    c == downcase (corig) && ISLOWER (c)))) ||
> +	   (ISUPPER (corig) || (canon_table != Qnil && ISLOWER (corig)))) ||
>  	  ((class_bits & BIT_LOWER) &&
> -	   (ISLOWER (c) || (corig != c &&
> -			    c == upcase (corig) && ISUPPER(c)))) ||
> +	   (ISLOWER (corig) || (canon_table != Qnil && ISUPPER (corig)))) ||

Just curious: why not NILP?

Thanks,

-- 
Basil





^ permalink raw reply	[flat|nested] 19+ messages in thread

* bug#11309: 24.1.50; Case problems with [:upper:] and Cyrillic, Greek
  2020-12-08 17:01     ` Basil L. Contovounesios
@ 2020-12-08 17:04       ` Mattias Engdegård
  0 siblings, 0 replies; 19+ messages in thread
From: Mattias Engdegård @ 2020-12-08 17:04 UTC (permalink / raw)
  To: Basil L. Contovounesios; +Cc: Aidan Kehoe, Lars Ingebrigtsen, 11309

8 dec. 2020 kl. 18.01 skrev Basil L. Contovounesios <contovob@tcd.ie>:

> Just curious: why not NILP?

Momentary amnesia. Will change, thank you!






^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2020-12-11 15:26 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-04-22 10:11 bug#11309: 24.1.50; Case problems with [:upper:] and Cyrillic, Greek Aidan Kehoe
2020-12-07 17:24 ` Lars Ingebrigtsen
2020-12-07 22:14 ` Mattias Engdegård
2020-12-08 14:48   ` Mattias Engdegård
2020-12-08 16:02     ` Eli Zaretskii
2020-12-08 16:57       ` Mattias Engdegård
2020-12-08 17:05         ` Eli Zaretskii
2020-12-09 14:37           ` Mattias Engdegård
2020-12-09 15:46             ` Eli Zaretskii
2020-12-10  9:36             ` Mattias Engdegård
2020-12-10 14:17               ` Eli Zaretskii
2020-12-10 15:48                 ` Mattias Engdegård
2020-12-10 15:53                   ` Lars Ingebrigtsen
2020-12-11  9:18                     ` Mattias Engdegård
2020-12-11 15:26                       ` Lars Ingebrigtsen
2020-12-08 16:10     ` Andreas Schwab
2020-12-08 16:19       ` Mattias Engdegård
2020-12-08 17:01     ` Basil L. Contovounesios
2020-12-08 17:04       ` Mattias Engdegård

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).