unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
* bug#16731: 24.3.50; Latin small letter sharp s is not considered lower-case
@ 2014-02-12 17:29 Jorgen Schaefer
  2014-02-12 17:55 ` Glenn Morris
                   ` (2 more replies)
  0 siblings, 3 replies; 34+ messages in thread
From: Jorgen Schaefer @ 2014-02-12 17:29 UTC (permalink / raw)
  To: 16731

Hi!
The following seems like a bug:

(string-match "[[:lower:]]" "ß") => nil

`describe-char' for this says:

  name: LATIN SMALL LETTER SHARP S
  general-category: Ll (Letter, Lowercase)
  decomposition: (223) ('ß')

Not sure why it would not be considered a lower-case letter. Umlauts
like ä, ö and ü are matched correctly.

Regards,
        -- Jorgen

Configured using:
 `configure --without-x'

Important settings:
  value of $LC_ALL: 
  value of $LC_COLLATE: de_DE.UTF-8
  value of $LC_CTYPE: de_DE.UTF-8
  value of $LC_MESSAGES: POSIX
  value of $LC_MONETARY: POSIX
  value of $LC_NUMERIC: POSIX
  value of $LC_TIME: POSIX
  value of $LANG: POSIX
  locale-coding-system: utf-8-unix





^ permalink raw reply	[flat|nested] 34+ messages in thread

* bug#16731: 24.3.50; Latin small letter sharp s is not considered lower-case
  2014-02-12 17:29 bug#16731: 24.3.50; Latin small letter sharp s is not considered lower-case Jorgen Schaefer
@ 2014-02-12 17:55 ` Glenn Morris
  2014-02-12 19:31   ` Andreas Röhler
  2014-02-14 16:20 ` bug#16731: 24.3.50; , " Paul Eggert
  2021-07-16 12:32 ` bug#10576: Subject: 23.4; char class [:lower:] misses latin small letter sharp s Lars Ingebrigtsen
  2 siblings, 1 reply; 34+ messages in thread
From: Glenn Morris @ 2014-02-12 17:55 UTC (permalink / raw)
  To: Jorgen Schaefer; +Cc: 16731

Jorgen Schaefer wrote:

> Not sure why it would not be considered a lower-case letter. Umlauts
> like ä, ö and ü are matched correctly.

See http://debbugs.gnu.org/10576

(I have no idea whether this is an Emacs bug or not.)





^ permalink raw reply	[flat|nested] 34+ messages in thread

* bug#16731: 24.3.50; Latin small letter sharp s is not considered lower-case
  2014-02-12 17:55 ` Glenn Morris
@ 2014-02-12 19:31   ` Andreas Röhler
  2014-02-12 19:49     ` Eli Zaretskii
  0 siblings, 1 reply; 34+ messages in thread
From: Andreas Röhler @ 2014-02-12 19:31 UTC (permalink / raw)
  To: 16731

Am 12.02.2014 18:55, schrieb Glenn Morris:
> Jorgen Schaefer wrote:
>
>> Not sure why it would not be considered a lower-case letter. Umlauts
>> like ä, ö and ü are matched correctly.
>
> See http://debbugs.gnu.org/10576
>
> (I have no idea whether this is an Emacs bug or not.)
>
>
>
>

IMO the answer given at link is not valid. Indeed the implementation in buffer.h does check --&& upcase1 (c)-- and expects a result, i.e. ignores the fact, some characters 
might not have an upcase variant.

When seeing there is a downcase-table, the check probably should be done against this.





^ permalink raw reply	[flat|nested] 34+ messages in thread

* bug#16731: 24.3.50; Latin small letter sharp s is not considered lower-case
  2014-02-12 19:31   ` Andreas Röhler
@ 2014-02-12 19:49     ` Eli Zaretskii
  2014-02-12 20:10       ` Andreas Röhler
  0 siblings, 1 reply; 34+ messages in thread
From: Eli Zaretskii @ 2014-02-12 19:49 UTC (permalink / raw)
  To: Andreas Röhler; +Cc: 16731

> Date: Wed, 12 Feb 2014 20:31:20 +0100
> From: Andreas Röhler <andreas.roehler@easy-emacs.de>
> 
> > See http://debbugs.gnu.org/10576
> >
> > (I have no idea whether this is an Emacs bug or not.)
> >
> 
> IMO the answer given at link is not valid.

It accurately describes what happens in the code, so it's definitely
valid.

> When seeing there is a downcase-table, the check probably should be done against this.

Not sure what you mean by that, please elaborate.





^ permalink raw reply	[flat|nested] 34+ messages in thread

* bug#16731: 24.3.50; Latin small letter sharp s is not considered lower-case
  2014-02-12 19:49     ` Eli Zaretskii
@ 2014-02-12 20:10       ` Andreas Röhler
  2014-02-12 20:16         ` Eli Zaretskii
  0 siblings, 1 reply; 34+ messages in thread
From: Andreas Röhler @ 2014-02-12 20:10 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 16731

Am 12.02.2014 20:49, schrieb Eli Zaretskii:
>> Date: Wed, 12 Feb 2014 20:31:20 +0100
>> From: Andreas Röhler <andreas.roehler@easy-emacs.de>
>>
>>> See http://debbugs.gnu.org/10576
>>>
>>> (I have no idea whether this is an Emacs bug or not.)
>>>
>>
>> IMO the answer given at link is not valid.
>
> It accurately describes what happens in the code, so it's definitely
> valid.
>
>> When seeing there is a downcase-table, the check probably should be done against this.
>
> Not sure what you mean by that, please elaborate.
>
>

See buffer.h
IIUC the mentioned lowercasep is implemented as !uppercasep (c) && upcase1 (c) != c;
upcase1 (c) must fail, as there is no upcased of this char.

While upcase1 can't succeed, downcase should - if "ß" is a member of downcase_table.









^ permalink raw reply	[flat|nested] 34+ messages in thread

* bug#16731: 24.3.50; Latin small letter sharp s is not considered lower-case
  2014-02-12 20:10       ` Andreas Röhler
@ 2014-02-12 20:16         ` Eli Zaretskii
  2014-02-12 20:33           ` Andreas Röhler
  0 siblings, 1 reply; 34+ messages in thread
From: Eli Zaretskii @ 2014-02-12 20:16 UTC (permalink / raw)
  To: Andreas Röhler; +Cc: 16731

> Date: Wed, 12 Feb 2014 21:10:57 +0100
> From: Andreas Röhler <andreas.roehler@easy-emacs.de>
> CC: 16731@debbugs.gnu.org
> 
> While upcase1 can't succeed, downcase should - if "ß" is a member of downcase_table.

But which character do you want to downcase in this case?

This whole logic works only for _pairs_ of characters (and the
char-table used here is populated by calls to set-case-syntax-pair).
Such machinery cannot possibly work when there's no pair.

The only way I can see out of this conundrum is to consult the
Lowercase Unicode property of the character as fallback, assuming that
won't slow down regex search too much.





^ permalink raw reply	[flat|nested] 34+ messages in thread

* bug#16731: 24.3.50; Latin small letter sharp s is not considered lower-case
  2014-02-12 20:16         ` Eli Zaretskii
@ 2014-02-12 20:33           ` Andreas Röhler
  2014-02-12 20:57             ` Juanma Barranquero
  2014-02-13  3:46             ` Eli Zaretskii
  0 siblings, 2 replies; 34+ messages in thread
From: Andreas Röhler @ 2014-02-12 20:33 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 16731

Am 12.02.2014 21:16, schrieb Eli Zaretskii:
>> Date: Wed, 12 Feb 2014 21:10:57 +0100
>> From: Andreas Röhler <andreas.roehler@easy-emacs.de>
>> CC: 16731@debbugs.gnu.org
>>
>> While upcase1 can't succeed, downcase should - if "ß" is a member of downcase_table.
>
> But which character do you want to downcase in this case?
>
> This whole logic works only for _pairs_ of characters (and the
> char-table used here is populated by calls to set-case-syntax-pair).

So populate it differently, resp. allow empty slots.

> Such machinery cannot possibly work when there's no pair.
>
> The only way I can see out of this conundrum is to consult the
> Lowercase Unicode property of the character as fallback, assuming that
> won't slow down regex search too much.
>
>

You can do (downcase "d") for example, which results in "d".

Instead of

upcase1 (c) != c

what about

downcase (c) == c

?









^ permalink raw reply	[flat|nested] 34+ messages in thread

* bug#16731: 24.3.50; Latin small letter sharp s is not considered lower-case
  2014-02-12 20:33           ` Andreas Röhler
@ 2014-02-12 20:57             ` Juanma Barranquero
  2014-02-13  3:46             ` Eli Zaretskii
  1 sibling, 0 replies; 34+ messages in thread
From: Juanma Barranquero @ 2014-02-12 20:57 UTC (permalink / raw)
  To: Andreas Röhler; +Cc: 16731

On Wed, Feb 12, 2014 at 9:33 PM, Andreas Röhler
<andreas.roehler@easy-emacs.de> wrote:

> what about
>
> downcase (c) == c

Won't that be true for characters that have no upcase/downcase
difference, like digits?

   J





^ permalink raw reply	[flat|nested] 34+ messages in thread

* bug#16731: 24.3.50; Latin small letter sharp s is not considered lower-case
  2014-02-12 20:33           ` Andreas Röhler
  2014-02-12 20:57             ` Juanma Barranquero
@ 2014-02-13  3:46             ` Eli Zaretskii
  2014-02-13  8:27               ` Andreas Röhler
  2014-02-13 13:37               ` Stefan Monnier
  1 sibling, 2 replies; 34+ messages in thread
From: Eli Zaretskii @ 2014-02-13  3:46 UTC (permalink / raw)
  To: Andreas Röhler; +Cc: 16731

> Date: Wed, 12 Feb 2014 21:33:31 +0100
> From: Andreas Röhler <andreas.roehler@easy-emacs.de>
> CC: 16731@debbugs.gnu.org
> 
> Am 12.02.2014 21:16, schrieb Eli Zaretskii:
> >> Date: Wed, 12 Feb 2014 21:10:57 +0100
> >> From: Andreas Röhler <andreas.roehler@easy-emacs.de>
> >> CC: 16731@debbugs.gnu.org
> >>
> >> While upcase1 can't succeed, downcase should - if "ß" is a member of downcase_table.
> >
> > But which character do you want to downcase in this case?
> >
> > This whole logic works only for _pairs_ of characters (and the
> > char-table used here is populated by calls to set-case-syntax-pair).
> 
> So populate it differently, resp. allow empty slots.

How will we then be able to distinguish between lower-case characters
that have no upcase variant and characters that are not lower-case
characters at all?

> You can do (downcase "d") for example, which results in "d".
> 
> Instead of
> 
> upcase1 (c) != c
> 
> what about
> 
> downcase (c) == c
> 
> ?

The same is true for any non-letter, like punctuation.





^ permalink raw reply	[flat|nested] 34+ messages in thread

* bug#16731: 24.3.50; Latin small letter sharp s is not considered lower-case
  2014-02-13  3:46             ` Eli Zaretskii
@ 2014-02-13  8:27               ` Andreas Röhler
  2014-02-13 15:53                 ` Eli Zaretskii
  2014-02-13 13:37               ` Stefan Monnier
  1 sibling, 1 reply; 34+ messages in thread
From: Andreas Röhler @ 2014-02-13  8:27 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Juanma Barranquero, 16731

Am 13.02.2014 04:46, schrieb Eli Zaretskii:
>> Date: Wed, 12 Feb 2014 21:33:31 +0100
>> From: Andreas Röhler <andreas.roehler@easy-emacs.de>
>> CC: 16731@debbugs.gnu.org
>>
>> Am 12.02.2014 21:16, schrieb Eli Zaretskii:
>>>> Date: Wed, 12 Feb 2014 21:10:57 +0100
>>>> From: Andreas Röhler <andreas.roehler@easy-emacs.de>
>>>> CC: 16731@debbugs.gnu.org
>>>>
>>>> While upcase1 can't succeed, downcase should - if "ß" is a member of downcase_table.
>>>
>>> But which character do you want to downcase in this case?
>>>
>>> This whole logic works only for _pairs_ of characters (and the
>>> char-table used here is populated by calls to set-case-syntax-pair).
>>
>> So populate it differently, resp. allow empty slots.
>
> How will we then be able to distinguish between lower-case characters
> that have no upcase variant and characters that are not lower-case
> characters at all?
>
>> You can do (downcase "d") for example, which results in "d".
>>
>> Instead of
>>
>> upcase1 (c) != c
>>
>> what about
>>
>> downcase (c) == c
>>
>> ?
>
> The same is true for any non-letter, like punctuation.
>
>

Okay, right.

So it seems upcase_table is populated wrongly with "ß"?






^ permalink raw reply	[flat|nested] 34+ messages in thread

* bug#16731: 24.3.50; Latin small letter sharp s is not considered lower-case
  2014-02-13  3:46             ` Eli Zaretskii
  2014-02-13  8:27               ` Andreas Röhler
@ 2014-02-13 13:37               ` Stefan Monnier
  2014-02-13 16:33                 ` Eli Zaretskii
  1 sibling, 1 reply; 34+ messages in thread
From: Stefan Monnier @ 2014-02-13 13:37 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 16731

> How will we then be able to distinguish between lower-case characters
> that have no upcase variant and characters that are not lower-case
> characters at all?

Right: to handle this, we need to distinguish characters that are
lower-case without an uppercase variant from characters which are
neither lowercase nor uppercase.

We could do that by saying that the upcase table should return nil or -1
for ß, to indicate that the upcase version is "missing".  But such
a change will probably require carefully revising "all" the code that
uses those tables.


        Stefan





^ permalink raw reply	[flat|nested] 34+ messages in thread

* bug#16731: 24.3.50; Latin small letter sharp s is not considered lower-case
  2014-02-13  8:27               ` Andreas Röhler
@ 2014-02-13 15:53                 ` Eli Zaretskii
  0 siblings, 0 replies; 34+ messages in thread
From: Eli Zaretskii @ 2014-02-13 15:53 UTC (permalink / raw)
  To: Andreas Röhler; +Cc: lekktu, 16731

> Date: Thu, 13 Feb 2014 09:27:43 +0100
> From: Andreas Röhler <andreas.roehler@easy-emacs.de>
> CC: 16731@debbugs.gnu.org, Juanma Barranquero <lekktu@gmail.com>
> 
> So it seems upcase_table is populated wrongly with "ß"?

I see nothing wrong with it: its entry is the character itself, like
any other character that has no up-case variant.





^ permalink raw reply	[flat|nested] 34+ messages in thread

* bug#16731: 24.3.50; Latin small letter sharp s is not considered lower-case
  2014-02-13 13:37               ` Stefan Monnier
@ 2014-02-13 16:33                 ` Eli Zaretskii
  2014-02-13 17:10                   ` Stefan Monnier
  2014-02-13 17:58                   ` Juanma Barranquero
  0 siblings, 2 replies; 34+ messages in thread
From: Eli Zaretskii @ 2014-02-13 16:33 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: 16731

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: Andreas Röhler <andreas.roehler@easy-emacs.de>,
>   16731@debbugs.gnu.org
> Date: Thu, 13 Feb 2014 08:37:45 -0500
> 
> > How will we then be able to distinguish between lower-case characters
> > that have no upcase variant and characters that are not lower-case
> > characters at all?
> 
> Right: to handle this, we need to distinguish characters that are
> lower-case without an uppercase variant from characters which are
> neither lowercase nor uppercase.
> 
> We could do that by saying that the upcase table should return nil or -1
> for ß, to indicate that the upcase version is "missing".  But such
> a change will probably require carefully revising "all" the code that
> uses those tables.

Right.  I can instead suggest a much less intrusive change below.  Its
only disadvantage is that if some user or Lisp program overrides the
standard case tables, and actually _wants_ some lower-case characters
behave as if they weren't, looking at the Unicode tables will undo
such customizations.  If this is a concern, perhaps we could compare
the case table with the standard value, and only use the Unicode
attributes when they are equal?

If the approach below is accepted, a related question is how to treat
letters whose category is Lt, i.e. "titlecase" -- do we consider such
letters upper case or don't we?

--- src/buffer.h~0	2014-01-01 09:46:07.000000000 +0200
+++ src/buffer.h	2014-02-13 18:27:32.225839000 +0200
@@ -1349,7 +1349,19 @@ downcase (int c)
 }
 
 /* True if C is upper case.  */
-INLINE bool uppercasep (int c) { return downcase (c) != c; }
+INLINE bool uppercasep (int c)
+{
+  Lisp_Object val;
+
+  if (downcase (c) != c)
+    return true;
+
+  if (NILP (Vunicode_category_table))
+    return false;
+
+  val = CHAR_TABLE_REF (Vunicode_category_table, c);
+  return INTEGERP (val) && XINT (val) == UNICODE_CATEGORY_Lu;
+}
 
 /* Upcase a character C known to be not upper case.  */
 INLINE int
@@ -1364,7 +1376,16 @@ upcase1 (int c)
 INLINE bool
 lowercasep (int c)
 {
-  return !uppercasep (c) && upcase1 (c) != c;
+  Lisp_Object val;
+
+  if (!uppercasep (c) && upcase1 (c) != c)
+    return true;
+
+  if (NILP (Vunicode_category_table))
+    return false;
+
+  val = CHAR_TABLE_REF (Vunicode_category_table, c);
+  return INTEGERP (val) && XINT (val) == UNICODE_CATEGORY_Ll;
 }
 
 /* Upcase a character C, or make no change if that cannot be done.  */





^ permalink raw reply	[flat|nested] 34+ messages in thread

* bug#16731: 24.3.50; Latin small letter sharp s is not considered lower-case
  2014-02-13 16:33                 ` Eli Zaretskii
@ 2014-02-13 17:10                   ` Stefan Monnier
  2014-02-13 17:39                     ` Eli Zaretskii
  2014-02-13 17:58                   ` Juanma Barranquero
  1 sibling, 1 reply; 34+ messages in thread
From: Stefan Monnier @ 2014-02-13 17:10 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 16731

>  /* True if C is upper case.  */
> -INLINE bool uppercasep (int c) { return downcase (c) != c; }
> +INLINE bool uppercasep (int c)
> +{
> +  Lisp_Object val;
> +
> +  if (downcase (c) != c)
> +    return true;
> +
> +  if (NILP (Vunicode_category_table))
> +    return false;
> +
> +  val = CHAR_TABLE_REF (Vunicode_category_table, c);
> +  return INTEGERP (val) && XINT (val) == UNICODE_CATEGORY_Lu;
> +}
 
Doesn't sound too bad.  But it does beg the question: why check
(downcase (c) != c) at all, then?


        Stefan





^ permalink raw reply	[flat|nested] 34+ messages in thread

* bug#16731: 24.3.50; Latin small letter sharp s is not considered lower-case
  2014-02-13 17:10                   ` Stefan Monnier
@ 2014-02-13 17:39                     ` Eli Zaretskii
  2014-02-13 18:02                       ` Andreas Röhler
  2014-02-13 18:10                       ` Stefan Monnier
  0 siblings, 2 replies; 34+ messages in thread
From: Eli Zaretskii @ 2014-02-13 17:39 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: 16731

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: andreas.roehler@easy-emacs.de,  16731@debbugs.gnu.org
> Date: Thu, 13 Feb 2014 12:10:49 -0500
> 
> >  /* True if C is upper case.  */
> > -INLINE bool uppercasep (int c) { return downcase (c) != c; }
> > +INLINE bool uppercasep (int c)
> > +{
> > +  Lisp_Object val;
> > +
> > +  if (downcase (c) != c)
> > +    return true;
> > +
> > +  if (NILP (Vunicode_category_table))
> > +    return false;
> > +
> > +  val = CHAR_TABLE_REF (Vunicode_category_table, c);
> > +  return INTEGERP (val) && XINT (val) == UNICODE_CATEGORY_Lu;
> > +}
>  
> Doesn't sound too bad.  But it does beg the question: why check
> (downcase (c) != c) at all, then?

Because it's faster, and for most characters will do the job.





^ permalink raw reply	[flat|nested] 34+ messages in thread

* bug#16731: 24.3.50; Latin small letter sharp s is not considered lower-case
  2014-02-13 16:33                 ` Eli Zaretskii
  2014-02-13 17:10                   ` Stefan Monnier
@ 2014-02-13 17:58                   ` Juanma Barranquero
  2014-02-13 18:18                     ` Eli Zaretskii
  1 sibling, 1 reply; 34+ messages in thread
From: Juanma Barranquero @ 2014-02-13 17:58 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 16731

On Thu, Feb 13, 2014 at 5:33 PM, Eli Zaretskii <eliz@gnu.org> wrote:

> If the approach below is accepted, a related question is how to treat
> letters whose category is Lt, i.e. "titlecase" -- do we consider such
> letters upper case or don't we?

No Unicode expert, but this suggest they are uppercase, sort of:

http://www.unicode.org/faq/casemap_charprop.html

"Q: What is titlecase? How is it different from uppercase?

A: Titlecase takes its name from the case format used when forming a
title, in which the initial letter in a word is capitalized and the
rest are not. Titlecase is also used in forming a sentence by
capitalizing the first word, and for forming proper names. The
titlecase mapping in the Unicode Standard is the mapping applied to
the initial character in a word.

The titlecase mapping in Unicode differs from the uppercase mapping in
that a number of characters require special handling. These are
chiefly ligatures and digraphs such as 'fl', 'dz', and 'lj', plus a
number of polytonic Greek characters. For example, U+01C7 (LJ) maps to
U+01C8 (Lj) rather than to U+01C9 (lj)."





^ permalink raw reply	[flat|nested] 34+ messages in thread

* bug#16731: 24.3.50; Latin small letter sharp s is not considered lower-case
  2014-02-13 17:39                     ` Eli Zaretskii
@ 2014-02-13 18:02                       ` Andreas Röhler
  2014-02-13 18:17                         ` Eli Zaretskii
  2014-02-13 18:10                       ` Stefan Monnier
  1 sibling, 1 reply; 34+ messages in thread
From: Andreas Röhler @ 2014-02-13 18:02 UTC (permalink / raw)
  To: Eli Zaretskii, Stefan Monnier; +Cc: 16731

Am 13.02.2014 18:39, schrieb Eli Zaretskii:
>> From: Stefan Monnier <monnier@iro.umontreal.ca>
>> Cc: andreas.roehler@easy-emacs.de,  16731@debbugs.gnu.org
>> Date: Thu, 13 Feb 2014 12:10:49 -0500
>>
>>>   /* True if C is upper case.  */
>>> -INLINE bool uppercasep (int c) { return downcase (c) != c; }
>>> +INLINE bool uppercasep (int c)
>>> +{
>>> +  Lisp_Object val;
>>> +
>>> +  if (downcase (c) != c)
>>> +    return true;
>>> +
>>> +  if (NILP (Vunicode_category_table))
>>> +    return false;
>>> +
>>> +  val = CHAR_TABLE_REF (Vunicode_category_table, c);
>>> +  return INTEGERP (val) && XINT (val) == UNICODE_CATEGORY_Lu;
>>> +}
>>
>> Doesn't sound too bad.  But it does beg the question: why check
>> (downcase (c) != c) at all, then?
>
> Because it's faster, and for most characters will do the job.
>

Maybe I'm missing the point: all change needed is not to store "ß" into the uppercase-table.
Why not store nil there instead?






^ permalink raw reply	[flat|nested] 34+ messages in thread

* bug#16731: 24.3.50; Latin small letter sharp s is not considered lower-case
  2014-02-13 17:39                     ` Eli Zaretskii
  2014-02-13 18:02                       ` Andreas Röhler
@ 2014-02-13 18:10                       ` Stefan Monnier
  2014-02-13 18:16                         ` Eli Zaretskii
  1 sibling, 1 reply; 34+ messages in thread
From: Stefan Monnier @ 2014-02-13 18:10 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 16731

>> Doesn't sound too bad.  But it does beg the question: why check
>> (downcase (c) != c) at all, then?
> Because it's faster,

Is it?  Both lookups look like CHAR_TABLE_REF to me.

> and for most characters will do the job.

But we'll check the unicode table at least for more than half the
characters (i.e. for all the lowercase and non-case characters), so the
fast path can't give us more than a factor of 2 speed up anyway, and the
slow path is made slower by unnecessarily looking up the case table.

I guess what I mean is that without actual measurements it's not obvious
at all that speed is a good justification.


        Stefan





^ permalink raw reply	[flat|nested] 34+ messages in thread

* bug#16731: 24.3.50; Latin small letter sharp s is not considered lower-case
  2014-02-13 18:10                       ` Stefan Monnier
@ 2014-02-13 18:16                         ` Eli Zaretskii
  2014-02-13 19:15                           ` Stefan Monnier
  0 siblings, 1 reply; 34+ messages in thread
From: Eli Zaretskii @ 2014-02-13 18:16 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: 16731

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: andreas.roehler@easy-emacs.de,  16731@debbugs.gnu.org
> Date: Thu, 13 Feb 2014 13:10:02 -0500
> 
> >> Doesn't sound too bad.  But it does beg the question: why check
> >> (downcase (c) != c) at all, then?
> > Because it's faster,
> 
> Is it?  Both lookups look like CHAR_TABLE_REF to me.
> 
> > and for most characters will do the job.
> 
> But we'll check the unicode table at least for more than half the
> characters (i.e. for all the lowercase and non-case characters), so the
> fast path can't give us more than a factor of 2 speed up anyway, and the
> slow path is made slower by unnecessarily looking up the case table.
> 
> I guess what I mean is that without actual measurements it's not obvious
> at all that speed is a good justification.

What about custom buffer-local case tables?





^ permalink raw reply	[flat|nested] 34+ messages in thread

* bug#16731: 24.3.50; Latin small letter sharp s is not considered lower-case
  2014-02-13 18:02                       ` Andreas Röhler
@ 2014-02-13 18:17                         ` Eli Zaretskii
  0 siblings, 0 replies; 34+ messages in thread
From: Eli Zaretskii @ 2014-02-13 18:17 UTC (permalink / raw)
  To: Andreas Röhler; +Cc: 16731

> Date: Thu, 13 Feb 2014 19:02:08 +0100
> From: Andreas Röhler <andreas.roehler@easy-emacs.de>
> CC: 16731@debbugs.gnu.org
> 
> Maybe I'm missing the point: all change needed is not to store "ß" into the uppercase-table.
> Why not store nil there instead?

Because that's not what case tables are documented to hold.  We will
break back compatibility if we put nil there.





^ permalink raw reply	[flat|nested] 34+ messages in thread

* bug#16731: 24.3.50; Latin small letter sharp s is not considered lower-case
  2014-02-13 17:58                   ` Juanma Barranquero
@ 2014-02-13 18:18                     ` Eli Zaretskii
  2014-02-13 18:22                       ` Juanma Barranquero
  2014-02-13 18:47                       ` Glenn Morris
  0 siblings, 2 replies; 34+ messages in thread
From: Eli Zaretskii @ 2014-02-13 18:18 UTC (permalink / raw)
  To: Juanma Barranquero; +Cc: 16731

> From: Juanma Barranquero <lekktu@gmail.com>
> Date: Thu, 13 Feb 2014 18:58:04 +0100
> Cc: Stefan Monnier <monnier@iro.umontreal.ca>, 16731@debbugs.gnu.org
> 
> On Thu, Feb 13, 2014 at 5:33 PM, Eli Zaretskii <eliz@gnu.org> wrote:
> 
> > If the approach below is accepted, a related question is how to treat
> > letters whose category is Lt, i.e. "titlecase" -- do we consider such
> > letters upper case or don't we?
> 
> No Unicode expert, but this suggest they are uppercase, sort of:
> 
> http://www.unicode.org/faq/casemap_charprop.html
> 
> "Q: What is titlecase? How is it different from uppercase?
> 
> A: Titlecase takes its name from the case format used when forming a
> title, in which the initial letter in a word is capitalized and the
> rest are not. Titlecase is also used in forming a sentence by
> capitalizing the first word, and for forming proper names. The
> titlecase mapping in the Unicode Standard is the mapping applied to
> the initial character in a word.
> 
> The titlecase mapping in Unicode differs from the uppercase mapping in
> that a number of characters require special handling. These are
> chiefly ligatures and digraphs such as 'fl', 'dz', and 'lj', plus a
> number of polytonic Greek characters. For example, U+01C7 (LJ) maps to
> U+01C8 (Lj) rather than to U+01C9 (lj)."

The question is whether we want [:upper:] to match titlecase letters.





^ permalink raw reply	[flat|nested] 34+ messages in thread

* bug#16731: 24.3.50; Latin small letter sharp s is not considered lower-case
  2014-02-13 18:18                     ` Eli Zaretskii
@ 2014-02-13 18:22                       ` Juanma Barranquero
  2014-02-13 18:47                       ` Glenn Morris
  1 sibling, 0 replies; 34+ messages in thread
From: Juanma Barranquero @ 2014-02-13 18:22 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 16731

On Thu, Feb 13, 2014 at 7:18 PM, Eli Zaretskii <eliz@gnu.org> wrote:

> The question is whether we want [:upper:] to match titlecase letters.

Yes, I understand. And I'm pointing out that, unless there's a
separate [:title:] matcher, matching them with [:upper:] is not
entirely unreasonable. Whether it is the right thing to do or not will
depend on the uses, I think.

    J





^ permalink raw reply	[flat|nested] 34+ messages in thread

* bug#16731: 24.3.50; Latin small letter sharp s is not considered lower-case
  2014-02-13 18:18                     ` Eli Zaretskii
  2014-02-13 18:22                       ` Juanma Barranquero
@ 2014-02-13 18:47                       ` Glenn Morris
  2014-02-13 20:16                         ` Eli Zaretskii
  1 sibling, 1 reply; 34+ messages in thread
From: Glenn Morris @ 2014-02-13 18:47 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Juanma Barranquero, 16731

Eli Zaretskii wrote:

> The question is whether we want [:upper:] to match titlecase letters.

What does grep do?
(http://debbugs.gnu.org/16631 ?)





^ permalink raw reply	[flat|nested] 34+ messages in thread

* bug#16731: 24.3.50; Latin small letter sharp s is not considered lower-case
  2014-02-13 18:16                         ` Eli Zaretskii
@ 2014-02-13 19:15                           ` Stefan Monnier
  2014-02-13 20:24                             ` Eli Zaretskii
  0 siblings, 1 reply; 34+ messages in thread
From: Stefan Monnier @ 2014-02-13 19:15 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 16731

> What about custom buffer-local case tables?

That's what I meant by my question, yes.  Your change will break about half of
the uses of buffer-local case tables.  Using the unicode table all the
time will break them all.
Is it a real issue?  I really don't know.


        Stefan





^ permalink raw reply	[flat|nested] 34+ messages in thread

* bug#16731: 24.3.50; Latin small letter sharp s is not considered lower-case
  2014-02-13 18:47                       ` Glenn Morris
@ 2014-02-13 20:16                         ` Eli Zaretskii
  0 siblings, 0 replies; 34+ messages in thread
From: Eli Zaretskii @ 2014-02-13 20:16 UTC (permalink / raw)
  To: Glenn Morris; +Cc: lekktu, 16731

> From: Glenn Morris <rgm@gnu.org>
> Cc: Juanma Barranquero <lekktu@gmail.com>,  16731@debbugs.gnu.org
> Date: Thu, 13 Feb 2014 13:47:33 -0500
> 
> Eli Zaretskii wrote:
> 
> > The question is whether we want [:upper:] to match titlecase letters.
> 
> What does grep do?
> (http://debbugs.gnu.org/16631 ?)

Grep (like most of other programs) uses locale-dependent tables
provided by libc, so it's not really relevant for us what it does.





^ permalink raw reply	[flat|nested] 34+ messages in thread

* bug#16731: 24.3.50; Latin small letter sharp s is not considered lower-case
  2014-02-13 19:15                           ` Stefan Monnier
@ 2014-02-13 20:24                             ` Eli Zaretskii
  2014-02-14 17:22                               ` Stefan Monnier
  0 siblings, 1 reply; 34+ messages in thread
From: Eli Zaretskii @ 2014-02-13 20:24 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: 16731

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: andreas.roehler@easy-emacs.de,  16731@debbugs.gnu.org
> Date: Thu, 13 Feb 2014 14:15:37 -0500
> 
> > What about custom buffer-local case tables?
> 
> That's what I meant by my question, yes.  Your change will break about half of
> the uses of buffer-local case tables.  Using the unicode table all the
> time will break them all.
> Is it a real issue?  I really don't know.

Neither do I.

How about if we use the unicode tables only if the corresponding
buffer's case-table is the standard one (Vascii_*_table)?





^ permalink raw reply	[flat|nested] 34+ messages in thread

* bug#16731: 24.3.50; , Latin small letter sharp s is not considered lower-case
  2014-02-12 17:29 bug#16731: 24.3.50; Latin small letter sharp s is not considered lower-case Jorgen Schaefer
  2014-02-12 17:55 ` Glenn Morris
@ 2014-02-14 16:20 ` Paul Eggert
  2021-07-16 12:32 ` bug#10576: Subject: 23.4; char class [:lower:] misses latin small letter sharp s Lars Ingebrigtsen
  2 siblings, 0 replies; 34+ messages in thread
From: Paul Eggert @ 2014-02-14 16:20 UTC (permalink / raw)
  To: 16731

Grep doesn't just use glibc's tables; it has its own dfa matcher (also 
shared by awk), and runs into problem in this area as well.  I'm working 
on fixes for this in my limited spare time.

If you want 'uppercasep' to match what glibc and grep mean by 
[[:upper:]], Emacs might need to check not merely for 
UNICODE_CATEGORY_Lu but also for other Unicode categories (mixed case, 
title case).  I haven't investigated the details.





^ permalink raw reply	[flat|nested] 34+ messages in thread

* bug#16731: 24.3.50; Latin small letter sharp s is not considered lower-case
  2014-02-13 20:24                             ` Eli Zaretskii
@ 2014-02-14 17:22                               ` Stefan Monnier
  2014-02-14 18:16                                 ` Eli Zaretskii
  0 siblings, 1 reply; 34+ messages in thread
From: Stefan Monnier @ 2014-02-14 17:22 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 16731

>> Is it a real issue?  I really don't know.
> Neither do I.

Maybe it's not a problem.  Someone(TM) should grep to try and figure it
out, and then try it out.

> How about if we use the unicode tables only if the corresponding
> buffer's case-table is the standard one (Vascii_*_table)?

That sounds kludgy.


        Stefan





^ permalink raw reply	[flat|nested] 34+ messages in thread

* bug#16731: 24.3.50; Latin small letter sharp s is not considered lower-case
  2014-02-14 17:22                               ` Stefan Monnier
@ 2014-02-14 18:16                                 ` Eli Zaretskii
  2014-02-14 20:59                                   ` Stefan Monnier
  0 siblings, 1 reply; 34+ messages in thread
From: Eli Zaretskii @ 2014-02-14 18:16 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: 16731

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: andreas.roehler@easy-emacs.de,  16731@debbugs.gnu.org
> Date: Fri, 14 Feb 2014 12:22:35 -0500
> 
> > How about if we use the unicode tables only if the corresponding
> > buffer's case-table is the standard one (Vascii_*_table)?
> 
> That sounds kludgy.

Why kludgy?  If the tables were not customized, it is a sign that this
buffer is OK with the default properties, which is what the Unicode
properties are about.





^ permalink raw reply	[flat|nested] 34+ messages in thread

* bug#16731: 24.3.50; Latin small letter sharp s is not considered lower-case
  2014-02-14 18:16                                 ` Eli Zaretskii
@ 2014-02-14 20:59                                   ` Stefan Monnier
  2014-02-15  7:12                                     ` Eli Zaretskii
  0 siblings, 1 reply; 34+ messages in thread
From: Stefan Monnier @ 2014-02-14 20:59 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 16731

>> > How about if we use the unicode tables only if the corresponding
>> > buffer's case-table is the standard one (Vascii_*_table)?
>> That sounds kludgy.
> Why kludgy?

Because, if someone were to take the Vascii_*_table, make a little
change to them and use them in a buffer, he suddenly gets different
behavior for some chars he hasn't touched.


        Stefan





^ permalink raw reply	[flat|nested] 34+ messages in thread

* bug#16731: 24.3.50; Latin small letter sharp s is not considered lower-case
  2014-02-14 20:59                                   ` Stefan Monnier
@ 2014-02-15  7:12                                     ` Eli Zaretskii
  2014-02-17  3:09                                       ` Stefan Monnier
  0 siblings, 1 reply; 34+ messages in thread
From: Eli Zaretskii @ 2014-02-15  7:12 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: 16731

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: andreas.roehler@easy-emacs.de,  16731@debbugs.gnu.org
> Date: Fri, 14 Feb 2014 15:59:00 -0500
> 
> >> > How about if we use the unicode tables only if the corresponding
> >> > buffer's case-table is the standard one (Vascii_*_table)?
> >> That sounds kludgy.
> > Why kludgy?
> 
> Because, if someone were to take the Vascii_*_table

How could they? these variables are not exposed to Lisp.  Only
ascii-case-table is, which is not the one I had in mind.





^ permalink raw reply	[flat|nested] 34+ messages in thread

* bug#16731: 24.3.50; Latin small letter sharp s is not considered lower-case
  2014-02-15  7:12                                     ` Eli Zaretskii
@ 2014-02-17  3:09                                       ` Stefan Monnier
  2014-02-17  5:29                                         ` Eli Zaretskii
  0 siblings, 1 reply; 34+ messages in thread
From: Stefan Monnier @ 2014-02-17  3:09 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 16731

> How could they? these variables are not exposed to Lisp.  Only
> ascii-case-table is, which is not the one I had in mind.

Right, I was thinking of standard-case-table.  Still, same problem: take
that standard case table change it a bit, and suddenly other chars than
the ones you changed are affected.


        Stefan





^ permalink raw reply	[flat|nested] 34+ messages in thread

* bug#16731: 24.3.50; Latin small letter sharp s is not considered lower-case
  2014-02-17  3:09                                       ` Stefan Monnier
@ 2014-02-17  5:29                                         ` Eli Zaretskii
  0 siblings, 0 replies; 34+ messages in thread
From: Eli Zaretskii @ 2014-02-17  5:29 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: 16731

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: andreas.roehler@easy-emacs.de,  16731@debbugs.gnu.org
> Date: Sun, 16 Feb 2014 22:09:32 -0500
> 
> > How could they? these variables are not exposed to Lisp.  Only
> > ascii-case-table is, which is not the one I had in mind.
> 
> Right, I was thinking of standard-case-table.  Still, same problem: take
> that standard case table change it a bit, and suddenly other chars than
> the ones you changed are affected.

But customizing case-tables is already a very special use case.  Why
can't we expect such users to deal with these issues?

The only alternative (besides leaving the original problem unsolved)
is to ignore buffer-local case tables.  Is this more acceptable?





^ permalink raw reply	[flat|nested] 34+ messages in thread

* bug#10576: Subject: 23.4; char class [:lower:] misses latin small letter sharp s
  2014-02-12 17:29 bug#16731: 24.3.50; Latin small letter sharp s is not considered lower-case Jorgen Schaefer
  2014-02-12 17:55 ` Glenn Morris
  2014-02-14 16:20 ` bug#16731: 24.3.50; , " Paul Eggert
@ 2021-07-16 12:32 ` Lars Ingebrigtsen
  2 siblings, 0 replies; 34+ messages in thread
From: Lars Ingebrigtsen @ 2021-07-16 12:32 UTC (permalink / raw)
  To: Jorgen Schaefer; +Cc: 10576, 16731

Jorgen Schaefer <forcer@forcix.cx> writes:

> The following seems like a bug:
>
> (string-match "[[:lower:]]" "ß") => nil

This has been fixed in Emacs 28.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no





^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2021-07-16 12:32 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-02-12 17:29 bug#16731: 24.3.50; Latin small letter sharp s is not considered lower-case Jorgen Schaefer
2014-02-12 17:55 ` Glenn Morris
2014-02-12 19:31   ` Andreas Röhler
2014-02-12 19:49     ` Eli Zaretskii
2014-02-12 20:10       ` Andreas Röhler
2014-02-12 20:16         ` Eli Zaretskii
2014-02-12 20:33           ` Andreas Röhler
2014-02-12 20:57             ` Juanma Barranquero
2014-02-13  3:46             ` Eli Zaretskii
2014-02-13  8:27               ` Andreas Röhler
2014-02-13 15:53                 ` Eli Zaretskii
2014-02-13 13:37               ` Stefan Monnier
2014-02-13 16:33                 ` Eli Zaretskii
2014-02-13 17:10                   ` Stefan Monnier
2014-02-13 17:39                     ` Eli Zaretskii
2014-02-13 18:02                       ` Andreas Röhler
2014-02-13 18:17                         ` Eli Zaretskii
2014-02-13 18:10                       ` Stefan Monnier
2014-02-13 18:16                         ` Eli Zaretskii
2014-02-13 19:15                           ` Stefan Monnier
2014-02-13 20:24                             ` Eli Zaretskii
2014-02-14 17:22                               ` Stefan Monnier
2014-02-14 18:16                                 ` Eli Zaretskii
2014-02-14 20:59                                   ` Stefan Monnier
2014-02-15  7:12                                     ` Eli Zaretskii
2014-02-17  3:09                                       ` Stefan Monnier
2014-02-17  5:29                                         ` Eli Zaretskii
2014-02-13 17:58                   ` Juanma Barranquero
2014-02-13 18:18                     ` Eli Zaretskii
2014-02-13 18:22                       ` Juanma Barranquero
2014-02-13 18:47                       ` Glenn Morris
2014-02-13 20:16                         ` Eli Zaretskii
2014-02-14 16:20 ` bug#16731: 24.3.50; , " Paul Eggert
2021-07-16 12:32 ` bug#10576: Subject: 23.4; char class [:lower:] misses latin small letter sharp s Lars Ingebrigtsen

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).