unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
* bug#59275: Unexpected return value of `string-collate-lessp' on Mac
@ 2022-11-15  4:08 Ihor Radchenko
  2022-11-15  9:51 ` Robert Pluim
  2022-11-15 13:46 ` Eli Zaretskii
  0 siblings, 2 replies; 24+ messages in thread
From: Ihor Radchenko @ 2022-11-15  4:08 UTC (permalink / raw)
  To: 59275

Hi,

I am forwarding an issue originally reported on Org mailing list.
https://orgmode.org/list/m2ilkwso8r.fsf@me.com

On Emacs 29 (adaa2fc90e) MacOS build:

(string-collate-lessp "a" "B" "C" t)  ; => nil

On Linux:

(string-collate-lessp "a" "B" "C" t) ; => t

The return value on MacOS is unexpected.

See more information, including locale date, in the Org ML thread.

-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>





^ permalink raw reply	[flat|nested] 24+ messages in thread

* bug#59275: Unexpected return value of `string-collate-lessp' on Mac
  2022-11-15  4:08 bug#59275: Unexpected return value of `string-collate-lessp' on Mac Ihor Radchenko
@ 2022-11-15  9:51 ` Robert Pluim
  2022-11-16  3:47   ` Ihor Radchenko
  2022-11-15 13:46 ` Eli Zaretskii
  1 sibling, 1 reply; 24+ messages in thread
From: Robert Pluim @ 2022-11-15  9:51 UTC (permalink / raw)
  To: Ihor Radchenko; +Cc: 59275

>>>>> On Tue, 15 Nov 2022 04:08:13 +0000, Ihor Radchenko <yantar92@posteo.net> said:

    Ihor> Hi,
    Ihor> I am forwarding an issue originally reported on Org mailing list.
    Ihor> https://orgmode.org/list/m2ilkwso8r.fsf@me.com

    Ihor> On Emacs 29 (adaa2fc90e) MacOS build:

    Ihor> (string-collate-lessp "a" "B" "C" t)  ; => nil

    Ihor> On Linux:

    Ihor> (string-collate-lessp "a" "B" "C" t) ; => t

    Ihor> The return value on MacOS is unexpected.

    Ihor> See more information, including locale date, in the Org ML thread.

I think this is expected. See the long thread on emacs-devel back in
July, eg
https://lists.gnu.org/archive/html/emacs-devel/2022-07/msg00940.html

(it resulted in the addition of `string-equal-ignore-case')

Robert
-- 





^ permalink raw reply	[flat|nested] 24+ messages in thread

* bug#59275: Unexpected return value of `string-collate-lessp' on Mac
  2022-11-15  4:08 bug#59275: Unexpected return value of `string-collate-lessp' on Mac Ihor Radchenko
  2022-11-15  9:51 ` Robert Pluim
@ 2022-11-15 13:46 ` Eli Zaretskii
  2022-11-15 15:05   ` Ihor Radchenko
  1 sibling, 1 reply; 24+ messages in thread
From: Eli Zaretskii @ 2022-11-15 13:46 UTC (permalink / raw)
  To: Ihor Radchenko; +Cc: 59275

> From: Ihor Radchenko <yantar92@posteo.net>
> Date: Tue, 15 Nov 2022 04:08:13 +0000
> 
> I am forwarding an issue originally reported on Org mailing list.
> https://orgmode.org/list/m2ilkwso8r.fsf@me.com
> 
> On Emacs 29 (adaa2fc90e) MacOS build:
> 
> (string-collate-lessp "a" "B" "C" t)  ; => nil
> 
> On Linux:
> 
> (string-collate-lessp "a" "B" "C" t) ; => t
> 
> The return value on MacOS is unexpected.

string-collate-lessp is inherently platform- (and locale-) dependent.
Don't use it if you want consistent results across platforms and
locales.





^ permalink raw reply	[flat|nested] 24+ messages in thread

* bug#59275: Unexpected return value of `string-collate-lessp' on Mac
  2022-11-15 13:46 ` Eli Zaretskii
@ 2022-11-15 15:05   ` Ihor Radchenko
  2022-11-15 15:16     ` Eli Zaretskii
  0 siblings, 1 reply; 24+ messages in thread
From: Ihor Radchenko @ 2022-11-15 15:05 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 59275

Eli Zaretskii <eliz@gnu.org> writes:

>> On Emacs 29 (adaa2fc90e) MacOS build:
>> 
>> (string-collate-lessp "a" "B" "C" t)  ; => nil
>> 
>> On Linux:
>> 
>> (string-collate-lessp "a" "B" "C" t) ; => t
>> 
>> The return value on MacOS is unexpected.
>
> string-collate-lessp is inherently platform- (and locale-) dependent.
> Don't use it if you want consistent results across platforms and
> locales.

Is there a better alternative?
Also, do I miss something, or is this pitfall not documented in the
docstring of `string-collate-lessp'?

-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>





^ permalink raw reply	[flat|nested] 24+ messages in thread

* bug#59275: Unexpected return value of `string-collate-lessp' on Mac
  2022-11-15 15:05   ` Ihor Radchenko
@ 2022-11-15 15:16     ` Eli Zaretskii
  2022-11-16  1:34       ` Ihor Radchenko
  0 siblings, 1 reply; 24+ messages in thread
From: Eli Zaretskii @ 2022-11-15 15:16 UTC (permalink / raw)
  To: Ihor Radchenko; +Cc: 59275

> From: Ihor Radchenko <yantar92@posteo.net>
> Cc: 59275@debbugs.gnu.org
> Date: Tue, 15 Nov 2022 15:05:48 +0000
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> 
> > string-collate-lessp is inherently platform- (and locale-) dependent.
> > Don't use it if you want consistent results across platforms and
> > locales.
> 
> Is there a better alternative?

Alternative to do what job?

> Also, do I miss something, or is this pitfall not documented in the
> docstring of `string-collate-lessp'?

It isn't? then what is this about:

  This function obeys the conventions for collation order in your
  locale settings.  For example, punctuation and whitespace characters
  might be considered less significant for sorting:

  (sort '("11" "12" "1 1" "1 2" "1.1" "1.2") 'string-collate-lessp)
    => ("11" "1 1" "1.1" "12" "1 2" "1.2")
  [...]
  To emulate Unicode-compliant collation on MS-Windows systems,
  bind ‘w32-collate-ignore-punctuation’ to a non-nil value, since
  the codeset part of the locale cannot be "UTF-8" on MS-Windows.

The ELisp manual says in addition:

     This behavior is system-dependent; e.g., punctuation and whitespace
     are never ignored on Cygwin, regardless of locale.

If this doesn't have a big WARNING sign near it, then what would?





^ permalink raw reply	[flat|nested] 24+ messages in thread

* bug#59275: Unexpected return value of `string-collate-lessp' on Mac
  2022-11-15 15:16     ` Eli Zaretskii
@ 2022-11-16  1:34       ` Ihor Radchenko
  2022-11-16 13:00         ` Eli Zaretskii
  0 siblings, 1 reply; 24+ messages in thread
From: Ihor Radchenko @ 2022-11-16  1:34 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 59275

Eli Zaretskii <eliz@gnu.org> writes:
>> > string-collate-lessp is inherently platform- (and locale-) dependent.
>> > Don't use it if you want consistent results across platforms and
>> > locales.
>> 
>> Is there a better alternative?
>
> Alternative to do what job?

Reliable sorting.
In particular, I am looking for a better PREDICATE argument for
`sort-subr' for case-sensitive and case-insensitive sorting of strings.

>> Also, do I miss something, or is this pitfall not documented in the
>> docstring of `string-collate-lessp'?
>
> It isn't? then what is this about:
>
>   This function obeys the conventions for collation order in your
>   locale settings.  For example, punctuation and whitespace characters
>   might be considered less significant for sorting:
>
>   (sort '("11" "12" "1 1" "1 2" "1.1" "1.2") 'string-collate-lessp)
>     => ("11" "1 1" "1.1" "12" "1 2" "1.2")
>   [...]
>   To emulate Unicode-compliant collation on MS-Windows systems,
>   bind ‘w32-collate-ignore-punctuation’ to a non-nil value, since
>   the codeset part of the locale cannot be "UTF-8" on MS-Windows.

The above sounds like we just need to worry about some edge cases where
different approaches may exist to sorting. Like with punctuation,
numbers, and spaces.

Having

  (string-collate-lessp "a" "B" "C" t)  ; => nil

is totally unexpected because case-insensitive "a"<"B"<"C" sounds like
the only reasonable outcome.

I'd like the warning to be even more prominent.

Feel free to disagree.

-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>





^ permalink raw reply	[flat|nested] 24+ messages in thread

* bug#59275: Unexpected return value of `string-collate-lessp' on Mac
  2022-11-15  9:51 ` Robert Pluim
@ 2022-11-16  3:47   ` Ihor Radchenko
  0 siblings, 0 replies; 24+ messages in thread
From: Ihor Radchenko @ 2022-11-16  3:47 UTC (permalink / raw)
  To: Robert Pluim; +Cc: 59275

Robert Pluim <rpluim@gmail.com> writes:

> I think this is expected. See the long thread on emacs-devel back in
> July, eg
> https://lists.gnu.org/archive/html/emacs-devel/2022-07/msg00940.html
>
> (it resulted in the addition of `string-equal-ignore-case')

Ok. So, it looks like `compare-strings' is the way to go for
system-independent string comparison.

-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>





^ permalink raw reply	[flat|nested] 24+ messages in thread

* bug#59275: Unexpected return value of `string-collate-lessp' on Mac
  2022-11-16  1:34       ` Ihor Radchenko
@ 2022-11-16 13:00         ` Eli Zaretskii
  2022-11-21  7:28           ` Ihor Radchenko
  0 siblings, 1 reply; 24+ messages in thread
From: Eli Zaretskii @ 2022-11-16 13:00 UTC (permalink / raw)
  To: Ihor Radchenko; +Cc: 59275

> From: Ihor Radchenko <yantar92@posteo.net>
> Cc: 59275@debbugs.gnu.org
> Date: Wed, 16 Nov 2022 01:34:09 +0000
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> >> > string-collate-lessp is inherently platform- (and locale-) dependent.
> >> > Don't use it if you want consistent results across platforms and
> >> > locales.
> >> 
> >> Is there a better alternative?
> >
> > Alternative to do what job?
> 
> Reliable sorting.
> In particular, I am looking for a better PREDICATE argument for
> `sort-subr' for case-sensitive and case-insensitive sorting of strings.

In the strict order of Unicode codepoints?  Use compare-strings.

> >> Also, do I miss something, or is this pitfall not documented in the
> >> docstring of `string-collate-lessp'?
> >
> > It isn't? then what is this about:
> >
> >   This function obeys the conventions for collation order in your
> >   locale settings.  For example, punctuation and whitespace characters
> >   might be considered less significant for sorting:
> >
> >   (sort '("11" "12" "1 1" "1 2" "1.1" "1.2") 'string-collate-lessp)
> >     => ("11" "1 1" "1.1" "12" "1 2" "1.2")
> >   [...]
> >   To emulate Unicode-compliant collation on MS-Windows systems,
> >   bind ‘w32-collate-ignore-punctuation’ to a non-nil value, since
> >   the codeset part of the locale cannot be "UTF-8" on MS-Windows.
> 
> The above sounds like we just need to worry about some edge cases where
> different approaches may exist to sorting. Like with punctuation,
> numbers, and spaces.
> 
> Having
> 
>   (string-collate-lessp "a" "B" "C" t)  ; => nil
> 
> is totally unexpected because case-insensitive "a"<"B"<"C" sounds like
> the only reasonable outcome.

It is hard to guess what will be unexpected for people.  When the doc
string was written, the example used there was deemed to be the most
striking surprise from using locale-dependent collation, so it was
what we used.

> I'd like the warning to be even more prominent.

You want to make it explicit that for systems where we use
string-lessp the IGNORE-CASE argument is ignored?  Or do you want some
other change?

Anyway, feel free to suggest some text to that effect.





^ permalink raw reply	[flat|nested] 24+ messages in thread

* bug#59275: Unexpected return value of `string-collate-lessp' on Mac
  2022-11-16 13:00         ` Eli Zaretskii
@ 2022-11-21  7:28           ` Ihor Radchenko
  2022-11-21 13:31             ` Eli Zaretskii
  0 siblings, 1 reply; 24+ messages in thread
From: Ihor Radchenko @ 2022-11-21  7:28 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 59275

Eli Zaretskii <eliz@gnu.org> writes:

>> Reliable sorting.
>> In particular, I am looking for a better PREDICATE argument for
>> `sort-subr' for case-sensitive and case-insensitive sorting of strings.
>
> In the strict order of Unicode codepoints?  Use compare-strings.

Thanks for the clarification.
After further considerations, it looks like we should still use
`string-collate-lessp' on Org side as it yields expected results if libc
properly implements the collation.

>> I'd like the warning to be even more prominent.
>
> You want to make it explicit that for systems where we use
> string-lessp the IGNORE-CASE argument is ignored?  Or do you want some
> other change?

Yes, I think.

> Anyway, feel free to suggest some text to that effect.

Maybe change

  If your system does not support a locale environment, this function
  behaves like `string-lessp'.

to

  Some operating systems do not implement correct collation (in specific
  locale environments or at all). Then, this functions falls back to
  case-sensitive `string-lessp' and IGNORE-CASE argument is ignored.

-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>





^ permalink raw reply	[flat|nested] 24+ messages in thread

* bug#59275: Unexpected return value of `string-collate-lessp' on Mac
  2022-11-21  7:28           ` Ihor Radchenko
@ 2022-11-21 13:31             ` Eli Zaretskii
  2022-11-22  1:24               ` Ihor Radchenko
  0 siblings, 1 reply; 24+ messages in thread
From: Eli Zaretskii @ 2022-11-21 13:31 UTC (permalink / raw)
  To: Ihor Radchenko; +Cc: 59275

> From: Ihor Radchenko <yantar92@posteo.net>
> Cc: 59275@debbugs.gnu.org
> Date: Mon, 21 Nov 2022 07:28:55 +0000
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> 
> >> Reliable sorting.
> >> In particular, I am looking for a better PREDICATE argument for
> >> `sort-subr' for case-sensitive and case-insensitive sorting of strings.
> >
> > In the strict order of Unicode codepoints?  Use compare-strings.
> 
> Thanks for the clarification.
> After further considerations, it looks like we should still use
> `string-collate-lessp' on Org side as it yields expected results if libc
> properly implements the collation.

Is the feature that uses it intended to be used only on glibc platforms
(which basically means GNU/Linux)?  If not, I'm surprised that you arrived
at this conclusion.  It is the 180 deg opposite of what I think you should
have decided.

Once again: locale-specific collation order is inherently unpredictable in
its results, and should only be used when the locale-specific order is a
_must_, like when sorting people's names for a telephone directory.

> Maybe change
> 
>   If your system does not support a locale environment, this function
>   behaves like `string-lessp'.
> 
> to
> 
>   Some operating systems do not implement correct collation (in specific
>   locale environments or at all). Then, this functions falls back to
>   case-sensitive `string-lessp' and IGNORE-CASE argument is ignored.

Fine with me.





^ permalink raw reply	[flat|nested] 24+ messages in thread

* bug#59275: Unexpected return value of `string-collate-lessp' on Mac
  2022-11-21 13:31             ` Eli Zaretskii
@ 2022-11-22  1:24               ` Ihor Radchenko
  2022-11-22 12:56                 ` Eli Zaretskii
  0 siblings, 1 reply; 24+ messages in thread
From: Ihor Radchenko @ 2022-11-22  1:24 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 59275

[-- Attachment #1: Type: text/plain, Size: 1697 bytes --]

Eli Zaretskii <eliz@gnu.org> writes:

>> > In the strict order of Unicode codepoints?  Use compare-strings.
>> 
>> Thanks for the clarification.
>> After further considerations, it looks like we should still use
>> `string-collate-lessp' on Org side as it yields expected results if libc
>> properly implements the collation.
>
> Is the feature that uses it intended to be used only on glibc platforms
> (which basically means GNU/Linux)?  If not, I'm surprised that you arrived
> at this conclusion.  It is the 180 deg opposite of what I think you should
> have decided.
>
> Once again: locale-specific collation order is inherently unpredictable in
> its results, and should only be used when the locale-specific order is a
> _must_, like when sorting people's names for a telephone directory.

We use string collation for

1. Sorting bibliographies
2. Sorting lists
3. Sorting table lines
4. Sorting tags
5. Sorting headings
6. Sorting entries in agendas
7. As a criterion for agenda/tag filtering when comparison operator is
   used on string property values (11.3.3 Matching tags and properties)

1-6 should follow the locale. I think we had a bug report in the past
where a user got confusing about list sorting being confusing for the
user language conventions.

7 is more debatable.

>> Maybe change
>> 
>>   If your system does not support a locale environment, this function
>>   behaves like `string-lessp'.
>> 
>> to
>> 
>>   Some operating systems do not implement correct collation (in specific
>>   locale environments or at all). Then, this functions falls back to
>>   case-sensitive `string-lessp' and IGNORE-CASE argument is ignored.
>
> Fine with me.

See the attached patch.


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-src-fns.c-Fstring_collate_lessp-Clarify-docstring.patch --]
[-- Type: text/x-patch, Size: 1349 bytes --]

From d9a67e94547ffeb6d8ac8a1202434fff1117af3f Mon Sep 17 00:00:00 2001
Message-Id: <d9a67e94547ffeb6d8ac8a1202434fff1117af3f.1669080246.git.yantar92@posteo.net>
From: Ihor Radchenko <yantar92@posteo.net>
Date: Tue, 22 Nov 2022 09:21:17 +0800
Subject: [PATCH] * src/fns.c (Fstring_collate_lessp): Clarify docstring

Clarify that IGNORE-CASE argument might be ignored when the operation
system does not implement string collation for the specified locale.

See bug#59275.
---
 src/fns.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/src/fns.c b/src/fns.c
index 035fa12935..e337c0958d 100644
--- a/src/fns.c
+++ b/src/fns.c
@@ -596,8 +596,9 @@ DEFUN ("string-collate-lessp", Fstring_collate_lessp, Sstring_collate_lessp, 2,
 bind `w32-collate-ignore-punctuation' to a non-nil value, since
 the codeset part of the locale cannot be \"UTF-8\" on MS-Windows.
 
-If your system does not support a locale environment, this function
-behaves like `string-lessp'.  */)
+Some operating systems do not implement correct collation (in specific
+locale environments or at all).  Then, this functions falls back to
+case-sensitive `string-lessp' and IGNORE-CASE argument is ignored.  */)
   (Lisp_Object s1, Lisp_Object s2, Lisp_Object locale, Lisp_Object ignore_case)
 {
 #if defined __STDC_ISO_10646__ || defined WINDOWSNT
-- 
2.35.1


[-- Attachment #3: Type: text/plain, Size: 225 bytes --]



-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* bug#59275: Unexpected return value of `string-collate-lessp' on Mac
  2022-11-22  1:24               ` Ihor Radchenko
@ 2022-11-22 12:56                 ` Eli Zaretskii
  2022-11-23 10:39                   ` Ihor Radchenko
  2022-11-26  2:03                   ` Ihor Radchenko
  0 siblings, 2 replies; 24+ messages in thread
From: Eli Zaretskii @ 2022-11-22 12:56 UTC (permalink / raw)
  To: Ihor Radchenko; +Cc: 59275-done

> From: Ihor Radchenko <yantar92@posteo.net>
> Cc: 59275@debbugs.gnu.org
> Date: Tue, 22 Nov 2022 01:24:43 +0000
> 
> > Once again: locale-specific collation order is inherently unpredictable in
> > its results, and should only be used when the locale-specific order is a
> > _must_, like when sorting people's names for a telephone directory.
> 
> We use string collation for
> 
> 1. Sorting bibliographies
> 2. Sorting lists
> 3. Sorting table lines
> 4. Sorting tags
> 5. Sorting headings
> 6. Sorting entries in agendas
> 7. As a criterion for agenda/tag filtering when comparison operator is
>    used on string property values (11.3.3 Matching tags and properties)
> 
> 1-6 should follow the locale.

I think only 1 and 6 are firmly in that category.  For the others it depends
on whether the results of the sorting are immediately displayed, or used for
further processing.  In the former case, using string-collate-lessp is
semi-okay ("semi" because producing different results in different locales
can still confuse users); in the latter case it is wrong, IMO, because you
will cause unexpected results.

> See the attached patch.

Thanks, installed.





^ permalink raw reply	[flat|nested] 24+ messages in thread

* bug#59275: Unexpected return value of `string-collate-lessp' on Mac
  2022-11-22 12:56                 ` Eli Zaretskii
@ 2022-11-23 10:39                   ` Ihor Radchenko
  2022-11-23 14:58                     ` Eli Zaretskii
  2022-11-26  2:03                   ` Ihor Radchenko
  1 sibling, 1 reply; 24+ messages in thread
From: Ihor Radchenko @ 2022-11-23 10:39 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 59275-done

Eli Zaretskii <eliz@gnu.org> writes:

>> See the attached patch.
>
> Thanks, installed.

Should we update the manual as well?
4.5 Comparison of Characters and Strings section contains the old
docstring verbatim.

P.S. I am wondering if there is some automated way to deal with verbatim
docstrings in the manuals. They are so easy to slip through when the
Elisp docstrings get updated.

-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>





^ permalink raw reply	[flat|nested] 24+ messages in thread

* bug#59275: Unexpected return value of `string-collate-lessp' on Mac
  2022-11-23 10:39                   ` Ihor Radchenko
@ 2022-11-23 14:58                     ` Eli Zaretskii
  2022-11-24  2:22                       ` Ihor Radchenko
  0 siblings, 1 reply; 24+ messages in thread
From: Eli Zaretskii @ 2022-11-23 14:58 UTC (permalink / raw)
  To: Ihor Radchenko; +Cc: 59275

> From: Ihor Radchenko <yantar92@posteo.net>
> Cc: 59275-done@debbugs.gnu.org
> Date: Wed, 23 Nov 2022 10:39:22 +0000
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> 
> >> See the attached patch.
> >
> > Thanks, installed.
> 
> Should we update the manual as well?
> 4.5 Comparison of Characters and Strings section contains the old
> docstring verbatim.

I see in the manual text that is not a verbatim copy of the doc string, but
an expanded version of it with more detailed explanations.  Which is how it
should be: it is IMNSHO bad documentation-fu to have the manual just copycat
the doc strings.  (We sometimes do it for lack of time, but it is not a Good
Thing.)

The note about case-sensitivity of the fallback was missing from the manual,
so I added it.

> P.S. I am wondering if there is some automated way to deal with verbatim
> docstrings in the manuals. They are so easy to slip through when the
> Elisp docstrings get updated.

There should be no verbatim copies of doc strings in the manual.  So I'm not
interested in making that bad practice easier ;-)

Thanks.





^ permalink raw reply	[flat|nested] 24+ messages in thread

* bug#59275: Unexpected return value of `string-collate-lessp' on Mac
  2022-11-23 14:58                     ` Eli Zaretskii
@ 2022-11-24  2:22                       ` Ihor Radchenko
  2022-11-24  7:23                         ` Eli Zaretskii
  0 siblings, 1 reply; 24+ messages in thread
From: Ihor Radchenko @ 2022-11-24  2:22 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 59275

Eli Zaretskii <eliz@gnu.org> writes:

>> Should we update the manual as well?
>> 4.5 Comparison of Characters and Strings section contains the old
>> docstring verbatim.
>
> I see in the manual text that is not a verbatim copy of the doc string, but
> an expanded version of it with more detailed explanations.  Which is how it
> should be: it is IMNSHO bad documentation-fu to have the manual just copycat
> the doc strings.  (We sometimes do it for lack of time, but it is not a Good
> Thing.)

Fair point.

> The note about case-sensitivity of the fallback was missing from the manual,
> so I added it.

Thanks!

>> P.S. I am wondering if there is some automated way to deal with verbatim
>> docstrings in the manuals. They are so easy to slip through when the
>> Elisp docstrings get updated.
>
> There should be no verbatim copies of doc strings in the manual.  So I'm not
> interested in making that bad practice easier ;-)

What about forgetting to update the manual when important changes are
made to the docstring? I know for certain that it happened many times
with Org manual. Maybe something can be done to auto-check if updates
were done to the docstring but not the manual?

-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>





^ permalink raw reply	[flat|nested] 24+ messages in thread

* bug#59275: Unexpected return value of `string-collate-lessp' on Mac
  2022-11-24  2:22                       ` Ihor Radchenko
@ 2022-11-24  7:23                         ` Eli Zaretskii
  0 siblings, 0 replies; 24+ messages in thread
From: Eli Zaretskii @ 2022-11-24  7:23 UTC (permalink / raw)
  To: Ihor Radchenko; +Cc: 59275

> From: Ihor Radchenko <yantar92@posteo.net>
> Cc: 59275@debbugs.gnu.org
> Date: Thu, 24 Nov 2022 02:22:41 +0000
> 
> > There should be no verbatim copies of doc strings in the manual.  So I'm not
> > interested in making that bad practice easier ;-)
> 
> What about forgetting to update the manual when important changes are
> made to the docstring? I know for certain that it happened many times
> with Org manual. Maybe something can be done to auto-check if updates
> were done to the docstring but not the manual?

That could be a useful feature, suitable for checkdoc.el, perhaps.  But
there are 2 issues here that I'm not sure how would such a feature handle:

 . not every symbol that has a doc string is mentioned in the manuals
 . the doc string and the text in the manual are generally different, and so
   it could be that the update to a doc string doesn't require any update to
   the manual text

So a naïve implementation would probably have too many false positives.  Not
sure if this could render the feature useless.

Bottom line: I'm not sure we can have a good automated way of detecting
updates that were missed, except at patch review time, and that is a
judgment call by the person who does the review, and relies on his/her
vigilance.  But if someone could come up with a good way of doing that, it
will be appreciated.





^ permalink raw reply	[flat|nested] 24+ messages in thread

* bug#59275: Unexpected return value of `string-collate-lessp' on Mac
  2022-11-22 12:56                 ` Eli Zaretskii
  2022-11-23 10:39                   ` Ihor Radchenko
@ 2022-11-26  2:03                   ` Ihor Radchenko
  2022-11-26  8:06                     ` Eli Zaretskii
  1 sibling, 1 reply; 24+ messages in thread
From: Ihor Radchenko @ 2022-11-26  2:03 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 59275-done

Eli Zaretskii <eliz@gnu.org> writes:

>> We use string collation for
>> 
>> 1. Sorting bibliographies
>> 2. Sorting lists
>> 3. Sorting table lines
>> 4. Sorting tags
>> 5. Sorting headings
>> 6. Sorting entries in agendas
>> 7. As a criterion for agenda/tag filtering when comparison operator is
>>    used on string property values (11.3.3 Matching tags and properties)
>> 
>> 1-6 should follow the locale.
>
> I think only 1 and 6 are firmly in that category.  For the others it depends
> on whether the results of the sorting are immediately displayed, or used for
> further processing.  In the former case, using string-collate-lessp is
> semi-okay ("semi" because producing different results in different locales
> can still confuse users); in the latter case it is wrong, IMO, because you
> will cause unexpected results.

1-6 are for interactive use.

As Maxim pointed out in
https://orgmode.org/list/tlle59$pl3$1@ciao.gmane.io,
`string-collate-lessp' generally yield better results for human
consumption:

"		 (setq lst '("semana" "señor" "sepia"))
		 (sort lst #'string-lessp) ;         => ("semana" "sepia" "señor")
		 (sort lst #'string-collate-lessp) ; => ("semana" "señor" "sepia")
"

In the same thread, we also discussed what Org can do about MacOS and
other systems that do not implement string collation.

We concluded that a better fallback when collation is not available
would be using downcase+string-lessp when `string-collate-lessp' is
called with non-nil IGNORE-CASE argument.

Would it be acceptable for Emacs to change the fallback behavior of
`string-collate-lessp' to:

1. If string collation is not available and IGNORE-CASE is nil, fallback
   to`string-lessp';
2. If string collation is not available and IGNORE-CASE is non-nil,
   use `downcase' + `string-lessp'.

This will not compromise consistency and will yield slightly better
fallback results.

I also do not think that it will be backwards-incompatible. If the call
to `string-collate-lessp' explicitly requests ignoring case, `downcase'
is more expected than bare `string-lessp' that _does not_ ignore case.

WDYT?

-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>





^ permalink raw reply	[flat|nested] 24+ messages in thread

* bug#59275: Unexpected return value of `string-collate-lessp' on Mac
  2022-11-26  2:03                   ` Ihor Radchenko
@ 2022-11-26  8:06                     ` Eli Zaretskii
  2022-11-26  8:47                       ` Ihor Radchenko
  0 siblings, 1 reply; 24+ messages in thread
From: Eli Zaretskii @ 2022-11-26  8:06 UTC (permalink / raw)
  To: Ihor Radchenko; +Cc: 59275

> From: Ihor Radchenko <yantar92@posteo.net>
> Cc: 59275-done@debbugs.gnu.org
> Date: Sat, 26 Nov 2022 02:03:43 +0000
> 
> We concluded that a better fallback when collation is not available
> would be using downcase+string-lessp when `string-collate-lessp' is
> called with non-nil IGNORE-CASE argument.

This has caveats, see below.  I won't argue about your Org-local decision,
since I don't know enough about the intended uses of what you did, but I do
have something to say about this decision in general.  I suggest at least a
FIXME comment where you do this stuff, based on what I tell below.

> Would it be acceptable for Emacs to change the fallback behavior of
> `string-collate-lessp' to:
> 
> 1. If string collation is not available and IGNORE-CASE is nil, fallback
>    to`string-lessp';
> 2. If string collation is not available and IGNORE-CASE is non-nil,
>    use `downcase' + `string-lessp'.

'downcase' uses the buffer-local case table if such is defined for the
buffer that happens to be the current when you invoke 'downcase', and that's
another cause of inconsistency and user surprises, especially when the
strings you compare don't really "belong" to the current buffer.  Also, in
some (rarely-used) locales, downcasing has unexpected results, even with the
default case-table.  For example, downcasing "I" produces "ı", not "i" as
expected.  Did you think about these cases when making the above decision?

> I also do not think that it will be backwards-incompatible. If the call
> to `string-collate-lessp' explicitly requests ignoring case, `downcase'
> is more expected than bare `string-lessp' that _does not_ ignore case.
> 
> WDYT?

See above.  What you suggest is perhaps fine for plain-ASCII text, but not
in general, IMNSHO.

The reason for what Emacs currently does on systems that lack collation
functions is that for such systems collation rules are indeterminate, and so
inventing them by following naïve rules of plain ASCII, in particular the
case-conversion rules, is potentially very wrong.  These are general-purpose
APIs, not something concrete in specific Org contexts, and as such, these
APIs cannot "mostly work", they should work always and for every possible
use case.

And we are talking about a single system where these problems happen, which
is macOS, right?  Wouldn't it be better for "Someone" who uses macOS to just
bite the bullet and write a proper collation function, or find a free
software implementation of one, and include it in Emacs?  This is what I did
for MS-Windows at the time string-collate-lessp was added to Emacs.  Why
cannot macOS users do the same?





^ permalink raw reply	[flat|nested] 24+ messages in thread

* bug#59275: Unexpected return value of `string-collate-lessp' on Mac
  2022-11-26  8:06                     ` Eli Zaretskii
@ 2022-11-26  8:47                       ` Ihor Radchenko
  2022-11-26  9:22                         ` Eli Zaretskii
  0 siblings, 1 reply; 24+ messages in thread
From: Ihor Radchenko @ 2022-11-26  8:47 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 59275

Eli Zaretskii <eliz@gnu.org> writes:

>> We concluded that a better fallback when collation is not available
>> would be using downcase+string-lessp when `string-collate-lessp' is
>> called with non-nil IGNORE-CASE argument.
>
> This has caveats, see below.  I won't argue about your Org-local decision,
> since I don't know enough about the intended uses of what you did, but I do
> have something to say about this decision in general.  I suggest at least a
> FIXME comment where you do this stuff, based on what I tell below.

Thanks for the information!

>> Would it be acceptable for Emacs to change the fallback behavior of
>> `string-collate-lessp' to:
>> 
>> 1. If string collation is not available and IGNORE-CASE is nil, fallback
>>    to`string-lessp';
>> 2. If string collation is not available and IGNORE-CASE is non-nil,
>>    use `downcase' + `string-lessp'.
>
> 'downcase' uses the buffer-local case table if such is defined for the
> buffer that happens to be the current when you invoke 'downcase', and that's
> another cause of inconsistency and user surprises, especially when the
> strings you compare don't really "belong" to the current buffer.

Interesting. Is there any reason why this is not mentioned in the
docstring for `downcase'?

I now see 4.10 The Case Table section of the manual, and it looks like
case tables should be set mostly automatically (by Emacs?) according to
the language environment. Are details about this process documented
anywhere? Are these case conversion tables independent of glibc?

> Also, in
> some (rarely-used) locales, downcasing has unexpected results, even with the
> default case-table.  For example, downcasing "I" produces "ı", not "i" as
> expected.  Did you think about these cases when making the above decision?

I did not. However, I recall reading somewhere that it is possible work
around this kind of issues by calling case conversion several times:
upcase -> downcase -> upcase -> downcase.

I did not. But now, after you reminded me about this caveat, I do recall
https://nullprogram.com/blog/2014/06/13/ that mentioned something
similar about caveats with composition. Just mentioning it for your
reference. (I am not sure if the caveats discussed have been raised on
Emacs devel).

>> I also do not think that it will be backwards-incompatible. If the call
>> to `string-collate-lessp' explicitly requests ignoring case, `downcase'
>> is more expected than bare `string-lessp' that _does not_ ignore case.
>> 
>> WDYT?
>
> See above.  What you suggest is perhaps fine for plain-ASCII text, but not
> in general, IMNSHO.
>
> The reason for what Emacs currently does on systems that lack collation
> functions is that for such systems collation rules are indeterminate, and so
> inventing them by following naïve rules of plain ASCII, in particular the
> case-conversion rules, is potentially very wrong.  These are general-purpose
> APIs, not something concrete in specific Org contexts, and as such, these
> APIs cannot "mostly work", they should work always and for every possible
> use case.

I feel that I miss something. Don't Emacs provide unicode case
conversion tables? Why plain ASCII rules?

> And we are talking about a single system where these problems happen, which
> is macOS, right?  Wouldn't it be better for "Someone" who uses macOS to just
> bite the bullet and write a proper collation function, or find a free
> software implementation of one, and include it in Emacs?  This is what I did
> for MS-Windows at the time string-collate-lessp was added to Emacs.  Why
> cannot macOS users do the same?

It would be. But how can we ask for this? etc/TODO? Or maybe re-open
this bug report?

-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>





^ permalink raw reply	[flat|nested] 24+ messages in thread

* bug#59275: Unexpected return value of `string-collate-lessp' on Mac
  2022-11-26  8:47                       ` Ihor Radchenko
@ 2022-11-26  9:22                         ` Eli Zaretskii
  2022-11-27 14:00                           ` Maxim Nikulin
  0 siblings, 1 reply; 24+ messages in thread
From: Eli Zaretskii @ 2022-11-26  9:22 UTC (permalink / raw)
  To: Ihor Radchenko; +Cc: 59275

> From: Ihor Radchenko <yantar92@posteo.net>
> Cc: 59275@debbugs.gnu.org
> Date: Sat, 26 Nov 2022 08:47:13 +0000
> 
> > 'downcase' uses the buffer-local case table if such is defined for the
> > buffer that happens to be the current when you invoke 'downcase', and that's
> > another cause of inconsistency and user surprises, especially when the
> > strings you compare don't really "belong" to the current buffer.
> 
> Interesting. Is there any reason why this is not mentioned in the
> docstring for `downcase'?

Yes: because we are ashamed of that and hope to change it at some point, if
we ever figure out how to do that.  The way to avoid this caveat is simple:
let-bind case-table when you call 'downcase'.

> I now see 4.10 The Case Table section of the manual, and it looks like
> case tables should be set mostly automatically (by Emacs?) according to
> the language environment.

Yes.  But a buffer can have its local case-table.

> Are details about this process documented anywhere?

No.  But see characters.el and the function I mention below.

> Are these case conversion tables independent of glibc?

Yes.  We build them completely separately and from scratch, as you will see
in characters.el.

> https://nullprogram.com/blog/2014/06/13/ that mentioned something
> similar about caveats with composition.

I don't see there anything about sorting or collation.  What did I miss?

> Just mentioning it for your reference. (I am not sure if the caveats
> discussed have been raised on Emacs devel).

What did you think ought to be discussed?

Btw, that blog fails to distinguish between display-time features and
processing of text without displaying it.  On display, Emacs combines
characters that are combining, so equivalent character sequences should look
the same.  But Emacs doesn't by default consider equivalent character
sequences as equal in all situations, leaving this to the Lisp program.
Considering them always as equal looks sexy in a blog post, because it
raises some brows and has the "whoah!" effect, but isn't a good policy in
general, since some applications definitely need to know about the original
decomposed sequence.  We cannot conceal this from Lisp programs by hiding
the original sequence on some low level that is not exposed to Lisp.  Yes,
this makes Lisp programs more complicated, but that comes with the
territory: you cannot have power without complexity.

> I feel that I miss something. Don't Emacs provide unicode case
> conversion tables?

The case tables we provide are based on Unicode, but are tweaked by the
language-environment.  See, for example, turkish-case-conversion-enable,
which is run when the Turkish language-environment is turned on.

> Why plain ASCII rules?

Your logic is.  What you suggest breaks down if you consider various
complications in some locales.

> > And we are talking about a single system where these problems happen, which
> > is macOS, right?  Wouldn't it be better for "Someone" who uses macOS to just
> > bite the bullet and write a proper collation function, or find a free
> > software implementation of one, and include it in Emacs?  This is what I did
> > for MS-Windows at the time string-collate-lessp was added to Emacs.  Why
> > cannot macOS users do the same?
> 
> It would be. But how can we ask for this? etc/TODO? Or maybe re-open
> this bug report?

Anything will be fine with me, but unless the people who are asking you to
do these workarounds are motivated enough to sit down and do the job, we
will never get there.  And guess what effect these workarounds have on their
motivation.





^ permalink raw reply	[flat|nested] 24+ messages in thread

* bug#59275: Unexpected return value of `string-collate-lessp' on Mac
  2022-11-26  9:22                         ` Eli Zaretskii
@ 2022-11-27 14:00                           ` Maxim Nikulin
  2022-11-27 14:23                             ` Eli Zaretskii
  0 siblings, 1 reply; 24+ messages in thread
From: Maxim Nikulin @ 2022-11-27 14:00 UTC (permalink / raw)
  To: Eli Zaretskii, Ihor Radchenko; +Cc: 59275

On 26/11/2022 16:22, Eli Zaretskii wrote:
>> From: Ihor Radchenko Date: Sat, 26 Nov 2022 08:47:13 +0000
>>
>>> 'downcase' uses the buffer-local case table if such is defined for the
>>> buffer that happens to be the current when you invoke 'downcase', and that's
>>> another cause of inconsistency and user surprises, especially when the
>>> strings you compare don't really "belong" to the current buffer.

`downcase' is already used in Org for case-insensitive sorting. I am 
unsure if it appeared earlier than `string-collate-lessp' was 
introduced. Buffer-local conversion table is not a problem when table 
rows, list items (text formatting object, not elisp structure), or tags 
local to the current file are sorted. However when agenda is built from 
several files current buffer should not affect entries order.

Concerning Org, my point is that caseless sorting should be uniform. 
Currently different functions use distinct approaches and it is more 
severe inconsistency.

>> https://nullprogram.com/blog/2014/06/13/ that mentioned something
>> similar about caveats with composition.
> 
> I don't see there anything about sorting or collation.  What did I miss?

Does not composed/decomposed representation affect comparison result?

Emacs-devel thread mentioned earlier in this bug contains a link 
describing enough issues with string comparison:

https://stackoverflow.com/questions/319426/how-do-i-do-a-case-insensitive-string-comparison

>>> And we are talking about a single system where these problems happen, which
>>> is macOS, right?  Wouldn't it be better for "Someone" who uses macOS to just
>>> bite the bullet and write a proper collation function, or find a free
>>> software implementation of one, and include it in Emacs?

My impression was that clang should eventually get better locales 
support. If so, I am in doubts concerning macOS-specific implementation. 
I have no a macOS machine, so I may be wrong in my assumption concerning 
locale implementation there. However Emacs may benefit from its own 
implementation of collation (based on built-in Unicode character 
database) used on (almost) all OSes. It will allow using of several 
locales in parallel without switching of libc locale that is not 
thread-safe.

I consider `downcase' as a kind of workaround (ignore case for poors) 
that allows graceful degradation in comparison to `string-lessp'. From 
my point of view e.g. case transformation rule for Turkish I is a minor 
issue in comparison to complete disregarding of IGNORE-CASE argument at 
least when results are presented to users.

My argument against `downcase' in `string-collate-lessp' is that it may 
add noticeable performance penalty.

Interestingly `compare-strings' uses upcase conversion when the 
IGNORE-CASE argument is true. I believed that some implementations 
(unrelated to Emacs) may have problems with e.g. ß and considered 
downcase as a safer option.





^ permalink raw reply	[flat|nested] 24+ messages in thread

* bug#59275: Unexpected return value of `string-collate-lessp' on Mac
  2022-11-27 14:00                           ` Maxim Nikulin
@ 2022-11-27 14:23                             ` Eli Zaretskii
  2022-11-27 15:19                               ` Maxim Nikulin
  0 siblings, 1 reply; 24+ messages in thread
From: Eli Zaretskii @ 2022-11-27 14:23 UTC (permalink / raw)
  To: Maxim Nikulin; +Cc: yantar92, 59275

> From: Maxim Nikulin <m.a.nikulin@gmail.com>
> Date: Sun, 27 Nov 2022 21:00:50 +0700
> Cc: 59275@debbugs.gnu.org
> 
> Concerning Org, my point is that caseless sorting should be uniform. 

You need to work hard to get that.  Just using 'downcase' is not enough, and
neither is using 'string-collate-equalp'.

> >> https://nullprogram.com/blog/2014/06/13/ that mentioned something
> >> similar about caveats with composition.
> > 
> > I don't see there anything about sorting or collation.  What did I miss?
> 
> Does not composed/decomposed representation affect comparison result?

They are different texts, so yes, they do, and they should.
If you want to treat such strings as equivalent, you need to work even
harder, since Emacs currently doesn't have enough infrastructure to do it
right in all cases.

> 
> Emacs-devel thread mentioned earlier in this bug contains a link 
> describing enough issues with string comparison:
> 
> https://stackoverflow.com/questions/319426/how-do-i-do-a-case-insensitive-string-comparison

This is about Python, no?

> From my point of view e.g. case transformation rule for Turkish I is a
> minor issue

Why, Org doesn't want to support Turkish users?

> My argument against `downcase' in `string-collate-lessp' is that it may 
> add noticeable performance penalty.

I'd worry about correctness before performance.

> Interestingly `compare-strings' uses upcase conversion when the 
> IGNORE-CASE argument is true. I believed that some implementations 
> (unrelated to Emacs) may have problems with e.g. ß and considered 
> downcase as a safer option.

Case conversions always have problems.





^ permalink raw reply	[flat|nested] 24+ messages in thread

* bug#59275: Unexpected return value of `string-collate-lessp' on Mac
  2022-11-27 14:23                             ` Eli Zaretskii
@ 2022-11-27 15:19                               ` Maxim Nikulin
  2022-11-27 15:42                                 ` Eli Zaretskii
  0 siblings, 1 reply; 24+ messages in thread
From: Maxim Nikulin @ 2022-11-27 15:19 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Ihor Radchenko, 59275

On 27/11/2022 21:23, Eli Zaretskii wrote:
>> From: Maxim Nikulin Date: Sun, 27 Nov 2022 21:00:50 +0700
>>
>> Concerning Org, my point is that caseless sorting should be uniform.
> 
> You need to work hard to get that.  Just using 'downcase' is not enough, and
> neither is using 'string-collate-equalp'.

I do not like that in some functions `string-collate-lessp' with 
IGNORE-CASE argument is used while strings are passed through `downcase' 
in other places. When proper locales implementation is available, I 
believe, it is better to consistently use IGNORE-CASE. I assume that 
text is presented to users, not serialized to be saved or sent as data.

When `string-collate-lessp' disregards IGNORE-CASE, I consider it 
acceptable to use `downcase' (`upcase' may be worse since Org currently 
uses `downcase'). It provides reasonable balance of invested efforts and 
obtained result.

>> Does not composed/decomposed representation affect comparison result?
> 
> They are different texts, so yes, they do, and they should.
> If you want to treat such strings as equivalent, you need to work even
> harder, since Emacs currently doesn't have enough infrastructure to do it
> right in all cases.

`("semana" "señor" ,(ucs-normalize-NFD-string "señor") "sepia")
(sort lst #'string-lessp)
=> ("semana" "señor" "sepia" "señor")
(sort lst #'string-collate-lessp)
=> ("semana" "señor" "señor" "sepia")

`string-collate-lessp' is able to handle at least some cases, it is 
another argument to use it.

>> https://stackoverflow.com/questions/319426/how-do-i-do-a-case-insensitive-string-comparison
> 
> This is about Python, no?

The value of this link is a collection of examples that are not obvious 
for everybody. They are applicable to behavior `string-lessp' vs. 
`string-collate-lessp' as well.

>>  From my point of view e.g. case transformation rule for Turkish I is a
>> minor issue
> 
> Why, Org doesn't want to support Turkish users?

 From my point of view it is a minor issue in comparison to

     (string-collate-lessp "a" "B" "C" t)  ; => nil

that breaks comparison not only for accented letters.

You almost manged to convince Ihor to use `string-lessp' instead of 
`string-collate-lessp'. I do not think it would improve quality of 
support of Turkish language.

My suggestion is to fall back to `downcase' and `string-lessp' only if 
`string-collate-lessp' is unable to provide case insensitive comparison.

>> My argument against `downcase' in `string-collate-lessp' is that it may
>> add noticeable performance penalty.
> 
> I'd worry about correctness before performance.

`downcase' with `string-lessp' handles more cases than just 
`string-lessp' (leaving aside buffer-local conversion tables), so form 
my point of view the former is more correct. Even `downcase' with fixed 
"C" locale may give result more consistent with user expectations. My 
impression that users may be familiar with wide spread problems with 
sorting.





^ permalink raw reply	[flat|nested] 24+ messages in thread

* bug#59275: Unexpected return value of `string-collate-lessp' on Mac
  2022-11-27 15:19                               ` Maxim Nikulin
@ 2022-11-27 15:42                                 ` Eli Zaretskii
  0 siblings, 0 replies; 24+ messages in thread
From: Eli Zaretskii @ 2022-11-27 15:42 UTC (permalink / raw)
  To: Maxim Nikulin; +Cc: yantar92, 59275

> From: Maxim Nikulin <m.a.nikulin@gmail.com>
> Date: Sun, 27 Nov 2022 22:19:24 +0700
> Cc: Ihor Radchenko <yantar92@posteo.net>, 59275@debbugs.gnu.org
> 
> I do not like that in some functions `string-collate-lessp' with 
> IGNORE-CASE argument is used while strings are passed through `downcase' 
> in other places. When proper locales implementation is available, I 
> believe, it is better to consistently use IGNORE-CASE.

I already explained up-thread why we ignore IGNORE-CASE when collation order
is not known.  I stand by that reasoning.  I believe your opinion is based
on considering only simple locales, and on the a-priori knowledge what is
the locale's collation to begin with, something that Emacs cannot know in
that case.

> When `string-collate-lessp' disregards IGNORE-CASE, I consider it 
> acceptable to use `downcase' (`upcase' may be worse since Org currently 
> uses `downcase'). It provides reasonable balance of invested efforts and 
> obtained result.

We disagree, sorry.

> `("semana" "señor" ,(ucs-normalize-NFD-string "señor") "sepia")
> (sort lst #'string-lessp)
> => ("semana" "señor" "sepia" "señor")
> (sort lst #'string-collate-lessp)
> => ("semana" "señor" "señor" "sepia")
> 
> `string-collate-lessp' is able to handle at least some cases

On what OS and with which libc?

And I don't think this is evidence of collation knowing about equivalent
sequences.  It is most probable the side effect of collation ignoring
Latin accents altogether.

> >> https://stackoverflow.com/questions/319426/how-do-i-do-a-case-insensitive-string-comparison
> > 
> > This is about Python, no?
> 
> The value of this link is a collection of examples that are not obvious 
> for everybody. They are applicable to behavior `string-lessp' vs. 
> `string-collate-lessp' as well.

Which parts are applicable, in your opinion, and in what way?

> >>  From my point of view e.g. case transformation rule for Turkish I is a
> >> minor issue
> > 
> > Why, Org doesn't want to support Turkish users?
> 
>  From my point of view it is a minor issue in comparison to
> 
>      (string-collate-lessp "a" "B" "C" t)  ; => nil
> 
> that breaks comparison not only for accented letters.

Org is free to make such misguided decisions, but Emacs won't.  We cannot
decide that some locale is "minor" and others are "major".  My suggestion is
to look for a solution that works in any locale.

> You almost manged to convince Ihor to use `string-lessp' instead of 
> `string-collate-lessp'. I do not think it would improve quality of 
> support of Turkish language.

I didn't try to convince Ihor of anything, just point out the pitfalls of
using locale-specific collation order in portable programs.  I said back
then that I don't know enough to evaluate your decisions.  Once you
understand the subtle issues with these APIs, it is your call to decide how
to solve your particular problems.

> My suggestion is to fall back to `downcase' and `string-lessp' only if 
> `string-collate-lessp' is unable to provide case insensitive comparison.

You can do that in Org if that's the decision of the Org developers.  Emacs
cannot do that automatically for the reasons I explained up-thread.

> >> My argument against `downcase' in `string-collate-lessp' is that it may
> >> add noticeable performance penalty.
> > 
> > I'd worry about correctness before performance.
> 
> `downcase' with `string-lessp' handles more cases than just 
> `string-lessp' (leaving aside buffer-local conversion tables), so form 
> my point of view the former is more correct.

I'm quite sure this is only true for the cases that you considered, not in
general.

> Even `downcase' with fixed "C" locale may give result more consistent with
> user expectations.

How does it help on systems where locale-specific collation is not
accessible to Emacs?

> My impression that users may be familiar with wide spread problems with
> sorting.

Not IME.  But that's a separate issue, and I don't pretend to know Org users
better than you do, so I will defer to you on this one.





^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2022-11-27 15:42 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-15  4:08 bug#59275: Unexpected return value of `string-collate-lessp' on Mac Ihor Radchenko
2022-11-15  9:51 ` Robert Pluim
2022-11-16  3:47   ` Ihor Radchenko
2022-11-15 13:46 ` Eli Zaretskii
2022-11-15 15:05   ` Ihor Radchenko
2022-11-15 15:16     ` Eli Zaretskii
2022-11-16  1:34       ` Ihor Radchenko
2022-11-16 13:00         ` Eli Zaretskii
2022-11-21  7:28           ` Ihor Radchenko
2022-11-21 13:31             ` Eli Zaretskii
2022-11-22  1:24               ` Ihor Radchenko
2022-11-22 12:56                 ` Eli Zaretskii
2022-11-23 10:39                   ` Ihor Radchenko
2022-11-23 14:58                     ` Eli Zaretskii
2022-11-24  2:22                       ` Ihor Radchenko
2022-11-24  7:23                         ` Eli Zaretskii
2022-11-26  2:03                   ` Ihor Radchenko
2022-11-26  8:06                     ` Eli Zaretskii
2022-11-26  8:47                       ` Ihor Radchenko
2022-11-26  9:22                         ` Eli Zaretskii
2022-11-27 14:00                           ` Maxim Nikulin
2022-11-27 14:23                             ` Eli Zaretskii
2022-11-27 15:19                               ` Maxim Nikulin
2022-11-27 15:42                                 ` Eli Zaretskii

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).