unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
* bug#18051: [Emacs-diffs] trunk r117726: Add string collation.
       [not found] <53FAB5F9.9050706@yandex.ru>
@ 2014-08-25  5:48 ` Paul Eggert
  2014-08-25  6:19   ` Dmitry Antipov
  0 siblings, 1 reply; 28+ messages in thread
From: Paul Eggert @ 2014-08-25  5:48 UTC (permalink / raw)
  To: Dmitry Antipov; +Cc: Michael Albinus, 18051

Dmitry Antipov wrote:

> ../../trunk/src/sysdep.c:3527:1: error: no previous prototype for
> ‘str_collate’ [-Werror=missing-prototypes]

I fixed that problem, along with some other minor glitches associated 
with the patch, in trunk bzr 117733.





^ permalink raw reply	[flat|nested] 28+ messages in thread

* bug#18051: [Emacs-diffs] trunk r117726: Add string collation.
  2014-08-25  5:48 ` bug#18051: [Emacs-diffs] trunk r117726: Add string collation Paul Eggert
@ 2014-08-25  6:19   ` Dmitry Antipov
  2014-08-25  6:41     ` Michael Albinus
  0 siblings, 1 reply; 28+ messages in thread
From: Dmitry Antipov @ 2014-08-25  6:19 UTC (permalink / raw)
  To: Paul Eggert, Michael Albinus; +Cc: 18051

On 08/25/2014 09:48 AM, Paul Eggert wrote:

> I fixed that problem, along with some other minor glitches
> associated with the patch, in trunk bzr 117733.

Thanks.

BTW, I think that collation functions with 3rd optional argument
to specify locale settings will be a bit more versatile, e.g.

(string-collate-lessp a b "es_ES.UTF-8")

Dmitry






^ permalink raw reply	[flat|nested] 28+ messages in thread

* bug#18051: [Emacs-diffs] trunk r117726: Add string collation.
  2014-08-25  6:19   ` Dmitry Antipov
@ 2014-08-25  6:41     ` Michael Albinus
  2014-08-25 15:03       ` Eli Zaretskii
  0 siblings, 1 reply; 28+ messages in thread
From: Michael Albinus @ 2014-08-25  6:41 UTC (permalink / raw)
  To: Dmitry Antipov; +Cc: Paul Eggert, 18051

Dmitry Antipov <dmantipov@yandex.ru> writes:

> BTW, I think that collation functions with 3rd optional argument
> to specify locale settings will be a bit more versatile, e.g.
>
> (string-collate-lessp a b "es_ES.UTF-8")

We discuss this already, see <http://lists.gnu.org/archive/html/bug-gnu-emacs/2014-08/msg00623.html>

My major reservation to this approach is that it doesn't fit well using
string-collate-lessp as predicate of sort. That's why I have proposed a
global variable as alternative, which could be let-bounded.

> Dmitry

Best regards, Michael.





^ permalink raw reply	[flat|nested] 28+ messages in thread

* bug#18051: [Emacs-diffs] trunk r117726: Add string collation.
  2014-08-25  6:41     ` Michael Albinus
@ 2014-08-25 15:03       ` Eli Zaretskii
  2014-08-25 16:01         ` Eli Zaretskii
  2014-08-27 11:24         ` Michael Albinus
  0 siblings, 2 replies; 28+ messages in thread
From: Eli Zaretskii @ 2014-08-25 15:03 UTC (permalink / raw)
  To: Michael Albinus; +Cc: dmantipov, 18051, eggert

> From: Michael Albinus <michael.albinus@gmx.de>
> Date: Mon, 25 Aug 2014 08:41:03 +0200
> Cc: Paul Eggert <eggert@cs.ucla.edu>, 18051@debbugs.gnu.org
> 
> > BTW, I think that collation functions with 3rd optional argument
> > to specify locale settings will be a bit more versatile, e.g.
> >
> > (string-collate-lessp a b "es_ES.UTF-8")
>
> We discuss this already, see 
> <http://lists.gnu.org/archive/html/bug-gnu-emacs/2014-08/msg00623.html>
>
> My major reservation to this approach is that it doesn't fit well using
> string-collate-lessp as predicate of sort. That's why I have proposed a
> global variable as alternative, which could be let-bounded.

I think that binding a variable will indeed be cleaner.  Using
process-environment for that purpose should be reserved for the
application level.  Also, what if LC_COLLATE is not set in the
environment, but 'setlocale' does return some value for it? shouldn't
we use that?

Here are a few more thoughts about related issues:

1. Why does str_collate return a ptrdiff_t value?  AFAIK, wcscoll
   etc. return int data type, and of rather small values.

2. Should we signal an error if the input strings are not pure-ASCII
   or multibyte?  Unibyte strings will at best cause incorrect
   results.  And what about strings with invalid codepoints,
   e.g. those outside of the Unicode range, which can happen inside
   Lisp strings?

3. What about errors in wcscoll?  The current code ignores them;
   however, the value returned by wcscoll in case of an error is not
   documented, so it could be random.  Should we signal an error if
   errno gets set by wcscoll?

4. How to control the optional features of the collating sequence?  I
   mean, for example, the fact that punctuation characters are ignored
   in the .UTF-8 locales on glibc hosts (or so it seems).  At least on
   Windows, a somewhat higher degree of control is available, but it
   must be specified separately of the locale ID.  E.g., the
   comparison function accepts flags to ignore punctuation and
   symbols, width differences, diacritics, etc. Should we have another
   variable, perhaps w32-specific, to request these features?
   Alternatively, we could use .UTF-8 on Windows to communicate that,
   although that sounds like a kludge.

5. The locale names on Windows are different from Posix: Windows uses
   3-letter abbreviations of the country and the language,
   e.g. "fra_FRA" instead of the Posix "fr_FR".  Do we want the locale
   string values used for let-binding the above-mentioned variable to
   be portable across systems?  Then we'd need some conversion
   database on MS-Windows.

6. I think we will want case-insensitive version of this function.





^ permalink raw reply	[flat|nested] 28+ messages in thread

* bug#18051: [Emacs-diffs] trunk r117726: Add string collation.
  2014-08-25 15:03       ` Eli Zaretskii
@ 2014-08-25 16:01         ` Eli Zaretskii
  2014-08-27 11:24         ` Michael Albinus
  1 sibling, 0 replies; 28+ messages in thread
From: Eli Zaretskii @ 2014-08-25 16:01 UTC (permalink / raw)
  To: michael.albinus; +Cc: dmantipov, 18051, eggert

This is now implemented for MS-Windows as well.





^ permalink raw reply	[flat|nested] 28+ messages in thread

* bug#18051: [Emacs-diffs] trunk r117726: Add string collation.
  2014-08-25 15:03       ` Eli Zaretskii
  2014-08-25 16:01         ` Eli Zaretskii
@ 2014-08-27 11:24         ` Michael Albinus
  2014-08-27 15:40           ` Eli Zaretskii
  2014-08-27 19:00           ` Paul Eggert
  1 sibling, 2 replies; 28+ messages in thread
From: Michael Albinus @ 2014-08-27 11:24 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: dmantipov, 18051, eggert

Eli Zaretskii <eliz@gnu.org> writes:

> Here are a few more thoughts about related issues:
>
> 1. Why does str_collate return a ptrdiff_t value?  AFAIK, wcscoll
>    etc. return int data type, and of rather small values.

Hm, yes. Both wcscoll and w32_compare_strings return int, so I've
changed that for str_collate accordingly.

> 2. Should we signal an error if the input strings are not pure-ASCII
>    or multibyte?  Unibyte strings will at best cause incorrect
>    results.

Maybe we shall convert the strings to multibyte, via string_to_multibyte()?
If the string is already multibyte, it doesn't harm.

>    And what about strings with invalid codepoints,
>    e.g. those outside of the Unicode range, which can happen inside
>    Lisp strings?

> 3. What about errors in wcscoll?  The current code ignores them;
>    however, the value returned by wcscoll in case of an error is not
>    documented, so it could be random.  Should we signal an error if
>    errno gets set by wcscoll?

wcscoll sets EINVAL when the codepoint is out of range. I've added a
check for this case, returning an error.

(string-collate-equalp (string 1) (string ?\U0020FFFF))
  => error: Non-Unicode character: 0x20ffff

> 4. How to control the optional features of the collating sequence?  I
>    mean, for example, the fact that punctuation characters are ignored
>    in the .UTF-8 locales on glibc hosts (or so it seems).  At least on
>    Windows, a somewhat higher degree of control is available, but it
>    must be specified separately of the locale ID.  E.g., the
>    comparison function accepts flags to ignore punctuation and
>    symbols, width differences, diacritics, etc. Should we have another
>    variable, perhaps w32-specific, to request these features?
>    Alternatively, we could use .UTF-8 on Windows to communicate that,
>    although that sounds like a kludge.

In Posix systems, I'm not aware of configuring such optional features
via glibc. The most granular selection is what you dou with LC_COLLATE.

If we want to offer more granular settings, we would need to use a library
like libicu (http://icu-project.org/). Could be done, but should be optional.

> 5. The locale names on Windows are different from Posix: Windows uses
>    3-letter abbreviations of the country and the language,
>    e.g. "fra_FRA" instead of the Posix "fr_FR".  Do we want the locale
>    string values used for let-binding the above-mentioned variable to
>    be portable across systems?  Then we'd need some conversion
>    database on MS-Windows.

Here I'm a bit undecided. We could let it to the users to find the
proper locale name, but this is inconvenient. OTOH it would be much work
to install a mapping system, and we would need to maintain it. What if
there would be a new "en_SC" (Scotland) locale? We would need to
maintain such changes in Emacs forever ...

> 6. I think we will want case-insensitive version of this function.

That's also on my todo list. But I'm a little bit undecided whether we
shall add it to string-collate-* functions, or whether there shall be
further functions.

Maybe we could use sort-fold-case for this as indication? Or is this too
specific?

Best regards, Michael.





^ permalink raw reply	[flat|nested] 28+ messages in thread

* bug#18051: [Emacs-diffs] trunk r117726: Add string collation.
  2014-08-27 11:24         ` Michael Albinus
@ 2014-08-27 15:40           ` Eli Zaretskii
  2014-08-27 18:12             ` Michael Albinus
  2014-08-27 19:00           ` Paul Eggert
  1 sibling, 1 reply; 28+ messages in thread
From: Eli Zaretskii @ 2014-08-27 15:40 UTC (permalink / raw)
  To: Michael Albinus; +Cc: dmantipov, 18051, eggert

> From: Michael Albinus <michael.albinus@gmx.de>
> Cc: dmantipov@yandex.ru,  eggert@cs.ucla.edu,  18051@debbugs.gnu.org
> Date: Wed, 27 Aug 2014 13:24:48 +0200
> 
> > 2. Should we signal an error if the input strings are not pure-ASCII
> >    or multibyte?  Unibyte strings will at best cause incorrect
> >    results.
> 
> Maybe we shall convert the strings to multibyte, via string_to_multibyte()?

That will not help.

I say code that invokes these functions with unibyte non-ASCII strings
has a bug that should be flagged.

> > 5. The locale names on Windows are different from Posix: Windows uses
> >    3-letter abbreviations of the country and the language,
> >    e.g. "fra_FRA" instead of the Posix "fr_FR".  Do we want the locale
> >    string values used for let-binding the above-mentioned variable to
> >    be portable across systems?  Then we'd need some conversion
> >    database on MS-Windows.
> 
> Here I'm a bit undecided. We could let it to the users to find the
> proper locale name, but this is inconvenient. OTOH it would be much work
> to install a mapping system, and we would need to maintain it. What if
> there would be a new "en_SC" (Scotland) locale? We would need to
> maintain such changes in Emacs forever ...

I think these interfaces will almost always be used with the current
locale.  So with that in mind, I think we can document this issue, and
then safely leave this problem to the code that needs to use
non-default locales.

> > 6. I think we will want case-insensitive version of this function.
> 
> That's also on my todo list. But I'm a little bit undecided whether we
> shall add it to string-collate-* functions, or whether there shall be
> further functions.
> 
> Maybe we could use sort-fold-case for this as indication? Or is this too
> specific?

See my suggestion in the other message.

Thanks.





^ permalink raw reply	[flat|nested] 28+ messages in thread

* bug#18051: [Emacs-diffs] trunk r117726: Add string collation.
  2014-08-27 15:40           ` Eli Zaretskii
@ 2014-08-27 18:12             ` Michael Albinus
  2014-08-27 18:26               ` Eli Zaretskii
  0 siblings, 1 reply; 28+ messages in thread
From: Michael Albinus @ 2014-08-27 18:12 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: dmantipov, 18051, eggert

Eli Zaretskii <eliz@gnu.org> writes:

>> > 2. Should we signal an error if the input strings are not pure-ASCII
>> >    or multibyte?  Unibyte strings will at best cause incorrect
>> >    results.
>> 
>> Maybe we shall convert the strings to multibyte, via string_to_multibyte()?
>
> That will not help.
>
> I say code that invokes these functions with unibyte non-ASCII strings
> has a bug that should be flagged.

Well, you have much more experience with unicode than I have.

>> > 5. The locale names on Windows are different from Posix: Windows uses
>> >    3-letter abbreviations of the country and the language,
>> >    e.g. "fra_FRA" instead of the Posix "fr_FR".  Do we want the locale
>> >    string values used for let-binding the above-mentioned variable to
>> >    be portable across systems?  Then we'd need some conversion
>> >    database on MS-Windows.
>> 
>> Here I'm a bit undecided. We could let it to the users to find the
>> proper locale name, but this is inconvenient. OTOH it would be much work
>> to install a mapping system, and we would need to maintain it. What if
>> there would be a new "en_SC" (Scotland) locale? We would need to
>> maintain such changes in Emacs forever ...
>
> I think these interfaces will almost always be used with the current
> locale.  So with that in mind, I think we can document this issue, and
> then safely leave this problem to the code that needs to use
> non-default locales.

I don't get this. What do you propose here? Set the locale specific to
the system Emacs is running, or do you propose a mapping to something
which is portable over system boundaries?

> Thanks.

Best regards, Michael.





^ permalink raw reply	[flat|nested] 28+ messages in thread

* bug#18051: [Emacs-diffs] trunk r117726: Add string collation.
  2014-08-27 18:12             ` Michael Albinus
@ 2014-08-27 18:26               ` Eli Zaretskii
  0 siblings, 0 replies; 28+ messages in thread
From: Eli Zaretskii @ 2014-08-27 18:26 UTC (permalink / raw)
  To: Michael Albinus; +Cc: dmantipov, 18051, eggert

> From: Michael Albinus <michael.albinus@gmx.de>
> Cc: dmantipov@yandex.ru,  eggert@cs.ucla.edu,  18051@debbugs.gnu.org
> Date: Wed, 27 Aug 2014 20:12:12 +0200
> 
> >> > 5. The locale names on Windows are different from Posix: Windows uses
> >> >    3-letter abbreviations of the country and the language,
> >> >    e.g. "fra_FRA" instead of the Posix "fr_FR".  Do we want the locale
> >> >    string values used for let-binding the above-mentioned variable to
> >> >    be portable across systems?  Then we'd need some conversion
> >> >    database on MS-Windows.
> >> 
> >> Here I'm a bit undecided. We could let it to the users to find the
> >> proper locale name, but this is inconvenient. OTOH it would be much work
> >> to install a mapping system, and we would need to maintain it. What if
> >> there would be a new "en_SC" (Scotland) locale? We would need to
> >> maintain such changes in Emacs forever ...
> >
> > I think these interfaces will almost always be used with the current
> > locale.  So with that in mind, I think we can document this issue, and
> > then safely leave this problem to the code that needs to use
> > non-default locales.
> 
> I don't get this. What do you propose here? Set the locale specific to
> the system Emacs is running, or do you propose a mapping to something
> which is portable over system boundaries?

The former.  IOW, the (rare, IMO) Lisp program that wants to override
the default locale will have to figure out how to do that in a way
that works on all the supported platforms.  E.g., one way is

  (let ((locale (if (eq system-type 'windows-nt)
                    "enu_USA"
		  "en_US")))
     ...






^ permalink raw reply	[flat|nested] 28+ messages in thread

* bug#18051: [Emacs-diffs] trunk r117726: Add string collation.
  2014-08-27 11:24         ` Michael Albinus
  2014-08-27 15:40           ` Eli Zaretskii
@ 2014-08-27 19:00           ` Paul Eggert
  2014-08-27 19:08             ` Paul Eggert
  1 sibling, 1 reply; 28+ messages in thread
From: Paul Eggert @ 2014-08-27 19:00 UTC (permalink / raw)
  To: Michael Albinus, Eli Zaretskii; +Cc: dmantipov, 18051

I found the following issues and installed what I hope are fixes as 
trunk bzr 117751.

First, the code should use wcscoll_t rather than uselocale, as uselocale 
modifies thread state and this is less robust; for example, it wasn't 
safe to call 'error' right after the first call to uselocale.

Second, if the locale is invalid, string-collate-lessp should throw an 
error, the same way it throws an error when the strings are invalid.





^ permalink raw reply	[flat|nested] 28+ messages in thread

* bug#18051: [Emacs-diffs] trunk r117726: Add string collation.
  2014-08-27 19:00           ` Paul Eggert
@ 2014-08-27 19:08             ` Paul Eggert
  2014-08-27 19:54               ` Eli Zaretskii
  0 siblings, 1 reply; 28+ messages in thread
From: Paul Eggert @ 2014-08-27 19:08 UTC (permalink / raw)
  To: Michael Albinus, Eli Zaretskii; +Cc: dmantipov, 18051

A couple more things.

First, the current algorithm looks only at LC_COLLATE, but the usual 
approach is to default LC_COLLATE to LANG if LC_COLLATE isn't set, and 
to have LC_ALL override LC_COLLATE.  Shouldn't Emacs take a similar 
approach, for compatibility?

More generally, it strikes me that string-collate-lessp will be quite 
slow due to the overhead of looking up the locale environment string and 
creating and destroying a locale for each string comparison.  Instead, 
shouldn't Emacs should have a locale object that the Emacs Lisp 
programmer can create, an object that encapsulates the low level 
locale_t object, and which can be passed as an optional argument to 
string-collate-lessp?  That way, string-collate-p would never have to 
inspect the environment itself, or to create or destroy a locale.





^ permalink raw reply	[flat|nested] 28+ messages in thread

* bug#18051: [Emacs-diffs] trunk r117726: Add string collation.
  2014-08-27 19:08             ` Paul Eggert
@ 2014-08-27 19:54               ` Eli Zaretskii
  2014-08-27 21:27                 ` Paul Eggert
  0 siblings, 1 reply; 28+ messages in thread
From: Eli Zaretskii @ 2014-08-27 19:54 UTC (permalink / raw)
  To: Paul Eggert; +Cc: michael.albinus, dmantipov, 18051

> Date: Wed, 27 Aug 2014 12:08:52 -0700
> From: Paul Eggert <eggert@cs.ucla.edu>
> CC: dmantipov@yandex.ru, 18051@debbugs.gnu.org
> 
> First, the current algorithm looks only at LC_COLLATE, but the usual 
> approach is to default LC_COLLATE to LANG if LC_COLLATE isn't set, and 
> to have LC_ALL override LC_COLLATE.  Shouldn't Emacs take a similar 
> approach, for compatibility?

I think we agreed to have a variable that holds the non-default locale
as a Lisp string.  LANG and LC_COLLATE will then be used internally by
newlocale and/or wcscoll_t, as users expect.  I don't think it's
appropriate for a primitive to take arguments from environment
variables, certainly not those on process-environment.  If some Lisp
application would want to do that, let them.

> More generally, it strikes me that string-collate-lessp will be quite 
> slow due to the overhead of looking up the locale environment string and 
> creating and destroying a locale for each string comparison.

The lookup will no longer be relevant, when we switch to a variable.

As for creating and destroying the locale, I guess you are right.

> Instead, shouldn't Emacs should have a locale object that the Emacs
> Lisp programmer can create, an object that encapsulates the low
> level locale_t object, and which can be passed as an optional
> argument to string-collate-lessp?

That's what Guile does.  But it will complicate using these functions
in sorting routines.  Perhaps binding a variable to the object will
do.

Alternatively, a simple one-slot cache internal to string_collate will
probably remove most of the overhead.  (You will see that
w32_compare_strings already employs a similar cache.)





^ permalink raw reply	[flat|nested] 28+ messages in thread

* bug#18051: [Emacs-diffs] trunk r117726: Add string collation.
  2014-08-27 19:54               ` Eli Zaretskii
@ 2014-08-27 21:27                 ` Paul Eggert
  2014-08-27 21:37                   ` Michael Albinus
  0 siblings, 1 reply; 28+ messages in thread
From: Paul Eggert @ 2014-08-27 21:27 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: michael.albinus, dmantipov, 18051

Eli Zaretskii wrote:

> I think we agreed to have a variable that holds the non-default locale
> as a Lisp string.

Ah, sorry, missed that (it is a long thread...).  Makes sense.  I assume 
this is on someone's TODO list since it's not done that way now.

> Perhaps binding a variable to the object will do.

We could do both: i.e., give the comparison function an optional 
argument that defaults to the value of the bound variable.  I'd think 
the value should be a locale object, though, not a string like "en_US". 
  And perhaps the object should also record whether the comparison is 
case-sensitive, and other stuff like that.

> Alternatively, a simple one-slot cache internal to string_collate will
> probably remove most of the overhead.

It would now, but it would also add another obstacle to adding 
multithreading capabilities, as the locking around the cache would 
inhibit scalability.  So I'd rather avoid such a cache if it's easy.





^ permalink raw reply	[flat|nested] 28+ messages in thread

* bug#18051: [Emacs-diffs] trunk r117726: Add string collation.
  2014-08-27 21:27                 ` Paul Eggert
@ 2014-08-27 21:37                   ` Michael Albinus
  2014-08-28  2:39                     ` Eli Zaretskii
  2014-08-29  8:59                     ` martin rudalics
  0 siblings, 2 replies; 28+ messages in thread
From: Michael Albinus @ 2014-08-27 21:37 UTC (permalink / raw)
  To: Paul Eggert; +Cc: dmantipov, 18051

Paul Eggert <eggert@cs.ucla.edu> writes:

> Ah, sorry, missed that (it is a long thread...).  Makes sense.  I
> assume this is on someone's TODO list since it's not done that way
> now.

Eli, that means you or me :-)

I do not want to interfere your work, but in case you are busy with
other tasks, I could do. Pls let me know.

>> Perhaps binding a variable to the object will do.
>
> We could do both: i.e., give the comparison function an optional
> argument that defaults to the value of the bound variable.  I'd think
> the value should be a locale object, though, not a string like
> "en_US". And perhaps the object should also record whether the
> comparison is case-sensitive, and other stuff like that.

Good idea, that would also make Glenn happy. (That's not a joke, I mean
it seriously!)

Best regards, Michael.





^ permalink raw reply	[flat|nested] 28+ messages in thread

* bug#18051: [Emacs-diffs] trunk r117726: Add string collation.
  2014-08-27 21:37                   ` Michael Albinus
@ 2014-08-28  2:39                     ` Eli Zaretskii
  2014-08-29  8:59                     ` martin rudalics
  1 sibling, 0 replies; 28+ messages in thread
From: Eli Zaretskii @ 2014-08-28  2:39 UTC (permalink / raw)
  To: Michael Albinus; +Cc: eggert, 18051, dmantipov

> From: Michael Albinus <michael.albinus@gmx.de>
> Cc: Eli Zaretskii <eliz@gnu.org>,  dmantipov@yandex.ru,  18051@debbugs.gnu.org
> Date: Wed, 27 Aug 2014 23:37:35 +0200
> 
> Paul Eggert <eggert@cs.ucla.edu> writes:
> 
> > Ah, sorry, missed that (it is a long thread...).  Makes sense.  I
> > assume this is on someone's TODO list since it's not done that way
> > now.
> 
> Eli, that means you or me :-)
> 
> I do not want to interfere your work, but in case you are busy with
> other tasks, I could do. Pls let me know.

Feel free, and thanks.





^ permalink raw reply	[flat|nested] 28+ messages in thread

* bug#18051: [Emacs-diffs] trunk r117726: Add string collation.
  2014-08-27 21:37                   ` Michael Albinus
  2014-08-28  2:39                     ` Eli Zaretskii
@ 2014-08-29  8:59                     ` martin rudalics
  2014-08-29  9:59                       ` Michael Albinus
  2014-08-29 10:06                       ` Eli Zaretskii
  1 sibling, 2 replies; 28+ messages in thread
From: martin rudalics @ 2014-08-29  8:59 UTC (permalink / raw)
  To: Michael Albinus, Paul Eggert; +Cc: dmantipov, 18051

 > Good idea, that would also make Glenn happy. (That's not a joke, I mean
 > it seriously!)

It would make me happy as well.  I have not yet started to convert my
fairly insane sorting functions to the new ones because mine are
generally based on case-insensitiveness.  Also I'm not yet sure how the
new predicates will relate to functions like `compare-strings' (which
IIUC is needed until now to make sorting case-insensitive),
`sort-lines', `sort-subr' and the like.  I'd hope that all of these
could profit from the new functions.

In any case, many thanks to you and Eli for the work.

martin





^ permalink raw reply	[flat|nested] 28+ messages in thread

* bug#18051: [Emacs-diffs] trunk r117726: Add string collation.
  2014-08-29  8:59                     ` martin rudalics
@ 2014-08-29  9:59                       ` Michael Albinus
  2014-08-29 17:21                         ` martin rudalics
  2014-08-29 10:06                       ` Eli Zaretskii
  1 sibling, 1 reply; 28+ messages in thread
From: Michael Albinus @ 2014-08-29  9:59 UTC (permalink / raw)
  To: martin rudalics; +Cc: Paul Eggert, 18051, dmantipov

martin rudalics <rudalics@gmx.at> writes:

> I have not yet started to convert my
> fairly insane sorting functions to the new ones because mine are
> generally based on case-insensitiveness.

I'm just working on this. `string-collate-lessp' will have the signature

(string-collate-lessp S1 S2 &optional LOCALE IGNORE-CASE)

> Also I'm not yet sure how the
> new predicates will relate to functions like `compare-strings' (which
> IIUC is needed until now to make sorting case-insensitive),

Likely, there shall also be `collate-strings'.

> `sort-lines', `sort-subr' and the like.  I'd hope that all of these
> could profit from the new functions.

`sort-subr' has PREDICATE as argument, you could take
`string-collate-lessp'. Maybe with some adaptions in `sort-subr', in
order to use also LOCALE and IGNORE-CASE.

`sort-lines' uses `sort-subr', without PREDIACATE. Might be also extended.

> martin

Best regards, Michael.





^ permalink raw reply	[flat|nested] 28+ messages in thread

* bug#18051: [Emacs-diffs] trunk r117726: Add string collation.
  2014-08-29  8:59                     ` martin rudalics
  2014-08-29  9:59                       ` Michael Albinus
@ 2014-08-29 10:06                       ` Eli Zaretskii
  2014-08-29 18:01                         ` Michael Albinus
  1 sibling, 1 reply; 28+ messages in thread
From: Eli Zaretskii @ 2014-08-29 10:06 UTC (permalink / raw)
  To: martin rudalics; +Cc: michael.albinus, eggert, 18051, dmantipov

> Date: Fri, 29 Aug 2014 10:59:37 +0200
> From: martin rudalics <rudalics@gmx.at>
> Cc: dmantipov@yandex.ru, 18051@debbugs.gnu.org
> 
>  > Good idea, that would also make Glenn happy. (That's not a joke, I mean
>  > it seriously!)
> 
> It would make me happy as well.  I have not yet started to convert my
> fairly insane sorting functions to the new ones because mine are
> generally based on case-insensitiveness.  Also I'm not yet sure how the
> new predicates will relate to functions like `compare-strings' (which
> IIUC is needed until now to make sorting case-insensitive),
> `sort-lines', `sort-subr' and the like.  I'd hope that all of these
> could profit from the new functions.

Case-insensitive versions of the new functions are yet to be written;
stay tuned.

For now, on MS-Windows, you can have that if you use the
NORM_IGNORECASE flag as the second argument of CompareStringW inside
w32_compare_strings.

For Posix, I guess we should run the 2 strings through towupper (or
towupper_l, if it exists), and then compare the results with
wcscoll/wcscoll_l.





^ permalink raw reply	[flat|nested] 28+ messages in thread

* bug#18051: [Emacs-diffs] trunk r117726: Add string collation.
  2014-08-29  9:59                       ` Michael Albinus
@ 2014-08-29 17:21                         ` martin rudalics
  2014-08-29 17:56                           ` Michael Albinus
  0 siblings, 1 reply; 28+ messages in thread
From: martin rudalics @ 2014-08-29 17:21 UTC (permalink / raw)
  To: Michael Albinus; +Cc: Paul Eggert, 18051, dmantipov

 > I'm just working on this. `string-collate-lessp' will have the signature
 >
 > (string-collate-lessp S1 S2 &optional LOCALE IGNORE-CASE)

Fine.  One additional question: Couldn't we also try to fix searching

http://debbugs.gnu.org/cgi/bugreport.cgi?bug=13041

with the new functions?

martin





^ permalink raw reply	[flat|nested] 28+ messages in thread

* bug#18051: [Emacs-diffs] trunk r117726: Add string collation.
  2014-08-29 17:21                         ` martin rudalics
@ 2014-08-29 17:56                           ` Michael Albinus
  0 siblings, 0 replies; 28+ messages in thread
From: Michael Albinus @ 2014-08-29 17:56 UTC (permalink / raw)
  To: martin rudalics; +Cc: Paul Eggert, 18051, dmantipov

martin rudalics <rudalics@gmx.at> writes:

> Fine.  One additional question: Couldn't we also try to fix searching
>
> http://debbugs.gnu.org/cgi/bugreport.cgi?bug=13041
>
> with the new functions?

Don't know (yet). Pushed on my TODO.

> martin

Best regards, Michael.





^ permalink raw reply	[flat|nested] 28+ messages in thread

* bug#18051: [Emacs-diffs] trunk r117726: Add string collation.
  2014-08-29 10:06                       ` Eli Zaretskii
@ 2014-08-29 18:01                         ` Michael Albinus
  2014-08-29 19:31                           ` Eli Zaretskii
  0 siblings, 1 reply; 28+ messages in thread
From: Michael Albinus @ 2014-08-29 18:01 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: eggert, dmantipov, 18051

Eli Zaretskii <eliz@gnu.org> writes:

> Case-insensitive versions of the new functions are yet to be written;
> stay tuned.

I've just committed a patch to the trunk which adds optional arguments
LOCALE and IGNORE-CASE to the collation functions.

> For now, on MS-Windows, you can have that if you use the
> NORM_IGNORECASE flag as the second argument of CompareStringW inside
> w32_compare_strings.

As usual, this I haven't implemented. I would let it to you, Eli.

> For Posix, I guess we should run the 2 strings through towupper (or
> towupper_l, if it exists), and then compare the results with
> wcscoll/wcscoll_l.

Yes.

Best regards, Michael.





^ permalink raw reply	[flat|nested] 28+ messages in thread

* bug#18051: [Emacs-diffs] trunk r117726: Add string collation.
  2014-08-29 18:01                         ` Michael Albinus
@ 2014-08-29 19:31                           ` Eli Zaretskii
  2014-08-29 21:01                             ` Michael Albinus
  0 siblings, 1 reply; 28+ messages in thread
From: Eli Zaretskii @ 2014-08-29 19:31 UTC (permalink / raw)
  To: Michael Albinus; +Cc: eggert, dmantipov, 18051

> From: Michael Albinus <michael.albinus@gmx.de>
> Cc: martin rudalics <rudalics@gmx.at>,  eggert@cs.ucla.edu,  dmantipov@yandex.ru,  18051@debbugs.gnu.org
> Date: Fri, 29 Aug 2014 20:01:50 +0200
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> 
> > Case-insensitive versions of the new functions are yet to be written;
> > stay tuned.
> 
> I've just committed a patch to the trunk which adds optional arguments
> LOCALE and IGNORE-CASE to the collation functions.

Thanks.

> > For now, on MS-Windows, you can have that if you use the
> > NORM_IGNORECASE flag as the second argument of CompareStringW inside
> > w32_compare_strings.
> 
> As usual, this I haven't implemented. I would let it to you, Eli.

As usual, done.

I needed to introduce a w32-specific variable, which needs to be bound
to a non-nil value in order to have UTS#10 (a.k.a. "Unicode Collation
Algorithm", or "UCA") compliant collation order, which ignores
punctuation differences, on MS-Windows.  This is because Windows
doesn't support UTF-8 as a codeset in its locales (and Windows locales
have different names anyway).  This means that if a Lisp program needs
to make sure it gets a UCA-compliant collation order on all platforms,
it will have to pass a "xx_YY.UTF-8" locale on Posix platforms, and on
Windows bind that w32-specific variable to a non-nil value.

Btw, I think we will need a lot of verbiage in the ELisp manual to
make sure people understand what to expect from these functions.  In
particular, the results are extremely locale- and platform-specific,
so one cannot expect exactly the same results in all cases, only
something similar.





^ permalink raw reply	[flat|nested] 28+ messages in thread

* bug#18051: [Emacs-diffs] trunk r117726: Add string collation.
  2014-08-29 19:31                           ` Eli Zaretskii
@ 2014-08-29 21:01                             ` Michael Albinus
  2014-09-01 15:20                               ` Eli Zaretskii
  0 siblings, 1 reply; 28+ messages in thread
From: Michael Albinus @ 2014-08-29 21:01 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: eggert, dmantipov, 18051

Eli Zaretskii <eliz@gnu.org> writes:

> Btw, I think we will need a lot of verbiage in the ELisp manual to
> make sure people understand what to expect from these functions.  In
> particular, the results are extremely locale- and platform-specific,
> so one cannot expect exactly the same results in all cases, only
> something similar.

Oh yes. I will start on this next days (as usual, I'm short in time) as
well as adding test cases to fns-tests.el.

Best regards, Michael.





^ permalink raw reply	[flat|nested] 28+ messages in thread

* bug#18051: [Emacs-diffs] trunk r117726: Add string collation.
  2014-08-29 21:01                             ` Michael Albinus
@ 2014-09-01 15:20                               ` Eli Zaretskii
  2014-09-01 20:46                                 ` Michael Heerdegen
  0 siblings, 1 reply; 28+ messages in thread
From: Eli Zaretskii @ 2014-09-01 15:20 UTC (permalink / raw)
  To: Michael Albinus, michael_heerdegen; +Cc: 18051

In trunk revision 117797, ls-lisp acquired the ability to sort file
names using the new string-collate-lessp function, thus producing
results that should be similar, if not identical, to what GNU ls does,
at least on GNU/Linux in the same locale.  (On MS-Windows, the
behavior will be similar; it cannot be identical because Windows
doesn't implement UTS#10 (a.k.a. "UCA", the Unicode Collation
Algorithm) to the letter in its locale-dependent collation routines.)

Trunk revision 117798 implements the GNU ls -v switch in ls-lisp.

Michael (Heerdegen), as you were the one who requested these features,
please give them some testing and see if you like them.

Many thanks to Michael Albinus for all the hard work on the
infrastructure that made this possible.





^ permalink raw reply	[flat|nested] 28+ messages in thread

* bug#18051: [Emacs-diffs] trunk r117726: Add string collation.
  2014-09-01 15:20                               ` Eli Zaretskii
@ 2014-09-01 20:46                                 ` Michael Heerdegen
  2014-10-17 20:26                                   ` Michael Heerdegen
  0 siblings, 1 reply; 28+ messages in thread
From: Michael Heerdegen @ 2014-09-01 20:46 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Michael Albinus, 18051

Eli Zaretskii <eliz@gnu.org> writes:

> Michael (Heerdegen), as you were the one who requested these features,
> please give them some testing and see if you like them.

I'll try and test ASAP.

> Many thanks to Michael Albinus for all the hard work on the
> infrastructure that made this possible.

I have to thank both of you!  Presumably it was not easy.


Michael.





^ permalink raw reply	[flat|nested] 28+ messages in thread

* bug#18051: [Emacs-diffs] trunk r117726: Add string collation.
  2014-09-01 20:46                                 ` Michael Heerdegen
@ 2014-10-17 20:26                                   ` Michael Heerdegen
  2014-10-18  5:38                                     ` Eli Zaretskii
  0 siblings, 1 reply; 28+ messages in thread
From: Michael Heerdegen @ 2014-10-17 20:26 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 18051, Michael Albinus

Hi Eli and Michael,

> > Michael (Heerdegen), as you were the one who requested these features,
> > please give them some testing and see if you like them.
>
> I'll try and test ASAP.

I used that stuff for a while now, and I think everything worked as
expected.

If I remember correctly, I saw just one tiny inconsistency: with the new
ls-lisp -v switch the sorting position of a backup file named foo~ was
different from ls -v when also numbered backup files foo~n~ of the same
file existed.  Dunno if this is relevant, it's a corner case.

For string collation and locales, I must say that I'm no expert at that
field and don't really know what tests could be useful for testing.  I
can only say that everything seems to be ok with the locales I am using.

Let me know when you think that I could nonetheless be of any help
there.


Thanks both of you again for your work,

Michael.





^ permalink raw reply	[flat|nested] 28+ messages in thread

* bug#18051: [Emacs-diffs] trunk r117726: Add string collation.
  2014-10-17 20:26                                   ` Michael Heerdegen
@ 2014-10-18  5:38                                     ` Eli Zaretskii
  2014-10-18 14:27                                       ` Michael Heerdegen
  0 siblings, 1 reply; 28+ messages in thread
From: Eli Zaretskii @ 2014-10-18  5:38 UTC (permalink / raw)
  To: michael_heerdegen; +Cc: 18051-done, michael.albinus

> From: Michael Heerdegen <michael_heerdegen@web.de>
> Cc: 18051@debbugs.gnu.org,  Michael Albinus <michael.albinus@gmx.de>
> Date: Fri, 17 Oct 2014 22:26:32 +0200
> 
> If I remember correctly, I saw just one tiny inconsistency: with the new
> ls-lisp -v switch the sorting position of a backup file named foo~ was
> different from ls -v when also numbered backup files foo~n~ of the same
> file existed.

Is that on Windows or on Unix?

On Windows, this is expected, as only an approximation to the Unicode
Collation Algorithm is available there.

On GNU/Linux, it would be strange, since 'ls' uses the same functions
as Emacs now does in ls-lisp.

> For string collation and locales, I must say that I'm no expert at that
> field and don't really know what tests could be useful for testing.  I
> can only say that everything seems to be ok with the locales I am using.

That's good enough for me, so I'm closing the bug.

Thanks.





^ permalink raw reply	[flat|nested] 28+ messages in thread

* bug#18051: [Emacs-diffs] trunk r117726: Add string collation.
  2014-10-18  5:38                                     ` Eli Zaretskii
@ 2014-10-18 14:27                                       ` Michael Heerdegen
  0 siblings, 0 replies; 28+ messages in thread
From: Michael Heerdegen @ 2014-10-18 14:27 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 18051-done, michael.albinus

Eli Zaretskii <eliz@gnu.org> writes:

> > with the new ls-lisp -v switch the sorting position of a backup file
> > named foo~ was different from ls -v when also numbered backup files
> > foo~n~ of the same file existed.

> On GNU/Linux, it would be strange, since 'ls' uses the same functions
> as Emacs now does in ls-lisp.

Gnu/Linux.  But I can't reproduce this anymore, it works as expected,
probably I was mistaken.


Thanks,

Michael.





^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2014-10-18 14:27 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <53FAB5F9.9050706@yandex.ru>
2014-08-25  5:48 ` bug#18051: [Emacs-diffs] trunk r117726: Add string collation Paul Eggert
2014-08-25  6:19   ` Dmitry Antipov
2014-08-25  6:41     ` Michael Albinus
2014-08-25 15:03       ` Eli Zaretskii
2014-08-25 16:01         ` Eli Zaretskii
2014-08-27 11:24         ` Michael Albinus
2014-08-27 15:40           ` Eli Zaretskii
2014-08-27 18:12             ` Michael Albinus
2014-08-27 18:26               ` Eli Zaretskii
2014-08-27 19:00           ` Paul Eggert
2014-08-27 19:08             ` Paul Eggert
2014-08-27 19:54               ` Eli Zaretskii
2014-08-27 21:27                 ` Paul Eggert
2014-08-27 21:37                   ` Michael Albinus
2014-08-28  2:39                     ` Eli Zaretskii
2014-08-29  8:59                     ` martin rudalics
2014-08-29  9:59                       ` Michael Albinus
2014-08-29 17:21                         ` martin rudalics
2014-08-29 17:56                           ` Michael Albinus
2014-08-29 10:06                       ` Eli Zaretskii
2014-08-29 18:01                         ` Michael Albinus
2014-08-29 19:31                           ` Eli Zaretskii
2014-08-29 21:01                             ` Michael Albinus
2014-09-01 15:20                               ` Eli Zaretskii
2014-09-01 20:46                                 ` Michael Heerdegen
2014-10-17 20:26                                   ` Michael Heerdegen
2014-10-18  5:38                                     ` Eli Zaretskii
2014-10-18 14:27                                       ` Michael Heerdegen

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).