unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed
From: Eli Zaretskii <eliz@gnu.org>
To: sds@gnu.org
Cc: emacs-devel@gnu.org
Subject: Re: case-insensitive string comparison
Date: Tue, 26 Jul 2022 16:05:50 +0300	[thread overview]
Message-ID: <83lesgc9ch.fsf@gnu.org> (raw)
In-Reply-To: <lztu750yo9.fsf@3c22fb11fdab.ant.amazon.com> (message from Sam Steingold on Mon, 25 Jul 2022 15:39:34 -0400)

> From: Sam Steingold <sds@gnu.org>
> Date: Mon, 25 Jul 2022 15:39:34 -0400
> 
> > * Eli Zaretskii <ryvm@tah.bet> [2022-07-25 18:58:19 +0300]:
> >
> >> > (string-collate-equalp "a" "A" current-locale-environment t)
> >> > ==> nil
> >> > current-locale-environment
> >> > ==> "en_US.UTF-8"
> >
> > I cannot reproduce this:
> >
> >   (string-collate-equalp "a" "A" current-locale-environment t)
> >     => t
> >   current-locale-environment
> >     => "en_US.UTF-8"
> >
> > What OS is this, and which Emacs version?
> 
> GNU Emacs 29.0.50 (build 5, x86_64-apple-darwin21.5.0, NS appkit-2113.50 Version 12.4 (Build 21F79))
>  of 2022-07-25
> Repository revision: ffe12ff2503917e47c0356195b31430996c148f9
> Repository branch: master
> Windowing system distributor 'Apple', version 10.3.2113
> System Description:  macOS 12.4

Could be something macOS-specific.  Maybe your system doesn't define
the __STDC_ISO_10646__ feature?  In that case, string-collate-equalp
(see the doc string) behaves like string-equal, and that one doesn't
have a case-insensitive variant.

> >> So, how do we do case-insensitive string comparison in Emacs?
> >
> > If you want locale-specific collation, as Stefan said, above.
> 
> Do I?
> Is it really true that "UTF-8" without "en_US" does _not_ define case conversion?

string-collate-equalp relies on the implementation in your libc, so
that's something I cannot answer (although I'd expect any reasonable
libc to work as expected here).

In general, locale-specific comparison is a bad idea in Emacs, unless
you are writing a Lisp program that absolutely _must_ meet the
locale's definitions of collation order and equivalence.  That's
because some locales have unexpected requirements, and because
different libc's implement this stuff very differently.  So using
string-collate-equalp and string-collate-lessp makes your program
unpredictable on any machine but your own.

For that reason, I suggest always using compare-strings instead.  That
function uses the Unicode locale-independent case-conversion rules,
and you can predictably control/tailor that if you need by using a
buffer-local case-table.

> but https://docs.python.org/3/library/stdtypes.html#str.casefold says
> 
> >>>>> The casefolding algorithm is described in section 3.13 of the Unicode Standard.
> 
> this seems to imply that user locale setting is not relevant.

That conclusion is incorrect.  The collation database is usually
tailored for each locale, and at least glibc indeed loads the tailored
collation tables for each locale you request.

> >> It is okay to add a `string-equal-ignore-case' based on `compare-strings'?
> >> (even though it does not recognize "SS" and "ß" as equal)
> >
> > What's wrong with calling compare-strings directly?
> 
> I want to be able to use `string-equal-ignore-case' as a :test argument
> to things like `cl-find'.

Then write a thin wrapper around compare-strings, and be done.

> And I don't want to have to think about encodings and locales.
> So I want the core Emacs maintainers who know about these things to
> provide me with something that works. Thanks in advance! ;-)

There's nothing to think about: see above.  The best results, in the
Emacs context, are to write code that doesn't depend on the locale,
and that's what you get with compare-strings.  No need to know
anything about encoding or locales.

> The fact that there are ***TWO*** core functions that compare strings -
> `string-collate-equalp' and `compare-strings' - does not look right to me.
> _I_ should not have to decide which function to use.

You can always ask.  But the documentation at least hints that the
locale-specific comparison has many hidden aspects:

  This function obeys the conventions for collation order in your locale
  settings.  For example, characters with different coding points but
  the same meaning might be considered as equal, like different grave
  accent Unicode characters:

  (string-collate-equalp (string ?\uFF40) (string ?\u1FEF))
    => t

> >> Or should we first implement something like casefold in Python?
> >
> > Ha! we already have that:
> >
> >   (get-char-code-property ?ß 'special-uppercase)
> >     => "SS"
> 
> Nice, but how does it help me if
> --8<---------------cut here---------------start------------->8---
> (compare-strings "SS" 0 nil "ß" 0 nil t)
> ==> -1
> (string-collate-equalp "SS" "ß" "en_US.UTF-8" t)
> ==> nil
> --8<---------------cut here---------------end--------------->8---
> instead of `t'?

It depends on what you want to do, and why you care about the ß case
in the first place.  AFAIR, you never explained that, nor described
your goal.



  reply	other threads:[~2022-07-26 13:05 UTC|newest]

Thread overview: 45+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-07-19 17:27 case-insensitive string comparison Sam Steingold
2022-07-19 18:06 ` Mattias Engdegård
2022-07-19 18:56   ` Sam Steingold
2022-07-20  4:39     ` tomas
2022-07-20 11:35       ` Eli Zaretskii
2022-07-20 13:30         ` tomas
2022-07-19 18:16 ` Stefan Kangas
2022-07-19 19:39 ` Roland Winkler
2022-07-19 22:47   ` Sam Steingold
2022-07-20  2:21     ` Roland Winkler
2022-07-20  3:01     ` Stefan Monnier
2022-07-20 16:22       ` Sam Steingold
2022-07-25 14:23         ` Sam Steingold
2022-07-25 15:58           ` Eli Zaretskii
2022-07-25 19:39             ` Sam Steingold
2022-07-26 13:05               ` Eli Zaretskii [this message]
2022-07-26 14:16                 ` Sam Steingold
2022-07-26 15:53                   ` Eli Zaretskii
2022-07-26 16:00                     ` Sam Steingold
2022-07-26 16:16                     ` Lars Ingebrigtsen
2022-07-26 14:43                 ` Robert Pluim
2022-07-25 19:37           ` Bruno Haible
2022-07-26  3:24           ` Richard Stallman
2022-07-26  8:00             ` Helmut Eller
2022-07-26 12:21               ` Eli Zaretskii
2022-07-27  2:58               ` Richard Stallman
2022-07-31  8:24                 ` Eli Zaretskii
2022-07-26 14:28             ` Sam Steingold
2022-07-26 15:42               ` Sam Steingold
2022-07-26 16:10               ` Eli Zaretskii
2022-07-26 18:56                 ` Bruno Haible
2022-07-26 19:30                   ` Eli Zaretskii
2022-07-20 16:24       ` Roland Winkler
2022-07-20 17:06         ` Sam Steingold
2022-07-20 17:16           ` Eli Zaretskii
2022-07-20 17:12         ` Eli Zaretskii
2022-07-20 17:37           ` Roland Winkler
2022-07-20 17:50             ` Eli Zaretskii
2022-07-20 18:10               ` Roland Winkler
2022-07-20 18:16                 ` Eli Zaretskii
2022-07-20 18:18                   ` [External] : " Drew Adams
2022-07-21  6:56                   ` Eli Zaretskii
2022-07-21 14:19                     ` Roland Winkler
2022-07-21 15:53                       ` Eli Zaretskii
2022-07-21 16:35                         ` Roland Winkler

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=83lesgc9ch.fsf@gnu.org \
    --to=eliz@gnu.org \
    --cc=emacs-devel@gnu.org \
    --cc=sds@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).