From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.devel Subject: Re: case-insensitive string comparison Date: Tue, 26 Jul 2022 16:05:50 +0300 Message-ID: <83lesgc9ch.fsf@gnu.org> References: <87ilnsq4cr.fsf@gnu.org> <83o7xddw10.fsf@gnu.org> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="15455"; mail-complaints-to="usenet@ciao.gmane.io" Cc: emacs-devel@gnu.org To: sds@gnu.org Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Tue Jul 26 15:17:32 2022 Return-path: Envelope-to: ged-emacs-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1oGKRH-0003od-LL for ged-emacs-devel@m.gmane-mx.org; Tue, 26 Jul 2022 15:17:31 +0200 Original-Received: from localhost ([::1]:56302 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1oGKRE-0003Ki-Ll for ged-emacs-devel@m.gmane-mx.org; Tue, 26 Jul 2022 09:17:30 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:44110) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1oGKFw-0003yS-6x for emacs-devel@gnu.org; Tue, 26 Jul 2022 09:05:52 -0400 Original-Received: from fencepost.gnu.org ([2001:470:142:3::e]:38576) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1oGKFv-0008Ku-OW; Tue, 26 Jul 2022 09:05:47 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=gnu.org; s=fencepost-gnu-org; h=MIME-version:References:Subject:In-Reply-To:To:From: Date; bh=NiJf8z4yaJhIKh3Ts0/G4GNbpQq3Hl62MlcKPCIQyp4=; b=jacTpKHe0YffFx9sJzF6 u11LCjpZ/m0FR2P3vIDuS2LG+5OU4y81Le7L41q26W870tqEnK3jHPbphLyYY0ctMlHSCpKJYWMjC 4j4YNZOTq54Cqouc8tcCldozaa4lKIMuKmCGCyuD7jHddXQNly6rRYQKkz71uBfLpw3kcgcpsCQuv kx242th1A/uXuhEjxgHWjNeTOQcaJRJGUvcdJZUczW5Khn6Dw2/8lvU51VALeqRojMLLGYg/gwNYE WDVOuMkeOX8SbaqS4ZXPev+s7kvsow2DlllHFHG0zVj7oAWQbAPoAUqwTkLuRViYOXqNYQ9AOVcOY LTn01kNDMgdZ+A==; Original-Received: from [87.69.77.57] (port=3868 helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1oGKFu-0002va-NW; Tue, 26 Jul 2022 09:05:47 -0400 In-Reply-To: (message from Sam Steingold on Mon, 25 Jul 2022 15:39:34 -0400) X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.io gmane.emacs.devel:292687 Archived-At: > From: Sam Steingold > Date: Mon, 25 Jul 2022 15:39:34 -0400 > > > * Eli Zaretskii [2022-07-25 18:58:19 +0300]: > > > >> > (string-collate-equalp "a" "A" current-locale-environment t) > >> > ==> nil > >> > current-locale-environment > >> > ==> "en_US.UTF-8" > > > > I cannot reproduce this: > > > > (string-collate-equalp "a" "A" current-locale-environment t) > > => t > > current-locale-environment > > => "en_US.UTF-8" > > > > What OS is this, and which Emacs version? > > GNU Emacs 29.0.50 (build 5, x86_64-apple-darwin21.5.0, NS appkit-2113.50 Version 12.4 (Build 21F79)) > of 2022-07-25 > Repository revision: ffe12ff2503917e47c0356195b31430996c148f9 > Repository branch: master > Windowing system distributor 'Apple', version 10.3.2113 > System Description: macOS 12.4 Could be something macOS-specific. Maybe your system doesn't define the __STDC_ISO_10646__ feature? In that case, string-collate-equalp (see the doc string) behaves like string-equal, and that one doesn't have a case-insensitive variant. > >> So, how do we do case-insensitive string comparison in Emacs? > > > > If you want locale-specific collation, as Stefan said, above. > > Do I? > Is it really true that "UTF-8" without "en_US" does _not_ define case conversion? string-collate-equalp relies on the implementation in your libc, so that's something I cannot answer (although I'd expect any reasonable libc to work as expected here). In general, locale-specific comparison is a bad idea in Emacs, unless you are writing a Lisp program that absolutely _must_ meet the locale's definitions of collation order and equivalence. That's because some locales have unexpected requirements, and because different libc's implement this stuff very differently. So using string-collate-equalp and string-collate-lessp makes your program unpredictable on any machine but your own. For that reason, I suggest always using compare-strings instead. That function uses the Unicode locale-independent case-conversion rules, and you can predictably control/tailor that if you need by using a buffer-local case-table. > but https://docs.python.org/3/library/stdtypes.html#str.casefold says > > >>>>> The casefolding algorithm is described in section 3.13 of the Unicode Standard. > > this seems to imply that user locale setting is not relevant. That conclusion is incorrect. The collation database is usually tailored for each locale, and at least glibc indeed loads the tailored collation tables for each locale you request. > >> It is okay to add a `string-equal-ignore-case' based on `compare-strings'? > >> (even though it does not recognize "SS" and "ß" as equal) > > > > What's wrong with calling compare-strings directly? > > I want to be able to use `string-equal-ignore-case' as a :test argument > to things like `cl-find'. Then write a thin wrapper around compare-strings, and be done. > And I don't want to have to think about encodings and locales. > So I want the core Emacs maintainers who know about these things to > provide me with something that works. Thanks in advance! ;-) There's nothing to think about: see above. The best results, in the Emacs context, are to write code that doesn't depend on the locale, and that's what you get with compare-strings. No need to know anything about encoding or locales. > The fact that there are ***TWO*** core functions that compare strings - > `string-collate-equalp' and `compare-strings' - does not look right to me. > _I_ should not have to decide which function to use. You can always ask. But the documentation at least hints that the locale-specific comparison has many hidden aspects: This function obeys the conventions for collation order in your locale settings. For example, characters with different coding points but the same meaning might be considered as equal, like different grave accent Unicode characters: (string-collate-equalp (string ?\uFF40) (string ?\u1FEF)) => t > >> Or should we first implement something like casefold in Python? > > > > Ha! we already have that: > > > > (get-char-code-property ?ß 'special-uppercase) > > => "SS" > > Nice, but how does it help me if > --8<---------------cut here---------------start------------->8--- > (compare-strings "SS" 0 nil "ß" 0 nil t) > ==> -1 > (string-collate-equalp "SS" "ß" "en_US.UTF-8" t) > ==> nil > --8<---------------cut here---------------end--------------->8--- > instead of `t'? It depends on what you want to do, and why you care about the ß case in the first place. AFAIR, you never explained that, nor described your goal.