From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.bugs Subject: bug#58168: string-lessp glitches and inconsistencies Date: Tue, 04 Oct 2022 08:55:34 +0300 Message-ID: <83wn9gw2sp.fsf@gnu.org> References: <7824372D-8002-4639-8AEE-E80A6D5FEFC6@gmail.com> <83czbef6le.fsf@gnu.org> <6CB805F6-89EE-4D7C-A398-F29698733A42@gmail.com> <83h70oce4k.fsf@gnu.org> <83tu4mais1.fsf@gnu.org> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="17925"; mail-complaints-to="usenet@ciao.gmane.io" Cc: 58168@debbugs.gnu.org To: Mattias =?UTF-8?Q?Engdeg=C3=A5rd?= Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Tue Oct 04 07:56:55 2022 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1ofavF-0004VR-FX for geb-bug-gnu-emacs@m.gmane-mx.org; Tue, 04 Oct 2022 07:56:53 +0200 Original-Received: from localhost ([::1]:44772 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1ofavE-0004F2-7i for geb-bug-gnu-emacs@m.gmane-mx.org; Tue, 04 Oct 2022 01:56:52 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:32850) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1ofauR-0004E7-TU for bug-gnu-emacs@gnu.org; Tue, 04 Oct 2022 01:56:05 -0400 Original-Received: from debbugs.gnu.org ([209.51.188.43]:53079) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1ofauQ-0008R7-AF for bug-gnu-emacs@gnu.org; Tue, 04 Oct 2022 01:56:03 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1ofauP-0002LF-WF for bug-gnu-emacs@gnu.org; Tue, 04 Oct 2022 01:56:02 -0400 X-Loop: help-debbugs@gnu.org Resent-From: Eli Zaretskii Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Tue, 04 Oct 2022 05:56:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 58168 X-GNU-PR-Package: emacs Original-Received: via spool by 58168-submit@debbugs.gnu.org id=B58168.16648629518984 (code B ref 58168); Tue, 04 Oct 2022 05:56:01 +0000 Original-Received: (at 58168) by debbugs.gnu.org; 4 Oct 2022 05:55:51 +0000 Original-Received: from localhost ([127.0.0.1]:52157 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1ofauE-0002Kq-Rt for submit@debbugs.gnu.org; Tue, 04 Oct 2022 01:55:51 -0400 Original-Received: from eggs.gnu.org ([209.51.188.92]:57470) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1ofauD-0002Kd-J0 for 58168@debbugs.gnu.org; Tue, 04 Oct 2022 01:55:49 -0400 Original-Received: from fencepost.gnu.org ([2001:470:142:3::e]:40912) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1ofau8-0008Q3-9o; Tue, 04 Oct 2022 01:55:44 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=gnu.org; s=fencepost-gnu-org; h=MIME-version:References:Subject:In-Reply-To:To:From: Date; bh=wt2XaKd2cndHBkBYgt+H0IUgcvHFXZPduq9+TTqedg4=; b=dNbSzZAT63n35jHp6+As KLXj2Oa52RdjkyCqQk+Sf37teMaCNl/rGBpYcYss/b726B7XF7V28u2JpmG3/UDYXVa5pTsSsDDbV D1hBG8Csd7zBj5hn6QYkFYGTrPvaXkTiRaNCaHa3uja+jFc/ePwtZTsSsYU3kBA+R4IJgixt+bBOR tKeh+fO0ryOQXiubsgUT0gdxD/CLq4RwPvq3ZDf0dL6WPlfbS0aFd7qc6jLAGAxe99YO2SBLjdSVX 2cLN1/j2L0kWU5ucdYlKwJr2FzToulph6g4StHhlWTfDhAg/jfhKHu9O8cuu+Wj/m8noFCND6q43r vCPd7ujoIL3w3Q==; Original-Received: from [87.69.77.57] (port=3321 helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1ofau6-0008Ou-S0; Tue, 04 Oct 2022 01:55:43 -0400 In-Reply-To: (message from Mattias =?UTF-8?Q?Engdeg=C3=A5rd?= on Mon, 3 Oct 2022 21:48:14 +0200) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Original-Sender: "bug-gnu-emacs" Xref: news.gmane.io gmane.emacs.bugs:244359 Archived-At: > From: Mattias EngdegÄrd > Date: Mon, 3 Oct 2022 21:48:14 +0200 > Cc: 58168@debbugs.gnu.org > > 2 okt. 2022 kl. 07.36 skrev Eli Zaretskii : > > >> Comparison between objects is not only useful when someone cares about their order, as in presenting a sorted list to the user. Often what is important is an ability to impose an order, preferably total, for use in building and searching data structures. I came across this bug when implementing a string set. > > > > Always converting to multibyte handles this case, doesn't it? > > I don't think it does -- string= treats raw bytes in unibyte and multibyte strings as distinct; converting to multibyte does not preserve (in)equality. If the fact that string= says strings are not equal, but string-lessp says they are equal, is what bothers you, we could document that results of comparing unibyte and multibyte strings are unspecified, or document explicitly that string= and string-lessp behave differently in this case. IOW, I see no reason to worry about 100% consistency here: the order is _really_ undefined in these cases, and trying to make it defined will not produce any tangible gains, IMNSHO. It could very well produce bugs and regressions, OTOH. So it sounds like a net loss to me, in practical terms. > >> Actually I was talking about multibyte-multibyte comparisons. > > > > Then why did you mention raw bytes? their multibyte representation > > presents no performance problems > > In a way they do -- the way raw bytes are represented (they start with C0 or C1) causes memcmp to sort them between U+007F and U+0080. If we accept that then comparisons are fast since memcmp will compare many character per data-dependent branch. The current code requires several data-dependent branches for each character. Once again, slowing down string-lessp when raw-bytes are involved shouldn't be a problem. So, if memchr finds a C0 or C1 in a string, fall back to a slower comparison. memchr is fast enough to not slow down the "usual" case. Would that be a good solution? Alternatively, we could introduce a new primitive which could assume multibyte or plain-ASCII unibyte strings without checking, and then code which is sure raw-bytes cannot happen, and needs to compare long strings, could use that for speed. > While we could probably bring down the comparison cost slightly by clever hand-coding, it's unlikely to be even nearly as fast as a memcmp and much messier. Since users are unlikely to care much about the ordering between raw bytes and something else (as long as there is an order), it would be a cheap way to improve performance while at the same time fixing the string< / string= mismatch. The assumption that "users are unlikely to care" is a pure conjecture, and we have no way of validating it. So I don't want us to act on such an assumption. What about one of the alternatives above instead? > > You can compare under the assumption that a unibyte string is > > pure-ASCII until you bump into the first non-ASCII one. If that > > happens, abandon the comparison, convert the unibyte string to its > > multibyte representation, and compare again. > > I don't quite see how that would improve performance but may be missing something. Then maybe I didn't understand the performance problems that you had in mind. Suppose you describe them in more detail, preferably with examples? E.g., are you saying that unibyte strings that are pure-ASCII also cause performance problems?