From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.devel Subject: Re: master def6fa4246 2/2: Speed up string-lessp for multibyte strings Date: Sat, 08 Oct 2022 21:25:29 +0300 Message-ID: <83sfjyjhpi.fsf@gnu.org> References: <837d1bmo66.fsf@gnu.org> <069A384D-4D27-4787-B6BE-84B43FBDF952@acm.org> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="27982"; mail-complaints-to="usenet@ciao.gmane.io" Cc: emacs-devel@gnu.org To: Mattias =?utf-8?Q?Engdeg=C3=A5rd?= Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Sat Oct 08 20:26:52 2022 Return-path: Envelope-to: ged-emacs-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1ohEXD-00073w-5w for ged-emacs-devel@m.gmane-mx.org; Sat, 08 Oct 2022 20:26:51 +0200 Original-Received: from localhost ([::1]:44286 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1ohEXB-0002Zh-Mt for ged-emacs-devel@m.gmane-mx.org; Sat, 08 Oct 2022 14:26:49 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:50872) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1ohEVr-0001m8-UX for emacs-devel@gnu.org; Sat, 08 Oct 2022 14:25:27 -0400 Original-Received: from fencepost.gnu.org ([2001:470:142:3::e]:36320) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1ohEVr-00083j-FY; Sat, 08 Oct 2022 14:25:27 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=gnu.org; s=fencepost-gnu-org; h=MIME-version:References:Subject:In-Reply-To:To:From: Date; bh=3Z+ygpLvXnwUhfvABP0eIf977EZIdWuXHlHZcxVWYSs=; b=GLulG5MJMjDrRkkAK27t mgg4nJhjtAbVzrdTuy/L1utmyTkRo/MejSkZPsJ+DbbKorJWLf+BWqd/GAyQQklxHYFIcD223FnrG LeaE4P1dPJVnI60QKaUSRT4RrXlVQlkBOxj+7zuBqwyrUux22NK34a+XJ665P8czLZFkAxvKPHbLk MUutxKGqC7BUFLFPsC1lxerBfa6qd8UAWMPBINsL4qO5ejabPcQ/RaO6cw0SGARnIIe3Linf8WO+k Zg4Jl0wyhsgZzz+pCeA0MkaVftg7xQ9Y7y+ED7gD++HWspJJ+nxBou/F/5As/aCyuUk159ATWwqSR Bof4tDimrBeXRA==; Original-Received: from [87.69.77.57] (port=4778 helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1ohEVq-00026E-Ua; Sat, 08 Oct 2022 14:25:27 -0400 In-Reply-To: <069A384D-4D27-4787-B6BE-84B43FBDF952@acm.org> (message from Mattias =?utf-8?Q?Engdeg=C3=A5rd?= on Sat, 8 Oct 2022 18:49:11 +0200) X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.io gmane.emacs.devel:297219 Archived-At: > From: Mattias EngdegÄrd > Date: Sat, 8 Oct 2022 18:49:11 +0200 > Cc: emacs-devel > > 7 okt. 2022 kl. 21.25 skrev Eli Zaretskii : > > > >> + /* Two arbitrary multibyte strings: we cannot use memcmp because > >> + the encoding for raw bytes would sort those between U+007F and U+0080 > >> + which isn't where we want them. > >> + Instead, we skip the longest common prefix and look at > >> + what follows. */ > > > > I don't think I understand this; please elaborate. Didn't you say > > that we never need to look beyond the first unequal byte? Then why > > does the order of raw bytes matter here? > > The comment explains why memcmp cannot be used to compare arbitrary multibyte strings and it's exactly as it says: a bytewise comparison would not produce the same order as string-lessp has used in the past because of how we encode raw bytes, that's all. As long as memcmp reports equality, we don't care, and once it reports inequality, you can examine the first unequal bytes "by hand". Right? So I still don't understand the comment and how it led you to the conclusion. I also asked about memmem -- did you consider using that? > > Are you sure about the alignment? > > Actually I had asked someone about that before and received the answer that string data alignment was guaranteed, and a semi-thorough reading of the code seemed to confirm this -- normal allocation ensures alignment via struct sdata (q.v.) and while AUTO_STRING does not, it only makes unibyte strings which do not concern us in the code path in question. AFAIU, AUTO_STRING can also generate stack-allocated multibyte strings. > > why no tests for this? > > `string-lessp` has much better test coverage than what is typical for Emacs primitives For non-ASCII strings?