From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail
From: =?utf-8?Q?Mattias_Engdeg=C3=A5rd?= <mattias.engdegard@gmail.com>
Newsgroups: gmane.emacs.devel
Subject: Re: master 49e243c0c85: Avoid resizing mutation in
 subst-char-in-string, take two
Date: Wed, 15 May 2024 14:29:22 +0200
Message-ID: <BAB18043-7340-406E-8311-4593455ACB96@gmail.com>
References: <865xvhy4wn.fsf@gnu.org>
 <8AF4F364-9030-4634-91C5-79E297E5335B@gmail.com> <861q65x6yp.fsf@gnu.org>
 <718E190B-3C90-4304-87D8-69E82A1C7AC9@gmail.com> <86eda4wrru.fsf@gnu.org>
Mime-Version: 1.0 (Mac OS X Mail 14.0 \(3654.120.0.1.15\))
Content-Type: text/plain;
	charset=utf-8
Content-Transfer-Encoding: quoted-printable
Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214";
	logging-data="23844"; mail-complaints-to="usenet@ciao.gmane.io"
Cc: emacs-devel@gnu.org
To: Eli Zaretskii <eliz@gnu.org>
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Wed May 15 14:30:32 2024
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane-mx.org
Original-Received: from lists.gnu.org ([209.51.188.17])
	by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
	(Exim 4.92)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org>)
	id 1s7DmB-0005xN-EP
	for ged-emacs-devel@m.gmane-mx.org; Wed, 15 May 2024 14:30:31 +0200
Original-Received: from localhost ([::1] helo=lists1p.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.90_1)
	(envelope-from <emacs-devel-bounces@gnu.org>)
	id 1s7DlI-0004Sp-Ex; Wed, 15 May 2024 08:29:36 -0400
Original-Received: from eggs.gnu.org ([2001:470:142:3::10])
 by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.90_1) (envelope-from <mattias.engdegard@gmail.com>)
 id 1s7DlD-0004Qo-OX
 for emacs-devel@gnu.org; Wed, 15 May 2024 08:29:31 -0400
Original-Received: from mail-lf1-x130.google.com ([2a00:1450:4864:20::130])
 by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128)
 (Exim 4.90_1) (envelope-from <mattias.engdegard@gmail.com>)
 id 1s7Dl8-0008QC-A0; Wed, 15 May 2024 08:29:31 -0400
Original-Received: by mail-lf1-x130.google.com with SMTP id
 2adb3069b0e04-51f0f6b613dso8578101e87.1; 
 Wed, 15 May 2024 05:29:25 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=gmail.com; s=20230601; t=1715776164; x=1716380964; darn=gnu.org;
 h=to:references:message-id:content-transfer-encoding:cc:date
 :in-reply-to:from:subject:mime-version:sender:from:to:cc:subject
 :date:message-id:reply-to;
 bh=K9BXeJl2DNT25U0hmpThpV8YFADlPpDP9eS/SZFK36k=;
 b=e2Az3AyO0oQMQdcqt/3JNBAsh8t+jEkgybNpV1F6MXxBoRFxZlFQaeI7UGGouEAdih
 OO8MDK4e41RMs9J5UF9teYxtCfRcmQAthIlAPXbsjMhbaC1GKrvAEsH9gwSJZULUUbLh
 B4T+x5Q/8y/StCZFecRrHDStQQhbqxQOt8CfQRnWLgMvU51lIx3XJN2fJW0K6jgNTUIT
 W6VfFleSXR3wdqRCfJGDYiDqy9tMCUkZ2QsgOChBL5yF/7T3fhKuj1g1mrgQ7Q9m9tdn
 HfbsyxrUHhGrSms4R5+6giS54Gj43SI6imPKbQD35ot/rg8R4Yoo3gVCcpyDmBE23UQt
 Dq7g==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20230601; t=1715776164; x=1716380964;
 h=to:references:message-id:content-transfer-encoding:cc:date
 :in-reply-to:from:subject:mime-version:sender:x-gm-message-state
 :from:to:cc:subject:date:message-id:reply-to;
 bh=K9BXeJl2DNT25U0hmpThpV8YFADlPpDP9eS/SZFK36k=;
 b=G2alziL/2EPl1ifm2ahemcaFgwRHFojV3CpjFY0zQCydJ529t/mXy7pSrRUxN8Ue+F
 jeN9VIr9Ybd31NzapMCtV0MRNXLWIH6aOvnxEjEu+JQZ6yvfiQoIPdGYYTI+CMPRDu+C
 WqdJwGD2m821pME+mSi3YdiUuvyFWZBFEMxcLT4tOKNW12A587tDhHLjzHvQdV1HuqHT
 qIPCp5NZDzJsEAkuR5dkzRBeULsYZQC4OCMffvUkr8k/90XW23QBzrMbGsM18rz+FkSV
 MZ9n7bBth+SmyNKAYY7jCMsb9Yccz5Eu+WB2ukUW2V+zjj9GWnBhEbbBjQD5iu510fD0
 Dq6Q==
X-Gm-Message-State: AOJu0YwgiIRmbQEoPrBmxy1h/sGDwiuVRkiMwrpjXZQo1DVdG6AHkwDT
 fNybf99uWNKVLJJFhVYDtjb2+/G6/eXdm4jkG0FPXYT47ulkZtGnDPK9hQ==
X-Google-Smtp-Source: AGHT+IElEMG90PMlI9h5F7QtlPHwPvJiL/K9Pf9Ouw/Q4+Mjv7ERMmGTT/ju1R62FtScen4gt2ctsA==
X-Received: by 2002:a19:e008:0:b0:51c:68a3:6f8e with SMTP id
 2adb3069b0e04-5220fd7d378mr8755974e87.31.1715776163607; 
 Wed, 15 May 2024 05:29:23 -0700 (PDT)
Original-Received: from smtpclient.apple (c80-217-1-132.bredband.tele2.se.
 [80.217.1.132]) by smtp.gmail.com with ESMTPSA id
 2adb3069b0e04-521f35ad412sm2486781e87.56.2024.05.15.05.29.22
 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128);
 Wed, 15 May 2024 05:29:23 -0700 (PDT)
In-Reply-To: <86eda4wrru.fsf@gnu.org>
X-Mailer: Apple Mail (2.3654.120.0.1.15)
Received-SPF: pass client-ip=2a00:1450:4864:20::130;
 envelope-from=mattias.engdegard@gmail.com; helo=mail-lf1-x130.google.com
X-Spam_score_int: -20
X-Spam_score: -2.1
X-Spam_bar: --
X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1,
 DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001,
 RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001,
 T_SPF_HELO_TEMPERROR=0.01 autolearn=ham autolearn_force=no
X-Spam_action: no action
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/emacs-devel>,
 <mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <https://lists.gnu.org/archive/html/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/emacs-devel>,
 <mailto:emacs-devel-request@gnu.org?subject=subscribe>
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org
Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org
Xref: news.gmane.io gmane.emacs.devel:319254
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/319254>

14 maj 2024 kl. 13.35 skrev Eli Zaretskii <eliz@gnu.org>:

> I see no problem to document that in-pace replacement is possible only
> if it doesn't resize the original string.  That leaks no
> implementation-specific information.

Actually it does, because even if we document the internal byte =
requirements of each character in the manual somewhere they are no less =
internal representation details which the user doesn't and shouldn't =
need to care about.

>  And if someone is uncertain, the
> way to test this is very simple.

Yes, that is exactly what I would do if uncertain because it's Lisp and =
it's easy to test things interactively, and replacing s with S and =C3=BC =
with =C3=9C works, so surely replacing =C3=9F with =E1=BA=9E, or =E2=84=82=
 with =F0=9D=94=BB, would as well. Or?

And this is the next problem: although founded in the hard logic of =
internal representation, on the user level the rules would appear =
arbitrary and treacherous.

We actually already have a primitive that works like this: (fillarray =
STRING C) is only permitted if the internal byte size of C exactly =
equals the arithmetic mean of those in STRING. For example,

(let ((s (copy-sequence "=E2=88=80x")))
  (fillarray s ?=C2=B1)
  s)

works because the string is comprised by a 3-byte and a 1-byte char so =
we can fill with 2-byte chars. This is of course completely useless =
semantics so in practice this means that either STRING is unibyte and C =
is in 0..255, or STRING is ASCII-only multibyte and C is ASCII, which is =
close to the proposed `aset` rules.

Now let's return to `subst-char-in-string` for a moment. It is indeed =
perfectly reasonable to document its behaviour with INPLACE=3Dt as =
permitted iff the corresponding `aset`s would be allowed, and that is in =
fact how it works: try

  (subst-char-in-string ?a ?=CE=A9 "a\x80" t)

and you will get an error because `aset` on a unibyte string containing =
non-ASCII characters is not allowed if TOCHAR > 255. (Not many people =
know this.)

In fact we could even argue that for consistency, the restrictions of =
`subst-char-in-string` should not depend on INPLACE, and that would be =
fine, too -- we could then revert 49e243c0 and leave the function alone. =
But I think that INPLACE=3Dt is a rarity and the current patch will help =
avoid user code breakage.

>> Simplicity as in describing and understanding it. The benefit from =
using the character size in bytes criterion is nil in practice.
>=20
> It will cease to be nil when and if we actually prohibit resizing
> (which is still the long-term goal, isn't it?).

No, sorry, I meant that it would be extremely rare for code to work with =
`aset` obeying precise byte-size rules but not the approximated rules. =
It would be code that relies on being able to substitute one non-ASCII =
multibyte char for another of exactly the same size, but never a char of =
a different size.

>> As a matter of fact `string-replace` is likely to be faster than =
`subst-char-in-string` for most uses, sometimes radically so,
>=20
> I'd love to see numbers to go with this.

It's perhaps easier to understand the difference qualitatively: =
`string-replace` will generally be faster if there are relatively few =
replacements, since it uses the very fast `string-search` primitive to =
scan for matches while `subst-char-in-string` walks the string =
byte-by-byte in pure Lisp which is much slower.

In particular, if there is no replacement at all, then `string-replace` =
will return the input string without allocating anything at all.

On the other hand, if there are many replacements, then `string-replace` =
may be slower because it will cons a list of many small pieces which are =
then concatenated to the final string.

Here are some actual numbers:

10000 char unibyte string, no matches:
subst-char-in-string  0.0031      s
string-replace        0.000000080 s

10000 char unibyte string, all match:
subst-char-in-string  0.0045 s
string-replace        0.015  s