From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: =?utf-8?Q?Mattias_Engdeg=C3=A5rd?= Newsgroups: gmane.emacs.devel Subject: Re: master 49e243c0c85: Avoid resizing mutation in subst-char-in-string, take two Date: Wed, 15 May 2024 14:29:22 +0200 Message-ID: References: <865xvhy4wn.fsf@gnu.org> <8AF4F364-9030-4634-91C5-79E297E5335B@gmail.com> <861q65x6yp.fsf@gnu.org> <718E190B-3C90-4304-87D8-69E82A1C7AC9@gmail.com> <86eda4wrru.fsf@gnu.org> Mime-Version: 1.0 (Mac OS X Mail 14.0 \(3654.120.0.1.15\)) Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="23844"; mail-complaints-to="usenet@ciao.gmane.io" Cc: emacs-devel@gnu.org To: Eli Zaretskii Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Wed May 15 14:30:32 2024 Return-path: Envelope-to: ged-emacs-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1s7DmB-0005xN-EP for ged-emacs-devel@m.gmane-mx.org; Wed, 15 May 2024 14:30:31 +0200 Original-Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1s7DlI-0004Sp-Ex; Wed, 15 May 2024 08:29:36 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1s7DlD-0004Qo-OX for emacs-devel@gnu.org; Wed, 15 May 2024 08:29:31 -0400 Original-Received: from mail-lf1-x130.google.com ([2a00:1450:4864:20::130]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1s7Dl8-0008QC-A0; Wed, 15 May 2024 08:29:31 -0400 Original-Received: by mail-lf1-x130.google.com with SMTP id 2adb3069b0e04-51f0f6b613dso8578101e87.1; Wed, 15 May 2024 05:29:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1715776164; x=1716380964; darn=gnu.org; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:sender:from:to:cc:subject :date:message-id:reply-to; bh=K9BXeJl2DNT25U0hmpThpV8YFADlPpDP9eS/SZFK36k=; b=e2Az3AyO0oQMQdcqt/3JNBAsh8t+jEkgybNpV1F6MXxBoRFxZlFQaeI7UGGouEAdih OO8MDK4e41RMs9J5UF9teYxtCfRcmQAthIlAPXbsjMhbaC1GKrvAEsH9gwSJZULUUbLh B4T+x5Q/8y/StCZFecRrHDStQQhbqxQOt8CfQRnWLgMvU51lIx3XJN2fJW0K6jgNTUIT W6VfFleSXR3wdqRCfJGDYiDqy9tMCUkZ2QsgOChBL5yF/7T3fhKuj1g1mrgQ7Q9m9tdn HfbsyxrUHhGrSms4R5+6giS54Gj43SI6imPKbQD35ot/rg8R4Yoo3gVCcpyDmBE23UQt Dq7g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1715776164; x=1716380964; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:sender:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=K9BXeJl2DNT25U0hmpThpV8YFADlPpDP9eS/SZFK36k=; b=G2alziL/2EPl1ifm2ahemcaFgwRHFojV3CpjFY0zQCydJ529t/mXy7pSrRUxN8Ue+F jeN9VIr9Ybd31NzapMCtV0MRNXLWIH6aOvnxEjEu+JQZ6yvfiQoIPdGYYTI+CMPRDu+C WqdJwGD2m821pME+mSi3YdiUuvyFWZBFEMxcLT4tOKNW12A587tDhHLjzHvQdV1HuqHT qIPCp5NZDzJsEAkuR5dkzRBeULsYZQC4OCMffvUkr8k/90XW23QBzrMbGsM18rz+FkSV MZ9n7bBth+SmyNKAYY7jCMsb9Yccz5Eu+WB2ukUW2V+zjj9GWnBhEbbBjQD5iu510fD0 Dq6Q== X-Gm-Message-State: AOJu0YwgiIRmbQEoPrBmxy1h/sGDwiuVRkiMwrpjXZQo1DVdG6AHkwDT fNybf99uWNKVLJJFhVYDtjb2+/G6/eXdm4jkG0FPXYT47ulkZtGnDPK9hQ== X-Google-Smtp-Source: AGHT+IElEMG90PMlI9h5F7QtlPHwPvJiL/K9Pf9Ouw/Q4+Mjv7ERMmGTT/ju1R62FtScen4gt2ctsA== X-Received: by 2002:a19:e008:0:b0:51c:68a3:6f8e with SMTP id 2adb3069b0e04-5220fd7d378mr8755974e87.31.1715776163607; Wed, 15 May 2024 05:29:23 -0700 (PDT) Original-Received: from smtpclient.apple (c80-217-1-132.bredband.tele2.se. [80.217.1.132]) by smtp.gmail.com with ESMTPSA id 2adb3069b0e04-521f35ad412sm2486781e87.56.2024.05.15.05.29.22 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Wed, 15 May 2024 05:29:23 -0700 (PDT) In-Reply-To: <86eda4wrru.fsf@gnu.org> X-Mailer: Apple Mail (2.3654.120.0.1.15) Received-SPF: pass client-ip=2a00:1450:4864:20::130; envelope-from=mattias.engdegard@gmail.com; helo=mail-lf1-x130.google.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001, T_SPF_HELO_TEMPERROR=0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Xref: news.gmane.io gmane.emacs.devel:319254 Archived-At: 14 maj 2024 kl. 13.35 skrev Eli Zaretskii : > I see no problem to document that in-pace replacement is possible only > if it doesn't resize the original string. That leaks no > implementation-specific information. Actually it does, because even if we document the internal byte = requirements of each character in the manual somewhere they are no less = internal representation details which the user doesn't and shouldn't = need to care about. > And if someone is uncertain, the > way to test this is very simple. Yes, that is exactly what I would do if uncertain because it's Lisp and = it's easy to test things interactively, and replacing s with S and =C3=BC = with =C3=9C works, so surely replacing =C3=9F with =E1=BA=9E, or =E2=84=82= with =F0=9D=94=BB, would as well. Or? And this is the next problem: although founded in the hard logic of = internal representation, on the user level the rules would appear = arbitrary and treacherous. We actually already have a primitive that works like this: (fillarray = STRING C) is only permitted if the internal byte size of C exactly = equals the arithmetic mean of those in STRING. For example, (let ((s (copy-sequence "=E2=88=80x"))) (fillarray s ?=C2=B1) s) works because the string is comprised by a 3-byte and a 1-byte char so = we can fill with 2-byte chars. This is of course completely useless = semantics so in practice this means that either STRING is unibyte and C = is in 0..255, or STRING is ASCII-only multibyte and C is ASCII, which is = close to the proposed `aset` rules. Now let's return to `subst-char-in-string` for a moment. It is indeed = perfectly reasonable to document its behaviour with INPLACE=3Dt as = permitted iff the corresponding `aset`s would be allowed, and that is in = fact how it works: try (subst-char-in-string ?a ?=CE=A9 "a\x80" t) and you will get an error because `aset` on a unibyte string containing = non-ASCII characters is not allowed if TOCHAR > 255. (Not many people = know this.) In fact we could even argue that for consistency, the restrictions of = `subst-char-in-string` should not depend on INPLACE, and that would be = fine, too -- we could then revert 49e243c0 and leave the function alone. = But I think that INPLACE=3Dt is a rarity and the current patch will help = avoid user code breakage. >> Simplicity as in describing and understanding it. The benefit from = using the character size in bytes criterion is nil in practice. >=20 > It will cease to be nil when and if we actually prohibit resizing > (which is still the long-term goal, isn't it?). No, sorry, I meant that it would be extremely rare for code to work with = `aset` obeying precise byte-size rules but not the approximated rules. = It would be code that relies on being able to substitute one non-ASCII = multibyte char for another of exactly the same size, but never a char of = a different size. >> As a matter of fact `string-replace` is likely to be faster than = `subst-char-in-string` for most uses, sometimes radically so, >=20 > I'd love to see numbers to go with this. It's perhaps easier to understand the difference qualitatively: = `string-replace` will generally be faster if there are relatively few = replacements, since it uses the very fast `string-search` primitive to = scan for matches while `subst-char-in-string` walks the string = byte-by-byte in pure Lisp which is much slower. In particular, if there is no replacement at all, then `string-replace` = will return the input string without allocating anything at all. On the other hand, if there are many replacements, then `string-replace` = may be slower because it will cons a list of many small pieces which are = then concatenated to the final string. Here are some actual numbers: 10000 char unibyte string, no matches: subst-char-in-string 0.0031 s string-replace 0.000000080 s 10000 char unibyte string, all match: subst-char-in-string 0.0045 s string-replace 0.015 s