unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed
From: "Mattias Engdegård" <mattias.engdegard@gmail.com>
To: Eli Zaretskii <eliz@gnu.org>
Cc: emacs-devel@gnu.org
Subject: Re: master 49e243c0c85: Avoid resizing mutation in subst-char-in-string, take two
Date: Wed, 15 May 2024 14:29:22 +0200	[thread overview]
Message-ID: <BAB18043-7340-406E-8311-4593455ACB96@gmail.com> (raw)
In-Reply-To: <86eda4wrru.fsf@gnu.org>

14 maj 2024 kl. 13.35 skrev Eli Zaretskii <eliz@gnu.org>:

> I see no problem to document that in-pace replacement is possible only
> if it doesn't resize the original string.  That leaks no
> implementation-specific information.

Actually it does, because even if we document the internal byte requirements of each character in the manual somewhere they are no less internal representation details which the user doesn't and shouldn't need to care about.

>  And if someone is uncertain, the
> way to test this is very simple.

Yes, that is exactly what I would do if uncertain because it's Lisp and it's easy to test things interactively, and replacing s with S and ü with Ü works, so surely replacing ß with ẞ, or ℂ with 𝔻, would as well. Or?

And this is the next problem: although founded in the hard logic of internal representation, on the user level the rules would appear arbitrary and treacherous.

We actually already have a primitive that works like this: (fillarray STRING C) is only permitted if the internal byte size of C exactly equals the arithmetic mean of those in STRING. For example,

(let ((s (copy-sequence "∀x")))
  (fillarray s ?±)
  s)

works because the string is comprised by a 3-byte and a 1-byte char so we can fill with 2-byte chars. This is of course completely useless semantics so in practice this means that either STRING is unibyte and C is in 0..255, or STRING is ASCII-only multibyte and C is ASCII, which is close to the proposed `aset` rules.

Now let's return to `subst-char-in-string` for a moment. It is indeed perfectly reasonable to document its behaviour with INPLACE=t as permitted iff the corresponding `aset`s would be allowed, and that is in fact how it works: try

  (subst-char-in-string ?a ?Ω "a\x80" t)

and you will get an error because `aset` on a unibyte string containing non-ASCII characters is not allowed if TOCHAR > 255. (Not many people know this.)

In fact we could even argue that for consistency, the restrictions of `subst-char-in-string` should not depend on INPLACE, and that would be fine, too -- we could then revert 49e243c0 and leave the function alone. But I think that INPLACE=t is a rarity and the current patch will help avoid user code breakage.

>> Simplicity as in describing and understanding it. The benefit from using the character size in bytes criterion is nil in practice.
> 
> It will cease to be nil when and if we actually prohibit resizing
> (which is still the long-term goal, isn't it?).

No, sorry, I meant that it would be extremely rare for code to work with `aset` obeying precise byte-size rules but not the approximated rules. It would be code that relies on being able to substitute one non-ASCII multibyte char for another of exactly the same size, but never a char of a different size.

>> As a matter of fact `string-replace` is likely to be faster than `subst-char-in-string` for most uses, sometimes radically so,
> 
> I'd love to see numbers to go with this.

It's perhaps easier to understand the difference qualitatively: `string-replace` will generally be faster if there are relatively few replacements, since it uses the very fast `string-search` primitive to scan for matches while `subst-char-in-string` walks the string byte-by-byte in pure Lisp which is much slower.

In particular, if there is no replacement at all, then `string-replace` will return the input string without allocating anything at all.

On the other hand, if there are many replacements, then `string-replace` may be slower because it will cons a list of many small pieces which are then concatenated to the final string.

Here are some actual numbers:

10000 char unibyte string, no matches:
subst-char-in-string  0.0031      s
string-replace        0.000000080 s

10000 char unibyte string, all match:
subst-char-in-string  0.0045 s
string-replace        0.015  s





  reply	other threads:[~2024-05-15 12:29 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-13 17:53 master 49e243c0c85: Avoid resizing mutation in subst-char-in-string, take two Eli Zaretskii
2024-05-13 19:20 ` Mattias Engdegård
2024-05-14  6:06   ` Eli Zaretskii
2024-05-14 10:44     ` Mattias Engdegård
2024-05-14 11:35       ` Eli Zaretskii
2024-05-15 12:29         ` Mattias Engdegård [this message]
2024-05-15 12:40           ` Eli Zaretskii
2024-05-15 17:29             ` Mattias Engdegård
2024-05-15 18:15               ` Eli Zaretskii
2024-05-15 20:19                 ` Mattias Engdegård

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=BAB18043-7340-406E-8311-4593455ACB96@gmail.com \
    --to=mattias.engdegard@gmail.com \
    --cc=eliz@gnu.org \
    --cc=emacs-devel@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).