Re: master 49e243c0c85: Avoid resizing mutation in subst-char-in-string, take two

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

* Re: master 49e243c0c85: Avoid resizing mutation in subst-char-in-string, take two
@ 2024-05-13 17:53 Eli Zaretskii
  2024-05-13 19:20 ` Mattias Engdegård
  0 siblings, 1 reply; 10+ messages in thread
From: Eli Zaretskii @ 2024-05-13 17:53 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: emacs-devel

> +  (if (and (not inplace)
> +           (if (multibyte-string-p string)
> +               (> (max fromchar tochar) 127)
> +             (> tochar 255)))

Is the above condition correct?  My reading of it is that if INPLACE
is non-nil, we use aset (which will resize a string) even if TOCHAR
needs more bytes than FROMCHAR.  Which seems to be in contradiction
with the goal of the change, as advertised by the log message: "avoid
resizing mutation".  Did I miss something?

Btw, why, in the case of a multibyte STRING, does the code look at the
codepoints of FROMCHAR and TOCHAR and not at the number of bytes they
take in the internal Emacs representation of the characters?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: master 49e243c0c85: Avoid resizing mutation in subst-char-in-string, take two
  2024-05-13 17:53 master 49e243c0c85: Avoid resizing mutation in subst-char-in-string, take two Eli Zaretskii
@ 2024-05-13 19:20 ` Mattias Engdegård
  2024-05-14  6:06   ` Eli Zaretskii
  0 siblings, 1 reply; 10+ messages in thread
From: Mattias Engdegård @ 2024-05-13 19:20 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

13 maj 2024 kl. 19.53 skrev Eli Zaretskii <eliz@gnu.org>:
> 
>> +  (if (and (not inplace)
>> +           (if (multibyte-string-p string)
>> +               (> (max fromchar tochar) 127)
>> +             (> tochar 255)))
> 
> Is the above condition correct?  My reading of it is that if INPLACE
> is non-nil, we use aset (which will resize a string) even if TOCHAR
> needs more bytes than FROMCHAR.  Which seems to be in contradiction
> with the goal of the change, as advertised by the log message: "avoid
> resizing mutation".

I agree that it does look a bit odd, but it's intentional. First of all, the aim is to insulate non-mutating calls to the function from issues arising from mutation in the implementation. If we don't have to mutate and it's faster and/or safer not to, then we shouldn't.

Second, the function is documented to change the string in-place if INPLACE is non-nil, so in that case we have no choice but to mutate, or we might silently break reasonable code.

> why, in the case of a multibyte STRING, does the code look at the
> codepoints of FROMCHAR and TOCHAR and not at the number of bytes they
> take in the internal Emacs representation of the characters?

It's a conservative approximation that is much simpler than computing the size of the internal representation. (It's also the condition proposed in bug#70784.)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: master 49e243c0c85: Avoid resizing mutation in subst-char-in-string, take two
  2024-05-13 19:20 ` Mattias Engdegård
@ 2024-05-14  6:06   ` Eli Zaretskii
  2024-05-14 10:44     ` Mattias Engdegård
  0 siblings, 1 reply; 10+ messages in thread
From: Eli Zaretskii @ 2024-05-14  6:06 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: emacs-devel

> From: Mattias Engdegård <mattias.engdegard@gmail.com>
> Date: Mon, 13 May 2024 21:20:24 +0200
> Cc: emacs-devel@gnu.org
> 
> 13 maj 2024 kl. 19.53 skrev Eli Zaretskii <eliz@gnu.org>:
> > 
> >> +  (if (and (not inplace)
> >> +           (if (multibyte-string-p string)
> >> +               (> (max fromchar tochar) 127)
> >> +             (> tochar 255)))
> > 
> > Is the above condition correct?  My reading of it is that if INPLACE
> > is non-nil, we use aset (which will resize a string) even if TOCHAR
> > needs more bytes than FROMCHAR.  Which seems to be in contradiction
> > with the goal of the change, as advertised by the log message: "avoid
> > resizing mutation".
> 
> I agree that it does look a bit odd, but it's intentional. First of all, the aim is to insulate non-mutating calls to the function from issues arising from mutation in the implementation. If we don't have to mutate and it's faster and/or safer not to, then we shouldn't.
> 
> Second, the function is documented to change the string in-place if INPLACE is non-nil, so in that case we have no choice but to mutate, or we might silently break reasonable code.

So I guess the log message doesn't describe this intent clearly
enough.

> > why, in the case of a multibyte STRING, does the code look at the
> > codepoints of FROMCHAR and TOCHAR and not at the number of bytes they
> > take in the internal Emacs representation of the characters?
> 
> It's a conservative approximation that is much simpler than computing the size of the internal representation. (It's also the condition proposed in bug#70784.)

Which part of bug#70784 suggested that?  (It's a very long discussion,
and the suggestion at the beginning talks only about the unibyte
case.)

More to the point, the length of the multibyte string
deterministically depends on the character's codepoint, so I don't
really understand why you say it's "much simpler".  We could have a
primitive, say, char-bytes, to do that even faster, if we want this to
be as efficient as possible.  This will allow a large subset of calls
(without INPLACE = t) to be much faster than it is now, without
resizing the string.  IOW, we will be able to "avoid resizing
mutation" in many more cases.



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: master 49e243c0c85: Avoid resizing mutation in subst-char-in-string, take two
  2024-05-14  6:06   ` Eli Zaretskii
@ 2024-05-14 10:44     ` Mattias Engdegård
  2024-05-14 11:35       ` Eli Zaretskii
  0 siblings, 1 reply; 10+ messages in thread
From: Mattias Engdegård @ 2024-05-14 10:44 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

14 maj 2024 kl. 08.06 skrev Eli Zaretskii <eliz@gnu.org>:

> So I guess the log message doesn't describe this intent clearly
> enough.

Right, sorry about that. Your questions were entirely justified.

>> It's a conservative approximation that is much simpler than computing the size of the internal representation. (It's also the condition proposed in bug#70784.)
> 
> Which part of bug#70784 suggested that?  (It's a very long discussion,
> and the suggestion at the beginning talks only about the unibyte
> case.)

That would have been the warning patch (you are forgiven if you didn't give it a close reading).

But it's not only a useful approximation but quite a defensible one: we do users no service at all if we let internal representation details leak out to the interface. It's much easier for us to describe, and for users to understand and follow, rules based on ASCII vs non-ASCII, for example, than "this character must use the same internal number of bytes as that one".

And that's also why I prefer the approximation in this commit because it's also about future-proofing, so that subst-char-in-string will keep working as before for non-INPLACE calls if we do away with resizing mutation.

> More to the point, the length of the multibyte string
> deterministically depends on the character's codepoint, so I don't
> really understand why you say it's "much simpler".

Simplicity as in describing and understanding it. The benefit from using the character size in bytes criterion is nil in practice.

As a matter of fact `string-replace` is likely to be faster than `subst-char-in-string` for most uses, sometimes radically so, but
(1) being newer means that fewer have heard of the function or are reluctant to use it for compatibility reasons, and
(2) when an innocent programmer who finds two functions that fill the need, the more restricted or specialised one is usually selected -- of course it must be faster because why else have both.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: master 49e243c0c85: Avoid resizing mutation in subst-char-in-string, take two
  2024-05-14 10:44     ` Mattias Engdegård
@ 2024-05-14 11:35       ` Eli Zaretskii
  2024-05-15 12:29         ` Mattias Engdegård
  0 siblings, 1 reply; 10+ messages in thread
From: Eli Zaretskii @ 2024-05-14 11:35 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: emacs-devel

> From: Mattias Engdegård <mattias.engdegard@gmail.com>
> Date: Tue, 14 May 2024 12:44:34 +0200
> Cc: emacs-devel@gnu.org
> 
> But it's not only a useful approximation but quite a defensible one: we do users no service at all if we let internal representation details leak out to the interface. It's much easier for us to describe, and for users to understand and follow, rules based on ASCII vs non-ASCII, for example, than "this character must use the same internal number of bytes as that one".

I see no problem to document that in-pace replacement is possible only
if it doesn't resize the original string.  That leaks no
implementation-specific information.  And if someone is uncertain, the
way to test this is very simple.

> Simplicity as in describing and understanding it. The benefit from using the character size in bytes criterion is nil in practice.

It will cease to be nil when and if we actually prohibit resizing
(which is still the long-term goal, isn't it?).

> As a matter of fact `string-replace` is likely to be faster than `subst-char-in-string` for most uses, sometimes radically so,

I'd love to see numbers to go with this.

However, the issue here is not just speed, it's also the support for
in-place replacements.



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: master 49e243c0c85: Avoid resizing mutation in subst-char-in-string, take two
  2024-05-14 11:35       ` Eli Zaretskii
@ 2024-05-15 12:29         ` Mattias Engdegård
  2024-05-15 12:40           ` Eli Zaretskii
  0 siblings, 1 reply; 10+ messages in thread
From: Mattias Engdegård @ 2024-05-15 12:29 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

14 maj 2024 kl. 13.35 skrev Eli Zaretskii <eliz@gnu.org>:

> I see no problem to document that in-pace replacement is possible only
> if it doesn't resize the original string.  That leaks no
> implementation-specific information.

Actually it does, because even if we document the internal byte requirements of each character in the manual somewhere they are no less internal representation details which the user doesn't and shouldn't need to care about.

>  And if someone is uncertain, the
> way to test this is very simple.

Yes, that is exactly what I would do if uncertain because it's Lisp and it's easy to test things interactively, and replacing s with S and ü with Ü works, so surely replacing ß with ẞ, or ℂ with 𝔻, would as well. Or?

And this is the next problem: although founded in the hard logic of internal representation, on the user level the rules would appear arbitrary and treacherous.

We actually already have a primitive that works like this: (fillarray STRING C) is only permitted if the internal byte size of C exactly equals the arithmetic mean of those in STRING. For example,

(let ((s (copy-sequence "∀x")))
  (fillarray s ?±)
  s)

works because the string is comprised by a 3-byte and a 1-byte char so we can fill with 2-byte chars. This is of course completely useless semantics so in practice this means that either STRING is unibyte and C is in 0..255, or STRING is ASCII-only multibyte and C is ASCII, which is close to the proposed `aset` rules.

Now let's return to `subst-char-in-string` for a moment. It is indeed perfectly reasonable to document its behaviour with INPLACE=t as permitted iff the corresponding `aset`s would be allowed, and that is in fact how it works: try

  (subst-char-in-string ?a ?Ω "a\x80" t)

and you will get an error because `aset` on a unibyte string containing non-ASCII characters is not allowed if TOCHAR > 255. (Not many people know this.)

In fact we could even argue that for consistency, the restrictions of `subst-char-in-string` should not depend on INPLACE, and that would be fine, too -- we could then revert 49e243c0 and leave the function alone. But I think that INPLACE=t is a rarity and the current patch will help avoid user code breakage.

>> Simplicity as in describing and understanding it. The benefit from using the character size in bytes criterion is nil in practice.
> 
> It will cease to be nil when and if we actually prohibit resizing
> (which is still the long-term goal, isn't it?).

No, sorry, I meant that it would be extremely rare for code to work with `aset` obeying precise byte-size rules but not the approximated rules. It would be code that relies on being able to substitute one non-ASCII multibyte char for another of exactly the same size, but never a char of a different size.

>> As a matter of fact `string-replace` is likely to be faster than `subst-char-in-string` for most uses, sometimes radically so,
> 
> I'd love to see numbers to go with this.

It's perhaps easier to understand the difference qualitatively: `string-replace` will generally be faster if there are relatively few replacements, since it uses the very fast `string-search` primitive to scan for matches while `subst-char-in-string` walks the string byte-by-byte in pure Lisp which is much slower.

In particular, if there is no replacement at all, then `string-replace` will return the input string without allocating anything at all.

On the other hand, if there are many replacements, then `string-replace` may be slower because it will cons a list of many small pieces which are then concatenated to the final string.

Here are some actual numbers:

10000 char unibyte string, no matches:
subst-char-in-string  0.0031      s
string-replace        0.000000080 s

10000 char unibyte string, all match:
subst-char-in-string  0.0045 s
string-replace        0.015  s

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: master 49e243c0c85: Avoid resizing mutation in subst-char-in-string, take two
  2024-05-15 12:29         ` Mattias Engdegård
@ 2024-05-15 12:40           ` Eli Zaretskii
  2024-05-15 17:29             ` Mattias Engdegård
  0 siblings, 1 reply; 10+ messages in thread
From: Eli Zaretskii @ 2024-05-15 12:40 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: emacs-devel

> From: Mattias Engdegård <mattias.engdegard@gmail.com>
> Date: Wed, 15 May 2024 14:29:22 +0200
> Cc: emacs-devel@gnu.org
> 
> >> Simplicity as in describing and understanding it. The benefit from using the character size in bytes criterion is nil in practice.
> > 
> > It will cease to be nil when and if we actually prohibit resizing
> > (which is still the long-term goal, isn't it?).
> 
> No, sorry, I meant that it would be extremely rare for code to work with `aset` obeying precise byte-size rules but not the approximated rules.

One important use case where this is not rare at all is when replacing
characters from the same Unicode block (= "script").

> 10000 char unibyte string, no matches:
> subst-char-in-string  0.0031      s
> string-replace        0.000000080 s
> 
> 10000 char unibyte string, all match:
> subst-char-in-string  0.0045 s
> string-replace        0.015  s

So it's sometimes faster and sometimes slower.



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: master 49e243c0c85: Avoid resizing mutation in subst-char-in-string, take two
  2024-05-15 12:40           ` Eli Zaretskii
@ 2024-05-15 17:29             ` Mattias Engdegård
  2024-05-15 18:15               ` Eli Zaretskii
  0 siblings, 1 reply; 10+ messages in thread
From: Mattias Engdegård @ 2024-05-15 17:29 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

15 maj 2024 kl. 14.40 skrev Eli Zaretskii <eliz@gnu.org>:

> One important use case where this is not rare at all is when replacing
> characters from the same Unicode block (= "script").

It is very rare to see replacement exclusively confined to single block, except for block 0 (ASCII). Scripts, even Latin, generally transcend blocks and even planes. Text written in one script also tends to include characters from blocks not related to that script, such as symbols, spaces, combining marks, numerals etc.

The usefulness of equal-length multibyte `aset` is very small, and given how rare string mutation is in general, this makes it just not worth taking into account. Clear, simple and predictable rules are far more important.

One reason why single-character multibyte replacement (`aset`, `subst-char-in-string`, `store-substring`, most of the cl-lib functions etc) is so rare is that in the world of Unicode, a 'character' can be a sequence of scalar values (combining chars, modifiers etc) so a one-for-one value replacement is just too inflexible and limiting.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: master 49e243c0c85: Avoid resizing mutation in subst-char-in-string, take two
  2024-05-15 17:29             ` Mattias Engdegård
@ 2024-05-15 18:15               ` Eli Zaretskii
  2024-05-15 20:19                 ` Mattias Engdegård
  0 siblings, 1 reply; 10+ messages in thread
From: Eli Zaretskii @ 2024-05-15 18:15 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: emacs-devel

> From: Mattias Engdegård <mattias.engdegard@gmail.com>
> Date: Wed, 15 May 2024 19:29:04 +0200
> Cc: emacs-devel <emacs-devel@gnu.org>
> 
> 15 maj 2024 kl. 14.40 skrev Eli Zaretskii <eliz@gnu.org>:
> 
> > One important use case where this is not rare at all is when replacing
> > characters from the same Unicode block (= "script").
> 
> It is very rare to see replacement exclusively confined to single block, except for block 0 (ASCII). Scripts, even Latin, generally transcend blocks and even planes. 

"Rare" is in the eye of the beholder.  Imagine processing of Arabic or
Greek or Cyrillic text -- these replacements are natural there.

> Text written in one script also tends to include characters from blocks not related to that script, such as symbols, spaces, combining marks, numerals etc.

That's true, but replacing punctuation and symbols by letters is
_really_ rare, at least IME.

> The usefulness of equal-length multibyte `aset` is very small,

I guess we'll have to agree to disagree about this.

> One reason why single-character multibyte replacement (`aset`, `subst-char-in-string`, `store-substring`, most of the cl-lib functions etc) is so rare is that in the world of Unicode, a 'character' can be a sequence of scalar values (combining chars, modifiers etc) so a one-for-one value replacement is just too inflexible and limiting.

But Emacs doesn't (yet) support such "characters", except in the
display engine.  If you search for such sequences with Emacs commands,
you will generally not find the corresponding precomposed characters
and other equivalents (we have "character-folding" in search commands,
but that's a trick, really, not a fundamental support of character
equivalence).

So I think these last aspects are not really relevant to the issue at
hand.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: master 49e243c0c85: Avoid resizing mutation in subst-char-in-string, take two
  2024-05-15 18:15               ` Eli Zaretskii
@ 2024-05-15 20:19                 ` Mattias Engdegård
  0 siblings, 0 replies; 10+ messages in thread
From: Mattias Engdegård @ 2024-05-15 20:19 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

15 maj 2024 kl. 20.15 skrev Eli Zaretskii <eliz@gnu.org>:

>> It is very rare to see replacement exclusively confined to single block, except for block 0 (ASCII). Scripts, even Latin, generally transcend blocks and even planes. 
> 
> "Rare" is in the eye of the beholder.  Imagine processing of Arabic or
> Greek or Cyrillic text -- these replacements are natural there.

That can very well be the case, but I haven't found any examples. Maybe you are right and it would not be an unnatural thing to do, but then string mutation itself is just uncommon and is definitely becoming more so.

Anyway, the little mutation I did find wasn't confined to any particular non-Latin scripts or Unicode blocks.

>> The usefulness of equal-length multibyte `aset` is very small,
> 
> I guess we'll have to agree to disagree about this.

Certainly.

> So I think these last aspects are not really relevant to the issue at
> hand.

That is true, it doesn't inform our decisions in this matter. Other than perhaps tell us that we shouldn't be surprised if we don't find much code that tries to replace one grapheme cluster (or emoji) with another using `aset`, but that's admittedly not a very big help.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2024-05-15 20:19 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-13 17:53 master 49e243c0c85: Avoid resizing mutation in subst-char-in-string, take two Eli Zaretskii
2024-05-13 19:20 ` Mattias Engdegård
2024-05-14  6:06   ` Eli Zaretskii
2024-05-14 10:44     ` Mattias Engdegård
2024-05-14 11:35       ` Eli Zaretskii
2024-05-15 12:29         ` Mattias Engdegård
2024-05-15 12:40           ` Eli Zaretskii
2024-05-15 17:29             ` Mattias Engdegård
2024-05-15 18:15               ` Eli Zaretskii
2024-05-15 20:19                 ` Mattias Engdegård

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).