bug#31837: 26.1; replace-buffer-contents doesn't work if buffer has multibyte characters

unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed

* bug#31837: 26.1; replace-buffer-contents doesn't work if buffer has multibyte characters
@ 2018-06-14 21:34 Milan Stanojević
  2018-06-15  8:34 ` Eli Zaretskii
  0 siblings, 1 reply; 7+ messages in thread
From: Milan Stanojević @ 2018-06-14 21:34 UTC (permalink / raw)
  To: 31837

Here is a small recipe that illustrates the bug.

recipe.el
---------
(setq use-multibyte (< 0 (length argv)))

(switch-to-buffer "file1")
(when use-multibyte (insert-char (char-from-name "SMILE")))
(insert "1234")

(switch-to-buffer "file2")
(when use-multibyte (insert-char (char-from-name "SMILE")))
(insert "5678")

(replace-buffer-contents "file1")

(princ (buffer-substring-no-properties (point-min) (point-max)))
(princ "\n")
-------------

Running the recipe as script

$ emacs -Q --script /tmp/recipe.el
1234

$ emacs -Q --script /tmp/recipe.el multibyte
⌣5234

In the first run, with just ascii characters, everything works as
 expected. 
In the second run, with multibyte characters, the function didn't
 replace '5' with '1' as expected.

I looked at the code and it looks to me like there is a very obvious bug
 in function buffer_chars_equal in editfns.c. It calls
 BUF_FETCH_CHAR_AS_MULTIBYTE passing *character* positions, but the
 macro expects *byte* positions. (it would be nice if these char vs byte
 positions could be distinguished with types, but I'm not sure it is
 possible in C).

The simple fix is to replace BUF_FETCH_CHAR_AS_MULTIBYTE with
 STRING_CHAR (BUF_CHAR_ADDRESS (buf, pos)) and this seems to work.

I'm not sure about performance of the above fix, though, because
 accessing random character position in a buffer is not constant. If
 diffing algorithm is accessing buffer positions in more or less
 localized manner, maybe it makes sense to move the point inside
 buffer_chars_equal so the char position to byte position conversion is
 fast. It probably doesn't matter for small files.

Emacs info:
In GNU Emacs 26.1 (build 4, x86_64-pc-linux-gnu, X toolkit, Xaw scroll bars)
Windowing system distributor 'The X.Org Foundation', version 11.0.11905000
System Description: Linux 

Configured using:
 'configure --with-x-toolkit=lucid --without-gpm --without-gconf --without-selinux --without-imagemagick --with-modules --with-gif=no'

^ permalink raw reply	[flat|nested] 7+ messages in thread

* bug#31837: 26.1; replace-buffer-contents doesn't work if buffer has multibyte characters
  2018-06-14 21:34 bug#31837: 26.1; replace-buffer-contents doesn't work if buffer has multibyte characters Milan Stanojević
@ 2018-06-15  8:34 ` Eli Zaretskii
  2018-06-18 17:44   ` Milan Stanojević
  2018-07-01 17:32   ` Philipp Stephani
  0 siblings, 2 replies; 7+ messages in thread
From: Eli Zaretskii @ 2018-06-15  8:34 UTC (permalink / raw)
  To: Milan Stanojević; +Cc: 31837

> From: mstanojevic@janestreet.com (Milan Stanojević)
> Date: Thu, 14 Jun 2018 17:34:27 -0400
> 
> Here is a small recipe that illustrates the bug.
> 
> recipe.el
> ---------
> (setq use-multibyte (< 0 (length argv)))
> 
> (switch-to-buffer "file1")
> (when use-multibyte (insert-char (char-from-name "SMILE")))
> (insert "1234")
> 
> (switch-to-buffer "file2")
> (when use-multibyte (insert-char (char-from-name "SMILE")))
> (insert "5678")
> 
> (replace-buffer-contents "file1")
> 
> (princ (buffer-substring-no-properties (point-min) (point-max)))
> (princ "\n")
> -------------
> 
> Running the recipe as script
> 
> $ emacs -Q --script /tmp/recipe.el
> 1234
> 
> $ emacs -Q --script /tmp/recipe.el multibyte
> ⌣5234
> 
> In the first run, with just ascii characters, everything works as
>  expected. 
> 
> In the second run, with multibyte characters, the function didn't
>  replace '5' with '1' as expected.

Right, thanks for catching this blunder.

> I looked at the code and it looks to me like there is a very obvious bug
>  in function buffer_chars_equal in editfns.c. It calls
>  BUF_FETCH_CHAR_AS_MULTIBYTE passing *character* positions, but the
>  macro expects *byte* positions. (it would be nice if these char vs byte
>  positions could be distinguished with types, but I'm not sure it is
>  possible in C).
> 
> The simple fix is to replace BUF_FETCH_CHAR_AS_MULTIBYTE with
>  STRING_CHAR (BUF_CHAR_ADDRESS (buf, pos)) and this seems to work.

Thanks, I installed a slightly different fix on the emacs-26 branch,
perhaps you could try that.

> I'm not sure about performance of the above fix, though, because
>  accessing random character position in a buffer is not constant. If
>  diffing algorithm is accessing buffer positions in more or less
>  localized manner, maybe it makes sense to move the point inside
>  buffer_chars_equal so the char position to byte position conversion is
>  fast. It probably doesn't matter for small files.

The conversion of character positions to byte positions should be
quite fast.  It got even faster on the master branch, so I think we
are good.  If there are reports about slow operation of this function,
we could in the future try to optimize it more.





^ permalink raw reply	[flat|nested] 7+ messages in thread

* bug#31837: 26.1; replace-buffer-contents doesn't work if buffer has multibyte characters
  2018-06-15  8:34 ` Eli Zaretskii
@ 2018-06-18 17:44   ` Milan Stanojević
  2018-06-18 18:16     ` Eli Zaretskii
  2018-07-01 17:32   ` Philipp Stephani
  1 sibling, 1 reply; 7+ messages in thread
From: Milan Stanojević @ 2018-06-18 17:44 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 31837

I haven't had a chance to read this email or try your patch until today.
I just tried it and it seems to work.

Thanks for the quick fix, Eli!

Not sure that [enable_multibyte_characters] check buys us much in
practice. At least in my case, it is always true. But what matters is
whether there are actual multibyte characters in the buffer, and
buf_charpos_to_bytepos already shortcircuits if there are none.

In any case, it probably doesn't matter one way or the other in
practice, but we'll keep a look on any complaints about performance.
We are starting to migrate our code base to use automatic code
formatting and we use this for applying the formatting changes so we
should have a lot of devs using this soon.

On Fri, Jun 15, 2018 at 4:34 AM, Eli Zaretskii <eliz@gnu.org> wrote:
>> From: mstanojevic@janestreet.com (Milan Stanojević)
>> Date: Thu, 14 Jun 2018 17:34:27 -0400
>>
>> Here is a small recipe that illustrates the bug.
>>
>> recipe.el
>> ---------
>> (setq use-multibyte (< 0 (length argv)))
>>
>> (switch-to-buffer "file1")
>> (when use-multibyte (insert-char (char-from-name "SMILE")))
>> (insert "1234")
>>
>> (switch-to-buffer "file2")
>> (when use-multibyte (insert-char (char-from-name "SMILE")))
>> (insert "5678")
>>
>> (replace-buffer-contents "file1")
>>
>> (princ (buffer-substring-no-properties (point-min) (point-max)))
>> (princ "\n")
>> -------------
>>
>> Running the recipe as script
>>
>> $ emacs -Q --script /tmp/recipe.el
>> 1234
>>
>> $ emacs -Q --script /tmp/recipe.el multibyte
>> ⌣5234
>>
>> In the first run, with just ascii characters, everything works as
>>  expected.
>>
>> In the second run, with multibyte characters, the function didn't
>>  replace '5' with '1' as expected.
>
> Right, thanks for catching this blunder.
>
>> I looked at the code and it looks to me like there is a very obvious bug
>>  in function buffer_chars_equal in editfns.c. It calls
>>  BUF_FETCH_CHAR_AS_MULTIBYTE passing *character* positions, but the
>>  macro expects *byte* positions. (it would be nice if these char vs byte
>>  positions could be distinguished with types, but I'm not sure it is
>>  possible in C).
>>
>> The simple fix is to replace BUF_FETCH_CHAR_AS_MULTIBYTE with
>>  STRING_CHAR (BUF_CHAR_ADDRESS (buf, pos)) and this seems to work.
>
> Thanks, I installed a slightly different fix on the emacs-26 branch,
> perhaps you could try that.
>
>> I'm not sure about performance of the above fix, though, because
>>  accessing random character position in a buffer is not constant. If
>>  diffing algorithm is accessing buffer positions in more or less
>>  localized manner, maybe it makes sense to move the point inside
>>  buffer_chars_equal so the char position to byte position conversion is
>>  fast. It probably doesn't matter for small files.
>
> The conversion of character positions to byte positions should be
> quite fast.  It got even faster on the master branch, so I think we
> are good.  If there are reports about slow operation of this function,
> we could in the future try to optimize it more.





^ permalink raw reply	[flat|nested] 7+ messages in thread

* bug#31837: 26.1; replace-buffer-contents doesn't work if buffer has multibyte characters
  2018-06-18 17:44   ` Milan Stanojević
@ 2018-06-18 18:16     ` Eli Zaretskii
  2018-06-18 20:29       ` Milan Stanojević
  0 siblings, 1 reply; 7+ messages in thread
From: Eli Zaretskii @ 2018-06-18 18:16 UTC (permalink / raw)
  To: Milan Stanojević; +Cc: 31837

> From: Milan Stanojević <mstanojevic@janestreet.com>
> Date: Mon, 18 Jun 2018 13:44:08 -0400
> Cc: 31837@debbugs.gnu.org
> 
> I haven't had a chance to read this email or try your patch until today.
> I just tried it and it seems to work.
> 
> Thanks for the quick fix, Eli!

Thanks for catching the blunder in the first place.

> Not sure that [enable_multibyte_characters] check buys us much in
> practice. At least in my case, it is always true.

It is easy enough to get a unibyte buffer: just use
find-file-literally or insert-file-literally.

> But what matters is whether there are actual multibyte characters in
> the buffer

If a buffer is unibyte, there could be no multibyte characters in it,
by definition.

> buf_charpos_to_bytepos already shortcircuits if there are none.

Not before it calls some macros.  And it is confusing to call a
conversion function when no conversion is needed, because the
correctness of that can only be established by looking at the called
conversion function.  That gets in the way of code reading, IME.

> We are starting to migrate our code base to use automatic code
> formatting and we use this for applying the formatting changes so we
> should have a lot of devs using this soon.

Thanks, please be sure to report any problems.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* bug#31837: 26.1; replace-buffer-contents doesn't work if buffer has multibyte characters
  2018-06-18 18:16     ` Eli Zaretskii
@ 2018-06-18 20:29       ` Milan Stanojević
  2018-06-19  2:30         ` Eli Zaretskii
  0 siblings, 1 reply; 7+ messages in thread
From: Milan Stanojević @ 2018-06-18 20:29 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 31837

>> But what matters is whether there are actual multibyte characters in
>> the buffer
>
> If a buffer is unibyte, there could be no multibyte characters in it,
> by definition.
>
>> buf_charpos_to_bytepos already shortcircuits if there are none.
>
> Not before it calls some macros.  And it is confusing to call a
> conversion function when no conversion is needed, because the
> correctness of that can only be established by looking at the called
> conversion function.  That gets in the way of code reading, IME.

Right, I get your point regarding readability but maybe I can clarify
what I'm thinking.

The question is whether we need to optimize buffer_chars_equal  or not.

If not, then we can just call buf_charpos_to_bytepos without any other
checks, that seems most readable.

If we do care about optimizing it, then it is better to check whether
the buffer actually contains multibyte characters (I guess with
BUF_Z(buf) == BUF_Z_BYTE(buf)) than using enable_multibyte_characters
since the the former would catch strictly more cases when we don't
need charpos to bytepos conversion.
Also, it occurs to me, the check shouldn't be done in
buffer_chars_equal, since the same check would be done multiple times
for a single character (for every comparison of that character), it
should be done once and the result stored in ctx before calling
compareseq.
Again, I'm not sure this matters at all in practice, all the checks
are just a few memory accesses (presumable in the cache) and integer
comparisons.

Anyway, I could try doing something in the future about this, but the
code as it is now is perfectly fine.

Thanks again!

^ permalink raw reply	[flat|nested] 7+ messages in thread

* bug#31837: 26.1; replace-buffer-contents doesn't work if buffer has multibyte characters
  2018-06-18 20:29       ` Milan Stanojević
@ 2018-06-19  2:30         ` Eli Zaretskii
  0 siblings, 0 replies; 7+ messages in thread
From: Eli Zaretskii @ 2018-06-19  2:30 UTC (permalink / raw)
  To: Milan Stanojević; +Cc: 31837

> From: Milan Stanojević <mstanojevic@janestreet.com>
> Date: Mon, 18 Jun 2018 16:29:41 -0400
> Cc: 31837@debbugs.gnu.org
> 
> If we do care about optimizing it, then it is better to check whether
> the buffer actually contains multibyte characters (I guess with
> BUF_Z(buf) == BUF_Z_BYTE(buf)) than using enable_multibyte_characters
> since the the former would catch strictly more cases when we don't
> need charpos to bytepos conversion.

No, testing BUF_Z vs BUF_Z_BYTE cannot possibly catch more cases,
because multibyte characters cannot happen in a buffer whose
enable_multibyte_characters is reset, and any unibyte buffer's Z value
is trivially equal to its Z_BYTE value.

> Also, it occurs to me, the check shouldn't be done in
> buffer_chars_equal, since the same check would be done multiple times
> for a single character (for every comparison of that character), it
> should be done once and the result stored in ctx before calling
> compareseq.

Something to consider if further optimization is needed.





^ permalink raw reply	[flat|nested] 7+ messages in thread

* bug#31837: 26.1; replace-buffer-contents doesn't work if buffer has multibyte characters
  2018-06-15  8:34 ` Eli Zaretskii
  2018-06-18 17:44   ` Milan Stanojević
@ 2018-07-01 17:32   ` Philipp Stephani
  1 sibling, 0 replies; 7+ messages in thread
From: Philipp Stephani @ 2018-07-01 17:32 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 31837, Milan Stanojević

[-- Attachment #1: Type: text/plain, Size: 768 bytes --]

Eli Zaretskii <eliz@gnu.org> schrieb am Fr., 15. Juni 2018 um 10:35 Uhr:

>
> > I looked at the code and it looks to me like there is a very obvious bug
> >  in function buffer_chars_equal in editfns.c. It calls
> >  BUF_FETCH_CHAR_AS_MULTIBYTE passing *character* positions, but the
> >  macro expects *byte* positions. (it would be nice if these char vs byte
> >  positions could be distinguished with types, but I'm not sure it is
> >  possible in C).
> >
> > The simple fix is to replace BUF_FETCH_CHAR_AS_MULTIBYTE with
> >  STRING_CHAR (BUF_CHAR_ADDRESS (buf, pos)) and this seems to work.
>
> Thanks, I installed a slightly different fix on the emacs-26 branch,
> perhaps you could try that.


Thanks a lot for finding and fixing this (rather embarrasing) bug!

[-- Attachment #2: Type: text/html, Size: 1107 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2018-07-01 17:32 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-06-14 21:34 bug#31837: 26.1; replace-buffer-contents doesn't work if buffer has multibyte characters Milan Stanojević
2018-06-15  8:34 ` Eli Zaretskii
2018-06-18 17:44   ` Milan Stanojević
2018-06-18 18:16     ` Eli Zaretskii
2018-06-18 20:29       ` Milan Stanojević
2018-06-19  2:30         ` Eli Zaretskii
2018-07-01 17:32   ` Philipp Stephani

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).