unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
* bug#37580: 26.3; setting buffer as unibyte temporarily may change buffer contents
@ 2019-10-02  9:43 ynyaaa
  2019-10-02 15:14 ` Eli Zaretskii
  0 siblings, 1 reply; 5+ messages in thread
From: ynyaaa @ 2019-10-02  9:43 UTC (permalink / raw)
  To: 37580


If a multibyte buffer contains eight-bit character sequences,
evaluating the form
 (progn (set-buffer-multibyte nil) (set-buffer-multibyte t))
may convert them to multibyte characters.

Afterwards, buffer-undo-list may be inappropriate.
Undo in the form below changes the position of character '1'.

(with-temp-buffer
  (insert "一123")
  (encode-coding-region 1 2 'utf-8)
  (buffer-enable-undo)
  (undo-boundary)
  (progn (goto-char (point-min))
         (search-forward "1")
         (delete-char -1))
  (undo-boundary)
  (progn (set-buffer-multibyte nil)
         (set-buffer-multibyte t))
  (undo)
  (buffer-string))
=>"一231"


In GNU Emacs 26.3 (build 1, x86_64-w64-mingw32)
 of 2019-08-29 built on CIRROCUMULUS
Repository revision: 96dd0196c28bc36779584e47fffcca433c9309cd
Windowing system distributor 'Microsoft Corp.', version 6.3.9600
Recent messages:

Configured using:
 'configure --without-dbus --host=x86_64-w64-mingw32
 --without-compress-install 'CFLAGS=-O2 -static -g3''

Configured features:
XPM JPEG TIFF GIF PNG RSVG SOUND NOTIFY ACL GNUTLS LIBXML2 ZLIB
TOOLKIT_SCROLL_BARS THREADS LCMS2

Important settings:
  value of $LANG: JPN
  locale-coding-system: cp932

Major mode: Lisp Interaction

Minor modes in effect:
  tooltip-mode: t
  global-eldoc-mode: t
  eldoc-mode: t
  electric-indent-mode: t
  mouse-wheel-mode: t
  tool-bar-mode: t
  menu-bar-mode: t
  file-name-shadow-mode: t
  global-font-lock-mode: t
  font-lock-mode: t
  auto-composition-mode: t
  auto-encryption-mode: t
  auto-compression-mode: t
  line-number-mode: t

Load-path shadows:
None found.

Features:
(network-stream nsm starttls tls gnutls mailalias smtpmail auth-source
cl-seq eieio eieio-core cl-macs eieio-loaddefs misearch multi-isearch pp
shadow sort mail-extr emacsbug message rmc puny seq byte-opt gv bytecomp
byte-compile cconv dired dired-loaddefs format-spec rfc822 mml mml-sec
password-cache epa derived epg epg-config gnus-util rmail rmail-loaddefs
mm-decode mm-bodies mm-encode mail-parse rfc2231 mailabbrev gmm-utils
mailheader sendmail rfc2047 rfc2045 ietf-drums mm-util mail-prsvr
mail-utils cl-extra thingatpt help-fns radix-tree help-mode cl-loaddefs
cl-lib image-mode easymenu elec-pair time-date mule-util japan-util
tooltip eldoc electric uniquify ediff-hook vc-hooks lisp-float-type
mwheel dos-w32 ls-lisp disp-table term/w32-win w32-win w32-vars
term/common-win tool-bar dnd fontset image regexp-opt fringe
tabulated-list replace newcomment text-mode elisp-mode lisp-mode
prog-mode register page menu-bar rfn-eshadow isearch timer select
scroll-bar mouse jit-lock font-lock syntax facemenu font-core
term/tty-colors frame cl-generic cham georgian utf-8-lang misc-lang
vietnamese tibetan thai tai-viet lao korean japanese eucjp-ms cp51932
hebrew greek romanian slovak czech european ethiopic indian cyrillic
chinese composite charscript charprop case-table epa-hook jka-cmpr-hook
help simple abbrev obarray minibuffer cl-preloaded nadvice loaddefs
button faces cus-face macroexp files text-properties overlay sha1 md5
base64 format env code-pages mule custom widget hashtable-print-readable
backquote threads w32notify w32 lcms2 multi-tty make-network-process
emacs)

Memory information:
((conses 16 120602 43655)
 (symbols 48 21333 2)
 (miscs 40 80 287)
 (strings 32 35067 1048)
 (string-bytes 1 899321)
 (vectors 16 16930)
 (vector-slots 8 597369 14750)
 (floats 8 64 249)
 (intervals 56 848 3)
 (buffers 992 21))





^ permalink raw reply	[flat|nested] 5+ messages in thread

* bug#37580: 26.3; setting buffer as unibyte temporarily may change buffer contents
  2019-10-02  9:43 bug#37580: 26.3; setting buffer as unibyte temporarily may change buffer contents ynyaaa
@ 2019-10-02 15:14 ` Eli Zaretskii
  2019-10-05 17:18   ` ynyaaa
  0 siblings, 1 reply; 5+ messages in thread
From: Eli Zaretskii @ 2019-10-02 15:14 UTC (permalink / raw)
  To: ynyaaa; +Cc: 37580

> From: ynyaaa@gmail.com
> Date: Wed, 02 Oct 2019 18:43:45 +0900
> 
> 
> If a multibyte buffer contains eight-bit character sequences,
> evaluating the form
>  (progn (set-buffer-multibyte nil) (set-buffer-multibyte t))
> may convert them to multibyte characters.
> 
> Afterwards, buffer-undo-list may be inappropriate.
> Undo in the form below changes the position of character '1'.

I don't think this is a bug.  Changing the multibyte-ness of a buffer
really does change the contents.  You should only do that where it
makes sense.





^ permalink raw reply	[flat|nested] 5+ messages in thread

* bug#37580: 26.3; setting buffer as unibyte temporarily may change buffer contents
  2019-10-02 15:14 ` Eli Zaretskii
@ 2019-10-05 17:18   ` ynyaaa
  2019-10-05 18:56     ` Eli Zaretskii
  0 siblings, 1 reply; 5+ messages in thread
From: ynyaaa @ 2019-10-05 17:18 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 37580

Eli Zaretskii <eliz@gnu.org> writes:
> I don't think this is a bug.  Changing the multibyte-ness of a buffer
> really does change the contents.  You should only do that where it
> makes sense.

Sometimes I find broken utf-8 texts on the Internet.
Some characters are split into surrogate pairs, and each surrogate
character is encoded as if it is a normal BMP character.

utf-8 coding system does not decode such sequences.
Changing multibyte-ness converts them to surrogate characters.
And encode-decode process with utf-16be outputs the intended characeters.

Suppose the character is #x10000,
the correspoding pair is (#xD800 #xDC00).
The miss-encoded sequence is:
  (encode-coding-string "\xD800\xDC00" 'utf-8)
  => "\355\240\200\355\260\200"

It is not decoded with utf-8.
  (decode-coding-string (encode-coding-string "\xD800\xDC00" 'utf-8)
                        'utf-8)
  => "\355\240\200\355\260\200"

Changing multibyte-ness, the sequence is converted into surrogate
characters.
  (with-temp-buffer
    (insert (encode-coding-string "\xD800\xDC00" 'utf-8))
    (set-buffer-multibyte nil)
    (set-buffer-multibyte t)
    (buffer-string))
  => "\xD800\xDC00"

The surrogate pair can be converted into the original character.
  (decode-coding-string (encode-coding-string "\xD800\xDC00" 'utf-16be)
                        'utf-16be)
  => "\x10000"





^ permalink raw reply	[flat|nested] 5+ messages in thread

* bug#37580: 26.3; setting buffer as unibyte temporarily may change buffer contents
  2019-10-05 17:18   ` ynyaaa
@ 2019-10-05 18:56     ` Eli Zaretskii
  2019-10-28 23:26       ` Stefan Kangas
  0 siblings, 1 reply; 5+ messages in thread
From: Eli Zaretskii @ 2019-10-05 18:56 UTC (permalink / raw)
  To: ynyaaa; +Cc: 37580

> From: ynyaaa@gmail.com
> Cc: 37580@debbugs.gnu.org
> Date: Sun, 06 Oct 2019 02:18:08 +0900
> 
> Sometimes I find broken utf-8 texts on the Internet.
> Some characters are split into surrogate pairs, and each surrogate
> character is encoded as if it is a normal BMP character.
> 
> utf-8 coding system does not decode such sequences.
> Changing multibyte-ness converts them to surrogate characters.
> And encode-decode process with utf-16be outputs the intended characeters.
> 
> Suppose the character is #x10000,
> the correspoding pair is (#xD800 #xDC00).
> The miss-encoded sequence is:
>   (encode-coding-string "\xD800\xDC00" 'utf-8)
>   => "\355\240\200\355\260\200"
> 
> It is not decoded with utf-8.
>   (decode-coding-string (encode-coding-string "\xD800\xDC00" 'utf-8)
>                         'utf-8)
>   => "\355\240\200\355\260\200"
> 
> Changing multibyte-ness, the sequence is converted into surrogate
> characters.
>   (with-temp-buffer
>     (insert (encode-coding-string "\xD800\xDC00" 'utf-8))
>     (set-buffer-multibyte nil)
>     (set-buffer-multibyte t)
>     (buffer-string))
>   => "\xD800\xDC00"
> 
> The surrogate pair can be converted into the original character.
>   (decode-coding-string (encode-coding-string "\xD800\xDC00" 'utf-16be)
>                         'utf-16be)
>   => "\x10000"

So where's the problem in all this?  AFAIU, you describe a sequence of
actions that successfully recovers text in an obscure situation.

I think the problem is that you enable undo.  So in that case, just
don't do that.





^ permalink raw reply	[flat|nested] 5+ messages in thread

* bug#37580: 26.3; setting buffer as unibyte temporarily may change buffer contents
  2019-10-05 18:56     ` Eli Zaretskii
@ 2019-10-28 23:26       ` Stefan Kangas
  0 siblings, 0 replies; 5+ messages in thread
From: Stefan Kangas @ 2019-10-28 23:26 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 37580, ynyaaa

tags 37580 + notabug
close 37580
thanks

Eli Zaretskii <eliz@gnu.org> writes:
[...]
> So where's the problem in all this?  AFAIU, you describe a sequence of
> actions that successfully recovers text in an obscure situation.
>
> I think the problem is that you enable undo.  So in that case, just
> don't do that.

No further comments, so I'm closing this as notabug.

Best regards,
Stefan Kangas





^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2019-10-28 23:26 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-02  9:43 bug#37580: 26.3; setting buffer as unibyte temporarily may change buffer contents ynyaaa
2019-10-02 15:14 ` Eli Zaretskii
2019-10-05 17:18   ` ynyaaa
2019-10-05 18:56     ` Eli Zaretskii
2019-10-28 23:26       ` Stefan Kangas

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).