* bug#37580: 26.3; setting buffer as unibyte temporarily may change buffer contents
@ 2019-10-02 9:43 ynyaaa
2019-10-02 15:14 ` Eli Zaretskii
0 siblings, 1 reply; 5+ messages in thread
From: ynyaaa @ 2019-10-02 9:43 UTC (permalink / raw)
To: 37580
If a multibyte buffer contains eight-bit character sequences,
evaluating the form
(progn (set-buffer-multibyte nil) (set-buffer-multibyte t))
may convert them to multibyte characters.
Afterwards, buffer-undo-list may be inappropriate.
Undo in the form below changes the position of character '1'.
(with-temp-buffer
(insert "一123")
(encode-coding-region 1 2 'utf-8)
(buffer-enable-undo)
(undo-boundary)
(progn (goto-char (point-min))
(search-forward "1")
(delete-char -1))
(undo-boundary)
(progn (set-buffer-multibyte nil)
(set-buffer-multibyte t))
(undo)
(buffer-string))
=>"一231"
In GNU Emacs 26.3 (build 1, x86_64-w64-mingw32)
of 2019-08-29 built on CIRROCUMULUS
Repository revision: 96dd0196c28bc36779584e47fffcca433c9309cd
Windowing system distributor 'Microsoft Corp.', version 6.3.9600
Recent messages:
Configured using:
'configure --without-dbus --host=x86_64-w64-mingw32
--without-compress-install 'CFLAGS=-O2 -static -g3''
Configured features:
XPM JPEG TIFF GIF PNG RSVG SOUND NOTIFY ACL GNUTLS LIBXML2 ZLIB
TOOLKIT_SCROLL_BARS THREADS LCMS2
Important settings:
value of $LANG: JPN
locale-coding-system: cp932
Major mode: Lisp Interaction
Minor modes in effect:
tooltip-mode: t
global-eldoc-mode: t
eldoc-mode: t
electric-indent-mode: t
mouse-wheel-mode: t
tool-bar-mode: t
menu-bar-mode: t
file-name-shadow-mode: t
global-font-lock-mode: t
font-lock-mode: t
auto-composition-mode: t
auto-encryption-mode: t
auto-compression-mode: t
line-number-mode: t
Load-path shadows:
None found.
Features:
(network-stream nsm starttls tls gnutls mailalias smtpmail auth-source
cl-seq eieio eieio-core cl-macs eieio-loaddefs misearch multi-isearch pp
shadow sort mail-extr emacsbug message rmc puny seq byte-opt gv bytecomp
byte-compile cconv dired dired-loaddefs format-spec rfc822 mml mml-sec
password-cache epa derived epg epg-config gnus-util rmail rmail-loaddefs
mm-decode mm-bodies mm-encode mail-parse rfc2231 mailabbrev gmm-utils
mailheader sendmail rfc2047 rfc2045 ietf-drums mm-util mail-prsvr
mail-utils cl-extra thingatpt help-fns radix-tree help-mode cl-loaddefs
cl-lib image-mode easymenu elec-pair time-date mule-util japan-util
tooltip eldoc electric uniquify ediff-hook vc-hooks lisp-float-type
mwheel dos-w32 ls-lisp disp-table term/w32-win w32-win w32-vars
term/common-win tool-bar dnd fontset image regexp-opt fringe
tabulated-list replace newcomment text-mode elisp-mode lisp-mode
prog-mode register page menu-bar rfn-eshadow isearch timer select
scroll-bar mouse jit-lock font-lock syntax facemenu font-core
term/tty-colors frame cl-generic cham georgian utf-8-lang misc-lang
vietnamese tibetan thai tai-viet lao korean japanese eucjp-ms cp51932
hebrew greek romanian slovak czech european ethiopic indian cyrillic
chinese composite charscript charprop case-table epa-hook jka-cmpr-hook
help simple abbrev obarray minibuffer cl-preloaded nadvice loaddefs
button faces cus-face macroexp files text-properties overlay sha1 md5
base64 format env code-pages mule custom widget hashtable-print-readable
backquote threads w32notify w32 lcms2 multi-tty make-network-process
emacs)
Memory information:
((conses 16 120602 43655)
(symbols 48 21333 2)
(miscs 40 80 287)
(strings 32 35067 1048)
(string-bytes 1 899321)
(vectors 16 16930)
(vector-slots 8 597369 14750)
(floats 8 64 249)
(intervals 56 848 3)
(buffers 992 21))
^ permalink raw reply [flat|nested] 5+ messages in thread
* bug#37580: 26.3; setting buffer as unibyte temporarily may change buffer contents
2019-10-02 9:43 bug#37580: 26.3; setting buffer as unibyte temporarily may change buffer contents ynyaaa
@ 2019-10-02 15:14 ` Eli Zaretskii
2019-10-05 17:18 ` ynyaaa
0 siblings, 1 reply; 5+ messages in thread
From: Eli Zaretskii @ 2019-10-02 15:14 UTC (permalink / raw)
To: ynyaaa; +Cc: 37580
> From: ynyaaa@gmail.com
> Date: Wed, 02 Oct 2019 18:43:45 +0900
>
>
> If a multibyte buffer contains eight-bit character sequences,
> evaluating the form
> (progn (set-buffer-multibyte nil) (set-buffer-multibyte t))
> may convert them to multibyte characters.
>
> Afterwards, buffer-undo-list may be inappropriate.
> Undo in the form below changes the position of character '1'.
I don't think this is a bug. Changing the multibyte-ness of a buffer
really does change the contents. You should only do that where it
makes sense.
^ permalink raw reply [flat|nested] 5+ messages in thread
* bug#37580: 26.3; setting buffer as unibyte temporarily may change buffer contents
2019-10-02 15:14 ` Eli Zaretskii
@ 2019-10-05 17:18 ` ynyaaa
2019-10-05 18:56 ` Eli Zaretskii
0 siblings, 1 reply; 5+ messages in thread
From: ynyaaa @ 2019-10-05 17:18 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 37580
Eli Zaretskii <eliz@gnu.org> writes:
> I don't think this is a bug. Changing the multibyte-ness of a buffer
> really does change the contents. You should only do that where it
> makes sense.
Sometimes I find broken utf-8 texts on the Internet.
Some characters are split into surrogate pairs, and each surrogate
character is encoded as if it is a normal BMP character.
utf-8 coding system does not decode such sequences.
Changing multibyte-ness converts them to surrogate characters.
And encode-decode process with utf-16be outputs the intended characeters.
Suppose the character is #x10000,
the correspoding pair is (#xD800 #xDC00).
The miss-encoded sequence is:
(encode-coding-string "\xD800\xDC00" 'utf-8)
=> "\355\240\200\355\260\200"
It is not decoded with utf-8.
(decode-coding-string (encode-coding-string "\xD800\xDC00" 'utf-8)
'utf-8)
=> "\355\240\200\355\260\200"
Changing multibyte-ness, the sequence is converted into surrogate
characters.
(with-temp-buffer
(insert (encode-coding-string "\xD800\xDC00" 'utf-8))
(set-buffer-multibyte nil)
(set-buffer-multibyte t)
(buffer-string))
=> "\xD800\xDC00"
The surrogate pair can be converted into the original character.
(decode-coding-string (encode-coding-string "\xD800\xDC00" 'utf-16be)
'utf-16be)
=> "\x10000"
^ permalink raw reply [flat|nested] 5+ messages in thread
* bug#37580: 26.3; setting buffer as unibyte temporarily may change buffer contents
2019-10-05 17:18 ` ynyaaa
@ 2019-10-05 18:56 ` Eli Zaretskii
2019-10-28 23:26 ` Stefan Kangas
0 siblings, 1 reply; 5+ messages in thread
From: Eli Zaretskii @ 2019-10-05 18:56 UTC (permalink / raw)
To: ynyaaa; +Cc: 37580
> From: ynyaaa@gmail.com
> Cc: 37580@debbugs.gnu.org
> Date: Sun, 06 Oct 2019 02:18:08 +0900
>
> Sometimes I find broken utf-8 texts on the Internet.
> Some characters are split into surrogate pairs, and each surrogate
> character is encoded as if it is a normal BMP character.
>
> utf-8 coding system does not decode such sequences.
> Changing multibyte-ness converts them to surrogate characters.
> And encode-decode process with utf-16be outputs the intended characeters.
>
> Suppose the character is #x10000,
> the correspoding pair is (#xD800 #xDC00).
> The miss-encoded sequence is:
> (encode-coding-string "\xD800\xDC00" 'utf-8)
> => "\355\240\200\355\260\200"
>
> It is not decoded with utf-8.
> (decode-coding-string (encode-coding-string "\xD800\xDC00" 'utf-8)
> 'utf-8)
> => "\355\240\200\355\260\200"
>
> Changing multibyte-ness, the sequence is converted into surrogate
> characters.
> (with-temp-buffer
> (insert (encode-coding-string "\xD800\xDC00" 'utf-8))
> (set-buffer-multibyte nil)
> (set-buffer-multibyte t)
> (buffer-string))
> => "\xD800\xDC00"
>
> The surrogate pair can be converted into the original character.
> (decode-coding-string (encode-coding-string "\xD800\xDC00" 'utf-16be)
> 'utf-16be)
> => "\x10000"
So where's the problem in all this? AFAIU, you describe a sequence of
actions that successfully recovers text in an obscure situation.
I think the problem is that you enable undo. So in that case, just
don't do that.
^ permalink raw reply [flat|nested] 5+ messages in thread
* bug#37580: 26.3; setting buffer as unibyte temporarily may change buffer contents
2019-10-05 18:56 ` Eli Zaretskii
@ 2019-10-28 23:26 ` Stefan Kangas
0 siblings, 0 replies; 5+ messages in thread
From: Stefan Kangas @ 2019-10-28 23:26 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: 37580, ynyaaa
tags 37580 + notabug
close 37580
thanks
Eli Zaretskii <eliz@gnu.org> writes:
[...]
> So where's the problem in all this? AFAIU, you describe a sequence of
> actions that successfully recovers text in an obscure situation.
>
> I think the problem is that you enable undo. So in that case, just
> don't do that.
No further comments, so I'm closing this as notabug.
Best regards,
Stefan Kangas
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2019-10-28 23:26 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-10-02 9:43 bug#37580: 26.3; setting buffer as unibyte temporarily may change buffer contents ynyaaa
2019-10-02 15:14 ` Eli Zaretskii
2019-10-05 17:18 ` ynyaaa
2019-10-05 18:56 ` Eli Zaretskii
2019-10-28 23:26 ` Stefan Kangas
Code repositories for project(s) associated with this external index
https://git.savannah.gnu.org/cgit/emacs.git
https://git.savannah.gnu.org/cgit/emacs/org-mode.git
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.