From: Eli Zaretskii <eliz@gnu.org>
To: ynyaaa@gmail.com
Cc: 37580@debbugs.gnu.org
Subject: bug#37580: 26.3; setting buffer as unibyte temporarily may change buffer contents
Date: Sat, 05 Oct 2019 21:56:36 +0300 [thread overview]
Message-ID: <83tv8n2f8b.fsf@gnu.org> (raw)
In-Reply-To: <86lftz157z.fsf@gmail.com> (ynyaaa@gmail.com)
> From: ynyaaa@gmail.com
> Cc: 37580@debbugs.gnu.org
> Date: Sun, 06 Oct 2019 02:18:08 +0900
>
> Sometimes I find broken utf-8 texts on the Internet.
> Some characters are split into surrogate pairs, and each surrogate
> character is encoded as if it is a normal BMP character.
>
> utf-8 coding system does not decode such sequences.
> Changing multibyte-ness converts them to surrogate characters.
> And encode-decode process with utf-16be outputs the intended characeters.
>
> Suppose the character is #x10000,
> the correspoding pair is (#xD800 #xDC00).
> The miss-encoded sequence is:
> (encode-coding-string "\xD800\xDC00" 'utf-8)
> => "\355\240\200\355\260\200"
>
> It is not decoded with utf-8.
> (decode-coding-string (encode-coding-string "\xD800\xDC00" 'utf-8)
> 'utf-8)
> => "\355\240\200\355\260\200"
>
> Changing multibyte-ness, the sequence is converted into surrogate
> characters.
> (with-temp-buffer
> (insert (encode-coding-string "\xD800\xDC00" 'utf-8))
> (set-buffer-multibyte nil)
> (set-buffer-multibyte t)
> (buffer-string))
> => "\xD800\xDC00"
>
> The surrogate pair can be converted into the original character.
> (decode-coding-string (encode-coding-string "\xD800\xDC00" 'utf-16be)
> 'utf-16be)
> => "\x10000"
So where's the problem in all this? AFAIU, you describe a sequence of
actions that successfully recovers text in an obscure situation.
I think the problem is that you enable undo. So in that case, just
don't do that.
next prev parent reply other threads:[~2019-10-05 18:56 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-10-02 9:43 bug#37580: 26.3; setting buffer as unibyte temporarily may change buffer contents ynyaaa
2019-10-02 15:14 ` Eli Zaretskii
2019-10-05 17:18 ` ynyaaa
2019-10-05 18:56 ` Eli Zaretskii [this message]
2019-10-28 23:26 ` Stefan Kangas
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=83tv8n2f8b.fsf@gnu.org \
--to=eliz@gnu.org \
--cc=37580@debbugs.gnu.org \
--cc=ynyaaa@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this external index
https://git.savannah.gnu.org/cgit/emacs.git
https://git.savannah.gnu.org/cgit/emacs/org-mode.git
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.