all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
From: ynyaaa@gmail.com
To: Eli Zaretskii <eliz@gnu.org>
Cc: 37580@debbugs.gnu.org
Subject: bug#37580: 26.3; setting buffer as unibyte temporarily may change buffer contents
Date: Sun, 06 Oct 2019 02:18:08 +0900	[thread overview]
Message-ID: <86lftz157z.fsf@gmail.com> (raw)
In-Reply-To: <83h84r89i4.fsf@gnu.org> (Eli Zaretskii's message of "Wed, 02 Oct 2019 18:14:43 +0300")

Eli Zaretskii <eliz@gnu.org> writes:
> I don't think this is a bug.  Changing the multibyte-ness of a buffer
> really does change the contents.  You should only do that where it
> makes sense.

Sometimes I find broken utf-8 texts on the Internet.
Some characters are split into surrogate pairs, and each surrogate
character is encoded as if it is a normal BMP character.

utf-8 coding system does not decode such sequences.
Changing multibyte-ness converts them to surrogate characters.
And encode-decode process with utf-16be outputs the intended characeters.

Suppose the character is #x10000,
the correspoding pair is (#xD800 #xDC00).
The miss-encoded sequence is:
  (encode-coding-string "\xD800\xDC00" 'utf-8)
  => "\355\240\200\355\260\200"

It is not decoded with utf-8.
  (decode-coding-string (encode-coding-string "\xD800\xDC00" 'utf-8)
                        'utf-8)
  => "\355\240\200\355\260\200"

Changing multibyte-ness, the sequence is converted into surrogate
characters.
  (with-temp-buffer
    (insert (encode-coding-string "\xD800\xDC00" 'utf-8))
    (set-buffer-multibyte nil)
    (set-buffer-multibyte t)
    (buffer-string))
  => "\xD800\xDC00"

The surrogate pair can be converted into the original character.
  (decode-coding-string (encode-coding-string "\xD800\xDC00" 'utf-16be)
                        'utf-16be)
  => "\x10000"





  reply	other threads:[~2019-10-05 17:18 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-10-02  9:43 bug#37580: 26.3; setting buffer as unibyte temporarily may change buffer contents ynyaaa
2019-10-02 15:14 ` Eli Zaretskii
2019-10-05 17:18   ` ynyaaa [this message]
2019-10-05 18:56     ` Eli Zaretskii
2019-10-28 23:26       ` Stefan Kangas

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=86lftz157z.fsf@gmail.com \
    --to=ynyaaa@gmail.com \
    --cc=37580@debbugs.gnu.org \
    --cc=eliz@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.