bug#12291: [rev 109796] wrong UTF-8 handling

unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed

From: Kenichi Handa <handa@gnu.org>
To: Werner LEMBERG <wl@gnu.org>
Cc: 12291@debbugs.gnu.org, smithcu@gvsu.edu
Subject: bug#12291: [rev 109796] wrong UTF-8 handling
Date: Tue, 28 Aug 2012 23:57:39 +0900	[thread overview]
Message-ID: <87a9xfdpy4.fsf@gnu.org> (raw)
In-Reply-To: <20120828.074720.480105751.wl@gnu.org> (message from Werner LEMBERG on Tue, 28 Aug 2012 07:47:20 +0200 (CEST))

In article <20120828.074720.480105751.wl@gnu.org>, Werner LEMBERG <wl@gnu.org> writes:

> Have a look at the attached file, containing a single character.
> (It's transmitted as binary to avoid e-mail encoding issues).  It
> contains a single, four-byte UTF-8 encoded character (0xF4 0xB5 0x87
> 0x9E, which would map to the non-existent Unicode character code
> U+1351DE).  If I load this file as UTF-8 encoded, Emacs gives this as
> the output of `C-u C-x =':

>                position: 1 of 2 (0%), column: 0
>               character: 二 (displayed as 二) (codepoint 20108, #o47214, #x4e8c)
[...]
> Look what Emacs says about the file code.  If I save this
> one-character file as UTF-8, the character code stays as-is.

> This behaviour is clearly wrong.

Sure.

> I suspect that Emacs is using such a
> high character code for internal representation of the `emacs-mule'
> encoding.  However, the user must not see this.  

That higher character code area is used for two purposes.

One is for reading CJK characters of legacy encoding (euc,
sjis, big5, etc).  They are decoded into the utf-8-emacs
byte sequence corresponding to the higher character cod
area.  But, on getting their character code, most of them
are unified into Unicode BMP characters.  But few are left
un-unified.  Those are private characters in each legacy
character set.

Another is for supporting non-Unicode characters.  The
biggest set is GB18030.

In both cases, user surely see them.

> Instead, such characters must be converted to correct
> UTF-8.

??? I don't understand what you means by "correct UTF-8".

I think the correct behaviour on reading such a file by
utf-8 is to treat each byte as raw-byte.

---
Kenichi Handa
handa@gnu.org

next prev parent reply	other threads:[~2012-08-28 14:57 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-08-28  5:47 bug#12291: [rev 109796] wrong UTF-8 handling Werner LEMBERG
2012-08-28  9:03 ` Andreas Schwab
2012-08-28 14:57 ` Kenichi Handa [this message]
2012-08-28 19:22   ` Werner LEMBERG
2012-08-31 10:40     ` Eli Zaretskii
2012-09-03  0:59       ` Kenichi Handa
2012-09-03  2:40         ` Eli Zaretskii
2022-01-27 16:32 ` Lars Ingebrigtsen
2022-01-27 16:52   ` Eli Zaretskii
2022-02-25  2:33     ` Lars Ingebrigtsen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87a9xfdpy4.fsf@gnu.org \
    --to=handa@gnu.org \
    --cc=12291@debbugs.gnu.org \
    --cc=smithcu@gvsu.edu \
    --cc=wl@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).