bug#12291: [rev 109796] wrong UTF-8 handling

unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed

* bug#12291: [rev 109796] wrong UTF-8 handling
@ 2012-08-28  5:47 Werner LEMBERG
  2012-08-28  9:03 ` Andreas Schwab
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Werner LEMBERG @ 2012-08-28  5:47 UTC (permalink / raw)
  To: 12291; +Cc: Curtis Smith

[-- Attachment #1: Type: Text/Plain, Size: 5007 bytes --]


[bzr revision 109796]

Have a look at the attached file, containing a single character.
(It's transmitted as binary to avoid e-mail encoding issues).  It
contains a single, four-byte UTF-8 encoded character (0xF4 0xB5 0x87
0x9E, which would map to the non-existent Unicode character code
U+1351DE).  If I load this file as UTF-8 encoded, Emacs gives this as
the output of `C-u C-x =':

               position: 1 of 2 (0%), column: 0
              character: 二 (displayed as 二) (codepoint 20108, #o47214, #x4e8c)
      preferred charset: unicode (Unicode (ISO10646))
  code point in charset: 0x4E8C
                 syntax: w 	which means: word
               category: .:Base, C:2-byte han, L:Left-to-right (strong), c:Chinese, h:Korean, j:Japanese, |:line breakable
               to input: type "C-x 8 RET HEX-CODEPOINT" or "C-x 8 RET NAME"
            buffer code: #xE4 #xBA #x8C
              file code: #xE4 #xBA #x8C (encoded by coding system utf-8-unix)
                display: by this font (glyph code)
      xft:-unknown-SimSun-normal-normal-normal-*-24-*-*-*-d-0-iso10646-1 (#x460)

  Character code properties: customize what to show
    name: CJK IDEOGRAPH-4E8C
    general-category: Lo (Letter, Other)
    decomposition: (20108) ('二')

Look what Emacs says about the file code.  If I save this
one-character file as UTF-8, the character code stays as-is.

This behaviour is clearly wrong.  I suspect that Emacs is using such a
high character code for internal representation of the `emacs-mule'
encoding.  However, the user must not see this.  Instead, such
characters must be converted to correct UTF-8.


    Werner


======================================================================

In GNU Emacs 24.2.50.1 (i686-pc-linux-gnu, GTK+ Version 2.24.9)
 of 2012-08-28 on linux-nvf0
Windowing system distributor `The X.Org Foundation', version 11.0.11004000
Configured using:
 `configure 'MAKEINFO=/usr/bin/makeinfo' '--with-x-toolkit=gtk''

Important settings:
  value of $LANG: de_DE.UTF-8
  value of $XMODIFIERS: @im=none
  locale-coding-system: utf-8-unix
  default enable-multibyte-characters: t

Major mode: Summary

Minor modes in effect:
  tooltip-mode: t
  mouse-wheel-mode: t
  menu-bar-mode: t
  file-name-shadow-mode: t
  global-font-lock-mode: t
  font-lock-mode: t
  blink-cursor-mode: t
  auto-composition-mode: t
  auto-encryption-mode: t
  auto-compression-mode: t
  column-number-mode: t
  transient-mark-mode: t

Recent input:
<return> w b u g - e m <tab> <tab> <tab> <tab> <tab> 
<tab> <tab> <backspace> <backspace> <tab> <tab> C-c 
C-q y M-x w r i t e - e m <tab> C-g C-h a b u g <return> 
<M-next> C-x 1 M-x r e p r t <backspace> <backspace> 
o r t - e m <tab> <return>

Recent messages:
Saving file /home/wl/Mail/draft/11...
Wrote /home/wl/Mail/draft/11
Draft is prepared
No matching alias [7 times]
Kill draft message? (y or n)  y
Saving file /home/wl/Mail/draft/11...
Wrote /home/wl/Mail/draft/11
Draft was killed
Quit
Type C-x 4 C-o RET to restore the other window.  

Load-path shadows:
None found.

Features:
(shadow emacsbug message format-spec rfc822 mml mml-sec mm-decode
mm-bodies mm-encode mail-parse rfc2231 mailabbrev gmm-utils mailheader
sendmail rfc2047 rfc2045 ietf-drums mm-util mail-prsvr mail-utils
apropos descr-text latexenc preview prv-emacs byte-opt tex-buf
noutline outline font-latex warnings bytecomp byte-compile cconv
macroexp latex easy-mmode edmacro kmacro tex-style cus-edit wid-edit
cus-start cus-load pp mew-varsx mew-unix cal-menu calendar
cal-loaddefs mew-auth mew-config mew-imap2 mew-imap mew-nntp2 mew-nntp
mew-pop mew-smtp mew-ssl mew-ssh mew-net mew-highlight mew-sort
mew-fib mew-ext mew-refile mew-demo mew-attach mew-draft mew-message
mew-thread mew-virtual mew-summary4 mew-summary3 mew-summary2
mew-summary mew-search mew-pick mew-passwd mew-scan mew-syntax mew-bq
mew-smime mew-pgp mew-header mew-exec mew-mark mew-mime mew-edit
mew-decode mew-encode mew-cache mew-minibuf mew-complete mew-addrbook
mew-local mew-vars3 mew-vars2 mew-vars mew-env mew-mule3 mew-mule
mew-gemacs mew-key mew-func mew-blvs mew-const mew tex advice help-fns
advice-preload tex-site auto-loads quail help-mode easymenu cjktilde
disp-table time-date tooltip ediff-hook vc-hooks lisp-float-type
mwheel x-win x-dnd tool-bar dnd fontset image regexp-opt fringe
tabulated-list newcomment lisp-mode register page menu-bar rfn-eshadow
timer select scroll-bar mouse jit-lock font-lock syntax facemenu
font-core frame cham georgian utf-8-lang misc-lang vietnamese tibetan
thai tai-viet lao korean japanese hebrew greek romanian slovak czech
european ethiopic indian cyrillic chinese case-table epa-hook
jka-cmpr-hook help simple abbrev minibuffer loaddefs button faces
cus-face files text-properties overlay sha1 md5 base64 format env
code-pages mule custom widget hashtable-print-readable backquote
make-network-process dbusbind dynamic-setting system-font-setting
font-render-setting move-toolbar gtk x-toolkit x multi-tty emacs)

[-- Attachment #2: emacs-problem.utf8 --]
[-- Type: Application/Octet-Stream, Size: 5 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* bug#12291: [rev 109796] wrong UTF-8 handling
  2012-08-28  5:47 bug#12291: [rev 109796] wrong UTF-8 handling Werner LEMBERG
@ 2012-08-28  9:03 ` Andreas Schwab
  2012-08-28 14:57 ` Kenichi Handa
  2022-01-27 16:32 ` Lars Ingebrigtsen
  2 siblings, 0 replies; 10+ messages in thread
From: Andreas Schwab @ 2012-08-28  9:03 UTC (permalink / raw)
  To: Werner LEMBERG; +Cc: 12291, Curtis Smith

The code points above #x110000 are used for CJK unification.  The utf-8
decoder should probably reject all those codes.

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."





^ permalink raw reply	[flat|nested] 10+ messages in thread

* bug#12291: [rev 109796] wrong UTF-8 handling
  2012-08-28  5:47 bug#12291: [rev 109796] wrong UTF-8 handling Werner LEMBERG
  2012-08-28  9:03 ` Andreas Schwab
@ 2012-08-28 14:57 ` Kenichi Handa
  2012-08-28 19:22   ` Werner LEMBERG
  2022-01-27 16:32 ` Lars Ingebrigtsen
  2 siblings, 1 reply; 10+ messages in thread
From: Kenichi Handa @ 2012-08-28 14:57 UTC (permalink / raw)
  To: Werner LEMBERG; +Cc: 12291, smithcu

In article <20120828.074720.480105751.wl@gnu.org>, Werner LEMBERG <wl@gnu.org> writes:

> Have a look at the attached file, containing a single character.
> (It's transmitted as binary to avoid e-mail encoding issues).  It
> contains a single, four-byte UTF-8 encoded character (0xF4 0xB5 0x87
> 0x9E, which would map to the non-existent Unicode character code
> U+1351DE).  If I load this file as UTF-8 encoded, Emacs gives this as
> the output of `C-u C-x =':

>                position: 1 of 2 (0%), column: 0
>               character: 二 (displayed as 二) (codepoint 20108, #o47214, #x4e8c)
[...]
> Look what Emacs says about the file code.  If I save this
> one-character file as UTF-8, the character code stays as-is.

> This behaviour is clearly wrong.

Sure.

> I suspect that Emacs is using such a
> high character code for internal representation of the `emacs-mule'
> encoding.  However, the user must not see this.  

That higher character code area is used for two purposes.

One is for reading CJK characters of legacy encoding (euc,
sjis, big5, etc).  They are decoded into the utf-8-emacs
byte sequence corresponding to the higher character cod
area.  But, on getting their character code, most of them
are unified into Unicode BMP characters.  But few are left
un-unified.  Those are private characters in each legacy
character set.

Another is for supporting non-Unicode characters.  The
biggest set is GB18030.

In both cases, user surely see them.

> Instead, such characters must be converted to correct
> UTF-8.

??? I don't understand what you means by "correct UTF-8".

I think the correct behaviour on reading such a file by
utf-8 is to treat each byte as raw-byte.

---
Kenichi Handa
handa@gnu.org

^ permalink raw reply	[flat|nested] 10+ messages in thread

* bug#12291: [rev 109796] wrong UTF-8 handling
  2012-08-28 14:57 ` Kenichi Handa
@ 2012-08-28 19:22   ` Werner LEMBERG
  2012-08-31 10:40     ` Eli Zaretskii
  0 siblings, 1 reply; 10+ messages in thread
From: Werner LEMBERG @ 2012-08-28 19:22 UTC (permalink / raw)
  To: handa; +Cc: 12291, smithcu

> In both cases, user surely see them.

OK.  BTW, the real use-case is a bug in emacs 23.x which prevented
correct conversion from emacs-mule encoding to utf-8, creating such
funnily encoded utf-8 files (I can't repeat this problem with my
recently compiled emacs, so it seems that it has been fixed
meanwhile).

>> Instead, such characters must be converted to correct
>> UTF-8.
> 
> ??? I don't understand what you means by "correct UTF-8".

Sorry, I've meant correct Unicode.  U+1351DE is larger than the
largest valid Unicode value.  As my example demonstrates, the Chinese
character in the file is certainly *neither* a private character nor a
character from GB 18030, so it should be converted to a regular
Unicode value.

> I think the correct behaviour on reading such a file by utf-8 is to
> treat each byte as raw-byte.

Maybe.  I'm not sure how Emacs should behave in reading such files.

    Werner

^ permalink raw reply	[flat|nested] 10+ messages in thread

* bug#12291: [rev 109796] wrong UTF-8 handling
  2012-08-28 19:22   ` Werner LEMBERG
@ 2012-08-31 10:40     ` Eli Zaretskii
  2012-09-03  0:59       ` Kenichi Handa
  0 siblings, 1 reply; 10+ messages in thread
From: Eli Zaretskii @ 2012-08-31 10:40 UTC (permalink / raw)
  To: Werner LEMBERG; +Cc: 12291, smithcu

> Date: Tue, 28 Aug 2012 21:22:26 +0200 (CEST)
> From: Werner LEMBERG <wl@gnu.org>
> Cc: 12291@debbugs.gnu.org, smithcu@gvsu.edu
> 
> > I think the correct behaviour on reading such a file by utf-8 is to
> > treat each byte as raw-byte.
> 
> Maybe.  I'm not sure how Emacs should behave in reading such files.

We can either read them as raw bytes, or convert them to u+FFFD.  The
former sounds like a more useful behavior to me, FWIW.





^ permalink raw reply	[flat|nested] 10+ messages in thread

* bug#12291: [rev 109796] wrong UTF-8 handling
  2012-08-31 10:40     ` Eli Zaretskii
@ 2012-09-03  0:59       ` Kenichi Handa
  2012-09-03  2:40         ` Eli Zaretskii
  0 siblings, 1 reply; 10+ messages in thread
From: Kenichi Handa @ 2012-09-03  0:59 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 12291, smithcu

In article <83bohrqr83.fsf@gnu.org>, Eli Zaretskii <eliz@gnu.org> writes:

> > Date: Tue, 28 Aug 2012 21:22:26 +0200 (CEST)
> > From: Werner LEMBERG <wl@gnu.org>
> > Cc: 12291@debbugs.gnu.org, smithcu@gvsu.edu
> > 
> > > I think the correct behaviour on reading such a file by utf-8 is to
> > > treat each byte as raw-byte.
> > 
> > Maybe.  I'm not sure how Emacs should behave in reading such files.

> We can either read them as raw bytes, or convert them to u+FFFD.  The
> former sounds like a more useful behavior to me, FWIW.

What to convert to U+FFFD?  Each byte, or the byte sequence?

Anyway, we can't simply convert them to U+FFFD because it
results in change of file contents just by reading and
writing.  We can add post-read-conversion and
pre-write-conversion functions to the conding system utf-8
to perform the conversion (and adding text properties for
reverting) and reverting (using the text properties attached
at the time of reading).  But, is it worth doing that?

I think converting each invalid byte to raw-byte is simpler
and equally useful.

---
Kenichi Handa
handa@gnu.org

^ permalink raw reply	[flat|nested] 10+ messages in thread

* bug#12291: [rev 109796] wrong UTF-8 handling
  2012-09-03  0:59       ` Kenichi Handa
@ 2012-09-03  2:40         ` Eli Zaretskii
  0 siblings, 0 replies; 10+ messages in thread
From: Eli Zaretskii @ 2012-09-03  2:40 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: 12291, smithcu

> From: Kenichi Handa <handa@gnu.org>
> Cc: wl@gnu.org, 12291@debbugs.gnu.org, smithcu@gvsu.edu
> Date: Mon, 03 Sep 2012 09:59:22 +0900
> 
> > We can either read them as raw bytes, or convert them to u+FFFD.  The
> > former sounds like a more useful behavior to me, FWIW.
> 
> What to convert to U+FFFD?  Each byte, or the byte sequence?

The byte sequence.

> Anyway, we can't simply convert them to U+FFFD because it
> results in change of file contents just by reading and
> writing.

Yes, and that's why I prefer the raw-bytes way.

> I think converting each invalid byte to raw-byte is simpler
> and equally useful.

It's more useful, I think.





^ permalink raw reply	[flat|nested] 10+ messages in thread

* bug#12291: [rev 109796] wrong UTF-8 handling
  2012-08-28  5:47 bug#12291: [rev 109796] wrong UTF-8 handling Werner LEMBERG
  2012-08-28  9:03 ` Andreas Schwab
  2012-08-28 14:57 ` Kenichi Handa
@ 2022-01-27 16:32 ` Lars Ingebrigtsen
  2022-01-27 16:52   ` Eli Zaretskii
  2 siblings, 1 reply; 10+ messages in thread
From: Lars Ingebrigtsen @ 2022-01-27 16:32 UTC (permalink / raw)
  To: Werner LEMBERG; +Cc: 12291, Curtis Smith

Werner LEMBERG <wl@gnu.org> writes:

> Have a look at the attached file, containing a single character.
> (It's transmitted as binary to avoid e-mail encoding issues).  It
> contains a single, four-byte UTF-8 encoded character (0xF4 0xB5 0x87
> 0x9E, which would map to the non-existent Unicode character code
> U+1351DE).  If I load this file as UTF-8 encoded, Emacs gives this as
> the output of `C-u C-x =':
>
>                position: 1 of 2 (0%), column: 0
>               character: 二 (displayed as 二) (codepoint 20108, #o47214, #x4e8c)
>       preferred charset: unicode (Unicode (ISO10646))

(I'm going through old bug reports that unfortunately weren't resolved
at the time.)

This has changed at some point between this was reported and now:

             position: 1 of 2 (0%), column: 0
            character:  (displayed as ) (codepoint 1266142, #o4650736, #x1351de)
              charset: emacs (Full Emacs charset (excluding eight bit chars))
code point in charset: 0x1351DE
               syntax: w 	which means: word
             category: L:Strong L2R
             to input: type "C-x 8 RET 1351de"

So Emacs now displays more accurate information about the utf-8
sequence.

It was pointed out that this sequence is outside the Unicode range,
which only extends up to U+10FFFF, and that Emacs should perhaps display
this as a number of raw bytes instead.  Is that something we still want
to pursue, or is Emacs behaving like we want to here?  Eli?

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no





^ permalink raw reply	[flat|nested] 10+ messages in thread

* bug#12291: [rev 109796] wrong UTF-8 handling
  2022-01-27 16:32 ` Lars Ingebrigtsen
@ 2022-01-27 16:52   ` Eli Zaretskii
  2022-02-25  2:33     ` Lars Ingebrigtsen
  0 siblings, 1 reply; 10+ messages in thread
From: Eli Zaretskii @ 2022-01-27 16:52 UTC (permalink / raw)
  To: Lars Ingebrigtsen; +Cc: 12291, smithcu

> From: Lars Ingebrigtsen <larsi@gnus.org>
> Cc: 12291@debbugs.gnu.org,  Curtis Smith <smithcu@gvsu.edu>, Eli Zaretskii
>  <eliz@gnu.org>
> Date: Thu, 27 Jan 2022 17:32:53 +0100
> 
>              position: 1 of 2 (0%), column: 0
>             character:  (displayed as ) (codepoint 1266142, #o4650736, #x1351de)
>               charset: emacs (Full Emacs charset (excluding eight bit chars))
> code point in charset: 0x1351DE
>                syntax: w 	which means: word
>              category: L:Strong L2R
>              to input: type "C-x 8 RET 1351de"
> 
> So Emacs now displays more accurate information about the utf-8
> sequence.
> 
> It was pointed out that this sequence is outside the Unicode range,
> which only extends up to U+10FFFF, and that Emacs should perhaps display
> this as a number of raw bytes instead.  Is that something we still want
> to pursue, or is Emacs behaving like we want to here?  Eli?

This is the expected behavior.  The raw bytes start at #x3FFF00, so
#x1351de is some character code reserved for characters not unified with
Unicode (some CJK encodings have them).  Interpreting them as raw
bytes would be counter-productive.

I'm not sure what was Werner's problem with this, so maybe let him
chime in and explain more.





^ permalink raw reply	[flat|nested] 10+ messages in thread

* bug#12291: [rev 109796] wrong UTF-8 handling
  2022-01-27 16:52   ` Eli Zaretskii
@ 2022-02-25  2:33     ` Lars Ingebrigtsen
  0 siblings, 0 replies; 10+ messages in thread
From: Lars Ingebrigtsen @ 2022-02-25  2:33 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 12291, smithcu

Eli Zaretskii <eliz@gnu.org> writes:

> This is the expected behavior.  The raw bytes start at #x3FFF00, so
> #x1351de is some character code reserved for characters not unified with
> Unicode (some CJK encodings have them).  Interpreting them as raw
> bytes would be counter-productive.
>
> I'm not sure what was Werner's problem with this, so maybe let him
> chime in and explain more.

This was a month ago, and there was no followup, so there doesn't seem
to be anything to be done on the Emacs side here, and I'm therefore
closing this bug report.  If there's something that should be changed in
Emacs, please respond to the debbugs address and we'll reopen.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no





^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2022-02-25  2:33 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-08-28  5:47 bug#12291: [rev 109796] wrong UTF-8 handling Werner LEMBERG
2012-08-28  9:03 ` Andreas Schwab
2012-08-28 14:57 ` Kenichi Handa
2012-08-28 19:22   ` Werner LEMBERG
2012-08-31 10:40     ` Eli Zaretskii
2012-09-03  0:59       ` Kenichi Handa
2012-09-03  2:40         ` Eli Zaretskii
2022-01-27 16:32 ` Lars Ingebrigtsen
2022-01-27 16:52   ` Eli Zaretskii
2022-02-25  2:33     ` Lars Ingebrigtsen

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).