unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
From: Eli Zaretskii <eliz@gnu.org>
To: Ruijie Yu <yuruijie@sics.ac.cn>
Cc: 66760@debbugs.gnu.org
Subject: bug#66760: 29.1; [BUG] GB18030 Incorrect Encoding
Date: Thu, 26 Oct 2023 16:26:52 +0300	[thread overview]
Message-ID: <83v8atfrab.fsf@gnu.org> (raw)
In-Reply-To: <1015f5fcf69b9c0656d42932da193bd4@sics.ac.cn> (yuruijie@sics.ac.cn)

> Date: Thu, 26 Oct 2023 19:43:54 +0800
> From: "Ruijie Yu" <yuruijie@sics.ac.cn>
> 
> Hello,
> 
> I have noticed that in GB18030 encoding, certain ranges of characters
> have incorrect encodings.
> 
> One example is U+217A (SMALL ROMAN NUMERAL ELEVEN).  The expected
> encoding is 81 36 C5 30 (as can be seen from the GB18030 standard [1]
> and verified from other programs such as iconv and MySQL), whereas the
> observed encoding within Emacs is 81 36 C4 39, with a 1-codepoint
> offset.
> 
> This behavior can be reproduced by the following recipe under both
> GNU/Linux and Windows:
> 
> --8<---------------cut here---------------start------------->8---
> $ emacs
> C-x h DEL
> C-x C-m f gb18030 RET
> C-x 8 RET 217a RET
> M-<
> C-u C-x =
> ;; observe the "file code":
> ;; file code: #x81 #x36 #xC4 #x39 (encoded by coding system chinese-gb18030-dos)
> --8<---------------cut here---------------end--------------->8---
> 
> In contrast, this is what I get on MySQL (which I have also verified
> against the GB18030 standard):
> 
> --8<---------------cut here---------------start------------->8---
> > CREATE TABLE gb (id INT, c TEXT CHARACTER SET GB18030);
> > INSERT INTO gb VALUES (0, 'ⅺ');
> > SELECT HEX(c) FROM gb;
> 
> +----------+
> | hex(c)   |
> +----------+
> | 8136C530 |
> +----------+
> --8<---------------cut here---------------end--------------->8---
> 
> Beyond this, I also noticed that U+A642 (CYRILLIC CAPITAL LETTER DZELO)
> has the encoding 82 36 B9 36 on Emacs, whereas MySQL has 82 36 BA 35,
> which has an offset of 9 codepoints.
> 
> Could someone with more expertise and time look into why there is a
> mismatch between Emacs' GB18030 data and the standard?

Alas, we don't have such experts on board, not anymore.  So we must do
it on our own somehow.

The mapping of GB18030 to Unicode is taken from glibc, see
etc/charsets/GB180302.map and etc/charsets/GB180304.map.  It is
possible that you are talking about a newer version of the GB18030
standard than these two mappings.  It is also possible that glibc has
since updated the mappings, and we failed to follow suit.  If so, we
need either to update the existing mappings or to add newer mappings.
Could you please see what needs to be done in this regard?





  reply	other threads:[~2023-10-26 13:26 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-10-26 11:43 bug#66760: 29.1; [BUG] GB18030 Incorrect Encoding Ruijie Yu
2023-10-26 13:26 ` Eli Zaretskii [this message]
2023-10-26 14:20 ` Andreas Schwab
2023-11-04  8:25   ` Eli Zaretskii

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=83v8atfrab.fsf@gnu.org \
    --to=eliz@gnu.org \
    --cc=66760@debbugs.gnu.org \
    --cc=yuruijie@sics.ac.cn \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).