unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
From: "Ruijie Yu" <yuruijie@sics.ac.cn>
To: 66760@debbugs.gnu.org
Subject: bug#66760: 29.1; [BUG] GB18030 Incorrect Encoding
Date: Thu, 26 Oct 2023 19:43:54 +0800	[thread overview]
Message-ID: <1015f5fcf69b9c0656d42932da193bd4@sics.ac.cn> (raw)

Hello,

I have noticed that in GB18030 encoding, certain ranges of characters
have incorrect encodings.

One example is U+217A (SMALL ROMAN NUMERAL ELEVEN).  The expected
encoding is 81 36 C5 30 (as can be seen from the GB18030 standard [1]
and verified from other programs such as iconv and MySQL), whereas the
observed encoding within Emacs is 81 36 C4 39, with a 1-codepoint
offset.

This behavior can be reproduced by the following recipe under both
GNU/Linux and Windows:

--8<---------------cut here---------------start------------->8---
$ emacs
C-x h DEL
C-x C-m f gb18030 RET
C-x 8 RET 217a RET
M-<
C-u C-x =
;; observe the "file code":
;; file code: #x81 #x36 #xC4 #x39 (encoded by coding system chinese-gb18030-dos)
--8<---------------cut here---------------end--------------->8---

In contrast, this is what I get on MySQL (which I have also verified
against the GB18030 standard):

--8<---------------cut here---------------start------------->8---
> CREATE TABLE gb (id INT, c TEXT CHARACTER SET GB18030);
> INSERT INTO gb VALUES (0, 'ⅺ');
> SELECT HEX(c) FROM gb;

+----------+
| hex(c)   |
+----------+
| 8136C530 |
+----------+
--8<---------------cut here---------------end--------------->8---

Beyond this, I also noticed that U+A642 (CYRILLIC CAPITAL LETTER DZELO)
has the encoding 82 36 B9 36 on Emacs, whereas MySQL has 82 36 BA 35,
which has an offset of 9 codepoints.

Could someone with more expertise and time look into why there is a
mismatch between Emacs' GB18030 data and the standard?

[1]:
https://openstd.samr.gov.cn/bzgk/gb/newGbInfo?hcno=A1931A578FE14957104988029B0833D3
(200+MB PDF.  Unfortunately this is the only official source which I can find, and it
requires a captcha.)

-- 

Best,

RY

In GNU Emacs 29.1 (build 2, x86_64-w64-mingw32) of 2023-08-02 built on
 AVALON
Windowing system distributor 'Microsoft Corp.', version 10.0.19045
System Description: Microsoft Windows 10 Enterprise (v10.0.2009.19045.3086)

Configured using:
 'configure --with-modules --without-dbus --with-native-compilation=aot
 --without-compress-install --with-tree-sitter CFLAGS=-O2'

Configured features:
ACL GIF GMP GNUTLS HARFBUZZ JPEG JSON LCMS2 LIBXML2 MODULES NATIVE_COMP
NOTIFY W32NOTIFY PDUMPER PNG RSVG SOUND SQLITE3 THREADS TIFF
TOOLKIT_SCROLL_BARS TREE_SITTER WEBP XPM ZLIB

(NATIVE_COMP present but libgccjit not available)

Important settings:
  value of $LANG: CHS
  locale-coding-system: cp936

Major mode: Lisp Interaction


             reply	other threads:[~2023-10-26 11:43 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-10-26 11:43 Ruijie Yu [this message]
2023-10-26 13:26 ` bug#66760: 29.1; [BUG] GB18030 Incorrect Encoding Eli Zaretskii
2023-10-26 14:20 ` Andreas Schwab
2023-11-04  8:25   ` Eli Zaretskii

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1015f5fcf69b9c0656d42932da193bd4@sics.ac.cn \
    --to=yuruijie@sics.ac.cn \
    --cc=66760@debbugs.gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).