From: "Ruijie Yu" <yuruijie@sics.ac.cn>
To: 66760@debbugs.gnu.org
Subject: bug#66760: 29.1; [BUG] GB18030 Incorrect Encoding
Date: Thu, 26 Oct 2023 19:43:54 +0800 [thread overview]
Message-ID: <1015f5fcf69b9c0656d42932da193bd4@sics.ac.cn> (raw)
Hello,
I have noticed that in GB18030 encoding, certain ranges of characters
have incorrect encodings.
One example is U+217A (SMALL ROMAN NUMERAL ELEVEN). The expected
encoding is 81 36 C5 30 (as can be seen from the GB18030 standard [1]
and verified from other programs such as iconv and MySQL), whereas the
observed encoding within Emacs is 81 36 C4 39, with a 1-codepoint
offset.
This behavior can be reproduced by the following recipe under both
GNU/Linux and Windows:
--8<---------------cut here---------------start------------->8---
$ emacs
C-x h DEL
C-x C-m f gb18030 RET
C-x 8 RET 217a RET
M-<
C-u C-x =
;; observe the "file code":
;; file code: #x81 #x36 #xC4 #x39 (encoded by coding system chinese-gb18030-dos)
--8<---------------cut here---------------end--------------->8---
In contrast, this is what I get on MySQL (which I have also verified
against the GB18030 standard):
--8<---------------cut here---------------start------------->8---
> CREATE TABLE gb (id INT, c TEXT CHARACTER SET GB18030);
> INSERT INTO gb VALUES (0, 'ⅺ');
> SELECT HEX(c) FROM gb;
+----------+
| hex(c) |
+----------+
| 8136C530 |
+----------+
--8<---------------cut here---------------end--------------->8---
Beyond this, I also noticed that U+A642 (CYRILLIC CAPITAL LETTER DZELO)
has the encoding 82 36 B9 36 on Emacs, whereas MySQL has 82 36 BA 35,
which has an offset of 9 codepoints.
Could someone with more expertise and time look into why there is a
mismatch between Emacs' GB18030 data and the standard?
[1]:
https://openstd.samr.gov.cn/bzgk/gb/newGbInfo?hcno=A1931A578FE14957104988029B0833D3
(200+MB PDF. Unfortunately this is the only official source which I can find, and it
requires a captcha.)
--
Best,
RY
In GNU Emacs 29.1 (build 2, x86_64-w64-mingw32) of 2023-08-02 built on
AVALON
Windowing system distributor 'Microsoft Corp.', version 10.0.19045
System Description: Microsoft Windows 10 Enterprise (v10.0.2009.19045.3086)
Configured using:
'configure --with-modules --without-dbus --with-native-compilation=aot
--without-compress-install --with-tree-sitter CFLAGS=-O2'
Configured features:
ACL GIF GMP GNUTLS HARFBUZZ JPEG JSON LCMS2 LIBXML2 MODULES NATIVE_COMP
NOTIFY W32NOTIFY PDUMPER PNG RSVG SOUND SQLITE3 THREADS TIFF
TOOLKIT_SCROLL_BARS TREE_SITTER WEBP XPM ZLIB
(NATIVE_COMP present but libgccjit not available)
Important settings:
value of $LANG: CHS
locale-coding-system: cp936
Major mode: Lisp Interaction
next reply other threads:[~2023-10-26 11:43 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-10-26 11:43 Ruijie Yu [this message]
2023-10-26 13:26 ` bug#66760: 29.1; [BUG] GB18030 Incorrect Encoding Eli Zaretskii
2023-10-26 14:20 ` Andreas Schwab
2023-11-04 8:25 ` Eli Zaretskii
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://www.gnu.org/software/emacs/
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1015f5fcf69b9c0656d42932da193bd4@sics.ac.cn \
--to=yuruijie@sics.ac.cn \
--cc=66760@debbugs.gnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://git.savannah.gnu.org/cgit/emacs.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).