unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
* bug#66760: 29.1; [BUG] GB18030 Incorrect Encoding
@ 2023-10-26 11:43 Ruijie Yu
  2023-10-26 13:26 ` Eli Zaretskii
  2023-10-26 14:20 ` Andreas Schwab
  0 siblings, 2 replies; 4+ messages in thread
From: Ruijie Yu @ 2023-10-26 11:43 UTC (permalink / raw)
  To: 66760

Hello,

I have noticed that in GB18030 encoding, certain ranges of characters
have incorrect encodings.

One example is U+217A (SMALL ROMAN NUMERAL ELEVEN).  The expected
encoding is 81 36 C5 30 (as can be seen from the GB18030 standard [1]
and verified from other programs such as iconv and MySQL), whereas the
observed encoding within Emacs is 81 36 C4 39, with a 1-codepoint
offset.

This behavior can be reproduced by the following recipe under both
GNU/Linux and Windows:

--8<---------------cut here---------------start------------->8---
$ emacs
C-x h DEL
C-x C-m f gb18030 RET
C-x 8 RET 217a RET
M-<
C-u C-x =
;; observe the "file code":
;; file code: #x81 #x36 #xC4 #x39 (encoded by coding system chinese-gb18030-dos)
--8<---------------cut here---------------end--------------->8---

In contrast, this is what I get on MySQL (which I have also verified
against the GB18030 standard):

--8<---------------cut here---------------start------------->8---
> CREATE TABLE gb (id INT, c TEXT CHARACTER SET GB18030);
> INSERT INTO gb VALUES (0, 'ⅺ');
> SELECT HEX(c) FROM gb;

+----------+
| hex(c)   |
+----------+
| 8136C530 |
+----------+
--8<---------------cut here---------------end--------------->8---

Beyond this, I also noticed that U+A642 (CYRILLIC CAPITAL LETTER DZELO)
has the encoding 82 36 B9 36 on Emacs, whereas MySQL has 82 36 BA 35,
which has an offset of 9 codepoints.

Could someone with more expertise and time look into why there is a
mismatch between Emacs' GB18030 data and the standard?

[1]:
https://openstd.samr.gov.cn/bzgk/gb/newGbInfo?hcno=A1931A578FE14957104988029B0833D3
(200+MB PDF.  Unfortunately this is the only official source which I can find, and it
requires a captcha.)

-- 

Best,

RY

In GNU Emacs 29.1 (build 2, x86_64-w64-mingw32) of 2023-08-02 built on
 AVALON
Windowing system distributor 'Microsoft Corp.', version 10.0.19045
System Description: Microsoft Windows 10 Enterprise (v10.0.2009.19045.3086)

Configured using:
 'configure --with-modules --without-dbus --with-native-compilation=aot
 --without-compress-install --with-tree-sitter CFLAGS=-O2'

Configured features:
ACL GIF GMP GNUTLS HARFBUZZ JPEG JSON LCMS2 LIBXML2 MODULES NATIVE_COMP
NOTIFY W32NOTIFY PDUMPER PNG RSVG SOUND SQLITE3 THREADS TIFF
TOOLKIT_SCROLL_BARS TREE_SITTER WEBP XPM ZLIB

(NATIVE_COMP present but libgccjit not available)

Important settings:
  value of $LANG: CHS
  locale-coding-system: cp936

Major mode: Lisp Interaction


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2023-11-04  8:25 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-10-26 11:43 bug#66760: 29.1; [BUG] GB18030 Incorrect Encoding Ruijie Yu
2023-10-26 13:26 ` Eli Zaretskii
2023-10-26 14:20 ` Andreas Schwab
2023-11-04  8:25   ` Eli Zaretskii

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).