From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.bugs Subject: bug#66760: 29.1; [BUG] GB18030 Incorrect Encoding Date: Thu, 26 Oct 2023 16:26:52 +0300 Message-ID: <83v8atfrab.fsf@gnu.org> References: <1015f5fcf69b9c0656d42932da193bd4@sics.ac.cn> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="25460"; mail-complaints-to="usenet@ciao.gmane.io" Cc: 66760@debbugs.gnu.org To: Ruijie Yu Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Thu Oct 26 15:27:55 2023 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1qw0Ox-0006LC-6R for geb-bug-gnu-emacs@m.gmane-mx.org; Thu, 26 Oct 2023 15:27:55 +0200 Original-Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1qw0Ob-0007Jg-KV; Thu, 26 Oct 2023 09:27:33 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1qw0OZ-0007JV-JQ for bug-gnu-emacs@gnu.org; Thu, 26 Oct 2023 09:27:31 -0400 Original-Received: from debbugs.gnu.org ([2001:470:142:5::43]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1qw0OZ-0008QV-BA for bug-gnu-emacs@gnu.org; Thu, 26 Oct 2023 09:27:31 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1qw0P4-0007lU-02 for bug-gnu-emacs@gnu.org; Thu, 26 Oct 2023 09:28:02 -0400 X-Loop: help-debbugs@gnu.org Resent-From: Eli Zaretskii Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Thu, 26 Oct 2023 13:28:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 66760 X-GNU-PR-Package: emacs Original-Received: via spool by 66760-submit@debbugs.gnu.org id=B66760.169832685229814 (code B ref 66760); Thu, 26 Oct 2023 13:28:01 +0000 Original-Received: (at 66760) by debbugs.gnu.org; 26 Oct 2023 13:27:32 +0000 Original-Received: from localhost ([127.0.0.1]:60997 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1qw0OZ-0007kn-Qc for submit@debbugs.gnu.org; Thu, 26 Oct 2023 09:27:32 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:51688) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1qw0OX-0007kb-PH for 66760@debbugs.gnu.org; Thu, 26 Oct 2023 09:27:30 -0400 Original-Received: from fencepost.gnu.org ([2001:470:142:3::e]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1qw0Nu-0008Kd-V9; Thu, 26 Oct 2023 09:26:51 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=gnu.org; s=fencepost-gnu-org; h=MIME-version:References:Subject:In-Reply-To:To:From: Date; bh=T5htQ3wGhKVN4MBYkcUlRUc58ScEleloOnnnixTl9es=; b=KQP5CnoXmV9CaV/DuCPY 9v41ZAKmDDWKMhMnYtJCs4iOoIF++xMy0LBWJY05nbU2tu2ZzUAwnj0v+sFrhHN1Y16uHQt4k8aSR ryxxOC91u5cYhTcSuf/LbbR6o1V5y+EjUMfxAZRfbiEWDjfX/OEXWI0rTbsbBhN+4qOq0Wn8cLDuc eUSfX1wDI3mw04uALPfzXT7ULyt9Ba0STqlUChiifQeFpGi0REdnRRhbmqMnJh2vCGu1MnzkwtVIC TMID8dq7Zyd+B7cwtPsO6Z7pjued9jZbO9dVlrilHfC0rQ2/hMHPBprUQxoDdp2oKMQ+b4qmzsHwM gnqsdtB0ZEiBTw==; In-Reply-To: <1015f5fcf69b9c0656d42932da193bd4@sics.ac.cn> (yuruijie@sics.ac.cn) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Original-Sender: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Xref: news.gmane.io gmane.emacs.bugs:273284 Archived-At: > Date: Thu, 26 Oct 2023 19:43:54 +0800 > From: "Ruijie Yu" > > Hello, > > I have noticed that in GB18030 encoding, certain ranges of characters > have incorrect encodings. > > One example is U+217A (SMALL ROMAN NUMERAL ELEVEN). The expected > encoding is 81 36 C5 30 (as can be seen from the GB18030 standard [1] > and verified from other programs such as iconv and MySQL), whereas the > observed encoding within Emacs is 81 36 C4 39, with a 1-codepoint > offset. > > This behavior can be reproduced by the following recipe under both > GNU/Linux and Windows: > > --8<---------------cut here---------------start------------->8--- > $ emacs > C-x h DEL > C-x C-m f gb18030 RET > C-x 8 RET 217a RET > M-< > C-u C-x = > ;; observe the "file code": > ;; file code: #x81 #x36 #xC4 #x39 (encoded by coding system chinese-gb18030-dos) > --8<---------------cut here---------------end--------------->8--- > > In contrast, this is what I get on MySQL (which I have also verified > against the GB18030 standard): > > --8<---------------cut here---------------start------------->8--- > > CREATE TABLE gb (id INT, c TEXT CHARACTER SET GB18030); > > INSERT INTO gb VALUES (0, 'ⅺ'); > > SELECT HEX(c) FROM gb; > > +----------+ > | hex(c) | > +----------+ > | 8136C530 | > +----------+ > --8<---------------cut here---------------end--------------->8--- > > Beyond this, I also noticed that U+A642 (CYRILLIC CAPITAL LETTER DZELO) > has the encoding 82 36 B9 36 on Emacs, whereas MySQL has 82 36 BA 35, > which has an offset of 9 codepoints. > > Could someone with more expertise and time look into why there is a > mismatch between Emacs' GB18030 data and the standard? Alas, we don't have such experts on board, not anymore. So we must do it on our own somehow. The mapping of GB18030 to Unicode is taken from glibc, see etc/charsets/GB180302.map and etc/charsets/GB180304.map. It is possible that you are talking about a newer version of the GB18030 standard than these two mappings. It is also possible that glibc has since updated the mappings, and we failed to follow suit. If so, we need either to update the existing mappings or to add newer mappings. Could you please see what needs to be done in this regard?