From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED.blaine.gmane.org!not-for-mail From: Eric Abrahamsen Newsgroups: gmane.emacs.bugs Subject: bug#34862: 27.0.50; Trying to update pinyin.map Date: Thu, 14 Mar 2019 14:49:51 -0700 Message-ID: <87zhpxyvls.fsf@ericabrahamsen.net> Mime-Version: 1.0 Content-Type: text/plain Injection-Info: blaine.gmane.org; posting-host="blaine.gmane.org:195.159.176.226"; logging-data="180644"; mail-complaints-to="usenet@blaine.gmane.org" User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.0.50 (gnu/linux) To: 34862@debbugs.gnu.org Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Thu Mar 14 22:55:08 2019 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([209.51.188.17]) by blaine.gmane.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:256) (Exim 4.89) (envelope-from ) id 1h4YJf-000ksN-UC for geb-bug-gnu-emacs@m.gmane.org; Thu, 14 Mar 2019 22:55:08 +0100 Original-Received: from localhost ([127.0.0.1]:45568 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1h4YJe-0003gN-OK for geb-bug-gnu-emacs@m.gmane.org; Thu, 14 Mar 2019 17:55:06 -0400 Original-Received: from eggs.gnu.org ([209.51.188.92]:56240) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1h4YGh-0001EF-94 for bug-gnu-emacs@gnu.org; Thu, 14 Mar 2019 17:52:04 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1h4YGg-0008FQ-Bz for bug-gnu-emacs@gnu.org; Thu, 14 Mar 2019 17:52:03 -0400 Original-Received: from debbugs.gnu.org ([209.51.188.43]:58826) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1h4YGg-0008FC-66 for bug-gnu-emacs@gnu.org; Thu, 14 Mar 2019 17:52:02 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1h4YGg-0002zq-0x for bug-gnu-emacs@gnu.org; Thu, 14 Mar 2019 17:52:02 -0400 X-Loop: help-debbugs@gnu.org Resent-From: Eric Abrahamsen Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Thu, 14 Mar 2019 21:52:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: report 34862 X-GNU-PR-Package: emacs X-Debbugs-Original-To: bug-gnu-emacs@gnu.org Original-Received: via spool by submit@debbugs.gnu.org id=B.155260027911434 (code B ref -1); Thu, 14 Mar 2019 21:52:01 +0000 Original-Received: (at submit) by debbugs.gnu.org; 14 Mar 2019 21:51:19 +0000 Original-Received: from localhost ([127.0.0.1]:44137 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1h4YFz-0002yM-4d for submit@debbugs.gnu.org; Thu, 14 Mar 2019 17:51:19 -0400 Original-Received: from eggs.gnu.org ([209.51.188.92]:41342) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1h4YFw-0002xv-Kr for submit@debbugs.gnu.org; Thu, 14 Mar 2019 17:51:17 -0400 Original-Received: from lists.gnu.org ([209.51.188.17]:60462) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1h4YFr-0007if-Bc for submit@debbugs.gnu.org; Thu, 14 Mar 2019 17:51:11 -0400 Original-Received: from eggs.gnu.org ([209.51.188.92]:55692) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1h4YFq-0000uz-9G for bug-gnu-emacs@gnu.org; Thu, 14 Mar 2019 17:51:11 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1h4YEi-000744-Ry for bug-gnu-emacs@gnu.org; Thu, 14 Mar 2019 17:50:01 -0400 Original-Received: from ericabrahamsen.net ([52.70.2.18]:33086 helo=mail.ericabrahamsen.net) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1h4YEi-00072H-CZ for bug-gnu-emacs@gnu.org; Thu, 14 Mar 2019 17:50:00 -0400 Original-Received: from localhost (unknown [207.109.85.82]) (Authenticated sender: eric@ericabrahamsen.net) by mail.ericabrahamsen.net (Postfix) with ESMTPSA id 446A3FA17C for ; Thu, 14 Mar 2019 21:49:52 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=ericabrahamsen.net; s=mail; t=1552600192; bh=fXrEWI/bQEIDvrgeD4GW0+72Y793ZlGMCfxGLvqW6VM=; h=From:To:Subject:Date:From; b=s3/WdRg1nz1fv4BNwwjZbOcN0K8vagP97FBysXCcwDicRYcEfIM81zJiNS7fpzRlR jqtTcD2OQxh5mYutSFu/Hee0lAhLjavifHE42djnk656/BT+byXo8DEIEMQ0YzsrBs yxHskBjky6WqnzQ2Tzm08oBHgGGqKtW7Ny4pKJkA= X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 209.51.188.43 X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Original-Sender: "bug-gnu-emacs" Xref: news.gmane.org gmane.emacs.bugs:156359 Archived-At: As discussed in bug#34215, I'm trying to update the romanization-to-Chinese-character mapping in the file ./leim/MISC-DIC/pinyin.map to use the more complete mapping provided by the Google pinyin input method, licensed under Apache 2.0. This expands the number of characters recognized by Emacs from around 7,000 to around 17,000. (And increases the size of the mapping file from 18K to 53K.) I'm running into encoding problems when adding the new characters -- Emacs says some of the characters can't be written using the existing coding system. The original file has an encoding cookie reading coding: cn-gb-2312, and describing the coding system gives me: chinese-iso-8bit-dos (alias: cn-gb-2312-dos euc-china-dos euc-cn-dos cn-gb-dos gb2312-dos) The characters *can* be encoded using gb18030, and of course utf8. The wikipedia page for gb18030 describes gb2312 as "legacy"[1], and says gb18030 is a superset of 2312. Is there any reason not to go straight to utf8 for this file? If that's not okay, would gb18030 be acceptable? Codepoint 23744 is an example of a character that can be encoded with 18030 but not 2312. It also exercises my font engine. I have two other questions, about reducing vc churn, and how to insert the license at the top of the file, but I figured I'd ask this first. Thanks, Eric [1] https://en.wikipedia.org/wiki/GB_18030