unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
* bug#34862: 27.0.50; Trying to update pinyin.map
@ 2019-03-14 21:49 Eric Abrahamsen
  2019-03-15  5:03 ` Eli Zaretskii
  0 siblings, 1 reply; 12+ messages in thread
From: Eric Abrahamsen @ 2019-03-14 21:49 UTC (permalink / raw)
  To: 34862


As discussed in bug#34215, I'm trying to update the
romanization-to-Chinese-character mapping in the
file ./leim/MISC-DIC/pinyin.map to use the more complete mapping
provided by the Google pinyin input method, licensed under Apache 2.0.
This expands the number of characters recognized by Emacs from around
7,000 to around 17,000. (And increases the size of the mapping file from
18K to 53K.)

I'm running into encoding problems when adding the new characters --
Emacs says some of the characters can't be written using the existing
coding system. The original file has an encoding cookie reading coding:
cn-gb-2312, and describing the coding system gives me:

chinese-iso-8bit-dos (alias: cn-gb-2312-dos euc-china-dos euc-cn-dos
  cn-gb-dos gb2312-dos)

The characters *can* be encoded using gb18030, and of course utf8. The
wikipedia page for gb18030 describes gb2312 as "legacy"[1], and says
gb18030 is a superset of 2312.

Is there any reason not to go straight to utf8 for this file? If that's
not okay, would gb18030 be acceptable?

Codepoint 23744 is an example of a character that can be encoded with
18030 but not 2312. It also exercises my font engine.

I have two other questions, about reducing vc churn, and how to insert
the license at the top of the file, but I figured I'd ask this first.

Thanks,
Eric

[1]  https://en.wikipedia.org/wiki/GB_18030






^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2022-02-08  6:12 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-14 21:49 bug#34862: 27.0.50; Trying to update pinyin.map Eric Abrahamsen
2019-03-15  5:03 ` Eli Zaretskii
2019-03-15  5:58   ` Eric Abrahamsen
2019-03-15  7:04     ` Eli Zaretskii
2019-03-15 18:31       ` Eric Abrahamsen
2019-03-20  9:45         ` Eli Zaretskii
2019-03-20 19:30           ` Eric Abrahamsen
2019-03-20 19:39             ` Eli Zaretskii
2019-03-20 19:41               ` Eric Abrahamsen
2022-02-02 18:59             ` Lars Ingebrigtsen
2022-02-08  0:26               ` Eric Abrahamsen
2022-02-08  6:12                 ` Lars Ingebrigtsen

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).