unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
* bug#34215: 27.0.50; Provide elisp access to Chinese pinyin-to-character mapping
@ 2019-01-27  5:34 Eric Abrahamsen
  2019-01-27 15:41 ` Eli Zaretskii
  2019-02-24 19:12 ` bug#34215: Eric Abrahamsen
  0 siblings, 2 replies; 20+ messages in thread
From: Eric Abrahamsen @ 2019-01-27  5:34 UTC (permalink / raw)
  To: 34215

[-- Attachment #1: Type: text/plain, Size: 1728 bytes --]


This bug report is apropos to this[1] emacs.devel thread.

The basic idea is that in the Emacs sources there's a file containing a
mapping between pinyin -- the most common Chinese romanization system --
and Chinese characters themselves. The mapping lives in
leim/MISC-DIC/pinyin.map, and is converted into a quail input method by
the `py-converter' function in titdic-cnv.el, which is part of the
"make" process.

I want this mapping to be available to elisp code in general, because
it's useful for all kinds of other language utilities (searching Chinese
characters using ascii letters, etc).

pinyin.map is a plain text file, each line consisting of a romanized
syllable, a TAB, and a string of the possible corresponding Chinese
characters. `titdic-convert' parses this and feeds it to
`quail-define-rules'.

My first thought was to add an intermediate step, where `titdic-convert'
first composes an alist, then feeds that alist to `quail-define-rules',
which would also allow us access to the alist.

The more I looked at it, the more hacky and awkward that approached
seemed, and it's not like it would save any memory: you still end up
with the data both in a quail method, and in a separate alist.

So this proposed patch simply parses the same file in the same way, but
in a different location. I've put it in china-util.el, but chinese.el
would also be a reasonable spot. Both those files are concerned with
encoding, but at least "china-util" gives the impression that it could
be a grab-bag.

I'm not sure this use of `source-directory' is particularly robust, but
I don't know how else to handle it.

Hope this will be considered!

Eric

[1]: https://lists.gnu.org/archive/html/emacs-devel/2019-01/msg00306.html

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-New-constant-chinese-pinyin-character-map.patch --]
[-- Type: text/x-patch, Size: 2009 bytes --]

From f63b918057f7eaf6f8eebb28071ac17dd5ab3ff1 Mon Sep 17 00:00:00 2001
From: Eric Abrahamsen <eric@ericabrahamsen.net>
Date: Sat, 26 Jan 2019 20:11:23 -0800
Subject: [PATCH] New constant chinese-pinyin-character-map

* lisp/language/china-util.el (chinese-pinyin-character-map): Constant
  holding an alist built from the pinyin-to-character mapping provided
  in the file pinyin.map.
---
 lisp/language/china-util.el | 26 +++++++++++++++++++++++++-
 1 file changed, 25 insertions(+), 1 deletion(-)

diff --git a/lisp/language/china-util.el b/lisp/language/china-util.el
index 70710bac18..cdbd8e322f 100644
--- a/lisp/language/china-util.el
+++ b/lisp/language/china-util.el
@@ -30,7 +30,7 @@
 
 ;;; Code:
 
-;; Hz/ZW/EUC-TW encoding stuff
+;; Hz/ZW/EUC-TW encoding stuff, also a pinyin-to-character mapping.
 
 ;; HZ is an encoding method for Chinese character set GB2312 used
 ;; widely in Internet.  It is very similar to 7-bit environment of
@@ -202,6 +202,30 @@ pre-write-encode-hz
     (let (last-coding-system-used)
       (encode-hz-region 1 (point-max)))
     nil))
+
+;;; Elisp-accessible version of the pinyin-to-character mapping
+;;; provided in leim/MISC-DIC/pinyin.map, which is otherwise only
+;;; exposed to the quail input method.
+
+(eval-and-compile
+  (defconst chinese-pinyin-character-map
+    (let ((py-file (expand-file-name
+                    "leim/MISC-DIC/pinyin.map"
+                    source-directory))
+          alst)
+      (with-temp-buffer
+        (insert-file-contents py-file)
+        (re-search-forward "^[^%]" (point-max) t)
+        (beginning-of-line)
+        (while (re-search-forward "^\\([[:ascii:]]+\\)\t\\(\\cc+\\)$"
+                                  (point-max) t)
+          (push (cons (match-string-no-properties 1)
+                      (match-string-no-properties 2))
+alst))
+        (nreverse alst)))
+    "An alist mapping pinyin syllables to Chinese characters.
+Produced from data in pinyin.map."))
+
 ;;
 (provide 'china-util)
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2019-02-24 19:12 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-01-27  5:34 bug#34215: 27.0.50; Provide elisp access to Chinese pinyin-to-character mapping Eric Abrahamsen
2019-01-27 15:41 ` Eli Zaretskii
2019-01-27 18:02   ` Eric Abrahamsen
2019-01-27 18:14     ` Eli Zaretskii
2019-01-27 19:18       ` Eric Abrahamsen
2019-01-27 19:48         ` Eli Zaretskii
2019-01-29 17:48           ` Eric Abrahamsen
2019-01-30 17:09             ` Eli Zaretskii
2019-01-30 20:33               ` Eric Abrahamsen
2019-01-30 20:48                 ` Eric Abrahamsen
2019-01-31  8:50                   ` Robert Pluim
2019-01-31 19:35                     ` Eric Abrahamsen
2019-02-01  9:48                       ` Eli Zaretskii
2019-02-01 16:27                         ` Eric Abrahamsen
2019-02-01 18:53                           ` Eli Zaretskii
2019-02-01 19:15                             ` Eric Abrahamsen
2019-02-24  5:36                         ` Eric Abrahamsen
2019-02-24 16:06                           ` Eli Zaretskii
2019-02-24 18:53                             ` Eric Abrahamsen
2019-02-24 19:12 ` bug#34215: Eric Abrahamsen

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).