unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
* bug#34215: 27.0.50; Provide elisp access to Chinese pinyin-to-character mapping
@ 2019-01-27  5:34 Eric Abrahamsen
  2019-01-27 15:41 ` Eli Zaretskii
  2019-02-24 19:12 ` bug#34215: Eric Abrahamsen
  0 siblings, 2 replies; 20+ messages in thread
From: Eric Abrahamsen @ 2019-01-27  5:34 UTC (permalink / raw)
  To: 34215

[-- Attachment #1: Type: text/plain, Size: 1728 bytes --]


This bug report is apropos to this[1] emacs.devel thread.

The basic idea is that in the Emacs sources there's a file containing a
mapping between pinyin -- the most common Chinese romanization system --
and Chinese characters themselves. The mapping lives in
leim/MISC-DIC/pinyin.map, and is converted into a quail input method by
the `py-converter' function in titdic-cnv.el, which is part of the
"make" process.

I want this mapping to be available to elisp code in general, because
it's useful for all kinds of other language utilities (searching Chinese
characters using ascii letters, etc).

pinyin.map is a plain text file, each line consisting of a romanized
syllable, a TAB, and a string of the possible corresponding Chinese
characters. `titdic-convert' parses this and feeds it to
`quail-define-rules'.

My first thought was to add an intermediate step, where `titdic-convert'
first composes an alist, then feeds that alist to `quail-define-rules',
which would also allow us access to the alist.

The more I looked at it, the more hacky and awkward that approached
seemed, and it's not like it would save any memory: you still end up
with the data both in a quail method, and in a separate alist.

So this proposed patch simply parses the same file in the same way, but
in a different location. I've put it in china-util.el, but chinese.el
would also be a reasonable spot. Both those files are concerned with
encoding, but at least "china-util" gives the impression that it could
be a grab-bag.

I'm not sure this use of `source-directory' is particularly robust, but
I don't know how else to handle it.

Hope this will be considered!

Eric

[1]: https://lists.gnu.org/archive/html/emacs-devel/2019-01/msg00306.html

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-New-constant-chinese-pinyin-character-map.patch --]
[-- Type: text/x-patch, Size: 2009 bytes --]

From f63b918057f7eaf6f8eebb28071ac17dd5ab3ff1 Mon Sep 17 00:00:00 2001
From: Eric Abrahamsen <eric@ericabrahamsen.net>
Date: Sat, 26 Jan 2019 20:11:23 -0800
Subject: [PATCH] New constant chinese-pinyin-character-map

* lisp/language/china-util.el (chinese-pinyin-character-map): Constant
  holding an alist built from the pinyin-to-character mapping provided
  in the file pinyin.map.
---
 lisp/language/china-util.el | 26 +++++++++++++++++++++++++-
 1 file changed, 25 insertions(+), 1 deletion(-)

diff --git a/lisp/language/china-util.el b/lisp/language/china-util.el
index 70710bac18..cdbd8e322f 100644
--- a/lisp/language/china-util.el
+++ b/lisp/language/china-util.el
@@ -30,7 +30,7 @@
 
 ;;; Code:
 
-;; Hz/ZW/EUC-TW encoding stuff
+;; Hz/ZW/EUC-TW encoding stuff, also a pinyin-to-character mapping.
 
 ;; HZ is an encoding method for Chinese character set GB2312 used
 ;; widely in Internet.  It is very similar to 7-bit environment of
@@ -202,6 +202,30 @@ pre-write-encode-hz
     (let (last-coding-system-used)
       (encode-hz-region 1 (point-max)))
     nil))
+
+;;; Elisp-accessible version of the pinyin-to-character mapping
+;;; provided in leim/MISC-DIC/pinyin.map, which is otherwise only
+;;; exposed to the quail input method.
+
+(eval-and-compile
+  (defconst chinese-pinyin-character-map
+    (let ((py-file (expand-file-name
+                    "leim/MISC-DIC/pinyin.map"
+                    source-directory))
+          alst)
+      (with-temp-buffer
+        (insert-file-contents py-file)
+        (re-search-forward "^[^%]" (point-max) t)
+        (beginning-of-line)
+        (while (re-search-forward "^\\([[:ascii:]]+\\)\t\\(\\cc+\\)$"
+                                  (point-max) t)
+          (push (cons (match-string-no-properties 1)
+                      (match-string-no-properties 2))
+alst))
+        (nreverse alst)))
+    "An alist mapping pinyin syllables to Chinese characters.
+Produced from data in pinyin.map."))
+
 ;;
 (provide 'china-util)
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* bug#34215: 27.0.50; Provide elisp access to Chinese pinyin-to-character mapping
  2019-01-27  5:34 bug#34215: 27.0.50; Provide elisp access to Chinese pinyin-to-character mapping Eric Abrahamsen
@ 2019-01-27 15:41 ` Eli Zaretskii
  2019-01-27 18:02   ` Eric Abrahamsen
  2019-02-24 19:12 ` bug#34215: Eric Abrahamsen
  1 sibling, 1 reply; 20+ messages in thread
From: Eli Zaretskii @ 2019-01-27 15:41 UTC (permalink / raw)
  To: Eric Abrahamsen; +Cc: 34215

> From: Eric Abrahamsen <eric@ericabrahamsen.net>
> Date: Sat, 26 Jan 2019 21:34:39 -0800
> 
> So this proposed patch simply parses the same file in the same way, but
> in a different location. I've put it in china-util.el, but chinese.el
> would also be a reasonable spot. Both those files are concerned with
> encoding, but at least "china-util" gives the impression that it could
> be a grab-bag.

How much does this add to Emacs memory footprint when loaded?  Since
this will be required only rarely, I doubt that it would be a good
idea to force every user of Chinese language to pay the price, if it
is significant.  It would be better to have this as a separate file
with autoloaded variable or function, IMO.

> I'm not sure this use of `source-directory' is particularly robust, but
> I don't know how else to handle it.

source-directory might not exist in a given installation.

Maybe we should have the data copied into that separate file I
mentioned above.

Thanks.





^ permalink raw reply	[flat|nested] 20+ messages in thread

* bug#34215: 27.0.50; Provide elisp access to Chinese pinyin-to-character mapping
  2019-01-27 15:41 ` Eli Zaretskii
@ 2019-01-27 18:02   ` Eric Abrahamsen
  2019-01-27 18:14     ` Eli Zaretskii
  0 siblings, 1 reply; 20+ messages in thread
From: Eric Abrahamsen @ 2019-01-27 18:02 UTC (permalink / raw)
  To: 34215

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Eric Abrahamsen <eric@ericabrahamsen.net>
>> Date: Sat, 26 Jan 2019 21:34:39 -0800
>> 
>> So this proposed patch simply parses the same file in the same way, but
>> in a different location. I've put it in china-util.el, but chinese.el
>> would also be a reasonable spot. Both those files are concerned with
>> encoding, but at least "china-util" gives the impression that it could
>> be a grab-bag.
>
> How much does this add to Emacs memory footprint when loaded?  

I actually don't know how to measure the memory taken up by the contents
of a variable, but I imagine it's fairly significant. Or maybe I could
do a "before and after" measurement of all of Emacs.

> Since this will be required only rarely, I doubt that it would be a
> good idea to force every user of Chinese language to pay the price, if
> it is significant. It would be better to have this as a separate file
> with autoloaded variable or function, IMO.

That sounds fine to me. I agree the data shouldn't be loaded unless it's
been explicitly requested.

>> I'm not sure this use of `source-directory' is particularly robust, but
>> I don't know how else to handle it.
>
> source-directory might not exist in a given installation.
>
> Maybe we should have the data copied into that separate file I
> mentioned above.

I can imagine a few ways of doing that:

1. Just manually copy the data into a new file and add it to the repo
   (pinyin.map hasn't been updated in years).
2. Do the copy at build time. I'm not quite sure where that function
   would live, or how it would get called.
3. Use an `eval-and-compile' form as in the patch I provided. Is working
   back from `load-file-name' more reliable than using
   `source-directory'?

Autoloading a variable seems to copy the value of the variable into the
loaddefs file, so there's no point to that. I figure we can just ask
people who want this value to require the library.

Thanks,
Eric

PS: pinyin.map is ancient and is missing a lot of good correspondences.
Google's pinyin input method uses a much larger map, licensed with
Apache v2.0. This[1] seems to indicate that Apache 2.0 is okay for Gnu
projects, maybe we could consider switching to that map?

Footnotes:
[1]  https://www.gnu.org/licenses/license-list.en.html#apache2







^ permalink raw reply	[flat|nested] 20+ messages in thread

* bug#34215: 27.0.50; Provide elisp access to Chinese pinyin-to-character mapping
  2019-01-27 18:02   ` Eric Abrahamsen
@ 2019-01-27 18:14     ` Eli Zaretskii
  2019-01-27 19:18       ` Eric Abrahamsen
  0 siblings, 1 reply; 20+ messages in thread
From: Eli Zaretskii @ 2019-01-27 18:14 UTC (permalink / raw)
  To: Eric Abrahamsen; +Cc: 34215

> From: Eric Abrahamsen <eric@ericabrahamsen.net>
> Date: Sun, 27 Jan 2019 10:02:48 -0800
> 
> >> I'm not sure this use of `source-directory' is particularly robust, but
> >> I don't know how else to handle it.
> >
> > source-directory might not exist in a given installation.
> >
> > Maybe we should have the data copied into that separate file I
> > mentioned above.
> 
> I can imagine a few ways of doing that:
> 
> 1. Just manually copy the data into a new file and add it to the repo
>    (pinyin.map hasn't been updated in years).
> 2. Do the copy at build time. I'm not quite sure where that function
>    would live, or how it would get called.
> 3. Use an `eval-and-compile' form as in the patch I provided. Is working
>    back from `load-file-name' more reliable than using
>    `source-directory'?

2 is what I had in mind.  I don't think it matters where the code
lives, it's small enough to not matter.  It would be called like the
various *-convert functions we invoke at build time to build the
dictionaries needed for CJK input methods, see the files in the leim/
directory.

> Autoloading a variable seems to copy the value of the variable into the
> loaddefs file, so there's no point to that. I figure we can just ask
> people who want this value to require the library.

Right.

> PS: pinyin.map is ancient and is missing a lot of good correspondences.
> Google's pinyin input method uses a much larger map, licensed with
> Apache v2.0. This[1] seems to indicate that Apache 2.0 is okay for Gnu
> projects, maybe we could consider switching to that map?

Maybe.  Unfortunately, I don't know enough about these input methods
to tell whether replacing the file is a good idea.  I wonder who can
we ask about this.





^ permalink raw reply	[flat|nested] 20+ messages in thread

* bug#34215: 27.0.50; Provide elisp access to Chinese pinyin-to-character mapping
  2019-01-27 18:14     ` Eli Zaretskii
@ 2019-01-27 19:18       ` Eric Abrahamsen
  2019-01-27 19:48         ` Eli Zaretskii
  0 siblings, 1 reply; 20+ messages in thread
From: Eric Abrahamsen @ 2019-01-27 19:18 UTC (permalink / raw)
  To: 34215

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Eric Abrahamsen <eric@ericabrahamsen.net>
>> Date: Sun, 27 Jan 2019 10:02:48 -0800
>> 
>> >> I'm not sure this use of `source-directory' is particularly robust, but
>> >> I don't know how else to handle it.
>> >
>> > source-directory might not exist in a given installation.
>> >
>> > Maybe we should have the data copied into that separate file I
>> > mentioned above.
>> 
>> I can imagine a few ways of doing that:
>> 
>> 1. Just manually copy the data into a new file and add it to the repo
>>    (pinyin.map hasn't been updated in years).
>> 2. Do the copy at build time. I'm not quite sure where that function
>>    would live, or how it would get called.
>> 3. Use an `eval-and-compile' form as in the patch I provided. Is working
>>    back from `load-file-name' more reliable than using
>>    `source-directory'?
>
> 2 is what I had in mind.  I don't think it matters where the code
> lives, it's small enough to not matter.  It would be called like the
> various *-convert functions we invoke at build time to build the
> dictionaries needed for CJK input methods, see the files in the leim/
> directory.

Okay, I'll put that together and add it to one of the Makefiles. I
suppose it could go in leim/Makefile.in, though it technically isn't
part of leim, and I was expecting the resulting file to go to
lisp/language/. But it would be convenient to put the generation
function in titdic-cnv.el.

>> Autoloading a variable seems to copy the value of the variable into the
>> loaddefs file, so there's no point to that. I figure we can just ask
>> people who want this value to require the library.
>
> Right.
>
>> PS: pinyin.map is ancient and is missing a lot of good correspondences.
>> Google's pinyin input method uses a much larger map, licensed with
>> Apache v2.0. This[1] seems to indicate that Apache 2.0 is okay for Gnu
>> projects, maybe we could consider switching to that map?
>
> Maybe.  Unfortunately, I don't know enough about these input methods
> to tell whether replacing the file is a good idea.  I wonder who can
> we ask about this.

It's more or less a drop-in replacement -- the format of the data would
be the same, only a bit more of it. I'm not sure who is "in charge" of
these files, though.

Eric






^ permalink raw reply	[flat|nested] 20+ messages in thread

* bug#34215: 27.0.50; Provide elisp access to Chinese pinyin-to-character mapping
  2019-01-27 19:18       ` Eric Abrahamsen
@ 2019-01-27 19:48         ` Eli Zaretskii
  2019-01-29 17:48           ` Eric Abrahamsen
  0 siblings, 1 reply; 20+ messages in thread
From: Eli Zaretskii @ 2019-01-27 19:48 UTC (permalink / raw)
  To: Eric Abrahamsen; +Cc: 34215

> From: Eric Abrahamsen <eric@ericabrahamsen.net>
> Date: Sun, 27 Jan 2019 11:18:29 -0800
> 
> >> PS: pinyin.map is ancient and is missing a lot of good correspondences.
> >> Google's pinyin input method uses a much larger map, licensed with
> >> Apache v2.0. This[1] seems to indicate that Apache 2.0 is okay for Gnu
> >> projects, maybe we could consider switching to that map?
> >
> > Maybe.  Unfortunately, I don't know enough about these input methods
> > to tell whether replacing the file is a good idea.  I wonder who can
> > we ask about this.
> 
> It's more or less a drop-in replacement -- the format of the data would
> be the same, only a bit more of it.

I understand, but I wonder if someone could try that for a while and
see if it makes better input method(s), before we decide to import it.

> I'm not sure who is "in charge" of these files, though.

No one, I'm afraid.  Not these days.





^ permalink raw reply	[flat|nested] 20+ messages in thread

* bug#34215: 27.0.50; Provide elisp access to Chinese pinyin-to-character mapping
  2019-01-27 19:48         ` Eli Zaretskii
@ 2019-01-29 17:48           ` Eric Abrahamsen
  2019-01-30 17:09             ` Eli Zaretskii
  0 siblings, 1 reply; 20+ messages in thread
From: Eric Abrahamsen @ 2019-01-29 17:48 UTC (permalink / raw)
  To: 34215

[-- Attachment #1: Type: text/plain, Size: 1751 bytes --]

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Eric Abrahamsen <eric@ericabrahamsen.net>
>> Date: Sun, 27 Jan 2019 11:18:29 -0800

I've attached a diff adding the conversion function itself, but I'm not
familiar with makefiles and so far haven't been able to figure out how
to call it. It looks like the invocation I want will look like:

$(AM_V_GEN)${RUN_EMACS} -l titdic-cnv -f pinyin-convert \
  ${srcdir}/MISC-DIC/pinyin.map ${srcdir}/../lisp/language/pinyin.el

Where ${srcdir} is the leim directory, but I don't actually know how to
get this code called by make...

Additionally, I could factor the common code in py-converter and
pinyin-convert out into a separate defsubst.

>> >> PS: pinyin.map is ancient and is missing a lot of good correspondences.
>> >> Google's pinyin input method uses a much larger map, licensed with
>> >> Apache v2.0. This[1] seems to indicate that Apache 2.0 is okay for Gnu
>> >> projects, maybe we could consider switching to that map?
>> >
>> > Maybe.  Unfortunately, I don't know enough about these input methods
>> > to tell whether replacing the file is a good idea.  I wonder who can
>> > we ask about this.
>> 
>> It's more or less a drop-in replacement -- the format of the data would
>> be the same, only a bit more of it.
>
> I understand, but I wonder if someone could try that for a while and
> see if it makes better input method(s), before we decide to import it.

FWIW, that mapping is used by the pyim package, which I believe is the
most popular pinyin-based Chinese input method out there. I also use it
via the system-wide input framework fcitx, and it works very well.

>> I'm not sure who is "in charge" of these files, though.
>
> No one, I'm afraid.  Not these days.

That's too bad.

Eric


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: pinyinconvert.diff --]
[-- Type: text/x-patch, Size: 1484 bytes --]

diff --git a/lisp/international/titdic-cnv.el b/lisp/international/titdic-cnv.el
index 2ce2c527b9..54d9fc6211 100644
--- a/lisp/international/titdic-cnv.el
+++ b/lisp/international/titdic-cnv.el
@@ -1203,6 +1203,37 @@ batch-miscdic-convert
 	(miscdic-convert filename dir))))
   (kill-emacs 0))
 
+(defun pinyin-convert ()
+  "Convert text file pinyin.map into an elisp library.
+The library is named pinyin.el, and contains the constant
+`pinyin-character-map'."
+  (let ((src-file (car command-line-args-left))
+        (dst-file (cadr command-line-args-left)))
+    (with-temp-file dst-file
+      (insert ";; This file is automatically generated from pinyin.map,\
+ by the function pinyin-convert.")
+      (insert "(defconst pinyin-character-map\n(")
+      (let ((pos (point)))
+        (insert-file-contents src-file)
+        (goto-char pos)
+        (re-search-forward "^[a-z]")
+        (beginning-of-line)
+        (delete-region pos (point))
+        (while (not (eobp))
+          (insert "(\"")
+          (skip-chars-forward "a-z")
+          (insert "\" \"")
+          (delete-char 1)
+          (end-of-line)
+          (while (= (preceding-char) ?\r)
+	    (delete-char -1))
+          (insert "\")")
+          (forward-line 1)))
+      (insert ")\n\"An alist holding correspondences between pinyin syllables\
+ and Chinese characters.\")\n")
+      (insert "(provide 'pinyin)\n"))
+    (kill-emacs 0)))
+
 ;; Prevent "Local Variables" above confusing Emacs.
 \f
 

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* bug#34215: 27.0.50; Provide elisp access to Chinese pinyin-to-character mapping
  2019-01-29 17:48           ` Eric Abrahamsen
@ 2019-01-30 17:09             ` Eli Zaretskii
  2019-01-30 20:33               ` Eric Abrahamsen
  0 siblings, 1 reply; 20+ messages in thread
From: Eli Zaretskii @ 2019-01-30 17:09 UTC (permalink / raw)
  To: Eric Abrahamsen; +Cc: 34215

> From: Eric Abrahamsen <eric@ericabrahamsen.net>
> Date: Tue, 29 Jan 2019 09:48:30 -0800
> 
> I've attached a diff adding the conversion function itself, but I'm not
> familiar with makefiles and so far haven't been able to figure out how
> to call it. It looks like the invocation I want will look like:
> 
> $(AM_V_GEN)${RUN_EMACS} -l titdic-cnv -f pinyin-convert \
>   ${srcdir}/MISC-DIC/pinyin.map ${srcdir}/../lisp/language/pinyin.el
> 
> Where ${srcdir} is the leim directory, but I don't actually know how to
> get this code called by make...

Add a target that is the file produced by this command, then make the
above command the recipe of that target.  Similar to the
${leimdir}/ja-dic/ja-dic.el target.

But if the above doesn't help, someone else could do this part for
you.

> > I understand, but I wonder if someone could try that for a while and
> > see if it makes better input method(s), before we decide to import it.
> 
> FWIW, that mapping is used by the pyim package, which I believe is the
> most popular pinyin-based Chinese input method out there. I also use it
> via the system-wide input framework fcitx, and it works very well.

Then I guess we will be fine importing the new version.

> +(defun pinyin-convert ()
> +  "Convert text file pinyin.map into an elisp library.
> +The library is named pinyin.el, and contains the constant
> +`pinyin-character-map'."

This writes out a .el file, but does it encode that file in UTF-8,
even if the locale's codeset is something other than UTF-8?  If not,
you need to bind coding-system-for-write to UTF-8.

> +      (insert ";; This file is automatically generated from pinyin.map,\
> + by the function pinyin-convert.")

This line is too long, suggest to break it in two.

> +      (insert ")\n\"An alist holding correspondences between pinyin syllables\
> + and Chinese characters.\")\n")

Likewise here.

Thanks.





^ permalink raw reply	[flat|nested] 20+ messages in thread

* bug#34215: 27.0.50; Provide elisp access to Chinese pinyin-to-character mapping
  2019-01-30 17:09             ` Eli Zaretskii
@ 2019-01-30 20:33               ` Eric Abrahamsen
  2019-01-30 20:48                 ` Eric Abrahamsen
  0 siblings, 1 reply; 20+ messages in thread
From: Eric Abrahamsen @ 2019-01-30 20:33 UTC (permalink / raw)
  To: 34215

[-- Attachment #1: Type: text/plain, Size: 2200 bytes --]

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Eric Abrahamsen <eric@ericabrahamsen.net>
>> Date: Tue, 29 Jan 2019 09:48:30 -0800
>> 
>> I've attached a diff adding the conversion function itself, but I'm not
>> familiar with makefiles and so far haven't been able to figure out how
>> to call it. It looks like the invocation I want will look like:
>> 
>> $(AM_V_GEN)${RUN_EMACS} -l titdic-cnv -f pinyin-convert \
>>   ${srcdir}/MISC-DIC/pinyin.map ${srcdir}/../lisp/language/pinyin.el
>> 
>> Where ${srcdir} is the leim directory, but I don't actually know how to
>> get this code called by make...
>
> Add a target that is the file produced by this command, then make the
> above command the recipe of that target.  Similar to the
> ${leimdir}/ja-dic/ja-dic.el target.
>
> But if the above doesn't help, someone else could do this part for
> you.

I've attached this as a commit patch -- it seems to work fine but I
would appreciate it if you'd check it.

>> > I understand, but I wonder if someone could try that for a while and
>> > see if it makes better input method(s), before we decide to import it.
>> 
>> FWIW, that mapping is used by the pyim package, which I believe is the
>> most popular pinyin-based Chinese input method out there. I also use it
>> via the system-wide input framework fcitx, and it works very well.
>
> Then I guess we will be fine importing the new version.

Cool -- I'll file another report for this in a bit.

>> +(defun pinyin-convert ()
>> +  "Convert text file pinyin.map into an elisp library.
>> +The library is named pinyin.el, and contains the constant
>> +`pinyin-character-map'."
>
> This writes out a .el file, but does it encode that file in UTF-8,
> even if the locale's codeset is something other than UTF-8?  If not,
> you need to bind coding-system-for-write to UTF-8.
>
>> +      (insert ";; This file is automatically generated from pinyin.map,\
>> + by the function pinyin-convert.")
>
> This line is too long, suggest to break it in two.
>
>> +      (insert ")\n\"An alist holding correspondences between pinyin syllables\
>> + and Chinese characters.\")\n")
>
> Likewise here.

Okay, I've fixed all of the above. Thanks for the pointers.

Eric

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-Make-pinyin-to-Chinese-character-mapping-available-t.patch --]
[-- Type: text/x-patch, Size: 3043 bytes --]

From 0aaa67f9717ae10b1dfd1c7f078400989123acb8 Mon Sep 17 00:00:00 2001
From: Eric Abrahamsen <eric@ericabrahamsen.net>
Date: Wed, 30 Jan 2019 12:31:49 -0800
Subject: [PATCH] Make pinyin to Chinese character mapping available to elisp

* leim/Makefile.in: Build the file pinyin.el from pinyin.map.
* lisp/international/titdic-cnv.el (pinyin-convert): New function that
  writes the library pinyin.el, containing a new constant
  `pinyin-character-map'.
---
 leim/Makefile.in                 |  7 ++++++-
 lisp/international/titdic-cnv.el | 32 ++++++++++++++++++++++++++++++++
 2 files changed, 38 insertions(+), 1 deletion(-)

diff --git a/leim/Makefile.in b/leim/Makefile.in
index c2fc8c41f2..cd693d6d0d 100644
--- a/leim/Makefile.in
+++ b/leim/Makefile.in
@@ -84,7 +84,8 @@ MISC=
 	${leimdir}/quail/PY.el		\
 	${leimdir}/quail/ZIRANMA.el	\
 	${leimdir}/quail/CTLau.el	\
-	${leimdir}/quail/CTLau-b5.el
+	${leimdir}/quail/CTLau-b5.el    \
+	${leimdir}/../lisp/language/pinyin.el
 
 ## All the generated .el files.
 TIT_MISC = ${TIT_GB} ${TIT_BIG5} ${MISC}
@@ -142,6 +143,10 @@ ${leimdir}/ja-dic/ja-dic.el:
 	$(AM_V_GEN)$(RUN_EMACS) -batch -l ja-dic-cnv \
 	  -f batch-skkdic-convert -dir "$(leimdir)/ja-dic" "$<"
 
+${leimdir}/../lisp/language/pinyin.el: ${srcdir}/MISC-DIC/pinyin.map
+	$(AM_V_GEN)${RUN_EMACS} -l titdic-cnv -f pinyin-convert \
+	   ${srcdir}/MISC-DIC/pinyin.map ${srcdir}/../lisp/language/pinyin.el
+
 
 .PHONY: bootstrap-clean distclean maintainer-clean extraclean
 
diff --git a/lisp/international/titdic-cnv.el b/lisp/international/titdic-cnv.el
index 2ce2c527b9..d33e9ff229 100644
--- a/lisp/international/titdic-cnv.el
+++ b/lisp/international/titdic-cnv.el
@@ -1203,6 +1203,38 @@ batch-miscdic-convert
 	(miscdic-convert filename dir))))
   (kill-emacs 0))
 
+(defun pinyin-convert ()
+  "Convert text file pinyin.map into an elisp library.
+The library is named pinyin.el, and contains the constant
+`pinyin-character-map'."
+  (let ((src-file (car command-line-args-left))
+        (dst-file (cadr command-line-args-left))
+        (coding-system-for-write 'utf-8-emacs))
+    (with-temp-file dst-file
+      (insert ";; This file is automatically generated from pinyin.map,\
+ by the\n;; function pinyin-convert.\n\n")
+      (insert "(defconst pinyin-character-map\n'(")
+      (let ((pos (point)))
+        (insert-file-contents src-file)
+        (goto-char pos)
+        (re-search-forward "^[a-z]")
+        (beginning-of-line)
+        (delete-region pos (point))
+        (while (not (eobp))
+          (insert "(\"")
+          (skip-chars-forward "a-z")
+          (insert "\" . \"")
+          (delete-char 1)
+          (end-of-line)
+          (while (= (preceding-char) ?\r)
+	    (delete-char -1))
+          (insert "\")")
+          (forward-line 1)))
+      (insert ")\n\"An alist holding correspondences between pinyin syllables\
+ and\nChinese characters.\")\n\n")
+      (insert "(provide 'pinyin)\n"))
+    (kill-emacs 0)))
+
 ;; Prevent "Local Variables" above confusing Emacs.
 \f
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* bug#34215: 27.0.50; Provide elisp access to Chinese pinyin-to-character mapping
  2019-01-30 20:33               ` Eric Abrahamsen
@ 2019-01-30 20:48                 ` Eric Abrahamsen
  2019-01-31  8:50                   ` Robert Pluim
  0 siblings, 1 reply; 20+ messages in thread
From: Eric Abrahamsen @ 2019-01-30 20:48 UTC (permalink / raw)
  To: 34215

Eric Abrahamsen <eric@ericabrahamsen.net> writes:

> Eli Zaretskii <eliz@gnu.org> writes:
>
>>> From: Eric Abrahamsen <eric@ericabrahamsen.net>
>>> Date: Tue, 29 Jan 2019 09:48:30 -0800
>>> 
>>> I've attached a diff adding the conversion function itself, but I'm not
>>> familiar with makefiles and so far haven't been able to figure out how
>>> to call it. It looks like the invocation I want will look like:
>>> 
>>> $(AM_V_GEN)${RUN_EMACS} -l titdic-cnv -f pinyin-convert \
>>>   ${srcdir}/MISC-DIC/pinyin.map ${srcdir}/../lisp/language/pinyin.el
>>> 
>>> Where ${srcdir} is the leim directory, but I don't actually know how to
>>> get this code called by make...
>>
>> Add a target that is the file produced by this command, then make the
>> above command the recipe of that target.  Similar to the
>> ${leimdir}/ja-dic/ja-dic.el target.
>>
>> But if the above doesn't help, someone else could do this part for
>> you.
>
> I've attached this as a commit patch -- it seems to work fine but I
> would appreciate it if you'd check it.

Oh, after reading a couple of "make" tutorials, I see maybe the make
rule could be simplified to:

${leimdir}/../lisp/language/pinyin.el: ${srcdir}/MISC-DIC/pinyin.map
  $(AM_V_GEN)${RUN_EMACS} -l titdic-cnv -f pinyin-convert $< $0






^ permalink raw reply	[flat|nested] 20+ messages in thread

* bug#34215: 27.0.50; Provide elisp access to Chinese pinyin-to-character mapping
  2019-01-30 20:48                 ` Eric Abrahamsen
@ 2019-01-31  8:50                   ` Robert Pluim
  2019-01-31 19:35                     ` Eric Abrahamsen
  0 siblings, 1 reply; 20+ messages in thread
From: Robert Pluim @ 2019-01-31  8:50 UTC (permalink / raw)
  To: Eric Abrahamsen; +Cc: 34215

Eric Abrahamsen <eric@ericabrahamsen.net> writes:

>
> Oh, after reading a couple of "make" tutorials, I see maybe the make
> rule could be simplified to:
>
> ${leimdir}/../lisp/language/pinyin.el: ${srcdir}/MISC-DIC/pinyin.map
>   $(AM_V_GEN)${RUN_EMACS} -l titdic-cnv -f pinyin-convert $< $0

$@ , I think.

Robert





^ permalink raw reply	[flat|nested] 20+ messages in thread

* bug#34215: 27.0.50; Provide elisp access to Chinese pinyin-to-character mapping
  2019-01-31  8:50                   ` Robert Pluim
@ 2019-01-31 19:35                     ` Eric Abrahamsen
  2019-02-01  9:48                       ` Eli Zaretskii
  0 siblings, 1 reply; 20+ messages in thread
From: Eric Abrahamsen @ 2019-01-31 19:35 UTC (permalink / raw)
  To: 34215

[-- Attachment #1: Type: text/plain, Size: 510 bytes --]

Robert Pluim <rpluim@gmail.com> writes:

> Eric Abrahamsen <eric@ericabrahamsen.net> writes:
>
>>
>> Oh, after reading a couple of "make" tutorials, I see maybe the make
>> rule could be simplified to:
>>
>> ${leimdir}/../lisp/language/pinyin.el: ${srcdir}/MISC-DIC/pinyin.map
>>   $(AM_V_GEN)${RUN_EMACS} -l titdic-cnv -f pinyin-convert $< $0
>
> $@ , I think.

Ah, right you are, thanks. I was wondering why that wasn't working. This
version should do the trick; it also gitignores the generated file.

Eric

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-Make-pinyin-to-Chinese-character-mapping-available-t.patch --]
[-- Type: text/x-patch, Size: 3372 bytes --]

From 2395d1a62e66206c04b7069d372fdb4f10787863 Mon Sep 17 00:00:00 2001
From: Eric Abrahamsen <eric@ericabrahamsen.net>
Date: Wed, 30 Jan 2019 12:31:49 -0800
Subject: [PATCH 1/2] Make pinyin to Chinese character mapping available to
 elisp

* leim/Makefile.in: Build the file pinyin.el from pinyin.map.
* lisp/international/titdic-cnv.el (pinyin-convert): New function that
  writes the library pinyin.el, containing a new constant
  `pinyin-character-map'.
* .gitignore: Ignore the generated pinyin.el file.
---
 .gitignore                       |  1 +
 leim/Makefile.in                 |  6 +++++-
 lisp/international/titdic-cnv.el | 32 ++++++++++++++++++++++++++++++++
 3 files changed, 38 insertions(+), 1 deletion(-)

diff --git a/.gitignore b/.gitignore
index 53f41f0f3e..f3d5ccb0f8 100644
--- a/.gitignore
+++ b/.gitignore
@@ -199,6 +199,7 @@ lisp/international/charscript.el
 lisp/international/cp51932.el
 lisp/international/eucjp-ms.el
 lisp/international/uni-*.el
+lisp/language/pinyin.el
 
 # Documentation.
 *.aux
diff --git a/leim/Makefile.in b/leim/Makefile.in
index c2fc8c41f2..4307d50087 100644
--- a/leim/Makefile.in
+++ b/leim/Makefile.in
@@ -84,7 +84,8 @@ MISC=
 	${leimdir}/quail/PY.el		\
 	${leimdir}/quail/ZIRANMA.el	\
 	${leimdir}/quail/CTLau.el	\
-	${leimdir}/quail/CTLau-b5.el
+	${leimdir}/quail/CTLau-b5.el    \
+	${srcdir}/../lisp/language/pinyin.el
 
 ## All the generated .el files.
 TIT_MISC = ${TIT_GB} ${TIT_BIG5} ${MISC}
@@ -142,6 +143,9 @@ ${leimdir}/ja-dic/ja-dic.el:
 	$(AM_V_GEN)$(RUN_EMACS) -batch -l ja-dic-cnv \
 	  -f batch-skkdic-convert -dir "$(leimdir)/ja-dic" "$<"
 
+${srcdir}/../lisp/language/pinyin.el: ${srcdir}/MISC-DIC/pinyin.map
+	$(AM_V_GEN)${RUN_EMACS} -l titdic-cnv -f pinyin-convert $< $@
+
 
 .PHONY: bootstrap-clean distclean maintainer-clean extraclean
 
diff --git a/lisp/international/titdic-cnv.el b/lisp/international/titdic-cnv.el
index 2ce2c527b9..d33e9ff229 100644
--- a/lisp/international/titdic-cnv.el
+++ b/lisp/international/titdic-cnv.el
@@ -1203,6 +1203,38 @@ batch-miscdic-convert
 	(miscdic-convert filename dir))))
   (kill-emacs 0))
 
+(defun pinyin-convert ()
+  "Convert text file pinyin.map into an elisp library.
+The library is named pinyin.el, and contains the constant
+`pinyin-character-map'."
+  (let ((src-file (car command-line-args-left))
+        (dst-file (cadr command-line-args-left))
+        (coding-system-for-write 'utf-8-emacs))
+    (with-temp-file dst-file
+      (insert ";; This file is automatically generated from pinyin.map,\
+ by the\n;; function pinyin-convert.\n\n")
+      (insert "(defconst pinyin-character-map\n'(")
+      (let ((pos (point)))
+        (insert-file-contents src-file)
+        (goto-char pos)
+        (re-search-forward "^[a-z]")
+        (beginning-of-line)
+        (delete-region pos (point))
+        (while (not (eobp))
+          (insert "(\"")
+          (skip-chars-forward "a-z")
+          (insert "\" . \"")
+          (delete-char 1)
+          (end-of-line)
+          (while (= (preceding-char) ?\r)
+	    (delete-char -1))
+          (insert "\")")
+          (forward-line 1)))
+      (insert ")\n\"An alist holding correspondences between pinyin syllables\
+ and\nChinese characters.\")\n\n")
+      (insert "(provide 'pinyin)\n"))
+    (kill-emacs 0)))
+
 ;; Prevent "Local Variables" above confusing Emacs.
 \f
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* bug#34215: 27.0.50; Provide elisp access to Chinese pinyin-to-character mapping
  2019-01-31 19:35                     ` Eric Abrahamsen
@ 2019-02-01  9:48                       ` Eli Zaretskii
  2019-02-01 16:27                         ` Eric Abrahamsen
  2019-02-24  5:36                         ` Eric Abrahamsen
  0 siblings, 2 replies; 20+ messages in thread
From: Eli Zaretskii @ 2019-02-01  9:48 UTC (permalink / raw)
  To: Eric Abrahamsen; +Cc: 34215

> From: Eric Abrahamsen <eric@ericabrahamsen.net>
> Date: Thu, 31 Jan 2019 11:35:32 -0800
> 
> +(defun pinyin-convert ()
> +  "Convert text file pinyin.map into an elisp library.
> +The library is named pinyin.el, and contains the constant
> +`pinyin-character-map'."
> +  (let ((src-file (car command-line-args-left))
> +        (dst-file (cadr command-line-args-left))
> +        (coding-system-for-write 'utf-8-emacs))

This should be 'utf-8-unix.  There's no reason to write out stuff in
our internal encoding, as the file is not supposed to have any
characters not representable in UTF-8.

Otherwise, this LGTM.  Let's wait for a few days for more comments,
and then push.

Thanks.





^ permalink raw reply	[flat|nested] 20+ messages in thread

* bug#34215: 27.0.50; Provide elisp access to Chinese pinyin-to-character mapping
  2019-02-01  9:48                       ` Eli Zaretskii
@ 2019-02-01 16:27                         ` Eric Abrahamsen
  2019-02-01 18:53                           ` Eli Zaretskii
  2019-02-24  5:36                         ` Eric Abrahamsen
  1 sibling, 1 reply; 20+ messages in thread
From: Eric Abrahamsen @ 2019-02-01 16:27 UTC (permalink / raw)
  To: 34215

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Eric Abrahamsen <eric@ericabrahamsen.net>
>> Date: Thu, 31 Jan 2019 11:35:32 -0800
>> 
>> +(defun pinyin-convert ()
>> +  "Convert text file pinyin.map into an elisp library.
>> +The library is named pinyin.el, and contains the constant
>> +`pinyin-character-map'."
>> +  (let ((src-file (car command-line-args-left))
>> +        (dst-file (cadr command-line-args-left))
>> +        (coding-system-for-write 'utf-8-emacs))
>
> This should be 'utf-8-unix.  There's no reason to write out stuff in
> our internal encoding, as the file is not supposed to have any
> characters not representable in UTF-8.

Oh, okay. For my information -- is that not platform-dependent? I
noticed titdic-cnv.el has a utf-8-emacs encoding cookie at the top.

> Otherwise, this LGTM.  Let's wait for a few days for more comments,
> and then push.

Sure thing.






^ permalink raw reply	[flat|nested] 20+ messages in thread

* bug#34215: 27.0.50; Provide elisp access to Chinese pinyin-to-character mapping
  2019-02-01 16:27                         ` Eric Abrahamsen
@ 2019-02-01 18:53                           ` Eli Zaretskii
  2019-02-01 19:15                             ` Eric Abrahamsen
  0 siblings, 1 reply; 20+ messages in thread
From: Eli Zaretskii @ 2019-02-01 18:53 UTC (permalink / raw)
  To: Eric Abrahamsen; +Cc: 34215

> From: Eric Abrahamsen <eric@ericabrahamsen.net>
> Date: Fri, 01 Feb 2019 08:27:08 -0800
> 
> >> +        (coding-system-for-write 'utf-8-emacs))
> >
> > This should be 'utf-8-unix.  There's no reason to write out stuff in
> > our internal encoding, as the file is not supposed to have any
> > characters not representable in UTF-8.
> 
> Oh, okay. For my information -- is that not platform-dependent?

No, the defaults are platform-dependent.  utf-8-unix is an explicit
specification of an encoding, so it leaves nothing to the platform.

> I noticed titdic-cnv.el has a utf-8-emacs encoding cookie at the
> top.

utf-8-emacs is the internal representation of characters used by
Emacs, it should only be used when some of the characters might not be
expressible in UTF-8 (i.e. they are beyond the Unicode codespace).





^ permalink raw reply	[flat|nested] 20+ messages in thread

* bug#34215: 27.0.50; Provide elisp access to Chinese pinyin-to-character mapping
  2019-02-01 18:53                           ` Eli Zaretskii
@ 2019-02-01 19:15                             ` Eric Abrahamsen
  0 siblings, 0 replies; 20+ messages in thread
From: Eric Abrahamsen @ 2019-02-01 19:15 UTC (permalink / raw)
  To: 34215

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Eric Abrahamsen <eric@ericabrahamsen.net>
>> Date: Fri, 01 Feb 2019 08:27:08 -0800
>> 
>> >> +        (coding-system-for-write 'utf-8-emacs))
>> >
>> > This should be 'utf-8-unix.  There's no reason to write out stuff in
>> > our internal encoding, as the file is not supposed to have any
>> > characters not representable in UTF-8.
>> 
>> Oh, okay. For my information -- is that not platform-dependent?
>
> No, the defaults are platform-dependent.  utf-8-unix is an explicit
> specification of an encoding, so it leaves nothing to the platform.
>
>> I noticed titdic-cnv.el has a utf-8-emacs encoding cookie at the
>> top.
>
> utf-8-emacs is the internal representation of characters used by
> Emacs, it should only be used when some of the characters might not be
> expressible in UTF-8 (i.e. they are beyond the Unicode codespace).

Interesting, thank you for this background.






^ permalink raw reply	[flat|nested] 20+ messages in thread

* bug#34215: 27.0.50; Provide elisp access to Chinese pinyin-to-character mapping
  2019-02-01  9:48                       ` Eli Zaretskii
  2019-02-01 16:27                         ` Eric Abrahamsen
@ 2019-02-24  5:36                         ` Eric Abrahamsen
  2019-02-24 16:06                           ` Eli Zaretskii
  1 sibling, 1 reply; 20+ messages in thread
From: Eric Abrahamsen @ 2019-02-24  5:36 UTC (permalink / raw)
  To: 34215


On 02/01/19 11:48 AM, Eli Zaretskii wrote:
>> From: Eric Abrahamsen <eric@ericabrahamsen.net>
>> Date: Thu, 31 Jan 2019 11:35:32 -0800
>> 
>> +(defun pinyin-convert ()
>> +  "Convert text file pinyin.map into an elisp library.
>> +The library is named pinyin.el, and contains the constant
>> +`pinyin-character-map'."
>> +  (let ((src-file (car command-line-args-left))
>> +        (dst-file (cadr command-line-args-left))
>> +        (coding-system-for-write 'utf-8-emacs))
>
> This should be 'utf-8-unix.  There's no reason to write out stuff in
> our internal encoding, as the file is not supposed to have any
> characters not representable in UTF-8.
>
> Otherwise, this LGTM.  Let's wait for a few days for more comments,
> and then push.

Doesn't look like anything more is forthcoming, shall I push to master?





^ permalink raw reply	[flat|nested] 20+ messages in thread

* bug#34215: 27.0.50; Provide elisp access to Chinese pinyin-to-character mapping
  2019-02-24  5:36                         ` Eric Abrahamsen
@ 2019-02-24 16:06                           ` Eli Zaretskii
  2019-02-24 18:53                             ` Eric Abrahamsen
  0 siblings, 1 reply; 20+ messages in thread
From: Eli Zaretskii @ 2019-02-24 16:06 UTC (permalink / raw)
  To: Eric Abrahamsen; +Cc: 34215

> From: Eric Abrahamsen <eric@ericabrahamsen.net>
> Date: Sat, 23 Feb 2019 21:36:10 -0800
> 
> > Otherwise, this LGTM.  Let's wait for a few days for more comments,
> > and then push.
> 
> Doesn't look like anything more is forthcoming, shall I push to master?

Yes, please.





^ permalink raw reply	[flat|nested] 20+ messages in thread

* bug#34215: 27.0.50; Provide elisp access to Chinese pinyin-to-character mapping
  2019-02-24 16:06                           ` Eli Zaretskii
@ 2019-02-24 18:53                             ` Eric Abrahamsen
  0 siblings, 0 replies; 20+ messages in thread
From: Eric Abrahamsen @ 2019-02-24 18:53 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 34215


On 02/24/19 18:06 PM, Eli Zaretskii wrote:
>> From: Eric Abrahamsen <eric@ericabrahamsen.net>
>> Date: Sat, 23 Feb 2019 21:36:10 -0800
>> 
>> > Otherwise, this LGTM.  Let's wait for a few days for more comments,
>> > and then push.
>> 
>> Doesn't look like anything more is forthcoming, shall I push to master?
>
> Yes, please.

Done, thanks.





^ permalink raw reply	[flat|nested] 20+ messages in thread

* bug#34215:
  2019-01-27  5:34 bug#34215: 27.0.50; Provide elisp access to Chinese pinyin-to-character mapping Eric Abrahamsen
  2019-01-27 15:41 ` Eli Zaretskii
@ 2019-02-24 19:12 ` Eric Abrahamsen
  1 sibling, 0 replies; 20+ messages in thread
From: Eric Abrahamsen @ 2019-02-24 19:12 UTC (permalink / raw)
  To: 34215-done






^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2019-02-24 19:12 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-01-27  5:34 bug#34215: 27.0.50; Provide elisp access to Chinese pinyin-to-character mapping Eric Abrahamsen
2019-01-27 15:41 ` Eli Zaretskii
2019-01-27 18:02   ` Eric Abrahamsen
2019-01-27 18:14     ` Eli Zaretskii
2019-01-27 19:18       ` Eric Abrahamsen
2019-01-27 19:48         ` Eli Zaretskii
2019-01-29 17:48           ` Eric Abrahamsen
2019-01-30 17:09             ` Eli Zaretskii
2019-01-30 20:33               ` Eric Abrahamsen
2019-01-30 20:48                 ` Eric Abrahamsen
2019-01-31  8:50                   ` Robert Pluim
2019-01-31 19:35                     ` Eric Abrahamsen
2019-02-01  9:48                       ` Eli Zaretskii
2019-02-01 16:27                         ` Eric Abrahamsen
2019-02-01 18:53                           ` Eli Zaretskii
2019-02-01 19:15                             ` Eric Abrahamsen
2019-02-24  5:36                         ` Eric Abrahamsen
2019-02-24 16:06                           ` Eli Zaretskii
2019-02-24 18:53                             ` Eric Abrahamsen
2019-02-24 19:12 ` bug#34215: Eric Abrahamsen

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).