bug#34862: 27.0.50; Trying to update pinyin.map

unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed

* bug#34862: 27.0.50; Trying to update pinyin.map
@ 2019-03-14 21:49 Eric Abrahamsen
  2019-03-15  5:03 ` Eli Zaretskii
  0 siblings, 1 reply; 12+ messages in thread
From: Eric Abrahamsen @ 2019-03-14 21:49 UTC (permalink / raw)
  To: 34862

As discussed in bug#34215, I'm trying to update the
romanization-to-Chinese-character mapping in the
file ./leim/MISC-DIC/pinyin.map to use the more complete mapping
provided by the Google pinyin input method, licensed under Apache 2.0.
This expands the number of characters recognized by Emacs from around
7,000 to around 17,000. (And increases the size of the mapping file from
18K to 53K.)

I'm running into encoding problems when adding the new characters --
Emacs says some of the characters can't be written using the existing
coding system. The original file has an encoding cookie reading coding:
cn-gb-2312, and describing the coding system gives me:

chinese-iso-8bit-dos (alias: cn-gb-2312-dos euc-china-dos euc-cn-dos
  cn-gb-dos gb2312-dos)

The characters *can* be encoded using gb18030, and of course utf8. The
wikipedia page for gb18030 describes gb2312 as "legacy"[1], and says
gb18030 is a superset of 2312.

Is there any reason not to go straight to utf8 for this file? If that's
not okay, would gb18030 be acceptable?

Codepoint 23744 is an example of a character that can be encoded with
18030 but not 2312. It also exercises my font engine.

I have two other questions, about reducing vc churn, and how to insert
the license at the top of the file, but I figured I'd ask this first.

Thanks,
Eric

[1]  https://en.wikipedia.org/wiki/GB_18030

^ permalink raw reply	[flat|nested] 12+ messages in thread

* bug#34862: 27.0.50; Trying to update pinyin.map
  2019-03-14 21:49 bug#34862: 27.0.50; Trying to update pinyin.map Eric Abrahamsen
@ 2019-03-15  5:03 ` Eli Zaretskii
  2019-03-15  5:58   ` Eric Abrahamsen
  0 siblings, 1 reply; 12+ messages in thread
From: Eli Zaretskii @ 2019-03-15  5:03 UTC (permalink / raw)
  To: Eric Abrahamsen; +Cc: 34862

> From: Eric Abrahamsen <eric@ericabrahamsen.net>
> Date: Thu, 14 Mar 2019 14:49:51 -0700
> 
> 
> As discussed in bug#34215, I'm trying to update the
> romanization-to-Chinese-character mapping in the
> file ./leim/MISC-DIC/pinyin.map to use the more complete mapping
> provided by the Google pinyin input method, licensed under Apache 2.0.
> This expands the number of characters recognized by Emacs from around
> 7,000 to around 17,000. (And increases the size of the mapping file from
> 18K to 53K.)
> 
> I'm running into encoding problems when adding the new characters --
> Emacs says some of the characters can't be written using the existing
> coding system. The original file has an encoding cookie reading coding:
> cn-gb-2312, and describing the coding system gives me:
> 
> chinese-iso-8bit-dos (alias: cn-gb-2312-dos euc-china-dos euc-cn-dos
>   cn-gb-dos gb2312-dos)
> 
> The characters *can* be encoded using gb18030, and of course utf8. The
> wikipedia page for gb18030 describes gb2312 as "legacy"[1], and says
> gb18030 is a superset of 2312.
> 
> Is there any reason not to go straight to utf8 for this file? If that's
> not okay, would gb18030 be acceptable?

I'm not sure I understand the encoding of which file would you like to
change?  Could you please clarify?





^ permalink raw reply	[flat|nested] 12+ messages in thread

* bug#34862: 27.0.50; Trying to update pinyin.map
  2019-03-15  5:03 ` Eli Zaretskii
@ 2019-03-15  5:58   ` Eric Abrahamsen
  2019-03-15  7:04     ` Eli Zaretskii
  0 siblings, 1 reply; 12+ messages in thread
From: Eric Abrahamsen @ 2019-03-15  5:58 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 34862


On 03/15/19 07:03 AM, Eli Zaretskii wrote:
>> From: Eric Abrahamsen <eric@ericabrahamsen.net>
>> Date: Thu, 14 Mar 2019 14:49:51 -0700
>> 
>> 
>> As discussed in bug#34215, I'm trying to update the
>> romanization-to-Chinese-character mapping in the
>> file ./leim/MISC-DIC/pinyin.map to use the more complete mapping
>> provided by the Google pinyin input method, licensed under Apache 2.0.
>> This expands the number of characters recognized by Emacs from around
>> 7,000 to around 17,000. (And increases the size of the mapping file from
>> 18K to 53K.)
>> 
>> I'm running into encoding problems when adding the new characters --
>> Emacs says some of the characters can't be written using the existing
>> coding system. The original file has an encoding cookie reading coding:
>> cn-gb-2312, and describing the coding system gives me:
>> 
>> chinese-iso-8bit-dos (alias: cn-gb-2312-dos euc-china-dos euc-cn-dos
>>   cn-gb-dos gb2312-dos)
>> 
>> The characters *can* be encoded using gb18030, and of course utf8. The
>> wikipedia page for gb18030 describes gb2312 as "legacy"[1], and says
>> gb18030 is a superset of 2312.
>> 
>> Is there any reason not to go straight to utf8 for this file? If that's
>> not okay, would gb18030 be acceptable?
>
> I'm not sure I understand the encoding of which file would you like to
> change?  Could you please clarify?

Sorry, I'm trying to add more characters to ./leim/MISC-DIC/pinyin.map,
which is encoded as chinese-iso-8bit-dos, and it can't accept the new
characters with that current encoding. That's the file I'd like to
change.

Thanks,
Eric





^ permalink raw reply	[flat|nested] 12+ messages in thread

* bug#34862: 27.0.50; Trying to update pinyin.map
  2019-03-15  5:58   ` Eric Abrahamsen
@ 2019-03-15  7:04     ` Eli Zaretskii
  2019-03-15 18:31       ` Eric Abrahamsen
  0 siblings, 1 reply; 12+ messages in thread
From: Eli Zaretskii @ 2019-03-15  7:04 UTC (permalink / raw)
  To: Eric Abrahamsen; +Cc: 34862

> From: Eric Abrahamsen <eric@ericabrahamsen.net>
> Cc: 34862@debbugs.gnu.org
> Date: Thu, 14 Mar 2019 22:58:14 -0700
> 
> > I'm not sure I understand the encoding of which file would you like to
> > change?  Could you please clarify?
> 
> Sorry, I'm trying to add more characters to ./leim/MISC-DIC/pinyin.map,
> which is encoded as chinese-iso-8bit-dos, and it can't accept the new
> characters with that current encoding. That's the file I'd like to
> change.

That file is imported from an external source, isn't it?  Are you
saying we should stop synchronizing it with that source, and instead
fork it, maintain our own separate copy, and never resync with that
source again?  If so, then I see no reason not to recode it in UTF-8.

Btw, I understand that the Google pinyin method is Apache licensed,
but does this mean we can freely use its data for updating pinyin.map?
IANAL.  Could you perhaps describe how you intend to extract the data
from the Google input method for the purpose of updating our file?  I
think someone will have to audit that process for being legal and
compatible with both the Apache license and the GPL.

(Also, I'm somewhat surprised that gbk isn't capable of covering the
characters you want to add.  Or did you not try using it?)

Thanks.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* bug#34862: 27.0.50; Trying to update pinyin.map
  2019-03-15  7:04     ` Eli Zaretskii
@ 2019-03-15 18:31       ` Eric Abrahamsen
  2019-03-20  9:45         ` Eli Zaretskii
  0 siblings, 1 reply; 12+ messages in thread
From: Eric Abrahamsen @ 2019-03-15 18:31 UTC (permalink / raw)
  To: 34862

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Eric Abrahamsen <eric@ericabrahamsen.net>
>> Cc: 34862@debbugs.gnu.org
>> Date: Thu, 14 Mar 2019 22:58:14 -0700
>> 
>> > I'm not sure I understand the encoding of which file would you like to
>> > change?  Could you please clarify?
>> 
>> Sorry, I'm trying to add more characters to ./leim/MISC-DIC/pinyin.map,
>> which is encoded as chinese-iso-8bit-dos, and it can't accept the new
>> characters with that current encoding. That's the file I'd like to
>> change.
>
> That file is imported from an external source, isn't it?  Are you
> saying we should stop synchronizing it with that source, and instead
> fork it, maintain our own separate copy, and never resync with that
> source again?  If so, then I see no reason not to recode it in UTF-8.

Near as I can tell that file was imported into Emacs in 2001 and not
touched since (apart from copyright and encoding stuff). The Debian
package from which it comes seems to have been orphaned in 2003[1]. So
there's not much to either synchronize or fork!

> Btw, I understand that the Google pinyin method is Apache licensed,
> but does this mean we can freely use its data for updating pinyin.map?
> IANAL.  Could you perhaps describe how you intend to extract the data
> from the Google input method for the purpose of updating our file?  I
> think someone will have to audit that process for being legal and
> compatible with both the Apache license and the GPL.

This[2] is the source file I used. I chopped off all the
multiple-character dictionary entries, and munged the remaining data
into the format we need. Ie, lines like this:

八 6677.54934466 0 ba
把 165484.231697 0 ba
吧 385205.434615 0 ba

Became this:

ba 吧把八

A straight rearrangement, with frequency of use translated into simple
ordering of the characters. While this is obviously pretty manual, and a
bit of work, a file like this really only needs to be updated every five
years or so -- if that. Whenever someone thinks of it.

Regarding the license, I'm even less of a lawyer than you, but these[3]
are the terms that cover this data.

> (Also, I'm somewhat surprised that gbk isn't capable of covering the
> characters you want to add.  Or did you not try using it?)

I did not try using it! Mostly because the error message suggested
gb18030 first. gbk also works. I don't have any opinion about encoding,
apart from assuming utf8 unless there's a good reason not to.

Thanks,
Eric

[1] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=189523;msg=18

[2]  https://android.googlesource.com/platform/packages/inputmethods/PinyinIME/+/refs/heads/master/jni/data/rawdict_utf16_65105_freq.txt

[3]  https://android.googlesource.com/platform/packages/inputmethods/PinyinIME/+/refs/heads/master/NOTICE

^ permalink raw reply	[flat|nested] 12+ messages in thread

* bug#34862: 27.0.50; Trying to update pinyin.map
  2019-03-15 18:31       ` Eric Abrahamsen
@ 2019-03-20  9:45         ` Eli Zaretskii
  2019-03-20 19:30           ` Eric Abrahamsen
  0 siblings, 1 reply; 12+ messages in thread
From: Eli Zaretskii @ 2019-03-20  9:45 UTC (permalink / raw)
  To: Eric Abrahamsen, Richard Stallman; +Cc: 34862

> From: Eric Abrahamsen <eric@ericabrahamsen.net>
> Date: Fri, 15 Mar 2019 11:31:40 -0700
> 
> > That file is imported from an external source, isn't it?  Are you
> > saying we should stop synchronizing it with that source, and instead
> > fork it, maintain our own separate copy, and never resync with that
> > source again?  If so, then I see no reason not to recode it in UTF-8.
> 
> Near as I can tell that file was imported into Emacs in 2001 and not
> touched since (apart from copyright and encoding stuff). The Debian
> package from which it comes seems to have been orphaned in 2003[1]. So
> there's not much to either synchronize or fork!

OK, sounds reasonable.

> > Btw, I understand that the Google pinyin method is Apache licensed,
> > but does this mean we can freely use its data for updating pinyin.map?
> > IANAL.  Could you perhaps describe how you intend to extract the data
> > from the Google input method for the purpose of updating our file?  I
> > think someone will have to audit that process for being legal and
> > compatible with both the Apache license and the GPL.
> 
> This[2] is the source file I used. I chopped off all the
> multiple-character dictionary entries, and munged the remaining data
> into the format we need. Ie, lines like this:
> 
> 八 6677.54934466 0 ba
> 把 165484.231697 0 ba
> 吧 385205.434615 0 ba
> 
> Became this:
> 
> ba 吧把八
> 
> A straight rearrangement, with frequency of use translated into simple
> ordering of the characters. While this is obviously pretty manual, and a
> bit of work, a file like this really only needs to be updated every five
> years or so -- if that. Whenever someone thinks of it.

I think this should be done with a script, and that script should be
in our repository.  The easiest kind of a script is a Lisp program, of
course, but we can also use other kinds, such as Awk scripts.

> Regarding the license, I'm even less of a lawyer than you, but these[3]
> are the terms that cover this data.

Richard, could you please look at that license and tell if we can use
this data file?

> > (Also, I'm somewhat surprised that gbk isn't capable of covering the
> > characters you want to add.  Or did you not try using it?)
> 
> I did not try using it! Mostly because the error message suggested
> gb18030 first. gbk also works. I don't have any opinion about encoding,
> apart from assuming utf8 unless there's a good reason not to.

I see no good reason to use anything other than UTF-8.

> [1] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=189523;msg=18
> 
> [2]  https://android.googlesource.com/platform/packages/inputmethods/PinyinIME/+/refs/heads/master/jni/data/rawdict_utf16_65105_freq.txt
> 
> [3]  https://android.googlesource.com/platform/packages/inputmethods/PinyinIME/+/refs/heads/master/NOTICE

Thanks.





^ permalink raw reply	[flat|nested] 12+ messages in thread

* bug#34862: 27.0.50; Trying to update pinyin.map
  2019-03-20  9:45         ` Eli Zaretskii
@ 2019-03-20 19:30           ` Eric Abrahamsen
  2019-03-20 19:39             ` Eli Zaretskii
  2022-02-02 18:59             ` Lars Ingebrigtsen
  0 siblings, 2 replies; 12+ messages in thread
From: Eric Abrahamsen @ 2019-03-20 19:30 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 34862, Richard Stallman


On 03/20/19 11:45 AM, Eli Zaretskii wrote:

[...]

>> > Btw, I understand that the Google pinyin method is Apache licensed,
>> > but does this mean we can freely use its data for updating pinyin.map?
>> > IANAL. Could you perhaps describe how you intend to extract the data
>> > from the Google input method for the purpose of updating our file? I
>> > think someone will have to audit that process for being legal and
>> > compatible with both the Apache license and the GPL.
>> 
>> This[2] is the source file I used. I chopped off all the
>> multiple-character dictionary entries, and munged the remaining data
>> into the format we need. Ie, lines like this:
>> 
>> 八 6677.54934466 0 ba
>> 把 165484.231697 0 ba
>> 吧 385205.434615 0 ba
>> 
>> Became this:
>> 
>> ba 吧把八
>> 
>> A straight rearrangement, with frequency of use translated into simple
>> ordering of the characters. While this is obviously pretty manual, and a
>> bit of work, a file like this really only needs to be updated every five
>> years or so -- if that. Whenever someone thinks of it.
>
> I think this should be done with a script, and that script should be
> in our repository.  The easiest kind of a script is a Lisp program, of
> course, but we can also use other kinds, such as Awk scripts.

Awk seems just right for the problem, but I haven't written much in it;
I did the original munging in elisp. Would this be a script written for
use with -batch and a custom make target? Or something to be loaded into
a running Emacs and called interactively? In either case, should it also
be responsible for downloading a recent copy of the source file, or
should that be done first, and the function pointed at the file?

>> Regarding the license, I'm even less of a lawyer than you, but these[3]
>> are the terms that cover this data.
>
> Richard, could you please look at that license and tell if we can use
> this data file?
>
>> > (Also, I'm somewhat surprised that gbk isn't capable of covering the
>> > characters you want to add.  Or did you not try using it?)
>> 
>> I did not try using it! Mostly because the error message suggested
>> gb18030 first. gbk also works. I don't have any opinion about encoding,
>> apart from assuming utf8 unless there's a good reason not to.
>
> I see no good reason to use anything other than UTF-8.

Excellent. I will think about the script, and look forward to word from
Richard.

Eric





^ permalink raw reply	[flat|nested] 12+ messages in thread

* bug#34862: 27.0.50; Trying to update pinyin.map
  2019-03-20 19:30           ` Eric Abrahamsen
@ 2019-03-20 19:39             ` Eli Zaretskii
  2019-03-20 19:41               ` Eric Abrahamsen
  2022-02-02 18:59             ` Lars Ingebrigtsen
  1 sibling, 1 reply; 12+ messages in thread
From: Eli Zaretskii @ 2019-03-20 19:39 UTC (permalink / raw)
  To: Eric Abrahamsen; +Cc: 34862, rms

> From: Eric Abrahamsen <eric@ericabrahamsen.net>
> Cc: Richard Stallman <rms@gnu.org>,  34862@debbugs.gnu.org
> Date: Wed, 20 Mar 2019 12:30:22 -0700
> 
> > I think this should be done with a script, and that script should be
> > in our repository.  The easiest kind of a script is a Lisp program, of
> > course, but we can also use other kinds, such as Awk scripts.
> 
> Awk seems just right for the problem, but I haven't written much in it;
> I did the original munging in elisp. Would this be a script written for
> use with -batch and a custom make target?

Yes.

> should it also be responsible for downloading a recent copy of the
> source file, or should that be done first, and the function pointed
> at the file?

The latter, I think.  That's what we do with the other data files we
use from external sources, e.g. see admin/unidata/.





^ permalink raw reply	[flat|nested] 12+ messages in thread

* bug#34862: 27.0.50; Trying to update pinyin.map
  2019-03-20 19:39             ` Eli Zaretskii
@ 2019-03-20 19:41               ` Eric Abrahamsen
  0 siblings, 0 replies; 12+ messages in thread
From: Eric Abrahamsen @ 2019-03-20 19:41 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 34862, rms

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Eric Abrahamsen <eric@ericabrahamsen.net>
>> Cc: Richard Stallman <rms@gnu.org>,  34862@debbugs.gnu.org
>> Date: Wed, 20 Mar 2019 12:30:22 -0700
>> 
>> > I think this should be done with a script, and that script should be
>> > in our repository.  The easiest kind of a script is a Lisp program, of
>> > course, but we can also use other kinds, such as Awk scripts.
>> 
>> Awk seems just right for the problem, but I haven't written much in it;
>> I did the original munging in elisp. Would this be a script written for
>> use with -batch and a custom make target?
>
> Yes.
>
>> should it also be responsible for downloading a recent copy of the
>> source file, or should that be done first, and the function pointed
>> at the file?
>
> The latter, I think.  That's what we do with the other data files we
> use from external sources, e.g. see admin/unidata/.

Understood -- thanks for this.





^ permalink raw reply	[flat|nested] 12+ messages in thread

* bug#34862: 27.0.50; Trying to update pinyin.map
  2019-03-20 19:30           ` Eric Abrahamsen
  2019-03-20 19:39             ` Eli Zaretskii
@ 2022-02-02 18:59             ` Lars Ingebrigtsen
  2022-02-08  0:26               ` Eric Abrahamsen
  1 sibling, 1 reply; 12+ messages in thread
From: Lars Ingebrigtsen @ 2022-02-02 18:59 UTC (permalink / raw)
  To: Eric Abrahamsen; +Cc: 34862, Richard Stallman

Eric Abrahamsen <eric@ericabrahamsen.net> writes:

>> I think this should be done with a script, and that script should be
>> in our repository.  The easiest kind of a script is a Lisp program, of
>> course, but we can also use other kinds, such as Awk scripts.
>
> Awk seems just right for the problem, but I haven't written much in it;
> I did the original munging in elisp. Would this be a script written for
> use with -batch and a custom make target?

It's fine to parse the files with Lisp instead of awk (unless they're
needed to boot Emacs, which I don't think is the case here).

Did you get any further with this?

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no





^ permalink raw reply	[flat|nested] 12+ messages in thread

* bug#34862: 27.0.50; Trying to update pinyin.map
  2022-02-02 18:59             ` Lars Ingebrigtsen
@ 2022-02-08  0:26               ` Eric Abrahamsen
  2022-02-08  6:12                 ` Lars Ingebrigtsen
  0 siblings, 1 reply; 12+ messages in thread
From: Eric Abrahamsen @ 2022-02-08  0:26 UTC (permalink / raw)
  To: Lars Ingebrigtsen; +Cc: 34862, Richard Stallman


On 02/02/22 19:59 PM, Lars Ingebrigtsen wrote:
> Eric Abrahamsen <eric@ericabrahamsen.net> writes:
>
>>> I think this should be done with a script, and that script should be
>>> in our repository.  The easiest kind of a script is a Lisp program, of
>>> course, but we can also use other kinds, such as Awk scripts.
>>
>> Awk seems just right for the problem, but I haven't written much in it;
>> I did the original munging in elisp. Would this be a script written for
>> use with -batch and a custom make target?
>
> It's fine to parse the files with Lisp instead of awk (unless they're
> needed to boot Emacs, which I don't think is the case here).
>
> Did you get any further with this?

I guess I still didn't know if I should be writing the script as a part
of the make process or not...





^ permalink raw reply	[flat|nested] 12+ messages in thread

* bug#34862: 27.0.50; Trying to update pinyin.map
  2022-02-08  0:26               ` Eric Abrahamsen
@ 2022-02-08  6:12                 ` Lars Ingebrigtsen
  0 siblings, 0 replies; 12+ messages in thread
From: Lars Ingebrigtsen @ 2022-02-08  6:12 UTC (permalink / raw)
  To: Eric Abrahamsen; +Cc: 34862, Richard Stallman

Eric Abrahamsen <eric@ericabrahamsen.net> writes:

> I guess I still didn't know if I should be writing the script as a part
> of the make process or not...

Yes, I think it would be natural to have it be part of the make process.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no





^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2022-02-08  6:12 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-14 21:49 bug#34862: 27.0.50; Trying to update pinyin.map Eric Abrahamsen
2019-03-15  5:03 ` Eli Zaretskii
2019-03-15  5:58   ` Eric Abrahamsen
2019-03-15  7:04     ` Eli Zaretskii
2019-03-15 18:31       ` Eric Abrahamsen
2019-03-20  9:45         ` Eli Zaretskii
2019-03-20 19:30           ` Eric Abrahamsen
2019-03-20 19:39             ` Eli Zaretskii
2019-03-20 19:41               ` Eric Abrahamsen
2022-02-02 18:59             ` Lars Ingebrigtsen
2022-02-08  0:26               ` Eric Abrahamsen
2022-02-08  6:12                 ` Lars Ingebrigtsen

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).