all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
From: Eric Abrahamsen <eric@ericabrahamsen.net>
To: 34862@debbugs.gnu.org
Subject: bug#34862: 27.0.50; Trying to update pinyin.map
Date: Fri, 15 Mar 2019 11:31:40 -0700	[thread overview]
Message-ID: <871s38at0z.fsf@ericabrahamsen.net> (raw)
In-Reply-To: <87zhpxyvls.fsf@ericabrahamsen.net>

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Eric Abrahamsen <eric@ericabrahamsen.net>
>> Cc: 34862@debbugs.gnu.org
>> Date: Thu, 14 Mar 2019 22:58:14 -0700
>> 
>> > I'm not sure I understand the encoding of which file would you like to
>> > change?  Could you please clarify?
>> 
>> Sorry, I'm trying to add more characters to ./leim/MISC-DIC/pinyin.map,
>> which is encoded as chinese-iso-8bit-dos, and it can't accept the new
>> characters with that current encoding. That's the file I'd like to
>> change.
>
> That file is imported from an external source, isn't it?  Are you
> saying we should stop synchronizing it with that source, and instead
> fork it, maintain our own separate copy, and never resync with that
> source again?  If so, then I see no reason not to recode it in UTF-8.

Near as I can tell that file was imported into Emacs in 2001 and not
touched since (apart from copyright and encoding stuff). The Debian
package from which it comes seems to have been orphaned in 2003[1]. So
there's not much to either synchronize or fork!

> Btw, I understand that the Google pinyin method is Apache licensed,
> but does this mean we can freely use its data for updating pinyin.map?
> IANAL.  Could you perhaps describe how you intend to extract the data
> from the Google input method for the purpose of updating our file?  I
> think someone will have to audit that process for being legal and
> compatible with both the Apache license and the GPL.

This[2] is the source file I used. I chopped off all the
multiple-character dictionary entries, and munged the remaining data
into the format we need. Ie, lines like this:

八 6677.54934466 0 ba
把 165484.231697 0 ba
吧 385205.434615 0 ba

Became this:

ba 吧把八

A straight rearrangement, with frequency of use translated into simple
ordering of the characters. While this is obviously pretty manual, and a
bit of work, a file like this really only needs to be updated every five
years or so -- if that. Whenever someone thinks of it.

Regarding the license, I'm even less of a lawyer than you, but these[3]
are the terms that cover this data.

> (Also, I'm somewhat surprised that gbk isn't capable of covering the
> characters you want to add.  Or did you not try using it?)

I did not try using it! Mostly because the error message suggested
gb18030 first. gbk also works. I don't have any opinion about encoding,
apart from assuming utf8 unless there's a good reason not to.

Thanks,
Eric

[1] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=189523;msg=18

[2]  https://android.googlesource.com/platform/packages/inputmethods/PinyinIME/+/refs/heads/master/jni/data/rawdict_utf16_65105_freq.txt

[3]  https://android.googlesource.com/platform/packages/inputmethods/PinyinIME/+/refs/heads/master/NOTICE







  reply	other threads:[~2019-03-15 18:31 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-03-14 21:49 bug#34862: 27.0.50; Trying to update pinyin.map Eric Abrahamsen
2019-03-15  5:03 ` Eli Zaretskii
2019-03-15  5:58   ` Eric Abrahamsen
2019-03-15  7:04     ` Eli Zaretskii
2019-03-15 18:31       ` Eric Abrahamsen [this message]
2019-03-20  9:45         ` Eli Zaretskii
2019-03-20 19:30           ` Eric Abrahamsen
2019-03-20 19:39             ` Eli Zaretskii
2019-03-20 19:41               ` Eric Abrahamsen
2022-02-02 18:59             ` Lars Ingebrigtsen
2022-02-08  0:26               ` Eric Abrahamsen
2022-02-08  6:12                 ` Lars Ingebrigtsen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=871s38at0z.fsf@ericabrahamsen.net \
    --to=eric@ericabrahamsen.net \
    --cc=34862@debbugs.gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.