From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED.blaine.gmane.org!not-for-mail From: Eric Abrahamsen Newsgroups: gmane.emacs.bugs Subject: bug#34862: 27.0.50; Trying to update pinyin.map Date: Wed, 20 Mar 2019 12:30:22 -0700 Message-ID: <87wokts5rl.fsf@ericabrahamsen.net> References: <87zhpxyvls.fsf@ericabrahamsen.net> <83ftro20gt.fsf@gnu.org> <87o96cbrwp.fsf@ericabrahamsen.net> <83ef781uuh.fsf@gnu.org> <871s38at0z.fsf@ericabrahamsen.net> <83woktswud.fsf@gnu.org> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Injection-Info: blaine.gmane.org; posting-host="blaine.gmane.org:195.159.176.226"; logging-data="42478"; mail-complaints-to="usenet@blaine.gmane.org" User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.0.50 (gnu/linux) Cc: 34862@debbugs.gnu.org, Richard Stallman To: Eli Zaretskii Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Wed Mar 20 20:31:11 2019 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([209.51.188.17]) by blaine.gmane.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:256) (Exim 4.89) (envelope-from ) id 1h6gvf-000Axa-HC for geb-bug-gnu-emacs@m.gmane.org; Wed, 20 Mar 2019 20:31:11 +0100 Original-Received: from localhost ([127.0.0.1]:52234 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1h6gve-0001Vd-Eh for geb-bug-gnu-emacs@m.gmane.org; Wed, 20 Mar 2019 15:31:10 -0400 Original-Received: from eggs.gnu.org ([209.51.188.92]:34583) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1h6gvY-0001VL-6y for bug-gnu-emacs@gnu.org; Wed, 20 Mar 2019 15:31:05 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1h6gvX-0000JD-8W for bug-gnu-emacs@gnu.org; Wed, 20 Mar 2019 15:31:04 -0400 Original-Received: from debbugs.gnu.org ([209.51.188.43]:38630) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1h6gvW-0000IN-Na for bug-gnu-emacs@gnu.org; Wed, 20 Mar 2019 15:31:03 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1h6gvW-0000m5-FM for bug-gnu-emacs@gnu.org; Wed, 20 Mar 2019 15:31:02 -0400 X-Loop: help-debbugs@gnu.org Resent-From: Eric Abrahamsen Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Wed, 20 Mar 2019 19:31:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 34862 X-GNU-PR-Package: emacs Original-Received: via spool by 34862-submit@debbugs.gnu.org id=B34862.15531102332941 (code B ref 34862); Wed, 20 Mar 2019 19:31:02 +0000 Original-Received: (at 34862) by debbugs.gnu.org; 20 Mar 2019 19:30:33 +0000 Original-Received: from localhost ([127.0.0.1]:52174 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1h6gv3-0000lN-82 for submit@debbugs.gnu.org; Wed, 20 Mar 2019 15:30:33 -0400 Original-Received: from ericabrahamsen.net ([52.70.2.18]:58356 helo=mail.ericabrahamsen.net) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1h6gv1-0000lA-Bc for 34862@debbugs.gnu.org; Wed, 20 Mar 2019 15:30:32 -0400 Original-Received: from localhost (50-251-205-17-static.hfc.comcastbusiness.net [50.251.205.17]) (Authenticated sender: eric@ericabrahamsen.net) by mail.ericabrahamsen.net (Postfix) with ESMTPSA id 9E631FA17F; Wed, 20 Mar 2019 19:30:25 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=ericabrahamsen.net; s=mail; t=1553110226; bh=Rffk+PFHjFGcuqhzgpbJGkqHGrgqCAT8xIPkOQuPeaA=; h=From:To:Cc:Subject:References:Date:In-Reply-To:From; b=niMEa4ExnEaEUmnE1OAXQPjVpFeBi+L3AiYxSjoxv8Q9zBQn3SUubWZkCneWCl+EQ Q1ILwv9mP/beBOROO3MB5HCOtbkEiRjKA6dx2KkdfhqSfA57MWCLa0uOiswIt2hyBz C0uyub/bdevQe3FvIlc4uJQql9eH/v1g1DJvw+mQ= In-Reply-To: <83woktswud.fsf@gnu.org> (Eli Zaretskii's message of "Wed, 20 Mar 2019 11:45:30 +0200") X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 209.51.188.43 X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Original-Sender: "bug-gnu-emacs" Xref: news.gmane.org gmane.emacs.bugs:156545 Archived-At: On 03/20/19 11:45 AM, Eli Zaretskii wrote: [...] >> > Btw, I understand that the Google pinyin method is Apache licensed, >> > but does this mean we can freely use its data for updating pinyin.map? >> > IANAL. Could you perhaps describe how you intend to extract the data >> > from the Google input method for the purpose of updating our file? I >> > think someone will have to audit that process for being legal and >> > compatible with both the Apache license and the GPL. >>=20 >> This[2] is the source file I used. I chopped off all the >> multiple-character dictionary entries, and munged the remaining data >> into the format we need. Ie, lines like this: >>=20 >> =E5=85=AB 6677.54934466 0 ba >> =E6=8A=8A 165484.231697 0 ba >> =E5=90=A7 385205.434615 0 ba >>=20 >> Became this: >>=20 >> ba =E5=90=A7=E6=8A=8A=E5=85=AB >>=20 >> A straight rearrangement, with frequency of use translated into simple >> ordering of the characters. While this is obviously pretty manual, and a >> bit of work, a file like this really only needs to be updated every five >> years or so -- if that. Whenever someone thinks of it. > > I think this should be done with a script, and that script should be > in our repository. The easiest kind of a script is a Lisp program, of > course, but we can also use other kinds, such as Awk scripts. Awk seems just right for the problem, but I haven't written much in it; I did the original munging in elisp. Would this be a script written for use with -batch and a custom make target? Or something to be loaded into a running Emacs and called interactively? In either case, should it also be responsible for downloading a recent copy of the source file, or should that be done first, and the function pointed at the file? >> Regarding the license, I'm even less of a lawyer than you, but these[3] >> are the terms that cover this data. > > Richard, could you please look at that license and tell if we can use > this data file? > >> > (Also, I'm somewhat surprised that gbk isn't capable of covering the >> > characters you want to add. Or did you not try using it?) >>=20 >> I did not try using it! Mostly because the error message suggested >> gb18030 first. gbk also works. I don't have any opinion about encoding, >> apart from assuming utf8 unless there's a good reason not to. > > I see no good reason to use anything other than UTF-8. Excellent. I will think about the script, and look forward to word from Richard. Eric