From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: Eric Abrahamsen <eric@ericabrahamsen.net>
Newsgroups: gmane.emacs.help
Subject: advice on hash tables?
Date: Fri, 04 Jul 2014 14:00:54 -0700
Message-ID: <87a98oj0u1.fsf@ericabrahamsen.net>
NNTP-Posting-Host: plane.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
X-Trace: ger.gmane.org 1404507702 7992 80.91.229.3 (4 Jul 2014 21:01:42 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Fri, 4 Jul 2014 21:01:42 +0000 (UTC)
To: help-gnu-emacs@gnu.org
Original-X-From: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Fri Jul 04 23:01:35 2014
Return-path: <help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org>
Envelope-to: geh-help-gnu-emacs@m.gmane.org
Original-Received: from lists.gnu.org ([208.118.235.17])
	by plane.gmane.org with esmtp (Exim 4.69)
	(envelope-from <help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org>)
	id 1X3Abv-0005hv-CG
	for geh-help-gnu-emacs@m.gmane.org; Fri, 04 Jul 2014 23:01:35 +0200
Original-Received: from localhost ([::1]:38086 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org>)
	id 1X3Abv-0005aC-2N
	for geh-help-gnu-emacs@m.gmane.org; Fri, 04 Jul 2014 17:01:35 -0400
Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:45301)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <geh-help-gnu-emacs@m.gmane.org>) id 1X3Abe-0005a1-HG
	for help-gnu-emacs@gnu.org; Fri, 04 Jul 2014 17:01:24 -0400
Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <geh-help-gnu-emacs@m.gmane.org>) id 1X3AbW-00025s-MI
	for help-gnu-emacs@gnu.org; Fri, 04 Jul 2014 17:01:18 -0400
Original-Received: from plane.gmane.org ([80.91.229.3]:56416)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <geh-help-gnu-emacs@m.gmane.org>) id 1X3AbW-00025l-FL
	for help-gnu-emacs@gnu.org; Fri, 04 Jul 2014 17:01:10 -0400
Original-Received: from list by plane.gmane.org with local (Exim 4.69)
	(envelope-from <geh-help-gnu-emacs@m.gmane.org>) id 1X3AbT-0005BH-Lh
	for help-gnu-emacs@gnu.org; Fri, 04 Jul 2014 23:01:07 +0200
Original-Received: from 63.226.249.211 ([63.226.249.211])
	by main.gmane.org with esmtp (Gmexim 0.1 (Debian))
	id 1AlnuQ-0007hv-00
	for <help-gnu-emacs@gnu.org>; Fri, 04 Jul 2014 23:01:07 +0200
Original-Received: from eric by 63.226.249.211 with local (Gmexim 0.1 (Debian))
	id 1AlnuQ-0007hv-00
	for <help-gnu-emacs@gnu.org>; Fri, 04 Jul 2014 23:01:07 +0200
X-Injected-Via-Gmane: http://gmane.org/
Original-Lines: 62
Original-X-Complaints-To: usenet@ger.gmane.org
X-Gmane-NNTP-Posting-Host: 63.226.249.211
User-Agent: Gnus/5.130012 (Ma Gnus v0.12) Emacs/24.4.50 (gnu/linux)
Cancel-Lock: sha1:+cOfcH0hrEC3eLmTOEybgUBFNrI=
X-detected-operating-system: by eggs.gnu.org: Genre and OS details not
	recognized.
X-Received-From: 80.91.229.3
X-BeenThere: help-gnu-emacs@gnu.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Users list for the GNU Emacs text editor <help-gnu-emacs.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/help-gnu-emacs>,
	<mailto:help-gnu-emacs-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/help-gnu-emacs>
List-Post: <mailto:help-gnu-emacs@gnu.org>
List-Help: <mailto:help-gnu-emacs-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/help-gnu-emacs>,
	<mailto:help-gnu-emacs-request@gnu.org?subject=subscribe>
Errors-To: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org
Original-Sender: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.emacs.help:98537
Archived-At: <http://permalink.gmane.org/gmane.emacs.help/98537>

I'm (very slowly) chewing on some Chinese-English translation functions
based on the freely-available CEDICT dictionary[1], this is related to a
question about Chinese word boundaries I raised earlier.

The first stage is just slurping the text-file dictionary into an elisp
data structure, for simple dictionary lookups.

This is the first time I've made anything where performance might
actually be an issue, so I'm asking for a general pointer on how to do
this. The issue is that the dictionary provides Chinese words in both
simplified and traditional characters. The typical entry looks like
this:

理性認識 理性认识 [li3 xing4 ren4 shi5] /cognition/rational knowledge/

So that's the traditional characters, simplified characters,
pronunciation in brackets, then an arbitrary number of slash-delimited
definitions. There are 108,296 such entries, one per line.

So I'd like a hash table where characters are keys, and values are
lists holding (pronunciation definition1 ...).

I don't want to have to specify what type of characters I'm using, I'd
like to just treat all types of characters as the same. The brute-force
solution would be redundant hash-table entries, one each for simplified
and traditional characters. That would double the size of the hash table
to 200,000+.

Some character don't differ between traditional/simplified: in the
example above, only the second two characters are different. So I could
also define a hash table test that used string-match-p, and construct
the hash table keys as regexps:

"理性[認认][識识]"

Or I could try using the nested alists from mule-util.el, which I don't
frankly understand. It's possible you're meant to use nested alists
*instead* of something like a hash table. But if not, the keys might
look something like:

("理性" ("認識") ("识认"))

Or perhaps it would be faster to do:

(29702 24615 (35469 35672) (35782 35748))

But again, I'm likely misunderstanding how a nested alist works.

Anyway, dictionary lookups don't need to be super fast, but I'd like to
use the same or similar data structure for finding word boundaries, so
it would be nice to get something that goes fast. In any even, it's a
good opportunity to learn a bit about efficiency.

My actual question is: does anyone have any advice on a clever way to
approach this?

Thanks!
Eric


[1]: http://www.mdbg.net/chindict/chindict.php?page=cc-cedict