unofficial mirror of help-gnu-emacs@gnu.org
 help / color / mirror / Atom feed
* advice on hash tables?
@ 2014-07-04 21:00 Eric Abrahamsen
  2014-07-05  1:45 ` Stefan Monnier
  2014-07-05 17:57 ` Michael Heerdegen
  0 siblings, 2 replies; 8+ messages in thread
From: Eric Abrahamsen @ 2014-07-04 21:00 UTC (permalink / raw)
  To: help-gnu-emacs

I'm (very slowly) chewing on some Chinese-English translation functions
based on the freely-available CEDICT dictionary[1], this is related to a
question about Chinese word boundaries I raised earlier.

The first stage is just slurping the text-file dictionary into an elisp
data structure, for simple dictionary lookups.

This is the first time I've made anything where performance might
actually be an issue, so I'm asking for a general pointer on how to do
this. The issue is that the dictionary provides Chinese words in both
simplified and traditional characters. The typical entry looks like
this:

理性認識 理性认识 [li3 xing4 ren4 shi5] /cognition/rational knowledge/

So that's the traditional characters, simplified characters,
pronunciation in brackets, then an arbitrary number of slash-delimited
definitions. There are 108,296 such entries, one per line.

So I'd like a hash table where characters are keys, and values are
lists holding (pronunciation definition1 ...).

I don't want to have to specify what type of characters I'm using, I'd
like to just treat all types of characters as the same. The brute-force
solution would be redundant hash-table entries, one each for simplified
and traditional characters. That would double the size of the hash table
to 200,000+.

Some character don't differ between traditional/simplified: in the
example above, only the second two characters are different. So I could
also define a hash table test that used string-match-p, and construct
the hash table keys as regexps:

"理性[認认][識识]"

Or I could try using the nested alists from mule-util.el, which I don't
frankly understand. It's possible you're meant to use nested alists
*instead* of something like a hash table. But if not, the keys might
look something like:

("理性" ("認識") ("识认"))

Or perhaps it would be faster to do:

(29702 24615 (35469 35672) (35782 35748))

But again, I'm likely misunderstanding how a nested alist works.

Anyway, dictionary lookups don't need to be super fast, but I'd like to
use the same or similar data structure for finding word boundaries, so
it would be nice to get something that goes fast. In any even, it's a
good opportunity to learn a bit about efficiency.

My actual question is: does anyone have any advice on a clever way to
approach this?

Thanks!
Eric



[1]: http://www.mdbg.net/chindict/chindict.php?page=cc-cedict




^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2014-07-05 18:22 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-07-04 21:00 advice on hash tables? Eric Abrahamsen
2014-07-05  1:45 ` Stefan Monnier
2014-07-05  6:48   ` Eric Abrahamsen
     [not found]   ` <mailman.4877.1404542921.1147.help-gnu-emacs@gnu.org>
2014-07-05 13:27     ` Stefan Monnier
2014-07-05 15:58       ` Stefan Monnier
2014-07-05 17:17       ` Eric Abrahamsen
2014-07-05 17:57 ` Michael Heerdegen
2014-07-05 18:22   ` Eric Abrahamsen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).