advice on hash tables?

unofficial mirror of help-gnu-emacs@gnu.org
 help / color / mirror / Atom feed

* advice on hash tables?
@ 2014-07-04 21:00 Eric Abrahamsen
  2014-07-05  1:45 ` Stefan Monnier
  2014-07-05 17:57 ` Michael Heerdegen
  0 siblings, 2 replies; 8+ messages in thread
From: Eric Abrahamsen @ 2014-07-04 21:00 UTC (permalink / raw)
  To: help-gnu-emacs

I'm (very slowly) chewing on some Chinese-English translation functions
based on the freely-available CEDICT dictionary[1], this is related to a
question about Chinese word boundaries I raised earlier.

The first stage is just slurping the text-file dictionary into an elisp
data structure, for simple dictionary lookups.

This is the first time I've made anything where performance might
actually be an issue, so I'm asking for a general pointer on how to do
this. The issue is that the dictionary provides Chinese words in both
simplified and traditional characters. The typical entry looks like
this:

理性認識 理性认识 [li3 xing4 ren4 shi5] /cognition/rational knowledge/

So that's the traditional characters, simplified characters,
pronunciation in brackets, then an arbitrary number of slash-delimited
definitions. There are 108,296 such entries, one per line.

So I'd like a hash table where characters are keys, and values are
lists holding (pronunciation definition1 ...).

I don't want to have to specify what type of characters I'm using, I'd
like to just treat all types of characters as the same. The brute-force
solution would be redundant hash-table entries, one each for simplified
and traditional characters. That would double the size of the hash table
to 200,000+.

Some character don't differ between traditional/simplified: in the
example above, only the second two characters are different. So I could
also define a hash table test that used string-match-p, and construct
the hash table keys as regexps:

"理性[認认][識识]"

Or I could try using the nested alists from mule-util.el, which I don't
frankly understand. It's possible you're meant to use nested alists
*instead* of something like a hash table. But if not, the keys might
look something like:

("理性" ("認識") ("识认"))

Or perhaps it would be faster to do:

(29702 24615 (35469 35672) (35782 35748))

But again, I'm likely misunderstanding how a nested alist works.

Anyway, dictionary lookups don't need to be super fast, but I'd like to
use the same or similar data structure for finding word boundaries, so
it would be nice to get something that goes fast. In any even, it's a
good opportunity to learn a bit about efficiency.

My actual question is: does anyone have any advice on a clever way to
approach this?

Thanks!
Eric

[1]: http://www.mdbg.net/chindict/chindict.php?page=cc-cedict

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: advice on hash tables?
  2014-07-04 21:00 advice on hash tables? Eric Abrahamsen
@ 2014-07-05  1:45 ` Stefan Monnier
  2014-07-05  6:48   ` Eric Abrahamsen
       [not found]   ` <mailman.4877.1404542921.1147.help-gnu-emacs@gnu.org>
  2014-07-05 17:57 ` Michael Heerdegen
  1 sibling, 2 replies; 8+ messages in thread
From: Stefan Monnier @ 2014-07-05  1:45 UTC (permalink / raw)
  To: help-gnu-emacs

> I don't want to have to specify what type of characters I'm using, I'd
> like to just treat all types of characters as the same. The brute-force
> solution would be redundant hash-table entries, one each for simplified
> and traditional characters. That would double the size of the hash table
> to 200,000+.

If the mapping from traditional characters to simplified characters
a function?  If so, I suggest you use as key the simplified characters
version, and then when looking things up, you first apply the
simplification function.


        Stefan




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: advice on hash tables?
  2014-07-05  1:45 ` Stefan Monnier
@ 2014-07-05  6:48   ` Eric Abrahamsen
       [not found]   ` <mailman.4877.1404542921.1147.help-gnu-emacs@gnu.org>
  1 sibling, 0 replies; 8+ messages in thread
From: Eric Abrahamsen @ 2014-07-05  6:48 UTC (permalink / raw)
  To: help-gnu-emacs

Stefan Monnier <monnier@iro.umontreal.ca> writes:

>> I don't want to have to specify what type of characters I'm using, I'd
>> like to just treat all types of characters as the same. The brute-force
>> solution would be redundant hash-table entries, one each for simplified
>> and traditional characters. That would double the size of the hash table
>> to 200,000+.
>
> If the mapping from traditional characters to simplified characters
> a function?  If so, I suggest you use as key the simplified characters
> version, and then when looking things up, you first apply the
> simplification function.

Is there a traditional-to-simplified reduction function? That's a good
question, there might be. I will take a look and see, but something
tells me there isn't. That could be a useful intermediate step, though.




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: advice on hash tables?
       [not found]   ` <mailman.4877.1404542921.1147.help-gnu-emacs@gnu.org>
@ 2014-07-05 13:27     ` Stefan Monnier
  2014-07-05 15:58       ` Stefan Monnier
  2014-07-05 17:17       ` Eric Abrahamsen
  0 siblings, 2 replies; 8+ messages in thread
From: Stefan Monnier @ 2014-07-05 13:27 UTC (permalink / raw)
  To: help-gnu-emacs

>> If the mapping from traditional characters to simplified characters
   ^^
   Is
>> a function?

> Is there a traditional-to-simplified reduction function?

No, I meant "it it", not "is there".  "Function" was meant in the
set-theory sense.  IOW, the question is whether a given traditional
character always maps to the same simplified character.

If it is, then you can write the mapping function.  Such a function would
most naturally be implemented in Elisp as a char-table, which you can
auto-generate from your input data.

        Stefan

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: advice on hash tables?
  2014-07-05 13:27     ` Stefan Monnier
@ 2014-07-05 15:58       ` Stefan Monnier
  2014-07-05 17:17       ` Eric Abrahamsen
  1 sibling, 0 replies; 8+ messages in thread
From: Stefan Monnier @ 2014-07-05 15:58 UTC (permalink / raw)
  To: help-gnu-emacs

> No, I meant "it it", not "is there".  "Function" was meant in the
               ^^
               is

Looks like I'm resisting the "s".


        Stefan




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: advice on hash tables?
  2014-07-05 13:27     ` Stefan Monnier
  2014-07-05 15:58       ` Stefan Monnier
@ 2014-07-05 17:17       ` Eric Abrahamsen
  1 sibling, 0 replies; 8+ messages in thread
From: Eric Abrahamsen @ 2014-07-05 17:17 UTC (permalink / raw)
  To: help-gnu-emacs

Stefan Monnier <monnier@iro.umontreal.ca> writes:

>>> If the mapping from traditional characters to simplified characters
>    ^^
>    Is
>>> a function?
>
>> Is there a traditional-to-simplified reduction function?
>
> No, I meant "it it", not "is there".  "Function" was meant in the
> set-theory sense.  IOW, the question is whether a given traditional
> character always maps to the same simplified character.

I'd need to do a bit of research, but as I recall all traditional
characters are only simplified one way, yes. Sometimes multiple
traditional characters will turn into the same simplified one, but I
don't think one character can be simplified multiple ways.

> If it is, then you can write the mapping function.  Such a function would
> most naturally be implemented in Elisp as a char-table, which you can
> auto-generate from your input data.

Okay, so I'd first "downcase" the string to simplified characters, then
do the lookup. That could be doable...

Thanks!
Eric

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: advice on hash tables?
  2014-07-04 21:00 advice on hash tables? Eric Abrahamsen
  2014-07-05  1:45 ` Stefan Monnier
@ 2014-07-05 17:57 ` Michael Heerdegen
  2014-07-05 18:22   ` Eric Abrahamsen
  1 sibling, 1 reply; 8+ messages in thread
From: Michael Heerdegen @ 2014-07-05 17:57 UTC (permalink / raw)
  To: help-gnu-emacs

Hi Eric,

> 理性認識 理性认识 [li3 xing4 ren4 shi5] /cognition/rational knowledge/

Since the dictionary is organized by lines, consider using simple text
search.  This gives you results nearly instantly (try isearching the
dictionary file, or M-x occur !); OTOH, going through the file and
creating some lisp structure from it will cause a large delay, probably,
lots of seconds.

BTW, here's how helm implements a dict lookup interface for some
dictionaries avoiding building up a Lisp structure:

  https://github.com/emacs-helm/helm-dictionary

Michael.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: advice on hash tables?
  2014-07-05 17:57 ` Michael Heerdegen
@ 2014-07-05 18:22   ` Eric Abrahamsen
  0 siblings, 0 replies; 8+ messages in thread
From: Eric Abrahamsen @ 2014-07-05 18:22 UTC (permalink / raw)
  To: help-gnu-emacs

Michael Heerdegen <michael_heerdegen@web.de> writes:

> Hi Eric,
>
>> 理性認識 理性认识 [li3 xing4 ren4 shi5] /cognition/rational knowledge/
>
> Since the dictionary is organized by lines, consider using simple text
> search.  This gives you results nearly instantly (try isearching the
> dictionary file, or M-x occur !); OTOH, going through the file and
> creating some lisp structure from it will cause a large delay, probably,
> lots of seconds.
>
> BTW, here's how helm implements a dict lookup interface for some
> dictionaries avoiding building up a Lisp structure:
>
>   https://github.com/emacs-helm/helm-dictionary

Actually that's a very good point... I'm doing the manual isearch thing
now, and that has gotten annoying enough that I wanted to automate it,
but you're quite right it would make much more sense just to write a
function to zip through the file and find the definition. At least at
this stage of things. Thanks!




^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2014-07-05 18:22 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-07-04 21:00 advice on hash tables? Eric Abrahamsen
2014-07-05  1:45 ` Stefan Monnier
2014-07-05  6:48   ` Eric Abrahamsen
     [not found]   ` <mailman.4877.1404542921.1147.help-gnu-emacs@gnu.org>
2014-07-05 13:27     ` Stefan Monnier
2014-07-05 15:58       ` Stefan Monnier
2014-07-05 17:17       ` Eric Abrahamsen
2014-07-05 17:57 ` Michael Heerdegen
2014-07-05 18:22   ` Eric Abrahamsen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).