From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Eric Abrahamsen Newsgroups: gmane.emacs.help Subject: advice on hash tables? Date: Fri, 04 Jul 2014 14:00:54 -0700 Message-ID: <87a98oj0u1.fsf@ericabrahamsen.net> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Trace: ger.gmane.org 1404507702 7992 80.91.229.3 (4 Jul 2014 21:01:42 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Fri, 4 Jul 2014 21:01:42 +0000 (UTC) To: help-gnu-emacs@gnu.org Original-X-From: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Fri Jul 04 23:01:35 2014 Return-path: Envelope-to: geh-help-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1X3Abv-0005hv-CG for geh-help-gnu-emacs@m.gmane.org; Fri, 04 Jul 2014 23:01:35 +0200 Original-Received: from localhost ([::1]:38086 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1X3Abv-0005aC-2N for geh-help-gnu-emacs@m.gmane.org; Fri, 04 Jul 2014 17:01:35 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:45301) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1X3Abe-0005a1-HG for help-gnu-emacs@gnu.org; Fri, 04 Jul 2014 17:01:24 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1X3AbW-00025s-MI for help-gnu-emacs@gnu.org; Fri, 04 Jul 2014 17:01:18 -0400 Original-Received: from plane.gmane.org ([80.91.229.3]:56416) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1X3AbW-00025l-FL for help-gnu-emacs@gnu.org; Fri, 04 Jul 2014 17:01:10 -0400 Original-Received: from list by plane.gmane.org with local (Exim 4.69) (envelope-from ) id 1X3AbT-0005BH-Lh for help-gnu-emacs@gnu.org; Fri, 04 Jul 2014 23:01:07 +0200 Original-Received: from 63.226.249.211 ([63.226.249.211]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Fri, 04 Jul 2014 23:01:07 +0200 Original-Received: from eric by 63.226.249.211 with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Fri, 04 Jul 2014 23:01:07 +0200 X-Injected-Via-Gmane: http://gmane.org/ Original-Lines: 62 Original-X-Complaints-To: usenet@ger.gmane.org X-Gmane-NNTP-Posting-Host: 63.226.249.211 User-Agent: Gnus/5.130012 (Ma Gnus v0.12) Emacs/24.4.50 (gnu/linux) Cancel-Lock: sha1:+cOfcH0hrEC3eLmTOEybgUBFNrI= X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-Received-From: 80.91.229.3 X-BeenThere: help-gnu-emacs@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Users list for the GNU Emacs text editor List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Original-Sender: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.help:98537 Archived-At: I'm (very slowly) chewing on some Chinese-English translation functions based on the freely-available CEDICT dictionary[1], this is related to a question about Chinese word boundaries I raised earlier. The first stage is just slurping the text-file dictionary into an elisp data structure, for simple dictionary lookups. This is the first time I've made anything where performance might actually be an issue, so I'm asking for a general pointer on how to do this. The issue is that the dictionary provides Chinese words in both simplified and traditional characters. The typical entry looks like this: 理性認識 理性认识 [li3 xing4 ren4 shi5] /cognition/rational knowledge/ So that's the traditional characters, simplified characters, pronunciation in brackets, then an arbitrary number of slash-delimited definitions. There are 108,296 such entries, one per line. So I'd like a hash table where characters are keys, and values are lists holding (pronunciation definition1 ...). I don't want to have to specify what type of characters I'm using, I'd like to just treat all types of characters as the same. The brute-force solution would be redundant hash-table entries, one each for simplified and traditional characters. That would double the size of the hash table to 200,000+. Some character don't differ between traditional/simplified: in the example above, only the second two characters are different. So I could also define a hash table test that used string-match-p, and construct the hash table keys as regexps: "理性[認认][識识]" Or I could try using the nested alists from mule-util.el, which I don't frankly understand. It's possible you're meant to use nested alists *instead* of something like a hash table. But if not, the keys might look something like: ("理性" ("認識") ("识认")) Or perhaps it would be faster to do: (29702 24615 (35469 35672) (35782 35748)) But again, I'm likely misunderstanding how a nested alist works. Anyway, dictionary lookups don't need to be super fast, but I'd like to use the same or similar data structure for finding word boundaries, so it would be nice to get something that goes fast. In any even, it's a good opportunity to learn a bit about efficiency. My actual question is: does anyone have any advice on a clever way to approach this? Thanks! Eric [1]: http://www.mdbg.net/chindict/chindict.php?page=cc-cedict