From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: rushan chen Newsgroups: gmane.lisp.guile.devel Subject: Re: a passionate guy who want to join in as a developer Date: Sun, 12 Aug 2012 22:31:22 +0800 Message-ID: References: <5024F643.8040706@netris.org> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: multipart/alternative; boundary=e89a8f3b9f65653be104c7126e87 X-Trace: dough.gmane.org 1344781892 16120 80.91.229.3 (12 Aug 2012 14:31:32 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Sun, 12 Aug 2012 14:31:32 +0000 (UTC) Cc: guile-devel@gnu.org To: Mark H Weaver Original-X-From: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Sun Aug 12 16:31:32 2012 Return-path: Envelope-to: guile-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1T0ZCV-0000lU-Sq for guile-devel@m.gmane.org; Sun, 12 Aug 2012 16:31:32 +0200 Original-Received: from localhost ([::1]:60730 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1T0ZCU-0001JQ-Sf for guile-devel@m.gmane.org; Sun, 12 Aug 2012 10:31:30 -0400 Original-Received: from eggs.gnu.org ([208.118.235.92]:40829) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1T0ZCQ-0001Fu-MO for guile-devel@gnu.org; Sun, 12 Aug 2012 10:31:28 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1T0ZCO-0006OM-7H for guile-devel@gnu.org; Sun, 12 Aug 2012 10:31:26 -0400 Original-Received: from mail-wg0-f49.google.com ([74.125.82.49]:54672) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1T0ZCN-0006NY-TZ for guile-devel@gnu.org; Sun, 12 Aug 2012 10:31:24 -0400 Original-Received: by wgbez12 with SMTP id ez12so2119074wgb.30 for ; Sun, 12 Aug 2012 07:31:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=1PphRchLxCle4rsZyLTgloYsSRbR8TkkPqOSwSNsQng=; b=FZ7IniC6zdE6trVBMU7GI5db9csBNHCLq24JRtbwoPS1JMQ9M+jMW/MuqgVXZ9zVAO DHbh8iFwhZiTn1P50jlZp8iq5EB3pYXfz1QcqERnH9d2PiTQ5PLpfcB83KV7+P1lmhrh bHAcraS0yg5IJKyKIgyMLirKPpc3FTlFuJMihMYTBBD8jCh/whgLqrgM79OvBmVfJDaJ 67xvlWg65dKT/JqbU1PnDYEp/vZbw1jlYLL/lRB4Q+5nH9lW31MxabDcUfTFyI+vXGR7 ngdsWLX1hhHoHput007FntzsVTPhTKOd03pDG+RW30zowlWw6okF4pwPcRXvcM1ZkLWC szqQ== Original-Received: by 10.180.107.2 with SMTP id gy2mr11014076wib.2.1344781882309; Sun, 12 Aug 2012 07:31:22 -0700 (PDT) Original-Received: by 10.216.0.139 with HTTP; Sun, 12 Aug 2012 07:31:22 -0700 (PDT) In-Reply-To: <5024F643.8040706@netris.org> X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-Received-From: 74.125.82.49 X-BeenThere: guile-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Developers list for Guile, the GNU extensibility library" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Original-Sender: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.lisp.guile.devel:14795 Archived-At: --e89a8f3b9f65653be104c7126e87 Content-Type: text/plain; charset=ISO-8859-1 Hi Mark, Very appreciate for your reply. I see you mention that it's useful to implement a larger library of efficient data structure, and I'm interested in that very much. I used to work on projects which involve complicated but very interesting data structures, implementing them could be challenging, but once done I feel a great sense of achievement. One such project is implementing a language model (LM) which is a core component of speech recognition and machine translation. I don't know if you heard of it before. Unfortunately, I can't cover it too detailed here, that would complicate things too much. Basically, one of the key operations LM supports is it should return a probability associated with any given id sequence. All id sequences are of the same length, and there are a mass amount of such id sequences (a commonly-seen LM may contain billions of them). So it's required to store LM in a concise way, and at the same time make the search for each id sequence very quickly. Trie is finally chosen to be the data structure for LM (there were many papers discussing this issue). All id sequences with the same prefix share the same internal node, for example, for <1, 2, 3, 4> and <1, 2, 3, 5>, only one copy of <1, 2, 3> will be stored in LM, and a search for a id sequence is done by a sequence of binary search until the leaf is met. One extra thing worth mentioning is that I store the whole trie structure in a single large piece of memory (usually around 2 gigabytes), which makes it convenient to write out to disk and load into memory by simply using mmap, and I think it also makes the system faster than if you allocate memory every time it's needed. There are some other projects I worked or working on like Spell Corrector, which also involve complicated data structures, but due to privacy policy, I can't say much about it. All in all, I'm very interested in it, and I really really hope I can help. Looking forward to your reply. Thanks in advance. Have fun! Rushan Chen --e89a8f3b9f65653be104c7126e87 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Hi Mark,

Very appreciate for your reply.=A0
I see you mention that it's useful to implement a larger l= ibrary of efficient data structure, and I'm interested in that very muc= h. I used to work on projects which involve complicated but very interestin= g data structures, implementing them could be challenging, but once done I = feel a great sense of achievement.=A0

One such project is implementing a language model (LM) = which is a core component of=A0speech recognition and machine translation. = I don't know if you heard of it before.=A0Unfortunately, I can't co= ver it too detailed here, that would complicate things too much.=A0

Basically, one of the key operations LM supports is it = should return a probability associated with any given id sequence. All id s= equences are of the same length, and there are a mass amount of such id seq= uences (a commonly-seen LM may contain billions of them). So it's requi= red to store LM in a concise way, and at the same time make the search for = each id sequence very quickly.

Trie is finally=A0chosen=A0to be the data structure for= LM (there were many papers discussing this issue). All id sequences with t= he same prefix share the same internal node, for example, for <1, 2, 3, = 4> and <1, 2, 3, 5>, only one copy of <1, 2, 3> will be stor= ed in LM, and a search for a id sequence is done by a sequence of binary se= arch until the leaf is met. One extra thing worth mentioning is that I stor= e the whole trie structure in a single large piece of memory (usually aroun= d 2 gigabytes), which makes it=A0convenient=A0to write out to disk and load= into memory by simply using mmap, and I think it also makes the system fas= ter than if you allocate memory every time it's needed.

There are some other projects I worked or wo= rking on like Spell Corrector, which also involve complicated data structur= es, but due to privacy policy, I can't say much about it.

All in all, I'm very interested in it, and I really real= ly hope I can help.

Looking forward to your reply.= Thanks in advance.

Have fun!

Rushan Chen
--e89a8f3b9f65653be104c7126e87--