From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: Amirouche Boubekki Newsgroups: gmane.lisp.guile.user Subject: Re: What's next with culturia search engine? (and guile-wiredtiger) Date: Sun, 14 Jan 2018 11:05:29 +0100 Message-ID: <89c7dc5c588f12ceae8efe078efb15fa@hypermove.net> References: NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII; format=flowed Content-Transfer-Encoding: 7bit X-Trace: blaine.gmane.org 1515924254 25871 195.159.176.226 (14 Jan 2018 10:04:14 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Sun, 14 Jan 2018 10:04:14 +0000 (UTC) User-Agent: Roundcube Webmail/1.1.2 Cc: Guile User To: Catonano Original-X-From: guile-user-bounces+guile-user=m.gmane.org@gnu.org Sun Jan 14 11:04:09 2018 Return-path: Envelope-to: guile-user@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1eaf90-00062H-Hz for guile-user@m.gmane.org; Sun, 14 Jan 2018 11:04:02 +0100 Original-Received: from localhost ([::1]:54953 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1eafB0-0008Ns-37 for guile-user@m.gmane.org; Sun, 14 Jan 2018 05:06:06 -0500 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:33756) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1eafAY-0008Nb-HC for guile-user@gnu.org; Sun, 14 Jan 2018 05:05:39 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1eafAU-0000tL-Cu for guile-user@gnu.org; Sun, 14 Jan 2018 05:05:38 -0500 Original-Received: from relay6-d.mail.gandi.net ([217.70.183.198]:57840) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1eafAU-0000rY-0A for guile-user@gnu.org; Sun, 14 Jan 2018 05:05:34 -0500 Original-Received: from webmail.gandi.net (webmail6-d.mgt.gandi.net [10.58.1.146]) (Authenticated sender: amirouche@hypermove.net) by relay6-d.mail.gandi.net (Postfix) with ESMTPA id CEB8EFB8C2; Sun, 14 Jan 2018 11:05:30 +0100 (CET) In-Reply-To: X-Sender: amirouche@hypermove.net X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] [fuzzy] X-Received-From: 217.70.183.198 X-BeenThere: guile-user@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: General Guile related discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guile-user-bounces+guile-user=m.gmane.org@gnu.org Original-Sender: "guile-user" Xref: news.gmane.org gmane.lisp.guile.user:14424 Archived-At: On 2018-01-14 09:12, Catonano wrote: > 2017-11-26 23:33 GMT+01:00 Amirouche Boubekki > : >> >> The quering engine will first compute the frequency of both >> keywords and then lookup the inverted index for the least >> frequent keyword. > > The least frequent keyword ? > > Not the most frequent keyword ? Yes, imagine you search for serif+font, most common word and the least discriminant is "font" because there is (I think) more page containing "font". The result of the inverted lookup above is used as seed of the rest of the algorithm that is O(n) so I need to minimize 'n' ie. the count of initial documents. > >> That way, there is a 'seed' set of documents >> that we can filter with a small vm that will interpret the >> rest of the query for instance. Something like: >> >> (filter (hit? (cdr query)) seed) >> >> Sort of. I can't make it simpler right now, but you can >> have a look at the code. The public procedure and the bottom >> called 'search' [4] is the where the code starts. This is badly explained. At this point SEED contains the unique identifier of document that contains the least frequent word. We remove it from the query hence the (cdr query) and filter the SEED with the rest of the query. This is small optimization, because we know that the least frequent word is already in the documents found in the SEED, so we do not need to check its presence in the SEED documents. 'hits?' will return somekind of state-machine that will check that a given document match the QUERY passed as argument. That what I mean to do, the (cdr query) to remove the most discriminant query term is not implemented, yet. >> >> [4] >> > https://github.com/a-guile-mind/culturia.one/blob/master/src/wiredtiger/ix.scm#L455 >> [8] > > file not fond > It's here: https://github.com/a-guile-mind/culturia.one/blob/master/src/ix.scm#L439 I reworked the thing to use grf3 graph abstraction to store the documents. Also guile-wiredtiger 0.6.4 is in guix. > > All this looks pretty interesting but I have to say that I prefer the > work you're doing on GNUNet ;-) Tx for you interest!