unofficial mirror of guile-user@gnu.org 
 help / color / mirror / Atom feed
From: Amirouche Boubekki <amirouche@hypermove.net>
To: Catonano <catonano@gmail.com>
Cc: Guile User <guile-user@gnu.org>
Subject: Re: What's next with culturia search engine? (and guile-wiredtiger)
Date: Sun, 14 Jan 2018 11:05:29 +0100	[thread overview]
Message-ID: <89c7dc5c588f12ceae8efe078efb15fa@hypermove.net> (raw)
In-Reply-To: <CAJ98PDwDXb8=py4CX6o8GzpcGtN-Z6X6Qp03JCESqpCwCGw_SQ@mail.gmail.com>

On 2018-01-14 09:12, Catonano wrote:
> 2017-11-26 23:33 GMT+01:00 Amirouche Boubekki
> <amirouche@hypermove.net>:
>> 
>> The quering engine will first compute the frequency of both
>> keywords and then lookup the inverted index for the least
>> frequent keyword.
> 
> The least frequent keyword ?
> 
> Not the most frequent keyword ?

Yes, imagine you search for serif+font, most common
word and the least discriminant is "font" because there
is (I think) more page containing "font".

The result of the inverted lookup above is used as seed
of the rest of the algorithm that is O(n) so I need to
minimize 'n' ie. the count of initial documents.

> 
>> That way, there is a 'seed' set of documents
>> that we can filter with a small vm that will interpret the
>> rest of the query for instance. Something like:
>> 
>> (filter (hit? (cdr query)) seed)
>> 
>> Sort of. I can't make it simpler right now, but you can
>> have a look at the code. The public procedure and the bottom
>> called 'search' [4] is the where the code starts.

This is badly explained.  At this point SEED contains the unique
identifier of document that contains the least frequent word.
We remove it from the query hence the (cdr query) and filter
the SEED with the rest of the query. This is small optimization,
because we know that the least frequent word is already in the
documents found in the SEED, so we do not need to check its
presence in the SEED documents. 'hits?' will return somekind
of state-machine that will check that a given document match
the QUERY passed as argument.

That what I mean to do, the (cdr query) to remove the most
discriminant query term is not implemented, yet.

>> 
>> [4]
>> 
> https://github.com/a-guile-mind/culturia.one/blob/master/src/wiredtiger/ix.scm#L455
>> [8]
> 
> file not fond
> 

It's here: 
https://github.com/a-guile-mind/culturia.one/blob/master/src/ix.scm#L439

I reworked the thing to use grf3 graph abstraction to store
the documents.

Also guile-wiredtiger 0.6.4 is in guix.


> 
> All this looks pretty interesting but I have to say that I prefer the
> work you're doing on GNUNet ;-)

Tx for you interest!



  reply	other threads:[~2018-01-14 10:05 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-11-26 22:33 What's next with culturia search engine? (and guile-wiredtiger) Amirouche Boubekki
2017-11-27 22:03 ` Tom Jakubowski
2017-11-27 23:29   ` Amirouche Boubekki
2018-01-14  8:12 ` Catonano
2018-01-14 10:05   ` Amirouche Boubekki [this message]
2018-01-14 14:12     ` Catonano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/guile/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=89c7dc5c588f12ceae8efe078efb15fa@hypermove.net \
    --to=amirouche@hypermove.net \
    --cc=catonano@gmail.com \
    --cc=guile-user@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).