What's next with culturia search engine? (and guile-wiredtiger)

unofficial mirror of guile-user@gnu.org 
 help / color / mirror / Atom feed

* What's next with culturia search engine? (and guile-wiredtiger)
@ 2017-11-26 22:33 Amirouche Boubekki
  2017-11-27 22:03 ` Tom Jakubowski
  2018-01-14  8:12 ` Catonano
  0 siblings, 2 replies; 6+ messages in thread
From: Amirouche Boubekki @ 2017-11-26 22:33 UTC (permalink / raw)
  To: Guile User

Héllo,

I made some progress on my culturia project,
I wanted to share with you where it's going
with a few bits about guile-wiredtiger itself.

tl;dr:

   $ git clone https://a-guile-mind.github.io/culturia.one
   $ git clone https://framagit.org/a-guile-mind/guile-wiredtiger

On guix(sd) environment:

   $ guix environment --ad-hoc guile-wiredtiger

I stopped trying to understand what makes
a concept search engine [0]. Instead I will
focus on plain old keyword search engine.
I don't even plan to support synonyms [1].

[1] 
https://www.slideshare.net/lucidworks/implementing-conceptual-search-in-solr-using-lsa-and-word2vec-presented-by-simon-hughes-dicecom
[2] 
https://blog.algolia.com/inside-the-engine-part-6-handling-synonyms-the-right-way/

But, if you have insights about the suject,
don't hesitate to share them.

You might be wondering why I want to build
a search engine. Yes, since I am not working
anymore to reach the Moon, re-inventing the
wheel might sound useless. It's not.
Because Guile, because wiredtiger.

You see wiredtiger is the last iteration
(over several decades) of somekind of data engine.
And guile is the _only_ high level language that
has true POSIX threads (and fibers (more on that
later)). Does it ring a bell?

So far, what is dominant in the database space
is RDBMS. Basically tables that you can query with
SQL. This is very neat and stuff. But what if we
could have tables queryable in scheme?

That is it! guile-wiredtiger offers a way to query
tables in scheme. Using minikanren if you want that!

What about performance? Well according to my
microbenchmarks it can query 1200 documents
at random in seconds. Which means that if you
don't use a _dynamic_ schema (like grf3 or
feature-space) and use the raw wiredtiger API
found in (wiredtiger wiredtiger) module and
some helpers in (wiredtiger extra). You can
achieve better performance. Also, did I mention
that those are numbers for single thread access?

Guile has threads! Which means more RPS for your
application, less hardware for more users. That
said I don't think doubling the threads will double
the throughput. This needs to be benchmarked.

The thing with NOSQL wiredtiger, is that it's not
the kind of NOSQL you might think about. Unlike
REDIS it's not primary in memory, unlike Cassandra
it doesn't spread it's data accross several nodes.
Though there are ideas on how to do that see for
instance TiKV [3].

[3] https://github.com/pingcap/tikv

According to wiredtiger there is no known limitations
in the size of the database or the number of concurrent
threads − provided the underlying hardware can follow...

One can scale vertically, on a single machine. How far
can we go? That's the question I'd like to answer.

The search engine is the idea to have both potentially
a lot of data and a lot of users (compared to my blog).

I am sure some people who tried and switched to
duckduckgo want to give it a try. At least to see what
the technology behind Google and DuckDuckGo really is.
You can't know for sure without having experienced the
old google or feu altavista.

My point other point, is for most of my search on the
web I don't need often to dig deeper than _my_ first
page (even on ddg). On _my_ first page, there is most
of the time wikipedia, stackoverflow and that's it!
Really, there is not much of the web that concerns
me. Outside some rare scholar articles.

english wikipedia and stackoverflow are already almost
100Go so it's bigger than any one can have as blog.

Right now culturia.one does store pages using three
tables:

Document table will store information about the document,
uid is unique identifier for the document, url and a scheme
list of token uid (This is stored like that for faster
comparaison)

    key |           value
   -----+-------------+----------------------------
    uid |  url        | document as token uid
   -----+-------------+----------------------------
     1  | gnu.org     | 14 32 51 42 63 74 75 23 113
     2  | hyperdev.fr | 1 22 1 12 23 71 175 323 14

There is another table that stores, all the tokens
found in the documents:

    key |          value
   -----+--------+-------
    uid | token  | count
   -----+--------+-------
     1  |  the   |  42

Where count, is the count of document where the token
appears at least once. This table as an index on token
column to quikcly retrieve the UID of given TOKEN.

The last index, is the so called inverted-index,
It's a bit special because the key part of the table
has two columns but not bizarre if you work with primary
keys:

         key       | value
   -----+----------+-------
   token| document | count
   -----+------------------
     42 |   1      |  1

(I just figured that I never use that count column,
it supposed to be the number of times of the token 42
appears in document 1)

Anyway, pretty simple no?

Let's imagine we have a simple query like the following:

    culturia://guile+manual

We will imagine that we indexed some things that containt
those words (or the engine will throw an exception).

The quering engine will first compute the frequency of both
keywords and then lookup the inverted index for the least
frequent keyword. That way, there is a 'seed' set of documents
that we can filter with a small vm that will interpret the
rest of the query for instance. Something like:

   (filter (hit? (cdr query)) seed)

Sort of. I can't make it simpler right now, but you can
have a look at the code. The public procedure and the bottom
called 'search' [4] is the where the code starts.

[4] 
https://github.com/a-guile-mind/culturia.one/blob/master/src/wiredtiger/ix.scm#L455

So what is the next iteration:

guile-wiredtiger:

- fix the tests to run with guix

culturia:

- make culturia compatible with guile-wiredtiger found in guix
- write program that will index the whatever wikipedia dump we want
- make a program to index stackoverflow based on archives.org dump
- make a program that will index news.ycombinator.com and the linked 
articles
- Create a crawler for sitemaps (or find one)
- Create a crawler for RSS/ATOM feeds (or find one)
- Support WARC file format and crawl gnu.org website
- Implement !g and !ddg in the searchbox to redirect the user
   to another search engine.

Conctact me directly if you want to work on one of the tasks
or some other tasks, or if you want to report a bug.

Tx!

-- 
Amirouche ~ amz3 ~ http://www.hyperdev.fr

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: What's next with culturia search engine? (and guile-wiredtiger)
  2017-11-26 22:33 What's next with culturia search engine? (and guile-wiredtiger) Amirouche Boubekki
@ 2017-11-27 22:03 ` Tom Jakubowski
  2017-11-27 23:29   ` Amirouche Boubekki
  2018-01-14  8:12 ` Catonano
  1 sibling, 1 reply; 6+ messages in thread
From: Tom Jakubowski @ 2017-11-27 22:03 UTC (permalink / raw)
  To: Amirouche Boubekki; +Cc: Guile User

Looks very cool! I don't have any comments to add other than that I think
the git URL for culturia.one has an error: it should be
https://github.com/a-guile-mind/culturia.one.git and not
https://a-guile-mind.github.io/culturia.one

In other words:

$ git clone https://github.com/a-guile-mind/culturia.one.git

Tom

On Sun, Nov 26, 2017 at 2:34 PM Amirouche Boubekki <amirouche@hypermove.net>
wrote:

> Héllo,
>
> I made some progress on my culturia project,
> I wanted to share with you where it's going
> with a few bits about guile-wiredtiger itself.
>
> tl;dr:
>
>    $ git clone https://a-guile-mind.github.io/culturia.one
>    $ git clone https://framagit.org/a-guile-mind/guile-wiredtiger
>
> On guix(sd) environment:
>
>    $ guix environment --ad-hoc guile-wiredtiger
>
> I stopped trying to understand what makes
> a concept search engine [0]. Instead I will
> focus on plain old keyword search engine.
> I don't even plan to support synonyms [1].
>
> [1]
>
> https://www.slideshare.net/lucidworks/implementing-conceptual-search-in-solr-using-lsa-and-word2vec-presented-by-simon-hughes-dicecom
> [2]
>
> https://blog.algolia.com/inside-the-engine-part-6-handling-synonyms-the-right-way/
>
> But, if you have insights about the suject,
> don't hesitate to share them.
>
> You might be wondering why I want to build
> a search engine. Yes, since I am not working
> anymore to reach the Moon, re-inventing the
> wheel might sound useless. It's not.
> Because Guile, because wiredtiger.
>
> You see wiredtiger is the last iteration
> (over several decades) of somekind of data engine.
> And guile is the _only_ high level language that
> has true POSIX threads (and fibers (more on that
> later)). Does it ring a bell?
>
> So far, what is dominant in the database space
> is RDBMS. Basically tables that you can query with
> SQL. This is very neat and stuff. But what if we
> could have tables queryable in scheme?
>
> That is it! guile-wiredtiger offers a way to query
> tables in scheme. Using minikanren if you want that!
>
> What about performance? Well according to my
> microbenchmarks it can query 1200 documents
> at random in seconds. Which means that if you
> don't use a _dynamic_ schema (like grf3 or
> feature-space) and use the raw wiredtiger API
> found in (wiredtiger wiredtiger) module and
> some helpers in (wiredtiger extra). You can
> achieve better performance. Also, did I mention
> that those are numbers for single thread access?
>
> Guile has threads! Which means more RPS for your
> application, less hardware for more users. That
> said I don't think doubling the threads will double
> the throughput. This needs to be benchmarked.
>
> The thing with NOSQL wiredtiger, is that it's not
> the kind of NOSQL you might think about. Unlike
> REDIS it's not primary in memory, unlike Cassandra
> it doesn't spread it's data accross several nodes.
> Though there are ideas on how to do that see for
> instance TiKV [3].
>
> [3] https://github.com/pingcap/tikv
>
> According to wiredtiger there is no known limitations
> in the size of the database or the number of concurrent
> threads − provided the underlying hardware can follow...
>
> One can scale vertically, on a single machine. How far
> can we go? That's the question I'd like to answer.
>
> The search engine is the idea to have both potentially
> a lot of data and a lot of users (compared to my blog).
>
> I am sure some people who tried and switched to
> duckduckgo want to give it a try. At least to see what
> the technology behind Google and DuckDuckGo really is.
> You can't know for sure without having experienced the
> old google or feu altavista.
>
> My point other point, is for most of my search on the
> web I don't need often to dig deeper than _my_ first
> page (even on ddg). On _my_ first page, there is most
> of the time wikipedia, stackoverflow and that's it!
> Really, there is not much of the web that concerns
> me. Outside some rare scholar articles.
>
> english wikipedia and stackoverflow are already almost
> 100Go so it's bigger than any one can have as blog.
>
> Right now culturia.one does store pages using three
> tables:
>
> Document table will store information about the document,
> uid is unique identifier for the document, url and a scheme
> list of token uid (This is stored like that for faster
> comparaison)
>
>     key |           value
>    -----+-------------+----------------------------
>     uid |  url        | document as token uid
>    -----+-------------+----------------------------
>      1  | gnu.org     | 14 32 51 42 63 74 75 23 113
>      2  | hyperdev.fr | 1 22 1 12 23 71 175 323 14
>
> There is another table that stores, all the tokens
> found in the documents:
>
>     key |          value
>    -----+--------+-------
>     uid | token  | count
>    -----+--------+-------
>      1  |  the   |  42
>
> Where count, is the count of document where the token
> appears at least once. This table as an index on token
> column to quikcly retrieve the UID of given TOKEN.
>
> The last index, is the so called inverted-index,
> It's a bit special because the key part of the table
> has two columns but not bizarre if you work with primary
> keys:
>
>
>          key       | value
>    -----+----------+-------
>    token| document | count
>    -----+------------------
>      42 |   1      |  1
>
> (I just figured that I never use that count column,
> it supposed to be the number of times of the token 42
> appears in document 1)
>
> Anyway, pretty simple no?
>
> Let's imagine we have a simple query like the following:
>
>     culturia://guile+manual
>
> We will imagine that we indexed some things that containt
> those words (or the engine will throw an exception).
>
> The quering engine will first compute the frequency of both
> keywords and then lookup the inverted index for the least
> frequent keyword. That way, there is a 'seed' set of documents
> that we can filter with a small vm that will interpret the
> rest of the query for instance. Something like:
>
>    (filter (hit? (cdr query)) seed)
>
> Sort of. I can't make it simpler right now, but you can
> have a look at the code. The public procedure and the bottom
> called 'search' [4] is the where the code starts.
>
> [4]
>
> https://github.com/a-guile-mind/culturia.one/blob/master/src/wiredtiger/ix.scm#L455
>
> So what is the next iteration:
>
> guile-wiredtiger:
>
> - fix the tests to run with guix
>
> culturia:
>
> - make culturia compatible with guile-wiredtiger found in guix
> - write program that will index the whatever wikipedia dump we want
> - make a program to index stackoverflow based on archives.org dump
> - make a program that will index news.ycombinator.com and the linked
> articles
> - Create a crawler for sitemaps (or find one)
> - Create a crawler for RSS/ATOM feeds (or find one)
> - Support WARC file format and crawl gnu.org website
> - Implement !g and !ddg in the searchbox to redirect the user
>    to another search engine.
>
> Conctact me directly if you want to work on one of the tasks
> or some other tasks, or if you want to report a bug.
>
> Tx!
>
> --
> Amirouche ~ amz3 ~ http://www.hyperdev.fr
>
>


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: What's next with culturia search engine? (and guile-wiredtiger)
  2017-11-27 22:03 ` Tom Jakubowski
@ 2017-11-27 23:29   ` Amirouche Boubekki
  0 siblings, 0 replies; 6+ messages in thread
From: Amirouche Boubekki @ 2017-11-27 23:29 UTC (permalink / raw)
  To: Tom Jakubowski; +Cc: Guile User

On 2017-11-27 23:03, Tom Jakubowski wrote:
> Looks very cool! I don't have any comments to add other than that I
> think the git URL for culturia.one has an error: it should be
> https://github.com/a-guile-mind/culturia.one.git [12] and not
> https://a-guile-mind.github.io/culturia.one [1]
> 
> In other words:
> 
> $ git clone https://github.com/a-guile-mind/culturia.one.git [12]
> 
> Tom

Tx Tom!

> 
> On Sun, Nov 26, 2017 at 2:34 PM Amirouche Boubekki
> <amirouche@hypermove.net> wrote:
> 
>> Héllo,
>> 
>> I made some progress on my culturia project,
>> I wanted to share with you where it's going
>> with a few bits about guile-wiredtiger itself.
>> 
>> tl;dr:
>> 
>> $ git clone https://a-guile-mind.github.io/culturia.one [1]
>> $ git clone https://framagit.org/a-guile-mind/guile-wiredtiger
>> [2]
>> 
>> On guix(sd) environment:
>> 
>> $ guix environment --ad-hoc guile-wiredtiger
>> 
>> I stopped trying to understand what makes
>> a concept search engine [0]. Instead I will
>> focus on plain old keyword search engine.
>> I don't even plan to support synonyms [1].
>> 
>> [1]
>> 
> https://www.slideshare.net/lucidworks/implementing-conceptual-search-in-solr-using-lsa-and-word2vec-presented-by-simon-hughes-dicecom
>> [3]
>> [2]
>> 
> https://blog.algolia.com/inside-the-engine-part-6-handling-synonyms-the-right-way/
>> [4]
>> 
>> But, if you have insights about the suject,
>> don't hesitate to share them.
>> 
>> You might be wondering why I want to build
>> a search engine. Yes, since I am not working
>> anymore to reach the Moon, re-inventing the
>> wheel might sound useless. It's not.
>> Because Guile, because wiredtiger.
>> 
>> You see wiredtiger is the last iteration
>> (over several decades) of somekind of data engine.
>> And guile is the _only_ high level language that
>> has true POSIX threads (and fibers (more on that
>> later)). Does it ring a bell?
>> 
>> So far, what is dominant in the database space
>> is RDBMS. Basically tables that you can query with
>> SQL. This is very neat and stuff. But what if we
>> could have tables queryable in scheme?
>> 
>> That is it! guile-wiredtiger offers a way to query
>> tables in scheme. Using minikanren if you want that!
>> 
>> What about performance? Well according to my
>> microbenchmarks it can query 1200 documents
>> at random in seconds. Which means that if you
>> don't use a _dynamic_ schema (like grf3 or
>> feature-space) and use the raw wiredtiger API
>> found in (wiredtiger wiredtiger) module and
>> some helpers in (wiredtiger extra). You can
>> achieve better performance. Also, did I mention
>> that those are numbers for single thread access?
>> 
>> Guile has threads! Which means more RPS for your
>> application, less hardware for more users. That
>> said I don't think doubling the threads will double
>> the throughput. This needs to be benchmarked.
>> 
>> The thing with NOSQL wiredtiger, is that it's not
>> the kind of NOSQL you might think about. Unlike
>> REDIS it's not primary in memory, unlike Cassandra
>> it doesn't spread it's data accross several nodes.
>> Though there are ideas on how to do that see for
>> instance TiKV [3].
>> 
>> [3] https://github.com/pingcap/tikv [5]
>> 
>> According to wiredtiger there is no known limitations
>> in the size of the database or the number of concurrent
>> threads − provided the underlying hardware can follow...
>> 
>> One can scale vertically, on a single machine. How far
>> can we go? That's the question I'd like to answer.
>> 
>> The search engine is the idea to have both potentially
>> a lot of data and a lot of users (compared to my blog).
>> 
>> I am sure some people who tried and switched to
>> duckduckgo want to give it a try. At least to see what
>> the technology behind Google and DuckDuckGo really is.
>> You can't know for sure without having experienced the
>> old google or feu altavista.
>> 
>> My point other point, is for most of my search on the
>> web I don't need often to dig deeper than _my_ first
>> page (even on ddg). On _my_ first page, there is most
>> of the time wikipedia, stackoverflow and that's it!
>> Really, there is not much of the web that concerns
>> me. Outside some rare scholar articles.
>> 
>> english wikipedia and stackoverflow are already almost
>> 100Go so it's bigger than any one can have as blog.
>> 
>> Right now culturia.one does store pages using three
>> tables:
>> 
>> Document table will store information about the document,
>> uid is unique identifier for the document, url and a scheme
>> list of token uid (This is stored like that for faster
>> comparaison)
>> 
>> key | value
>> -----+-------------+----------------------------
>> uid | url | document as token uid
>> -----+-------------+----------------------------
>> 1 | gnu.org [6] | 14 32 51 42 63 74 75 23 113
>> 2 | hyperdev.fr [7] | 1 22 1 12 23 71 175 323 14
>> 
>> There is another table that stores, all the tokens
>> found in the documents:
>> 
>> key | value
>> -----+--------+-------
>> uid | token | count
>> -----+--------+-------
>> 1 | the | 42
>> 
>> Where count, is the count of document where the token
>> appears at least once. This table as an index on token
>> column to quikcly retrieve the UID of given TOKEN.
>> 
>> The last index, is the so called inverted-index,
>> It's a bit special because the key part of the table
>> has two columns but not bizarre if you work with primary
>> keys:
>> 
>> key | value
>> -----+----------+-------
>> token| document | count
>> -----+------------------
>> 42 | 1 | 1
>> 
>> (I just figured that I never use that count column,
>> it supposed to be the number of times of the token 42
>> appears in document 1)
>> 
>> Anyway, pretty simple no?
>> 
>> Let's imagine we have a simple query like the following:
>> 
>> culturia://guile+manual
>> 
>> We will imagine that we indexed some things that containt
>> those words (or the engine will throw an exception).
>> 
>> The quering engine will first compute the frequency of both
>> keywords and then lookup the inverted index for the least
>> frequent keyword. That way, there is a 'seed' set of documents
>> that we can filter with a small vm that will interpret the
>> rest of the query for instance. Something like:
>> 
>> (filter (hit? (cdr query)) seed)
>> 
>> Sort of. I can't make it simpler right now, but you can
>> have a look at the code. The public procedure and the bottom
>> called 'search' [4] is the where the code starts.
>> 
>> [4]
>> 
> https://github.com/a-guile-mind/culturia.one/blob/master/src/wiredtiger/ix.scm#L455
>> [8]
>> 
>> So what is the next iteration:
>> 
>> guile-wiredtiger:
>> 
>> - fix the tests to run with guix
>> 
>> culturia:
>> 
>> - make culturia compatible with guile-wiredtiger found in guix
>> - write program that will index the whatever wikipedia dump we want
>> - make a program to index stackoverflow based on archives.org [9]
>> dump
>> - make a program that will index news.ycombinator.com [10] and the
>> linked
>> articles
>> - Create a crawler for sitemaps (or find one)
>> - Create a crawler for RSS/ATOM feeds (or find one)
>> - Support WARC file format and crawl gnu.org [6] website
>> - Implement !g and !ddg in the searchbox to redirect the user
>> to another search engine.
>> 
>> Conctact me directly if you want to work on one of the tasks
>> or some other tasks, or if you want to report a bug.
>> 
>> Tx!
>> 
>> --
>> Amirouche ~ amz3 ~ http://www.hyperdev.fr [11]
> 
> 
> Links:
> ------
> [1] https://a-guile-mind.github.io/culturia.one
> [2] https://framagit.org/a-guile-mind/guile-wiredtiger
> [3]
> https://www.slideshare.net/lucidworks/implementing-conceptual-search-in-solr-using-lsa-and-word2vec-presented-by-simon-hughes-dicecom
> [4]
> https://blog.algolia.com/inside-the-engine-part-6-handling-synonyms-the-right-way/
> [5] https://github.com/pingcap/tikv
> [6] http://gnu.org
> [7] http://hyperdev.fr
> [8]
> https://github.com/a-guile-mind/culturia.one/blob/master/src/wiredtiger/ix.scm#L455
> [9] http://archives.org
> [10] http://news.ycombinator.com
> [11] http://www.hyperdev.fr
> [12] https://github.com/a-guile-mind/culturia.one.git

-- 
Amirouche ~ amz3 ~ http://www.hyperdev.fr



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: What's next with culturia search engine? (and guile-wiredtiger)
  2017-11-26 22:33 What's next with culturia search engine? (and guile-wiredtiger) Amirouche Boubekki
  2017-11-27 22:03 ` Tom Jakubowski
@ 2018-01-14  8:12 ` Catonano
  2018-01-14 10:05   ` Amirouche Boubekki
  1 sibling, 1 reply; 6+ messages in thread
From: Catonano @ 2018-01-14  8:12 UTC (permalink / raw)
  To: Amirouche Boubekki; +Cc: Guile User

2017-11-26 23:33 GMT+01:00 Amirouche Boubekki <amirouche@hypermove.net>:

> Héllo,
>
> I made some progress on my culturia project,
> I wanted to share with you where it's going
> with a few bits about guile-wiredtiger itself.
>
> tl;dr:
>
>   $ git clone https://a-guile-mind.github.io/culturia.one
>   $ git clone https://framagit.org/a-guile-mind/guile-wiredtiger
>
> On guix(sd) environment:
>
>   $ guix environment --ad-hoc guile-wiredtiger
>
> I stopped trying to understand what makes
> a concept search engine [0]. Instead I will
> focus on plain old keyword search engine.
> I don't even plan to support synonyms [1].
>
> [1] https://www.slideshare.net/lucidworks/implementing-conceptua
> l-search-in-solr-using-lsa-and-word2vec-presented-by-simon-hughes-dicecom
> [2] https://blog.algolia.com/inside-the-engine-part-6-handling-
> synonyms-the-right-way/
>
> But, if you have insights about the suject,
> don't hesitate to share them.
>
> You might be wondering why I want to build
> a search engine. Yes, since I am not working
> anymore to reach the Moon, re-inventing the
> wheel might sound useless. It's not.
> Because Guile, because wiredtiger.
>
> You see wiredtiger is the last iteration
> (over several decades) of somekind of data engine.
> And guile is the _only_ high level language that
> has true POSIX threads (and fibers (more on that
> later)). Does it ring a bell?
>
> So far, what is dominant in the database space
> is RDBMS. Basically tables that you can query with
> SQL. This is very neat and stuff. But what if we
> could have tables queryable in scheme?
>
> That is it! guile-wiredtiger offers a way to query
> tables in scheme. Using minikanren if you want that!
>
> What about performance? Well according to my
> microbenchmarks it can query 1200 documents
> at random in seconds. Which means that if you
> don't use a _dynamic_ schema (like grf3 or
> feature-space) and use the raw wiredtiger API
> found in (wiredtiger wiredtiger) module and
> some helpers in (wiredtiger extra). You can
> achieve better performance. Also, did I mention
> that those are numbers for single thread access?
>
> Guile has threads! Which means more RPS for your
> application, less hardware for more users. That
> said I don't think doubling the threads will double
> the throughput. This needs to be benchmarked.
>
> The thing with NOSQL wiredtiger, is that it's not
> the kind of NOSQL you might think about. Unlike
> REDIS it's not primary in memory, unlike Cassandra
> it doesn't spread it's data accross several nodes.
> Though there are ideas on how to do that see for
> instance TiKV [3].
>
> [3] https://github.com/pingcap/tikv
>
> According to wiredtiger there is no known limitations
> in the size of the database or the number of concurrent
> threads − provided the underlying hardware can follow...
>
> One can scale vertically, on a single machine. How far
> can we go? That's the question I'd like to answer.
>
> The search engine is the idea to have both potentially
> a lot of data and a lot of users (compared to my blog).
>
> I am sure some people who tried and switched to
> duckduckgo want to give it a try. At least to see what
> the technology behind Google and DuckDuckGo really is.
> You can't know for sure without having experienced the
> old google or feu altavista.
>
> My point other point, is for most of my search on the
> web I don't need often to dig deeper than _my_ first
> page (even on ddg). On _my_ first page, there is most
> of the time wikipedia, stackoverflow and that's it!
> Really, there is not much of the web that concerns
> me. Outside some rare scholar articles.
>
> english wikipedia and stackoverflow are already almost
> 100Go so it's bigger than any one can have as blog.
>
> Right now culturia.one does store pages using three
> tables:
>
> Document table will store information about the document,
> uid is unique identifier for the document, url and a scheme
> list of token uid (This is stored like that for faster
> comparaison)
>
>    key |           value
>   -----+-------------+----------------------------
>    uid |  url        | document as token uid
>   -----+-------------+----------------------------
>     1  | gnu.org     | 14 32 51 42 63 74 75 23 113
>     2  | hyperdev.fr | 1 22 1 12 23 71 175 323 14
>
> There is another table that stores, all the tokens
> found in the documents:
>
>    key |          value
>   -----+--------+-------
>    uid | token  | count
>   -----+--------+-------
>     1  |  the   |  42
>
> Where count, is the count of document where the token
> appears at least once. This table as an index on token
> column to quikcly retrieve the UID of given TOKEN.
>
> The last index, is the so called inverted-index,
> It's a bit special because the key part of the table
> has two columns but not bizarre if you work with primary
> keys:
>
>
>         key       | value
>   -----+----------+-------
>   token| document | count
>   -----+------------------
>     42 |   1      |  1
>
> (I just figured that I never use that count column,
> it supposed to be the number of times of the token 42
> appears in document 1)
>
> Anyway, pretty simple no?
>
> Let's imagine we have a simple query like the following:
>
>    culturia://guile+manual
>
> We will imagine that we indexed some things that containt
> those words (or the engine will throw an exception).
>
> The quering engine will first compute the frequency of both
> keywords and then lookup the inverted index for the least
> frequent keyword.


The least freqent keyword ?
Not the most frequent keyword ?


That way, there is a 'seed' set of documents
> that we can filter with a small vm that will interpret the
> rest of the query for instance. Something like:
>
>   (filter (hit? (cdr query)) seed)
>
> Sort of. I can't make it simpler right now, but you can
> have a look at the code. The public procedure and the bottom
> called 'search' [4] is the where the code starts.
>
> [4] https://github.com/a-guile-mind/culturia.one/blob/master/src
> /wiredtiger/ix.scm#L455


file not fond

All this looks pretty interesting but I have to say that I prefer the work
you're doing on GNUNet ;-)






>
>
> So what is the next iteration:
>
> guile-wiredtiger:
>
> - fix the tests to run with guix
>
> culturia:
>
> - make culturia compatible with guile-wiredtiger found in guix
> - write program that will index the whatever wikipedia dump we want
> - make a program to index stackoverflow based on archives.org dump
> - make a program that will index news.ycombinator.com and the linked
> articles
> - Create a crawler for sitemaps (or find one)
> - Create a crawler for RSS/ATOM feeds (or find one)
> - Support WARC file format and crawl gnu.org website
> - Implement !g and !ddg in the searchbox to redirect the user
>   to another search engine.
>
> Conctact me directly if you want to work on one of the tasks
> or some other tasks, or if you want to report a bug.
>
> Tx!
>
> --
> Amirouche ~ amz3 ~ http://www.hyperdev.fr
>
>


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: What's next with culturia search engine? (and guile-wiredtiger)
  2018-01-14  8:12 ` Catonano
@ 2018-01-14 10:05   ` Amirouche Boubekki
  2018-01-14 14:12     ` Catonano
  0 siblings, 1 reply; 6+ messages in thread
From: Amirouche Boubekki @ 2018-01-14 10:05 UTC (permalink / raw)
  To: Catonano; +Cc: Guile User

On 2018-01-14 09:12, Catonano wrote:
> 2017-11-26 23:33 GMT+01:00 Amirouche Boubekki
> <amirouche@hypermove.net>:
>> 
>> The quering engine will first compute the frequency of both
>> keywords and then lookup the inverted index for the least
>> frequent keyword.
> 
> The least frequent keyword ?
> 
> Not the most frequent keyword ?

Yes, imagine you search for serif+font, most common
word and the least discriminant is "font" because there
is (I think) more page containing "font".

The result of the inverted lookup above is used as seed
of the rest of the algorithm that is O(n) so I need to
minimize 'n' ie. the count of initial documents.

> 
>> That way, there is a 'seed' set of documents
>> that we can filter with a small vm that will interpret the
>> rest of the query for instance. Something like:
>> 
>> (filter (hit? (cdr query)) seed)
>> 
>> Sort of. I can't make it simpler right now, but you can
>> have a look at the code. The public procedure and the bottom
>> called 'search' [4] is the where the code starts.

This is badly explained.  At this point SEED contains the unique
identifier of document that contains the least frequent word.
We remove it from the query hence the (cdr query) and filter
the SEED with the rest of the query. This is small optimization,
because we know that the least frequent word is already in the
documents found in the SEED, so we do not need to check its
presence in the SEED documents. 'hits?' will return somekind
of state-machine that will check that a given document match
the QUERY passed as argument.

That what I mean to do, the (cdr query) to remove the most
discriminant query term is not implemented, yet.

>> 
>> [4]
>> 
> https://github.com/a-guile-mind/culturia.one/blob/master/src/wiredtiger/ix.scm#L455
>> [8]
> 
> file not fond
> 

It's here: 
https://github.com/a-guile-mind/culturia.one/blob/master/src/ix.scm#L439

I reworked the thing to use grf3 graph abstraction to store
the documents.

Also guile-wiredtiger 0.6.4 is in guix.

> 
> All this looks pretty interesting but I have to say that I prefer the
> work you're doing on GNUNet ;-)

Tx for you interest!

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: What's next with culturia search engine? (and guile-wiredtiger)
  2018-01-14 10:05   ` Amirouche Boubekki
@ 2018-01-14 14:12     ` Catonano
  0 siblings, 0 replies; 6+ messages in thread
From: Catonano @ 2018-01-14 14:12 UTC (permalink / raw)
  To: Amirouche Boubekki; +Cc: Guile User

2018-01-14 11:05 GMT+01:00 Amirouche Boubekki <amirouche@hypermove.net>:

> On 2018-01-14 09:12, Catonano wrote:
>
>> 2017-11-26 23:33 GMT+01:00 Amirouche Boubekki
>> <amirouche@hypermove.net>:
>>
>>>
>>> The quering engine will first compute the frequency of both
>>> keywords and then lookup the inverted index for the least
>>> frequent keyword.
>>>
>>
>> The least frequent keyword ?
>>
>> Not the most frequent keyword ?
>>
>
> Yes, imagine you search for serif+font, most common
> word and the least discriminant is "font" because there
> is (I think) more page containing "font".
>
> The result of the inverted lookup above is used as seed
> of the rest of the algorithm that is O(n) so I need to
> minimize 'n' ie. the count of initial documents.


I see now. Thanks


>
>
>
>> That way, there is a 'seed' set of documents
>>> that we can filter with a small vm that will interpret the
>>> rest of the query for instance. Something like:
>>>
>>> (filter (hit? (cdr query)) seed)
>>>
>>> Sort of. I can't make it simpler right now, but you can
>>> have a look at the code. The public procedure and the bottom
>>> called 'search' [4] is the where the code starts.
>>>
>>
> This is badly explained.  At this point SEED contains the unique
> identifier of document that contains the least frequent word.
> We remove it from the query hence the (cdr query) and filter
> the SEED with the rest of the query. This is small optimization,
> because we know that the least frequent word is already in the
> documents found in the SEED, so we do not need to check its
> presence in the SEED documents. 'hits?' will return somekind
> of state-machine that will check that a given document match
> the QUERY passed as argument.
>
> That what I mean to do, the (cdr query) to remove the most
> discriminant query term is not implemented, yet.
>

Ok


>
>>> [4]
>>>
>>> https://github.com/a-guile-mind/culturia.one/blob/master/src
>> /wiredtiger/ix.scm#L455
>>
>>> [8]
>>>
>>
>> file not fond
>>
>>
> It's here: https://github.com/a-guile-mind/culturia.one/blob/master/src
> /ix.scm#L439
>
> I reworked the thing to use grf3 graph abstraction to store
> the documents.
>
> Also guile-wiredtiger 0.6.4 is in guix.


I know ;-)

At the moment I don't feel confident with guile code calling C code

There's guile-squee thaht needs some love too, that could be a starting
point ofrr me

Also G-golf is very important. GUIs are not optional, they are fundamental,
we should absolutely have a decent integration between Guile and Gnome

When and if I'll know more, I'll take a look at Culturia too :-)


>
>
>
>
>> All this looks pretty interesting but I have to say that I prefer the
>> work you're doing on GNUNet ;-)
>>
>
> Tx for you interest!
>

No, thank you !
GNUNet is also very important, probably more than g-golf, I'm not sure

I can't wait to test drive it


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2018-01-14 14:12 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-11-26 22:33 What's next with culturia search engine? (and guile-wiredtiger) Amirouche Boubekki
2017-11-27 22:03 ` Tom Jakubowski
2017-11-27 23:29   ` Amirouche Boubekki
2018-01-14  8:12 ` Catonano
2018-01-14 10:05   ` Amirouche Boubekki
2018-01-14 14:12     ` Catonano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).