From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!.POSTED!not-for-mail
From: Amirouche Boubekki <amirouche@hypermove.net>
Newsgroups: gmane.lisp.guile.user
Subject: Re: What's next with culturia search engine? (and guile-wiredtiger)
Date: Tue, 28 Nov 2017 00:29:30 +0100
Message-ID: <51004583028c27dffe979ec1be35b8ef@hypermove.net>
References: <d22074be36e396a49011c6eedc8fd0aa@hypermove.net>
	<CAEJvSPLGWTv67JDyxUmSYTY7nK5Bfjcg1v_m_5X=8ek1-YOMMA@mail.gmail.com>
NNTP-Posting-Host: blaine.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8;
 format=flowed
Content-Transfer-Encoding: 8bit
X-Trace: blaine.gmane.org 1511825517 3428 195.159.176.226 (27 Nov 2017 23:31:57 GMT)
X-Complaints-To: usenet@blaine.gmane.org
NNTP-Posting-Date: Mon, 27 Nov 2017 23:31:57 +0000 (UTC)
User-Agent: Roundcube Webmail/1.1.2
Cc: Guile User <guile-user@gnu.org>
To: Tom Jakubowski <tjakubowski@oblong.com>
Original-X-From: guile-user-bounces+guile-user=m.gmane.org@gnu.org Tue Nov 28 00:31:50 2017
Return-path: <guile-user-bounces+guile-user=m.gmane.org@gnu.org>
Envelope-to: guile-user@m.gmane.org
Original-Received: from lists.gnu.org ([208.118.235.17])
	by blaine.gmane.org with esmtp (Exim 4.84_2)
	(envelope-from <guile-user-bounces+guile-user=m.gmane.org@gnu.org>)
	id 1eJSsF-0008Ax-5x
	for guile-user@m.gmane.org; Tue, 28 Nov 2017 00:31:39 +0100
Original-Received: from localhost ([::1]:35241 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <guile-user-bounces+guile-user=m.gmane.org@gnu.org>)
	id 1eJSsM-0001Q7-Iq
	for guile-user@m.gmane.org; Mon, 27 Nov 2017 18:31:46 -0500
Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:47905)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <amirouche@hypermove.net>) id 1eJSqG-0000D5-5R
	for guile-user@gnu.org; Mon, 27 Nov 2017 18:29:37 -0500
Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <amirouche@hypermove.net>) id 1eJSqE-0008Sz-3I
	for guile-user@gnu.org; Mon, 27 Nov 2017 18:29:36 -0500
Original-Received: from relay6-d.mail.gandi.net ([2001:4b98:c:538::198]:56529)
	by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32)
	(Exim 4.71) (envelope-from <amirouche@hypermove.net>)
	id 1eJSqD-0008S2-Pz
	for guile-user@gnu.org; Mon, 27 Nov 2017 18:29:34 -0500
Original-Received: from webmail.gandi.net (webmail5-d.mgt.gandi.net [10.58.1.145])
	(Authenticated sender: amirouche@hypermove.net)
	by relay6-d.mail.gandi.net (Postfix) with ESMTPA id A9655FB881;
	Tue, 28 Nov 2017 00:29:30 +0100 (CET)
In-Reply-To: <CAEJvSPLGWTv67JDyxUmSYTY7nK5Bfjcg1v_m_5X=8ek1-YOMMA@mail.gmail.com>
X-Sender: amirouche@hypermove.net
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic]
X-Received-From: 2001:4b98:c:538::198
X-BeenThere: guile-user@gnu.org
X-Mailman-Version: 2.1.21
Precedence: list
List-Id: General Guile related discussions <guile-user.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/guile-user>,
	<mailto:guile-user-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/guile-user/>
List-Post: <mailto:guile-user@gnu.org>
List-Help: <mailto:guile-user-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/guile-user>,
	<mailto:guile-user-request@gnu.org?subject=subscribe>
Errors-To: guile-user-bounces+guile-user=m.gmane.org@gnu.org
Original-Sender: "guile-user" <guile-user-bounces+guile-user=m.gmane.org@gnu.org>
Xref: news.gmane.org gmane.lisp.guile.user:14302
Archived-At: <http://permalink.gmane.org/gmane.lisp.guile.user/14302>

On 2017-11-27 23:03, Tom Jakubowski wrote:
> Looks very cool! I don't have any comments to add other than that I
> think the git URL for culturia.one has an error: it should be
> https://github.com/a-guile-mind/culturia.one.git [12] and not
> https://a-guile-mind.github.io/culturia.one [1]
> 
> In other words:
> 
> $ git clone https://github.com/a-guile-mind/culturia.one.git [12]
> 
> Tom

Tx Tom!

> 
> On Sun, Nov 26, 2017 at 2:34 PM Amirouche Boubekki
> <amirouche@hypermove.net> wrote:
> 
>> Héllo,
>> 
>> I made some progress on my culturia project,
>> I wanted to share with you where it's going
>> with a few bits about guile-wiredtiger itself.
>> 
>> tl;dr:
>> 
>> $ git clone https://a-guile-mind.github.io/culturia.one [1]
>> $ git clone https://framagit.org/a-guile-mind/guile-wiredtiger
>> [2]
>> 
>> On guix(sd) environment:
>> 
>> $ guix environment --ad-hoc guile-wiredtiger
>> 
>> I stopped trying to understand what makes
>> a concept search engine [0]. Instead I will
>> focus on plain old keyword search engine.
>> I don't even plan to support synonyms [1].
>> 
>> [1]
>> 
> https://www.slideshare.net/lucidworks/implementing-conceptual-search-in-solr-using-lsa-and-word2vec-presented-by-simon-hughes-dicecom
>> [3]
>> [2]
>> 
> https://blog.algolia.com/inside-the-engine-part-6-handling-synonyms-the-right-way/
>> [4]
>> 
>> But, if you have insights about the suject,
>> don't hesitate to share them.
>> 
>> You might be wondering why I want to build
>> a search engine. Yes, since I am not working
>> anymore to reach the Moon, re-inventing the
>> wheel might sound useless. It's not.
>> Because Guile, because wiredtiger.
>> 
>> You see wiredtiger is the last iteration
>> (over several decades) of somekind of data engine.
>> And guile is the _only_ high level language that
>> has true POSIX threads (and fibers (more on that
>> later)). Does it ring a bell?
>> 
>> So far, what is dominant in the database space
>> is RDBMS. Basically tables that you can query with
>> SQL. This is very neat and stuff. But what if we
>> could have tables queryable in scheme?
>> 
>> That is it! guile-wiredtiger offers a way to query
>> tables in scheme. Using minikanren if you want that!
>> 
>> What about performance? Well according to my
>> microbenchmarks it can query 1200 documents
>> at random in seconds. Which means that if you
>> don't use a _dynamic_ schema (like grf3 or
>> feature-space) and use the raw wiredtiger API
>> found in (wiredtiger wiredtiger) module and
>> some helpers in (wiredtiger extra). You can
>> achieve better performance. Also, did I mention
>> that those are numbers for single thread access?
>> 
>> Guile has threads! Which means more RPS for your
>> application, less hardware for more users. That
>> said I don't think doubling the threads will double
>> the throughput. This needs to be benchmarked.
>> 
>> The thing with NOSQL wiredtiger, is that it's not
>> the kind of NOSQL you might think about. Unlike
>> REDIS it's not primary in memory, unlike Cassandra
>> it doesn't spread it's data accross several nodes.
>> Though there are ideas on how to do that see for
>> instance TiKV [3].
>> 
>> [3] https://github.com/pingcap/tikv [5]
>> 
>> According to wiredtiger there is no known limitations
>> in the size of the database or the number of concurrent
>> threads − provided the underlying hardware can follow...
>> 
>> One can scale vertically, on a single machine. How far
>> can we go? That's the question I'd like to answer.
>> 
>> The search engine is the idea to have both potentially
>> a lot of data and a lot of users (compared to my blog).
>> 
>> I am sure some people who tried and switched to
>> duckduckgo want to give it a try. At least to see what
>> the technology behind Google and DuckDuckGo really is.
>> You can't know for sure without having experienced the
>> old google or feu altavista.
>> 
>> My point other point, is for most of my search on the
>> web I don't need often to dig deeper than _my_ first
>> page (even on ddg). On _my_ first page, there is most
>> of the time wikipedia, stackoverflow and that's it!
>> Really, there is not much of the web that concerns
>> me. Outside some rare scholar articles.
>> 
>> english wikipedia and stackoverflow are already almost
>> 100Go so it's bigger than any one can have as blog.
>> 
>> Right now culturia.one does store pages using three
>> tables:
>> 
>> Document table will store information about the document,
>> uid is unique identifier for the document, url and a scheme
>> list of token uid (This is stored like that for faster
>> comparaison)
>> 
>> key | value
>> -----+-------------+----------------------------
>> uid | url | document as token uid
>> -----+-------------+----------------------------
>> 1 | gnu.org [6] | 14 32 51 42 63 74 75 23 113
>> 2 | hyperdev.fr [7] | 1 22 1 12 23 71 175 323 14
>> 
>> There is another table that stores, all the tokens
>> found in the documents:
>> 
>> key | value
>> -----+--------+-------
>> uid | token | count
>> -----+--------+-------
>> 1 | the | 42
>> 
>> Where count, is the count of document where the token
>> appears at least once. This table as an index on token
>> column to quikcly retrieve the UID of given TOKEN.
>> 
>> The last index, is the so called inverted-index,
>> It's a bit special because the key part of the table
>> has two columns but not bizarre if you work with primary
>> keys:
>> 
>> key | value
>> -----+----------+-------
>> token| document | count
>> -----+------------------
>> 42 | 1 | 1
>> 
>> (I just figured that I never use that count column,
>> it supposed to be the number of times of the token 42
>> appears in document 1)
>> 
>> Anyway, pretty simple no?
>> 
>> Let's imagine we have a simple query like the following:
>> 
>> culturia://guile+manual
>> 
>> We will imagine that we indexed some things that containt
>> those words (or the engine will throw an exception).
>> 
>> The quering engine will first compute the frequency of both
>> keywords and then lookup the inverted index for the least
>> frequent keyword. That way, there is a 'seed' set of documents
>> that we can filter with a small vm that will interpret the
>> rest of the query for instance. Something like:
>> 
>> (filter (hit? (cdr query)) seed)
>> 
>> Sort of. I can't make it simpler right now, but you can
>> have a look at the code. The public procedure and the bottom
>> called 'search' [4] is the where the code starts.
>> 
>> [4]
>> 
> https://github.com/a-guile-mind/culturia.one/blob/master/src/wiredtiger/ix.scm#L455
>> [8]
>> 
>> So what is the next iteration:
>> 
>> guile-wiredtiger:
>> 
>> - fix the tests to run with guix
>> 
>> culturia:
>> 
>> - make culturia compatible with guile-wiredtiger found in guix
>> - write program that will index the whatever wikipedia dump we want
>> - make a program to index stackoverflow based on archives.org [9]
>> dump
>> - make a program that will index news.ycombinator.com [10] and the
>> linked
>> articles
>> - Create a crawler for sitemaps (or find one)
>> - Create a crawler for RSS/ATOM feeds (or find one)
>> - Support WARC file format and crawl gnu.org [6] website
>> - Implement !g and !ddg in the searchbox to redirect the user
>> to another search engine.
>> 
>> Conctact me directly if you want to work on one of the tasks
>> or some other tasks, or if you want to report a bug.
>> 
>> Tx!
>> 
>> --
>> Amirouche ~ amz3 ~ http://www.hyperdev.fr [11]
> 
> 
> Links:
> ------
> [1] https://a-guile-mind.github.io/culturia.one
> [2] https://framagit.org/a-guile-mind/guile-wiredtiger
> [3]
> https://www.slideshare.net/lucidworks/implementing-conceptual-search-in-solr-using-lsa-and-word2vec-presented-by-simon-hughes-dicecom
> [4]
> https://blog.algolia.com/inside-the-engine-part-6-handling-synonyms-the-right-way/
> [5] https://github.com/pingcap/tikv
> [6] http://gnu.org
> [7] http://hyperdev.fr
> [8]
> https://github.com/a-guile-mind/culturia.one/blob/master/src/wiredtiger/ix.scm#L455
> [9] http://archives.org
> [10] http://news.ycombinator.com
> [11] http://www.hyperdev.fr
> [12] https://github.com/a-guile-mind/culturia.one.git

-- 
Amirouche ~ amz3 ~ http://www.hyperdev.fr