From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!.POSTED!not-for-mail
From: Tom Jakubowski <tjakubowski@oblong.com>
Newsgroups: gmane.lisp.guile.user
Subject: Re: What's next with culturia search engine? (and guile-wiredtiger)
Date: Mon, 27 Nov 2017 22:03:50 +0000
Message-ID: <CAEJvSPLGWTv67JDyxUmSYTY7nK5Bfjcg1v_m_5X=8ek1-YOMMA@mail.gmail.com>
References: <d22074be36e396a49011c6eedc8fd0aa@hypermove.net>
NNTP-Posting-Host: blaine.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Trace: blaine.gmane.org 1511820288 29663 195.159.176.226 (27 Nov 2017 22:04:48 GMT)
X-Complaints-To: usenet@blaine.gmane.org
NNTP-Posting-Date: Mon, 27 Nov 2017 22:04:48 +0000 (UTC)
Cc: Guile User <guile-user@gnu.org>
To: Amirouche Boubekki <amirouche@hypermove.net>
Original-X-From: guile-user-bounces+guile-user=m.gmane.org@gnu.org Mon Nov 27 23:04:40 2017
Return-path: <guile-user-bounces+guile-user=m.gmane.org@gnu.org>
Envelope-to: guile-user@m.gmane.org
Original-Received: from lists.gnu.org ([208.118.235.17])
	by blaine.gmane.org with esmtp (Exim 4.84_2)
	(envelope-from <guile-user-bounces+guile-user=m.gmane.org@gnu.org>)
	id 1eJRW1-0006y7-Uk
	for guile-user@m.gmane.org; Mon, 27 Nov 2017 23:04:38 +0100
Original-Received: from localhost ([::1]:34987 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <guile-user-bounces+guile-user=m.gmane.org@gnu.org>)
	id 1eJRW9-0001iN-9W
	for guile-user@m.gmane.org; Mon, 27 Nov 2017 17:04:45 -0500
Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:48840)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <tjakubowski@oblong.com>) id 1eJRVV-0001f4-2f
	for guile-user@gnu.org; Mon, 27 Nov 2017 17:04:07 -0500
Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <tjakubowski@oblong.com>) id 1eJRVS-000847-Ly
	for guile-user@gnu.org; Mon, 27 Nov 2017 17:04:05 -0500
Original-Received: from mail-wm0-x22c.google.com ([2a00:1450:400c:c09::22c]:45943)
	by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16)
	(Exim 4.71) (envelope-from <tjakubowski@oblong.com>)
	id 1eJRVS-00083T-7z
	for guile-user@gnu.org; Mon, 27 Nov 2017 17:04:02 -0500
Original-Received: by mail-wm0-x22c.google.com with SMTP id 9so37047201wme.4
	for <guile-user@gnu.org>; Mon, 27 Nov 2017 14:04:01 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oblong.com; s=google;
	h=mime-version:references:in-reply-to:from:date:message-id:subject:to
	:cc; bh=vO/hfMrfc4hEfYlxgG53dZ7ItvXuA+G7ryJkQVOk2xU=;
	b=UWg9PvNMpgw4Zp0gx+jnQMk7UH6QHvvYAfvv0305VsX4S8RFYsb6BA9WriUHfQdkoL
	Yg7MvOM4JRD3JTOflW+azwwzLiqcP3NzOLZxOBZRDe/8TWpxeU72Mr3IEpA8GBB3hwRX
	XBuFwDijuOuSvi96/jOyamxe0HWf8DbH11Ofc=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=1e100.net; s=20161025;
	h=x-gm-message-state:mime-version:references:in-reply-to:from:date
	:message-id:subject:to:cc;
	bh=vO/hfMrfc4hEfYlxgG53dZ7ItvXuA+G7ryJkQVOk2xU=;
	b=l8svFdq9Hcr+5MduYQ63UzpxxmKuH97rwM0RMs3boxB2qTxOdyQxZHZN92psxut5A5
	H0kthQtOrxzRuLqCk8ChkUDgANf79B406wzw2CfjsKJJbUmowIHvKjNeyl3IBvM5stYy
	JKaoTwNq+hJzJ2BYWXuLxiMxV1txvQqNc8r51BRGh9MHSxjm/U6yel0fGgtcil3ZRpyj
	9SxDqFTUeuKGksL8t6N3W0hbl5ALp2+sqdUrzMM+IdgfJnn/NVkyrbK9GiMlWQO79cPE
	KI4h67mDN9SW0bSE8aR2/HRPoOTtP45/Pz0uaFV3ZfwTe/N6z0m9bURAXz5ZbC2TOd0W
	kEgQ==
X-Gm-Message-State: AJaThX6gg+Wy0s0tobuV+So8qG1uxJp6SsS9iK+MY5RfIGdHqkJe0I+6
	r10+cpbFbnIp6vHevvOc40A2AYeYK1koNlf0HTfYXg==
X-Google-Smtp-Source: AGs4zMZE8LiRYTzbM/D25R5h22wgSd94cdQBUAm1Vccwn+FsdLb61Z62yEr2xHtjpAbH+ULJENabjTIt7Y9YeIjQ4Eg=
X-Received: by 10.28.166.216 with SMTP id p207mr19256745wme.147.1511820240619; 
	Mon, 27 Nov 2017 14:04:00 -0800 (PST)
In-Reply-To: <d22074be36e396a49011c6eedc8fd0aa@hypermove.net>
X-detected-operating-system: by eggs.gnu.org: Genre and OS details not
	recognized.
X-Received-From: 2a00:1450:400c:c09::22c
X-Content-Filtered-By: Mailman/MimeDel 2.1.21
X-BeenThere: guile-user@gnu.org
X-Mailman-Version: 2.1.21
Precedence: list
List-Id: General Guile related discussions <guile-user.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/guile-user>,
	<mailto:guile-user-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/guile-user/>
List-Post: <mailto:guile-user@gnu.org>
List-Help: <mailto:guile-user-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/guile-user>,
	<mailto:guile-user-request@gnu.org?subject=subscribe>
Errors-To: guile-user-bounces+guile-user=m.gmane.org@gnu.org
Original-Sender: "guile-user" <guile-user-bounces+guile-user=m.gmane.org@gnu.org>
Xref: news.gmane.org gmane.lisp.guile.user:14298
Archived-At: <http://permalink.gmane.org/gmane.lisp.guile.user/14298>

Looks very cool! I don't have any comments to add other than that I think
the git URL for culturia.one has an error: it should be
https://github.com/a-guile-mind/culturia.one.git and not
https://a-guile-mind.github.io/culturia.one

In other words:

$ git clone https://github.com/a-guile-mind/culturia.one.git

Tom

On Sun, Nov 26, 2017 at 2:34 PM Amirouche Boubekki <amirouche@hypermove.net=
>
wrote:

> H=C3=A9llo,
>
> I made some progress on my culturia project,
> I wanted to share with you where it's going
> with a few bits about guile-wiredtiger itself.
>
> tl;dr:
>
>    $ git clone https://a-guile-mind.github.io/culturia.one
>    $ git clone https://framagit.org/a-guile-mind/guile-wiredtiger
>
> On guix(sd) environment:
>
>    $ guix environment --ad-hoc guile-wiredtiger
>
> I stopped trying to understand what makes
> a concept search engine [0]. Instead I will
> focus on plain old keyword search engine.
> I don't even plan to support synonyms [1].
>
> [1]
>
> https://www.slideshare.net/lucidworks/implementing-conceptual-search-in-s=
olr-using-lsa-and-word2vec-presented-by-simon-hughes-dicecom
> [2]
>
> https://blog.algolia.com/inside-the-engine-part-6-handling-synonyms-the-r=
ight-way/
>
> But, if you have insights about the suject,
> don't hesitate to share them.
>
> You might be wondering why I want to build
> a search engine. Yes, since I am not working
> anymore to reach the Moon, re-inventing the
> wheel might sound useless. It's not.
> Because Guile, because wiredtiger.
>
> You see wiredtiger is the last iteration
> (over several decades) of somekind of data engine.
> And guile is the _only_ high level language that
> has true POSIX threads (and fibers (more on that
> later)). Does it ring a bell?
>
> So far, what is dominant in the database space
> is RDBMS. Basically tables that you can query with
> SQL. This is very neat and stuff. But what if we
> could have tables queryable in scheme?
>
> That is it! guile-wiredtiger offers a way to query
> tables in scheme. Using minikanren if you want that!
>
> What about performance? Well according to my
> microbenchmarks it can query 1200 documents
> at random in seconds. Which means that if you
> don't use a _dynamic_ schema (like grf3 or
> feature-space) and use the raw wiredtiger API
> found in (wiredtiger wiredtiger) module and
> some helpers in (wiredtiger extra). You can
> achieve better performance. Also, did I mention
> that those are numbers for single thread access?
>
> Guile has threads! Which means more RPS for your
> application, less hardware for more users. That
> said I don't think doubling the threads will double
> the throughput. This needs to be benchmarked.
>
> The thing with NOSQL wiredtiger, is that it's not
> the kind of NOSQL you might think about. Unlike
> REDIS it's not primary in memory, unlike Cassandra
> it doesn't spread it's data accross several nodes.
> Though there are ideas on how to do that see for
> instance TiKV [3].
>
> [3] https://github.com/pingcap/tikv
>
> According to wiredtiger there is no known limitations
> in the size of the database or the number of concurrent
> threads =E2=88=92 provided the underlying hardware can follow...
>
> One can scale vertically, on a single machine. How far
> can we go? That's the question I'd like to answer.
>
> The search engine is the idea to have both potentially
> a lot of data and a lot of users (compared to my blog).
>
> I am sure some people who tried and switched to
> duckduckgo want to give it a try. At least to see what
> the technology behind Google and DuckDuckGo really is.
> You can't know for sure without having experienced the
> old google or feu altavista.
>
> My point other point, is for most of my search on the
> web I don't need often to dig deeper than _my_ first
> page (even on ddg). On _my_ first page, there is most
> of the time wikipedia, stackoverflow and that's it!
> Really, there is not much of the web that concerns
> me. Outside some rare scholar articles.
>
> english wikipedia and stackoverflow are already almost
> 100Go so it's bigger than any one can have as blog.
>
> Right now culturia.one does store pages using three
> tables:
>
> Document table will store information about the document,
> uid is unique identifier for the document, url and a scheme
> list of token uid (This is stored like that for faster
> comparaison)
>
>     key |           value
>    -----+-------------+----------------------------
>     uid |  url        | document as token uid
>    -----+-------------+----------------------------
>      1  | gnu.org     | 14 32 51 42 63 74 75 23 113
>      2  | hyperdev.fr | 1 22 1 12 23 71 175 323 14
>
> There is another table that stores, all the tokens
> found in the documents:
>
>     key |          value
>    -----+--------+-------
>     uid | token  | count
>    -----+--------+-------
>      1  |  the   |  42
>
> Where count, is the count of document where the token
> appears at least once. This table as an index on token
> column to quikcly retrieve the UID of given TOKEN.
>
> The last index, is the so called inverted-index,
> It's a bit special because the key part of the table
> has two columns but not bizarre if you work with primary
> keys:
>
>
>          key       | value
>    -----+----------+-------
>    token| document | count
>    -----+------------------
>      42 |   1      |  1
>
> (I just figured that I never use that count column,
> it supposed to be the number of times of the token 42
> appears in document 1)
>
> Anyway, pretty simple no?
>
> Let's imagine we have a simple query like the following:
>
>     culturia://guile+manual
>
> We will imagine that we indexed some things that containt
> those words (or the engine will throw an exception).
>
> The quering engine will first compute the frequency of both
> keywords and then lookup the inverted index for the least
> frequent keyword. That way, there is a 'seed' set of documents
> that we can filter with a small vm that will interpret the
> rest of the query for instance. Something like:
>
>    (filter (hit? (cdr query)) seed)
>
> Sort of. I can't make it simpler right now, but you can
> have a look at the code. The public procedure and the bottom
> called 'search' [4] is the where the code starts.
>
> [4]
>
> https://github.com/a-guile-mind/culturia.one/blob/master/src/wiredtiger/i=
x.scm#L455
>
> So what is the next iteration:
>
> guile-wiredtiger:
>
> - fix the tests to run with guix
>
> culturia:
>
> - make culturia compatible with guile-wiredtiger found in guix
> - write program that will index the whatever wikipedia dump we want
> - make a program to index stackoverflow based on archives.org dump
> - make a program that will index news.ycombinator.com and the linked
> articles
> - Create a crawler for sitemaps (or find one)
> - Create a crawler for RSS/ATOM feeds (or find one)
> - Support WARC file format and crawl gnu.org website
> - Implement !g and !ddg in the searchbox to redirect the user
>    to another search engine.
>
> Conctact me directly if you want to work on one of the tasks
> or some other tasks, or if you want to report a bug.
>
> Tx!
>
> --
> Amirouche ~ amz3 ~ http://www.hyperdev.fr
>
>