From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!.POSTED!not-for-mail
From: Catonano <catonano@gmail.com>
Newsgroups: gmane.lisp.guile.user
Subject: Re: What's next with culturia search engine? (and guile-wiredtiger)
Date: Sun, 14 Jan 2018 09:12:00 +0100
Message-ID: <CAJ98PDwDXb8=py4CX6o8GzpcGtN-Z6X6Qp03JCESqpCwCGw_SQ@mail.gmail.com>
References: <d22074be36e396a49011c6eedc8fd0aa@hypermove.net>
NNTP-Posting-Host: blaine.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Trace: blaine.gmane.org 1515917438 10788 195.159.176.226 (14 Jan 2018 08:10:38 GMT)
X-Complaints-To: usenet@blaine.gmane.org
NNTP-Posting-Date: Sun, 14 Jan 2018 08:10:38 +0000 (UTC)
Cc: Guile User <guile-user@gnu.org>
To: Amirouche Boubekki <amirouche@hypermove.net>
Original-X-From: guile-user-bounces+guile-user=m.gmane.org@gnu.org Sun Jan 14 09:10:34 2018
Return-path: <guile-user-bounces+guile-user=m.gmane.org@gnu.org>
Envelope-to: guile-user@m.gmane.org
Original-Received: from lists.gnu.org ([208.118.235.17])
	by blaine.gmane.org with esmtp (Exim 4.84_2)
	(envelope-from <guile-user-bounces+guile-user=m.gmane.org@gnu.org>)
	id 1eadN7-0001zy-B6
	for guile-user@m.gmane.org; Sun, 14 Jan 2018 09:10:29 +0100
Original-Received: from localhost ([::1]:48859 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <guile-user-bounces+guile-user=m.gmane.org@gnu.org>)
	id 1eadP4-0001p0-Vt
	for guile-user@m.gmane.org; Sun, 14 Jan 2018 03:12:31 -0500
Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:44185)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <catonano@gmail.com>) id 1eadOe-0001oS-5z
	for guile-user@gnu.org; Sun, 14 Jan 2018 03:12:07 -0500
Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <catonano@gmail.com>) id 1eadOb-0005nm-Kz
	for guile-user@gnu.org; Sun, 14 Jan 2018 03:12:04 -0500
Original-Received: from mail-yb0-x242.google.com ([2607:f8b0:4002:c09::242]:41008)
	by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16)
	(Exim 4.71) (envelope-from <catonano@gmail.com>) id 1eadOb-0005nA-DB
	for guile-user@gnu.org; Sun, 14 Jan 2018 03:12:01 -0500
Original-Received: by mail-yb0-x242.google.com with SMTP id v76so4529463ybb.8
	for <guile-user@gnu.org>; Sun, 14 Jan 2018 00:12:01 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025;
	h=mime-version:in-reply-to:references:from:date:message-id:subject:to
	:cc; bh=1uhzsEoywXqckBqNc3gcx9WM7QDq/2iTg+VyKtaaMy4=;
	b=AZQdi0JExwC6g613Opa0zVVtitJjoBsyL5eTj72vTQiTOsIL+oV4Bk9E/1wS6lP0XV
	n1qYusnFHGLZ5HKW6/fYFjPj1+GxjuYzneKqu1nOdHqlhBLvb1ONA2zONgV3fOd3f/SP
	9oAWnFoLTBnIE+buSzhldf3C6i2UlXBAvM6MWA6/7QVMKVBEtgXjiVfxhRd6DgVquU+5
	hgknUIvVYECpzc8jgeEuGcTapiT5XhtG/6Uc3oeU4gcm6iXa5qe+zh5BytW+/+SBgORc
	aqJCxEiA6i3ea5/vr1D2F3anSczBLTmcUqsxD1j3CXbXsEfVkYvDCX0S9d3uUM9tQ3cx
	fCpQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=1e100.net; s=20161025;
	h=x-gm-message-state:mime-version:in-reply-to:references:from:date
	:message-id:subject:to:cc;
	bh=1uhzsEoywXqckBqNc3gcx9WM7QDq/2iTg+VyKtaaMy4=;
	b=G2eCAeu59t4LqJrfdhS29qkmaztRDHuygRpAbXYie9QxPRvuuN1vPivZG5pA0X99kj
	BgN/1MqGtEKVVYYuEiZJ5zgYu6tki2uVIaAZT9xMhPVZ96krO15lTRB8/66J9pbrSUQn
	XXSy80qcSlvh6bj8w/FVXl1UVJfUqrRh4dIpWLza8DdEzHvEkN/pVbMwo5XmeXaQfI5b
	cK1441se85XiV7Q4i6OLiVbRfcI4xFAJPRy3CmSh+9v15CsukHqRos1gwGxJY/o5lyRD
	flaMeELaih7w5AAv/4mP8XZ/RIztq5JVdm89BBdHyaDjRuAifu2yP9FJxvLz0SwClfRa
	I+Jg==
X-Gm-Message-State: AKGB3mLPlQ937eey8N02UBXuHZ1mnv1OHJlHIhGQnWeewiJR4kfiopS7
	zEs5O/B/GO46oU6bAc4I1uVhHoKGO0UAp89gRA==
X-Google-Smtp-Source: ACJfBou4o1cG96Dldg0v8SCk3v+n9FKflr3l7bg8J73Aft9lkoSAw+bAWBb35L0O2l+1KaIH61RisVJwkfd3M4BnsUw=
X-Received: by 10.37.8.6 with SMTP id 6mr24403487ybi.203.1515917520655; Sun,
	14 Jan 2018 00:12:00 -0800 (PST)
Original-Received: by 10.129.40.208 with HTTP; Sun, 14 Jan 2018 00:12:00 -0800 (PST)
In-Reply-To: <d22074be36e396a49011c6eedc8fd0aa@hypermove.net>
X-detected-operating-system: by eggs.gnu.org: Genre and OS details not
	recognized.
X-Received-From: 2607:f8b0:4002:c09::242
X-Content-Filtered-By: Mailman/MimeDel 2.1.21
X-BeenThere: guile-user@gnu.org
X-Mailman-Version: 2.1.21
Precedence: list
List-Id: General Guile related discussions <guile-user.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/guile-user>,
	<mailto:guile-user-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/guile-user/>
List-Post: <mailto:guile-user@gnu.org>
List-Help: <mailto:guile-user-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/guile-user>,
	<mailto:guile-user-request@gnu.org?subject=subscribe>
Errors-To: guile-user-bounces+guile-user=m.gmane.org@gnu.org
Original-Sender: "guile-user" <guile-user-bounces+guile-user=m.gmane.org@gnu.org>
Xref: news.gmane.org gmane.lisp.guile.user:14422
Archived-At: <http://permalink.gmane.org/gmane.lisp.guile.user/14422>

2017-11-26 23:33 GMT+01:00 Amirouche Boubekki <amirouche@hypermove.net>:

> H=C3=A9llo,
>
> I made some progress on my culturia project,
> I wanted to share with you where it's going
> with a few bits about guile-wiredtiger itself.
>
> tl;dr:
>
>   $ git clone https://a-guile-mind.github.io/culturia.one
>   $ git clone https://framagit.org/a-guile-mind/guile-wiredtiger
>
> On guix(sd) environment:
>
>   $ guix environment --ad-hoc guile-wiredtiger
>
> I stopped trying to understand what makes
> a concept search engine [0]. Instead I will
> focus on plain old keyword search engine.
> I don't even plan to support synonyms [1].
>
> [1] https://www.slideshare.net/lucidworks/implementing-conceptua
> l-search-in-solr-using-lsa-and-word2vec-presented-by-simon-hughes-dicecom
> [2] https://blog.algolia.com/inside-the-engine-part-6-handling-
> synonyms-the-right-way/
>
> But, if you have insights about the suject,
> don't hesitate to share them.
>
> You might be wondering why I want to build
> a search engine. Yes, since I am not working
> anymore to reach the Moon, re-inventing the
> wheel might sound useless. It's not.
> Because Guile, because wiredtiger.
>
> You see wiredtiger is the last iteration
> (over several decades) of somekind of data engine.
> And guile is the _only_ high level language that
> has true POSIX threads (and fibers (more on that
> later)). Does it ring a bell?
>
> So far, what is dominant in the database space
> is RDBMS. Basically tables that you can query with
> SQL. This is very neat and stuff. But what if we
> could have tables queryable in scheme?
>
> That is it! guile-wiredtiger offers a way to query
> tables in scheme. Using minikanren if you want that!
>
> What about performance? Well according to my
> microbenchmarks it can query 1200 documents
> at random in seconds. Which means that if you
> don't use a _dynamic_ schema (like grf3 or
> feature-space) and use the raw wiredtiger API
> found in (wiredtiger wiredtiger) module and
> some helpers in (wiredtiger extra). You can
> achieve better performance. Also, did I mention
> that those are numbers for single thread access?
>
> Guile has threads! Which means more RPS for your
> application, less hardware for more users. That
> said I don't think doubling the threads will double
> the throughput. This needs to be benchmarked.
>
> The thing with NOSQL wiredtiger, is that it's not
> the kind of NOSQL you might think about. Unlike
> REDIS it's not primary in memory, unlike Cassandra
> it doesn't spread it's data accross several nodes.
> Though there are ideas on how to do that see for
> instance TiKV [3].
>
> [3] https://github.com/pingcap/tikv
>
> According to wiredtiger there is no known limitations
> in the size of the database or the number of concurrent
> threads =E2=88=92 provided the underlying hardware can follow...
>
> One can scale vertically, on a single machine. How far
> can we go? That's the question I'd like to answer.
>
> The search engine is the idea to have both potentially
> a lot of data and a lot of users (compared to my blog).
>
> I am sure some people who tried and switched to
> duckduckgo want to give it a try. At least to see what
> the technology behind Google and DuckDuckGo really is.
> You can't know for sure without having experienced the
> old google or feu altavista.
>
> My point other point, is for most of my search on the
> web I don't need often to dig deeper than _my_ first
> page (even on ddg). On _my_ first page, there is most
> of the time wikipedia, stackoverflow and that's it!
> Really, there is not much of the web that concerns
> me. Outside some rare scholar articles.
>
> english wikipedia and stackoverflow are already almost
> 100Go so it's bigger than any one can have as blog.
>
> Right now culturia.one does store pages using three
> tables:
>
> Document table will store information about the document,
> uid is unique identifier for the document, url and a scheme
> list of token uid (This is stored like that for faster
> comparaison)
>
>    key |           value
>   -----+-------------+----------------------------
>    uid |  url        | document as token uid
>   -----+-------------+----------------------------
>     1  | gnu.org     | 14 32 51 42 63 74 75 23 113
>     2  | hyperdev.fr | 1 22 1 12 23 71 175 323 14
>
> There is another table that stores, all the tokens
> found in the documents:
>
>    key |          value
>   -----+--------+-------
>    uid | token  | count
>   -----+--------+-------
>     1  |  the   |  42
>
> Where count, is the count of document where the token
> appears at least once. This table as an index on token
> column to quikcly retrieve the UID of given TOKEN.
>
> The last index, is the so called inverted-index,
> It's a bit special because the key part of the table
> has two columns but not bizarre if you work with primary
> keys:
>
>
>         key       | value
>   -----+----------+-------
>   token| document | count
>   -----+------------------
>     42 |   1      |  1
>
> (I just figured that I never use that count column,
> it supposed to be the number of times of the token 42
> appears in document 1)
>
> Anyway, pretty simple no?
>
> Let's imagine we have a simple query like the following:
>
>    culturia://guile+manual
>
> We will imagine that we indexed some things that containt
> those words (or the engine will throw an exception).
>
> The quering engine will first compute the frequency of both
> keywords and then lookup the inverted index for the least
> frequent keyword.


The least freqent keyword ?
Not the most frequent keyword ?


That way, there is a 'seed' set of documents
> that we can filter with a small vm that will interpret the
> rest of the query for instance. Something like:
>
>   (filter (hit? (cdr query)) seed)
>
> Sort of. I can't make it simpler right now, but you can
> have a look at the code. The public procedure and the bottom
> called 'search' [4] is the where the code starts.
>
> [4] https://github.com/a-guile-mind/culturia.one/blob/master/src
> /wiredtiger/ix.scm#L455


file not fond

All this looks pretty interesting but I have to say that I prefer the work
you're doing on GNUNet ;-)


>
>
> So what is the next iteration:
>
> guile-wiredtiger:
>
> - fix the tests to run with guix
>
> culturia:
>
> - make culturia compatible with guile-wiredtiger found in guix
> - write program that will index the whatever wikipedia dump we want
> - make a program to index stackoverflow based on archives.org dump
> - make a program that will index news.ycombinator.com and the linked
> articles
> - Create a crawler for sitemaps (or find one)
> - Create a crawler for RSS/ATOM feeds (or find one)
> - Support WARC file format and crawl gnu.org website
> - Implement !g and !ddg in the searchbox to redirect the user
>   to another search engine.
>
> Conctact me directly if you want to work on one of the tasks
> or some other tasks, or if you want to report a bug.
>
> Tx!
>
> --
> Amirouche ~ amz3 ~ http://www.hyperdev.fr
>
>