From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: Amirouche Boubekki Newsgroups: gmane.lisp.guile.user Subject: Re: What's next with culturia search engine? (and guile-wiredtiger) Date: Tue, 28 Nov 2017 00:29:30 +0100 Message-ID: <51004583028c27dffe979ec1be35b8ef@hypermove.net> References: NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Trace: blaine.gmane.org 1511825517 3428 195.159.176.226 (27 Nov 2017 23:31:57 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Mon, 27 Nov 2017 23:31:57 +0000 (UTC) User-Agent: Roundcube Webmail/1.1.2 Cc: Guile User To: Tom Jakubowski Original-X-From: guile-user-bounces+guile-user=m.gmane.org@gnu.org Tue Nov 28 00:31:50 2017 Return-path: Envelope-to: guile-user@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1eJSsF-0008Ax-5x for guile-user@m.gmane.org; Tue, 28 Nov 2017 00:31:39 +0100 Original-Received: from localhost ([::1]:35241 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1eJSsM-0001Q7-Iq for guile-user@m.gmane.org; Mon, 27 Nov 2017 18:31:46 -0500 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:47905) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1eJSqG-0000D5-5R for guile-user@gnu.org; Mon, 27 Nov 2017 18:29:37 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1eJSqE-0008Sz-3I for guile-user@gnu.org; Mon, 27 Nov 2017 18:29:36 -0500 Original-Received: from relay6-d.mail.gandi.net ([2001:4b98:c:538::198]:56529) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1eJSqD-0008S2-Pz for guile-user@gnu.org; Mon, 27 Nov 2017 18:29:34 -0500 Original-Received: from webmail.gandi.net (webmail5-d.mgt.gandi.net [10.58.1.145]) (Authenticated sender: amirouche@hypermove.net) by relay6-d.mail.gandi.net (Postfix) with ESMTPA id A9655FB881; Tue, 28 Nov 2017 00:29:30 +0100 (CET) In-Reply-To: X-Sender: amirouche@hypermove.net X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 2001:4b98:c:538::198 X-BeenThere: guile-user@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: General Guile related discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guile-user-bounces+guile-user=m.gmane.org@gnu.org Original-Sender: "guile-user" Xref: news.gmane.org gmane.lisp.guile.user:14302 Archived-At: On 2017-11-27 23:03, Tom Jakubowski wrote: > Looks very cool! I don't have any comments to add other than that I > think the git URL for culturia.one has an error: it should be > https://github.com/a-guile-mind/culturia.one.git [12] and not > https://a-guile-mind.github.io/culturia.one [1] > > In other words: > > $ git clone https://github.com/a-guile-mind/culturia.one.git [12] > > Tom Tx Tom! > > On Sun, Nov 26, 2017 at 2:34 PM Amirouche Boubekki > wrote: > >> Héllo, >> >> I made some progress on my culturia project, >> I wanted to share with you where it's going >> with a few bits about guile-wiredtiger itself. >> >> tl;dr: >> >> $ git clone https://a-guile-mind.github.io/culturia.one [1] >> $ git clone https://framagit.org/a-guile-mind/guile-wiredtiger >> [2] >> >> On guix(sd) environment: >> >> $ guix environment --ad-hoc guile-wiredtiger >> >> I stopped trying to understand what makes >> a concept search engine [0]. Instead I will >> focus on plain old keyword search engine. >> I don't even plan to support synonyms [1]. >> >> [1] >> > https://www.slideshare.net/lucidworks/implementing-conceptual-search-in-solr-using-lsa-and-word2vec-presented-by-simon-hughes-dicecom >> [3] >> [2] >> > https://blog.algolia.com/inside-the-engine-part-6-handling-synonyms-the-right-way/ >> [4] >> >> But, if you have insights about the suject, >> don't hesitate to share them. >> >> You might be wondering why I want to build >> a search engine. Yes, since I am not working >> anymore to reach the Moon, re-inventing the >> wheel might sound useless. It's not. >> Because Guile, because wiredtiger. >> >> You see wiredtiger is the last iteration >> (over several decades) of somekind of data engine. >> And guile is the _only_ high level language that >> has true POSIX threads (and fibers (more on that >> later)). Does it ring a bell? >> >> So far, what is dominant in the database space >> is RDBMS. Basically tables that you can query with >> SQL. This is very neat and stuff. But what if we >> could have tables queryable in scheme? >> >> That is it! guile-wiredtiger offers a way to query >> tables in scheme. Using minikanren if you want that! >> >> What about performance? Well according to my >> microbenchmarks it can query 1200 documents >> at random in seconds. Which means that if you >> don't use a _dynamic_ schema (like grf3 or >> feature-space) and use the raw wiredtiger API >> found in (wiredtiger wiredtiger) module and >> some helpers in (wiredtiger extra). You can >> achieve better performance. Also, did I mention >> that those are numbers for single thread access? >> >> Guile has threads! Which means more RPS for your >> application, less hardware for more users. That >> said I don't think doubling the threads will double >> the throughput. This needs to be benchmarked. >> >> The thing with NOSQL wiredtiger, is that it's not >> the kind of NOSQL you might think about. Unlike >> REDIS it's not primary in memory, unlike Cassandra >> it doesn't spread it's data accross several nodes. >> Though there are ideas on how to do that see for >> instance TiKV [3]. >> >> [3] https://github.com/pingcap/tikv [5] >> >> According to wiredtiger there is no known limitations >> in the size of the database or the number of concurrent >> threads − provided the underlying hardware can follow... >> >> One can scale vertically, on a single machine. How far >> can we go? That's the question I'd like to answer. >> >> The search engine is the idea to have both potentially >> a lot of data and a lot of users (compared to my blog). >> >> I am sure some people who tried and switched to >> duckduckgo want to give it a try. At least to see what >> the technology behind Google and DuckDuckGo really is. >> You can't know for sure without having experienced the >> old google or feu altavista. >> >> My point other point, is for most of my search on the >> web I don't need often to dig deeper than _my_ first >> page (even on ddg). On _my_ first page, there is most >> of the time wikipedia, stackoverflow and that's it! >> Really, there is not much of the web that concerns >> me. Outside some rare scholar articles. >> >> english wikipedia and stackoverflow are already almost >> 100Go so it's bigger than any one can have as blog. >> >> Right now culturia.one does store pages using three >> tables: >> >> Document table will store information about the document, >> uid is unique identifier for the document, url and a scheme >> list of token uid (This is stored like that for faster >> comparaison) >> >> key | value >> -----+-------------+---------------------------- >> uid | url | document as token uid >> -----+-------------+---------------------------- >> 1 | gnu.org [6] | 14 32 51 42 63 74 75 23 113 >> 2 | hyperdev.fr [7] | 1 22 1 12 23 71 175 323 14 >> >> There is another table that stores, all the tokens >> found in the documents: >> >> key | value >> -----+--------+------- >> uid | token | count >> -----+--------+------- >> 1 | the | 42 >> >> Where count, is the count of document where the token >> appears at least once. This table as an index on token >> column to quikcly retrieve the UID of given TOKEN. >> >> The last index, is the so called inverted-index, >> It's a bit special because the key part of the table >> has two columns but not bizarre if you work with primary >> keys: >> >> key | value >> -----+----------+------- >> token| document | count >> -----+------------------ >> 42 | 1 | 1 >> >> (I just figured that I never use that count column, >> it supposed to be the number of times of the token 42 >> appears in document 1) >> >> Anyway, pretty simple no? >> >> Let's imagine we have a simple query like the following: >> >> culturia://guile+manual >> >> We will imagine that we indexed some things that containt >> those words (or the engine will throw an exception). >> >> The quering engine will first compute the frequency of both >> keywords and then lookup the inverted index for the least >> frequent keyword. That way, there is a 'seed' set of documents >> that we can filter with a small vm that will interpret the >> rest of the query for instance. Something like: >> >> (filter (hit? (cdr query)) seed) >> >> Sort of. I can't make it simpler right now, but you can >> have a look at the code. The public procedure and the bottom >> called 'search' [4] is the where the code starts. >> >> [4] >> > https://github.com/a-guile-mind/culturia.one/blob/master/src/wiredtiger/ix.scm#L455 >> [8] >> >> So what is the next iteration: >> >> guile-wiredtiger: >> >> - fix the tests to run with guix >> >> culturia: >> >> - make culturia compatible with guile-wiredtiger found in guix >> - write program that will index the whatever wikipedia dump we want >> - make a program to index stackoverflow based on archives.org [9] >> dump >> - make a program that will index news.ycombinator.com [10] and the >> linked >> articles >> - Create a crawler for sitemaps (or find one) >> - Create a crawler for RSS/ATOM feeds (or find one) >> - Support WARC file format and crawl gnu.org [6] website >> - Implement !g and !ddg in the searchbox to redirect the user >> to another search engine. >> >> Conctact me directly if you want to work on one of the tasks >> or some other tasks, or if you want to report a bug. >> >> Tx! >> >> -- >> Amirouche ~ amz3 ~ http://www.hyperdev.fr [11] > > > Links: > ------ > [1] https://a-guile-mind.github.io/culturia.one > [2] https://framagit.org/a-guile-mind/guile-wiredtiger > [3] > https://www.slideshare.net/lucidworks/implementing-conceptual-search-in-solr-using-lsa-and-word2vec-presented-by-simon-hughes-dicecom > [4] > https://blog.algolia.com/inside-the-engine-part-6-handling-synonyms-the-right-way/ > [5] https://github.com/pingcap/tikv > [6] http://gnu.org > [7] http://hyperdev.fr > [8] > https://github.com/a-guile-mind/culturia.one/blob/master/src/wiredtiger/ix.scm#L455 > [9] http://archives.org > [10] http://news.ycombinator.com > [11] http://www.hyperdev.fr > [12] https://github.com/a-guile-mind/culturia.one.git -- Amirouche ~ amz3 ~ http://www.hyperdev.fr