From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: Amirouche Boubekki Newsgroups: gmane.lisp.guile.user Subject: What's next with culturia search engine? (and guile-wiredtiger) Date: Sun, 26 Nov 2017 23:33:33 +0100 Message-ID: NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Trace: blaine.gmane.org 1511735651 6811 195.159.176.226 (26 Nov 2017 22:34:11 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Sun, 26 Nov 2017 22:34:11 +0000 (UTC) User-Agent: Roundcube Webmail/1.1.2 To: Guile User Original-X-From: guile-user-bounces+guile-user=m.gmane.org@gnu.org Sun Nov 26 23:34:05 2017 Return-path: Envelope-to: guile-user@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1eJ5Ut-0000mB-6A for guile-user@m.gmane.org; Sun, 26 Nov 2017 23:33:59 +0100 Original-Received: from localhost ([::1]:58306 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1eJ5Uz-0002Vp-2C for guile-user@m.gmane.org; Sun, 26 Nov 2017 17:34:05 -0500 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:39299) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1eJ5Ua-0002Vg-Iq for guile-user@gnu.org; Sun, 26 Nov 2017 17:33:42 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1eJ5UX-0001K3-DT for guile-user@gnu.org; Sun, 26 Nov 2017 17:33:40 -0500 Original-Received: from relay5-d.mail.gandi.net ([217.70.183.197]:35281) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1eJ5UW-0001JP-S5 for guile-user@gnu.org; Sun, 26 Nov 2017 17:33:37 -0500 Original-Received: from webmail.gandi.net (webmail5-d.mgt.gandi.net [10.58.1.145]) (Authenticated sender: amirouche@hypermove.net) by relay5-d.mail.gandi.net (Postfix) with ESMTPA id E29DD41C07E for ; Sun, 26 Nov 2017 23:33:33 +0100 (CET) X-Sender: amirouche@hypermove.net X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] [fuzzy] X-Received-From: 217.70.183.197 X-BeenThere: guile-user@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: General Guile related discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guile-user-bounces+guile-user=m.gmane.org@gnu.org Original-Sender: "guile-user" Xref: news.gmane.org gmane.lisp.guile.user:14295 Archived-At: Héllo, I made some progress on my culturia project, I wanted to share with you where it's going with a few bits about guile-wiredtiger itself. tl;dr: $ git clone https://a-guile-mind.github.io/culturia.one $ git clone https://framagit.org/a-guile-mind/guile-wiredtiger On guix(sd) environment: $ guix environment --ad-hoc guile-wiredtiger I stopped trying to understand what makes a concept search engine [0]. Instead I will focus on plain old keyword search engine. I don't even plan to support synonyms [1]. [1] https://www.slideshare.net/lucidworks/implementing-conceptual-search-in-solr-using-lsa-and-word2vec-presented-by-simon-hughes-dicecom [2] https://blog.algolia.com/inside-the-engine-part-6-handling-synonyms-the-right-way/ But, if you have insights about the suject, don't hesitate to share them. You might be wondering why I want to build a search engine. Yes, since I am not working anymore to reach the Moon, re-inventing the wheel might sound useless. It's not. Because Guile, because wiredtiger. You see wiredtiger is the last iteration (over several decades) of somekind of data engine. And guile is the _only_ high level language that has true POSIX threads (and fibers (more on that later)). Does it ring a bell? So far, what is dominant in the database space is RDBMS. Basically tables that you can query with SQL. This is very neat and stuff. But what if we could have tables queryable in scheme? That is it! guile-wiredtiger offers a way to query tables in scheme. Using minikanren if you want that! What about performance? Well according to my microbenchmarks it can query 1200 documents at random in seconds. Which means that if you don't use a _dynamic_ schema (like grf3 or feature-space) and use the raw wiredtiger API found in (wiredtiger wiredtiger) module and some helpers in (wiredtiger extra). You can achieve better performance. Also, did I mention that those are numbers for single thread access? Guile has threads! Which means more RPS for your application, less hardware for more users. That said I don't think doubling the threads will double the throughput. This needs to be benchmarked. The thing with NOSQL wiredtiger, is that it's not the kind of NOSQL you might think about. Unlike REDIS it's not primary in memory, unlike Cassandra it doesn't spread it's data accross several nodes. Though there are ideas on how to do that see for instance TiKV [3]. [3] https://github.com/pingcap/tikv According to wiredtiger there is no known limitations in the size of the database or the number of concurrent threads − provided the underlying hardware can follow... One can scale vertically, on a single machine. How far can we go? That's the question I'd like to answer. The search engine is the idea to have both potentially a lot of data and a lot of users (compared to my blog). I am sure some people who tried and switched to duckduckgo want to give it a try. At least to see what the technology behind Google and DuckDuckGo really is. You can't know for sure without having experienced the old google or feu altavista. My point other point, is for most of my search on the web I don't need often to dig deeper than _my_ first page (even on ddg). On _my_ first page, there is most of the time wikipedia, stackoverflow and that's it! Really, there is not much of the web that concerns me. Outside some rare scholar articles. english wikipedia and stackoverflow are already almost 100Go so it's bigger than any one can have as blog. Right now culturia.one does store pages using three tables: Document table will store information about the document, uid is unique identifier for the document, url and a scheme list of token uid (This is stored like that for faster comparaison) key | value -----+-------------+---------------------------- uid | url | document as token uid -----+-------------+---------------------------- 1 | gnu.org | 14 32 51 42 63 74 75 23 113 2 | hyperdev.fr | 1 22 1 12 23 71 175 323 14 There is another table that stores, all the tokens found in the documents: key | value -----+--------+------- uid | token | count -----+--------+------- 1 | the | 42 Where count, is the count of document where the token appears at least once. This table as an index on token column to quikcly retrieve the UID of given TOKEN. The last index, is the so called inverted-index, It's a bit special because the key part of the table has two columns but not bizarre if you work with primary keys: key | value -----+----------+------- token| document | count -----+------------------ 42 | 1 | 1 (I just figured that I never use that count column, it supposed to be the number of times of the token 42 appears in document 1) Anyway, pretty simple no? Let's imagine we have a simple query like the following: culturia://guile+manual We will imagine that we indexed some things that containt those words (or the engine will throw an exception). The quering engine will first compute the frequency of both keywords and then lookup the inverted index for the least frequent keyword. That way, there is a 'seed' set of documents that we can filter with a small vm that will interpret the rest of the query for instance. Something like: (filter (hit? (cdr query)) seed) Sort of. I can't make it simpler right now, but you can have a look at the code. The public procedure and the bottom called 'search' [4] is the where the code starts. [4] https://github.com/a-guile-mind/culturia.one/blob/master/src/wiredtiger/ix.scm#L455 So what is the next iteration: guile-wiredtiger: - fix the tests to run with guix culturia: - make culturia compatible with guile-wiredtiger found in guix - write program that will index the whatever wikipedia dump we want - make a program to index stackoverflow based on archives.org dump - make a program that will index news.ycombinator.com and the linked articles - Create a crawler for sitemaps (or find one) - Create a crawler for RSS/ATOM feeds (or find one) - Support WARC file format and crawl gnu.org website - Implement !g and !ddg in the searchbox to redirect the user to another search engine. Conctact me directly if you want to work on one of the tasks or some other tasks, or if you want to report a bug. Tx! -- Amirouche ~ amz3 ~ http://www.hyperdev.fr