From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: Catonano Newsgroups: gmane.lisp.guile.user Subject: Re: What's next with culturia search engine? (and guile-wiredtiger) Date: Sun, 14 Jan 2018 09:12:00 +0100 Message-ID: References: NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Trace: blaine.gmane.org 1515917438 10788 195.159.176.226 (14 Jan 2018 08:10:38 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Sun, 14 Jan 2018 08:10:38 +0000 (UTC) Cc: Guile User To: Amirouche Boubekki Original-X-From: guile-user-bounces+guile-user=m.gmane.org@gnu.org Sun Jan 14 09:10:34 2018 Return-path: Envelope-to: guile-user@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1eadN7-0001zy-B6 for guile-user@m.gmane.org; Sun, 14 Jan 2018 09:10:29 +0100 Original-Received: from localhost ([::1]:48859 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1eadP4-0001p0-Vt for guile-user@m.gmane.org; Sun, 14 Jan 2018 03:12:31 -0500 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:44185) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1eadOe-0001oS-5z for guile-user@gnu.org; Sun, 14 Jan 2018 03:12:07 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1eadOb-0005nm-Kz for guile-user@gnu.org; Sun, 14 Jan 2018 03:12:04 -0500 Original-Received: from mail-yb0-x242.google.com ([2607:f8b0:4002:c09::242]:41008) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1eadOb-0005nA-DB for guile-user@gnu.org; Sun, 14 Jan 2018 03:12:01 -0500 Original-Received: by mail-yb0-x242.google.com with SMTP id v76so4529463ybb.8 for ; Sun, 14 Jan 2018 00:12:01 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=1uhzsEoywXqckBqNc3gcx9WM7QDq/2iTg+VyKtaaMy4=; b=AZQdi0JExwC6g613Opa0zVVtitJjoBsyL5eTj72vTQiTOsIL+oV4Bk9E/1wS6lP0XV n1qYusnFHGLZ5HKW6/fYFjPj1+GxjuYzneKqu1nOdHqlhBLvb1ONA2zONgV3fOd3f/SP 9oAWnFoLTBnIE+buSzhldf3C6i2UlXBAvM6MWA6/7QVMKVBEtgXjiVfxhRd6DgVquU+5 hgknUIvVYECpzc8jgeEuGcTapiT5XhtG/6Uc3oeU4gcm6iXa5qe+zh5BytW+/+SBgORc aqJCxEiA6i3ea5/vr1D2F3anSczBLTmcUqsxD1j3CXbXsEfVkYvDCX0S9d3uUM9tQ3cx fCpQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=1uhzsEoywXqckBqNc3gcx9WM7QDq/2iTg+VyKtaaMy4=; b=G2eCAeu59t4LqJrfdhS29qkmaztRDHuygRpAbXYie9QxPRvuuN1vPivZG5pA0X99kj BgN/1MqGtEKVVYYuEiZJ5zgYu6tki2uVIaAZT9xMhPVZ96krO15lTRB8/66J9pbrSUQn XXSy80qcSlvh6bj8w/FVXl1UVJfUqrRh4dIpWLza8DdEzHvEkN/pVbMwo5XmeXaQfI5b cK1441se85XiV7Q4i6OLiVbRfcI4xFAJPRy3CmSh+9v15CsukHqRos1gwGxJY/o5lyRD flaMeELaih7w5AAv/4mP8XZ/RIztq5JVdm89BBdHyaDjRuAifu2yP9FJxvLz0SwClfRa I+Jg== X-Gm-Message-State: AKGB3mLPlQ937eey8N02UBXuHZ1mnv1OHJlHIhGQnWeewiJR4kfiopS7 zEs5O/B/GO46oU6bAc4I1uVhHoKGO0UAp89gRA== X-Google-Smtp-Source: ACJfBou4o1cG96Dldg0v8SCk3v+n9FKflr3l7bg8J73Aft9lkoSAw+bAWBb35L0O2l+1KaIH61RisVJwkfd3M4BnsUw= X-Received: by 10.37.8.6 with SMTP id 6mr24403487ybi.203.1515917520655; Sun, 14 Jan 2018 00:12:00 -0800 (PST) Original-Received: by 10.129.40.208 with HTTP; Sun, 14 Jan 2018 00:12:00 -0800 (PST) In-Reply-To: X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-Received-From: 2607:f8b0:4002:c09::242 X-Content-Filtered-By: Mailman/MimeDel 2.1.21 X-BeenThere: guile-user@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: General Guile related discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guile-user-bounces+guile-user=m.gmane.org@gnu.org Original-Sender: "guile-user" Xref: news.gmane.org gmane.lisp.guile.user:14422 Archived-At: 2017-11-26 23:33 GMT+01:00 Amirouche Boubekki : > H=C3=A9llo, > > I made some progress on my culturia project, > I wanted to share with you where it's going > with a few bits about guile-wiredtiger itself. > > tl;dr: > > $ git clone https://a-guile-mind.github.io/culturia.one > $ git clone https://framagit.org/a-guile-mind/guile-wiredtiger > > On guix(sd) environment: > > $ guix environment --ad-hoc guile-wiredtiger > > I stopped trying to understand what makes > a concept search engine [0]. Instead I will > focus on plain old keyword search engine. > I don't even plan to support synonyms [1]. > > [1] https://www.slideshare.net/lucidworks/implementing-conceptua > l-search-in-solr-using-lsa-and-word2vec-presented-by-simon-hughes-dicecom > [2] https://blog.algolia.com/inside-the-engine-part-6-handling- > synonyms-the-right-way/ > > But, if you have insights about the suject, > don't hesitate to share them. > > You might be wondering why I want to build > a search engine. Yes, since I am not working > anymore to reach the Moon, re-inventing the > wheel might sound useless. It's not. > Because Guile, because wiredtiger. > > You see wiredtiger is the last iteration > (over several decades) of somekind of data engine. > And guile is the _only_ high level language that > has true POSIX threads (and fibers (more on that > later)). Does it ring a bell? > > So far, what is dominant in the database space > is RDBMS. Basically tables that you can query with > SQL. This is very neat and stuff. But what if we > could have tables queryable in scheme? > > That is it! guile-wiredtiger offers a way to query > tables in scheme. Using minikanren if you want that! > > What about performance? Well according to my > microbenchmarks it can query 1200 documents > at random in seconds. Which means that if you > don't use a _dynamic_ schema (like grf3 or > feature-space) and use the raw wiredtiger API > found in (wiredtiger wiredtiger) module and > some helpers in (wiredtiger extra). You can > achieve better performance. Also, did I mention > that those are numbers for single thread access? > > Guile has threads! Which means more RPS for your > application, less hardware for more users. That > said I don't think doubling the threads will double > the throughput. This needs to be benchmarked. > > The thing with NOSQL wiredtiger, is that it's not > the kind of NOSQL you might think about. Unlike > REDIS it's not primary in memory, unlike Cassandra > it doesn't spread it's data accross several nodes. > Though there are ideas on how to do that see for > instance TiKV [3]. > > [3] https://github.com/pingcap/tikv > > According to wiredtiger there is no known limitations > in the size of the database or the number of concurrent > threads =E2=88=92 provided the underlying hardware can follow... > > One can scale vertically, on a single machine. How far > can we go? That's the question I'd like to answer. > > The search engine is the idea to have both potentially > a lot of data and a lot of users (compared to my blog). > > I am sure some people who tried and switched to > duckduckgo want to give it a try. At least to see what > the technology behind Google and DuckDuckGo really is. > You can't know for sure without having experienced the > old google or feu altavista. > > My point other point, is for most of my search on the > web I don't need often to dig deeper than _my_ first > page (even on ddg). On _my_ first page, there is most > of the time wikipedia, stackoverflow and that's it! > Really, there is not much of the web that concerns > me. Outside some rare scholar articles. > > english wikipedia and stackoverflow are already almost > 100Go so it's bigger than any one can have as blog. > > Right now culturia.one does store pages using three > tables: > > Document table will store information about the document, > uid is unique identifier for the document, url and a scheme > list of token uid (This is stored like that for faster > comparaison) > > key | value > -----+-------------+---------------------------- > uid | url | document as token uid > -----+-------------+---------------------------- > 1 | gnu.org | 14 32 51 42 63 74 75 23 113 > 2 | hyperdev.fr | 1 22 1 12 23 71 175 323 14 > > There is another table that stores, all the tokens > found in the documents: > > key | value > -----+--------+------- > uid | token | count > -----+--------+------- > 1 | the | 42 > > Where count, is the count of document where the token > appears at least once. This table as an index on token > column to quikcly retrieve the UID of given TOKEN. > > The last index, is the so called inverted-index, > It's a bit special because the key part of the table > has two columns but not bizarre if you work with primary > keys: > > > key | value > -----+----------+------- > token| document | count > -----+------------------ > 42 | 1 | 1 > > (I just figured that I never use that count column, > it supposed to be the number of times of the token 42 > appears in document 1) > > Anyway, pretty simple no? > > Let's imagine we have a simple query like the following: > > culturia://guile+manual > > We will imagine that we indexed some things that containt > those words (or the engine will throw an exception). > > The quering engine will first compute the frequency of both > keywords and then lookup the inverted index for the least > frequent keyword. The least freqent keyword ? Not the most frequent keyword ? That way, there is a 'seed' set of documents > that we can filter with a small vm that will interpret the > rest of the query for instance. Something like: > > (filter (hit? (cdr query)) seed) > > Sort of. I can't make it simpler right now, but you can > have a look at the code. The public procedure and the bottom > called 'search' [4] is the where the code starts. > > [4] https://github.com/a-guile-mind/culturia.one/blob/master/src > /wiredtiger/ix.scm#L455 file not fond All this looks pretty interesting but I have to say that I prefer the work you're doing on GNUNet ;-) > > > So what is the next iteration: > > guile-wiredtiger: > > - fix the tests to run with guix > > culturia: > > - make culturia compatible with guile-wiredtiger found in guix > - write program that will index the whatever wikipedia dump we want > - make a program to index stackoverflow based on archives.org dump > - make a program that will index news.ycombinator.com and the linked > articles > - Create a crawler for sitemaps (or find one) > - Create a crawler for RSS/ATOM feeds (or find one) > - Support WARC file format and crawl gnu.org website > - Implement !g and !ddg in the searchbox to redirect the user > to another search engine. > > Conctact me directly if you want to work on one of the tasks > or some other tasks, or if you want to report a bug. > > Tx! > > -- > Amirouche ~ amz3 ~ http://www.hyperdev.fr > >