From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: Amirouche Boubekki Newsgroups: gmane.lisp.guile.user Subject: Re: [HELP] a search engine in GNU Guile Date: Fri, 23 Sep 2016 07:52:36 +0200 Message-ID: References: NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Trace: blaine.gmane.org 1474610004 10418 195.159.176.226 (23 Sep 2016 05:53:24 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Fri, 23 Sep 2016 05:53:24 +0000 (UTC) User-Agent: Roundcube Webmail/1.1.2 Cc: guile-user To: Guile User Original-X-From: guile-user-bounces+guile-user=m.gmane.org@gnu.org Fri Sep 23 07:53:19 2016 Return-path: Envelope-to: guile-user@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bnJQ9-0001TD-Sv for guile-user@m.gmane.org; Fri, 23 Sep 2016 07:53:14 +0200 Original-Received: from localhost ([::1]:40851 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bnJQ8-000753-AC for guile-user@m.gmane.org; Fri, 23 Sep 2016 01:53:12 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:44966) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bnJPm-00074u-2n for guile-user@gnu.org; Fri, 23 Sep 2016 01:52:51 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1bnJPk-0006qh-K0 for guile-user@gnu.org; Fri, 23 Sep 2016 01:52:50 -0400 Original-Received: from relay5-d.mail.gandi.net ([217.70.183.197]:37713) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bnJPe-0006m2-3F; Fri, 23 Sep 2016 01:52:42 -0400 Original-Received: from mfilter48-d.gandi.net (mfilter48-d.gandi.net [217.70.178.179]) by relay5-d.mail.gandi.net (Postfix) with ESMTP id A3A3541C086; Fri, 23 Sep 2016 07:52:38 +0200 (CEST) X-Virus-Scanned: Debian amavisd-new at mfilter48-d.gandi.net Original-Received: from relay5-d.mail.gandi.net ([IPv6:::ffff:217.70.183.197]) by mfilter48-d.gandi.net (mfilter48-d.gandi.net [::ffff:10.0.15.180]) (amavisd-new, port 10024) with ESMTP id SLWcFZFEgo_1; Fri, 23 Sep 2016 07:52:36 +0200 (CEST) X-Originating-IP: 10.58.1.142 Original-Received: from webmail.gandi.net (webmail2-d.mgt.gandi.net [10.58.1.142]) (Authenticated sender: amirouche@hypermove.net) by relay5-d.mail.gandi.net (Postfix) with ESMTPA id BDB8F41C087; Fri, 23 Sep 2016 07:52:36 +0200 (CEST) In-Reply-To: X-Sender: amirouche@hypermove.net X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 217.70.183.197 X-BeenThere: guile-user@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: General Guile related discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guile-user-bounces+guile-user=m.gmane.org@gnu.org Original-Sender: "guile-user" Xref: news.gmane.org gmane.lisp.guile.user:12926 Archived-At: Héllo, I made some progress regarding culturia. Now there is a web interface available at hypermove.net [0]. [0] http://hypermove.net/?query=guile+algorithms+-wingolog Now you can: - search a bunch of website related to Guile I quickly selected - use minus char "-" to exclude some keyword from the result set - index small-ish domains There is no UI yet to add your website to the index, if you want your website to be part of the experiment reach me. I tried two search backends. One based on a graph database which is more versatile but much slower and another based on inverted index. The latter is the one that is used currently. It's fast enough right now. What will be done next: - Add wikipedia, wiktionary, stackoverflow and hackernews to the database - Improve how results are displayed - Remove feeds from results - Index pdf - Remove duplicate entries from the results - Paginate - Make use of fibers to run the http server - Better crawling algorithm that support updating the index At some point there will be a 0.1 release, not sure when. The code is also available at framagit [1] the web interface is implemented in a single module [2]. [1] https://framagit.org/a-guile-mind/culturia [2] https://framagit.org/a-guile-mind/culturia/blob/master/src/web.scm Happy hacking! On 2016-08-13 17:25, Amirouche Boubekki wrote: > Héllo, > > > The goal of Culturia is to create a framework that makes it easy > to tape into Natural Language Understanding algorithms (and NLP) > and provide an interface for common tasks. > > Culturia is an intelligence augmentation software. > > It's primary interface is a search engine. Another important aspect > of the project is that it wants to be useable offline as such it will > come with infrastructure to dump, load and store dataset for offline > use. > > The current state of the project can be described as a big ball of mud. > There is a tiny search engine with crawling skills and that's basically > all of it. > > The immediate changes that should happen are in order of preference: > > - offline stackoverflow (cf. sotoki.scm) and use the generated > website to create a zim for kiwix [0]. This is great occasion to > show how great GNU Guile is! > - port whoosh/lucene to guile to improve text search > - offline hackernews, wikidata, wikipedia, wiktionary > - implement BM25f > > Culturia is a reference to _Culture and Empire_ by Pieter Hintjens. > > It has a sparse documentation is available online [1]. > It's hosted on github [2] (This can change, if contributors > don't want to use github). > > The TODO list is big, here is some stuff that needs to be done: > > - finish GrammarLink bindings > - create sophia [3] bindings > - implement TextRank > - implement PageRank > - create a GUI using sly or html > - explore ways to easily share database among several processus > > And many other things! Newbies are accepted obviously! > > Send me a mail or use #guile @ irc.freenode.net, I am amz3. > > > Happy hacking! > > > [0] http://www.kiwix.org/wiki/Main_Page > [1] https://amirouche.github.io/Culturia/doc/ > [2] https://github.com/amirouche/Culturia > [3] http://sophia.systems/ -- Amirouche ~ amz3 ~ http://www.hyperdev.fr