unofficial mirror of guile-user@gnu.org 
 help / color / mirror / Atom feed
* babelia
@ 2019-11-16 10:06 Amirouche Boubekki
  2019-11-16 10:19 ` babelia Amirouche Boubekki
                   ` (3 more replies)
  0 siblings, 4 replies; 6+ messages in thread
From: Amirouche Boubekki @ 2019-11-16 10:06 UTC (permalink / raw)
  To: Guile User

I restarted working on my personal search engine.

It used to be called culturia [0] with too many planned features. At
some point, I called it asylum [1] and focused on personal knowledge
base aspects and the last iteration was called gotofish [2]

[0] https://framagit.org/a-guile-mind/culturia
[1] https://framagit.org/a-guile-mind/culturia.next
[2] https://git.sr.ht/~amz3/guile-gotofish

I learned much from all this projects. In particular, I learned that
it will be a long long long project, even if I focus only on "personal
search engine" line of work.

The last iteration, gotofish, was not too bad even if it has bitrot.
Based on my research and practical experiment, it seems very clear
that there is no workaround the use of map-reduce, that might be known
as n-par-for-each [3].

[3] https://www.gnu.org/software/guile/manual/html_node/Parallel-Forms.html#index-n_002dpar_002dfor_002deach

I made a prototype similar to that n-par-for-each, except it works
with guile-fibers, is asynchronous and works with a shared pool of
threads instead of spawning N threads for each incoming query like
gotofish does.

Related blog post: https://hyper.dev/blog/on-the-road-to-babelia.html

If you want to help or discuss those matters, do not hesitate to reply
to this message.


Cheers,

Amirouche ~ amz3 ~ https://hyper.dev



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: babelia
  2019-11-16 10:06 babelia Amirouche Boubekki
@ 2019-11-16 10:19 ` Amirouche Boubekki
  2019-11-16 12:08 ` babelia Arne Babenhauserheide
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 6+ messages in thread
From: Amirouche Boubekki @ 2019-11-16 10:19 UTC (permalink / raw)
  To: Guile User

Le sam. 16 nov. 2019 à 11:06, Amirouche Boubekki
<amirouche.boubekki@gmail.com> a écrit :
>
> I restarted working on my personal search engine.
>
> It used to be called culturia [0] with too many planned features. At
> some point, I called it asylum [1] and focused on personal knowledge
> base aspects and the last iteration was called gotofish [2]
>
> [0] https://framagit.org/a-guile-mind/culturia
> [1] https://framagit.org/a-guile-mind/culturia.next
> [2] https://git.sr.ht/~amz3/guile-gotofish
>
> I learned much from all this projects. In particular, I learned that
> it will be a long long long project, even if I focus only on "personal
> search engine" line of work.
>
> The last iteration, gotofish, was not too bad even if it has bitrot.
> Based on my research and practical experiment, it seems very clear
> that there is no workaround the use of map-reduce, that might be known
> as n-par-for-each [3].
>
> [3] https://www.gnu.org/software/guile/manual/html_node/Parallel-Forms.html#index-n_002dpar_002dfor_002deach
>
> I made a prototype similar to that n-par-for-each, except it works
> with guile-fibers, is asynchronous and works with a shared pool of
> threads instead of spawning N threads for each incoming query like
> gotofish does.
>
> Related blog post: https://hyper.dev/blog/on-the-road-to-babelia.html
>
> If you want to help or discuss those matters, do not hesitate to reply
> to this message.

I forgot to add that there is several big-ish tasks that can be
tackled in parallel (see the above blog post). In particular, a parser
for wet or warc files, see https://en.wikipedia.org/wiki/Web_ARChive.
This is the most common format of the output of crawlers e.g.
http://commoncrawl.org/



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: babelia
  2019-11-16 10:06 babelia Amirouche Boubekki
  2019-11-16 10:19 ` babelia Amirouche Boubekki
@ 2019-11-16 12:08 ` Arne Babenhauserheide
  2019-11-18 17:12   ` babelia Amirouche Boubekki
  2019-11-22 17:38 ` babelia Amirouche Boubekki
  2019-12-06 13:19 ` babelia Amirouche Boubekki
  3 siblings, 1 reply; 6+ messages in thread
From: Arne Babenhauserheide @ 2019-11-16 12:08 UTC (permalink / raw)
  To: guile-user

[-- Attachment #1: Type: text/plain, Size: 290 bytes --]

Hi Amirouche,

For the firefox driver you might get a good start from skewer-mode:
https://github.com/skeeto/skewer-mode

Best wishes,
Arne

> Related blog post: https://hyper.dev/blog/on-the-road-to-babelia.html


--
Unpolitisch sein
heißt politisch sein
ohne es zu merken

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 1076 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: babelia
  2019-11-16 12:08 ` babelia Arne Babenhauserheide
@ 2019-11-18 17:12   ` Amirouche Boubekki
  0 siblings, 0 replies; 6+ messages in thread
From: Amirouche Boubekki @ 2019-11-18 17:12 UTC (permalink / raw)
  To: Arne Babenhauserheide; +Cc: Guile User

Hello Arne,

Le sam. 16 nov. 2019 à 13:08, Arne Babenhauserheide <arne_bab@web.de> a écrit :
>
> Hi Amirouche,
>
> For the firefox driver you might get a good start from skewer-mode:
> https://github.com/skeeto/skewer-mode
>

Thanks for the hint. I am not sure I will get to the point of using
headless browser, because a) most of what I search for on the web is
static html b) it can be security problem c) I do not want to rely too
much on the web d) I think https://archive.org/ does snapshots of SPA
already.

Higher priority tasks include WARC file parser and a small crawler
(that can respect robots.txt (and optionaly read sitemap.xml))

Anyway It will be good for me to read some elisp. That emacs mode is impressive.



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: babelia
  2019-11-16 10:06 babelia Amirouche Boubekki
  2019-11-16 10:19 ` babelia Amirouche Boubekki
  2019-11-16 12:08 ` babelia Arne Babenhauserheide
@ 2019-11-22 17:38 ` Amirouche Boubekki
  2019-12-06 13:19 ` babelia Amirouche Boubekki
  3 siblings, 0 replies; 6+ messages in thread
From: Amirouche Boubekki @ 2019-11-22 17:38 UTC (permalink / raw)
  To: Guile User

Le sam. 16 nov. 2019 à 11:06, Amirouche Boubekki
<amirouche.boubekki@gmail.com> a écrit :
>
> I restarted working on my personal search engine.
>

I pushed a v0.1.0 tag in the repository. You can find it at:

  https://git.sr.ht/~amz3/guile-babelia

Only the command line interface works. See `make benchmarks` to learn
how to use it.

Here is the benchmark over bug-guix mailing list archive:

;;; ("search for:" "shepherd")

;;; ("query time in milliseconds" 270)

* search: shepherd reboot

;;; ("search for:" "shepherd reboot")

;;; ("query time in milliseconds" 200)

* search: shepherd restart

;;; ("search for:" "shepherd restart")

;;; ("query time in milliseconds" 205)

* search: guix

;;; ("search for:" "guix")

;;; ("query time in milliseconds" 3973)

As you can see in the last result, it very slow on frequent terms like
"guix" in a guix specific dataset. There is many rooms for
improvements to reduce response time. In particular, "guix" query
results will be cached.


Good week end!



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: babelia
  2019-11-16 10:06 babelia Amirouche Boubekki
                   ` (2 preceding siblings ...)
  2019-11-22 17:38 ` babelia Amirouche Boubekki
@ 2019-12-06 13:19 ` Amirouche Boubekki
  3 siblings, 0 replies; 6+ messages in thread
From: Amirouche Boubekki @ 2019-12-06 13:19 UTC (permalink / raw)
  To: Guile User

Hello all!

Le sam. 16 nov. 2019 à 11:06, Amirouche Boubekki <
amirouche.boubekki@gmail.com> a écrit :

> I restarted working on my personal search engine.
>

After two weeks of work, 41 files changed, 1845 insertions(+), 441
deletions(-) and 97 commits, I tagged a v0.2.0 in the repository at:

  https://git.sr.ht/~amz3/guile-babelia

The babelia index and babelia search subcommands were removed. Instead, one
has to `make web` to spawn a server and then hit the
/api/search?query=foobar to make a search. To index stuff, one can POST a
file like test.scm to /api/index or rely on babelia crawler subcommands.
The crawler is still a work-in-progress. Do not expect the index to be
compatible with future releases.


>
> The last iteration, gotofish, was not too bad even if it has bitrot.
> Based on my research and practical experiment, it seems very clear
> that there is no workaround the use of map-reduce, that might be known
> as n-par-for-each [3].
>
> [3]
> https://www.gnu.org/software/guile/manual/html_node/Parallel-Forms.html#index-n_002dpar_002dfor_002deach
>
> I made a prototype similar to that n-par-for-each, except it works
> with guile-fibers, is asynchronous and works with a shared pool of
> threads instead of spawning N threads for each incoming query like
> gotofish does.


Actually, what I need is n-for-each-par-map where map happens in parallel.
The implementation can be found in the babelia/pool.scm file [4].

[4] https://git.sr.ht/~amz3/guile-babelia/tree/v0.2.0/babelia/pool.scm

The installation process is still a little bit akward, because one needs to
change the path to wiredtiger-3.2.0-0 shared library in the source. Add my
channel [5] and do `make init` to get started.

[5] https://git.sr.ht/~amz3/guix-amz3-channel



Happy hacking!


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2019-12-06 13:19 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-11-16 10:06 babelia Amirouche Boubekki
2019-11-16 10:19 ` babelia Amirouche Boubekki
2019-11-16 12:08 ` babelia Arne Babenhauserheide
2019-11-18 17:12   ` babelia Amirouche Boubekki
2019-11-22 17:38 ` babelia Amirouche Boubekki
2019-12-06 13:19 ` babelia Amirouche Boubekki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).