unofficial mirror of guile-user@gnu.org 
 help / color / mirror / Atom feed
* [HELP] a search engine in GNU Guile
@ 2016-08-13 15:25 Amirouche Boubekki
  2016-09-04 13:35 ` Amirouche Boubekki
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Amirouche Boubekki @ 2016-08-13 15:25 UTC (permalink / raw)
  To: Guile User

Héllo,


The goal of Culturia is to create a framework that makes it easy
to tape into Natural Language Understanding algorithms (and NLP)
and provide an interface for common tasks.

Culturia is an intelligence augmentation software.

It's primary interface is a search engine. Another important aspect
of the project is that it wants to be useable offline as such it will
come with infrastructure to dump, load and store dataset for offline 
use.

The current state of the project can be described as a big ball of mud.
There is a tiny search engine with crawling skills and that's basically
all of it.

The immediate changes that should happen are in order of preference:

- offline stackoverflow (cf. sotoki.scm) and use the generated
   website to create a zim for kiwix [0]. This is great occasion to
   show how great GNU Guile is!
- port whoosh/lucene to guile to improve text search
- offline hackernews, wikidata, wikipedia, wiktionary
- implement BM25f

Culturia is a reference to _Culture and Empire_ by Pieter Hintjens.

It has a sparse documentation is available online [1].
It's hosted on github [2] (This can change, if contributors
don't want to use github).

The TODO list is big, here is some stuff that needs to be done:

- finish GrammarLink bindings
- create sophia [3] bindings
- implement TextRank
- implement PageRank
- create a GUI using sly or html
- explore ways to easily share database among several processus

And many other things! Newbies are accepted obviously!

Send me a mail or use #guile @ irc.freenode.net, I am amz3.


Happy hacking!


[0] http://www.kiwix.org/wiki/Main_Page
[1] https://amirouche.github.io/Culturia/doc/
[2] https://github.com/amirouche/Culturia
[3] http://sophia.systems/


-- 
Amirouche ~ amz3 ~ http://www.hyperdev.fr



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [HELP] a search engine in GNU Guile
  2016-08-13 15:25 [HELP] a search engine in GNU Guile Amirouche Boubekki
@ 2016-09-04 13:35 ` Amirouche Boubekki
  2016-09-09 14:40   ` Christopher Allan Webber
  2016-09-09 14:39 ` Christopher Allan Webber
  2016-09-23  5:52 ` Amirouche Boubekki
  2 siblings, 1 reply; 9+ messages in thread
From: Amirouche Boubekki @ 2016-09-04 13:35 UTC (permalink / raw)
  To: Guile User

[-- Attachment #1: Type: text/plain, Size: 7297 bytes --]

Héllo,


I'd like to share with you a mini-project on the road of Culturia 0.1
[0] which is a boolean keyword search engine (similar in principle to
xapian, lucene and whoosh (with less polish and features)).

[0] https://github.com/amirouche/Culturia

The dependencies are wiredtiger develop branch and html2text cli 
utility.

The main design choice in this search engine is the use of an *inverted
index* to store documents. It only works with languages that have space
between words like english and french.

The procedure used to index documents is `index`, It takes an `url` as
first argument and `html` as second argument. Actually `url` can be
any string reference and `html` can be plain text.

(I am not happy with the naming in this `wsh.scm` in general)

Here is the wiredtiger table definition:

```scheme
(define-public *wsh* '((urls
                         ((uid . record))
                         ((url . string))
                         ())
                        (terms
                         ((uid . record))
                         ((string . string))
                         ((inverse (string) (uid))))
                        (inverted-index
                         ((term . unsigned-integer)
                          (url-uid . unsigned-integer)
                          (position . unsigned-integer))
                         ((nothing . bytes))
                         ((positions (url-uid position) (term))))))
```

This describes the tables used to store the informations needed by the
search to execute (somewhat) efficently its algorithm.

The first table `urls` associates an unique identifier to urls.

`terms` associate a unique identifier with each term found in the
whole database.

When a document is indexed via `(index url html)`, `html` is
preprocessed through `html2text` and `string->tokens`. In particular,
`string->tokens` will:

- downcase everything
- remove single char words
- remove words found in the stopwords.txt file

`inverted-index` is the gist of this way of building a search engine
will associate each term, with the documents where it's found with its
position.

Here is an example run:

```scheme
(let ((query (search/and (search/term "database")
                          (search/or (search/term "postgresql")
                                     (search/term "pgsql")))))
    (search* query))
```

Which is the scheme expression of `database AND (pogresql OR pgsql)`.
The search vm also support negation via `search/not`.

Scoring is done only using term frequency in the document and not using
tf-idf because it makes the implementation simpler.

If you are interested in how search queries are translated to database
procedure calls the entry point is `search*` obviously but the gist of
the algorithm happens in `search/vm` [1]. Basically it boilds down to
do single term search via `search` procedure [2] and do list set
operations like intersection and difference. I think there is an
optimisation to do here but I did not bother.

[1] 
https://github.com/amirouche/Culturia/blob/wsh-inverted-index/src/wsh.scm#L106
[2] 
https://github.com/amirouche/Culturia/blob/wsh-inverted-index/src/wsh.scm#L66

Have a look at `wsh.scm` at the bottom they are few tests.

Also `inverted-index` table has a `positions` index on the `url-uid`
and `position` to make it possible to do "phrase search", even if it's
not part of the current implementation, it is possible to implement
phrase search thanks to this. Phrase search are queries surrounded by
double quotes like "what is gnu". One can implement phrase search as
follow:

- split phrase into words

- search first word in the `inverted-index` table using `search`, this
   will give the `uid` of documents with the position `index` in that
   document of the first word of the phrase.

- lookup `positions` index with `uid`, and `position + 1` and keep the
   uids for which `position + 1` is the next word in the phrase. Repeat
   that step until all words from the phrase are consumed.

And you have phrase search.

There is another thing that doesn't work right now. If you look up a
term that is not in the databae it will fail. There is a test case for
it in `wsh.scm`.

I pushed this to `wsh-inverted-index` branch:

   git clone https://github.com/amirouche/Culturia --branch 
wsh-inverted-index

And attached the interesting files to this mail.

Regarding **Culturia**, I wish to re-code this on top of a graph
datastructure to have more flexibility regarding how documents are
related with terms and how terms are related together. Right now it's
not possible without using another table to relate terms
together. Culturia will rely on graphdb mainly to make the
implementation simpler. The graph datastructure `grf3` [3] is
implemented on top of ukv tuple space [4] which is a new
implementation of uav database which make use of wiredtigerz new `env`
multithread support.

I also updated the documentation regarding `wiredtiger` and 
`wiredtigerz`.
Have a look at it online github [5][6].

[3] 
https://github.com/amirouche/Culturia/blob/474b5910a6f1a8541280f53bc6ae6b8458929e7f/src/grf3.scm
[4] 
https://github.com/amirouche/Culturia/blob/474b5910a6f1a8541280f53bc6ae6b8458929e7f/src/ukv.scm
[5] https://amirouche.github.io/Culturia/doc/wiredtiger.html
[6] https://amirouche.github.io/Culturia/doc/wiredtigerz.html

Happy hacking!

On 2016-08-13 17:25, Amirouche Boubekki wrote:
> Héllo,
> 
> 
> The goal of Culturia is to create a framework that makes it easy
> to tape into Natural Language Understanding algorithms (and NLP)
> and provide an interface for common tasks.
> 
> Culturia is an intelligence augmentation software.
> 
> It's primary interface is a search engine. Another important aspect
> of the project is that it wants to be useable offline as such it will
> come with infrastructure to dump, load and store dataset for offline 
> use.
> 
> The current state of the project can be described as a big ball of mud.
> There is a tiny search engine with crawling skills and that's basically
> all of it.
> 
> The immediate changes that should happen are in order of preference:
> 
> - offline stackoverflow (cf. sotoki.scm) and use the generated
>   website to create a zim for kiwix [0]. This is great occasion to
>   show how great GNU Guile is!
> - port whoosh/lucene to guile to improve text search
> - offline hackernews, wikidata, wikipedia, wiktionary
> - implement BM25f
> 
> Culturia is a reference to _Culture and Empire_ by Pieter Hintjens.
> 
> It has a sparse documentation is available online [1].
> It's hosted on github [2] (This can change, if contributors
> don't want to use github).
> 
> The TODO list is big, here is some stuff that needs to be done:
> 
> - finish GrammarLink bindings
> - create sophia [3] bindings
> - implement TextRank
> - implement PageRank
> - create a GUI using sly or html
> - explore ways to easily share database among several processus
> 
> And many other things! Newbies are accepted obviously!
> 
> Send me a mail or use #guile @ irc.freenode.net, I am amz3.
> 
> 
> Happy hacking!
> 
> 
> [0] http://www.kiwix.org/wiki/Main_Page
> [1] https://amirouche.github.io/Culturia/doc/
> [2] https://github.com/amirouche/Culturia
> [3] http://sophia.systems/

-- 
Amirouche ~ amz3 ~ http://www.hyperdev.fr

[-- Attachment #2: stopwords.en.txt --]
[-- Type: text/plain, Size: 4872 bytes --]

a
able
about
above
abst
accordance
according
accordingly
across
act
actually
added
adj
affected
affecting
affects
after
afterwards
again
against
ah
ain't
all
allow
allows
almost
alone
along
already
also
although
always
am
among
amongst
an
and
announce
another
any
anybody
anyhow
anymore
anyone
anything
anyway
anyways
anywhere
apart
apparently
appear
appreciate
appropriate
approximately
are
aren
arent
aren't
arise
around
as
a's
aside
ask
asking
associated
at
auth
available
away
awfully
b
back
be
became
because
become
becomes
becoming
been
before
beforehand
begin
beginning
beginnings
begins
behind
being
believe
below
beside
besides
best
better
between
beyond
biol
both
brief
briefly
but
by
c
ca
came
can
cannot
cant
can't
cause
causes
certain
certainly
changes
clearly
c'mon
co
com
come
comes
concerning
consequently
consider
considering
contain
containing
contains
corresponding
could
couldnt
couldn't
course
c's
currently
d
date
definitely
described
despite
did
didn't
different
do
does
doesn't
doing
done
don't
down
downwards
due
during
e
each
ed
edu
effect
eg
eight
eighty
either
else
elsewhere
end
ending
enough
entirely
especially
et
et-al
etc
even
ever
every
everybody
everyone
everything
everywhere
ex
exactly
example
except
f
far
few
ff
fifth
first
five
fix
followed
following
follows
for
former
formerly
forth
found
four
from
further
furthermore
g
gave
get
gets
getting
give
given
gives
giving
go
goes
going
gone
got
gotten
greetings
h
had
hadn't
happens
hardly
has
hasn't
have
haven't
having
he
hed
he'd
he'll
hello
hence
her
here
hereafter
hereby
herein
heres
here's
hereupon
hers
herself
hes
he's
hi
hid
him
himself
his
hither
home
hopefully
how
howbeit
however
how's
hundred
i
I
id
i'd
ie
if
ignored
i'll
im
i'm
immediate
immediately
importance
important
in
inasmuch
inc
indeed
index
indicate
indicated
indicates
information
inner
insofar
instead
into
invention
inward
is
isn't
it
itd
it'd
it'll
its
it's
itself
i've
j
just
k
keep
keep 	keeps
keeps
kept
kg
km
know
known
knows
l
largely
last
lately
later
latter
latterly
least
less
lest
let
lets
let's
like
liked
likely
line
little
'll
look
looking
looks
ltd
m
made
mainly
make
makes
many
may
maybe
me
mean
means
meantime
meanwhile
merely
mg
might
million
miss
ml
more
moreover
most
mostly
mr
mrs
much
mug
must
mustn't
my
myself
n
na
name
namely
nay
nd
near
nearly
necessarily
necessary
need
needs
neither
never
nevertheless
new
next
nine
ninety
no
nobody
non
none
nonetheless
noone
nor
normally
nos
not
noted
nothing
novel
now
nowhere
o
obtain
obtained
obviously
of
off
often
oh
ok
okay
old
omitted
on
once
one
ones
only
onto
or
ord
other
others
otherwise
ought
our
ours
ourselves
out
outside
over
overall
owing
own
p
page
pages
part
particular
particularly
past
per
perhaps
placed
please
plus
poorly
possible
possibly
potentially
pp
predominantly
present
presumably
previously
primarily
probably
promptly
proud
provides
put
q
que
quickly
quite
qv
r
ran
rather
rd
re
readily
really
reasonably
recent
recently
ref
refs
regarding
regardless
regards
related
relatively
research
respectively
resulted
resulting
results
right
run
s
said
same
saw
say
saying
says
sec
second
secondly
section
see
seeing
seem
seemed
seeming
seems
seen
self
selves
sensible
sent
serious
seriously
seven
several
shall
shan't
she
shed
she'd
she'll
shes
she's
should
shouldn't
show
showed
shown
showns
shows
significant
significantly
similar
similarly
since
six
slightly
so
some
somebody
somehow
someone
somethan
something
sometime
sometimes
somewhat
somewhere
soon
sorry
specifically
specified
specify
specifying
still
stop
strongly
sub
substantially
successfully
such
sufficiently
suggest
sup
sure
sure 	t
take
taken
taking
tell
tends
th
than
thank
thanks
thanx
that
that'll
thats
that's
that've
the
their
theirs
them
themselves
then
thence
there
thereafter
thereby
thered
therefore
therein
there'll
thereof
therere
theres
there's
thereto
thereupon
there've
these
they
theyd
they'd
they'll
theyre
they're
they've
think
third
this
thorough
thoroughly
those
thou
though
thoughh
thousand
three
throug
through
throughout
thru
thus
til
tip
to
together
too
took
toward
towards
tried
tries
truly
try
trying
ts
t's
twice
two
u
un
under
unfortunately
unless
unlike
unlikely
until
unto
up
upon
ups
us
use
used
useful
usefully
usefulness
uses
using
usually
v
value
various
've
very
via
viz
vol
vols
vs
w
want
wants
was
wasnt
wasn't
way
we
wed
we'd
welcome
well
we'll
went
were
we're
werent
weren't
we've
what
whatever
what'll
whats
what's
when
whence
whenever
when's
where
whereafter
whereas
whereby
wherein
wheres
where's
whereupon
wherever
whether
which
while
whim
whither
who
whod
whoever
whole
who'll
whom
whomever
whos
who's
whose
why
why's
widely
will
willing
wish
with
within
without
wonder
wont
won't
words
world
would
wouldnt
wouldn't
www
x
y
yes
yet
you
youd
you'd
you'll
your
youre
you're
yours
yourself
yourselves
you've
z
zero

[-- Attachment #3: text.scm --]
[-- Type: text/plain, Size: 1795 bytes --]

(define-module (text))

(use-modules (ice-9 popen))
(use-modules (ice-9 rdelim))
(use-modules (srfi srfi-1))
(use-modules (srfi srfi-26))


;;;
;;; wrapping html2text
;;;
;;
;; inspired from ice-9 popen
;;

(define open-process (@@ (ice-9 popen) open-process))

(define (html2text string)
  (with-error-to-file "/dev/null"
    (lambda ()
      (call-with-values (lambda () (open-process OPEN_BOTH "html2text"))
        (lambda (read-port write-port pid)
          (display string write-port)
          (close-port write-port)
          (let ((str (read-string read-port)))
            (close-port read-port)
            (waitpid pid)
            str))))))

;;;
;;; tokenizing
;;;

;; english stop words
;; XXX: actually not very useful

(define stopwords (make-hash-table))

(with-input-from-file "stopwords.en.txt"  ;; snarffed from http://www.ranks.nl/stopwords/
  (lambda ()
    (let loop ((stopword (read-line)))
      (unless (eof-object? stopword)
        (hash-set! stopwords stopword #t)
        (loop (read-line))))))

(define (filter-stopwords lst)
  (filter (lambda (token) (not (hash-ref stopwords token))) lst))

;; tokens

(define punctuation (string->list "!\"#$%&\\'()*+,-./:;<=>?@[\\]^_`{|}~\n\t"))

(define (clean text)
  "Replace punctuation characters from TEXT with a space character"
  (string-map (lambda (char) (if (list-index (cut equal? char <>) punctuation) #\space char)) text))

(define split (cut string-split <> #\space))

(define (sanitize words)
  "Only keep words that have length bigger than one"
  (filter (lambda (word) (< 1 (string-length word))) words))

;; XXX: compose must be read from right to left
(define string->tokens (compose filter-stopwords sanitize split string-downcase clean))

(define-public html->tokens (compose string->tokens html2text))

[-- Attachment #4: wiredtiger.scm --]
[-- Type: text/plain, Size: 30440 bytes --]

;; guile-wiredtiger - 0.4 - 2016/07/07

;; Copyright © 2014-2016 Amirouche BOUBEKKI <amirouche at hypermove net>

;; guile-wiredtiger is free software: you can redistribute it and/or modify
;; it under the terms of the GNU General Public License as published by
;; the Free Software Foundation, either version 2 of the License, or
;; (at your option) or version 3.

;; guile-wiredtiger is distributed in the hope that it will be useful,
;; but WITHOUT ANY WARRANTY; without even the implied warranty of
;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
;; GNU General Public License for more details.

;; You should have received a copy of the GNU General Public License
;; along with guile-wiredtiger.  If not, see <http://www.gnu.org/licenses/>

;;; Comment:
;;
;; Tested with wiredtiger develop branch
;;

(define-module (wiredtiger))

(use-modules (srfi srfi-9))  ;; records
(use-modules (srfi srfi-9 gnu))  ;; set-record-type-printer!
(use-modules (srfi srfi-26)) ;; cut

(use-modules (rnrs bytevectors))

(use-modules (ice-9 iconv))  ;; string->bytevector
(use-modules (ice-9 match))
(use-modules (ice-9 format))
(use-modules (ice-9 optargs))  ;; lambda*
(use-modules (ice-9 receive))

(use-modules (system foreign))  ;; ffi

;;;
;;; plain
;;;
;;
;; macro to quickly define records
;;
;;
;; Usage:
;;
;;   (define-record-type <car> seats wheels)
;;   (define smart (make-car 2 4))
;;   (car-seats smart) ;; => 2
;;

(define-syntax define-record-type*
  (lambda (x)
    (define (%id-name name) (string->symbol (string-drop (string-drop-right (symbol->string name) 1) 1)))
    (define (id-name ctx name)
      (datum->syntax ctx (%id-name (syntax->datum name))))
    (define (id-append ctx . syms)
      (datum->syntax ctx (apply symbol-append (map syntax->datum syms))))
    (syntax-case x ()
      ((_ rname field ...)
       (and (identifier? #'rname) (and-map identifier? #'(field ...)))
       (with-syntax ((cons (id-append #'rname #'make- (id-name #'rname #'rname)))
                     (pred (id-append #'rname (id-name #'rname #'rname) #'?))
                     ((getter ...) (map (lambda (f)
                                          (id-append f (id-name #'rname #'rname) #'- f))
                                        #'(field ...))))
         #'(define-record-type rname
             (cons field ...)
             pred
             (field getter)
             ...))))))

;;; ffi helpers

(define NULL %null-pointer)
(define POINTER '*)

(define* (dynamic-link* #:optional library-name)
  (let ((shared-object (if library-name (dynamic-link library-name) (dynamic-link))))
    (lambda (return-value function-name . arguments)
      (let ((function (dynamic-func function-name shared-object)))
        (pointer->procedure return-value function arguments)))))

(define (pointer->procedure* return-type function-pointer . args_types)
  (pointer->procedure return-type function-pointer args_types))

(define (make constructor pointer size)
  (let* ((pointer (make-pointer (array-ref pointer 0)))
         (struct (map make-pointer (u64vector->list (pointer->bytevector pointer size 0 'u64)))))
    (apply constructor (cons pointer struct))))

;;;
;;; wiredtiger bindings
;;;

(define wiredtiger (dynamic-link* "libwiredtiger.so"))

(define* wiredtiger-string-error
  (let ((function (wiredtiger POINTER "wiredtiger_strerror" int)))
    (lambda (code)
      (pointer->string (function code)))))

(define (check code)
  (unless (eq? code 0)
    (throw 'wiredtiger (wiredtiger-string-error code))))

;;;
;;; Connection
;;;

(define-record-type* <*connection>
  pointer

  async-flush
  async-new-op
  close
  reconfigure
  get-home
  configure-method
  is-new
  open-session
  load-extension
  add-data-source
  add-collator
  add-compressor
  add-encryptor
  add-extractor
  get-extension-api)

(set-record-type-printer! <*connection>
                          (lambda (record port)
                            (format port
                                    "<connection 0x~x>"
                                    (pointer-address (*connection-pointer record)))))

(define-public connection-open
  (let ((function (wiredtiger int "wiredtiger_open" POINTER POINTER POINTER POINTER)))
    (lambda (home config)
      (let* (;; init a double pointer
             (pointer (u64vector 0))
             (double-pointer (bytevector->pointer pointer))
             ;; convert arguments to c types
             (home (string->pointer home))
             (config (string->pointer config)))
      ;; FIXME: add support for error_handler
      (check (function home NULL config double-pointer))
      (make make-*connection pointer 15)))))


(define*-public (connection-close connection #:optional (config ""))
  (let ((function (pointer->procedure* int (*connection-close connection) POINTER POINTER)))
    (let* (;; init a double pointer
           (pointer (u64vector 0))
           (double-pointer (bytevector->pointer pointer))
           ;; convert arguments to c types
           (config (string->pointer config)))
           ;; FIXME: add support for error_handler
      (check (function (*connection-pointer connection) config)))))

(define (make-collator format proc)
  (lambda (collator session key other cmp)
        (let* ((key (pointer->bytevector key 2 0 'u64))
                   (key (%wiredtiger-struct-unpack session (make-pointer (array-ref key 0)) (array-ref key 1) format)))
          (let* ((other (pointer->bytevector other 2 0 'u64))
                 (other (%wiredtiger-struct-unpack session (make-pointer (array-ref other 0)) (array-ref other 1) format)))
                        (s32vector-set! (pointer->bytevector cmp 1 0 's32) 0 (proc key other))
                        0))))

(define-public (connection-add-collator connection name format proc)
  "add PROC as a collator named NAME against CONNECTION"
  (let* ((function (pointer->procedure* int (*connection-add-collator connection) '* '* '* '*))
                 (collator (pointer-address (procedure->pointer int (make-collator format proc) (list '* '* '* '* '*))))
                 (collator (bytevector->pointer (u64vector collator 0 0))))
        (check (function (*connection-pointer connection)
                                         (string->pointer name)
                                         collator
NULL))))

;;;
;;; Session
;;;

(define-record-type* <*session>
  pointer

  connection
  app-private
  close
  reconfigure
  string-error
  cursor-open
  create
  compact
  drop
  join
  log-flush
  log-printf
  rebalance
  rename
  reset
  salvage
  truncate
  upgrade
  verify
  transaction-begin
  transaction-commit
  transaction-rollback
  checkpoint
  snapshot
  transaction-pinned-range
  transaction-sync)

(set-record-type-printer! <*session>
                          (lambda (record port)
                            (format port
                                    "<session 0x~x>"
                                    (pointer-address (*session-pointer record)))))

(define (session-string-error* session code)
  (let ((function (pointer->procedure* POINTER (*session-string-error session) POINTER int)))
    (pointer->string (function (*session-pointer session) code))))

(define (session-check session code)
  (unless (eq? code 0)
    (throw 'wiredtiger (session-string-error* session code))))

(define*-public (session-open connection #:optional (config ""))
  (let ((function (pointer->procedure* int (*connection-open-session connection) POINTER POINTER POINTER POINTER)))
    (let* (;; init a double pointer
           (pointer (u64vector 0))
           (double-pointer (bytevector->pointer pointer))
           ;; convert arguments to c types
           (config (string->pointer config))
           ;; call the foreign function
           ;; FIXME: add support for error_handler
           (code (function (*connection-pointer connection) NULL config double-pointer)))
      (check  code)
      (make make-*session pointer 26))))

(define-public (session-create session name config)
  (let ((function (pointer->procedure* int (*session-create session) POINTER POINTER POINTER)))
    (let* (;; convert arguments to c types
           (name (string->pointer name))
           (config (string->pointer config))
           ;; call the foreign function
           (code (function (*session-pointer session) name config)))
      (session-check session code))))

(define-public (session-close session)
  (let ((function (pointer->procedure* int (*session-close session) POINTER)))
    (session-check session (function (*session-pointer session)))))

(define*-public (session-transaction-begin session #:optional (config ""))
  (let ((function (pointer->procedure* int (*session-transaction-begin session) POINTER POINTER)))
    (session-check session (function (*session-pointer session) (string->pointer config)))))

(define*-public (session-transaction-commit session #:optional (config ""))
  (let ((function (pointer->procedure* int (*session-transaction-commit session) POINTER POINTER)))
    (session-check session (function (*session-pointer session) (string->pointer config)))))

(define*-public (session-transaction-rollback session #:optional (config ""))
  (let ((function (pointer->procedure* int (*session-transaction-rollback session) POINTER POINTER)))
    (session-check session (function (*session-pointer session) (string->pointer config)))))

;;;
;;; Cursor
;;;

(define-record-type* <*cursor>
  pointer
  session
  config

  session-pointer
  uri
  key-format
  value-format
  key-ref
  value-ref
  key-set
  value-set
  compare
  equals
  next
  previous
  reset
  search
  search-near
  insert
  update
  remove
  close
  reconfigure
  ;; XXX: other fields are defined in the header
  ;;      those are only useful to implement a new cursor type
  ;;      and as such are not part the record
  )

(set-record-type-printer! <*cursor>
                          (lambda (record port)
                            (format port
                                    "<cursor 0x~x uri=~s key=~s value=~s config=~s>"
                                    (pointer-address (*cursor-pointer record))
                                    (pointer->string (*cursor-uri record))
                                    (cursor-key-format record)
                                    (cursor-value-format record)
                                    (*cursor-config record))))

(define (cursor-key-format cursor)
  ;; FIXME: cache this value
  (pointer->string (*cursor-key-format cursor)))

(define (cursor-value-format cursor)
  ;; FIXME: cache this value
  (pointer->string (*cursor-value-format cursor)))

(define*-public (cursor-open session uri #:optional (config ""))
  (let ((function (pointer->procedure* int (*session-cursor-open session) POINTER POINTER POINTER POINTER POINTER)))
    (let* (;; init a double pointer
           (pointer (u64vector 0))
           (double-pointer (bytevector->pointer pointer))
           ;; call the foreign function
           (code (function (*session-pointer session)
                           (string->pointer uri)
                           NULL
                           (string->pointer config)
                           double-pointer)))
      (session-check session code)
      ;; make cursor record
      (let* ((size 20)
             (pointer (make-pointer (array-ref pointer 0)))
             (struct (map make-pointer (u64vector->list (pointer->bytevector pointer size 0 'u64)))))
        (apply make-*cursor (cons* pointer session config struct))))))

;;; key/value set/ref

(define (item->string bv)
  (pointer->string (make-pointer (array-ref bv 0))))

(define (item->integer bv)
  (array-ref bv 0))

(define (item->signed-integer bv)
  (s32vector-ref bv 0))

(define (item->bytes item)
  (let* ((size (s32vector-ref item 2)))
    (if (zero? size)
        #vu8()
        (bytevector-copy (pointer->bytevector (make-pointer (s64vector-ref item 0)) size 0 'u8)))))

(define *item->value* `((#\S . ,item->string)
                        (#\Q . ,item->integer)
                        (#\q . ,item->signed-integer)
                        (#\r . ,item->integer)
                        (#\u . ,item->bytes)))

(define (pointers->scm formats pointers)
  (let loop ((formats (string->list formats))
             (pointers pointers)
             (out '()))
    (cond
     ((and (null? formats) (null? pointers)) out)
     ((or (null? formats) (null? pointers))
      (throw 'wiredtiger "failed to ref cursor value due to format error"))
     (else (loop (cdr formats)
                 (cdr pointers)
                 (append out (list ((assoc-ref *item->value* (car formats)) (car pointers)))))))))

(define-public (cursor-key-ref cursor)
  (let* ((args (map (lambda _ (u64vector 0 0)) (string->list (cursor-key-format cursor))))
         (args* (cons (*cursor-pointer cursor) (map bytevector->pointer args)))
         (signature (map (lambda _ POINTER) args*))
         (function (pointer->procedure int (*cursor-key-ref cursor) signature)))
    (apply function args*)
    (pointers->scm (cursor-key-format cursor) args)))

(define-public (cursor-value-ref cursor)
  (let* ((args (map (lambda _ (u64vector 0 0)) (string->list (cursor-value-format cursor))))
         (args* (cons (*cursor-pointer cursor) (map bytevector->pointer args)))
         (signature (map (lambda _ POINTER) args*))
         (function (pointer->procedure int (*cursor-value-ref cursor) signature)))
    (apply function args*)
    (pointers->scm (cursor-value-format cursor) args)))

;;; set procedures

(define make-string-pointer
  (compose bytevector->pointer
           (cut string->bytevector <> "utf-8")
           (cut string-append <> "\0")))

(define (make-bytes-item bv)
  (let ((item  (u64vector 0 0)))
    (s32vector-set! item 2 (bytevector-length bv))
    (u64vector-set! item 0 (pointer-address (bytevector->pointer bv)))
    (bytevector->pointer item)))


(define *format->pointer* `((#\S . ,make-string-pointer)
                            (#\Q . ,make-pointer)
                            (#\q . ,make-pointer)
                            (#\r . ,make-pointer)
                            (#\u . ,make-bytes-item)))

(define (formats->items formats values)
  (let loop ((formats (string->list formats))
             (values values)
             (out '()))
    (cond
     ((and (null? formats) (null? values)) out)
     ((or (null? formats) (null? values))
      (throw 'wiredtiger "failed to set cursor due to format error"))
     (else (loop (cdr formats)
                 (cdr values)
                 (append out (list ((assoc-ref *format->pointer* (car formats)) (car values)))))))))

(define-public (cursor-key-set cursor . key)
  (let* ((args (cons (*cursor-pointer cursor) (formats->items (cursor-key-format cursor) key)))
         (signature (map (lambda ignore POINTER) args))
         (function (pointer->procedure int (*cursor-key-set cursor) signature)))
    (apply function args)))

(define-public (cursor-value-set cursor . value)
  (let* ((args (cons (*cursor-pointer cursor) (formats->items (cursor-value-format cursor) value)))
         (signature (map (lambda ignore POINTER) args))
         (function (pointer->procedure int (*cursor-value-set cursor) signature)))
    (apply function args)))

(define-public (cursor-reset cursor)
  (let ((function (pointer->procedure* int (*cursor-reset cursor) POINTER)))
    (session-check (*cursor-session cursor) (function (*cursor-pointer cursor)))))

(define-public (cursor-next cursor)
  (let ((function (pointer->procedure* int (*cursor-next cursor) POINTER)))
    (session-check (*cursor-session cursor) (function (*cursor-pointer cursor)))))

(define-public (cursor-previous cursor)
  (let ((function (pointer->procedure* int (*cursor-previous cursor) POINTER)))
    (session-check (*cursor-session cursor) (function (*cursor-pointer cursor)))))

(define-public (cursor-search cursor)
  (let ((function (pointer->procedure* int (*cursor-search cursor) POINTER)))
    (session-check (*cursor-session cursor) (function (*cursor-pointer cursor)))))

(define-public (cursor-search-near cursor)
  (let ((function (pointer->procedure* int (*cursor-search-near cursor) POINTER POINTER)))
    (let* ((integer (s32vector 0))
           (pointer (bytevector->pointer integer)))
      (session-check (*cursor-session cursor) (function (*cursor-pointer cursor) pointer))
      (s32vector-ref integer 0))))

(define-public (cursor-insert cursor)
  (let ((function (pointer->procedure* int (*cursor-insert cursor) POINTER)))
    (session-check (*cursor-session cursor) (function (*cursor-pointer cursor)))))

(define-public (cursor-update cursor)
  (let ((function (pointer->procedure* int (*cursor-update cursor) POINTER)))
    (session-check (*cursor-session cursor) (function (*cursor-pointer cursor)))))

(define-public (cursor-remove cursor)
  (let ((function (pointer->procedure* int (*cursor-remove cursor) POINTER)))
    (session-check (*cursor-session cursor) (function (*cursor-pointer cursor)))))

(define-public (cursor-close cursor)
  (let ((function (pointer->procedure* int (*cursor-close cursor) POINTER)))
    (session-check (*cursor-session cursor) (function (*cursor-pointer cursor)))))

(define-syntax-rule (with-cnx connection e ...)
  (let ((cnx connection)
        (out (begin e ...)))
    (connection-close cnx)
    out))

(export with-cnx)

;;; helpers

(define (wiredtiger-struct-size session format . args)
  (let* ((size (u64vector 0))
         (items (formats->items format args))
         (signature (map (lambda _ '*) args))
         (function (apply wiredtiger (cons* int "wiredtiger_struct_size" '* '* '* signature))))
    (check (apply function (cons* (*session-pointer session)
                                  (bytevector->pointer size)
                                  (string->pointer format)
                                  items)))
    (u64vector-ref size 0)))

(define-public (wiredtiger-struct-pack session format . args) 
  (let* ((size (apply wiredtiger-struct-size (cons* session format args)))
         (buffer (apply u8vector (iota size)))
         (items (formats->items format args))
         (signature (map (lambda _ '*) args))
         (function (apply wiredtiger (cons* int "wiredtiger_struct_pack" '* '* size_t '* signature))))
    (check (apply function (cons* (*session-pointer session)
                                  (bytevector->pointer buffer)
                                  size
                                  (string->pointer format)
                                  items)))
    buffer))
    
(define (%wiredtiger-struct-unpack session buffer size format)
  ;; session must the raw pointer, not the record
  (let* ((args (map (lambda _ (u64vector 0)) (string->list format)))
                 (args* (map bytevector->pointer args))
                 (signature (map (lambda _ POINTER) args*))
                 (function (apply wiredtiger (cons* int "wiredtiger_struct_unpack" '* '* size_t '* signature))))
        (apply function (cons* session buffer size (string->pointer format) args*))
(pointers->scm format args)))

(define-public (wiredtiger-struct-unpack session buffer format)
  (%wiredtiger-struct-unpack (*session-pointer session)
                             (bytevector->pointer buffer)
                             (bytevector-length buffer)
                             format))
;;; tests

(use-modules (test-check))

(when (or (getenv "CHECK") (getenv "CHECK_WIREDTIGER"))
  (format #true "* testing wiredtiger\n")

  (test-check "create and close database"
              (with-cnx (connection-open "/tmp/wt" "create") #true)
              #true)

  (test-check "table with index, insert and index value"
              (let* ((cnx (connection-open "/tmp/wt" "create"))
                     (session (session-open cnx)))
                ;; create a table
                (session-create session "table:nodes" "key_format=Q,value_format=SS,columns=(a,b,c)")
                (session-create session "index:nodes:index" "columns=(b,c)")
                ;; open a cursor over that table
                (let ((cursor (cursor-open session "table:nodes")))
                  (session-transaction-begin session "isolation=\"snapshot\"")
                  (cursor-key-set cursor 42)
                  (cursor-value-set cursor "a" "b")
                  (cursor-insert cursor)
                  (session-transaction-commit session)
                  (let ((index (cursor-open session "index:nodes:index(a)")))
                    (cursor-next index)
                    (with-cnx cnx
                      (list (cursor-key-ref index) (cursor-value-ref index))))))
              (list (list "a" "b") (list 42)))

  (test-check "cursor search on empty table"
              (let* ((cnx (connection-open "/tmp/wt" "create"))
                     (session (session-open cnx)))
                ;; create a table
                (session-create session "table:nodes" "key_format=Q,value_format=SS,columns=(a,b,c)")
                ;; open a cursor over that table
                (let ((cursor (cursor-open session "table:nodes")))
                  (cursor-key-set cursor 42)
                  (with-cnx cnx
                    (catch 'wiredtiger
                      (lambda () (cursor-search cursor) #false)
                      (lambda (key value) #true)))))
              #true)

  (test-check "cursor search"
              (let* ((cnx (connection-open "/tmp/wt" "create"))
                     (session (session-open cnx)))
                ;; create a table
                (session-create session "table:nodes" "key_format=Q,value_format=SS,columns=(a,b,c)")
                ;; open a cursor over that table
                (let ((cursor (cursor-open session "table:nodes")))
                  (cursor-key-set cursor 42)
                  (cursor-value-set cursor "b" "c")
                  (cursor-insert cursor)
                  (cursor-key-set cursor 42)
                  (with-cnx cnx
                    (catch #true
                      (lambda () (cursor-search cursor) #true)
                      (lambda _ #false)))))
              #true)

  (test-check "cursor search near on empty table"
              (let* ((cnx (connection-open "/tmp/wt" "create"))
                     (session (session-open cnx)))
                (session-create session "table:nodes" "key_format=Q,value_format=S,columns=(a,b)")
                (let ((cursor (cursor-open session "table:nodes")))
                  (cursor-key-set cursor 42)
                  (with-cnx cnx
                            (catch #true
                              (lambda () (cursor-search-near cursor) #false)
                              (lambda _ #true)))))
              #true)

  (test-check "cursor search near below"
              (let* ((cnx (connection-open "/tmp/wt" "create"))
                     (session (session-open cnx)))
                (session-create session "table:nodes" "key_format=Q,value_format=S,columns=(a,b)")
                (let ((cursor (cursor-open session "table:nodes")))
                  ;; prepare
                  (cursor-key-set cursor 42)
                  (cursor-value-set cursor "magic number")
                  (cursor-insert cursor)
                  ;; test
                  (cursor-key-set cursor 43)
                  (with-cnx cnx
                    (cursor-search-near cursor))))
              -1)

  (test-check "cursor search near above"
              (let* ((cnx (connection-open "/tmp/wt" "create"))
                     (session (session-open cnx)))
                (session-create session "table:nodes" "key_format=Q,value_format=S,columns=(a,b)")
                (let ((cursor (cursor-open session "table:nodes")))
                  ;; prepare
                  (cursor-key-set cursor 41)
                  (cursor-value-set cursor "another number")
                  (cursor-insert cursor)
                  (cursor-key-set cursor 42)
                  (cursor-value-set cursor "magic number")
                  (cursor-insert cursor)
                  (cursor-key-set cursor 45)
                  (cursor-value-set cursor "random number")
                  (cursor-insert cursor)
                  ;; test
                  (cursor-key-set cursor 43)
                  (with-cnx cnx
                    (< 0 (cursor-search-near cursor)))))
              #true)

  (test-check "cursor search near exact match"
              (let* ((cnx (connection-open "/tmp/wt" "create"))
                     (session (session-open cnx)))
                (session-create session "table:nodes" "key_format=Q,value_format=S,columns=(a,b)")
                (let ((cursor (cursor-open session "table:nodes")))
                  ;; prepare
                  (cursor-key-set cursor 41)
                  (cursor-value-set cursor "another number")
                  (cursor-insert cursor)
                  (cursor-key-set cursor 42)
                  (cursor-value-set cursor "magic number")
                  (cursor-insert cursor)
                  (cursor-key-set cursor 45)
                  (cursor-value-set cursor "random number")
                  (cursor-insert cursor)
                  ;; test
                  (cursor-key-set cursor 42)
                  (with-cnx cnx
                    (cursor-search-near cursor))))
              0)

  (test-check "record table, insert and retrieve key"
              (let* ((cnx (connection-open "/tmp/wt" "create"))
                     (session (session-open cnx)))
                (session-create session "table:terms" "key_format=r,value_format=S")
                (let ((cursor (cursor-open session "table:terms" "append")))
                  (cursor-value-set cursor "timesink")
                  (cursor-insert cursor)
                  (with-cnx cnx (car (cursor-key-ref cursor)))))
              1)

  (test-check "raw item, insert and retrieve value"
              (let* ((cnx (connection-open "/tmp/wt" "create"))
                     (session (session-open cnx)))
                (session-create session "table:terms" "key_format=r,value_format=u")
                (let ((cursor (cursor-open session "table:terms" "append")))
                  (cursor-value-set cursor (u8vector 1 2 3 4))
                  (cursor-insert cursor)
                  (cursor-reset cursor)
                  (cursor-next cursor)
                  (with-cnx cnx
                    (cursor-value-ref cursor))))
              '(#vu8(1 2 3 4)))

  (test-check "wiredtiger-struct-size"
              (let* ((cnx (connection-open "/tmp/wt" "create"))
                     (session (session-open cnx)))
                (with-cnx cnx
                  (wiredtiger-struct-size session "qQS" 1 42 "héllo")))
              '9)

  (test-check "wiredtiger-struct-pack/unpack"
    (let* ((cnx (connection-open "/tmp/wt" "create"))
           (session (session-open cnx)))
      (with-cnx cnx
        (wiredtiger-struct-unpack session 
                                  (wiredtiger-struct-pack session "qQS" 1 42 "héllo")
                                  "qQS")))
    '(1 42 "héllo"))

  ;; (test-check "create table with scheme collator"
  ;;   (receive (cnx ctx) (wiredtiger-open* "/tmp/wt" '(table
  ;;                                                    ((key . record))
  ;;                                                    ((scheme . string))
  ;;                                                    ((reversed (scheme) (key)))))

  ;;     (connection-add-collator cnx "ci" "Sr" (lambda (key other)
  ;;                                              (if (string=? (car key) (car other))
  ;;                                                  0
  ;;                                                  (if (string-ci<? (car key) (car other)) -1 1))))
  ;;     (session-create session "table:terms" "key_format=r,value_format=S,columns=(a,b)")
  ;;     (session-create session "index:terms:reversed" "columns=(b),collator=ci")

  ;;     (let ((cursor (cursor-open session "table:terms" "append")))
  ;;       (cursor-value-set cursor "a")
  ;;       (cursor-insert cursor)
  ;;       (cursor-value-set cursor "A")
  ;;       (cursor-insert cursor)
  ;;       (cursor-value-set cursor "b")
  ;;       (cursor-insert cursor)
  ;;       (cursor-value-set cursor "B")
  ;;       (cursor-insert cursor))

  ;;     (let ((cursor (cursor-open session "index:terms:reversed")))
  ;;       (with-cnx cnx
  ;;         (let loop ((next? (cursor-next cursor))
  ;;                    (out '()))
  ;;           (if next?
  ;;               (let ((key (cursor-key-ref cursor)))
  ;;                 (loop (catch 'wiredtiger
  ;;                         (lambda () (cursor-next cursor) #true)
  ;;                         (lambda ignore #false))
  ;;                       (cons key out)))
  ;;               out)))))
  ;;   '(("B") ("b") ("A") ("a")))
  ;; (test-check "create table with collator"
  ;;   (let* ((cnx (connection-open "/tmp/wt" "create"))
  ;;          (session (session-open cnx)))
  ;;     (connection-add-collator cnx "ci" "Sr" (lambda (key other)
  ;;                                              (if (string=? (car key) (car other))
  ;;                                                  0
  ;;                                                  (if (string-ci<? (car key) (car other)) -1 1))))
  ;;     (session-create session "table:terms" "key_format=r,value_format=S,columns=(a,b)")
  ;;     (session-create session "index:terms:reversed" "columns=(b),collator=ci")

  ;;     (let ((cursor (cursor-open session "table:terms" "append")))
  ;;       (cursor-value-set cursor "a")
  ;;       (cursor-insert cursor)
  ;;       (cursor-value-set cursor "A")
  ;;       (cursor-insert cursor)
  ;;       (cursor-value-set cursor "b")
  ;;       (cursor-insert cursor)
  ;;       (cursor-value-set cursor "B")
  ;;       (cursor-insert cursor))
  ;;     (let ((cursor (cursor-open session "index:terms:reversed")))
  ;;       (with-cnx cnx
  ;;         (let loop ((next? (cursor-next cursor))
  ;;                    (out '()))
  ;;           (if next?
  ;;               (let ((key (cursor-key-ref cursor)))
  ;;                 (loop (catch 'wiredtiger
  ;;                         (lambda () (cursor-next cursor) #true)
  ;;                         (lambda ignore #false))
  ;;                       (cons key out)))
  ;;               out)))))
  ;;   '(("B") ("b") ("A") ("a")))
  )

[-- Attachment #5: wiredtigerz.scm --]
[-- Type: text/plain, Size: 29105 bytes --]

;; guile-wiredtiger - 0.4 - 2016/07/15

;; Copyright © 2014-2016 Amirouche BOUBEKKI <amirouche@hypermove.net>

;; guile-wiredtiger is free software: you can redistribute it and/or modify
;; it under the terms of the GNU General Public License as published by
;; the Free Software Foundation, either version 2 of the License, or
;; (at your option) or version 3.

;; guile-wiredtiger is distributed in the hope that it will be useful,
;; but WITHOUT ANY WARRANTY; without even the implied warranty of
;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
;; GNU General Public License for more details.

;; You should have received a copy of the GNU General Public License
;; along with guile-wiredtiger.  If not, see <http://www.gnu.org/licenses/>

;;; Comment:
;;
;; Tested with wiredtiger develop branch
;;
(define-module (wiredtigerz))

(use-modules (ice-9 match))
(use-modules (ice-9 threads))
(use-modules (ice-9 receive))
(use-modules (ice-9 optargs))

(use-modules (srfi srfi-1))  ;; append-map
(use-modules (srfi srfi-9))  ;; records
(use-modules (srfi srfi-9 gnu))  ;; set-record-type-printer!
(use-modules (srfi srfi-26))  ;; cut

(use-modules (wiredtiger))


;;;
;;; plain records
;;;
;;
;; macro to quickly define records
;;
(define-syntax define-record-type*
  (lambda (x)
    (define (%id-name name) (string->symbol (string-drop (string-drop-right (symbol->string name) 1) 1)))
    (define (id-name ctx name)
      (datum->syntax ctx (%id-name (syntax->datum name))))
    (define (id-append ctx . syms)
      (datum->syntax ctx (apply symbol-append (map syntax->datum syms))))
    (define (id-append! ctx . syms)
      (datum->syntax ctx (symbol-append (apply symbol-append (map syntax->datum syms)) '!)))
    (syntax-case x ()
      ((_ rname field ...)
       (and (identifier? #'rname) (and-map identifier? #'(field ...)))
       (with-syntax ((cons (id-append #'rname #'make- (id-name #'rname #'rname)))
                     (pred (id-append #'rname (id-name #'rname #'rname) #'?))
                     ((getter ...) (map (lambda (f)
                                          (id-append f (id-name #'rname #'rname) #'- f))
                                        #'(field ...)))
                     ((setter ...) (map (lambda (f)
                                          (id-append! f (id-name #'rname #'rname) #'- f))
                                        #'(field ...))))
         #'(define-record-type rname
             (cons field ...)
             pred
             (field getter setter)
             ...))))))

;;; helpers

(define-public (scm->string scm)
  (call-with-output-string
    (lambda (port)
      (write scm port))))

(define-public (string->scm string)
  (call-with-input-string string read))

;;;
;;; wiredtigerz try to explicit main wiredtiger workflow
;;;
;;
;; This modules defines a star procedure version of wiredtiger procedure
;; to help jump into wiredtiger making the main workflow more explicit.
;;
;; The main workflow is the following:
;;
;; 1. define a some table and indices
;; 2. open session per thread
;; 3. open a single cursor per table and indices
;; 4. forward cursor navigation
;;
;; In theory you might open multiple cursors for the same table/index but this
;; leads to extra bookeeping for which I have no good reason to apply.
;;
;; The API defined as star procedure try to remains thin on top of wiredtiger
;; so that you can drop to raw wiredtiger when required, like open multiple cursors.
;;
;; This introduce a declarative API (described below) that both defines the tables
;; and the cursor in single block of code which must be used with `session-create*`
;; and `cursor-open*` which do the job described in 1. and 2.
;;
;; Also wiredtiger mainly rely on statefull API where the cursor is first configured with
;; cursor-key-set and  then an operation is executed on it like cursor-search or
;; cursor-remove. This leaves the door open for many workflows while keeping each
;; procedure signature simple.
;;
;; The purpose of the proposed (star) procedures is to simplify user code by covering
;; the main workflow (I've encountered) while leaving aside some performance concerns.
;;

;;;
;;; Declarative api
;;;
;;
;; Declare the layout of the database and its cursors using list and symbols
;; Here is two example configurations:
;;
;;   (define atoms '(atoms
;;                   ((uid . record))
;;                   ((assoc . raw))
;;                   ()))
;;
;;   (define arrows '(arrows
;;                    ((key . record))
;;                    ((start . unsigned-integer)
;;                     (end . unsigned-integer))
;;                    ;; index
;;                    '((outgoings (start) (uid end))
;;                      (incomings (end) (uid start)))))
;;
;; The format can be described as follow:
;;
;; (table-name
;;  (key assoc as (column-name . column-type))
;;  (value assoc as (column-name . column-type))
;;  (indices as (indexed name (indexed keys) (projection as column names))))
;;
;;
;; If there is no indices, the field MUST NOT be omited but replaced with an empty list
;;
;; The configuration can be used in (session-create* session . configs) to create
;; the tables and indices.
;;
;; And then in (cursor-open* session . configs) to all the table and indices cursors.
;;
;; A <context> record exists which should be associated with a thread. It encapsulates
;; a <session> and cursors.
;; A <context> can be created with (context-open connection . config).
;; Shortcuts exists to execute transaction against a context directly.
;;

;; utils for declarative configuration

(define-record-type* <config> name key value indices)
(define-record-type* <index> name keys values)

;; FIXME: some types are missing
(define (symbol->config symbol)
  (assoc-ref '((record . "r")
               (string . "S")
               (unsigned-integer . "Q")
               (positive-integer . "Q")
               (integer . "q")
               (bytes . "u"))
             symbol))


;;; define session-create*

(define-public (session-create* session . configs)
  ;; XXX: here instead of using `session-create` downstream
  ;; we wait for `session-create` arguments instead.
  ;; This makes the code easier to test...
  (define (create args)
    (apply session-create (cons session args)))
  ;; prepare arguments for every config and apply them
  (for-each create (append-map config-prepare-create configs)))

(define-public (config-prepare-create config)
  ;; a config generate one table and maybe several indices
  (cons (config-prepare-create-table config)
        (config-prepare-create-indices config)))

(define (config-prepare-create-table config)
  ;; transform declarative api into a session-create arguments
  (define (symbols->config symbols)
    (string-concatenate (map (cut symbol->config <>) symbols)))

  (define (symbols->columns symbols)
    (string-join (map (cut symbol->string <>) symbols) ","))

  (let* ((config (apply make-config config))
         (name (string-append "table:" (symbol->string (config-name config))))
         (key (symbols->config (map cdr (config-key config))))
         (value (symbols->config (map cdr (config-value config))))
         (columns (append (config-key config) (config-value config)))
         (columns (symbols->columns (map car columns)))
         (config (format #false
                         "key_format=~a,value_format=~a,columns=(~a)"
                         key value columns)))
    (list name config)))

(define (config-prepare-create-indices config)
  ;; one config may have multiple indices
  (let ((config (apply make-config config)))
    (map (config-prepare-create-index (config-name config)) (config-indices config))))

(define (config-prepare-create-index name)
  ;; convert declarative configuration to session-create arguments
  (define (symbols->columns symbols)
    (string-join (map (cut symbol->string <>) symbols) ","))

  (lambda (index)
    (let* ((index (apply make-index index))
           (name (string-append "index:" (symbol->string name) ":" (symbol->string (index-name index))))
           (columns (format #false "columns=(~a)" (symbols->columns (index-keys index)))))
      (list name columns))))

;;;
;;; define cursor-open*
;;;
;;
;; open cursor for every table and indices in an assoc where the key is
;; the table name for main cursor, '-append prefixed with the name of table
;; for the append cursor when applicable and the name index prefixed with
;; the name of the table.
;; cursor-open* will automatically create a 'append' cursor for tables
;; that have single record column.
;;

(define-public (cursor-open* session . configs)
  "Open cursors against SESSION using CONFIGS and return an assoc
with cursor symbols as key and cursors as value"
  ;; XXX: just like session-open* we expect cursor-open arguments
  ;; but this time we return an assoc made of ('cursor-name . cursor)
  (define (open name+args)
    (cons (car name+args) (apply cursor-open (cons session (cadr name+args)))))
  ;; prepare arguments for every config and apply them
  (map open (append-map config-prepare-open configs)))

(define (config-prepare-open config)
  (append (config-prepare-cursor-open config)
          (config-prepare-cursor-append-open config)
          (config-prepare-cursor-indices-open config)))

(define (config-prepare-cursor-open config)
  (let* ((config (apply make-config config))
         (name (config-name config)))
    ;; XXX: config-prepare-open expect a list of cursor-open arguments
    (list (list name (list (format #false "table:~a" name))))))

(define (config-prepare-cursor-append-open config)
  (define (key-is-record? key)
    (and (eq? (length key) 1) (eq? (cdar key) 'record)))
  (let* ((config (apply make-config config))
         (name (config-name config))
         (cursor-name (symbol-append name '-append)))
    (if (key-is-record? (config-key config))
        ;; add a append cursor over the table
        ;; XXX: config-prepare-open expect a list of cursor-open arguments
        (list (list cursor-name (list (format #false "table:~a" name) "append")))
        ;; no cursor is required
        (list))))

(define (config-prepare-cursor-indices-open config)
  (let ((config (apply make-config config)))
    (map (config-prepare-cursor-index-open (config-name config)) (config-indices config))))

(define (config-prepare-cursor-index-open name)
  (define (symbols->columns symbols)
    (string-join (map (cut symbol->string <>) symbols) ","))

  (lambda (index)
    (let* ((index (apply make-index index))
           (columns (symbols->columns (index-values index)))
           (cursor-name (symbol-append name '- (index-name index))))
      (if (equal? columns "")
          (list cursor-name
                (list (format #false "index:~a:~a" name (index-name index))))
          (list cursor-name
                (list (format #false "index:~a:~a(~a)" name (index-name index) columns)))))))
;;;
;;; <env>
;;;
;;
;; An environment contains the configuration of a given database and
;; its contexts. It's threadsafe.
;;

(define-record-type* <env> connection configs contexts mutex)

(define *context* (make-unbound-fluid))

(define-public (env-open path)
  (make-env (connection-open path "create") '() '() (make-mutex)))

(define-public (env-close env)
  (connection-close (env-connection env)))

(define-public (env-config-add env config)
  (env-configs! env (cons config (env-configs env))))

(define-public (env-open* path configs)
  (let ((env (env-open path)))
    (for-each (cut env-config-add env <>) configs)
    (env-create env)
    env))

(define-public (env-create env)
  (let* ((connection (env-connection env))
         (session (session-open connection)))
    (apply session-create* (cons session (env-configs env)))
    (session-close session)))
  
(define (get-or-create-context env)
  (with-mutex (env-mutex env)
    (let ((contexts (env-contexts env)))
      (if (null? contexts)
          ;; create a new context
          ;; XXX: the number of active context is unbound
          (apply context-open (cons (env-connection env) (env-configs env)))
          ;; re-use an existing context
          (let ((context (car contexts)))
            (env-contexts! env (cdr contexts))
            context)))))

(define-syntax-rule (with-context env body ...)
  (let ((env* env))
    ;; get or create a context and set it as current *context* value
    (let ((context (get-or-create-context env*)))
      (with-fluids ((*context* context))
        ;; execute the body
        (let ((out (begin body ...)))
          ;; push back the context to the context pool
          (with-mutex (env-mutex env*)
            (env-contexts! env* (cons context (env-contexts env*))))
          out)))))

(export with-context)

(define-syntax-rule (with-env env e ...)
  (let* ((env* env)
         (out (with-context env* e ...)))
    (env-close env*)
    out))

(export with-env)

;;;
;;; <context>
;;;
;;
;; A session and cursors assoc
;;

(define-record-type* <context> session cursors)

(export context-session)
(export context-cursors)

(define-public (context-open connection . configs)
  "Open a context using CONNECTION with CONFIGS"
  (let* ((session (session-open connection))
         (cursors (apply cursor-open* (cons session configs))))
    (make-context session cursors)))

(define-public (wiredtiger-open* path . configs)
  "Open or create a database at PATH with CONFIGS and return
a two values: the connection and a context"
  (let* ((connection (connection-open path "create"))
         (session (session-open connection)))
    (apply session-create* (cons session configs))
    (values connection (make-context session (apply cursor-open* (cons session configs))))))

(define-public (context-ref context name)
  "Return the cursor associated with NAME in CONTEXT"
  (assoc-ref (context-cursors context) name))

(define-public (context-begin context)
  "Start a transaction against CONTEXT"
  (session-transaction-begin (context-session context)))

(define-public (context-commit context)
  "Commit transaction against CONTEXT"
  (session-transaction-commit (context-session context)))

(define-public (context-rollback context)
  "Rollback transaction against CONTEXT"
  (session-transaction-rollback (context-session context)))

(define-syntax-rule (with-transaction e ...)
  (catch #true
    (lambda ()
      (context-begin (fluid-ref *context*))
      (let ((out (begin e ...)))
        (context-commit (fluid-ref *context*))
        out))
    (lambda (key . args)
      (context-rollback (fluid-ref *context*))
      (apply throw (cons key args)))))

(export with-transaction)

;;;
;;; Cursor navigation
;;;
;;
;; Quickly operate on cursors
;;

;; helper for reseting cursors after doing some operations
;; @@@: emacs: (put 'with-cursor 'scheme-indent-function 1)
(define-syntax-rule (with-cursor cursor e ...)
  (let ((out (begin e ...)))
    (cursor-reset cursor)
    out))

(export with-cursor)

(define-syntax-rule (call-with-cursor name proc)
  (let* ((context (fluid-ref *context*))
         (cursor (context-ref context name)))
    (let ((out (proc cursor)))
      (cursor-reset cursor)
        out)))

(export call-with-cursor)

(define-public (cursor-next* cursor)
  "Move the cursor to the next result and return #t.
Return #f if there is no next result (end of the table).

This procedure return a boolean instead of throwing an exception"
  (catch 'wiredtiger
    (lambda () (cursor-next cursor) #t)
    (lambda _ #f)))

(define-public (cursor-previous* cursor)
  "Move the cursor to the previous result and return #t.
Return #f if there is no previous result (end of the table).

This procedure return a boolean instead of throwing an exception"
  (catch 'wiredtiger
    (lambda () (cursor-previous cursor) #t)
    (lambda _ #f)))

(define-public (cursor-debug cursor)
  "Prints the whole table starting at cursor position"
  (pk 'cursor-debug cursor)
  (with-cursor cursor
    (let loop ((next (cursor-next* cursor)))
      (when next
        (pk (cursor-key-ref cursor) (cursor-value-ref cursor))
        (loop (cursor-next* cursor))))))

(define-public (cursor-value-ref* cursor . key)
  "Search KEY and return the associated value.
Throw a 'wiredtiger error if the key is not found"
  (with-cursor cursor
    (apply cursor-search* (cons cursor key))
    (cursor-value-ref cursor)))

(define-public (cursor-insert* cursor key value)
  "Insert using CURSOR KEY and VALUE. If KEY is null,
the key will not be set and the key assigned by wiredtiger
will be returned. This is useful in the case of a table with
a single record key column."
  (unless (null? key)
    (apply cursor-key-set (cons cursor key)))
  (apply cursor-value-set (cons cursor value))
  (cursor-insert cursor)
  (when (null? key)  ;; used with single record key column
    (car (cursor-key-ref cursor))))

(define-public (cursor-update* cursor key value)
  "Update KEY with VALUE using CURSOR"
  (apply cursor-key-set (cons cursor key))
  (apply cursor-value-set (cons cursor value))
  (cursor-update cursor))

(define-public (cursor-remove* cursor . key)
  "Remove row having KEY as key using CURSOR"
  (apply cursor-key-set (cons cursor key))
  (cursor-remove cursor))

(define-public (cursor-search* cursor . key)
  "Position the  cursor at KEY. Throw a 'wiredtiger error
if KEY is not found"
  (apply cursor-key-set (cons cursor key))
  (cursor-search cursor))

(define-public (cursor-search-near* cursor . key-prefix)
  "Search near KEY-PREFIX on CURSOR"
  (apply cursor-key-set (cons cursor key-prefix))
  (cursor-search-near cursor))

;; cursor-range

(define-public (cursor-range cursor . key)
  "Return a list made of the values taken by rows having KEY as key"
  (with-cursor cursor
    (catch 'wiredtiger
      (lambda ()
        (apply cursor-search* (cons cursor key))
        (let loop ((out (list (cursor-value-ref cursor))))
          (if (cursor-next* cursor)
              (if (equal? (cursor-key-ref cursor) key)
                  (loop (cons (cursor-value-ref cursor) out))
                  out)
              out)))
      (lambda _ '()))))

;; cursor-range-prefix

(define-public (prefix? prefix other)
  "Return #true if OTHER has KEY as prefix"
  ;; filter "empty" values from the key
  (define (empty? x) (or (eq? x 0) (equal? x "") (eq? x #vu8())))
  (define (predicate? a b) (not (or (empty? a) (equal? a b))))
  (not (any predicate? prefix other)))

(define-public (cursor-range-prefix cursor . key-prefix)
  "Return CURSOR range association where keys match PREFIX"
  (define (next?)
    (catch 'wiredtiger
      (lambda ()
        (cursor-next cursor)
        (prefix? key-prefix (cursor-key-ref cursor)))
      (lambda (key . args)
        #false)))
  (with-cursor cursor
    (let ((code (catch 'wiredtiger
                  (lambda () (apply cursor-search-near* (cons cursor key-prefix)))
                  (lambda (key . args) #f))))
      (if code
          (if (or (eq? code 0) (eq? code 1) (and (eq? code -1) (cursor-next* cursor)))
              (let loop ((out (list))
                         (valid? (prefix? key-prefix (cursor-key-ref cursor))))
                (if valid?
                    (loop (acons (cursor-key-ref cursor) (cursor-value-ref cursor) out)
                          (next?))
                    out))
              '())
          '()))))

(define-public (cursor-count-prefix cursor . key-prefix)
  ;; FIXME: only counting keys would improve performance
  (length (apply cursor-range-prefix (cons cursor key-prefix))))


;;;
;;; generate-uid
;;;

(define (random-id size)
  "Generate and random identifier of length SIZE"
  (define CHARS "0123456789AZERTYUIOPQSDFGHJKLMWXCVBN")
  ;; append SIZE alphanumeric chars from `CHARS`
  (let loop ((count size)
             (id ""))
    (if (eq? count 0)
        id
        (loop (1- count) (format #f "~a~a" id (string-ref CHARS (random 36)))))))

(define*-public (generate-uid exists? #:optional (size 8))
  "Generate a random string made up alphanumeric ascii chars that doesn't exists
   according to `exists?`"
  (let loop ()
    ;; generate a random identifier until it find an one that doesn't already `exists?`
    (let ((id (random-id size)))
      (if (exists? id) (loop) id))))

;;;
;;; tests
;;;

(use-modules (test-check))

(when (or (getenv "CHECK") (getenv "CHECK_WIREDTIGERZ"))
  (format #true "* testing wiredtigerz\n")

  ;; test declarative API

  (test-check "create table config without index"
    (config-prepare-create '(atoms
                             ((uid . record))
                             ((assoc . bytes))
                             ()))
    (list (list "table:atoms" "key_format=r,value_format=u,columns=(uid,assoc)")))

  (test-check "create table config with index and projections"
    (config-prepare-create '(arrows
                             ((key . record))
                             ((start . unsigned-integer)
                              (end . unsigned-integer))
                             ;; index
                             ((outgoings (uid,start) (uid end))
                              (incomings (end) ()))))
    (list (list "table:arrows" "key_format=r,value_format=QQ,columns=(key,start,end)")
          (list "index:arrows:outgoings" "columns=(uid,start)")
          (list "index:arrows:incomings" "columns=(end)")))

  (test-check "create cursor config without index"
    (config-prepare-open '(atoms
                           ((uid . record))
                           ((assoc . bytes))
                           ()))
    (list (list 'atoms (list "table:atoms"))
          (list 'atoms-append (list "table:atoms" "append"))))

  (test-check "create cursor config with index with and without projection"
    (config-prepare-open '(atoms
                           ((uid . record))
                           ((assoc . bytes))
                           ((reversex (assoc) (uid))
                            (reverse (assoc) ()))))
    (list (list 'atoms (list "table:atoms"))
          (list 'atoms-append (list "table:atoms" "append"))
          (list 'atoms-reversex (list "index:atoms:reversex(uid)"))
          (list 'atoms-reverse (list "index:atoms:reverse"))))

  ;; test star API

  (test-check "wiredtiger-open*"
    (receive (cnx ctx)
        (wiredtiger-open* "/tmp/wt"
                          '(table ((key . record)) ((value . integer)) ()))
      (with-cnx cnx #true))
    #true)

  (test-check "cursor-insert* and cursor-search*"
    (receive (cnx ctx)
        (wiredtiger-open* "/tmp/wt"
                          '(terms ((key . record)) ((value . unsigned-integer)) ()))
      (with-cnx cnx
        (let ((append (context-ref ctx 'terms-append)))
          (cursor-insert* append #nil (list 42))
          (cursor-insert* append #nil (list 1337))
          (cursor-insert* append #nil (list 1985)))))
    3)

  (test-check "cursor-range 0"
    (receive (cnx ctx) (wiredtiger-open* "/tmp/wt" '(table ((k . record))
                                                           ((v . integer))
                                                           ((reversed (v) (k)))))
      (with-cnx cnx
        (cursor-range (context-ref ctx 'table-reversed) 42)))
    '())

  (test-check "cursor-range 1"
    (receive (cnx ctx) (wiredtiger-open* "/tmp/wt" '(table ((k . record))
                                                           ((v . integer))
                                                           ((reversed (v) (k)))))
      (cursor-insert* (context-ref ctx 'table-append) #nil '(1))
      (cursor-insert* (context-ref ctx 'table-append) #nil '(42))
      (cursor-insert* (context-ref ctx 'table-append) #nil '(42))
      (cursor-insert* (context-ref ctx 'table-append) #nil '(42))
      (cursor-insert* (context-ref ctx 'table-append) #nil '(1))
      (with-cnx cnx
        (cursor-range (context-ref ctx 'table-reversed) 42)))
    '((4) (3) (2)))

  (test-check "cursor-range 2"
    (receive (cnx ctx) (wiredtiger-open* "/tmp/wt" '(table ((k . record))
                                                           ((v . integer))
                                                           ((reversed (v) (k)))))
      (cursor-insert* (context-ref ctx 'table-append) #nil '(1))
      (cursor-insert* (context-ref ctx 'table-append) #nil '(1))
      (with-cnx cnx
        (cursor-range (context-ref ctx 'table-reversed) 42)))
    '())

  (test-check "cursor-range 3"
    (receive (cnx ctx) (wiredtiger-open* "/tmp/wt" '(table ((k . record))
                                                           ((v . integer))
                                                           ((reversed (v) (k)))))
      (cursor-insert* (context-ref ctx 'table-append) #nil '(1))
      (cursor-insert* (context-ref ctx 'table-append) #nil '(42))
      (cursor-insert* (context-ref ctx 'table-append) #nil '(1))
      (with-cnx cnx
        (cursor-range (context-ref ctx 'table-reversed) 42)))
    '((2)))

  (test-check "cursor-range-prefix"
    (receive (cnx ctx)
        (wiredtiger-open* "/tmp/wt"
                          '(table ((a . integer) (b . integer)) ((c . integer)) ()))
      (let ((cursor (context-ref ctx 'table)))
        (cursor-insert* cursor (list 0 0) (list 0))
        (cursor-insert* cursor (list 1 1) (list 1))
        (cursor-insert* cursor (list 1 2) (list 1))
        (cursor-insert* cursor (list 2 0) (list 2))
        (with-cnx cnx
          (cursor-range-prefix cursor 1 0))))
    '(((1 2) 1)
      ((1 1) 1)))

  (test-check "cursor-range-prefix 2"
    (receive (cnx ctx)
        (wiredtiger-open* "/tmp/wt"
                          '(table ((a . integer) (b . integer)) ((c . integer)) ()))
      (let ((cursor (context-ref ctx 'table)))
        (cursor-insert* cursor (list 1 1) (list 1))
        (cursor-insert* cursor (list 1 2) (list 1))
        (cursor-insert* cursor (list 2 0) (list 2))
        (with-cnx cnx
          (cursor-range-prefix cursor 1 0))))
    '(((1 2) 1)
      ((1 1) 1)))

  (test-check "cursor-range-prefix 3"
    (receive (cnx ctx)
        (wiredtiger-open* "/tmp/wt"
                          '(table ((a . integer) (b . integer)) ((c . integer)) ()))
      (let ((cursor (context-ref ctx 'table)))
        (cursor-insert* cursor (list 2 0) (list 2))
        (with-cnx cnx
          (cursor-range-prefix cursor 1 0))))
    '())

  (test-check "cursor-range-prefix 3"
    (receive (cnx ctx)
        (wiredtiger-open* "/tmp/wt"
                          '(table ((a . integer) (b . integer)) ((c . integer)) ()))
      (let ((cursor (context-ref ctx 'table)))
        (cursor-insert* cursor (list 0 0) (list 0))
        (with-cnx cnx
          (cursor-range-prefix cursor 1 0))))
    '())

  (test-check "cursor-range-prefix 4"
    (receive (cnx ctx)
        (wiredtiger-open* "/tmp/wt"
                          '(table ((a . integer) (b . integer)) ((c . integer)) ()))
      (let ((cursor (context-ref ctx 'table)))
        (cursor-insert* cursor (list 0 0) (list 0))
        (cursor-insert* cursor (list 1 0) (list 0))
        (cursor-insert* cursor (list 1 1) (list 1))
        (cursor-insert* cursor (list 1 2) (list 1))
        (cursor-insert* cursor (list 2 0) (list 2))
        (with-cnx cnx
          (cursor-range-prefix cursor 1 0))))
    '(((1 2) 1)
      ((1 1) 1)
      ((1 0) 0)))

  (test-check "cursor-range-prefix 5"
    (receive (cnx ctx)
        (wiredtiger-open* "/tmp/wt"
                          '(table ((a . integer) (b . integer)) ((c . integer)) ()))
      (let ((cursor (context-ref ctx 'table)))
        (cursor-insert* cursor (list 0 0) (list 0))
        (cursor-insert* cursor (list 1 0) (list 0))
        (cursor-insert* cursor (list 1 1) (list 1))
        (cursor-insert* cursor (list 1 2) (list 1))
        (with-cnx cnx
          (cursor-range-prefix cursor 1 0))))
    '(((1 2) 1)
      ((1 1) 1)
      ((1 0) 0)))

  (test-check "cursor with empty range-prefix"
    (receive (cnx ctx)
        (wiredtiger-open* "/tmp/wt"
                          '(table ((a . integer) (b . integer)) ((c . integer)) ()))
      (let ((cursor (context-ref ctx 'table)))
        (with-cnx cnx
          (cursor-range-prefix cursor 1 0))))
    '())
  (test-check "cursor-value-ref* on empty table"
    (receive (cnx ctx)
        (wiredtiger-open* "/tmp/wt"
                          '(table ((a . integer) (b . integer)) ((c . integer)) ()))
      (let ((cursor (context-ref ctx 'table)))
        (with-cnx cnx
          (catch 'wiredtiger
            (lambda ()
              (cursor-value-ref* cursor 42))
            (lambda (key . args)
              #true)))))
    #true)

  (test-check "with-context"
    (let ((env (env-open "/tmp/wt")))
      (env-config-add env '(counter ((name . string)) ((value . integer)) ()))
      (env-create env)
      (with-cnx (env-connection env)
        (with-context env
          (call-with-cursor 'counter
            (lambda (cursor)
              (cursor-insert* cursor (list "counter") (list 42)))))
        (with-context env
          (call-with-cursor 'counter
            (lambda (cursor)
              (cursor-value-ref* cursor "counter"))))))
    '(42))
  )

[-- Attachment #6: wsh.scm --]
[-- Type: text/plain, Size: 13267 bytes --]

(define-module (wsh))

(use-modules (ice-9 receive))
(use-modules (srfi srfi-26))
(use-modules (srfi srfi-1))
(use-modules (ice-9 match))

(use-modules (wiredtigerz))
(use-modules (wiredtiger))
(use-modules (text))


(define-public *wsh* '((urls
                        ((uid . record))
                        ((url . string))
                        ())
                       (terms
                        ((uid . record))
                        ((string . string))
                        ((inverse (string) (uid))))
                       (inverted-index
                        ((term . unsigned-integer)
                         (url-uid . unsigned-integer)
                         (position . unsigned-integer))
                        ((nothing . bytes))
                        ((positions (url-uid position) (term))))))

(define (index-term-with-position url-uid)
  (match-lambda
    ((term position)
     (let ((uid ;; get or create term
            (call-with-cursor 'terms-inverse
              (lambda (cursor)
                (catch 'wiredtiger
                  (lambda () (car (cursor-value-ref* cursor term)))
                  (lambda _ (call-with-cursor 'terms-append
                              (lambda (cursor)
                                (cursor-insert* cursor '() (list term))))))))))
       (call-with-cursor 'inverted-index
         (lambda (cursor)
           (cursor-insert* cursor (list uid url-uid position) (list #vu8()))))))))

(define-public (index url html)
  (call-with-cursor 'urls-append
    (lambda (cursor)
      ;; insert url and retrieve its uid
      (let ((url-uid (cursor-insert* cursor '() (list url))))
        ;; convert html to tokens and add positions
        (let* ((terms (html->tokens html))
               (terms+positions (zip terms (iota (length terms)))))
          (for-each (index-term-with-position url-uid) terms+positions))))))


(define (uid->url uid)
  (call-with-cursor 'urls
    (lambda (cursor)
      (car (cursor-value-ref* cursor uid)))))

(define (term-uid term)
  (catch 'wiredtiger
    (lambda() (call-with-cursor 'terms-inverse
                (lambda (cursor)
                  (car (cursor-value-ref* cursor term)))))
    (lambda (key . args) #f)
    #f))

(define (search term)
  (let ((uid (term-uid term)))
    (call-with-cursor 'inverted-index
      (lambda (cursor)
        ;; return document-id and position
        (delete-duplicates (map cadar (cursor-range-prefix cursor uid 0 0)))))))

(define-public (debug)
  (call-with-cursor 'inverted-index
    (lambda (cursor)
      (cursor-debug cursor))))

(define-public (search/term term)
  (cons 'term term))

(define-public (search/and . args)
  (cons 'and args))

(define-public (search/or . args)
  (cons 'or args))

(define-public (search/not arg)
  (cons 'not arg))

(define true? (cut eq? #t <>))

(define (search/make-predicate arg)
  (match arg
    (('term . term) (let ((termid (term-uid term)))
                      (lambda (docid)
                        (call-with-cursor 'inverted-index
                          (lambda (cursor)
                            (not (zero? (cursor-count-prefix cursor termid docid 0))))))))
    (('and . args) (let ((predicates (map (cut search/make-predicate <>) args)))
                     (lambda (docid)
                       (every true? (map (cut <> docid) predicates)))))
    (('or . args) (let ((predicates (map (cut search/make-predicate <>) args)))
                    (lambda (docid)
                      (any true? (map (cut <> docid) predicates)))))))

(define (search/vm query)
  (match query
    (('term . term) (search term))
    (('and . args)
     ;; separate 'not' for others
     (let* ((nots (filter (lambda (arg) (eq? (car arg) 'not)) args))
            (others (lset-difference equal? args nots)))
       ;; retrieve hits before applying 'not'
       (let* ((nots (map search/vm nots))
              (hits (apply lset-intersection (cons eq? (map search/vm others)))))
         (let loop ((nots nots)
                    (hits hits))
           (cond
            ((null? nots) hits)
            ((null? hits) '())
            (else (loop (cdr nots) ((car nots) hits))))))))
    (('or . args) (delete-duplicates (append-map search/vm args)))
    (('not . arg) (lambda (hits)
                    (lset-difference eq? hits (search/vm arg))))))

(define (flatten lst)
  (let loop ((lst lst)
             (out '()))
    (if (list? lst)
        (if (null? lst)
            out
            (if (list? (car lst))
                (loop (cdr lst) (append (flatten (car lst)) out))
                (loop (cdr lst) (cons (car lst) out))))
        lst)))

(define (query-terms% query)
  (match query
    (('term . term) (term-uid term))
    (('and . args) (map query-terms% args))
    (('or . args) (map query-terms% args))
    (('not . arg) '())))

(define (query-terms query)
  "convert QUERY to a list of relevant terms for computing the score"
  (flatten (query-terms% query)))

(define (term-frequency term-id doc-id)
  "frequency of TERM-ID in DOC-ID"
  (call-with-cursor 'inverted-index
    (lambda (cursor)
      (cursor-count-prefix cursor term-id doc-id 0))))

(define (score term-ids doc-id)
  "score DOC-ID against TERM-IDS"
  (apply + (map (cut term-frequency <> doc-id) term-ids)))

(define-public (search* query)
  "retrieve sorted urls for QUERY"
  ;; compute hits for query
  (let ((hits (search/vm query)))
    ;; retrieve relevant query terms
    (let ((term-ids (query-terms query)))
      ;; score every hits against terms
      (let ((scores (map (cut score term-ids <>) hits)))
        (let ((urls (map uid->url hits)))
          (sort (map cons urls scores) (lambda (a b) (> (cdr a) (cdr b))) ))))))

;;;
;;; tests
;;;

(use-modules (test-check))

(when (or (getenv "CHECK") (getenv "CHECK_WSH"))
  (format #t "* check wsh\n")
  
  (test-check "index"
    (with-env (env-open* "/tmp/wt" *wsh*)
      (not (null? (index "http://example.net" "foo bar baz"))))
    #t)

  (test-check "query-terms 1"
    (with-env (env-open* "/tmp/wt" *wsh*)
        (index "http://example.net" "database")
        (index "http://example.net" "spam")
        (index "http://example.net" "egg")
        (index "http://example.net" "postgresql")
        (index "http://example.net" "pgsql")
        (query-terms (search/and (search/term "database") (search/term "spam"))))
    '(2 1))

  (test-check "query-terms 2"
    (with-env (env-open* "/tmp/wt" *wsh*)
      (index "http://example.net" "database")
      (index "http://example.net" "spam")
      (index "http://example.net" "egg")
      (index "http://example.net" "postgresql")
      (index "http://example.net" "pgsql")
      (query-terms (search/and (search/term "database") (search/term "spam")
                               (search/or (search/term "pgsql") (search/term "postgresql")))))
    '(4 5 2 1))

  (test-check "query-terms 3"
    (with-env (env-open* "/tmp/wt" *wsh*)
      (index "http://example.net" "database")
      (index "http://example.net" "spam")
      (index "http://example.net" "egg")
      (index "http://example.net" "postgresql")
      (index "http://example.net" "pgsql")
      (query-terms (search/and (search/term "database") (search/term "spam")
                               (search/or (search/term "pgsql") (search/term "postgresql"))
                               (search/not (search/term "spam")))))
    '(4 5 2 1))

  (test-check "search/vm and/or"
    (with-env (env-open* "/tmp/wt" *wsh*)
      (index "http://example.net" "database & postgresql")
      (index "http://example.net" "spam & pgsql")
      (index "http://example.net" "spam & egg")
      (index "http://example.net" "database & egg")
      (index "http://example.net" "database & pgsql")
      (search/vm (search/and (search/term "database") (search/or (search/term "postgresql")
                                                                 (search/term "pgsql")))))
    '(5 1))

  (test-check "search/vm or avoid duplicates"
    (with-env (env-open* "/tmp/wt" *wsh*)
      (index "http://example.net" "database")
      (index "http://example.net" "wiredtiger & database")
      (index "http://example.net" "wiredtiger")
      
      (search/vm (search/or (search/term "database")
                            (search/term "wiredtiger"))))
    '(2 1 3))

  (test-check "search/vm avoid duplicates"
    (with-env (env-open* "/tmp/wt" *wsh*)
      (index "http://example.net" "wiredtiger & wiredtiger")
      (search/vm (search/term "wiredtiger")))
    '(1))

  (test-check "search/vm and/not"
    (with-env (env-open* "/tmp/wt" *wsh*)
      (index "http://example.net" "database & postgresql")
      (index "http://example.net" "spam & pgsql")
      (index "http://example.net" "spam & egg")
      (index "http://example.net" "database & egg")
      (index "http://example.net" "database & pgsql")
      (search/vm (search/and (search/term "database")
                             (search/not (search/term "egg")))))
  '(5 1))

  (test-check "search/vm and/not/and"
    (with-env (env-open* "/tmp/wt" *wsh*)
      (index "http://example.net" "database & egg")
      (index "http://example.net" "database & spam & egg")
      (index "http://example.net" "database & spam & egg")
      (index "http://example.net" "database & egg")
      (index "http://example.net" "database & spam")
      (search/vm (search/and (search/term "database") (search/not (search/and (search/term "egg")
                                                                              (search/term "spam"))))))
    '(5 4 1))

  (test-check "search/vm and/not"
    (with-env (env-open* "/tmp/wt" *wsh*)
      (index "http://example.net" "database & postgresql")
      (index "http://example.net" "spam & pgsql")
      (index "http://example.net" "spam & egg")
      (index "http://example.net" "database & egg")
      (index "http://example.net" "database & pgsql")
      (search/vm (search/and (search/term "database")
                             (search/not (search/or (search/term "egg")
                                                    (search/term "pgsql"))))))
    '(1))
  
  (test-check "search/make-predicate 1"
    (with-env (env-open* "/tmp/wt" *wsh*)
      (index "http://example.net" "database & postgresql")
      (index "http://example.net" "spam & pgsql")
      (index "http://example.net" "spam & egg")
      (index "http://example.net" "database & egg")
      (index "http://example.net" "database & pgsql & spam")
      (let* ((query (search/term "database"))
             (predicate (search/make-predicate query)))
        
        (filter predicate (map (cut + 1 <>) (iota 5)))))
  '(1 4 5))

  (test-check "search/make-predicate 2"
    (with-env (env-open* "/tmp/wt" *wsh*)
      (index "http://example.net" "database & postgresql")
      (index "http://example.net" "spam & pgsql")
      (index "http://example.net" "spam & egg")
      (index "http://example.net" "database & egg")
      (index "http://example.net" "database & pgsql & spam")
      (let* ((query (search/and (search/term "database")
                                (search/term "postgresql")))
             (predicate (search/make-predicate query)))
        
        (filter predicate (map (cut + 1 <>) (iota 5)))))
  '(1))

  (test-check "search/make-predicate 3"
    (with-env (env-open* "/tmp/wt" *wsh*)
      (index "http://example.net" "database & postgresql")
      (index "http://example.net" "spam & pgsql")
      (index "http://example.net" "spam & egg")
      (index "http://example.net" "database & egg")
      (index "http://example.net" "database & pgsql & spam")
      (let* ((query (search/and (search/term "database")
                                (search/or (search/term "postgresql")
                                           (search/term "pgsql"))))
             (predicate (search/make-predicate query)))
        
        (filter predicate (map (cut + 1 <>) (iota 5)))))
  '(1 5))

  (test-check "search*"
    (with-env (env-open* "/tmp/wt" *wsh*)
      (index "http://database.postgresql.pgsql.net" "database & postgresql & pgsql")
      (index "http://database.postgresql.net" "database & postgresql")
      (index "http://spam.egg.net" "spam & egg")
      (index "http://database.egg.net" "database & egg")
      (index "http://database.pgsql.spam.net" "database & pgsql & spam")
      (index "http://database.postgresql.pgsql.net/database" "database & postgresql & pgsql & database again")
      (let ((query (search/and (search/term "database")
                               (search/or (search/term "postgresql")
                                          (search/term "pgsql")))))
        (search* query)))
    '(("http://database.postgresql.pgsql.net/database" . 4) ("http://database.postgresql.pgsql.net" . 3) ("http://database.pgsql.spam.net" . 2) ("http://database.postgresql.net" . 2)))

  (test-check "search/vm unknown keyword"
    (with-env (env-open* "/tmp/wt" *wsh*)
      (index "http://example.net" "database & postgresql")
      (index "http://example.net" "spam & pgsql")
      (index "http://example.net" "spam & egg")
      (index "http://example.net" "database & egg")
      (index "http://example.net" "database & pgsql")
      (search/vm (search/and (search/term "wiredtiger"))))
    '(1))
)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [HELP] a search engine in GNU Guile
  2016-09-09 14:39 ` Christopher Allan Webber
@ 2016-09-09 14:05   ` Ralf Mattes
  2016-09-09 18:10     ` Amirouche Boubekki
  0 siblings, 1 reply; 9+ messages in thread
From: Ralf Mattes @ 2016-09-09 14:05 UTC (permalink / raw)
  To: Christopher Allan Webber; +Cc: Guile User

On Fri, Sep 09, 2016 at 09:39:24AM -0500, Christopher Allan Webber wrote:
> Amirouche Boubekki writes:
> 
> > - port whoosh/lucene to guile to improve text search

Sorry, but I don't see the point of this. At least Lucene has a http-based
interface that can be accessed by any kind of client language. Why reinvent the
wheel (and, in the case of Lucene, a rather well working, extremly mature and
complex wheel)?
 
> This is something I'd love to see generally.  It would be nice to have
> an indexing library, either by writing bindings to Xapian (which
> unfortunately couldn't use the FFI since it's C++),

But almost all of Xapian's bindings are Swig-generated (and that seems to be
the prefered way of generating bindings). IIRC I used the Swig Guile bindings
years ago (I'm pretty shure that code got lost in a harddisk crash, but I'm to
lazy to google it up ...).
 
> or natively porting
> something like Whoosh, for Guile.

I've seen similar approaches for Common Lisp (search for montezuma) but in the
end it seems to be way too much work - remember that not a small part of Lucene's
success is based on the existing ecosystem (Solr, excellent language parsers et al.)

Cheers, Ralf Mattes


> If you write this as an independent library, let me know.  I'm a likely
> user.
> 
>  - Chris
> 



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [HELP] a search engine in GNU Guile
  2016-08-13 15:25 [HELP] a search engine in GNU Guile Amirouche Boubekki
  2016-09-04 13:35 ` Amirouche Boubekki
@ 2016-09-09 14:39 ` Christopher Allan Webber
  2016-09-09 14:05   ` Ralf Mattes
  2016-09-23  5:52 ` Amirouche Boubekki
  2 siblings, 1 reply; 9+ messages in thread
From: Christopher Allan Webber @ 2016-09-09 14:39 UTC (permalink / raw)
  To: Amirouche Boubekki; +Cc: Guile User

Amirouche Boubekki writes:

> - port whoosh/lucene to guile to improve text search

This is something I'd love to see generally.  It would be nice to have
an indexing library, either by writing bindings to Xapian (which
unfortunately couldn't use the FFI since it's C++), or natively porting
something like Whoosh, for Guile.

If you write this as an independent library, let me know.  I'm a likely
user.

 - Chris



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [HELP] a search engine in GNU Guile
  2016-09-04 13:35 ` Amirouche Boubekki
@ 2016-09-09 14:40   ` Christopher Allan Webber
  2016-09-10  6:33     ` Amirouche Boubekki
  0 siblings, 1 reply; 9+ messages in thread
From: Christopher Allan Webber @ 2016-09-09 14:40 UTC (permalink / raw)
  To: Amirouche Boubekki; +Cc: Guile User

Amirouche Boubekki writes:

> Héllo,
>
>
> I'd like to share with you a mini-project on the road of Culturia 0.1
> [0] which is a boolean keyword search engine (similar in principle to
> xapian, lucene and whoosh (with less polish and features)).

... and I didn't read this until after I wrote my last message.  Very
cool! :)  I hope to look more soon.



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [HELP] a search engine in GNU Guile
  2016-09-09 14:05   ` Ralf Mattes
@ 2016-09-09 18:10     ` Amirouche Boubekki
  2017-02-10 10:56       ` amirouche
  0 siblings, 1 reply; 9+ messages in thread
From: Amirouche Boubekki @ 2016-09-09 18:10 UTC (permalink / raw)
  To: Ralf Mattes; +Cc: Guile User, Ralf Mattes,,,

On 2016-09-09 16:05, Ralf Mattes wrote:
> On Fri, Sep 09, 2016 at 09:39:24AM -0500, Christopher Allan Webber 
> wrote:
>> Amirouche Boubekki writes:
>> 
>> > - port whoosh/lucene to guile to improve text search
> 
> Sorry, but I don't see the point of this.

I mean to say "to improve text search of my previous attempt at writing 
a search
engine". The previous iteration of this project does not support boolean 
search.

> At least Lucene has a http-based
> interface that can be accessed by any kind of client language.

That is trivial to do with guile too.

> Why reinvent the wheel

Because it's a hobby.

> (and, in the case of Lucene, a rather well working,

It's not possible to use a custom storage engine with Lucene.

> extremly mature

My theory is that some search engine businesses like algolia forked 
Lucene to build
it on top of something similar to wiredtiger and can now claim 
impressive performance.

What I mean to say basically, is that wsh.scm is innovation. I read here 
and there that
big players are actually using storage engines similar to wiredtiger to 
build search engines...
So, it's not a bad idea it just an idea that is not common.

> and complex wheel)?

How complex? That's what I try to understand. AFAIK it's not as complex 
as opencog
since I can rewrite more features.


> 
>> This is something I'd love to see generally.  It would be nice to have
>> an indexing library, either by writing bindings to Xapian (which
>> unfortunately couldn't use the FFI since it's C++),
> 
> But almost all of Xapian's bindings are Swig-generated (and that seems 
> to be
> the prefered way of generating bindings). IIRC I used the Swig Guile 
> bindings
> years ago (I'm pretty shure that code got lost in a harddisk crash, but 
> I'm to
> lazy to google it up ...).
> 
>> or natively porting
>> something like Whoosh, for Guile.
> 
> I've seen similar approaches for Common Lisp (search for montezuma) but 
> in the
> end it seems to be way too much work - remember that not a small part
> of Lucene's
> success is based on the existing ecosystem (Solr, excellent language
> parsers et al.)

If you think about stemming then it's not supported yet by wsh at all. 
It's an area
I'd like to improve.

I agree that if someone wants to create a business using Guile, they 
would be
up and running faster using ES or solr. It will be a good contribution 
to Guile
ecosystem. I am not building a business, I'm studying free software zoo. 
wsh is
basically a notes in the form of code on the road to what I actually 
want to reach
which is concept search  cf. 
https://en.wikipedia.org/wiki/Concept_search



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [HELP] a search engine in GNU Guile
  2016-09-09 14:40   ` Christopher Allan Webber
@ 2016-09-10  6:33     ` Amirouche Boubekki
  0 siblings, 0 replies; 9+ messages in thread
From: Amirouche Boubekki @ 2016-09-10  6:33 UTC (permalink / raw)
  To: Christopher Allan Webber; +Cc: Guile User

On 2016-09-09 16:40, Christopher Allan Webber wrote:
> Amirouche Boubekki writes:
> 
>> Héllo,
>> 
>> 
>> I'd like to share with you a mini-project on the road of Culturia 0.1
>> [0] which is a boolean keyword search engine (similar in principle to
>> xapian, lucene and whoosh (with less polish and features)).
> 
> ... and I didn't read this until after I wrote my last message.  Very
> cool! :)  I hope to look more soon.

Let me know what you need in terms of features. There is various things
missing in particular a) there is no stemming at all b) there is no 
support
for indexing multiple fields with different boost scoare c) there is no
query parser.

That's what I can think of you will need...



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [HELP] a search engine in GNU Guile
  2016-08-13 15:25 [HELP] a search engine in GNU Guile Amirouche Boubekki
  2016-09-04 13:35 ` Amirouche Boubekki
  2016-09-09 14:39 ` Christopher Allan Webber
@ 2016-09-23  5:52 ` Amirouche Boubekki
  2 siblings, 0 replies; 9+ messages in thread
From: Amirouche Boubekki @ 2016-09-23  5:52 UTC (permalink / raw)
  To: Guile User; +Cc: guile-user

Héllo,


I made some progress regarding culturia. Now there is a web
interface available at hypermove.net [0].

[0] http://hypermove.net/?query=guile+algorithms+-wingolog

Now you can:

- search a bunch of website related to Guile I quickly selected
- use minus char "-" to exclude some keyword from the result set
- index small-ish domains

There is no UI yet to add your website to the index, if you want
your website to be part of the experiment reach me.

I tried two search backends. One based on a graph database which
is more versatile but much slower and another based on inverted
index. The latter is the one that is used currently. It's fast
enough right now.

What will be done next:

- Add wikipedia, wiktionary, stackoverflow and hackernews to the
   database

- Improve how results are displayed

- Remove feeds from results

- Index pdf

- Remove duplicate entries from the results

- Paginate

- Make use of fibers to run the http server

- Better crawling algorithm that support updating the index

At some point there will be a 0.1 release, not sure when.

The code is also available at framagit [1] the web interface
is implemented in a single module [2].

[1] https://framagit.org/a-guile-mind/culturia
[2] https://framagit.org/a-guile-mind/culturia/blob/master/src/web.scm

Happy hacking!

On 2016-08-13 17:25, Amirouche Boubekki wrote:
> Héllo,
> 
> 
> The goal of Culturia is to create a framework that makes it easy
> to tape into Natural Language Understanding algorithms (and NLP)
> and provide an interface for common tasks.
> 
> Culturia is an intelligence augmentation software.
> 
> It's primary interface is a search engine. Another important aspect
> of the project is that it wants to be useable offline as such it will
> come with infrastructure to dump, load and store dataset for offline 
> use.
> 
> The current state of the project can be described as a big ball of mud.
> There is a tiny search engine with crawling skills and that's basically
> all of it.
> 
> The immediate changes that should happen are in order of preference:
> 
> - offline stackoverflow (cf. sotoki.scm) and use the generated
>   website to create a zim for kiwix [0]. This is great occasion to
>   show how great GNU Guile is!
> - port whoosh/lucene to guile to improve text search
> - offline hackernews, wikidata, wikipedia, wiktionary
> - implement BM25f
> 
> Culturia is a reference to _Culture and Empire_ by Pieter Hintjens.
> 
> It has a sparse documentation is available online [1].
> It's hosted on github [2] (This can change, if contributors
> don't want to use github).
> 
> The TODO list is big, here is some stuff that needs to be done:
> 
> - finish GrammarLink bindings
> - create sophia [3] bindings
> - implement TextRank
> - implement PageRank
> - create a GUI using sly or html
> - explore ways to easily share database among several processus
> 
> And many other things! Newbies are accepted obviously!
> 
> Send me a mail or use #guile @ irc.freenode.net, I am amz3.
> 
> 
> Happy hacking!
> 
> 
> [0] http://www.kiwix.org/wiki/Main_Page
> [1] https://amirouche.github.io/Culturia/doc/
> [2] https://github.com/amirouche/Culturia
> [3] http://sophia.systems/

-- 
Amirouche ~ amz3 ~ http://www.hyperdev.fr



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [HELP] a search engine in GNU Guile
  2016-09-09 18:10     ` Amirouche Boubekki
@ 2017-02-10 10:56       ` amirouche
  0 siblings, 0 replies; 9+ messages in thread
From: amirouche @ 2017-02-10 10:56 UTC (permalink / raw)
  To: guile-user

Héllo Ralf,

On 09/09/2016 20:10, Amirouche Boubekki wrote:
> On 2016-09-09 16:05, Ralf Mattes wrote:
>> On Fri, Sep 09, 2016 at 09:39:24AM -0500, Christopher Allan Webber wrote:
>>> Amirouche Boubekki writes:

[...]

>> (and, in the case of Lucene, a rather well working,
>
> It's not possible to use a custom storage engine with Lucene.
>
>> extremly mature
>
> My theory is that some search engine businesses like algolia forked
> Lucene to build
> it on top of something similar to wiredtiger and can now claim
> impressive performance.

Based on the blog post serie algolia has done, they do not use
something like wiredtiger. So I was fully wrong. They have fine
tuned datastructures for every problem they encounter for indexing
and querying.

https://blog.algolia.com/inside-the-algolia-engine-part-1-indexing-vs-search/

Anyway, outside the far reaching goal of achieving concept search,
I find wiredtiger handy+good enough that's why I use it. That's said
if I need to get a living from my free coding I will not use
wiredtiger.

Best regards,


Amirouche



^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2017-02-10 10:56 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-08-13 15:25 [HELP] a search engine in GNU Guile Amirouche Boubekki
2016-09-04 13:35 ` Amirouche Boubekki
2016-09-09 14:40   ` Christopher Allan Webber
2016-09-10  6:33     ` Amirouche Boubekki
2016-09-09 14:39 ` Christopher Allan Webber
2016-09-09 14:05   ` Ralf Mattes
2016-09-09 18:10     ` Amirouche Boubekki
2017-02-10 10:56       ` amirouche
2016-09-23  5:52 ` Amirouche Boubekki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).