zimoun <zimon.toutoune@gmail.com> writes:

> However, the core issue about "search" is not 'find quickly an item'
> but 'find the relevant item'.
> For example, Bloom filter [1] or B-tree could be interesting data
> structure if we are talking about fast retrieval. Note that this part
> should even be delegated to SQLite, IMHO.
>
> [1] https://en.wikipedia.org/wiki/Bloom_filter
> [2] https://en.wikipedia.org/wiki/B-tree

I just read about Bloom filters. They are fascinating indeed!

> Well, the issue is 'scoring' a query.

I think the issue of whether to use an inverted index is orthogonal to
the quest to improve the relevance of the search results. Implementing
tf-idf like you suggested could greatly benefit from having fast
searches.

> Last, I have not read the 2 links you pointed but I think you have a
> bug in your code. The retrieval "game" using the index returns the
> package "r-mgcv" which does not contain the term "game" in its
> description. Well, the query "game" using the brute force does not
> return this very package.

Indeed, I found the bug and fixed it. Please find attached the updated
code. Also, the inverted index hash table stores package objects instead
of package names and all comparisons are now done with eq?. This gets us
an even higher speedup of around 300x.

Built index in 8.556596040725708 seconds
Brute force search took 0.4107058048248291 seconds
Inverted index search took 0.0012979507446289062 seconds

Bengt Richter <bokr@bokr.com> writes:

> If you want to instrument particular calls, as you know, you can probably
> get sub-microsecond resolution (try the following snippet on your platform
> to see how long it takes to get a gettimeofday value -- even on this
> Acer Swift with its Intel(R) Pentium(R) CPU N4200 @ 1.10GHz I get about 10
> calls per microsecond).

gettimeofday is indeed a good idea. I have used it in the latest version
of my proof of concept script. Like you said, I also noticed some
peculiar outliers which probably got averaged out when I was running the
search 1000 times. For the purpose of the discussion above, I have only
reported the most common time.

Pierre Neidhardt <mail@ambrevar.xyz> writes:

> By the way, what about using Xapian in Guix?
>
> https://en.wikipedia.org/wiki/Xapian
>
> If it's relevant, maybe we can follow up with a discussion in a new
> thread.

I feel xapian is too much work (considering that we don't yet have guile
bindings) compared to our own simple implementation of an inverted
index. But, of course, I am biased since I wrote the inverted index
code! :-)

But, on a more serious note, if we move to xapian, we will not be able
to support regular expression based search queries that we support
today. On the other hand, I can extend the inverted index implementation
to support regular expression searches. Personally, I don't use regular
expression based search queries, and don't think they are very useful
especially if we make use of xapian's stemming. What do people think?

On the question of whether xapian is too heavy, I think we should make
it an optional dependency of Guix so that it remains possible to build
and use a more minimalistic Guix for those interested in such things.