zimoun writes: > However, the core issue about "search" is not 'find quickly an item' > but 'find the relevant item'. > For example, Bloom filter [1] or B-tree could be interesting data > structure if we are talking about fast retrieval. Note that this part > should even be delegated to SQLite, IMHO. > > [1] https://en.wikipedia.org/wiki/Bloom_filter > [2] https://en.wikipedia.org/wiki/B-tree I just read about Bloom filters. They are fascinating indeed! > Well, the issue is 'scoring' a query. I think the issue of whether to use an inverted index is orthogonal to the quest to improve the relevance of the search results. Implementing tf-idf like you suggested could greatly benefit from having fast searches. > Last, I have not read the 2 links you pointed but I think you have a > bug in your code. The retrieval "game" using the index returns the > package "r-mgcv" which does not contain the term "game" in its > description. Well, the query "game" using the brute force does not > return this very package. Indeed, I found the bug and fixed it. Please find attached the updated code. Also, the inverted index hash table stores package objects instead of package names and all comparisons are now done with eq?. This gets us an even higher speedup of around 300x. Built index in 8.556596040725708 seconds Brute force search took 0.4107058048248291 seconds Inverted index search took 0.0012979507446289062 seconds Bengt Richter writes: > If you want to instrument particular calls, as you know, you can probably > get sub-microsecond resolution (try the following snippet on your platform > to see how long it takes to get a gettimeofday value -- even on this > Acer Swift with its Intel(R) Pentium(R) CPU N4200 @ 1.10GHz I get about 10 > calls per microsecond). gettimeofday is indeed a good idea. I have used it in the latest version of my proof of concept script. Like you said, I also noticed some peculiar outliers which probably got averaged out when I was running the search 1000 times. For the purpose of the discussion above, I have only reported the most common time. Pierre Neidhardt writes: > By the way, what about using Xapian in Guix? > > https://en.wikipedia.org/wiki/Xapian > > If it's relevant, maybe we can follow up with a discussion in a new > thread. I feel xapian is too much work (considering that we don't yet have guile bindings) compared to our own simple implementation of an inverted index. But, of course, I am biased since I wrote the inverted index code! :-) But, on a more serious note, if we move to xapian, we will not be able to support regular expression based search queries that we support today. On the other hand, I can extend the inverted index implementation to support regular expression searches. Personally, I don't use regular expression based search queries, and don't think they are very useful especially if we make use of xapian's stemming. What do people think? On the question of whether xapian is too heavy, I think we should make it an optional dependency of Guix so that it remains possible to build and use a more minimalistic Guix for those interested in such things.