From mboxrd@z Thu Jan 1 00:00:00 1970 From: zimoun Subject: Re: Inverted index to accelerate guix package search Date: Mon, 13 Jan 2020 18:54:18 +0100 Message-ID: References: <87a76r68u6.fsf@ambrevar.xyz> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Return-path: Received: from eggs.gnu.org ([2001:470:142:3::10]:46042) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1ir3v6-0000Nt-5z for guix-devel@gnu.org; Mon, 13 Jan 2020 12:54:33 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1ir3v4-00070i-Pa for guix-devel@gnu.org; Mon, 13 Jan 2020 12:54:32 -0500 Received: from mail-qt1-x833.google.com ([2607:f8b0:4864:20::833]:38112) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1ir3v4-0006zz-L7 for guix-devel@gnu.org; Mon, 13 Jan 2020 12:54:30 -0500 Received: by mail-qt1-x833.google.com with SMTP id n15so9864660qtp.5 for ; Mon, 13 Jan 2020 09:54:30 -0800 (PST) In-Reply-To: List-Id: "Development of GNU Guix and the GNU System distribution." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guix-devel-bounces+gcggd-guix-devel=m.gmane-mx.org@gnu.org Sender: "Guix-devel" To: Arun Isaac Cc: Guix Devel Hi Arun, For sure, it is a good direction... but it is not that simple. :-) On Mon, 13 Jan 2020 at 16:08, Arun Isaac wrote: > > Those measures don't seem precise enough to draw a good conclusion. > > Could you increase the sample size (or maybe just loop?) so that all > > times reach over a second or so? > > Indeed, my bad! Here are better timing results. I have repeated each of > the searches a 1000 times, and I'm getting around 90x speedup! :-) > > Building index > clock utime stime cutime cstime gctime > 9.76 12.50 0.18 0.00 0.00 6.16 > > Brute force search (repeated 1000 times) > clock utime stime cutime cstime gctime > 195.95 264.13 0.90 0.00 0.00 142.62 `fold-packages` traverses all the packages so yes it is "slow". > Inverted index search (repeated 1000 times) > clock utime stime cutime cstime gctime > 2.18 2.48 0.00 0.00 0.00 0.60 Yes, looking up into an hash table is really really faster. :-) However, the core issue about "search" is not 'find quickly an item' but 'find the relevant item'. For example, Bloom filter [1] or B-tree could be interesting data structure if we are talking about fast retrieval. Note that this part should even be delegated to SQLite, IMHO. [1] https://en.wikipedia.org/wiki/Bloom_filter [2] https://en.wikipedia.org/wiki/B-tree Well, the issue is 'scoring' a query. For example, try: guix search youtube | recsel -p name,relevance --8<---------------cut here---------------start------------->8--- name: youtube-dl-gui relevance: 15 name: youtube-viewer relevance: 11 name: youtube-dl relevance: 11 name: mps-youtube relevance: 11 name: emacs-youtube-dl relevance: 11 --8<---------------cut here---------------end--------------->8--- Why the package youtube-dl-gui is more relevant? Because the function that computes the relevance score not enough "smart". The function guix/ui.scm(relevance) is too simple: the score is the number of ocurences of the pattern (weighted by the field where the pattern matches). Therefore, if you compare "guix show" the packages "youtube-dl-gui" and "youtube-dl", you will see that the term "youtube" (case-insentive) appears 5 times for youtube-dl-gui against only 3 times for youtube-dl. Does the difference (5 vs 3) provide more information? I do not think so... The issue is: 1. the description is not well-written (for the current scoring system) and 2. the scoring function is too simple. I have tried to described such issues/views here [3] [4] and since I have not read so much about recommender systems: state-of-the-art, NLP tools in Guile, etc. [3] https://lists.gnu.org/archive/html/guix-devel/2019-07/msg00252.html [4] https://lists.gnu.org/archive/html/guix-devel/2019-12/msg00160.html Last, I have not read the 2 links you pointed but I think you have a bug in your code. The retrieval "game" using the index returns the package "r-mgcv" which does not contain the term "game" in its description. Well, the query "game" using the brute force does not return this very package. Then, in the function "guix/scripts/package.scm(find-packages-by-description)", you could try to replace the `fold-packages' by something using the inverted index. (Note that the name `find-packages-by-descripton' is odd because it uses synopsis, name, etc. too.). Hope that helps. Chhers, simon