From mboxrd@z Thu Jan 1 00:00:00 1970 From: Bengt Richter Subject: Re: Inverted index to accelerate guix package search Date: Mon, 13 Jan 2020 19:53:06 -0800 Message-ID: <20200114035306.GA18362@Evo25c2ArchGx4.localdomain> References: <87a76r68u6.fsf@ambrevar.xyz> Reply-To: Bengt Richter Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Return-path: Received: from eggs.gnu.org ([2001:470:142:3::10]:60751) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1irDGb-0006HW-Ou for guix-devel@gnu.org; Mon, 13 Jan 2020 22:53:23 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1irDGZ-00034w-N2 for guix-devel@gnu.org; Mon, 13 Jan 2020 22:53:21 -0500 Received: from imta-38.everyone.net ([216.200.145.38]:53140) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1irDGZ-00033Z-Di for guix-devel@gnu.org; Mon, 13 Jan 2020 22:53:19 -0500 Content-Disposition: inline In-Reply-To: List-Id: "Development of GNU Guix and the GNU System distribution." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guix-devel-bounces+gcggd-guix-devel=m.gmane-mx.org@gnu.org Sender: "Guix-devel" To: zimoun Cc: Guix Devel Hi simoun, Guix, On +2020-01-13 18:54:18 +0100, zimoun wrote: > Hi Arun, >=20 > For sure, it is a good direction... but it is not that simple. :-) >=20 >=20 > On Mon, 13 Jan 2020 at 16:08, Arun Isaac w= rote: >=20 > > > Those measures don't seem precise enough to draw a good conclusion. > > > Could you increase the sample size (or maybe just loop?) so that al= l > > > times reach over a second or so? > > > > Indeed, my bad! Here are better timing results. I have repeated each = of > > the searches a 1000 times, and I'm getting around 90x speedup! :-) > > > > Building index > > clock utime stime cutime cstime gctime > > 9.76 12.50 0.18 0.00 0.00 6.16 > > > > Brute force search (repeated 1000 times) > > clock utime stime cutime cstime gctime > > 195.95 264.13 0.90 0.00 0.00 142.62 >=20 > `fold-packages` traverses all the packages so yes it is "slow". >=20 >=20 > > Inverted index search (repeated 1000 times) > > clock utime stime cutime cstime gctime > > 2.18 2.48 0.00 0.00 0.00 0.60 >=20 > Yes, looking up into an hash table is really really faster. :-) > If you want to instrument particular calls, as you know, you can probably get sub-microsecond resolution (try the following snippet on your platfor= m to see how long it takes to get a gettimeofday value -- even on this Acer Swift with its Intel(R) Pentium(R) CPU N4200 @ 1.10GHz I get about 1= 0 calls per microsecond). At that kind of resolution, timing at various milestones in your program = will show interesting regularities and outliers. You may even detect timing su= bsets caused by such things as the CPU taking a shortcut when multiplying by ze= ro. If you time a thousand calls of A in a loop followed by the same for B, and then do a loop that calls them in succession inside the loop, you wil= l most likely discover cache or other resource competitions between A and B. My point is that benchmarking meaningfully is subtle, and it is easy to be fooled by statistical summary measures of distributions that are reall= y combinations of separate distributions (that are easily, sometimes beauti= fully, seen in scattergrams). --8<---------------cut here---------------start------------->8--- #!/usr/bin/guile -s !# ;;;; ttime -- see how fast gettimeofday is ;;;; 2017-08-24 20:44:21 -- bokr AT bokr DOT com (use-modules (ice-9 pretty-print)) (use-modules (ice-9 format)) (define loop-count 0) (catch #t (lambda () (set! loop-count (string->number (cadr (command-line= ))))) (lambda ( k . rest ) (begin=20 (display "\nFirst arg should be integer loop count\n") (display "Suggested usage from bash: ./ttime 500|uniq -c\n") (display "to see calls/useq\n\n") (exit)))) (define start-loop (gettimeofday)) (define end-loop start-loop) (display (let lp ((x loop-count) (tl `())) (if (positive? x)=20 (lp (- x 1) (cons (gettimeofday) tl)) (begin (set! end-loop (gettimeofday)) (format #f "~y" tl))))) (format #t "start loop: ~a\n" start-loop) (format #t " end loop: ~a\n" end-loop) (format #t " end disp: ~a\n" (gettimeofday)) (newline) --8<---------------cut here---------------end--------------->8--- (pls excuse the odd use of catch -- I was playing around, meaning later to catch other errors like bad count etc., but never got a round tuit) So just chmod 755 ttime and then ./ttime 500|uniq -c as it will suggest if you run it without args. >=20 >=20 > However, the core issue about "search" is not 'find quickly an item' > but 'find the relevant item'. > For example, Bloom filter [1] or B-tree could be interesting data > structure if we are talking about fast retrieval. Note that this part > should even be delegated to SQLite, IMHO. >=20 >=20 > [1] https://en.wikipedia.org/wiki/Bloom_filter > [2] https://en.wikipedia.org/wiki/B-tree >=20 >=20 > Well, the issue is 'scoring' a query. For example, try: >=20 > guix search youtube | recsel -p name,relevance >=20 > --8<---------------cut here---------------start------------->8--- > name: youtube-dl-gui > relevance: 15 >=20 > name: youtube-viewer > relevance: 11 >=20 > name: youtube-dl > relevance: 11 >=20 > name: mps-youtube > relevance: 11 >=20 > name: emacs-youtube-dl > relevance: 11 > --8<---------------cut here---------------end--------------->8--- >=20 > Why the package youtube-dl-gui is more relevant? Because the function > that computes the relevance score not enough "smart". The function > guix/ui.scm(relevance) is too simple: the score is the number of > ocurences of the pattern (weighted by the field where the pattern > matches). Therefore, if you compare "guix show" the packages > "youtube-dl-gui" and "youtube-dl", you will see that the term > "youtube" (case-insentive) appears 5 times for youtube-dl-gui against > only 3 times for youtube-dl. Does the difference (5 vs 3) provide more > information? I do not think so... The issue is: 1. the description is > not well-written (for the current scoring system) and 2. the scoring > function is too simple. >=20 > I have tried to described such issues/views here [3] [4] and since I > have not read so much about recommender systems: state-of-the-art, NLP > tools in Guile, etc. >=20 > [3] https://lists.gnu.org/archive/html/guix-devel/2019-07/msg00252.html >From [3]: =E2=94=8C=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80= =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80= =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=90 =E2=94=82 From my opinion, text mining or search engine strategies seem a= better =E2=94=82 =E2=94=82 approach to improve the `guix search` than rigidify the filenam= e tree. =E2=94=82 =E2=94=82 And some data mining to see how the packages cluster (depending= on the =E2=94=82 =E2=94=82 metric) should be helpful to first understand how to improve `g= uix =E2=94=82 =E2=94=82 search`. = =E2=94=82 =E2=94=82 I do not know if my words make sense. = =E2=94=82 =E2=94=94=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80= =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80= =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=98 They make sense to me :) > [4] https://lists.gnu.org/archive/html/guix-devel/2019-12/msg00160.html More stuff, skimmed, but mostly LGTM. I would note that librarians have been helping people find relevant information for centuries, so IWT there is something to learn from them? Maybe not a Dewey Decimal System book classification for guix, but I am not sure it's good to exclude tagging ideas altogether. Another aspect of search is _who_ is doing the searching. A guix develope= r wanting to find implementation examples for bytecode jit is different from a newbie struggling to form an overview of what makes his app stutter. I note that papers and dissertations etc often contain prefatory paragraphs indicating intended audience, and noting parts that are only for specialists vs no-previous-knowledge-required. If synopses and such contained an audience hint in a standard syntax, might that serve as a basis for on-the-fly derivation of a search filter, maybe with a --audience guix-dev,specialist:guile:cps option? For object-oriented oops goops stuff, where you might like to find a collection of methods that might serve as mixin methods for an object of your own, how could you search for that? (think provides/ensures/requires or other type contraints ??) >=20 > Last, I have not read the 2 links you pointed but I think you have a > bug in your code. The retrieval "game" using the index returns the > package "r-mgcv" which does not contain the term "game" in its > description. Well, the query "game" using the brute force does not > return this very package. >=20 > Then, in the function > "guix/scripts/package.scm(find-packages-by-description)", you could > try to replace the `fold-packages' by something using the inverted > index. (Note that the name `find-packages-by-descripton' is odd > because it uses synopsis, name, etc. too.). >=20 >=20 > Hope that helps. >=20 > Chhers, > simon >=20 HTH in some way :) --=20 Regards, Bengt Richter