From mboxrd@z Thu Jan 1 00:00:00 1970 From: zimoun Subject: Re: Inverted index to accelerate guix package search Date: Mon, 20 Jan 2020 20:14:11 +0100 Message-ID: References: <87a76r68u6.fsf@ambrevar.xyz> <87sgkgxwir.fsf@elephly.net> <87a76ncvg0.fsf@gnu.org> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Return-path: Received: from eggs.gnu.org ([2001:470:142:3::10]:34910) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1itcVF-0001hR-7R for guix-devel@gnu.org; Mon, 20 Jan 2020 14:14:26 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1itcVD-0003Wy-Qr for guix-devel@gnu.org; Mon, 20 Jan 2020 14:14:25 -0500 Received: from mail-qt1-x829.google.com ([2607:f8b0:4864:20::829]:36063) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1itcVD-0003Wm-Kw for guix-devel@gnu.org; Mon, 20 Jan 2020 14:14:23 -0500 Received: by mail-qt1-x829.google.com with SMTP id i13so671026qtr.3 for ; Mon, 20 Jan 2020 11:14:23 -0800 (PST) In-Reply-To: List-Id: "Development of GNU Guix and the GNU System distribution." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guix-devel-bounces+gcggd-guix-devel=m.gmane-mx.org@gnu.org Sender: "Guix-devel" To: Arun Isaac Cc: Guix Devel Hi Arun, On Fri, 17 Jan 2020 at 20:29, Arun Isaac wrote: > > 1. > > How to update the index. > > Give a look at the "pull" code and the ~/.cache/guix folder. > > We don't "update" the index. At every guix pull we create it > anew. Currently, generate-package-cache in gnu/packages.scm does > this. generate-package-cache is called by package-cache-file in > guix/channels.scm. package-cache-file is a channel profile hook listed > under %channel-profile-hooks. I would like to be able to search the packages in all the history of all the commits, and not only in only the packages for one specific commit. > Now, what I am unclear about is how to test my sqlite index building > code without actually pushing to master and running a guix pull. I will > go through the various tests in Guix and see if I can figure something > out, but any pointers would be much appreciated. To test "guix pull", simple "make as-derivation". Disclaim: can take some time :-) Then the issue is more to avoid to pollute your ~/.cache/guix and ~/.config/guix :-) 1. Update Guix with the result in /tmp/test guix pull -p /tmp/test --url=/path/to/guix/repo 2. Create your SQL index /tmp/test/bin/guix pull -p /tmp/trash Now your index should be created with all the packages currently in master. To have something reproducible (and faster), I suggest to add --commit= and always pull against the same commit. 3. Test the index /tmp/test/bin/guix search foo I mean something along these lines. ;-) > > 2. > > How to deal with regexp. > > It is more or less clear to me how to deal with using the trigram keys > > but I do not know with SQLite; I have not thought about yet. > > I think it is not possible to search using regular expressions in sqlite I think it is possible. I imagine something using multiple query. I will give a look at the Guile module. > I think we should remove regex support altogether. I don't think a good > search interface should expect the user to provide regexes for > search. Certainly, it will be a lot less useful if and when we have > xapian. However, just to keep backward compatibility, we can fall back > to brute force fold-packages search for regexes. As Ludo pointed out, we > can't remove the brute force code since we need to support cases when > the cache is not authoritative. I disagree. We should keep the regexp. Otherwise we cannot include under "guix search" or "guix package --search=" because arguments about backward compatibility. The end user interface (CLI) has to be exactly the same when using brute force or the index. And the results too. > About sqlite versus an inverted index using vhashes, I don't know if it > is possible to serialize a vhash onto disk. Even if that were possible, > we'll have to load the entire vhash based inverted index into memory for > every invocation of guix search, and that could hit > performance. Something like guile-gdbm could have helped, but that's > another story. And your first test was not fair. ;-) Because you compared when the hash table was already in memory. I mean to know the real performance, only timing can talk. :-) > I didn't know about sets.scm when I wrote my first proof of concept > inverted index script. That is why I reinvented the set using hash > tables. I don't know how hash tables are different from VHashes or which > is better. VHashes is a bit confused in my mind too. ;-) https://www.gnu.org/software/guile/manual/html_node/VHashes.html Cheers, simon