From mboxrd@z Thu Jan 1 00:00:00 1970 From: zimoun Subject: Re: Inverted index to accelerate guix package search Date: Thu, 16 Jan 2020 20:53:27 +0100 Message-ID: References: <87a76r68u6.fsf@ambrevar.xyz> <87sgkgxwir.fsf@elephly.net> <87a76ncvg0.fsf@gnu.org> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Return-path: Received: from eggs.gnu.org ([2001:470:142:3::10]:52050) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1isBD4-0006c3-3A for guix-devel@gnu.org; Thu, 16 Jan 2020 14:53:43 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1isBD1-0006jT-RW for guix-devel@gnu.org; Thu, 16 Jan 2020 14:53:41 -0500 Received: from mail-qv1-xf35.google.com ([2607:f8b0:4864:20::f35]:45001) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1isBD1-0006io-NC for guix-devel@gnu.org; Thu, 16 Jan 2020 14:53:39 -0500 Received: by mail-qv1-xf35.google.com with SMTP id n8so9650296qvg.11 for ; Thu, 16 Jan 2020 11:53:39 -0800 (PST) In-Reply-To: List-Id: "Development of GNU Guix and the GNU System distribution." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guix-devel-bounces+gcggd-guix-devel=m.gmane-mx.org@gnu.org Sender: "Guix-devel" To: Arun Isaac Cc: Guix Devel Hi Arun, On Thu, 16 Jan 2020 at 20:11, Arun Isaac wrote: > > zimoun writes: > > > About (1), let implement something experimental and time it to compare > > apples with apples. :-) > > I mean I am working on it. > > I am interested in working on an experimental sqlite based > implementation too. So, do let me know your plans so we don't duplicate > too much work. :-) Why not directly go for SQLite. But some details are not clear to me and they are clearer with a Guile hash table or VHash. What is not clear to me right now in both implementations are. 1. How to update the index. Give a look at the "pull" code and the ~/.cache/guix folder. 2. How to deal with regexp. It is more or less clear to me how to deal with using the trigram keys but I do not know with SQLite; I have not thought about yet. Basically, the current search works like that: the CLI arguments are transformed in 'patterns', then transformed in 'regexps' which is basically a list of terms; then 'find-packages-by-description' is applied. --8<---------------cut here---------------start------------->8--- (let* ((patterns (filter-map (match-lambda (('query 'search rx) rx) (_ #f)) opts)) (regexps (map (cut make-regexp* <> regexp/icase) patterns)) (matches (find-packages-by-description regexps))) --8<---------------cut here---------------end--------------->8--- The 'find-packages-by-description' applies a 'fold-packages' where each term of the regexps list is lookup in each package (name, synopsis, description, location, etc.) computing the relevance with 'package-relevance'. --8<---------------cut here---------------start------------->8--- (let ((matches (fold-packages (lambda (package result) (if (package-superseded package) result (match (package-relevance package regexps) ((? zero?) result) (score (cons (cons package score) result))))) '()))) --8<---------------cut here---------------end--------------->8--- Therefore, this call to 'fold-packages' needs to be replaced. Using the trigram keys, I see more or less how to output the list of packages where the key matches one term from the list of regexps. Some False Positive will be filtered out then by the 'package-relevance' function. Using the SQLite, I do not know now. I need to read a bit about SQL query. :-) If you want to implement it, go ahead. :-) Otherwise, I will try to finish next week what I started yesterday evening using VHash. :-) (note that to avoid duplicate , the file sets.scm can be relevant) All the best, simon