From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp2 ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms11 with LMTPS id y1ZUI39aMV+MUAAA0tVLHw (envelope-from ) for ; Mon, 10 Aug 2020 14:32:31 +0000 Received: from aspmx1.migadu.com ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp2 with LMTPS id qLcIH39aMV+vVwAAB5/wlQ (envelope-from ) for ; Mon, 10 Aug 2020 14:32:31 +0000 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id 2C4319400BF for ; Mon, 10 Aug 2020 14:32:31 +0000 (UTC) Received: from localhost ([::1]:40162 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1k58qk-0005em-3j for larch@yhetil.org; Mon, 10 Aug 2020 10:32:30 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:35854) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1k58qX-0005dV-Pr for guix-devel@gnu.org; Mon, 10 Aug 2020 10:32:17 -0400 Received: from relay6-d.mail.gandi.net ([217.70.183.198]:38081) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1k58qU-0000AY-22 for guix-devel@gnu.org; Mon, 10 Aug 2020 10:32:17 -0400 X-Originating-IP: 86.246.37.13 Received: from bababa (lfbn-idf2-1-572-13.w86-246.abo.wanadoo.fr [86.246.37.13]) (Authenticated sender: mail@ambrevar.xyz) by relay6-d.mail.gandi.net (Postfix) with ESMTPSA id 46201C0002 for ; Mon, 10 Aug 2020 14:32:08 +0000 (UTC) From: Pierre Neidhardt To: guix-devel@gnu.org Subject: File search progress: database review and question on triggers Date: Mon, 10 Aug 2020 16:32:08 +0200 Message-ID: <87sgcuh8rb.fsf@ambrevar.xyz> MIME-Version: 1.0 Content-Type: multipart/signed; boundary="=-=-="; micalg=pgp-sha256; protocol="application/pgp-signature" Received-SPF: pass client-ip=217.70.183.198; envelope-from=mail@ambrevar.xyz; helo=relay6-d.mail.gandi.net X-detected-operating-system: by eggs.gnu.org: First seen = 2020/08/10 09:25:54 X-ACL-Warn: Detected OS = Linux 3.11 and newer X-Spam_score_int: 4 X-Spam_score: 0.4 X-Spam_bar: / X-Spam_report: (0.4 / 5.0 requ) BAYES_00=-1.9, FROM_SUSPICIOUS_NTLD=1, FROM_SUSPICIOUS_NTLD_FP=1, PDS_OTHER_BAD_TLD=1, RCVD_IN_DNSWL_LOW=-0.7, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001 autolearn=no autolearn_force=no X-Spam_action: no action X-BeenThere: guix-devel@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Development of GNU Guix and the GNU System distribution." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guix-devel-bounces+larch=yhetil.org@gnu.org Sender: "Guix-devel" X-Scanner: scn0 Authentication-Results: aspmx1.migadu.com; dkim=none; dmarc=none; spf=pass (aspmx1.migadu.com: domain of guix-devel-bounces@gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=guix-devel-bounces@gnu.org X-Spam-Score: -3.11 X-TUID: c4xDRS7AZ24I --=-=-= Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Hi! After much delay I finally got down to work on file search support for Guix. By "file search", I mean the ability to find which package contains files matching the queried pattern. If we want to be able to know which package to install, we need file search= to be able to work for packages that have not yet been installed nor built on the system. As we previously discussed, a good approach, mostly for performance reasons, would be to store all files in a database that's populated on demand from t= he substitute servers. What I've done so far: 1. An SQLite database with the following schema: =2D-8<---------------cut here---------------start------------->8--- create table if not exists Packages ( name text not null, output text default "out", system text not null, path text primary key not null, -- store path, e.g. /gnu/store/a= bcd...-foo version text not null, guix text not null -- The Guix version in which the package can be found. ); create table if not exists Files ( subpath text not null, package text not null, primary key (subpath, package), -- Same subpath can occur in multiple p= ackages. foreign key (package) references Packages(path) on delete cascade ); =2D-8<---------------cut here---------------end--------------->8--- I'm not very good with SQL, so thanks in advance for reviewing this carefully; let me know if we can do better. 2. A procedure that persists the filepaths of a given package in the databa= se. 3. Size of the database: I've persisted all locally-present store items for my current Guix versi= on and it produced a database of 72=C2=A0MiB. It compresses down to 8=C2= =A0MiB in zstd. But since we can have multiple Guix versions, this means that the packages have one entry per store path, so we might end up with more entries than that as the number of Guix generations grows. The worse case is around (number of guix generations) x ~100 MiB. If we compress, it would be at least 10x less, maybe way less. To be sustainable, I suggest that when we remove a Guix generation we "garbage-collect" the corresponding database entries. Thoughts? 4. Indexing speed: The above items took some 20 minutes to complete (on my rather powerful = machine). A single store path takes a fraction of a second to index (on an SSD). The storage device is the bottleneck here. Not sure we can do better th= an the following procedure: =2D-8<---------------cut here---------------start------------->8--- (define (directory-files path) "Return a list of all files within PATH, recursively. Each file is returned as the path relative to PATH, starting with a '/'. It's important that the first character be the directory separator because = it gives more expressive power for search. For instance, searching \"/bin\" matches both \"/bin/foo\" and \"/usr/bin/foo\" but not \"barbin\"." ;; TODO: This does not include empty directories. Should we? ;; REVIEW: Use vlist for performance? Big packages take a fraction of a ;; second on a hot cache, so it's probably not worth it. (let ((file-list '())) (ftw path (lambda (filename statinfo flag) (when (eq? flag 'regular) (set! file-list (cons (string-drop filename (string-length pat= h)) file-list))) #t)) file-list)) =2D-8<---------------cut here---------------end--------------->8--- Most of the indexing will be done by the substitute servers however, so = this is of little concern for the end user. Question: Should we include empty directories in the database? I'm temp= ted to answer no. 5. Search speed: It completes in a fraction of a second and supports SQLite patterns. Example: =2D-8<---------------cut here---------------start------------->8--- > (format-search (search-file-package "%libio%")) samba:out@4.12.3 /lib/libiov-buf-samba4.so guix:out@1.1.0-18.218a67d /share/guile/site/3.0/gnu/packages/patches/m4-gnu= lib-libio.patch =2D-8<---------------cut here---------------end--------------->8--- Question: This bounds us to the SQLite syntax for pattern matching. Is = it a problem? It seems powerful enough in practice. But maybe we can use regular expression in SQLite as well? Next points I'd like to address: 6. Automatically persist the database entry when building a package. Any idea where I should plug that in? 7. Have substitute servers distribute database content. When the user perf= orms a file search, Guix asks the substitute server for a database update. O= nly the diff should be sent over the network, not the whole thing since it m= ight be very large. Question 1: If the substitute server does not have data corresponding to= the Guix server of the user, shall we send data of the version that's the cl= osest to that of the user? Locally, if there are not many entries for the current Guix version, but= many for an older version, shall perform the search on the older version as w= ell? Multiple older versions? Question 2: How do we send "SQLite diffs"? I know xdelta for binary dif= fs, but I don't think that what we want to do here. Or else can we export t= he right SELECT result as text and send it compressed over the network? I'll send a patch soon, in the meantime let me know about your thoughts! :) Cheers! =2D-=20 Pierre Neidhardt https://ambrevar.xyz/ --=-=-= Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQEzBAEBCAAdFiEEUPM+LlsMPZAEJKvom9z0l6S7zH8FAl8xWmgACgkQm9z0l6S7 zH+z1Qf8DdD06wJVdyvkjCc3CBU0T2eC6LhHJPDX0NMXKznKzvjyL2J6oCQO+VBU eManOTSOfM5bsVpsWo4t3IUONFItDb7CSYQWYnBaa2r1in1ViNZL8tYgIgRGdg6r Ei6SXiKTA1HiiBPshoM08PPVs4+aMO0nWdSdfK3d15N1IhE9LCp2yUrJE8ok0L2S jwXQbGGdp1njE1JnT5fsTU+HgP6UYa+F/TcIz5OS4XRlp4EXcv771ryNX3vnVyMb r5qbB8eEMggoq/unjmJvn5z6IybWNUkOswgy5RNV5iZQ6H6x/k1mdG3/n1H0UWZn +R20o7e+R8aTEOzC67hdLCPNgsJk/A== =oK2k -----END PGP SIGNATURE----- --=-=-=--