unofficial mirror of guix-devel@gnu.org 
 help / color / mirror / code / Atom feed
* Re: File search
@ 2022-12-02 17:58 antoine.romain.dumont
  2022-12-02 18:22 ` Antoine R. Dumont (@ardumont)
  0 siblings, 1 reply; 33+ messages in thread
From: antoine.romain.dumont @ 2022-12-02 17:58 UTC (permalink / raw)
  To: guix-devel

[-- Attachment #1: Type: text/plain, Size: 3099 bytes --]

Hello Guix!

Guix is top so thanks for the awesome work!

Just to give some feedback on this thread. That's a good news that the
file search functionality in the radar.

> Lately I found myself going several times to
> <https://packages.debian.org> to look for packages providing a given
> file and I thought it’s time to do something about it.

I've finally started to set up my machine with Guix system (and
Guix Home). Finding out where such program or cli is packaged is
definitely something that I need to port my existing use (from mainly
nixified debian or nixos machines) to Guix.

And to answer such question, I used existing "offline" programs in my
machines. I've bounced back and forth between `nix-locate` and `apt-file
search` to determine approximately the packages in Guix (names aren't
usually that different).

Hence, as a user, it's one of my expectation that the Guix cli provides
some equivalent program to lookup from file to package ;).

> The script below creates an SQLite database for the current set of
> packages, but only for those already in the store:
>
>   Guix repl file-database.scm populate
>
> That creates /tmp/db; it took about 25mn on berlin, for 18K packages.
> Then you can run, say:
>
>   Guix repl file-database.scm search boot-9.scm
>
> to find which packages provide a file named ‘boot-9.scm’.  That part is
> instantaneous.
>
> The database for 18K packages is quite big:
>
> --8<---------------cut here---------------start------------->8---
> $ du -h /tmp/db*
> 389M    /tmp/db
> 82M     /tmp/db.gz
> 61M     /tmp/db.zst
> --8<---------------cut here---------------end--------------->8---

For information, in a most recent implementation (@civodul provided me
in #guix-devel), I noticed multiple calls to the indexation step would
duplicate information (at all levels packages, files, directories). So
that might have had an impact in the extracted values above (if ludo had
triggered multiple times the script at the time).

Jsyk, I have started iterating a bit over that provided implementation
(and fixed the current caveat mentioned), added some help message...
I'll follow up with it in a bit (same thread) to have some more feedback
on it.

> How do we expose that information?  There are several criteria I can
> think of: accuracy, freshness, privacy, responsiveness, off-line
> operation.
>
> I think accuracy (making sure you get results that correspond precisely
> to, say, your current channel revisions and your current system) is not
> a high priority: some result is better than no result.

I definitely agree with this. At least from the offline use perspective.
I did not focus at all on the second part of the problematic ("online"
and distribution use).

> Likewise for freshness: results for an older version of a given
> package may still be valid now.

Indeed.

Cheers,
--
tony / Antoine R. Dumont (@ardumont)

-----------------------------------------------------------------
gpg fingerprint BF00 203D 741A C9D5 46A8 BE07 52E2 E984 0D10 C3B8

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 877 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread
* File search
@ 2022-01-21  9:03 Ludovic Courtès
  2022-01-21 10:35 ` Mathieu Othacehe
                   ` (4 more replies)
  0 siblings, 5 replies; 33+ messages in thread
From: Ludovic Courtès @ 2022-01-21  9:03 UTC (permalink / raw)
  To: Guix Devel

[-- Attachment #1: Type: text/plain, Size: 2372 bytes --]

Hello Guix!

Lately I found myself going several times to
<https://packages.debian.org> to look for packages providing a given
file and I thought it’s time to do something about it.

The script below creates an SQLite database for the current set of
packages, but only for those already in the store:

  guix repl file-database.scm populate

That creates /tmp/db; it took about 25mn on berlin, for 18K packages.
Then you can run, say:

  guix repl file-database.scm search boot-9.scm

to find which packages provide a file named ‘boot-9.scm’.  That part is
instantaneous.

The database for 18K packages is quite big:

--8<---------------cut here---------------start------------->8---
$ du -h /tmp/db*
389M    /tmp/db
82M     /tmp/db.gz
61M     /tmp/db.zst
--8<---------------cut here---------------end--------------->8---

How do we expose that information?  There are several criteria I can
think of: accuracy, freshness, privacy, responsiveness, off-line
operation.

I think accuracy (making sure you get results that correspond precisely
to, say, your current channel revisions and your current system) is not
a high priority: some result is better than no result.  Likewise for
freshness: results for an older version of a given package may still be
valid now.

In terms of privacy, I think it’s better if we can avoid making one
request per file searched for.  Off-line operation would be sweet, and
it comes with responsiveness; fast off-line search is necessary for
things like ‘command-not-found’ (where the shell tells you what package
to install when a command is not found).

Based on that, it is tempting to just distribute a full database from
ci.guix, say, that the client command would regularly fetch.  The
downside is that that’s quite a lot of data to download; if you use the
file search command infrequently, you might find yourself spending more
time downloading the database than actually searching it.

We could have a hybrid solution: distribute a database that contains
only files in /bin and /sbin (it should be much smaller), and for
everything else, resort to a web service (the Data Service could be
extended to include file lists).  That way, we’d have fast
privacy-respecting search for command names, and on-line search for
everything else.

Thoughts?

Ludo’.


[-- Attachment #2: The file database tool --]
[-- Type: text/plain, Size: 7549 bytes --]

;;; GNU Guix --- Functional package management for GNU
;;; Copyright © 2022 Ludovic Courtès <ludo@gnu.org>
;;;
;;; This file is part of GNU Guix.
;;;
;;; GNU Guix is free software; you can redistribute it and/or modify it
;;; under the terms of the GNU General Public License as published by
;;; the Free Software Foundation; either version 3 of the License, or (at
;;; your option) any later version.
;;;
;;; GNU Guix is distributed in the hope that it will be useful, but
;;; WITHOUT ANY WARRANTY; without even the implied warranty of
;;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
;;; GNU General Public License for more details.
;;;
;;; You should have received a copy of the GNU General Public License
;;; along with GNU Guix.  If not, see <http://www.gnu.org/licenses/>.

(define-module (file-database)
  #:use-module (sqlite3)
  #:use-module (ice-9 match)
  #:use-module (guix store)
  #:use-module (guix monads)
  #:autoload   (guix grafts) (%graft?)
  #:use-module (guix derivations)
  #:use-module (guix packages)
  #:autoload   (guix build utils) (find-files)
  #:autoload   (gnu packages) (fold-packages)
  #:use-module (srfi srfi-1)
  #:use-module (srfi srfi-9)
  #:export (file-database))

(define schema
  "
create table if not exists Packages (
  id        integer primary key autoincrement not null,
  name      text not null,
  version   text not null
);

create table if not exists Directories (
  id        integer primary key autoincrement not null,
  name      text not null,
  package   integer not null,
  foreign key (package) references Packages(id) on delete cascade
);

create table if not exists Files (
  name      text not null,
  basename  text not null,
  directory integer not null,
  foreign key (directory) references Directories(id) on delete cascade
);

create index if not exists IndexFiles on Files(basename);")

(define (call-with-database file proc)
  (let ((db (sqlite-open file)))
    (dynamic-wind
      (lambda () #t)
      (lambda ()
        (sqlite-exec db schema)
        (proc db))
      (lambda ()
        (sqlite-close db)))))

(define (insert-files db package version directories)
  "Insert the files contained in DIRECTORIES as belonging to PACKAGE at
VERSION."
  (define last-row-id-stmt
    (sqlite-prepare db "SELECT last_insert_rowid();"
                    #:cache? #t))

  (define package-stmt
    (sqlite-prepare db "\
INSERT OR REPLACE INTO Packages(name, version)
VALUES (:name, :version);"
                    #:cache? #t))

  (define directory-stmt
    (sqlite-prepare db "\
INSERT INTO Directories(name, package) VALUES (:name, :package);"
                    #:cache? #t))

  (define file-stmt
    (sqlite-prepare db "\
INSERT INTO Files(name, basename, directory)
VALUES (:name, :basename, :directory);"
                    #:cache? #t))

  (sqlite-exec db "begin immediate;")
  (sqlite-bind-arguments package-stmt
                         #:name package
                         #:version version)
  (sqlite-fold (const #t) #t package-stmt)
  (match (sqlite-fold cons '() last-row-id-stmt)
    ((#(package-id))
     (pk 'package package-id package)
     (for-each (lambda (directory)
                 (define (strip file)
                   (string-drop file (+ (string-length directory) 1)))

                 (sqlite-reset directory-stmt)
                 (sqlite-bind-arguments directory-stmt
                                        #:name directory
                                        #:package package-id)
                 (sqlite-fold (const #t) #t directory-stmt)

                 (match (sqlite-fold cons '() last-row-id-stmt)
                   ((#(directory-id))
                    (for-each (lambda (file)
                                ;; If DIRECTORY is a symlink, (find-files
                                ;; DIRECTORY) returns the DIRECTORY singleton.
                                (unless (string=? file directory)
                                  (sqlite-reset file-stmt)
                                  (sqlite-bind-arguments file-stmt
                                                         #:name (strip file)
                                                         #:basename
                                                         (basename file)
                                                         #:directory
                                                         directory-id)
                                  (sqlite-fold (const #t) #t file-stmt)))
                              (find-files directory)))))
               directories)
     (sqlite-exec db "commit;"))))

(define (insert-package db package)
  "Insert all the files of PACKAGE into DB."
  (mlet %store-monad ((drv (package->derivation package #:graft? #f)))
    (match (derivation->output-paths drv)
      (((labels . directories) ...)
       (when (every file-exists? directories)
         (insert-files db (package-name package) (package-version package)
                       directories))
       (return #t)))))

(define (insert-packages db)
  "Insert all the current packages into DB."
  (with-store store
    (parameterize ((%graft? #f))
      (fold-packages (lambda (package _)
                       (run-with-store store
                         (insert-package db package)))
                     #t
                     #:select? (lambda (package)
                                 (and (not (hidden-package? package))
                                      (not (package-superseded package))
                                      (supported-package? package)))))))

(define-record-type <package-match>
  (package-match name version file)
  package-match?
  (name      package-match-name)
  (version   package-match-version)
  (file      package-match-file))

(define (matching-packages db file)
  "Return a list of <package-match> corresponding to packages containing
FILE."
  (define lookup-stmt
    (sqlite-prepare db "\
SELECT Packages.name, Packages.version, Directories.name, Files.name
FROM Packages
INNER JOIN Files, Directories
ON files.basename = :file AND directories.id = files.directory AND packages.id = directories.package;"))

  (sqlite-bind-arguments lookup-stmt #:file file)
  (sqlite-fold (lambda (result lst)
                 (match result
                   (#(package version directory file)
                    (cons (package-match package version
                                         (string-append directory "/" file))
                          lst))))
               '() lookup-stmt))

\f
(define (file-database . args)
  (match args
    ((_ "populate")
     (call-with-database "/tmp/db"
       (lambda (db)
         (insert-packages db))))
    ((_ "search" file)
     (let ((matches (call-with-database "/tmp/db"
                      (lambda (db)
                        (matching-packages db file)))))
       (for-each (lambda (result)
                   (format #t "~20a ~a~%"
                           (string-append (package-match-name result)
                                          "@" (package-match-version result))
                           (package-match-file result)))
                 matches)
       (exit (pair? matches))))
    (_
     (format (current-error-port)
             "usage: file-database [populate|search] args ...~%")
     (exit 1))))

(apply file-database (command-line))

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2022-12-20 11:23 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-12-02 17:58 File search antoine.romain.dumont
2022-12-02 18:22 ` Antoine R. Dumont (@ardumont)
2022-12-03 18:19   ` Ludovic Courtès
2022-12-04 16:35     ` Antoine R. Dumont (@ardumont)
2022-12-06 10:01       ` Ludovic Courtès
2022-12-06 12:59         ` zimoun
2022-12-06 18:27         ` (
2022-12-08 15:41           ` Ludovic Courtès
2022-12-09 10:05         ` Antoine R. Dumont (@ardumont)
2022-12-09 18:05           ` zimoun
2022-12-11 10:22           ` Ludovic Courtès
2022-12-15 17:03             ` Antoine R. Dumont (@ardumont)
2022-12-19 21:25               ` Ludovic Courtès
2022-12-19 22:44                 ` zimoun
2022-12-20 11:13                 ` Antoine R. Dumont (@ardumont)
  -- strict thread matches above, loose matches on Subject: below --
2022-01-21  9:03 Ludovic Courtès
2022-01-21 10:35 ` Mathieu Othacehe
2022-01-22  0:35   ` Ludovic Courtès
2022-01-21 19:00 ` Vagrant Cascadian
2022-01-22  0:37   ` Ludovic Courtès
2022-01-22  2:53     ` Maxim Cournoyer
2022-01-25 11:15       ` Ludovic Courtès
2022-01-25 11:20         ` Oliver Propst
2022-01-25 11:22           ` Oliver Propst
2022-01-22  4:46 ` raingloom
2022-01-22  7:55   ` Ricardo Wurmus
2022-01-24 15:48     ` Ludovic Courtès
2022-01-24 17:03       ` Ricardo Wurmus
2022-02-02 16:14         ` Maxim Cournoyer
2022-02-05 11:15           ` Ludovic Courtès
2022-01-25 23:45 ` Ryan Prior
2022-02-05 11:18   ` Ludovic Courtès
2022-02-06 13:27 ` André A. Gomes

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).