all messages for Guix-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
From: "Ludovic Courtès" <ludo@gnu.org>
To: Maxim Cournoyer <maxim.cournoyer@gmail.com>
Cc: Vagrant Cascadian <vagrant@debian.org>, Guix Devel <guix-devel@gnu.org>
Subject: Re: File search
Date: Tue, 25 Jan 2022 12:15:43 +0100	[thread overview]
Message-ID: <87fspcoylc.fsf@gnu.org> (raw)
In-Reply-To: <87pmok8orm.fsf@gmail.com> (Maxim Cournoyer's message of "Fri, 21 Jan 2022 21:53:17 -0500")

Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:

> I also had the idea of making it a package... this way only the people
> who opt to install the database locally would incur the cost (in
> bandwidth).
>
> Perhaps a question for Vagrant: talking about size, is this SQLite
> database file comparable or smaller in size to the apt-file database
> that needs to be downloaded?  With the Debian software catalog being
> about 30% bigger, I'd expect a similarly bigger file size.
>
> If Debian is doing better in terms of database file size, we could look
> at how they're doing it.

As a back-of-the-envelope estimate, here’s the amount of text that needs
to be available in the database:

--8<---------------cut here---------------start------------->8---
ludo@berlin ~/src$ sqlite3 -csv  /tmp/db 'select name,version from packages; select name from directories;select name from files;'|wc -c
197689978
ludo@berlin ~/src$ guile -c '(pk (/ 197689978 (expt 2. 20)))'

;;; (188.5318546295166)
ludo@berlin ~/src$ du -h /tmp/db
389M    /tmp/db
--8<---------------cut here---------------end--------------->8---

So roughly, SQLite with this particular schema ends up taking twice as
much space as the lower bound.

We can do a bit better (I’m not an expert, so I’m just trying things
naively) by dropping the index and cleaning up the database:

--8<---------------cut here---------------start------------->8---
ludo@berlin ~/src$ cp /tmp/db{,.without-index}
ludo@berlin ~/src$ sqlite3  /tmp/db.without-index
SQLite version 3.32.3 2020-06-18 14:00:33
Enter ".help" for usage hints.
sqlite> drop index IndexFiles;
sqlite> .quit
ludo@berlin ~/src$ du -h /tmp/db.without-index 
389M    /tmp/db.without-index
ludo@berlin ~/src$ sqlite3  /tmp/db.without-index 
SQLite version 3.32.3 2020-06-18 14:00:33
Enter ".help" for usage hints.
sqlite> vacuum;
sqlite> .quit
ludo@berlin ~/src$ du -h /tmp/db.without-index 
290M    /tmp/db.without-index
--8<---------------cut here---------------end--------------->8---

With compression:

--8<---------------cut here---------------start------------->8---
ludo@berlin ~/src$ zstd -19 < /tmp/db.without-index > /tmp/db.without-index.zst
ludo@berlin ~/src$ du -h /tmp/db.without-index.zst 
37M     /tmp/db.without-index.zst
--8<---------------cut here---------------end--------------->8---

(Down from 61MB.)  For comparison, this is smaller than guile, perl,
gtk+, and roughly the same as glibc:out.

For the record, with compression, the lower bound is about 12 MiB:

--8<---------------cut here---------------start------------->8---
ludo@berlin ~/src$ sqlite3 -csv  /tmp/db 'select name,version from packages; select name from directories;select name from files;'|zstd -19|wc -c
12128674
ludo@berlin ~/src$ guile -c '(pk (/ 12128674 (expt 2. 20)))'

;;; (11.566804885864258)
--8<---------------cut here---------------end--------------->8---

All this to say that we could distribute the database in a form that
gets closer to the optimal size, at the expense of extra processing on
the client side upon reception to put it into shape (creating an index,
etc.).

Ludo’.


  reply	other threads:[~2022-01-25 11:16 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-01-21  9:03 File search Ludovic Courtès
2022-01-21 10:35 ` Mathieu Othacehe
2022-01-22  0:35   ` Ludovic Courtès
2022-01-21 19:00 ` Vagrant Cascadian
2022-01-22  0:37   ` Ludovic Courtès
2022-01-22  2:53     ` Maxim Cournoyer
2022-01-25 11:15       ` Ludovic Courtès [this message]
2022-01-25 11:20         ` Oliver Propst
2022-01-25 11:22           ` Oliver Propst
2022-01-22  4:46 ` raingloom
2022-01-22  7:55   ` Ricardo Wurmus
2022-01-24 15:48     ` Ludovic Courtès
2022-01-24 17:03       ` Ricardo Wurmus
2022-02-02 16:14         ` Maxim Cournoyer
2022-02-05 11:15           ` Ludovic Courtès
2022-01-25 23:45 ` Ryan Prior
2022-02-05 11:18   ` Ludovic Courtès
2022-02-06 13:27 ` André A. Gomes
  -- strict thread matches above, loose matches on Subject: below --
2022-12-02 17:58 antoine.romain.dumont
2022-12-02 18:22 ` Antoine R. Dumont (@ardumont)
2022-12-03 18:19   ` Ludovic Courtès
2022-12-04 16:35     ` Antoine R. Dumont (@ardumont)
2022-12-06 10:01       ` Ludovic Courtès
2022-12-06 12:59         ` zimoun
2022-12-06 18:27         ` (
2022-12-08 15:41           ` Ludovic Courtès
2022-12-09 10:05         ` Antoine R. Dumont (@ardumont)
2022-12-09 18:05           ` zimoun
2022-12-11 10:22           ` Ludovic Courtès
2022-12-15 17:03             ` Antoine R. Dumont (@ardumont)
2022-12-19 21:25               ` Ludovic Courtès
2022-12-19 22:44                 ` zimoun
2022-12-20 11:13                 ` Antoine R. Dumont (@ardumont)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87fspcoylc.fsf@gnu.org \
    --to=ludo@gnu.org \
    --cc=guix-devel@gnu.org \
    --cc=maxim.cournoyer@gmail.com \
    --cc=vagrant@debian.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/guix.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.