unofficial mirror of guix-devel@gnu.org 
 help / color / mirror / Atom feed
* File search progress: database review and question on triggers
@ 2020-08-10 14:32 Pierre Neidhardt
  2020-08-11  9:43 ` Mathieu Othacehe
                   ` (3 more replies)
  0 siblings, 4 replies; 45+ messages in thread
From: Pierre Neidhardt @ 2020-08-10 14:32 UTC (permalink / raw)
  To: guix-devel

[-- Attachment #1: Type: text/plain, Size: 5575 bytes --]

Hi!

After much delay I finally got down to work on file search support for Guix.
By "file search", I mean the ability to find which package contains files
matching the queried pattern.

If we want to be able to know which package to install, we need file search to
be able to work for packages that have not yet been installed nor built
on the system.

As we previously discussed, a good approach, mostly for performance reasons,
would be to store all files in a database that's populated on demand from the
substitute servers.

What I've done so far:

1. An SQLite database with the following schema:

--8<---------------cut here---------------start------------->8---
create table if not exists Packages (
    name        text not null,
    output      text default "out",
    system      text not null,
    path        text primary key not null, -- store path, e.g. /gnu/store/abcd...-foo
    version     text not null,
    guix        text not null  -- The Guix version in which the package
    can be found.
);

create table if not exists Files (
    subpath     text not null,
    package  text not null,
    primary key (subpath, package), -- Same subpath can occur in multiple packages.
    foreign key (package) references Packages(path) on delete cascade
);
--8<---------------cut here---------------end--------------->8---

   I'm not very good with SQL, so thanks in advance for reviewing this
   carefully; let me know if we can do better.

2. A procedure that persists the filepaths of a given package in the database.

3. Size of the database:
   I've persisted all locally-present store items for my current Guix version
   and it produced a database of 72 MiB.  It compresses down to 8 MiB in zstd.

   But since we can have multiple Guix versions, this means that the
   packages have one entry per store path, so we might end up with more
   entries than that as the number of Guix generations grows.

   The worse case is around (number of guix generations) x ~100 MiB.

   If we compress, it would be at least 10x less, maybe way less.

   To be sustainable, I suggest that when we remove a Guix generation we
   "garbage-collect" the corresponding database entries.

   Thoughts?

4. Indexing speed:
   The above items took some 20 minutes to complete (on my rather powerful machine).
   A single store path takes a fraction of a second to index (on an SSD).
   The storage device is the bottleneck here.  Not sure we can do better than
   the following procedure:

--8<---------------cut here---------------start------------->8---
(define (directory-files path)
  "Return a list of all files within PATH, recursively.
Each file is returned as the path relative to PATH, starting with a '/'.

It's important that the first character be the directory separator because it
gives more expressive power for search.  For instance, searching \"/bin\"
matches both \"/bin/foo\" and \"/usr/bin/foo\" but not \"barbin\"."
  ;; TODO: This does not include empty directories.  Should we?
  ;; REVIEW: Use vlist for performance?  Big packages take a fraction of a
  ;; second on a hot cache, so it's probably not worth it.
  (let ((file-list '()))
    (ftw path
         (lambda (filename statinfo flag)
           (when (eq? flag 'regular)
             (set! file-list (cons (string-drop filename (string-length path))
                                   file-list))) #t))
    file-list))
--8<---------------cut here---------------end--------------->8---

   Most of the indexing will be done by the substitute servers however, so this is
   of little concern for the end user.

   Question: Should we include empty directories in the database?  I'm tempted
   to answer no.

5. Search speed: It completes in a fraction of a second and supports
   SQLite patterns.  Example:

--8<---------------cut here---------------start------------->8---
> (format-search (search-file-package "%libio%"))
samba:out@4.12.3	/lib/libiov-buf-samba4.so
guix:out@1.1.0-18.218a67d	/share/guile/site/3.0/gnu/packages/patches/m4-gnulib-libio.patch
--8<---------------cut here---------------end--------------->8---

   Question: This bounds us to the SQLite syntax for pattern matching.  Is it a
   problem?
   It seems powerful enough in practice.  But maybe we can use regular
   expression in SQLite as well?


Next points I'd like to address:

6. Automatically persist the database entry when building a package.
   Any idea where I should plug that in?

7. Have substitute servers distribute database content.  When the user performs
   a file search, Guix asks the substitute server for a database update.  Only
   the diff should be sent over the network, not the whole thing since it might
   be very large.

   Question 1: If the substitute server does not have data corresponding to the
   Guix server of the user, shall we send data of the version that's the closest
   to that of the user?
   Locally, if there are not many entries for the current Guix version, but many
   for an older version, shall perform the search on the older version as well?
   Multiple older versions?

   Question 2: How do we send "SQLite diffs"?  I know xdelta for binary diffs,
   but I don't think that what we want to do here.  Or else can we export the right
   SELECT result as text and send it compressed over the network?

I'll send a patch soon, in the meantime let me know about your thoughts!
:)

Cheers!

-- 
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-10 14:32 File search progress: database review and question on triggers Pierre Neidhardt
@ 2020-08-11  9:43 ` Mathieu Othacehe
  2020-08-11 12:35   ` Pierre Neidhardt
  2020-08-11 15:43 ` Ricardo Wurmus
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 45+ messages in thread
From: Mathieu Othacehe @ 2020-08-11  9:43 UTC (permalink / raw)
  To: Pierre Neidhardt; +Cc: guix-devel


Hello Pierre,

Thanks for sharing your progress. A few remarks below.

> 3. Size of the database:
>    I've persisted all locally-present store items for my current Guix version
>    and it produced a database of 72 MiB.  It compresses down to 8 MiB in zstd.
>
>    But since we can have multiple Guix versions, this means that the
>    packages have one entry per store path, so we might end up with more
>    entries than that as the number of Guix generations grows.

I'm not sure we actually need to save the full history. I think we could
just store that the package X produces [Y1, Y2, ...]  executable files. Then
on package X update, the executable files list could be refreshed.

>    Question: Should we include empty directories in the database?  I'm tempted
>    to answer no.

I would also say no, and also exclude non-executable files.

>    Question: This bounds us to the SQLite syntax for pattern matching.  Is it a
>    problem?
>    It seems powerful enough in practice.  But maybe we can use regular
>    expression in SQLite as well?

From the UI perspective, we already have "guix search" that expects a
regex. If we were to include a "guix file-search" command, then I think
it would make sense that it uses the same regex syntax.

> Next points I'd like to address:
>
> 6. Automatically persist the database entry when building a package.
>    Any idea where I should plug that in?
>
> 7. Have substitute servers distribute database content.  When the user performs
>    a file search, Guix asks the substitute server for a database update.  Only
>    the diff should be sent over the network, not the whole thing since it might

If I understand correctly, you are proposing to create local databases
that would be kept in sync with a master database populated by the CI
server. This seems a bit complex.

What about extending Cuirass database to add the two tables you are
proposing. Then, each time a package is built, if the version is
updated, the "Files" table would be updated.

Then we could add an HTTP interface such as "/search/file?query=libxxx"
to Cuirass, that would directly query the database. In Guix itself, we
could add the counterpart in the (guix ci) module.

WDYT?

Thanks,

Mathieu


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-11  9:43 ` Mathieu Othacehe
@ 2020-08-11 12:35   ` Pierre Neidhardt
  2020-08-15 12:48     ` Hartmut Goebel
  0 siblings, 1 reply; 45+ messages in thread
From: Pierre Neidhardt @ 2020-08-11 12:35 UTC (permalink / raw)
  To: Mathieu Othacehe; +Cc: guix-devel

[-- Attachment #1: Type: text/plain, Size: 3454 bytes --]

Hi Mathieu,

Thanks for you comments!
Answers below.

>> 3. Size of the database:
>>    I've persisted all locally-present store items for my current Guix version
>>    and it produced a database of 72 MiB.  It compresses down to 8 MiB in zstd.
>>
>>    But since we can have multiple Guix versions, this means that the
>>    packages have one entry per store path, so we might end up with more
>>    entries than that as the number of Guix generations grows.
>
> I'm not sure we actually need to save the full history. I think we could
> just store that the package X produces [Y1, Y2, ...]  executable files. Then
> on package X update, the executable files list could be refreshed.

Maybe you are missing some context.  The original discussion is there:
https://lists.gnu.org/archive/html/guix-devel/2020-01/msg00019.html.

Unlike Nix, we would like to do more than just index executable files.
Indeed, it's very useful to know where to find, say, a C header, a .so
library, a TeXlive .sty file, etc.

However, as you hinted, maybe it's unnecessary to save the file listings of
the packages for every Guix versions.  Maybe we could only store the
"diffs" between the Guix generations.  I don't know if SQLite supports
this.  If not, it sounds like a rather complex thing to do.

But really, if the compressed database over multiple Guix generations is
<100 MiB, then size is not a big problem.

>>    Question: Should we include empty directories in the database?  I'm tempted
>>    to answer no.
>
> I would also say no, and also exclude non-executable files.

See above, I think we would lose a lot in not including non-executable files.

>>    Question: This bounds us to the SQLite syntax for pattern matching.  Is it a
>>    problem?
>>    It seems powerful enough in practice.  But maybe we can use regular
>>    expression in SQLite as well?
>
> From the UI perspective, we already have "guix search" that expects a
> regex. If we were to include a "guix file-search" command, then I think
> it would make sense that it uses the same regex syntax.

I found out that SQLite has a REGEXP operator, I'll see if it works well enough.

>> 7. Have substitute servers distribute database content.  When the user performs
>>    a file search, Guix asks the substitute server for a database update.  Only
>>    the diff should be sent over the network, not the whole thing since it might
>
> If I understand correctly, you are proposing to create local databases
> that would be kept in sync with a master database populated by the CI
> server. This seems a bit complex.
>
> What about extending Cuirass database to add the two tables you are
> proposing. Then, each time a package is built, if the version is
> updated, the "Files" table would be updated.
>
> Then we could add an HTTP interface such as "/search/file?query=libxxx"
> to Cuirass, that would directly query the database. In Guix itself, we
> could add the counterpart in the (guix ci) module.

The problem with this approach is that it would not work offline.  In my
opinion, this is a big limitation.  I'd rather have a local database.

Besides, we need a local database for non-official, locally-built
packages anyways (which Cuirass would not know about).  Since this is a
requirement, the only piece that'd be missing is database synchronization.

Thoughts?

-- 
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-10 14:32 File search progress: database review and question on triggers Pierre Neidhardt
  2020-08-11  9:43 ` Mathieu Othacehe
@ 2020-08-11 15:43 ` Ricardo Wurmus
  2020-08-11 17:54   ` Pierre Neidhardt
  2020-08-18 14:58 ` File search progress: database review and question on triggers OFF TOPIC PRAISE Joshua Branson
  2020-08-27 10:00 ` File search progress: database review and question on triggers zimoun
  3 siblings, 1 reply; 45+ messages in thread
From: Ricardo Wurmus @ 2020-08-11 15:43 UTC (permalink / raw)
  To: Pierre Neidhardt; +Cc: guix-devel


Pierre Neidhardt <mail@ambrevar.xyz> writes:

> 3. Size of the database:
>    I've persisted all locally-present store items for my current Guix version
>    and it produced a database of 72 MiB.  It compresses down to 8 MiB
>    in zstd.

For comparison, my laptop’s store contains 1,103,543 files, excluding
.links 691,994.  The updatedb database for all of them is 86MB and takes
~6 seconds to generate:

    time updatedb \
         --localpaths=/gnu/store \
         --findoptions='( -path /gnu/store/.links -o -name *.drv -o -name *.chroot ) -prune -o -type f -print' \
         --output=/tmp/dbfile

    locate -d /tmp/dbfile ecxc0800

(This could be further tweaked to exclude links…)

>    The worse case is around (number of guix generations) x ~100 MiB.

This seems a little excessive.

> 4. Indexing speed:
>    The above items took some 20 minutes to complete (on my rather
>    powerful machine).

Oof.  The updatedb hack above takes 6 seconds on my i7-6500U CPU @
2.50GHz with SSD.

I’m not suggesting to use updatedb, but I think it can be instructive to
look at how the file database is implemented there.  We don’t have to
use SQlite if it is much slower and heavier than a custom inverted
index.

-- 
Ricardo


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-11 15:43 ` Ricardo Wurmus
@ 2020-08-11 17:54   ` Pierre Neidhardt
  2020-08-11 17:58     ` Pierre Neidhardt
  0 siblings, 1 reply; 45+ messages in thread
From: Pierre Neidhardt @ 2020-08-11 17:54 UTC (permalink / raw)
  To: Ricardo Wurmus; +Cc: guix-devel

[-- Attachment #1: Type: text/plain, Size: 573 bytes --]

Ricardo Wurmus <rekado@elephly.net> writes:

> Oof.  The updatedb hack above takes 6 seconds on my i7-6500U CPU @
> 2.50GHz with SSD.
>
> I’m not suggesting to use updatedb, but I think it can be instructive to
> look at how the file database is implemented there.  We don’t have to
> use SQlite if it is much slower and heavier than a custom inverted
> index.

Good call, I'll benchmark against an inverted index.

Some cost may also be induced by the Guix store queries, not sure if we
can optimize these.

--
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-11 17:54   ` Pierre Neidhardt
@ 2020-08-11 17:58     ` Pierre Neidhardt
  2020-08-11 20:08       ` Ricardo Wurmus
  0 siblings, 1 reply; 45+ messages in thread
From: Pierre Neidhardt @ 2020-08-11 17:58 UTC (permalink / raw)
  To: Ricardo Wurmus; +Cc: guix-devel

[-- Attachment #1: Type: text/plain, Size: 942 bytes --]

Pierre Neidhardt <mail@ambrevar.xyz> writes:

> Ricardo Wurmus <rekado@elephly.net> writes:
>
>> I’m not suggesting to use updatedb, but I think it can be instructive to
>> look at how the file database is implemented there.  We don’t have to
>> use SQlite if it is much slower and heavier than a custom inverted
>> index.
>
> Good call, I'll benchmark against an inverted index.
>
> Some cost may also be induced by the Guix store queries, not sure if we
> can optimize these.

With an s-exp based file, or a trivial text-based format, the downside
is that it needs a bit of extra work to only load select entries,
e.g. just the entries matching a specific Guix version.

Would you happen to know a serialization library that allows for loading
only a select portion of a file?

Otherwise a trivial workaround would be to persist one index file per
Guix generation.

-- 
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-11 17:58     ` Pierre Neidhardt
@ 2020-08-11 20:08       ` Ricardo Wurmus
  2020-08-12 19:10         ` Pierre Neidhardt
  0 siblings, 1 reply; 45+ messages in thread
From: Ricardo Wurmus @ 2020-08-11 20:08 UTC (permalink / raw)
  To: Pierre Neidhardt; +Cc: guix-devel


Pierre Neidhardt <mail@ambrevar.xyz> writes:

> Pierre Neidhardt <mail@ambrevar.xyz> writes:
>
>> Ricardo Wurmus <rekado@elephly.net> writes:
>>
>>> I’m not suggesting to use updatedb, but I think it can be instructive to
>>> look at how the file database is implemented there.  We don’t have to
>>> use SQlite if it is much slower and heavier than a custom inverted
>>> index.
>>
>> Good call, I'll benchmark against an inverted index.
>>
>> Some cost may also be induced by the Guix store queries, not sure if we
>> can optimize these.
>
> With an s-exp based file, or a trivial text-based format, the downside
> is that it needs a bit of extra work to only load select entries,
> e.g. just the entries matching a specific Guix version.
>
> Would you happen to know a serialization library that allows for loading
> only a select portion of a file?

I don’t know of any suitable file format, but a generated offset index
at the beginning of the file could be useful.  You’d read the first
expression and then seek to the specified byte offset (after the
position of the index expression) where you then read the target
expression.  This can easily be generated and it can be extended without
having to rewrite the whole file.

But perhaps that’s premature optimization.

-- 
Ricardo


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-11 20:08       ` Ricardo Wurmus
@ 2020-08-12 19:10         ` Pierre Neidhardt
  2020-08-12 20:13           ` Julien Lepiller
                             ` (2 more replies)
  0 siblings, 3 replies; 45+ messages in thread
From: Pierre Neidhardt @ 2020-08-12 19:10 UTC (permalink / raw)
  To: Ricardo Wurmus; +Cc: guix-devel

[-- Attachment #1: Type: text/plain, Size: 1764 bytes --]

I've done some benchmarking.

1. I tried to fine-tune the SQL a bit:
  - Open/close the database only once for the whole indexing.
  - Use "insert" instead of "insert or replace".
  - Use numeric ID as key instead of path.

  Result: Still around 15-20 minutes to build.  Switching to numeric
  indices shrank the database by half.

2. I've tried with the following naive 1-file-per-line format:

--8<---------------cut here---------------start------------->8---
"/gnu/store/97p5gvb7jglmn9jpgwwf5al1798wi61f-acl-2.2.53//share/man/man5/acl.5.gz"
"/gnu/store/97p5gvb7jglmn9jpgwwf5al1798wi61f-acl-2.2.53//share/man/man3/acl_add_perm.3.gz"
"/gnu/store/97p5gvb7jglmn9jpgwwf5al1798wi61f-acl-2.2.53//share/man/man3/acl_calc_mask.3.gz"
...
--8<---------------cut here---------------end--------------->8---

  Result: Takes between 20 and 2 minutes to complete and the result is
  32 MiB big.  (I don't know why the timing varies.)

  A string-contains filter takes less than 1 second.

  A string-match (regex) search takes some 3 seconds (Ryzen 5 @ 3.5
  GHz).  I'm not sure if we can go faster.  I need to measure the time
  SQL takes for a regexp match.

Question: Any idea how to load the database as fast as possible?  I
tried the following, it takes 1.5s on my machine:

--8<---------------cut here---------------start------------->8---
(define (load-textual-database)
  (call-with-input-file %textual-db
    (lambda (port)
      (let loop ((line (get-line port))
                 (result '()))
        (if (string? line)
            (loop (get-line port) (cons line result))
            result)))))
--8<---------------cut here---------------end--------------->8---

Cheers!

--
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-12 19:10         ` Pierre Neidhardt
@ 2020-08-12 20:13           ` Julien Lepiller
  2020-08-12 20:43             ` Pierre Neidhardt
  2020-08-12 20:32           ` Pierre Neidhardt
  2020-08-13  0:17           ` Arun Isaac
  2 siblings, 1 reply; 45+ messages in thread
From: Julien Lepiller @ 2020-08-12 20:13 UTC (permalink / raw)
  To: guix-devel, Pierre Neidhardt, Ricardo Wurmus

[-- Attachment #1: Type: text/plain, Size: 2612 bytes --]

Have you tried something more structured? I have some code for creating a binary search tree and even compressing/decompressing strings with huffman, as well as code to serialize all that (my deserialization is in Java though, so not very useful to you): https://framagit.org/nani-project/nani-website

See modules/nani/trie.scm for instance.

The idea is to have a binary search tree whose keys are filenames and value is a pointer in the file to a structure that holds data for this filerame (packages and versions of guix for instance). It's super fast to look it up, because you don't load the whole file, you only seek to the right position. It's a lot longer to build, and not so easy to update though.

On 2020年8月12日 15:10:08 GMT-04:00, Pierre Neidhardt <mail@ambrevar.xyz> wrote:
>I've done some benchmarking.
>
>1. I tried to fine-tune the SQL a bit:
>  - Open/close the database only once for the whole indexing.
>  - Use "insert" instead of "insert or replace".
>  - Use numeric ID as key instead of path.
>
>  Result: Still around 15-20 minutes to build.  Switching to numeric
>  indices shrank the database by half.
>
>2. I've tried with the following naive 1-file-per-line format:
>
>--8<---------------cut here---------------start------------->8---
>"/gnu/store/97p5gvb7jglmn9jpgwwf5al1798wi61f-acl-2.2.53//share/man/man5/acl.5.gz"
>"/gnu/store/97p5gvb7jglmn9jpgwwf5al1798wi61f-acl-2.2.53//share/man/man3/acl_add_perm.3.gz"
>"/gnu/store/97p5gvb7jglmn9jpgwwf5al1798wi61f-acl-2.2.53//share/man/man3/acl_calc_mask.3.gz"
>...
>--8<---------------cut here---------------end--------------->8---
>
>  Result: Takes between 20 and 2 minutes to complete and the result is
>  32 MiB big.  (I don't know why the timing varies.)
>
>  A string-contains filter takes less than 1 second.
>
>  A string-match (regex) search takes some 3 seconds (Ryzen 5 @ 3.5
>  GHz).  I'm not sure if we can go faster.  I need to measure the time
>  SQL takes for a regexp match.
>
>Question: Any idea how to load the database as fast as possible?  I
>tried the following, it takes 1.5s on my machine:
>
>--8<---------------cut here---------------start------------->8---
>(define (load-textual-database)
>  (call-with-input-file %textual-db
>    (lambda (port)
>      (let loop ((line (get-line port))
>                 (result '()))
>        (if (string? line)
>            (loop (get-line port) (cons line result))
>            result)))))
>--8<---------------cut here---------------end--------------->8---
>
>Cheers!
>
>--
>Pierre Neidhardt
>https://ambrevar.xyz/

[-- Attachment #2: Type: text/html, Size: 3040 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-12 19:10         ` Pierre Neidhardt
  2020-08-12 20:13           ` Julien Lepiller
@ 2020-08-12 20:32           ` Pierre Neidhardt
  2020-08-13  0:17           ` Arun Isaac
  2 siblings, 0 replies; 45+ messages in thread
From: Pierre Neidhardt @ 2020-08-12 20:32 UTC (permalink / raw)
  To: Ricardo Wurmus; +Cc: guix-devel

[-- Attachment #1: Type: text/plain, Size: 319 bytes --]

Pierre Neidhardt <mail@ambrevar.xyz> writes:

>   Result: Takes between 20 and 2 minutes to complete and the result is
>   32 MiB big.  (I don't know why the timing varies.)

Typo: 20 _seconds_ to 2 minutes!  So it's faster than SQL by 1 or 2 orders of magnitude.

-- 
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-12 20:13           ` Julien Lepiller
@ 2020-08-12 20:43             ` Pierre Neidhardt
  2020-08-12 21:29               ` Julien Lepiller
  0 siblings, 1 reply; 45+ messages in thread
From: Pierre Neidhardt @ 2020-08-12 20:43 UTC (permalink / raw)
  To: Julien Lepiller, guix-devel, Ricardo Wurmus

[-- Attachment #1: Type: text/plain, Size: 1417 bytes --]

Julien Lepiller <julien@lepiller.eu> writes:

> Have you tried something more structured? I have some code for creating a binary search tree and even compressing/decompressing strings with huffman, as well as code to serialize all that (my deserialization is in Java though, so not very useful to you): https://framagit.org/nani-project/nani-website
>
> See modules/nani/trie.scm for instance.
>
> The idea is to have a binary search tree whose keys are filenames and value is a pointer in the file to a structure that holds data for this filerame (packages and versions of guix for instance). It's super fast to look it up, because you don't load the whole file, you only seek to the right position. It's a lot longer to build, and not so easy to update though.

Assuming we'd have only 1 Guix generation per file, I'm not sure a trie
would help because we _always_ need to search _all_ file paths, so in
all cases we've got some 10,000+ entries to load in memory and loop over
them.

The total number of entries is the bottleneck here, both for the
database load and the individual search queries.

An obvious optimization is to memoize the database load.  This has a
significant memory cost though.

The trie could be helpful for multiple Guix generations in the same
database, but I'm not sure it warrant the increased complexity.

Thoughts?

-- 
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-12 20:43             ` Pierre Neidhardt
@ 2020-08-12 21:29               ` Julien Lepiller
  2020-08-12 22:29                 ` Ricardo Wurmus
  2020-08-13  6:52                 ` Pierre Neidhardt
  0 siblings, 2 replies; 45+ messages in thread
From: Julien Lepiller @ 2020-08-12 21:29 UTC (permalink / raw)
  To: Pierre Neidhardt, guix-devel, Ricardo Wurmus

[-- Attachment #1: Type: text/plain, Size: 1880 bytes --]

Why wouldn't it help? Can't you make it a trie from basename -> complete name? If I'm looking for "libcord.so" (which is a key in the trie), I don't think I need to look for every path. I only need to follow the trie until I find a pointer to some structure that contains the data I look for (ex: a list of complete filenames).

On 2020年8月12日 16:43:37 GMT-04:00, Pierre Neidhardt <mail@ambrevar.xyz> wrote:
>Julien Lepiller <julien@lepiller.eu> writes:
>
>> Have you tried something more structured? I have some code for
>creating a binary search tree and even compressing/decompressing
>strings with huffman, as well as code to serialize all that (my
>deserialization is in Java though, so not very useful to you):
>https://framagit.org/nani-project/nani-website
>>
>> See modules/nani/trie.scm for instance.
>>
>> The idea is to have a binary search tree whose keys are filenames and
>value is a pointer in the file to a structure that holds data for this
>filerame (packages and versions of guix for instance). It's super fast
>to look it up, because you don't load the whole file, you only seek to
>the right position. It's a lot longer to build, and not so easy to
>update though.
>
>Assuming we'd have only 1 Guix generation per file, I'm not sure a trie
>would help because we _always_ need to search _all_ file paths, so in
>all cases we've got some 10,000+ entries to load in memory and loop
>over
>them.
>
>The total number of entries is the bottleneck here, both for the
>database load and the individual search queries.
>
>An obvious optimization is to memoize the database load.  This has a
>significant memory cost though.
>
>The trie could be helpful for multiple Guix generations in the same
>database, but I'm not sure it warrant the increased complexity.
>
>Thoughts?
>
>-- 
>Pierre Neidhardt
>https://ambrevar.xyz/

[-- Attachment #2: Type: text/html, Size: 2275 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-12 21:29               ` Julien Lepiller
@ 2020-08-12 22:29                 ` Ricardo Wurmus
  2020-08-13  6:55                   ` Pierre Neidhardt
  2020-08-13  6:52                 ` Pierre Neidhardt
  1 sibling, 1 reply; 45+ messages in thread
From: Ricardo Wurmus @ 2020-08-12 22:29 UTC (permalink / raw)
  To: Julien Lepiller; +Cc: guix-devel


Julien Lepiller <julien@lepiller.eu> writes:

> Why wouldn't it help? Can't you make it a trie from basename -> complete name? If I'm looking for "libcord.so" (which is a key in the trie), I don't think I need to look for every path. I only need to follow the trie until I find a pointer to some structure that contains the data I look for (ex: a list of complete filenames).

Exactly!  It’s somewhat less useful (without pre-processing) for base
names and more useful for absolute file names, where you can quickly
toss out mismatches while traversing the trie.

You could also optimize the data structure by swapping the base name and
the directory name, assuming that users would want to search for either
absolute file names and also for base names (and not for a directory in
between).

-- 
Ricardo


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-12 19:10         ` Pierre Neidhardt
  2020-08-12 20:13           ` Julien Lepiller
  2020-08-12 20:32           ` Pierre Neidhardt
@ 2020-08-13  0:17           ` Arun Isaac
  2020-08-13  6:58             ` Pierre Neidhardt
  2020-08-13  9:40             ` Pierre Neidhardt
  2 siblings, 2 replies; 45+ messages in thread
From: Arun Isaac @ 2020-08-13  0:17 UTC (permalink / raw)
  To: Pierre Neidhardt, Ricardo Wurmus; +Cc: guix-devel

[-- Attachment #1: Type: text/plain, Size: 1756 bytes --]


Hi,

> 1. I tried to fine-tune the SQL a bit:
>   - Open/close the database only once for the whole indexing.
>   - Use "insert" instead of "insert or replace".
>   - Use numeric ID as key instead of path.
>
>   Result: Still around 15-20 minutes to build.  Switching to numeric
>   indices shrank the database by half.

sqlite insert statements can be very fast. sqlite.org claims 50000 or
more insert statements per second. But in order to achieve that speed
all insert statements have to be grouped together in a single
transaction. See https://www.sqlite.org/faq.html#q19

>   A string-contains filter takes less than 1 second.

Guile's string-contains function uses a naive O(nk) implementation,
where 'n' is the length of string s1 and 'k' is the length of string
s2. If it was implemented using the Knuth-Morris-Pratt algorithm, it
could cost only O(n+k). So, there is some scope for improvement here. In
fact, a comment on line 2007 of libguile/srfi-13.c in the guile source
tree makes this very point.

> I need to measure the time SQL takes for a regexp match.

sqlite, by default, does not come with regexp support. You might have to
load some external library. See
https://www.sqlite.org/lang_expr.html#the_like_glob_regexp_and_match_operators

--8<---------------cut here---------------start------------->8---
The REGEXP operator is a special syntax for the regexp() user
function. No regexp() user function is defined by default and so use of
the REGEXP operator will normally result in an error message. If an
application-defined SQL function named "regexp" is added at run-time,
then the "X REGEXP Y" operator will be implemented as a call to
"regexp(Y,X)".
--8<---------------cut here---------------end--------------->8---

Regards,
Arun

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-12 21:29               ` Julien Lepiller
  2020-08-12 22:29                 ` Ricardo Wurmus
@ 2020-08-13  6:52                 ` Pierre Neidhardt
  2020-08-13  9:34                   ` Ricardo Wurmus
  1 sibling, 1 reply; 45+ messages in thread
From: Pierre Neidhardt @ 2020-08-13  6:52 UTC (permalink / raw)
  To: Julien Lepiller, guix-devel, Ricardo Wurmus

[-- Attachment #1: Type: text/plain, Size: 1058 bytes --]

Julien Lepiller <julien@lepiller.eu> writes:

> Why wouldn't it help? Can't you make it a trie from basename ->
> complete name? If I'm looking for "libcord.so" (which is a key in the
> trie), I don't think I need to look for every path. I only need to
> follow the trie until I find a pointer to some structure that contains
> the data I look for (ex: a list of complete filenames).

Fair enough, but it's a more limited scope: here we assume the user
knows the exact basename.

It's a bit too limited in my opinion:

- It's only too common to have shared objects ending with a .X.Y extension
  (where X.Y is a version), the version-less file is not always present
  which means a lot of trial and error on the user end just to search
  the right file.

- It does not cover the case where I don't know the basename, e.g. if I'm
  looking for a FOO header file my query would look like "/include/.*foo.*".

I believe it's important that the search be as general as possible.

Thoughts?

-- 
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-12 22:29                 ` Ricardo Wurmus
@ 2020-08-13  6:55                   ` Pierre Neidhardt
  0 siblings, 0 replies; 45+ messages in thread
From: Pierre Neidhardt @ 2020-08-13  6:55 UTC (permalink / raw)
  To: Ricardo Wurmus, Julien Lepiller; +Cc: guix-devel

[-- Attachment #1: Type: text/plain, Size: 1185 bytes --]

Hi Ricardo,

Ricardo Wurmus <rekado@elephly.net> writes:

>> Why wouldn't it help? Can't you make it a trie from basename -> complete name? If I'm looking for "libcord.so" (which is a key in the trie), I don't think I need to look for every path. I only need to follow the trie until I find a pointer to some structure that contains the data I look for (ex: a list of complete filenames).
>
> Exactly!  It’s somewhat less useful (without pre-processing) for base
> names and more useful for absolute file names, where you can quickly
> toss out mismatches while traversing the trie.
>
> You could also optimize the data structure by swapping the base name and
> the directory name, assuming that users would want to search for either
> absolute file names and also for base names (and not for a directory in
> between).

By absolute file name, you mean relative to the Guix store item?  As I
mentioned in my reply to Julien, this would be a specific use-case
optimization, like you said it won't work when you don't know the exact
basename or full path, so I'm not sure it's worth going down that road.

Thoughts?

-- 
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-13  0:17           ` Arun Isaac
@ 2020-08-13  6:58             ` Pierre Neidhardt
  2020-08-13  9:40             ` Pierre Neidhardt
  1 sibling, 0 replies; 45+ messages in thread
From: Pierre Neidhardt @ 2020-08-13  6:58 UTC (permalink / raw)
  To: Arun Isaac, Ricardo Wurmus; +Cc: guix-devel

[-- Attachment #1: Type: text/plain, Size: 1950 bytes --]

Hi Arun,

Arun Isaac <arunisaac@systemreboot.net> writes:

> sqlite insert statements can be very fast. sqlite.org claims 50000 or
> more insert statements per second. But in order to achieve that speed
> all insert statements have to be grouped together in a single
> transaction. See https://www.sqlite.org/faq.html#q19

Very good point, thanks a lot!  I'll try again!

>>   A string-contains filter takes less than 1 second.
>
> Guile's string-contains function uses a naive O(nk) implementation,
> where 'n' is the length of string s1 and 'k' is the length of string
> s2. If it was implemented using the Knuth-Morris-Pratt algorithm, it
> could cost only O(n+k). So, there is some scope for improvement here. In
> fact, a comment on line 2007 of libguile/srfi-13.c in the guile source
> tree makes this very point.

Thanks for this low-level insight, this is very useful!
Then if we can get Knuth-Morris-Pratt in the next Guile version, a
textual database could be an ideal option in terms of performance.

>> I need to measure the time SQL takes for a regexp match.
>
> sqlite, by default, does not come with regexp support. You might have to
> load some external library. See
> https://www.sqlite.org/lang_expr.html#the_like_glob_regexp_and_match_operators
>
> --8<---------------cut here---------------start------------->8---
> The REGEXP operator is a special syntax for the regexp() user
> function. No regexp() user function is defined by default and so use of
> the REGEXP operator will normally result in an error message. If an
> application-defined SQL function named "regexp" is added at run-time,
> then the "X REGEXP Y" operator will be implemented as a call to
> "regexp(Y,X)".
> --8<---------------cut here---------------end--------------->8---

I saw that.  Now I need to how regexp search performs as opposed to
pattern search.

Cheers!

-- 
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-13  6:52                 ` Pierre Neidhardt
@ 2020-08-13  9:34                   ` Ricardo Wurmus
  2020-08-13 10:04                     ` Pierre Neidhardt
  0 siblings, 1 reply; 45+ messages in thread
From: Ricardo Wurmus @ 2020-08-13  9:34 UTC (permalink / raw)
  To: Pierre Neidhardt; +Cc: guix-devel


Pierre Neidhardt <mail@ambrevar.xyz> writes:

> Julien Lepiller <julien@lepiller.eu> writes:
>
>> Why wouldn't it help? Can't you make it a trie from basename ->
>> complete name? If I'm looking for "libcord.so" (which is a key in the
>> trie), I don't think I need to look for every path. I only need to
>> follow the trie until I find a pointer to some structure that contains
>> the data I look for (ex: a list of complete filenames).
>
> Fair enough, but it's a more limited scope: here we assume the user
> knows the exact basename.
>
> It's a bit too limited in my opinion:
>
> - It's only too common to have shared objects ending with a .X.Y extension
>   (where X.Y is a version), the version-less file is not always present
>   which means a lot of trial and error on the user end just to search
>   the right file.

We can always abort the radix trie traversal early and print all leaves
that can be reached from the node that we arrived at.  This allows us to
search for “libc” and find “libcord.so.1.2”, “libc.so”, “libcaca.a”,
etc.

> - It does not cover the case where I don't know the basename, e.g. if I'm
>   looking for a FOO header file my query would look like "/include/.*foo.*".

I think this is a rather rare use case, which in my opinion doesn’t
justify forgoing the use of a smart data structure.  It would *still* be
possible to use the prefix tree for this kind of search, but it would
certainly not be optimal. (E.g. by searching up to the first wildcard,
and then searching each resulting sub-tree up to the first non-wildcard,
and resuming the search in all remaining sub-trees.)

> I believe it's important that the search be as general as possible.

Search should be cheap and fast.  Dependent on how we arrange the data,
users can easily filter the results with grep, if necessary.  For
example, we could have a radix tree of the base names and have the
directory prefix as the leaf nodes.  The search would then involve two
steps: culling the set of directory prefixes by following the base name
trie (that’s fast), then do a simple linear search over the results
(that’s slow).

The trie can easily be serialized as a bunch of offsets that we only
need to repeatedly seek to, so we never need to hold it in memory.

-- 
Ricardo


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-13  0:17           ` Arun Isaac
  2020-08-13  6:58             ` Pierre Neidhardt
@ 2020-08-13  9:40             ` Pierre Neidhardt
  2020-08-13 10:08               ` Pierre Neidhardt
                                 ` (2 more replies)
  1 sibling, 3 replies; 45+ messages in thread
From: Pierre Neidhardt @ 2020-08-13  9:40 UTC (permalink / raw)
  To: Arun Isaac, Ricardo Wurmus; +Cc: guix-devel

[-- Attachment #1: Type: text/plain, Size: 1534 bytes --]

Arun Isaac <arunisaac@systemreboot.net> writes:

> sqlite insert statements can be very fast. sqlite.org claims 50000
> or...

I tried it, and got it down to... 30s!  Turns out that Guix already
provides `call-with-transaction' which I've used here.

So SQLite is back on the table.

>> I need to measure the time SQL takes for a regexp match.
>
> sqlite, by default, does not come with regexp support. You might have to
> load some external library. See
> https://www.sqlite.org/lang_expr.html#the_like_glob_regexp_and_match_operators
>
> --8<---------------cut here---------------start------------->8---
> The REGEXP operator is a special syntax for the regexp() user
> function. No regexp() user function is defined by default and so use of
> the REGEXP operator will normally result in an error message. If an
> application-defined SQL function named "regexp" is added at run-time,
> then the "X REGEXP Y" operator will be implemented as a call to
> "regexp(Y,X)".
> --8<---------------cut here---------------end--------------->8---

After more reading, I see the following options:

- Either we use the sqlite-pcre extension (which is not packaged in
  Guix).

- Or we extend guile-sqlite to add a binding to create_function
  (https://www.sqlite.org/appfunc.html).

- Or do you think SQLite patterns (using "%") would do for now?  As
  Mathieu pointed out, it's an unfortunate inconsistency with the rest of
  Guix.  But maybe regexp support can be added in a second stage.

Thoughts?

--
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-13  9:34                   ` Ricardo Wurmus
@ 2020-08-13 10:04                     ` Pierre Neidhardt
  2020-08-15 12:47                       ` Hartmut Goebel
  0 siblings, 1 reply; 45+ messages in thread
From: Pierre Neidhardt @ 2020-08-13 10:04 UTC (permalink / raw)
  To: Ricardo Wurmus; +Cc: guix-devel

[-- Attachment #1: Type: text/plain, Size: 2149 bytes --]

Hi Ricardo,

See my recent email with the new SQLite benchmark: I can now generate
the whole database in under 30 seconds, so about the same order of
magnitude than a naive text database (which has a less data).

SQLite pattern search queries are extremely fast (<0.1s) and cover all
examples named so far:

- exact basename match
- partial path match
- pattern match (e.g. "/include/%foo%")

The only thing that's missing is regexp support.  I'll see what we can do.

Considering this, I find that leveraging SQLite is very attractive at
this point:

- Fast (fastest?).
- Low on memory.
- No need to come up with our own data format.
- Less work, no need to reinvent the wheel.

>> - It does not cover the case where I don't know the basename, e.g. if I'm
>>   looking for a FOO header file my query would look like "/include/.*foo.*".
>
> I think this is a rather rare use case, which in my opinion doesn’t
> justify forgoing the use of a smart data structure.

I don't find it so rare, actually.  I've used filesearch (on other OSes)
to find

- Emacs packages where I didn't know the exact name of the .el.
- TeXlive packages (every time you get a font a .sty error)
- Shared libraries (.so of which I didn't know the exact name)
- include files as above
- Your favourite programming language package...

> It would *still* be possible to use the prefix tree for this kind of
> search, but it would certainly not be optimal. (E.g. by searching up
> to the first wildcard, and then searching each resulting sub-tree up
> to the first non-wildcard, and resuming the search in all remaining
> sub-trees.)

But it's not a wildcard, it's a regexp.  Can we run full-path regexp queries
over a trie?

>> I believe it's important that the search be as general as possible.
>
> Search should be cheap and fast.

While very important, I would put more priority on exactness.  A search
which gives unreliable results (e.g. returns nothing while there exists
a result) is a search that would deter many users from using it, in my
experience.

Cheers!

-- 
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-13  9:40             ` Pierre Neidhardt
@ 2020-08-13 10:08               ` Pierre Neidhardt
  2020-08-13 11:47               ` Ricardo Wurmus
  2020-08-13 12:20               ` Arun Isaac
  2 siblings, 0 replies; 45+ messages in thread
From: Pierre Neidhardt @ 2020-08-13 10:08 UTC (permalink / raw)
  To: Arun Isaac, Ricardo Wurmus; +Cc: guix-devel

[-- Attachment #1: Type: text/plain, Size: 221 bytes --]

An alternative to regexp would be "full-text search":

https://www.sqlite.org/fts5.html

I don't know how fast that is.  Might also induce more disk usage.  I'll test.

-- 
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-13  9:40             ` Pierre Neidhardt
  2020-08-13 10:08               ` Pierre Neidhardt
@ 2020-08-13 11:47               ` Ricardo Wurmus
  2020-08-13 13:44                 ` Pierre Neidhardt
  2020-08-13 12:20               ` Arun Isaac
  2 siblings, 1 reply; 45+ messages in thread
From: Ricardo Wurmus @ 2020-08-13 11:47 UTC (permalink / raw)
  To: Pierre Neidhardt; +Cc: guix-devel


Pierre Neidhardt <mail@ambrevar.xyz> writes:

> - Or do you think SQLite patterns (using "%") would do for now?  As
>   Mathieu pointed out, it's an unfortunate inconsistency with the rest of
>   Guix.  But maybe regexp support can be added in a second stage.

These patterns could be generated from user input that looks more like
what we already use elsewhere.  We don’t have to directly pass the
SQLite interface to users.

-- 
Ricardo


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-13  9:40             ` Pierre Neidhardt
  2020-08-13 10:08               ` Pierre Neidhardt
  2020-08-13 11:47               ` Ricardo Wurmus
@ 2020-08-13 12:20               ` Arun Isaac
  2020-08-13 13:53                 ` Pierre Neidhardt
  2 siblings, 1 reply; 45+ messages in thread
From: Arun Isaac @ 2020-08-13 12:20 UTC (permalink / raw)
  To: Pierre Neidhardt, Ricardo Wurmus; +Cc: guix-devel

[-- Attachment #1: Type: text/plain, Size: 1025 bytes --]


>> sqlite insert statements can be very fast. sqlite.org claims 50000
>> or...
>
> I tried it, and got it down to... 30s!

That's great! :-)

> - Or do you think SQLite patterns (using "%") would do for now?  As
>   Mathieu pointed out, it's an unfortunate inconsistency with the rest of
>   Guix.  But maybe regexp support can be added in a second stage.

The inconsistency is unfortunate. Personally, I am in favor of dropping
regexp support everywhere in Guix, and only having literal string
search. But that is backward incompatible, and may be controversial.

In this specific case of file search, we could use the sqlite like
patterns, but not expose them to the user. For example, if the search
query is "<query>", we search for the LIKE pattern "%<query>%". I think
this addresses how users normally search for files. I don't think
regexps add much value.

Full text search may not be relevant to file search. Full text search is
more suited for natural language search involving such things as
stemming algorithms.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-13 11:47               ` Ricardo Wurmus
@ 2020-08-13 13:44                 ` Pierre Neidhardt
  0 siblings, 0 replies; 45+ messages in thread
From: Pierre Neidhardt @ 2020-08-13 13:44 UTC (permalink / raw)
  To: Ricardo Wurmus; +Cc: guix-devel

[-- Attachment #1: Type: text/plain, Size: 693 bytes --]

Ricardo Wurmus <rekado@elephly.net> writes:

> Pierre Neidhardt <mail@ambrevar.xyz> writes:
>
>> - Or do you think SQLite patterns (using "%") would do for now?  As
>>   Mathieu pointed out, it's an unfortunate inconsistency with the rest of
>>   Guix.  But maybe regexp support can be added in a second stage.
>
> These patterns could be generated from user input that looks more like
> what we already use elsewhere.  We don’t have to directly pass the
> SQLite interface to users.

True, an obvious fix to the problem of syntax familiarity would be to
replace "*" by "%" in the user query, to have it look like shell globing.

-- 
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-13 12:20               ` Arun Isaac
@ 2020-08-13 13:53                 ` Pierre Neidhardt
  2020-08-13 15:14                   ` Arun Isaac
  0 siblings, 1 reply; 45+ messages in thread
From: Pierre Neidhardt @ 2020-08-13 13:53 UTC (permalink / raw)
  To: Arun Isaac, Ricardo Wurmus; +Cc: guix-devel

[-- Attachment #1: Type: text/plain, Size: 1915 bytes --]

Arun Isaac <arunisaac@systemreboot.net> writes:

>> - Or do you think SQLite patterns (using "%") would do for now?  As
>>   Mathieu pointed out, it's an unfortunate inconsistency with the rest of
>>   Guix.  But maybe regexp support can be added in a second stage.
>
> The inconsistency is unfortunate. Personally, I am in favor of dropping
> regexp support everywhere in Guix, and only having literal string
> search. But that is backward incompatible, and may be controversial.
>
> In this specific case of file search, we could use the sqlite like
> patterns, but not expose them to the user. For example, if the search
> query is "<query>", we search for the LIKE pattern "%<query>%". I think
> this addresses how users normally search for files. I don't think
> regexps add much value.

I agree.

> Full text search may not be relevant to file search. Full text search is
> more suited for natural language search involving such things as
> stemming algorithms.

Yes, but full text search brings us a few niceties here:

- Wildcards using the `*` character.  This fixes the unfamiliarity of `%`.

- "Automatic word permutations" (maybe not the right term).  "foo bar" and
  "bar foo" both match the same results!  

- Case insensitive, diacritic insensitive (e.g. "e" matches "É").

- Logic: we can do "(OR foo bar) AND (OR qux quuz)".

- Relevance ranking: results can be sorted by relevance, another problem
  we don't have to fix ourselves ;)

All the above is arguably more powerful and easier to use than regexp.
But even if no user ever bothers with the logic operators, the default
behaviour "just works" in the fashion of a search engine.

The main thing I don't know how to do is suffix matches (like "%foo").
With FTS, looking up "foo*" won't match "libfoo", which is problematic.
Any clue how to fix that?

-- 
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-13 13:53                 ` Pierre Neidhardt
@ 2020-08-13 15:14                   ` Arun Isaac
  2020-08-13 15:36                     ` Pierre Neidhardt
  0 siblings, 1 reply; 45+ messages in thread
From: Arun Isaac @ 2020-08-13 15:14 UTC (permalink / raw)
  To: Pierre Neidhardt, Ricardo Wurmus; +Cc: guix-devel

[-- Attachment #1: Type: text/plain, Size: 1228 bytes --]


> Yes, but full text search brings us a few niceties here:

These are nice features, but I don't know if all of them are useful for
file search. Normally, with Arch's pkgfile, I seach for some missing
header file, shared library, etc. Usually, I know the exact filename I
am looking for, or I know some prefix or suffix of the exact filename.

> - Case insensitive, diacritic insensitive (e.g. "e" matches "É").

Case insensitivity is quite useful. Most filenames are in lower case,
but there is always that one odd filename out there.

But filenames usually don't have diacritics. So, I'm not sure if
diacritic insensitivity is useful.

> All the above is arguably more powerful and easier to use than regexp.
> But even if no user ever bothers with the logic operators, the default
> behaviour "just works" in the fashion of a search engine.
>
> The main thing I don't know how to do is suffix matches (like "%foo").
> With FTS, looking up "foo*" won't match "libfoo", which is problematic.
> Any clue how to fix that?

This is handled by stemming. We'll need a custom stemmer that normalizes
libfoo to foo. Xapian has a nice page on stemming. See
https://xapian.org/docs/stemming.html

Cheers!

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-13 15:14                   ` Arun Isaac
@ 2020-08-13 15:36                     ` Pierre Neidhardt
  2020-08-13 15:56                       ` Pierre Neidhardt
  0 siblings, 1 reply; 45+ messages in thread
From: Pierre Neidhardt @ 2020-08-13 15:36 UTC (permalink / raw)
  To: Arun Isaac, Ricardo Wurmus; +Cc: guix-devel

[-- Attachment #1: Type: text/plain, Size: 868 bytes --]

Arun Isaac <arunisaac@systemreboot.net> writes:

> But filenames usually don't have diacritics. So, I'm not sure if
> diacritic insensitivity is useful.

Probably not, but if there ever is this odd file name with an accent,
then we won't have to worry about it, it will be handled.  Better too
much than too little!

> This is handled by stemming. We'll need a custom stemmer that normalizes
> libfoo to foo. Xapian has a nice page on stemming. See
> https://xapian.org/docs/stemming.html

I think I gave a confusing example.  I think stemming is too specific.
I meant searching a pattern that does not start the word.
If I search "bar", I'd like to match "foobar".

The reverse is possible ("foo" matches "foobar").

Well, I'll share my FTS demo just now, we can always think about this case later.

-- 
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-13 15:36                     ` Pierre Neidhardt
@ 2020-08-13 15:56                       ` Pierre Neidhardt
  2020-08-15 19:33                         ` Arun Isaac
  0 siblings, 1 reply; 45+ messages in thread
From: Pierre Neidhardt @ 2020-08-13 15:56 UTC (permalink / raw)
  To: Arun Isaac, Ricardo Wurmus; +Cc: guix-devel

[-- Attachment #1: Type: text/plain, Size: 1400 bytes --]

I've pushed my experiment to the `wip-filesearch' branch.

As of this writing it is not automatically triggered by "guix build".
To test it:

- Load the module from a REPL.

- Run

--8<---------------cut here---------------start------------->8---
  (test-index-git)
--8<---------------cut here---------------end--------------->8---

  You should get a database at ~/.cache/guix/files.db.

- Run

--8<---------------cut here---------------start------------->8---
  (format-search (search-file-package "git perl5"))
--8<---------------cut here---------------end--------------->8---

  to get some search results.

- Run

--8<---------------cut here---------------start------------->8---
  (persist-all-local-packages)
--8<---------------cut here---------------end--------------->8---

  to build the database for all your local files.  Warning, this takes
  some 30 seconds on my machine, could take minutes if you has a
  spinning drive or an old processor.

I'll be away for a couple of days, feel free to experiment and send me
some feedback in the meantime.

What remains to be done:

- Automatically index "guix build" result in database.
  Any idea where I can trigger the indexing from a "guix build"?

- Sync the database data from a substitute server.  Any pointer would be
  appreciated :)

Cheers!

-- 
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-13 10:04                     ` Pierre Neidhardt
@ 2020-08-15 12:47                       ` Hartmut Goebel
  2020-08-15 21:20                         ` Bengt Richter
  0 siblings, 1 reply; 45+ messages in thread
From: Hartmut Goebel @ 2020-08-15 12:47 UTC (permalink / raw)
  To: guix-devel

Am 13.08.20 um 12:04 schrieb Pierre Neidhardt:
> SQLite pattern search queries are extremely fast (<0.1s) and cover all
> examples named so far:
>
> - exact basename match
> - partial path match
> - pattern match (e.g. "/include/%foo%")


For comparison: These are the options Debian Package search
<https://www.debian.org/distrib/packages.en#search_packages> supports:

  * paths ending with the keyword  -> e.g. file-extensions
  * packages that contain files named like this -> basename
  * packages that contain files whose names contain the keyword  ->
    LIKE  %<keyword>%

-- 
Regards
Hartmut Goebel

| Hartmut Goebel          | h.goebel@crazy-compilers.com               |
| www.crazy-compilers.com | compilers which you thought are impossible |



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-11 12:35   ` Pierre Neidhardt
@ 2020-08-15 12:48     ` Hartmut Goebel
  0 siblings, 0 replies; 45+ messages in thread
From: Hartmut Goebel @ 2020-08-15 12:48 UTC (permalink / raw)
  To: Pierre Neidhardt, Mathieu Othacehe; +Cc: guix-devel

Am 11.08.20 um 14:35 schrieb Pierre Neidhardt:
> Unlike Nix, we would like to do more than just index executable files.
> Indeed, it's very useful to know where to find, say, a C header, a .so
> library, a TeXlive .sty file, etc.

+1

Most of the time I'm searching for non-executable files.

-- 
Regards
Hartmut Goebel

| Hartmut Goebel          | h.goebel@crazy-compilers.com               |
| www.crazy-compilers.com | compilers which you thought are impossible |



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-13 15:56                       ` Pierre Neidhardt
@ 2020-08-15 19:33                         ` Arun Isaac
  2020-08-24  8:29                           ` Pierre Neidhardt
  0 siblings, 1 reply; 45+ messages in thread
From: Arun Isaac @ 2020-08-15 19:33 UTC (permalink / raw)
  To: Pierre Neidhardt, Ricardo Wurmus; +Cc: guix-devel

[-- Attachment #1: Type: text/plain, Size: 1191 bytes --]


Hi Pierre,

I tried the wip-filesearch branch. Nice work! :-)
persist-all-local-packages takes around 350 seconds on my machine (slow
machine with spinning disk) and the database is 50 MB. Some other
comments follow.

- Maybe, we shouldn't index hidden files, particularly all the .xxx-real
  files created by our wrap phases.

- You should use SQL prepared statements with sqlite-prepare,
  sqlite-bind, etc. That would correctly handle escaping special
  characters in the search string. Currently, searching for
  "transmission-gtk", "libm.so", etc. errors out.

- Searching for "git perl5" works as expected, but searching for "git
  perl" returns no results. I think this is due to the tokenizer used by
  the full text search indexer. The tokenizer sees the word "perl5" as
  one indivisible token and does not realize that "perl" is a prefix of
  "perl5". Unfortunately, I think this is a fundamental problem with FTS
  -- one that can only be fixed by using simple LIKE patterns. FTS is
  meant for natural language search where this kind of thing would be
  normal.

- I guess you are only indexing local packages now, but will include all
  packages later by some means.

Cheers!

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-15 12:47                       ` Hartmut Goebel
@ 2020-08-15 21:20                         ` Bengt Richter
  2020-08-16  8:18                           ` Hartmut Goebel
  0 siblings, 1 reply; 45+ messages in thread
From: Bengt Richter @ 2020-08-15 21:20 UTC (permalink / raw)
  To: Hartmut Goebel; +Cc: guix-devel

Hi Hartmut, et al

On +2020-08-15 14:47:12 +0200, Hartmut Goebel wrote:
> Am 13.08.20 um 12:04 schrieb Pierre Neidhardt:
> > SQLite pattern search queries are extremely fast (<0.1s) and cover all
> > examples named so far:
> >
> > - exact basename match
> > - partial path match
> > - pattern match (e.g. "/include/%foo%")
> 
> 
> For comparison: These are the options Debian Package search
> <https://www.debian.org/distrib/packages.en#search_packages> supports:
> 
>   * paths ending with the keyword  -> e.g. file-extensions
>   * packages that contain files named like this -> basename
>   * packages that contain files whose names contain the keyword  ->
>     LIKE  %<keyword>%
>

If you are on debian, have you tried
    dpkg -l '*your*globbed*name*here*'
or
    dpkg -l '*'|egrep your-regex-here
or the former, preserving header and excluding non-installed, e.g.,?
    dpkg -l '*your*globbed*name*here*'|grep -v ^un

> -- 
> Regards
> Hartmut Goebel
> 
> | Hartmut Goebel          | h.goebel@crazy-compilers.com               |
> | www.crazy-compilers.com | compilers which you thought are impossible |
> 
> 
-- 
Regards,
Bengt Richter


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-15 21:20                         ` Bengt Richter
@ 2020-08-16  8:18                           ` Hartmut Goebel
  0 siblings, 0 replies; 45+ messages in thread
From: Hartmut Goebel @ 2020-08-16  8:18 UTC (permalink / raw)
  To: Bengt Richter; +Cc: guix-devel

Am 15.08.20 um 23:20 schrieb Bengt Richter:
> If you are on debian, have you tried
>     dpkg -l '*your*globbed*name*here*'

No, since for Debian I'm not using a command-line tool, but the
Web-Interface - which allows querying even packages I have not
installed. (And the later is my specific use-case: Find which package I
need to install to get a specific file.)

-- 
Regards
Hartmut Goebel

| Hartmut Goebel          | h.goebel@crazy-compilers.com               |
| www.crazy-compilers.com | compilers which you thought are impossible |



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: File search progress: database review and question on triggers OFF TOPIC PRAISE
  2020-08-10 14:32 File search progress: database review and question on triggers Pierre Neidhardt
  2020-08-11  9:43 ` Mathieu Othacehe
  2020-08-11 15:43 ` Ricardo Wurmus
@ 2020-08-18 14:58 ` Joshua Branson
  2020-08-27 10:00 ` File search progress: database review and question on triggers zimoun
  3 siblings, 0 replies; 45+ messages in thread
From: Joshua Branson @ 2020-08-18 14:58 UTC (permalink / raw)
  To: guix-devel


Thanks for working on this!  This is a super awesome feature!  Best of luck!

-- 
Joshua Branson
Sent from Emacs and Gnus


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-15 19:33                         ` Arun Isaac
@ 2020-08-24  8:29                           ` Pierre Neidhardt
  2020-08-24 10:53                             ` Pierre Neidhardt
  0 siblings, 1 reply; 45+ messages in thread
From: Pierre Neidhardt @ 2020-08-24  8:29 UTC (permalink / raw)
  To: Arun Isaac, Ricardo Wurmus; +Cc: guix-devel

[-- Attachment #1: Type: text/plain, Size: 2150 bytes --]

Hi Arun,

thanks for your feedback!

350 seconds does not seem too bad.  Let's keep in mind that this
operation will mostly be used by the build farm / substitute server, not
by the user.

> - Maybe, we shouldn't index hidden files, particularly all the .xxx-real
>   files created by our wrap phases.

I thought about this, but for now I don't think it's a good idea:

- Hidden files are rather rare and won't take up much space in the
  database.
- What if the users actually searches for a hidden file?
- It's easy to exclude hidden files from the search results.

> - You should use SQL prepared statements with sqlite-prepare,
>   sqlite-bind, etc. That would correctly handle escaping special
>   characters in the search string. Currently, searching for
>   "transmission-gtk", "libm.so", etc. errors out.

Thanks for pointing this out, I'll look into it.

> - Searching for "git perl5" works as expected, but searching for "git
>   perl" returns no results. I think this is due to the tokenizer used by
>   the full text search indexer. The tokenizer sees the word "perl5" as
>   one indivisible token and does not realize that "perl" is a prefix of
>   "perl5". Unfortunately, I think this is a fundamental problem with FTS
>   -- one that can only be fixed by using simple LIKE patterns. FTS is
>   meant for natural language search where this kind of thing would be
>   normal.

Indeed, but "git perl*" would have worked I think.  We can always
pre-process the query to add stars everywhere.

At the moment, the only downside I see with FTS is that it seems to be
impossible to match words for which we don't know the beginning.

Anyways, switching from FTS to LIKE patterns is rather easy.  In fact, I
could implement both with a tiny abstraction so that we can choose which
one we want in the end.

> - I guess you are only indexing local packages now, but will include all
>   packages later by some means.

Indeed, I want the substitute server / build farm to expose the database
for syncing.  I'd need some help to get started.  Anyone?

-- 
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-24  8:29                           ` Pierre Neidhardt
@ 2020-08-24 10:53                             ` Pierre Neidhardt
  2020-09-04 19:15                               ` Arun Isaac
  0 siblings, 1 reply; 45+ messages in thread
From: Pierre Neidhardt @ 2020-08-24 10:53 UTC (permalink / raw)
  To: Arun Isaac, Ricardo Wurmus; +Cc: guix-devel

[-- Attachment #1: Type: text/plain, Size: 875 bytes --]

>> - You should use SQL prepared statements with sqlite-prepare,
>>   sqlite-bind, etc. That would correctly handle escaping special
>>   characters in the search string. Currently, searching for
>>   "transmission-gtk", "libm.so", etc. errors out.
>
> Thanks for pointing this out, I'll look into it.


I've looked into it.
Actually the issue is not with the lack of sqlite-bind-arguments, but
with "-" and "." triggering an error in FTS.

I've fixed it in
e08f913d20428a9a925cc46d177c7446f55e6443.  The downside is that we can't
use any special character like boolean operators.  I'm not sure how we
can get the best of both worlds.  Maybe add a command line flag that
would enable "extended search syntax" in which the user can use all of
FTS but must make sure to quote "libm.so" for instance.

Cheers!

-- 
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-10 14:32 File search progress: database review and question on triggers Pierre Neidhardt
                   ` (2 preceding siblings ...)
  2020-08-18 14:58 ` File search progress: database review and question on triggers OFF TOPIC PRAISE Joshua Branson
@ 2020-08-27 10:00 ` zimoun
  2020-08-27 11:15   ` Pierre Neidhardt
  3 siblings, 1 reply; 45+ messages in thread
From: zimoun @ 2020-08-27 10:00 UTC (permalink / raw)
  To: Pierre Neidhardt, guix-devel

Hi Pierre,

Cool to relive the topic. :-)

(disclaim: I have not yet process all the thread and emails)


On Mon, 10 Aug 2020 at 16:32, Pierre Neidhardt <mail@ambrevar.xyz> wrote:

> 1. An SQLite database with the following schema:

If you are going to an local SQL database, my two questions are:

a)
Which part would update it?  “guix pull”?  Other?  Even using
substitutes, the channels and co could lead to an extra cost and so what
is acceptable and what is not?

b)
Could you also include other fields such that “synopsis” and
“description”?  Because it could speed up “guix search” without adding
(or modifying) the current cache
(~/config/guix/current/lib/package.cache); discussed at length in
#39258 <http://issues.guix.gnu.org/issue/39258>.



Cheers,
simon



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-27 10:00 ` File search progress: database review and question on triggers zimoun
@ 2020-08-27 11:15   ` Pierre Neidhardt
  2020-08-27 12:56     ` zimoun
  0 siblings, 1 reply; 45+ messages in thread
From: Pierre Neidhardt @ 2020-08-27 11:15 UTC (permalink / raw)
  To: zimoun, guix-devel

[-- Attachment #1: Type: text/plain, Size: 1217 bytes --]

Hi Simon!

zimoun <zimon.toutoune@gmail.com> writes:

> If you are going to an local SQL database, my two questions are:
>
> a)
> Which part would update it?  “guix pull”?  Other?  Even using
> substitutes, the channels and co could lead to an extra cost and so what
> is acceptable and what is not?

I suggest fetching database updates when performing the filesearch, just
like `guix size` does.

The induced cost would be little in my opinion.  A compressed database
for a whole generation is below 10 MiB.  This operation would need to be
done only once per Guix generation.  If we are a bit smart with timestamping
and only send a diff, it could even be much, much smaller.

> b)
> Could you also include other fields such that “synopsis” and
> “description”?  Because it could speed up “guix search” without adding
> (or modifying) the current cache
> (~/config/guix/current/lib/package.cache); discussed at length in
> #39258 <http://issues.guix.gnu.org/issue/39258>.

I think this is a bit beyond the scope of this patch set.  I'd rather
focus on files exclusively for now and proceed one step at a time :)

Cheers!

-- 
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-27 11:15   ` Pierre Neidhardt
@ 2020-08-27 12:56     ` zimoun
  2020-08-27 13:19       ` Pierre Neidhardt
  0 siblings, 1 reply; 45+ messages in thread
From: zimoun @ 2020-08-27 12:56 UTC (permalink / raw)
  To: Pierre Neidhardt, guix-devel

Hi Pierre,

On Thu, 27 Aug 2020 at 13:15, Pierre Neidhardt <mail@ambrevar.xyz> wrote:

> zimoun <zimon.toutoune@gmail.com> writes:
>
>> If you are going to an local SQL database, my two questions are:
>>
>> a)
>> Which part would update it?  “guix pull”?  Other?  Even using
>> substitutes, the channels and co could lead to an extra cost and so what
>> is acceptable and what is not?
>
> I suggest fetching database updates when performing the filesearch, just
> like `guix size` does.

I am not sure to see how.  One needs all the database to search inside
and cannot know in advance which packages etc..  Contrary to “guix size”
which manipulates the graph and then download the missing parts.

Therefore, your suggestion is to download all the database the first
time the user run “guix filesearch”, i.e., download ~10MiB.

Then each time the user runs “guix pull” then “guix filesearch”, two
options, either download the new database for this last Guix generation,
or either download a diff (not sure the complexity is worth for ~10MiB).

Right?


And what about the channels?  Because if I read correctly, “guix size”
fails when “no available substitute information“.


>> b)
>> Could you also include other fields such that “synopsis” and
>> “description”?  Because it could speed up “guix search” without adding
>> (or modifying) the current cache
>> (~/config/guix/current/lib/package.cache); discussed at length in
>> #39258 <http://issues.guix.gnu.org/issue/39258>.
>
> I think this is a bit beyond the scope of this patch set.  I'd rather
> focus on files exclusively for now and proceed one step at a time :)

I do not think it is beyond the scope because Arun introduced an SQLite
database for improving “guix search”.  But this path had been stopped
because of “introducing complexity” [1].  Therefore, if “guix
filesearch” introduces a SQL cache, then it seems a good idea to be also
usable by “guix search”.

Well, if the table is extended with the fields “synopsis” and
“description” then what is the size of the database?  Does it kill the
lookup performance?

[1] http://issues.guix.gnu.org/issue/39258#7


Cheers,
simon


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-27 12:56     ` zimoun
@ 2020-08-27 13:19       ` Pierre Neidhardt
  0 siblings, 0 replies; 45+ messages in thread
From: Pierre Neidhardt @ 2020-08-27 13:19 UTC (permalink / raw)
  To: zimoun, guix-devel

[-- Attachment #1: Type: text/plain, Size: 1945 bytes --]

zimoun <zimon.toutoune@gmail.com> writes:

> I am not sure to see how.  One needs all the database to search inside
> and cannot know in advance which packages etc..  Contrary to “guix size”
> which manipulates the graph and then download the missing parts.
>
> Therefore, your suggestion is to download all the database the first
> time the user run “guix filesearch”, i.e., download ~10MiB.

Yes.
If the user is not using substitutes, they can also compute the database
from their local store items (like I'm doing in the graft).  It's less
useful but still a bit more convenient than running find/locate on
/gnu/store.

> Then each time the user runs “guix pull” then “guix filesearch”, two
> options, either download the new database for this last Guix generation,
> or either download a diff (not sure the complexity is worth for ~10MiB).
>
> Right?

Yes.

> And what about the channels?  Because if I read correctly, “guix size”
> fails when “no available substitute information“.

Just like `guix size', it would work with local items.  But if there is
no substitute server for channels, then there is no way around it.

>> I think this is a bit beyond the scope of this patch set.  I'd rather
>> focus on files exclusively for now and proceed one step at a time :)
>
> I do not think it is beyond the scope because Arun introduced an SQLite
> database for improving “guix search”.  But this path had been stopped
> because of “introducing complexity” [1].  Therefore, if “guix
> filesearch” introduces a SQL cache, then it seems a good idea to be also
> usable by “guix search”.
>
> Well, if the table is extended with the fields “synopsis” and
> “description” then what is the size of the database?  Does it kill the
> lookup performance?

Good point, I'll test and report my measures.

Cheers!

-- 
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-24 10:53                             ` Pierre Neidhardt
@ 2020-09-04 19:15                               ` Arun Isaac
  2020-09-05  7:48                                 ` Pierre Neidhardt
  0 siblings, 1 reply; 45+ messages in thread
From: Arun Isaac @ 2020-09-04 19:15 UTC (permalink / raw)
  To: Pierre Neidhardt; +Cc: guix-devel


Hi Pierre!

Sorry for the very late response.

> I've fixed it in
> e08f913d20428a9a925cc46d177c7446f55e6443.  The downside is that we can't
> use any special character like boolean operators.  I'm not sure how we
> can get the best of both worlds.  Maybe add a command line flag that
> would enable "extended search syntax" in which the user can use all of
> FTS but must make sure to quote "libm.so" for instance.

Where is this commit so I may test it? I couldn't find the commit in the
wip-filesearch branch. The HEAD of the wip-filesearch branch I see is
49b52c2c7be03caf3636632c31f4451d5bc88125.

Regards,
Arun


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-09-04 19:15                               ` Arun Isaac
@ 2020-09-05  7:48                                 ` Pierre Neidhardt
  2020-09-06  9:25                                   ` Arun Isaac
  0 siblings, 1 reply; 45+ messages in thread
From: Pierre Neidhardt @ 2020-09-05  7:48 UTC (permalink / raw)
  To: Arun Isaac; +Cc: guix-devel

[-- Attachment #1: Type: text/plain, Size: 183 bytes --]

Oops!  Forgot to push.

It's actually commit 25147f983bdf432b03e8271abe0318f4812f94ba on
wip-filesearch.

Thanks for the notice!

-- 
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-09-05  7:48                                 ` Pierre Neidhardt
@ 2020-09-06  9:25                                   ` Arun Isaac
  2020-09-06 10:05                                     ` Pierre Neidhardt
  0 siblings, 1 reply; 45+ messages in thread
From: Arun Isaac @ 2020-09-06  9:25 UTC (permalink / raw)
  To: Pierre Neidhardt; +Cc: guix-devel

[-- Attachment #1: Type: text/plain, Size: 698 bytes --]


> Oops!  Forgot to push.
>
> It's actually commit 25147f983bdf432b03e8271abe0318f4812f94ba on
> wip-filesearch.

I checked and it works now. Just a tiny nitpick:

The function signature of search-file-package,

--8<---------------cut here---------------start------------->8---
(define (search-file-package pattern . more-patterns) ...)
--8<---------------cut here---------------end--------------->8---

can be rewritten as

--8<---------------cut here---------------start------------->8---
(define (search-file-package . patterns) ...)
--8<---------------cut here---------------end--------------->8---

No need to cons pattern onto more-patterns in the function body.

Thanks for working on this!

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 524 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-09-06  9:25                                   ` Arun Isaac
@ 2020-09-06 10:05                                     ` Pierre Neidhardt
  2020-09-06 10:33                                       ` Arun Isaac
  0 siblings, 1 reply; 45+ messages in thread
From: Pierre Neidhardt @ 2020-09-06 10:05 UTC (permalink / raw)
  To: Arun Isaac; +Cc: guix-devel

[-- Attachment #1: Type: text/plain, Size: 890 bytes --]

Arun Isaac <arunisaac@systemreboot.net> writes:

> The function signature of search-file-package,
>
> --8<---------------cut here---------------start------------->8---
> (define (search-file-package pattern . more-patterns) ...)
> --8<---------------cut here---------------end--------------->8---
>
> can be rewritten as
>
> --8<---------------cut here---------------start------------->8---
> (define (search-file-package . patterns) ...)
> --8<---------------cut here---------------end--------------->8---
>
> No need to cons pattern onto more-patterns in the function body.

There is a subtle different: in the latter, (search-file-package) is
allowed and won't trigger a compile time error, while the former does.
This "foo . more-foo" paradigm is a way to say "1 or more arguments",
instead of "0 or more".

Cheers!

-- 
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-09-06 10:05                                     ` Pierre Neidhardt
@ 2020-09-06 10:33                                       ` Arun Isaac
  0 siblings, 0 replies; 45+ messages in thread
From: Arun Isaac @ 2020-09-06 10:33 UTC (permalink / raw)
  To: Pierre Neidhardt; +Cc: guix-devel

[-- Attachment #1: Type: text/plain, Size: 285 bytes --]


> There is a subtle different: in the latter, (search-file-package) is
> allowed and won't trigger a compile time error, while the former does.
> This "foo . more-foo" paradigm is a way to say "1 or more arguments",
> instead of "0 or more".

Ah, ok, that makes sense!

Regards,
Arun

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 524 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2020-09-06 10:33 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-10 14:32 File search progress: database review and question on triggers Pierre Neidhardt
2020-08-11  9:43 ` Mathieu Othacehe
2020-08-11 12:35   ` Pierre Neidhardt
2020-08-15 12:48     ` Hartmut Goebel
2020-08-11 15:43 ` Ricardo Wurmus
2020-08-11 17:54   ` Pierre Neidhardt
2020-08-11 17:58     ` Pierre Neidhardt
2020-08-11 20:08       ` Ricardo Wurmus
2020-08-12 19:10         ` Pierre Neidhardt
2020-08-12 20:13           ` Julien Lepiller
2020-08-12 20:43             ` Pierre Neidhardt
2020-08-12 21:29               ` Julien Lepiller
2020-08-12 22:29                 ` Ricardo Wurmus
2020-08-13  6:55                   ` Pierre Neidhardt
2020-08-13  6:52                 ` Pierre Neidhardt
2020-08-13  9:34                   ` Ricardo Wurmus
2020-08-13 10:04                     ` Pierre Neidhardt
2020-08-15 12:47                       ` Hartmut Goebel
2020-08-15 21:20                         ` Bengt Richter
2020-08-16  8:18                           ` Hartmut Goebel
2020-08-12 20:32           ` Pierre Neidhardt
2020-08-13  0:17           ` Arun Isaac
2020-08-13  6:58             ` Pierre Neidhardt
2020-08-13  9:40             ` Pierre Neidhardt
2020-08-13 10:08               ` Pierre Neidhardt
2020-08-13 11:47               ` Ricardo Wurmus
2020-08-13 13:44                 ` Pierre Neidhardt
2020-08-13 12:20               ` Arun Isaac
2020-08-13 13:53                 ` Pierre Neidhardt
2020-08-13 15:14                   ` Arun Isaac
2020-08-13 15:36                     ` Pierre Neidhardt
2020-08-13 15:56                       ` Pierre Neidhardt
2020-08-15 19:33                         ` Arun Isaac
2020-08-24  8:29                           ` Pierre Neidhardt
2020-08-24 10:53                             ` Pierre Neidhardt
2020-09-04 19:15                               ` Arun Isaac
2020-09-05  7:48                                 ` Pierre Neidhardt
2020-09-06  9:25                                   ` Arun Isaac
2020-09-06 10:05                                     ` Pierre Neidhardt
2020-09-06 10:33                                       ` Arun Isaac
2020-08-18 14:58 ` File search progress: database review and question on triggers OFF TOPIC PRAISE Joshua Branson
2020-08-27 10:00 ` File search progress: database review and question on triggers zimoun
2020-08-27 11:15   ` Pierre Neidhardt
2020-08-27 12:56     ` zimoun
2020-08-27 13:19       ` Pierre Neidhardt

unofficial mirror of guix-devel@gnu.org 

This inbox may be cloned and mirrored by anyone:

	git clone --mirror https://yhetil.org/guix-devel/0 guix-devel/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 guix-devel guix-devel/ https://yhetil.org/guix-devel \
		guix-devel@gnu.org
	public-inbox-index guix-devel

Example config snippet for mirrors.
Newsgroups are available over NNTP:
	nntp://news.yhetil.org/yhetil.gnu.guix.devel
	nntp://news.gmane.io/gmane.comp.gnu.guix.devel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git