unofficial mirror of guix-devel@gnu.org 
 help / color / mirror / code / Atom feed
From: Julien Lepiller <julien@lepiller.eu>
To: guix-devel@gnu.org, Pierre Neidhardt <mail@ambrevar.xyz>,
	Ricardo Wurmus <rekado@elephly.net>
Subject: Re: File search progress: database review and question on triggers
Date: Wed, 12 Aug 2020 16:13:50 -0400	[thread overview]
Message-ID: <683CAE3A-8B38-4A41-9B3C-18D1284D3EFA@lepiller.eu> (raw)
In-Reply-To: <87r1sbel4f.fsf@ambrevar.xyz>

[-- Attachment #1: Type: text/plain, Size: 2612 bytes --]

Have you tried something more structured? I have some code for creating a binary search tree and even compressing/decompressing strings with huffman, as well as code to serialize all that (my deserialization is in Java though, so not very useful to you): https://framagit.org/nani-project/nani-website

See modules/nani/trie.scm for instance.

The idea is to have a binary search tree whose keys are filenames and value is a pointer in the file to a structure that holds data for this filerame (packages and versions of guix for instance). It's super fast to look it up, because you don't load the whole file, you only seek to the right position. It's a lot longer to build, and not so easy to update though.

On 2020年8月12日 15:10:08 GMT-04:00, Pierre Neidhardt <mail@ambrevar.xyz> wrote:
>I've done some benchmarking.
>
>1. I tried to fine-tune the SQL a bit:
>  - Open/close the database only once for the whole indexing.
>  - Use "insert" instead of "insert or replace".
>  - Use numeric ID as key instead of path.
>
>  Result: Still around 15-20 minutes to build.  Switching to numeric
>  indices shrank the database by half.
>
>2. I've tried with the following naive 1-file-per-line format:
>
>--8<---------------cut here---------------start------------->8---
>"/gnu/store/97p5gvb7jglmn9jpgwwf5al1798wi61f-acl-2.2.53//share/man/man5/acl.5.gz"
>"/gnu/store/97p5gvb7jglmn9jpgwwf5al1798wi61f-acl-2.2.53//share/man/man3/acl_add_perm.3.gz"
>"/gnu/store/97p5gvb7jglmn9jpgwwf5al1798wi61f-acl-2.2.53//share/man/man3/acl_calc_mask.3.gz"
>...
>--8<---------------cut here---------------end--------------->8---
>
>  Result: Takes between 20 and 2 minutes to complete and the result is
>  32 MiB big.  (I don't know why the timing varies.)
>
>  A string-contains filter takes less than 1 second.
>
>  A string-match (regex) search takes some 3 seconds (Ryzen 5 @ 3.5
>  GHz).  I'm not sure if we can go faster.  I need to measure the time
>  SQL takes for a regexp match.
>
>Question: Any idea how to load the database as fast as possible?  I
>tried the following, it takes 1.5s on my machine:
>
>--8<---------------cut here---------------start------------->8---
>(define (load-textual-database)
>  (call-with-input-file %textual-db
>    (lambda (port)
>      (let loop ((line (get-line port))
>                 (result '()))
>        (if (string? line)
>            (loop (get-line port) (cons line result))
>            result)))))
>--8<---------------cut here---------------end--------------->8---
>
>Cheers!
>
>--
>Pierre Neidhardt
>https://ambrevar.xyz/

[-- Attachment #2: Type: text/html, Size: 3040 bytes --]

  reply	other threads:[~2020-08-12 20:14 UTC|newest]

Thread overview: 73+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-08-10 14:32 File search progress: database review and question on triggers Pierre Neidhardt
2020-08-11  9:43 ` Mathieu Othacehe
2020-08-11 12:35   ` Pierre Neidhardt
2020-08-15 12:48     ` Hartmut Goebel
2020-08-11 15:43 ` Ricardo Wurmus
2020-08-11 17:54   ` Pierre Neidhardt
2020-08-11 17:58     ` Pierre Neidhardt
2020-08-11 20:08       ` Ricardo Wurmus
2020-08-12 19:10         ` Pierre Neidhardt
2020-08-12 20:13           ` Julien Lepiller [this message]
2020-08-12 20:43             ` Pierre Neidhardt
2020-08-12 21:29               ` Julien Lepiller
2020-08-12 22:29                 ` Ricardo Wurmus
2020-08-13  6:55                   ` Pierre Neidhardt
2020-08-13  6:52                 ` Pierre Neidhardt
2020-08-13  9:34                   ` Ricardo Wurmus
2020-08-13 10:04                     ` Pierre Neidhardt
2020-08-15 12:47                       ` Hartmut Goebel
2020-08-15 21:20                         ` Bengt Richter
2020-08-16  8:18                           ` Hartmut Goebel
2020-08-12 20:32           ` Pierre Neidhardt
2020-08-13  0:17           ` Arun Isaac
2020-08-13  6:58             ` Pierre Neidhardt
2020-08-13  9:40             ` Pierre Neidhardt
2020-08-13 10:08               ` Pierre Neidhardt
2020-08-13 11:47               ` Ricardo Wurmus
2020-08-13 13:44                 ` Pierre Neidhardt
2020-08-13 12:20               ` Arun Isaac
2020-08-13 13:53                 ` Pierre Neidhardt
2020-08-13 15:14                   ` Arun Isaac
2020-08-13 15:36                     ` Pierre Neidhardt
2020-08-13 15:56                       ` Pierre Neidhardt
2020-08-15 19:33                         ` Arun Isaac
2020-08-24  8:29                           ` Pierre Neidhardt
2020-08-24 10:53                             ` Pierre Neidhardt
2020-09-04 19:15                               ` Arun Isaac
2020-09-05  7:48                                 ` Pierre Neidhardt
2020-09-06  9:25                                   ` Arun Isaac
2020-09-06 10:05                                     ` Pierre Neidhardt
2020-09-06 10:33                                       ` Arun Isaac
2020-08-18 14:58 ` File search progress: database review and question on triggers OFF TOPIC PRAISE Joshua Branson
2020-08-27 10:00 ` File search progress: database review and question on triggers zimoun
2020-08-27 11:15   ` Pierre Neidhardt
2020-08-27 12:56     ` zimoun
2020-08-27 13:19       ` Pierre Neidhardt
2020-09-26 14:04         ` Pierre Neidhardt
2020-09-26 14:12           ` Pierre Neidhardt
2020-10-05 12:35           ` Ludovic Courtès
2020-10-05 18:53             ` Pierre Neidhardt
2020-10-09 21:16               ` zimoun
2020-10-10  8:57                 ` Pierre Neidhardt
2020-10-10 14:58                   ` zimoun
2020-10-12 10:16                   ` Ludovic Courtès
2020-10-12 11:18                     ` Pierre Neidhardt
2020-10-13 13:48                       ` Ludovic Courtès
2020-10-13 13:59                         ` Pierre Neidhardt
2020-10-10 16:03               ` zimoun
2020-10-11 11:19                 ` Pierre Neidhardt
2020-10-11 13:02                   ` zimoun
2020-10-11 14:25                     ` Pierre Neidhardt
2020-10-11 16:05                       ` zimoun
2020-10-12 10:20               ` Ludovic Courtès
2020-10-12 11:21                 ` Pierre Neidhardt
2020-10-13 13:45                   ` Ludovic Courtès
2020-10-13 13:56                     ` Pierre Neidhardt
2020-10-13 21:22                       ` Ludovic Courtès
2020-10-14  7:50                         ` Pierre Neidhardt
2020-10-16 10:30                           ` Ludovic Courtès
2020-10-17  9:14                             ` Pierre Neidhardt
2020-10-17 19:17                               ` Pierre Neidhardt
2020-10-21  9:53                               ` Ludovic Courtès
2020-10-21  9:58                                 ` Pierre Neidhardt
2020-10-12 11:23                 ` zimoun

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://guix.gnu.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=683CAE3A-8B38-4A41-9B3C-18D1284D3EFA@lepiller.eu \
    --to=julien@lepiller.eu \
    --cc=guix-devel@gnu.org \
    --cc=mail@ambrevar.xyz \
    --cc=rekado@elephly.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).