Have you tried something more structured? I have some code for creating a binary search tree and even compressing/decompressing strings with huffman, as well as code to serialize all that (my deserialization is in Java though, so not very useful to you): https://framagit.org/nani-project/nani-website

See modules/nani/trie.scm for instance.

The idea is to have a binary search tree whose keys are filenames and value is a pointer in the file to a structure that holds data for this filerame (packages and versions of guix for instance). It's super fast to look it up, because you don't load the whole file, you only seek to the right position. It's a lot longer to build, and not so easy to update though.

On 2020年8月12日 15:10:08 GMT-04:00, Pierre Neidhardt <mail@ambrevar.xyz> wrote:
I've done some benchmarking.

1. I tried to fine-tune the SQL a bit:
- Open/close the database only once for the whole indexing.
- Use "insert" instead of "insert or replace".
- Use numeric ID as key instead of path.

Result: Still around 15-20 minutes to build. Switching to numeric
indices shrank the database by half.

2. I've tried with the following naive 1-file-per-line format:

--8<---------------cut here---------------start------------->8---
"/gnu/store/97p5gvb7jglmn9jpgwwf5al1798wi61f-acl-2.2.53//share/man/man5/acl.5.gz"
"/gnu/store/97p5gvb7jglmn9jpgwwf5al1798wi61f-acl-2.2.53//share/man/man3/acl_add_perm.3.gz"
"/gnu/store/97p5gvb7jglmn9jpgwwf5al1798wi61f-acl-2.2.53//share/man/man3/acl_calc_mask.3.gz"
...
--8<---------------cut here---------------end--------------->8---

Result: Takes between 20 and 2 minutes to complete and the result is
32 MiB big. (I don't know why the timing varies.)

A string-contains filter takes less than 1 second.

A string-match (regex) search takes some 3 seconds (Ryzen 5 @ 3.5
GHz). I'm not sure if we can go faster. I need to measure the time
SQL takes for a regexp match.

Question: Any idea how to load the database as fast as possible? I
tried the following, it takes 1.5s on my machine:

--8<---------------cut here---------------start------------->8---
(define (load-textual-database)
(call-with-input-file %textual-db
(lambda (port)
(let loop ((line (get-line port))
(result '()))
(if (string? line)
(loop (get-line port) (cons line result))
result)))))
--8<---------------cut here---------------end--------------->8---

Cheers!

--
Pierre Neidhardt
https://ambrevar.xyz/