From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp2 ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms11 with LMTPS id oIoRGZxNNF8dLwAA0tVLHw (envelope-from ) for ; Wed, 12 Aug 2020 20:14:20 +0000 Received: from aspmx1.migadu.com ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp2 with LMTPS id OJNnE5xNNF9xAQAAB5/wlQ (envelope-from ) for ; Wed, 12 Aug 2020 20:14:20 +0000 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id E97819404D3 for ; Wed, 12 Aug 2020 20:14:19 +0000 (UTC) Received: from localhost ([::1]:51302 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1k5x8c-0001MZ-Li for larch@yhetil.org; Wed, 12 Aug 2020 16:14:18 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:41450) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1k5x8T-0001MG-CD for guix-devel@gnu.org; Wed, 12 Aug 2020 16:14:09 -0400 Received: from lepiller.eu ([89.234.186.109]:43312) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1k5x8Q-0006K6-Uq for guix-devel@gnu.org; Wed, 12 Aug 2020 16:14:09 -0400 Received: from lepiller.eu (localhost [127.0.0.1]) by lepiller.eu (OpenSMTPD) with ESMTP id 02bcc05d; Wed, 12 Aug 2020 20:14:01 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed; d=lepiller.eu; h=date :in-reply-to:references:mime-version:content-type :content-transfer-encoding:subject:to:from:message-id; s=dkim; bh=w20zI4OadViJ3yY1XjwbRVfoTJuBbvMKrkHL22dY5s8=; b=DwGKgaR2JNft 4P3xgoBAp9bHeNOV0jLgciLbfJ2lsCOqD5EwcOESpMMycNIkxiUOEvemW6XtUcKb 4XSPBfEjs7zTvV3oO4LTLyoCnjlJYvgfGlUBrrV+JMBJec6l5fqBo6uejQWSRQmV +bVsoUK8DPDDPAMWHo/a2Sk0Z2gPt+NkIEcYNIbHlCbbW8PVEW3mgeW6ZPIwH1Ol 8lWLtmphnaWrrCYjtRutU5WSqsxTfKmgWOl4gOM4suEQSTtkqi+p7hlN2bhRBP37 Wu1aS9bb0XKI1UfCXnTxOtwJrkRzjeXRbAR/2Ypw8Hvuxo7ZlofpVyKP9NjhSz+L gN9YiAtbWQ== Received: by lepiller.eu (OpenSMTPD) with ESMTPSA id 45244f3d (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256:NO); Wed, 12 Aug 2020 20:14:01 +0000 (UTC) Date: Wed, 12 Aug 2020 16:13:50 -0400 User-Agent: K-9 Mail for Android In-Reply-To: <87r1sbel4f.fsf@ambrevar.xyz> References: <87sgcuh8rb.fsf@ambrevar.xyz> <87y2ml429i.fsf@elephly.net> <87364tgja3.fsf@ambrevar.xyz> <87y2mlf4jw.fsf@ambrevar.xyz> <87pn7x3pyw.fsf@elephly.net> <87r1sbel4f.fsf@ambrevar.xyz> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----BP9QXZO22ACPYIVBW5FTLT4YZBL9VH" Content-Transfer-Encoding: 7bit Subject: Re: File search progress: database review and question on triggers To: guix-devel@gnu.org, Pierre Neidhardt , Ricardo Wurmus From: Julien Lepiller Message-ID: <683CAE3A-8B38-4A41-9B3C-18D1284D3EFA@lepiller.eu> Received-SPF: none client-ip=89.234.186.109; envelope-from=julien@lepiller.eu; helo=lepiller.eu X-detected-operating-system: by eggs.gnu.org: First seen = 2020/08/12 16:14:03 X-ACL-Warn: Detected OS = Linux 2.2.x-3.x [generic] [fuzzy] X-Spam_score_int: -10 X-Spam_score: -1.1 X-Spam_bar: - X-Spam_report: (-1.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.001, PDS_OTHER_BAD_TLD=1, SPF_HELO_NONE=0.001, SPF_NONE=0.001, URIBL_BLOCKED=0.001 autolearn=no autolearn_force=no X-Spam_action: no action X-BeenThere: guix-devel@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Development of GNU Guix and the GNU System distribution." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guix-devel-bounces+larch=yhetil.org@gnu.org Sender: "Guix-devel" X-Scanner: scn0 Authentication-Results: aspmx1.migadu.com; dkim=fail (rsa verify failed) header.d=lepiller.eu header.s=dkim header.b=DwGKgaR2; dmarc=fail reason="SPF not aligned (relaxed)" header.from=lepiller.eu (policy=none); spf=pass (aspmx1.migadu.com: domain of guix-devel-bounces@gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=guix-devel-bounces@gnu.org X-Spam-Score: 0.09 X-TUID: eMezdy4TokgT ------BP9QXZO22ACPYIVBW5FTLT4YZBL9VH Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Have you tried something more structured? I have some code for creating a b= inary search tree and even compressing/decompressing strings with huffman, = as well as code to serialize all that (my deserialization is in Java though= , so not very useful to you): https://framagit=2Eorg/nani-project/nani-webs= ite See modules/nani/trie=2Escm for instance=2E The idea is to have a binary search tree whose keys are filenames and valu= e is a pointer in the file to a structure that holds data for this filerame= (packages and versions of guix for instance)=2E It's super fast to look it= up, because you don't load the whole file, you only seek to the right posi= tion=2E It's a lot longer to build, and not so easy to update though=2E On 2020=E5=B9=B48=E6=9C=8812=E6=97=A5 15:10:08 GMT-04:00, Pierre Neidhardt= wrote: >I've done some benchmarking=2E > >1=2E I tried to fine-tune the SQL a bit: > - Open/close the database only once for the whole indexing=2E > - Use "insert" instead of "insert or replace"=2E > - Use numeric ID as key instead of path=2E > > Result: Still around 15-20 minutes to build=2E Switching to numeric > indices shrank the database by half=2E > >2=2E I've tried with the following naive 1-file-per-line format: > >--8<---------------cut here---------------start------------->8--- >"/gnu/store/97p5gvb7jglmn9jpgwwf5al1798wi61f-acl-2=2E2=2E53//share/man/ma= n5/acl=2E5=2Egz" >"/gnu/store/97p5gvb7jglmn9jpgwwf5al1798wi61f-acl-2=2E2=2E53//share/man/ma= n3/acl_add_perm=2E3=2Egz" >"/gnu/store/97p5gvb7jglmn9jpgwwf5al1798wi61f-acl-2=2E2=2E53//share/man/ma= n3/acl_calc_mask=2E3=2Egz" >=2E=2E=2E >--8<---------------cut here---------------end--------------->8--- > > Result: Takes between 20 and 2 minutes to complete and the result is > 32=C2=A0MiB big=2E (I don't know why the timing varies=2E) > > A string-contains filter takes less than 1 second=2E > > A string-match (regex) search takes some 3 seconds (Ryzen 5 @ 3=2E5 > GHz)=2E I'm not sure if we can go faster=2E I need to measure the tim= e > SQL takes for a regexp match=2E > >Question: Any idea how to load the database as fast as possible? I >tried the following, it takes 1=2E5s on my machine: > >--8<---------------cut here---------------start------------->8--- >(define (load-textual-database) > (call-with-input-file %textual-db > (lambda (port) > (let loop ((line (get-line port)) > (result '())) > (if (string? line) > (loop (get-line port) (cons line result)) > result))))) >--8<---------------cut here---------------end--------------->8--- > >Cheers! > >-- >Pierre Neidhardt >https://ambrevar=2Exyz/ ------BP9QXZO22ACPYIVBW5FTLT4YZBL9VH Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: quoted-printable Have you tried something more structured? I have s= ome code for creating a binary search tree and even compressing/decompressi= ng strings with huffman, as well as code to serialize all that (my deserial= ization is in Java though, so not very useful to you): https://framagit=2Eorg/nani-projec= t/nani-website

See modules/nani/trie=2Escm for instance=2E
The idea is to have a binary search tree whose keys are filenames and val= ue is a pointer in the file to a structure that holds data for this fileram= e (packages and versions of guix for instance)=2E It's super fast to look i= t up, because you don't load the whole file, you only seek to the right pos= ition=2E It's a lot longer to build, and not so easy to update though=2E
On 2020=E5=B9=B48=E6=9C=8812=E6=97=A5 15:10= :08 GMT-04:00, Pierre Neidhardt <mail@ambrevar=2Exyz> wrote:
I've done some benchmarking=2E

1=2E I tried t= o fine-tune the SQL a bit:
- Open/close the database only once for the= whole indexing=2E
- Use "insert" instead of "insert or replace"=2E - Use numeric ID as key instead of path=2E

Result: Still around= 15-20 minutes to build=2E Switching to numeric
indices shrank the da= tabase by half=2E

2=2E I've tried with the following naive 1-file-pe= r-line format:

--8<---------------cut here---------------start---= ---------->8---
"/gnu/store/97p5gvb7jglmn9jpgwwf5al1798wi61f-acl-2=2E= 2=2E53//share/man/man5/acl=2E5=2Egz"
"/gnu/store/97p5gvb7jglmn9jpgwwf5al= 1798wi61f-acl-2=2E2=2E53//share/man/man3/acl_add_perm=2E3=2Egz"
"/gnu/st= ore/97p5gvb7jglmn9jpgwwf5al1798wi61f-acl-2=2E2=2E53//share/man/man3/acl_cal= c_mask=2E3=2Egz"
=2E=2E=2E
--8<---------------cut here------------= ---end--------------->8---

Result: Takes between 20 and 2 minut= es to complete and the result is
32 MiB big=2E (I don't know why= the timing varies=2E)

A string-contains filter takes less than 1 = second=2E

A string-match (regex) search takes some 3 seconds (Ryze= n 5 @ 3=2E5
GHz)=2E I'm not sure if we can go faster=2E I need to me= asure the time
SQL takes for a regexp match=2E

Question: Any id= ea how to load the database as fast as possible? I
tried the following,= it takes 1=2E5s on my machine:

--8<---------------cut here------= ---------start------------->8---
(define (load-textual-database)
= (call-with-input-file %textual-db
(lambda (port)
(let loop= ((line (get-line port))
(result '()))
(if (= string? line)
(loop (get-line port) (cons line result))
= result)))))
--8<---------------cut here---------------end-= -------------->8---

Cheers!

--
Pierre Neidhardt
https://ambrevar=2Exyz/
------BP9QXZO22ACPYIVBW5FTLT4YZBL9VH--