From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp0 ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms11 with LMTPS id oBlJOak+NF8kegAA0tVLHw (envelope-from ) for ; Wed, 12 Aug 2020 19:10:33 +0000 Received: from aspmx1.migadu.com ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp0 with LMTPS id 4CSaM6k+NF9jfwAA1q6Kng (envelope-from ) for ; Wed, 12 Aug 2020 19:10:33 +0000 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id 7F4C89407C5 for ; Wed, 12 Aug 2020 19:10:33 +0000 (UTC) Received: from localhost ([::1]:54512 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1k5w8q-0000WA-Mq for larch@yhetil.org; Wed, 12 Aug 2020 15:10:32 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:56306) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1k5w8e-0000VW-VU for guix-devel@gnu.org; Wed, 12 Aug 2020 15:10:19 -0400 Received: from relay3-d.mail.gandi.net ([217.70.183.195]:49919) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1k5w8c-0007E0-JG for guix-devel@gnu.org; Wed, 12 Aug 2020 15:10:16 -0400 X-Originating-IP: 86.246.37.13 Received: from bababa (lfbn-idf2-1-572-13.w86-246.abo.wanadoo.fr [86.246.37.13]) (Authenticated sender: mail@ambrevar.xyz) by relay3-d.mail.gandi.net (Postfix) with ESMTPSA id 9EB7260007; Wed, 12 Aug 2020 19:10:09 +0000 (UTC) From: Pierre Neidhardt To: Ricardo Wurmus Subject: Re: File search progress: database review and question on triggers In-Reply-To: <87pn7x3pyw.fsf@elephly.net> References: <87sgcuh8rb.fsf@ambrevar.xyz> <87y2ml429i.fsf@elephly.net> <87364tgja3.fsf@ambrevar.xyz> <87y2mlf4jw.fsf@ambrevar.xyz> <87pn7x3pyw.fsf@elephly.net> Date: Wed, 12 Aug 2020 21:10:08 +0200 Message-ID: <87r1sbel4f.fsf@ambrevar.xyz> MIME-Version: 1.0 Content-Type: multipart/signed; boundary="=-=-="; micalg=pgp-sha256; protocol="application/pgp-signature" Received-SPF: pass client-ip=217.70.183.195; envelope-from=mail@ambrevar.xyz; helo=relay3-d.mail.gandi.net X-detected-operating-system: by eggs.gnu.org: First seen = 2020/08/12 15:10:10 X-ACL-Warn: Detected OS = Linux 3.11 and newer X-Spam_score_int: -5 X-Spam_score: -0.6 X-Spam_bar: / X-Spam_report: (-0.6 / 5.0 requ) BAYES_00=-1.9, FROM_SUSPICIOUS_NTLD=1, PDS_OTHER_BAD_TLD=1, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001 autolearn=no autolearn_force=no X-Spam_action: no action X-BeenThere: guix-devel@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Development of GNU Guix and the GNU System distribution." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: guix-devel@gnu.org Errors-To: guix-devel-bounces+larch=yhetil.org@gnu.org Sender: "Guix-devel" X-Scanner: scn0 Authentication-Results: aspmx1.migadu.com; dkim=none; dmarc=none; spf=pass (aspmx1.migadu.com: domain of guix-devel-bounces@gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=guix-devel-bounces@gnu.org X-Spam-Score: -3.11 X-TUID: KMTlgE3VWFk3 --=-=-= Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable I've done some benchmarking. 1. I tried to fine-tune the SQL a bit: - Open/close the database only once for the whole indexing. - Use "insert" instead of "insert or replace". - Use numeric ID as key instead of path. Result: Still around 15-20 minutes to build. Switching to numeric indices shrank the database by half. 2. I've tried with the following naive 1-file-per-line format: =2D-8<---------------cut here---------------start------------->8--- "/gnu/store/97p5gvb7jglmn9jpgwwf5al1798wi61f-acl-2.2.53//share/man/man5/acl= .5.gz" "/gnu/store/97p5gvb7jglmn9jpgwwf5al1798wi61f-acl-2.2.53//share/man/man3/acl= _add_perm.3.gz" "/gnu/store/97p5gvb7jglmn9jpgwwf5al1798wi61f-acl-2.2.53//share/man/man3/acl= _calc_mask.3.gz" ... =2D-8<---------------cut here---------------end--------------->8--- Result: Takes between 20 and 2 minutes to complete and the result is 32=C2=A0MiB big. (I don't know why the timing varies.) A string-contains filter takes less than 1 second. A string-match (regex) search takes some 3 seconds (Ryzen 5 @ 3.5 GHz). I'm not sure if we can go faster. I need to measure the time SQL takes for a regexp match. Question: Any idea how to load the database as fast as possible? I tried the following, it takes 1.5s on my machine: =2D-8<---------------cut here---------------start------------->8--- (define (load-textual-database) (call-with-input-file %textual-db (lambda (port) (let loop ((line (get-line port)) (result '())) (if (string? line) (loop (get-line port) (cons line result)) result))))) =2D-8<---------------cut here---------------end--------------->8--- Cheers! =2D- Pierre Neidhardt https://ambrevar.xyz/ --=-=-= Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQEzBAEBCAAdFiEEUPM+LlsMPZAEJKvom9z0l6S7zH8FAl80PpAACgkQm9z0l6S7 zH99hQf/TP+eQ/DTiF96/DT/4JlZJnfVnRYlsTxvyqHc/hqG88eWTPfF39TfyxUc FzyXerF0JPOygRaOUov9mms2RafgMb/pDWalw3r9xeGabl4qzk3/wFOEwApN45Yq 7QUpdPxPy0+P4NAWEfKjVq9WdC0So4mzsTfFr9iGFwUippbBHQma/4E31wpvkoMb pECWcODtRTGpAkrqkO7AFuEDjzx7fBVmVCmUsZD6SXEAhNJ72GTBr+Y/LvGUsZe2 e99WlIHgBkDXsf6a9vSD7RdBxyuk+cYuyKhSwkHh+rFYdodo/TbXST6DHLxH8ds9 DA63GrIKJxDTsQR2xe5IoW2CWO7H3Q== =oEG0 -----END PGP SIGNATURE----- --=-=-=--