From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp0 ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms11 with LMTPS id MInRKX5fNF+AIAAA0tVLHw (envelope-from ) for ; Wed, 12 Aug 2020 21:30:38 +0000 Received: from aspmx1.migadu.com ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp0 with LMTPS id aNEfJH5fNF9edAAA1q6Kng (envelope-from ) for ; Wed, 12 Aug 2020 21:30:38 +0000 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id 27953940142 for ; Wed, 12 Aug 2020 21:30:38 +0000 (UTC) Received: from localhost ([::1]:34612 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1k5yKT-0002vp-0l for larch@yhetil.org; Wed, 12 Aug 2020 17:30:37 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:52936) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1k5yJy-0002uz-D4 for guix-devel@gnu.org; Wed, 12 Aug 2020 17:30:06 -0400 Received: from lepiller.eu ([89.234.186.109]:43346) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1k5yJv-0005l4-Iz for guix-devel@gnu.org; Wed, 12 Aug 2020 17:30:06 -0400 Received: from lepiller.eu (localhost [127.0.0.1]) by lepiller.eu (OpenSMTPD) with ESMTP id ad45d0c4; Wed, 12 Aug 2020 21:29:58 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed; d=lepiller.eu; h=date :in-reply-to:references:mime-version:content-type :content-transfer-encoding:subject:to:from:message-id; s=dkim; bh=4LAysQTii4S4XnQqbt57cqYi2Qrx3hts5TH8lqZQXDE=; b=fTrBVKsmg2/k XZ7VA/u+7sfuKFatsTkUbo5yENLuiY5NDfwJ7hIfECMEATYpq+VEg+8u3gLraTbo j5qDEQ59u8dJ4c/azqvJiQa0UtjjYaXFMyYcWi6xOskNRC0tYzGYU9DqfQqyjXym 2w30KnGbBwodRuJ+mxhBTkKLi2GQ6D6Y/GIK0Z07/cKE6VuwiPUg1Kmo1M8cHwJg VbOEx6zIHZ09D+8pAJ9+n0AktL5/76cJc7uatymKY9sLDDfBN52Di0MHxpYqkU1f oL9EoW4E6iWP0okPI4xuolciJyoWHbbiQw3qnvd2cXHbB9r+2fmKwSd6UvC77OTd OcNee3Bw1w== Received: by lepiller.eu (OpenSMTPD) with ESMTPSA id e20df182 (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256:NO); Wed, 12 Aug 2020 21:29:57 +0000 (UTC) Date: Wed, 12 Aug 2020 17:29:39 -0400 User-Agent: K-9 Mail for Android In-Reply-To: <87o8nfy4qu.fsf@ambrevar.xyz> References: <87sgcuh8rb.fsf@ambrevar.xyz> <87y2ml429i.fsf@elephly.net> <87364tgja3.fsf@ambrevar.xyz> <87y2mlf4jw.fsf@ambrevar.xyz> <87pn7x3pyw.fsf@elephly.net> <87r1sbel4f.fsf@ambrevar.xyz> <683CAE3A-8B38-4A41-9B3C-18D1284D3EFA@lepiller.eu> <87o8nfy4qu.fsf@ambrevar.xyz> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----O43MTP2EFDPWNZ046JN39P6M0YNZ1W" Content-Transfer-Encoding: 7bit Subject: Re: File search progress: database review and question on triggers To: Pierre Neidhardt , guix-devel@gnu.org, Ricardo Wurmus From: Julien Lepiller Message-ID: <72845E8B-35E1-4A27-95E6-452D1D1F626B@lepiller.eu> Received-SPF: none client-ip=89.234.186.109; envelope-from=julien@lepiller.eu; helo=lepiller.eu X-detected-operating-system: by eggs.gnu.org: First seen = 2020/08/12 16:14:03 X-ACL-Warn: Detected OS = Linux 2.2.x-3.x [generic] [fuzzy] X-Spam_score_int: -10 X-Spam_score: -1.1 X-Spam_bar: - X-Spam_report: (-1.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.001, PDS_OTHER_BAD_TLD=1, SPF_HELO_NONE=0.001, SPF_NONE=0.001, URIBL_BLOCKED=0.001 autolearn=no autolearn_force=no X-Spam_action: no action X-BeenThere: guix-devel@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Development of GNU Guix and the GNU System distribution." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guix-devel-bounces+larch=yhetil.org@gnu.org Sender: "Guix-devel" X-Scanner: scn0 Authentication-Results: aspmx1.migadu.com; dkim=fail (rsa verify failed) header.d=lepiller.eu header.s=dkim header.b=fTrBVKsm; dmarc=fail reason="SPF not aligned (relaxed)" header.from=lepiller.eu (policy=none); spf=pass (aspmx1.migadu.com: domain of guix-devel-bounces@gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=guix-devel-bounces@gnu.org X-Spam-Score: 0.09 X-TUID: haV2UHQndop2 ------O43MTP2EFDPWNZ046JN39P6M0YNZ1W Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Why wouldn't it help? Can't you make it a trie from basename -> complete na= me? If I'm looking for "libcord=2Eso" (which is a key in the trie), I don't= think I need to look for every path=2E I only need to follow the trie unti= l I find a pointer to some structure that contains the data I look for (ex:= a list of complete filenames)=2E On 2020=E5=B9=B48=E6=9C=8812=E6=97=A5 16:43:37 GMT-04:00, Pierre Neidhardt= wrote: >Julien Lepiller writes: > >> Have you tried something more structured? I have some code for >creating a binary search tree and even compressing/decompressing >strings with huffman, as well as code to serialize all that (my >deserialization is in Java though, so not very useful to you): >https://framagit=2Eorg/nani-project/nani-website >> >> See modules/nani/trie=2Escm for instance=2E >> >> The idea is to have a binary search tree whose keys are filenames and >value is a pointer in the file to a structure that holds data for this >filerame (packages and versions of guix for instance)=2E It's super fast >to look it up, because you don't load the whole file, you only seek to >the right position=2E It's a lot longer to build, and not so easy to >update though=2E > >Assuming we'd have only 1 Guix generation per file, I'm not sure a trie >would help because we _always_ need to search _all_ file paths, so in >all cases we've got some 10,000+ entries to load in memory and loop >over >them=2E > >The total number of entries is the bottleneck here, both for the >database load and the individual search queries=2E > >An obvious optimization is to memoize the database load=2E This has a >significant memory cost though=2E > >The trie could be helpful for multiple Guix generations in the same >database, but I'm not sure it warrant the increased complexity=2E > >Thoughts? > >--=20 >Pierre Neidhardt >https://ambrevar=2Exyz/ ------O43MTP2EFDPWNZ046JN39P6M0YNZ1W Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: quoted-printable Why wouldn't it help? Can't you make it a trie fro= m basename -> complete name? If I'm looking for "libcord=2Eso" (which is= a key in the trie), I don't think I need to look for every path=2E I only = need to follow the trie until I find a pointer to some structure that conta= ins the data I look for (ex: a list of complete filenames)=2E

On 2020=E5=B9=B48=E6=9C=8812=E6=97=A5 16:43:37 GMT-04:= 00, Pierre Neidhardt <mail@ambrevar=2Exyz> wrote:
Julien Lepiller <julien@lepiller=2Eeu> writes:=


Have you tried so= mething more structured? I have some code for creating a binary search tree= and even compressing/decompressing strings with huffman, as well as code t= o serialize all that (my deserialization is in Java though, so not very use= ful to you): h= ttps://framagit=2Eorg/nani-project/nani-website

See modules/nan= i/trie=2Escm for instance=2E

The idea is to have a binary search tr= ee whose keys are filenames and value is a pointer in the file to a structu= re that holds data for this filerame (packages and versions of guix for ins= tance)=2E It's super fast to look it up, because you don't load the whole f= ile, you only seek to the right position=2E It's a lot longer to build, and= not so easy to update though=2E

Assuming we'd have onl= y 1 Guix generation per file, I'm not sure a trie
would help because we = _always_ need to search _all_ file paths, so in
all cases we've got some= 10,000+ entries to load in memory and loop over
them=2E

The tota= l number of entries is the bottleneck here, both for the
database load a= nd the individual search queries=2E

An obvious optimization is to me= moize the database load=2E This has a
significant memory cost though=2E=

The trie could be helpful for multiple Guix generations in the same=
database, but I'm not sure it warrant the increased complexity=2E
Thoughts?
------O43MTP2EFDPWNZ046JN39P6M0YNZ1W--