From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp2.migadu.com ([2001:41d0:403:58f0::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms8.migadu.com with LMTPS id YLyWC6hT/GVVzAAAe85BDQ:P1 (envelope-from ) for ; Thu, 21 Mar 2024 16:35:04 +0100 Received: from aspmx1.migadu.com ([2001:41d0:403:58f0::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp2.migadu.com with LMTPS id YLyWC6hT/GVVzAAAe85BDQ (envelope-from ) for ; Thu, 21 Mar 2024 16:35:04 +0100 X-Envelope-To: larch@yhetil.org Authentication-Results: aspmx1.migadu.com; dkim=pass header.d=tugraz.at header.s=mailrelay header.b=iZxwI7A3; dmarc=pass (policy=none) header.from=tugraz.at; spf=pass (aspmx1.migadu.com: domain of "gwl-devel-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="gwl-devel-bounces+larch=yhetil.org@gnu.org" ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=yhetil.org; s=key1; t=1711035304; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:list-id:list-help: list-unsubscribe:list-subscribe:list-post:dkim-signature; bh=qY+wAE4l5yrzN55U2G0pUqbFeUVt/DZwfelHWlehRkw=; b=VXJ0ZNRpQRbr3hxbOoGt+yuKlvG2lCUDcGHDeD+OOjYx0mAfVD/zwQ32yKuLJE0n5GqpmJ Z0m+TDzYQ/hFn7+LBBlsUxq0sXCo9RA8ITeX82g6wFri2BgFj5BlF3u9kVbUKVBTjLrvsB RHMNWmFlFzFhwexgL/ad1j6+oY1Y6fkAA0/VafhREUAsqayu+NppE5PGmzqEO1yJxJLiI8 +bJX8HLszDrV4SQuIrzv/D00LPEa0yygyfA/QvTP19aHQxdlpGxFnGRpbUt6m0RTXRWAcv cLS2UPNfkw48V2tX9LLmJZPR73CKdBFpvWwIrMWdqVORLxju2rxT5TFDO3XKTw== ARC-Authentication-Results: i=1; aspmx1.migadu.com; dkim=pass header.d=tugraz.at header.s=mailrelay header.b=iZxwI7A3; dmarc=pass (policy=none) header.from=tugraz.at; spf=pass (aspmx1.migadu.com: domain of "gwl-devel-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="gwl-devel-bounces+larch=yhetil.org@gnu.org" ARC-Seal: i=1; s=key1; d=yhetil.org; t=1711035304; a=rsa-sha256; cv=none; b=j5L0LnEuzjYvsJZB4BIWVg9CDTeZmX5DxQ9LcG2ZAsj/3YBtilXtdadj8CSJLCWKJHQdnt Y02MEz4br2O52rKIkQe25kqMFjPLyxpVEe6hucNTZCwjtXtzffU9jTB1TAvZT91MjFIdJN oAjQPkMNnaojpVibg+4K0rVTvRrUp2rlDBSG2xn0ILzBBQ7D7pNUgVS5j27W7Iy0bZgIHb qnMTqElzowCWw9OH83koJR8fSDZWWov9wwkEm/Vz5in7QDppBYSkAZ0ih8xm5bUOLM2q1C 2DF4GPUiJNgeynoeYUnHwaVgwUHXZFKJZDfTef3vgsIlu6ua2Qwarrqz74kTFA== Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id EB2CA6F973 for ; Thu, 21 Mar 2024 16:35:03 +0100 (CET) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1rnKRH-0007ea-09; Thu, 21 Mar 2024 11:34:43 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1rnKRE-0007eE-F1 for gwl-devel@gnu.org; Thu, 21 Mar 2024 11:34:40 -0400 Received: from mailrelay.tugraz.at ([129.27.2.202]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1rnKRA-00029c-Vz for gwl-devel@gnu.org; Thu, 21 Mar 2024 11:34:40 -0400 Received: from lprikler-laptop.ist.intra (gw.ist.tugraz.at [129.27.202.101]) by mailrelay.tugraz.at (Postfix) with ESMTPSA id 4V0qGw6wc8z3wBg; Thu, 21 Mar 2024 16:33:12 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=tugraz.at; s=mailrelay; t=1711035193; bh=qY+wAE4l5yrzN55U2G0pUqbFeUVt/DZwfelHWlehRkw=; h=Subject:From:To:Cc:Date:In-Reply-To:References; b=iZxwI7A3R04tIM/Ym/olBEWBn7kZmk9lwIYr9lDYO/ye76LbIM6tLI9r5hR1lkDLB vzqz5pxgvQvdJQg8fU3wjY8VLqbvlUZLkMc9Kj3S1NderBGlY26SGmFxH+Ad0vp21T tNHE+c9eO0govGeHTNwGr0eR8PtLEn6dDDR/g6LM= Message-ID: Subject: Re: Processing large amounts of files From: Liliana Marie Prikler To: Ricardo Wurmus Cc: gwl-devel@gnu.org Date: Thu, 21 Mar 2024 16:33:12 +0100 In-Reply-To: <877chvehuu.fsf@elephly.net> References: <2010bdb88116d64da3650b06e58979518b2c7277.camel@ist.tugraz.at> <877chvehuu.fsf@elephly.net> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable User-Agent: Evolution 3.46.4 MIME-Version: 1.0 X-TUG-Backscatter-control: waObeELIUl4ypBWmcn/8wQ X-Scanned-By: MIMEDefang 2.74 on 129.27.10.116 Received-SPF: none client-ip=129.27.2.202; envelope-from=liliana.prikler@ist.tugraz.at; helo=mailrelay.tugraz.at X-Spam_score_int: -42 X-Spam_score: -4.3 X-Spam_bar: ---- X-Spam_report: (-4.3 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_MED=-2.3, RCVD_IN_MSPIKE_H3=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_HELO_NONE=0.001, SPF_NONE=0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: gwl-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: gwl-devel-bounces+larch=yhetil.org@gnu.org Sender: gwl-devel-bounces+larch=yhetil.org@gnu.org X-Migadu-Country: US X-Migadu-Flow: FLOW_IN X-Migadu-Spam-Score: -8.72 X-Spam-Score: -8.72 X-Migadu-Queue-Id: EB2CA6F973 X-Migadu-Scanner: mx11.migadu.com X-TUID: I44qOQMnhLYS Am Donnerstag, dem 21.03.2024 um 16:03 +0100 schrieb Ricardo Wurmus: >=20 > Liliana Marie Prikler writes: >=20 > > For comparison: > > =C2=A0 time cat /tmp/meow/{0..7769} > > =C2=A0 [=E2=80=A6] > > =C2=A0=20 > > =C2=A0 real=C2=A0=C2=A00m0,144s > > =C2=A0 user=C2=A0=C2=A00m0,049s > > =C2=A0 sys=C2=A0=C2=A0=C2=A00m0,094s > >=20 > > It takes GWL 6 times longer to compute the workflow=C2=A0than to create > > the inputs in Guile, and 600 times longer than to actually execute > > the shell command.=C2=A0 I think there is room for improvement :) >=20 > GWL checks if all input files exist before running the command.=C2=A0 Par= t > of the difference you see here (takes about 2 seconds on my laptop) > is GWL running FILE-EXISTS? on 7769 files.=C2=A0 This happens in prepare- > inputs; its purpose: >=20 > =C2=A0 "Ensure that all files in the INPUTS-MAP alist exist and are linke= d > to the expected locations.=C2=A0 Pick unspecified inputs from the > environment. Return either the INPUTS-MAP alist with any > additionally used input file names added, or raise a condition > containing the list of missing files." >=20 > Another significant delay is introduced by the cache mechanism, which > computes a unique prefix based on the contents of all input files.=C2=A0 > It's not unexpected that this will take a little while, but it's not > great either. Is there a way to speed this up? At the very least, I'd avoid hashing the same file twice, but perhaps we could even go further and hash the directory once w.r.t. all the top-level inputs. At least some workflows would probably benefit from stratified hashing, where we first gather all top-level inputs, then add all outputs that could be built from those with a single process, etc. Then again, 2 seconds vs. 10 seconds would already be a great improvement. > The rest of the time is lost in inferior package lookups and in using > Guix to build a script that likely already exists.=C2=A0 The latter is > something that we could cache (given identical output of "guix > describe" we could skip the computation of the process scripts). Yeah, I think we can assume "guix describe" to be constant. Could we do a manifest of sorts, where the scripts are only computed once; once we know all processes? Cheers