From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp2.migadu.com ([2001:41d0:303:e224::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms8.migadu.com with LMTPS id cG6DB1XxA2aP1wAAe85BDQ:P1 (envelope-from ) for ; Wed, 27 Mar 2024 11:13:41 +0100 Received: from aspmx1.migadu.com ([2001:41d0:303:e224::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp2.migadu.com with LMTPS id cG6DB1XxA2aP1wAAe85BDQ (envelope-from ) for ; Wed, 27 Mar 2024 11:13:41 +0100 X-Envelope-To: larch@yhetil.org Authentication-Results: aspmx1.migadu.com; dkim=pass header.d=elephly.net header.s=zoho header.b=EJgJhcrY; dmarc=none; spf=pass (aspmx1.migadu.com: domain of "gwl-devel-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="gwl-devel-bounces+larch=yhetil.org@gnu.org"; arc=pass ("zohomail.com:s=zohoarc:i=1") ARC-Seal: i=2; s=key1; d=yhetil.org; t=1711534421; a=rsa-sha256; cv=pass; b=q+opnArN4gYOvrI9joY5ru4fDJ+lju9unvW4eKNsjEdrtePJB4mFis83gapBdQstX2rkSY dNSwVVM/FFIb3FVuT7POLIpnekqmMmjBHv+rn8vGkYNlp9DjLJlGHKmICs0gfxNkcrXuJc oG7QkzTsfZeqglIMirrDgp0tCZMzesiIKL3qh9kZLrDaoozRRMjBZF89zEh9fXzylYrrNN 0h/xvRtSZ49/JcbHzNewJglWPuuKUW4/bS8dBb65BJhHWRlzDNJpsiTCjTkVfSb214nAZb /M3rRPLiFJKr8eo6d00qaNegtFp5xeKgX8a/MaIwX9fek8OMpPlsGmH9NTIsvQ== ARC-Authentication-Results: i=2; aspmx1.migadu.com; dkim=pass header.d=elephly.net header.s=zoho header.b=EJgJhcrY; dmarc=none; spf=pass (aspmx1.migadu.com: domain of "gwl-devel-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="gwl-devel-bounces+larch=yhetil.org@gnu.org"; arc=pass ("zohomail.com:s=zohoarc:i=1") ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=yhetil.org; s=key1; t=1711534421; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:list-id:list-help: list-unsubscribe:list-subscribe:list-post:dkim-signature; bh=4iwCxA7s5TmEarwYi/sx0ua+3NjoWfltYf8aQGHisJQ=; b=OSODz3zGFbEWk/bJANK7BOb4jAgBb7nlsr/P5Z2WdpzoJdf5Qax7fkFg15K2P/q+5niCA9 6VkFNo+7wW3RBoetDd8z/o7xEPfJ+JDatBuaoR8A9+gkse/6UIk1BlaFqKMxQ5M2swcgcq O6ELGeRZtzKza7r9r6JfM+jHzrsin1PpVCvSpWEmI3TsEVQahGCv1iKerd+iZVjIDpCV7J v4RvDBzqTtOys/cbhAgBZFF6uJC3ialmIpqA8gQn2JnoqOOa800pTFofTZ0WtDLY9aCMY6 vhmbTHefYQ5bbo8oMsyEEtx4vQW6Xk7skF76SPZ6VNNjPrZGVrfj5C4Q1Ic3og== Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id 6C69D21F25 for ; Wed, 27 Mar 2024 11:13:40 +0100 (CET) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1rpQHi-0003QV-Ok; Wed, 27 Mar 2024 06:13:30 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1rpQHg-0003QL-2z for gwl-devel@gnu.org; Wed, 27 Mar 2024 06:13:28 -0400 Received: from sender4-of-o51.zoho.com ([136.143.188.51]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1rpQHc-0003vk-Ra for gwl-devel@gnu.org; Wed, 27 Mar 2024 06:13:27 -0400 ARC-Seal: i=1; a=rsa-sha256; t=1711534399; cv=none; d=zohomail.com; s=zohoarc; b=HhHnVVr3T7tm1su5lHNNNpSe7+2h0lQ/CSp4EZ/E25KF/EJLcVVJh7zI1R9RRpGmsxqK6RaWqaKE3ggZgmGfD2I/ivKArfaBRxxX+SQNAsy7d4C9X+ISmn5ZjDfArwdz1VC9sfhpdhq1A5BGenbbEgP3E97UKxhnqrCpHn8jWrI= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=zohomail.com; s=zohoarc; t=1711534399; h=Content-Type:Content-Transfer-Encoding:Cc:Cc:Date:Date:From:From:In-Reply-To:MIME-Version:Message-ID:References:Subject:Subject:To:To:Message-Id:Reply-To; bh=4iwCxA7s5TmEarwYi/sx0ua+3NjoWfltYf8aQGHisJQ=; b=EoMmvhpn0BGDsqdyKEu+3AKqyDHwQhwtxOQAIa89wDMAarZBfG79vNbBv4ZtM1GvcVDFEwQMhZzu0XXtkfzpF/p31JUIS+ru26Crsi+kc9VJwfs0r8Ouv3J02NoaQZkO27C0Ith4WR3UPNZqESX3q+vqWdETWsZLTPWvxYwwr2o= ARC-Authentication-Results: i=1; mx.zohomail.com; dkim=pass header.i=elephly.net; spf=pass smtp.mailfrom=rekado@elephly.net; dmarc=pass header.from= DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; t=1711534399; s=zoho; d=elephly.net; i=rekado@elephly.net; h=References:From:From:To:To:Cc:Cc:Subject:Subject:Date:Date:In-reply-to:Message-ID:MIME-Version:Content-Type:Content-Transfer-Encoding:Message-Id:Reply-To; bh=4iwCxA7s5TmEarwYi/sx0ua+3NjoWfltYf8aQGHisJQ=; b=EJgJhcrYDzONFHEqUiea88u7PhJaKlOAHlc22ut6c6VqLuDmys37b2aXK8ckttaG oDKUcaI1Q2axQVMTlmpPctSstKnBdDEBPtLLlHzATEAlh/k5Lm+2HonuOxV/OBimqk9 xn3BoennQdSlHF1DA2Lnyrq/BgbCm7cNVBsdZSSk= Received: from localhost (196-110-142-46.pool.kielnet.net [46.142.110.196]) by mx.zohomail.com with SMTPS id 1711534397389341.9547156832698; Wed, 27 Mar 2024 03:13:17 -0700 (PDT) References: <2010bdb88116d64da3650b06e58979518b2c7277.camel@ist.tugraz.at> <877chvehuu.fsf@elephly.net> <87v858brxq.fsf@elephly.net> <11dfb81e0f3316206c7ecb6fa6d2741fe0721187.camel@ist.tugraz.at> User-agent: mu4e 1.10.8; emacs 29.1 From: Ricardo Wurmus To: Liliana Marie Prikler Cc: gwl-devel@gnu.org Subject: Re: Processing large amounts of files Date: Wed, 27 Mar 2024 10:58:10 +0100 In-reply-to: <11dfb81e0f3316206c7ecb6fa6d2741fe0721187.camel@ist.tugraz.at> Message-ID: <87r0fwasp2.fsf@elephly.net> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-ZohoMailClient: External Received-SPF: pass client-ip=136.143.188.51; envelope-from=rekado@elephly.net; helo=sender4-of-o51.zoho.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H4=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: gwl-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: gwl-devel-bounces+larch=yhetil.org@gnu.org Sender: gwl-devel-bounces+larch=yhetil.org@gnu.org X-Migadu-Flow: FLOW_IN X-Migadu-Country: US X-Spam-Score: -8.32 X-Migadu-Queue-Id: 6C69D21F25 X-Migadu-Scanner: mx12.migadu.com X-Migadu-Spam-Score: -8.32 X-TUID: T5v2et9KIG6K Liliana Marie Prikler writes: > Am Dienstag, dem 26.03.2024 um 22:30 +0100 schrieb Ricardo Wurmus: >>=20 >> Ricardo Wurmus writes: >> > Another significant delay is introduced by the cache mechanism, >> > which computes a unique prefix based on the contents of all input >> > files.=C2=A0 It's not unexpected that this will take a little while, b= ut >> > it's not great either. >>=20 >> With commit f4442e409cf05d0c7cc4d6a251626d22efaffe8c it's a little >> faster.=C2=A0 We used a whole lot of alists, and this becomes slow when >> there are thousands of inputs.=C2=A0 We're now using hash tables. > SGTM. I assume the caches are internal and do not affect input order > otherwise? i.e. a process that declares > > inputs : files "foo" "bar" "baz" > > will still see the same {{inputs}} as before? Yes, the order should always be the same. > I see there are tests > covering make-process, but I'm not quite sure how to parse "prepare- > inputs returns the unmodified inputs-map when all files exist" tbh. Input handling is a big bag of compromises. In the distant past workflows hardcoded input file names, which were assumed to be present at runtime. That wasn't great for my use cases, which was to specify a workflow as a generic thing that has deterministic behavior but allows for plugging in different input files. That's why I decoupled process scripts from their inputs; inputs are passed as arguments to these unchanging scripts. GWL currently assumes that *any* input anywhere in the workflow can be injected by the user. There is an option to provide an input mapping, which maps an existing file to an input file name in the workflow. GWL will first compute free inputs, i.e. inputs that are not provided by any of the outputs of any process in the workflow. GWL expects that these free inputs are either declared by the user or --- and this is a pragmatic decision, that I'm not too happy with --- that a file matching the input name can be found relative to the current directory. The above test is for the simple case where no files were discovered to fill the slots of computed free inputs. The caching mechanism exists to avoid rerunning processes when their output files already exist. In the presence of input maps and file discovery relative to the current working directory, however, it is necessary to rerun processes when the input files differ. GWL computes hashes of the mapped input files and of all process scripts to arrive at a cache prefix. This cache prefix is derived from a chain of hashes that covers the workflow definitions and the effective inputs. Given the same input files and the same workflow we can avoid running the whole workflow again when the cache already contains outputs from a previous run. --=20 Ricardo