Re: Processing large amounts of files

unofficial mirror of gwl-devel@gnu.org
 help / color / mirror / Atom feed

From: Ricardo Wurmus <rekado@elephly.net>
To: Liliana Marie Prikler <liliana.prikler@ist.tugraz.at>
Cc: gwl-devel@gnu.org
Subject: Re: Processing large amounts of files
Date: Wed, 27 Mar 2024 10:58:10 +0100	[thread overview]
Message-ID: <87r0fwasp2.fsf@elephly.net> (raw)
In-Reply-To: <11dfb81e0f3316206c7ecb6fa6d2741fe0721187.camel@ist.tugraz.at>

Liliana Marie Prikler <liliana.prikler@ist.tugraz.at> writes:

> Am Dienstag, dem 26.03.2024 um 22:30 +0100 schrieb Ricardo Wurmus:
>> 
>> Ricardo Wurmus <rekado@elephly.net> writes:
>> > Another significant delay is introduced by the cache mechanism,
>> > which computes a unique prefix based on the contents of all input
>> > files.  It's not unexpected that this will take a little while, but
>> > it's not great either.
>> 
>> With commit f4442e409cf05d0c7cc4d6a251626d22efaffe8c it's a little
>> faster.  We used a whole lot of alists, and this becomes slow when
>> there are thousands of inputs.  We're now using hash tables.
> SGTM.  I assume the caches are internal and do not affect input order
> otherwise?  i.e. a process that declares
>
>   inputs : files "foo" "bar" "baz"
>
> will still see the same {{inputs}} as before?

Yes, the order should always be the same.

> I see there are tests
> covering make-process, but I'm not quite sure how to parse "prepare-
> inputs returns the unmodified inputs-map when all files exist" tbh.

Input handling is a big bag of compromises.  In the distant past
workflows hardcoded input file names, which were assumed to be present
at runtime.  That wasn't great for my use cases, which was to specify a
workflow as a generic thing that has deterministic behavior but allows
for plugging in different input files.

That's why I decoupled process scripts from their inputs; inputs are
passed as arguments to these unchanging scripts.

GWL currently assumes that *any* input anywhere in the workflow can be
injected by the user.  There is an option to provide an input mapping,
which maps an existing file to an input file name in the workflow.

GWL will first compute free inputs, i.e. inputs that are not provided by
any of the outputs of any process in the workflow.  GWL expects that
these free inputs are either declared by the user or --- and this is a
pragmatic decision, that I'm not too happy with --- that a file matching
the input name can be found relative to the current directory.

The above test is for the simple case where no files were discovered
to fill the slots of computed free inputs.

The caching mechanism exists to avoid rerunning processes when their
output files already exist.  In the presence of input maps and file
discovery relative to the current working directory, however, it is
necessary to rerun processes when the input files differ.

GWL computes hashes of the mapped input files and of all process scripts
to arrive at a cache prefix.  This cache prefix is derived from a chain
of hashes that covers the workflow definitions and the effective inputs.
Given the same input files and the same workflow we can avoid running
the whole workflow again when the cache already contains outputs from a
previous run.

-- 
Ricardo

     prev parent reply	other threads:[~2024-03-27 10:13 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <2010bdb88116d64da3650b06e58979518b2c7277.camel@ist.tugraz.at>
2024-03-21 14:34 ` Processing large amounts of files Ricardo Wurmus
2024-03-25  7:42   ` Liliana Marie Prikler
2024-03-25  9:25     ` Ricardo Wurmus
2024-03-25 10:42       ` Ricardo Wurmus
2024-03-21 15:03 ` Ricardo Wurmus
2024-03-21 15:33   ` Liliana Marie Prikler
2024-03-26 21:30   ` Ricardo Wurmus
2024-03-27  7:10     ` Liliana Marie Prikler
2024-03-27  9:58       ` Ricardo Wurmus [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.guixwl.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87r0fwasp2.fsf@elephly.net \
    --to=rekado@elephly.net \
    --cc=gwl-devel@gnu.org \
    --cc=liliana.prikler@ist.tugraz.at \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).