Re: Processing large amounts of files

unofficial mirror of gwl-devel@gnu.org
 help / color / mirror / Atom feed

From: Liliana Marie Prikler <liliana.prikler@ist.tugraz.at>
To: Ricardo Wurmus <rekado@elephly.net>
Cc: gwl-devel@gnu.org
Subject: Re: Processing large amounts of files
Date: Thu, 21 Mar 2024 16:33:12 +0100	[thread overview]
Message-ID: <adcc1814522a466cdd98d7667314743441246011.camel@ist.tugraz.at> (raw)
In-Reply-To: <877chvehuu.fsf@elephly.net>

Am Donnerstag, dem 21.03.2024 um 16:03 +0100 schrieb Ricardo Wurmus:
> 
> Liliana Marie Prikler <liliana.prikler@ist.tugraz.at> writes:
> 
> > For comparison:
> >   time cat /tmp/meow/{0..7769}
> >   […]
> >   
> >   real  0m0,144s
> >   user  0m0,049s
> >   sys   0m0,094s
> > 
> > It takes GWL 6 times longer to compute the workflow than to create
> > the inputs in Guile, and 600 times longer than to actually execute
> > the shell command.  I think there is room for improvement :)
> 
> GWL checks if all input files exist before running the command.  Part
> of the difference you see here (takes about 2 seconds on my laptop)
> is GWL running FILE-EXISTS? on 7769 files.  This happens in prepare-
> inputs; its purpose:
> 
>   "Ensure that all files in the INPUTS-MAP alist exist and are linked
>   to the expected locations.  Pick unspecified inputs from the
>   environment.  Return either the INPUTS-MAP alist with any
>   additionally used input file names added, or raise a condition
>   containing the list of missing files."
> 
> Another significant delay is introduced by the cache mechanism, which
> computes a unique prefix based on the contents of all input files. 
> It's not unexpected that this will take a little while, but it's not
> great either.
Is there a way to speed this up?  At the very least, I'd avoid hashing
the same file twice, but perhaps we could even go further and hash the
directory once w.r.t. all the top-level inputs.  At least some
workflows would probably benefit from stratified hashing, where we
first gather all top-level inputs, then add all outputs that could be
built from those with a single process, etc.  Then again, 2 seconds vs.
10 seconds would already be a great improvement.

> The rest of the time is lost in inferior package lookups and in using
> Guix to build a script that likely already exists.  The latter is
> something that we could cache (given identical output of "guix
> describe" we could skip the computation of the process scripts).
Yeah, I think we can assume "guix describe" to be constant.  Could we
do a manifest of sorts, where the scripts are only computed once; once
we know all processes?

Cheers

next prev parent reply	other threads:[~2024-03-21 15:35 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <2010bdb88116d64da3650b06e58979518b2c7277.camel@ist.tugraz.at>
2024-03-21 14:34 ` Processing large amounts of files Ricardo Wurmus
2024-03-25  7:42   ` Liliana Marie Prikler
2024-03-25  9:25     ` Ricardo Wurmus
2024-03-25 10:42       ` Ricardo Wurmus
2024-03-21 15:03 ` Ricardo Wurmus
2024-03-21 15:33   ` Liliana Marie Prikler [this message]
2024-03-26 21:30   ` Ricardo Wurmus
2024-03-27  7:10     ` Liliana Marie Prikler
2024-03-27  9:58       ` Ricardo Wurmus

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.guixwl.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=adcc1814522a466cdd98d7667314743441246011.camel@ist.tugraz.at \
    --to=liliana.prikler@ist.tugraz.at \
    --cc=gwl-devel@gnu.org \
    --cc=rekado@elephly.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).