Re: Processing large amounts of files

unofficial mirror of gwl-devel@gnu.org
 help / color / mirror / Atom feed

* Re: Processing large amounts of files
       [not found] <2010bdb88116d64da3650b06e58979518b2c7277.camel@ist.tugraz.at>
@ 2024-03-21 14:34 ` Ricardo Wurmus
  2024-03-25  7:42   ` Liliana Marie Prikler
  2024-03-21 15:03 ` Ricardo Wurmus
  1 sibling, 1 reply; 9+ messages in thread
From: Ricardo Wurmus @ 2024-03-21 14:34 UTC (permalink / raw)
  To: Liliana Marie Prikler; +Cc: gwl-devel

Hi Liliana,

[-guix-devel@gnu.org, +gwl-devel@gnu.org]

(forgot to actually send this message a few days ago)

thanks for the report!

> I have a somewhat unusual workflow that requires me to do a number of
> processes on numerous, but small input files.  The original is a bit
> unwieldy and takes several minutes to compile, but I've managed to
> produce a more understandable and better performing example.  Note,
> that after a certain number of inputs, I get the following error:
>
> info: .16 Loading workflow file `meow.gwl'...
> info: 2.80 Computing workflow `cat'...
> run: 12.96 Executing: /bin/sh -c /gnu/store/kmssbjcdcabg9fh4nxscwwpnlb4px30h-gwl-meow.scm …
> error: 13.01 Wrong type argument in position 1: #f

It is frustrating that there is no backtrace.  Reliable error handling
is hard.

When running with "-l all" I see this:

  info: .75 Computing workflow `cat'...
  debug: 3.13 Computing script for process `meow'
  guix: 3.13 Looking up package `bash-minimal'
  guix: 3.13 Opening inferior Guix at `/gnu/store/pb1nkrn3sg6a1j6c4r5j2ahygkf4vkv9-profile'
  guix: 4.27 Looking up package `guix'
  debug: 4.45 Generating all scripts and their dependencies.
  debug: 4.89 Generating all scripts and their dependencies.
  run: 6.73 Executing: /bin/sh -c /gnu/store/5idhbvhrwj3p53kkz2vikdn1ypncwj84-gwl-meow.scm '((inputs "/tmp/meow/0" ...
  process: 8.80 In execvp of /bin/sh: Argument list too long
  error: 8.80 Wrong type argument in position 1: #f

This at least tells us that the last error here is due to sh refusing to run.

> For comparison:
>   time cat /tmp/meow/{0..7769}
>   […]
>   
>   real	0m0,144s
>   user	0m0,049s
>   sys	0m0,094s
>
> It takes GWL 6 times longer to compute the workflow than to create the
> inputs in Guile, and 600 times longer than to actually execute the
> shell command.  I think there is room for improvement :)

Yeah, not good.  Do you have any recommendations?

-- 
Ricardo

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Processing large amounts of files
       [not found] <2010bdb88116d64da3650b06e58979518b2c7277.camel@ist.tugraz.at>
  2024-03-21 14:34 ` Processing large amounts of files Ricardo Wurmus
@ 2024-03-21 15:03 ` Ricardo Wurmus
  2024-03-21 15:33   ` Liliana Marie Prikler
  2024-03-26 21:30   ` Ricardo Wurmus
  1 sibling, 2 replies; 9+ messages in thread
From: Ricardo Wurmus @ 2024-03-21 15:03 UTC (permalink / raw)
  To: Liliana Marie Prikler; +Cc: gwl-devel

Liliana Marie Prikler <liliana.prikler@ist.tugraz.at> writes:

> For comparison:
>   time cat /tmp/meow/{0..7769}
>   […]
>   
>   real	0m0,144s
>   user	0m0,049s
>   sys	0m0,094s
>
> It takes GWL 6 times longer to compute the workflow than to create the
> inputs in Guile, and 600 times longer than to actually execute the
> shell command.  I think there is room for improvement :)

GWL checks if all input files exist before running the command.  Part of
the difference you see here (takes about 2 seconds on my laptop) is GWL
running FILE-EXISTS? on 7769 files.  This happens in prepare-inputs; its
purpose:

  "Ensure that all files in the INPUTS-MAP alist exist and are linked to
  the expected locations.  Pick unspecified inputs from the environment.
  Return either the INPUTS-MAP alist with any additionally used input
  file names added, or raise a condition containing the list of missing
  files."

Another significant delay is introduced by the cache mechanism, which
computes a unique prefix based on the contents of all input files.  It's
not unexpected that this will take a little while, but it's not great
either.

The rest of the time is lost in inferior package lookups and in using
Guix to build a script that likely already exists.  The latter is
something that we could cache (given identical output of "guix describe"
we could skip the computation of the process scripts).

-- 
Ricardo

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Processing large amounts of files
  2024-03-21 15:03 ` Ricardo Wurmus
@ 2024-03-21 15:33   ` Liliana Marie Prikler
  2024-03-26 21:30   ` Ricardo Wurmus
  1 sibling, 0 replies; 9+ messages in thread
From: Liliana Marie Prikler @ 2024-03-21 15:33 UTC (permalink / raw)
  To: Ricardo Wurmus; +Cc: gwl-devel

Am Donnerstag, dem 21.03.2024 um 16:03 +0100 schrieb Ricardo Wurmus:
> 
> Liliana Marie Prikler <liliana.prikler@ist.tugraz.at> writes:
> 
> > For comparison:
> >   time cat /tmp/meow/{0..7769}
> >   […]
> >   
> >   real  0m0,144s
> >   user  0m0,049s
> >   sys   0m0,094s
> > 
> > It takes GWL 6 times longer to compute the workflow than to create
> > the inputs in Guile, and 600 times longer than to actually execute
> > the shell command.  I think there is room for improvement :)
> 
> GWL checks if all input files exist before running the command.  Part
> of the difference you see here (takes about 2 seconds on my laptop)
> is GWL running FILE-EXISTS? on 7769 files.  This happens in prepare-
> inputs; its purpose:
> 
>   "Ensure that all files in the INPUTS-MAP alist exist and are linked
>   to the expected locations.  Pick unspecified inputs from the
>   environment.  Return either the INPUTS-MAP alist with any
>   additionally used input file names added, or raise a condition
>   containing the list of missing files."
> 
> Another significant delay is introduced by the cache mechanism, which
> computes a unique prefix based on the contents of all input files. 
> It's not unexpected that this will take a little while, but it's not
> great either.
Is there a way to speed this up?  At the very least, I'd avoid hashing
the same file twice, but perhaps we could even go further and hash the
directory once w.r.t. all the top-level inputs.  At least some
workflows would probably benefit from stratified hashing, where we
first gather all top-level inputs, then add all outputs that could be
built from those with a single process, etc.  Then again, 2 seconds vs.
10 seconds would already be a great improvement.

> The rest of the time is lost in inferior package lookups and in using
> Guix to build a script that likely already exists.  The latter is
> something that we could cache (given identical output of "guix
> describe" we could skip the computation of the process scripts).
Yeah, I think we can assume "guix describe" to be constant.  Could we
do a manifest of sorts, where the scripts are only computed once; once
we know all processes?

Cheers



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Processing large amounts of files
  2024-03-21 14:34 ` Processing large amounts of files Ricardo Wurmus
@ 2024-03-25  7:42   ` Liliana Marie Prikler
  2024-03-25  9:25     ` Ricardo Wurmus
  0 siblings, 1 reply; 9+ messages in thread
From: Liliana Marie Prikler @ 2024-03-25  7:42 UTC (permalink / raw)
  To: Ricardo Wurmus; +Cc: gwl-devel

Am Donnerstag, dem 21.03.2024 um 15:34 +0100 schrieb Ricardo Wurmus:
> [-guix-devel@gnu.org, +gwl-devel@gnu.org]
oops D:
> 
> [...]
> When running with "-l all" I see this:
> 
>   info: .75 Computing workflow `cat'...
>   debug: 3.13 Computing script for process `meow'
>   guix: 3.13 Looking up package `bash-minimal'
>   guix: 3.13 Opening inferior Guix at
> `/gnu/store/pb1nkrn3sg6a1j6c4r5j2ahygkf4vkv9-profile'
>   guix: 4.27 Looking up package `guix'
>   debug: 4.45 Generating all scripts and their dependencies.
>   debug: 4.89 Generating all scripts and their dependencies.
>   run: 6.73 Executing: /bin/sh -c
> /gnu/store/5idhbvhrwj3p53kkz2vikdn1ypncwj84-gwl-meow.scm '((inputs
> "/tmp/meow/0" ...
>   process: 8.80 In execvp of /bin/sh: Argument list too long
>   error: 8.80 Wrong type argument in position 1: #f
> 
> This at least tells us that the last error here is due to sh refusing
> to run.
Good to know, and I thought it'd be just that, but… shouldn't this
failure to invoke sh be caught through something?

> > For comparison:
> >   time cat /tmp/meow/{0..7769}
> >   […]
> >   
> >   real  0m0,144s
> >   user  0m0,049s
> >   sys   0m0,094s
> > 
> > It takes GWL 6 times longer to compute the workflow than to create
> > the inputs in Guile, and 600 times longer than to actually execute
> > the shell command.  I think there is room for improvement :)
> 
> Yeah, not good.  Do you have any recommendations?
We already talked about this in response to your second mail, but (LRU)
Caching of things that can be cached would be an approach to take. 
Perhaps there's also inefficiencies in auto-connecting inputs – not
exhibited by this example, but thinkable.

Design-wise, we might need a way of splitting large worfklows anyhow. 
Files and environment variables work, but feel clunky at the moment,
and particular files remind me about recursive make… maybe when I get
the time, I can code something up and then look at ways for
simplification.

Cheers


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Processing large amounts of files
  2024-03-25  7:42   ` Liliana Marie Prikler
@ 2024-03-25  9:25     ` Ricardo Wurmus
  2024-03-25 10:42       ` Ricardo Wurmus
  0 siblings, 1 reply; 9+ messages in thread
From: Ricardo Wurmus @ 2024-03-25  9:25 UTC (permalink / raw)
  To: Liliana Marie Prikler; +Cc: gwl-devel


Liliana Marie Prikler <liliana.prikler@ist.tugraz.at> writes:

>> When running with "-l all" I see this:
>> 
>>   info: .75 Computing workflow `cat'...
>>   debug: 3.13 Computing script for process `meow'
>>   guix: 3.13 Looking up package `bash-minimal'
>>   guix: 3.13 Opening inferior Guix at
>> `/gnu/store/pb1nkrn3sg6a1j6c4r5j2ahygkf4vkv9-profile'
>>   guix: 4.27 Looking up package `guix'
>>   debug: 4.45 Generating all scripts and their dependencies.
>>   debug: 4.89 Generating all scripts and their dependencies.
>>   run: 6.73 Executing: /bin/sh -c
>> /gnu/store/5idhbvhrwj3p53kkz2vikdn1ypncwj84-gwl-meow.scm '((inputs
>> "/tmp/meow/0" ...
>>   process: 8.80 In execvp of /bin/sh: Argument list too long
>>   error: 8.80 Wrong type argument in position 1: #f
>> 
>> This at least tells us that the last error here is due to sh refusing
>> to run.
> Good to know, and I thought it'd be just that, but… shouldn't this
> failure to invoke sh be caught through something?

Yes, it really should.  This may be a problem with how we capture stdout
and stderr.  I'll look into it.

>> > For comparison:
>> >   time cat /tmp/meow/{0..7769}
>> >   […]
>> >   
>> >   real  0m0,144s
>> >   user  0m0,049s
>> >   sys   0m0,094s
>> > 
>> > It takes GWL 6 times longer to compute the workflow than to create
>> > the inputs in Guile, and 600 times longer than to actually execute
>> > the shell command.  I think there is room for improvement :)
>> 
>> Yeah, not good.  Do you have any recommendations?
> We already talked about this in response to your second mail, but (LRU)
> Caching of things that can be cached would be an approach to take. 
> Perhaps there's also inefficiencies in auto-connecting inputs – not
> exhibited by this example, but thinkable.
>
> Design-wise, we might need a way of splitting large worfklows anyhow. 
> Files and environment variables work, but feel clunky at the moment,
> and particular files remind me about recursive make… maybe when I get
> the time, I can code something up and then look at ways for
> simplification.

I'd be very happy to see a rough proposal and/or patches.  GWL is
currently unburdened due to the fact that it hardly has any active/vocal
users, so I'm willing to evolve it in a direction that serves actual
users.

-- 
Ricardo


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Processing large amounts of files
  2024-03-25  9:25     ` Ricardo Wurmus
@ 2024-03-25 10:42       ` Ricardo Wurmus
  0 siblings, 0 replies; 9+ messages in thread
From: Ricardo Wurmus @ 2024-03-25 10:42 UTC (permalink / raw)
  To: Liliana Marie Prikler, gwl-devel


Ricardo Wurmus <rekado@elephly.net> writes:

> Liliana Marie Prikler <liliana.prikler@ist.tugraz.at> writes:
>
>>> When running with "-l all" I see this:
>>> 
>>>   info: .75 Computing workflow `cat'...
>>>   debug: 3.13 Computing script for process `meow'
>>>   guix: 3.13 Looking up package `bash-minimal'
>>>   guix: 3.13 Opening inferior Guix at
>>> `/gnu/store/pb1nkrn3sg6a1j6c4r5j2ahygkf4vkv9-profile'
>>>   guix: 4.27 Looking up package `guix'
>>>   debug: 4.45 Generating all scripts and their dependencies.
>>>   debug: 4.89 Generating all scripts and their dependencies.
>>>   run: 6.73 Executing: /bin/sh -c
>>> /gnu/store/5idhbvhrwj3p53kkz2vikdn1ypncwj84-gwl-meow.scm '((inputs
>>> "/tmp/meow/0" ...
>>>   process: 8.80 In execvp of /bin/sh: Argument list too long
>>>   error: 8.80 Wrong type argument in position 1: #f
>>> 
>>> This at least tells us that the last error here is due to sh refusing
>>> to run.
>> Good to know, and I thought it'd be just that, but… shouldn't this
>> failure to invoke sh be caught through something?
>
> Yes, it really should.  This may be a problem with how we capture stdout
> and stderr.  I'll look into it.

Fixed with commit f7d6b159e9423e69271503ef9ea92e191265d8ee.

The command processor erroneously returned the exit code and not the
composite status value containing both exit code and termination signal.

-- 
Ricardo


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Processing large amounts of files
  2024-03-21 15:03 ` Ricardo Wurmus
  2024-03-21 15:33   ` Liliana Marie Prikler
@ 2024-03-26 21:30   ` Ricardo Wurmus
  2024-03-27  7:10     ` Liliana Marie Prikler
  1 sibling, 1 reply; 9+ messages in thread
From: Ricardo Wurmus @ 2024-03-26 21:30 UTC (permalink / raw)
  To: Liliana Marie Prikler, gwl-devel


Ricardo Wurmus <rekado@elephly.net> writes:

> Liliana Marie Prikler <liliana.prikler@ist.tugraz.at> writes:
>
>> For comparison:
>>   time cat /tmp/meow/{0..7769}
>>   […]
>>   
>>   real	0m0,144s
>>   user	0m0,049s
>>   sys	0m0,094s
>>
>> It takes GWL 6 times longer to compute the workflow than to create the
>> inputs in Guile, and 600 times longer than to actually execute the
>> shell command.  I think there is room for improvement :)
>
> GWL checks if all input files exist before running the command.  Part of
> the difference you see here (takes about 2 seconds on my laptop) is GWL
> running FILE-EXISTS? on 7769 files.  This happens in prepare-inputs; its
> purpose:
>
>   "Ensure that all files in the INPUTS-MAP alist exist and are linked to
>   the expected locations.  Pick unspecified inputs from the environment.
>   Return either the INPUTS-MAP alist with any additionally used input
>   file names added, or raise a condition containing the list of missing
>   files."
>
> Another significant delay is introduced by the cache mechanism, which
> computes a unique prefix based on the contents of all input files.  It's
> not unexpected that this will take a little while, but it's not great
> either.

With commit f4442e409cf05d0c7cc4d6a251626d22efaffe8c it's a little
faster.  We used a whole lot of alists, and this becomes slow when there
are thousands of inputs.  We're now using hash tables.

-- 
Ricardo


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Processing large amounts of files
  2024-03-26 21:30   ` Ricardo Wurmus
@ 2024-03-27  7:10     ` Liliana Marie Prikler
  2024-03-27  9:58       ` Ricardo Wurmus
  0 siblings, 1 reply; 9+ messages in thread
From: Liliana Marie Prikler @ 2024-03-27  7:10 UTC (permalink / raw)
  To: Ricardo Wurmus, gwl-devel

Am Dienstag, dem 26.03.2024 um 22:30 +0100 schrieb Ricardo Wurmus:
> 
> Ricardo Wurmus <rekado@elephly.net> writes:
> > Another significant delay is introduced by the cache mechanism,
> > which computes a unique prefix based on the contents of all input
> > files.  It's not unexpected that this will take a little while, but
> > it's not great either.
> 
> With commit f4442e409cf05d0c7cc4d6a251626d22efaffe8c it's a little
> faster.  We used a whole lot of alists, and this becomes slow when
> there are thousands of inputs.  We're now using hash tables.
SGTM.  I assume the caches are internal and do not affect input order
otherwise?  i.e. a process that declares

  inputs : files "foo" "bar" "baz"

will still see the same {{inputs}} as before?  I see there are tests
covering make-process, but I'm not quite sure how to parse "prepare-
inputs returns the unmodified inputs-map when all files exist" tbh.

Cheers


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Processing large amounts of files
  2024-03-27  7:10     ` Liliana Marie Prikler
@ 2024-03-27  9:58       ` Ricardo Wurmus
  0 siblings, 0 replies; 9+ messages in thread
From: Ricardo Wurmus @ 2024-03-27  9:58 UTC (permalink / raw)
  To: Liliana Marie Prikler; +Cc: gwl-devel

Liliana Marie Prikler <liliana.prikler@ist.tugraz.at> writes:

> Am Dienstag, dem 26.03.2024 um 22:30 +0100 schrieb Ricardo Wurmus:
>> 
>> Ricardo Wurmus <rekado@elephly.net> writes:
>> > Another significant delay is introduced by the cache mechanism,
>> > which computes a unique prefix based on the contents of all input
>> > files.  It's not unexpected that this will take a little while, but
>> > it's not great either.
>> 
>> With commit f4442e409cf05d0c7cc4d6a251626d22efaffe8c it's a little
>> faster.  We used a whole lot of alists, and this becomes slow when
>> there are thousands of inputs.  We're now using hash tables.
> SGTM.  I assume the caches are internal and do not affect input order
> otherwise?  i.e. a process that declares
>
>   inputs : files "foo" "bar" "baz"
>
> will still see the same {{inputs}} as before?

Yes, the order should always be the same.

> I see there are tests
> covering make-process, but I'm not quite sure how to parse "prepare-
> inputs returns the unmodified inputs-map when all files exist" tbh.

Input handling is a big bag of compromises.  In the distant past
workflows hardcoded input file names, which were assumed to be present
at runtime.  That wasn't great for my use cases, which was to specify a
workflow as a generic thing that has deterministic behavior but allows
for plugging in different input files.

That's why I decoupled process scripts from their inputs; inputs are
passed as arguments to these unchanging scripts.

GWL currently assumes that *any* input anywhere in the workflow can be
injected by the user.  There is an option to provide an input mapping,
which maps an existing file to an input file name in the workflow.

GWL will first compute free inputs, i.e. inputs that are not provided by
any of the outputs of any process in the workflow.  GWL expects that
these free inputs are either declared by the user or --- and this is a
pragmatic decision, that I'm not too happy with --- that a file matching
the input name can be found relative to the current directory.

The above test is for the simple case where no files were discovered
to fill the slots of computed free inputs.

The caching mechanism exists to avoid rerunning processes when their
output files already exist.  In the presence of input maps and file
discovery relative to the current working directory, however, it is
necessary to rerun processes when the input files differ.

GWL computes hashes of the mapped input files and of all process scripts
to arrive at a cache prefix.  This cache prefix is derived from a chain
of hashes that covers the workflow definitions and the effective inputs.
Given the same input files and the same workflow we can avoid running
the whole workflow again when the cache already contains outputs from a
previous run.

-- 
Ricardo

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2024-03-27 10:13 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <2010bdb88116d64da3650b06e58979518b2c7277.camel@ist.tugraz.at>
2024-03-21 14:34 ` Processing large amounts of files Ricardo Wurmus
2024-03-25  7:42   ` Liliana Marie Prikler
2024-03-25  9:25     ` Ricardo Wurmus
2024-03-25 10:42       ` Ricardo Wurmus
2024-03-21 15:03 ` Ricardo Wurmus
2024-03-21 15:33   ` Liliana Marie Prikler
2024-03-26 21:30   ` Ricardo Wurmus
2024-03-27  7:10     ` Liliana Marie Prikler
2024-03-27  9:58       ` Ricardo Wurmus

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).