unofficial mirror of guix-devel@gnu.org 
 help / color / mirror / code / Atom feed
From: Roel Janssen <roel@gnu.org>
To: zimoun <zimon.toutoune@gmail.com>
Cc: Guix Devel <guix-devel@gnu.org>
Subject: Re: GWL pipelined process composition ?
Date: Wed, 18 Jul 2018 19:29:15 +0200	[thread overview]
Message-ID: <87601cxxjo.fsf@gnu.org> (raw)
In-Reply-To: <CAJ3okZ0TFmCLygVhrb0gEe0yz=ub2kiSc=g+4=oOLBUkj+pitA@mail.gmail.com>

Hello Simon,

zimoun <zimon.toutoune@gmail.com> writes:

> Hi,
>
> I am asking if it should be possible to optionally stream the
> inputs/outputs when the workflow is processed without writing the
> intermediate files on disk.
>
> Well, a workflow is basically:
>  - some process units (or task or rule) that take inputs (file) and
> produce outputs (other file)
>  - a graph that describes the relationship of theses units.
>
> The simplest workflow is:
>     x --A--> y --B--> z
>  - process A: input file x, output file y
>  - process B: input file y, output file z
>
> Currently, the file y is written on disk by A then read by B. Which
> leads to IO inefficiency. Especially when the file is large. And/or
> when there is several same kind of unit done in parallel.
>
>
> Should be a good idea to have something like the shell pipe `|` to
> compose the process unit ?
> If yes how ? I have no clue where to look...

That's an interesting idea.  Of course, you could literally use the
shell pipe within a single process.  And I think this makes sense, because
if a shell pipe is beneficial in your situation, then it is likely to be
beneficial to run the two programs connected by the pipe on a single
computer / in a single job.

Here's an example:
(define-public A
  (process
    (name "A")
    (package-inputs (list samtools gzip))
    (data-inputs "/tmp/sample.sam")
    (outputs "/tmp/sample.sam.gz")
    (procedure
     #~(system (string-append "samtools view " #$data-inputs
                              " | gzip -c > " #$outputs)))))

> I agree that the storage of intermediate files avoid to compute again
> and again unmodified part of the workflow. In this saves time when
> developing the workflow.
> However, the storage of temporary files appears unnecessary once the
> workflow is done and when it does not need to run on cluster.

I think it's either an efficient data transfer (using a pipe), or
writing to disk in between for better restore points.  We cannot have
both.  The former can already be achieved with the shell pipe, and the
latter can be achieved by writing two processes.

Maybe we can come up with a convenient way to combine two processes
using a shell pipe.  But this needs more thought!

If you have an idea to improve on this, please do share. :-)

> Thank you for all the work about the Guix ecosystem.
>
> All the best,
> simon

Thanks!

Kind regards,
Roel Janssen

  reply	other threads:[~2018-07-18 17:29 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-07-18 11:20 GWL pipelined process composition ? zimoun
2018-07-18 17:29 ` Roel Janssen [this message]
2018-07-18 21:55   ` zimoun
2018-07-19  7:13     ` Pjotr Prins
2018-07-19 11:44       ` zimoun
2018-07-19  8:15     ` Roel Janssen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://guix.gnu.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87601cxxjo.fsf@gnu.org \
    --to=roel@gnu.org \
    --cc=guix-devel@gnu.org \
    --cc=zimon.toutoune@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).