GWL pipelined process composition ?

unofficial mirror of guix-devel@gnu.org 
 help / color / mirror / code / Atom feed

* GWL pipelined process composition ?
@ 2018-07-18 11:20 zimoun
  2018-07-18 17:29 ` Roel Janssen
  0 siblings, 1 reply; 6+ messages in thread
From: zimoun @ 2018-07-18 11:20 UTC (permalink / raw)
  To: Guix Devel

Hi,

I am asking if it should be possible to optionally stream the
inputs/outputs when the workflow is processed without writing the
intermediate files on disk.

Well, a workflow is basically:
 - some process units (or task or rule) that take inputs (file) and
produce outputs (other file)
 - a graph that describes the relationship of theses units.

The simplest workflow is:
    x --A--> y --B--> z
 - process A: input file x, output file y
 - process B: input file y, output file z

Currently, the file y is written on disk by A then read by B. Which
leads to IO inefficiency. Especially when the file is large. And/or
when there is several same kind of unit done in parallel.

Should be a good idea to have something like the shell pipe `|` to
compose the process unit ?
If yes how ? I have no clue where to look...

I agree that the storage of intermediate files avoid to compute again
and again unmodified part of the workflow. In this saves time when
developing the workflow.
However, the storage of temporary files appears unnecessary once the
workflow is done and when it does not need to run on cluster.

Thank you for all the work about the Guix ecosystem.

All the best,
simon

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: GWL pipelined process composition ?
  2018-07-18 11:20 GWL pipelined process composition ? zimoun
@ 2018-07-18 17:29 ` Roel Janssen
  2018-07-18 21:55   ` zimoun
  0 siblings, 1 reply; 6+ messages in thread
From: Roel Janssen @ 2018-07-18 17:29 UTC (permalink / raw)
  To: zimoun; +Cc: Guix Devel

Hello Simon,

zimoun <zimon.toutoune@gmail.com> writes:

> Hi,
>
> I am asking if it should be possible to optionally stream the
> inputs/outputs when the workflow is processed without writing the
> intermediate files on disk.
>
> Well, a workflow is basically:
>  - some process units (or task or rule) that take inputs (file) and
> produce outputs (other file)
>  - a graph that describes the relationship of theses units.
>
> The simplest workflow is:
>     x --A--> y --B--> z
>  - process A: input file x, output file y
>  - process B: input file y, output file z
>
> Currently, the file y is written on disk by A then read by B. Which
> leads to IO inefficiency. Especially when the file is large. And/or
> when there is several same kind of unit done in parallel.
>
>
> Should be a good idea to have something like the shell pipe `|` to
> compose the process unit ?
> If yes how ? I have no clue where to look...

That's an interesting idea.  Of course, you could literally use the
shell pipe within a single process.  And I think this makes sense, because
if a shell pipe is beneficial in your situation, then it is likely to be
beneficial to run the two programs connected by the pipe on a single
computer / in a single job.

Here's an example:
(define-public A
  (process
    (name "A")
    (package-inputs (list samtools gzip))
    (data-inputs "/tmp/sample.sam")
    (outputs "/tmp/sample.sam.gz")
    (procedure
     #~(system (string-append "samtools view " #$data-inputs
                              " | gzip -c > " #$outputs)))))

> I agree that the storage of intermediate files avoid to compute again
> and again unmodified part of the workflow. In this saves time when
> developing the workflow.
> However, the storage of temporary files appears unnecessary once the
> workflow is done and when it does not need to run on cluster.

I think it's either an efficient data transfer (using a pipe), or
writing to disk in between for better restore points.  We cannot have
both.  The former can already be achieved with the shell pipe, and the
latter can be achieved by writing two processes.

Maybe we can come up with a convenient way to combine two processes
using a shell pipe.  But this needs more thought!

If you have an idea to improve on this, please do share. :-)

> Thank you for all the work about the Guix ecosystem.
>
> All the best,
> simon

Thanks!

Kind regards,
Roel Janssen

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: GWL pipelined process composition ?
  2018-07-18 17:29 ` Roel Janssen
@ 2018-07-18 21:55   ` zimoun
  2018-07-19  7:13     ` Pjotr Prins
  2018-07-19  8:15     ` Roel Janssen
  0 siblings, 2 replies; 6+ messages in thread
From: zimoun @ 2018-07-18 21:55 UTC (permalink / raw)
  To: Roel Janssen; +Cc: Guix Devel

Hi Roel,

Thank you for all your comments.

> Maybe we can come up with a convenient way to combine two processes
> using a shell pipe.  But this needs more thought!

Yes, from my point of view, the classic shell pipe `|` has two strong
limitations for workflows:
 1. it does not compose at the 'process' level but at the procedure 'level'
 2. it cannot deal with two inputs.

As an illustration for the point 1., it appears to me more "functional
spirit" to write one process/task/unit corresponding to "samtools
view" and another one about compressing "gzip -c". Then, if you have a
process that filters some fastq, you can easily reuse the compress
process, and composes it. For more complicated workflows, such as
RNAseq or other, the composition seems an advantage.

As an illustration for the point 2., I do not do with shell pipe:

  dd if=/dev/urandom of=file1 bs=1024 count=1k
  dd if=/dev/urandom of=file2 bs=1024 count=2k
  tar -cvf file.tar file1 file2

or whatever process instead of `dd` which is perhaps not the right example here.
To be clear,
  process that outputs fileA
  process that outputs fileB
  process that inputs fileA *and* fileB
without write on disk fileA and fileB.

> If you have an idea to improve on this, please do share. :-)

I do not know where to look. :-)
Any ideas ?

All the best,
simon

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: GWL pipelined process composition ?
  2018-07-18 21:55   ` zimoun
@ 2018-07-19  7:13     ` Pjotr Prins
  2018-07-19 11:44       ` zimoun
  2018-07-19  8:15     ` Roel Janssen
  1 sibling, 1 reply; 6+ messages in thread
From: Pjotr Prins @ 2018-07-19  7:13 UTC (permalink / raw)
  To: zimoun; +Cc: Guix Devel

On Wed, Jul 18, 2018 at 11:55:25PM +0200, zimoun wrote:
> Hi Roel,
> 
> Thank you for all your comments.
> 
> > Maybe we can come up with a convenient way to combine two processes
> > using a shell pipe.  But this needs more thought!
> 
> Yes, from my point of view, the classic shell pipe `|` has two strong
> limitations for workflows:
>  1. it does not compose at the 'process' level but at the procedure 'level'
>  2. it cannot deal with two inputs.
> 
> As an illustration for the point 1., it appears to me more "functional
> spirit" to write one process/task/unit corresponding to "samtools
> view" and another one about compressing "gzip -c". Then, if you have a
> process that filters some fastq, you can easily reuse the compress
> process, and composes it. For more complicated workflows, such as
> RNAseq or other, the composition seems an advantage.

Yes, but the original question was whether you could stream data
without writing to disk, right? Unix pipes are the system way of
providing that functionality - with the added advantage of parallel
processing between Unix processes. The downside, as you say is that it
is not that composable.

To make it composable you'd have to manage process communication -
using some network/socket protocol - and GWL would have to fire up the
processes in parallel so they can communicate - preferably on one box. 

That is a significant and fragile piece of functionality to add ;).

Error/failure handling in particular will be hard.

Unsurprisingly there are no systems that handle that well - that I am
aware. The best you'll get today is composable containers that 'talk'
with each other. But that is ad hoc network programming.

The great thing about the GWL is that it describes pipelines
deterministically and makes great use of GNU Guix. I think those are
killer features. 

Adding composable pipes will magnify the size of the code base and
make it fragile at the same time. Besides, network transport layers
will add another possibility of IO bottle necks. It is a whole project
in itself ;)

Probably a good idea to keep it simple.

I'd stick with pipes when possible. All pipelines can be described as
a combination of sequential processing and scatter/gather processing:
there will always be inefficiencies. To address these you'll need to
rewrite the software tools you want to run (as we did with sambamba at
the time, to replace samtools). 

In FP you are working in one space, so it is much easier to compose
(functions).

Pj.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: GWL pipelined process composition ?
  2018-07-18 21:55   ` zimoun
  2018-07-19  7:13     ` Pjotr Prins
@ 2018-07-19  8:15     ` Roel Janssen
  1 sibling, 0 replies; 6+ messages in thread
From: Roel Janssen @ 2018-07-19  8:15 UTC (permalink / raw)
  To: zimoun; +Cc: Guix Devel

zimoun <zimon.toutoune@gmail.com> writes:

> Hi Roel,
>
> Thank you for all your comments.
>
>
>> Maybe we can come up with a convenient way to combine two processes
>> using a shell pipe.  But this needs more thought!
>
> Yes, from my point of view, the classic shell pipe `|` has two strong
> limitations for workflows:
>  1. it does not compose at the 'process' level but at the procedure 'level'
>  2. it cannot deal with two inputs.

Yes, and this strongly suggests that shell pipes are indeed limited to
the procedures *the shell* can combine.  So we can only use them at the
procedure level.  They weren't designed to deal with two (or more)
inputs, and if they were, that would make it vastly more complex.

> As an illustration for the point 1., it appears to me more "functional
> spirit" to write one process/task/unit corresponding to "samtools
> view" and another one about compressing "gzip -c". Then, if you have a
> process that filters some fastq, you can easily reuse the compress
> process, and composes it. For more complicated workflows, such as
> RNAseq or other, the composition seems an advantage.

Maybe we could solve this at the symbolic (programming) level instead.

So if we were to try to avoid using "| gzip -c > ..." all over our code,
we could define a function to wrap this.  Here's a simple example:

(define (with-compressed-output command output-file)
  (system (string-append command " | gzip -c > " output-file)))

And then you could use it in a procedure like so:

(define-public A
  (process
    (name "A")
    (package-inputs (list samtools gzip))
    (data-inputs "/tmp/sample.sam")
    (outputs "/tmp/sample.sam.gz")
    (procedure
     #~(with-compressed-output
         (string-append "samtools view " #$data-inputs)
         #$outputs))))

This isn't perfect, because we still need to include “gzip” in the
‘package-inputs’.  It doesn't allow multiple input files, nor does it
split the “gzip” command from the “samtools” command on the process
level.  However, it does allow us to express the idea that we want to
compress the output of a command and save that in a file without having
to explicitely provide the commands to do that.

>
> As an illustration for the point 2., I do not do with shell pipe:
>
>   dd if=/dev/urandom of=file1 bs=1024 count=1k
>   dd if=/dev/urandom of=file2 bs=1024 count=2k
>   tar -cvf file.tar file1 file2
>
> or whatever process instead of `dd` which is perhaps not the right example here.
> To be clear,
>   process that outputs fileA
>   process that outputs fileB
>   process that inputs fileA *and* fileB
> without write on disk fileA and fileB.

Given the ‘dd’ example, I don't see how that could work without
reinventing the way filesystems work.

> All the best,
> simon

Thanks!

Kind regards,
Roel Janssen

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: GWL pipelined process composition ?
  2018-07-19  7:13     ` Pjotr Prins
@ 2018-07-19 11:44       ` zimoun
  0 siblings, 0 replies; 6+ messages in thread
From: zimoun @ 2018-07-19 11:44 UTC (permalink / raw)
  To: Pjotr Prins; +Cc: Guix Devel

Hi Pjotr and Roel,

Thank you for the explanations.
I am not sure to have the skills to understand all of them.

> Yes, but the original question was whether you could stream data
> without writing to disk, right? Unix pipes are the system way of
> providing that functionality - with the added advantage of parallel
> processing between Unix processes. The downside, as you say is that it
> is not that com posable.

My original question was: pipelined process composition :-)
It is still my question ;-)

The unix pipes allow to compose at the procedure level, not at the
process level.

> Unsurprisingly there are no systems that handle that well - that I am
> aware. The best you'll get today is composable containers that 'talk'
> with each other. But that is ad hoc network programming.

Hum? I should miss a point.
I quickly look some stackoverflow questions about C implementation of
the shell pipe.
Perhaps, I need to test and fail by myself to understand why it is
hard to sequentially catch two stdout and then use them as stdin.
I mean, the single stdin write-on-disk version:
 y = f(x)  # process x and then write y
 z = g(y) # read y then process then write z
and the pipelined is: z = g(f(x)) where write 'y' is avoided.
This composition is done by the shell pipe.

The extension of two inputs is:
 r = f(p)
 s = h(q)
 z = g(r, s)
and the pipelined version: z = g(f(p), h(q)).

Why the fork+dup2 does not work to implement this "double" composition ?

> Adding composable pipes will magnify the size of the code base and
> make it fragile at the same time. Besides, network transport layers
> will add another possibility of IO bottle necks. It is a whole project
> in itself ;)

I trust you :-)

Even if I do not see why some network layers should be used to work
around the filesystem.
Maybe we are not talking about the same issue.
One is compose pipes in distributed memory context.
Another is compose pipes in shared memory context.

I am talking about compose pipes using one machine. :-)

> In FP you are working in one space, so it is much easier to compose
> (functions).

At the level of the processes, you are in one same space. It is the
space of files, i.e., the filesystem.

Now, after your explanations, the questions to myself are:
 1. is it possible to design (and implement with guile) shell pipes to
deal with two (or more) inputs ?
 2. is it possible to temporary mount something in the RAM to fake the
filesystem ?
 3. the already invented wheel is it not the Virtual FileSystem ?
Let me know your experienced insights :-)

All the best,
simon

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2018-07-19 11:44 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-07-18 11:20 GWL pipelined process composition ? zimoun
2018-07-18 17:29 ` Roel Janssen
2018-07-18 21:55   ` zimoun
2018-07-19  7:13     ` Pjotr Prins
2018-07-19 11:44       ` zimoun
2018-07-19  8:15     ` Roel Janssen

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).