unofficial mirror of gwl-devel@gnu.org
 help / color / mirror / Atom feed
From: Ricardo Wurmus <rekado@elephly.net>
To: zimoun <zimon.toutoune@gmail.com>
Cc: gwl-devel@gnu.org
Subject: Re: [GWL] (random) next steps?
Date: Wed, 16 Jan 2019 23:08:34 +0100	[thread overview]
Message-ID: <87won4gsrh.fsf@elephly.net> (raw)
In-Reply-To: <CAJ3okZ09sJa5EOATPiWGSVbk2nEmZH+grLdBybc1T2XF2oo-0w@mail.gmail.com>


Hi simon,

[- guix-devel@gnu.org]

I wrote:

> We can connect a graph by joining the inputs of one process with the
> outputs of another.
>
> With a content addressed store we would run processes in isolation and
> map the declared data inputs into the environment.  Instead of working
> on the global namespace of the shared file system we can learn from Guix
> and strictly control the execution environment.  After a process has run
> to completion, only files that were declared as outputs end up in the
> content addressed store.
>
> A process could declare outputs like this:
>
>     (define the-process
>       (process
>         (name 'foo)
>         (outputs
>          '((result "path/to/result.bam")
>            (meta   "path/to/meta.xml")))))
>
> Other processes can then access these files with:
>
>     (output the-process 'result)
>
> i.e. the file corresponding to the declared output “result” of the
> process named by the variable “the-process”.

You wrote:

> From my point of view, there is 2 different paths:
>  1- the inputs-outputs are attached to the process/rule/unit
>  2- the processes/rules/units are a pure function and then the
> `workflow' describes how to glue them together.
[…]
> On one hand, from the path 1-, it is hard to reuse the process/rule
> because the composition is hard-coded in the inputs-outputs
> (duplication of the same process/rule with different inputs-outputs).
> The graph is written by the user when it writes the inputs-outputs
> chain.
> On the other hand, from the path 2-, it is difficult to provide both
> the inputs-outputs to the function and also the graph without
> duplicate some code.

I agree with this assessment.

I would like to note, though, that at least the declaration of outputs
works in both systems.  Only when an exact input is tightly attached to
a process/rule do we limit ourselves to the first path where composition
is inflexible.

> Last, is it useful to write on disk the intermediate files if they are
> not stored?
> In the tread [0], we discussed the possibility to stream the pipes.
> Let say, the simple case:
>    filter input > filtered
>    quality filtered > output
> and the piped version is better is you do not mind about the filtered file:
>    filter input | quality > ouput
>
> However, the classic pipe does not fit for this case:
>    filter input_R1 > R1_filtered
>    filter input_R2 > R2_filtered
>    align R1_filtered R2_filtered > output_aligned
> In general, one is not interested to conserve the files
> R{1,2}_filtered. So why spend time to write them on disk and to hash
> them.
>
> In other words, is it doable to stream the `processes' at the process
> level?
[…]
> [0] http://lists.gnu.org/archive/html/guix-devel/2018-07/msg00231.html

For this to work at all inputs and outputs must be declared.  This
wasn’t mentioned before, but it could of course be done in the workflow
declaration rather than the individual process descriptions.

But even then it isn’t clear to me how to do this in a general fashion.
It may work fine for tools that write to I/O streams, but we would
probably need mechanisms to declare this behaviour.  It cannot be
generally inferred, nor can a process automatically change the behaviour
of its procedure to switch between the generation of intermediate files
and output to a stream.

The GWL examples show the use of the “(system "foo > out.file") idiom,
which I don’t like very much.  I’d prefer to use "foo" directly and
declare the output to be a stream.

> Last, could we add a GWL session to the before-FOSDEM days?

The Guix Days are what we make of them, so yes, we can have a GWL
session there :)

--
Ricardo

      reply	other threads:[~2019-01-16 22:09 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <CAJ3okZ1Wy8eOGgnvFQN-ay-j37HCjFbYoT3EobkvRNULq0eJHA@mail.gmail.com>
2018-12-15  9:09 ` [GWL] (random) next steps? Ricardo Wurmus
2018-12-17 17:33   ` zimoun
2018-12-21 20:06     ` Ricardo Wurmus
2019-01-04 17:48       ` zimoun
2019-01-16 22:08         ` Ricardo Wurmus [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.guixwl.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87won4gsrh.fsf@elephly.net \
    --to=rekado@elephly.net \
    --cc=gwl-devel@gnu.org \
    --cc=zimon.toutoune@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).