unofficial mirror of gwl-devel@gnu.org
 help / color / mirror / Atom feed
From: zimoun <zimon.toutoune@gmail.com>
To: Ricardo Wurmus <rekado@elephly.net>
Cc: Guix Devel <guix-devel@gnu.org>, gwl-devel@gnu.org
Subject: Re: [GWL] (random) next steps?
Date: Mon, 17 Dec 2018 18:33:49 +0100	[thread overview]
Message-ID: <CAJ3okZ3DxZz017f4W2=Of2WcP79BXnYXg3D3tUitpgHFSaxr_w@mail.gmail.com> (raw)
In-Reply-To: <874lbfxijq.fsf@elephly.net>

Dear,

> I’m working on updating the GWL manual to show a simple wispy example.

Nice!
I am also giving a deeper look at the manual. :-)


> > Some GWL scripts are already there.
> I only know of Roel’s ATACseq workflow[1], but we could add a few more
> independent process definitions for simple tasks such as sequence
> alignment, trimming, etc.  This could be a git subtree that includes an
> independent repository.

Yes it should be a git subtree.
An idea should be to collect examples and in the same time to improve
kind of test suite.
I mean I have in mind to collect simple and minimal examples to also
populate the tests/.

At starting point (and with minimal effort), I would like to rewrite
the minimal snakemake examples, e.g.,
https://snakemake.readthedocs.io/en/stable/getting_started/examples.html

Once a wispy "syntax" fixed, it will be a good exercise. ;-)


> Scheme itself has (format #f "…" foo bar) for string interpolation.
> With a little macro we could generate the right “format” invocation, so
> that the user could do something similar to what you suggested:
>
>     (shell "gzip ${data-inputs} -c > ${outputs}")
>
>     –> (system (format #f "gzip ~a -c > ~a" data-inputs outputs))
>
> String concatenation is one possibility, but I hope we can do better
> than that.  scsh offers special process forms that would allow us to do
> things like this:
>
>     (shell (gzip ,data-inputs -c > ,outputs))
>
> or
>
>     (run (gzip ,data-inputs -c)
>          (> 1 ,outputs))
>
> Maybe we can take some inspiration from scsh.

I did not know about scsh. I am giving a look...

What I have in mind is to reduce the "gap" between the Lisp syntax and
more mainstream-ish syntax as Snakemake or CWL.
The comma s.t. (shell (gzip ,data-inputs -c > ,outputs)) are nice!
But it is less "natural" than the simple string interpolation, at
least to people in my environment. ;-)

What do you think ?


> > 6.
> > The graph of dependencies between the processes/units/rules is written
> > by hand. What should be the best strategy to capture it ? By files "à
> > la" Snakemake ? Other ?
>
> The GWL currently does not use the input information provided by the
> user in the data-inputs field.  For the content addressible store we
> will need to change this.  The GWL will then be able of determining that
> data-inputs are in fact the outputs of other processes.

Hum? nice but how?
I mean, the graph cannot be deduced and it needs to be written by
hand, somehow. Isn't it?



Last, just to fix the ideas about what we are talking about in terms
of input/output sizes.
An aligner as Bowtie2/BWA uses as inputs:
 - a fixed dataset (reference): it is approx. 25GB for human species.
 - experimental data (specific genome): it is approx 10GB for some
kind of sequencing and say that the series are approx. 50 experiments
or more (one cohort); so you have to deal with 500GB for one analysis.
The output for each data is around 20GB. Then this output is used by
another tools to trim, filter out, compare, etc.
I mean, part of the time is spent in moving data (read/write),
contrary to HPC-simulations---other story, other issues (MPI, etc.).

Strategies à la git-annex (Haskell, again! ;-) should be nice. But is
the history useful ?


Thank you for any comments or ideas.

All the best,
simon

  reply	other threads:[~2018-12-17 17:33 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <CAJ3okZ1Wy8eOGgnvFQN-ay-j37HCjFbYoT3EobkvRNULq0eJHA@mail.gmail.com>
2018-12-15  9:09 ` [GWL] (random) next steps? Ricardo Wurmus
2018-12-17 17:33   ` zimoun [this message]
2018-12-21 20:06     ` Ricardo Wurmus
2019-01-04 17:48       ` zimoun
2019-01-16 22:08         ` Ricardo Wurmus

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.guixwl.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAJ3okZ3DxZz017f4W2=Of2WcP79BXnYXg3D3tUitpgHFSaxr_w@mail.gmail.com' \
    --to=zimon.toutoune@gmail.com \
    --cc=guix-devel@gnu.org \
    --cc=gwl-devel@gnu.org \
    --cc=rekado@elephly.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).