From mboxrd@z Thu Jan 1 00:00:00 1970 From: myglc2 Subject: Re: leaky pipelines and Guix Date: Fri, 04 Mar 2016 18:29:20 -0500 Message-ID: <87vb51hlf3.fsf@gmail.com> References: <87egci10tz.fsf@gnu.org> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Return-path: Received: from eggs.gnu.org ([2001:4830:134:3::10]:48191) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1abz8T-0003te-0t for help-guix@gnu.org; Fri, 04 Mar 2016 18:27:54 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1abz8N-0004no-Vl for help-guix@gnu.org; Fri, 04 Mar 2016 18:27:52 -0500 Received: from plane.gmane.org ([80.91.229.3]:41067) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1abz8N-0004nk-Lk for help-guix@gnu.org; Fri, 04 Mar 2016 18:27:47 -0500 Received: from list by plane.gmane.org with local (Exim 4.69) (envelope-from ) id 1abz8L-0006bw-1g for help-guix@gnu.org; Sat, 05 Mar 2016 00:27:45 +0100 Received: from c-73-167-118-254.hsd1.ma.comcast.net ([73.167.118.254]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Sat, 05 Mar 2016 00:27:45 +0100 Received: from myglc2 by c-73-167-118-254.hsd1.ma.comcast.net with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Sat, 05 Mar 2016 00:27:45 +0100 List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: help-guix-bounces+gcggh-help-guix=m.gmane.org@gnu.org Sender: help-guix-bounces+gcggh-help-guix=m.gmane.org@gnu.org To: help-guix@gnu.org ludo@gnu.org (Ludovic Courtès) writes: > Ricardo Wurmus skribis: >> [...] >> So, how could I package something like that? Is packaging the wrong >> approach here and should I really just be using “guix environment” to >> prepare a suitable environment, run the pipeline, and then exit? > Maybe packages are the wrong abstraction here? > > IIUC, a pipeline is really a function that takes inputs and produces > output(s). So it can definitely be modeled as a derivation. I built and ran reproducible pipelines on HPC clusters for the last 5 years. IMO the derivation model fits (disclaimer, I am still trying to figure Guix out ;) I think of a generic pipeline as a series of job steps (series of derivations). Job steps must be configured at a meta level in terms of parameters, dependicies, inputs and outputs. I found Grid Engine qmake (which is GNU Make integrated with the Grid Engine scheduler) extremely useful for this. I used it to configure & manage the pipeline, express dependencies, partition & manage parallel tasks, deal with error conditions, and manage starts and re-starts. Such pipeline jobs ran for weeks without incident. I dealt with the input/output problem using a recursive sub-make architecture in which data flowed up the analysis (make) directory tree. I dealt with modularity by using git submodules. I checked results into git for provenance. The only real fly in the ointment was that make uses time stamps. What you really want is a hash or a git SHA. Of course there were also hideous problems w/software configuration, but I expect guix will solve those :=) > Perhaps as a first step you could try and write a procedure and a CLI > around it that simply runs a given pipeline: > > $ guix biopipeline foo frog.dna human.dna > … > /gnu/store/…-freak.dna > > The procedure itself would be along the lines of: > > (define (foo-pipeline input1 input2) > (gexp->derivation "result" > #~(begin > (setenv "PATH" "/foo/bar") > (invoke-make-and-co #$input1 #$input2 > #$output)))) Sidebar: - What is "biopipeline" above? A new guix command? - Should "foo-pipeline" read "foo", or visa versa? >> [...] >> However, most pipelines do not take this approach. Pipelines are often >> designed as glue (written in Perl, or as Makefiles) that ties together >> other tools in some particular order. These tools are usually assumed >> to be available on the PATH. Yes, these pipelines are generally badly designed. >> [...] >> I can easily create a shared profile containing the tools that are >> needed by a particular pipeline and provide a wrapper script that >> does something like this (pseudo-code): >> >> bash >> eval $(guix package --search-paths=prefix) >> do things >> exit >> >> But I wouldn’t want to do this for individual users, letting them >> install all tools in a separate profile to run that pipeline, run >> something like the above to set up the environment, then fetch the >> tarball containing the glue code that constitutes the pipeline >> (because we wouldn’t offer a Guix package for something that’s not >> usable without so much effort to prepare an environment first), >> unpack it and then run it inside that environment. >> >> To me this seems to be in the twilight zone between proper packaging and >> a use-case for “guix environment”. I welcome any comments about how to >> approach this and I’m looking forward to the many practical tricks that >> I must have overlooked. An attraction of Guix is the possibility of placing job step inputs and outputs in the store, or something like the store. So how about integrating GNU Make with Guix to enable job steps that are equivalent to ... step1: guix environment foo bar && read the store, do things, save in store ... Or maybe something like ... step2: send to guix-daemon: > (define (foo-pipeline input1 input2) > (gexp->derivation "result" > #~(begin > (setenv "PATH" "/foo/bar") > (invoke-make-and-co #$input1 #$input2 > #$output)))) Then you can provide a pipeline by providing a makefile. Things needed to make this work: - make integration with guix store - make integration with guix-daemon - HPC scheduler: qmake allows specific HPC resources to be requested for each job step (e.g. memory, slots(cpus)). Grid engine uses these to determine where the steps run. Maybe these features could be achieved by running the guix daemon over a scheduler, like slurm. Or maybe by submitting job steps to slurm wich are in turn run by the daemon? (disclaimer, I am still trying to figure Guix out ;) - George