From mboxrd@z Thu Jan 1 00:00:00 1970 From: Roel Janssen Subject: Re: GWL pipelined process composition ? Date: Thu, 19 Jul 2018 10:15:24 +0200 Message-ID: <87muun8wvf.fsf@gnu.org> References: <87601cxxjo.fsf@gnu.org> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Return-path: Received: from eggs.gnu.org ([2001:4830:134:3::10]:38259) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1fg465-00034a-3Y for guix-devel@gnu.org; Thu, 19 Jul 2018 04:15:38 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1fg460-0005rm-37 for guix-devel@gnu.org; Thu, 19 Jul 2018 04:15:37 -0400 In-reply-to: List-Id: "Development of GNU Guix and the GNU System distribution." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guix-devel-bounces+gcggd-guix-devel=m.gmane.org@gnu.org Sender: "Guix-devel" To: zimoun Cc: Guix Devel zimoun writes: > Hi Roel, > > Thank you for all your comments. > > >> Maybe we can come up with a convenient way to combine two processes >> using a shell pipe. But this needs more thought! > > Yes, from my point of view, the classic shell pipe `|` has two strong > limitations for workflows: > 1. it does not compose at the 'process' level but at the procedure 'leve= l' > 2. it cannot deal with two inputs. Yes, and this strongly suggests that shell pipes are indeed limited to the procedures *the shell* can combine. So we can only use them at the procedure level. They weren't designed to deal with two (or more) inputs, and if they were, that would make it vastly more complex. > As an illustration for the point 1., it appears to me more "functional > spirit" to write one process/task/unit corresponding to "samtools > view" and another one about compressing "gzip -c". Then, if you have a > process that filters some fastq, you can easily reuse the compress > process, and composes it. For more complicated workflows, such as > RNAseq or other, the composition seems an advantage. Maybe we could solve this at the symbolic (programming) level instead. So if we were to try to avoid using "| gzip -c > ..." all over our code, we could define a function to wrap this. Here's a simple example: (define (with-compressed-output command output-file) (system (string-append command " | gzip -c > " output-file))) And then you could use it in a procedure like so: (define-public A (process (name "A") (package-inputs (list samtools gzip)) (data-inputs "/tmp/sample.sam") (outputs "/tmp/sample.sam.gz") (procedure #~(with-compressed-output (string-append "samtools view " #$data-inputs) #$outputs)))) This isn't perfect, because we still need to include =E2=80=9Cgzip=E2=80=9D= in the =E2=80=98package-inputs=E2=80=99. It doesn't allow multiple input files, n= or does it split the =E2=80=9Cgzip=E2=80=9D command from the =E2=80=9Csamtools=E2=80= =9D command on the process level. However, it does allow us to express the idea that we want to compress the output of a command and save that in a file without having to explicitely provide the commands to do that. > > As an illustration for the point 2., I do not do with shell pipe: > > dd if=3D/dev/urandom of=3Dfile1 bs=3D1024 count=3D1k > dd if=3D/dev/urandom of=3Dfile2 bs=3D1024 count=3D2k > tar -cvf file.tar file1 file2 > > or whatever process instead of `dd` which is perhaps not the right exampl= e here. > To be clear, > process that outputs fileA > process that outputs fileB > process that inputs fileA *and* fileB > without write on disk fileA and fileB. Given the =E2=80=98dd=E2=80=99 example, I don't see how that could work wit= hout reinventing the way filesystems work. > All the best, > simon Thanks! Kind regards, Roel Janssen