fastest way to run a GWL workflow on AWS

unofficial mirror of gwl-devel@gnu.org
 help / color / mirror / Atom feed

* fastest way to run a GWL workflow on AWS
@ 2020-07-06  9:52 Ricardo Wurmus
  2020-07-16  0:08 ` zimoun
  0 siblings, 1 reply; 3+ messages in thread
From: Ricardo Wurmus @ 2020-07-06  9:52 UTC (permalink / raw)
  To: gwl-devel

Hey there,

I had an idea to get a GWL workflow to run on AWS without having to mess
with Docker and all that.  GWL should do all of these steps when AWS
deployment is requested:

* create an EFS file system.  Why EFS?  Unlike EBS (block storage) and
  S3, one EFS can be accessed simultaneously by different virtual
  machines (EC2 instances).

* sync the closure of the complete workflow (all steps) to EFS.  (How?
  We could either mount EFS locally or use an EC2 instance as a simple
  “cloud” file server.) This differs from how other workflow languages
  handle things.  Other workflow systems have one or more Docker
  image(s) per step (sometimes one Docker image per application), which
  means that there is some duplication and setup time as Docker images
  are downloaded from a registry (where they have previously been
  uploaded).  Since Guix knows the closure of all programs in the
  workflow we can simply upload all of it.

* create as many EC2 instances as requested (respecting optional
  grouping information to keep any set of processes on the same node)
  and mount the EFS over NFS.  The OS on the EC2 instances doesn’t
  matter.

* run the processes on the EC2 instances (parallelizing as far as
  possible) and have them write to a unique directory on the shared
  EFS.  The rest of the EFS is used as a read-only store to access all
  the Guix-built tools.

The EFS either stays active or its contents are archived to S3 upon
completion to reduce storage costs.

The last two steps are obviously a little vague; we’d need to add a few
knobs to allow users to easily tweak resource allocation beyond what the
GWL currently offers (e.g. grouping, mapping resources to EC2 machine
sizes.)  To implement the last step we would need to keep track of step
execution.  We can already do this, but the complication here is to
effect execution on the remote nodes.

I also want to add optional reporting for each step.  There could be a
service that listens to events and each step would trigger events to
indicate start and stop of each step.  This could trivially be
visualized, so that users can keep track of the state of the workflow
and its processes, e.g. with a pretty web interface.

For the deployment to AWS (and eventual tear-down) we can use Guile AWS.

None of this depends on “guix deploy”, which I think would be a poor fit
as these virtual machines are meant to be disposable.

Another thing I’d like to point out is that this doesn’t lead users down
the AWS rabbit hole.  We don’t use specialized AWS services like their
cluster/grid service, nor do we use Docker, nor ECS, etc.  We use the
simplest resource types: plain EC2 and boring NFS storage.  This looks
like one of the simplest remote execution models, which could just as
well be used with other remote compute providers (or even a custom
server farm).

One of the open issues is to figure out how to sync the /gnu/store items
to EFS efficiently.  I don’t really want to shell out to rsync, nor do I
want to use “guix copy”, which would require a remote installation of
Guix.  Perhaps rsync would be the easiest route for a rough first
draft.  It would also be nice if we could deduplicate our slice of the
store to cut down on unnecessary traffic to AWS.

What do you think about this?

-- 
Ricardo

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: fastest way to run a GWL workflow on AWS
  2020-07-06  9:52 fastest way to run a GWL workflow on AWS Ricardo Wurmus
@ 2020-07-16  0:08 ` zimoun
  2020-07-16 15:17   ` Ricardo Wurmus
  0 siblings, 1 reply; 3+ messages in thread
From: zimoun @ 2020-07-16  0:08 UTC (permalink / raw)
  To: Ricardo Wurmus, gwl-devel

Dear Ricardo,

Nice ideas!  I am a bit ignorant in this area so my questions are surely
totally naive, not to say dumb. :-)

On Mon, 06 Jul 2020 at 11:52, Ricardo Wurmus <rekado@elephly.net> wrote:

> * create an EFS file system.  Why EFS?  Unlike EBS (block storage) and
>   S3, one EFS can be accessed simultaneously by different virtual
>   machines (EC2 instances).

Who creates the EFS file system?  And you are referring to [1], right?

1: https://aws.amazon.com/efs/


> * sync the closure of the complete workflow (all steps) to EFS.  (How?
>   We could either mount EFS locally or use an EC2 instance as a simple
>   “cloud” file server.) This differs from how other workflow languages
>   handle things.  Other workflow systems have one or more Docker
>   image(s) per step (sometimes one Docker image per application), which
>   means that there is some duplication and setup time as Docker images
>   are downloaded from a registry (where they have previously been
>   uploaded).  Since Guix knows the closure of all programs in the
>   workflow we can simply upload all of it.

I think one of the points about using one Docker image per step to ease
the composition, well to be able to recompose another workflow with some
of the steps and other steps requiring other tools with other versions.

In Guix parlance, workflow1 uses tool1 for step1 and tool2 for step2
both from commit C1.  If workflow2 uses tool1 from commit C1 for step1'
and tool3 from commit C2 for step2', then it is easy if each tool (step)
are containered and not in only one big image.

But it is an issue for the Guix side, not the GWL side. :-)  For
example, is it possible to compose 2 profiles owning one package at the
very same version but grafted differently?


> * create as many EC2 instances as requested (respecting optional
>   grouping information to keep any set of processes on the same node)
>   and mount the EFS over NFS.  The OS on the EC2 instances doesn’t
>   matter.

By “The OS on the EC2 instances doesn’t matter.“, do you mean that it is
possible to run Guix System or Guix as package package on the top of say
Debian?

> * run the processes on the EC2 instances (parallelizing as far as
>   possible) and have them write to a unique directory on the shared
>   EFS.  The rest of the EFS is used as a read-only store to access all
>   the Guix-built tools.
>
> The EFS either stays active or its contents are archived to S3 upon
> completion to reduce storage costs.
>
> The last two steps are obviously a little vague; we’d need to add a few
> knobs to allow users to easily tweak resource allocation beyond what the
> GWL currently offers (e.g. grouping, mapping resources to EC2 machine
> sizes.)  To implement the last step we would need to keep track of step
> execution.  We can already do this, but the complication here is to
> effect execution on the remote nodes.

Ok.

> I also want to add optional reporting for each step.  There could be a
> service that listens to events and each step would trigger events to
> indicate start and stop of each step.  This could trivially be
> visualized, so that users can keep track of the state of the workflow
> and its processes, e.g. with a pretty web interface.

By “service”, do you mean as Guix services?

> For the deployment to AWS (and eventual tear-down) we can use Guile AWS.
>
> None of this depends on “guix deploy”, which I think would be a poor fit
> as these virtual machines are meant to be disposable.
>
> Another thing I’d like to point out is that this doesn’t lead users down
> the AWS rabbit hole.  We don’t use specialized AWS services like their
> cluster/grid service, nor do we use Docker, nor ECS, etc.  We use the
> simplest resource types: plain EC2 and boring NFS storage.  This looks
> like one of the simplest remote execution models, which could just as
> well be used with other remote compute providers (or even a custom
> server farm).
>
> One of the open issues is to figure out how to sync the /gnu/store items
> to EFS efficiently.  I don’t really want to shell out to rsync, nor do I
> want to use “guix copy”, which would require a remote installation of
> Guix.  Perhaps rsync would be the easiest route for a rough first
> draft.  It would also be nice if we could deduplicate our slice of the
> store to cut down on unnecessary traffic to AWS.

Naively, why does the “guix pack -f docker” or “guix docker-image”
approach fail?


All the best,
simon


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: fastest way to run a GWL workflow on AWS
  2020-07-16  0:08 ` zimoun
@ 2020-07-16 15:17   ` Ricardo Wurmus
  0 siblings, 0 replies; 3+ messages in thread
From: Ricardo Wurmus @ 2020-07-16 15:17 UTC (permalink / raw)
  To: zimoun; +Cc: gwl-devel

zimoun <zimon.toutoune@gmail.com> writes:

>> * create an EFS file system.  Why EFS?  Unlike EBS (block storage) and
>>   S3, one EFS can be accessed simultaneously by different virtual
>>   machines (EC2 instances).
>
> Who creates the EFS file system?  And you are referring to [1], right?
>
> 1: https://aws.amazon.com/efs/

Guile AWS would create it on demand (unless a user provides the name of
an existing EFS that already contains a few Guix things).  The idea is
to copy parts of a store to a remote file system — just without the
database and Guix itself doing anything on the remote.  This is very
much like the setup of Guix on HPC clusters where all nodes mount the
shared file system that is controlled by one node.  In the case of EFS
the “controller node” is the user’s machine running GWL.

>> * sync the closure of the complete workflow (all steps) to EFS.  (How?
>>   We could either mount EFS locally or use an EC2 instance as a simple
>>   “cloud” file server.) This differs from how other workflow languages
>>   handle things.  Other workflow systems have one or more Docker
>>   image(s) per step (sometimes one Docker image per application), which
>>   means that there is some duplication and setup time as Docker images
>>   are downloaded from a registry (where they have previously been
>>   uploaded).  Since Guix knows the closure of all programs in the
>>   workflow we can simply upload all of it.
>
> I think one of the points about using one Docker image per step to ease
> the composition, well to be able to recompose another workflow with some
> of the steps and other steps requiring other tools with other versions.
>
> In Guix parlance, workflow1 uses tool1 for step1 and tool2 for step2
> both from commit C1.  If workflow2 uses tool1 from commit C1 for step1'
> and tool3 from commit C2 for step2', then it is easy if each tool (step)
> are containered and not in only one big image.
>
> But it is an issue for the Guix side, not the GWL side. :-)  For
> example, is it possible to compose 2 profiles owning one package at the
> very same version but grafted differently?

I think it *is* a GWL issue to solve.  The GWL could support inferiors
so that users could reference specific tool variants for parts of the
workflow.  Currently, the GWL will use whatever tools the extended
version of Guix provides.

>> * create as many EC2 instances as requested (respecting optional
>>   grouping information to keep any set of processes on the same node)
>>   and mount the EFS over NFS.  The OS on the EC2 instances doesn’t
>>   matter.
>
> By “The OS on the EC2 instances doesn’t matter.“, do you mean that it is
> possible to run Guix System or Guix as package package on the top of say
> Debian?

Running Guix System on AWS is tricky.  AWS doesn’t like our disk images
because /etc/fstab doesn’t exist (that was the last error before I
stopped playing with it).  My point is that Guix System isn’t
necessary.  Pick whatever virtual machine image they offer on AWS and
mount the EFS containing all the Guix goodies.

>> I also want to add optional reporting for each step.  There could be a
>> service that listens to events and each step would trigger events to
>> indicate start and stop of each step.  This could trivially be
>> visualized, so that users can keep track of the state of the workflow
>> and its processes, e.g. with a pretty web interface.
>
> By “service”, do you mean as Guix services?

No, much more vague.  When you submit a GWL workflow to a cluster today
the GWL prepares things and then hands off the work to the cluster
scheduler.  The GWL has no way to tell you anything about the progress
of the workflow.  Its work is done once it has compiled a higher-order
description of the workflow down to scripts that the cluster can run.

It doesn’t have to be this way.  Why let the cluster scheduler have all
the fun?  (And more importantly: what do we do if we don’t *have* a
scheduler?)  The GWL could have a sub-command or switch to watch
submitted jobs, a little daemon that listens to events being sent by the
individual steps of the workflow; events like “started”, “error”,
“done”; even fancier ones such as machine load or disk utilization at
this point in time.  When enabled the jobs themselves would be
instrumented and sending information to the GWL monitor, which in turn
would be able to visualize this information.

>> One of the open issues is to figure out how to sync the /gnu/store items
>> to EFS efficiently.  I don’t really want to shell out to rsync, nor do I
>> want to use “guix copy”, which would require a remote installation of
>> Guix.  Perhaps rsync would be the easiest route for a rough first
>> draft.  It would also be nice if we could deduplicate our slice of the
>> store to cut down on unnecessary traffic to AWS.
>
> Naively, why does the “guix pack -f docker” or “guix docker-image”
> approach fail?

Docker images would have to be uploaded to a container registry (either
DockerHub or Amazon’s ECR).  AWS can use Docker only by downloading an
image from a registry when you instantiate a virtual machine.  One of
the advantages of using Guix is that we don’t need to use a big Docker
blob at all; we can instead upload individual store items (and
accumulate them) and use them directly without the need for any copying
from a container registry.

-- 
Ricardo

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2020-07-16 15:18 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-06  9:52 fastest way to run a GWL workflow on AWS Ricardo Wurmus
2020-07-16  0:08 ` zimoun
2020-07-16 15:17   ` Ricardo Wurmus

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).