Guix orchestration notes

unofficial mirror of guix-devel@gnu.org 
 help / color / mirror / code / Atom feed

* Guix orchestration notes
@ 2018-02-18  4:37 Chris Marusich
  2018-03-01 16:18 ` Orchestration working group (before: Guix orchestration notes) Pjotr Prins
  2018-03-27 18:27 ` Guix orchestration notes Thompson, David
  0 siblings, 2 replies; 5+ messages in thread
From: Chris Marusich @ 2018-02-18  4:37 UTC (permalink / raw)
  To: guix-devel

[-- Attachment #1.1: Type: text/plain, Size: 735 bytes --]

Hi,

At FOSDEM, some of us discussed "orchestration", which means something
like "how to deploy services to more than 1 machine in a coordinated
fashion".  Many people contributed to the discussion.  I took notes.
I've thought about this more, reviewed the "wip-deploy" branch, and
written up my thoughts in the attached file.

It's a rough sketch of ideas, biased with my own opinions and
experience, but I think it's good enough to share.  I invite you to
improve upon it: share your own thoughts, hack some code together, and
just iterate on this a bit, so we can make some progress.

Hopefully, we can agree on a basic design and get a working proof of
concept.  Then we can make a blog post about it!

-- 
Chris

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1.2: guix-orchestration.org --]
[-- Type: text/x-org, Size: 10786 bytes --]

* end goal: blog post, 1 webserver 1 db
We should make something showable and hackable that people can start
to play with.

Proposed goal: show one deployment that upgrades services across two
or more servers.  Maybe a web server and a backing database.  As a
bonus, show features like: "guix gc --requisites" showing ALL runtime
dependencies, even those that are cross-server dependencies.  This
would help illustrates how Guix know the EXACTLY closure of what is
deployed to all hosts.
* Feature: service validation scripts
can be part of a service's start-up script for now

could split into multiple pre-defined hooks - which would encourage
consistency

* Feature: service upgrade/roll-back
make a transient "upgrade service" which manages the transition.
Maybe, a service author can define a procedure like the following:

  (define (switch-to-service #:key (old #f) new)
   ;; Stop the old service, start the new one, transfer state, perform validation, etc.
   )

it depends on both the old and the new service, so it can run things
in the right order (e.g., run old service's shutdown scripts, then run
new service's start-up script)

it also provides a place to implement more complex logic (e.g.,
checkpoint or back up the database on old service, then restore on new
service, or even migrate while both services are still running, then
shut down old service once migration is complete).

I'm not sure how to wire this into the normal system activation logic
(maybe extend the activation-service-type)?  But to correctly manage
upgrades/rollbacks in general (e.g., database upgrades), it seems
crucial that we should be able to control both the old and the new
service during the transition.

* Feature: <site-configuration>

We need something like a <site-configuration> (maybe
<distributed-system>?) which defines all the services and
operating-systems used in a distributed system.  The key here is that
it involves multiple hosts.

In this configuration, we define services and compose them together.

In it, we define host classes (not individual hosts!).  a host class
can be thought of as a role - web server, database, etc - but
conceptually it represents "a configuration shared by a group of
hosts".  a host class does NOT contain a list of hosts, and it does
NOT represent a single host, since the details of how many hosts exist
in a host class are not relevant to the abstract structure of the
distributed system's service dependencies.

Basically, a host class corresponds to an operating system
configuration, in that every host in the host class will use the same
operating system configuration.

We need a way to assign services to host classes.  I.e. we need to
choose what host pools a service will run on.  This could be
automatic, or this could be manually defined in the site configuration
directly.

Services depend on one another.  Currently we have a nice system for
describing service dependencies within a single host.  How do we
extend this so that a service can depend on a service on another host?
I'm not sure.  The most obvious solution is coarse-grained: provide a
way to declare that host class A depends on host class B.  But I can
see that this might be problematic, too: what if a service A in host
class X depends on a service B in host class Y, but service Y also
depends on a service Z in host class A?  If dependencies go from host
class to host class, then this is a circular dependency, which might
be a problem; if dependencies go from service to service, it's not
circular, so it clearly isn't a problem.  But I'm not sure how to
extend our service dependency model across hosts, so unless you have a
better idea, I think it's reasonable to start with the coarse-grained
model of one host class depending on another host class.

To update your site/distributed system, you would run 'guix deploy' or
similar, a push-model tool for coordinating the deployment. this tool
looks up the mapping from host class to actual nodes at runtime.  the
interface and implementation for how to do the query are decoupled;
e.g., maybe it's a procedure like host-class->hosts, so we can get the
mapping from a local config file if we want, or we can get the mapping
by querying AWS APIs to get all the instances with a specific tag.
decoupling the implementation from the interface for host-class->hosts
makes it easy to accommodate both use cases.

* What about a pull model for deployment?

In some large systems, a pull model for deployment is nice.  A push
model, might be pretty slow if you're deploying to thousands of hosts.
But a pull model has drawbacks, too.  For example, if 1000 hosts are
pulling from the same place, they can brown it out (this is a common
"death spiral" scenario when e.g. an entire site loses and then
regains power).

For now, a push model seems easier and more convenient.  If the
deployments are coordinated by a single process, it makes it easier to
enforce a policy for deployments (e.g., do this set of hosts before
the others).  I'll bet it wouldn't be too hard to take a working push
model and spread it out into more of a pull-like model later, also.
For example, if you have a dozen sites, maybe you have one node that
performs automatic deployment (with a push model) to the site, and
this master node periodically polls some global configuration to get
its latest site configuration.  You get the idea.

For now, a push model seems preferable.

* baby steps: first, let's orchestrate a static fleet of existing guixsd servers
later, can add ability to deploy to fleet where some servers don't run guix.
later, can add ability to change the size of the fleet when deploying.
these features will probably follow naturally, if the first design is good.

* note: different guix versions on different machines is not necessarily a problem
because guix daemon is also included as a service, so it and its
dependencies are exactly described when running 'guix deploy'.  If two
host classes run a slightly different version of guix daemon, it
shouldn't be a problem (assuming, of course, that those two versions
understand how to speak to one another as needed, e.g. for
offloading).

* Feature: allow operator to specify deployment policy
e.g., "one datacenter at a time", "no more than 30% hosts at once", etc.
for starters, we can have a very simple policy

but we should plan on making it easy for operators to define and use
their own policies, similar to how they can define their own
implementation for how to query the hostclass->host mapping

* observation: references today for service don't include dependent services
maybe they should (optionally)?
maybe they shouldn't?

What does a service "depend" on at runtime?  Obviously, it needs its
program files to run.  But it might need other services.  The former
can be described by following store references, but what about the
latter?  Dependencies between services are not reflected in the Guix
store.  Thus, some aspect of "service dependencies" (actually, a large
part of it) lives outside the store, and outside the "purely
functional software deployment model".

* thoughts on david's existing work: wip-deploy
This is the "wip-deploy" branch.

Does not build, but we can probably fix that easily.

The way he decoupled "platforms" (like "servers running on AWS",
"servers running on local VMs using qemu", etc.) from the various
actions is a very nice design that enables us to perform the same
action on different platforms without caring about the tiny details
that are different between the platforms.

seems like 1 "machine" = 1 os config = 1 host, we need to decouple
this.  Things like ip addr are overly specific, also.  Solve this with
host classes.

dependencies between services in different os configs are not modeled
need to model them to ensure dependencies are not broken during deployment.

no support (yet!) for deployments to existing, running machines.  it
only supports initial deployment.  however, if we make the 'provision'
procedure return state that describes the deployed machines, and we
save it, perhaps we could later look it up and do another deployment
on existing, running machines.  but wait, we already have this: the
deployment object describes the machines, so we would just need a
mechanism to look up their current state: are they running or not? if
they are not running, provision them; if they are running, update
(reconfigure) them.

the existing code seems to anticipate this, but there is no
implementation yet for 'reconfigure'.  I'll bet that if we can figure
out a good solution for the section "service upgrade/rollback" above,
we can use it here.

** Do 'build-deployment' and 'provision-deployment' use the same derivation?
Yes, and it's important that they do both use the same derivation,
otherwise there is no point in providing these as separate steps.

build-deployment sends the result of 
machine-os-for-platform
(which transforms the os via virtualized-operating-system-os)
to
operating-system-derivation

provision-deployment sends the result of 
machine-os-for-platform
(which transforms the os via virtualized-operating-system-os)
to
the platform's provision procedure, which here sends it off to:
system-qemu-image/shared-store-script
there, the os gets sent through 
virtualized-operating-system-os
(for a second time; this procedure is apparently idempotent, which is
good because it means the os used for buliding is the same os used for
provisioning)
and then the os gets sent to
operating-system-derivation

so, 'build-deployment' and 'provision-deployment' wind up using the
same derivation.  this might not have been true if
virtualized-operating-system-os were not idempotent.  In general, it
seems important to ensure that the derivation used by
'build-deployment' and the derivation used by 'provision-deployment'
are in fact identical; otherwise, there isn't much point in building
before provisioning.
* problem: what if i don't want to deploy all services on a host at once?
in a large distributed system, it is often undesirable to deploy all
updates to all services at the same time.  How can we limit this?  A
task like "reconfiguring" a GuixSD server will currently upgrade all
services...

I'm thinking it would be nice to have something like the --upgrade and
--do-not-upgrade options for "guix package", but for service
deployment.

This could be worked around by running just one primary service per
host.  Then you don't care if all the services on the host get
upgraded, since effectively you're treating it all like a single
service.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Orchestration working group (before: Guix orchestration notes)
  2018-02-18  4:37 Guix orchestration notes Chris Marusich
@ 2018-03-01 16:18 ` Pjotr Prins
  2018-03-04  2:18   ` Devan
  2018-03-27 18:27 ` Guix orchestration notes Thompson, David
  1 sibling, 1 reply; 5+ messages in thread
From: Pjotr Prins @ 2018-03-01 16:18 UTC (permalink / raw)
  To: Chris Marusich; +Cc: guix-devel

Hi Chris,

That was a lengthy writeup, thanks :). I think the first step is to
think in terms of (container/VM) provisioning. Then build that up towards
orchestration and workflows. There is some overlap with the HPC
working group, but I think we ought to have one on
orchestration/provisining/work flows. I am quite keen to have
something in place in the coming year. I.e., roll out firing up of
software combinations in isolated environments. Let's keep discussing
and build up a web page for that.

My first provisioning step is to start up a Guix container on a server
somewhere through a REST API and fire up a script that can run in the
container.

Pj.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Orchestration working group (before: Guix orchestration notes)
  2018-03-01 16:18 ` Orchestration working group (before: Guix orchestration notes) Pjotr Prins
@ 2018-03-04  2:18   ` Devan
  2018-03-04 20:42     ` Chris Marusich
  0 siblings, 1 reply; 5+ messages in thread
From: Devan @ 2018-03-04  2:18 UTC (permalink / raw)
  To: Pjotr Prins; +Cc: guix-devel

[-- Attachment #1: Type: text/plain, Size: 1389 bytes --]

Hi all,

Pjotr Prins transcribed 0.6K bytes:
> Hi Chris,
> 
> That was a lengthy writeup, thanks :). I think the first step is to
> think in terms of (container/VM) provisioning. Then build that up towards
> orchestration and workflows.

I agree. I think a good start would be figuring out ways to
essentially replace what you can do with Dockerfiles. Which to me 
mainly means having the ability to arbitrarily copy files into, and 
further manipulate the tarballs made from using `guix pack`.

This would already allow us to build useful container images without 
having to pull in random blob layers, which is the current state of
affairs with docker. These could then be orchestrated with the many
existing tools, and it would be a big step forward in reproducability 
and security for that ecosystem.

> There is some overlap with the HPC
> working group, but I think we ought to have one on
> orchestration/provisining/work flows. I am quite keen to have
> something in place in the coming year. I.e., roll out firing up of
> software combinations in isolated environments. Let's keep discussing
> and build up a web page for that.
> 
> My first provisioning step is to start up a Guix container on a server
> somewhere through a REST API and fire up a script that can run in the
> container.

That all sounds like good ideas to me.

> Pj.

Devan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Orchestration working group (before: Guix orchestration notes)
  2018-03-04  2:18   ` Devan
@ 2018-03-04 20:42     ` Chris Marusich
  0 siblings, 0 replies; 5+ messages in thread
From: Chris Marusich @ 2018-03-04 20:42 UTC (permalink / raw)
  To: Devan; +Cc: guix-devel

[-- Attachment #1: Type: text/plain, Size: 1100 bytes --]

Devan <mail@dvn.me> writes:

> Hi all,
>
> Pjotr Prins transcribed 0.6K bytes:
>> Hi Chris,
>> 
>> That was a lengthy writeup, thanks :). I think the first step is to
>> think in terms of (container/VM) provisioning. Then build that up towards
>> orchestration and workflows.
>
> I agree. I think a good start would be figuring out ways to
> essentially replace what you can do with Dockerfiles. Which to me 
> mainly means having the ability to arbitrarily copy files into, and 
> further manipulate the tarballs made from using `guix pack`.

With "guix pack", you can already create a Docker image that contains
arbitrary files.  Just define whatever packages you need in order to
install the things you want, and "guix pack --format=docker" will create
a Docker image of exactly what you declare, without building upon any
existing Docker image.  Is this different from what you had in mind?

Once "guix system docker-image" gets added, you will be able to do the
same thing for an entire GuixSD system:

https://debbugs.gnu.org/cgi/bugreport.cgi?bug=30572

-- 
Chris

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Guix orchestration notes
  2018-02-18  4:37 Guix orchestration notes Chris Marusich
  2018-03-01 16:18 ` Orchestration working group (before: Guix orchestration notes) Pjotr Prins
@ 2018-03-27 18:27 ` Thompson, David
  1 sibling, 0 replies; 5+ messages in thread
From: Thompson, David @ 2018-03-27 18:27 UTC (permalink / raw)
  To: Chris Marusich; +Cc: guix-devel

Hi Chris,

On Sat, Feb 17, 2018 at 11:37 PM, Chris Marusich <cmmarusich@gmail.com> wrote:
> Hi,
>
> At FOSDEM, some of us discussed "orchestration", which means something
> like "how to deploy services to more than 1 machine in a coordinated
> fashion".  Many people contributed to the discussion.  I took notes.
> I've thought about this more, reviewed the "wip-deploy" branch, and
> written up my thoughts in the attached file.
>
> It's a rough sketch of ideas, biased with my own opinions and
> experience, but I think it's good enough to share.  I invite you to
> improve upon it: share your own thoughts, hack some code together, and
> just iterate on this a bit, so we can make some progress.
>
> Hopefully, we can agree on a basic design and get a working proof of
> concept.  Then we can make a blog post about it!

These are good notes, thanks for sharing them!

One additional use-case I would consider for an orchestration tool
would be so-called "immutable deployment", where virtual machines are
replaced entirely rather than updated in-place. This is commonly used
for deploying web applications into auto scaling groups (where the
actual number of hosts at any given time is dynamic) using a
"blue-green" deployment technique (in a nutshell it's a double buffer
that allows updating the application without downtime and allows easy
rollback in the event the deploy breaks critical functionality).  This
is the kind of thing that I do at my day job, and we are but one of
many companies that do things this way.

- Dave

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2018-03-27 18:27 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-02-18  4:37 Guix orchestration notes Chris Marusich
2018-03-01 16:18 ` Orchestration working group (before: Guix orchestration notes) Pjotr Prins
2018-03-04  2:18   ` Devan
2018-03-04 20:42     ` Chris Marusich
2018-03-27 18:27 ` Guix orchestration notes Thompson, David

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).