unofficial mirror of guix-devel@gnu.org 
 help / color / mirror / code / Atom feed
From: zimoun <zimon.toutoune@gmail.com>
To: "Christopher Baines" <mail@cbaines.net>,
	guix-devel@gnu.org, "Ludovic Courtès" <ludo@gnu.org>,
	"Mathieu Othacehe" <othacehe@gnu.org>
Subject: Thoughts on CI (was: Thoughts on building things for substitutes and the Guix Build Coordinator)
Date: Mon, 23 Nov 2020 23:56:07 +0100	[thread overview]
Message-ID: <86v9dvu1h4.fsf@gmail.com> (raw)
In-Reply-To: <87tutnlnjy.fsf@cbaines.net>

Hi,

(Disclaim: I am biased since I have been the Mathieu’s rubber duck [1]
about his “new CI design” presented in his talk and I have read his
first experimental implementations.)

1: <https://en.wikipedia.org/wiki/Rubber_duck_debugging>


Thank you Chris for this detailed email and your inspiring talk.  Really
interesting!  The discussion has been really fruitful.  At least for me.

From my understanding, the Guix Build Coordinator is designed to
distribute the workload on heterogeneous context (distant machines).

IIUC, the design of GBC could implement some Andreas’s ideas.  Because,
the GBC is designed to support unreliable network and even it has
experimental trusted mechanism for the workers.  The un-queue’ing
algorithm implemented in GBC is not clear; it appears to be “work
stealing” but I have not read the code.

The Mathieu’s offload is designed for cluster with the architecture of
Berlin in mind; reusing as much as possible the existing part of Guix.

Since Berlin is a cluster, the workers are already trusted.  So Avahi
allows to discover them; the addition / remove of machines should be hot
swapping, without reconfiguration involved.  In other words, the
controller/coordinator (master) does not need the list of workers.
That’s one of the dynamic part.

The second dynamic part is “work stealing”.  And to do so, ZeroMQ is
used both for communication and for un-queue’ing (work stealing).  This
library is used because it allows to focus on the design avoiding the
reimplementation of the scheduling strategy and probably bugs with
Fibers to communicate.  Well, that’s how I understand it.

For sure, the ’guile-simple-zmq’ wrapper is not bullet-proof; but it is
simply a wrapper and ZeroMQ is well-tested, AFAIK.  Well, we could
imagine replace in a second step this ZMQ library by Fibers plus Scheme
reimplementation of the scheduling strategy; once the design is a bit
tested.


> I've been using the Guix Build Coordinator build substitutes for
> guix.cbaines.net, which is my testing ground for providing
> substitutes. I think it's working reasonably well.

What is the configuration of this machine?  Size of the store?  Number
of workers where the agents are running?


> The Guix Build Coordinator supports prioritisation of builds. You can
> assign a priority to builds, and it'll try to order builds in such a way
> that the higher priority builds get processed first. If the aim is to
> serve substitutes, doing some prioritisation might help building the
> most fetched things first.

This is really cool!  How does it work?  Do you do manual tag on some
specific derivations?


> Another feature supported by the Guix Build Coordinator is retries. If a
> build fails, the Guix Build Coordinator can automatically retry it. In a
[…]
> perfect world, everything would succeed first time, but because the
> world isn't perfect, there still can be intermittent build
> failures. Retrying failed builds even once can help reduce the chance
> that a failure leads to no substitutes for that builds as well as any
> builds that depend on that output.

Yeah, something in the current infrastructure is lacking to distinguish
between error (“build is complete but return an error”) and failure
(“something along had been wrong“).


> Because the build results don't end up in a store (they could, but as
> set out above, not being in the store is a feature I think), you can't
> use `guix gc` to get rid of old store entries/substitutes. I have some
> ideas about what to implement to provide some kind of GC approach over a
> bunch of nars + narinfos, but I haven't implemented anything yet.

Where do they end up so?  I missed your answer in the Question/Answer
session.


Speaking about Berlin, the builds should be in the workers store (with a
GC policy to be defined; keep them for debugging concern?) and the main
store should have only the minimum.  The items should be really and only
stored in the cache of publish.  IMHO.

Maybe I miss something.


> There could be issues with the implementation… I'd like to think it's
> relatively simple, but that doesn't mean there aren't issues. For some
[…]
> reason or another, getting backtraces for exceptions rarely works. Most
> of the time the coordinator tries to print a backtrace, the part of
> Guile doing that raises an exception. I've managed to cause it to
> segfault, through using SQLite incorrectly, which hasn't been obvious to
> fix at least for me. Additionally, there are some places where I'm
> fighting against bits of Guix, things like checking for substitutes
> without caching, or substituting a derivation without starting to build
> it.

I am confused by all the SQL involved.  And I feel it is hard to
maintain when scaling at large.  I do not know.  I am newbie.


> Finally, the instrumentation is somewhat reliant on Prometheus, and if
> you want a pretty dashboard, then you might need Grafana too. Both of
> these things aren't packaged for Guix, Prometheus might be feasible to
> package within the next few months, I doubt the same is true for Grafana
> (due to the use of NPM).

Really cool!  For sure know how it is healthy (or not) is really nice.

Cheers,
simon


  parent reply	other threads:[~2020-11-23 23:01 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-11-17 20:45 Thoughts on building things for substitutes and the Guix Build Coordinator Christopher Baines
2020-11-17 22:10 ` Ludovic Courtès
2020-11-18  7:56   ` Christopher Baines
2020-11-20 10:12     ` Ludovic Courtès
2020-11-21  9:46       ` Christopher Baines
2020-11-23 22:56 ` zimoun [this message]
2020-11-24  9:42   ` Thoughts on CI (was: Thoughts on building things for substitutes and the Guix Build Coordinator) Christopher Baines

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://guix.gnu.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=86v9dvu1h4.fsf@gmail.com \
    --to=zimon.toutoune@gmail.com \
    --cc=guix-devel@gnu.org \
    --cc=ludo@gnu.org \
    --cc=mail@cbaines.net \
    --cc=othacehe@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).