Thoughts on building things for substitutes and the Guix Build Coordinator

unofficial mirror of guix-devel@gnu.org 
 help / color / mirror / code / Atom feed

* Thoughts on building things for substitutes and the Guix Build Coordinator
@ 2020-11-17 20:45 Christopher Baines
  2020-11-17 22:10 ` Ludovic Courtès
  2020-11-23 22:56 ` Thoughts on CI (was: Thoughts on building things for substitutes and the Guix Build Coordinator) zimoun
  0 siblings, 2 replies; 7+ messages in thread
From: Christopher Baines @ 2020-11-17 20:45 UTC (permalink / raw)
  To: guix-devel

[-- Attachment #1: Type: text/plain, Size: 5529 bytes --]

Hey,

In summary, this email lists the good things and bad things that you
might experience when using the Guix Build Coordinator for providing
substitutes for Guix.

So, over the last ~7 months I've been working on the Guix Build
Coordinator [1]. I think the first email I sent about it is [2], and I'm
not sure if I've sent another one. I did prepare a talk on it though
which goes through some of the workings [3].

1: https://git.cbaines.net/guix/build-coordinator/tree/README.org
2: https://lists.gnu.org/archive/html/guix-devel/2020-04/msg00323.html
3: https://xana.lepiller.eu/guix-days-2020/guix-days-2020-christopher-baines-guix-build-coordinator.webm

Over the last few weeks I've fixed up and tested the Guix services for
the Guix Build Coordinator, as well as fixing some major issues like it
segfaulting frequently.

I've been using the Guix Build Coordinator build substitutes for
guix.cbaines.net, which is my testing ground for providing
substitutes. I think it's working reasonably well.

I wanted to write this email though to set out more about actually using
the Guix Build Coordinator to build things for substitutes, to help
inform any conversations that happen about that.

First, the good things:

The way the Guix Build Coordinator generates compressed nars where the
agent runs, then sends them over the network to the coordinator has a
few benefits. The (sometimes expensive) work of generating the nars
takes place where the agents are, so if you've got a bunch of machines
running agents, that work is distributed.

Also, when the nars are received by the coordinator, you have exactly
what you need for serving substitutes. You just generate narinfo files,
and then place the nars + narinfos where they can be fetched. The Guix
Build Coordinator contains code to help with this.

Because you aren't copying the store items back in to a single store, or
serving substitutes from the store, you don't need to scale the store to
serve more substitutes. You've still got a bunch of nars + narinfos to
store, but I think that is an easier problem to tackle.

This isn't strictly a benefit of the Guix Build Coordinator, but in
contrast to Cuirass when run on a store which is subject to periodic
garbage collection, assuming you're pairing the Guix Build Coordinator
with the Guix Data Service to provide substitutes for the derivations,
you don't run the risk of garbage collecting the derivations prior to
building them. As I say, this isn't really a benefit of the Guix Build
Coordinator, you'd potentially have the same issue if you ran the Guix
Build Coordinator with guix publish (on a machine which GC's) to provide
derivations, but I thought I'd mention it anyway.

The Guix Build Coordinator supports prioritisation of builds. You can
assign a priority to builds, and it'll try to order builds in such a way
that the higher priority builds get processed first. If the aim is to
serve substitutes, doing some prioritisation might help building the
most fetched things first.

Another feature supported by the Guix Build Coordinator is retries. If a
build fails, the Guix Build Coordinator can automatically retry it. In a
perfect world, everything would succeed first time, but because the
world isn't perfect, there still can be intermittent build
failures. Retrying failed builds even once can help reduce the chance
that a failure leads to no substitutes for that builds as well as any
builds that depend on that output.

Now the not so good things:

The Guix Build Coordinator just builds things, if you want to build all
Guix packages, you need to work out the derivations, then submit builds
for all of them. There's a script I wrote that does this with the help
of a Guix Data Service instance, but that might not be ideal for all
deployments. Even though it can handle the building of things, and most
of the serving substitutes part (just not the serving bit), some other
component(s) are needed.

Because the build results don't end up in a store (they could, but as
set out above, not being in the store is a feature I think), you can't
use `guix gc` to get rid of old store entries/substitutes. I have some
ideas about what to implement to provide some kind of GC approach over a
bunch of nars + narinfos, but I haven't implemented anything yet.

There could be issues with the implementation… I'd like to think it's
relatively simple, but that doesn't mean there aren't issues. For some
reason or another, getting backtraces for exceptions rarely works. Most
of the time the coordinator tries to print a backtrace, the part of
Guile doing that raises an exception. I've managed to cause it to
segfault, through using SQLite incorrectly, which hasn't been obvious to
fix at least for me. Additionally, there are some places where I'm
fighting against bits of Guix, things like checking for substitutes
without caching, or substituting a derivation without starting to build
it.

Finally, the instrumentation is somewhat reliant on Prometheus, and if
you want a pretty dashboard, then you might need Grafana too. Both of
these things aren't packaged for Guix, Prometheus might be feasible to
package within the next few months, I doubt the same is true for Grafana
(due to the use of NPM).

I think that's a somewhat objective look at what using the Guix Build
Coordinator might be like at the moment. Just let me know if you have
any thoughts or questions?

Thanks,

Chris

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 987 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Thoughts on building things for substitutes and the Guix Build Coordinator
  2020-11-17 20:45 Thoughts on building things for substitutes and the Guix Build Coordinator Christopher Baines
@ 2020-11-17 22:10 ` Ludovic Courtès
  2020-11-18  7:56   ` Christopher Baines
  2020-11-23 22:56 ` Thoughts on CI (was: Thoughts on building things for substitutes and the Guix Build Coordinator) zimoun
  1 sibling, 1 reply; 7+ messages in thread
From: Ludovic Courtès @ 2020-11-17 22:10 UTC (permalink / raw)
  To: Christopher Baines; +Cc: guix-devel

Hi!

Christopher Baines <mail@cbaines.net> skribis:

> The way the Guix Build Coordinator generates compressed nars where the
> agent runs, then sends them over the network to the coordinator has a
> few benefits. The (sometimes expensive) work of generating the nars
> takes place where the agents are, so if you've got a bunch of machines
> running agents, that work is distributed.
>
> Also, when the nars are received by the coordinator, you have exactly
> what you need for serving substitutes. You just generate narinfo files,
> and then place the nars + narinfos where they can be fetched. The Guix
> Build Coordinator contains code to help with this.
>
> Because you aren't copying the store items back in to a single store, or
> serving substitutes from the store, you don't need to scale the store to
> serve more substitutes. You've still got a bunch of nars + narinfos to
> store, but I think that is an easier problem to tackle.

Yes, this is good for the use case of providing substitutes and it would
certainly help on a big build farm like berlin.

I see a lot could be shared with (guix scripts publish) and (guix
scripts substitute).  We should extract the relevant bits and move them
to new modules explicitly meant for more general consumption.  I think
it’s important to reduce duplication.

> The Guix Build Coordinator supports prioritisation of builds. You can
> assign a priority to builds, and it'll try to order builds in such a way
> that the higher priority builds get processed first. If the aim is to
> serve substitutes, doing some prioritisation might help building the
> most fetched things first.

Neat!

> Another feature supported by the Guix Build Coordinator is retries. If a
> build fails, the Guix Build Coordinator can automatically retry it. In a
> perfect world, everything would succeed first time, but because the
> world isn't perfect, there still can be intermittent build
> failures. Retrying failed builds even once can help reduce the chance
> that a failure leads to no substitutes for that builds as well as any
> builds that depend on that output.

That’s nice too; it’s one of the practical issues we have with Cuirass
and that’s tempting to ignore because “hey it’s all functional!”, but
then reality gets in the way.

> Now the not so good things:
>
> The Guix Build Coordinator just builds things, if you want to build all
> Guix packages, you need to work out the derivations, then submit builds
> for all of them. There's a script I wrote that does this with the help
> of a Guix Data Service instance, but that might not be ideal for all
> deployments. Even though it can handle the building of things, and most
> of the serving substitutes part (just not the serving bit), some other
> component(s) are needed.

That’s OK; it’s good that these two things (computing derivations and
building them) are separate.

> Because the build results don't end up in a store (they could, but as
> set out above, not being in the store is a feature I think), you can't
> use `guix gc` to get rid of old store entries/substitutes. I have some
> ideas about what to implement to provide some kind of GC approach over a
> bunch of nars + narinfos, but I haven't implemented anything yet.

‘guix publish’ has support for that via (guix cache), so if we could
share code, that’d be great.

One option would be to populate /var/cache/guix/publish and to let ‘guix
publish’ serve it from there.

> There could be issues with the implementation… I'd like to think it's
> relatively simple, but that doesn't mean there aren't issues. For some
> reason or another, getting backtraces for exceptions rarely works. Most
> of the time the coordinator tries to print a backtrace, the part of
> Guile doing that raises an exception. I've managed to cause it to
> segfault, through using SQLite incorrectly, which hasn't been obvious to
> fix at least for me. Additionally, there are some places where I'm
> fighting against bits of Guix, things like checking for substitutes
> without caching, or substituting a derivation without starting to build
> it.

I’ve haven’t yet watched your talk but I’ve what Mathieu’s, where he
admits to being concerned about the reliability of code involving Fibers
and/or SQLite (which I can understand given his/our experience, although
I’m maybe less pessimistic).  What’s your experience, how do you feel
about it?

> Finally, the instrumentation is somewhat reliant on Prometheus, and if
> you want a pretty dashboard, then you might need Grafana too. Both of
> these things aren't packaged for Guix, Prometheus might be feasible to
> package within the next few months, I doubt the same is true for Grafana
> (due to the use of NPM).

Heh.  :-)

Thanks for this update!

Ludo’.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Thoughts on building things for substitutes and the Guix Build Coordinator
  2020-11-17 22:10 ` Ludovic Courtès
@ 2020-11-18  7:56   ` Christopher Baines
  2020-11-20 10:12     ` Ludovic Courtès
  0 siblings, 1 reply; 7+ messages in thread
From: Christopher Baines @ 2020-11-18  7:56 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: guix-devel

[-- Attachment #1: Type: text/plain, Size: 6222 bytes --]

Ludovic Courtès <ludo@gnu.org> writes:

> Christopher Baines <mail@cbaines.net> skribis:
>
>> Because you aren't copying the store items back in to a single store, or
>> serving substitutes from the store, you don't need to scale the store to
>> serve more substitutes. You've still got a bunch of nars + narinfos to
>> store, but I think that is an easier problem to tackle.
>
> Yes, this is good for the use case of providing substitutes and it would
> certainly help on a big build farm like berlin.
>
> I see a lot could be shared with (guix scripts publish) and (guix
> scripts substitute).  We should extract the relevant bits and move them
> to new modules explicitly meant for more general consumption.  I think
> it’s important to reduce duplication.

Yeah, that would be good.

>> Another feature supported by the Guix Build Coordinator is retries. If a
>> build fails, the Guix Build Coordinator can automatically retry it. In a
>> perfect world, everything would succeed first time, but because the
>> world isn't perfect, there still can be intermittent build
>> failures. Retrying failed builds even once can help reduce the chance
>> that a failure leads to no substitutes for that builds as well as any
>> builds that depend on that output.
>
> That’s nice too; it’s one of the practical issues we have with Cuirass
> and that’s tempting to ignore because “hey it’s all functional!”, but
> then reality gets in the way.

One further benefit related to this is that if you want to manually
retry building a derivation, you just submit a new build for that
derivation.

The Guix Build Coordinator also has no concept of "Failed (dependency)",
it never gives up. This avoids the situation where spurious failures
block other builds.

>> Because the build results don't end up in a store (they could, but as
>> set out above, not being in the store is a feature I think), you can't
>> use `guix gc` to get rid of old store entries/substitutes. I have some
>> ideas about what to implement to provide some kind of GC approach over a
>> bunch of nars + narinfos, but I haven't implemented anything yet.
>
> ‘guix publish’ has support for that via (guix cache), so if we could
> share code, that’d be great.

Guix publish does time based deletion, based on when the files were
first created, right? If that works for people, that's fine I guess.

Personally, I'm thinking about GC as in, don't delete nar A if you want
to keep nar B, and nar B references nar A. It's perfectly possible that
someone could fetch nar B if you deleted nar A, but it's also possible
that someone couldn't because of that missing substitute. Maybe I'm
overthinking this though?

The Cuirass + guix publish approach does something similar, because
Cuirass creates GC roots that expire. guix gc wouldn't delete a store
item if it's needed by something that's protected by a Cuirass created
GC root.

Another complexity here that I didn't set out initially, is that there
are places the Guix Build Coordinator makes decisions based on the
belief that if it's database says a build has succeeded for an output,
that output will be available. If a situation where a build needed an
output that had been successfully built, but then deleted, I think the
coordinator would get stuck forever trying that build and it not
starting because of the missing store item. My thinking on this at the
moment is maybe what you'd want to do is tell the Guix Build Coordinator
that you've deleted a store item and it's truly missing, but that would
complicate the setup to some degree.

> One option would be to populate /var/cache/guix/publish and to let ‘guix
> publish’ serve it from there.

That's probably pretty easy to do, I haven't looked at the details
though.

>> There could be issues with the implementation… I'd like to think it's
>> relatively simple, but that doesn't mean there aren't issues. For some
>> reason or another, getting backtraces for exceptions rarely works. Most
>> of the time the coordinator tries to print a backtrace, the part of
>> Guile doing that raises an exception. I've managed to cause it to
>> segfault, through using SQLite incorrectly, which hasn't been obvious to
>> fix at least for me. Additionally, there are some places where I'm
>> fighting against bits of Guix, things like checking for substitutes
>> without caching, or substituting a derivation without starting to build
>> it.
>
> I’ve haven’t yet watched your talk but I’ve what Mathieu’s, where he
> admits to being concerned about the reliability of code involving Fibers
> and/or SQLite (which I can understand given his/our experience, although
> I’m maybe less pessimistic).  What’s your experience, how do you feel
> about it?

The coordinator does use Fibers, plus a lot of different threads for
different things.

Regarding reliability, it's hard to say really. Given I set out to build
something that works across a (unreliable) network, I've built in
reliability through making sure things retry upon failure among other
things. I definitely haven't chased any blocked fibers, although there
could be some of those lurking in the code, I might have not noticed
because it sorts itself out eventually.

One of the problems I did have recently was that some hooks would just
stop getting processed. Each type of hook has a thread, which just
checked if there were any events to process every second, and processed
any if there were. I'm not sure what was wrong, but I changed the code
to be smarter, be triggered when new events are actually entered in to
the database, and poll every so often just in case. I haven't seen hooks
get stuck since then, but what I'm trying to convey here is that I'm not
quite sure how to track down issues that occur in specific threads.

Another thing to mention here is that implementing suppport for
PostgreSQL through Guile Squee is still a thing I have in mind, and that
might be more appropriate for larger databases. It's still prone to the
fibers blocking problem, but at least it's harder to cause Segfaults
with Squee compared to SQLite.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 987 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Thoughts on building things for substitutes and the Guix Build Coordinator
  2020-11-18  7:56   ` Christopher Baines
@ 2020-11-20 10:12     ` Ludovic Courtès
  2020-11-21  9:46       ` Christopher Baines
  0 siblings, 1 reply; 7+ messages in thread
From: Ludovic Courtès @ 2020-11-20 10:12 UTC (permalink / raw)
  To: Christopher Baines; +Cc: guix-devel

Hi Chris,

Christopher Baines <mail@cbaines.net> skribis:

>>> Another feature supported by the Guix Build Coordinator is retries. If a
>>> build fails, the Guix Build Coordinator can automatically retry it. In a
>>> perfect world, everything would succeed first time, but because the
>>> world isn't perfect, there still can be intermittent build
>>> failures. Retrying failed builds even once can help reduce the chance
>>> that a failure leads to no substitutes for that builds as well as any
>>> builds that depend on that output.
>>
>> That’s nice too; it’s one of the practical issues we have with Cuirass
>> and that’s tempting to ignore because “hey it’s all functional!”, but
>> then reality gets in the way.
>
> One further benefit related to this is that if you want to manually
> retry building a derivation, you just submit a new build for that
> derivation.
>
> The Guix Build Coordinator also has no concept of "Failed (dependency)",
> it never gives up. This avoids the situation where spurious failures
> block other builds.

I think there’s a balance to be found.  Being able to retry is nice, but
“never giving up” is not: on a build farm, you could end up always
rebuilding the same derivation that in the end always fails, and that
can be a huge resource waste.

On berlin we run the daemon with ‘--cache-failures’.  It’s not great
because, again, it prevents further builds altogether.

Which makes me think we could change the daemon to have a threshold: it
would maintain a derivation build failure count (instead of a Boolean)
and would only prevent rebuilds once a failure threshold has been
reached.

>>> Because the build results don't end up in a store (they could, but as
>>> set out above, not being in the store is a feature I think), you can't
>>> use `guix gc` to get rid of old store entries/substitutes. I have some
>>> ideas about what to implement to provide some kind of GC approach over a
>>> bunch of nars + narinfos, but I haven't implemented anything yet.
>>
>> ‘guix publish’ has support for that via (guix cache), so if we could
>> share code, that’d be great.
>
> Guix publish does time based deletion, based on when the files were
> first created, right? If that works for people, that's fine I guess.

Yes, it’s based on the atime (don’t use “noatime”!), though (guix cache)
lets you implement other policies.

> Personally, I'm thinking about GC as in, don't delete nar A if you want
> to keep nar B, and nar B references nar A. It's perfectly possible that
> someone could fetch nar B if you deleted nar A, but it's also possible
> that someone couldn't because of that missing substitute. Maybe I'm
> overthinking this though?

I think you are.  :-) ‘guix publish’ doesn’t do anything this fancy and
it works great.  The reason is that clients typically always ask for
both A and B, thus the atime of A is the same as that of B.

> The Cuirass + guix publish approach does something similar, because
> Cuirass creates GC roots that expire. guix gc wouldn't delete a store
> item if it's needed by something that's protected by a Cuirass created
> GC root.

Cuirass has a TTL on GC roots, which thus defines how long things remain
in the store; ‘publish’ has a TTL on nars, which defines how long nars
remain in its cache.  The two are disconnected in fact.

> Another complexity here that I didn't set out initially, is that there
> are places the Guix Build Coordinator makes decisions based on the
> belief that if it's database says a build has succeeded for an output,
> that output will be available. If a situation where a build needed an
> output that had been successfully built, but then deleted, I think the
> coordinator would get stuck forever trying that build and it not
> starting because of the missing store item. My thinking on this at the
> moment is maybe what you'd want to do is tell the Guix Build Coordinator
> that you've deleted a store item and it's truly missing, but that would
> complicate the setup to some degree.

I think you’d just end up rebuilding it in that case, no?

>> I’ve haven’t yet watched your talk but I’ve what Mathieu’s, where he
>> admits to being concerned about the reliability of code involving Fibers
>> and/or SQLite (which I can understand given his/our experience, although
>> I’m maybe less pessimistic).  What’s your experience, how do you feel
>> about it?
>
> The coordinator does use Fibers, plus a lot of different threads for
> different things.

Interesting, why are some things running in threads?

There also seems to be shared state in ‘create-work-queue’; why not use
message passing?

> Regarding reliability, it's hard to say really. Given I set out to build
> something that works across a (unreliable) network, I've built in
> reliability through making sure things retry upon failure among other
> things. I definitely haven't chased any blocked fibers, although there
> could be some of those lurking in the code, I might have not noticed
> because it sorts itself out eventually.

OK.  Most of the issues we see now with offloading and Cuirass are
things you can only experience with a huge store, a large number of
build machines, and a lot of concurrent derivation builds.  Perhaps you
are approaching this scale on your instance actually?

Mathieu experimented with the Coordinator on berlin.  It would be nice
to see how it behaved there.

> One of the problems I did have recently was that some hooks would just
> stop getting processed. Each type of hook has a thread, which just
> checked if there were any events to process every second, and processed
> any if there were. I'm not sure what was wrong, but I changed the code
> to be smarter, be triggered when new events are actually entered in to
> the database, and poll every so often just in case. I haven't seen hooks
> get stuck since then, but what I'm trying to convey here is that I'm not
> quite sure how to track down issues that occur in specific threads.
>
> Another thing to mention here is that implementing suppport for
> PostgreSQL through Guile Squee is still a thing I have in mind, and that
> might be more appropriate for larger databases. It's still prone to the
> fibers blocking problem, but at least it's harder to cause Segfaults
> with Squee compared to SQLite.

OK.

I find it really nice to have metrics built in, but I share Mathieu’s
concern about complexity here.  If we’re already hitting scalability
issues with SQLite, then perhaps that’s a sign that metrics should be
handled separately.

Would it be an option to implement metrics gathering in a separate,
optional process, which would essentially subscribe to the relevant
hooks/events?

Thanks,
Ludo’.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Thoughts on building things for substitutes and the Guix Build Coordinator
  2020-11-20 10:12     ` Ludovic Courtès
@ 2020-11-21  9:46       ` Christopher Baines
  0 siblings, 0 replies; 7+ messages in thread
From: Christopher Baines @ 2020-11-21  9:46 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: guix-devel

[-- Attachment #1: Type: text/plain, Size: 9676 bytes --]


Ludovic Courtès <ludo@gnu.org> writes:

> Hi Chris,
>
> Christopher Baines <mail@cbaines.net> skribis:
>
>>>> Another feature supported by the Guix Build Coordinator is retries. If a
>>>> build fails, the Guix Build Coordinator can automatically retry it. In a
>>>> perfect world, everything would succeed first time, but because the
>>>> world isn't perfect, there still can be intermittent build
>>>> failures. Retrying failed builds even once can help reduce the chance
>>>> that a failure leads to no substitutes for that builds as well as any
>>>> builds that depend on that output.
>>>
>>> That’s nice too; it’s one of the practical issues we have with Cuirass
>>> and that’s tempting to ignore because “hey it’s all functional!”, but
>>> then reality gets in the way.
>>
>> One further benefit related to this is that if you want to manually
>> retry building a derivation, you just submit a new build for that
>> derivation.
>>
>> The Guix Build Coordinator also has no concept of "Failed (dependency)",
>> it never gives up. This avoids the situation where spurious failures
>> block other builds.
>
> I think there’s a balance to be found.  Being able to retry is nice, but
> “never giving up” is not: on a build farm, you could end up always
> rebuilding the same derivation that in the end always fails, and that
> can be a huge resource waste.
>
> On berlin we run the daemon with ‘--cache-failures’.  It’s not great
> because, again, it prevents further builds altogether.
>
> Which makes me think we could change the daemon to have a threshold: it
> would maintain a derivation build failure count (instead of a Boolean)
> and would only prevent rebuilds once a failure threshold has been
> reached.

So my comment about not giving up here was specifically about
derivations that cannot be built as inputs are missing, which Cuirass
describes as "Failed (dependency)".

There's no wasted time spent building them, since the builds aren't and
cannot be attempted.

>>>> Because the build results don't end up in a store (they could, but as
>>>> set out above, not being in the store is a feature I think), you can't
>>>> use `guix gc` to get rid of old store entries/substitutes. I have some
>>>> ideas about what to implement to provide some kind of GC approach over a
>>>> bunch of nars + narinfos, but I haven't implemented anything yet.
>>>
>>> ‘guix publish’ has support for that via (guix cache), so if we could
>>> share code, that’d be great.
>>
>> Guix publish does time based deletion, based on when the files were
>> first created, right? If that works for people, that's fine I guess.
>
> Yes, it’s based on the atime (don’t use “noatime”!), though (guix cache)
> lets you implement other policies.
>
>> Personally, I'm thinking about GC as in, don't delete nar A if you want
>> to keep nar B, and nar B references nar A. It's perfectly possible that
>> someone could fetch nar B if you deleted nar A, but it's also possible
>> that someone couldn't because of that missing substitute. Maybe I'm
>> overthinking this though?
>
> I think you are.  :-) ‘guix publish’ doesn’t do anything this fancy and
> it works great.  The reason is that clients typically always ask for
> both A and B, thus the atime of A is the same as that of B.

Ok, the atime thing kind of makes sense.

>> The Cuirass + guix publish approach does something similar, because
>> Cuirass creates GC roots that expire. guix gc wouldn't delete a store
>> item if it's needed by something that's protected by a Cuirass created
>> GC root.
>
> Cuirass has a TTL on GC roots, which thus defines how long things remain
> in the store; ‘publish’ has a TTL on nars, which defines how long nars
> remain in its cache.  The two are disconnected in fact.

I'm seeing the connection in that guix publish populates it's cache from
the store, and things cannot be removed from the store in an
inconsistent manor, you always have to have the dependencies.

>> Another complexity here that I didn't set out initially, is that there
>> are places the Guix Build Coordinator makes decisions based on the
>> belief that if it's database says a build has succeeded for an output,
>> that output will be available. If a situation where a build needed an
>> output that had been successfully built, but then deleted, I think the
>> coordinator would get stuck forever trying that build and it not
>> starting because of the missing store item. My thinking on this at the
>> moment is maybe what you'd want to do is tell the Guix Build Coordinator
>> that you've deleted a store item and it's truly missing, but that would
>> complicate the setup to some degree.
>
> I think you’d just end up rebuilding it in that case, no?

Well, the current implementation of the build-missing-inputs hook
assumes an output is available if there's been a successful build. So it
wouldn't think to actually build that thing again.

I've been thinking about this, and I think the approach to take might be
to make that hook configurable so that it can actually check the
availability of outputs. This way it can do less guessing based of
whether things have been built in the past.

>>> I’ve haven’t yet watched your talk but I’ve what Mathieu’s, where he
>>> admits to being concerned about the reliability of code involving Fibers
>>> and/or SQLite (which I can understand given his/our experience, although
>>> I’m maybe less pessimistic).  What’s your experience, how do you feel
>>> about it?
>>
>> The coordinator does use Fibers, plus a lot of different threads for
>> different things.
>
> Interesting, why are some things running in threads?

So, there's a thread for writing to the SQLite database, and there's a
pool of threads for reading from it. Each hook has a associated thread
which processes those hook events. The allocator runs in it's own
thread. There's also a pool of threads for reading the chunked parts of
HTTP requests (which is probably/hopefully unnecessary) as well as a
pool of threads for talking to the guix-daemon to fetch
substitutes. Plus fibers creates threads to run fibers.

> There also seems to be shared state in ‘create-work-queue’; why not use
> message passing?

So, create-work-queue is just used on the agents, and they don't run
fibers. I've also actually separated out the code in such a way that the
agents can run without fibers even being available, which should enable
the agent process to run on the Hurd (as fibers doesn't work there yet).

>> Regarding reliability, it's hard to say really. Given I set out to build
>> something that works across a (unreliable) network, I've built in
>> reliability through making sure things retry upon failure among other
>> things. I definitely haven't chased any blocked fibers, although there
>> could be some of those lurking in the code, I might have not noticed
>> because it sorts itself out eventually.
>
> OK.  Most of the issues we see now with offloading and Cuirass are
> things you can only experience with a huge store, a large number of
> build machines, and a lot of concurrent derivation builds.  Perhaps you
> are approaching this scale on your instance actually?
>
> Mathieu experimented with the Coordinator on berlin.  It would be nice
> to see how it behaved there.
>
>> One of the problems I did have recently was that some hooks would just
>> stop getting processed. Each type of hook has a thread, which just
>> checked if there were any events to process every second, and processed
>> any if there were. I'm not sure what was wrong, but I changed the code
>> to be smarter, be triggered when new events are actually entered in to
>> the database, and poll every so often just in case. I haven't seen hooks
>> get stuck since then, but what I'm trying to convey here is that I'm not
>> quite sure how to track down issues that occur in specific threads.
>>
>> Another thing to mention here is that implementing suppport for
>> PostgreSQL through Guile Squee is still a thing I have in mind, and that
>> might be more appropriate for larger databases. It's still prone to the
>> fibers blocking problem, but at least it's harder to cause Segfaults
>> with Squee compared to SQLite.
>
> OK.
>
> I find it really nice to have metrics built in, but I share Mathieu’s
> concern about complexity here.  If we’re already hitting scalability
> issues with SQLite, then perhaps that’s a sign that metrics should be
> handled separately.

I think SQLite is actually doing OK. The database for guix.cbaines.net
is 18G in size, which is larger than it needs to be I think, but there
are improvements that I can make to reduce that. I'm pretty happy with
the performance as well at the moment.

> Would it be an option to implement metrics gathering in a separate,
> optional process, which would essentially subscribe to the relevant
> hooks/events?

So, this is kind of what already happens. I wrote a Guile library for
Prometheus style metrics as I was writing the Guix Build Coordinator
[1]. The coordinator uses this, and you can see the metrics data here
[2] for example. The endpoint is far too slow, I need to fix that, but
it does work.

1: https://git.cbaines.net/guile/prometheus/
2: https://coordinator.guix.cbaines.net/metrics

Now, you can get some information just by reading that page, but it gets
more useful and more readable when you have Prometheus regularly scrape
and record those metrics, so you can graph them over time.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 987 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Thoughts on CI (was: Thoughts on building things for substitutes and the Guix Build Coordinator)
  2020-11-17 20:45 Thoughts on building things for substitutes and the Guix Build Coordinator Christopher Baines
  2020-11-17 22:10 ` Ludovic Courtès
@ 2020-11-23 22:56 ` zimoun
  2020-11-24  9:42   ` Christopher Baines
  1 sibling, 1 reply; 7+ messages in thread
From: zimoun @ 2020-11-23 22:56 UTC (permalink / raw)
  To: Christopher Baines, guix-devel, Ludovic Courtès,
	Mathieu Othacehe

Hi,

(Disclaim: I am biased since I have been the Mathieu’s rubber duck [1]
about his “new CI design” presented in his talk and I have read his
first experimental implementations.)

1: <https://en.wikipedia.org/wiki/Rubber_duck_debugging>

Thank you Chris for this detailed email and your inspiring talk.  Really
interesting!  The discussion has been really fruitful.  At least for me.

From my understanding, the Guix Build Coordinator is designed to
distribute the workload on heterogeneous context (distant machines).

IIUC, the design of GBC could implement some Andreas’s ideas.  Because,
the GBC is designed to support unreliable network and even it has
experimental trusted mechanism for the workers.  The un-queue’ing
algorithm implemented in GBC is not clear; it appears to be “work
stealing” but I have not read the code.

The Mathieu’s offload is designed for cluster with the architecture of
Berlin in mind; reusing as much as possible the existing part of Guix.

Since Berlin is a cluster, the workers are already trusted.  So Avahi
allows to discover them; the addition / remove of machines should be hot
swapping, without reconfiguration involved.  In other words, the
controller/coordinator (master) does not need the list of workers.
That’s one of the dynamic part.

The second dynamic part is “work stealing”.  And to do so, ZeroMQ is
used both for communication and for un-queue’ing (work stealing).  This
library is used because it allows to focus on the design avoiding the
reimplementation of the scheduling strategy and probably bugs with
Fibers to communicate.  Well, that’s how I understand it.

For sure, the ’guile-simple-zmq’ wrapper is not bullet-proof; but it is
simply a wrapper and ZeroMQ is well-tested, AFAIK.  Well, we could
imagine replace in a second step this ZMQ library by Fibers plus Scheme
reimplementation of the scheduling strategy; once the design is a bit
tested.

> I've been using the Guix Build Coordinator build substitutes for
> guix.cbaines.net, which is my testing ground for providing
> substitutes. I think it's working reasonably well.

What is the configuration of this machine?  Size of the store?  Number
of workers where the agents are running?

> The Guix Build Coordinator supports prioritisation of builds. You can
> assign a priority to builds, and it'll try to order builds in such a way
> that the higher priority builds get processed first. If the aim is to
> serve substitutes, doing some prioritisation might help building the
> most fetched things first.

This is really cool!  How does it work?  Do you do manual tag on some
specific derivations?

> Another feature supported by the Guix Build Coordinator is retries. If a
> build fails, the Guix Build Coordinator can automatically retry it. In a
[…]
> perfect world, everything would succeed first time, but because the
> world isn't perfect, there still can be intermittent build
> failures. Retrying failed builds even once can help reduce the chance
> that a failure leads to no substitutes for that builds as well as any
> builds that depend on that output.

Yeah, something in the current infrastructure is lacking to distinguish
between error (“build is complete but return an error”) and failure
(“something along had been wrong“).

> Because the build results don't end up in a store (they could, but as
> set out above, not being in the store is a feature I think), you can't
> use `guix gc` to get rid of old store entries/substitutes. I have some
> ideas about what to implement to provide some kind of GC approach over a
> bunch of nars + narinfos, but I haven't implemented anything yet.

Where do they end up so?  I missed your answer in the Question/Answer
session.

Speaking about Berlin, the builds should be in the workers store (with a
GC policy to be defined; keep them for debugging concern?) and the main
store should have only the minimum.  The items should be really and only
stored in the cache of publish.  IMHO.

Maybe I miss something.

> There could be issues with the implementation… I'd like to think it's
> relatively simple, but that doesn't mean there aren't issues. For some
[…]
> reason or another, getting backtraces for exceptions rarely works. Most
> of the time the coordinator tries to print a backtrace, the part of
> Guile doing that raises an exception. I've managed to cause it to
> segfault, through using SQLite incorrectly, which hasn't been obvious to
> fix at least for me. Additionally, there are some places where I'm
> fighting against bits of Guix, things like checking for substitutes
> without caching, or substituting a derivation without starting to build
> it.

I am confused by all the SQL involved.  And I feel it is hard to
maintain when scaling at large.  I do not know.  I am newbie.

> Finally, the instrumentation is somewhat reliant on Prometheus, and if
> you want a pretty dashboard, then you might need Grafana too. Both of
> these things aren't packaged for Guix, Prometheus might be feasible to
> package within the next few months, I doubt the same is true for Grafana
> (due to the use of NPM).

Really cool!  For sure know how it is healthy (or not) is really nice.

Cheers,
simon

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Thoughts on CI (was: Thoughts on building things for substitutes and the Guix Build Coordinator)
  2020-11-23 22:56 ` Thoughts on CI (was: Thoughts on building things for substitutes and the Guix Build Coordinator) zimoun
@ 2020-11-24  9:42   ` Christopher Baines
  0 siblings, 0 replies; 7+ messages in thread
From: Christopher Baines @ 2020-11-24  9:42 UTC (permalink / raw)
  To: zimoun; +Cc: guix-devel, Mathieu Othacehe

[-- Attachment #1: Type: text/plain, Size: 8104 bytes --]

zimoun <zimon.toutoune@gmail.com> writes:

> Hi,
>
> (Disclaim: I am biased since I have been the Mathieu’s rubber duck [1]
> about his “new CI design” presented in his talk and I have read his
> first experimental implementations.)
>
> 1: <https://en.wikipedia.org/wiki/Rubber_duck_debugging>
>
>
> Thank you Chris for this detailed email and your inspiring talk.  Really
> interesting!  The discussion has been really fruitful.  At least for me.

You're welcome :)

> From my understanding, the Guix Build Coordinator is designed to
> distribute the workload on heterogeneous context (distant machines).

It's not specific to similar or dissimilar machines, although I was
wanting something that would work in both situations.

> IIUC, the design of GBC could implement some Andreas’s ideas.  Because,
> the GBC is designed to support unreliable network and even it has
> experimental trusted mechanism for the workers.  The un-queue’ing
> algorithm implemented in GBC is not clear; it appears to be “work
> stealing” but I have not read the code.
>
> The Mathieu’s offload is designed for cluster with the architecture of
> Berlin in mind; reusing as much as possible the existing part of Guix.
>
> Since Berlin is a cluster, the workers are already trusted.  So Avahi
> allows to discover them; the addition / remove of machines should be hot
> swapping, without reconfiguration involved.  In other words, the
> controller/coordinator (master) does not need the list of workers.
> That’s one of the dynamic part.

Yeah, I think that's a really nice feature. Although I actually see more
use for that if I was trying to perform builds across machines in my
house, since they come and go more than machines in a datacentre.

> The second dynamic part is “work stealing”.  And to do so, ZeroMQ is
> used both for communication and for un-queue’ing (work stealing).  This
> library is used because it allows to focus on the design avoiding the
> reimplementation of the scheduling strategy and probably bugs with
> Fibers to communicate.  Well, that’s how I understand it.
>
> For sure, the ’guile-simple-zmq’ wrapper is not bullet-proof; but it is
> simply a wrapper and ZeroMQ is well-tested, AFAIK.  Well, we could
> imagine replace in a second step this ZMQ library by Fibers plus Scheme
> reimplementation of the scheduling strategy; once the design is a bit
> tested.
>
>
>> I've been using the Guix Build Coordinator build substitutes for
>> guix.cbaines.net, which is my testing ground for providing
>> substitutes. I think it's working reasonably well.
>
> What is the configuration of this machine?  Size of the store?  Number
> of workers where the agents are running?

It has varied. at the moment I'm 1 small virtual machine for the
coordinator, plus 4 small virtual machines to build packages (3 of these
I run for other reasons as well). I'm also using my desktop computer to
build packages, which is a much faster machine, but it's only building
packages when I'm using it for other things.

So 4 or 5 machines running the agents, plus 1 machine for the
coordinator itself. They all have small stores. For guix.cbaines.net,
I'm using a S3 service (wasabi.com) for the nar storage, which currently
has 1.4TB of nars.

>> The Guix Build Coordinator supports prioritisation of builds. You can
>> assign a priority to builds, and it'll try to order builds in such a way
>> that the higher priority builds get processed first. If the aim is to
>> serve substitutes, doing some prioritisation might help building the
>> most fetched things first.
>
> This is really cool!  How does it work?  Do you do manual tag on some
> specific derivations?

The logic for assigning priorities is currently very simple for the
guix.cbaines.net builds, it's here [1]. The channel instance (guix pull
related) derivations are prioritised over packages, with x86_64-linux
being prioritised over other architectures. For packages, x86_64-linux
is prioritised over other architectures.

1: https://git.cbaines.net/guix/build-coordinator/tree/scripts/guix-build-coordinator-queue-builds-from-guix-data-service.in#n174

In the future, it might be good to try and work out what packages are
more popular or harder to build, and prioritise more along those lines.

>> Because the build results don't end up in a store (they could, but as
>> set out above, not being in the store is a feature I think), you can't
>> use `guix gc` to get rid of old store entries/substitutes. I have some
>> ideas about what to implement to provide some kind of GC approach over a
>> bunch of nars + narinfos, but I haven't implemented anything yet.
>
> Where do they end up so?  I missed your answer in the Question/Answer
> session.

That's something as a user that you have to decide when configuring the
service.

The simplest option is just to have a directory on the machine running
the coordinator, where the narinfo and nar files get put. You then serve
this using NGinx/Apache-HTTPD/...

For guix.cbaines.net, I don't want to pay for a server with lots of
storage, it's cheeper to pay wasabi.com in this case to just store the
files and do the web serving. There's code available to use a S3 service
for storage, and it's not difficult to do similar things.

> Speaking about Berlin, the builds should be in the workers store (with a
> GC policy to be defined; keep them for debugging concern?) and the main
> store should have only the minimum.  The items should be really and only
> stored in the cache of publish.  IMHO.
>
> Maybe I miss something.

That sounds OK. The main disadvantage of worker machines performing
garbage collection is that they might have to re-download some stuff to
perform new builds, so you want to have a good mix of local store items,
plus plenty of space.

At least the way I've been using the Guix Build Coordinator, there is no
main store. The store on the machine used for the coordinator process
tends to fill up with derivations, but that's about it, and that happens
very slowly.

I also haven't tested trying to mix guix publish and the Guix Build
Coordinator, it could probably work though.

>> There could be issues with the implementation… I'd like to think it's
>> relatively simple, but that doesn't mean there aren't issues. For some
> […]
>> reason or another, getting backtraces for exceptions rarely works. Most
>> of the time the coordinator tries to print a backtrace, the part of
>> Guile doing that raises an exception. I've managed to cause it to
>> segfault, through using SQLite incorrectly, which hasn't been obvious to
>> fix at least for me. Additionally, there are some places where I'm
>> fighting against bits of Guix, things like checking for substitutes
>> without caching, or substituting a derivation without starting to build
>> it.
>
> I am confused by all the SQL involved.  And I feel it is hard to
> maintain when scaling at large.  I do not know.  I am newbie.

While I quite like SQL, I also do like stateless things, or keeping as
little state as possible.

I think the database size is one aspect of scaling, that could do with a
bit more work. There's also bits of the Guix Build Coordinator that are
serialised, like the processing of hook events, and that can be a
bottleneck when the build throughput is quite high.

While I've had trouble with SQLite, and guile-squee in the Guix Data
Service, things are seeming pretty stable now with both projects.

>> Finally, the instrumentation is somewhat reliant on Prometheus, and if
>> you want a pretty dashboard, then you might need Grafana too. Both of
>> these things aren't packaged for Guix, Prometheus might be feasible to
>> package within the next few months, I doubt the same is true for Grafana
>> (due to the use of NPM).
>
> Really cool!  For sure know how it is healthy (or not) is really nice.

Hope this helps, just let me know if you have any more comments or
questions :)

Chris

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 987 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2020-11-24  9:43 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-17 20:45 Thoughts on building things for substitutes and the Guix Build Coordinator Christopher Baines
2020-11-17 22:10 ` Ludovic Courtès
2020-11-18  7:56   ` Christopher Baines
2020-11-20 10:12     ` Ludovic Courtès
2020-11-21  9:46       ` Christopher Baines
2020-11-23 22:56 ` Thoughts on CI (was: Thoughts on building things for substitutes and the Guix Build Coordinator) zimoun
2020-11-24  9:42   ` Christopher Baines

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).