* Thoughts on building things for substitutes and the Guix Build Coordinator @ 2020-11-17 20:45 Christopher Baines 2020-11-17 22:10 ` Ludovic Courtès 2020-11-23 22:56 ` Thoughts on CI (was: Thoughts on building things for substitutes and the Guix Build Coordinator) zimoun 0 siblings, 2 replies; 7+ messages in thread From: Christopher Baines @ 2020-11-17 20:45 UTC (permalink / raw) To: guix-devel [-- Attachment #1: Type: text/plain, Size: 5529 bytes --] Hey, In summary, this email lists the good things and bad things that you might experience when using the Guix Build Coordinator for providing substitutes for Guix. So, over the last ~7 months I've been working on the Guix Build Coordinator [1]. I think the first email I sent about it is [2], and I'm not sure if I've sent another one. I did prepare a talk on it though which goes through some of the workings [3]. 1: https://git.cbaines.net/guix/build-coordinator/tree/README.org 2: https://lists.gnu.org/archive/html/guix-devel/2020-04/msg00323.html 3: https://xana.lepiller.eu/guix-days-2020/guix-days-2020-christopher-baines-guix-build-coordinator.webm Over the last few weeks I've fixed up and tested the Guix services for the Guix Build Coordinator, as well as fixing some major issues like it segfaulting frequently. I've been using the Guix Build Coordinator build substitutes for guix.cbaines.net, which is my testing ground for providing substitutes. I think it's working reasonably well. I wanted to write this email though to set out more about actually using the Guix Build Coordinator to build things for substitutes, to help inform any conversations that happen about that. First, the good things: The way the Guix Build Coordinator generates compressed nars where the agent runs, then sends them over the network to the coordinator has a few benefits. The (sometimes expensive) work of generating the nars takes place where the agents are, so if you've got a bunch of machines running agents, that work is distributed. Also, when the nars are received by the coordinator, you have exactly what you need for serving substitutes. You just generate narinfo files, and then place the nars + narinfos where they can be fetched. The Guix Build Coordinator contains code to help with this. Because you aren't copying the store items back in to a single store, or serving substitutes from the store, you don't need to scale the store to serve more substitutes. You've still got a bunch of nars + narinfos to store, but I think that is an easier problem to tackle. This isn't strictly a benefit of the Guix Build Coordinator, but in contrast to Cuirass when run on a store which is subject to periodic garbage collection, assuming you're pairing the Guix Build Coordinator with the Guix Data Service to provide substitutes for the derivations, you don't run the risk of garbage collecting the derivations prior to building them. As I say, this isn't really a benefit of the Guix Build Coordinator, you'd potentially have the same issue if you ran the Guix Build Coordinator with guix publish (on a machine which GC's) to provide derivations, but I thought I'd mention it anyway. The Guix Build Coordinator supports prioritisation of builds. You can assign a priority to builds, and it'll try to order builds in such a way that the higher priority builds get processed first. If the aim is to serve substitutes, doing some prioritisation might help building the most fetched things first. Another feature supported by the Guix Build Coordinator is retries. If a build fails, the Guix Build Coordinator can automatically retry it. In a perfect world, everything would succeed first time, but because the world isn't perfect, there still can be intermittent build failures. Retrying failed builds even once can help reduce the chance that a failure leads to no substitutes for that builds as well as any builds that depend on that output. Now the not so good things: The Guix Build Coordinator just builds things, if you want to build all Guix packages, you need to work out the derivations, then submit builds for all of them. There's a script I wrote that does this with the help of a Guix Data Service instance, but that might not be ideal for all deployments. Even though it can handle the building of things, and most of the serving substitutes part (just not the serving bit), some other component(s) are needed. Because the build results don't end up in a store (they could, but as set out above, not being in the store is a feature I think), you can't use `guix gc` to get rid of old store entries/substitutes. I have some ideas about what to implement to provide some kind of GC approach over a bunch of nars + narinfos, but I haven't implemented anything yet. There could be issues with the implementation… I'd like to think it's relatively simple, but that doesn't mean there aren't issues. For some reason or another, getting backtraces for exceptions rarely works. Most of the time the coordinator tries to print a backtrace, the part of Guile doing that raises an exception. I've managed to cause it to segfault, through using SQLite incorrectly, which hasn't been obvious to fix at least for me. Additionally, there are some places where I'm fighting against bits of Guix, things like checking for substitutes without caching, or substituting a derivation without starting to build it. Finally, the instrumentation is somewhat reliant on Prometheus, and if you want a pretty dashboard, then you might need Grafana too. Both of these things aren't packaged for Guix, Prometheus might be feasible to package within the next few months, I doubt the same is true for Grafana (due to the use of NPM). I think that's a somewhat objective look at what using the Guix Build Coordinator might be like at the moment. Just let me know if you have any thoughts or questions? Thanks, Chris [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 987 bytes --] ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Thoughts on building things for substitutes and the Guix Build Coordinator 2020-11-17 20:45 Thoughts on building things for substitutes and the Guix Build Coordinator Christopher Baines @ 2020-11-17 22:10 ` Ludovic Courtès 2020-11-18 7:56 ` Christopher Baines 2020-11-23 22:56 ` Thoughts on CI (was: Thoughts on building things for substitutes and the Guix Build Coordinator) zimoun 1 sibling, 1 reply; 7+ messages in thread From: Ludovic Courtès @ 2020-11-17 22:10 UTC (permalink / raw) To: Christopher Baines; +Cc: guix-devel Hi! Christopher Baines <mail@cbaines.net> skribis: > The way the Guix Build Coordinator generates compressed nars where the > agent runs, then sends them over the network to the coordinator has a > few benefits. The (sometimes expensive) work of generating the nars > takes place where the agents are, so if you've got a bunch of machines > running agents, that work is distributed. > > Also, when the nars are received by the coordinator, you have exactly > what you need for serving substitutes. You just generate narinfo files, > and then place the nars + narinfos where they can be fetched. The Guix > Build Coordinator contains code to help with this. > > Because you aren't copying the store items back in to a single store, or > serving substitutes from the store, you don't need to scale the store to > serve more substitutes. You've still got a bunch of nars + narinfos to > store, but I think that is an easier problem to tackle. Yes, this is good for the use case of providing substitutes and it would certainly help on a big build farm like berlin. I see a lot could be shared with (guix scripts publish) and (guix scripts substitute). We should extract the relevant bits and move them to new modules explicitly meant for more general consumption. I think it’s important to reduce duplication. > The Guix Build Coordinator supports prioritisation of builds. You can > assign a priority to builds, and it'll try to order builds in such a way > that the higher priority builds get processed first. If the aim is to > serve substitutes, doing some prioritisation might help building the > most fetched things first. Neat! > Another feature supported by the Guix Build Coordinator is retries. If a > build fails, the Guix Build Coordinator can automatically retry it. In a > perfect world, everything would succeed first time, but because the > world isn't perfect, there still can be intermittent build > failures. Retrying failed builds even once can help reduce the chance > that a failure leads to no substitutes for that builds as well as any > builds that depend on that output. That’s nice too; it’s one of the practical issues we have with Cuirass and that’s tempting to ignore because “hey it’s all functional!”, but then reality gets in the way. > Now the not so good things: > > The Guix Build Coordinator just builds things, if you want to build all > Guix packages, you need to work out the derivations, then submit builds > for all of them. There's a script I wrote that does this with the help > of a Guix Data Service instance, but that might not be ideal for all > deployments. Even though it can handle the building of things, and most > of the serving substitutes part (just not the serving bit), some other > component(s) are needed. That’s OK; it’s good that these two things (computing derivations and building them) are separate. > Because the build results don't end up in a store (they could, but as > set out above, not being in the store is a feature I think), you can't > use `guix gc` to get rid of old store entries/substitutes. I have some > ideas about what to implement to provide some kind of GC approach over a > bunch of nars + narinfos, but I haven't implemented anything yet. ‘guix publish’ has support for that via (guix cache), so if we could share code, that’d be great. One option would be to populate /var/cache/guix/publish and to let ‘guix publish’ serve it from there. > There could be issues with the implementation… I'd like to think it's > relatively simple, but that doesn't mean there aren't issues. For some > reason or another, getting backtraces for exceptions rarely works. Most > of the time the coordinator tries to print a backtrace, the part of > Guile doing that raises an exception. I've managed to cause it to > segfault, through using SQLite incorrectly, which hasn't been obvious to > fix at least for me. Additionally, there are some places where I'm > fighting against bits of Guix, things like checking for substitutes > without caching, or substituting a derivation without starting to build > it. I’ve haven’t yet watched your talk but I’ve what Mathieu’s, where he admits to being concerned about the reliability of code involving Fibers and/or SQLite (which I can understand given his/our experience, although I’m maybe less pessimistic). What’s your experience, how do you feel about it? > Finally, the instrumentation is somewhat reliant on Prometheus, and if > you want a pretty dashboard, then you might need Grafana too. Both of > these things aren't packaged for Guix, Prometheus might be feasible to > package within the next few months, I doubt the same is true for Grafana > (due to the use of NPM). Heh. :-) Thanks for this update! Ludo’. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Thoughts on building things for substitutes and the Guix Build Coordinator 2020-11-17 22:10 ` Ludovic Courtès @ 2020-11-18 7:56 ` Christopher Baines 2020-11-20 10:12 ` Ludovic Courtès 0 siblings, 1 reply; 7+ messages in thread From: Christopher Baines @ 2020-11-18 7:56 UTC (permalink / raw) To: Ludovic Courtès; +Cc: guix-devel [-- Attachment #1: Type: text/plain, Size: 6222 bytes --] Ludovic Courtès <ludo@gnu.org> writes: > Christopher Baines <mail@cbaines.net> skribis: > >> Because you aren't copying the store items back in to a single store, or >> serving substitutes from the store, you don't need to scale the store to >> serve more substitutes. You've still got a bunch of nars + narinfos to >> store, but I think that is an easier problem to tackle. > > Yes, this is good for the use case of providing substitutes and it would > certainly help on a big build farm like berlin. > > I see a lot could be shared with (guix scripts publish) and (guix > scripts substitute). We should extract the relevant bits and move them > to new modules explicitly meant for more general consumption. I think > it’s important to reduce duplication. Yeah, that would be good. >> Another feature supported by the Guix Build Coordinator is retries. If a >> build fails, the Guix Build Coordinator can automatically retry it. In a >> perfect world, everything would succeed first time, but because the >> world isn't perfect, there still can be intermittent build >> failures. Retrying failed builds even once can help reduce the chance >> that a failure leads to no substitutes for that builds as well as any >> builds that depend on that output. > > That’s nice too; it’s one of the practical issues we have with Cuirass > and that’s tempting to ignore because “hey it’s all functional!”, but > then reality gets in the way. One further benefit related to this is that if you want to manually retry building a derivation, you just submit a new build for that derivation. The Guix Build Coordinator also has no concept of "Failed (dependency)", it never gives up. This avoids the situation where spurious failures block other builds. >> Because the build results don't end up in a store (they could, but as >> set out above, not being in the store is a feature I think), you can't >> use `guix gc` to get rid of old store entries/substitutes. I have some >> ideas about what to implement to provide some kind of GC approach over a >> bunch of nars + narinfos, but I haven't implemented anything yet. > > ‘guix publish’ has support for that via (guix cache), so if we could > share code, that’d be great. Guix publish does time based deletion, based on when the files were first created, right? If that works for people, that's fine I guess. Personally, I'm thinking about GC as in, don't delete nar A if you want to keep nar B, and nar B references nar A. It's perfectly possible that someone could fetch nar B if you deleted nar A, but it's also possible that someone couldn't because of that missing substitute. Maybe I'm overthinking this though? The Cuirass + guix publish approach does something similar, because Cuirass creates GC roots that expire. guix gc wouldn't delete a store item if it's needed by something that's protected by a Cuirass created GC root. Another complexity here that I didn't set out initially, is that there are places the Guix Build Coordinator makes decisions based on the belief that if it's database says a build has succeeded for an output, that output will be available. If a situation where a build needed an output that had been successfully built, but then deleted, I think the coordinator would get stuck forever trying that build and it not starting because of the missing store item. My thinking on this at the moment is maybe what you'd want to do is tell the Guix Build Coordinator that you've deleted a store item and it's truly missing, but that would complicate the setup to some degree. > One option would be to populate /var/cache/guix/publish and to let ‘guix > publish’ serve it from there. That's probably pretty easy to do, I haven't looked at the details though. >> There could be issues with the implementation… I'd like to think it's >> relatively simple, but that doesn't mean there aren't issues. For some >> reason or another, getting backtraces for exceptions rarely works. Most >> of the time the coordinator tries to print a backtrace, the part of >> Guile doing that raises an exception. I've managed to cause it to >> segfault, through using SQLite incorrectly, which hasn't been obvious to >> fix at least for me. Additionally, there are some places where I'm >> fighting against bits of Guix, things like checking for substitutes >> without caching, or substituting a derivation without starting to build >> it. > > I’ve haven’t yet watched your talk but I’ve what Mathieu’s, where he > admits to being concerned about the reliability of code involving Fibers > and/or SQLite (which I can understand given his/our experience, although > I’m maybe less pessimistic). What’s your experience, how do you feel > about it? The coordinator does use Fibers, plus a lot of different threads for different things. Regarding reliability, it's hard to say really. Given I set out to build something that works across a (unreliable) network, I've built in reliability through making sure things retry upon failure among other things. I definitely haven't chased any blocked fibers, although there could be some of those lurking in the code, I might have not noticed because it sorts itself out eventually. One of the problems I did have recently was that some hooks would just stop getting processed. Each type of hook has a thread, which just checked if there were any events to process every second, and processed any if there were. I'm not sure what was wrong, but I changed the code to be smarter, be triggered when new events are actually entered in to the database, and poll every so often just in case. I haven't seen hooks get stuck since then, but what I'm trying to convey here is that I'm not quite sure how to track down issues that occur in specific threads. Another thing to mention here is that implementing suppport for PostgreSQL through Guile Squee is still a thing I have in mind, and that might be more appropriate for larger databases. It's still prone to the fibers blocking problem, but at least it's harder to cause Segfaults with Squee compared to SQLite. [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 987 bytes --] ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Thoughts on building things for substitutes and the Guix Build Coordinator 2020-11-18 7:56 ` Christopher Baines @ 2020-11-20 10:12 ` Ludovic Courtès 2020-11-21 9:46 ` Christopher Baines 0 siblings, 1 reply; 7+ messages in thread From: Ludovic Courtès @ 2020-11-20 10:12 UTC (permalink / raw) To: Christopher Baines; +Cc: guix-devel Hi Chris, Christopher Baines <mail@cbaines.net> skribis: >>> Another feature supported by the Guix Build Coordinator is retries. If a >>> build fails, the Guix Build Coordinator can automatically retry it. In a >>> perfect world, everything would succeed first time, but because the >>> world isn't perfect, there still can be intermittent build >>> failures. Retrying failed builds even once can help reduce the chance >>> that a failure leads to no substitutes for that builds as well as any >>> builds that depend on that output. >> >> That’s nice too; it’s one of the practical issues we have with Cuirass >> and that’s tempting to ignore because “hey it’s all functional!”, but >> then reality gets in the way. > > One further benefit related to this is that if you want to manually > retry building a derivation, you just submit a new build for that > derivation. > > The Guix Build Coordinator also has no concept of "Failed (dependency)", > it never gives up. This avoids the situation where spurious failures > block other builds. I think there’s a balance to be found. Being able to retry is nice, but “never giving up” is not: on a build farm, you could end up always rebuilding the same derivation that in the end always fails, and that can be a huge resource waste. On berlin we run the daemon with ‘--cache-failures’. It’s not great because, again, it prevents further builds altogether. Which makes me think we could change the daemon to have a threshold: it would maintain a derivation build failure count (instead of a Boolean) and would only prevent rebuilds once a failure threshold has been reached. >>> Because the build results don't end up in a store (they could, but as >>> set out above, not being in the store is a feature I think), you can't >>> use `guix gc` to get rid of old store entries/substitutes. I have some >>> ideas about what to implement to provide some kind of GC approach over a >>> bunch of nars + narinfos, but I haven't implemented anything yet. >> >> ‘guix publish’ has support for that via (guix cache), so if we could >> share code, that’d be great. > > Guix publish does time based deletion, based on when the files were > first created, right? If that works for people, that's fine I guess. Yes, it’s based on the atime (don’t use “noatime”!), though (guix cache) lets you implement other policies. > Personally, I'm thinking about GC as in, don't delete nar A if you want > to keep nar B, and nar B references nar A. It's perfectly possible that > someone could fetch nar B if you deleted nar A, but it's also possible > that someone couldn't because of that missing substitute. Maybe I'm > overthinking this though? I think you are. :-) ‘guix publish’ doesn’t do anything this fancy and it works great. The reason is that clients typically always ask for both A and B, thus the atime of A is the same as that of B. > The Cuirass + guix publish approach does something similar, because > Cuirass creates GC roots that expire. guix gc wouldn't delete a store > item if it's needed by something that's protected by a Cuirass created > GC root. Cuirass has a TTL on GC roots, which thus defines how long things remain in the store; ‘publish’ has a TTL on nars, which defines how long nars remain in its cache. The two are disconnected in fact. > Another complexity here that I didn't set out initially, is that there > are places the Guix Build Coordinator makes decisions based on the > belief that if it's database says a build has succeeded for an output, > that output will be available. If a situation where a build needed an > output that had been successfully built, but then deleted, I think the > coordinator would get stuck forever trying that build and it not > starting because of the missing store item. My thinking on this at the > moment is maybe what you'd want to do is tell the Guix Build Coordinator > that you've deleted a store item and it's truly missing, but that would > complicate the setup to some degree. I think you’d just end up rebuilding it in that case, no? >> I’ve haven’t yet watched your talk but I’ve what Mathieu’s, where he >> admits to being concerned about the reliability of code involving Fibers >> and/or SQLite (which I can understand given his/our experience, although >> I’m maybe less pessimistic). What’s your experience, how do you feel >> about it? > > The coordinator does use Fibers, plus a lot of different threads for > different things. Interesting, why are some things running in threads? There also seems to be shared state in ‘create-work-queue’; why not use message passing? > Regarding reliability, it's hard to say really. Given I set out to build > something that works across a (unreliable) network, I've built in > reliability through making sure things retry upon failure among other > things. I definitely haven't chased any blocked fibers, although there > could be some of those lurking in the code, I might have not noticed > because it sorts itself out eventually. OK. Most of the issues we see now with offloading and Cuirass are things you can only experience with a huge store, a large number of build machines, and a lot of concurrent derivation builds. Perhaps you are approaching this scale on your instance actually? Mathieu experimented with the Coordinator on berlin. It would be nice to see how it behaved there. > One of the problems I did have recently was that some hooks would just > stop getting processed. Each type of hook has a thread, which just > checked if there were any events to process every second, and processed > any if there were. I'm not sure what was wrong, but I changed the code > to be smarter, be triggered when new events are actually entered in to > the database, and poll every so often just in case. I haven't seen hooks > get stuck since then, but what I'm trying to convey here is that I'm not > quite sure how to track down issues that occur in specific threads. > > Another thing to mention here is that implementing suppport for > PostgreSQL through Guile Squee is still a thing I have in mind, and that > might be more appropriate for larger databases. It's still prone to the > fibers blocking problem, but at least it's harder to cause Segfaults > with Squee compared to SQLite. OK. I find it really nice to have metrics built in, but I share Mathieu’s concern about complexity here. If we’re already hitting scalability issues with SQLite, then perhaps that’s a sign that metrics should be handled separately. Would it be an option to implement metrics gathering in a separate, optional process, which would essentially subscribe to the relevant hooks/events? Thanks, Ludo’. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Thoughts on building things for substitutes and the Guix Build Coordinator 2020-11-20 10:12 ` Ludovic Courtès @ 2020-11-21 9:46 ` Christopher Baines 0 siblings, 0 replies; 7+ messages in thread From: Christopher Baines @ 2020-11-21 9:46 UTC (permalink / raw) To: Ludovic Courtès; +Cc: guix-devel [-- Attachment #1: Type: text/plain, Size: 9676 bytes --] Ludovic Courtès <ludo@gnu.org> writes: > Hi Chris, > > Christopher Baines <mail@cbaines.net> skribis: > >>>> Another feature supported by the Guix Build Coordinator is retries. If a >>>> build fails, the Guix Build Coordinator can automatically retry it. In a >>>> perfect world, everything would succeed first time, but because the >>>> world isn't perfect, there still can be intermittent build >>>> failures. Retrying failed builds even once can help reduce the chance >>>> that a failure leads to no substitutes for that builds as well as any >>>> builds that depend on that output. >>> >>> That’s nice too; it’s one of the practical issues we have with Cuirass >>> and that’s tempting to ignore because “hey it’s all functional!”, but >>> then reality gets in the way. >> >> One further benefit related to this is that if you want to manually >> retry building a derivation, you just submit a new build for that >> derivation. >> >> The Guix Build Coordinator also has no concept of "Failed (dependency)", >> it never gives up. This avoids the situation where spurious failures >> block other builds. > > I think there’s a balance to be found. Being able to retry is nice, but > “never giving up” is not: on a build farm, you could end up always > rebuilding the same derivation that in the end always fails, and that > can be a huge resource waste. > > On berlin we run the daemon with ‘--cache-failures’. It’s not great > because, again, it prevents further builds altogether. > > Which makes me think we could change the daemon to have a threshold: it > would maintain a derivation build failure count (instead of a Boolean) > and would only prevent rebuilds once a failure threshold has been > reached. So my comment about not giving up here was specifically about derivations that cannot be built as inputs are missing, which Cuirass describes as "Failed (dependency)". There's no wasted time spent building them, since the builds aren't and cannot be attempted. >>>> Because the build results don't end up in a store (they could, but as >>>> set out above, not being in the store is a feature I think), you can't >>>> use `guix gc` to get rid of old store entries/substitutes. I have some >>>> ideas about what to implement to provide some kind of GC approach over a >>>> bunch of nars + narinfos, but I haven't implemented anything yet. >>> >>> ‘guix publish’ has support for that via (guix cache), so if we could >>> share code, that’d be great. >> >> Guix publish does time based deletion, based on when the files were >> first created, right? If that works for people, that's fine I guess. > > Yes, it’s based on the atime (don’t use “noatime”!), though (guix cache) > lets you implement other policies. > >> Personally, I'm thinking about GC as in, don't delete nar A if you want >> to keep nar B, and nar B references nar A. It's perfectly possible that >> someone could fetch nar B if you deleted nar A, but it's also possible >> that someone couldn't because of that missing substitute. Maybe I'm >> overthinking this though? > > I think you are. :-) ‘guix publish’ doesn’t do anything this fancy and > it works great. The reason is that clients typically always ask for > both A and B, thus the atime of A is the same as that of B. Ok, the atime thing kind of makes sense. >> The Cuirass + guix publish approach does something similar, because >> Cuirass creates GC roots that expire. guix gc wouldn't delete a store >> item if it's needed by something that's protected by a Cuirass created >> GC root. > > Cuirass has a TTL on GC roots, which thus defines how long things remain > in the store; ‘publish’ has a TTL on nars, which defines how long nars > remain in its cache. The two are disconnected in fact. I'm seeing the connection in that guix publish populates it's cache from the store, and things cannot be removed from the store in an inconsistent manor, you always have to have the dependencies. >> Another complexity here that I didn't set out initially, is that there >> are places the Guix Build Coordinator makes decisions based on the >> belief that if it's database says a build has succeeded for an output, >> that output will be available. If a situation where a build needed an >> output that had been successfully built, but then deleted, I think the >> coordinator would get stuck forever trying that build and it not >> starting because of the missing store item. My thinking on this at the >> moment is maybe what you'd want to do is tell the Guix Build Coordinator >> that you've deleted a store item and it's truly missing, but that would >> complicate the setup to some degree. > > I think you’d just end up rebuilding it in that case, no? Well, the current implementation of the build-missing-inputs hook assumes an output is available if there's been a successful build. So it wouldn't think to actually build that thing again. I've been thinking about this, and I think the approach to take might be to make that hook configurable so that it can actually check the availability of outputs. This way it can do less guessing based of whether things have been built in the past. >>> I’ve haven’t yet watched your talk but I’ve what Mathieu’s, where he >>> admits to being concerned about the reliability of code involving Fibers >>> and/or SQLite (which I can understand given his/our experience, although >>> I’m maybe less pessimistic). What’s your experience, how do you feel >>> about it? >> >> The coordinator does use Fibers, plus a lot of different threads for >> different things. > > Interesting, why are some things running in threads? So, there's a thread for writing to the SQLite database, and there's a pool of threads for reading from it. Each hook has a associated thread which processes those hook events. The allocator runs in it's own thread. There's also a pool of threads for reading the chunked parts of HTTP requests (which is probably/hopefully unnecessary) as well as a pool of threads for talking to the guix-daemon to fetch substitutes. Plus fibers creates threads to run fibers. > There also seems to be shared state in ‘create-work-queue’; why not use > message passing? So, create-work-queue is just used on the agents, and they don't run fibers. I've also actually separated out the code in such a way that the agents can run without fibers even being available, which should enable the agent process to run on the Hurd (as fibers doesn't work there yet). >> Regarding reliability, it's hard to say really. Given I set out to build >> something that works across a (unreliable) network, I've built in >> reliability through making sure things retry upon failure among other >> things. I definitely haven't chased any blocked fibers, although there >> could be some of those lurking in the code, I might have not noticed >> because it sorts itself out eventually. > > OK. Most of the issues we see now with offloading and Cuirass are > things you can only experience with a huge store, a large number of > build machines, and a lot of concurrent derivation builds. Perhaps you > are approaching this scale on your instance actually? > > Mathieu experimented with the Coordinator on berlin. It would be nice > to see how it behaved there. > >> One of the problems I did have recently was that some hooks would just >> stop getting processed. Each type of hook has a thread, which just >> checked if there were any events to process every second, and processed >> any if there were. I'm not sure what was wrong, but I changed the code >> to be smarter, be triggered when new events are actually entered in to >> the database, and poll every so often just in case. I haven't seen hooks >> get stuck since then, but what I'm trying to convey here is that I'm not >> quite sure how to track down issues that occur in specific threads. >> >> Another thing to mention here is that implementing suppport for >> PostgreSQL through Guile Squee is still a thing I have in mind, and that >> might be more appropriate for larger databases. It's still prone to the >> fibers blocking problem, but at least it's harder to cause Segfaults >> with Squee compared to SQLite. > > OK. > > I find it really nice to have metrics built in, but I share Mathieu’s > concern about complexity here. If we’re already hitting scalability > issues with SQLite, then perhaps that’s a sign that metrics should be > handled separately. I think SQLite is actually doing OK. The database for guix.cbaines.net is 18G in size, which is larger than it needs to be I think, but there are improvements that I can make to reduce that. I'm pretty happy with the performance as well at the moment. > Would it be an option to implement metrics gathering in a separate, > optional process, which would essentially subscribe to the relevant > hooks/events? So, this is kind of what already happens. I wrote a Guile library for Prometheus style metrics as I was writing the Guix Build Coordinator [1]. The coordinator uses this, and you can see the metrics data here [2] for example. The endpoint is far too slow, I need to fix that, but it does work. 1: https://git.cbaines.net/guile/prometheus/ 2: https://coordinator.guix.cbaines.net/metrics Now, you can get some information just by reading that page, but it gets more useful and more readable when you have Prometheus regularly scrape and record those metrics, so you can graph them over time. [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 987 bytes --] ^ permalink raw reply [flat|nested] 7+ messages in thread
* Thoughts on CI (was: Thoughts on building things for substitutes and the Guix Build Coordinator) 2020-11-17 20:45 Thoughts on building things for substitutes and the Guix Build Coordinator Christopher Baines 2020-11-17 22:10 ` Ludovic Courtès @ 2020-11-23 22:56 ` zimoun 2020-11-24 9:42 ` Christopher Baines 1 sibling, 1 reply; 7+ messages in thread From: zimoun @ 2020-11-23 22:56 UTC (permalink / raw) To: Christopher Baines, guix-devel, Ludovic Courtès, Mathieu Othacehe Hi, (Disclaim: I am biased since I have been the Mathieu’s rubber duck [1] about his “new CI design” presented in his talk and I have read his first experimental implementations.) 1: <https://en.wikipedia.org/wiki/Rubber_duck_debugging> Thank you Chris for this detailed email and your inspiring talk. Really interesting! The discussion has been really fruitful. At least for me. From my understanding, the Guix Build Coordinator is designed to distribute the workload on heterogeneous context (distant machines). IIUC, the design of GBC could implement some Andreas’s ideas. Because, the GBC is designed to support unreliable network and even it has experimental trusted mechanism for the workers. The un-queue’ing algorithm implemented in GBC is not clear; it appears to be “work stealing” but I have not read the code. The Mathieu’s offload is designed for cluster with the architecture of Berlin in mind; reusing as much as possible the existing part of Guix. Since Berlin is a cluster, the workers are already trusted. So Avahi allows to discover them; the addition / remove of machines should be hot swapping, without reconfiguration involved. In other words, the controller/coordinator (master) does not need the list of workers. That’s one of the dynamic part. The second dynamic part is “work stealing”. And to do so, ZeroMQ is used both for communication and for un-queue’ing (work stealing). This library is used because it allows to focus on the design avoiding the reimplementation of the scheduling strategy and probably bugs with Fibers to communicate. Well, that’s how I understand it. For sure, the ’guile-simple-zmq’ wrapper is not bullet-proof; but it is simply a wrapper and ZeroMQ is well-tested, AFAIK. Well, we could imagine replace in a second step this ZMQ library by Fibers plus Scheme reimplementation of the scheduling strategy; once the design is a bit tested. > I've been using the Guix Build Coordinator build substitutes for > guix.cbaines.net, which is my testing ground for providing > substitutes. I think it's working reasonably well. What is the configuration of this machine? Size of the store? Number of workers where the agents are running? > The Guix Build Coordinator supports prioritisation of builds. You can > assign a priority to builds, and it'll try to order builds in such a way > that the higher priority builds get processed first. If the aim is to > serve substitutes, doing some prioritisation might help building the > most fetched things first. This is really cool! How does it work? Do you do manual tag on some specific derivations? > Another feature supported by the Guix Build Coordinator is retries. If a > build fails, the Guix Build Coordinator can automatically retry it. In a […] > perfect world, everything would succeed first time, but because the > world isn't perfect, there still can be intermittent build > failures. Retrying failed builds even once can help reduce the chance > that a failure leads to no substitutes for that builds as well as any > builds that depend on that output. Yeah, something in the current infrastructure is lacking to distinguish between error (“build is complete but return an error”) and failure (“something along had been wrong“). > Because the build results don't end up in a store (they could, but as > set out above, not being in the store is a feature I think), you can't > use `guix gc` to get rid of old store entries/substitutes. I have some > ideas about what to implement to provide some kind of GC approach over a > bunch of nars + narinfos, but I haven't implemented anything yet. Where do they end up so? I missed your answer in the Question/Answer session. Speaking about Berlin, the builds should be in the workers store (with a GC policy to be defined; keep them for debugging concern?) and the main store should have only the minimum. The items should be really and only stored in the cache of publish. IMHO. Maybe I miss something. > There could be issues with the implementation… I'd like to think it's > relatively simple, but that doesn't mean there aren't issues. For some […] > reason or another, getting backtraces for exceptions rarely works. Most > of the time the coordinator tries to print a backtrace, the part of > Guile doing that raises an exception. I've managed to cause it to > segfault, through using SQLite incorrectly, which hasn't been obvious to > fix at least for me. Additionally, there are some places where I'm > fighting against bits of Guix, things like checking for substitutes > without caching, or substituting a derivation without starting to build > it. I am confused by all the SQL involved. And I feel it is hard to maintain when scaling at large. I do not know. I am newbie. > Finally, the instrumentation is somewhat reliant on Prometheus, and if > you want a pretty dashboard, then you might need Grafana too. Both of > these things aren't packaged for Guix, Prometheus might be feasible to > package within the next few months, I doubt the same is true for Grafana > (due to the use of NPM). Really cool! For sure know how it is healthy (or not) is really nice. Cheers, simon ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Thoughts on CI (was: Thoughts on building things for substitutes and the Guix Build Coordinator) 2020-11-23 22:56 ` Thoughts on CI (was: Thoughts on building things for substitutes and the Guix Build Coordinator) zimoun @ 2020-11-24 9:42 ` Christopher Baines 0 siblings, 0 replies; 7+ messages in thread From: Christopher Baines @ 2020-11-24 9:42 UTC (permalink / raw) To: zimoun; +Cc: guix-devel, Mathieu Othacehe [-- Attachment #1: Type: text/plain, Size: 8104 bytes --] zimoun <zimon.toutoune@gmail.com> writes: > Hi, > > (Disclaim: I am biased since I have been the Mathieu’s rubber duck [1] > about his “new CI design” presented in his talk and I have read his > first experimental implementations.) > > 1: <https://en.wikipedia.org/wiki/Rubber_duck_debugging> > > > Thank you Chris for this detailed email and your inspiring talk. Really > interesting! The discussion has been really fruitful. At least for me. You're welcome :) > From my understanding, the Guix Build Coordinator is designed to > distribute the workload on heterogeneous context (distant machines). It's not specific to similar or dissimilar machines, although I was wanting something that would work in both situations. > IIUC, the design of GBC could implement some Andreas’s ideas. Because, > the GBC is designed to support unreliable network and even it has > experimental trusted mechanism for the workers. The un-queue’ing > algorithm implemented in GBC is not clear; it appears to be “work > stealing” but I have not read the code. > > The Mathieu’s offload is designed for cluster with the architecture of > Berlin in mind; reusing as much as possible the existing part of Guix. > > Since Berlin is a cluster, the workers are already trusted. So Avahi > allows to discover them; the addition / remove of machines should be hot > swapping, without reconfiguration involved. In other words, the > controller/coordinator (master) does not need the list of workers. > That’s one of the dynamic part. Yeah, I think that's a really nice feature. Although I actually see more use for that if I was trying to perform builds across machines in my house, since they come and go more than machines in a datacentre. > The second dynamic part is “work stealing”. And to do so, ZeroMQ is > used both for communication and for un-queue’ing (work stealing). This > library is used because it allows to focus on the design avoiding the > reimplementation of the scheduling strategy and probably bugs with > Fibers to communicate. Well, that’s how I understand it. > > For sure, the ’guile-simple-zmq’ wrapper is not bullet-proof; but it is > simply a wrapper and ZeroMQ is well-tested, AFAIK. Well, we could > imagine replace in a second step this ZMQ library by Fibers plus Scheme > reimplementation of the scheduling strategy; once the design is a bit > tested. > > >> I've been using the Guix Build Coordinator build substitutes for >> guix.cbaines.net, which is my testing ground for providing >> substitutes. I think it's working reasonably well. > > What is the configuration of this machine? Size of the store? Number > of workers where the agents are running? It has varied. at the moment I'm 1 small virtual machine for the coordinator, plus 4 small virtual machines to build packages (3 of these I run for other reasons as well). I'm also using my desktop computer to build packages, which is a much faster machine, but it's only building packages when I'm using it for other things. So 4 or 5 machines running the agents, plus 1 machine for the coordinator itself. They all have small stores. For guix.cbaines.net, I'm using a S3 service (wasabi.com) for the nar storage, which currently has 1.4TB of nars. >> The Guix Build Coordinator supports prioritisation of builds. You can >> assign a priority to builds, and it'll try to order builds in such a way >> that the higher priority builds get processed first. If the aim is to >> serve substitutes, doing some prioritisation might help building the >> most fetched things first. > > This is really cool! How does it work? Do you do manual tag on some > specific derivations? The logic for assigning priorities is currently very simple for the guix.cbaines.net builds, it's here [1]. The channel instance (guix pull related) derivations are prioritised over packages, with x86_64-linux being prioritised over other architectures. For packages, x86_64-linux is prioritised over other architectures. 1: https://git.cbaines.net/guix/build-coordinator/tree/scripts/guix-build-coordinator-queue-builds-from-guix-data-service.in#n174 In the future, it might be good to try and work out what packages are more popular or harder to build, and prioritise more along those lines. >> Because the build results don't end up in a store (they could, but as >> set out above, not being in the store is a feature I think), you can't >> use `guix gc` to get rid of old store entries/substitutes. I have some >> ideas about what to implement to provide some kind of GC approach over a >> bunch of nars + narinfos, but I haven't implemented anything yet. > > Where do they end up so? I missed your answer in the Question/Answer > session. That's something as a user that you have to decide when configuring the service. The simplest option is just to have a directory on the machine running the coordinator, where the narinfo and nar files get put. You then serve this using NGinx/Apache-HTTPD/... For guix.cbaines.net, I don't want to pay for a server with lots of storage, it's cheeper to pay wasabi.com in this case to just store the files and do the web serving. There's code available to use a S3 service for storage, and it's not difficult to do similar things. > Speaking about Berlin, the builds should be in the workers store (with a > GC policy to be defined; keep them for debugging concern?) and the main > store should have only the minimum. The items should be really and only > stored in the cache of publish. IMHO. > > Maybe I miss something. That sounds OK. The main disadvantage of worker machines performing garbage collection is that they might have to re-download some stuff to perform new builds, so you want to have a good mix of local store items, plus plenty of space. At least the way I've been using the Guix Build Coordinator, there is no main store. The store on the machine used for the coordinator process tends to fill up with derivations, but that's about it, and that happens very slowly. I also haven't tested trying to mix guix publish and the Guix Build Coordinator, it could probably work though. >> There could be issues with the implementation… I'd like to think it's >> relatively simple, but that doesn't mean there aren't issues. For some > […] >> reason or another, getting backtraces for exceptions rarely works. Most >> of the time the coordinator tries to print a backtrace, the part of >> Guile doing that raises an exception. I've managed to cause it to >> segfault, through using SQLite incorrectly, which hasn't been obvious to >> fix at least for me. Additionally, there are some places where I'm >> fighting against bits of Guix, things like checking for substitutes >> without caching, or substituting a derivation without starting to build >> it. > > I am confused by all the SQL involved. And I feel it is hard to > maintain when scaling at large. I do not know. I am newbie. While I quite like SQL, I also do like stateless things, or keeping as little state as possible. I think the database size is one aspect of scaling, that could do with a bit more work. There's also bits of the Guix Build Coordinator that are serialised, like the processing of hook events, and that can be a bottleneck when the build throughput is quite high. While I've had trouble with SQLite, and guile-squee in the Guix Data Service, things are seeming pretty stable now with both projects. >> Finally, the instrumentation is somewhat reliant on Prometheus, and if >> you want a pretty dashboard, then you might need Grafana too. Both of >> these things aren't packaged for Guix, Prometheus might be feasible to >> package within the next few months, I doubt the same is true for Grafana >> (due to the use of NPM). > > Really cool! For sure know how it is healthy (or not) is really nice. Hope this helps, just let me know if you have any more comments or questions :) Chris [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 987 bytes --] ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2020-11-24 9:43 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-11-17 20:45 Thoughts on building things for substitutes and the Guix Build Coordinator Christopher Baines 2020-11-17 22:10 ` Ludovic Courtès 2020-11-18 7:56 ` Christopher Baines 2020-11-20 10:12 ` Ludovic Courtès 2020-11-21 9:46 ` Christopher Baines 2020-11-23 22:56 ` Thoughts on CI (was: Thoughts on building things for substitutes and the Guix Build Coordinator) zimoun 2020-11-24 9:42 ` Christopher Baines
Code repositories for project(s) associated with this public inbox https://git.savannah.gnu.org/cgit/guix.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).