Ludovic Courtès writes: > Hi Chris, > > Christopher Baines skribis: > >>>> Another feature supported by the Guix Build Coordinator is retries. If a >>>> build fails, the Guix Build Coordinator can automatically retry it. In a >>>> perfect world, everything would succeed first time, but because the >>>> world isn't perfect, there still can be intermittent build >>>> failures. Retrying failed builds even once can help reduce the chance >>>> that a failure leads to no substitutes for that builds as well as any >>>> builds that depend on that output. >>> >>> That’s nice too; it’s one of the practical issues we have with Cuirass >>> and that’s tempting to ignore because “hey it’s all functional!”, but >>> then reality gets in the way. >> >> One further benefit related to this is that if you want to manually >> retry building a derivation, you just submit a new build for that >> derivation. >> >> The Guix Build Coordinator also has no concept of "Failed (dependency)", >> it never gives up. This avoids the situation where spurious failures >> block other builds. > > I think there’s a balance to be found. Being able to retry is nice, but > “never giving up” is not: on a build farm, you could end up always > rebuilding the same derivation that in the end always fails, and that > can be a huge resource waste. > > On berlin we run the daemon with ‘--cache-failures’. It’s not great > because, again, it prevents further builds altogether. > > Which makes me think we could change the daemon to have a threshold: it > would maintain a derivation build failure count (instead of a Boolean) > and would only prevent rebuilds once a failure threshold has been > reached. So my comment about not giving up here was specifically about derivations that cannot be built as inputs are missing, which Cuirass describes as "Failed (dependency)". There's no wasted time spent building them, since the builds aren't and cannot be attempted. >>>> Because the build results don't end up in a store (they could, but as >>>> set out above, not being in the store is a feature I think), you can't >>>> use `guix gc` to get rid of old store entries/substitutes. I have some >>>> ideas about what to implement to provide some kind of GC approach over a >>>> bunch of nars + narinfos, but I haven't implemented anything yet. >>> >>> ‘guix publish’ has support for that via (guix cache), so if we could >>> share code, that’d be great. >> >> Guix publish does time based deletion, based on when the files were >> first created, right? If that works for people, that's fine I guess. > > Yes, it’s based on the atime (don’t use “noatime”!), though (guix cache) > lets you implement other policies. > >> Personally, I'm thinking about GC as in, don't delete nar A if you want >> to keep nar B, and nar B references nar A. It's perfectly possible that >> someone could fetch nar B if you deleted nar A, but it's also possible >> that someone couldn't because of that missing substitute. Maybe I'm >> overthinking this though? > > I think you are. :-) ‘guix publish’ doesn’t do anything this fancy and > it works great. The reason is that clients typically always ask for > both A and B, thus the atime of A is the same as that of B. Ok, the atime thing kind of makes sense. >> The Cuirass + guix publish approach does something similar, because >> Cuirass creates GC roots that expire. guix gc wouldn't delete a store >> item if it's needed by something that's protected by a Cuirass created >> GC root. > > Cuirass has a TTL on GC roots, which thus defines how long things remain > in the store; ‘publish’ has a TTL on nars, which defines how long nars > remain in its cache. The two are disconnected in fact. I'm seeing the connection in that guix publish populates it's cache from the store, and things cannot be removed from the store in an inconsistent manor, you always have to have the dependencies. >> Another complexity here that I didn't set out initially, is that there >> are places the Guix Build Coordinator makes decisions based on the >> belief that if it's database says a build has succeeded for an output, >> that output will be available. If a situation where a build needed an >> output that had been successfully built, but then deleted, I think the >> coordinator would get stuck forever trying that build and it not >> starting because of the missing store item. My thinking on this at the >> moment is maybe what you'd want to do is tell the Guix Build Coordinator >> that you've deleted a store item and it's truly missing, but that would >> complicate the setup to some degree. > > I think you’d just end up rebuilding it in that case, no? Well, the current implementation of the build-missing-inputs hook assumes an output is available if there's been a successful build. So it wouldn't think to actually build that thing again. I've been thinking about this, and I think the approach to take might be to make that hook configurable so that it can actually check the availability of outputs. This way it can do less guessing based of whether things have been built in the past. >>> I’ve haven’t yet watched your talk but I’ve what Mathieu’s, where he >>> admits to being concerned about the reliability of code involving Fibers >>> and/or SQLite (which I can understand given his/our experience, although >>> I’m maybe less pessimistic). What’s your experience, how do you feel >>> about it? >> >> The coordinator does use Fibers, plus a lot of different threads for >> different things. > > Interesting, why are some things running in threads? So, there's a thread for writing to the SQLite database, and there's a pool of threads for reading from it. Each hook has a associated thread which processes those hook events. The allocator runs in it's own thread. There's also a pool of threads for reading the chunked parts of HTTP requests (which is probably/hopefully unnecessary) as well as a pool of threads for talking to the guix-daemon to fetch substitutes. Plus fibers creates threads to run fibers. > There also seems to be shared state in ‘create-work-queue’; why not use > message passing? So, create-work-queue is just used on the agents, and they don't run fibers. I've also actually separated out the code in such a way that the agents can run without fibers even being available, which should enable the agent process to run on the Hurd (as fibers doesn't work there yet). >> Regarding reliability, it's hard to say really. Given I set out to build >> something that works across a (unreliable) network, I've built in >> reliability through making sure things retry upon failure among other >> things. I definitely haven't chased any blocked fibers, although there >> could be some of those lurking in the code, I might have not noticed >> because it sorts itself out eventually. > > OK. Most of the issues we see now with offloading and Cuirass are > things you can only experience with a huge store, a large number of > build machines, and a lot of concurrent derivation builds. Perhaps you > are approaching this scale on your instance actually? > > Mathieu experimented with the Coordinator on berlin. It would be nice > to see how it behaved there. > >> One of the problems I did have recently was that some hooks would just >> stop getting processed. Each type of hook has a thread, which just >> checked if there were any events to process every second, and processed >> any if there were. I'm not sure what was wrong, but I changed the code >> to be smarter, be triggered when new events are actually entered in to >> the database, and poll every so often just in case. I haven't seen hooks >> get stuck since then, but what I'm trying to convey here is that I'm not >> quite sure how to track down issues that occur in specific threads. >> >> Another thing to mention here is that implementing suppport for >> PostgreSQL through Guile Squee is still a thing I have in mind, and that >> might be more appropriate for larger databases. It's still prone to the >> fibers blocking problem, but at least it's harder to cause Segfaults >> with Squee compared to SQLite. > > OK. > > I find it really nice to have metrics built in, but I share Mathieu’s > concern about complexity here. If we’re already hitting scalability > issues with SQLite, then perhaps that’s a sign that metrics should be > handled separately. I think SQLite is actually doing OK. The database for guix.cbaines.net is 18G in size, which is larger than it needs to be I think, but there are improvements that I can make to reduce that. I'm pretty happy with the performance as well at the moment. > Would it be an option to implement metrics gathering in a separate, > optional process, which would essentially subscribe to the relevant > hooks/events? So, this is kind of what already happens. I wrote a Guile library for Prometheus style metrics as I was writing the Guix Build Coordinator [1]. The coordinator uses this, and you can see the metrics data here [2] for example. The endpoint is far too slow, I need to fix that, but it does work. 1: https://git.cbaines.net/guile/prometheus/ 2: https://coordinator.guix.cbaines.net/metrics Now, you can get some information just by reading that page, but it gets more useful and more readable when you have Prometheus regularly scrape and record those metrics, so you can graph them over time.