Ludovic Courtès writes: > Christopher Baines skribis: > >> Because you aren't copying the store items back in to a single store, or >> serving substitutes from the store, you don't need to scale the store to >> serve more substitutes. You've still got a bunch of nars + narinfos to >> store, but I think that is an easier problem to tackle. > > Yes, this is good for the use case of providing substitutes and it would > certainly help on a big build farm like berlin. > > I see a lot could be shared with (guix scripts publish) and (guix > scripts substitute). We should extract the relevant bits and move them > to new modules explicitly meant for more general consumption. I think > it’s important to reduce duplication. Yeah, that would be good. >> Another feature supported by the Guix Build Coordinator is retries. If a >> build fails, the Guix Build Coordinator can automatically retry it. In a >> perfect world, everything would succeed first time, but because the >> world isn't perfect, there still can be intermittent build >> failures. Retrying failed builds even once can help reduce the chance >> that a failure leads to no substitutes for that builds as well as any >> builds that depend on that output. > > That’s nice too; it’s one of the practical issues we have with Cuirass > and that’s tempting to ignore because “hey it’s all functional!”, but > then reality gets in the way. One further benefit related to this is that if you want to manually retry building a derivation, you just submit a new build for that derivation. The Guix Build Coordinator also has no concept of "Failed (dependency)", it never gives up. This avoids the situation where spurious failures block other builds. >> Because the build results don't end up in a store (they could, but as >> set out above, not being in the store is a feature I think), you can't >> use `guix gc` to get rid of old store entries/substitutes. I have some >> ideas about what to implement to provide some kind of GC approach over a >> bunch of nars + narinfos, but I haven't implemented anything yet. > > ‘guix publish’ has support for that via (guix cache), so if we could > share code, that’d be great. Guix publish does time based deletion, based on when the files were first created, right? If that works for people, that's fine I guess. Personally, I'm thinking about GC as in, don't delete nar A if you want to keep nar B, and nar B references nar A. It's perfectly possible that someone could fetch nar B if you deleted nar A, but it's also possible that someone couldn't because of that missing substitute. Maybe I'm overthinking this though? The Cuirass + guix publish approach does something similar, because Cuirass creates GC roots that expire. guix gc wouldn't delete a store item if it's needed by something that's protected by a Cuirass created GC root. Another complexity here that I didn't set out initially, is that there are places the Guix Build Coordinator makes decisions based on the belief that if it's database says a build has succeeded for an output, that output will be available. If a situation where a build needed an output that had been successfully built, but then deleted, I think the coordinator would get stuck forever trying that build and it not starting because of the missing store item. My thinking on this at the moment is maybe what you'd want to do is tell the Guix Build Coordinator that you've deleted a store item and it's truly missing, but that would complicate the setup to some degree. > One option would be to populate /var/cache/guix/publish and to let ‘guix > publish’ serve it from there. That's probably pretty easy to do, I haven't looked at the details though. >> There could be issues with the implementation… I'd like to think it's >> relatively simple, but that doesn't mean there aren't issues. For some >> reason or another, getting backtraces for exceptions rarely works. Most >> of the time the coordinator tries to print a backtrace, the part of >> Guile doing that raises an exception. I've managed to cause it to >> segfault, through using SQLite incorrectly, which hasn't been obvious to >> fix at least for me. Additionally, there are some places where I'm >> fighting against bits of Guix, things like checking for substitutes >> without caching, or substituting a derivation without starting to build >> it. > > I’ve haven’t yet watched your talk but I’ve what Mathieu’s, where he > admits to being concerned about the reliability of code involving Fibers > and/or SQLite (which I can understand given his/our experience, although > I’m maybe less pessimistic). What’s your experience, how do you feel > about it? The coordinator does use Fibers, plus a lot of different threads for different things. Regarding reliability, it's hard to say really. Given I set out to build something that works across a (unreliable) network, I've built in reliability through making sure things retry upon failure among other things. I definitely haven't chased any blocked fibers, although there could be some of those lurking in the code, I might have not noticed because it sorts itself out eventually. One of the problems I did have recently was that some hooks would just stop getting processed. Each type of hook has a thread, which just checked if there were any events to process every second, and processed any if there were. I'm not sure what was wrong, but I changed the code to be smarter, be triggered when new events are actually entered in to the database, and poll every so often just in case. I haven't seen hooks get stuck since then, but what I'm trying to convey here is that I'm not quite sure how to track down issues that occur in specific threads. Another thing to mention here is that implementing suppport for PostgreSQL through Guile Squee is still a thing I have in mind, and that might be more appropriate for larger databases. It's still prone to the fibers blocking problem, but at least it's harder to cause Segfaults with Squee compared to SQLite.