zimoun writes: > Hi, > > (Disclaim: I am biased since I have been the Mathieu’s rubber duck [1] > about his “new CI design” presented in his talk and I have read his > first experimental implementations.) > > 1: > > > Thank you Chris for this detailed email and your inspiring talk. Really > interesting! The discussion has been really fruitful. At least for me. You're welcome :) > From my understanding, the Guix Build Coordinator is designed to > distribute the workload on heterogeneous context (distant machines). It's not specific to similar or dissimilar machines, although I was wanting something that would work in both situations. > IIUC, the design of GBC could implement some Andreas’s ideas. Because, > the GBC is designed to support unreliable network and even it has > experimental trusted mechanism for the workers. The un-queue’ing > algorithm implemented in GBC is not clear; it appears to be “work > stealing” but I have not read the code. > > The Mathieu’s offload is designed for cluster with the architecture of > Berlin in mind; reusing as much as possible the existing part of Guix. > > Since Berlin is a cluster, the workers are already trusted. So Avahi > allows to discover them; the addition / remove of machines should be hot > swapping, without reconfiguration involved. In other words, the > controller/coordinator (master) does not need the list of workers. > That’s one of the dynamic part. Yeah, I think that's a really nice feature. Although I actually see more use for that if I was trying to perform builds across machines in my house, since they come and go more than machines in a datacentre. > The second dynamic part is “work stealing”. And to do so, ZeroMQ is > used both for communication and for un-queue’ing (work stealing). This > library is used because it allows to focus on the design avoiding the > reimplementation of the scheduling strategy and probably bugs with > Fibers to communicate. Well, that’s how I understand it. > > For sure, the ’guile-simple-zmq’ wrapper is not bullet-proof; but it is > simply a wrapper and ZeroMQ is well-tested, AFAIK. Well, we could > imagine replace in a second step this ZMQ library by Fibers plus Scheme > reimplementation of the scheduling strategy; once the design is a bit > tested. > > >> I've been using the Guix Build Coordinator build substitutes for >> guix.cbaines.net, which is my testing ground for providing >> substitutes. I think it's working reasonably well. > > What is the configuration of this machine? Size of the store? Number > of workers where the agents are running? It has varied. at the moment I'm 1 small virtual machine for the coordinator, plus 4 small virtual machines to build packages (3 of these I run for other reasons as well). I'm also using my desktop computer to build packages, which is a much faster machine, but it's only building packages when I'm using it for other things. So 4 or 5 machines running the agents, plus 1 machine for the coordinator itself. They all have small stores. For guix.cbaines.net, I'm using a S3 service (wasabi.com) for the nar storage, which currently has 1.4TB of nars. >> The Guix Build Coordinator supports prioritisation of builds. You can >> assign a priority to builds, and it'll try to order builds in such a way >> that the higher priority builds get processed first. If the aim is to >> serve substitutes, doing some prioritisation might help building the >> most fetched things first. > > This is really cool! How does it work? Do you do manual tag on some > specific derivations? The logic for assigning priorities is currently very simple for the guix.cbaines.net builds, it's here [1]. The channel instance (guix pull related) derivations are prioritised over packages, with x86_64-linux being prioritised over other architectures. For packages, x86_64-linux is prioritised over other architectures. 1: https://git.cbaines.net/guix/build-coordinator/tree/scripts/guix-build-coordinator-queue-builds-from-guix-data-service.in#n174 In the future, it might be good to try and work out what packages are more popular or harder to build, and prioritise more along those lines. >> Because the build results don't end up in a store (they could, but as >> set out above, not being in the store is a feature I think), you can't >> use `guix gc` to get rid of old store entries/substitutes. I have some >> ideas about what to implement to provide some kind of GC approach over a >> bunch of nars + narinfos, but I haven't implemented anything yet. > > Where do they end up so? I missed your answer in the Question/Answer > session. That's something as a user that you have to decide when configuring the service. The simplest option is just to have a directory on the machine running the coordinator, where the narinfo and nar files get put. You then serve this using NGinx/Apache-HTTPD/... For guix.cbaines.net, I don't want to pay for a server with lots of storage, it's cheeper to pay wasabi.com in this case to just store the files and do the web serving. There's code available to use a S3 service for storage, and it's not difficult to do similar things. > Speaking about Berlin, the builds should be in the workers store (with a > GC policy to be defined; keep them for debugging concern?) and the main > store should have only the minimum. The items should be really and only > stored in the cache of publish. IMHO. > > Maybe I miss something. That sounds OK. The main disadvantage of worker machines performing garbage collection is that they might have to re-download some stuff to perform new builds, so you want to have a good mix of local store items, plus plenty of space. At least the way I've been using the Guix Build Coordinator, there is no main store. The store on the machine used for the coordinator process tends to fill up with derivations, but that's about it, and that happens very slowly. I also haven't tested trying to mix guix publish and the Guix Build Coordinator, it could probably work though. >> There could be issues with the implementation… I'd like to think it's >> relatively simple, but that doesn't mean there aren't issues. For some > […] >> reason or another, getting backtraces for exceptions rarely works. Most >> of the time the coordinator tries to print a backtrace, the part of >> Guile doing that raises an exception. I've managed to cause it to >> segfault, through using SQLite incorrectly, which hasn't been obvious to >> fix at least for me. Additionally, there are some places where I'm >> fighting against bits of Guix, things like checking for substitutes >> without caching, or substituting a derivation without starting to build >> it. > > I am confused by all the SQL involved. And I feel it is hard to > maintain when scaling at large. I do not know. I am newbie. While I quite like SQL, I also do like stateless things, or keeping as little state as possible. I think the database size is one aspect of scaling, that could do with a bit more work. There's also bits of the Guix Build Coordinator that are serialised, like the processing of hook events, and that can be a bottleneck when the build throughput is quite high. While I've had trouble with SQLite, and guile-squee in the Guix Data Service, things are seeming pretty stable now with both projects. >> Finally, the instrumentation is somewhat reliant on Prometheus, and if >> you want a pretty dashboard, then you might need Grafana too. Both of >> these things aren't packaged for Guix, Prometheus might be feasible to >> package within the next few months, I doubt the same is true for Grafana >> (due to the use of NPM). > > Really cool! For sure know how it is healthy (or not) is really nice. Hope this helps, just let me know if you have any more comments or questions :) Chris