From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp1 ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms11 with LMTPS id MDRNMLDVvF/gfwAA0tVLHw (envelope-from ) for ; Tue, 24 Nov 2020 09:43:12 +0000 Received: from aspmx1.migadu.com ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp1 with LMTPS id +P4iLLDVvF9ebgAAbx9fmQ (envelope-from ) for ; Tue, 24 Nov 2020 09:43:12 +0000 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id 267CC9404C2 for ; Tue, 24 Nov 2020 09:43:12 +0000 (UTC) Received: from localhost ([::1]:34432 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1khUqt-00020o-0k for larch@yhetil.org; Tue, 24 Nov 2020 04:43:11 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]:48518) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1khUqe-0001zw-6q for guix-devel@gnu.org; Tue, 24 Nov 2020 04:42:56 -0500 Received: from mira.cbaines.net ([2a01:7e00:e000:2f8:fd4d:b5c7:13fb:3d27]:41455) by eggs.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1khUqZ-0003yU-6L; Tue, 24 Nov 2020 04:42:55 -0500 Received: from localhost (188.28.112.52.threembb.co.uk [188.28.112.52]) by mira.cbaines.net (Postfix) with ESMTPSA id B1FDD27BBF7; Tue, 24 Nov 2020 09:42:43 +0000 (GMT) Received: from capella (localhost [127.0.0.1]) by localhost (OpenSMTPD) with ESMTP id 289f6370; Tue, 24 Nov 2020 09:42:41 +0000 (UTC) References: <87tutnlnjy.fsf@cbaines.net> <86v9dvu1h4.fsf@gmail.com> User-agent: mu4e 1.4.13; emacs 27.1 From: Christopher Baines To: zimoun Subject: Re: Thoughts on CI (was: Thoughts on building things for substitutes and the Guix Build Coordinator) In-reply-to: <86v9dvu1h4.fsf@gmail.com> Date: Tue, 24 Nov 2020 09:42:38 +0000 Message-ID: <87im9vgkfl.fsf@cbaines.net> MIME-Version: 1.0 Content-Type: multipart/signed; boundary="=-=-="; micalg=pgp-sha512; protocol="application/pgp-signature" Received-SPF: pass client-ip=2a01:7e00:e000:2f8:fd4d:b5c7:13fb:3d27; envelope-from=mail@cbaines.net; helo=mira.cbaines.net X-Spam_score_int: -18 X-Spam_score: -1.9 X-Spam_bar: - X-Spam_report: (-1.9 / 5.0 requ) BAYES_00=-1.9, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: guix-devel@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Development of GNU Guix and the GNU System distribution." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: guix-devel@gnu.org, Mathieu Othacehe Errors-To: guix-devel-bounces+larch=yhetil.org@gnu.org Sender: "Guix-devel" X-Scanner: ns3122888.ip-94-23-21.eu Authentication-Results: aspmx1.migadu.com; dkim=none; dmarc=none; spf=pass (aspmx1.migadu.com: domain of guix-devel-bounces@gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=guix-devel-bounces@gnu.org X-Spam-Score: -3.11 X-TUID: FwLh8ftArKeE --=-=-= Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable zimoun writes: > Hi, > > (Disclaim: I am biased since I have been the Mathieu=E2=80=99s rubber duc= k [1] > about his =E2=80=9Cnew CI design=E2=80=9D presented in his talk and I hav= e read his > first experimental implementations.) > > 1: > > > Thank you Chris for this detailed email and your inspiring talk. Really > interesting! The discussion has been really fruitful. At least for me. You're welcome :) > From my understanding, the Guix Build Coordinator is designed to > distribute the workload on heterogeneous context (distant machines). It's not specific to similar or dissimilar machines, although I was wanting something that would work in both situations. > IIUC, the design of GBC could implement some Andreas=E2=80=99s ideas. Be= cause, > the GBC is designed to support unreliable network and even it has > experimental trusted mechanism for the workers. The un-queue=E2=80=99ing > algorithm implemented in GBC is not clear; it appears to be =E2=80=9Cwork > stealing=E2=80=9D but I have not read the code. > > The Mathieu=E2=80=99s offload is designed for cluster with the architectu= re of > Berlin in mind; reusing as much as possible the existing part of Guix. > > Since Berlin is a cluster, the workers are already trusted. So Avahi > allows to discover them; the addition / remove of machines should be hot > swapping, without reconfiguration involved. In other words, the > controller/coordinator (master) does not need the list of workers. > That=E2=80=99s one of the dynamic part. Yeah, I think that's a really nice feature. Although I actually see more use for that if I was trying to perform builds across machines in my house, since they come and go more than machines in a datacentre. > The second dynamic part is =E2=80=9Cwork stealing=E2=80=9D. And to do so= , ZeroMQ is > used both for communication and for un-queue=E2=80=99ing (work stealing).= This > library is used because it allows to focus on the design avoiding the > reimplementation of the scheduling strategy and probably bugs with > Fibers to communicate. Well, that=E2=80=99s how I understand it. > > For sure, the =E2=80=99guile-simple-zmq=E2=80=99 wrapper is not bullet-pr= oof; but it is > simply a wrapper and ZeroMQ is well-tested, AFAIK. Well, we could > imagine replace in a second step this ZMQ library by Fibers plus Scheme > reimplementation of the scheduling strategy; once the design is a bit > tested. > > >> I've been using the Guix Build Coordinator build substitutes for >> guix.cbaines.net, which is my testing ground for providing >> substitutes. I think it's working reasonably well. > > What is the configuration of this machine? Size of the store? Number > of workers where the agents are running? It has varied. at the moment I'm 1 small virtual machine for the coordinator, plus 4 small virtual machines to build packages (3 of these I run for other reasons as well). I'm also using my desktop computer to build packages, which is a much faster machine, but it's only building packages when I'm using it for other things. So 4 or 5 machines running the agents, plus 1 machine for the coordinator itself. They all have small stores. For guix.cbaines.net, I'm using a S3 service (wasabi.com) for the nar storage, which currently has 1.4TB of nars. >> The Guix Build Coordinator supports prioritisation of builds. You can >> assign a priority to builds, and it'll try to order builds in such a way >> that the higher priority builds get processed first. If the aim is to >> serve substitutes, doing some prioritisation might help building the >> most fetched things first. > > This is really cool! How does it work? Do you do manual tag on some > specific derivations? The logic for assigning priorities is currently very simple for the guix.cbaines.net builds, it's here [1]. The channel instance (guix pull related) derivations are prioritised over packages, with x86_64-linux being prioritised over other architectures. For packages, x86_64-linux is prioritised over other architectures. 1: https://git.cbaines.net/guix/build-coordinator/tree/scripts/guix-build-c= oordinator-queue-builds-from-guix-data-service.in#n174 In the future, it might be good to try and work out what packages are more popular or harder to build, and prioritise more along those lines. >> Because the build results don't end up in a store (they could, but as >> set out above, not being in the store is a feature I think), you can't >> use `guix gc` to get rid of old store entries/substitutes. I have some >> ideas about what to implement to provide some kind of GC approach over a >> bunch of nars + narinfos, but I haven't implemented anything yet. > > Where do they end up so? I missed your answer in the Question/Answer > session. That's something as a user that you have to decide when configuring the service. The simplest option is just to have a directory on the machine running the coordinator, where the narinfo and nar files get put. You then serve this using NGinx/Apache-HTTPD/... For guix.cbaines.net, I don't want to pay for a server with lots of storage, it's cheeper to pay wasabi.com in this case to just store the files and do the web serving. There's code available to use a S3 service for storage, and it's not difficult to do similar things. > Speaking about Berlin, the builds should be in the workers store (with a > GC policy to be defined; keep them for debugging concern?) and the main > store should have only the minimum. The items should be really and only > stored in the cache of publish. IMHO. > > Maybe I miss something. That sounds OK. The main disadvantage of worker machines performing garbage collection is that they might have to re-download some stuff to perform new builds, so you want to have a good mix of local store items, plus plenty of space. At least the way I've been using the Guix Build Coordinator, there is no main store. The store on the machine used for the coordinator process tends to fill up with derivations, but that's about it, and that happens very slowly. I also haven't tested trying to mix guix publish and the Guix Build Coordinator, it could probably work though. >> There could be issues with the implementation=E2=80=A6 I'd like to think= it's >> relatively simple, but that doesn't mean there aren't issues. For some > [=E2=80=A6] >> reason or another, getting backtraces for exceptions rarely works. Most >> of the time the coordinator tries to print a backtrace, the part of >> Guile doing that raises an exception. I've managed to cause it to >> segfault, through using SQLite incorrectly, which hasn't been obvious to >> fix at least for me. Additionally, there are some places where I'm >> fighting against bits of Guix, things like checking for substitutes >> without caching, or substituting a derivation without starting to build >> it. > > I am confused by all the SQL involved. And I feel it is hard to > maintain when scaling at large. I do not know. I am newbie. While I quite like SQL, I also do like stateless things, or keeping as little state as possible. I think the database size is one aspect of scaling, that could do with a bit more work. There's also bits of the Guix Build Coordinator that are serialised, like the processing of hook events, and that can be a bottleneck when the build throughput is quite high. While I've had trouble with SQLite, and guile-squee in the Guix Data Service, things are seeming pretty stable now with both projects. >> Finally, the instrumentation is somewhat reliant on Prometheus, and if >> you want a pretty dashboard, then you might need Grafana too. Both of >> these things aren't packaged for Guix, Prometheus might be feasible to >> package within the next few months, I doubt the same is true for Grafana >> (due to the use of NPM). > > Really cool! For sure know how it is healthy (or not) is really nice. Hope this helps, just let me know if you have any more comments or questions :) Chris --=-=-= Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQKlBAEBCgCPFiEEPonu50WOcg2XVOCyXiijOwuE9XcFAl+81Y5fFIAAAAAALgAo aXNzdWVyLWZwckBub3RhdGlvbnMub3BlbnBncC5maWZ0aGhvcnNlbWFuLm5ldDNF ODlFRUU3NDU4RTcyMEQ5NzU0RTBCMjVFMjhBMzNCMEI4NEY1NzcRHG1haWxAY2Jh aW5lcy5uZXQACgkQXiijOwuE9XemURAAkICnXeanJHEnObGB2z9EiDKJFVCkJBc5 qJLwWza0F9gfkjGF3zxCXEq/+9svjCG01Z3Wd3pJAtPpK0DXFJNojQNRIfllHre7 YDfxOdcqvdh2pKo1Tr+WR7YlXonejWmKAnNeNTW00wIHrlMajzAyAhOf5D95lfN9 vo1EsFsRjhvFX/hNhsFR+CstGezSVsdbQRxFZzYmBgHJUGlRUIKZxA8Ch7e66icw GVXIqS0KGinaUxuaAGwhSmql+qHHqdUfkXgfudNJqsjzt/2pn/RQss45bqx9n7hp WfyXxgZZRcvSPjezMY2HdsVEQ6iaexWg3z+i6YMdyj+7NUZxt84ApjR9qUArubNr bTXSxor8XF57pXV0P8EnafX0D1adi9iHU3krhnqvVzYopKDnw5sLpaiVQ3c4J95q JSZ0vO9K1F6ZWHZ1iXh4OaHc8aUyZhbKJD/GAzbCxwSS7Bdnt5fSh05tleGLP9Vf 6BvhBmuSSKM8JHrt3whpodc7Ak3214CHGB4DaZWIfpB4+fi92Z+UdFYMWPCpB5oh iGct9ZehBSQnwCSrKgqInoVZoyVtu5AlnKhLA4La/MvZn6CLnZdI3wnlgWJGhCyH FtNrVTGIvlVcZgrAuBt+cwZFIzRIc6oqTERGSsdNMR6Q6dPdHaz7LlmddMQjTJdX zmkKJck/am0= =KGCZ -----END PGP SIGNATURE----- --=-=-=--