From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp0 ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms11 with LMTPS id uJh5HD0/vF8feQAA0tVLHw (envelope-from ) for ; Mon, 23 Nov 2020 23:01:17 +0000 Received: from aspmx1.migadu.com ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp0 with LMTPS id SAZnGD0/vF+KQAAA1q6Kng (envelope-from ) for ; Mon, 23 Nov 2020 23:01:17 +0000 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id D72149402DD for ; Mon, 23 Nov 2020 23:01:16 +0000 (UTC) Received: from localhost ([::1]:50186 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1khKpf-0007tU-LA for larch@yhetil.org; Mon, 23 Nov 2020 18:01:15 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]:58316) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1khKpE-0007sF-Aw for guix-devel@gnu.org; Mon, 23 Nov 2020 18:00:48 -0500 Received: from mail-wr1-x444.google.com ([2a00:1450:4864:20::444]:46578) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1khKp8-0005vp-5V; Mon, 23 Nov 2020 18:00:46 -0500 Received: by mail-wr1-x444.google.com with SMTP id g14so5242696wrm.13; Mon, 23 Nov 2020 15:00:41 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:subject:in-reply-to:references:date:message-id:mime-version :content-transfer-encoding; bh=Fdvop5ThxihgI0Q8mulz+iAF1X7KzVFfo8OkFTsaB3o=; b=fHoSbVhMZEfy+f5aj+XB6Q7bCSTRQah24wA5HHhHZ3XY6sLU5030AoiaTdk6aS09gA fwpLqJs4SUukbRMyGVq3ex5BjNXftEm6DwwH5U/lBWaDJBGIsyk1wh7jc8V/zE9Vu596 Ror8QPhBUnjoMXCXPKngEdxIrxWlj78eDJdNHE84NXY5JbvK77Ca7YxEm6eNKGvkXjNP yjFJRCs07xbaoz1LeLrfiDdwHp5GYllGC5XTa5NQxIeMlH/CqupGFarEIhgdRhPMGyNM r9SVrs9oN5/pYqIREzpcE6VrTHTftzSteKO+91lJwWIK7QDdrVmPR6aiF/Dy8GbBWHMg XZ/Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:subject:in-reply-to:references:date :message-id:mime-version:content-transfer-encoding; bh=Fdvop5ThxihgI0Q8mulz+iAF1X7KzVFfo8OkFTsaB3o=; b=rLlNsxP3DSmGoNThB9PeMhF7Zx/OlMXB7o4T1zz7e2GrHZHfuVfIOPnBQ7kcwvn4My O3p1bEwXARM7bM4v0BA+7C0PMjCUG3ELsmGQggNYiVjXVlLXOWdQXChjC+LSwBV5X7SO wm8/l9EmJviV262C63Hg6M1F6mT9dlgOtJPfgXrJRN6ptp84nCvvBYG2fS1UO8X05b8V Lifr+w1VJp0yUJJcKmFmV71jkuTKnOCoiiXQY85R6Ob6/egoAeIbT646ljof+0IIqZPM g51UJHA4Vy13yHpRSOg9X7XCmWxaW4yHAZsHh6svjTu4kq3wFA/B7V7BihoqcXYiBPwY qCxQ== X-Gm-Message-State: AOAM531aNTcOaWU8E/k6wLR7m6zAI631yfDR8OEdvM5BBKFBUb6C85rf GT3CIhvmJWd9zJTDUDnFn4LM68UEH0Zd4g== X-Google-Smtp-Source: ABdhPJx+e8uctZqWZQKNqmc2YeADZ93JpeIS3GQhk0oqzSL3bnHsum4z1DGS0JDqvlGjXjgNxkS9AQ== X-Received: by 2002:adf:9043:: with SMTP id h61mr1992448wrh.237.1606172439192; Mon, 23 Nov 2020 15:00:39 -0800 (PST) Received: from lili ([2a01:e0a:59b:9120:65d2:2476:f637:db1e]) by smtp.gmail.com with ESMTPSA id w10sm22592690wra.34.2020.11.23.15.00.38 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 23 Nov 2020 15:00:38 -0800 (PST) From: zimoun To: Christopher Baines , guix-devel@gnu.org, Ludovic =?utf-8?Q?Court=C3=A8s?= , Mathieu Othacehe Subject: Thoughts on CI (was: Thoughts on building things for substitutes and the Guix Build Coordinator) In-Reply-To: <87tutnlnjy.fsf@cbaines.net> References: <87tutnlnjy.fsf@cbaines.net> Date: Mon, 23 Nov 2020 23:56:07 +0100 Message-ID: <86v9dvu1h4.fsf@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Received-SPF: pass client-ip=2a00:1450:4864:20::444; envelope-from=zimon.toutoune@gmail.com; helo=mail-wr1-x444.google.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: guix-devel@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Development of GNU Guix and the GNU System distribution." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guix-devel-bounces+larch=yhetil.org@gnu.org Sender: "Guix-devel" X-Scanner: ns3122888.ip-94-23-21.eu Authentication-Results: aspmx1.migadu.com; dkim=pass header.d=gmail.com header.s=20161025 header.b=fHoSbVhM; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (aspmx1.migadu.com: domain of guix-devel-bounces@gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=guix-devel-bounces@gnu.org X-Spam-Score: -1.71 X-TUID: MY1D1LvDM8+N Hi, (Disclaim: I am biased since I have been the Mathieu=E2=80=99s rubber duck = [1] about his =E2=80=9Cnew CI design=E2=80=9D presented in his talk and I have = read his first experimental implementations.) 1: Thank you Chris for this detailed email and your inspiring talk. Really interesting! The discussion has been really fruitful. At least for me. >From my understanding, the Guix Build Coordinator is designed to distribute the workload on heterogeneous context (distant machines). IIUC, the design of GBC could implement some Andreas=E2=80=99s ideas. Beca= use, the GBC is designed to support unreliable network and even it has experimental trusted mechanism for the workers. The un-queue=E2=80=99ing algorithm implemented in GBC is not clear; it appears to be =E2=80=9Cwork stealing=E2=80=9D but I have not read the code. The Mathieu=E2=80=99s offload is designed for cluster with the architecture= of Berlin in mind; reusing as much as possible the existing part of Guix. Since Berlin is a cluster, the workers are already trusted. So Avahi allows to discover them; the addition / remove of machines should be hot swapping, without reconfiguration involved. In other words, the controller/coordinator (master) does not need the list of workers. That=E2=80=99s one of the dynamic part. The second dynamic part is =E2=80=9Cwork stealing=E2=80=9D. And to do so, = ZeroMQ is used both for communication and for un-queue=E2=80=99ing (work stealing). = This library is used because it allows to focus on the design avoiding the reimplementation of the scheduling strategy and probably bugs with Fibers to communicate. Well, that=E2=80=99s how I understand it. For sure, the =E2=80=99guile-simple-zmq=E2=80=99 wrapper is not bullet-proo= f; but it is simply a wrapper and ZeroMQ is well-tested, AFAIK. Well, we could imagine replace in a second step this ZMQ library by Fibers plus Scheme reimplementation of the scheduling strategy; once the design is a bit tested. > I've been using the Guix Build Coordinator build substitutes for > guix.cbaines.net, which is my testing ground for providing > substitutes. I think it's working reasonably well. What is the configuration of this machine? Size of the store? Number of workers where the agents are running? > The Guix Build Coordinator supports prioritisation of builds. You can > assign a priority to builds, and it'll try to order builds in such a way > that the higher priority builds get processed first. If the aim is to > serve substitutes, doing some prioritisation might help building the > most fetched things first. This is really cool! How does it work? Do you do manual tag on some specific derivations? > Another feature supported by the Guix Build Coordinator is retries. If a > build fails, the Guix Build Coordinator can automatically retry it. In a [=E2=80=A6] > perfect world, everything would succeed first time, but because the > world isn't perfect, there still can be intermittent build > failures. Retrying failed builds even once can help reduce the chance > that a failure leads to no substitutes for that builds as well as any > builds that depend on that output. Yeah, something in the current infrastructure is lacking to distinguish between error (=E2=80=9Cbuild is complete but return an error=E2=80=9D) and= failure (=E2=80=9Csomething along had been wrong=E2=80=9C). > Because the build results don't end up in a store (they could, but as > set out above, not being in the store is a feature I think), you can't > use `guix gc` to get rid of old store entries/substitutes. I have some > ideas about what to implement to provide some kind of GC approach over a > bunch of nars + narinfos, but I haven't implemented anything yet. Where do they end up so? I missed your answer in the Question/Answer session. Speaking about Berlin, the builds should be in the workers store (with a GC policy to be defined; keep them for debugging concern?) and the main store should have only the minimum. The items should be really and only stored in the cache of publish. IMHO. Maybe I miss something. > There could be issues with the implementation=E2=80=A6 I'd like to think = it's > relatively simple, but that doesn't mean there aren't issues. For some [=E2=80=A6] > reason or another, getting backtraces for exceptions rarely works. Most > of the time the coordinator tries to print a backtrace, the part of > Guile doing that raises an exception. I've managed to cause it to > segfault, through using SQLite incorrectly, which hasn't been obvious to > fix at least for me. Additionally, there are some places where I'm > fighting against bits of Guix, things like checking for substitutes > without caching, or substituting a derivation without starting to build > it. I am confused by all the SQL involved. And I feel it is hard to maintain when scaling at large. I do not know. I am newbie. > Finally, the instrumentation is somewhat reliant on Prometheus, and if > you want a pretty dashboard, then you might need Grafana too. Both of > these things aren't packaged for Guix, Prometheus might be feasible to > package within the next few months, I doubt the same is true for Grafana > (due to the use of NPM). Really cool! For sure know how it is healthy (or not) is really nice. Cheers, simon