From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <guix-devel-bounces+larch=yhetil.org@gnu.org>
Received: from mp1 ([2001:41d0:2:4a6f::])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits))
	by ms11 with LMTPS
	id sNKYDLyWt1+hNgAA0tVLHw
	(envelope-from <guix-devel-bounces+larch=yhetil.org@gnu.org>)
	for <larch@yhetil.org>; Fri, 20 Nov 2020 10:13:16 +0000
Received: from aspmx1.migadu.com ([2001:41d0:2:4a6f::])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits))
	by mp1 with LMTPS
	id t2SlCLyWt1+cQgAAbx9fmQ
	(envelope-from <guix-devel-bounces+larch=yhetil.org@gnu.org>)
	for <larch@yhetil.org>; Fri, 20 Nov 2020 10:13:16 +0000
Received: from lists.gnu.org (lists.gnu.org [209.51.188.17])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by aspmx1.migadu.com (Postfix) with ESMTPS id C6C7094036A
	for <larch@yhetil.org>; Fri, 20 Nov 2020 10:13:15 +0000 (UTC)
Received: from localhost ([::1]:40028 helo=lists1p.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.90_1)
	(envelope-from <guix-devel-bounces+larch=yhetil.org@gnu.org>)
	id 1kg3Pm-0006qJ-My
	for larch@yhetil.org; Fri, 20 Nov 2020 05:13:14 -0500
Received: from eggs.gnu.org ([2001:470:142:3::10]:50794)
 by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.90_1) (envelope-from <ludo@gnu.org>) id 1kg3PT-0006qA-IB
 for guix-devel@gnu.org; Fri, 20 Nov 2020 05:12:55 -0500
Received: from fencepost.gnu.org ([2001:470:142:3::e]:46063)
 by eggs.gnu.org with esmtp (Exim 4.90_1)
 (envelope-from <ludo@gnu.org>)
 id 1kg3PT-0005tK-2p; Fri, 20 Nov 2020 05:12:55 -0500
Received: from [2a01:e0a:1d:7270:af76:b9b:ca24:c465] (port=39236 helo=ribbon)
 by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256)
 (Exim 4.82) (envelope-from <ludo@gnu.org>)
 id 1kg3PP-0005o5-Bx; Fri, 20 Nov 2020 05:12:54 -0500
From: =?utf-8?Q?Ludovic_Court=C3=A8s?= <ludo@gnu.org>
To: Christopher Baines <mail@cbaines.net>
Subject: Re: Thoughts on building things for substitutes and the Guix Build
 Coordinator
References: <87tutnlnjy.fsf@cbaines.net> <87blfvocrn.fsf@gnu.org>
 <87r1orksi0.fsf@cbaines.net>
X-URL: http://www.fdn.fr/~lcourtes/
X-Revolutionary-Date: 28 Brumaire an 229 de la =?utf-8?Q?R=C3=A9volution?=
X-PGP-Key-ID: 0x090B11993D9AEBB5
X-PGP-Key: http://www.fdn.fr/~lcourtes/ludovic.asc
X-PGP-Fingerprint: 3CE4 6455 8A84 FDC6 9DB4  0CFB 090B 1199 3D9A EBB5
X-OS: x86_64-pc-linux-gnu
Date: Fri, 20 Nov 2020 11:12:49 +0100
In-Reply-To: <87r1orksi0.fsf@cbaines.net> (Christopher Baines's message of
 "Wed, 18 Nov 2020 07:56:39 +0000")
Message-ID: <87sg94fiam.fsf@gnu.org>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-BeenThere: guix-devel@gnu.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: "Development of GNU Guix and the GNU System distribution."
 <guix-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/guix-devel>,
 <mailto:guix-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <https://lists.gnu.org/archive/html/guix-devel>
List-Post: <mailto:guix-devel@gnu.org>
List-Help: <mailto:guix-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/guix-devel>,
 <mailto:guix-devel-request@gnu.org?subject=subscribe>
Cc: guix-devel@gnu.org
Errors-To: guix-devel-bounces+larch=yhetil.org@gnu.org
Sender: "Guix-devel" <guix-devel-bounces+larch=yhetil.org@gnu.org>
X-Scanner: ns3122888.ip-94-23-21.eu
Authentication-Results: aspmx1.migadu.com;
	dkim=none;
	dmarc=pass (policy=none) header.from=gnu.org;
	spf=pass (aspmx1.migadu.com: domain of guix-devel-bounces@gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=guix-devel-bounces@gnu.org
X-Spam-Score: -1.51
X-TUID: /G2K5iSGvatG

Hi Chris,

Christopher Baines <mail@cbaines.net> skribis:

>>> Another feature supported by the Guix Build Coordinator is retries. If a
>>> build fails, the Guix Build Coordinator can automatically retry it. In a
>>> perfect world, everything would succeed first time, but because the
>>> world isn't perfect, there still can be intermittent build
>>> failures. Retrying failed builds even once can help reduce the chance
>>> that a failure leads to no substitutes for that builds as well as any
>>> builds that depend on that output.
>>
>> That=E2=80=99s nice too; it=E2=80=99s one of the practical issues we hav=
e with Cuirass
>> and that=E2=80=99s tempting to ignore because =E2=80=9Chey it=E2=80=99s =
all functional!=E2=80=9D, but
>> then reality gets in the way.
>
> One further benefit related to this is that if you want to manually
> retry building a derivation, you just submit a new build for that
> derivation.
>
> The Guix Build Coordinator also has no concept of "Failed (dependency)",
> it never gives up. This avoids the situation where spurious failures
> block other builds.

I think there=E2=80=99s a balance to be found.  Being able to retry is nice=
, but
=E2=80=9Cnever giving up=E2=80=9D is not: on a build farm, you could end up=
 always
rebuilding the same derivation that in the end always fails, and that
can be a huge resource waste.

On berlin we run the daemon with =E2=80=98--cache-failures=E2=80=99.  It=E2=
=80=99s not great
because, again, it prevents further builds altogether.

Which makes me think we could change the daemon to have a threshold: it
would maintain a derivation build failure count (instead of a Boolean)
and would only prevent rebuilds once a failure threshold has been
reached.

>>> Because the build results don't end up in a store (they could, but as
>>> set out above, not being in the store is a feature I think), you can't
>>> use `guix gc` to get rid of old store entries/substitutes. I have some
>>> ideas about what to implement to provide some kind of GC approach over a
>>> bunch of nars + narinfos, but I haven't implemented anything yet.
>>
>> =E2=80=98guix publish=E2=80=99 has support for that via (guix cache), so=
 if we could
>> share code, that=E2=80=99d be great.
>
> Guix publish does time based deletion, based on when the files were
> first created, right? If that works for people, that's fine I guess.

Yes, it=E2=80=99s based on the atime (don=E2=80=99t use =E2=80=9Cnoatime=E2=
=80=9D!), though (guix cache)
lets you implement other policies.

> Personally, I'm thinking about GC as in, don't delete nar A if you want
> to keep nar B, and nar B references nar A. It's perfectly possible that
> someone could fetch nar B if you deleted nar A, but it's also possible
> that someone couldn't because of that missing substitute. Maybe I'm
> overthinking this though?

I think you are.  :-) =E2=80=98guix publish=E2=80=99 doesn=E2=80=99t do any=
thing this fancy and
it works great.  The reason is that clients typically always ask for
both A and B, thus the atime of A is the same as that of B.

> The Cuirass + guix publish approach does something similar, because
> Cuirass creates GC roots that expire. guix gc wouldn't delete a store
> item if it's needed by something that's protected by a Cuirass created
> GC root.

Cuirass has a TTL on GC roots, which thus defines how long things remain
in the store; =E2=80=98publish=E2=80=99 has a TTL on nars, which defines ho=
w long nars
remain in its cache.  The two are disconnected in fact.

> Another complexity here that I didn't set out initially, is that there
> are places the Guix Build Coordinator makes decisions based on the
> belief that if it's database says a build has succeeded for an output,
> that output will be available. If a situation where a build needed an
> output that had been successfully built, but then deleted, I think the
> coordinator would get stuck forever trying that build and it not
> starting because of the missing store item. My thinking on this at the
> moment is maybe what you'd want to do is tell the Guix Build Coordinator
> that you've deleted a store item and it's truly missing, but that would
> complicate the setup to some degree.

I think you=E2=80=99d just end up rebuilding it in that case, no?

>> I=E2=80=99ve haven=E2=80=99t yet watched your talk but I=E2=80=99ve what=
 Mathieu=E2=80=99s, where he
>> admits to being concerned about the reliability of code involving Fibers
>> and/or SQLite (which I can understand given his/our experience, although
>> I=E2=80=99m maybe less pessimistic).  What=E2=80=99s your experience, ho=
w do you feel
>> about it?
>
> The coordinator does use Fibers, plus a lot of different threads for
> different things.

Interesting, why are some things running in threads?

There also seems to be shared state in =E2=80=98create-work-queue=E2=80=99;=
 why not use
message passing?

> Regarding reliability, it's hard to say really. Given I set out to build
> something that works across a (unreliable) network, I've built in
> reliability through making sure things retry upon failure among other
> things. I definitely haven't chased any blocked fibers, although there
> could be some of those lurking in the code, I might have not noticed
> because it sorts itself out eventually.

OK.  Most of the issues we see now with offloading and Cuirass are
things you can only experience with a huge store, a large number of
build machines, and a lot of concurrent derivation builds.  Perhaps you
are approaching this scale on your instance actually?

Mathieu experimented with the Coordinator on berlin.  It would be nice
to see how it behaved there.

> One of the problems I did have recently was that some hooks would just
> stop getting processed. Each type of hook has a thread, which just
> checked if there were any events to process every second, and processed
> any if there were. I'm not sure what was wrong, but I changed the code
> to be smarter, be triggered when new events are actually entered in to
> the database, and poll every so often just in case. I haven't seen hooks
> get stuck since then, but what I'm trying to convey here is that I'm not
> quite sure how to track down issues that occur in specific threads.
>
> Another thing to mention here is that implementing suppport for
> PostgreSQL through Guile Squee is still a thing I have in mind, and that
> might be more appropriate for larger databases. It's still prone to the
> fibers blocking problem, but at least it's harder to cause Segfaults
> with Squee compared to SQLite.

OK.

I find it really nice to have metrics built in, but I share Mathieu=E2=80=
=99s
concern about complexity here.  If we=E2=80=99re already hitting scalability
issues with SQLite, then perhaps that=E2=80=99s a sign that metrics should =
be
handled separately.

Would it be an option to implement metrics gathering in a separate,
optional process, which would essentially subscribe to the relevant
hooks/events?

Thanks,
Ludo=E2=80=99.