From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp11.migadu.com ([2001:41d0:8:6d80::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms0.migadu.com with LMTPS id I4TSK1mm5WHQDwAAgWs5BA (envelope-from ) for ; Mon, 17 Jan 2022 18:24:41 +0100 Received: from aspmx1.migadu.com ([2001:41d0:8:6d80::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp11.migadu.com with LMTPS id YIHpJVmm5WEFgQEA9RJhRA (envelope-from ) for ; Mon, 17 Jan 2022 18:24:41 +0100 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id E4186189C0 for ; Mon, 17 Jan 2022 18:24:40 +0100 (CET) Received: from localhost ([::1]:52964 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1n9VkG-0005Ka-1J for larch@yhetil.org; Mon, 17 Jan 2022 12:24:40 -0500 Received: from eggs.gnu.org ([209.51.188.92]:45152) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1n9VjM-0005Iz-UO for guix-devel@gnu.org; Mon, 17 Jan 2022 12:23:44 -0500 Received: from [2607:f8b0:4864:20::d29] (port=39586 helo=mail-io1-xd29.google.com) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1n9VjG-000150-MR for guix-devel@gnu.org; Mon, 17 Jan 2022 12:23:44 -0500 Received: by mail-io1-xd29.google.com with SMTP id v6so22420138iom.6 for ; Mon, 17 Jan 2022 09:23:36 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=beadling-co-uk.20210112.gappssmtp.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=u/96rBJdfTYkbR0owWD1EYScHif7cP6PkCPMOAmNwjc=; b=7g5VKSRCzl0icygyyQ2stO+awqzMunROrphzLqADnWtkvrwMAAt7hSyvP33hEAY03q iZW67vwzzf9acbG8coHABjPIcIcjwjSdntP8JR8605bFBG8Bp7Zii50Mar4QKaKCXHIW oAHgN6K3sALx6kKD2gNeK5UrXXRhSNIN0IA8aW5cW25MDSl1xQEHBe9GkAd1VFy9LYNx wUSgvKhw/XDxfgS4e2fRkKqbxlpMVsT94gfwi6wH6Ywn2o8SqHDI8tAFhoBwmfL1i+/w pDO6yRyX75MMdx/W5AY3fne55M11/eH85QA3lfB5JbM5gE8UgLygFlNGbLANrN0qQWFq cF1w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=u/96rBJdfTYkbR0owWD1EYScHif7cP6PkCPMOAmNwjc=; b=zoEOJ84evvZhiS335siV0GBZovfH5IZE8WFvl5fz/PT3cBPl9CDjiOwcE0IgS0AS9b 1ebyft4VWHkP9qx/PmeYdNA4rJo6YSD/cpEN/l+Tlv/rM+lcleszLQnQF+UybX8TbAuj kFJ1rhYhGYRzuaVyNSpYwpGgy7bQh155cJCmQeqd0Wk6ovKk5dOCKMXHXj+MO7UgmOyO XjmWKmnGOLVwQpdo2wZP1Givn3nSijeGkLMYhdk4/FZgH2c/BiGMHf57EV5cpThx0+dB HyhbslOsy/HUnxOA4Hxe37wqogbljVaAdPQSJZUDR6NFBnBRAxFdFGQAxeJt9ND3/vFP DZKA== X-Gm-Message-State: AOAM5306SSKzNwi3H8PYnCKqPm16E5WP0kcTa0wGso0QFuPNE9f6CDCd UxLOqVxcUrZHmOVaAXgYJbpDXsCeMFuiK8ZYg8kYrw== X-Google-Smtp-Source: ABdhPJz9UBPM2CxIIEgjxi1yOZX7FnhCVhq/fZDVXze4Fr9kh0PciLnHGBYBwmtTDJcs45wiTnD3fnmQKoUvdHRbF+E= X-Received: by 2002:a02:cea8:: with SMTP id z8mr9578780jaq.4.1642440215187; Mon, 17 Jan 2022 09:23:35 -0800 (PST) MIME-Version: 1.0 References: <877db6ge0o.fsf@beadling.co.uk> <874k697thf.fsf@elephly.net> <87r19d5pcw.fsf@elephly.net> <871r1dguq6.fsf@beadling.co.uk> In-Reply-To: <871r1dguq6.fsf@beadling.co.uk> From: Phil Beadling Date: Mon, 17 Jan 2022 17:23:24 +0000 Message-ID: Subject: Re: Parallel guix builds can trample? To: Ricardo Wurmus Content-Type: multipart/alternative; boundary="0000000000009769a605d5ca6985" X-Host-Lookup-Failed: Reverse DNS lookup failed for 2607:f8b0:4864:20::d29 (failed) Received-SPF: none client-ip=2607:f8b0:4864:20::d29; envelope-from=phil@beadling.co.uk; helo=mail-io1-xd29.google.com X-Spam_score_int: -10 X-Spam_score: -1.1 X-Spam_bar: - X-Spam_report: (-1.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=0.001, PDS_HP_HELO_NORDNS=0.001, RCVD_IN_DNSWL_NONE=-0.0001, RDNS_NONE=0.793, SPF_NONE=0.001, T_SPF_HELO_TEMPERROR=0.01 autolearn=no autolearn_force=no X-Spam_action: no action X-BeenThere: guix-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Development of GNU Guix and the GNU System distribution." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: "guix-devel@gnu.org" Errors-To: guix-devel-bounces+larch=yhetil.org@gnu.org Sender: "Guix-devel" X-Migadu-Flow: FLOW_IN X-Migadu-Country: US ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=yhetil.org; s=key1; t=1642440281; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:in-reply-to:in-reply-to: references:references:list-id:list-help:list-unsubscribe: list-subscribe:list-post:dkim-signature; bh=u/96rBJdfTYkbR0owWD1EYScHif7cP6PkCPMOAmNwjc=; b=myJD1bL5VVMxdR0ukGDAJ+83fJhtwA9jtElKoFGfFb6EE12MG4BibG1QnwuLgy3QRlG7pa 24RHuDLf5QDjr9kyYZPJ1gxk9Tfg+FI+4WcEbQYgJF4cIJa+gyGkU+zRoiFfDcb0M4lOGS n7ciH72WwwKgWau5SQ/ZMc6r6SrIKfetkknPeOQKzrCP58wUCSYG9eNglurFL0O8onPmaw q5Eyd67HlWWfVRTfGX0uvvj3y/LKlFbZIim3qVH1udYzYmio9mD/5Ft8LetB1fD4n8Nrvy D6Y0YuO3O7EWEFTTrLy07DWwX0rw0YTlkqqvCXID9HbYP9JrRYVgw+F6XI2h4A== ARC-Seal: i=1; s=key1; d=yhetil.org; t=1642440281; a=rsa-sha256; cv=none; b=Iv+YOhmXOYgGkkutzUbAKjo1oACzyGrPTLmKEQXrhCbKKjfmHN5UhioLKsDtpvuyQu5HU3 Wq1rRr5hs9rrI4hFD5TH7MUn0RGzHIXaEIL5CB2OGXrmv31EknUmYDn0hKfOI/72Shf+Y9 Ye+nHz3WfQUI0Gw7rgnEcIxdoFPPAUpXQFAxdF7RLslx1JfuU5mTw7EU9+g/InUugA4s/d YSTHbphXGR4NAd/sjKRpYJPVWE53ijTD/V/Y53c4go8MPfIjWlBO0/iJH/7Sg6UDtPhiAa svHpQJsiJ9cRy7RoyNich/2pLf00uGIJ4EjCW8HYO5Q+WMqWIgSMffiZRB5xeQ== ARC-Authentication-Results: i=1; aspmx1.migadu.com; dkim=pass header.d=beadling-co-uk.20210112.gappssmtp.com header.s=20210112 header.b=7g5VKSRC; dmarc=none; spf=pass (aspmx1.migadu.com: domain of "guix-devel-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="guix-devel-bounces+larch=yhetil.org@gnu.org" X-Migadu-Spam-Score: -1.82 Authentication-Results: aspmx1.migadu.com; dkim=pass header.d=beadling-co-uk.20210112.gappssmtp.com header.s=20210112 header.b=7g5VKSRC; dmarc=none; spf=pass (aspmx1.migadu.com: domain of "guix-devel-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="guix-devel-bounces+larch=yhetil.org@gnu.org" X-Migadu-Queue-Id: E4186189C0 X-Spam-Score: -1.82 X-Migadu-Scanner: scn1.migadu.com X-TUID: Uug+eWN1iiWl --0000000000009769a605d5ca6985 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi Ricardo, all, I think we=E2=80=99ve worked out what the issue is, and have a proposed wor= karound, and perhaps a case for solving the problem in Guix itself (depending on what you people think!). The issue is that despite each build being performed in its own isolated container, these containers are fed by the same per-user cached source directory. In the case where *different* versions of the *same* repo are built at once, this results in a race condition. In our case we have one Linux account that does a lot of automated Guix builds for us. One example is this account watches our source control and automatically rebuilds all outstanding Pull Requests (PRs) on a repo, after a separate successful merge to our integration branch. PRs are uniquely identified as monotonically increasing PR numbers eg, PR-1, PR-2, PR-3 and so on. Each is a different branch on the same Repo with slightly different candidate changes in it. They are automatically kept up to date with the integration branch. To do this our watcher fires off (near) instantaneously dozens of guix builds, each with their own local channel customized for the PR it is building. Doing them in parallel is important to make the system usably responsive. Each fired process does this: - Clone the channel containing the package into a local directory - Modify the commit id of the package to the new merged head of the PR - Modify the package version to some dummy version containing the PR number - Build the modified package using the local channel - Report the result (the build is effectively discarded; it is never used for anything) What we think is happening is the following: - For each build that is kicked off in quick succession the local cache of the repo required updated by *update-cached-checkou*t - https://github.com/guix-mirror/guix/blob/9f526f5dad5f4af69d158c50369e= 182305147f3b/guix/git.scm#L476 - https://github.com/guix-mirror/guix/blob/9f526f5dad5f4af69d158c50369e= 182305147f3b/guix/git.scm#L279 - The problem with this is because each version is using the same cached repo --- before one has a chance to take a copy of the updated checkout, that checkout can be changed by a separate build process Thus there is a race condition in this scenario. We can provide a longer test script to demo this if required =E2=80=93 it=E2=80=99s quite straightf= orward to reproduce just with a bash script, now we know what is causing it. Our workaround has been to change XDG_CACHE_HOME for each PR build we do. But this is a bit unsatisfactory as it effects processes beyond Guix =E2=80= =93 it casts too wide of a net, but it does resolve the problem for the time being= . Do people think this is enough of an issue to make a switch available in Guix to prevent sharing of cached clones? This would be easy enough to implement =E2=80=93 a crude solution would be that each cache directory nam= e would simply be generated using a SHA of a string which includes the PID or similar to ensure a unique name, and because it is never going to be reused it could be deleted immediately after the build. Whilst this is unlikely to happen at the console, as people script guix build use-cases to fit their own problems (in particular building lots of variations of a single piece of software) =E2=80=93 I can see this causing = a headache? I think at least the manual should make it clear that you cannot build 2 packages referencing the same repo at the same time with the same user (unless I=E2=80=99ve missed this bit I don=E2=80=99t think it=E2=80=99= s made explicitly clear?). An even simpler change would be introduce a lock file that refused the 2nd build and at least preventing the race condition happening, and ensuring referential transparency, or simpler still just placed a warning on stderr? If people are amenable to adding a switch or other config option, we=E2=80= =99d be happy to look writing the patch? Any thoughts/comments/advice? Cheers! Phil. On Wed, 12 Jan 2022 at 09:37, Phil wrote: > Hi - more details below. > > Ricardo Wurmus writes: > > > > > How are you using Guix with this? Do you generate Guix package > > expressions? Do you use =E2=80=9Cguix build --with-commit=E2=80=9D? > > > > The situation is like this - if we had a directory of clones of my > channel: > > - pr-1 > - pr-2 > - pr-3 > - pr-4 > ... and so on > > Initially all the clones are taken from the master branch of my > channel and are all identical - but we change the version and commit to > match the head of each PR branch as per below. > > Each clone looks like this: > - pr-1 > - my-package.scm > - pr-2 > - my-package.scm > and so on.... > > Each my-package.scm has a package like below - the inital packages are al= l > identical, but my system effectively seds the version and commit values > like the below. These values are never committed back to master they > are used only as local channels to build each PR to test each build > still passes. > > (define-public my-package > (package > (name "my-package") > (version "this-is-different-for-each-pr") ;; replace master version > (source > (git-checkout > (url "ssh://same@repo:7999/same/repo.git") > (commit "this-is-different-for-each-pr") ;; replace master versio= n > everything else remains the same in the package.... > > > At this point we have lots of local channels referencing different > commits, in > the same package, ready to build - so I spawn them all simultaneously - > the equivalent pseudo-shell that I will mock up today would be: > > # define some sort of return code array: > RC=3D[] > > for dir in pr-dirs > RC[${dir}]=3D`guix build -K -L ${dir} my-package & 2>&1 > > /tmp/${dir}.log` # note the ampersand > wait > > for rc in $RC > if $rc.value !=3D 0: > report the failure of build $rc.key > > What I'm seeing occasionally is that the logs and return code for say > directory pr-1 > and appearing in the guix build for pr-3 or pr-6 instead. > > We know this becuse the code is different enough in pr-1 that it's logs > are unique across all the PRs. We can also check the source code if the > build fails using --keep-failed to show it doesn't match the commit id > in the package used to build it. > > Hopefully that makes sense? I can post the actual shell script once > I've written the mock. > --0000000000009769a605d5ca6985 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable

Hi Ricardo, all,

=C2=A0<= /p>

I think we=E2=80=99ve worked out what the issue is, and have a proposed workaround,= and perhaps a case for solving the problem in Guix itself (depending on what you people think!).

=C2=A0<= /p>

The issue is that despite each build being performed in its own isolated container, t= hese containers are fed by the same per-user cached source directory.=C2=A0 In t= he case where different versions of the same repo are built at o= nce, this results in a race condition.

=C2=A0<= /p>

In our case we have one Linux account that does a lot of automated Guix builds for us.<= span>

=C2=A0<= /p>

One example is this account watches our source control and automatically rebuilds all outstanding Pull Requests (PRs) on a repo, after a separate successful merg= e to our integration branch.=C2=A0 PRs are uniquely identified as monotonically increasing PR numbers eg, PR-1, PR-2, PR-3 and so on.=C2=A0 Each is a diffe= rent branch on the same Repo with slightly different candidate changes in it.=C2= =A0 They are automatically kept up to date with the integration branch.

=C2=A0<= /p>

To do this our watcher fires off (near) instantaneously dozens of guix builds, each wi= th their own local channel customized for the PR it is building.=C2=A0 Doing t= hem in parallel is important to make the system usably responsive.=

=C2=A0<= /p>

Each fired process does this:

  • Clone the channel containin= g the package into a local directory
  • Modify the commit id of the= package to the new merged head of the PR
  • <= span lang=3D"EN-US">Modify the package version to some dummy version contai= ning the PR number
  • Build the modified package using the local channel
  • Report the result (the build is effectively discarded; it is nev= er used for anything)

=C2=A0<= /p>

What we think is happening is the following:

=C2=A0<= /p>

  • For each build that is kick= ed off in quick succession the local cache of the repo required updated by update-cached-checkout
  • The problem with this is becau= se each version is using the same cached repo --- before one has a chance to take a copy of the updated checkout, that checkout can be changed by a separate build process

=C2=A0<= /p>

Thus there is a race condition in this scenario.=C2=A0 We can provide a longer test sc= ript to demo this if required =E2=80=93 it=E2=80=99s quite straightforward to re= produce just with a bash script, now we know what is causing it.

=C2=A0<= /p>

Our workaround has been to change XDG_CACHE_HOME for each PR build we do.=C2=A0= But this is a bit unsatisfactory as it effects processes beyond Guix =E2=80=93 = it casts too wide of a net, but it does resolve the problem for the time being.

=C2=A0<= /p>

Do people think this is enough of an issue to make a switch available in Guix to prev= ent sharing of cached clones?=C2=A0 This would be easy enough to implement =E2= =80=93 a crude solution would be that each cache directory name would simply be generated using a SHA of a string which includes the PID or similar to ensu= re a unique name, and because it is never going to be reused it could be deleted immediately after the build.

=C2=A0<= /p>

Whilst this is unlikely to happen at the console, as people script guix build use-cases= to fit their own problems (in particular building lots of variations of a sing= le piece of software) =E2=80=93 I can see this causing a headache?=C2=A0 I thi= nk at least the manual should make it clear that you cannot build 2 packages referencin= g the same repo at the same time with the same user (unless I=E2=80=99ve miss= ed this bit I don=E2=80=99t think it=E2=80=99s made explicitly clear?).=C2=A0 An even s= impler change would be introduce a lock file that refused the 2nd build and at least preventing the race condition happening, and ensuring referential transpare= ncy, or simpler still just placed a warning on stderr?

=C2=A0<= /p>

If people are amenable to adding a switch or other config option, we=E2=80=99d be hap= py to look writing the patch?


Any thoughts/comments/advice?


Cheers!
Phil.

On Wed, 12 Jan 2022 at 09:37, Phil <phil@beadling.co.uk> wrote:
Hi - more details below.

Ricardo Wurmus writes:

>
> How are you using Guix with this?=C2=A0 Do you generate Guix package > expressions?=C2=A0 Do you use =E2=80=9Cguix build --with-commit=E2=80= =9D?
>

The situation is like this - if we had a directory of clones of my
channel:

- pr-1
- pr-2
- pr-3
- pr-4
... and so on

Initially all the clones are taken from the master branch of my
channel and are all identical - but we change the version and commit to
match the head of each PR branch as per below.

Each clone looks like this:
- pr-1
=C2=A0 =C2=A0 =C2=A0 - my-package.scm
- pr-2
=C2=A0 =C2=A0 =C2=A0 - my-package.scm
and so on....

Each my-package.scm has a package like below - the inital packages are all<= br> identical, but my system effectively seds the version and commit values
like the below.=C2=A0 These values are never committed back to master they<= br> are used only as local channels to build each PR to test each build
still passes.

(define-public my-package
=C2=A0 (package
=C2=A0 =C2=A0 (name "my-package")
=C2=A0 =C2=A0 (version "this-is-different-for-each-pr")=C2=A0 ;; = replace master version
=C2=A0 =C2=A0 (source
=C2=A0 =C2=A0 =C2=A0 (git-checkout
=C2=A0 =C2=A0 =C2=A0 =C2=A0 (url "ssh://same@repo:7999/same/repo.git&q= uot;)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 (commit "this-is-different-for-each-pr&quo= t;) ;; replace master version
everything else remains the same in the package....


At this point we have lots of local channels referencing different commits,= in
the same package, ready to build - so I spawn them all simultaneously -
the equivalent pseudo-shell that I will mock up today would be:

# define some sort of return code array:
RC=3D[]

for dir in pr-dirs
=C2=A0 RC[${dir}]=3D`guix build -K -L ${dir} my-package & 2>&1 &= gt; /tmp/${dir}.log`=C2=A0 # note the ampersand
wait

for rc in $RC
=C2=A0 if $rc.value !=3D 0:
=C2=A0 =C2=A0 report the failure of build $rc.key

What I'm seeing occasionally is that the logs and return code for say d= irectory pr-1
and appearing in the guix build for pr-3 or pr-6 instead.

We know this becuse the code is different enough in pr-1 that it's logs=
are unique across all the PRs.=C2=A0 We can also check the source code if t= he
build fails using --keep-failed to show it doesn't match the commit id<= br> in the package used to build it.

Hopefully that makes sense?=C2=A0 I can post the actual shell script once I've written the mock.
--0000000000009769a605d5ca6985--