all messages for Guix-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
* Should we document how to detect if build machines are reachable before trying to offload?
@ 2024-07-04 15:45 Sergio Pastor Pérez
  2024-07-11  9:48 ` Ludovic Courtès
  0 siblings, 1 reply; 8+ messages in thread
From: Sergio Pastor Pérez @ 2024-07-04 15:45 UTC (permalink / raw
  To: guix-devel

Hello.

I recently discovered that offloading builds to remote machines[1],
hangs when the machines are not available; instead of defaulting to
building locally[2]. This forces the user to use the `--no-offload`
flag.

I saw in the mailing list[2] that someone suggested that, the
`build-machines` field accepts a list of GEXPs instead of plain
`build-machine` record types.

This suggestion is almost correct but it only checks if the host is
known which does not guarantee that it is reachable. Therefore I came up
with this:
--8<---------------cut here---------------start------------->8---
(build-machines
 (list
  #~(let* ((resolvable? (lambda (machine)
                          (zero? (system* #$(file-append netcat "/bin/nc")
                                          "-z" "-w1"
                                          (build-machine-name machine) "22")))))
      (filter resolvable?
              (list (build-machine
                     (name "my-host")
                     (systems (list "x86_64-linux" "i686-linux"))
                     ;; NOTE: Located in '/etc/ssh/ssh_host_ed25519_key.pub' on the machine that does the build.
                     ;; It will be generated by `openssh-service-type'.
                     (host-key #$(plain-file-content %my-host-host-key))
                     ;; NOTE: User on the build machine that allows SSH access with the key from `private-key' field.
                     (user "my-host-user1")
                     (private-key "/home/user1/.ssh/id_ed25519")))))))
--8<---------------cut here---------------end--------------->8---

Which allows to dynamically detect which machines are reachable.

If the user wanted to never build locally, the `-M 0` flag can be
used. Therefore, I would expect that it would graciously fallback to
building locally instead of getting stuck. If this is the desired
behaviour, I think we should document how to avoid the hanging.

Should we add this snippet to the manual/cookbook?

[1] https://guix.gnu.org/manual/en/html_node/Daemon-Offload-Setup.html
[2] https://lists.gnu.org/archive/html/help-guix/2023-12/msg00114.html
[3] https://lists.gnu.org/archive/html/help-guix/2023-12/msg00120.html


Regards,
Sergio.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Should we document how to detect if build machines are reachable before trying to offload?
  2024-07-04 15:45 Should we document how to detect if build machines are reachable before trying to offload? Sergio Pastor Pérez
@ 2024-07-11  9:48 ` Ludovic Courtès
  2024-07-11 20:23   ` Sergio Pastor Pérez
  0 siblings, 1 reply; 8+ messages in thread
From: Ludovic Courtès @ 2024-07-11  9:48 UTC (permalink / raw
  To: Sergio Pastor Pérez; +Cc: guix-devel

Hello!

Sergio Pastor Pérez <sergio.pastorperez@outlook.es> skribis:

> I recently discovered that offloading builds to remote machines[1],
> hangs when the machines are not available; instead of defaulting to
> building locally[2]. This forces the user to use the `--no-offload`
> flag.

Do you remember exactly under what circumstances it hangs?  I think
‘guix offload’ should handle that situation gracefully and we should fix
it if it does not.

Right now, it sets an initial connection timeout of 30s, which is quite
long but turned out to be necessary (see ‘open-ssh-session’ in (guix
scripts offload)).

Thanks,
Ludo’.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Should we document how to detect if build machines are reachable before trying to offload?
  2024-07-11  9:48 ` Ludovic Courtès
@ 2024-07-11 20:23   ` Sergio Pastor Pérez
  2024-07-21 12:56     ` Ludovic Courtès
  0 siblings, 1 reply; 8+ messages in thread
From: Sergio Pastor Pérez @ 2024-07-11 20:23 UTC (permalink / raw
  To: Ludovic Courtès; +Cc: guix-devel

Hi Ludo!

> Do you remember exactly under what circumstances it hangs?  I think
> ‘guix offload’ should handle that situation gracefully and we should fix
> it if it does not.

Yeah. It happens when I have a build machine configured like so and I
disconnect it from the Ethernet connection:
--8<---------------cut here---------------start------------->8---
(build-machines
 (list
  #~(build-machine
     (name "remote")
     (systems (list "x86_64-linux" "i686-linux"))
     (host-key %remote-host-key
     (private-key %local-key))))
--8<---------------cut here---------------end--------------->8---

With this configuration `guix offload test` will timeout after 30
seconds, as you describe. But this other command will hang indefinitely:
--8<---------------cut here---------------start------------->8---
$ timeout 1m guix build imhex -M 0
The following derivation will be built:
  /gnu/store/9absqzdd4ak3pms2jw6rkhlmjvm8zzyv-imhex-1.35.1.drv
process 12199 acquired build slot '/var/guix/offload/bordercollie:22/0'
guix offload: error: failed to connect to 'bordercollie': No route to host
waiting for locks or build slots...
process 12199 acquired build slot '/var/guix/offload/bordercollie:22/0'
guix offload: error: failed to connect to 'bordercollie': No route to host
process 12199 acquired build slot '/var/guix/offload/bordercollie:22/0'
guix offload: error: failed to connect to 'bordercollie': No route to host
process 12199 acquired build slot '/var/guix/offload/bordercollie:22/0'
guix offload: error: failed to connect to 'bordercollie': No route to host
process 12199 acquired build slot '/var/guix/offload/bordercollie:22/0'
guix offload: error: failed to connect to 'bordercollie': No route to host
process 12199 acquired build slot '/var/guix/offload/bordercollie:22/0'
guix offload: error: failed to connect to 'bordercollie': No route to host
process 12199 acquired build slot '/var/guix/offload/bordercollie:22/0'
guix offload: error: failed to connect to 'bordercollie': No route to host
process 12199 acquired build slot '/var/guix/offload/bordercollie:22/0'
guix offload: error: failed to connect to 'bordercollie': No route to host
process 12199 acquired build slot '/var/guix/offload/bordercollie:22/0'
guix offload: error: failed to connect to 'bordercollie': No route to host
process 12199 acquired build slot '/var/guix/offload/bordercollie:22/0'
guix offload: error: failed to connect to 'bordercollie': No route to host
process 12199 acquired build slot '/var/guix/offload/bordercollie:22/0'
guix offload: error: failed to connect to 'bordercollie': No route to host
process 12199 acquired build slot '/var/guix/offload/bordercollie:22/0'
--8<---------------cut here---------------end--------------->8---

`imhex` is just a package that is not yet merged upstream so no
substitutes are available and the offload happens.

> Right now, it sets an initial connection timeout of 30s, which is quite
> long but turned out to be necessary (see ‘open-ssh-session’ in (guix
> scripts offload)).

If that long timeout is required, I think the snippet I propose to
document will be useful for other users.


Have a good night!
Sergio.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Should we document how to detect if build machines are reachable before trying to offload?
  2024-07-11 20:23   ` Sergio Pastor Pérez
@ 2024-07-21 12:56     ` Ludovic Courtès
  2024-07-21 15:59       ` Sergio Pastor Pérez
                         ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Ludovic Courtès @ 2024-07-21 12:56 UTC (permalink / raw
  To: Sergio Pastor Pérez; +Cc: guix-devel

Hi,

Sergio Pastor Pérez <sergio.pastorperez@outlook.es> skribis:

>> Do you remember exactly under what circumstances it hangs?  I think
>> ‘guix offload’ should handle that situation gracefully and we should fix
>> it if it does not.
>
> Yeah. It happens when I have a build machine configured like so and I
> disconnect it from the Ethernet connection:
>
> (build-machines
>  (list
>   #~(build-machine
>      (name "remote")
>      (systems (list "x86_64-linux" "i686-linux"))
>      (host-key %remote-host-key
>      (private-key %local-key))))
>
>
> With this configuration `guix offload test` will timeout after 30
> seconds, as you describe. But this other command will hang indefinitely:
>
> $ timeout 1m guix build imhex -M 0
> The following derivation will be built:
>   /gnu/store/9absqzdd4ak3pms2jw6rkhlmjvm8zzyv-imhex-1.35.1.drv
> process 12199 acquired build slot '/var/guix/offload/bordercollie:22/0'
> guix offload: error: failed to connect to 'bordercollie': No route to host
> waiting for locks or build slots...
> process 12199 acquired build slot '/var/guix/offload/bordercollie:22/0'
> guix offload: error: failed to connect to 'bordercollie': No route to host

I believe the problem here is that offloading always wants to offload.
That is, when all the machines in /etc/guix/machines.scm are
unavailable, ‘guix offload’ says so to guix-daemon, but then guix-daemon
just keeps retrying (if you had more than one machine in
/etc/guix/machines.scm, one of which is unavailable, ‘guix offload’
would just pick another one.)

I guess this is probably what we should permit: building locally when we
cannot offload.

Does that make sense?

Ludo’.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Should we document how to detect if build machines are reachable before trying to offload?
  2024-07-21 12:56     ` Ludovic Courtès
@ 2024-07-21 15:59       ` Sergio Pastor Pérez
  2024-07-21 16:26       ` Vincent Legoll
  2024-07-22 14:59       ` Simon Tournier
  2 siblings, 0 replies; 8+ messages in thread
From: Sergio Pastor Pérez @ 2024-07-21 15:59 UTC (permalink / raw
  To: Ludovic Courtès; +Cc: guix-devel

Hello.

> I guess this is probably what we should permit: building locally when we
> cannot offload.
>
> Does that make sense?

Yeah, makes sense.

Regards!
Sergio.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Should we document how to detect if build machines are reachable before trying to offload?
  2024-07-21 12:56     ` Ludovic Courtès
  2024-07-21 15:59       ` Sergio Pastor Pérez
@ 2024-07-21 16:26       ` Vincent Legoll
  2024-07-21 17:25         ` Tomas Volf
  2024-07-22 14:59       ` Simon Tournier
  2 siblings, 1 reply; 8+ messages in thread
From: Vincent Legoll @ 2024-07-21 16:26 UTC (permalink / raw
  To: Ludovic Courtès; +Cc: Sergio Pastor Pérez, guix-devel

[-- Attachment #1: Type: text/plain, Size: 644 bytes --]

Hello,

On Sun, Jul 21, 2024 at 12:57 PM Ludovic Courtès <ludo@gnu.org> wrote:

> I guess this is probably what we should permit: building locally when we
> cannot offload.
>
> Does that make sense?
>

What about making "build locally" not a special case, but just "offloading
to
localhost" ?

Maybe as an implicit default, so that it would work naturally as today.

And with some way to deny it for people who don't want to build locally
at all, whatever their reason might be.

Would that trim some "build locally"-specific code ?

Is that already how it's done ?

Is the idea crazy / dumb ?

-- 
Vincent Legoll

[-- Attachment #2: Type: text/html, Size: 1221 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Should we document how to detect if build machines are reachable before trying to offload?
  2024-07-21 16:26       ` Vincent Legoll
@ 2024-07-21 17:25         ` Tomas Volf
  0 siblings, 0 replies; 8+ messages in thread
From: Tomas Volf @ 2024-07-21 17:25 UTC (permalink / raw
  To: Vincent Legoll; +Cc: Ludovic Courtès, Sergio Pastor Pérez, guix-devel

[-- Attachment #1: Type: text/plain, Size: 1273 bytes --]

On 2024-07-21 16:26:41 +0000, Vincent Legoll wrote:
> Hello,
>
> On Sun, Jul 21, 2024 at 12:57 PM Ludovic Courtès <ludo@gnu.org> wrote:
>
> > I guess this is probably what we should permit: building locally when we
> > cannot offload.
> >
> > Does that make sense?
> >
>
> What about making "build locally" not a special case, but just "offloading
> to
> localhost" ?

That will not work without reworking (fixing) the offload mechanism.  (Some?)
flags are currently ignored for offload (--check, --rounds, ...), but you want
to have them working at least locally when you need them.

>
> Maybe as an implicit default, so that it would work naturally as today.
>
> And with some way to deny it for people who don't want to build locally
> at all, whatever their reason might be.
>
> Would that trim some "build locally"-specific code ?
>
> Is that already how it's done ?
>
> Is the idea crazy / dumb ?

No, I like it in general, if nothing else it would put the offload code on
critical path, ensuring it fully works.

I wonder what are the downsides (I am sure there are some).

>
> --
> Vincent Legoll

Tomas

--
There are only two hard things in Computer Science:
cache invalidation, naming things and off-by-one errors.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Should we document how to detect if build machines are reachable before trying to offload?
  2024-07-21 12:56     ` Ludovic Courtès
  2024-07-21 15:59       ` Sergio Pastor Pérez
  2024-07-21 16:26       ` Vincent Legoll
@ 2024-07-22 14:59       ` Simon Tournier
  2 siblings, 0 replies; 8+ messages in thread
From: Simon Tournier @ 2024-07-22 14:59 UTC (permalink / raw
  To: Ludovic Courtès, Sergio Pastor Pérez; +Cc: guix-devel

Hi Ludo,

On Sun, 21 Jul 2024 at 14:56, Ludovic Courtès <ludo@gnu.org> wrote:

> I guess this is probably what we should permit: building locally when we
> cannot offload.
>
> Does that make sense?

Yes! This feature seems wanted. ;-)

Aside it would satisfy requests [1,2] randomly picked up :-) it would
help in closing this old report, I guess.

        bug#24496: offloading should fall back to local build after n tries
        ng0 <ngillmann@runbox.com>
        Wed, 21 Sep 2016 09:39:48 +0000
        id:8760ppr3q3.fsf@we.make.ritual.n0.is
        https://issues.guix.gnu.org/24496
        https://issues.guix.gnu.org/msgid/8760ppr3q3.fsf@we.make.ritual.n0.is
        https://yhetil.org/guix/8760ppr3q3.fsf@we.make.ritual.n0.is

Cheers,
simon

1: How to offload builds only when some of the offload build servers are available
Kyle Andrews <kyle@posteo.net>
Sun, 01 May 2022 16:01:21 +0000
id:875ympmdqw.fsf@posteo.net
https://lists.gnu.org/archive/html/help-guix/2022-05
https://yhetil.org/guix/875ympmdqw.fsf@posteo.net

2: guix offload
Aleksandr Vityazev <avityazew@gmail.com>
Wed, 20 Dec 2023 14:02:27 +0300
id:87o7el9lx8.fsf@gmail.com
https://lists.gnu.org/archive/html/help-guix/2023-12
https://yhetil.org/guix/87o7el9lx8.fsf@gmail.com


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2024-07-22 16:38 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-04 15:45 Should we document how to detect if build machines are reachable before trying to offload? Sergio Pastor Pérez
2024-07-11  9:48 ` Ludovic Courtès
2024-07-11 20:23   ` Sergio Pastor Pérez
2024-07-21 12:56     ` Ludovic Courtès
2024-07-21 15:59       ` Sergio Pastor Pérez
2024-07-21 16:26       ` Vincent Legoll
2024-07-21 17:25         ` Tomas Volf
2024-07-22 14:59       ` Simon Tournier

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/guix.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.