unofficial mirror of bug-guix@gnu.org 
 help / color / mirror / code / Atom feed
* bug#59493: cuirass-remote-worker crash
@ 2022-11-22 22:14 Ludovic Courtès
  2022-11-23  8:08 ` Mathieu Othacehe
  0 siblings, 1 reply; 5+ messages in thread
From: Ludovic Courtès @ 2022-11-22 22:14 UTC (permalink / raw)
  To: 59493; +Cc: Mathieu Othacehe

Hi,

In /var/log/cuirass-remote-worker.log on overdrive1.guix, I found this:

--8<---------------cut here---------------start------------->8---
2022-11-21 14:27:24 Backtrace:
2022-11-21 14:27:24 Backtrace:
2022-11-21 14:27:24 In ice-9/boot-9.scm:
2022-11-21 14:27:24 In ice-9/boot-9.scm:
2022-11-21 14:27:24   1752:10 10 (with-exception-handler _ _ #:unwind? _ # _)
2022-11-21 14:27:24 In unknown file:
2022-11-21 14:27:24            9 (apply-smob/0 #<thunk 3903a300>)
2022-11-21 14:27:24 In ice-9/boot-9.scm:
2022-11-21 14:27:24     724:2  8 (call-with-prompt _ _ #<procedure default-prompt-handle?>)
2022-11-21 14:27:24 In ice-9/eval.scm:
2022-11-21 14:27:24   1752:10 10 (with-exception-handler _ _ #:unwind? _ # _)
2022-11-21 14:27:24     619:8  7 (_ #(#(#<directory (guile-user) 3903dc80>)))
2022-11-21 14:27:24 In cuirass/ui.scm:
2022-11-21 14:27:24 In unknown file:
2022-11-21 14:27:24            9 (apply-smob/0 #<thunk 3903a300>)
2022-11-21 14:27:24    104:10  6 (run-cuirass-command _ . _)
2022-11-21 14:27:24 In ice-9/boot-9.scm:
2022-11-21 14:27:24 In ice-9/boot-9.scm:
2022-11-21 14:27:24     724:2  8 (call-with-prompt _ _ #<procedure default-prompt-handle?>)
2022-11-21 14:27:24   1752:10  5 (with-exception-handler _ _ #:unwind? _ # _)
2022-11-21 14:27:24 In ice-9/eval.scm:
2022-11-21 14:27:24 In cuirass/scripts/remote-worker.scm:
2022-11-21 14:27:24     619:8  7 (_ #(#(#<directory (guile-user) 3903dc80>)))
2022-11-21 14:27:24 In cuirass/ui.scm:
2022-11-21 14:27:24    104:10  6 (run-cuirass-command _ . _)
2022-11-21 14:27:24    435:12  4 (_)
2022-11-21 14:27:24 In srfi/srfi-1.scm:
2022-11-21 14:27:24 In ice-9/boot-9.scm:
2022-11-21 14:27:24   1752:10  5 (with-exception-handler _ _ #:unwind? _ # _)
2022-11-21 14:27:24     634:9  3 (for-each #<procedure 398a3510 at cuirass/scripts/remo?> ?)
2022-11-21 14:27:24 In cuirass/scripts/remote-worker.scm:
2022-11-21 14:27:24 In cuirass/scripts/remote-worker.scm:
2022-11-21 14:27:24    448:18  2 (_ _)
2022-11-21 14:27:24    435:12  4 (_)
2022-11-21 14:27:24 In srfi/srfi-1.scm:
2022-11-21 14:27:24     634:9  3 (for-each #<procedure 398a3510 at cuirass/scripts/remo?> ?)
2022-11-21 14:27:24    356:11  1 (start-worker _ _)
2022-11-21 14:27:24 In cuirass/scripts/remote-worker.scm:
2022-11-21 14:27:24 In ice-9/boot-9.scm:
2022-11-21 14:27:24    448:18  2 (_ _)
2022-11-21 14:27:24   1685:16  0 (raise-exception _ #:continuable? _)
2022-11-21 14:27:24
2022-11-21 14:27:24 ice-9/boot-9.scm:1685:16: In procedure raise-exception:
2022-11-21 14:27:24 Throw to key `match-error' with args `("match" "no matching pattern" (#vu8()))'.
2022-11-21 14:27:24    356:11  1 (start-worker _ _)
2022-11-21 14:27:24 In ice-9/boot-9.scm:
2022-11-21 14:27:24   1685:16  0 (raise-exception _ #:continuable? _)
2022-11-21 14:27:24
2022-11-21 14:27:24 ice-9/boot-9.scm:1685:16: In procedure raise-exception:
2022-11-21 14:27:24 Throw to key `match-error' with args `("match" "no matching pattern" (#vu8()))'.
--8<---------------cut here---------------end--------------->8---

(Stuttering is due to the unprotected use of ‘primitive-fork’: a
non-local exit in the child leads it to execute the same code as its
parent.  We should fix that, but should we really fork in the first
place?  :-))

This comes from here:

--8<---------------cut here---------------start------------->8---
  (define (read-server-info socket)
    (request-info socket)
    (match (zmq-get-msg-parts-bytevector socket '())   ;<-- here
      ((empty info)
       (match (zmq-read-message (bv->string info))
         (('server-info
           ('worker-address worker-address)
           ('log-port log-port)
           ('publish-port publish-port))
          (list worker-address log-port publish-port))))))
--8<---------------cut here---------------end--------------->8---

This is the version being used:

--8<---------------cut here---------------start------------->8---
ludo@overdrive1 ~$ cat /proc/24019/cmdline |xargs -0
/gnu/store/zpir9n73amaxrwz2k7x46l73v21vxk6s-guile-3.0.8/bin/guile --no-auto-compile -e main -s /gnu/store/rlqdzmfyamjpn6lz07yqk2hsabv3l7g5-cuirass-1.1.0-11.9f08035/bin/.cuirass-real remote-worker --workers=2 --server=10.0.0.1:5555 --systems=armhf-linux,aarch64-linux --publish-port=5558 --substitute-urls=http://10.0.0.1
ludo@overdrive1 ~$ guix system describe
Generation 36   Sep 27 2022 09:06:48    (current)
  file name: /var/guix/profiles/system-36-link
  canonical file name: /gnu/store/m04qw6f0lfd0wpn1skiys4b56wqfc3b8-system
  label: GNU with Linux-Libre 5.19.11
  bootloader: grub-efi
  root device: /dev/sda3
  kernel: /gnu/store/09r4wbbabskmbrnwmshpdk7vh6g87gam-linux-libre-5.19.11/Image
  channels:
    guix:
      repository URL: https://git.savannah.gnu.org/git/guix.git
      commit: f15a141cf35bd4188767f0e91c0654991d4c49e0
  configuration file: /gnu/store/myvzd1kpw2pfzfj3krl4lzpcbqsdn48x-configuration.scm
--8<---------------cut here---------------end--------------->8---

The sequence leading to this seems to be:

--8<---------------cut here---------------start------------->8---
22340 eventfd2(0, EFD_CLOEXEC <unfinished ...>
[…]
22340 <... eventfd2 resumed>)           = 15
[…]
22340 ppoll([{fd=15, events=POLLIN}], 1, NULL, NULL, 0 <unfinished ...>
[…]
22340 <... ppoll resumed>)              = 1 ([{fd=15, revents=POLLIN}])
22343 epoll_pwait(8,  <unfinished ...>
22340 read(15, "\1\0\0\0\0\0\0\0", 8)   = 8
22340 ppoll([{fd=15, events=POLLIN}], 1, {tv_sec=0, tv_nsec=0}, NULL, 0) = 0 (Timeout)
22340 write(2, "Backtrace:\n", 11)      = 11
--8<---------------cut here---------------end--------------->8---

Does that ring a bell?  Perhaps that was fixed in the meantime?

Right now it cannot be restarted: it always fails at start up with the
error above.  10.0.0.1 is reachable though so I’m not sure what’s up.

Ludo’.




^ permalink raw reply	[flat|nested] 5+ messages in thread

* bug#59493: cuirass-remote-worker crash
  2022-11-22 22:14 bug#59493: cuirass-remote-worker crash Ludovic Courtès
@ 2022-11-23  8:08 ` Mathieu Othacehe
  2022-11-23 15:47   ` Ludovic Courtès
  0 siblings, 1 reply; 5+ messages in thread
From: Mathieu Othacehe @ 2022-11-23  8:08 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 59493


Hello Ludo,

Thanks for gathering those information.

> 2022-11-21 14:27:24   1685:16  0 (raise-exception _ #:continuable? _)
> 2022-11-21 14:27:24
> 2022-11-21 14:27:24 ice-9/boot-9.scm:1685:16: In procedure raise-exception:
> 2022-11-21 14:27:24 Throw to key `match-error' with args `("match" "no matching pattern" (#vu8()))'.

Yes this is because a new remote-server is running on Berlin and it
sends an empty sequence at every connection:
https://git.savannah.gnu.org/cgit/guix/guix-cuirass.git/commit/?id=fc1641381d2a8a0472a71ef5ad2b64361faaaab4

All remote-workers must update, and I have deployed Cuirass
1.1.0-13.1341725 on all hydra workers + guix9p.

I have been trying to deploy that to overdrive1 for two days but Berlin
offloads the builds to kreuzberg which has some issues because a lot of
builds are timeouting:

--8<---------------cut here---------------start------------->8---
\building of `/gnu/store/9jg75a8rvdz3qxcbbm95312rlc4hyi98-mrustc-0.10-2.597593a-checkout.drv' timed out after 3600 seconds of silence
build of /gnu/store/9jg75a8rvdz3qxcbbm95312rlc4hyi98-mrustc-0.10-2.597593a-checkout.drv failed
View build log at '/var/log/guix/drvs/9j/g75a8rvdz3qxcbbm95312rlc4hyi98-mrustc-0.10-2.597593a-checkout.drv.gz'.
cannot build derivation `/gnu/store/wavx7rl6h93fpmc46nggnhkyxm75lqa4-mrustc-0.10-2.597593a-checkout.drv': 1 dependencies couldn't be built
--8<---------------cut here---------------end--------------->8---

> (Stuttering is due to the unprotected use of ‘primitive-fork’: a
> non-local exit in the child leads it to execute the same code as its
> parent.  We should fix that, but should we really fork in the first
> place?  :-))

Right, this is problematic. I can't remember why I chose to fork.

In the meantime, this should be fixed by updating to 1.1.0-13.1341725 so
we can close this one I guess.

Mathieu




^ permalink raw reply	[flat|nested] 5+ messages in thread

* bug#59493: cuirass-remote-worker crash
  2022-11-23  8:08 ` Mathieu Othacehe
@ 2022-11-23 15:47   ` Ludovic Courtès
  2022-11-23 16:03     ` Mathieu Othacehe
  0 siblings, 1 reply; 5+ messages in thread
From: Ludovic Courtès @ 2022-11-23 15:47 UTC (permalink / raw)
  To: Mathieu Othacehe; +Cc: 59493

Hi,

Mathieu Othacehe <othacehe@gnu.org> skribis:

>> 2022-11-21 14:27:24   1685:16  0 (raise-exception _ #:continuable? _)
>> 2022-11-21 14:27:24
>> 2022-11-21 14:27:24 ice-9/boot-9.scm:1685:16: In procedure raise-exception:
>> 2022-11-21 14:27:24 Throw to key `match-error' with args `("match" "no matching pattern" (#vu8()))'.
>
> Yes this is because a new remote-server is running on Berlin and it
> sends an empty sequence at every connection:
> https://git.savannah.gnu.org/cgit/guix/guix-cuirass.git/commit/?id=fc1641381d2a8a0472a71ef5ad2b64361faaaab4

Oh I see.  It would be nice to avoid non-backward-compatible changes in
the protocol so we can upgrade more smoothly.

> All remote-workers must update, and I have deployed Cuirass
> 1.1.0-13.1341725 on all hydra workers + guix9p.
>
> I have been trying to deploy that to overdrive1 for two days but Berlin
> offloads the builds to kreuzberg which has some issues because a lot of
> builds are timeouting:

Done now!

--8<---------------cut here---------------start------------->8---
ludo@overdrive1 ~$ guix system describe
Generation 37   Nov 23 2022 15:58:08    (current)
  file name: /var/guix/profiles/system-37-link
  canonical file name: /gnu/store/62dr875n7i30l375j87flbqfym78kddg-system
  label: GNU with Linux-Libre 6.0.9
  bootloader: grub-efi
  root device: /dev/sda3
  kernel: /gnu/store/p4impcxw8lba8600acrxs21lgzc06xzq-linux-libre-6.0.9/Image
  channels:
    guix:
      repository URL: https://git.savannah.gnu.org/git/guix.git
      commit: 78f03567f44f704dfbc03cb64368aa42a01e78ad
  configuration file: /gnu/store/myvzd1kpw2pfzfj3krl4lzpcbqsdn48x-configuration.scm
--8<---------------cut here---------------end--------------->8---

Running the Shepherd 0.9.3 and all, wonderful.

>> (Stuttering is due to the unprotected use of ‘primitive-fork’: a
>> non-local exit in the child leads it to execute the same code as its
>> parent.  We should fix that, but should we really fork in the first
>> place?  :-))

Fixed in Cuirass commit 9fb6f21d29c5398b35f4c1a77cf6c20f207c9ebb.

> Right, this is problematic. I can't remember why I chose to fork.

One concern is that, in the Avahi case, we create at least one thread
before forking, and as we know that doesn’t work (as in: it might work
sometimes).  ZMQ may also create threads behind our back.

The parent doesn’t call ‘waitpid’ on its children, which isn’t great.

To me, ideally this would be either multi-threaded or Fiberized.  The
latter would be more fruitful but what might be difficult is
guile-simple-zmq integration with Fibers (but maybe not: zmq_getsockopt
+ ZMQ_FD lets us get the file descriptor of a socket).

Something to consider…

Thanks,
Ludo’.




^ permalink raw reply	[flat|nested] 5+ messages in thread

* bug#59493: cuirass-remote-worker crash
  2022-11-23 15:47   ` Ludovic Courtès
@ 2022-11-23 16:03     ` Mathieu Othacehe
  2022-11-26 15:04       ` Ludovic Courtès
  0 siblings, 1 reply; 5+ messages in thread
From: Mathieu Othacehe @ 2022-11-23 16:03 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 59493-done


Hey,

> Oh I see.  It would be nice to avoid non-backward-compatible changes in
> the protocol so we can upgrade more smoothly.

Right, sorry. We should introduce a protocol version to avoid that in
the future.

> Fixed in Cuirass commit 9fb6f21d29c5398b35f4c1a77cf6c20f207c9ebb.

Awesome, thanks :)

> To me, ideally this would be either multi-threaded or Fiberized.  The
> latter would be more fruitful but what might be difficult is
> guile-simple-zmq integration with Fibers (but maybe not: zmq_getsockopt
> + ZMQ_FD lets us get the file descriptor of a socket).

I would prefer the multi-threaded approach if possible. While the
concept of Fiber is nice it adds another layer of complexity and
instability to those programs which are already hard to debug.

Mathieu




^ permalink raw reply	[flat|nested] 5+ messages in thread

* bug#59493: cuirass-remote-worker crash
  2022-11-23 16:03     ` Mathieu Othacehe
@ 2022-11-26 15:04       ` Ludovic Courtès
  0 siblings, 0 replies; 5+ messages in thread
From: Ludovic Courtès @ 2022-11-26 15:04 UTC (permalink / raw)
  To: Mathieu Othacehe; +Cc: 59493-done

Hi,

Mathieu Othacehe <othacehe@gnu.org> skribis:

>> To me, ideally this would be either multi-threaded or Fiberized.  The
>> latter would be more fruitful but what might be difficult is
>> guile-simple-zmq integration with Fibers (but maybe not: zmq_getsockopt
>> + ZMQ_FD lets us get the file descriptor of a socket).
>
> I would prefer the multi-threaded approach if possible. While the
> concept of Fiber is nice it adds another layer of complexity and
> instability to those programs which are already hard to debug.

I guess it’s not black and white.  Shared-state multithreading is an
endless source of bugs, regardless of the language being used;
message-passing (what Fibers is about) is more tractable.

Sure Fibers can have bugs of its own (I’m well aware of that :-)) but at
Fiber-using code can be simpler and less error-ridden than the
equivalent shared-state code.

Anyway, we’re not there yet.

Can you remember the rationale for forking in remote-worker.scm, or do
you think we might as well do it all in a single process?

Thanks,
Ludo’.




^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2022-11-26 15:05 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-22 22:14 bug#59493: cuirass-remote-worker crash Ludovic Courtès
2022-11-23  8:08 ` Mathieu Othacehe
2022-11-23 15:47   ` Ludovic Courtès
2022-11-23 16:03     ` Mathieu Othacehe
2022-11-26 15:04       ` Ludovic Courtès

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).