all messages for Guix-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
* bug#67988: [Cuirass] ‘request-work’ responses received by several workers
@ 2023-12-23  9:13 Ludovic Courtès
  2024-05-28 21:50 ` Ludovic Courtès
  2024-05-31 19:55 ` Ludovic Courtès
  0 siblings, 2 replies; 3+ messages in thread
From: Ludovic Courtès @ 2023-12-23  9:13 UTC (permalink / raw)
  To: 67988

Hello,

I’m under the impression that sometimes, when the server replies to
‘worker-request-work’ messages, its reply is received by more than just
the target worker, leading to builds being performed twice:

--8<---------------cut here---------------start------------->8---
ludo@berlin ~$ sudo grep lyhz5d1jb396m32dy0fs9h8vqzw95ddp /var/log/cuirass-remote-server.log
2023-12-23 00:15:29 141.80.167.184 (0LFowqzr): build started: '/gnu/store/lyhz5d1jb396m32dy0fs9h8vqzw95ddp-cdrdao-1.2.5.drv'.
2023-12-23 00:18:41 fetching 1 outputs of '/gnu/store/lyhz5d1jb396m32dy0fs9h8vqzw95ddp-cdrdao-1.2.5.drv' from http://141.80.167.184:5558
2023-12-23 00:18:45 build succeeded: '/gnu/store/lyhz5d1jb396m32dy0fs9h8vqzw95ddp-cdrdao-1.2.5.drv'
2023-12-23 00:21:20 141.80.167.159 (oNzYXCv5): build started: '/gnu/store/lyhz5d1jb396m32dy0fs9h8vqzw95ddp-cdrdao-1.2.5.drv'.
2023-12-23 00:24:31 fetching 1 outputs of '/gnu/store/lyhz5d1jb396m32dy0fs9h8vqzw95ddp-cdrdao-1.2.5.drv' from http://141.80.167.159:5558
2023-12-23 00:24:32 build succeeded: '/gnu/store/lyhz5d1jb396m32dy0fs9h8vqzw95ddp-cdrdao-1.2.5.drv'
ludo@berlin ~$ sudo ssh root@141.80.167.184 grep lyhz5d1jb396m32dy0fs9h8vqzw95ddp /var/log/cuirass-remote-worker.log
2023-12-23 00:12:32 0LFowqzr: building derivation `/gnu/store/lyhz5d1jb396m32dy0fs9h8vqzw95ddp-cdrdao-1.2.5.drv' (system: x86_64-linux)
2023-12-23 00:12:54 0LFowqzr: derivation /gnu/store/lyhz5d1jb396m32dy0fs9h8vqzw95ddp-cdrdao-1.2.5.drv build succeeded.
ludo@berlin ~$ sudo ssh root@141.80.167.159 grep lyhz5d1jb396m32dy0fs9h8vqzw95ddp /var/log/cuirass-remote-worker.log
2023-12-23 00:17:51 oNzYXCv5: building derivation `/gnu/store/lyhz5d1jb396m32dy0fs9h8vqzw95ddp-cdrdao-1.2.5.drv' (system: x86_64-linux)
2023-12-23 00:18:17 oNzYXCv5: derivation /gnu/store/lyhz5d1jb396m32dy0fs9h8vqzw95ddp-cdrdao-1.2.5.drv build succeeded.
--8<---------------cut here---------------end--------------->8---

This is with Cuirass 1.2.0-1.bdc1f9f.

To be continued…

Ludo’.




^ permalink raw reply	[flat|nested] 3+ messages in thread

* bug#67988: [Cuirass] ‘request-work’ responses received by several workers
  2023-12-23  9:13 bug#67988: [Cuirass] ‘request-work’ responses received by several workers Ludovic Courtès
@ 2024-05-28 21:50 ` Ludovic Courtès
  2024-05-31 19:55 ` Ludovic Courtès
  1 sibling, 0 replies; 3+ messages in thread
From: Ludovic Courtès @ 2024-05-28 21:50 UTC (permalink / raw)
  To: 67988

Ludovic Courtès <ludovic.courtes@inria.fr> skribis:

> I’m under the impression that sometimes, when the server replies to
> ‘worker-request-work’ messages, its reply is received by more than just
> the target worker, leading to builds being performed twice:

Seen again:

--8<---------------cut here---------------start------------->8---
ludo@guix-hpc4 ~/src/cuirass$ sudo grep  nmhvrka9i4qng54w3d478j1lsp9dn7r7 /var/log/cuirass-remote-server.log
2024-05-28 21:31:43 194.199.1.26 (PajrOfGX): build started: '/gnu/store/nmhvrka9i4qng54w3d478j1lsp9dn7r7-firefox-126.0.1.drv'.
2024-05-28 21:34:22 194.199.1.27 (exataaY9): build started: '/gnu/store/nmhvrka9i4qng54w3d478j1lsp9dn7r7-firefox-126.0.1.drv'.
2024-05-28 21:38:32 194.199.1.17 (DIwFaVSn): build started: '/gnu/store/nmhvrka9i4qng54w3d478j1lsp9dn7r7-firefox-126.0.1.drv'.
2024-05-28 22:16:13 fetching 1 outputs of '/gnu/store/nmhvrka9i4qng54w3d478j1lsp9dn7r7-firefox-126.0.1.drv' from http://194.199.1.26:5558
2024-05-28 22:16:18 build succeeded: '/gnu/store/nmhvrka9i4qng54w3d478j1lsp9dn7r7-firefox-126.0.1.drv'
2024-05-28 22:53:49 fetching 1 outputs of '/gnu/store/nmhvrka9i4qng54w3d478j1lsp9dn7r7-firefox-126.0.1.drv' from http://194.199.1.27:5558
2024-05-28 22:53:49 build succeeded: '/gnu/store/nmhvrka9i4qng54w3d478j1lsp9dn7r7-firefox-126.0.1.drv'
2024-05-28 23:03:50 fetching 1 outputs of '/gnu/store/nmhvrka9i4qng54w3d478j1lsp9dn7r7-firefox-126.0.1.drv' from http://194.199.1.17:5558
2024-05-28 23:03:50 build succeeded: '/gnu/store/nmhvrka9i4qng54w3d478j1lsp9dn7r7-firefox-126.0.1.drv'
--8<---------------cut here---------------end--------------->8---

And on workers:

--8<---------------cut here---------------start------------->8---
$ ssh root@guix-hpc3 grep nmhvrka9i4qng54w3d478j1lsp9dn7r7 /var/log/cuirass-remote-worker.log
2024-05-28 21:57:43 DIwFaVSn: building derivation `/gnu/store/nmhvrka9i4qng54w3d478j1lsp9dn7r7-firefox-126.0.1.drv' (system: x86_64-linux)
2024-05-28 23:22:58 DIwFaVSn: derivation /gnu/store/nmhvrka9i4qng54w3d478j1lsp9dn7r7-firefox-126.0.1.drv build succeeded.
$ ssh root@guix-hpc5 grep nmhvrka9i4qng54w3d478j1lsp9dn7r7 /var/log/cuirass-remote-worker.log
2024-05-28 21:34:13 PajrOfGX: building derivation `/gnu/store/nmhvrka9i4qng54w3d478j1lsp9dn7r7-firefox-126.0.1.drv' (system: x86_64-linux)
2024-05-28 22:18:40 PajrOfGX: derivation /gnu/store/nmhvrka9i4qng54w3d478j1lsp9dn7r7-firefox-126.0.1.drv build succeeded.
$ ssh root@guix-hpc7 grep nmhvrka9i4qng54w3d478j1lsp9dn7r7 /var/log/cuirass-remote-worker.log
2024-05-28 21:34:11 exataaY9: building derivation `/gnu/store/nmhvrka9i4qng54w3d478j1lsp9dn7r7-firefox-126.0.1.drv' (system: x86_64-linux)
2024-05-28 22:53:35 exataaY9: derivation /gnu/store/nmhvrka9i4qng54w3d478j1lsp9dn7r7-firefox-126.0.1.drv build succeeded.
--8<---------------cut here---------------end--------------->8---

Ludo’.




^ permalink raw reply	[flat|nested] 3+ messages in thread

* bug#67988: [Cuirass] ‘request-work’ responses received by several workers
  2023-12-23  9:13 bug#67988: [Cuirass] ‘request-work’ responses received by several workers Ludovic Courtès
  2024-05-28 21:50 ` Ludovic Courtès
@ 2024-05-31 19:55 ` Ludovic Courtès
  1 sibling, 0 replies; 3+ messages in thread
From: Ludovic Courtès @ 2024-05-31 19:55 UTC (permalink / raw)
  To: 67988

Ludovic Courtès <ludovic.courtes@inria.fr> skribis:

> I’m under the impression that sometimes, when the server replies to
> ‘worker-request-work’ messages, its reply is received by more than just
> the target worker, leading to builds being performed twice:

On closer inspection, the theory of the message being received by two
different peers doesn’t hold.

Instead, I believe ‘db-get-pending-build’ would return the same build at
two different points in time, typically while the first one is still
running.

That’s normally not possible because the build’s status is changed to
‘submitted’ once it’s been picked up.  Turns out that, due to slowness
of the query in ‘db-get-pending-build’ (fixed in
17338588d4862b04e9e405c1244a2ea703b50d98), ‘remote-server’ would
sometimes fail to see worker pings in a timely fashion.  Thus, it would
call ‘db-remove-unresponsive-workers’, which would reschedule builds
that were being carried out by said worker(s).  And that’s how we would
end up with multiple concurrent builds of the same derivation.

I added logging in c2061ca845d05694ebeb88935a6ff2254711beb2, which
should give a hint, should that happen again.

Ludo’.




^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2024-05-31 19:56 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-12-23  9:13 bug#67988: [Cuirass] ‘request-work’ responses received by several workers Ludovic Courtès
2024-05-28 21:50 ` Ludovic Courtès
2024-05-31 19:55 ` Ludovic Courtès

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/guix.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.