* bug#67988: [Cuirass] ‘request-work’ responses received by several workers
@ 2023-12-23 9:13 Ludovic Courtès
2024-05-28 21:50 ` Ludovic Courtès
2024-05-31 19:55 ` Ludovic Courtès
0 siblings, 2 replies; 3+ messages in thread
From: Ludovic Courtès @ 2023-12-23 9:13 UTC (permalink / raw)
To: 67988
Hello,
I’m under the impression that sometimes, when the server replies to
‘worker-request-work’ messages, its reply is received by more than just
the target worker, leading to builds being performed twice:
--8<---------------cut here---------------start------------->8---
ludo@berlin ~$ sudo grep lyhz5d1jb396m32dy0fs9h8vqzw95ddp /var/log/cuirass-remote-server.log
2023-12-23 00:15:29 141.80.167.184 (0LFowqzr): build started: '/gnu/store/lyhz5d1jb396m32dy0fs9h8vqzw95ddp-cdrdao-1.2.5.drv'.
2023-12-23 00:18:41 fetching 1 outputs of '/gnu/store/lyhz5d1jb396m32dy0fs9h8vqzw95ddp-cdrdao-1.2.5.drv' from http://141.80.167.184:5558
2023-12-23 00:18:45 build succeeded: '/gnu/store/lyhz5d1jb396m32dy0fs9h8vqzw95ddp-cdrdao-1.2.5.drv'
2023-12-23 00:21:20 141.80.167.159 (oNzYXCv5): build started: '/gnu/store/lyhz5d1jb396m32dy0fs9h8vqzw95ddp-cdrdao-1.2.5.drv'.
2023-12-23 00:24:31 fetching 1 outputs of '/gnu/store/lyhz5d1jb396m32dy0fs9h8vqzw95ddp-cdrdao-1.2.5.drv' from http://141.80.167.159:5558
2023-12-23 00:24:32 build succeeded: '/gnu/store/lyhz5d1jb396m32dy0fs9h8vqzw95ddp-cdrdao-1.2.5.drv'
ludo@berlin ~$ sudo ssh root@141.80.167.184 grep lyhz5d1jb396m32dy0fs9h8vqzw95ddp /var/log/cuirass-remote-worker.log
2023-12-23 00:12:32 0LFowqzr: building derivation `/gnu/store/lyhz5d1jb396m32dy0fs9h8vqzw95ddp-cdrdao-1.2.5.drv' (system: x86_64-linux)
2023-12-23 00:12:54 0LFowqzr: derivation /gnu/store/lyhz5d1jb396m32dy0fs9h8vqzw95ddp-cdrdao-1.2.5.drv build succeeded.
ludo@berlin ~$ sudo ssh root@141.80.167.159 grep lyhz5d1jb396m32dy0fs9h8vqzw95ddp /var/log/cuirass-remote-worker.log
2023-12-23 00:17:51 oNzYXCv5: building derivation `/gnu/store/lyhz5d1jb396m32dy0fs9h8vqzw95ddp-cdrdao-1.2.5.drv' (system: x86_64-linux)
2023-12-23 00:18:17 oNzYXCv5: derivation /gnu/store/lyhz5d1jb396m32dy0fs9h8vqzw95ddp-cdrdao-1.2.5.drv build succeeded.
--8<---------------cut here---------------end--------------->8---
This is with Cuirass 1.2.0-1.bdc1f9f.
To be continued…
Ludo’.
^ permalink raw reply [flat|nested] 3+ messages in thread
* bug#67988: [Cuirass] ‘request-work’ responses received by several workers
2023-12-23 9:13 bug#67988: [Cuirass] ‘request-work’ responses received by several workers Ludovic Courtès
@ 2024-05-28 21:50 ` Ludovic Courtès
2024-05-31 19:55 ` Ludovic Courtès
1 sibling, 0 replies; 3+ messages in thread
From: Ludovic Courtès @ 2024-05-28 21:50 UTC (permalink / raw)
To: 67988
Ludovic Courtès <ludovic.courtes@inria.fr> skribis:
> I’m under the impression that sometimes, when the server replies to
> ‘worker-request-work’ messages, its reply is received by more than just
> the target worker, leading to builds being performed twice:
Seen again:
--8<---------------cut here---------------start------------->8---
ludo@guix-hpc4 ~/src/cuirass$ sudo grep nmhvrka9i4qng54w3d478j1lsp9dn7r7 /var/log/cuirass-remote-server.log
2024-05-28 21:31:43 194.199.1.26 (PajrOfGX): build started: '/gnu/store/nmhvrka9i4qng54w3d478j1lsp9dn7r7-firefox-126.0.1.drv'.
2024-05-28 21:34:22 194.199.1.27 (exataaY9): build started: '/gnu/store/nmhvrka9i4qng54w3d478j1lsp9dn7r7-firefox-126.0.1.drv'.
2024-05-28 21:38:32 194.199.1.17 (DIwFaVSn): build started: '/gnu/store/nmhvrka9i4qng54w3d478j1lsp9dn7r7-firefox-126.0.1.drv'.
2024-05-28 22:16:13 fetching 1 outputs of '/gnu/store/nmhvrka9i4qng54w3d478j1lsp9dn7r7-firefox-126.0.1.drv' from http://194.199.1.26:5558
2024-05-28 22:16:18 build succeeded: '/gnu/store/nmhvrka9i4qng54w3d478j1lsp9dn7r7-firefox-126.0.1.drv'
2024-05-28 22:53:49 fetching 1 outputs of '/gnu/store/nmhvrka9i4qng54w3d478j1lsp9dn7r7-firefox-126.0.1.drv' from http://194.199.1.27:5558
2024-05-28 22:53:49 build succeeded: '/gnu/store/nmhvrka9i4qng54w3d478j1lsp9dn7r7-firefox-126.0.1.drv'
2024-05-28 23:03:50 fetching 1 outputs of '/gnu/store/nmhvrka9i4qng54w3d478j1lsp9dn7r7-firefox-126.0.1.drv' from http://194.199.1.17:5558
2024-05-28 23:03:50 build succeeded: '/gnu/store/nmhvrka9i4qng54w3d478j1lsp9dn7r7-firefox-126.0.1.drv'
--8<---------------cut here---------------end--------------->8---
And on workers:
--8<---------------cut here---------------start------------->8---
$ ssh root@guix-hpc3 grep nmhvrka9i4qng54w3d478j1lsp9dn7r7 /var/log/cuirass-remote-worker.log
2024-05-28 21:57:43 DIwFaVSn: building derivation `/gnu/store/nmhvrka9i4qng54w3d478j1lsp9dn7r7-firefox-126.0.1.drv' (system: x86_64-linux)
2024-05-28 23:22:58 DIwFaVSn: derivation /gnu/store/nmhvrka9i4qng54w3d478j1lsp9dn7r7-firefox-126.0.1.drv build succeeded.
$ ssh root@guix-hpc5 grep nmhvrka9i4qng54w3d478j1lsp9dn7r7 /var/log/cuirass-remote-worker.log
2024-05-28 21:34:13 PajrOfGX: building derivation `/gnu/store/nmhvrka9i4qng54w3d478j1lsp9dn7r7-firefox-126.0.1.drv' (system: x86_64-linux)
2024-05-28 22:18:40 PajrOfGX: derivation /gnu/store/nmhvrka9i4qng54w3d478j1lsp9dn7r7-firefox-126.0.1.drv build succeeded.
$ ssh root@guix-hpc7 grep nmhvrka9i4qng54w3d478j1lsp9dn7r7 /var/log/cuirass-remote-worker.log
2024-05-28 21:34:11 exataaY9: building derivation `/gnu/store/nmhvrka9i4qng54w3d478j1lsp9dn7r7-firefox-126.0.1.drv' (system: x86_64-linux)
2024-05-28 22:53:35 exataaY9: derivation /gnu/store/nmhvrka9i4qng54w3d478j1lsp9dn7r7-firefox-126.0.1.drv build succeeded.
--8<---------------cut here---------------end--------------->8---
Ludo’.
^ permalink raw reply [flat|nested] 3+ messages in thread
* bug#67988: [Cuirass] ‘request-work’ responses received by several workers
2023-12-23 9:13 bug#67988: [Cuirass] ‘request-work’ responses received by several workers Ludovic Courtès
2024-05-28 21:50 ` Ludovic Courtès
@ 2024-05-31 19:55 ` Ludovic Courtès
1 sibling, 0 replies; 3+ messages in thread
From: Ludovic Courtès @ 2024-05-31 19:55 UTC (permalink / raw)
To: 67988
Ludovic Courtès <ludovic.courtes@inria.fr> skribis:
> I’m under the impression that sometimes, when the server replies to
> ‘worker-request-work’ messages, its reply is received by more than just
> the target worker, leading to builds being performed twice:
On closer inspection, the theory of the message being received by two
different peers doesn’t hold.
Instead, I believe ‘db-get-pending-build’ would return the same build at
two different points in time, typically while the first one is still
running.
That’s normally not possible because the build’s status is changed to
‘submitted’ once it’s been picked up. Turns out that, due to slowness
of the query in ‘db-get-pending-build’ (fixed in
17338588d4862b04e9e405c1244a2ea703b50d98), ‘remote-server’ would
sometimes fail to see worker pings in a timely fashion. Thus, it would
call ‘db-remove-unresponsive-workers’, which would reschedule builds
that were being carried out by said worker(s). And that’s how we would
end up with multiple concurrent builds of the same derivation.
I added logging in c2061ca845d05694ebeb88935a6ff2254711beb2, which
should give a hint, should that happen again.
Ludo’.
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2024-05-31 19:56 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-12-23 9:13 bug#67988: [Cuirass] ‘request-work’ responses received by several workers Ludovic Courtès
2024-05-28 21:50 ` Ludovic Courtès
2024-05-31 19:55 ` Ludovic Courtès
Code repositories for project(s) associated with this public inbox
https://git.savannah.gnu.org/cgit/guix.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).