all messages for Guix-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
* bug#48468: substitute server connection timeout
@ 2021-05-16 17:57 Mathieu Othacehe
  2021-05-16 18:26 ` Christopher Baines
  2021-06-18 12:33 ` Ludovic Courtès
  0 siblings, 2 replies; 20+ messages in thread
From: Mathieu Othacehe @ 2021-05-16 17:57 UTC (permalink / raw)
  To: 48468


Hello,

We recently have a lot of those errors on Cuirass:

--8<---------------cut here---------------start------------->8---
guix substitute: warning: while fetching http://141.80.167.131:5557/nar/g7ka09613k5v1vlznh87yg35905ggw51-python2-scipy-1.2.2-guile-builder: server is somewhat slow
guix substitute: warning: try `--no-substitutes' if the problem persists
guix substitute: error: connect*: Connection timed out
--8<---------------cut here---------------end--------------->8---

which means that the workers are failing to connect to the Cuirass
remote-server publish server on berlin at 141.80.167.131:5557.

Stracing this publish server shows that connection reuse seems to be
broken:

--8<---------------cut here---------------start------------->8---
accept4(9, {sa_family=AF_INET, sin_port=htons(41742), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
accept4(9, {sa_family=AF_INET, sin_port=htons(41744), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
accept4(9, {sa_family=AF_INET, sin_port=htons(41746), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 25
accept4(9, {sa_family=AF_INET, sin_port=htons(41748), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 24
accept4(9, {sa_family=AF_INET, sin_port=htons(41750), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
accept4(9, {sa_family=AF_INET, sin_port=htons(41752), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
accept4(9, {sa_family=AF_INET, sin_port=htons(41754), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 25
accept4(9, {sa_family=AF_INET, sin_port=htons(41756), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
accept4(9, {sa_family=AF_INET, sin_port=htons(41758), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 26
accept4(9, {sa_family=AF_INET, sin_port=htons(41760), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 24
accept4(9, {sa_family=AF_INET, sin_port=htons(41762), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
accept4(9, {sa_family=AF_INET, sin_port=htons(41764), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
accept4(9, {sa_family=AF_INET, sin_port=htons(41766), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
accept4(9, {sa_family=AF_INET, sin_port=htons(41768), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 22
accept4(9, {sa_family=AF_INET, sin_port=htons(41770), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
accept4(9, {sa_family=AF_INET, sin_port=htons(41772), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
accept4(9, {sa_family=AF_INET, sin_port=htons(41774), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
accept4(9, {sa_family=AF_INET, sin_port=htons(41776), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
accept4(9, {sa_family=AF_INET, sin_port=htons(41778), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
accept4(9, {sa_family=AF_INET, sin_port=htons(41780), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
accept4(9, {sa_family=AF_INET, sin_port=htons(41782), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
--8<---------------cut here---------------end--------------->8---

Investigating it, I found that the connection is closed and opened
multiple times in the call-with-cached-connection procedure of the (guix
script substitute) module.

It looks like its because a 'bad-headers exception is raised when trying
to parse an eof object:

--8<---------------cut here---------------start------------->8---
;;; (error bad-header (read-header-line #<eof>))
--8<---------------cut here---------------end--------------->8---

I'm not sure where this eof comes from. There is this comment in the
http-multiple-get procedure in (guix http-client):

--8<---------------cut here---------------start------------->8---
;; Swallow networking errors that could occur due to connection reuse
;; and the like; they will be handled down the road when trying to
;; read responses.
(false-if-networking-error
 (begin
   (for-each (cut write-request <> buffer) batch)
   (put-bytevector p (get))
   (force-output p))))
--8<---------------cut here---------------end--------------->8---

which would suggest that connection reuse could cause networking errors?

What also puzzles me it that the main guix publish server on berlin does
not seem to present this issue. That would indicate that this error is
caused by how the Cuirass remote-server publish server is started or
configured.

Ludo, Chris, any idea?

I will keep searching anyway :)

Thanks,

Mathieu




^ permalink raw reply	[flat|nested] 20+ messages in thread

* bug#48468: substitute server connection timeout
  2021-05-16 17:57 bug#48468: substitute server connection timeout Mathieu Othacehe
@ 2021-05-16 18:26 ` Christopher Baines
  2021-05-17 14:49   ` Mathieu Othacehe
  2021-05-29 21:44   ` Ludovic Courtès
  2021-06-18 12:33 ` Ludovic Courtès
  1 sibling, 2 replies; 20+ messages in thread
From: Christopher Baines @ 2021-05-16 18:26 UTC (permalink / raw)
  To: Mathieu Othacehe; +Cc: 48468

[-- Attachment #1: Type: text/plain, Size: 5546 bytes --]


Mathieu Othacehe <othacehe@gnu.org> writes:

> Hello,
>
> We recently have a lot of those errors on Cuirass:
>
> --8<---------------cut here---------------start------------->8---
> guix substitute: warning: while fetching http://141.80.167.131:5557/nar/g7ka09613k5v1vlznh87yg35905ggw51-python2-scipy-1.2.2-guile-builder: server is somewhat slow
> guix substitute: warning: try `--no-substitutes' if the problem persists
> guix substitute: error: connect*: Connection timed out
> --8<---------------cut here---------------end--------------->8---
>
> which means that the workers are failing to connect to the Cuirass
> remote-server publish server on berlin at 141.80.167.131:5557.
>
> Stracing this publish server shows that connection reuse seems to be
> broken:
>
> --8<---------------cut here---------------start------------->8---
> accept4(9, {sa_family=AF_INET, sin_port=htons(41742), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
> accept4(9, {sa_family=AF_INET, sin_port=htons(41744), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
> accept4(9, {sa_family=AF_INET, sin_port=htons(41746), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 25
> accept4(9, {sa_family=AF_INET, sin_port=htons(41748), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 24
> accept4(9, {sa_family=AF_INET, sin_port=htons(41750), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
> accept4(9, {sa_family=AF_INET, sin_port=htons(41752), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
> accept4(9, {sa_family=AF_INET, sin_port=htons(41754), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 25
> accept4(9, {sa_family=AF_INET, sin_port=htons(41756), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
> accept4(9, {sa_family=AF_INET, sin_port=htons(41758), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 26
> accept4(9, {sa_family=AF_INET, sin_port=htons(41760), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 24
> accept4(9, {sa_family=AF_INET, sin_port=htons(41762), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
> accept4(9, {sa_family=AF_INET, sin_port=htons(41764), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
> accept4(9, {sa_family=AF_INET, sin_port=htons(41766), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
> accept4(9, {sa_family=AF_INET, sin_port=htons(41768), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 22
> accept4(9, {sa_family=AF_INET, sin_port=htons(41770), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
> accept4(9, {sa_family=AF_INET, sin_port=htons(41772), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
> accept4(9, {sa_family=AF_INET, sin_port=htons(41774), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
> accept4(9, {sa_family=AF_INET, sin_port=htons(41776), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
> accept4(9, {sa_family=AF_INET, sin_port=htons(41778), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
> accept4(9, {sa_family=AF_INET, sin_port=htons(41780), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
> accept4(9, {sa_family=AF_INET, sin_port=htons(41782), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
> --8<---------------cut here---------------end--------------->8---
>
> Investigating it, I found that the connection is closed and opened
> multiple times in the call-with-cached-connection procedure of the (guix
> script substitute) module.
>
> It looks like its because a 'bad-headers exception is raised when trying
> to parse an eof object:
>
> --8<---------------cut here---------------start------------->8---
> ;;; (error bad-header (read-header-line #<eof>))
> --8<---------------cut here---------------end--------------->8---
>
> I'm not sure where this eof comes from. There is this comment in the
> http-multiple-get procedure in (guix http-client):
>
> --8<---------------cut here---------------start------------->8---
> ;; Swallow networking errors that could occur due to connection reuse
> ;; and the like; they will be handled down the road when trying to
> ;; read responses.
> (false-if-networking-error
>  (begin
>    (for-each (cut write-request <> buffer) batch)
>    (put-bytevector p (get))
>    (force-output p))))
> --8<---------------cut here---------------end--------------->8---
>
> which would suggest that connection reuse could cause networking errors?
>
> What also puzzles me it that the main guix publish server on berlin does
> not seem to present this issue. That would indicate that this error is
> caused by how the Cuirass remote-server publish server is started or
> configured.
>
> Ludo, Chris, any idea?

While I've been working in this area, I've actually been trying to pick
apart the connection caching, since the single thread assumption doesn't
hold in the Guix Build Coordinator.

Anyway, I do have a theory. Assuming I'm correct in saying that there's
no nginx between the client and publish server here, I think that's your
configuration difference.

For ci.guix.gnu.org, as well as data.guix.gnu.org, it's NGinx which is
keeping connections alive. I'm not sure the Guile code for the publish
server does similarly, so talking to it directly might be different.

That's on the server side, the actual problem is probably on the client
side, as I guess there are possibly places where closed connections
aren't handled properly. This reminds me I sent some patches relating to
closing connections, this could well be related [1].

1: https://issues.guix.gnu.org/47174

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 987 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* bug#48468: substitute server connection timeout
  2021-05-16 18:26 ` Christopher Baines
@ 2021-05-17 14:49   ` Mathieu Othacehe
  2021-05-18 14:35     ` Mathieu Othacehe
  2021-05-29 21:44   ` Ludovic Courtès
  1 sibling, 1 reply; 20+ messages in thread
From: Mathieu Othacehe @ 2021-05-17 14:49 UTC (permalink / raw)
  To: Christopher Baines; +Cc: 48468


Hey,

> That's on the server side, the actual problem is probably on the client
> side, as I guess there are possibly places where closed connections
> aren't handled properly. This reminds me I sent some patches relating to
> closing connections, this could well be related [1].

Oh, you're right, the Nginx server probably makes the difference
here. The http-write procedure of the Guile (web server http) module
seems to handle keep-alive connections. However, the wrapping http-write
procedure of (guix scripts publish) may not.

I'll have a closer look, thanks for your help.

Mathieu




^ permalink raw reply	[flat|nested] 20+ messages in thread

* bug#48468: substitute server connection timeout
  2021-05-17 14:49   ` Mathieu Othacehe
@ 2021-05-18 14:35     ` Mathieu Othacehe
  2021-05-21 13:30       ` Mathieu Othacehe
  0 siblings, 1 reply; 20+ messages in thread
From: Mathieu Othacehe @ 2021-05-18 14:35 UTC (permalink / raw)
  To: Christopher Baines; +Cc: 48468


Hey,

> I'll have a closer look, thanks for your help.

So this snippet in the http-write procedure of the (guix scripts
publish) module:

--8<---------------cut here---------------start------------->8---
          (swallow-zlib-error
           (close-port port))
--8<---------------cut here---------------end--------------->8---

is closing the client port unconditionally, which means that guix
publish cannot keep connections alive, unless sitting behind an Nginx
proxy.

I'm trying to turn the close-port call into a maybe-close-port with the
following procedure:

--8<---------------cut here---------------start------------->8---
(define (maybe-close-port port)
    (cond
     ((keep-alive? response)
      (poll-set-add! (http-poll-set server) port *events*))
     (else
      (close-port port))))
--8<---------------cut here---------------end--------------->8---

however this is terribly hacky, as I need to access the private poll-set
from (web server http).

Ludo, do you have a better idea?

Thanks,

Mathieu




^ permalink raw reply	[flat|nested] 20+ messages in thread

* bug#48468: substitute server connection timeout
  2021-05-18 14:35     ` Mathieu Othacehe
@ 2021-05-21 13:30       ` Mathieu Othacehe
  0 siblings, 0 replies; 20+ messages in thread
From: Mathieu Othacehe @ 2021-05-21 13:30 UTC (permalink / raw)
  To: Christopher Baines; +Cc: 48468


Hey,

I posted a patchset adding keep-alive support to guix publish earlier:
https://issues.guix.gnu.org/48556.

Thanks,

Mathieu




^ permalink raw reply	[flat|nested] 20+ messages in thread

* bug#48468: substitute server connection timeout
  2021-05-16 18:26 ` Christopher Baines
  2021-05-17 14:49   ` Mathieu Othacehe
@ 2021-05-29 21:44   ` Ludovic Courtès
  1 sibling, 0 replies; 20+ messages in thread
From: Ludovic Courtès @ 2021-05-29 21:44 UTC (permalink / raw)
  To: Christopher Baines; +Cc: Mathieu Othacehe, 48468

Hi,

Christopher Baines <mail@cbaines.net> skribis:

> Mathieu Othacehe <othacehe@gnu.org> writes:

[...]

>> Stracing this publish server shows that connection reuse seems to be
>> broken:
>>
>> --8<---------------cut here---------------start------------->8---
>> accept4(9, {sa_family=AF_INET, sin_port=htons(41742), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
>> accept4(9, {sa_family=AF_INET, sin_port=htons(41744), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
>> accept4(9, {sa_family=AF_INET, sin_port=htons(41746), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 25
>> accept4(9, {sa_family=AF_INET, sin_port=htons(41748), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 24
>> accept4(9, {sa_family=AF_INET, sin_port=htons(41750), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21

Ouch.

>> Investigating it, I found that the connection is closed and opened
>> multiple times in the call-with-cached-connection procedure of the (guix
>> script substitute) module.
>>
>> It looks like its because a 'bad-headers exception is raised when trying
>> to parse an eof object:
>>
>> --8<---------------cut here---------------start------------->8---
>> ;;; (error bad-header (read-header-line #<eof>))
>> --8<---------------cut here---------------end--------------->8---
>>
>> I'm not sure where this eof comes from. There is this comment in the
>> http-multiple-get procedure in (guix http-client):
>>
>> --8<---------------cut here---------------start------------->8---
>> ;; Swallow networking errors that could occur due to connection reuse
>> ;; and the like; they will be handled down the road when trying to
>> ;; read responses.
>> (false-if-networking-error
>>  (begin
>>    (for-each (cut write-request <> buffer) batch)
>>    (put-bytevector p (get))
>>    (force-output p))))
>> --8<---------------cut here---------------end--------------->8---
>>
>> which would suggest that connection reuse could cause networking errors?

[...]

> That's on the server side, the actual problem is probably on the client
> side, as I guess there are possibly places where closed connections
> aren't handled properly.

D’oh.  So we should be able to use it by talking to a bare ‘guix
publish’?  I tried and failed to reproduce it, but that’s probably not
deterministic.

> This reminds me I sent some patches relating to closing connections,
> this could well be related [1].
>
> 1: https://issues.guix.gnu.org/47174

This one will properly close connections when the servers asks for it,
but would that make a difference?

Thanks,
Ludo’.




^ permalink raw reply	[flat|nested] 20+ messages in thread

* bug#48468: substitute server connection timeout
  2021-05-16 17:57 bug#48468: substitute server connection timeout Mathieu Othacehe
  2021-05-16 18:26 ` Christopher Baines
@ 2021-06-18 12:33 ` Ludovic Courtès
  2021-06-29 16:49   ` Mathieu Othacehe
  1 sibling, 1 reply; 20+ messages in thread
From: Ludovic Courtès @ 2021-06-18 12:33 UTC (permalink / raw)
  To: Mathieu Othacehe; +Cc: 48468

Hi,

Mathieu Othacehe <othacehe@gnu.org> skribis:

> Investigating it, I found that the connection is closed and opened
> multiple times in the call-with-cached-connection procedure of the (guix
> script substitute) module.
>
> It looks like its because a 'bad-headers exception is raised when trying
> to parse an eof object:
>
> ;;; (error bad-header (read-header-line #<eof>))
>
>
> I'm not sure where this eof comes from. There is this comment in the
> http-multiple-get procedure in (guix http-client):
>
> ;; Swallow networking errors that could occur due to connection reuse
> ;; and the like; they will be handled down the road when trying to
> ;; read responses.
> (false-if-networking-error
>  (begin
>    (for-each (cut write-request <> buffer) batch)
>    (put-bytevector p (get))
>    (force-output p))))
>
> which would suggest that connection reuse could cause networking errors?

Trying to see exactly which bit is at fault here.  First,
‘http-multiple-get’ seems to be working as expected when passed a fresh
connection:

--8<---------------cut here---------------start------------->8---
scheme@(guix http-client)> (http-multiple-get (string->uri "https://ci.guix.gnu.org")
					      (lambda (request response port result)
						(get-bytevector-n port (response-content-length response))
						(cons #t result))
					      '()
					      (make-list
					       5000
					       (build-request (build-uri 'https
										   #:host "ci.guix.gnu.org"
										   #:path "/d7gpr41qpsifri54vp2lzjs99zkylscq.narinfo"))))
connecting (5000 requests left)...connecting (4400 requests left)...connecting (3800 requests left)...connecting (3200 requests left)...connecting (2600 requests left)...connecting (2000 requests left)...connecting (1400 requests left)...connecting (800 requests left)...connecting (200 requests left)...$143 = (#t #t …)
scheme@(guix http-client)> (length $143)
$144 = 5000
--8<---------------cut here---------------end--------------->8---

But maybe I’m not looking at the right thing.

Do you have evidence or a reproducer?

Thanks,
Ludo’.




^ permalink raw reply	[flat|nested] 20+ messages in thread

* bug#48468: substitute server connection timeout
  2021-06-18 12:33 ` Ludovic Courtès
@ 2021-06-29 16:49   ` Mathieu Othacehe
  2022-12-05 13:21     ` Ludovic Courtès
  0 siblings, 1 reply; 20+ messages in thread
From: Mathieu Othacehe @ 2021-06-29 16:49 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 48468


Hey,

> But maybe I’m not looking at the right thing.
>
> Do you have evidence or a reproducer?

Yes, adding the following debug message here:

--8<---------------cut here---------------start------------->8---
--- a/guix/scripts/substitute.scm
+++ b/guix/scripts/substitute.scm
@@ -416,6 +416,7 @@ server certificates."
         ;; and retry.  We might also get 'bad-response or a similar
         ;; exception from (web response) later on, once we've sent the
         ;; request, or a ERROR/INVALID-SESSION from GnuTLS.
+        (pk key args)
         (if (or (and (eq? key 'system-error)
                      (= EPIPE (system-error-errno `(,key ,@args))))
                 (and (eq? key 'gnutls-error)
--8<---------------cut here---------------end--------------->8---

and using substitutes from a publish server without keep alive support
(before 0b8fa24), prints something like:

--8<---------------cut here---------------start------------->8---
downloading from http://192.168.1.51:8080/nar/gzip/lkmiyfzi1bq571yn4gypbcs5vn4fpma7-texlive-cm-51265 ...
 texlive-cm-51265                                                                                                                                                                                                                                      7.8MiB/s 00:00 | 1.7MiB transferred

substituting /gnu/store/l788x07ska5vffayz0gayv4hsx5flxal-module-import-compiled...

;;; (bad-header (read-header-line #<eof>))
downloading from http://192.168.1.51:8080/nar/gzip/l788x07ska5vffayz0gayv4hsx5flxal-module-import-compiled ...
 module-import-compiled                                                                                                                                                                                                                                 7.5MiB/s 00:00 | 85KiB transferred

substituting /gnu/store/1s1lrnxlkjwxshk5q2w97ig3clc6n06f-ruby-2.6.5...

;;; (bad-header (read-header-line #<eof>))
--8<---------------cut here---------------end--------------->8---

That's because the connection is closed by the publish server each time
a NAR is sent. This particular behaviour is fixed by:
https://issues.guix.gnu.org/48556.

I hoped that it would decrease the load of the Cuirass publish server
that was flooded by worker connections. While the situation is better,
there are still a lot of substitute timeout errors on Cuirass.

I have also observed those timeouts using the publish server behind
ci.guix.gnu.org. This makes me think that there's something else that is
preventing the server from honoring every connection request in less
than 5 seconds (the timeout duration).

Thanks,

Mathieu




^ permalink raw reply	[flat|nested] 20+ messages in thread

* bug#48468: substitute server connection timeout
  2021-06-29 16:49   ` Mathieu Othacehe
@ 2022-12-05 13:21     ` Ludovic Courtès
  2022-12-07  9:44       ` Mathieu Othacehe
  0 siblings, 1 reply; 20+ messages in thread
From: Ludovic Courtès @ 2022-12-05 13:21 UTC (permalink / raw)
  To: Mathieu Othacehe; +Cc: 48468

Hi,

Mathieu Othacehe <othacehe@gnu.org> skribis:

> I hoped that it would decrease the load of the Cuirass publish server
> that was flooded by worker connections. While the situation is better,
> there are still a lot of substitute timeout errors on Cuirass.
>
> I have also observed those timeouts using the publish server behind
> ci.guix.gnu.org. This makes me think that there's something else that is
> preventing the server from honoring every connection request in less
> than 5 seconds (the timeout duration).

I don’t see this when substituting from https://ci.guix.gnu.org these
days.

You mentioned on IRC that nginx logs show that ‘guix publish’ times out.
Looking at /var/log/nginx/error.log, I see “Connection reset by peer”
and “Broken pipe”, which could indicate that the client closed the
connection (which was open) prematurely, maybe due to an internal
timeout.

What I’d like to know is whether those timeouts you mention appear
during connection establishment (connect(2) on the client side doesn’t
complete within 5 seconds) or after connection establishment.

Connection establishment is handled by the kernel and ‘guix publish’ is
not involved.  However, the listen(2) call made in (web server http)
sets a maximum backlog of 128 connections; if at some point 128
connections are already queued, then the 129th client will have
connect(2) take some time.  Seems unlikely, but who knows.

Ludo’.




^ permalink raw reply	[flat|nested] 20+ messages in thread

* bug#48468: substitute server connection timeout
  2022-12-05 13:21     ` Ludovic Courtès
@ 2022-12-07  9:44       ` Mathieu Othacehe
  2022-12-07 13:38         ` Ludovic Courtès
  0 siblings, 1 reply; 20+ messages in thread
From: Mathieu Othacehe @ 2022-12-07  9:44 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 48468


Hello,

> You mentioned on IRC that nginx logs show that ‘guix publish’ times out.
> Looking at /var/log/nginx/error.log, I see “Connection reset by peer”
> and “Broken pipe”, which could indicate that the client closed the
> connection (which was open) prematurely, maybe due to an internal
> timeout.

Could it be that the client is receiving 404 because the baking of some
NAR was deferred to a worker, and then it closes the connection?

I think that's what I had in mind with the patch 2/2 of this patchset:
https://issues.guix.gnu.org/50040.

Thanks,

Mathieu




^ permalink raw reply	[flat|nested] 20+ messages in thread

* bug#48468: substitute server connection timeout
  2022-12-07  9:44       ` Mathieu Othacehe
@ 2022-12-07 13:38         ` Ludovic Courtès
  2022-12-07 14:31           ` Mathieu Othacehe
  0 siblings, 1 reply; 20+ messages in thread
From: Ludovic Courtès @ 2022-12-07 13:38 UTC (permalink / raw)
  To: Mathieu Othacehe; +Cc: 48468

Hi,

Mathieu Othacehe <othacehe@gnu.org> skribis:

>> You mentioned on IRC that nginx logs show that ‘guix publish’ times out.
>> Looking at /var/log/nginx/error.log, I see “Connection reset by peer”
>> and “Broken pipe”, which could indicate that the client closed the
>> connection (which was open) prematurely, maybe due to an internal
>> timeout.
>
> Could it be that the client is receiving 404 because the baking of some
> NAR was deferred to a worker, and then it closes the connection?

Unlikely.  Take
/gnu/store/qmzr030rzgikdxv3g9msqv0l8qp5j6y2-btrfs-raid-root-os.drv,
which was marked as failed earlier today due to missing .drv.  It’s a
4KiB file, and the cache-bypass-threshold is ‘guix publish’ is typically
set to something much higher than that.  So ‘guix publish’ won’t return
404 in that case.

If you find a “missing derivation” error in Cuirass, how would you
search logs to find what happened?  I’m not sure where to look for
useful debugging info.

Thanks,
Ludo’.




^ permalink raw reply	[flat|nested] 20+ messages in thread

* bug#48468: substitute server connection timeout
  2022-12-07 13:38         ` Ludovic Courtès
@ 2022-12-07 14:31           ` Mathieu Othacehe
  2022-12-08 10:26             ` Ludovic Courtès
  0 siblings, 1 reply; 20+ messages in thread
From: Mathieu Othacehe @ 2022-12-07 14:31 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 48468


Hello,

> /gnu/store/qmzr030rzgikdxv3g9msqv0l8qp5j6y2-btrfs-raid-root-os.drv,
> which was marked as failed earlier today due to missing .drv.  It’s a
> 4KiB file, and the cache-bypass-threshold is ‘guix publish’ is typically
> set to something much higher than that.  So ‘guix publish’ won’t return
> 404 in that case.

Yes but that derivation also depends on other derivations, for instance
qemu-minimal and if I try:

--8<---------------cut here---------------start------------->8---
mathieu@berlin ~$ guix build qemu-minimal
...
/gnu/store/lwv2pl0m6dkf6bkzip755w5p71g5akq4-qemu-minimal-7.1.0
--8<---------------cut here---------------end--------------->8---

and then, from my machine.

--8<---------------cut here---------------start------------->8---
curl https://ci.guix.gnu.org/lwv2pl0m6dkf6bkzip755w5p71g5akq4.narinfo
We're baking it: /lwv2pl0m6dkf6bkzip755w5p71g5akq4.narinfo
--8<---------------cut here---------------end--------------->8---

wget exhibits the same behaviour and returns 404.

So any build that requires a heavy substitute, heavier than the cache
bypass threshold at least, will fail on the workers, as it would fail
locally.

That's not really a surprise as baking substitutes takes time and there
is a time window between the moment Cuirass triggers NAR baking and the
moment the NAR is baked, where every user will get a 404.

Mathieu




^ permalink raw reply	[flat|nested] 20+ messages in thread

* bug#48468: substitute server connection timeout
  2022-12-07 14:31           ` Mathieu Othacehe
@ 2022-12-08 10:26             ` Ludovic Courtès
  2022-12-10 10:55               ` Ludovic Courtès
  2022-12-28 14:23               ` Mathieu Othacehe
  0 siblings, 2 replies; 20+ messages in thread
From: Ludovic Courtès @ 2022-12-08 10:26 UTC (permalink / raw)
  To: Mathieu Othacehe; +Cc: 48468

Hi,

Mathieu Othacehe <othacehe@gnu.org> skribis:

>> /gnu/store/qmzr030rzgikdxv3g9msqv0l8qp5j6y2-btrfs-raid-root-os.drv,
>> which was marked as failed earlier today due to missing .drv.  It’s a
>> 4KiB file, and the cache-bypass-threshold is ‘guix publish’ is typically
>> set to something much higher than that.  So ‘guix publish’ won’t return
>> 404 in that case.
>
> Yes but that derivation also depends on other derivations, for instance
> qemu-minimal and if I try:
>
> mathieu@berlin ~$ guix build qemu-minimal
> ...
> /gnu/store/lwv2pl0m6dkf6bkzip755w5p71g5akq4-qemu-minimal-7.1.0
>
>
> and then, from my machine.
>
> curl https://ci.guix.gnu.org/lwv2pl0m6dkf6bkzip755w5p71g5akq4.narinfo
> We're baking it: /lwv2pl0m6dkf6bkzip755w5p71g5akq4.narinfo

Yes, but derivations (.drv) are depend only on “sources” (like
*-guile-builder, *.patch, *.scm) and on other derivations, all of which
are typically less than 1 MiB.

Derivations don’t depend on derivation outputs like that of qemu-minimal
above.

> So any build that requires a heavy substitute, heavier than the cache
> bypass threshold at least, will fail on the workers, as it would fail
> locally.

The main ‘guix publish’ instance on berlin has cache-bypass-threshold
set to 100 MiB; I don’t think we have any source or .drv that is this
big, or even a tenth of it.  :-)

The ‘guix publish’ instance spawned in (cuirass remote) doesn’t use
caching at all, so there’s no bypass threshold.

The only clue we have so far is a long chain of successful .drv
substitutions followed by a failing one:

--8<---------------cut here---------------start------------->8---
Downloading http://141.80.167.131/nar/zstd/9nm35401i8j42559iadi3iz3kmhmj7pr-guix-system-tests.drv...

[K guix-system-tests.drv  1KiB             0B/s 00:00 [                  ]   0.0%
[K guix-system-tests.drv  1KiB         1.3MiB/s 00:00 [##################] 100.0%
[K guix-system-tests.drv  1KiB         408KiB/s 00:00 [##################] 100.0%

@ substituter-succeeded /gnu/store/9nm35401i8j42559iadi3iz3kmhmj7pr-guix-system-tests.drv
fetching path `/gnu/store/9zf6inizcb09m136c44dj35fmzf2g3hs-guix-system-tests-modules.drv'...
@ substituter-started /gnu/store/9zf6inizcb09m136c44dj35fmzf2g3hs-guix-system-tests-modules.drv substitute
Downloading http://141.80.167.131/nar/zstd/9zf6inizcb09m136c44dj35fmzf2g3hs-guix-system-tests-modules.drv...

[K guix-system-tests-modules.drv  543B     0B/s 00:00 [                  ]   0.0%
[K guix-system-tests-modules.drv  543B 851KiB/s 00:00 [##################] 100.0%
[K guix-system-tests-modules.drv  543B 484KiB/s 00:00 [##################] 100.0%

@ substituter-succeeded /gnu/store/9zf6inizcb09m136c44dj35fmzf2g3hs-guix-system-tests-modules.drv
cannot build missing derivation ?/gnu/store/zq7idl0j51fdzqhhqm9ql90d0f2326k7-btrfs-root-on-subvolume-os.drv?
--8<---------------cut here---------------end--------------->8---

(From <https://ci.guix.gnu.org/build/224849/log/raw>.)

Plausible explanations that come to mind:

  1. ‘guix publish’ returning 404, but not due to baking.  Instead the
     .drv is simply not in store, hence 404.

  2. Client timeout (‘guix publish’ fails to reply on time).

  3. Cached 404 in nginx (though we’re not supposed to cache those I
     think?), or timeout in nginx (again due to ‘guix publish’ being too
     slow and ‘proxy_read_timeout’ is reached, currently 10s).

We should see if we can get useful info from nginx/publish logs.

Thanks,
Ludo’.




^ permalink raw reply	[flat|nested] 20+ messages in thread

* bug#48468: substitute server connection timeout
  2022-12-08 10:26             ` Ludovic Courtès
@ 2022-12-10 10:55               ` Ludovic Courtès
  2022-12-27  9:52                 ` Mathieu Othacehe
  2022-12-28 14:23               ` Mathieu Othacehe
  1 sibling, 1 reply; 20+ messages in thread
From: Ludovic Courtès @ 2022-12-10 10:55 UTC (permalink / raw)
  To: Mathieu Othacehe; +Cc: 48468

Hi,

Ludovic Courtès <ludo@gnu.org> skribis:

> Mathieu Othacehe <othacehe@gnu.org> skribis:

[...]

>> curl https://ci.guix.gnu.org/lwv2pl0m6dkf6bkzip755w5p71g5akq4.narinfo
>> We're baking it: /lwv2pl0m6dkf6bkzip755w5p71g5akq4.narinfo
>
> Yes, but derivations (.drv) are depend only on “sources” (like
> *-guile-builder, *.patch, *.scm) and on other derivations, all of which
> are typically less than 1 MiB.
>
> Derivations don’t depend on derivation outputs like that of qemu-minimal
> above.
>
>> So any build that requires a heavy substitute, heavier than the cache
>> bypass threshold at least, will fail on the workers, as it would fail
>> locally.
>
> The main ‘guix publish’ instance on berlin has cache-bypass-threshold
> set to 100 MiB; I don’t think we have any source or .drv that is this
> big, or even a tenth of it.  :-)

On closer inspection, we do get bigger sources for those system test
derivations:

--8<---------------cut here---------------start------------->8---
$ du  $(guix gc -R /gnu/store/2laznragkfjf0jn6n7b8c53vgmb8maqp-installed-extlinux-os.drv ) |sort -k1 -n |tail
6304	/gnu/store/mvvf60w8p0lx8vlixryb1a1gfib401rf-guix-b129026/gnu/packages
6956	/gnu/store/9cs4i3hygldsan8ahf0p7x5kgzmqz9gr-guix-packages-base-source/gnu/packages/patches
7000	/gnu/store/mvvf60w8p0lx8vlixryb1a1gfib401rf-guix-b129026/gnu
7716	/gnu/store/z0gc56x1w83wn0pi4mz8nhyjdr0d10fs-guix
12352	/gnu/store/mvvf60w8p0lx8vlixryb1a1gfib401rf-guix-b129026
22548	/gnu/store/ky96vygv1hpa8iz6677qz96binbz201h-packages
22684	/gnu/store/rrg9ki4bdjffvmgapjcqwjwjwkhb0qhl-doc
39516	/gnu/store/9cs4i3hygldsan8ahf0p7x5kgzmqz9gr-guix-packages-base-source/gnu/packages
39524	/gnu/store/9cs4i3hygldsan8ahf0p7x5kgzmqz9gr-guix-packages-base-source/gnu
39528	/gnu/store/9cs4i3hygldsan8ahf0p7x5kgzmqz9gr-guix-packages-base-source
--8<---------------cut here---------------end--------------->8---

The items at the bottom weigh around 40 MiB.

That’s still below the 100 MiB cache bypass threshold of the main ‘guix
publish’ instance though.

Ludo’.




^ permalink raw reply	[flat|nested] 20+ messages in thread

* bug#48468: substitute server connection timeout
  2022-12-10 10:55               ` Ludovic Courtès
@ 2022-12-27  9:52                 ` Mathieu Othacehe
  0 siblings, 0 replies; 20+ messages in thread
From: Mathieu Othacehe @ 2022-12-27  9:52 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 48468

[-- Attachment #1: Type: text/plain, Size: 282 bytes --]


Hey Ludo,

> That’s still below the 100 MiB cache bypass threshold of the main ‘guix
> publish’ instance though.

Right. Just to be on the safe side here, what about applying this patch
to have log lines when we are replying 404 due to baking?

Thanks,

Mathieu

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-scripts-publish-Add-a-log-when-replying-404-due-to-b.patch --]
[-- Type: text/x-patch, Size: 1412 bytes --]

From 725d5ba21a0fc0108b60c37bbc8d947fab6ac938 Mon Sep 17 00:00:00 2001
From: Mathieu Othacehe <othacehe@gnu.org>
Date: Tue, 27 Dec 2022 10:49:04 +0100
Subject: [PATCH 1/1] scripts: publish: Add a log when replying 404 due to
 baking.

* guix/scripts/publish.scm (render-narinfo/cached): Add it.
---
 guix/scripts/publish.scm | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/guix/scripts/publish.scm b/guix/scripts/publish.scm
index 3bf3bd9c7c..a2048c98fd 100644
--- a/guix/scripts/publish.scm
+++ b/guix/scripts/publish.scm
@@ -586,9 +586,13 @@ (define (delete-entry narinfo)
                                #:ttl 300          ;temporary
                                #:nar-path nar-path
                                #:compressions compressions)
-               (not-found request
-                          #:phrase "We're baking it"
-                          #:ttl 300)))          ;should be available within 5m
+               (begin
+                 (format #t (G_ "~a ~a: 404 (baking)~%")
+                         (request-method request)
+                         (uri-path (request-uri request)))
+                 (not-found request
+                            #:phrase "We're baking it"
+                            #:ttl 300))))      ;should be available within 5m
           (else
            (not-found request #:phrase "" #:ttl negative-ttl)))))
 
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* bug#48468: substitute server connection timeout
  2022-12-08 10:26             ` Ludovic Courtès
  2022-12-10 10:55               ` Ludovic Courtès
@ 2022-12-28 14:23               ` Mathieu Othacehe
  2023-01-06 22:42                 ` Ludovic Courtès
  1 sibling, 1 reply; 20+ messages in thread
From: Mathieu Othacehe @ 2022-12-28 14:23 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 48468

[-- Attachment #1: Type: text/plain, Size: 5267 bytes --]


Hello,

So I had an other look to that one,

> (From <https://ci.guix.gnu.org/build/224849/log/raw>.)
>
> Plausible explanations that come to mind:
>
>   1. ‘guix publish’ returning 404, but not due to baking.  Instead the
>      .drv is simply not in store, hence 404.
>
>   2. Client timeout (‘guix publish’ fails to reply on time).
>
>   3. Cached 404 in nginx (though we’re not supposed to cache those I
>      think?), or timeout in nginx (again due to ‘guix publish’ being too
>      slow and ‘proxy_read_timeout’ is reached, currently 10s).

I focused on https://ci.guix.gnu.org/build/308493/details which is
similar to the one linked above. Some derivations are successfully
substituted then, one is not and it aborts.

This build stopped at 2022/12/24 00:25:38 CET. At that time the nginx error
log on Berlin looks like:

--8<---------------cut here---------------start------------->8---
2022/12/24 00:25:23 [info] 128755#0: *178076209 recv() failed (104: Connection reset by peer) while sending to client, client: 78.47.68.4, server: ci.guix.gnu.org, request: "GET /x3c2f9zzxhk0ci81nrdsxzglp4b1h3di.narinfo HTTP/1.1", upstream: "http://127.0.0.1:3000/x3c2f9zzxhk0ci81nrdsxzglp4b1h3di.narinfo", host: "ci.guix.gnu.org"
2022/12/24 00:25:24 [info] 128742#0: *178076816 client 174.59.221.51 closed keepalive connection
2022/12/24 00:25:25 [info] 128755#0: *178076830 recv() failed (104: Connection reset by peer) while sending to client, client: 78.47.68.4, server: ci.guix.gnu.org, request: "GET /vx7lgvrcw6i6r0inw513qn6fj3cb27mm.narinfo HTTP/1.1", upstream: "http://127.0.0.1:3000/vx7lgvrcw6i6r0inw513qn6fj3cb27mm.narinfo", host: "ci.guix.gnu.org"
2022/12/24 00:25:26 [info] 128742#0: *178077436 recv() failed (104: Connection reset by peer) while sending to client, client: 78.47.68.4, server: ci.guix.gnu.org, request: "GET /b1g8vyqp8c52yk6vscdvjflci5mnladq.narinfo HTTP/1.1", upstream: "http://127.0.0.1:3000/b1g8vyqp8c52yk6vscdvjflci5mnladq.narinfo", host: "ci.guix.gnu.org"
2022/12/24 00:25:28 [info] 128742#0: *178078045 recv() failed (104: Connection reset by peer) while sending to client, client: 78.47.68.4, server: ci.guix.gnu.org, request: "GET /007zgflsl5xkr377wpakbsis5c2yqh1q.narinfo HTTP/1.1", upstream: "http://127.0.0.1:3000/007zgflsl5xkr377wpakbsis5c2yqh1q.narinfo", host: "ci.guix.gnu.org"
2022/12/24 00:25:32 [info] 128742#0: *178078659 recv() failed (104: Connection reset by peer) while sending to client, client: 78.47.68.4, server: ci.guix.gnu.org, request: "GET /xdfyi8jki350i8njk1glrzg9p0va18vy.narinfo HTTP/1.1", upstream: "http://127.0.0.1:3000/xdfyi8jki350i8njk1glrzg9p0va18vy.narinfo", host: "ci.guix.gnu.org"
2022/12/24 00:25:33 [info] 128741#0: *178079271 recv() failed (104: Connection reset by peer) while sending to client, client: 78.47.68.4, server: ci.guix.gnu.org, request: "GET /rzs7jiwhyls2bg8zv2g1vmmgcfwkvaax.narinfo HTTP/1.1", upstream: "http://127.0.0.1:3000/rzs7jiwhyls2bg8zv2g1vmmgcfwkvaax.narinfo", host: "ci.guix.gnu.org"
2022/12/24 00:25:36 [info] 128741#0: *178080049 client 157.55.39.212 closed keepalive connection
2022/12/24 00:25:38 [info] 128741#0: *178080047 peer closed connection in SSL handshake while SSL handshaking, client: 127.0.0.1, server: 0.0.0.0:443
--8<---------------cut here---------------end--------------->8---

That would, in that specific case, invalidate your hypothesis number
2. When 'guix publish' timeouts, we have messages like:

--8<---------------cut here---------------start------------->8---
2022/12/24 00:45:36 [error] 128749#0: *178225211 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 86.87.253.49, server: ci.guix.gnu.org, request: "GET /m0hqyvj8di41xlraz087kmpr38zaw90r.narinfo HTTP/1.1", upstream: "http://127.0.0.1:3000/m0hqyvj8di41xlraz087kmpr38zaw90r.narinfo", host: "ci.guix.gnu.org"
--8<---------------cut here---------------end--------------->8---

However, like suggested in your hypothesis number 1, it seems instead
that we are replying 404 to the worker which resets the connection. As
we have put aside the baking thing, the question is now why are those
derivations not available?

Are they not part of Berlin's store at that time? Or is the publish
server erroneously returns 404?

It looks like the requested derivations are still absent, as of right
now:

--8<---------------cut here---------------start------------->8---
mathieu@berlin /var/log/nginx$ ls /gnu/store/vx7lgvrcw6i6r0inw513qn6fj3cb27mm*
ls: cannot access '/gnu/store/vx7lgvrcw6i6r0inw513qn6fj3cb27mm*': No such file or directory
mathieu@berlin /var/log/nginx$ ls /gnu/store/b1g8vyqp8c52yk6vscdvjflci5mnladq*
ls: cannot access '/gnu/store/b1g8vyqp8c52yk6vscdvjflci5mnladq*': No such file or directory
mathieu@berlin /var/log/nginx$ ls /gnu/store/007zgflsl5xkr377wpakbsis5c2yqh1q*
ls: cannot access '/gnu/store/007zgflsl5xkr377wpakbsis5c2yqh1q*': No such file or directory
--8<---------------cut here---------------end--------------->8---

As I don't have much clue about what are those derivations, I think we
should instrument a bit the publish server and maybe the substitute
script like proposed in the attachments.

WDYT?

Thanks,

Mathieu

[-- Attachment #2: 0001-scripts-publish-Add-a-custom-baking-header.patch --]
[-- Type: text/x-patch, Size: 3120 bytes --]

From 9f9c839937ac2edd1b5901b2262c4be0954fa20c Mon Sep 17 00:00:00 2001
From: Mathieu Othacehe <othacehe@gnu.org>
Date: Wed, 28 Dec 2022 15:12:46 +0100
Subject: [PATCH 1/2] scripts: publish: Add a custom baking header.

Log the not-found responses and their reason (baking or not) to stdout. Also
send the X-Baking custom header so that the client can be informed of the
cause of the failure.

* guix/scripts/publish.scm (not-found): Add a baking? argument to add the
X-Baking HTTP header to the response if baking is in progress.  Also, log the
404 responses to stdout, indicating if it is due to baking or not.
(render-narinfo/cached): Pass the baking? argument.
---
 guix/scripts/publish.scm | 25 ++++++++++++++++++++-----
 1 file changed, 20 insertions(+), 5 deletions(-)

diff --git a/guix/scripts/publish.scm b/guix/scripts/publish.scm
index 3bf3bd9c7c..11fedf092e 100644
--- a/guix/scripts/publish.scm
+++ b/guix/scripts/publish.scm
@@ -4,7 +4,7 @@
 ;;; Copyright © 2015-2022 Ludovic Courtès <ludo@gnu.org>
 ;;; Copyright © 2020 Maxim Cournoyer <maxim.cournoyer@gmail.com>
 ;;; Copyright © 2021 Simon Tournier <zimon.toutoune@gmail.com>
-;;; Copyright © 2021 Mathieu Othacehe <othacehe@gnu.org>
+;;; Copyright © 2021, 2022 Mathieu Othacehe <othacehe@gnu.org>
 ;;;
 ;;; This file is part of GNU Guix.
 ;;;
@@ -375,14 +375,28 @@ (define* (narinfo-string store store-path
                                            compression)))
                  compressions))))
 
+;; Custom header to indicate that baking is in progress.
+(declare-opaque-header! "X-Baking")
+
 (define* (not-found request
-                    #:key (phrase "Resource not found")
+                    #:key
+                    baking?
+                    (phrase "Resource not found")
                     ttl)
   "Render 404 response for REQUEST."
+  (format #t (G_ "↳ ~a ~a: 404~a~%")
+          (request-method request)
+          (uri-path (request-uri request))
+          (if baking? " (baking)" ""))
   (values (build-response #:code 404
-                          #:headers (if ttl
-                                        `((cache-control (max-age . ,ttl)))
-                                        '()))
+                          #:headers
+                          (append
+                           (if ttl
+                               `((cache-control (max-age . ,ttl)))
+                               '())
+                           (if baking?
+                               '((x-baking . "1"))
+                               '())))
           (string-append phrase ": "
                          (uri-path (request-uri request)))))
 
@@ -587,6 +601,7 @@ (define (delete-entry narinfo)
                                #:nar-path nar-path
                                #:compressions compressions)
                (not-found request
+                          #:baking? #t
                           #:phrase "We're baking it"
                           #:ttl 300)))          ;should be available within 5m
           (else
-- 
2.38.1


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #3: 0002-substitutes-Log-the-failing-queries.patch --]
[-- Type: text/x-patch, Size: 1921 bytes --]

From 25ffc57864dbf34ca58741f89c1f790dbde6702f Mon Sep 17 00:00:00 2001
From: Mathieu Othacehe <othacehe@gnu.org>
Date: Wed, 28 Dec 2022 15:19:29 +0100
Subject: [PATCH 2/2] substitutes: Log the failing queries.

* guix/substitutes.scm (%debug?): New variable.
(handle-narinfo-response): Log the failing queries if the %debug? parameter is
set.
---
 guix/substitutes.scm | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/guix/substitutes.scm b/guix/substitutes.scm
index 9014cf61ec..819eb2c73e 100644
--- a/guix/substitutes.scm
+++ b/guix/substitutes.scm
@@ -90,6 +90,12 @@ (define %narinfo-cache-directory
           (string-append %state-directory "/substitute/cache"))
       (string-append (cache-directory #:ensure? #f) "/substitute")))
 
+(define %debug?
+  ;; Enable debug mode by setting the GUIX_SUBSTITUTE_DEBUG environmnent
+  ;; variable.
+  (make-parameter
+   (getenv "GUIX_SUBSTITUTE_DEBUG")))
+
 (define (narinfo-cache-file cache-url path)
   "Return the name of the local file that contains an entry for PATH.  The
 entry is stored in a sub-directory specific to CACHE-URL."
@@ -224,6 +230,15 @@ (define (handle-narinfo-response request response port result)
           (let* ((path      (uri-path (request-uri request)))
                  (hash-part (basename
                              (string-drop-right path 8)))) ;drop ".narinfo"
+            ;; Log the failing queries and indicate if it failed because the
+            ;; narinfo is being baked.
+            (when (%debug?)
+              (let ((baking?
+                     (assoc-ref (response-headers response) 'x-baking)))
+                (display
+                 (format #f "could not fetch ~a~a ~a~a~%"
+                         url path code
+                         (if baking? " (baking)" "")))))
             (if len
                 (get-bytevector-n port len)
                 (read-to-eof port))
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* bug#48468: substitute server connection timeout
  2022-12-28 14:23               ` Mathieu Othacehe
@ 2023-01-06 22:42                 ` Ludovic Courtès
  2023-01-07 13:40                   ` Mathieu Othacehe
  0 siblings, 1 reply; 20+ messages in thread
From: Ludovic Courtès @ 2023-01-06 22:42 UTC (permalink / raw)
  To: Mathieu Othacehe; +Cc: 48468

Hi Mathieu!

Mathieu Othacehe <othacehe@gnu.org> skribis:

>> (From <https://ci.guix.gnu.org/build/224849/log/raw>.)
>>
>> Plausible explanations that come to mind:
>>
>>   1. ‘guix publish’ returning 404, but not due to baking.  Instead the
>>      .drv is simply not in store, hence 404.
>>
>>   2. Client timeout (‘guix publish’ fails to reply on time).
>>
>>   3. Cached 404 in nginx (though we’re not supposed to cache those I
>>      think?), or timeout in nginx (again due to ‘guix publish’ being too
>>      slow and ‘proxy_read_timeout’ is reached, currently 10s).
>
> I focused on https://ci.guix.gnu.org/build/308493/details which is
> similar to the one linked above. Some derivations are successfully
> substituted then, one is not and it aborts.
>
> This build stopped at 2022/12/24 00:25:38 CET. At that time the nginx error
> log on Berlin looks like:
>
> 2022/12/24 00:25:23 [info] 128755#0: *178076209 recv() failed (104: Connection reset by peer) while sending to client, client: 78.X.Y.Z, server: ci.guix.gnu.org, request: "GET /x3c2f9zzxhk0ci81nrdsxzglp4b1h3di.narinfo HTTP/1.1", upstream: "http://127.0.0.1:3000/x3c2f9zzxhk0ci81nrdsxzglp4b1h3di.narinfo", host: "ci.guix.gnu.org"

It means that upstream (i.e., ‘guix publish’) closed the connection,
right?

And it means that it closed it prematurely I guess?

> However, like suggested in your hypothesis number 1, it seems instead
> that we are replying 404 to the worker which resets the connection. As
> we have put aside the baking thing, the question is now why are those
> derivations not available?

In that case we’re not replying at all, are we?

> From 9f9c839937ac2edd1b5901b2262c4be0954fa20c Mon Sep 17 00:00:00 2001
> From: Mathieu Othacehe <othacehe@gnu.org>
> Date: Wed, 28 Dec 2022 15:12:46 +0100
> Subject: [PATCH 1/2] scripts: publish: Add a custom baking header.
>
> Log the not-found responses and their reason (baking or not) to stdout. Also
> send the X-Baking custom header so that the client can be informed of the
> cause of the failure.
>
> * guix/scripts/publish.scm (not-found): Add a baking? argument to add the
> X-Baking HTTP header to the response if baking is in progress.  Also, log the
> 404 responses to stdout, indicating if it is due to baking or not.
> (render-narinfo/cached): Pass the baking? argument.

[...]

>  (define* (not-found request
> -                    #:key (phrase "Resource not found")
> +                    #:key
> +                    baking?
> +                    (phrase "Resource not found")
>                      ttl)
>    "Render 404 response for REQUEST."
> +  (format #t (G_ "↳ ~a ~a: 404~a~%")

Drop ‘G_’ (we don’t translate debugging messages) and use ASCII, to be
on the safe side…

> From 25ffc57864dbf34ca58741f89c1f790dbde6702f Mon Sep 17 00:00:00 2001
> From: Mathieu Othacehe <othacehe@gnu.org>
> Date: Wed, 28 Dec 2022 15:19:29 +0100
> Subject: [PATCH 2/2] substitutes: Log the failing queries.
>
> * guix/substitutes.scm (%debug?): New variable.
> (handle-narinfo-response): Log the failing queries if the %debug? parameter is
> set.

[...]

> +(define %debug?
> +  ;; Enable debug mode by setting the GUIX_SUBSTITUTE_DEBUG environmnent
> +  ;; variable.
> +  (make-parameter
> +   (getenv "GUIX_SUBSTITUTE_DEBUG")))

Instead of an env. var., maybe add a ‘--debug’ command-line option and
parameterize ‘%debug?’ accordingly?

You can also have something like:

  (define-syntax (debug fmt args ...)
    (when (%debug?)
      (format #t fmt args ...)))

LGTM with changes along these lines!

Thanks,
Ludo’.




^ permalink raw reply	[flat|nested] 20+ messages in thread

* bug#48468: substitute server connection timeout
  2023-01-06 22:42                 ` Ludovic Courtès
@ 2023-01-07 13:40                   ` Mathieu Othacehe
  2023-01-09  9:39                     ` Ludovic Courtès
  0 siblings, 1 reply; 20+ messages in thread
From: Mathieu Othacehe @ 2023-01-07 13:40 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 48468


Hello,

> It means that upstream (i.e., ‘guix publish’) closed the connection,
> right?
>
> And it means that it closed it prematurely I guess?

Looks like it yes.

>> However, like suggested in your hypothesis number 1, it seems instead
>> that we are replying 404 to the worker which resets the connection. As
>> we have put aside the baking thing, the question is now why are those
>> derivations not available?
>
> In that case we’re not replying at all, are we?

Well could be, I'm not 100% sure how to understand those nginx logs. If
we are replying anything, it will be visible with the new traces. If on
the other hand the publish server is hanging up then they won't help
much I guess.

> Drop ‘G_’ (we don’t translate debugging messages) and use ASCII, to be
> on the safe side…

Done.

> Instead of an env. var., maybe add a ‘--debug’ command-line option and
> parameterize ‘%debug?’ accordingly?

The --debug command-line feels better but it involves a guix-daemon
modification so I kept the environment variable,

> You can also have something like:
>
>   (define-syntax (debug fmt args ...)
>     (when (%debug?)
>       (format #t fmt args ...)))
>

and used that macro :)

Thanks for having a look!

Mathieu




^ permalink raw reply	[flat|nested] 20+ messages in thread

* bug#48468: substitute server connection timeout
  2023-01-07 13:40                   ` Mathieu Othacehe
@ 2023-01-09  9:39                     ` Ludovic Courtès
  2023-01-10  8:10                       ` Mathieu Othacehe
  0 siblings, 1 reply; 20+ messages in thread
From: Ludovic Courtès @ 2023-01-09  9:39 UTC (permalink / raw)
  To: Mathieu Othacehe; +Cc: 48468

Hi,

Mathieu Othacehe <othacehe@gnu.org> skribis:

>> Instead of an env. var., maybe add a ‘--debug’ command-line option and
>> parameterize ‘%debug?’ accordingly?
>
> The --debug command-line feels better but it involves a guix-daemon
> modification so I kept the environment variable,

Oh sorry, I was confusing ‘substitute’ and ‘publish’, that makes sense!

Ludo’.




^ permalink raw reply	[flat|nested] 20+ messages in thread

* bug#48468: substitute server connection timeout
  2023-01-09  9:39                     ` Ludovic Courtès
@ 2023-01-10  8:10                       ` Mathieu Othacehe
  0 siblings, 0 replies; 20+ messages in thread
From: Mathieu Othacehe @ 2023-01-10  8:10 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 48468


Hey,

So the debug mechanism is in place. Requesting a non-existing derivation
on a worker gives:

--8<---------------cut here---------------start------------->8---
mathieu@hydra-guix-104 ~$ guix build /gnu/store/yd1p7069rs4xbbfwj5p7nzp9psw7d3vv-hello-2.12.1.drv
substitute: could not fetch http://141.80.167.131/yd1p7069rs4xbbfwj5p7nzp9psw7d3vv.narinfo 404
substitute: updating substitutes from 'http://141.80.167.131'... 100.0%
cannot build missing derivation ‘/gnu/store/yd1p7069rs4xbbfwj5p7nzp9psw7d3vv-hello-2.12.1.drv’
guix build: error: build of `/gnu/store/yd1p7069rs4xbbfwj5p7nzp9psw7d3vv-hello-2.12.1.drv' failed
--8<---------------cut here---------------end--------------->8---

as expected. The funny thing is that during the test failures of
tonight, none of those traces were displayed. That would mean that the
failure is not caused by a missing narinfo.

I added the "--debug" option to the guix-daemon on the workers as well
hoping to gather more info.

Thanks,

Mathieu




^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2023-01-10  9:12 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-05-16 17:57 bug#48468: substitute server connection timeout Mathieu Othacehe
2021-05-16 18:26 ` Christopher Baines
2021-05-17 14:49   ` Mathieu Othacehe
2021-05-18 14:35     ` Mathieu Othacehe
2021-05-21 13:30       ` Mathieu Othacehe
2021-05-29 21:44   ` Ludovic Courtès
2021-06-18 12:33 ` Ludovic Courtès
2021-06-29 16:49   ` Mathieu Othacehe
2022-12-05 13:21     ` Ludovic Courtès
2022-12-07  9:44       ` Mathieu Othacehe
2022-12-07 13:38         ` Ludovic Courtès
2022-12-07 14:31           ` Mathieu Othacehe
2022-12-08 10:26             ` Ludovic Courtès
2022-12-10 10:55               ` Ludovic Courtès
2022-12-27  9:52                 ` Mathieu Othacehe
2022-12-28 14:23               ` Mathieu Othacehe
2023-01-06 22:42                 ` Ludovic Courtès
2023-01-07 13:40                   ` Mathieu Othacehe
2023-01-09  9:39                     ` Ludovic Courtès
2023-01-10  8:10                       ` Mathieu Othacehe

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/guix.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.