unofficial mirror of bug-guix@gnu.org 
 help / color / mirror / Atom feed
* bug#48468: substitute server connection timeout
@ 2021-05-16 17:57 Mathieu Othacehe
  2021-05-16 18:26 ` Christopher Baines
  2021-06-18 12:33 ` Ludovic Courtès
  0 siblings, 2 replies; 7+ messages in thread
From: Mathieu Othacehe @ 2021-05-16 17:57 UTC (permalink / raw)
  To: 48468


Hello,

We recently have a lot of those errors on Cuirass:

--8<---------------cut here---------------start------------->8---
guix substitute: warning: while fetching http://141.80.167.131:5557/nar/g7ka09613k5v1vlznh87yg35905ggw51-python2-scipy-1.2.2-guile-builder: server is somewhat slow
guix substitute: warning: try `--no-substitutes' if the problem persists
guix substitute: error: connect*: Connection timed out
--8<---------------cut here---------------end--------------->8---

which means that the workers are failing to connect to the Cuirass
remote-server publish server on berlin at 141.80.167.131:5557.

Stracing this publish server shows that connection reuse seems to be
broken:

--8<---------------cut here---------------start------------->8---
accept4(9, {sa_family=AF_INET, sin_port=htons(41742), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
accept4(9, {sa_family=AF_INET, sin_port=htons(41744), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
accept4(9, {sa_family=AF_INET, sin_port=htons(41746), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 25
accept4(9, {sa_family=AF_INET, sin_port=htons(41748), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 24
accept4(9, {sa_family=AF_INET, sin_port=htons(41750), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
accept4(9, {sa_family=AF_INET, sin_port=htons(41752), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
accept4(9, {sa_family=AF_INET, sin_port=htons(41754), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 25
accept4(9, {sa_family=AF_INET, sin_port=htons(41756), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
accept4(9, {sa_family=AF_INET, sin_port=htons(41758), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 26
accept4(9, {sa_family=AF_INET, sin_port=htons(41760), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 24
accept4(9, {sa_family=AF_INET, sin_port=htons(41762), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
accept4(9, {sa_family=AF_INET, sin_port=htons(41764), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
accept4(9, {sa_family=AF_INET, sin_port=htons(41766), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
accept4(9, {sa_family=AF_INET, sin_port=htons(41768), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 22
accept4(9, {sa_family=AF_INET, sin_port=htons(41770), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
accept4(9, {sa_family=AF_INET, sin_port=htons(41772), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
accept4(9, {sa_family=AF_INET, sin_port=htons(41774), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
accept4(9, {sa_family=AF_INET, sin_port=htons(41776), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
accept4(9, {sa_family=AF_INET, sin_port=htons(41778), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
accept4(9, {sa_family=AF_INET, sin_port=htons(41780), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
accept4(9, {sa_family=AF_INET, sin_port=htons(41782), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
--8<---------------cut here---------------end--------------->8---

Investigating it, I found that the connection is closed and opened
multiple times in the call-with-cached-connection procedure of the (guix
script substitute) module.

It looks like its because a 'bad-headers exception is raised when trying
to parse an eof object:

--8<---------------cut here---------------start------------->8---
;;; (error bad-header (read-header-line #<eof>))
--8<---------------cut here---------------end--------------->8---

I'm not sure where this eof comes from. There is this comment in the
http-multiple-get procedure in (guix http-client):

--8<---------------cut here---------------start------------->8---
;; Swallow networking errors that could occur due to connection reuse
;; and the like; they will be handled down the road when trying to
;; read responses.
(false-if-networking-error
 (begin
   (for-each (cut write-request <> buffer) batch)
   (put-bytevector p (get))
   (force-output p))))
--8<---------------cut here---------------end--------------->8---

which would suggest that connection reuse could cause networking errors?

What also puzzles me it that the main guix publish server on berlin does
not seem to present this issue. That would indicate that this error is
caused by how the Cuirass remote-server publish server is started or
configured.

Ludo, Chris, any idea?

I will keep searching anyway :)

Thanks,

Mathieu




^ permalink raw reply	[flat|nested] 7+ messages in thread

* bug#48468: substitute server connection timeout
  2021-05-16 17:57 bug#48468: substitute server connection timeout Mathieu Othacehe
@ 2021-05-16 18:26 ` Christopher Baines
  2021-05-17 14:49   ` Mathieu Othacehe
  2021-05-29 21:44   ` Ludovic Courtès
  2021-06-18 12:33 ` Ludovic Courtès
  1 sibling, 2 replies; 7+ messages in thread
From: Christopher Baines @ 2021-05-16 18:26 UTC (permalink / raw)
  To: Mathieu Othacehe; +Cc: 48468

[-- Attachment #1: Type: text/plain, Size: 5546 bytes --]


Mathieu Othacehe <othacehe@gnu.org> writes:

> Hello,
>
> We recently have a lot of those errors on Cuirass:
>
> --8<---------------cut here---------------start------------->8---
> guix substitute: warning: while fetching http://141.80.167.131:5557/nar/g7ka09613k5v1vlznh87yg35905ggw51-python2-scipy-1.2.2-guile-builder: server is somewhat slow
> guix substitute: warning: try `--no-substitutes' if the problem persists
> guix substitute: error: connect*: Connection timed out
> --8<---------------cut here---------------end--------------->8---
>
> which means that the workers are failing to connect to the Cuirass
> remote-server publish server on berlin at 141.80.167.131:5557.
>
> Stracing this publish server shows that connection reuse seems to be
> broken:
>
> --8<---------------cut here---------------start------------->8---
> accept4(9, {sa_family=AF_INET, sin_port=htons(41742), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
> accept4(9, {sa_family=AF_INET, sin_port=htons(41744), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
> accept4(9, {sa_family=AF_INET, sin_port=htons(41746), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 25
> accept4(9, {sa_family=AF_INET, sin_port=htons(41748), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 24
> accept4(9, {sa_family=AF_INET, sin_port=htons(41750), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
> accept4(9, {sa_family=AF_INET, sin_port=htons(41752), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
> accept4(9, {sa_family=AF_INET, sin_port=htons(41754), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 25
> accept4(9, {sa_family=AF_INET, sin_port=htons(41756), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
> accept4(9, {sa_family=AF_INET, sin_port=htons(41758), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 26
> accept4(9, {sa_family=AF_INET, sin_port=htons(41760), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 24
> accept4(9, {sa_family=AF_INET, sin_port=htons(41762), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
> accept4(9, {sa_family=AF_INET, sin_port=htons(41764), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
> accept4(9, {sa_family=AF_INET, sin_port=htons(41766), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
> accept4(9, {sa_family=AF_INET, sin_port=htons(41768), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 22
> accept4(9, {sa_family=AF_INET, sin_port=htons(41770), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
> accept4(9, {sa_family=AF_INET, sin_port=htons(41772), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
> accept4(9, {sa_family=AF_INET, sin_port=htons(41774), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
> accept4(9, {sa_family=AF_INET, sin_port=htons(41776), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
> accept4(9, {sa_family=AF_INET, sin_port=htons(41778), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
> accept4(9, {sa_family=AF_INET, sin_port=htons(41780), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
> accept4(9, {sa_family=AF_INET, sin_port=htons(41782), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
> --8<---------------cut here---------------end--------------->8---
>
> Investigating it, I found that the connection is closed and opened
> multiple times in the call-with-cached-connection procedure of the (guix
> script substitute) module.
>
> It looks like its because a 'bad-headers exception is raised when trying
> to parse an eof object:
>
> --8<---------------cut here---------------start------------->8---
> ;;; (error bad-header (read-header-line #<eof>))
> --8<---------------cut here---------------end--------------->8---
>
> I'm not sure where this eof comes from. There is this comment in the
> http-multiple-get procedure in (guix http-client):
>
> --8<---------------cut here---------------start------------->8---
> ;; Swallow networking errors that could occur due to connection reuse
> ;; and the like; they will be handled down the road when trying to
> ;; read responses.
> (false-if-networking-error
>  (begin
>    (for-each (cut write-request <> buffer) batch)
>    (put-bytevector p (get))
>    (force-output p))))
> --8<---------------cut here---------------end--------------->8---
>
> which would suggest that connection reuse could cause networking errors?
>
> What also puzzles me it that the main guix publish server on berlin does
> not seem to present this issue. That would indicate that this error is
> caused by how the Cuirass remote-server publish server is started or
> configured.
>
> Ludo, Chris, any idea?

While I've been working in this area, I've actually been trying to pick
apart the connection caching, since the single thread assumption doesn't
hold in the Guix Build Coordinator.

Anyway, I do have a theory. Assuming I'm correct in saying that there's
no nginx between the client and publish server here, I think that's your
configuration difference.

For ci.guix.gnu.org, as well as data.guix.gnu.org, it's NGinx which is
keeping connections alive. I'm not sure the Guile code for the publish
server does similarly, so talking to it directly might be different.

That's on the server side, the actual problem is probably on the client
side, as I guess there are possibly places where closed connections
aren't handled properly. This reminds me I sent some patches relating to
closing connections, this could well be related [1].

1: https://issues.guix.gnu.org/47174

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 987 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* bug#48468: substitute server connection timeout
  2021-05-16 18:26 ` Christopher Baines
@ 2021-05-17 14:49   ` Mathieu Othacehe
  2021-05-18 14:35     ` Mathieu Othacehe
  2021-05-29 21:44   ` Ludovic Courtès
  1 sibling, 1 reply; 7+ messages in thread
From: Mathieu Othacehe @ 2021-05-17 14:49 UTC (permalink / raw)
  To: Christopher Baines; +Cc: 48468


Hey,

> That's on the server side, the actual problem is probably on the client
> side, as I guess there are possibly places where closed connections
> aren't handled properly. This reminds me I sent some patches relating to
> closing connections, this could well be related [1].

Oh, you're right, the Nginx server probably makes the difference
here. The http-write procedure of the Guile (web server http) module
seems to handle keep-alive connections. However, the wrapping http-write
procedure of (guix scripts publish) may not.

I'll have a closer look, thanks for your help.

Mathieu




^ permalink raw reply	[flat|nested] 7+ messages in thread

* bug#48468: substitute server connection timeout
  2021-05-17 14:49   ` Mathieu Othacehe
@ 2021-05-18 14:35     ` Mathieu Othacehe
  2021-05-21 13:30       ` Mathieu Othacehe
  0 siblings, 1 reply; 7+ messages in thread
From: Mathieu Othacehe @ 2021-05-18 14:35 UTC (permalink / raw)
  To: Christopher Baines; +Cc: 48468


Hey,

> I'll have a closer look, thanks for your help.

So this snippet in the http-write procedure of the (guix scripts
publish) module:

--8<---------------cut here---------------start------------->8---
          (swallow-zlib-error
           (close-port port))
--8<---------------cut here---------------end--------------->8---

is closing the client port unconditionally, which means that guix
publish cannot keep connections alive, unless sitting behind an Nginx
proxy.

I'm trying to turn the close-port call into a maybe-close-port with the
following procedure:

--8<---------------cut here---------------start------------->8---
(define (maybe-close-port port)
    (cond
     ((keep-alive? response)
      (poll-set-add! (http-poll-set server) port *events*))
     (else
      (close-port port))))
--8<---------------cut here---------------end--------------->8---

however this is terribly hacky, as I need to access the private poll-set
from (web server http).

Ludo, do you have a better idea?

Thanks,

Mathieu




^ permalink raw reply	[flat|nested] 7+ messages in thread

* bug#48468: substitute server connection timeout
  2021-05-18 14:35     ` Mathieu Othacehe
@ 2021-05-21 13:30       ` Mathieu Othacehe
  0 siblings, 0 replies; 7+ messages in thread
From: Mathieu Othacehe @ 2021-05-21 13:30 UTC (permalink / raw)
  To: Christopher Baines; +Cc: 48468


Hey,

I posted a patchset adding keep-alive support to guix publish earlier:
https://issues.guix.gnu.org/48556.

Thanks,

Mathieu




^ permalink raw reply	[flat|nested] 7+ messages in thread

* bug#48468: substitute server connection timeout
  2021-05-16 18:26 ` Christopher Baines
  2021-05-17 14:49   ` Mathieu Othacehe
@ 2021-05-29 21:44   ` Ludovic Courtès
  1 sibling, 0 replies; 7+ messages in thread
From: Ludovic Courtès @ 2021-05-29 21:44 UTC (permalink / raw)
  To: Christopher Baines; +Cc: Mathieu Othacehe, 48468

Hi,

Christopher Baines <mail@cbaines.net> skribis:

> Mathieu Othacehe <othacehe@gnu.org> writes:

[...]

>> Stracing this publish server shows that connection reuse seems to be
>> broken:
>>
>> --8<---------------cut here---------------start------------->8---
>> accept4(9, {sa_family=AF_INET, sin_port=htons(41742), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
>> accept4(9, {sa_family=AF_INET, sin_port=htons(41744), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
>> accept4(9, {sa_family=AF_INET, sin_port=htons(41746), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 25
>> accept4(9, {sa_family=AF_INET, sin_port=htons(41748), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 24
>> accept4(9, {sa_family=AF_INET, sin_port=htons(41750), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21

Ouch.

>> Investigating it, I found that the connection is closed and opened
>> multiple times in the call-with-cached-connection procedure of the (guix
>> script substitute) module.
>>
>> It looks like its because a 'bad-headers exception is raised when trying
>> to parse an eof object:
>>
>> --8<---------------cut here---------------start------------->8---
>> ;;; (error bad-header (read-header-line #<eof>))
>> --8<---------------cut here---------------end--------------->8---
>>
>> I'm not sure where this eof comes from. There is this comment in the
>> http-multiple-get procedure in (guix http-client):
>>
>> --8<---------------cut here---------------start------------->8---
>> ;; Swallow networking errors that could occur due to connection reuse
>> ;; and the like; they will be handled down the road when trying to
>> ;; read responses.
>> (false-if-networking-error
>>  (begin
>>    (for-each (cut write-request <> buffer) batch)
>>    (put-bytevector p (get))
>>    (force-output p))))
>> --8<---------------cut here---------------end--------------->8---
>>
>> which would suggest that connection reuse could cause networking errors?

[...]

> That's on the server side, the actual problem is probably on the client
> side, as I guess there are possibly places where closed connections
> aren't handled properly.

D’oh.  So we should be able to use it by talking to a bare ‘guix
publish’?  I tried and failed to reproduce it, but that’s probably not
deterministic.

> This reminds me I sent some patches relating to closing connections,
> this could well be related [1].
>
> 1: https://issues.guix.gnu.org/47174

This one will properly close connections when the servers asks for it,
but would that make a difference?

Thanks,
Ludo’.




^ permalink raw reply	[flat|nested] 7+ messages in thread

* bug#48468: substitute server connection timeout
  2021-05-16 17:57 bug#48468: substitute server connection timeout Mathieu Othacehe
  2021-05-16 18:26 ` Christopher Baines
@ 2021-06-18 12:33 ` Ludovic Courtès
  1 sibling, 0 replies; 7+ messages in thread
From: Ludovic Courtès @ 2021-06-18 12:33 UTC (permalink / raw)
  To: Mathieu Othacehe; +Cc: 48468

Hi,

Mathieu Othacehe <othacehe@gnu.org> skribis:

> Investigating it, I found that the connection is closed and opened
> multiple times in the call-with-cached-connection procedure of the (guix
> script substitute) module.
>
> It looks like its because a 'bad-headers exception is raised when trying
> to parse an eof object:
>
> ;;; (error bad-header (read-header-line #<eof>))
>
>
> I'm not sure where this eof comes from. There is this comment in the
> http-multiple-get procedure in (guix http-client):
>
> ;; Swallow networking errors that could occur due to connection reuse
> ;; and the like; they will be handled down the road when trying to
> ;; read responses.
> (false-if-networking-error
>  (begin
>    (for-each (cut write-request <> buffer) batch)
>    (put-bytevector p (get))
>    (force-output p))))
>
> which would suggest that connection reuse could cause networking errors?

Trying to see exactly which bit is at fault here.  First,
‘http-multiple-get’ seems to be working as expected when passed a fresh
connection:

--8<---------------cut here---------------start------------->8---
scheme@(guix http-client)> (http-multiple-get (string->uri "https://ci.guix.gnu.org")
					      (lambda (request response port result)
						(get-bytevector-n port (response-content-length response))
						(cons #t result))
					      '()
					      (make-list
					       5000
					       (build-request (build-uri 'https
										   #:host "ci.guix.gnu.org"
										   #:path "/d7gpr41qpsifri54vp2lzjs99zkylscq.narinfo"))))
connecting (5000 requests left)...connecting (4400 requests left)...connecting (3800 requests left)...connecting (3200 requests left)...connecting (2600 requests left)...connecting (2000 requests left)...connecting (1400 requests left)...connecting (800 requests left)...connecting (200 requests left)...$143 = (#t #t …)
scheme@(guix http-client)> (length $143)
$144 = 5000
--8<---------------cut here---------------end--------------->8---

But maybe I’m not looking at the right thing.

Do you have evidence or a reproducer?

Thanks,
Ludo’.




^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2021-06-18 12:34 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-16 17:57 bug#48468: substitute server connection timeout Mathieu Othacehe
2021-05-16 18:26 ` Christopher Baines
2021-05-17 14:49   ` Mathieu Othacehe
2021-05-18 14:35     ` Mathieu Othacehe
2021-05-21 13:30       ` Mathieu Othacehe
2021-05-29 21:44   ` Ludovic Courtès
2021-06-18 12:33 ` Ludovic Courtès

unofficial mirror of bug-guix@gnu.org 

This inbox may be cloned and mirrored by anyone:

	git clone --mirror https://yhetil.org/guix-bugs/0 guix-bugs/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 guix-bugs guix-bugs/ https://yhetil.org/guix-bugs \
		bug-guix@gnu.org
	public-inbox-index guix-bugs

Example config snippet for mirrors.
Newsgroups are available over NNTP:
	nntp://news.yhetil.org/yhetil.gnu.guix.bugs
	nntp://news.gmane.io/gmane.comp.gnu.guix.bugs


AGPL code for this site: git clone http://ou63pmih66umazou.onion/public-inbox.git