From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp12.migadu.com ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms9.migadu.com with LMTPS id iGrQD9j4nmSacQAASxT56A (envelope-from ) for ; Fri, 30 Jun 2023 17:46:32 +0200 Received: from aspmx1.migadu.com ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp12.migadu.com with LMTPS id MMmSD9j4nmSZCwEAauVa8A (envelope-from ) for ; Fri, 30 Jun 2023 17:46:32 +0200 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id A2DCE7544 for ; Fri, 30 Jun 2023 17:46:31 +0200 (CEST) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1qFGJw-0004Gm-Uo; Fri, 30 Jun 2023 11:46:04 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1qFGJu-0004Dl-Rb for bug-guix@gnu.org; Fri, 30 Jun 2023 11:46:02 -0400 Received: from debbugs.gnu.org ([209.51.188.43]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1qFGJu-0000ST-It for bug-guix@gnu.org; Fri, 30 Jun 2023 11:46:02 -0400 Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1qFGJu-0001CA-Be for bug-guix@gnu.org; Fri, 30 Jun 2023 11:46:02 -0400 X-Loop: help-debbugs@gnu.org Subject: bug#64297: [Cuirass] Remote server not picking up job, losing workers Resent-From: Ludovic =?UTF-8?Q?Court=C3=A8s?= Original-Sender: "Debbugs-submit" Resent-CC: bug-guix@gnu.org Resent-Date: Fri, 30 Jun 2023 15:46:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 64297 X-GNU-PR-Package: guix X-GNU-PR-Keywords: To: 64297@debbugs.gnu.org Received: via spool by 64297-submit@debbugs.gnu.org id=B64297.16881399594577 (code B ref 64297); Fri, 30 Jun 2023 15:46:02 +0000 Received: (at 64297) by debbugs.gnu.org; 30 Jun 2023 15:45:59 +0000 Received: from localhost ([127.0.0.1]:55642 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1qFGJq-0001Bi-Iv for submit@debbugs.gnu.org; Fri, 30 Jun 2023 11:45:59 -0400 Received: from eggs.gnu.org ([209.51.188.92]:49590) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1qFGJo-0001BQ-AC for 64297@debbugs.gnu.org; Fri, 30 Jun 2023 11:45:57 -0400 Received: from fencepost.gnu.org ([2001:470:142:3::e]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1qFGJi-0000PA-VS for 64297@debbugs.gnu.org; Fri, 30 Jun 2023 11:45:50 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=gnu.org; s=fencepost-gnu-org; h=MIME-Version:In-Reply-To:Date:References:Subject:To: From; bh=8rIKMXatkXoNMmANmTQ4/BnFhzdOWZ2SbYzTlmzDKpY=; b=T153Gm0XXJsZWnDXNXTx 0PAgNrr/nSaFNzc5qQm1drxazcg/41x+Q4pYBlSVg2sliOh+vbxZrbeMCARpmKupZYA426fzJuyFi jD6sLtFUBtRioYuDiAzu9IOs/Zd/HlDgPy41Z/t3TqwaX5/BVRCGTOYrMBdaqk8sOV/duL60RtzwK as1uVhognNtmtsTLhk2qhReU3yuS5M/Thok+32kmYgLemjdrDLBiO03aI6zcrQcr/0pZHKBq7JY/7 R1TVLZSrXGnZV1PV1y45QLok+nyESmH+xnZsWL3Sz+hoTy6dX+wp3xVjm828/YGg96/N4iNltY3Po 8+f6VAfC5Fdweg==; Received: from [193.50.110.213] (helo=ribbon) by fencepost.gnu.org with esmtpsa (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1qFGJi-0002JO-Iu for 64297@debbugs.gnu.org; Fri, 30 Jun 2023 11:45:50 -0400 From: Ludovic =?UTF-8?Q?Court=C3=A8s?= References: <87fs6ey56z.fsf@inria.fr> Date: Fri, 30 Jun 2023 17:45:48 +0200 In-Reply-To: <87fs6ey56z.fsf@inria.fr> ("Ludovic =?UTF-8?Q?Court=C3=A8s?="'s message of "Mon, 26 Jun 2023 10:54:12 +0200") Message-ID: <87wmzlymvn.fsf@gnu.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.2 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-BeenThere: bug-guix@gnu.org List-Id: Bug reports for GNU Guix List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-guix-bounces+larch=yhetil.org@gnu.org Sender: bug-guix-bounces+larch=yhetil.org@gnu.org X-Migadu-Country: US X-Migadu-Flow: FLOW_IN ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=yhetil.org; s=key1; t=1688139992; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding:resent-cc: resent-from:resent-sender:resent-message-id:in-reply-to:in-reply-to: references:references:list-id:list-help:list-unsubscribe: list-subscribe:list-post:dkim-signature; bh=8rIKMXatkXoNMmANmTQ4/BnFhzdOWZ2SbYzTlmzDKpY=; b=IMsHX4Kezg12Xx7cN3r4oQpsrOtE/RykCPhZGlwp8174i6RpK6zi6QaNm+dd5Vt4eY5Y0m pC45+G2M+1plUeRTls4QCyDuDY41+fg72EuhxNSdfBNkuUfbHG8VBl/D1Hm2uTUaRcmIgV BzAdKf1A1euwDviVVJN9oCOUTBhWu/A4e4pP1x+/lt5tlPu91rXaKP+WWVpzfDyWEB188D WewBoHL3/QWYFZ9A94hF83U/KsiJUyKR71gmMMpvnUFVyNeWhhdavchybT6jPnrkBjLxmC Yifxl0nZSST8WAwr2kI736/2crigfDMX+a1F1Lqcb+127lrzm/w4kOxlshUc7g== ARC-Authentication-Results: i=1; aspmx1.migadu.com; dkim=fail ("headers rsa verify failed") header.d=gnu.org header.s=fencepost-gnu-org header.b=T153Gm0X; dmarc=pass (policy=none) header.from=gnu.org; spf=pass (aspmx1.migadu.com: domain of "bug-guix-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="bug-guix-bounces+larch=yhetil.org@gnu.org" ARC-Seal: i=1; s=key1; d=yhetil.org; t=1688139992; a=rsa-sha256; cv=none; b=EtO97aQ81no+YArouzEVQHQj3RKbpsEvN+BmlTrsADfCUYD86RxguFfHbmpCTDYDFaWjBl 7rwGh51xcg1XQ7Y7yGBtGUCpx2yVwdUm4W2FEWy6LNOOP7CbL9uVOrRxgOo/3IfvIqBSkv nRijWIT646R+O7zq52u/LW1SVMtxkIOfcJoaqoskq/ukfPUNMskqXyU14/MnpD7luTfR6n Y7t17lG3FiHMFT+RJOpqR5BNgJX7VtE/uCqs4i+rPmEbV9xvv3t4zz4rAh4F+9L2cAaT1j t/HuSrCzDvTffEObMFLr66QN+CfRnQ/NJV/rndWBnM3RsnQSo4Lb+z2CD9WH0g== X-Migadu-Scanner: scn1.migadu.com X-Migadu-Spam-Score: -2.66 Authentication-Results: aspmx1.migadu.com; dkim=fail ("headers rsa verify failed") header.d=gnu.org header.s=fencepost-gnu-org header.b=T153Gm0X; dmarc=pass (policy=none) header.from=gnu.org; spf=pass (aspmx1.migadu.com: domain of "bug-guix-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="bug-guix-bounces+larch=yhetil.org@gnu.org" X-Migadu-Queue-Id: A2DCE7544 X-Spam-Score: -2.66 X-TUID: 2YiQtHVtZci/ Ludovic Court=C3=A8s skribis: > As of cuirass@1.1.0-16.b825967, =E2=80=98cuirass remote-server=E2=80=99 a= ppears to not > pick jobs as quickly as it should and to lose sight of workers (you can > see them come and go on ). > > /var/log/cuirass-remote-worker.log shows that it does build things, but > only sporadically. Then there are things like: > > 2023-06-26 10:07:58 warning: Poll loop busy during 3404 seconds. > > This is presumably related to Cuirass commit > c4743b54720e86b0e0b0295fb6d33977e4293644 (previously =E2=80=98remote-work= er=E2=80=99 > would have a database worker thread; now it opens a new connection every > time=E2=80=94a stopgap before it=E2=80=99s fiberized, but apparently not = a good one). Apparently this has to do with squee calling =E2=80=98current-read-waiter= =E2=80=99 (i.e., poll(2)) while waiting for its response and passing checking the wrong FD for some reason, as in this case: --8<---------------cut here---------------start------------->8--- 18484 15:01:00 connect(55, {sa_family=3DAF_UNIX, sun_path=3D"/tmp/ephemeral= pg.58xnKh/.s.PGSQL.5432"}, 110) =3D 0 <0.000019> 18484 15:01:00 getsockopt(55, SOL_SOCKET, SO_ERROR, [0], [4]) =3D 0 <0.0000= 14> 18484 15:01:00 getsockname(55, {sa_family=3DAF_UNIX}, [128 =3D> 2]) =3D 0 <= 0.000010> 18484 15:01:00 poll([{fd=3D55, events=3DPOLLOUT|POLLERR}], 1, -1) =3D 1 ([{= fd=3D55, revents=3DPOLLOUT}]) <0.000011> 18484 15:01:00 sendto(55, "\0\0\0!\0\3\0\0user\0ludo\0database\0test\0\0", = 33, MSG_NOSIGNAL, NULL, 0) =3D 33 <0.000014> 18484 15:01:00 poll([{fd=3D55, events=3DPOLLIN|POLLERR}], 1, -1 18484 15:01:00 <... poll resumed>) =3D 1 ([{fd=3D55, revents=3DPOLLIN}= ]) <0.001786> 18484 15:01:00 recvfrom(55, "R\0\0\0\10\0\0\0\0S\0\0\0\26application_name\0= \0S\0\0\0\31client_encoding\0UTF8\0S\0\0\0\27DateStyle\0ISO, MD"..., 16384,= 0, NULL, NULL) =3D 376 <0.000016> 18484 15:01:00 sendto(55, "P\0\0\0\366\0\nUPDATE Builds SET status =3D -2, = worker =3D null FROM\n(SELECT id FROM Workers"..., 291, MSG_NOSIGNAL, NULL,= 0 18484 15:01:00 <... sendto resumed>) =3D 291 <0.000589> 18484 15:01:00 poll([{fd=3D54, events=3DPOLLIN}], 1, -1 18484 15:03:14 <... poll resumed>) =3D 1 ([{fd=3D54, revents=3DPOLLIN}= ]) <134.319198> 18484 15:03:14 recvfrom(55, 18484 15:03:14 <... recvfrom resumed>"1\0\0\0\0042\0\0\0\4n\0\0\0\4C\0\0\0\= rUPDATE 0\0Z\0\0\0\5I", 16384, 0, NULL, NULL) =3D 35 <0.000026> 18484 15:03:14 sendto(55, "P\0\0\0V\0DELETE FROM Workers WHERE\n(extract(ep= och from now())::int - last_seen) > "..., 131, MSG_NOSIGNAL, NULL, 0 18484 15:03:14 <... sendto resumed>) =3D 131 <0.000084> 18484 15:03:14 poll([{fd=3D54, events=3DPOLLIN}], 1, -1) =3D 1 ([{fd=3D54, = revents=3DPOLLNVAL}]) <0.000013> 18484 15:03:14 recvfrom(55, "1\0\0\0\0042\0\0\0\4n\0\0\0\4C\0\0\0\rDELETE 4= \0Z\0\0\0\5I", 16384, 0, NULL, NULL) =3D 35 <0.000016> 18484 15:03:14 sendto(55, "X\0\0\0\4", 5, MSG_NOSIGNAL, NULL, 0) =3D 5 <0.0= 00025> 18484 15:03:14 close(55) =3D 0 <0.000018> 18484 15:03:14 openat(AT_FDCWD, "/etc/localtime", O_RDONLY|O_CLOEXEC) =3D -= 1 ENOENT (No such file or directory) <0.000019> 18484 15:03:14 write(2, "2023-06-30T15:03:14 warning: Poll loop busy during= 134 seconds.\n", 64) =3D 64 <0.000024> --8<---------------cut here---------------end--------------->8--- In this case FD 54 is a connection with a worker process; terminating that process led poll(2) to return, thus unblocking the =E2=80=9Cpoll loop= =E2=80=9D. The problem is most likely with the connection-to-port caching in squee=E2=80=99s =E2=80=98connection-socket-port=E2=80=99, as can be seen in= this other trace where I added =E2=80=98pk=E2=80=99 calls in =E2=80=98connection-socket-port= =E2=80=99: --8<---------------cut here---------------start------------->8--- 19468 15:37:43 connect(55, {sa_family=3DAF_UNIX, sun_path=3D"/tmp/ephemeral= pg.58xnKh/.s.PGSQL.5432"}, 110) =3D 0 <0.000018> 19468 15:37:43 getsockopt(55, SOL_SOCKET, SO_ERROR, [0], [4]) =3D 0 <0.0000= 10> 19468 15:37:43 getsockname(55, {sa_family=3DAF_UNIX}, [128 =3D> 2]) =3D 0 <= 0.000009> 19468 15:37:43 poll([{fd=3D55, events=3DPOLLOUT|POLLERR}], 1, -1) =3D 1 ([{= fd=3D55, revents=3DPOLLOUT}]) <0.000009> 19468 15:37:43 sendto(55, "\0\0\0!\0\3\0\0user\0ludo\0database\0test\0\0", = 33, MSG_NOSIGNAL, NULL, 0) =3D 33 <0.000012> 19468 15:37:43 poll([{fd=3D55, events=3DPOLLIN|POLLERR}], 1, -1) =3D 1 ([{f= d=3D55, revents=3DPOLLIN}]) <0.002109> 19468 15:37:43 recvfrom(55, "R\0\0\0\10\0\0\0\0S\0\0\0\26application_name\0= \0S\0\0\0\31client_encoding\0UTF8\0S\0\0\0\27DateStyle\0ISO, MD"..., 16384,= 0, NULL, NULL) =3D 376 <0.000009> 19468 15:37:43 sendto(55, "P\0\0\0\366\0\nUPDATE Builds SET status =3D -2, = worker =3D null FROM\n(SELECT id FROM Workers"..., 291, MSG_NOSIGNAL, NULL,= 0) =3D 291 <0.000012> 19468 15:37:43 write(1, "\n", 1) =3D 1 <0.000015> 19468 15:37:43 ioctl(54, TCGETS, 0x7ffd8fde6660) =3D -1 ENOTTY (Inappropria= te ioctl for device) <0.000012> 19468 15:37:43 write(1, ";;; (cached # #)\n", 68) =3D 68 <0.000012> --8<---------------cut here---------------end--------------->8--- Above we open a fresh connection on FD 55, but =E2=80=98connection-socket-p= ort=E2=80=99 determines that the connection object is cached and associated with a port for FD 54. To be continued=E2=80=A6 Ludo=E2=80=99.