From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp2 ([2001:41d0:2:bcc0::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms0.migadu.com with LMTPS id AO6iDy6vE2GLIAEAgWs5BA (envelope-from ) for ; Wed, 11 Aug 2021 13:06:22 +0200 Received: from aspmx1.migadu.com ([2001:41d0:2:bcc0::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp2 with LMTPS id MKAfCy6vE2H4VQAAB5/wlQ (envelope-from ) for ; Wed, 11 Aug 2021 11:06:22 +0000 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id D296C1E286 for ; Wed, 11 Aug 2021 13:06:21 +0200 (CEST) Received: from localhost ([::1]:56256 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1mDm3w-0005I3-P1 for larch@yhetil.org; Wed, 11 Aug 2021 07:06:20 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:40628) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1mDm3h-0005FE-6N for guix-devel@gnu.org; Wed, 11 Aug 2021 07:06:06 -0400 Received: from fencepost.gnu.org ([2001:470:142:3::e]:54992) by eggs.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1mDm3g-0000dw-OA; Wed, 11 Aug 2021 07:06:04 -0400 Received: from [2a01:e0a:1d:7270:af76:b9b:ca24:c465] (port=48752 helo=ribbon) by fencepost.gnu.org with esmtpsa (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1mDm3g-0002Lb-Ep; Wed, 11 Aug 2021 07:06:04 -0400 From: =?utf-8?Q?Ludovic_Court=C3=A8s?= To: Mathieu Othacehe Subject: Re: Substitute timeouts References: <875ywec3oo.fsf@gnu.org> X-URL: http://www.fdn.fr/~lcourtes/ X-Revolutionary-Date: 24 Thermidor an 229 de la =?utf-8?Q?R=C3=A9volution?= X-PGP-Key-ID: 0x090B11993D9AEBB5 X-PGP-Key: http://www.fdn.fr/~lcourtes/ludovic.asc X-PGP-Fingerprint: 3CE4 6455 8A84 FDC6 9DB4 0CFB 090B 1199 3D9A EBB5 X-OS: x86_64-pc-linux-gnu Date: Wed, 11 Aug 2021 13:06:01 +0200 In-Reply-To: <875ywec3oo.fsf@gnu.org> (Mathieu Othacehe's message of "Mon, 09 Aug 2021 12:28:39 +0200") Message-ID: <87pmukmeau.fsf@gnu.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-BeenThere: guix-devel@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Development of GNU Guix and the GNU System distribution." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: guix-devel Errors-To: guix-devel-bounces+larch=yhetil.org@gnu.org Sender: "Guix-devel" X-Migadu-Flow: FLOW_IN ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=yhetil.org; s=key1; t=1628679981; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:list-id:list-help: list-unsubscribe:list-subscribe:list-post; bh=mlI8Wt4G2wLCVI4TFvEslmDgHPYOkAwXfZs4lVG9rGg=; b=MS/Il9tB7Zf91XQeH3luBMgNjW2AG8D5S9Zd+cXQ+SojhiEr+LJlFtfKBpuzvjgvQPPPbu petTItj6juNWNk/q8vREGqMkkOIdn+sL4c0UzdgwbvrWtDgP1kjM34k0i0TIWdo5dXN39y 06Bg8osmInf7h8LIpCadS2RvLp9kI5hYdiOTOFr3lGNXNS3szsJjjbeCNwbQd2c1Y8aF3M MorbV54UIGupEgukDmJo55cP06vd4nzFVplC4w1BJESh8otO/hFUbsf36EodmJEUTbTFLe ofLaqE77UPkMswJ/mBavIceNpi2njaZlpIkDULQLN8IfCjrIApXeeWNuhj33yA== ARC-Seal: i=1; s=key1; d=yhetil.org; t=1628679981; a=rsa-sha256; cv=none; b=TINFUTSaAzH44tQdNG9HbHK3U3hYQD3D1mLripDTsG9BlSOrisKP6TobuhhuzZPr06hvmK GzJTFgeShuNYT9Ec/RdR1ISF4vUYgswefoNvUkw3zcRyiLDPlbW9RvDOIWst3vw8XX3X+p 406lfCu4xirTlSgCwrVnmBSX8LwTq0kWxjoGJIllHWj9eNgwNa546dcSHva2AD7qXP7d8y XmetJiVi4mly7DwRK4ZQU1K0TrlurtIGbJqNuVYEf4k1SdnK2Mod5kDOxqx+wVOTtal6xr sDmDP+oclAvaDDTuz8Zfr8tJ8qSBBMyEkq/CwrCg6IbKMVuySnFjtEd3MTwiTw== ARC-Authentication-Results: i=1; aspmx1.migadu.com; dkim=none; dmarc=pass (policy=none) header.from=gnu.org; spf=pass (aspmx1.migadu.com: domain of guix-devel-bounces@gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=guix-devel-bounces@gnu.org X-Migadu-Spam-Score: -2.91 Authentication-Results: aspmx1.migadu.com; dkim=none; dmarc=pass (policy=none) header.from=gnu.org; spf=pass (aspmx1.migadu.com: domain of guix-devel-bounces@gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=guix-devel-bounces@gnu.org X-Migadu-Queue-Id: D296C1E286 X-Spam-Score: -2.91 X-Migadu-Scanner: scn1.migadu.com X-TUID: ywICe+mHh9aK Hi, Mathieu Othacehe skribis: > I have been investigating a problem that is visible both on the main > guix publish server at https://ci.guix.gnu.org[1] and on the Cuirass > build farm[2]. > > This error comes from the fact that the publish server does not accept > the "guix substitute" connection requests within the %fetch-timeout > duration of 5 seconds. Thanks for getting to the bottom of this! > The main guix publish server is using a cache. If a requested narinfo is > not in the cache, it will be baked and the client receives a 404 > error. Since ecaa102a58ad3ab0b42e04a3d10d7c761c05ec98 and the > introduction of the bypass mechanism, small store items are directly > returned. > > This means that the "narinfo-string" procedure can be called directly in > the main publish thread. Running perf on the main publish server reveals > that this procedure can be really expensive under IO pressure (GC > running for example) because it opens a lot of files. I have observed > that the "read-derivation-from-file" call can take up to 600 ms. > > If multiple clients were to ask narinfo of several items not yet cached, > under IO pressure, I think that the publish server could become > unresponsive and cause the timeout errors. Yeah, it=E2=80=99s a double-edged sword. If this is a problem on the main = =E2=80=98guix publish=E2=80=99 server, we can lower the bypass threshold, which is curren= tly 50=C2=A0MiB: https://git.savannah.gnu.org/cgit/guix/maintenance.git/tree/hydra/modules= /sysadmin/services.scm#n450 WDYT? > The fact that Cuirass triggers the baking of successfully built > derivations probably doesn't help here. Could be. This threshold seemed to work fine earlier (and still does, mostly?). > Now regarding the timeout errors that are much more frequent on the > Cuirass build farm, the cause varies a bit. The Cuirass publish server > running on Berlin does not use a cache. This means that the > "narinfo-string" procedure is called for each request, in the main > thread. > > To fix those issues, a solution could be to run the "narinfo-string" in > a separate thread, but it will make the publish server code even harder > to understand. True! Though maybe it wouldn=E2=80=99t be that much worse. :-) The problem is that this thing is very much single-threaded, with exceptions in a couple of places. We could add one more exception like you write, or fiberize it, or run it behind nginx, possibly with a tiny bit of caching. > My proposition would be to get rid of the bypass mechanism and instead > implement a retry when some substitutes are reported as being baked, > as proposed by Miguel[3]. > > I think this is the most reasonable solution. This way, users won't > receive 404 errors and start building substitutes that are being > baked[4]. (If I followed correctly, the bypass mechanism is not at fault regarding timeouts on the Cuirass publish server since it=E2=80=99s not using a cache, right?) I don=E2=80=99t think it=E2=80=99s reasonable for =E2=80=98guix substitute= =E2=80=99 to just wait upon 202 (or 404, that doesn=E2=80=99t matter). First, in terms of UI, you=E2=80=99d have a command sitting there and doing nothing, which can be off-putting. Second, clients have no idea how long they=E2=80=99re going to wait; it could be that the nar is going to be baked within seconds, or it could take 20mn if the baking queue is already crowded or if the user is asking for a big store item like libreoffice. Third, in many cases, building locally is likely to be faster than waiting for substitutes to be available (the majority of packages build very quickly, though the few most popular leaf packages take a long time to build). > It will also allow the Cuirass build farm to use directly the main guix > publish server, simplifying the current CI setup. The only reason why Cuirass runs its own publish server is to avoid overloading the main one? Thanks, Ludo=E2=80=99.