From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp1.migadu.com ([2001:41d0:1008:1e59::1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms1.migadu.com with LMTPS id oP++LYx9T2aeFQAAA41jLg (envelope-from ) for ; Thu, 23 May 2024 19:31:56 +0200 Received: from aspmx1.migadu.com ([2001:41d0:303:e224::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp1.migadu.com with LMTPS id iIsCJox9T2Z/GAAA62LTzQ (envelope-from ) for ; Thu, 23 May 2024 19:31:56 +0200 X-Envelope-To: larch@yhetil.org Authentication-Results: aspmx1.migadu.com; dkim=pass header.d=gnu.org header.s=fencepost-gnu-org header.b=RDhddF1D; dmarc=pass (policy=none) header.from=gnu.org; spf=pass (aspmx1.migadu.com: domain of "guix-devel-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="guix-devel-bounces+larch=yhetil.org@gnu.org" ARC-Seal: i=1; s=key1; d=yhetil.org; t=1716485516; a=rsa-sha256; cv=none; b=SGOVTmwtzVCiMHIE4oegFS38o5Gj43u4pE4Qq+cBnSEmGR3Lkn09bUtkjAl5/AfiO8tEA5 2cwy/x7pj4ehHSHYgMbxuL7JziFiHoMr24zANOoY+LOC1S7F/Lp/ztt9O+oihsyyyYVyLP KJrTlBtjgnYxHg/goo33+FclWcaW201CGSTl3DoQxSkAGzs1XdQUaAjzQBVeAdWKx1aG5g 1EBcPN1WStL7n1e3QZr1qNRoXw274qdoOh+Vx1gorvFCyLW17P7byGVGt5G7uQubwPz+el vPDHVu07jCt7kk9v3AmF4KAIn0OgkgmjPJN5uR2cZjZt/J284bBTRu/++qltxQ== ARC-Authentication-Results: i=1; aspmx1.migadu.com; dkim=pass header.d=gnu.org header.s=fencepost-gnu-org header.b=RDhddF1D; dmarc=pass (policy=none) header.from=gnu.org; spf=pass (aspmx1.migadu.com: domain of "guix-devel-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="guix-devel-bounces+larch=yhetil.org@gnu.org" ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=yhetil.org; s=key1; t=1716485516; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:mime-version:mime-version: content-type:content-type:list-id:list-help:list-unsubscribe: list-subscribe:list-post:dkim-signature; bh=TH3ot+dy16EHtALsljQv6IihqJfCkX4m/uCvX3+dmmc=; b=YDtwtno7EDBcmq/n68ipBrTzGTXDjQLrl9cXxpKMw6qosOzuJMSxQhDVTWZfEmTe3QUohV 39gXjG1O4QAYfn2cCiK2Ub7zevOeCJ/QjYv1bmXaTGHVzAHgE1Rf7PBKmsz8y0Jjrc3l3t +g90a6dj7ubfeEjDPZ6ydVB2SJAD7C/jBK1lkfLDIqgL6GtuJYAQiDq2SXbLDy6ZqLGDwH sOK4K1HAvI6R8g43nb3RXTd8f7SIZvoyzBBxsr/Q7V3dIs5b7dIcfqR2S7w99UvQp2P6Rx Ne7YUTs9dO0V9Z1jRABTKZQGQ0rxYuEdmwFRFjeBDOpfpwMv0erRxp0jLA/msg== Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id 1F5AF64197 for ; Thu, 23 May 2024 19:31:56 +0200 (CEST) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1sACHk-0006Gp-Py; Thu, 23 May 2024 13:31:24 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1sACHj-0006GR-1C for guix-devel@gnu.org; Thu, 23 May 2024 13:31:23 -0400 Received: from fencepost.gnu.org ([2001:470:142:3::e]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1sACHh-0008PM-Mo for guix-devel@gnu.org; Thu, 23 May 2024 13:31:22 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=gnu.org; s=fencepost-gnu-org; h=MIME-Version:Date:Subject:To:From:in-reply-to: references; bh=TH3ot+dy16EHtALsljQv6IihqJfCkX4m/uCvX3+dmmc=; b=RDhddF1DflBMjh l4iORGT6Y980+clVOrEVu6Rio1d1Z6CjlI2umcDLHnpZafryhsCzYFf6Jis21nouI9q9bzBHHmk5h Dnwtx12kdM0QXWmligu21TLcSa/cxg5XlEDOoSaoM9tCRiBHyHinx/p2KQxiBxKCiIKlYld0T2FRh NvaYLsS4qAPKOBIaeXS+l2sZ4FkG6qgiD386f2g028Pju6JiM7ZvOuG6zXqj4SxhZ+S55QxHY5YUp i55sQ3yFJkvOCwloHLrWEO8H+okIIfMENZK1R0JgiWhLYWQlGr7rP6w7HBS6/IK4spr7MWo+oupEG oOBr2K+FRZMbOw7sR3vQ==; From: =?utf-8?Q?Ludovic_Court=C3=A8s?= To: guix-devel@gnu.org Subject: Postmortem of service downtime X-URL: http://www.fdn.fr/~lcourtes/ X-Revolutionary-Date: Quintidi 5 Prairial an 232 de la =?utf-8?Q?R=C3=A9vo?= =?utf-8?Q?lution=2C?= jour du Canard X-PGP-Key-ID: 0x090B11993D9AEBB5 X-PGP-Key: http://www.fdn.fr/~lcourtes/ludovic.asc X-PGP-Fingerprint: 3CE4 6455 8A84 FDC6 9DB4 0CFB 090B 1199 3D9A EBB5 X-OS: x86_64-pc-linux-gnu Date: Thu, 23 May 2024 19:31:11 +0200 Message-ID: <877cfk77vk.fsf@gnu.org> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: multipart/signed; boundary="=-=-="; micalg=pgp-sha512; protocol="application/pgp-signature" X-BeenThere: guix-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Development of GNU Guix and the GNU System distribution." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guix-devel-bounces+larch=yhetil.org@gnu.org Sender: guix-devel-bounces+larch=yhetil.org@gnu.org X-Migadu-Country: US X-Migadu-Flow: FLOW_IN X-Migadu-Queue-Id: 1F5AF64197 X-Migadu-Scanner: mx12.migadu.com X-Migadu-Spam-Score: -13.21 X-Spam-Score: -13.21 X-TUID: bRBLUrf8vs8R --=-=-= Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable From=20Sunday May 19th to Tuesday may 21st, for about 36h, bayfront.guix.gnu.org, the machine behind many services went down: https://lists.gnu.org/archive/html/info-guix/2024-05/msg00000.html Affected web sites and services included: guix.gnu.org bordeaux.guix.gnu.org logs.guix.gnu.org hpc.guix.info foundation.guix.info packages.guix.gnu.org qa.guix.gnu.org Here=E2=80=99s the series of events that led to this: =E2=80=A2 The machine had not been rebooted for 7 months and needed to be rebooted to run a newer version of Shepherd (it was on 0.10.2, which had a bug regarding replacements that is fixed in newer versions: ). =E2=80=A2 The machine did not reboot. There=E2=80=99s no IPMI (this full= y free system we acquired some years ago did not support it), so all we have is a remote-controlled power controller that allows us to turn it on and off. This had no effect though: the machine didn=E2=80=99t come back. Fellow hackers of Aquilenet, the non-profit ISP that rents the bay in the data center where bayfront is, are looking into setting up serial console access to the machine for us. =E2=80=A2 We (Andreas and myself) scheduled an intervention in the data c= enter where it is, in Bordeaux (France), and could only get there on Tuesday morning. =E2=80=A2 The machine was failing to boot because of an error in the Shep= herd config (unbound variable), now fixed: https://git.savannah.gnu.org/cgit/guix/maintenance.git/commit/?id=3D9= 7a31249793b8af9923f915140a6732539e9d2a3 The underlying problem is that an error in a non-essential service would prevent the machine from booting. This issue is being tracked here: https://issues.guix.gnu.org/71144 Such errors can be detected by testing the config in =E2=80=98guix syst= em vm=E2=80=99, at the cost of extra time for sysadmins. =E2=80=A2 Pulling and reconfiguring the machine was extremely slow. This= is in part due to spinning disks, and in part due to the fact that we had to pull the right commit that would allow us to not rebuild Linux-libre locally (substitutes for the latest upgrade, from Monday, were unavailable; also we had to pass --substitute-urls=3Dhttps://hydra-guix-129.guix.gnu.org in lieu of the default https://bordeaux.guix.gnu.org, which was unavailable). A large part of the slowness was due to =E2=80=98guix substitute=E2=80= =99 reading all the 300K+ entries from /var/guix/substitute/cache and deleting them, one by one (this took several minutes). Chris had mentioned that performance issue in the past; it=E2=80=99s not much of a problem = on one=E2=80=99s laptop with an SSD, but it=E2=80=99s clearly a problem he= re where there are more entries than usual. We should at least drastically reduce the TTL of cache entries. =E2=80=A2 qa-frontpage failed to build when we first reconfigured the mac= hine, so we commented it out. This is now fixed: https://git.savannah.gnu.org/cgit/guix/maintenance.git/commit/?id=3D3= fecb1e8fdea65a7440fec403c1c52da197b5dfe =E2=80=A2 guix-packages-website (the server behind packages.guix.gnu.org) still refuses to start with an Artanis error: https://issues.guix.gnu.org/71138 Ludo=E2=80=99, on behalf on the emergency rescue^W^W sysadmin team. --=-=-= Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQJBBAEBCgArFiEEPORkVYqE/cadtAz7CQsRmT2a67UFAmZPfWANHGx1ZG9AZ251 Lm9yZwAKCRAJCxGZPZrrtbMLEACK2vxz8g4BUIPhKx8FaAswLFeFPOdHz6XDYTxZ 027L8rIpjEoSQdXCnGnDJponjqKgVDRGczmwTyYcqs5ykkkvTjPpKyZGJSTH0P5E Z7w/gel9Jm88xqxhecBDjrFMgio6THWQ24K/hdS6NlXKJ0H0q1skOZsmDOYm3OJ2 Ij6XRjoWXssJhdcQvU8Zl0c6mtM0I3cO94X0GwK4c3OH6l80DA9BY4cNqo5zFrbA 0XCgf44PKSNeFiMGksKGh3pv0qaKbLLvj5g9gN0ZiHCs5uUKzWjTB4tMBAuu6zL1 jdyFJEOFcXaFv2QbWz9NX3RBMlmYytFd3xQZQQijtIq3+gBPb+9mWQ4DjDBJ8POA zszDfAwPTIVOSCEYfcKKH/bcRqxYpa95JElSISuIEwTSb0u6DJaAFuOT375G90KD u7Wm6y+HHSBbhOrPpHelmPFvqIHb+udggqj6o2Qsq99sAVcVu6K8OGCLIQRG/m0H lbyULwuRBX/lV1S8n1lAeQDuP4e0H0fcnCNu52dieYHPuBYxyYATMyfRFmjat0Io /pv17aCUIvhCdpSnlHgN67KEAF/9OIeKXYweJe1sV3xItLG4DxaIkydkqG79qG/E EGJfrHgDECnVNX+g1ucjrpy6CkZRGNTPBmvjCn/ssFu/HgKZ0n0qv0sFCCnbL8vu DpS+Ag== =xeGI -----END PGP SIGNATURE----- --=-=-=--