From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp0.migadu.com ([2001:41d0:403:4876::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms1.migadu.com with LMTPS id sNeTJbYIUmarKgAAqHPOHw:P1 (envelope-from ) for ; Sat, 25 May 2024 17:50:14 +0200 Received: from aspmx1.migadu.com ([2001:41d0:403:4876::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp0.migadu.com with LMTPS id sNeTJbYIUmarKgAAqHPOHw (envelope-from ) for ; Sat, 25 May 2024 17:50:14 +0200 X-Envelope-To: larch@yhetil.org Authentication-Results: aspmx1.migadu.com; dkim=fail ("headers rsa verify failed") header.d=panix.com header.s=panix header.b=d0KjATNT; spf=pass (aspmx1.migadu.com: domain of "guix-devel-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="guix-devel-bounces+larch=yhetil.org@gnu.org"; dmarc=fail reason="SPF not aligned (relaxed)" header.from=panix.com (policy=none) ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=yhetil.org; s=key1; t=1716652214; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:in-reply-to:in-reply-to: references:references:list-id:list-help:list-unsubscribe: list-subscribe:list-post:dkim-signature; bh=tg8/JIUbOg8JcB7ulChJxrWKxjVhRH5N+D+2KI+3UWA=; b=HyeBM8r3EjRhyd4CMNGAbGFw66T2/lOGDKsQU/wYGkXCGVqESGBzPd8PWNcaRt0oUGw21k w+u2/yz72Xzv9VpysknHBV5ksABX3iNm8X7Csrn0MZc4hWP7I1r8V32uqlMVH6uTHjtRWI QdZhhcX9mCHLVYUB+bzBF6kwe4YhtXCGOJxlfHzODpD82PIREQzM47LsLOCjJlUPpk6fYO +lCxaeGJw3UCxN3vJY6ida/cNq2yshmFMCGDdz2MJci/9xDG3ywsvkqwrHGyRjKbovdanL SZaFD7S4JCTar/5AJC8k+NacX+AfiEqcDDOSh3+nUiNvGL+v5W6R6GqceXhA2g== ARC-Authentication-Results: i=1; aspmx1.migadu.com; dkim=fail ("headers rsa verify failed") header.d=panix.com header.s=panix header.b=d0KjATNT; spf=pass (aspmx1.migadu.com: domain of "guix-devel-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="guix-devel-bounces+larch=yhetil.org@gnu.org"; dmarc=fail reason="SPF not aligned (relaxed)" header.from=panix.com (policy=none) ARC-Seal: i=1; s=key1; d=yhetil.org; t=1716652214; a=rsa-sha256; cv=none; b=sCsv0xOER3NVfjtWEnvC8J5x45zATYfPz7Pf+IKiUZUj+lHx//WiG9IAmBzPKKHMKq34+p DmIOtQ4iAwFPrze402a52N5fJayYtz/dGCnaBG+KCtL59xSh57NroP4vYv9gZYEyOindRD lw22UY9O2ztA3D5vBGLZSyqMIW17R3ynJUhZ6mwHBKyeC0YSg7DakCVTjSageO0Ej1vEfc dzYWx2ITHbuaM3csRMX2jFw5z/f4a1I5RNJvXCNjH5wFwvu5r/JtccUbhWBHEmHzgj4hSE 4XiJwWrNDNL+nKUwBIhYeBx9CVitk9Bhuzx7HPJBeGGyb4E6gD+GuzBHVktaww== Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id 74FD17F95A for ; Sat, 25 May 2024 17:50:14 +0200 (CEST) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1sAtKy-0007Eh-KR; Sat, 25 May 2024 11:29:36 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1sACRm-0001QT-1x for guix-devel@gnu.org; Thu, 23 May 2024 13:41:46 -0400 Received: from mailbackend.panix.com ([166.84.1.89]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1sACRh-0001pK-Dx for guix-devel@gnu.org; Thu, 23 May 2024 13:41:43 -0400 Received: from panix3.panix.com (panix3.panix.com [166.84.1.3]) by mailbackend.panix.com (Postfix) with ESMTP id 4Vlb803VpqzLTg; Thu, 23 May 2024 13:41:36 -0400 (EDT) DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=panix.com; s=panix; t=1716486096; bh=zOxXwXQtaKmM3S6EpP4tdsw5dmHMiG2GTXjLS0x3epU=; h=Date:From:To:cc:Subject:In-Reply-To:References; b=d0KjATNT5gKhLDqQ7gMdwlCkpHmO+y1XvxMjejV9ujLVzZz39yFtWzR422IEuMfxn 5niJgM4wuwcLg5XgDa5Ej1tDmXWutgKb9eCk3hTARhuAh9jTp1fdSO7h0sR8UwbhVJ jtTYFRm6nLyS8RcCbam9lxbnpthJ6JZiVxmMkPtk= Received: by panix3.panix.com (Postfix, from userid 7271) id 4Vlb803Nnpz1QXM; Thu, 23 May 2024 13:41:36 -0400 (EDT) Received: from localhost (localhost [127.0.0.1]) by panix3.panix.com (Postfix) with ESMTP id 4Vlb803Hhkz1QWS; Thu, 23 May 2024 13:41:36 -0400 (EDT) Date: Thu, 23 May 2024 17:41:36 +0000 () From: Jay Sulzberger To: guix-devel@gnu.org cc: Jay Sulzberger Subject: Re: Postmortem of service downtime In-Reply-To: <877cfk77vk.fsf@gnu.org> Message-ID: References: <877cfk77vk.fsf@gnu.org> MIME-Version: 1.0 Content-Type: MULTIPART/MIXED; BOUNDARY="0-406268268-1716486096=:27803" Received-SPF: pass client-ip=166.84.1.89; envelope-from=jays@panix.com; helo=mailbackend.panix.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_MSPIKE_H3=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-Mailman-Approved-At: Sat, 25 May 2024 11:29:35 -0400 X-BeenThere: guix-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Development of GNU Guix and the GNU System distribution." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guix-devel-bounces+larch=yhetil.org@gnu.org Sender: guix-devel-bounces+larch=yhetil.org@gnu.org X-Migadu-Country: US X-Migadu-Flow: FLOW_IN X-Spam-Score: 4.65 X-Migadu-Queue-Id: 74FD17F95A X-Migadu-Scanner: mx10.migadu.com X-Migadu-Spam-Score: 4.65 X-TUID: QNC2wYpgXycP This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. --0-406268268-1716486096=:27803 Content-Type: TEXT/PLAIN; charset=X-UNKNOWN; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE On Thu, 23 May 2024, Ludovic Court=C3=A8s wrote: > From Sunday May 19th to Tuesday may 21st, for about 36h, > bayfront.guix.gnu.org, the machine behind many services went down: > > https://lists.gnu.org/archive/html/info-guix/2024-05/msg00000.html > > Affected web sites and services included: > > guix.gnu.org > bordeaux.guix.gnu.org > logs.guix.gnu.org > hpc.guix.info > foundation.guix.info > packages.guix.gnu.org > qa.guix.gnu.org > > Here=E2=80=99s the series of events that led to this: > > =E2=80=A2 The machine had not been rebooted for 7 months and needed to b= e > rebooted to run a newer version of Shepherd (it was on 0.10.2, which > had a bug regarding replacements that is fixed in newer versions: > ). > > =E2=80=A2 The machine did not reboot. There=E2=80=99s no IPMI (this ful= ly free system > we acquired some years ago did not support it), so all we have is a > remote-controlled power controller that allows us to turn it on and > off. This had no effect though: the machine didn=E2=80=99t come back. > > Fellow hackers of Aquilenet, the non-profit ISP that rents the bay > in the data center where bayfront is, are looking into setting up > serial console access to the machine for us. > > =E2=80=A2 We (Andreas and myself) scheduled an intervention in the data = center > where it is, in Bordeaux (France), and could only get there on > Tuesday morning. > > =E2=80=A2 The machine was failing to boot because of an error in the She= pherd > config (unbound variable), now fixed: > > https://git.savannah.gnu.org/cgit/guix/maintenance.git/commit/?id=3D= 97a31249793b8af9923f915140a6732539e9d2a3 > > The underlying problem is that an error in a non-essential service > would prevent the machine from booting. This issue is being tracked > here: > > https://issues.guix.gnu.org/71144 > > Such errors can be detected by testing the config in =E2=80=98guix sys= tem > vm=E2=80=99, at the cost of extra time for sysadmins. > > =E2=80=A2 Pulling and reconfiguring the machine was extremely slow. Thi= s is > in part due to spinning disks, and in part due to the fact that we > had to pull the right commit that would allow us to not rebuild > Linux-libre locally (substitutes for the latest upgrade, from > Monday, were unavailable; also we had to pass > --substitute-urls=3Dhttps://hydra-guix-129.guix.gnu.org in lieu of the > default https://bordeaux.guix.gnu.org, which was unavailable). > > A large part of the slowness was due to =E2=80=98guix substitute=E2=80= =99 reading > all the 300K+ entries from /var/guix/substitute/cache and deleting > them, one by one (this took several minutes). Chris had mentioned > that performance issue in the past; it=E2=80=99s not much of a problem= on > one=E2=80=99s laptop with an SSD, but it=E2=80=99s clearly a problem h= ere where > there are more entries than usual. We should at least drastically > reduce the TTL of cache entries. > > =E2=80=A2 qa-frontpage failed to build when we first reconfigured the ma= chine, > so we commented it out. This is now fixed: > > https://git.savannah.gnu.org/cgit/guix/maintenance.git/commit/?id=3D= 3fecb1e8fdea65a7440fec403c1c52da197b5dfe > > =E2=80=A2 guix-packages-website (the server behind packages.guix.gnu.org= ) > still refuses to start with an Artanis error: > > https://issues.guix.gnu.org/71138 > > Ludo=E2=80=99, on behalf on the emergency rescue^W^W sysadmin team. > Dear Ludo and Team, thank you for report! oo--JS. --0-406268268-1716486096=:27803--