From: Jay Sulzberger <jays@panix.com>
To: guix-devel@gnu.org
Cc: Jay Sulzberger <jays@panix.com>
Subject: Re: Postmortem of service downtime
Date: Thu, 23 May 2024 17:41:36 +0000 () [thread overview]
Message-ID: <Pine.NEB.4.64.2405231739040.27803@panix3.panix.com> (raw)
In-Reply-To: <877cfk77vk.fsf@gnu.org>
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: TEXT/PLAIN; charset=X-UNKNOWN; format=flowed, Size: 3563 bytes --]
On Thu, 23 May 2024, Ludovic Courtès <ludo@gnu.org> wrote:
> From Sunday May 19th to Tuesday may 21st, for about 36h,
> bayfront.guix.gnu.org, the machine behind many services went down:
>
> https://lists.gnu.org/archive/html/info-guix/2024-05/msg00000.html
>
> Affected web sites and services included:
>
> guix.gnu.org
> bordeaux.guix.gnu.org
> logs.guix.gnu.org
> hpc.guix.info
> foundation.guix.info
> packages.guix.gnu.org
> qa.guix.gnu.org
>
> Here’s the series of events that led to this:
>
> • The machine had not been rebooted for 7 months and needed to be
> rebooted to run a newer version of Shepherd (it was on 0.10.2, which
> had a bug regarding replacements that is fixed in newer versions:
> <https://issues.guix.gnu.org/67839>).
>
> • The machine did not reboot. There’s no IPMI (this fully free system
> we acquired some years ago did not support it), so all we have is a
> remote-controlled power controller that allows us to turn it on and
> off. This had no effect though: the machine didn’t come back.
>
> Fellow hackers of Aquilenet, the non-profit ISP that rents the bay
> in the data center where bayfront is, are looking into setting up
> serial console access to the machine for us.
>
> • We (Andreas and myself) scheduled an intervention in the data center
> where it is, in Bordeaux (France), and could only get there on
> Tuesday morning.
>
> • The machine was failing to boot because of an error in the Shepherd
> config (unbound variable), now fixed:
>
> https://git.savannah.gnu.org/cgit/guix/maintenance.git/commit/?id=97a31249793b8af9923f915140a6732539e9d2a3
>
> The underlying problem is that an error in a non-essential service
> would prevent the machine from booting. This issue is being tracked
> here:
>
> https://issues.guix.gnu.org/71144
>
> Such errors can be detected by testing the config in ‘guix system
> vm’, at the cost of extra time for sysadmins.
>
> • Pulling and reconfiguring the machine was extremely slow. This is
> in part due to spinning disks, and in part due to the fact that we
> had to pull the right commit that would allow us to not rebuild
> Linux-libre locally (substitutes for the latest upgrade, from
> Monday, were unavailable; also we had to pass
> --substitute-urls=https://hydra-guix-129.guix.gnu.org in lieu of the
> default https://bordeaux.guix.gnu.org, which was unavailable).
>
> A large part of the slowness was due to ‘guix substitute’ reading
> all the 300K+ entries from /var/guix/substitute/cache and deleting
> them, one by one (this took several minutes). Chris had mentioned
> that performance issue in the past; it’s not much of a problem on
> one’s laptop with an SSD, but it’s clearly a problem here where
> there are more entries than usual. We should at least drastically
> reduce the TTL of cache entries.
>
> • qa-frontpage failed to build when we first reconfigured the machine,
> so we commented it out. This is now fixed:
>
> https://git.savannah.gnu.org/cgit/guix/maintenance.git/commit/?id=3fecb1e8fdea65a7440fec403c1c52da197b5dfe
>
> • guix-packages-website (the server behind packages.guix.gnu.org)
> still refuses to start with an Artanis error:
>
> https://issues.guix.gnu.org/71138
>
> Ludo’, on behalf on the emergency rescue^W^W sysadmin team.
>
Dear Ludo and Team, thank you for report!
oo--JS.
next prev parent reply other threads:[~2024-05-25 15:50 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-05-23 17:31 Postmortem of service downtime Ludovic Courtès
2024-05-23 17:41 ` Jay Sulzberger [this message]
2024-05-25 1:19 ` Maxim Cournoyer
2024-06-04 19:07 ` Simon Tournier
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Pine.NEB.4.64.2405231739040.27803@panix3.panix.com \
--to=jays@panix.com \
--cc=guix-devel@gnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this external index
https://git.savannah.gnu.org/cgit/guix.git
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.