unofficial mirror of guix-devel@gnu.org 
 help / color / mirror / code / Atom feed
From: Jay Sulzberger <jays@panix.com>
To: guix-devel@gnu.org
Cc: Jay Sulzberger <jays@panix.com>
Subject: Re: Postmortem of service downtime
Date: Thu, 23 May 2024 17:41:36 +0000 ()	[thread overview]
Message-ID: <Pine.NEB.4.64.2405231739040.27803@panix3.panix.com> (raw)
In-Reply-To: <877cfk77vk.fsf@gnu.org>

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: TEXT/PLAIN; charset=X-UNKNOWN; format=flowed, Size: 3563 bytes --]


On Thu, 23 May 2024, Ludovic Courtès <ludo@gnu.org> wrote:

> From Sunday May 19th to Tuesday may 21st, for about 36h,
> bayfront.guix.gnu.org, the machine behind many services went down:
>
>  https://lists.gnu.org/archive/html/info-guix/2024-05/msg00000.html
>
> Affected web sites and services included:
>
>  guix.gnu.org
>  bordeaux.guix.gnu.org
>  logs.guix.gnu.org
>  hpc.guix.info
>  foundation.guix.info
>  packages.guix.gnu.org
>  qa.guix.gnu.org
>
> Here’s the series of events that led to this:
>
>  • The machine had not been rebooted for 7 months and needed to be
>    rebooted to run a newer version of Shepherd (it was on 0.10.2, which
>    had a bug regarding replacements that is fixed in newer versions:
>    <https://issues.guix.gnu.org/67839>).
>
>  • The machine did not reboot.  There’s no IPMI (this fully free system
>    we acquired some years ago did not support it), so all we have is a
>    remote-controlled power controller that allows us to turn it on and
>    off.  This had no effect though: the machine didn’t come back.
>
>    Fellow hackers of Aquilenet, the non-profit ISP that rents the bay
>    in the data center where bayfront is, are looking into setting up
>    serial console access to the machine for us.
>
>  • We (Andreas and myself) scheduled an intervention in the data center
>    where it is, in Bordeaux (France), and could only get there on
>    Tuesday morning.
>
>  • The machine was failing to boot because of an error in the Shepherd
>    config (unbound variable), now fixed:
>
>      https://git.savannah.gnu.org/cgit/guix/maintenance.git/commit/?id=97a31249793b8af9923f915140a6732539e9d2a3
>
>    The underlying problem is that an error in a non-essential service
>    would prevent the machine from booting.  This issue is being tracked
>    here:
>
>      https://issues.guix.gnu.org/71144
>
>    Such errors can be detected by testing the config in ‘guix system
>    vm’, at the cost of extra time for sysadmins.
>
>  • Pulling and reconfiguring the machine was extremely slow.  This is
>    in part due to spinning disks, and in part due to the fact that we
>    had to pull the right commit that would allow us to not rebuild
>    Linux-libre locally (substitutes for the latest upgrade, from
>    Monday, were unavailable; also we had to pass
>    --substitute-urls=https://hydra-guix-129.guix.gnu.org in lieu of the
>    default https://bordeaux.guix.gnu.org, which was unavailable).
>
>    A large part of the slowness was due to ‘guix substitute’ reading
>    all the 300K+ entries from /var/guix/substitute/cache and deleting
>    them, one by one (this took several minutes).  Chris had mentioned
>    that performance issue in the past; it’s not much of a problem on
>    one’s laptop with an SSD, but it’s clearly a problem here where
>    there are more entries than usual.  We should at least drastically
>    reduce the TTL of cache entries.
>
>  • qa-frontpage failed to build when we first reconfigured the machine,
>    so we commented it out.  This is now fixed:
>
>      https://git.savannah.gnu.org/cgit/guix/maintenance.git/commit/?id=3fecb1e8fdea65a7440fec403c1c52da197b5dfe
>
>  • guix-packages-website (the server behind packages.guix.gnu.org)
>    still refuses to start with an Artanis error:
>
>      https://issues.guix.gnu.org/71138
>
> Ludo’, on behalf on the emergency rescue^W^W sysadmin team.
>

Dear Ludo and Team, thank you for report!

oo--JS.

  reply	other threads:[~2024-05-25 15:50 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-23 17:31 Postmortem of service downtime Ludovic Courtès
2024-05-23 17:41 ` Jay Sulzberger [this message]
2024-05-25  1:19 ` Maxim Cournoyer
2024-06-04 19:07 ` Simon Tournier

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://guix.gnu.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Pine.NEB.4.64.2405231739040.27803@panix3.panix.com \
    --to=jays@panix.com \
    --cc=guix-devel@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).