On Thu, 23 May 2024, Ludovic Courtès <ludo@gnu.org> wrote:

> From Sunday May 19th to Tuesday may 21st, for about 36h,
> bayfront.guix.gnu.org, the machine behind many services went down:
>
>  https://lists.gnu.org/archive/html/info-guix/2024-05/msg00000.html
>
> Affected web sites and services included:
>
>  guix.gnu.org
>  bordeaux.guix.gnu.org
>  logs.guix.gnu.org
>  hpc.guix.info
>  foundation.guix.info
>  packages.guix.gnu.org
>  qa.guix.gnu.org
>
> Here’s the series of events that led to this:
>
>  • The machine had not been rebooted for 7 months and needed to be
>    rebooted to run a newer version of Shepherd (it was on 0.10.2, which
>    had a bug regarding replacements that is fixed in newer versions:
>    <https://issues.guix.gnu.org/67839>).
>
>  • The machine did not reboot.  There’s no IPMI (this fully free system
>    we acquired some years ago did not support it), so all we have is a
>    remote-controlled power controller that allows us to turn it on and
>    off.  This had no effect though: the machine didn’t come back.
>
>    Fellow hackers of Aquilenet, the non-profit ISP that rents the bay
>    in the data center where bayfront is, are looking into setting up
>    serial console access to the machine for us.
>
>  • We (Andreas and myself) scheduled an intervention in the data center
>    where it is, in Bordeaux (France), and could only get there on
>    Tuesday morning.
>
>  • The machine was failing to boot because of an error in the Shepherd
>    config (unbound variable), now fixed:
>
>      https://git.savannah.gnu.org/cgit/guix/maintenance.git/commit/?id=97a31249793b8af9923f915140a6732539e9d2a3
>
>    The underlying problem is that an error in a non-essential service
>    would prevent the machine from booting.  This issue is being tracked
>    here:
>
>      https://issues.guix.gnu.org/71144
>
>    Such errors can be detected by testing the config in ‘guix system
>    vm’, at the cost of extra time for sysadmins.
>
>  • Pulling and reconfiguring the machine was extremely slow.  This is
>    in part due to spinning disks, and in part due to the fact that we
>    had to pull the right commit that would allow us to not rebuild
>    Linux-libre locally (substitutes for the latest upgrade, from
>    Monday, were unavailable; also we had to pass
>    --substitute-urls=https://hydra-guix-129.guix.gnu.org in lieu of the
>    default https://bordeaux.guix.gnu.org, which was unavailable).
>
>    A large part of the slowness was due to ‘guix substitute’ reading
>    all the 300K+ entries from /var/guix/substitute/cache and deleting
>    them, one by one (this took several minutes).  Chris had mentioned
>    that performance issue in the past; it’s not much of a problem on
>    one’s laptop with an SSD, but it’s clearly a problem here where
>    there are more entries than usual.  We should at least drastically
>    reduce the TTL of cache entries.
>
>  • qa-frontpage failed to build when we first reconfigured the machine,
>    so we commented it out.  This is now fixed:
>
>      https://git.savannah.gnu.org/cgit/guix/maintenance.git/commit/?id=3fecb1e8fdea65a7440fec403c1c52da197b5dfe
>
>  • guix-packages-website (the server behind packages.guix.gnu.org)
>    still refuses to start with an Artanis error:
>
>      https://issues.guix.gnu.org/71138
>
> Ludo’, on behalf on the emergency rescue^W^W sysadmin team.
>

Dear Ludo and Team, thank you for report!

oo--JS.