On Thu, 23 May 2024, Ludovic Courtès wrote: > From Sunday May 19th to Tuesday may 21st, for about 36h, > bayfront.guix.gnu.org, the machine behind many services went down: > > https://lists.gnu.org/archive/html/info-guix/2024-05/msg00000.html > > Affected web sites and services included: > > guix.gnu.org > bordeaux.guix.gnu.org > logs.guix.gnu.org > hpc.guix.info > foundation.guix.info > packages.guix.gnu.org > qa.guix.gnu.org > > Here’s the series of events that led to this: > > • The machine had not been rebooted for 7 months and needed to be > rebooted to run a newer version of Shepherd (it was on 0.10.2, which > had a bug regarding replacements that is fixed in newer versions: > ). > > • The machine did not reboot. There’s no IPMI (this fully free system > we acquired some years ago did not support it), so all we have is a > remote-controlled power controller that allows us to turn it on and > off. This had no effect though: the machine didn’t come back. > > Fellow hackers of Aquilenet, the non-profit ISP that rents the bay > in the data center where bayfront is, are looking into setting up > serial console access to the machine for us. > > • We (Andreas and myself) scheduled an intervention in the data center > where it is, in Bordeaux (France), and could only get there on > Tuesday morning. > > • The machine was failing to boot because of an error in the Shepherd > config (unbound variable), now fixed: > > https://git.savannah.gnu.org/cgit/guix/maintenance.git/commit/?id=97a31249793b8af9923f915140a6732539e9d2a3 > > The underlying problem is that an error in a non-essential service > would prevent the machine from booting. This issue is being tracked > here: > > https://issues.guix.gnu.org/71144 > > Such errors can be detected by testing the config in ‘guix system > vm’, at the cost of extra time for sysadmins. > > • Pulling and reconfiguring the machine was extremely slow. This is > in part due to spinning disks, and in part due to the fact that we > had to pull the right commit that would allow us to not rebuild > Linux-libre locally (substitutes for the latest upgrade, from > Monday, were unavailable; also we had to pass > --substitute-urls=https://hydra-guix-129.guix.gnu.org in lieu of the > default https://bordeaux.guix.gnu.org, which was unavailable). > > A large part of the slowness was due to ‘guix substitute’ reading > all the 300K+ entries from /var/guix/substitute/cache and deleting > them, one by one (this took several minutes). Chris had mentioned > that performance issue in the past; it’s not much of a problem on > one’s laptop with an SSD, but it’s clearly a problem here where > there are more entries than usual. We should at least drastically > reduce the TTL of cache entries. > > • qa-frontpage failed to build when we first reconfigured the machine, > so we commented it out. This is now fixed: > > https://git.savannah.gnu.org/cgit/guix/maintenance.git/commit/?id=3fecb1e8fdea65a7440fec403c1c52da197b5dfe > > • guix-packages-website (the server behind packages.guix.gnu.org) > still refuses to start with an Artanis error: > > https://issues.guix.gnu.org/71138 > > Ludo’, on behalf on the emergency rescue^W^W sysadmin team. > Dear Ludo and Team, thank you for report! oo--JS.