unofficial mirror of guix-devel@gnu.org 
 help / color / mirror / code / Atom feed
* Postmortem of service downtime
@ 2024-05-23 17:31 Ludovic Courtès
  2024-05-23 17:41 ` Jay Sulzberger
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Ludovic Courtès @ 2024-05-23 17:31 UTC (permalink / raw)
  To: guix-devel

[-- Attachment #1: Type: text/plain, Size: 3358 bytes --]

From Sunday May 19th to Tuesday may 21st, for about 36h,
bayfront.guix.gnu.org, the machine behind many services went down:

  https://lists.gnu.org/archive/html/info-guix/2024-05/msg00000.html

Affected web sites and services included:

  guix.gnu.org
  bordeaux.guix.gnu.org
  logs.guix.gnu.org
  hpc.guix.info
  foundation.guix.info
  packages.guix.gnu.org
  qa.guix.gnu.org

Here’s the series of events that led to this:

  • The machine had not been rebooted for 7 months and needed to be
    rebooted to run a newer version of Shepherd (it was on 0.10.2, which
    had a bug regarding replacements that is fixed in newer versions:
    <https://issues.guix.gnu.org/67839>).

  • The machine did not reboot.  There’s no IPMI (this fully free system
    we acquired some years ago did not support it), so all we have is a
    remote-controlled power controller that allows us to turn it on and
    off.  This had no effect though: the machine didn’t come back.

    Fellow hackers of Aquilenet, the non-profit ISP that rents the bay
    in the data center where bayfront is, are looking into setting up
    serial console access to the machine for us.

  • We (Andreas and myself) scheduled an intervention in the data center
    where it is, in Bordeaux (France), and could only get there on
    Tuesday morning.

  • The machine was failing to boot because of an error in the Shepherd
    config (unbound variable), now fixed:

      https://git.savannah.gnu.org/cgit/guix/maintenance.git/commit/?id=97a31249793b8af9923f915140a6732539e9d2a3

    The underlying problem is that an error in a non-essential service
    would prevent the machine from booting.  This issue is being tracked
    here:

      https://issues.guix.gnu.org/71144

    Such errors can be detected by testing the config in ‘guix system
    vm’, at the cost of extra time for sysadmins.

  • Pulling and reconfiguring the machine was extremely slow.  This is
    in part due to spinning disks, and in part due to the fact that we
    had to pull the right commit that would allow us to not rebuild
    Linux-libre locally (substitutes for the latest upgrade, from
    Monday, were unavailable; also we had to pass
    --substitute-urls=https://hydra-guix-129.guix.gnu.org in lieu of the
    default https://bordeaux.guix.gnu.org, which was unavailable).

    A large part of the slowness was due to ‘guix substitute’ reading
    all the 300K+ entries from /var/guix/substitute/cache and deleting
    them, one by one (this took several minutes).  Chris had mentioned
    that performance issue in the past; it’s not much of a problem on
    one’s laptop with an SSD, but it’s clearly a problem here where
    there are more entries than usual.  We should at least drastically
    reduce the TTL of cache entries.

  • qa-frontpage failed to build when we first reconfigured the machine,
    so we commented it out.  This is now fixed:

      https://git.savannah.gnu.org/cgit/guix/maintenance.git/commit/?id=3fecb1e8fdea65a7440fec403c1c52da197b5dfe

  • guix-packages-website (the server behind packages.guix.gnu.org)
    still refuses to start with an Artanis error:

      https://issues.guix.gnu.org/71138

Ludo’, on behalf on the emergency rescue^W^W sysadmin team.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 853 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Postmortem of service downtime
  2024-05-23 17:31 Postmortem of service downtime Ludovic Courtès
@ 2024-05-23 17:41 ` Jay Sulzberger
  2024-05-25  1:19 ` Maxim Cournoyer
  2024-06-04 19:07 ` Simon Tournier
  2 siblings, 0 replies; 4+ messages in thread
From: Jay Sulzberger @ 2024-05-23 17:41 UTC (permalink / raw)
  To: guix-devel; +Cc: Jay Sulzberger

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: TEXT/PLAIN; charset=X-UNKNOWN; format=flowed, Size: 3563 bytes --]


On Thu, 23 May 2024, Ludovic Courtès <ludo@gnu.org> wrote:

> From Sunday May 19th to Tuesday may 21st, for about 36h,
> bayfront.guix.gnu.org, the machine behind many services went down:
>
>  https://lists.gnu.org/archive/html/info-guix/2024-05/msg00000.html
>
> Affected web sites and services included:
>
>  guix.gnu.org
>  bordeaux.guix.gnu.org
>  logs.guix.gnu.org
>  hpc.guix.info
>  foundation.guix.info
>  packages.guix.gnu.org
>  qa.guix.gnu.org
>
> Here’s the series of events that led to this:
>
>  • The machine had not been rebooted for 7 months and needed to be
>    rebooted to run a newer version of Shepherd (it was on 0.10.2, which
>    had a bug regarding replacements that is fixed in newer versions:
>    <https://issues.guix.gnu.org/67839>).
>
>  • The machine did not reboot.  There’s no IPMI (this fully free system
>    we acquired some years ago did not support it), so all we have is a
>    remote-controlled power controller that allows us to turn it on and
>    off.  This had no effect though: the machine didn’t come back.
>
>    Fellow hackers of Aquilenet, the non-profit ISP that rents the bay
>    in the data center where bayfront is, are looking into setting up
>    serial console access to the machine for us.
>
>  • We (Andreas and myself) scheduled an intervention in the data center
>    where it is, in Bordeaux (France), and could only get there on
>    Tuesday morning.
>
>  • The machine was failing to boot because of an error in the Shepherd
>    config (unbound variable), now fixed:
>
>      https://git.savannah.gnu.org/cgit/guix/maintenance.git/commit/?id=97a31249793b8af9923f915140a6732539e9d2a3
>
>    The underlying problem is that an error in a non-essential service
>    would prevent the machine from booting.  This issue is being tracked
>    here:
>
>      https://issues.guix.gnu.org/71144
>
>    Such errors can be detected by testing the config in ‘guix system
>    vm’, at the cost of extra time for sysadmins.
>
>  • Pulling and reconfiguring the machine was extremely slow.  This is
>    in part due to spinning disks, and in part due to the fact that we
>    had to pull the right commit that would allow us to not rebuild
>    Linux-libre locally (substitutes for the latest upgrade, from
>    Monday, were unavailable; also we had to pass
>    --substitute-urls=https://hydra-guix-129.guix.gnu.org in lieu of the
>    default https://bordeaux.guix.gnu.org, which was unavailable).
>
>    A large part of the slowness was due to ‘guix substitute’ reading
>    all the 300K+ entries from /var/guix/substitute/cache and deleting
>    them, one by one (this took several minutes).  Chris had mentioned
>    that performance issue in the past; it’s not much of a problem on
>    one’s laptop with an SSD, but it’s clearly a problem here where
>    there are more entries than usual.  We should at least drastically
>    reduce the TTL of cache entries.
>
>  • qa-frontpage failed to build when we first reconfigured the machine,
>    so we commented it out.  This is now fixed:
>
>      https://git.savannah.gnu.org/cgit/guix/maintenance.git/commit/?id=3fecb1e8fdea65a7440fec403c1c52da197b5dfe
>
>  • guix-packages-website (the server behind packages.guix.gnu.org)
>    still refuses to start with an Artanis error:
>
>      https://issues.guix.gnu.org/71138
>
> Ludo’, on behalf on the emergency rescue^W^W sysadmin team.
>

Dear Ludo and Team, thank you for report!

oo--JS.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Postmortem of service downtime
  2024-05-23 17:31 Postmortem of service downtime Ludovic Courtès
  2024-05-23 17:41 ` Jay Sulzberger
@ 2024-05-25  1:19 ` Maxim Cournoyer
  2024-06-04 19:07 ` Simon Tournier
  2 siblings, 0 replies; 4+ messages in thread
From: Maxim Cournoyer @ 2024-05-25  1:19 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: guix-devel

Hi Ludovic,

Ludovic Courtès <ludo@gnu.org> writes:

> From Sunday May 19th to Tuesday may 21st, for about 36h,
> bayfront.guix.gnu.org, the machine behind many services went down:
>
>   https://lists.gnu.org/archive/html/info-guix/2024-05/msg00000.html
>
> Affected web sites and services included:
>
>   guix.gnu.org
>   bordeaux.guix.gnu.org
>   logs.guix.gnu.org
>   hpc.guix.info
>   foundation.guix.info
>   packages.guix.gnu.org
>   qa.guix.gnu.org
>

[...]

>     A large part of the slowness was due to ‘guix substitute’ reading
>     all the 300K+ entries from /var/guix/substitute/cache and deleting
>     them, one by one (this took several minutes).  Chris had mentioned
>     that performance issue in the past; it’s not much of a problem on
>     one’s laptop with an SSD, but it’s clearly a problem here where
>     there are more entries than usual.  We should at least drastically
>     reduce the TTL of cache entries.

Interesting!

>   • qa-frontpage failed to build when we first reconfigured the machine,
>     so we commented it out.  This is now fixed:
>
>       https://git.savannah.gnu.org/cgit/guix/maintenance.git/commit/?id=3fecb1e8fdea65a7440fec403c1c52da197b5dfe
>
>   • guix-packages-website (the server behind packages.guix.gnu.org)
>     still refuses to start with an Artanis error:
>
>       https://issues.guix.gnu.org/71138
>
> Ludo’, on behalf on the emergency rescue^W^W sysadmin team.

Phew!  Thanks for the detailed write-up and for the fixes/thankless work
of bringing the machine back up and running.

-- 
Maxim


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Postmortem of service downtime
  2024-05-23 17:31 Postmortem of service downtime Ludovic Courtès
  2024-05-23 17:41 ` Jay Sulzberger
  2024-05-25  1:19 ` Maxim Cournoyer
@ 2024-06-04 19:07 ` Simon Tournier
  2 siblings, 0 replies; 4+ messages in thread
From: Simon Tournier @ 2024-06-04 19:07 UTC (permalink / raw)
  To: Ludovic Courtès, guix-devel

Hi Ludo,

On Thu, 23 May 2024 at 19:31, Ludovic Courtès <ludo@gnu.org> wrote:

> From Sunday May 19th to Tuesday may 21st, for about 36h,
> bayfront.guix.gnu.org, the machine behind many services went down:
>
>   https://lists.gnu.org/archive/html/info-guix/2024-05/msg00000.html
>
> Affected web sites and services included:
>
>   guix.gnu.org
>   bordeaux.guix.gnu.org
>   logs.guix.gnu.org
>   hpc.guix.info
>   foundation.guix.info
>   packages.guix.gnu.org
>   qa.guix.gnu.org

Oh, I am going outside of my “cellar“… I have not noticed. :-)

> Here’s the series of events that led to this:

Thanks for all the work and the detailed report!
And thanks Andreas for the rescue team.

Cheers,
simon


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2024-06-04 19:11 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-23 17:31 Postmortem of service downtime Ludovic Courtès
2024-05-23 17:41 ` Jay Sulzberger
2024-05-25  1:19 ` Maxim Cournoyer
2024-06-04 19:07 ` Simon Tournier

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).