unofficial mirror of guix-devel@gnu.org 
 help / color / mirror / code / Atom feed
* watchdog triggered auto-rollback
@ 2024-05-24 12:50 raingloom
  2024-05-25 16:58 ` Richard Sent
  0 siblings, 1 reply; 7+ messages in thread
From: raingloom @ 2024-05-24 12:50 UTC (permalink / raw)
  To: Guix Devel

Since I've been experimenting with a foolproof unikernel based static
website deployment lately, I realized I should write down this idea I've
been chewing for a while:

It would be very nice to have automatic system rollbacks when certain
things break.
One example is broken SSH config that makes a machine unreachable.
Local testing is useful, but like in the SSH example, some issues only
become apparent when you are deploying to the production environment.

Would others find this useful?  Where in the stack would this be solved?
 Could we, for example, catch an issue in the init system and still
perform a rollback?  Or if not a full rollback, then at least a reboot
into the previous config?  (And if that is also broken, then the one
before, etc, etc)

Obviously there are a lot of edge cases and potential bugs in this
mechanism as well.  Sticking with the SSH example, rolling back to a
version that was kept around where the authorized keys are different
would also make the machine unreachable via SSH.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: watchdog triggered auto-rollback
  2024-05-24 12:50 watchdog triggered auto-rollback raingloom
@ 2024-05-25 16:58 ` Richard Sent
  0 siblings, 0 replies; 7+ messages in thread
From: Richard Sent @ 2024-05-25 16:58 UTC (permalink / raw)
  To: raingloom; +Cc: Guix Devel

raingloom@riseup.net writes:

> Would others find this useful?  Where in the stack would this be solved?
>  Could we, for example, catch an issue in the init system and still
> perform a rollback?  Or if not a full rollback, then at least a reboot
> into the previous config?  (And if that is also broken, then the one
> before, etc, etc)
>
> Obviously there are a lot of edge cases and potential bugs in this
> mechanism as well.  Sticking with the SSH example, rolling back to a
> version that was kept around where the authorized keys are different
> would also make the machine unreachable via SSH.

I would definitely find this useful, particularly when combined with
unattended-upgrade-service. I don't know the best way to handle an init
system failure.

Perhaps a starting point would be a one-shot
conditional-rollback-service with a "shepherd-requirements" field and a
"test" field that takes an file-like object. This service would execute
that file, write the output to some log file, and trigger a rollback if
an error is signaled.

Presumably this service should only trigger on boot, not reconfigure so
we don't risk running the test with old services. I don't believe Guix
has a mechanism yet to say "Yes, this service is new, and I /do/ want
Shepherd to auto-start it, but not on reconfigure". This shouldn't be
too hard to add though.

-- 
Take it easy,
Richard Sent
Making my computer weirder one commit at a time.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* watchdog triggered auto-rollback
@ 2024-05-28  0:52 Nathan Dehnel
  2024-05-28  1:46 ` Richard Sent
  0 siblings, 1 reply; 7+ messages in thread
From: Nathan Dehnel @ 2024-05-28  0:52 UTC (permalink / raw)
  To: raingloom, guix-devel

>Would others find this useful?
I would 100% use this.

>Where in the stack would this be solved?
I think there's two places for rollbacks with two different purposes

GRUB: https://www.gnu.org/software/grub/manual/grub/html_node/fallback.html
GRUB supports falling back to another boot entry if the machine fails
to boot. This could be integrated with guix so GRUB falls back to a
previous guix system generation. This covers the case of "we can't
start a watchdog service because the system won't boot".

SSH watchdog: a shepherd service that tests SSH connectivity, and then
executes "guix system roll-back && reboot". SSH access is a rough
approximation for "the system is working", as kernel, init, and all
manner of networking services, DHCP, DNS, VPN, etc. must work for SSH
to work. And if SSH works then it provides a means for a user to fix
their system anyways.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: watchdog triggered auto-rollback
  2024-05-28  0:52 Nathan Dehnel
@ 2024-05-28  1:46 ` Richard Sent
  2024-05-28 10:10   ` Attila Lendvai
  0 siblings, 1 reply; 7+ messages in thread
From: Richard Sent @ 2024-05-28  1:46 UTC (permalink / raw)
  To: Nathan Dehnel; +Cc: raingloom, guix-devel

Nathan Dehnel <ncdehnel@gmail.com> writes:

> GRUB: https://www.gnu.org/software/grub/manual/grub/html_node/fallback.html
> GRUB supports falling back to another boot entry if the machine fails
> to boot. This could be integrated with guix so GRUB falls back to a
> previous guix system generation. This covers the case of "we can't
> start a watchdog service because the system won't boot".

How does GRUB determine a boot failed? Does it have to be something
drastic like "kernel failed to mount the initrd" or can it catch more
complex errors?

I believe that if the initrd fails during startup it will abort into an
interactive Guile REPL. This might hurt GRUB's ability to detect
something went wrong since the kernel would still be running. A similar
case may apply if Shepherd gets stuck during system initialization.

-- 
Take it easy,
Richard Sent
Making my computer weirder one commit at a time.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: watchdog triggered auto-rollback
  2024-05-28  1:46 ` Richard Sent
@ 2024-05-28 10:10   ` Attila Lendvai
  2024-05-29 13:45     ` Richard Sent
  0 siblings, 1 reply; 7+ messages in thread
From: Attila Lendvai @ 2024-05-28 10:10 UTC (permalink / raw)
  To: Richard Sent; +Cc: Nathan Dehnel, raingloom, guix-devel

> I believe that if the initrd fails during startup it will abort into an
> interactive Guile REPL. This might hurt GRUB's ability to detect
> something went wrong since the kernel would still be running.


i'm afraid that's not the case currently:

%guile-static-stripped crashes with a sigsegv (i.e. the guile used in the initrd (?))
https://issues.guix.gnu.org/71211

-- 
• attila lendvai
• PGP: 963F 5D5F 45C7 DFCD 0A39
--
“Self-education is, I firmly believe, the only kind of education there is.”
	— Isaac Asimov (1920–1992)



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: watchdog triggered auto-rollback
  2024-05-28 10:10   ` Attila Lendvai
@ 2024-05-29 13:45     ` Richard Sent
  2024-05-29 20:41       ` Attila Lendvai
  0 siblings, 1 reply; 7+ messages in thread
From: Richard Sent @ 2024-05-29 13:45 UTC (permalink / raw)
  To: Attila Lendvai; +Cc: Nathan Dehnel, raingloom, guix-devel

> i'm afraid that's not the case currently:
> 
> %guile-static-stripped crashes with a sigsegv (i.e. the guile used in the initrd (?))
> https://issues.guix.gnu.org/71211

Interesting. Is this a recent bug? When I was trying to bring up Guix on the VisionFive 2 I was being dropped into a Guile REPL when the initrd failed to find the root partition.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: watchdog triggered auto-rollback
  2024-05-29 13:45     ` Richard Sent
@ 2024-05-29 20:41       ` Attila Lendvai
  0 siblings, 0 replies; 7+ messages in thread
From: Attila Lendvai @ 2024-05-29 20:41 UTC (permalink / raw)
  To: Richard Sent; +Cc: Nathan Dehnel, raingloom, guix-devel

> > i'm afraid that's not the case currently:
> > 
> > %guile-static-stripped crashes with a sigsegv (i.e. the guile used in the initrd (?))
> > https://issues.guix.gnu.org/71211
> 
>
> Interesting. Is this a recent bug? When I was trying to bring up
> Guix on the VisionFive 2 I was being dropped into a Guile REPL when
> the initrd failed to find the root partition.


well, the reproducer "works" on a recent x86_64, but i originally noticed this long ago (maybe a year even). back then i investigated an early crash in the boot, and reached %GUILE-STATIC-STRIPPED, and made a TODO note to further investigate. then i forgot most of what happened, and recently i opened a bug report based on my note.

since then EXPRESSION->INITRD may have changed, because it now uses %GUILE-STATIC-INITRD. but it's created with the same MAKE-GUILE-STATIC-STRIPPED that produces the faulty %GUILE-STATIC-STRIPPED, so...

in short: the reproducer crashes both %GUILE-STATIC-STRIPPED and %GUILE-STATIC-INITRD on x86_64, and i believe that it crashes the same in the early phase of the boot when it tries to enter the debugger.

-- 
• attila lendvai
• PGP: 963F 5D5F 45C7 DFCD 0A39
--
Freedom cannot be given… it can only be taken away.



^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2024-05-29 20:42 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-24 12:50 watchdog triggered auto-rollback raingloom
2024-05-25 16:58 ` Richard Sent
  -- strict thread matches above, loose matches on Subject: below --
2024-05-28  0:52 Nathan Dehnel
2024-05-28  1:46 ` Richard Sent
2024-05-28 10:10   ` Attila Lendvai
2024-05-29 13:45     ` Richard Sent
2024-05-29 20:41       ` Attila Lendvai

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).