all messages for Guix-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
* shepherd respawn frequency
@ 2023-05-27  8:34 Attila Lendvai
  2023-07-02 20:01 ` Ludovic Courtès
  0 siblings, 1 reply; 5+ messages in thread
From: Attila Lendvai @ 2023-05-27  8:34 UTC (permalink / raw)
  To: guix-devel

dear guix,

the issue at hand:

i have a daemon that simply quits when some of its running condition is not satisfied. this can be dependent on unpredictable external factors, like the temporary unreachability of a remote service.

shepherd respawns it immediately in RESPAWN-SERVICE, without any delay, which leads to a kind of a busy loop (i noticed this through the fan noise of the machine). i know that there's a stopgap measure to disable such services, but:

  1) some of these daemons struggle long enough before quitting that
     they do not trigger the default RESPAWN-LIMIT-HIT? stopgap
     measure

  2) i *do* want shepherd to keep restarting them indefinitely, but
     not immediately after their premature exit

proposed solution:

would the shepherd maintaners (looking at you Ludo :) accept a change that introduces a new field into <service> called RESPAWN-DELAY, and issue a fiber sleep in RESPAWN-SERVICE when it is not #false, and the daemon process quits unexpectedly?

in an initial commit i'd also turn the global variable called RESPAWN-LIMIT into a field of <service>, and make it take its default value from a properly named %RESPAWN-LIMIT global variable.

open questions:

 - what should be the default value of the respawn delay? i suggest 5
   seconds, and i'd argue against it being disabled by default:

    - premature exits happen more frequently at startup than in an
      already running process

    - an unwanted default respawn delay causes less headache than an
      unwanted busy loop.

 - if the respawn delay is set, then should respawn-limit be ignored?
   IOW, should the logic treat them as two independent variables, or
   should it not? and should there be some logic in how/where they
   take their defaults from?

   my pick: treat them as two independent variables, but when the user
   explicitly specifies a respawn delay for the service object, then
   there shouldn't be any respawn limit, unless the user also
   explicitly specifies it on the <service> object.

   corollary: the handling of defaults should be implemented so that
   the fields of <service> hold #false as default value, in which
   case the logic takes the default value from a global variable in
   shepherd.

 - should i bother with detecting a first respawn in a given past
   period (of e.g. 1 minute?), and do not apply any delay when this is
   the first respawn in that time window? this adds extra complexity,
   which may not be worth it. i'd go with a pass here.

 - after a cursory look, i don't understand the relationship between
   RESPAWNS and FAILURES. the former seems to be an endlessly growing
   list of timestamps, while the latter is a ring buffer of
   timestamps. it's not crucial for me to understand it, but i wonder
   if there's a bug lurking there that eats up the heap when a service
   keeps respawning without any delay?

i'm all ears for suggestions, and i'm also happy to hand over the implementation to someone else, who already had plans to do it, and knows the internals of shepherd better than me.

-- 
• attila lendvai
• PGP: 963F 5D5F 45C7 DFCD 0A39
--
“Looking back on a 30-year teaching career full of rewards and prizes, somehow I can't completely believe that I spent my time on earth institutionalized; I can't believe that centralized schooling is allowed to exist at all as a gigantic indoctrination and sorting machine, robbing people of their children. Did it really happen? Was this my life? God help me.”
	— John Taylor Gatto (1935–2018), Teacher of the Year, both in New York City and State, multiple times



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: shepherd respawn frequency
  2023-05-27  8:34 shepherd respawn frequency Attila Lendvai
@ 2023-07-02 20:01 ` Ludovic Courtès
  2023-07-02 20:11   ` Felix Lechner via Development of GNU Guix and the GNU System distribution.
                     ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Ludovic Courtès @ 2023-07-02 20:01 UTC (permalink / raw)
  To: Attila Lendvai; +Cc: guix-devel

Hi!

Attila Lendvai <attila@lendvai.name> skribis:

> would the shepherd maintaners (looking at you Ludo :) accept a change that introduces a new field into <service> called RESPAWN-DELAY, and issue a fiber sleep in RESPAWN-SERVICE when it is not #false, and the daemon process quits unexpectedly?

We could do that.  It never occurred to me that this is something one
would want to have though.  My reasoning is that if you mark a service
as respawnable, then you really want it to be respawned as soon as it
fails, not 5 seconds later.

Do you have a motivating example in mind (a daemon) to share?

Thanks,
Ludo’.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: shepherd respawn frequency
  2023-07-02 20:01 ` Ludovic Courtès
@ 2023-07-02 20:11   ` Felix Lechner via Development of GNU Guix and the GNU System distribution.
  2023-07-02 20:25   ` Attila Lendvai
  2023-07-03  9:08   ` Efraim Flashner
  2 siblings, 0 replies; 5+ messages in thread
From: Felix Lechner via Development of GNU Guix and the GNU System distribution. @ 2023-07-02 20:11 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: Attila Lendvai, guix-devel

Hi Ludo',

On Sun, Jul 2, 2023 at 1:01 PM Ludovic Courtès <ludo@gnu.org> wrote:
>
> We could do that.

Without advocating that the Shepherd draw any inspiration from
systemd, please allow me to mention, sotto-voce, that the default may
be 100 milliseconds over there. [1]

Kind regards
Felix

[1] https://www.freedesktop.org/software/systemd/man/systemd.service.html#RestartSec=


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: shepherd respawn frequency
  2023-07-02 20:01 ` Ludovic Courtès
  2023-07-02 20:11   ` Felix Lechner via Development of GNU Guix and the GNU System distribution.
@ 2023-07-02 20:25   ` Attila Lendvai
  2023-07-03  9:08   ` Efraim Flashner
  2 siblings, 0 replies; 5+ messages in thread
From: Attila Lendvai @ 2023-07-02 20:25 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: guix-devel

> We could do that. It never occurred to me that this is something one
> would want to have though. My reasoning is that if you mark a service
> as respawnable, then you really want it to be respawned as soon as it
> fails, not 5 seconds later.


there are a large number of different applications that users may want to run as a service. i'm pretty sure a nontrivial subset of them can get into such an error loop.

this specific one is the Bee client of ethswarm.org. multiple threads are active during its startup. one of them is to connect to the blockchain node. when it's unreachable, then eventually it gives up and quits.

arguably, it would be a better behavior to keep trying indefinitely instead of quitting, but let's assume that its behavior is practically beyond our control.

-- 
• attila lendvai
• PGP: 963F 5D5F 45C7 DFCD 0A39
--
The real voyage of discovery consists not in seeking new landscapes, but in looking with new eyes.



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: shepherd respawn frequency
  2023-07-02 20:01 ` Ludovic Courtès
  2023-07-02 20:11   ` Felix Lechner via Development of GNU Guix and the GNU System distribution.
  2023-07-02 20:25   ` Attila Lendvai
@ 2023-07-03  9:08   ` Efraim Flashner
  2 siblings, 0 replies; 5+ messages in thread
From: Efraim Flashner @ 2023-07-03  9:08 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: Attila Lendvai, guix-devel

[-- Attachment #1: Type: text/plain, Size: 1178 bytes --]

On Sun, Jul 02, 2023 at 10:01:26PM +0200, Ludovic Courtès wrote:
> Hi!
> 
> Attila Lendvai <attila@lendvai.name> skribis:
> 
> > would the shepherd maintaners (looking at you Ludo :) accept a change that introduces a new field into <service> called RESPAWN-DELAY, and issue a fiber sleep in RESPAWN-SERVICE when it is not #false, and the daemon process quits unexpectedly?
> 
> We could do that.  It never occurred to me that this is something one
> would want to have though.  My reasoning is that if you mark a service
> as respawnable, then you really want it to be respawned as soon as it
> fails, not 5 seconds later.
> 
> Do you have a motivating example in mind (a daemon) to share?
> 
> Thanks,
> Ludo’.
> 

If you want to make sure that something is really gone and cleaned up
before trying again. Or an artificial delay, say while waiting for the
correct network interface to come up. Or even just
`mbsync -a && exit 1 || exit 1`

-- 
Efraim Flashner   <efraim@flashner.co.il>   רנשלפ םירפא
GPG key = A28B F40C 3E55 1372 662D  14F7 41AA E7DC CA3D 8351
Confidentiality cannot be guaranteed on emails sent or received unencrypted

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2023-07-03  9:09 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-27  8:34 shepherd respawn frequency Attila Lendvai
2023-07-02 20:01 ` Ludovic Courtès
2023-07-02 20:11   ` Felix Lechner via Development of GNU Guix and the GNU System distribution.
2023-07-02 20:25   ` Attila Lendvai
2023-07-03  9:08   ` Efraim Flashner

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/guix.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.