TITLE
Shepherd non-deterministically fails to load the "guix-daemon" service after the "user-processes" service has been started

EXPECTED BEHAVIOUR
Shepherd will always start the "guix-daemon" service after the "user-processes" service has been started.

OBSERVED BEHAVIOUR
Shepherd will sometimes start the "guix-daemon" service after the "user-processes" service has been started and sometimes it won't.

HARDWARE ENVIRONMENT
Motherboard: Asus M5A88-M EVO
CPU: AMD Athlon II X4 640 Processor
SAS card: Broadcom / LSI SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] (rev 02)
Hard drive: 9x assorted spinning rust sata hard drives

SOFTWARE ENVIRONMENT
See details below

STEPS TO REPRODUCE
Create the software environment detailed below on an equivalent hardware environment to the one listed above

DETAILS
I have a media server running NFS to host videos to computers on my home network. The media server has 9 hard drives in it. 5 hard drives are connected to the motherboard's sata ports and another four are connected to a HBA SAS controller card in IT mode sitting in the motherboard's pci-e slot. The hard-drives are in an LVM configuration like below:

mark@streetkid ~$ lsblk
NAME                                       MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
sda                                          8:0    0 465.8G  0 disk
├─sda1                                       8:1    0     2M  0 part
├─sda2                                       8:2    0   100G  0 part /gnu/store
│                                                                    /
└─sda3                                       8:3    0 365.8G  0 part
  └─streetkid_VG_storage-streetkid_LV_home 253:0    0   4.3T  0 lvm  /home
sdb                                          8:16   0 931.5G  0 disk
└─sdb1                                       8:17   0 931.5G  0 part
  └─streetkid_VG_storage-streetkid_LV_home 253:0    0   4.3T  0 lvm  /home
sdc                                          8:32   0   149G  0 disk
└─sdc1                                       8:33   0   149G  0 part
  └─streetkid_VG_storage-streetkid_LV_home 253:0    0   4.3T  0 lvm  /home
sdd                                          8:48   0 465.8G  0 disk
└─sdd1                                       8:49   0 465.8G  0 part
  └─streetkid_VG_storage-streetkid_LV_home 253:0    0   4.3T  0 lvm  /home
sde                                          8:64   0 465.8G  0 disk
└─sde1                                       8:65   0 465.8G  0 part
  └─streetkid_VG_storage-streetkid_LV_home 253:0    0   4.3T  0 lvm  /home
sdf                                          8:80   0 298.1G  0 disk
└─sdf1                                       8:81   0 298.1G  0 part
  └─streetkid_VG_storage-streetkid_LV_home 253:0    0   4.3T  0 lvm  /home
sdg                                          8:96   0 931.5G  0 disk
└─sdg1                                       8:97   0 931.5G  0 part
  └─streetkid_VG_storage-streetkid_LV_home 253:0    0   4.3T  0 lvm  /home
sdh                                          8:112  0 465.8G  0 disk
└─sdh1                                       8:113  0 465.8G  0 part
  └─streetkid_VG_storage-streetkid_LV_home 253:0    0   4.3T  0 lvm  /home
sdi                                          8:128  0 298.1G  0 disk
└─sdi1                                       8:129  0 298.1G  0 part
  └─streetkid_VG_storage-streetkid_LV_home 253:0    0   4.3T  0 lvm  /home



Recently I noticed that running guix commands would sometimes fail with the following error:

mark@streetkid ~$ guix package -m ~/.config/guix/manifest.scm
guix package: error: failed to connect to `/var/guix/daemon-socket/socket': Connection refused


The NFS service would be working fine but guix would refuse to work. Not knowing what to do to fix guix I tried rolling back to previous system generations using the GRUB menu on bootup. Sometimes this would result in guix working but then on the next bootup - using the same system generation - guix would fail again. I found it was easy to work around the issue by manually getting shepherd to start the guix-daemon:

mark@streetkid ~$ sudo herd start guix-daemon
Service guix-daemon has been started.


I looked through the kernel log and found that the harddrives connected to the SAS controller were taking a long while to be brought up which was blocking services that depend on the "device-mapping" service from being started. The following snippet from /var/log/messages.1.gz illustrates this:

Jul  1 21:36:58 localhost vmunix: [   18.538151] shepherd[1]: Service device-mapping-streetkid_VG_storage-streetkid_LV_home could not be started.
Jul  1 21:36:58 localhost vmunix: [   18.540000] shepherd[1]: Service file-system-/home depends on device-mapping-streetkid_VG_storage-streetkid_LV_home.
Jul  1 21:36:58 localhost vmunix: [   18.541809] shepherd[1]: Service file-system-/home could not be started.
Jul  1 21:36:58 localhost vmunix: [   18.543412] shepherd[1]: Service file-systems depends on file-system-/home.
Jul  1 21:36:58 localhost vmunix: [   18.545276] shepherd[1]: Service file-systems could not be started.
Jul  1 21:36:58 localhost vmunix: [   18.546921] shepherd[1]: Service user-processes depends on file-systems.
Jul  1 21:36:58 localhost vmunix: [   18.548774] shepherd[1]: Service user-processes could not be started.
Jul  1 21:36:58 localhost vmunix: [   18.550432] shepherd[1]: Service guix-daemon depends on user-processes.
Jul  1 21:36:58 localhost vmunix: [   18.552355] shepherd[1]: Service guix-daemon could not be started.

Looking at the logs shows that during bootup the Shepherd will make a few attempts at starting services while waiting for the hard drives connected to the SAS card to become available. The "device-mapping" service needs the SAS card hard drives to be up and will fail when they are not. When they are not up the services that depend on the "device-mapping" service will fail. After a while the hard drives come online and then the "device-mapping" service starts, which then causes the services that depend on it to start up as well.

I then noticed something very strange. On some boot ups the Shepherd would try and fail a few times to start services that depend on "device-mapping" and then when the SAS hard drives were up it would successfully load these dependent services including the "guix-daemon" service. On other boot ups it would try and fail a few times to start services that depend on "device-mapping" and then when the SAS hard drives were up it would successfully load these dependent services BUT WOULD NOT LOAD the "guix-daemon" service.

To see this non-deterministic bug in the logs open the attached file "messages.1". Go to line 21664 which is the start of the boot process starting Jul 1 09:02:35. Search for the next string "Service user-processes" and you can see this service failing to start on line 22558 and again on line 22581. The "user-processes" service then succeeds on line 22657 and on the next line (22658) you can see that the "guix-daemon" starts. In summary:

21664 - System starts booting
22558 - "user-processes" service fails to start
22581 - "user-processes" service fails to start again
22657 - "user-processes" service starts successfully
22658 - "guix-daemon" service starts successfully
22837 - System shuts down

On the next boot "guix-daemon" fails to start after "user-processes" has started:
22838 - System starts booting
23728 - "user-processes" service fails to start
23750 - "user-processes" service fails to start again
23762 - "user-processes" service fails to start a third time
23764 - "guix-daemon" service fails to start
23840 - "user-processes" service starts successfully
23938 - System shuts down

On the next boot "guix-daemon" again fails to start after "user-processes" has started:
23939 - System starts booting
24826 - "user-processes" service fails to start
24853 - "user-processes" service fails to start again
24865 - "user-processes" service fails to start a third time
24867 - "guix-daemon" service fails to start
24941 - "user-processes" service starts successfully
25053 - System shuts down

On the next boot "guix-daemon" successfully starts after "user-processes" has started:
25054 - System starts booting
25952 - "user-processes" service fails to start
25975 - "user-processes" service fails to start again
26064 - "user-processes" service starts successfully
26066 - "guix-daemon" service starts successfully
26169 - System shuts down

Note that these four boots are using the same system generation so the software is the same on all boots.
The generation info is shown below:

Generation 27 Jun 29 2022 22:50:13
  file name: /var/guix/profiles/system-27-link
  canonical file name: /gnu/store/xrn0lz22gnhhq9ywk30q2nmim3fgx5sd-system
  label: GNU with Linux 5.17.15
  bootloader: grub
  root device: UUID: f75c515d-0ef7-4af1-950e-5486408d2451
  kernel: /gnu/store/xbcwcv2a024n0vmncvhgllmz52d4rprj-linux-5.17.15/bzImage
  channels:
    nonguix:
      repository URL: https://gitlab.com/nonguix/nonguix
      branch: master
      commit: 1a122e06fe046caebf39395edc797515861acd3b
    guix:
      repository URL: https://git.savannah.gnu.org/git/guix.git
      branch: master
      commit: 0fe0c739f53bbb635eb5c4a3b172f4be45293c82
  configuration file: /gnu/store/sniryjir3bhlswyarmdmrwdw8zqh8zgp-configuration.scm

The config file for this generation is in the attachment "system-config.scm".


I could find two clues which might be relevant:

1) It seems that on the occasions that the "guix-daemon" successfully starts after "user-processes" has started the Shepherd hasn't tried and then failed to start it earlier. Conversely, on the occasions when the "guix-daemon" service fails to start after "user-processes" has started the shepherd will have tried and failed to start the "guix-daemon" service.
Does the "guix-daemon" service definition state that it should not attempt to start if it has failed to start before?

2) The earlier boots in the kernel log seem to not have the problem of the SAS hard drives not coming up in time to cause services that depend on them from failing to start. This might mean that the SAS card is starting to fail. On the other hand once the hard drives are up the system is stable and the NFS service works fine.

A screenshot of the lvm error messages from a recent boot up showing the hard drives failing to be detected is shown in the "lvm-error-messages.jpg" attachment.