TITLE Shepherd non-deterministically fails to load the "guix-daemon" service after the "user-processes" service has been started EXPECTED BEHAVIOUR Shepherd will always start the "guix-daemon" service after the "user-processes" service has been started. OBSERVED BEHAVIOUR Shepherd will sometimes start the "guix-daemon" service after the "user-processes" service has been started and sometimes it won't. HARDWARE ENVIRONMENT Motherboard: Asus M5A88-M EVO CPU: AMD Athlon II X4 640 Processor SAS card: Broadcom / LSI SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] (rev 02) Hard drive: 9x assorted spinning rust sata hard drives SOFTWARE ENVIRONMENT See details below STEPS TO REPRODUCE Create the software environment detailed below on an equivalent hardware environment to the one listed above DETAILS I have a media server running NFS to host videos to computers on my home network. The media server has 9 hard drives in it. 5 hard drives are connected to the motherboard's sata ports and another four are connected to a HBA SAS controller card in IT mode sitting in the motherboard's pci-e slot. The hard-drives are in an LVM configuration like below: mark@streetkid ~$ lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS sda 8:0 0 465.8G 0 disk ├─sda1 8:1 0 2M 0 part ├─sda2 8:2 0 100G 0 part /gnu/store │ / └─sda3 8:3 0 365.8G 0 part └─streetkid_VG_storage-streetkid_LV_home 253:0 0 4.3T 0 lvm /home sdb 8:16 0 931.5G 0 disk └─sdb1 8:17 0 931.5G 0 part └─streetkid_VG_storage-streetkid_LV_home 253:0 0 4.3T 0 lvm /home sdc 8:32 0 149G 0 disk └─sdc1 8:33 0 149G 0 part └─streetkid_VG_storage-streetkid_LV_home 253:0 0 4.3T 0 lvm /home sdd 8:48 0 465.8G 0 disk └─sdd1 8:49 0 465.8G 0 part └─streetkid_VG_storage-streetkid_LV_home 253:0 0 4.3T 0 lvm /home sde 8:64 0 465.8G 0 disk └─sde1 8:65 0 465.8G 0 part └─streetkid_VG_storage-streetkid_LV_home 253:0 0 4.3T 0 lvm /home sdf 8:80 0 298.1G 0 disk └─sdf1 8:81 0 298.1G 0 part └─streetkid_VG_storage-streetkid_LV_home 253:0 0 4.3T 0 lvm /home sdg 8:96 0 931.5G 0 disk └─sdg1 8:97 0 931.5G 0 part └─streetkid_VG_storage-streetkid_LV_home 253:0 0 4.3T 0 lvm /home sdh 8:112 0 465.8G 0 disk └─sdh1 8:113 0 465.8G 0 part └─streetkid_VG_storage-streetkid_LV_home 253:0 0 4.3T 0 lvm /home sdi 8:128 0 298.1G 0 disk └─sdi1 8:129 0 298.1G 0 part └─streetkid_VG_storage-streetkid_LV_home 253:0 0 4.3T 0 lvm /home Recently I noticed that running guix commands would sometimes fail with the following error: mark@streetkid ~$ guix package -m ~/.config/guix/manifest.scm guix package: error: failed to connect to `/var/guix/daemon-socket/socket': Connection refused The NFS service would be working fine but guix would refuse to work. Not knowing what to do to fix guix I tried rolling back to previous system generations using the GRUB menu on bootup. Sometimes this would result in guix working but then on the next bootup - using the same system generation - guix would fail again. I found it was easy to work around the issue by manually getting shepherd to start the guix-daemon: mark@streetkid ~$ sudo herd start guix-daemon Service guix-daemon has been started. I looked through the kernel log and found that the harddrives connected to the SAS controller were taking a long while to be brought up which was blocking services that depend on the "device-mapping" service from being started. The following snippet from /var/log/messages.1.gz illustrates this: Jul 1 21:36:58 localhost vmunix: [ 18.538151] shepherd[1]: Service device-mapping-streetkid_VG_storage-streetkid_LV_home could not be started. Jul 1 21:36:58 localhost vmunix: [ 18.540000] shepherd[1]: Service file-system-/home depends on device-mapping-streetkid_VG_storage-streetkid_LV_home. Jul 1 21:36:58 localhost vmunix: [ 18.541809] shepherd[1]: Service file-system-/home could not be started. Jul 1 21:36:58 localhost vmunix: [ 18.543412] shepherd[1]: Service file-systems depends on file-system-/home. Jul 1 21:36:58 localhost vmunix: [ 18.545276] shepherd[1]: Service file-systems could not be started. Jul 1 21:36:58 localhost vmunix: [ 18.546921] shepherd[1]: Service user-processes depends on file-systems. Jul 1 21:36:58 localhost vmunix: [ 18.548774] shepherd[1]: Service user-processes could not be started. Jul 1 21:36:58 localhost vmunix: [ 18.550432] shepherd[1]: Service guix-daemon depends on user-processes. Jul 1 21:36:58 localhost vmunix: [ 18.552355] shepherd[1]: Service guix-daemon could not be started. Looking at the logs shows that during bootup the Shepherd will make a few attempts at starting services while waiting for the hard drives connected to the SAS card to become available. The "device-mapping" service needs the SAS card hard drives to be up and will fail when they are not. When they are not up the services that depend on the "device-mapping" service will fail. After a while the hard drives come online and then the "device-mapping" service starts, which then causes the services that depend on it to start up as well. I then noticed something very strange. On some boot ups the Shepherd would try and fail a few times to start services that depend on "device-mapping" and then when the SAS hard drives were up it would successfully load these dependent services including the "guix-daemon" service. On other boot ups it would try and fail a few times to start services that depend on "device-mapping" and then when the SAS hard drives were up it would successfully load these dependent services BUT WOULD NOT LOAD the "guix-daemon" service. To see this non-deterministic bug in the logs open the attached file "messages.1". Go to line 21664 which is the start of the boot process starting Jul 1 09:02:35. Search for the next string "Service user-processes" and you can see this service failing to start on line 22558 and again on line 22581. The "user-processes" service then succeeds on line 22657 and on the next line (22658) you can see that the "guix-daemon" starts. In summary: 21664 - System starts booting 22558 - "user-processes" service fails to start 22581 - "user-processes" service fails to start again 22657 - "user-processes" service starts successfully 22658 - "guix-daemon" service starts successfully 22837 - System shuts down On the next boot "guix-daemon" fails to start after "user-processes" has started: 22838 - System starts booting 23728 - "user-processes" service fails to start 23750 - "user-processes" service fails to start again 23762 - "user-processes" service fails to start a third time 23764 - "guix-daemon" service fails to start 23840 - "user-processes" service starts successfully 23938 - System shuts down On the next boot "guix-daemon" again fails to start after "user-processes" has started: 23939 - System starts booting 24826 - "user-processes" service fails to start 24853 - "user-processes" service fails to start again 24865 - "user-processes" service fails to start a third time 24867 - "guix-daemon" service fails to start 24941 - "user-processes" service starts successfully 25053 - System shuts down On the next boot "guix-daemon" successfully starts after "user-processes" has started: 25054 - System starts booting 25952 - "user-processes" service fails to start 25975 - "user-processes" service fails to start again 26064 - "user-processes" service starts successfully 26066 - "guix-daemon" service starts successfully 26169 - System shuts down Note that these four boots are using the same system generation so the software is the same on all boots. The generation info is shown below: Generation 27 Jun 29 2022 22:50:13 file name: /var/guix/profiles/system-27-link canonical file name: /gnu/store/xrn0lz22gnhhq9ywk30q2nmim3fgx5sd-system label: GNU with Linux 5.17.15 bootloader: grub root device: UUID: f75c515d-0ef7-4af1-950e-5486408d2451 kernel: /gnu/store/xbcwcv2a024n0vmncvhgllmz52d4rprj-linux-5.17.15/bzImage channels: nonguix: repository URL: https://gitlab.com/nonguix/nonguix branch: master commit: 1a122e06fe046caebf39395edc797515861acd3b guix: repository URL: https://git.savannah.gnu.org/git/guix.git branch: master commit: 0fe0c739f53bbb635eb5c4a3b172f4be45293c82 configuration file: /gnu/store/sniryjir3bhlswyarmdmrwdw8zqh8zgp-configuration.scm The config file for this generation is in the attachment "system-config.scm". I could find two clues which might be relevant: 1) It seems that on the occasions that the "guix-daemon" successfully starts after "user-processes" has started the Shepherd hasn't tried and then failed to start it earlier. Conversely, on the occasions when the "guix-daemon" service fails to start after "user-processes" has started the shepherd will have tried and failed to start the "guix-daemon" service. Does the "guix-daemon" service definition state that it should not attempt to start if it has failed to start before? 2) The earlier boots in the kernel log seem to not have the problem of the SAS hard drives not coming up in time to cause services that depend on them from failing to start. This might mean that the SAS card is starting to fail. On the other hand once the hard drives are up the system is stable and the NFS service works fine. A screenshot of the lvm error messages from a recent boot up showing the hard drives failing to be detected is shown in the "lvm-error-messages.jpg" attachment.