From mboxrd@z Thu Jan  1 00:00:00 1970
From: Ludovic =?UTF-8?Q?Court=C3=A8s?= <ludo@gnu.org>
Subject: bug#37757: Kernel panic upon shutdown
Date: Mon, 02 Dec 2019 18:33:03 +0100
Message-ID: <87d0d6k4z4.fsf@gnu.org>
References: <0876c9961fdffa47be54b756a05eb6320b6bdb18.camel@gmail.com>
 <874kzsfqsx.fsf@gnu.org> <87k183mnza.fsf@gnu.org>
 <87wobkw7gj.fsf@gnu.org>
Mime-Version: 1.0
Content-Type: multipart/mixed; boundary="=-=-="
Return-path: <bug-guix-bounces+gcggb-bug-guix=m.gmane.org@gnu.org>
Received: from eggs.gnu.org ([2001:470:142:3::10]:51568)
 by lists.gnu.org with esmtp (Exim 4.90_1)
 (envelope-from <Debian-debbugs@debbugs.gnu.org>) id 1ibpaF-0007Vt-JP
 for bug-guix@gnu.org; Mon, 02 Dec 2019 12:34:04 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <Debian-debbugs@debbugs.gnu.org>) id 1ibpaE-00061l-9q
 for bug-guix@gnu.org; Mon, 02 Dec 2019 12:34:03 -0500
Received: from debbugs.gnu.org ([209.51.188.43]:32901)
 by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16)
 (Exim 4.71) (envelope-from <Debian-debbugs@debbugs.gnu.org>)
 id 1ibpaE-00061f-6Z
 for bug-guix@gnu.org; Mon, 02 Dec 2019 12:34:02 -0500
Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2)
 (envelope-from <Debian-debbugs@debbugs.gnu.org>) id 1ibpaD-00030B-Vb
 for bug-guix@gnu.org; Mon, 02 Dec 2019 12:34:01 -0500
Sender: "Debbugs-submit" <debbugs-submit-bounces@debbugs.gnu.org>
Resent-Message-ID: <handler.37757.B37757.157530799911484@debbugs.gnu.org>
In-Reply-To: <87wobkw7gj.fsf@gnu.org> ("Ludovic
 \=\?utf-8\?Q\?Court\=C3\=A8s\=22'\?\=
 \=\?utf-8\?Q\?s\?\= message of "Thu, 28 Nov 2019 12:45:00 +0100")
List-Id: Bug reports for GNU Guix <bug-guix.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/bug-guix>,
 <mailto:bug-guix-request@gnu.org?subject=unsubscribe>
List-Archive: <https://lists.gnu.org/archive/html/bug-guix>
List-Post: <mailto:bug-guix@gnu.org>
List-Help: <mailto:bug-guix-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/bug-guix>,
 <mailto:bug-guix-request@gnu.org?subject=subscribe>
Errors-To: bug-guix-bounces+gcggb-bug-guix=m.gmane.org@gnu.org
Sender: "bug-Guix" <bug-guix-bounces+gcggb-bug-guix=m.gmane.org@gnu.org>
To: Jesse Gibbons <jgibbons2357@gmail.com>, Jan <tona_kosmicznego_smiecia@interia.pl>
Cc: 37757@debbugs.gnu.org

--=-=-=
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable

Hi!

Ludovic Court=C3=A8s <ludo@gnu.org> skribis:

> Jesse (and anyone else experiencing this!), could you try to (1)
> reconfigure with this patch, (2) reboot, (3) try to halt the system to
> reproduce the crash, and (4) retrieve a backtrace from the =E2=80=98core=
=E2=80=99 file?
>
> For #4, you=E2=80=99ll have to do something along these lines once you=E2=
=80=99ve
> rebooted after the crash:
>
>   sudo gdb /run/current-system/profile/bin/guile /core
>
> and then type =E2=80=9Cthread apply all bt=E2=80=9D at the GDB prompt.

It turns out the previous patch didn=E2=80=99t work; in short, we really ha=
ve to
use async-signal-safe functions only from the signal handler, so this
has to be done in C.

The attached patch does that.  I=E2=80=99ve tried it with =E2=80=98guix sys=
tem
container=E2=80=99 and it seems to dump core as expected, from what I can s=
ee.

Let me know if you manage to reproduce the bug and to get a core dumped
with this patch.

To everyone reading this: if you=E2=80=99re experiencing shepherd crashes,
please raise your hand :-) and consider applying this patch so we can
gather debugging info!

Thanks,
Ludo=E2=80=99.


--=-=-=
Content-Type: text/x-patch
Content-Disposition: inline

diff --git a/gnu/services/shepherd.scm b/gnu/services/shepherd.scm
index 08bb33039c..cf82ef0a4c 100644
--- a/gnu/services/shepherd.scm
+++ b/gnu/services/shepherd.scm
@@ -271,6 +271,23 @@ and return the resulting '.go' file."
                          (compile-file #$file #:output-file #$output
                                        #:env env))))))
 
+(define (crash-handler)
+  (define gcc-toolchain
+    (module-ref (resolve-interface '(gnu packages commencement))
+                'gcc-toolchain))
+
+  (define source
+    (local-file "../system/aux-files/shepherd-crash-handler.c"))
+
+  (computed-file "crash-handler.so"
+                 #~(begin
+                     (setenv "PATH" #+(file-append gcc-toolchain "/bin"))
+                     (setenv "CPATH" #+(file-append gcc-toolchain "/include"))
+                     (setenv "LIBRARY_PATH"
+                             #+(file-append gcc-toolchain "/lib"))
+                     (system* "gcc" "-Wall" "-g" "-O3" "-fPIC"
+                              "-shared" "-o" #$output #$source))))
+
 (define (shepherd-configuration-file services)
   "Return the shepherd configuration file for SERVICES."
   (assert-valid-graph services)
@@ -281,6 +298,9 @@ and return the resulting '.go' file."
           (use-modules (srfi srfi-34)
                        (system repl error-handling))
 
+          ;; Load the crash handler, which allows shepherd to dump core.
+          (dynamic-link #$(crash-handler))
+
           ;; Arrange to spawn a REPL if something goes wrong.  This is better
           ;; than a kernel panic.
           (call-with-error-handling
diff --git a/gnu/system/aux-files/shepherd-crash-handler.c b/gnu/system/aux-files/shepherd-crash-handler.c
new file mode 100644
index 0000000000..6b2db10866
--- /dev/null
+++ b/gnu/system/aux-files/shepherd-crash-handler.c
@@ -0,0 +1,70 @@
+#define _GNU_SOURCE
+
+#include <stdlib.h>
+#include <unistd.h>
+#include <sched.h>
+#include <sys/time.h>
+#include <sys/resource.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <sys/syscall.h>   /* For SYS_xxx definitions */
+#include <signal.h>
+
+static void
+handle_crash (int sig)
+{
+  static const char msg[] = "Shepherd crashed!\n";
+  write (2, msg, sizeof msg);
+
+#ifdef __sparc__
+  /* See 'raw_clone' in systemd.  */
+# error "SPARC uses a different 'clone' syscall convention"
+#endif
+
+  pid_t pid = syscall (SYS_clone, SIGCHLD, NULL);
+  if (pid < 0)
+    abort ();
+
+  if (pid == 0)
+    {
+      /* Restore the default signal handler to get a core dump.  */
+      signal (sig, SIG_DFL);
+
+      const struct rlimit infinity = { RLIM_INFINITY, RLIM_INFINITY };
+      setrlimit (RLIMIT_CORE, &infinity);
+      chdir ("/");
+
+      int pid = syscall (SYS_getpid);
+      kill (pid, sig);
+
+      /* As it turns out, 'kill' simply returns without doing anything, which
+	 is consistent with the "Notes" section of kill(2).  Thus, force a
+	 crash.  */
+      * (int *) 0 = 42;
+
+      _exit (254);
+    }
+  else
+    {
+      signal (sig, SIG_IGN);
+
+      int status;
+      waitpid (pid, &status, 0);
+
+      sync ();
+
+      _exit (255);
+    }
+
+  _exit (253);
+}
+
+static void initialize_crash_handler (void)
+  __attribute__ ((constructor));
+
+static void
+initialize_crash_handler (void)
+{
+  signal (SIGSEGV, handle_crash);
+  signal (SIGABRT, handle_crash);
+}

--=-=-=--