Hello Ludovic! Ludovic Courtès writes: > Hello! > > Maxim Cournoyer skribis: > >> 374 connect(11, {sa_family=AF_UNIX, sun_path="/var/run/dbus/system_bus_socket"}, 34) = 0 > > [...] > >> 374 epoll_wait(5, [{events=EPOLLIN|EPOLLOUT|EPOLLHUP, data={u32=24802800, u64=24802800}}], 20, -1) = 1 >> 374 sendmsg(11, {msg_name=NULL, msg_namelen=0, >> msg_iov=[{iov_base="l\1\0\1\0\0\0\0\1\0\0\0m\0\0\0\1\1o\0\25\0\0\0/org/freedesktop/DBus\0\0\0\3\1s\0\5\0\0\0Hello\0\0\0\2\1s\0\24\0\0\0org.freedesktop.DBus\0\0\0\0\6\1s\0\24\0\0\0org.freedesktop.DBus\0\0\0\0", >> iov_len=128}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, >> MSG_DONTWAIT|MSG_NOSIGNAL) = -1 EPIPE (Broken pipe) >> 374 gettid() = 374 >> 374 epoll_ctl(5, EPOLL_CTL_MOD, 11, {events=0, data={u32=24802800, u64=24802800}}) = 0 >> 374 timerfd_settime(12, TFD_TIMER_ABSTIME, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=0, tv_nsec=1}}, NULL) = 0 >> 374 epoll_wait(5, [{events=EPOLLHUP, data={u32=24802800, u64=24802800}}, {events=EPOLLIN, data={u32=24764384, u64=24764384}}], 20, -1) = 2 >> 374 read(12, "\1\0\0\0\0\0\0\0", 8) = 8 >> 374 gettid() = 374 >> 374 timerfd_settime(12, TFD_TIMER_ABSTIME, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=0, tv_nsec=1}}, NULL) = 0 >> 374 epoll_wait(5, [{events=EPOLLHUP, data={u32=24802800, u64=24802800}}, {events=EPOLLIN, data={u32=24764384, u64=24764384}}], 20, -1) = 2 >> 374 read(12, "\1\0\0\0\0\0\0\0", 8) = 8 >> 374 gettid() = 374 >> 374 timerfd_settime(12, TFD_TIMER_ABSTIME, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=0, tv_nsec=1}}, NULL) = 0 >> 374 epoll_wait(5, [{events=EPOLLHUP, data={u32=24802800, u64=24802800}}, {events=EPOLLIN, data={u32=24764384, u64=24764384}}], 20, -1) = 2 >> 374 read(12, "\1\0\0\0\0\0\0\0", 8) = 8 >> 374 epoll_ctl(5, EPOLL_CTL_DEL, 11, NULL) = 0 >> 374 close(11) = 0 >> 374 gettid() = 374 >> 374 epoll_wait(5, >> 391 <... close resumed>) = 0 >> 391 madvise(0x7fd6c83dc000, 8368128, MADV_DONTNEED) = 0 >> 391 exit(0) = ? >> 391 +++ exited with 0 +++ >> 374 <... epoll_wait resumed>[{events=EPOLLERR, data={u32=24768000, u64=24768000}}], 17, -1) = 1 >> 374 lseek(7, 0, SEEK_SET) = 0 >> 374 read(7, "tty7\n", 63) = 5 > > As you pointed out on IRC, the initially ‘Hello’ method call above leads > to EPIPE, and we can see that elogind eventually closes its socket to > dbus-daemon *but* keeps doing its thing. > > Some interesting things to note… Indeed, very interesting :-). Thanks for diving in! > First, to my surprise, elogind does not use the client library of the > ‘dbus’ package: > > $ guix gc --references $(./pre-inst-env guix build elogind)|grep dbus > $ echo $? > 1 > > > (This is already the case in ‘master’ with v243.7.) Instead, it has its > own implementation of the DBus protocol, in C, from systemd—we can’t > have enough sources of bugs and vulnerabilities. Oh! Marius found this interesting issue [0] that they shared on IRC today; I wonder if it may be related. sd-bus apparently pipeline things aggressively, which is not always handled well by other D-bus participants. [0] https://github.com/systemd/systemd/issues/16610 > Anyway, the “Hello” message is sent to the system bus asynchronously in > ‘sd-bus.c’: > > static int bus_send_hello(sd_bus *bus) { > _cleanup_(sd_bus_message_unrefp) sd_bus_message *m = NULL; > int r; > > assert(bus); > > if (!bus->bus_client) > return 0; > > r = sd_bus_message_new_method_call( > bus, > &m, > "org.freedesktop.DBus", > "/org/freedesktop/DBus", > "org.freedesktop.DBus", > "Hello"); > if (r < 0) > return r; > > return sd_bus_call_async(bus, NULL, m, hello_callback, NULL, 0); > } > > > A callback is called when a reply is received or an error arises: > > static int hello_callback(sd_bus_message *reply, void *userdata, sd_bus_error *error) { > > [...] > > fail: > /* When Hello() failed, let's propagate this in two ways: first we return the error immediately here, > * which is the propagated up towards the event loop. Let's also invalidate the connection, so that > * if the user then calls back into us again we won't wait any longer. */ > > bus_set_state(bus, BUS_CLOSING); > return r; > } > > > It’s not clear from that whether the authors intended for the thing to > keep going in case of failure. In our case it’s not helpful. If we picture this in the systemd use case, I believe sd-bus must be used as *the* d-bus daemon, perhaps? So it should never die, and expect users to call back into it to retry things? In our case, it acts as a D-Bus client, not a server (IIUC), so perhaps its behavior is not tuned right for our use case. > But why does dbus-daemon drop the connection in the first place? > > To know that, we could change ‘dbus-root-service-type’ to run > dbus-daemon from a ‘--enable-verbose-mode’ build, and with the > ‘DBUS_VERBOSE’ environment set to 1. > > Looking at ‘dbus-server-socket.c’ it would seem that the only sensible > way this can happen is if dbus-daemon doesn’t yet have a > ‘new_connection_function’ set at the time it accepts the incoming > connection: Interesting that you mention this; that's what I worked on yesterday (see attached patches). I managed to get elogind panicking the kernel during a guix system reconfigure, which corrupted GRUB, so had to chroot and reconfigure from there [1]. Not sure what happened, but it seems that killing and restarting elogind is susceptible to cause hard locks. [1] https://lists.gnu.org/archive/html/guix-devel/2018-06/msg00202.html