Hello Ludovic!

Ludovic Courtès <ludo@gnu.org> writes:

> Hello!
>
> Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:
>
>> 374   connect(11, {sa_family=AF_UNIX, sun_path="/var/run/dbus/system_bus_socket"}, 34) = 0
>
> [...]
>
>> 374   epoll_wait(5, [{events=EPOLLIN|EPOLLOUT|EPOLLHUP, data={u32=24802800, u64=24802800}}], 20, -1) = 1
>> 374 sendmsg(11, {msg_name=NULL, msg_namelen=0,
>> msg_iov=[{iov_base="l\1\0\1\0\0\0\0\1\0\0\0m\0\0\0\1\1o\0\25\0\0\0/org/freedesktop/DBus\0\0\0\3\1s\0\5\0\0\0Hello\0\0\0\2\1s\0\24\0\0\0org.freedesktop.DBus\0\0\0\0\6\1s\0\24\0\0\0org.freedesktop.DBus\0\0\0\0",
>> iov_len=128}], msg_iovlen=1, msg_controllen=0, msg_flags=0},
>> MSG_DONTWAIT|MSG_NOSIGNAL) = -1 EPIPE (Broken pipe)
>> 374   gettid()                          = 374
>> 374   epoll_ctl(5, EPOLL_CTL_MOD, 11, {events=0, data={u32=24802800, u64=24802800}}) = 0
>> 374   timerfd_settime(12, TFD_TIMER_ABSTIME, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=0, tv_nsec=1}}, NULL) = 0
>> 374   epoll_wait(5, [{events=EPOLLHUP, data={u32=24802800, u64=24802800}}, {events=EPOLLIN, data={u32=24764384, u64=24764384}}], 20, -1) = 2
>> 374   read(12, "\1\0\0\0\0\0\0\0", 8)   = 8
>> 374   gettid()                          = 374
>> 374   timerfd_settime(12, TFD_TIMER_ABSTIME, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=0, tv_nsec=1}}, NULL) = 0
>> 374   epoll_wait(5, [{events=EPOLLHUP, data={u32=24802800, u64=24802800}}, {events=EPOLLIN, data={u32=24764384, u64=24764384}}], 20, -1) = 2
>> 374   read(12, "\1\0\0\0\0\0\0\0", 8)   = 8
>> 374   gettid()                          = 374
>> 374   timerfd_settime(12, TFD_TIMER_ABSTIME, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=0, tv_nsec=1}}, NULL) = 0
>> 374   epoll_wait(5, [{events=EPOLLHUP, data={u32=24802800, u64=24802800}}, {events=EPOLLIN, data={u32=24764384, u64=24764384}}], 20, -1) = 2
>> 374   read(12, "\1\0\0\0\0\0\0\0", 8)   = 8
>> 374   epoll_ctl(5, EPOLL_CTL_DEL, 11, NULL) = 0
>> 374   close(11)                         = 0
>> 374   gettid()                          = 374
>> 374   epoll_wait(5,  <unfinished ...>
>> 391   <... close resumed>)              = 0
>> 391   madvise(0x7fd6c83dc000, 8368128, MADV_DONTNEED) = 0
>> 391   exit(0)                           = ?
>> 391   +++ exited with 0 +++
>> 374   <... epoll_wait resumed>[{events=EPOLLERR, data={u32=24768000, u64=24768000}}], 17, -1) = 1
>> 374   lseek(7, 0, SEEK_SET)             = 0
>> 374   read(7, "tty7\n", 63)             = 5
>
> As you pointed out on IRC, the initially ‘Hello’ method call above leads
> to EPIPE, and we can see that elogind eventually closes its socket to
> dbus-daemon *but* keeps doing its thing.
>
> Some interesting things to note…

Indeed, very interesting :-).  Thanks for diving in!

> First, to my surprise, elogind does not use the client library of the
> ‘dbus’ package:
>
> $ guix gc --references $(./pre-inst-env guix build elogind)|grep dbus
> $ echo $?
> 1
>
>
> (This is already the case in ‘master’ with v243.7.)  Instead, it has its
> own implementation of the DBus protocol, in C, from systemd—we can’t
> have enough sources of bugs and vulnerabilities.

Oh!  Marius found this interesting issue [0] that they shared on IRC
today; I wonder if it may be related.  sd-bus apparently pipeline things
aggressively, which is not always handled well by other D-bus
participants.

[0]  https://github.com/systemd/systemd/issues/16610

> Anyway, the “Hello” message is sent to the system bus asynchronously in
> ‘sd-bus.c’:
>
> static int bus_send_hello(sd_bus *bus) {
>         _cleanup_(sd_bus_message_unrefp) sd_bus_message *m = NULL;
>         int r;
>
>         assert(bus);
>
>         if (!bus->bus_client)
>                 return 0;
>
>         r = sd_bus_message_new_method_call(
>                         bus,
>                         &m,
>                         "org.freedesktop.DBus",
>                         "/org/freedesktop/DBus",
>                         "org.freedesktop.DBus",
>                         "Hello");
>         if (r < 0)
>                 return r;
>
>         return sd_bus_call_async(bus, NULL, m, hello_callback, NULL, 0);
> }
>
>
> A callback is called when a reply is received or an error arises:
>
> static int hello_callback(sd_bus_message *reply, void *userdata, sd_bus_error *error) {
>
> [...]
>
> fail:
>         /* When Hello() failed, let's propagate this in two ways: first we return the error immediately here,
>          * which is the propagated up towards the event loop. Let's also invalidate the connection, so that
>          * if the user then calls back into us again we won't wait any longer. */
>
>         bus_set_state(bus, BUS_CLOSING);
>         return r;
> }
>
>
> It’s not clear from that whether the authors intended for the thing to
> keep going in case of failure.  In our case it’s not helpful.

If we picture this in the systemd use case, I believe sd-bus must be
used as *the* d-bus daemon, perhaps?  So it should never die, and expect
users to call back into it to retry things?  In our case, it acts as a
D-Bus client, not a server (IIUC), so perhaps its behavior is not tuned
right for our use case.

> But why does dbus-daemon drop the connection in the first place?
>
> To know that, we could change ‘dbus-root-service-type’ to run
> dbus-daemon from a ‘--enable-verbose-mode’ build, and with the
> ‘DBUS_VERBOSE’ environment set to 1.
>
> Looking at ‘dbus-server-socket.c’ it would seem that the only sensible
> way this can happen is if dbus-daemon doesn’t yet have a
> ‘new_connection_function’ set at the time it accepts the incoming
> connection:

Interesting that you mention this; that's what I worked on yesterday
(see attached patches).  I managed to get elogind panicking the kernel
during a guix system reconfigure, which corrupted GRUB, so had to chroot
and reconfigure from there [1].  Not sure what happened, but it seems
that killing and restarting elogind is susceptible to cause hard locks.

[1]  https://lists.gnu.org/archive/html/guix-devel/2018-06/msg00202.html