From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.devel Subject: Re: scratch/sigchld-fd 8f0ce42 1/2: Fix deadlock when receiving SIGCHLD during 'pselect'. Date: Wed, 20 Jan 2021 17:05:46 +0200 Message-ID: <83pn1z7inp.fsf@gnu.org> References: <20210116184947.2105.45267@vcs0.savannah.gnu.org> <20210116184949.3C17C211A5@vcs0.savannah.gnu.org> <834kjd7yk5.fsf@gnu.org> <83turc7n93.fsf@gnu.org> Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="26137"; mail-complaints-to="usenet@ciao.gmane.io" Cc: phst@google.com, emacs-devel@gnu.org To: Philipp Stephani Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Wed Jan 20 16:06:55 2021 Return-path: Envelope-to: ged-emacs-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1l2F4N-0006cm-TY for ged-emacs-devel@m.gmane-mx.org; Wed, 20 Jan 2021 16:06:51 +0100 Original-Received: from localhost ([::1]:56524 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1l2F4M-0003nM-UD for ged-emacs-devel@m.gmane-mx.org; Wed, 20 Jan 2021 10:06:50 -0500 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:40488) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1l2F3I-0003Be-8s for emacs-devel@gnu.org; Wed, 20 Jan 2021 10:05:46 -0500 Original-Received: from fencepost.gnu.org ([2001:470:142:3::e]:51514) by eggs.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1l2F3H-0003Qc-Qa; Wed, 20 Jan 2021 10:05:43 -0500 Original-Received: from 84.94.185.95.cable.012.net.il ([84.94.185.95]:4180 helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256) (Exim 4.82) (envelope-from ) id 1l2F38-00079t-BX; Wed, 20 Jan 2021 10:05:43 -0500 In-Reply-To: (message from Philipp Stephani on Tue, 19 Jan 2021 21:22:04 +0100) X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.io gmane.emacs.devel:263205 Archived-At: > From: Philipp Stephani > Date: Tue, 19 Jan 2021 21:22:04 +0100 > Cc: Emacs developers , Philipp Stephani > process.c:4729: Faccept_process_output: enter > process.c:5139: wait_reading_process_output: enter > process.c:5193: wait_reading_process_output: outer loop > process.c:5322: wait_reading_process_output: update_tick = 261, > process_tick = 261 > process.c:5554: wait_reading_process_output: before pselect; max_desc = 1019 > process.c:5601: wait_reading_process_output: after pselect: nfds = -1 > process.c:5641: wait_reading_process_output: EINTR > process.c:5193: wait_reading_process_output: outer loop > process.c:5322: wait_reading_process_output: update_tick = 261, > process_tick = 261 > process.c:7189: handle_child_signal: enter > process.c:7234: handle_child_signal: process test 5: change status to > 0; new process_tick = 262 > process.c:144: handle_child_signal: leave > process.c:5554: wait_reading_process_output: before pselect; max_desc = 1017 > > and then Emacs hangs. > So wait_reading_process_output indeed first receives EINTR, loops > around, and checks for the process_tick change before SIGCHLD is > handled. By the time it reruns pselect it's too late. OK, thanks. So the scenario is that the SIGCHLD handler is called _after_ pselect returns with EINTR, and by that time we already call pselect again, whereas the logic in wait_reading_process_output is based on the assumption that the signal handler will be called _before_ pselect returns? (We should have this in the comments to the child_signal_* functions, so that we remember why we introduced this machinery.) First, is this a valid behavior, or is this a bug in the OS? I couldn't find any place where it is documented what Posix says about the relative timing of calling the handler vs syscall returning with EINTR -- is the behavior you see allowed? I do see many places that explicitly or implicitly assume that a syscall will return with EINTR _after_ the signal handler returns. If this _is_ valid behavior, I wonder whether we could somehow amend the loop and its logic to handle this case without introducing additional file descriptors to wait. For example, when we get EINTR, we could add another pselect call with a short timeout, so that the handler has time to run and update the process tick before we loop around. Or maybe, when we get EINTR, we should loop over the processes and see which one(s) have exited. Such changes would basically leave the overall logic intact, which is IMO better for more complex use cases that involve other Lisp threads. Or maybe the problem here is that more than one signal is involved, so that one of them interrupts pselect before the other (our SIGCHLD) gets delivered and calls the SIGCHLD handler? In any case, did you consider what happens with the self-pipe method when more than one Lisp thread is waiting in pselect? > > The thing is, on Windows we can only wait on up to 64 handles (unless > > we complicate the code with multilevel wait, that is), so every > > unnecessary descriptor we need to wait on means we can support fewer > > simultaneous subprocesses. We are already limited to just 32 > > subprocesses, which is quite low a number. > > OK, that's a good point, I didn't know about this limitation. I'll see > that I can remove the pipe on Windows. Just making those child_signal_* function be no-ops on MS-Windows, and #ifedf'ing away that call to FD_SET of child_fd, should be okay, I think.