From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Philipp Newsgroups: gmane.emacs.devel Subject: Re: master 8dcb19f 4/4: Add a unit test testing interaction between threads and processes. Date: Thu, 18 Mar 2021 13:51:10 +0100 Message-ID: <25EEFE6F-9532-4A5A-8824-A3F8F41012F3@gmail.com> References: <83k0s34eo1.fsf@gnu.org> <83h7n74cwe.fsf@gnu.org> <838s8i4ak8.fsf@gnu.org> <6C5E64C7-19A6-4D46-AC30-33C29FA5DEBB@gmail.com> <83mtvo7xno.fsf@gnu.org> Mime-Version: 1.0 (Mac OS X Mail 14.0 \(3654.60.0.2.21\)) Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="18769"; mail-complaints-to="usenet@ciao.gmane.io" Cc: Philipp Stephani , emacs-devel@gnu.org To: Eli Zaretskii Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Thu Mar 18 13:54:01 2021 Return-path: Envelope-to: ged-emacs-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1lMsA5-0004kl-5w for ged-emacs-devel@m.gmane-mx.org; Thu, 18 Mar 2021 13:54:01 +0100 Original-Received: from localhost ([::1]:54028 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1lMsA4-0001Iq-14 for ged-emacs-devel@m.gmane-mx.org; Thu, 18 Mar 2021 08:54:00 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:33918) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1lMs7T-0007OC-Uj for emacs-devel@gnu.org; Thu, 18 Mar 2021 08:51:25 -0400 Original-Received: from mail-wr1-x434.google.com ([2a00:1450:4864:20::434]:41821) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1lMs7P-0005lF-In; Thu, 18 Mar 2021 08:51:19 -0400 Original-Received: by mail-wr1-x434.google.com with SMTP id b9so5409712wrt.8; Thu, 18 Mar 2021 05:51:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=U3m5HkPhq6oESg7RRsVJNRGJN4j5mA0BH9jpQz5qNsg=; b=pB3uRtttGnVkwo+es0xerPBIzg0pgYUiMOioubcZ2aOHeh7oH7TR3jrzVmqDFpccMq zuoTA3wdAT0meUluPwvD6CAntSZJ7ZSfKmvel64DTae8JU2dBtDusug4XMWsGMTUrqNY MjB0fgTyF46GEQGBg+NTv9Sksn9Dprwet22ODIq2KTyUOQvjiALqNJnLDvICPD88ZL06 AxHEQS9OXCPmPUNdYsQMNagHb/iw9A4qq7WIOrrAJPmrAE1O9ZTRYkSR9yVIEIfPTo6D hcPQXqlxnfka3LJT4Epujlx0Sie7rFvskOvPLEo7b6z3VDXFTJTOZJsx+R6jzPmK1+Ia O6eQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=U3m5HkPhq6oESg7RRsVJNRGJN4j5mA0BH9jpQz5qNsg=; b=tankZHnSYorcFWULeE9JhpnMiBMgfcOu9D/SMdWM6JpvTBVjSRo5KNCuB/umHqfIeN 2/m9HmZ0JfvuaMEiK9w6Xh160HZZfZC8kYZkamfbXBklzxZoMuXHBVSuYjMLli3xjCmk DPNEqeoQ+MWWCJSRlEfxZWBQB98iDjGMdDmYqpA88PAZAehlvnErrQD74zl3jzFts+Mc uPRdb02pLMO+Apeoeta9x52TOExBF055ZTLerFXxrscGWSsIRimvHkWAL2bP1bpdDsii RAkOU4CQndgJ1+FjGkL6yXmXcKwM3wFUlcogFGVnEBsTkXTy5OnywCH/NkwH4f0rN5NS IZWQ== X-Gm-Message-State: AOAM530ddeHYStVX4SEhU6BfYQLAqm7C5upxFnDTh80uQuda+F8y+Yfh yLQ6hSfdegu1QX4OlxTJaaDCO+MAlD4= X-Google-Smtp-Source: ABdhPJzR44qXGT5mP9nUhu6sDVi4SMXDRUEf1j9rW8c2l8Xv4Z47KdctDrLjDllCx+Z9yISd+fti/Q== X-Received: by 2002:a5d:550b:: with SMTP id b11mr9701601wrv.313.1616071872612; Thu, 18 Mar 2021 05:51:12 -0700 (PDT) Original-Received: from philipps-macbook-pro.fritz.box (p57aafc25.dip0.t-ipconnect.de. [87.170.252.37]) by smtp.gmail.com with ESMTPSA id r11sm2980734wrm.26.2021.03.18.05.51.11 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Thu, 18 Mar 2021 05:51:11 -0700 (PDT) In-Reply-To: <83mtvo7xno.fsf@gnu.org> X-Mailer: Apple Mail (2.3654.60.0.2.21) Received-SPF: pass client-ip=2a00:1450:4864:20::434; envelope-from=p.stephani2@gmail.com; helo=mail-wr1-x434.google.com X-Spam_score_int: -17 X-Spam_score: -1.8 X-Spam_bar: - X-Spam_report: (-1.8 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.io gmane.emacs.devel:266563 Archived-At: > Am 28.02.2021 um 21:19 schrieb Eli Zaretskii : >=20 >> From: Philipp >> Date: Sun, 28 Feb 2021 19:30:46 +0100 >> Cc: Philipp Stephani , >> emacs-devel@gnu.org >>=20 >> The specific scenario here is: Start a number of process and a number = of threads, each of which waits for one of the processes >=20 > See below: I don't see how you ensure in this test that each thread > waits for just one process. Shouldn't providing a process argument to accept-process-output ensure = this, modulo implementation details that are not observable? The only = nontrivial guarantee that the test requires is that after = (accept-process-output PROC) returns nil, the PROC process has finished, = independent of other processes or threads, and that multiple concurrent = calls of this form don't deadlock. >=20 >> then, cause the processes to exit (since they run =E2=80=9Ecat=E2=80=9C= , that works by sending EOF); then, join all the threads. This should = work without hangs or errors. >=20 > There are two "inputs" from each subprocess that Emacs receives: the > self-pipe output and the SIGCHLD signal. Depending on the timing and > on which thread receives which input, the behavior can be different. > Do you have a clear idea what should happen and what does happen wrt > these "inputs"? In particular, which thread(s) do you expect to see > SIGCHLD and/or read the self-pipe as result of that, and do you think > it's important whether a thread gets awoken by the signal or by > reading the self-pipe? All of those are implementation details that should be entirely = invisible to the thread interface. As such, none of the threads should = "see" any of those. All that matters is the observable behavior of = accept-process-output: it may return nil only if and when its process = argument has finished, independent of threads or signal disposition. >=20 >>>> The test wouldn't work without it, because it calls >>>> accept-process-output from a thread different from the thread in = which >>>> the process was created. >>>=20 >>> But set-process-thread accepts a thread argument, so you could bind >>> the process to a particular thread, couldn't you? >>=20 >> Yes, but the threads that call `accept-process-output=E2=80=99 and = `process-send-eof=E2=80=99 are different threads; binding the process = object to one of the threads would cause the other call to fail. >=20 > As Stefan points out, process-send-eof could be sent from any thread. Correct, though this shouldn't make a difference: the test should pass = whether the process objects have thread affinity or not (and that's the = case for me). >=20 >>> More generally, my reading of the code is that the first thread = which >>> calls accept-process-output will start listening on all the file >>> descriptors of all the sub-processes which were started until then, >>> and all the other threads will have no descriptors left to listen = on. >>> Is this the situation you intended to create? >>=20 >> No, but I=E2=80=99m not convinced that that=E2=80=99s actually how = the implementation behaves. Wouldn=E2=80=99t with the current = implementation each thread wait for the file descriptors of one process? >=20 > How would that happen, in this particular test? The test first starts > 10 processes, then creates 10 threads, then calls thread-join for each > of the threads. The threads that the test create don't run, because > the first thing each thread does is attempt to take the global lock, > and becomes blocked because the main thread still runs. The main > thread yields the first time it calls thread-join; at that time, one > of the 10 waiting threads (not necessarily the one passed as argument > to thread-join) will take the global lock. This thread will call > accept-process-output, which calls wait_reading_process_output, which > computes the mask for pselect to wait on. Now look at the code in > compute_non_keyboard_wait_mask and its ilk: the mask will include > every descriptor that is not already watched by some other thread. > Since this is the first thread that calls wait_reading_process_output, > it will find all the descriptors up for grabs, and will add all of > them to the mask. Then it will release the global lock and call > pselect. Releasing the global lock will let some other thread of the > 9 remaining threads run, that thread will also call > accept-process-output, which will call wait_reading_process_output -- > but now all the descriptors are taken by the first thread, so there's > nothing left for the second one, and the same for all the other 8 > threads. >=20 > So what we have now is one thread waiting for output from all the 10 > subprocess, and the other 9 waiting only on the self-pipe. What > happens then, when the processes exit and we get 10 SIGCHLD signals, > is left as an exercise. And don't forget that order in which we > receive the signal and the byte from self-pipe is not entirely > deterministic, either. >=20 > Am I missing something, or is this what happens the way the test is > written? And if I'm not missing anything, is this the situation you > wanted to create and test? This might be how the implementation behaves (I haven't checked, but I = have no reason to doubt your explanation). However, the test doesn't = test the implementation, but the observable behavior. That is, I can't = formally prove that the implementation leads to the observable behavior = that the test expects, but I think it should, and I'd consider it a bug = (in Emacs, not the test code) if it doesn't. >=20 >>> Why is it important to test this situation, and what do we expect to >>> happen in it? Is it reasonable to have more than one thread wait = for >>> output from a subprocess? >>=20 >> The tests shouldn=E2=80=99t ask what=E2=80=99s reasonable, but = what=E2=80=99s documented or expected. Surely users can write code like = this, and I don=E2=80=99t see why that shouldn=E2=80=99t be supported. >=20 > I didn't say anything about unsupporting this situation, I asked what > should happen in it, and I meant the details, not merely that there > are no errors. E.g., is it important that each threads gets output > from exactly one subprocess? How does a thread "get output"? Do you mean that accept-process-output = returns non-nil? What I'd expect from accept-process-output with a = process argument is that directly after it returns nil, the process = represented by the process argument has finished, all output has been = passed to buffers/filters, and the sentinel has run. That's independent = of the thread in which accept-process-output is called. > is it important whether a thread exits > because it read output or because it received SIGCHLD and read the > self-pipe? No, these are implementation details: they might as well not exist as = far as the observable behavior of accept-process-output is concerned. > is it important whether all the threads exit one after > another, or is it okay if they all wait until one thread exits and > then exit immediately (because all the processes have already died)? I don't think that's observable/documented behavior: we only know that a = thread has exited after thread-join returns. > etc. >=20 > I instrumented the code that gets executed by this test and saw all > kinds of strange phenomena, and I'm asking myself whether you looked > at those details and whether you are satisfied by what you saw. No, that's not the way I approach topics like this. I wrote the test = based on my reading of the documentation (having in mind the complexity = of the implementation, but not being guided by it). I then ran the test = and it passed, which gave me a bit of confidence in the correctness of = the implementation (which the test can't prove, of course). So the questions I'd ask here: Is the test code correct, based on = documented (or at least implied) guarantees? If it's correct, but the = test fails on some system, the implementation is buggy. What specific = sequence of events in the implementation code leads to the buggy = behavior is then only the next step, not the starting point of the = discussion. >=20 >>> The question I ask myself is what to do with the deviant behavior on >>> MS-Windows. If it behaves differently because what we test here are >>> the peculiarities of the semi-broken Unix mechanism known as = "signals" >>> and/or some details of our implementation of subprocesses on Posix >>> platforms, then maybe there's no need to change anything in the >>> Windows code, and just either skip this test on MS-Windows or write = a >>> different test which will work there. IOW, in such a case there's = no >>> bug in the Windows port, it's just that it isn't bug-for-bug >>> compatible with Unix. >>=20 >> I don=E2=80=99t think so. Again, I wrote the test based on my = understanding of the expected and promised behavior of the process and = thread functions, not the Posix implementation. >=20 > Where and how did you obtain that understanding? The documentation is > (intentionally or unintentionally) very vague regarding these details. Yes, so I had to fill in some aspects that I hope are reasonable. In = particular, the documentation doesn't state explicitly that waiting for = a process in a different thread than the original one is supported, but = given set-process-thread it implicitly has to be, because what would = otherwise be the purpose of set-process-thread? The documentation definitely needs to be significantly expanded, as this = discussion shows. So maybe asking more directly: - Should this test pass? - If so, how can this be deduced from the documentation? - If not, how can *that* be deduced from the documentation, and what = changes would be necessary to guarantee that it passes? - Is there anything missing from the documentation that prevents = deducing whether the test should pass or not? > Perhaps you only cared about whether the code should or shouldn't err, > but that's clearly not the only interesting or important aspect of > this exercise, is it? It's at least one important aspect, but finding deadlocks (hangs) is = probably more important. >=20 >>> But if what happens in this test is important in practical use = cases, >>> then perhaps our Windows code does need some changes in this area. >>>=20 >>> This is why I'm trying to understand the semantics of this test -- = I'd >>> like to figure out which implementation details are important, and >>> thus should be emulated on Windows, and which are just details that >>> can be different (and portable Lisp programs will have to take those >>> differences into account). >>=20 >> The way I see it, all aspects of the test are important =E2=80=94 = their behavior should be guaranteed by the implementation. >=20 > Which aspects of those I mentioned above do you think the > implementation should guarantee? E.g., should it guarantee how many > threads get the actual output from the processes in this case, That depends what "get the actual output" means. What I expect is the = behavior of accept-process-output, as explained above. Specifically: = accept-process-output will return nil quickly (within milliseconds, = assuming the calling thread is unblocked) once the given process has = exited. Once it returns, all process output has been given to process = filters/buffers, and the sentinel has run. > and how > many get the signal? None, as the signal is an implementation detail and not part of the = interface. >=20 >>> For example, the test verifies that each process exited with zero >>> status, but it doesn't care which thread detected that -- is this >>> important? >>=20 >> IIUC threads don=E2=80=99t really =E2=80=9Edetect=E2=80=9C process = exit. >=20 > ??? When a subprocess exits, Emacs gets a SIGCHLD, and then we access > the exit code in order to pass it to the sentinel function. This is > what I call "detecting process exit". Sure, but is that really related to threads? I'd rather say that = *Emacs* detects process exits. It's also true that there's some thread that contains the `accept' call, = but that's again an implementation detail and not related to Lisp = threads. Several other implementations would be possible: Calling = `accept' in a single thread not accessible from Lisp at all; = implementing Lisp threads as "green threads" without pthread support; = etc. >=20 >> What I=E2=80=99d assume is that the `thread-join=E2=80=99 calls are = synchronization points =E2=80=94 after calling `thread-join=E2=80=99 the = thread in question is guaranteed to have finished, and since the thread = has been waiting for a process to finish (using the = accept-process-output loop), the process is also guaranteed to have = finished, and this sequence of event is guaranteed exactly in that = order. >=20 > I don't see how you reach those conclusions. The first part is true: > when thread-join exits, the thread it was waiting for is guaranteed to > have exited. But how it exited, and what caused it to exit -- these > are open questions, in a situation where 10 subprocesses generate > output and produce 10 signals. There's only one way for a thread to exit: when its thread function = returns. At least that's what the manual says. And the thread function = can only return when `accept-process-output' returns nil, so we're back = at the discussion what the semantics of `accept-process-output' are. >=20 >>> Or what about delivery of SIGCHLD -- is this important >>> which thread gets delivered the signal, or how many threads should = the >>> signal wake up? >>=20 >> Not as long as the observable behavior stays the same (and isn=E2=80=99= t buggy). >=20 > And the "observable behavior" is what? that processes exited > successfully and no Lisp errors happened? Or is there something else > we expect in this case? That the code as a whole doesn't hang/deadlock. It's harder to make = precise guarantees for that, but the 60s timeout should be long enough = for the test to finish even on a heavily-loaded system. >=20 >>> Or is it important when and due to what reason(s) >>> does each thread exit? >>=20 >> It=E2=80=99s important that each thread exits normally, i.e., due to = its thread function body exiting normally. >=20 > So it is NOT important what event caused the thread function to run, > and which event caused it to exit accept-process-output and finish? No. AFAIK these are unobservable, and the manual doesn't make any = guarantees in this respect. >=20 >> the observable order of events matters: first process-send-eof, then = process exit, then thread exit, then thread-join returns. >=20 > The order of the first two is guaranteed because a single thread > arranges for that sequentially. But do you care if the sequence is > like this: >=20 > . all the 10 processes exit > . one of the threads exit because the processes exited > . the other 9 threads get awoken one by one, exit > accept-process-output because of SIGCHLD, end exit =46rom what I can see, that should be fine. The test only checks that = the threads finish eventually and don't deadlock, but doesn't care about = the order of events across threads (which is not guaranteed unless there = are explicit synchronization points like mutexes or condition = variables).