From mboxrd@z Thu Jan 1 00:00:00 1970
Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail
From: Philipp
Newsgroups: gmane.emacs.devel
Subject: Re: master 8dcb19f 4/4: Add a unit test testing interaction between
threads and processes.
Date: Thu, 18 Mar 2021 13:51:10 +0100
Message-ID: <25EEFE6F-9532-4A5A-8824-A3F8F41012F3@gmail.com>
References: <83k0s34eo1.fsf@gnu.org> <83h7n74cwe.fsf@gnu.org>
<838s8i4ak8.fsf@gnu.org> <6C5E64C7-19A6-4D46-AC30-33C29FA5DEBB@gmail.com>
<83mtvo7xno.fsf@gnu.org>
Mime-Version: 1.0 (Mac OS X Mail 14.0 \(3654.60.0.2.21\))
Content-Type: text/plain;
charset=utf-8
Content-Transfer-Encoding: quoted-printable
Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214";
logging-data="18769"; mail-complaints-to="usenet@ciao.gmane.io"
Cc: Philipp Stephani , emacs-devel@gnu.org
To: Eli Zaretskii
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Thu Mar 18 13:54:01 2021
Return-path:
Envelope-to: ged-emacs-devel@m.gmane-mx.org
Original-Received: from lists.gnu.org ([209.51.188.17])
by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
(Exim 4.92)
(envelope-from )
id 1lMsA5-0004kl-5w
for ged-emacs-devel@m.gmane-mx.org; Thu, 18 Mar 2021 13:54:01 +0100
Original-Received: from localhost ([::1]:54028 helo=lists1p.gnu.org)
by lists.gnu.org with esmtp (Exim 4.90_1)
(envelope-from )
id 1lMsA4-0001Iq-14
for ged-emacs-devel@m.gmane-mx.org; Thu, 18 Mar 2021 08:54:00 -0400
Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:33918)
by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
(Exim 4.90_1) (envelope-from )
id 1lMs7T-0007OC-Uj
for emacs-devel@gnu.org; Thu, 18 Mar 2021 08:51:25 -0400
Original-Received: from mail-wr1-x434.google.com ([2a00:1450:4864:20::434]:41821)
by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128)
(Exim 4.90_1) (envelope-from )
id 1lMs7P-0005lF-In; Thu, 18 Mar 2021 08:51:19 -0400
Original-Received: by mail-wr1-x434.google.com with SMTP id b9so5409712wrt.8;
Thu, 18 Mar 2021 05:51:14 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025;
h=mime-version:subject:from:in-reply-to:date:cc
:content-transfer-encoding:message-id:references:to;
bh=U3m5HkPhq6oESg7RRsVJNRGJN4j5mA0BH9jpQz5qNsg=;
b=pB3uRtttGnVkwo+es0xerPBIzg0pgYUiMOioubcZ2aOHeh7oH7TR3jrzVmqDFpccMq
zuoTA3wdAT0meUluPwvD6CAntSZJ7ZSfKmvel64DTae8JU2dBtDusug4XMWsGMTUrqNY
MjB0fgTyF46GEQGBg+NTv9Sksn9Dprwet22ODIq2KTyUOQvjiALqNJnLDvICPD88ZL06
AxHEQS9OXCPmPUNdYsQMNagHb/iw9A4qq7WIOrrAJPmrAE1O9ZTRYkSR9yVIEIfPTo6D
hcPQXqlxnfka3LJT4Epujlx0Sie7rFvskOvPLEo7b6z3VDXFTJTOZJsx+R6jzPmK1+Ia
O6eQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=1e100.net; s=20161025;
h=x-gm-message-state:mime-version:subject:from:in-reply-to:date:cc
:content-transfer-encoding:message-id:references:to;
bh=U3m5HkPhq6oESg7RRsVJNRGJN4j5mA0BH9jpQz5qNsg=;
b=tankZHnSYorcFWULeE9JhpnMiBMgfcOu9D/SMdWM6JpvTBVjSRo5KNCuB/umHqfIeN
2/m9HmZ0JfvuaMEiK9w6Xh160HZZfZC8kYZkamfbXBklzxZoMuXHBVSuYjMLli3xjCmk
DPNEqeoQ+MWWCJSRlEfxZWBQB98iDjGMdDmYqpA88PAZAehlvnErrQD74zl3jzFts+Mc
uPRdb02pLMO+Apeoeta9x52TOExBF055ZTLerFXxrscGWSsIRimvHkWAL2bP1bpdDsii
RAkOU4CQndgJ1+FjGkL6yXmXcKwM3wFUlcogFGVnEBsTkXTy5OnywCH/NkwH4f0rN5NS
IZWQ==
X-Gm-Message-State: AOAM530ddeHYStVX4SEhU6BfYQLAqm7C5upxFnDTh80uQuda+F8y+Yfh
yLQ6hSfdegu1QX4OlxTJaaDCO+MAlD4=
X-Google-Smtp-Source: ABdhPJzR44qXGT5mP9nUhu6sDVi4SMXDRUEf1j9rW8c2l8Xv4Z47KdctDrLjDllCx+Z9yISd+fti/Q==
X-Received: by 2002:a5d:550b:: with SMTP id b11mr9701601wrv.313.1616071872612;
Thu, 18 Mar 2021 05:51:12 -0700 (PDT)
Original-Received: from philipps-macbook-pro.fritz.box (p57aafc25.dip0.t-ipconnect.de.
[87.170.252.37])
by smtp.gmail.com with ESMTPSA id r11sm2980734wrm.26.2021.03.18.05.51.11
(version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128);
Thu, 18 Mar 2021 05:51:11 -0700 (PDT)
In-Reply-To: <83mtvo7xno.fsf@gnu.org>
X-Mailer: Apple Mail (2.3654.60.0.2.21)
Received-SPF: pass client-ip=2a00:1450:4864:20::434;
envelope-from=p.stephani2@gmail.com; helo=mail-wr1-x434.google.com
X-Spam_score_int: -17
X-Spam_score: -1.8
X-Spam_bar: -
X-Spam_report: (-1.8 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1,
DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1,
FREEMAIL_ENVFROM_END_DIGIT=0.25, FREEMAIL_FROM=0.001,
RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001,
SPF_PASS=-0.001 autolearn=ham autolearn_force=no
X-Spam_action: no action
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: "Emacs development discussions."
List-Unsubscribe: ,
List-Archive:
List-Post:
List-Help:
List-Subscribe: ,
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org
Original-Sender: "Emacs-devel"
Xref: news.gmane.io gmane.emacs.devel:266563
Archived-At:
> Am 28.02.2021 um 21:19 schrieb Eli Zaretskii :
>=20
>> From: Philipp
>> Date: Sun, 28 Feb 2021 19:30:46 +0100
>> Cc: Philipp Stephani ,
>> emacs-devel@gnu.org
>>=20
>> The specific scenario here is: Start a number of process and a number =
of threads, each of which waits for one of the processes
>=20
> See below: I don't see how you ensure in this test that each thread
> waits for just one process.
Shouldn't providing a process argument to accept-process-output ensure =
this, modulo implementation details that are not observable? The only =
nontrivial guarantee that the test requires is that after =
(accept-process-output PROC) returns nil, the PROC process has finished, =
independent of other processes or threads, and that multiple concurrent =
calls of this form don't deadlock.
>=20
>> then, cause the processes to exit (since they run =E2=80=9Ecat=E2=80=9C=
, that works by sending EOF); then, join all the threads. This should =
work without hangs or errors.
>=20
> There are two "inputs" from each subprocess that Emacs receives: the
> self-pipe output and the SIGCHLD signal. Depending on the timing and
> on which thread receives which input, the behavior can be different.
> Do you have a clear idea what should happen and what does happen wrt
> these "inputs"? In particular, which thread(s) do you expect to see
> SIGCHLD and/or read the self-pipe as result of that, and do you think
> it's important whether a thread gets awoken by the signal or by
> reading the self-pipe?
All of those are implementation details that should be entirely =
invisible to the thread interface. As such, none of the threads should =
"see" any of those. All that matters is the observable behavior of =
accept-process-output: it may return nil only if and when its process =
argument has finished, independent of threads or signal disposition.
>=20
>>>> The test wouldn't work without it, because it calls
>>>> accept-process-output from a thread different from the thread in =
which
>>>> the process was created.
>>>=20
>>> But set-process-thread accepts a thread argument, so you could bind
>>> the process to a particular thread, couldn't you?
>>=20
>> Yes, but the threads that call `accept-process-output=E2=80=99 and =
`process-send-eof=E2=80=99 are different threads; binding the process =
object to one of the threads would cause the other call to fail.
>=20
> As Stefan points out, process-send-eof could be sent from any thread.
Correct, though this shouldn't make a difference: the test should pass =
whether the process objects have thread affinity or not (and that's the =
case for me).
>=20
>>> More generally, my reading of the code is that the first thread =
which
>>> calls accept-process-output will start listening on all the file
>>> descriptors of all the sub-processes which were started until then,
>>> and all the other threads will have no descriptors left to listen =
on.
>>> Is this the situation you intended to create?
>>=20
>> No, but I=E2=80=99m not convinced that that=E2=80=99s actually how =
the implementation behaves. Wouldn=E2=80=99t with the current =
implementation each thread wait for the file descriptors of one process?
>=20
> How would that happen, in this particular test? The test first starts
> 10 processes, then creates 10 threads, then calls thread-join for each
> of the threads. The threads that the test create don't run, because
> the first thing each thread does is attempt to take the global lock,
> and becomes blocked because the main thread still runs. The main
> thread yields the first time it calls thread-join; at that time, one
> of the 10 waiting threads (not necessarily the one passed as argument
> to thread-join) will take the global lock. This thread will call
> accept-process-output, which calls wait_reading_process_output, which
> computes the mask for pselect to wait on. Now look at the code in
> compute_non_keyboard_wait_mask and its ilk: the mask will include
> every descriptor that is not already watched by some other thread.
> Since this is the first thread that calls wait_reading_process_output,
> it will find all the descriptors up for grabs, and will add all of
> them to the mask. Then it will release the global lock and call
> pselect. Releasing the global lock will let some other thread of the
> 9 remaining threads run, that thread will also call
> accept-process-output, which will call wait_reading_process_output --
> but now all the descriptors are taken by the first thread, so there's
> nothing left for the second one, and the same for all the other 8
> threads.
>=20
> So what we have now is one thread waiting for output from all the 10
> subprocess, and the other 9 waiting only on the self-pipe. What
> happens then, when the processes exit and we get 10 SIGCHLD signals,
> is left as an exercise. And don't forget that order in which we
> receive the signal and the byte from self-pipe is not entirely
> deterministic, either.
>=20
> Am I missing something, or is this what happens the way the test is
> written? And if I'm not missing anything, is this the situation you
> wanted to create and test?
This might be how the implementation behaves (I haven't checked, but I =
have no reason to doubt your explanation). However, the test doesn't =
test the implementation, but the observable behavior. That is, I can't =
formally prove that the implementation leads to the observable behavior =
that the test expects, but I think it should, and I'd consider it a bug =
(in Emacs, not the test code) if it doesn't.
>=20
>>> Why is it important to test this situation, and what do we expect to
>>> happen in it? Is it reasonable to have more than one thread wait =
for
>>> output from a subprocess?
>>=20
>> The tests shouldn=E2=80=99t ask what=E2=80=99s reasonable, but =
what=E2=80=99s documented or expected. Surely users can write code like =
this, and I don=E2=80=99t see why that shouldn=E2=80=99t be supported.
>=20
> I didn't say anything about unsupporting this situation, I asked what
> should happen in it, and I meant the details, not merely that there
> are no errors. E.g., is it important that each threads gets output
> from exactly one subprocess?
How does a thread "get output"? Do you mean that accept-process-output =
returns non-nil? What I'd expect from accept-process-output with a =
process argument is that directly after it returns nil, the process =
represented by the process argument has finished, all output has been =
passed to buffers/filters, and the sentinel has run. That's independent =
of the thread in which accept-process-output is called.
> is it important whether a thread exits
> because it read output or because it received SIGCHLD and read the
> self-pipe?
No, these are implementation details: they might as well not exist as =
far as the observable behavior of accept-process-output is concerned.
> is it important whether all the threads exit one after
> another, or is it okay if they all wait until one thread exits and
> then exit immediately (because all the processes have already died)?
I don't think that's observable/documented behavior: we only know that a =
thread has exited after thread-join returns.
> etc.
>=20
> I instrumented the code that gets executed by this test and saw all
> kinds of strange phenomena, and I'm asking myself whether you looked
> at those details and whether you are satisfied by what you saw.
No, that's not the way I approach topics like this. I wrote the test =
based on my reading of the documentation (having in mind the complexity =
of the implementation, but not being guided by it). I then ran the test =
and it passed, which gave me a bit of confidence in the correctness of =
the implementation (which the test can't prove, of course).
So the questions I'd ask here: Is the test code correct, based on =
documented (or at least implied) guarantees? If it's correct, but the =
test fails on some system, the implementation is buggy. What specific =
sequence of events in the implementation code leads to the buggy =
behavior is then only the next step, not the starting point of the =
discussion.
>=20
>>> The question I ask myself is what to do with the deviant behavior on
>>> MS-Windows. If it behaves differently because what we test here are
>>> the peculiarities of the semi-broken Unix mechanism known as =
"signals"
>>> and/or some details of our implementation of subprocesses on Posix
>>> platforms, then maybe there's no need to change anything in the
>>> Windows code, and just either skip this test on MS-Windows or write =
a
>>> different test which will work there. IOW, in such a case there's =
no
>>> bug in the Windows port, it's just that it isn't bug-for-bug
>>> compatible with Unix.
>>=20
>> I don=E2=80=99t think so. Again, I wrote the test based on my =
understanding of the expected and promised behavior of the process and =
thread functions, not the Posix implementation.
>=20
> Where and how did you obtain that understanding? The documentation is
> (intentionally or unintentionally) very vague regarding these details.
Yes, so I had to fill in some aspects that I hope are reasonable. In =
particular, the documentation doesn't state explicitly that waiting for =
a process in a different thread than the original one is supported, but =
given set-process-thread it implicitly has to be, because what would =
otherwise be the purpose of set-process-thread?
The documentation definitely needs to be significantly expanded, as this =
discussion shows.
So maybe asking more directly:
- Should this test pass?
- If so, how can this be deduced from the documentation?
- If not, how can *that* be deduced from the documentation, and what =
changes would be necessary to guarantee that it passes?
- Is there anything missing from the documentation that prevents =
deducing whether the test should pass or not?
> Perhaps you only cared about whether the code should or shouldn't err,
> but that's clearly not the only interesting or important aspect of
> this exercise, is it?
It's at least one important aspect, but finding deadlocks (hangs) is =
probably more important.
>=20
>>> But if what happens in this test is important in practical use =
cases,
>>> then perhaps our Windows code does need some changes in this area.
>>>=20
>>> This is why I'm trying to understand the semantics of this test -- =
I'd
>>> like to figure out which implementation details are important, and
>>> thus should be emulated on Windows, and which are just details that
>>> can be different (and portable Lisp programs will have to take those
>>> differences into account).
>>=20
>> The way I see it, all aspects of the test are important =E2=80=94 =
their behavior should be guaranteed by the implementation.
>=20
> Which aspects of those I mentioned above do you think the
> implementation should guarantee? E.g., should it guarantee how many
> threads get the actual output from the processes in this case,
That depends what "get the actual output" means. What I expect is the =
behavior of accept-process-output, as explained above. Specifically: =
accept-process-output will return nil quickly (within milliseconds, =
assuming the calling thread is unblocked) once the given process has =
exited. Once it returns, all process output has been given to process =
filters/buffers, and the sentinel has run.
> and how
> many get the signal?
None, as the signal is an implementation detail and not part of the =
interface.
>=20
>>> For example, the test verifies that each process exited with zero
>>> status, but it doesn't care which thread detected that -- is this
>>> important?
>>=20
>> IIUC threads don=E2=80=99t really =E2=80=9Edetect=E2=80=9C process =
exit.
>=20
> ??? When a subprocess exits, Emacs gets a SIGCHLD, and then we access
> the exit code in order to pass it to the sentinel function. This is
> what I call "detecting process exit".
Sure, but is that really related to threads? I'd rather say that =
*Emacs* detects process exits.
It's also true that there's some thread that contains the `accept' call, =
but that's again an implementation detail and not related to Lisp =
threads. Several other implementations would be possible: Calling =
`accept' in a single thread not accessible from Lisp at all; =
implementing Lisp threads as "green threads" without pthread support; =
etc.
>=20
>> What I=E2=80=99d assume is that the `thread-join=E2=80=99 calls are =
synchronization points =E2=80=94 after calling `thread-join=E2=80=99 the =
thread in question is guaranteed to have finished, and since the thread =
has been waiting for a process to finish (using the =
accept-process-output loop), the process is also guaranteed to have =
finished, and this sequence of event is guaranteed exactly in that =
order.
>=20
> I don't see how you reach those conclusions. The first part is true:
> when thread-join exits, the thread it was waiting for is guaranteed to
> have exited. But how it exited, and what caused it to exit -- these
> are open questions, in a situation where 10 subprocesses generate
> output and produce 10 signals.
There's only one way for a thread to exit: when its thread function =
returns. At least that's what the manual says. And the thread function =
can only return when `accept-process-output' returns nil, so we're back =
at the discussion what the semantics of `accept-process-output' are.
>=20
>>> Or what about delivery of SIGCHLD -- is this important
>>> which thread gets delivered the signal, or how many threads should =
the
>>> signal wake up?
>>=20
>> Not as long as the observable behavior stays the same (and isn=E2=80=99=
t buggy).
>=20
> And the "observable behavior" is what? that processes exited
> successfully and no Lisp errors happened? Or is there something else
> we expect in this case?
That the code as a whole doesn't hang/deadlock. It's harder to make =
precise guarantees for that, but the 60s timeout should be long enough =
for the test to finish even on a heavily-loaded system.
>=20
>>> Or is it important when and due to what reason(s)
>>> does each thread exit?
>>=20
>> It=E2=80=99s important that each thread exits normally, i.e., due to =
its thread function body exiting normally.
>=20
> So it is NOT important what event caused the thread function to run,
> and which event caused it to exit accept-process-output and finish?
No. AFAIK these are unobservable, and the manual doesn't make any =
guarantees in this respect.
>=20
>> the observable order of events matters: first process-send-eof, then =
process exit, then thread exit, then thread-join returns.
>=20
> The order of the first two is guaranteed because a single thread
> arranges for that sequentially. But do you care if the sequence is
> like this:
>=20
> . all the 10 processes exit
> . one of the threads exit because the processes exited
> . the other 9 threads get awoken one by one, exit
> accept-process-output because of SIGCHLD, end exit
=46rom what I can see, that should be fine. The test only checks that =
the threads finish eventually and don't deadlock, but doesn't care about =
the order of events across threads (which is not guaranteed unless there =
are explicit synchronization points like mutexes or condition =
variables).