From: Daniel Kahn Gillmor <dkg@fifthhorseman.net>
To: David Bremner <david@tethera.net>, notmuch@notmuchmail.org
Subject: Re: parallel tests broken on Debian stable
Date: Mon, 20 May 2019 19:49:02 -0400 [thread overview]
Message-ID: <87pnocadqp.fsf@fifthhorseman.net> (raw)
In-Reply-To: <878sv1avfc.fsf@fifthhorseman.net>
[-- Attachment #1: Type: text/plain, Size: 8409 bytes --]
On Mon 2019-05-20 13:27:03 -0400, Daniel Kahn Gillmor wrote:
> c) we should avoid the timeout hanging :)
I dug into this today, and i'm reporting back my findings.
I have what appears to be a fix (see below), but i don't understand it,
so i'm not advocating for it.
To be clear: my two test cases are two KVM instances, one running
stretch (debian stable) and one running sid (debian unstable). both
systems have 4 virtual CPUs (on a hardware platform that has 4 cores).
The two VMs are otherwise similarly configured. Both have moreutils
installed, and GNU parallel is not installed.
on the stretch system, i can achieve this hang/failure with a simple
"make -j4 check". on the "sid" system, i do not see the failure.
When i disable the use of timeout entirely (with NOTMUCH_TEST_TIMEOUT=0,
see id:20190520232535.4904-1-dkg@fifthhorseman.net), the problem goes
away on the stretch system.
When i inspect the state of the debian stretch system when the tests are
hanging, i see this (from "ps auwx"):
------------------------
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
dkg 7980 1.8 0.5 10228 4348 pts/1 S+ 18:49 0:00 make -j4 check
dkg 8001 0.0 0.3 11228 2884 pts/1 S+ 18:49 0:00 bash /home/dkg/src/notmuch/notmuch/test/notmuch-test
dkg 8011 0.0 0.1 10092 804 pts/1 S 18:49 0:00 timeout 2m parallel -- /home/dkg/src/notmuch/notmuch/test/T000-basic.sh /home/dkg/src/notmuch/notmuch/test/T010-help-test.sh /home/dkg/
dkg 8012 0.0 0.0 4168 744 pts/1 T 18:49 0:00 parallel -- /home/dkg/src/notmuch/notmuch/test/T000-basic.sh /home/dkg/src/notmuch/notmuch/test/T010-help-test.sh /home/dkg/src/notmuch
dkg 8013 0.0 0.0 4168 96 pts/1 T 18:49 0:00 parallel -- /home/dkg/src/notmuch/notmuch/test/T000-basic.sh /home/dkg/src/notmuch/notmuch/test/T010-help-test.sh /home/dkg/src/notmuch
dkg 8014 0.0 0.0 4168 96 pts/1 T 18:49 0:00 parallel -- /home/dkg/src/notmuch/notmuch/test/T000-basic.sh /home/dkg/src/notmuch/notmuch/test/T010-help-test.sh /home/dkg/src/notmuch
dkg 8267 0.0 0.0 4168 96 pts/1 T 18:49 0:00 parallel -- /home/dkg/src/notmuch/notmuch/test/T000-basic.sh /home/dkg/src/notmuch/notmuch/test/T010-help-test.sh /home/dkg/src/notmuch
dkg 8268 0.0 0.0 4276 732 pts/1 T 18:49 0:00 sh -c /home/dkg/src/notmuch/notmuch/test/T050-new.sh
dkg 8270 2.7 0.4 11772 3744 pts/1 T 18:49 0:00 bash /home/dkg/src/notmuch/notmuch/test/T050-new.sh
dkg 8320 0.0 0.0 4168 96 pts/1 T 18:49 0:00 parallel -- /home/dkg/src/notmuch/notmuch/test/T000-basic.sh /home/dkg/src/notmuch/notmuch/test/T010-help-test.sh /home/dkg/src/notmuch
dkg 8321 0.0 0.0 4276 748 pts/1 T 18:49 0:00 sh -c /home/dkg/src/notmuch/notmuch/test/T060-count.sh
dkg 8322 0.7 0.4 11752 3556 pts/1 T 18:49 0:00 bash /home/dkg/src/notmuch/notmuch/test/T060-count.sh
dkg 8345 0.0 0.0 4168 96 pts/1 T 18:49 0:00 parallel -- /home/dkg/src/notmuch/notmuch/test/T000-basic.sh /home/dkg/src/notmuch/notmuch/test/T010-help-test.sh /home/dkg/src/notmuch
dkg 8346 0.0 0.0 4276 744 pts/1 T 18:49 0:00 sh -c /home/dkg/src/notmuch/notmuch/test/T070-insert.sh
dkg 8347 1.7 0.4 11772 3764 pts/1 T 18:49 0:00 bash /home/dkg/src/notmuch/notmuch/test/T070-insert.sh
dkg 8425 0.0 0.0 4168 96 pts/1 T 18:49 0:00 parallel -- /home/dkg/src/notmuch/notmuch/test/T000-basic.sh /home/dkg/src/notmuch/notmuch/test/T010-help-test.sh /home/dkg/src/notmuch
dkg 8426 0.0 0.0 4276 740 pts/1 T 18:49 0:00 sh -c /home/dkg/src/notmuch/notmuch/test/T080-search.sh
dkg 8427 1.5 0.4 11752 3664 pts/1 T 18:49 0:00 bash /home/dkg/src/notmuch/notmuch/test/T080-search.sh
dkg 8763 4.7 2.9 73960 22708 pts/1 T 18:49 0:00 gdb --batch-silent --return-child-result -x count-files.gdb --args notmuch count --output=files *
dkg 8914 0.0 0.8 68508 6228 pts/1 T 18:49 0:00 notmuch search --format=text0 --output=files --offset=1 --limit=1 *
dkg 8915 0.0 0.1 4484 1164 pts/1 T 18:49 0:00 xargs -0 -I {} mv {} /home/dkg/src/notmuch/notmuch/test/tmp.T050-new/mail/moved_messages
dkg 8916 0.0 0.5 68244 3824 pts/1 T 18:49 0:00 notmuch insert --folder=Drafts +draft -unread
dkg 8919 0.0 0.0 13012 704 pts/1 T 18:49 0:00 notmuch new
dkg 8920 0.0 0.0 1412 4 pts/1 t 18:49 0:00 /bin/bash -c exec /home/dkg/src/notmuch/notmuch/notmuch count --output=files \*
------------------------
As you can see in the "STAT" column, nearly all of the hanging processes
are marked with T ("stopped by job control signal" according to ps(1)).
I also note that "t" means "stopped by debugger during the tracing" --
maybe that final line (with "t") is the special one that triggers this?
i don't know.
When i try to connect to any of these stopped processes with "strace -p
$PID", strace reports:
strace: Process 4204 attached
--- stopped by SIGTTOU ---
SIGTTOU is novel to me, and i don't really understand why the test suite
would have this problem. Skimming this guidance:
http://curiousthing.org/sigttin-sigttou-deep-dive-linux
suggested that maybe if i just decoupled the processes from the terminal
"enough" i could get away with a functioning test suite. redirecting
all of stdin, stdout, stderr to /dev/null worked! then i tried pruning
out different pieces, and found that all i needed to do was to redirect
stdin from /dev/null and the test suite would run without problems in
parallel with moreutils parallel. (it also works with GNU parallel, and
if i run the tests serially).
So the patch below is a "fix" but it's not a principled one.
the source for moreutils parallel.c doesn't appear to have changed at
all between stretch and buster. I tried upgrading the version of
moreutils on this stretch system from 0.60-1 to 0.62-1, and i was able
to reproduce the same problem. So i don't believe the problem is with
moreutils.
Some things that might be different between debian stable (stretch) and
testing (buster):
package provides stretch buster
------- -------- ------- ------
GNU coreutils /usr/bin/timeout 8.26-3 8.30-3
GNU bash /bin/bash 4.4-5 5.0-4
GNU dash /bin/sh (via symlink) 0.5.8-2.4 0.5.10.2-5
Linux the kernel 4.9.168-1+deb9u2 4.19.37-3
GNU gdb /usr/bin/gdb 7.12-6 8.2.1-2
I also tried changing the symlink for /bin/sh to point to bash instead
of dash, and was still able to replicate the problem, so i suspect dash
is not the culprit.
However, i tried selectively upgrading all the versions of all of these
packages *except for gdb* to the version in buster (or to the version
from backports, in the case of the kernel). and i'm *still* seeing the
problem on the stretch system.
So perhaps it's some interaction between timeout and gdb? I haven't
managed to test that particular combination yet.
I hope someone else will look into this further, as i'm out of my depth.
--dkg
diff --git a/test/Makefile.local b/test/Makefile.local
index 47244e8f..3a57b6be 100644
--- a/test/Makefile.local
+++ b/test/Makefile.local
@@ -66,13 +66,13 @@ test-binaries: $(TEST_BINARIES)
test: all test-binaries
ifeq ($V,)
@echo 'Use "$(MAKE) V=1" to see the details for passing and known broken tests.'
- @env NOTMUCH_TEST_QUIET=1 $(NOTMUCH_SRCDIR)/$(test_src_dir)/notmuch-test $(OPTIONS)
+ @env NOTMUCH_TEST_QUIET=1 $(NOTMUCH_SRCDIR)/$(test_src_dir)/notmuch-test $(OPTIONS) </dev/null
else
# The user has explicitly enabled quiet execution.
ifeq ($V,0)
- @env NOTMUCH_TEST_QUIET=1 $(NOTMUCH_SRCDIR)/$(test_src_dir)/notmuch-test $(OPTIONS)
+ @env NOTMUCH_TEST_QUIET=1 $(NOTMUCH_SRCDIR)/$(test_src_dir)/notmuch-test $(OPTIONS) </dev/null
else
- @$(NOTMUCH_SRCDIR)/$(test_src_dir)/notmuch-test $(OPTIONS)
+ @$(NOTMUCH_SRCDIR)/$(test_src_dir)/notmuch-test $(OPTIONS) </dev/null
endif
endif
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 227 bytes --]
next prev parent reply other threads:[~2019-05-20 23:49 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-05-10 10:45 parallel tests broken on Debian stable David Bremner
2019-05-20 17:27 ` Daniel Kahn Gillmor
2019-05-20 18:55 ` Tomi Ollila
2019-05-20 21:45 ` Daniel Kahn Gillmor
2019-05-20 23:49 ` Daniel Kahn Gillmor [this message]
2019-05-21 0:44 ` Daniel Kahn Gillmor
2019-05-21 1:03 ` [PATCH] test: avoid hanging older gdb under GNU timeout and moreutils parallel Daniel Kahn Gillmor
2019-05-21 2:32 ` David Bremner
2019-05-21 15:45 ` Daniel Kahn Gillmor
2019-05-21 6:12 ` Tomi Ollila
2019-05-21 15:50 ` Daniel Kahn Gillmor
2019-05-21 20:23 ` Tomi Ollila
2021-10-30 19:30 ` parallel tests broken on Debian stable David Bremner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://notmuchmail.org/
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87pnocadqp.fsf@fifthhorseman.net \
--to=dkg@fifthhorseman.net \
--cc=david@tethera.net \
--cc=notmuch@notmuchmail.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://yhetil.org/notmuch.git/
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).