On Mon 2019-05-20 13:27:03 -0400, Daniel Kahn Gillmor wrote: > c) we should avoid the timeout hanging :) I dug into this today, and i'm reporting back my findings. I have what appears to be a fix (see below), but i don't understand it, so i'm not advocating for it. To be clear: my two test cases are two KVM instances, one running stretch (debian stable) and one running sid (debian unstable). both systems have 4 virtual CPUs (on a hardware platform that has 4 cores). The two VMs are otherwise similarly configured. Both have moreutils installed, and GNU parallel is not installed. on the stretch system, i can achieve this hang/failure with a simple "make -j4 check". on the "sid" system, i do not see the failure. When i disable the use of timeout entirely (with NOTMUCH_TEST_TIMEOUT=0, see id:20190520232535.4904-1-dkg@fifthhorseman.net), the problem goes away on the stretch system. When i inspect the state of the debian stretch system when the tests are hanging, i see this (from "ps auwx"): ------------------------ USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND dkg 7980 1.8 0.5 10228 4348 pts/1 S+ 18:49 0:00 make -j4 check dkg 8001 0.0 0.3 11228 2884 pts/1 S+ 18:49 0:00 bash /home/dkg/src/notmuch/notmuch/test/notmuch-test dkg 8011 0.0 0.1 10092 804 pts/1 S 18:49 0:00 timeout 2m parallel -- /home/dkg/src/notmuch/notmuch/test/T000-basic.sh /home/dkg/src/notmuch/notmuch/test/T010-help-test.sh /home/dkg/ dkg 8012 0.0 0.0 4168 744 pts/1 T 18:49 0:00 parallel -- /home/dkg/src/notmuch/notmuch/test/T000-basic.sh /home/dkg/src/notmuch/notmuch/test/T010-help-test.sh /home/dkg/src/notmuch dkg 8013 0.0 0.0 4168 96 pts/1 T 18:49 0:00 parallel -- /home/dkg/src/notmuch/notmuch/test/T000-basic.sh /home/dkg/src/notmuch/notmuch/test/T010-help-test.sh /home/dkg/src/notmuch dkg 8014 0.0 0.0 4168 96 pts/1 T 18:49 0:00 parallel -- /home/dkg/src/notmuch/notmuch/test/T000-basic.sh /home/dkg/src/notmuch/notmuch/test/T010-help-test.sh /home/dkg/src/notmuch dkg 8267 0.0 0.0 4168 96 pts/1 T 18:49 0:00 parallel -- /home/dkg/src/notmuch/notmuch/test/T000-basic.sh /home/dkg/src/notmuch/notmuch/test/T010-help-test.sh /home/dkg/src/notmuch dkg 8268 0.0 0.0 4276 732 pts/1 T 18:49 0:00 sh -c /home/dkg/src/notmuch/notmuch/test/T050-new.sh dkg 8270 2.7 0.4 11772 3744 pts/1 T 18:49 0:00 bash /home/dkg/src/notmuch/notmuch/test/T050-new.sh dkg 8320 0.0 0.0 4168 96 pts/1 T 18:49 0:00 parallel -- /home/dkg/src/notmuch/notmuch/test/T000-basic.sh /home/dkg/src/notmuch/notmuch/test/T010-help-test.sh /home/dkg/src/notmuch dkg 8321 0.0 0.0 4276 748 pts/1 T 18:49 0:00 sh -c /home/dkg/src/notmuch/notmuch/test/T060-count.sh dkg 8322 0.7 0.4 11752 3556 pts/1 T 18:49 0:00 bash /home/dkg/src/notmuch/notmuch/test/T060-count.sh dkg 8345 0.0 0.0 4168 96 pts/1 T 18:49 0:00 parallel -- /home/dkg/src/notmuch/notmuch/test/T000-basic.sh /home/dkg/src/notmuch/notmuch/test/T010-help-test.sh /home/dkg/src/notmuch dkg 8346 0.0 0.0 4276 744 pts/1 T 18:49 0:00 sh -c /home/dkg/src/notmuch/notmuch/test/T070-insert.sh dkg 8347 1.7 0.4 11772 3764 pts/1 T 18:49 0:00 bash /home/dkg/src/notmuch/notmuch/test/T070-insert.sh dkg 8425 0.0 0.0 4168 96 pts/1 T 18:49 0:00 parallel -- /home/dkg/src/notmuch/notmuch/test/T000-basic.sh /home/dkg/src/notmuch/notmuch/test/T010-help-test.sh /home/dkg/src/notmuch dkg 8426 0.0 0.0 4276 740 pts/1 T 18:49 0:00 sh -c /home/dkg/src/notmuch/notmuch/test/T080-search.sh dkg 8427 1.5 0.4 11752 3664 pts/1 T 18:49 0:00 bash /home/dkg/src/notmuch/notmuch/test/T080-search.sh dkg 8763 4.7 2.9 73960 22708 pts/1 T 18:49 0:00 gdb --batch-silent --return-child-result -x count-files.gdb --args notmuch count --output=files * dkg 8914 0.0 0.8 68508 6228 pts/1 T 18:49 0:00 notmuch search --format=text0 --output=files --offset=1 --limit=1 * dkg 8915 0.0 0.1 4484 1164 pts/1 T 18:49 0:00 xargs -0 -I {} mv {} /home/dkg/src/notmuch/notmuch/test/tmp.T050-new/mail/moved_messages dkg 8916 0.0 0.5 68244 3824 pts/1 T 18:49 0:00 notmuch insert --folder=Drafts +draft -unread dkg 8919 0.0 0.0 13012 704 pts/1 T 18:49 0:00 notmuch new dkg 8920 0.0 0.0 1412 4 pts/1 t 18:49 0:00 /bin/bash -c exec /home/dkg/src/notmuch/notmuch/notmuch count --output=files \* ------------------------ As you can see in the "STAT" column, nearly all of the hanging processes are marked with T ("stopped by job control signal" according to ps(1)). I also note that "t" means "stopped by debugger during the tracing" -- maybe that final line (with "t") is the special one that triggers this? i don't know. When i try to connect to any of these stopped processes with "strace -p $PID", strace reports: strace: Process 4204 attached --- stopped by SIGTTOU --- SIGTTOU is novel to me, and i don't really understand why the test suite would have this problem. Skimming this guidance: http://curiousthing.org/sigttin-sigttou-deep-dive-linux suggested that maybe if i just decoupled the processes from the terminal "enough" i could get away with a functioning test suite. redirecting all of stdin, stdout, stderr to /dev/null worked! then i tried pruning out different pieces, and found that all i needed to do was to redirect stdin from /dev/null and the test suite would run without problems in parallel with moreutils parallel. (it also works with GNU parallel, and if i run the tests serially). So the patch below is a "fix" but it's not a principled one. the source for moreutils parallel.c doesn't appear to have changed at all between stretch and buster. I tried upgrading the version of moreutils on this stretch system from 0.60-1 to 0.62-1, and i was able to reproduce the same problem. So i don't believe the problem is with moreutils. Some things that might be different between debian stable (stretch) and testing (buster): package provides stretch buster ------- -------- ------- ------ GNU coreutils /usr/bin/timeout 8.26-3 8.30-3 GNU bash /bin/bash 4.4-5 5.0-4 GNU dash /bin/sh (via symlink) 0.5.8-2.4 0.5.10.2-5 Linux the kernel 4.9.168-1+deb9u2 4.19.37-3 GNU gdb /usr/bin/gdb 7.12-6 8.2.1-2 I also tried changing the symlink for /bin/sh to point to bash instead of dash, and was still able to replicate the problem, so i suspect dash is not the culprit. However, i tried selectively upgrading all the versions of all of these packages *except for gdb* to the version in buster (or to the version from backports, in the case of the kernel). and i'm *still* seeing the problem on the stretch system. So perhaps it's some interaction between timeout and gdb? I haven't managed to test that particular combination yet. I hope someone else will look into this further, as i'm out of my depth. --dkg diff --git a/test/Makefile.local b/test/Makefile.local index 47244e8f..3a57b6be 100644 --- a/test/Makefile.local +++ b/test/Makefile.local @@ -66,13 +66,13 @@ test-binaries: $(TEST_BINARIES) test: all test-binaries ifeq ($V,) @echo 'Use "$(MAKE) V=1" to see the details for passing and known broken tests.' - @env NOTMUCH_TEST_QUIET=1 $(NOTMUCH_SRCDIR)/$(test_src_dir)/notmuch-test $(OPTIONS) + @env NOTMUCH_TEST_QUIET=1 $(NOTMUCH_SRCDIR)/$(test_src_dir)/notmuch-test $(OPTIONS)