From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by arlo.cworth.org (Postfix) with ESMTP id 0B3E66DE145F for ; Mon, 20 May 2019 16:49:08 -0700 (PDT) X-Virus-Scanned: Debian amavisd-new at cworth.org X-Spam-Flag: NO X-Spam-Score: -0.171 X-Spam-Level: X-Spam-Status: No, score=-0.171 tagged_above=-999 required=5 tests=[AWL=0.030, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001] autolearn=disabled Received: from arlo.cworth.org ([127.0.0.1]) by localhost (arlo.cworth.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id GXQgCKvO-yt9 for ; Mon, 20 May 2019 16:49:06 -0700 (PDT) Received: from che.mayfirst.org (che.mayfirst.org [162.247.75.118]) by arlo.cworth.org (Postfix) with ESMTPS id A5FFD6DE1457 for ; Mon, 20 May 2019 16:49:06 -0700 (PDT) DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/simple; d=fifthhorseman.net; i=@fifthhorseman.net; q=dns/txt; s=2019; t=1558396145; h=from : to : subject : in-reply-to : references : date : message-id : mime-version : content-type : from; bh=yaTzYO3CQ4KnV1Wd8MyG58EFs6TcI0fh0WeYWKMUP64=; b=JTo6HNfCmQIRH+AEcRKClaODLcD0RtWeg1c8M1wtXBujWLXLEj+bMfQd zPkJJRjRGZjtKhjZbX6ZrOO7X2c6Ag== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=fifthhorseman.net; i=@fifthhorseman.net; q=dns/txt; s=2019rsa; t=1558396145; h=from : to : subject : in-reply-to : references : date : message-id : mime-version : content-type : from; bh=yaTzYO3CQ4KnV1Wd8MyG58EFs6TcI0fh0WeYWKMUP64=; b=Dkxsd55TrJLJi96Vk2MCZTTZHNc5mF6CLPJ7DuFvn5A5RNJXam5yB5hu wCy6dXmkogCiQiSqdjjVmHSdZiMTUntWEp1gL2EbmmfANqZFK+n6A043nK oP98c337z0cylqIUlrgL1MQocrp+Y9y7V7xp1509vFCCkQx55w+6FieTwa GzEF7s3I1h4nh23+M56Y76UiAdHXcAcS/t5Av4aSxleqxikjDEN5w2DpqH YPNlxOe/kIWPk1mL9jiaV1XQM5LTIiqlI4TzkgXrfopyOWhm0lH9tMqmgw ZL4JhJWlMhtO/OZexg9bRHBRHI3vmjkk/RkX4Wjdlfm096BrjuJuUg== Received: from fifthhorseman.net (unknown [38.109.115.130]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by che.mayfirst.org (Postfix) with ESMTPSA id CFD8AF99D; Mon, 20 May 2019 19:49:05 -0400 (EDT) Received: by fifthhorseman.net (Postfix, from userid 1000) id AABA4201C0; Mon, 20 May 2019 19:49:03 -0400 (EDT) From: Daniel Kahn Gillmor To: David Bremner , notmuch@notmuchmail.org Subject: Re: parallel tests broken on Debian stable In-Reply-To: <878sv1avfc.fsf@fifthhorseman.net> References: <87ef56io5d.fsf@tethera.net> <878sv1avfc.fsf@fifthhorseman.net> Autocrypt: addr=dkg@fifthhorseman.net; prefer-encrypt=mutual; keydata= mDMEXEK/AhYJKwYBBAHaRw8BAQdAr/gSROcn+6m8ijTN0DV9AahoHGafy52RRkhCZVwxhEe0K0Rh bmllbCBLYWhuIEdpbGxtb3IgPGRrZ0BmaWZ0aGhvcnNlbWFuLm5ldD6ImQQTFggAQQIbAQUJA8Jn AAULCQgHAgYVCgkICwIEFgIDAQIeAQIXgBYhBMS8Lds4zOlkhevpwvIGkReQOOXGBQJcQsbzAhkB AAoJEPIGkReQOOXG4fkBAO1joRxqAZY57PjdzGieXLpluk9RkWa3ufkt3YUVEpH/AP9c+pgIxtyW +FwMQRjlqljuj8amdN4zuEqaCy4hhz/1DbgzBFxCv4sWCSsGAQQB2kcPAQEHQERSZxSPmgtdw6nN u7uxY7bzb9TnPrGAOp9kClBLRwGfiPUEGBYIACYWIQTEvC3bOMzpZIXr6cLyBpEXkDjlxgUCXEK/ iwIbAgUJAeEzgACBCRDyBpEXkDjlxnYgBBkWCAAdFiEEyQ5tNiAKG5IqFQnndhgZZSmuX/gFAlxC v4sACgkQdhgZZSmuX/iVWgD/fCU4ONzgy8w8UCHGmrmIZfDvdhg512NIBfx+Mz9ls5kA/Rq97vz4 z48MFuBdCuu0W/fVqVjnY7LN5n+CQJwGC0MIA7QA/RyY7Sz2gFIOcrns0RpoHr+3WI+won3xCD8+ sVXSHZvCAP98HCjDnw/b0lGuCR7coTXKLIM44/LFWgXAdZjm1wjODbg4BFxCv50SCisGAQQBl1UB BQEBB0BG4iXnHX/fs35NWKMWQTQoRI7oiAUt0wJHFFJbomxXbAMBCAeIfgQYFggAJhYhBMS8Lds4 zOlkhevpwvIGkReQOOXGBQJcQr+dAhsMBQkB4TOAAAoJEPIGkReQOOXGe/cBAPlek5d9xzcXUn/D kY6jKmxe26CTws3ZkbK6Aa5Ey/qKAP0VuPQSCRxA7RKfcB/XrEphfUFkraL06Xn/xGwJ+D0hCw== Date: Mon, 20 May 2019 19:49:02 -0400 Message-ID: <87pnocadqp.fsf@fifthhorseman.net> MIME-Version: 1.0 Content-Type: multipart/signed; boundary="=-=-="; micalg=pgp-sha512; protocol="application/pgp-signature" X-BeenThere: notmuch@notmuchmail.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Use and development of the notmuch mail system." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 20 May 2019 23:49:08 -0000 --=-=-= Content-Type: text/plain Content-Transfer-Encoding: quoted-printable On Mon 2019-05-20 13:27:03 -0400, Daniel Kahn Gillmor wrote: > c) we should avoid the timeout hanging :) I dug into this today, and i'm reporting back my findings. I have what appears to be a fix (see below), but i don't understand it, so i'm not advocating for it. To be clear: my two test cases are two KVM instances, one running stretch (debian stable) and one running sid (debian unstable). both systems have 4 virtual CPUs (on a hardware platform that has 4 cores). The two VMs are otherwise similarly configured. Both have moreutils installed, and GNU parallel is not installed. on the stretch system, i can achieve this hang/failure with a simple "make -j4 check". on the "sid" system, i do not see the failure. When i disable the use of timeout entirely (with NOTMUCH_TEST_TIMEOUT=3D0, see id:20190520232535.4904-1-dkg@fifthhorseman.net), the problem goes away on the stretch system. When i inspect the state of the debian stretch system when the tests are hanging, i see this (from "ps auwx"): =2D----------------------- USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND dkg 7980 1.8 0.5 10228 4348 pts/1 S+ 18:49 0:00 make -j4 c= heck dkg 8001 0.0 0.3 11228 2884 pts/1 S+ 18:49 0:00 bash /home= /dkg/src/notmuch/notmuch/test/notmuch-test dkg 8011 0.0 0.1 10092 804 pts/1 S 18:49 0:00 timeout 2m= parallel -- /home/dkg/src/notmuch/notmuch/test/T000-basic.sh /home/dkg/src= /notmuch/notmuch/test/T010-help-test.sh /home/dkg/ dkg 8012 0.0 0.0 4168 744 pts/1 T 18:49 0:00 parallel -= - /home/dkg/src/notmuch/notmuch/test/T000-basic.sh /home/dkg/src/notmuch/no= tmuch/test/T010-help-test.sh /home/dkg/src/notmuch dkg 8013 0.0 0.0 4168 96 pts/1 T 18:49 0:00 parallel -= - /home/dkg/src/notmuch/notmuch/test/T000-basic.sh /home/dkg/src/notmuch/no= tmuch/test/T010-help-test.sh /home/dkg/src/notmuch dkg 8014 0.0 0.0 4168 96 pts/1 T 18:49 0:00 parallel -= - /home/dkg/src/notmuch/notmuch/test/T000-basic.sh /home/dkg/src/notmuch/no= tmuch/test/T010-help-test.sh /home/dkg/src/notmuch dkg 8267 0.0 0.0 4168 96 pts/1 T 18:49 0:00 parallel -= - /home/dkg/src/notmuch/notmuch/test/T000-basic.sh /home/dkg/src/notmuch/no= tmuch/test/T010-help-test.sh /home/dkg/src/notmuch dkg 8268 0.0 0.0 4276 732 pts/1 T 18:49 0:00 sh -c /hom= e/dkg/src/notmuch/notmuch/test/T050-new.sh dkg 8270 2.7 0.4 11772 3744 pts/1 T 18:49 0:00 bash /home= /dkg/src/notmuch/notmuch/test/T050-new.sh dkg 8320 0.0 0.0 4168 96 pts/1 T 18:49 0:00 parallel -= - /home/dkg/src/notmuch/notmuch/test/T000-basic.sh /home/dkg/src/notmuch/no= tmuch/test/T010-help-test.sh /home/dkg/src/notmuch dkg 8321 0.0 0.0 4276 748 pts/1 T 18:49 0:00 sh -c /hom= e/dkg/src/notmuch/notmuch/test/T060-count.sh dkg 8322 0.7 0.4 11752 3556 pts/1 T 18:49 0:00 bash /home= /dkg/src/notmuch/notmuch/test/T060-count.sh dkg 8345 0.0 0.0 4168 96 pts/1 T 18:49 0:00 parallel -= - /home/dkg/src/notmuch/notmuch/test/T000-basic.sh /home/dkg/src/notmuch/no= tmuch/test/T010-help-test.sh /home/dkg/src/notmuch dkg 8346 0.0 0.0 4276 744 pts/1 T 18:49 0:00 sh -c /hom= e/dkg/src/notmuch/notmuch/test/T070-insert.sh dkg 8347 1.7 0.4 11772 3764 pts/1 T 18:49 0:00 bash /home= /dkg/src/notmuch/notmuch/test/T070-insert.sh dkg 8425 0.0 0.0 4168 96 pts/1 T 18:49 0:00 parallel -= - /home/dkg/src/notmuch/notmuch/test/T000-basic.sh /home/dkg/src/notmuch/no= tmuch/test/T010-help-test.sh /home/dkg/src/notmuch dkg 8426 0.0 0.0 4276 740 pts/1 T 18:49 0:00 sh -c /hom= e/dkg/src/notmuch/notmuch/test/T080-search.sh dkg 8427 1.5 0.4 11752 3664 pts/1 T 18:49 0:00 bash /home= /dkg/src/notmuch/notmuch/test/T080-search.sh dkg 8763 4.7 2.9 73960 22708 pts/1 T 18:49 0:00 gdb --batc= h-silent --return-child-result -x count-files.gdb --args notmuch count --ou= tput=3Dfiles * dkg 8914 0.0 0.8 68508 6228 pts/1 T 18:49 0:00 notmuch se= arch --format=3Dtext0 --output=3Dfiles --offset=3D1 --limit=3D1 * dkg 8915 0.0 0.1 4484 1164 pts/1 T 18:49 0:00 xargs -0 -= I {} mv {} /home/dkg/src/notmuch/notmuch/test/tmp.T050-new/mail/moved_messa= ges dkg 8916 0.0 0.5 68244 3824 pts/1 T 18:49 0:00 notmuch in= sert --folder=3DDrafts +draft -unread dkg 8919 0.0 0.0 13012 704 pts/1 T 18:49 0:00 notmuch new dkg 8920 0.0 0.0 1412 4 pts/1 t 18:49 0:00 /bin/bash = -c exec /home/dkg/src/notmuch/notmuch/notmuch count --output=3Dfiles \* =2D----------------------- As you can see in the "STAT" column, nearly all of the hanging processes are marked with T ("stopped by job control signal" according to ps(1)). I also note that "t" means "stopped by debugger during the tracing" -- maybe that final line (with "t") is the special one that triggers this? i don't know. When i try to connect to any of these stopped processes with "strace -p $PID", strace reports: strace: Process 4204 attached --- stopped by SIGTTOU --- SIGTTOU is novel to me, and i don't really understand why the test suite would have this problem. Skimming this guidance: http://curiousthing.org/sigttin-sigttou-deep-dive-linux suggested that maybe if i just decoupled the processes from the terminal "enough" i could get away with a functioning test suite. redirecting all of stdin, stdout, stderr to /dev/null worked! then i tried pruning out different pieces, and found that all i needed to do was to redirect stdin from /dev/null and the test suite would run without problems in parallel with moreutils parallel. (it also works with GNU parallel, and if i run the tests serially). So the patch below is a "fix" but it's not a principled one. the source for moreutils parallel.c doesn't appear to have changed at all between stretch and buster. I tried upgrading the version of moreutils on this stretch system from 0.60-1 to 0.62-1, and i was able to reproduce the same problem. So i don't believe the problem is with moreutils. Some things that might be different between debian stable (stretch) and testing (buster): package provides stretch buster ------- -------- ------- ------ GNU coreutils /usr/bin/timeout 8.26-3 8.30-3 GNU bash /bin/bash 4.4-5 5.0-4 GNU dash /bin/sh (via symlink) 0.5.8-2.4 0.5.10.2-5 Linux the kernel 4.9.168-1+deb9u2 4.19.37-3 GNU gdb /usr/bin/gdb 7.12-6 8.2.1-2 I also tried changing the symlink for /bin/sh to point to bash instead of dash, and was still able to replicate the problem, so i suspect dash is not the culprit. However, i tried selectively upgrading all the versions of all of these packages *except for gdb* to the version in buster (or to the version from backports, in the case of the kernel). and i'm *still* seeing the problem on the stretch system. So perhaps it's some interaction between timeout and gdb? I haven't managed to test that particular combination yet. I hope someone else will look into this further, as i'm out of my depth. --dkg diff --git a/test/Makefile.local b/test/Makefile.local index 47244e8f..3a57b6be 100644 =2D-- a/test/Makefile.local +++ b/test/Makefile.local @@ -66,13 +66,13 @@ test-binaries: $(TEST_BINARIES) test: all test-binaries ifeq ($V,) @echo 'Use "$(MAKE) V=3D1" to see the details for passing and known= broken tests.' =2D @env NOTMUCH_TEST_QUIET=3D1 $(NOTMUCH_SRCDIR)/$(test_src_dir)/not= much-test $(OPTIONS) + @env NOTMUCH_TEST_QUIET=3D1 $(NOTMUCH_SRCDIR)/$(test_src_dir)/notmu= ch-test $(OPTIONS)