* parallel test failures
@ 2021-02-19 12:24 David Bremner
2021-02-21 18:21 ` Xu Wang
2021-02-25 19:33 ` Tomi Ollila
0 siblings, 2 replies; 7+ messages in thread
From: David Bremner @ 2021-02-19 12:24 UTC (permalink / raw)
To: notmuch
[-- Attachment #1: Type: text/plain, Size: 378 bytes --]
I have intermittent failures when running the test suite on sufficiently
parallel machines. I have attached a log of such a failing build,
although it does not seem especially illuminating.
It takes anywhere from 5 to 300 runs to get a failure for me running on
60 hardware threads (30 cores). At least on this machine the number of
tests that pass seems consistent at 1205
[-- Attachment #2: log.xz --]
[-- Type: application/x-xz, Size: 18492 bytes --]
[-- Attachment #3: Type: text/plain, Size: 0 bytes --]
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: parallel test failures
2021-02-19 12:24 parallel test failures David Bremner
@ 2021-02-21 18:21 ` Xu Wang
2021-02-25 19:33 ` Tomi Ollila
1 sibling, 0 replies; 7+ messages in thread
From: Xu Wang @ 2021-02-21 18:21 UTC (permalink / raw)
To: David Bremner; +Cc: notmuch
I did not look at logs, but I have had problem in other scenarios. The
way I debugged was to use strace to get a list of all files the tests
accessed. From that list I could recognize that some files that should
have been in separate temp directories were not thread-specific and
solution was to put the temp files in separate dir for each test. Not
sure if this is helpful, but wanted to share.
Kind regards and best of luck,
Xu
On Fri, Feb 19, 2021 at 7:24 AM David Bremner <david@tethera.net> wrote:
>
>
> I have intermittent failures when running the test suite on sufficiently
> parallel machines. I have attached a log of such a failing build,
> although it does not seem especially illuminating.
>
> It takes anywhere from 5 to 300 runs to get a failure for me running on
> 60 hardware threads (30 cores). At least on this machine the number of
> tests that pass seems consistent at 1205
>
> _______________________________________________
> notmuch mailing list -- notmuch@notmuchmail.org
> To unsubscribe send an email to notmuch-leave@notmuchmail.org
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: parallel test failures
2021-02-19 12:24 parallel test failures David Bremner
2021-02-21 18:21 ` Xu Wang
@ 2021-02-25 19:33 ` Tomi Ollila
2021-02-26 11:49 ` David Bremner
1 sibling, 1 reply; 7+ messages in thread
From: Tomi Ollila @ 2021-02-25 19:33 UTC (permalink / raw)
To: David Bremner, notmuch
On Fri, Feb 19 2021, David Bremner wrote:
> I have intermittent failures when running the test suite on sufficiently
> parallel machines. I have attached a log of such a failing build,
> although it does not seem especially illuminating.
>
> It takes anywhere from 5 to 300 runs to get a failure for me running on
> 60 hardware threads (30 cores). At least on this machine the number of
> tests that pass seems consistent at 1205
I did the following changes to see file write accesses:
----
diff --git a/test/notmuch-test b/test/notmuch-test
index b58fd3b3..903a5dff 100755
--- a/test/notmuch-test
+++ b/test/notmuch-test
@@ -62,13 +62,16 @@ if test -z "$NOTMUCH_TEST_SERIALIZE" && command -v
parallel >/dev/null ; then
META_FAILURE="parallel test suite returned error code $RES"
fi
else
+ rm -rf inw; mkdir inw
for test in $TESTS; do
+ testname=$(basename $test .sh)
+ inotifywait -d --outfile $PWD/inw/inw-$testname -r -e close_write,delete $PWD/test /tmp
$TEST_TIMEOUT_CMD $test "$@" &
wait $!
+ pkill inotifywa
# If the test failed without producing results, then it aborted,
# so we should abort, too.
RES=$?
- testname=$(basename $test .sh)
if [[ $RES != 0 && ! -e
"$NOTMUCH_BUILDDIR/test/test-results/$testname" ]]; then
META_FAILURE="Aborting on $testname (returned $RES)"
break
----
Then ran tests w/ NOTMUCH_TEST_SERIALIZE=t
and then ran
for f in inw/*; do echo $f; sed -e 's,.*notmuch/test/, ,' -e '/tmp.T/ s,/.*,,' $f | sort -u; echo; done | less
to examine "fallout"
based on that (random gazes to the listing) I did not see any potentially
overlapping writes, but saw unrelated inconsistency in test directories.
Anyway, the log.gz did not show any tests failing but parallel exiting
nonzero possibly for some other reason. Cannot say. Probably stracing (even
with --seccomp-bpf) would make it happen even less likely :/
Tomi
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: parallel test failures
2021-02-25 19:33 ` Tomi Ollila
@ 2021-02-26 11:49 ` David Bremner
2021-02-27 2:29 ` David Bremner
0 siblings, 1 reply; 7+ messages in thread
From: David Bremner @ 2021-02-26 11:49 UTC (permalink / raw)
To: Tomi Ollila, notmuch
Tomi Ollila <tomi.ollila@iki.fi> writes:
>
> Anyway, the log.gz did not show any tests failing but parallel exiting
> nonzero possibly for some other reason. Cannot say. Probably stracing (even
> with --seccomp-bpf) would make it happen even less likely :/
>
Thanks to both of you for your feedback / suggestions. I did read today
that timeout exits with 124 when the time limit is reached. I haven't
investigated further (nor do I know how the timelimit should be reached,
since the whold build+test cycle takes about 10s on this machine.
d
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: parallel test failures
2021-02-26 11:49 ` David Bremner
@ 2021-02-27 2:29 ` David Bremner
2021-02-27 8:22 ` Tomi Ollila
0 siblings, 1 reply; 7+ messages in thread
From: David Bremner @ 2021-02-27 2:29 UTC (permalink / raw)
To: notmuch
David Bremner <david@tethera.net> writes:
> Tomi Ollila <tomi.ollila@iki.fi> writes:
>
>>
>> Anyway, the log.gz did not show any tests failing but parallel exiting
>> nonzero possibly for some other reason. Cannot say. Probably stracing (even
>> with --seccomp-bpf) would make it happen even less likely :/
>>
>
> Thanks to both of you for your feedback / suggestions. I did read today
> that timeout exits with 124 when the time limit is reached. I haven't
> investigated further (nor do I know how the timelimit should be reached,
> since the whold build+test cycle takes about 10s on this machine.
Maybe a timeout is not so crazy. I ran a couple of trials with
NOTMUCH_TEST_TIMEOUT=0, and it eventually hung (after 6, and 110
repetitions) in T355-smime, as far as I can tell on the first test.
I'm currently running some trials to see if I can duplicate that without
parallel execution, but that of course takes longer.
d
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: parallel test failures
2021-02-27 2:29 ` David Bremner
@ 2021-02-27 8:22 ` Tomi Ollila
2021-02-27 11:41 ` David Bremner
0 siblings, 1 reply; 7+ messages in thread
From: Tomi Ollila @ 2021-02-27 8:22 UTC (permalink / raw)
To: David Bremner, notmuch
On Fri, Feb 26 2021, David Bremner wrote:
> David Bremner <david@tethera.net> writes:
>
>>
>> Thanks to both of you for your feedback / suggestions. I did read today
>> that timeout exits with 124 when the time limit is reached. I haven't
>> investigated further (nor do I know how the timelimit should be reached,
>> since the whold build+test cycle takes about 10s on this machine.
>
> Maybe a timeout is not so crazy. I ran a couple of trials with
> NOTMUCH_TEST_TIMEOUT=0, and it eventually hung (after 6, and 110
> repetitions) in T355-smime, as far as I can tell on the first test.
> I'm currently running some trials to see if I can duplicate that without
> parallel execution, but that of course takes longer.
So, AFAIU, you got 124 since timeout(1) exited with that status (and
killed all parallel(1) executions (after 2 minutes in that case?)...
... and when you set NOTMUCH_TEST_TIMEOUT=0 then timeout(1) was not
executed and a test hung (probably T355-smime).
In any way you get it again to hung state (w/o using timeout(1) to
mess around) you probably can peek things with ps, /proc, strace,
gdb, or with some other (potentially more sophisticated ;) tools.
>
> d
Tomi
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: parallel test failures
2021-02-27 8:22 ` Tomi Ollila
@ 2021-02-27 11:41 ` David Bremner
0 siblings, 0 replies; 7+ messages in thread
From: David Bremner @ 2021-02-27 11:41 UTC (permalink / raw)
To: Tomi Ollila, notmuch; +Cc: Daniel Kahn Gillmor
Tomi Ollila <tomi.ollila@iki.fi> writes:
> So, AFAIU, you got 124 since timeout(1) exited with that status (and
> killed all parallel(1) executions (after 2 minutes in that case?)...
> ... and when you set NOTMUCH_TEST_TIMEOUT=0 then timeout(1) was not
> executed and a test hung (probably T355-smime).
That sounds right.
> In any way you get it again to hung state (w/o using timeout(1) to
> mess around) you probably can peek things with ps, /proc, strace,
> gdb, or with some other (potentially more sophisticated ;) tools.
In fact it looks like I already reported this issue (or a different
issue causing T355 to hang, which seems less likely) at
id:87h7pxiek3.fsf@tethera.net
Past me seems to have thought it was some kind of gpgsm failure. I would
welcome input from people use or understand gpgsm.
d
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2021-02-27 11:41 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-02-19 12:24 parallel test failures David Bremner
2021-02-21 18:21 ` Xu Wang
2021-02-25 19:33 ` Tomi Ollila
2021-02-26 11:49 ` David Bremner
2021-02-27 2:29 ` David Bremner
2021-02-27 8:22 ` Tomi Ollila
2021-02-27 11:41 ` David Bremner
Code repositories for project(s) associated with this public inbox
https://yhetil.org/notmuch.git/
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).