* bug#35181: Hydra offloads often get stuck while exporting build requisites
2019-04-08 8:19 ` Ludovic Courtès
@ 2019-04-08 19:40 ` Mark H Weaver
2019-04-09 10:54 ` Ludovic Courtès
2019-04-09 1:06 ` Mark H Weaver
1 sibling, 1 reply; 12+ messages in thread
From: Mark H Weaver @ 2019-04-08 19:40 UTC (permalink / raw)
To: Ludovic Courtès; +Cc: 35181
Hi Ludovic,
Ludovic Courtès <ludo@gnu.org> writes:
> Mark H Weaver <mhw@netris.org> skribis:
>
>> The source checkout currently being transferred for build 3432472
>> (/gnu/store/…-font-google-material-design-icons-3.0.1-checkout) is 176
>> megabytes uncompressed, as measured by "du -s --si", which is not
>> precisely same as NAR size, but hopefully close enough for a rough
>> estimate. As I write this, build 3432472 been stuck here for 24 hours
>> 15 minutes. Even if the average transfer rate were 4 kilobytes per
>> second, it should have been done in half that time.
>
> This is weird, could it be that data transfers get stuck somehow?
As far as I can tell, that's what seems to happen.
> Did you try to check the status of the ‘nix-store’ and ‘guix offload’
> processes on the head node?
Here are the corresponding 'guix offload' processes:
--8<---------------cut here---------------start------------->8---
hydra@20121227-hydra:~$ ps auxwwf | head -1; ps auxwwf | egrep -B1 'off()load'
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 8984 0.0 0.0 30784 2248 ? Ss Apr07 0:00 | \_ /root/.guix-profile/bin/guix-daemon 8983 --max-jobs=1 --build-users-group=guixbuild --disable-log-compression --gc-keep-outputs --gc-keep-derivations --no-substitutes --cache-failures
root 8985 0.0 0.2 145532 13976 ? SLsl Apr07 0:10 | | \_ /gnu/store/yihvhxv3xyyvl1m2cy1lnf1lyi9h76fk-guile-2.2.2/bin/guile --no-auto-compile /gnu/store/fkkjhida23k612naa9d4q6avqj5v3b28-guix-0.13.0-8.357ab93/bin/.guix-real offload x86_64-linux 3600 1 72000
--
root 14768 0.0 0.0 30752 2356 ? Ss Apr07 0:00 | \_ /root/.guix-profile/bin/guix-daemon 14767 --max-jobs=1 --build-users-group=guixbuild --disable-log-compression --gc-keep-outputs --gc-keep-derivations --no-substitutes --cache-failures
root 14769 0.0 0.2 145668 10912 ? SLsl Apr07 0:16 | | \_ /gnu/store/yihvhxv3xyyvl1m2cy1lnf1lyi9h76fk-guile-2.2.2/bin/guile --no-auto-compile /gnu/store/fkkjhida23k612naa9d4q6avqj5v3b28-guix-0.13.0-8.357ab93/bin/.guix-real offload x86_64-linux 3600 1 72000
--8<---------------cut here---------------end--------------->8---
I tried attaching to both of these offload processes with 'strace', and
waited for several minutes for any system call activity. Both of them
are stuck sleeping within a system call, although I don't yet know which
system call:
--8<---------------cut here---------------start------------->8---
root@20121227-hydra:~# strace -p 8985
Process 8985 attached - interrupt to quit
restart_syscall(<... resuming interrupted call ...>^C <unfinished ...>
Process 8985 detached
root@20121227-hydra:~# strace -p 14769
Process 14769 attached - interrupt to quit
restart_syscall(<... resuming interrupted call ...>^C <unfinished ...>
Process 14769 detached
--8<---------------cut here---------------end--------------->8---
Here are the 'nix-store' processes:
--8<---------------cut here---------------start------------->8---
hydra@20121227-hydra:~$ ps auxwwf | head -1; ps auxwwf | egrep -A1 'hydra-()build'
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
hydra 8980 0.0 0.9 187332 46656 pts/5 S Apr07 0:01 | | \_ /usr/local/bin/perl -w /usr/local/bin/hydra-build 3432472
hydra 8983 0.0 0.0 34228 464 pts/5 S Apr07 0:00 | | | \_ nix-store --realise /gnu/store/5ivay4l7bn0sqsi7k53j4qv3kndrby17-font-google-material-design-icons-3.0.1.drv --timeout 72000 --max-silent-time 3600 --option build-max-log-size 67108864 --keep-going --fallback --no-build-output --log-type flat --print-build-trace --add-root /nix/var/nix/gcroots/per-user/hydra/hydra-roots/y61f3cdhx31msdhkdw0kfs5pb75ycgfq-font-google-material-design-icons-3.0.1
hydra 14764 0.0 0.9 187336 46576 pts/5 S Apr07 0:01 | | \_ /usr/local/bin/perl -w /usr/local/bin/hydra-build 3432052
hydra 14767 0.0 0.0 34228 352 pts/5 S Apr07 0:00 | | \_ nix-store --realise /gnu/store/k27i3lkb38gr3mw0mridymhik3qsg6w7-font-fira-sans-4.202.drv --timeout 72000 --max-silent-time 3600 --option build-max-log-size 67108864 --keep-going --fallback --no-build-output --log-type flat --print-build-trace --add-root /nix/var/nix/gcroots/per-user/hydra/hydra-roots/28ncfjmplcwyzas2p3d4cy5xlzacjcnj-font-fira-sans-4.202
--8<---------------cut here---------------end--------------->8---
The 'nix-store' processes seem to be stuck sleeping in 'read', if I'm
interpreting the 'strace' output correctly:
--8<---------------cut here---------------start------------->8---
root@20121227-hydra:~# strace -p 8983
Process 8983 attached - interrupt to quit
read(3, ^C <unfinished ...>
Process 8983 detached
root@20121227-hydra:~# strace -p 14767
Process 14767 attached - interrupt to quit
read(3, ^C <unfinished ...>
Process 14767 detached
--8<---------------cut here---------------end--------------->8---
"netstat --inet --program" shows that the SSH connections are still
open:
--8<---------------cut here---------------start------------->8---
root@20121227-hydra:~# netstat --inet --program | grep 'hydra\.net\.in\.tum\.'
tcp 0 0 20121227-hydra.gn:53216 hydra.net.in.tum.de:ssh ESTABLISHED 14769/guile
tcp 0 0 20121227-hydra.gn:52434 hydra.net.in.tum.de:ssh ESTABLISHED 8985/guile
tcp 0 0 20121227-hydra.gnu.:www hydra.net.in.tum.:52104 TIME_WAIT -
tcp 0 0 20121227-hydra.gnu.:www hydra.net.in.tum.:52103 TIME_WAIT -
--8<---------------cut here---------------end--------------->8---
> I just checked and apparently this package builds fine on berlin.
Also, these same jobs (e.g. same versions) have been built successfully
on hydra.gnunet.org for years without difficulty. In the case of
'font-fira-sans-4.202.x86_64-linux', it has only ever been built on
hydra.gnunet.org, with the last successful build on 28 September 2018.
The packages haven't been updated in years, and so typically are only
rebuilt during 'core-updates' cycles. They only started aborting in
late March, when some other rarely-update package apparently changed to
force a rebuild. However, the similar 'mozjs-60' failures happened
earlier.
FYI, here's the history of build attempts on Hydra:
--8<---------------cut here---------------start------------->8---
hydra=> select case when s.machine~'^(hydra|guix)\.' then s.machine else substring(s.machine from '^[^.]*') end as machine, jobset, s.build, s.stepnr as step, case when s.busy=1 then 'busy' when s.status=0 then NULL when s.status=1 then 'fail' when s.status=4 then 'abort' when s.status=7 then 'timeout' when s.status=8 then 'cfail' else '?' end as stat, regexp_replace(substr(s.drvpath,1+strpos(s.drvpath,'-')),'\.drv$','') as what, date_trunc('day', to_timestamp(s.stoptime)) as finished from builds b, buildsteps s where b.id=s.build and b.job='font-fira-sans-4.202.x86_64-linux' order by s.stoptime;
machine | jobset | build | step | stat | what | finished
------------------+--------------+---------+------+-------+----------------------+------------------------
hydra.gnunet.org | master | 2362639 | 1 | | font-fira-sans-4.202 | 2017-11-29 00:00:00+00
hydra.gnunet.org | core-updates | 2407845 | 1 | | font-fira-sans-4.202 | 2018-01-02 00:00:00+00
hydra.gnunet.org | core-updates | 2674686 | 1 | | font-fira-sans-4.202 | 2018-05-19 00:00:00+00
hydra.gnunet.org | core-updates | 3075254 | 1 | | font-fira-sans-4.202 | 2018-09-28 00:00:00+00
| master | 3432052 | 1 | abort | font-fira-sans-4.202 | 2019-03-31 00:00:00+00
| master | 3432052 | 2 | abort | font-fira-sans-4.202 | 2019-04-02 00:00:00+00
| master | 3432052 | 3 | abort | font-fira-sans-4.202 | 2019-04-03 00:00:00+00
| master | 3432052 | 4 | abort | font-fira-sans-4.202 | 2019-04-05 00:00:00+00
| master | 3432052 | 5 | abort | font-fira-sans-4.202 | 2019-04-05 00:00:00+00
| master | 3432052 | 6 | abort | font-fira-sans-4.202 | 2019-04-06 00:00:00+00
| master | 3432052 | 7 | abort | font-fira-sans-4.202 | 2019-04-06 00:00:00+00
| master | 3432052 | 8 | busy | font-fira-sans-4.202 |
(12 rows)
hydra=> select case when s.machine~'^(hydra|guix)\.' then s.machine else substring(s.machine from '^[^.]*') end as machine, jobset, s.build, s.stepnr as step, case when s.busy=1 then 'busy' when s.status=0 then NULL when s.status=1 then 'fail' when s.status=4 then 'abort' when s.status=7 then 'timeout' when s.status=8 then 'cfail' else '?' end as stat, regexp_replace(substr(s.drvpath,1+strpos(s.drvpath,'-')),'\.drv$','') as what, date_trunc('day', to_timestamp(s.stoptime)) as finished from builds b, buildsteps s where b.id=s.build and b.job='font-google-material-design-icons-3.0.1.i686-linux' order by s.stoptime;
machine | jobset | build | step | stat | what | finished
------------------+--------------+---------+------+-------+------------------------------------------------+------------------------
chapters | master | 1834047 | 1 | | font-google-material-design-icons-3.0.1 | 2017-02-13 00:00:00+00
hydra.gnunet.org | core-updates | 1889434 | 1 | | font-google-material-design-icons-3.0.1 | 2017-03-12 00:00:00+00
| master | 2030520 | 2 | cfail | glibc-intermediate-2.25 | 2017-04-30 00:00:00+00
| master | 2030520 | 1 | cfail | glibc-intermediate-2.25 | 2017-04-30 00:00:00+00
| master | 2030520 | 3 | cfail | glibc-intermediate-2.25 | 2017-04-30 00:00:00+00
guix.sjd.se | master | 2035120 | 1 | | font-google-material-design-icons-3.0.1 | 2017-05-04 00:00:00+00
guix.sjd.se | core-updates | 2111787 | 1 | | font-google-material-design-icons-3.0.1 | 2017-06-25 00:00:00+00
hydra.gnunet.org | master | 2128849 | 1 | | font-google-material-design-icons-3.0.1 | 2017-06-26 00:00:00+00
guix.sjd.se | core-updates | 2175161 | 1 | | font-google-material-design-icons-3.0.1 | 2017-07-20 00:00:00+00
| master | 2334641 | 1 | | font-google-material-design-icons-3.0.1.tar.gz | 2017-10-23 00:00:00+00
hydra.gnunet.org | master | 2334641 | 2 | | font-google-material-design-icons-3.0.1 | 2017-10-23 00:00:00+00
| core-updates | 2406391 | 1 | | module-import | 2018-01-02 00:00:00+00
| core-updates | 2406391 | 2 | | module-import-compiled | 2018-01-02 00:00:00+00
guix.sjd.se | core-updates | 2406391 | 3 | | font-google-material-design-icons-3.0.1 | 2018-01-02 00:00:00+00
guix.sjd.se | core-updates | 2667328 | 1 | | font-google-material-design-icons-3.0.1 | 2018-05-18 00:00:00+00
guix.sjd.se | core-updates | 3073906 | 1 | | font-google-material-design-icons-3.0.1 | 2018-09-25 00:00:00+00
| master | 3432472 | 1 | abort | font-google-material-design-icons-3.0.1 | 2019-03-29 00:00:00+00
| master | 3432472 | 2 | abort | font-google-material-design-icons-3.0.1 | 2019-03-30 00:00:00+00
| master | 3432472 | 3 | abort | font-google-material-design-icons-3.0.1 | 2019-03-31 00:00:00+00
| master | 3432472 | 4 | abort | font-google-material-design-icons-3.0.1 | 2019-04-02 00:00:00+00
| master | 3432472 | 5 | abort | font-google-material-design-icons-3.0.1 | 2019-04-03 00:00:00+00
| master | 3432472 | 6 | abort | font-google-material-design-icons-3.0.1 | 2019-04-05 00:00:00+00
| master | 3432472 | 7 | abort | font-google-material-design-icons-3.0.1 | 2019-04-05 00:00:00+00
| master | 3432472 | 8 | abort | font-google-material-design-icons-3.0.1 | 2019-04-06 00:00:00+00
| master | 3432472 | 9 | abort | font-google-material-design-icons-3.0.1 | 2019-04-06 00:00:00+00
| master | 3432472 | 10 | busy | font-google-material-design-icons-3.0.1 |
(26 rows)
--8<---------------cut here---------------end--------------->8---
It seems that Hydra fails to record the machine name in build steps that
are aborted, but I know that all of the 'aborts' above are on
hydra.gnunet.org, because that's currently the only x86 build slave on
Hydra.
I could easily believe that this problem is specific to
hydra.gnunet.org, but even if that's the case, it would be good if
offloading would reliably time out before days have passed.
Ideally, the timeout would be a "max-silent-time" kind of timeout, so
that we don't impose an arbitrary limitation on total transfer time as
long as progress is being made, and so that the timeout can be
relatively short. However, even a "total-transfer-time" kind of timeout
would be welcome at this point, to stop the profuse bleeding of Hydra's
limited x86 build capacity.
What do you think?
Mark
^ permalink raw reply [flat|nested] 12+ messages in thread
* bug#35181: Hydra offloads often get stuck while exporting build requisites
2019-04-08 8:19 ` Ludovic Courtès
2019-04-08 19:40 ` Mark H Weaver
@ 2019-04-09 1:06 ` Mark H Weaver
2019-04-09 10:56 ` Ludovic Courtès
2023-04-14 13:15 ` Maxim Cournoyer
1 sibling, 2 replies; 12+ messages in thread
From: Mark H Weaver @ 2019-04-09 1:06 UTC (permalink / raw)
To: Ludovic Courtès; +Cc: 35181
merge 35181 34157
thanks
I looked more closely at the 'mozjs-60' failures, and I'm convinced that
it's an instance of the same problem that's currently affecting these
large font builds.
Mozjs-60 was pushed to the master branch on 2019-01-18. It has _never_
successfully built on x86_64 or i686, although all builds were
successful on armhf. See below for the complete list of build attempts
of mozjs-60 on Hydra.
Also of note: So far, all known instances of this problem have occurred
while transferring a large directory, as opposed to a tarball.
We have several packages with source tarballs _much_ larger than these
problematic source checkouts, and which are updated more much
frequently, and yet I've *never* seen an instance of this problem while
exporting a plain file to a build slave. For example, the upstream
IceCat and Firefox ESR tarballs are ~270 megabytes compressed, whereas
'font-google-material-design-icons-3.0.1' source is only ~176 megabytes
_uncompressed_.
I have no explanation for why the superficial form of the store item
should matter here, but maybe it's a clue. I know that plain
non-executable files in the store are handled somewhat differently in
the Nix model than directories or executable files, the latter
associated with the word "recursive", and requiring an additional layer
of encoding for purposes of serialization, but I'm not sufficiently
familiar with the details or relevant code.
Ludovic, can you think of a reason why the file/directory distinction
could be relevant to this issue?
Finally: the problem seems to have been introduced into Hydra sometime
between September 2018 and January 2019. September 2018 is when the
last successful build of the problematic font packages was performed,
and January 2019 is the first known instance of the problem. I do not
currently know of any relevant data points in that time range. The last
'core-updates' merge into 'master' was on December 3rd.
Mark
PS: Here's the complete history of 'mozjs-60' build attempts on Hydra:
First are the 'armhf' attempts, followed by i686, and x86_64. Note
that the two armhf aborts happened after only 2 seconds, and surely
had a different cause than this issue.
--8<---------------cut here---------------start------------->8---
hydra=> select case when s.machine~'^(hydra|guix)\.' then s.machine else substring(s.machine from '^[^.]*') end as machine, jobset, s.build, s.stepnr as step, case when s.busy=1 then 'busy' when s.status=0 then NULL when s.status=1 then 'fail' when s.status=4 then 'abort' when s.status=7 then 'timeout' when s.status=8 then 'cfail' else '?' end as stat, regexp_replace(substr(s.drvpath,1+strpos(s.drvpath,'-')),'\.drv$','') as what, date_trunc('second', to_timestamp(s.stoptime)) as finished, date_trunc('second', to_timestamp(s.stoptime) - to_timestamp(s.starttime)) as duration from builds b, buildsteps s where b.id=s.build and b.job='mozjs-60.2.3-2.armhf-linux' order by s.stoptime;
machine | jobset | build | step | stat | what | finished | duration
--------------+--------+---------+------+-------+-------------------------+------------------------+----------
hydra-slave2 | master | 3342804 | 1 | | mozjs-60.2.3-2-checkout | 2019-01-19 12:58:52+00 | 00:23:55
hydra-slave2 | master | 3342804 | 2 | | mozjs-60.2.3-2 | 2019-01-19 15:49:37+00 | 02:50:42
| master | 3367975 | 1 | abort | mozjs-60.2.3-2 | 2019-02-13 00:03:58+00 | 00:00:02
| master | 3367975 | 2 | abort | mozjs-60.2.3-2 | 2019-02-15 15:35:45+00 | 00:00:02
hydra-slave3 | master | 3367975 | 3 | | mozjs-60.2.3-2 | 2019-02-18 16:38:08+00 | 02:46:42
(5 rows)
hydra=> select case when s.machine~'^(hydra|guix)\.' then s.machine else substring(s.machine from '^[^.]*') end as machine, jobset, s.build, s.stepnr as step, case when s.busy=1 then 'busy' when s.status=0 then NULL when s.status=1 then 'fail' when s.status=4 then 'abort' when s.status=7 then 'timeout' when s.status=8 then 'cfail' else '?' end as stat, regexp_replace(substr(s.drvpath,1+strpos(s.drvpath,'-')),'\.drv$','') as what, date_trunc('second', to_timestamp(s.stoptime)) as finished, date_trunc('second', to_timestamp(s.stoptime) - to_timestamp(s.starttime)) as duration from builds b, buildsteps s where b.id=s.build and b.job='mozjs-60.2.3-2.i686-linux' order by s.stoptime;
machine | jobset | build | step | stat | what | finished | duration
---------+--------+---------+------+-------+----------------+------------------------+-----------------
| master | 3343511 | 1 | abort | mozjs-60.2.3-2 | 2019-01-20 20:05:16+00 | 12:11:12
| master | 3343511 | 2 | abort | mozjs-60.2.3-2 | 2019-01-23 01:52:01+00 | 2 days 05:42:55
| master | 3360985 | 1 | abort | mozjs-60.2.3-2 | 2019-02-15 19:59:42+00 | 09:31:25
| master | 3360985 | 2 | abort | mozjs-60.2.3-2 | 2019-02-16 17:37:06+00 | 05:57:15
| master | 3360985 | 3 | abort | mozjs-60.2.3-2 | 2019-02-17 17:39:49+00 | 16:06:14
| master | 3360985 | 4 | abort | mozjs-60.2.3-2 | 2019-03-03 21:50:48+00 | 00:02:19
(6 rows)
hydra=> select case when s.machine~'^(hydra|guix)\.' then s.machine else substring(s.machine from '^[^.]*') end as machine, jobset, s.build, s.stepnr as step, case when s.busy=1 then 'busy' when s.status=0 then NULL when s.status=1 then 'fail' when s.status=4 then 'abort' when s.status=7 then 'timeout' when s.status=8 then 'cfail' else '?' end as stat, regexp_replace(substr(s.drvpath,1+strpos(s.drvpath,'-')),'\.drv$','') as what, date_trunc('second', to_timestamp(s.stoptime)) as finished, date_trunc('second', to_timestamp(s.stoptime) - to_timestamp(s.starttime)) as duration from builds b, buildsteps s where b.id=s.build and b.job='mozjs-60.2.3-2.x86_64-linux' order by s.stoptime;
machine | jobset | build | step | stat | what | finished | duration
---------+--------+---------+------+-------+----------------+------------------------+-----------------
| master | 3342528 | 1 | abort | mozjs-60.2.3-2 | 2019-01-20 20:04:50+00 | 22:25:28
| master | 3342528 | 2 | abort | mozjs-60.2.3-2 | 2019-01-23 01:51:48+00 | 2 days 05:19:35
| master | 3366691 | 1 | abort | mozjs-60.2.3-2 | 2019-02-17 17:39:59+00 | 09:21:24
(3 rows)
--8<---------------cut here---------------end--------------->8---
^ permalink raw reply [flat|nested] 12+ messages in thread