From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mark H Weaver Subject: bug#35181: Hydra offloads often get stuck while exporting build requisites Date: Mon, 08 Apr 2019 15:40:51 -0400 Message-ID: <87o95g5lpd.fsf@netris.org> References: <87mul17oo2.fsf@netris.org> <87imvp7ogv.fsf@netris.org> <20190407173105.GB1337@macbook41> <87ef6d6mdn.fsf@netris.org> <87pnpw29kp.fsf@gnu.org> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Return-path: Received: from eggs.gnu.org ([209.51.188.92]:45731) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1hDaAc-0004jg-78 for bug-guix@gnu.org; Mon, 08 Apr 2019 15:43:08 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1hDaAZ-0003XU-7v for bug-guix@gnu.org; Mon, 08 Apr 2019 15:43:05 -0400 Received: from debbugs.gnu.org ([209.51.188.43]:36648) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1hDaAY-0003Td-0W for bug-guix@gnu.org; Mon, 08 Apr 2019 15:43:02 -0400 Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1hDaAX-00040F-Q5 for bug-guix@gnu.org; Mon, 08 Apr 2019 15:43:01 -0400 Sender: "Debbugs-submit" Resent-Message-ID: In-Reply-To: <87pnpw29kp.fsf@gnu.org> ("Ludovic \=\?utf-8\?Q\?Court\=C3\=A8s\=22'\?\= \=\?utf-8\?Q\?s\?\= message of "Mon, 08 Apr 2019 10:19:18 +0200") List-Id: Bug reports for GNU Guix List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-guix-bounces+gcggb-bug-guix=m.gmane.org@gnu.org Sender: "bug-Guix" To: Ludovic =?UTF-8?Q?Court=C3=A8s?= Cc: 35181@debbugs.gnu.org Hi Ludovic, Ludovic Court=C3=A8s writes: > Mark H Weaver skribis: > >> The source checkout currently being transferred for build 3432472 >> (/gnu/store/=E2=80=A6-font-google-material-design-icons-3.0.1-checkout) = is 176 >> megabytes uncompressed, as measured by "du -s --si", which is not >> precisely same as NAR size, but hopefully close enough for a rough >> estimate. As I write this, build 3432472 been stuck here for 24 hours >> 15 minutes. Even if the average transfer rate were 4 kilobytes per >> second, it should have been done in half that time. > > This is weird, could it be that data transfers get stuck somehow? As far as I can tell, that's what seems to happen. > Did you try to check the status of the =E2=80=98nix-store=E2=80=99 and = =E2=80=98guix offload=E2=80=99 > processes on the head node? Here are the corresponding 'guix offload' processes: --8<---------------cut here---------------start------------->8--- hydra@20121227-hydra:~$ ps auxwwf | head -1; ps auxwwf | egrep -B1 'off()lo= ad' USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 8984 0.0 0.0 30784 2248 ? Ss Apr07 0:00 | \= _ /root/.guix-profile/bin/guix-daemon 8983 --max-jobs=3D1 --build-user= s-group=3Dguixbuild --disable-log-compression --gc-keep-outputs --gc-keep-d= erivations --no-substitutes --cache-failures root 8985 0.0 0.2 145532 13976 ? SLsl Apr07 0:10 | |= \_ /gnu/store/yihvhxv3xyyvl1m2cy1lnf1lyi9h76fk-guile-2.2.2/bin/guile --n= o-auto-compile /gnu/store/fkkjhida23k612naa9d4q6avqj5v3b28-guix-0.13.0-8.35= 7ab93/bin/.guix-real offload x86_64-linux 3600 1 72000 -- root 14768 0.0 0.0 30752 2356 ? Ss Apr07 0:00 | \= _ /root/.guix-profile/bin/guix-daemon 14767 --max-jobs=3D1 --build-user= s-group=3Dguixbuild --disable-log-compression --gc-keep-outputs --gc-keep-d= erivations --no-substitutes --cache-failures root 14769 0.0 0.2 145668 10912 ? SLsl Apr07 0:16 | |= \_ /gnu/store/yihvhxv3xyyvl1m2cy1lnf1lyi9h76fk-guile-2.2.2/bin/guile --n= o-auto-compile /gnu/store/fkkjhida23k612naa9d4q6avqj5v3b28-guix-0.13.0-8.35= 7ab93/bin/.guix-real offload x86_64-linux 3600 1 72000 --8<---------------cut here---------------end--------------->8--- I tried attaching to both of these offload processes with 'strace', and waited for several minutes for any system call activity. Both of them are stuck sleeping within a system call, although I don't yet know which system call: --8<---------------cut here---------------start------------->8--- root@20121227-hydra:~# strace -p 8985 Process 8985 attached - interrupt to quit restart_syscall(<... resuming interrupted call ...>^C Process 8985 detached root@20121227-hydra:~# strace -p 14769 Process 14769 attached - interrupt to quit restart_syscall(<... resuming interrupted call ...>^C Process 14769 detached --8<---------------cut here---------------end--------------->8--- Here are the 'nix-store' processes: --8<---------------cut here---------------start------------->8--- hydra@20121227-hydra:~$ ps auxwwf | head -1; ps auxwwf | egrep -A1 'hydra-(= )build' USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND hydra 8980 0.0 0.9 187332 46656 pts/5 S Apr07 0:01 | | \= _ /usr/local/bin/perl -w /usr/local/bin/hydra-build 3432472 hydra 8983 0.0 0.0 34228 464 pts/5 S Apr07 0:00 | | |= \_ nix-store --realise /gnu/store/5ivay4l7bn0sqsi7k53j4qv3kndrby17-font-= google-material-design-icons-3.0.1.drv --timeout 72000 --max-silent-time 36= 00 --option build-max-log-size 67108864 --keep-going --fallback --no-build-= output --log-type flat --print-build-trace --add-root /nix/var/nix/gcroots/= per-user/hydra/hydra-roots/y61f3cdhx31msdhkdw0kfs5pb75ycgfq-font-google-mat= erial-design-icons-3.0.1 hydra 14764 0.0 0.9 187336 46576 pts/5 S Apr07 0:01 | | \= _ /usr/local/bin/perl -w /usr/local/bin/hydra-build 3432052 hydra 14767 0.0 0.0 34228 352 pts/5 S Apr07 0:00 | | = \_ nix-store --realise /gnu/store/k27i3lkb38gr3mw0mridymhik3qsg6w7-font-= fira-sans-4.202.drv --timeout 72000 --max-silent-time 3600 --option build-m= ax-log-size 67108864 --keep-going --fallback --no-build-output --log-type f= lat --print-build-trace --add-root /nix/var/nix/gcroots/per-user/hydra/hydr= a-roots/28ncfjmplcwyzas2p3d4cy5xlzacjcnj-font-fira-sans-4.202 --8<---------------cut here---------------end--------------->8--- The 'nix-store' processes seem to be stuck sleeping in 'read', if I'm interpreting the 'strace' output correctly: --8<---------------cut here---------------start------------->8--- root@20121227-hydra:~# strace -p 8983 Process 8983 attached - interrupt to quit read(3, ^C Process 8983 detached root@20121227-hydra:~# strace -p 14767 Process 14767 attached - interrupt to quit read(3, ^C Process 14767 detached --8<---------------cut here---------------end--------------->8--- "netstat --inet --program" shows that the SSH connections are still open: --8<---------------cut here---------------start------------->8--- root@20121227-hydra:~# netstat --inet --program | grep 'hydra\.net\.in\.tum= \.' tcp 0 0 20121227-hydra.gn:53216 hydra.net.in.tum.de:ssh ESTABLI= SHED 14769/guile=20=20=20=20=20 tcp 0 0 20121227-hydra.gn:52434 hydra.net.in.tum.de:ssh ESTABLI= SHED 8985/guile=20=20=20=20=20=20 tcp 0 0 20121227-hydra.gnu.:www hydra.net.in.tum.:52104 TIME_WA= IT -=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20 tcp 0 0 20121227-hydra.gnu.:www hydra.net.in.tum.:52103 TIME_WA= IT -=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20 --8<---------------cut here---------------end--------------->8--- > I just checked and apparently this package builds fine on berlin. Also, these same jobs (e.g. same versions) have been built successfully on hydra.gnunet.org for years without difficulty. In the case of 'font-fira-sans-4.202.x86_64-linux', it has only ever been built on hydra.gnunet.org, with the last successful build on 28 September 2018. The packages haven't been updated in years, and so typically are only rebuilt during 'core-updates' cycles. They only started aborting in late March, when some other rarely-update package apparently changed to force a rebuild. However, the similar 'mozjs-60' failures happened earlier. FYI, here's the history of build attempts on Hydra: --8<---------------cut here---------------start------------->8--- hydra=3D> select case when s.machine~'^(hydra|guix)\.' then s.machine else = substring(s.machine from '^[^.]*') end as machine, jobset, s.build, s.stepn= r as step, case when s.busy=3D1 then 'busy' when s.status=3D0 then NULL whe= n s.status=3D1 then 'fail' when s.status=3D4 then 'abort' when s.status=3D7= then 'timeout' when s.status=3D8 then 'cfail' else '?' end as stat, regexp= _replace(substr(s.drvpath,1+strpos(s.drvpath,'-')),'\.drv$','') as what, da= te_trunc('day', to_timestamp(s.stoptime)) as finished from builds b, builds= teps s where b.id=3Ds.build and b.job=3D'font-fira-sans-4.202.x86_64-linux'= order by s.stoptime; machine | jobset | build | step | stat | what = | finished=20=20=20=20=20=20=20=20 ------------------+--------------+---------+------+-------+----------------= ------+------------------------ hydra.gnunet.org | master | 2362639 | 1 | | font-fira-sans-= 4.202 | 2017-11-29 00:00:00+00 hydra.gnunet.org | core-updates | 2407845 | 1 | | font-fira-sans-= 4.202 | 2018-01-02 00:00:00+00 hydra.gnunet.org | core-updates | 2674686 | 1 | | font-fira-sans-= 4.202 | 2018-05-19 00:00:00+00 hydra.gnunet.org | core-updates | 3075254 | 1 | | font-fira-sans-= 4.202 | 2018-09-28 00:00:00+00 | master | 3432052 | 1 | abort | font-fira-sans-= 4.202 | 2019-03-31 00:00:00+00 | master | 3432052 | 2 | abort | font-fira-sans-= 4.202 | 2019-04-02 00:00:00+00 | master | 3432052 | 3 | abort | font-fira-sans-= 4.202 | 2019-04-03 00:00:00+00 | master | 3432052 | 4 | abort | font-fira-sans-= 4.202 | 2019-04-05 00:00:00+00 | master | 3432052 | 5 | abort | font-fira-sans-= 4.202 | 2019-04-05 00:00:00+00 | master | 3432052 | 6 | abort | font-fira-sans-= 4.202 | 2019-04-06 00:00:00+00 | master | 3432052 | 7 | abort | font-fira-sans-= 4.202 | 2019-04-06 00:00:00+00 | master | 3432052 | 8 | busy | font-fira-sans-= 4.202 |=20 (12 rows) hydra=3D> select case when s.machine~'^(hydra|guix)\.' then s.machine else = substring(s.machine from '^[^.]*') end as machine, jobset, s.build, s.stepn= r as step, case when s.busy=3D1 then 'busy' when s.status=3D0 then NULL whe= n s.status=3D1 then 'fail' when s.status=3D4 then 'abort' when s.status=3D7= then 'timeout' when s.status=3D8 then 'cfail' else '?' end as stat, regexp= _replace(substr(s.drvpath,1+strpos(s.drvpath,'-')),'\.drv$','') as what, da= te_trunc('day', to_timestamp(s.stoptime)) as finished from builds b, builds= teps s where b.id=3Ds.build and b.job=3D'font-google-material-design-icons-= 3.0.1.i686-linux' order by s.stoptime; machine | jobset | build | step | stat | = what | finished=20=20=20=20=20=20=20=20 ------------------+--------------+---------+------+-------+----------------= --------------------------------+------------------------ chapters | master | 1834047 | 1 | | font-google-mat= erial-design-icons-3.0.1 | 2017-02-13 00:00:00+00 hydra.gnunet.org | core-updates | 1889434 | 1 | | font-google-mat= erial-design-icons-3.0.1 | 2017-03-12 00:00:00+00 | master | 2030520 | 2 | cfail | glibc-intermedi= ate-2.25 | 2017-04-30 00:00:00+00 | master | 2030520 | 1 | cfail | glibc-intermedi= ate-2.25 | 2017-04-30 00:00:00+00 | master | 2030520 | 3 | cfail | glibc-intermedi= ate-2.25 | 2017-04-30 00:00:00+00 guix.sjd.se | master | 2035120 | 1 | | font-google-mat= erial-design-icons-3.0.1 | 2017-05-04 00:00:00+00 guix.sjd.se | core-updates | 2111787 | 1 | | font-google-mat= erial-design-icons-3.0.1 | 2017-06-25 00:00:00+00 hydra.gnunet.org | master | 2128849 | 1 | | font-google-mat= erial-design-icons-3.0.1 | 2017-06-26 00:00:00+00 guix.sjd.se | core-updates | 2175161 | 1 | | font-google-mat= erial-design-icons-3.0.1 | 2017-07-20 00:00:00+00 | master | 2334641 | 1 | | font-google-mat= erial-design-icons-3.0.1.tar.gz | 2017-10-23 00:00:00+00 hydra.gnunet.org | master | 2334641 | 2 | | font-google-mat= erial-design-icons-3.0.1 | 2017-10-23 00:00:00+00 | core-updates | 2406391 | 1 | | module-import = | 2018-01-02 00:00:00+00 | core-updates | 2406391 | 2 | | module-import-c= ompiled | 2018-01-02 00:00:00+00 guix.sjd.se | core-updates | 2406391 | 3 | | font-google-mat= erial-design-icons-3.0.1 | 2018-01-02 00:00:00+00 guix.sjd.se | core-updates | 2667328 | 1 | | font-google-mat= erial-design-icons-3.0.1 | 2018-05-18 00:00:00+00 guix.sjd.se | core-updates | 3073906 | 1 | | font-google-mat= erial-design-icons-3.0.1 | 2018-09-25 00:00:00+00 | master | 3432472 | 1 | abort | font-google-mat= erial-design-icons-3.0.1 | 2019-03-29 00:00:00+00 | master | 3432472 | 2 | abort | font-google-mat= erial-design-icons-3.0.1 | 2019-03-30 00:00:00+00 | master | 3432472 | 3 | abort | font-google-mat= erial-design-icons-3.0.1 | 2019-03-31 00:00:00+00 | master | 3432472 | 4 | abort | font-google-mat= erial-design-icons-3.0.1 | 2019-04-02 00:00:00+00 | master | 3432472 | 5 | abort | font-google-mat= erial-design-icons-3.0.1 | 2019-04-03 00:00:00+00 | master | 3432472 | 6 | abort | font-google-mat= erial-design-icons-3.0.1 | 2019-04-05 00:00:00+00 | master | 3432472 | 7 | abort | font-google-mat= erial-design-icons-3.0.1 | 2019-04-05 00:00:00+00 | master | 3432472 | 8 | abort | font-google-mat= erial-design-icons-3.0.1 | 2019-04-06 00:00:00+00 | master | 3432472 | 9 | abort | font-google-mat= erial-design-icons-3.0.1 | 2019-04-06 00:00:00+00 | master | 3432472 | 10 | busy | font-google-mat= erial-design-icons-3.0.1 |=20 (26 rows) --8<---------------cut here---------------end--------------->8--- It seems that Hydra fails to record the machine name in build steps that are aborted, but I know that all of the 'aborts' above are on hydra.gnunet.org, because that's currently the only x86 build slave on Hydra. I could easily believe that this problem is specific to hydra.gnunet.org, but even if that's the case, it would be good if offloading would reliably time out before days have passed. Ideally, the timeout would be a "max-silent-time" kind of timeout, so that we don't impose an arbitrary limitation on total transfer time as long as progress is being made, and so that the timeout can be relatively short. However, even a "total-transfer-time" kind of timeout would be welcome at this point, to stop the profuse bleeding of Hydra's limited x86 build capacity. What do you think? Mark