unofficial mirror of guix-devel@gnu.org 
 help / color / mirror / code / Atom feed
* CI status
@ 2021-12-15 16:15 Mathieu Othacehe
  2021-12-15 17:26 ` Leo Famulari
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Mathieu Othacehe @ 2021-12-15 16:15 UTC (permalink / raw)
  To: guix-devel


Hello,

You must have noticed that the CI is currently struggling a bit. Here is
a small recap of the situation.

* The IO operations on Berlin are mysteriously slow. Removing files from
  /gnu/store/trash is taking ages. This is reported here:
  https://issues.guix.gnu.org/51787.

  We have to kill the garbage collect frequently to keep things
  going. The bad side is obviously that we can't do that forever, as we
  only have 9.3T and decreasing, while we aim to stay at 10T available.

* The PostgreSQL database behind ci.guix.gnu.org also became super slow
  and I decided to drop it. I don't know if there's a connection with
  the above point. I'm missing the appropriate tools/knowledge to
  monitor the IO & file-system performances.

* The php package isn't building anymore, reported here:
  https://issues.guix.gnu.org/52513. This means that we cannot
  reconfigure zabbix. I removed it from the berlin configuration
  temporarily.
  
* The cuirass-remote-server Avahi service is no longer visible when
  running "avahi-browse -a". I strongly suspect that this is related to
  the static-networking update, even if I don't have a proof for
  now. This means that the remote-workers using Avahi for discovering
  (hydra-guix-*) machines can no longer connect. The
  ci.guix.gnu.org/workers list is thus quite empty.

* Facing those problems, I tried to rollback to a previous system
  generation, but this is bringing even more issues, as for instance the
  older Cuirass package, is struggling with the new database structure and
  other niceties. I think out best course of action is to stick to
  master and fix the above problems.

Thanks,

Mathieu


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: CI status
  2021-12-15 16:15 CI status Mathieu Othacehe
@ 2021-12-15 17:26 ` Leo Famulari
  2021-12-15 19:38 ` Mathieu Othacehe
  2021-12-20  2:16 ` Maxim Cournoyer
  2 siblings, 0 replies; 6+ messages in thread
From: Leo Famulari @ 2021-12-15 17:26 UTC (permalink / raw)
  To: Mathieu Othacehe; +Cc: guix-devel

On Wed, Dec 15, 2021 at 05:15:08PM +0100, Mathieu Othacehe wrote:
> * The IO operations on Berlin are mysteriously slow. Removing files from
>   /gnu/store/trash is taking ages. This is reported here:
>   https://issues.guix.gnu.org/51787.

I believe this is because we are running `guix gc --verify=contents` to
check the status of the build artifacts after the shutdown. I'm not sure
whether or not we can get a progress report on this.

> * The PostgreSQL database behind ci.guix.gnu.org also became super slow
>   and I decided to drop it. I don't know if there's a connection with
>   the above point. I'm missing the appropriate tools/knowledge to
>   monitor the IO & file-system performances.

You might try `atop`, which at least highlights that the storage is
almost fully loaded with I/O operations. Beyond that is `sar` from the
sysstat package, although making use of it requires some learning.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: CI status
  2021-12-15 16:15 CI status Mathieu Othacehe
  2021-12-15 17:26 ` Leo Famulari
@ 2021-12-15 19:38 ` Mathieu Othacehe
  2021-12-15 19:43   ` Leo Famulari
  2021-12-15 20:36   ` Ricardo Wurmus
  2021-12-20  2:16 ` Maxim Cournoyer
  2 siblings, 2 replies; 6+ messages in thread
From: Mathieu Othacehe @ 2021-12-15 19:38 UTC (permalink / raw)
  To: guix-devel


> * The cuirass-remote-server Avahi service is no longer visible when
>   running "avahi-browse -a". I strongly suspect that this is related to
>   the static-networking update, even if I don't have a proof for
>   now. This means that the remote-workers using Avahi for discovering
>   (hydra-guix-*) machines can no longer connect. The
>   ci.guix.gnu.org/workers list is thus quite empty.

This is caused by: https://issues.guix.gnu.org/52520. I worked-around it
by enabling multicast manually on the berlin eno1 network interface.

Cuirass is building again, yay!

Mathieu


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: CI status
  2021-12-15 19:38 ` Mathieu Othacehe
@ 2021-12-15 19:43   ` Leo Famulari
  2021-12-15 20:36   ` Ricardo Wurmus
  1 sibling, 0 replies; 6+ messages in thread
From: Leo Famulari @ 2021-12-15 19:43 UTC (permalink / raw)
  To: Mathieu Othacehe; +Cc: guix-devel

On Wed, Dec 15, 2021 at 08:38:56PM +0100, Mathieu Othacehe wrote:
> 
> > * The cuirass-remote-server Avahi service is no longer visible when
> >   running "avahi-browse -a". I strongly suspect that this is related to
> >   the static-networking update, even if I don't have a proof for
> >   now. This means that the remote-workers using Avahi for discovering
> >   (hydra-guix-*) machines can no longer connect. The
> >   ci.guix.gnu.org/workers list is thus quite empty.
> 
> This is caused by: https://issues.guix.gnu.org/52520. I worked-around it
> by enabling multicast manually on the berlin eno1 network interface.
> 
> Cuirass is building again, yay!

Great news! Thanks for your diligence.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: CI status
  2021-12-15 19:38 ` Mathieu Othacehe
  2021-12-15 19:43   ` Leo Famulari
@ 2021-12-15 20:36   ` Ricardo Wurmus
  1 sibling, 0 replies; 6+ messages in thread
From: Ricardo Wurmus @ 2021-12-15 20:36 UTC (permalink / raw)
  To: Mathieu Othacehe; +Cc: guix-devel


Mathieu Othacehe <othacehe@gnu.org> writes:

>> * The cuirass-remote-server Avahi service is no longer visible when
>>   running "avahi-browse -a". I strongly suspect that this is related to
>>   the static-networking update, even if I don't have a proof for
>>   now. This means that the remote-workers using Avahi for discovering
>>   (hydra-guix-*) machines can no longer connect. The
>>   ci.guix.gnu.org/workers list is thus quite empty.
>
> This is caused by: https://issues.guix.gnu.org/52520. I worked-around it
> by enabling multicast manually on the berlin eno1 network interface.
>
> Cuirass is building again, yay!

Thank you, Mathieu!  I really appreciate your work on Cuirass and your
efforts in diagnosing and working around performance problems.

-- 
Ricardo


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: CI status
  2021-12-15 16:15 CI status Mathieu Othacehe
  2021-12-15 17:26 ` Leo Famulari
  2021-12-15 19:38 ` Mathieu Othacehe
@ 2021-12-20  2:16 ` Maxim Cournoyer
  2 siblings, 0 replies; 6+ messages in thread
From: Maxim Cournoyer @ 2021-12-20  2:16 UTC (permalink / raw)
  To: Mathieu Othacehe; +Cc: guix-devel

Hi Mathieu,

Mathieu Othacehe <othacehe@gnu.org> writes:

> Hello,
>
> You must have noticed that the CI is currently struggling a bit. Here is
> a small recap of the situation.
>
> * The IO operations on Berlin are mysteriously slow. Removing files from
>   /gnu/store/trash is taking ages. This is reported here:
>   https://issues.guix.gnu.org/51787.
>
>   We have to kill the garbage collect frequently to keep things
>   going. The bad side is obviously that we can't do that forever, as we
>   only have 9.3T and decreasing, while we aim to stay at 10T available.
>
> * The PostgreSQL database behind ci.guix.gnu.org also became super slow
>   and I decided to drop it. I don't know if there's a connection with
>   the above point. I'm missing the appropriate tools/knowledge to
>   monitor the IO & file-system performances.
>
> * The php package isn't building anymore, reported here:
>   https://issues.guix.gnu.org/52513. This means that we cannot
>   reconfigure zabbix. I removed it from the berlin configuration
>   temporarily.
>   
> * The cuirass-remote-server Avahi service is no longer visible when
>   running "avahi-browse -a". I strongly suspect that this is related to
>   the static-networking update, even if I don't have a proof for
>   now. This means that the remote-workers using Avahi for discovering
>   (hydra-guix-*) machines can no longer connect. The
>   ci.guix.gnu.org/workers list is thus quite empty.
>
> * Facing those problems, I tried to rollback to a previous system
>   generation, but this is bringing even more issues, as for instance the
>   older Cuirass package, is struggling with the new database structure and
>   other niceties. I think out best course of action is to stick to
>   master and fix the above problems.

Ooof, thanks a lot with reporting (and fixing) the above problems on top
of baby sitting Cuirass while things stabilize...  I'll try to keep an
eye on Berlin IO activity to see if there are any offender consuming an
abnormal amount of IO.

Maxim


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-12-20  2:16 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-15 16:15 CI status Mathieu Othacehe
2021-12-15 17:26 ` Leo Famulari
2021-12-15 19:38 ` Mathieu Othacehe
2021-12-15 19:43   ` Leo Famulari
2021-12-15 20:36   ` Ricardo Wurmus
2021-12-20  2:16 ` Maxim Cournoyer

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).