* ARM build machines
@ 2020-07-28 10:31 Jonathan Brielmaier
2020-07-28 12:55 ` Mathieu Othacehe
0 siblings, 1 reply; 7+ messages in thread
From: Jonathan Brielmaier @ 2020-07-28 10:31 UTC (permalink / raw)
To: Guix-devel
Hi fellow hackers,
after recent discussions about our Softiron OverDrive in IRC, I thought
it might be a good idea to have look what options for powerful ARM
machines are around. That was some weeks ago.
Today there where another discussions on IRC about building packages on
ARM, so I sent in my findings.
ARM cloud/VPS offers
====================
Packet 96cores, 128G RAM -> 4380$/a [0]
Amazon 16vCPU, 32G RAM -> 4080$/a [1]
Linaro free dev VMs
===================
Linaro, an organization somehow connected with ARM offers ARM VMs for
free for developers. They go up to 32vCPU, 64G RAM:
https://www.linaro.cloud/
ARM servers selfhosting
=======================
Marvell ThunderX 32cores, 1U 400W -> 2000$+
Marvell ThunderX 48cores, 1U 400W -> 3000$+
+ RAM + disks + taxes + shipping [2]
Avantek has also ThunderX offerings, slightly more expensive but based
in UK[4].
ThunderX2 is insanely expensive compared to v1. Starting at something
above 9.000$.
Other fast ARM server chips come from Qualcomm (Centriq), Ampere eMAG[8]
and Huawei (Kunpeng on TaiShan boards). Huawei TaiShans are distributed
by a Polish company[5] and an US distributor[6].
Colocation costs for such servers starts at 1500€/a in Germany. I don't
think selfhosting for those isn't really an option. As they are big and
more power hungry then the Softirons...
Summary
=======
Neither VPS nor dedicated build machines for ARM are cheap, if they are
decent powerful. As a first step I could try to reach out Linaro if they
are willing to provide us some VMs.
Kind Regards
Jonathan
[0] https://www.packet.com/cloud/servers/
[1] https://aws.amazon.com/ec2/pricing/on-demand/ + traffic costs 0.09$
per GB
[2] https://www.asacomputers.com/Cavium-ThunderX.html
[4] https://store.avantek.co.uk/arm-servers.html
[5] https://www.vectorsolutions.net/kat-produktow/serwery-data-center/
[6] https://actfornet.com/search?q=taishan
[8] Ampere eMag is used in Lenovo ThinkSystem HR330A. I can't find a
reseller vor them.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: ARM build machines
2020-07-28 10:31 ARM build machines Jonathan Brielmaier
@ 2020-07-28 12:55 ` Mathieu Othacehe
2020-08-24 14:42 ` Improving CI throughput Ludovic Courtès
0 siblings, 1 reply; 7+ messages in thread
From: Mathieu Othacehe @ 2020-07-28 12:55 UTC (permalink / raw)
To: Jonathan Brielmaier; +Cc: Guix-devel
Hello Jonathan,
Thanks a lot for this nice report. Let me summarize a bit the situation
first, before commenting on your findings.
On our main "berlin" build farm, we have currently a few aarch64
machines:
* overdrive.guixsd.org
* dover.guix.info
* dmitri.tobias.gr
* sergei.tobias.gr
We are also emulating aarch64 and armhf builds on around 20 x86_64
machines (hydra-guix-101 ... hydra-guix-120 roughly).
Regarding the physical machines, the overdrive seems like a modest
machine (in comparison to the x86 machines we have), and I don't know
about the other 3 ones.
For the emulation, even though it's way way slower than real hardware,
those 20 machines are really powerful and should be able to bring us
good substitutes coverage.
The current situation is that due to Cuirass/offloading issues such as
[1], our build farm is most of the time idle. Given our computation
power, we should be able to bake much more substitutes I think.
Maybe we could also take advantage of the build-coordinator Christopher
is implementing (+ Guix daemon RPC's over HTTP) to make sure that we are
able to deal with a distributed build farm efficiently.
Now the question I'm asking myself is: could the ARM substitutes
situation be solved by improving our CI software stack, or do we really
need more hardware?
Hard to answer right now, but maybe the best way to proceed would be to
make sure that we are using the hardware we have close to its limit
first and then consider buying more.
> Neither VPS nor dedicated build machines for ARM are cheap, if they are
> decent powerful. As a first step I could try to reach out Linaro if they
> are willing to provide us some VMs.
Yep, that would be great to explore this option!
Concerning the other options you proposed, they are indeed expensive and
as I said, at this point, our CI software stack would not be able to use
it efficiently, which would be sad.
What do other people think?
Thanks,
Mathieu
[1]: https://issues.guix.gnu.org/34033
^ permalink raw reply [flat|nested] 7+ messages in thread
* Improving CI throughput
2020-07-28 12:55 ` Mathieu Othacehe
@ 2020-08-24 14:42 ` Ludovic Courtès
2020-08-24 14:57 ` John Soo
2020-08-25 13:32 ` Mathieu Othacehe
0 siblings, 2 replies; 7+ messages in thread
From: Ludovic Courtès @ 2020-08-24 14:42 UTC (permalink / raw)
To: Mathieu Othacehe; +Cc: Guix-devel
Hi,
Mathieu Othacehe <othacehe@gnu.org> skribis:
> The current situation is that due to Cuirass/offloading issues such as
> [1], our build farm is most of the time idle. Given our computation
> power, we should be able to bake much more substitutes I think.
>
> Maybe we could also take advantage of the build-coordinator Christopher
> is implementing (+ Guix daemon RPC's over HTTP) to make sure that we are
> able to deal with a distributed build farm efficiently.
>
> Now the question I'm asking myself is: could the ARM substitutes
> situation be solved by improving our CI software stack, or do we really
> need more hardware?
Yeah, this is a ridiculous situation. We should do a hackathon to get
better monitoring of useful metrics (machine load,
time-of-push-to-time-to-build-completion, etc.), to clearly identify the
bottlenecks (crashes? inefficient protocol? scheduling issues? Cuirass
or offload or guix-daemon issue?), and to address as many of them as we
can.
Any volunteers? :-)
Ludo’.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Improving CI throughput
2020-08-24 14:42 ` Improving CI throughput Ludovic Courtès
@ 2020-08-24 14:57 ` John Soo
2020-08-25 13:32 ` Mathieu Othacehe
1 sibling, 0 replies; 7+ messages in thread
From: John Soo @ 2020-08-24 14:57 UTC (permalink / raw)
To: Ludovic Courtès; +Cc: Guix-devel, Mathieu Othacehe
Hi Ludo and Guix,
I am not sure how much I can devote to the problem, but bpf now works in guix. The bpftrace scripting language is there if it might help.
Hope that helps a little,
John
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Improving CI throughput
2020-08-24 14:42 ` Improving CI throughput Ludovic Courtès
2020-08-24 14:57 ` John Soo
@ 2020-08-25 13:32 ` Mathieu Othacehe
2020-08-25 17:44 ` Ricardo Wurmus
2020-08-28 13:51 ` Ludovic Courtès
1 sibling, 2 replies; 7+ messages in thread
From: Mathieu Othacehe @ 2020-08-25 13:32 UTC (permalink / raw)
To: Ludovic Courtès; +Cc: Guix-devel
Hey,
> Yeah, this is a ridiculous situation. We should do a hackathon to get
> better monitoring of useful metrics (machine load,
> time-of-push-to-time-to-build-completion, etc.), to clearly identify the
> bottlenecks (crashes? inefficient protocol? scheduling issues? Cuirass
> or offload or guix-daemon issue?), and to address as many of them as we
> can.
>
> Any volunteers? :-)
I'd really like to improve the situation! A hackathon seems like a
nice idea.
As a matter of fact, I already spent some times improving the stability
of Cuirass web interface[1].
Now I can see multiple topics that could be approached in parallel:
* Add metrics to Cuirass as you suggested. There's an open ticket about
that here[2].
* Investigate offloading issues[3].
* Fix database contention[4].
* Fix guix-daemon deadlocking[5].
* Monitor closely what's happening on Berlin and decide if it is
opportune to add a build scheduler mechanism somewhere. See what Hydra
is doing[6] and what Chris is proposing[7].
As most of the issues are only observed on Berlin machines, which access is
restricted, we will also have to find a way to reproduce them locally.
Anyway, if some people are motivated, we could try to plan a day or
week-end to work on those topics :).
Thanks,
Mathieu
[1]: https://issues.guix.gnu.org/42548.
[2]: https://issues.guix.gnu.org/32548.
[3]: https://issues.guix.gnu.org/34033.
[4]: https://issues.guix.gnu.org/42001.
[5]: https://issues.guix.gnu.org/31785.
[6]: https://github.com/NixOS/hydra/blob/master/src/hydra-queue-runner/dispatcher.cc
[7]: https://lists.gnu.org/archive/html/guix-devel/2020-04/msg00323.html
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Improving CI throughput
2020-08-25 13:32 ` Mathieu Othacehe
@ 2020-08-25 17:44 ` Ricardo Wurmus
2020-08-28 13:51 ` Ludovic Courtès
1 sibling, 0 replies; 7+ messages in thread
From: Ricardo Wurmus @ 2020-08-25 17:44 UTC (permalink / raw)
To: Mathieu Othacehe; +Cc: guix-devel
Mathieu Othacehe <othacehe@gnu.org> writes:
> As most of the issues are only observed on Berlin machines, which access is
> restricted, we will also have to find a way to reproduce them locally.
You can access all Berlin build nodes from the head node at
ci.guix.gnu.org, either as “root” or “hydra” (both with root’s pubkey).
If you need more access please let me know.
--
Ricardo
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Improving CI throughput
2020-08-25 13:32 ` Mathieu Othacehe
2020-08-25 17:44 ` Ricardo Wurmus
@ 2020-08-28 13:51 ` Ludovic Courtès
1 sibling, 0 replies; 7+ messages in thread
From: Ludovic Courtès @ 2020-08-28 13:51 UTC (permalink / raw)
To: Mathieu Othacehe; +Cc: Guix-devel
Hello!
Mathieu Othacehe <othacehe@gnu.org> skribis:
>> Yeah, this is a ridiculous situation. We should do a hackathon to get
>> better monitoring of useful metrics (machine load,
>> time-of-push-to-time-to-build-completion, etc.), to clearly identify the
>> bottlenecks (crashes? inefficient protocol? scheduling issues? Cuirass
>> or offload or guix-daemon issue?), and to address as many of them as we
>> can.
>>
>> Any volunteers? :-)
>
> I'd really like to improve the situation! A hackathon seems like a
> nice idea.
>
> As a matter of fact, I already spent some times improving the stability
> of Cuirass web interface[1].
Much appreciated!
> Now I can see multiple topics that could be approached in parallel:
>
> * Add metrics to Cuirass as you suggested. There's an open ticket about
> that here[2].
>
> * Investigate offloading issues[3].
>
> * Fix database contention[4].
>
> * Fix guix-daemon deadlocking[5].
>
> * Monitor closely what's happening on Berlin and decide if it is
> opportune to add a build scheduler mechanism somewhere. See what Hydra
> is doing[6] and what Chris is proposing[7].
I’m happy to help tackle daemon/offload issues, but I’ll be more
motivated if others join. :-)
> As most of the issues are only observed on Berlin machines, which access is
> restricted, we will also have to find a way to reproduce them locally.
Yeah, and these are usually non-deterministic issues and not that
frequent.
> Anyway, if some people are motivated, we could try to plan a day or
> week-end to work on those topics :).
I can try and spend some time on it this week-end. I suggest that
people join the IRC channel and shout “CI!” as a way to rally, and then
share what they’re looking at and how they feel. How does that sound?
Thanks for cooking up this list of issues!
Ludo’.
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2020-08-28 13:51 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-07-28 10:31 ARM build machines Jonathan Brielmaier
2020-07-28 12:55 ` Mathieu Othacehe
2020-08-24 14:42 ` Improving CI throughput Ludovic Courtès
2020-08-24 14:57 ` John Soo
2020-08-25 13:32 ` Mathieu Othacehe
2020-08-25 17:44 ` Ricardo Wurmus
2020-08-28 13:51 ` Ludovic Courtès
Code repositories for project(s) associated with this public inbox
https://git.savannah.gnu.org/cgit/guix.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).