From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp0.migadu.com ([2001:41d0:403:4876::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms1.migadu.com with LMTPS id CBFDCK0UY2ZGPgEAqHPOHw:P1 (envelope-from ) for ; Fri, 07 Jun 2024 16:09:49 +0200 Received: from aspmx1.migadu.com ([2001:41d0:403:4876::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp0.migadu.com with LMTPS id CBFDCK0UY2ZGPgEAqHPOHw (envelope-from ) for ; Fri, 07 Jun 2024 16:09:49 +0200 X-Envelope-To: larch@yhetil.org Authentication-Results: aspmx1.migadu.com; dkim=none; spf=pass (aspmx1.migadu.com: domain of "guix-devel-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="guix-devel-bounces+larch=yhetil.org@gnu.org"; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=yhetil.org; s=key1; t=1717769389; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:mime-version:mime-version: content-type:content-type:list-id:list-help:list-unsubscribe: list-subscribe:list-post; bh=y3/zqWA7AqIsM2KzoQVgTgwso5PCaKVawSzwTdilrxc=; b=c4fISlHXxhOyxmu5gQnbBTRQkG6J6dCZxsKBRAkZzGqqTi6fN2a8qeiz1aT8TQsTZMcMwu QrD0wa4aSU8tu7u8nO6NZoLB9VJtIalF4nKpXk8+VzJOCcO+XaVEf4EA8erA2+45Lktcxl D4ye6c3u1IFUY8OXgZOxYw2kBCb3JHKdNg78CFuMfPScv2WFhsGpDpKrnBxXo7IG6Kmb+s Y0GA1SE78Aydcy1EMoXBn+58RCiqP7di+JrRduTedXUpwt72uSpMLUrfU6Cb6lbL7wslRv U8kemkm5iz0edd9RlcQZpHWv/9xNbfnNVjLGaodEnv+eXAyEYlAhOUkLvnfIjQ== ARC-Authentication-Results: i=1; aspmx1.migadu.com; dkim=none; spf=pass (aspmx1.migadu.com: domain of "guix-devel-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="guix-devel-bounces+larch=yhetil.org@gnu.org"; dmarc=none ARC-Seal: i=1; s=key1; d=yhetil.org; t=1717769389; a=rsa-sha256; cv=none; b=OVADVWrKfuAwuuMkPWz/Qn/9Vobntth6jzzEoi/xiwkbaupYaWHbOWbhDmOmD7hQqqVIml qUKf/BQp87BRZP8WYZkfarhLHIPT2/Jz6b3/OsdE2bmbV66cBOT4mEJHq5QKabmYVPZwI9 ojA2EcHlrWijVW7t5qFTYCZ0Q2prsJfOjk32cRPYti+/25o+RaULE65EXxTLnMK0fHc8sA WnYd1mPnDDdHnd0YBYAW6gYCDYQa/HXnAIpEcyqRiAQbXeSbp2BinroupiwAp1qPA51QNB JAYLffJ72b/DPB2G8nBySryzsPXAdX3h1QNzl8c+x2I4irxlt4rCFmDzKfqNOQ== Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id F17B1276FE for ; Fri, 7 Jun 2024 16:09:48 +0200 (CEST) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1sFaH6-0006v6-Ce; Fri, 07 Jun 2024 10:09:00 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1sFaH4-0006tw-V7 for guix-devel@gnu.org; Fri, 07 Jun 2024 10:08:58 -0400 Received: from hera.aquilenet.fr ([2a0c:e300::1]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1sFaGy-0003l8-MC for guix-devel@gnu.org; Fri, 07 Jun 2024 10:08:58 -0400 Received: from localhost (localhost [127.0.0.1]) by hera.aquilenet.fr (Postfix) with ESMTP id E4113288; Fri, 7 Jun 2024 16:08:48 +0200 (CEST) X-Virus-Scanned: Debian amavisd-new at hera.aquilenet.fr Received: from hera.aquilenet.fr ([127.0.0.1]) by localhost (hera.aquilenet.fr [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id CXhm4vRdrzSd; Fri, 7 Jun 2024 16:08:48 +0200 (CEST) Received: from jurong (unknown [IPv6:2001:861:c4:f2f0::c64]) by hera.aquilenet.fr (Postfix) with ESMTPSA id 34AFA68B; Fri, 7 Jun 2024 16:08:48 +0200 (CEST) Date: Fri, 7 Jun 2024 16:08:46 +0200 From: Andreas Enge To: guix-devel@gnu.org Subject: Guix build coordinator and cgroups Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Received-SPF: pass client-ip=2a0c:e300::1; envelope-from=andreas@enge.fr; helo=hera.aquilenet.fr X-Spam_score_int: -18 X-Spam_score: -1.9 X-Spam_bar: - X-Spam_report: (-1.9 / 5.0 requ) BAYES_00=-1.9, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: guix-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Development of GNU Guix and the GNU System distribution." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guix-devel-bounces+larch=yhetil.org@gnu.org Sender: guix-devel-bounces+larch=yhetil.org@gnu.org X-Migadu-Country: US X-Migadu-Flow: FLOW_IN X-Spam-Score: -5.84 X-Migadu-Queue-Id: F17B1276FE X-Migadu-Scanner: mx10.migadu.com X-Migadu-Spam-Score: -5.84 X-TUID: drBDRy9fwpPT Hello, when trying to run a guix build agent in a docker container on openshift with a colleague and assigning 8 of the 128 cores of the physical machine, the agent would be completely choked since it would start all builds with commands such as "make -j 128". The 128 are determined by a call to the guile function current-processor-count, which calls nproc from coreutils (see "man nproc"). This works on bare metal and virtual machines, but not in containers or more generally when cgroups are used to limit the number of cores. Additionally, but less crucially, this probably leads to the max-1min-load-average parameter of guix-build-coordinator-agent-configuration to be completely useless: In the example, the machine could have a load of 120 on the other cores, but the part attached to the build agent would be idle. This can be worked around by passing by hand extra arguments, such as "--cores=8" to the guix daemon service, and adapting max-parallel-builds of the build agent service. Still, it would be nice to have a more automated approach (for instance, when changing the number of assigned cores in openshift, one does not want to recreate a docker container with new manual parameters). Here is how far we got concerning a potential solution. When cgroups are available, the file /sys/fs/cgroup/cpu.pressure contains some measure of load congestion: some avg10=8.28 avg60=5.50 avg300=2.11 total=365519361 full avg10=0.00 avg60=0.00 avg300=0.00 total=0 Its contents are described here: https://www.kernel.org/doc/html/latest/accounting/psi.html#psi The "full" line is meaningless. I am not exactly sure what is measured by the "some" line - it is not the load, but a percentage of time during which "some tasks are stalled on a given resource". It looks like the max-1min-load-average of the build agent service could be replaced by a threshold for the avg60 value of this file. To obtain the current value, the libcgroup library, which is already available in guix, can be used; we may need to write guile bindings. I suppose that the number of available cores can be determined in a similar manner. What do you think? Andreas