From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp1 ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms11 with LMTPS id eDBpOeM/eV8IPAAA0tVLHw (envelope-from ) for ; Sun, 04 Oct 2020 03:22:11 +0000 Received: from aspmx1.migadu.com ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp1 with LMTPS id MOcmNeM/eV8sIgAAbx9fmQ (envelope-from ) for ; Sun, 04 Oct 2020 03:22:11 +0000 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id 62D749400BF for ; Sun, 4 Oct 2020 03:22:11 +0000 (UTC) Received: from localhost ([::1]:50702 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1kOubC-0004aj-BE for larch@yhetil.org; Sat, 03 Oct 2020 23:22:10 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:53294) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1kOub4-0004aY-QK for bug-guix@gnu.org; Sat, 03 Oct 2020 23:22:02 -0400 Received: from debbugs.gnu.org ([209.51.188.43]:33214) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1kOub4-0002b9-E9 for bug-guix@gnu.org; Sat, 03 Oct 2020 23:22:02 -0400 Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1kOub4-0007YR-AT for bug-guix@gnu.org; Sat, 03 Oct 2020 23:22:02 -0400 X-Loop: help-debbugs@gnu.org Subject: bug#43773: [PATCH] offload: Improve load normalization and configurability. References: <875z7sm2kv.fsf@gmail.com> In-Reply-To: <875z7sm2kv.fsf@gmail.com> Resent-From: Maxim Cournoyer Original-Sender: "Debbugs-submit" Resent-CC: bug-guix@gnu.org Resent-Date: Sun, 04 Oct 2020 03:22:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 43773 X-GNU-PR-Package: guix X-GNU-PR-Keywords: To: 43773@debbugs.gnu.org Received: via spool by 43773-submit@debbugs.gnu.org id=B43773.160178171929031 (code B ref 43773); Sun, 04 Oct 2020 03:22:02 +0000 Received: (at 43773) by debbugs.gnu.org; 4 Oct 2020 03:21:59 +0000 Received: from localhost ([127.0.0.1]:44760 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1kOub1-0007YA-11 for submit@debbugs.gnu.org; Sat, 03 Oct 2020 23:21:59 -0400 Received: from mail-qk1-f173.google.com ([209.85.222.173]:35247) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1kOuaz-0007Xv-38 for 43773@debbugs.gnu.org; Sat, 03 Oct 2020 23:21:57 -0400 Received: by mail-qk1-f173.google.com with SMTP id q5so7894053qkc.2 for <43773@debbugs.gnu.org>; Sat, 03 Oct 2020 20:21:57 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=oQoDm65dm+y705PRhj2oNjfypDTmTmw1W3krIqv03oY=; b=HytsSg2RwEfiwyldSeKr6EI51V4idAr8gTjkiyqmd9iLPyxdlIp9fu14NK0eST6B2x qaAzQ627BvGEugYk5StlVszT7dNo9IIs0kvpCk8+9Ew7Klh7dMdmq2Fx/5dHS0CAyg8C hqor1wpzTggRJQdAkEhtrrl0jYTwLoNElzCzBWgIx7tDAl5WTeUrgnNhQotfBnE/j5Er EkYg2VX5G9xoXiQilxPMy7JtbU9WcgYHxFPH4ST3xvoYVc6nT8IhLOdbiPaa874kufwN upKb4pa0+nvZaIb2xEQzNKwsOwaiRQbjxxza+ipC/e0MfvYuGbzxj//ms9XKS+VceLez eaGQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=oQoDm65dm+y705PRhj2oNjfypDTmTmw1W3krIqv03oY=; b=LatrrMepPedujmnWo9oxnaFJ4PtiJsZPlypvbaYPwDBwtkjAoc6iwENVg4921yiRJH FpFgskHCJiTGJUE0ZvDDylYfmU0gr2o9syavcVEoysGH6pRMG+z4JYBLCRJuoC7zpqTj aJTjkZXKIXl9dyHBE+IMtOHUpw2erGWHwPfiR4/snHRrei8wl6YR1jhrFswnkPNkPRb9 EXrOTU7QrwKmZuTNnilkiXbSOekEd72OYTCohwShEJ05IqR0P15HaEZZsEuwKqttLMGU oxgTUDblBvTbxKrhBYC4n3ije1lqyz64n4Jt7UmqirMS3OMtP2lxSyybsPLuKmaKe2am hf1g== X-Gm-Message-State: AOAM531KivrFdZdnVVmm4SbxLCpy1KNasZhLQG9OaGbTNBoj5Lf/TeTL R+XC+iBT6p9viF7tNu8UrL3M9kC7jRA= X-Google-Smtp-Source: ABdhPJz321IZfVZnGUzDYqIFF1xnTfPFn4vqAlzxa/UVlvKE3G7fnxqGSvKl9NmJWyDB/EJsOp5zxg== X-Received: by 2002:a37:6745:: with SMTP id b66mr1416129qkc.221.1601781711264; Sat, 03 Oct 2020 20:21:51 -0700 (PDT) Received: from localhost.localdomain (dsl-156-63.b2b2c.ca. [66.158.156.63]) by smtp.gmail.com with ESMTPSA id e26sm4622411qka.24.2020.10.03.20.21.50 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 03 Oct 2020 20:21:50 -0700 (PDT) From: Maxim Cournoyer Date: Sat, 3 Oct 2020 23:21:12 -0400 Message-Id: <20201004032112.5916-1-maxim.cournoyer@gmail.com> X-Mailer: git-send-email 2.28.0 MIME-Version: 1.0 Content-Type: text/plain; charset=yes Content-Transfer-Encoding: 8bit X-Spam-Score: 0.0 (/) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-Spam-Score: -1.0 (-) X-BeenThere: bug-guix@gnu.org List-Id: Bug reports for GNU Guix List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Maxim Cournoyer Errors-To: bug-guix-bounces+larch=yhetil.org@gnu.org Sender: "bug-Guix" X-Scanner: scn0 Authentication-Results: aspmx1.migadu.com; dkim=fail (rsa verify failed) header.d=gmail.com header.s=20161025 header.b=HytsSg2R; dmarc=fail reason="SPF not aligned (relaxed)" header.from=gmail.com (policy=none); spf=pass (aspmx1.migadu.com: domain of bug-guix-bounces@gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=bug-guix-bounces@gnu.org X-Spam-Score: 0.09 X-TUID: Ko04KVelF0TH Fixes . The computed normalized load was previously obtained by dividing the load average as found in /proc/loadavg by the number of parallel builds defined for a build machine. This normalized didn't allow to compare machines with different number of cores, as the load average reported by can be as high as the number of cores; thus comparing that value to a fixed threshold of 2.0 would mean machines with multiple cores were more likely to be flagged as overloaded compared to single core machines. This can be fixed by normalizing using the available number of cores instead of the number of parallel jobs. * guix/scripts/offload.scm ()[overload-threshold]: New field. (node-load): Modify to return a normalized load value between 0 and 1, taking into account the number of cores available. (normalized-load): Remove procedure. (report-load): New procedure. (choose-build-machine): Adjust to use the modified 'node-load' and the new 'report-load' and 'build-machine-overload-threshold' procedures. (check-machine-status): Adjust. * doc/guix.texi (Daemon Offload Setup): Document the offload scheduler and the new 'overload-threshold' field. --- doc/guix.texi | 30 +++++++++++++++++++++- guix/scripts/offload.scm | 54 ++++++++++++++++++++++++---------------- 2 files changed, 62 insertions(+), 22 deletions(-) diff --git a/doc/guix.texi b/doc/guix.texi index a6260a12aa..1d5adbeb63 100644 --- a/doc/guix.texi +++ b/doc/guix.texi @@ -1081,7 +1081,28 @@ architecture natively supports it, via emulation (@pxref{Transparent Emulation with QEMU}), or both. Missing prerequisites for the build are copied over SSH to the target machine, which then proceeds with the build; upon success the output(s) of the build are copied back to the -initial machine. +initial machine. The offload facility comes with a basic scheduler that +attempts to select the best machine. The best machine is chosen among +the available machines based on criteria such as: + +@enumerate +@item +The availability of a build slot. A build machine can have as many +build slots (connections) as the value of the @code{parallel-builds} +field of its @code{build-machine} object. + +@item +Its relative speed, as defined via the @code{speed} field of its +@code{build-machine} object. + +@item +Its load. The normalized machine load must be lower than a threshold +value, configurable via the @code{overload-threshold} field of its +@code{build-machine} object. + +@item +Disk space availability. More than a 100 MiB must be available. +@end enumerate The @file{/etc/guix/machines.scm} file typically looks like this: @@ -1185,6 +1206,13 @@ when transferring files to and from build machines. File name of the Unix-domain socket @command{guix-daemon} is listening to on that machine. +@item @code{overload-threshold} (default: @code{0.6}) +The load threshold above which a potential offload machine is +disregarded by the offload scheduler. The value roughly translates to +the total processor usage of the build machine, ranging from 0.0 (0%) to +1.0 (100%). It can also be disabled by setting +@code{overload-threshold} to @code{#f}. + @item @code{parallel-builds} (default: @code{1}) The number of builds that may run in parallel on the machine. diff --git a/guix/scripts/offload.scm b/guix/scripts/offload.scm index 3dc8ccefcb..a5fe98b675 100644 --- a/guix/scripts/offload.scm +++ b/guix/scripts/offload.scm @@ -88,6 +88,10 @@ (default 3)) (daemon-socket build-machine-daemon-socket ; string (default "/var/guix/daemon-socket/socket")) + ;; A #f value tells the offload scheduler to disregard the load of the build + ;; machine when selecting the best offload machine. + (overload-threshold build-machine-overload-threshold ; inexact real between + (default 0.6)) ; 0.0 and 1.0 | #f (parallel-builds build-machine-parallel-builds ; number (default 1)) (speed build-machine-speed ; inexact real @@ -391,30 +395,34 @@ of free disk space on '~a'~%") (* 100 (expt 2 20))) ;100 MiB (define (node-load node) - "Return the load on NODE. Return +∞ if NODE is misbehaving." + "Return the load on NODE, a normalized value between 0.0 and 1.0. The value +is derived from /proc/loadavg and normalized according to the number of +logical cores available, to give a rough estimation of CPU usage. Return +1.0 (fully loaded) if NODE is misbehaving." (let ((line (inferior-eval '(begin (use-modules (ice-9 rdelim)) (call-with-input-file "/proc/loadavg" read-string)) - node))) - (if (eof-object? line) - +inf.0 ;MACHINE does not respond, so assume it is infinitely loaded + node)) + (ncores (inferior-eval '(begin + (use-modules (ice-9 threads)) + (current-processor-count)) + node))) + (if (or (eof-object? line) (eof-object? ncores)) + 1.0 ;MACHINE does not respond, so assume it is fully loaded (match (string-tokenize line) ((one five fifteen . x) - (string->number one)) + (let ((load (/ (string->number one) ncores))) + (if (> load 1.0) + 1.0 + load))) (x - +inf.0))))) - -(define (normalized-load machine load) - "Divide LOAD by the number of parallel builds of MACHINE." - (if (rational? load) - (let* ((jobs (build-machine-parallel-builds machine)) - (normalized (/ load jobs))) - (format (current-error-port) "load on machine '~a' is ~s\ - (normalized: ~s)~%" - (build-machine-name machine) load normalized) - normalized) - load)) + 1.0))))) + +(define (report-load machine load) + (format (current-error-port) + "normalized load on machine '~a' is ~,2f~%" + (build-machine-name machine) load)) (define (random-seed) (logxor (getpid) (car (gettimeofday)))) @@ -472,11 +480,15 @@ slot (which must later be released with 'release-build-slot'), or #f and #f." (let* ((session (false-if-exception (open-ssh-session best %short-timeout))) (node (and session (remote-inferior session))) - (load (and node (normalized-load best (node-load node)))) + (load (and node (node-load node))) + (threshold (build-machine-overload-threshold best)) (space (and node (node-free-disk-space node)))) + (when load (report-load best load)) (when node (close-inferior node)) (when session (disconnect! session)) - (if (and node (< load 2.) (>= space %minimum-disk-space)) + (if (and node + (or (not threshold) (< load threshold)) + (>= space %minimum-disk-space)) (match others (((machines slots) ...) ;; Release slots from the uninteresting machines. @@ -708,13 +720,13 @@ machine." (free (node-free-disk-space inferior))) (close-inferior inferior) (format #t "~a~% kernel: ~a ~a~% architecture: ~a~%\ - host name: ~a~% normalized load: ~a~% free disk space: ~,2f MiB~%\ + host name: ~a~% normalized load: ~,2f~% free disk space: ~,2f MiB~%\ time difference: ~a s~%" (build-machine-name machine) (utsname:sysname uts) (utsname:release uts) (utsname:machine uts) (utsname:nodename uts) - (normalized-load machine load) + load (/ free (expt 2 20) 1.) (- time now)))))))) -- 2.28.0