From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp0 ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms11 with LMTPS id uO2pI8XZaF8WcgAA0tVLHw (envelope-from ) for ; Mon, 21 Sep 2020 16:50:13 +0000 Received: from aspmx1.migadu.com ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp0 with LMTPS id iOuZH8XZaF+XYwAA1q6Kng (envelope-from ) for ; Mon, 21 Sep 2020 16:50:13 +0000 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id C72E6940416 for ; Mon, 21 Sep 2020 16:50:12 +0000 (UTC) Received: from localhost ([::1]:42992 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1kKP10-00041L-P7 for larch@yhetil.org; Mon, 21 Sep 2020 12:50:10 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:33980) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1kKP0s-0003zj-4z for guix-patches@gnu.org; Mon, 21 Sep 2020 12:50:02 -0400 Received: from debbugs.gnu.org ([209.51.188.43]:44458) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1kKP0r-0002EX-RP for guix-patches@gnu.org; Mon, 21 Sep 2020 12:50:01 -0400 Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1kKP0r-0002Ay-PM for guix-patches@gnu.org; Mon, 21 Sep 2020 12:50:01 -0400 X-Loop: help-debbugs@gnu.org Subject: [bug#43552] [PATCH] Add watchdog support. Resent-From: Mathieu Othacehe Original-Sender: "Debbugs-submit" Resent-CC: guix-patches@gnu.org Resent-Date: Mon, 21 Sep 2020 16:50:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: report 43552 X-GNU-PR-Package: guix-patches X-GNU-PR-Keywords: patch To: 43552@debbugs.gnu.org Cc: Mathieu Othacehe X-Debbugs-Original-To: guix-patches@gnu.org Received: via spool by submit@debbugs.gnu.org id=B.16007069718321 (code B ref -1); Mon, 21 Sep 2020 16:50:01 +0000 Received: (at submit) by debbugs.gnu.org; 21 Sep 2020 16:49:31 +0000 Received: from localhost ([127.0.0.1]:56004 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1kKP0M-0002A8-MD for submit@debbugs.gnu.org; Mon, 21 Sep 2020 12:49:31 -0400 Received: from lists.gnu.org ([209.51.188.17]:36886) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1kKP0I-00029y-Mp for submit@debbugs.gnu.org; Mon, 21 Sep 2020 12:49:29 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:33830) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1kKP0I-0003bu-1W for guix-patches@gnu.org; Mon, 21 Sep 2020 12:49:26 -0400 Received: from fencepost.gnu.org ([2001:470:142:3::e]:60919) by eggs.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1kKP0G-00029f-QW for guix-patches@gnu.org; Mon, 21 Sep 2020 12:49:25 -0400 Received: from [2a01:e0a:19b:d9a0:9d9d:97cc:d92a:8ac0] (port=57750 helo=localhost.localdomain) by fencepost.gnu.org with esmtpsa (TLS1.2:DHE_RSA_AES_256_CBC_SHA1:256) (Exim 4.82) (envelope-from ) id 1kKP0F-0008PH-04; Mon, 21 Sep 2020 12:49:23 -0400 From: Mathieu Othacehe Date: Mon, 21 Sep 2020 18:49:08 +0200 Message-Id: <20200921164908.1396570-1-othacehe@gnu.org> X-Mailer: git-send-email 2.28.0 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Score: -2.3 (--) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-Spam-Score: -3.3 (---) X-BeenThere: guix-patches@gnu.org List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guix-patches-bounces+larch=yhetil.org@gnu.org Sender: "Guix-patches" X-Scanner: scn0 Authentication-Results: aspmx1.migadu.com; dkim=none; dmarc=none; spf=pass (aspmx1.migadu.com: domain of guix-patches-bounces@gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=guix-patches-bounces@gnu.org X-Spam-Score: 3.99 X-TUID: h1vk/ST3zf6Z * src/cuirass/watchdog.scm: New file. * Makefile.am (dist_pkgmodule_DATA): Add it. * src/cuirass/utils.scm (with-timeout, get-message-with-timeout): Export them. * bin/cuirass.in (main): Start the watchdog. --- Hello, I've noticed that during several hours all Cuirass fibers seem to be stucked. I suspect that this is because a fiber is doing a blocking action, hence preventing its scheduler to run other fibers. This patch adds a watchdog that print warning messages if some Fibers scheduler is hanging for more than 5 seconds. The next step will be to identify the faulty fiber. Thanks, Mathieu Makefile.am | 5 ++- bin/cuirass.in | 3 +- src/cuirass/utils.scm | 3 ++ src/cuirass/watchdog.scm | 81 ++++++++++++++++++++++++++++++++++++++++ 4 files changed, 89 insertions(+), 3 deletions(-) create mode 100644 src/cuirass/watchdog.scm diff --git a/Makefile.am b/Makefile.am index 0e4d3c8..2c7e6e2 100644 --- a/Makefile.am +++ b/Makefile.am @@ -50,8 +50,9 @@ dist_pkgmodule_DATA = \ src/cuirass/metrics.scm \ src/cuirass/send-events.scm \ src/cuirass/ui.scm \ - src/cuirass/utils.scm \ - src/cuirass/templates.scm + src/cuirass/utils.scm \ + src/cuirass/templates.scm \ + src/cuirass/watchdog.scm nodist_pkgmodule_DATA = \ src/cuirass/config.scm diff --git a/bin/cuirass.in b/bin/cuirass.in index ed21ed7..7f4793e 100644 --- a/bin/cuirass.in +++ b/bin/cuirass.in @@ -31,6 +31,7 @@ exec ${GUILE:-@GUILE@} --no-auto-compile -e main -s "$0" "$@" (cuirass logging) (cuirass metrics) (cuirass utils) + (cuirass watchdog) (guix ui) ((guix build utils) #:select (mkdir-p)) (fibers) @@ -142,7 +143,7 @@ exec ${GUILE:-@GUILE@} --no-auto-compile -e main -s "$0" "$@" (if one-shot? (process-specs (db-get-specifications)) (let ((exit-channel (make-channel))) - + (start-watchdog) (if (option-ref opts 'web #f) (begin (spawn-fiber diff --git a/src/cuirass/utils.scm b/src/cuirass/utils.scm index 00cfef6..7ce4b83 100644 --- a/src/cuirass/utils.scm +++ b/src/cuirass/utils.scm @@ -37,6 +37,9 @@ define-enumeration unwind-protect + with-timeout + get-message-with-timeout + make-worker-thread-channel call-with-worker-thread with-worker-thread diff --git a/src/cuirass/watchdog.scm b/src/cuirass/watchdog.scm new file mode 100644 index 0000000..b912a9f --- /dev/null +++ b/src/cuirass/watchdog.scm @@ -0,0 +1,81 @@ +;;; watchdog.scm -- Monitor fibers scheduling. +;;; This file is part of Cuirass. +;;; +;;; Cuirass is free software: you can redistribute it and/or modify +;;; it under the terms of the GNU General Public License as published by +;;; the Free Software Foundation, either version 3 of the License, or +;;; (at your option) any later version. +;;; +;;; Cuirass is distributed in the hope that it will be useful, +;;; but WITHOUT ANY WARRANTY; without even the implied warranty of +;;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +;;; GNU General Public License for more details. +;;; +;;; You should have received a copy of the GNU General Public License +;;; along with Cuirass. If not, see . + +(define-module (cuirass watchdog) + #:use-module (cuirass logging) + #:use-module (cuirass utils) + #:use-module (fibers) + #:use-module (fibers channels) + #:use-module (fibers internal) + #:use-module (fibers operations) + #:use-module (ice-9 match) + #:use-module (ice-9 threads) + #:export (start-watchdog)) + +(define* (watchdog-fiber scheduler channel + #:key + timeout) + "Spawn a fiber running on SCHEDULER that sends over CHANNEL, every TIMEOUT +seconds, the scheduler name and the current time." + (spawn-fiber + (lambda () + (while #t + (put-message channel (list (scheduler-name scheduler) + (current-time))) + (sleep timeout))) + scheduler)) + +(define* (start-watchdog #:key (timeout 5)) + "Start a watchdog checking that each Fibers scheduler is not blocked for +more than TIMEOUT seconds. + +The watchdog mechanism consists in spawning a dedicated fiber per running +Fiber scheduler, using the above watchdog-fiber method. Those fibers send a +ping signal periodically to a separate thread. If no signal is received from +one of the schedulers for more than TIMEOUT seconds, a warning message is +printed." + (define (check-timeouts pings) + (for-each + (match-lambda + ((scheduler . time) + (let* ((cur-time (current-time)) + (diff-ping (- cur-time time))) + (when (> diff-ping timeout) + (log-message "Scheduler ~a blocked since ~a seconds." + scheduler diff-ping))))) + pings)) + + (let ((watchdog-channel (make-channel))) + (parameterize (((@@ (fibers internal) current-fiber) #f)) + (call-with-new-thread + (lambda () + (let loop ((pings '())) + (let ((operation-timeout 2)) + (match (perform-operation + (with-timeout + (get-operation watchdog-channel) + #:seconds operation-timeout + #:wrap (const 'timeout))) + ((scheduler ping) + (loop (assq-set! pings scheduler ping))) + ('timeout + (check-timeouts pings) + (loop pings)))))))) + (fold-all-schedulers + (lambda (name scheduler seed) + (watchdog-fiber scheduler watchdog-channel + #:timeout timeout)) + '()))) -- 2.28.0