unofficial mirror of guix-devel@gnu.org 
 help / color / mirror / code / Atom feed
* Substitute timeouts
@ 2021-08-09 10:28 Mathieu Othacehe
  2021-08-11 11:06 ` Ludovic Courtès
  0 siblings, 1 reply; 3+ messages in thread
From: Mathieu Othacehe @ 2021-08-09 10:28 UTC (permalink / raw)
  To: guix-devel

[-- Attachment #1: Type: text/plain, Size: 2326 bytes --]


Hello,

I have been investigating a problem that is visible both on the main
guix publish server at https://ci.guix.gnu.org[1] and on the Cuirass
build farm[2].

This error comes from the fact that the publish server does not accept
the "guix substitute" connection requests within the %fetch-timeout
duration of 5 seconds.

The main guix publish server is using a cache. If a requested narinfo is
not in the cache, it will be baked and the client receives a 404
error. Since ecaa102a58ad3ab0b42e04a3d10d7c761c05ec98 and the
introduction of the bypass mechanism, small store items are directly
returned.

This means that the "narinfo-string" procedure can be called directly in
the main publish thread. Running perf on the main publish server reveals
that this procedure can be really expensive under IO pressure (GC
running for example) because it opens a lot of files. I have observed
that the "read-derivation-from-file" call can take up to 600 ms.

If multiple clients were to ask narinfo of several items not yet cached,
under IO pressure, I think that the publish server could become
unresponsive and cause the timeout errors.

The fact that Cuirass triggers the baking of successfully built
derivations probably doesn't help here.

Now regarding the timeout errors that are much more frequent on the
Cuirass build farm, the cause varies a bit. The Cuirass publish server
running on Berlin does not use a cache. This means that the
"narinfo-string" procedure is called for each request, in the main
thread.

To fix those issues, a solution could be to run the "narinfo-string" in
a separate thread, but it will make the publish server code even harder
to understand. My proposition would be to get rid of the bypass
mechanism and instead implement a retry when some substitutes are
reported as being baked, as proposed by Miguel[3].

I think this is the most reasonable solution. This way, users won't
receive 404 errors and start building substitutes that are being
baked[4].

It will also allow the Cuirass build farm to use directly the main guix
publish server, simplifying the current CI setup.

There's a proposed patch attached, WDYT?

Thanks,

Mathieu

[1]: https://issues.guix.gnu.org/49089
[2]: https://issues.guix.gnu.org/48468
[3]: https://issues.guix.gnu.org/44193#2
[4]: http://issues.guix.gnu.org/33370


[-- Attachment #2: patch --]
[-- Type: application/octet-stream, Size: 8394 bytes --]

From e3c50c6d70adf71b8e84e6895b625f3d432b10dd Mon Sep 17 00:00:00 2001
From: Mathieu Othacehe <othacehe@gnu.org>
Date: Mon, 9 Aug 2021 11:25:41 +0200
Subject: [PATCH 1/3] narinfo: Add narinfo-promise support.

* guix/narinfo.scm (<narinfo-promise>): New record.
(%make-narinfo-promise, narinfo-promise?, narinfo-promise-path): New
procedures.
---
 guix/narinfo.scm | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/guix/narinfo.scm b/guix/narinfo.scm
index 72e0f75fda..c0d4338a35 100644
--- a/guix/narinfo.scm
+++ b/guix/narinfo.scm
@@ -51,6 +51,10 @@
             narinfo-signature
             narinfo-contents
 
+            %make-narinfo-promise
+            narinfo-promise?
+            narinfo-promise-path
+
             narinfo-hash-algorithm+value
 
             narinfo-hash->sha256
@@ -89,6 +93,13 @@
   ;; for more information.
   (contents     narinfo-contents))
 
+;; This denotes a narinfo that is currently being baked by the publish server
+;; and for which an HTTP 202 return code has been issued.
+(define-record-type <narinfo-promise>
+  (%make-narinfo-promise path)
+  narinfo-promise?
+  (path         narinfo-promise-path))
+
 (define (narinfo-hash-algorithm+value narinfo)
   "Return two values: the hash algorithm used by NARINFO and its value as a
 bytevector."
-- 
2.32.0


From 7dc0f19aca1cf7da0f2586bb386b9b6837673581 Mon Sep 17 00:00:00 2001
From: Mathieu Othacehe <othacehe@gnu.org>
Date: Mon, 9 Aug 2021 11:27:02 +0200
Subject: [PATCH 2/3] scripts: publish: Send an HTTP 202 return code when
 baking a substitute.

* guix/scripts/publish.scm (baking): New procedure.
(render-narinfo/cached): Use it to return an HTTP 202 return code when a
substitute is being baked.
---
 guix/scripts/publish.scm | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/guix/scripts/publish.scm b/guix/scripts/publish.scm
index f35f81dc34..a26d83212b 100644
--- a/guix/scripts/publish.scm
+++ b/guix/scripts/publish.scm
@@ -379,6 +379,13 @@ References: ~a~%"
                       (canonical-sexp->string (signed-string info)))))
     (format #f "~aSignature: 1;~a;~a~%" info (gethostname) signature)))
 
+(define* (baking request
+                 #:key (phrase "We're baking it"))
+  "Render 202 response for REQUEST."
+  (values (build-response #:code 202)
+          (string-append phrase ": "
+                         (uri-path (request-uri request)))))
+
 (define* (not-found request
                     #:key (phrase "Resource not found")
                     ttl)
@@ -591,9 +598,8 @@ requested using POOL."
                                #:ttl 300          ;temporary
                                #:nar-path nar-path
                                #:compressions compressions)
-               (not-found request
-                          #:phrase "We're baking it"
-                          #:ttl 300)))          ;should be available within 5m
+               ;; See: https://issues.guix.gnu.org/44193.
+               (baking request)))
           (else
            (not-found request #:phrase "" #:ttl negative-ttl)))))
 
-- 
2.32.0


From 3f21f60a65f4b66c3c753160db1a68f2dc485d2c Mon Sep 17 00:00:00 2001
From: Mathieu Othacehe <othacehe@gnu.org>
Date: Mon, 9 Aug 2021 11:36:24 +0200
Subject: [PATCH 3/3] substitutes: Retry to fetch narinfos being baked.

guix/substitutes.scm (%nar-baking-delay): New variable.
(fetch-narinfos): Return <narinfo-promise> records for the narinfos that are
currently being fetched.
(lookup-narinfos): Retry to fetch the narinfos that are currently being baked,
after a delay. Do not return until there are no more narinfos being baked.
---
 guix/substitutes.scm | 58 ++++++++++++++++++++++++++++++++------------
 1 file changed, 43 insertions(+), 15 deletions(-)

diff --git a/guix/substitutes.scm b/guix/substitutes.scm
index a5c554acff..3a2fdb124d 100644
--- a/guix/substitutes.scm
+++ b/guix/substitutes.scm
@@ -79,6 +79,11 @@
   ;; Likewise, but for transient errors such as 504 ("Gateway timeout").
   (* 5 60))
 
+(define %nar-baking-delay
+  ;; Number of seconds to wait before asking again a narinfo that has been
+  ;; declared as being baked (202).
+  30)
+
 (define %narinfo-cache-directory
   ;; A local cache of narinfos, to avoid going to the network.  Most of the
   ;; time, 'guix substitute' is called by guix-daemon as root and stores its
@@ -180,7 +185,9 @@ if file doesn't exist, and the narinfo otherwise."
                          (open-connection guix:open-connection-for-uri)
                          (make-progress-reporter
                           (const progress-reporter/silent)))
-  "Retrieve all the narinfos for PATHS from the cache at URL and return them."
+  "Retrieve all the narinfos for PATHS from the cache at URL and return them
+as a list of <narinfo> and <narinfo-promise> records.  The <narinfo-promises>
+records denote narinfos that are currently being baked by the publish server."
   (define progress-reporter
     (make-progress-reporter (length paths)
                             #:url url))
@@ -223,11 +230,15 @@ if file doesn't exist, and the narinfo otherwise."
             (if len
                 (get-bytevector-n port len)
                 (read-to-eof port))
-            (cache-narinfo! url (hash-part->path hash-part) #f
-                            (if (or (= 404 code) (= 202 code))
-                                ttl
-                                %narinfo-transient-error-ttl))
-            result))))
+            (if (= code 202)
+                (cons (%make-narinfo-promise (hash-part->path hash-part))
+                      result)
+                (begin
+                  (cache-narinfo! url (hash-part->path hash-part) #f
+                                  (if (= 404 code)
+                                      ttl
+                                      %narinfo-transient-error-ttl))
+                  result))))))
 
   (define (do-fetch uri)
     (case (and=> uri uri-scheme)
@@ -300,7 +311,9 @@ for PATH."
                           (make-progress-reporter
                            (const progress-reporter/silent)))
   "Return the narinfos for PATHS, invoking the server at CACHE when no
-information is available locally."
+information is available locally.  If some narinfos are currently being baked,
+retry to fetch after a delay.  Do not return until there are no more narinfos
+being baked."
   (let-values (((cached missing)
                 (fold2 (lambda (path cached missing)
                          (let-values (((valid? value)
@@ -313,14 +326,29 @@ information is available locally."
                        '()
                        '()
                        paths)))
-    (values (if (null? missing)
-                cached
-                (let ((missing (fetch-narinfos cache missing
-                                               #:open-connection open-connection
-                                               #:make-progress-reporter
-                                               make-progress-reporter)))
-                  (append cached (or missing '()))))
-            (length missing))))
+    (values
+     (if (null? missing)
+         cached
+         (let* ((missing (fetch-narinfos cache missing
+                                         #:open-connection open-connection
+                                         #:make-progress-reporter
+                                         make-progress-reporter))
+                (narinfos (filter narinfo? missing))
+                (promises (map narinfo-promise-path
+                               (filter narinfo-promise? missing))))
+           (append cached narinfos
+                   (if (null? promises)
+                       '()
+                       (begin
+                         (info (G_ "Some substitutes are being baked,\
+ retrying in ~a seconds.~%")
+                               %nar-baking-delay)
+                         (sleep %nar-baking-delay)
+                         (lookup-narinfos cache promises
+                                          #:open-connection open-connection
+                                          #:make-progress-reporter
+                                          make-progress-reporter))))))
+     (length missing))))
 
 (define* (lookup-narinfos/diverse caches paths authorized?
                                   #:key (open-connection
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: Substitute timeouts
  2021-08-09 10:28 Substitute timeouts Mathieu Othacehe
@ 2021-08-11 11:06 ` Ludovic Courtès
  2021-08-11 11:38   ` Mathieu Othacehe
  0 siblings, 1 reply; 3+ messages in thread
From: Ludovic Courtès @ 2021-08-11 11:06 UTC (permalink / raw)
  To: Mathieu Othacehe; +Cc: guix-devel

Hi,

Mathieu Othacehe <othacehe@gnu.org> skribis:

> I have been investigating a problem that is visible both on the main
> guix publish server at https://ci.guix.gnu.org[1] and on the Cuirass
> build farm[2].
>
> This error comes from the fact that the publish server does not accept
> the "guix substitute" connection requests within the %fetch-timeout
> duration of 5 seconds.

Thanks for getting to the bottom of this!

> The main guix publish server is using a cache. If a requested narinfo is
> not in the cache, it will be baked and the client receives a 404
> error. Since ecaa102a58ad3ab0b42e04a3d10d7c761c05ec98 and the
> introduction of the bypass mechanism, small store items are directly
> returned.
>
> This means that the "narinfo-string" procedure can be called directly in
> the main publish thread. Running perf on the main publish server reveals
> that this procedure can be really expensive under IO pressure (GC
> running for example) because it opens a lot of files. I have observed
> that the "read-derivation-from-file" call can take up to 600 ms.
>
> If multiple clients were to ask narinfo of several items not yet cached,
> under IO pressure, I think that the publish server could become
> unresponsive and cause the timeout errors.

Yeah, it’s a double-edged sword.  If this is a problem on the main ‘guix
publish’ server, we can lower the bypass threshold, which is currently
50 MiB:

  https://git.savannah.gnu.org/cgit/guix/maintenance.git/tree/hydra/modules/sysadmin/services.scm#n450

WDYT?

> The fact that Cuirass triggers the baking of successfully built
> derivations probably doesn't help here.

Could be.  This threshold seemed to work fine earlier (and still does,
mostly?).

> Now regarding the timeout errors that are much more frequent on the
> Cuirass build farm, the cause varies a bit. The Cuirass publish server
> running on Berlin does not use a cache. This means that the
> "narinfo-string" procedure is called for each request, in the main
> thread.
>
> To fix those issues, a solution could be to run the "narinfo-string" in
> a separate thread, but it will make the publish server code even harder
> to understand.

True!  Though maybe it wouldn’t be that much worse.  :-)

The problem is that this thing is very much single-threaded, with
exceptions in a couple of places.  We could add one more exception like
you write, or fiberize it, or run it behind nginx, possibly with a tiny
bit of caching.

> My proposition would be to get rid of the bypass mechanism and instead
> implement a retry when some substitutes are reported as being baked,
> as proposed by Miguel[3].
>
> I think this is the most reasonable solution. This way, users won't
> receive 404 errors and start building substitutes that are being
> baked[4].

(If I followed correctly, the bypass mechanism is not at fault regarding
timeouts on the Cuirass publish server since it’s not using a cache,
right?)

I don’t think it’s reasonable for ‘guix substitute’ to just wait upon
202 (or 404, that doesn’t matter).

First, in terms of UI, you’d have a command sitting there and doing
nothing, which can be off-putting.  Second, clients have no idea how
long they’re going to wait; it could be that the nar is going to be
baked within seconds, or it could take 20mn if the baking queue is
already crowded or if the user is asking for a big store item like
libreoffice.  Third, in many cases, building locally is likely to be
faster than waiting for substitutes to be available (the majority of
packages build very quickly, though the few most popular leaf packages
take a long time to build).

> It will also allow the Cuirass build farm to use directly the main guix
> publish server, simplifying the current CI setup.

The only reason why Cuirass runs its own publish server is to avoid
overloading the main one?

Thanks,
Ludo’.


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Substitute timeouts
  2021-08-11 11:06 ` Ludovic Courtès
@ 2021-08-11 11:38   ` Mathieu Othacehe
  0 siblings, 0 replies; 3+ messages in thread
From: Mathieu Othacehe @ 2021-08-11 11:38 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: guix-devel


Hey Ludo,

Thanks for taking the time to read my wall of text :D.

> Yeah, it’s a double-edged sword.  If this is a problem on the main ‘guix
> publish’ server, we can lower the bypass threshold, which is currently
> 50 MiB:
>
>   https://git.savannah.gnu.org/cgit/guix/maintenance.git/tree/hydra/modules/sysadmin/services.scm#n450
>
> WDYT?

That would maybe help, but on the other hand, I would prefer to find a
more definitive solution :).

> First, in terms of UI, you’d have a command sitting there and doing
> nothing, which can be off-putting.  Second, clients have no idea how
> long they’re going to wait; it could be that the nar is going to be
> baked within seconds, or it could take 20mn if the baking queue is
> already crowded or if the user is asking for a big store item like
> libreoffice.  Third, in many cases, building locally is likely to be
> faster than waiting for substitutes to be available (the majority of
> packages build very quickly, though the few most popular leaf packages
> take a long time to build).

It would be interesting to monitor the status of the baking
workers. Could it really take 20 minutes to bake a substitute from your
experience?

Personally, I have always found this baking 404 and bypass cache a bit
misleading. When substituting libreoffice, I would much rather wait a
few minutes than trying to build it while there's an almost ready
substitute. I get that this is a personal choice and maybe it should be
an optional behaviour.

>> It will also allow the Cuirass build farm to use directly the main guix
>> publish server, simplifying the current CI setup.
>
> The only reason why Cuirass runs its own publish server is to avoid
> overloading the main one?

No, the main reason is that with the use of a publish cache, the Cuirass
workers would probably hit 404 errors while the substitutes are being
baked. Using a publish server without cache was a way to work around it.

The motivation of the 202 waiting patch was to solve both problems at
once. Maybe I should explore the narinfo dedicated thread solution as a
short term solution, while starting to think about a more long term
solution based on Fiber/Nginx.

A Cuirass dedicated solution could also be to declare a build successful
only when a nar is available and stop using a non-caching publish
server.

Thanks,

Mathieu


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2021-08-11 11:41 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-09 10:28 Substitute timeouts Mathieu Othacehe
2021-08-11 11:06 ` Ludovic Courtès
2021-08-11 11:38   ` Mathieu Othacehe

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).