unofficial mirror of guix-devel@gnu.org 
 help / color / mirror / code / Atom feed
From: "Ludovic Courtès" <ludo@gnu.org>
To: Mathieu Othacehe <othacehe@gnu.org>
Cc: guix-devel <guix-devel@gnu.org>
Subject: Re: Substitute timeouts
Date: Wed, 11 Aug 2021 13:06:01 +0200	[thread overview]
Message-ID: <87pmukmeau.fsf@gnu.org> (raw)
In-Reply-To: <875ywec3oo.fsf@gnu.org> (Mathieu Othacehe's message of "Mon, 09 Aug 2021 12:28:39 +0200")

Hi,

Mathieu Othacehe <othacehe@gnu.org> skribis:

> I have been investigating a problem that is visible both on the main
> guix publish server at https://ci.guix.gnu.org[1] and on the Cuirass
> build farm[2].
>
> This error comes from the fact that the publish server does not accept
> the "guix substitute" connection requests within the %fetch-timeout
> duration of 5 seconds.

Thanks for getting to the bottom of this!

> The main guix publish server is using a cache. If a requested narinfo is
> not in the cache, it will be baked and the client receives a 404
> error. Since ecaa102a58ad3ab0b42e04a3d10d7c761c05ec98 and the
> introduction of the bypass mechanism, small store items are directly
> returned.
>
> This means that the "narinfo-string" procedure can be called directly in
> the main publish thread. Running perf on the main publish server reveals
> that this procedure can be really expensive under IO pressure (GC
> running for example) because it opens a lot of files. I have observed
> that the "read-derivation-from-file" call can take up to 600 ms.
>
> If multiple clients were to ask narinfo of several items not yet cached,
> under IO pressure, I think that the publish server could become
> unresponsive and cause the timeout errors.

Yeah, it’s a double-edged sword.  If this is a problem on the main ‘guix
publish’ server, we can lower the bypass threshold, which is currently
50 MiB:

  https://git.savannah.gnu.org/cgit/guix/maintenance.git/tree/hydra/modules/sysadmin/services.scm#n450

WDYT?

> The fact that Cuirass triggers the baking of successfully built
> derivations probably doesn't help here.

Could be.  This threshold seemed to work fine earlier (and still does,
mostly?).

> Now regarding the timeout errors that are much more frequent on the
> Cuirass build farm, the cause varies a bit. The Cuirass publish server
> running on Berlin does not use a cache. This means that the
> "narinfo-string" procedure is called for each request, in the main
> thread.
>
> To fix those issues, a solution could be to run the "narinfo-string" in
> a separate thread, but it will make the publish server code even harder
> to understand.

True!  Though maybe it wouldn’t be that much worse.  :-)

The problem is that this thing is very much single-threaded, with
exceptions in a couple of places.  We could add one more exception like
you write, or fiberize it, or run it behind nginx, possibly with a tiny
bit of caching.

> My proposition would be to get rid of the bypass mechanism and instead
> implement a retry when some substitutes are reported as being baked,
> as proposed by Miguel[3].
>
> I think this is the most reasonable solution. This way, users won't
> receive 404 errors and start building substitutes that are being
> baked[4].

(If I followed correctly, the bypass mechanism is not at fault regarding
timeouts on the Cuirass publish server since it’s not using a cache,
right?)

I don’t think it’s reasonable for ‘guix substitute’ to just wait upon
202 (or 404, that doesn’t matter).

First, in terms of UI, you’d have a command sitting there and doing
nothing, which can be off-putting.  Second, clients have no idea how
long they’re going to wait; it could be that the nar is going to be
baked within seconds, or it could take 20mn if the baking queue is
already crowded or if the user is asking for a big store item like
libreoffice.  Third, in many cases, building locally is likely to be
faster than waiting for substitutes to be available (the majority of
packages build very quickly, though the few most popular leaf packages
take a long time to build).

> It will also allow the Cuirass build farm to use directly the main guix
> publish server, simplifying the current CI setup.

The only reason why Cuirass runs its own publish server is to avoid
overloading the main one?

Thanks,
Ludo’.


  reply	other threads:[~2021-08-11 11:06 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-08-09 10:28 Substitute timeouts Mathieu Othacehe
2021-08-11 11:06 ` Ludovic Courtès [this message]
2021-08-11 11:38   ` Mathieu Othacehe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://guix.gnu.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87pmukmeau.fsf@gnu.org \
    --to=ludo@gnu.org \
    --cc=guix-devel@gnu.org \
    --cc=othacehe@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).