[-- Attachment #1: Type: text/plain, Size: 1304 bytes --] Hey, This has been on my mind for a while, as I wonder what effect it has on users fetching substitues. The narinfo caching as I understand it works as follows: Default success TTL => 36 hours Negative TTL => 1 hour Transient error TTL => 10 minutes I'm ignoring the success TTL, I'm just interested in the negative and transient error values. Negative means that when a server says it doesn't have an output, that response will be cached for an hour. Transient errors are for other HTTP response codes, like 504. I had a look through the Git history, caching negative lookups has been a thing for a while. Caching transient errors was added, but I couldn't see why. Personally I don't see a reason to keep either behaviours? In an extreme case, the Guix Build Coordinator has to work hard to work around this caching. Asking the guix-daemon if a substitute exists is dangerous, as it literally costs an hour if that substitute isn't available yet, but will be shortly (which happens all the time when building a bunch of things). Currently it checks itself, and only continues to ask the guix-daemon to fetch the item if it knows it to exist. The transient error caching is also problematic, as that imposes a 10 minute penalty if there's a server issue. Any thoughts? Thanks, Chris [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 987 bytes --]
[-- Attachment #1: Type: text/plain, Size: 903 bytes --] Christopher Baines <mail@cbaines.net> writes: > This has been on my mind for a while, as I wonder what effect it has on > users fetching substitues. > > The narinfo caching as I understand it works as follows: > > Default success TTL => 36 hours > Negative TTL => 1 hour > Transient error TTL => 10 minutes > > I'm ignoring the success TTL, I'm just interested in the negative and > transient error values. Negative means that when a server says it > doesn't have an output, that response will be cached for an > hour. Transient errors are for other HTTP response codes, like 504. > > I had a look through the Git history, caching negative lookups has been > a thing for a while. Caching transient errors was added, but I couldn't > see why. > > Personally I don't see a reason to keep either behaviours? I've now sent a patch to remove this behaviour: https://issues.guix.gnu.org/47897 [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 987 bytes --]
Hi! (“Sorry for the long delay” is officially my motto at this point.) Christopher Baines <mail@cbaines.net> skribis: > This has been on my mind for a while, as I wonder what effect it has on > users fetching substitues. > > The narinfo caching as I understand it works as follows: > > Default success TTL => 36 hours > Negative TTL => 1 hour > Transient error TTL => 10 minutes > > I'm ignoring the success TTL, I'm just interested in the negative and > transient error values. Negative means that when a server says it > doesn't have an output, that response will be cached for an > hour. Transient errors are for other HTTP response codes, like 504. You’re looking at the default TTLs, which are not the actual TTLs. Specifically, servers can include a ‘Cache-Control’ header in their reply specifying the TTL of their choice, and ‘guix substitute’ honors that: https://git.savannah.gnu.org/cgit/guix.git/tree/guix/substitutes.scm#n200 https://git.savannah.gnu.org/cgit/guix.git/tree/guix/scripts/publish.scm#n371 ‘guix publish’ returns 404 with a TTL of 5mn when the requested item is in store but needs to be “baked”. However, ‘guix publish’ does not set ‘Cache-Control’ when the request item is not in store. In that case, clients use ‘%narinfo-negative-ttl’ (1h). > I had a look through the Git history, caching negative lookups has been > a thing for a while. Caching transient errors was added, but I couldn't > see why. Transient error caching was most likely added in the days of hydra.gnu.org, that VM that was extremely slow. When overloaded, you’d get 500 or similar, and at that point it was safer for clients to wait and come back later, possibly much later. :-) > Personally I don't see a reason to keep either behaviours? The main arguments for these negative TTLs are: 1. Reducing server load: if the server doesn’t have libreoffice, don’t come back asking every 10s, it’s prolly useless. You could easily have “GET storms” for libreoffice if clients don’t restrain themselves. 2. Improving client performance: don’t GET things that are likely to fail. Now, the penalty it imposes is annoying. I’ve sometimes found myself working around it, too (because I knew the server was going to have the store item sooner than 1h). Rather than removing it entirely, I can think of these options: 1. Reduce the default negative timeouts. 2. Add an option to ‘guix publish’ (and to the Coordinator?) so they send a ‘Cache-Control’ header with the chosen TTL on 404. That way, if the server operator doesn’t mind extra load, they can run “guix publish --negative-ttl=0”. WDYT? Does that make any sense? Ludo’.
BTW, one thing that would be interesting too is to return 404 with a long ‘Cache-Control’ validity when the requested store item is among the cached failures. We could also add an extra response header to explicitly communicate that the store item is known to fail to build. Ludo’.
[-- Attachment #1: Type: text/plain, Size: 4400 bytes --] Ludovic Courtès <ludo@gnu.org> writes: > Hi! > > (“Sorry for the long delay” is officially my motto at this point.) > > Christopher Baines <mail@cbaines.net> skribis: > >> This has been on my mind for a while, as I wonder what effect it has on >> users fetching substitues. >> >> The narinfo caching as I understand it works as follows: >> >> Default success TTL => 36 hours >> Negative TTL => 1 hour >> Transient error TTL => 10 minutes >> >> I'm ignoring the success TTL, I'm just interested in the negative and >> transient error values. Negative means that when a server says it >> doesn't have an output, that response will be cached for an >> hour. Transient errors are for other HTTP response codes, like 504. > > You’re looking at the default TTLs, which are not the actual TTLs. > Specifically, servers can include a ‘Cache-Control’ header in their > reply specifying the TTL of their choice, and ‘guix substitute’ honors > that: > > https://git.savannah.gnu.org/cgit/guix.git/tree/guix/substitutes.scm#n200 > https://git.savannah.gnu.org/cgit/guix.git/tree/guix/scripts/publish.scm#n371 > > ‘guix publish’ returns 404 with a TTL of 5mn when the requested item is > in store but needs to be “baked”. > > However, ‘guix publish’ does not set ‘Cache-Control’ when the request > item is not in store. In that case, clients use ‘%narinfo-negative-ttl’ > (1h). You're right that the negative ttl is just a default, so it's possible to override the default behaviour in the success and negative lookup cases, but I don't believe the Cache-Control header is used for transient errors. >> I had a look through the Git history, caching negative lookups has been >> a thing for a while. Caching transient errors was added, but I couldn't >> see why. > > Transient error caching was most likely added in the days of > hydra.gnu.org, that VM that was extremely slow. When overloaded, you’d > get 500 or similar, and at that point it was safer for clients to wait > and come back later, possibly much later. :-) > >> Personally I don't see a reason to keep either behaviours? > > The main arguments for these negative TTLs are: > > 1. Reducing server load: if the server doesn’t have libreoffice, don’t > come back asking every 10s, it’s prolly useless. You could easily > have “GET storms” for libreoffice if clients don’t restrain > themselves. > > 2. Improving client performance: don’t GET things that are likely to > fail. As you say, for the negative TTL, the question here is really what's the best default value, if a server isn't specifying one. Given that most narinfo requests precede a build for that thing if the response is negative, I have my doubts about those two arguments above. This is assuming the most common case is users asking guix to install and upgrade things. If a user gets a negative response, they'll just build it instead and not check for that narinfo again. Even if they cancel that build when they realise they don't want to build libreoffice, they'll wait a bit anyway before retrying. > Now, the penalty it imposes is annoying. I’ve sometimes found myself > working around it, too (because I knew the server was going to have the > store item sooner than 1h). > > Rather than removing it entirely, I can think of these options: > > 1. Reduce the default negative timeouts. I think reducing it is good, as you say, it's possible to override the default from the server side. Just in case someone wants caching behaviour, it might be worth keeping that functionality at least. > 2. Add an option to ‘guix publish’ (and to the Coordinator?) so they > send a ‘Cache-Control’ header with the chosen TTL on 404. That > way, if the server operator doesn’t mind extra load, they can run > “guix publish --negative-ttl=0”. That sounds sensible. The Guix Build Coordinator doesn't do any serving, that's left to something else like nginx. For the deployments I maintain though, I don't think I'm setting the relevant headers, but I'll look at changing that. Going back to the %narinfo-transient-error-ttl, if I'm correct in saying that it's not possible to override that, maybe that should also use the relevant header value if set? [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 987 bytes --]
Hi, Christopher Baines <mail@cbaines.net> skribis: >> Now, the penalty it imposes is annoying. I’ve sometimes found myself >> working around it, too (because I knew the server was going to have the >> store item sooner than 1h). >> >> Rather than removing it entirely, I can think of these options: >> >> 1. Reduce the default negative timeouts. > > I think reducing it is good, as you say, it's possible to override the > default from the server side. Just in case someone wants caching > behaviour, it might be worth keeping that functionality at least. OK, let’s do that. >> 2. Add an option to ‘guix publish’ (and to the Coordinator?) so they >> send a ‘Cache-Control’ header with the chosen TTL on 404. That >> way, if the server operator doesn’t mind extra load, they can run >> “guix publish --negative-ttl=0”. > > That sounds sensible. The Guix Build Coordinator doesn't do any serving, > that's left to something else like nginx. For the deployments I maintain > though, I don't think I'm setting the relevant headers, but I'll look at > changing that. Cool. > Going back to the %narinfo-transient-error-ttl, if I'm correct in saying > that it's not possible to override that, maybe that should also use the > relevant header value if set? Correct, ‘%narinfo-transient-error-ttl’ cannot be overridden. We can halve it if you think that’s useful, thought when that happens, it means something’s wrong with the server (returning 500 or similar). I’ve sent patches to address this, lemme know what you think! Thanks, Ludo’.
[-- Attachment #1: Type: text/plain, Size: 1742 bytes --] Ludovic Courtès <ludo@gnu.org> writes: > Hi, > > Christopher Baines <mail@cbaines.net> skribis: > >>> Now, the penalty it imposes is annoying. I’ve sometimes found myself >>> working around it, too (because I knew the server was going to have the >>> store item sooner than 1h). >>> >>> Rather than removing it entirely, I can think of these options: >>> >>> 1. Reduce the default negative timeouts. >> >> I think reducing it is good, as you say, it's possible to override the >> default from the server side. Just in case someone wants caching >> behaviour, it might be worth keeping that functionality at least. > > OK, let’s do that. > >>> 2. Add an option to ‘guix publish’ (and to the Coordinator?) so they >>> send a ‘Cache-Control’ header with the chosen TTL on 404. That >>> way, if the server operator doesn’t mind extra load, they can run >>> “guix publish --negative-ttl=0”. >> >> That sounds sensible. The Guix Build Coordinator doesn't do any serving, >> that's left to something else like nginx. For the deployments I maintain >> though, I don't think I'm setting the relevant headers, but I'll look at >> changing that. > > Cool. > >> Going back to the %narinfo-transient-error-ttl, if I'm correct in saying >> that it's not possible to override that, maybe that should also use the >> relevant header value if set? > > Correct, ‘%narinfo-transient-error-ttl’ cannot be overridden. We can > halve it if you think that’s useful, thought when that happens, it means > something’s wrong with the server (returning 500 or similar). > > I’ve sent patches to address this, lemme know what you think! The patches you've sent look good. [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 987 bytes --]
Hi,
Christopher Baines <mail@cbaines.net> skribis:
> The patches you've sent look good.
Pushed as 938ffcbb0589adc07dc12c79eda3e1e2bb9e7cf8 (I was generous and
lowered ‘%narinfo-negative-ttl’ to 10mn :-)).
Thanks,
Ludo’.