* bug#52338: Crawler bots are downloading substitutes @ 2021-12-06 21:20 Leo Famulari 2021-12-06 22:18 ` bug#52338: [maintenance] hydra: berlin: Create robots.txt Leo Famulari 2021-12-10 21:21 ` Mark H Weaver 0 siblings, 2 replies; 10+ messages in thread From: Leo Famulari @ 2021-12-06 21:20 UTC (permalink / raw) To: 52338 I noticed that some bots are downloading substitutes from ci.guix.gnu.org. We should add a robots.txt file to reduce this waste. Specifically, I see bots from Bing and Semrush: https://www.bing.com/bingbot.htm https://www.semrush.com/bot.html ^ permalink raw reply [flat|nested] 10+ messages in thread
* bug#52338: [maintenance] hydra: berlin: Create robots.txt. 2021-12-06 21:20 bug#52338: Crawler bots are downloading substitutes Leo Famulari @ 2021-12-06 22:18 ` Leo Famulari 2021-12-09 13:27 ` bug#52338: Crawler bots are downloading substitutes Mathieu Othacehe 2021-12-10 21:21 ` Mark H Weaver 1 sibling, 1 reply; 10+ messages in thread From: Leo Famulari @ 2021-12-06 22:18 UTC (permalink / raw) To: 52338 I tested that `guix system build` does succeed with this change, but I would like a review on whether the resulting Nginx configuration is correct, and if this is the correct path to disallow. It generates an Nginx location block like this: ------ location /robots.txt { add_header Content-Type text/plain; return 200 "User-agent: * Disallow: /nar "; } ------ * hydra/nginx/berlin.scm (berlin-locations): Add a robots.txt Nginx location. --- hydra/nginx/berlin.scm | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/hydra/nginx/berlin.scm b/hydra/nginx/berlin.scm index 1f4b0be..3bb2129 100644 --- a/hydra/nginx/berlin.scm +++ b/hydra/nginx/berlin.scm @@ -174,7 +174,14 @@ PUBLISH-URL." (nginx-location-configuration (uri "/berlin.guixsd.org-export.pub") (body - (list "root /var/www/guix;")))))) + (list "root /var/www/guix;"))) + + (nginx-location-configuration + (uri "/robots.txt") + (body + (list + "add_header Content-Type text/plain;" + "return 200 \"User-agent: *\nDisallow: /nar/\n\";")))))) (define guix.gnu.org-redirect-locations (list -- 2.34.0 ^ permalink raw reply related [flat|nested] 10+ messages in thread
* bug#52338: Crawler bots are downloading substitutes 2021-12-06 22:18 ` bug#52338: [maintenance] hydra: berlin: Create robots.txt Leo Famulari @ 2021-12-09 13:27 ` Mathieu Othacehe 2021-12-09 15:42 ` Tobias Geerinckx-Rice via Bug reports for GNU Guix 0 siblings, 1 reply; 10+ messages in thread From: Mathieu Othacehe @ 2021-12-09 13:27 UTC (permalink / raw) To: Leo Famulari; +Cc: 52338 Hello Leo, > + (nginx-location-configuration > + (uri "/robots.txt") > + (body > + (list > + "add_header Content-Type text/plain;" > + "return 200 \"User-agent: *\nDisallow: /nar/\n\";")))))) Nice, the bots are also accessing the Cuirass web interface, do you think it would be possible to extend this snippet to prevent it? Thanks, Mathieu ^ permalink raw reply [flat|nested] 10+ messages in thread
* bug#52338: Crawler bots are downloading substitutes 2021-12-09 13:27 ` bug#52338: Crawler bots are downloading substitutes Mathieu Othacehe @ 2021-12-09 15:42 ` Tobias Geerinckx-Rice via Bug reports for GNU Guix 2021-12-10 16:22 ` Leo Famulari 0 siblings, 1 reply; 10+ messages in thread From: Tobias Geerinckx-Rice via Bug reports for GNU Guix @ 2021-12-09 15:42 UTC (permalink / raw) To: Mathieu Othacehe; +Cc: 52338, leo [-- Attachment #1.1: Type: text/plain, Size: 1469 bytes --] Mathieu Othacehe 写道: > Hello Leo, > >> + (nginx-location-configuration >> + (uri "/robots.txt") It's a micro-optimisation, but it can't hurt to generate ‘location = /robots.txt’ instead of ‘location /robots.txt’ here. >> + (body >> + (list >> + "add_header Content-Type text/plain;" >> + "return 200 \"User-agent: *\nDisallow: >> /nar/\n\";")))))) Use \r\n instead of \n, even if \n happens to work. There are many ‘buggy’ crawlers out there. It's in their own interest to be fussy whilst claiming to respect robots.txt. The less you deviate from the most basic norm imaginable, the better. I tested whether embedding raw \r\n bytes in nginx.conf strings like this works, and it seems to, even though a human would probably not do so. > Nice, the bots are also accessing the Cuirass web interface, do > you > think it would be possible to extend this snippet to prevent it? You can replace ‘/nar/’ with ‘/’ to disallow everything: Disallow: / If we want crawlers to index only the front page (so people can search for ‘Guix CI’, I guess), that's possible: Disallow: / Allow: /$ Don't confuse ‘$’ with ‘supports regexps’. Buggy bots might fall back to ‘Disallow: /’. This is where it gets ugly: nginx doesn't support escaping ‘$’ in strings. At all. It's insane. [-- Attachment #1.2: Type: text/plain, Size: 201 bytes --] geo $dollar { default "$"; } # stackoverflow.com/questions/57466554 server { location = /robots.txt { return 200 "User-agent: *\r\nDisallow: /\r\nAllow: /$dollar\r\n"; } } [-- Attachment #1.3: Type: text/plain, Size: 99 bytes --] *Obviously.* An alternative to that is to serve a real on-disc robots.txt. Kind regards, T G-R [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 247 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
* bug#52338: Crawler bots are downloading substitutes 2021-12-09 15:42 ` Tobias Geerinckx-Rice via Bug reports for GNU Guix @ 2021-12-10 16:22 ` Leo Famulari 2021-12-10 16:47 ` Tobias Geerinckx-Rice via Bug reports for GNU Guix 0 siblings, 1 reply; 10+ messages in thread From: Leo Famulari @ 2021-12-10 16:22 UTC (permalink / raw) To: Tobias Geerinckx-Rice; +Cc: othacehe, 52338 [-- Attachment #1: Type: text/plain, Size: 286 bytes --] On Thu, Dec 09, 2021 at 04:42:24PM +0100, Tobias Geerinckx-Rice wrote: [...] > An alternative to that is to serve a real on-disc robots.txt. Alright, I leave it up to you. I just want to prevent bots from downloading substitutes. I don't really have opinions about any of the details. [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
* bug#52338: Crawler bots are downloading substitutes 2021-12-10 16:22 ` Leo Famulari @ 2021-12-10 16:47 ` Tobias Geerinckx-Rice via Bug reports for GNU Guix 2021-12-11 9:46 ` Mathieu Othacehe 0 siblings, 1 reply; 10+ messages in thread From: Tobias Geerinckx-Rice via Bug reports for GNU Guix @ 2021-12-10 16:47 UTC (permalink / raw) To: Leo Famulari; +Cc: othacehe, 52338 [-- Attachment #1: Type: text/plain, Size: 95 bytes --] Leo Famulari 写道: > Alright, I leave it up to you. Dammit. Kind regards, T G-R [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 247 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
* bug#52338: Crawler bots are downloading substitutes 2021-12-10 16:47 ` Tobias Geerinckx-Rice via Bug reports for GNU Guix @ 2021-12-11 9:46 ` Mathieu Othacehe 2021-12-19 16:53 ` Mathieu Othacehe 0 siblings, 1 reply; 10+ messages in thread From: Mathieu Othacehe @ 2021-12-11 9:46 UTC (permalink / raw) To: Tobias Geerinckx-Rice; +Cc: 52338 Hey, The Cuirass web interface logs were quite silent this morning and I suspected an issue somewhere. I then realized that you did update the Nginx conf and the bots were no longer knocking at our door, which is great! Thanks to both of you, Mathieu ^ permalink raw reply [flat|nested] 10+ messages in thread
* bug#52338: Crawler bots are downloading substitutes 2021-12-11 9:46 ` Mathieu Othacehe @ 2021-12-19 16:53 ` Mathieu Othacehe 0 siblings, 0 replies; 10+ messages in thread From: Mathieu Othacehe @ 2021-12-19 16:53 UTC (permalink / raw) To: 52338-done > Thanks to both of you, And closing! Mathieu ^ permalink raw reply [flat|nested] 10+ messages in thread
* bug#52338: Crawler bots are downloading substitutes 2021-12-06 21:20 bug#52338: Crawler bots are downloading substitutes Leo Famulari 2021-12-06 22:18 ` bug#52338: [maintenance] hydra: berlin: Create robots.txt Leo Famulari @ 2021-12-10 21:21 ` Mark H Weaver 2021-12-10 22:52 ` Tobias Geerinckx-Rice via Bug reports for GNU Guix 1 sibling, 1 reply; 10+ messages in thread From: Mark H Weaver @ 2021-12-10 21:21 UTC (permalink / raw) To: Leo Famulari, 52338 Hi Leo, Leo Famulari <leo@famulari.name> writes: > I noticed that some bots are downloading substitutes from > ci.guix.gnu.org. > > We should add a robots.txt file to reduce this waste. > > Specifically, I see bots from Bing and Semrush: > > https://www.bing.com/bingbot.htm > https://www.semrush.com/bot.html For what it's worth: during the years that I administered Hydra, I found that many bots disregarded the robots.txt file that was in place there. In practice, I found that I needed to periodically scan the access logs for bots and forcefully block their requests in order to keep Hydra from becoming overloaded with expensive queries from bots. Regards, Mark ^ permalink raw reply [flat|nested] 10+ messages in thread
* bug#52338: Crawler bots are downloading substitutes 2021-12-10 21:21 ` Mark H Weaver @ 2021-12-10 22:52 ` Tobias Geerinckx-Rice via Bug reports for GNU Guix 0 siblings, 0 replies; 10+ messages in thread From: Tobias Geerinckx-Rice via Bug reports for GNU Guix @ 2021-12-10 22:52 UTC (permalink / raw) To: Mark H Weaver; +Cc: 52338, leo [-- Attachment #1: Type: text/plain, Size: 1073 bytes --] All, Mark H Weaver 写道: > For what it's worth: during the years that I administered Hydra, > I found > that many bots disregarded the robots.txt file that was in place > there. > In practice, I found that I needed to periodically scan the > access logs > for bots and forcefully block their requests in order to keep > Hydra from > becoming overloaded with expensive queries from bots. Very good point. IME (which is a few years old at this point) at least the highlighted BingBot & SemrushThing always respected my robots.txt, but it's definitely a concern. I'll leave this bug open to remind us of that in a few weeks or so… If it does become a problem, we (I) might add some basic User-Agent sniffing to either slow down or outright block non-Guile downloaders. Whitelisting any legitimate ones, of course. I think that's less hassle than dealing with dynamic IP blocks whilst being equally effective here. Thanks (again) for taking care of Hydra, Mark, and thank you Leo for keeping an eye on Cuirass :-) T G-R [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 247 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2021-12-19 16:55 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2021-12-06 21:20 bug#52338: Crawler bots are downloading substitutes Leo Famulari 2021-12-06 22:18 ` bug#52338: [maintenance] hydra: berlin: Create robots.txt Leo Famulari 2021-12-09 13:27 ` bug#52338: Crawler bots are downloading substitutes Mathieu Othacehe 2021-12-09 15:42 ` Tobias Geerinckx-Rice via Bug reports for GNU Guix 2021-12-10 16:22 ` Leo Famulari 2021-12-10 16:47 ` Tobias Geerinckx-Rice via Bug reports for GNU Guix 2021-12-11 9:46 ` Mathieu Othacehe 2021-12-19 16:53 ` Mathieu Othacehe 2021-12-10 21:21 ` Mark H Weaver 2021-12-10 22:52 ` Tobias Geerinckx-Rice via Bug reports for GNU Guix
Code repositories for project(s) associated with this public inbox https://git.savannah.gnu.org/cgit/guix.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).