* bug#52338: Crawler bots are downloading substitutes
@ 2021-12-06 21:20 Leo Famulari
2021-12-06 22:18 ` bug#52338: [maintenance] hydra: berlin: Create robots.txt Leo Famulari
2021-12-10 21:21 ` Mark H Weaver
0 siblings, 2 replies; 10+ messages in thread
From: Leo Famulari @ 2021-12-06 21:20 UTC (permalink / raw)
To: 52338
I noticed that some bots are downloading substitutes from
ci.guix.gnu.org.
We should add a robots.txt file to reduce this waste.
Specifically, I see bots from Bing and Semrush:
https://www.bing.com/bingbot.htm
https://www.semrush.com/bot.html
^ permalink raw reply [flat|nested] 10+ messages in thread
* bug#52338: [maintenance] hydra: berlin: Create robots.txt.
2021-12-06 21:20 bug#52338: Crawler bots are downloading substitutes Leo Famulari
@ 2021-12-06 22:18 ` Leo Famulari
2021-12-09 13:27 ` bug#52338: Crawler bots are downloading substitutes Mathieu Othacehe
2021-12-10 21:21 ` Mark H Weaver
1 sibling, 1 reply; 10+ messages in thread
From: Leo Famulari @ 2021-12-06 22:18 UTC (permalink / raw)
To: 52338
I tested that `guix system build` does succeed with this change, but I
would like a review on whether the resulting Nginx configuration is
correct, and if this is the correct path to disallow. It generates an
Nginx location block like this:
------
location /robots.txt {
add_header Content-Type text/plain;
return 200 "User-agent: *
Disallow: /nar
";
}
------
* hydra/nginx/berlin.scm (berlin-locations): Add a robots.txt Nginx location.
---
hydra/nginx/berlin.scm | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/hydra/nginx/berlin.scm b/hydra/nginx/berlin.scm
index 1f4b0be..3bb2129 100644
--- a/hydra/nginx/berlin.scm
+++ b/hydra/nginx/berlin.scm
@@ -174,7 +174,14 @@ PUBLISH-URL."
(nginx-location-configuration
(uri "/berlin.guixsd.org-export.pub")
(body
- (list "root /var/www/guix;"))))))
+ (list "root /var/www/guix;")))
+
+ (nginx-location-configuration
+ (uri "/robots.txt")
+ (body
+ (list
+ "add_header Content-Type text/plain;"
+ "return 200 \"User-agent: *\nDisallow: /nar/\n\";"))))))
(define guix.gnu.org-redirect-locations
(list
--
2.34.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* bug#52338: Crawler bots are downloading substitutes
2021-12-06 22:18 ` bug#52338: [maintenance] hydra: berlin: Create robots.txt Leo Famulari
@ 2021-12-09 13:27 ` Mathieu Othacehe
2021-12-09 15:42 ` Tobias Geerinckx-Rice via Bug reports for GNU Guix
0 siblings, 1 reply; 10+ messages in thread
From: Mathieu Othacehe @ 2021-12-09 13:27 UTC (permalink / raw)
To: Leo Famulari; +Cc: 52338
Hello Leo,
> + (nginx-location-configuration
> + (uri "/robots.txt")
> + (body
> + (list
> + "add_header Content-Type text/plain;"
> + "return 200 \"User-agent: *\nDisallow: /nar/\n\";"))))))
Nice, the bots are also accessing the Cuirass web interface, do you
think it would be possible to extend this snippet to prevent it?
Thanks,
Mathieu
^ permalink raw reply [flat|nested] 10+ messages in thread
* bug#52338: Crawler bots are downloading substitutes
2021-12-09 13:27 ` bug#52338: Crawler bots are downloading substitutes Mathieu Othacehe
@ 2021-12-09 15:42 ` Tobias Geerinckx-Rice via Bug reports for GNU Guix
2021-12-10 16:22 ` Leo Famulari
0 siblings, 1 reply; 10+ messages in thread
From: Tobias Geerinckx-Rice via Bug reports for GNU Guix @ 2021-12-09 15:42 UTC (permalink / raw)
To: Mathieu Othacehe; +Cc: 52338, leo
[-- Attachment #1.1: Type: text/plain, Size: 1469 bytes --]
Mathieu Othacehe 写道:
> Hello Leo,
>
>> + (nginx-location-configuration
>> + (uri "/robots.txt")
It's a micro-optimisation, but it can't hurt to generate ‘location
= /robots.txt’ instead of ‘location /robots.txt’ here.
>> + (body
>> + (list
>> + "add_header Content-Type text/plain;"
>> + "return 200 \"User-agent: *\nDisallow:
>> /nar/\n\";"))))))
Use \r\n instead of \n, even if \n happens to work.
There are many ‘buggy’ crawlers out there. It's in their own
interest to be fussy whilst claiming to respect robots.txt. The
less you deviate from the most basic norm imaginable, the better.
I tested whether embedding raw \r\n bytes in nginx.conf strings
like this works, and it seems to, even though a human would
probably not do so.
> Nice, the bots are also accessing the Cuirass web interface, do
> you
> think it would be possible to extend this snippet to prevent it?
You can replace ‘/nar/’ with ‘/’ to disallow everything:
Disallow: /
If we want crawlers to index only the front page (so people can
search for ‘Guix CI’, I guess), that's possible:
Disallow: /
Allow: /$
Don't confuse ‘$’ with ‘supports regexps’. Buggy bots might fall
back to ‘Disallow: /’.
This is where it gets ugly: nginx doesn't support escaping ‘$’ in
strings. At all. It's insane.
[-- Attachment #1.2: Type: text/plain, Size: 201 bytes --]
geo $dollar { default "$"; } #
stackoverflow.com/questions/57466554
server {
location = /robots.txt {
return 200
"User-agent: *\r\nDisallow: /\r\nAllow: /$dollar\r\n";
}
}
[-- Attachment #1.3: Type: text/plain, Size: 99 bytes --]
*Obviously.*
An alternative to that is to serve a real on-disc robots.txt.
Kind regards,
T G-R
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 247 bytes --]
^ permalink raw reply [flat|nested] 10+ messages in thread
* bug#52338: Crawler bots are downloading substitutes
2021-12-09 15:42 ` Tobias Geerinckx-Rice via Bug reports for GNU Guix
@ 2021-12-10 16:22 ` Leo Famulari
2021-12-10 16:47 ` Tobias Geerinckx-Rice via Bug reports for GNU Guix
0 siblings, 1 reply; 10+ messages in thread
From: Leo Famulari @ 2021-12-10 16:22 UTC (permalink / raw)
To: Tobias Geerinckx-Rice; +Cc: othacehe, 52338
[-- Attachment #1: Type: text/plain, Size: 286 bytes --]
On Thu, Dec 09, 2021 at 04:42:24PM +0100, Tobias Geerinckx-Rice wrote:
[...]
> An alternative to that is to serve a real on-disc robots.txt.
Alright, I leave it up to you. I just want to prevent bots from
downloading substitutes. I don't really have opinions about any of the
details.
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 10+ messages in thread
* bug#52338: Crawler bots are downloading substitutes
2021-12-10 16:22 ` Leo Famulari
@ 2021-12-10 16:47 ` Tobias Geerinckx-Rice via Bug reports for GNU Guix
2021-12-11 9:46 ` Mathieu Othacehe
0 siblings, 1 reply; 10+ messages in thread
From: Tobias Geerinckx-Rice via Bug reports for GNU Guix @ 2021-12-10 16:47 UTC (permalink / raw)
To: Leo Famulari; +Cc: othacehe, 52338
[-- Attachment #1: Type: text/plain, Size: 95 bytes --]
Leo Famulari 写道:
> Alright, I leave it up to you.
Dammit.
Kind regards,
T G-R
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 247 bytes --]
^ permalink raw reply [flat|nested] 10+ messages in thread
* bug#52338: Crawler bots are downloading substitutes
2021-12-06 21:20 bug#52338: Crawler bots are downloading substitutes Leo Famulari
2021-12-06 22:18 ` bug#52338: [maintenance] hydra: berlin: Create robots.txt Leo Famulari
@ 2021-12-10 21:21 ` Mark H Weaver
2021-12-10 22:52 ` Tobias Geerinckx-Rice via Bug reports for GNU Guix
1 sibling, 1 reply; 10+ messages in thread
From: Mark H Weaver @ 2021-12-10 21:21 UTC (permalink / raw)
To: Leo Famulari, 52338
Hi Leo,
Leo Famulari <leo@famulari.name> writes:
> I noticed that some bots are downloading substitutes from
> ci.guix.gnu.org.
>
> We should add a robots.txt file to reduce this waste.
>
> Specifically, I see bots from Bing and Semrush:
>
> https://www.bing.com/bingbot.htm
> https://www.semrush.com/bot.html
For what it's worth: during the years that I administered Hydra, I found
that many bots disregarded the robots.txt file that was in place there.
In practice, I found that I needed to periodically scan the access logs
for bots and forcefully block their requests in order to keep Hydra from
becoming overloaded with expensive queries from bots.
Regards,
Mark
^ permalink raw reply [flat|nested] 10+ messages in thread
* bug#52338: Crawler bots are downloading substitutes
2021-12-10 21:21 ` Mark H Weaver
@ 2021-12-10 22:52 ` Tobias Geerinckx-Rice via Bug reports for GNU Guix
0 siblings, 0 replies; 10+ messages in thread
From: Tobias Geerinckx-Rice via Bug reports for GNU Guix @ 2021-12-10 22:52 UTC (permalink / raw)
To: Mark H Weaver; +Cc: 52338, leo
[-- Attachment #1: Type: text/plain, Size: 1073 bytes --]
All,
Mark H Weaver 写道:
> For what it's worth: during the years that I administered Hydra,
> I found
> that many bots disregarded the robots.txt file that was in place
> there.
> In practice, I found that I needed to periodically scan the
> access logs
> for bots and forcefully block their requests in order to keep
> Hydra from
> becoming overloaded with expensive queries from bots.
Very good point.
IME (which is a few years old at this point) at least the
highlighted BingBot & SemrushThing always respected my robots.txt,
but it's definitely a concern. I'll leave this bug open to remind
us of that in a few weeks or so…
If it does become a problem, we (I) might add some basic
User-Agent sniffing to either slow down or outright block
non-Guile downloaders. Whitelisting any legitimate ones, of
course. I think that's less hassle than dealing with dynamic IP
blocks whilst being equally effective here.
Thanks (again) for taking care of Hydra, Mark, and thank you Leo
for keeping an eye on Cuirass :-)
T G-R
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 247 bytes --]
^ permalink raw reply [flat|nested] 10+ messages in thread
* bug#52338: Crawler bots are downloading substitutes
2021-12-10 16:47 ` Tobias Geerinckx-Rice via Bug reports for GNU Guix
@ 2021-12-11 9:46 ` Mathieu Othacehe
2021-12-19 16:53 ` Mathieu Othacehe
0 siblings, 1 reply; 10+ messages in thread
From: Mathieu Othacehe @ 2021-12-11 9:46 UTC (permalink / raw)
To: Tobias Geerinckx-Rice; +Cc: 52338
Hey,
The Cuirass web interface logs were quite silent this morning and I
suspected an issue somewhere. I then realized that you did update the
Nginx conf and the bots were no longer knocking at our door, which is
great!
Thanks to both of you,
Mathieu
^ permalink raw reply [flat|nested] 10+ messages in thread
* bug#52338: Crawler bots are downloading substitutes
2021-12-11 9:46 ` Mathieu Othacehe
@ 2021-12-19 16:53 ` Mathieu Othacehe
0 siblings, 0 replies; 10+ messages in thread
From: Mathieu Othacehe @ 2021-12-19 16:53 UTC (permalink / raw)
To: 52338-done
> Thanks to both of you,
And closing!
Mathieu
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2021-12-19 16:55 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-12-06 21:20 bug#52338: Crawler bots are downloading substitutes Leo Famulari
2021-12-06 22:18 ` bug#52338: [maintenance] hydra: berlin: Create robots.txt Leo Famulari
2021-12-09 13:27 ` bug#52338: Crawler bots are downloading substitutes Mathieu Othacehe
2021-12-09 15:42 ` Tobias Geerinckx-Rice via Bug reports for GNU Guix
2021-12-10 16:22 ` Leo Famulari
2021-12-10 16:47 ` Tobias Geerinckx-Rice via Bug reports for GNU Guix
2021-12-11 9:46 ` Mathieu Othacehe
2021-12-19 16:53 ` Mathieu Othacehe
2021-12-10 21:21 ` Mark H Weaver
2021-12-10 22:52 ` Tobias Geerinckx-Rice via Bug reports for GNU Guix
Code repositories for project(s) associated with this public inbox
https://git.savannah.gnu.org/cgit/guix.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).