bug#52338: Crawler bots are downloading substitutes

unofficial mirror of bug-guix@gnu.org 
 help / color / mirror / code / Atom feed

* bug#52338: Crawler bots are downloading substitutes
@ 2021-12-06 21:20 Leo Famulari
  2021-12-06 22:18 ` bug#52338: [maintenance] hydra: berlin: Create robots.txt Leo Famulari
  2021-12-10 21:21 ` Mark H Weaver
  0 siblings, 2 replies; 10+ messages in thread
From: Leo Famulari @ 2021-12-06 21:20 UTC (permalink / raw)
  To: 52338

I noticed that some bots are downloading substitutes from
ci.guix.gnu.org.

We should add a robots.txt file to reduce this waste.

Specifically, I see bots from Bing and Semrush:

https://www.bing.com/bingbot.htm
https://www.semrush.com/bot.html




^ permalink raw reply	[flat|nested] 10+ messages in thread

* bug#52338: [maintenance] hydra: berlin: Create robots.txt.
  2021-12-06 21:20 bug#52338: Crawler bots are downloading substitutes Leo Famulari
@ 2021-12-06 22:18 ` Leo Famulari
  2021-12-09 13:27   ` bug#52338: Crawler bots are downloading substitutes Mathieu Othacehe
  2021-12-10 21:21 ` Mark H Weaver
  1 sibling, 1 reply; 10+ messages in thread
From: Leo Famulari @ 2021-12-06 22:18 UTC (permalink / raw)
  To: 52338

I tested that `guix system build` does succeed with this change, but I
would like a review on whether the resulting Nginx configuration is
correct, and if this is the correct path to disallow. It generates an
Nginx location block like this:

------
      location /robots.txt {
        add_header  Content-Type  text/plain;
        return 200 "User-agent: *
Disallow: /nar
";
      }
------

* hydra/nginx/berlin.scm (berlin-locations): Add a robots.txt Nginx location.
---
 hydra/nginx/berlin.scm | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/hydra/nginx/berlin.scm b/hydra/nginx/berlin.scm
index 1f4b0be..3bb2129 100644
--- a/hydra/nginx/berlin.scm
+++ b/hydra/nginx/berlin.scm
@@ -174,7 +174,14 @@ PUBLISH-URL."
            (nginx-location-configuration
             (uri "/berlin.guixsd.org-export.pub")
             (body
-             (list "root /var/www/guix;"))))))
+             (list "root /var/www/guix;")))
+
+           (nginx-location-configuration
+             (uri "/robots.txt")
+             (body
+               (list
+                 "add_header  Content-Type  text/plain;"
+                 "return 200 \"User-agent: *\nDisallow: /nar/\n\";"))))))
 
 (define guix.gnu.org-redirect-locations
   (list
-- 
2.34.0





^ permalink raw reply related	[flat|nested] 10+ messages in thread

* bug#52338: Crawler bots are downloading substitutes
  2021-12-06 22:18 ` bug#52338: [maintenance] hydra: berlin: Create robots.txt Leo Famulari
@ 2021-12-09 13:27   ` Mathieu Othacehe
  2021-12-09 15:42     ` Tobias Geerinckx-Rice via Bug reports for GNU Guix
  0 siblings, 1 reply; 10+ messages in thread
From: Mathieu Othacehe @ 2021-12-09 13:27 UTC (permalink / raw)
  To: Leo Famulari; +Cc: 52338


Hello Leo,

> +           (nginx-location-configuration
> +             (uri "/robots.txt")
> +             (body
> +               (list
> +                 "add_header  Content-Type  text/plain;"
> +                 "return 200 \"User-agent: *\nDisallow: /nar/\n\";"))))))

Nice, the bots are also accessing the Cuirass web interface, do you
think it would be possible to extend this snippet to prevent it?

Thanks,

Mathieu




^ permalink raw reply	[flat|nested] 10+ messages in thread

* bug#52338: Crawler bots are downloading substitutes
  2021-12-09 13:27   ` bug#52338: Crawler bots are downloading substitutes Mathieu Othacehe
@ 2021-12-09 15:42     ` Tobias Geerinckx-Rice via Bug reports for GNU Guix
  2021-12-10 16:22       ` Leo Famulari
  0 siblings, 1 reply; 10+ messages in thread
From: Tobias Geerinckx-Rice via Bug reports for GNU Guix @ 2021-12-09 15:42 UTC (permalink / raw)
  To: Mathieu Othacehe; +Cc: 52338, leo

[-- Attachment #1.1: Type: text/plain, Size: 1469 bytes --]

Mathieu Othacehe 写道：
> Hello Leo,
>
>> +           (nginx-location-configuration
>> +             (uri "/robots.txt")

It's a micro-optimisation, but it can't hurt to generate ‘location 
= /robots.txt’ instead of ‘location /robots.txt’ here.

>> +             (body
>> +               (list
>> +                 "add_header  Content-Type  text/plain;"
>> +                 "return 200 \"User-agent: *\nDisallow: 
>> /nar/\n\";"))))))

Use \r\n instead of \n, even if \n happens to work.

There are many ‘buggy’ crawlers out there.  It's in their own 
interest to be fussy whilst claiming to respect robots.txt.  The 
less you deviate from the most basic norm imaginable, the better.

I tested whether embedding raw \r\n bytes in nginx.conf strings 
like this works, and it seems to, even though a human would 
probably not do so.

> Nice, the bots are also accessing the Cuirass web interface, do 
> you
> think it would be possible to extend this snippet to prevent it?

You can replace ‘/nar/’ with ‘/’ to disallow everything:

  Disallow: /

If we want crawlers to index only the front page (so people can 
search for ‘Guix CI’, I guess), that's possible:

  Disallow: /
  Allow: /$

Don't confuse ‘$’ with ‘supports regexps’.  Buggy bots might fall 
back to ‘Disallow: /’.

This is where it gets ugly: nginx doesn't support escaping ‘$’ in 
strings.  At all.  It's insane.

[-- Attachment #1.2: Type: text/plain, Size: 201 bytes --]

  geo $dollar { default "$"; } # 
  stackoverflow.com/questions/57466554
  server {
    location = /robots.txt {
      return 200
      "User-agent: *\r\nDisallow: /\r\nAllow: /$dollar\r\n";
    }
  }

[-- Attachment #1.3: Type: text/plain, Size: 99 bytes --]

*Obviously.*

An alternative to that is to serve a real on-disc robots.txt.

Kind regards,

T G-R

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 247 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* bug#52338: Crawler bots are downloading substitutes
  2021-12-09 15:42     ` Tobias Geerinckx-Rice via Bug reports for GNU Guix
@ 2021-12-10 16:22       ` Leo Famulari
  2021-12-10 16:47         ` Tobias Geerinckx-Rice via Bug reports for GNU Guix
  0 siblings, 1 reply; 10+ messages in thread
From: Leo Famulari @ 2021-12-10 16:22 UTC (permalink / raw)
  To: Tobias Geerinckx-Rice; +Cc: othacehe, 52338

[-- Attachment #1: Type: text/plain, Size: 286 bytes --]

On Thu, Dec 09, 2021 at 04:42:24PM +0100, Tobias Geerinckx-Rice wrote:
[...]
> An alternative to that is to serve a real on-disc robots.txt.

Alright, I leave it up to you. I just want to prevent bots from
downloading substitutes. I don't really have opinions about any of the
details.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* bug#52338: Crawler bots are downloading substitutes
  2021-12-10 16:22       ` Leo Famulari
@ 2021-12-10 16:47         ` Tobias Geerinckx-Rice via Bug reports for GNU Guix
  2021-12-11  9:46           ` Mathieu Othacehe
  0 siblings, 1 reply; 10+ messages in thread
From: Tobias Geerinckx-Rice via Bug reports for GNU Guix @ 2021-12-10 16:47 UTC (permalink / raw)
  To: Leo Famulari; +Cc: othacehe, 52338

[-- Attachment #1: Type: text/plain, Size: 95 bytes --]

Leo Famulari 写道：
> Alright, I leave it up to you.

Dammit.

Kind regards,

T G-R

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 247 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* bug#52338: Crawler bots are downloading substitutes
  2021-12-10 16:47         ` Tobias Geerinckx-Rice via Bug reports for GNU Guix
@ 2021-12-11  9:46           ` Mathieu Othacehe
  2021-12-19 16:53             ` Mathieu Othacehe
  0 siblings, 1 reply; 10+ messages in thread
From: Mathieu Othacehe @ 2021-12-11  9:46 UTC (permalink / raw)
  To: Tobias Geerinckx-Rice; +Cc: 52338

Hey,

The Cuirass web interface logs were quite silent this morning and I
suspected an issue somewhere. I then realized that you did update the
Nginx conf and the bots were no longer knocking at our door, which is
great!

Thanks to both of you,

Mathieu

^ permalink raw reply	[flat|nested] 10+ messages in thread

* bug#52338: Crawler bots are downloading substitutes
  2021-12-11  9:46           ` Mathieu Othacehe
@ 2021-12-19 16:53             ` Mathieu Othacehe
  0 siblings, 0 replies; 10+ messages in thread
From: Mathieu Othacehe @ 2021-12-19 16:53 UTC (permalink / raw)
  To: 52338-done


> Thanks to both of you,

And closing!

Mathieu




^ permalink raw reply	[flat|nested] 10+ messages in thread

* bug#52338: Crawler bots are downloading substitutes
  2021-12-06 21:20 bug#52338: Crawler bots are downloading substitutes Leo Famulari
  2021-12-06 22:18 ` bug#52338: [maintenance] hydra: berlin: Create robots.txt Leo Famulari
@ 2021-12-10 21:21 ` Mark H Weaver
  2021-12-10 22:52   ` Tobias Geerinckx-Rice via Bug reports for GNU Guix
  1 sibling, 1 reply; 10+ messages in thread
From: Mark H Weaver @ 2021-12-10 21:21 UTC (permalink / raw)
  To: Leo Famulari, 52338

Hi Leo,

Leo Famulari <leo@famulari.name> writes:

> I noticed that some bots are downloading substitutes from
> ci.guix.gnu.org.
>
> We should add a robots.txt file to reduce this waste.
>
> Specifically, I see bots from Bing and Semrush:
>
> https://www.bing.com/bingbot.htm
> https://www.semrush.com/bot.html

For what it's worth: during the years that I administered Hydra, I found
that many bots disregarded the robots.txt file that was in place there.
In practice, I found that I needed to periodically scan the access logs
for bots and forcefully block their requests in order to keep Hydra from
becoming overloaded with expensive queries from bots.

     Regards,
       Mark

^ permalink raw reply	[flat|nested] 10+ messages in thread

* bug#52338: Crawler bots are downloading substitutes
  2021-12-10 21:21 ` Mark H Weaver
@ 2021-12-10 22:52   ` Tobias Geerinckx-Rice via Bug reports for GNU Guix
  0 siblings, 0 replies; 10+ messages in thread
From: Tobias Geerinckx-Rice via Bug reports for GNU Guix @ 2021-12-10 22:52 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: 52338, leo

[-- Attachment #1: Type: text/plain, Size: 1073 bytes --]

All,

Mark H Weaver 写道：
> For what it's worth: during the years that I administered Hydra, 
> I found
> that many bots disregarded the robots.txt file that was in place 
> there.
> In practice, I found that I needed to periodically scan the 
> access logs
> for bots and forcefully block their requests in order to keep 
> Hydra from
> becoming overloaded with expensive queries from bots.

Very good point.

IME (which is a few years old at this point) at least the 
highlighted BingBot & SemrushThing always respected my robots.txt, 
but it's definitely a concern.  I'll leave this bug open to remind 
us of that in a few weeks or so…

If it does become a problem, we (I) might add some basic 
User-Agent sniffing to either slow down or outright block 
non-Guile downloaders.  Whitelisting any legitimate ones, of 
course.  I think that's less hassle than dealing with dynamic IP 
blocks whilst being equally effective here.

Thanks (again) for taking care of Hydra, Mark, and thank you Leo 
for keeping an eye on Cuirass :-)

T G-R

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 247 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2021-12-19 16:55 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-12-06 21:20 bug#52338: Crawler bots are downloading substitutes Leo Famulari
2021-12-06 22:18 ` bug#52338: [maintenance] hydra: berlin: Create robots.txt Leo Famulari
2021-12-09 13:27   ` bug#52338: Crawler bots are downloading substitutes Mathieu Othacehe
2021-12-09 15:42     ` Tobias Geerinckx-Rice via Bug reports for GNU Guix
2021-12-10 16:22       ` Leo Famulari
2021-12-10 16:47         ` Tobias Geerinckx-Rice via Bug reports for GNU Guix
2021-12-11  9:46           ` Mathieu Othacehe
2021-12-19 16:53             ` Mathieu Othacehe
2021-12-10 21:21 ` Mark H Weaver
2021-12-10 22:52   ` Tobias Geerinckx-Rice via Bug reports for GNU Guix

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).