From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp0 ([2001:41d0:2:bcc0::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms0.migadu.com with LMTPS id +CPaCfMwsmE9eQEAgWs5BA (envelope-from ) for ; Thu, 09 Dec 2021 17:38:11 +0100 Received: from aspmx1.migadu.com ([2001:41d0:2:bcc0::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp0 with LMTPS id wEihBfMwsmGRIgAA1q6Kng (envelope-from ) for ; Thu, 09 Dec 2021 16:38:11 +0000 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id A73AACB21 for ; Thu, 9 Dec 2021 17:38:10 +0100 (CET) Received: from localhost ([::1]:34028 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1mvMQr-0001dW-QS for larch@yhetil.org; Thu, 09 Dec 2021 11:38:09 -0500 Received: from eggs.gnu.org ([209.51.188.92]:60592) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1mvMOq-00083r-77 for bug-guix@gnu.org; Thu, 09 Dec 2021 11:36:04 -0500 Received: from debbugs.gnu.org ([209.51.188.43]:33106) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1mvMOo-0006aQ-Hh for bug-guix@gnu.org; Thu, 09 Dec 2021 11:36:03 -0500 Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1mvMOo-0001pF-9o for bug-guix@gnu.org; Thu, 09 Dec 2021 11:36:02 -0500 X-Loop: help-debbugs@gnu.org Subject: bug#52338: Crawler bots are downloading substitutes Resent-From: Tobias Geerinckx-Rice Original-Sender: "Debbugs-submit" Resent-CC: bug-guix@gnu.org Resent-Date: Thu, 09 Dec 2021 16:36:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 52338 X-GNU-PR-Package: guix X-GNU-PR-Keywords: To: Mathieu Othacehe Cc: 52338@debbugs.gnu.org, leo@famulari.name X-Debbugs-Original-Cc: 52338@debbugs.gnu.org, bug-guix@gnu.org, Leo Famulari Received: via spool by submit@debbugs.gnu.org id=B.16390677617009 (code B ref -1); Thu, 09 Dec 2021 16:36:02 +0000 Received: (at submit) by debbugs.gnu.org; 9 Dec 2021 16:36:01 +0000 Received: from localhost ([127.0.0.1]:44652 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1mvMOm-0001oz-UC for submit@debbugs.gnu.org; Thu, 09 Dec 2021 11:36:01 -0500 Received: from lists.gnu.org ([209.51.188.17]:60954) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1mvMOl-0001oq-3J for submit@debbugs.gnu.org; Thu, 09 Dec 2021 11:35:59 -0500 Received: from eggs.gnu.org ([209.51.188.92]:60572) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1mvMOk-0007zj-TR for bug-guix@gnu.org; Thu, 09 Dec 2021 11:35:58 -0500 Received: from tobias.gr ([80.241.217.52]:33378) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1mvMOh-0006Ut-D0; Thu, 09 Dec 2021 11:35:57 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; s=2018; bh=2TFMU6E3AYyG6 7io9x9Indy/PjAf9SfY1ZFLVnMAWUg=; h=in-reply-to:date:subject:cc:to: from:references; d=tobias.gr; b=NdkB3lLLdZFvqDg4d3Bd86+gPkEZjV++e/FDyW mnT1AoCJGaOi2faKY3USyzovdTvPToudjIZgySLzTR2nGsM/4v679fW7cMamSifJR+ck/L n6nyJCcTdCWYnJAZPmsEZZI9OWCx+BahZWuTnDTQeSwPGOqKJVsA//brh/Yz1vCxCcvGGk kWVQtQTNl1qygYHtrMA453Yh7wAnvOrwoEdF/lBpolHOPmlv3OLgxKY//32pmNtEm+gBJM NoZvY4aAQFAOEDvC3J/iHGzJ7auxgZIxm+KM3+lzPlIc8Lyf0idUqlrDrlJqUQCmxMP6cU ZVSlnzaeBMM0jyMx7vGEvUfA== Received: by submission.tobias.gr (OpenSMTPD) with ESMTPSA id efe962d3 (TLSv1.3:AEAD-AES256-GCM-SHA384:256:NO); Thu, 9 Dec 2021 16:35:48 +0000 (UTC) References: <2f52f6b48db55f8a79b07dbb242b297ab49d6083.1638828946.git.leo@famulari.name> <87tufh6h85.fsf_-_@gnu.org> Date: Thu, 09 Dec 2021 16:42:24 +0100 In-reply-to: <87tufh6h85.fsf_-_@gnu.org> BIMI-Selector: v=BIMI1; s=default; Message-ID: <87sfv1ivl2.fsf@nckx> MIME-Version: 1.0 Content-Type: multipart/signed; boundary="==-=-="; micalg=pgp-sha512; protocol="application/pgp-signature" Received-SPF: pass client-ip=80.241.217.52; envelope-from=me@tobias.gr; helo=tobias.gr X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-BeenThere: bug-guix@gnu.org List-Id: Bug reports for GNU Guix List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-guix-bounces+larch=yhetil.org@gnu.org Sender: "bug-Guix" Reply-to: Tobias Geerinckx-Rice From: Tobias Geerinckx-Rice via Bug reports for GNU Guix X-Migadu-Flow: FLOW_IN ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=yhetil.org; s=key1; t=1639067890; h=from:from:sender:sender:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:resent-cc:resent-from:resent-sender: resent-message-id:in-reply-to:in-reply-to:references:references: list-id:list-help:list-unsubscribe:list-subscribe:list-post: dkim-signature; bh=2TFMU6E3AYyG67io9x9Indy/PjAf9SfY1ZFLVnMAWUg=; b=lVu20xizx/I6lKvFBqjhMTNRWAWFTST7wVmdHWJTz2uRcmxkeepfaIeUbs5xOhEByPaoBd Ez1ux7+zSsjpcLaqtkHAhDFtBkiQB7v3Jle7Azn3aiDe40bF0ZSd2gFOfW4a834V+Obv/X jWh0SdEi5rGgbPQAempjjw60HfbAz0qS8VFpgNVVjdwqJPS8R69Yl7uMLVnQuS5SlIfg0v cL3iE93RZwUO0cIPo6bnowXrGnWRy6/uGwUtFoB8qqrrxWgmtOrVSy9sF1HCQY9hgI1stX EpxGoMT/TOM/AvyynjeY/UiVISJ7LVi1ZSAYMPCad4GQqLi5poHpxeGQPyguNA== ARC-Seal: i=1; s=key1; d=yhetil.org; t=1639067890; a=rsa-sha256; cv=none; b=XT30NPlsaz+I6w/eMx2Cvqs6qU7qDU0Y1SS2uTiRjOlJsQsUt2Q66naPuQ2VfrTb062ci0 V1QibfgqrYbxNEZWIHmtLME5kh7mRkONlmx5TLiJIxADYruF7rKkNroSijPqikmLr2OH8S mPgiBhBFxspJX7ik/YU40UszCxIL0KI3d7Js0ZFZYWXdDui2uYyuw1wiAX6vZ4YBoXa4un 5fTZhmcf3EppcSYnCNOGXr0KmKPulntIrA6jzLJFx+txbqvkfM90yzbd8LXUwyxiHZKWEn JFFSZ4GEG9CwNtAcNHNCvvgns98tE42YxiwKyNk6wlHZTi63w4zrBwc+4G4yIQ== ARC-Authentication-Results: i=1; aspmx1.migadu.com; dkim=fail ("headers rsa verify failed") header.d=tobias.gr header.s=2018 header.b=NdkB3lLL; dmarc=pass (policy=none) header.from=gnu.org; spf=pass (aspmx1.migadu.com: domain of "bug-guix-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="bug-guix-bounces+larch=yhetil.org@gnu.org" X-Migadu-Spam-Score: -3.76 Authentication-Results: aspmx1.migadu.com; dkim=fail ("headers rsa verify failed") header.d=tobias.gr header.s=2018 header.b=NdkB3lLL; dmarc=pass (policy=none) header.from=gnu.org; spf=pass (aspmx1.migadu.com: domain of "bug-guix-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="bug-guix-bounces+larch=yhetil.org@gnu.org" X-Migadu-Queue-Id: A73AACB21 X-Spam-Score: -3.76 X-Migadu-Scanner: scn1.migadu.com X-TUID: FsbNQGvixc/L --==-=-= Content-Type: multipart/mixed; boundary="=-=-=" --=-=-= Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: quoted-printable Mathieu Othacehe =E5=86=99=E9=81=93=EF=BC=9A > Hello Leo, > >> + (nginx-location-configuration >> + (uri "/robots.txt") It's a micro-optimisation, but it can't hurt to generate =E2=80=98location= =20 =3D /robots.txt=E2=80=99 instead of =E2=80=98location /robots.txt=E2=80=99 = here. >> + (body >> + (list >> + "add_header Content-Type text/plain;" >> + "return 200 \"User-agent: *\nDisallow:=20 >> /nar/\n\";")))))) Use \r\n instead of \n, even if \n happens to work. There are many =E2=80=98buggy=E2=80=99 crawlers out there. It's in their o= wn=20 interest to be fussy whilst claiming to respect robots.txt. The=20 less you deviate from the most basic norm imaginable, the better. I tested whether embedding raw \r\n bytes in nginx.conf strings=20 like this works, and it seems to, even though a human would=20 probably not do so. > Nice, the bots are also accessing the Cuirass web interface, do=20 > you > think it would be possible to extend this snippet to prevent it? You can replace =E2=80=98/nar/=E2=80=99 with =E2=80=98/=E2=80=99 to disallo= w everything: Disallow: / If we want crawlers to index only the front page (so people can=20 search for =E2=80=98Guix CI=E2=80=99, I guess), that's possible: Disallow: / Allow: /$ Don't confuse =E2=80=98$=E2=80=99 with =E2=80=98supports regexps=E2=80=99. = Buggy bots might fall=20 back to =E2=80=98Disallow: /=E2=80=99. This is where it gets ugly: nginx doesn't support escaping =E2=80=98$=E2=80= =99 in=20 strings. At all. It's insane. --=-=-= Content-Type: text/plain; format=flowed Content-Disposition: inline geo $dollar { default "$"; } # stackoverflow.com/questions/57466554 server { location = /robots.txt { return 200 "User-agent: *\r\nDisallow: /\r\nAllow: /$dollar\r\n"; } } --=-=-= Content-Type: text/plain; format=flowed *Obviously.* An alternative to that is to serve a real on-disc robots.txt. Kind regards, T G-R --=-=-=-- --==-=-= Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iIMEARYKACsWIQT12iAyS4c9C3o4dnINsP+IT1VteQUCYbIwmQ0cbWVAdG9iaWFz LmdyAAoJEA2w/4hPVW15y2MBAILKgUIzreTZdQAAQaTODJziTLB3oomvmrwEpsjM VhnaAP9/P3wC8RwFz3hIJqUIRnXEp5/d9fgqVk/96ouiXhOGAw== =fmbL -----END PGP SIGNATURE----- --==-=-=--