unofficial mirror of meta@public-inbox.org
 help / color / mirror / Atom feed
* [BUG] Unescaped '&' ampersands in atom header links
@ 2023-11-27  8:17 Ricardo Cañuelo
  2023-11-27 10:23 ` Eric Wong
  0 siblings, 1 reply; 3+ messages in thread
From: Ricardo Cañuelo @ 2023-11-27  8:17 UTC (permalink / raw)
  To: meta

Hi,

When parsing outputs from lore.kernel.org with Python3 xml.dom.minidom I
noticed that, for queries that contain '&' characters, they aren't
escaped in the href attributes of the title tags in atom feed headers.

So, for example, for this request:

https://lore.kernel.org/all/?x=A&q=driver+core%3A+Fix+wait_for_device_probe%28%29+%26+deferred_probe_timeout+interaction

The atom header in the output contains:

<title
type="html">driver core: Fix wait_for_device_probe() &#38; deferred_probe_timeout interaction - search results</title><link
rel="alternate"
type="text/html"
href="http://lore.kernel.org/all/?q=driver+core:+Fix+wait_for_device_probe()+&+deferred_probe_timeout+interaction"/><link
rel="self"
href="http://lore.kernel.org/all/?q=driver+core:+Fix+wait_for_device_probe()+&+deferred_probe_timeout+interaction&amp;x=A"/>

where the '&' character is escaped in the text of the <title> tag but
not in the href attributes. Shouldn't these be escaped as well? If so,
the fix should be most likely located in WwwAtomStream.pm:atom_header().

Cheers,
Ricardo

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [BUG] Unescaped '&' ampersands in atom header links
  2023-11-27  8:17 [BUG] Unescaped '&' ampersands in atom header links Ricardo Cañuelo
@ 2023-11-27 10:23 ` Eric Wong
  2023-11-27 21:55   ` Eric Wong
  0 siblings, 1 reply; 3+ messages in thread
From: Eric Wong @ 2023-11-27 10:23 UTC (permalink / raw)
  To: Ricardo Cañuelo; +Cc: meta

Ricardo Cañuelo <ricardo.canuelo@collabora.com> wrote:
<snip>
> where the '&' character is escaped in the text of the <title> tag but
> not in the href attributes. Shouldn't these be escaped as well? If so,
> the fix should be most likely located in WwwAtomStream.pm:atom_header().

Thanks for the bug report.  Yes, '&' should be escaped, that's
the job of the ->qs_html call from atom_header().

The below fixes it and also brings a minor improvement in
allowing '/' to be unescaped for dfn: queries.  More could be
allowed, actually, but '/' is actually common.

But I'm running on fumes right now, so an extra set of eyes
would be helpful.

deployed on https://yhbt.net/lore/

-----8<-----
Subject: [PATCH] www: qs_html: fix escaping of `q' param

Our use of MID_ESC characters was only intended for the pathname
component of URIs and not appropriate for the query string
component.  So use a different $unsafe parameter list for
uri_escape to make the result appropriate for query strings by
disallowing [\&\'\+=] characters.  Most notably, this change
also allows us to accept `/' (slash) unescaped to make dfn: queries
nicer to look at.

Finally, we'll also add a ascii_html call on the URI-escaped
result as an extra safety measure even though it's not really
needed.

As far as I can tell, the code without this fix didn't result in
in an HTML injection since all our uses of uri_escape did escape
angle brackets.

Reported-by: Ricardo Cañuelo <ricardo.canuelo@collabora.com>
Link: https://public-inbox.org/meta/87o7ff4nlk.fsf@collabora.com/
---
 lib/PublicInbox/MID.pm         |  2 +-
 lib/PublicInbox/SearchQuery.pm | 10 +++++++---
 t/psgi_search.t                | 27 ++++++++++++++++++++++++++-
 3 files changed, 34 insertions(+), 5 deletions(-)

diff --git a/lib/PublicInbox/MID.pm b/lib/PublicInbox/MID.pm
index b1ae9939..97cf3a54 100644
--- a/lib/PublicInbox/MID.pm
+++ b/lib/PublicInbox/MID.pm
@@ -125,7 +125,7 @@ sub uniq_mids ($;$) {
 	\@ret;
 }
 
-# RFC3986, section 3.3:
+# RFC3986, section 3.3 (pathnames only):
 sub MID_ESC () { '^A-Za-z0-9\-\._~!\$\&\'\(\)\*\+,;=:@' }
 sub mid_escape ($) { uri_escape_utf8($_[0], MID_ESC) }
 
diff --git a/lib/PublicInbox/SearchQuery.pm b/lib/PublicInbox/SearchQuery.pm
index 96246c53..747e3249 100644
--- a/lib/PublicInbox/SearchQuery.pm
+++ b/lib/PublicInbox/SearchQuery.pm
@@ -6,7 +6,7 @@ package PublicInbox::SearchQuery;
 use strict;
 use v5.10.1;
 use URI::Escape qw(uri_escape);
-use PublicInbox::MID qw(MID_ESC);
+use PublicInbox::Hval qw(ascii_html);
 our $LIM = 200;
 
 sub new {
@@ -35,9 +35,13 @@ sub qs_html {
 	}
 	my $qs = '';
 	if (defined(my $q = $self->{'q'})) {
-		$q = uri_escape($q, MID_ESC);
+		# not using MID_ESC since that's for the path component and
+		# this is for the query component.  Unlike MID_ESC,
+		# this disallows [\&\'\+=] and allows slash [/] for
+		# nicer looking dfn: queries
+		$q = uri_escape($q, '^A-Za-z0-9\-\._~!\$\(\)\*,;:@/');
 		$q =~ s/%20/+/g; # improve URL readability
-		$qs .= "q=$q";
+		$qs .= 'q='.ascii_html($q);
 	}
 	if (my $o = $self->{o}) { # ignore o == 0
 		$qs .= "&amp;o=$o";
diff --git a/t/psgi_search.t b/t/psgi_search.t
index 289c34e7..8c981c6c 100644
--- a/t/psgi_search.t
+++ b/t/psgi_search.t
@@ -18,7 +18,8 @@ local $ENV{TZ} = 'UTC';
 my $digits = '10010260936330';
 my $ua = 'Pine.LNX.4.10';
 my $mid = "$ua.$digits.2460-100000\@penguin.transmeta.com";
-my $ibx = create_inbox 'git', indexlevel => 'full', tmpdir => "$tmpdir/1", sub {
+my $ibx = create_inbox '26-git', indexlevel => 'full', tmpdir => "$tmpdir/1",
+sub {
 	my ($im) = @_;
 	# n.b. these headers are not properly RFC2047-encoded
 	$im->add(PublicInbox::Eml->new(<<EOF)) or BAIL_OUT;
@@ -48,6 +49,17 @@ Message-ID: <no-subject-at-all@example.com>
 From: no subject at all <no-subject-at-all@example.com>
 To: git@vger.kernel.org
 
+EOF
+	$im->add(PublicInbox::Eml->new(<<'EOF')) or BAIL_OUT;
+Message-ID: <ampersand@example.com>
+From: <e@example.com>
+To: git@vger.kernel.org
+Subject: git & ampersand
+
+hi +++ b/foo
+x=y
+s'more
+
 EOF
 };
 
@@ -155,6 +167,19 @@ test_psgi(sub { $www->call(@_) }, sub {
 	is($res->code, 200, 'successful mbox download w/ threads');
 	gunzip(\($res->content) => \(my $after));
 	isnt($before, $after);
+
+	$res = $cb->(GET('/test/?q=git+%26+ampersand&x=A'));
+	is $res->code, 200, 'Atom hit with ampersand';
+	unlike $res->content, qr/git\+&\+ampersand/, '& is HTML-escaped';
+
+	$res = $cb->(GET('/test/?q=%22hi+%2b%2b%2b+b/foo%22&x=A'));
+	is $res->code, 200, 'slashes and plusses search hit';
+	like $res->content, qr!q=%22hi\+(?:%2[bB]){3}\+b/foo%22!,
+		'+ and " escaped, but slash not escaped in query';
+
+	$res = $cb->(GET(q{/test/?q=%22s'more%22&x=A}));
+	is $res->code, 200, 'single quote inside phrase';
+	# TODO: more tests and odd cases
 });
 
 done_testing();

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [BUG] Unescaped '&' ampersands in atom header links
  2023-11-27 10:23 ` Eric Wong
@ 2023-11-27 21:55   ` Eric Wong
  0 siblings, 0 replies; 3+ messages in thread
From: Eric Wong @ 2023-11-27 21:55 UTC (permalink / raw)
  To: Ricardo Cañuelo; +Cc: meta

Thanks, pushed as commit 577e421a0815e66f965bd4317adad5aeea3cc52a
with your Tested-By (sent privately in <87leaj4ea1.fsf@collabora.com>)

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2023-11-27 21:58 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-11-27  8:17 [BUG] Unescaped '&' ampersands in atom header links Ricardo Cañuelo
2023-11-27 10:23 ` Eric Wong
2023-11-27 21:55   ` Eric Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).