unofficial mirror of meta@public-inbox.org
 help / color / mirror / Atom feed
* Feature R/BUG: Auto uri_unescape() & utf8 handling
@ 2017-05-23 10:17 Ævar Arnfjörð Bjarmason
  2017-05-23 18:39 ` Eric Wong
  2017-05-23 18:43 ` Feature R/BUG: Auto uri_unescape() & utf8 handling Eric Wong
  0 siblings, 2 replies; 5+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2017-05-23 10:17 UTC (permalink / raw)
  To: meta

Feature request:

It would be neat to be able to right click on highlighted msgids in
the "what's cooking e-mails" in GMail and paste them into public
inbox, unfortunately Chrome copies this URL escaped, so you get e.g.:

https://public-inbox.org/git/?q=%3CCACBZZX4UUwzRQmyH8joYaqHnuVTjVtGBHp%252BiZKcnAnwoM_ZJhg%40mail.gmail.com%3E

Whereas that doesn't work because it's seacrhing for a URI escaped
msgid that contains + converted to %2B

Whereas unescaping it works:

https://public-inbox.org/git/?q=CACBZZX4UUwzRQmyH8joYaqHnuVTjVtGBHp%2BiZKcnAnwoM_ZJhg%40mail.gmail.com

I don't know if it breaks anything else but calling uri_unescape()
would work for these sort of msgids, but then of course searching for
%2b wouldn't work, hrm...

BUG:

The code is missing a utf8::decode() or equivalent somewhere, try
searching for: https://public-inbox.org/git/?q=%C3%86var

The search works, but the text in the search box is garbled, the
second search is https://public-inbox.org/git/?q=%C3%83%E2%80%A0var
third https://public-inbox.org/git/?q=%C3%83%C6%92%C3%A2%E2%82%AC%C2%A0var
etc.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Feature R/BUG: Auto uri_unescape() & utf8 handling
  2017-05-23 10:17 Feature R/BUG: Auto uri_unescape() & utf8 handling Ævar Arnfjörð Bjarmason
@ 2017-05-23 18:39 ` Eric Wong
  2017-05-23 22:02   ` [PATCH] www: do not mangle characters from search queries Eric Wong
  2017-05-23 18:43 ` Feature R/BUG: Auto uri_unescape() & utf8 handling Eric Wong
  1 sibling, 1 reply; 5+ messages in thread
From: Eric Wong @ 2017-05-23 18:39 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: meta

Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
> BUG:
> 
> The code is missing a utf8::decode() or equivalent somewhere, try
> searching for: https://public-inbox.org/git/?q=%C3%86var
> 
> The search works, but the text in the search box is garbled, the
> second search is https://public-inbox.org/git/?q=%C3%83%E2%80%A0var
> third https://public-inbox.org/git/?q=%C3%83%C6%92%C3%A2%E2%82%AC%C2%A0var
> etc.

Thanks for the report, I'm testing the patch below on
public-inbox.org and it seems fine.  I'll need to write a test for
this...

diff --git a/lib/PublicInbox/MID.pm b/lib/PublicInbox/MID.pm
index 1c2d75c..2613c8e 100644
--- a/lib/PublicInbox/MID.pm
+++ b/lib/PublicInbox/MID.pm
@@ -6,7 +6,7 @@ package PublicInbox::MID;
 use strict;
 use warnings;
 use base qw/Exporter/;
-our @EXPORT_OK = qw/mid_clean id_compress mid2path mid_mime mid_escape/;
+our @EXPORT_OK = qw/mid_clean id_compress mid2path mid_mime mid_escape MID_ESC/;
 use URI::Escape qw(uri_escape_utf8);
 use Digest::SHA qw/sha1_hex/;
 use constant MID_MAX => 40; # SHA-1 hex length
diff --git a/lib/PublicInbox/SearchView.pm b/lib/PublicInbox/SearchView.pm
index cec87c6..42bc648 100644
--- a/lib/PublicInbox/SearchView.pm
+++ b/lib/PublicInbox/SearchView.pm
@@ -222,7 +222,9 @@ sub mset_thread {
 
 sub ctx_prepare {
 	my ($q, $ctx) = @_;
-	my $qh = ascii_html($q->{'q'});
+	my $qh = $q->{'q'};
+	utf8::decode($qh);
+	$qh = ascii_html($qh);
 	$ctx->{-q_value_html} = $qh;
 	$ctx->{-atom} = '?'.$q->qs_html(x => 'A', r => undef);
 	$ctx->{-title_html} = "$qh - search results";
@@ -254,8 +256,9 @@ sub adump {
 package PublicInbox::SearchQuery;
 use strict;
 use warnings;
+use URI::Escape qw(uri_escape);
 use PublicInbox::Hval;
-use PublicInbox::MID qw(mid_escape);
+use PublicInbox::MID qw(MID_ESC);
 
 sub new {
 	my ($class, $qp) = @_;
@@ -280,7 +283,7 @@ sub qs_html {
 		$self = $tmp;
 	}
 
-	my $q = mid_escape($self->{'q'});
+	my $q = uri_escape($self->{'q'}, MID_ESC);
 	$q =~ s/%20/+/g; # improve URL readability
 	my $qs = "q=$q";
 
diff --git a/lib/PublicInbox/WWW.pm b/lib/PublicInbox/WWW.pm
index 13b3921..f3c702e 100644
--- a/lib/PublicInbox/WWW.pm
+++ b/lib/PublicInbox/WWW.pm
@@ -42,6 +42,7 @@ sub call {
 
 	# we don't care about multi-value
 	my %qp = map {
+		utf8::decode($_);
 		my ($k, $v) = split('=', uri_unescape($_), 2);
 		$v = '' unless defined $v;
 		$v =~ tr/+/ /;

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: Feature R/BUG: Auto uri_unescape() & utf8 handling
  2017-05-23 10:17 Feature R/BUG: Auto uri_unescape() & utf8 handling Ævar Arnfjörð Bjarmason
  2017-05-23 18:39 ` Eric Wong
@ 2017-05-23 18:43 ` Eric Wong
  2017-05-23 23:17   ` [PATCH] searchview: retry queries if uri_unescape-able Eric Wong
  1 sibling, 1 reply; 5+ messages in thread
From: Eric Wong @ 2017-05-23 18:43 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: meta

Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
> Feature request:
> 
> It would be neat to be able to right click on highlighted msgids in
> the "what's cooking e-mails" in GMail and paste them into public
> inbox, unfortunately Chrome copies this URL escaped, so you get e.g.:
> 
> https://public-inbox.org/git/?q=%3CCACBZZX4UUwzRQmyH8joYaqHnuVTjVtGBHp%252BiZKcnAnwoM_ZJhg%40mail.gmail.com%3E
> 
> Whereas that doesn't work because it's seacrhing for a URI escaped
> msgid that contains + converted to %2B
> 
> Whereas unescaping it works:
> 
> https://public-inbox.org/git/?q=CACBZZX4UUwzRQmyH8joYaqHnuVTjVtGBHp%2BiZKcnAnwoM_ZJhg%40mail.gmail.com
> 
> I don't know if it breaks anything else but calling uri_unescape()
> would work for these sort of msgids, but then of course searching for
> %2b wouldn't work, hrm...

I wonder if it could fall back to auto-unescape if it sees '%'
and can't find any messages, or use an OR query.

Fwiw, I've actually been thinking about doing auto-linkification
for messages (and maybe other things).

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH] www: do not mangle characters from search queries
  2017-05-23 18:39 ` Eric Wong
@ 2017-05-23 22:02   ` Eric Wong
  0 siblings, 0 replies; 5+ messages in thread
From: Eric Wong @ 2017-05-23 22:02 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: meta

Eric Wong <e@80x24.org> wrote:
> Thanks for the report, I'm testing the patch below on
> public-inbox.org and it seems fine.  I'll need to write a test for
> this...

OK, I've pushed this out to public-inbox.git and deployed
to all onions, too; will look at the MID unescaping in a bit.
Thanks again.

----8<------
Subject: [PATCH] www: do not mangle characters from search queries
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Reported-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
  https://public-inbox.org/meta/CACBZZX5Gnow08r=0A1J_kt3a=zpGyMfvsqu8nAN7kacNnDm+dg@mail.gmail.com/
---
 MANIFEST                      |  1 +
 lib/PublicInbox/MID.pm        |  2 +-
 lib/PublicInbox/SearchView.pm |  9 ++++--
 lib/PublicInbox/WWW.pm        |  1 +
 t/psgi_search.t               | 71 +++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 80 insertions(+), 4 deletions(-)
 create mode 100644 t/psgi_search.t

diff --git a/MANIFEST b/MANIFEST
index d1e0952..3bfd9a4 100644
--- a/MANIFEST
+++ b/MANIFEST
@@ -155,6 +155,7 @@ t/plack.t
 t/precheck.t
 t/psgi_attach.t
 t/psgi_mount.t
+t/psgi_search.t
 t/psgi_text.t
 t/qspawn.t
 t/search-thr-index.t
diff --git a/lib/PublicInbox/MID.pm b/lib/PublicInbox/MID.pm
index 1c2d75c..2613c8e 100644
--- a/lib/PublicInbox/MID.pm
+++ b/lib/PublicInbox/MID.pm
@@ -6,7 +6,7 @@ package PublicInbox::MID;
 use strict;
 use warnings;
 use base qw/Exporter/;
-our @EXPORT_OK = qw/mid_clean id_compress mid2path mid_mime mid_escape/;
+our @EXPORT_OK = qw/mid_clean id_compress mid2path mid_mime mid_escape MID_ESC/;
 use URI::Escape qw(uri_escape_utf8);
 use Digest::SHA qw/sha1_hex/;
 use constant MID_MAX => 40; # SHA-1 hex length
diff --git a/lib/PublicInbox/SearchView.pm b/lib/PublicInbox/SearchView.pm
index cec87c6..42bc648 100644
--- a/lib/PublicInbox/SearchView.pm
+++ b/lib/PublicInbox/SearchView.pm
@@ -222,7 +222,9 @@ sub mset_thread {
 
 sub ctx_prepare {
 	my ($q, $ctx) = @_;
-	my $qh = ascii_html($q->{'q'});
+	my $qh = $q->{'q'};
+	utf8::decode($qh);
+	$qh = ascii_html($qh);
 	$ctx->{-q_value_html} = $qh;
 	$ctx->{-atom} = '?'.$q->qs_html(x => 'A', r => undef);
 	$ctx->{-title_html} = "$qh - search results";
@@ -254,8 +256,9 @@ sub adump {
 package PublicInbox::SearchQuery;
 use strict;
 use warnings;
+use URI::Escape qw(uri_escape);
 use PublicInbox::Hval;
-use PublicInbox::MID qw(mid_escape);
+use PublicInbox::MID qw(MID_ESC);
 
 sub new {
 	my ($class, $qp) = @_;
@@ -280,7 +283,7 @@ sub qs_html {
 		$self = $tmp;
 	}
 
-	my $q = mid_escape($self->{'q'});
+	my $q = uri_escape($self->{'q'}, MID_ESC);
 	$q =~ s/%20/+/g; # improve URL readability
 	my $qs = "q=$q";
 
diff --git a/lib/PublicInbox/WWW.pm b/lib/PublicInbox/WWW.pm
index 13b3921..f3c702e 100644
--- a/lib/PublicInbox/WWW.pm
+++ b/lib/PublicInbox/WWW.pm
@@ -42,6 +42,7 @@ sub call {
 
 	# we don't care about multi-value
 	my %qp = map {
+		utf8::decode($_);
 		my ($k, $v) = split('=', uri_unescape($_), 2);
 		$v = '' unless defined $v;
 		$v =~ tr/+/ /;
diff --git a/t/psgi_search.t b/t/psgi_search.t
new file mode 100644
index 0000000..cc9c9cf
--- /dev/null
+++ b/t/psgi_search.t
@@ -0,0 +1,71 @@
+# Copyright (C) 2017 all contributors <meta@public-inbox.org>
+# License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
+use strict;
+use warnings;
+use Test::More;
+use File::Temp qw/tempdir/;
+use Email::MIME;
+use PublicInbox::Config;
+use PublicInbox::WWW;
+my @mods = qw(PublicInbox::SearchIdx HTTP::Request::Common Plack::Test
+		URI::Escape Plack::Builder);
+foreach my $mod (@mods) {
+	eval "require $mod";
+	plan skip_all => "$mod missing for psgi_search.t" if $@;
+}
+use_ok $_ foreach @mods;
+my $tmpdir = tempdir('pi-psgi-search.XXXXXX', TMPDIR => 1, CLEANUP => 1);
+my $git_dir = "$tmpdir/a.git";
+
+is(0, system(qw(git init -q --bare), $git_dir), "git init (main)");
+my $rw = PublicInbox::SearchIdx->new($git_dir, 1);
+ok($rw, "search indexer created");
+my $data = <<'EOF';
+Subject: test
+Message-Id: <utf8@example>
+From: Ævar Arnfjörð Bjarmason <avarab@example>
+To: git@vger.kernel.org
+
+EOF
+
+my $num = 0;
+# nb. using internal API, fragile!
+my $xdb = $rw->_xdb_acquire;
+$xdb->begin_transaction;
+
+foreach (reverse split(/\n\n/, $data)) {
+	$_ .= "\n";
+	my $mime = Email::MIME->new(\$_);
+	my $bytes = bytes::length($mime->as_string);
+	my $doc_id = $rw->add_message($mime, $bytes, ++$num, 'ignored');
+	my $mid = $mime->header('Message-Id');
+	ok($doc_id, 'message added: '. $mid);
+}
+
+$xdb->commit_transaction;
+$rw = undef;
+
+my $cfgpfx = "publicinbox.test";
+my $config = PublicInbox::Config->new({
+	"$cfgpfx.address" => 'git@vger.kernel.org',
+	"$cfgpfx.mainrepo" => $git_dir,
+});
+my $www = PublicInbox::WWW->new($config);
+test_psgi(sub { $www->call(@_) }, sub {
+	my ($cb) = @_;
+	my $res;
+	$res = $cb->(GET('/test/?q=%C3%86var'));
+	my $html = $res->content;
+	like($html, qr/<title>&#198;var - /, 'HTML escaped in title');
+	my @res = ($html =~ m/\?q=(.+var)\b/g);
+	ok(scalar(@res), 'saw query strings');
+	my %uniq = map { $_ => 1 } @res;
+	is(1, scalar keys %uniq, 'all query values identical in HTML');
+	is('%C3%86var', (keys %uniq)[0], 'matches original query');
+	ok(index($html, 'by &#198;var Arnfj&#246;r&#240; Bjarmason') >= 0,
+		"displayed Ævar's name properly in HTML");
+});
+
+done_testing();
+
+1;
-- 
EW

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH] searchview: retry queries if uri_unescape-able
  2017-05-23 18:43 ` Feature R/BUG: Auto uri_unescape() & utf8 handling Eric Wong
@ 2017-05-23 23:17   ` Eric Wong
  0 siblings, 0 replies; 5+ messages in thread
From: Eric Wong @ 2017-05-23 23:17 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: meta

> Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
> > I don't know if it breaks anything else but calling uri_unescape()
> > would work for these sort of msgids, but then of course searching for
> > %2b wouldn't work, hrm...

So it's now trying the query as-is first, then retrying the
unescaped query if there's no results:

------8<------
Subject: [PATCH] searchview: retry queries if uri_unescape-able
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

It is possible to have double-escaped queries when copy and
pasting into browsers, so try to help users work around this
common error by automatically retrying after unescaping once.

Of course, we must inform the user when doing this results in
success, in case they really meant to search for a
double-escaped term which resulted in nothing.

Reported-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
  https://public-inbox.org/meta/CACBZZX5Gnow08r=0A1J_kt3a=zpGyMfvsqu8nAN7kacNnDm+dg@mail.gmail.com/
---
 lib/PublicInbox/SearchView.pm | 36 +++++++++++++++++++++++++++++-------
 1 file changed, 29 insertions(+), 7 deletions(-)

diff --git a/lib/PublicInbox/SearchView.pm b/lib/PublicInbox/SearchView.pm
index 42bc648..f92790f 100644
--- a/lib/PublicInbox/SearchView.pm
+++ b/lib/PublicInbox/SearchView.pm
@@ -5,11 +5,12 @@
 package PublicInbox::SearchView;
 use strict;
 use warnings;
+use URI::Escape qw(uri_unescape uri_escape);
 use PublicInbox::SearchMsg;
 use PublicInbox::Hval qw/ascii_html/;
 use PublicInbox::View;
 use PublicInbox::WwwAtomStream;
-use PublicInbox::MID qw(mid2path mid_mime mid_clean mid_escape);
+use PublicInbox::MID qw(mid2path mid_mime mid_clean mid_escape MID_ESC);
 use PublicInbox::MIME;
 require PublicInbox::Git;
 require PublicInbox::SearchThread;
@@ -29,19 +30,27 @@ sub sres_top_html {
 		mset => 1,
 		relevance => $q->{r},
 	};
-	my ($mset, $total);
+	my ($mset, $total, $err, $cb);
+retry:
 	eval {
 		$mset = $ctx->{srch}->query($q->{'q'}, $opts);
 		$total = $mset->get_matches_estimated;
 	};
-	my $err = $@;
+	$err = $@;
 	ctx_prepare($q, $ctx);
-	my $cb;
 	if ($err) {
 		$code = 400;
 		$ctx->{-html_tip} = '<pre>'.err_txt($ctx, $err).'</pre><hr>';
 		$cb = *noop;
 	} elsif ($total == 0) {
+		if (defined($ctx->{-uxs_retried})) {
+			# undo retry damage:
+			$q->{'q'} = $ctx->{-uxs_retried};
+		} elsif (index($q->{'q'}, '%') >= 0) {
+			$ctx->{-uxs_retried} = $q->{'q'};
+			$q->{'q'} = uri_unescape($q->{'q'});
+			goto retry;
+		}
 		$code = 404;
 		$ctx->{-html_tip} = "<pre>\n[No results found]</pre><hr>";
 		$cb = *noop;
@@ -49,7 +58,7 @@ sub sres_top_html {
 		my $x = $q->{x};
 		return adump($_[0], $mset, $q, $ctx) if $x eq 'A';
 
-		$ctx->{-html_tip} = search_nav_top($mset, $q) . "\n\n";
+		$ctx->{-html_tip} = search_nav_top($mset, $q, $ctx) . "\n\n";
 		if ($x eq 't') {
 			$cb = mset_thread($ctx, $mset, $q);
 		} else {
@@ -113,9 +122,22 @@ sub err_txt {
 }
 
 sub search_nav_top {
-	my ($mset, $q) = @_;
+	my ($mset, $q, $ctx) = @_;
+
+	my $rv = '<pre>';
+	my $initial_q = $ctx->{-uxs_retried};
+	if (defined $initial_q) {
+		my $rewritten = $q->{'q'};
+		utf8::decode($initial_q);
+		utf8::decode($rewritten);
+		$initial_q = ascii_html($initial_q);
+		$rewritten = ascii_html($rewritten);
+		$rv .= " Warning: Initial query:\n <b>$initial_q</b>\n";
+		$rv .= " returned no results, used:\n";
+		$rv .= " <b>$rewritten</b>\n instead\n\n";
+	}
 
-	my $rv = "<pre>Search results ordered by [";
+	$rv .= 'Search results ordered by [';
 	if ($q->{r}) {
 		my $d = $q->qs_html(r => 0);
 		$rv .= qq{<a\nhref="?$d">date</a>|<b>relevance</b>};
-- 
(deployed on public-inbox.org and all .onions)

^ permalink raw reply related	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2017-05-23 23:17 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-05-23 10:17 Feature R/BUG: Auto uri_unescape() & utf8 handling Ævar Arnfjörð Bjarmason
2017-05-23 18:39 ` Eric Wong
2017-05-23 22:02   ` [PATCH] www: do not mangle characters from search queries Eric Wong
2017-05-23 18:43 ` Feature R/BUG: Auto uri_unescape() & utf8 handling Eric Wong
2017-05-23 23:17   ` [PATCH] searchview: retry queries if uri_unescape-able Eric Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).