unofficial mirror of meta@public-inbox.org
 help / color / mirror / Atom feed
* [PATCH 0/10] search: more mairix prefix compatibility
@ 2016-09-09  0:01 Eric Wong
  2016-09-09  0:01 ` [PATCH 01/10] search: allow searching user fields (To/Cc/From) Eric Wong
                   ` (9 more replies)
  0 siblings, 10 replies; 11+ messages in thread
From: Eric Wong @ 2016-09-09  0:01 UTC (permalink / raw)
  To: meta

This brings us closer to the behavior of mairix(1) for search
by supporting n:, t:, c:, f:, tc:, tcf:, n:, b:, and bs:
prefixes as documented in the mairix(1) manpage.

We also introduce the use of q: and nq: prefixes for quoted and
non-quoted text, respectively.

There is a schema version change in [PATCH 7/10] to maintain
compatibility with Debian 7.x wheezy installs.  The in-place
reindexing would've been expensive anyways, so perhaps the
schema bump is a good idea, anyways, as creating a fresh index
should be faster than --reindex.

Eric Wong (10):
      search: allow searching user fields (To/Cc/From)
      search: drop longer subject: prefix for search
      search: more granular message body searching
      search: fix space regressions from recent changes
      search: match quote detection behavior of view
      search: increase term positions for each quoted hunk
      search: fix compatibility with Debian wheezy
      search: avoid mindlessly calling body_set
      search: match the behavior of WWW for indexing text
      search: index attachment filenames

 lib/PublicInbox/Search.pm    |  32 +++++++++---
 lib/PublicInbox/SearchIdx.pm | 104 ++++++++++++++++++++++++-------------
 t/search.t                   | 120 ++++++++++++++++++++++++++++++++++++++++---
 3 files changed, 206 insertions(+), 50 deletions(-)


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 01/10] search: allow searching user fields (To/Cc/From)
  2016-09-09  0:01 [PATCH 0/10] search: more mairix prefix compatibility Eric Wong
@ 2016-09-09  0:01 ` Eric Wong
  2016-09-09  0:01 ` [PATCH 02/10] search: drop longer subject: prefix for search Eric Wong
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Eric Wong @ 2016-09-09  0:01 UTC (permalink / raw)
  To: meta

Sometimes it can be useful to search based on who the
message was sent to, sent by, or Cc:-ed.  Of course,
headers can be faked, but they usually are not...

Anyways this mostly matches the behavior of mairix(1).
---
 lib/PublicInbox/Search.pm    | 10 +++++++-
 lib/PublicInbox/SearchIdx.pm | 59 +++++++++++++++++++++++++++++++-------------
 t/search.t                   | 37 +++++++++++++++++++++++++++
 3 files changed, 88 insertions(+), 18 deletions(-)

diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm
index 445c2d8..aec459b 100644
--- a/lib/PublicInbox/Search.pm
+++ b/lib/PublicInbox/Search.pm
@@ -51,8 +51,8 @@ my %bool_pfx_internal = (
 	thread => 'G', # newsGroup (or similar entity - e.g. a web forum name)
 );
 
-# do we still need these? probably not..
 my %bool_pfx_external = (
+	# do we still need these? probably not..
 	path => 'XPATH',
 	mid => 'Q', # uniQue id (Message-ID)
 );
@@ -61,6 +61,14 @@ my %prob_prefix = (
 	subject => 'S',
 	s => 'S', # for mairix compatibility
 	m => 'Q', # 'mid' is exact, 'm' can do partial
+	f => 'A', # for mairix compatibility
+	t => 'XTO', # for mairix compatibility
+	tc => 'XTC', # for mairix compatibility
+	c => 'XCC', # for mairix compatibility
+	tcf => 'XTCF', # for mairix compatibility
+	# n.b.: leaving out "a:" alias for "tcf:" even though
+	# mairix supports it.  It is only mentioned in passing in mairix(1)
+	# and the extra two letters are not significantly longer.
 );
 
 # not documenting m: and mid: for now, the using the URLs works w/o Xapian
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index f54f5f2..37fefbe 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -96,12 +96,51 @@ sub _lock_release {
 	close $lockfh or die "close failed: $!\n";
 }
 
-sub add_val {
+sub add_val ($$$) {
 	my ($doc, $col, $num) = @_;
 	$num = Search::Xapian::sortable_serialise($num);
 	$doc->add_value($col, $num);
 }
 
+sub add_values ($$$) {
+	my ($smsg, $bytes, $num) = @_;
+
+	my $ts = $smsg->ts;
+	my $doc = $smsg->{doc};
+	add_val($doc, &PublicInbox::Search::TS, $ts);
+
+	defined($num) and add_val($doc, &PublicInbox::Search::NUM, $num);
+
+	defined($bytes) and add_val($doc, &PublicInbox::Search::BYTES, $bytes);
+
+	add_val($doc, &PublicInbox::Search::LINES,
+			$smsg->{mime}->body_raw =~ tr!\n!\n!);
+
+	my $yyyymmdd = strftime('%Y%m%d', gmtime($ts));
+	$doc->add_value(&PublicInbox::Search::YYYYMMDD, $yyyymmdd);
+}
+
+sub index_users ($$) {
+	my ($tg, $smsg) = @_;
+
+	my $from = $smsg->from;
+	my $to = $smsg->to;
+	my $cc = $smsg->cc;
+
+	$tg->index_text($from, 1, 'A'); # A - author
+	$tg->increase_termpos;
+
+	$tg->index_text($to, 1, 'XTO') if $to ne '';
+	$tg->index_text($cc, 1, 'XCC') if $cc ne '';
+	my $tc = join("\t", $to, $cc);
+	$tg->index_text($tc, 1, 'XTC') if $tc ne '';
+	my $tcf = join("\t", $tc, $from);
+	$tg->index_text($tcf, 1, 'XTCF') if $tcf ne '';
+
+	$tg->index_text($from);
+	$tg->increase_termpos;
+}
+
 sub add_message {
 	my ($self, $mime, $bytes, $num, $blob) = @_; # mime = Email::MIME object
 	my $db = $self->{xdb};
@@ -129,20 +168,7 @@ sub add_message {
 			$doc->add_term(xpfx('path') . id_compress($path));
 		}
 
-		my $ts = $smsg->ts;
-		add_val($doc, &PublicInbox::Search::TS, $ts);
-
-		defined($num) and
-			add_val($doc, &PublicInbox::Search::NUM, $num);
-
-		defined($bytes) and
-			add_val($doc, &PublicInbox::Search::BYTES, $bytes);
-
-		add_val($doc, &PublicInbox::Search::LINES,
-				$mime->body_raw =~ tr!\n!\n!);
-
-		my $yyyymmdd = strftime('%Y%m%d', gmtime($ts));
-		$doc->add_value(&PublicInbox::Search::YYYYMMDD, $yyyymmdd);
+		add_values($smsg, $bytes, $num);
 
 		my $tg = $self->term_generator;
 
@@ -152,8 +178,7 @@ sub add_message {
 		$tg->index_text($subj) if $subj;
 		$tg->increase_termpos;
 
-		$tg->index_text($smsg->from);
-		$tg->increase_termpos;
+		index_users($tg, $smsg);
 
 		msg_iter($mime, sub {
 			my ($part, $depth, @idx) = @{$_[0]};
diff --git a/t/search.t b/t/search.t
index db94c0a..bb0861a 100644
--- a/t/search.t
+++ b/t/search.t
@@ -86,6 +86,7 @@ my $rw_commit = sub {
 			'Message-ID' => '<last@s>',
 			From => 'John Smith <js@example.com>',
 			To => 'list@example.com',
+			Cc => 'foo@example.com',
 		],
 		body => "goodbye forever :<\n");
 
@@ -324,6 +325,42 @@ sub filter_mids {
 	is(scalar @{$res->{msgs}}, 0, 'nothing before 19931001');
 }
 
+# names and addresses
+{
+	my $res = $ro->query('t:list@example.com');
+	is(scalar @{$res->{msgs}}, 6, 'searched To: successfully');
+	foreach my $smsg (@{$res->{msgs}}) {
+		like($smsg->to, qr/\blist\@example\.com\b/, 'to appears');
+	}
+
+	$res = $ro->query('tc:list@example.com');
+	is(scalar @{$res->{msgs}}, 6, 'searched To+Cc: successfully');
+	foreach my $smsg (@{$res->{msgs}}) {
+		my $tocc = join("\n", $smsg->to, $smsg->cc);
+		like($tocc, qr/\blist\@example\.com\b/, 'tocc appears');
+	}
+
+	foreach my $pfx ('tcf:', 'c:') {
+		$res = $ro->query($pfx . 'foo@example.com');
+		is(scalar @{$res->{msgs}}, 1,
+			"searched $pfx successfully for Cc:");
+		foreach my $smsg (@{$res->{msgs}}) {
+			like($smsg->cc, qr/\bfoo\@example\.com\b/,
+				'cc appears');
+		}
+	}
+
+	foreach my $pfx ('', 'tcf:', 'f:') {
+		$res = $ro->query($pfx . 'Laggy');
+		is(scalar @{$res->{msgs}}, 1,
+			"searched $pfx successfully for From:");
+		foreach my $smsg (@{$res->{msgs}}) {
+			like($smsg->from, qr/Laggy Sender/,
+				"From appears with $pfx");
+		}
+	}
+}
+
 done_testing();
 
 1;
-- 
EW


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 02/10] search: drop longer subject: prefix for search
  2016-09-09  0:01 [PATCH 0/10] search: more mairix prefix compatibility Eric Wong
  2016-09-09  0:01 ` [PATCH 01/10] search: allow searching user fields (To/Cc/From) Eric Wong
@ 2016-09-09  0:01 ` Eric Wong
  2016-09-09  0:01 ` [PATCH 03/10] search: more granular message body searching Eric Wong
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Eric Wong @ 2016-09-09  0:01 UTC (permalink / raw)
  To: meta

We only document the "s:" anyways.  While the long name is more
descriptive, the ambiguity makes agnostic caching (by Varnish or
similar) slightly harder and longer URLs are more likely to be
accidentally truncated when shared.
---
 lib/PublicInbox/Search.pm |  1 -
 t/search.t                | 14 +++++++-------
 2 files changed, 7 insertions(+), 8 deletions(-)

diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm
index aec459b..3b25b66 100644
--- a/lib/PublicInbox/Search.pm
+++ b/lib/PublicInbox/Search.pm
@@ -58,7 +58,6 @@ my %bool_pfx_external = (
 );
 
 my %prob_prefix = (
-	subject => 'S',
 	s => 'S', # for mairix compatibility
 	m => 'Q', # 'mid' is exact, 'm' can do partial
 	f => 'A', # for mairix compatibility
diff --git a/t/search.t b/t/search.t
index bb0861a..7abaf83 100644
--- a/t/search.t
+++ b/t/search.t
@@ -123,19 +123,19 @@ sub filter_mids {
 		is($res->{total}, 0, "path variant `$p' does not match");
 	}
 
-	$res = $ro->query('subject:(Hello world)');
+	$res = $ro->query('s:(Hello world)');
 	@res = filter_mids($res);
-	is_deeply(\@res, \@exp, 'got expected results for subject:() match');
+	is_deeply(\@res, \@exp, 'got expected results for s:() match');
 
-	$res = $ro->query('subject:"Hello world"');
+	$res = $ro->query('s:"Hello world"');
 	@res = filter_mids($res);
-	is_deeply(\@res, \@exp, 'got expected results for subject:"" match');
+	is_deeply(\@res, \@exp, 'got expected results for s:"" match');
 
-	$res = $ro->query('subject:"Hello world"', {limit => 1});
+	$res = $ro->query('s:"Hello world"', {limit => 1});
 	is(scalar @{$res->{msgs}}, 1, "limit works");
 	my $first = $res->{msgs}->[0];
 
-	$res = $ro->query('subject:"Hello world"', {offset => 1});
+	$res = $ro->query('s:"Hello world"', {offset => 1});
 	is(scalar @{$res->{msgs}}, 1, "offset works");
 	my $second = $res->{msgs}->[0];
 
@@ -181,7 +181,7 @@ sub filter_mids {
 	$rw_commit->();
 	$ro->reopen;
 
-	# Subject:
+	# subject
 	my $res = $ro->query('ghost');
 	my @exp = sort qw(ghost-message@s ghost-reply@s);
 	my @res = filter_mids($res);
-- 
EW


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 03/10] search: more granular message body searching
  2016-09-09  0:01 [PATCH 0/10] search: more mairix prefix compatibility Eric Wong
  2016-09-09  0:01 ` [PATCH 01/10] search: allow searching user fields (To/Cc/From) Eric Wong
  2016-09-09  0:01 ` [PATCH 02/10] search: drop longer subject: prefix for search Eric Wong
@ 2016-09-09  0:01 ` Eric Wong
  2016-09-09  0:01 ` [PATCH 04/10] search: fix space regressions from recent changes Eric Wong
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Eric Wong @ 2016-09-09  0:01 UTC (permalink / raw)
  To: meta

"bs:" and "b:" are adapted from mairix(1)

We will also support searching explicitly for quoted vs
non-quoted text via "q:" and "nq:" prefixes since sometimes
readers will not care for quoted text.

In the future, we will support parsing diffs (perhaps when
repobrowse integration is complete).

Note: this roughly doubles the size of the Xapian database due
to the additional information; so this change may not be worth
it.
---
 lib/PublicInbox/Search.pm    | 18 ++++++++++++------
 lib/PublicInbox/SearchIdx.pm | 17 ++++++++++++++---
 t/search.t                   | 25 +++++++++++++++++++++++++
 3 files changed, 51 insertions(+), 9 deletions(-)

diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm
index 3b25b66..f74129d 100644
--- a/lib/PublicInbox/Search.pm
+++ b/lib/PublicInbox/Search.pm
@@ -58,16 +58,22 @@ my %bool_pfx_external = (
 );
 
 my %prob_prefix = (
-	s => 'S', # for mairix compatibility
+	# for mairix compatibility
+	s => 'S',
 	m => 'Q', # 'mid' is exact, 'm' can do partial
-	f => 'A', # for mairix compatibility
-	t => 'XTO', # for mairix compatibility
-	tc => 'XTC', # for mairix compatibility
-	c => 'XCC', # for mairix compatibility
-	tcf => 'XTCF', # for mairix compatibility
+	f => 'A',
+	t => 'XTO',
+	tc => 'XTC',
+	c => 'XCC',
+	tcf => 'XTCF',
+	b => 'XBODY',
+	bs => 'XBS',
+
 	# n.b.: leaving out "a:" alias for "tcf:" even though
 	# mairix supports it.  It is only mentioned in passing in mairix(1)
 	# and the extra two letters are not significantly longer.
+	q => 'XQUOT',
+	nq => 'XNQ',
 );
 
 # not documenting m: and mid: for now, the using the URLs works w/o Xapian
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 37fefbe..cd27a29 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -173,7 +173,10 @@ sub add_message {
 		my $tg = $self->term_generator;
 
 		$tg->set_document($doc);
-		$tg->index_text($subj, 1, 'S') if $subj;
+		if ($subj) {
+			$tg->index_text($subj, 1, 'S');
+			$tg->index_text($subj, 1, 'XBS');
+		}
 		$tg->increase_termpos;
 		$tg->index_text($subj) if $subj;
 		$tg->increase_termpos;
@@ -199,13 +202,21 @@ sub add_message {
 				}
 			}
 			if (@quot) {
-				$tg->index_text(join("\n", @quot), 0);
+				my $s = join("\n", @quot);
 				@quot = ();
+				$tg->index_text($s, 1, 'XQUOT');
+				$tg->index_text($s, 0, 'XBS');
+				$tg->index_text($s, 0, 'XBODY');
+				$tg->index_text($s, 0);
 				$tg->increase_termpos;
 			}
 			if (@orig) {
-				$tg->index_text(join("\n", @orig));
+				my $s = join("\n", @orig);
 				@orig = ();
+				$tg->index_text($s, 1, 'XNQ');
+				$tg->index_text($s, 1, 'XBS');
+				$tg->index_text($s, 1, 'XBODY');
+				$tg->index_text($s);
 				$tg->increase_termpos;
 			}
 		});
diff --git a/t/search.t b/t/search.t
index 7abaf83..bddb545 100644
--- a/t/search.t
+++ b/t/search.t
@@ -361,6 +361,31 @@ sub filter_mids {
 	}
 }
 
+{
+	$rw_commit->();
+	$ro->reopen;
+	my $res = $ro->query('b:hello');
+	is(scalar @{$res->{msgs}}, 0, 'no match on body search only');
+	$res = $ro->query('bs:smith');
+	is(scalar @{$res->{msgs}}, 0,
+		'no match on body+subject search for From');
+
+	$res = $ro->query('q:theatre');
+	is(scalar @{$res->{msgs}}, 1, 'only one quoted body');
+	like($res->{msgs}->[0]->from, qr/\AQuoter/, 'got quoted body');
+
+	$res = $ro->query('nq:theatre');
+	is(scalar @{$res->{msgs}}, 1, 'only one non-quoted body');
+	like($res->{msgs}->[0]->from, qr/\ANon-Quoter/, 'got non-quoted body');
+
+	foreach my $pfx (qw(b: bs:)) {
+		$res = $ro->query($pfx . 'theatre');
+		is(scalar @{$res->{msgs}}, 2, "searched both bodies for $pfx");
+		like($res->{msgs}->[0]->from, qr/\ANon-Quoter/,
+			"non-quoter first for $pfx");
+	}
+}
+
 done_testing();
 
 1;
-- 
EW


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 04/10] search: fix space regressions from recent changes
  2016-09-09  0:01 [PATCH 0/10] search: more mairix prefix compatibility Eric Wong
                   ` (2 preceding siblings ...)
  2016-09-09  0:01 ` [PATCH 03/10] search: more granular message body searching Eric Wong
@ 2016-09-09  0:01 ` Eric Wong
  2016-09-09  0:01 ` [PATCH 05/10] search: match quote detection behavior of view Eric Wong
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Eric Wong @ 2016-09-09  0:01 UTC (permalink / raw)
  To: meta

As of Xapian 1.0.4 (from 2007) is possible to use
Search::Xapian::QueryParser::add_prefix multiple times with the
same user field name but different term prefixes.

This brings my current git@vger mirror from 6.5GB to 2.1GB
(both sizes are after xapian-compact).
---
 lib/PublicInbox/Search.pm    | 15 +++++++++------
 lib/PublicInbox/SearchIdx.pm | 25 ++++---------------------
 2 files changed, 13 insertions(+), 27 deletions(-)

diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm
index f74129d..c8e297f 100644
--- a/lib/PublicInbox/Search.pm
+++ b/lib/PublicInbox/Search.pm
@@ -60,20 +60,23 @@ my %bool_pfx_external = (
 my %prob_prefix = (
 	# for mairix compatibility
 	s => 'S',
-	m => 'Q', # 'mid' is exact, 'm' can do partial
+	m => 'XMID', # 'mid:' (bool) is exact, 'm:' (prob) can do partial
 	f => 'A',
 	t => 'XTO',
-	tc => 'XTC',
+	tc => 'XTO XCC',
 	c => 'XCC',
-	tcf => 'XTCF',
-	b => 'XBODY',
-	bs => 'XBS',
+	tcf => 'XTO XCC A',
+	b => 'XNQ XQUOT',
+	bs => 'XNQ XQUOT S',
 
 	# n.b.: leaving out "a:" alias for "tcf:" even though
 	# mairix supports it.  It is only mentioned in passing in mairix(1)
 	# and the extra two letters are not significantly longer.
 	q => 'XQUOT',
 	nq => 'XNQ',
+
+	# default:
+	'' => 'XMID S A XNQ XQUOT',
 );
 
 # not documenting m: and mid: for now, the using the URLs works w/o Xapian
@@ -241,7 +244,7 @@ EOF
 	}
 
 	while (my ($name, $prefix) = each %prob_prefix) {
-		$qp->add_prefix($name, $prefix);
+		$qp->add_prefix($name, $_) foreach split(/ /, $prefix);
 	}
 
 	$self->{query_parser} = $qp;
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index cd27a29..ae89060 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -129,15 +129,9 @@ sub index_users ($$) {
 
 	$tg->index_text($from, 1, 'A'); # A - author
 	$tg->increase_termpos;
-
 	$tg->index_text($to, 1, 'XTO') if $to ne '';
+	$tg->increase_termpos;
 	$tg->index_text($cc, 1, 'XCC') if $cc ne '';
-	my $tc = join("\t", $to, $cc);
-	$tg->index_text($tc, 1, 'XTC') if $tc ne '';
-	my $tcf = join("\t", $tc, $from);
-	$tg->index_text($tcf, 1, 'XTCF') if $tcf ne '';
-
-	$tg->index_text($from);
 	$tg->increase_termpos;
 }
 
@@ -173,12 +167,7 @@ sub add_message {
 		my $tg = $self->term_generator;
 
 		$tg->set_document($doc);
-		if ($subj) {
-			$tg->index_text($subj, 1, 'S');
-			$tg->index_text($subj, 1, 'XBS');
-		}
-		$tg->increase_termpos;
-		$tg->index_text($subj) if $subj;
+		$tg->index_text($subj, 1, 'S') if $subj;
 		$tg->increase_termpos;
 
 		index_users($tg, $smsg);
@@ -204,25 +193,19 @@ sub add_message {
 			if (@quot) {
 				my $s = join("\n", @quot);
 				@quot = ();
-				$tg->index_text($s, 1, 'XQUOT');
-				$tg->index_text($s, 0, 'XBS');
-				$tg->index_text($s, 0, 'XBODY');
-				$tg->index_text($s, 0);
+				$tg->index_text($s, 0, 'XQUOT');
 				$tg->increase_termpos;
 			}
 			if (@orig) {
 				my $s = join("\n", @orig);
 				@orig = ();
 				$tg->index_text($s, 1, 'XNQ');
-				$tg->index_text($s, 1, 'XBS');
-				$tg->index_text($s, 1, 'XBODY');
-				$tg->index_text($s);
 				$tg->increase_termpos;
 			}
 		});
 
 		link_message($self, $smsg, $old_tid);
-		$tg->index_text($mid, 1);
+		$tg->index_text($mid, 1, 'XMID');
 		$doc->set_data($smsg->to_doc_data($blob));
 
 		if (my $altid = $self->{-altid}) {
-- 
EW


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 05/10] search: match quote detection behavior of view
  2016-09-09  0:01 [PATCH 0/10] search: more mairix prefix compatibility Eric Wong
                   ` (3 preceding siblings ...)
  2016-09-09  0:01 ` [PATCH 04/10] search: fix space regressions from recent changes Eric Wong
@ 2016-09-09  0:01 ` Eric Wong
  2016-09-09  0:01 ` [PATCH 06/10] search: increase term positions for each quoted hunk Eric Wong
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Eric Wong @ 2016-09-09  0:01 UTC (permalink / raw)
  To: meta

This is stricter than the mutt quote_regexp default
("^([ \t]*[|>:}#])+" on Debian jessie),
but matches what we have in View.pm.

I prefer the stricter quote detection since it is less ambiguous
and less likely to hide/obscure important details.
---
 lib/PublicInbox/SearchIdx.pm | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index ae89060..25452da 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -184,7 +184,7 @@ sub add_message {
 			$part->body_set('');
 			my @lines = split(/\n/, $body);
 			while (defined(my $l = shift @lines)) {
-				if ($l =~ /^\s*>/) {
+				if ($l =~ /^>/) {
 					push @quot, $l;
 				} else {
 					push @orig, $l;
-- 
EW


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 06/10] search: increase term positions for each quoted hunk
  2016-09-09  0:01 [PATCH 0/10] search: more mairix prefix compatibility Eric Wong
                   ` (4 preceding siblings ...)
  2016-09-09  0:01 ` [PATCH 05/10] search: match quote detection behavior of view Eric Wong
@ 2016-09-09  0:01 ` Eric Wong
  2016-09-09  0:01 ` [PATCH 07/10] search: fix compatibility with Debian wheezy Eric Wong
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Eric Wong @ 2016-09-09  0:01 UTC (permalink / raw)
  To: meta

We pay a storage cost for storing positional information
in Xapian, make good use of it by attempting to preserve
it for (hopefully) better search results.
---
 lib/PublicInbox/SearchIdx.pm | 23 +++++++++++------------
 1 file changed, 11 insertions(+), 12 deletions(-)

diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 25452da..0e499ad 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -135,6 +135,13 @@ sub index_users ($$) {
 	$tg->increase_termpos;
 }
 
+sub index_body ($$$) {
+	my ($tg, $lines, $inc) = @_;
+	$tg->index_text(join("\n", @$lines), $inc, $inc ? 'XNQ' : 'XQUOT');
+	@$lines = ();
+	$tg->increase_termpos;
+}
+
 sub add_message {
 	my ($self, $mime, $bytes, $num, $blob) = @_; # mime = Email::MIME object
 	my $db = $self->{xdb};
@@ -185,23 +192,15 @@ sub add_message {
 			my @lines = split(/\n/, $body);
 			while (defined(my $l = shift @lines)) {
 				if ($l =~ /^>/) {
+					index_body($tg, \@orig, 1) if @orig;
 					push @quot, $l;
 				} else {
+					index_body($tg, \@quot, 0) if @quot;
 					push @orig, $l;
 				}
 			}
-			if (@quot) {
-				my $s = join("\n", @quot);
-				@quot = ();
-				$tg->index_text($s, 0, 'XQUOT');
-				$tg->increase_termpos;
-			}
-			if (@orig) {
-				my $s = join("\n", @orig);
-				@orig = ();
-				$tg->index_text($s, 1, 'XNQ');
-				$tg->increase_termpos;
-			}
+			index_body($tg, \@quot, 0) if @quot;
+			index_body($tg, \@orig, 1) if @orig;
 		});
 
 		link_message($self, $smsg, $old_tid);
-- 
EW


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 07/10] search: fix compatibility with Debian wheezy
  2016-09-09  0:01 [PATCH 0/10] search: more mairix prefix compatibility Eric Wong
                   ` (5 preceding siblings ...)
  2016-09-09  0:01 ` [PATCH 06/10] search: increase term positions for each quoted hunk Eric Wong
@ 2016-09-09  0:01 ` Eric Wong
  2016-09-09  0:01 ` [PATCH 08/10] search: avoid mindlessly calling body_set Eric Wong
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Eric Wong @ 2016-09-09  0:01 UTC (permalink / raw)
  To: meta

Specifying the "d:" field only worked for
NumberValueRangeProcessor in older versions of Xapian, such
as the one in Debian wheezy (libsearch-xapian-perl=1.2.10.0-1)

This slipped through since I rarely use wheezy, anymore, and
perhaps nobody else does, either.  Perhaps wheezy support may be
dropped, soon.

Unfortunately, this requires a schema version bump.
---
 lib/PublicInbox/Search.pm    | 5 +++--
 lib/PublicInbox/SearchIdx.pm | 2 +-
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm
index c8e297f..ceee39a 100644
--- a/lib/PublicInbox/Search.pm
+++ b/lib/PublicInbox/Search.pm
@@ -38,7 +38,8 @@ use constant {
 	# 9 - disable Message-ID compression (SHA-1)
 	# 10 - optimize doc for NNTP overviews
 	# 11 - merge threads when vivifying ghosts
-	SCHEMA_VERSION => 11,
+	# 12 - change YYYYMMDD value column to numeric
+	SCHEMA_VERSION => 12,
 
 	# n.b. FLAG_PURE_NOT is expensive not suitable for a public website
 	# as it could become a denial-of-service vector
@@ -221,7 +222,7 @@ sub qp {
 	$qp->set_stemmer($self->stemmer);
 	$qp->set_stemming_strategy(STEM_SOME);
 	$qp->add_valuerangeprocessor(
-		Search::Xapian::StringValueRangeProcessor->new(YYYYMMDD, 'd:'));
+		Search::Xapian::NumberValueRangeProcessor->new(YYYYMMDD, 'd:'));
 
 	while (my ($name, $prefix) = each %bool_pfx_external) {
 		$qp->add_boolean_prefix($name, $prefix);
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 0e499ad..86be9ed 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -117,7 +117,7 @@ sub add_values ($$$) {
 			$smsg->{mime}->body_raw =~ tr!\n!\n!);
 
 	my $yyyymmdd = strftime('%Y%m%d', gmtime($ts));
-	$doc->add_value(&PublicInbox::Search::YYYYMMDD, $yyyymmdd);
+	add_val($doc, PublicInbox::Search::YYYYMMDD, $yyyymmdd);
 }
 
 sub index_users ($$) {
-- 
EW


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 08/10] search: avoid mindlessly calling body_set
  2016-09-09  0:01 [PATCH 0/10] search: more mairix prefix compatibility Eric Wong
                   ` (6 preceding siblings ...)
  2016-09-09  0:01 ` [PATCH 07/10] search: fix compatibility with Debian wheezy Eric Wong
@ 2016-09-09  0:01 ` Eric Wong
  2016-09-09  0:01 ` [PATCH 09/10] search: match the behavior of WWW for indexing text Eric Wong
  2016-09-09  0:01 ` [PATCH 10/10] search: index attachment filenames Eric Wong
  9 siblings, 0 replies; 11+ messages in thread
From: Eric Wong @ 2016-09-09  0:01 UTC (permalink / raw)
  To: meta

It's not worth entering a complex codepath in Email::MIME to
save some (probably immeasurable amount of) memory, here.  We've
already stopped doing this in our WWW code a while back, too.
If we really cared enough about it, we'd prioritize work on a
streaming replacement for Email::MIME.
---
 lib/PublicInbox/SearchIdx.pm | 1 -
 1 file changed, 1 deletion(-)

diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 86be9ed..0e2d225 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -188,7 +188,6 @@ sub add_message {
 
 			my (@orig, @quot);
 			my $body = $part->body;
-			$part->body_set('');
 			my @lines = split(/\n/, $body);
 			while (defined(my $l = shift @lines)) {
 				if ($l =~ /^>/) {
-- 
EW


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 09/10] search: match the behavior of WWW for indexing text
  2016-09-09  0:01 [PATCH 0/10] search: more mairix prefix compatibility Eric Wong
                   ` (7 preceding siblings ...)
  2016-09-09  0:01 ` [PATCH 08/10] search: avoid mindlessly calling body_set Eric Wong
@ 2016-09-09  0:01 ` Eric Wong
  2016-09-09  0:01 ` [PATCH 10/10] search: index attachment filenames Eric Wong
  9 siblings, 0 replies; 11+ messages in thread
From: Eric Wong @ 2016-09-09  0:01 UTC (permalink / raw)
  To: meta

The basic rule is that if it is displayable via our WWW
interface, it should be indexable text for Xapian search.
---
 lib/PublicInbox/SearchIdx.pm | 21 ++++++++++++++++-----
 1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 0e2d225..fb68f4b 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -148,7 +148,6 @@ sub add_message {
 
 	my ($doc_id, $old_tid);
 	my $mid = mid_clean(mid_mime($mime));
-	my $ct_msg = $mime->header('Content-Type') || 'text/plain';
 
 	eval {
 		die 'Message-ID too long' if length($mid) > MAX_MID_SIZE;
@@ -181,10 +180,22 @@ sub add_message {
 
 		msg_iter($mime, sub {
 			my ($part, $depth, @idx) = @{$_[0]};
-			my $ct = $part->content_type || $ct_msg;
-
-			# account for filter bugs...
-			$ct =~ m!\btext/plain\b!i or return;
+			my $ct = $part->content_type || 'text/plain';
+
+			return if $ct =~ m!\btext/x?html\b!i;
+
+			my $s = eval { $part->body_str };
+			if ($@) {
+				if ($ct =~ m!\btext/plain\b!i) {
+					# Try to assume UTF-8 because Alpine
+					# seems to do wacky things and set
+					# charset=X-UNKNOWN
+					$part->charset_set('UTF-8');
+					$s = eval { $part->body_str };
+					$s = $part->body if $@;
+				}
+			}
+			defined $s or return;
 
 			my (@orig, @quot);
 			my $body = $part->body;
-- 
EW


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 10/10] search: index attachment filenames
  2016-09-09  0:01 [PATCH 0/10] search: more mairix prefix compatibility Eric Wong
                   ` (8 preceding siblings ...)
  2016-09-09  0:01 ` [PATCH 09/10] search: match the behavior of WWW for indexing text Eric Wong
@ 2016-09-09  0:01 ` Eric Wong
  9 siblings, 0 replies; 11+ messages in thread
From: Eric Wong @ 2016-09-09  0:01 UTC (permalink / raw)
  To: meta

And while we're at it, ensure searching inside displayable
attachment bodies works.
---
 lib/PublicInbox/Search.pm    |  3 ++-
 lib/PublicInbox/SearchIdx.pm |  4 ++++
 t/search.t                   | 44 ++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 50 insertions(+), 1 deletion(-)

diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm
index ceee39a..0c05677 100644
--- a/lib/PublicInbox/Search.pm
+++ b/lib/PublicInbox/Search.pm
@@ -69,6 +69,7 @@ my %prob_prefix = (
 	tcf => 'XTO XCC A',
 	b => 'XNQ XQUOT',
 	bs => 'XNQ XQUOT S',
+	n => 'XFN',
 
 	# n.b.: leaving out "a:" alias for "tcf:" even though
 	# mairix supports it.  It is only mentioned in passing in mairix(1)
@@ -77,7 +78,7 @@ my %prob_prefix = (
 	nq => 'XNQ',
 
 	# default:
-	'' => 'XMID S A XNQ XQUOT',
+	'' => 'XMID S A XNQ XQUOT XFN',
 );
 
 # not documenting m: and mid: for now, the using the URLs works w/o Xapian
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index fb68f4b..23aef9f 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -181,6 +181,10 @@ sub add_message {
 		msg_iter($mime, sub {
 			my ($part, $depth, @idx) = @{$_[0]};
 			my $ct = $part->content_type || 'text/plain';
+			my $fn = $part->filename;
+			if (defined $fn && $fn ne '') {
+				$tg->index_text($fn, 1, 'XFN');
+			}
 
 			return if $ct =~ m!\btext/x?html\b!i;
 
diff --git a/t/search.t b/t/search.t
index bddb545..cce3b9e 100644
--- a/t/search.t
+++ b/t/search.t
@@ -386,6 +386,50 @@ sub filter_mids {
 	}
 }
 
+{
+	my $part1 = Email::MIME->create(
+                 attributes => {
+                     content_type => 'text/plain',
+                     disposition  => 'attachment',
+                     charset => 'US-ASCII',
+		     encoding => 'quoted-printable',
+		     filename => 'attached_fart.txt',
+                 },
+                 body_str => 'inside the attachment',
+	);
+	my $part2 = Email::MIME->create(
+                 attributes => {
+                     content_type => 'text/plain',
+                     disposition  => 'attachment',
+                     charset => 'US-ASCII',
+		     encoding => 'quoted-printable',
+		     filename => 'part_deux.txt',
+                 },
+                 body_str => 'inside another',
+	);
+	my $amsg = Email::MIME->create(
+		header_str => [
+			Subject => 'see attachment',
+			'Message-ID' => '<file@attached>',
+			From => 'John Smith <js@example.com>',
+			To => 'list@example.com',
+		],
+		parts => [ $part1, $part2 ],
+	);
+	ok($rw->add_message($amsg), 'added attachment');
+	$rw_commit->();
+	$ro->reopen;
+	my $n = $ro->query('n:attached_fart.txt');
+	is(scalar @{$n->{msgs}}, 1, 'got result for n:');
+	my $res = $ro->query('part_deux.txt');
+	is(scalar @{$res->{msgs}}, 1, 'got result without n:');
+	is($n->{msgs}->[0]->mid, $res->{msgs}->[0]->mid,
+		'same result with and without');
+	my $txt = $ro->query('"inside another"');
+	is($txt->{msgs}->[0]->mid, $res->{msgs}->[0]->mid,
+		'search inside text attachments works');
+}
+
 done_testing();
 
 1;
-- 
EW


^ permalink raw reply related	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2016-09-09  0:01 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-09-09  0:01 [PATCH 0/10] search: more mairix prefix compatibility Eric Wong
2016-09-09  0:01 ` [PATCH 01/10] search: allow searching user fields (To/Cc/From) Eric Wong
2016-09-09  0:01 ` [PATCH 02/10] search: drop longer subject: prefix for search Eric Wong
2016-09-09  0:01 ` [PATCH 03/10] search: more granular message body searching Eric Wong
2016-09-09  0:01 ` [PATCH 04/10] search: fix space regressions from recent changes Eric Wong
2016-09-09  0:01 ` [PATCH 05/10] search: match quote detection behavior of view Eric Wong
2016-09-09  0:01 ` [PATCH 06/10] search: increase term positions for each quoted hunk Eric Wong
2016-09-09  0:01 ` [PATCH 07/10] search: fix compatibility with Debian wheezy Eric Wong
2016-09-09  0:01 ` [PATCH 08/10] search: avoid mindlessly calling body_set Eric Wong
2016-09-09  0:01 ` [PATCH 09/10] search: match the behavior of WWW for indexing text Eric Wong
2016-09-09  0:01 ` [PATCH 10/10] search: index attachment filenames Eric Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).