unofficial mirror of meta@public-inbox.org
 help / color / mirror / Atom feed
* [PATCH 00/12] better dedupe, contiguous article numbers
@ 2018-04-18  9:13 Eric Wong (Contractor, The Linux Foundation)
  2018-04-18  9:13 ` [PATCH 01/12] feed: respect feedmax, again Eric Wong (Contractor, The Linux Foundation)
                   ` (11 more replies)
  0 siblings, 12 replies; 13+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-04-18  9:13 UTC (permalink / raw)
  To: meta; +Cc: Eric Wong (Contractor, The Linux Foundation)

Hopefully the final round of patches, most notably [7/12]
improving deduplication and [9/12] to avoid causing difficulties
for NNTP readers on v1 inboxes.

And a few more bugfixes along the way...

Eric Wong (Contractor, The Linux Foundation) (12):
  feed: respect feedmax, again
  v1: remove articles from overview DB
  compact: do not merge v2 repos by default
  v2writable: reduce partititions by one
  search: preserve References in Xapian smsg for x=t view
  v2: generate better Message-IDs for duplicates
  v2: improve deduplication checks
  import: cat_blob drops leading 'From ' lines like Inbox
  searchidx: regenerate and avoid article number gaps on full index
  extmsg: remove expensive git path checks
  use %H consistently to disable abbreviations
  searchidx: increase term positions for all text terms

 MANIFEST                      |   2 +
 lib/PublicInbox/ContentId.pm  |  67 ++++++++++--
 lib/PublicInbox/ExtMsg.pm     |  39 ++-----
 lib/PublicInbox/Feed.pm       |   4 +-
 lib/PublicInbox/Import.pm     |  21 ++--
 lib/PublicInbox/Inbox.pm      |   5 -
 lib/PublicInbox/Msgmap.pm     |   8 +-
 lib/PublicInbox/SearchIdx.pm  | 191 ++++++++++++++++++++++------------
 lib/PublicInbox/V2Writable.pm |  69 +++++++-----
 lib/PublicInbox/View.pm       |   7 +-
 script/public-inbox-compact   |  15 ++-
 scripts/dupe-finder           |  54 ++++++++++
 t/content_id.t                |  10 ++
 t/convert-compact.t           |   2 -
 t/psgi_v2.t                   |  11 +-
 t/search.t                    |   9 +-
 t/v1-add-remove-add.t         |  45 ++++++++
 t/v2writable.t                |   5 +-
 18 files changed, 391 insertions(+), 173 deletions(-)
 create mode 100644 scripts/dupe-finder
 create mode 100644 t/v1-add-remove-add.t

-- 
EW


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 01/12] feed: respect feedmax, again
  2018-04-18  9:13 [PATCH 00/12] better dedupe, contiguous article numbers Eric Wong (Contractor, The Linux Foundation)
@ 2018-04-18  9:13 ` Eric Wong (Contractor, The Linux Foundation)
  2018-04-18  9:13 ` [PATCH 02/12] v1: remove articles from overview DB Eric Wong (Contractor, The Linux Foundation)
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-04-18  9:13 UTC (permalink / raw)
  To: meta; +Cc: Eric Wong (Contractor, The Linux Foundation)

Gigantic feeds probably make some clients unhappy,
clamp it to what it was in the past.

Fixes: b9534449ecce2c59 ("view: avoid offset during pagination")
---
 lib/PublicInbox/Feed.pm | 2 +-
 lib/PublicInbox/View.pm | 7 +++----
 2 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/lib/PublicInbox/Feed.pm b/lib/PublicInbox/Feed.pm
index 5cb044b..d7df07e 100644
--- a/lib/PublicInbox/Feed.pm
+++ b/lib/PublicInbox/Feed.pm
@@ -95,7 +95,7 @@ sub recent_msgs {
 		die "BUG: unsupported inbox version: $v\n";
 	}
 	if (my $srch = $ibx->search) {
-		return PublicInbox::View::paginate_recent($ctx);
+		return PublicInbox::View::paginate_recent($ctx, $max);
 	}
 
 	my $hex = '[a-f0-9]';
diff --git a/lib/PublicInbox/View.pm b/lib/PublicInbox/View.pm
index 94058ed..af287b9 100644
--- a/lib/PublicInbox/View.pm
+++ b/lib/PublicInbox/View.pm
@@ -1049,10 +1049,9 @@ sub index_nav { # callback for WwwStream
 	pagination_footer($ctx, '.')
 }
 
-sub paginate_recent ($) {
-	my ($ctx) = @_;
+sub paginate_recent ($$) {
+	my ($ctx, $lim) = @_;
 	my $t = $ctx->{qp}->{t} || '';
-	my $lim = 200; # this is our window
 	my $opts = { limit => $lim };
 	my ($after, $before);
 
@@ -1094,7 +1093,7 @@ sub paginate_recent ($) {
 
 sub index_topics {
 	my ($ctx) = @_;
-	my $msgs = paginate_recent($ctx);
+	my $msgs = paginate_recent($ctx, 200); # 200 is our window
 	if (@$msgs) {
 		walk_thread(thread_results($ctx, $msgs), $ctx, *acc_topic);
 	}
-- 
EW


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 02/12] v1: remove articles from overview DB
  2018-04-18  9:13 [PATCH 00/12] better dedupe, contiguous article numbers Eric Wong (Contractor, The Linux Foundation)
  2018-04-18  9:13 ` [PATCH 01/12] feed: respect feedmax, again Eric Wong (Contractor, The Linux Foundation)
@ 2018-04-18  9:13 ` Eric Wong (Contractor, The Linux Foundation)
  2018-04-18  9:13 ` [PATCH 03/12] compact: do not merge v2 repos by default Eric Wong (Contractor, The Linux Foundation)
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-04-18  9:13 UTC (permalink / raw)
  To: meta; +Cc: Eric Wong (Contractor, The Linux Foundation)

Otherwise articles show up again...
---
 lib/PublicInbox/SearchIdx.pm | 2 ++
 t/search.t                   | 9 ++++++++-
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index f9b40b0..fd76627 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -385,11 +385,13 @@ sub remove_message {
 	my $db = $self->{xdb};
 	my $called;
 	$mid = mid_clean($mid);
+	my $over = $self->{over};
 
 	eval {
 		batch_do($self, 'Q' . $mid, sub {
 			my ($ids) = @_;
 			$db->delete_document($_) for @$ids;
+			$over->delete_articles($ids) if $over;
 			$called = 1;
 		});
 	};
diff --git a/t/search.t b/t/search.t
index 516f567..48c2511 100644
--- a/t/search.t
+++ b/t/search.t
@@ -409,8 +409,15 @@ sub filter_mids {
 	my $txt = $ro->query('"inside another"');
 	is($txt->[0]->mid, $res->[0]->mid,
 		'search inside text attachments works');
+
+	my $mid = $n->[0]->mid;
+	my ($id, $prev);
+	my $art = $ro->next_by_mid($mid, \$id, \$prev);
+	ok($art, 'article exists in OVER DB');
+	$rw->unindex_blob($amsg);
+	$rw->commit_txn_lazy;
+	is($ro->lookup_article($art->{num}), undef, 'gone from OVER DB');
 }
-$rw->commit_txn_lazy;
 
 done_testing();
 
-- 
EW


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 03/12] compact: do not merge v2 repos by default
  2018-04-18  9:13 [PATCH 00/12] better dedupe, contiguous article numbers Eric Wong (Contractor, The Linux Foundation)
  2018-04-18  9:13 ` [PATCH 01/12] feed: respect feedmax, again Eric Wong (Contractor, The Linux Foundation)
  2018-04-18  9:13 ` [PATCH 02/12] v1: remove articles from overview DB Eric Wong (Contractor, The Linux Foundation)
@ 2018-04-18  9:13 ` Eric Wong (Contractor, The Linux Foundation)
  2018-04-18  9:13 ` [PATCH 04/12] v2writable: reduce partititions by one Eric Wong (Contractor, The Linux Foundation)
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-04-18  9:13 UTC (permalink / raw)
  To: meta; +Cc: Eric Wong (Contractor, The Linux Foundation)

--no-renumber does not allow merging, and merging is not ideal
for reindexing, either.
---
 script/public-inbox-compact | 15 ++++++++++-----
 t/convert-compact.t         |  2 --
 2 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/script/public-inbox-compact b/script/public-inbox-compact
index 9f33265..5f18497 100755
--- a/script/public-inbox-compact
+++ b/script/public-inbox-compact
@@ -10,6 +10,7 @@ use PublicInbox::InboxWritable;
 use Cwd 'abs_path';
 use File::Temp qw(tempdir);
 use File::Path qw(remove_tree);
+use PublicInbox::Spawn qw(spawn);
 my $usage = "Usage: public-inbox-compact REPO_DIR\n";
 my $dir = shift or die $usage;
 my $config = PublicInbox::Config->new;
@@ -58,10 +59,11 @@ if ($v == 2) {
 	my $new = tempdir('compact-XXXXXXXX', CLEANUP => 1, DIR => $dir);
 	$ibx->with_umask(sub {
 		$v2w->lock_acquire;
-		my @parts;
+		my %pids;
 		while (defined(my $dn = readdir($dh))) {
 			if ($dn =~ /\A\d+\z/) {
-				push @parts, "$old/$dn";
+				my $cmd = [ @compact, "$old/$dn", "$new/$dn" ];
+				$pids{spawn($cmd)} = join(' ', @$cmd);
 			} elsif ($dn eq '.' || $dn eq '..') {
 			} elsif ($dn =~ /\Aover\.sqlite3/) {
 			} else {
@@ -69,9 +71,12 @@ if ($v == 2) {
 			}
 		}
 		close $dh;
-		die "No Xapian parts found in $old\n" unless @parts;
-		my $cmd = [@compact, @parts, "$new/0" ];
-		PublicInbox::Import::run_die($cmd);
+		die "No Xapian parts found in $old\n" unless keys %pids;
+		while (scalar keys %pids) {
+			my $pid = waitpid(-1, 0);
+			my $desc = delete $pids{$pid};
+			die "$desc failed: $?\n" if $?;
+		}
 		commit_changes($v2w, $old, $new);
 	});
 } elsif ($v == 1) {
diff --git a/t/convert-compact.t b/t/convert-compact.t
index 5caa0ac..ced4541 100644
--- a/t/convert-compact.t
+++ b/t/convert-compact.t
@@ -80,8 +80,6 @@ my $env = { NPROC => 2 };
 ok(PublicInbox::Import::run_die($cmd, $env, $rdr), 'v2 compact works');
 $ibx->{mainrepo} = "$tmpdir/v2";
 $ibx->{version} = 2;
-my $v2w = PublicInbox::V2Writable->new($ibx);
-is($v2w->{partitions}, 1, "only one partition in compacted repo");
 
 @xdir = glob("$tmpdir/v2/xap*/*");
 foreach (@xdir) {
-- 
EW


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 04/12] v2writable: reduce partititions by one
  2018-04-18  9:13 [PATCH 00/12] better dedupe, contiguous article numbers Eric Wong (Contractor, The Linux Foundation)
                   ` (2 preceding siblings ...)
  2018-04-18  9:13 ` [PATCH 03/12] compact: do not merge v2 repos by default Eric Wong (Contractor, The Linux Foundation)
@ 2018-04-18  9:13 ` Eric Wong (Contractor, The Linux Foundation)
  2018-04-18  9:13 ` [PATCH 05/12] search: preserve References in Xapian smsg for x=t view Eric Wong (Contractor, The Linux Foundation)
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-04-18  9:13 UTC (permalink / raw)
  To: meta; +Cc: Eric Wong (Contractor, The Linux Foundation)

git fast-import and the main V2Writable process combined takes
about one CPU, so avoid having too many Xapian partitions which
cause unnecessary I/O contention.
---
 lib/PublicInbox/V2Writable.pm | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index 1cc4b00..66f8a8a 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -22,8 +22,11 @@ use IO::Handle;
 my $PACKING_FACTOR = 0.4;
 
 # assume 2 cores if GNU nproc(1) is not available
-sub nproc () {
-	int($ENV{NPROC} || `nproc 2>/dev/null` || 2);
+sub nproc_parts () {
+	my $n = int($ENV{NPROC} || `nproc 2>/dev/null` || 2);
+	# subtract for the main process and git-fast-import
+	$n -= 1;
+	$n < 1 ? 1 : $n;
 }
 
 sub count_partitions ($) {
@@ -73,7 +76,7 @@ sub new {
 		rotate_bytes => int((1024 * 1024 * 1024) / $PACKING_FACTOR),
 		last_commit => [], # git repo -> commit
 	};
-	$self->{partitions} = count_partitions($self) || nproc();
+	$self->{partitions} = count_partitions($self) || nproc_parts();
 	bless $self, $class;
 }
 
-- 
EW


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 05/12] search: preserve References in Xapian smsg for x=t view
  2018-04-18  9:13 [PATCH 00/12] better dedupe, contiguous article numbers Eric Wong (Contractor, The Linux Foundation)
                   ` (3 preceding siblings ...)
  2018-04-18  9:13 ` [PATCH 04/12] v2writable: reduce partititions by one Eric Wong (Contractor, The Linux Foundation)
@ 2018-04-18  9:13 ` Eric Wong (Contractor, The Linux Foundation)
  2018-04-18  9:13 ` [PATCH 06/12] v2: generate better Message-IDs for duplicates Eric Wong (Contractor, The Linux Foundation)
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-04-18  9:13 UTC (permalink / raw)
  To: meta; +Cc: Eric Wong (Contractor, The Linux Foundation)

I'm not sure how useful this view is, but it exists for now.
---
 lib/PublicInbox/SearchIdx.pm |  1 +
 t/psgi_v2.t                  | 11 +++++++++--
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index fd76627..0b1dc21 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -331,6 +331,7 @@ sub add_message {
 			$tg->index_text($mid, 1, 'XM');
 		}
 		$smsg->{to} = $smsg->{cc} = '';
+		PublicInbox::OverIdx::parse_references($smsg, $mid0, $mids);
 		my $data = $smsg->to_doc_data($oid, $mid0);
 		$doc->set_data($data);
 		if (my $altid = $self->{-altid}) {
diff --git a/t/psgi_v2.t b/t/psgi_v2.t
index faa139f..65448dc 100644
--- a/t/psgi_v2.t
+++ b/t/psgi_v2.t
@@ -113,6 +113,7 @@ test_psgi(sub { $www->call(@_) }, sub {
 	like($raw, qr/^hello ghosts$/m, 'got third message');
 	@from_ = ($raw =~ m/^From /mg);
 	is(scalar(@from_), 3, 'three From_ lines');
+	$config->each_inbox(sub { $_[0]->search->reopen });
 
 	SKIP: {
 		eval { require IO::Uncompress::Gunzip };
@@ -129,7 +130,6 @@ test_psgi(sub { $www->call(@_) }, sub {
 		is(scalar(@from_), 3, 'three From_ lines in t.mbox.gz');
 
 		# search interface
-		$config->each_inbox(sub { $_[0]->search->reopen });
 		$res = $cb->(POST('/v2test/?q=m:a-mid@b&x=m'));
 		$in = $res->content;
 		$status = IO::Uncompress::Gunzip::gunzip(\$in => \$out);
@@ -150,6 +150,13 @@ test_psgi(sub { $www->call(@_) }, sub {
 		is(scalar(@from_), 3, 'three From_ lines in all.mbox');
 	};
 
+	$res = $cb->(GET('/v2test/?q=m:a-mid@b&x=t'));
+	is($res->code, 200, 'success with threaded search');
+	my $raw = $res->content;
+	ok($raw =~ s/\A.*>Results 1-3 of 3<//s, 'got all results');
+	my @over = ($raw =~ m/\d{4}-\d+-\d+\s+\d+:\d+ (.+)$/gm);
+	is_deeply(\@over, [ '<a', '` <a', '` <a' ], 'threaded messages show up');
+
 	local $SIG{__WARN__} = 'DEFAULT';
 	$res = $cb->(GET('/v2test/a-mid@b/'));
 	$raw = $res->content;
@@ -183,7 +190,7 @@ test_psgi(sub { $www->call(@_) }, sub {
 	$res = $cb->(GET('/v2test/reuse@mid/T/'));
 	$raw = $res->content;
 	like($raw, qr/\b4\+ messages\b/, 'thread overview shown with /T/');
-	my @over = ($raw =~ m/^\d{4}-\d+-\d+\s+\d+:\d+ (.+)$/gm);
+	@over = ($raw =~ m/^\d{4}-\d+-\d+\s+\d+:\d+ (.+)$/gm);
 	is_deeply(\@over, [ '<a', '` <a', '` <a', '` <a' ],
 		'duplicate messages share the same root');
 
-- 
EW


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 06/12] v2: generate better Message-IDs for duplicates
  2018-04-18  9:13 [PATCH 00/12] better dedupe, contiguous article numbers Eric Wong (Contractor, The Linux Foundation)
                   ` (4 preceding siblings ...)
  2018-04-18  9:13 ` [PATCH 05/12] search: preserve References in Xapian smsg for x=t view Eric Wong (Contractor, The Linux Foundation)
@ 2018-04-18  9:13 ` Eric Wong (Contractor, The Linux Foundation)
  2018-04-18  9:13 ` [PATCH 07/12] v2: improve deduplication checks Eric Wong (Contractor, The Linux Foundation)
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-04-18  9:13 UTC (permalink / raw)
  To: meta; +Cc: Eric Wong (Contractor, The Linux Foundation)

While hunting duplicates, I noticed a leading '-' in some
Message-IDs as a result of RFC4648 encoding.  While '-' seems
allowed by RFC5322 and URL-friendly (RFC4648), they are uncommon
and make using Message-IDs as arguments for command-line tools
more difficult.  So prefix them with a datestamp to at least
give readers some sense of the age.  And shorten the "localhost"
hostname to "z" to save space.
---
 MANIFEST                      |  1 +
 lib/PublicInbox/Import.pm     | 18 ++++++------
 lib/PublicInbox/V2Writable.pm |  6 ++--
 scripts/dupe-finder           | 54 +++++++++++++++++++++++++++++++++++
 t/v2writable.t                |  5 ++--
 5 files changed, 71 insertions(+), 13 deletions(-)
 create mode 100644 scripts/dupe-finder

diff --git a/MANIFEST b/MANIFEST
index 58b3634..00a0970 100644
--- a/MANIFEST
+++ b/MANIFEST
@@ -124,6 +124,7 @@ script/public-inbox-watch
 script/public-inbox.cgi
 scripts/dc-dlvr
 scripts/dc-dlvr.pre
+scripts/dupe-finder
 scripts/edit-sa-prefs
 scripts/import_maildir
 scripts/import_slrnspool
diff --git a/lib/PublicInbox/Import.pm b/lib/PublicInbox/Import.pm
index 9e8900f..c7a96e1 100644
--- a/lib/PublicInbox/Import.pm
+++ b/lib/PublicInbox/Import.pm
@@ -14,6 +14,7 @@ use PublicInbox::Address;
 use PublicInbox::MsgTime qw(msg_timestamp msg_datestamp);
 use PublicInbox::ContentId qw(content_digest);
 use PublicInbox::MDA;
+use POSIX qw(strftime);
 
 sub new {
 	my ($class, $git, $name, $email, $ibx) = @_;
@@ -330,7 +331,7 @@ sub v1_mid0 ($) {
 	my $mids = mids($hdr);
 
 	if (!scalar(@$mids)) { # spam often has no Message-Id
-		my $mid0 = digest2mid(content_digest($mime));
+		my $mid0 = digest2mid(content_digest($mime), $hdr);
 		append_mid($hdr, $mid0);
 		return $mid0;
 	}
@@ -445,18 +446,19 @@ sub atfork_child {
 	}
 }
 
-sub digest2mid ($) {
-	my ($dig) = @_;
+sub digest2mid ($$) {
+	my ($dig, $hdr) = @_;
 	my $b64 = $dig->clone->b64digest;
 	# Make our own URLs nicer:
 	# See "Base 64 Encoding with URL and Filename Safe Alphabet" in RFC4648
 	$b64 =~ tr!+/=!-_!d;
 
-	# We can make this more meaningful with a date prefix or other things,
-	# but this is only needed for crap that fails to generate a Message-ID
-	# or reuses one.  In other words, it's usually spammers who hit this
-	# so they don't deserve nice Message-IDs :P
-	$b64 . '@localhost';
+	# Add a date prefix to prevent a leading '-' in case that trips
+	# up some tools (e.g. if a Message-ID were a expected as a
+	# command-line arg)
+	my $dt = msg_datestamp($hdr);
+	$dt = POSIX::strftime('%Y%m%d%H%M%S', gmtime($dt));
+	"$dt.$b64" . '@z';
 }
 
 sub clean_purge_buffer {
diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index 66f8a8a..0dcdeda 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -174,19 +174,19 @@ sub num_for_harder {
 
 	my $hdr = $mime->header_obj;
 	my $dig = content_digest($mime);
-	$$mid0 = PublicInbox::Import::digest2mid($dig);
+	$$mid0 = PublicInbox::Import::digest2mid($dig, $hdr);
 	my $num = $self->{mm}->mid_insert($$mid0);
 	unless (defined $num) {
 		# it's hard to spoof the last Received: header
 		my @recvd = $hdr->header_raw('Received');
 		$dig->add("Received: $_") foreach (@recvd);
-		$$mid0 = PublicInbox::Import::digest2mid($dig);
+		$$mid0 = PublicInbox::Import::digest2mid($dig, $hdr);
 		$num = $self->{mm}->mid_insert($$mid0);
 
 		# fall back to a random Message-ID and give up determinism:
 		until (defined($num)) {
 			$dig->add(rand);
-			$$mid0 = PublicInbox::Import::digest2mid($dig);
+			$$mid0 = PublicInbox::Import::digest2mid($dig, $hdr);
 			warn "using random Message-ID <$$mid0> as fallback\n";
 			$num = $self->{mm}->mid_insert($$mid0);
 		}
diff --git a/scripts/dupe-finder b/scripts/dupe-finder
new file mode 100644
index 0000000..1402237
--- /dev/null
+++ b/scripts/dupe-finder
@@ -0,0 +1,54 @@
+#!/usr/bin/perl -w
+# Copyright (C) 2018 all contributors <meta@public-inbox.org>
+# License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
+#
+# ad-hoc tool for finding duplicates, unstable!
+use strict;
+use warnings;
+use PublicInbox::Inbox;
+use PublicInbox::Over;
+use PublicInbox::Search;
+use PublicInbox::Config;
+my $repo = shift;
+my $ibx;
+if (index($repo, '@') > 0) {
+	$ibx = PublicInbox::Config->new->lookup($repo);
+} elsif (-d $repo) {
+	$ibx = { mainrepo => $repo, address => 'unnamed@example.com' };
+	$ibx = PublicInbox::Inbox->new($ibx);
+} else {
+	$ibx = PublicInbox::Config->new->lookup_name($repo);
+}
+$ibx or die "No inbox";
+$ibx->search or die "search not available for inbox";
+my $dbh = $ibx->search->{over_ro}->connect;
+my $over = PublicInbox::Over->new($dbh->sqlite_db_filename);
+
+sub emit ($) {
+	my ($nums) = @_;
+	foreach my $n (@$nums) {
+		my $smsg = $over->get_art($n) or next;
+		print STDERR "$n $smsg->{blob} $smsg->{mid}\n";
+		my $msg = $ibx->msg_by_smsg($smsg) or next;
+		print "From $smsg->{blob}\@$n Thu Jan  1 00:00:00 1970\n";
+		$$msg =~ s/^(>*From )/>$1/gm;
+		print $$msg, "\n";
+	}
+}
+
+my $sth = $dbh->prepare(<<'');
+SELECT id,num FROM id2num WHERE num > 0 ORDER BY id
+
+$sth->execute;
+my $prev_id = -1;
+my ($id, $num, @nums);
+while (1) {
+	($id, $num) = $sth->fetchrow_array;
+	defined $id or last;
+	if ($prev_id != $id) {
+		emit(\@nums) if scalar(@nums) > 1;
+		@nums = ();
+	}
+	$prev_id = $id;
+	push @nums, $num;
+}
diff --git a/t/v2writable.t b/t/v2writable.t
index 85fb6a6..d37fb06 100644
--- a/t/v2writable.t
+++ b/t/v2writable.t
@@ -68,7 +68,7 @@ if ('ensure git configs are correct') {
 		[ $sec->header_obj->header_raw('Message-Id') ],
 		'no new Message-Id added');
 
-	my $sane_mid = qr/\A<[\w\-]+\@localhost>\z/;
+	my $sane_mid = qr/\A<[\w\-\.]+\@\w+>\z/;
 	@warn = ();
 	$mime->header_set('Message-Id', '<a-mid@b>');
 	$mime->body_set('different');
@@ -82,7 +82,8 @@ if ('ensure git configs are correct') {
 	@warn = ();
 	$mime->header_set('Message-Id', '<a-mid@b>');
 	$mime->body_set('this one needs a random mid');
-	my $gen = PublicInbox::Import::digest2mid(content_digest($mime));
+	my $hdr = $mime->header_obj;
+	my $gen = PublicInbox::Import::digest2mid(content_digest($mime), $hdr);
 	unlike($gen, qr![\+/=]!, 'no URL-unfriendly chars in Message-Id');
 	my $fake = PublicInbox::MIME->new($mime->as_string);
 	$fake->header_set('Message-Id', "<$gen>");
-- 
EW


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 07/12] v2: improve deduplication checks
  2018-04-18  9:13 [PATCH 00/12] better dedupe, contiguous article numbers Eric Wong (Contractor, The Linux Foundation)
                   ` (5 preceding siblings ...)
  2018-04-18  9:13 ` [PATCH 06/12] v2: generate better Message-IDs for duplicates Eric Wong (Contractor, The Linux Foundation)
@ 2018-04-18  9:13 ` Eric Wong (Contractor, The Linux Foundation)
  2018-04-18  9:13 ` [PATCH 08/12] import: cat_blob drops leading 'From ' lines like Inbox Eric Wong (Contractor, The Linux Foundation)
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-04-18  9:13 UTC (permalink / raw)
  To: meta; +Cc: Eric Wong (Contractor, The Linux Foundation)

First off, decode text portions of messages since some archived
mail I got was converted from quoted-printable or base-64 to
8bit by the original recipient.  Attempting to merge them with
my own archives (which had no conversion done) led to
unnecessary duplicates showing up.

Then, normalize CRLF line endings in text portions to LF.

In the headers, we relax the content_id hashing to ignore quotes
and lower-case domain names in To, Cc, and From headers since
some mail processors will alter them.

Finally, I've discovered Email::MIME->new($mime->as_string)
does not always round-trip reliably, so we calculate the
content_id twice on user-supplied messages.
---
 lib/PublicInbox/ContentId.pm  | 67 +++++++++++++++++++++++++++++------
 lib/PublicInbox/V2Writable.pm | 35 +++++++++++++-----
 t/content_id.t                | 10 ++++++
 3 files changed, 94 insertions(+), 18 deletions(-)

diff --git a/lib/PublicInbox/ContentId.pm b/lib/PublicInbox/ContentId.pm
index 279eec0..b1d27eb 100644
--- a/lib/PublicInbox/ContentId.pm
+++ b/lib/PublicInbox/ContentId.pm
@@ -7,10 +7,19 @@ use warnings;
 use base qw/Exporter/;
 our @EXPORT_OK = qw/content_id content_digest/;
 use PublicInbox::MID qw(mids references);
+use PublicInbox::MsgIter;
 
 # not sure if less-widely supported hash families are worth bothering with
 use Digest::SHA;
 
+sub digest_addr ($$$) {
+	my ($dig, $h, $v) = @_;
+	$v =~ tr/"//d;
+	$v =~ s/@([a-z0-9\_\.\-\(\)]*([A-Z])\S*)/'@'.lc($1)/ge;
+	utf8::encode($v);
+	$dig->add("$h\0$v\0");
+}
+
 sub content_digest ($) {
 	my ($mime) = @_;
 	my $dig = Digest::SHA->new(256);
@@ -27,24 +36,62 @@ sub content_digest ($) {
 	}
 	foreach my $mid (@{references($hdr)}) {
 		next if $seen{$mid};
-		$dig->add('ref: '.$mid);
+		$dig->add("ref\0$mid\0");
 	}
 
 	# Only use Sender: if From is not present
 	foreach my $h (qw(From Sender)) {
-		my @v = $hdr->header_raw($h);
+		my @v = $hdr->header($h);
 		if (@v) {
-			$dig->add("$h: $_") foreach @v;
-			last;
+			digest_addr($dig, $h, $_) foreach @v;
 		}
 	}
-
-	# Content-* headers are often no-ops, so maybe we don't need them
-	foreach my $h (qw(Subject Date To Cc)) {
-		my @v = $hdr->header_raw($h);
-		$dig->add("$h: $_") foreach @v;
+	foreach my $h (qw(Subject Date)) {
+		my @v = $hdr->header($h);
+		foreach my $v (@v) {
+			utf8::encode($v);
+			$dig->add("$h\0$v\0");
+		}
+	}
+	# Some mail processors will add " to unquoted names that were
+	# not in the original message.  For the purposes of deduplication,
+	# do not take it into account:
+	foreach my $h (qw(To Cc)) {
+		my @v = $hdr->header($h);
+		digest_addr($dig, $h, $_) foreach @v;
 	}
-	$dig->add($mime->body_raw);
+	msg_iter($mime, sub {
+		my ($part, $depth, @idx) = @{$_[0]};
+		$dig->add("\0$depth:".join('.', @idx)."\0");
+		my $fn = $part->filename;
+		if (defined $fn) {
+			utf8::encode($fn);
+			$dig->add("fn\0$fn\0");
+		}
+		my @d = $part->header('Content-Description');
+		foreach my $d (@d) {
+			utf8::encode($d);
+			$dig->add("d\0$d\0");
+		}
+		$dig->add("b\0");
+		my $ct = $part->content_type || 'text/plain';
+		my $s = eval { $part->body_str };
+		if ($@ && $ct =~ m!\btext/plain\b!i) {
+			# Try to assume UTF-8 because Alpine
+			# seems to do wacky things and set
+			# charset=X-UNKNOWN
+			$part->charset_set('UTF-8');
+			$s = eval { $part->body_str };
+		}
+		if (defined $s) {
+			$s =~ s/\r\n/\n/gs;
+			$s =~ s/\s*\z//s;
+			utf8::encode($s);
+		} else {
+			$s = $part->body;
+		}
+		$dig->add($s);
+	});
 	$dig;
 }
 
diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index 0dcdeda..e9fd502 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -259,12 +259,32 @@ sub purge_oids {
 	$purges;
 }
 
+sub content_ids ($) {
+	my ($mime) = @_;
+	my @cids = ( content_id($mime) );
+
+	# Email::MIME->as_string doesn't always round-trip, so we may
+	# use a second content_id
+	my $rt = content_id(PublicInbox::MIME->new(\($mime->as_string)));
+	push @cids, $rt if $cids[0] ne $rt;
+	\@cids;
+}
+
+sub content_matches ($$) {
+	my ($cids, $existing) = @_;
+	my $cid = content_id($existing);
+	foreach (@$cids) {
+		return 1 if $_ eq $cid
+	}
+	0
+}
+
 sub remove_internal {
 	my ($self, $mime, $cmt_msg, $purge) = @_;
 	$self->idx_init;
 	my $im = $self->importer unless $purge;
 	my $over = $self->{over};
-	my $cid = content_id($mime);
+	my $cids = content_ids($mime);
 	my $parts = $self->{idx_parts};
 	my $mm = $self->{mm};
 	my $removed;
@@ -287,7 +307,7 @@ sub remove_internal {
 			}
 			my $orig = $$msg;
 			my $cur = PublicInbox::MIME->new($msg);
-			if (content_id($cur) eq $cid) {
+			if (content_matches($cids, $cur)) {
 				$smsg->{mime} = $cur;
 				$gone{$smsg->{num}} = [ $smsg, \$orig ];
 			}
@@ -572,8 +592,7 @@ sub get_blob ($$) {
 sub lookup_content {
 	my ($self, $mime, $mid) = @_;
 	my $over = $self->{over};
-	my $cid = content_id($mime);
-	my $found;
+	my $cids = content_ids($mime);
 	my ($id, $prev);
 	while (my $smsg = $over->next_by_mid($mid, \$id, \$prev)) {
 		my $msg = get_blob($self, $smsg);
@@ -582,16 +601,16 @@ sub lookup_content {
 			next;
 		}
 		my $cur = PublicInbox::MIME->new($msg);
-		if (content_id($cur) eq $cid) {
+		if (content_matches($cids, $cur)) {
 			$smsg->{mime} = $cur;
-			$found = $smsg;
-			last;
+			return $smsg;
 		}
 
+
 		# XXX DEBUG_DIFF is experimental and may be removed
 		diff($mid, $cur, $mime) if $ENV{DEBUG_DIFF};
 	}
-	$found;
+	undef;
 }
 
 sub atfork_child {
diff --git a/t/content_id.t b/t/content_id.t
index adcdb6c..01ce65e 100644
--- a/t/content_id.t
+++ b/t/content_id.t
@@ -22,4 +22,14 @@ my $orig = content_id($mime);
 my $reload = content_id(Email::MIME->new($mime->as_string));
 is($orig, $reload, 'content_id matches after serialization');
 
+foreach my $h (qw(From To Cc)) {
+	my $n = '"Quoted N\'Ame" <foo@EXAMPLE.com>';
+	$mime->header_str_set($h, "$n");
+	my $q = content_id($mime);
+	is($n, $mime->header($h), "content_id does not mutate $h:");
+	$mime->header_str_set($h, 'Quoted N\'Ame <foo@example.com>');
+	my $nq = content_id($mime);
+	is($nq, $q, "quotes ignored in $h:");
+}
+
 done_testing();
-- 
EW


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 08/12] import: cat_blob drops leading 'From ' lines like Inbox
  2018-04-18  9:13 [PATCH 00/12] better dedupe, contiguous article numbers Eric Wong (Contractor, The Linux Foundation)
                   ` (6 preceding siblings ...)
  2018-04-18  9:13 ` [PATCH 07/12] v2: improve deduplication checks Eric Wong (Contractor, The Linux Foundation)
@ 2018-04-18  9:13 ` Eric Wong (Contractor, The Linux Foundation)
  2018-04-18  9:13 ` [PATCH 09/12] searchidx: regenerate and avoid article number gaps on full index Eric Wong (Contractor, The Linux Foundation)
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-04-18  9:13 UTC (permalink / raw)
  To: meta; +Cc: Eric Wong (Contractor, The Linux Foundation)

In case people were running old buggy versions from 2016...
(and -convert should probably clean those up, eventually)
---
 lib/PublicInbox/Import.pm | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/lib/PublicInbox/Import.pm b/lib/PublicInbox/Import.pm
index c7a96e1..b25427e 100644
--- a/lib/PublicInbox/Import.pm
+++ b/lib/PublicInbox/Import.pm
@@ -117,6 +117,9 @@ sub _cat_blob ($$$) {
 	$n = read($r, my $lf, 1);
 	defined($n) or die "read final byte of cat-blob failed: $!";
 	die "bad read on final byte: <$lf>" if $lf ne "\n";
+
+	# fixup some bugginess in old versions:
+	$buf =~ s/\A[\r\n]*From [^\r\n]*\r?\n//s;
 	\$buf;
 }
 
-- 
EW


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 09/12] searchidx: regenerate and avoid article number gaps on full index
  2018-04-18  9:13 [PATCH 00/12] better dedupe, contiguous article numbers Eric Wong (Contractor, The Linux Foundation)
                   ` (7 preceding siblings ...)
  2018-04-18  9:13 ` [PATCH 08/12] import: cat_blob drops leading 'From ' lines like Inbox Eric Wong (Contractor, The Linux Foundation)
@ 2018-04-18  9:13 ` Eric Wong (Contractor, The Linux Foundation)
  2018-04-18  9:13 ` [PATCH 10/12] extmsg: remove expensive git path checks Eric Wong (Contractor, The Linux Foundation)
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-04-18  9:13 UTC (permalink / raw)
  To: meta; +Cc: Eric Wong (Contractor, The Linux Foundation)

Some messages to git@vger went missing from Msgmap from old bugs
and became inaccessible via NNTP.  Forcing NNTP article numbers
when the overview DB came about made the problem more visible when
reindexing old (v1) repositories as all removed spam messages
took up AUTOINCREMENT numbers again before they were removed.

Having large gaps in NNTP article numbers is not good since it
throws off NNTP clients.  This does NOT prevent NNTP clients from
seeing some messages twice, but is better than having them
miss several messages entirely.

We also avoid depending on --reverse in git-log, as
git requires storing an entire commit list in memory for
--reverse, so it's cheaper to store only deleted blobs in the %D
hash since they do not live long.
---
 MANIFEST                      |   1 +
 lib/PublicInbox/Msgmap.pm     |   8 +-
 lib/PublicInbox/SearchIdx.pm  | 186 +++++++++++++++++++++-------------
 lib/PublicInbox/V2Writable.pm |  15 +--
 t/v1-add-remove-add.t         |  45 ++++++++
 5 files changed, 171 insertions(+), 84 deletions(-)
 create mode 100644 t/v1-add-remove-add.t

diff --git a/MANIFEST b/MANIFEST
index 00a0970..8038ad4 100644
--- a/MANIFEST
+++ b/MANIFEST
@@ -192,6 +192,7 @@ t/spawn.t
 t/thread-cycle.t
 t/time.t
 t/utf8.mbox
+t/v1-add-remove-add.t
 t/v2-add-remove-add.t
 t/v2mda.t
 t/v2mirror.t
diff --git a/lib/PublicInbox/Msgmap.pm b/lib/PublicInbox/Msgmap.pm
index feef8ba..3237a5e 100644
--- a/lib/PublicInbox/Msgmap.pm
+++ b/lib/PublicInbox/Msgmap.pm
@@ -108,10 +108,10 @@ sub created_at {
 sub mid_insert {
 	my ($self, $mid) = @_;
 	my $dbh = $self->{dbh};
-	my $sql = 'INSERT OR IGNORE INTO msgmap (mid) VALUES (?)';
-	my $sth = $self->{mid_insert} ||= $dbh->prepare($sql);
-	$sth->bind_param(1, $mid);
-	return if $sth->execute == 0;
+	my $sth = $dbh->prepare_cached(<<'');
+INSERT OR IGNORE INTO msgmap (mid) VALUES (?)
+
+	return if $sth->execute($mid) == 0;
 	$dbh->last_insert_id(undef, undef, 'msgmap', 'num');
 }
 
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 0b1dc21..2239c90 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -17,6 +17,7 @@ use PublicInbox::MsgIter;
 use Carp qw(croak);
 use POSIX qw(strftime);
 use PublicInbox::OverIdx;
+use PublicInbox::Spawn qw(spawn);
 require PublicInbox::Git;
 use Compress::Zlib qw(compress);
 
@@ -266,8 +267,8 @@ sub add_message {
 	my $mids = mids($mime->header_obj);
 	$mid0 = $mids->[0] unless defined $mid0; # v1 compatibility
 	unless (defined $num) { # v1
-		my $mm = $self->_msgmap_init;
-		$num = $mm->mid_insert($mid0) || $mm->num_for($mid0);
+		$self->_msgmap_init;
+		$num = index_mm($self, $mime);
 	}
 	eval {
 		my $smsg = PublicInbox::SearchMsg->new($mime);
@@ -464,8 +465,28 @@ sub index_mm {
 	my ($self, $mime) = @_;
 	my $mid = mid_clean(mid_mime($mime));
 	my $mm = $self->{mm};
-	my $num = $mm->mid_insert($mid);
-	return $num if defined $num;
+	my $num;
+
+	if (defined $self->{regen_down}) {
+		$num = $mm->num_for($mid) and return $num;
+
+		while (($num = $self->{regen_down}--) > 0) {
+			if ($mm->mid_set($num, $mid) != 0) {
+				return $num;
+			}
+		}
+	} elsif (defined $self->{regen_up}) {
+		$num = $mm->num_for($mid) and return $num;
+
+		# this is to fixup old bugs due to add-remove-add
+		while (($num = ++$self->{regen_up})) {
+			if ($mm->mid_set($num, $mid) != 0) {
+				return $num;
+			}
+		}
+	}
+
+	$num = $mm->mid_insert($mid) and return $num;
 
 	# fallback to num_for since filters like RubyLang set the number
 	$mm->num_for($mid);
@@ -476,18 +497,6 @@ sub unindex_mm {
 	$self->{mm}->mid_delete(mid_clean(mid_mime($mime)));
 }
 
-sub index_mm2 {
-	my ($self, $mime, $bytes, $blob) = @_;
-	my $num = $self->{mm}->num_for(mid_clean(mid_mime($mime)));
-	add_message($self, $mime, $bytes, $num, $blob);
-}
-
-sub unindex_mm2 {
-	my ($self, $mime) = @_;
-	$self->{mm}->mid_delete(mid_clean(mid_mime($mime)));
-	unindex_blob($self, $mime);
-}
-
 sub index_both {
 	my ($self, $mime, $bytes, $blob) = @_;
 	my $num = index_mm($self, $mime);
@@ -521,12 +530,12 @@ sub batch_adjust ($$$$) {
 	$$max -= $bytes;
 	if ($$max <= 0) {
 		$$max = BATCH_BYTES;
-		$batch_cb->($latest, 1);
+		$batch_cb->($latest);
 	}
 }
 
 # only for v1
-sub rlog {
+sub read_log {
 	my ($self, $log, $add_cb, $del_cb, $batch_cb) = @_;
 	my $hex = '[a-f0-9]';
 	my $h40 = $hex .'{40}';
@@ -537,23 +546,39 @@ sub rlog {
 	my $bytes;
 	my $max = BATCH_BYTES;
 	local $/ = "\n";
+	my %D;
 	my $line;
+	my $newest;
+	my $mid = '20170114215743.5igbjup6qpsh3jfg@genre.crustytoothpaste.net';
 	while (defined($line = <$log>)) {
 		if ($line =~ /$addmsg/o) {
 			my $blob = $1;
+			delete $D{$blob} and next;
 			my $mime = do_cat_mail($git, $blob, \$bytes) or next;
+			my $mids = mids($mime->header_obj);
+			foreach (@$mids) {
+				warn "ADD $mid\n" if ($_ eq $mid);
+			}
 			batch_adjust(\$max, $bytes, $batch_cb, $latest);
 			$add_cb->($self, $mime, $bytes, $blob);
 		} elsif ($line =~ /$delmsg/o) {
 			my $blob = $1;
-			my $mime = do_cat_mail($git, $blob, \$bytes) or next;
-			batch_adjust(\$max, $bytes, $batch_cb, $latest);
-			$del_cb->($self, $mime);
+			$D{$blob} = 1;
 		} elsif ($line =~ /^commit ($h40)/o) {
 			$latest = $1;
+			$newest ||= $latest;
+		}
+	}
+	# get the leftovers
+	foreach my $blob (keys %D) {
+		my $mime = do_cat_mail($git, $blob, \$bytes) or next;
+		my $mids = mids($mime->header_obj);
+		foreach (@$mids) {
+			warn "DEL $mid\n" if ($_ eq $mid);
 		}
+		$del_cb->($self, $mime);
 	}
-	$batch_cb->($latest, 0);
+	$batch_cb->($latest, $newest);
 }
 
 sub _msgmap_init {
@@ -567,18 +592,62 @@ sub _msgmap_init {
 
 sub _git_log {
 	my ($self, $range) = @_;
-	$self->{git}->popen(qw/log --reverse --no-notes --no-color
+	my $git = $self->{git};
+
+	if (index($range, '..') < 0) {
+		my $regen_max = 0;
+		# can't use 'rev-list --count' if we use --diff-filter
+		my $fh = $git->popen(qw(log --pretty=tformat:%h
+				--no-notes --no-color --no-renames
+				--diff-filter=AM), $range);
+		++$regen_max while <$fh>;
+		my (undef, $max) = $self->{mm}->minmax;
+
+		if ($max && $max == $regen_max) {
+			# fix up old bugs in full indexes which caused messages to
+			# not appear in Msgmap
+			$self->{regen_up} = $max;
+		} else {
+			# normal regen is for for fresh data
+			$self->{regen_down} = $regen_max;
+		}
+	}
+
+	$git->popen(qw/log --no-notes --no-color --no-renames
 				--raw -r --no-abbrev/, $range);
 }
 
+sub is_ancestor ($$$) {
+	my ($git, $cur, $tip) = @_;
+	return 0 unless $git->check($cur);
+	my $cmd = [ 'git', "--git-dir=$git->{git_dir}",
+		qw(merge-base --is-ancestor), $cur, $tip ];
+	my $pid = spawn($cmd);
+	defined $pid or die "spawning ".join(' ', @$cmd)." failed: $!";
+	waitpid($pid, 0) == $pid or die join(' ', @$cmd) .' did not finish';
+	$? == 0;
+}
+
+sub need_update ($$$) {
+	my ($self, $cur, $new) = @_;
+	my $git = $self->{git};
+	return 1 if $cur && !is_ancestor($git, $cur, $new);
+	my $range = $cur eq '' ? $new : "$cur..$new";
+	chomp(my $n = $git->qx(qw(rev-list --count), $range));
+	($n eq '' || $n > 0);
+}
+
 # indexes all unindexed messages (v1 only)
 sub _index_sync {
 	my ($self, $opts) = @_;
 	my $tip = $opts->{ref} || 'HEAD';
 	my $reindex = $opts->{reindex};
 	my ($mkey, $last_commit, $lx, $xlog);
-	$self->{git}->batch_prepare;
+	my $git = $self->{git};
+	$git->batch_prepare;
+
 	my $xdb = $self->begin_txn_lazy;
+	my $mm = _msgmap_init($self);
 	do {
 		$xlog = undef;
 		$mkey = 'last_commit';
@@ -588,73 +657,54 @@ sub _index_sync {
 			$lx = '';
 			$mkey = undef if $last_commit ne '';
 		}
+
+		# use last_commit from msgmap if it is older or unset
+		my $lm = $mm->last_commit || '';
+		if (!$lm || ($lm && $lx && is_ancestor($git, $lm, $lx))) {
+			$lx = $lm;
+		}
+
 		$self->{over}->rollback_lazy;
 		$self->{over}->disconnect;
 		delete $self->{txn};
 		$xdb->cancel_transaction;
 		$xdb = _xdb_release($self);
 
-		# ensure we leak no FDs to "git log"
+		# ensure we leak no FDs to "git log" with Xapian <= 1.2
 		my $range = $lx eq '' ? $tip : "$lx..$tip";
 		$xlog = _git_log($self, $range);
 
 		$xdb = $self->begin_txn_lazy;
 	} while ($xdb->get_metadata('last_commit') ne $last_commit);
 
-	my $mm = _msgmap_init($self);
 	my $dbh = $mm->{dbh} if $mm;
-	my $mm_only;
 	my $cb = sub {
-		my ($commit, $more) = @_;
+		my ($commit, $newest) = @_;
 		if ($dbh) {
-			$mm->last_commit($commit) if $commit;
+			if ($newest) {
+				my $cur = $mm->last_commit || '';
+				if (need_update($self, $cur, $newest)) {
+					$mm->last_commit($newest);
+				}
+			}
 			$dbh->commit;
 		}
-		if (!$mm_only) {
-			$xdb->set_metadata($mkey, $commit) if $mkey && $commit;
-			$self->commit_txn_lazy;
+		if ($mkey && $newest) {
+			my $cur = $xdb->get_metadata($mkey);
+			if (need_update($self, $cur, $newest)) {
+				$xdb->set_metadata($mkey, $newest);
+			}
 		}
+		$self->commit_txn_lazy;
 		# let another process do some work... <
-		if ($more) {
-			if (!$mm_only) {
-				$xdb = $self->begin_txn_lazy;
-			}
+		if (!$newest) {
+			$xdb = $self->begin_txn_lazy;
 			$dbh->begin_work if $dbh;
 		}
 	};
 
-	if ($mm) {
-		$dbh->begin_work;
-		my $lm = $mm->last_commit || '';
-		if ($lm eq $lx) {
-			# Common case is the indexes are synced,
-			# we only need to run git-log once:
-			rlog($self, $xlog, *index_both, *unindex_both, $cb);
-		} else {
-			# Uncommon case, msgmap and xapian are out-of-sync
-			# do not care for performance (but git is fast :>)
-			# This happens if we have to reindex Xapian since
-			# msgmap is a frozen format and our Xapian format
-			# is evolving.
-			my $r = $lm eq '' ? $tip : "$lm..$tip";
-
-			# first, ensure msgmap is up-to-date:
-			my $mkey_prev = $mkey;
-			$mkey = undef; # ignore xapian, for now
-			my $mlog = _git_log($self, $r);
-			$mm_only = 1;
-			rlog($self, $mlog, *index_mm, *unindex_mm, $cb);
-			$mm_only = $mlog = undef;
-
-			# now deal with Xapian
-			$mkey = $mkey_prev;
-			$dbh = undef;
-			rlog($self, $xlog, *index_mm2, *unindex_mm2, $cb);
-		}
-	} else {
-		# user didn't install DBD::SQLite and DBI
-		rlog($self, $xlog, *add_message, *unindex_blob, $cb);
-	}
+	$dbh->begin_work;
+	read_log($self, $xlog, *index_both, *unindex_both, $cb);
 }
 
 sub DESTROY {
diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index e9fd502..2cc8730 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -15,7 +15,8 @@ use PublicInbox::ContentId qw(content_id content_digest);
 use PublicInbox::Inbox;
 use PublicInbox::OverIdx;
 use PublicInbox::Msgmap;
-use PublicInbox::Spawn;
+use PublicInbox::Spawn qw(spawn);
+use PublicInbox::SearchIdx;
 use IO::Handle;
 
 # an estimate of the post-packed size to the raw uncompressed size
@@ -561,7 +562,6 @@ sub import_init {
 sub diff ($$$) {
 	my ($mid, $cur, $new) = @_;
 	use File::Temp qw(tempfile);
-	use PublicInbox::Spawn qw(spawn);
 
 	my ($ah, $an) = tempfile('email-cur-XXXXXXXX', TMPDIR => 1);
 	print $ah $cur->as_string or die "print: $!";
@@ -738,16 +738,7 @@ sub last_commits {
 	$heads;
 }
 
-sub is_ancestor ($$$) {
-	my ($git, $cur, $tip) = @_;
-	return 0 unless $git->check($cur);
-	my $cmd = [ 'git', "--git-dir=$git->{git_dir}",
-		qw(merge-base --is-ancestor), $cur, $tip ];
-	my $pid = spawn($cmd);
-	defined $pid or die "spawning ".join(' ', @$cmd)." failed: $!";
-	waitpid($pid, 0) == $pid or die join(' ', @$cmd) .' did not finish';
-	$? == 0;
-}
+*is_ancestor = *PublicInbox::SearchIdx::is_ancestor;
 
 sub index_prepare {
 	my ($self, $opts, $epoch_max, $ranges) = @_;
diff --git a/t/v1-add-remove-add.t b/t/v1-add-remove-add.t
new file mode 100644
index 0000000..cd6e281
--- /dev/null
+++ b/t/v1-add-remove-add.t
@@ -0,0 +1,45 @@
+# Copyright (C) 2018 all contributors <meta@public-inbox.org>
+# License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
+use strict;
+use warnings;
+use Test::More;
+use PublicInbox::MIME;
+use PublicInbox::Import;
+use PublicInbox::SearchIdx;
+use File::Temp qw/tempdir/;
+
+foreach my $mod (qw(DBD::SQLite Search::Xapian)) {
+	eval "require $mod";
+	plan skip_all => "$mod missing for v1-add-remove-add.t" if $@;
+}
+my $mainrepo = tempdir('pi-add-remove-add-XXXXXX', TMPDIR => 1, CLEANUP => 1);
+is(system(qw(git init --bare), $mainrepo), 0);
+my $ibx = {
+	mainrepo => $mainrepo,
+	name => 'test-add-remove-add',
+	-primary_address => 'test@example.com',
+};
+$ibx = PublicInbox::Inbox->new($ibx);
+my $mime = PublicInbox::MIME->create(
+	header => [
+		From => 'a@example.com',
+		To => 'test@example.com',
+		Subject => 'this is a subject',
+		Date => 'Fri, 02 Oct 1993 00:00:00 +0000',
+		'Message-ID' => '<a-mid@b>',
+	],
+	body => "hello world\n",
+);
+my $im = PublicInbox::Import->new($ibx->git, undef, undef, $ibx);
+ok($im->add($mime), 'message added');
+ok($im->remove($mime), 'message added');
+ok($im->add($mime), 'message added again');
+$im->done;
+my $rw = PublicInbox::SearchIdx->new($ibx, 1);
+$rw->index_sync;
+my $msgs = $ibx->recent({limit => 10});
+is($msgs->[0]->{mid}, 'a-mid@b', 'message exists in history');
+is(scalar @$msgs, 1, 'only one message in history');
+is($ibx->mm->num_for('a-mid@b'), 2, 'exists with second article number');
+
+done_testing();
-- 
EW


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 10/12] extmsg: remove expensive git path checks
  2018-04-18  9:13 [PATCH 00/12] better dedupe, contiguous article numbers Eric Wong (Contractor, The Linux Foundation)
                   ` (8 preceding siblings ...)
  2018-04-18  9:13 ` [PATCH 09/12] searchidx: regenerate and avoid article number gaps on full index Eric Wong (Contractor, The Linux Foundation)
@ 2018-04-18  9:13 ` Eric Wong (Contractor, The Linux Foundation)
  2018-04-18  9:13 ` [PATCH 11/12] use %H consistently to disable abbreviations Eric Wong (Contractor, The Linux Foundation)
  2018-04-18  9:13 ` [PATCH 12/12] searchidx: increase term positions for all text terms Eric Wong (Contractor, The Linux Foundation)
  11 siblings, 0 replies; 13+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-04-18  9:13 UTC (permalink / raw)
  To: meta; +Cc: Eric Wong (Contractor, The Linux Foundation)

Searching across different inboxes is expensive without
SQLite (or Xapian) installed, so avoid doing expensive tree
lookups in git.  Since SQLite is required for Xapian
support anyways, we won't need to check Xapian, either.

Sites without SQLite installed will simply 404 if somebody
requests a message which isn't in the current inbox.
---
 lib/PublicInbox/ExtMsg.pm | 39 +++++++--------------------------------
 lib/PublicInbox/Inbox.pm  |  5 -----
 2 files changed, 7 insertions(+), 37 deletions(-)

diff --git a/lib/PublicInbox/ExtMsg.pm b/lib/PublicInbox/ExtMsg.pm
index c71510f..a6f516d 100644
--- a/lib/PublicInbox/ExtMsg.pm
+++ b/lib/PublicInbox/ExtMsg.pm
@@ -31,30 +31,19 @@ sub ext_msg {
 	my $cur = $ctx->{-inbox};
 	my $mid = $ctx->{mid};
 
-	eval { require PublicInbox::Search };
-	my $have_xap = $@ ? 0 : 1;
-	my (@nox, @ibx, @found);
+	eval { require PublicInbox::Msgmap };
+	my $have_mm = $@ ? 0 : 1;
+	my (@ibx, @found);
 
 	$ctx->{www}->{pi_config}->each_inbox(sub {
 		my ($other) = @_;
 		return if $other->{name} eq $cur->{name} || !$other->base_url;
 
-		my $s = $other->search;
-		if (!$s) {
-			push @nox, $other;
-			return;
-		}
-
-		# try to find the URL with Xapian to avoid forking
-		my $doc_id = eval { $s->find_first_doc_id('Q' . $mid) };
-		if ($@) {
-			# xapian not configured properly for this repo
-			push @nox, $other;
-			return;
-		}
+		my $mm = $other->mm or return;
 
-		# maybe we found it!
-		if (defined $doc_id) {
+		# try to find the URL with Msgmap to avoid forking
+		my $num = $mm->num_for($mid);
+		if (defined $num) {
 			push @found, $other;
 		} else {
 			# no point in trying the fork fallback if we
@@ -66,20 +55,6 @@ sub ext_msg {
 
 	return exact($ctx, \@found, $mid) if @found;
 
-	# Xapian not installed or configured for some repos,
-	# do a full MID check (this is expensive...):
-	if (@nox) {
-		my $path = mid2path($mid);
-		foreach my $other (@nox) {
-			my (undef, $type, undef) = $other->path_check($path);
-
-			if ($type && $type eq 'blob') {
-				push @found, $other;
-			}
-		}
-	}
-	return exact($ctx, \@found, $mid) if @found;
-
 	# fall back to partial MID matching
 	my $n_partial = 0;
 	my @partial;
diff --git a/lib/PublicInbox/Inbox.pm b/lib/PublicInbox/Inbox.pm
index f71493a..706089c 100644
--- a/lib/PublicInbox/Inbox.pm
+++ b/lib/PublicInbox/Inbox.pm
@@ -290,11 +290,6 @@ sub smsg_mime {
 	}
 }
 
-sub path_check {
-	my ($self, $path) = @_;
-	git($self)->check('HEAD:'.$path);
-}
-
 sub mid2num($$) {
 	my ($self, $mid) = @_;
 	my $mm = mm($self) or return;
-- 
EW


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 11/12] use %H consistently to disable abbreviations
  2018-04-18  9:13 [PATCH 00/12] better dedupe, contiguous article numbers Eric Wong (Contractor, The Linux Foundation)
                   ` (9 preceding siblings ...)
  2018-04-18  9:13 ` [PATCH 10/12] extmsg: remove expensive git path checks Eric Wong (Contractor, The Linux Foundation)
@ 2018-04-18  9:13 ` Eric Wong (Contractor, The Linux Foundation)
  2018-04-18  9:13 ` [PATCH 12/12] searchidx: increase term positions for all text terms Eric Wong (Contractor, The Linux Foundation)
  11 siblings, 0 replies; 13+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-04-18  9:13 UTC (permalink / raw)
  To: meta; +Cc: Eric Wong (Contractor, The Linux Foundation)

We generally do not want git to waste time finding abbreviations
and we do not want the possibility of them becoming ambiguous
over time, either.
---
 lib/PublicInbox/Feed.pm       | 2 +-
 lib/PublicInbox/V2Writable.pm | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/lib/PublicInbox/Feed.pm b/lib/PublicInbox/Feed.pm
index d7df07e..b373a1e 100644
--- a/lib/PublicInbox/Feed.pm
+++ b/lib/PublicInbox/Feed.pm
@@ -116,7 +116,7 @@ sub recent_msgs {
 	my $log = $ibx->git->popen(qw/log
 				--no-notes --no-color --raw -r
 				--no-abbrev --abbrev-commit/,
-				"--format=%h", $range);
+				"--format=%H", $range);
 	my %deleted; # only an optimization at this point
 	my $last;
 	my $last_commit;
diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index 2cc8730..1316d62 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -788,7 +788,7 @@ $range
 		$ranges->[$i] = $range;
 
 		# can't use 'rev-list --count' if we use --diff-filter
-		my $fh = $git->popen(qw(log --pretty=tformat:%h
+		my $fh = $git->popen(qw(log --pretty=tformat:%H
 				--no-notes --no-color --no-renames
 				--diff-filter=AM), $range, '--', 'm');
 		++$regen_max while <$fh>;
@@ -865,7 +865,7 @@ sub index_sync {
 	my $regen = $self->index_prepare($opts, $epoch_max, $ranges);
 	$$regen += $max if $max;
 	my $D = {};
-	my @cmd = qw(log --raw -r --pretty=tformat:%h
+	my @cmd = qw(log --raw -r --pretty=tformat:%H
 			--no-notes --no-color --no-abbrev --no-renames);
 
 	# work backwards through history
-- 
EW


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 12/12] searchidx: increase term positions for all text terms
  2018-04-18  9:13 [PATCH 00/12] better dedupe, contiguous article numbers Eric Wong (Contractor, The Linux Foundation)
                   ` (10 preceding siblings ...)
  2018-04-18  9:13 ` [PATCH 11/12] use %H consistently to disable abbreviations Eric Wong (Contractor, The Linux Foundation)
@ 2018-04-18  9:13 ` Eric Wong (Contractor, The Linux Foundation)
  11 siblings, 0 replies; 13+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-04-18  9:13 UTC (permalink / raw)
  To: meta; +Cc: Eric Wong (Contractor, The Linux Foundation)

We do not want phrase searches to cross between independent
fields (filenames/Message-ID vs bodies)
---
 lib/PublicInbox/SearchIdx.pm | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 2239c90..6e44887 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -295,6 +295,7 @@ sub add_message {
 			my $fn = $part->filename;
 			if (defined $fn && $fn ne '') {
 				$tg->index_text($fn, 1, 'XFN');
+				$tg->increase_termpos;
 			}
 
 			return if $ct =~ m!\btext/x?html\b!i;
@@ -330,6 +331,7 @@ sub add_message {
 
 		foreach my $mid (@$mids) {
 			$tg->index_text($mid, 1, 'XM');
+			$tg->increase_termpos;
 		}
 		$smsg->{to} = $smsg->{cc} = '';
 		PublicInbox::OverIdx::parse_references($smsg, $mid0, $mids);
-- 
EW


^ permalink raw reply related	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2018-04-18  9:13 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-04-18  9:13 [PATCH 00/12] better dedupe, contiguous article numbers Eric Wong (Contractor, The Linux Foundation)
2018-04-18  9:13 ` [PATCH 01/12] feed: respect feedmax, again Eric Wong (Contractor, The Linux Foundation)
2018-04-18  9:13 ` [PATCH 02/12] v1: remove articles from overview DB Eric Wong (Contractor, The Linux Foundation)
2018-04-18  9:13 ` [PATCH 03/12] compact: do not merge v2 repos by default Eric Wong (Contractor, The Linux Foundation)
2018-04-18  9:13 ` [PATCH 04/12] v2writable: reduce partititions by one Eric Wong (Contractor, The Linux Foundation)
2018-04-18  9:13 ` [PATCH 05/12] search: preserve References in Xapian smsg for x=t view Eric Wong (Contractor, The Linux Foundation)
2018-04-18  9:13 ` [PATCH 06/12] v2: generate better Message-IDs for duplicates Eric Wong (Contractor, The Linux Foundation)
2018-04-18  9:13 ` [PATCH 07/12] v2: improve deduplication checks Eric Wong (Contractor, The Linux Foundation)
2018-04-18  9:13 ` [PATCH 08/12] import: cat_blob drops leading 'From ' lines like Inbox Eric Wong (Contractor, The Linux Foundation)
2018-04-18  9:13 ` [PATCH 09/12] searchidx: regenerate and avoid article number gaps on full index Eric Wong (Contractor, The Linux Foundation)
2018-04-18  9:13 ` [PATCH 10/12] extmsg: remove expensive git path checks Eric Wong (Contractor, The Linux Foundation)
2018-04-18  9:13 ` [PATCH 11/12] use %H consistently to disable abbreviations Eric Wong (Contractor, The Linux Foundation)
2018-04-18  9:13 ` [PATCH 12/12] searchidx: increase term positions for all text terms Eric Wong (Contractor, The Linux Foundation)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).