* [v2 PATCH 00/34] duplicate handling, smaller Xapian DBs, date fixes
@ 2018-03-06 8:42 Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 01/34] v2writable: delete ::Import obj when ->done Eric Wong (Contractor, The Linux Foundation)
` (34 more replies)
0 siblings, 35 replies; 36+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-03-06 8:42 UTC (permalink / raw)
To: meta
Duplicate detection based on `content_id' now works and rejects
obviously re-sent messages with the same Message-Id.
Since many historical messages already have multiple Message-Ids (some
from buggy versions of git-send-email), we will inject Message-Ids as
needed to differentiate messages with the SAME Message-Id. This
prevents NNTP readers from missing out on messages.
Internally, the Message-Id we _favor_ for NNTP is also the one which
gets used for rendering threads.
Excessively long Message-Ids are just truncated to 244 for now (Xapian
limit for terms). I hope it's not an abuse vector going forward (only
one spam message used it), but this is another problem our
inject-new-Message-Id-on-duplicate scheme "solves".
Internal timestamps used for sorting now favor the first (last-added)
Received: header since is more likely to be correct than the Date:
header.
A wrong Date: header will still show up in the per-message ("permalink")
view, so it can still be used to embarass people with bad clocks :P
(Of course, downloadable mboxes will continue to show them).
For thread skeleton (index) views in HTML, we use the internal
timestamp for now; but maybe we'll use the Date: like the permalink
view. Maybe internally there can be two timestamps like git's
author-vs-committer dates.
Xapian index size is reduced, as the "nq:" search field is no longer
redundantly storing information that would be in searchable diff
fields (df* in https://public-inbox.org/git/_/text/help/).
This (along with remembering to run fstrim(8)) seems to have
reduced best-case indexing speed to around 3.5 hours for the
2000-2017 dataset I'm using \o/
Eric Wong (Contractor, The Linux Foundation) (34):
v2writable: delete ::Import obj when ->done
search: remove informational "warning" message
searchidx: add PID to error message when die-ing
content_id: special treatment for Message-Id headers
evcleanup: disable outside of daemon
v2writable: deduplicate detection on add
evcleanup: do not create event loop if nothing was registered
mid: add `mids' and `references' methods for extraction
content_id: use `mids' and `references' for MID extraction
searchidx: use new `references' method for parsing References
content_id: no need to be human-friendly
v2writable: inject new Message-IDs on true duplicates
search: revert to using 'Q' as a uniQue id per-Xapian conventions
searchidx: support indexing multiple MIDs
mid: be strict with References, but loose on Message-Id
searchidx: avoid excessive XNQ indexing with diffs
searchidxskeleton: add a note about locking
v2writable: generated Message-ID goes first
searchidx: use add_boolean_term for internal terms
searchidx: add NNTP article number as a searchable term
mid: truncate excessively long MIDs early
nntp: use NNTP article numbers for lookups
nntp: fix NEWNEWS command
searchidx: store the primary MID in doc data for NNTP
import: consolidate object info for v2 imports
v2: avoid redundant/repeated configs for git partition repos
INSTALL: document more optional dependencies
search: favor skeleton DB for lookup_mail
search: each_smsg_by_mid uses skeleton if available
v2writable: remove unnecessary skeleton commit
favor Received: date over Date: header globally
import: fall back to Sender for extracting name and email
scripts/import_vger_from_mbox: perform mboxrd or mboxo escaping
v2writable: detect and use previous partition count
INSTALL | 13 ++
MANIFEST | 2 +
lib/PublicInbox/ContentId.pm | 32 +++--
lib/PublicInbox/Daemon.pm | 1 +
lib/PublicInbox/EvCleanup.pm | 6 +-
lib/PublicInbox/ExtMsg.pm | 2 +-
lib/PublicInbox/Import.pm | 99 ++++++-------
lib/PublicInbox/Inbox.pm | 1 +
lib/PublicInbox/MID.pm | 55 +++++++-
lib/PublicInbox/MsgTime.pm | 51 +++++++
lib/PublicInbox/NNTP.pm | 31 ++---
lib/PublicInbox/Search.pm | 70 ++++++++--
lib/PublicInbox/SearchIdx.pm | 260 +++++++++++++++++++++--------------
lib/PublicInbox/SearchIdxPart.pm | 8 +-
lib/PublicInbox/SearchIdxSkeleton.pm | 27 +---
lib/PublicInbox/SearchMsg.pm | 26 ++--
lib/PublicInbox/V2Writable.pm | 166 +++++++++++++++++++---
lib/PublicInbox/View.pm | 8 +-
lib/PublicInbox/WwwAtomStream.pm | 5 +-
scripts/import_vger_from_mbox | 11 +-
t/content_id.t | 5 +-
t/import.t | 9 +-
t/init.t | 2 +
t/mid.t | 22 ++-
t/nntpd.t | 2 +
t/search-thr-index.t | 2 +-
t/v2writable.t | 195 ++++++++++++++++++++++++++
27 files changed, 842 insertions(+), 269 deletions(-)
create mode 100644 lib/PublicInbox/MsgTime.pm
create mode 100644 t/v2writable.t
--
EW
^ permalink raw reply [flat|nested] 36+ messages in thread
* [PATCH 01/34] v2writable: delete ::Import obj when ->done
2018-03-06 8:42 [v2 PATCH 00/34] duplicate handling, smaller Xapian DBs, date fixes Eric Wong (Contractor, The Linux Foundation)
@ 2018-03-06 8:42 ` Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 02/34] search: remove informational "warning" message Eric Wong (Contractor, The Linux Foundation)
` (33 subsequent siblings)
34 siblings, 0 replies; 36+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-03-06 8:42 UTC (permalink / raw)
To: meta
As with the ::Import class this wraps, we want this to be
usable as a checkpoint and be able to call ->add afterwards.
We'll be relying on ->done to flush changes through all
partition and skeleton DBs for deduplication checks.
---
lib/PublicInbox/V2Writable.pm | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index 81c7a4d..0470fb0 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -115,7 +115,7 @@ sub remove {
sub done {
my ($self) = @_;
- my $im = $self->{im};
+ my $im = delete $self->{im};
$im->done if $im; # PublicInbox::Import::done
$self->searchidx_checkpoint(0);
}
--
EW
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH 02/34] search: remove informational "warning" message
2018-03-06 8:42 [v2 PATCH 00/34] duplicate handling, smaller Xapian DBs, date fixes Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 01/34] v2writable: delete ::Import obj when ->done Eric Wong (Contractor, The Linux Foundation)
@ 2018-03-06 8:42 ` Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 03/34] searchidx: add PID to error message when die-ing Eric Wong (Contractor, The Linux Foundation)
` (32 subsequent siblings)
34 siblings, 0 replies; 36+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-03-06 8:42 UTC (permalink / raw)
To: meta
It was making imports too noisy.
---
lib/PublicInbox/Search.pm | 1 -
1 file changed, 1 deletion(-)
diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm
index a796cf6..21c72b6 100644
--- a/lib/PublicInbox/Search.pm
+++ b/lib/PublicInbox/Search.pm
@@ -158,7 +158,6 @@ sub new {
$xdb = $sub;
}
}
- warn "v2 repo with $parts found in $dir\n";
$self->{xdb} = $xdb;
$self->{skel} = Search::Xapian::Database->new("$dir/skel");
} else {
--
EW
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH 03/34] searchidx: add PID to error message when die-ing
2018-03-06 8:42 [v2 PATCH 00/34] duplicate handling, smaller Xapian DBs, date fixes Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 01/34] v2writable: delete ::Import obj when ->done Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 02/34] search: remove informational "warning" message Eric Wong (Contractor, The Linux Foundation)
@ 2018-03-06 8:42 ` Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 04/34] content_id: special treatment for Message-Id headers Eric Wong (Contractor, The Linux Foundation)
` (31 subsequent siblings)
34 siblings, 0 replies; 36+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-03-06 8:42 UTC (permalink / raw)
To: meta
---
lib/PublicInbox/SearchIdx.pm | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index f4238fe..ec3a6f3 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -837,7 +837,7 @@ sub remote_close {
print $w "close\n" or die "failed to write to pid:$pid: $!\n";
close $w or die "failed to close pipe for pid:$pid: $!\n";
waitpid($pid, 0) == $pid or die "remote process did not finish";
- $? == 0 or die ref($self)." exited with: $?";
+ $? == 0 or die ref($self)." pid:$pid exited with: $?";
}
1;
--
EW
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH 04/34] content_id: special treatment for Message-Id headers
2018-03-06 8:42 [v2 PATCH 00/34] duplicate handling, smaller Xapian DBs, date fixes Eric Wong (Contractor, The Linux Foundation)
` (2 preceding siblings ...)
2018-03-06 8:42 ` [PATCH 03/34] searchidx: add PID to error message when die-ing Eric Wong (Contractor, The Linux Foundation)
@ 2018-03-06 8:42 ` Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 05/34] evcleanup: disable outside of daemon Eric Wong (Contractor, The Linux Foundation)
` (30 subsequent siblings)
34 siblings, 0 replies; 36+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-03-06 8:42 UTC (permalink / raw)
To: meta
Some emails in LKML archives are identical with the only
difference being s/References:/In-Reply-To:/ in the headers.
Since this difference doesn't affect how we handle message
threading, we will treat them the same way for the purposes
of deduplication.
There may be more changes to how we do content_id along these
lines (e.g. using msg_iter to walk the message).
---
lib/PublicInbox/ContentId.pm | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)
diff --git a/lib/PublicInbox/ContentId.pm b/lib/PublicInbox/ContentId.pm
index 65d5a76..7ec638c 100644
--- a/lib/PublicInbox/ContentId.pm
+++ b/lib/PublicInbox/ContentId.pm
@@ -11,7 +11,7 @@ our @EXPORT_OK = qw/content_id/;
use Digest::SHA;
# Content-* headers are often no-ops, so maybe we don't need them
-my @ID_HEADERS = qw(Subject From Date Message-ID References To Cc In-Reply-To);
+my @ID_HEADERS = qw(Subject From Date To Cc);
sub content_id ($;$) {
my ($mime, $alg) = @_;
@@ -19,6 +19,20 @@ sub content_id ($;$) {
my $dig = Digest::SHA->new($alg);
my $hdr = $mime->header_obj;
+ # References: and In-Reply-To: get used interchangeably
+ # in some "duplicates" in LKML. We treat them the same
+ # in SearchIdx, so treat them the same for this:
+ my @mid = $hdr->header_raw('Message-ID');
+ @mid = (join(' ', @mid) =~ /<([^>]+)>/g);
+ my $refs = join(' ', $hdr->header_raw('References'),
+ $hdr->header_raw('In-Reply-To'));
+ my @refs = ($refs =~ /<([^>]+)>/g);
+ my %seen;
+ foreach my $mid (@mid, @refs) {
+ next if $seen{$mid};
+ $dig->add($mid);
+ $seen{$mid} = 1;
+ }
foreach my $h (@ID_HEADERS) {
my @v = $hdr->header_raw($h);
$dig->add($_) foreach @v;
--
EW
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH 05/34] evcleanup: disable outside of daemon
2018-03-06 8:42 [v2 PATCH 00/34] duplicate handling, smaller Xapian DBs, date fixes Eric Wong (Contractor, The Linux Foundation)
` (3 preceding siblings ...)
2018-03-06 8:42 ` [PATCH 04/34] content_id: special treatment for Message-Id headers Eric Wong (Contractor, The Linux Foundation)
@ 2018-03-06 8:42 ` Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 06/34] v2writable: deduplicate detection on add Eric Wong (Contractor, The Linux Foundation)
` (29 subsequent siblings)
34 siblings, 0 replies; 36+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-03-06 8:42 UTC (permalink / raw)
To: meta
We'll be using these in a more OO manner for V2Writable
(which doesn't use Danga::Socket), so lets not unnecessarily
register cleanup handlers intended for network daemons.
---
lib/PublicInbox/Daemon.pm | 1 +
lib/PublicInbox/EvCleanup.pm | 2 ++
lib/PublicInbox/Inbox.pm | 1 +
3 files changed, 4 insertions(+)
diff --git a/lib/PublicInbox/Daemon.pm b/lib/PublicInbox/Daemon.pm
index 0329bd3..9d27d2b 100644
--- a/lib/PublicInbox/Daemon.pm
+++ b/lib/PublicInbox/Daemon.pm
@@ -460,6 +460,7 @@ sub daemon_loop ($$) {
@listeners = map {
PublicInbox::Listener->new($_, $post_accept)
} @listeners;
+ $PublicInbox::EvCleanup::ENABLED = 1;
Danga::Socket->EventLoop;
$parent_pipe = undef;
}
diff --git a/lib/PublicInbox/EvCleanup.pm b/lib/PublicInbox/EvCleanup.pm
index 559730e..8ed5180 100644
--- a/lib/PublicInbox/EvCleanup.pm
+++ b/lib/PublicInbox/EvCleanup.pm
@@ -7,6 +7,8 @@ use strict;
use warnings;
use base qw(Danga::Socket);
use fields qw(rd);
+
+our $ENABLED;
my $singleton;
my $asapq = [ [], undef ];
my $nextq = [ [], undef ];
diff --git a/lib/PublicInbox/Inbox.pm b/lib/PublicInbox/Inbox.pm
index 54a6eb3..b9cd4c4 100644
--- a/lib/PublicInbox/Inbox.pm
+++ b/lib/PublicInbox/Inbox.pm
@@ -29,6 +29,7 @@ sub cleanup_task () {
sub _cleanup_later ($) {
my ($self) = @_;
+ return unless $PublicInbox::EvCleanup::ENABLED;
$cleanup_timer ||= PublicInbox::EvCleanup::later(*cleanup_task);
$CLEANUP->{"$self"} = $self;
}
--
EW
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH 06/34] v2writable: deduplicate detection on add
2018-03-06 8:42 [v2 PATCH 00/34] duplicate handling, smaller Xapian DBs, date fixes Eric Wong (Contractor, The Linux Foundation)
` (4 preceding siblings ...)
2018-03-06 8:42 ` [PATCH 05/34] evcleanup: disable outside of daemon Eric Wong (Contractor, The Linux Foundation)
@ 2018-03-06 8:42 ` Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 07/34] evcleanup: do not create event loop if nothing was registered Eric Wong (Contractor, The Linux Foundation)
` (28 subsequent siblings)
34 siblings, 0 replies; 36+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-03-06 8:42 UTC (permalink / raw)
To: meta
This is a bit expensive in a multi-process situation because
we need to make our indices and packs visible to the read-only
pieces.
---
lib/PublicInbox/Search.pm | 16 ++++++++++
lib/PublicInbox/SearchIdx.pm | 9 ++++--
lib/PublicInbox/V2Writable.pm | 68 +++++++++++++++++++++++++++++++++++++------
3 files changed, 81 insertions(+), 12 deletions(-)
diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm
index 21c72b6..c074410 100644
--- a/lib/PublicInbox/Search.pm
+++ b/lib/PublicInbox/Search.pm
@@ -371,6 +371,22 @@ sub lookup_mail { # no ghosts!
});
}
+sub each_smsg_by_mid {
+ my ($self, $mid, $cb) = @_;
+ $mid = mid_clean($mid);
+ my $xdb = $self->{xdb};
+ # XXX retry_reopen isn't necessary for V2Writable, but the PSGI
+ # interface will need it...
+ my ($head, $tail) = $self->find_doc_ids('XMID' . $mid);
+ for (; $head->nequal($tail); $head->inc) {
+ my $doc_id = $head->get_docid;
+ my $doc = $xdb->get_document($doc_id);
+ my $smsg = PublicInbox::SearchMsg->wrap($doc, $mid);
+ $smsg->{doc_id} = $doc_id;
+ $cb->($smsg) or return;
+ }
+}
+
sub find_unique_doc_id {
my ($self, $termval) = @_;
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index ec3a6f3..ed52e38 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -284,7 +284,11 @@ sub add_message {
my $db = $self->{xdb};
my ($doc_id, $old_tid);
- my $mid = mid_clean(mid_mime($mime));
+ my @mids = mid_mime($mime);
+ if (@mids > 1) {
+ warn "Multi-MID: ( ",join(' | ', @mids)," )\n";
+ }
+ my $mid = mid_clean($mids[0]);
my $skel = $self->{skeleton};
eval {
@@ -512,13 +516,12 @@ sub unindex_blob {
}
sub index_mm {
- my ($self, $mime, $warn_existing) = @_;
+ my ($self, $mime) = @_;
my $mid = mid_clean(mid_mime($mime));
my $mm = $self->{mm};
my $num = $mm->mid_insert($mid);
return $num if defined $num;
- warn "<$mid> reused\n" if $warn_existing;
# fallback to num_for since filters like RubyLang set the number
$mm->num_for($mid);
}
diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index 0470fb0..57cb7d3 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -11,6 +11,9 @@ use PublicInbox::SearchIdxSkeleton;
use PublicInbox::MIME;
use PublicInbox::Git;
use PublicInbox::Import;
+use PublicInbox::MID qw(mid_clean mid_mime);
+use PublicInbox::ContentId qw(content_id);
+use PublicInbox::Inbox;
# an estimate of the post-packed size to the raw uncompressed size
my $PACKING_FACTOR = 0.4;
@@ -46,22 +49,40 @@ sub new {
# mimics Import::add and wraps it for v2
sub add {
my ($self, $mime, $check_cb) = @_;
- my $existing = $self->lookup_content($mime);
- if ($existing) {
- return undef if $existing->type eq 'mail'; # duplicate
+ # spam check:
+ if ($check_cb) {
+ $mime = $check_cb->($mime) or return;
}
- my $im = $self->importer;
+ # All pipes (> $^F) known to Perl 5.6+ have FD_CLOEXEC set,
+ # as does SQLite 3.4.1+ (released in 2007-07-20), and
+ # Xapian 1.3.2+ (released 2015-03-15).
+ # For the most part, we can spawn git-fast-import without
+ # leaking FDs to it...
+ $self->idx_init;
+
+ my $mid = mid_clean(mid_mime($mime));
+ my $num = $self->{skel}->{mm}->mid_insert($mid);
+ if (!defined($num)) { # mid is already known
+ $self->done; # ensure all subprocesses are done writing
+
+ my $existing = $self->lookup_content($mime);
+ warn "<$mid> resent\n" if $existing;
+ return if $existing; # easy, don't store duplicates
+
+ # reuse NNTP article number?
+ warn "<$mid> reused for mismatched content\n";
+ $self->idx_init;
+ $num = $self->{skel}->{mm}->num_for($mid);
+ }
- # im->add returns undef if check_cb fails
- my $cmt = $im->add($mime, $check_cb) or return;
+ my $im = $self->importer;
+ my $cmt = $im->add($mime);
$cmt = $im->get_mark($cmt);
my $oid = $im->{last_object_id};
my ($len, $msgref) = @{$im->{last_object}};
- $self->idx_init;
- my $num = $self->{skel}->index_mm($mime, 1);
my $nparts = $self->{partitions};
my $part = $num % $nparts;
my $idx = $self->idx_part($part);
@@ -83,6 +104,12 @@ sub idx_part {
sub idx_init {
my ($self) = @_;
return if $self->{idx_parts};
+ my $ibx = $self->{-inbox};
+
+ # do not leak read-only FDs to child processes, we only have these
+ # FDs for duplicate detection so they should not be
+ # frequently activated.
+ delete $ibx->{$_} foreach (qw(git mm search));
# first time initialization, first we create the skeleton pipe:
my $skel = $self->{skel} = PublicInbox::SearchIdxSkeleton->new($self);
@@ -241,7 +268,30 @@ sub import_init {
}
sub lookup_content {
- undef # TODO
+ my ($self, $mime) = @_;
+ my $ibx = $self->{-inbox};
+
+ my $srch = $ibx->search;
+ my $cid = content_id($mime);
+ my $found;
+ my $mid = mid_mime($mime);
+ $srch->each_smsg_by_mid($mid, sub {
+ my ($smsg) = @_;
+ $smsg->load_expand;
+ my $msg = $ibx->msg_by_smsg($smsg);
+ if (!defined($msg)) {
+ warn "broken smsg for $mid\n";
+ return 1; # continue
+ }
+ my $cur = PublicInbox::MIME->new($msg);
+ if (content_id($cur) eq $cid) {
+ $smsg->{mime} = $cur;
+ $found = $smsg;
+ return 0; # break out of loop
+ }
+ 1; # continue
+ });
+ $found;
}
sub atfork_child {
--
EW
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH 07/34] evcleanup: do not create event loop if nothing was registered
2018-03-06 8:42 [v2 PATCH 00/34] duplicate handling, smaller Xapian DBs, date fixes Eric Wong (Contractor, The Linux Foundation)
` (5 preceding siblings ...)
2018-03-06 8:42 ` [PATCH 06/34] v2writable: deduplicate detection on add Eric Wong (Contractor, The Linux Foundation)
@ 2018-03-06 8:42 ` Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 08/34] mid: add `mids' and `references' methods for extraction Eric Wong (Contractor, The Linux Foundation)
` (27 subsequent siblings)
34 siblings, 0 replies; 36+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-03-06 8:42 UTC (permalink / raw)
To: meta
This was creating an unnecessary epoll descriptor via
Danga::Socket when using V2Writable to import a mbox. That
said, there should probably be better way of detecting whether
or not we're inside a Danga::Socket event loop.
Fixes: 427245acacaf04a8
("evcleanup: ensure deferred close from timers are handled ASAP")
---
lib/PublicInbox/EvCleanup.pm | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/lib/PublicInbox/EvCleanup.pm b/lib/PublicInbox/EvCleanup.pm
index 8ed5180..384efd3 100644
--- a/lib/PublicInbox/EvCleanup.pm
+++ b/lib/PublicInbox/EvCleanup.pm
@@ -79,8 +79,8 @@ sub later ($) {
END {
_run_asap();
- _run_next();
- _run_later();
+ _run_all($nextq);
+ _run_all($laterq);
}
1;
--
EW
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH 08/34] mid: add `mids' and `references' methods for extraction
2018-03-06 8:42 [v2 PATCH 00/34] duplicate handling, smaller Xapian DBs, date fixes Eric Wong (Contractor, The Linux Foundation)
` (6 preceding siblings ...)
2018-03-06 8:42 ` [PATCH 07/34] evcleanup: do not create event loop if nothing was registered Eric Wong (Contractor, The Linux Foundation)
@ 2018-03-06 8:42 ` Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 09/34] content_id: use `mids' and `references' for MID extraction Eric Wong (Contractor, The Linux Foundation)
` (26 subsequent siblings)
34 siblings, 0 replies; 36+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-03-06 8:42 UTC (permalink / raw)
To: meta
We'll be using a more consistent API for extracting Message-IDs
from various headers.
---
lib/PublicInbox/MID.pm | 24 +++++++++++++++++++++++-
t/mid.t | 22 +++++++++++++++++++++-
2 files changed, 44 insertions(+), 2 deletions(-)
diff --git a/lib/PublicInbox/MID.pm b/lib/PublicInbox/MID.pm
index 2c9822f..786c056 100644
--- a/lib/PublicInbox/MID.pm
+++ b/lib/PublicInbox/MID.pm
@@ -6,7 +6,8 @@ package PublicInbox::MID;
use strict;
use warnings;
use base qw/Exporter/;
-our @EXPORT_OK = qw/mid_clean id_compress mid2path mid_mime mid_escape MID_ESC/;
+our @EXPORT_OK = qw/mid_clean id_compress mid2path mid_mime mid_escape MID_ESC
+ mids references/;
use URI::Escape qw(uri_escape_utf8);
use Digest::SHA qw/sha1_hex/;
use constant MID_MAX => 40; # SHA-1 hex length
@@ -48,6 +49,27 @@ sub mid2path {
sub mid_mime ($) { $_[0]->header_obj->header_raw('Message-ID') }
+sub uniq_mids {
+ my ($hdr, @fields) = @_;
+ my %seen;
+ my @raw;
+ foreach my $f (@fields) {
+ push @raw, $hdr->header_raw($f);
+ }
+ my @mids = (join(' ', @raw) =~ /<([^>]+)>/g);
+ my $mids = scalar(@mids) == 0 ? \@raw: \@mids;
+ my @ret;
+ foreach (@$mids) {
+ next if $seen{$_};
+ push @ret, $_;
+ $seen{$_} = 1;
+ }
+ \@ret;
+}
+
+sub mids { uniq_mids($_[0], 'Message-Id') }
+sub references { uniq_mids($_[0], 'References', 'In-Reply-To') }
+
# RFC3986, section 3.3:
sub MID_ESC () { '^A-Za-z0-9\-\._~!\$\&\';\(\)\*\+,;=:@' }
sub mid_escape ($) { uri_escape_utf8($_[0], MID_ESC) }
diff --git a/t/mid.t b/t/mid.t
index 0bf3331..223be79 100644
--- a/t/mid.t
+++ b/t/mid.t
@@ -1,11 +1,31 @@
# Copyright (C) 2016-2018 all contributors <meta@public-inbox.org>
# License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
use Test::More;
-use PublicInbox::MID qw(mid_escape);
+use PublicInbox::MID qw(mid_escape mids references);
is(mid_escape('foo!@(bar)'), 'foo!@(bar)');
is(mid_escape('foo%!@(bar)'), 'foo%25!@(bar)');
is(mid_escape('foo%!@(bar)'), 'foo%25!@(bar)');
+{
+ use Email::MIME;
+ my $mime = Email::MIME->create;
+ $mime->header_set('Message-Id', '<mid-1@a>');
+ is_deeply(['mid-1@a'], mids($mime->header_obj), 'mids in common case');
+ $mime->header_set('Message-Id', '<mid-1@a>', '<mid-2@b>');
+ is_deeply(['mid-1@a', 'mid-2@b'], mids($mime->header_obj), '2 mids');
+ $mime->header_set('Message-Id', '<mid-1@a>', '<mid-1@a>');
+ is_deeply(['mid-1@a'], mids($mime->header_obj), 'dup mids');
+ $mime->header_set('Message-Id', '<mid-1@a> comment');
+ is_deeply(['mid-1@a'], mids($mime->header_obj), 'comment ignored');
+ $mime->header_set('Message-Id', 'bare-mid');
+ is_deeply(['bare-mid'], mids($mime->header_obj), 'bare mid OK');
+
+ $mime->header_set('References', '<hello> <world>');
+ $mime->header_set('In-Reply-To', '<weld>');
+ is_deeply(['hello', 'world', 'weld'], references($mime->header_obj),
+ 'references combines with In-Reply-To');
+}
+
done_testing();
1;
--
EW
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH 09/34] content_id: use `mids' and `references' for MID extraction
2018-03-06 8:42 [v2 PATCH 00/34] duplicate handling, smaller Xapian DBs, date fixes Eric Wong (Contractor, The Linux Foundation)
` (7 preceding siblings ...)
2018-03-06 8:42 ` [PATCH 08/34] mid: add `mids' and `references' methods for extraction Eric Wong (Contractor, The Linux Foundation)
@ 2018-03-06 8:42 ` Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 10/34] searchidx: use new `references' method for parsing References Eric Wong (Contractor, The Linux Foundation)
` (25 subsequent siblings)
34 siblings, 0 replies; 36+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-03-06 8:42 UTC (permalink / raw)
To: meta
These already take care of deduping internally, so we'll save
ourselves at least some of the trouble while using a more
consistent API. While we're at it, hash the header name as
well, since we need to distinguish which header a certain value
came from.
---
lib/PublicInbox/ContentId.pm | 17 ++++++++---------
1 file changed, 8 insertions(+), 9 deletions(-)
diff --git a/lib/PublicInbox/ContentId.pm b/lib/PublicInbox/ContentId.pm
index 7ec638c..d1a009e 100644
--- a/lib/PublicInbox/ContentId.pm
+++ b/lib/PublicInbox/ContentId.pm
@@ -6,6 +6,7 @@ use strict;
use warnings;
use base qw/Exporter/;
our @EXPORT_OK = qw/content_id/;
+use PublicInbox::MID qw(mids references);
# not sure if less-widely supported hash families are worth bothering with
use Digest::SHA;
@@ -22,20 +23,18 @@ sub content_id ($;$) {
# References: and In-Reply-To: get used interchangeably
# in some "duplicates" in LKML. We treat them the same
# in SearchIdx, so treat them the same for this:
- my @mid = $hdr->header_raw('Message-ID');
- @mid = (join(' ', @mid) =~ /<([^>]+)>/g);
- my $refs = join(' ', $hdr->header_raw('References'),
- $hdr->header_raw('In-Reply-To'));
- my @refs = ($refs =~ /<([^>]+)>/g);
my %seen;
- foreach my $mid (@mid, @refs) {
- next if $seen{$mid};
- $dig->add($mid);
+ foreach my $mid (@{mids($hdr)}) {
+ $dig->add('mid: '.$mid);
$seen{$mid} = 1;
}
+ foreach my $mid (@{references($hdr)}) {
+ next if $seen{$mid};
+ $dig->add('ref: '.$mid);
+ }
foreach my $h (@ID_HEADERS) {
my @v = $hdr->header_raw($h);
- $dig->add($_) foreach @v;
+ $dig->add("$h: $_") foreach @v;
}
$dig->add($mime->body_raw);
'SHA-' . $dig->algorithm . ':' . $dig->hexdigest;
--
EW
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH 10/34] searchidx: use new `references' method for parsing References
2018-03-06 8:42 [v2 PATCH 00/34] duplicate handling, smaller Xapian DBs, date fixes Eric Wong (Contractor, The Linux Foundation)
` (8 preceding siblings ...)
2018-03-06 8:42 ` [PATCH 09/34] content_id: use `mids' and `references' for MID extraction Eric Wong (Contractor, The Linux Foundation)
@ 2018-03-06 8:42 ` Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 11/34] content_id: no need to be human-friendly Eric Wong (Contractor, The Linux Foundation)
` (24 subsequent siblings)
34 siblings, 0 replies; 36+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-03-06 8:42 UTC (permalink / raw)
To: meta
It's shorter and more convenient, here.
---
lib/PublicInbox/MID.pm | 3 +++
lib/PublicInbox/SearchIdx.pm | 39 +++++++++++++++------------------------
2 files changed, 18 insertions(+), 24 deletions(-)
diff --git a/lib/PublicInbox/MID.pm b/lib/PublicInbox/MID.pm
index 786c056..4ccb704 100644
--- a/lib/PublicInbox/MID.pm
+++ b/lib/PublicInbox/MID.pm
@@ -68,6 +68,9 @@ sub uniq_mids {
}
sub mids { uniq_mids($_[0], 'Message-Id') }
+
+# last References should be IRT, but some mail clients do things
+# out of order, so trust IRT over References iff IRT exists
sub references { uniq_mids($_[0], 'References', 'In-Reply-To') }
# RFC3986, section 3.3:
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index ed52e38..57aed75 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -12,7 +12,7 @@ use warnings;
use Fcntl qw(:flock :DEFAULT);
use PublicInbox::MIME;
use base qw(PublicInbox::Search);
-use PublicInbox::MID qw/mid_clean id_compress mid_mime/;
+use PublicInbox::MID qw/mid_clean id_compress mid_mime mids references/;
use PublicInbox::MsgIter;
use Carp qw(croak);
use POSIX qw(strftime);
@@ -447,33 +447,24 @@ sub next_thread_id {
sub parse_references ($) {
my ($smsg) = @_;
- my $doc = $smsg->{doc};
- my $mid = $smsg->mid;
my $mime = $smsg->{mime};
my $hdr = $mime->header_obj;
-
- # last References should be IRT, but some mail clients do things
- # out of order, so trust IRT over References iff IRT exists
- my @refs = (($hdr->header_raw('References') || '') =~ /<([^>]+)>/g);
- push(@refs, (($hdr->header_raw('In-Reply-To') || '') =~ /<([^>]+)>/g));
-
- if (@refs) {
- my %uniq = ($mid => 1);
- my @orig_refs = @refs;
- @refs = ();
-
- # prevent circular references via References: here:
- foreach my $ref (@orig_refs) {
- if (length($ref) > MAX_MID_SIZE) {
- warn "References: <$ref> too long, ignoring\n";
- }
- next if $uniq{$ref};
- $uniq{$ref} = 1;
- push @refs, $ref;
+ my $refs = references($hdr);
+ return $refs if scalar(@$refs) == 0;
+
+ # prevent circular references via References here:
+ my %mids = map { $_ => 1 } @{mids($hdr)};
+ my @keep;
+ foreach my $ref (@$refs) {
+ # FIXME: this is an archive-prevention vector like X-No-Archive
+ if (length($ref) > MAX_MID_SIZE) {
+ warn "References: <$ref> too long, ignoring\n";
}
+ next if $mids{$ref};
+ push @keep, $ref;
}
- $smsg->{references} = '<'.join('> <', @refs).'>' if @refs;
- \@refs
+ $smsg->{references} = '<'.join('> <', @keep).'>' if @keep;
+ \@keep;
}
sub link_message {
--
EW
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH 11/34] content_id: no need to be human-friendly
2018-03-06 8:42 [v2 PATCH 00/34] duplicate handling, smaller Xapian DBs, date fixes Eric Wong (Contractor, The Linux Foundation)
` (9 preceding siblings ...)
2018-03-06 8:42 ` [PATCH 10/34] searchidx: use new `references' method for parsing References Eric Wong (Contractor, The Linux Foundation)
@ 2018-03-06 8:42 ` Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 12/34] v2writable: inject new Message-IDs on true duplicates Eric Wong (Contractor, The Linux Foundation)
` (23 subsequent siblings)
34 siblings, 0 replies; 36+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-03-06 8:42 UTC (permalink / raw)
To: meta
We merely use this for internal comparisons and do not store
this in Xapian. So using a shorter, non-human readable digest
is enough. Furthermore, introduce "content_digest" which
returns the Digest::SHA object for extra changes.
---
lib/PublicInbox/ContentId.pm | 15 +++++++++------
t/content_id.t | 5 +++--
2 files changed, 12 insertions(+), 8 deletions(-)
diff --git a/lib/PublicInbox/ContentId.pm b/lib/PublicInbox/ContentId.pm
index d1a009e..8347de2 100644
--- a/lib/PublicInbox/ContentId.pm
+++ b/lib/PublicInbox/ContentId.pm
@@ -5,7 +5,7 @@ package PublicInbox::ContentId;
use strict;
use warnings;
use base qw/Exporter/;
-our @EXPORT_OK = qw/content_id/;
+our @EXPORT_OK = qw/content_id content_digest/;
use PublicInbox::MID qw(mids references);
# not sure if less-widely supported hash families are worth bothering with
@@ -14,10 +14,9 @@ use Digest::SHA;
# Content-* headers are often no-ops, so maybe we don't need them
my @ID_HEADERS = qw(Subject From Date To Cc);
-sub content_id ($;$) {
- my ($mime, $alg) = @_;
- $alg ||= 256;
- my $dig = Digest::SHA->new($alg);
+sub content_digest ($) {
+ my ($mime) = @_;
+ my $dig = Digest::SHA->new(256);
my $hdr = $mime->header_obj;
# References: and In-Reply-To: get used interchangeably
@@ -37,7 +36,11 @@ sub content_id ($;$) {
$dig->add("$h: $_") foreach @v;
}
$dig->add($mime->body_raw);
- 'SHA-' . $dig->algorithm . ':' . $dig->hexdigest;
+ $dig;
+}
+
+sub content_id ($) {
+ content_digest($_[0])->digest;
}
1;
diff --git a/t/content_id.t b/t/content_id.t
index c0ae6ec..adcdb6c 100644
--- a/t/content_id.t
+++ b/t/content_id.t
@@ -18,7 +18,8 @@ my $mime = Email::MIME->create(
body => "hello world\n",
);
-my $res = content_id($mime);
-like($res, qr/\ASHA-256:[a-f0-9]{64}\z/, 'cid in format expected');
+my $orig = content_id($mime);
+my $reload = content_id(Email::MIME->new($mime->as_string));
+is($orig, $reload, 'content_id matches after serialization');
done_testing();
--
EW
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH 12/34] v2writable: inject new Message-IDs on true duplicates
2018-03-06 8:42 [v2 PATCH 00/34] duplicate handling, smaller Xapian DBs, date fixes Eric Wong (Contractor, The Linux Foundation)
` (10 preceding siblings ...)
2018-03-06 8:42 ` [PATCH 11/34] content_id: no need to be human-friendly Eric Wong (Contractor, The Linux Foundation)
@ 2018-03-06 8:42 ` Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 13/34] search: revert to using 'Q' as a uniQue id per-Xapian conventions Eric Wong (Contractor, The Linux Foundation)
` (22 subsequent siblings)
34 siblings, 0 replies; 36+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-03-06 8:42 UTC (permalink / raw)
To: meta
Since we'll need to support multiple Message-IDs anyways,
inject a new one if we hit a duplicate (or don't get one at
all).
Try to use a deterministic Message-Id for consistency, but give
up determinism and use a random Message-Id if an "attacker"
wants to prevent their message from being archived.
---
MANIFEST | 1 +
lib/PublicInbox/V2Writable.pm | 88 +++++++++++++++++++++++++++++++++----------
t/v2writable.t | 84 +++++++++++++++++++++++++++++++++++++++++
3 files changed, 154 insertions(+), 19 deletions(-)
create mode 100644 t/v2writable.t
diff --git a/MANIFEST b/MANIFEST
index 1aaf8ff..7366aa0 100644
--- a/MANIFEST
+++ b/MANIFEST
@@ -177,5 +177,6 @@ t/spawn.t
t/thread-all.t
t/thread-cycle.t
t/utf8.mbox
+t/v2writable.t
t/view.t
t/watch_maildir.t
diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index 57cb7d3..6d73827 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -11,8 +11,8 @@ use PublicInbox::SearchIdxSkeleton;
use PublicInbox::MIME;
use PublicInbox::Git;
use PublicInbox::Import;
-use PublicInbox::MID qw(mid_clean mid_mime);
-use PublicInbox::ContentId qw(content_id);
+use PublicInbox::MID qw(mids);
+use PublicInbox::ContentId qw(content_id content_digest);
use PublicInbox::Inbox;
# an estimate of the post-packed size to the raw uncompressed size
@@ -62,21 +62,8 @@ sub add {
# leaking FDs to it...
$self->idx_init;
- my $mid = mid_clean(mid_mime($mime));
- my $num = $self->{skel}->{mm}->mid_insert($mid);
- if (!defined($num)) { # mid is already known
- $self->done; # ensure all subprocesses are done writing
-
- my $existing = $self->lookup_content($mime);
- warn "<$mid> resent\n" if $existing;
- return if $existing; # easy, don't store duplicates
-
- # reuse NNTP article number?
- warn "<$mid> reused for mismatched content\n";
- $self->idx_init;
- $num = $self->{skel}->{mm}->num_for($mid);
- }
-
+ my $num = num_for($self, $mime);
+ defined $num or return; # duplicate
my $im = $self->importer;
my $cmt = $im->add($mime);
$cmt = $im->get_mark($cmt);
@@ -95,6 +82,70 @@ sub add {
$mime;
}
+sub num_for {
+ my ($self, $mime) = @_;
+ my $mids = mids($mime->header_obj);
+ if (@$mids) {
+ my $mid = $mids->[0];
+ my $num = $self->{skel}->{mm}->mid_insert($mid);
+ return $num if defined($num); # common case
+
+ # crap, Message-ID is already known, hope somebody just resent:
+ $self->done; # write barrier, clears $self->{skel}
+ foreach my $m (@$mids) {
+ # read-only lookup now safe to do after above barrier
+ my $existing = $self->lookup_content($mime, $m);
+ if ($existing) {
+ warn "<$m> resent\n";
+ return; # easy, don't store duplicates
+ }
+ }
+
+ # very unlikely:
+ warn "<$mid> reused for mismatched content\n";
+ $self->idx_init;
+
+ # try the rest of the mids
+ foreach my $i (1..$#$mids) {
+ my $m = $mids->[$i];
+ $num = $self->{skel}->{mm}->mid_insert($m);
+ if (defined $num) {
+ warn "alternative <$m> for <$mid> found\n";
+ return $num;
+ }
+ }
+ }
+ # none of the existing Message-IDs are good, generate a new one:
+ num_for_harder($self, $mime);
+}
+
+sub num_for_harder {
+ my ($self, $mime) = @_;
+
+ my $hdr = $mime->header_obj;
+ my $dig = content_digest($mime);
+ my $mid = $dig->clone->hexdigest . '@localhost';
+ my $num = $self->{skel}->{mm}->mid_insert($mid);
+ unless (defined $num) {
+ # it's hard to spoof the last Received: header
+ my @recvd = $hdr->header_raw('Received');
+ $dig->add("Received: $_") foreach (@recvd);
+ $mid = $dig->clone->hexdigest . '@localhost';
+ $num = $self->{skel}->{mm}->mid_insert($mid);
+
+ # fall back to a random Message-ID and give up determinism:
+ until (defined($num)) {
+ $dig->add(rand);
+ $mid = $dig->clone->hexdigest . '@localhost';
+ warn "using random Message-ID <$mid> as fallback\n";
+ $num = $self->{skel}->{mm}->mid_insert($mid);
+ }
+ }
+ my @cur = $hdr->header_raw('Message-Id');
+ $hdr->header_set('Message-Id', @cur, "<$mid>");
+ $num;
+}
+
sub idx_part {
my ($self, $part) = @_;
$self->{idx_parts}->[$part];
@@ -268,13 +319,12 @@ sub import_init {
}
sub lookup_content {
- my ($self, $mime) = @_;
+ my ($self, $mime, $mid) = @_;
my $ibx = $self->{-inbox};
my $srch = $ibx->search;
my $cid = content_id($mime);
my $found;
- my $mid = mid_mime($mime);
$srch->each_smsg_by_mid($mid, sub {
my ($smsg) = @_;
$smsg->load_expand;
diff --git a/t/v2writable.t b/t/v2writable.t
new file mode 100644
index 0000000..bc2437a
--- /dev/null
+++ b/t/v2writable.t
@@ -0,0 +1,84 @@
+# Copyright (C) 2018 all contributors <meta@public-inbox.org>
+# License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
+use strict;
+use warnings;
+use Test::More;
+use PublicInbox::MIME;
+use PublicInbox::ContentId qw(content_digest);
+use File::Temp qw/tempdir/;
+foreach my $mod (qw(DBD::SQLite Search::Xapian)) {
+ eval "require $mod";
+ plan skip_all => "$mod missing for nntpd.t" if $@;
+}
+use_ok 'PublicInbox::V2Writable';
+my $mainrepo = tempdir('pi-v2writable-XXXXXX', TMPDIR => 1, CLEANUP => 1);
+my $ibx = {
+ mainrepo => $mainrepo,
+ name => 'test-v2writable',
+ version => 2,
+ -primary_address => 'test@example.com',
+};
+$ibx = PublicInbox::Inbox->new($ibx);
+my $mime = PublicInbox::MIME->create(
+ header => [
+ From => 'a@example.com',
+ To => 'test@example.com',
+ Subject => 'this is a subject',
+ 'Message-ID' => '<a-mid@b>',
+ Date => 'Fri, 02 Oct 1993 00:00:00 +0000',
+ ],
+ body => "hello world\n",
+);
+
+my $im = PublicInbox::V2Writable->new($ibx, 1);
+ok($im->add($mime), 'ordinary message added');
+{
+ my @warn;
+ local $SIG{__WARN__} = sub { push @warn, @_ };
+ is(undef, $im->add($mime), 'obvious duplicate rejected');
+ like(join(' ', @warn), qr/resent/, 'warned about resent message');
+
+ @warn = ();
+ $mime->header_set('Message-Id', '<a-mid@b>', '<c@d>');
+ ok($im->add($mime), 'secondary MID used');
+ like(join(' ', @warn), qr/mismatched/, 'warned about mismatch');
+ like(join(' ', @warn), qr/alternative/, 'warned about alternative');
+ is_deeply([ '<a-mid@b>', '<c@d>' ],
+ [ $mime->header_obj->header_raw('Message-Id') ],
+ 'no new Message-Id added');
+
+ @warn = ();
+ $mime->header_set('Message-Id', '<a-mid@b>');
+ $mime->body_set('different');
+ ok($im->add($mime), 'reused mid ok');
+ like(join(' ', @warn), qr/reused/, 'warned about reused MID');
+ my @mids = $mime->header_obj->header_raw('Message-Id');
+ is($mids[0], '<a-mid@b>', 'original mid not changed');
+ like($mids[1], qr/\A<\w+\@localhost>\z/, 'new MID added');
+ is(scalar(@mids), 2, 'only one new MID added');
+
+ @warn = ();
+ $mime->header_set('Message-Id', '<a-mid@b>');
+ $mime->body_set('this one needs a random mid');
+ my $gen = content_digest($mime)->hexdigest . '@localhost';
+ my $fake = PublicInbox::MIME->new($mime->as_string);
+ $fake->header_set('Message-Id', $gen);
+ ok($im->add($fake), 'fake added easily');
+ is_deeply(\@warn, [], 'no warnings from a faker');
+ ok($im->add($mime), 'random MID made');
+ like(join(' ', @warn), qr/using random/, 'warned about using random');
+ @mids = $mime->header_obj->header_raw('Message-Id');
+ is($mids[0], '<a-mid@b>', 'original mid not changed');
+ like($mids[1], qr/\A<\w+\@localhost>\z/, 'new MID added');
+ is(scalar(@mids), 2, 'only one new MID added');
+
+ @warn = ();
+ $mime->header_set('Message-Id');
+ ok($im->add($mime), 'random MID made for MID free message');
+ @mids = $mime->header_obj->header_raw('Message-Id');
+ like($mids[0], qr/\A<\w+\@localhost>\z/, 'mid was generated');
+ is(scalar(@mids), 1, 'new generated');
+}
+
+$im->done;
+done_testing();
--
EW
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH 13/34] search: revert to using 'Q' as a uniQue id per-Xapian conventions
2018-03-06 8:42 [v2 PATCH 00/34] duplicate handling, smaller Xapian DBs, date fixes Eric Wong (Contractor, The Linux Foundation)
` (11 preceding siblings ...)
2018-03-06 8:42 ` [PATCH 12/34] v2writable: inject new Message-IDs on true duplicates Eric Wong (Contractor, The Linux Foundation)
@ 2018-03-06 8:42 ` Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 14/34] searchidx: support indexing multiple MIDs Eric Wong (Contractor, The Linux Foundation)
` (21 subsequent siblings)
34 siblings, 0 replies; 36+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-03-06 8:42 UTC (permalink / raw)
To: meta
'Q' is merely a convention in the Xapian world, and is close
enough to unique for practical purposes, so stop using XMID
and gain a little more term length as a result.
---
lib/PublicInbox/ExtMsg.pm | 2 +-
lib/PublicInbox/Search.pm | 8 ++++----
lib/PublicInbox/SearchIdx.pm | 8 ++++----
lib/PublicInbox/SearchIdxSkeleton.pm | 2 +-
lib/PublicInbox/SearchMsg.pm | 2 +-
5 files changed, 11 insertions(+), 11 deletions(-)
diff --git a/lib/PublicInbox/ExtMsg.pm b/lib/PublicInbox/ExtMsg.pm
index 90d68db..f3076a3 100644
--- a/lib/PublicInbox/ExtMsg.pm
+++ b/lib/PublicInbox/ExtMsg.pm
@@ -46,7 +46,7 @@ sub ext_msg {
}
# try to find the URL with Xapian to avoid forking
- my $doc_id = eval { $s->find_first_doc_id('XMID' . $mid) };
+ my $doc_id = eval { $s->find_first_doc_id('Q' . $mid) };
if ($@) {
# xapian not configured properly for this repo
push @nox, $other;
diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm
index c074410..74f406a 100644
--- a/lib/PublicInbox/Search.pm
+++ b/lib/PublicInbox/Search.pm
@@ -56,7 +56,7 @@ my %bool_pfx_internal = (
);
my %bool_pfx_external = (
- mid => 'XMID', # Message-ID (full/exact)
+ mid => 'Q', # Message-ID (full/exact), this is mostly uniQue
);
my %prob_prefix = (
@@ -333,7 +333,7 @@ sub lookup_skeleton {
my ($self, $mid) = @_;
my $skel = $self->{skel} or return lookup_message($self, $mid);
$mid = mid_clean($mid);
- my $term = 'XMID' . $mid;
+ my $term = 'Q' . $mid;
my $smsg;
my $beg = $skel->postlist_begin($term);
if ($beg != $skel->postlist_end($term)) {
@@ -352,7 +352,7 @@ sub lookup_message {
my ($self, $mid) = @_;
$mid = mid_clean($mid);
- my $doc_id = $self->find_first_doc_id('XMID' . $mid);
+ my $doc_id = $self->find_first_doc_id('Q' . $mid);
my $smsg;
if (defined $doc_id) {
# raises on error:
@@ -377,7 +377,7 @@ sub each_smsg_by_mid {
my $xdb = $self->{xdb};
# XXX retry_reopen isn't necessary for V2Writable, but the PSGI
# interface will need it...
- my ($head, $tail) = $self->find_doc_ids('XMID' . $mid);
+ my ($head, $tail) = $self->find_doc_ids('Q' . $mid);
for (; $head->nequal($tail); $head->inc) {
my $doc_id = $head->get_docid;
my $doc = $xdb->get_document($doc_id);
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 57aed75..61dc057 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -19,7 +19,7 @@ use POSIX qw(strftime);
require PublicInbox::Git;
use constant {
- MAX_MID_SIZE => 244, # max term size - 1 in Xapian
+ MAX_MID_SIZE => 244, # max term size (Xapian limitation) - length('Q')
PERM_UMASK => 0,
OLD_PERM_GROUP => 1,
OLD_PERM_EVERYBODY => 2,
@@ -302,7 +302,7 @@ sub add_message {
}
$smsg = PublicInbox::SearchMsg->new($mime);
my $doc = $smsg->{doc};
- $doc->add_term('XMID' . $mid);
+ $doc->add_term('Q' . $mid);
my $subj = $smsg->subject;
my $xpath;
@@ -404,7 +404,7 @@ sub remove_message {
$mid = mid_clean($mid);
eval {
- my ($head, $tail) = $self->find_doc_ids('XMID' . $mid);
+ my ($head, $tail) = $self->find_doc_ids('Q' . $mid);
if ($head->equal($tail)) {
warn "cannot remove non-existent <$mid>\n";
}
@@ -721,7 +721,7 @@ sub create_ghost {
my $tid = $self->next_thread_id;
my $doc = Search::Xapian::Document->new;
- $doc->add_term('XMID' . $mid);
+ $doc->add_term('Q' . $mid);
$doc->add_term('G' . $tid);
$doc->add_term('T' . 'ghost');
diff --git a/lib/PublicInbox/SearchIdxSkeleton.pm b/lib/PublicInbox/SearchIdxSkeleton.pm
index aa2713f..333f965 100644
--- a/lib/PublicInbox/SearchIdxSkeleton.pm
+++ b/lib/PublicInbox/SearchIdxSkeleton.pm
@@ -107,7 +107,7 @@ sub index_skeleton_real ($$) {
}
my $doc = $smsg->{doc};
$doc->add_term('XPATH' . $xpath) if defined $xpath;
- $doc->add_term('XMID' . $mid);
+ $doc->add_term('Q' . $mid);
PublicInbox::SearchIdx::add_values($doc, $values);
$doc->set_data($doc_data);
$smsg->{ts} = $ts;
diff --git a/lib/PublicInbox/SearchMsg.pm b/lib/PublicInbox/SearchMsg.pm
index 941bfd2..014f490 100644
--- a/lib/PublicInbox/SearchMsg.pm
+++ b/lib/PublicInbox/SearchMsg.pm
@@ -154,7 +154,7 @@ sub mid ($;$) {
} elsif (my $rv = $self->{mid}) {
$rv;
} else {
- $self->{mid} = _get_term_val($self, 'XMID', qr/\AXMID/) ||
+ $self->{mid} = _get_term_val($self, 'Q', qr/\AQ/) ||
$self->_extract_mid;
}
}
--
EW
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH 14/34] searchidx: support indexing multiple MIDs
2018-03-06 8:42 [v2 PATCH 00/34] duplicate handling, smaller Xapian DBs, date fixes Eric Wong (Contractor, The Linux Foundation)
` (12 preceding siblings ...)
2018-03-06 8:42 ` [PATCH 13/34] search: revert to using 'Q' as a uniQue id per-Xapian conventions Eric Wong (Contractor, The Linux Foundation)
@ 2018-03-06 8:42 ` Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 15/34] mid: be strict with References, but loose on Message-Id Eric Wong (Contractor, The Linux Foundation)
` (20 subsequent siblings)
34 siblings, 0 replies; 36+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-03-06 8:42 UTC (permalink / raw)
To: meta
It's possible to have a message handle multiple terms;
so use this feature to ensure messages with multiple MIDs
can be found by either one.
---
lib/PublicInbox/Search.pm | 1 -
lib/PublicInbox/SearchIdx.pm | 121 ++++++++++++++++++++++-------------
lib/PublicInbox/SearchIdxSkeleton.pm | 26 ++------
lib/PublicInbox/SearchMsg.pm | 7 ++
t/v2writable.t | 15 ++++-
5 files changed, 105 insertions(+), 65 deletions(-)
diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm
index 74f406a..fb7a126 100644
--- a/lib/PublicInbox/Search.pm
+++ b/lib/PublicInbox/Search.pm
@@ -373,7 +373,6 @@ sub lookup_mail { # no ghosts!
sub each_smsg_by_mid {
my ($self, $mid, $cb) = @_;
- $mid = mid_clean($mid);
my $xdb = $self->{xdb};
# XXX retry_reopen isn't necessary for V2Writable, but the PSGI
# interface will need it...
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 61dc057..1c10728 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -281,29 +281,19 @@ sub index_body ($$$) {
sub add_message {
my ($self, $mime, $bytes, $num, $blob) = @_; # mime = Email::MIME object
- my $db = $self->{xdb};
-
- my ($doc_id, $old_tid);
- my @mids = mid_mime($mime);
- if (@mids > 1) {
- warn "Multi-MID: ( ",join(' | ', @mids)," )\n";
- }
- my $mid = mid_clean($mids[0]);
+ my $doc_id;
+ my $mids = mids($mime->header_obj);
my $skel = $self->{skeleton};
eval {
- die 'Message-ID too long' if length($mid) > MAX_MID_SIZE;
- my $smsg = $self->lookup_message($mid);
- if ($smsg) {
- # convert a ghost to a regular message
- # it will also clobber any existing regular message
- $doc_id = $smsg->{doc_id};
- $old_tid = $smsg->thread_id unless $skel;
- }
- $smsg = PublicInbox::SearchMsg->new($mime);
+ my $smsg = PublicInbox::SearchMsg->new($mime);
my $doc = $smsg->{doc};
- $doc->add_term('Q' . $mid);
-
+ foreach my $mid (@$mids) {
+ # FIXME: may be abused to prevent archival
+ length($mid) > MAX_MID_SIZE and
+ die 'Message-ID too long';
+ $doc->add_term('Q' . $mid);
+ }
my $subj = $smsg->subject;
my $xpath;
if ($subj ne '') {
@@ -366,31 +356,30 @@ sub add_message {
# populates smsg->references for smsg->to_doc_data
my $refs = parse_references($smsg);
my $data = $smsg->to_doc_data($blob);
- if ($skel) {
- push @values, $mid, $xpath, $data;
- $skel->index_skeleton(\@values);
- } else {
- link_message($self, $smsg, $refs, $old_tid);
+ foreach my $mid (@$mids) {
+ $tg->index_text($mid, 1, 'XM');
}
- $tg->index_text($mid, 1, 'XM');
$doc->set_data($data);
-
if (my $altid = $self->{-altid}) {
foreach my $alt (@$altid) {
- my $id = $alt->mid2alt($mid);
- next unless defined $id;
- $doc->add_term($alt->{xprefix} . $id);
+ foreach my $mid (@$mids) {
+ my $id = $alt->mid2alt($mid);
+ next unless defined $id;
+ $doc->add_term($alt->{xprefix} . $id);
+ }
}
}
- if (defined $doc_id) {
- $db->replace_document($doc_id, $doc);
+ if ($skel) {
+ push @values, $mids, $xpath, $data;
+ $skel->index_skeleton(\@values);
+ $doc_id = $self->{xdb}->add_document($doc);
} else {
- $doc_id = $db->add_document($doc);
+ $doc_id = link_and_save($self, $doc, $mids, $refs);
}
};
if ($@) {
- warn "failed to index message <$mid>: $@\n";
+ warn "failed to index message <".join('> <',@$mids).">: $@\n";
return undef;
}
$doc_id;
@@ -467,27 +456,62 @@ sub parse_references ($) {
\@keep;
}
-sub link_message {
- my ($self, $smsg, $refs, $old_tid) = @_;
+sub link_doc {
+ my ($self, $doc, $refs, $old_tid) = @_;
my $tid;
if (@$refs) {
-
# first ref *should* be the thread root,
# but we can never trust clients to do the right thing
my $ref = shift @$refs;
- $tid = $self->_resolve_mid_to_tid($ref);
- $self->merge_threads($tid, $old_tid) if defined $old_tid;
+ $tid = resolve_mid_to_tid($self, $ref);
+ merge_threads($self, $tid, $old_tid) if defined $old_tid;
# the rest of the refs should point to this tid:
foreach $ref (@$refs) {
- my $ptid = $self->_resolve_mid_to_tid($ref);
+ my $ptid = resolve_mid_to_tid($self, $ref);
merge_threads($self, $tid, $ptid);
}
} else {
$tid = defined $old_tid ? $old_tid : $self->next_thread_id;
}
- $smsg->{doc}->add_term('G' . $tid);
+ $doc->add_term('G' . $tid);
+ $tid;
+}
+
+sub link_and_save {
+ my ($self, $doc, $mids, $refs) = @_;
+ my $db = $self->{xdb};
+ my $old_tid;
+ my $doc_id;
+ my $vivified = 0;
+ foreach my $mid (@$mids) {
+ $self->each_smsg_by_mid($mid, sub {
+ my ($cur) = @_;
+ my $type = $cur->type;
+ my $cur_tid = $cur->thread_id;
+ $old_tid = $cur_tid unless defined $old_tid;
+ if ($type eq 'mail') {
+ # do not break existing mail messages,
+ # just merge the threads
+ merge_threads($self, $old_tid, $cur_tid);
+ return 1;
+ }
+ if ($type ne 'ghost') {
+ die "<$mid> has a bad type: $type\n";
+ }
+ my $tid = link_doc($self, $doc, $refs, $old_tid);
+ $old_tid = $tid unless defined $old_tid;
+ $doc_id = $cur->{doc_id};
+ $self->{xdb}->replace_document($doc_id, $doc);
+ ++$vivified;
+ 1;
+ });
+ }
+ # not really important, but we return any vivified ghost docid, here:
+ return $doc_id if defined $doc_id;
+ link_doc($self, $doc, $refs, $old_tid);
+ $self->{xdb}->add_document($doc);
}
sub index_git_blob_id {
@@ -709,11 +733,22 @@ sub _index_sync {
}
# this will create a ghost as necessary
-sub _resolve_mid_to_tid {
+sub resolve_mid_to_tid {
my ($self, $mid) = @_;
+ my $tid;
+ $self->each_smsg_by_mid($mid, sub {
+ my ($smsg) = @_;
+ my $cur_tid = $smsg->thread_id;
+ if (defined $tid) {
+ merge_threads($self, $tid, $cur_tid);
+ } else {
+ $tid = $smsg->thread_id;
+ }
+ 1;
+ });
+ return $tid if defined $tid;
- my $smsg = $self->lookup_message($mid) || $self->create_ghost($mid);
- $smsg->thread_id;
+ $self->create_ghost($mid)->thread_id;
}
sub create_ghost {
diff --git a/lib/PublicInbox/SearchIdxSkeleton.pm b/lib/PublicInbox/SearchIdxSkeleton.pm
index 333f965..063c83e 100644
--- a/lib/PublicInbox/SearchIdxSkeleton.pm
+++ b/lib/PublicInbox/SearchIdxSkeleton.pm
@@ -92,34 +92,20 @@ sub index_skeleton_real ($$) {
my ($self, $values) = @_;
my $doc_data = pop @$values;
my $xpath = pop @$values;
- my $mid = pop @$values;
+ my $mids = pop @$values;
my $ts = $values->[PublicInbox::Search::TS];
- my $smsg = $self->lookup_message($mid);
- my ($old_tid, $doc_id);
- if ($smsg) {
- # convert a ghost to a regular message
- # it will also clobber any existing regular message
- $doc_id = $smsg->{doc_id};
- $old_tid = $smsg->thread_id;
- } else {
- $smsg = PublicInbox::SearchMsg->new(undef);
- $smsg->{mid} = $mid;
- }
+ my $smsg = PublicInbox::SearchMsg->new(undef);
my $doc = $smsg->{doc};
$doc->add_term('XPATH' . $xpath) if defined $xpath;
- $doc->add_term('Q' . $mid);
+ foreach my $mid (@$mids) {
+ $doc->add_term('Q' . $mid);
+ }
PublicInbox::SearchIdx::add_values($doc, $values);
$doc->set_data($doc_data);
$smsg->{ts} = $ts;
$smsg->load_from_data($doc_data);
my @refs = ($smsg->references =~ /<([^>]+)>/g);
- $self->link_message($smsg, \@refs, $old_tid);
- my $db = $self->{xdb};
- if (defined $doc_id) {
- $db->replace_document($doc_id, $doc);
- } else {
- $doc_id = $db->add_document($doc);
- }
+ $self->link_and_save($doc, $mids, \@refs);
}
1;
diff --git a/lib/PublicInbox/SearchMsg.pm b/lib/PublicInbox/SearchMsg.pm
index 014f490..a556534 100644
--- a/lib/PublicInbox/SearchMsg.pm
+++ b/lib/PublicInbox/SearchMsg.pm
@@ -176,4 +176,11 @@ sub path {
$self->{path} = _get_term_val($self, 'XPATH', qr/\AXPATH/); # path
}
+sub type {
+ my ($self) = @_;
+ my $type = $self->{type};
+ return $type if defined $type;
+ $self->{type} = _get_term_val($self, 'T', qr/\AT/);
+}
+
1;
diff --git a/t/v2writable.t b/t/v2writable.t
index bc2437a..44191c1 100644
--- a/t/v2writable.t
+++ b/t/v2writable.t
@@ -80,5 +80,18 @@ ok($im->add($mime), 'ordinary message added');
is(scalar(@mids), 1, 'new generated');
}
-$im->done;
+{
+ $mime->header_set('Message-Id', '<abcde@1>', '<abcde@2>');
+ ok($im->add($mime), 'message with multiple Message-ID');
+ $im->done;
+ my @found;
+ $ibx->search->each_smsg_by_mid('abcde@1', sub { push @found, @_; 1 });
+ is(scalar(@found), 1, 'message found by first MID');
+ $ibx->search->each_smsg_by_mid('abcde@2', sub { push @found, @_; 1 });
+ is(scalar(@found), 2, 'message found by second MID');
+ is($found[0]->{doc_id}, $found[1]->{doc_id}, 'same document');
+ ok($found[1]->{doc_id} > 0, 'doc_id is positive');
+}
+
+
done_testing();
--
EW
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH 15/34] mid: be strict with References, but loose on Message-Id
2018-03-06 8:42 [v2 PATCH 00/34] duplicate handling, smaller Xapian DBs, date fixes Eric Wong (Contractor, The Linux Foundation)
` (13 preceding siblings ...)
2018-03-06 8:42 ` [PATCH 14/34] searchidx: support indexing multiple MIDs Eric Wong (Contractor, The Linux Foundation)
@ 2018-03-06 8:42 ` Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 16/34] searchidx: avoid excessive XNQ indexing with diffs Eric Wong (Contractor, The Linux Foundation)
` (19 subsequent siblings)
34 siblings, 0 replies; 36+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-03-06 8:42 UTC (permalink / raw)
To: meta
Traditionally we've been more lax on parsing Message-Id
and allow it without the angle brackets. We've always been
strict on References and can't have it be pointlessly
large when some MUA decides to use HTML-escaped angle
brackets ("<", ">").
---
lib/PublicInbox/MID.pm | 45 +++++++++++++++++++++++++++++++--------------
1 file changed, 31 insertions(+), 14 deletions(-)
diff --git a/lib/PublicInbox/MID.pm b/lib/PublicInbox/MID.pm
index 4ccb704..9608539 100644
--- a/lib/PublicInbox/MID.pm
+++ b/lib/PublicInbox/MID.pm
@@ -49,16 +49,39 @@ sub mid2path {
sub mid_mime ($) { $_[0]->header_obj->header_raw('Message-ID') }
-sub uniq_mids {
- my ($hdr, @fields) = @_;
- my %seen;
- my @raw;
- foreach my $f (@fields) {
- push @raw, $hdr->header_raw($f);
+sub mids ($) {
+ my ($hdr) = @_;
+ my @mids;
+ my @v = $hdr->header_raw('Message-Id');
+ foreach my $v (@v) {
+ my @cur = ($v =~ /<([^>]+)>/sg);
+ if (@cur) {
+ push(@mids, @cur);
+ } else {
+ push(@mids, $v);
+ }
}
- my @mids = (join(' ', @raw) =~ /<([^>]+)>/g);
- my $mids = scalar(@mids) == 0 ? \@raw: \@mids;
+ uniq_mids(\@mids);
+}
+
+# last References should be IRT, but some mail clients do things
+# out of order, so trust IRT over References iff IRT exists
+sub references ($) {
+ my ($hdr) = @_;
+ my @mids;
+ foreach my $f (qw(References In-Reply-To)) {
+ my @v = $hdr->header_raw($f);
+ foreach my $v (@v) {
+ push(@mids, ($v =~ /<([^>]+)>/sg));
+ }
+ }
+ uniq_mids(\@mids);
+}
+
+sub uniq_mids ($) {
+ my ($mids) = @_;
my @ret;
+ my %seen;
foreach (@$mids) {
next if $seen{$_};
push @ret, $_;
@@ -67,12 +90,6 @@ sub uniq_mids {
\@ret;
}
-sub mids { uniq_mids($_[0], 'Message-Id') }
-
-# last References should be IRT, but some mail clients do things
-# out of order, so trust IRT over References iff IRT exists
-sub references { uniq_mids($_[0], 'References', 'In-Reply-To') }
-
# RFC3986, section 3.3:
sub MID_ESC () { '^A-Za-z0-9\-\._~!\$\&\';\(\)\*\+,;=:@' }
sub mid_escape ($) { uri_escape_utf8($_[0], MID_ESC) }
--
EW
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH 16/34] searchidx: avoid excessive XNQ indexing with diffs
2018-03-06 8:42 [v2 PATCH 00/34] duplicate handling, smaller Xapian DBs, date fixes Eric Wong (Contractor, The Linux Foundation)
` (14 preceding siblings ...)
2018-03-06 8:42 ` [PATCH 15/34] mid: be strict with References, but loose on Message-Id Eric Wong (Contractor, The Linux Foundation)
@ 2018-03-06 8:42 ` Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 17/34] searchidxskeleton: add a note about locking Eric Wong (Contractor, The Linux Foundation)
` (18 subsequent siblings)
34 siblings, 0 replies; 36+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-03-06 8:42 UTC (permalink / raw)
To: meta
When indexing diffs, we can avoid indexing the diff parts under
XNQ and instead combine the parts in the read-only search
interface. This results in better indexing performance and
10-15% smaller Xapian indices.
---
lib/PublicInbox/Search.pm | 9 +++---
lib/PublicInbox/SearchIdx.pm | 77 ++++++++++++++++++++++++++++----------------
2 files changed, 55 insertions(+), 31 deletions(-)
diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm
index fb7a126..a1c423c 100644
--- a/lib/PublicInbox/Search.pm
+++ b/lib/PublicInbox/Search.pm
@@ -59,6 +59,7 @@ my %bool_pfx_external = (
mid => 'Q', # Message-ID (full/exact), this is mostly uniQue
);
+my $non_quoted_body = 'XNQ XDFN XDFA XDFB XDFHH XDFCTX XDFPRE XDFPOST';
my %prob_prefix = (
# for mairix compatibility
s => 'S',
@@ -69,12 +70,12 @@ my %prob_prefix = (
c => 'XCC',
tcf => 'XTO XCC A',
a => 'XTO XCC A',
- b => 'XNQ XQUOT',
- bs => 'XNQ XQUOT S',
+ b => $non_quoted_body . ' XQUOT',
+ bs => $non_quoted_body . ' XQUOT S',
n => 'XFN',
q => 'XQUOT',
- nq => 'XNQ',
+ nq => $non_quoted_body,
dfn => 'XDFN',
dfa => 'XDFA',
dfb => 'XDFB',
@@ -85,7 +86,7 @@ my %prob_prefix = (
dfblob => 'XDFPRE XDFPOST',
# default:
- '' => 'XM S A XNQ XQUOT XFN',
+ '' => 'XM S A XQUOT XFN ' . $non_quoted_body,
);
# not documenting m: and mid: for now, the using the URLs works w/o Xapian
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 1c10728..1bca3a6 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -175,14 +175,19 @@ sub index_users ($$) {
$tg->increase_termpos;
}
-sub index_text_inc ($$$) {
- my ($tg, $text, $pfx) = @_;
+sub index_diff_inc ($$$$) {
+ my ($tg, $text, $pfx, $xnq) = @_;
+ if (@$xnq) {
+ $tg->index_text(join("\n", @$xnq), 1, 'XNQ');
+ $tg->increase_termpos;
+ @$xnq = ();
+ }
$tg->index_text($text, 1, $pfx);
$tg->increase_termpos;
}
sub index_old_diff_fn {
- my ($tg, $seen, $fa, $fb) = @_;
+ my ($tg, $seen, $fa, $fb, $xnq) = @_;
# no renames or space support for traditional diffs,
# find the number of leading common paths to strip:
@@ -192,7 +197,9 @@ sub index_old_diff_fn {
$fa = join('/', @fa);
$fb = join('/', @fb);
if ($fa eq $fb) {
- index_text_inc($tg, $fa,'XDFN') unless $seen->{$fa}++;
+ unless ($seen->{$fa}++) {
+ index_diff_inc($tg, $fa, 'XDFN', $xnq);
+ }
return 1;
}
shift @fa;
@@ -205,40 +212,46 @@ sub index_diff ($$$) {
my ($tg, $lines, $doc) = @_;
my %seen;
my $in_diff;
+ my @xnq;
+ my $xnq = \@xnq;
foreach (@$lines) {
if ($in_diff && s/^ //) { # diff context
- index_text_inc($tg, $_, 'XDFCTX');
+ index_diff_inc($tg, $_, 'XDFCTX', $xnq);
} elsif (/^-- $/) { # email signature begins
$in_diff = undef;
} elsif (m!^diff --git ("?a/.+) ("?b/.+)\z!) {
my ($fa, $fb) = ($1, $2);
my $fn = (split('/', git_unquote($fa), 2))[1];
- index_text_inc($tg, $fn, 'XDFN') unless $seen{$fn}++;
+ $seen{$fn}++ or index_diff_inc($tg, $fn, 'XDFN', $xnq);
$fn = (split('/', git_unquote($fb), 2))[1];
- index_text_inc($tg, $fn, 'XDFN') unless $seen{$fn}++;
+ $seen{$fn}++ or index_diff_inc($tg, $fn, 'XDFN', $xnq);
$in_diff = 1;
# traditional diff:
} elsif (m/^diff -(.+) (\S+) (\S+)$/) {
my ($opt, $fa, $fb) = ($1, $2, $3);
+ push @xnq, $_;
# only support unified:
next unless $opt =~ /[uU]/;
- $in_diff = index_old_diff_fn($tg, \%seen, $fa, $fb);
+ $in_diff = index_old_diff_fn($tg, \%seen, $fa, $fb,
+ $xnq);
} elsif (m!^--- ("?a/.+)!) {
my $fn = (split('/', git_unquote($1), 2))[1];
- index_text_inc($tg, $fn, 'XDFN') unless $seen{$fn}++;
+ $seen{$fn}++ or index_diff_inc($tg, $fn, 'XDFN', $xnq);
$in_diff = 1;
} elsif (m!^\+\+\+ ("?b/.+)!) {
my $fn = (split('/', git_unquote($1), 2))[1];
- index_text_inc($tg, $fn, 'XDFN') unless $seen{$fn}++;
+ $seen{$fn}++ or index_diff_inc($tg, $fn, 'XDFN', $xnq);
$in_diff = 1;
} elsif (/^--- (\S+)/) {
$in_diff = $1;
+ push @xnq, $_;
} elsif (defined $in_diff && /^\+\+\+ (\S+)/) {
- $in_diff = index_old_diff_fn($tg, \%seen, $in_diff, $1);
+ $in_diff = index_old_diff_fn($tg, \%seen, $in_diff, $1,
+ $xnq);
} elsif ($in_diff && s/^\+//) { # diff added
- index_text_inc($tg, $_, 'XDFB');
+ index_diff_inc($tg, $_, 'XDFB', $xnq);
} elsif ($in_diff && s/^-//) { # diff removed
- index_text_inc($tg, $_, 'XDFA');
+ index_diff_inc($tg, $_, 'XDFA', $xnq);
} elsif (m!^index ([a-f0-9]+)\.\.([a-f0-9]+)!) {
my ($ba, $bb) = ($1, $2);
index_git_blob_id($doc, 'XDFPRE', $ba);
@@ -248,34 +261,44 @@ sub index_diff ($$$) {
# traditional diff w/o -p
} elsif (/^@@ (?:\S+) (?:\S+) @@\s*(\S+.*)$/) {
# hunk header context
- index_text_inc($tg, $1, 'XDFHH');
+ index_diff_inc($tg, $1, 'XDFHH', $xnq);
# ignore the following lines:
- } elsif (/^(?:dis)similarity index/) {
- } elsif (/^(?:old|new) mode/) {
- } elsif (/^(?:deleted|new) file mode/) {
- } elsif (/^(?:copy|rename) (?:from|to) /) {
- } elsif (/^(?:dis)?similarity index /) {
- } elsif (/^\\ No newline at end of file/) {
- } elsif (/^Binary files .* differ/) {
+ } elsif (/^(?:dis)similarity index/ ||
+ /^(?:old|new) mode/ ||
+ /^(?:deleted|new) file mode/ ||
+ /^(?:copy|rename) (?:from|to) / ||
+ /^(?:dis)?similarity index / ||
+ /^\\ No newline at end of file/ ||
+ /^Binary files .* differ/) {
+ push @xnq, $_;
} elsif ($_ eq '') {
$in_diff = undef;
} else {
+ push @xnq, $_;
warn "non-diff line: $_\n" if DEBUG && $_ ne '';
$in_diff = undef;
}
}
+
+ $tg->index_text(join("\n", @xnq), 1, 'XNQ');
+ $tg->increase_termpos;
}
sub index_body ($$$) {
my ($tg, $lines, $doc) = @_;
my $txt = join("\n", @$lines);
- $tg->index_text($txt, !!$doc, $doc ? 'XNQ' : 'XQUOT');
- $tg->increase_termpos;
- # does it look like a diff?
- if ($doc && $txt =~ /^(?:diff|---|\+\+\+) /ms) {
- $txt = undef;
- index_diff($tg, $lines, $doc);
+ if ($doc) {
+ # does it look like a diff?
+ if ($txt =~ /^(?:diff|---|\+\+\+) /ms) {
+ $txt = undef;
+ index_diff($tg, $lines, $doc);
+ } else {
+ $tg->index_text($txt, 1, 'XNQ');
+ }
+ } else {
+ $tg->index_text($txt, 0, 'XQUOT');
}
+ $tg->increase_termpos;
@$lines = ();
}
--
EW
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH 17/34] searchidxskeleton: add a note about locking
2018-03-06 8:42 [v2 PATCH 00/34] duplicate handling, smaller Xapian DBs, date fixes Eric Wong (Contractor, The Linux Foundation)
` (15 preceding siblings ...)
2018-03-06 8:42 ` [PATCH 16/34] searchidx: avoid excessive XNQ indexing with diffs Eric Wong (Contractor, The Linux Foundation)
@ 2018-03-06 8:42 ` Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 18/34] v2writable: generated Message-ID goes first Eric Wong (Contractor, The Linux Foundation)
` (17 subsequent siblings)
34 siblings, 0 replies; 36+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-03-06 8:42 UTC (permalink / raw)
To: meta
It's tempting to rely on the atomicity of smaller-than-PIPE_BUF
writes, but it doesn't work if mixed with larger ones.
---
lib/PublicInbox/SearchIdxSkeleton.pm | 2 ++
1 file changed, 2 insertions(+)
diff --git a/lib/PublicInbox/SearchIdxSkeleton.pm b/lib/PublicInbox/SearchIdxSkeleton.pm
index 063c83e..506e566 100644
--- a/lib/PublicInbox/SearchIdxSkeleton.pm
+++ b/lib/PublicInbox/SearchIdxSkeleton.pm
@@ -80,6 +80,8 @@ sub index_skeleton {
$str = length($str) . "\n" . $str;
# multiple processes write to the same pipe, so use flock
+ # We can't avoid this lock for <=PIPE_BUF writes, either,
+ # because those atomic writes can break up >PIPE_BUF ones
$self->_lock_acquire;
print $w $str or $err = $!;
$self->_lock_release;
--
EW
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH 18/34] v2writable: generated Message-ID goes first
2018-03-06 8:42 [v2 PATCH 00/34] duplicate handling, smaller Xapian DBs, date fixes Eric Wong (Contractor, The Linux Foundation)
` (16 preceding siblings ...)
2018-03-06 8:42 ` [PATCH 17/34] searchidxskeleton: add a note about locking Eric Wong (Contractor, The Linux Foundation)
@ 2018-03-06 8:42 ` Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 19/34] searchidx: use add_boolean_term for internal terms Eric Wong (Contractor, The Linux Foundation)
` (16 subsequent siblings)
34 siblings, 0 replies; 36+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-03-06 8:42 UTC (permalink / raw)
To: meta
This is to make SearchMsg behave more sanely under NNTP.
---
lib/PublicInbox/V2Writable.pm | 2 +-
t/v2writable.t | 8 ++++----
2 files changed, 5 insertions(+), 5 deletions(-)
diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index 6d73827..c73d859 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -142,7 +142,7 @@ sub num_for_harder {
}
}
my @cur = $hdr->header_raw('Message-Id');
- $hdr->header_set('Message-Id', @cur, "<$mid>");
+ $hdr->header_set('Message-Id', "<$mid>", @cur);
$num;
}
diff --git a/t/v2writable.t b/t/v2writable.t
index 44191c1..f95b2e7 100644
--- a/t/v2writable.t
+++ b/t/v2writable.t
@@ -53,8 +53,8 @@ ok($im->add($mime), 'ordinary message added');
ok($im->add($mime), 'reused mid ok');
like(join(' ', @warn), qr/reused/, 'warned about reused MID');
my @mids = $mime->header_obj->header_raw('Message-Id');
- is($mids[0], '<a-mid@b>', 'original mid not changed');
- like($mids[1], qr/\A<\w+\@localhost>\z/, 'new MID added');
+ is($mids[1], '<a-mid@b>', 'original mid not changed');
+ like($mids[0], qr/\A<\w+\@localhost>\z/, 'new MID added');
is(scalar(@mids), 2, 'only one new MID added');
@warn = ();
@@ -68,8 +68,8 @@ ok($im->add($mime), 'ordinary message added');
ok($im->add($mime), 'random MID made');
like(join(' ', @warn), qr/using random/, 'warned about using random');
@mids = $mime->header_obj->header_raw('Message-Id');
- is($mids[0], '<a-mid@b>', 'original mid not changed');
- like($mids[1], qr/\A<\w+\@localhost>\z/, 'new MID added');
+ is($mids[1], '<a-mid@b>', 'original mid not changed');
+ like($mids[0], qr/\A<\w+\@localhost>\z/, 'new MID added');
is(scalar(@mids), 2, 'only one new MID added');
@warn = ();
--
EW
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH 19/34] searchidx: use add_boolean_term for internal terms
2018-03-06 8:42 [v2 PATCH 00/34] duplicate handling, smaller Xapian DBs, date fixes Eric Wong (Contractor, The Linux Foundation)
` (17 preceding siblings ...)
2018-03-06 8:42 ` [PATCH 18/34] v2writable: generated Message-ID goes first Eric Wong (Contractor, The Linux Foundation)
@ 2018-03-06 8:42 ` Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 20/34] searchidx: add NNTP article number as a searchable term Eric Wong (Contractor, The Linux Foundation)
` (15 subsequent siblings)
34 siblings, 0 replies; 36+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-03-06 8:42 UTC (permalink / raw)
To: meta
Aside from the Message-Id ('Q'), these terms do not appear in
content and thus have no business contributing to the Xapian
document length.
Thanks-to Olly Betts for the tip on xapian-discuss
<20180228004400.GU12724@survex.com>
---
lib/PublicInbox/SearchIdx.pm | 15 ++++++++-------
lib/PublicInbox/SearchIdxSkeleton.pm | 2 +-
lib/PublicInbox/SearchMsg.pm | 2 +-
3 files changed, 10 insertions(+), 9 deletions(-)
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 1bca3a6..f63e072 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -322,7 +322,7 @@ sub add_message {
if ($subj ne '') {
$xpath = $self->subject_path($subj);
$xpath = id_compress($xpath);
- $doc->add_term('XPATH' . $xpath);
+ $doc->add_boolean_term('XPATH' . $xpath);
}
my $lines = $mime->body_raw =~ tr!\n!\n!;
@@ -385,10 +385,11 @@ sub add_message {
$doc->set_data($data);
if (my $altid = $self->{-altid}) {
foreach my $alt (@$altid) {
+ my $pfx = $alt->{xprefix};
foreach my $mid (@$mids) {
my $id = $alt->mid2alt($mid);
next unless defined $id;
- $doc->add_term($alt->{xprefix} . $id);
+ $doc->add_boolean_term($pfx . $id);
}
}
}
@@ -498,7 +499,7 @@ sub link_doc {
} else {
$tid = defined $old_tid ? $old_tid : $self->next_thread_id;
}
- $doc->add_term('G' . $tid);
+ $doc->add_boolean_term('G' . $tid);
$tid;
}
@@ -779,9 +780,9 @@ sub create_ghost {
my $tid = $self->next_thread_id;
my $doc = Search::Xapian::Document->new;
- $doc->add_term('Q' . $mid);
- $doc->add_term('G' . $tid);
- $doc->add_term('T' . 'ghost');
+ $doc->add_boolean_term('Q' . $mid);
+ $doc->add_boolean_term('G' . $tid);
+ $doc->add_boolean_term('T' . 'ghost');
my $smsg = PublicInbox::SearchMsg->wrap($doc, $mid);
$self->{xdb}->add_document($doc);
@@ -805,7 +806,7 @@ sub merge_threads {
foreach my $docid (@ids) {
my $doc = $db->get_document($docid);
$doc->remove_term('G' . $loser_tid);
- $doc->add_term('G' . $winner_tid);
+ $doc->add_boolean_term('G' . $winner_tid);
$db->replace_document($docid, $doc);
}
}
diff --git a/lib/PublicInbox/SearchIdxSkeleton.pm b/lib/PublicInbox/SearchIdxSkeleton.pm
index 506e566..3fe6a4a 100644
--- a/lib/PublicInbox/SearchIdxSkeleton.pm
+++ b/lib/PublicInbox/SearchIdxSkeleton.pm
@@ -98,7 +98,7 @@ sub index_skeleton_real ($$) {
my $ts = $values->[PublicInbox::Search::TS];
my $smsg = PublicInbox::SearchMsg->new(undef);
my $doc = $smsg->{doc};
- $doc->add_term('XPATH' . $xpath) if defined $xpath;
+ $doc->add_boolean_term('XPATH' . $xpath) if defined $xpath;
foreach my $mid (@$mids) {
$doc->add_term('Q' . $mid);
}
diff --git a/lib/PublicInbox/SearchMsg.pm b/lib/PublicInbox/SearchMsg.pm
index a556534..93e6fd8 100644
--- a/lib/PublicInbox/SearchMsg.pm
+++ b/lib/PublicInbox/SearchMsg.pm
@@ -14,7 +14,7 @@ use PublicInbox::Address;
sub new {
my ($class, $mime) = @_;
my $doc = Search::Xapian::Document->new;
- $doc->add_term('T' . 'mail');
+ $doc->add_boolean_term('T' . 'mail');
bless { type => 'mail', doc => $doc, mime => $mime }, $class;
}
--
EW
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH 20/34] searchidx: add NNTP article number as a searchable term
2018-03-06 8:42 [v2 PATCH 00/34] duplicate handling, smaller Xapian DBs, date fixes Eric Wong (Contractor, The Linux Foundation)
` (18 preceding siblings ...)
2018-03-06 8:42 ` [PATCH 19/34] searchidx: use add_boolean_term for internal terms Eric Wong (Contractor, The Linux Foundation)
@ 2018-03-06 8:42 ` Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 21/34] mid: truncate excessively long MIDs early Eric Wong (Contractor, The Linux Foundation)
` (14 subsequent siblings)
34 siblings, 0 replies; 36+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-03-06 8:42 UTC (permalink / raw)
To: meta
Since we support duplicate MIDs in v2, the NNTP article number
becomes the true unique identifier and we want a way to do fast
lookups on it.
While we're at it, stop putting XPATH in the term partitions
since we only need it in the skeleton DB.
---
lib/PublicInbox/SearchIdx.pm | 8 +++++---
lib/PublicInbox/SearchIdxSkeleton.pm | 4 ++--
2 files changed, 7 insertions(+), 5 deletions(-)
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index f63e072..3ef444d 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -322,7 +322,6 @@ sub add_message {
if ($subj ne '') {
$xpath = $self->subject_path($subj);
$xpath = id_compress($xpath);
- $doc->add_boolean_term('XPATH' . $xpath);
}
my $lines = $mime->body_raw =~ tr!\n!\n!;
@@ -398,7 +397,8 @@ sub add_message {
$skel->index_skeleton(\@values);
$doc_id = $self->{xdb}->add_document($doc);
} else {
- $doc_id = link_and_save($self, $doc, $mids, $refs);
+ $doc_id = link_and_save($self, $doc, $mids, $refs,
+ $num, $xpath);
}
};
@@ -504,10 +504,12 @@ sub link_doc {
}
sub link_and_save {
- my ($self, $doc, $mids, $refs) = @_;
+ my ($self, $doc, $mids, $refs, $num, $xpath) = @_;
my $db = $self->{xdb};
my $old_tid;
my $doc_id;
+ $doc->add_boolean_term('XNUM' . $num) if defined $num;
+ $doc->add_boolean_term('XPATH' . $xpath) if defined $xpath;
my $vivified = 0;
foreach my $mid (@$mids) {
$self->each_smsg_by_mid($mid, sub {
diff --git a/lib/PublicInbox/SearchIdxSkeleton.pm b/lib/PublicInbox/SearchIdxSkeleton.pm
index 3fe6a4a..4066b59 100644
--- a/lib/PublicInbox/SearchIdxSkeleton.pm
+++ b/lib/PublicInbox/SearchIdxSkeleton.pm
@@ -98,7 +98,6 @@ sub index_skeleton_real ($$) {
my $ts = $values->[PublicInbox::Search::TS];
my $smsg = PublicInbox::SearchMsg->new(undef);
my $doc = $smsg->{doc};
- $doc->add_boolean_term('XPATH' . $xpath) if defined $xpath;
foreach my $mid (@$mids) {
$doc->add_term('Q' . $mid);
}
@@ -106,8 +105,9 @@ sub index_skeleton_real ($$) {
$doc->set_data($doc_data);
$smsg->{ts} = $ts;
$smsg->load_from_data($doc_data);
+ my $num = $values->[PublicInbox::Search::NUM];
my @refs = ($smsg->references =~ /<([^>]+)>/g);
- $self->link_and_save($doc, $mids, \@refs);
+ $self->link_and_save($doc, $mids, \@refs, $num, $xpath);
}
1;
--
EW
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH 21/34] mid: truncate excessively long MIDs early
2018-03-06 8:42 [v2 PATCH 00/34] duplicate handling, smaller Xapian DBs, date fixes Eric Wong (Contractor, The Linux Foundation)
` (19 preceding siblings ...)
2018-03-06 8:42 ` [PATCH 20/34] searchidx: add NNTP article number as a searchable term Eric Wong (Contractor, The Linux Foundation)
@ 2018-03-06 8:42 ` Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 22/34] nntp: use NNTP article numbers for lookups Eric Wong (Contractor, The Linux Foundation)
` (13 subsequent siblings)
34 siblings, 0 replies; 36+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-03-06 8:42 UTC (permalink / raw)
To: meta
Since we support duplicate MIDs in v2, we can safely truncate
long MID terms in the database and let other normal duplicate
resolution sort it out. It seems only spammers use excessively
long MIDs, and there'll always be abuse/misuse vectors for causing
mis-threaded messages, so it's not worth worrying about
excessively long MIDs.
---
lib/PublicInbox/MID.pm | 11 ++++++++++-
lib/PublicInbox/SearchIdx.pm | 15 ++++++---------
lib/PublicInbox/SearchIdxSkeleton.pm | 3 ---
3 files changed, 16 insertions(+), 13 deletions(-)
diff --git a/lib/PublicInbox/MID.pm b/lib/PublicInbox/MID.pm
index 9608539..422902f 100644
--- a/lib/PublicInbox/MID.pm
+++ b/lib/PublicInbox/MID.pm
@@ -10,7 +10,10 @@ our @EXPORT_OK = qw/mid_clean id_compress mid2path mid_mime mid_escape MID_ESC
mids references/;
use URI::Escape qw(uri_escape_utf8);
use Digest::SHA qw/sha1_hex/;
-use constant MID_MAX => 40; # SHA-1 hex length
+use constant {
+ MID_MAX => 40, # SHA-1 hex length # TODO: get rid of this
+ MAX_MID_SIZE => 244, # max term size (Xapian limitation) - length('Q')
+};
sub mid_clean {
my ($mid) = @_;
@@ -61,6 +64,12 @@ sub mids ($) {
push(@mids, $v);
}
}
+ foreach my $i (0..$#mids) {
+ next if length($mids[$i]) <= MAX_MID_SIZE;
+ warn "Message-ID: <$mids[$i]> too long, truncating\n";
+ $mids[$i] = substr($mids[$i], 0, MAX_MID_SIZE);
+ }
+
uniq_mids(\@mids);
}
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 3ef444d..a70e1eb 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -19,7 +19,6 @@ use POSIX qw(strftime);
require PublicInbox::Git;
use constant {
- MAX_MID_SIZE => 244, # max term size (Xapian limitation) - length('Q')
PERM_UMASK => 0,
OLD_PERM_GROUP => 1,
OLD_PERM_EVERYBODY => 2,
@@ -311,12 +310,6 @@ sub add_message {
eval {
my $smsg = PublicInbox::SearchMsg->new($mime);
my $doc = $smsg->{doc};
- foreach my $mid (@$mids) {
- # FIXME: may be abused to prevent archival
- length($mid) > MAX_MID_SIZE and
- die 'Message-ID too long';
- $doc->add_term('Q' . $mid);
- }
my $subj = $smsg->subject;
my $xpath;
if ($subj ne '') {
@@ -392,9 +385,11 @@ sub add_message {
}
}
}
+
if ($skel) {
push @values, $mids, $xpath, $data;
$skel->index_skeleton(\@values);
+ $doc->add_boolean_term('Q' . $_) foreach @$mids;
$doc_id = $self->{xdb}->add_document($doc);
} else {
$doc_id = link_and_save($self, $doc, $mids, $refs,
@@ -469,9 +464,9 @@ sub parse_references ($) {
my %mids = map { $_ => 1 } @{mids($hdr)};
my @keep;
foreach my $ref (@$refs) {
- # FIXME: this is an archive-prevention vector like X-No-Archive
- if (length($ref) > MAX_MID_SIZE) {
+ if (length($ref) > PublicInbox::MID::MAX_MID_SIZE) {
warn "References: <$ref> too long, ignoring\n";
+ next;
}
next if $mids{$ref};
push @keep, $ref;
@@ -510,6 +505,8 @@ sub link_and_save {
my $doc_id;
$doc->add_boolean_term('XNUM' . $num) if defined $num;
$doc->add_boolean_term('XPATH' . $xpath) if defined $xpath;
+ $doc->add_boolean_term('Q' . $_) foreach @$mids;
+
my $vivified = 0;
foreach my $mid (@$mids) {
$self->each_smsg_by_mid($mid, sub {
diff --git a/lib/PublicInbox/SearchIdxSkeleton.pm b/lib/PublicInbox/SearchIdxSkeleton.pm
index 4066b59..40b28c5 100644
--- a/lib/PublicInbox/SearchIdxSkeleton.pm
+++ b/lib/PublicInbox/SearchIdxSkeleton.pm
@@ -98,9 +98,6 @@ sub index_skeleton_real ($$) {
my $ts = $values->[PublicInbox::Search::TS];
my $smsg = PublicInbox::SearchMsg->new(undef);
my $doc = $smsg->{doc};
- foreach my $mid (@$mids) {
- $doc->add_term('Q' . $mid);
- }
PublicInbox::SearchIdx::add_values($doc, $values);
$doc->set_data($doc_data);
$smsg->{ts} = $ts;
--
EW
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH 22/34] nntp: use NNTP article numbers for lookups
2018-03-06 8:42 [v2 PATCH 00/34] duplicate handling, smaller Xapian DBs, date fixes Eric Wong (Contractor, The Linux Foundation)
` (20 preceding siblings ...)
2018-03-06 8:42 ` [PATCH 21/34] mid: truncate excessively long MIDs early Eric Wong (Contractor, The Linux Foundation)
@ 2018-03-06 8:42 ` Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 23/34] nntp: fix NEWNEWS command Eric Wong (Contractor, The Linux Foundation)
` (12 subsequent siblings)
34 siblings, 0 replies; 36+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-03-06 8:42 UTC (permalink / raw)
To: meta
Since Message-IDs are no longer unique within Xapian
(but are within the SQLite Msgmap); favor NNTP article
numbers for internal lookups. This will prevent us
from finding the "wrong" internal Message-ID.
---
lib/PublicInbox/NNTP.pm | 29 ++++++++++++++---------------
lib/PublicInbox/Search.pm | 21 +++++++++++++++++++++
2 files changed, 35 insertions(+), 15 deletions(-)
diff --git a/lib/PublicInbox/NNTP.pm b/lib/PublicInbox/NNTP.pm
index 56d8e01..895e502 100644
--- a/lib/PublicInbox/NNTP.pm
+++ b/lib/PublicInbox/NNTP.pm
@@ -463,18 +463,16 @@ find_mid:
defined $mid or return $err;
}
found:
- my $bytes;
- my $s = eval { $ng->msg_by_mid($mid, \$bytes) } or return $err;
- $s = Email::Simple->new($s);
- my $lines;
+ my $smsg = $ng->search->lookup_article($n) or return $err;
+ my $msg = $ng->msg_by_smsg($smsg) or return $err;
+ my $s = Email::Simple->new($msg);
if ($set_headers) {
set_nntp_headers($s->header_obj, $ng, $n, $mid);
- $lines = $s->body =~ tr!\n!\n!;
# must be last
$s->body_set('') if ($set_headers == 2);
}
- [ $n, $mid, $s, $bytes, $lines, $ng ];
+ [ $n, $mid, $s, $smsg->bytes, $smsg->lines, $ng ];
}
sub simple_body_write ($$) {
@@ -693,8 +691,8 @@ sub hdr_xref ($$$) { # optimize XHDR Xref [range] for rtin
}
sub search_header_for {
- my ($srch, $mid, $field) = @_;
- my $smsg = $srch->lookup_mail($mid) or return;
+ my ($srch, $num, $field) = @_;
+ my $smsg = $srch->lookup_article($num) or return;
$smsg->$field;
}
@@ -702,8 +700,8 @@ sub hdr_searchmsg ($$$$) {
my ($self, $xhdr, $field, $range) = @_;
if (defined $range && $range =~ /\A<(.+)>\z/) { # Message-ID
my ($ng, $n) = mid_lookup($self, $1);
- return r430 unless $n;
- my $v = search_header_for($ng->search, $range, $field);
+ return r430 unless defined $n;
+ my $v = search_header_for($ng->search, $n, $field);
hdr_mid_response($self, $xhdr, $ng, $n, $range, $v);
} else { # numeric range
$range = $self->{article} unless defined $range;
@@ -803,9 +801,10 @@ sub cmd_xrover ($;$) {
more($self, '224 Overview information follows');
long_response($self, $beg, $end, sub {
my ($i) = @_;
- my $mid = $mm->mid_for($$i) or return;
- my $h = search_header_for($srch, $mid, 'references');
- more($self, "$$i $h");
+ my $num = $$i;
+ my $h = search_header_for($srch, $num, 'references');
+ defined $h or return;
+ more($self, "$num $h");
});
}
@@ -829,8 +828,8 @@ sub cmd_over ($;$) {
my ($self, $range) = @_;
if ($range && $range =~ /\A<(.+)>\z/) {
my ($ng, $n) = mid_lookup($self, $1);
- my $smsg = $ng->search->lookup_mail($range) or
- return '430 No article with that message-id';
+ defined $n or return r430;
+ my $smsg = $ng->search->lookup_article($n) or return r430;
more($self, '224 Overview information follows (multi-line)');
# Only set article number column if it's the current group
diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm
index a1c423c..802984b 100644
--- a/lib/PublicInbox/Search.pm
+++ b/lib/PublicInbox/Search.pm
@@ -372,6 +372,27 @@ sub lookup_mail { # no ghosts!
});
}
+sub lookup_article {
+ my ($self, $num) = @_;
+ my $term = 'XNUM'.$num;
+ my $smsg;
+ eval {
+ retry_reopen($self, sub {
+ my $db = $self->{skel} || $self->{xdb};
+ my $head = $db->postlist_begin($term);
+ return if $head == $db->postlist_end($term);
+ my $doc_id = $head->get_docid;
+ return unless defined $doc_id;
+ # raises on error:
+ my $doc = $db->get_document($doc_id);
+ $smsg = PublicInbox::SearchMsg->wrap($doc);
+ $smsg->load_expand;
+ $smsg->{doc_id} = $doc_id;
+ });
+ };
+ $smsg;
+}
+
sub each_smsg_by_mid {
my ($self, $mid, $cb) = @_;
my $xdb = $self->{xdb};
--
EW
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH 23/34] nntp: fix NEWNEWS command
2018-03-06 8:42 [v2 PATCH 00/34] duplicate handling, smaller Xapian DBs, date fixes Eric Wong (Contractor, The Linux Foundation)
` (21 preceding siblings ...)
2018-03-06 8:42 ` [PATCH 22/34] nntp: use NNTP article numbers for lookups Eric Wong (Contractor, The Linux Foundation)
@ 2018-03-06 8:42 ` Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 24/34] searchidx: store the primary MID in doc data for NNTP Eric Wong (Contractor, The Linux Foundation)
` (11 subsequent siblings)
34 siblings, 0 replies; 36+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-03-06 8:42 UTC (permalink / raw)
To: meta
I guess nobody uses this command (slrnpull does not), and
the breakage was not noticed until I started writing new
tests for multi-MID handling.
Fixes: 3fc411c772a21d8f ("search: drop pointless range processors for Unix timestamp")
---
lib/PublicInbox/NNTP.pm | 2 +-
lib/PublicInbox/Search.pm | 14 ++++++++++++++
t/nntpd.t | 2 ++
3 files changed, 17 insertions(+), 1 deletion(-)
diff --git a/lib/PublicInbox/NNTP.pm b/lib/PublicInbox/NNTP.pm
index 895e502..fb65ddc 100644
--- a/lib/PublicInbox/NNTP.pm
+++ b/lib/PublicInbox/NNTP.pm
@@ -336,7 +336,7 @@ sub cmd_newnews ($$$$;$$) {
long_response($self, 0, long_response_limit, sub {
my ($i) = @_;
my $srch = $srch[0];
- my $res = $srch->query($ts, $opts);
+ my $res = $srch->query_ts($ts, $opts);
my $msgs = $res->{msgs};
if (my $nr = scalar @$msgs) {
more($self, '<' .
diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm
index 802984b..4dc2747 100644
--- a/lib/PublicInbox/Search.pm
+++ b/lib/PublicInbox/Search.pm
@@ -330,6 +330,20 @@ sub query_xover {
_do_enquire($self, $query, $opts);
}
+sub query_ts {
+ my ($self, $ts, $opts) = @_;
+ my $qp = $self->{qp_ts} ||= eval {
+ my $q = Search::Xapian::QueryParser->new;
+ $q->set_database($self->{skel} || $self->{xdb});
+ $q->add_valuerangeprocessor(
+ Search::Xapian::NumberValueRangeProcessor->new(TS));
+ $q
+ };
+ my $query = $qp->parse_query($ts, QP_FLAGS);
+ $opts->{enquire} = enquire_skel($self);
+ _do_enquire($self, $query, $opts);
+}
+
sub lookup_skeleton {
my ($self, $mid) = @_;
my $skel = $self->{skel} or return lookup_message($self, $mid);
diff --git a/t/nntpd.t b/t/nntpd.t
index ea0d293..de781d7 100644
--- a/t/nntpd.t
+++ b/t/nntpd.t
@@ -228,6 +228,8 @@ EOF
is_deeply($n->xhdr(qw(list-id 1-)), {},
'XHDR on invalid header returns empty');
+ my $mids = $n->newnews(0, '*');
+ is_deeply($mids, ['<nntp@example.com>'], 'NEWNEWS works');
{
my $t0 = time;
my $date = $n->date;
--
EW
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH 24/34] searchidx: store the primary MID in doc data for NNTP
2018-03-06 8:42 [v2 PATCH 00/34] duplicate handling, smaller Xapian DBs, date fixes Eric Wong (Contractor, The Linux Foundation)
` (22 preceding siblings ...)
2018-03-06 8:42 ` [PATCH 23/34] nntp: fix NEWNEWS command Eric Wong (Contractor, The Linux Foundation)
@ 2018-03-06 8:42 ` Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 25/34] import: consolidate object info for v2 imports Eric Wong (Contractor, The Linux Foundation)
` (10 subsequent siblings)
34 siblings, 0 replies; 36+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-03-06 8:42 UTC (permalink / raw)
To: meta
We can't rely on header order for Message-ID after all
since we fall back to existing MIDs if they exist and
are unseen. This lets us use SearchMsg->mid to get the
MID we associated with the NNTP article number to ensure
all NNTP article lookups roundtrip correctly.
---
lib/PublicInbox/SearchIdx.pm | 6 ++--
lib/PublicInbox/SearchIdxPart.pm | 8 ++---
lib/PublicInbox/SearchMsg.pm | 9 ++---
lib/PublicInbox/V2Writable.pm | 34 +++++++++++--------
t/search-thr-index.t | 2 +-
t/v2writable.t | 73 ++++++++++++++++++++++++++++++++++++++++
6 files changed, 107 insertions(+), 25 deletions(-)
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index a70e1eb..71469a9 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -302,7 +302,8 @@ sub index_body ($$$) {
}
sub add_message {
- my ($self, $mime, $bytes, $num, $blob) = @_; # mime = Email::MIME object
+ # mime = Email::MIME object
+ my ($self, $mime, $bytes, $num, $oid, $mid0) = @_;
my $doc_id;
my $mids = mids($mime->header_obj);
my $skel = $self->{skeleton};
@@ -370,7 +371,8 @@ sub add_message {
# populates smsg->references for smsg->to_doc_data
my $refs = parse_references($smsg);
- my $data = $smsg->to_doc_data($blob);
+ $mid0 = $mids->[0] unless defined $mid0;
+ my $data = $smsg->to_doc_data($oid, $mid0);
foreach my $mid (@$mids) {
$tg->index_text($mid, 1, 'XM');
}
diff --git a/lib/PublicInbox/SearchIdxPart.pm b/lib/PublicInbox/SearchIdxPart.pm
index 2f577ec..6d8cb2a 100644
--- a/lib/PublicInbox/SearchIdxPart.pm
+++ b/lib/PublicInbox/SearchIdxPart.pm
@@ -51,7 +51,7 @@ sub partition_worker_loop ($$$) {
$xdb = $txn = undef;
} else {
chomp $line;
- my ($len, $artnum, $object_id) = split(/ /, $line);
+ my ($len, $artnum, $oid, $mid0) = split(/ /, $line);
$xdb ||= $self->_xdb_acquire;
if (!$txn) {
$xdb->begin_transaction;
@@ -61,7 +61,7 @@ sub partition_worker_loop ($$$) {
$n == $len or die "short read: $n != $len\n";
my $mime = PublicInbox::MIME->new(\$msg);
$artnum = int($artnum);
- $self->add_message($mime, $n, $artnum, $object_id);
+ $self->add_message($mime, $n, $artnum, $oid, $mid0);
}
}
warn "$$ still in transaction\n" if $txn;
@@ -70,9 +70,9 @@ sub partition_worker_loop ($$$) {
# called by V2Writable
sub index_raw {
- my ($self, $len, $msgref, $artnum, $object_id) = @_;
+ my ($self, $len, $msgref, $artnum, $object_id, $mid0) = @_;
my $w = $self->{w};
- print $w "$len $artnum $object_id\n", $$msgref or die
+ print $w "$len $artnum $object_id $mid0\n", $$msgref or die
"failed to write partition $!\n";
$w->flush or die "failed to flush: $!\n";
}
diff --git a/lib/PublicInbox/SearchMsg.pm b/lib/PublicInbox/SearchMsg.pm
index 93e6fd8..a62a649 100644
--- a/lib/PublicInbox/SearchMsg.pm
+++ b/lib/PublicInbox/SearchMsg.pm
@@ -31,13 +31,14 @@ sub get_val ($$) {
sub load_from_data ($$) {
my ($self) = $_[0]; # data = $_[1]
- my ($subj, $from, $refs, $to, $cc, $blob) = split(/\n/, $_[1]);
+ my ($subj, $from, $refs, $to, $cc, $blob, $mid0) = split(/\n/, $_[1]);
$self->{subject} = $subj;
$self->{from} = $from;
$self->{references} = $refs;
$self->{to} = $to;
$self->{cc} = $cc;
$self->{blob} = $blob;
+ $self->{mid} = $mid0;
}
sub load_expand {
@@ -120,11 +121,11 @@ sub ts {
}
sub to_doc_data {
- my ($self, $blob) = @_;
+ my ($self, $oid, $mid0) = @_;
my @rows = ($self->subject, $self->from, $self->references,
$self->to, $self->cc);
- push @rows, $blob if defined $blob;
- join("\n", @rows);
+ $oid = '' unless defined $oid;
+ join("\n", @rows, $oid, $mid0);
}
sub references {
diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index c73d859..caabc8e 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -62,8 +62,10 @@ sub add {
# leaking FDs to it...
$self->idx_init;
- my $num = num_for($self, $mime);
+ my $mid0;
+ my $num = num_for($self, $mime, \$mid0);
defined $num or return; # duplicate
+ defined $mid0 or die "BUG: $mid0 undefined\n";
my $im = $self->importer;
my $cmt = $im->add($mime);
$cmt = $im->get_mark($cmt);
@@ -73,7 +75,7 @@ sub add {
my $nparts = $self->{partitions};
my $part = $num % $nparts;
my $idx = $self->idx_part($part);
- $idx->index_raw($len, $msgref, $num, $oid);
+ $idx->index_raw($len, $msgref, $num, $oid, $mid0);
my $n = $self->{transact_bytes} += $len;
if ($n > (PublicInbox::SearchIdx::BATCH_BYTES * $nparts)) {
$self->checkpoint;
@@ -83,12 +85,15 @@ sub add {
}
sub num_for {
- my ($self, $mime) = @_;
+ my ($self, $mime, $mid0) = @_;
my $mids = mids($mime->header_obj);
if (@$mids) {
my $mid = $mids->[0];
my $num = $self->{skel}->{mm}->mid_insert($mid);
- return $num if defined($num); # common case
+ if (defined $num) { # common case
+ $$mid0 = $mid;
+ return $num;
+ };
# crap, Message-ID is already known, hope somebody just resent:
$self->done; # write barrier, clears $self->{skel}
@@ -111,38 +116,39 @@ sub num_for {
$num = $self->{skel}->{mm}->mid_insert($m);
if (defined $num) {
warn "alternative <$m> for <$mid> found\n";
+ $$mid0 = $m;
return $num;
}
}
}
# none of the existing Message-IDs are good, generate a new one:
- num_for_harder($self, $mime);
+ num_for_harder($self, $mime, $mid0);
}
sub num_for_harder {
- my ($self, $mime) = @_;
+ my ($self, $mime, $mid0) = @_;
my $hdr = $mime->header_obj;
my $dig = content_digest($mime);
- my $mid = $dig->clone->hexdigest . '@localhost';
- my $num = $self->{skel}->{mm}->mid_insert($mid);
+ $$mid0 = $dig->clone->hexdigest . '@localhost';
+ my $num = $self->{skel}->{mm}->mid_insert($$mid0);
unless (defined $num) {
# it's hard to spoof the last Received: header
my @recvd = $hdr->header_raw('Received');
$dig->add("Received: $_") foreach (@recvd);
- $mid = $dig->clone->hexdigest . '@localhost';
- $num = $self->{skel}->{mm}->mid_insert($mid);
+ $$mid0 = $dig->clone->hexdigest . '@localhost';
+ $num = $self->{skel}->{mm}->mid_insert($$mid0);
# fall back to a random Message-ID and give up determinism:
until (defined($num)) {
$dig->add(rand);
- $mid = $dig->clone->hexdigest . '@localhost';
- warn "using random Message-ID <$mid> as fallback\n";
- $num = $self->{skel}->{mm}->mid_insert($mid);
+ $$mid0 = $dig->clone->hexdigest . '@localhost';
+ warn "using random Message-ID <$$mid0> as fallback\n";
+ $num = $self->{skel}->{mm}->mid_insert($$mid0);
}
}
my @cur = $hdr->header_raw('Message-Id');
- $hdr->header_set('Message-Id', "<$mid>", @cur);
+ $hdr->header_set('Message-Id', "<$$mid0>", @cur);
$num;
}
diff --git a/t/search-thr-index.t b/t/search-thr-index.t
index c3534f6..6c6e4c5 100644
--- a/t/search-thr-index.t
+++ b/t/search-thr-index.t
@@ -41,8 +41,8 @@ foreach (reverse split(/\n\n/, $data)) {
$mime->header_set('From' => 'bw@g');
$mime->header_set('To' => 'git@vger.kernel.org');
my $bytes = bytes::length($mime->as_string);
- my $doc_id = $rw->add_message($mime, $bytes, ++$num, 'ignored');
my $mid = $mime->header('Message-Id');
+ my $doc_id = $rw->add_message($mime, $bytes, ++$num, 'ignored', $mid);
push @mids, $mid;
ok($doc_id, 'message added: '. $mid);
}
diff --git a/t/v2writable.t b/t/v2writable.t
index f95b2e7..bf8ae5e 100644
--- a/t/v2writable.t
+++ b/t/v2writable.t
@@ -93,5 +93,78 @@ ok($im->add($mime), 'ordinary message added');
ok($found[1]->{doc_id} > 0, 'doc_id is positive');
}
+SKIP: {
+ use Fcntl qw(FD_CLOEXEC F_SETFD F_GETFD);
+ use Net::NNTP;
+ use IO::Socket;
+ use Socket qw(SO_KEEPALIVE IPPROTO_TCP TCP_NODELAY);
+ eval { require Danga::Socket };
+ skip "Danga::Socket missing $@", 2 if $@;
+ my $err = "$mainrepo/stderr.log";
+ my $out = "$mainrepo/stdout.log";
+ my %opts = (
+ LocalAddr => '127.0.0.1',
+ ReuseAddr => 1,
+ Proto => 'tcp',
+ Type => SOCK_STREAM,
+ Listen => 1024,
+ );
+ my $group = 'inbox.comp.test.v2writable';
+ my $pi_config = "$mainrepo/pi_config";
+ open my $fh, '>', $pi_config or die "open: $!\n";
+ print $fh <<EOF
+[publicinbox "test-v2writable"]
+ mainrepo = $mainrepo
+ version = 2
+ address = test\@example.com
+ newsgroup = $group
+EOF
+ ;
+ close $fh or die "close: $!\n";
+ my $sock = IO::Socket::INET->new(%opts);
+ ok($sock, 'sock created');
+ my $pid;
+ my $len;
+ END { kill 'TERM', $pid if defined $pid };
+ $! = 0;
+ my $fl = fcntl($sock, F_GETFD, 0);
+ ok(! $!, 'no error from fcntl(F_GETFD)');
+ is($fl, FD_CLOEXEC, 'cloexec set by default (Perl behavior)');
+ $pid = fork;
+ if ($pid == 0) {
+ use POSIX qw(dup2);
+ $ENV{PI_CONFIG} = $pi_config;
+ # pretend to be systemd
+ fcntl($sock, F_SETFD, $fl &= ~FD_CLOEXEC);
+ dup2(fileno($sock), 3) or die "dup2 failed: $!\n";
+ $ENV{LISTEN_PID} = $$;
+ $ENV{LISTEN_FDS} = 1;
+ my $nntpd = 'blib/script/public-inbox-nntpd';
+ exec $nntpd, "--stdout=$out", "--stderr=$err";
+ die "FAIL: $!\n";
+ }
+ ok(defined $pid, 'forked nntpd process successfully');
+ $! = 0;
+ fcntl($sock, F_SETFD, $fl |= FD_CLOEXEC);
+ ok(! $!, 'no error from fcntl(F_SETFD)');
+ my $host_port = $sock->sockhost . ':' . $sock->sockport;
+ my $n = Net::NNTP->new($host_port);
+ $n->group($group);
+ my $x = $n->xover('1-');
+ my %uniq;
+ foreach my $num (sort { $a <=> $b } keys %$x) {
+ my $mid = $x->{$num}->[3];
+ is($uniq{$mid}++, 0, "MID for $num is unique in XOVER");
+ is_deeply($n->xhdr('Message-ID', $num),
+ { $num => $mid }, "XHDR lookup OK on num $num");
+ is_deeply($n->xhdr('Message-ID', $mid),
+ { $mid => $mid }, "XHDR lookup OK on MID $num");
+ }
+ my %nn;
+ foreach my $mid (@{$n->newnews(0, $group)}) {
+ is($nn{$mid}++, 0, "MID is unique in NEWNEWS");
+ }
+ is_deeply([sort keys %nn], [sort keys %uniq]);
+};
done_testing();
--
EW
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH 25/34] import: consolidate object info for v2 imports
2018-03-06 8:42 [v2 PATCH 00/34] duplicate handling, smaller Xapian DBs, date fixes Eric Wong (Contractor, The Linux Foundation)
` (23 preceding siblings ...)
2018-03-06 8:42 ` [PATCH 24/34] searchidx: store the primary MID in doc data for NNTP Eric Wong (Contractor, The Linux Foundation)
@ 2018-03-06 8:42 ` Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 26/34] v2: avoid redundant/repeated configs for git partition repos Eric Wong (Contractor, The Linux Foundation)
` (9 subsequent siblings)
34 siblings, 0 replies; 36+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-03-06 8:42 UTC (permalink / raw)
To: meta
It's easier to store everything in one array ref similar
to what our Git->check routine returns
---
lib/PublicInbox/Import.pm | 6 +++---
lib/PublicInbox/V2Writable.pm | 5 ++---
t/import.t | 9 ++++++---
3 files changed, 11 insertions(+), 9 deletions(-)
diff --git a/lib/PublicInbox/Import.pm b/lib/PublicInbox/Import.pm
index ac46c0c..ddb63b1 100644
--- a/lib/PublicInbox/Import.pm
+++ b/lib/PublicInbox/Import.pm
@@ -282,9 +282,9 @@ sub add {
print $w $str, "\n" or wfail;
# v2: we need this for Xapian
- if ($self->{want_object_id}) {
- chomp($self->{last_object_id} = $self->get_mark(":$blob"));
- $self->{last_object} = [ $n, \$str ];
+ if ($self->{want_object_info}) {
+ chomp(my $oid = $self->get_mark(":$blob"));
+ $self->{last_object} = [ $oid, $n, \$str ];
}
my $ref = $self->{ref};
my $commit = $self->{mark}++;
diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index caabc8e..31376db 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -69,8 +69,7 @@ sub add {
my $im = $self->importer;
my $cmt = $im->add($mime);
$cmt = $im->get_mark($cmt);
- my $oid = $im->{last_object_id};
- my ($len, $msgref) = @{$im->{last_object}};
+ my ($oid, $len, $msgref) = @{$im->{last_object}};
my $nparts = $self->{partitions};
my $part = $num % $nparts;
@@ -318,7 +317,7 @@ sub import_init {
my ($self, $git, $packed_bytes) = @_;
my $im = PublicInbox::Import->new($git, undef, undef, $self->{-inbox});
$im->{bytes_added} = int($packed_bytes / $PACKING_FACTOR);
- $im->{want_object_id} = 1;
+ $im->{want_object_info} = 1;
$im->{ssoma_lock} = 0;
$im->{path_type} = 'v2';
$self->{im} = $im;
diff --git a/t/import.t b/t/import.t
index ca59772..eee4744 100644
--- a/t/import.t
+++ b/t/import.t
@@ -28,11 +28,14 @@ my $mime = PublicInbox::MIME->create(
body => "hello world\n",
);
-$im->{want_object_id} = 1 if 'v2';
+$im->{want_object_info} = 1 if 'v2';
like($im->add($mime), qr/\A:\d+\z/, 'added one message');
if ('v2') {
- like($im->{last_object_id}, qr/\A[a-f0-9]{40}\z/, 'got last_object_id');
+ my $info = $im->{last_object};
+ like($info->[0], qr/\A[a-f0-9]{40}\z/, 'got last object_id');
+ is($mime->as_string, ${$info->[2]}, 'string matches');
+ is($info->[1], length(${$info->[2]}), 'length matches');
my @cmd = ('git', "--git-dir=$git->{git_dir}", qw(hash-object --stdin));
my $in = tempfile();
print $in $mime->as_string or die "write failed: $!";
@@ -44,7 +47,7 @@ if ('v2') {
is($?, 0, 'hash-object');
$out->seek(0, SEEK_SET);
chomp(my $hashed_obj = <$out>);
- is($hashed_obj, $im->{last_object_id}, "last_object_id matches exp");
+ is($hashed_obj, $info->[0], "last object_id matches exp");
}
$im->done;
--
EW
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH 26/34] v2: avoid redundant/repeated configs for git partition repos
2018-03-06 8:42 [v2 PATCH 00/34] duplicate handling, smaller Xapian DBs, date fixes Eric Wong (Contractor, The Linux Foundation)
` (24 preceding siblings ...)
2018-03-06 8:42 ` [PATCH 25/34] import: consolidate object info for v2 imports Eric Wong (Contractor, The Linux Foundation)
@ 2018-03-06 8:42 ` Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 27/34] INSTALL: document more optional dependencies Eric Wong (Contractor, The Linux Foundation)
` (8 subsequent siblings)
34 siblings, 0 replies; 36+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-03-06 8:42 UTC (permalink / raw)
To: meta
We'll let the config of all.git dictate every other subrepo to
ease maintenance and configuration. The "include" directive has
been supported since git 1.7.10, so it's safe to depend on as v2
requires git 2.6.0+ anyways for "get-mark" in fast-import.
---
lib/PublicInbox/SearchIdx.pm | 2 +-
lib/PublicInbox/V2Writable.pm | 10 +++++++---
t/init.t | 2 ++
t/v2writable.t | 16 ++++++++++++++++
4 files changed, 26 insertions(+), 4 deletions(-)
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 71469a9..725bbd8 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -817,7 +817,7 @@ sub _read_git_config_perm {
my ($self) = @_;
my @cmd = qw(config);
if ($self->{version} == 2) {
- push @cmd, "--file=$self->{mainrepo}/inbox-config";
+ push @cmd, "--file=$self->{mainrepo}/all.git/config";
}
my $fh = $self->{git}->popen(@cmd, 'core.sharedRepository');
local $/ = "\n";
diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index 31376db..461432e 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -244,16 +244,20 @@ sub git_init {
die "$git_dir exists\n" if -e $git_dir;
my @cmd = (qw(git init --bare -q), $git_dir);
PublicInbox::Import::run_die(\@cmd);
- @cmd = (qw/git config/, "--file=$git_dir/config",
- 'repack.writeBitmaps', 'true');
- PublicInbox::Import::run_die(\@cmd);
my $all = "$self->{-inbox}->{mainrepo}/all.git";
unless (-d $all) {
@cmd = (qw(git init --bare -q), $all);
PublicInbox::Import::run_die(\@cmd);
+ @cmd = (qw/git config/, "--file=$all/config",
+ 'repack.writeBitmaps', 'true');
+ PublicInbox::Import::run_die(\@cmd);
}
+ @cmd = (qw/git config/, "--file=$git_dir/config",
+ 'include.path', '../../all.git/config');
+ PublicInbox::Import::run_die(\@cmd);
+
my $alt = "$all/objects/info/alternates";
my $new_obj_dir = "../../git/$new.git/objects";
my %alts;
diff --git a/t/init.t b/t/init.t
index 54b90ec..6ae608e 100644
--- a/t/init.t
+++ b/t/init.t
@@ -38,6 +38,8 @@ SKIP: {
ok(-d "$tmpdir/v2list", 'v2list directory exists');
ok(-f "$tmpdir/v2list/msgmap.sqlite3", 'msgmap exists');
ok(-d "$tmpdir/v2list/all.git", 'catch-all.git directory exists');
+ @cmd = (qw(git config), "--file=$tmpdir/v2list/all.git/config",
+ qw(core.sharedRepository 0644));
}
done_testing();
diff --git a/t/v2writable.t b/t/v2writable.t
index bf8ae5e..2d35aca 100644
--- a/t/v2writable.t
+++ b/t/v2writable.t
@@ -32,6 +32,22 @@ my $mime = PublicInbox::MIME->create(
my $im = PublicInbox::V2Writable->new($ibx, 1);
ok($im->add($mime), 'ordinary message added');
+
+if ('ensure git configs are correct') {
+ my @cmd = (qw(git config), "--file=$mainrepo/all.git/config",
+ qw(core.sharedRepository 0644));
+ is(system(@cmd), 0, "set sharedRepository in all.git");
+ my $git0 = PublicInbox::Git->new("$mainrepo/git/0.git");
+ my $fh = $git0->popen(qw(config core.sharedRepository));
+ my $v = eval { local $/; <$fh> };
+ chomp $v;
+ is($v, '0644', 'child repo inherited core.sharedRepository');
+ $fh = $git0->popen(qw(config --bool repack.writeBitmaps));
+ $v = eval { local $/; <$fh> };
+ chomp $v;
+ is($v, 'true', 'child repo inherited repack.writeBitmaps');
+}
+
{
my @warn;
local $SIG{__WARN__} = sub { push @warn, @_ };
--
EW
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH 27/34] INSTALL: document more optional dependencies
2018-03-06 8:42 [v2 PATCH 00/34] duplicate handling, smaller Xapian DBs, date fixes Eric Wong (Contractor, The Linux Foundation)
` (25 preceding siblings ...)
2018-03-06 8:42 ` [PATCH 26/34] v2: avoid redundant/repeated configs for git partition repos Eric Wong (Contractor, The Linux Foundation)
@ 2018-03-06 8:42 ` Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 28/34] search: favor skeleton DB for lookup_mail Eric Wong (Contractor, The Linux Foundation)
` (7 subsequent siblings)
34 siblings, 0 replies; 36+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-03-06 8:42 UTC (permalink / raw)
To: meta
I've missed a few things over time :x
---
INSTALL | 13 +++++++++++++
1 file changed, 13 insertions(+)
diff --git a/INSTALL b/INSTALL
index 3758e8d..11d844c 100644
--- a/INSTALL
+++ b/INSTALL
@@ -45,6 +45,9 @@ Optional Perl modules:
- Danga::Socket[4] libdanga-socket-perl
- Net::Server[5] libnet-server-perl
- Filesys::Notify::Simple[6] libfilesys-notify-simple-perl
+ - Inline::C[7] libinline-c-perl
+ - Plack::Middleware::ReverseProxy[8] libplack-middleware-reverseproxy-perl
+ - Plack::Middleware::Deflater[8] libplack-middleware-deflater-perl
[1] - Optional, needed for serving/generating Atom and HTML pages
[2] - Optional, only required for NNTP server
@@ -52,6 +55,8 @@ Optional Perl modules:
[4] - Optional, needed for bundled HTTP and NNTP servers
[5] - Optional, needed for standalone daemonization of HTTP+NNTP servers
[6] - Optional, needed for public-inbox-watch Maildir watcher
+[7] - Optional, allows speeds up spawning on Linux (see public-inbox-daemon(8))
+[8] - Optional, recommended for PSGI interface
When installing Search::Xapian, make sure the underlying Xapian
installation is not affected by an index corruption bug:
@@ -64,6 +69,14 @@ public-inbox will never store unregeneratable data in Xapian
or any other search database we might use; Xapian corruption
will not destroy critical data.
+Optional Perl modules (for developers):
+
+ - XML::Feed[9] libxml-feed-perl
+ - IPC::Run[10] libipc-run-perl
+
+[9] - Optional, for testing Atom feeds
+[10] - Optional, for some tests (we hope to drop this dependency someday)
+
Copyright
---------
--
EW
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH 28/34] search: favor skeleton DB for lookup_mail
2018-03-06 8:42 [v2 PATCH 00/34] duplicate handling, smaller Xapian DBs, date fixes Eric Wong (Contractor, The Linux Foundation)
` (26 preceding siblings ...)
2018-03-06 8:42 ` [PATCH 27/34] INSTALL: document more optional dependencies Eric Wong (Contractor, The Linux Foundation)
@ 2018-03-06 8:42 ` Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 29/34] search: each_smsg_by_mid uses skeleton if available Eric Wong (Contractor, The Linux Foundation)
` (6 subsequent siblings)
34 siblings, 0 replies; 36+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-03-06 8:42 UTC (permalink / raw)
To: meta
The skeleton DB is smaller and hit more frequently given the
homepage and per-message/thread views; so it will be hotter in
the page cache.
---
lib/PublicInbox/Search.pm | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm
index 4dc2747..dc46ead 100644
--- a/lib/PublicInbox/Search.pm
+++ b/lib/PublicInbox/Search.pm
@@ -381,7 +381,7 @@ sub lookup_message {
sub lookup_mail { # no ghosts!
my ($self, $mid) = @_;
retry_reopen($self, sub {
- my $smsg = lookup_message($self, $mid) or return;
+ my $smsg = lookup_skeleton($self, $mid) or return;
$smsg->load_expand;
});
}
--
EW
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH 29/34] search: each_smsg_by_mid uses skeleton if available
2018-03-06 8:42 [v2 PATCH 00/34] duplicate handling, smaller Xapian DBs, date fixes Eric Wong (Contractor, The Linux Foundation)
` (27 preceding siblings ...)
2018-03-06 8:42 ` [PATCH 28/34] search: favor skeleton DB for lookup_mail Eric Wong (Contractor, The Linux Foundation)
@ 2018-03-06 8:42 ` Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 30/34] v2writable: remove unnecessary skeleton commit Eric Wong (Contractor, The Linux Foundation)
` (5 subsequent siblings)
34 siblings, 0 replies; 36+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-03-06 8:42 UTC (permalink / raw)
To: meta
We do not need the large DBs for MID scans.
---
lib/PublicInbox/Search.pm | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm
index dc46ead..7cad31a 100644
--- a/lib/PublicInbox/Search.pm
+++ b/lib/PublicInbox/Search.pm
@@ -409,13 +409,15 @@ sub lookup_article {
sub each_smsg_by_mid {
my ($self, $mid, $cb) = @_;
- my $xdb = $self->{xdb};
# XXX retry_reopen isn't necessary for V2Writable, but the PSGI
# interface will need it...
- my ($head, $tail) = $self->find_doc_ids('Q' . $mid);
+ my $db = $self->{skel} || $self->{xdb};
+ my $term = 'Q' . $mid;
+ my $head = $db->postlist_begin($term);
+ my $tail = $db->postlist_end($term);
for (; $head->nequal($tail); $head->inc) {
my $doc_id = $head->get_docid;
- my $doc = $xdb->get_document($doc_id);
+ my $doc = $db->get_document($doc_id);
my $smsg = PublicInbox::SearchMsg->wrap($doc, $mid);
$smsg->{doc_id} = $doc_id;
$cb->($smsg) or return;
--
EW
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH 30/34] v2writable: remove unnecessary skeleton commit
2018-03-06 8:42 [v2 PATCH 00/34] duplicate handling, smaller Xapian DBs, date fixes Eric Wong (Contractor, The Linux Foundation)
` (28 preceding siblings ...)
2018-03-06 8:42 ` [PATCH 29/34] search: each_smsg_by_mid uses skeleton if available Eric Wong (Contractor, The Linux Foundation)
@ 2018-03-06 8:42 ` Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 31/34] favor Received: date over Date: header globally Eric Wong (Contractor, The Linux Foundation)
` (4 subsequent siblings)
34 siblings, 0 replies; 36+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-03-06 8:42 UTC (permalink / raw)
To: meta
Not a big deal since we still commit to the skeleton for every
single partition (barrier work abandoned).
---
lib/PublicInbox/V2Writable.pm | 1 -
1 file changed, 1 deletion(-)
diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index 461432e..3bcea37 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -229,7 +229,6 @@ sub searchidx_checkpoint {
if ($more) {
$dbh->begin_work;
} else {
- $skel->remote_commit; # XXX should be unnecessary...
$skel->remote_close;
delete $self->{skel};
}
--
EW
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH 31/34] favor Received: date over Date: header globally
2018-03-06 8:42 [v2 PATCH 00/34] duplicate handling, smaller Xapian DBs, date fixes Eric Wong (Contractor, The Linux Foundation)
` (29 preceding siblings ...)
2018-03-06 8:42 ` [PATCH 30/34] v2writable: remove unnecessary skeleton commit Eric Wong (Contractor, The Linux Foundation)
@ 2018-03-06 8:42 ` Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 32/34] import: fall back to Sender for extracting name and email Eric Wong (Contractor, The Linux Foundation)
` (3 subsequent siblings)
34 siblings, 0 replies; 36+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-03-06 8:42 UTC (permalink / raw)
To: meta
The first Received: header is believable since it typically
hits the user's mail server and can be treated as relatively
trustworthy. We still show the Date: in per-message (permalink)
views, which may expose users for having incorrect Date:
headers, but all the ISO YYYY-MM-DD dates we display will
match what we see.
---
MANIFEST | 1 +
lib/PublicInbox/Import.pm | 35 ++-------------------------
lib/PublicInbox/MsgTime.pm | 51 ++++++++++++++++++++++++++++++++++++++++
lib/PublicInbox/SearchMsg.pm | 6 +++--
lib/PublicInbox/View.pm | 8 +------
lib/PublicInbox/WwwAtomStream.pm | 5 ++--
scripts/import_vger_from_mbox | 1 -
7 files changed, 61 insertions(+), 46 deletions(-)
create mode 100644 lib/PublicInbox/MsgTime.pm
diff --git a/MANIFEST b/MANIFEST
index 7366aa0..a42b9e1 100644
--- a/MANIFEST
+++ b/MANIFEST
@@ -73,6 +73,7 @@ lib/PublicInbox/MID.pm
lib/PublicInbox/MIME.pm
lib/PublicInbox/Mbox.pm
lib/PublicInbox/MsgIter.pm
+lib/PublicInbox/MsgTime.pm
lib/PublicInbox/Msgmap.pm
lib/PublicInbox/NNTP.pm
lib/PublicInbox/NNTPD.pm
diff --git a/lib/PublicInbox/Import.pm b/lib/PublicInbox/Import.pm
index ddb63b1..7ba1668 100644
--- a/lib/PublicInbox/Import.pm
+++ b/lib/PublicInbox/Import.pm
@@ -12,8 +12,7 @@ use PublicInbox::Spawn qw(spawn);
use PublicInbox::MID qw(mid_mime mid2path);
use PublicInbox::Address;
use PublicInbox::ContentId qw(content_id);
-use Date::Parse qw(str2time);
-use Time::Zone qw(tz_offset);
+use PublicInbox::MsgTime qw(msg_timestamp);
sub new {
my ($class, $git, $name, $email, $ibx) = @_;
@@ -204,37 +203,7 @@ sub remove {
sub parse_date ($) {
my ($mime) = @_;
- my $hdr = $mime->header_obj;
- my $date = $hdr->header_raw('Date');
- my ($ts, $zone);
- my $mid = $hdr->header_raw('Message-ID');
- if ($date) {
- $ts = eval { str2time($date) };
- if ($@) {
- warn "bad Date: $date in $mid: $@\n";
- } elsif ($date =~ /\s+([\+\-]\d+)\s*\z/) {
- $zone = $1;
- }
- }
- unless ($ts) {
- my @recvd = $hdr->header_raw('Received');
- foreach my $r (@recvd) {
- $zone = undef;
- $r =~ /\s*(\d+\s+[[:alpha:]]+\s+\d{2,4}\s+
- \d+\D\d+(?:\D\d+)\s+([\+\-]\d+))/osx or next;
- $zone = $2;
- $ts = eval { str2time($1) } and last;
- warn "no date in Received: $r\n";
- }
- }
- $zone ||= '+0000';
- # "-1200" is the furthest westermost zone offset,
- # but git fast-import is liberal so we use "-1400"
- if ($zone >= 1400 || $zone <= -1400) {
- warn "bogus TZ offset: $zone, ignoring and assuming +0000\n";
- $zone = '+0000';
- }
- $ts = time unless defined $ts;
+ my ($ts, $zone) = msg_timestamp($mime->header_obj);
$ts = 0 if $ts < 0; # git uses unsigned times
"$ts $zone";
}
diff --git a/lib/PublicInbox/MsgTime.pm b/lib/PublicInbox/MsgTime.pm
new file mode 100644
index 0000000..87664f4
--- /dev/null
+++ b/lib/PublicInbox/MsgTime.pm
@@ -0,0 +1,51 @@
+# Copyright (C) 2018 all contributors <meta@public-inbox.org>
+# License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
+package PublicInbox::MsgTime;
+use strict;
+use warnings;
+use base qw(Exporter);
+our @EXPORT_OK = qw(msg_timestamp);
+use Date::Parse qw(str2time);
+use Time::Zone qw(tz_offset);
+
+sub msg_timestamp ($) {
+ my ($hdr) = @_; # Email::MIME::Header
+ my ($ts, $zone);
+ my $mid;
+ my @recvd = $hdr->header_raw('Received');
+ foreach my $r (@recvd) {
+ $zone = undef;
+ $r =~ /\s*(\d+\s+[[:alpha:]]+\s+\d{2,4}\s+
+ \d+\D\d+(?:\D\d+)\s+([\+\-]\d+))/sx or next;
+ $zone = $2;
+ $ts = eval { str2time($1) } and last;
+ $mid ||= $hdr->header_raw('Message-ID');
+ warn "no date in $mid Received: $r\n";
+ }
+ unless (defined $ts) {
+ my @date = $hdr->header_raw('Date');
+ foreach my $d (@date) {
+ $zone = undef;
+ $ts = eval { str2time($d) };
+ if ($@) {
+ $mid ||= $hdr->header_raw('Message-ID');
+ warn "bad Date: $d in $mid: $@\n";
+ } elsif ($d =~ /\s+([\+\-]\d+)\s*\z/) {
+ $zone = $1;
+ }
+ }
+ }
+ $ts = time unless defined $ts;
+ return $ts unless wantarray;
+
+ $zone ||= '+0000';
+ # "-1200" is the furthest westermost zone offset,
+ # but git fast-import is liberal so we use "-1400"
+ if ($zone >= 1400 || $zone <= -1400) {
+ warn "bogus TZ offset: $zone, ignoring and assuming +0000\n";
+ $zone = '+0000';
+ }
+ ($ts, $zone);
+}
+
+1;
diff --git a/lib/PublicInbox/SearchMsg.pm b/lib/PublicInbox/SearchMsg.pm
index a62a649..23478a2 100644
--- a/lib/PublicInbox/SearchMsg.pm
+++ b/lib/PublicInbox/SearchMsg.pm
@@ -7,9 +7,9 @@ package PublicInbox::SearchMsg;
use strict;
use warnings;
use Search::Xapian;
-use Date::Parse qw/str2time/;
use PublicInbox::MID qw/mid_clean mid_mime/;
use PublicInbox::Address;
+use PublicInbox::MsgTime qw(msg_timestamp);
sub new {
my ($class, $mime) = @_;
@@ -117,7 +117,9 @@ sub from_name {
sub ts {
my ($self) = @_;
- $self->{ts} ||= eval { str2time($self->{mime}->header('Date')) } || 0;
+ $self->{ts} ||= eval {
+ msg_timestamp($self->{mime}->header_obj);
+ } || 0;
}
sub to_doc_data {
diff --git a/lib/PublicInbox/View.pm b/lib/PublicInbox/View.pm
index aad6748..f811f4f 100644
--- a/lib/PublicInbox/View.pm
+++ b/lib/PublicInbox/View.pm
@@ -6,7 +6,7 @@
package PublicInbox::View;
use strict;
use warnings;
-use Date::Parse qw/str2time/;
+use PublicInbox::MsgTime qw(msg_timestamp);
use PublicInbox::Hval qw/ascii_html obfuscate_addrs/;
use PublicInbox::Linkify;
use PublicInbox::MID qw/mid_clean id_compress mid_mime mid_escape/;
@@ -732,12 +732,6 @@ sub load_results {
$srch->retry_reopen(sub { [ map { $_->mid; $_ } @$msgs ] });
}
-sub msg_timestamp {
- my ($hdr) = @_;
- my $ts = eval { str2time($hdr->header('Date')) };
- defined($ts) ? $ts : 0;
-}
-
sub thread_results {
my ($msgs, $srch) = @_;
require PublicInbox::SearchThread;
diff --git a/lib/PublicInbox/WwwAtomStream.pm b/lib/PublicInbox/WwwAtomStream.pm
index b69de85..bb574a7 100644
--- a/lib/PublicInbox/WwwAtomStream.pm
+++ b/lib/PublicInbox/WwwAtomStream.pm
@@ -7,11 +7,11 @@ use strict;
use warnings;
use POSIX qw(strftime);
-use Date::Parse qw(str2time);
use Digest::SHA qw(sha1_hex);
use PublicInbox::Address;
use PublicInbox::Hval qw(ascii_html);
use PublicInbox::MID qw/mid_clean mid_escape/;
+use PublicInbox::MsgTime qw(msg_timestamp);
# called by PSGI server after getline:
sub close {}
@@ -108,8 +108,7 @@ sub feed_entry {
$irt = '';
}
my $href = $base . mid_escape($mid) . '/';
- my $date = $hdr->header('Date');
- my $t = eval { str2time($date) } if defined $date;
+ my $t = msg_timestamp($hdr);
my @t = gmtime(defined $t ? $t : time);
my $updated = feed_updated(@t);
diff --git a/scripts/import_vger_from_mbox b/scripts/import_vger_from_mbox
index 8f0ec7c..4469887 100644
--- a/scripts/import_vger_from_mbox
+++ b/scripts/import_vger_from_mbox
@@ -4,7 +4,6 @@
use strict;
use warnings;
use Getopt::Long qw/:config gnu_getopt no_ignore_case auto_abbrev/;
-use Date::Parse qw/str2time/;
use PublicInbox::MIME;
use PublicInbox::Inbox;
use PublicInbox::V2Writable;
--
EW
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH 32/34] import: fall back to Sender for extracting name and email
2018-03-06 8:42 [v2 PATCH 00/34] duplicate handling, smaller Xapian DBs, date fixes Eric Wong (Contractor, The Linux Foundation)
` (30 preceding siblings ...)
2018-03-06 8:42 ` [PATCH 31/34] favor Received: date over Date: header globally Eric Wong (Contractor, The Linux Foundation)
@ 2018-03-06 8:42 ` Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 33/34] scripts/import_vger_from_mbox: perform mboxrd or mboxo escaping Eric Wong (Contractor, The Linux Foundation)
` (2 subsequent siblings)
34 siblings, 0 replies; 36+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-03-06 8:42 UTC (permalink / raw)
To: meta; +Cc: Nicolás Ojeda Bär
This seems like a reasonable course of action for old messages.
Cc: Nicolás Ojeda Bär <n.oje.bar@gmail.com>
---
lib/PublicInbox/Import.pm | 62 +++++++++++++++++++++++++++++------------------
1 file changed, 39 insertions(+), 23 deletions(-)
diff --git a/lib/PublicInbox/Import.pm b/lib/PublicInbox/Import.pm
index 7ba1668..664bec6 100644
--- a/lib/PublicInbox/Import.pm
+++ b/lib/PublicInbox/Import.pm
@@ -208,15 +208,50 @@ sub parse_date ($) {
"$ts $zone";
}
-# returns undef on duplicate
-# returns the :MARK of the most recent commit
-sub add {
- my ($self, $mime, $check_cb) = @_; # mime = Email::MIME
+sub extract_author_info ($) {
+ my ($mime) = @_;
+ my $sender = '';
my $from = $mime->header('From');
my ($email) = PublicInbox::Address::emails($from);
my ($name) = PublicInbox::Address::names($from);
+ if (!defined($name) || !defined($email)) {
+ $sender = $mime->header('Sender');
+ if (!defined($name)) {
+ ($name) = PublicInbox::Address::names($sender);
+ }
+ if (!defined($email)) {
+ ($email) = PublicInbox::Address::emails($sender);
+ }
+ }
+ if (defined $email) {
+ # quiet down wide character warnings with utf8::encode
+ utf8::encode($email);
+ } else {
+ $email = '';
+ warn "no email in From: $from or Sender: $sender\n";
+ }
+
+ # git gets confused with:
+ # "'A U Thor <u@example.com>' via foo" <foo@example.com>
+ # ref:
+ # <CAD0k6qSUYANxbjjbE4jTW4EeVwOYgBD=bXkSu=akiYC_CB7Ffw@mail.gmail.com>
+ if (defined $name) {
+ $name =~ tr/<>//d;
+ utf8::encode($name);
+ } else {
+ $name = '';
+ warn "no name in From: $from or Sender: $sender\n";
+ }
+ ($name, $email);
+}
+
+# returns undef on duplicate
+# returns the :MARK of the most recent commit
+sub add {
+ my ($self, $mime, $check_cb) = @_; # mime = Email::MIME
+ my ($name, $email) = extract_author_info($mime);
my $date_raw = parse_date($mime);
my $subject = $mime->header('Subject');
$subject = '(no subject)' unless defined $subject;
@@ -263,25 +298,6 @@ sub add {
print $w "reset $ref\n" or wfail;
}
- # quiet down wide character warnings with utf8::encode
- if (defined $email) {
- utf8::encode($email);
- } else {
- $email = '';
- warn "no email in From: $from\n";
- }
-
- # git gets confused with:
- # "'A U Thor <u@example.com>' via foo" <foo@example.com>
- # ref:
- # <CAD0k6qSUYANxbjjbE4jTW4EeVwOYgBD=bXkSu=akiYC_CB7Ffw@mail.gmail.com>
- if (defined $name) {
- $name =~ tr/<>//d;
- utf8::encode($name);
- } else {
- $name = '';
- warn "no name in From: $from\n";
- }
utf8::encode($subject);
print $w "commit $ref\nmark :$commit\n",
"author $name <$email> $date_raw\n",
--
EW
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH 33/34] scripts/import_vger_from_mbox: perform mboxrd or mboxo escaping
2018-03-06 8:42 [v2 PATCH 00/34] duplicate handling, smaller Xapian DBs, date fixes Eric Wong (Contractor, The Linux Foundation)
` (31 preceding siblings ...)
2018-03-06 8:42 ` [PATCH 32/34] import: fall back to Sender for extracting name and email Eric Wong (Contractor, The Linux Foundation)
@ 2018-03-06 8:42 ` Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 34/34] v2writable: detect and use previous partition count Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:53 ` [v2 PATCH 00/34] duplicate handling, smaller Xapian DBs, date fixes Eric Wong
34 siblings, 0 replies; 36+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-03-06 8:42 UTC (permalink / raw)
To: meta
It appears most of the mboxes in the archive I've been given are
mboxrd (despite having Content-Length:) and needs the escaping.
---
scripts/import_vger_from_mbox | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/scripts/import_vger_from_mbox b/scripts/import_vger_from_mbox
index 4469887..6a00fae 100644
--- a/scripts/import_vger_from_mbox
+++ b/scripts/import_vger_from_mbox
@@ -11,11 +11,16 @@ use PublicInbox::Import;
my $usage = "usage: $0 NAME EMAIL DIR <MBOX\n";
my $dry_run;
my $version = 2;
+my $variant = 'mboxrd';
my %opts = (
'n|dry-run' => \$dry_run,
'V|version=i' => \$version,
+ 'F|format=s' => \$variant,
);
GetOptions(%opts) or die $usage;
+if ($variant ne 'mboxrd' && $variant ne 'mboxo') {
+ die "Unsupported mbox variant: $variant\n";
+}
my $name = shift or die $usage; # git
my $email = shift or die $usage; # git@vger.kernel.org
my $mainrepo = shift or die $usage; # /path/to/v2/repo
@@ -45,6 +50,11 @@ sub do_add ($$) {
my ($im, $msg) = @_;
$$msg =~ s/(\r?\n)+\z/$1/s;
my $mime = PublicInbox::MIME->new($msg);
+ if ($variant eq 'mboxrd') {
+ $$msg =~ s/^>(>*From )/$1/sm;
+ } elsif ($variant eq 'mboxo') {
+ $$msg =~ s/^>From /From /sm;
+ }
$mime = $vger->scrub($mime);
return unless $im;
$im->add($mime) or
--
EW
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH 34/34] v2writable: detect and use previous partition count
2018-03-06 8:42 [v2 PATCH 00/34] duplicate handling, smaller Xapian DBs, date fixes Eric Wong (Contractor, The Linux Foundation)
` (32 preceding siblings ...)
2018-03-06 8:42 ` [PATCH 33/34] scripts/import_vger_from_mbox: perform mboxrd or mboxo escaping Eric Wong (Contractor, The Linux Foundation)
@ 2018-03-06 8:42 ` Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:53 ` [v2 PATCH 00/34] duplicate handling, smaller Xapian DBs, date fixes Eric Wong
34 siblings, 0 replies; 36+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-03-06 8:42 UTC (permalink / raw)
To: meta
We need to detect the number of partitions the repository was
created with to ensure Xapian DBs can work across different
machines (or even CPU affinity changes) without leaving messages
unaffected by search.
---
lib/PublicInbox/V2Writable.pm | 22 ++++++++++++++++++++--
t/v2writable.t | 11 ++++++++++-
2 files changed, 30 insertions(+), 3 deletions(-)
diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index 3bcea37..7728b91 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -19,7 +19,9 @@ use PublicInbox::Inbox;
my $PACKING_FACTOR = 0.4;
# assume 2 cores if GNU nproc(1) is not available
-my $NPROC = int($ENV{NPROC} || `nproc 2>/dev/null` || 2);
+sub nproc () {
+ int($ENV{NPROC} || `nproc 2>/dev/null` || 2);
+}
sub new {
my ($class, $v2ibx, $creat) = @_;
@@ -32,12 +34,28 @@ sub new {
die "$dir does not exist\n";
}
}
+
+ my $nparts = 0;
+ my $xpfx = "$dir/xap" . PublicInbox::Search::SCHEMA_VERSION;
+
+ # always load existing partitions in case core count changes:
+ if (-d $xpfx) {
+ foreach my $part (<$xpfx/*>) {
+ -d $part && $part =~ m!/\d+\z! or next;
+ eval {
+ Search::Xapian::Database->new($part)->close;
+ $nparts++;
+ };
+ }
+ }
+ $nparts = nproc() if ($nparts == 0);
+
my $self = {
-inbox => $v2ibx,
im => undef, # PublicInbox::Import
xap_rw => undef, # PublicInbox::V2SearchIdx
xap_ro => undef,
- partitions => $NPROC,
+ partitions => $nparts,
transact_bytes => 0,
# limit each repo to 1GB or so
rotate_bytes => int((1024 * 1024 * 1024) / $PACKING_FACTOR),
diff --git a/t/v2writable.t b/t/v2writable.t
index 2d35aca..404c865 100644
--- a/t/v2writable.t
+++ b/t/v2writable.t
@@ -30,7 +30,11 @@ my $mime = PublicInbox::MIME->create(
body => "hello world\n",
);
-my $im = PublicInbox::V2Writable->new($ibx, 1);
+my $im = eval {
+ local $ENV{NPROC} = '1';
+ PublicInbox::V2Writable->new($ibx, 1);
+};
+is($im->{partitions}, 1, 'one partition when forced');
ok($im->add($mime), 'ordinary message added');
if ('ensure git configs are correct') {
@@ -182,5 +186,10 @@ EOF
}
is_deeply([sort keys %nn], [sort keys %uniq]);
};
+{
+ local $ENV{NPROC} = 2;
+ $im = PublicInbox::V2Writable->new($ibx, 1);
+ is($im->{partitions}, 1, 'detected single partition from previous');
+}
done_testing();
--
EW
^ permalink raw reply related [flat|nested] 36+ messages in thread
* Re: [v2 PATCH 00/34] duplicate handling, smaller Xapian DBs, date fixes
2018-03-06 8:42 [v2 PATCH 00/34] duplicate handling, smaller Xapian DBs, date fixes Eric Wong (Contractor, The Linux Foundation)
` (33 preceding siblings ...)
2018-03-06 8:42 ` [PATCH 34/34] v2writable: detect and use previous partition count Eric Wong (Contractor, The Linux Foundation)
@ 2018-03-06 8:53 ` Eric Wong
34 siblings, 0 replies; 36+ messages in thread
From: Eric Wong @ 2018-03-06 8:53 UTC (permalink / raw)
To: meta
I wrote:
> Internally, the Message-Id we _favor_ for NNTP is also the one which
> gets used for rendering threads.
This idea will also be used for implementing attachment
downloads when multiple messages share the same MID
^ permalink raw reply [flat|nested] 36+ messages in thread
end of thread, other threads:[~2018-03-06 8:53 UTC | newest]
Thread overview: 36+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-03-06 8:42 [v2 PATCH 00/34] duplicate handling, smaller Xapian DBs, date fixes Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 01/34] v2writable: delete ::Import obj when ->done Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 02/34] search: remove informational "warning" message Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 03/34] searchidx: add PID to error message when die-ing Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 04/34] content_id: special treatment for Message-Id headers Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 05/34] evcleanup: disable outside of daemon Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 06/34] v2writable: deduplicate detection on add Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 07/34] evcleanup: do not create event loop if nothing was registered Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 08/34] mid: add `mids' and `references' methods for extraction Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 09/34] content_id: use `mids' and `references' for MID extraction Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 10/34] searchidx: use new `references' method for parsing References Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 11/34] content_id: no need to be human-friendly Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 12/34] v2writable: inject new Message-IDs on true duplicates Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 13/34] search: revert to using 'Q' as a uniQue id per-Xapian conventions Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 14/34] searchidx: support indexing multiple MIDs Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 15/34] mid: be strict with References, but loose on Message-Id Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 16/34] searchidx: avoid excessive XNQ indexing with diffs Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 17/34] searchidxskeleton: add a note about locking Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 18/34] v2writable: generated Message-ID goes first Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 19/34] searchidx: use add_boolean_term for internal terms Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 20/34] searchidx: add NNTP article number as a searchable term Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 21/34] mid: truncate excessively long MIDs early Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 22/34] nntp: use NNTP article numbers for lookups Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 23/34] nntp: fix NEWNEWS command Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 24/34] searchidx: store the primary MID in doc data for NNTP Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 25/34] import: consolidate object info for v2 imports Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 26/34] v2: avoid redundant/repeated configs for git partition repos Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 27/34] INSTALL: document more optional dependencies Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 28/34] search: favor skeleton DB for lookup_mail Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 29/34] search: each_smsg_by_mid uses skeleton if available Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 30/34] v2writable: remove unnecessary skeleton commit Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 31/34] favor Received: date over Date: header globally Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 32/34] import: fall back to Sender for extracting name and email Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 33/34] scripts/import_vger_from_mbox: perform mboxrd or mboxo escaping Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:42 ` [PATCH 34/34] v2writable: detect and use previous partition count Eric Wong (Contractor, The Linux Foundation)
2018-03-06 8:53 ` [v2 PATCH 00/34] duplicate handling, smaller Xapian DBs, date fixes Eric Wong
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).