* [PATCH 1/8] v2: use v5.10.1, parent.pm, drop warnings
2020-07-17 6:31 [PATCH 0/8] indexing cleanup and code reduction Eric Wong
@ 2020-07-17 6:31 ` Eric Wong
2020-07-17 6:31 ` [PATCH 2/8] drop binmode usage Eric Wong
` (7 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: Eric Wong @ 2020-07-17 6:31 UTC (permalink / raw)
To: meta
The "5.010_001" form was for Perl 5.6, which I doubt anybody
would attempt; so favor "v5.10.1" as it is more readable to
humans. Prefer "parent" to "base" since the former is lighter.
We'll also rely on warnings from "-w" globally (or not) instead
of via "use".
We'll also update "use" statements to reflect what's actually
used by V2Writable.
---
lib/PublicInbox/SearchIdxShard.pm | 4 ++--
lib/PublicInbox/V2Writable.pm | 7 +++----
2 files changed, 5 insertions(+), 6 deletions(-)
diff --git a/lib/PublicInbox/SearchIdxShard.pm b/lib/PublicInbox/SearchIdxShard.pm
index f7ba293f..baf7352a 100644
--- a/lib/PublicInbox/SearchIdxShard.pm
+++ b/lib/PublicInbox/SearchIdxShard.pm
@@ -5,8 +5,8 @@
# See L<public-inbox-v2-format(5)> for more info on how we shard Xapian
package PublicInbox::SearchIdxShard;
use strict;
-use warnings;
-use base qw(PublicInbox::SearchIdx);
+use v5.10.1;
+use parent qw(PublicInbox::SearchIdx);
use IO::Handle (); # autoflush
use PublicInbox::Eml;
diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index 528f5e9a..0119ea76 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -5,16 +5,15 @@
# Used to write to V2 inboxes (see L<public-inbox-v2-format(5)>).
package PublicInbox::V2Writable;
use strict;
-use warnings;
-use base qw(PublicInbox::Lock);
-use 5.010_001;
+use v5.10.1;
+use parent qw(PublicInbox::Lock);
use PublicInbox::SearchIdxShard;
use PublicInbox::Eml;
use PublicInbox::Git;
use PublicInbox::Import;
use PublicInbox::MID qw(mids references);
use PublicInbox::ContentHash qw(content_hash content_digest);
-use PublicInbox::Inbox;
+use PublicInbox::InboxWritable;
use PublicInbox::OverIdx;
use PublicInbox::Msgmap;
use PublicInbox::Spawn qw(spawn popen_rd);
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH 2/8] drop binmode usage
2020-07-17 6:31 [PATCH 0/8] indexing cleanup and code reduction Eric Wong
2020-07-17 6:31 ` [PATCH 1/8] v2: use v5.10.1, parent.pm, drop warnings Eric Wong
@ 2020-07-17 6:31 ` Eric Wong
2020-07-17 6:31 ` [PATCH 3/8] import: use common capitalization for filtering headers Eric Wong
` (6 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: Eric Wong @ 2020-07-17 6:31 UTC (permalink / raw)
To: meta
We only support Unix-like platforms where binmode (":raw") is
the default anyways, and v5.10 semantics means it won't do
unicode_strings (unlike v5.12). So save some lines of code.
---
lib/PublicInbox/Import.pm | 2 --
lib/PublicInbox/SearchIdxShard.pm | 2 --
2 files changed, 4 deletions(-)
diff --git a/lib/PublicInbox/Import.pm b/lib/PublicInbox/Import.pm
index fb813159..b61d4b31 100644
--- a/lib/PublicInbox/Import.pm
+++ b/lib/PublicInbox/Import.pm
@@ -73,8 +73,6 @@ sub gfi_start {
$self->{out} = $out_w;
$self->{pid} = $pid;
$self->{nchg} = 0;
- binmode $out_w, ':raw' or die "binmode :raw failed: $!";
- binmode $in_r, ':raw' or die "binmode :raw failed: $!";
($in_r, $out_w);
}
diff --git a/lib/PublicInbox/SearchIdxShard.pm b/lib/PublicInbox/SearchIdxShard.pm
index baf7352a..b51d148b 100644
--- a/lib/PublicInbox/SearchIdxShard.pm
+++ b/lib/PublicInbox/SearchIdxShard.pm
@@ -26,8 +26,6 @@ sub spawn_worker {
my ($self, $v2w, $shard) = @_;
my ($r, $w);
pipe($r, $w) or die "pipe failed: $!\n";
- binmode $r, ':raw';
- binmode $w, ':raw';
$w->autoflush(1);
my $pid = fork;
defined $pid or die "fork failed: $!\n";
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH 3/8] import: use common capitalization for filtering headers
2020-07-17 6:31 [PATCH 0/8] indexing cleanup and code reduction Eric Wong
2020-07-17 6:31 ` [PATCH 1/8] v2: use v5.10.1, parent.pm, drop warnings Eric Wong
2020-07-17 6:31 ` [PATCH 2/8] drop binmode usage Eric Wong
@ 2020-07-17 6:31 ` Eric Wong
2020-07-17 6:31 ` [PATCH 4/8] with_umask: pass args to callback Eric Wong
` (5 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: Eric Wong @ 2020-07-17 6:31 UTC (permalink / raw)
To: meta
In case this ends up in the same process as Mbox::msg_hdr,
it can reduce memory use by sharing the cache key in
PublicInbox::Eml::re_memo
---
lib/PublicInbox/Import.pm | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/lib/PublicInbox/Import.pm b/lib/PublicInbox/Import.pm
index b61d4b31..d565b0a0 100644
--- a/lib/PublicInbox/Import.pm
+++ b/lib/PublicInbox/Import.pm
@@ -326,7 +326,7 @@ sub extract_cmt_info ($;$) {
sub drop_unwanted_headers ($) {
my ($mime) = @_;
- $mime->header_set($_) for qw(bytes lines content-length status);
+ $mime->header_set($_) for qw(Bytes Lines Content-Length Status);
$mime->header_set($_) for @PublicInbox::MDA::BAD_HEADERS;
}
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH 4/8] with_umask: pass args to callback
2020-07-17 6:31 [PATCH 0/8] indexing cleanup and code reduction Eric Wong
` (2 preceding siblings ...)
2020-07-17 6:31 ` [PATCH 3/8] import: use common capitalization for filtering headers Eric Wong
@ 2020-07-17 6:31 ` Eric Wong
2020-07-17 6:31 ` [PATCH 5/8] overidx: each_by_mid: pass self and args to callbacks Eric Wong
` (4 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: Eric Wong @ 2020-07-17 6:31 UTC (permalink / raw)
To: meta
While it makes the code flow slightly less well in some places,
it saves us runtime allocations and indentation.
---
lib/PublicInbox/InboxWritable.pm | 42 +++++++------
lib/PublicInbox/SearchIdx.pm | 40 ++++++------
lib/PublicInbox/V2Writable.pm | 102 ++++++++++++++-----------------
lib/PublicInbox/Xapcmd.pm | 35 ++++++-----
4 files changed, 111 insertions(+), 108 deletions(-)
diff --git a/lib/PublicInbox/InboxWritable.pm b/lib/PublicInbox/InboxWritable.pm
index 875dcce2..1f3f6672 100644
--- a/lib/PublicInbox/InboxWritable.pm
+++ b/lib/PublicInbox/InboxWritable.pm
@@ -37,27 +37,33 @@ sub assert_usable_dir {
die "no inboxdir defined for $self->{name}\n";
}
+sub _init_v1 {
+ my ($self, $skip_artnum) = @_;
+ if (defined($self->{indexlevel}) || defined($skip_artnum)) {
+ require PublicInbox::SearchIdx;
+ require PublicInbox::Msgmap;
+ my $sidx = PublicInbox::SearchIdx->new($self, 1); # just create
+ $sidx->begin_txn_lazy;
+ if (defined $skip_artnum) {
+ my $mm = PublicInbox::Msgmap->new($self->{inboxdir}, 1);
+ $mm->{dbh}->begin_work;
+ $mm->skip_artnum($skip_artnum);
+ $mm->{dbh}->commit;
+ }
+ $sidx->commit_txn_lazy;
+ } else {
+ open my $fh, '>>', "$self->{inboxdir}/ssoma.lock" or
+ die "$self->{inboxdir}/ssoma.lock: $!\n";
+ }
+}
+
sub init_inbox {
my ($self, $shards, $skip_epoch, $skip_artnum) = @_;
if ($self->version == 1) {
my $dir = assert_usable_dir($self);
PublicInbox::Import::init_bare($dir);
- if (defined($self->{indexlevel}) || defined($skip_artnum)) {
- require PublicInbox::SearchIdx;
- require PublicInbox::Msgmap;
- my $sidx = PublicInbox::SearchIdx->new($self, 1); # just create
- $sidx->begin_txn_lazy;
- $self->with_umask(sub {
- my $mm = PublicInbox::Msgmap->new($dir, 1);
- $mm->{dbh}->begin_work;
- $mm->skip_artnum($skip_artnum);
- $mm->{dbh}->commit;
- }) if defined($skip_artnum);
- $sidx->commit_txn_lazy;
- } else {
- open my $fh, '>>', "$dir/ssoma.lock" or
- die "$dir/ssoma.lock: $!\n";
- }
+ $self->umask_prepare;
+ $self->with_umask(\&_init_v1, $self, $skip_artnum);
} else {
my $v2w = importer($self);
$v2w->init_inbox($shards, $skip_epoch, $skip_artnum);
@@ -255,9 +261,9 @@ sub _umask_for {
}
sub with_umask {
- my ($self, $cb) = @_;
+ my ($self, $cb, @arg) = @_;
my $old = umask $self->{umask};
- my $rv = eval { $cb->() };
+ my $rv = eval { $cb->(@arg) };
my $err = $@;
umask $old;
die $err if $err;
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 4caa66d3..c93c9034 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -585,7 +585,7 @@ sub unindex_both { # git->cat_async callback
sub index_sync {
my ($self, $opts) = @_;
delete $self->{lock_path} if $opts->{-skip_lock};
- $self->{-inbox}->with_umask(sub { $self->_index_sync($opts) })
+ $self->{-inbox}->with_umask(\&_index_sync, $self, $opts);
}
sub too_big ($$$) {
@@ -854,17 +854,18 @@ sub remote_remove {
}
}
-sub begin_txn_lazy {
+sub _begin_txn {
my ($self) = @_;
- return if $self->{txn};
+ my $xdb = $self->{xdb} || $self->_xdb_acquire;
+ $self->{over}->begin_lazy if $self->{over};
+ $xdb->begin_transaction if $xdb;
+ $self->{txn} = 1;
+ $xdb;
+}
- $self->{-inbox}->with_umask(sub {
- my $xdb = $self->{xdb} || $self->_xdb_acquire;
- $self->{over}->begin_lazy if $self->{over};
- $xdb->begin_transaction if $xdb;
- $self->{txn} = 1;
- $xdb;
- });
+sub begin_txn_lazy {
+ my ($self) = @_;
+ $self->{-inbox}->with_umask(\&_begin_txn, $self) if !$self->{txn};
}
# store 'indexlevel=medium' in v2 shard=0 and v1 (only one shard)
@@ -882,16 +883,19 @@ sub set_indexlevel {
}
}
+sub _commit_txn {
+ my ($self) = @_;
+ if (my $xdb = $self->{xdb}) {
+ set_indexlevel($self);
+ $xdb->commit_transaction;
+ }
+ $self->{over}->commit_lazy if $self->{over};
+}
+
sub commit_txn_lazy {
my ($self) = @_;
- delete $self->{txn} or return;
- $self->{-inbox}->with_umask(sub {
- if (my $xdb = $self->{xdb}) {
- set_indexlevel($self);
- $xdb->commit_transaction;
- }
- $self->{over}->commit_lazy if $self->{over};
- });
+ delete($self->{txn}) and
+ $self->{-inbox}->with_umask(\&_commit_txn, $self);
}
sub worker_done {
diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index 0119ea76..b51c8525 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -147,10 +147,8 @@ sub init_inbox {
# returns undef on duplicate or spam
# mimics Import::add and wraps it for v2
sub add {
- my ($self, $mime, $check_cb) = @_;
- $self->{-inbox}->with_umask(sub {
- _add($self, $mime, $check_cb)
- });
+ my ($self, $eml, $check_cb) = @_;
+ $self->{-inbox}->with_umask(\&_add, $self, $eml, $check_cb);
}
# indexes a message, returns true if checkpointing is needed
@@ -276,6 +274,28 @@ sub idx_shard {
$self->{idx_shards}->[$shard_i];
}
+sub _idx_init { # with_umask callback
+ my ($self, $opt) = @_;
+ $self->lock_acquire unless $opt && $opt->{-skip_lock};
+ $self->{over}->create;
+
+ # xcpdb can change shard count while -watch is idle
+ my $nshards = count_shards($self);
+ $self->{shards} = $nshards if $nshards && $nshards != $self->{shards};
+
+ # need to create all shards before initializing msgmap FD
+ # idx_shards must be visible to all forked processes
+ my $max = $self->{shards} - 1;
+ my $idx = $self->{idx_shards} = [];
+ push @$idx, PublicInbox::SearchIdxShard->new($self, $_) for (0..$max);
+
+ # Now that all subprocesses are up, we can open the FDs
+ # for SQLite:
+ my $mm = $self->{mm} = PublicInbox::Msgmap->new_file(
+ "$self->{-inbox}->{inboxdir}/msgmap.sqlite3", 1);
+ $mm->{dbh}->begin_work;
+}
+
# idempotent
sub idx_init {
my ($self, $opt) = @_;
@@ -285,13 +305,10 @@ sub idx_init {
# do not leak read-only FDs to child processes, we only have these
# FDs for duplicate detection so they should not be
# frequently activated.
+ # delete @$ibx{qw(git mm search)};
delete $ibx->{$_} foreach (qw(git mm search));
- my $indexlevel = $ibx->{indexlevel};
- if ($indexlevel && $indexlevel eq 'basic') {
- $self->{parallel} = 0;
- }
-
+ $self->{parallel} = 0 if ($ibx->{indexlevel}//'') eq 'basic';
if ($self->{parallel}) {
pipe(my ($r, $w)) or die "pipe failed: $!";
# pipe for barrier notifications doesn't need to be big,
@@ -301,33 +318,8 @@ sub idx_init {
$w->autoflush(1);
}
- my $over = $self->{over};
$ibx->umask_prepare;
- $ibx->with_umask(sub {
- $self->lock_acquire unless ($opt && $opt->{-skip_lock});
- $over->create;
-
- # xcpdb can change shard count while -watch is idle
- my $nshards = count_shards($self);
- if ($nshards && $nshards != $self->{shards}) {
- $self->{shards} = $nshards;
- }
-
- # need to create all shards before initializing msgmap FD
- my $max = $self->{shards} - 1;
-
- # idx_shards must be visible to all forked processes
- my $idx = $self->{idx_shards} = [];
- for my $i (0..$max) {
- push @$idx, PublicInbox::SearchIdxShard->new($self, $i);
- }
-
- # Now that all subprocesses are up, we can open the FDs
- # for SQLite:
- my $mm = $self->{mm} = PublicInbox::Msgmap->new_file(
- "$self->{-inbox}->{inboxdir}/msgmap.sqlite3", 1);
- $mm->{dbh}->begin_work;
- });
+ $ibx->with_umask(\&_idx_init, $self, $opt);
}
# returns an array mapping [ epoch => latest_commit ]
@@ -379,24 +371,24 @@ sub content_matches ($$) {
# used for removing or replacing (purging)
sub rewrite_internal ($$;$$$) {
- my ($self, $old_mime, $cmt_msg, $new_mime, $sref) = @_;
+ my ($self, $old_eml, $cmt_msg, $new_eml, $sref) = @_;
$self->idx_init;
my ($im, $need_reindex, $replace_map);
if ($sref) {
$replace_map = {}; # oid => sref
- $need_reindex = [] if $new_mime;
+ $need_reindex = [] if $new_eml;
} else {
$im = $self->importer;
}
my $over = $self->{over};
- my $chashes = content_hashes($old_mime);
- my @removed;
- my $mids = mids($old_mime->header_obj);
+ my $chashes = content_hashes($old_eml);
+ my $removed = [];
+ my $mids = mids($old_eml->header_obj);
# We avoid introducing new blobs into git since the raw content
# can be slightly different, so we do not need the user-supplied
# message now that we have the mids and content_hash
- $old_mime = undef;
+ $old_eml = undef;
my $mark;
foreach my $mid (@$mids) {
@@ -422,15 +414,15 @@ sub rewrite_internal ($$;$$$) {
}
foreach my $num (keys %gone) {
my ($smsg, $mime, $orig) = @{$gone{$num}};
- # @removed should only be set once assuming
+ # $removed should only be set once assuming
# no bugs in our deduplication code:
- @removed = (undef, $mime, $smsg);
+ $removed = [ undef, $mime, $smsg ];
my $oid = $smsg->{blob};
if ($replace_map) {
$replace_map->{$oid} = $sref;
} else {
($mark, undef) = $im->remove($orig, $cmt_msg);
- $removed[0] = $mark;
+ $removed->[0] = $mark;
}
$orig = undef;
if ($need_reindex) { # ->replace
@@ -447,28 +439,26 @@ sub rewrite_internal ($$;$$$) {
$self->{last_commit}->[$self->{epoch_max}] = $cmt;
}
if ($replace_map && scalar keys %$replace_map) {
- my $rewrites = _replace_oids($self, $new_mime, $replace_map);
+ my $rewrites = _replace_oids($self, $new_eml, $replace_map);
return { rewrites => $rewrites, need_reindex => $need_reindex };
}
- defined($mark) ? @removed : undef;
+ defined($mark) ? $removed : undef;
}
# public (see PublicInbox::Import->remove), but note the 3rd element
# (retval[2]) is not part of the stable API shared with Import->remove
sub remove {
- my ($self, $mime, $cmt_msg) = @_;
- my @ret;
- $self->{-inbox}->with_umask(sub {
- @ret = rewrite_internal($self, $mime, $cmt_msg);
- });
- defined($ret[0]) ? @ret : undef;
+ my ($self, $eml, $cmt_msg) = @_;
+ my $r = $self->{-inbox}->with_umask(\&rewrite_internal,
+ $self, $eml, $cmt_msg);
+ defined($r) && defined($r->[0]) ? @$r: undef;
}
sub _replace ($$;$$) {
- my ($self, $old_mime, $new_mime, $sref) = @_;
- my $rewritten = $self->{-inbox}->with_umask(sub {
- rewrite_internal($self, $old_mime, undef, $new_mime, $sref);
- }) or return;
+ my ($self, $old_eml, $new_eml, $sref) = @_;
+ my $arg = [ $self, $old_eml, undef, $new_eml, $sref ];
+ my $rewritten = $self->{-inbox}->with_umask(\&rewrite_internal,
+ $self, $old_eml, undef, $new_eml, $sref) or return;
my $rewrites = $rewritten->{rewrites};
# ->done is called if there are rewrites since we gc+prune from git
diff --git a/lib/PublicInbox/Xapcmd.pm b/lib/PublicInbox/Xapcmd.pm
index a57fa559..c04f935c 100644
--- a/lib/PublicInbox/Xapcmd.pm
+++ b/lib/PublicInbox/Xapcmd.pm
@@ -229,6 +229,24 @@ sub prepare_run {
sub check_compact () { runnable_or_die($XAPIAN_COMPACT) }
+sub _run {
+ my ($ibx, $cb, $opt, $reindex) = @_;
+ my $im = $ibx->importer(0);
+ $im->lock_acquire;
+ my ($tmp, $queue) = prepare_run($ibx, $opt);
+
+ # fine-grained locking if we prepare for reindex
+ if (!$opt->{-coarse_lock}) {
+ prepare_reindex($ibx, $im, $reindex);
+ $im->lock_release;
+ }
+
+ $ibx->cleanup;
+ process_queue($queue, $cb, $opt);
+ $im->lock_acquire if !$opt->{-coarse_lock};
+ commit_changes($ibx, $im, $tmp, $opt);
+}
+
sub run {
my ($ibx, $task, $opt) = @_; # task = 'cpdb' or 'compact'
my $cb = \&${\"PublicInbox::Xapcmd::$task"};
@@ -248,22 +266,7 @@ sub run {
local %SIG = %SIG;
setup_signals();
$ibx->umask_prepare;
- $ibx->with_umask(sub {
- my $im = $ibx->importer(0);
- $im->lock_acquire;
- my ($tmp, $queue) = prepare_run($ibx, $opt);
-
- # fine-grained locking if we prepare for reindex
- if (!$opt->{-coarse_lock}) {
- prepare_reindex($ibx, $im, $reindex);
- $im->lock_release;
- }
-
- $ibx->cleanup;
- process_queue($queue, $cb, $opt);
- $im->lock_acquire if !$opt->{-coarse_lock};
- commit_changes($ibx, $im, $tmp, $opt);
- });
+ $ibx->with_umask(\&_run, $ibx, $cb, $opt, $reindex);
}
sub cpdb_retryable ($$) {
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH 5/8] overidx: each_by_mid: pass self and args to callbacks
2020-07-17 6:31 [PATCH 0/8] indexing cleanup and code reduction Eric Wong
` (3 preceding siblings ...)
2020-07-17 6:31 ` [PATCH 4/8] with_umask: pass args to callback Eric Wong
@ 2020-07-17 6:31 ` Eric Wong
2020-07-17 6:31 ` [PATCH 6/8] overidx: favor non-OO sub dispatch for internal subs Eric Wong
` (3 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: Eric Wong @ 2020-07-17 6:31 UTC (permalink / raw)
To: meta
This saves runtime allocations and reduces the likelyhood of
memory leaks either from cycles or buggy old Perl versions.
---
lib/PublicInbox/OverIdx.pm | 99 +++++++++++++++++++++-----------------
1 file changed, 54 insertions(+), 45 deletions(-)
diff --git a/lib/PublicInbox/OverIdx.pm b/lib/PublicInbox/OverIdx.pm
index ea8da723..52f6328e 100644
--- a/lib/PublicInbox/OverIdx.pm
+++ b/lib/PublicInbox/OverIdx.pm
@@ -107,7 +107,7 @@ DELETE FROM $_ WHERE num = ?
# this includes ghosts
sub each_by_mid {
- my ($self, $mid, $cols, $cb) = @_;
+ my ($self, $mid, $cols, $cb, @arg) = @_;
my $dbh = $self->{dbh};
=over
@@ -152,27 +152,29 @@ SELECT $cols FROM over WHERE over.num = ? LIMIT 1
foreach (@$nums) {
$sth->execute($_->[0]);
my $smsg = $sth->fetchrow_hashref;
- $cb->(PublicInbox::Over::load_from_row($smsg)) or
- return;
+ $smsg = PublicInbox::Over::load_from_row($smsg);
+ $cb->($self, $smsg, @arg) or return;
}
return if $nr != $lim;
}
}
+sub _resolve_mid_to_tid {
+ my ($self, $smsg, $tid) = @_;
+ my $cur_tid = $smsg->{tid};
+ if (defined $$tid) {
+ merge_threads($self, $$tid, $cur_tid);
+ } else {
+ $$tid = $cur_tid;
+ }
+ 1;
+}
+
# this will create a ghost as necessary
sub resolve_mid_to_tid {
my ($self, $mid) = @_;
my $tid;
- each_by_mid($self, $mid, ['tid'], sub {
- my ($smsg) = @_;
- my $cur_tid = $smsg->{tid};
- if (defined $tid) {
- merge_threads($self, $tid, $cur_tid);
- } else {
- $tid = $cur_tid;
- }
- 1;
- });
+ each_by_mid($self, $mid, ['tid'], \&_resolve_mid_to_tid, \$tid);
defined $tid ? $tid : create_ghost($self, $mid);
}
@@ -271,6 +273,22 @@ sub add_overview {
add_over($self, [ @$smsg{qw(ts ds num)}, $mids, $refs, $xpath, $dd ]);
}
+sub _add_over {
+ my ($self, $smsg, $mid, $refs, $old_tid, $v) = @_;
+ my $cur_tid = $smsg->{tid};
+ my $n = $smsg->{num};
+ die "num must not be zero for $mid" if !$n;
+ $$old_tid = $cur_tid unless defined $$old_tid;
+ if ($n > 0) { # regular mail
+ merge_threads($self, $$old_tid, $cur_tid);
+ } elsif ($n < 0) { # ghost
+ link_refs($self, $refs, $$old_tid);
+ $self->delete_by_num($n);
+ $$v++;
+ }
+ 1;
+}
+
sub add_over {
my ($self, $values) = @_;
my ($ts, $ds, $num, $mids, $refs, $xpath, $ddd) = @$values;
@@ -281,21 +299,8 @@ sub add_over {
$self->delete_by_num($num, \$old_tid);
foreach my $mid (@$mids) {
my $v = 0;
- each_by_mid($self, $mid, ['tid'], sub {
- my ($cur) = @_;
- my $cur_tid = $cur->{tid};
- my $n = $cur->{num};
- die "num must not be zero for $mid" if !$n;
- $old_tid = $cur_tid unless defined $old_tid;
- if ($n > 0) { # regular mail
- merge_threads($self, $old_tid, $cur_tid);
- } elsif ($n < 0) { # ghost
- link_refs($self, $refs, $old_tid);
- $self->delete_by_num($n);
- $v++;
- }
- 1;
- });
+ each_by_mid($self, $mid, ['tid'], \&_add_over,
+ $mid, $refs, \$old_tid, \$v);
$v > 1 and warn "BUG: vivified multiple ($v) ghosts for $mid\n";
$vivified += $v;
}
@@ -320,35 +325,39 @@ INSERT INTO id2num (id, num) VALUES (?,?)
}
}
+sub _remove_oid {
+ my ($self, $smsg, $oid, $nr) = @_;
+ if (!defined($oid) || $smsg->{blob} eq $oid) {
+ $self->delete_by_num($smsg->{num});
+ $$nr++;
+ }
+ 1;
+}
+
# returns number of removed messages
# $oid may be undef to match only on $mid
sub remove_oid {
my ($self, $oid, $mid) = @_;
my $nr = 0;
$self->begin_lazy;
- each_by_mid($self, $mid, ['ddd'], sub {
- my ($smsg) = @_;
- if (!defined($oid) || $smsg->{blob} eq $oid) {
- $self->delete_by_num($smsg->{num});
- $nr++;
- }
- 1;
- });
+ each_by_mid($self, $mid, ['ddd'], \&_remove_oid, $oid, \$nr);
$nr;
}
+sub _num_mid0_for_oid {
+ my ($self, $smsg, $oid, $res) = @_;
+ my $blob = $smsg->{blob};
+ return 1 if (!defined($blob) || $blob ne $oid); # continue;
+ @$res = ($smsg->{num}, $smsg->{mid});
+ 0; # done
+}
+
sub num_mid0_for_oid {
my ($self, $oid, $mid) = @_;
- my ($num, $mid0);
+ my $res = [];
$self->begin_lazy;
- each_by_mid($self, $mid, ['ddd'], sub {
- my ($smsg) = @_;
- my $blob = $smsg->{blob};
- return 1 if (!defined($blob) || $blob ne $oid); # continue;
- ($num, $mid0) = ($smsg->{num}, $smsg->{mid});
- 0; # done
- });
- ($num, $mid0);
+ each_by_mid($self, $mid, ['ddd'], \&_num_mid0_for_oid, $oid, $res);
+ @$res, # ($num, $mid0);
}
sub create_tables {
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH 6/8] overidx: favor non-OO sub dispatch for internal subs
2020-07-17 6:31 [PATCH 0/8] indexing cleanup and code reduction Eric Wong
` (4 preceding siblings ...)
2020-07-17 6:31 ` [PATCH 5/8] overidx: each_by_mid: pass self and args to callbacks Eric Wong
@ 2020-07-17 6:31 ` Eric Wong
2020-07-17 6:31 ` [PATCH 7/8] searchidx: use v5.10.1, parent.pm, drop warnings Eric Wong
` (2 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: Eric Wong @ 2020-07-17 6:31 UTC (permalink / raw)
To: meta
OO method dispatch was 10-15% slower when I was implementing the
NNTP server. It also serves as a helpful reminder to the reader
at the callsite as to whether a sub is likely in the same
package as the caller or not.
---
lib/PublicInbox/OverIdx.pm | 24 ++++++++++++------------
1 file changed, 12 insertions(+), 12 deletions(-)
diff --git a/lib/PublicInbox/OverIdx.pm b/lib/PublicInbox/OverIdx.pm
index 52f6328e..29c6e0b9 100644
--- a/lib/PublicInbox/OverIdx.pm
+++ b/lib/PublicInbox/OverIdx.pm
@@ -180,10 +180,10 @@ sub resolve_mid_to_tid {
sub create_ghost {
my ($self, $mid) = @_;
- my $id = $self->mid2id($mid);
- my $num = $self->next_ghost_num;
+ my $id = mid2id($self, $mid);
+ my $num = next_ghost_num($self);
$num < 0 or die "ghost num is non-negative: $num\n";
- my $tid = $self->next_tid;
+ my $tid = next_tid($self);
my $dbh = $self->{dbh};
$dbh->prepare_cached(<<'')->execute($num, $tid);
INSERT INTO over (num, tid) VALUES (?,?)
@@ -221,7 +221,7 @@ sub link_refs {
merge_threads($self, $tid, $ptid);
}
} else {
- $tid = defined $old_tid ? $old_tid : $self->next_tid;
+ $tid = defined $old_tid ? $old_tid : next_tid($self);
}
$tid;
}
@@ -283,7 +283,7 @@ sub _add_over {
merge_threads($self, $$old_tid, $cur_tid);
} elsif ($n < 0) { # ghost
link_refs($self, $refs, $$old_tid);
- $self->delete_by_num($n);
+ delete_by_num($self, $n);
$$v++;
}
1;
@@ -295,8 +295,8 @@ sub add_over {
my $old_tid;
my $vivified = 0;
- $self->begin_lazy;
- $self->delete_by_num($num, \$old_tid);
+ begin_lazy($self);
+ delete_by_num($self, $num, \$old_tid);
foreach my $mid (@$mids) {
my $v = 0;
each_by_mid($self, $mid, ['tid'], \&_add_over,
@@ -305,7 +305,7 @@ sub add_over {
$vivified += $v;
}
my $tid = $vivified ? $old_tid : link_refs($self, $refs, $old_tid);
- my $sid = $self->sid($xpath);
+ my $sid = sid($self, $xpath);
my $dbh = $self->{dbh};
my $sth = $dbh->prepare_cached(<<'');
INSERT INTO over (num, tid, sid, ts, ds, ddd)
@@ -320,7 +320,7 @@ VALUES (?,?,?,?,?,?)
INSERT INTO id2num (id, num) VALUES (?,?)
foreach my $mid (@$mids) {
- my $id = $self->mid2id($mid);
+ my $id = mid2id($self, $mid);
$sth->execute($id, $num);
}
}
@@ -328,7 +328,7 @@ INSERT INTO id2num (id, num) VALUES (?,?)
sub _remove_oid {
my ($self, $smsg, $oid, $nr) = @_;
if (!defined($oid) || $smsg->{blob} eq $oid) {
- $self->delete_by_num($smsg->{num});
+ delete_by_num($self, $smsg->{num});
$$nr++;
}
1;
@@ -339,7 +339,7 @@ sub _remove_oid {
sub remove_oid {
my ($self, $oid, $mid) = @_;
my $nr = 0;
- $self->begin_lazy;
+ begin_lazy($self);
each_by_mid($self, $mid, ['ddd'], \&_remove_oid, $oid, \$nr);
$nr;
}
@@ -355,7 +355,7 @@ sub _num_mid0_for_oid {
sub num_mid0_for_oid {
my ($self, $oid, $mid) = @_;
my $res = [];
- $self->begin_lazy;
+ begin_lazy($self);
each_by_mid($self, $mid, ['ddd'], \&_num_mid0_for_oid, $oid, $res);
@$res, # ($num, $mid0);
}
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH 7/8] searchidx: use v5.10.1, parent.pm, drop warnings
2020-07-17 6:31 [PATCH 0/8] indexing cleanup and code reduction Eric Wong
` (5 preceding siblings ...)
2020-07-17 6:31 ` [PATCH 6/8] overidx: favor non-OO sub dispatch for internal subs Eric Wong
@ 2020-07-17 6:31 ` Eric Wong
2020-07-17 6:31 ` [PATCH 8/8] search: simplify unindexing Eric Wong
2020-07-17 7:25 ` [9/8 PATCH] v2writable: git_hash_raw: avoid $TMPDIR write Eric Wong
8 siblings, 0 replies; 10+ messages in thread
From: Eric Wong @ 2020-07-17 6:31 UTC (permalink / raw)
To: meta
Prefer "parent" to "base" since the former is lighter and part
of Perl 5.10+. We'll also rely on warnings from "-w" globally
(or not) instead of via "use".
---
lib/PublicInbox/SearchIdx.pm | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index c93c9034..9550847b 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -8,8 +8,8 @@
# This writes to the search index.
package PublicInbox::SearchIdx;
use strict;
-use warnings;
-use base qw(PublicInbox::Search PublicInbox::Lock);
+use v5.10.1;
+use parent qw(PublicInbox::Search PublicInbox::Lock);
use PublicInbox::Eml;
use PublicInbox::InboxWritable;
use PublicInbox::MID qw/mid_clean mid_mime mids_for_index/;
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH 8/8] search: simplify unindexing
2020-07-17 6:31 [PATCH 0/8] indexing cleanup and code reduction Eric Wong
` (6 preceding siblings ...)
2020-07-17 6:31 ` [PATCH 7/8] searchidx: use v5.10.1, parent.pm, drop warnings Eric Wong
@ 2020-07-17 6:31 ` Eric Wong
2020-07-17 7:25 ` [9/8 PATCH] v2writable: git_hash_raw: avoid $TMPDIR write Eric Wong
8 siblings, 0 replies; 10+ messages in thread
From: Eric Wong @ 2020-07-17 6:31 UTC (permalink / raw)
To: meta
Since over.sqlite3 seems here to stay, we no longer need to do
Message-ID lookups against Xapian and can simply rely on the
docid <=> NNTP article number equivalancy SCHEMA_VERSION=15
gave us.
This rids us of the closure-using batch_do sub in the v1
code path and vastly simplifies both v1 and v2 unindexing.
---
lib/PublicInbox/OverIdx.pm | 13 +--
lib/PublicInbox/SearchIdx.pm | 134 +++++++++++-------------------
lib/PublicInbox/SearchIdxShard.pm | 6 +-
lib/PublicInbox/V2Writable.pm | 7 +-
t/search.t | 6 +-
5 files changed, 66 insertions(+), 100 deletions(-)
diff --git a/lib/PublicInbox/OverIdx.pm b/lib/PublicInbox/OverIdx.pm
index 29c6e0b9..5601e602 100644
--- a/lib/PublicInbox/OverIdx.pm
+++ b/lib/PublicInbox/OverIdx.pm
@@ -326,22 +326,23 @@ INSERT INTO id2num (id, num) VALUES (?,?)
}
sub _remove_oid {
- my ($self, $smsg, $oid, $nr) = @_;
+ my ($self, $smsg, $oid, $removed) = @_;
if (!defined($oid) || $smsg->{blob} eq $oid) {
delete_by_num($self, $smsg->{num});
- $$nr++;
+ push @$removed, $smsg->{num};
}
1;
}
-# returns number of removed messages
+# returns number of removed messages in scalar context,
+# array of removed article numbers in array context.
# $oid may be undef to match only on $mid
sub remove_oid {
my ($self, $oid, $mid) = @_;
- my $nr = 0;
+ my $removed = [];
begin_lazy($self);
- each_by_mid($self, $mid, ['ddd'], \&_remove_oid, $oid, \$nr);
- $nr;
+ each_by_mid($self, $mid, ['ddd'], \&_remove_oid, $oid, $removed);
+ @$removed;
}
sub _num_mid0_for_oid {
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 9550847b..83162509 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -12,7 +12,7 @@ use v5.10.1;
use parent qw(PublicInbox::Search PublicInbox::Lock);
use PublicInbox::Eml;
use PublicInbox::InboxWritable;
-use PublicInbox::MID qw/mid_clean mid_mime mids_for_index/;
+use PublicInbox::MID qw(mid_mime mids_for_index mids);
use PublicInbox::MsgIter;
use Carp qw(croak);
use POSIX qw(strftime);
@@ -413,88 +413,32 @@ sub add_message {
$smsg->{num};
}
-# returns begin and end PostingIterator
-sub find_doc_ids {
- my ($self, $termval) = @_;
- my $db = $self->{xdb};
-
- ($db->postlist_begin($termval), $db->postlist_end($termval));
-}
-
-# v1 only
-sub batch_do {
- my ($self, $termval, $cb) = @_;
- my $batch_size = 1000; # don't let @ids grow too large to avoid OOM
- while (1) {
- my ($head, $tail) = $self->find_doc_ids($termval);
- return if $head == $tail;
- my @ids;
- for (; $head != $tail && @ids < $batch_size; $head++) {
- push @ids, $head->get_docid;
+sub xdb_remove {
+ my ($self, $oid, @removed) = @_;
+ my $xdb = $self->{xdb} or return;
+ for my $num (@removed) {
+ my $doc = eval { $xdb->get_document($num) };
+ unless ($doc) {
+ warn "E: $@\n" if $@;
+ warn "E: #$num $oid missing in Xapian\n";
+ next;
}
- $cb->(\@ids);
- }
-}
-
-# v1 only, where $mid is unique
-sub remove_message {
- my ($self, $mid) = @_;
- $mid = mid_clean($mid);
-
- if (my $over = $self->{over}) {
- my $nr = eval { $over->remove_oid(undef, $mid) };
- if ($@) {
- warn "failed to remove <$mid> from overview: $@\n";
- } elsif ($nr == 0) {
- warn "<$mid> missing for removal from overview\n";
+ my $smsg = bless {}, 'PublicInbox::Smsg';
+ $smsg->load_expand($doc);
+ my $blob = $smsg->{blob} // '(unset)';
+ if ($blob eq $oid) {
+ $xdb->delete_document($num);
+ } else {
+ warn "E: #$num $oid != $blob in Xapian\n";
}
}
- return unless need_xapian($self);
- my $db = $self->{xdb};
- my $nr = 0;
- eval {
- batch_do($self, 'Q' . $mid, sub {
- my ($ids) = @_;
- $db->delete_document($_) for @$ids;
- $nr += scalar @$ids;
- });
- };
- if ($@) {
- warn "failed to remove <$mid> from Xapian: $@\n";
- } elsif ($nr == 0) {
- warn "<$mid> missing for removal from Xapian\n";
- }
}
-# MID is a hint in V2
sub remove_by_oid {
- my ($self, $oid, $mid) = @_;
-
- $self->{over}->remove_oid($oid, $mid) if $self->{over};
-
- return unless need_xapian($self);
- my $db = $self->{xdb};
-
- # XXX careful, we cannot use batch_do here since we conditionally
- # delete documents based on other factors, so we cannot call
- # find_doc_ids twice.
- my ($head, $tail) = $self->find_doc_ids('Q' . $mid);
- return if $head == $tail;
-
- # there is only ONE element in @delete unless we
- # have bugs in our v2writable deduplication check
- my @delete;
- for (; $head != $tail; $head++) {
- my $docid = $head->get_docid;
- my $doc = $db->get_document($docid);
- my $smsg = bless { mid => $mid }, 'PublicInbox::Smsg';
- $smsg->load_expand($doc);
- if ($smsg->{blob} eq $oid) {
- push(@delete, $docid);
- }
- }
- $db->delete_document($_) foreach @delete;
- scalar(@delete);
+ my ($self, $oid, $num) = @_;
+ die "BUG: remove_by_oid is v2-only\n" if $self->{over};
+ $self->begin_txn_lazy;
+ xdb_remove($self, $oid, $num) if need_xapian($self);
}
sub index_git_blob_id {
@@ -507,10 +451,29 @@ sub index_git_blob_id {
}
}
-sub unindex_blob {
- my ($self, $mime) = @_;
- my $mid = eval { mid_mime($mime) };
- $self->remove_message($mid) if defined $mid;
+# v1 only
+sub unindex_eml {
+ my ($self, $oid, $eml) = @_;
+ my $mids = mids($eml);
+ my $nr = 0;
+ my %tmp;
+ for my $mid (@$mids) {
+ my @removed = eval { $self->{over}->remove_oid($oid, $mid) };
+ if ($@) {
+ warn "E: failed to remove <$mid> from overview: $@\n";
+ } else {
+ $nr += scalar @removed;
+ $tmp{$_}++ for @removed;
+ }
+ }
+ if (!$nr) {
+ $mids = join('> <', @$mids);
+ warn "W: <$mids> missing for removal from overview\n";
+ }
+ while (my ($num, $nr) = each %tmp) {
+ warn "BUG: $num appears >1 times ($nr) for $oid\n" if $nr != 1;
+ }
+ xdb_remove($self, $oid, keys %tmp) if need_xapian($self);
}
sub index_mm {
@@ -577,7 +540,7 @@ sub index_both { # git->cat_async callback
sub unindex_both { # git->cat_async callback
my ($bref, $oid, $type, $size, $self) = @_;
my $eml = PublicInbox::Eml->new($bref);
- unindex_blob($self, $eml);
+ unindex_eml($self, $oid, $eml);
unindex_mm($self, $eml);
}
@@ -844,13 +807,12 @@ sub remote_close {
}
sub remote_remove {
- my ($self, $oid, $mid) = @_;
+ my ($self, $oid, $num) = @_;
if (my $w = $self->{w}) {
# triggers remove_by_oid in a shard
- print $w "D $oid $mid\n" or die "failed to write remove $!";
+ print $w "D $oid $num\n" or die "failed to write remove $!";
} else {
- $self->begin_txn_lazy;
- $self->remove_by_oid($oid, $mid);
+ $self->remove_by_oid($oid, $num);
}
}
diff --git a/lib/PublicInbox/SearchIdxShard.pm b/lib/PublicInbox/SearchIdxShard.pm
index b51d148b..54426881 100644
--- a/lib/PublicInbox/SearchIdxShard.pm
+++ b/lib/PublicInbox/SearchIdxShard.pm
@@ -62,10 +62,8 @@ sub shard_worker_loop ($$$$$) {
# no need to lock < 512 bytes is atomic under POSIX
print $bnote "barrier $shard\n" or
die "write failed for barrier $!\n";
- } elsif ($line =~ /\AD ([a-f0-9]{40,}) (.+)\n\z/s) {
- my ($oid, $mid) = ($1, $2);
- $self->begin_txn_lazy;
- $self->remove_by_oid($oid, $mid);
+ } elsif ($line =~ /\AD ([a-f0-9]{40,}) ([0-9]+)\n\z/s) {
+ $self->remove_by_oid($1, $2 + 0);
} else {
chomp $line;
# n.b. $mid may contain spaces(!)
diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index b51c8525..dffe90d8 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -1185,8 +1185,11 @@ sub sync_prepare ($$$) {
sub unindex_oid_remote ($$$) {
my ($self, $oid, $mid) = @_;
- $_->remote_remove($oid, $mid) foreach @{$self->{idx_shards}};
- $self->{over}->remove_oid($oid, $mid);
+ my @removed = $self->{over}->remove_oid($oid, $mid);
+ for my $num (@removed) {
+ my $idx = idx_shard($self, $num % $self->{shards});
+ $idx->remote_remove($oid, $num);
+ }
}
sub unindex_oid ($$$;$) {
diff --git a/t/search.t b/t/search.t
index 82caf9e4..aa6f94bf 100644
--- a/t/search.t
+++ b/t/search.t
@@ -397,7 +397,9 @@ $ibx->with_umask(sub {
$ibx->with_umask(sub {
my $amsg = eml_load 't/search-amsg.eml';
- ok($rw->add_message($amsg), 'added attachment');
+ my $oid = ('0'x40);
+ my $smsg = bless { blob => $oid }, 'PublicInbox::Smsg';
+ ok($rw->add_message($amsg, $smsg), 'added attachment');
$rw_commit->();
$ro->reopen;
my $n = $ro->query('n:attached_fart.txt');
@@ -418,7 +420,7 @@ $ibx->with_umask(sub {
$art = $ro->{over_ro}->next_by_mid($mid, \$id, \$prev);
ok($art, 'article exists in OVER DB');
}
- $rw->unindex_blob($amsg);
+ $rw->unindex_eml($oid, $amsg);
$rw->commit_txn_lazy;
SKIP: {
skip('$art not defined', 1) unless defined $art;
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [9/8 PATCH] v2writable: git_hash_raw: avoid $TMPDIR write
2020-07-17 6:31 [PATCH 0/8] indexing cleanup and code reduction Eric Wong
` (7 preceding siblings ...)
2020-07-17 6:31 ` [PATCH 8/8] search: simplify unindexing Eric Wong
@ 2020-07-17 7:25 ` Eric Wong
8 siblings, 0 replies; 10+ messages in thread
From: Eric Wong @ 2020-07-17 7:25 UTC (permalink / raw)
To: meta
We can rely on FD_CLOEXEC being set by default (since Perl 5.6+)
on pipes to avoid FS/page-cache traffic, here. We also know
"git hash-object" won't output anything until it's consumed all
of its standard input; so there's no danger of a deadlock even
in the the unlikely case git uses a hash that can't fit into
PIPE_BUF :P
---
lib/PublicInbox/V2Writable.pm | 10 ++++------
1 file changed, 4 insertions(+), 6 deletions(-)
diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index dffe90d8c..0582dd5e3 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -482,14 +482,12 @@ sub purge {
sub git_hash_raw ($$) {
my ($self, $raw) = @_;
# grab the expected OID we have to reindex:
- open my $tmp_fh, '+>', undef or die "failed to open tmp: $!";
- $tmp_fh->autoflush(1);
- print $tmp_fh $$raw or die "print \$tmp_fh: $!";
- sysseek($tmp_fh, 0, 0) or die "seek failed: $!";
-
+ pipe(my($in, $w)) or die "pipe: $!";
my $git_dir = $self->{-inbox}->git->{git_dir};
my $cmd = ['git', "--git-dir=$git_dir", qw(hash-object --stdin)];
- my $r = popen_rd($cmd, undef, { 0 => $tmp_fh });
+ my $r = popen_rd($cmd, undef, { 0 => $in });
+ print $w $$raw or die "print \$w: $!";
+ close $w or die "close \$w: $!";
local $/ = "\n";
chomp(my $oid = <$r>);
close $r or die "git hash-object failed: $?";
^ permalink raw reply related [flat|nested] 10+ messages in thread