* [PATCH 0/8] extindex and then some...
@ 2021-10-10 14:25 Eric Wong
2021-10-10 14:25 ` [PATCH 1/8] lei_to_mail: show --output on augment progress failure Eric Wong
` (7 more replies)
0 siblings, 8 replies; 9+ messages in thread
From: Eric Wong @ 2021-10-10 14:25 UTC (permalink / raw)
To: meta
One notable fix for -extindex --gc, a couple of minor things
here and there. Still need to speed up --reindex...
Eric Wong (8):
lei_to_mail: show --output on augment progress failure
admin: add '# ' prefix for progress messages
set nodatacow on more SQLite files
extindex: speed up Xapian cleanup in --gc
extindex: minor cost reductions
extindex: --gc doesn't touch ghost entries
lei/store: keep ".err-XXXX" in stderr tmpfile
extindex: sync each inbox before checking for missed messages
lib/PublicInbox/Admin.pm | 2 +-
lib/PublicInbox/ExtSearchIdx.pm | 51 +++++++++++++++++++++------------
lib/PublicInbox/LeiStore.pm | 2 +-
lib/PublicInbox/LeiToMail.pm | 2 +-
lib/PublicInbox/Over.pm | 4 ++-
lib/PublicInbox/SearchIdx.pm | 3 ++
lib/PublicInbox/SharedKV.pm | 3 +-
7 files changed, 43 insertions(+), 24 deletions(-)
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH 1/8] lei_to_mail: show --output on augment progress failure
2021-10-10 14:25 [PATCH 0/8] extindex and then some Eric Wong
@ 2021-10-10 14:25 ` Eric Wong
2021-10-10 14:25 ` [PATCH 2/8] admin: add '# ' prefix for progress messages Eric Wong
` (6 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: Eric Wong @ 2021-10-10 14:25 UTC (permalink / raw)
To: meta
Just in case it fails when there's many parallel invocations.
---
lib/PublicInbox/LeiToMail.pm | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/lib/PublicInbox/LeiToMail.pm b/lib/PublicInbox/LeiToMail.pm
index d42759cf..5a220ba3 100644
--- a/lib/PublicInbox/LeiToMail.pm
+++ b/lib/PublicInbox/LeiToMail.pm
@@ -796,7 +796,7 @@ sub augment_inprogress {
"scanning old contents of $dst for dedupe" :
"removing old contents of $dst")." ...\n";
};
- warn "E: $@" if $@;
+ warn "E: $@ ($dst)" if $@;
}
# called in top-level lei-daemon when LeiAuth is done
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 2/8] admin: add '# ' prefix for progress messages
2021-10-10 14:25 [PATCH 0/8] extindex and then some Eric Wong
2021-10-10 14:25 ` [PATCH 1/8] lei_to_mail: show --output on augment progress failure Eric Wong
@ 2021-10-10 14:25 ` Eric Wong
2021-10-10 14:25 ` [PATCH 3/8] set nodatacow on more SQLite files Eric Wong
` (5 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: Eric Wong @ 2021-10-10 14:25 UTC (permalink / raw)
To: meta
It's more consistent with TAP output and hopefully puts
users at ease in case they don't understand the meaning
of a message.
---
lib/PublicInbox/Admin.pm | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/lib/PublicInbox/Admin.pm b/lib/PublicInbox/Admin.pm
index a17a632c..11ea8f83 100644
--- a/lib/PublicInbox/Admin.pm
+++ b/lib/PublicInbox/Admin.pm
@@ -320,7 +320,7 @@ sub progress_prepare ($;$) {
} else {
$opt->{verbose} ||= 1;
$dst //= *STDERR{GLOB};
- $opt->{-progress} = sub { print $dst @_ };
+ $opt->{-progress} = sub { print $dst '# ', @_ };
}
}
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 3/8] set nodatacow on more SQLite files
2021-10-10 14:25 [PATCH 0/8] extindex and then some Eric Wong
2021-10-10 14:25 ` [PATCH 1/8] lei_to_mail: show --output on augment progress failure Eric Wong
2021-10-10 14:25 ` [PATCH 2/8] admin: add '# ' prefix for progress messages Eric Wong
@ 2021-10-10 14:25 ` Eric Wong
2021-10-10 14:25 ` [PATCH 4/8] extindex: speed up Xapian cleanup in --gc Eric Wong
` (4 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: Eric Wong @ 2021-10-10 14:25 UTC (permalink / raw)
To: meta
We'll set nodatacow when detecting existing but empty
files, and also their directories in more cases (for
auxiliary -wal, -journal, -shm files). Hopefully
this keeps performance reasonable on CoW FSes.
---
lib/PublicInbox/Over.pm | 4 +++-
lib/PublicInbox/SharedKV.pm | 3 ++-
2 files changed, 5 insertions(+), 2 deletions(-)
diff --git a/lib/PublicInbox/Over.pm b/lib/PublicInbox/Over.pm
index 19da056a..98de82c0 100644
--- a/lib/PublicInbox/Over.pm
+++ b/lib/PublicInbox/Over.pm
@@ -16,9 +16,11 @@ use constant DEFAULT_LIMIT => 1000;
sub dbh_new {
my ($self, $rw) = @_;
my $f = delete $self->{filename};
- if (!-f $f) { # SQLite defaults mode to 0644, we want 0666
+ if (!-s $f) { # SQLite defaults mode to 0644, we want 0666
if ($rw) {
require PublicInbox::Spawn;
+ my ($dir) = ($f =~ m!(.+)/[^/]+\z!);
+ PublicInbox::Spawn::nodatacow_dir($dir);
open my $fh, '+>>', $f or die "failed to open $f: $!";
PublicInbox::Spawn::nodatacow_fd(fileno($fh));
} else {
diff --git a/lib/PublicInbox/SharedKV.pm b/lib/PublicInbox/SharedKV.pm
index 645bb57c..398f4ca8 100644
--- a/lib/PublicInbox/SharedKV.pm
+++ b/lib/PublicInbox/SharedKV.pm
@@ -51,7 +51,8 @@ sub new {
$base //= '';
my $f = $self->{filename} = "$dir/$base.sqlite3";
$self->{lock_path} = $opt->{lock_path} // "$dir/$base.flock";
- unless (-f $f) {
+ unless (-s $f) {
+ PublicInbox::Spawn::nodatacow_dir($dir); # for journal/shm/wal
open my $fh, '+>>', $f or die "failed to open $f: $!";
PublicInbox::Spawn::nodatacow_fd(fileno($fh));
}
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 4/8] extindex: speed up Xapian cleanup in --gc
2021-10-10 14:25 [PATCH 0/8] extindex and then some Eric Wong
` (2 preceding siblings ...)
2021-10-10 14:25 ` [PATCH 3/8] set nodatacow on more SQLite files Eric Wong
@ 2021-10-10 14:25 ` Eric Wong
2021-10-10 14:25 ` [PATCH 5/8] extindex: minor cost reductions Eric Wong
` (3 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: Eric Wong @ 2021-10-10 14:25 UTC (permalink / raw)
To: meta
Avoiding repeated SQL statements brings --gc down to 2-3 minutes
from around 10. We'll also add some checkpoints around over and
xref3 cleanups.
---
lib/PublicInbox/ExtSearchIdx.pm | 37 ++++++++++++++++++++-------------
lib/PublicInbox/SearchIdx.pm | 3 +++
2 files changed, 26 insertions(+), 14 deletions(-)
diff --git a/lib/PublicInbox/ExtSearchIdx.pm b/lib/PublicInbox/ExtSearchIdx.pm
index 20c4cf78..04948b8b 100644
--- a/lib/PublicInbox/ExtSearchIdx.pm
+++ b/lib/PublicInbox/ExtSearchIdx.pm
@@ -421,34 +421,43 @@ sub eidx_gc_scan_shards ($$) { # TODO: use for lei/store
DELETE FROM xref3 WHERE docid NOT IN (SELECT num FROM over)
warn "I: eliminated $nr stale xref3 entries\n" if $nr != 0;
+ reindex_checkpoint($self, $sync) if checkpoint_due($sync);
# fixup from old bugs:
$nr = $self->{oidx}->dbh->do(<<'');
DELETE FROM over WHERE num NOT IN (SELECT docid FROM xref3)
warn "I: eliminated $nr stale over entries\n" if $nr != 0;
+ reindex_checkpoint($self, $sync) if checkpoint_due($sync);
my ($cur) = $self->{oidx}->dbh->selectrow_array(<<EOM);
SELECT MIN(num) FROM over
EOM
- my ($max) = $self->{oidx}->dbh->selectrow_array(<<EOM);
-SELECT MAX(num) FROM over
-EOM
- my $exists;
-restart:
- $exists = $self->{oidx}->dbh->prepare(<<EOM);
-SELECT COUNT(num) FROM over WHERE num = ?
-EOM
- for (; $cur <= $max; $cur++) {
- $exists->execute($cur);
- next if $exists->fetchrow_array != 0;
- $self->idx_shard($cur)->ipc_do('xdb_remove_quiet', $cur);
+ $cur // return; # empty
+ my ($r, $n, %active);
+ $nr = 0;
+ while (1) {
+ $r = $self->{oidx}->dbh->selectcol_arrayref(<<"", undef, $cur);
+SELECT num FROM over WHERE num >= ? ORDER BY num ASC LIMIT 10000
+
+ last unless scalar(@$r);
+ while (defined($n = shift @$r)) {
+ for my $i ($cur..($n - 1)) {
+ my $idx = idx_shard($self, $i);
+ $idx->ipc_do('xdb_remove_quiet', $i);
+ $active{$idx} = $idx;
+ }
+ $cur = $n + 1;
+ }
if (checkpoint_due($sync)) {
- $exists = undef;
+ for my $idx (values %active) {
+ $nr += $idx->ipc_do('nr_quiet_rm')
+ }
+ %active = ();
reindex_checkpoint($self, $sync);
- goto restart;
}
}
+ warn "I: eliminated $nr stale Xapian documents\n" if $nr != 0;
}
sub eidx_gc {
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 78db329d..bebe904b 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -650,8 +650,11 @@ sub xdb_remove_quiet {
begin_txn_lazy($self);
my $xdb = $self->{xdb} // die 'BUG: missing {xdb}';
eval { $xdb->delete_document($docid) };
+ ++$self->{-quiet_rm} unless $@;
}
+sub nr_quiet_rm { delete($_[0]->{-quiet_rm}) // 0 }
+
sub index_git_blob_id {
my ($doc, $pfx, $objid) = @_;
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 5/8] extindex: minor cost reductions
2021-10-10 14:25 [PATCH 0/8] extindex and then some Eric Wong
` (3 preceding siblings ...)
2021-10-10 14:25 ` [PATCH 4/8] extindex: speed up Xapian cleanup in --gc Eric Wong
@ 2021-10-10 14:25 ` Eric Wong
2021-10-10 14:25 ` [PATCH 6/8] extindex: --gc doesn't touch ghost entries Eric Wong
` (2 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: Eric Wong @ 2021-10-10 14:25 UTC (permalink / raw)
To: meta
Don't bother decoding the 20-byte SHA-1 to a 40-byte hex value
since we don't read it, anyways. We can also use the on-stack
ibx->eidx_key value instead of dispatching the method again.
---
lib/PublicInbox/ExtSearchIdx.pm | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/lib/PublicInbox/ExtSearchIdx.pm b/lib/PublicInbox/ExtSearchIdx.pm
index 04948b8b..42488e12 100644
--- a/lib/PublicInbox/ExtSearchIdx.pm
+++ b/lib/PublicInbox/ExtSearchIdx.pm
@@ -902,15 +902,14 @@ DELETE FROM xref3 WHERE ibx_id = ? AND xnum = ? AND oidbin = ?
$del->execute;
# get_xref3 over-fetches, but this is a rare path:
- my $xr3 = $self->{oidx}->get_xref3($docid);
+ my $xr3 = $self->{oidx}->get_xref3($docid, 1);
my $idx = $self->idx_shard($docid);
if (scalar(@$xr3) == 0) { # all gone
$self->{oidx}->delete_by_num($docid);
$self->{oidx}->eidxq_del($docid);
$idx->ipc_do('xdb_remove', $docid);
} else { # enqueue for reindex of remaining messages
- $idx->ipc_do('remove_eidx_info',
- $docid, $ibx->eidx_key);
+ $idx->ipc_do('remove_eidx_info', $docid, $ekey);
$self->{oidx}->eidxq_add($docid); # yes, add
}
}
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 6/8] extindex: --gc doesn't touch ghost entries
2021-10-10 14:25 [PATCH 0/8] extindex and then some Eric Wong
` (4 preceding siblings ...)
2021-10-10 14:25 ` [PATCH 5/8] extindex: minor cost reductions Eric Wong
@ 2021-10-10 14:25 ` Eric Wong
2021-10-10 14:25 ` [PATCH 7/8] lei/store: keep ".err-XXXX" in stderr tmpfile Eric Wong
2021-10-10 14:25 ` [PATCH 8/8] extindex: sync each inbox before checking for missed messages Eric Wong
7 siblings, 0 replies; 9+ messages in thread
From: Eric Wong @ 2021-10-10 14:25 UTC (permalink / raw)
To: meta
We were deleting ghost entries, this was usually harmless since
other messages could fill-in-the-blanks, but could cause
misthreading in odd cases where a big chunk of a thread is
missing and the latest messages only referenced ghosts.
We'll also save some cycles when scanning Xapian shards since
docids won't be <= 0.
---
lib/PublicInbox/ExtSearchIdx.pm | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/lib/PublicInbox/ExtSearchIdx.pm b/lib/PublicInbox/ExtSearchIdx.pm
index 42488e12..acf35e3d 100644
--- a/lib/PublicInbox/ExtSearchIdx.pm
+++ b/lib/PublicInbox/ExtSearchIdx.pm
@@ -425,13 +425,13 @@ DELETE FROM xref3 WHERE docid NOT IN (SELECT num FROM over)
# fixup from old bugs:
$nr = $self->{oidx}->dbh->do(<<'');
-DELETE FROM over WHERE num NOT IN (SELECT docid FROM xref3)
+DELETE FROM over WHERE num > 0 AND num NOT IN (SELECT docid FROM xref3)
warn "I: eliminated $nr stale over entries\n" if $nr != 0;
reindex_checkpoint($self, $sync) if checkpoint_due($sync);
my ($cur) = $self->{oidx}->dbh->selectrow_array(<<EOM);
-SELECT MIN(num) FROM over
+SELECT MIN(num) FROM over WHERE num > 0
EOM
$cur // return; # empty
my ($r, $n, %active);
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 7/8] lei/store: keep ".err-XXXX" in stderr tmpfile
2021-10-10 14:25 [PATCH 0/8] extindex and then some Eric Wong
` (5 preceding siblings ...)
2021-10-10 14:25 ` [PATCH 6/8] extindex: --gc doesn't touch ghost entries Eric Wong
@ 2021-10-10 14:25 ` Eric Wong
2021-10-10 14:25 ` [PATCH 8/8] extindex: sync each inbox before checking for missed messages Eric Wong
7 siblings, 0 replies; 9+ messages in thread
From: Eric Wong @ 2021-10-10 14:25 UTC (permalink / raw)
To: meta
This is slighly more meaningful since the file is already
in ~/.local/share/lei/store, so "lei_store" was redundant
(and the "XXXX" are random characters replaced by File::Temp)
---
lib/PublicInbox/LeiStore.pm | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/lib/PublicInbox/LeiStore.pm b/lib/PublicInbox/LeiStore.pm
index 52a1456f..613d1d31 100644
--- a/lib/PublicInbox/LeiStore.pm
+++ b/lib/PublicInbox/LeiStore.pm
@@ -512,7 +512,7 @@ sub xchg_stderr {
return unless -e $dir;
my $old = delete $self->{-tmp_err};
my $pfx = POSIX::strftime('%Y%m%d%H%M%S', gmtime(time));
- my $err = File::Temp->new(TEMPLATE => "$pfx.$$.lei_storeXXXX",
+ my $err = File::Temp->new(TEMPLATE => "$pfx.$$.err-XXXX",
SUFFIX => '.err', DIR => $dir);
open STDERR, '>>', $err->filename or die "dup2: $!";
STDERR->autoflush(1); # shared with shard subprocesses
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 8/8] extindex: sync each inbox before checking for missed messages
2021-10-10 14:25 [PATCH 0/8] extindex and then some Eric Wong
` (6 preceding siblings ...)
2021-10-10 14:25 ` [PATCH 7/8] lei/store: keep ".err-XXXX" in stderr tmpfile Eric Wong
@ 2021-10-10 14:25 ` Eric Wong
7 siblings, 0 replies; 9+ messages in thread
From: Eric Wong @ 2021-10-10 14:25 UTC (permalink / raw)
To: meta
Otherwise, it gets too noisy and we repeat some work
when we do an actual sync, since the last_commit info
will be out-of-date.
---
lib/PublicInbox/ExtSearchIdx.pm | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/lib/PublicInbox/ExtSearchIdx.pm b/lib/PublicInbox/ExtSearchIdx.pm
index acf35e3d..d589d2c0 100644
--- a/lib/PublicInbox/ExtSearchIdx.pm
+++ b/lib/PublicInbox/ExtSearchIdx.pm
@@ -812,6 +812,9 @@ sub _reindex_check_unseen ($$$) {
my $ibx_id = $ibx->{-ibx_id};
my $slice = 1000;
my ($beg, $end) = (1, $slice);
+ my $err = sync_inbox($self, $sync, $ibx) and return;
+ my $max = $ibx->over->max;
+ $end = $max if $end > $max;
# first, check if we missed any messages in target $ibx
my $msgs;
@@ -825,6 +828,7 @@ sub _reindex_check_unseen ($$$) {
${$sync->{nr}} = $beg;
$beg = $msgs->[-1]->{num} + 1;
$end = $beg + $slice;
+ $end = $max if $end > $max;
if (checkpoint_due($sync)) {
reindex_checkpoint($self, $sync); # release lock
}
@@ -952,6 +956,7 @@ sub sync_inbox {
my $err = _sync_inbox($self, $sync, $ibx);
delete @$ibx{qw(mm over)};
warn $err, "\n" if defined($err);
+ $err;
}
sub dd_smsg { # git->cat_async callback
^ permalink raw reply related [flat|nested] 9+ messages in thread
end of thread, other threads:[~2021-10-10 14:25 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-10 14:25 [PATCH 0/8] extindex and then some Eric Wong
2021-10-10 14:25 ` [PATCH 1/8] lei_to_mail: show --output on augment progress failure Eric Wong
2021-10-10 14:25 ` [PATCH 2/8] admin: add '# ' prefix for progress messages Eric Wong
2021-10-10 14:25 ` [PATCH 3/8] set nodatacow on more SQLite files Eric Wong
2021-10-10 14:25 ` [PATCH 4/8] extindex: speed up Xapian cleanup in --gc Eric Wong
2021-10-10 14:25 ` [PATCH 5/8] extindex: minor cost reductions Eric Wong
2021-10-10 14:25 ` [PATCH 6/8] extindex: --gc doesn't touch ghost entries Eric Wong
2021-10-10 14:25 ` [PATCH 7/8] lei/store: keep ".err-XXXX" in stderr tmpfile Eric Wong
2021-10-10 14:25 ` [PATCH 8/8] extindex: sync each inbox before checking for missed messages Eric Wong
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).