unofficial mirror of meta@public-inbox.org
 help / color / mirror / Atom feed
* [PATCH 0/4] lei reindex, minor tweaks
@ 2022-08-17  9:33 Eric Wong
  2022-08-17  9:33 ` [PATCH 1/4] searchidx: fix spelling error in comment Eric Wong
                   ` (3 more replies)
  0 siblings, 4 replies; 6+ messages in thread
From: Eric Wong @ 2022-08-17  9:33 UTC (permalink / raw)
  To: meta

Reindex is far from complete, and probably needs a compact, and
some other fixups for old data + rethread support.

But avoiding false positives from base-85 is nice.

Eric Wong (4):
  searchidx: fix spelling error in comment
  lei inspect: less scary exception for invalid "docid:" inspect
  lei/store: reduce work when accessing mail_sync.sqlite3
  lei reindex: new command to reindex lei/store

 Documentation/lei-reindex.pod | 47 +++++++++++++++++++++++++++++++++
 MANIFEST                      |  2 ++
 lib/PublicInbox/LEI.pm        |  2 ++
 lib/PublicInbox/LeiInspect.pm |  5 ++--
 lib/PublicInbox/LeiReindex.pm | 49 +++++++++++++++++++++++++++++++++++
 lib/PublicInbox/LeiStore.pm   | 38 ++++++++++++++++++++++++---
 lib/PublicInbox/SearchIdx.pm  |  2 +-
 7 files changed, 138 insertions(+), 7 deletions(-)
 create mode 100644 Documentation/lei-reindex.pod
 create mode 100644 lib/PublicInbox/LeiReindex.pm

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 1/4] searchidx: fix spelling error in comment
  2022-08-17  9:33 [PATCH 0/4] lei reindex, minor tweaks Eric Wong
@ 2022-08-17  9:33 ` Eric Wong
  2022-08-17  9:33 ` [PATCH 2/4] lei inspect: less scary exception for invalid "docid:" inspect Eric Wong
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 6+ messages in thread
From: Eric Wong @ 2022-08-17  9:33 UTC (permalink / raw)
  To: meta

---
 lib/PublicInbox/SearchIdx.pm | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index bdb84fc7..257b83a5 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -453,7 +453,7 @@ sub eml2doc ($$$;$) {
 	index_ids($self, $doc, $eml, $mids);
 
 	# by default, we maintain compatibility with v1.5.0 and earlier
-	# by writing to docdata.glass, users who never exect to downgrade can
+	# by writing to docdata.glass, users who never expect to downgrade can
 	# use --skip-docdata
 	if (!$self->{-skip_docdata}) {
 		# WWW doesn't need {to} or {cc}, only NNTP

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 2/4] lei inspect: less scary exception for invalid "docid:" inspect
  2022-08-17  9:33 [PATCH 0/4] lei reindex, minor tweaks Eric Wong
  2022-08-17  9:33 ` [PATCH 1/4] searchidx: fix spelling error in comment Eric Wong
@ 2022-08-17  9:33 ` Eric Wong
  2022-08-17  9:33 ` [PATCH 3/4] lei/store: reduce work when accessing mail_sync.sqlite3 Eric Wong
  2022-08-17  9:33 ` [PATCH 4/4] lei reindex: new command to reindex lei/store Eric Wong
  3 siblings, 0 replies; 6+ messages in thread
From: Eric Wong @ 2022-08-17  9:33 UTC (permalink / raw)
  To: meta

It still says "Exception:", but doesn't pointlessly print out
the line number and file of the exception when it's a data/input
problem, and not a code problem on our end.
---
 lib/PublicInbox/LeiInspect.pm | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/lib/PublicInbox/LeiInspect.pm b/lib/PublicInbox/LeiInspect.pm
index d7775d4b..d1dca4ef 100644
--- a/lib/PublicInbox/LeiInspect.pm
+++ b/lib/PublicInbox/LeiInspect.pm
@@ -1,4 +1,4 @@
-# Copyright (C) 2021 all contributors <meta@public-inbox.org>
+# Copyright (C) all contributors <meta@public-inbox.org>
 # License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
 
 # "lei inspect" general purpose inspector for stuff in SQLite and
@@ -235,7 +235,8 @@ sub inspect_argv { # via wq_do
 	$lei->{1}->autoflush(0);
 	$lei->out('[') if $multi;
 	while (defined(my $x = shift @$argv)) {
-		inspect1($lei, $x, scalar(@$argv)) or return;
+		eval { inspect1($lei, $x, scalar(@$argv)) or return };
+		warn "E: $@\n" if $@;
 	}
 	$lei->out(']') if $multi;
 }

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 3/4] lei/store: reduce work when accessing mail_sync.sqlite3
  2022-08-17  9:33 [PATCH 0/4] lei reindex, minor tweaks Eric Wong
  2022-08-17  9:33 ` [PATCH 1/4] searchidx: fix spelling error in comment Eric Wong
  2022-08-17  9:33 ` [PATCH 2/4] lei inspect: less scary exception for invalid "docid:" inspect Eric Wong
@ 2022-08-17  9:33 ` Eric Wong
  2022-08-17  9:33 ` [PATCH 4/4] lei reindex: new command to reindex lei/store Eric Wong
  3 siblings, 0 replies; 6+ messages in thread
From: Eric Wong @ 2022-08-17  9:33 UTC (permalink / raw)
  To: meta

There's no need to initialize eidx if we already have an open
handle for mail_sync.sqlite3
---
 lib/PublicInbox/LeiStore.pm | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/lib/PublicInbox/LeiStore.pm b/lib/PublicInbox/LeiStore.pm
index 66049dfe..d49746cb 100644
--- a/lib/PublicInbox/LeiStore.pm
+++ b/lib/PublicInbox/LeiStore.pm
@@ -255,13 +255,13 @@ sub remove_eml_vmd { # remove just the VMD
 
 sub _lms_rw ($) { # it is important to have eidx processes open before lms
 	my ($self) = @_;
-	my ($eidx, $tl) = eidx_init($self);
-	$self->{lms} //= do {
+	$self->{lms} // do {
 		require PublicInbox::LeiMailSync;
+		my ($eidx, $tl) = eidx_init($self);
 		my $f = "$self->{priv_eidx}->{topdir}/mail_sync.sqlite3";
 		my $lms = PublicInbox::LeiMailSync->new($f);
 		$lms->lms_write_prepare;
-		$lms;
+		$self->{lms} = $lms;
 	};
 }
 

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 4/4] lei reindex: new command to reindex lei/store
  2022-08-17  9:33 [PATCH 0/4] lei reindex, minor tweaks Eric Wong
                   ` (2 preceding siblings ...)
  2022-08-17  9:33 ` [PATCH 3/4] lei/store: reduce work when accessing mail_sync.sqlite3 Eric Wong
@ 2022-08-17  9:33 ` Eric Wong
  2022-08-18  7:22   ` Eric Wong
  3 siblings, 1 reply; 6+ messages in thread
From: Eric Wong @ 2022-08-17  9:33 UTC (permalink / raw)
  To: meta

---
 Documentation/lei-reindex.pod | 47 +++++++++++++++++++++++++++++++++
 MANIFEST                      |  2 ++
 lib/PublicInbox/LEI.pm        |  2 ++
 lib/PublicInbox/LeiReindex.pm | 49 +++++++++++++++++++++++++++++++++++
 lib/PublicInbox/LeiStore.pm   | 32 ++++++++++++++++++++++-
 5 files changed, 131 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/lei-reindex.pod
 create mode 100644 lib/PublicInbox/LeiReindex.pm

diff --git a/Documentation/lei-reindex.pod b/Documentation/lei-reindex.pod
new file mode 100644
index 00000000..3a5861c4
--- /dev/null
+++ b/Documentation/lei-reindex.pod
@@ -0,0 +1,47 @@
+=head1 NAME
+
+lei-reindex - reindex messages already in lei/store
+
+=head1 SYNOPSIS
+
+lei reindex [OPTIONS]
+
+=head1 DESCRIPTION
+
+Forces a re-index of all messages previously-indexed by L<lei-import(1)>
+or L<lei-index(1)>.  This can be used for in-place upgrades and bugfixes
+while other processes are querying the store.  Keep in mind this roughly
+doubles the size of the already-large Xapian database.
+
+It does not re-index messages in externals, using the C<--reindex>
+switch of L<public-inbox-index(1)> or L<public-inbox-extindex(1)> is
+needed for that.
+
+=head1 OPTIONS
+
+=over
+
+=item -q
+
+=item --quiet
+
+Suppress feedback messages.
+
+=back
+
+=head1 CONTACT
+
+Feedback welcome via plain-text mail to L<mailto:meta@public-inbox.org>
+
+The mail archives are hosted at L<https://public-inbox.org/meta/> and
+L<http://4uok3hntl7oi7b4uf4rtfwefqeexfzil2w6kgk2jn5z2f764irre7byd.onion/meta/>
+
+=head1 COPYRIGHT
+
+Copyright all contributors L<mailto:meta@public-inbox.org>
+
+License: AGPL-3.0+ L<https://www.gnu.org/licenses/agpl-3.0.txt>
+
+=head1 SEE ALSO
+
+L<lei-index(1)>, L<lei-import(1)>
diff --git a/MANIFEST b/MANIFEST
index cc0a9a4c..27e4c4e0 100644
--- a/MANIFEST
+++ b/MANIFEST
@@ -56,6 +56,7 @@ Documentation/lei-p2q.pod
 Documentation/lei-q.pod
 Documentation/lei-rediff.pod
 Documentation/lei-refresh-mail-sync.pod
+Documentation/lei-reindex.pod
 Documentation/lei-rm-watch.pod
 Documentation/lei-rm.pod
 Documentation/lei-security.pod
@@ -256,6 +257,7 @@ lib/PublicInbox/LeiPmdir.pm
 lib/PublicInbox/LeiQuery.pm
 lib/PublicInbox/LeiRediff.pm
 lib/PublicInbox/LeiRefreshMailSync.pm
+lib/PublicInbox/LeiReindex.pm
 lib/PublicInbox/LeiRemote.pm
 lib/PublicInbox/LeiRm.pm
 lib/PublicInbox/LeiRmWatch.pm
diff --git a/lib/PublicInbox/LEI.pm b/lib/PublicInbox/LEI.pm
index 595b3fa9..8a3a3ab6 100644
--- a/lib/PublicInbox/LEI.pm
+++ b/lib/PublicInbox/LEI.pm
@@ -253,6 +253,8 @@ our %CMD = ( # sorted in order of importance/use:
 'forget-watch' => [ '{WATCH_NUMBER|--prune}', 'stop and forget a watch',
 	qw(prune), @c_opt ],
 
+'reindex' => [ '', 'reindex all locally-indexed messages', @c_opt ],
+
 'index' => [ 'LOCATION...', 'one-time index from URL or filesystem',
 	qw(in-format|F=s kw! offset=i recursive|r exclude=s include|I=s
 	verbose|v+ incremental!), @net_opt, # mainly for --proxy=
diff --git a/lib/PublicInbox/LeiReindex.pm b/lib/PublicInbox/LeiReindex.pm
new file mode 100644
index 00000000..3f109f33
--- /dev/null
+++ b/lib/PublicInbox/LeiReindex.pm
@@ -0,0 +1,49 @@
+# Copyright all contributors <meta@public-inbox.org>
+# License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
+
+# "lei reindex" command to reindex everything in lei/store
+package PublicInbox::LeiReindex;
+use v5.12;
+use parent qw(PublicInbox::IPC);
+
+sub reindex_full {
+	my ($lei) = @_;
+	my $sto = $lei->{sto};
+	my $max = $sto->search->over(1)->max;
+	$lei->qerr("# reindexing 1..$max");
+	$sto->wq_do('reindex_art', $_) for (1..$max);
+}
+
+sub reindex_store { # via wq_do
+	my ($self) = @_;
+	my ($lei, $argv) = delete @$self{qw(lei argv)};
+	if (!@$argv) {
+		reindex_full($lei);
+	}
+}
+
+sub lei_reindex {
+	my ($lei, @argv) = @_;
+	my $sto = $lei->_lei_store or return $lei->fail('nothing indexed');
+	$sto->write_prepare($lei);
+	my $self = bless { lei => $lei, argv => \@argv }, __PACKAGE__;
+	my ($op_c, $ops) = $lei->workers_start($self, 1);
+	$lei->{wq1} = $self;
+	$lei->wait_wq_events($op_c, $ops);
+	$self->wq_do('reindex_store');
+	$self->wq_close;
+}
+
+sub _lei_wq_eof { # EOF callback for main lei daemon
+	my ($lei) = @_;
+	$lei->{sto}->wq_do('reindex_done');
+	$lei->wq_eof;
+}
+
+sub ipc_atfork_child {
+	my ($self) = @_;
+	$self->{lei}->_lei_atfork_child;
+	$self->SUPER::ipc_atfork_child;
+}
+
+1;
diff --git a/lib/PublicInbox/LeiStore.pm b/lib/PublicInbox/LeiStore.pm
index d49746cb..277ed6bd 100644
--- a/lib/PublicInbox/LeiStore.pm
+++ b/lib/PublicInbox/LeiStore.pm
@@ -1,4 +1,4 @@
-# Copyright (C) 2020-2021 all contributors <meta@public-inbox.org>
+# Copyright (C) all contributors <meta@public-inbox.org>
 # License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
 #
 # Local storage (cache/memo) for lei(1), suitable for personal/private
@@ -335,6 +335,36 @@ sub _docids_and_maybe_kw ($$) {
 	($docids, [ sort keys %$kw ]);
 }
 
+sub _reindex_1 { # git->cat_async callback
+	my ($bref, $hex, $type, $size, $smsg) = @_;
+	my ($self, $eidx, $tl) = delete @$smsg{qw(-self -eidx -tl)};
+	$bref //= _lms_rw($self)->local_blob($hex, 1);
+	if ($bref) {
+		my $eml = PublicInbox::Eml->new($bref);
+		$smsg->{-merge_vmd} = 1; # preserve existing keywords
+		$eidx->idx_shard($smsg->{num})->index_eml($eml, $smsg);
+	} else {
+		warn("E: $type $hex\n");
+	}
+}
+
+sub reindex_art {
+	my ($self, $art) = @_;
+	my ($eidx, $tl) = eidx_init($self);
+	my $smsg = $eidx->{oidx}->get_art($art) // return;
+	return if $smsg->{bytes} == 0; # external-only message
+	@$smsg{qw(-self -eidx -tl)} = ($self, $eidx, $tl);
+	$eidx->git->cat_async($smsg->{blob} // die("no blob (#$art)"),
+				\&_reindex_1, $smsg);
+}
+
+sub reindex_done {
+	my ($self) = @_;
+	my ($eidx, $tl) = eidx_init($self);
+	$eidx->git->async_wait_all;
+	# ->done to be called via sto_done_request
+}
+
 sub add_eml {
 	my ($self, $eml, $vmd, $xoids) = @_;
 	my $im = $self->{-fake_im} // $self->importer; # may create new epoch

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH 4/4] lei reindex: new command to reindex lei/store
  2022-08-17  9:33 ` [PATCH 4/4] lei reindex: new command to reindex lei/store Eric Wong
@ 2022-08-18  7:22   ` Eric Wong
  0 siblings, 0 replies; 6+ messages in thread
From: Eric Wong @ 2022-08-18  7:22 UTC (permalink / raw)
  To: meta

Eric Wong <e@80x24.org> wrote:
> index d49746cb..277ed6bd 100644
> --- a/lib/PublicInbox/LeiStore.pm
> +++ b/lib/PublicInbox/LeiStore.pm
> @@ -335,6 +335,36 @@ sub _docids_and_maybe_kw ($$) {
>  	($docids, [ sort keys %$kw ]);
>  }
>  
> +sub _reindex_1 { # git->cat_async callback
> +	my ($bref, $hex, $type, $size, $smsg) = @_;
> +	my ($self, $eidx, $tl) = delete @$smsg{qw(-self -eidx -tl)};
> +	$bref //= _lms_rw($self)->local_blob($hex, 1);
> +	if ($bref) {
> +		my $eml = PublicInbox::Eml->new($bref);
> +		$smsg->{-merge_vmd} = 1; # preserve existing keywords
> +		$eidx->idx_shard($smsg->{num})->index_eml($eml, $smsg);
> +	} else {
> +		warn("E: $type $hex\n");

This path has been worrying me a bit, I hit it quite a bit on
one of my systems since there was a time when external-only
messages were fully-indexed inside lei/store.  Nowadays,
duplicate indexing is avoided...

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2022-08-18  7:22 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-08-17  9:33 [PATCH 0/4] lei reindex, minor tweaks Eric Wong
2022-08-17  9:33 ` [PATCH 1/4] searchidx: fix spelling error in comment Eric Wong
2022-08-17  9:33 ` [PATCH 2/4] lei inspect: less scary exception for invalid "docid:" inspect Eric Wong
2022-08-17  9:33 ` [PATCH 3/4] lei/store: reduce work when accessing mail_sync.sqlite3 Eric Wong
2022-08-17  9:33 ` [PATCH 4/4] lei reindex: new command to reindex lei/store Eric Wong
2022-08-18  7:22   ` Eric Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).