* [PATCH 0/3] lei: hopefully kill /Document \d+ not found/ errors @ 2021-08-14 0:29 Eric Wong 2021-08-14 0:29 ` [PATCH 1/3] lei: diagnostics for " Eric Wong ` (3 more replies) 0 siblings, 4 replies; 7+ messages in thread From: Eric Wong @ 2021-08-14 0:29 UTC (permalink / raw) To: meta 2/3 is probably a fix for a long-standing problem, 3/3 was noticed while working on it. If 2/3 doesn't fix it, then maybe 1/3 will help us narrow it down. Eric Wong (3): lei: diagnostics for /Document \d+ not found/ errors lei <q|up>: wait on remote mboxrd imports synchronously lei: hexdigest mocks account for unwanted headers lib/PublicInbox/FakeImport.pm | 3 +++ lib/PublicInbox/IPC.pm | 2 +- lib/PublicInbox/LEI.pm | 5 +++++ lib/PublicInbox/LeiQuery.pm | 2 +- lib/PublicInbox/LeiRemote.pm | 9 +++++---- lib/PublicInbox/LeiSearch.pm | 17 ++++++++++------- lib/PublicInbox/LeiStore.pm | 10 +++++++++- lib/PublicInbox/LeiXSearch.pm | 11 +++++++---- 8 files changed, 41 insertions(+), 18 deletions(-) ^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH 1/3] lei: diagnostics for /Document \d+ not found/ errors 2021-08-14 0:29 [PATCH 0/3] lei: hopefully kill /Document \d+ not found/ errors Eric Wong @ 2021-08-14 0:29 ` Eric Wong 2021-08-14 0:29 ` [PATCH 2/3] lei <q|up>: wait on remote mboxrd imports synchronously Eric Wong ` (2 subsequent siblings) 3 siblings, 0 replies; 7+ messages in thread From: Eric Wong @ 2021-08-14 0:29 UTC (permalink / raw) To: meta This may help diagnose "Exception: Document \d+ not found" errors I'm seeing from "lei up" with HTTPS endpoints. --- lib/PublicInbox/IPC.pm | 2 +- lib/PublicInbox/LeiSearch.pm | 17 ++++++++++------- 2 files changed, 11 insertions(+), 8 deletions(-) diff --git a/lib/PublicInbox/IPC.pm b/lib/PublicInbox/IPC.pm index 497a6035..d909dc1c 100644 --- a/lib/PublicInbox/IPC.pm +++ b/lib/PublicInbox/IPC.pm @@ -236,7 +236,7 @@ sub recv_and_run { undef $buf; my $sub = shift @$args; eval { $self->$sub(@$args) }; - warn "$$ wq_worker: $@" if $@; + warn "$$ $0 wq_worker: $@" if $@; delete @$self{0..($nfd-1)}; $n; } diff --git a/lib/PublicInbox/LeiSearch.pm b/lib/PublicInbox/LeiSearch.pm index 79b2fd7d..f9e5c8e9 100644 --- a/lib/PublicInbox/LeiSearch.pm +++ b/lib/PublicInbox/LeiSearch.pm @@ -55,13 +55,16 @@ sub _xsmsg_vmd { # retry_reopen $kw{flagged} = 1 if delete($smsg->{lei_q_tt_flagged}); my @num = $self->over->blob_exists($smsg->{blob}); for my $num (@num) { # there should only be one... - $doc = $xdb->get_document(num2docid($self, $num)); - $x = xap_terms('K', $doc); - %kw = (%kw, %$x); - if ($want_label) { # JSON/JMAP only - $x = xap_terms('L', $doc); - %L = (%L, %$x); - } + eval { + $doc = $xdb->get_document(num2docid($self, $num)); + $x = xap_terms('K', $doc); + %kw = (%kw, %$x); + if ($want_label) { # JSON/JMAP only + $x = xap_terms('L', $doc); + %L = (%L, %$x); + } + }; + warn "$$ $0 #$num (nshard=$self->{nshard}) $smsg->{blob}: $@"; } $smsg->{kw} = [ sort keys %kw ] if scalar(keys(%kw)); $smsg->{L} = [ sort keys %L ] if scalar(keys(%L)); ^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH 2/3] lei <q|up>: wait on remote mboxrd imports synchronously 2021-08-14 0:29 [PATCH 0/3] lei: hopefully kill /Document \d+ not found/ errors Eric Wong 2021-08-14 0:29 ` [PATCH 1/3] lei: diagnostics for " Eric Wong @ 2021-08-14 0:29 ` Eric Wong 2021-08-14 0:29 ` [PATCH 3/3] lei: hexdigest mocks account for unwanted headers Eric Wong 2021-08-24 20:14 ` [PATCH 0/3] lei: hopefully^W kill /Document \d+ not found/ errors Eric Wong 3 siblings, 0 replies; 7+ messages in thread From: Eric Wong @ 2021-08-14 0:29 UTC (permalink / raw) To: meta This ought to avoid /Document \d+ not found/ errors from Xapian when seeing a message for the first time by not attempting to read keywords for totally unseen messages. --- lib/PublicInbox/LeiRemote.pm | 7 ++++--- lib/PublicInbox/LeiStore.pm | 1 + lib/PublicInbox/LeiXSearch.pm | 10 +++++++--- 3 files changed, 12 insertions(+), 6 deletions(-) diff --git a/lib/PublicInbox/LeiRemote.pm b/lib/PublicInbox/LeiRemote.pm index 945d9990..e7deecb8 100644 --- a/lib/PublicInbox/LeiRemote.pm +++ b/lib/PublicInbox/LeiRemote.pm @@ -26,11 +26,12 @@ sub _each_mboxrd_eml { # callback for MboxReader->mboxrd my ($eml, $self) = @_; my $lei = $self->{lei}; my $xoids = $lei->{ale}->xoids_for($eml, 1); + my $smsg = bless {}, 'PublicInbox::Smsg'; if ($lei->{sto} && !$xoids) { # memoize locally - $lei->{sto}->ipc_do('add_eml', $eml); + my $res = $lei->{sto}->ipc_do('add_eml', $eml); + $smsg = $res if ref($res) eq ref($smsg); } - my $smsg = bless {}, 'PublicInbox::Smsg'; - $smsg->{blob} = $xoids ? (keys(%$xoids))[0] + $smsg->{blob} //= $xoids ? (keys(%$xoids))[0] : git_sha(1, $eml)->hexdigest; $smsg->populate($eml); $smsg->{mid} //= '(none)'; diff --git a/lib/PublicInbox/LeiStore.pm b/lib/PublicInbox/LeiStore.pm index e26b622d..ce66014f 100644 --- a/lib/PublicInbox/LeiStore.pm +++ b/lib/PublicInbox/LeiStore.pm @@ -329,6 +329,7 @@ sub add_eml { } \@docids; } else { # totally new message + delete $smsg->{-oidx}; # for IPC-friendliness $smsg->{num} = $oidx->adj_counter('eidx_docid', '+'); $oidx->add_overview($eml, $smsg); $oidx->add_xref3($smsg->{num}, -1, $smsg->{blob}, '.'); diff --git a/lib/PublicInbox/LeiXSearch.pm b/lib/PublicInbox/LeiXSearch.pm index 393f25bf..971f3a06 100644 --- a/lib/PublicInbox/LeiXSearch.pm +++ b/lib/PublicInbox/LeiXSearch.pm @@ -266,11 +266,15 @@ sub _smsg_fill ($$) { sub each_remote_eml { # callback for MboxReader->mboxrd my ($eml, $self, $lei, $each_smsg) = @_; my $xoids = $lei->{ale}->xoids_for($eml, 1); + my $smsg = bless {}, 'PublicInbox::Smsg'; if ($self->{import_sto} && !$xoids) { - $self->{import_sto}->ipc_do('add_eml', $eml); + my $res = $self->{import_sto}->ipc_do('add_eml', $eml); + if (ref($res) eq ref($smsg)) { # totally new message + $smsg = $res; + $smsg->{kw} = []; # short-circuit xsmsg_vmd + } } - my $smsg = bless {}, 'PublicInbox::Smsg'; - $smsg->{blob} = $xoids ? (keys(%$xoids))[0] + $smsg->{blob} //= $xoids ? (keys(%$xoids))[0] : git_sha(1, $eml)->hexdigest; _smsg_fill($smsg, $eml); wait_startq($lei); ^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH 3/3] lei: hexdigest mocks account for unwanted headers 2021-08-14 0:29 [PATCH 0/3] lei: hopefully kill /Document \d+ not found/ errors Eric Wong 2021-08-14 0:29 ` [PATCH 1/3] lei: diagnostics for " Eric Wong 2021-08-14 0:29 ` [PATCH 2/3] lei <q|up>: wait on remote mboxrd imports synchronously Eric Wong @ 2021-08-14 0:29 ` Eric Wong 2021-08-24 20:14 ` [PATCH 0/3] lei: hopefully^W kill /Document \d+ not found/ errors Eric Wong 3 siblings, 0 replies; 7+ messages in thread From: Eric Wong @ 2021-08-14 0:29 UTC (permalink / raw) To: meta PublicInbox::Import never imports @UNWANTED_HEADERS, so ensure our mock blob OIDs do the same. This ought to prevent duplicates if the PSGI mboxrd download starts setting "X-Status: F" like "lei q -tt .." --- lib/PublicInbox/FakeImport.pm | 3 +++ lib/PublicInbox/LEI.pm | 5 +++++ lib/PublicInbox/LeiQuery.pm | 2 +- lib/PublicInbox/LeiRemote.pm | 2 +- lib/PublicInbox/LeiStore.pm | 9 ++++++++- lib/PublicInbox/LeiXSearch.pm | 3 +-- 6 files changed, 19 insertions(+), 5 deletions(-) diff --git a/lib/PublicInbox/FakeImport.pm b/lib/PublicInbox/FakeImport.pm index dea25cbe..bccc3321 100644 --- a/lib/PublicInbox/FakeImport.pm +++ b/lib/PublicInbox/FakeImport.pm @@ -4,12 +4,15 @@ # pretend to do PublicInbox::Import::add for "lei index" package PublicInbox::FakeImport; use strict; +use v5.10.1; use PublicInbox::ContentHash qw(git_sha); +use PublicInbox::Import; sub new { bless { bytes_added => 0 }, __PACKAGE__ } sub add { my ($self, $eml, $check_cb, $smsg) = @_; + PublicInbox::Import::drop_unwanted_headers($eml); $smsg->populate($eml); my $raw = $eml->as_string; $smsg->{blob} = git_sha(1, \$raw)->hexdigest; diff --git a/lib/PublicInbox/LEI.pm b/lib/PublicInbox/LEI.pm index 7d0f63dc..347dd280 100644 --- a/lib/PublicInbox/LEI.pm +++ b/lib/PublicInbox/LEI.pm @@ -1420,4 +1420,9 @@ sub refresh_watches { } } +sub git_blob_id { + my ($lei, $eml) = @_; + ($lei->{sto} // _lei_store($lei, 1))->git_blob_id($eml); +} + 1; diff --git a/lib/PublicInbox/LeiQuery.pm b/lib/PublicInbox/LeiQuery.pm index 37b660f9..962ad49e 100644 --- a/lib/PublicInbox/LeiQuery.pm +++ b/lib/PublicInbox/LeiQuery.pm @@ -73,7 +73,7 @@ sub lxs_prepare { my @only = @{$opt->{only} // []}; # --local is enabled by default unless --only is used # we'll allow "--only $LOCATION --local" - my $sto = $self->_lei_store(1); # FIXME: should not create + my $sto = $self->_lei_store(1); $self->{lse} = $sto->search; if ($opt->{'local'} //= scalar(@only) ? 0 : 1) { $lxs->prepare_external($self->{lse}); diff --git a/lib/PublicInbox/LeiRemote.pm b/lib/PublicInbox/LeiRemote.pm index e7deecb8..580787c0 100644 --- a/lib/PublicInbox/LeiRemote.pm +++ b/lib/PublicInbox/LeiRemote.pm @@ -32,7 +32,7 @@ sub _each_mboxrd_eml { # callback for MboxReader->mboxrd $smsg = $res if ref($res) eq ref($smsg); } $smsg->{blob} //= $xoids ? (keys(%$xoids))[0] - : git_sha(1, $eml)->hexdigest; + : $lei->git_blob_id($eml); $smsg->populate($eml); $smsg->{mid} //= '(none)'; push @{$self->{smsg}}, $smsg; diff --git a/lib/PublicInbox/LeiStore.pm b/lib/PublicInbox/LeiStore.pm index ce66014f..3f33d114 100644 --- a/lib/PublicInbox/LeiStore.pm +++ b/lib/PublicInbox/LeiStore.pm @@ -20,7 +20,7 @@ use PublicInbox::Eml; use PublicInbox::Import; use PublicInbox::InboxWritable qw(eml_from_path); use PublicInbox::V2Writable; -use PublicInbox::ContentHash qw(content_hash); +use PublicInbox::ContentHash qw(content_hash git_sha); use PublicInbox::MID qw(mids); use PublicInbox::LeiSearch; use PublicInbox::MDA; @@ -508,4 +508,11 @@ sub write_prepare { $lei->{sto} = $self; } +# TODO: support SHA-256 +sub git_blob_id { # called via LEI->git_blob_id + my ($self, $eml) = @_; + $eml->header_set($_) for @PublicInbox::Import::UNWANTED_HEADERS; + git_sha(1, $eml)->hexdigest; +} + 1; diff --git a/lib/PublicInbox/LeiXSearch.pm b/lib/PublicInbox/LeiXSearch.pm index 971f3a06..5e34d864 100644 --- a/lib/PublicInbox/LeiXSearch.pm +++ b/lib/PublicInbox/LeiXSearch.pm @@ -274,8 +274,7 @@ sub each_remote_eml { # callback for MboxReader->mboxrd $smsg->{kw} = []; # short-circuit xsmsg_vmd } } - $smsg->{blob} //= $xoids ? (keys(%$xoids))[0] - : git_sha(1, $eml)->hexdigest; + $smsg->{blob} //= $xoids ? (keys(%$xoids))[0] : $lei->git_blob_id($eml); _smsg_fill($smsg, $eml); wait_startq($lei); if ($lei->{-progress}) { ^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH 0/3] lei: hopefully^W kill /Document \d+ not found/ errors 2021-08-14 0:29 [PATCH 0/3] lei: hopefully kill /Document \d+ not found/ errors Eric Wong ` (2 preceding siblings ...) 2021-08-14 0:29 ` [PATCH 3/3] lei: hexdigest mocks account for unwanted headers Eric Wong @ 2021-08-24 20:14 ` Eric Wong 2021-10-14 5:31 ` Eric Wong 3 siblings, 1 reply; 7+ messages in thread From: Eric Wong @ 2021-08-24 20:14 UTC (permalink / raw) To: meta Eric Wong <e@80x24.org> wrote: > 2/3 is probably a fix for a long-standing problem, 3/3 was > noticed while working on it. If 2/3 doesn't fix it, then maybe > 1/3 will help us narrow it down. s/is probably/was almost certainly/ \o/ ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 0/3] lei: hopefully^W kill /Document \d+ not found/ errors 2021-08-24 20:14 ` [PATCH 0/3] lei: hopefully^W kill /Document \d+ not found/ errors Eric Wong @ 2021-10-14 5:31 ` Eric Wong 2021-10-15 9:52 ` [PATCH] lei q: avoid kw lookup failure on remote mboxrd Eric Wong 0 siblings, 1 reply; 7+ messages in thread From: Eric Wong @ 2021-10-14 5:31 UTC (permalink / raw) To: meta Eric Wong <e@80x24.org> wrote: > Eric Wong <e@80x24.org> wrote: > > 2/3 is probably a fix for a long-standing problem, 3/3 was > > noticed while working on it. If 2/3 doesn't fix it, then maybe > > 1/3 will help us narrow it down. > > s/is probably/was almost certainly/ \o/ No. :< It still happens when there's overlap between various search sources, but I think I know how to fix it... ^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH] lei q: avoid kw lookup failure on remote mboxrd 2021-10-14 5:31 ` Eric Wong @ 2021-10-15 9:52 ` Eric Wong 0 siblings, 0 replies; 7+ messages in thread From: Eric Wong @ 2021-10-15 9:52 UTC (permalink / raw) To: meta > > > 2/3 is probably a fix for a long-standing problem, 3/3 was > > > noticed while working on it. If 2/3 doesn't fix it, then maybe > > > 1/3 will help us narrow it down. > > > > s/is probably/was almost certainly/ \o/ > > No. :< > > It still happens when there's overlap between various search > sources, but I think I know how to fix it... Ok, higher certainty for now :P -----8<----- Subject: [PATCH] lei q: avoid kw lookup failure on remote mboxrd When importing several sources in parallel via http(s) mboxrd, we need to be able to get keywords of uncommitted documents directly from shard workers. Otherwise, Xapian DocNotFound errors happen because the read-only LeiSearch won't see documents from uncomitted transactions. Keep in mind that it's possible the keywords can be changed on-the-fly even for uncommitted documents because of inotify watches from LeiNoteEvent. --- lib/PublicInbox/LeiStore.pm | 28 +++++++++++++++++++++++----- lib/PublicInbox/LeiXSearch.pm | 8 +++----- lib/PublicInbox/SearchIdx.pm | 6 ++++++ t/lei_store.t | 3 ++- 4 files changed, 34 insertions(+), 11 deletions(-) diff --git a/lib/PublicInbox/LeiStore.pm b/lib/PublicInbox/LeiStore.pm index bf41dcf5..c45380d1 100644 --- a/lib/PublicInbox/LeiStore.pm +++ b/lib/PublicInbox/LeiStore.pm @@ -328,6 +328,20 @@ sub _add_vmd ($$$$) { sto_export_kw($self, $docid, $vmd); } +sub _docids_and_maybe_kw ($$) { + my ($self, $docids) = @_; + return $docids unless wantarray; + my $kw = {}; + for my $num (@$docids) { # likely only 1, unless ContentHash changes + # can't use ->search->msg_keywords on uncommitted docs + my $idx = $self->{priv_eidx}->idx_shard($num); + my $tmp = eval { $idx->ipc_do('get_terms', 'K', $num) }; + if ($@) { warn "#$num get_terms: $@" } + else { @$kw{keys %$tmp} = values(%$tmp) }; + } + ($docids, [ sort keys %$kw ]); +} + sub add_eml { my ($self, $eml, $vmd, $xoids) = @_; my $im = $self->{-fake_im} // $self->importer; # may create new epoch @@ -339,7 +353,11 @@ sub add_eml { if ($vmd && $vmd->{sync_info}) { set_sync_info($self, $smsg->{blob}, @{$vmd->{sync_info}}); } - $im_mark or return; # duplicate blob returns undef + unless ($im_mark) { # duplicate blob returns undef + return unless wantarray; + my @docids = $oidx->blob_exists($smsg->{blob}); + return _docids_and_maybe_kw $self, \@docids; + } local $self->{current_info} = $smsg->{blob}; my $vivify_xvmd = delete($smsg->{-vivify_xvmd}) // []; # exact matches @@ -373,7 +391,7 @@ sub add_eml { } _add_vmd($self, $idx, $docid, $vmd) if $vmd; } - $vivify_xvmd; + _docids_and_maybe_kw $self, $vivify_xvmd; } elsif (my @docids = _docids_for($self, $eml)) { # fuzzy match from within lei/store for my $docid (@docids) { @@ -383,8 +401,8 @@ sub add_eml { $idx->ipc_do('add_eidx_info', $docid, '.', $eml); _add_vmd($self, $idx, $docid, $vmd) if $vmd; } - \@docids; - } else { # totally new message + _docids_and_maybe_kw $self, \@docids; + } else { # totally new message, no keywords delete $smsg->{-oidx}; # for IPC-friendliness $smsg->{num} = $oidx->adj_counter('eidx_docid', '+'); $oidx->add_overview($eml, $smsg); @@ -392,7 +410,7 @@ sub add_eml { my $idx = $eidx->idx_shard($smsg->{num}); $idx->index_eml($eml, $smsg); _add_vmd($self, $idx, $smsg->{num}, $vmd) if $vmd; - $smsg; + wantarray ? ($smsg, []) : $smsg; } } diff --git a/lib/PublicInbox/LeiXSearch.pm b/lib/PublicInbox/LeiXSearch.pm index fba16861..3ec75528 100644 --- a/lib/PublicInbox/LeiXSearch.pm +++ b/lib/PublicInbox/LeiXSearch.pm @@ -282,11 +282,9 @@ sub each_remote_eml { # callback for MboxReader->mboxrd my $xoids = $lei->{ale}->xoids_for($eml, 1); my $smsg = bless {}, 'PublicInbox::Smsg'; if ($self->{import_sto} && !$xoids) { - my $res = $self->{import_sto}->wq_do('add_eml', $eml); - if (ref($res) eq ref($smsg)) { # totally new message - $smsg = $res; - $smsg->{kw} = []; # short-circuit xsmsg_vmd - } + my ($res, $kw) = $self->{import_sto}->wq_do('add_eml', $eml); + $smsg = $res if ref($res) eq ref($smsg); # totally new message + $smsg->{kw} = $kw; # short-circuit xsmsg_vmd } $smsg->{blob} //= $xoids ? (keys(%$xoids))[0] : $lei->git_oid($eml)->hexdigest; diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm index 928152ec..585f28f5 100644 --- a/lib/PublicInbox/SearchIdx.pm +++ b/lib/PublicInbox/SearchIdx.pm @@ -517,6 +517,12 @@ sub add_eidx_info { $self->{xdb}->replace_document($docid, $doc); } +sub get_terms { + my ($self, $pfx, $docid) = @_; + begin_txn_lazy($self); + xap_terms($pfx, $self->{xdb}, $docid); +} + sub remove_eidx_info { my ($self, $docid, $eidx_key, $eml) = @_; begin_txn_lazy($self); diff --git a/t/lei_store.t b/t/lei_store.t index c31e27a2..40ad7800 100644 --- a/t/lei_store.t +++ b/t/lei_store.t @@ -138,7 +138,8 @@ Subject: timezone-dependent test WHAT IS TIME ANYMORE? EOM - ok($sto->add_eml($eml), 'recently received message'); + my $smsg = $sto->add_eml($eml); + ok($smsg && $smsg->{blob}, 'recently received message'); $sto->done; local $ENV{TZ} = 'GMT+5'; my $lse = $sto->search; ^ permalink raw reply related [flat|nested] 7+ messages in thread
end of thread, other threads:[~2021-10-15 9:52 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2021-08-14 0:29 [PATCH 0/3] lei: hopefully kill /Document \d+ not found/ errors Eric Wong 2021-08-14 0:29 ` [PATCH 1/3] lei: diagnostics for " Eric Wong 2021-08-14 0:29 ` [PATCH 2/3] lei <q|up>: wait on remote mboxrd imports synchronously Eric Wong 2021-08-14 0:29 ` [PATCH 3/3] lei: hexdigest mocks account for unwanted headers Eric Wong 2021-08-24 20:14 ` [PATCH 0/3] lei: hopefully^W kill /Document \d+ not found/ errors Eric Wong 2021-10-14 5:31 ` Eric Wong 2021-10-15 9:52 ` [PATCH] lei q: avoid kw lookup failure on remote mboxrd Eric Wong
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).