* Threading/searching problem @ 2018-08-03 18:26 Konstantin Ryabitsev 2018-08-03 19:20 ` Eric Wong 0 siblings, 1 reply; 8+ messages in thread From: Konstantin Ryabitsev @ 2018-08-03 18:26 UTC (permalink / raw) To: meta [-- Attachment #1: Type: text/plain, Size: 1274 bytes --] Hi, all: Something I came over today that seems to be wonky. I was trying to find this message: https://lore.kernel.org/lkml/CA+55aFz5EWE9OTbzDoMfsY2ez04Qv9eg0KQhwKfyJY0vFvoD3g@mail.gmail.com/ It's from Linus about WireGuard, so I searched for it: https://lore.kernel.org/lkml/?q=torvalds+wireguard And it doesn't show up, which is really odd. I see some of the other search strings also not show up: https://lore.kernel.org/lkml/?q=DL_FLAG_AUTOREMOVE_SUPPLIER it's in today's message: https://lore.kernel.org/lkml/53f9f939dbce11e1c96986ff41f29dd6e41b9220.camel@nxp.com/ I'm wondering if it's related to incremental index fixes, but the cover letter mentions deleted messages, which we haven't been doing. There's also an issue of threading which has been irking some people on LKML. If you look at this message again: https://lore.kernel.org/lkml/CA+55aFz5EWE9OTbzDoMfsY2ez04Qv9eg0KQhwKfyJY0vFvoD3g@mail.gmail.com/ You will notice that PI tries to build a subject-based thread all the way back to 2008 and gives up somewhere in 2015. This makes it impossible to navigate that message thread. Any help would be greatly appreciated. If anyone wants a copy of Xapian/git I can provide it on request (they are large!). -K [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 228 bytes --] ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Threading/searching problem 2018-08-03 18:26 Threading/searching problem Konstantin Ryabitsev @ 2018-08-03 19:20 ` Eric Wong 2018-08-03 19:38 ` Konstantin Ryabitsev 0 siblings, 1 reply; 8+ messages in thread From: Eric Wong @ 2018-08-03 19:20 UTC (permalink / raw) To: Konstantin Ryabitsev; +Cc: meta Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote: > Hi, all: > > Something I came over today that seems to be wonky. I was trying to find > this message: > > https://lore.kernel.org/lkml/CA+55aFz5EWE9OTbzDoMfsY2ez04Qv9eg0KQhwKfyJY0vFvoD3g@mail.gmail.com/ > > It's from Linus about WireGuard, so I searched for it: > > https://lore.kernel.org/lkml/?q=torvalds+wireguard OK, WireGuard is in the body, there... > And it doesn't show up, which is really odd. I see some of the other search > strings also not show up: > > https://lore.kernel.org/lkml/?q=DL_FLAG_AUTOREMOVE_SUPPLIER > > it's in today's message: > https://lore.kernel.org/lkml/53f9f939dbce11e1c96986ff41f29dd6e41b9220.camel@nxp.com/ Likewise, DL_FLAG_AUTOREMOVE_SUPPLIER only appears in the body. > I'm wondering if it's related to incremental index fixes, but the cover > letter mentions deleted messages, which we haven't been doing. Since you're on mda, I think it's related to the mda + message body indexing fixes I posted: https://public-inbox.org/meta/20180729093441.5250-1-e@80x24.org/ I think it takes about 3-4 hours to reindex my local LKML archives on my fastest SSD (way longer on a slower SSD), so it might only take around 2 hours onthe machine used for lore. > There's also an issue of threading which has been irking some people on > LKML. If you look at this message again: > > https://lore.kernel.org/lkml/CA+55aFz5EWE9OTbzDoMfsY2ez04Qv9eg0KQhwKfyJY0vFvoD3g@mail.gmail.com/ > > You will notice that PI tries to build a subject-based thread all the way > back to 2008 and gives up somewhere in 2015. This makes it impossible to > navigate that message thread. Oops. We'll have to tweak the ordering for giant threads to favor newer messages. Gotta run for a bit but should be done today. > Any help would be greatly appreciated. If anyone wants a copy of Xapian/git > I can provide it on request (they are large!). I've got my own; but it might be worth looking into supporting remote indices a feature of Xapian. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Threading/searching problem 2018-08-03 19:20 ` Eric Wong @ 2018-08-03 19:38 ` Konstantin Ryabitsev 2018-08-05 6:04 ` [PATCH] view: distinguish strict and loose thread matches Eric Wong 0 siblings, 1 reply; 8+ messages in thread From: Konstantin Ryabitsev @ 2018-08-03 19:38 UTC (permalink / raw) To: Eric Wong; +Cc: meta [-- Attachment #1.1: Type: text/plain, Size: 1172 bytes --] On 08/03/18 15:20, Eric Wong wrote: > Since you're on mda, I think it's related to the mda + message body > indexing fixes I posted: > > https://public-inbox.org/meta/20180729093441.5250-1-e@80x24.org/ > > I think it takes about 3-4 hours to reindex my local LKML > archives on my fastest SSD (way longer on a slower SSD), > so it might only take around 2 hours onthe machine used for lore. Thanks, Eric. I've updated to 9015a8af2909 and a reindex is now running. I'll let you know if this doesn't solve the problem. >> https://lore.kernel.org/lkml/CA+55aFz5EWE9OTbzDoMfsY2ez04Qv9eg0KQhwKfyJY0vFvoD3g@mail.gmail.com/ >> >> You will notice that PI tries to build a subject-based thread all the way >> back to 2008 and gives up somewhere in 2015. This makes it impossible to >> navigate that message thread. > > Oops. We'll have to tweak the ordering for giant threads to > favor newer messages. Gotta run for a bit but should be done > today. No rush, I probably won't get around to it until Monday at the earliest. Thanks for your help! Best, -- Konstantin Ryabitsev Director, IT Infrastructure Security The Linux Foundation [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 228 bytes --] ^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH] view: distinguish strict and loose thread matches 2018-08-03 19:38 ` Konstantin Ryabitsev @ 2018-08-05 6:04 ` Eric Wong 2018-08-05 8:19 ` [RFC] overidx: preserve `tid' column on re-indexing Eric Wong 2018-08-06 20:05 ` [PATCH] view: distinguish strict and loose thread matches Konstantin Ryabitsev 0 siblings, 2 replies; 8+ messages in thread From: Eric Wong @ 2018-08-05 6:04 UTC (permalink / raw) To: Konstantin Ryabitsev; +Cc: meta The "loose" (Subject:-based) thread matching yields too many hits for some common subjects (e.g. "[GIT] Networking" on LKML) and causes thread skeletons to not show the current messages. Favor strict matches in the query and only add loose matches if there's space. While working on this, I noticed the backwards --reindex walk breaks `tid' on v1 repositories, at least. That bug was hidden by the Subject: match logic and not discovered until now. It will be fixed separately. Reported-by: Konstantin Ryabitsev <konstantin@linuxfoundation.org> --- lib/PublicInbox/Over.pm | 43 ++++++++++++++++++++++------ lib/PublicInbox/View.pm | 63 +++++++++++++++++++++++++++++++++++++---- 2 files changed, 93 insertions(+), 13 deletions(-) diff --git a/lib/PublicInbox/Over.pm b/lib/PublicInbox/Over.pm index b2f6883..dda1e6d 100644 --- a/lib/PublicInbox/Over.pm +++ b/lib/PublicInbox/Over.pm @@ -11,6 +11,7 @@ package PublicInbox::Over; use DBD::SQLite; use PublicInbox::SearchMsg; use Compress::Zlib qw(uncompress); +use constant DEFAULT_LIMIT => 1000; sub dbh_new { my ($self) = @_; @@ -53,7 +54,7 @@ sub load_from_row { sub do_get { my ($self, $sql, $opts, @args) = @_; my $dbh = $self->connect; - my $lim = (($opts->{limit} || 0) + 0) || 1000; + my $lim = (($opts->{limit} || 0) + 0) || DEFAULT_LIMIT; $sql .= "LIMIT $lim"; my $msgs = $dbh->selectall_arrayref($sql, { Slice => {} }, @args); load_from_row($_) for @$msgs; @@ -97,21 +98,47 @@ sub get_thread { SELECT tid,sid FROM over WHERE num = ? LIMIT 1 defined $tid or return nothing; # $sid may be undef + + my $cond_all = '(tid = ? OR sid = ?) AND num > ?'; my $sort_col = 'ds'; $num = 0; - if ($prev) { + if ($prev) { # mboxrd stream, only $num = $prev->{num} || 0; $sort_col = 'num'; } - my $cond = '(tid = ? OR sid = ?) AND num > ?'; - my $msgs = do_get($self, <<"", {}, $tid, $sid, $num); -SELECT num,ts,ds,ddd FROM over WHERE $cond ORDER BY $sort_col ASC - return $msgs unless wantarray; + my $cols = 'num,ts,ds,ddd'; + unless (wantarray) { + return do_get($self, <<"", {}, $tid, $sid, $num); +SELECT $cols FROM over WHERE $cond_all +ORDER BY $sort_col ASC - my $nr = $dbh->selectrow_array(<<"", undef, $tid, $sid, $num); -SELECT COUNT(num) FROM over WHERE $cond + } + # HTML view always wants an array and never uses $prev, + # but the mbox stream never wants an array and always has $prev + die '$prev not supported with wantarray' if $prev; + my $nr = $dbh->selectrow_array(<<"", undef, $tid, $sid, $num); +SELECT COUNT(num) FROM over WHERE $cond_all + + # giant thread, prioritize strict (tid) matches and throw + # in the loose (sid) matches at the end + my $msgs = do_get($self, <<"", {}, $tid, $num); +SELECT $cols FROM over WHERE tid = ? AND num > ? +ORDER BY $sort_col ASC + + # do we have room for loose matches? get the most recent ones, first: + my $lim = DEFAULT_LIMIT - scalar(@$msgs); + if ($lim > 0) { + my $opts = { limit => $lim }; + my $loose = do_get($self, <<"", $opts, $tid, $sid, $num); +SELECT $cols FROM over WHERE tid != ? AND sid = ? AND num > ? +ORDER BY $sort_col DESC + + # TODO separate strict and loose matches here once --reindex + # is fixed to preserve `tid' properly + push @$msgs, @$loose; + } ($nr, $msgs); } diff --git a/lib/PublicInbox/View.pm b/lib/PublicInbox/View.pm index 58851ed..eb002ae 100644 --- a/lib/PublicInbox/View.pm +++ b/lib/PublicInbox/View.pm @@ -365,7 +365,7 @@ sub walk_thread { while (@q) { my ($level, $node, $i) = splice(@q, 0, 3); defined $node or next; - $cb->($ctx, $level, $node, $i); + $cb->($ctx, $level, $node, $i) or return; ++$level; $i = 0; unshift @q, map { ($level, $_, $i++) } @{$node->{children}}; @@ -818,10 +818,56 @@ sub indent_for { $level ? INDENT x ($level - 1) : ''; } +sub find_mid_root { + my ($ctx, $level, $node, $idx) = @_; + ++$ctx->{root_idx} if $level == 0; + if ($node->{id} eq $ctx->{mid}) { + $ctx->{found_mid_at} = $ctx->{root_idx}; + return 0; + } + 1; +} + +sub strict_loose_note ($) { + my ($nr) = @_; + my $msg = +" -- strict thread matches above, loose matches on Subject: below --\n"; + + if ($nr > PublicInbox::Over::DEFAULT_LIMIT()) { + $msg .= +" -- use mbox.gz link to download all $nr messages --\n"; + } + $msg; +} + sub thread_results { my ($ctx, $msgs) = @_; require PublicInbox::SearchThread; - PublicInbox::SearchThread::thread($msgs, *sort_ds, $ctx->{-inbox}); + my $ibx = $ctx->{-inbox}; + my $rootset = PublicInbox::SearchThread::thread($msgs, *sort_ds, $ibx); + + # FIXME: `tid' is broken on --reindex, so that needs to be fixed + # and preserved in the future. This bug is hidden by `sid' matches + # in get_thread, so we never noticed it until now. And even when + # reindexing is fixed, we'll keep this code until a SCHEMA_VERSION + # bump since reindexing is expensive and users may not do it + + # loose threading could've returned too many results, + # put the root the message we care about at the top: + my $mid = $ctx->{mid}; + if (defined($mid) && scalar(@$rootset) > 1) { + $ctx->{root_idx} = -1; + my $nr = scalar @$msgs; + walk_thread($rootset, $ctx, *find_mid_root); + my $idx = $ctx->{found_mid_at}; + if (defined($idx) && $idx != 0) { + my $tip = splice(@$rootset, $idx, 1); + @$rootset = reverse @$rootset; + unshift @$rootset, $tip; + $ctx->{sl_note} = strict_loose_note($nr); + } + } + $rootset } sub missing_thread { @@ -864,6 +910,10 @@ sub skel_dump { my $cur = $ctx->{cur}; my $mid = $smsg->{mid}; + if ($level == 0 && $ctx->{skel_dump_roots}++) { + $$dst .= delete $ctx->{sl_note} || ''; + } + my $f = ascii_html($smsg->from_name); my $obfs_ibx = $ctx->{-obfs_ibx}; obfuscate_addrs($obfs_ibx, $f) if $obfs_ibx; @@ -882,7 +932,7 @@ sub skel_dump { delete $ctx->{cur}; $$dst .= "<b>$d<a\nid=r\nhref=\"#t\">". "$attr [this message]</a></b>\n"; - return; + return 1; } else { $ctx->{prev_msg} = $mid; } @@ -922,6 +972,7 @@ sub skel_dump { $m = $ctx->{-upfx}.mid_escape($mid).'/'; } $$dst .= $d . "<a\nhref=\"$m\"$id>" . $end; + 1; } sub _skel_ghost { @@ -947,6 +998,7 @@ sub _skel_ghost { } my $dst = $ctx->{dst}; $$dst .= $d; + 1; } sub sort_ds { @@ -973,7 +1025,7 @@ sub acc_topic { $topic = [ $ds, 1, { $subj => $mid }, $subj ]; $ctx->{-cur_topic} = $topic; push @{$ctx->{order}}, $topic; - return; + return 1; } $topic = $ctx->{-cur_topic}; # should never be undef @@ -987,11 +1039,12 @@ sub acc_topic { } $seen->{$subj} = $mid; # latest for subject } else { # ghost message - return if $level != 0; # ignore child ghosts + return 1 if $level != 0; # ignore child ghosts $topic = [ -666, 0, {} ]; $ctx->{-cur_topic} = $topic; push @{$ctx->{order}}, $topic; } + 1; } sub dump_topics { -- EW ^ permalink raw reply related [flat|nested] 8+ messages in thread
* [RFC] overidx: preserve `tid' column on re-indexing 2018-08-05 6:04 ` [PATCH] view: distinguish strict and loose thread matches Eric Wong @ 2018-08-05 8:19 ` Eric Wong 2018-08-05 21:41 ` Eric Wong 2018-08-06 20:05 ` [PATCH] view: distinguish strict and loose thread matches Konstantin Ryabitsev 1 sibling, 1 reply; 8+ messages in thread From: Eric Wong @ 2018-08-05 8:19 UTC (permalink / raw) To: meta; +Cc: Konstantin Ryabitsev Eric Wong <e@80x24.org> wrote: > While working on this, I noticed the backwards --reindex walk > breaks `tid' on v1 repositories, at least. That bug was hidden > by the Subject: match logic and not discovered until now. It > will be fixed separately. Lightly tested, but seems to make sense... Reindexing http://czquwvybam4bgbro.onion/git/ now... -------8<------- Subject: [RFC] overidx: preserve `tid' column on re-indexing Otherwise, walking backwards through history could mean the root message in a thread forgets its `tid' and it prevents messages from being looked up by it. This bug was hidden by the fact that `sid' matches were often good enough to link threads together. --- lib/PublicInbox/OverIdx.pm | 11 +++++++++-- t/search-thr-index.t | 40 ++++++++++++++++++++++++++++++++++++++ 2 files changed, 49 insertions(+), 2 deletions(-) diff --git a/lib/PublicInbox/OverIdx.pm b/lib/PublicInbox/OverIdx.pm index 62fec0d..cc9bd7d 100644 --- a/lib/PublicInbox/OverIdx.pm +++ b/lib/PublicInbox/OverIdx.pm @@ -79,8 +79,15 @@ sub mid2id { } sub delete_by_num { - my ($self, $num) = @_; + my ($self, $num, $tid_ref) = @_; my $dbh = $self->{dbh}; + if ($tid_ref) { + my $sth = $dbh->prepare_cached(<<'', undef, 1); +SELECT tid FROM over WHERE num = ? LIMIT 1 + + $sth->execute($num); + $$tid_ref = $sth->fetchrow_array; # may be undef + } foreach (qw(over id2num)) { $dbh->prepare_cached(<<"")->execute($num); DELETE FROM $_ WHERE num = ? @@ -262,7 +269,7 @@ sub add_over { my $vivified = 0; $self->begin_lazy; - $self->delete_by_num($num); + $self->delete_by_num($num, \$old_tid); foreach my $mid (@$mids) { my $v = 0; each_by_mid($self, $mid, ['tid'], sub { diff --git a/t/search-thr-index.t b/t/search-thr-index.t index 2aa97bf..ab6d1b0 100644 --- a/t/search-thr-index.t +++ b/t/search-thr-index.t @@ -48,9 +48,49 @@ foreach (reverse split(/\n\n/, $data)) { } my $prev; +my %tids; +my $dbh = $rw->{over}->connect; foreach my $mid (@mids) { my $msgs = $rw->{over}->get_thread($mid); is(3, scalar(@$msgs), "got all messages from $mid"); + foreach my $m (@$msgs) { + my $tid = $dbh->selectrow_array(<<'', undef, $m->{num}); +SELECT tid FROM over WHERE num = ? LIMIT 1 + + $tids{$tid}++; + } +} + +is(scalar keys %tids, 1, 'all messages have the same tid'); + +$rw->commit_txn_lazy; + +$xdb = $rw->begin_txn_lazy; +{ + my $mime = Email::MIME->new(<<''); +Subject: [RFC 00/14] +Message-Id: <1-bw@g> +From: bw@g +To: git@vger.kernel.org + + my $dbh = $rw->{over}->connect; + my ($id, $prev); + my $reidx = $rw->{over}->next_by_mid('1-bw@g', \$id, \$prev); + ok(defined $reidx); + my $num = $reidx->{num}; + my $tid0 = $dbh->selectrow_array(<<'', undef, $num); +SELECT tid FROM over WHERE num = ? LIMIT 1 + + my $bytes = bytes::length($mime->as_string); + my $mid = mids($mime->header_obj)->[0]; + my $doc_id = $rw->add_message($mime, $bytes, $num, 'ignored', $mid); + ok($doc_id, 'message reindexed'. $mid); + is($doc_id, $num, "article number unchanged: $num"); + + my $tid1 = $dbh->selectrow_array(<<'', undef, $num); +SELECT tid FROM over WHERE num = ? LIMIT 1 + + is($tid1, $tid0, 'tid unchanged on reindex'); } $rw->commit_txn_lazy; -- EW ^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [RFC] overidx: preserve `tid' column on re-indexing 2018-08-05 8:19 ` [RFC] overidx: preserve `tid' column on re-indexing Eric Wong @ 2018-08-05 21:41 ` Eric Wong 0 siblings, 0 replies; 8+ messages in thread From: Eric Wong @ 2018-08-05 21:41 UTC (permalink / raw) To: meta; +Cc: Konstantin Ryabitsev Eric Wong <e@80x24.org> wrote: > Lightly tested, but seems to make sense... > Reindexing http://czquwvybam4bgbro.onion/git/ now... Seems fine, updating the non-.onion sites http://hjrcffqmbrq6wope.onion/git will be last (And czquwvybam4bgbro.onion going down for unrelated maintenance) ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] view: distinguish strict and loose thread matches 2018-08-05 6:04 ` [PATCH] view: distinguish strict and loose thread matches Eric Wong 2018-08-05 8:19 ` [RFC] overidx: preserve `tid' column on re-indexing Eric Wong @ 2018-08-06 20:05 ` Konstantin Ryabitsev 2018-08-06 20:10 ` Konstantin Ryabitsev 1 sibling, 1 reply; 8+ messages in thread From: Konstantin Ryabitsev @ 2018-08-06 20:05 UTC (permalink / raw) To: Eric Wong; +Cc: meta [-- Attachment #1.1: Type: text/plain, Size: 700 bytes --] On 08/05/18 02:04, Eric Wong wrote: > The "loose" (Subject:-based) thread matching yields too many > hits for some common subjects (e.g. "[GIT] Networking" on LKML) > and causes thread skeletons to not show the current messages. > Favor strict matches in the query and only add loose matches > if there's space. Eric: I've updated to ce5494b5b83 and reindexed, but it appears that I still have the same behaviour as previously: https://lore.kernel.org/lkml/20180805.004744.1043412275329854518.davem@davemloft.net/ Are there any configuration file changes that are required for this? Best, -- Konstantin Ryabitsev Director, IT Infrastructure Security The Linux Foundation [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 228 bytes --] ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] view: distinguish strict and loose thread matches 2018-08-06 20:05 ` [PATCH] view: distinguish strict and loose thread matches Konstantin Ryabitsev @ 2018-08-06 20:10 ` Konstantin Ryabitsev 0 siblings, 0 replies; 8+ messages in thread From: Konstantin Ryabitsev @ 2018-08-06 20:10 UTC (permalink / raw) To: Eric Wong; +Cc: meta [-- Attachment #1.1: Type: text/plain, Size: 522 bytes --] On 08/06/18 16:05, Konstantin Ryabitsev wrote: > I've updated to ce5494b5b83 and reindexed, but it appears that I still > have the same behaviour as previously: > > https://lore.kernel.org/lkml/20180805.004744.1043412275329854518.davem@davemloft.net/ > > Are there any configuration file changes that are required for this? No, but a service restart is required. :) Sorry about that -- I see the changes now. Best, -- Konstantin Ryabitsev Director, IT Infrastructure Security The Linux Foundation [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 228 bytes --] ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2018-08-06 20:10 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2018-08-03 18:26 Threading/searching problem Konstantin Ryabitsev 2018-08-03 19:20 ` Eric Wong 2018-08-03 19:38 ` Konstantin Ryabitsev 2018-08-05 6:04 ` [PATCH] view: distinguish strict and loose thread matches Eric Wong 2018-08-05 8:19 ` [RFC] overidx: preserve `tid' column on re-indexing Eric Wong 2018-08-05 21:41 ` Eric Wong 2018-08-06 20:05 ` [PATCH] view: distinguish strict and loose thread matches Konstantin Ryabitsev 2018-08-06 20:10 ` Konstantin Ryabitsev
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).