* Cheap way to check for new messages in a thread @ 2023-03-27 15:08 Konstantin Ryabitsev 2023-03-27 19:10 ` Eric Wong 0 siblings, 1 reply; 17+ messages in thread From: Konstantin Ryabitsev @ 2023-03-27 15:08 UTC (permalink / raw) To: meta Hello: For the bugzilla integration work I'm doing, I need a way to check if there were any updates to a thread since the last check. Right now, I'm just grabbing the full thread, parsing it and seeing if there are any new message-IDs that we don't know about, but it's very wasteful. Any way to just issue something like "how many messages are in a thread with this message-id" or "are there any updates to a thread with this message-id since YYYYMMDDHHMMSS? -K ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Cheap way to check for new messages in a thread 2023-03-27 15:08 Cheap way to check for new messages in a thread Konstantin Ryabitsev @ 2023-03-27 19:10 ` Eric Wong 2023-03-27 20:47 ` Konstantin Ryabitsev 0 siblings, 1 reply; 17+ messages in thread From: Eric Wong @ 2023-03-27 19:10 UTC (permalink / raw) To: Konstantin Ryabitsev; +Cc: meta Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote: > Hello: > > For the bugzilla integration work I'm doing, I need a way to check if there > were any updates to a thread since the last check. Right now, I'm just > grabbing the full thread, parsing it and seeing if there are any new > message-IDs that we don't know about, but it's very wasteful. Any way to just > issue something like "how many messages are in a thread with this message-id" > or "are there any updates to a thread with this message-id since > YYYYMMDDHHMMSS? lei q -t --only /path/to/(inbox|extindex) mid:$MSGID rt:APPROXIDATE.. Returns JSON and won't retrieve message bodies from git. I wouldn't query down to the second due to propagation delays, clock skew, etc, though. There might be a JMAP endpoint I can implement for WWW which only retrieves that info, but getting backreferences (required by the JMAP spec) to work properly seemed painful. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Cheap way to check for new messages in a thread 2023-03-27 19:10 ` Eric Wong @ 2023-03-27 20:47 ` Konstantin Ryabitsev 2023-03-27 21:38 ` Eric Wong 0 siblings, 1 reply; 17+ messages in thread From: Konstantin Ryabitsev @ 2023-03-27 20:47 UTC (permalink / raw) To: Eric Wong; +Cc: meta On Mon, Mar 27, 2023 at 07:10:49PM +0000, Eric Wong wrote: > > For the bugzilla integration work I'm doing, I need a way to check if there > > were any updates to a thread since the last check. Right now, I'm just > > grabbing the full thread, parsing it and seeing if there are any new > > message-IDs that we don't know about, but it's very wasteful. Any way to just > > issue something like "how many messages are in a thread with this message-id" > > or "are there any updates to a thread with this message-id since > > YYYYMMDDHHMMSS? > > lei q -t --only /path/to/(inbox|extindex) mid:$MSGID rt:APPROXIDATE.. > > Returns JSON and won't retrieve message bodies from git. Ah, I was hoping to have a fully remote way of doing this. > I wouldn't query down to the second due to propagation delays, > clock skew, etc, though. > > There might be a JMAP endpoint I can implement for WWW which > only retrieves that info, but getting backreferences (required > by the JMAP spec) to work properly seemed painful. What about a "bodiless" atom feed? It's already available per thread, so perhaps there could be a mode that skips the bodies or trims them after the first paragraph? -K ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Cheap way to check for new messages in a thread 2023-03-27 20:47 ` Konstantin Ryabitsev @ 2023-03-27 21:38 ` Eric Wong 2023-03-28 14:04 ` Konstantin Ryabitsev 0 siblings, 1 reply; 17+ messages in thread From: Eric Wong @ 2023-03-27 21:38 UTC (permalink / raw) To: Konstantin Ryabitsev; +Cc: meta Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote: > On Mon, Mar 27, 2023 at 07:10:49PM +0000, Eric Wong wrote: > > > For the bugzilla integration work I'm doing, I need a way to check if there > > > were any updates to a thread since the last check. Right now, I'm just > > > grabbing the full thread, parsing it and seeing if there are any new > > > message-IDs that we don't know about, but it's very wasteful. Any way to just > > > issue something like "how many messages are in a thread with this message-id" > > > or "are there any updates to a thread with this message-id since > > > YYYYMMDDHHMMSS? > > > > lei q -t --only /path/to/(inbox|extindex) mid:$MSGID rt:APPROXIDATE.. > > > > Returns JSON and won't retrieve message bodies from git. > > Ah, I was hoping to have a fully remote way of doing this. > > > I wouldn't query down to the second due to propagation delays, > > clock skew, etc, though. > > > > There might be a JMAP endpoint I can implement for WWW which > > only retrieves that info, but getting backreferences (required > > by the JMAP spec) to work properly seemed painful. > > What about a "bodiless" atom feed? It's already available per thread, so > perhaps there could be a mode that skips the bodies or trims them after the > first paragraph? I thought about that, too; but I'm worried about having one-off stuff that ends up needing to be supported indefinitely. JMAP for this would take more time, but I'd be more comfortable carrying it long-term. I don't expect trimming after the first paragraph to be a huge improvement. Retrieving any part of the message from git and dealing with MIME is expensive, anyways. I wouldn't expect it to be a big (if any) improvement compared to POST-ing for the mbox.gz (&x=m&t=1) endpoint with rt:$SINCE.. The mbox.gz endpoints should be a bit more efficient for the server than Atom feeds; decoding MIME and HTML escaping takes up considerable CPU time. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Cheap way to check for new messages in a thread 2023-03-27 21:38 ` Eric Wong @ 2023-03-28 14:04 ` Konstantin Ryabitsev 2023-03-28 19:45 ` Eric Wong 0 siblings, 1 reply; 17+ messages in thread From: Konstantin Ryabitsev @ 2023-03-28 14:04 UTC (permalink / raw) To: Eric Wong; +Cc: meta On Mon, Mar 27, 2023 at 09:38:49PM +0000, Eric Wong wrote: > I thought about that, too; but I'm worried about having one-off > stuff that ends up needing to be supported indefinitely. > > JMAP for this would take more time, but I'd be more comfortable > carrying it long-term. > > I don't expect trimming after the first paragraph to be a huge > improvement. Retrieving any part of the message from git and > dealing with MIME is expensive, anyways. I wouldn't expect it > to be a big (if any) improvement compared to POST-ing for the > mbox.gz (&x=m&t=1) endpoint with rt:$SINCE.. Hmm... This didn't seem to do the right thing for me. For example, this thread: https://lore.kernel.org/lkml/20230327080502.GA570847@ziqianlu-desk2 If I ask for any new messages in that thread since 20230327120000, I get nothing: curl -Sf -d '' 'https://lore.kernel.org/all/?x=m&t=1&q=mid%3A20230327080502.GA570847@ziqianlu-desk2+AND+dt%3A20230328120000..' > The mbox.gz endpoints should be a bit more efficient for the > server than Atom feeds; decoding MIME and HTML escaping takes up > considerable CPU time. Good to know. I'm really looking for a way to ask the remote system "hey, is there anything new in this thread?" so that I can quickly ignore threads without any updates. -K ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Cheap way to check for new messages in a thread 2023-03-28 14:04 ` Konstantin Ryabitsev @ 2023-03-28 19:45 ` Eric Wong 2023-03-28 20:00 ` Konstantin Ryabitsev 0 siblings, 1 reply; 17+ messages in thread From: Eric Wong @ 2023-03-28 19:45 UTC (permalink / raw) To: Konstantin Ryabitsev; +Cc: meta Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote: > On Mon, Mar 27, 2023 at 09:38:49PM +0000, Eric Wong wrote: > > I thought about that, too; but I'm worried about having one-off > > stuff that ends up needing to be supported indefinitely. > > > > JMAP for this would take more time, but I'd be more comfortable > > carrying it long-term. > > > > I don't expect trimming after the first paragraph to be a huge > > improvement. Retrieving any part of the message from git and > > dealing with MIME is expensive, anyways. I wouldn't expect it > > to be a big (if any) improvement compared to POST-ing for the > > mbox.gz (&x=m&t=1) endpoint with rt:$SINCE.. > > Hmm... This didn't seem to do the right thing for me. For example, this > thread: > > https://lore.kernel.org/lkml/20230327080502.GA570847@ziqianlu-desk2 > > If I ask for any new messages in that thread since 20230327120000, I get > nothing: > > curl -Sf -d '' 'https://lore.kernel.org/all/?x=m&t=1&q=mid%3A20230327080502.GA570847@ziqianlu-desk2+AND+dt%3A20230328120000..' Ugh, that's because the thread expansion (t=1) happens after Xapian handles dt:/rt:/d: I don't know if there's a good way to do that entirely within Xapian via high-level Perl bindings. Some options: A) grab MSGID first, lookup THREADID for a given MSGID, use remaining query The problem is figuring out which parts of the query to handle, first. Maybe a solution below... B) add explicit before= and after= parameters which allow us to do filtering ourselves in the thread expansion phase C) index References:/In-Reply-To: so searching `ref:$MSGID' can work. This doesn't work for some MUAs and deep threads, though. D) Support `thread:{subquery}' like notmuch. Thus `thread:{mid:$MSGID} AND dt:$START..' would communicate to Xapian what we want for A). I'm not sure this is doable unless using Xapian via C++, but I've been considering providing the option to use C++ anyways to support less hacky approxidate query parsing. According to notmuch docs, it's expensive, though :< I think it's possible to support /$INBOX/$MSGID/t.mbox.gz?q=... for A) without too much difficulty. I'll have to think about it a bit... D) is good for long-term consideration if proper timeouts can be implemented. > > The mbox.gz endpoints should be a bit more efficient for the > > server than Atom feeds; decoding MIME and HTML escaping takes up > > considerable CPU time. > > Good to know. I'm really looking for a way to ask the remote system "hey, is > there anything new in this thread?" so that I can quickly ignore threads > without any updates. All the mbox.gz endpoints will 404 if there's no results, and the `-f' flag of curl will ensure nothing's emitted to stdout in that case. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Cheap way to check for new messages in a thread 2023-03-28 19:45 ` Eric Wong @ 2023-03-28 20:00 ` Konstantin Ryabitsev 2023-03-28 22:08 ` Eric Wong 0 siblings, 1 reply; 17+ messages in thread From: Konstantin Ryabitsev @ 2023-03-28 20:00 UTC (permalink / raw) To: Eric Wong; +Cc: meta On Tue, Mar 28, 2023 at 07:45:49PM +0000, Eric Wong wrote: > C) index References:/In-Reply-To: so searching `ref:$MSGID' > can work. This doesn't work for some MUAs and deep > threads, though. I think this is a workable approach, but would require a reindex, right? -K ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Cheap way to check for new messages in a thread 2023-03-28 20:00 ` Konstantin Ryabitsev @ 2023-03-28 22:08 ` Eric Wong 2023-03-28 23:30 ` Konstantin Ryabitsev 0 siblings, 1 reply; 17+ messages in thread From: Eric Wong @ 2023-03-28 22:08 UTC (permalink / raw) To: Konstantin Ryabitsev; +Cc: meta Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote: > On Tue, Mar 28, 2023 at 07:45:49PM +0000, Eric Wong wrote: > > C) index References:/In-Reply-To: so searching `ref:$MSGID' > > can work. This doesn't work for some MUAs and deep > > threads, though. > > I think this is a workable approach, but would require a reindex, right? Yes, it requires a reindex to take effect, which takes ~2 days on my lore mirror. The biggest problem is MUAs are likely to cull References: when threads get too long; so accuracy gets lost. Supporting /$MSGID/?q=... doesn't seem like the worst idea, actually; since I've seen some web forums (phpBB maybe?) have a "search in thread" function. thread:{sub-query} is ideal; and I wouldn't rule out doing any combination of the three (I don't like separating before/after). ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Cheap way to check for new messages in a thread 2023-03-28 22:08 ` Eric Wong @ 2023-03-28 23:30 ` Konstantin Ryabitsev 2023-03-29 21:25 ` Eric Wong 0 siblings, 1 reply; 17+ messages in thread From: Konstantin Ryabitsev @ 2023-03-28 23:30 UTC (permalink / raw) To: Eric Wong; +Cc: meta On Tue, Mar 28, 2023 at 10:08:30PM +0000, Eric Wong wrote: > > I think this is a workable approach, but would require a reindex, right? > > Yes, it requires a reindex to take effect, which takes ~2 days > on my lore mirror. The biggest problem is MUAs are likely to > cull References: when threads get too long; so accuracy gets > lost. > > Supporting /$MSGID/?q=... doesn't seem like the worst idea, > actually; since I've seen some web forums (phpBB maybe?) have a > "search in thread" function. > > thread:{sub-query} is ideal; and I wouldn't rule out doing any > combination of the three (I don't like separating before/after). I'm fine with either of these, and just to stress, it's not really blocking anything I'm working on -- bugbot is in initial rollout stages, so while the number of tracked bugs/threads remains low, even if we re-download a hundred threads every 10 minutes, it's just internal churn between two adjacent VMs. If it becomes heavy, I can always look into switching to lei and performing local queries instead of doing external polling. However, if you do want to add ability to cheaply do a "give me just the newest messages in this thread since this datetime", that would be great for my needs. :) -K ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Cheap way to check for new messages in a thread 2023-03-28 23:30 ` Konstantin Ryabitsev @ 2023-03-29 21:25 ` Eric Wong 2023-03-30 11:29 ` Eric Wong 0 siblings, 1 reply; 17+ messages in thread From: Eric Wong @ 2023-03-29 21:25 UTC (permalink / raw) To: Konstantin Ryabitsev; +Cc: meta Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote: > I'm fine with either of these, and just to stress, it's not really blocking > anything I'm working on -- bugbot is in initial rollout stages, so while the > number of tracked bugs/threads remains low, even if we re-download a hundred > threads every 10 minutes, it's just internal churn between two adjacent VMs. > If it becomes heavy, I can always look into switching to lei and performing > local queries instead of doing external polling. Alright. > However, if you do want to add ability to cheaply do a "give me just the > newest messages in this thread since this datetime", that would be great for > my needs. :) Per-thread search is something I've wanted for a while, anyways, so I think I'll do /$MSGID/?q= in between ongoing work for codesearch and chasing down FreeBSD issues. I may not expose /$MSGID/?q= it via HTML just yet since I find <form> elements confusing as a user :x Indexing References/IRT would be a waste of space and I/O due to MUA truncations; so I'm hesitant to do it since we already index THREADID. thread:{sub-query} will be nice, but I'll get to it after I deal with the lei FUSE stuff since I've already done most of the C work from another project. Normal FSes are so inefficient for storing Maildir outputs. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Cheap way to check for new messages in a thread 2023-03-29 21:25 ` Eric Wong @ 2023-03-30 11:29 ` Eric Wong 2023-03-30 16:45 ` Konstantin Ryabitsev 2023-06-16 19:11 ` Konstantin Ryabitsev 0 siblings, 2 replies; 17+ messages in thread From: Eric Wong @ 2023-03-30 11:29 UTC (permalink / raw) To: Konstantin Ryabitsev; +Cc: meta Eric Wong <e@80x24.org> wrote: > Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote: > > However, if you do want to add ability to cheaply do a "give me just the > > newest messages in this thread since this datetime", that would be great for > > my needs. :) > > Per-thread search is something I've wanted for a while, anyways, > so I think I'll do /$MSGID/?q= in between ongoing work for This implements the mbox.gz retrieval. I didn't want to deal with HTML nor figuring out how to expose more <form> elements, yet; but I figure mbox.gz is the most important. Now deployed on 80x24.org/lore: MSGID=20230327080502.GA570847@ziqianlu-desk2 curl -d '' -sSf \ https://80x24.org/lore/all/"$MSGID/?x=m&q=rt:2023-03-29.." | \ zcat | grep -i ^Message-ID: shows the expected messages. -----------8<----------- Subject: [PATCH] www: support POST /$INBOX/$MSGID/?x=m&q= This allows filtering the contents of any existing thread using a search query. It uses the existing THREADID column in Xapian so we can internally add a Xapian OP_FILTER to the results. This new functionality is orthogonal to the existing `t=1' parameter which gives mairix-style thread expansion. It doesn't make sense to use `t=1' with this functionality, but it's not disallowed, either. The indentation change in Over->next_by_mid is to ensure DBI->prepare_cached can share across both ->next_by_mid and ->mid2tid. I also noticed the existing regex for `POST /$INBOX/?x=m&q=' was allowing extra characters. With an added \z, it's now as strict was originally intended and AFAIK nothing was generating invalid URLs for it Reported-by: Konstantin Ryabitsev <konstantin@linuxfoundation.org> Link: https://public-inbox.org/meta/aaniyhk7wfm4e6m5mbukcrhevzoc6ftctyrfwvmz4fkykwwtlj@mverfng6ytas/T/ --- lib/PublicInbox/Mbox.pm | 5 ++++ lib/PublicInbox/Over.pm | 24 ++++++++++++++++++- lib/PublicInbox/Search.pm | 6 +++++ lib/PublicInbox/WWW.pm | 4 +++- t/psgi_v2.t | 50 ++++++++++++++++++++++++++++++++++----- 5 files changed, 81 insertions(+), 8 deletions(-) diff --git a/lib/PublicInbox/Mbox.pm b/lib/PublicInbox/Mbox.pm index 18db9d38..e1abf7ec 100644 --- a/lib/PublicInbox/Mbox.pm +++ b/lib/PublicInbox/Mbox.pm @@ -229,6 +229,11 @@ sub mbox_all { return PublicInbox::WWW::need($ctx, 'Overview'); my $qopts = $ctx->{qopts} = { relevance => -2 }; # ORDER BY docid DESC + + # {threadid} limits results to a given thread + # {threads} collapses results from messages in the same thread, + # allowing us to use ->expand_thread w/o duplicates in our own code + $qopts->{threadid} = $over->mid2tid($ctx->{mid}) if defined($ctx->{mid}); $qopts->{threads} = 1 if $q->{t}; $srch->query_approxidate($ctx->{ibx}->git, $q_string); my $mset = $srch->mset($q_string, $qopts); diff --git a/lib/PublicInbox/Over.pm b/lib/PublicInbox/Over.pm index 271e2246..6ba27118 100644 --- a/lib/PublicInbox/Over.pm +++ b/lib/PublicInbox/Over.pm @@ -283,13 +283,35 @@ SELECT eidx_key FROM inboxes WHERE ibx_id = ? $rows; } +sub mid2tid { + my ($self, $mid) = @_; + my $dbh = dbh($self); + + my $sth = $dbh->prepare_cached(<<'', undef, 1); +SELECT id FROM msgid WHERE mid = ? LIMIT 1 + + $sth->execute($mid); + my $id = $sth->fetchrow_array or return; + $sth = $dbh->prepare_cached(<<'', undef, 1); +SELECT num FROM id2num WHERE id = ? AND num > ? +ORDER BY num ASC LIMIT 1 + + $sth->execute($id, 0); + my $num = $sth->fetchrow_array or return; + $sth = $dbh->prepare(<<''); +SELECT tid FROM over WHERE num = ? LIMIT 1 + + $sth->execute($num); + $sth->fetchrow_array; +} + sub next_by_mid { my ($self, $mid, $id, $prev) = @_; my $dbh = dbh($self); unless (defined $$id) { my $sth = $dbh->prepare_cached(<<'', undef, 1); - SELECT id FROM msgid WHERE mid = ? LIMIT 1 +SELECT id FROM msgid WHERE mid = ? LIMIT 1 $sth->execute($mid); $$id = $sth->fetchrow_array; diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm index 5133a3b7..6c3d9f93 100644 --- a/lib/PublicInbox/Search.pm +++ b/lib/PublicInbox/Search.pm @@ -386,6 +386,12 @@ sub mset { sortable_serialise($uid_range->[1])); $query = $X{Query}->new(OP_FILTER(), $query, $range); } + if (defined(my $tid = $opt->{threadid})) { + $tid = sortable_serialise($tid); + $query = $X{Query}->new(OP_FILTER(), $query, + $X{Query}->new(OP_VALUE_RANGE(), THREADID, $tid, $tid)); + } + my $xdb = xdb($self); my $enq = $X{Enquire}->new($xdb); $enq->set_query($query); diff --git a/lib/PublicInbox/WWW.pm b/lib/PublicInbox/WWW.pm index 9ffcb879..a8f1ad17 100644 --- a/lib/PublicInbox/WWW.pm +++ b/lib/PublicInbox/WWW.pm @@ -68,7 +68,9 @@ sub call { my ($idx, $fn) = ($3, $4); return invalid_inbox_mid($ctx, $1, $2) || get_attach($ctx, $idx, $fn); - } elsif ($path_info =~ m!$INBOX_RE/!o) { + } elsif ($path_info =~ m!$INBOX_RE/$MID_RE/\z!o) { + return invalid_inbox_mid($ctx, $1, $2) || mbox_results($ctx); + } elsif ($path_info =~ m!$INBOX_RE/\z!o) { return invalid_inbox($ctx, $1) || mbox_results($ctx); } } diff --git a/t/psgi_v2.t b/t/psgi_v2.t index 5b197a9f..0a77adfb 100644 --- a/t/psgi_v2.t +++ b/t/psgi_v2.t @@ -4,6 +4,7 @@ use strict; use v5.10.1; use PublicInbox::TestCommon; +use IO::Uncompress::Gunzip qw(gunzip); require_git(2.6); use PublicInbox::Eml; use PublicInbox::Config; @@ -76,6 +77,30 @@ $new_mid //= do { local $/; <$fh>; }; + +my $m2t = create_inbox 'mid2tid-1', version => 2, indexlevel => 'medium', sub { + my ($im, $ibx) = @_; + for my $n (1..3) { + $im->add(PublicInbox::Eml->new(<<EOM)) or xbail 'add'; +Date: Fri, 02 Oct 1993 00:0$n:00 +0000 +Message-ID: <t\@$n> +Subject: tid $n +From: x\@example.com +References: <a-mid\@b> + +$n +EOM + $im->add(PublicInbox::Eml->new(<<EOM)) or xbail 'add'; +Date: Fri, 02 Oct 1993 00:0$n:00 +0000 +Message-ID: <ut\@$n> +Subject: unrelated tid $n +From: x\@example.com +References: <b-mid\@b> + +EOM + } +}; + my $cfgpath = "$ibx->{inboxdir}/pi_config"; { open my $fh, '>', $cfgpath or BAIL_OUT $!; @@ -86,6 +111,9 @@ my $cfgpath = "$ibx->{inboxdir}/pi_config"; [publicinbox "dup"] inboxdir = $dibx->{inboxdir} address = $dibx->{-primary_address} +[publicinbox "m2t"] + inboxdir = $m2t->{inboxdir} + address = $m2t->{-primary_address} EOF close $fh or BAIL_OUT; } @@ -178,20 +206,18 @@ my $client1 = sub { $cfg->each_inbox(sub { $_[0]->search->reopen }); SKIP: { - eval { require IO::Uncompress::Gunzip }; - skip 'IO::Uncompress::Gunzip missing', 6 if $@; my ($in, $out, $status); my $req = GET('/v2test/a-mid@b/raw'); $req->header('Accept-Encoding' => 'gzip'); $res = $cb->($req); is($res->header('Content-Encoding'), 'gzip', 'gzip encoding'); $in = $res->content; - IO::Uncompress::Gunzip::gunzip(\$in => \$out); + gunzip(\$in => \$out); is($out, $raw, 'gzip response matches'); $res = $cb->(GET('/v2test/a-mid@b/t.mbox.gz')); $in = $res->content; - $status = IO::Uncompress::Gunzip::gunzip(\$in => \$out); + $status = gunzip(\$in => \$out); unlike($out, qr/^From oldbug/sm, 'buggy "From_" line omitted'); like($out, qr/^hello world$/m, 'got first in t.mbox.gz'); like($out, qr/^hello world!$/m, 'got second in t.mbox.gz'); @@ -202,7 +228,7 @@ my $client1 = sub { # search interface $res = $cb->(POST('/v2test/?q=m:a-mid@b&x=m')); $in = $res->content; - $status = IO::Uncompress::Gunzip::gunzip(\$in => \$out); + $status = gunzip(\$in => \$out); unlike($out, qr/^From oldbug/sm, 'buggy "From_" line omitted'); like($out, qr/^hello world$/m, 'got first in mbox POST'); like($out, qr/^hello world!$/m, 'got second in mbox POST'); @@ -213,7 +239,7 @@ my $client1 = sub { # all.mbox.gz interface $res = $cb->(GET('/v2test/all.mbox.gz')); $in = $res->content; - $status = IO::Uncompress::Gunzip::gunzip(\$in => \$out); + $status = gunzip(\$in => \$out); unlike($out, qr/^From oldbug/sm, 'buggy "From_" line omitted'); like($out, qr/^hello world$/m, 'got first in all.mbox'); like($out, qr/^hello world!$/m, 'got second in all.mbox'); @@ -335,6 +361,18 @@ my $client3 = sub { local $SIG{__WARN__} = sub { push @warn, @_ }; $res = $cb->(GET('/v2test/?t=1970'.'01'.'01')); is_deeply(\@warn, [], 'no warnings on YYYYMMDD only'); + + $res = $cb->(POST("/m2t/t\@1/?q=dt:19931002000300..&x=m")); + is($res->code, 200, 'got 200 on mid2tid query'); + gunzip(\(my $in = $res->content) => \(my $out)); + my @m = ($out =~ m!^Message-ID: <([^>]+)>\n!gms); + is_deeply(\@m, ['t@3'], 'only got latest result from query'); + + $res = $cb->(POST("/m2t/t\@1/?q=dt:19931002000400..&x=m")); + is($res->code, 404, '404 on out-of-range mid2tid query'); + + $res = $cb->(POST("/m2t/t\@1/?q=s:unrelated&x=m")); + is($res->code, 404, '404 on cross-thread search'); }; test_psgi(sub { $www->call(@_) }, $client3); test_httpd($env, $client3, 4); ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: Cheap way to check for new messages in a thread 2023-03-30 11:29 ` Eric Wong @ 2023-03-30 16:45 ` Konstantin Ryabitsev 2023-03-31 1:40 ` Eric Wong 2023-06-16 19:11 ` Konstantin Ryabitsev 1 sibling, 1 reply; 17+ messages in thread From: Konstantin Ryabitsev @ 2023-03-30 16:45 UTC (permalink / raw) To: Eric Wong; +Cc: meta On Thu, Mar 30, 2023 at 11:29:51AM +0000, Eric Wong wrote: > > Per-thread search is something I've wanted for a while, anyways, > > so I think I'll do /$MSGID/?q= in between ongoing work for > > This implements the mbox.gz retrieval. I didn't want to deal > with HTML nor figuring out how to expose more <form> elements, > yet; but I figure mbox.gz is the most important. Nice, thanks! I can't easily test this, because lore is currently mostly on 1.9 and the patch doesn't cleanly apply to that tree. However, I will be happy to test it out once 2.0 is out and we've updated to it on our systems. Cheers, -K ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Cheap way to check for new messages in a thread 2023-03-30 16:45 ` Konstantin Ryabitsev @ 2023-03-31 1:40 ` Eric Wong 2023-04-11 11:27 ` Eric Wong 0 siblings, 1 reply; 17+ messages in thread From: Eric Wong @ 2023-03-31 1:40 UTC (permalink / raw) To: Konstantin Ryabitsev; +Cc: meta Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote: > I can't easily test this, because lore is currently mostly on 1.9 and the > patch doesn't cleanly apply to that tree. However, I will be happy to test it > out once 2.0 is out and we've updated to it on our systems. Fwiw, master is good on Linux for mail. codesearch still needs work, and lei on FreeBSD gets stuck sometimes. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Cheap way to check for new messages in a thread 2023-03-31 1:40 ` Eric Wong @ 2023-04-11 11:27 ` Eric Wong 0 siblings, 0 replies; 17+ messages in thread From: Eric Wong @ 2023-04-11 11:27 UTC (permalink / raw) To: Konstantin Ryabitsev; +Cc: meta Eric Wong <e@80x24.org> wrote: > Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote: > > I can't easily test this, because lore is currently mostly on 1.9 and the > > patch doesn't cleanly apply to that tree. However, I will be happy to test it > > out once 2.0 is out and we've updated to it on our systems. > > Fwiw, master is good on Linux for mail. Erm, almost :x The --batch-command support added a difficult-to-trigger bug that went undetected for a few months: https://public-inbox.org/meta/20230411112350.297099-1-e@80x24.org/ ("git: fix cat_async_retry") ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Cheap way to check for new messages in a thread 2023-03-30 11:29 ` Eric Wong 2023-03-30 16:45 ` Konstantin Ryabitsev @ 2023-06-16 19:11 ` Konstantin Ryabitsev 2023-06-16 23:13 ` [PATCH] www: use correct threadid for per-thread search Eric Wong 1 sibling, 1 reply; 17+ messages in thread From: Konstantin Ryabitsev @ 2023-06-16 19:11 UTC (permalink / raw) To: Eric Wong; +Cc: meta On Thu, Mar 30, 2023 at 11:29:51AM +0000, Eric Wong wrote: > This implements the mbox.gz retrieval. I didn't want to deal > with HTML nor figuring out how to expose more <form> elements, > yet; but I figure mbox.gz is the most important. > > Now deployed on 80x24.org/lore: > > MSGID=20230327080502.GA570847@ziqianlu-desk2 > curl -d '' -sSf \ > https://80x24.org/lore/all/"$MSGID/?x=m&q=rt:2023-03-29.." | \ > zcat | grep -i ^Message-ID: Eric: Reviving this old thread for some clarification. I noticed that this only works for /all/, but not for individual inboxes. E.g.: $ curl -d '' -sSf \ https://lore.kernel.org/all/"$MSGID/?x=m&q=rt:2023-03-29.." \ | zgrep -i ^Message-ID: Message-ID: <cfcf852c-e9f0-f560-542d-0f72777a85b2@leemhuis.info> but with /lkml/ I get a 404: $ curl -d '' -sSf \ https://lore.kernel.org/lkml/"$MSGID/?x=m&q=rt:2023-03-29.." \ | zgrep -i ^Message-ID: curl: (22) The requested URL returned error: 404 Is that intentionally restricted to just extindex? -K ^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH] www: use correct threadid for per-thread search 2023-06-16 19:11 ` Konstantin Ryabitsev @ 2023-06-16 23:13 ` Eric Wong 2023-06-21 17:11 ` Konstantin Ryabitsev 0 siblings, 1 reply; 17+ messages in thread From: Eric Wong @ 2023-06-16 23:13 UTC (permalink / raw) To: Konstantin Ryabitsev; +Cc: meta Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote: > On Thu, Mar 30, 2023 at 11:29:51AM +0000, Eric Wong wrote: > > This implements the mbox.gz retrieval. I didn't want to deal > > with HTML nor figuring out how to expose more <form> elements, > > yet; but I figure mbox.gz is the most important. > > > > Now deployed on 80x24.org/lore: > > > > MSGID=20230327080502.GA570847@ziqianlu-desk2 > > curl -d '' -sSf \ > > https://80x24.org/lore/all/"$MSGID/?x=m&q=rt:2023-03-29.." | \ > > zcat | grep -i ^Message-ID: > > Eric: > > Reviving this old thread for some clarification. I noticed that this only > works for /all/, but not for individual inboxes. E.g.: > > $ curl -d '' -sSf \ > https://lore.kernel.org/all/"$MSGID/?x=m&q=rt:2023-03-29.." \ > | zgrep -i ^Message-ID: > Message-ID: <cfcf852c-e9f0-f560-542d-0f72777a85b2@leemhuis.info> > > but with /lkml/ I get a 404: > > $ curl -d '' -sSf \ > https://lore.kernel.org/lkml/"$MSGID/?x=m&q=rt:2023-03-29.." \ > | zgrep -i ^Message-ID: > curl: (22) The requested URL returned error: 404 > > Is that intentionally restricted to just extindex? It's a bug, fix below and deployed to https://80x24.org/lore/ ---------8<--------- Subject: [PATCH] www: use correct threadid for per-thread search For individual public-inboxes relying on extindex for per-inbox search, we must use the threadid from the extindex over.sqlite3 rather than the per-inbox over.sqlite3 file. Reported-by: Konstantin Ryabitsev <konstantin@linuxfoundation.org> Link: https://public-inbox.org/meta/20230616-rudy-comedy-vision-2b9f92@meerkat/ --- lib/PublicInbox/Mbox.pm | 10 +++++++--- t/extindex-psgi.t | 39 +++++++++++++++++++++++++++++++++++++-- 2 files changed, 44 insertions(+), 5 deletions(-) diff --git a/lib/PublicInbox/Mbox.pm b/lib/PublicInbox/Mbox.pm index e1abf7ec..bf61bb0e 100644 --- a/lib/PublicInbox/Mbox.pm +++ b/lib/PublicInbox/Mbox.pm @@ -225,15 +225,19 @@ sub mbox_all { return mbox_all_ids($ctx) if $q_string !~ /\S/; my $srch = $ctx->{ibx}->isrch or return PublicInbox::WWW::need($ctx, 'Search'); - my $over = $ctx->{ibx}->over or - return PublicInbox::WWW::need($ctx, 'Overview'); my $qopts = $ctx->{qopts} = { relevance => -2 }; # ORDER BY docid DESC # {threadid} limits results to a given thread # {threads} collapses results from messages in the same thread, # allowing us to use ->expand_thread w/o duplicates in our own code - $qopts->{threadid} = $over->mid2tid($ctx->{mid}) if defined($ctx->{mid}); + if (defined($ctx->{mid})) { + my $over = ($ctx->{ibx}->{isrch} ? + $ctx->{ibx}->{isrch}->{es}->over : + $ctx->{ibx}->over) or + return PublicInbox::WWW::need($ctx, 'Overview'); + $qopts->{threadid} = $over->mid2tid($ctx->{mid}); + } $qopts->{threads} = 1 if $q->{t}; $srch->query_approxidate($ctx->{ibx}->git, $q_string); my $mset = $srch->mset($q_string, $qopts); diff --git a/t/extindex-psgi.t b/t/extindex-psgi.t index 98dc2e48..f10ffbb6 100644 --- a/t/extindex-psgi.t +++ b/t/extindex-psgi.t @@ -1,5 +1,5 @@ #!perl -w -# Copyright (C) 2020-2021 all contributors <meta@public-inbox.org> +# Copyright (C) all contributors <meta@public-inbox.org> # License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt> use strict; use v5.10.1; @@ -21,7 +21,28 @@ mkdir "$home/.public-inbox" or BAIL_OUT $!; my $pi_config = "$home/.public-inbox/config"; cp($cfg_path, $pi_config) or BAIL_OUT; my $env = { HOME => $home }; -run_script([qw(-extindex --all), "$tmpdir/eidx"], $env) or BAIL_OUT; +my $m2t = create_inbox 'mid2tid', version => 2, indexlevel => 'basic', sub { + my ($im, $ibx) = @_; + for my $n (1..3) { + $im->add(PublicInbox::Eml->new(<<EOM)) or xbail 'add'; +Date: Fri, 02 Oct 1993 00:0$n:00 +0000 +Message-ID: <t\@$n> +Subject: tid $n +From: x\@example.com +References: <a-mid\@b> + +$n +EOM + $im->add(PublicInbox::Eml->new(<<EOM)) or xbail 'add'; +Date: Fri, 02 Oct 1993 00:0$n:00 +0000 +Message-ID: <ut\@$n> +Subject: unrelated tid $n +From: x\@example.com +References: <b-mid\@b> + +EOM + } +}; { open my $cfgfh, '>>', $pi_config or BAIL_OUT; $cfgfh->autoflush(1); @@ -32,8 +53,14 @@ run_script([qw(-extindex --all), "$tmpdir/eidx"], $env) or BAIL_OUT; [publicinbox] wwwlisting = all grokManifest = all +[publicinbox "m2t"] + inboxdir = $m2t->{inboxdir} + address = $m2t->{-primary_address} EOM + close $cfgfh or xbail "close: $!"; } + +run_script([qw(-extindex --all), "$tmpdir/eidx"], $env) or BAIL_OUT; my $www = PublicInbox::WWW->new(PublicInbox::Config->new($pi_config)); my $client = sub { my ($cb) = @_; @@ -83,6 +110,14 @@ my $client = sub { 't2 manifest'); is_deeply([ sort keys %{$m->{'/t1'}} ], [ '/t1' ], 't2 manifest'); + + # ensure ibx->{isrch}->{es}->over is used instead of ibx->over: + $res = $cb->(POST("/m2t/t\@1/?q=dt:19931002000259..&x=m")); + is($res->code, 200, 'hit on mid2tid query'); + $res = $cb->(POST("/m2t/t\@1/?q=dt:19931002000400..&x=m")); + is($res->code, 404, '404 on out-of-range mid2tid query'); + $res = $cb->(POST("/m2t/t\@1/?q=s:unrelated&x=m")); + is($res->code, 404, '404 on cross-thread search'); }; test_psgi(sub { $www->call(@_) }, $client); %$env = (%$env, TMPDIR => $tmpdir, PI_CONFIG => $pi_config); ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH] www: use correct threadid for per-thread search 2023-06-16 23:13 ` [PATCH] www: use correct threadid for per-thread search Eric Wong @ 2023-06-21 17:11 ` Konstantin Ryabitsev 0 siblings, 0 replies; 17+ messages in thread From: Konstantin Ryabitsev @ 2023-06-21 17:11 UTC (permalink / raw) To: Eric Wong; +Cc: meta On Fri, Jun 16, 2023 at 11:13:01PM +0000, Eric Wong wrote: > > Reviving this old thread for some clarification. I noticed that this only > > works for /all/, but not for individual inboxes. E.g.: > > > > $ curl -d '' -sSf \ > > https://lore.kernel.org/all/"$MSGID/?x=m&q=rt:2023-03-29.." \ > > | zgrep -i ^Message-ID: > > Message-ID: <cfcf852c-e9f0-f560-542d-0f72777a85b2@leemhuis.info> > > > > but with /lkml/ I get a 404: > > > > $ curl -d '' -sSf \ > > https://lore.kernel.org/lkml/"$MSGID/?x=m&q=rt:2023-03-29.." \ > > | zgrep -i ^Message-ID: > > curl: (22) The requested URL returned error: 404 > > > > Is that intentionally restricted to just extindex? > > It's a bug, fix below and deployed to https://80x24.org/lore/ Indeed, looks good now. Thank you! -K ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2023-06-21 17:11 UTC | newest] Thread overview: 17+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-03-27 15:08 Cheap way to check for new messages in a thread Konstantin Ryabitsev 2023-03-27 19:10 ` Eric Wong 2023-03-27 20:47 ` Konstantin Ryabitsev 2023-03-27 21:38 ` Eric Wong 2023-03-28 14:04 ` Konstantin Ryabitsev 2023-03-28 19:45 ` Eric Wong 2023-03-28 20:00 ` Konstantin Ryabitsev 2023-03-28 22:08 ` Eric Wong 2023-03-28 23:30 ` Konstantin Ryabitsev 2023-03-29 21:25 ` Eric Wong 2023-03-30 11:29 ` Eric Wong 2023-03-30 16:45 ` Konstantin Ryabitsev 2023-03-31 1:40 ` Eric Wong 2023-04-11 11:27 ` Eric Wong 2023-06-16 19:11 ` Konstantin Ryabitsev 2023-06-16 23:13 ` [PATCH] www: use correct threadid for per-thread search Eric Wong 2023-06-21 17:11 ` Konstantin Ryabitsev
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).