unofficial mirror of meta@public-inbox.org
 help / color / mirror / Atom feed
* Cheap way to check for new messages in a thread
@ 2023-03-27 15:08 Konstantin Ryabitsev
  2023-03-27 19:10 ` Eric Wong
  0 siblings, 1 reply; 17+ messages in thread
From: Konstantin Ryabitsev @ 2023-03-27 15:08 UTC (permalink / raw)
  To: meta

Hello:

For the bugzilla integration work I'm doing, I need a way to check if there
were any updates to a thread since the last check. Right now, I'm just
grabbing the full thread, parsing it and seeing if there are any new
message-IDs that we don't know about, but it's very wasteful. Any way to just
issue something like "how many messages are in a thread with this message-id"
or "are there any updates to a thread with this message-id since
YYYYMMDDHHMMSS?

-K

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Cheap way to check for new messages in a thread
  2023-03-27 15:08 Cheap way to check for new messages in a thread Konstantin Ryabitsev
@ 2023-03-27 19:10 ` Eric Wong
  2023-03-27 20:47   ` Konstantin Ryabitsev
  0 siblings, 1 reply; 17+ messages in thread
From: Eric Wong @ 2023-03-27 19:10 UTC (permalink / raw)
  To: Konstantin Ryabitsev; +Cc: meta

Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
> Hello:
> 
> For the bugzilla integration work I'm doing, I need a way to check if there
> were any updates to a thread since the last check. Right now, I'm just
> grabbing the full thread, parsing it and seeing if there are any new
> message-IDs that we don't know about, but it's very wasteful. Any way to just
> issue something like "how many messages are in a thread with this message-id"
> or "are there any updates to a thread with this message-id since
> YYYYMMDDHHMMSS?

  lei q -t --only /path/to/(inbox|extindex) mid:$MSGID rt:APPROXIDATE..

Returns JSON and won't retrieve message bodies from git.

I wouldn't query down to the second due to propagation delays,
clock skew, etc, though.


There might be a JMAP endpoint I can implement for WWW which
only retrieves that info, but getting backreferences (required
by the JMAP spec) to work properly seemed painful.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Cheap way to check for new messages in a thread
  2023-03-27 19:10 ` Eric Wong
@ 2023-03-27 20:47   ` Konstantin Ryabitsev
  2023-03-27 21:38     ` Eric Wong
  0 siblings, 1 reply; 17+ messages in thread
From: Konstantin Ryabitsev @ 2023-03-27 20:47 UTC (permalink / raw)
  To: Eric Wong; +Cc: meta

On Mon, Mar 27, 2023 at 07:10:49PM +0000, Eric Wong wrote:
> > For the bugzilla integration work I'm doing, I need a way to check if there
> > were any updates to a thread since the last check. Right now, I'm just
> > grabbing the full thread, parsing it and seeing if there are any new
> > message-IDs that we don't know about, but it's very wasteful. Any way to just
> > issue something like "how many messages are in a thread with this message-id"
> > or "are there any updates to a thread with this message-id since
> > YYYYMMDDHHMMSS?
> 
>   lei q -t --only /path/to/(inbox|extindex) mid:$MSGID rt:APPROXIDATE..
> 
> Returns JSON and won't retrieve message bodies from git.

Ah, I was hoping to have a fully remote way of doing this.

> I wouldn't query down to the second due to propagation delays,
> clock skew, etc, though.
> 
> There might be a JMAP endpoint I can implement for WWW which
> only retrieves that info, but getting backreferences (required
> by the JMAP spec) to work properly seemed painful.

What about a "bodiless" atom feed? It's already available per thread, so
perhaps there could be a mode that skips the bodies or trims them after the
first paragraph?

-K

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Cheap way to check for new messages in a thread
  2023-03-27 20:47   ` Konstantin Ryabitsev
@ 2023-03-27 21:38     ` Eric Wong
  2023-03-28 14:04       ` Konstantin Ryabitsev
  0 siblings, 1 reply; 17+ messages in thread
From: Eric Wong @ 2023-03-27 21:38 UTC (permalink / raw)
  To: Konstantin Ryabitsev; +Cc: meta

Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
> On Mon, Mar 27, 2023 at 07:10:49PM +0000, Eric Wong wrote:
> > > For the bugzilla integration work I'm doing, I need a way to check if there
> > > were any updates to a thread since the last check. Right now, I'm just
> > > grabbing the full thread, parsing it and seeing if there are any new
> > > message-IDs that we don't know about, but it's very wasteful. Any way to just
> > > issue something like "how many messages are in a thread with this message-id"
> > > or "are there any updates to a thread with this message-id since
> > > YYYYMMDDHHMMSS?
> > 
> >   lei q -t --only /path/to/(inbox|extindex) mid:$MSGID rt:APPROXIDATE..
> > 
> > Returns JSON and won't retrieve message bodies from git.
> 
> Ah, I was hoping to have a fully remote way of doing this.
> 
> > I wouldn't query down to the second due to propagation delays,
> > clock skew, etc, though.
> > 
> > There might be a JMAP endpoint I can implement for WWW which
> > only retrieves that info, but getting backreferences (required
> > by the JMAP spec) to work properly seemed painful.
> 
> What about a "bodiless" atom feed? It's already available per thread, so
> perhaps there could be a mode that skips the bodies or trims them after the
> first paragraph?

I thought about that, too; but I'm worried about having one-off
stuff that ends up needing to be supported indefinitely.

JMAP for this would take more time, but I'd be more comfortable
carrying it long-term.

I don't expect trimming after the first paragraph to be a huge
improvement.  Retrieving any part of the message from git and
dealing with MIME is expensive, anyways.  I wouldn't expect it
to be a big (if any) improvement compared to POST-ing for the
mbox.gz (&x=m&t=1) endpoint with rt:$SINCE..

The mbox.gz endpoints should be a bit more efficient for the
server than Atom feeds; decoding MIME and HTML escaping takes up
considerable CPU time.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Cheap way to check for new messages in a thread
  2023-03-27 21:38     ` Eric Wong
@ 2023-03-28 14:04       ` Konstantin Ryabitsev
  2023-03-28 19:45         ` Eric Wong
  0 siblings, 1 reply; 17+ messages in thread
From: Konstantin Ryabitsev @ 2023-03-28 14:04 UTC (permalink / raw)
  To: Eric Wong; +Cc: meta

On Mon, Mar 27, 2023 at 09:38:49PM +0000, Eric Wong wrote:
> I thought about that, too; but I'm worried about having one-off
> stuff that ends up needing to be supported indefinitely.
> 
> JMAP for this would take more time, but I'd be more comfortable
> carrying it long-term.
> 
> I don't expect trimming after the first paragraph to be a huge
> improvement.  Retrieving any part of the message from git and
> dealing with MIME is expensive, anyways.  I wouldn't expect it
> to be a big (if any) improvement compared to POST-ing for the
> mbox.gz (&x=m&t=1) endpoint with rt:$SINCE..

Hmm... This didn't seem to do the right thing for me. For example, this
thread:

https://lore.kernel.org/lkml/20230327080502.GA570847@ziqianlu-desk2

If I ask for any new messages in that thread since 20230327120000, I get
nothing:

curl -Sf -d '' 'https://lore.kernel.org/all/?x=m&t=1&q=mid%3A20230327080502.GA570847@ziqianlu-desk2+AND+dt%3A20230328120000..'

> The mbox.gz endpoints should be a bit more efficient for the
> server than Atom feeds; decoding MIME and HTML escaping takes up
> considerable CPU time.

Good to know. I'm really looking for a way to ask the remote system "hey, is
there anything new in this thread?" so that I can quickly ignore threads
without any updates.

-K

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Cheap way to check for new messages in a thread
  2023-03-28 14:04       ` Konstantin Ryabitsev
@ 2023-03-28 19:45         ` Eric Wong
  2023-03-28 20:00           ` Konstantin Ryabitsev
  0 siblings, 1 reply; 17+ messages in thread
From: Eric Wong @ 2023-03-28 19:45 UTC (permalink / raw)
  To: Konstantin Ryabitsev; +Cc: meta

Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
> On Mon, Mar 27, 2023 at 09:38:49PM +0000, Eric Wong wrote:
> > I thought about that, too; but I'm worried about having one-off
> > stuff that ends up needing to be supported indefinitely.
> > 
> > JMAP for this would take more time, but I'd be more comfortable
> > carrying it long-term.
> > 
> > I don't expect trimming after the first paragraph to be a huge
> > improvement.  Retrieving any part of the message from git and
> > dealing with MIME is expensive, anyways.  I wouldn't expect it
> > to be a big (if any) improvement compared to POST-ing for the
> > mbox.gz (&x=m&t=1) endpoint with rt:$SINCE..
> 
> Hmm... This didn't seem to do the right thing for me. For example, this
> thread:
> 
> https://lore.kernel.org/lkml/20230327080502.GA570847@ziqianlu-desk2
> 
> If I ask for any new messages in that thread since 20230327120000, I get
> nothing:
> 
> curl -Sf -d '' 'https://lore.kernel.org/all/?x=m&t=1&q=mid%3A20230327080502.GA570847@ziqianlu-desk2+AND+dt%3A20230328120000..'

Ugh, that's because the thread expansion (t=1) happens after
Xapian handles dt:/rt:/d:

I don't know if there's a good way to do that entirely within
Xapian via high-level Perl bindings.

Some options:

A) grab MSGID first, lookup THREADID for a given MSGID,
   use remaining query

   The problem is figuring out which parts of the query to
   handle, first.  Maybe a solution below...

B) add explicit before= and after= parameters which allow us
   to do filtering ourselves in the thread expansion phase

C) index References:/In-Reply-To: so searching `ref:$MSGID'
   can work.  This doesn't work for some MUAs and deep
   threads, though.

D) Support `thread:{subquery}' like notmuch.
   Thus `thread:{mid:$MSGID} AND dt:$START..' would communicate
   to Xapian what we want for A).

   I'm not sure this is doable unless using Xapian via C++,
   but I've been considering providing the option to use C++
   anyways to support less hacky approxidate query parsing.
   According to notmuch docs, it's expensive, though :<

I think it's possible to support /$INBOX/$MSGID/t.mbox.gz?q=...
for A) without too much difficulty.  I'll have to think
about it a bit...

D) is good for long-term consideration if proper timeouts can
be implemented.

> > The mbox.gz endpoints should be a bit more efficient for the
> > server than Atom feeds; decoding MIME and HTML escaping takes up
> > considerable CPU time.
> 
> Good to know. I'm really looking for a way to ask the remote system "hey, is
> there anything new in this thread?" so that I can quickly ignore threads
> without any updates.

All the mbox.gz endpoints will 404 if there's no results, and
the `-f' flag of curl will ensure nothing's emitted to stdout
in that case.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Cheap way to check for new messages in a thread
  2023-03-28 19:45         ` Eric Wong
@ 2023-03-28 20:00           ` Konstantin Ryabitsev
  2023-03-28 22:08             ` Eric Wong
  0 siblings, 1 reply; 17+ messages in thread
From: Konstantin Ryabitsev @ 2023-03-28 20:00 UTC (permalink / raw)
  To: Eric Wong; +Cc: meta

On Tue, Mar 28, 2023 at 07:45:49PM +0000, Eric Wong wrote:
> C) index References:/In-Reply-To: so searching `ref:$MSGID'
>    can work.  This doesn't work for some MUAs and deep
>    threads, though.

I think this is a workable approach, but would require a reindex, right?

-K

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Cheap way to check for new messages in a thread
  2023-03-28 20:00           ` Konstantin Ryabitsev
@ 2023-03-28 22:08             ` Eric Wong
  2023-03-28 23:30               ` Konstantin Ryabitsev
  0 siblings, 1 reply; 17+ messages in thread
From: Eric Wong @ 2023-03-28 22:08 UTC (permalink / raw)
  To: Konstantin Ryabitsev; +Cc: meta

Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
> On Tue, Mar 28, 2023 at 07:45:49PM +0000, Eric Wong wrote:
> > C) index References:/In-Reply-To: so searching `ref:$MSGID'
> >    can work.  This doesn't work for some MUAs and deep
> >    threads, though.
> 
> I think this is a workable approach, but would require a reindex, right?

Yes, it requires a reindex to take effect, which takes ~2 days
on my lore mirror.  The biggest problem is MUAs are likely to
cull References: when threads get too long; so accuracy gets
lost.

Supporting /$MSGID/?q=... doesn't seem like the worst idea,
actually; since I've seen some web forums (phpBB maybe?) have a
"search in thread" function.

thread:{sub-query} is ideal; and I wouldn't rule out doing any
combination of the three (I don't like separating before/after).

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Cheap way to check for new messages in a thread
  2023-03-28 22:08             ` Eric Wong
@ 2023-03-28 23:30               ` Konstantin Ryabitsev
  2023-03-29 21:25                 ` Eric Wong
  0 siblings, 1 reply; 17+ messages in thread
From: Konstantin Ryabitsev @ 2023-03-28 23:30 UTC (permalink / raw)
  To: Eric Wong; +Cc: meta

On Tue, Mar 28, 2023 at 10:08:30PM +0000, Eric Wong wrote:
> > I think this is a workable approach, but would require a reindex, right?
> 
> Yes, it requires a reindex to take effect, which takes ~2 days
> on my lore mirror.  The biggest problem is MUAs are likely to
> cull References: when threads get too long; so accuracy gets
> lost.
> 
> Supporting /$MSGID/?q=... doesn't seem like the worst idea,
> actually; since I've seen some web forums (phpBB maybe?) have a
> "search in thread" function.
> 
> thread:{sub-query} is ideal; and I wouldn't rule out doing any
> combination of the three (I don't like separating before/after).

I'm fine with either of these, and just to stress, it's not really blocking
anything I'm working on -- bugbot is in initial rollout stages, so while the
number of tracked bugs/threads remains low, even if we re-download a hundred
threads every 10 minutes, it's just internal churn between two adjacent VMs.
If it becomes heavy, I can always look into switching to lei and performing
local queries instead of doing external polling.

However, if you do want to add ability to cheaply do a "give me just the
newest messages in this thread since this datetime", that would be great for
my needs. :)

-K

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Cheap way to check for new messages in a thread
  2023-03-28 23:30               ` Konstantin Ryabitsev
@ 2023-03-29 21:25                 ` Eric Wong
  2023-03-30 11:29                   ` Eric Wong
  0 siblings, 1 reply; 17+ messages in thread
From: Eric Wong @ 2023-03-29 21:25 UTC (permalink / raw)
  To: Konstantin Ryabitsev; +Cc: meta

Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
> I'm fine with either of these, and just to stress, it's not really blocking
> anything I'm working on -- bugbot is in initial rollout stages, so while the
> number of tracked bugs/threads remains low, even if we re-download a hundred
> threads every 10 minutes, it's just internal churn between two adjacent VMs.
> If it becomes heavy, I can always look into switching to lei and performing
> local queries instead of doing external polling.

Alright.

> However, if you do want to add ability to cheaply do a "give me just the
> newest messages in this thread since this datetime", that would be great for
> my needs. :)

Per-thread search is something I've wanted for a while, anyways,
so I think I'll do /$MSGID/?q= in between ongoing work for
codesearch and chasing down FreeBSD issues.

I may not expose /$MSGID/?q= it via HTML just yet since I find
<form> elements confusing as a user :x

Indexing References/IRT would be a waste of space and I/O due to
MUA truncations; so I'm hesitant to do it since we already index
THREADID.

thread:{sub-query} will be nice, but I'll get to it after I deal
with the lei FUSE stuff since I've already done most of the C
work from another project.  Normal FSes are so inefficient
for storing Maildir outputs.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Cheap way to check for new messages in a thread
  2023-03-29 21:25                 ` Eric Wong
@ 2023-03-30 11:29                   ` Eric Wong
  2023-03-30 16:45                     ` Konstantin Ryabitsev
  2023-06-16 19:11                     ` Konstantin Ryabitsev
  0 siblings, 2 replies; 17+ messages in thread
From: Eric Wong @ 2023-03-30 11:29 UTC (permalink / raw)
  To: Konstantin Ryabitsev; +Cc: meta

Eric Wong <e@80x24.org> wrote:
> Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
> > However, if you do want to add ability to cheaply do a "give me just the
> > newest messages in this thread since this datetime", that would be great for
> > my needs. :)
> 
> Per-thread search is something I've wanted for a while, anyways,
> so I think I'll do /$MSGID/?q= in between ongoing work for

This implements the mbox.gz retrieval.  I didn't want to deal
with HTML nor figuring out how to expose more <form> elements,
yet; but I figure mbox.gz is the most important.

Now deployed on 80x24.org/lore:

MSGID=20230327080502.GA570847@ziqianlu-desk2
curl -d '' -sSf \
   https://80x24.org/lore/all/"$MSGID/?x=m&q=rt:2023-03-29.." | \
   zcat | grep -i ^Message-ID:

shows the expected messages.
-----------8<-----------
Subject: [PATCH] www: support POST /$INBOX/$MSGID/?x=m&q=

This allows filtering the contents of any existing thread using
a search query.  It uses the existing THREADID column in Xapian
so we can internally add a Xapian OP_FILTER to the results.

This new functionality is orthogonal to the existing `t=1'
parameter which gives mairix-style thread expansion.  It doesn't
make sense to use `t=1' with this functionality, but it's not
disallowed, either.

The indentation change in Over->next_by_mid is to ensure
DBI->prepare_cached can share across both ->next_by_mid
and ->mid2tid.

I also noticed the existing regex for `POST /$INBOX/?x=m&q=' was
allowing extra characters.  With an added \z, it's now as strict
was originally intended and AFAIK nothing was generating invalid
URLs for it

Reported-by: Konstantin Ryabitsev <konstantin@linuxfoundation.org>
Link: https://public-inbox.org/meta/aaniyhk7wfm4e6m5mbukcrhevzoc6ftctyrfwvmz4fkykwwtlj@mverfng6ytas/T/
---
 lib/PublicInbox/Mbox.pm   |  5 ++++
 lib/PublicInbox/Over.pm   | 24 ++++++++++++++++++-
 lib/PublicInbox/Search.pm |  6 +++++
 lib/PublicInbox/WWW.pm    |  4 +++-
 t/psgi_v2.t               | 50 ++++++++++++++++++++++++++++++++++-----
 5 files changed, 81 insertions(+), 8 deletions(-)

diff --git a/lib/PublicInbox/Mbox.pm b/lib/PublicInbox/Mbox.pm
index 18db9d38..e1abf7ec 100644
--- a/lib/PublicInbox/Mbox.pm
+++ b/lib/PublicInbox/Mbox.pm
@@ -229,6 +229,11 @@ sub mbox_all {
 		return PublicInbox::WWW::need($ctx, 'Overview');
 
 	my $qopts = $ctx->{qopts} = { relevance => -2 }; # ORDER BY docid DESC
+
+	# {threadid} limits results to a given thread
+	# {threads} collapses results from messages in the same thread,
+	# allowing us to use ->expand_thread w/o duplicates in our own code
+	$qopts->{threadid} = $over->mid2tid($ctx->{mid}) if defined($ctx->{mid});
 	$qopts->{threads} = 1 if $q->{t};
 	$srch->query_approxidate($ctx->{ibx}->git, $q_string);
 	my $mset = $srch->mset($q_string, $qopts);
diff --git a/lib/PublicInbox/Over.pm b/lib/PublicInbox/Over.pm
index 271e2246..6ba27118 100644
--- a/lib/PublicInbox/Over.pm
+++ b/lib/PublicInbox/Over.pm
@@ -283,13 +283,35 @@ SELECT eidx_key FROM inboxes WHERE ibx_id = ?
 	$rows;
 }
 
+sub mid2tid {
+	my ($self, $mid) = @_;
+	my $dbh = dbh($self);
+
+	my $sth = $dbh->prepare_cached(<<'', undef, 1);
+SELECT id FROM msgid WHERE mid = ? LIMIT 1
+
+	$sth->execute($mid);
+	my $id = $sth->fetchrow_array or return;
+	$sth = $dbh->prepare_cached(<<'', undef, 1);
+SELECT num FROM id2num WHERE id = ? AND num > ?
+ORDER BY num ASC LIMIT 1
+
+	$sth->execute($id, 0);
+	my $num = $sth->fetchrow_array or return;
+	$sth = $dbh->prepare(<<'');
+SELECT tid FROM over WHERE num = ? LIMIT 1
+
+	$sth->execute($num);
+	$sth->fetchrow_array;
+}
+
 sub next_by_mid {
 	my ($self, $mid, $id, $prev) = @_;
 	my $dbh = dbh($self);
 
 	unless (defined $$id) {
 		my $sth = $dbh->prepare_cached(<<'', undef, 1);
-	SELECT id FROM msgid WHERE mid = ? LIMIT 1
+SELECT id FROM msgid WHERE mid = ? LIMIT 1
 
 		$sth->execute($mid);
 		$$id = $sth->fetchrow_array;
diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm
index 5133a3b7..6c3d9f93 100644
--- a/lib/PublicInbox/Search.pm
+++ b/lib/PublicInbox/Search.pm
@@ -386,6 +386,12 @@ sub mset {
 					sortable_serialise($uid_range->[1]));
 		$query = $X{Query}->new(OP_FILTER(), $query, $range);
 	}
+	if (defined(my $tid = $opt->{threadid})) {
+		$tid = sortable_serialise($tid);
+		$query = $X{Query}->new(OP_FILTER(), $query,
+				$X{Query}->new(OP_VALUE_RANGE(), THREADID, $tid, $tid));
+	}
+
 	my $xdb = xdb($self);
 	my $enq = $X{Enquire}->new($xdb);
 	$enq->set_query($query);
diff --git a/lib/PublicInbox/WWW.pm b/lib/PublicInbox/WWW.pm
index 9ffcb879..a8f1ad17 100644
--- a/lib/PublicInbox/WWW.pm
+++ b/lib/PublicInbox/WWW.pm
@@ -68,7 +68,9 @@ sub call {
 			my ($idx, $fn) = ($3, $4);
 			return invalid_inbox_mid($ctx, $1, $2) ||
 				get_attach($ctx, $idx, $fn);
-		} elsif ($path_info =~ m!$INBOX_RE/!o) {
+		} elsif ($path_info =~ m!$INBOX_RE/$MID_RE/\z!o) {
+			return invalid_inbox_mid($ctx, $1, $2) || mbox_results($ctx);
+		} elsif ($path_info =~ m!$INBOX_RE/\z!o) {
 			return invalid_inbox($ctx, $1) || mbox_results($ctx);
 		}
 	}
diff --git a/t/psgi_v2.t b/t/psgi_v2.t
index 5b197a9f..0a77adfb 100644
--- a/t/psgi_v2.t
+++ b/t/psgi_v2.t
@@ -4,6 +4,7 @@
 use strict;
 use v5.10.1;
 use PublicInbox::TestCommon;
+use IO::Uncompress::Gunzip qw(gunzip);
 require_git(2.6);
 use PublicInbox::Eml;
 use PublicInbox::Config;
@@ -76,6 +77,30 @@ $new_mid //= do {
 	local $/;
 	<$fh>;
 };
+
+my $m2t = create_inbox 'mid2tid-1', version => 2, indexlevel => 'medium', sub {
+	my ($im, $ibx) = @_;
+	for my $n (1..3) {
+		$im->add(PublicInbox::Eml->new(<<EOM)) or xbail 'add';
+Date: Fri, 02 Oct 1993 00:0$n:00 +0000
+Message-ID: <t\@$n>
+Subject: tid $n
+From: x\@example.com
+References: <a-mid\@b>
+
+$n
+EOM
+		$im->add(PublicInbox::Eml->new(<<EOM)) or xbail 'add';
+Date: Fri, 02 Oct 1993 00:0$n:00 +0000
+Message-ID: <ut\@$n>
+Subject: unrelated tid $n
+From: x\@example.com
+References: <b-mid\@b>
+
+EOM
+	}
+};
+
 my $cfgpath = "$ibx->{inboxdir}/pi_config";
 {
 	open my $fh, '>', $cfgpath or BAIL_OUT $!;
@@ -86,6 +111,9 @@ my $cfgpath = "$ibx->{inboxdir}/pi_config";
 [publicinbox "dup"]
 	inboxdir = $dibx->{inboxdir}
 	address = $dibx->{-primary_address}
+[publicinbox "m2t"]
+	inboxdir = $m2t->{inboxdir}
+	address = $m2t->{-primary_address}
 EOF
 	close $fh or BAIL_OUT;
 }
@@ -178,20 +206,18 @@ my $client1 = sub {
 	$cfg->each_inbox(sub { $_[0]->search->reopen });
 
 	SKIP: {
-		eval { require IO::Uncompress::Gunzip };
-		skip 'IO::Uncompress::Gunzip missing', 6 if $@;
 		my ($in, $out, $status);
 		my $req = GET('/v2test/a-mid@b/raw');
 		$req->header('Accept-Encoding' => 'gzip');
 		$res = $cb->($req);
 		is($res->header('Content-Encoding'), 'gzip', 'gzip encoding');
 		$in = $res->content;
-		IO::Uncompress::Gunzip::gunzip(\$in => \$out);
+		gunzip(\$in => \$out);
 		is($out, $raw, 'gzip response matches');
 
 		$res = $cb->(GET('/v2test/a-mid@b/t.mbox.gz'));
 		$in = $res->content;
-		$status = IO::Uncompress::Gunzip::gunzip(\$in => \$out);
+		$status = gunzip(\$in => \$out);
 		unlike($out, qr/^From oldbug/sm, 'buggy "From_" line omitted');
 		like($out, qr/^hello world$/m, 'got first in t.mbox.gz');
 		like($out, qr/^hello world!$/m, 'got second in t.mbox.gz');
@@ -202,7 +228,7 @@ my $client1 = sub {
 		# search interface
 		$res = $cb->(POST('/v2test/?q=m:a-mid@b&x=m'));
 		$in = $res->content;
-		$status = IO::Uncompress::Gunzip::gunzip(\$in => \$out);
+		$status = gunzip(\$in => \$out);
 		unlike($out, qr/^From oldbug/sm, 'buggy "From_" line omitted');
 		like($out, qr/^hello world$/m, 'got first in mbox POST');
 		like($out, qr/^hello world!$/m, 'got second in mbox POST');
@@ -213,7 +239,7 @@ my $client1 = sub {
 		# all.mbox.gz interface
 		$res = $cb->(GET('/v2test/all.mbox.gz'));
 		$in = $res->content;
-		$status = IO::Uncompress::Gunzip::gunzip(\$in => \$out);
+		$status = gunzip(\$in => \$out);
 		unlike($out, qr/^From oldbug/sm, 'buggy "From_" line omitted');
 		like($out, qr/^hello world$/m, 'got first in all.mbox');
 		like($out, qr/^hello world!$/m, 'got second in all.mbox');
@@ -335,6 +361,18 @@ my $client3 = sub {
 	local $SIG{__WARN__} = sub { push @warn, @_ };
 	$res = $cb->(GET('/v2test/?t=1970'.'01'.'01'));
 	is_deeply(\@warn, [], 'no warnings on YYYYMMDD only');
+
+	$res = $cb->(POST("/m2t/t\@1/?q=dt:19931002000300..&x=m"));
+	is($res->code, 200, 'got 200 on mid2tid query');
+	gunzip(\(my $in = $res->content) => \(my $out));
+	my @m = ($out =~ m!^Message-ID: <([^>]+)>\n!gms);
+	is_deeply(\@m, ['t@3'], 'only got latest result from query');
+
+	$res = $cb->(POST("/m2t/t\@1/?q=dt:19931002000400..&x=m"));
+	is($res->code, 404, '404 on out-of-range mid2tid query');
+
+	$res = $cb->(POST("/m2t/t\@1/?q=s:unrelated&x=m"));
+	is($res->code, 404, '404 on cross-thread search');
 };
 test_psgi(sub { $www->call(@_) }, $client3);
 test_httpd($env, $client3, 4);


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: Cheap way to check for new messages in a thread
  2023-03-30 11:29                   ` Eric Wong
@ 2023-03-30 16:45                     ` Konstantin Ryabitsev
  2023-03-31  1:40                       ` Eric Wong
  2023-06-16 19:11                     ` Konstantin Ryabitsev
  1 sibling, 1 reply; 17+ messages in thread
From: Konstantin Ryabitsev @ 2023-03-30 16:45 UTC (permalink / raw)
  To: Eric Wong; +Cc: meta

On Thu, Mar 30, 2023 at 11:29:51AM +0000, Eric Wong wrote:
> > Per-thread search is something I've wanted for a while, anyways,
> > so I think I'll do /$MSGID/?q= in between ongoing work for
> 
> This implements the mbox.gz retrieval.  I didn't want to deal
> with HTML nor figuring out how to expose more <form> elements,
> yet; but I figure mbox.gz is the most important.

Nice, thanks!

I can't easily test this, because lore is currently mostly on 1.9 and the
patch doesn't cleanly apply to that tree. However, I will be happy to test it
out once 2.0 is out and we've updated to it on our systems.

Cheers,
-K

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Cheap way to check for new messages in a thread
  2023-03-30 16:45                     ` Konstantin Ryabitsev
@ 2023-03-31  1:40                       ` Eric Wong
  2023-04-11 11:27                         ` Eric Wong
  0 siblings, 1 reply; 17+ messages in thread
From: Eric Wong @ 2023-03-31  1:40 UTC (permalink / raw)
  To: Konstantin Ryabitsev; +Cc: meta

Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
> I can't easily test this, because lore is currently mostly on 1.9 and the
> patch doesn't cleanly apply to that tree. However, I will be happy to test it
> out once 2.0 is out and we've updated to it on our systems.

Fwiw, master is good on Linux for mail.  codesearch still
needs work, and lei on FreeBSD gets stuck sometimes.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Cheap way to check for new messages in a thread
  2023-03-31  1:40                       ` Eric Wong
@ 2023-04-11 11:27                         ` Eric Wong
  0 siblings, 0 replies; 17+ messages in thread
From: Eric Wong @ 2023-04-11 11:27 UTC (permalink / raw)
  To: Konstantin Ryabitsev; +Cc: meta

Eric Wong <e@80x24.org> wrote:
> Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
> > I can't easily test this, because lore is currently mostly on 1.9 and the
> > patch doesn't cleanly apply to that tree. However, I will be happy to test it
> > out once 2.0 is out and we've updated to it on our systems.
> 
> Fwiw, master is good on Linux for mail.

Erm, almost :x   The --batch-command support added a difficult-to-trigger bug
that went undetected for a few months:

  https://public-inbox.org/meta/20230411112350.297099-1-e@80x24.org/
  ("git: fix cat_async_retry")

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Cheap way to check for new messages in a thread
  2023-03-30 11:29                   ` Eric Wong
  2023-03-30 16:45                     ` Konstantin Ryabitsev
@ 2023-06-16 19:11                     ` Konstantin Ryabitsev
  2023-06-16 23:13                       ` [PATCH] www: use correct threadid for per-thread search Eric Wong
  1 sibling, 1 reply; 17+ messages in thread
From: Konstantin Ryabitsev @ 2023-06-16 19:11 UTC (permalink / raw)
  To: Eric Wong; +Cc: meta

On Thu, Mar 30, 2023 at 11:29:51AM +0000, Eric Wong wrote:
> This implements the mbox.gz retrieval.  I didn't want to deal
> with HTML nor figuring out how to expose more <form> elements,
> yet; but I figure mbox.gz is the most important.
> 
> Now deployed on 80x24.org/lore:
> 
> MSGID=20230327080502.GA570847@ziqianlu-desk2
> curl -d '' -sSf \
>    https://80x24.org/lore/all/"$MSGID/?x=m&q=rt:2023-03-29.." | \
>    zcat | grep -i ^Message-ID:

Eric:

Reviving this old thread for some clarification. I noticed that this only
works for /all/, but not for individual inboxes. E.g.:

    $ curl -d '' -sSf \
      https://lore.kernel.org/all/"$MSGID/?x=m&q=rt:2023-03-29.." \
      | zgrep -i ^Message-ID:
    Message-ID: <cfcf852c-e9f0-f560-542d-0f72777a85b2@leemhuis.info>

but with /lkml/ I get a 404:

    $ curl -d '' -sSf \
      https://lore.kernel.org/lkml/"$MSGID/?x=m&q=rt:2023-03-29.." \
      | zgrep -i ^Message-ID:
    curl: (22) The requested URL returned error: 404

Is that intentionally restricted to just extindex?

-K

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH] www: use correct threadid for per-thread search
  2023-06-16 19:11                     ` Konstantin Ryabitsev
@ 2023-06-16 23:13                       ` Eric Wong
  2023-06-21 17:11                         ` Konstantin Ryabitsev
  0 siblings, 1 reply; 17+ messages in thread
From: Eric Wong @ 2023-06-16 23:13 UTC (permalink / raw)
  To: Konstantin Ryabitsev; +Cc: meta

Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
> On Thu, Mar 30, 2023 at 11:29:51AM +0000, Eric Wong wrote:
> > This implements the mbox.gz retrieval.  I didn't want to deal
> > with HTML nor figuring out how to expose more <form> elements,
> > yet; but I figure mbox.gz is the most important.
> > 
> > Now deployed on 80x24.org/lore:
> > 
> > MSGID=20230327080502.GA570847@ziqianlu-desk2
> > curl -d '' -sSf \
> >    https://80x24.org/lore/all/"$MSGID/?x=m&q=rt:2023-03-29.." | \
> >    zcat | grep -i ^Message-ID:
> 
> Eric:
> 
> Reviving this old thread for some clarification. I noticed that this only
> works for /all/, but not for individual inboxes. E.g.:
> 
>     $ curl -d '' -sSf \
>       https://lore.kernel.org/all/"$MSGID/?x=m&q=rt:2023-03-29.." \
>       | zgrep -i ^Message-ID:
>     Message-ID: <cfcf852c-e9f0-f560-542d-0f72777a85b2@leemhuis.info>
> 
> but with /lkml/ I get a 404:
> 
>     $ curl -d '' -sSf \
>       https://lore.kernel.org/lkml/"$MSGID/?x=m&q=rt:2023-03-29.." \
>       | zgrep -i ^Message-ID:
>     curl: (22) The requested URL returned error: 404
> 
> Is that intentionally restricted to just extindex?

It's a bug, fix below and deployed to https://80x24.org/lore/

---------8<---------
Subject: [PATCH] www: use correct threadid for per-thread search

For individual public-inboxes relying on extindex for per-inbox
search, we must use the threadid from the extindex over.sqlite3
rather than the per-inbox over.sqlite3 file.

Reported-by: Konstantin Ryabitsev <konstantin@linuxfoundation.org>
Link: https://public-inbox.org/meta/20230616-rudy-comedy-vision-2b9f92@meerkat/
---
 lib/PublicInbox/Mbox.pm | 10 +++++++---
 t/extindex-psgi.t       | 39 +++++++++++++++++++++++++++++++++++++--
 2 files changed, 44 insertions(+), 5 deletions(-)

diff --git a/lib/PublicInbox/Mbox.pm b/lib/PublicInbox/Mbox.pm
index e1abf7ec..bf61bb0e 100644
--- a/lib/PublicInbox/Mbox.pm
+++ b/lib/PublicInbox/Mbox.pm
@@ -225,15 +225,19 @@ sub mbox_all {
 	return mbox_all_ids($ctx) if $q_string !~ /\S/;
 	my $srch = $ctx->{ibx}->isrch or
 		return PublicInbox::WWW::need($ctx, 'Search');
-	my $over = $ctx->{ibx}->over or
-		return PublicInbox::WWW::need($ctx, 'Overview');
 
 	my $qopts = $ctx->{qopts} = { relevance => -2 }; # ORDER BY docid DESC
 
 	# {threadid} limits results to a given thread
 	# {threads} collapses results from messages in the same thread,
 	# allowing us to use ->expand_thread w/o duplicates in our own code
-	$qopts->{threadid} = $over->mid2tid($ctx->{mid}) if defined($ctx->{mid});
+	if (defined($ctx->{mid})) {
+		my $over = ($ctx->{ibx}->{isrch} ?
+				$ctx->{ibx}->{isrch}->{es}->over :
+				$ctx->{ibx}->over) or
+			return PublicInbox::WWW::need($ctx, 'Overview');
+		$qopts->{threadid} = $over->mid2tid($ctx->{mid});
+	}
 	$qopts->{threads} = 1 if $q->{t};
 	$srch->query_approxidate($ctx->{ibx}->git, $q_string);
 	my $mset = $srch->mset($q_string, $qopts);
diff --git a/t/extindex-psgi.t b/t/extindex-psgi.t
index 98dc2e48..f10ffbb6 100644
--- a/t/extindex-psgi.t
+++ b/t/extindex-psgi.t
@@ -1,5 +1,5 @@
 #!perl -w
-# Copyright (C) 2020-2021 all contributors <meta@public-inbox.org>
+# Copyright (C) all contributors <meta@public-inbox.org>
 # License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
 use strict;
 use v5.10.1;
@@ -21,7 +21,28 @@ mkdir "$home/.public-inbox" or BAIL_OUT $!;
 my $pi_config = "$home/.public-inbox/config";
 cp($cfg_path, $pi_config) or BAIL_OUT;
 my $env = { HOME => $home };
-run_script([qw(-extindex --all), "$tmpdir/eidx"], $env) or BAIL_OUT;
+my $m2t = create_inbox 'mid2tid', version => 2, indexlevel => 'basic', sub {
+	my ($im, $ibx) = @_;
+	for my $n (1..3) {
+		$im->add(PublicInbox::Eml->new(<<EOM)) or xbail 'add';
+Date: Fri, 02 Oct 1993 00:0$n:00 +0000
+Message-ID: <t\@$n>
+Subject: tid $n
+From: x\@example.com
+References: <a-mid\@b>
+
+$n
+EOM
+		$im->add(PublicInbox::Eml->new(<<EOM)) or xbail 'add';
+Date: Fri, 02 Oct 1993 00:0$n:00 +0000
+Message-ID: <ut\@$n>
+Subject: unrelated tid $n
+From: x\@example.com
+References: <b-mid\@b>
+
+EOM
+	}
+};
 {
 	open my $cfgfh, '>>', $pi_config or BAIL_OUT;
 	$cfgfh->autoflush(1);
@@ -32,8 +53,14 @@ run_script([qw(-extindex --all), "$tmpdir/eidx"], $env) or BAIL_OUT;
 [publicinbox]
 	wwwlisting = all
 	grokManifest = all
+[publicinbox "m2t"]
+	inboxdir = $m2t->{inboxdir}
+	address = $m2t->{-primary_address}
 EOM
+	close $cfgfh or xbail "close: $!";
 }
+
+run_script([qw(-extindex --all), "$tmpdir/eidx"], $env) or BAIL_OUT;
 my $www = PublicInbox::WWW->new(PublicInbox::Config->new($pi_config));
 my $client = sub {
 	my ($cb) = @_;
@@ -83,6 +110,14 @@ my $client = sub {
 		't2 manifest');
 	is_deeply([ sort keys %{$m->{'/t1'}} ], [ '/t1' ],
 		't2 manifest');
+
+	# ensure ibx->{isrch}->{es}->over is used instead of ibx->over:
+	$res = $cb->(POST("/m2t/t\@1/?q=dt:19931002000259..&x=m"));
+	is($res->code, 200, 'hit on mid2tid query');
+	$res = $cb->(POST("/m2t/t\@1/?q=dt:19931002000400..&x=m"));
+	is($res->code, 404, '404 on out-of-range mid2tid query');
+	$res = $cb->(POST("/m2t/t\@1/?q=s:unrelated&x=m"));
+	is($res->code, 404, '404 on cross-thread search');
 };
 test_psgi(sub { $www->call(@_) }, $client);
 %$env = (%$env, TMPDIR => $tmpdir, PI_CONFIG => $pi_config);

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH] www: use correct threadid for per-thread search
  2023-06-16 23:13                       ` [PATCH] www: use correct threadid for per-thread search Eric Wong
@ 2023-06-21 17:11                         ` Konstantin Ryabitsev
  0 siblings, 0 replies; 17+ messages in thread
From: Konstantin Ryabitsev @ 2023-06-21 17:11 UTC (permalink / raw)
  To: Eric Wong; +Cc: meta

On Fri, Jun 16, 2023 at 11:13:01PM +0000, Eric Wong wrote:
> > Reviving this old thread for some clarification. I noticed that this only
> > works for /all/, but not for individual inboxes. E.g.:
> > 
> >     $ curl -d '' -sSf \
> >       https://lore.kernel.org/all/"$MSGID/?x=m&q=rt:2023-03-29.." \
> >       | zgrep -i ^Message-ID:
> >     Message-ID: <cfcf852c-e9f0-f560-542d-0f72777a85b2@leemhuis.info>
> > 
> > but with /lkml/ I get a 404:
> > 
> >     $ curl -d '' -sSf \
> >       https://lore.kernel.org/lkml/"$MSGID/?x=m&q=rt:2023-03-29.." \
> >       | zgrep -i ^Message-ID:
> >     curl: (22) The requested URL returned error: 404
> > 
> > Is that intentionally restricted to just extindex?
> 
> It's a bug, fix below and deployed to https://80x24.org/lore/

Indeed, looks good now. Thank you!

-K


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2023-06-21 17:11 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-27 15:08 Cheap way to check for new messages in a thread Konstantin Ryabitsev
2023-03-27 19:10 ` Eric Wong
2023-03-27 20:47   ` Konstantin Ryabitsev
2023-03-27 21:38     ` Eric Wong
2023-03-28 14:04       ` Konstantin Ryabitsev
2023-03-28 19:45         ` Eric Wong
2023-03-28 20:00           ` Konstantin Ryabitsev
2023-03-28 22:08             ` Eric Wong
2023-03-28 23:30               ` Konstantin Ryabitsev
2023-03-29 21:25                 ` Eric Wong
2023-03-30 11:29                   ` Eric Wong
2023-03-30 16:45                     ` Konstantin Ryabitsev
2023-03-31  1:40                       ` Eric Wong
2023-04-11 11:27                         ` Eric Wong
2023-06-16 19:11                     ` Konstantin Ryabitsev
2023-06-16 23:13                       ` [PATCH] www: use correct threadid for per-thread search Eric Wong
2023-06-21 17:11                         ` Konstantin Ryabitsev

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).