unofficial mirror of meta@public-inbox.org
 help / color / mirror / Atom feed
* lei missing mails
@ 2022-06-29 16:15 Rob Herring
  2022-06-29 16:30 ` Eric Wong
  0 siblings, 1 reply; 11+ messages in thread
From: Rob Herring @ 2022-06-29 16:15 UTC (permalink / raw)
  To: meta

Hi,

I'm using lei with lore where I have 2 queries which overlap. Really,
one is a subset of the other. On those overlapping threads, I'm
finding that sometimes new messages are written to one mailbox and not
the other. (At least sometimes, the messages may be missing from all
mailboxes sometimes too. I'm not certain.) Using --remote-fudge-time
to force refetching seems to get the missing mails. I haven't found
anything strange in timestamps of the missing mails, but otherwise am
not sure how to debug this further. The queries are retrieving full
threads and the missing mails are in the threads, but not direct
matches to the queries. I realize that's not a lot of detail to go on.
Suggestions on debugging this further?

It might be helpful if lei could print out message-ids of messages
written to mailboxes.

Rob

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: lei missing mails
  2022-06-29 16:15 lei missing mails Rob Herring
@ 2022-06-29 16:30 ` Eric Wong
  2022-06-29 16:53   ` Rob Herring
  0 siblings, 1 reply; 11+ messages in thread
From: Eric Wong @ 2022-06-29 16:30 UTC (permalink / raw)
  To: Rob Herring; +Cc: meta

Rob Herring <robh@kernel.org> wrote:
> Hi,
> 
> I'm using lei with lore where I have 2 queries which overlap. Really,
> one is a subset of the other. On those overlapping threads, I'm
> finding that sometimes new messages are written to one mailbox and not
> the other. (At least sometimes, the messages may be missing from all
> mailboxes sometimes too. I'm not certain.) Using --remote-fudge-time
> to force refetching seems to get the missing mails. I haven't found
> anything strange in timestamps of the missing mails, but otherwise am
> not sure how to debug this further. The queries are retrieving full
> threads and the missing mails are in the threads, but not direct
> matches to the queries. I realize that's not a lot of detail to go on.
> Suggestions on debugging this further?

Is this with 1.8 or 1.7?

I forgot to note in the release notes, but there were some
SQLite usage-related fixes which could avoid missing messages.

You'll need "lei daemon-kill" after upgrading to 1.8 to ensure
the new code is running.

What might be interesting is to use the URLs lei prints and
comparing the results w/o lei.

I'll have to double-check if overlapping affects things, but it
shouldn't; since the dedupe logic is per-output.

Is this exclusively with HTTPS endpoints and writing to Maildirs
(or something else?)

> It might be helpful if lei could print out message-ids of messages
> written to mailboxes.

That could get very noisy, especially as mailboxes are written
in parallel.

Thanks.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: lei missing mails
  2022-06-29 16:30 ` Eric Wong
@ 2022-06-29 16:53   ` Rob Herring
  2022-06-29 17:27     ` Eric Wong
  0 siblings, 1 reply; 11+ messages in thread
From: Rob Herring @ 2022-06-29 16:53 UTC (permalink / raw)
  To: Eric Wong; +Cc: meta

On Wed, Jun 29, 2022 at 10:30 AM Eric Wong <e@80x24.org> wrote:
>
> Rob Herring <robh@kernel.org> wrote:
> > Hi,
> >
> > I'm using lei with lore where I have 2 queries which overlap. Really,
> > one is a subset of the other. On those overlapping threads, I'm
> > finding that sometimes new messages are written to one mailbox and not
> > the other. (At least sometimes, the messages may be missing from all
> > mailboxes sometimes too. I'm not certain.) Using --remote-fudge-time
> > to force refetching seems to get the missing mails. I haven't found
> > anything strange in timestamps of the missing mails, but otherwise am
> > not sure how to debug this further. The queries are retrieving full
> > threads and the missing mails are in the threads, but not direct
> > matches to the queries. I realize that's not a lot of detail to go on.
> > Suggestions on debugging this further?
>
> Is this with 1.8 or 1.7?

Commit 68b53c888911 actually. So post 1.8.

> I forgot to note in the release notes, but there were some
> SQLite usage-related fixes which could avoid missing messages.
>
> You'll need "lei daemon-kill" after upgrading to 1.8 to ensure
> the new code is running.

It's possible I haven't done that since updating though I do vaguely
recall seeing something about needing to do that. Is there any way to
tell before I restart it?


> What might be interesting is to use the URLs lei prints and
> comparing the results w/o lei.
>
> I'll have to double-check if overlapping affects things, but it
> shouldn't; since the dedupe logic is per-output.
>
> Is this exclusively with HTTPS endpoints and writing to Maildirs
> (or something else?)

Yes. It's querying lore and writing to a maildir. Here's one of the queries:

[lei]
        q = (dfn:drivers OR dfn:arch OR dfn:Documentation/* OR
dfn:include OR dfn:scripts) AND \
         f:robh@kernel.org AND rt:6.month.ago..
[lei "q"]
        include = https://lore.kernel.org/all/
        external = 1
        local = 1
        remote = 1
        threads = 1
        dedupe = mid
        output = maildir:/home/rob/Mail/my-patches

>
> > It might be helpful if lei could print out message-ids of messages
> > written to mailboxes.
>
> That could get very noisy, especially as mailboxes are written
> in parallel.

Verbose mode already is. Maybe specifying what info you want to be
verbose would help. The network side is mostly uninteresting in this
case for example.

Is there any tool to list new messages in a maildir? I could do that
before and after. I've done the clearing the new flag in mutt between
runs, but that's not really ideal.

Rob

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: lei missing mails
  2022-06-29 16:53   ` Rob Herring
@ 2022-06-29 17:27     ` Eric Wong
  2022-06-29 22:01       ` Rob Herring
  0 siblings, 1 reply; 11+ messages in thread
From: Eric Wong @ 2022-06-29 17:27 UTC (permalink / raw)
  To: Rob Herring; +Cc: meta

Rob Herring <robh@kernel.org> wrote:
> On Wed, Jun 29, 2022 at 10:30 AM Eric Wong <e@80x24.org> wrote:
> >
> > Rob Herring <robh@kernel.org> wrote:
> > > Hi,
> > >
> > > I'm using lei with lore where I have 2 queries which overlap. Really,
> > > one is a subset of the other. On those overlapping threads, I'm
> > > finding that sometimes new messages are written to one mailbox and not
> > > the other. (At least sometimes, the messages may be missing from all
> > > mailboxes sometimes too. I'm not certain.) Using --remote-fudge-time
> > > to force refetching seems to get the missing mails. I haven't found
> > > anything strange in timestamps of the missing mails, but otherwise am
> > > not sure how to debug this further. The queries are retrieving full
> > > threads and the missing mails are in the threads, but not direct
> > > matches to the queries. I realize that's not a lot of detail to go on.
> > > Suggestions on debugging this further?
> >
> > Is this with 1.8 or 1.7?
> 
> Commit 68b53c888911 actually. So post 1.8.

OK, thanks for that info.

> > I forgot to note in the release notes, but there were some
> > SQLite usage-related fixes which could avoid missing messages.
> >
> > You'll need "lei daemon-kill" after upgrading to 1.8 to ensure
> > the new code is running.
> 
> It's possible I haven't done that since updating though I do vaguely
> recall seeing something about needing to do that. Is there any way to
> tell before I restart it?

Not really, but it's pretty cheap to restart (assuming there's no
long-running jobs).

> > What might be interesting is to use the URLs lei prints and
> > comparing the results w/o lei.
> >
> > I'll have to double-check if overlapping affects things, but it
> > shouldn't; since the dedupe logic is per-output.
> >
> > Is this exclusively with HTTPS endpoints and writing to Maildirs
> > (or something else?)
> 
> Yes. It's querying lore and writing to a maildir. Here's one of the queries:
> 
> [lei]
>         q = (dfn:drivers OR dfn:arch OR dfn:Documentation/* OR
> dfn:include OR dfn:scripts) AND \
>          f:robh@kernel.org AND rt:6.month.ago..
> [lei "q"]
>         include = https://lore.kernel.org/all/
>         external = 1
>         local = 1
>         remote = 1
>         threads = 1
>         dedupe = mid
>         output = maildir:/home/rob/Mail/my-patches

Fwiw, dedupe based on mid could be vulnerable to spoofing, which
is why `content' is the default.  But yes, in the past, I've
noticed some messages to meta@public-inbox.org not showing up,
though not recently (I guess lack of activity here is a culprit :x)

I also just noticed an inotify-related bug deadlocking the whole
lei-deamon while looking into this :<

> > > It might be helpful if lei could print out message-ids of messages
> > > written to mailboxes.
> >
> > That could get very noisy, especially as mailboxes are written
> > in parallel.
> 
> Verbose mode already is. Maybe specifying what info you want to be
> verbose would help. The network side is mostly uninteresting in this
> case for example.

Yes, I've been struggling with the verbosity, too; and many
other things :<

> Is there any tool to list new messages in a maildir? I could do that
> before and after. I've done the clearing the new flag in mutt between
> runs, but that's not really ideal.

I suppose `ls'.  There are likely other tools more suited for Maildirs
but I'm not familiar with them off the top of my head.

Maybe lei could grow yet another command.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: lei missing mails
  2022-06-29 17:27     ` Eric Wong
@ 2022-06-29 22:01       ` Rob Herring
  2022-06-30  8:55         ` Eric Wong
  0 siblings, 1 reply; 11+ messages in thread
From: Rob Herring @ 2022-06-29 22:01 UTC (permalink / raw)
  To: Eric Wong; +Cc: meta

On Wed, Jun 29, 2022 at 11:27 AM Eric Wong <e@80x24.org> wrote:
>
> Rob Herring <robh@kernel.org> wrote:
> > On Wed, Jun 29, 2022 at 10:30 AM Eric Wong <e@80x24.org> wrote:
> > >
> > > Rob Herring <robh@kernel.org> wrote:
> > > > Hi,
> > > >
> > > > I'm using lei with lore where I have 2 queries which overlap. Really,
> > > > one is a subset of the other. On those overlapping threads, I'm
> > > > finding that sometimes new messages are written to one mailbox and not
> > > > the other. (At least sometimes, the messages may be missing from all
> > > > mailboxes sometimes too. I'm not certain.) Using --remote-fudge-time
> > > > to force refetching seems to get the missing mails. I haven't found
> > > > anything strange in timestamps of the missing mails, but otherwise am
> > > > not sure how to debug this further. The queries are retrieving full
> > > > threads and the missing mails are in the threads, but not direct
> > > > matches to the queries. I realize that's not a lot of detail to go on.
> > > > Suggestions on debugging this further?
> > >
> > > Is this with 1.8 or 1.7?
> >
> > Commit 68b53c888911 actually. So post 1.8.
>
> OK, thanks for that info.
>
> > > I forgot to note in the release notes, but there were some
> > > SQLite usage-related fixes which could avoid missing messages.
> > >
> > > You'll need "lei daemon-kill" after upgrading to 1.8 to ensure
> > > the new code is running.
> >
> > It's possible I haven't done that since updating though I do vaguely
> > recall seeing something about needing to do that. Is there any way to
> > tell before I restart it?
>
> Not really, but it's pretty cheap to restart (assuming there's no
> long-running jobs).

I've restarted and just hit this again.


> > > What might be interesting is to use the URLs lei prints and
> > > comparing the results w/o lei.

$ lei up --all
# updating /home/rob/Mail/from-me
# updating /home/rob/Mail/missing-cc
# updating /home/rob/Mail/my-patches
# updating /home/rob/Mail/pci
# https://lore.kernel.org/all/ limiting to 2022-06-27 12:42 -0600 and newer
# https://lore.kernel.org/all/ limiting to 2022-06-27  9:50 -0600 and newer
# https://lore.kernel.org/all/ limiting to 2022-06-27 12:42 -0600 and newer
# /usr/bin/curl -Sf -s -d ''
https://lore.kernel.org/all/?x=m&t=1&q=(dt%3A20220529211430..+AND+(f%3Arobh%40kernel.org+OR+f%3Arobh%2Bdt%40kernel.org))+AND+dt%3A20220627184226..
# /home/rob/.local/share/lei/store 144/144
# /home/rob/.local/share/lei/store 3/3
# /usr/bin/curl -Sf -s -d ''
https://lore.kernel.org/all/?x=m&t=1&q=((dfn%3Adrivers+OR+dfn%3Aarch+OR+dfn%3ADocumentation%2F*+OR+dfn%3Ainclude+OR+dfn%3Ascripts)+AND+f%3Arobh%40kernel.org+AND+rt%3A1640812470..)+AND+dt%3A20220627155025..
# /usr/bin/curl -Sf -s -d ''
https://lore.kernel.org/all/?x=m&t=1&q=(l%3Alinux-pci+dfn%3Adrivers%2Fpci%2Fcontroller+dt%3A20220529211430..)+AND+dt%3A20220627184226..
# /home/rob/.local/share/lei/store 0/0
# /home/rob/.local/share/lei/store 362/362
# 0 written to /home/rob/Mail/missing-cc/ (0 matches)
# https://lore.kernel.org/all/ 72/72
# https://lore.kernel.org/all/ 4/4
# https://lore.kernel.org/all/ 131/?
# https://lore.kernel.org/all/ 184/?
# https://lore.kernel.org/all/ 412/?
# https://lore.kernel.org/all/ 603/?
# https://lore.kernel.org/all/ 853/?
# https://lore.kernel.org/all/ 1069/?
# https://lore.kernel.org/all/ 1442/?
# https://lore.kernel.org/all/ 1443/1443
# 1 written to /home/rob/Mail/pci/ (75 matches)
# 2 written to /home/rob/Mail/my-patches/ (148 matches)
# 7 written to /home/rob/Mail/from-me/ (1805 matches)


What I expected was 3 messages written to 'my-patches'.

I think the problem is just simply that the new message missing
doesn't match the query, but is a reply to a match. So with a date
after the original match in the thread won't pick up anything. The 2nd
URL above indeed only has 2 results. I guess I just have to fetch a
wider window like a month every time? What's needed is a get any new
messages in existing threads. I don't suppose there's an efficient way
to do that?

> > >
> > > I'll have to double-check if overlapping affects things, but it
> > > shouldn't; since the dedupe logic is per-output.
> > >
> > > Is this exclusively with HTTPS endpoints and writing to Maildirs
> > > (or something else?)
> >
> > Yes. It's querying lore and writing to a maildir. Here's one of the queries:
> >
> > [lei]
> >         q = (dfn:drivers OR dfn:arch OR dfn:Documentation/* OR
> > dfn:include OR dfn:scripts) AND \
> >          f:robh@kernel.org AND rt:6.month.ago..
> > [lei "q"]
> >         include = https://lore.kernel.org/all/
> >         external = 1
> >         local = 1
> >         remote = 1
> >         threads = 1
> >         dedupe = mid
> >         output = maildir:/home/rob/Mail/my-patches
>
> Fwiw, dedupe based on mid could be vulnerable to spoofing, which
> is why `content' is the default.  But yes, in the past, I've
> noticed some messages to meta@public-inbox.org not showing up,
> though not recently (I guess lack of activity here is a culprit :x)

Does 'content' ignore trailers that mailman lists like to add? I think
I switched because of that.

Rob

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: lei missing mails
  2022-06-29 22:01       ` Rob Herring
@ 2022-06-30  8:55         ` Eric Wong
  2022-07-07  9:48           ` Eric Wong
  2022-07-11 21:59           ` Rob Herring
  0 siblings, 2 replies; 11+ messages in thread
From: Eric Wong @ 2022-06-30  8:55 UTC (permalink / raw)
  To: Rob Herring; +Cc: meta

Rob Herring <robh@kernel.org> wrote:
> On Wed, Jun 29, 2022 at 11:27 AM Eric Wong <e@80x24.org> wrote:
> > Rob Herring <robh@kernel.org> wrote:
> > > On Wed, Jun 29, 2022 at 10:30 AM Eric Wong <e@80x24.org> wrote:
> > > > Rob Herring <robh@kernel.org> wrote:
> > > > > Hi,
> > > > >
> > > > > I'm using lei with lore where I have 2 queries which overlap. Really,
> > > > > one is a subset of the other. On those overlapping threads, I'm
> > > > > finding that sometimes new messages are written to one mailbox and not
> > > > > the other. (At least sometimes, the messages may be missing from all
> > > > > mailboxes sometimes too. I'm not certain.) Using --remote-fudge-time
> > > > > to force refetching seems to get the missing mails. I haven't found
> > > > > anything strange in timestamps of the missing mails, but otherwise am
> > > > > not sure how to debug this further. The queries are retrieving full
> > > > > threads and the missing mails are in the threads, but not direct
> > > > > matches to the queries. I realize that's not a lot of detail to go on.
> > > > > Suggestions on debugging this further?
> > > >
> > > > Is this with 1.8 or 1.7?
> > >
> > > Commit 68b53c888911 actually. So post 1.8.
> >
> > OK, thanks for that info.
> >
> > > > I forgot to note in the release notes, but there were some
> > > > SQLite usage-related fixes which could avoid missing messages.
> > > >
> > > > You'll need "lei daemon-kill" after upgrading to 1.8 to ensure
> > > > the new code is running.
> > >
> > > It's possible I haven't done that since updating though I do vaguely
> > > recall seeing something about needing to do that. Is there any way to
> > > tell before I restart it?
> >
> > Not really, but it's pretty cheap to restart (assuming there's no
> > long-running jobs).
> 
> I've restarted and just hit this again.

Ugh, sorry to hear that :<

> > > > What might be interesting is to use the URLs lei prints and
> > > > comparing the results w/o lei.
> 
> $ lei up --all
> # updating /home/rob/Mail/from-me
> # updating /home/rob/Mail/missing-cc
> # updating /home/rob/Mail/my-patches
> # updating /home/rob/Mail/pci
> # https://lore.kernel.org/all/ limiting to 2022-06-27 12:42 -0600 and newer
> # https://lore.kernel.org/all/ limiting to 2022-06-27  9:50 -0600 and newer
> # https://lore.kernel.org/all/ limiting to 2022-06-27 12:42 -0600 and newer
> # /usr/bin/curl -Sf -s -d ''
> https://lore.kernel.org/all/?x=m&t=1&q=(dt%3A20220529211430..+AND+(f%3Arobh%40kernel.org+OR+f%3Arobh%2Bdt%40kernel.org))+AND+dt%3A20220627184226..
> # /home/rob/.local/share/lei/store 144/144
> # /home/rob/.local/share/lei/store 3/3
> # /usr/bin/curl -Sf -s -d ''
> https://lore.kernel.org/all/?x=m&t=1&q=((dfn%3Adrivers+OR+dfn%3Aarch+OR+dfn%3ADocumentation%2F*+OR+dfn%3Ainclude+OR+dfn%3Ascripts)+AND+f%3Arobh%40kernel.org+AND+rt%3A1640812470..)+AND+dt%3A20220627155025..
> # /usr/bin/curl -Sf -s -d ''
> https://lore.kernel.org/all/?x=m&t=1&q=(l%3Alinux-pci+dfn%3Adrivers%2Fpci%2Fcontroller+dt%3A20220529211430..)+AND+dt%3A20220627184226..
> # /home/rob/.local/share/lei/store 0/0
> # /home/rob/.local/share/lei/store 362/362
> # 0 written to /home/rob/Mail/missing-cc/ (0 matches)
> # https://lore.kernel.org/all/ 72/72
> # https://lore.kernel.org/all/ 4/4
> # https://lore.kernel.org/all/ 131/?
> # https://lore.kernel.org/all/ 184/?
> # https://lore.kernel.org/all/ 412/?
> # https://lore.kernel.org/all/ 603/?
> # https://lore.kernel.org/all/ 853/?
> # https://lore.kernel.org/all/ 1069/?
> # https://lore.kernel.org/all/ 1442/?
> # https://lore.kernel.org/all/ 1443/1443
> # 1 written to /home/rob/Mail/pci/ (75 matches)
> # 2 written to /home/rob/Mail/my-patches/ (148 matches)
> # 7 written to /home/rob/Mail/from-me/ (1805 matches)
> 
> 
> What I expected was 3 messages written to 'my-patches'.
> 
> I think the problem is just simply that the new message missing
> doesn't match the query, but is a reply to a match. So with a date
> after the original match in the thread won't pick up anything. The 2nd
> URL above indeed only has 2 results. I guess I just have to fetch a
> wider window like a month every time? What's needed is a get any new
> messages in existing threads. I don't suppose there's an efficient way
> to do that?

No, I don't think so.  I think this is a separate issue in lei...
"t=1" in the remote query expands threads in a time-agnostic
way, so I don't think that's the problem (though I may be wrong...).

I'll have to check more closely this week (still stuck with POP3
user account/storage issues :<)

> > > >
> > > > I'll have to double-check if overlapping affects things, but it
> > > > shouldn't; since the dedupe logic is per-output.
> > > >
> > > > Is this exclusively with HTTPS endpoints and writing to Maildirs
> > > > (or something else?)
> > >
> > > Yes. It's querying lore and writing to a maildir. Here's one of the queries:
> > >
> > > [lei]
> > >         q = (dfn:drivers OR dfn:arch OR dfn:Documentation/* OR
> > > dfn:include OR dfn:scripts) AND \
> > >          f:robh@kernel.org AND rt:6.month.ago..
> > > [lei "q"]
> > >         include = https://lore.kernel.org/all/
> > >         external = 1
> > >         local = 1
> > >         remote = 1
> > >         threads = 1
> > >         dedupe = mid
> > >         output = maildir:/home/rob/Mail/my-patches
> >
> > Fwiw, dedupe based on mid could be vulnerable to spoofing, which
> > is why `content' is the default.  But yes, in the past, I've
> > noticed some messages to meta@public-inbox.org not showing up,
> > though not recently (I guess lack of activity here is a culprit :x)
> 
> Does 'content' ignore trailers that mailman lists like to add? I think
> I switched because of that.

No, unfortunately not.  Hopefully the admins can be convinced to
get rid of trailers (I'm happy vger did so a few years back).
But I'd rather deal with duplicates than miss messages (there
have been legitimate messages in the past which reused msgids,
unfortunately).

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: lei missing mails
  2022-06-30  8:55         ` Eric Wong
@ 2022-07-07  9:48           ` Eric Wong
  2022-07-11 21:17             ` Rob Herring
  2022-07-11 21:59           ` Rob Herring
  1 sibling, 1 reply; 11+ messages in thread
From: Eric Wong @ 2022-07-07  9:48 UTC (permalink / raw)
  To: Rob Herring; +Cc: meta

Eric Wong <e@80x24.org> wrote:
> Rob Herring <robh@kernel.org> wrote:
> > I've restarted and just hit this again.
> 
> Ugh, sorry to hear that :<

Btw, is this only with HTTP(S) endpoints?  (not local mirrors?)

I wonder if it's something wrong on the HTTP end (server or
client), but did more digging the other day and couldn't
reproduce the problem...

I just posted a very minor diagnostic change which might help
(forgot to Cc you :x)
https://public-inbox.org/meta/20220707094030.1185793-1-e@80x24.org/

As usual, daemon-kill is necessary after upgrading.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: lei missing mails
  2022-07-07  9:48           ` Eric Wong
@ 2022-07-11 21:17             ` Rob Herring
  0 siblings, 0 replies; 11+ messages in thread
From: Rob Herring @ 2022-07-11 21:17 UTC (permalink / raw)
  To: Eric Wong; +Cc: meta

On Thu, Jul 7, 2022 at 3:48 AM Eric Wong <e@80x24.org> wrote:
>
> Eric Wong <e@80x24.org> wrote:
> > Rob Herring <robh@kernel.org> wrote:
> > > I've restarted and just hit this again.
> >
> > Ugh, sorry to hear that :<
>
> Btw, is this only with HTTP(S) endpoints?  (not local mirrors?)

Yes. Only in that https is the only case I've tried. Don't know about
local mirrors because I don't know how to set that up...

Rob

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: lei missing mails
  2022-06-30  8:55         ` Eric Wong
  2022-07-07  9:48           ` Eric Wong
@ 2022-07-11 21:59           ` Rob Herring
  2022-07-18 23:41             ` Eric Wong
  1 sibling, 1 reply; 11+ messages in thread
From: Rob Herring @ 2022-07-11 21:59 UTC (permalink / raw)
  To: Eric Wong; +Cc: meta

On Thu, Jun 30, 2022 at 2:55 AM Eric Wong <e@80x24.org> wrote:
>
> Rob Herring <robh@kernel.org> wrote:
> > On Wed, Jun 29, 2022 at 11:27 AM Eric Wong <e@80x24.org> wrote:
> > > Rob Herring <robh@kernel.org> wrote:
> > > > On Wed, Jun 29, 2022 at 10:30 AM Eric Wong <e@80x24.org> wrote:
> > > > > Rob Herring <robh@kernel.org> wrote:
> > > > > > Hi,
> > > > > >
> > > > > > I'm using lei with lore where I have 2 queries which overlap. Really,
> > > > > > one is a subset of the other. On those overlapping threads, I'm
> > > > > > finding that sometimes new messages are written to one mailbox and not
> > > > > > the other. (At least sometimes, the messages may be missing from all
> > > > > > mailboxes sometimes too. I'm not certain.) Using --remote-fudge-time
> > > > > > to force refetching seems to get the missing mails. I haven't found
> > > > > > anything strange in timestamps of the missing mails, but otherwise am
> > > > > > not sure how to debug this further. The queries are retrieving full
> > > > > > threads and the missing mails are in the threads, but not direct
> > > > > > matches to the queries. I realize that's not a lot of detail to go on.
> > > > > > Suggestions on debugging this further?
> > > > >
> > > > > Is this with 1.8 or 1.7?
> > > >
> > > > Commit 68b53c888911 actually. So post 1.8.
> > >
> > > OK, thanks for that info.
> > >
> > > > > I forgot to note in the release notes, but there were some
> > > > > SQLite usage-related fixes which could avoid missing messages.
> > > > >
> > > > > You'll need "lei daemon-kill" after upgrading to 1.8 to ensure
> > > > > the new code is running.
> > > >
> > > > It's possible I haven't done that since updating though I do vaguely
> > > > recall seeing something about needing to do that. Is there any way to
> > > > tell before I restart it?
> > >
> > > Not really, but it's pretty cheap to restart (assuming there's no
> > > long-running jobs).
> >
> > I've restarted and just hit this again.
>
> Ugh, sorry to hear that :<
>
> > > > > What might be interesting is to use the URLs lei prints and
> > > > > comparing the results w/o lei.
> >
> > $ lei up --all
> > # updating /home/rob/Mail/from-me
> > # updating /home/rob/Mail/missing-cc
> > # updating /home/rob/Mail/my-patches
> > # updating /home/rob/Mail/pci
> > # https://lore.kernel.org/all/ limiting to 2022-06-27 12:42 -0600 and newer
> > # https://lore.kernel.org/all/ limiting to 2022-06-27  9:50 -0600 and newer
> > # https://lore.kernel.org/all/ limiting to 2022-06-27 12:42 -0600 and newer
> > # /usr/bin/curl -Sf -s -d ''
> > https://lore.kernel.org/all/?x=m&t=1&q=(dt%3A20220529211430..+AND+(f%3Arobh%40kernel.org+OR+f%3Arobh%2Bdt%40kernel.org))+AND+dt%3A20220627184226..
> > # /home/rob/.local/share/lei/store 144/144
> > # /home/rob/.local/share/lei/store 3/3
> > # /usr/bin/curl -Sf -s -d ''
> > https://lore.kernel.org/all/?x=m&t=1&q=((dfn%3Adrivers+OR+dfn%3Aarch+OR+dfn%3ADocumentation%2F*+OR+dfn%3Ainclude+OR+dfn%3Ascripts)+AND+f%3Arobh%40kernel.org+AND+rt%3A1640812470..)+AND+dt%3A20220627155025..
> > # /usr/bin/curl -Sf -s -d ''
> > https://lore.kernel.org/all/?x=m&t=1&q=(l%3Alinux-pci+dfn%3Adrivers%2Fpci%2Fcontroller+dt%3A20220529211430..)+AND+dt%3A20220627184226..
> > # /home/rob/.local/share/lei/store 0/0
> > # /home/rob/.local/share/lei/store 362/362
> > # 0 written to /home/rob/Mail/missing-cc/ (0 matches)
> > # https://lore.kernel.org/all/ 72/72
> > # https://lore.kernel.org/all/ 4/4
> > # https://lore.kernel.org/all/ 131/?
> > # https://lore.kernel.org/all/ 184/?
> > # https://lore.kernel.org/all/ 412/?
> > # https://lore.kernel.org/all/ 603/?
> > # https://lore.kernel.org/all/ 853/?
> > # https://lore.kernel.org/all/ 1069/?
> > # https://lore.kernel.org/all/ 1442/?
> > # https://lore.kernel.org/all/ 1443/1443
> > # 1 written to /home/rob/Mail/pci/ (75 matches)
> > # 2 written to /home/rob/Mail/my-patches/ (148 matches)
> > # 7 written to /home/rob/Mail/from-me/ (1805 matches)
> >
> >
> > What I expected was 3 messages written to 'my-patches'.
> >
> > I think the problem is just simply that the new message missing
> > doesn't match the query, but is a reply to a match. So with a date
> > after the original match in the thread won't pick up anything. The 2nd
> > URL above indeed only has 2 results. I guess I just have to fetch a
> > wider window like a month every time? What's needed is a get any new
> > messages in existing threads. I don't suppose there's an efficient way
> > to do that?
>
> No, I don't think so.  I think this is a separate issue in lei...
> "t=1" in the remote query expands threads in a time-agnostic
> way, so I don't think that's the problem (though I may be wrong...).

Based on what the web interface presents, it sure seems like 't=1' is
independent of the query. The results listed are only those that match
the query and date range on the match.

For example, this query returns 3 matches:

https://lore.kernel.org/all/?x=m&t=1&q=((dfn%3Adrivers+OR+dfn%3Aarch+OR+dfn%3ADocumentation%2F*+OR+dfn%3Ainclude+OR+dfn%3Ascripts)+AND+f%3Arobh%40kernel.org+AND+rt%3A1641934905..)+AND+dt%3A20220630203819..

If I change 'dt' to 1 day earlier, I get 1 more match:

https://lore.kernel.org/all/?x=m&t=1&q=((dfn%3Adrivers+OR+dfn%3Aarch+OR+dfn%3ADocumentation%2F*+OR+dfn%3Ainclude+OR+dfn%3Ascripts)+AND+f%3Arobh%40kernel.org+AND+rt%3A1641934905..)+AND+dt%3A20220629203819..

That 4th match has a reply after 6/30, but the 1st query will not get
the reply. This is all reproducible without lei involved at all.

What seems to be needed is a 'thread date' which is the latest time
for any message in a thread that matches. Or perhaps some way to
separate the query from what's transferred. IOW, query for X, but only
send results newer than some date.

Rob

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: lei missing mails
  2022-07-11 21:59           ` Rob Herring
@ 2022-07-18 23:41             ` Eric Wong
  2022-07-20 22:57               ` [PATCH] www: note "x=m" and "t=1" (mis)use for GET requests Eric Wong
  0 siblings, 1 reply; 11+ messages in thread
From: Eric Wong @ 2022-07-18 23:41 UTC (permalink / raw)
  To: Rob Herring; +Cc: meta

Rob Herring <robh@kernel.org> wrote:
> On Thu, Jun 30, 2022 at 2:55 AM Eric Wong <e@80x24.org> wrote:
> >
> > Rob Herring <robh@kernel.org> wrote:
> > > On Wed, Jun 29, 2022 at 11:27 AM Eric Wong <e@80x24.org> wrote:
> > > > Rob Herring <robh@kernel.org> wrote:
> > > > > On Wed, Jun 29, 2022 at 10:30 AM Eric Wong <e@80x24.org> wrote:
> > > > > > Rob Herring <robh@kernel.org> wrote:
> > > > > > > Hi,
> > > > > > >
> > > > > > > I'm using lei with lore where I have 2 queries which overlap. Really,
> > > > > > > one is a subset of the other. On those overlapping threads, I'm
> > > > > > > finding that sometimes new messages are written to one mailbox and not
> > > > > > > the other. (At least sometimes, the messages may be missing from all
> > > > > > > mailboxes sometimes too. I'm not certain.) Using --remote-fudge-time
> > > > > > > to force refetching seems to get the missing mails. I haven't found
> > > > > > > anything strange in timestamps of the missing mails, but otherwise am
> > > > > > > not sure how to debug this further. The queries are retrieving full
> > > > > > > threads and the missing mails are in the threads, but not direct
> > > > > > > matches to the queries. I realize that's not a lot of detail to go on.
> > > > > > > Suggestions on debugging this further?
> > > > > >
> > > > > > Is this with 1.8 or 1.7?
> > > > >
> > > > > Commit 68b53c888911 actually. So post 1.8.
> > > >
> > > > OK, thanks for that info.
> > > >
> > > > > > I forgot to note in the release notes, but there were some
> > > > > > SQLite usage-related fixes which could avoid missing messages.
> > > > > >
> > > > > > You'll need "lei daemon-kill" after upgrading to 1.8 to ensure
> > > > > > the new code is running.
> > > > >
> > > > > It's possible I haven't done that since updating though I do vaguely
> > > > > recall seeing something about needing to do that. Is there any way to
> > > > > tell before I restart it?
> > > >
> > > > Not really, but it's pretty cheap to restart (assuming there's no
> > > > long-running jobs).
> > >
> > > I've restarted and just hit this again.
> >
> > Ugh, sorry to hear that :<
> >
> > > > > > What might be interesting is to use the URLs lei prints and
> > > > > > comparing the results w/o lei.
> > >
> > > $ lei up --all
> > > # updating /home/rob/Mail/from-me
> > > # updating /home/rob/Mail/missing-cc
> > > # updating /home/rob/Mail/my-patches
> > > # updating /home/rob/Mail/pci
> > > # https://lore.kernel.org/all/ limiting to 2022-06-27 12:42 -0600 and newer
> > > # https://lore.kernel.org/all/ limiting to 2022-06-27  9:50 -0600 and newer
> > > # https://lore.kernel.org/all/ limiting to 2022-06-27 12:42 -0600 and newer
> > > # /usr/bin/curl -Sf -s -d ''
> > > https://lore.kernel.org/all/?x=m&t=1&q=(dt%3A20220529211430..+AND+(f%3Arobh%40kernel.org+OR+f%3Arobh%2Bdt%40kernel.org))+AND+dt%3A20220627184226..
> > > # /home/rob/.local/share/lei/store 144/144
> > > # /home/rob/.local/share/lei/store 3/3
> > > # /usr/bin/curl -Sf -s -d ''
> > > https://lore.kernel.org/all/?x=m&t=1&q=((dfn%3Adrivers+OR+dfn%3Aarch+OR+dfn%3ADocumentation%2F*+OR+dfn%3Ainclude+OR+dfn%3Ascripts)+AND+f%3Arobh%40kernel.org+AND+rt%3A1640812470..)+AND+dt%3A20220627155025..
> > > # /usr/bin/curl -Sf -s -d ''
> > > https://lore.kernel.org/all/?x=m&t=1&q=(l%3Alinux-pci+dfn%3Adrivers%2Fpci%2Fcontroller+dt%3A20220529211430..)+AND+dt%3A20220627184226..
> > > # /home/rob/.local/share/lei/store 0/0
> > > # /home/rob/.local/share/lei/store 362/362
> > > # 0 written to /home/rob/Mail/missing-cc/ (0 matches)
> > > # https://lore.kernel.org/all/ 72/72
> > > # https://lore.kernel.org/all/ 4/4
> > > # https://lore.kernel.org/all/ 131/?
> > > # https://lore.kernel.org/all/ 184/?
> > > # https://lore.kernel.org/all/ 412/?
> > > # https://lore.kernel.org/all/ 603/?
> > > # https://lore.kernel.org/all/ 853/?
> > > # https://lore.kernel.org/all/ 1069/?
> > > # https://lore.kernel.org/all/ 1442/?
> > > # https://lore.kernel.org/all/ 1443/1443
> > > # 1 written to /home/rob/Mail/pci/ (75 matches)
> > > # 2 written to /home/rob/Mail/my-patches/ (148 matches)
> > > # 7 written to /home/rob/Mail/from-me/ (1805 matches)
> > >
> > >
> > > What I expected was 3 messages written to 'my-patches'.
> > >
> > > I think the problem is just simply that the new message missing
> > > doesn't match the query, but is a reply to a match. So with a date
> > > after the original match in the thread won't pick up anything. The 2nd
> > > URL above indeed only has 2 results. I guess I just have to fetch a
> > > wider window like a month every time? What's needed is a get any new
> > > messages in existing threads. I don't suppose there's an efficient way
> > > to do that?
> >
> > No, I don't think so.  I think this is a separate issue in lei...
> > "t=1" in the remote query expands threads in a time-agnostic
> > way, so I don't think that's the problem (though I may be wrong...).
> 
> Based on what the web interface presents, it sure seems like 't=1' is
> independent of the query. The results listed are only those that match
> the query and date range on the match.

Actually, for the HTML results, t=1 is ignored right now...
it's only for mboxrd downloads (via POST) at the moment...
That should be clarified/changed.

> For example, this query returns 3 matches:
> 
> https://lore.kernel.org/all/?x=m&t=1&q=((dfn%3Adrivers+OR+dfn%3Aarch+OR+dfn%3ADocumentation%2F*+OR+dfn%3Ainclude+OR+dfn%3Ascripts)+AND+f%3Arobh%40kernel.org+AND+rt%3A1641934905..)+AND+dt%3A20220630203819..
> 
> If I change 'dt' to 1 day earlier, I get 1 more match:
> 
> https://lore.kernel.org/all/?x=m&t=1&q=((dfn%3Adrivers+OR+dfn%3Aarch+OR+dfn%3ADocumentation%2F*+OR+dfn%3Ainclude+OR+dfn%3Ascripts)+AND+f%3Arobh%40kernel.org+AND+rt%3A1641934905..)+AND+dt%3A20220629203819..
> 
> That 4th match has a reply after 6/30, but the 1st query will not get
> the reply. This is all reproducible without lei involved at all.

Right.  t=1 only expands threads if they're linked via
References/In-Reply-To or (loosely) via matching Subjects.

> What seems to be needed is a 'thread date' which is the latest time
> for any message in a thread that matches. Or perhaps some way to
> separate the query from what's transferred. IOW, query for X, but only
> send results newer than some date.

Yeah, the thread ID info is stored independently of the search
index documents, though; so searching (Xapian) vs filtering
(SQLite) gets a bit tricky

I'll look into JMAP further before making more changes to the
data models to accomodate this.

Apologies for the delays; been stressed over other things :<

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH] www: note "x=m" and "t=1" (mis)use for GET requests
  2022-07-18 23:41             ` Eric Wong
@ 2022-07-20 22:57               ` Eric Wong
  0 siblings, 0 replies; 11+ messages in thread
From: Eric Wong @ 2022-07-20 22:57 UTC (permalink / raw)
  To: meta; +Cc: Rob Herring

Eric Wong <e@80x24.org> wrote:
> Rob Herring <robh@kernel.org> wrote:
> > Based on what the web interface presents, it sure seems like 't=1' is
> > independent of the query. The results listed are only those that match
> > the query and date range on the match.
> 
> Actually, for the HTML results, t=1 is ignored right now...
> it's only for mboxrd downloads (via POST) at the moment...
> That should be clarified/changed.
> 
> > For example, this query returns 3 matches:
> > 
> > https://lore.kernel.org/all/?x=m&t=1&q=((dfn%3Adrivers+OR+dfn%3Aarch+OR+dfn%3ADocumentation%2F*+OR+dfn%3Ainclude+OR+dfn%3Ascripts)+AND+f%3Arobh%40kernel.org+AND+rt%3A1641934905..)+AND+dt%3A20220630203819..
> > 
> > If I change 'dt' to 1 day earlier, I get 1 more match:
> > 
> > https://lore.kernel.org/all/?x=m&t=1&q=((dfn%3Adrivers+OR+dfn%3Aarch+OR+dfn%3ADocumentation%2F*+OR+dfn%3Ainclude+OR+dfn%3Ascripts)+AND+f%3Arobh%40kernel.org+AND+rt%3A1641934905..)+AND+dt%3A20220629203819..
> > 
> > That 4th match has a reply after 6/30, but the 1st query will not get
> > the reply. This is all reproducible without lei involved at all.
> 
> Right.  t=1 only expands threads if they're linked via
> References/In-Reply-To or (loosely) via matching Subjects.

--------8<--------
Subject: [PATCH] www: note "x=m" and "t=1" (mis)use for GET requests

We require "x=m" (requests for mboxes) to be POST requests to
avoid unnecessary traffic from crawlers.  "t=1" only collapses
threads in the summary view, which isn't normally accessible
from <form> elements.

This also fixes the missing "[summary|nested]" element when
"x=m" is used.
---
 lib/PublicInbox/SearchView.pm | 22 ++++++++++++++--------
 1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/lib/PublicInbox/SearchView.pm b/lib/PublicInbox/SearchView.pm
index b1cdb480..b025ec96 100644
--- a/lib/PublicInbox/SearchView.pm
+++ b/lib/PublicInbox/SearchView.pm
@@ -1,4 +1,4 @@
-# Copyright (C) 2015-2021 all contributors <meta@public-inbox.org>
+# Copyright (C) all contributors <meta@public-inbox.org>
 # License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
 #
 # Displays search results for the web interface
@@ -193,18 +193,24 @@ sub search_nav_top {
 
 	my $x = $q->{x};
 	my $pfx = "\t\t\t";
-	if ($x eq '') {
-		my $t = $q->qs_html(x => 't');
-		$rv .= qq{<b>summary</b>|<a\nhref="?$t">nested</a>}
-	} elsif ($x eq 't') {
+	if ($x eq 't') {
 		my $s = $q->qs_html(x => '');
 		$rv .= qq{<a\nhref="?$s">summary</a>|<b>nested</b>};
 		$pfx = "thread overview <a\nhref=#t>below</a> | ";
+	} else {
+		my $t = $q->qs_html(x => 't');
+		$rv .= qq{<b>summary</b>|<a\nhref="?$t">nested</a>}
 	}
 	my $A = $q->qs_html(x => 'A', r => undef);
-	$rv .= qq{|<a\nhref="?$A">Atom feed</a>]};
+	$rv .= qq{|<a\nhref="?$A">Atom feed</a>]\n};
+	$rv .= <<EOM if $x ne 't' && $q->{t};
+*** "t=1" collapses threads in summary, "full threads" requires mbox.gz ***
+EOM
+	$rv .= <<EOM if $x eq 'm';
+*** "x=m" ignored for GET requests, use download buttons below ***
+EOM
 	if ($ctx->{ibx}->isrch->has_threadid) {
-		$rv .= qq{\n${pfx}download mbox.gz: } .
+		$rv .= qq{${pfx}download mbox.gz: } .
 			# we set name=z w/o using it since it seems required for
 			# lynx (but works fine for w3m).
 			qq{<input\ntype=submit\nname=z\n} .
@@ -212,7 +218,7 @@ sub search_nav_top {
 			qq{|<input\ntype=submit\nname=x\n} .
 				q{value="full threads"/>};
 	} else { # BOFH needs to --reindex
-		$rv .= qq{\n${pfx}download: } .
+		$rv .= qq{${pfx}download: } .
 			qq{<input\ntype=submit\nname=z\nvalue="mbox.gz"/>}
 	}
 	$rv .= qq{</pre></form><pre>};

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2022-07-20 22:57 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-29 16:15 lei missing mails Rob Herring
2022-06-29 16:30 ` Eric Wong
2022-06-29 16:53   ` Rob Herring
2022-06-29 17:27     ` Eric Wong
2022-06-29 22:01       ` Rob Herring
2022-06-30  8:55         ` Eric Wong
2022-07-07  9:48           ` Eric Wong
2022-07-11 21:17             ` Rob Herring
2022-07-11 21:59           ` Rob Herring
2022-07-18 23:41             ` Eric Wong
2022-07-20 22:57               ` [PATCH] www: note "x=m" and "t=1" (mis)use for GET requests Eric Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).