From: Eric Wong <e@80x24.org>
To: workflows@vger.kernel.org, meta@public-inbox.org
Subject: Re: WIP: searching all of lore
Date: Tue, 1 Dec 2020 18:48:14 +0000 [thread overview]
Message-ID: <20201201184814.GA32272@dcvr> (raw)
In-Reply-To: <20201201140033.gyxmaejay2ddpiz3@nitro.local>
Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
> On Thu, Nov 26, 2020 at 07:45:43PM +0000, Eric Wong wrote:
> > Requires Tor, for now:
> >
> > http://rskvuqcfnfizkjg6h5jvovwb3wkikzcwskf54lfpymus6mxrzw67b5ad.onion/all/
> > http://lore.czquwvybam4bgbro.onion/all/
>
> Thanks for this work, Eric, things are looking good in my tests, though
> I uncovered a bunch of problems with b4 when used with torsocks. :)
>
> When grabbing t.mbox.gz threads from /all, it appears to properly
> reconstitute follow-ups from multiple mailing lists, correct?
Yup, though some duplicates appear due to different mailing list-added
trailers. Maybe some of the PublicInbox::Filter::* stuff (currently
only for -mda + -watch) can be applied to the indexing phase to better
dedupe and drop trailers
> Is there a
> way to "weight" different sources, so that when the same message-id
> exist in multiple places, we can prefer one source over another?
It indexes based on the order it iterates through the inboxes
and messages. That's usually that follows order in the config file;
especially if indexing is delayed. Of course it's possible a
message can show up in a low-priority source first due to
network latency or outages (something I'm too familiar with :<).
I have any idea to fix that via --reindex which *might*
allow performance improvements on the Xapian side, too.
--reindex is another mind twister when dealing with multiple
histories compared to normal inboxes and will need a new
approach. Been working on that and my head hurts :x
> For
> example, this is useful when we're trying to do DKIM validation and some
> lists are known to mess that up, while others do the right thing.
Right, though I think it's somewhat less necessary given how sensitive
PublicInbox::ContentHash is compared to just using the Message-ID to
dedupe...
One bad thing about it being too sensitive is NNTP speedups couldn't rely
solely on contents hashing because of mailing list trailers yesterday:
https://public-inbox.org/meta/20201130194201.GA6687@dcvr/
> Thanks again,
You're welcome :>
next prev parent reply other threads:[~2020-12-01 18:48 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-11-26 19:45 WIP: searching all of lore Eric Wong
2020-11-28 22:34 ` Eric Wong
2020-12-05 20:07 ` Eric Wong
2020-12-08 14:01 ` Konstantin Ryabitsev
2020-12-08 18:02 ` Eric Wong
2020-12-08 18:11 ` Konstantin Ryabitsev
2020-12-01 14:00 ` Konstantin Ryabitsev
2020-12-01 18:48 ` Eric Wong [this message]
2021-03-17 7:11 ` Eric Wong
2021-03-17 13:27 ` Konstantin Ryabitsev
2021-03-17 18:18 ` Eric Wong
2021-03-17 18:37 ` Konstantin Ryabitsev
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://public-inbox.org/README
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20201201184814.GA32272@dcvr \
--to=e@80x24.org \
--cc=meta@public-inbox.org \
--cc=workflows@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).