From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-Status: No, score=-4.0 required=3.0 tests=ALL_TRUSTED,BAYES_00 shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from localhost (dcvr.yhbt.net [127.0.0.1]) by dcvr.yhbt.net (Postfix) with ESMTP id A02871F86C; Tue, 1 Dec 2020 18:48:14 +0000 (UTC) Date: Tue, 1 Dec 2020 18:48:14 +0000 From: Eric Wong To: workflows@vger.kernel.org, meta@public-inbox.org Subject: Re: WIP: searching all of lore Message-ID: <20201201184814.GA32272@dcvr> References: <20201126194543.GA30337@dcvr> <20201201140033.gyxmaejay2ddpiz3@nitro.local> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20201201140033.gyxmaejay2ddpiz3@nitro.local> List-Id: Konstantin Ryabitsev wrote: > On Thu, Nov 26, 2020 at 07:45:43PM +0000, Eric Wong wrote: > > Requires Tor, for now: > > > > http://rskvuqcfnfizkjg6h5jvovwb3wkikzcwskf54lfpymus6mxrzw67b5ad.onion/all/ > > http://lore.czquwvybam4bgbro.onion/all/ > > Thanks for this work, Eric, things are looking good in my tests, though > I uncovered a bunch of problems with b4 when used with torsocks. :) > > When grabbing t.mbox.gz threads from /all, it appears to properly > reconstitute follow-ups from multiple mailing lists, correct? Yup, though some duplicates appear due to different mailing list-added trailers. Maybe some of the PublicInbox::Filter::* stuff (currently only for -mda + -watch) can be applied to the indexing phase to better dedupe and drop trailers > Is there a > way to "weight" different sources, so that when the same message-id > exist in multiple places, we can prefer one source over another? It indexes based on the order it iterates through the inboxes and messages. That's usually that follows order in the config file; especially if indexing is delayed. Of course it's possible a message can show up in a low-priority source first due to network latency or outages (something I'm too familiar with :<). I have any idea to fix that via --reindex which *might* allow performance improvements on the Xapian side, too. --reindex is another mind twister when dealing with multiple histories compared to normal inboxes and will need a new approach. Been working on that and my head hurts :x > For > example, this is useful when we're trying to do DKIM validation and some > lists are known to mess that up, while others do the right thing. Right, though I think it's somewhat less necessary given how sensitive PublicInbox::ContentHash is compared to just using the Message-ID to dedupe... One bad thing about it being too sensitive is NNTP speedups couldn't rely solely on contents hashing because of mailing list trailers yesterday: https://public-inbox.org/meta/20201130194201.GA6687@dcvr/ > Thanks again, You're welcome :>