From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-Status: No, score=-4.0 required=3.0 tests=ALL_TRUSTED,BAYES_00, URIBL_RED shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from localhost (dcvr.yhbt.net [127.0.0.1]) by dcvr.yhbt.net (Postfix) with ESMTP id D9CFF1F4B4; Mon, 28 Dec 2020 21:31:39 +0000 (UTC) Date: Mon, 28 Dec 2020 21:31:39 +0000 From: Eric Wong To: Konstantin Ryabitsev Cc: meta@public-inbox.org Subject: Re: public-inbox + mlmmj best practices? Message-ID: <20201228213139.GA17600@dcvr> References: <20201221212032.syunaxzrvcqcrose@chatter.i7.local> <20201221213914.GA9374@dcvr> <20201222062808.GA4522@dcvr> <20201228162218.zcnqxkgwa2i3nt66@chatter.i7.local> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20201228162218.zcnqxkgwa2i3nt66@chatter.i7.local> List-Id: Konstantin Ryabitsev wrote: > On Tue, Dec 22, 2020 at 06:28:08AM +0000, Eric Wong wrote: > > Eric Wong wrote: > > > > > > There's scripts/ssoma-replay which was v1-only and dependent on > > > ssoma. I've been meaning to convert into something that reads > > > NNTP so it's not locked into public-inbox. Maybe it could be > > > part of `lei', too, for piping to arbitrary commands, dunno... > > I wrote grok-pi-piper a while back for the purpose of piping from git to > patchwork.kernel.org. It's not complete yet, because we currently do not > handle situations with rewritten history, but it's been working well enough. I > have a write-up here: > > https://people.kernel.org/monsieuricon/subscribing-to-lore-lists-with-grokmirror > > What is the sanest way to recognize and handle history rewrites? Right now, we > just keep track of the latest tip hash. On each subsequent run, we just iterate > all commits between the recorded hash and the newest tip. My current thoughts > are: > > - in addition to the latest tip hash, keep track of author, authordate and > message-id of the last processed message > - if we no longer find the tracked hash in the repo, use author+authordate to > find the new hash of the latest message we processed, and verify with > message-id > - if we cannot find the exact match (i.e. our latest processed message is gone > from history), find the first commit that happens before our recorded > authordate and use that as the "latest processed" jump-off point That's a lot of persistent state to keep track of. > This should do the right thing in most situations except for when the message > that was deleted from history was sent with a bogus Date: header with a date > in the future. In this case, we can miss valid messages in the queue. AFAIK, V2Writable always does the right thing on -purge/-edit; at least for WWW users(*). V2W does more work in rare cases when history gets rewritten, but doesn't track anything beyond the latest indexed commit hash. In the V2Writable::log_range sub, it uses "git merge-base --is-ancestor" (via is_ancestor wrapper) to cover the common case of contiguous history. Otherwise, it attempts "git merge-base" to find a common ancestor: if (common_ancestor_found) unindex some history starting at common ancestor reindex from common ancestor else unindex all history in epoch reindex epoch from stratch AFAIK, the common_ancestor_found case is always true unless somebody was wacky enough to run a full gc+prune immediately after fetching. IOW, I don't think the else case happens in practice. (*) The downside to this approach is IMAP UIDs (NNTP article numbers) get changed, but I think I can workaround that. The workaround I'm thinking of involves capturing exact blob OIDs during the unindex phase to create an OID => UID mapping. reindex would reuse the OID => UID mapping to keep the same IMAP UID. It could be loosened to use ContentHash, or whatever combination of Message-ID/From/Date/etc, too. > Any suggestions on how this can be improved? Fwiw, my general approach is to keep track of and operate with as little state as I can get away with (and discard it as soon as possible). IME it avoids bugs by simplifying things to accomodate my limited mental capacity. The lack of distinct POLL{IN|OUT|HUP|ERR} callbacks in the DS event loop is another example of that approach, as is the lack of explicit {state} fields for per-client sockets: all state is implied from what's in (or not in) read/write buffers.