* public-inbox + mlmmj best practices? @ 2020-12-21 21:20 Konstantin Ryabitsev 2020-12-21 21:39 ` Eric Wong 0 siblings, 1 reply; 9+ messages in thread From: Konstantin Ryabitsev @ 2020-12-21 21:20 UTC (permalink / raw) To: meta Hello: One of our projects is looking at mailing list hosting and I was wondering if I should steer them towards public-inbox + mlmmj as opposed to things like the moribund googlegroups, groups.io, etc. I know meta uses mlmmj, but there don't appear to be many docs on how things are organized behind the scenes. Is there anything to help me figure out the best arrangement for this setup? -K ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: public-inbox + mlmmj best practices? 2020-12-21 21:20 public-inbox + mlmmj best practices? Konstantin Ryabitsev @ 2020-12-21 21:39 ` Eric Wong 2020-12-22 6:28 ` Eric Wong 0 siblings, 1 reply; 9+ messages in thread From: Eric Wong @ 2020-12-21 21:39 UTC (permalink / raw) To: meta Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote: > Hello: > > One of our projects is looking at mailing list hosting and I was wondering if > I should steer them towards public-inbox + mlmmj as opposed to things like the > moribund googlegroups, groups.io, etc. > > I know meta uses mlmmj, but there don't appear to be many docs on how things > are organized behind the scenes. Is there anything to help me figure out the > best arrangement for this setup? There's scripts/ssoma-replay which was v1-only and dependent on ssoma. I've been meaning to convert into something that reads NNTP so it's not locked into public-inbox. Maybe it could be part of `lei', too, for piping to arbitrary commands, dunno... ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: public-inbox + mlmmj best practices? 2020-12-21 21:39 ` Eric Wong @ 2020-12-22 6:28 ` Eric Wong 2020-12-28 16:22 ` Konstantin Ryabitsev 0 siblings, 1 reply; 9+ messages in thread From: Eric Wong @ 2020-12-22 6:28 UTC (permalink / raw) To: meta Eric Wong <e@80x24.org> wrote: > > There's scripts/ssoma-replay which was v1-only and dependent on > ssoma. I've been meaning to convert into something that reads > NNTP so it's not locked into public-inbox. Maybe it could be > part of `lei', too, for piping to arbitrary commands, dunno... Fwiw, so far `lei' is designed for interactive use but maybe it could be more... A theoretical scripts/nntp-replay would probably be a cronjob, exclusively, so maybe a standalone script is a better idea. OTOH, ssoma + ssoma-replay is perfect fine for new and small inboxes and public-inbox-convert exists if things ever get big. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: public-inbox + mlmmj best practices? 2020-12-22 6:28 ` Eric Wong @ 2020-12-28 16:22 ` Konstantin Ryabitsev 2020-12-28 21:31 ` Eric Wong 0 siblings, 1 reply; 9+ messages in thread From: Konstantin Ryabitsev @ 2020-12-28 16:22 UTC (permalink / raw) To: Eric Wong; +Cc: meta On Tue, Dec 22, 2020 at 06:28:08AM +0000, Eric Wong wrote: > Eric Wong <e@80x24.org> wrote: > > > > There's scripts/ssoma-replay which was v1-only and dependent on > > ssoma. I've been meaning to convert into something that reads > > NNTP so it's not locked into public-inbox. Maybe it could be > > part of `lei', too, for piping to arbitrary commands, dunno... I wrote grok-pi-piper a while back for the purpose of piping from git to patchwork.kernel.org. It's not complete yet, because we currently do not handle situations with rewritten history, but it's been working well enough. I have a write-up here: https://people.kernel.org/monsieuricon/subscribing-to-lore-lists-with-grokmirror What is the sanest way to recognize and handle history rewrites? Right now, we just keep track of the latest tip hash. On each subsequent run, we just iterate all commits between the recorded hash and the newest tip. My current thoughts are: - in addition to the latest tip hash, keep track of author, authordate and message-id of the last processed message - if we no longer find the tracked hash in the repo, use author+authordate to find the new hash of the latest message we processed, and verify with message-id - if we cannot find the exact match (i.e. our latest processed message is gone from history), find the first commit that happens before our recorded authordate and use that as the "latest processed" jump-off point This should do the right thing in most situations except for when the message that was deleted from history was sent with a bogus Date: header with a date in the future. In this case, we can miss valid messages in the queue. Any suggestions on how this can be improved? -K ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: public-inbox + mlmmj best practices? 2020-12-28 16:22 ` Konstantin Ryabitsev @ 2020-12-28 21:31 ` Eric Wong 2021-01-04 20:12 ` Konstantin Ryabitsev 0 siblings, 1 reply; 9+ messages in thread From: Eric Wong @ 2020-12-28 21:31 UTC (permalink / raw) To: Konstantin Ryabitsev; +Cc: meta Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote: > On Tue, Dec 22, 2020 at 06:28:08AM +0000, Eric Wong wrote: > > Eric Wong <e@80x24.org> wrote: > > > > > > There's scripts/ssoma-replay which was v1-only and dependent on > > > ssoma. I've been meaning to convert into something that reads > > > NNTP so it's not locked into public-inbox. Maybe it could be > > > part of `lei', too, for piping to arbitrary commands, dunno... > > I wrote grok-pi-piper a while back for the purpose of piping from git to > patchwork.kernel.org. It's not complete yet, because we currently do not > handle situations with rewritten history, but it's been working well enough. I > have a write-up here: > > https://people.kernel.org/monsieuricon/subscribing-to-lore-lists-with-grokmirror > > What is the sanest way to recognize and handle history rewrites? Right now, we > just keep track of the latest tip hash. On each subsequent run, we just iterate > all commits between the recorded hash and the newest tip. My current thoughts > are: > > - in addition to the latest tip hash, keep track of author, authordate and > message-id of the last processed message > - if we no longer find the tracked hash in the repo, use author+authordate to > find the new hash of the latest message we processed, and verify with > message-id > - if we cannot find the exact match (i.e. our latest processed message is gone > from history), find the first commit that happens before our recorded > authordate and use that as the "latest processed" jump-off point That's a lot of persistent state to keep track of. > This should do the right thing in most situations except for when the message > that was deleted from history was sent with a bogus Date: header with a date > in the future. In this case, we can miss valid messages in the queue. AFAIK, V2Writable always does the right thing on -purge/-edit; at least for WWW users(*). V2W does more work in rare cases when history gets rewritten, but doesn't track anything beyond the latest indexed commit hash. In the V2Writable::log_range sub, it uses "git merge-base --is-ancestor" (via is_ancestor wrapper) to cover the common case of contiguous history. Otherwise, it attempts "git merge-base" to find a common ancestor: if (common_ancestor_found) unindex some history starting at common ancestor reindex from common ancestor else unindex all history in epoch reindex epoch from stratch AFAIK, the common_ancestor_found case is always true unless somebody was wacky enough to run a full gc+prune immediately after fetching. IOW, I don't think the else case happens in practice. (*) The downside to this approach is IMAP UIDs (NNTP article numbers) get changed, but I think I can workaround that. The workaround I'm thinking of involves capturing exact blob OIDs during the unindex phase to create an OID => UID mapping. reindex would reuse the OID => UID mapping to keep the same IMAP UID. It could be loosened to use ContentHash, or whatever combination of Message-ID/From/Date/etc, too. > Any suggestions on how this can be improved? Fwiw, my general approach is to keep track of and operate with as little state as I can get away with (and discard it as soon as possible). IME it avoids bugs by simplifying things to accomodate my limited mental capacity. The lack of distinct POLL{IN|OUT|HUP|ERR} callbacks in the DS event loop is another example of that approach, as is the lack of explicit {state} fields for per-client sockets: all state is implied from what's in (or not in) read/write buffers. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: public-inbox + mlmmj best practices? 2020-12-28 21:31 ` Eric Wong @ 2021-01-04 20:12 ` Konstantin Ryabitsev 2021-01-05 1:06 ` Eric Wong 0 siblings, 1 reply; 9+ messages in thread From: Konstantin Ryabitsev @ 2021-01-04 20:12 UTC (permalink / raw) To: Eric Wong; +Cc: meta On Mon, Dec 28, 2020 at 09:31:39PM +0000, Eric Wong wrote: > AFAIK, V2Writable always does the right thing on -purge/-edit; > at least for WWW users(*). > > V2W does more work in rare cases when history gets rewritten, > but doesn't track anything beyond the latest indexed commit > hash. > > In the V2Writable::log_range sub, it uses "git merge-base --is-ancestor" > (via is_ancestor wrapper) to cover the common case of contiguous history. > > Otherwise, it attempts "git merge-base" to find a common ancestor: > > if (common_ancestor_found) > unindex some history starting at common ancestor > reindex from common ancestor > else > unindex all history in epoch > reindex epoch from stratch I think I understand, but in the case of grok-pi-piper, unindexing is not an option, since we can't control what the receiving-end app has already done with the messages we have previously piped to it. We can't assume that it will do the right thing when it receives duplicate messages, so we need to somehow make sure that we don't pipe the same message twice. > AFAIK, the common_ancestor_found case is always true unless > somebody was wacky enough to run a full gc+prune immediately > after fetching. IOW, I don't think the else case happens > in practice. :) It kinda does in grok-pi-piper case, since one of the config options is to continuously "reshallow" the repository to basically contain no objects. https://git.kernel.org/pub/scm/utils/grokmirror/grokmirror.git/tree/grokmirror/pi_piper.py#n58 I know that this is "wacky" as you say, but it helps save dramatic amounts of space when cloning most of lore.kernel.org repositories. We can still use "git fetch --deepen" when necessary, but this does make it impossible to use the common ancestor strategy when dealing with history rewrites. -K ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: public-inbox + mlmmj best practices? 2021-01-04 20:12 ` Konstantin Ryabitsev @ 2021-01-05 1:06 ` Eric Wong 2021-01-05 1:29 ` [PATCH] v2writable: exact discontiguous history handling Eric Wong 0 siblings, 1 reply; 9+ messages in thread From: Eric Wong @ 2021-01-05 1:06 UTC (permalink / raw) To: meta Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote: > On Mon, Dec 28, 2020 at 09:31:39PM +0000, Eric Wong wrote: > > AFAIK, V2Writable always does the right thing on -purge/-edit; > > at least for WWW users(*). > > > > V2W does more work in rare cases when history gets rewritten, > > but doesn't track anything beyond the latest indexed commit > > hash. > > > > In the V2Writable::log_range sub, it uses "git merge-base --is-ancestor" > > (via is_ancestor wrapper) to cover the common case of contiguous history. > > > > Otherwise, it attempts "git merge-base" to find a common ancestor: > > > > if (common_ancestor_found) > > unindex some history starting at common ancestor > > reindex from common ancestor > > else > > unindex all history in epoch > > reindex epoch from stratch > > I think I understand, but in the case of grok-pi-piper, unindexing is not an > option, since we can't control what the receiving-end app has already done > with the messages we have previously piped to it. We can't assume that it will > do the right thing when it receives duplicate messages, so we need to somehow > make sure that we don't pipe the same message twice. Nevermind, I just reread my code more carefully :x Actually the unindexing code currently stores an {unindexed} hash which is a { Message-ID => (NNTP )num } mapping Which allows most unedited messages keep the same NNTP article number so clients don't see it twice. "Most" meaning non-broken messages which don't have reused Message-IDs. I'm thinking {unindexed} should be a { OID => [ num, Message-ID ] } mapping That would allow the new version of the edited message to be piped and seen by NNTP/IMAP readers. You *do* want to pipe the new version of the message you've edited, right? > > AFAIK, the common_ancestor_found case is always true unless > > somebody was wacky enough to run a full gc+prune immediately > > after fetching. IOW, I don't think the else case happens > > in practice. > > :) It kinda does in grok-pi-piper case, since one of the config options is to > continuously "reshallow" the repository to basically contain no objects. > > https://git.kernel.org/pub/scm/utils/grokmirror/grokmirror.git/tree/grokmirror/pi_piper.py#n58 > > I know that this is "wacky" as you say, but it helps save dramatic amounts of > space when cloning most of lore.kernel.org repositories. We can still use "git > fetch --deepen" when necessary, but this does make it impossible to use the > common ancestor strategy when dealing with history rewrites. Understood. So yeah, actually the current {unindexed} hash in V2Writable mostly does what we want, but I'm preparing a patch which does the aforementioned { OID => [ num, Message-ID ] } mapping. ^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH] v2writable: exact discontiguous history handling 2021-01-05 1:06 ` Eric Wong @ 2021-01-05 1:29 ` Eric Wong 2021-01-09 22:21 ` Eric Wong 0 siblings, 1 reply; 9+ messages in thread From: Eric Wong @ 2021-01-05 1:29 UTC (permalink / raw) To: meta Eric Wong <e@80x24.org> wrote: > That would allow the new version of the edited message to be > piped and seen by NNTP/IMAP readers. > > You *do* want to pipe the new version of the message you've > edited, right? ---------8<-------- Subject: [PATCH] v2writable: exact discontiguous history handling We've always temporarily unindexeded messages before reindexing them again if there's discontiguous history. This change improves the mechanism we use to prevent NNTP and IMAP clients from seeing duplicate messages. Previously, we relied on mapping Message-IDs to NNTP article numbers to ensure clients would not see the same message twice. This worked for most messages, but not for for messages with reused or duplicate Message-IDs. Instead of relying on Message-IDs as a key, we now rely on the git blob object ID for exact content matching. This allows truly different messages to show up for NNTP|IMAP clients, while still those clients from seeing the message again. --- lib/PublicInbox/V2Writable.pm | 21 ++++++++++++++------- 1 file changed, 14 insertions(+), 7 deletions(-) diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm index 459c7e86..54004fd7 100644 --- a/lib/PublicInbox/V2Writable.pm +++ b/lib/PublicInbox/V2Writable.pm @@ -888,12 +888,16 @@ sub index_oid { # cat_async callback } # {unindexed} is unlikely - if ((my $unindexed = $arg->{unindexed}) && scalar(@$mids) == 1) { - $num = delete($unindexed->{$mids->[0]}); + if (my $unindexed = $arg->{unindexed}) { + my $oidbin = pack('H*', $oid); + my $u = $unindexed->{$oidbin}; + ($num, $mid0) = splice(@$u, 0, 2) if $u; if (defined $num) { - $mid0 = $mids->[0]; $self->{mm}->mid_set($num, $mid0); - delete($arg->{unindexed}) if !keys(%$unindexed); + if (scalar(@$u) == 0) { # done with current OID + delete $unindexed->{$oidbin}; + delete($arg->{unindexed}) if !keys(%$unindexed); + } } } if (!defined($num)) { # reuse if reindexing (or duplicates) @@ -1160,10 +1164,13 @@ sub unindex_oid ($$;$) { # git->cat_async callback warn "BUG: multiple articles linked to $oid\n", join(',',sort keys %gone), "\n"; } - foreach my $num (keys %gone) { + # reuse (num => mid) mapping in ascending numeric order + for my $num (sort { $a <=> $b } keys %gone) { + $num += 0; if ($unindexed) { my $mid0 = $mm->mid_for($num); - $unindexed->{$mid0} = $num; + my $oidbin = pack('H*', $oid); + push @{$unindexed->{$oidbin}}, $num, $mid0; } $mm->num_delete($num); } @@ -1179,7 +1186,7 @@ sub git { $_[0]->{ibx}->git } sub unindex_todo ($$$) { my ($self, $sync, $unit) = @_; my $unindex_range = delete($unit->{unindex_range}) // return; - my $unindexed = $sync->{unindexed} //= {}; # $mid0 => $num + my $unindexed = $sync->{unindexed} //= {}; # $oidbin => [$num, $mid0] my $before = scalar keys %$unindexed; # order does not matter, here: my $fh = $unit->{git}->popen(qw(log --raw -r --no-notes --no-color ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH] v2writable: exact discontiguous history handling 2021-01-05 1:29 ` [PATCH] v2writable: exact discontiguous history handling Eric Wong @ 2021-01-09 22:21 ` Eric Wong 0 siblings, 0 replies; 9+ messages in thread From: Eric Wong @ 2021-01-09 22:21 UTC (permalink / raw) To: meta Pushed as 392533147f50061d93cb9ed82abf98067dde5472 ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2021-01-09 22:21 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2020-12-21 21:20 public-inbox + mlmmj best practices? Konstantin Ryabitsev 2020-12-21 21:39 ` Eric Wong 2020-12-22 6:28 ` Eric Wong 2020-12-28 16:22 ` Konstantin Ryabitsev 2020-12-28 21:31 ` Eric Wong 2021-01-04 20:12 ` Konstantin Ryabitsev 2021-01-05 1:06 ` Eric Wong 2021-01-05 1:29 ` [PATCH] v2writable: exact discontiguous history handling Eric Wong 2021-01-09 22:21 ` Eric Wong
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).