* public-inbox + mlmmj best practices?
@ 2020-12-21 21:20 Konstantin Ryabitsev
2020-12-21 21:39 ` Eric Wong
0 siblings, 1 reply; 9+ messages in thread
From: Konstantin Ryabitsev @ 2020-12-21 21:20 UTC (permalink / raw)
To: meta
Hello:
One of our projects is looking at mailing list hosting and I was wondering if
I should steer them towards public-inbox + mlmmj as opposed to things like the
moribund googlegroups, groups.io, etc.
I know meta uses mlmmj, but there don't appear to be many docs on how things
are organized behind the scenes. Is there anything to help me figure out the
best arrangement for this setup?
-K
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: public-inbox + mlmmj best practices?
2020-12-21 21:20 public-inbox + mlmmj best practices? Konstantin Ryabitsev
@ 2020-12-21 21:39 ` Eric Wong
2020-12-22 6:28 ` Eric Wong
0 siblings, 1 reply; 9+ messages in thread
From: Eric Wong @ 2020-12-21 21:39 UTC (permalink / raw)
To: meta
Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
> Hello:
>
> One of our projects is looking at mailing list hosting and I was wondering if
> I should steer them towards public-inbox + mlmmj as opposed to things like the
> moribund googlegroups, groups.io, etc.
>
> I know meta uses mlmmj, but there don't appear to be many docs on how things
> are organized behind the scenes. Is there anything to help me figure out the
> best arrangement for this setup?
There's scripts/ssoma-replay which was v1-only and dependent on
ssoma. I've been meaning to convert into something that reads
NNTP so it's not locked into public-inbox. Maybe it could be
part of `lei', too, for piping to arbitrary commands, dunno...
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: public-inbox + mlmmj best practices?
2020-12-21 21:39 ` Eric Wong
@ 2020-12-22 6:28 ` Eric Wong
2020-12-28 16:22 ` Konstantin Ryabitsev
0 siblings, 1 reply; 9+ messages in thread
From: Eric Wong @ 2020-12-22 6:28 UTC (permalink / raw)
To: meta
Eric Wong <e@80x24.org> wrote:
>
> There's scripts/ssoma-replay which was v1-only and dependent on
> ssoma. I've been meaning to convert into something that reads
> NNTP so it's not locked into public-inbox. Maybe it could be
> part of `lei', too, for piping to arbitrary commands, dunno...
Fwiw, so far `lei' is designed for interactive use but maybe it
could be more... A theoretical scripts/nntp-replay would
probably be a cronjob, exclusively, so maybe a standalone script
is a better idea.
OTOH, ssoma + ssoma-replay is perfect fine for new and small
inboxes and public-inbox-convert exists if things ever get big.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: public-inbox + mlmmj best practices?
2020-12-22 6:28 ` Eric Wong
@ 2020-12-28 16:22 ` Konstantin Ryabitsev
2020-12-28 21:31 ` Eric Wong
0 siblings, 1 reply; 9+ messages in thread
From: Konstantin Ryabitsev @ 2020-12-28 16:22 UTC (permalink / raw)
To: Eric Wong; +Cc: meta
On Tue, Dec 22, 2020 at 06:28:08AM +0000, Eric Wong wrote:
> Eric Wong <e@80x24.org> wrote:
> >
> > There's scripts/ssoma-replay which was v1-only and dependent on
> > ssoma. I've been meaning to convert into something that reads
> > NNTP so it's not locked into public-inbox. Maybe it could be
> > part of `lei', too, for piping to arbitrary commands, dunno...
I wrote grok-pi-piper a while back for the purpose of piping from git to
patchwork.kernel.org. It's not complete yet, because we currently do not
handle situations with rewritten history, but it's been working well enough. I
have a write-up here:
https://people.kernel.org/monsieuricon/subscribing-to-lore-lists-with-grokmirror
What is the sanest way to recognize and handle history rewrites? Right now, we
just keep track of the latest tip hash. On each subsequent run, we just iterate
all commits between the recorded hash and the newest tip. My current thoughts
are:
- in addition to the latest tip hash, keep track of author, authordate and
message-id of the last processed message
- if we no longer find the tracked hash in the repo, use author+authordate to
find the new hash of the latest message we processed, and verify with
message-id
- if we cannot find the exact match (i.e. our latest processed message is gone
from history), find the first commit that happens before our recorded
authordate and use that as the "latest processed" jump-off point
This should do the right thing in most situations except for when the message
that was deleted from history was sent with a bogus Date: header with a date
in the future. In this case, we can miss valid messages in the queue.
Any suggestions on how this can be improved?
-K
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: public-inbox + mlmmj best practices?
2020-12-28 16:22 ` Konstantin Ryabitsev
@ 2020-12-28 21:31 ` Eric Wong
2021-01-04 20:12 ` Konstantin Ryabitsev
0 siblings, 1 reply; 9+ messages in thread
From: Eric Wong @ 2020-12-28 21:31 UTC (permalink / raw)
To: Konstantin Ryabitsev; +Cc: meta
Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
> On Tue, Dec 22, 2020 at 06:28:08AM +0000, Eric Wong wrote:
> > Eric Wong <e@80x24.org> wrote:
> > >
> > > There's scripts/ssoma-replay which was v1-only and dependent on
> > > ssoma. I've been meaning to convert into something that reads
> > > NNTP so it's not locked into public-inbox. Maybe it could be
> > > part of `lei', too, for piping to arbitrary commands, dunno...
>
> I wrote grok-pi-piper a while back for the purpose of piping from git to
> patchwork.kernel.org. It's not complete yet, because we currently do not
> handle situations with rewritten history, but it's been working well enough. I
> have a write-up here:
>
> https://people.kernel.org/monsieuricon/subscribing-to-lore-lists-with-grokmirror
>
> What is the sanest way to recognize and handle history rewrites? Right now, we
> just keep track of the latest tip hash. On each subsequent run, we just iterate
> all commits between the recorded hash and the newest tip. My current thoughts
> are:
>
> - in addition to the latest tip hash, keep track of author, authordate and
> message-id of the last processed message
> - if we no longer find the tracked hash in the repo, use author+authordate to
> find the new hash of the latest message we processed, and verify with
> message-id
> - if we cannot find the exact match (i.e. our latest processed message is gone
> from history), find the first commit that happens before our recorded
> authordate and use that as the "latest processed" jump-off point
That's a lot of persistent state to keep track of.
> This should do the right thing in most situations except for when the message
> that was deleted from history was sent with a bogus Date: header with a date
> in the future. In this case, we can miss valid messages in the queue.
AFAIK, V2Writable always does the right thing on -purge/-edit;
at least for WWW users(*).
V2W does more work in rare cases when history gets rewritten,
but doesn't track anything beyond the latest indexed commit
hash.
In the V2Writable::log_range sub, it uses "git merge-base --is-ancestor"
(via is_ancestor wrapper) to cover the common case of contiguous history.
Otherwise, it attempts "git merge-base" to find a common ancestor:
if (common_ancestor_found)
unindex some history starting at common ancestor
reindex from common ancestor
else
unindex all history in epoch
reindex epoch from stratch
AFAIK, the common_ancestor_found case is always true unless
somebody was wacky enough to run a full gc+prune immediately
after fetching. IOW, I don't think the else case happens
in practice.
(*) The downside to this approach is IMAP UIDs (NNTP article
numbers) get changed, but I think I can workaround that. The
workaround I'm thinking of involves capturing exact blob OIDs
during the unindex phase to create an OID => UID mapping.
reindex would reuse the OID => UID mapping to keep the same
IMAP UID. It could be loosened to use ContentHash, or
whatever combination of Message-ID/From/Date/etc, too.
> Any suggestions on how this can be improved?
Fwiw, my general approach is to keep track of and operate with
as little state as I can get away with (and discard it as soon
as possible).
IME it avoids bugs by simplifying things to accomodate my
limited mental capacity.
The lack of distinct POLL{IN|OUT|HUP|ERR} callbacks in the DS
event loop is another example of that approach, as is the lack
of explicit {state} fields for per-client sockets: all state
is implied from what's in (or not in) read/write buffers.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: public-inbox + mlmmj best practices?
2020-12-28 21:31 ` Eric Wong
@ 2021-01-04 20:12 ` Konstantin Ryabitsev
2021-01-05 1:06 ` Eric Wong
0 siblings, 1 reply; 9+ messages in thread
From: Konstantin Ryabitsev @ 2021-01-04 20:12 UTC (permalink / raw)
To: Eric Wong; +Cc: meta
On Mon, Dec 28, 2020 at 09:31:39PM +0000, Eric Wong wrote:
> AFAIK, V2Writable always does the right thing on -purge/-edit;
> at least for WWW users(*).
>
> V2W does more work in rare cases when history gets rewritten,
> but doesn't track anything beyond the latest indexed commit
> hash.
>
> In the V2Writable::log_range sub, it uses "git merge-base --is-ancestor"
> (via is_ancestor wrapper) to cover the common case of contiguous history.
>
> Otherwise, it attempts "git merge-base" to find a common ancestor:
>
> if (common_ancestor_found)
> unindex some history starting at common ancestor
> reindex from common ancestor
> else
> unindex all history in epoch
> reindex epoch from stratch
I think I understand, but in the case of grok-pi-piper, unindexing is not an
option, since we can't control what the receiving-end app has already done
with the messages we have previously piped to it. We can't assume that it will
do the right thing when it receives duplicate messages, so we need to somehow
make sure that we don't pipe the same message twice.
> AFAIK, the common_ancestor_found case is always true unless
> somebody was wacky enough to run a full gc+prune immediately
> after fetching. IOW, I don't think the else case happens
> in practice.
:) It kinda does in grok-pi-piper case, since one of the config options is to
continuously "reshallow" the repository to basically contain no objects.
https://git.kernel.org/pub/scm/utils/grokmirror/grokmirror.git/tree/grokmirror/pi_piper.py#n58
I know that this is "wacky" as you say, but it helps save dramatic amounts of
space when cloning most of lore.kernel.org repositories. We can still use "git
fetch --deepen" when necessary, but this does make it impossible to use the
common ancestor strategy when dealing with history rewrites.
-K
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: public-inbox + mlmmj best practices?
2021-01-04 20:12 ` Konstantin Ryabitsev
@ 2021-01-05 1:06 ` Eric Wong
2021-01-05 1:29 ` [PATCH] v2writable: exact discontiguous history handling Eric Wong
0 siblings, 1 reply; 9+ messages in thread
From: Eric Wong @ 2021-01-05 1:06 UTC (permalink / raw)
To: meta
Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
> On Mon, Dec 28, 2020 at 09:31:39PM +0000, Eric Wong wrote:
> > AFAIK, V2Writable always does the right thing on -purge/-edit;
> > at least for WWW users(*).
> >
> > V2W does more work in rare cases when history gets rewritten,
> > but doesn't track anything beyond the latest indexed commit
> > hash.
> >
> > In the V2Writable::log_range sub, it uses "git merge-base --is-ancestor"
> > (via is_ancestor wrapper) to cover the common case of contiguous history.
> >
> > Otherwise, it attempts "git merge-base" to find a common ancestor:
> >
> > if (common_ancestor_found)
> > unindex some history starting at common ancestor
> > reindex from common ancestor
> > else
> > unindex all history in epoch
> > reindex epoch from stratch
>
> I think I understand, but in the case of grok-pi-piper, unindexing is not an
> option, since we can't control what the receiving-end app has already done
> with the messages we have previously piped to it. We can't assume that it will
> do the right thing when it receives duplicate messages, so we need to somehow
> make sure that we don't pipe the same message twice.
Nevermind, I just reread my code more carefully :x
Actually the unindexing code currently stores an {unindexed}
hash which is a { Message-ID => (NNTP )num } mapping
Which allows most unedited messages keep the same NNTP article
number so clients don't see it twice. "Most" meaning non-broken
messages which don't have reused Message-IDs.
I'm thinking {unindexed} should be a
{ OID => [ num, Message-ID ] } mapping
That would allow the new version of the edited message to be
piped and seen by NNTP/IMAP readers.
You *do* want to pipe the new version of the message you've
edited, right?
> > AFAIK, the common_ancestor_found case is always true unless
> > somebody was wacky enough to run a full gc+prune immediately
> > after fetching. IOW, I don't think the else case happens
> > in practice.
>
> :) It kinda does in grok-pi-piper case, since one of the config options is to
> continuously "reshallow" the repository to basically contain no objects.
>
> https://git.kernel.org/pub/scm/utils/grokmirror/grokmirror.git/tree/grokmirror/pi_piper.py#n58
>
> I know that this is "wacky" as you say, but it helps save dramatic amounts of
> space when cloning most of lore.kernel.org repositories. We can still use "git
> fetch --deepen" when necessary, but this does make it impossible to use the
> common ancestor strategy when dealing with history rewrites.
Understood. So yeah, actually the current {unindexed} hash in
V2Writable mostly does what we want, but I'm preparing a patch
which does the aforementioned { OID => [ num, Message-ID ] }
mapping.
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH] v2writable: exact discontiguous history handling
2021-01-05 1:06 ` Eric Wong
@ 2021-01-05 1:29 ` Eric Wong
2021-01-09 22:21 ` Eric Wong
0 siblings, 1 reply; 9+ messages in thread
From: Eric Wong @ 2021-01-05 1:29 UTC (permalink / raw)
To: meta
Eric Wong <e@80x24.org> wrote:
> That would allow the new version of the edited message to be
> piped and seen by NNTP/IMAP readers.
>
> You *do* want to pipe the new version of the message you've
> edited, right?
---------8<--------
Subject: [PATCH] v2writable: exact discontiguous history handling
We've always temporarily unindexeded messages before reindexing
them again if there's discontiguous history.
This change improves the mechanism we use to prevent NNTP and
IMAP clients from seeing duplicate messages.
Previously, we relied on mapping Message-IDs to NNTP article
numbers to ensure clients would not see the same message twice.
This worked for most messages, but not for for messages with
reused or duplicate Message-IDs.
Instead of relying on Message-IDs as a key, we now rely on the
git blob object ID for exact content matching. This allows
truly different messages to show up for NNTP|IMAP clients, while
still those clients from seeing the message again.
---
lib/PublicInbox/V2Writable.pm | 21 ++++++++++++++-------
1 file changed, 14 insertions(+), 7 deletions(-)
diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index 459c7e86..54004fd7 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -888,12 +888,16 @@ sub index_oid { # cat_async callback
}
# {unindexed} is unlikely
- if ((my $unindexed = $arg->{unindexed}) && scalar(@$mids) == 1) {
- $num = delete($unindexed->{$mids->[0]});
+ if (my $unindexed = $arg->{unindexed}) {
+ my $oidbin = pack('H*', $oid);
+ my $u = $unindexed->{$oidbin};
+ ($num, $mid0) = splice(@$u, 0, 2) if $u;
if (defined $num) {
- $mid0 = $mids->[0];
$self->{mm}->mid_set($num, $mid0);
- delete($arg->{unindexed}) if !keys(%$unindexed);
+ if (scalar(@$u) == 0) { # done with current OID
+ delete $unindexed->{$oidbin};
+ delete($arg->{unindexed}) if !keys(%$unindexed);
+ }
}
}
if (!defined($num)) { # reuse if reindexing (or duplicates)
@@ -1160,10 +1164,13 @@ sub unindex_oid ($$;$) { # git->cat_async callback
warn "BUG: multiple articles linked to $oid\n",
join(',',sort keys %gone), "\n";
}
- foreach my $num (keys %gone) {
+ # reuse (num => mid) mapping in ascending numeric order
+ for my $num (sort { $a <=> $b } keys %gone) {
+ $num += 0;
if ($unindexed) {
my $mid0 = $mm->mid_for($num);
- $unindexed->{$mid0} = $num;
+ my $oidbin = pack('H*', $oid);
+ push @{$unindexed->{$oidbin}}, $num, $mid0;
}
$mm->num_delete($num);
}
@@ -1179,7 +1186,7 @@ sub git { $_[0]->{ibx}->git }
sub unindex_todo ($$$) {
my ($self, $sync, $unit) = @_;
my $unindex_range = delete($unit->{unindex_range}) // return;
- my $unindexed = $sync->{unindexed} //= {}; # $mid0 => $num
+ my $unindexed = $sync->{unindexed} //= {}; # $oidbin => [$num, $mid0]
my $before = scalar keys %$unindexed;
# order does not matter, here:
my $fh = $unit->{git}->popen(qw(log --raw -r --no-notes --no-color
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH] v2writable: exact discontiguous history handling
2021-01-05 1:29 ` [PATCH] v2writable: exact discontiguous history handling Eric Wong
@ 2021-01-09 22:21 ` Eric Wong
0 siblings, 0 replies; 9+ messages in thread
From: Eric Wong @ 2021-01-09 22:21 UTC (permalink / raw)
To: meta
Pushed as 392533147f50061d93cb9ed82abf98067dde5472
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2021-01-09 22:21 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-12-21 21:20 public-inbox + mlmmj best practices? Konstantin Ryabitsev
2020-12-21 21:39 ` Eric Wong
2020-12-22 6:28 ` Eric Wong
2020-12-28 16:22 ` Konstantin Ryabitsev
2020-12-28 21:31 ` Eric Wong
2021-01-04 20:12 ` Konstantin Ryabitsev
2021-01-05 1:06 ` Eric Wong
2021-01-05 1:29 ` [PATCH] v2writable: exact discontiguous history handling Eric Wong
2021-01-09 22:21 ` Eric Wong
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).