unofficial mirror of meta@public-inbox.org
 help / color / mirror / Atom feed
* publicinbox watch path globbing
@ 2023-11-19 23:13 Robin H. Johnson
  2023-11-20  0:10 ` Eric Wong
  0 siblings, 1 reply; 4+ messages in thread
From: Robin H. Johnson @ 2023-11-19 23:13 UTC (permalink / raw)
  To: meta

[-- Attachment #1: Type: text/plain, Size: 1612 bytes --]

Hi!

Writing to see about work in converting Gentoo's (now-broken) other
archives web interface over into using public-inbox instead.

This is the first of a few questions/bumps along the way.

For historical reasons on the scaling side, the archive maildirs are
stored by date:
watch = maildir:$REDACTED/$LISTNAME/.200001/
watch = maildir:$REDACTED/$LISTNAME/.200102/
watch = maildir:$REDACTED/$LISTNAME/.YYYYMM/
watch = maildir:$REDACTED/$LISTNAME/.202311/
etc.
(over time, directories are moved to stable read-only storage)

If a given list is low traffic does NOT get traffic in a given month,
the directory does not exist (it's created when the first mail arrives
during a calendar month).

Multiply this by ~120 lists, and it gets on the large side for a config
file: 7500+ lines just for the "watch" entries.

While I could generate the config file, I'm wondering about better
solution, to allow globbing the path.

I tried to locate a single place in the codebase where this would be
applied, but it's not clear enough to me if there's a single place that
it can easily modified.

If there's a consistent place, I think the cleanest syntax that doesn't
break existing consumers would be something like this:
[publicinbox "$LISTNAME"]
watch = maildirglob:$REDACTED/$LISTNAME/.19????/
watch = maildirglob:$REDACTED/$LISTNAME/.20????/

-- 
Robin Hugh Johnson
Gentoo Linux: Dev, Infra Lead, Foundation President & Treasurer
E-Mail   : robbat2@gentoo.org
GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 1113 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: publicinbox watch path globbing
  2023-11-19 23:13 publicinbox watch path globbing Robin H. Johnson
@ 2023-11-20  0:10 ` Eric Wong
  2023-11-20  0:16   ` Robin H. Johnson
  0 siblings, 1 reply; 4+ messages in thread
From: Eric Wong @ 2023-11-20  0:10 UTC (permalink / raw)
  To: Robin H. Johnson; +Cc: meta

"Robin H. Johnson" <robbat2@gentoo.org> wrote:
> Hi!
> 
> Writing to see about work in converting Gentoo's (now-broken) other
> archives web interface over into using public-inbox instead.
> 
> This is the first of a few questions/bumps along the way.
> 
> For historical reasons on the scaling side, the archive maildirs are
> stored by date:
> watch = maildir:$REDACTED/$LISTNAME/.200001/
> watch = maildir:$REDACTED/$LISTNAME/.200102/
> watch = maildir:$REDACTED/$LISTNAME/.YYYYMM/
> watch = maildir:$REDACTED/$LISTNAME/.202311/
> etc.
> (over time, directories are moved to stable read-only storage)

Is there any reason to expect new messages to appear the /.2000??/
and other old directories?

IOW, if somebody with a broken clock sends a message from a past
year/month in the Date: header, does it end up in an old bucket
or the current one?

If your old buckets are frozen, lei in public-inbox.git should be
able to start them off with:

	for d in $REDACTED/$LISTNAME/.??????
	do
		lei convert -o v2:/path/to/inbox-$LISTNAME maildir:$d
	done
	lei daemon-kill # optional, stops lei-daemon when done

And then you'd only have to watch the latest maildir.

I'll try to get public-inbox 2.0 released soon[1]; but the lei convert
stuff should be ready.

> If a given list is low traffic does NOT get traffic in a given month,
> the directory does not exist (it's created when the first mail arrives
> during a calendar month).
> 
> Multiply this by ~120 lists, and it gets on the large side for a config
> file: 7500+ lines just for the "watch" entries.

I agree that sucks.

> While I could generate the config file, I'm wondering about better
> solution, to allow globbing the path.

I wanted to have recursive watches at some point but never got
around to it.  So I guess something like this could work recursively:

	watchglob = maildir:$REDACTED/$LISTNAME/**

> I tried to locate a single place in the codebase where this would be
> applied, but it's not clear enough to me if there's a single place that
> it can easily modified.

The `new' sub in lib/PublicInbox/Watch.pm sets up maildirs/imap/nntp

The glob2re function is better nowadays in public-inbox.git,
and the mdre regexp will probably needs to be updated when it sees
a new maildir...

> If there's a consistent place, I think the cleanest syntax that doesn't
> break existing consumers would be something like this:
> [publicinbox "$LISTNAME"]
> watch = maildirglob:$REDACTED/$LISTNAME/.19????/
> watch = maildirglob:$REDACTED/$LISTNAME/.20????/

I think `watchglob = maildir:...' is preferable since I don't
want maildirglob: to be confused as a type.

[1] mainly blocked on releasing trying to wrap my head around -cindex

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: publicinbox watch path globbing
  2023-11-20  0:10 ` Eric Wong
@ 2023-11-20  0:16   ` Robin H. Johnson
  2023-11-20  1:20     ` Eric Wong
  0 siblings, 1 reply; 4+ messages in thread
From: Robin H. Johnson @ 2023-11-20  0:16 UTC (permalink / raw)
  To: Eric Wong; +Cc: Robin H. Johnson, meta

[-- Attachment #1: Type: text/plain, Size: 3424 bytes --]

On Mon, Nov 20, 2023 at 12:10:01AM +0000, Eric Wong wrote:
> "Robin H. Johnson" <robbat2@gentoo.org> wrote:
> > Hi!
> > 
> > Writing to see about work in converting Gentoo's (now-broken) other
> > archives web interface over into using public-inbox instead.
> > 
> > This is the first of a few questions/bumps along the way.
> > 
> > For historical reasons on the scaling side, the archive maildirs are
> > stored by date:
> > watch = maildir:$REDACTED/$LISTNAME/.200001/
> > watch = maildir:$REDACTED/$LISTNAME/.200102/
> > watch = maildir:$REDACTED/$LISTNAME/.YYYYMM/
> > watch = maildir:$REDACTED/$LISTNAME/.202311/
> > etc.
> > (over time, directories are moved to stable read-only storage)
> 
> Is there any reason to expect new messages to appear the /.2000??/
> and other old directories?
> 
> IOW, if somebody with a broken clock sends a message from a past
> year/month in the Date: header, does it end up in an old bucket
> or the current one?
The date is based on arrival time at the archive ingest.

For some of the very old lists, we do have a list of message-ids that we
know existed but aren't captured in the archive, and those mails have
been added to the old locations if they are ever found (maybe once a
year).

> 
> If your old buckets are frozen, lei in public-inbox.git should be
> able to start them off with:
> 
> 	for d in $REDACTED/$LISTNAME/.??????
> 	do
> 		lei convert -o v2:/path/to/inbox-$LISTNAME maildir:$d
> 	done
> 	lei daemon-kill # optional, stops lei-daemon when done
> 
> And then you'd only have to watch the latest maildir.
Any concerns during the month rollover period?
E.g. making sure the 202310 & 202311 are both watched right as time
increments from October to November, because the archive ingest is
likely to write to 202311, but it's possible that public-inbox is still
run for the last few new messages in 202310 yet?

> > While I could generate the config file, I'm wondering about better
> > solution, to allow globbing the path.
> 
> I wanted to have recursive watches at some point but never got
> around to it.  So I guess something like this could work recursively:
> 	watchglob = maildir:$REDACTED/$LISTNAME/**
> 
> > I tried to locate a single place in the codebase where this would be
> > applied, but it's not clear enough to me if there's a single place that
> > it can easily modified.
> 
> The `new' sub in lib/PublicInbox/Watch.pm sets up maildirs/imap/nntp
> 
> The glob2re function is better nowadays in public-inbox.git,
> and the mdre regexp will probably needs to be updated when it sees
> a new maildir...
Thanks. I'd want to explicitly scope the glob to the dates.
The spam processing has been to move spam to .spam.YYYYMM.

> > If there's a consistent place, I think the cleanest syntax that doesn't
> > break existing consumers would be something like this:
> > [publicinbox "$LISTNAME"]
> > watch = maildirglob:$REDACTED/$LISTNAME/.19????/
> > watch = maildirglob:$REDACTED/$LISTNAME/.20????/
> 
> I think `watchglob = maildir:...' is preferable since I don't
> want maildirglob: to be confused as a type.
Agreed, I see concerns there.

-- 
Robin Hugh Johnson
Gentoo Linux: Dev, Infra Lead, Foundation President & Treasurer
E-Mail   : robbat2@gentoo.org
GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 1113 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: publicinbox watch path globbing
  2023-11-20  0:16   ` Robin H. Johnson
@ 2023-11-20  1:20     ` Eric Wong
  0 siblings, 0 replies; 4+ messages in thread
From: Eric Wong @ 2023-11-20  1:20 UTC (permalink / raw)
  To: Robin H. Johnson; +Cc: meta

"Robin H. Johnson" <robbat2@gentoo.org> wrote:
> The date is based on arrival time at the archive ingest.
> 
> For some of the very old lists, we do have a list of message-ids that we
> know existed but aren't captured in the archive, and those mails have
> been added to the old locations if they are ever found (maybe once a
> year).

Yeah, it's fine to run `lei convert' repeatedly on the same
Maildirs when outputting to a v2 public-inbox since it enforces dedupe.
You won't end up with duplicates in the archives (unless there's
some list-added footers/subjects that change).

> E.g. making sure the 202310 & 202311 are both watched right as time
> increments from October to November, because the archive ingest is
> likely to write to 202311, but it's possible that public-inbox is still
> run for the last few new messages in 202310 yet?

Yeah, it's fine to keep watch on the last two months (or
whatever number). But -watch will also import unimported
messages if you're late in configuring watch on a new month (the
resulting archives would be out-of-date, too).

Also, `lei convert' is idempotent and respects all v2 locking so
it won't trample -watch and vice-versa.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2023-11-20  1:20 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-11-19 23:13 publicinbox watch path globbing Robin H. Johnson
2023-11-20  0:10 ` Eric Wong
2023-11-20  0:16   ` Robin H. Johnson
2023-11-20  1:20     ` Eric Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).