* Usage of public-inbox with maildirs
@ 2019-03-20 22:28 Ralf Ramsauer
2019-03-21 3:35 ` Eric Wong
0 siblings, 1 reply; 2+ messages in thread
From: Ralf Ramsauer @ 2019-03-20 22:28 UTC (permalink / raw)
To: meta; +Cc: Lukas Bulwahn
Hi,
we want to archive a fair amount of mailing lists (~160 lists) with
public-inbox.
Therefore, we subscribed to all of those lists with a single email
address. Mails are periodically fetched and stored in a local maildir
via IMAP. Mails are currently not pre-filtered or sorted, all of them
are bunched in a single maildir.
So every [publicinbox] config entry has the same 'watch' entry for the
maildir, but all have their own watchheader to be sensitive on different
lists.
Is this the intended way to use public-inbox, or should we rather place
mails from different lists in different maildirs before processing them
with public-inbox?
Secondly, I wrote a script that automatically that creates the
public-inbox config together with empty, bare git repositories for every
list.
A config entry looks like:
[publicinbox "listid"]
address = post@listid.org
mainrepo = /path/to/repo
watch = maildir:/path/to/maildir
watchheader = List-Id:<listid>
Our maildir currently contains ~120k mails for the initial import, and
this raised some new questions:
1. It appears that the initial import with public-inbox-watch is very
slow. After stracing the perl script, it looks like
public-inbox-watch lstats every single mail. After an hour of not
inserting any mail into a repo, I canceled the process and restarted
it on a smaller initial subset. This works better, but is still slow.
(~4k mails in 10 minutes, feels like constantly getting slower)
If public-inbox-watch is restarted for some reason (e.g., system
reboot), will it stat every single mail again on startup?
IOW, should old mails be removed from the maildir and/or will they
cause performance impacts? Is there an way to automatically delete
processed mails?
2. public-inbox-watch seems to fill the repositories with the 'old' v1
layout, and I don't know how to switch to v2. Is there a config
parameter for that?
I found the v1-v2 convert script, but I'd like to directly initialise
it with the newer version, if possible.
3. On the initial import, public-inbox-watch seems to randomly insert
mails into repositories. In the end, coverage matters more than
hierarchy, but is there a way to do the initial import sorted by
date?
Thanks a lot!
Ralf
^ permalink raw reply [flat|nested] 2+ messages in thread
* Re: Usage of public-inbox with maildirs
2019-03-20 22:28 Usage of public-inbox with maildirs Ralf Ramsauer
@ 2019-03-21 3:35 ` Eric Wong
0 siblings, 0 replies; 2+ messages in thread
From: Eric Wong @ 2019-03-21 3:35 UTC (permalink / raw)
To: Ralf Ramsauer; +Cc: meta, Lukas Bulwahn
Ralf Ramsauer <ralf.ramsauer@oth-regensburg.de> wrote:
> Hi,
>
> we want to archive a fair amount of mailing lists (~160 lists) with
> public-inbox.
>
> Therefore, we subscribed to all of those lists with a single email
> address. Mails are periodically fetched and stored in a local maildir
> via IMAP. Mails are currently not pre-filtered or sorted, all of them
> are bunched in a single maildir.
>
> So every [publicinbox] config entry has the same 'watch' entry for the
> maildir, but all have their own watchheader to be sensitive on different
> lists.
>
> Is this the intended way to use public-inbox, or should we rather place
> mails from different lists in different maildirs before processing them
> with public-inbox?
Yes, it's supported since this year:
commit ed3b90b7a203fe5513894d01d478f6104cdff897
Date: Sat Jan 5 00:35:42 2019 +0000
("watchmaildir: support multiple inboxes in the same Maildir")
Sorry, haven't hit a good point to make a release, and I'm not too
good at release management :<
> Secondly, I wrote a script that automatically that creates the
> public-inbox config together with empty, bare git repositories for every
> list.
>
> A config entry looks like:
>
> [publicinbox "listid"]
> address = post@listid.org
> mainrepo = /path/to/repo
> watch = maildir:/path/to/maildir
> watchheader = List-Id:<listid>
All looks fine to me.
> Our maildir currently contains ~120k mails for the initial import, and
> this raised some new questions:
>
> 1. It appears that the initial import with public-inbox-watch is very
> slow. After stracing the perl script, it looks like
> public-inbox-watch lstats every single mail. After an hour of not
> inserting any mail into a repo, I canceled the process and restarted
> it on a smaller initial subset. This works better, but is still slow.
> (~4k mails in 10 minutes, feels like constantly getting slower)
v1 gets slower as repositories get bigger. v2 is barely
affected by that. Are you sure it wasn't importing? The
fast-import processes may not be writing out frequently enough.
> If public-inbox-watch is restarted for some reason (e.g., system
> reboot), will it stat every single mail again on startup?
Yes. However, the scan is at a low priority compared to
freshly-arrived mail if you have Linux::Inotify2 module
installed for Filesys::Notify::Simple to use.
> IOW, should old mails be removed from the maildir and/or will they
> cause performance impacts? Is there an way to automatically delete
> processed mails?
Yes, old mails should be removed.
I have a cronjob doing something like:
find $MAILDIR -ctime +$AGE_DAYS -type f | xargs rm -f
AGE_DAYS can be whatever you're comfortable with.
Fwiw, I run public-inbox-watch and the find|rm cronjob as
different users, so public-inbox-watch can rely on read-only
access to a Maildir while rm(1) (obviously) needs write access
to the Maildir.
> 2. public-inbox-watch seems to fill the repositories with the 'old' v1
> layout, and I don't know how to switch to v2. Is there a config
> parameter for that?
>
> I found the v1-v2 convert script, but I'd like to directly initialise
> it with the newer version, if possible.
Use "-V2" with public-inbox-init.
Perhaps it could become the default iff SQLite+Xapian are
installed.
> 3. On the initial import, public-inbox-watch seems to randomly insert
> mails into repositories. In the end, coverage matters more than
> hierarchy, but is there a way to do the initial import sorted by
> date?
You can use (or derive from) scripts/import_vger_from_mbox if you
have sorted mboxes.
The main benefit for sorting would be to ensure NNTP articles
numbers roughly match the dates. Otherwise, the HTTP interface
won't care about ordering.
I suppose you could import the first time into a throwaway inbox,
fetch http://$HOST/$INBOX/all.mbox.gz
and zcat the result of that to scripts/import_vger_from_mbox
> Thanks a lot!
no prob :>
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2019-03-21 3:35 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-03-20 22:28 Usage of public-inbox with maildirs Ralf Ramsauer
2019-03-21 3:35 ` Eric Wong
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).