From: Eric Wong <e@80x24.org>
To: "Nicolás Ojeda Bär" <n.oje.bar@gmail.com>
Cc: meta@public-inbox.org
Subject: Re: Relationship between public-inbox and ssoma?
Date: Mon, 5 Mar 2018 17:50:07 +0000 [thread overview]
Message-ID: <20180305175007.GA19007@whir> (raw)
In-Reply-To: <CAPunWhD_BKT0QgpL5Z=jdMTaV7nE=uK_-JSt1Or2=u6U+wk4Fg@mail.gmail.com>
Nicolás Ojeda Bär <n.oje.bar@gmail.com> wrote:
> Hello Eric,
>
> Thanks for the prompt reply. I am trying to migrate a long-lived
> mailing list (65k messages over 26 years), below are some
> troubles/questions I am having;
> any suggestions would be greatly appreciated.
>
> - public-inbox-watch seems to struggle with very big maildirs; for now
> I am moving the data into the maildir a little at a time and that
> seems to work. Is there a particular obstacle
> to making the importing process more incremental?
Do you know if it's SpamAssassin being slow?
I disable network checks for large imports in ~/.spamassassin/user_prefs
(if I'm using SA at all during the imports):
# uncomment the following for importing archives:
# dns_available no
# skip_rbl_checks 1
# skip_uribl_checks 1
Fwiw, large directories are a performance killer in any
application. Seek times and cache overheads are two problems,
at least, so an SSD will definitely help; and maybe even shorter
filenames.
I usually prefer one-off scripts like
scripts/import_vger_from_mbox for initial imports and store
large archives in compressed mboxes instead of Maildir. Lack of
mbox support is one reason I never used notmuch despite studying
it.
> - Trouble due to missing/malformed headers (mostly on very old
> messages). For example, here is the header of a message that trips
> public-inbox-watch:
>
> From weis@margaux Fri Nov 27 16:24:50 1992
> Received: by margaux.inria.fr, Fri, 27 Nov 92 16:24:50 +0100
> Message-ID: <9211271524.AA29971@margaux.inria.fr>
> To: caml-list@margaux
> Sender: weis@margaux
> Status: O
>
> The error is: fatal: Invalid rfc2822 date "" in ident: <> (I guess
> due to the lack of a Date: field). I added a Date: field just to test
> and
> noticed that Author: in the git commit was empty, I guess due to the
> use of Sender: rather than From: header.
I have a patch in the wings to use the Received: date:
https://public-inbox.org/meta/20180215110840.30413-16-e@80x24.org/raw
And I'm thinking about favoring Received: over Date: if both
exist, since Date: headers are more often wrong...
> Do you think it is feasible to improve public-inbox-watch to try to
> extract the date from some other header like above?
> and to use Sender: when From: is not found?
Sure, I suppose falling back to Sender is correct if From is
missing.
> - There are some messages that do not have Message-Id, but
> public-inbox-watch seems to be able to handle them.
Yes, we generate a Message-Id if one is missing
> Is it the case that Date: is the only header that is absolutely
> necessary for public-inbox-watch to process the message?
Probably none of them are, actually.
> - Does public-inbox-watch ever modify the message data?
Message-ID generation is one that's generated.
Status, Lines, Bytes, Content-Length, and @BAD_HEADERS in
lib/PublicInbox/MDA.pm are all dropped:
our @BAD_HEADERS = (
# postfix
qw(delivered-to x-original-to), # prevent training loops
# The rest are taken from Mailman 2.1.15:
# could contain passwords:
qw(approved approve x-approved x-approve urgent),
# could be used phishing:
qw(return-receipt-to disposition-notification-to x-confirm-reading-to),
# Pegasus mail:
qw(x-pmrqc)
);
Email::MIME might modify invalid characters in the headers (or
if there's bugs in Email::MIME). I don't think bodies are
modified outside of the not-really-documented
PublicInbox::Filter API. You can check out some filters at
lib/PublicInbox/Filter/*.pm (some commit messages document them,
but I don't think there's manpages, yet)
> - In general public-inbox-watch prints very little about what it is
> doing, which makes it hard(er) to trace problems; a verbose flag would
> be a nice
> addition, I think.
I usually use strace on Linux to track down problems. I'm not
sure it's worth the effort to introduce new options/features
if generic tracing utilities are more detailed and accurate.
Also, I'm going to be mostly offline for about a week starting
tomorrow; so don't expect prompt replies for a bit.
next prev parent reply other threads:[~2018-03-05 17:50 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-03-05 0:54 Relationship between public-inbox and ssoma? Nicolás Ojeda Bär
2018-03-05 2:07 ` Eric Wong
2018-03-05 11:45 ` Nicolás Ojeda Bär
2018-03-05 17:50 ` Eric Wong [this message]
2018-03-05 18:06 ` Nicolás Ojeda Bär
2018-03-19 7:43 ` watch performance [was: Relationship between public-inbox and ssoma?] Eric Wong
2018-03-15 15:30 ` internal format (was: Relationship between public-inbox and ssoma?) Stefan Monnier
2018-03-15 16:40 ` Eric Wong
2018-03-15 18:49 ` internal format Stefan Monnier
2018-03-15 20:14 ` Eric Wong
2018-03-15 21:05 ` Stefan Monnier
2018-03-15 21:21 ` Eric Wong
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://public-inbox.org/README
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20180305175007.GA19007@whir \
--to=e@80x24.org \
--cc=meta@public-inbox.org \
--cc=n.oje.bar@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).