From: Pieter Praet <pieter@praet.org>
To: Mueen Nawaz <mueen@nawaz.org>, notmuch@notmuchmail.org
Subject: Re: Questions about importing mail (mbox)
Date: Sat, 16 Apr 2011 20:43:22 +0200 [thread overview]
Message-ID: <87bp06m3zp.fsf@A7GMS.i-did-not-set--mail-host-address--so-tickle-me> (raw)
In-Reply-To: <87hbavlxoa.fsf@fester.com>
On Mon, 21 Mar 2011 19:02:45 -0700, Mueen Nawaz <mueen@nawaz.org> wrote:
> I think you misunderstood me. A part of me suspects this has something
> to do with my not explaining myself, but who's to say?<G>
Same here, apparently :D
> I'm experimenting with notmuch, and if I can translate everything I
> currently do in mutt to notmuch, then I'll just dump mutt. The set of
> mboxes I have will remain archived, but for all future incoming email,
> I'll switch to MH or MailDir. So I don't actually need to put my old
> mboxes under revision control - I just need to save them somewhere.
I strongly agree that long term storage choices are a matter of personal
opinion, however the intention of my proposition was to simply keep
track of what changed in the mbox as a result of the various ops
performed, as to gain insight in what gets messed up and where.
Non-VCS would be something along the lines of:
compact mbox.orig > mbox.comp # (*if* "compact" were a valid command)
diff mbox.orig mbox.comp
mb2md -s ./mbox.comp -d ./maildir
cat ./maildir/new/* >> mbox.conv
diff mbox.comp mbox.conv
> > For the actual conversion to Maildir (and any type of mail fetching in
> > general), I'd suggest using FDM [2], you'll never look back.
>
> Thanks - will take a look.
>
> > Regarding the significant discrepancy between processed and added files
> > in Notmuch: Could be dupes (e.g. mail to/cc/bcc yourself or mailing
> > lists, ending up in both Inbox and Sent), which are automatically
> > suppressed by Notmuch.
>
> It definitely was dupes. I didn't realize that notmuch did not keep
> track of dupes.
>
> So I wrote a Python script to go through the mboxes and do a count of
> only unique messages. Problem? I have over 1000 emails that don't have a
> Message-ID header (case invariant search). I could go over why that is,
> but suffice it to say that I hate Microsoft.<G>
>
> Once I remove all dupes, I get to within 300-400 of the count that
> notmuch provides. The remaining 1000+ emails do contain some dupes, and
> I can't find a convenient way to get an accurate count of unique emails
> from them, but at least now I'm in the ballpark, and a lot more
> confident.
Sadly, both mb2md and fdm *will* mess things up, since they both split
on every single occurence of "^From " [1,2], even if it isn't a
separator line.
Both assume occurences of "^From " in the message body to be already
escaped like so: "^>From " [3,4].
Even worse, RFC 4155 [5] confirms this to be semi-expected behaviour:
>> Many implementations are also known to escape message body lines that
>> begin with the character sequence of "From ", so as to prevent
>> confusion with overly-liberal parsers that do not search for full
>> separator lines. In the common case, a leading Greater-Than symbol
>> (0x3E) is used for this purpose (with "From " becoming ">From ").
>> However, other implementations are known not to escape such lines
>> unless they are immediately preceded by a blank line or if they also
>> appear to contain an email address and a timestamp. Other
>> implementations are also known to perform secondary escapes against
>> these lines if they are already escaped or quoted, while others
>> ignore these mechanisms altogether.
One way to circumvent this is by making use of the Content-Length header
(which is apparently how Mutt does it [6]), but guess what, it suffers
the same fate as Message-ID...
> Incidentally, one reason I didn't realize dupes were the reason is that
> I did a search for a word in one email I had and notmuch did not find
> it - so I assumed it had not been indexed. Later on, I realized I had
> written a partial word and discovered that notmuch does find it if I
> type the full word.
>
> What am I doing wrong? Can't notmuch handle partial word matches? Do I
> need to specify an option to get that to work?
AFAIK, this depends on how Xapian splits terms, so isn't a Notmuch issue.
Globbing helps (sometimes).
query: "partia AND from:mueen@nawaz.org"
returns nil
query: "partia* AND from:mueen@nawaz.org"
correctly returns this thread.
Peace
-Pieter
[1] mb2md, line 999 (http://www.linuxkungfu.org/files/scripts/mb2md)
[2] fdm, line 461 (http://fdm.cvs.sourceforge.net/viewvc/fdm/fdm/fetch-mbox.c?view=markup)
[3] mb2md, line 1342 (http://www.linuxkungfu.org/files/scripts/mb2md)
[4] fdm, line 468 (http://fdm.cvs.sourceforge.net/viewvc/fdm/fdm/fetch-mbox.c?view=markup)
[5] RFC 4155, section 2, paragraph 5 (http://tools.ietf.org/html/rfc4155)
[6] http://www.mail-archive.com/mutt-users@mutt.org/msg21921.html
next prev parent reply other threads:[~2011-04-16 18:43 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-03-21 3:30 Questions about importing mail (mbox) Mueen Nawaz
2011-03-21 14:31 ` Pieter Praet
2011-03-22 2:02 ` Mueen Nawaz
2011-04-16 18:43 ` Pieter Praet [this message]
2011-03-21 15:27 ` Jesse Rosenthal
2011-03-22 2:07 ` Mueen Nawaz
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://notmuchmail.org/
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87bp06m3zp.fsf@A7GMS.i-did-not-set--mail-host-address--so-tickle-me \
--to=pieter@praet.org \
--cc=mueen@nawaz.org \
--cc=notmuch@notmuchmail.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://yhetil.org/notmuch.git/
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).