unofficial mirror of notmuch@notmuchmail.org
 help / color / mirror / code / Atom feed
* Questions about importing mail (mbox)
@ 2011-03-21  3:30 Mueen Nawaz
  2011-03-21 14:31 ` Pieter Praet
  2011-03-21 15:27 ` Jesse Rosenthal
  0 siblings, 2 replies; 6+ messages in thread
From: Mueen Nawaz @ 2011-03-21  3:30 UTC (permalink / raw)
  To: notmuch


Hi,

I'm trying to experiment with notmuch. 

As I understand it, notmuch does not handle mbox for input. The problem
is that all my mail is currently in mbox format.

So I first tried converting mbox to maildir using mb2md.

It didn't do a good job. When I subsequently tried importing to notmuch,
notmuch complained about lots of non-mail files - I confirmed that
indeed mb2md had botched converting those emails.

So then I tried to convert to mh format using Sylpheed. This seemed to
go well, but then when importing to notmuch, it complained again for
about 20 emails, and a manual check confirmed that some messages did not
get converted properly to mh (they don't show up in Sylpheed).

And then I noticed another discrepancy. mutt shows that I started with
44473 messages in mbox. When I imported into Sylpheed, it showed 44482
messages (no idea where the extra 9 came from). However, notmuch is
reporting that it processed 44482 files, but that it added 35602
messages.

Why only 35602 (it complained for only about 20 messages)? A search
confirmed that some messages that show up in both mutt (in mbox) and
Sylpheed (in mh format) were not indexed.

So I want to know: When you guys switched to notmuch, how did you ensure
you did not miss any emails. I really, really, really don't want to lose
any emails in this process!

Thanks.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Questions about importing mail (mbox)
  2011-03-21  3:30 Questions about importing mail (mbox) Mueen Nawaz
@ 2011-03-21 14:31 ` Pieter Praet
  2011-03-22  2:02   ` Mueen Nawaz
  2011-03-21 15:27 ` Jesse Rosenthal
  1 sibling, 1 reply; 6+ messages in thread
From: Pieter Praet @ 2011-03-21 14:31 UTC (permalink / raw)
  To: Mueen Nawaz, notmuch

On Sun, 20 Mar 2011 20:30:52 -0700, Mueen Nawaz <mueen@nawaz.org> wrote:
> 
> Hi,
> 
> I'm trying to experiment with notmuch. 
> 
> As I understand it, notmuch does not handle mbox for input. The problem
> is that all my mail is currently in mbox format.
> 
> So I first tried converting mbox to maildir using mb2md.
> 
> It didn't do a good job. When I subsequently tried importing to notmuch,
> notmuch complained about lots of non-mail files - I confirmed that
> indeed mb2md had botched converting those emails.
> 
> So then I tried to convert to mh format using Sylpheed. This seemed to
> go well, but then when importing to notmuch, it complained again for
> about 20 emails, and a manual check confirmed that some messages did not
> get converted properly to mh (they don't show up in Sylpheed).
> 
> And then I noticed another discrepancy. mutt shows that I started with
> 44473 messages in mbox. When I imported into Sylpheed, it showed 44482
> messages (no idea where the extra 9 came from). However, notmuch is
> reporting that it processed 44482 files, but that it added 35602
> messages.
> 
> Why only 35602 (it complained for only about 20 messages)? A search
> confirmed that some messages that show up in both mutt (in mbox) and
> Sylpheed (in mh format) were not indexed.
> 
> So I want to know: When you guys switched to notmuch, how did you ensure
> you did not miss any emails. I really, really, really don't want to lose
> any emails in this process!
> 
> Thanks.
> 
> _______________________________________________
> notmuch mailing list
> notmuch@notmuchmail.org
> http://notmuchmail.org/mailman/listinfo/notmuch


It would've been a no-brainer if you'd been using Maildir all along
(mbox is evil incarnate), but...

I'd suggest keeping your original mbox file safe in git [1], and
consistently commiting every step of the way, so even if messages were
to get lost in translation, you still have a way to get them back, with
negligible storage overhead (just remember to "git gc --aggressive
--prune=now" when you're finished).

Compacting the mbox file, i.e. purging all stale messages (sync-mailbox
in mutt?) and diffing to HEAD could then possibly give you an indication
as to the origin of the 9 surplus files.

For the actual conversion to Maildir (and any type of mail fetching in
general), I'd suggest using FDM [2], you'll never look back.

Regarding the significant discrepancy between processed and added files
in Notmuch: Could be dupes (e.g. mail to/cc/bcc yourself or mailing
lists, ending up in both Inbox and Sent), which are automatically
suppressed by Notmuch.


[1] http://git-scm.com/
[2] http://fdm.sourceforge.net/

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Questions about importing mail (mbox)
  2011-03-21  3:30 Questions about importing mail (mbox) Mueen Nawaz
  2011-03-21 14:31 ` Pieter Praet
@ 2011-03-21 15:27 ` Jesse Rosenthal
  2011-03-22  2:07   ` Mueen Nawaz
  1 sibling, 1 reply; 6+ messages in thread
From: Jesse Rosenthal @ 2011-03-21 15:27 UTC (permalink / raw)
  To: Mueen Nawaz, notmuch

On Sun, 20 Mar 2011 20:30:52 -0700, Mueen Nawaz <mueen@nawaz.org> wrote:
> 
> So I want to know: When you guys switched to notmuch, how did you ensure
> you did not miss any emails. I really, really, really don't want to lose
> any emails in this process!

I didn't need to convert when I started using notmuch, but for past
mbox-to-maildir conversions, I always had the most confidence in using
mutt interactively. Tag all messages (S-t, all), copy or save to a
maildir, and make sure your mbox_type is set appropriately. There are
scripts out there to automate it, but if you're worried about missing
something, doing it by hand might work a bit better for you. (You can
also do it in chunks by date to make sure everything is moving over.)
Not the most efficient, but you should only have to do it once.

Best,
Jesse

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Questions about importing mail (mbox)
  2011-03-21 14:31 ` Pieter Praet
@ 2011-03-22  2:02   ` Mueen Nawaz
  2011-04-16 18:43     ` Pieter Praet
  0 siblings, 1 reply; 6+ messages in thread
From: Mueen Nawaz @ 2011-03-22  2:02 UTC (permalink / raw)
  To: notmuch

Pieter Praet <pieter@praet.org> writes:
> It would've been a no-brainer if you'd been using Maildir all along
> (mbox is evil incarnate), but...

Sure, but mbox is too convenient.

> I'd suggest keeping your original mbox file safe in git [1], and
> consistently commiting every step of the way, so even if messages were
> to get lost in translation, you still have a way to get them back, with
> negligible storage overhead (just remember to "git gc --aggressive
> --prune=now" when you're finished).

I think you misunderstood me. A part of me suspects this has something
to do with my not explaining myself, but who's to say?<G>

I'm experimenting with notmuch, and if I can translate everything I
currently do in mutt to notmuch, then I'll just dump mutt. The set of
mboxes I have will remain archived, but for all future incoming email,
I'll switch to MH or MailDir. So I don't actually need to put my old
mboxes under revision control - I just need to save them somewhere.

> For the actual conversion to Maildir (and any type of mail fetching in
> general), I'd suggest using FDM [2], you'll never look back.

Thanks - will take a look.

> Regarding the significant discrepancy between processed and added files
> in Notmuch: Could be dupes (e.g. mail to/cc/bcc yourself or mailing
> lists, ending up in both Inbox and Sent), which are automatically
> suppressed by Notmuch.

It definitely was dupes. I didn't realize that notmuch did not keep
track of dupes. 

So I wrote a Python script to go through the mboxes and do a count of
only unique messages. Problem? I have over 1000 emails that don't have a
Message-ID header (case invariant search). I could go over why that is,
but suffice it to say that I hate Microsoft.<G>

Once I remove all dupes, I get to within 300-400 of the count that
notmuch provides. The remaining 1000+ emails do contain some dupes, and
I can't find a convenient way to get an accurate count of unique emails
from them, but at least now I'm in the ballpark, and a lot more
confident.

Incidentally, one reason I didn't realize dupes were the reason is that
I did a search for a word in one email I had and notmuch did not find
it - so I assumed it had not been indexed. Later on, I realized I had
written a partial word and discovered that notmuch does find it if I
type the full word.

What am I doing wrong? Can't notmuch handle partial word matches? Do I
need to specify an option to get that to work?

Anyway, thanks for the help - I'll investigate further.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Questions about importing mail (mbox)
  2011-03-21 15:27 ` Jesse Rosenthal
@ 2011-03-22  2:07   ` Mueen Nawaz
  0 siblings, 0 replies; 6+ messages in thread
From: Mueen Nawaz @ 2011-03-22  2:07 UTC (permalink / raw)
  To: notmuch

Jesse Rosenthal <jrosenthal@jhu.edu> writes:

> I didn't need to convert when I started using notmuch, but for past
> mbox-to-maildir conversions, I always had the most confidence in using
> mutt interactively. Tag all messages (S-t, all), copy or save to a
> maildir, and make sure your mbox_type is set appropriately. There are
> scripts out there to automate it, but if you're worried about missing
> something, doing it by hand might work a bit better for you. (You can
> also do it in chunks by date to make sure everything is moving over.)
> Not the most efficient, but you should only have to do it once.

Thanks - will give it a try. It half solves my problem, in that I can do
a message count using mutt before and after to see the conversion went
well. The second issue is figuring out if notmuch really did index all
of them - challenging because I have plenty of dupes. I may just have to
take it all on faith for now.

As I had mentioned, when using going from MH to notmuch, it complained
for about 20 messages. I was in a hurry so didn't take a detailed look,
but two of them were clearly corrupt in my mbox file. They had a from
and virtually no other headers. So perhaps all the problems I'm having
stem from corrupt messages in my mbox...

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Questions about importing mail (mbox)
  2011-03-22  2:02   ` Mueen Nawaz
@ 2011-04-16 18:43     ` Pieter Praet
  0 siblings, 0 replies; 6+ messages in thread
From: Pieter Praet @ 2011-04-16 18:43 UTC (permalink / raw)
  To: Mueen Nawaz, notmuch

On Mon, 21 Mar 2011 19:02:45 -0700, Mueen Nawaz <mueen@nawaz.org> wrote:
> I think you misunderstood me. A part of me suspects this has something
> to do with my not explaining myself, but who's to say?<G>

Same here, apparently :D

> I'm experimenting with notmuch, and if I can translate everything I
> currently do in mutt to notmuch, then I'll just dump mutt. The set of
> mboxes I have will remain archived, but for all future incoming email,
> I'll switch to MH or MailDir. So I don't actually need to put my old
> mboxes under revision control - I just need to save them somewhere.

I strongly agree that long term storage choices are a matter of personal
opinion, however the intention of my proposition was to simply keep
track of what changed in the mbox as a result of the various ops
performed, as to gain insight in what gets messed up and where.

Non-VCS would be something along the lines of:
    compact mbox.orig > mbox.comp   # (*if* "compact" were a valid command)
    diff mbox.orig mbox.comp
    mb2md -s ./mbox.comp -d ./maildir
    cat ./maildir/new/* >> mbox.conv
    diff mbox.comp mbox.conv

> > For the actual conversion to Maildir (and any type of mail fetching in
> > general), I'd suggest using FDM [2], you'll never look back.
> 
> Thanks - will take a look.
> 
> > Regarding the significant discrepancy between processed and added files
> > in Notmuch: Could be dupes (e.g. mail to/cc/bcc yourself or mailing
> > lists, ending up in both Inbox and Sent), which are automatically
> > suppressed by Notmuch.
> 
> It definitely was dupes. I didn't realize that notmuch did not keep
> track of dupes. 
> 
> So I wrote a Python script to go through the mboxes and do a count of
> only unique messages. Problem? I have over 1000 emails that don't have a
> Message-ID header (case invariant search). I could go over why that is,
> but suffice it to say that I hate Microsoft.<G>
> 
> Once I remove all dupes, I get to within 300-400 of the count that
> notmuch provides. The remaining 1000+ emails do contain some dupes, and
> I can't find a convenient way to get an accurate count of unique emails
> from them, but at least now I'm in the ballpark, and a lot more
> confident.

Sadly, both mb2md and fdm *will* mess things up, since they both split
on every single occurence of "^From " [1,2], even if it isn't a
separator line.

Both assume occurences of "^From " in the message body to be already
escaped like so: "^>From " [3,4].

Even worse, RFC 4155 [5] confirms this to be semi-expected behaviour:
>> Many implementations are also known to escape message body lines that
>> begin with the character sequence of "From ", so as to prevent
>> confusion with overly-liberal parsers that do not search for full
>> separator lines.  In the common case, a leading Greater-Than symbol
>> (0x3E) is used for this purpose (with "From " becoming ">From ").
>> However, other implementations are known not to escape such lines
>> unless they are immediately preceded by a blank line or if they also
>> appear to contain an email address and a timestamp.  Other
>> implementations are also known to perform secondary escapes against
>> these lines if they are already escaped or quoted, while others
>> ignore these mechanisms altogether.

One way to circumvent this is by making use of the Content-Length header
(which is apparently how Mutt does it [6]), but guess what, it suffers
the same fate as Message-ID...

> Incidentally, one reason I didn't realize dupes were the reason is that
> I did a search for a word in one email I had and notmuch did not find
> it - so I assumed it had not been indexed. Later on, I realized I had
> written a partial word and discovered that notmuch does find it if I
> type the full word.
> 
> What am I doing wrong? Can't notmuch handle partial word matches? Do I
> need to specify an option to get that to work?

AFAIK, this depends on how Xapian splits terms, so isn't a Notmuch issue.
Globbing helps (sometimes).

query: "partia AND from:mueen@nawaz.org"
    returns nil

query: "partia* AND from:mueen@nawaz.org"
    correctly returns this thread.



Peace

-Pieter


[1] mb2md, line 999 (http://www.linuxkungfu.org/files/scripts/mb2md)
[2] fdm, line 461 (http://fdm.cvs.sourceforge.net/viewvc/fdm/fdm/fetch-mbox.c?view=markup)
[3] mb2md, line 1342 (http://www.linuxkungfu.org/files/scripts/mb2md)
[4] fdm, line 468 (http://fdm.cvs.sourceforge.net/viewvc/fdm/fdm/fetch-mbox.c?view=markup)
[5] RFC 4155, section 2, paragraph 5 (http://tools.ietf.org/html/rfc4155)
[6] http://www.mail-archive.com/mutt-users@mutt.org/msg21921.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2011-04-16 18:43 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-03-21  3:30 Questions about importing mail (mbox) Mueen Nawaz
2011-03-21 14:31 ` Pieter Praet
2011-03-22  2:02   ` Mueen Nawaz
2011-04-16 18:43     ` Pieter Praet
2011-03-21 15:27 ` Jesse Rosenthal
2011-03-22  2:07   ` Mueen Nawaz

Code repositories for project(s) associated with this public inbox

	https://yhetil.org/notmuch.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).