From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by olra.theworths.org (Postfix) with ESMTP id 3E842431FB6 for ; Sat, 16 Apr 2011 11:43:33 -0700 (PDT) X-Virus-Scanned: Debian amavisd-new at olra.theworths.org X-Spam-Flag: NO X-Spam-Score: 0 X-Spam-Level: X-Spam-Status: No, score=0 tagged_above=-999 required=5 tests=[none] autolearn=disabled Received: from olra.theworths.org ([127.0.0.1]) by localhost (olra.theworths.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id mhcxG++lzjyG for ; Sat, 16 Apr 2011 11:43:29 -0700 (PDT) Received: from mail-wy0-f181.google.com (mail-wy0-f181.google.com [74.125.82.181]) (using TLSv1 with cipher RC4-SHA (128/128 bits)) (No client certificate requested) by olra.theworths.org (Postfix) with ESMTPS id 71140431FB5 for ; Sat, 16 Apr 2011 11:43:29 -0700 (PDT) Received: by wyi11 with SMTP id 11so4152797wyi.26 for ; Sat, 16 Apr 2011 11:43:26 -0700 (PDT) Received: by 10.227.139.149 with SMTP id e21mr3220834wbu.147.1302979406246; Sat, 16 Apr 2011 11:43:26 -0700 (PDT) Received: from localhost (99.28-240-81.adsl-dyn.isp.belgacom.be [81.240.28.99]) by mx.google.com with ESMTPS id bs4sm2301002wbb.1.2011.04.16.11.43.24 (version=TLSv1/SSLv3 cipher=OTHER); Sat, 16 Apr 2011 11:43:25 -0700 (PDT) From: Pieter Praet To: Mueen Nawaz , notmuch@notmuchmail.org Subject: Re: Questions about importing mail (mbox) In-Reply-To: <87hbavlxoa.fsf@fester.com> References: <87bp15m9oz.fsf@fester.com> <87zkooo88x.fsf@A7GMS.i-did-not-set--mail-host-address--so-tickle-me> <87hbavlxoa.fsf@fester.com> User-Agent: Notmuch/0.5-86-g4875299 (http://notmuchmail.org) Emacs/23.1.50.1 (x86_64-pc-linux-gnu) Date: Sat, 16 Apr 2011 20:43:22 +0200 Message-ID: <87bp06m3zp.fsf@A7GMS.i-did-not-set--mail-host-address--so-tickle-me> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-BeenThere: notmuch@notmuchmail.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: "Use and development of the notmuch mail system." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 16 Apr 2011 18:43:33 -0000 On Mon, 21 Mar 2011 19:02:45 -0700, Mueen Nawaz wrote: > I think you misunderstood me. A part of me suspects this has something > to do with my not explaining myself, but who's to say? Same here, apparently :D > I'm experimenting with notmuch, and if I can translate everything I > currently do in mutt to notmuch, then I'll just dump mutt. The set of > mboxes I have will remain archived, but for all future incoming email, > I'll switch to MH or MailDir. So I don't actually need to put my old > mboxes under revision control - I just need to save them somewhere. I strongly agree that long term storage choices are a matter of personal opinion, however the intention of my proposition was to simply keep track of what changed in the mbox as a result of the various ops performed, as to gain insight in what gets messed up and where. Non-VCS would be something along the lines of: compact mbox.orig > mbox.comp # (*if* "compact" were a valid command) diff mbox.orig mbox.comp mb2md -s ./mbox.comp -d ./maildir cat ./maildir/new/* >> mbox.conv diff mbox.comp mbox.conv > > For the actual conversion to Maildir (and any type of mail fetching in > > general), I'd suggest using FDM [2], you'll never look back. > > Thanks - will take a look. > > > Regarding the significant discrepancy between processed and added files > > in Notmuch: Could be dupes (e.g. mail to/cc/bcc yourself or mailing > > lists, ending up in both Inbox and Sent), which are automatically > > suppressed by Notmuch. > > It definitely was dupes. I didn't realize that notmuch did not keep > track of dupes. > > So I wrote a Python script to go through the mboxes and do a count of > only unique messages. Problem? I have over 1000 emails that don't have a > Message-ID header (case invariant search). I could go over why that is, > but suffice it to say that I hate Microsoft. > > Once I remove all dupes, I get to within 300-400 of the count that > notmuch provides. The remaining 1000+ emails do contain some dupes, and > I can't find a convenient way to get an accurate count of unique emails > from them, but at least now I'm in the ballpark, and a lot more > confident. Sadly, both mb2md and fdm *will* mess things up, since they both split on every single occurence of "^From " [1,2], even if it isn't a separator line. Both assume occurences of "^From " in the message body to be already escaped like so: "^>From " [3,4]. Even worse, RFC 4155 [5] confirms this to be semi-expected behaviour: >> Many implementations are also known to escape message body lines that >> begin with the character sequence of "From ", so as to prevent >> confusion with overly-liberal parsers that do not search for full >> separator lines. In the common case, a leading Greater-Than symbol >> (0x3E) is used for this purpose (with "From " becoming ">From "). >> However, other implementations are known not to escape such lines >> unless they are immediately preceded by a blank line or if they also >> appear to contain an email address and a timestamp. Other >> implementations are also known to perform secondary escapes against >> these lines if they are already escaped or quoted, while others >> ignore these mechanisms altogether. One way to circumvent this is by making use of the Content-Length header (which is apparently how Mutt does it [6]), but guess what, it suffers the same fate as Message-ID... > Incidentally, one reason I didn't realize dupes were the reason is that > I did a search for a word in one email I had and notmuch did not find > it - so I assumed it had not been indexed. Later on, I realized I had > written a partial word and discovered that notmuch does find it if I > type the full word. > > What am I doing wrong? Can't notmuch handle partial word matches? Do I > need to specify an option to get that to work? AFAIK, this depends on how Xapian splits terms, so isn't a Notmuch issue. Globbing helps (sometimes). query: "partia AND from:mueen@nawaz.org" returns nil query: "partia* AND from:mueen@nawaz.org" correctly returns this thread. Peace -Pieter [1] mb2md, line 999 (http://www.linuxkungfu.org/files/scripts/mb2md) [2] fdm, line 461 (http://fdm.cvs.sourceforge.net/viewvc/fdm/fdm/fetch-mbox.c?view=markup) [3] mb2md, line 1342 (http://www.linuxkungfu.org/files/scripts/mb2md) [4] fdm, line 468 (http://fdm.cvs.sourceforge.net/viewvc/fdm/fdm/fetch-mbox.c?view=markup) [5] RFC 4155, section 2, paragraph 5 (http://tools.ietf.org/html/rfc4155) [6] http://www.mail-archive.com/mutt-users@mutt.org/msg21921.html