Hello everyone :), I like Git, so when folks suggest storing things in Git, I'm usually excited ;). Eric Wong has been working on some tools to store email in a Git repository, and his client-side code is ssoma [1]. I wanted a bit more metadata than the stock ssoma-mda [2], and ended up just writing a ssoma-mda in Python [3]. It needs Python ≥3.4 and pygit2. I had pygit2 already installed for Python 3.3 (which gave me a local libgit2), so I used pip to install it for 3.4: $ python3.4 -m ensurepip --user $ pip3.4 install --user pygit2 Then I grabbed the archives, and pulled them into Git: $ wget http://notmuchmail.org/archives/notmuch.mbox $ git init --bare notmuch-archives.git $ cd notmuch-archives.git $ python3.4 >>> import email.utils >>> import mailbox >>> import ssoma_mda >>> mbox = mailbox.mbox('../notmuch.mbox', factory=None, create=False) >>> messages = sorted(mbox, key=lambda m: email.utils.mktime_tz(email.utils.parsedate_tz(m['date']))) >>> for message in messages: ... if ((message['message-id'] == '' and ... message['X-List-Received-Date'] == 'Sat, 26 Feb 2011 14:23:34 -0000') or ... (message['message-id'] == '<4EDF728E.3050204@gmail.com>' and ... message['X-List-Received-Date'] == 'Wed, 07 Dec 2011 14:05:16 -0000') or ... (message['message-id'] == <4FE369F2.5080804@gmail.com>' and ... message['X-List-Received-Date'] == 'Thu, 21 Jun 2012 18:38:07 -0000') or ... (message['message-id'] == '<5122353D.4060601@gmail.com>' and ... message['X-List-Received-Date'] == 'Mon, 18 Feb 2013 14:06:12 -0000') or ... (message['message-id'] == '' and ... message['X-List-Received-Date'] == 'Wed, 24 Apr 2013 18:09:55 -0000') or ... (message['message-id'] == '<527B9E8C.5000001@krugs.de>' and ... message['X-List-Received-Date'] == 'Thu, 07 Nov 2013 14:07:32 -0000') or ... (message['message-id'] == '<1399645162-8653-1-git-send-email-wael.nasreddine@gmail.com>' and ... message['X-List-Received-Date'] == 'Fri, 09 May 2014 14:19:36 -0000') or ... (message['message-id'] == '' and ... message['X-List-Received-Date'] == 'Thu, 18 Sep 2014 10:27:35 -0000') or ... (message['message-id'] == '' and ... message['X-List-Received-Date'] != 'Mon, 22 Sep 2014 09:54:16 -0000')): ... continue ... ssoma_mda.deliver(message=message, once=True) >>> ^D On my 1.1GHz Intel Celeron 847 Sandy Bridge netbook, that took about half an hour. The initial repository was large: $ du -hs . 394M . But packing it up made it small: $ git gc --aggressive du -hs . 51M . With a few less images than the mbox: $ git log --oneline | wc -l 19650 Compared with 19660 messages in the mbox at 107 MB (160 MB for the associated Maildir). The messages I dropped removed duplicate Message-IDs: * id:m2k4gmyjer.fsf@ecocode.net had different received dates: -X-List-Received-Date: Sat, 26 Feb 2011 14:12:20 -0000 +X-List-Received-Date: Sat, 26 Feb 2011 14:23:34 -0000 but no significant differences. * id:4EDF728E.3050204@gmail.com had a real address in the first-to-arrive version: -X-List-Received-Date: Wed, 07 Dec 2011 14:10:13 -0000 -> <4winter@informatik.uni-hamburg.de> an an obfuscated one in the second-to-arrive version: +X-List-Received-Date: Wed, 07 Dec 2011 14:05:16 -0000 +> <4winter-jNDFPZUTrfQBEfOqpokbeYV0Y/DQsy6Ps0AfqQuZ5sE@public.gmane.org> * id:4FE369F2.5080804@gmail.com had the same: -X-List-Received-Date: Thu, 21 Jun 2012 18:37:54 -0000 -> > wrote: * id:5122353D.4060601@gmail.com had different received dates: -X-List-Received-Date: Mon, 18 Feb 2013 14:06:05 -0000 +X-List-Received-Date: Mon, 18 Feb 2013 14:06:12 -0000 but no significant differences. * id:CA+eQo_1hMsTD4+6ifqgEQXW0_qYXGOdfkO6tBuGQKV+W7OSaKA@mail.gmail.com had different MIME boundaries: -Content-Type: multipart/alternative; boundary=f46d043be11ac45a0904db1f3428 -X-List-Received-Date: Wed, 24 Apr 2013 18:09:46 -0000 +Content-Type: multipart/alternative; boundary=e89a8f646ff3faa11d04db1f3294 +X-List-Received-Date: Wed, 24 Apr 2013 18:09:55 -0000 but no significant differences. * id:527B9E8C.5000001@krugs.de had obfuscated addresses: -X-List-Received-Date: Thu, 07 Nov 2013 14:07:33 -0000 -> Rainer M Krug writes: +X-List-Received-Date: Thu, 07 Nov 2013 14:07:32 -0000 +> Rainer M Krug writes: * id:1399645162-8653-1-git-send-email-wael.nasreddine@gmail.com had additional content in the later submission: -Subject: [PATCH] Add Travis-CI config file. -Date: Fri, 9 May 2014 07:19:22 -0700 -X-List-Received-Date: Fri, 09 May 2014 14:19:36 -0000 - .travis.yml | 10 ++++++++++ - 1 file changed, 10 insertions(+) +Subject: [PATCH v2] Enable Travis-CI as a backup continuous integration + service. +Date: Fri, 9 May 2014 14:44:50 -0700 +X-List-Received-Date: Fri, 09 May 2014 21:45:16 -0000 + +The v2 adds a notification section to send failure (or back to passing) notifications +to the mailing list and to the IRC channel + + .travis.yml | 13 +++++++++++++ + 1 file changed, 13 insertions(+) * id:m2mw9xkyvg.fsf@krugs.de had an obfuscated adderss and different signature: -X-List-Received-Date: Thu, 18 Sep 2014 10:27:31 -0000 ->> guyzmo writes: -----BEGIN PGP SIGNATURE----- Version: GnuPG/MacGPG2 v2.0.22 (Darwin) -iQEcBAEBAgAGBQJUGrN3AAoJENvXNx4PUvmC4J0IAN9Wf+0ArvirJCoewItnEZoo -ySg4VRP7uWVqDxHVl5N9XFv4YE2bZ2E2eMGvbo6v7I82lhqeR5dauZhlgCMki+ZI +X-List-Received-Date: Thu, 18 Sep 2014 10:27:35 -0000 +>> guyzmo writes: -----BEGIN PGP SIGNATURE----- Version: GnuPG/MacGPG2 v2.0.22 (Darwin) +iQEcBAEBAgAGBQJUGrN4AAoJENvXNx4PUvmC6LsIAIaFrd4MFnm8EixrAHPGfW6j +L3KNG7Dv+hQuNRUN6qn+emZHI8wX4O74HOZOpZWkE09CmjkPJBmf7IuJwtz2ONbM * id:cover.1411379395.git.jani@nikula.org came in three times, with three dates, but no significant differences: Date: Mon, 22 Sep 2014 11:54:20 +0200 X-List-Received-Date: Mon, 22 Sep 2014 09:54:16 -0000 Date: Mon, 22 Sep 2014 11:54:42 +0200 X-List-Received-Date: Mon, 22 Sep 2014 09:54:37 -0000 Date: Mon, 22 Sep 2014 11:54:51 +0200 X-List-Received-Date: Mon, 22 Sep 2014 09:54:49 -0000 Anyhow, I've pushed the Git archive [4,5] if anyone wants to play around with ssoma. I think this would be a nice backend for folks building notmuch-based web archives, and pulling from Git is easier than downloading a new mbox ;). Cheers, Trevor [1]: http://ssoma.public-inbox.org/README [2]: http://public-inbox.org/meta/m/ec8f54cf6451eef6e9f59eff691cd9002f4fdf65.html [3]: http://git.tremily.us/?p=ssoma-mda.git;a=shortlog;h=refs/heads/python I have an uncommitted patch to work around http://bugs.python.org/issue22684 [4]: http://git.tremily.us/?p=notmuch-archives.git [5]: git://tremily.us/notmuch-archives.git -- This email may be signed or encrypted with GnuPG (http://www.gnupg.org). For more information, see http://en.wikipedia.org/wiki/Pretty_Good_Privacy