unofficial mirror of notmuch@notmuchmail.org
 help / color / mirror / code / Atom feed
* Mail in git
@ 2010-02-15  0:29 Stewart Smith
  2010-02-16  9:08 ` Michal Sojka
                   ` (2 more replies)
  0 siblings, 3 replies; 32+ messages in thread
From: Stewart Smith @ 2010-02-15  0:29 UTC (permalink / raw)
  To: notmuch

So... I sketched this out in my head at LCA... and it's taken a bit of
time to actually properly try it.

The problem is:
A simple 'find ~/Maildir` takes 10 minutes, and if you write the
output to a file, it's 88MB+

there's "only" about 900,000 entries there. But this means 900,000
files, which is a non-trivial amount. Some mail folders are quite
large too.

Some of this problem could just be solved by using notmuch a bit
differently (folder per month for example).

However... this is a one-way change and going back would be very
tricky.

There's also the backup problem. Iterating through ~1million inodes
takes a *LONG* time. Restoring it takes even longer (think about
writing all that data to the file system journal).

Historically, if i'm running a backup, I couldn't really use my
laptop, it'd be saturated with disk IO performing the file system
dump. It would also take many hours.

Restoring from backup? about 8hrs.

An observation is that mail never changes. It may be reclassified (and
that's what notmuch is for), but it never changes.

We really just want a way to store and access many many many small
blobs of data that never change.

It turns out git is pretty good at that. Underneath, we could just use
it as an object store (a simple git-hash-object and git-cat-file test
confirmed this to be pretty simple to do). even better is since a lot
of mail is fairly similar, to use delta compression between mail
messages to reduce the storage space. Git is pretty good at that too.

A few giant git packs will be much quicker to backup and restore than
1million files.

So... I wrote a script to test it....

$ time perl /home/stewart/evenless.pl /home/stewart/Maildir/

real    841m41.491s
user    491m3.200s
sys     261m58.080s

Which goes from a 15GB Maildir to a 3.7GB git repo.

The algorithm of evenless.pl is basically:
1 get next directory entry
2 if is directory, recurse into it
3 write item to git (git hash-object -w)
4 add item to tree object
5 if number of items written = 1000
  5.1 make pack of last 1000 items
6 goto 1

$ git count-objects -v
count: 479
size: 27680
in-pack: 873109
packs: 1084
size-pack: 3746219
prune-packable: 0
garbage: 0

If i did a "git checkout", about 8 hours later i'd have a directory
tree exactly the same as my maildir.

Why didn't I just git-add everything? I didn't exactly feel like
creating another giant copy of my mail (that also takes a long time).

What about adding more mail to the archive?

So the way I think is that you use a Maildir for day to day mail (e.g.
delivery) and every so often you run some magic command that takes old
mail out of the Maildir and stores it in the git repo.

Next step?

Make notmuch be able to read mail out of it and add it to an index
(oh, and some kind of verification and error checking about creating
the git repo).
-- 
Stewart Smith

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2011-05-21  7:25 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-02-15  0:29 Mail in git Stewart Smith
2010-02-16  9:08 ` Michal Sojka
2010-02-16 19:06 ` Ben Gamari
2010-02-17  0:21   ` Stewart Smith
2010-02-17 10:07     ` Stewart Smith
2011-05-21  7:05       ` martin f krafft
2011-05-21  7:25         ` Stewart Smith
2010-02-17  1:21 ` martin f krafft
2010-02-17 15:03   ` Ben Gamari
2010-02-17 19:23     ` Mark Anderson
2010-02-17 19:34       ` Ben Gamari
2010-02-17 23:52         ` martin f krafft
2010-02-18  0:39           ` Ben Gamari
2010-02-18  1:58             ` martin f krafft
2010-02-18  2:19               ` Ben Gamari
2010-02-18  2:48                 ` nested tag trees (was: Mail in git) martin f krafft
2010-02-18  4:32                   ` martin f krafft
     [not found]                   ` <1266463007-sup-8777@ben-laptop>
2010-02-18  4:34                     ` martin f krafft
     [not found]                     ` <20100218034613.GD1991@lapse.rw.madduck.net>
2010-02-18  4:44                       ` Ben Gamari
2010-02-18  4:59                         ` martin f krafft
2010-02-18  5:10                           ` Ben Gamari
2010-02-19  0:31                             ` martin f krafft
2010-02-19  9:52                               ` Michal Sojka
2010-02-19 14:27                                 ` Ben Gamari
2010-02-17 23:56   ` Mail in git Stewart Smith
2010-02-18  1:01     ` Ben Gamari
2010-02-18  2:00       ` martin f krafft
2010-02-18  2:11         ` Git ancestry and sync problems (was: Mail in git) martin f krafft
2010-02-18  8:34           ` racin
2010-02-18 12:20             ` Jameson Rollins
2010-02-18 12:47             ` Ben Gamari
2010-02-18 23:23             ` martin f krafft

Code repositories for project(s) associated with this public inbox

	https://yhetil.org/notmuch.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).