From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by olra.theworths.org (Postfix) with ESMTP id 14EF7431FBC for ; Wed, 17 Feb 2010 02:07:32 -0800 (PST) X-Virus-Scanned: Debian amavisd-new at olra.theworths.org X-Spam-Flag: NO X-Spam-Score: -0.703 X-Spam-Level: X-Spam-Status: No, score=-0.703 tagged_above=-999 required=5 tests=[AWL=-0.704, BAYES_50=0.001] autolearn=ham Received: from olra.theworths.org ([127.0.0.1]) by localhost (olra.theworths.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id lOf9puRcGlLg for ; Wed, 17 Feb 2010 02:07:31 -0800 (PST) Received: from kaylee.flamingspork.com (kaylee.flamingspork.com [74.207.245.61]) by olra.theworths.org (Postfix) with ESMTP id 25CBD431FAE for ; Wed, 17 Feb 2010 02:07:31 -0800 (PST) Received: from willster (localhost [127.0.0.1]) by kaylee.flamingspork.com (Postfix) with ESMTPS id 80F626396; Wed, 17 Feb 2010 10:04:26 +0000 (UTC) Received: by willster (Postfix, from userid 1000) id 431AA10FB47A; Wed, 17 Feb 2010 21:07:28 +1100 (EST) From: Stewart Smith To: Ben Gamari , notmuch In-Reply-To: <87ocjok8yo.fsf@willster.local.flamingspork.com> References: <20100215002914.GA22402@flamingspork.com> <1266347128-sup-7796@ben-laptop> <87ocjok8yo.fsf@willster.local.flamingspork.com> Date: Wed, 17 Feb 2010 21:07:28 +1100 Message-ID: <87mxz8jhun.fsf@willster.local.flamingspork.com> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" Subject: Re: Mail in git X-BeenThere: notmuch@notmuchmail.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: "Use and development of the notmuch mail system." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 17 Feb 2010 10:07:32 -0000 --=-=-= On Wed, 17 Feb 2010 11:21:51 +1100, Stewart Smith wrote: > Using fast-import is interesting. Does it update the working tree? The > big thing I wanted to avoid was creating a working tree (another million > inodes being created is not ever what I need) > > Also interesting is the mention of creating packs on the fly... this > could save the time in first writing the object and then packing it (as > my script does). > > I'm going to play with this.... and I did. good news... on my mailstore (which, as I've previously mentioned, takes about 10 minutes to run 'du' over, about the same time as 'notmuch new' takes): using the (attached) evenless.pl to create a single commit with everything in it: $ du -sh .git 3.4G .git Down from a whopping 14-15GB!!! My previous effort (git-write-object, create pack every 1000 messages, rinse, repeat) took all night and got to 3.7GB. This took only 108 minutes. In both cases, i was creating the repository on another spindle (USB2.0 disk attached to my laptop). git-ls-tree and git-cat-file both work for listing and getting objects. The next thing to think about is adding objects as they come in... creating a new commit with just an added file should be pretty simple and easy... but this means we get to keep a "revision history" of the mailstore, which is *possibly* not ideal in terms of storage efficiency (i'll do a trial with mine of doing one message at a time and seeing what the end size is). however... commit per added mail (or mails) does give us the advantage of a really well documented and tested backup system :) Deleting could be hard.. if we actually want the objects to go away in a "permanent" way (not just no longer be referenced). for the stats nerds: $ time perl /home/stewart/evenless/evenless.pl /home/stewart/Maildir/INBOX git-fast-import statistics: --------------------------------------------------------------------- Alloc'd objects: 785000 Total objects: 781813 ( 79023 duplicates ) blobs : 781363 ( 79023 duplicates 708627 deltas) trees : 449 ( 0 duplicates 0 deltas) commits: 1 ( 0 duplicates 0 deltas) tags : 0 ( 0 duplicates 0 deltas) Total branches: 1 ( 1 loads ) marks: 1048576 ( 860386 unique ) atoms: 860557 Memory total: 182780 KiB pools: 152116 KiB objects: 30664 KiB --------------------------------------------------------------------- pack_report: getpagesize() = 4096 pack_report: core.packedGitWindowSize = 1073741824 pack_report: core.packedGitLimit = 8589934592 pack_report: pack_used_ctr = 1 pack_report: pack_mmap_calls = 1 pack_report: pack_open_windows = 1 / 1 pack_report: pack_mapped = 388496447 / 388496447 --------------------------------------------------------------------- real 107m43.130s user 45m25.430s sys 2m49.440s --=-=-= Content-Type: text/x-perl Content-Disposition: inline; filename=evenless.pl Content-Description: evenless.pl: maildir to git using fast-import #!/usr/bin/perl -w use strict; my $tree= ""; use IPC::Open2; use File::stat; my $FILES; my $mark= 1; my $stripdir= $ARGV[0]; sub fastimport_blobs ($); sub fastimport_blobs ($) { my $dirname= shift @_; opendir (my $dirhandle, $dirname); foreach (readdir $dirhandle) { next if /^\.\.?$/; next if /\.cmeta$/; next if /\.ibex.index$/; next if /\.ibex.index.data$/; next if /\.ev-summary$/; next if /\.ev-summary-meta$/; next if /\.notmuch$/; if (-d $dirname.'/'.$_) { print STDERR "Recursing into $_/ "; fastimport_blobs($dirname.'/'.$_); print STDERR "\n"; } else { my $sb= stat("$dirname/$_"); print FASTIMPORT "blob\n"; print FASTIMPORT "mark :$mark\n"; print FASTIMPORT "data ".($sb->size)."\n"; open FILEIN, "$dirname/$_"; my $content; sysread FILEIN, $content, $sb->size; close FILEIN; print FASTIMPORT $content; my $storedir= "$dirname/$_"; $storedir=~ s/^$stripdir//; $storedir=~ s/^\///; $FILES.="M 0644 :$mark $storedir\n"; $mark++; } } } open FASTIMPORT, "| git fast-import --date-format=rfc2822"; fastimport_blobs($ARGV[0]); print FASTIMPORT "commit refs/heads/master\n"; print FASTIMPORT "committer EvenLess ".`date -R`; print FASTIMPORT "data 11\n"; print FASTIMPORT "mail commit\n"; print FASTIMPORT $FILES; print FASTIMPORT "\n"; close FASTIMPORT; --=-=-= -- Stewart Smith --=-=-=--