unofficial mirror of notmuch@notmuchmail.org
 help / color / mirror / code / Atom feed
From: Stewart Smith <stewart@flamingspork.com>
To: Ben Gamari <bgamari@gmail.com>, notmuch <notmuch@notmuchmail.org>
Subject: Re: Mail in git
Date: Wed, 17 Feb 2010 21:07:28 +1100	[thread overview]
Message-ID: <87mxz8jhun.fsf@willster.local.flamingspork.com> (raw)
In-Reply-To: <87ocjok8yo.fsf@willster.local.flamingspork.com>

[-- Attachment #1: Type: text/plain, Size: 3118 bytes --]

On Wed, 17 Feb 2010 11:21:51 +1100, Stewart Smith <stewart@flamingspork.com> wrote:
> Using fast-import is interesting. Does it update the working tree? The
> big thing I wanted to avoid was creating a working tree (another million
> inodes being created is not ever what I need)
> 
> Also interesting is the mention of creating packs on the fly... this
> could save the time in first writing the object and then packing it (as
> my script does).
> 
> I'm going to play with this....

and I did.

good news... on my mailstore (which, as I've previously mentioned, takes
about 10 minutes to run 'du' over, about the same time as 'notmuch new'
takes):

using the (attached) evenless.pl to create a single commit with
everything in it:

$ du -sh .git
3.4G	.git

Down from a whopping 14-15GB!!!

My previous effort (git-write-object, create pack every 1000 messages,
rinse, repeat) took all night and got to 3.7GB.

This took only 108 minutes.

In both cases, i was creating the repository on another spindle (USB2.0
disk attached to my laptop).

git-ls-tree and git-cat-file both work for listing and getting objects.

The next thing to think about is adding objects as they come
in... creating a new commit with just an added file should be pretty
simple and easy... but this means we get to keep a "revision history" of
the mailstore, which is *possibly* not ideal in terms of storage
efficiency (i'll do a trial with mine of doing one message at a time and
seeing what the end size is).

however... commit per added mail (or mails) does give us the advantage
of a really well documented and tested backup system :)

Deleting could be hard.. if we actually want the objects to go away in a
"permanent" way (not just no longer be referenced).

for the stats nerds:

$ time perl /home/stewart/evenless/evenless.pl /home/stewart/Maildir/INBOX

git-fast-import statistics:
---------------------------------------------------------------------
Alloc'd objects:     785000
Total objects:       781813 (     79023 duplicates                  )
      blobs  :       781363 (     79023 duplicates     708627 deltas)
      trees  :          449 (         0 duplicates          0 deltas)
      commits:            1 (         0 duplicates          0 deltas)
      tags   :            0 (         0 duplicates          0 deltas)
Total branches:           1 (         1 loads     )
      marks:        1048576 (    860386 unique    )
      atoms:         860557
Memory total:        182780 KiB
       pools:        152116 KiB
     objects:         30664 KiB
---------------------------------------------------------------------
pack_report: getpagesize()            =       4096
pack_report: core.packedGitWindowSize = 1073741824
pack_report: core.packedGitLimit      = 8589934592
pack_report: pack_used_ctr            =          1
pack_report: pack_mmap_calls          =          1
pack_report: pack_open_windows        =          1 /          1
pack_report: pack_mapped              =  388496447 /  388496447
---------------------------------------------------------------------


real	107m43.130s
user	45m25.430s
sys	2m49.440s



[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: evenless.pl: maildir to git using fast-import --]
[-- Type: text/x-perl, Size: 1413 bytes --]

#!/usr/bin/perl -w

use strict;

my $tree= "";

use IPC::Open2;

use File::stat;

my $FILES;

my $mark= 1;

my $stripdir= $ARGV[0];

sub fastimport_blobs ($);
sub fastimport_blobs ($)
{
    my $dirname= shift @_;

    opendir (my $dirhandle, $dirname);
    foreach (readdir $dirhandle)
    {
	next if /^\.\.?$/;
	next if /\.cmeta$/;
	next if /\.ibex.index$/;
	next if /\.ibex.index.data$/;
	next if /\.ev-summary$/;
	next if /\.ev-summary-meta$/;
	next if /\.notmuch$/;

	if (-d $dirname.'/'.$_)
	{
	    print STDERR "Recursing into $_/ ";
	    fastimport_blobs($dirname.'/'.$_);
	    print STDERR "\n";
	}
	else
	{
	    my $sb= stat("$dirname/$_");
	    print FASTIMPORT "blob\n";
	    print FASTIMPORT "mark :$mark\n";
	    print FASTIMPORT "data ".($sb->size)."\n";
	    open FILEIN, "$dirname/$_";
	    my $content;
	    sysread FILEIN, $content, $sb->size;
	    close FILEIN;
	    print FASTIMPORT $content;
	    my $storedir= "$dirname/$_";
	    $storedir=~ s/^$stripdir//;
	    $storedir=~ s/^\///;
	    $FILES.="M 0644 :$mark $storedir\n";
	    $mark++;
	}
    }
}

open FASTIMPORT, "| git fast-import --date-format=rfc2822";

fastimport_blobs($ARGV[0]);

print FASTIMPORT "commit refs/heads/master\n";
print FASTIMPORT "committer EvenLess <evenless\@evenless> ".`date -R`;
print FASTIMPORT "data 11\n";
print FASTIMPORT "mail commit\n";
print FASTIMPORT $FILES;
print FASTIMPORT "\n";

close FASTIMPORT;

[-- Attachment #3: Type: text/plain, Size: 22 bytes --]





-- 
Stewart Smith

  reply	other threads:[~2010-02-17 10:07 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-02-15  0:29 Mail in git Stewart Smith
2010-02-16  9:08 ` Michal Sojka
2010-02-16 19:06 ` Ben Gamari
2010-02-17  0:21   ` Stewart Smith
2010-02-17 10:07     ` Stewart Smith [this message]
2011-05-21  7:05       ` martin f krafft
2011-05-21  7:25         ` Stewart Smith
2010-02-17  1:21 ` martin f krafft
2010-02-17 15:03   ` Ben Gamari
2010-02-17 19:23     ` Mark Anderson
2010-02-17 19:34       ` Ben Gamari
2010-02-17 23:52         ` martin f krafft
2010-02-18  0:39           ` Ben Gamari
2010-02-18  1:58             ` martin f krafft
2010-02-18  2:19               ` Ben Gamari
2010-02-18  2:48                 ` nested tag trees (was: Mail in git) martin f krafft
2010-02-18  4:32                   ` martin f krafft
     [not found]                   ` <1266463007-sup-8777@ben-laptop>
2010-02-18  4:34                     ` martin f krafft
     [not found]                     ` <20100218034613.GD1991@lapse.rw.madduck.net>
2010-02-18  4:44                       ` Ben Gamari
2010-02-18  4:59                         ` martin f krafft
2010-02-18  5:10                           ` Ben Gamari
2010-02-19  0:31                             ` martin f krafft
2010-02-19  9:52                               ` Michal Sojka
2010-02-19 14:27                                 ` Ben Gamari
2010-02-17 23:56   ` Mail in git Stewart Smith
2010-02-18  1:01     ` Ben Gamari
2010-02-18  2:00       ` martin f krafft
2010-02-18  2:11         ` Git ancestry and sync problems (was: Mail in git) martin f krafft
2010-02-18  8:34           ` racin
2010-02-18 12:20             ` Jameson Rollins
2010-02-18 12:47             ` Ben Gamari
2010-02-18 23:23             ` martin f krafft

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://notmuchmail.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87mxz8jhun.fsf@willster.local.flamingspork.com \
    --to=stewart@flamingspork.com \
    --cc=bgamari@gmail.com \
    --cc=notmuch@notmuchmail.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://yhetil.org/notmuch.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).