From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by olra.theworths.org (Postfix) with ESMTP id 477C1431FBC for ; Tue, 16 Feb 2010 01:08:54 -0800 (PST) X-Virus-Scanned: Debian amavisd-new at olra.theworths.org X-Spam-Flag: NO X-Spam-Score: -0.944 X-Spam-Level: X-Spam-Status: No, score=-0.944 tagged_above=-999 required=5 tests=[AWL=-0.945, BAYES_50=0.001] autolearn=ham Received: from olra.theworths.org ([127.0.0.1]) by localhost (olra.theworths.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id w5NOrr8kS36E for ; Tue, 16 Feb 2010 01:08:53 -0800 (PST) Received: from max.feld.cvut.cz (max.feld.cvut.cz [147.32.192.36]) by olra.theworths.org (Postfix) with ESMTP id 31177431FAE for ; Tue, 16 Feb 2010 01:08:53 -0800 (PST) Received: from localhost (unknown [192.168.200.4]) by max.feld.cvut.cz (Postfix) with ESMTP id 2508F19F3305; Tue, 16 Feb 2010 10:08:52 +0100 (CET) X-Virus-Scanned: IMAP AMAVIS Received: from max.feld.cvut.cz ([192.168.200.1]) by localhost (styx.feld.cvut.cz [192.168.200.4]) (amavisd-new, port 10044) with ESMTP id t92phN1g2yHX; Tue, 16 Feb 2010 10:08:48 +0100 (CET) Received: from imap.feld.cvut.cz (imap.feld.cvut.cz [147.32.192.34]) by max.feld.cvut.cz (Postfix) with ESMTP id E221119F33B8; Tue, 16 Feb 2010 10:08:47 +0100 (CET) Received: from steelpick.localdomain (k335-30.felk.cvut.cz [147.32.86.30]) (Authenticated sender: sojkam1) by imap.feld.cvut.cz (Postfix) with ESMTPSA id A9F17FA003; Tue, 16 Feb 2010 10:08:47 +0100 (CET) Received: from wsh by steelpick.localdomain with local (Exim 4.71) (envelope-from ) id 1NhJQE-0000Vu-Im; Tue, 16 Feb 2010 10:08:46 +0100 From: Michal Sojka To: Stewart Smith , notmuch@notmuchmail.org In-Reply-To: <20100215002914.GA22402@flamingspork.com> References: <20100215002914.GA22402@flamingspork.com> Date: Tue, 16 Feb 2010 10:08:45 +0100 Message-ID: <87wrydim3m.fsf@steelpick.localdomain> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Subject: Re: Mail in git X-BeenThere: notmuch@notmuchmail.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: "Use and development of the notmuch mail system." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 16 Feb 2010 09:08:54 -0000 Hi Stewart, On Mon, 15 Feb 2010 11:29:14 +1100, Stewart Smith wrote: > Which goes from a 15GB Maildir to a 3.7GB git repo. That's quite interesting ratio. I've tried a plain git add and git gc on my mail store and the result was a repo of approximately 50% of mail store size. Do you think that this difference might be caused by the way you created the packs? > > The algorithm of evenless.pl is basically: > 1 get next directory entry > 2 if is directory, recurse into it > 3 write item to git (git hash-object -w) > 4 add item to tree object > 5 if number of items written = 1000 > 5.1 make pack of last 1000 items > 6 goto 1 So it seems that you have all you mails in a single tree. How long it takes to caculate difference of two trees (git diff-tree --name-status)? This operation will be needed by "notmuch new" to determine which files/blobs to index. I suppose it will be better if mail blobs are stored in subtrees. If a subtree is not changed git doesn't need to descend to it because it has the same sha1. I think that storing mails in a similar structure as in .git/objects (i.e. 256 subdirectories based on the first sha1 byte and file names based on the last 39 sha1 bytes) would be reasonable. > Next step? > > Make notmuch be able to read mail out of it and add it to an index > (oh, and some kind of verification and error checking about creating > the git repo). Besides using git to compact the size of mail store, another feature that cames with git for free is synchronization. For this to work, you only need to store tags in the repo. What might work is to store tags in files named .tags. The tags would be stored in the files alphabetically, one tag per line. I guess, that this way makes it easy to merge tags during synchronization even without writing custom git merge driver. Onother point that must be solved if we would like to use git with notmuch is the license problem. As it was pointed out by Carl in another thread, Git is licensed under GPLv2 only whereas notmuch under GPLv3 and these licences are incompatible. So I think we will need some kind of hooks in notmuch from which external programs (git) will be called. Cheers, Michal