Mail in git

unofficial mirror of notmuch@notmuchmail.org
 help / color / mirror / code / Atom feed

* Mail in git
@ 2010-02-15  0:29 Stewart Smith
  2010-02-16  9:08 ` Michal Sojka
                   ` (2 more replies)
  0 siblings, 3 replies; 32+ messages in thread
From: Stewart Smith @ 2010-02-15  0:29 UTC (permalink / raw)
  To: notmuch

So... I sketched this out in my head at LCA... and it's taken a bit of
time to actually properly try it.

The problem is:
A simple 'find ~/Maildir` takes 10 minutes, and if you write the
output to a file, it's 88MB+

there's "only" about 900,000 entries there. But this means 900,000
files, which is a non-trivial amount. Some mail folders are quite
large too.

Some of this problem could just be solved by using notmuch a bit
differently (folder per month for example).

However... this is a one-way change and going back would be very
tricky.

There's also the backup problem. Iterating through ~1million inodes
takes a *LONG* time. Restoring it takes even longer (think about
writing all that data to the file system journal).

Historically, if i'm running a backup, I couldn't really use my
laptop, it'd be saturated with disk IO performing the file system
dump. It would also take many hours.

Restoring from backup? about 8hrs.

An observation is that mail never changes. It may be reclassified (and
that's what notmuch is for), but it never changes.

We really just want a way to store and access many many many small
blobs of data that never change.

It turns out git is pretty good at that. Underneath, we could just use
it as an object store (a simple git-hash-object and git-cat-file test
confirmed this to be pretty simple to do). even better is since a lot
of mail is fairly similar, to use delta compression between mail
messages to reduce the storage space. Git is pretty good at that too.

A few giant git packs will be much quicker to backup and restore than
1million files.

So... I wrote a script to test it....

$ time perl /home/stewart/evenless.pl /home/stewart/Maildir/

real    841m41.491s
user    491m3.200s
sys     261m58.080s

Which goes from a 15GB Maildir to a 3.7GB git repo.

The algorithm of evenless.pl is basically:
1 get next directory entry
2 if is directory, recurse into it
3 write item to git (git hash-object -w)
4 add item to tree object
5 if number of items written = 1000
  5.1 make pack of last 1000 items
6 goto 1

$ git count-objects -v
count: 479
size: 27680
in-pack: 873109
packs: 1084
size-pack: 3746219
prune-packable: 0
garbage: 0

If i did a "git checkout", about 8 hours later i'd have a directory
tree exactly the same as my maildir.

Why didn't I just git-add everything? I didn't exactly feel like
creating another giant copy of my mail (that also takes a long time).

What about adding more mail to the archive?

So the way I think is that you use a Maildir for day to day mail (e.g.
delivery) and every so often you run some magic command that takes old
mail out of the Maildir and stores it in the git repo.

Next step?

Make notmuch be able to read mail out of it and add it to an index
(oh, and some kind of verification and error checking about creating
the git repo).
-- 
Stewart Smith

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Mail in git
  2010-02-15  0:29 Mail in git Stewart Smith
@ 2010-02-16  9:08 ` Michal Sojka
  2010-02-16 19:06 ` Ben Gamari
  2010-02-17  1:21 ` martin f krafft
  2 siblings, 0 replies; 32+ messages in thread
From: Michal Sojka @ 2010-02-16  9:08 UTC (permalink / raw)
  To: Stewart Smith, notmuch

Hi Stewart,

On Mon, 15 Feb 2010 11:29:14 +1100, Stewart Smith <stewart@flamingspork.com> wrote:
> Which goes from a 15GB Maildir to a 3.7GB git repo.

That's quite interesting ratio. I've tried a plain git add and git gc on
my mail store and the result was a repo of approximately 50% of mail
store size. Do you think that this difference might be caused by the way
you created the packs?

> 
> The algorithm of evenless.pl is basically:
> 1 get next directory entry
> 2 if is directory, recurse into it
> 3 write item to git (git hash-object -w)
> 4 add item to tree object
> 5 if number of items written = 1000
>   5.1 make pack of last 1000 items
> 6 goto 1

So it seems that you have all you mails in a single tree. How long it
takes to caculate difference of two trees (git diff-tree --name-status)?
This operation will be needed by "notmuch new" to determine which
files/blobs to index. I suppose it will be better if mail blobs are
stored in subtrees. If a subtree is not changed git doesn't need to
descend to it because it has the same sha1.

I think that storing mails in a similar structure as in .git/objects
(i.e. 256 subdirectories based on the first sha1 byte and file names
based on the last 39 sha1 bytes) would be reasonable.

> Next step?
> 
> Make notmuch be able to read mail out of it and add it to an index
> (oh, and some kind of verification and error checking about creating
> the git repo).

Besides using git to compact the size of mail store, another feature that
cames with git for free is synchronization. For this to work, you only
need to store tags in the repo. What might work is to store tags in
files named <mail-name>.tags. The tags would be stored in the files
alphabetically, one tag per line. I guess, that this way makes it easy
to merge tags during synchronization even without writing custom git
merge driver.

Onother point that must be solved if we would like to use git with
notmuch is the license problem. As it was pointed out by Carl in another
thread, Git is licensed under GPLv2 only whereas notmuch under GPLv3 and
these licences are incompatible. So I think we will need some kind of
hooks in notmuch from which external programs (git) will be called.

Cheers,
 Michal

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Mail in git
  2010-02-15  0:29 Mail in git Stewart Smith
  2010-02-16  9:08 ` Michal Sojka
@ 2010-02-16 19:06 ` Ben Gamari
  2010-02-17  0:21   ` Stewart Smith
  2010-02-17  1:21 ` martin f krafft
  2 siblings, 1 reply; 32+ messages in thread
From: Ben Gamari @ 2010-02-16 19:06 UTC (permalink / raw)
  To: notmuch

Excerpts from Stewart Smith's message of Sun Feb 14 19:29:14 -0500 2010:
> So... I sketched this out in my head at LCA... and it's taken a bit of
> time to actually properly try it.
> 
In case anyone wanted to play around with this, I've written up my own
little implementation[1] of a git mail import script. It's quite simple,
but I felt it might be nice to have some public code to play around
with. I get around 80 messages/second on my laptop and things are
definitely quite IO bound. You get 1 commit per message, although I'm
not entirely sure if this is the correct way to do things.

- Ben


[1] git://goldnerlab.physics.umass.edu/git-mail

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Mail in git
  2010-02-16 19:06 ` Ben Gamari
@ 2010-02-17  0:21   ` Stewart Smith
  2010-02-17 10:07     ` Stewart Smith
  0 siblings, 1 reply; 32+ messages in thread
From: Stewart Smith @ 2010-02-17  0:21 UTC (permalink / raw)
  To: Ben Gamari, notmuch

On Tue, 16 Feb 2010 14:06:29 -0500, Ben Gamari <bgamari@gmail.com> wrote:
> Excerpts from Stewart Smith's message of Sun Feb 14 19:29:14 -0500 2010:
> > So... I sketched this out in my head at LCA... and it's taken a bit of
> > time to actually properly try it.
> > 
> In case anyone wanted to play around with this, I've written up my own
> little implementation[1] of a git mail import script. It's quite simple,
> but I felt it might be nice to have some public code to play around
> with. I get around 80 messages/second on my laptop and things are
> definitely quite IO bound. You get 1 commit per message, although I'm
> not entirely sure if this is the correct way to do things.
> 
> [1] git://goldnerlab.physics.umass.edu/git-mail

Using fast-import is interesting. Does it update the working tree? The
big thing I wanted to avoid was creating a working tree (another million
inodes being created is not ever what I need)

Also interesting is the mention of creating packs on the fly... this
could save the time in first writing the object and then packing it (as
my script does).

I'm going to play with this....
-- 
Stewart Smith

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Mail in git
  2010-02-15  0:29 Mail in git Stewart Smith
  2010-02-16  9:08 ` Michal Sojka
  2010-02-16 19:06 ` Ben Gamari
@ 2010-02-17  1:21 ` martin f krafft
  2010-02-17 15:03   ` Ben Gamari
  2010-02-17 23:56   ` Mail in git Stewart Smith
  2 siblings, 2 replies; 32+ messages in thread
From: martin f krafft @ 2010-02-17  1:21 UTC (permalink / raw)
  To: notmuch

[-- Attachment #1: Type: text/plain, Size: 2006 bytes --]

also sprach Stewart Smith <stewart@flamingspork.com> [2010.02.15.1329 +1300]:
> What about adding more mail to the archive?
> 
> So the way I think is that you use a Maildir for day to day mail
> (e.g. delivery) and every so often you run some magic command that
> takes old mail out of the Maildir and stores it in the git repo.

Either that, or the other idea we had (which I prefer), which would
basically be:

  evenless-submit — add a new mail (and return a hash ID)
                    and invoke a hook, e.g. to let notmuch know
  evenless-cat    — print the full mail given ID with headers to stdout
  evenless-delete — unlink a mail identified by hash ID
                    and invoke a hook, e.g. to let notmuch know

If we expose the submit and delete functionality at the notmuch
level, then we don't need the hooks for then evenless would be
plumbing.

Anything to avoid a cronjob would be good, I think.

Then we need a notmuch backend for mutt etc.. For those who still
want to use a regular Maildir, let them use the worktree.

What I am wondering is if (explicit) tags couldn't be represented as
tree-objects with this.

  evenless-link   — link a message object with a tree object
  evenless–unlink – unlink a message object from tree object
    [replaces evenless-unlink]

messages would then be deleted whenever using git-gc.

No idea how this would sync if we don't keep ancestry. Otoh, it
would probably not be very expensive to do just that.

notmuch would then only search and provide the hash ID(s); tags
would be a function of storage.

Is it possible to find out all trees that reference a given object
with Git in constant or sub-linear time?

-- 
martin | http://madduck.net/ | http://two.sentenc.es/

"the question of whether computers can think
 is like the question of whether submarines can swim."
                                                 -- edsgar w. dijkstra

spamtraps: madduck.bogus@madduck.net

[-- Attachment #2: Digital signature (see http://martin-krafft.net/gpg/) --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Mail in git
  2010-02-17  0:21   ` Stewart Smith
@ 2010-02-17 10:07     ` Stewart Smith
  2011-05-21  7:05       ` martin f krafft
  0 siblings, 1 reply; 32+ messages in thread
From: Stewart Smith @ 2010-02-17 10:07 UTC (permalink / raw)
  To: Ben Gamari, notmuch

[-- Attachment #1: Type: text/plain, Size: 3118 bytes --]

On Wed, 17 Feb 2010 11:21:51 +1100, Stewart Smith <stewart@flamingspork.com> wrote:
> Using fast-import is interesting. Does it update the working tree? The
> big thing I wanted to avoid was creating a working tree (another million
> inodes being created is not ever what I need)
> 
> Also interesting is the mention of creating packs on the fly... this
> could save the time in first writing the object and then packing it (as
> my script does).
> 
> I'm going to play with this....

and I did.

good news... on my mailstore (which, as I've previously mentioned, takes
about 10 minutes to run 'du' over, about the same time as 'notmuch new'
takes):

using the (attached) evenless.pl to create a single commit with
everything in it:

$ du -sh .git
3.4G	.git

Down from a whopping 14-15GB!!!

My previous effort (git-write-object, create pack every 1000 messages,
rinse, repeat) took all night and got to 3.7GB.

This took only 108 minutes.

In both cases, i was creating the repository on another spindle (USB2.0
disk attached to my laptop).

git-ls-tree and git-cat-file both work for listing and getting objects.

The next thing to think about is adding objects as they come
in... creating a new commit with just an added file should be pretty
simple and easy... but this means we get to keep a "revision history" of
the mailstore, which is *possibly* not ideal in terms of storage
efficiency (i'll do a trial with mine of doing one message at a time and
seeing what the end size is).

however... commit per added mail (or mails) does give us the advantage
of a really well documented and tested backup system :)

Deleting could be hard.. if we actually want the objects to go away in a
"permanent" way (not just no longer be referenced).

for the stats nerds:

$ time perl /home/stewart/evenless/evenless.pl /home/stewart/Maildir/INBOX

git-fast-import statistics:
---------------------------------------------------------------------
Alloc'd objects:     785000
Total objects:       781813 (     79023 duplicates                  )
      blobs  :       781363 (     79023 duplicates     708627 deltas)
      trees  :          449 (         0 duplicates          0 deltas)
      commits:            1 (         0 duplicates          0 deltas)
      tags   :            0 (         0 duplicates          0 deltas)
Total branches:           1 (         1 loads     )
      marks:        1048576 (    860386 unique    )
      atoms:         860557
Memory total:        182780 KiB
       pools:        152116 KiB
     objects:         30664 KiB
---------------------------------------------------------------------
pack_report: getpagesize()            =       4096
pack_report: core.packedGitWindowSize = 1073741824
pack_report: core.packedGitLimit      = 8589934592
pack_report: pack_used_ctr            =          1
pack_report: pack_mmap_calls          =          1
pack_report: pack_open_windows        =          1 /          1
pack_report: pack_mapped              =  388496447 /  388496447
---------------------------------------------------------------------


real	107m43.130s
user	45m25.430s
sys	2m49.440s



[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: evenless.pl: maildir to git using fast-import --]
[-- Type: text/x-perl, Size: 1413 bytes --]

#!/usr/bin/perl -w

use strict;

my $tree= "";

use IPC::Open2;

use File::stat;

my $FILES;

my $mark= 1;

my $stripdir= $ARGV[0];

sub fastimport_blobs ($);
sub fastimport_blobs ($)
{
    my $dirname= shift @_;

    opendir (my $dirhandle, $dirname);
    foreach (readdir $dirhandle)
    {
	next if /^\.\.?$/;
	next if /\.cmeta$/;
	next if /\.ibex.index$/;
	next if /\.ibex.index.data$/;
	next if /\.ev-summary$/;
	next if /\.ev-summary-meta$/;
	next if /\.notmuch$/;

	if (-d $dirname.'/'.$_)
	{
	    print STDERR "Recursing into $_/ ";
	    fastimport_blobs($dirname.'/'.$_);
	    print STDERR "\n";
	}
	else
	{
	    my $sb= stat("$dirname/$_");
	    print FASTIMPORT "blob\n";
	    print FASTIMPORT "mark :$mark\n";
	    print FASTIMPORT "data ".($sb->size)."\n";
	    open FILEIN, "$dirname/$_";
	    my $content;
	    sysread FILEIN, $content, $sb->size;
	    close FILEIN;
	    print FASTIMPORT $content;
	    my $storedir= "$dirname/$_";
	    $storedir=~ s/^$stripdir//;
	    $storedir=~ s/^\///;
	    $FILES.="M 0644 :$mark $storedir\n";
	    $mark++;
	}
    }
}

open FASTIMPORT, "| git fast-import --date-format=rfc2822";

fastimport_blobs($ARGV[0]);

print FASTIMPORT "commit refs/heads/master\n";
print FASTIMPORT "committer EvenLess <evenless\@evenless> ".`date -R`;
print FASTIMPORT "data 11\n";
print FASTIMPORT "mail commit\n";
print FASTIMPORT $FILES;
print FASTIMPORT "\n";

close FASTIMPORT;

[-- Attachment #3: Type: text/plain, Size: 22 bytes --]





-- 
Stewart Smith

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Mail in git
  2010-02-17  1:21 ` martin f krafft
@ 2010-02-17 15:03   ` Ben Gamari
  2010-02-17 19:23     ` Mark Anderson
  2010-02-17 23:56   ` Mail in git Stewart Smith
  1 sibling, 1 reply; 32+ messages in thread
From: Ben Gamari @ 2010-02-17 15:03 UTC (permalink / raw)
  To: notmuch

Excerpts from martin f krafft's message of Tue Feb 16 20:21:01 -0500 2010:
> What I am wondering is if (explicit) tags couldn't be represented as
> tree-objects with this.
> 
>   evenless-link   — link a message object with a tree object
>   evenless–unlink – unlink a message object from tree object
>     [replaces evenless-unlink]

I was actually wondering this very thing. I'd just be worried about tags
with large numbers of messages (presumably we would need an All tag,
that would contain a reference to every known message). It seems like
the simple act of adding a message to the repository could turn into an
extremely expensive operation.

Moreover, deleting a message could also be quite expensive as this will
require rewriting all of the tags that reference it. Surely, we would
need to batch these sort of operations to avoid disasterous performance.

However, even with batching, it seems we would face some pretty serious
scalability issues. I think if we were to implement tag storage in
trees, we'd need to use a multi-level tree. This way we could avoid
rewriting a tree object containing all of the tag's messages on every
change. I apologize if this was already obvious to everyone but me.

> 
> messages would then be deleted whenever using git-gc.
> 
> No idea how this would sync if we don't keep ancestry. Otoh, it
> would probably not be very expensive to do just that.

I think that keeping the ancestry would be quite important and would
come with relatively low overhead given the correct dereferencing of
data structures.

> 
> notmuch would then only search and provide the hash ID(s); tags
> would be a function of storage.
> 
> Is it possible to find out all trees that reference a given object
> with Git in constant or sub-linear time?
> 
I don't believe so. I think this is one of the reasons why git gc is so
expensive.

- Ben

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Mail in git
  2010-02-17 15:03   ` Ben Gamari
@ 2010-02-17 19:23     ` Mark Anderson
  2010-02-17 19:34       ` Ben Gamari
  0 siblings, 1 reply; 32+ messages in thread
From: Mark Anderson @ 2010-02-17 19:23 UTC (permalink / raw)
  To: Ben Gamari, notmuch

On Wed, 17 Feb 2010 10:03:36 -0500, Ben Gamari <bgamari@gmail.com> wrote:
> > notmuch would then only search and provide the hash ID(s); tags
> > would be a function of storage.
> > 
> > Is it possible to find out all trees that reference a given object
> > with Git in constant or sub-linear time?
> > 
> I don't believe so. I think this is one of the reasons why git gc is so
> expensive.

But if we have notmuch as a cache of the tags, then don't we already
know the tree objects that need updating?  Yes, we would probably need
some consistency checks for when things don't work as planned, but in
the common case we ought to always know.

Perhaps I'm misunderstanding these tree objects, and you're suggesting
that we don't even tell notmuch about them.

-Mark

Just poking my nose where it don't belong, since 1984.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Mail in git
  2010-02-17 19:23     ` Mark Anderson
@ 2010-02-17 19:34       ` Ben Gamari
  2010-02-17 23:52         ` martin f krafft
  0 siblings, 1 reply; 32+ messages in thread
From: Ben Gamari @ 2010-02-17 19:34 UTC (permalink / raw)
  To: Mark Anderson; +Cc: notmuch

Excerpts from Mark Anderson's message of Wed Feb 17 14:23:48 -0500 2010:
> But if we have notmuch as a cache of the tags, then don't we already
> know the tree objects that need updating?  Yes, we would probably need
> some consistency checks for when things don't work as planned, but in
> the common case we ought to always know.
> 
Cached or not, rewriting would still be an incredibly (e.g.
prohibitively or close to it) expensive operation for a large mailstore.

> Perhaps I'm misunderstanding these tree objects, and you're suggesting
> that we don't even tell notmuch about them.
> 
I think it would be unwise to teach notmuch anything about the
underlying store. That would be leaking way too many implementation
details into 

- Ben

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Mail in git
  2010-02-17 19:34       ` Ben Gamari
@ 2010-02-17 23:52         ` martin f krafft
  2010-02-18  0:39           ` Ben Gamari
  0 siblings, 1 reply; 32+ messages in thread
From: martin f krafft @ 2010-02-17 23:52 UTC (permalink / raw)
  To: notmuch

[-- Attachment #1: Type: text/plain, Size: 1320 bytes --]

also sprach Ben Gamari <bgamari@gmail.com> [2010.02.18.0834 +1300]:
> Excerpts from Mark Anderson's message of Wed Feb 17 14:23:48 -0500
> 2010:
> > But if we have notmuch as a cache of the tags, then don't we
> > already know the tree objects that need updating?  Yes, we would
> > probably need some consistency checks for when things don't work
> > as planned, but in the common case we ought to always know.
> > 
> Cached or not, rewriting would still be an incredibly (e.g.
> prohibitively or close to it) expensive operation for a large
> mailstore.

Why? Well, would involve creating n objects and unlinking n objects
for n tags, but it would be constant in the number of messages, no?

> > Perhaps I'm misunderstanding these tree objects, and you're
> > suggesting that we don't even tell notmuch about them.
> > 
> I think it would be unwise to teach notmuch anything about the
> underlying store. That would be leaking way too many
> implementation details into

I agree. Also, it would introduce redundancy.

-- 
martin | http://madduck.net/ | http://two.sentenc.es/
 
"twenty-four hour room-service must be one of the
 premiere achievements of modern civilization."
                                          -- special agent dale cooper
 
spamtraps: madduck.bogus@madduck.net

[-- Attachment #2: Digital signature (see http://martin-krafft.net/gpg/) --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Mail in git
  2010-02-17  1:21 ` martin f krafft
  2010-02-17 15:03   ` Ben Gamari
@ 2010-02-17 23:56   ` Stewart Smith
  2010-02-18  1:01     ` Ben Gamari
  1 sibling, 1 reply; 32+ messages in thread
From: Stewart Smith @ 2010-02-17 23:56 UTC (permalink / raw)
  To: martin f krafft, notmuch

On Wed, 17 Feb 2010 14:21:01 +1300, martin f krafft <madduck@madduck.net> wrote:
> What I am wondering is if (explicit) tags couldn't be represented as
> tree-objects with this.
> 
>   evenless-link   — link a message object with a tree object
>   evenless–unlink – unlink a message object from tree object
>     [replaces evenless-unlink]

I think it could get expensive for tags with lots of messages.

With my fast-import script, doing the commit (that
referenced... umm.. 800,000+ objects took a *very* long time).

As far as I understand it, the tree object is stored in full and space
is only reclaimed during repack (due to delta compression).

So if you, say, had the entire history of a high volume list such as
linux-kernel, adding messages could get rather expensive if you
auto-tagged (or autotagged messages with patches or whatever).

> messages would then be deleted whenever using git-gc.
> 
> No idea how this would sync if we don't keep ancestry. Otoh, it
> would probably not be very expensive to do just that.

If we keep ancestry though, we are reusing existing working code for
backup (git-pull :)

Keep in mind that with my tests, the Maildir in git is about a quarter
to a fifth of the size of it in Maildir... so a bit of extra usage per
message isn't as dramatic as it may sound.

> Is it possible to find out all trees that reference a given object
> with Git in constant or sub-linear time?

I don't think so.... but I'm not sure.

-- 
Stewart Smith

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Mail in git
  2010-02-17 23:52         ` martin f krafft
@ 2010-02-18  0:39           ` Ben Gamari
  2010-02-18  1:58             ` martin f krafft
  0 siblings, 1 reply; 32+ messages in thread
From: Ben Gamari @ 2010-02-18  0:39 UTC (permalink / raw)
  To: notmuch

Excerpts from martin f krafft's message of Wed Feb 17 18:52:11 -0500 2010:
> also sprach Ben Gamari <bgamari@gmail.com> [2010.02.18.0834 +1300]:
> > Excerpts from Mark Anderson's message of Wed Feb 17 14:23:48 -0500
> > 2010:
> > > But if we have notmuch as a cache of the tags, then don't we
> > > already know the tree objects that need updating?  Yes, we would
> > > probably need some consistency checks for when things don't work
> > > as planned, but in the common case we ought to always know.
> > > 
> > Cached or not, rewriting would still be an incredibly (e.g.
> > prohibitively or close to it) expensive operation for a large
> > mailstore.
> 
> Why? Well, would involve creating n objects and unlinking n objects
> for n tags, but it would be constant in the number of messages, no?

Yes, it would be linear in number of tags. I suppose if messages
weren't stored in the top-level tree nodes, then it would still be
linear, although with a slope equal to the reciprocal of the fan-out.
This has the potential to be very reasonable performance-wise.

- Ben

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Mail in git
  2010-02-17 23:56   ` Mail in git Stewart Smith
@ 2010-02-18  1:01     ` Ben Gamari
  2010-02-18  2:00       ` martin f krafft
  0 siblings, 1 reply; 32+ messages in thread
From: Ben Gamari @ 2010-02-18  1:01 UTC (permalink / raw)
  To: notmuch

Excerpts from Stewart Smith's message of Wed Feb 17 18:56:53 -0500 2010:
> On Wed, 17 Feb 2010 14:21:01 +1300, martin f krafft <madduck@madduck.net> wrote:
> > What I am wondering is if (explicit) tags couldn't be represented as
> > tree-objects with this.
> 
> I think it could get expensive for tags with lots of messages.
> 
> As far as I understand it, the tree object is stored in full and space
> is only reclaimed during repack (due to delta compression).
> 
> So if you, say, had the entire history of a high volume list such as
> linux-kernel, adding messages could get rather expensive if you
> auto-tagged (or autotagged messages with patches or whatever).
> 

Well, it's tough to say, but I don't think it's as bad as you think. I
proposed that we could use a tree structure like the following,

                  ╭─msg1
      ╭tagA.list1╶┼─msg2
      │           ╰─msg3
      │
      │           ╭─msg4
      ├tagA.list2╶┼─msg5
      │           ╰─msg6
tagA ╶┤
      │           ╭─msg7
      ├tagA.list3╶┼─msg8
      │           ╰─msg9
      │
      │           ╭─msg10
      ╰tagA.list4╶┼─msg11
                  ╰─msg12

This way, adding a message to, say list3, would only require rewriting
list3 and tagA, which seems pretty reasonable to me. Moreover, we could
make the tree structure as deep as necessary, although we
would need to rewrite a node at every level of the tree, so its tough
saying how many levels is too many. It could simply be adaptive (e.g.
bisect any nodes with more than N children).

This certainly isn't as simple as the naive approach, but I think it's
the only reasonable approach performance-wise and I don't believe it
shouldn't be too tricky.

> > messages would then be deleted whenever using git-gc.
> > 
> > No idea how this would sync if we don't keep ancestry. Otoh, it
> > would probably not be very expensive to do just that.
> 
> If we keep ancestry though, we are reusing existing working code for
> backup (git-pull :)

This is one of the reasons I feel it's important we keep it. And as is
stated below, the storage overhead is minimal.
> 
> Keep in mind that with my tests, the Maildir in git is about a quarter
> to a fifth of the size of it in Maildir... so a bit of extra usage per
> message isn't as dramatic as it may sound.
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Mail in git
  2010-02-18  0:39           ` Ben Gamari
@ 2010-02-18  1:58             ` martin f krafft
  2010-02-18  2:19               ` Ben Gamari
  0 siblings, 1 reply; 32+ messages in thread
From: martin f krafft @ 2010-02-18  1:58 UTC (permalink / raw)
  To: Ben Gamari; +Cc: notmuch

[-- Attachment #1: Type: text/plain, Size: 851 bytes --]

also sprach Ben Gamari <bgamari@gmail.com> [2010.02.18.1339 +1300]:
> Yes, it would be linear in number of tags. I suppose if messages
> weren't stored in the top-level tree nodes, then it would still be
> linear, although with a slope equal to the reciprocal of the fan-out.
> This has the potential to be very reasonable performance-wise.

Messages are never stored in tree nodes; all these do are store
references to objects (blobs) holding messages. I bet you know this,
but I just wanted to make it explicit.

So retagging is really just writing a new tree with a modified list
of references.

-- 
martin | http://madduck.net/ | http://two.sentenc.es/
 
"no survivors? then where do the stories come from I wonder?"
                                               -- captain jack sparrow
 
spamtraps: madduck.bogus@madduck.net

[-- Attachment #2: Digital signature (see http://martin-krafft.net/gpg/) --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Mail in git
  2010-02-18  1:01     ` Ben Gamari
@ 2010-02-18  2:00       ` martin f krafft
  2010-02-18  2:11         ` Git ancestry and sync problems (was: Mail in git) martin f krafft
  0 siblings, 1 reply; 32+ messages in thread
From: martin f krafft @ 2010-02-18  2:00 UTC (permalink / raw)
  To: Ben Gamari; +Cc: notmuch

[-- Attachment #1: Type: text/plain, Size: 826 bytes --]

also sprach Ben Gamari <bgamari@gmail.com> [2010.02.18.1401 +1300]:
> > If we keep ancestry though, we are reusing existing working code for
> > backup (git-pull :)
> 
> This is one of the reasons I feel it's important we keep it. And as is
> stated below, the storage overhead is minimal.

Absolutely; Stewart mentioned at LCA to forego the porcelain and
harness the power of the plumbing, and I knew back then that this
would be among the first things of which to convince him once he had
the basic idea out. ;)

-- 
martin | http://madduck.net/ | http://two.sentenc.es/
 
DISCLAIMER: this entire message is privileged communication, intended
for the sole use of its recipients only. If you read it even though
you know you aren't supposed to, you're a poopy-head.
 
spamtraps: madduck.bogus@madduck.net

[-- Attachment #2: Digital signature (see http://martin-krafft.net/gpg/) --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Git ancestry and sync problems (was:  Mail in git)
  2010-02-18  2:00       ` martin f krafft
@ 2010-02-18  2:11         ` martin f krafft
  2010-02-18  8:34           ` racin
  0 siblings, 1 reply; 32+ messages in thread
From: martin f krafft @ 2010-02-18  2:11 UTC (permalink / raw)
  To: Ben Gamari, notmuch

[-- Attachment #1: Type: text/plain, Size: 944 bytes --]

also sprach martin f krafft <madduck@madduck.net> [2010.02.18.1500 +1300]:
> > > If we keep ancestry though, we are reusing existing working code for
> > > backup (git-pull :)
> > 
> > This is one of the reasons I feel it's important we keep it. And as is
> > stated below, the storage overhead is minimal.
> 
> Absolutely; Stewart mentioned at LCA to forego the porcelain and
> harness the power of the plumbing, and I knew back then that this
> would be among the first things of which to convince him once he had
> the basic idea out. ;)

Except I fear that as soon as we allow manipulation of the local
store, we'll potentially run into this problem:

  http://notmuchmail.org/pipermail/notmuch/2010/001114.html
  id:20100112045152.GA15275@lapse.rw.madduck.net

-- 
martin | http://madduck.net/ | http://two.sentenc.es/
 
it's as bad as you think, and they are out to get you.
 
spamtraps: madduck.bogus@madduck.net

[-- Attachment #2: Digital signature (see http://martin-krafft.net/gpg/) --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Mail in git
  2010-02-18  1:58             ` martin f krafft
@ 2010-02-18  2:19               ` Ben Gamari
  2010-02-18  2:48                 ` nested tag trees (was: Mail in git) martin f krafft
  0 siblings, 1 reply; 32+ messages in thread
From: Ben Gamari @ 2010-02-18  2:19 UTC (permalink / raw)
  To: martin f krafft; +Cc: notmuch

Excerpts from martin f krafft's message of Wed Feb 17 20:58:47 -0500 2010:
> also sprach Ben Gamari <bgamari@gmail.com> [2010.02.18.1339 +1300]:
> > Yes, it would be linear in number of tags. I suppose if messages
> > weren't stored in the top-level tree nodes, then it would still be
> > linear, although with a slope equal to the reciprocal of the fan-out.
> > This has the potential to be very reasonable performance-wise.
> 
> Messages are never stored in tree nodes; all these do are store
> references to objects (blobs) holding messages. I bet you know this,
> but I just wanted to make it explicit.

Yep, I'm aware.
> 
> So retagging is really just writing a new tree with a modified list
> of references.
> 
Certainly, however if you have a large tag (>100,000 messages), this
list of reference could easily be tens of megabytes. For this reason, it
seems like the added overhead of nesting trees would be well worth it.

- Ben

^ permalink raw reply	[flat|nested] 32+ messages in thread

* nested tag trees (was:  Mail in git)
  2010-02-18  2:19               ` Ben Gamari
@ 2010-02-18  2:48                 ` martin f krafft
  2010-02-18  4:32                   ` martin f krafft
       [not found]                   ` <1266463007-sup-8777@ben-laptop>
  0 siblings, 2 replies; 32+ messages in thread
From: martin f krafft @ 2010-02-18  2:48 UTC (permalink / raw)
  To: Ben Gamari; +Cc: notmuch

[-- Attachment #1: Type: text/plain, Size: 955 bytes --]

also sprach Ben Gamari <bgamari@gmail.com> [2010.02.18.1519 +1300]:
> > So retagging is really just writing a new tree with a modified list
> > of references.
> > 
> Certainly, however if you have a large tag (>100,000 messages), this
> list of reference could easily be tens of megabytes. For this reason, it
> seems like the added overhead of nesting trees would be well worth it.

True — iff we find a way to enumerate trees referencing a given
blob or tree so that we can walk up the hierarchy. I could look
right now, but I am about to cross half of the globe tomorrow, so
I have other things I should rather be doing. Sorry.

-- 
martin | http://madduck.net/ | http://two.sentenc.es/
 
"men always want to be a woman's first love.
 women have a more subtle instinct:
 what they like is to be a man's last romance."
                                                        -- oscar wilde
 
spamtraps: madduck.bogus@madduck.net

[-- Attachment #2: Digital signature (see http://martin-krafft.net/gpg/) --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: nested tag trees (was:  Mail in git)
  2010-02-18  2:48                 ` nested tag trees (was: Mail in git) martin f krafft
@ 2010-02-18  4:32                   ` martin f krafft
       [not found]                   ` <1266463007-sup-8777@ben-laptop>
  1 sibling, 0 replies; 32+ messages in thread
From: martin f krafft @ 2010-02-18  4:32 UTC (permalink / raw)
  To: Ben Gamari, notmuch

[-- Attachment #1: Type: text/plain, Size: 680 bytes --]

also sprach martin f krafft <madduck@madduck.net> [2010.02.18.1548 +1300]:
> True — iff we find a way to enumerate trees referencing a given
> blob or tree so that we can walk up the hierarchy. I could look
> right now, but I am about to cross half of the globe tomorrow, so
> I have other things I should rather be doing. Sorry.

http://marc.info/?l=git&m=126646636824600&w=2
id:20100218041240.GA4127@lapse.rw.madduck.net

-- 
martin | http://madduck.net/ | http://two.sentenc.es/
 
"although occasionally there is something to be said for solitude."
                                          -- special agent dale cooper
 
spamtraps: madduck.bogus@madduck.net

[-- Attachment #2: Digital signature (see http://martin-krafft.net/gpg/) --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: nested tag trees (was:  Mail in git)
       [not found]                   ` <1266463007-sup-8777@ben-laptop>
@ 2010-02-18  4:34                     ` martin f krafft
       [not found]                     ` <20100218034613.GD1991@lapse.rw.madduck.net>
  1 sibling, 0 replies; 32+ messages in thread
From: martin f krafft @ 2010-02-18  4:34 UTC (permalink / raw)
  To: notmuch discussion list

[-- Attachment #1: Type: text/plain, Size: 1645 bytes --]

[Taking a private message back to the list with permission]

also sprach Ben Gamari <bgamari@gmail.com> [2010.02.18.1620 +1300]:
> This is a very good point. From what I've read about the database
> format, I can't think of any way that reverse dependencies could be
> easily found, unfortunately. If there really is no way to do this, then
> we could have a problem. I'm not sure rewriting tens of megabytes
> everytime you receive a mail message is acceptable.

You would not need to do that, since the messages don't change, and
thus their blobs remain the same.

However, for every manipulation of a message, you would need to
iterate *all* tag trees (O(n)) and update the ones referencing the
message (also O(n)).

The entire process will still be O(n) per message, and O(m×n) for
all:

  messages=[list of messages]
  add_tags=[list of tags to add]
  remove_tags=[list of tags to remove]
  tagtrees=[all tag trees]
  trees_to_update=[]

  for t in remove_tags:
    if intersection(t.tree.children, messages):
      T = new_tree(t.name)
      write_tree(T, t.tree.children - messages)
      write_tree(t.tree, [])
      t.tree = T

  for t in add_tags:
    t.tree = new_tree(t.name)
    rewrite_tree(t.tree, messages)

This can probably be further optimised, but still: it's not quite as
nice as enumerating all parents of a message in O(1) time (which
would still result in O(m×n)).

-- 
martin | http://madduck.net/ | http://two.sentenc.es/
 
"... (ethik und ästhetik sind eins.)"
                                                       -- wittgenstein
 
spamtraps: madduck.bogus@madduck.net

[-- Attachment #2: Digital signature (see http://martin-krafft.net/gpg/) --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: nested tag trees (was:  Mail in git)
       [not found]                     ` <20100218034613.GD1991@lapse.rw.madduck.net>
@ 2010-02-18  4:44                       ` Ben Gamari
  2010-02-18  4:59                         ` martin f krafft
  0 siblings, 1 reply; 32+ messages in thread
From: Ben Gamari @ 2010-02-18  4:44 UTC (permalink / raw)
  To: martin f krafft; +Cc: notmuch

Excerpts from martin f krafft's message of Wed Feb 17 22:46:13 -0500 2010:
> You ought to have sent to the list, and I want to send mine there
> too, so please give permission.
> 
Oops! Sorry about that. Damn you sup.

> also sprach Ben Gamari <bgamari@gmail.com> [2010.02.18.1620 +1300]:
> > This is a very good point. From what I've read about the database
> > format, I can't think of any way that reverse dependencies could be
> > easily found, unfortunately. If there really is no way to do this, then
> > we could have a problem. I'm not sure rewriting tens of megabytes
> > everytime you receive a mail message is acceptable.
> 
> You would not need to do that, since the messages don't change, and
> thus their blobs remain the same.

I believe you would. The problem isn't the messages (well, that's a
problem too), it's the fact that
the tree (e.g. tab) objects which reference the messages are immutable
(I believe). This presents us with the difficult
circumstance of being unable to modify a tag after it has been created.
Therefore, as far as I can tell, we need to rewrite the tag's tree
object whenever we add or remove a message. This was the reason I
suggested nesting tag trees, although this only partially solves the
issue.

(Please correct me if I'm wrong about any/all of the above)

> 
> However, for every manipulation of a message, you would need to
> iterate *all* tag trees (O(n)) and update the ones referencing the
> message (also O(n)).
> 
This is definitely an issue.

> This can probably be further optimised, but still: it's not quite as
> nice as enumerating all parents of a message in O(1) time (which
> would still result in O(m×n)).
> 
Yeah, I'm not sure how well this would scale on truly massive
mail stores.

Cheers,

- Ben

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: nested tag trees (was:  Mail in git)
  2010-02-18  4:44                       ` Ben Gamari
@ 2010-02-18  4:59                         ` martin f krafft
  2010-02-18  5:10                           ` Ben Gamari
  0 siblings, 1 reply; 32+ messages in thread
From: martin f krafft @ 2010-02-18  4:59 UTC (permalink / raw)
  To: Ben Gamari; +Cc: notmuch

[-- Attachment #1: Type: text/plain, Size: 2295 bytes --]

also sprach Ben Gamari <bgamari@gmail.com> [2010.02.18.1744 +1300]:
> I believe you would. The problem isn't the messages (well, that's
> a problem too), it's the fact that the tree (e.g. tab) objects
> which reference the messages are immutable (I believe). This
> presents us with the difficult circumstance of being unable to
> modify a tag after it has been created. Therefore, as far as I can
> tell, we need to rewrite the tag's tree object whenever we add or
> remove a message. This was the reason I suggested nesting tag
> trees, although this only partially solves the issue.

You are absolutely right, and I think nesting tag trees is an
interesting idea to pursue. It *would* make it impossible to ever
check out the metatree into the filesystem, or rather result in
subdirectories that the user shouldn't need to worry about.

Instead of nested subtrees, think of 16 subtrees forming a level-1
hash table, or 256 for level-2, which really *ought* to be enough.

Anyway, rewriting a tree object is pretty much exactly the same as
removing a line (e.g. a message ID) from a file (e.g. a tag), as
that file would have to be fully rewritten.

> > This can probably be further optimised, but still: it's not
> > quite as nice as enumerating all parents of a message in O(1)
> > time (which would still result in O(m×n)).
> > 
> Yeah, I'm not sure how well this would scale on truly massive mail
> stores.

The more I think about this, the more I want to implement this
between evenless and Git, i.e. as a porcelain layer, since then
I could also use it for vcs-home[0]. In fact, maybe one day we can
store ~ and mail all in one Git repo, with different porcelains for
different use-cases, and notmuch indexing it all anyway. ;)

0. http://vcs-home.madduck.net

Let's continue the technical discussion on the Git list, okay?

http://marc.info/?l=git&m=126646636824600&w=2
id:20100218041240.GA4127@lapse.rw.madduck.net

-- 
martin | http://madduck.net/ | http://two.sentenc.es/

"i hate vulgar realism in literature. the man who could call a spade
 a spade should be compelled to use one. it is the only thing he is
 fit for."
                                                        -- oscar wilde

spamtraps: madduck.bogus@madduck.net

[-- Attachment #2: Digital signature (see http://martin-krafft.net/gpg/) --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: nested tag trees (was:  Mail in git)
  2010-02-18  4:59                         ` martin f krafft
@ 2010-02-18  5:10                           ` Ben Gamari
  2010-02-19  0:31                             ` martin f krafft
  0 siblings, 1 reply; 32+ messages in thread
From: Ben Gamari @ 2010-02-18  5:10 UTC (permalink / raw)
  To: martin f krafft; +Cc: notmuch

Excerpts from martin f krafft's message of Wed Feb 17 23:59:43 -0500 2010:
> also sprach Ben Gamari <bgamari@gmail.com> [2010.02.18.1744 +1300]:
> > I believe you would. The problem isn't the messages (well, that's
> > a problem too), it's the fact that the tree (e.g. tab) objects
> > which reference the messages are immutable (I believe). This
> > presents us with the difficult circumstance of being unable to
> > modify a tag after it has been created. Therefore, as far as I can
> > tell, we need to rewrite the tag's tree object whenever we add or
> > remove a message. This was the reason I suggested nesting tag
> > trees, although this only partially solves the issue.
> 
> You are absolutely right, and I think nesting tag trees is an
> interesting idea to pursue. It *would* make it impossible to ever
> check out the metatree into the filesystem, or rather result in
> subdirectories that the user shouldn't need to worry about.
> 
Yeah, this is a bit of a bummer. This is really a stretch, but I wonder
if the git folks would accept patches/minor database semantics changes
in the name of making git more flexible as a general purpose object
database. I really doubt it, but you never know.

> Instead of nested subtrees, think of 16 subtrees forming a level-1
> hash table, or 256 for level-2, which really *ought* to be enough.
> 
> Anyway, rewriting a tree object is pretty much exactly the same as
> removing a line (e.g. a message ID) from a file (e.g. a tag), as
> that file would have to be fully rewritten.
> 
This is very true, but exactly do you mean by this statement?

> > Yeah, I'm not sure how well this would scale on truly massive mail
> > stores.
> 
> The more I think about this, the more I want to implement this
> between evenless and Git, i.e. as a porcelain layer, since then
> I could also use it for vcs-home[0]. In fact, maybe one day we can
> store ~ and mail all in one Git repo, with different porcelains for
> different use-cases, and notmuch indexing it all anyway. ;)

It would be nice if git just didn't attach so many semantics to its
object types and left more up to the porcelain. Git is a fantastic
database, unfortunately it seems you need to work around a lot of VCS
behavior in order to make use of it in a non-VCS application. Attaching
less meaning to database objects would make things substantially easier.

> Let's continue the technical discussion on the Git list, okay?
> 

Yep. As soon as Majordomo sends me my confirmation.

Cheers,

- Ben

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Git ancestry and sync problems (was:  Mail in git)
  2010-02-18  2:11         ` Git ancestry and sync problems (was: Mail in git) martin f krafft
@ 2010-02-18  8:34           ` racin
  2010-02-18 12:20             ` Jameson Rollins
                               ` (2 more replies)
  0 siblings, 3 replies; 32+ messages in thread
From: racin @ 2010-02-18  8:34 UTC (permalink / raw)
  To: martin f krafft; +Cc: Ben Gamari, notmuch

----- "martin f krafft" <madduck@madduck.net> a écrit :

> Except I fear that as soon as we allow manipulation of the local
> store, we'll potentially run into this problem:
> 
>   http://notmuchmail.org/pipermail/notmuch/2010/001114.html
>   id:20100112045152.GA15275@lapse.rw.madduck.net

I don't understand the problem. Why not just letting all "inbox" mails in a regular Maildir, 
and use git only when they have been explicit archived? This way, mails are added to git only if we want
to save them, and we rarely (never?) need to remove mail from the git store. Deleting mail
is also much easier to do from the maildir. This mail flow would make much more sense to me.

Thanks,
Matthieu

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Git ancestry and sync problems (was:  Mail in git)
  2010-02-18  8:34           ` racin
@ 2010-02-18 12:20             ` Jameson Rollins
  2010-02-18 12:47             ` Ben Gamari
  2010-02-18 23:23             ` martin f krafft
  2 siblings, 0 replies; 32+ messages in thread
From: Jameson Rollins @ 2010-02-18 12:20 UTC (permalink / raw)
  To: racin, martin f krafft; +Cc: Ben Gamari, notmuch

[-- Attachment #1: Type: text/plain, Size: 617 bytes --]

On Thu, 18 Feb 2010 09:34:28 +0100 (CET), racin@free.fr wrote:
> I don't understand the problem. Why not just letting all "inbox" mails in a regular Maildir, 
> and use git only when they have been explicit archived? This way, mails are added to git only if we want
> to save them, and we rarely (never?) need to remove mail from the git store. Deleting mail
> is also much easier to do from the maildir. This mail flow would make much more sense to me.

I agree that this sounds much simpler and far easier to implement.  Once
you're passed any deletion phase, using git sounds much more sensible.

jamie.

[-- Attachment #2: Type: application/pgp-signature, Size: 835 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Git ancestry and sync problems (was: Mail in git)
  2010-02-18  8:34           ` racin
  2010-02-18 12:20             ` Jameson Rollins
@ 2010-02-18 12:47             ` Ben Gamari
  2010-02-18 23:23             ` martin f krafft
  2 siblings, 0 replies; 32+ messages in thread
From: Ben Gamari @ 2010-02-18 12:47 UTC (permalink / raw)
  To: racin; +Cc: martin f krafft, notmuch

Excerpts from racin's message of Thu Feb 18 03:34:28 -0500 2010:
> 
> ----- "martin f krafft" <madduck@madduck.net> a écrit :
> 
> > Except I fear that as soon as we allow manipulation of the local
> > store, we'll potentially run into this problem:
> > 
> >   http://notmuchmail.org/pipermail/notmuch/2010/001114.html
> >   id:20100112045152.GA15275@lapse.rw.madduck.net
> 
> I don't understand the problem. Why not just letting all "inbox" mails in a regular Maildir, 
> and use git only when they have been explicit archived? This way, mails are added to git only if we want
> to save them, and we rarely (never?) need to remove mail from the git store. Deleting mail
> is also much easier to do from the maildir. This mail flow would make much more sense to me.

Yes, this would certainly be much easier, but by doing this you also
pass up on using much of git's utility (although as we've discovered,
using it is non-trivial). In particular, I am very interested in using
git to keep my mail and tags synchronized across multiple machines. This
is an issue that I've always struggled with: While I dislike having my
mail tied to my laptop, there is no good two-way solution for
synchronizing maildirs and metadata across machines.

- Ben

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Git ancestry and sync problems (was:  Mail in git)
  2010-02-18  8:34           ` racin
  2010-02-18 12:20             ` Jameson Rollins
  2010-02-18 12:47             ` Ben Gamari
@ 2010-02-18 23:23             ` martin f krafft
  2 siblings, 0 replies; 32+ messages in thread
From: martin f krafft @ 2010-02-18 23:23 UTC (permalink / raw)
  To: racin; +Cc: Ben Gamari, notmuch

[-- Attachment #1: Type: text/plain, Size: 711 bytes --]

also sprach racin@free.fr <racin@free.fr> [2010.02.18.2134 +1300]:
> I don't understand the problem. Why not just letting all "inbox"
> mails in a regular Maildir, and use git only when they have been
> explicit archived?

I don't archive my mail. I would like to be able to bring mails from
the past back into circulation at any time, without duplicating
them, or potentially having to discard the metadata when moving them
into a Maildir.

-- 
martin | http://madduck.net/ | http://two.sentenc.es/
 
"aus der kriegsschule des lebens -
 was mich nicht umbringt, macht mich härter."
                                                 - friedrich nietzsche
 
spamtraps: madduck.bogus@madduck.net

[-- Attachment #2: Digital signature (see http://martin-krafft.net/gpg/) --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: nested tag trees (was:  Mail in git)
  2010-02-18  5:10                           ` Ben Gamari
@ 2010-02-19  0:31                             ` martin f krafft
  2010-02-19  9:52                               ` Michal Sojka
  0 siblings, 1 reply; 32+ messages in thread
From: martin f krafft @ 2010-02-19  0:31 UTC (permalink / raw)
  To: Ben Gamari; +Cc: notmuch

[-- Attachment #1: Type: text/plain, Size: 1383 bytes --]

also sprach Ben Gamari <bgamari@gmail.com> [2010.02.18.1810 +1300]:
> Yeah, this is a bit of a bummer. This is really a stretch, but I wonder
> if the git folks would accept patches/minor database semantics changes
> in the name of making git more flexible as a general purpose object
> database. I really doubt it, but you never know.

I am pretty sure they won't. Git is a content tracker, not a general
purpose filesystem. It's a bit of a shame.

> > Instead of nested subtrees, think of 16 subtrees forming
> > a level-1 hash table, or 256 for level-2, which really *ought*
> > to be enough.
> > 
> > Anyway, rewriting a tree object is pretty much exactly the same
> > as removing a line (e.g. a message ID) from a file (e.g. a tag),
> > as that file would have to be fully rewritten.
> > 
> This is very true, but exactly do you mean by this statement?

That any form of tag-to-message mapping will be expensive when you
have a million messages referenced. If you used symlinks like mairix
does, any manipulation would require changes to the directory index,
which — curiously — functions much like the subtree approach you
proposed.

-- 
martin | http://madduck.net/ | http://two.sentenc.es/
 
"the faster i go, the behinder i get."
                                                    -- lewis carroll
 
spamtraps: madduck.bogus@madduck.net

[-- Attachment #2: Digital signature (see http://martin-krafft.net/gpg/) --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: nested tag trees (was:  Mail in git)
  2010-02-19  0:31                             ` martin f krafft
@ 2010-02-19  9:52                               ` Michal Sojka
  2010-02-19 14:27                                 ` Ben Gamari
  0 siblings, 1 reply; 32+ messages in thread
From: Michal Sojka @ 2010-02-19  9:52 UTC (permalink / raw)
  To: martin f krafft, Ben Gamari; +Cc: notmuch

On Fri, 19 Feb 2010 13:31:15 +1300, martin f krafft <madduck@madduck.net> wrote:
> also sprach Ben Gamari <bgamari@gmail.com> [2010.02.18.1810 +1300]:
> > > Instead of nested subtrees, think of 16 subtrees forming
> > > a level-1 hash table, or 256 for level-2, which really *ought*
> > > to be enough.
> > > 
> > > Anyway, rewriting a tree object is pretty much exactly the same
> > > as removing a line (e.g. a message ID) from a file (e.g. a tag),
> > > as that file would have to be fully rewritten.
> > > 
> > This is very true, but exactly do you mean by this statement?
> 
> That any form of tag-to-message mapping will be expensive when you
> have a million messages referenced. If you used symlinks like mairix
> does, any manipulation would require changes to the directory index,
> which — curiously — functions much like the subtree approach you
> proposed.

Why do you want to store tag-to-message mapping in git? This is IMHO
perfectly solved by Xapian so storing message-to-tag mapping would be
sufficient, wouldn't it?

Michal

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: nested tag trees (was: Mail in git)
  2010-02-19  9:52                               ` Michal Sojka
@ 2010-02-19 14:27                                 ` Ben Gamari
  0 siblings, 0 replies; 32+ messages in thread
From: Ben Gamari @ 2010-02-19 14:27 UTC (permalink / raw)
  To: Michal Sojka; +Cc: martin f krafft, notmuch

Excerpts from Michal Sojka's message of Fri Feb 19 04:52:18 -0500 2010:
> Why do you want to store tag-to-message mapping in git? This is IMHO
> perfectly solved by Xapian so storing message-to-tag mapping would be
> sufficient, wouldn't it?
> 
In my case, I would like to keep the entire state of my mail store
synchronized between multiple machines. This includes both messages and
metadata alike. It seems clear that Xapian would still be necessary for
querying in reaonable time, but I feel like tag storage itself should
have support beyond just the indexer.

- Ben

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Mail in git
  2010-02-17 10:07     ` Stewart Smith
@ 2011-05-21  7:05       ` martin f krafft
  2011-05-21  7:25         ` Stewart Smith
  0 siblings, 1 reply; 32+ messages in thread
From: martin f krafft @ 2011-05-21  7:05 UTC (permalink / raw)
  To: Stewart Smith; +Cc: notmuch

[-- Attachment #1: Type: text/plain, Size: 907 bytes --]

also sprach Stewart Smith <stewart@flamingspork.com> [2010.02.17.1107 +0100]:
> On Wed, 17 Feb 2010 11:21:51 +1100, Stewart Smith <stewart@flamingspork.com> wrote:
> > Using fast-import is interesting. Does it update the working tree? The
> > big thing I wanted to avoid was creating a working tree (another million
> > inodes being created is not ever what I need)
> > 
> > Also interesting is the mention of creating packs on the fly... this
> > could save the time in first writing the object and then packing it (as
> > my script does).
> > 
> > I'm going to play with this....
> 
> and I did.

Has anyone worked on this since?

-- 
martin | http://madduck.net/ | http://two.sentenc.es/
 
"one should never allow one's mind
 and one's foot to wander at the same time."
                                -- edward perkins (yes, the librarian)
 
spamtraps: madduck.bogus@madduck.net

[-- Attachment #2: Digital signature (see http://martin-krafft.net/gpg/sig-policy/999bbcc4/current) --]
[-- Type: application/pgp-signature, Size: 1124 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Mail in git
  2011-05-21  7:05       ` martin f krafft
@ 2011-05-21  7:25         ` Stewart Smith
  0 siblings, 0 replies; 32+ messages in thread
From: Stewart Smith @ 2011-05-21  7:25 UTC (permalink / raw)
  To: martin f krafft; +Cc: notmuch

On Sat, 21 May 2011 09:05:54 +0200, martin f krafft <madduck@madduck.net> wrote:
> Has anyone worked on this since?

No, haven't had the cycles... and SSD helped a bit to delay urgency.

-- 
Stewart Smith

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2011-05-21  7:25 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-02-15  0:29 Mail in git Stewart Smith
2010-02-16  9:08 ` Michal Sojka
2010-02-16 19:06 ` Ben Gamari
2010-02-17  0:21   ` Stewart Smith
2010-02-17 10:07     ` Stewart Smith
2011-05-21  7:05       ` martin f krafft
2011-05-21  7:25         ` Stewart Smith
2010-02-17  1:21 ` martin f krafft
2010-02-17 15:03   ` Ben Gamari
2010-02-17 19:23     ` Mark Anderson
2010-02-17 19:34       ` Ben Gamari
2010-02-17 23:52         ` martin f krafft
2010-02-18  0:39           ` Ben Gamari
2010-02-18  1:58             ` martin f krafft
2010-02-18  2:19               ` Ben Gamari
2010-02-18  2:48                 ` nested tag trees (was: Mail in git) martin f krafft
2010-02-18  4:32                   ` martin f krafft
     [not found]                   ` <1266463007-sup-8777@ben-laptop>
2010-02-18  4:34                     ` martin f krafft
     [not found]                     ` <20100218034613.GD1991@lapse.rw.madduck.net>
2010-02-18  4:44                       ` Ben Gamari
2010-02-18  4:59                         ` martin f krafft
2010-02-18  5:10                           ` Ben Gamari
2010-02-19  0:31                             ` martin f krafft
2010-02-19  9:52                               ` Michal Sojka
2010-02-19 14:27                                 ` Ben Gamari
2010-02-17 23:56   ` Mail in git Stewart Smith
2010-02-18  1:01     ` Ben Gamari
2010-02-18  2:00       ` martin f krafft
2010-02-18  2:11         ` Git ancestry and sync problems (was: Mail in git) martin f krafft
2010-02-18  8:34           ` racin
2010-02-18 12:20             ` Jameson Rollins
2010-02-18 12:47             ` Ben Gamari
2010-02-18 23:23             ` martin f krafft

Code repositories for project(s) associated with this public inbox

	https://yhetil.org/notmuch.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).