unofficial mirror of notmuch@notmuchmail.org
 help / color / mirror / code / Atom feed
* More ideas about logging.
@ 2011-12-16  2:09 David Bremner
  2011-12-16  4:07 ` Austin Clements
  2011-12-16  7:16 ` Michael Hudson-Doyle
  0 siblings, 2 replies; 10+ messages in thread
From: David Bremner @ 2011-12-16  2:09 UTC (permalink / raw)
  To: Notmuch Mail; +Cc: Olly Betts

[-- Attachment #1: Type: text/plain, Size: 3064 bytes --]


Various discussions (mostly on IRC) from my jlog proposal, and a from
Thomas's mtime
(id:"1323796305-28789-1-git-send-email-schnouki@schnouki.net") proposal
got me thinking.  So let me know what you think about the following.

The goal here is to log tag adds and deletes (including those implicit
in message deletion) to facilitate tag synchonization.

If we use Xapian to store transaction numbers (much as the
last_thread_id is stored now), then we don't need an external logging
library. We can rely on the xapian to keep other clients from writing

Assume we have routines read_metadata and write_metadata that read and
write to the xapian database metadata (in real life, I think we might
need to decide in advance exactly what will be written there).

when we create a database

write_metadata('log_write',0)
write_metadata('log_read',0) // more about this later

To carry out database operation X with logging, we do the following

begin_atomic

    txn=read_metadata('last_written')

    X

    // begin dangerzone
    fprintf(logfile,"%d %s",num+1,stuff) // or whatever.

    write_metadata('last_written', num+1)

end_atomic
//end dangerzone

If I understand correctly, then the only way the database and the log
can get out of sync is if this is interrupted in the "dangerzone"
between the start of the log write and the end of the xapian atomic
transaction. But then since we can consider the database authoritative
(since our goal is synchonization rather than recovery), we can discard
those portions of the log. We have to be a bit careful to discard
incomplete log items at the end of the log (maybe a checksum?).

So how do we discard? Two places. At the opening of the database for
writing, we truncate the log file (if we are very lazy, we can use seek
offsets as transaction indicies to facilitate this). 

In order to guarantee that log item is output exactly once, it seems
like we need another counter (or maybe I'm overthinking this)

     read_ptr = read_metadata('last_read')

     write_ptr = read_metadata('last_write')
     
     while (read_ptr < write_ptr) {
         begin_atomic
            s = read(read_ptr)
            do_stuff(s)
            read_ptr++
            write_metadata('log_read', read_ptr);
         end_atomic
     }

     write_metadata('log_write',0) // The log file will be truncated on
                                   // on db open
     write_metadata('log_read',0) 

I think we can double check if write_ptr <= read_ptr on next db open,
and truncate then if needed.

I think we need to assume that do_stuff is atomic here; I'm not sure how
reasonable or unreasonable that is in practice.

I also don't know about the performance implications of reading and
writing like maniac from the xapian metadata. Of course if this whole
scheme is fatally flawed, no need to worry about performance.

I don't think the actual amount of code involved would be too bad. Of
course, I thought was going to be a short message too.

d


[-- Attachment #2: Type: application/pgp-signature, Size: 315 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: More ideas about logging.
  2011-12-16  2:09 More ideas about logging David Bremner
@ 2011-12-16  4:07 ` Austin Clements
  2011-12-16 11:56   ` David Bremner
                     ` (2 more replies)
  2011-12-16  7:16 ` Michael Hudson-Doyle
  1 sibling, 3 replies; 10+ messages in thread
From: Austin Clements @ 2011-12-16  4:07 UTC (permalink / raw)
  To: David Bremner; +Cc: Olly Betts, Notmuch Mail

Quoth David Bremner on Dec 15 at 10:09 pm:
> Assume we have routines read_metadata and write_metadata that read and
> write to the xapian database metadata (in real life, I think we might
> need to decide in advance exactly what will be written there).
> 
> when we create a database
> 
> write_metadata('log_write',0)
> write_metadata('log_read',0) // more about this later
> 
> To carry out database operation X with logging, we do the following
> 
> begin_atomic
> 
>     txn=read_metadata('last_written')
> 
>     X
> 
>     // begin dangerzone
>     fprintf(logfile,"%d %s",num+1,stuff) // or whatever.
> 
>     write_metadata('last_written', num+1)
> 
> end_atomic
> //end dangerzone

The trouble with this approach is that the OS doesn't have to flush
logfile to the disk platters in any particular order relative to the
updates to Xapian.  So, after someone trips over your plug, you could
come back with Xapian saying you have 500 log entries when your
logfile comes back with only 20.  The only way I know of to fix this
is to fsync after the logfile write, which would obviously have
performance issues.  But maybe there are cleverer ways?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: More ideas about logging.
  2011-12-16  2:09 More ideas about logging David Bremner
  2011-12-16  4:07 ` Austin Clements
@ 2011-12-16  7:16 ` Michael Hudson-Doyle
  2011-12-16 12:02   ` David Bremner
  1 sibling, 1 reply; 10+ messages in thread
From: Michael Hudson-Doyle @ 2011-12-16  7:16 UTC (permalink / raw)
  To: David Bremner, Notmuch Mail; +Cc: Olly Betts

On Thu, 15 Dec 2011 22:09:08 -0400, David Bremner <bremner@debian.org> wrote:
> Various discussions (mostly on IRC) from my jlog proposal, and a from
> Thomas's mtime
> (id:"1323796305-28789-1-git-send-email-schnouki@schnouki.net") proposal
> got me thinking.  So let me know what you think about the following.
>
> The goal here is to log tag adds and deletes (including those implicit
> in message deletion) to facilitate tag synchonization.

It's a tangent, but would this sort of thing allow a "undo last tagging
operation" command in emacs?

Cheers,
mwh

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: More ideas about logging.
  2011-12-16  4:07 ` Austin Clements
@ 2011-12-16 11:56   ` David Bremner
  2011-12-18 18:34   ` David Bremner
  2012-10-12 16:28   ` Ethan Glasser-Camp
  2 siblings, 0 replies; 10+ messages in thread
From: David Bremner @ 2011-12-16 11:56 UTC (permalink / raw)
  To: Austin Clements; +Cc: Notmuch Mail

On Thu, 15 Dec 2011 23:07:22 -0500, Austin Clements <amdragon@MIT.EDU> wrote:

> Quoth David Bremner on Dec 15 at 10:09 pm:
>
> you could come back with Xapian saying you have 500 log entries when
> your logfile comes back with only 20.  The only way I know of to fix
> this is to fsync after the logfile write, which would obviously have
> performance issues.  But maybe there are cleverer ways?

This might be why jlog uses journalling to write log entries. I'm not
familiar with the details though.

d

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: More ideas about logging.
  2011-12-16  7:16 ` Michael Hudson-Doyle
@ 2011-12-16 12:02   ` David Bremner
  2011-12-18 21:53     ` Michael Hudson-Doyle
  0 siblings, 1 reply; 10+ messages in thread
From: David Bremner @ 2011-12-16 12:02 UTC (permalink / raw)
  To: Michael Hudson-Doyle; +Cc: Notmuch Mail

On Fri, 16 Dec 2011 20:16:51 +1300, Michael Hudson-Doyle <michael.hudson@canonical.com> wrote:
> 
> It's a tangent, but would this sort of thing allow a "undo last tagging
> operation" command in emacs?
> 

It seems like it would be much simpler to track that information in a
data structure in emacs? Undo info is a frequently updated stack while
logs are naturally a queue. Also, implmenting a list of tagging
operations in emacs sounds way easier than anything we talked about
here.

d

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: More ideas about logging.
  2011-12-16  4:07 ` Austin Clements
  2011-12-16 11:56   ` David Bremner
@ 2011-12-18 18:34   ` David Bremner
  2011-12-18 20:22     ` Tom Prince
  2012-10-12 16:28   ` Ethan Glasser-Camp
  2 siblings, 1 reply; 10+ messages in thread
From: David Bremner @ 2011-12-18 18:34 UTC (permalink / raw)
  To: Austin Clements; +Cc: Notmuch Mail

[-- Attachment #1: Type: text/plain, Size: 1817 bytes --]

On Thu, 15 Dec 2011 23:07:22 -0500, Austin Clements <amdragon@MIT.EDU> wrote:
> Quoth David Bremner on Dec 15 at 10:09 pm:
> 
> The trouble with this approach is that the OS doesn't have to flush
> logfile to the disk platters in any particular order relative to the
> updates to Xapian.  So, after someone trips over your plug, you could
> come back with Xapian saying you have 500 log entries when your
> logfile comes back with only 20.  The only way I know of to fix this
> is to fsync after the logfile write, which would obviously have
> performance issues.  But maybe there are cleverer ways?

What about just declaring the log invalid in this case and forcing a
"slow-sync"? It seems it should be no harder to detect the log being
behind xapian than it would be to detect it being ahead.

Another idea would be to replace logging with mkdir(2) and creat(2); I
made some experiments in branch 'tree-dump' in repo
     
     git://pivot.cs.unb.ca/notmuch

This generates a tree of empty files in the style of nmbug (which an
extra layer of directories at the to help prevent file system
explosion).

It isn't super fast as a way to dump (probably at least 10x slower than
the file based methods). On the other hand, on this machine (an i7 950
with a spinning disk) it takes about 1 ms per tag to write (i.e. 175k
tags take about 160s). It is completely IO bound, so I would expect it
be faster on SSD.  I am running lvm on top of dm-crypt.

The more worrying part is disk usage; the tag tree for 200k messages
uses 400k inodes, and 836M of apparent disk usage (according to du) the
same tags in "sup" format take 11M.  Maybe this could be usefull if
combined with some scheme to only dump tags not covered by maildir (for
those using maildir flag synching already)

d




[-- Attachment #2: Type: application/pgp-signature, Size: 315 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: More ideas about logging.
  2011-12-18 18:34   ` David Bremner
@ 2011-12-18 20:22     ` Tom Prince
  2011-12-20 20:25       ` David Bremner
  0 siblings, 1 reply; 10+ messages in thread
From: Tom Prince @ 2011-12-18 20:22 UTC (permalink / raw)
  To: David Bremner, Austin Clements; +Cc: Notmuch Mail

On Sun, 18 Dec 2011 14:34:00 -0400, David Bremner <bremner@debian.org> wrote:
> The more worrying part is disk usage; the tag tree for 200k messages
> uses 400k inodes, and 836M of apparent disk usage (according to du) the
> same tags in "sup" format take 11M.  Maybe this could be usefull if
> combined with some scheme to only dump tags not covered by maildir (for
> those using maildir flag synching already)

Well, it would seem natural to re-use the nmbug logic here, and just use
a bare git repo for this. One would need a way to sync and merge the
tag-tree automatically anyway. I admit I haven't tried nmbug yet, but it
seems that nmbug, switched from sync just notmuch:: to syncing
everything but notmuch:: would be a sensible way to sync tags?

  Tom

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: More ideas about logging.
  2011-12-16 12:02   ` David Bremner
@ 2011-12-18 21:53     ` Michael Hudson-Doyle
  0 siblings, 0 replies; 10+ messages in thread
From: Michael Hudson-Doyle @ 2011-12-18 21:53 UTC (permalink / raw)
  To: David Bremner; +Cc: Notmuch Mail

On Fri, 16 Dec 2011 08:02:17 -0400, David Bremner <bremner@debian.org> wrote:
> On Fri, 16 Dec 2011 20:16:51 +1300, Michael Hudson-Doyle <michael.hudson@canonical.com> wrote:
> > 
> > It's a tangent, but would this sort of thing allow a "undo last tagging
> > operation" command in emacs?
> > 
> 
> It seems like it would be much simpler to track that information in a
> data structure in emacs?

I think that would require tracking more information than is currently
tracked -- when press 'a' on a thread, emacs would have to remember
which emails had the inbox tag before running 'notmuch tag -inbox
thread:xxx' or whatever it runs today.

You are probably right that what you're talking about here is not the
easiest way to do this though.

Cheers,
mwh

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: More ideas about logging.
  2011-12-18 20:22     ` Tom Prince
@ 2011-12-20 20:25       ` David Bremner
  0 siblings, 0 replies; 10+ messages in thread
From: David Bremner @ 2011-12-20 20:25 UTC (permalink / raw)
  To: Tom Prince, Austin Clements; +Cc: Notmuch Mail

On Sun, 18 Dec 2011 13:22:20 -0700, Tom Prince <tom.prince@ualberta.net> wrote:
> On Sun, 18 Dec 2011 14:34:00 -0400, David Bremner <bremner@debian.org> wrote:
> > The more worrying part is disk usage; the tag tree for 200k messages
> > uses 400k inodes, and 836M of apparent disk usage (according to du) the
> > same tags in "sup" format take 11M.  Maybe this could be usefull if
> > combined with some scheme to only dump tags not covered by maildir (for
> > those using maildir flag synching already)
> 
> Well, it would seem natural to re-use the nmbug logic here, and just use
> a bare git repo for this. One would need a way to sync and merge the
> tag-tree automatically anyway. I admit I haven't tried nmbug yet, but it
> seems that nmbug, switched from sync just notmuch:: to syncing
> everything but notmuch:: would be a sensible way to sync tags?

I was mainly interested in if some guarantee of atomicity could be given
in a simple way.  The git update-index approach doesn't really make
those kind of guaranteees..  Probably this is tolerable for a human
initiated "dump" process; not so much for other uses.  Furthermore much
of the motivation for both mtimes and logging is to make incremental
dumping possible in order to avoid the time to do of a full dump. This
is experiment was also to see how feasible it was to insert some
"mkdir+creat" in the notmuch-tag critical path.

Since a few people have mentioned this, I should confess that
there are (at least) 2 performance bugs lurking in nmbug that make it
probably not yet suitable for large scale tag syncing.

1) I did not get the merging working with only the index, so 
   nmbug currently makes a temporary checkout to do the merge.

2) transfering tags from the git repo to xapian is currently quite slow
   because it does one call to git tag for each tag, rather than
   constructing an input for "notmuch restore".  

I _think_ both of these are fixable in principle.  Maybe somebody with
better git internals knowledge than I would like to take a look at (1). 
(2) is just a SimpleMatterOfProgramming (TM). Patches, as they say, are
welcome ;).

d

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: More ideas about logging.
  2011-12-16  4:07 ` Austin Clements
  2011-12-16 11:56   ` David Bremner
  2011-12-18 18:34   ` David Bremner
@ 2012-10-12 16:28   ` Ethan Glasser-Camp
  2 siblings, 0 replies; 10+ messages in thread
From: Ethan Glasser-Camp @ 2012-10-12 16:28 UTC (permalink / raw)
  To: Austin Clements, David Bremner; +Cc: Olly Betts, Notmuch Mail

Austin Clements <amdragon@MIT.EDU> writes:

> The trouble with this approach is that the OS doesn't have to flush
> logfile to the disk platters in any particular order relative to the
> updates to Xapian.  So, after someone trips over your plug, you could
> come back with Xapian saying you have 500 log entries when your
> logfile comes back with only 20.  The only way I know of to fix this
> is to fsync after the logfile write, which would obviously have
> performance issues.  But maybe there are cleverer ways?

Sorry to jump in almost a year after the fact, but..

How bad do you think those performance issues are going to be? I don't
see them as prohibitive, even in the case where you write a log entry
for every message being tagged. Xapian's doing an fsync each time we
commit, isn't it? (Or is there some cute trick where it rename()s the
database?)

If, instead of truncating, you replay logged operations, I think you can
get away with just writing log entries for user-level operations (like
"notmuch tag +mytag to:somequery") which could touch a lot of
messages. This would then only require one fsync on the log file before
doing a lot of tag updates.

Ethan

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2012-10-12 16:28 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-12-16  2:09 More ideas about logging David Bremner
2011-12-16  4:07 ` Austin Clements
2011-12-16 11:56   ` David Bremner
2011-12-18 18:34   ` David Bremner
2011-12-18 20:22     ` Tom Prince
2011-12-20 20:25       ` David Bremner
2012-10-12 16:28   ` Ethan Glasser-Camp
2011-12-16  7:16 ` Michael Hudson-Doyle
2011-12-16 12:02   ` David Bremner
2011-12-18 21:53     ` Michael Hudson-Doyle

Code repositories for project(s) associated with this public inbox

	https://yhetil.org/notmuch.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).