From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by olra.theworths.org (Postfix) with ESMTP id D6677421176 for ; Fri, 11 Apr 2014 09:05:00 -0700 (PDT) X-Virus-Scanned: Debian amavisd-new at olra.theworths.org X-Spam-Flag: NO X-Spam-Score: -2.3 X-Spam-Level: X-Spam-Status: No, score=-2.3 tagged_above=-999 required=5 tests=[RCVD_IN_DNSWL_MED=-2.3] autolearn=disabled Received: from olra.theworths.org ([127.0.0.1]) by localhost (olra.theworths.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id oHihRs1iJvsa for ; Fri, 11 Apr 2014 09:04:56 -0700 (PDT) Received: from market.scs.stanford.edu (market.scs.stanford.edu [171.66.3.10]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by olra.theworths.org (Postfix) with ESMTPS id 937E6421173 for ; Fri, 11 Apr 2014 09:04:56 -0700 (PDT) Received: from market.scs.stanford.edu (localhost.scs.stanford.edu [127.0.0.1]) by market.scs.stanford.edu (8.14.7/8.14.7) with ESMTP id s3BG3c97012339; Fri, 11 Apr 2014 09:03:38 -0700 (PDT) Received: (from dm@localhost) by market.scs.stanford.edu (8.14.7/8.14.7/Submit) id s3BG3cFO000873; Fri, 11 Apr 2014 09:03:38 -0700 (PDT) X-Authentication-Warning: market.scs.stanford.edu: dm set sender to return-yjfumptdmm9v8zs6su3fi692c6@ta.scs.stanford.edu using -f From: dm-list-email-notmuch@scs.stanford.edu To: David Bremner , Gaute Hope Subject: Re: [PATCH] Add configurable changed tag to messages that have been changed on disk In-Reply-To: <87k3aw5dj5.fsf@zancas.localnet> References: <1396800683-9164-1-git-send-email-eg@gaute.vetsj.com> <87wqf2gqig.fsf@ta.scs.stanford.edu> <1397140962-sup-6514@qwerzila> <87wqexnqvb.fsf@ta.scs.stanford.edu> <87k3aw5dj5.fsf@zancas.localnet> Date: Fri, 11 Apr 2014 09:03:38 -0700 Message-ID: <877g6v3lb9.fsf@ta.scs.stanford.edu> MIME-Version: 1.0 Content-Type: text/plain Cc: notmuch X-BeenThere: notmuch@notmuchmail.org X-Mailman-Version: 2.1.13 Precedence: list Reply-To: David Mazieres expires 2014-07-10 PDT List-Id: "Use and development of the notmuch mail system." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 11 Apr 2014 16:05:01 -0000 David Bremner writes: >> Exactly. It could be a tick, or just the current time of day if your >> clock does not go backwards. (I'd be willing to do a full scan if the >> clock ever goes backwards.) The advantage of time is that you don't >> have to synchronously update some counter. > > I think I'd lean towards global time so that one could use it to resolve > conflicts between changes to multiple copies of the database. I, too, would prefer to use time. However, I'm doubtful it would help resolve conflicts. On the plus side, I'm not sure it is even needed to resolve conflicts. My mail synchronizer has an algorithm for resolving conflicts that always works without human intervention and in my limited experience does exactly what I want: * If there's a conflict between two replicas, ensure that each maildir ends up with the maximum number of the number copies of the message in each of the two databases being reconciled. [Example: If replica A deletes a message and replica B moves it from folder INBOX to folder SPAM, you end up with a copy in spam. If replica A moves a message to folder IMPORTANT and replica B moves it to SPAM, then you get two hard links to the same file, one in IMPORTANT and one in SPAM.] * If there's a conflict and two replicas have different tags on the same message, then the tags in notmuch's new.tags directive get logically ANDed, while all other tags get logically ORed. Granted, I've only been using this system for a week. On the other hand, all I was doing was starting to test something I had written, yet it ended up being so much better than my old system that I couldn't go back and ended up using my system in production far earlier than anticipated... >> Making sure the write-operations update the time should be easy. Most >> or all of the changes are probably funneled through >> _notmuch_message_sync. Worst case, there are only 9 places in the >> source code that make use of a Xapian:WritableDatabase, so I'm pretty >> confident total changes wouldn't be much more than 50 lines of code. > > Maybe. Don't forget upgrading the database, updating the test suite, and > presumably some changes to the CLI so the new mtime can actually be > used. Not to be discouraging ;). The CLI is trivial. We'll just add another search keyword ctime analogous to date. As far as updating the test suite, etc., it's almost certain that the core notmuch developers would be unsatisfied with whatever I've done, since the code base is very clean and has a very uniform style. So when I say I'd want some "indication that such a change could be upstreamed," I mean more specifically that someone would be willing to shepherd the process of getting the code into shape. > In the ensuing time, nothing better has developed for tag > synchronization (my pet use case) so maybe it's time to pursue this > again. I do have something pretty good for tag synchronization. It requires a full database scan each time to detect changes, but I've heavily optimized it to be very fast by skipping over the notmuch library and directly scanning the underlying Xapian Btrees. Currently my bottleneck is indexing messages (e.g., running notmuch new or calling notmuch_database_add_message), which are painfully slow on 32-bit machines. (Unfortunately my mail server is a 32-bit machine.) To give you an idea, on a 32 bit machine, if I get a handful of new mail (e.g., 6 messages), running "notmuch new" takes 19 seconds, while scanning the database to check for renames and changed tags adds another 1.4 seconds. On a 64-bit machine, "notmuch new" might take 1 second, while scanning the database adds 350 msec. So full database scan's might not be the end of the world. The biggest performance bottleneck at this point is notmuch's painful indexing performance. It kills me that it takes 10 minutes to index 100,000 mail messages on a 16-core machine with 48 GiB of RAM. But the library is non-reentrant and allocates thread IDs in such a way that it's hard to create parallel databases and later merge them. Basically I can't figure out how to make productive use of more than one CPU core even when synchronizing across 1GB Ethernet! It's pretty beta, but my intention is to open-source my code, so glad for beta testers if you are interested in testing tag synchronization. > It would be good to have some preliminary idea about the time > and space costs of adding document mtimes. I guess database bloat > should not be too bad, since it's only 64bits (?) per mail message. Plus a Btree to index it, so figure at least 24 bytes per message. Another issue is that values are always brought into memory with a document, so it will consume more RAM. But yeah, I don't think it should be that bad. David