From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by olra.theworths.org (Postfix) with ESMTP id D26A3431FC4 for ; Wed, 23 Apr 2014 15:31:58 -0700 (PDT) X-Virus-Scanned: Debian amavisd-new at olra.theworths.org X-Spam-Flag: NO X-Spam-Score: -2.3 X-Spam-Level: X-Spam-Status: No, score=-2.3 tagged_above=-999 required=5 tests=[RCVD_IN_DNSWL_MED=-2.3] autolearn=disabled Received: from olra.theworths.org ([127.0.0.1]) by localhost (olra.theworths.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id PIBwrsjGS7ZV for ; Wed, 23 Apr 2014 15:31:54 -0700 (PDT) Received: from market.scs.stanford.edu (market.scs.stanford.edu [171.66.3.10]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by olra.theworths.org (Postfix) with ESMTPS id E922B431FBF for ; Wed, 23 Apr 2014 15:31:53 -0700 (PDT) Received: from market.scs.stanford.edu (localhost.scs.stanford.edu [127.0.0.1]) by market.scs.stanford.edu (8.14.7/8.14.7) with ESMTP id s3NMVoYW028965; Wed, 23 Apr 2014 15:31:50 -0700 (PDT) Received: (from dm@localhost) by market.scs.stanford.edu (8.14.7/8.14.7/Submit) id s3NMVnq9018357; Wed, 23 Apr 2014 15:31:49 -0700 (PDT) X-Authentication-Warning: market.scs.stanford.edu: dm set sender to return-qnmtsdtvmswvfn8meiysvc9qaw@ta.scs.stanford.edu using -f From: dm-list-email-notmuch@scs.stanford.edu To: Austin Clements Subject: Re: [PATCH] Add configurable changed tag to messages that have been changed on disk In-Reply-To: <20140423205920.GM25817@mit.edu> References: <1396800683-9164-1-git-send-email-eg@gaute.vetsj.com> <87wqf2gqig.fsf@ta.scs.stanford.edu> <1397140962-sup-6514@qwerzila> <87wqexnqvb.fsf@ta.scs.stanford.edu> <1397163239-sup-5101@qwerzila> <87d2g9ja0h.fsf@maritornes.cs.unb.ca> <1398237865-sup-624@qwerzila> <87ioq0l8th.fsf@ta.scs.stanford.edu> <20140423205920.GM25817@mit.edu> Date: Wed, 23 Apr 2014 15:31:49 -0700 Message-ID: <87iopz4qzu.fsf@ta.scs.stanford.edu> MIME-Version: 1.0 Content-Type: text/plain Cc: notmuch X-BeenThere: notmuch@notmuchmail.org X-Mailman-Version: 2.1.13 Precedence: list Reply-To: David Mazieres expires 2014-07-22 PDT List-Id: "Use and development of the notmuch mail system." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 23 Apr 2014 22:31:59 -0000 Austin Clements writes: >> A middle ground might be to use the maximum of two values: 1) the >> time-of-day at which notmuch started executing, and 2) the highest ctime >> in the database plus 100 microseconds (leaving plenty of slop to store >> timestamps as IEEE doubles with 52 significant bits). Since the values >> will be Btree-indexed, computing the max plus one will be cheap. > > This makes me curious if you've considered how to fit this in to > Xapian. The Xapian query syntax supports range queries over document > "values", but within the Xapian B-tree, values are stored in docid > order, not value order, so Xapian's range query operator is actually a > full scan in implementation. I assume it does this so it doesn't have > to store both forward and inverse indexes of values. (I spent some > time figuring out the layout of the Xapian database and have fairly > detailed notes if anyone's curious.) Aside from finding the previous max time, everything else should work identically to the date query operator and NOTMUCH_VALUE_TIMESTAMP. Though I believe you, I'm a little surprised the values aren't indexed. An alternative design might use terms like XCTIMExxxxxxxxxxxxxxxx where the x's are hex digits. But this seems a bit clunky and not using Xapian the way it is indented to be used. When I do a query with a giant result set ordered by date (notmuch search --sort=oldest-first "*"), the first few results come back pretty quickly, so I guess the full database scan is not an issue, at least for ~10^5 messages. > This is still reasonably fast in practice because it's a sequential > scan and only requires a few bytes per message, but it's probably not > what you'd expect. That said, Xapian does track per-value statistics > that would suffice for the particular problem of monotonic time stamps > (e.g., Database::get_value_upper_bound). Oh, well in that case there is no issue. That max is the only statistic we need. Everything that requires a full database scan, like get me all messages whose properties have changed since time X, is something that you can't do at all right now. And in fact I'm already scanning all 100,000 message IDs AND diffing the results against a separate sqlite database to detect changes in only 0.09 seconds (Linux) or 1.2 seconds (32-bit OpenBSD). This will only make that faster, and additionally allow other people to do what I'm doing without writing a bunch of C++ code. > In principle it would be possible to use user metadata or even > document terms to support true B-tree range scans by ctime order, but > I don't think it's possible to express queries over this using > Xapian's query parser. I've written about 90% of a (new) custom query > parser for Notmuch that would enable this, but little things like my > looming thesis deadline have interfered with me finishing it. Yeah, I've been avoiding the query parser and just scanning terms and postlists directly. Since the lack of ctime forced me to scan the whole database anyway, I found it much faster to scan each tag's posting list and dump the results into sqlite than to extract tag terms on a per-document basis the way notmuch dump does. >> Incidentally, if you are really this paranoid about time stamps, it >> should bother you that notmuch's directory timestamps only have one >> second granularity. > > This is historical (and, I agree, unfortunate). But nobody's > complained, so it hasn't been worth changing the libnotmuch interface > to support sub-second directory mtimes. However, notmuch new does > correctly handle deliveries in the same second it runs. If the > wall-clock time when it starts is the same as the on-disk directory > mtime, it skips updating the in-database directory mtime at the end. > Hence, on the next run, it will still consider the directory > out-of-date. It's a bit of a hack, but it's a hack that would be > necessary for supporting older file systems even if we did support > sub-second timestamps. Yeah, is kind of a problem me. I currently scan the XFDIRENTRY terms belonging to a directory only if the directory's notmuch mtime has changed since the last time I examined Xapian's state. I used to scan the actual directories, which was fine, but not so useful because I don't actually want to deal with messages that notmuch has not yet indexed. Conversely, if a directory has not changed since the last time muchsync ran, but notmuch's idea of the directory has changed (because someone ran notmuch new), then I do care about scanning for new/deleted XFDIRENTRY terms. But couldn't notmuch fix the sub-second problem in a fully backwards compatible way? After all, the database is already storing these mtimes as doubles. Even for OSes that don't support st_mtim, notmuch could just add 0.00001 seconds to the previous timestamp of a modified directory. David