From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by olra.theworths.org (Postfix) with ESMTP id 9B42A431FB6 for ; Mon, 25 Jun 2012 18:48:00 -0700 (PDT) X-Virus-Scanned: Debian amavisd-new at olra.theworths.org X-Spam-Flag: NO X-Spam-Score: -0.7 X-Spam-Level: X-Spam-Status: No, score=-0.7 tagged_above=-999 required=5 tests=[RCVD_IN_DNSWL_LOW=-0.7] autolearn=disabled Received: from olra.theworths.org ([127.0.0.1]) by localhost (olra.theworths.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id uS-oFf0wmzln for ; Mon, 25 Jun 2012 18:47:58 -0700 (PDT) Received: from dmz-mailsec-scanner-3.mit.edu (DMZ-MAILSEC-SCANNER-3.MIT.EDU [18.9.25.14]) by olra.theworths.org (Postfix) with ESMTP id 44F36431FAF for ; Mon, 25 Jun 2012 18:47:58 -0700 (PDT) X-AuditID: 1209190e-b7fb56d0000008b2-0c-4fe914cc9f7c Received: from mailhub-auth-4.mit.edu ( [18.7.62.39]) by dmz-mailsec-scanner-3.mit.edu (Symantec Messaging Gateway) with SMTP id F7.2C.02226.CC419EF4; Mon, 25 Jun 2012 21:47:56 -0400 (EDT) Received: from outgoing.mit.edu (OUTGOING-AUTH.MIT.EDU [18.7.22.103]) by mailhub-auth-4.mit.edu (8.13.8/8.9.2) with ESMTP id q5Q1ltJq011643; Mon, 25 Jun 2012 21:47:56 -0400 Received: from awakening.csail.mit.edu (awakening.csail.mit.edu [18.26.4.91]) (authenticated bits=0) (User authenticated as amdragon@ATHENA.MIT.EDU) by outgoing.mit.edu (8.13.6/8.12.4) with ESMTP id q5Q1lsP2026639 (version=TLSv1/SSLv3 cipher=AES256-SHA bits=256 verify=NOT); Mon, 25 Jun 2012 21:47:55 -0400 (EDT) Received: from amthrax by awakening.csail.mit.edu with local (Exim 4.77) (envelope-from ) id 1SjKsk-0005uu-9E; Mon, 25 Jun 2012 21:47:54 -0400 Date: Mon, 25 Jun 2012 21:47:54 -0400 From: Austin Clements To: Sascha Silbe Subject: Re: [PATCH 0/3] Speed up notmuch new for unchanged directories Message-ID: <20120626014754.GO24342@mit.edu> References: <1340555366-25891-1-git-send-email-sascha-pgp@silbe.org> <87pq8n1de4.fsf@awakening.csail.mit.edu> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFlrBKsWRmVeSWpSXmKPExsUixG6nrntG5KW/wew7ehbXb85ktnj77Aaj A5PHs1W3mD02/v3BEsAUxWWTkpqTWZZapG+XwJWx4O58poKlyhW9xy+yNTBukOli5OSQEDCR 2NvxhBnCFpO4cG89WxcjF4eQwD5GieN7G5ggnA2MEnMXnWKGcE4ySRy+cBasRUhgCaPE4h0x IDaLgKrE6cezGUFsNgENiW37l4PZIgJmEus3TwKrZwaqaVx7EcwWFnCXWPjmAhOIzSugI9E3 9R/UtpmMEgfad0ElBCVOznzCAtGsLvFn3iWgZg4gW1pi+T8OiLC8RPPW2WAzOYHeufUdYpeo gIrElJPb2CYwCs9CMmkWkkmzECbNQjJpASPLKkbZlNwq3dzEzJzi1GTd4uTEvLzUIl1jvdzM Er3UlNJNjKBI4JTk28H49aDSIUYBDkYlHl6P+hf+QqyJZcWVuYcYJTmYlER5Q4Vf+gvxJeWn VGYkFmfEF5XmpBYfYpTgYFYS4d19A6icNyWxsiq1KB8mJc3BoiTOeyXlpr+QQHpiSWp2ampB ahFMVoaDQ0mCVwgY8UKCRanpqRVpmTklCGkmDk6Q4TxAw5VAaniLCxJzizPTIfKnGHU51r05 coNRiCUvPy9VSpxXBaRIAKQoozQPbg4sgb1iFAd6S5j3K8gPPMDkBzfpFdASJqAlHJtAPigu SURISTUwruDRUuCeeF7qj1f8P1b90NIfXtba66JqfliF7HlyybRTUlnFuGSPvHu5L8dJvpI3 wkm+F/1nn5C/edic6cCUpocR4cUmmy2/Rp/lZHMLNskzsLOp/OqatOHqK5Utd3+4bjNfvPq+ wfSL3SoOUTckHfcHxfVZpL83tTy/v/O1zcHWS2f0XuUqsRRnJBpqMRcVJwIANU+9gjsDAAA= Cc: notmuch X-BeenThere: notmuch@notmuchmail.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: "Use and development of the notmuch mail system." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 26 Jun 2012 01:48:00 -0000 Quoth Sascha Silbe on Jun 26 at 12:13 am: > Austin Clements writes: > > On Sun, 24 Jun 2012, Sascha Silbe wrote: > > ["notmuch new" listing every directory, even if it's unchanged] > > I haven't looked over your patches yet, but this result surprises me. > > Could you explain your setup a little more? How much mail do you have > > and across how many directories? What file system are you using? > > As mentioned in passing already, I have a total of about 900k unique > mails (sometimes several copies of them, received over different paths, > e.g. mailing list and a direct CC). Most of that is "old" mails, in > directories that are not getting updated. If notmuch would support mbox, > I'd use that instead for those old mails. The total number of > directories in the mail store is about 29k and the total number of files > (including the git repository and mbox files that sup used) is about > 1.25M. > > Since a housekeeping job last weekend, the number of mails in > directories that are still getting updated is about 4k, i.e. about 5‰ of > the total number of mails or 3‰ of the total number of files. The number > of directories getting updated is 104, i.e. about 4‰ of the total number > of directories. > > Ideally, we'd get the run-time of "notmuch new" down by a similar > factor. With just plain POSIX and no additional information that won't > be possible, but providing a way to channel information about updates > into notmuch (rather than having it scan everything over and over again) > should help. That information is already available as output from the > mail fetching process (rsync in my case). Of course, it would be purely > optional: "notmuch new" without additional information would simply > continue to scan everything. This would be great. I've been thinking along similar lines for a while (in my case, I want to feed notmuch new from inotify), though I haven't written any code for it. > > I'm also surprised that your new approach helps. This directory listing > > has to be read off disk one way or the other, but listing directories is > > the bread-and-butter of file systems, whereas I would think that Xapian > > would require more IO to accomplish the same effect. > > "notmuch new" needs to iterate over a list of all directories to find > those with new mails (and potentially new subdirectories). However, it > does not need to list the *contents* of those folders. I'm surprised as > well, but rather in the opposite direction: Based on a naive > calculation, we'd expect to see a speedup on the order of > (1.25M+29k)/29k = 44. The actual results suggest that stat()ing (done > 29k times both before and after the patch) is taking about 19 times as > long as listing a directory entry (before the patch we listed 1M > entries, now we list none if nothing has changed). (*) For a cold cache, these aren't the numbers that matter. With an HDD and how few files your directories contain on average, only seeks will matter. I would expect your workload without your patch to have at least 1 but closer to 2 seeks per directory: one to stat the directory and one to get the directory contents block. Some of the stat seeks will be eliminated by the buffer cache, even starting cold, because of inode locality (absolute best case is 16x reduction, but if you created the directories over time, then this locality is probably quite poor). There are a few other potential seeks to get the directory document from Xapian and to get its mtime value, but those should exhibit strong locality, so they probably don't contribute much. NewEgg says your drive has an average seek time of 8.9ms, so with 29k directories and assuming your directories are sequential on disk, that's at least 258s and closer to 512s, which agrees with your benchmark results. I'm surprised by your results because I would expect your workload with your patches to exhibit about the same number of seeks: one to stat the directory (same as before) and one for notmuch_directory_get_child_files, which has to seek in the term index to get the child directories. My guess is that this exhibits better locality because the child directory terms are stored contiguously in the database's key space (though not necessarily sequentially on disk unless this is a fresh database). Unfortunately, I'm not sure of a good way to test this hypothesis. Any thoughts?