From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by olra.theworths.org (Postfix) with ESMTP id D20A4431FBF; Sat, 21 Nov 2009 19:28:31 -0800 (PST) X-Virus-Scanned: Debian amavisd-new at olra.theworths.org Received: from olra.theworths.org ([127.0.0.1]) by localhost (olra.theworths.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id SefZV3xd2Lqu; Sat, 21 Nov 2009 19:28:31 -0800 (PST) Received: from cworth.org (localhost [127.0.0.1]) by olra.theworths.org (Postfix) with ESMTP id 5D685431FAE; Sat, 21 Nov 2009 19:28:30 -0800 (PST) From: Carl Worth To: Brett Viren In-Reply-To: <46263c600911211436s5826015eqc5fc18a4164245cb@mail.gmail.com> References: <20091121145111.GB19397@excalibur.local> <87fx874xj5.fsf@yoom.home.cworth.org> <46263c600911211436s5826015eqc5fc18a4164245cb@mail.gmail.com> Date: Sun, 22 Nov 2009 04:28:18 +0100 Message-ID: <87hbsn2q7h.fsf@yoom.home.cworth.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: notmuch@notmuchmail.org Subject: Re: 25 minutes load time with emacs -f notmuch X-BeenThere: notmuch@notmuchmail.org X-Mailman-Version: 2.1.12 Precedence: list List-Id: "Use and development of the notmuch mail system." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 22 Nov 2009 03:28:32 -0000 On Sat, 21 Nov 2009 17:36:18 -0500, Brett Viren wrote: > Processed 130871 total files in 38m 7s (57 files/sec.). > Added 102723 new messages to the database (not much, really). Just be glad that you have so little mail. ;-) > This was ~2GB of mail on a 2.5GHz CPU. That seems pretty reasonable > to me but I'd like to rerun the "notmuch new" under google perftools > to see if there are any obvious bottlenecks that might be cleaned up. To me, here are the obvious things to fix after looking at a profile: 1. We're spending a *lot* of time searching in the Xapian database. But our initial indexing operation should only be *writing* data into the database, so what's this searching about? Well, at each new message, we're looking up the ID from it's In-Reply-To header to find a thread-ID to link to, and then we're looking up all of the IDs from its References header to find thread IDs that need to be merged with ours. So both parent and child lookups. And since those are taking a bunch of time, I think it might make sense to just keep a hashtable mapping message-ID -> thread-ID and do lookups in that, (should have plenty of memory on current machines even with lots of mail). 2. We're hitting the slow Xapian document updates for thread-ID merging. Whenever we find a child that was already in the database with one thread ID that should have ours, we simply want to set its thread ID to ours. But as we've talked about recently, Xapian has a bug (defect 250) that makes it much more expensive than it should be to update a single term. So, we could do a first pass over the messages to find all their thread IDs and get them to settle down before doing any indexing in a separate, second pass. Step (2) should help even if we don't do step (1), but clearly we can do both. It would be great if anyone wants to take a look at either or both of these, otherwise I will when I can. -Carl