unofficial mirror of notmuch@notmuchmail.org
 help / color / mirror / code / Atom feed
* Improved search speed by 2.5X
@ 2010-11-16 16:18 Austin Clements
  2010-11-16 16:58 ` David Edmondson
  2010-11-16 19:06 ` Carl Worth
  0 siblings, 2 replies; 4+ messages in thread
From: Austin Clements @ 2010-11-16 16:18 UTC (permalink / raw)
  To: notmuch

'Lo.  I've been toying with switching to notmuch for a while, but the
speed of search has been holding me back.  Perhaps I should upgrade
the archaic machine I'm trying to run notmuch on, but I figured
optimizing search would be a good way for me to cut my teeth on
notmuch's code.  I've implemented two general optimizations that,
together, make thread search 2.5X faster, reducing my test "tag:inbox"
query from 4.523 seconds to 1.779 seconds.

Before cleaning up and posting the patches, I wanted to make sure my
general approach is both correct and pedagogically sound.

I reduced thread search's 1 + 2t Xapian queries (where t is the number
of unique matched threads) to 1 + t queries and now construct exactly
one notmuch_message_t for each message instead of 2 to 3.  The query
performed by notmuch_query_search_threads to get all messages matching
the user's query now fetches only the docid set instead of
constructing message objects and decompressing the thread ID of every
message from its term list.  Threads are then constructed from these
docids using "thread:" queries much as before; however, now which
subset of these messages matched the user query is determined by
checking their docids against the docid set constructed earlier,
rather than requiring yet another Xapian query per thread to find
matched messages.

Before this optimization, my test "tag:inbox" query took 4.523
seconds.  With the optimization, it takes 3.072 seconds (1.5X faster).
It has the downside that it requires more RAM, though even with, say,
a 1 million message database, it's at most ~4.2 megs.  In principle it
also introduces latency before the first search result, but fetching
docid sets is what Xapian was born to do and I haven't noticed any
latency.

The second optimization is based on the observation that Xapian
decompresses a document's term list every time you iterate over it.
As a result, notmuch can decompress the beginning of a single term
list at least five times.  I combined all of the message metadata
fetching (message ID, thread ID, reply-to, filename list, and tag
list) into a single pass over the term list that fetches everything.

This optimization reduces my "tag:inbox" query to 3.521 seconds (1.3X
faster).  This one is a bit more of a balancing act, since it may
fetch metadata that is never needed; however, doing the single pass
takes only a little longer than running any of the individual metadata
requests.

These two optimizations complement each other.  With both, my query
takes 1.779 seconds (2.5X faster).  Because the first constructs only
one message object per message, it aggregates all metadata operations
on that one object instead of spreading them across multiple objects,
increasing the effectiveness of single-pass metadata retrieval.

Does this all seem reasonable?  My code passes the test suite [1], so
I believe it to be fairly sound.


[1] Except for 2 emacs tests that depend on author order.  What order
are matched authors *supposed* to be in?

-- 
Austin Clements                                      MIT/'06/PhD/CSAIL
amdragon@mit.edu                           http://web.mit.edu/amdragon
       Somewhere in the dream we call reality you will find me,
              searching for the reality we call dreams.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Improved search speed by 2.5X
  2010-11-16 16:18 Improved search speed by 2.5X Austin Clements
@ 2010-11-16 16:58 ` David Edmondson
  2010-11-16 17:59   ` Austin Clements
  2010-11-16 19:06 ` Carl Worth
  1 sibling, 1 reply; 4+ messages in thread
From: David Edmondson @ 2010-11-16 16:58 UTC (permalink / raw)
  To: Austin Clements, notmuch

On Tue, 16 Nov 2010 11:18:43 -0500, Austin Clements <amdragon@MIT.EDU> wrote:
> [1] Except for 2 emacs tests that depend on author order.  What order
> are matched authors *supposed* to be in?

In IRC cworth confirmed that he considered the current behaviour a bug,
which is to say that the authors should be sorted by first occurrence in
the thread.

dme.
-- 
David Edmondson, http://dme.org

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Improved search speed by 2.5X
  2010-11-16 16:58 ` David Edmondson
@ 2010-11-16 17:59   ` Austin Clements
  0 siblings, 0 replies; 4+ messages in thread
From: Austin Clements @ 2010-11-16 17:59 UTC (permalink / raw)
  To: David Edmondson; +Cc: notmuch

Quoth David Edmondson on Nov 16 at  4:58 pm:
> On Tue, 16 Nov 2010 11:18:43 -0500, Austin Clements <amdragon@MIT.EDU> wrote:
> > [1] Except for 2 emacs tests that depend on author order.  What order
> > are matched authors *supposed* to be in?
> 
> In IRC cworth confirmed that he considered the current behaviour a bug,
> which is to say that the authors should be sorted by first occurrence in
> the thread.

Ah, perfect.  That's what my patch does most naturally.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Improved search speed by 2.5X
  2010-11-16 16:18 Improved search speed by 2.5X Austin Clements
  2010-11-16 16:58 ` David Edmondson
@ 2010-11-16 19:06 ` Carl Worth
  1 sibling, 0 replies; 4+ messages in thread
From: Carl Worth @ 2010-11-16 19:06 UTC (permalink / raw)
  To: Austin Clements, notmuch

[-- Attachment #1: Type: text/plain, Size: 928 bytes --]

On Tue, 16 Nov 2010 11:18:43 -0500, Austin Clements <amdragon@MIT.EDU> wrote:
> 'Lo.  I've been toying with switching to notmuch for a while, but the
> speed of search has been holding me back.  Perhaps I should upgrade
> the archaic machine I'm trying to run notmuch on, but I figured
> optimizing search would be a good way for me to cut my teeth on
> notmuch's code.

Hey, if you're going to keep optimizing notmuch, then I say, keep the
old machine! :-)

> Does this all seem reasonable?  My code passes the test suite [1], so
> I believe it to be fairly sound.

Yes. It all sounds quite good. Patches please!

> [1] Except for 2 emacs tests that depend on author order.  What order
> are matched authors *supposed* to be in?

There's an existing bug with author order, (sadly codified in some
current test cases). So this should be a good excuse for fixing that.

-Carl

-- 
carl.d.worth@intel.com

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2010-11-16 19:07 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-11-16 16:18 Improved search speed by 2.5X Austin Clements
2010-11-16 16:58 ` David Edmondson
2010-11-16 17:59   ` Austin Clements
2010-11-16 19:06 ` Carl Worth

Code repositories for project(s) associated with this public inbox

	https://yhetil.org/notmuch.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).