From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by olra.theworths.org (Postfix) with ESMTP id 5745F431FB6 for ; Sat, 28 Jan 2012 15:56:11 -0800 (PST) X-Virus-Scanned: Debian amavisd-new at olra.theworths.org X-Spam-Flag: NO X-Spam-Score: -1.098 X-Spam-Level: X-Spam-Status: No, score=-1.098 tagged_above=-999 required=5 tests=[DKIM_ADSP_CUSTOM_MED=0.001, FREEMAIL_FROM=0.001, NML_ADSP_CUSTOM_MED=1.2, RCVD_IN_DNSWL_MED=-2.3] autolearn=disabled Received: from olra.theworths.org ([127.0.0.1]) by localhost (olra.theworths.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id limawhzEDOrS for ; Sat, 28 Jan 2012 15:56:10 -0800 (PST) Received: from mail2.qmul.ac.uk (mail2.qmul.ac.uk [138.37.6.6]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by olra.theworths.org (Postfix) with ESMTPS id 37AF8431FAE for ; Sat, 28 Jan 2012 15:56:10 -0800 (PST) Received: from smtp.qmul.ac.uk ([138.37.6.40]) by mail2.qmul.ac.uk with esmtp (Exim 4.71) (envelope-from ) id 1RrI7o-0006c8-GV; Sat, 28 Jan 2012 23:56:05 +0000 Received: from 94-192-233-223.zone6.bethere.co.uk ([94.192.233.223] helo=localhost) by smtp.qmul.ac.uk with esmtpsa (TLSv1:AES128-SHA:128) (Exim 4.69) (envelope-from ) id 1RrI7o-0005tc-4M; Sat, 28 Jan 2012 23:56:04 +0000 From: Mark Walters To: Austin Clements Subject: Re: [RFC PATCH 2/4] Add NOTMUCH_MESSAGE_FLAG_EXCLUDED flag In-Reply-To: <20120128183340.GD17991@mit.edu> References: <20120124011609.GX16740@mit.edu> <1327367923-18228-2-git-send-email-markwalters1009@gmail.com> <20120124024521.GY16740@mit.edu> <874nvg6qxn.fsf@qmul.ac.uk> <20120128183340.GD17991@mit.edu> User-Agent: Notmuch/0.11+132~g30df010 (http://notmuchmail.org) Emacs/23.2.1 (i486-pc-linux-gnu) Date: Sat, 28 Jan 2012 23:57:08 +0000 Message-ID: <8739azqt2j.fsf@qmul.ac.uk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Sender-Host-Address: 94.192.233.223 X-QM-SPAM-Info: Sender has good ham record. :) X-QM-Body-MD5: 4764223aab539717b6db5626964a5321 (of first 20000 bytes) X-SpamAssassin-Score: -1.8 X-SpamAssassin-SpamBar: - X-SpamAssassin-Report: The QM spam filters have analysed this message to determine if it is spam. We require at least 5.0 points to mark a message as spam. This message scored -1.8 points. Summary of the scoring: * -2.3 RCVD_IN_DNSWL_MED RBL: Sender listed at http://www.dnswl.org/, * medium trust * [138.37.6.40 listed in list.dnswl.org] * 0.0 FREEMAIL_FROM Sender email is commonly abused enduser mail provider * (markwalters1009[at]gmail.com) * -0.0 T_RP_MATCHES_RCVD Envelope sender domain matches handover relay * domain * 0.5 AWL AWL: From: address is in the auto white-list X-QM-Scan-Virus: ClamAV says the message is clean Cc: notmuch@notmuchmail.org X-BeenThere: notmuch@notmuchmail.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: "Use and development of the notmuch mail system." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 28 Jan 2012 23:56:11 -0000 On Sat, 28 Jan 2012 13:33:40 -0500, Austin Clements wrote: > Quoth Mark Walters on Jan 28 at 10:51 am: > > > > > > exclude_query = _notmuch_exclude_tags (query, final_query); > > > > > > > > - final_query = Xapian::Query (Xapian::Query::OP_AND_NOT, > > > > - final_query, exclude_query); > > > > + enquire.set_weighting_scheme (Xapian::BoolWeight()); > > > > + enquire.set_query (exclude_query); > > > > + > > > > + mset = enquire.get_mset (0, notmuch->xapian_db->get_doccount ()); > > > > + > > > > + GArray *excluded_doc_ids = g_array_new (FALSE, FALSE, sizeof (unsigned int)); > > > > + > > > > + for (iterator = mset.begin (); iterator != mset.end (); iterator++) > > > > + { > > > > + unsigned int doc_id = *iterator; > > > > + g_array_append_val (excluded_doc_ids, doc_id); > > > > + } > > > > + messages->base.excluded_doc_ids = talloc (query, _notmuch_doc_id_set); > > > > + _notmuch_doc_id_set_init (query, messages->base.excluded_doc_ids, > > > > + excluded_doc_ids); > > > > > > This might be inefficient for message-only queries, since it will > > > fetch *all* excluded docids. This highlights a basic difference > > > between message and thread search: thread search can return messages > > > that don't match the original query and hence needs to know all > > > potentially excluded messages, while message search can only return > > > messages that match the original query. > > > > I now have some benchmarks (not run enough times to be hugely accurate > > so ignore minor differences). The full results are below. The summary > > is: > > > > Large-archive = 1 100 000 messages in 290 000 threads (about 10 years of > > lkml). I mark 1 000 000 deleted > > Small-archive = 70 000 messages in 35 000 threads. 10 000 marked > > deleted. > > > > Doing the initial exclude work on the big collection takes about 0.8s > > and on the small collection about 0.01s. So any query to the big > > collection takes at least 0.8s longer and this all occurs before any > > results appear. > > Interesting. Do you know where that time is spent? > > Also, it might be reasonable to assume that no more than, say, 10% of > a person's mail store is excluded, but maybe that depends on how > people use this feature. > > > I then implemented the exclude doing it once for each thread query in > > _notmuch_create_thread. Roughly this made any query 50% slower. > > That's not terrible. > > > In normal front end use even the 0.8s is not totally unusable, but it is > > totally unacceptable in the backend where a user might do something like > > > > for i in ` notmuch search --output=threads from:xxx ` ; > > do > > notmuch search --output=messages $i; > > done > > > > to list all messages in all matching threads. > > > > So I think my conclusions are: > > > > (1) message only queries must be done without the full exclude. > > (2) thread queries which only match one message should not do the full > > exclude > > (3) it would be nice to switch between the two approaches depending on > > size but I don't see how to do that without extra(!) queries > > (4) One possible might be do something that say does thirty threads with > > the by thread method and then if not finished does the full exclude. > > (5) thread-by-thread might be best for Jani's limit-match > > id:"1327692900-22926-1-git-send-email-jani@nikula.org" > > > > Obviously, anything setting an exclude flag like this will be slower > > (since it is doing more work): the question is are either of these (or a > > combination like (4) above) acceptable? > > Or only mark matched messages as excluded. > > Here's another idea (actually, a rehash of an old idea). For message > search do two queries, the original query and " AND > ", and use this to keep everything in order and mark excluded > messages. For thread search, use message search results so it's easy > to both sort by unexcluded messages and include fully-excluded > threads, but compute the excluded flag (either just for unmatched > messages or for all messages) by examining each message's tags > directly (which thread_add_message already iterates over, so this is > easy and won't add any overhead). If the excluded query is fast, > which I think it will be, I think this should get the best of all > worlds and be fairly straightforward to implement (no asymmetries > between the queries used for message and thread search). It would be > easy and worth it to run the excluded query by hand on your test > corpus; I suspect it will be much faster than 0.8s because the query > already uses "Tmail", which is huge and doesn't seem to slow things > down. I have tried your suggestion (still marking all messages) and it does seem the way to go: the difference in speed is small from master is small: between 0 and 10% for most of the tests. The code seems to work and I will post it in reply to this thread. The library code is reasonable (although whether messages matching an exclude tag that has been specified in the query should be marked as excluded is unclear). The cli stuff needs thought (about what it should do rather than how to do it). I won't post the emacs stuff yet but I when I merge my various bits together I should get different colour headerlines for excluded messages and that they are initially shown collapsed. Best wishes Mark