From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by olra.theworths.org (Postfix) with ESMTP id 0CB84431FD0 for ; Mon, 30 May 2011 18:05:26 -0700 (PDT) X-Virus-Scanned: Debian amavisd-new at olra.theworths.org X-Spam-Flag: NO X-Spam-Score: -0.7 X-Spam-Level: X-Spam-Status: No, score=-0.7 tagged_above=-999 required=5 tests=[RCVD_IN_DNSWL_LOW=-0.7] autolearn=disabled Received: from olra.theworths.org ([127.0.0.1]) by localhost (olra.theworths.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id rXU1Lv0vXik0 for ; Mon, 30 May 2011 18:05:25 -0700 (PDT) Received: from dmz-mailsec-scanner-6.mit.edu (DMZ-MAILSEC-SCANNER-6.MIT.EDU [18.7.68.35]) by olra.theworths.org (Postfix) with ESMTP id DFBF2431FB6 for ; Mon, 30 May 2011 18:05:24 -0700 (PDT) X-AuditID: 12074423-b7babae000007c6b-8a-4de43ecbbdd4 Received: from mailhub-auth-2.mit.edu ( [18.7.62.36]) by dmz-mailsec-scanner-6.mit.edu (Symantec Messaging Gateway) with SMTP id 08.97.31851.BCE34ED4; Mon, 30 May 2011 21:05:15 -0400 (EDT) Received: from outgoing.mit.edu (OUTGOING-AUTH.MIT.EDU [18.7.22.103]) by mailhub-auth-2.mit.edu (8.13.8/8.9.2) with ESMTP id p4V15NKp006607; Mon, 30 May 2011 21:05:23 -0400 Received: from awakening.csail.mit.edu (awakening.csail.mit.edu [18.26.4.91]) (authenticated bits=0) (User authenticated as amdragon@ATHENA.MIT.EDU) by outgoing.mit.edu (8.13.6/8.12.4) with ESMTP id p4V15LlI011545 (version=TLSv1/SSLv3 cipher=AES256-SHA bits=256 verify=NOT); Mon, 30 May 2011 21:05:22 -0400 (EDT) Received: from amthrax by awakening.csail.mit.edu with local (Exim 4.72) (envelope-from ) id 1QRDOV-0006tm-L9; Mon, 30 May 2011 21:05:15 -0400 Date: Mon, 30 May 2011 21:05:15 -0400 From: Austin Clements To: Patrick Totzke Subject: Re: one-time-iterators Message-ID: <20110531010515.GW29861@mit.edu> References: <1306397849-sup-3304@brick> <877h9d9y5m.fsf@yoom.home.cworth.org> <1306442683-sup-9315@brick> <20110526214302.GR29861@mit.edu> <1306446621-sup-3184@brick> <1306518628-sup-5396@brick> <1306572273-sup-9995@brick> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <1306572273-sup-9995@brick> User-Agent: Mutt/1.5.20 (2009-06-14) X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFlrJKsWRmVeSWpSXmKPExsUixG6nonva7omvwcEVlhbXb85ktng28QO7 A5PH0wmT2T2erbrFHMAUxWWTkpqTWZZapG+XwJXxu3M1U8Ehw4opb5+xNTC2q3cxcnJICJhI fHn1kR3CFpO4cG89WxcjF4eQwD5GiSvLX7NDOBsYJeZMamKFcE4ySZw508cE4SxhlLj0YT4T SD+LgKrEs+ZbYDabgIbEtv3LGUFsEQFDib7ZB9hAbGagmsa1F5lBbGEBeYm1LbNYQWxeAR2J 3Q9OQa3rZ5b4eeUrG0RCUOLkzCcsEM06Eju33gGKcwDZ0hLL/3FAhOUlmrfOBpvJCbT3+/+/ YOWiAioS1/a3s01gFJ6FZNIsJJNmIUyahWTSAkaWVYyyKblVurmJmTnFqcm6xcmJeXmpRbpm ermZJXqpKaWbGMGx4KK8g/HPQaVDjAIcjEo8vNz7H/sKsSaWFVfmHmKU5GBSEuU9bvvEV4gv KT+lMiOxOCO+qDQntfgQowQHs5II7zc+oBxvSmJlVWpRPkxKmoNFSZx3rqS6r5BAemJJanZq akFqEUxWhoNDSYL3DMhQwaLU9NSKtMycEoQ0EwcnyHAeoOGudiDDiwsSc4sz0yHypxh1Oa6v 3XqQUYglLz8vVUqcVwakSACkKKM0D24OLIW9YhQHekuYlwWkigeY/uAmvQJawgS0pPfdQ5Al JYkIKakGRjVl0yOa3ftTH32Q9FQ0OHvW/uDFaQc0HZ1mz2Nl2rHpxhSVkx+yp5rYbDPaGv5Y uc95SuLLfgEV08yt9/gf5O/snzJl01clH++YzU3pPs46txiaeB/qfLbdJOPkvfFppkmk7qbl T71X/M3v2BC+lvfkf901sswKu2cJNydlbny4IDnu6l2u60osxRmJhlrMRcWJACjtOP88AwAA Cc: notmuch X-BeenThere: notmuch@notmuchmail.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: "Use and development of the notmuch mail system." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 31 May 2011 01:05:26 -0000 Quoth Patrick Totzke on May 28 at 9:58 am: > Excerpts from Austin Clements's message of Fri May 27 20:29:24 +0100 2011: > > On Fri, May 27, 2011 at 2:04 PM, Patrick Totzke > > wrote: > > > Excerpts from Austin Clements's message of Fri May 27 03:41:44 +0100 2011: > > >> >> > > Have you tried simply calling list() on your thread > > >> >> > > iterator to see how expensive it is?  My bet is that it's quite cheap, > > >> >> > > both memory-wise and CPU-wise. > > >> >> > Funny thing: > > >> >> >  q=Database().create_query('*') > > >> >> >  time tlist = list(q.search_threads()) > > >> >> > raises a NotmuchError(STATUS.NOT_INITIALIZED) exception. For some reason > > >> >> > the list constructor must read mere than once from the iterator. > > >> >> > So this is not an option, but even if it worked, it would show > > >> >> > the same behaviour as my above test.. > > >> >> > > >> >> Interesting.  Looks like the Threads class implements __len__ and that > > >> >> its implementation exhausts the iterator.  Which isn't a great idea in > > >> >> itself, but it turns out that Python's implementation of list() calls > > >> >> __len__ if it's available (presumably to pre-size the list) before > > >> >> iterating over the object, so it exhausts the iterator before even > > >> >> using it. > > >> >> > > >> >> That said, if list(q.search_threads()) did work, it wouldn't give you > > >> >> better performance than your experiment above. > > > true. Nevertheless I think that list(q.search_threads()) > > > should be equivalent to [t for t in q.search_threads()], which is > > > something to be fixed in the bindings. Should I file an issue somehow? > > > Or is enough to state this as a TODO here on the list? > > > > Yes, they should be equivalent. > > > > Sebastian was thinking about fixing the larger issue of generator > > exhaustion, which would address this, though the performance would > > depend on the cost of iterating twice. This is why generators > > shouldn't support __len__. Unfortunately, it's probably hard to get > > rid of at this point and I doubt there's a way to tell list() to > > overlook the presence of a __len__ method. > Why not simply removing support for __len__ in the Threads and Messages classes? Presumably len is there because things use it. On the other hand, given the issues surrounding len, I suspect anything that's using it is already a mess. > > >> >> > would it be very hard to implement a Query.search_thread_ids() ? > > >> >> > This name is a bit off because it had to be done on a lower level. > > >> >> > > >> >> Lazily fetching the thread metadata on the C side would probably > > >> >> address your problem automatically.  But what are you doing that > > >> >> doesn't require any information about the threads you're manipulating? > > >> > Agreed. Unfortunately, there seems to be no way to get a list of thread > > >> > ids or a reliable iterator thereof by using the current python bindings. > > >> > It would be enough for me to have the ids because then I could > > >> > search for the few threads I actually need individually on demand. > > >> > > >> There's no way to do that from the C API either, so don't feel left > > >> out.  ]:--8)  It seems to me that the right solution to your problem > > >> is to make thread information lazy (effectively, everything gathered > > >> in lib/thread.cc:_thread_add_message).  Then you could probably > > >> materialize that iterator cheaply. > > > Alright. I'll put this on my mental notmuch wish list and > > > hope that someone will have addressed this before I run out of > > > ideas how to improve my UI and have time to look at this myself. > > > For now, I go with the [t.get_thread_id for t in q.search_threads()] > > > approach to cache the thread ids myself and live with the fact that > > > this takes time for large result sets. > > > > > >> In fact, it's probably worth > > >> trying a hack where you put dummy information in the thread object > > >> from _thread_add_message and see how long it takes just to walk the > > >> iterator (unfortunately I don't think profiling will help much here > > >> because much of your time is probably spent waiting for I/O). > > > I don't think I understand what you mean by dummy info in a thread > > > object. > > > > In _thread_add_message, rather than looking up the message's author, > > subject, etc, just hard-code some dummy values. Performance-wise, > > this would simulate making the thread metadata lookup lazy, so you > > could see if making this lazy would address your problem. > Thanks for the clarification. I did that, and also commented out the > lower parts of _notmuch_thread_create and this did indeed improve > the performance, but not so much as I had hoped: > > In [10]: q=Database().create_query('*') > In [11]: time T=[t for t in q.search_threads()] > CPU times: user 2.43 s, sys: 0.22 s, total: 2.65 s > Wall time: 2.66 s > > And I have only about 8000 mails in my index. > Making thread lookups lazy would help, but here one would still > create a lot of unused (empty) thread objects. > The easiest solution to my problem would in my opinion be > a function that queries only for thread ids without instanciating them. > But I can't think of any other use case than mine for this > so I guess many of you would be against adding this to the API? Well, you may be in a pickle. I seriously doubt creating the thread objects is your problem, though *expanding* the threads may be your main expense now. I'd recommend actual profiling at this point. It may be possible to lazily expand threads, too. I'm not sure how familiar you are with the C code for this. notmuch does *not* index threads; it indexes individual messages. When you do a thread query, it finds all *messages* that match that query, and then, for each message, it does another query to find all messages in the same thread. You might be able to do just the original message query eagerly (which takes virtually no time) and then do the thread expansion queries only as the iterator demands them. You may encounter concurrency issues with a scheme like this, though as long as notmuch can't split threads, I think it would be okay.