unofficial mirror of notmuch@notmuchmail.org
 help / color / mirror / code / Atom feed
From: Austin Clements <amdragon@mit.edu>
To: Patrick Totzke <patricktotzke@googlemail.com>,
	Sebastian Spaeth <sebastian@sspaeth.de>
Cc: notmuch <notmuch@notmuchmail.org>
Subject: Re: one-time-iterators
Date: Fri, 27 May 2011 15:29:24 -0400	[thread overview]
Message-ID: <BANLkTi=cZ50h50xf_OigTyjdfY_y34AX_g@mail.gmail.com> (raw)
In-Reply-To: <1306518628-sup-5396@brick>

On Fri, May 27, 2011 at 2:04 PM, Patrick Totzke
<patricktotzke@googlemail.com> wrote:
> Excerpts from Austin Clements's message of Fri May 27 03:41:44 +0100 2011:
>> >> > > Have you tried simply calling list() on your thread
>> >> > > iterator to see how expensive it is?  My bet is that it's quite cheap,
>> >> > > both memory-wise and CPU-wise.
>> >> > Funny thing:
>> >> >  q=Database().create_query('*')
>> >> >  time tlist = list(q.search_threads())
>> >> > raises a NotmuchError(STATUS.NOT_INITIALIZED) exception. For some reason
>> >> > the list constructor must read mere than once from the iterator.
>> >> > So this is not an option, but even if it worked, it would show
>> >> > the same behaviour as my above test..
>> >>
>> >> Interesting.  Looks like the Threads class implements __len__ and that
>> >> its implementation exhausts the iterator.  Which isn't a great idea in
>> >> itself, but it turns out that Python's implementation of list() calls
>> >> __len__ if it's available (presumably to pre-size the list) before
>> >> iterating over the object, so it exhausts the iterator before even
>> >> using it.
>> >>
>> >> That said, if list(q.search_threads()) did work, it wouldn't give you
>> >> better performance than your experiment above.
> true. Nevertheless I think that list(q.search_threads())
> should be equivalent to [t for t in q.search_threads()], which is
> something to be fixed in the bindings. Should I file an issue somehow?
> Or is enough to state this as a TODO here on the list?

Yes, they should be equivalent.

Sebastian was thinking about fixing the larger issue of generator
exhaustion, which would address this, though the performance would
depend on the cost of iterating twice.  This is why generators
shouldn't support __len__.  Unfortunately, it's probably hard to get
rid of at this point and I doubt there's a way to tell list() to
overlook the presence of a __len__ method.

>> >> > would it be very hard to implement a Query.search_thread_ids() ?
>> >> > This name is a bit off because it had to be done on a lower level.
>> >>
>> >> Lazily fetching the thread metadata on the C side would probably
>> >> address your problem automatically.  But what are you doing that
>> >> doesn't require any information about the threads you're manipulating?
>> > Agreed. Unfortunately, there seems to be no way to get a list of thread
>> > ids or a reliable iterator thereof by using the current python bindings.
>> > It would be enough for me to have the ids because then I could
>> > search for the few threads I actually need individually on demand.
>>
>> There's no way to do that from the C API either, so don't feel left
>> out.  ]:--8)  It seems to me that the right solution to your problem
>> is to make thread information lazy (effectively, everything gathered
>> in lib/thread.cc:_thread_add_message).  Then you could probably
>> materialize that iterator cheaply.
> Alright. I'll put this on my mental notmuch wish list and
> hope that someone will have addressed this before I run out of
> ideas how to improve my UI and have time to look at this myself.
> For now, I go with the [t.get_thread_id for t in q.search_threads()]
> approach to cache the thread ids myself and live with the fact that
> this takes time for large result sets.
>
>> In fact, it's probably worth
>> trying a hack where you put dummy information in the thread object
>> from _thread_add_message and see how long it takes just to walk the
>> iterator (unfortunately I don't think profiling will help much here
>> because much of your time is probably spent waiting for I/O).
> I don't think I understand what you mean by dummy info in a thread
> object.

In _thread_add_message, rather than looking up the message's author,
subject, etc, just hard-code some dummy values.  Performance-wise,
this would simulate making the thread metadata lookup lazy, so you
could see if making this lazy would address your problem.

  reply	other threads:[~2011-05-27 19:29 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-05-26  8:31 one-time-iterators Patrick Totzke
2011-05-26 17:20 ` one-time-iterators Carl Worth
2011-05-26 20:18   ` one-time-iterators Austin Clements
2011-05-26 21:47     ` one-time-iterators Patrick Totzke
     [not found]     ` <1306442683-sup-9315@brick>
     [not found]       ` <20110526214302.GR29861@mit.edu>
2011-05-26 22:22         ` one-time-iterators Patrick Totzke
2011-05-27  2:41           ` one-time-iterators Austin Clements
2011-05-27 18:04             ` one-time-iterators Patrick Totzke
2011-05-27 19:29               ` Austin Clements [this message]
2011-05-28  8:58                 ` one-time-iterators Patrick Totzke
2011-05-31  1:05                   ` one-time-iterators Austin Clements
2011-05-26 21:16   ` one-time-iterators Michael Hudson-Doyle

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://notmuchmail.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='BANLkTi=cZ50h50xf_OigTyjdfY_y34AX_g@mail.gmail.com' \
    --to=amdragon@mit.edu \
    --cc=notmuch@notmuchmail.org \
    --cc=patricktotzke@googlemail.com \
    --cc=sebastian@sspaeth.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://yhetil.org/notmuch.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).