Re: one-time-iterators - Patrick Totzke

unofficial mirror of notmuch@notmuchmail.org
 help / color / mirror / code / Atom feed

From: Patrick Totzke <patricktotzke@googlemail.com>
To: Austin Clements <amdragon@mit.edu>
Cc: notmuch <notmuch@notmuchmail.org>
Subject: Re: one-time-iterators
Date: Sat, 28 May 2011 09:58:03 +0100	[thread overview]
Message-ID: <1306572273-sup-9995@brick> (raw)
In-Reply-To: <BANLkTi=cZ50h50xf_OigTyjdfY_y34AX_g@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 5095 bytes --]

Excerpts from Austin Clements's message of Fri May 27 20:29:24 +0100 2011:
> On Fri, May 27, 2011 at 2:04 PM, Patrick Totzke
> <patricktotzke@googlemail.com> wrote:
> > Excerpts from Austin Clements's message of Fri May 27 03:41:44 +0100 2011:
> >> >> > > Have you tried simply calling list() on your thread
> >> >> > > iterator to see how expensive it is?  My bet is that it's quite cheap,
> >> >> > > both memory-wise and CPU-wise.
> >> >> > Funny thing:
> >> >> >  q=Database().create_query('*')
> >> >> >  time tlist = list(q.search_threads())
> >> >> > raises a NotmuchError(STATUS.NOT_INITIALIZED) exception. For some reason
> >> >> > the list constructor must read mere than once from the iterator.
> >> >> > So this is not an option, but even if it worked, it would show
> >> >> > the same behaviour as my above test..
> >> >>
> >> >> Interesting.  Looks like the Threads class implements __len__ and that
> >> >> its implementation exhausts the iterator.  Which isn't a great idea in
> >> >> itself, but it turns out that Python's implementation of list() calls
> >> >> __len__ if it's available (presumably to pre-size the list) before
> >> >> iterating over the object, so it exhausts the iterator before even
> >> >> using it.
> >> >>
> >> >> That said, if list(q.search_threads()) did work, it wouldn't give you
> >> >> better performance than your experiment above.
> > true. Nevertheless I think that list(q.search_threads())
> > should be equivalent to [t for t in q.search_threads()], which is
> > something to be fixed in the bindings. Should I file an issue somehow?
> > Or is enough to state this as a TODO here on the list?
> 
> Yes, they should be equivalent.
> 
> Sebastian was thinking about fixing the larger issue of generator
> exhaustion, which would address this, though the performance would
> depend on the cost of iterating twice.  This is why generators
> shouldn't support __len__.  Unfortunately, it's probably hard to get
> rid of at this point and I doubt there's a way to tell list() to
> overlook the presence of a __len__ method.
Why not simply removing support for __len__ in the Threads and Messages classes?

> >> >> > would it be very hard to implement a Query.search_thread_ids() ?
> >> >> > This name is a bit off because it had to be done on a lower level.
> >> >>
> >> >> Lazily fetching the thread metadata on the C side would probably
> >> >> address your problem automatically.  But what are you doing that
> >> >> doesn't require any information about the threads you're manipulating?
> >> > Agreed. Unfortunately, there seems to be no way to get a list of thread
> >> > ids or a reliable iterator thereof by using the current python bindings.
> >> > It would be enough for me to have the ids because then I could
> >> > search for the few threads I actually need individually on demand.
> >>
> >> There's no way to do that from the C API either, so don't feel left
> >> out.  ]:--8)  It seems to me that the right solution to your problem
> >> is to make thread information lazy (effectively, everything gathered
> >> in lib/thread.cc:_thread_add_message).  Then you could probably
> >> materialize that iterator cheaply.
> > Alright. I'll put this on my mental notmuch wish list and
> > hope that someone will have addressed this before I run out of
> > ideas how to improve my UI and have time to look at this myself.
> > For now, I go with the [t.get_thread_id for t in q.search_threads()]
> > approach to cache the thread ids myself and live with the fact that
> > this takes time for large result sets.
> >
> >> In fact, it's probably worth
> >> trying a hack where you put dummy information in the thread object
> >> from _thread_add_message and see how long it takes just to walk the
> >> iterator (unfortunately I don't think profiling will help much here
> >> because much of your time is probably spent waiting for I/O).
> > I don't think I understand what you mean by dummy info in a thread
> > object.
> 
> In _thread_add_message, rather than looking up the message's author,
> subject, etc, just hard-code some dummy values.  Performance-wise,
> this would simulate making the thread metadata lookup lazy, so you
> could see if making this lazy would address your problem.
Thanks for the clarification. I did that, and also commented out the
lower parts of _notmuch_thread_create and this did indeed improve
the performance, but not so much as I had hoped:

In [10]: q=Database().create_query('*')
In [11]: time T=[t for t in q.search_threads()]
CPU times: user 2.43 s, sys: 0.22 s, total: 2.65 s
Wall time: 2.66 s

And I have only about 8000 mails in my index.
Making thread lookups lazy would help, but here one would still
create a lot of unused (empty) thread objects.
The easiest solution to my problem would in my opinion be
a function that queries only for thread ids without instanciating them.
But I can't think of any other use case than mine for this
so I guess many of you would be against adding this to the API?
/p

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

next prev parent reply	other threads:[~2011-05-28  8:58 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-05-26  8:31 one-time-iterators Patrick Totzke
2011-05-26 17:20 ` one-time-iterators Carl Worth
2011-05-26 20:18   ` one-time-iterators Austin Clements
2011-05-26 21:47     ` one-time-iterators Patrick Totzke
     [not found]     ` <1306442683-sup-9315@brick>
     [not found]       ` <20110526214302.GR29861@mit.edu>
2011-05-26 22:22         ` one-time-iterators Patrick Totzke
2011-05-27  2:41           ` one-time-iterators Austin Clements
2011-05-27 18:04             ` one-time-iterators Patrick Totzke
2011-05-27 19:29               ` one-time-iterators Austin Clements
2011-05-28  8:58                 ` Patrick Totzke [this message]
2011-05-31  1:05                   ` one-time-iterators Austin Clements
2011-05-26 21:16   ` one-time-iterators Michael Hudson-Doyle

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://notmuchmail.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1306572273-sup-9995@brick \
    --to=patricktotzke@googlemail.com \
    --cc=amdragon@mit.edu \
    --cc=notmuch@notmuchmail.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://yhetil.org/notmuch.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).