one-time-iterators

unofficial mirror of notmuch@notmuchmail.org
 help / color / mirror / code / Atom feed

* one-time-iterators
@ 2011-05-26  8:31 Patrick Totzke
  2011-05-26 17:20 ` one-time-iterators Carl Worth
  0 siblings, 1 reply; 11+ messages in thread
From: Patrick Totzke @ 2011-05-26  8:31 UTC (permalink / raw)
  To: notmuch

[-- Attachment #1: Type: text/plain, Size: 1080 bytes --]

Hi!

I'm playing around with one-time iterators Threads in the python bindings
in combination with changes to the database.
Particularly, I'd like to iterate over a list of threads partially,
change the tags of a single thread and afterwards continue iterating.
Of course I get "Xapian DB-changed" exceptions.
As I see it, there are two possible solutions:
First, I iterate over all the threads and cache all the info i'd like to 
extract from each one (or just the id and re-query the info on demand).
The problem here is, that this list of thread-ids might be huge and
caching it will block my code.
Secondly, I could replace the iterator i'm using with a fresh one after 
db-changes. But then I'd need some magic that discards an initial
portion of the (fresh) iteration and/or updates the pre-db-change
partial iteration.

Wow. This reads really complicated. All I want to say is:
if I change tags in my search-results view, I get Xapian errors :)
The question: How do you solve this in the emacs code?
do you store all tids of a query? 

Thanks,
/p

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: one-time-iterators
  2011-05-26  8:31 one-time-iterators Patrick Totzke
@ 2011-05-26 17:20 ` Carl Worth
  2011-05-26 20:18   ` one-time-iterators Austin Clements
  2011-05-26 21:16   ` one-time-iterators Michael Hudson-Doyle
  0 siblings, 2 replies; 11+ messages in thread
From: Carl Worth @ 2011-05-26 17:20 UTC (permalink / raw)
  To: Patrick Totzke, notmuch

[-- Attachment #1: Type: text/plain, Size: 1157 bytes --]

On Thu, 26 May 2011 09:31:19 +0100, Patrick Totzke <patricktotzke@googlemail.com> wrote:
> Wow. This reads really complicated. All I want to say is:
> if I change tags in my search-results view, I get Xapian errors :)

Yes, that's frustrating. I wish that we had a more reliable interface at
the notmuch library level. But I'm not entirely sure what would be the
best way to do this.

> The question: How do you solve this in the emacs code?
> do you store all tids of a query? 

The emacs code does not use the notmuch library interface like your
python bindings do. Instead, it uses the notmuch command-line tool, (and
buffers up the text output by it). The support for asynchronous
operations in the emacs interface means that it's likely possible
someone could run into a similar problem:

	1. Start a search returning a *lot* of results

	2. When the first results come in, make some tag changes

	3. See if the original search aborts

I may have even had this happen to me before, but if I did I've never
actually noticed it. I don't know what a good answer might be for this
problem.

-Carl

-- 
carl.d.worth@intel.com

[-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: one-time-iterators
  2011-05-26 17:20 ` one-time-iterators Carl Worth
@ 2011-05-26 20:18   ` Austin Clements
  2011-05-26 21:47     ` one-time-iterators Patrick Totzke
       [not found]     ` <1306442683-sup-9315@brick>
  2011-05-26 21:16   ` one-time-iterators Michael Hudson-Doyle
  1 sibling, 2 replies; 11+ messages in thread
From: Austin Clements @ 2011-05-26 20:18 UTC (permalink / raw)
  To: Carl Worth, Patrick Totzke; +Cc: notmuch

On May 26, 2011 1:20 PM, "Carl Worth" <cworth@cworth.org> wrote:
> > The question: How do you solve this in the emacs code?
> > do you store all tids of a query?
>
> The emacs code does not use the notmuch library interface like your
> python bindings do. Instead, it uses the notmuch command-line tool, (and
> buffers up the text output by it). The support for asynchronous
> operations in the emacs interface means that it's likely possible
> someone could run into a similar problem:
>
>        1. Start a search returning a *lot* of results
>
>        2. When the first results come in, make some tag changes
>
>        3. See if the original search aborts
>
> I may have even had this happen to me before, but if I did I've never
> actually noticed it. I don't know what a good answer might be for this
> problem.

I proposed a solution to this problem a while ago
(id:"AANLkTi=KOx8aTJipkiArFVjEHE6zt_JypoASMiiAWBZ6@mail.gmail.com"),
though I haven't tried implementing it yet.

Though, Patrick, that solution doesn't address your problem.  On the
other hand, it's not clear to me what concurrent access semantics
you're actually expecting.  I suspect you don't want the remaining
iteration to reflect the changes, since your changes could equally
well have affected earlier iteration results.  But if you want a
consistent view of your query results, something's going to have to
materialize that iterator, and it might as well be you (or Xapian
would need more sophisticated concurrency control than it has).  But
this shouldn't be expensive because all you need to materialize are
the document ids; you shouldn't need to eagerly fetch the per-thread
information.  Have you tried simply calling list() on your thread
iterator to see how expensive it is?  My bet is that it's quite cheap,
both memory-wise and CPU-wise.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: one-time-iterators
  2011-05-26 17:20 ` one-time-iterators Carl Worth
  2011-05-26 20:18   ` one-time-iterators Austin Clements
@ 2011-05-26 21:16   ` Michael Hudson-Doyle
  1 sibling, 0 replies; 11+ messages in thread
From: Michael Hudson-Doyle @ 2011-05-26 21:16 UTC (permalink / raw)
  To: Carl Worth, Patrick Totzke, notmuch

On Thu, 26 May 2011 10:20:21 -0700, Carl Worth <cworth@cworth.org> wrote:
> On Thu, 26 May 2011 09:31:19 +0100, Patrick Totzke <patricktotzke@googlemail.com> wrote:
> > Wow. This reads really complicated. All I want to say is:
> > if I change tags in my search-results view, I get Xapian errors :)
> 
> Yes, that's frustrating. I wish that we had a more reliable interface at
> the notmuch library level. But I'm not entirely sure what would be the
> best way to do this.
> 
> > The question: How do you solve this in the emacs code?
> > do you store all tids of a query? 
> 
> The emacs code does not use the notmuch library interface like your
> python bindings do. Instead, it uses the notmuch command-line tool, (and
> buffers up the text output by it). The support for asynchronous
> operations in the emacs interface means that it's likely possible
> someone could run into a similar problem:
> 
> 	1. Start a search returning a *lot* of results
> 
> 	2. When the first results come in, make some tag changes
> 
> 	3. See if the original search aborts
> 
> I may have even had this happen to me before, but if I did I've never
> actually noticed it. I don't know what a good answer might be for this
> problem.

I've had exactly this happen to me.  Yay for post-vacation email
mountains and slow laptop drives...

Cheers,
mwh

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: one-time-iterators
  2011-05-26 20:18   ` one-time-iterators Austin Clements
@ 2011-05-26 21:47     ` Patrick Totzke
       [not found]     ` <1306442683-sup-9315@brick>
  1 sibling, 0 replies; 11+ messages in thread
From: Patrick Totzke @ 2011-05-26 21:47 UTC (permalink / raw)
  To: Austin Clements; +Cc: notmuch

[-- Attachment #1: Type: text/plain, Size: 4001 bytes --]

hehe, did it again (dropping the list from cc). I need to stop
using sup :P thanks Austin.

Excerpts from Carl Worth's message of Thu May 26 18:20:21 +0100 2011:
> On Thu, 26 May 2011 09:31:19 +0100, Patrick Totzke <patricktotzke@googlemail.com> wrote:
> > Wow. This reads really complicated. All I want to say is:
> > if I change tags in my search-results view, I get Xapian errors :)
> 
> Yes, that's frustrating. I wish that we had a more reliable interface at
> the notmuch library level. But I'm not entirely sure what would be the
> best way to do this.
Actually, I expected something like this. For this reason each sup instance 
locks its index.
At the moment I'm going for custom wrapper classes around notmuch.Thread
and notmuch.Messages that cache the result of the calls relevant for me.
But the real issue seems to be the iterator:
It takes an awful lot of time just to copy the thread ids of all threads from 
large a query result.

I tried the following in ipython:
 q=Database().create_query('*')
 time tids = [t.get_thread_id() for t in q.search_threads()]

which results in
CPU times: user 7.64 s, sys: 2.06 s, total: 9.70 s
Wall time: 9.84 s

It would really help if the Query object could return an iterator
of thread-ids that makes this copying unnecessary. Is it possible to
implement this? Or would this require the same amount of copying to happen
at a lower level?
I have not looked into the code for the bindings or the C code so far,
but I guess the Query.search_threads() translates to some 
"SELECT id,morestuff from threads"
where for me a "SELECT is from threads" would totally suffice. Copying 
(in the C code) only the ids so some list that yields an iterator should be faster.

> > The question: How do you solve this in the emacs code?
> > do you store all tids of a query? 
> 
> The emacs code does not use the notmuch library interface like your
> python bindings do. Instead, it uses the notmuch command-line tool, (and
> buffers up the text output by it). 
Ahh ok. Thanks for the explanation.

Excerpts from Austin Clements's message of Thu May 26 21:18:53 +0100 2011:
> I proposed a solution to this problem a while ago
> (id:"AANLkTi=KOx8aTJipkiArFVjEHE6zt_JypoASMiiAWBZ6@mail.gmail.com"),
> though I haven't tried implementing it yet.
Sorry, I wasn't on the list back then.

> Though, Patrick, that solution doesn't address your problem.  On the
> other hand, it's not clear to me what concurrent access semantics
> you're actually expecting.  I suspect you don't want the remaining
> iteration to reflect the changes, since your changes could equally
> well have affected earlier iteration results. 
That's right. 
> But if you want a
> consistent view of your query results, something's going to have to
> materialize that iterator, and it might as well be you (or Xapian
> would need more sophisticated concurrency control than it has).  But
> this shouldn't be expensive because all you need to materialize are
> the document ids; you shouldn't need to eagerly fetch the per-thread
> information.  
I thought so, but it seems that Query.search_threads() already
caches more than the id of each item. Which is as expected
because it is designed to return thread objects, not their ids.
As you can see above, this _is_ too expensive for me.

> Have you tried simply calling list() on your thread
> iterator to see how expensive it is?  My bet is that it's quite cheap,
> both memory-wise and CPU-wise.
Funny thing:
 q=Database().create_query('*')
 time tlist = list(q.search_threads())
raises a NotmuchError(STATUS.NOT_INITIALIZED) exception. For some reason
the list constructor must read mere than once from the iterator.
So this is not an option, but even if it worked, it would show
the same behaviour as my above test..

would it be very hard to implement a Query.search_thread_ids() ?
This name is a bit off because it had to be done on a lower level.
Cheers,
/p

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: one-time-iterators
       [not found]       ` <20110526214302.GR29861@mit.edu>
@ 2011-05-26 22:22         ` Patrick Totzke
  2011-05-27  2:41           ` one-time-iterators Austin Clements
  0 siblings, 1 reply; 11+ messages in thread
From: Patrick Totzke @ 2011-05-26 22:22 UTC (permalink / raw)
  To: Austin Clements; +Cc: notmuch

[-- Attachment #1: Type: text/plain, Size: 3981 bytes --]

Excerpts from Austin Clements's message of Thu May 26 22:43:02 +0100 2011:
> http://notmuch.198994.n3.nabble.com/notmuch-s-idea-of-concurrency-failing-an-invocation-tp2373468p2565731.html
ah, good old peterson :P thanks.

> > > Though, Patrick, that solution doesn't address your problem.  On the
> > > other hand, it's not clear to me what concurrent access semantics
> > > you're actually expecting.  I suspect you don't want the remaining
> > > iteration to reflect the changes, since your changes could equally
> > > well have affected earlier iteration results. 
> > That's right. 
> > > But if you want a
> > > consistent view of your query results, something's going to have to
> > > materialize that iterator, and it might as well be you (or Xapian
> > > would need more sophisticated concurrency control than it has).  But
> > > this shouldn't be expensive because all you need to materialize are
> > > the document ids; you shouldn't need to eagerly fetch the per-thread
> > > information.  
> > I thought so, but it seems that Query.search_threads() already
> > caches more than the id of each item. Which is as expected
> > because it is designed to return thread objects, not their ids.
> > As you can see above, this _is_ too expensive for me.
> 
> I'd forgotten that constructing threads on the C side was eager about
> the thread tags, author list and subject (which, without Istvan's
> proposed patch, even requires opening and parsing the message file).
> This is probably what's killing you.
> 
> Out of curiosity, what is your situation that you won't wind up paying
> the cost of this iteration one way or the other and that the latency
> of doing these tag changes matters?

I'm trying to implement a terminal interface for notmuch in python
that resembles sup.
For the search results view, i read an initial portion from a Threads iterator 
to fill my teminal window with threadline-widgets. Obviously, for a
large number of results I don't want to go through all of them.
The problem arises if you toggle a tag on the selected threadline and afterwards
continue to scroll down.

> > > Have you tried simply calling list() on your thread
> > > iterator to see how expensive it is?  My bet is that it's quite cheap,
> > > both memory-wise and CPU-wise.
> > Funny thing:
> >  q=Database().create_query('*')
> >  time tlist = list(q.search_threads())
> > raises a NotmuchError(STATUS.NOT_INITIALIZED) exception. For some reason
> > the list constructor must read mere than once from the iterator.
> > So this is not an option, but even if it worked, it would show
> > the same behaviour as my above test..
> 
> Interesting.  Looks like the Threads class implements __len__ and that
> its implementation exhausts the iterator.  Which isn't a great idea in
> itself, but it turns out that Python's implementation of list() calls
> __len__ if it's available (presumably to pre-size the list) before
> iterating over the object, so it exhausts the iterator before even
> using it.
> 
> That said, if list(q.search_threads()) did work, it wouldn't give you
> better performance than your experiment above.
> 
> > would it be very hard to implement a Query.search_thread_ids() ?
> > This name is a bit off because it had to be done on a lower level.
> 
> Lazily fetching the thread metadata on the C side would probably
> address your problem automatically.  But what are you doing that
> doesn't require any information about the threads you're manipulating?
Agreed. Unfortunately, there seems to be no way to get a list of thread
ids or a reliable iterator thereof by using the current python bindings.
It would be enough for me to have the ids because then I could
search for the few threads I actually need individually on demand.

Here is the branch in which I'm trying out these things. Sorry for the
messy code, its late :P
https://github.com/pazz/notmuch-gui/tree/toggletags

/p

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: one-time-iterators
  2011-05-26 22:22         ` one-time-iterators Patrick Totzke
@ 2011-05-27  2:41           ` Austin Clements
  2011-05-27 18:04             ` one-time-iterators Patrick Totzke
  0 siblings, 1 reply; 11+ messages in thread
From: Austin Clements @ 2011-05-27  2:41 UTC (permalink / raw)
  To: Patrick Totzke; +Cc: notmuch

On Thu, May 26, 2011 at 6:22 PM, Patrick Totzke
<patricktotzke@googlemail.com> wrote:
> Excerpts from Austin Clements's message of Thu May 26 22:43:02 +0100 2011:
>> > > Though, Patrick, that solution doesn't address your problem.  On the
>> > > other hand, it's not clear to me what concurrent access semantics
>> > > you're actually expecting.  I suspect you don't want the remaining
>> > > iteration to reflect the changes, since your changes could equally
>> > > well have affected earlier iteration results.
>> > That's right.
>> > > But if you want a
>> > > consistent view of your query results, something's going to have to
>> > > materialize that iterator, and it might as well be you (or Xapian
>> > > would need more sophisticated concurrency control than it has).  But
>> > > this shouldn't be expensive because all you need to materialize are
>> > > the document ids; you shouldn't need to eagerly fetch the per-thread
>> > > information.
>> > I thought so, but it seems that Query.search_threads() already
>> > caches more than the id of each item. Which is as expected
>> > because it is designed to return thread objects, not their ids.
>> > As you can see above, this _is_ too expensive for me.
>>
>> I'd forgotten that constructing threads on the C side was eager about
>> the thread tags, author list and subject (which, without Istvan's
>> proposed patch, even requires opening and parsing the message file).
>> This is probably what's killing you.
>>
>> Out of curiosity, what is your situation that you won't wind up paying
>> the cost of this iteration one way or the other and that the latency
>> of doing these tag changes matters?
>
> I'm trying to implement a terminal interface for notmuch in python
> that resembles sup.
> For the search results view, i read an initial portion from a Threads iterator
> to fill my teminal window with threadline-widgets. Obviously, for a
> large number of results I don't want to go through all of them.
> The problem arises if you toggle a tag on the selected threadline and afterwards
> continue to scroll down.

Ah, that makes sense.

>> > > Have you tried simply calling list() on your thread
>> > > iterator to see how expensive it is?  My bet is that it's quite cheap,
>> > > both memory-wise and CPU-wise.
>> > Funny thing:
>> >  q=Database().create_query('*')
>> >  time tlist = list(q.search_threads())
>> > raises a NotmuchError(STATUS.NOT_INITIALIZED) exception. For some reason
>> > the list constructor must read mere than once from the iterator.
>> > So this is not an option, but even if it worked, it would show
>> > the same behaviour as my above test..
>>
>> Interesting.  Looks like the Threads class implements __len__ and that
>> its implementation exhausts the iterator.  Which isn't a great idea in
>> itself, but it turns out that Python's implementation of list() calls
>> __len__ if it's available (presumably to pre-size the list) before
>> iterating over the object, so it exhausts the iterator before even
>> using it.
>>
>> That said, if list(q.search_threads()) did work, it wouldn't give you
>> better performance than your experiment above.
>>
>> > would it be very hard to implement a Query.search_thread_ids() ?
>> > This name is a bit off because it had to be done on a lower level.
>>
>> Lazily fetching the thread metadata on the C side would probably
>> address your problem automatically.  But what are you doing that
>> doesn't require any information about the threads you're manipulating?
> Agreed. Unfortunately, there seems to be no way to get a list of thread
> ids or a reliable iterator thereof by using the current python bindings.
> It would be enough for me to have the ids because then I could
> search for the few threads I actually need individually on demand.

There's no way to do that from the C API either, so don't feel left
out.  ]:--8)  It seems to me that the right solution to your problem
is to make thread information lazy (effectively, everything gathered
in lib/thread.cc:_thread_add_message).  Then you could probably
materialize that iterator cheaply.  In fact, it's probably worth
trying a hack where you put dummy information in the thread object
from _thread_add_message and see how long it takes just to walk the
iterator (unfortunately I don't think profiling will help much here
because much of your time is probably spent waiting for I/O).

I don't think there would be any downside to doing this for eager
consumers like the CLI.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: one-time-iterators
  2011-05-27  2:41           ` one-time-iterators Austin Clements
@ 2011-05-27 18:04             ` Patrick Totzke
  2011-05-27 19:29               ` one-time-iterators Austin Clements
  0 siblings, 1 reply; 11+ messages in thread
From: Patrick Totzke @ 2011-05-27 18:04 UTC (permalink / raw)
  To: Austin Clements; +Cc: Patrick Totzke, notmuch

[-- Attachment #1: Type: text/plain, Size: 3312 bytes --]

Excerpts from Austin Clements's message of Fri May 27 03:41:44 +0100 2011:
> >> > > Have you tried simply calling list() on your thread
> >> > > iterator to see how expensive it is?  My bet is that it's quite cheap,
> >> > > both memory-wise and CPU-wise.
> >> > Funny thing:
> >> >  q=Database().create_query('*')
> >> >  time tlist = list(q.search_threads())
> >> > raises a NotmuchError(STATUS.NOT_INITIALIZED) exception. For some reason
> >> > the list constructor must read mere than once from the iterator.
> >> > So this is not an option, but even if it worked, it would show
> >> > the same behaviour as my above test..
> >>
> >> Interesting.  Looks like the Threads class implements __len__ and that
> >> its implementation exhausts the iterator.  Which isn't a great idea in
> >> itself, but it turns out that Python's implementation of list() calls
> >> __len__ if it's available (presumably to pre-size the list) before
> >> iterating over the object, so it exhausts the iterator before even
> >> using it.
> >>
> >> That said, if list(q.search_threads()) did work, it wouldn't give you
> >> better performance than your experiment above.
true. Nevertheless I think that list(q.search_threads())
should be equivalent to [t for t in q.search_threads()], which is
something to be fixed in the bindings. Should I file an issue somehow?
Or is enough to state this as a TODO here on the list?

> >> > would it be very hard to implement a Query.search_thread_ids() ?
> >> > This name is a bit off because it had to be done on a lower level.
> >>
> >> Lazily fetching the thread metadata on the C side would probably
> >> address your problem automatically.  But what are you doing that
> >> doesn't require any information about the threads you're manipulating?
> > Agreed. Unfortunately, there seems to be no way to get a list of thread
> > ids or a reliable iterator thereof by using the current python bindings.
> > It would be enough for me to have the ids because then I could
> > search for the few threads I actually need individually on demand.
> 
> There's no way to do that from the C API either, so don't feel left
> out.  ]:--8)  It seems to me that the right solution to your problem
> is to make thread information lazy (effectively, everything gathered
> in lib/thread.cc:_thread_add_message).  Then you could probably
> materialize that iterator cheaply. 
Alright. I'll put this on my mental notmuch wish list and 
hope that someone will have addressed this before I run out of
ideas how to improve my UI and have time to look at this myself.
For now, I go with the [t.get_thread_id for t in q.search_threads()]
approach to cache the thread ids myself and live with the fact that
this takes time for large result sets.

> In fact, it's probably worth
> trying a hack where you put dummy information in the thread object
> from _thread_add_message and see how long it takes just to walk the
> iterator (unfortunately I don't think profiling will help much here
> because much of your time is probably spent waiting for I/O).
I don't think I understand what you mean by dummy info in a thread
object.

> I don't think there would be any downside to doing this for eager
> consumers like the CLI.
one should think so, yes.
/p

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: one-time-iterators
  2011-05-27 18:04             ` one-time-iterators Patrick Totzke
@ 2011-05-27 19:29               ` Austin Clements
  2011-05-28  8:58                 ` one-time-iterators Patrick Totzke
  0 siblings, 1 reply; 11+ messages in thread
From: Austin Clements @ 2011-05-27 19:29 UTC (permalink / raw)
  To: Patrick Totzke, Sebastian Spaeth; +Cc: notmuch

On Fri, May 27, 2011 at 2:04 PM, Patrick Totzke
<patricktotzke@googlemail.com> wrote:
> Excerpts from Austin Clements's message of Fri May 27 03:41:44 +0100 2011:
>> >> > > Have you tried simply calling list() on your thread
>> >> > > iterator to see how expensive it is?  My bet is that it's quite cheap,
>> >> > > both memory-wise and CPU-wise.
>> >> > Funny thing:
>> >> >  q=Database().create_query('*')
>> >> >  time tlist = list(q.search_threads())
>> >> > raises a NotmuchError(STATUS.NOT_INITIALIZED) exception. For some reason
>> >> > the list constructor must read mere than once from the iterator.
>> >> > So this is not an option, but even if it worked, it would show
>> >> > the same behaviour as my above test..
>> >>
>> >> Interesting.  Looks like the Threads class implements __len__ and that
>> >> its implementation exhausts the iterator.  Which isn't a great idea in
>> >> itself, but it turns out that Python's implementation of list() calls
>> >> __len__ if it's available (presumably to pre-size the list) before
>> >> iterating over the object, so it exhausts the iterator before even
>> >> using it.
>> >>
>> >> That said, if list(q.search_threads()) did work, it wouldn't give you
>> >> better performance than your experiment above.
> true. Nevertheless I think that list(q.search_threads())
> should be equivalent to [t for t in q.search_threads()], which is
> something to be fixed in the bindings. Should I file an issue somehow?
> Or is enough to state this as a TODO here on the list?

Yes, they should be equivalent.

Sebastian was thinking about fixing the larger issue of generator
exhaustion, which would address this, though the performance would
depend on the cost of iterating twice.  This is why generators
shouldn't support __len__.  Unfortunately, it's probably hard to get
rid of at this point and I doubt there's a way to tell list() to
overlook the presence of a __len__ method.

>> >> > would it be very hard to implement a Query.search_thread_ids() ?
>> >> > This name is a bit off because it had to be done on a lower level.
>> >>
>> >> Lazily fetching the thread metadata on the C side would probably
>> >> address your problem automatically.  But what are you doing that
>> >> doesn't require any information about the threads you're manipulating?
>> > Agreed. Unfortunately, there seems to be no way to get a list of thread
>> > ids or a reliable iterator thereof by using the current python bindings.
>> > It would be enough for me to have the ids because then I could
>> > search for the few threads I actually need individually on demand.
>>
>> There's no way to do that from the C API either, so don't feel left
>> out.  ]:--8)  It seems to me that the right solution to your problem
>> is to make thread information lazy (effectively, everything gathered
>> in lib/thread.cc:_thread_add_message).  Then you could probably
>> materialize that iterator cheaply.
> Alright. I'll put this on my mental notmuch wish list and
> hope that someone will have addressed this before I run out of
> ideas how to improve my UI and have time to look at this myself.
> For now, I go with the [t.get_thread_id for t in q.search_threads()]
> approach to cache the thread ids myself and live with the fact that
> this takes time for large result sets.
>
>> In fact, it's probably worth
>> trying a hack where you put dummy information in the thread object
>> from _thread_add_message and see how long it takes just to walk the
>> iterator (unfortunately I don't think profiling will help much here
>> because much of your time is probably spent waiting for I/O).
> I don't think I understand what you mean by dummy info in a thread
> object.

In _thread_add_message, rather than looking up the message's author,
subject, etc, just hard-code some dummy values.  Performance-wise,
this would simulate making the thread metadata lookup lazy, so you
could see if making this lazy would address your problem.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: one-time-iterators
  2011-05-27 19:29               ` one-time-iterators Austin Clements
@ 2011-05-28  8:58                 ` Patrick Totzke
  2011-05-31  1:05                   ` one-time-iterators Austin Clements
  0 siblings, 1 reply; 11+ messages in thread
From: Patrick Totzke @ 2011-05-28  8:58 UTC (permalink / raw)
  To: Austin Clements; +Cc: notmuch

[-- Attachment #1: Type: text/plain, Size: 5095 bytes --]

Excerpts from Austin Clements's message of Fri May 27 20:29:24 +0100 2011:
> On Fri, May 27, 2011 at 2:04 PM, Patrick Totzke
> <patricktotzke@googlemail.com> wrote:
> > Excerpts from Austin Clements's message of Fri May 27 03:41:44 +0100 2011:
> >> >> > > Have you tried simply calling list() on your thread
> >> >> > > iterator to see how expensive it is?  My bet is that it's quite cheap,
> >> >> > > both memory-wise and CPU-wise.
> >> >> > Funny thing:
> >> >> >  q=Database().create_query('*')
> >> >> >  time tlist = list(q.search_threads())
> >> >> > raises a NotmuchError(STATUS.NOT_INITIALIZED) exception. For some reason
> >> >> > the list constructor must read mere than once from the iterator.
> >> >> > So this is not an option, but even if it worked, it would show
> >> >> > the same behaviour as my above test..
> >> >>
> >> >> Interesting.  Looks like the Threads class implements __len__ and that
> >> >> its implementation exhausts the iterator.  Which isn't a great idea in
> >> >> itself, but it turns out that Python's implementation of list() calls
> >> >> __len__ if it's available (presumably to pre-size the list) before
> >> >> iterating over the object, so it exhausts the iterator before even
> >> >> using it.
> >> >>
> >> >> That said, if list(q.search_threads()) did work, it wouldn't give you
> >> >> better performance than your experiment above.
> > true. Nevertheless I think that list(q.search_threads())
> > should be equivalent to [t for t in q.search_threads()], which is
> > something to be fixed in the bindings. Should I file an issue somehow?
> > Or is enough to state this as a TODO here on the list?
> 
> Yes, they should be equivalent.
> 
> Sebastian was thinking about fixing the larger issue of generator
> exhaustion, which would address this, though the performance would
> depend on the cost of iterating twice.  This is why generators
> shouldn't support __len__.  Unfortunately, it's probably hard to get
> rid of at this point and I doubt there's a way to tell list() to
> overlook the presence of a __len__ method.
Why not simply removing support for __len__ in the Threads and Messages classes?

> >> >> > would it be very hard to implement a Query.search_thread_ids() ?
> >> >> > This name is a bit off because it had to be done on a lower level.
> >> >>
> >> >> Lazily fetching the thread metadata on the C side would probably
> >> >> address your problem automatically.  But what are you doing that
> >> >> doesn't require any information about the threads you're manipulating?
> >> > Agreed. Unfortunately, there seems to be no way to get a list of thread
> >> > ids or a reliable iterator thereof by using the current python bindings.
> >> > It would be enough for me to have the ids because then I could
> >> > search for the few threads I actually need individually on demand.
> >>
> >> There's no way to do that from the C API either, so don't feel left
> >> out.  ]:--8)  It seems to me that the right solution to your problem
> >> is to make thread information lazy (effectively, everything gathered
> >> in lib/thread.cc:_thread_add_message).  Then you could probably
> >> materialize that iterator cheaply.
> > Alright. I'll put this on my mental notmuch wish list and
> > hope that someone will have addressed this before I run out of
> > ideas how to improve my UI and have time to look at this myself.
> > For now, I go with the [t.get_thread_id for t in q.search_threads()]
> > approach to cache the thread ids myself and live with the fact that
> > this takes time for large result sets.
> >
> >> In fact, it's probably worth
> >> trying a hack where you put dummy information in the thread object
> >> from _thread_add_message and see how long it takes just to walk the
> >> iterator (unfortunately I don't think profiling will help much here
> >> because much of your time is probably spent waiting for I/O).
> > I don't think I understand what you mean by dummy info in a thread
> > object.
> 
> In _thread_add_message, rather than looking up the message's author,
> subject, etc, just hard-code some dummy values.  Performance-wise,
> this would simulate making the thread metadata lookup lazy, so you
> could see if making this lazy would address your problem.
Thanks for the clarification. I did that, and also commented out the
lower parts of _notmuch_thread_create and this did indeed improve
the performance, but not so much as I had hoped:

In [10]: q=Database().create_query('*')
In [11]: time T=[t for t in q.search_threads()]
CPU times: user 2.43 s, sys: 0.22 s, total: 2.65 s
Wall time: 2.66 s

And I have only about 8000 mails in my index.
Making thread lookups lazy would help, but here one would still
create a lot of unused (empty) thread objects.
The easiest solution to my problem would in my opinion be
a function that queries only for thread ids without instanciating them.
But I can't think of any other use case than mine for this
so I guess many of you would be against adding this to the API?
/p

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: one-time-iterators
  2011-05-28  8:58                 ` one-time-iterators Patrick Totzke
@ 2011-05-31  1:05                   ` Austin Clements
  0 siblings, 0 replies; 11+ messages in thread
From: Austin Clements @ 2011-05-31  1:05 UTC (permalink / raw)
  To: Patrick Totzke; +Cc: notmuch

Quoth Patrick Totzke on May 28 at  9:58 am:
> Excerpts from Austin Clements's message of Fri May 27 20:29:24 +0100 2011:
> > On Fri, May 27, 2011 at 2:04 PM, Patrick Totzke
> > <patricktotzke@googlemail.com> wrote:
> > > Excerpts from Austin Clements's message of Fri May 27 03:41:44 +0100 2011:
> > >> >> > > Have you tried simply calling list() on your thread
> > >> >> > > iterator to see how expensive it is?  My bet is that it's quite cheap,
> > >> >> > > both memory-wise and CPU-wise.
> > >> >> > Funny thing:
> > >> >> >  q=Database().create_query('*')
> > >> >> >  time tlist = list(q.search_threads())
> > >> >> > raises a NotmuchError(STATUS.NOT_INITIALIZED) exception. For some reason
> > >> >> > the list constructor must read mere than once from the iterator.
> > >> >> > So this is not an option, but even if it worked, it would show
> > >> >> > the same behaviour as my above test..
> > >> >>
> > >> >> Interesting.  Looks like the Threads class implements __len__ and that
> > >> >> its implementation exhausts the iterator.  Which isn't a great idea in
> > >> >> itself, but it turns out that Python's implementation of list() calls
> > >> >> __len__ if it's available (presumably to pre-size the list) before
> > >> >> iterating over the object, so it exhausts the iterator before even
> > >> >> using it.
> > >> >>
> > >> >> That said, if list(q.search_threads()) did work, it wouldn't give you
> > >> >> better performance than your experiment above.
> > > true. Nevertheless I think that list(q.search_threads())
> > > should be equivalent to [t for t in q.search_threads()], which is
> > > something to be fixed in the bindings. Should I file an issue somehow?
> > > Or is enough to state this as a TODO here on the list?
> > 
> > Yes, they should be equivalent.
> > 
> > Sebastian was thinking about fixing the larger issue of generator
> > exhaustion, which would address this, though the performance would
> > depend on the cost of iterating twice.  This is why generators
> > shouldn't support __len__.  Unfortunately, it's probably hard to get
> > rid of at this point and I doubt there's a way to tell list() to
> > overlook the presence of a __len__ method.
> Why not simply removing support for __len__ in the Threads and Messages classes?

Presumably len is there because things use it.  On the other hand,
given the issues surrounding len, I suspect anything that's using it
is already a mess.

> > >> >> > would it be very hard to implement a Query.search_thread_ids() ?
> > >> >> > This name is a bit off because it had to be done on a lower level.
> > >> >>
> > >> >> Lazily fetching the thread metadata on the C side would probably
> > >> >> address your problem automatically.  But what are you doing that
> > >> >> doesn't require any information about the threads you're manipulating?
> > >> > Agreed. Unfortunately, there seems to be no way to get a list of thread
> > >> > ids or a reliable iterator thereof by using the current python bindings.
> > >> > It would be enough for me to have the ids because then I could
> > >> > search for the few threads I actually need individually on demand.
> > >>
> > >> There's no way to do that from the C API either, so don't feel left
> > >> out.  ]:--8)  It seems to me that the right solution to your problem
> > >> is to make thread information lazy (effectively, everything gathered
> > >> in lib/thread.cc:_thread_add_message).  Then you could probably
> > >> materialize that iterator cheaply.
> > > Alright. I'll put this on my mental notmuch wish list and
> > > hope that someone will have addressed this before I run out of
> > > ideas how to improve my UI and have time to look at this myself.
> > > For now, I go with the [t.get_thread_id for t in q.search_threads()]
> > > approach to cache the thread ids myself and live with the fact that
> > > this takes time for large result sets.
> > >
> > >> In fact, it's probably worth
> > >> trying a hack where you put dummy information in the thread object
> > >> from _thread_add_message and see how long it takes just to walk the
> > >> iterator (unfortunately I don't think profiling will help much here
> > >> because much of your time is probably spent waiting for I/O).
> > > I don't think I understand what you mean by dummy info in a thread
> > > object.
> > 
> > In _thread_add_message, rather than looking up the message's author,
> > subject, etc, just hard-code some dummy values.  Performance-wise,
> > this would simulate making the thread metadata lookup lazy, so you
> > could see if making this lazy would address your problem.
> Thanks for the clarification. I did that, and also commented out the
> lower parts of _notmuch_thread_create and this did indeed improve
> the performance, but not so much as I had hoped:
> 
> In [10]: q=Database().create_query('*')
> In [11]: time T=[t for t in q.search_threads()]
> CPU times: user 2.43 s, sys: 0.22 s, total: 2.65 s
> Wall time: 2.66 s
> 
> And I have only about 8000 mails in my index.
> Making thread lookups lazy would help, but here one would still
> create a lot of unused (empty) thread objects.
> The easiest solution to my problem would in my opinion be
> a function that queries only for thread ids without instanciating them.
> But I can't think of any other use case than mine for this
> so I guess many of you would be against adding this to the API?

Well, you may be in a pickle.  I seriously doubt creating the thread
objects is your problem, though *expanding* the threads may be your
main expense now.  I'd recommend actual profiling at this point.

It may be possible to lazily expand threads, too.  I'm not sure how
familiar you are with the C code for this.  notmuch does *not* index
threads; it indexes individual messages.  When you do a thread query,
it finds all *messages* that match that query, and then, for each
message, it does another query to find all messages in the same
thread.  You might be able to do just the original message query
eagerly (which takes virtually no time) and then do the thread
expansion queries only as the iterator demands them.  You may
encounter concurrency issues with a scheme like this, though as long
as notmuch can't split threads, I think it would be okay.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2011-05-31  1:05 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-05-26  8:31 one-time-iterators Patrick Totzke
2011-05-26 17:20 ` one-time-iterators Carl Worth
2011-05-26 20:18   ` one-time-iterators Austin Clements
2011-05-26 21:47     ` one-time-iterators Patrick Totzke
     [not found]     ` <1306442683-sup-9315@brick>
     [not found]       ` <20110526214302.GR29861@mit.edu>
2011-05-26 22:22         ` one-time-iterators Patrick Totzke
2011-05-27  2:41           ` one-time-iterators Austin Clements
2011-05-27 18:04             ` one-time-iterators Patrick Totzke
2011-05-27 19:29               ` one-time-iterators Austin Clements
2011-05-28  8:58                 ` one-time-iterators Patrick Totzke
2011-05-31  1:05                   ` one-time-iterators Austin Clements
2011-05-26 21:16   ` one-time-iterators Michael Hudson-Doyle

Code repositories for project(s) associated with this public inbox

	https://yhetil.org/notmuch.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).