unofficial mirror of notmuch@notmuchmail.org
 help / color / mirror / code / Atom feed
* [PATCH] dump: Don't sort.
@ 2011-10-29 10:37 Thomas Schwinge
  2011-11-15  1:10 ` David Bremner
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Thomas Schwinge @ 2011-10-29 10:37 UTC (permalink / raw)
  To: notmuch

From: Thomas Schwinge <thomas@schwinge.name>

This improves usage experience considerably in the given scenario.

---


Hi!

I decided that it'd be useful to put the reasoning and data right next to
the source code (as opposed to putting it into the commit message), for
the next guy to read this code has it all in one place.


Grüße,
 Thomas


---

 notmuch-dump.c |    7 ++++++-
 1 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/notmuch-dump.c b/notmuch-dump.c
index 7e7bc17..a431e23 100644
--- a/notmuch-dump.c
+++ b/notmuch-dump.c
@@ -45,7 +45,12 @@ notmuch_dump_command (unused (void *ctx), int argc, char *argv[])
 	fprintf (stderr, "Out of memory\n");
 	return 1;
     }
-    notmuch_query_set_sort (query, NOTMUCH_SORT_MESSAGE_ID);
+    /* This used to use NOTMUCH_SORT_MESSAGE_ID.  On 2011-10-29, a measurement
+     * on a 372981 messages instance showed that wall time can be reduced from
+     * 28 minutes (sorted by Message-ID) to 15 minutes (unsorted), the latter
+     * being much more ``database-disk-layout-friendly''.  Subsequently sorting
+     * the 25 MiB of data is a no-brainer, if required.  */
+    notmuch_query_set_sort (query, NOTMUCH_SORT_UNSORTED);
 
     if (argc) {
 	output = fopen (argv[0], "w");
-- 
tg: (3bafdfc..) t/dump_unsorted (depends on: baseline)

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] dump: Don't sort.
  2011-10-29 10:37 [PATCH] dump: Don't sort Thomas Schwinge
@ 2011-11-15  1:10 ` David Bremner
  2011-11-21 11:04   ` Tomi Ollila
  2011-11-19 15:11 ` Petter Reinholdtsen
  2011-11-27 18:40 ` [PATCH] dump: Don't sort the output by message id Tom Prince
  2 siblings, 1 reply; 7+ messages in thread
From: David Bremner @ 2011-11-15  1:10 UTC (permalink / raw)
  To: Thomas Schwinge, notmuch

On Sat, 29 Oct 2011 12:37:37 +0200, Thomas Schwinge <thomas@schwinge.name> wrote:
> From: Thomas Schwinge <thomas@schwinge.name>
> 
> This improves usage experience considerably in the given scenario.
> 

I'm not sure if I mentioned this only in IRC, so let me recap.

Personally, I think this needs more information in the commit message,
and (optionally) less in the comment. I would prefer to be able to
understand what is going on just from the output of git log.

d

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] dump: Don't sort.
  2011-10-29 10:37 [PATCH] dump: Don't sort Thomas Schwinge
  2011-11-15  1:10 ` David Bremner
@ 2011-11-19 15:11 ` Petter Reinholdtsen
  2011-11-28 21:04   ` Thomas Schwinge
  2011-11-27 18:40 ` [PATCH] dump: Don't sort the output by message id Tom Prince
  2 siblings, 1 reply; 7+ messages in thread
From: Petter Reinholdtsen @ 2011-11-19 15:11 UTC (permalink / raw)
  To: notmuch


[Thomas Schwinge]
> +    /* This used to use NOTMUCH_SORT_MESSAGE_ID.  On 2011-10-29, a measurement
> +     * on a 372981 messages instance showed that wall time can be reduced from
> +     * 28 minutes (sorted by Message-ID) to 15 minutes (unsorted), the latter
> +     * being much more ``database-disk-layout-friendly''.  Subsequently sorting
> +     * the 25 MiB of data is a no-brainer, if required.  */

This sound like a great idea for my use case.  Doing 'notmuch dump'
with my 1.2 million emails take hours at the moment (not very fast
encrypted file system), and result in a 90 MiB dump file.
-- 
Happy hacking
Petter Reinholdtsen

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] dump: Don't sort.
  2011-11-15  1:10 ` David Bremner
@ 2011-11-21 11:04   ` Tomi Ollila
  0 siblings, 0 replies; 7+ messages in thread
From: Tomi Ollila @ 2011-11-21 11:04 UTC (permalink / raw)
  To: Thomas Schwinge, notmuch

On Mon, 14 Nov 2011 21:10:44 -0400, David Bremner <david@tethera.net> wrote:
> On Sat, 29 Oct 2011 12:37:37 +0200, Thomas Schwinge <thomas@schwinge.name> wrote:
> > From: Thomas Schwinge <thomas@schwinge.name>
> > 
> > This improves usage experience considerably in the given scenario.
> > 
> 
> I'm not sure if I mentioned this only in IRC, so let me recap.
> 
> Personally, I think this needs more information in the commit message,
> and (optionally) less in the comment. I would prefer to be able to
> understand what is going on just from the output of git log.

I agree; the git commit message:

--8<----8<----8<----8<----8<----8<----8<----8<----8<--
dump: Don't sort.

This improves usage experience considerably in the given scenario.
--8<----8<----8<----8<----8<----8<----8<----8<----8<--

Is not very informative. Better to describe the given scenario in commit
message and (perhaps!) tune the comment in the code.

We've seen that the patch is very useful -- So, Thomas, please provide a 
patch in a format it can be comfortably added to the repository :).

> 
> d

Tomi

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH] dump: Don't sort the output by message id.
  2011-10-29 10:37 [PATCH] dump: Don't sort Thomas Schwinge
  2011-11-15  1:10 ` David Bremner
  2011-11-19 15:11 ` Petter Reinholdtsen
@ 2011-11-27 18:40 ` Tom Prince
  2011-11-29  7:10   ` David Bremner
  2 siblings, 1 reply; 7+ messages in thread
From: Tom Prince @ 2011-11-27 18:40 UTC (permalink / raw)
  To: Notmuch Mail; +Cc: Thomas Schwinge

From: Thomas Schwinge <thomas@schwinge.name>

Asking xapian to sort the messages for us causes suboptimal IO patterns. This
would be useful, if we only wanted the first few results, but since we want
everything anyway, this is pessimization.

On 2011-10-29, a measurement on a 372981 messages instance showed that wall
time can be reduced from 28 minutes (sorted by Message-ID) to 15 minutes
(unsorted).

Timings on 189605 messages:

$ time notmuch.old dump
19.48user 5.83system 12:10.42elapsed 3%CPU (0avgtext+0avgdata 110656maxresident)k
3629584inputs+22720outputs (33major+7073minor)pagefaults 0swaps
$ echo 3 > /proc/sys/vm/drop_caches
$ time notmuch.new
14.89user 1.20system 3:23.58elapsed 7%CPU (0avgtext+0avgdata 46032maxresident)k
1256264inputs+22464outputs (43major+1990minor)pagefaults 0swaps
---
 This just moves the motivation to the commit message, and adds more detailed timing information.

 notmuch-dump.c |    5 ++++-
 1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/notmuch-dump.c b/notmuch-dump.c
index 126593d..0475eb9 100644
--- a/notmuch-dump.c
+++ b/notmuch-dump.c
@@ -73,7 +73,10 @@ notmuch_dump_command (unused (void *ctx), int argc, char *argv[])
 	fprintf (stderr, "Out of memory\n");
 	return 1;
     }
-    notmuch_query_set_sort (query, NOTMUCH_SORT_MESSAGE_ID);
+    /* Don't ask xapian to sort by Message-ID. Xapian optimizes returning the
+     * first results quickly at the expense of total time.
+     */
+    notmuch_query_set_sort (query, NOTMUCH_SORT_UNSORTED);
 
     for (messages = notmuch_query_search_messages (query);
 	 notmuch_messages_valid (messages);
-- 
1.7.6.1

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] dump: Don't sort.
  2011-11-19 15:11 ` Petter Reinholdtsen
@ 2011-11-28 21:04   ` Thomas Schwinge
  0 siblings, 0 replies; 7+ messages in thread
From: Thomas Schwinge @ 2011-11-28 21:04 UTC (permalink / raw)
  To: notmuch; +Cc: Petter Reinholdtsen

[-- Attachment #1: Type: text/plain, Size: 2925 bytes --]

Hi!

First, thanks to David, Tomi, Tom for moving this forward.


On Sat, 19 Nov 2011 16:11:13 +0100, Petter Reinholdtsen <pere@hungry.com> wrote:
> [Thomas Schwinge]
> > +    /* This used to use NOTMUCH_SORT_MESSAGE_ID.  On 2011-10-29, a measurement
> > +     * on a 372981 messages instance showed that wall time can be reduced from
> > +     * 28 minutes (sorted by Message-ID) to 15 minutes (unsorted), the latter
> > +     * being much more ``database-disk-layout-friendly''.  Subsequently sorting
> > +     * the 25 MiB of data is a no-brainer, if required.  */

Here is the measurement re-done -- I discovered that while doing the
former, there had been parallel work been done in another Xen domU on
that system, disturbing the measurement.

Discard caches, every time before dumping:

    $ sync; sleep 3; echo -n 3 | sudo dd of=/proc/sys/vm/drop_caches

Original (sorted by Message-ID):

    $ \time notmuch dump > ~/tmp/Mail-notmuch_dump/dump
    26.41user 16.56system 14:34.81elapsed 4%CPU (0avgtext+0avgdata 167152maxresident)k
    2994440inputs+55896outputs (41major+11627minor)pagefaults 0swaps

Unsorted:

    $ \time notmuch dump | sort > ~/tmp/Mail-notmuch_dump/dump
    24.79user 3.86system 12:00.22elapsed 3%CPU (0avgtext+0avgdata 57216maxresident)k
    2929192inputs+0outputs (40major+4942minor)pagefaults 0swaps

The difference is no longer as big as before, but still better than
nothing.

> This sound like a great idea for my use case.  Doing 'notmuch dump'
> with my 1.2 million emails take hours at the moment (not very fast
> encrypted file system), and result in a 90 MiB dump file.

... and you will gain most by putting the .notmuch directory onto a SSD,
as I have done by now:

Original (sorted by Message-ID), with .notmuch on SSD:

    $ \time notmuch dump > ~/tmp/Mail-notmuch_dump/dump
    24.86user 13.40system 1:06.01elapsed 57%CPU (0avgtext+0avgdata 167200maxresident)k
    2992184inputs+55920outputs (49major+11622minor)pagefaults 0swaps

Unsorted, with .notmuch on SSD:

    $ \time notmuch dump > ~/tmp/Mail-notmuch_dump/dump
    21.90user 2.68system 0:51.70elapsed 47%CPU (0avgtext+0avgdata 57248maxresident)k
    2926912inputs+55920outputs (50major+4934minor)pagefaults 0swaps

User and system time (roughly) remain the same, but the wall time drops
considerably -- a SSD at its best, obviously.


Generally speaking, I decided it was enough to just put the .notmuch
directory onto the SSD, and not the whole mail store: if new messages are
added (notmuch new), they're still in the page cache anyway (having been
retrieven via POP3 or whatever just before), and for regular message read
access, a HDD's seek time shouldn't matter too much (and I've taken
notice of Austin's patches which even retrieven Subject: etc. from the
DB), so what remains to be optimized is random access to the DB.


Grüße,
 Thomas

[-- Attachment #2: Type: application/pgp-signature, Size: 489 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] dump: Don't sort the output by message id.
  2011-11-27 18:40 ` [PATCH] dump: Don't sort the output by message id Tom Prince
@ 2011-11-29  7:10   ` David Bremner
  0 siblings, 0 replies; 7+ messages in thread
From: David Bremner @ 2011-11-29  7:10 UTC (permalink / raw)
  To: Tom Prince, Notmuch Mail; +Cc: Thomas Schwinge

On Sun, 27 Nov 2011 13:40:53 -0500, Tom Prince <tom.prince@ualberta.net> wrote:
> From: Thomas Schwinge <thomas@schwinge.name>
> 
> Asking xapian to sort the messages for us causes suboptimal IO patterns. This
> would be useful, if we only wanted the first few results, but since we want
> everything anyway, this is pessimization.

Pushed.

d

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2011-11-29  7:11 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-10-29 10:37 [PATCH] dump: Don't sort Thomas Schwinge
2011-11-15  1:10 ` David Bremner
2011-11-21 11:04   ` Tomi Ollila
2011-11-19 15:11 ` Petter Reinholdtsen
2011-11-28 21:04   ` Thomas Schwinge
2011-11-27 18:40 ` [PATCH] dump: Don't sort the output by message id Tom Prince
2011-11-29  7:10   ` David Bremner

Code repositories for project(s) associated with this public inbox

	https://yhetil.org/notmuch.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).