unofficial mirror of notmuch@notmuchmail.org
 help / color / mirror / code / Atom feed
* Missing messages breaking threads
@ 2009-12-18 19:02 James Westby
  2009-12-18 19:41 ` Carl Worth
  0 siblings, 1 reply; 26+ messages in thread
From: James Westby @ 2009-12-18 19:02 UTC (permalink / raw)
  To: notmuch

Hi,

I like the architecture of notmuch, and have just switched
to using it as my primary client, so thanks.

I however have discovered one issue that is a pain.

I use a bugtracker a lot which has workflows that mean that
you don't always get teh initial messages from a bug.

To put it in more common terms, imagine this:

  * Alice sends a mail to a large group of your friends, but
    not you.
  * Each of these friends replies, and puts you in Cc for the
    reply.

This will mean that you get several messages that all have
References and In-Reply-To set to ids that aren't known to
notmuch.

This means that it doesn't thread them, and so they aren't
grouped in the UI. This becomes painful when you are dealing
with several bugs like this. It's almost like being back in
the days of a message oriented client, and we know that's
not the way to do it.

Therefore I'd like to fix this. The obvious way is to
introduce documents in to the db for each id we see, and
threading should then naturally work better.

The only issue I see with doing this is with mail delays.
Once we do this we will sometimes receive a message that
already has a dummy document. What happens currently with
message-id collisions?

Therefore I would propose this:

  * When doing a thread resolution and we have ids that
    we don't know, add a document to the db that is
    something like

    {id: <message-id>
     thread: <synthesised thread-id>
     dummy: True
     content: "Messages missing"
    }

  * When we get a message-id conflict check for dummy:True
    and replace the document if it is there.

How does this sound?

There could be an issue with synthesising too many threads
and then ending up having to try and put a message in two
threads? I see there is code for merging threads, would that
handle this?

Thanks,

James

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Missing messages breaking threads
  2009-12-18 19:02 Missing messages breaking threads James Westby
@ 2009-12-18 19:41 ` Carl Worth
  2009-12-18 19:53   ` James Westby
  0 siblings, 1 reply; 26+ messages in thread
From: Carl Worth @ 2009-12-18 19:41 UTC (permalink / raw)
  To: James Westby, notmuch

[-- Attachment #1: Type: text/plain, Size: 2670 bytes --]

On Fri, 18 Dec 2009 19:02:21 +0000, James Westby <jw+debian@jameswestby.net> wrote:
> I like the architecture of notmuch, and have just switched
> to using it as my primary client, so thanks.

You're quite welcome, James. Welcome to notmuch!

> Therefore I'd like to fix this. The obvious way is to
> introduce documents in to the db for each id we see, and
> threading should then naturally work better.

That sounds like a fine idea.

> The only issue I see with doing this is with mail delays.
> Once we do this we will sometimes receive a message that
> already has a dummy document. What happens currently with
> message-id collisions?

The current message-ID collision logic is pretty brain-dead. It just
says "Oh, I've seen a file with this message before, so I'll skip this
additional file".

But I'm just putting the finishing touches on a patch that instead does:

	Oh, and here's an additional filename for that message ID. Add
	that too, please.

Beyond that, all we would need to do as well is to also index the new
content. I don't want to do useless re-indexing when files just get
renamed. So maybe all we need to do is to save the filesize of the
last-indexed file for a document and then when we encounter a file with
the same message ID and a larger file size, then index it as well?

That would even take care of providing the opportunity to index
additional mailing-list-added content for messages also sent directly
via CC.

The file-size heuristic wouldn't be perfect for these other cases. I
guess we save a list of sha-1 sums for indexed files or so, (assuming
that's cheaper than just re-indexing---before the Xapian Defect 250 fix
I'm sure it is, but after I'm not sure---we maybe should just always
re-index---but I think I have seen the TermGenerator appear in profiles
of indexing runs.)

>   * When we get a message-id conflict check for dummy:True
>     and replace the document if it is there.
> 
> How does this sound?

That sounds fine. It's the same as what I propose above with
"filesize:0" instead of "dummy:true".

> There could be an issue with synthesising too many threads
> and then ending up having to try and put a message in two
> threads? I see there is code for merging threads, would that
> handle this?

It should, yes.

The current logic is that a message can only appear in a single
thread. So if a message has children or parents with distinct thread IDs
then those threads are merged.

I can imagine some strange cross-posting scenario where one could argue
that the merging shouldn't happen, but I'm not sure we want to try to
respect that.

-Carl

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Missing messages breaking threads
  2009-12-18 19:41 ` Carl Worth
@ 2009-12-18 19:53   ` James Westby
  2009-12-18 20:52     ` Carl Worth
  2009-12-18 21:21     ` [PATCH] Store the size of the file for each message James Westby
  0 siblings, 2 replies; 26+ messages in thread
From: James Westby @ 2009-12-18 19:53 UTC (permalink / raw)
  To: Carl Worth, notmuch

On Fri, 18 Dec 2009 11:41:18 -0800, Carl Worth <cworth@cworth.org> wrote:
> On Fri, 18 Dec 2009 19:02:21 +0000, James Westby <jw+debian@jameswestby.net> wrote:
> > Therefore I'd like to fix this. The obvious way is to
> > introduce documents in to the db for each id we see, and
> > threading should then naturally work better.
> 
> That sounds like a fine idea.

Good, at least I'm not totally off the map.
 
> > The only issue I see with doing this is with mail delays.
> > Once we do this we will sometimes receive a message that
> > already has a dummy document. What happens currently with
> > message-id collisions?
> 
> The current message-ID collision logic is pretty brain-dead. It just
> says "Oh, I've seen a file with this message before, so I'll skip this
> additional file".
> 
> But I'm just putting the finishing touches on a patch that instead does:
> 
> 	Oh, and here's an additional filename for that message ID. Add
> 	that too, please.
> 
> Beyond that, all we would need to do as well is to also index the new
> content. I don't want to do useless re-indexing when files just get
> renamed. So maybe all we need to do is to save the filesize of the
> last-indexed file for a document and then when we encounter a file with
> the same message ID and a larger file size, then index it as well?

I would say different file size, but I imagine larger is the majority
of interesting cases.

> That would even take care of providing the opportunity to index
> additional mailing-list-added content for messages also sent directly
> via CC.
> 
> The file-size heuristic wouldn't be perfect for these other cases. I
> guess we save a list of sha-1 sums for indexed files or so, (assuming
> that's cheaper than just re-indexing---before the Xapian Defect 250 fix
> I'm sure it is, but after I'm not sure---we maybe should just always
> re-index---but I think I have seen the TermGenerator appear in profiles
> of indexing runs.)

I'm not sure this is needed too much, but would obviously be
correct.

On Xapian 250, I have a very slow spinning disk, and it was hitting
me hard, making processing my inbox far too slow. I built Xapian SVN
with the patch from the bug and it is now lightning fast, so
consider this another endorsement. I also tried the supplemental
patch and it showed no further improvement for notmuch tag.

> >   * When we get a message-id conflict check for dummy:True
> >     and replace the document if it is there.
> > 
> > How does this sound?
> 
> That sounds fine. It's the same as what I propose above with
> "filesize:0" instead of "dummy:true".

That works. However, we would want the old content to go away
in these cases wouldn't we.

Or do we not index whatever dummy text we add? Or do we not
even put it in? Or not even show it at all? I was just thinking
of having "Missing messages..." showing up as the start of
the thread, but maybe it's no needed.

> > There could be an issue with synthesising too many threads
> > and then ending up having to try and put a message in two
> > threads? I see there is code for merging threads, would that
> > handle this?
> 
> It should, yes.
> 
> The current logic is that a message can only appear in a single
> thread. So if a message has children or parents with distinct thread IDs
> then those threads are merged.
> 
> I can imagine some strange cross-posting scenario where one could argue
> that the merging shouldn't happen, but I'm not sure we want to try to
> respect that.

Fair enough.

So, to summarise, I should first look at storing filesizes, then
the collision code to make it index further when the filesize grows,
and then finally the code to add documents for missing messages?

The only thing I am unclear on is how to handle existing databases?
Do we have any concept of versioning? Or should I just assume that
filesize: may not be in the document and act appropriately?

Thanks,

James

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Missing messages breaking threads
  2009-12-18 19:53   ` James Westby
@ 2009-12-18 20:52     ` Carl Worth
  2009-12-18 21:24       ` James Westby
  2009-12-22 22:48       ` Olly Betts
  2009-12-18 21:21     ` [PATCH] Store the size of the file for each message James Westby
  1 sibling, 2 replies; 26+ messages in thread
From: Carl Worth @ 2009-12-18 20:52 UTC (permalink / raw)
  To: James Westby, notmuch

[-- Attachment #1: Type: text/plain, Size: 1668 bytes --]

On Fri, 18 Dec 2009 19:53:13 +0000, James Westby <jw+debian@jameswestby.net> wrote:
> Or do we not index whatever dummy text we add? Or do we not
> even put it in? Or not even show it at all? I was just thinking
> of having "Missing messages..." showing up as the start of
> the thread, but maybe it's no needed.

Oh, I was assuming you wouldn't index any text. The UI can add "missing
message" for a document with no filename, for example.

> So, to summarise, I should first look at storing filesizes, then
> the collision code to make it index further when the filesize grows,
> and then finally the code to add documents for missing messages?

Some of the code areas to be touched will be changing soon, (at least as
far as when filenames appear and disappear). Hopefully I'll have
something posted for that sooner rather than later to avoid having to
redo too much work.

> The only thing I am unclear on is how to handle existing databases?
> Do we have any concept of versioning? Or should I just assume that
> filesize: may not be in the document and act appropriately?

My current, outstanding patch is going to be the first trigger for a
"flag day" where we'll all need to rewrite our databases.

We don't have any concept of versioning yet, but it would obviously be
easy to have a new version document with an increasing integer.

But even with my current patch I'm considering doing a graceful upgrade
of the database in-place rather than making the user do something like a
dump, delete, rebuild, restore. That would give a much better experience
than "Your database is out-of-date, please rebuild it", so we'll see if
I pursue that in the end.

-Carl



[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH] Store the size of the file for each message
  2009-12-18 19:53   ` James Westby
  2009-12-18 20:52     ` Carl Worth
@ 2009-12-18 21:21     ` James Westby
  2009-12-18 22:29       ` Carl Worth
  1 sibling, 1 reply; 26+ messages in thread
From: James Westby @ 2009-12-18 21:21 UTC (permalink / raw)
  To: notmuch

When indexing a message store the filesize along with it so that
when we store all the filenames for a message-id we can know if
any of them have different content cheaply.

The value stored is defined to be the largest filesize of any
of the files for that message.

This changes the API for efficiency reasons. The size is often
known to the caller, and so we save a second stat by asking them
to provide it. If they don't know it they can pass -1 and the
stat will be done for them.

We store the filesize such that we can query a range. Thus it
would be possible to query "filesize:0..100" if you somehow
knew the raw message was less that 100 bytes.
---

  Here's the first part, storing the filesize. I'm using
  add_value so that we can make it sortable, is that valid
  for retrieving it as well?

  The only thing I'm not sure about is if it works. Is there
  a way to inspect a document to see the values that are
  stored? Doing a search isn't working, so I imagine I made
  a mistake.

  Thanks,

  James

 lib/database.cc       |   17 +++++++++++++++++
 lib/message.cc        |   25 +++++++++++++++++++++++++
 lib/notmuch-private.h |    8 +++++++-
 lib/notmuch.h         |    5 +++++
 notmuch-new.c         |    2 +-
 5 files changed, 55 insertions(+), 2 deletions(-)

diff --git a/lib/database.cc b/lib/database.cc
index b6c4d07..0ec77cd 100644
--- a/lib/database.cc
+++ b/lib/database.cc
@@ -454,6 +454,17 @@ notmuch_database_create (const char *path)
     return notmuch;
 }
 
+struct FilesizeValueRangeProcessor : public Xapian::ValueRangeProcessor {
+    FilesizeValueRangeProcessor() {}
+
+    Xapian::valueno operator()(std::string &begin, std::string &) {
+        if (begin.substr(0, 9) != "filesize:")
+            return Xapian::BAD_VALUENO;
+        begin.erase(0, 9);
+        return NOTMUCH_VALUE_FILESIZE;
+    }
+};
+
 notmuch_database_t *
 notmuch_database_open (const char *path,
 		       notmuch_database_mode_t mode)
@@ -463,6 +474,7 @@ notmuch_database_open (const char *path,
     struct stat st;
     int err;
     unsigned int i;
+    FilesizeValueRangeProcessor filesize_proc;
 
     if (asprintf (&notmuch_path, "%s/%s", path, ".notmuch") == -1) {
 	notmuch_path = NULL;
@@ -508,6 +520,7 @@ notmuch_database_open (const char *path,
 	notmuch->query_parser->set_stemmer (Xapian::Stem ("english"));
 	notmuch->query_parser->set_stemming_strategy (Xapian::QueryParser::STEM_SOME);
 	notmuch->query_parser->add_valuerangeprocessor (notmuch->value_range_processor);
+	notmuch->query_parser->add_valuerangeprocessor (&filesize_proc);
 
 	for (i = 0; i < ARRAY_SIZE (BOOLEAN_PREFIX_EXTERNAL); i++) {
 	    prefix_t *prefix = &BOOLEAN_PREFIX_EXTERNAL[i];
@@ -889,6 +902,7 @@ _notmuch_database_link_message (notmuch_database_t *notmuch,
 notmuch_status_t
 notmuch_database_add_message (notmuch_database_t *notmuch,
 			      const char *filename,
+			      const off_t size,
 			      notmuch_message_t **message_ret)
 {
     notmuch_message_file_t *message_file;
@@ -992,6 +1006,9 @@ notmuch_database_add_message (notmuch_database_t *notmuch,
 	if (private_status == NOTMUCH_PRIVATE_STATUS_NO_DOCUMENT_FOUND) {
 	    _notmuch_message_set_filename (message, filename);
 	    _notmuch_message_add_term (message, "type", "mail");
+	    ret = _notmuch_message_set_filesize (message, filename, size);
+	    if (ret)
+		goto DONE;
 	} else {
 	    ret = NOTMUCH_STATUS_DUPLICATE_MESSAGE_ID;
 	    goto DONE;
diff --git a/lib/message.cc b/lib/message.cc
index 49519f1..2bfc5ed 100644
--- a/lib/message.cc
+++ b/lib/message.cc
@@ -426,6 +426,31 @@ _notmuch_message_set_filename (notmuch_message_t *message,
     message->doc.set_data (s);
 }
 
+notmuch_status_t
+_notmuch_message_set_filesize (notmuch_message_t *message,
+			       const char *filename,
+			       const off_t size)
+{
+    struct stat st;
+    off_t realsize = size;
+    notmuch_status_t ret = NOTMUCH_STATUS_SUCCESS;
+
+    if (realsize < 0) {
+	if (stat (filename, &st)) {
+	    ret = NOTMUCH_STATUS_FILE_ERROR;
+	    goto DONE;
+	} else {
+	    realsize = st.st_size;
+	}
+    }
+
+    message->doc.add_value (NOTMUCH_VALUE_FILESIZE,
+			 Xapian::sortable_serialise (realsize));
+
+  DONE:
+    return ret;
+}
+
 const char *
 notmuch_message_get_filename (notmuch_message_t *message)
 {
diff --git a/lib/notmuch-private.h b/lib/notmuch-private.h
index 116f63d..1ba3055 100644
--- a/lib/notmuch-private.h
+++ b/lib/notmuch-private.h
@@ -100,7 +100,8 @@ _internal_error (const char *format, ...) PRINTF_ATTRIBUTE (1, 2);
 
 typedef enum {
     NOTMUCH_VALUE_TIMESTAMP = 0,
-    NOTMUCH_VALUE_MESSAGE_ID
+    NOTMUCH_VALUE_MESSAGE_ID,
+    NOTMUCH_VALUE_FILESIZE
 } notmuch_value_t;
 
 /* Xapian (with flint backend) complains if we provide a term longer
@@ -193,6 +194,11 @@ void
 _notmuch_message_set_filename (notmuch_message_t *message,
 			       const char *filename);
 
+notmuch_status_t
+_notmuch_message_set_filesize (notmuch_message_t *message,
+			       const char *filename,
+			       const off_t size);
+
 void
 _notmuch_message_ensure_thread_id (notmuch_message_t *message);
 
diff --git a/lib/notmuch.h b/lib/notmuch.h
index 60834fb..5d0d224 100644
--- a/lib/notmuch.h
+++ b/lib/notmuch.h
@@ -32,6 +32,7 @@
 NOTMUCH_BEGIN_DECLS
 
 #include <time.h>
+#include <stdlib.h>
 
 #ifndef FALSE
 #define FALSE 0
@@ -241,6 +242,9 @@ notmuch_database_get_timestamp (notmuch_database_t *database,
  * notmuch database will reference the filename, and will not copy the
  * entire contents of the file.
  *
+ * 'size' should be the number of bytes in the file, or -1 if you are
+ * not sure.
+ *
  * If 'message' is not NULL, then, on successful return '*message'
  * will be initialized to a message object that can be used for things
  * such as adding tags to the just-added message. The user should call
@@ -265,6 +269,7 @@ notmuch_database_get_timestamp (notmuch_database_t *database,
 notmuch_status_t
 notmuch_database_add_message (notmuch_database_t *database,
 			      const char *filename,
+			      const off_t size,
 			      notmuch_message_t **message);
 
 /* Find a message with the given message_id.
diff --git a/notmuch-new.c b/notmuch-new.c
index 9d20616..cea66c2 100644
--- a/notmuch-new.c
+++ b/notmuch-new.c
@@ -235,7 +235,7 @@ add_files_recursive (notmuch_database_t *notmuch,
 		    fflush (stdout);
 		}
 
-		status = notmuch_database_add_message (notmuch, next, &message);
+		status = notmuch_database_add_message (notmuch, next, st->st_size, &message);
 		switch (status) {
 		    /* success */
 		    case NOTMUCH_STATUS_SUCCESS:
-- 
1.6.3.3

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: Missing messages breaking threads
  2009-12-18 20:52     ` Carl Worth
@ 2009-12-18 21:24       ` James Westby
  2009-12-22 22:48       ` Olly Betts
  1 sibling, 0 replies; 26+ messages in thread
From: James Westby @ 2009-12-18 21:24 UTC (permalink / raw)
  To: Carl Worth, notmuch

On Fri, 18 Dec 2009 12:52:58 -0800, Carl Worth <cworth@cworth.org> wrote:
> On Fri, 18 Dec 2009 19:53:13 +0000, James Westby <jw+debian@jameswestby.net> wrote:
> Oh, I was assuming you wouldn't index any text. The UI can add "missing
> message" for a document with no filename, for example.

Works for me.

> > So, to summarise, I should first look at storing filesizes, then
> > the collision code to make it index further when the filesize grows,
> > and then finally the code to add documents for missing messages?
> 
> Some of the code areas to be touched will be changing soon, (at least as
> far as when filenames appear and disappear). Hopefully I'll have
> something posted for that sooner rather than later to avoid having to
> redo too much work.

That would be great. I'm learning all the code anyway, so there's not
a whole lot of knowledge being thrown away.

I've just sent an initial cut at the fist step.

> > The only thing I am unclear on is how to handle existing databases?
> > Do we have any concept of versioning? Or should I just assume that
> > filesize: may not be in the document and act appropriately?
> 
> My current, outstanding patch is going to be the first trigger for a
> "flag day" where we'll all need to rewrite our databases.
> 
> We don't have any concept of versioning yet, but it would obviously be
> easy to have a new version document with an increasing integer.
> 
> But even with my current patch I'm considering doing a graceful upgrade
> of the database in-place rather than making the user do something like a
> dump, delete, rebuild, restore. That would give a much better experience
> than "Your database is out-of-date, please rebuild it", so we'll see if
> I pursue that in the end.

That sounds nice, I'd certainly prefer this sort of thing as it evolves.

Thanks,

James

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Store the size of the file for each message
  2009-12-18 21:21     ` [PATCH] Store the size of the file for each message James Westby
@ 2009-12-18 22:29       ` Carl Worth
  2009-12-19  0:08         ` James Westby
                           ` (3 more replies)
  0 siblings, 4 replies; 26+ messages in thread
From: Carl Worth @ 2009-12-18 22:29 UTC (permalink / raw)
  To: James Westby, notmuch

[-- Attachment #1: Type: text/plain, Size: 1369 bytes --]

On Fri, 18 Dec 2009 21:21:03 +0000, James Westby <jw+debian@jameswestby.net> wrote:
>   Here's the first part, storing the filesize. I'm using
>   add_value so that we can make it sortable, is that valid
>   for retrieving it as well?

Yes, a value makes sense here and should make the value easy to
retrieve.

>   The only thing I'm not sure about is if it works. Is there
>   a way to inspect a document to see the values that are
>   stored?

I usually use a little tool I wrote called xapian-dump. It currently
exists only in the git history of notmuch. Look at commit:

	22691064666c03c5e76bc787395bfe586929f4cc

or so.

> Doing a search isn't working, so I imagine I made a mistake.

Let's see... (just reviewing here, not testing)..

> +struct FilesizeValueRangeProcessor : public Xapian::ValueRangeProcessor {
> +    FilesizeValueRangeProcessor() {}
> +
> +    Xapian::valueno operator()(std::string &begin, std::string &) {
> +        if (begin.substr(0, 9) != "filesize:")
> +            return Xapian::BAD_VALUENO;
> +        begin.erase(0, 9);
> +        return NOTMUCH_VALUE_FILESIZE;
> +    }
> +};

If the file size is just an integer, then you shouldn't need a custom
ValueRangeProcessor. One of the existing processors in Xapian should
work fine.

Having not ever written a custom processor, I can't say whether the one
above is correct or not.

-Carl

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Store the size of the file for each message
  2009-12-18 22:29       ` Carl Worth
@ 2009-12-19  0:08         ` James Westby
  2009-12-19  0:57           ` Carl Worth
  2009-12-19  0:11         ` [PATCH v2] Store the size of the file for each message James Westby
                           ` (2 subsequent siblings)
  3 siblings, 1 reply; 26+ messages in thread
From: James Westby @ 2009-12-19  0:08 UTC (permalink / raw)
  To: Carl Worth, notmuch

On Fri, 18 Dec 2009 14:29:21 -0800, Carl Worth <cworth@cworth.org> wrote:
> On Fri, 18 Dec 2009 21:21:03 +0000, James Westby <jw+debian@jameswestby.net> wrote:
> Yes, a value makes sense here and should make the value easy to
> retrieve.

Excellent.

> I usually use a little tool I wrote called xapian-dump. It currently
> exists only in the git history of notmuch. Look at commit:
> 
> 	22691064666c03c5e76bc787395bfe586929f4cc
> 
> or so.

Thanks, I found delve, which at least showed that something was
being stored. It's in the xapian-tools package, and

   delve -V2 <database>

prints out the filesize value for each document.

It would be great if we could specify an alternative configuration
file for testing so that I can set up a small maildir and test
against that.

> If the file size is just an integer, then you shouldn't need a custom
> ValueRangeProcessor. One of the existing processors in Xapian should
> work fine.

Correct, I hadn't read the documentation closely enough. After fixing
that and doing some testing I have this working now. Patch incoming.

Thanks,

James

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v2] Store the size of the file for each message
  2009-12-18 22:29       ` Carl Worth
  2009-12-19  0:08         ` James Westby
@ 2009-12-19  0:11         ` James Westby
  2009-12-19  1:29         ` [PATCH] Reindex larger files that duplicate ids we have James Westby
  2009-12-20 20:27         ` [PATCH] Store documents for message-ids we haven't seen James Westby
  3 siblings, 0 replies; 26+ messages in thread
From: James Westby @ 2009-12-19  0:11 UTC (permalink / raw)
  To: notmuch

When indexing a message store the filesize along with it so that
when we store all the filenames for a message-id we can know if
any of them have different content cheaply.

The value stored is defined to be the largest filesize of any
of the files for that message.

This changes the API for efficiency reasons. The size is often
known to the caller, and so we save a second stat by asking them
to provide it. If they don't know it they can pass -1 and the
stat will be done for them.

We store the filesize such that we can query a range. Thus it
would be possible to query "filesize:0..100" if you somehow
knew the raw message was less that 100 bytes.
---

  With new, improved, working, filesize:.. search.

 lib/database.cc       |    7 +++++++
 lib/message.cc        |   25 +++++++++++++++++++++++++
 lib/notmuch-private.h |    8 +++++++-
 lib/notmuch.h         |    5 +++++
 notmuch-new.c         |    2 +-
 5 files changed, 45 insertions(+), 2 deletions(-)

diff --git a/lib/database.cc b/lib/database.cc
index b6c4d07..d834d94 100644
--- a/lib/database.cc
+++ b/lib/database.cc
@@ -463,6 +463,8 @@ notmuch_database_open (const char *path,
     struct stat st;
     int err;
     unsigned int i;
+    Xapian::NumberValueRangeProcessor *filesize_proc = new Xapian::NumberValueRangeProcessor (NOTMUCH_VALUE_FILESIZE,
+			 "filesize:", true);
 
     if (asprintf (&notmuch_path, "%s/%s", path, ".notmuch") == -1) {
 	notmuch_path = NULL;
@@ -508,6 +510,7 @@ notmuch_database_open (const char *path,
 	notmuch->query_parser->set_stemmer (Xapian::Stem ("english"));
 	notmuch->query_parser->set_stemming_strategy (Xapian::QueryParser::STEM_SOME);
 	notmuch->query_parser->add_valuerangeprocessor (notmuch->value_range_processor);
+	notmuch->query_parser->add_valuerangeprocessor (filesize_proc);
 
 	for (i = 0; i < ARRAY_SIZE (BOOLEAN_PREFIX_EXTERNAL); i++) {
 	    prefix_t *prefix = &BOOLEAN_PREFIX_EXTERNAL[i];
@@ -889,6 +892,7 @@ _notmuch_database_link_message (notmuch_database_t *notmuch,
 notmuch_status_t
 notmuch_database_add_message (notmuch_database_t *notmuch,
 			      const char *filename,
+			      const off_t size,
 			      notmuch_message_t **message_ret)
 {
     notmuch_message_file_t *message_file;
@@ -992,6 +996,9 @@ notmuch_database_add_message (notmuch_database_t *notmuch,
 	if (private_status == NOTMUCH_PRIVATE_STATUS_NO_DOCUMENT_FOUND) {
 	    _notmuch_message_set_filename (message, filename);
 	    _notmuch_message_add_term (message, "type", "mail");
+	    ret = _notmuch_message_set_filesize (message, filename, size);
+	    if (ret)
+		goto DONE;
 	} else {
 	    ret = NOTMUCH_STATUS_DUPLICATE_MESSAGE_ID;
 	    goto DONE;
diff --git a/lib/message.cc b/lib/message.cc
index 49519f1..2bfc5ed 100644
--- a/lib/message.cc
+++ b/lib/message.cc
@@ -426,6 +426,31 @@ _notmuch_message_set_filename (notmuch_message_t *message,
     message->doc.set_data (s);
 }
 
+notmuch_status_t
+_notmuch_message_set_filesize (notmuch_message_t *message,
+			       const char *filename,
+			       const off_t size)
+{
+    struct stat st;
+    off_t realsize = size;
+    notmuch_status_t ret = NOTMUCH_STATUS_SUCCESS;
+
+    if (realsize < 0) {
+	if (stat (filename, &st)) {
+	    ret = NOTMUCH_STATUS_FILE_ERROR;
+	    goto DONE;
+	} else {
+	    realsize = st.st_size;
+	}
+    }
+
+    message->doc.add_value (NOTMUCH_VALUE_FILESIZE,
+			 Xapian::sortable_serialise (realsize));
+
+  DONE:
+    return ret;
+}
+
 const char *
 notmuch_message_get_filename (notmuch_message_t *message)
 {
diff --git a/lib/notmuch-private.h b/lib/notmuch-private.h
index 116f63d..1ba3055 100644
--- a/lib/notmuch-private.h
+++ b/lib/notmuch-private.h
@@ -100,7 +100,8 @@ _internal_error (const char *format, ...) PRINTF_ATTRIBUTE (1, 2);
 
 typedef enum {
     NOTMUCH_VALUE_TIMESTAMP = 0,
-    NOTMUCH_VALUE_MESSAGE_ID
+    NOTMUCH_VALUE_MESSAGE_ID,
+    NOTMUCH_VALUE_FILESIZE
 } notmuch_value_t;
 
 /* Xapian (with flint backend) complains if we provide a term longer
@@ -193,6 +194,11 @@ void
 _notmuch_message_set_filename (notmuch_message_t *message,
 			       const char *filename);
 
+notmuch_status_t
+_notmuch_message_set_filesize (notmuch_message_t *message,
+			       const char *filename,
+			       const off_t size);
+
 void
 _notmuch_message_ensure_thread_id (notmuch_message_t *message);
 
diff --git a/lib/notmuch.h b/lib/notmuch.h
index 60834fb..5d0d224 100644
--- a/lib/notmuch.h
+++ b/lib/notmuch.h
@@ -32,6 +32,7 @@
 NOTMUCH_BEGIN_DECLS
 
 #include <time.h>
+#include <stdlib.h>
 
 #ifndef FALSE
 #define FALSE 0
@@ -241,6 +242,9 @@ notmuch_database_get_timestamp (notmuch_database_t *database,
  * notmuch database will reference the filename, and will not copy the
  * entire contents of the file.
  *
+ * 'size' should be the number of bytes in the file, or -1 if you are
+ * not sure.
+ *
  * If 'message' is not NULL, then, on successful return '*message'
  * will be initialized to a message object that can be used for things
  * such as adding tags to the just-added message. The user should call
@@ -265,6 +269,7 @@ notmuch_database_get_timestamp (notmuch_database_t *database,
 notmuch_status_t
 notmuch_database_add_message (notmuch_database_t *database,
 			      const char *filename,
+			      const off_t size,
 			      notmuch_message_t **message);
 
 /* Find a message with the given message_id.
diff --git a/notmuch-new.c b/notmuch-new.c
index 9d20616..cea66c2 100644
--- a/notmuch-new.c
+++ b/notmuch-new.c
@@ -235,7 +235,7 @@ add_files_recursive (notmuch_database_t *notmuch,
 		    fflush (stdout);
 		}
 
-		status = notmuch_database_add_message (notmuch, next, &message);
+		status = notmuch_database_add_message (notmuch, next, st->st_size, &message);
 		switch (status) {
 		    /* success */
 		    case NOTMUCH_STATUS_SUCCESS:
-- 
1.6.3.3

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH] Store the size of the file for each message
  2009-12-19  0:08         ` James Westby
@ 2009-12-19  0:57           ` Carl Worth
  2009-12-19  1:35             ` James Westby
  0 siblings, 1 reply; 26+ messages in thread
From: Carl Worth @ 2009-12-19  0:57 UTC (permalink / raw)
  To: James Westby, notmuch

[-- Attachment #1: Type: text/plain, Size: 834 bytes --]

On Sat, 19 Dec 2009 00:08:24 +0000, James Westby <jw+debian@jameswestby.net> wrote:
> Thanks, I found delve, which at least showed that something was
> being stored. It's in the xapian-tools package, and
> 
>    delve -V2 <database>
> 
> prints out the filesize value for each document.

Ah, right. I had forgotten about that.

> It would be great if we could specify an alternative configuration
> file for testing so that I can set up a small maildir and test
> against that.

You can, actually. Just set the NOTMUCH_CONFIG environment variable to
your alternate configuration file. (And yes, we're missing any mention
of this in our documentation.)

> Correct, I hadn't read the documentation closely enough. After fixing
> that and doing some testing I have this working now. Patch incoming.

Cool!

-Carl

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH] Reindex larger files that duplicate ids we have
  2009-12-18 22:29       ` Carl Worth
  2009-12-19  0:08         ` James Westby
  2009-12-19  0:11         ` [PATCH v2] Store the size of the file for each message James Westby
@ 2009-12-19  1:29         ` James Westby
  2009-12-21 18:33           ` Carl Worth
  2009-12-20 20:27         ` [PATCH] Store documents for message-ids we haven't seen James Westby
  3 siblings, 1 reply; 26+ messages in thread
From: James Westby @ 2009-12-19  1:29 UTC (permalink / raw)
  To: notmuch

When we see a message where we already have the file
id stored, check if the size is larger. If it is then
re-index and set the file size and name to be the
new message.
---

  Here's the (quite simple) patch to implement indexing the
  largest copy of each mail that we have.

  Does the re-indexing replace the old terms? In the case
  where you had a collision with different text this could
  make a search return mails that don't contain that text.
  I don't think it's a big issue though, even if that is the
  case.

  Thanks,

  James

 lib/database.cc       |    4 +++-
 lib/index.cc          |   27 +++++++++++++++++++++++++++
 lib/message.cc        |   31 ++++++++++++++++++++++++++-----
 lib/notmuch-private.h |   13 +++++++++++++
 lib/notmuch.h         |    5 +++--
 5 files changed, 72 insertions(+), 8 deletions(-)

diff --git a/lib/database.cc b/lib/database.cc
index d834d94..64f29b9 100644
--- a/lib/database.cc
+++ b/lib/database.cc
@@ -1000,7 +1000,9 @@ notmuch_database_add_message (notmuch_database_t *notmuch,
 	    if (ret)
 		goto DONE;
 	} else {
-	    ret = NOTMUCH_STATUS_DUPLICATE_MESSAGE_ID;
+	    ret = _notmuch_message_possibly_reindex (message, filename, size);
+	    if (!ret)
+		ret = NOTMUCH_STATUS_DUPLICATE_MESSAGE_ID;
 	    goto DONE;
 	}
 
diff --git a/lib/index.cc b/lib/index.cc
index 125fa6c..14c3268 100644
--- a/lib/index.cc
+++ b/lib/index.cc
@@ -312,3 +312,30 @@ _notmuch_message_index_file (notmuch_message_t *message,
 
     return ret;
 }
+
+notmuch_status_t
+_notmuch_message_possibly_reindex (notmuch_message_t *message,
+			     const char *filename,
+			     const off_t size)
+{
+    off_t realsize = size;
+    off_t stored_size;
+    notmuch_status_t ret = NOTMUCH_STATUS_SUCCESS;
+
+    ret = _notmuch_message_size_on_disk (message, filename, &realsize);
+    if (ret)
+        goto DONE;
+    stored_size = _notmuch_message_get_filesize (message);
+    if (realsize > stored_size) {
+	ret = _notmuch_message_index_file (message, filename);
+	if (ret)
+	    goto DONE;
+	ret = _notmuch_message_set_filesize (message, filename, realsize);
+	_notmuch_message_set_filename (message, filename);
+	_notmuch_message_sync (message);
+    }
+
+  DONE:
+    return ret;
+
+}
diff --git a/lib/message.cc b/lib/message.cc
index 2bfc5ed..cc32741 100644
--- a/lib/message.cc
+++ b/lib/message.cc
@@ -427,23 +427,38 @@ _notmuch_message_set_filename (notmuch_message_t *message,
 }
 
 notmuch_status_t
-_notmuch_message_set_filesize (notmuch_message_t *message,
+_notmuch_message_size_on_disk (notmuch_message_t *message,
 			       const char *filename,
-			       const off_t size)
+			       off_t *size)
 {
     struct stat st;
-    off_t realsize = size;
     notmuch_status_t ret = NOTMUCH_STATUS_SUCCESS;
 
-    if (realsize < 0) {
+    if (*size < 0) {
 	if (stat (filename, &st)) {
 	    ret = NOTMUCH_STATUS_FILE_ERROR;
 	    goto DONE;
 	} else {
-	    realsize = st.st_size;
+	    *size = st.st_size;
 	}
     }
 
+  DONE:
+    return ret;
+}
+
+notmuch_status_t
+_notmuch_message_set_filesize (notmuch_message_t *message,
+			       const char *filename,
+			       const off_t size)
+{
+    off_t realsize = size;
+    notmuch_status_t ret = NOTMUCH_STATUS_SUCCESS;
+
+    ret = _notmuch_message_size_on_disk (message, filename, &realsize);
+    if (ret)
+        goto DONE;
+
     message->doc.add_value (NOTMUCH_VALUE_FILESIZE,
 			 Xapian::sortable_serialise (realsize));
 
@@ -451,6 +466,12 @@ _notmuch_message_set_filesize (notmuch_message_t *message,
     return ret;
 }
 
+off_t
+_notmuch_message_get_filesize (notmuch_message_t *message)
+{
+    return Xapian::sortable_unserialise (message->doc.get_value (NOTMUCH_VALUE_FILESIZE));
+}
+
 const char *
 notmuch_message_get_filename (notmuch_message_t *message)
 {
diff --git a/lib/notmuch-private.h b/lib/notmuch-private.h
index 1ba3055..cf65fd9 100644
--- a/lib/notmuch-private.h
+++ b/lib/notmuch-private.h
@@ -199,6 +199,14 @@ _notmuch_message_set_filesize (notmuch_message_t *message,
 			       const char *filename,
 			       const off_t size);
 
+off_t
+_notmuch_message_get_filesize (notmuch_message_t *message);
+
+notmuch_status_t
+_notmuch_message_size_on_disk (notmuch_message_t *message,
+			       const char *filename,
+			       off_t *size);
+
 void
 _notmuch_message_ensure_thread_id (notmuch_message_t *message);
 
@@ -218,6 +226,11 @@ notmuch_status_t
 _notmuch_message_index_file (notmuch_message_t *message,
 			     const char *filename);
 
+notmuch_status_t
+_notmuch_message_possibly_reindex (notmuch_message_t *message,
+			     const char *filename,
+			     const off_t size);
+
 /* message-file.c */
 
 /* XXX: I haven't decided yet whether these will actually get exported
diff --git a/lib/notmuch.h b/lib/notmuch.h
index 5d0d224..892e420 100644
--- a/lib/notmuch.h
+++ b/lib/notmuch.h
@@ -256,8 +256,9 @@ notmuch_database_get_timestamp (notmuch_database_t *database,
  * NOTMUCH_STATUS_SUCCESS: Message successfully added to database.
  *
  * NOTMUCH_STATUS_DUPLICATE_MESSAGE_ID: Message has the same message
- *	ID as another message already in the database. Nothing added
- *	to the database.
+ *	ID as another message already in the database. This may have
+ *	caused some further indexing to be done, but it is not an entirely
+ *	new message.
  *
  * NOTMUCH_STATUS_FILE_ERROR: an error occurred trying to open the
  *	file, (such as permission denied, or file not found,
-- 
1.6.3.3

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH] Store the size of the file for each message
  2009-12-19  0:57           ` Carl Worth
@ 2009-12-19  1:35             ` James Westby
  2009-12-19  5:24               ` Carl Worth
  0 siblings, 1 reply; 26+ messages in thread
From: James Westby @ 2009-12-19  1:35 UTC (permalink / raw)
  To: Carl Worth, notmuch

On Fri, 18 Dec 2009 16:57:16 -0800, Carl Worth <cworth@cworth.org> wrote:
> You can, actually. Just set the NOTMUCH_CONFIG environment variable to
> your alternate configuration file. (And yes, we're missing any mention
> of this in our documentation.)

Sweet. Where would be the best place to document it? Just in the
man page?

Thanks,

James

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Store the size of the file for each message
  2009-12-19  1:35             ` James Westby
@ 2009-12-19  5:24               ` Carl Worth
  2009-12-19  8:02                 ` Marten Veldthuis
  2009-12-19 13:12                 ` [PATCH] Add ENVIRONMENT VARIABLES section to the man page James Westby
  0 siblings, 2 replies; 26+ messages in thread
From: Carl Worth @ 2009-12-19  5:24 UTC (permalink / raw)
  To: James Westby, notmuch

[-- Attachment #1: Type: text/plain, Size: 709 bytes --]

On Sat, 19 Dec 2009 01:35:46 +0000, James Westby <jw+debian@jameswestby.net> wrote:
> On Fri, 18 Dec 2009 16:57:16 -0800, Carl Worth <cworth@cworth.org> wrote:
> > You can, actually. Just set the NOTMUCH_CONFIG environment variable to
> > your alternate configuration file. (And yes, we're missing any mention
> > of this in our documentation.)
> 
> Sweet. Where would be the best place to document it? Just in the
> man page?

Currently we're replicating all of our documentation both in the man
page and in the output from "notmuch help". It's annoying to have to
add everything in two places, but I don't have a good idea for making
that sharable yet.

Anyone have a solution here?

-Carl

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Store the size of the file for each message
  2009-12-19  5:24               ` Carl Worth
@ 2009-12-19  8:02                 ` Marten Veldthuis
  2009-12-19 17:17                   ` Carl Worth
  2009-12-19 13:12                 ` [PATCH] Add ENVIRONMENT VARIABLES section to the man page James Westby
  1 sibling, 1 reply; 26+ messages in thread
From: Marten Veldthuis @ 2009-12-19  8:02 UTC (permalink / raw)
  To: Carl Worth, James Westby, notmuch

On Fri, 18 Dec 2009 21:24:10 -0800, Carl Worth <cworth@cworth.org> wrote:
> Currently we're replicating all of our documentation both in the man
> page and in the output from "notmuch help". It's annoying to have to
> add everything in two places, but I don't have a good idea for making
> that sharable yet.
> 
> Anyone have a solution here?

Something like "git help add" just opens the manpage for git-add. Can't
we do the same here?

-- 
- Marten

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH] Add ENVIRONMENT VARIABLES section to the man page
  2009-12-19  5:24               ` Carl Worth
  2009-12-19  8:02                 ` Marten Veldthuis
@ 2009-12-19 13:12                 ` James Westby
  2010-02-23 19:25                   ` Carl Worth
  1 sibling, 1 reply; 26+ messages in thread
From: James Westby @ 2009-12-19 13:12 UTC (permalink / raw)
  To: notmuch

Briefly describe the NOTMUCH_CONFIG variable there.
---

  It turns out that it is documented in notmuch help setup, but I
  missed it.

  I'm not sure how to phrase it to fit at the end of notmuch help,
  suggestions welcome. For now I added a traditional ENVIRONMENT
  VARIABLES section to the man page.

  Thanks,

  James

 notmuch.1 |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/notmuch.1 b/notmuch.1
index 369ecba..f304404 100644
--- a/notmuch.1
+++ b/notmuch.1
@@ -432,6 +432,10 @@ specify a date range to return messages from 2009-10-01 until the
 current time:
 
 	$(date +%s -d 2009-10-01)..$(date +%s)
+.SH ENVIRONMENT VARIABLES
+.IP NOTMUCH_CONFIG
+Specifies the location of the configuration file to override the default of
+.IR ~/.notmuch-config .
 .SH SEE ALSO
 The emacs-based interface to notmuch (available as
 .B notmuch.el
-- 
1.6.3.3

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH] Store the size of the file for each message
  2009-12-19  8:02                 ` Marten Veldthuis
@ 2009-12-19 17:17                   ` Carl Worth
  0 siblings, 0 replies; 26+ messages in thread
From: Carl Worth @ 2009-12-19 17:17 UTC (permalink / raw)
  To: Marten Veldthuis, James Westby, notmuch

[-- Attachment #1: Type: text/plain, Size: 685 bytes --]

On Sat, 19 Dec 2009 09:02:11 +0100, Marten Veldthuis <marten@veldthuis.com> wrote:
> > Anyone have a solution here?
> 
> Something like "git help add" just opens the manpage for git-add. Can't
> we do the same here?

The granularity is different, though. I like that "notmuch help show"
shows just the documentation for "notmuch show". And I also like that
"man notmuch" shows all the documentation, (without having to have N
different man pages for each of the sub-commands).

Meanwhile, the git approach does mean that one doesn't get any "builtin"
help unless the external man-based stuff is working and installed. I'm
not sure that I want to depend on that.

-Carl

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH] Store documents for message-ids we haven't seen
  2009-12-18 22:29       ` Carl Worth
                           ` (2 preceding siblings ...)
  2009-12-19  1:29         ` [PATCH] Reindex larger files that duplicate ids we have James Westby
@ 2009-12-20 20:27         ` James Westby
  2009-12-21 18:43           ` Carl Worth
  3 siblings, 1 reply; 26+ messages in thread
From: James Westby @ 2009-12-20 20:27 UTC (permalink / raw)
  To: notmuch

When scanning the maildir there can be earlier messages
in the thread that we haven't seen, either due to mail
delays, or because they weren't sent. This can break the
threading.

Therefore we store stub documents for every message id
we see, so that we can link them in to the threads. If
we see the message later then we will replace the document
with the full information.
---

  Here's the third part of the patch, that actually stores
  the stubs and links the threads. It's fairly simple,
  though we may want to store a little more in the stubs.

  One case that isn't handled is when we don't find the thread
  id of the parent, but then find the message itself. I believe
  this case shouldn't happen, but you never know.

 lib/database.cc       |   25 +++++++++++++++++++++++++
 lib/message.cc        |    9 +--------
 lib/notmuch-private.h |   10 ++++++++++
 3 files changed, 36 insertions(+), 8 deletions(-)

diff --git a/lib/database.cc b/lib/database.cc
index 64f29b9..6ee8068 100644
--- a/lib/database.cc
+++ b/lib/database.cc
@@ -782,8 +782,33 @@ _notmuch_database_link_message_to_parents (notmuch_database_t *notmuch,
 							     parent_message_id);
 
 	if (parent_thread_id == NULL) {
+	    notmuch_private_status_t private_status;
+	    notmuch_message_t *new_parent_message;
+	    thread_id_t parent_thread_id_t;
+
+	    // We have yet to see the referenced message, generate a thread id
+	    // for it if needed and store a dummy message for the parent. If we
+	    // find the mail later we will replace the dummy.
 	    _notmuch_message_add_term (message, "reference",
 				       parent_message_id);
+	    if (*thread_id == NULL) {
+		thread_id_generate (&parent_thread_id_t);
+		*thread_id = parent_thread_id_t.str;
+	    }
+	    new_parent_message = _notmuch_message_create_for_message_id (notmuch,
+					    parent_message_id, &private_status);
+		if (private_status) {
+		    if (private_status == NOTMUCH_PRIVATE_STATUS_NO_DOCUMENT_FOUND) {
+			// expected
+			_notmuch_message_add_term (new_parent_message, "thread", *thread_id);
+			_notmuch_message_sync (new_parent_message);
+		    } else {
+			ret = COERCE_STATUS (private_status,
+				 "Cannot create parent message");
+			goto DONE;
+		    }
+		}
+	    _notmuch_message_add_term (message, "thread", *thread_id);
 	} else {
 	    if (*thread_id == NULL) {
 		*thread_id = talloc_strdup (message, parent_thread_id);
diff --git a/lib/message.cc b/lib/message.cc
index 7129d59..dff02c4 100644
--- a/lib/message.cc
+++ b/lib/message.cc
@@ -42,13 +42,6 @@ struct _notmuch_message {
     Xapian::Document doc;
 };
 
-/* "128 bits of thread-id ought to be enough for anybody" */
-#define NOTMUCH_THREAD_ID_BITS	 128
-#define NOTMUCH_THREAD_ID_DIGITS (NOTMUCH_THREAD_ID_BITS / 4)
-typedef struct _thread_id {
-    char str[NOTMUCH_THREAD_ID_DIGITS + 1];
-} thread_id_t;
-
 /* We end up having to call the destructor explicitly because we had
  * to use "placement new" in order to initialize C++ objects within a
  * block that we allocated with talloc. So C++ is making talloc
@@ -551,7 +544,7 @@ _notmuch_message_set_date (notmuch_message_t *message,
 			    Xapian::sortable_serialise (time_value));
 }
 
-static void
+void
 thread_id_generate (thread_id_t *thread_id)
 {
     static int seeded = 0;
diff --git a/lib/notmuch-private.h b/lib/notmuch-private.h
index cf65fd9..cf75b4a 100644
--- a/lib/notmuch-private.h
+++ b/lib/notmuch-private.h
@@ -141,6 +141,13 @@ typedef enum _notmuch_private_status {
      :									\
      (notmuch_status_t) private_status)
 
+/* "128 bits of thread-id ought to be enough for anybody" */
+#define NOTMUCH_THREAD_ID_BITS	 128
+#define NOTMUCH_THREAD_ID_DIGITS (NOTMUCH_THREAD_ID_BITS / 4)
+typedef struct _thread_id {
+    char str[NOTMUCH_THREAD_ID_DIGITS + 1];
+} thread_id_t;
+
 /* database.cc */
 
 /* Lookup a prefix value by name.
@@ -347,6 +354,9 @@ void
 _notmuch_message_add_reply (notmuch_message_t *message,
 			    notmuch_message_node_t *reply);
 
+void
+thread_id_generate (thread_id_t *thread_id);
+
 /* sha1.c */
 
 char *
-- 
1.6.3.3

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH] Reindex larger files that duplicate ids we have
  2009-12-19  1:29         ` [PATCH] Reindex larger files that duplicate ids we have James Westby
@ 2009-12-21 18:33           ` Carl Worth
  0 siblings, 0 replies; 26+ messages in thread
From: Carl Worth @ 2009-12-21 18:33 UTC (permalink / raw)
  To: James Westby, notmuch

[-- Attachment #1: Type: text/plain, Size: 1548 bytes --]

Hi James,

I just got to a point in my outstanding rework where I thought it would
make sense to pull this patch series in, (I'm adding support for storing
multiple filenames in a single mail document).

I took a closer look at this series, and I think it's still independent,
so I'll finish up what I'm doing and then add this series on top later.

But I can at least answer some of the questions you asked for now:

>   Does the re-indexing replace the old terms?

Before this patch, there's there's not yet any "re-indexing" in
notmuch. So we'll basically need to think about what we want to do here.

As this patch is written, (just calling into the existing _index_file
function), the re-indexing only adds new terms, (and doesn't delete
any). That's probably correct. We're using file size as an heuristic
that the larger file is a superset of the smaller file, but it doesn't
guarantee that the smaller file doesn't contain any unique terms. So I'd
be extremely hesitant to drop any terms here.

>                                               In the case
>   where you had a collision with different text this could
>   make a search return mails that don't contain that text.
>   I don't think it's a big issue though, even if that is the
>   case.

That's correct. As mentioned in a previous thread, this is likely only a
big issue in the face of deliberate message-ID spoofing or so. In that
thread we talked about some ideas for mitigating that. But I don't think
we need to solve that problem before applying this patch series.

-Carl

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Store documents for message-ids we haven't seen
  2009-12-20 20:27         ` [PATCH] Store documents for message-ids we haven't seen James Westby
@ 2009-12-21 18:43           ` Carl Worth
  0 siblings, 0 replies; 26+ messages in thread
From: Carl Worth @ 2009-12-21 18:43 UTC (permalink / raw)
  To: James Westby, notmuch

[-- Attachment #1: Type: text/plain, Size: 1293 bytes --]

On Sun, 20 Dec 2009 20:27:32 +0000, James Westby <jw+debian@jameswestby.net> wrote:
>   One case that isn't handled is when we don't find the thread
>   id of the parent, but then find the message itself. I believe
>   this case shouldn't happen, but you never know.

It really shouldn't happen since we are holding a write lock on the
database, (so there's no possible race condition here with another
client delivering the parent message).

But since you almost can't help but detect the case, (just noticing a
NOTMUCH_STATUS_SUCCESS value from _create_for_message_id), please put an
INTERNAL_ERROR there rather than marching along with an incorrect thread
ID.

> +	    // We have yet to see the referenced message, generate a thread id
> +	    // for it if needed and store a dummy message for the parent. If we
> +	    // find the mail later we will replace the dummy.

Call me old-fashioned if you will, but I'd much rather have C style
multi-line comments (/* ... */) rather than these C++-style comments
with //.

> +		    if (private_status == NOTMUCH_PRIVATE_STATUS_NO_DOCUMENT_FOUND) {
> +			// expected

And I think this comment deserves a complete sentence before the
condition. Something like:

/* We expect this call to create a new document (return NO_DOCUMENT_FOUND) */

-Carl

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Missing messages breaking threads
  2009-12-18 20:52     ` Carl Worth
  2009-12-18 21:24       ` James Westby
@ 2009-12-22 22:48       ` Olly Betts
  2009-12-23  0:05         ` Carl Worth
  1 sibling, 1 reply; 26+ messages in thread
From: Olly Betts @ 2009-12-22 22:48 UTC (permalink / raw)
  To: notmuch

Carl Worth writes:
> We don't have any concept of versioning yet, but it would obviously be
> easy to have a new version document with an increasing integer.

Adding a magic document for this isn't ideal as you have to make sure
it can't appear in search results, etc.

This is just the sort of thing which Xapian's "user metadata" is there
for.  It's essentially a key/value store which is versioned along with
the rest of the Xapian database.  So to set it:

  database.set_metadata("version", "1");

And to read (and default if not set):

  string version = database.get_metadata("version");
  if (version.empty()) version = "0";

Cheers,
   Olly

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Missing messages breaking threads
  2009-12-22 22:48       ` Olly Betts
@ 2009-12-23  0:05         ` Carl Worth
  2010-03-13 21:27           ` [PATCH] Store thread ids for messages that we haven't seen yet James Westby
  0 siblings, 1 reply; 26+ messages in thread
From: Carl Worth @ 2009-12-23  0:05 UTC (permalink / raw)
  To: Olly Betts, notmuch

[-- Attachment #1: Type: text/plain, Size: 643 bytes --]

On Tue, 22 Dec 2009 22:48:25 +0000 (UTC), Olly Betts <olly@survex.com> wrote:
> This is just the sort of thing which Xapian's "user metadata" is there
> for.  It's essentially a key/value store which is versioned along with
> the rest of the Xapian database.  So to set it:
> 
>   database.set_metadata("version", "1");
> 
> And to read (and default if not set):
> 
>   string version = database.get_metadata("version");
>   if (version.empty()) version = "0";

Thanks, Olly!

That is exactly what we'll want here, and is much better than a magic
document.

-Carl (grateful to have a Xapian expert keeping watch on the list)

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Add ENVIRONMENT VARIABLES section to the man page
  2009-12-19 13:12                 ` [PATCH] Add ENVIRONMENT VARIABLES section to the man page James Westby
@ 2010-02-23 19:25                   ` Carl Worth
  0 siblings, 0 replies; 26+ messages in thread
From: Carl Worth @ 2010-02-23 19:25 UTC (permalink / raw)
  To: James Westby, notmuch

[-- Attachment #1: Type: text/plain, Size: 581 bytes --]

On Sat, 19 Dec 2009 13:12:15 +0000, James Westby <jw+debian@jameswestby.net> wrote:
> Briefly describe the NOTMUCH_CONFIG variable there.

Thanks, James! I've pushed this out now.

>   I'm not sure how to phrase it to fit at the end of notmuch help,
>   suggestions welcome. For now I added a traditional ENVIRONMENT
>   VARIABLES section to the man page.

I had originally postponed your patch only because I wanted to try to
answer your question here about how to fit this at the end of notmuch
help as well. But it's not obvious to me that there's a good way to do
that.

-Carl

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH] Store thread ids for messages that we haven't seen yet
  2009-12-23  0:05         ` Carl Worth
@ 2010-03-13 21:27           ` James Westby
  2010-04-13 15:20             ` Carl Worth
  0 siblings, 1 reply; 26+ messages in thread
From: James Westby @ 2010-03-13 21:27 UTC (permalink / raw)
  To: notmuch

This allows us to thread messages even when we receive them out of
order, or never receive the root.

The thread ids for messages that aren't present but are referred to are
stored as metadata in the database and then retrieved if we ever get
that message.

When determining the thread id for a message we also check for this
metadata so that we can thread descendants of a message together before
we receive it.
---
 lib/database.cc   |   78 ++++++++++++++++++++++++++++++++++++++--------------
 test/notmuch-test |   32 +++++++++++++++++++--
 2 files changed, 86 insertions(+), 24 deletions(-)

diff --git a/lib/database.cc b/lib/database.cc
index c91e97c..92234ff 100644
--- a/lib/database.cc
+++ b/lib/database.cc
@@ -1111,6 +1111,31 @@ notmuch_database_get_directory (notmuch_database_t *notmuch,
     return _notmuch_directory_create (notmuch, path, &status);
 }
 
+static const char *
+_notmuch_database_generate_thread_id (notmuch_database_t *notmuch)
+{
+    /* 16 bytes (+ terminator) for hexadecimal representation of
+     * a 64-bit integer. */
+    static char thread_id[17];
+    Xapian::WritableDatabase *db;
+
+    db = static_cast <Xapian::WritableDatabase *> (notmuch->xapian_db);
+
+    notmuch->last_thread_id++;
+
+    sprintf (thread_id, "%016" PRIx64, notmuch->last_thread_id);
+
+    db->set_metadata ("last_thread_id", thread_id);
+
+    return thread_id;
+}
+
+static char *
+_get_metadata_thread_id_key (void *ctx, const char *message_id)
+{
+    return talloc_asprintf (ctx, "thread_id_%s", message_id);
+}
+
 /* Find the thread ID to which the message with 'message_id' belongs.
  *
  * Returns NULL if no message with message ID 'message_id' is in the
@@ -1127,8 +1152,25 @@ _resolve_message_id_to_thread_id (notmuch_database_t *notmuch,
     const char *ret = NULL;
 
     message = notmuch_database_find_message (notmuch, message_id);
-    if (message == NULL)
-	goto DONE;
+    /* If we haven't seen that message yet then check if we have already
+     * generated a dummy id for it and stored it in the metadata.
+     * If not then we generate a new thread id.
+     * This ensures that we can thread messages even when we haven't received
+     * the root (yet?)
+     */
+    if (message == NULL) {
+        Xapian::WritableDatabase *db = static_cast <Xapian::WritableDatabase *> (notmuch->xapian_db);
+        char * metadata_key = _get_metadata_thread_id_key (ctx, message_id);
+        string thread_id = notmuch->xapian_db->get_metadata(metadata_key);
+        if (thread_id.empty()) {
+            ret = _notmuch_database_generate_thread_id(notmuch);
+            db->set_metadata(metadata_key, ret);
+        } else {
+            ret = thread_id.c_str();
+        }
+        talloc_free (metadata_key);
+        goto DONE;
+    }
 
     ret = talloc_steal (ctx, notmuch_message_get_thread_id (message));
 
@@ -1295,25 +1337,6 @@ _notmuch_database_link_message_to_children (notmuch_database_t *notmuch,
     return ret;
 }
 
-static const char *
-_notmuch_database_generate_thread_id (notmuch_database_t *notmuch)
-{
-    /* 16 bytes (+ terminator) for hexadecimal representation of
-     * a 64-bit integer. */
-    static char thread_id[17];
-    Xapian::WritableDatabase *db;
-
-    db = static_cast <Xapian::WritableDatabase *> (notmuch->xapian_db);
-
-    notmuch->last_thread_id++;
-
-    sprintf (thread_id, "%016" PRIx64, notmuch->last_thread_id);
-
-    db->set_metadata ("last_thread_id", thread_id);
-
-    return thread_id;
-}
-
 /* Given a (mostly empty) 'message' and its corresponding
  * 'message_file' link it to existing threads in the database.
  *
@@ -1337,6 +1360,19 @@ _notmuch_database_link_message (notmuch_database_t *notmuch,
 {
     notmuch_status_t status;
     const char *thread_id = NULL;
+    char *metadata_key = _get_metadata_thread_id_key (message,
+            notmuch_message_get_message_id (message));
+    /* Check if we have already seen related messages to this one.
+     * If we have then use the thread_id that we stored at that time.
+     */
+    string stored_id = notmuch->xapian_db->get_metadata (metadata_key);
+    if (!stored_id.empty()) {
+        Xapian::WritableDatabase *db = static_cast <Xapian::WritableDatabase *> (notmuch->xapian_db);
+        db->set_metadata (metadata_key, "");
+        thread_id = stored_id.c_str();
+        _notmuch_message_add_term (message, "thread", thread_id);
+    }
+    talloc_free (metadata_key);
 
     status = _notmuch_database_link_message_to_parents (notmuch, message,
 							message_file,
diff --git a/test/notmuch-test b/test/notmuch-test
index 7bc53ec..9264fb0 100755
--- a/test/notmuch-test
+++ b/test/notmuch-test
@@ -64,6 +64,10 @@ increment_mtime ()
 #	Additional values for email headers. If these are not provided
 #	then the relevant headers will simply not appear in the
 #	message.
+#
+#  '[id]=<message-id>'
+#
+#	Controls the message-id of the created message.
 gen_msg_cnt=0
 gen_msg_filename=""
 gen_msg_id=""
@@ -73,9 +77,14 @@ generate_message ()
     local -A template="($@)"
     local additional_headers
 
-    gen_msg_cnt=$((gen_msg_cnt + 1))
-    gen_msg_name=msg-$(printf "%03d" $gen_msg_cnt)
-    gen_msg_id="${gen_msg_name}@notmuch-test-suite"
+    if [ -z "${template[id]}" ]; then
+	gen_msg_cnt=$((gen_msg_cnt + 1))
+	gen_msg_name=msg-$(printf "%03d" $gen_msg_cnt)
+	gen_msg_id="${gen_msg_name}@notmuch-test-suite"
+    else
+	gen_msg_name="msg-${template[id]}"
+	gen_msg_id="${template[id]}"
+    fi
 
     if [ -z "${template[dir]}" ]; then
 	gen_msg_filename="${MAIL_DIR}/$gen_msg_name"
@@ -534,6 +543,23 @@ printf " Restore with nothing to do...\t"
 $NOTMUCH restore dump.expected
 echo "	PASS"
 
+printf "\nTesting threading when messages received out of order:\n"
+printf " Adding initial child message...\t\t"
+generate_message [body]=foo '[in-reply-to]=\<parent-id\>' [subject]=brokenthreadtest '[date]="Sat, 01 Jan 2000 12:00:00 -0000"'
+execute_expecting new "Added 1 new message to the database."
+printf " Searching returns the message...\t\t"
+execute_expecting "search foo" "thread:XXX   2000-01-01 [1/1] Notmuch Test Suite; brokenthreadtest (inbox unread)"
+printf " Adding second child message...\t\t"
+generate_message [body]=foo '[in-reply-to]=\<parent-id\>' [subject]=brokenthreadtest '[date]="Sat, 01 Jan 2000 12:00:00 -0000"'
+execute_expecting new "Added 1 new message to the database."
+printf " Searching returns both messages in one thread...\t\t"
+execute_expecting "search foo" "thread:XXX   2000-01-01 [2/2] Notmuch Test Suite; brokenthreadtest (inbox unread)"
+printf " Adding parent message...\t\t"
+generate_message [body]=foo [id]=parent-id [subject]=brokenthreadtest '[date]="Sat, 01 Jan 2000 12:00:00 -0000"'
+execute_expecting new "Added 1 new message to the database."
+printf " Searching returns all three messages in one thread...\t\t"
+execute_expecting "search foo" "thread:XXX   2000-01-01 [3/3] Notmuch Test Suite; brokenthreadtest (inbox unread)"
+
 cat <<EOF
 Notmuch test suite complete.
 
-- 
1.7.0

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH] Store thread ids for messages that we haven't seen yet
  2010-03-13 21:27           ` [PATCH] Store thread ids for messages that we haven't seen yet James Westby
@ 2010-04-13 15:20             ` Carl Worth
  2010-04-13 15:36               ` James Westby
  0 siblings, 1 reply; 26+ messages in thread
From: Carl Worth @ 2010-04-13 15:20 UTC (permalink / raw)
  To: James Westby, notmuch

[-- Attachment #1: Type: text/plain, Size: 1170 bytes --]

On Sat, 13 Mar 2010 16:27:57 -0500, James Westby <jw+debian@jameswestby.net> wrote:
> This allows us to thread messages even when we receive them out of
> order, or never receive the root.

Thanks for this patch, James! It's especially nice to have the fix come
in with additions to the test suite as well.

I did split up the commit so the addition to the test suite happens
first. That way it's easy to test the test itself, (verifying that it
fails before the fix, and then passes after the fix).

I also added a few documentation and other cleanups as follow-on
commits. Hopefully, they don't change the logic at all, but make things
easier to understand.

So that's all pushed.

Then, I started implementing support for retroactively storing
thread_ids for non-existing messages references in already-existing
messages. It took me perhaps too long that a change like that, (while
useful), is too invasive for the current 0.2 release, and not essential
for this particular feature.

So I've postponed that part at least. I hope to make a database-schema
upgrade a key part of a release in a couple of cycles, (for this
feature and for "list:" and "folder:").

-Carl

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Store thread ids for messages that we haven't seen yet
  2010-04-13 15:20             ` Carl Worth
@ 2010-04-13 15:36               ` James Westby
  2010-04-15  0:39                 ` Carl Worth
  0 siblings, 1 reply; 26+ messages in thread
From: James Westby @ 2010-04-13 15:36 UTC (permalink / raw)
  To: Carl Worth, notmuch

On Tue, 13 Apr 2010 08:20:48 -0700, Carl Worth <cworth@cworth.org> wrote:
> Thanks for this patch, James! It's especially nice to have the fix come
> in with additions to the test suite as well.

Thanks for including a test suite I could add to!

> I did split up the commit so the addition to the test suite happens
> first. That way it's easy to test the test itself, (verifying that it
> fails before the fix, and then passes after the fix).

Your choice. I prefer putting them in the same commit to be more
self-documenting, and then using the capabilities of my VCS to verify
the change if i desire.

> I also added a few documentation and other cleanups as follow-on
> commits. Hopefully, they don't change the logic at all, but make things
> easier to understand.
> 
> So that's all pushed.

Great, thanks.

> Then, I started implementing support for retroactively storing
> thread_ids for non-existing messages references in already-existing
> messages. It took me perhaps too long that a change like that, (while
> useful), is too invasive for the current 0.2 release, and not essential
> for this particular feature.

This would fix up threads for all existing messages? Probably a good
thing to have, but not that important to me. In my case I can always
open the bug in my browser if I want to see the full conversation.

> So I've postponed that part at least. I hope to make a database-schema
> upgrade a key part of a release in a couple of cycles, (for this
> feature and for "list:" and "folder:").

Cool, I look forward to it.

Thanks,

James

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Store thread ids for messages that we haven't seen yet
  2010-04-13 15:36               ` James Westby
@ 2010-04-15  0:39                 ` Carl Worth
  0 siblings, 0 replies; 26+ messages in thread
From: Carl Worth @ 2010-04-15  0:39 UTC (permalink / raw)
  To: James Westby, notmuch

[-- Attachment #1: Type: text/plain, Size: 1197 bytes --]

On Tue, 13 Apr 2010 16:36:30 +0100, James Westby <jw+debian@jameswestby.net> wrote:
> Your choice. I prefer putting them in the same commit to be more
> self-documenting, and then using the capabilities of my VCS to verify
> the change if i desire.

But that's my point. With it split, I can actually use "git checkout" to
go to a state where the test exists, but the bug hasn't been fixed. With
it combined, there's no such state. (I can checkout state with the bug,
but then there's nothing in the test suite to exercise it.)

> This would fix up threads for all existing messages?

Yes. It seems un-right for notmuch to provide a feature on an arbitrary
subset of messages, (those that happened to be added after the user
switched to some particular version of notmuch).

> Probably a good thing to have, but not that important to me. In my
> case I can always open the bug in my browser if I want to see the full
> conversation.

I agree it's not totally essential. But it should be easy enough to pick
up in the upcoming database upgrade, (which may actually end up being a
full rebuild anyway---I've got a lot of things to change and a full
rebuild might be the fastest thing to do).

-Carl

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2010-04-15  0:39 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-12-18 19:02 Missing messages breaking threads James Westby
2009-12-18 19:41 ` Carl Worth
2009-12-18 19:53   ` James Westby
2009-12-18 20:52     ` Carl Worth
2009-12-18 21:24       ` James Westby
2009-12-22 22:48       ` Olly Betts
2009-12-23  0:05         ` Carl Worth
2010-03-13 21:27           ` [PATCH] Store thread ids for messages that we haven't seen yet James Westby
2010-04-13 15:20             ` Carl Worth
2010-04-13 15:36               ` James Westby
2010-04-15  0:39                 ` Carl Worth
2009-12-18 21:21     ` [PATCH] Store the size of the file for each message James Westby
2009-12-18 22:29       ` Carl Worth
2009-12-19  0:08         ` James Westby
2009-12-19  0:57           ` Carl Worth
2009-12-19  1:35             ` James Westby
2009-12-19  5:24               ` Carl Worth
2009-12-19  8:02                 ` Marten Veldthuis
2009-12-19 17:17                   ` Carl Worth
2009-12-19 13:12                 ` [PATCH] Add ENVIRONMENT VARIABLES section to the man page James Westby
2010-02-23 19:25                   ` Carl Worth
2009-12-19  0:11         ` [PATCH v2] Store the size of the file for each message James Westby
2009-12-19  1:29         ` [PATCH] Reindex larger files that duplicate ids we have James Westby
2009-12-21 18:33           ` Carl Worth
2009-12-20 20:27         ` [PATCH] Store documents for message-ids we haven't seen James Westby
2009-12-21 18:43           ` Carl Worth

Code repositories for project(s) associated with this public inbox

	https://yhetil.org/notmuch.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).