unofficial mirror of notmuch@notmuchmail.org
 help / color / mirror / code / Atom feed
* [RFC] [PATCH] lib/database.cc: change how the parent of a message is calculated
@ 2013-02-25 23:50 Aaron Ecay
  2013-02-26  9:19 ` Jani Nikula
  2013-02-28 19:41 ` Jani Nikula
  0 siblings, 2 replies; 17+ messages in thread
From: Aaron Ecay @ 2013-02-25 23:50 UTC (permalink / raw)
  To: notmuch

Presently, the code which finds the parent of a message as it is being
added to the database assumes that the first Message-ID-like substring
of the In-Reply-To header is the parent Message ID.  Some mail clients,
however, put stuff other than the Message-ID of the parent in the
In-Reply-To header, such as the email address of the sender of the
parent.  This can fool notmuch.

The updated algorithm prefers the last Message ID in the References
header.  The References header lists messages oldest-first, so the last
Message ID is the parent (RFC2822, p. 24).  The References header is
also less likely to be in a non-standard
syntax (http://cr.yp.to/immhf/thread.html,
http://www.jwz.org/doc/threading.html).  In case the References header
is not to be found, fall back to the old behavior.
---

I especially notice this problem on public mailing lists, where
certain people's messages always cause an "out-dent" of the threading,
instead of being nested under whichever message they are replies to.

Technically, putting non-Message-ID crud in the In-Reply-To field is a
violation of RFC2822, but it appears that in practice the References
header is respected more often than the In-Reply-To one.

 lib/database.cc | 30 ++++++++++++++++++++++--------
 1 file changed, 22 insertions(+), 8 deletions(-)

diff --git a/lib/database.cc b/lib/database.cc
index 91d4329..cbf33ae 100644
--- a/lib/database.cc
+++ b/lib/database.cc
@@ -501,8 +501,10 @@ _parse_message_id (void *ctx, const char *message_id, const char **next)
  * 'message_id' in the result (to avoid mass confusion when a single
  * message references itself cyclically---and yes, mail messages are
  * not infrequent in the wild that do this---don't ask me why).
+ *
+ * Return the last reference parsed.
 */
-static void
+static char *
 parse_references (void *ctx,
 		  const char *message_id,
 		  GHashTable *hash,
@@ -511,7 +513,7 @@ parse_references (void *ctx,
     char *ref;

     if (refs == NULL || *refs == '\0')
-	return;
+	return NULL;

     while (*refs) {
 	ref = _parse_message_id (ctx, refs, &refs);
@@ -519,6 +521,8 @@ parse_references (void *ctx,
 	if (ref && strcmp (ref, message_id))
 	    g_hash_table_insert (hash, ref, NULL);
     }
+
+    return ref;
 }

 notmuch_status_t
@@ -1365,7 +1369,7 @@ _notmuch_database_generate_doc_id (notmuch_database_t *notmuch)
     notmuch->last_doc_id++;

     if (notmuch->last_doc_id == 0)
-	INTERNAL_ERROR ("Xapian document IDs are exhausted.\n");
+	INTERNAL_ERROR ("Xapian document IDs are exhausted.\n");

     return notmuch->last_doc_id;
 }
@@ -1509,7 +1513,7 @@ _notmuch_database_link_message_to_parents (notmuch_database_t *notmuch,
 					   const char **thread_id)
 {
     GHashTable *parents = NULL;
-    const char *refs, *in_reply_to, *in_reply_to_message_id;
+    const char *refs, *in_reply_to, *in_reply_to_message_id, *last_ref_message_id;
     GList *l, *keys = NULL;
     notmuch_status_t ret = NOTMUCH_STATUS_SUCCESS;

@@ -1517,21 +1521,31 @@ _notmuch_database_link_message_to_parents (notmuch_database_t *notmuch,
 				     _my_talloc_free_for_g_hash, NULL);

     refs = notmuch_message_file_get_header (message_file, "references");
-    parse_references (message, notmuch_message_get_message_id (message),
-		      parents, refs);
+    last_ref_message_id = parse_references (message,
+					    notmuch_message_get_message_id (message),
+					    parents, refs);

     in_reply_to = notmuch_message_file_get_header (message_file, "in-reply-to");
     parse_references (message, notmuch_message_get_message_id (message),
 		      parents, in_reply_to);

-    /* Carefully avoid adding any self-referential in-reply-to term. */
     in_reply_to_message_id = _parse_message_id (message, in_reply_to, NULL);
+    /* If the parent message ID from the Reply-To and References
+     * headers are different, use the References one.  This is because
+     * the Reply-To header is more likely to be in an non-standard
+     * format. */
+    if (in_reply_to_message_id &&
+	last_ref_message_id &&
+	strcmp (last_ref_message_id, in_reply_to_message_id)) {
+	in_reply_to_message_id = last_ref_message_id;
+    }
+    /* Carefully avoid adding any self-referential in-reply-to term. */
     if (in_reply_to_message_id &&
 	strcmp (in_reply_to_message_id,
 		notmuch_message_get_message_id (message)))
     {
 	_notmuch_message_add_term (message, "replyto",
-			     _parse_message_id (message, in_reply_to, NULL));
+			     in_reply_to_message_id);
     }

     keys = g_hash_table_get_keys (parents);
--
1.8.1.4

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [RFC] [PATCH] lib/database.cc: change how the parent of a message is calculated
  2013-02-25 23:50 [RFC] [PATCH] lib/database.cc: change how the parent of a message is calculated Aaron Ecay
@ 2013-02-26  9:19 ` Jani Nikula
  2013-02-26 19:40   ` Aaron Ecay
  2013-02-28 19:41 ` Jani Nikula
  1 sibling, 1 reply; 17+ messages in thread
From: Jani Nikula @ 2013-02-26  9:19 UTC (permalink / raw)
  To: Aaron Ecay, notmuch

On Tue, 26 Feb 2013, Aaron Ecay <aaronecay@gmail.com> wrote:
> Presently, the code which finds the parent of a message as it is being
> added to the database assumes that the first Message-ID-like substring
> of the In-Reply-To header is the parent Message ID.  Some mail clients,
> however, put stuff other than the Message-ID of the parent in the
> In-Reply-To header, such as the email address of the sender of the
> parent.  This can fool notmuch.

Hi Aaron, please provide references to a few messages like this. If
available on the notmuch list an id: reference would be best, but
otherwise some archive that allows viewing full message headers or
downloading the full message would be best.

Thanks,
Jani.

>
> The updated algorithm prefers the last Message ID in the References
> header.  The References header lists messages oldest-first, so the last
> Message ID is the parent (RFC2822, p. 24).  The References header is
> also less likely to be in a non-standard
> syntax (http://cr.yp.to/immhf/thread.html,
> http://www.jwz.org/doc/threading.html).  In case the References header
> is not to be found, fall back to the old behavior.
> ---
>
> I especially notice this problem on public mailing lists, where
> certain people's messages always cause an "out-dent" of the threading,
> instead of being nested under whichever message they are replies to.
>
> Technically, putting non-Message-ID crud in the In-Reply-To field is a
> violation of RFC2822, but it appears that in practice the References
> header is respected more often than the In-Reply-To one.
>
>  lib/database.cc | 30 ++++++++++++++++++++++--------
>  1 file changed, 22 insertions(+), 8 deletions(-)
>
> diff --git a/lib/database.cc b/lib/database.cc
> index 91d4329..cbf33ae 100644
> --- a/lib/database.cc
> +++ b/lib/database.cc
> @@ -501,8 +501,10 @@ _parse_message_id (void *ctx, const char *message_id, const char **next)
>   * 'message_id' in the result (to avoid mass confusion when a single
>   * message references itself cyclically---and yes, mail messages are
>   * not infrequent in the wild that do this---don't ask me why).
> + *
> + * Return the last reference parsed.
>  */
> -static void
> +static char *
>  parse_references (void *ctx,
>  		  const char *message_id,
>  		  GHashTable *hash,
> @@ -511,7 +513,7 @@ parse_references (void *ctx,
>      char *ref;
>
>      if (refs == NULL || *refs == '\0')
> -	return;
> +	return NULL;
>
>      while (*refs) {
>  	ref = _parse_message_id (ctx, refs, &refs);
> @@ -519,6 +521,8 @@ parse_references (void *ctx,
>  	if (ref && strcmp (ref, message_id))
>  	    g_hash_table_insert (hash, ref, NULL);
>      }
> +
> +    return ref;
>  }
>
>  notmuch_status_t
> @@ -1365,7 +1369,7 @@ _notmuch_database_generate_doc_id (notmuch_database_t *notmuch)
>      notmuch->last_doc_id++;
>
>      if (notmuch->last_doc_id == 0)
> -	INTERNAL_ERROR ("Xapian document IDs are exhausted.\n");
> +	INTERNAL_ERROR ("Xapian document IDs are exhausted.\n");
>
>      return notmuch->last_doc_id;
>  }
> @@ -1509,7 +1513,7 @@ _notmuch_database_link_message_to_parents (notmuch_database_t *notmuch,
>  					   const char **thread_id)
>  {
>      GHashTable *parents = NULL;
> -    const char *refs, *in_reply_to, *in_reply_to_message_id;
> +    const char *refs, *in_reply_to, *in_reply_to_message_id, *last_ref_message_id;
>      GList *l, *keys = NULL;
>      notmuch_status_t ret = NOTMUCH_STATUS_SUCCESS;
>
> @@ -1517,21 +1521,31 @@ _notmuch_database_link_message_to_parents (notmuch_database_t *notmuch,
>  				     _my_talloc_free_for_g_hash, NULL);
>
>      refs = notmuch_message_file_get_header (message_file, "references");
> -    parse_references (message, notmuch_message_get_message_id (message),
> -		      parents, refs);
> +    last_ref_message_id = parse_references (message,
> +					    notmuch_message_get_message_id (message),
> +					    parents, refs);
>
>      in_reply_to = notmuch_message_file_get_header (message_file, "in-reply-to");
>      parse_references (message, notmuch_message_get_message_id (message),
>  		      parents, in_reply_to);
>
> -    /* Carefully avoid adding any self-referential in-reply-to term. */
>      in_reply_to_message_id = _parse_message_id (message, in_reply_to, NULL);
> +    /* If the parent message ID from the Reply-To and References
> +     * headers are different, use the References one.  This is because
> +     * the Reply-To header is more likely to be in an non-standard
> +     * format. */
> +    if (in_reply_to_message_id &&
> +	last_ref_message_id &&
> +	strcmp (last_ref_message_id, in_reply_to_message_id)) {
> +	in_reply_to_message_id = last_ref_message_id;
> +    }
> +    /* Carefully avoid adding any self-referential in-reply-to term. */
>      if (in_reply_to_message_id &&
>  	strcmp (in_reply_to_message_id,
>  		notmuch_message_get_message_id (message)))
>      {
>  	_notmuch_message_add_term (message, "replyto",
> -			     _parse_message_id (message, in_reply_to, NULL));
> +			     in_reply_to_message_id);
>      }
>
>      keys = g_hash_table_get_keys (parents);
> --
> 1.8.1.4
> _______________________________________________
> notmuch mailing list
> notmuch@notmuchmail.org
> http://notmuchmail.org/mailman/listinfo/notmuch

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] [PATCH] lib/database.cc: change how the parent of a message is calculated
  2013-02-26  9:19 ` Jani Nikula
@ 2013-02-26 19:40   ` Aaron Ecay
  0 siblings, 0 replies; 17+ messages in thread
From: Aaron Ecay @ 2013-02-26 19:40 UTC (permalink / raw)
  To: Jani Nikula, notmuch

2013ko otsailak 26an, Jani Nikula-ek idatzi zuen:
> Hi Aaron, please provide references to a few messages like this. If
> available on the notmuch list an id: reference would be best, but
> otherwise some archive that allows viewing full message headers or
> downloading the full message would be best.

Here’s a message from the notmuch list that has passed through gmane,
which inserts References but not In-Reply-To headers:
id:877h0sa207.fsf@fester.com .  Here’s a link to one with a borked
In-Reply-To header:
http://article.gmane.org/gmane.emacs.orgmode/66616/raw

-- 
Aaron Ecay

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] [PATCH] lib/database.cc: change how the parent of a message is calculated
  2013-02-25 23:50 [RFC] [PATCH] lib/database.cc: change how the parent of a message is calculated Aaron Ecay
  2013-02-26  9:19 ` Jani Nikula
@ 2013-02-28 19:41 ` Jani Nikula
  2013-03-01 17:06   ` Jani Nikula
  1 sibling, 1 reply; 17+ messages in thread
From: Jani Nikula @ 2013-02-28 19:41 UTC (permalink / raw)
  To: Aaron Ecay, notmuch


Hi Aaron -

On Tue, 26 Feb 2013, Aaron Ecay <aaronecay@gmail.com> wrote:
> Presently, the code which finds the parent of a message as it is being
> added to the database assumes that the first Message-ID-like substring
> of the In-Reply-To header is the parent Message ID.  Some mail clients,
> however, put stuff other than the Message-ID of the parent in the
> In-Reply-To header, such as the email address of the sender of the
> parent.  This can fool notmuch.

I think the background is that RFC 822 defines In-Reply-To (and
References too for that matter) as *(phrase / msg-id), while RFC 2822
defines them as 1*msg-id. I'd like something about RFC 822 being
mentioned in the commit message.

The problem in the gmane message you link to in
id:87liaa3luc.fsf@gmail.com is likely related to the FAQ item 05.26 "How
do I fix a bogus In-Reply-To or missing References field?" in the MH FAQ
http://www.newt.com/faq/mh.html.

> The updated algorithm prefers the last Message ID in the References
> header.  The References header lists messages oldest-first, so the last
> Message ID is the parent (RFC2822, p. 24).  The References header is
> also less likely to be in a non-standard
> syntax (http://cr.yp.to/immhf/thread.html,
> http://www.jwz.org/doc/threading.html).  In case the References header
> is not to be found, fall back to the old behavior.
> ---
>
> I especially notice this problem on public mailing lists, where
> certain people's messages always cause an "out-dent" of the threading,
> instead of being nested under whichever message they are replies to.
>
> Technically, putting non-Message-ID crud in the In-Reply-To field is a
> violation of RFC2822, but it appears that in practice the References
> header is respected more often than the In-Reply-To one.
>
>  lib/database.cc | 30 ++++++++++++++++++++++--------
>  1 file changed, 22 insertions(+), 8 deletions(-)
>
> diff --git a/lib/database.cc b/lib/database.cc
> index 91d4329..cbf33ae 100644
> --- a/lib/database.cc
> +++ b/lib/database.cc
> @@ -501,8 +501,10 @@ _parse_message_id (void *ctx, const char *message_id, const char **next)
>   * 'message_id' in the result (to avoid mass confusion when a single
>   * message references itself cyclically---and yes, mail messages are
>   * not infrequent in the wild that do this---don't ask me why).
> + *
> + * Return the last reference parsed.
>  */
> -static void
> +static char *
>  parse_references (void *ctx,
>  		  const char *message_id,
>  		  GHashTable *hash,
> @@ -511,7 +513,7 @@ parse_references (void *ctx,
>      char *ref;
>
>      if (refs == NULL || *refs == '\0')
> -	return;
> +	return NULL;
>
>      while (*refs) {
>  	ref = _parse_message_id (ctx, refs, &refs);
> @@ -519,6 +521,8 @@ parse_references (void *ctx,
>  	if (ref && strcmp (ref, message_id))
>  	    g_hash_table_insert (hash, ref, NULL);
>      }
> +
> +    return ref;

As the comment for the function says, we explicitly avoid including
self-references. I think I'd err on the safe side and return NULL if the
last ref equals message-id.

>  }
>
>  notmuch_status_t
> @@ -1365,7 +1369,7 @@ _notmuch_database_generate_doc_id (notmuch_database_t *notmuch)
>      notmuch->last_doc_id++;
>
>      if (notmuch->last_doc_id == 0)
> -	INTERNAL_ERROR ("Xapian document IDs are exhausted.\n");
> +	INTERNAL_ERROR ("Xapian document IDs are exhausted.\n");

I don't know how you got this non-change hunk here, but please remove
it. :)

>
>      return notmuch->last_doc_id;
>  }
> @@ -1509,7 +1513,7 @@ _notmuch_database_link_message_to_parents (notmuch_database_t *notmuch,
>  					   const char **thread_id)
>  {
>      GHashTable *parents = NULL;
> -    const char *refs, *in_reply_to, *in_reply_to_message_id;
> +    const char *refs, *in_reply_to, *in_reply_to_message_id, *last_ref_message_id;
>      GList *l, *keys = NULL;
>      notmuch_status_t ret = NOTMUCH_STATUS_SUCCESS;
>
> @@ -1517,21 +1521,31 @@ _notmuch_database_link_message_to_parents (notmuch_database_t *notmuch,
>  				     _my_talloc_free_for_g_hash, NULL);
>
>      refs = notmuch_message_file_get_header (message_file, "references");
> -    parse_references (message, notmuch_message_get_message_id (message),
> -		      parents, refs);
> +    last_ref_message_id = parse_references (message,
> +					    notmuch_message_get_message_id (message),
> +					    parents, refs);
>
>      in_reply_to = notmuch_message_file_get_header (message_file, "in-reply-to");
>      parse_references (message, notmuch_message_get_message_id (message),
>  		      parents, in_reply_to);
>
> -    /* Carefully avoid adding any self-referential in-reply-to term. */
>      in_reply_to_message_id = _parse_message_id (message, in_reply_to, NULL);

I wonder if you should reuse your parse_references() change here, so
you'd set in_reply_to_message_id to the last message-id in
In-Reply-To. This might tackle some of the problematic cases directly,
but should still be all right per RFC 2822. I didn't verify how the
parser handles an RFC 2822 violating free form header though.

> +    /* If the parent message ID from the Reply-To and References
> +     * headers are different, use the References one.  This is because
> +     * the Reply-To header is more likely to be in an non-standard
> +     * format. */
> +    if (in_reply_to_message_id &&
> +	last_ref_message_id &&
> +	strcmp (last_ref_message_id, in_reply_to_message_id)) {
> +	in_reply_to_message_id = last_ref_message_id;
> +    }

I suggest adding an else if branch (or revamp the above if condition) to
tackle the missing In-Reply-To header:

    else if (!in_reply_to_message_id && last_ref_message_id) {
        in_reply_to_message_id = last_ref_message_id;
    }

> +    /* Carefully avoid adding any self-referential in-reply-to term. */
>      if (in_reply_to_message_id &&
>  	strcmp (in_reply_to_message_id,
>  		notmuch_message_get_message_id (message)))

If you change parse_references() to be careful about never returning a
self-reference, and set in_reply_to_message_id from there, I think you
can drop the strcmp here. And move the comment to an appropriate place.

Thanks for the patch, I think we should do this. But this is an area
where I think we need to be careful, so another reviewer wouldn't
harm. Some tests for this would be good too, obviously.


BR,
Jani.

>      {
>  	_notmuch_message_add_term (message, "replyto",
> -			     _parse_message_id (message, in_reply_to, NULL));
> +			     in_reply_to_message_id);
>      }
>
>      keys = g_hash_table_get_keys (parents);
> --
> 1.8.1.4
> _______________________________________________
> notmuch mailing list
> notmuch@notmuchmail.org
> http://notmuchmail.org/mailman/listinfo/notmuch

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] [PATCH] lib/database.cc: change how the parent of a message is calculated
  2013-02-28 19:41 ` Jani Nikula
@ 2013-03-01 17:06   ` Jani Nikula
  2013-03-03 23:46     ` Aaron Ecay
  0 siblings, 1 reply; 17+ messages in thread
From: Jani Nikula @ 2013-03-01 17:06 UTC (permalink / raw)
  To: Aaron Ecay, notmuch

On Thu, 28 Feb 2013, Jani Nikula <jani@nikula.org> wrote:
> Hi Aaron -
>
> On Tue, 26 Feb 2013, Aaron Ecay <aaronecay@gmail.com> wrote:
>> Presently, the code which finds the parent of a message as it is being
>> added to the database assumes that the first Message-ID-like substring
>> of the In-Reply-To header is the parent Message ID.  Some mail clients,
>> however, put stuff other than the Message-ID of the parent in the
>> In-Reply-To header, such as the email address of the sender of the
>> parent.  This can fool notmuch.
>
> I think the background is that RFC 822 defines In-Reply-To (and
> References too for that matter) as *(phrase / msg-id), while RFC 2822
> defines them as 1*msg-id. I'd like something about RFC 822 being
> mentioned in the commit message.
>
> The problem in the gmane message you link to in
> id:87liaa3luc.fsf@gmail.com is likely related to the FAQ item 05.26 "How
> do I fix a bogus In-Reply-To or missing References field?" in the MH FAQ
> http://www.newt.com/faq/mh.html.
>
>> The updated algorithm prefers the last Message ID in the References
>> header.  The References header lists messages oldest-first, so the last
>> Message ID is the parent (RFC2822, p. 24).  The References header is
>> also less likely to be in a non-standard
>> syntax (http://cr.yp.to/immhf/thread.html,
>> http://www.jwz.org/doc/threading.html).  In case the References header
>> is not to be found, fall back to the old behavior.
>> ---
>>
>> I especially notice this problem on public mailing lists, where
>> certain people's messages always cause an "out-dent" of the threading,
>> instead of being nested under whichever message they are replies to.
>>
>> Technically, putting non-Message-ID crud in the In-Reply-To field is a
>> violation of RFC2822, but it appears that in practice the References
>> header is respected more often than the In-Reply-To one.
>>
>>  lib/database.cc | 30 ++++++++++++++++++++++--------
>>  1 file changed, 22 insertions(+), 8 deletions(-)
>>
>> diff --git a/lib/database.cc b/lib/database.cc
>> index 91d4329..cbf33ae 100644
>> --- a/lib/database.cc
>> +++ b/lib/database.cc
>> @@ -501,8 +501,10 @@ _parse_message_id (void *ctx, const char *message_id, const char **next)
>>   * 'message_id' in the result (to avoid mass confusion when a single
>>   * message references itself cyclically---and yes, mail messages are
>>   * not infrequent in the wild that do this---don't ask me why).
>> + *
>> + * Return the last reference parsed.
>>  */
>> -static void
>> +static char *
>>  parse_references (void *ctx,
>>  		  const char *message_id,
>>  		  GHashTable *hash,
>> @@ -511,7 +513,7 @@ parse_references (void *ctx,
>>      char *ref;
>>
>>      if (refs == NULL || *refs == '\0')
>> -	return;
>> +	return NULL;
>>
>>      while (*refs) {
>>  	ref = _parse_message_id (ctx, refs, &refs);
>> @@ -519,6 +521,8 @@ parse_references (void *ctx,
>>  	if (ref && strcmp (ref, message_id))
>>  	    g_hash_table_insert (hash, ref, NULL);
>>      }
>> +
>> +    return ref;
>
> As the comment for the function says, we explicitly avoid including
> self-references. I think I'd err on the safe side and return NULL if the
> last ref equals message-id.
>
>>  }
>>
>>  notmuch_status_t
>> @@ -1365,7 +1369,7 @@ _notmuch_database_generate_doc_id (notmuch_database_t *notmuch)
>>      notmuch->last_doc_id++;
>>
>>      if (notmuch->last_doc_id == 0)
>> -	INTERNAL_ERROR ("Xapian document IDs are exhausted.\n");
>> +	INTERNAL_ERROR ("Xapian document IDs are exhausted.\n");
>
> I don't know how you got this non-change hunk here, but please remove
> it. :)
>
>>
>>      return notmuch->last_doc_id;
>>  }
>> @@ -1509,7 +1513,7 @@ _notmuch_database_link_message_to_parents (notmuch_database_t *notmuch,
>>  					   const char **thread_id)
>>  {
>>      GHashTable *parents = NULL;
>> -    const char *refs, *in_reply_to, *in_reply_to_message_id;
>> +    const char *refs, *in_reply_to, *in_reply_to_message_id, *last_ref_message_id;
>>      GList *l, *keys = NULL;
>>      notmuch_status_t ret = NOTMUCH_STATUS_SUCCESS;
>>
>> @@ -1517,21 +1521,31 @@ _notmuch_database_link_message_to_parents (notmuch_database_t *notmuch,
>>  				     _my_talloc_free_for_g_hash, NULL);
>>
>>      refs = notmuch_message_file_get_header (message_file, "references");
>> -    parse_references (message, notmuch_message_get_message_id (message),
>> -		      parents, refs);
>> +    last_ref_message_id = parse_references (message,
>> +					    notmuch_message_get_message_id (message),
>> +					    parents, refs);
>>
>>      in_reply_to = notmuch_message_file_get_header (message_file, "in-reply-to");
>>      parse_references (message, notmuch_message_get_message_id (message),
>>  		      parents, in_reply_to);
>>
>> -    /* Carefully avoid adding any self-referential in-reply-to term. */
>>      in_reply_to_message_id = _parse_message_id (message, in_reply_to, NULL);
>
> I wonder if you should reuse your parse_references() change here, so
> you'd set in_reply_to_message_id to the last message-id in
> In-Reply-To. This might tackle some of the problematic cases directly,
> but should still be all right per RFC 2822. I didn't verify how the
> parser handles an RFC 2822 violating free form header though.

Strike that based on http://www.jwz.org/doc/threading.html:

"If there are multiple things in In-Reply-To that look like Message-IDs,
only use the first one of them: odds are that the later ones are
actually email addresses, not IDs."

>> +    /* If the parent message ID from the Reply-To and References
>> +     * headers are different, use the References one.  This is because
>> +     * the Reply-To header is more likely to be in an non-standard
>> +     * format. */
>> +    if (in_reply_to_message_id &&
>> +	last_ref_message_id &&
>> +	strcmp (last_ref_message_id, in_reply_to_message_id)) {
>> +	in_reply_to_message_id = last_ref_message_id;
>> +    }

I talked to Austin (CC) about the patch on IRC, and his comment was,
perceptive as always:

 23:38        amdragon   Is the logic in that patch equivalent to always using
                         the last message ID in references unless there is no
                         references header?  Seems like it is, but in a
                         convoluted way.

And that's actually the case, isn't it? To make the code reflect that,
you should use last_ref_message_id, and if that's NULL, fallback to
in_reply_to_message_id.

> I suggest adding an else if branch (or revamp the above if condition) to
> tackle the missing In-Reply-To header:
>
>     else if (!in_reply_to_message_id && last_ref_message_id) {
>         in_reply_to_message_id = last_ref_message_id;
>     }

Strike that, it should be the other way round.

>
>> +    /* Carefully avoid adding any self-referential in-reply-to term. */
>>      if (in_reply_to_message_id &&
>>  	strcmp (in_reply_to_message_id,
>>  		notmuch_message_get_message_id (message)))
>
> If you change parse_references() to be careful about never returning a
> self-reference, and set in_reply_to_message_id from there, I think you
> can drop the strcmp here. And move the comment to an appropriate place.

Strike that.

> Thanks for the patch, I think we should do this. But this is an area
> where I think we need to be careful, so another reviewer wouldn't
> harm. Some tests for this would be good too, obviously.

I got this part right. :)


Jani.

>
>
> BR,
> Jani.
>
>>      {
>>  	_notmuch_message_add_term (message, "replyto",
>> -			     _parse_message_id (message, in_reply_to, NULL));
>> +			     in_reply_to_message_id);
>>      }
>>
>>      keys = g_hash_table_get_keys (parents);
>> --
>> 1.8.1.4
>> _______________________________________________
>> notmuch mailing list
>> notmuch@notmuchmail.org
>> http://notmuchmail.org/mailman/listinfo/notmuch

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] [PATCH] lib/database.cc: change how the parent of a message is calculated
  2013-03-01 17:06   ` Jani Nikula
@ 2013-03-03 23:46     ` Aaron Ecay
  2013-03-03 23:55       ` Aaron Ecay
  0 siblings, 1 reply; 17+ messages in thread
From: Aaron Ecay @ 2013-03-03 23:46 UTC (permalink / raw)
  To: Jani Nikula, notmuch

Hi Jani,

Thanks to you and Austin for the comments.

2013ko martxoak 1an, Jani Nikula-ek idatzi zuen:
>> I think the background is that RFC 822 defines In-Reply-To (and
>> References too for that matter) as *(phrase / msg-id), while RFC 2822
>> defines them as 1*msg-id. I'd like something about RFC 822 being
>> mentioned in the commit message.
>> 
>> The problem in the gmane message you link to in
>> id:87liaa3luc.fsf@gmail.com is likely related to the FAQ item 05.26
>> "How do I fix a bogus In-Reply-To or missing References field?" in
>> the MH FAQ http://www.newt.com/faq/mh.html.

Likely yes.  But I think notmuch should handle these messages, since
they are seen in the wild (and I don’t think you disagree with me on
this point?)


>> 
>> As the comment for the function says, we explicitly avoid including
>> self-references. I think I'd err on the safe side and return NULL if
>> the last ref equals message-id.

Done.

>> 
>> I don't know how you got this non-change hunk here, but please remove
>> it. :)

That’s what I get for setting my editor to delete trailing whitespace on
save (then not reading outgoing patches carefully).  Fixed.

>> I wonder if you should reuse your parse_references() change here, so
>> you'd set in_reply_to_message_id to the last message-id in
>> In-Reply-To. This might tackle some of the problematic cases
>> directly, but should still be all right per RFC 2822. I didn't verify
>> how the parser handles an RFC 2822 violating free form header though.
> 
> Strike that based on http://www.jwz.org/doc/threading.html:
> 
> "If there are multiple things in In-Reply-To that look like
> Message-IDs, only use the first one of them: odds are that the later
> ones are actually email addresses, not IDs."

Hmm.  I think it’s a toss-up which of multiple quasi-message-ids is the
real one.  In the email message example I linked upthread, it was the
last one that was real.  I decided to use the last one, because it
allows the self-reference checking to be pushed entirely into
parse_references.  If you feel strongly that we should use the first
one, I can change it back.

> I talked to Austin (CC) about the patch on IRC, and his comment was,
> perceptive as always:
> 
>  23:38 amdragon Is the logic in that patch equivalent to always using
> the last message ID in references unless there is no references
> header?  Seems like it is, but in a convoluted way.
> 
> And that's actually the case, isn't it? To make the code reflect that,
> you should use last_ref_message_id, and if that's NULL, fallback to
> in_reply_to_message_id.

Yes.  Fixed.

> 
>> I suggest adding an else if branch (or revamp the above if condition)
>> to tackle the missing In-Reply-To header:
>> 
>> else if (!in_reply_to_message_id && last_ref_message_id) {
>> in_reply_to_message_id = last_ref_message_id; }
> 
> Strike that, it should be the other way round.

Now that the self-reference check is in parse_references, the
conditional is much simpler.

One additional change I made in this version was to factor out 3 calls
to “notmuch_message_get_message_id (message)” into a variable inside the
_notmuch_database_link_message_to_parents function, for a small boost to
readability (and perhaps speed, depending on how clever the compiler is
I guess).

I also added tests – those are the first of two patches that will follow
this email, the second being the code to make them pass.

-- 
Aaron Ecay

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] [PATCH] lib/database.cc: change how the parent of a message is calculated
  2013-03-03 23:46     ` Aaron Ecay
@ 2013-03-03 23:55       ` Aaron Ecay
  2013-03-04  2:45         ` David Bremner
  2013-03-04  5:59         ` [RFC] [PATCH] lib/database.cc: change how the parent of a message is calculated Tomi Ollila
  0 siblings, 2 replies; 17+ messages in thread
From: Aaron Ecay @ 2013-03-03 23:55 UTC (permalink / raw)
  To: Jani Nikula, notmuch

[-- Attachment #1: Type: text/plain, Size: 254 bytes --]

git send-email is mad about lines >998 characters in the test patch, so
I’m sending the patches as attachments to this email.  (Is there a
better way to include the expected output of a notmuch command which
outputs long lines in a test script?)


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-test-add-tests-for-the-handling-of-References-and-In.patch --]
[-- Type: text/x-patch, Size: 7433 bytes --]

From 23836241dd304b98f2a05803fbb5a5a94f563050 Mon Sep 17 00:00:00 2001
From: Aaron Ecay <aaronecay@gmail.com>
Date: Sun, 3 Mar 2013 18:14:07 -0500
Subject: [PATCH 1/2] test: add tests for the handling of References and
 In-Reply-To headers

These tests are known_broken, the following commit fixes them.
---
 test/thread-replies | 55 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 55 insertions(+)
 create mode 100755 test/thread-replies

diff --git a/test/thread-replies b/test/thread-replies
new file mode 100755
index 0000000..fd11a09
--- /dev/null
+++ b/test/thread-replies
@@ -0,0 +1,55 @@
+#!/usr/bin/env bash
+#
+# Copyright (c) 2013 Aaron Ecay
+#
+
+test_description='test of proper handling of in-reply-to and references headers
+
+This test makes sure that the thread structure in the notmuch database is
+constructed properly, even in the presence of non-RFC-compliant headers'
+
+. ./test-lib.sh
+
+test_begin_subtest "Use References when In-Reply-To is broken"
+test_subtest_known_broken
+add_message '[id]="foo@one.com"' \
+    '[subject]=one'
+add_message '[in-reply-to]="mumble"' \
+    '[references]="<foo@one.com>"' \
+    '[subject]="Re: one"'
+output=$(notmuch show --format=json 'subject:one')
+test_expect_equal "$output" '[[[{"id": "foo@one.com", "match": true, "excluded": false, "filename": "/home/aecay/development/notmuch/notmuch-git/src/notmuch/test/tmp.thread-replies/mail/msg-001", "timestamp": 978709437, "date_relative": "2001-01-05", "tags": ["inbox", "unread"], "headers": {"Subject": "one", "From": "Notmuch Test Suite <test_suite@notmuchmail.org>", "To": "Notmuch Test Suite <test_suite@notmuchmail.org>", "Date": "Fri, 05 Jan 2001 15:43:57 +0000"}, "body": [{"id": 1, "content-type": "text/plain", "content": "This is just a test message (#1)\n"}]}, [[{"id": "msg-002@notmuch-test-suite", "match": true, "excluded": false, "filename": "/home/aecay/development/notmuch/notmuch-git/src/notmuch/test/tmp.thread-replies/mail/msg-002", "timestamp": 978709437, "date_relative": "2001-01-05", "tags": ["inbox", "unread"], "headers": {"Subject": "Re: one", "From": "Notmuch Test Suite <test_suite@notmuchmail.org>", "To": "Notmuch Test Suite <test_suite@notmuchmail.org>", "Date": "Fri, 05 Jan 2001 15:43:57 +0000"}, "body": [{"id": 1, "content-type": "text/plain", "content": "This is just a test message (#2)\n"}]}, []]]]]]'
+
+test_begin_subtest "Prefer References to In-Reply-To"
+test_subtest_known_broken
+add_message '[id]="foo@two.com"' \
+    '[subject]=two'
+add_message '[in-reply-to]="<bar@baz.com>"' \
+    '[references]="<foo@two.com>"' \
+    '[subject]="Re: two"'
+output=$(notmuch show --format=json 'subject:two')
+test_expect_equal "$output" '[[[{"id": "foo@two.com", "match": true, "excluded": false, "filename": "/home/aecay/development/notmuch/notmuch-git/src/notmuch/test/tmp.thread-replies/mail/msg-003", "timestamp": 978709437, "date_relative": "2001-01-05", "tags": ["inbox", "unread"], "headers": {"Subject": "two", "From": "Notmuch Test Suite <test_suite@notmuchmail.org>", "To": "Notmuch Test Suite <test_suite@notmuchmail.org>", "Date": "Fri, 05 Jan 2001 15:43:57 +0000"}, "body": [{"id": 1, "content-type": "text/plain", "content": "This is just a test message (#3)\n"}]}, [[{"id": "msg-004@notmuch-test-suite", "match": true, "excluded": false, "filename": "/home/aecay/development/notmuch/notmuch-git/src/notmuch/test/tmp.thread-replies/mail/msg-004", "timestamp": 978709437, "date_relative": "2001-01-05", "tags": ["inbox", "unread"], "headers": {"Subject": "Re: two", "From": "Notmuch Test Suite <test_suite@notmuchmail.org>", "To": "Notmuch Test Suite <test_suite@notmuchmail.org>", "Date": "Fri, 05 Jan 2001 15:43:57 +0000"}, "body": [{"id": 1, "content-type": "text/plain", "content": "This is just a test message (#4)\n"}]}, []]]]]]'
+
+test_begin_subtest "Use In-Reply-To when no References"
+test_subtest_known_broken
+add_message '[id]="foo@three.com"' \
+    '[subject]="three"'
+add_message '[in-reply-to]="<foo@three.com>"' \
+    '[subject]="Re: three"'
+output=$(notmuch show --format=json 'subject:three')
+test_expect_equal "$output" '[[[{"id": "foo@three.com", "match": true, "excluded": false, "filename": "/home/aecay/development/notmuch/notmuch-git/src/notmuch/test/tmp.thread-replies/mail/msg-005", "timestamp": 978709437, "date_relative": "2001-01-05", "tags": ["inbox", "unread"], "headers": {"Subject": "three", "From": "Notmuch Test Suite <test_suite@notmuchmail.org>", "To": "Notmuch Test Suite <test_suite@notmuchmail.org>", "Date": "Fri, 05 Jan 2001 15:43:57 +0000"}, "body": [{"id": 1, "content-type": "text/plain", "content": "This is just a test message (#5)\n"}]}, [[{"id": "msg-006@notmuch-test-suite", "match": true, "excluded": false, "filename": "/home/aecay/development/notmuch/notmuch-git/src/notmuch/test/tmp.thread-replies/mail/msg-006", "timestamp": 978709437, "date_relative": "2001-01-05", "tags": ["inbox", "unread"], "headers": {"Subject": "Re: three", "From": "Notmuch Test Suite <test_suite@notmuchmail.org>", "To": "Notmuch Test Suite <test_suite@notmuchmail.org>", "Date": "Fri, 05 Jan 2001 15:43:57 +0000"}, "body": [{"id": 1, "content-type": "text/plain", "content": "This is just a test message (#6)\n"}]}, []]]]]]'
+
+test_begin_subtest "Use last Reference"
+test_subtest_known_broken
+add_message '[id]="foo@four.com"' \
+    '[subject]="four"'
+add_message '[id]="bar@four.com"' \
+    '[subject]="not-four"'
+add_message '[in-reply-to]="<baz@four.com>"' \
+    '[references]="<baz@four.com> <foo@four.com>"' \
+    '[subject]="neither"'
+output=$(notmuch show --format=json 'subject:four')
+test_expect_equal "$output" '[[[{"id": "foo@four.com", "match": true, "excluded": false, "filename": "/home/aecay/development/notmuch/notmuch-git/src/notmuch/test/tmp.thread-replies/mail/msg-007", "timestamp": 978709437, "date_relative": "2001-01-05", "tags": ["inbox", "unread"], "headers": {"Subject": "four", "From": "Notmuch Test Suite <test_suite@notmuchmail.org>", "To": "Notmuch Test Suite <test_suite@notmuchmail.org>", "Date": "Fri, 05 Jan 2001 15:43:57 +0000"}, "body": [{"id": 1, "content-type": "text/plain", "content": "This is just a test message (#7)\n"}]}, [[{"id": "msg-009@notmuch-test-suite", "match": false, "excluded": false, "filename": "/home/aecay/development/notmuch/notmuch-git/src/notmuch/test/tmp.thread-replies/mail/msg-009", "timestamp": 978709437, "date_relative": "2001-01-05", "tags": ["inbox", "unread"], "headers": {"Subject": "neither", "From": "Notmuch Test Suite <test_suite@notmuchmail.org>", "To": "Notmuch Test Suite <test_suite@notmuchmail.org>", "Date": "Fri, 05 Jan 2001 15:43:57 +0000"}, "body": [{"id": 1, "content-type": "text/plain", "content": "This is just a test message (#9)\n"}]}, []]]]], [[{"id": "bar@four.com", "match": true, "excluded": false, "filename": "/home/aecay/development/notmuch/notmuch-git/src/notmuch/test/tmp.thread-replies/mail/msg-008", "timestamp": 978709437, "date_relative": "2001-01-05", "tags": ["inbox", "unread"], "headers": {"Subject": "not-four", "From": "Notmuch Test Suite <test_suite@notmuchmail.org>", "To": "Notmuch Test Suite <test_suite@notmuchmail.org>", "Date": "Fri, 05 Jan 2001 15:43:57 +0000"}, "body": [{"id": 1, "content-type": "text/plain", "content": "This is just a test message (#8)\n"}]}, []]]]'
+
+
+test_done
-- 
1.8.1.5


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #3: 0002-lib-database.cc-change-how-the-parent-of-a-message-i.patch --]
[-- Type: text/x-patch, Size: 9294 bytes --]

From 57739b9722a86ba50ef97ad7d5d21b3e5bc1a977 Mon Sep 17 00:00:00 2001
From: Aaron Ecay <aaronecay@gmail.com>
Date: Mon, 25 Feb 2013 18:46:41 -0500
Subject: [PATCH 2/2] lib/database.cc: change how the parent of a message is
 calculated

Presently, the code which finds the parent of a message as it is being
added to the database assumes that the first Message-ID-like substring
of the In-Reply-To header is the parent Message ID.  Some mail clients,
however, put stuff other than the Message-ID of the parent in the
In-Reply-To header, such as the email address of the sender of the
parent.  This can fool notmuch.

The updated algorithm prefers the last Message ID in the References
header.  The References header lists messages oldest-first, so the last
Message ID is the parent (RFC2822, p. 24).  The References header is
also less likely to be in a non-standard
syntax (http://cr.yp.to/immhf/thread.html,
http://www.jwz.org/doc/threading.html).  In case the References header
is not to be found, fall back to the old behavior.

V2 of this patch, incorporating feedback from Jani and (indirectly)
Austin.
---
 lib/database.cc     | 48 +++++++++++++++++++++++++++++++++---------------
 test/thread-replies |  4 ----
 2 files changed, 33 insertions(+), 19 deletions(-)

diff --git a/lib/database.cc b/lib/database.cc
index 91d4329..52ed618 100644
--- a/lib/database.cc
+++ b/lib/database.cc
@@ -501,8 +501,10 @@ _parse_message_id (void *ctx, const char *message_id, const char **next)
  * 'message_id' in the result (to avoid mass confusion when a single
  * message references itself cyclically---and yes, mail messages are
  * not infrequent in the wild that do this---don't ask me why).
-*/
-static void
+ *
+ * Return the last reference parsed, if it is not equal to message_id.
+ */
+static char *
 parse_references (void *ctx,
 		  const char *message_id,
 		  GHashTable *hash,
@@ -511,7 +513,7 @@ parse_references (void *ctx,
     char *ref;
 
     if (refs == NULL || *refs == '\0')
-	return;
+	return NULL;
 
     while (*refs) {
 	ref = _parse_message_id (ctx, refs, &refs);
@@ -519,6 +521,17 @@ parse_references (void *ctx,
 	if (ref && strcmp (ref, message_id))
 	    g_hash_table_insert (hash, ref, NULL);
     }
+
+    /* The return value of this function is used to add a parent
+     * reference to the database.  We should avoid making a message
+     * its own parent, thus the following check.
+     */
+
+    if (ref && strcmp(ref, message_id)) {
+	return ref;
+    } else {
+	return NULL;
+    }
 }
 
 notmuch_status_t
@@ -1510,28 +1523,33 @@ _notmuch_database_link_message_to_parents (notmuch_database_t *notmuch,
 {
     GHashTable *parents = NULL;
     const char *refs, *in_reply_to, *in_reply_to_message_id;
+    const char *last_ref_message_id, *this_message_id;
     GList *l, *keys = NULL;
     notmuch_status_t ret = NOTMUCH_STATUS_SUCCESS;
 
     parents = g_hash_table_new_full (g_str_hash, g_str_equal,
 				     _my_talloc_free_for_g_hash, NULL);
+    this_message_id = notmuch_message_get_message_id (message);
 
     refs = notmuch_message_file_get_header (message_file, "references");
-    parse_references (message, notmuch_message_get_message_id (message),
-		      parents, refs);
+    last_ref_message_id = parse_references (message,
+					    this_message_id,
+					    parents, refs);
 
     in_reply_to = notmuch_message_file_get_header (message_file, "in-reply-to");
-    parse_references (message, notmuch_message_get_message_id (message),
-		      parents, in_reply_to);
-
-    /* Carefully avoid adding any self-referential in-reply-to term. */
-    in_reply_to_message_id = _parse_message_id (message, in_reply_to, NULL);
-    if (in_reply_to_message_id &&
-	strcmp (in_reply_to_message_id,
-		notmuch_message_get_message_id (message)))
-    {
+    in_reply_to_message_id = parse_references (message,
+					       this_message_id,
+					       parents, in_reply_to);
+
+    /* For the parent of this message, use the last message ID of the
+     * References header, if available.  If not, fall back to the
+     * first message ID in the In-Reply-To header. */
+    if (last_ref_message_id) {
+        _notmuch_message_add_term (message, "replyto",
+                                   last_ref_message_id);
+    } else if (in_reply_to_message_id) {
 	_notmuch_message_add_term (message, "replyto",
-			     _parse_message_id (message, in_reply_to, NULL));
+			     in_reply_to_message_id);
     }
 
     keys = g_hash_table_get_keys (parents);
diff --git a/test/thread-replies b/test/thread-replies
index fd11a09..6dc6143 100755
--- a/test/thread-replies
+++ b/test/thread-replies
@@ -11,7 +11,6 @@ constructed properly, even in the presence of non-RFC-compliant headers'
 . ./test-lib.sh
 
 test_begin_subtest "Use References when In-Reply-To is broken"
-test_subtest_known_broken
 add_message '[id]="foo@one.com"' \
     '[subject]=one'
 add_message '[in-reply-to]="mumble"' \
@@ -21,7 +20,6 @@ output=$(notmuch show --format=json 'subject:one')
 test_expect_equal "$output" '[[[{"id": "foo@one.com", "match": true, "excluded": false, "filename": "/home/aecay/development/notmuch/notmuch-git/src/notmuch/test/tmp.thread-replies/mail/msg-001", "timestamp": 978709437, "date_relative": "2001-01-05", "tags": ["inbox", "unread"], "headers": {"Subject": "one", "From": "Notmuch Test Suite <test_suite@notmuchmail.org>", "To": "Notmuch Test Suite <test_suite@notmuchmail.org>", "Date": "Fri, 05 Jan 2001 15:43:57 +0000"}, "body": [{"id": 1, "content-type": "text/plain", "content": "This is just a test message (#1)\n"}]}, [[{"id": "msg-002@notmuch-test-suite", "match": true, "excluded": false, "filename": "/home/aecay/development/notmuch/notmuch-git/src/notmuch/test/tmp.thread-replies/mail/msg-002", "timestamp": 978709437, "date_relative": "2001-01-05", "tags": ["inbox", "unread"], "headers": {"Subject": "Re: one", "From": "Notmuch Test Suite <test_suite@notmuchmail.org>", "To": "Notmuch Test Suite <test_suite@notmuchmail.org>", "Date": "Fri, 05 Jan 2001 15:43:57 +0000"}, "body": [{"id": 1, "content-type": "text/plain", "content": "This is just a test message (#2)\n"}]}, []]]]]]'
 
 test_begin_subtest "Prefer References to In-Reply-To"
-test_subtest_known_broken
 add_message '[id]="foo@two.com"' \
     '[subject]=two'
 add_message '[in-reply-to]="<bar@baz.com>"' \
@@ -31,7 +29,6 @@ output=$(notmuch show --format=json 'subject:two')
 test_expect_equal "$output" '[[[{"id": "foo@two.com", "match": true, "excluded": false, "filename": "/home/aecay/development/notmuch/notmuch-git/src/notmuch/test/tmp.thread-replies/mail/msg-003", "timestamp": 978709437, "date_relative": "2001-01-05", "tags": ["inbox", "unread"], "headers": {"Subject": "two", "From": "Notmuch Test Suite <test_suite@notmuchmail.org>", "To": "Notmuch Test Suite <test_suite@notmuchmail.org>", "Date": "Fri, 05 Jan 2001 15:43:57 +0000"}, "body": [{"id": 1, "content-type": "text/plain", "content": "This is just a test message (#3)\n"}]}, [[{"id": "msg-004@notmuch-test-suite", "match": true, "excluded": false, "filename": "/home/aecay/development/notmuch/notmuch-git/src/notmuch/test/tmp.thread-replies/mail/msg-004", "timestamp": 978709437, "date_relative": "2001-01-05", "tags": ["inbox", "unread"], "headers": {"Subject": "Re: two", "From": "Notmuch Test Suite <test_suite@notmuchmail.org>", "To": "Notmuch Test Suite <test_suite@notmuchmail.org>", "Date": "Fri, 05 Jan 2001 15:43:57 +0000"}, "body": [{"id": 1, "content-type": "text/plain", "content": "This is just a test message (#4)\n"}]}, []]]]]]'
 
 test_begin_subtest "Use In-Reply-To when no References"
-test_subtest_known_broken
 add_message '[id]="foo@three.com"' \
     '[subject]="three"'
 add_message '[in-reply-to]="<foo@three.com>"' \
@@ -40,7 +37,6 @@ output=$(notmuch show --format=json 'subject:three')
 test_expect_equal "$output" '[[[{"id": "foo@three.com", "match": true, "excluded": false, "filename": "/home/aecay/development/notmuch/notmuch-git/src/notmuch/test/tmp.thread-replies/mail/msg-005", "timestamp": 978709437, "date_relative": "2001-01-05", "tags": ["inbox", "unread"], "headers": {"Subject": "three", "From": "Notmuch Test Suite <test_suite@notmuchmail.org>", "To": "Notmuch Test Suite <test_suite@notmuchmail.org>", "Date": "Fri, 05 Jan 2001 15:43:57 +0000"}, "body": [{"id": 1, "content-type": "text/plain", "content": "This is just a test message (#5)\n"}]}, [[{"id": "msg-006@notmuch-test-suite", "match": true, "excluded": false, "filename": "/home/aecay/development/notmuch/notmuch-git/src/notmuch/test/tmp.thread-replies/mail/msg-006", "timestamp": 978709437, "date_relative": "2001-01-05", "tags": ["inbox", "unread"], "headers": {"Subject": "Re: three", "From": "Notmuch Test Suite <test_suite@notmuchmail.org>", "To": "Notmuch Test Suite <test_suite@notmuchmail.org>", "Date": "Fri, 05 Jan 2001 15:43:57 +0000"}, "body": [{"id": 1, "content-type": "text/plain", "content": "This is just a test message (#6)\n"}]}, []]]]]]'
 
 test_begin_subtest "Use last Reference"
-test_subtest_known_broken
 add_message '[id]="foo@four.com"' \
     '[subject]="four"'
 add_message '[id]="bar@four.com"' \
-- 
1.8.1.5


[-- Attachment #4: Type: text/plain, Size: 17 bytes --]



-- 
Aaron Ecay

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [RFC] [PATCH] lib/database.cc: change how the parent of a message is calculated
  2013-03-03 23:55       ` Aaron Ecay
@ 2013-03-04  2:45         ` David Bremner
  2013-03-06  3:31           ` [PATCH 1/2] test: add tests for the handling of References and In-Reply-To headers Aaron Ecay
  2013-03-04  5:59         ` [RFC] [PATCH] lib/database.cc: change how the parent of a message is calculated Tomi Ollila
  1 sibling, 1 reply; 17+ messages in thread
From: David Bremner @ 2013-03-04  2:45 UTC (permalink / raw)
  To: Aaron Ecay, Jani Nikula, notmuch

Aaron Ecay <aaronecay@gmail.com> writes:

> git send-email is mad about lines >998 characters in the test patch, so
> I’m sending the patches as attachments to this email.  (Is there a
> better way to include the expected output of a notmuch command which
> outputs long lines in a test script?)

What about pretty printing the json in the source?
test_expect_equal_json canonicalizes both arguments before comparing, so
it should be ok.

d

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] [PATCH] lib/database.cc: change how the parent of a message is calculated
  2013-03-03 23:55       ` Aaron Ecay
  2013-03-04  2:45         ` David Bremner
@ 2013-03-04  5:59         ` Tomi Ollila
  1 sibling, 0 replies; 17+ messages in thread
From: Tomi Ollila @ 2013-03-04  5:59 UTC (permalink / raw)
  To: Aaron Ecay, Jani Nikula, notmuch

On Mon, Mar 04 2013, Aaron Ecay <aaronecay@gmail.com> wrote:

> git send-email is mad about lines >998 characters in the test patch, so
> I’m sending the patches as attachments to this email.  (Is there a
> better way to include the expected output of a notmuch command which
> outputs long lines in a test script?)

Simplest thing is to split the expected string to multiple lines
and use test_expect_equal_json() for checking.

Tomi

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/2] test: add tests for the handling of References and In-Reply-To headers
  2013-03-04  2:45         ` David Bremner
@ 2013-03-06  3:31           ` Aaron Ecay
  2013-03-06  3:31             ` [PATCH 2/2] lib/database.cc: change how the parent of a message is calculated Aaron Ecay
                               ` (3 more replies)
  0 siblings, 4 replies; 17+ messages in thread
From: Aaron Ecay @ 2013-03-06  3:31 UTC (permalink / raw)
  To: notmuch

These tests are known_broken, the following commit fixes them.
---

Thanks to David and Tomi for pointing out test_expect_equal_json.  In
the process of implementing that, I discovered
notmuch_json_show_sanitize, which I had also not been using.  This
patch fixes the test to use both these conveniences.  The second patch
is not changed substantively, but I am resending it for tidiness.

 test/thread-replies | 144 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 144 insertions(+)
 create mode 100755 test/thread-replies

diff --git a/test/thread-replies b/test/thread-replies
new file mode 100755
index 0000000..a902691
--- /dev/null
+++ b/test/thread-replies
@@ -0,0 +1,144 @@
+#!/usr/bin/env bash
+#
+# Copyright (c) 2013 Aaron Ecay
+#
+
+test_description='test of proper handling of in-reply-to and references headers
+
+This test makes sure that the thread structure in the notmuch database is
+constructed properly, even in the presence of non-RFC-compliant headers'
+
+. ./test-lib.sh
+
+test_begin_subtest "Use References when In-Reply-To is broken"
+test_subtest_known_broken
+add_message '[id]="foo@one.com"' \
+    '[subject]=one'
+add_message '[in-reply-to]="mumble"' \
+    '[references]="<foo@one.com>"' \
+    '[subject]="Re: one"'
+output=$(notmuch show --format=json 'subject:one' | notmuch_json_show_sanitize)
+expected='[[[{"id": "foo@one.com",
+ "match": true,
+ "excluded": false,
+ "filename": "/home/aecay/development/notmuch/notmuch-git/src/notmuch/test/tmp.thread-replies/mail/msg-001",
+ "timestamp": 978709437,
+ "date_relative": "2001-01-05",
+ "tags": ["inbox", "unread"],
+ "headers": {"Subject": "one",
+ "From": "Notmuch Test Suite <test_suite@notmuchmail.org>",
+ "To": "Notmuch Test Suite <test_suite@notmuchmail.org>",
+ "Date": "Fri, 05 Jan 2001 15:43:57 +0000"},
+ "body": [{"id": 1,
+ "content-type": "text/plain",
+ "content": "This is just a test message (#1)\n"}]},
+ [[{"id": "msg-002@notmuch-test-suite",
+ "match": true, "excluded": false,
+ "filename": "/home/aecay/development/notmuch/notmuch-git/src/notmuch/test/tmp.thread-replies/mail/msg-002",
+ "timestamp": 978709437, "date_relative": "2001-01-05",
+ "tags": ["inbox", "unread"], "headers": {"Subject": "Re: one",
+ "From": "Notmuch Test Suite <test_suite@notmuchmail.org>",
+ "To": "Notmuch Test Suite <test_suite@notmuchmail.org>",
+ "Date": "Fri, 05 Jan 2001 15:43:57 +0000"},
+ "body": [{"id": 1, "content-type": "text/plain",
+ "content": "This is just a test message (#2)\n"}]}, []]]]]]'
+expected=`echo "$expected" | notmuch_json_show_sanitize`
+test_expect_equal_json "$output" "$expected"
+
+test_begin_subtest "Prefer References to In-Reply-To"
+test_subtest_known_broken
+add_message '[id]="foo@two.com"' \
+    '[subject]=two'
+add_message '[in-reply-to]="<bar@baz.com>"' \
+    '[references]="<foo@two.com>"' \
+    '[subject]="Re: two"'
+output=$(notmuch show --format=json 'subject:two' | notmuch_json_show_sanitize)
+expected='[[[{"id": "foo@two.com",
+ "match": true, "excluded": false,
+ "filename": "/home/aecay/development/notmuch/notmuch-git/src/notmuch/test/tmp.thread-replies/mail/msg-003",
+ "timestamp": 978709437, "date_relative": "2001-01-05", "tags": ["inbox", "unread"],
+ "headers": {"Subject": "two",
+ "From": "Notmuch Test Suite <test_suite@notmuchmail.org>",
+ "To": "Notmuch Test Suite <test_suite@notmuchmail.org>",
+ "Date": "Fri, 05 Jan 2001 15:43:57 +0000"},
+ "body": [{"id": 1, "content-type": "text/plain",
+ "content": "This is just a test message (#3)\n"}]},
+ [[{"id": "msg-004@notmuch-test-suite", "match": true, "excluded": false,
+ "filename": "/home/aecay/development/notmuch/notmuch-git/src/notmuch/test/tmp.thread-replies/mail/msg-004",
+ "timestamp": 978709437, "date_relative": "2001-01-05", "tags": ["inbox", "unread"],
+ "headers": {"Subject": "Re: two",
+ "From": "Notmuch Test Suite <test_suite@notmuchmail.org>",
+ "To": "Notmuch Test Suite <test_suite@notmuchmail.org>",
+ "Date": "Fri, 05 Jan 2001 15:43:57 +0000"},
+ "body": [{"id": 1,
+ "content-type": "text/plain", "content": "This is just a test message (#4)\n"}]},
+ []]]]]]'
+expected=`echo "$expected" | notmuch_json_show_sanitize`
+test_expect_equal_json "$output" "$expected"
+
+test_begin_subtest "Use In-Reply-To when no References"
+test_subtest_known_broken
+add_message '[id]="foo@three.com"' \
+    '[subject]="three"'
+add_message '[in-reply-to]="<foo@three.com>"' \
+    '[subject]="Re: three"'
+output=$(notmuch show --format=json 'subject:three' | notmuch_json_show_sanitize)
+expected='[[[{"id": "foo@three.com", "match": true, "excluded": false,
+ "filename": "/home/aecay/development/notmuch/notmuch-git/src/notmuch/test/tmp.thread-replies/mail/msg-005",
+ "timestamp": 978709437, "date_relative": "2001-01-05", "tags": ["inbox", "unread"],
+ "headers": {"Subject": "three",
+ "From": "Notmuch Test Suite <test_suite@notmuchmail.org>",
+ "To": "Notmuch Test Suite <test_suite@notmuchmail.org>",
+ "Date": "Fri, 05 Jan 2001 15:43:57 +0000"}, "body": [{"id": 1,
+ "content-type": "text/plain", "content": "This is just a test message (#5)\n"}]},
+ [[{"id": "msg-006@notmuch-test-suite", "match": true, "excluded": false,
+ "filename": "/home/aecay/development/notmuch/notmuch-git/src/notmuch/test/tmp.thread-replies/mail/msg-006",
+ "timestamp": 978709437, "date_relative": "2001-01-05", "tags": ["inbox", "unread"],
+ "headers": {"Subject": "Re: three",
+ "From": "Notmuch Test Suite <test_suite@notmuchmail.org>",
+ "To": "Notmuch Test Suite <test_suite@notmuchmail.org>",
+ "Date": "Fri, 05 Jan 2001 15:43:57 +0000"}, "body": [{"id": 1,
+ "content-type": "text/plain", "content": "This is just a test message (#6)\n"}]},
+ []]]]]]'
+expected=`echo "$expected" | notmuch_json_show_sanitize`
+test_expect_equal_json "$output" "$expected"
+
+test_begin_subtest "Use last Reference"
+test_subtest_known_broken
+add_message '[id]="foo@four.com"' \
+    '[subject]="four"'
+add_message '[id]="bar@four.com"' \
+    '[subject]="not-four"'
+add_message '[in-reply-to]="<baz@four.com>"' \
+    '[references]="<baz@four.com> <foo@four.com>"' \
+    '[subject]="neither"'
+output=$(notmuch show --format=json 'subject:four' | notmuch_json_show_sanitize)
+expected='[[[{"id": "foo@four.com", "match": true, "excluded": false,
+ "filename": "/home/aecay/development/notmuch/notmuch-git/src/notmuch/test/tmp.thread-replies/mail/msg-007",
+ "timestamp": 978709437, "date_relative": "2001-01-05", "tags": ["inbox", "unread"],
+ "headers": {"Subject": "four",
+ "From": "Notmuch Test Suite <test_suite@notmuchmail.org>",
+ "To": "Notmuch Test Suite <test_suite@notmuchmail.org>",
+ "Date": "Fri, 05 Jan 2001 15:43:57 +0000"}, "body": [{"id": 1,
+ "content-type": "text/plain", "content": "This is just a test message (#7)\n"}]},
+ [[{"id": "msg-009@notmuch-test-suite", "match": false, "excluded": false,
+ "filename": "/home/aecay/development/notmuch/notmuch-git/src/notmuch/test/tmp.thread-replies/mail/msg-009",
+ "timestamp": 978709437, "date_relative": "2001-01-05", "tags": ["inbox", "unread"],
+ "headers": {"Subject": "neither",
+ "From": "Notmuch Test Suite <test_suite@notmuchmail.org>",
+ "To": "Notmuch Test Suite <test_suite@notmuchmail.org>",
+ "Date": "Fri, 05 Jan 2001 15:43:57 +0000"}, "body": [{"id": 1,
+ "content-type": "text/plain", "content": "This is just a test message (#9)\n"}]},
+ []]]]], [[{"id": "bar@four.com", "match": true, "excluded": false,
+ "filename": "/home/aecay/development/notmuch/notmuch-git/src/notmuch/test/tmp.thread-replies/mail/msg-008",
+ "timestamp": 978709437, "date_relative": "2001-01-05", "tags": ["inbox", "unread"],
+ "headers": {"Subject": "not-four",
+ "From": "Notmuch Test Suite <test_suite@notmuchmail.org>",
+ "To": "Notmuch Test Suite <test_suite@notmuchmail.org>",
+ "Date": "Fri, 05 Jan 2001 15:43:57 +0000"}, "body": [{"id": 1,
+ "content-type": "text/plain", "content": "This is just a test message (#8)\n"}]}, []]]]'
+expected=`echo "$expected" | notmuch_json_show_sanitize`
+test_expect_equal_json "$output" "$expected"
+
+
+test_done
-- 
1.8.1.5

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/2] lib/database.cc: change how the parent of a message is calculated
  2013-03-06  3:31           ` [PATCH 1/2] test: add tests for the handling of References and In-Reply-To headers Aaron Ecay
@ 2013-03-06  3:31             ` Aaron Ecay
  2013-05-04 14:10               ` David Bremner
                                 ` (2 more replies)
  2013-05-04 14:01             ` [PATCH 1/2] test: add tests for the handling of References and In-Reply-To headers David Bremner
                               ` (2 subsequent siblings)
  3 siblings, 3 replies; 17+ messages in thread
From: Aaron Ecay @ 2013-03-06  3:31 UTC (permalink / raw)
  To: notmuch

Presently, the code which finds the parent of a message as it is being
added to the database assumes that the first Message-ID-like substring
of the In-Reply-To header is the parent Message ID.  Some mail clients,
however, put stuff other than the Message-ID of the parent in the
In-Reply-To header, such as the email address of the sender of the
parent.  This can fool notmuch.

The updated algorithm prefers the last Message ID in the References
header.  The References header lists messages oldest-first, so the last
Message ID is the parent (RFC2822, p. 24).  The References header is
also less likely to be in a non-standard
syntax (http://cr.yp.to/immhf/thread.html,
http://www.jwz.org/doc/threading.html).  In case the References header
is not to be found, fall back to the old behavior.

V2 of this patch, incorporating feedback from Jani and (indirectly)
Austin.
---
 lib/database.cc     | 48 +++++++++++++++++++++++++++++++++---------------
 test/thread-replies |  4 ----
 2 files changed, 33 insertions(+), 19 deletions(-)

diff --git a/lib/database.cc b/lib/database.cc
index 91d4329..52ed618 100644
--- a/lib/database.cc
+++ b/lib/database.cc
@@ -501,8 +501,10 @@ _parse_message_id (void *ctx, const char *message_id, const char **next)
  * 'message_id' in the result (to avoid mass confusion when a single
  * message references itself cyclically---and yes, mail messages are
  * not infrequent in the wild that do this---don't ask me why).
-*/
-static void
+ *
+ * Return the last reference parsed, if it is not equal to message_id.
+ */
+static char *
 parse_references (void *ctx,
 		  const char *message_id,
 		  GHashTable *hash,
@@ -511,7 +513,7 @@ parse_references (void *ctx,
     char *ref;
 
     if (refs == NULL || *refs == '\0')
-	return;
+	return NULL;
 
     while (*refs) {
 	ref = _parse_message_id (ctx, refs, &refs);
@@ -519,6 +521,17 @@ parse_references (void *ctx,
 	if (ref && strcmp (ref, message_id))
 	    g_hash_table_insert (hash, ref, NULL);
     }
+
+    /* The return value of this function is used to add a parent
+     * reference to the database.  We should avoid making a message
+     * its own parent, thus the following check.
+     */
+
+    if (ref && strcmp(ref, message_id)) {
+	return ref;
+    } else {
+	return NULL;
+    }
 }
 
 notmuch_status_t
@@ -1510,28 +1523,33 @@ _notmuch_database_link_message_to_parents (notmuch_database_t *notmuch,
 {
     GHashTable *parents = NULL;
     const char *refs, *in_reply_to, *in_reply_to_message_id;
+    const char *last_ref_message_id, *this_message_id;
     GList *l, *keys = NULL;
     notmuch_status_t ret = NOTMUCH_STATUS_SUCCESS;
 
     parents = g_hash_table_new_full (g_str_hash, g_str_equal,
 				     _my_talloc_free_for_g_hash, NULL);
+    this_message_id = notmuch_message_get_message_id (message);
 
     refs = notmuch_message_file_get_header (message_file, "references");
-    parse_references (message, notmuch_message_get_message_id (message),
-		      parents, refs);
+    last_ref_message_id = parse_references (message,
+					    this_message_id,
+					    parents, refs);
 
     in_reply_to = notmuch_message_file_get_header (message_file, "in-reply-to");
-    parse_references (message, notmuch_message_get_message_id (message),
-		      parents, in_reply_to);
-
-    /* Carefully avoid adding any self-referential in-reply-to term. */
-    in_reply_to_message_id = _parse_message_id (message, in_reply_to, NULL);
-    if (in_reply_to_message_id &&
-	strcmp (in_reply_to_message_id,
-		notmuch_message_get_message_id (message)))
-    {
+    in_reply_to_message_id = parse_references (message,
+					       this_message_id,
+					       parents, in_reply_to);
+
+    /* For the parent of this message, use the last message ID of the
+     * References header, if available.  If not, fall back to the
+     * first message ID in the In-Reply-To header. */
+    if (last_ref_message_id) {
+        _notmuch_message_add_term (message, "replyto",
+                                   last_ref_message_id);
+    } else if (in_reply_to_message_id) {
 	_notmuch_message_add_term (message, "replyto",
-			     _parse_message_id (message, in_reply_to, NULL));
+			     in_reply_to_message_id);
     }
 
     keys = g_hash_table_get_keys (parents);
diff --git a/test/thread-replies b/test/thread-replies
index a902691..28c2b1f 100755
--- a/test/thread-replies
+++ b/test/thread-replies
@@ -11,7 +11,6 @@ constructed properly, even in the presence of non-RFC-compliant headers'
 . ./test-lib.sh
 
 test_begin_subtest "Use References when In-Reply-To is broken"
-test_subtest_known_broken
 add_message '[id]="foo@one.com"' \
     '[subject]=one'
 add_message '[in-reply-to]="mumble"' \
@@ -46,7 +45,6 @@ expected=`echo "$expected" | notmuch_json_show_sanitize`
 test_expect_equal_json "$output" "$expected"
 
 test_begin_subtest "Prefer References to In-Reply-To"
-test_subtest_known_broken
 add_message '[id]="foo@two.com"' \
     '[subject]=two'
 add_message '[in-reply-to]="<bar@baz.com>"' \
@@ -77,7 +75,6 @@ expected=`echo "$expected" | notmuch_json_show_sanitize`
 test_expect_equal_json "$output" "$expected"
 
 test_begin_subtest "Use In-Reply-To when no References"
-test_subtest_known_broken
 add_message '[id]="foo@three.com"' \
     '[subject]="three"'
 add_message '[in-reply-to]="<foo@three.com>"' \
@@ -104,7 +101,6 @@ expected=`echo "$expected" | notmuch_json_show_sanitize`
 test_expect_equal_json "$output" "$expected"
 
 test_begin_subtest "Use last Reference"
-test_subtest_known_broken
 add_message '[id]="foo@four.com"' \
     '[subject]="four"'
 add_message '[id]="bar@four.com"' \
-- 
1.8.1.5

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] test: add tests for the handling of References and In-Reply-To headers
  2013-03-06  3:31           ` [PATCH 1/2] test: add tests for the handling of References and In-Reply-To headers Aaron Ecay
  2013-03-06  3:31             ` [PATCH 2/2] lib/database.cc: change how the parent of a message is calculated Aaron Ecay
@ 2013-05-04 14:01             ` David Bremner
  2013-05-04 14:03             ` David Bremner
  2013-05-04 16:23             ` Jani Nikula
  3 siblings, 0 replies; 17+ messages in thread
From: David Bremner @ 2013-05-04 14:01 UTC (permalink / raw)
  To: Aaron Ecay, notmuch

Aaron Ecay <aaronecay@gmail.com> writes:

> These tests are known_broken, the following commit fixes them.
> ---

On current master, I get 

FIXED  Use In-Reply-To when no References

d

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] test: add tests for the handling of References and In-Reply-To headers
  2013-03-06  3:31           ` [PATCH 1/2] test: add tests for the handling of References and In-Reply-To headers Aaron Ecay
  2013-03-06  3:31             ` [PATCH 2/2] lib/database.cc: change how the parent of a message is calculated Aaron Ecay
  2013-05-04 14:01             ` [PATCH 1/2] test: add tests for the handling of References and In-Reply-To headers David Bremner
@ 2013-05-04 14:03             ` David Bremner
  2013-05-04 16:23             ` Jani Nikula
  3 siblings, 0 replies; 17+ messages in thread
From: David Bremner @ 2013-05-04 14:03 UTC (permalink / raw)
  To: Aaron Ecay, notmuch

Aaron Ecay <aaronecay@gmail.com> writes:

> These tests are known_broken, the following commit fixes them.
> ---

Sorry for the extra mails, but this patch also makes test/basic fail
because thread-replies is not added to test/notmuch-test

d

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/2] lib/database.cc: change how the parent of a message is calculated
  2013-03-06  3:31             ` [PATCH 2/2] lib/database.cc: change how the parent of a message is calculated Aaron Ecay
@ 2013-05-04 14:10               ` David Bremner
  2013-05-04 16:24               ` Jani Nikula
  2013-05-14  0:43               ` David Bremner
  2 siblings, 0 replies; 17+ messages in thread
From: David Bremner @ 2013-05-04 14:10 UTC (permalink / raw)
  To: Aaron Ecay, notmuch

Aaron Ecay <aaronecay@gmail.com> writes:
> +    if (ref && strcmp(ref, message_id)) {
> +	return ref;
> +    } else {
> +	return NULL;
> +    }

As a nit, there should be a space after strcmp. But if nobody has any
other comments, I could amend that.

d

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] test: add tests for the handling of References and In-Reply-To headers
  2013-03-06  3:31           ` [PATCH 1/2] test: add tests for the handling of References and In-Reply-To headers Aaron Ecay
                               ` (2 preceding siblings ...)
  2013-05-04 14:03             ` David Bremner
@ 2013-05-04 16:23             ` Jani Nikula
  3 siblings, 0 replies; 17+ messages in thread
From: Jani Nikula @ 2013-05-04 16:23 UTC (permalink / raw)
  To: Aaron Ecay, notmuch

On Wed, 06 Mar 2013, Aaron Ecay <aaronecay@gmail.com> wrote:
> These tests are known_broken, the following commit fixes them.
> ---
>
> Thanks to David and Tomi for pointing out test_expect_equal_json.  In
> the process of implementing that, I discovered
> notmuch_json_show_sanitize, which I had also not been using.  This
> patch fixes the test to use both these conveniences.  The second patch
> is not changed substantively, but I am resending it for tidiness.
>
>  test/thread-replies | 144 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 144 insertions(+)
>  create mode 100755 test/thread-replies
>
> diff --git a/test/thread-replies b/test/thread-replies
> new file mode 100755
> index 0000000..a902691
> --- /dev/null
> +++ b/test/thread-replies
> @@ -0,0 +1,144 @@
> +#!/usr/bin/env bash
> +#
> +# Copyright (c) 2013 Aaron Ecay
> +#
> +
> +test_description='test of proper handling of in-reply-to and references headers
> +
> +This test makes sure that the thread structure in the notmuch database is
> +constructed properly, even in the presence of non-RFC-compliant headers'

Nitpick, most other tests have one line test descriptions. The second
paragraph could be left as comment.

> +
> +. ./test-lib.sh
> +
> +test_begin_subtest "Use References when In-Reply-To is broken"
> +test_subtest_known_broken
> +add_message '[id]="foo@one.com"' \
> +    '[subject]=one'
> +add_message '[in-reply-to]="mumble"' \
> +    '[references]="<foo@one.com>"' \
> +    '[subject]="Re: one"'
> +output=$(notmuch show --format=json 'subject:one' | notmuch_json_show_sanitize)
> +expected='[[[{"id": "foo@one.com",
> + "match": true,
> + "excluded": false,
> + "filename": "/home/aecay/development/notmuch/notmuch-git/src/notmuch/test/tmp.thread-replies/mail/msg-001",

Please replace the path with:

 "filename": "YYYYY",

Ditto below for all of them.

> + "timestamp": 978709437,
> + "date_relative": "2001-01-05",
> + "tags": ["inbox", "unread"],
> + "headers": {"Subject": "one",
> + "From": "Notmuch Test Suite <test_suite@notmuchmail.org>",
> + "To": "Notmuch Test Suite <test_suite@notmuchmail.org>",
> + "Date": "Fri, 05 Jan 2001 15:43:57 +0000"},
> + "body": [{"id": 1,
> + "content-type": "text/plain",
> + "content": "This is just a test message (#1)\n"}]},
> + [[{"id": "msg-002@notmuch-test-suite",
> + "match": true, "excluded": false,
> + "filename": "/home/aecay/development/notmuch/notmuch-git/src/notmuch/test/tmp.thread-replies/mail/msg-002",
> + "timestamp": 978709437, "date_relative": "2001-01-05",
> + "tags": ["inbox", "unread"], "headers": {"Subject": "Re: one",
> + "From": "Notmuch Test Suite <test_suite@notmuchmail.org>",
> + "To": "Notmuch Test Suite <test_suite@notmuchmail.org>",
> + "Date": "Fri, 05 Jan 2001 15:43:57 +0000"},
> + "body": [{"id": 1, "content-type": "text/plain",
> + "content": "This is just a test message (#2)\n"}]}, []]]]]]'
> +expected=`echo "$expected" | notmuch_json_show_sanitize`
> +test_expect_equal_json "$output" "$expected"
> +
> +test_begin_subtest "Prefer References to In-Reply-To"
> +test_subtest_known_broken
> +add_message '[id]="foo@two.com"' \
> +    '[subject]=two'
> +add_message '[in-reply-to]="<bar@baz.com>"' \
> +    '[references]="<foo@two.com>"' \
> +    '[subject]="Re: two"'
> +output=$(notmuch show --format=json 'subject:two' | notmuch_json_show_sanitize)
> +expected='[[[{"id": "foo@two.com",
> + "match": true, "excluded": false,
> + "filename": "/home/aecay/development/notmuch/notmuch-git/src/notmuch/test/tmp.thread-replies/mail/msg-003",
> + "timestamp": 978709437, "date_relative": "2001-01-05", "tags": ["inbox", "unread"],
> + "headers": {"Subject": "two",
> + "From": "Notmuch Test Suite <test_suite@notmuchmail.org>",
> + "To": "Notmuch Test Suite <test_suite@notmuchmail.org>",
> + "Date": "Fri, 05 Jan 2001 15:43:57 +0000"},
> + "body": [{"id": 1, "content-type": "text/plain",
> + "content": "This is just a test message (#3)\n"}]},
> + [[{"id": "msg-004@notmuch-test-suite", "match": true, "excluded": false,
> + "filename": "/home/aecay/development/notmuch/notmuch-git/src/notmuch/test/tmp.thread-replies/mail/msg-004",
> + "timestamp": 978709437, "date_relative": "2001-01-05", "tags": ["inbox", "unread"],
> + "headers": {"Subject": "Re: two",
> + "From": "Notmuch Test Suite <test_suite@notmuchmail.org>",
> + "To": "Notmuch Test Suite <test_suite@notmuchmail.org>",
> + "Date": "Fri, 05 Jan 2001 15:43:57 +0000"},
> + "body": [{"id": 1,
> + "content-type": "text/plain", "content": "This is just a test message (#4)\n"}]},
> + []]]]]]'
> +expected=`echo "$expected" | notmuch_json_show_sanitize`
> +test_expect_equal_json "$output" "$expected"
> +
> +test_begin_subtest "Use In-Reply-To when no References"
> +test_subtest_known_broken

As David said, this is not broken currently, as In-Reply-To is used
unconditionally. Just drop the broken annotation, the test itself is
useful.

> +add_message '[id]="foo@three.com"' \
> +    '[subject]="three"'
> +add_message '[in-reply-to]="<foo@three.com>"' \
> +    '[subject]="Re: three"'
> +output=$(notmuch show --format=json 'subject:three' | notmuch_json_show_sanitize)
> +expected='[[[{"id": "foo@three.com", "match": true, "excluded": false,
> + "filename": "/home/aecay/development/notmuch/notmuch-git/src/notmuch/test/tmp.thread-replies/mail/msg-005",
> + "timestamp": 978709437, "date_relative": "2001-01-05", "tags": ["inbox", "unread"],
> + "headers": {"Subject": "three",
> + "From": "Notmuch Test Suite <test_suite@notmuchmail.org>",
> + "To": "Notmuch Test Suite <test_suite@notmuchmail.org>",
> + "Date": "Fri, 05 Jan 2001 15:43:57 +0000"}, "body": [{"id": 1,
> + "content-type": "text/plain", "content": "This is just a test message (#5)\n"}]},
> + [[{"id": "msg-006@notmuch-test-suite", "match": true, "excluded": false,
> + "filename": "/home/aecay/development/notmuch/notmuch-git/src/notmuch/test/tmp.thread-replies/mail/msg-006",
> + "timestamp": 978709437, "date_relative": "2001-01-05", "tags": ["inbox", "unread"],
> + "headers": {"Subject": "Re: three",
> + "From": "Notmuch Test Suite <test_suite@notmuchmail.org>",
> + "To": "Notmuch Test Suite <test_suite@notmuchmail.org>",
> + "Date": "Fri, 05 Jan 2001 15:43:57 +0000"}, "body": [{"id": 1,
> + "content-type": "text/plain", "content": "This is just a test message (#6)\n"}]},
> + []]]]]]'
> +expected=`echo "$expected" | notmuch_json_show_sanitize`
> +test_expect_equal_json "$output" "$expected"
> +
> +test_begin_subtest "Use last Reference"
> +test_subtest_known_broken
> +add_message '[id]="foo@four.com"' \
> +    '[subject]="four"'
> +add_message '[id]="bar@four.com"' \
> +    '[subject]="not-four"'
> +add_message '[in-reply-to]="<baz@four.com>"' \
> +    '[references]="<baz@four.com> <foo@four.com>"' \
> +    '[subject]="neither"'
> +output=$(notmuch show --format=json 'subject:four' | notmuch_json_show_sanitize)
> +expected='[[[{"id": "foo@four.com", "match": true, "excluded": false,
> + "filename": "/home/aecay/development/notmuch/notmuch-git/src/notmuch/test/tmp.thread-replies/mail/msg-007",
> + "timestamp": 978709437, "date_relative": "2001-01-05", "tags": ["inbox", "unread"],
> + "headers": {"Subject": "four",
> + "From": "Notmuch Test Suite <test_suite@notmuchmail.org>",
> + "To": "Notmuch Test Suite <test_suite@notmuchmail.org>",
> + "Date": "Fri, 05 Jan 2001 15:43:57 +0000"}, "body": [{"id": 1,
> + "content-type": "text/plain", "content": "This is just a test message (#7)\n"}]},
> + [[{"id": "msg-009@notmuch-test-suite", "match": false, "excluded": false,
> + "filename": "/home/aecay/development/notmuch/notmuch-git/src/notmuch/test/tmp.thread-replies/mail/msg-009",
> + "timestamp": 978709437, "date_relative": "2001-01-05", "tags": ["inbox", "unread"],
> + "headers": {"Subject": "neither",
> + "From": "Notmuch Test Suite <test_suite@notmuchmail.org>",
> + "To": "Notmuch Test Suite <test_suite@notmuchmail.org>",
> + "Date": "Fri, 05 Jan 2001 15:43:57 +0000"}, "body": [{"id": 1,
> + "content-type": "text/plain", "content": "This is just a test message (#9)\n"}]},
> + []]]]], [[{"id": "bar@four.com", "match": true, "excluded": false,
> + "filename": "/home/aecay/development/notmuch/notmuch-git/src/notmuch/test/tmp.thread-replies/mail/msg-008",
> + "timestamp": 978709437, "date_relative": "2001-01-05", "tags": ["inbox", "unread"],
> + "headers": {"Subject": "not-four",
> + "From": "Notmuch Test Suite <test_suite@notmuchmail.org>",
> + "To": "Notmuch Test Suite <test_suite@notmuchmail.org>",
> + "Date": "Fri, 05 Jan 2001 15:43:57 +0000"}, "body": [{"id": 1,
> + "content-type": "text/plain", "content": "This is just a test message (#8)\n"}]}, []]]]'
> +expected=`echo "$expected" | notmuch_json_show_sanitize`
> +test_expect_equal_json "$output" "$expected"
> +
> +
> +test_done
> -- 
> 1.8.1.5
>
> _______________________________________________
> notmuch mailing list
> notmuch@notmuchmail.org
> http://notmuchmail.org/mailman/listinfo/notmuch

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/2] lib/database.cc: change how the parent of a message is calculated
  2013-03-06  3:31             ` [PATCH 2/2] lib/database.cc: change how the parent of a message is calculated Aaron Ecay
  2013-05-04 14:10               ` David Bremner
@ 2013-05-04 16:24               ` Jani Nikula
  2013-05-14  0:43               ` David Bremner
  2 siblings, 0 replies; 17+ messages in thread
From: Jani Nikula @ 2013-05-04 16:24 UTC (permalink / raw)
  To: Aaron Ecay, notmuch


LGTM

On Wed, 06 Mar 2013, Aaron Ecay <aaronecay@gmail.com> wrote:
> Presently, the code which finds the parent of a message as it is being
> added to the database assumes that the first Message-ID-like substring
> of the In-Reply-To header is the parent Message ID.  Some mail clients,
> however, put stuff other than the Message-ID of the parent in the
> In-Reply-To header, such as the email address of the sender of the
> parent.  This can fool notmuch.
>
> The updated algorithm prefers the last Message ID in the References
> header.  The References header lists messages oldest-first, so the last
> Message ID is the parent (RFC2822, p. 24).  The References header is
> also less likely to be in a non-standard
> syntax (http://cr.yp.to/immhf/thread.html,
> http://www.jwz.org/doc/threading.html).  In case the References header
> is not to be found, fall back to the old behavior.
>
> V2 of this patch, incorporating feedback from Jani and (indirectly)
> Austin.
> ---
>  lib/database.cc     | 48 +++++++++++++++++++++++++++++++++---------------
>  test/thread-replies |  4 ----
>  2 files changed, 33 insertions(+), 19 deletions(-)
>
> diff --git a/lib/database.cc b/lib/database.cc
> index 91d4329..52ed618 100644
> --- a/lib/database.cc
> +++ b/lib/database.cc
> @@ -501,8 +501,10 @@ _parse_message_id (void *ctx, const char *message_id, const char **next)
>   * 'message_id' in the result (to avoid mass confusion when a single
>   * message references itself cyclically---and yes, mail messages are
>   * not infrequent in the wild that do this---don't ask me why).
> -*/
> -static void
> + *
> + * Return the last reference parsed, if it is not equal to message_id.
> + */
> +static char *
>  parse_references (void *ctx,
>  		  const char *message_id,
>  		  GHashTable *hash,
> @@ -511,7 +513,7 @@ parse_references (void *ctx,
>      char *ref;
>  
>      if (refs == NULL || *refs == '\0')
> -	return;
> +	return NULL;
>  
>      while (*refs) {
>  	ref = _parse_message_id (ctx, refs, &refs);
> @@ -519,6 +521,17 @@ parse_references (void *ctx,
>  	if (ref && strcmp (ref, message_id))
>  	    g_hash_table_insert (hash, ref, NULL);
>      }
> +
> +    /* The return value of this function is used to add a parent
> +     * reference to the database.  We should avoid making a message
> +     * its own parent, thus the following check.
> +     */
> +
> +    if (ref && strcmp(ref, message_id)) {
> +	return ref;
> +    } else {
> +	return NULL;
> +    }
>  }
>  
>  notmuch_status_t
> @@ -1510,28 +1523,33 @@ _notmuch_database_link_message_to_parents (notmuch_database_t *notmuch,
>  {
>      GHashTable *parents = NULL;
>      const char *refs, *in_reply_to, *in_reply_to_message_id;
> +    const char *last_ref_message_id, *this_message_id;
>      GList *l, *keys = NULL;
>      notmuch_status_t ret = NOTMUCH_STATUS_SUCCESS;
>  
>      parents = g_hash_table_new_full (g_str_hash, g_str_equal,
>  				     _my_talloc_free_for_g_hash, NULL);
> +    this_message_id = notmuch_message_get_message_id (message);
>  
>      refs = notmuch_message_file_get_header (message_file, "references");
> -    parse_references (message, notmuch_message_get_message_id (message),
> -		      parents, refs);
> +    last_ref_message_id = parse_references (message,
> +					    this_message_id,
> +					    parents, refs);
>  
>      in_reply_to = notmuch_message_file_get_header (message_file, "in-reply-to");
> -    parse_references (message, notmuch_message_get_message_id (message),
> -		      parents, in_reply_to);
> -
> -    /* Carefully avoid adding any self-referential in-reply-to term. */
> -    in_reply_to_message_id = _parse_message_id (message, in_reply_to, NULL);
> -    if (in_reply_to_message_id &&
> -	strcmp (in_reply_to_message_id,
> -		notmuch_message_get_message_id (message)))
> -    {
> +    in_reply_to_message_id = parse_references (message,
> +					       this_message_id,
> +					       parents, in_reply_to);
> +
> +    /* For the parent of this message, use the last message ID of the
> +     * References header, if available.  If not, fall back to the
> +     * first message ID in the In-Reply-To header. */
> +    if (last_ref_message_id) {
> +        _notmuch_message_add_term (message, "replyto",
> +                                   last_ref_message_id);
> +    } else if (in_reply_to_message_id) {
>  	_notmuch_message_add_term (message, "replyto",
> -			     _parse_message_id (message, in_reply_to, NULL));
> +			     in_reply_to_message_id);
>      }
>  
>      keys = g_hash_table_get_keys (parents);
> diff --git a/test/thread-replies b/test/thread-replies
> index a902691..28c2b1f 100755
> --- a/test/thread-replies
> +++ b/test/thread-replies
> @@ -11,7 +11,6 @@ constructed properly, even in the presence of non-RFC-compliant headers'
>  . ./test-lib.sh
>  
>  test_begin_subtest "Use References when In-Reply-To is broken"
> -test_subtest_known_broken
>  add_message '[id]="foo@one.com"' \
>      '[subject]=one'
>  add_message '[in-reply-to]="mumble"' \
> @@ -46,7 +45,6 @@ expected=`echo "$expected" | notmuch_json_show_sanitize`
>  test_expect_equal_json "$output" "$expected"
>  
>  test_begin_subtest "Prefer References to In-Reply-To"
> -test_subtest_known_broken
>  add_message '[id]="foo@two.com"' \
>      '[subject]=two'
>  add_message '[in-reply-to]="<bar@baz.com>"' \
> @@ -77,7 +75,6 @@ expected=`echo "$expected" | notmuch_json_show_sanitize`
>  test_expect_equal_json "$output" "$expected"
>  
>  test_begin_subtest "Use In-Reply-To when no References"
> -test_subtest_known_broken
>  add_message '[id]="foo@three.com"' \
>      '[subject]="three"'
>  add_message '[in-reply-to]="<foo@three.com>"' \
> @@ -104,7 +101,6 @@ expected=`echo "$expected" | notmuch_json_show_sanitize`
>  test_expect_equal_json "$output" "$expected"
>  
>  test_begin_subtest "Use last Reference"
> -test_subtest_known_broken
>  add_message '[id]="foo@four.com"' \
>      '[subject]="four"'
>  add_message '[id]="bar@four.com"' \
> -- 
> 1.8.1.5
>
> _______________________________________________
> notmuch mailing list
> notmuch@notmuchmail.org
> http://notmuchmail.org/mailman/listinfo/notmuch

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/2] lib/database.cc: change how the parent of a message is calculated
  2013-03-06  3:31             ` [PATCH 2/2] lib/database.cc: change how the parent of a message is calculated Aaron Ecay
  2013-05-04 14:10               ` David Bremner
  2013-05-04 16:24               ` Jani Nikula
@ 2013-05-14  0:43               ` David Bremner
  2 siblings, 0 replies; 17+ messages in thread
From: David Bremner @ 2013-05-14  0:43 UTC (permalink / raw)
  To: Aaron Ecay, notmuch

Aaron Ecay <aaronecay@gmail.com> writes:

> Presently, the code which finds the parent of a message as it is being
> added to the database assumes that the first Message-ID-like substring
> of the In-Reply-To header is the parent Message ID.  Some mail
> clients, however, put stuff other than the Message-ID of the parent in
> the In-Reply-To header, such as the email address of the sender of the
> parent.  This can fool notmuch.
>

Pushed both, with a certain amount of hacking on the test patch.

d

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2013-05-14  0:43 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-02-25 23:50 [RFC] [PATCH] lib/database.cc: change how the parent of a message is calculated Aaron Ecay
2013-02-26  9:19 ` Jani Nikula
2013-02-26 19:40   ` Aaron Ecay
2013-02-28 19:41 ` Jani Nikula
2013-03-01 17:06   ` Jani Nikula
2013-03-03 23:46     ` Aaron Ecay
2013-03-03 23:55       ` Aaron Ecay
2013-03-04  2:45         ` David Bremner
2013-03-06  3:31           ` [PATCH 1/2] test: add tests for the handling of References and In-Reply-To headers Aaron Ecay
2013-03-06  3:31             ` [PATCH 2/2] lib/database.cc: change how the parent of a message is calculated Aaron Ecay
2013-05-04 14:10               ` David Bremner
2013-05-04 16:24               ` Jani Nikula
2013-05-14  0:43               ` David Bremner
2013-05-04 14:01             ` [PATCH 1/2] test: add tests for the handling of References and In-Reply-To headers David Bremner
2013-05-04 14:03             ` David Bremner
2013-05-04 16:23             ` Jani Nikula
2013-03-04  5:59         ` [RFC] [PATCH] lib/database.cc: change how the parent of a message is calculated Tomi Ollila

Code repositories for project(s) associated with this public inbox

	https://yhetil.org/notmuch.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).