From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by olra.theworths.org (Postfix) with ESMTP id D7467431FBC for ; Thu, 28 Feb 2013 11:41:21 -0800 (PST) X-Virus-Scanned: Debian amavisd-new at olra.theworths.org X-Spam-Flag: NO X-Spam-Score: -0.7 X-Spam-Level: X-Spam-Status: No, score=-0.7 tagged_above=-999 required=5 tests=[RCVD_IN_DNSWL_LOW=-0.7] autolearn=disabled Received: from olra.theworths.org ([127.0.0.1]) by localhost (olra.theworths.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id CVWGnYYK4NWU for ; Thu, 28 Feb 2013 11:41:19 -0800 (PST) Received: from mail-la0-f41.google.com (mail-la0-f41.google.com [209.85.215.41]) (using TLSv1 with cipher RC4-SHA (128/128 bits)) (No client certificate requested) by olra.theworths.org (Postfix) with ESMTPS id DF529431FAF for ; Thu, 28 Feb 2013 11:41:18 -0800 (PST) Received: by mail-la0-f41.google.com with SMTP id fo12so2188940lab.28 for ; Thu, 28 Feb 2013 11:41:16 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=x-received:from:to:subject:in-reply-to:references:user-agent:date :message-id:mime-version:content-type:x-gm-message-state; bh=k9EuZu6+hVJPIbkf59XdmyV9Y+qxB8aX3oOJkHI8feQ=; b=VWH/+NmnDNbbC/KNLHlxOW2KRyxqsbQyDuZFiQSu7AZFXF0/ItApTtMFNw/C3ZpoBd bkoutZiQB6y+a5FXJnfCUIHi6nzvZJBQaYZtvoOewEwtS20ssMgNA8pLYmoQr7JuVPOy j8mjrcOpBb3tauD19VTYAKM78ceaN2lfG4Gf7clmgO7RazGpyZD2kCdoOd2sTr+IwdPA Bfolh7SpgyGycz/heqCn8CEw2/RQCJT5vv+9Gy0FlwMCTLThxXWMJ1WRRbwOIt3iRGuV 9b8fqUfJvnNwM+i4SZ3uj489jAS7W6WsAAeTUqZK/948l1x9VoPIVuZraK5dv+/ltAiI H2JQ== X-Received: by 10.112.13.200 with SMTP id j8mr4057894lbc.68.1362080475852; Thu, 28 Feb 2013 11:41:15 -0800 (PST) Received: from localhost (dsl-hkibrasgw4-50df51-27.dhcp.inet.fi. [80.223.81.27]) by mx.google.com with ESMTPS id fm8sm3283414lbb.17.2013.02.28.11.41.13 (version=TLSv1.2 cipher=RC4-SHA bits=128/128); Thu, 28 Feb 2013 11:41:14 -0800 (PST) From: Jani Nikula To: Aaron Ecay , notmuch@notmuchmail.org Subject: Re: [RFC] [PATCH] lib/database.cc: change how the parent of a message is calculated In-Reply-To: <1361836225-17279-1-git-send-email-aaronecay@gmail.com> References: <1361836225-17279-1-git-send-email-aaronecay@gmail.com> User-Agent: Notmuch/0.15.2+33~g98253a3 (http://notmuchmail.org) Emacs/24.2.1 (x86_64-pc-linux-gnu) Date: Thu, 28 Feb 2013 21:41:16 +0200 Message-ID: <87621cteeb.fsf@nikula.org> MIME-Version: 1.0 Content-Type: text/plain X-Gm-Message-State: ALoCoQmRb+br0tqWNRdxc1RyXji58iziwxTGZotvTWLSVA5AvRzDUu7LRxiZyORYP7S9dIcpuQDM X-BeenThere: notmuch@notmuchmail.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: "Use and development of the notmuch mail system." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 28 Feb 2013 19:41:22 -0000 Hi Aaron - On Tue, 26 Feb 2013, Aaron Ecay wrote: > Presently, the code which finds the parent of a message as it is being > added to the database assumes that the first Message-ID-like substring > of the In-Reply-To header is the parent Message ID. Some mail clients, > however, put stuff other than the Message-ID of the parent in the > In-Reply-To header, such as the email address of the sender of the > parent. This can fool notmuch. I think the background is that RFC 822 defines In-Reply-To (and References too for that matter) as *(phrase / msg-id), while RFC 2822 defines them as 1*msg-id. I'd like something about RFC 822 being mentioned in the commit message. The problem in the gmane message you link to in id:87liaa3luc.fsf@gmail.com is likely related to the FAQ item 05.26 "How do I fix a bogus In-Reply-To or missing References field?" in the MH FAQ http://www.newt.com/faq/mh.html. > The updated algorithm prefers the last Message ID in the References > header. The References header lists messages oldest-first, so the last > Message ID is the parent (RFC2822, p. 24). The References header is > also less likely to be in a non-standard > syntax (http://cr.yp.to/immhf/thread.html, > http://www.jwz.org/doc/threading.html). In case the References header > is not to be found, fall back to the old behavior. > --- > > I especially notice this problem on public mailing lists, where > certain people's messages always cause an "out-dent" of the threading, > instead of being nested under whichever message they are replies to. > > Technically, putting non-Message-ID crud in the In-Reply-To field is a > violation of RFC2822, but it appears that in practice the References > header is respected more often than the In-Reply-To one. > > lib/database.cc | 30 ++++++++++++++++++++++-------- > 1 file changed, 22 insertions(+), 8 deletions(-) > > diff --git a/lib/database.cc b/lib/database.cc > index 91d4329..cbf33ae 100644 > --- a/lib/database.cc > +++ b/lib/database.cc > @@ -501,8 +501,10 @@ _parse_message_id (void *ctx, const char *message_id, const char **next) > * 'message_id' in the result (to avoid mass confusion when a single > * message references itself cyclically---and yes, mail messages are > * not infrequent in the wild that do this---don't ask me why). > + * > + * Return the last reference parsed. > */ > -static void > +static char * > parse_references (void *ctx, > const char *message_id, > GHashTable *hash, > @@ -511,7 +513,7 @@ parse_references (void *ctx, > char *ref; > > if (refs == NULL || *refs == '\0') > - return; > + return NULL; > > while (*refs) { > ref = _parse_message_id (ctx, refs, &refs); > @@ -519,6 +521,8 @@ parse_references (void *ctx, > if (ref && strcmp (ref, message_id)) > g_hash_table_insert (hash, ref, NULL); > } > + > + return ref; As the comment for the function says, we explicitly avoid including self-references. I think I'd err on the safe side and return NULL if the last ref equals message-id. > } > > notmuch_status_t > @@ -1365,7 +1369,7 @@ _notmuch_database_generate_doc_id (notmuch_database_t *notmuch) > notmuch->last_doc_id++; > > if (notmuch->last_doc_id == 0) > - INTERNAL_ERROR ("Xapian document IDs are exhausted.\n"); > + INTERNAL_ERROR ("Xapian document IDs are exhausted.\n"); I don't know how you got this non-change hunk here, but please remove it. :) > > return notmuch->last_doc_id; > } > @@ -1509,7 +1513,7 @@ _notmuch_database_link_message_to_parents (notmuch_database_t *notmuch, > const char **thread_id) > { > GHashTable *parents = NULL; > - const char *refs, *in_reply_to, *in_reply_to_message_id; > + const char *refs, *in_reply_to, *in_reply_to_message_id, *last_ref_message_id; > GList *l, *keys = NULL; > notmuch_status_t ret = NOTMUCH_STATUS_SUCCESS; > > @@ -1517,21 +1521,31 @@ _notmuch_database_link_message_to_parents (notmuch_database_t *notmuch, > _my_talloc_free_for_g_hash, NULL); > > refs = notmuch_message_file_get_header (message_file, "references"); > - parse_references (message, notmuch_message_get_message_id (message), > - parents, refs); > + last_ref_message_id = parse_references (message, > + notmuch_message_get_message_id (message), > + parents, refs); > > in_reply_to = notmuch_message_file_get_header (message_file, "in-reply-to"); > parse_references (message, notmuch_message_get_message_id (message), > parents, in_reply_to); > > - /* Carefully avoid adding any self-referential in-reply-to term. */ > in_reply_to_message_id = _parse_message_id (message, in_reply_to, NULL); I wonder if you should reuse your parse_references() change here, so you'd set in_reply_to_message_id to the last message-id in In-Reply-To. This might tackle some of the problematic cases directly, but should still be all right per RFC 2822. I didn't verify how the parser handles an RFC 2822 violating free form header though. > + /* If the parent message ID from the Reply-To and References > + * headers are different, use the References one. This is because > + * the Reply-To header is more likely to be in an non-standard > + * format. */ > + if (in_reply_to_message_id && > + last_ref_message_id && > + strcmp (last_ref_message_id, in_reply_to_message_id)) { > + in_reply_to_message_id = last_ref_message_id; > + } I suggest adding an else if branch (or revamp the above if condition) to tackle the missing In-Reply-To header: else if (!in_reply_to_message_id && last_ref_message_id) { in_reply_to_message_id = last_ref_message_id; } > + /* Carefully avoid adding any self-referential in-reply-to term. */ > if (in_reply_to_message_id && > strcmp (in_reply_to_message_id, > notmuch_message_get_message_id (message))) If you change parse_references() to be careful about never returning a self-reference, and set in_reply_to_message_id from there, I think you can drop the strcmp here. And move the comment to an appropriate place. Thanks for the patch, I think we should do this. But this is an area where I think we need to be careful, so another reviewer wouldn't harm. Some tests for this would be good too, obviously. BR, Jani. > { > _notmuch_message_add_term (message, "replyto", > - _parse_message_id (message, in_reply_to, NULL)); > + in_reply_to_message_id); > } > > keys = g_hash_table_get_keys (parents); > -- > 1.8.1.4 > _______________________________________________ > notmuch mailing list > notmuch@notmuchmail.org > http://notmuchmail.org/mailman/listinfo/notmuch