From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by olra.theworths.org (Postfix) with ESMTP id 16050431FBD for ; Sun, 20 Apr 2014 05:04:09 -0700 (PDT) X-Virus-Scanned: Debian amavisd-new at olra.theworths.org X-Spam-Flag: NO X-Spam-Score: 3.001 X-Spam-Level: *** X-Spam-Status: No, score=3.001 tagged_above=-999 required=5 tests=[DKIM_ADSP_CUSTOM_MED=0.001, FREEMAIL_FROM=0.001, FREEMAIL_REPLY=2.499, NML_ADSP_CUSTOM_MED=1.2, RCVD_IN_DNSWL_LOW=-0.7] autolearn=disabled Received: from olra.theworths.org ([127.0.0.1]) by localhost (olra.theworths.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Z8NZUlzkc1g8 for ; Sun, 20 Apr 2014 05:04:05 -0700 (PDT) Received: from mail2.qmul.ac.uk (mail2.qmul.ac.uk [138.37.6.6]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by olra.theworths.org (Postfix) with ESMTPS id 71C30431FBC for ; Sun, 20 Apr 2014 05:04:05 -0700 (PDT) Received: from smtp.qmul.ac.uk ([138.37.6.40]) by mail2.qmul.ac.uk with esmtp (Exim 4.71) (envelope-from ) id 1WbqTK-0001Zg-HT; Sun, 20 Apr 2014 13:03:59 +0100 Received: from 188.29.253.189.threembb.co.uk ([188.29.253.189] helo=localhost) by smtp.qmul.ac.uk with esmtpsa (TLSv1:AES128-SHA:128) (Exim 4.71) (envelope-from ) id 1WbqTJ-0008FP-52; Sun, 20 Apr 2014 13:03:46 +0100 From: Mark Walters To: Carl Worth , David Bremner , notmuch Subject: Re: [RFC PATCH] Re: excessive thread fusing In-Reply-To: <87oazwjq1e.fsf@yoom.home.cworth.org> References: <87ioq5mrbz.fsf@maritornes.cs.unb.ca> <87fvl8mpzj.fsf@qmul.ac.uk> <87oazwjq1e.fsf@yoom.home.cworth.org> User-Agent: Notmuch/0.15.2+615~g78e3a93 (http://notmuchmail.org) Emacs/23.4.1 (x86_64-pc-linux-gnu) Date: Sun, 20 Apr 2014 13:03:34 +0100 Message-ID: <877g6kmcmh.fsf@qmul.ac.uk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Sender-Host-Address: 188.29.253.189 X-QM-Geographic: According to ripencc, this message was delivered by a machine in Britain (UK) (GB). X-QM-SPAM-Info: Sender has good ham record. :) X-QM-Body-MD5: 38660748c1edc2097058a623755a6d01 (of first 20000 bytes) X-SpamAssassin-Score: 1.0 X-SpamAssassin-SpamBar: + X-SpamAssassin-Report: The QM spam filters have analysed this message to determine if it is spam. We require at least 5.0 points to mark a message as spam. This message scored 1.0 points. Summary of the scoring: * 0.0 FREEMAIL_FROM Sender email is commonly abused enduser mail provider * (markwalters1009[at]gmail.com) * 0.0 AC_HTML_NONSENSE_TAGS RAW: Many consecutive multi-letter HTML tags, * likely nonsense/spam * 1.0 FREEMAIL_REPLY From and body contain different freemails X-QM-Scan-Virus: ClamAV says the message is clean X-BeenThere: notmuch@notmuchmail.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: "Use and development of the notmuch mail system." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 20 Apr 2014 12:04:09 -0000 On Sun, 20 Apr 2014, Carl Worth wrote: > Mark Walters writes: >> I have done dome debugging of this. > > Thanks for looking closely, Mark! > >> There is a patch below which fixes this test case but who knows what >> it breaks! Please DO NOT apply unless someone who knows this code says >> it's OK. > > I wrote much of the original code being patched here, so hopefully I > understand it and can say something useful. > > I agree that the patch should not be applied. I don't like to see one > piece of code not trusting another in the same code base. If the > parse_references() function doesn't deal well with a malformed header, > then we should fix it, not step around it. > > Meanwhile, not treating all potential referenced message IDs > consistently could definitely make the notmuch algorithm more fragile > and sensitive to the order of message indexing, etc. So let's not do > that. I agree. This bug first came up in id:874nvcekjk.fsf@qmul.ac.uk; I think that got mostly fixed by cf8aaafbad68 (id:1361836225-17279-1-git-send-email-aaronecay@gmail.com and related thread) so we may want to check whether that change is still wanted if we fix the actual bug. > Instead, let's track down and fix the actual bug. > > Thanks for the idea of using two-digit names for these messages. That > makes it much easier to inspect the relevant headers. > > Below, I've grepped out the actual References and In-Reply-To headers > From the messages, and then simply substituted minimal, and > easy-to-understand values for the message IDs. > > With these minimally modified headers, it's easy to manually inspect the > relationships and see that messages 17 and 18 belong in one thread, and > messages 32-52 belong in a separate thread. > > It's also quite easy to see the potential source of the bug. The > In-Reply-To headers for messages 18, 32, and 52 all share a common > string (an email address) formatted to look like a message-id, > "". If notmuch looks at those headers, and treats > that string as a message-id, then all of theses messages will be > connected into a single thread. > > And since that's the reported behavior, it seems likely that > "" is the cause of this bug. > >> I put some debug stuff in _notmuch_database_link_message_to_parents and >> I think that the problem comes from the call to parse_references on line >> 1767 which adds the malformed in-reply-to header to the hash table, so >> this malformed line gets added as a potential parent. > > Am I correct that your debugging showed that "" is > being added to the hash table? Yes that is correct. > My inspection of _parse_references() and parse_message_id() suggests > that that's exactly what notmuch is doing, (treating both of the > angle-bracketed portions ("" as well as the actual > message-ID, "" or "" or "") as message IDs. > > So it seems like we need a new _parse_in_reply_to() function to use in > place of _parse_references() and the new function will need a better > heuristic for dealing with the unpredictability of In-Reply-To. > > The only real reason that we are trying to grab multiple message ID > values from an In-Reply-To header is that RFC 2822 explicitly allows > that, (to support a message simultaneously replying to multiple > messages). I don't believe that that's common, but we might as well > support it. At the same time, RFC 2822 also explicitly specifies that > the In-Reply-To header will consist of nothing but message IDs. > > So perhaps the heuristic here could be to notice any characters outside > of angle brackets, (like "Message" in the headers below), and in that > case go to a strictly "not RFC 2822" mode and look for exactly one > message ID. At that point, JWZ would recommend "the first <>-bracketed > text found therein", but that would give precisely the wrong answer in > this particular case. Here the correct Message ID appears in the last > <>-bracketed text. I have not surveyed a large email corpus to determine > how often "last <>-bracketed text" would fail as a heuristic. > > Another idea would be to trigger specifically on common forms. Judging > From the samples in this particular thread, it seems like a workable > heuristic would be: > > If the In-Reply-To header begins with '<': > > Parse that initial portion as a message ID > > Else if it ends with '>': > > Parse that final portion as a message ID > > Else > > Ignore this garbage-valued header. > > That's probably the best and most reliably thing to do here. > > Does anyone have any better ideas? Is there a case coming before all the above: if the In-Reply-To header is correctly formed then parse as we do currently? (You sort of suggest so above but I just wanted to check) >> As a clear example that I don't understand this code I don't know why >> this no longer causes a problem if message 17 gets added too. > > I wanted to test my own knowledge of the code to see if I could explain > this. But I didn't precisely follow your explanation of the behavior you > saw. In cases (1) and (2) of your description, what order are you using > to "add all messages" or "add all apart from 52"? I just untarred the tar file David posted. Then the messages get added in the following order: 45 39 47 33 31 18 42 51 41 46 37 44 35 36 34 49 40 48 38 52 17 50 32 43 which is the same as the order in the tar file. (I think this is notmuch using inode based sort as it has not seen the directory before) In Case 2 I started with a fresh untar; then moved message 52 out of the Maildir; ran notmuch new, then moved message 52 back into the the Maildir tree and ran notmuch new again. > Then, for cases (3) and (4), what is done before adding the messages > mentioned in these cases? Add all other messages? Again, in what order? In case 3 I started with a fresh untar. Moved all the message except 18 elsewhere. ran notmuch new. moved message 52 back and ran notmuch new. In have checked case 4 carefully adding messages 1 at a time and running notmuch new between each addition. If I add 18 17 52 I get 2 threads. If I add 17 18 52 I get 1 thread > I haven't tracked through all the logic of the existing algorithm for > this case. But I don't like hearing that notmuch constructs different > threads for the same messages presented in different orders. This sounds > like a bug separate from what we've discussed above. I agree but I don't know the logic well enough to be sure. Best wishes Mark > > -Carl > > 18:References: > 32:References: > 33:References: > 34:References: > 35:References: > 36:References: > 37:References: > 38:References: > 39:References: > 40:References: > 41:References: > 42:References: > 43:References: > 44:References: > 45:References: > 46:References: > 47:References: > 48:References: > 49:References: > 50:References: > 51:References: > 52:References: > > 18:In-reply-to: Message from Eric S Fraga of "Tue, 01 Mar 2011 15:25:38 GMT." > 32:In-Reply-To: Message from Eric S Fraga of "Thu, 10 Mar 2011 21:00:16 GMT." > 33:In-Reply-To: (Nick Dokos's message of "Thu, 10 Mar 2011 18:06:33 -0500") > 34:In-Reply-To: > 35:In-Reply-To: > 36:In-Reply-To: (Carsten Dominik's message of "Sun, 13 Mar 2011 08:39:13 +0100") > 37:In-Reply-To: > 38:In-Reply-To: (Carsten Dominik's message of "Mon, 14 Mar 2011 08:40:33 +0100") > 39:In-Reply-To: (Nick Dokos's message of "Thu, 10 Mar 2011 18:06:33 -0500") > 40:In-Reply-To: > 41:In-Reply-To: (Carsten Dominik's message of "Fri, 11 Mar 2011 12:36:13 +0100") > 42:In-Reply-To: > 43:In-Reply-To: > 44:In-Reply-To: > 45:In-reply-to: Message from Carsten Dominik of "Fri, 11 Mar 2011 12:36:13 +0100." > 46:In-Reply-To: > 47:In-reply-to: Message from Carsten Dominik of "Mon, 14 Mar 2011 11:21:36 BST." > 48:In-Reply-To: > 49:In-reply-to: Message from Carsten Dominik of "Mon, 14 Mar 2011 18:02:54 BST." > 51:In-Reply-To: > 52:In-reply-to: Message from Eric S Fraga of "Fri, 11 Mar 2011 08:47:58 GMT." > > -- > carl.d.worth@intel.com