From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by olra.theworths.org (Postfix) with ESMTP id 17791431FBD for ; Sun, 20 Apr 2014 05:59:57 -0700 (PDT) X-Virus-Scanned: Debian amavisd-new at olra.theworths.org X-Spam-Flag: NO X-Spam-Score: 0 X-Spam-Level: X-Spam-Status: No, score=0 tagged_above=-999 required=5 tests=[none] autolearn=disabled Received: from olra.theworths.org ([127.0.0.1]) by localhost (olra.theworths.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id BaCWYoFZAnqA for ; Sun, 20 Apr 2014 05:59:49 -0700 (PDT) Received: from mx.xen14.node3324.gplhost.com (gitolite.debian.net [87.98.215.224]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (No client certificate requested) by olra.theworths.org (Postfix) with ESMTPS id EB126431FBC for ; Sun, 20 Apr 2014 05:59:48 -0700 (PDT) Received: from remotemail by mx.xen14.node3324.gplhost.com with local (Exim 4.72) (envelope-from ) id 1WbrLV-0005Wu-Od; Sun, 20 Apr 2014 12:59:45 +0000 Received: (nullmailer pid 17456 invoked by uid 1000); Sun, 20 Apr 2014 12:59:26 -0000 From: David Bremner To: Carl Worth , Mark Walters , notmuch Subject: Re: [RFC PATCH] Re: excessive thread fusing In-Reply-To: <87oazwjq1e.fsf@yoom.home.cworth.org> References: <87ioq5mrbz.fsf@maritornes.cs.unb.ca> <87fvl8mpzj.fsf@qmul.ac.uk> <87oazwjq1e.fsf@yoom.home.cworth.org> User-Agent: Notmuch/0.17+202~gb65f328 (http://notmuchmail.org) Emacs/24.3.1 (x86_64-pc-linux-gnu) Date: Sun, 20 Apr 2014 21:59:26 +0900 Message-ID: <87fvl8upg1.fsf@maritornes.cs.unb.ca> MIME-Version: 1.0 Content-Type: text/plain X-BeenThere: notmuch@notmuchmail.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: "Use and development of the notmuch mail system." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 20 Apr 2014 12:59:57 -0000 Carl Worth writes: > > Another idea would be to trigger specifically on common forms. Judging > From the samples in this particular thread, it seems like a workable > heuristic would be: > > If the In-Reply-To header begins with '<': > > Parse that initial portion as a message ID > > Else if it ends with '>': > > Parse that final portion as a message ID > > Else > > Ignore this garbage-valued header. > using the hacky script below, I scanned my own mail collection of about 300k messages. I can make the following observations - I have some RFC compliant in-reply-to's with multiple ids - I have have a non-trivial number of Message from $NAME
of $date - I didn't see any cases where using the last angle bracketed thing would fail. - I did see some some cases where the header starts with '<' but the matching '>' was missing - I also noticed some rfc2047 encoding of in-reply-to headers. ###################################################################### # hacky script follows dir=$1 echo Scanning $dir tempdir=$(mktemp -d) echo Writing to ${tempdir} find $dir -exec sh -c "formail -c -xIn-reply-to < {}" \; \ > ${tempdir}/ids sed -e 's/\t/ /' -e 's/ */ /g' -e 's/<[^ ]*>//g' -e 's/(.*)/(comment)/' < ${tempdir}/ids | sort | uniq | tee ${tempdir}/report