From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp12.migadu.com ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms5.migadu.com with LMTPS id OGQpMHge8mPKLwEAbAwnHQ (envelope-from ) for ; Sun, 19 Feb 2023 14:04:56 +0100 Received: from aspmx1.migadu.com ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp12.migadu.com with LMTPS id cC0AMHge8mOZ5AAAauVa8A (envelope-from ) for ; Sun, 19 Feb 2023 14:04:56 +0100 Received: from mail.notmuchmail.org (yantan.tethera.net [135.181.149.255]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id EB48C77EA for ; Sun, 19 Feb 2023 14:04:55 +0100 (CET) Received: from yantan.tethera.net (localhost [127.0.0.1]) by mail.notmuchmail.org (Postfix) with ESMTP id 406965F342; Sun, 19 Feb 2023 13:04:53 +0000 (UTC) Received: from fethera.tethera.net (fethera.tethera.net [IPv6:2607:5300:60:c5::1]) by mail.notmuchmail.org (Postfix) with ESMTP id 294645DD59 for ; Sun, 19 Feb 2023 13:04:51 +0000 (UTC) Received: by fethera.tethera.net (Postfix, from userid 1001) id 0B0185FB9C; Sun, 19 Feb 2023 08:04:49 -0500 (EST) Received: (nullmailer pid 3363046 invoked by uid 1000); Sun, 19 Feb 2023 13:04:42 -0000 From: David Bremner To: Michael J Gruber Subject: Re: Proof of concept for counting messages in thread In-Reply-To: References: <20230213122631.2088558-1-david@tethera.net> <87lel1pluu.fsf@tethera.net> <87fsb9pb5m.fsf@tethera.net> <87bklxow5v.fsf@tethera.net> Date: Sun, 19 Feb 2023 09:04:42 -0400 Message-ID: <873571x0ut.fsf@tethera.net> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" Message-ID-Hash: FARYH4PXSQTD3YR34SIF2FS73BZQWRGN X-Message-ID-Hash: FARYH4PXSQTD3YR34SIF2FS73BZQWRGN X-MailFrom: david@tethera.net X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; header-match-notmuch.notmuchmail.org-0; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: notmuch@notmuchmail.org X-Mailman-Version: 3.3.3 Precedence: list List-Id: "Use and development of the notmuch mail system." List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: X-Migadu-Country: DE X-Migadu-Flow: FLOW_IN ARC-Seal: i=1; s=key1; d=yhetil.org; t=1676811896; a=rsa-sha256; cv=none; b=tQpPVMMJpw6urZ5ZP1pDChZHj8NbNA3VSISTXmigxcXhFsCD7177qFUR2ba8JGwbnjqVLQ UB+7/3lEJX2yQrAyIN1IhMgzcpvayT9Stdges/LKSsCuKe5hd69h8UT5P+ZS/VsHJN9LKR ON5MshwPf1To1DruSF0E26HmNwHaNAC76BN0xOYDZ/wm0M945HZvA7gk5D5S9Yc0WVq5Q2 AF+HbklPR/fGsbvGgec53ZTXsvUQXtU9AkL02tAb4ARQgFro02QI6dMheLyUkhkAYtNizg qEYaSHl20UdpvNkVPWT+Q5XqWHdGx7kLXmt8i95STuaVOsNScHWAtZCM8OmnFw== ARC-Authentication-Results: i=1; aspmx1.migadu.com; dkim=none; dmarc=none; spf=pass (aspmx1.migadu.com: domain of notmuch-bounces@notmuchmail.org designates 135.181.149.255 as permitted sender) smtp.mailfrom=notmuch-bounces@notmuchmail.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=yhetil.org; s=key1; t=1676811896; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references:list-id:list-help: list-owner:list-unsubscribe:list-subscribe:list-post; bh=U35F/V0XRfeokhyqry3+xnexqOceGhlgwr5fIu9Uk8k=; b=hFKELk40P+klkKjd8IVKs7GqYRDX/lcAJkKyfh19+IbPGUu6anFez+yBUAlVJK2KYVX8V4 c6b/Du7RDA1KxZn9/H6qh8QjypAAAJF7MdvLEBBdSmuaROJeCxvYVYVCg6yWwFq5DRm2E/ 1z9gXjW59ce1DBwXbh4+a6nd+CU299zvBI+Oe2I/qSoJbHFg+DvbaWtOkXHHGhH3P4XiN9 SehN9KXg/SbCPrsFiIu7BOhDtcB+9Rv7LiPJgEAKEX4sK/UrcpFwSk12a9GTNdLj9wiWpi 5vnRMeUJlEiLNuH24dKL/Fp1bnx+w1aglyY1PBrkI77UGdJe3wwD1ZU0JINUrA== Authentication-Results: aspmx1.migadu.com; dkim=none; dmarc=none; spf=pass (aspmx1.migadu.com: domain of notmuch-bounces@notmuchmail.org designates 135.181.149.255 as permitted sender) smtp.mailfrom=notmuch-bounces@notmuchmail.org X-Migadu-Scanner: scn1.migadu.com X-Migadu-Spam-Score: -1.71 X-Spam-Score: -1.71 X-Migadu-Queue-Id: EB48C77EA X-TUID: T+jcMjGcq2Se --=-=-= Content-Type: text/plain Michael J Gruber writes: > > Yes, the extra ones all are ghosts, and I slowly remember that they > scared me in the past already ... > > These ghosts appear to be pretty common. It happens all the time that > I am joined to an existing discussion thread where I do not have all > references. I have about 8% ghost messages in my 730k messages. I don't think I have any situation as extreme as you do with hundreds of ghost messages for a small number of actual messages in thread. If you would like to calculate the ratio for your mail store, you can run % xapian-delve -v -A Tghost ~/.local/share/notmuch/default/xapian % xapian-delve -v -A Tmail ~/.local/share/notmuch/default/xapian > I'd go as far as to say that counting ghosts as thread > members makes this useless for me. On the other hand, notmuch's own > count gets this right. And getting different counts is even more > confusing. The count shown in e.g. notmuch search is calculated after the query has been run, so it isn't easily usable as part of a query. Maybe there is a way to trade off some performance for less false positives. In principle we could do a query for each thread found by the current technique to postprocess the results. I can see that getting pretty slow if there are many results though. At least for the original motivation of looking for messages without replies counting ghost messages makes some sense. In general it also makes sense for finding large threads. I did the query '(thread (count 200 *))' on my mail store and most matches are genuinely large threads. A few are false positive like the one you describe. In my case it is easy to see where the ghosts come from, as the (spam) messages have hundreds of (presumably fictional) references. > >> 2) Do they have more than one G term? That suggests a bug somewhere. We >> actually have a test in the test suite [1] for that, but of course that is >> with a simple artificial database. > > No, they all have one. But their sheer number looks suspicious: those > 5 "real" e-mails have maybe 20 reference headers in total, and some of > them refer to some of those 5. Grepping the account store for those > references gives me around that number. Where do the 110 ghosts (90 > extra) come from which this thread points to? Still scared by them ... > we need ghost busters! The only information attached to a ghost message is the thread-id and the message-id. You can get a visual picture of the thread with the attached script. But that will probably just confirm what you did with grep. To see what is in the database, you can run % quest -btype:T -bthread:G -d mail/.notmuch/xapian "type:ghost and thread:0000000000000002" That gives you record numbers, that you can examine with xapian-delve -r. --=-=-= Content-Type: application/octet-stream Content-Disposition: attachment; filename=draw-thread Content-Transfer-Encoding: base64 IyEvYmluL2Jhc2gKCiMgVGhpcyBzY3JpcHQgY2FuIGJlIHVzZWQgbGlrZQojIE5PVE1VQ0hfQ09O RklHPXRlc3QvdG1wLlQ1ODAtdGhyZWFkLXNlYXJjaC9ub3RtdWNoLWNvbmZpZyBcCiMgICAgZGV2 ZWwvZHJhdy10aHJlYWQgdGhyZWFkOjAwMDAwMDAwMDAwMDAwMDIgfCBkb3QgLVRwZGYgPiB0aHJl YWQyLnBkZgoKIyBJbiBhZGRpdGlvbiB0byBub3RtdWNoLCB5b3Ugd2lsbCBuZWVkIHRoZSBmb2xs b3dpbmcgdG9vbHMgaW5zdGFsbGVkCiMgLSBncmFwaHZpegojIC0gZm9ybWFpbCAocGFydCBvZiBw cm9jbWFpbCkKCnRocmVhZGlkPSQxCgpkZWNsYXJlIC1hIGVkZ2VzCgpkZWNsYXJlIC1hIGRlc3QK ZWNobyAiZGlncmFwaCBcIiR0aHJlYWRpZFwiIHsiCmZvciBtZXNzYWdlaWQgaW4gJChub3RtdWNo IHNlYXJjaCAtLW91dHB1dD1tZXNzYWdlcyAkdGhyZWFkaWQpOyBkbwogICAgZWNobyAic3ViZ3Jh cGggXCJjbHVzdGVyXyRtZXNzYWdlaWRcIiB7IgogICAgcHJpbnRmICJcIiVzXCIgW3NoYXBlPWZv bGRlcl07XG4iICR7bWVzc2FnZWlkI2lkOn0KICAgIGZvciBmaWxlIGluICQobm90bXVjaCBzZWFy Y2ggLS1vdXRwdXQ9ZmlsZXMgJG1lc3NhZ2VpZCk7IGRvCiAgICAgICAgbm9kZT0kKGJhc2VuYW1l ICRmaWxlKQogICAgICAgIHByaW50ZiAiXCIlc1wiIFtzaGFwZT1ub3RlXTtcbiIgJG5vZGUKCiAg ICAgICAgbWFwZmlsZSAtdCBkZXN0IDwgPChmb3JtYWlsIC14IHJlZmVyZW5jZXMgPCAkZmlsZSB8 IHRyICc8PiwnICciIiAnKQogICAgICAgIGVkZ2U9IlwiJG5vZGVcIiAtPiB7ICR7ZGVzdFsqXX0g fSIKICAgICAgICBlZGdlcys9KCRlZGdlKQogICAgZG9uZQogICAgZWNobyAifSIKZG9uZQoKZm9y IGVkZ2UgaW4gIiR7ZWRnZXNbKl19IjsgZG8KICAgIGVjaG8gJGVkZ2UKZG9uZQoKZWNobyAifSIK --=-=-= Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline --=-=-=--