unofficial mirror of notmuch@notmuchmail.org
 help / color / mirror / code / Atom feed
* Proof of concept for counting messages in thread
@ 2023-02-13 12:26 David Bremner
  2023-02-13 12:26 ` [PATCH 1/2] WIP/lib: add count query backend David Bremner
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: David Bremner @ 2023-02-13 12:26 UTC (permalink / raw)
  To: notmuch; +Cc: pabs

So for this only supports counting messages in threads, and the sexp
based query parser. It seems useful to expand it to other fields
(from, e.g.). I'm not sure how motivated I am to shim this into the
infix query parser, but we will see how it goes.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/2] WIP/lib: add count query backend
  2023-02-13 12:26 Proof of concept for counting messages in thread David Bremner
@ 2023-02-13 12:26 ` David Bremner
  2023-02-13 12:26 ` [PATCH 2/2] WIP: support thread count queries David Bremner
  2023-02-13 15:39 ` Proof of concept for counting messages in thread Michael J Gruber
  2 siblings, 0 replies; 12+ messages in thread
From: David Bremner @ 2023-02-13 12:26 UTC (permalink / raw)
  To: notmuch; +Cc: pabs

---
 lib/Makefile.local     |  3 +-
 lib/count-query.cc     | 62 ++++++++++++++++++++++++++++++++++++++++++
 lib/database-private.h |  6 ++++
 3 files changed, 70 insertions(+), 1 deletion(-)
 create mode 100644 lib/count-query.cc

diff --git a/lib/Makefile.local b/lib/Makefile.local
index 4e766305..cc646946 100644
--- a/lib/Makefile.local
+++ b/lib/Makefile.local
@@ -66,7 +66,8 @@ libnotmuch_cxx_srcs =		\
 	$(dir)/init.cc		\
 	$(dir)/parse-sexp.cc	\
 	$(dir)/sexp-fp.cc	\
-	$(dir)/lastmod-fp.cc
+	$(dir)/lastmod-fp.cc    \
+	$(dir)/count-query.cc
 
 libnotmuch_modules := $(libnotmuch_c_srcs:.c=.o) $(libnotmuch_cxx_srcs:.cc=.o)
 
diff --git a/lib/count-query.cc b/lib/count-query.cc
new file mode 100644
index 00000000..5d258880
--- /dev/null
+++ b/lib/count-query.cc
@@ -0,0 +1,62 @@
+/* count-query.cc - generate queries for terms on few / many messages.
+ *
+ * This file is part of notmuch.
+ *
+ * Copyright © 2023 David Bremner
+ *
+ * This program is free software: you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation, either version 3 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see https://www.gnu.org/licenses/ .
+ *
+ * Author: David Bremner <david@tethera.net>
+ */
+
+#include "database-private.h"
+
+notmuch_status_t
+_notmuch_count_strings_to_query (notmuch_database_t *notmuch, std::string field,
+				 const std::string &from, const std::string &to,
+				 Xapian::Query &output, std::string &msg)
+{
+
+    long from_idx = 0, to_idx = LONG_MAX;
+    std::string term_prefix = _find_prefix (field.c_str ());
+    std::vector<std::string> terms;
+
+    if (! from.empty ()) {
+	try {
+	    from_idx = std::stol(from);
+	} catch (std::logic_error &e) {
+	    msg = "bad 'from' count: '" + from + "'";
+	    return NOTMUCH_STATUS_BAD_QUERY_SYNTAX;
+	}
+    }
+
+    if (! to.empty ()) {
+	try {
+	    to_idx = std::stod(to);
+	} catch (std::logic_error &e) {
+	    msg = "bad 'to' count: '" + to + "'";
+	    return NOTMUCH_STATUS_BAD_QUERY_SYNTAX;
+	}
+    }
+
+    for (Xapian::TermIterator it = notmuch->xapian_db->allterms_begin (term_prefix);
+	 it != notmuch->xapian_db->allterms_end (); ++it) {
+	Xapian::doccount freq = it.get_termfreq();
+	if (from_idx <= freq && freq <= to_idx)
+	    terms.push_back (*it);
+    }
+
+    output = Xapian::Query (Xapian::Query::OP_OR, terms.begin (), terms.end ());
+    return NOTMUCH_STATUS_SUCCESS;
+}
diff --git a/lib/database-private.h b/lib/database-private.h
index b9be4e22..ba96a93c 100644
--- a/lib/database-private.h
+++ b/lib/database-private.h
@@ -387,5 +387,11 @@ notmuch_status_t
 _notmuch_lastmod_strings_to_query (notmuch_database_t *notmuch,
 				   const std::string &from, const std::string &to,
 				   Xapian::Query &output, std::string &msg);
+
+/* count-query.cc */
+notmuch_status_t
+_notmuch_count_strings_to_query (notmuch_database_t *notmuch, std::string field,
+				 const std::string &from, const std::string &to,
+				 Xapian::Query &output, std::string &msg);
 #endif
 #endif
-- 
2.39.1
\r

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/2] WIP: support thread count queries
  2023-02-13 12:26 Proof of concept for counting messages in thread David Bremner
  2023-02-13 12:26 ` [PATCH 1/2] WIP/lib: add count query backend David Bremner
@ 2023-02-13 12:26 ` David Bremner
  2023-02-13 15:39 ` Proof of concept for counting messages in thread Michael J Gruber
  2 siblings, 0 replies; 12+ messages in thread
From: David Bremner @ 2023-02-13 12:26 UTC (permalink / raw)
  To: notmuch; +Cc: pabs

---
 lib/parse-sexp.cc         | 35 ++++++++++++++++++++++++++++++++---
 test/T081-sexpr-search.sh |  6 ++++++
 2 files changed, 38 insertions(+), 3 deletions(-)

diff --git a/lib/parse-sexp.cc b/lib/parse-sexp.cc
index 9cadbc13..1faa9023 100644
--- a/lib/parse-sexp.cc
+++ b/lib/parse-sexp.cc
@@ -34,6 +34,8 @@ typedef enum {
     SEXP_FLAG_ORPHAN	= 1 << 8,
     SEXP_FLAG_RANGE	= 1 << 9,
     SEXP_FLAG_PATHNAME	= 1 << 10,
+    SEXP_FLAG_COUNT	= 1 << 11,
+    SEXP_FLAG_MODIFIER	= 1 << 12,
 } _sexp_flag_t;
 
 /*
@@ -70,6 +72,8 @@ static _sexp_prefix_t prefixes[] =
       SEXP_FLAG_FIELD },
     { "date",           Xapian::Query::OP_INVALID,      Xapian::Query::MatchAll,
       SEXP_FLAG_RANGE },
+    { "count",          Xapian::Query::OP_INVALID,      Xapian::Query::MatchAll,
+      SEXP_FLAG_RANGE | SEXP_FLAG_MODIFIER },
     { "from",           Xapian::Query::OP_AND,          Xapian::Query::MatchAll,
       SEXP_FLAG_FIELD | SEXP_FLAG_WILDCARD | SEXP_FLAG_REGEX | SEXP_FLAG_EXPAND },
     { "folder",         Xapian::Query::OP_OR,           Xapian::Query::MatchNothing,
@@ -113,7 +117,8 @@ static _sexp_prefix_t prefixes[] =
     { "tag",            Xapian::Query::OP_AND,          Xapian::Query::MatchAll,
       SEXP_FLAG_FIELD | SEXP_FLAG_BOOLEAN | SEXP_FLAG_WILDCARD | SEXP_FLAG_REGEX | SEXP_FLAG_EXPAND },
     { "thread",         Xapian::Query::OP_OR,           Xapian::Query::MatchNothing,
-      SEXP_FLAG_FIELD | SEXP_FLAG_BOOLEAN | SEXP_FLAG_WILDCARD | SEXP_FLAG_REGEX | SEXP_FLAG_EXPAND },
+      SEXP_FLAG_FIELD | SEXP_FLAG_BOOLEAN | SEXP_FLAG_WILDCARD | SEXP_FLAG_REGEX |
+      SEXP_FLAG_EXPAND | SEXP_FLAG_COUNT },
     { "to",             Xapian::Query::OP_AND,          Xapian::Query::MatchAll,
       SEXP_FLAG_FIELD | SEXP_FLAG_WILDCARD | SEXP_FLAG_EXPAND },
     { }
@@ -513,6 +518,7 @@ _sexp_expand_param (notmuch_database_t *notmuch, const _sexp_prefix_t *parent,
 
 static notmuch_status_t
 _sexp_parse_range (notmuch_database_t *notmuch,  const _sexp_prefix_t *prefix,
+		   const _sexp_prefix_t *parent,
 		   const sexp_t *sx, Xapian::Query &output)
 {
     const char *from, *to;
@@ -552,6 +558,27 @@ _sexp_parse_range (notmuch_database_t *notmuch,  const _sexp_prefix_t *prefix,
 	    to = "";
     }
 
+    if (strcmp (prefix->name, "count") == 0) {
+	notmuch_status_t status;
+	if (! parent) {
+	    _notmuch_database_log (notmuch, "illegal '%s' outside field\n",
+				   prefix->name);
+	    return NOTMUCH_STATUS_BAD_QUERY_SYNTAX;
+	}
+	if (! (parent->flags & SEXP_FLAG_COUNT)) {
+	    _notmuch_database_log (notmuch, "'%s' not supported in field '%s'\n",
+				   prefix->name, parent->name);
+	    return NOTMUCH_STATUS_BAD_QUERY_SYNTAX;
+	}
+
+	status = _notmuch_count_strings_to_query (notmuch, parent->name, from, to, output, msg);
+	if (status) {
+	    if (! msg.empty ())
+		_notmuch_database_log (notmuch, "%s\n", msg.c_str ());
+	}
+	return status;
+    }
+
     if (strcmp (prefix->name, "date") == 0) {
 	notmuch_status_t status;
 	status = _notmuch_date_strings_to_query (NOTMUCH_VALUE_TIMESTAMP, from, to, output, msg);
@@ -654,7 +681,9 @@ _sexp_to_xapian_query (notmuch_database_t *notmuch, const _sexp_prefix_t *parent
 
     for (_sexp_prefix_t *prefix = prefixes; prefix && prefix->name; prefix++) {
 	if (strcmp (prefix->name, sx->list->val) == 0) {
-	    if (prefix->flags & (SEXP_FLAG_FIELD | SEXP_FLAG_RANGE)) {
+	    if ((prefix->flags & (SEXP_FLAG_FIELD)) ||
+		((prefix->flags & SEXP_FLAG_RANGE) &&
+		 ! (prefix->flags & SEXP_FLAG_MODIFIER))) {
 		if (parent) {
 		    _notmuch_database_log (notmuch, "nested field: '%s' inside '%s'\n",
 					   prefix->name, parent->name);
@@ -677,7 +706,7 @@ _sexp_to_xapian_query (notmuch_database_t *notmuch, const _sexp_prefix_t *parent
 	    }
 
 	    if (prefix->flags & SEXP_FLAG_RANGE)
-		return _sexp_parse_range (notmuch, prefix, sx->list->next, output);
+		return _sexp_parse_range (notmuch, prefix, parent, sx->list->next, output);
 
 	    if (strcmp (prefix->name, "infix") == 0) {
 		return _sexp_parse_infix (notmuch, sx->list->next, output);
diff --git a/test/T081-sexpr-search.sh b/test/T081-sexpr-search.sh
index 0c7db9c2..2013fa5c 100755
--- a/test/T081-sexpr-search.sh
+++ b/test/T081-sexpr-search.sh
@@ -1318,5 +1318,11 @@ notmuch search subject:notmuch or List:notmuch | notmuch_search_sanitize > EXPEC
 notmuch search --query=sexp '(About notmuch)' | notmuch_search_sanitize > OUTPUT
 test_expect_equal_file EXPECTED OUTPUT
 
+test_begin_subtest "threads with one message"
+notmuch search --query=sexp '(and (from gusarov) (thread (count 1)))' | notmuch_search_sanitize > OUTPUT
+cat <<EOF >EXPECTED
+thread:XXX   2009-11-17 [1/1] Mikhail Gusarov; [notmuch] [PATCH] Handle rename of message file (inbox unread)
+EOF
+test_expect_equal_file EXPECTED OUTPUT
 
 test_done
-- 
2.39.1

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: Proof of concept for counting messages in thread
  2023-02-13 12:26 Proof of concept for counting messages in thread David Bremner
  2023-02-13 12:26 ` [PATCH 1/2] WIP/lib: add count query backend David Bremner
  2023-02-13 12:26 ` [PATCH 2/2] WIP: support thread count queries David Bremner
@ 2023-02-13 15:39 ` Michael J Gruber
  2023-02-13 16:32   ` David Bremner
  2 siblings, 1 reply; 12+ messages in thread
From: Michael J Gruber @ 2023-02-13 15:39 UTC (permalink / raw)
  To: David Bremner, notmuch; +Cc: pabs

Am Mo., 13. Feb. 2023 um 13:26 Uhr schrieb David Bremner <david@tethera.net>:
>
> So for this only supports counting messages in threads, and the sexp
> based query parser. It seems useful to expand it to other fields
> (from, e.g.). I'm not sure how motivated I am to shim this into the
> infix query parser, but we will see how it goes.

This certainly looks interesting, and not easy to get by scripting
around the existing commands. It is kinda special, so having it in
sexp only seems okay.

I am getting a few surprising matches, e.g.
```
notmuch search  --query=sexp '(thread (count 115)))'
thread:0000000000021229   2021-05-17 [5/5] Michael J Gruber ... redacted
notmuch count --exclude=false thread:0000000000021229
5
```
It could be some database issues, of course. Or me misunderstanding something :)

Patch 1/2 is crlf garbled, by the way. Applies cleanly after removing
the extra ^Ms.

Michael

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Proof of concept for counting messages in thread
  2023-02-13 15:39 ` Proof of concept for counting messages in thread Michael J Gruber
@ 2023-02-13 16:32   ` David Bremner
  2023-02-13 17:03     ` Michael J Gruber
  0 siblings, 1 reply; 12+ messages in thread
From: David Bremner @ 2023-02-13 16:32 UTC (permalink / raw)
  To: Michael J Gruber, notmuch; +Cc: pabs

Michael J Gruber <michaeljgruber+grubix+git@gmail.com> writes:

> I am getting a few surprising matches, e.g.
> ```
> notmuch search  --query=sexp '(thread (count 115)))'
> thread:0000000000021229   2021-05-17 [5/5] Michael J Gruber ... redacted
> notmuch count --exclude=false thread:0000000000021229
> 5
> ```
> It could be some database issues, of course. Or me misunderstanding something :)

Hmm. I don't see any strange matches for that particular query, just a
thread that actually has 115 messages. But there could also be bugs of
course.  Does xapin-check complain about your database?

>
> Patch 1/2 is crlf garbled, by the way. Applies cleanly after removing
> the extra ^Ms.

Hmm. Probably because of Content-Transfer-Encoding: 8bit

I have a direct mailed copy that didn't go through mailman, and that
looks OK. 

>
> Michael

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Proof of concept for counting messages in thread
  2023-02-13 16:32   ` David Bremner
@ 2023-02-13 17:03     ` Michael J Gruber
  2023-02-13 20:23       ` David Bremner
  0 siblings, 1 reply; 12+ messages in thread
From: Michael J Gruber @ 2023-02-13 17:03 UTC (permalink / raw)
  To: David Bremner; +Cc: notmuch, pabs

Am Mo., 13. Feb. 2023 um 17:32 Uhr schrieb David Bremner <david@tethera.net>:
>
> Michael J Gruber <michaeljgruber+grubix+git@gmail.com> writes:
>
> > I am getting a few surprising matches, e.g.
> > ```
> > notmuch search  --query=sexp '(thread (count 115)))'
> > thread:0000000000021229   2021-05-17 [5/5] Michael J Gruber ... redacted
> > notmuch count --exclude=false thread:0000000000021229
> > 5
> > ```
> > It could be some database issues, of course. Or me misunderstanding something :)
>
> Hmm. I don't see any strange matches for that particular query, just a
> thread that actually has 115 messages. But there could also be bugs of
> course.  Does xapin-check complain about your database?

It has 5, as confirmed by the search output and that of `notmuch
count`. But it is matched by `count 115`.
`xapian-check` is happy. (There used to be some issue with additional
thread entries at some point.)

Michael

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Proof of concept for counting messages in thread
  2023-02-13 17:03     ` Michael J Gruber
@ 2023-02-13 20:23       ` David Bremner
  2023-02-13 22:36         ` Michael J Gruber
  0 siblings, 1 reply; 12+ messages in thread
From: David Bremner @ 2023-02-13 20:23 UTC (permalink / raw)
  To: Michael J Gruber; +Cc: notmuch

Michael J Gruber <michaeljgruber+grubix+git@gmail.com> writes:
>
> It has 5, as confirmed by the search output and that of `notmuch
> count`. But it is matched by `count 115`.
> `xapian-check` is happy. (There used to be some issue with additional
> thread entries at some point.)
>
> Michael

A simple test to try is

% xapian-delve -t G0000000000021229 \
  ~/.local/share/notmuch/default/xapian

adjusting your database path as needed.

If that says "termfreq 115", then something is broken (or at least
confusing) about your database (possibly related to the previous issues
with threading). In that case I'm curious if there are 115 distinct
record numbers.  You can find all of the thread-ids attached to a given
message with

% xapian-delve -1r 267585 ~/.local/share/notmuch/default/xapian | grep ^G

where 267585 is an example record number in my database.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Proof of concept for counting messages in thread
  2023-02-13 20:23       ` David Bremner
@ 2023-02-13 22:36         ` Michael J Gruber
  2023-02-14  1:47           ` David Bremner
  0 siblings, 1 reply; 12+ messages in thread
From: Michael J Gruber @ 2023-02-13 22:36 UTC (permalink / raw)
  To: David Bremner; +Cc: notmuch

Am Mo., 13. Feb. 2023 um 21:23 Uhr schrieb David Bremner <david@tethera.net>:
>
> Michael J Gruber <michaeljgruber+grubix+git@gmail.com> writes:
> >
> > It has 5, as confirmed by the search output and that of `notmuch
> > count`. But it is matched by `count 115`.
> > `xapian-check` is happy. (There used to be some issue with additional
> > thread entries at some point.)
> >
> > Michael
>
> A simple test to try is
>
> % xapian-delve -t G0000000000021229 \
>   ~/.local/share/notmuch/default/xapian
>
> adjusting your database path as needed.
>
> If that says "termfreq 115", then something is broken (or at least
> confusing) about your database (possibly related to the previous issues
> with threading). In that case I'm curious if there are 115 distinct
> record numbers.  You can find all of the thread-ids attached to a given
> message with
>
> % xapian-delve -1r 267585 ~/.local/share/notmuch/default/xapian | grep ^G
>
> where 267585 is an example record number in my database.

That is really weird:
```
xapian-delve -t G0000000000021229 .
Posting List for term 'G0000000000021229' (termfreq 115, collfreq 0,
wdf_max 0): 146259 ...
```
with 115 record numbers, all different.
Doing `xapian-delve -1r` for each of them and grepping for the G-lines
gives 115 times that correct thread id.
Grepping for the Q-lines and notmuch-searching for the message ids
gives only 5 results (the expected ones). Apparantly, there are bogus
mail records which that thread points to.
I guess I should recreate the db, if I only knew how lieer deals with
a reindexed mail store ... (The thread and the 5 message sit in an
mbsynced folder, but lieer syncs other folders with that same db).

Michael

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Proof of concept for counting messages in thread
  2023-02-13 22:36         ` Michael J Gruber
@ 2023-02-14  1:47           ` David Bremner
  2023-02-18 17:47             ` Michael J Gruber
  0 siblings, 1 reply; 12+ messages in thread
From: David Bremner @ 2023-02-14  1:47 UTC (permalink / raw)
  To: Michael J Gruber; +Cc: notmuch

Michael J Gruber <michaeljgruber+grubix+git@gmail.com> writes:

> That is really weird:
> ```
> xapian-delve -t G0000000000021229 .
> Posting List for term 'G0000000000021229' (termfreq 115, collfreq 0,
> wdf_max 0): 146259 ...
> ```
> with 115 record numbers, all different.
> Doing `xapian-delve -1r` for each of them and grepping for the G-lines
> gives 115 times that correct thread id.
> Grepping for the Q-lines and notmuch-searching for the message ids
> gives only 5 results (the expected ones). Apparantly, there are bogus
> mail records which that thread points to.

1) Do those "bogus" records have a "Tghost" term? That would be for
messages that are known via references, but not actually in the local
database. This is a bug / feature of the current implementation, it
counts all messages known, whether or not local copies exist.

2) Do they have more than one G term? That suggests a bug somewhere. We
actually have a test in the test suite [1] for that, but of course that is
with a simple artificial database. 

[1]: in T670-duplicate-mid.sh:

db=$HOME/.local/share/notmuch/default/xapian
for doc in $(xapian-delve -1 -t '' "$db" | grep '^[1-9]'); do
    xapian-delve -1 -r "$doc" "$db" | grep -c '^G'
done > OUTPUT.raw
sort -u < OUTPUT.raw > OUTPUT

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Proof of concept for counting messages in thread
  2023-02-14  1:47           ` David Bremner
@ 2023-02-18 17:47             ` Michael J Gruber
  2023-02-19 13:04               ` David Bremner
  0 siblings, 1 reply; 12+ messages in thread
From: Michael J Gruber @ 2023-02-18 17:47 UTC (permalink / raw)
  To: David Bremner; +Cc: notmuch

Am Di., 14. Feb. 2023 um 02:47 Uhr schrieb David Bremner <david@tethera.net>:
>
> Michael J Gruber <michaeljgruber+grubix+git@gmail.com> writes:
>
> > That is really weird:
> > ```
> > xapian-delve -t G0000000000021229 .
> > Posting List for term 'G0000000000021229' (termfreq 115, collfreq 0,
> > wdf_max 0): 146259 ...
> > ```
> > with 115 record numbers, all different.
> > Doing `xapian-delve -1r` for each of them and grepping for the G-lines
> > gives 115 times that correct thread id.
> > Grepping for the Q-lines and notmuch-searching for the message ids
> > gives only 5 results (the expected ones). Apparantly, there are bogus
> > mail records which that thread points to.
>
> 1) Do those "bogus" records have a "Tghost" term? That would be for
> messages that are known via references, but not actually in the local
> database. This is a bug / feature of the current implementation, it
> counts all messages known, whether or not local copies exist.

Yes, the extra ones all are ghosts, and I slowly remember that they
scared me in the past already ...

These ghosts appear to be pretty common. It happens all the time that
I am joined to an existing discussion thread where I do not have all
references. I'd go as far as to say that counting ghosts as thread
members makes this useless for me. On the other hand, notmuch's own
count gets this right. And getting different counts is even more
confusing.

> 2) Do they have more than one G term? That suggests a bug somewhere. We
> actually have a test in the test suite [1] for that, but of course that is
> with a simple artificial database.

No, they all have one. But their sheer number looks suspicious: those
5 "real" e-mails have maybe 20 reference headers in total, and some of
them refer to some of those 5. Grepping the account store for those
references gives me around that number. Where do the 110 ghosts (90
extra) come from which this thread points to? Still scared by them ...
we need ghost busters!

Michael

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Proof of concept for counting messages in thread
  2023-02-18 17:47             ` Michael J Gruber
@ 2023-02-19 13:04               ` David Bremner
  2023-02-19 13:56                 ` David Bremner
  0 siblings, 1 reply; 12+ messages in thread
From: David Bremner @ 2023-02-19 13:04 UTC (permalink / raw)
  To: Michael J Gruber; +Cc: notmuch

[-- Attachment #1: Type: text/plain, Size: 2798 bytes --]

Michael J Gruber <michaeljgruber+grubix+git@gmail.com> writes:

>
> Yes, the extra ones all are ghosts, and I slowly remember that they
> scared me in the past already ...
>
> These ghosts appear to be pretty common. It happens all the time that
> I am joined to an existing discussion thread where I do not have all
> references.

I have about 8% ghost messages in my 730k messages. I don't think I have
any situation as extreme as you do with hundreds of ghost messages for a
small number of actual messages in thread.

If you would like to calculate the ratio for your mail store, you can run

% xapian-delve -v -A Tghost ~/.local/share/notmuch/default/xapian
% xapian-delve -v -A Tmail ~/.local/share/notmuch/default/xapian

> I'd go as far as to say that counting ghosts as thread
> members makes this useless for me. On the other hand, notmuch's own
> count gets this right. And getting different counts is even more
> confusing.

The count shown in e.g. notmuch search is calculated after the query has
been run, so it isn't easily usable as part of a query. Maybe there is a
way to trade off some performance for less false positives. In principle
we could do a query for each thread found by the current technique to
postprocess the results. I can see that getting pretty slow if there are
many results though.

At least for the original motivation of looking for messages without
replies counting ghost messages makes some sense. In general it also
makes sense for finding large threads. I did the query '(thread (count
200 *))' on my mail store and most matches are genuinely large
threads. A few are false positive like the one you describe. In my case
it is easy to see where the ghosts come from, as the (spam) messages
have hundreds of (presumably fictional) references.

>
>> 2) Do they have more than one G term? That suggests a bug somewhere. We
>> actually have a test in the test suite [1] for that, but of course that is
>> with a simple artificial database.
>
> No, they all have one. But their sheer number looks suspicious: those
> 5 "real" e-mails have maybe 20 reference headers in total, and some of
> them refer to some of those 5. Grepping the account store for those
> references gives me around that number. Where do the 110 ghosts (90
> extra) come from which this thread points to? Still scared by them ...
> we need ghost busters!

The only information attached to a ghost message is the thread-id and
the message-id.  You can get a visual picture of the thread with the
attached script. But that will probably just confirm what you did with
grep. To see what is in the database, you can run

% quest -btype:T -bthread:G -d mail/.notmuch/xapian "type:ghost and thread:0000000000000002"

That gives you record numbers, that you can examine with xapian-delve
-r.




[-- Attachment #2: draw-thread --]
[-- Type: application/octet-stream, Size: 912 bytes --]

#!/bin/bash

# This script can be used like
# NOTMUCH_CONFIG=test/tmp.T580-thread-search/notmuch-config \
#    devel/draw-thread thread:0000000000000002 | dot -Tpdf > thread2.pdf

# In addition to notmuch, you will need the following tools installed
# - graphviz
# - formail (part of procmail)

threadid=$1

declare -a edges

declare -a dest
echo "digraph \"$threadid\" {"
for messageid in $(notmuch search --output=messages $threadid); do
    echo "subgraph \"cluster_$messageid\" {"
    printf "\"%s\" [shape=folder];\n" ${messageid#id:}
    for file in $(notmuch search --output=files $messageid); do
        node=$(basename $file)
        printf "\"%s\" [shape=note];\n" $node

        mapfile -t dest < <(formail -x references < $file | tr '<>,' '"" ')
        edge="\"$node\" -> { ${dest[*]} }"
        edges+=($edge)
    done
    echo "}"
done

for edge in "${edges[*]}"; do
    echo $edge
done

echo "}"

[-- Attachment #3: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Proof of concept for counting messages in thread
  2023-02-19 13:04               ` David Bremner
@ 2023-02-19 13:56                 ` David Bremner
  0 siblings, 0 replies; 12+ messages in thread
From: David Bremner @ 2023-02-19 13:56 UTC (permalink / raw)
  To: Michael J Gruber; +Cc: notmuch

David Bremner <david@tethera.net> writes:

> Michael J Gruber <michaeljgruber+grubix+git@gmail.com> writes:
>
>>
>> Yes, the extra ones all are ghosts, and I slowly remember that they
>> scared me in the past already ...
>>
>> These ghosts appear to be pretty common. It happens all the time that
>> I am joined to an existing discussion thread where I do not have all
>> references.
>
> I have about 8% ghost messages in my 730k messages. I don't think I have
> any situation as extreme as you do with hundreds of ghost messages for a
> small number of actual messages in thread.

That turns out to be a lie, as I wrote below.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2023-02-19 14:01 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-13 12:26 Proof of concept for counting messages in thread David Bremner
2023-02-13 12:26 ` [PATCH 1/2] WIP/lib: add count query backend David Bremner
2023-02-13 12:26 ` [PATCH 2/2] WIP: support thread count queries David Bremner
2023-02-13 15:39 ` Proof of concept for counting messages in thread Michael J Gruber
2023-02-13 16:32   ` David Bremner
2023-02-13 17:03     ` Michael J Gruber
2023-02-13 20:23       ` David Bremner
2023-02-13 22:36         ` Michael J Gruber
2023-02-14  1:47           ` David Bremner
2023-02-18 17:47             ` Michael J Gruber
2023-02-19 13:04               ` David Bremner
2023-02-19 13:56                 ` David Bremner

Code repositories for project(s) associated with this public inbox

	https://yhetil.org/notmuch.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).