unofficial mirror of notmuch@notmuchmail.org
 help / color / mirror / code / Atom feed
* Partial support for sorting the output of notmuch address
@ 2021-12-19 18:18 David Bremner
  2021-12-19 18:18 ` [RFC Patch 1/3] CLI/address: refactor/inline print_popular David Bremner
                   ` (3 more replies)
  0 siblings, 4 replies; 6+ messages in thread
From: David Bremner @ 2021-12-19 18:18 UTC (permalink / raw)
  To: notmuch

One of the annoyances with notmuch-address (and motivations for
various third party work-a-likes) is the fact that it does not output
data in (descending) frequency order.  This series adds that for
either --output=count or --deduplicate=address. This is because the
current --dedup=mailbox handling does not actually count the
occurences.

I see two ways to go here

1) Unify the handling of --dedup={mailbox,address}. This will probably
yield a performance hit for dedup=mailbox. How big probably depends on how many duplicates you have.
The main difference would be that the output would all arrive at the end, like with --dedup=address.

If people are using --dedup=mailbox for large queries, now is the time
to speak up. I noticed the builtin completion in notmuch-emacs is
already using --deduplicate=address, so probably no change there.

2) Apply something like this series, and deal with questions from
users confused by why only one of the deduplication options, or
output=count, sorts.

At least in my tests, there does not seem to be a noticable
performance hit for the current series. I'm also interested to know if
other people do notice such a hit.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [RFC Patch 1/3] CLI/address: refactor/inline print_popular
  2021-12-19 18:18 Partial support for sorting the output of notmuch address David Bremner
@ 2021-12-19 18:18 ` David Bremner
  2021-12-19 18:18 ` [RFC Patch 2/3] CLI/address: sort output by frequency with deduplicate=address David Bremner
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 6+ messages in thread
From: David Bremner @ 2021-12-19 18:18 UTC (permalink / raw)
  To: notmuch

The goal is to make it easier to sort the current output in future commits.

There should be no functional change in this commit.
---
 notmuch-search.c | 19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/notmuch-search.c b/notmuch-search.c
index 327e1445..47eec601 100644
--- a/notmuch-search.c
+++ b/notmuch-search.c
@@ -473,10 +473,10 @@ _list_free_for_g_hash (void *ptr)
     g_list_free_full (ptr, _talloc_free_for_g_hash);
 }
 
-/* Print the most common variant of a list of unique mailboxes, and
+/* Find the most common variant of a list of unique mailboxes, and
  * conflate the counts. */
-static void
-print_popular (const search_context_t *ctx, GList *list)
+static mailbox_t *
+summarize_mailboxes (GList *list)
 {
     GList *l;
     mailbox_t *mailbox = NULL, *m;
@@ -498,7 +498,7 @@ print_popular (const search_context_t *ctx, GList *list)
     /* The original count is no longer needed, so overwrite. */
     mailbox->count = total;
 
-    print_mailbox (ctx, mailbox);
+    return mailbox;
 }
 
 static void
@@ -512,9 +512,10 @@ print_hash_value (unused (void *key), void *list, void *context)
 {
     const search_context_t *ctx = context;
 
-    if (ctx->dedup == DEDUP_ADDRESS)
-	print_popular (ctx, list);
-    else
+    if (ctx->dedup == DEDUP_ADDRESS) {
+	mailbox_t *mailbox = summarize_mailboxes (list);
+	print_mailbox (ctx, mailbox);
+    } else
 	g_list_foreach (list, print_list_value, context);
 }
 
@@ -619,9 +620,9 @@ do_search_messages (search_context_t *ctx)
     }
 
     if (ctx->addresses &&
-	(ctx->output & OUTPUT_COUNT || ctx->dedup == DEDUP_ADDRESS))
+	(ctx->output & OUTPUT_COUNT || ctx->dedup == DEDUP_ADDRESS)) {
 	g_hash_table_foreach (ctx->addresses, print_hash_value, ctx);
-
+    }
     notmuch_messages_destroy (messages);
 
     format->end (format);
-- 
2.34.1

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [RFC Patch 2/3] CLI/address: sort output by frequency with deduplicate=address
  2021-12-19 18:18 Partial support for sorting the output of notmuch address David Bremner
  2021-12-19 18:18 ` [RFC Patch 1/3] CLI/address: refactor/inline print_popular David Bremner
@ 2021-12-19 18:18 ` David Bremner
  2021-12-19 18:18 ` [RFC Patch 3/3] CLI/address: source by frequency if output=count David Bremner
  2021-12-20 16:48 ` Partial support for sorting the output of notmuch address inwit
  3 siblings, 0 replies; 6+ messages in thread
From: David Bremner @ 2021-12-19 18:18 UTC (permalink / raw)
  To: notmuch

Since the current output is not in any guaranteed order, changing the
output ordering should not break anything.

This is one of the easy cases, since we are already making a complete
pass of the matches and storing them in a hash table.
---
 notmuch-search.c     | 19 +++++++++++++++----
 test/T095-address.sh | 28 ++++++++++++++++++++++++++++
 2 files changed, 43 insertions(+), 4 deletions(-)

diff --git a/notmuch-search.c b/notmuch-search.c
index 47eec601..ff3967fd 100644
--- a/notmuch-search.c
+++ b/notmuch-search.c
@@ -64,6 +64,7 @@ typedef struct {
     int limit;
     int dupe;
     GHashTable *addresses;
+    GList *output_mailboxes;
     int dedup;
 } search_context_t;
 
@@ -507,14 +508,20 @@ print_list_value (void *mailbox, void *context)
     print_mailbox (context, mailbox);
 }
 
+static gint
+compare_count (gconstpointer a, gconstpointer b)
+{
+    return ((mailbox_t *) b)->count - ((mailbox_t *) a)->count;
+}
+
 static void
-print_hash_value (unused (void *key), void *list, void *context)
+process_hash_value (unused (void *key), void *list, void *context)
 {
-    const search_context_t *ctx = context;
+    search_context_t *ctx = context;
 
     if (ctx->dedup == DEDUP_ADDRESS) {
 	mailbox_t *mailbox = summarize_mailboxes (list);
-	print_mailbox (ctx, mailbox);
+	ctx->output_mailboxes = g_list_prepend (ctx->output_mailboxes, mailbox);
     } else
 	g_list_foreach (list, print_list_value, context);
 }
@@ -621,7 +628,10 @@ do_search_messages (search_context_t *ctx)
 
     if (ctx->addresses &&
 	(ctx->output & OUTPUT_COUNT || ctx->dedup == DEDUP_ADDRESS)) {
-	g_hash_table_foreach (ctx->addresses, print_hash_value, ctx);
+	g_hash_table_foreach (ctx->addresses, process_hash_value, ctx);
+	ctx->output_mailboxes = g_list_sort (ctx->output_mailboxes, compare_count);
+	if (ctx->dedup == DEDUP_ADDRESS)
+	    g_list_foreach (ctx->output_mailboxes, print_list_value, ctx);
     }
     notmuch_messages_destroy (messages);
 
@@ -910,6 +920,7 @@ notmuch_address_command (notmuch_database_t *notmuch, int argc, char *argv[])
     ctx->addresses = g_hash_table_new_full (strcase_hash, strcase_equal,
 					    _talloc_free_for_g_hash,
 					    _list_free_for_g_hash);
+    ctx->output_mailboxes = NULL; /* empty list, according to glib */
 
     /* The order is not guaranteed if a full pass is required, so go
      * for fastest. */
diff --git a/test/T095-address.sh b/test/T095-address.sh
index 8bb3627a..53886591 100755
--- a/test/T095-address.sh
+++ b/test/T095-address.sh
@@ -276,6 +276,26 @@ notmuch@notmuchmail.org
 EOF
 test_expect_equal_file EXPECTED OUTPUT
 
+test_begin_subtest "--deduplicate=address --output=sender --output=recipients --output=count (sort by frequency)"
+notmuch address --deduplicate=address --output=sender --output=recipients --output=count '*' | head -n 4 >OUTPUT
+cat <<EOF >EXPECTED
+50	notmuch@notmuchmail.org
+12	Carl Worth <cworth@cworth.org>
+8	Keith Packard <keithp@keithp.com>
+6	Alexander Botero-Lowry <alex.boterolowry@gmail.com>
+EOF
+test_expect_equal_file EXPECTED OUTPUT
+
+test_begin_subtest "--deduplicate=address --output=sender --output=recipients --output=count (sort by frequency)"
+notmuch address --deduplicate=address --output=sender --output=recipients '*' | head -n 4 >OUTPUT
+cat <<EOF >EXPECTED
+notmuch@notmuchmail.org
+Carl Worth <cworth@cworth.org>
+Keith Packard <keithp@keithp.com>
+Alexander Botero-Lowry <alex.boterolowry@gmail.com>
+EOF
+test_expect_equal_file EXPECTED OUTPUT
+
 generate_message '[from]="Foo Bar <foo.bar@example.com>"'
 generate_message '[from]="Foo Bar <Foo.Bar@Example.Com>"'
 generate_message '[from]="Foo Bar <foo.bar@example.com>"'
@@ -325,6 +345,14 @@ cat <<EOF >EXPECTED
 EOF
 test_expect_equal_file EXPECTED OUTPUT
 
+test_begin_subtest "--deduplicate=address --output=sender --output=count (sort by frequency)"
+notmuch address --deduplicate=address --output=sender --output=count from:example.com >OUTPUT
+cat <<EOF >EXPECTED
+7	Foo Bar <foo.bar@example.com>
+3	Baz <foo.bar+baz@example.com>
+EOF
+test_expect_equal_file EXPECTED OUTPUT
+
 if [[ NOTMUCH_HAVE_SFSEXP = 1 ]]; then
     test_begin_subtest "sexpr query: all messages"
     notmuch address '*' > EXPECTED
-- 
2.34.1

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [RFC Patch 3/3] CLI/address: source by frequency if output=count
  2021-12-19 18:18 Partial support for sorting the output of notmuch address David Bremner
  2021-12-19 18:18 ` [RFC Patch 1/3] CLI/address: refactor/inline print_popular David Bremner
  2021-12-19 18:18 ` [RFC Patch 2/3] CLI/address: sort output by frequency with deduplicate=address David Bremner
@ 2021-12-19 18:18 ` David Bremner
  2021-12-20 16:48 ` Partial support for sorting the output of notmuch address inwit
  3 siblings, 0 replies; 6+ messages in thread
From: David Bremner @ 2021-12-19 18:18 UTC (permalink / raw)
  To: notmuch

This is the other low hanging fruit, because we already require a
second pass of the hashed values to output.
---
 notmuch-search.c     | 10 ++++++----
 test/T095-address.sh | 34 ++++++++++++++++++++++++++++++++++
 2 files changed, 40 insertions(+), 4 deletions(-)

diff --git a/notmuch-search.c b/notmuch-search.c
index ff3967fd..9f698a09 100644
--- a/notmuch-search.c
+++ b/notmuch-search.c
@@ -522,8 +522,11 @@ process_hash_value (unused (void *key), void *list, void *context)
     if (ctx->dedup == DEDUP_ADDRESS) {
 	mailbox_t *mailbox = summarize_mailboxes (list);
 	ctx->output_mailboxes = g_list_prepend (ctx->output_mailboxes, mailbox);
-    } else
-	g_list_foreach (list, print_list_value, context);
+    } else {
+	for (GList *l = list; l; l = l->next) {
+	    ctx->output_mailboxes = g_list_prepend (ctx->output_mailboxes, l->data);
+	}
+    }
 }
 
 static int
@@ -630,8 +633,7 @@ do_search_messages (search_context_t *ctx)
 	(ctx->output & OUTPUT_COUNT || ctx->dedup == DEDUP_ADDRESS)) {
 	g_hash_table_foreach (ctx->addresses, process_hash_value, ctx);
 	ctx->output_mailboxes = g_list_sort (ctx->output_mailboxes, compare_count);
-	if (ctx->dedup == DEDUP_ADDRESS)
-	    g_list_foreach (ctx->output_mailboxes, print_list_value, ctx);
+	g_list_foreach (ctx->output_mailboxes, print_list_value, ctx);
     }
     notmuch_messages_destroy (messages);
 
diff --git a/test/T095-address.sh b/test/T095-address.sh
index 53886591..204168c7 100755
--- a/test/T095-address.sh
+++ b/test/T095-address.sh
@@ -119,6 +119,14 @@ cat <<EOF >EXPECTED
 EOF
 test_expect_equal_file EXPECTED OUTPUT
 
+test_begin_subtest "--output=sender --output=count (sort by frequency)"
+notmuch address --output=sender --output=count '*' | head -n 2 >OUTPUT
+cat <<EOF >EXPECTED
+12	Carl Worth <cworth@cworth.org>
+7	Keith Packard <keithp@keithp.com>
+EOF
+test_expect_equal_file EXPECTED OUTPUT
+
 test_begin_subtest "--output=recipients --output=address"
 notmuch address --output=recipients --output=address '*' >OUTPUT
 cat <<EOF >EXPECTED
@@ -155,6 +163,14 @@ cat <<EOF >EXPECTED
 EOF
 test_expect_equal_file EXPECTED OUTPUT
 
+test_begin_subtest "--output=sender --output=address --output=count (sort by frequency)"
+notmuch address --output=sender --output=address --output=count '*' | head -n 2 >OUTPUT
+cat <<EOF >EXPECTED
+12	cworth@cworth.org
+7	keithp@keithp.com
+EOF
+test_expect_equal_file EXPECTED OUTPUT
+
 test_begin_subtest "--output=count --format=json"
 # Since the iteration order of GHashTable is not specified, we
 # preprocess and sort the results to keep the order stable here.
@@ -181,6 +197,15 @@ cat <<EOF >EXPECTED
 EOF
 test_expect_equal_file EXPECTED OUTPUT
 
+test_begin_subtest "--output=count --format=json (sort by frequency)"
+notmuch address --output=count --format=json '*' | \
+    sed -e 's/^\[//' -e 's/]$//' -e 's/,$//' | head -n 2 >OUTPUT
+cat <<EOF >EXPECTED
+{"name": "Carl Worth", "address": "cworth@cworth.org", "name-addr": "Carl Worth <cworth@cworth.org>", "count": 12}
+{"name": "Keith Packard", "address": "keithp@keithp.com", "name-addr": "Keith Packard <keithp@keithp.com>", "count": 7}
+EOF
+test_expect_equal_file EXPECTED OUTPUT
+
 test_begin_subtest "--deduplicate=no --sort=oldest-first --output=sender"
 notmuch address --deduplicate=no --sort=oldest-first --output=sender '*' >OUTPUT
 cat <<EOF >EXPECTED
@@ -337,6 +362,15 @@ cat <<EOF >EXPECTED
 EOF
 test_expect_equal_file EXPECTED OUTPUT
 
+test_begin_subtest "--deduplicate=mailbox --output=sender --output=count (sort by frequency)"
+notmuch address --deduplicate=mailbox --output=sender --output=count from:example.com  | head -n 3 | sort -k2 >OUTPUT
+cat <<EOF >EXPECTED
+2	Baz <foo.bar+baz@example.com>
+2	Foo Bar <foo.bar@example.com>
+2	foo.bar@example.com
+EOF
+test_expect_equal_file EXPECTED OUTPUT
+
 test_begin_subtest "--deduplicate=address --output=sender --output=count"
 notmuch address --deduplicate=address --output=sender --output=count from:example.com | sort -n >OUTPUT
 cat <<EOF >EXPECTED
-- 
2.34.1

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: Partial support for sorting the output of notmuch address
  2021-12-19 18:18 Partial support for sorting the output of notmuch address David Bremner
                   ` (2 preceding siblings ...)
  2021-12-19 18:18 ` [RFC Patch 3/3] CLI/address: source by frequency if output=count David Bremner
@ 2021-12-20 16:48 ` inwit
  2021-12-20 17:44   ` David Bremner
  3 siblings, 1 reply; 6+ messages in thread
From: inwit @ 2021-12-20 16:48 UTC (permalink / raw)
  To: David Bremner, notmuch

On Sun Dec 19, 2021 at 7:18 PM CET, David Bremner wrote:
> One of the annoyances with notmuch-address (and motivations for various third
> party work-a-likes) is the fact that it does not output data in (descending)
> frequency order. 
Thanks, David! For those of us using company-prescient.el in emacs, this is not
really necessary. In the case of email addresses, I guess one could argue that
the responsability for sorting could be of notmuch, but this works for me so
far.

In any case, prescient's 'frecency' approach is better imho than frequency only. 

Regards,


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Partial support for sorting the output of notmuch address
  2021-12-20 16:48 ` Partial support for sorting the output of notmuch address inwit
@ 2021-12-20 17:44   ` David Bremner
  0 siblings, 0 replies; 6+ messages in thread
From: David Bremner @ 2021-12-20 17:44 UTC (permalink / raw)
  To: inwit, notmuch

"inwit" <inwit@sindominio.net> writes:

> On Sun Dec 19, 2021 at 7:18 PM CET, David Bremner wrote:
>> One of the annoyances with notmuch-address (and motivations for various third
>> party work-a-likes) is the fact that it does not output data in (descending)
>> frequency order. 
> Thanks, David! For those of us using company-prescient.el in emacs, this is not
> really necessary. In the case of email addresses, I guess one could argue that
> the responsability for sorting could be of notmuch, but this works for me so
> far.

Sure. I'm just intesterted in bringing notmuch-address to rough feature
parity with some of the third party alternatives, particularly for
command line use.

> In any case, prescient's 'frecency' approach is better imho than frequency only. 

Yeah, I can imagine that would be an improvement.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-12-20 17:44 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-19 18:18 Partial support for sorting the output of notmuch address David Bremner
2021-12-19 18:18 ` [RFC Patch 1/3] CLI/address: refactor/inline print_popular David Bremner
2021-12-19 18:18 ` [RFC Patch 2/3] CLI/address: sort output by frequency with deduplicate=address David Bremner
2021-12-19 18:18 ` [RFC Patch 3/3] CLI/address: source by frequency if output=count David Bremner
2021-12-20 16:48 ` Partial support for sorting the output of notmuch address inwit
2021-12-20 17:44   ` David Bremner

Code repositories for project(s) associated with this public inbox

	https://yhetil.org/notmuch.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).