unofficial mirror of notmuch@notmuchmail.org
 help / color / mirror / code / Atom feed
* reindex improvements
@ 2019-04-16  1:46 David Bremner
  2019-04-16  1:46 ` [PATCH 1/2] CLI/reindex: fix memory leak David Bremner
  2019-04-16  1:46 ` [PATCH 2/2] n_m_remove_indexed_terms: reduce number of Xapian API calls David Bremner
  0 siblings, 2 replies; 7+ messages in thread
From: David Bremner @ 2019-04-16  1:46 UTC (permalink / raw)
  To: notmuch

With the upgrade path for the body: prefix being "notmuch reindex
'*'", it's important that that is usable. It turns out that due to a
blunder on my part it wasn't really unless your whole mail store fits
comfortably in RAM.

[PATCH 1/2] CLI/reindex: fix memory leak

       This is an outright bugfix. 
       
[PATCH 2/2] n_m_remove_indexed_terms: reduce number of Xapian API

       This is optional, but the performance payoff is impressive. I
       suspect it might be even more impressive on spinning rust, but
       I didn't test it.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 1/2] CLI/reindex: fix memory leak
  2019-04-16  1:46 reindex improvements David Bremner
@ 2019-04-16  1:46 ` David Bremner
  2019-04-17 17:04   ` Tomi Ollila
  2019-04-19  2:22   ` David Bremner
  2019-04-16  1:46 ` [PATCH 2/2] n_m_remove_indexed_terms: reduce number of Xapian API calls David Bremner
  1 sibling, 2 replies; 7+ messages in thread
From: David Bremner @ 2019-04-16  1:46 UTC (permalink / raw)
  To: notmuch

Since message is owned by messages, it was held for the entire run of
the program. This in turn means that the Xapian::Document objects are
not freed, and thus one ends up with (effectively) a copy of one's
entire mailstore in memory when running

       notmuch reindex '*'

Thanks to Olly Betts for the patient help debugging, and the
suggestion of a fix.
---
 notmuch-reindex.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/notmuch-reindex.c b/notmuch-reindex.c
index d8589120..fefa7c63 100644
--- a/notmuch-reindex.c
+++ b/notmuch-reindex.c
@@ -71,6 +71,7 @@ reindex_query (notmuch_database_t *notmuch, const char *query_string,
 	ret = notmuch_message_reindex(message, indexopts);
 	if (ret != NOTMUCH_STATUS_SUCCESS)
 	    break;
+	notmuch_message_destroy (message);
     }
 
     if (!ret)
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 2/2] n_m_remove_indexed_terms: reduce number of Xapian API calls.
  2019-04-16  1:46 reindex improvements David Bremner
  2019-04-16  1:46 ` [PATCH 1/2] CLI/reindex: fix memory leak David Bremner
@ 2019-04-16  1:46 ` David Bremner
  2019-05-22 11:58   ` David Bremner
  2019-05-23 11:40   ` David Bremner
  1 sibling, 2 replies; 7+ messages in thread
From: David Bremner @ 2019-04-16  1:46 UTC (permalink / raw)
  To: notmuch

Previously this functioned scanned every term attached to a given
Xapian document. It turns out we know how to read only the terms we
need to preserve (and we might have already done so). This commit
replaces many calls to Xapian::Document::remove_term with one call to
::clear_terms, and a (typically much smaller) number of calls to
::add_term. Roughly speaking this is based on the assumption that most
messages have more text than they have tags.

According to the performance test suite, this yields a roughly 40%
speedup on "notmuch reindex '*'"
---
 lib/message.cc | 66 +++++++++++++++++++++++++++++---------------------
 1 file changed, 38 insertions(+), 28 deletions(-)

diff --git a/lib/message.cc b/lib/message.cc
index 6f2f6345..3e33d8b8 100644
--- a/lib/message.cc
+++ b/lib/message.cc
@@ -716,6 +716,8 @@ _notmuch_message_remove_terms (notmuch_message_t *message, const char *prefix)
 
 /* Remove all terms generated by indexing, i.e. not tags or
  * properties, along with any automatic tags*/
+/* According to Xapian API docs, none of these calls throw
+ * exceptions */
 notmuch_private_status_t
 _notmuch_message_remove_indexed_terms (notmuch_message_t *message)
 {
@@ -727,45 +729,53 @@ _notmuch_message_remove_indexed_terms (notmuch_message_t *message)
 	tag_prefix = _find_prefix ("tag"),
 	type_prefix = _find_prefix ("type");
 
-    for (i = message->doc.termlist_begin ();
-	 i != message->doc.termlist_end (); i++) {
+    /* Make sure we have the data to restore to Xapian*/
+    _notmuch_message_ensure_metadata (message,NULL);
 
-	const std::string term = *i;
+    /* Empirically, it turns out to be faster to remove all the terms,
+     * and add back the ones we want. */
+    message->doc.clear_terms ();
+    message->modified = true;
 
-	if (term.compare (0, type_prefix.size (), type_prefix) == 0)
-	    continue;
+    /* still a mail message */
+    message->doc.add_term (type_prefix + "mail");
 
-	if (term.compare (0, id_prefix.size (), id_prefix) == 0)
-	    continue;
+    /* Put back message-id */
+    message->doc.add_term (id_prefix + message->message_id);
 
-	if (term.compare (0, property_prefix.size (), property_prefix) == 0)
-	    continue;
+    /* Put back non-automatic tags */
+    for (notmuch_tags_t *tags = notmuch_message_get_tags (message);
+	 notmuch_tags_valid (tags);
+	 notmuch_tags_move_to_next (tags)) {
 
-	if (term.compare (0, tag_prefix.size (), tag_prefix) == 0 &&
-	    term.compare (1, strlen("encrypted"), "encrypted") != 0 &&
-	    term.compare (1, strlen("signed"), "signed") != 0 &&
-	    term.compare (1, strlen("attachment"), "attachment") != 0)
-	    continue;
+	const char *tag = notmuch_tags_get (tags);
 
-	try {
-	    message->doc.remove_term ((*i));
-	    message->modified = true;
-	} catch (const Xapian::InvalidArgumentError) {
-	    /* Ignore failure to remove non-existent term. */
-	} catch (const Xapian::Error &error) {
-	    notmuch_database_t *notmuch = message->notmuch;
-
-	    if (!notmuch->exception_reported) {
-		_notmuch_database_log(notmuch_message_get_database (message), "A Xapian exception occurred creating message: %s\n",
-				      error.get_msg().c_str());
-		notmuch->exception_reported = true;
-	    }
-	    return NOTMUCH_PRIVATE_STATUS_XAPIAN_EXCEPTION;
+	if (STRNCMP_LITERAL (tag, "encrypted") != 0 &&
+	    STRNCMP_LITERAL (tag, "signed") != 0 &&
+	    STRNCMP_LITERAL (tag, "attachment") != 0) {
+	    std::string term = tag_prefix + tag;
+	    message->doc.add_term(term);
 	}
     }
+
+    /* Put back properties */
+    notmuch_message_properties_t *list;
+
+    for (list = notmuch_message_get_properties (message, "", false);
+	 notmuch_message_properties_valid (list); notmuch_message_properties_move_to_next (list)) {
+	std::string term = property_prefix +
+	    notmuch_message_properties_key(list) + "=" +
+	    notmuch_message_properties_value(list);
+
+	message->doc.add_term(term);
+    }
+
+    notmuch_message_properties_destroy (list);
+
     return NOTMUCH_PRIVATE_STATUS_SUCCESS;
 }
 
+
 /* Return true if p points at "new" or "cur". */
 static bool is_maildir (const char *p)
 {
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/2] CLI/reindex: fix memory leak
  2019-04-16  1:46 ` [PATCH 1/2] CLI/reindex: fix memory leak David Bremner
@ 2019-04-17 17:04   ` Tomi Ollila
  2019-04-19  2:22   ` David Bremner
  1 sibling, 0 replies; 7+ messages in thread
From: Tomi Ollila @ 2019-04-17 17:04 UTC (permalink / raw)
  To: David Bremner, notmuch

On Mon, Apr 15 2019, David Bremner wrote:

> Since message is owned by messages, it was held for the entire run of
> the program. This in turn means that the Xapian::Document objects are
> not freed, and thus one ends up with (effectively) a copy of one's
> entire mailstore in memory when running
>
>        notmuch reindex '*'
>
> Thanks to Olly Betts for the patient help debugging, and the
> suggestion of a fix.
> ---
>  notmuch-reindex.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/notmuch-reindex.c b/notmuch-reindex.c
> index d8589120..fefa7c63 100644
> --- a/notmuch-reindex.c
> +++ b/notmuch-reindex.c
> @@ -71,6 +71,7 @@ reindex_query (notmuch_database_t *notmuch, const char *query_string,
>  	ret = notmuch_message_reindex(message, indexopts);
>  	if (ret != NOTMUCH_STATUS_SUCCESS)
>  	    break;
> +	notmuch_message_destroy (message);

This particular change looks trivial to me...

Tomi

>      }
>  
>      if (!ret)
> -- 
> 2.20.1

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/2] CLI/reindex: fix memory leak
  2019-04-16  1:46 ` [PATCH 1/2] CLI/reindex: fix memory leak David Bremner
  2019-04-17 17:04   ` Tomi Ollila
@ 2019-04-19  2:22   ` David Bremner
  1 sibling, 0 replies; 7+ messages in thread
From: David Bremner @ 2019-04-19  2:22 UTC (permalink / raw)
  To: notmuch

David Bremner <david@tethera.net> writes:

> Since message is owned by messages, it was held for the entire run of
> the program. This in turn means that the Xapian::Document objects are
> not freed, and thus one ends up with (effectively) a copy of one's
> entire mailstore in memory when running

Pushed this one to master. People may be wanting to do a complete
reindex to take advantage of body: searches.

d

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] n_m_remove_indexed_terms: reduce number of Xapian API calls.
  2019-04-16  1:46 ` [PATCH 2/2] n_m_remove_indexed_terms: reduce number of Xapian API calls David Bremner
@ 2019-05-22 11:58   ` David Bremner
  2019-05-23 11:40   ` David Bremner
  1 sibling, 0 replies; 7+ messages in thread
From: David Bremner @ 2019-05-22 11:58 UTC (permalink / raw)
  To: notmuch

David Bremner <david@tethera.net> writes:

> Previously this functioned scanned every term attached to a given
> Xapian document. It turns out we know how to read only the terms we
> need to preserve (and we might have already done so). This commit
> replaces many calls to Xapian::Document::remove_term with one call to
> ::clear_terms, and a (typically much smaller) number of calls to
> ::add_term. Roughly speaking this is based on the assumption that most
> messages have more text than they have tags.
>
> According to the performance test suite, this yields a roughly 40%
> speedup on "notmuch reindex '*'"

I've marked this ready to merge. If you have any feedback, please send
it ASAP.

d

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] n_m_remove_indexed_terms: reduce number of Xapian API calls.
  2019-04-16  1:46 ` [PATCH 2/2] n_m_remove_indexed_terms: reduce number of Xapian API calls David Bremner
  2019-05-22 11:58   ` David Bremner
@ 2019-05-23 11:40   ` David Bremner
  1 sibling, 0 replies; 7+ messages in thread
From: David Bremner @ 2019-05-23 11:40 UTC (permalink / raw)
  To: notmuch

David Bremner <david@tethera.net> writes:

> Previously this functioned scanned every term attached to a given
> Xapian document. It turns out we know how to read only the terms we
> need to preserve (and we might have already done so). This commit
> replaces many calls to Xapian::Document::remove_term with one call to
> ::clear_terms, and a (typically much smaller) number of calls to
> ::add_term. Roughly speaking this is based on the assumption that most
> messages have more text than they have tags.

pushed to master.

d

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2019-05-23 11:40 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-16  1:46 reindex improvements David Bremner
2019-04-16  1:46 ` [PATCH 1/2] CLI/reindex: fix memory leak David Bremner
2019-04-17 17:04   ` Tomi Ollila
2019-04-19  2:22   ` David Bremner
2019-04-16  1:46 ` [PATCH 2/2] n_m_remove_indexed_terms: reduce number of Xapian API calls David Bremner
2019-05-22 11:58   ` David Bremner
2019-05-23 11:40   ` David Bremner

Code repositories for project(s) associated with this public inbox

	https://yhetil.org/notmuch.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).