unofficial mirror of notmuch@notmuchmail.org
 help / color / mirror / code / Atom feed
* v3 of index multiple files per msg-id, add reindex command
@ 2017-06-04 12:32 David Bremner
  2017-06-04 12:32 ` [patch v3 01/12] lib: isolate n_d_add_message and helper functions into own file David Bremner
                   ` (11 more replies)
  0 siblings, 12 replies; 17+ messages in thread
From: David Bremner @ 2017-06-04 12:32 UTC (permalink / raw)
  To: notmuch, notmuch

this obsoletes

     id:20170507124012.30188-1-david@tethera.net

As with the previous versions:

,----
| WARNING: reindexing is an intrusive operation. I don't think this will
| corrupt your database, but previous versions thrashed threading pretty
| well. notmuch-dump is your friend.
`----

Compared to the last version this rebases against master (and updates
a couple of tests merged in the last month to handle the summary
format changes) and documents the summary format.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [patch v3 01/12] lib: isolate n_d_add_message and helper functions into own file
  2017-06-04 12:32 v3 of index multiple files per msg-id, add reindex command David Bremner
@ 2017-06-04 12:32 ` David Bremner
  2017-06-04 12:32 ` [patch v3 02/12] lib/n_d_add_message: refactor test for new/ghost messages David Bremner
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 17+ messages in thread
From: David Bremner @ 2017-06-04 12:32 UTC (permalink / raw)
  To: notmuch, notmuch

'database.cc' is becoming a monster, and it's hard to follow what the
various static functions are used for. It turns out that about 1/3 of
this file notmuch_database_add_message and helper functions not used
by any other function. This commit isolates this code into it's own
file.

Some side effects of this refactoring:

- find_doc_ids becomes the non-static (but still private)
  _notmuch_database_find_doc_ids
- a few instances of 'string' have 'std::' prepended, avoiding the
  need for 'using namespace std;' in the new file.
---
 lib/Makefile.local     |   1 +
 lib/add-message.cc     | 721 ++++++++++++++++++++++++++++++++++++++++++++++++
 lib/database-private.h |  10 +
 lib/database.cc        | 732 +------------------------------------------------
 4 files changed, 739 insertions(+), 725 deletions(-)
 create mode 100644 lib/add-message.cc

diff --git a/lib/Makefile.local b/lib/Makefile.local
index bf6e0649..9dd68286 100644
--- a/lib/Makefile.local
+++ b/lib/Makefile.local
@@ -50,6 +50,7 @@ libnotmuch_cxx_srcs =		\
 	$(dir)/directory.cc	\
 	$(dir)/index.cc		\
 	$(dir)/message.cc	\
+	$(dir)/add-message.cc	\
 	$(dir)/message-property.cc \
 	$(dir)/query.cc		\
 	$(dir)/query-fp.cc      \
diff --git a/lib/add-message.cc b/lib/add-message.cc
new file mode 100644
index 00000000..5fe2c45b
--- /dev/null
+++ b/lib/add-message.cc
@@ -0,0 +1,721 @@
+#include "database-private.h"
+
+/* Advance 'str' past any whitespace or RFC 822 comments. A comment is
+ * a (potentially nested) parenthesized sequence with '\' used to
+ * escape any character (including parentheses).
+ *
+ * If the sequence to be skipped continues to the end of the string,
+ * then 'str' will be left pointing at the final terminating '\0'
+ * character.
+ */
+static void
+skip_space_and_comments (const char **str)
+{
+    const char *s;
+
+    s = *str;
+    while (*s && (isspace (*s) || *s == '(')) {
+	while (*s && isspace (*s))
+	    s++;
+	if (*s == '(') {
+	    int nesting = 1;
+	    s++;
+	    while (*s && nesting) {
+		if (*s == '(') {
+		    nesting++;
+		} else if (*s == ')') {
+		    nesting--;
+		} else if (*s == '\\') {
+		    if (*(s+1))
+			s++;
+		}
+		s++;
+	    }
+	}
+    }
+
+    *str = s;
+}
+
+/* Parse an RFC 822 message-id, discarding whitespace, any RFC 822
+ * comments, and the '<' and '>' delimiters.
+ *
+ * If not NULL, then *next will be made to point to the first character
+ * not parsed, (possibly pointing to the final '\0' terminator.
+ *
+ * Returns a newly talloc'ed string belonging to 'ctx'.
+ *
+ * Returns NULL if there is any error parsing the message-id. */
+static char *
+_parse_message_id (void *ctx, const char *message_id, const char **next)
+{
+    const char *s, *end;
+    char *result;
+
+    if (message_id == NULL || *message_id == '\0')
+	return NULL;
+
+    s = message_id;
+
+    skip_space_and_comments (&s);
+
+    /* Skip any unstructured text as well. */
+    while (*s && *s != '<')
+	s++;
+
+    if (*s == '<') {
+	s++;
+    } else {
+	if (next)
+	    *next = s;
+	return NULL;
+    }
+
+    skip_space_and_comments (&s);
+
+    end = s;
+    while (*end && *end != '>')
+	end++;
+    if (next) {
+	if (*end)
+	    *next = end + 1;
+	else
+	    *next = end;
+    }
+
+    if (end > s && *end == '>')
+	end--;
+    if (end <= s)
+	return NULL;
+
+    result = talloc_strndup (ctx, s, end - s + 1);
+
+    /* Finally, collapse any whitespace that is within the message-id
+     * itself. */
+    {
+	char *r;
+	int len;
+
+	for (r = result, len = strlen (r); *r; r++, len--)
+	    if (*r == ' ' || *r == '\t')
+		memmove (r, r+1, len);
+    }
+
+    return result;
+}
+
+/* Parse a References header value, putting a (talloc'ed under 'ctx')
+ * copy of each referenced message-id into 'hash'.
+ *
+ * We explicitly avoid including any reference identical to
+ * 'message_id' in the result (to avoid mass confusion when a single
+ * message references itself cyclically---and yes, mail messages are
+ * not infrequent in the wild that do this---don't ask me why).
+ *
+ * Return the last reference parsed, if it is not equal to message_id.
+ */
+static char *
+parse_references (void *ctx,
+		  const char *message_id,
+		  GHashTable *hash,
+		  const char *refs)
+{
+    char *ref, *last_ref = NULL;
+
+    if (refs == NULL || *refs == '\0')
+	return NULL;
+
+    while (*refs) {
+	ref = _parse_message_id (ctx, refs, &refs);
+
+	if (ref && strcmp (ref, message_id)) {
+	    g_hash_table_add (hash, ref);
+	    last_ref = ref;
+	}
+    }
+
+    /* The return value of this function is used to add a parent
+     * reference to the database.  We should avoid making a message
+     * its own parent, thus the above check.
+     */
+    return talloc_strdup(ctx, last_ref);
+}
+
+static const char *
+_notmuch_database_generate_thread_id (notmuch_database_t *notmuch)
+{
+    /* 16 bytes (+ terminator) for hexadecimal representation of
+     * a 64-bit integer. */
+    static char thread_id[17];
+    Xapian::WritableDatabase *db;
+
+    db = static_cast <Xapian::WritableDatabase *> (notmuch->xapian_db);
+
+    notmuch->last_thread_id++;
+
+    sprintf (thread_id, "%016" PRIx64, notmuch->last_thread_id);
+
+    db->set_metadata ("last_thread_id", thread_id);
+
+    return thread_id;
+}
+
+static char *
+_get_metadata_thread_id_key (void *ctx, const char *message_id)
+{
+    if (strlen (message_id) > NOTMUCH_MESSAGE_ID_MAX)
+	message_id = _notmuch_message_id_compressed (ctx, message_id);
+
+    return talloc_asprintf (ctx, NOTMUCH_METADATA_THREAD_ID_PREFIX "%s",
+			    message_id);
+}
+
+
+static notmuch_status_t
+_resolve_message_id_to_thread_id_old (notmuch_database_t *notmuch,
+				      void *ctx,
+				      const char *message_id,
+				      const char **thread_id_ret);
+
+
+/* Find the thread ID to which the message with 'message_id' belongs.
+ *
+ * Note: 'thread_id_ret' must not be NULL!
+ * On success '*thread_id_ret' is set to a newly talloced string belonging to
+ * 'ctx'.
+ *
+ * Note: If there is no message in the database with the given
+ * 'message_id' then a new thread_id will be allocated for this
+ * message ID and stored in the database metadata so that the
+ * thread ID can be looked up if the message is added to the database
+ * later.
+ */
+static notmuch_status_t
+_resolve_message_id_to_thread_id (notmuch_database_t *notmuch,
+				  void *ctx,
+				  const char *message_id,
+				  const char **thread_id_ret)
+{
+    notmuch_private_status_t status;
+    notmuch_message_t *message;
+
+    if (! (notmuch->features & NOTMUCH_FEATURE_GHOSTS))
+	return _resolve_message_id_to_thread_id_old (notmuch, ctx, message_id,
+						     thread_id_ret);
+
+    /* Look for this message (regular or ghost) */
+    message = _notmuch_message_create_for_message_id (
+	notmuch, message_id, &status);
+    if (status == NOTMUCH_PRIVATE_STATUS_SUCCESS) {
+	/* Message exists */
+	*thread_id_ret = talloc_steal (
+	    ctx, notmuch_message_get_thread_id (message));
+    } else if (status == NOTMUCH_PRIVATE_STATUS_NO_DOCUMENT_FOUND) {
+	/* Message did not exist.  Give it a fresh thread ID and
+	 * populate this message as a ghost message. */
+	*thread_id_ret = talloc_strdup (
+	    ctx, _notmuch_database_generate_thread_id (notmuch));
+	if (! *thread_id_ret) {
+	    status = NOTMUCH_PRIVATE_STATUS_OUT_OF_MEMORY;
+	} else {
+	    status = _notmuch_message_initialize_ghost (message, *thread_id_ret);
+	    if (status == 0)
+		/* Commit the new ghost message */
+		_notmuch_message_sync (message);
+	}
+    } else {
+	/* Create failed. Fall through. */
+    }
+
+    notmuch_message_destroy (message);
+
+    return COERCE_STATUS (status, "Error creating ghost message");
+}
+
+/* Pre-ghost messages _resolve_message_id_to_thread_id */
+static notmuch_status_t
+_resolve_message_id_to_thread_id_old (notmuch_database_t *notmuch,
+				      void *ctx,
+				      const char *message_id,
+				      const char **thread_id_ret)
+{
+    notmuch_status_t status;
+    notmuch_message_t *message;
+    std::string thread_id_string;
+    char *metadata_key;
+    Xapian::WritableDatabase *db;
+
+    status = notmuch_database_find_message (notmuch, message_id, &message);
+
+    if (status)
+	return status;
+
+    if (message) {
+	*thread_id_ret = talloc_steal (ctx,
+				       notmuch_message_get_thread_id (message));
+
+	notmuch_message_destroy (message);
+
+	return NOTMUCH_STATUS_SUCCESS;
+    }
+
+    /* Message has not been seen yet.
+     *
+     * We may have seen a reference to it already, in which case, we
+     * can return the thread ID stored in the metadata. Otherwise, we
+     * generate a new thread ID and store it there.
+     */
+    db = static_cast <Xapian::WritableDatabase *> (notmuch->xapian_db);
+    metadata_key = _get_metadata_thread_id_key (ctx, message_id);
+    thread_id_string = notmuch->xapian_db->get_metadata (metadata_key);
+
+    if (thread_id_string.empty()) {
+	*thread_id_ret = talloc_strdup (ctx,
+					_notmuch_database_generate_thread_id (notmuch));
+	db->set_metadata (metadata_key, *thread_id_ret);
+    } else {
+	*thread_id_ret = talloc_strdup (ctx, thread_id_string.c_str());
+    }
+
+    talloc_free (metadata_key);
+
+    return NOTMUCH_STATUS_SUCCESS;
+}
+
+static notmuch_status_t
+_merge_threads (notmuch_database_t *notmuch,
+		const char *winner_thread_id,
+		const char *loser_thread_id)
+{
+    Xapian::PostingIterator loser, loser_end;
+    notmuch_message_t *message = NULL;
+    notmuch_private_status_t private_status;
+    notmuch_status_t ret = NOTMUCH_STATUS_SUCCESS;
+
+    _notmuch_database_find_doc_ids (notmuch, "thread", loser_thread_id, &loser, &loser_end);
+
+    for ( ; loser != loser_end; loser++) {
+	message = _notmuch_message_create (notmuch, notmuch,
+					   *loser, &private_status);
+	if (message == NULL) {
+	    ret = COERCE_STATUS (private_status,
+				 "Cannot find document for doc_id from query");
+	    goto DONE;
+	}
+
+	_notmuch_message_remove_term (message, "thread", loser_thread_id);
+	_notmuch_message_add_term (message, "thread", winner_thread_id);
+	_notmuch_message_sync (message);
+
+	notmuch_message_destroy (message);
+	message = NULL;
+    }
+
+  DONE:
+    if (message)
+	notmuch_message_destroy (message);
+
+    return ret;
+}
+
+static void
+_my_talloc_free_for_g_hash (void *ptr)
+{
+    talloc_free (ptr);
+}
+
+static notmuch_status_t
+_notmuch_database_link_message_to_parents (notmuch_database_t *notmuch,
+					   notmuch_message_t *message,
+					   notmuch_message_file_t *message_file,
+					   const char **thread_id)
+{
+    GHashTable *parents = NULL;
+    const char *refs, *in_reply_to, *in_reply_to_message_id;
+    const char *last_ref_message_id, *this_message_id;
+    GList *l, *keys = NULL;
+    notmuch_status_t ret = NOTMUCH_STATUS_SUCCESS;
+
+    parents = g_hash_table_new_full (g_str_hash, g_str_equal,
+				     _my_talloc_free_for_g_hash, NULL);
+    this_message_id = notmuch_message_get_message_id (message);
+
+    refs = _notmuch_message_file_get_header (message_file, "references");
+    last_ref_message_id = parse_references (message,
+					    this_message_id,
+					    parents, refs);
+
+    in_reply_to = _notmuch_message_file_get_header (message_file, "in-reply-to");
+    in_reply_to_message_id = parse_references (message,
+					       this_message_id,
+					       parents, in_reply_to);
+
+    /* For the parent of this message, use the last message ID of the
+     * References header, if available.  If not, fall back to the
+     * first message ID in the In-Reply-To header. */
+    if (last_ref_message_id) {
+	_notmuch_message_add_term (message, "replyto",
+				   last_ref_message_id);
+    } else if (in_reply_to_message_id) {
+	_notmuch_message_add_term (message, "replyto",
+			     in_reply_to_message_id);
+    }
+
+    keys = g_hash_table_get_keys (parents);
+    for (l = keys; l; l = l->next) {
+	char *parent_message_id;
+	const char *parent_thread_id = NULL;
+
+	parent_message_id = (char *) l->data;
+
+	_notmuch_message_add_term (message, "reference",
+				   parent_message_id);
+
+	ret = _resolve_message_id_to_thread_id (notmuch,
+						message,
+						parent_message_id,
+						&parent_thread_id);
+	if (ret)
+	    goto DONE;
+
+	if (*thread_id == NULL) {
+	    *thread_id = talloc_strdup (message, parent_thread_id);
+	    _notmuch_message_add_term (message, "thread", *thread_id);
+	} else if (strcmp (*thread_id, parent_thread_id)) {
+	    ret = _merge_threads (notmuch, *thread_id, parent_thread_id);
+	    if (ret)
+		goto DONE;
+	}
+    }
+
+  DONE:
+    if (keys)
+	g_list_free (keys);
+    if (parents)
+	g_hash_table_unref (parents);
+
+    return ret;
+}
+
+static notmuch_status_t
+_notmuch_database_link_message_to_children (notmuch_database_t *notmuch,
+					    notmuch_message_t *message,
+					    const char **thread_id)
+{
+    const char *message_id = notmuch_message_get_message_id (message);
+    Xapian::PostingIterator child, children_end;
+    notmuch_message_t *child_message = NULL;
+    const char *child_thread_id;
+    notmuch_status_t ret = NOTMUCH_STATUS_SUCCESS;
+    notmuch_private_status_t private_status;
+
+    _notmuch_database_find_doc_ids (notmuch, "reference", message_id, &child, &children_end);
+
+    for ( ; child != children_end; child++) {
+
+	child_message = _notmuch_message_create (message, notmuch,
+						 *child, &private_status);
+	if (child_message == NULL) {
+	    ret = COERCE_STATUS (private_status,
+				 "Cannot find document for doc_id from query");
+	    goto DONE;
+	}
+
+	child_thread_id = notmuch_message_get_thread_id (child_message);
+	if (*thread_id == NULL) {
+	    *thread_id = talloc_strdup (message, child_thread_id);
+	    _notmuch_message_add_term (message, "thread", *thread_id);
+	} else if (strcmp (*thread_id, child_thread_id)) {
+	    _notmuch_message_remove_term (child_message, "reference",
+					  message_id);
+	    _notmuch_message_sync (child_message);
+	    ret = _merge_threads (notmuch, *thread_id, child_thread_id);
+	    if (ret)
+		goto DONE;
+	}
+
+	notmuch_message_destroy (child_message);
+	child_message = NULL;
+    }
+
+  DONE:
+    if (child_message)
+	notmuch_message_destroy (child_message);
+
+    return ret;
+}
+
+/* Fetch and clear the stored thread_id for message, or NULL if none. */
+static char *
+_consume_metadata_thread_id (void *ctx, notmuch_database_t *notmuch,
+			     notmuch_message_t *message)
+{
+    const char *message_id;
+    std::string stored_id;
+    char *metadata_key;
+
+    message_id = notmuch_message_get_message_id (message);
+    metadata_key = _get_metadata_thread_id_key (ctx, message_id);
+
+    /* Check if we have already seen related messages to this one.
+     * If we have then use the thread_id that we stored at that time.
+     */
+    stored_id = notmuch->xapian_db->get_metadata (metadata_key);
+    if (stored_id.empty ()) {
+	return NULL;
+    } else {
+	Xapian::WritableDatabase *db;
+
+	db = static_cast <Xapian::WritableDatabase *> (notmuch->xapian_db);
+
+	/* Clear the metadata for this message ID. We don't need it
+	 * anymore. */
+	db->set_metadata (metadata_key, "");
+
+	return talloc_strdup (ctx, stored_id.c_str ());
+    }
+}
+
+/* Given a blank or ghost 'message' and its corresponding
+ * 'message_file' link it to existing threads in the database.
+ *
+ * First, if is_ghost, this retrieves the thread ID already stored in
+ * the message (which will be the case if a message was previously
+ * added that referenced this one).  If the message is blank
+ * (!is_ghost), it doesn't have a thread ID yet (we'll generate one
+ * later in this function).  If the database does not support ghost
+ * messages, this checks for a thread ID stored in database metadata
+ * for this message ID.
+ *
+ * Second, we look at 'message_file' and its link-relevant headers
+ * (References and In-Reply-To) for message IDs.
+ *
+ * Finally, we look in the database for existing message that
+ * reference 'message'.
+ *
+ * In all cases, we assign to the current message the first thread ID
+ * found. We will also merge any existing, distinct threads where this
+ * message belongs to both, (which is not uncommon when messages are
+ * processed out of order).
+ *
+ * Finally, if no thread ID has been found through referenced messages, we
+ * call _notmuch_message_generate_thread_id to generate a new thread
+ * ID. This should only happen for new, top-level messages, (no
+ * References or In-Reply-To header in this message, and no previously
+ * added message refers to this message).
+ */
+static notmuch_status_t
+_notmuch_database_link_message (notmuch_database_t *notmuch,
+				notmuch_message_t *message,
+				notmuch_message_file_t *message_file,
+				notmuch_bool_t is_ghost)
+{
+    void *local = talloc_new (NULL);
+    notmuch_status_t status;
+    const char *thread_id = NULL;
+
+    /* Check if the message already had a thread ID */
+    if (notmuch->features & NOTMUCH_FEATURE_GHOSTS) {
+	if (is_ghost)
+	    thread_id = notmuch_message_get_thread_id (message);
+    } else {
+	thread_id = _consume_metadata_thread_id (local, notmuch, message);
+	if (thread_id)
+	    _notmuch_message_add_term (message, "thread", thread_id);
+    }
+
+    status = _notmuch_database_link_message_to_parents (notmuch, message,
+							message_file,
+							&thread_id);
+    if (status)
+	goto DONE;
+
+    if (! (notmuch->features & NOTMUCH_FEATURE_GHOSTS)) {
+	/* In general, it shouldn't be necessary to link children,
+	 * since the earlier indexing of those children will have
+	 * stored a thread ID for the missing parent.  However, prior
+	 * to ghost messages, these stored thread IDs were NOT
+	 * rewritten during thread merging (and there was no
+	 * performant way to do so), so if indexed children were
+	 * pulled into a different thread ID by a merge, it was
+	 * necessary to pull them *back* into the stored thread ID of
+	 * the parent.  With ghost messages, we just rewrite the
+	 * stored thread IDs during merging, so this workaround isn't
+	 * necessary. */
+	status = _notmuch_database_link_message_to_children (notmuch, message,
+							     &thread_id);
+	if (status)
+	    goto DONE;
+    }
+
+    /* If not part of any existing thread, generate a new thread ID. */
+    if (thread_id == NULL) {
+	thread_id = _notmuch_database_generate_thread_id (notmuch);
+
+	_notmuch_message_add_term (message, "thread", thread_id);
+    }
+
+ DONE:
+    talloc_free (local);
+
+    return status;
+}
+
+notmuch_status_t
+notmuch_database_add_message (notmuch_database_t *notmuch,
+			      const char *filename,
+			      notmuch_message_t **message_ret)
+{
+    notmuch_message_file_t *message_file;
+    notmuch_message_t *message = NULL;
+    notmuch_status_t ret = NOTMUCH_STATUS_SUCCESS, ret2;
+    notmuch_private_status_t private_status;
+    notmuch_bool_t is_ghost = false;
+
+    const char *date, *header;
+    const char *from, *to, *subject;
+    char *message_id = NULL;
+
+    if (message_ret)
+	*message_ret = NULL;
+
+    ret = _notmuch_database_ensure_writable (notmuch);
+    if (ret)
+	return ret;
+
+    message_file = _notmuch_message_file_open (notmuch, filename);
+    if (message_file == NULL)
+	return NOTMUCH_STATUS_FILE_ERROR;
+
+    /* Adding a message may change many documents.  Do this all
+     * atomically. */
+    ret = notmuch_database_begin_atomic (notmuch);
+    if (ret)
+	goto DONE;
+
+    /* Parse message up front to get better error status. */
+    ret = _notmuch_message_file_parse (message_file);
+    if (ret)
+	goto DONE;
+
+    /* Before we do any real work, (especially before doing a
+     * potential SHA-1 computation on the entire file's contents),
+     * let's make sure that what we're looking at looks like an
+     * actual email message.
+     */
+    from = _notmuch_message_file_get_header (message_file, "from");
+    subject = _notmuch_message_file_get_header (message_file, "subject");
+    to = _notmuch_message_file_get_header (message_file, "to");
+
+    if ((from == NULL || *from == '\0') &&
+	(subject == NULL || *subject == '\0') &&
+	(to == NULL || *to == '\0')) {
+	ret = NOTMUCH_STATUS_FILE_NOT_EMAIL;
+	goto DONE;
+    }
+
+    /* Now that we're sure it's mail, the first order of business
+     * is to find a message ID (or else create one ourselves).
+     */
+    header = _notmuch_message_file_get_header (message_file, "message-id");
+    if (header && *header != '\0') {
+	message_id = _parse_message_id (message_file, header, NULL);
+
+	/* So the header value isn't RFC-compliant, but it's
+	 * better than no message-id at all.
+	 */
+	if (message_id == NULL)
+	    message_id = talloc_strdup (message_file, header);
+    }
+
+    if (message_id == NULL ) {
+	/* No message-id at all, let's generate one by taking a
+	 * hash over the file's contents.
+	 */
+	char *sha1 = _notmuch_sha1_of_file (filename);
+
+	/* If that failed too, something is really wrong. Give up. */
+	if (sha1 == NULL) {
+	    ret = NOTMUCH_STATUS_FILE_ERROR;
+	    goto DONE;
+	}
+
+	message_id = talloc_asprintf (message_file, "notmuch-sha1-%s", sha1);
+	free (sha1);
+    }
+
+    try {
+	/* Now that we have a message ID, we get a message object,
+	 * (which may or may not reference an existing document in the
+	 * database). */
+
+	message = _notmuch_message_create_for_message_id (notmuch,
+							  message_id,
+							  &private_status);
+
+	talloc_free (message_id);
+
+	if (message == NULL) {
+	    ret = COERCE_STATUS (private_status,
+				 "Unexpected status value from _notmuch_message_create_for_message_id");
+	    goto DONE;
+	}
+
+	_notmuch_message_add_filename (message, filename);
+
+	/* Is this a newly created message object or a ghost
+	 * message?  We have to be slightly careful: if this is a
+	 * blank message, it's not safe to call
+	 * notmuch_message_get_flag yet. */
+	if (private_status == NOTMUCH_PRIVATE_STATUS_NO_DOCUMENT_FOUND ||
+	    (is_ghost = notmuch_message_get_flag (
+		message, NOTMUCH_MESSAGE_FLAG_GHOST))) {
+	    _notmuch_message_add_term (message, "type", "mail");
+	    if (is_ghost)
+		/* Convert ghost message to a regular message */
+		_notmuch_message_remove_term (message, "type", "ghost");
+
+	    ret = _notmuch_database_link_message (notmuch, message,
+						  message_file, is_ghost);
+	    if (ret)
+		goto DONE;
+
+	    date = _notmuch_message_file_get_header (message_file, "date");
+	    _notmuch_message_set_header_values (message, date, from, subject);
+
+	    ret = _notmuch_message_index_file (message, message_file);
+	    if (ret)
+		goto DONE;
+	} else {
+	    ret = NOTMUCH_STATUS_DUPLICATE_MESSAGE_ID;
+	}
+
+	_notmuch_message_sync (message);
+    } catch (const Xapian::Error &error) {
+	_notmuch_database_log (notmuch, "A Xapian exception occurred adding message: %s.\n",
+		 error.get_msg().c_str());
+	notmuch->exception_reported = TRUE;
+	ret = NOTMUCH_STATUS_XAPIAN_EXCEPTION;
+	goto DONE;
+    }
+
+  DONE:
+    if (message) {
+	if ((ret == NOTMUCH_STATUS_SUCCESS ||
+	     ret == NOTMUCH_STATUS_DUPLICATE_MESSAGE_ID) && message_ret)
+	    *message_ret = message;
+	else
+	    notmuch_message_destroy (message);
+    }
+
+    if (message_file)
+	_notmuch_message_file_close (message_file);
+
+    ret2 = notmuch_database_end_atomic (notmuch);
+    if ((ret == NOTMUCH_STATUS_SUCCESS ||
+	 ret == NOTMUCH_STATUS_DUPLICATE_MESSAGE_ID) &&
+	ret2 != NOTMUCH_STATUS_SUCCESS)
+	ret = ret2;
+
+    return ret;
+}
diff --git a/lib/database-private.h b/lib/database-private.h
index 727b1d61..504618b9 100644
--- a/lib/database-private.h
+++ b/lib/database-private.h
@@ -246,4 +246,14 @@ _notmuch_database_get_terms_with_prefix (void *ctx, Xapian::TermIterator &i,
 					 Xapian::TermIterator &end,
 					 const char *prefix);
 
+void
+_notmuch_database_find_doc_ids (notmuch_database_t *notmuch,
+				const char *prefix_name,
+				const char *value,
+				Xapian::PostingIterator *begin,
+				Xapian::PostingIterator *end);
+
+
+#pragma GCC visibility pop
+
 #endif
diff --git a/lib/database.cc b/lib/database.cc
index 5b13f541..8f0e22a8 100644
--- a/lib/database.cc
+++ b/lib/database.cc
@@ -463,12 +463,12 @@ find_doc_ids_for_term (notmuch_database_t *notmuch,
     *end = notmuch->xapian_db->postlist_end (term);
 }
 
-static void
-find_doc_ids (notmuch_database_t *notmuch,
-	      const char *prefix_name,
-	      const char *value,
-	      Xapian::PostingIterator *begin,
-	      Xapian::PostingIterator *end)
+void
+_notmuch_database_find_doc_ids (notmuch_database_t *notmuch,
+				const char *prefix_name,
+				const char *value,
+				Xapian::PostingIterator *begin,
+				Xapian::PostingIterator *end)
 {
     char *term;
 
@@ -488,7 +488,7 @@ _notmuch_database_find_unique_doc_id (notmuch_database_t *notmuch,
 {
     Xapian::PostingIterator i, end;
 
-    find_doc_ids (notmuch, prefix_name, value, &i, &end);
+    _notmuch_database_find_doc_ids (notmuch, prefix_name, value, &i, &end);
 
     if (i == end) {
 	*doc_id = 0;
@@ -568,147 +568,6 @@ notmuch_database_find_message (notmuch_database_t *notmuch,
     }
 }
 
-/* Advance 'str' past any whitespace or RFC 822 comments. A comment is
- * a (potentially nested) parenthesized sequence with '\' used to
- * escape any character (including parentheses).
- *
- * If the sequence to be skipped continues to the end of the string,
- * then 'str' will be left pointing at the final terminating '\0'
- * character.
- */
-static void
-skip_space_and_comments (const char **str)
-{
-    const char *s;
-
-    s = *str;
-    while (*s && (isspace (*s) || *s == '(')) {
-	while (*s && isspace (*s))
-	    s++;
-	if (*s == '(') {
-	    int nesting = 1;
-	    s++;
-	    while (*s && nesting) {
-		if (*s == '(') {
-		    nesting++;
-		} else if (*s == ')') {
-		    nesting--;
-		} else if (*s == '\\') {
-		    if (*(s+1))
-			s++;
-		}
-		s++;
-	    }
-	}
-    }
-
-    *str = s;
-}
-
-/* Parse an RFC 822 message-id, discarding whitespace, any RFC 822
- * comments, and the '<' and '>' delimiters.
- *
- * If not NULL, then *next will be made to point to the first character
- * not parsed, (possibly pointing to the final '\0' terminator.
- *
- * Returns a newly talloc'ed string belonging to 'ctx'.
- *
- * Returns NULL if there is any error parsing the message-id. */
-static char *
-_parse_message_id (void *ctx, const char *message_id, const char **next)
-{
-    const char *s, *end;
-    char *result;
-
-    if (message_id == NULL || *message_id == '\0')
-	return NULL;
-
-    s = message_id;
-
-    skip_space_and_comments (&s);
-
-    /* Skip any unstructured text as well. */
-    while (*s && *s != '<')
-	s++;
-
-    if (*s == '<') {
-	s++;
-    } else {
-	if (next)
-	    *next = s;
-	return NULL;
-    }
-
-    skip_space_and_comments (&s);
-
-    end = s;
-    while (*end && *end != '>')
-	end++;
-    if (next) {
-	if (*end)
-	    *next = end + 1;
-	else
-	    *next = end;
-    }
-
-    if (end > s && *end == '>')
-	end--;
-    if (end <= s)
-	return NULL;
-
-    result = talloc_strndup (ctx, s, end - s + 1);
-
-    /* Finally, collapse any whitespace that is within the message-id
-     * itself. */
-    {
-	char *r;
-	int len;
-
-	for (r = result, len = strlen (r); *r; r++, len--)
-	    if (*r == ' ' || *r == '\t')
-		memmove (r, r+1, len);
-    }
-
-    return result;
-}
-
-/* Parse a References header value, putting a (talloc'ed under 'ctx')
- * copy of each referenced message-id into 'hash'.
- *
- * We explicitly avoid including any reference identical to
- * 'message_id' in the result (to avoid mass confusion when a single
- * message references itself cyclically---and yes, mail messages are
- * not infrequent in the wild that do this---don't ask me why).
- *
- * Return the last reference parsed, if it is not equal to message_id.
- */
-static char *
-parse_references (void *ctx,
-		  const char *message_id,
-		  GHashTable *hash,
-		  const char *refs)
-{
-    char *ref, *last_ref = NULL;
-
-    if (refs == NULL || *refs == '\0')
-	return NULL;
-
-    while (*refs) {
-	ref = _parse_message_id (ctx, refs, &refs);
-
-	if (ref && strcmp (ref, message_id)) {
-	    g_hash_table_add (hash, ref);
-	    last_ref = ref;
-	}
-    }
-
-    /* The return value of this function is used to add a parent
-     * reference to the database.  We should avoid making a message
-     * its own parent, thus the above check.
-     */
-    return talloc_strdup(ctx, last_ref);
-}
-
 notmuch_status_t
 notmuch_database_create (const char *path, notmuch_database_t **database)
 {
@@ -2044,583 +1903,6 @@ _notmuch_database_generate_doc_id (notmuch_database_t *notmuch)
     return notmuch->last_doc_id;
 }
 
-static const char *
-_notmuch_database_generate_thread_id (notmuch_database_t *notmuch)
-{
-    /* 16 bytes (+ terminator) for hexadecimal representation of
-     * a 64-bit integer. */
-    static char thread_id[17];
-    Xapian::WritableDatabase *db;
-
-    db = static_cast <Xapian::WritableDatabase *> (notmuch->xapian_db);
-
-    notmuch->last_thread_id++;
-
-    sprintf (thread_id, "%016" PRIx64, notmuch->last_thread_id);
-
-    db->set_metadata ("last_thread_id", thread_id);
-
-    return thread_id;
-}
-
-static char *
-_get_metadata_thread_id_key (void *ctx, const char *message_id)
-{
-    if (strlen (message_id) > NOTMUCH_MESSAGE_ID_MAX)
-	message_id = _notmuch_message_id_compressed (ctx, message_id);
-
-    return talloc_asprintf (ctx, NOTMUCH_METADATA_THREAD_ID_PREFIX "%s",
-			    message_id);
-}
-
-static notmuch_status_t
-_resolve_message_id_to_thread_id_old (notmuch_database_t *notmuch,
-				      void *ctx,
-				      const char *message_id,
-				      const char **thread_id_ret);
-
-/* Find the thread ID to which the message with 'message_id' belongs.
- *
- * Note: 'thread_id_ret' must not be NULL!
- * On success '*thread_id_ret' is set to a newly talloced string belonging to
- * 'ctx'.
- *
- * Note: If there is no message in the database with the given
- * 'message_id' then a new thread_id will be allocated for this
- * message ID and stored in the database metadata so that the
- * thread ID can be looked up if the message is added to the database
- * later.
- */
-static notmuch_status_t
-_resolve_message_id_to_thread_id (notmuch_database_t *notmuch,
-				  void *ctx,
-				  const char *message_id,
-				  const char **thread_id_ret)
-{
-    notmuch_private_status_t status;
-    notmuch_message_t *message;
-
-    if (! (notmuch->features & NOTMUCH_FEATURE_GHOSTS))
-	return _resolve_message_id_to_thread_id_old (notmuch, ctx, message_id,
-						     thread_id_ret);
-
-    /* Look for this message (regular or ghost) */
-    message = _notmuch_message_create_for_message_id (
-	notmuch, message_id, &status);
-    if (status == NOTMUCH_PRIVATE_STATUS_SUCCESS) {
-	/* Message exists */
-	*thread_id_ret = talloc_steal (
-	    ctx, notmuch_message_get_thread_id (message));
-    } else if (status == NOTMUCH_PRIVATE_STATUS_NO_DOCUMENT_FOUND) {
-	/* Message did not exist.  Give it a fresh thread ID and
-	 * populate this message as a ghost message. */
-	*thread_id_ret = talloc_strdup (
-	    ctx, _notmuch_database_generate_thread_id (notmuch));
-	if (! *thread_id_ret) {
-	    status = NOTMUCH_PRIVATE_STATUS_OUT_OF_MEMORY;
-	} else {
-	    status = _notmuch_message_initialize_ghost (message, *thread_id_ret);
-	    if (status == 0)
-		/* Commit the new ghost message */
-		_notmuch_message_sync (message);
-	}
-    } else {
-	/* Create failed. Fall through. */
-    }
-
-    notmuch_message_destroy (message);
-
-    return COERCE_STATUS (status, "Error creating ghost message");
-}
-
-/* Pre-ghost messages _resolve_message_id_to_thread_id */
-static notmuch_status_t
-_resolve_message_id_to_thread_id_old (notmuch_database_t *notmuch,
-				      void *ctx,
-				      const char *message_id,
-				      const char **thread_id_ret)
-{
-    notmuch_status_t status;
-    notmuch_message_t *message;
-    string thread_id_string;
-    char *metadata_key;
-    Xapian::WritableDatabase *db;
-
-    status = notmuch_database_find_message (notmuch, message_id, &message);
-
-    if (status)
-	return status;
-
-    if (message) {
-	*thread_id_ret = talloc_steal (ctx,
-				       notmuch_message_get_thread_id (message));
-
-	notmuch_message_destroy (message);
-
-	return NOTMUCH_STATUS_SUCCESS;
-    }
-
-    /* Message has not been seen yet.
-     *
-     * We may have seen a reference to it already, in which case, we
-     * can return the thread ID stored in the metadata. Otherwise, we
-     * generate a new thread ID and store it there.
-     */
-    db = static_cast <Xapian::WritableDatabase *> (notmuch->xapian_db);
-    metadata_key = _get_metadata_thread_id_key (ctx, message_id);
-    thread_id_string = notmuch->xapian_db->get_metadata (metadata_key);
-
-    if (thread_id_string.empty()) {
-	*thread_id_ret = talloc_strdup (ctx,
-					_notmuch_database_generate_thread_id (notmuch));
-	db->set_metadata (metadata_key, *thread_id_ret);
-    } else {
-	*thread_id_ret = talloc_strdup (ctx, thread_id_string.c_str());
-    }
-
-    talloc_free (metadata_key);
-
-    return NOTMUCH_STATUS_SUCCESS;
-}
-
-static notmuch_status_t
-_merge_threads (notmuch_database_t *notmuch,
-		const char *winner_thread_id,
-		const char *loser_thread_id)
-{
-    Xapian::PostingIterator loser, loser_end;
-    notmuch_message_t *message = NULL;
-    notmuch_private_status_t private_status;
-    notmuch_status_t ret = NOTMUCH_STATUS_SUCCESS;
-
-    find_doc_ids (notmuch, "thread", loser_thread_id, &loser, &loser_end);
-
-    for ( ; loser != loser_end; loser++) {
-	message = _notmuch_message_create (notmuch, notmuch,
-					   *loser, &private_status);
-	if (message == NULL) {
-	    ret = COERCE_STATUS (private_status,
-				 "Cannot find document for doc_id from query");
-	    goto DONE;
-	}
-
-	_notmuch_message_remove_term (message, "thread", loser_thread_id);
-	_notmuch_message_add_term (message, "thread", winner_thread_id);
-	_notmuch_message_sync (message);
-
-	notmuch_message_destroy (message);
-	message = NULL;
-    }
-
-  DONE:
-    if (message)
-	notmuch_message_destroy (message);
-
-    return ret;
-}
-
-static void
-_my_talloc_free_for_g_hash (void *ptr)
-{
-    talloc_free (ptr);
-}
-
-static notmuch_status_t
-_notmuch_database_link_message_to_parents (notmuch_database_t *notmuch,
-					   notmuch_message_t *message,
-					   notmuch_message_file_t *message_file,
-					   const char **thread_id)
-{
-    GHashTable *parents = NULL;
-    const char *refs, *in_reply_to, *in_reply_to_message_id;
-    const char *last_ref_message_id, *this_message_id;
-    GList *l, *keys = NULL;
-    notmuch_status_t ret = NOTMUCH_STATUS_SUCCESS;
-
-    parents = g_hash_table_new_full (g_str_hash, g_str_equal,
-				     _my_talloc_free_for_g_hash, NULL);
-    this_message_id = notmuch_message_get_message_id (message);
-
-    refs = _notmuch_message_file_get_header (message_file, "references");
-    last_ref_message_id = parse_references (message,
-					    this_message_id,
-					    parents, refs);
-
-    in_reply_to = _notmuch_message_file_get_header (message_file, "in-reply-to");
-    in_reply_to_message_id = parse_references (message,
-					       this_message_id,
-					       parents, in_reply_to);
-
-    /* For the parent of this message, use the last message ID of the
-     * References header, if available.  If not, fall back to the
-     * first message ID in the In-Reply-To header. */
-    if (last_ref_message_id) {
-	_notmuch_message_add_term (message, "replyto",
-				   last_ref_message_id);
-    } else if (in_reply_to_message_id) {
-	_notmuch_message_add_term (message, "replyto",
-			     in_reply_to_message_id);
-    }
-
-    keys = g_hash_table_get_keys (parents);
-    for (l = keys; l; l = l->next) {
-	char *parent_message_id;
-	const char *parent_thread_id = NULL;
-
-	parent_message_id = (char *) l->data;
-
-	_notmuch_message_add_term (message, "reference",
-				   parent_message_id);
-
-	ret = _resolve_message_id_to_thread_id (notmuch,
-						message,
-						parent_message_id,
-						&parent_thread_id);
-	if (ret)
-	    goto DONE;
-
-	if (*thread_id == NULL) {
-	    *thread_id = talloc_strdup (message, parent_thread_id);
-	    _notmuch_message_add_term (message, "thread", *thread_id);
-	} else if (strcmp (*thread_id, parent_thread_id)) {
-	    ret = _merge_threads (notmuch, *thread_id, parent_thread_id);
-	    if (ret)
-		goto DONE;
-	}
-    }
-
-  DONE:
-    if (keys)
-	g_list_free (keys);
-    if (parents)
-	g_hash_table_unref (parents);
-
-    return ret;
-}
-
-static notmuch_status_t
-_notmuch_database_link_message_to_children (notmuch_database_t *notmuch,
-					    notmuch_message_t *message,
-					    const char **thread_id)
-{
-    const char *message_id = notmuch_message_get_message_id (message);
-    Xapian::PostingIterator child, children_end;
-    notmuch_message_t *child_message = NULL;
-    const char *child_thread_id;
-    notmuch_status_t ret = NOTMUCH_STATUS_SUCCESS;
-    notmuch_private_status_t private_status;
-
-    find_doc_ids (notmuch, "reference", message_id, &child, &children_end);
-
-    for ( ; child != children_end; child++) {
-
-	child_message = _notmuch_message_create (message, notmuch,
-						 *child, &private_status);
-	if (child_message == NULL) {
-	    ret = COERCE_STATUS (private_status,
-				 "Cannot find document for doc_id from query");
-	    goto DONE;
-	}
-
-	child_thread_id = notmuch_message_get_thread_id (child_message);
-	if (*thread_id == NULL) {
-	    *thread_id = talloc_strdup (message, child_thread_id);
-	    _notmuch_message_add_term (message, "thread", *thread_id);
-	} else if (strcmp (*thread_id, child_thread_id)) {
-	    _notmuch_message_remove_term (child_message, "reference",
-					  message_id);
-	    _notmuch_message_sync (child_message);
-	    ret = _merge_threads (notmuch, *thread_id, child_thread_id);
-	    if (ret)
-		goto DONE;
-	}
-
-	notmuch_message_destroy (child_message);
-	child_message = NULL;
-    }
-
-  DONE:
-    if (child_message)
-	notmuch_message_destroy (child_message);
-
-    return ret;
-}
-
-/* Fetch and clear the stored thread_id for message, or NULL if none. */
-static char *
-_consume_metadata_thread_id (void *ctx, notmuch_database_t *notmuch,
-			     notmuch_message_t *message)
-{
-    const char *message_id;
-    string stored_id;
-    char *metadata_key;
-
-    message_id = notmuch_message_get_message_id (message);
-    metadata_key = _get_metadata_thread_id_key (ctx, message_id);
-
-    /* Check if we have already seen related messages to this one.
-     * If we have then use the thread_id that we stored at that time.
-     */
-    stored_id = notmuch->xapian_db->get_metadata (metadata_key);
-    if (stored_id.empty ()) {
-	return NULL;
-    } else {
-	Xapian::WritableDatabase *db;
-
-	db = static_cast <Xapian::WritableDatabase *> (notmuch->xapian_db);
-
-	/* Clear the metadata for this message ID. We don't need it
-	 * anymore. */
-	db->set_metadata (metadata_key, "");
-
-	return talloc_strdup (ctx, stored_id.c_str ());
-    }
-}
-
-/* Given a blank or ghost 'message' and its corresponding
- * 'message_file' link it to existing threads in the database.
- *
- * First, if is_ghost, this retrieves the thread ID already stored in
- * the message (which will be the case if a message was previously
- * added that referenced this one).  If the message is blank
- * (!is_ghost), it doesn't have a thread ID yet (we'll generate one
- * later in this function).  If the database does not support ghost
- * messages, this checks for a thread ID stored in database metadata
- * for this message ID.
- *
- * Second, we look at 'message_file' and its link-relevant headers
- * (References and In-Reply-To) for message IDs.
- *
- * Finally, we look in the database for existing message that
- * reference 'message'.
- *
- * In all cases, we assign to the current message the first thread ID
- * found. We will also merge any existing, distinct threads where this
- * message belongs to both, (which is not uncommon when messages are
- * processed out of order).
- *
- * Finally, if no thread ID has been found through referenced messages, we
- * call _notmuch_message_generate_thread_id to generate a new thread
- * ID. This should only happen for new, top-level messages, (no
- * References or In-Reply-To header in this message, and no previously
- * added message refers to this message).
- */
-static notmuch_status_t
-_notmuch_database_link_message (notmuch_database_t *notmuch,
-				notmuch_message_t *message,
-				notmuch_message_file_t *message_file,
-				notmuch_bool_t is_ghost)
-{
-    void *local = talloc_new (NULL);
-    notmuch_status_t status;
-    const char *thread_id = NULL;
-
-    /* Check if the message already had a thread ID */
-    if (notmuch->features & NOTMUCH_FEATURE_GHOSTS) {
-	if (is_ghost)
-	    thread_id = notmuch_message_get_thread_id (message);
-    } else {
-	thread_id = _consume_metadata_thread_id (local, notmuch, message);
-	if (thread_id)
-	    _notmuch_message_add_term (message, "thread", thread_id);
-    }
-
-    status = _notmuch_database_link_message_to_parents (notmuch, message,
-							message_file,
-							&thread_id);
-    if (status)
-	goto DONE;
-
-    if (! (notmuch->features & NOTMUCH_FEATURE_GHOSTS)) {
-	/* In general, it shouldn't be necessary to link children,
-	 * since the earlier indexing of those children will have
-	 * stored a thread ID for the missing parent.  However, prior
-	 * to ghost messages, these stored thread IDs were NOT
-	 * rewritten during thread merging (and there was no
-	 * performant way to do so), so if indexed children were
-	 * pulled into a different thread ID by a merge, it was
-	 * necessary to pull them *back* into the stored thread ID of
-	 * the parent.  With ghost messages, we just rewrite the
-	 * stored thread IDs during merging, so this workaround isn't
-	 * necessary. */
-	status = _notmuch_database_link_message_to_children (notmuch, message,
-							     &thread_id);
-	if (status)
-	    goto DONE;
-    }
-
-    /* If not part of any existing thread, generate a new thread ID. */
-    if (thread_id == NULL) {
-	thread_id = _notmuch_database_generate_thread_id (notmuch);
-
-	_notmuch_message_add_term (message, "thread", thread_id);
-    }
-
- DONE:
-    talloc_free (local);
-
-    return status;
-}
-
-notmuch_status_t
-notmuch_database_add_message (notmuch_database_t *notmuch,
-			      const char *filename,
-			      notmuch_message_t **message_ret)
-{
-    notmuch_message_file_t *message_file;
-    notmuch_message_t *message = NULL;
-    notmuch_status_t ret = NOTMUCH_STATUS_SUCCESS, ret2;
-    notmuch_private_status_t private_status;
-    notmuch_bool_t is_ghost = false;
-
-    const char *date, *header;
-    const char *from, *to, *subject;
-    char *message_id = NULL;
-
-    if (message_ret)
-	*message_ret = NULL;
-
-    ret = _notmuch_database_ensure_writable (notmuch);
-    if (ret)
-	return ret;
-
-    message_file = _notmuch_message_file_open (notmuch, filename);
-    if (message_file == NULL)
-	return NOTMUCH_STATUS_FILE_ERROR;
-
-    /* Adding a message may change many documents.  Do this all
-     * atomically. */
-    ret = notmuch_database_begin_atomic (notmuch);
-    if (ret)
-	goto DONE;
-
-    /* Parse message up front to get better error status. */
-    ret = _notmuch_message_file_parse (message_file);
-    if (ret)
-	goto DONE;
-
-    /* Before we do any real work, (especially before doing a
-     * potential SHA-1 computation on the entire file's contents),
-     * let's make sure that what we're looking at looks like an
-     * actual email message.
-     */
-    from = _notmuch_message_file_get_header (message_file, "from");
-    subject = _notmuch_message_file_get_header (message_file, "subject");
-    to = _notmuch_message_file_get_header (message_file, "to");
-
-    if ((from == NULL || *from == '\0') &&
-	(subject == NULL || *subject == '\0') &&
-	(to == NULL || *to == '\0')) {
-	ret = NOTMUCH_STATUS_FILE_NOT_EMAIL;
-	goto DONE;
-    }
-
-    /* Now that we're sure it's mail, the first order of business
-     * is to find a message ID (or else create one ourselves).
-     */
-    header = _notmuch_message_file_get_header (message_file, "message-id");
-    if (header && *header != '\0') {
-	message_id = _parse_message_id (message_file, header, NULL);
-
-	/* So the header value isn't RFC-compliant, but it's
-	 * better than no message-id at all.
-	 */
-	if (message_id == NULL)
-	    message_id = talloc_strdup (message_file, header);
-    }
-
-    if (message_id == NULL ) {
-	/* No message-id at all, let's generate one by taking a
-	 * hash over the file's contents.
-	 */
-	char *sha1 = _notmuch_sha1_of_file (filename);
-
-	/* If that failed too, something is really wrong. Give up. */
-	if (sha1 == NULL) {
-	    ret = NOTMUCH_STATUS_FILE_ERROR;
-	    goto DONE;
-	}
-
-	message_id = talloc_asprintf (message_file, "notmuch-sha1-%s", sha1);
-	free (sha1);
-    }
-
-    try {
-	/* Now that we have a message ID, we get a message object,
-	 * (which may or may not reference an existing document in the
-	 * database). */
-
-	message = _notmuch_message_create_for_message_id (notmuch,
-							  message_id,
-							  &private_status);
-
-	talloc_free (message_id);
-
-	if (message == NULL) {
-	    ret = COERCE_STATUS (private_status,
-				 "Unexpected status value from _notmuch_message_create_for_message_id");
-	    goto DONE;
-	}
-
-	_notmuch_message_add_filename (message, filename);
-
-	/* Is this a newly created message object or a ghost
-	 * message?  We have to be slightly careful: if this is a
-	 * blank message, it's not safe to call
-	 * notmuch_message_get_flag yet. */
-	if (private_status == NOTMUCH_PRIVATE_STATUS_NO_DOCUMENT_FOUND ||
-	    (is_ghost = notmuch_message_get_flag (
-		message, NOTMUCH_MESSAGE_FLAG_GHOST))) {
-	    _notmuch_message_add_term (message, "type", "mail");
-	    if (is_ghost)
-		/* Convert ghost message to a regular message */
-		_notmuch_message_remove_term (message, "type", "ghost");
-
-	    ret = _notmuch_database_link_message (notmuch, message,
-						  message_file, is_ghost);
-	    if (ret)
-		goto DONE;
-
-	    date = _notmuch_message_file_get_header (message_file, "date");
-	    _notmuch_message_set_header_values (message, date, from, subject);
-
-	    ret = _notmuch_message_index_file (message, message_file);
-	    if (ret)
-		goto DONE;
-	} else {
-	    ret = NOTMUCH_STATUS_DUPLICATE_MESSAGE_ID;
-	}
-
-	_notmuch_message_sync (message);
-    } catch (const Xapian::Error &error) {
-	_notmuch_database_log (notmuch, "A Xapian exception occurred adding message: %s.\n",
-		 error.get_msg().c_str());
-	notmuch->exception_reported = TRUE;
-	ret = NOTMUCH_STATUS_XAPIAN_EXCEPTION;
-	goto DONE;
-    }
-
-  DONE:
-    if (message) {
-	if ((ret == NOTMUCH_STATUS_SUCCESS ||
-	     ret == NOTMUCH_STATUS_DUPLICATE_MESSAGE_ID) && message_ret)
-	    *message_ret = message;
-	else
-	    notmuch_message_destroy (message);
-    }
-
-    if (message_file)
-	_notmuch_message_file_close (message_file);
-
-    ret2 = notmuch_database_end_atomic (notmuch);
-    if ((ret == NOTMUCH_STATUS_SUCCESS ||
-	 ret == NOTMUCH_STATUS_DUPLICATE_MESSAGE_ID) &&
-	ret2 != NOTMUCH_STATUS_SUCCESS)
-	ret = ret2;
-
-    return ret;
-}
-
 notmuch_status_t
 notmuch_database_remove_message (notmuch_database_t *notmuch,
 				 const char *filename)
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [patch v3 02/12] lib/n_d_add_message: refactor test for new/ghost messages
  2017-06-04 12:32 v3 of index multiple files per msg-id, add reindex command David Bremner
  2017-06-04 12:32 ` [patch v3 01/12] lib: isolate n_d_add_message and helper functions into own file David Bremner
@ 2017-06-04 12:32 ` David Bremner
  2017-06-04 12:32 ` [patch v3 03/12] lib: factor out message-id parsing to separate file David Bremner
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 17+ messages in thread
From: David Bremner @ 2017-06-04 12:32 UTC (permalink / raw)
  To: notmuch, notmuch

The switch is easier to understand than the side effects in the if
test. It also potentially allows us more flexibility in breaking up
this function into smaller pieces, since passing private_status around
is icky.
---
 lib/add-message.cc | 23 +++++++++++++----------
 1 file changed, 13 insertions(+), 10 deletions(-)

diff --git a/lib/add-message.cc b/lib/add-message.cc
index 5fe2c45b..0f09415e 100644
--- a/lib/add-message.cc
+++ b/lib/add-message.cc
@@ -570,7 +570,7 @@ notmuch_database_add_message (notmuch_database_t *notmuch,
     notmuch_message_t *message = NULL;
     notmuch_status_t ret = NOTMUCH_STATUS_SUCCESS, ret2;
     notmuch_private_status_t private_status;
-    notmuch_bool_t is_ghost = false;
+    notmuch_bool_t is_ghost = FALSE, is_new = FALSE;
 
     const char *date, *header;
     const char *from, *to, *subject;
@@ -655,7 +655,17 @@ notmuch_database_add_message (notmuch_database_t *notmuch,
 
 	talloc_free (message_id);
 
-	if (message == NULL) {
+	/* We cannot call notmuch_message_get_flag for a new message */
+	switch (private_status) {
+	case NOTMUCH_PRIVATE_STATUS_NO_DOCUMENT_FOUND:
+	    is_ghost = FALSE;
+	    is_new = TRUE;
+	    break;
+	case NOTMUCH_PRIVATE_STATUS_SUCCESS:
+	    is_ghost = notmuch_message_get_flag (message, NOTMUCH_MESSAGE_FLAG_GHOST);
+	    is_new = FALSE;
+	    break;
+	default:
 	    ret = COERCE_STATUS (private_status,
 				 "Unexpected status value from _notmuch_message_create_for_message_id");
 	    goto DONE;
@@ -663,18 +673,11 @@ notmuch_database_add_message (notmuch_database_t *notmuch,
 
 	_notmuch_message_add_filename (message, filename);
 
-	/* Is this a newly created message object or a ghost
-	 * message?  We have to be slightly careful: if this is a
-	 * blank message, it's not safe to call
-	 * notmuch_message_get_flag yet. */
-	if (private_status == NOTMUCH_PRIVATE_STATUS_NO_DOCUMENT_FOUND ||
-	    (is_ghost = notmuch_message_get_flag (
-		message, NOTMUCH_MESSAGE_FLAG_GHOST))) {
+	if (is_new || is_ghost) {
 	    _notmuch_message_add_term (message, "type", "mail");
 	    if (is_ghost)
 		/* Convert ghost message to a regular message */
 		_notmuch_message_remove_term (message, "type", "ghost");
-
 	    ret = _notmuch_database_link_message (notmuch, message,
 						  message_file, is_ghost);
 	    if (ret)
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [patch v3 03/12] lib: factor out message-id parsing to separate file.
  2017-06-04 12:32 v3 of index multiple files per msg-id, add reindex command David Bremner
  2017-06-04 12:32 ` [patch v3 01/12] lib: isolate n_d_add_message and helper functions into own file David Bremner
  2017-06-04 12:32 ` [patch v3 02/12] lib/n_d_add_message: refactor test for new/ghost messages David Bremner
@ 2017-06-04 12:32 ` David Bremner
  2017-06-04 12:32 ` [patch v3 04/12] lib: refactor notmuch_database_add_message header parsing David Bremner
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 17+ messages in thread
From: David Bremner @ 2017-06-04 12:32 UTC (permalink / raw)
  To: notmuch, notmuch

This is really pure C string parsing, and doesn't need to be mixed in
with the Xapian/C++ layer. Although not strictly necessary, it also
makes it a bit more natural to call _parse_message_id from multiple
compilation units.
---
 lib/Makefile.local    |   1 +
 lib/add-message.cc    | 106 +-------------------------------------------------
 lib/message-id.c      |  96 +++++++++++++++++++++++++++++++++++++++++++++
 lib/notmuch-private.h |  14 +++++++
 4 files changed, 112 insertions(+), 105 deletions(-)
 create mode 100644 lib/message-id.c

diff --git a/lib/Makefile.local b/lib/Makefile.local
index 9dd68286..0b5c4b08 100644
--- a/lib/Makefile.local
+++ b/lib/Makefile.local
@@ -38,6 +38,7 @@ libnotmuch_c_srcs =		\
 	$(dir)/filenames.c	\
 	$(dir)/string-list.c	\
 	$(dir)/message-file.c	\
+	$(dir)/message-id.c	\
 	$(dir)/messages.c	\
 	$(dir)/sha1.c		\
 	$(dir)/built-with.c	\
diff --git a/lib/add-message.cc b/lib/add-message.cc
index 0f09415e..314016a8 100644
--- a/lib/add-message.cc
+++ b/lib/add-message.cc
@@ -1,109 +1,5 @@
 #include "database-private.h"
 
-/* Advance 'str' past any whitespace or RFC 822 comments. A comment is
- * a (potentially nested) parenthesized sequence with '\' used to
- * escape any character (including parentheses).
- *
- * If the sequence to be skipped continues to the end of the string,
- * then 'str' will be left pointing at the final terminating '\0'
- * character.
- */
-static void
-skip_space_and_comments (const char **str)
-{
-    const char *s;
-
-    s = *str;
-    while (*s && (isspace (*s) || *s == '(')) {
-	while (*s && isspace (*s))
-	    s++;
-	if (*s == '(') {
-	    int nesting = 1;
-	    s++;
-	    while (*s && nesting) {
-		if (*s == '(') {
-		    nesting++;
-		} else if (*s == ')') {
-		    nesting--;
-		} else if (*s == '\\') {
-		    if (*(s+1))
-			s++;
-		}
-		s++;
-	    }
-	}
-    }
-
-    *str = s;
-}
-
-/* Parse an RFC 822 message-id, discarding whitespace, any RFC 822
- * comments, and the '<' and '>' delimiters.
- *
- * If not NULL, then *next will be made to point to the first character
- * not parsed, (possibly pointing to the final '\0' terminator.
- *
- * Returns a newly talloc'ed string belonging to 'ctx'.
- *
- * Returns NULL if there is any error parsing the message-id. */
-static char *
-_parse_message_id (void *ctx, const char *message_id, const char **next)
-{
-    const char *s, *end;
-    char *result;
-
-    if (message_id == NULL || *message_id == '\0')
-	return NULL;
-
-    s = message_id;
-
-    skip_space_and_comments (&s);
-
-    /* Skip any unstructured text as well. */
-    while (*s && *s != '<')
-	s++;
-
-    if (*s == '<') {
-	s++;
-    } else {
-	if (next)
-	    *next = s;
-	return NULL;
-    }
-
-    skip_space_and_comments (&s);
-
-    end = s;
-    while (*end && *end != '>')
-	end++;
-    if (next) {
-	if (*end)
-	    *next = end + 1;
-	else
-	    *next = end;
-    }
-
-    if (end > s && *end == '>')
-	end--;
-    if (end <= s)
-	return NULL;
-
-    result = talloc_strndup (ctx, s, end - s + 1);
-
-    /* Finally, collapse any whitespace that is within the message-id
-     * itself. */
-    {
-	char *r;
-	int len;
-
-	for (r = result, len = strlen (r); *r; r++, len--)
-	    if (*r == ' ' || *r == '\t')
-		memmove (r, r+1, len);
-    }
-
-    return result;
-}
-
 /* Parse a References header value, putting a (talloc'ed under 'ctx')
  * copy of each referenced message-id into 'hash'.
  *
@@ -126,7 +22,7 @@ parse_references (void *ctx,
 	return NULL;
 
     while (*refs) {
-	ref = _parse_message_id (ctx, refs, &refs);
+	ref = _notmuch_message_id_parse (ctx, refs, &refs);
 
 	if (ref && strcmp (ref, message_id)) {
 	    g_hash_table_add (hash, ref);
diff --git a/lib/message-id.c b/lib/message-id.c
new file mode 100644
index 00000000..d7541d50
--- /dev/null
+++ b/lib/message-id.c
@@ -0,0 +1,96 @@
+#include "notmuch-private.h"
+
+/* Advance 'str' past any whitespace or RFC 822 comments. A comment is
+ * a (potentially nested) parenthesized sequence with '\' used to
+ * escape any character (including parentheses).
+ *
+ * If the sequence to be skipped continues to the end of the string,
+ * then 'str' will be left pointing at the final terminating '\0'
+ * character.
+ */
+static void
+skip_space_and_comments (const char **str)
+{
+    const char *s;
+
+    s = *str;
+    while (*s && (isspace (*s) || *s == '(')) {
+	while (*s && isspace (*s))
+	    s++;
+	if (*s == '(') {
+	    int nesting = 1;
+	    s++;
+	    while (*s && nesting) {
+		if (*s == '(') {
+		    nesting++;
+		} else if (*s == ')') {
+		    nesting--;
+		} else if (*s == '\\') {
+		    if (*(s+1))
+			s++;
+		}
+		s++;
+	    }
+	}
+    }
+
+    *str = s;
+}
+
+char *
+_notmuch_message_id_parse (void *ctx, const char *message_id, const char **next)
+{
+    const char *s, *end;
+    char *result;
+
+    if (message_id == NULL || *message_id == '\0')
+	return NULL;
+
+    s = message_id;
+
+    skip_space_and_comments (&s);
+
+    /* Skip any unstructured text as well. */
+    while (*s && *s != '<')
+	s++;
+
+    if (*s == '<') {
+	s++;
+    } else {
+	if (next)
+	    *next = s;
+	return NULL;
+    }
+
+    skip_space_and_comments (&s);
+
+    end = s;
+    while (*end && *end != '>')
+	end++;
+    if (next) {
+	if (*end)
+	    *next = end + 1;
+	else
+	    *next = end;
+    }
+
+    if (end > s && *end == '>')
+	end--;
+    if (end <= s)
+	return NULL;
+
+    result = talloc_strndup (ctx, s, end - s + 1);
+
+    /* Finally, collapse any whitespace that is within the message-id
+     * itself. */
+    {
+	char *r;
+	int len;
+
+	for (r = result, len = strlen (r); *r; r++, len--)
+	    if (*r == ' ' || *r == '\t')
+		memmove (r, r+1, len);
+    }
+
+    return result;
+}
diff --git a/lib/notmuch-private.h b/lib/notmuch-private.h
index ac315e4c..dda94d76 100644
--- a/lib/notmuch-private.h
+++ b/lib/notmuch-private.h
@@ -492,6 +492,20 @@ notmuch_status_t
 _notmuch_query_count_documents (notmuch_query_t *query,
 				const char *type,
 				unsigned *count_out);
+/* message-id.c */
+
+/* Parse an RFC 822 message-id, discarding whitespace, any RFC 822
+ * comments, and the '<' and '>' delimiters.
+ *
+ * If not NULL, then *next will be made to point to the first character
+ * not parsed, (possibly pointing to the final '\0' terminator.
+ *
+ * Returns a newly talloc'ed string belonging to 'ctx'.
+ *
+ * Returns NULL if there is any error parsing the message-id. */
+char *
+_notmuch_message_id_parse (void *ctx, const char *message_id, const char **next);
+
 
 /* message.cc */
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [patch v3 04/12] lib: refactor notmuch_database_add_message header parsing
  2017-06-04 12:32 v3 of index multiple files per msg-id, add reindex command David Bremner
                   ` (2 preceding siblings ...)
  2017-06-04 12:32 ` [patch v3 03/12] lib: factor out message-id parsing to separate file David Bremner
@ 2017-06-04 12:32 ` David Bremner
  2017-06-04 12:32 ` [patch v3 05/12] test: add known broken tests for duplicate message id David Bremner
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 17+ messages in thread
From: David Bremner @ 2017-06-04 12:32 UTC (permalink / raw)
  To: notmuch, notmuch

This function is large and hard to understand and modify. Start to
break it down into meaningful pieces.
---
 lib/add-message.cc    | 54 +++-----------------------------
 lib/message-file.c    | 86 +++++++++++++++++++++++++++++++++++++++++++++++++++
 lib/notmuch-private.h | 11 +++++++
 3 files changed, 101 insertions(+), 50 deletions(-)

diff --git a/lib/add-message.cc b/lib/add-message.cc
index 314016a8..2922eaa9 100644
--- a/lib/add-message.cc
+++ b/lib/add-message.cc
@@ -468,7 +468,7 @@ notmuch_database_add_message (notmuch_database_t *notmuch,
     notmuch_private_status_t private_status;
     notmuch_bool_t is_ghost = FALSE, is_new = FALSE;
 
-    const char *date, *header;
+    const char *date;
     const char *from, *to, *subject;
     char *message_id = NULL;
 
@@ -489,57 +489,12 @@ notmuch_database_add_message (notmuch_database_t *notmuch,
     if (ret)
 	goto DONE;
 
-    /* Parse message up front to get better error status. */
-    ret = _notmuch_message_file_parse (message_file);
+    ret = _notmuch_message_file_get_headers (message_file,
+					     &from, &subject, &to, &date,
+					     &message_id);
     if (ret)
 	goto DONE;
 
-    /* Before we do any real work, (especially before doing a
-     * potential SHA-1 computation on the entire file's contents),
-     * let's make sure that what we're looking at looks like an
-     * actual email message.
-     */
-    from = _notmuch_message_file_get_header (message_file, "from");
-    subject = _notmuch_message_file_get_header (message_file, "subject");
-    to = _notmuch_message_file_get_header (message_file, "to");
-
-    if ((from == NULL || *from == '\0') &&
-	(subject == NULL || *subject == '\0') &&
-	(to == NULL || *to == '\0')) {
-	ret = NOTMUCH_STATUS_FILE_NOT_EMAIL;
-	goto DONE;
-    }
-
-    /* Now that we're sure it's mail, the first order of business
-     * is to find a message ID (or else create one ourselves).
-     */
-    header = _notmuch_message_file_get_header (message_file, "message-id");
-    if (header && *header != '\0') {
-	message_id = _parse_message_id (message_file, header, NULL);
-
-	/* So the header value isn't RFC-compliant, but it's
-	 * better than no message-id at all.
-	 */
-	if (message_id == NULL)
-	    message_id = talloc_strdup (message_file, header);
-    }
-
-    if (message_id == NULL ) {
-	/* No message-id at all, let's generate one by taking a
-	 * hash over the file's contents.
-	 */
-	char *sha1 = _notmuch_sha1_of_file (filename);
-
-	/* If that failed too, something is really wrong. Give up. */
-	if (sha1 == NULL) {
-	    ret = NOTMUCH_STATUS_FILE_ERROR;
-	    goto DONE;
-	}
-
-	message_id = talloc_asprintf (message_file, "notmuch-sha1-%s", sha1);
-	free (sha1);
-    }
-
     try {
 	/* Now that we have a message ID, we get a message object,
 	 * (which may or may not reference an existing document in the
@@ -579,7 +534,6 @@ notmuch_database_add_message (notmuch_database_t *notmuch,
 	    if (ret)
 		goto DONE;
 
-	    date = _notmuch_message_file_get_header (message_file, "date");
 	    _notmuch_message_set_header_values (message, date, from, subject);
 
 	    ret = _notmuch_message_index_file (message, message_file);
diff --git a/lib/message-file.c b/lib/message-file.c
index db18b163..70526ef0 100644
--- a/lib/message-file.c
+++ b/lib/message-file.c
@@ -92,6 +92,12 @@ _notmuch_message_file_open (notmuch_database_t *notmuch,
     return _notmuch_message_file_open_ctx (notmuch, NULL, filename);
 }
 
+const char *
+_notmuch_message_file_get_filename (notmuch_message_file_t *message_file)
+{
+    return message_file->filename;
+}
+
 void
 _notmuch_message_file_close (notmuch_message_file_t *message)
 {
@@ -304,3 +310,83 @@ _notmuch_message_file_get_header (notmuch_message_file_t *message,
 
     return decoded;
 }
+
+notmuch_status_t
+_notmuch_message_file_get_headers (notmuch_message_file_t *message_file,
+				   const char **from_out,
+				   const char **subject_out,
+				   const char **to_out,
+				   const char **date_out,
+				   char **message_id_out)
+{
+    notmuch_status_t ret;
+    const char *header;
+    const char *from, *to, *subject, *date;
+    char *message_id = NULL;
+
+    /* Parse message up front to get better error status. */
+    ret = _notmuch_message_file_parse (message_file);
+    if (ret)
+	goto DONE;
+
+    /* Before we do any real work, (especially before doing a
+     * potential SHA-1 computation on the entire file's contents),
+     * let's make sure that what we're looking at looks like an
+     * actual email message.
+     */
+    from = _notmuch_message_file_get_header (message_file, "from");
+    subject = _notmuch_message_file_get_header (message_file, "subject");
+    to = _notmuch_message_file_get_header (message_file, "to");
+    date = _notmuch_message_file_get_header (message_file, "date");
+
+    if ((from == NULL || *from == '\0') &&
+	(subject == NULL || *subject == '\0') &&
+	(to == NULL || *to == '\0')) {
+	ret = NOTMUCH_STATUS_FILE_NOT_EMAIL;
+	goto DONE;
+    }
+
+    /* Now that we're sure it's mail, the first order of business
+     * is to find a message ID (or else create one ourselves).
+     */
+    header = _notmuch_message_file_get_header (message_file, "message-id");
+    if (header && *header != '\0') {
+	message_id = _notmuch_message_id_parse (message_file, header, NULL);
+
+	/* So the header value isn't RFC-compliant, but it's
+	 * better than no message-id at all.
+	 */
+	if (message_id == NULL)
+	    message_id = talloc_strdup (message_file, header);
+    }
+
+    if (message_id == NULL ) {
+	/* No message-id at all, let's generate one by taking a
+	 * hash over the file's contents.
+	 */
+	char *sha1 = _notmuch_sha1_of_file (_notmuch_message_file_get_filename (message_file));
+
+	/* If that failed too, something is really wrong. Give up. */
+	if (sha1 == NULL) {
+	    ret = NOTMUCH_STATUS_FILE_ERROR;
+	    goto DONE;
+	}
+
+	message_id = talloc_asprintf (message_file, "notmuch-sha1-%s", sha1);
+	free (sha1);
+    }
+ DONE:
+    if (ret == NOTMUCH_STATUS_SUCCESS) {
+	if (from_out)
+	    *from_out = from;
+	if (subject_out)
+	    *subject_out = subject;
+	if (to_out)
+	    *to_out = to;
+	if (date_out)
+	    *date_out = date;
+	if (message_id_out)
+	    *message_id_out = message_id;
+    }
+    return ret;
+}
diff --git a/lib/notmuch-private.h b/lib/notmuch-private.h
index dda94d76..4534fb52 100644
--- a/lib/notmuch-private.h
+++ b/lib/notmuch-private.h
@@ -425,6 +425,17 @@ const char *
 _notmuch_message_file_get_header (notmuch_message_file_t *message,
 				 const char *header);
 
+notmuch_status_t
+_notmuch_message_file_get_headers (notmuch_message_file_t *message_file,
+				   const char **from_out,
+				   const char **subject_out,
+				   const char **to_out,
+				   const char **date_out,
+				   char **message_id_out);
+
+const char *
+_notmuch_message_file_get_filename (notmuch_message_file_t *message);
+
 /* index.cc */
 
 notmuch_status_t
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [patch v3 05/12] test: add known broken tests for duplicate message id
  2017-06-04 12:32 v3 of index multiple files per msg-id, add reindex command David Bremner
                   ` (3 preceding siblings ...)
  2017-06-04 12:32 ` [patch v3 04/12] lib: refactor notmuch_database_add_message header parsing David Bremner
@ 2017-06-04 12:32 ` David Bremner
  2017-06-04 12:32 ` [patch v3 06/12] lib: index message files with duplicate message-ids David Bremner
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 17+ messages in thread
From: David Bremner @ 2017-06-04 12:32 UTC (permalink / raw)
  To: notmuch, notmuch

There are many other problems that could be tested, but these ones we
have some hope of fixing because it doesn't require UI changes, just
indexing changes.
---
 test/T670-duplicate-mid.sh | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)
 create mode 100755 test/T670-duplicate-mid.sh

diff --git a/test/T670-duplicate-mid.sh b/test/T670-duplicate-mid.sh
new file mode 100755
index 00000000..ced28a21
--- /dev/null
+++ b/test/T670-duplicate-mid.sh
@@ -0,0 +1,28 @@
+#!/usr/bin/env bash
+test_description="duplicate message ids"
+. ./test-lib.sh || exit 1
+
+add_message '[id]="duplicate"' '[subject]="message 1" [filename]=copy1'
+add_message '[id]="duplicate"' '[subject]="message 2" [filename]=copy2'
+
+test_begin_subtest 'Search for second subject'
+test_subtest_known_broken
+cat <<EOF >EXPECTED
+MAIL_DIR/copy1
+MAIL_DIR/copy2
+EOF
+notmuch search --output=files subject:'"message 2"' | notmuch_dir_sanitize > OUTPUT
+test_expect_equal_file EXPECTED OUTPUT
+
+add_message '[id]="duplicate"' '[body]="sekrit" [filename]=copy3'
+test_begin_subtest 'search for body in duplicate file'
+test_subtest_known_broken
+cat <<EOF >EXPECTED
+MAIL_DIR/copy1
+MAIL_DIR/copy2
+MAIL_DIR/copy3
+EOF
+notmuch search --output=files "sekrit" | notmuch_dir_sanitize > OUTPUT
+test_expect_equal_file EXPECTED OUTPUT
+
+test_done
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [patch v3 06/12] lib: index message files with duplicate message-ids
  2017-06-04 12:32 v3 of index multiple files per msg-id, add reindex command David Bremner
                   ` (4 preceding siblings ...)
  2017-06-04 12:32 ` [patch v3 05/12] test: add known broken tests for duplicate message id David Bremner
@ 2017-06-04 12:32 ` David Bremner
  2017-06-04 20:34   ` Daniel Kahn Gillmor
  2017-06-04 12:32 ` [patch v3 07/12] lib: add notmuch_message_count_files David Bremner
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 17+ messages in thread
From: David Bremner @ 2017-06-04 12:32 UTC (permalink / raw)
  To: notmuch, notmuch

The corresponding xapian document just gets more terms added to it,
but this doesn't seem to break anything. Values on the other hand get
overwritten, which is a bit annoying, but arguably it is not worse to
take the values (from, subject, date) from the last file indexed
rather than the first.
---
 lib/add-message.cc         | 20 +++++++++++---------
 test/T160-json.sh          |  4 ++--
 test/T670-duplicate-mid.sh |  2 --
 3 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/lib/add-message.cc b/lib/add-message.cc
index 2922eaa9..ae9b14a7 100644
--- a/lib/add-message.cc
+++ b/lib/add-message.cc
@@ -529,19 +529,21 @@ notmuch_database_add_message (notmuch_database_t *notmuch,
 	    if (is_ghost)
 		/* Convert ghost message to a regular message */
 		_notmuch_message_remove_term (message, "type", "ghost");
-	    ret = _notmuch_database_link_message (notmuch, message,
+	}
+
+	ret = _notmuch_database_link_message (notmuch, message,
 						  message_file, is_ghost);
-	    if (ret)
-		goto DONE;
+	if (ret)
+	    goto DONE;
 
-	    _notmuch_message_set_header_values (message, date, from, subject);
+	_notmuch_message_set_header_values (message, date, from, subject);
 
-	    ret = _notmuch_message_index_file (message, message_file);
-	    if (ret)
-		goto DONE;
-	} else {
+	ret = _notmuch_message_index_file (message, message_file);
+	if (ret)
+	    goto DONE;
+
+	if (! is_new && !is_ghost)
 	    ret = NOTMUCH_STATUS_DUPLICATE_MESSAGE_ID;
-	}
 
 	_notmuch_message_sync (message);
     } catch (const Xapian::Error &error) {
diff --git a/test/T160-json.sh b/test/T160-json.sh
index ac51895e..07955a2b 100755
--- a/test/T160-json.sh
+++ b/test/T160-json.sh
@@ -71,8 +71,8 @@ test_begin_subtest "Format version: too high"
 test_expect_code 21 "notmuch search --format-version=999 \\*"
 
 test_begin_subtest "Show message: multiple filenames"
-add_message "[id]=message-id@example.com [filename]=copy1"
-add_message "[id]=message-id@example.com [filename]=copy2"
+add_message '[id]=message-id@example.com [filename]=copy1 [date]="Fri, 05 Jan 2001 15:43:52 +0000"'
+add_message '[id]=message-id@example.com [filename]=copy2 [date]="Fri, 05 Jan 2001 15:43:52 +0000"'
 cat <<EOF > EXPECTED
 [
     [
diff --git a/test/T670-duplicate-mid.sh b/test/T670-duplicate-mid.sh
index ced28a21..f1952555 100755
--- a/test/T670-duplicate-mid.sh
+++ b/test/T670-duplicate-mid.sh
@@ -6,7 +6,6 @@ add_message '[id]="duplicate"' '[subject]="message 1" [filename]=copy1'
 add_message '[id]="duplicate"' '[subject]="message 2" [filename]=copy2'
 
 test_begin_subtest 'Search for second subject'
-test_subtest_known_broken
 cat <<EOF >EXPECTED
 MAIL_DIR/copy1
 MAIL_DIR/copy2
@@ -16,7 +15,6 @@ test_expect_equal_file EXPECTED OUTPUT
 
 add_message '[id]="duplicate"' '[body]="sekrit" [filename]=copy3'
 test_begin_subtest 'search for body in duplicate file'
-test_subtest_known_broken
 cat <<EOF >EXPECTED
 MAIL_DIR/copy1
 MAIL_DIR/copy2
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [patch v3 07/12] lib: add notmuch_message_count_files
  2017-06-04 12:32 v3 of index multiple files per msg-id, add reindex command David Bremner
                   ` (5 preceding siblings ...)
  2017-06-04 12:32 ` [patch v3 06/12] lib: index message files with duplicate message-ids David Bremner
@ 2017-06-04 12:32 ` David Bremner
  2017-06-04 12:32 ` [patch v3 08/12] lib: add notmuch_thread_get_total_files David Bremner
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 17+ messages in thread
From: David Bremner @ 2017-06-04 12:32 UTC (permalink / raw)
  To: notmuch, notmuch

This operation is relatively inexpensive, as the needed metadata is
already computed by our lazy metadata fetching. The goal is to support
better UI for messages with multipile files.
---
 lib/message.cc        | 8 ++++++++
 lib/notmuch-private.h | 6 ++++++
 lib/notmuch.h         | 8 ++++++++
 lib/string-list.c     | 6 ++++++
 4 files changed, 28 insertions(+)

diff --git a/lib/message.cc b/lib/message.cc
index b330dcce..f3d1e159 100644
--- a/lib/message.cc
+++ b/lib/message.cc
@@ -946,6 +946,14 @@ notmuch_message_get_filenames (notmuch_message_t *message)
     return _notmuch_filenames_create (message, message->filename_list);
 }
 
+int
+notmuch_message_count_files (notmuch_message_t *message)
+{
+    _notmuch_message_ensure_filename_list (message);
+
+    return _notmuch_string_list_length (message->filename_list);
+}
+
 notmuch_bool_t
 notmuch_message_get_flag (notmuch_message_t *message,
 			  notmuch_message_flag_t flag)
diff --git a/lib/notmuch-private.h b/lib/notmuch-private.h
index 4534fb52..bcd22bee 100644
--- a/lib/notmuch-private.h
+++ b/lib/notmuch-private.h
@@ -550,6 +550,12 @@ typedef struct _notmuch_string_list {
 notmuch_string_list_t *
 _notmuch_string_list_create (const void *ctx);
 
+/*
+ * return the number of strings in 'list'
+ */
+int
+_notmuch_string_list_length (notmuch_string_list_t *list);
+
 /* Add 'string' to 'list'.
  *
  * The list will create its own talloced copy of 'string'.
diff --git a/lib/notmuch.h b/lib/notmuch.h
index e1745444..f6ed3990 100644
--- a/lib/notmuch.h
+++ b/lib/notmuch.h
@@ -1342,6 +1342,14 @@ notmuch_messages_t *
 notmuch_message_get_replies (notmuch_message_t *message);
 
 /**
+ * Get the total number of files associated with a message.
+ * @returns Non-negative integer
+ * @since libnotmuch 5.0 (notmuch 0.25)
+ */
+int
+notmuch_message_count_files (notmuch_message_t *message);
+
+/**
  * Get a filename for the email corresponding to 'message'.
  *
  * The returned filename is an absolute filename, (the initial
diff --git a/lib/string-list.c b/lib/string-list.c
index 43ebe499..9c3ae7ef 100644
--- a/lib/string-list.c
+++ b/lib/string-list.c
@@ -42,6 +42,12 @@ _notmuch_string_list_create (const void *ctx)
     return list;
 }
 
+int
+_notmuch_string_list_length (notmuch_string_list_t *list)
+{
+    return list->length;
+}
+
 void
 _notmuch_string_list_append (notmuch_string_list_t *list,
 			     const char *string)
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [patch v3 08/12] lib: add notmuch_thread_get_total_files
  2017-06-04 12:32 v3 of index multiple files per msg-id, add reindex command David Bremner
                   ` (6 preceding siblings ...)
  2017-06-04 12:32 ` [patch v3 07/12] lib: add notmuch_message_count_files David Bremner
@ 2017-06-04 12:32 ` David Bremner
  2017-06-04 12:32 ` [patch v3 09/12] cli/search: print total number of files matched in summary output David Bremner
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 17+ messages in thread
From: David Bremner @ 2017-06-04 12:32 UTC (permalink / raw)
  To: notmuch, notmuch

This is relatively inexpensive in terms of run time and implementation
cost as we are already traversing the list of messages in a thread.
---
 lib/notmuch.h | 12 ++++++++++++
 lib/thread.cc |  9 +++++++++
 2 files changed, 21 insertions(+)

diff --git a/lib/notmuch.h b/lib/notmuch.h
index f6ed3990..7bd5346f 100644
--- a/lib/notmuch.h
+++ b/lib/notmuch.h
@@ -1097,6 +1097,18 @@ int
 notmuch_thread_get_total_messages (notmuch_thread_t *thread);
 
 /**
+ * Get the total number of files in 'thread'.
+ *
+ * This sums notmuch_message_count_files over all messages in the
+ * thread
+ * @returns Non-negative integer
+ * @since libnotmuch 5.0 (notmuch 0.25)
+ */
+
+int
+notmuch_thread_get_total_files (notmuch_thread_t *thread);
+
+/**
  * Get a notmuch_messages_t iterator for the top-level messages in
  * 'thread' in oldest-first order.
  *
diff --git a/lib/thread.cc b/lib/thread.cc
index 1a1ecfa5..e17ef63e 100644
--- a/lib/thread.cc
+++ b/lib/thread.cc
@@ -44,6 +44,7 @@ struct _notmuch_thread {
 
     GHashTable *message_hash;
     int total_messages;
+    int total_files;
     int matched_messages;
     time_t oldest;
     time_t newest;
@@ -266,6 +267,7 @@ _thread_add_message (notmuch_thread_t *thread,
     _notmuch_message_list_add_message (thread->message_list,
 				       talloc_steal (thread, message));
     thread->total_messages++;
+    thread->total_files += notmuch_message_count_files (message);
 
     g_hash_table_insert (thread->message_hash,
 			 xstrdup (notmuch_message_get_message_id (message)),
@@ -495,6 +497,7 @@ _notmuch_thread_create (void *ctx,
 						  free, NULL);
 
     thread->total_messages = 0;
+    thread->total_files = 0;
     thread->matched_messages = 0;
     thread->oldest = 0;
     thread->newest = 0;
@@ -567,6 +570,12 @@ notmuch_thread_get_total_messages (notmuch_thread_t *thread)
 }
 
 int
+notmuch_thread_get_total_files (notmuch_thread_t *thread)
+{
+    return thread->total_files;
+}
+
+int
 notmuch_thread_get_matched_messages (notmuch_thread_t *thread)
 {
     return thread->matched_messages;
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [patch v3 09/12] cli/search: print total number of files matched in summary output.
  2017-06-04 12:32 v3 of index multiple files per msg-id, add reindex command David Bremner
                   ` (7 preceding siblings ...)
  2017-06-04 12:32 ` [patch v3 08/12] lib: add notmuch_thread_get_total_files David Bremner
@ 2017-06-04 12:32 ` David Bremner
  2017-06-04 12:32 ` [patch v3 10/12] lib: add _notmuch_message_remove_indexed_terms David Bremner
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 17+ messages in thread
From: David Bremner @ 2017-06-04 12:32 UTC (permalink / raw)
  To: notmuch, notmuch

The structured output formats already have all of the filenames. This
is an easy bit of UI change to make the multiple files visible.
---
 doc/man1/notmuch-search.rst          | 17 ++++++++++++++++-
 notmuch-search.c                     | 15 +++++++++++++--
 test/T080-search.sh                  |  2 +-
 test/T100-search-by-folder.sh        |  4 ++--
 test/T340-maildir-sync.sh            |  4 ++--
 test/T370-search-folder-coherence.sh |  2 +-
 test/T500-search-date.sh             |  2 +-
 test/T650-regexp-query.sh            |  4 ++--
 8 files changed, 38 insertions(+), 12 deletions(-)

diff --git a/doc/man1/notmuch-search.rst b/doc/man1/notmuch-search.rst
index 4e660a6f..fd6dcadd 100644
--- a/doc/man1/notmuch-search.rst
+++ b/doc/man1/notmuch-search.rst
@@ -42,7 +42,9 @@ Supported options for **search** include
             the search terms. The summary includes the thread ID, date,
             the number of messages in the thread (both the number
             matched and the total number), the authors of the thread and
-            the subject.
+	    the subject. In the case where a thread contains multiple files for
+	    some messages, the total number of files is printed in parentheses
+	    (see below for an example).
 
         **threads**
             Output the thread IDs of all threads with any message
@@ -135,6 +137,19 @@ Supported options for **search** include
         prefix. The prefix matches messages based on filenames. This
         option filters filenames of the matching messages.
 
+EXAMPLE
+=======
+
+The following shows an example of the summary output format, with one
+message having multiple filenames.
+
+::
+
+  % notmuch search date:today.. and tag:bad-news
+  thread:0000000000063c10 Today [1/1] Some Persun; To the bone (inbox unread)
+  thread:0000000000063c25 Today [1/1(2)] Ann Other; Bears (inbox unread)
+  thread:0000000000063c00 Today [1/1] A Thurd; Bites, stings, sad feelings (inbox unread)
+
 EXIT STATUS
 ===========
 
diff --git a/notmuch-search.c b/notmuch-search.c
index 019e14ee..380e9d8f 100644
--- a/notmuch-search.c
+++ b/notmuch-search.c
@@ -160,6 +160,7 @@ do_search_threads (search_context_t *ctx)
 	    const char *subject = notmuch_thread_get_subject (thread);
 	    const char *thread_id = notmuch_thread_get_thread_id (thread);
 	    int matched = notmuch_thread_get_matched_messages (thread);
+	    int files = notmuch_thread_get_total_files (thread);
 	    int total = notmuch_thread_get_total_messages (thread);
 	    const char *relative_date = NULL;
 	    notmuch_bool_t first_tag = TRUE;
@@ -175,13 +176,23 @@ do_search_threads (search_context_t *ctx)
 
 	    if (format->is_text_printer) {
                 /* Special case for the text formatter */
-		printf ("thread:%s %12s [%d/%d] %s; %s (",
+		printf ("thread:%s %12s ",
 			thread_id,
-			relative_date,
+			relative_date);
+		if (total == files)
+		    printf ("[%d/%d] %s; %s (",
 			matched,
 			total,
 			sanitize_string (ctx_quote, authors),
 			sanitize_string (ctx_quote, subject));
+		else
+		    printf ("[%d/%d(%d)] %s; %s (",
+			matched,
+			total,
+			files,
+			sanitize_string (ctx_quote, authors),
+			sanitize_string (ctx_quote, subject));
+
 	    } else { /* Structured Output */
 		format->map_key (format, "thread");
 		format->string (format, thread_id);
diff --git a/test/T080-search.sh b/test/T080-search.sh
index d2d71ca9..3bb3dced 100755
--- a/test/T080-search.sh
+++ b/test/T080-search.sh
@@ -111,7 +111,7 @@ thread:XXX   2009-11-18 [3/3] Adrian Perez de Castro, Keith Packard, Carl Worth;
 thread:XXX   2009-11-18 [3/3] Israel Herraiz, Keith Packard, Carl Worth; [notmuch] New to the list (inbox unread)
 thread:XXX   2009-11-18 [3/3] Jan Janak, Carl Worth; [notmuch] What a great idea! (inbox unread)
 thread:XXX   2009-11-18 [2/2] Jan Janak, Carl Worth; [notmuch] [PATCH] Older versions of install do not support -C. (inbox unread)
-thread:XXX   2009-11-18 [3/3] Aron Griffis, Keith Packard, Carl Worth; [notmuch] archive (inbox unread)
+thread:XXX   2009-11-18 [3/3(4)] Aron Griffis, Keith Packard, Carl Worth; [notmuch] archive (inbox unread)
 thread:XXX   2009-11-18 [2/2] Keith Packard, Carl Worth; [notmuch] [PATCH] Make notmuch-show 'X' (and 'x') commands remove inbox (and unread) tags (inbox unread)
 thread:XXX   2009-11-18 [7/7] Lars Kellogg-Stedman, Mikhail Gusarov, Keith Packard, Carl Worth; [notmuch] Working with Maildir storage? (inbox signed unread)
 thread:XXX   2009-11-18 [5/5] Mikhail Gusarov, Carl Worth, Keith Packard; [notmuch] [PATCH 1/2] Close message file after parsing message headers (inbox unread)
diff --git a/test/T100-search-by-folder.sh b/test/T100-search-by-folder.sh
index 2844ec61..79c266e4 100755
--- a/test/T100-search-by-folder.sh
+++ b/test/T100-search-by-folder.sh
@@ -15,7 +15,7 @@ add_message '[dir]=things/bad' '[subject]="Bites, stings, sad feelings"'
 test_begin_subtest "Single-world folder: specification (multiple results)"
 output=$(notmuch search folder:bad folder:bad/news folder:things/bad | notmuch_search_sanitize)
 test_expect_equal "$output" "thread:XXX   2001-01-05 [1/1] Notmuch Test Suite; To the bone (inbox unread)
-thread:XXX   2001-01-05 [1/1] Notmuch Test Suite; Bears (inbox unread)
+thread:XXX   2001-01-05 [1/1(2)] Notmuch Test Suite; Bears (inbox unread)
 thread:XXX   2001-01-05 [1/1] Notmuch Test Suite; Bites, stings, sad feelings (inbox unread)"
 
 test_begin_subtest "Top level folder"
@@ -24,7 +24,7 @@ test_expect_equal "$output" "thread:XXX   2001-01-05 [1/1] Notmuch Test Suite; T
 
 test_begin_subtest "Two-word path to narrow results to one"
 output=$(notmuch search folder:bad/news | notmuch_search_sanitize)
-test_expect_equal "$output" "thread:XXX   2001-01-05 [1/1] Notmuch Test Suite; Bears (inbox unread)"
+test_expect_equal "$output" "thread:XXX   2001-01-05 [1/1(2)] Notmuch Test Suite; Bears (inbox unread)"
 
 test_begin_subtest "Folder search with --output=files"
 output=$(notmuch search --output=files folder:bad/news | notmuch_search_files_sanitize)
diff --git a/test/T340-maildir-sync.sh b/test/T340-maildir-sync.sh
index 6d956635..959bf8d8 100755
--- a/test/T340-maildir-sync.sh
+++ b/test/T340-maildir-sync.sh
@@ -153,14 +153,14 @@ cp "$MAIL_DIR/cur/duplicated-message:2," "$MAIL_DIR/cur/duplicated-message-copy:
 NOTMUCH_NEW > output
 notmuch search subject:"Duplicated message" | notmuch_search_sanitize >> output
 test_expect_equal "$(< output)" "No new mail.
-thread:XXX   2001-01-05 [1/1] Notmuch Test Suite; Duplicated message (inbox replied)"
+thread:XXX   2001-01-05 [1/1(2)] Notmuch Test Suite; Duplicated message (inbox replied)"
 
 test_begin_subtest "Adding duplicate message without flags does not remove tags"
 cp "$MAIL_DIR/cur/duplicated-message-copy:2,RS" "$MAIL_DIR/cur/duplicated-message-another-copy:2,"
 NOTMUCH_NEW > output
 notmuch search subject:"Duplicated message" | notmuch_search_sanitize >> output
 test_expect_equal "$(< output)" "No new mail.
-thread:XXX   2001-01-05 [1/1] Notmuch Test Suite; Duplicated message (inbox replied)"
+thread:XXX   2001-01-05 [1/1(3)] Notmuch Test Suite; Duplicated message (inbox replied)"
 
 test_begin_subtest "Tag changes modify flags of multiple files"
 notmuch tag -replied subject:"Duplicated message"
diff --git a/test/T370-search-folder-coherence.sh b/test/T370-search-folder-coherence.sh
index d1cb45ec..8748b3d0 100755
--- a/test/T370-search-folder-coherence.sh
+++ b/test/T370-search-folder-coherence.sh
@@ -32,7 +32,7 @@ test_expect_equal_file EXPECTED OUTPUT
 
 test_begin_subtest "Test matches folder:spam"
 output=$(notmuch search folder:spam)
-test_expect_equal "$output" "thread:0000000000000001   2001-01-05 [1/1] Notmuch Test Suite; Single new message (inbox unread)"
+test_expect_equal "$output" "thread:0000000000000001   2001-01-05 [1/1(2)] Notmuch Test Suite; Single new message (inbox unread)"
 
 test_begin_subtest "Remove folder:spam copy of email"
 rm $dir/spam/$(basename $file_x)
diff --git a/test/T500-search-date.sh b/test/T500-search-date.sh
index f10207f8..fc4ecdc3 100755
--- a/test/T500-search-date.sh
+++ b/test/T500-search-date.sh
@@ -23,7 +23,7 @@ notmuch search date:18-Nov-2009_02:19:26-0800..2009-11-18_04:49:52-06:00 | notmu
 cat <<EOF >EXPECTED
 thread:XXX   2009-11-18 [1/3] Carl Worth| Jan Janak; [notmuch] What a great idea! (inbox unread)
 thread:XXX   2009-11-18 [1/2] Carl Worth| Jan Janak; [notmuch] [PATCH] Older versions of install do not support -C. (inbox unread)
-thread:XXX   2009-11-18 [1/3] Carl Worth| Aron Griffis, Keith Packard; [notmuch] archive (inbox unread)
+thread:XXX   2009-11-18 [1/3(4)] Carl Worth| Aron Griffis, Keith Packard; [notmuch] archive (inbox unread)
 thread:XXX   2009-11-18 [1/2] Carl Worth| Keith Packard; [notmuch] [PATCH] Make notmuch-show 'X' (and 'x') commands remove inbox (and unread) tags (inbox unread)
 EOF
 test_expect_equal_file EXPECTED OUTPUT
diff --git a/test/T650-regexp-query.sh b/test/T650-regexp-query.sh
index b7bdda11..d5def764 100755
--- a/test/T650-regexp-query.sh
+++ b/test/T650-regexp-query.sh
@@ -29,7 +29,7 @@ test_expect_equal_file EXPECTED OUTPUT
 test_begin_subtest "unanchored folder:// specification"
 output=$(notmuch search folder:/bad/ | notmuch_search_sanitize)
 test_expect_equal "$output" "thread:XXX   2001-01-05 [1/1] Notmuch Test Suite; To the bone (inbox unread)
-thread:XXX   2001-01-05 [1/1] Notmuch Test Suite; Bears (inbox unread)
+thread:XXX   2001-01-05 [1/1(2)] Notmuch Test Suite; Bears (inbox unread)
 thread:XXX   2001-01-05 [1/1] Notmuch Test Suite; Bites, stings, sad feelings (inbox unread)"
 
 test_begin_subtest "anchored folder:// search"
@@ -39,7 +39,7 @@ test_expect_equal "$output" "thread:XXX   2001-01-05 [1/1] Notmuch Test Suite; T
 test_begin_subtest "unanchored path:// specification"
 output=$(notmuch search path:/bad/ | notmuch_search_sanitize)
 test_expect_equal "$output" "thread:XXX   2001-01-05 [1/1] Notmuch Test Suite; To the bone (inbox unread)
-thread:XXX   2001-01-05 [1/1] Notmuch Test Suite; Bears (inbox unread)
+thread:XXX   2001-01-05 [1/1(2)] Notmuch Test Suite; Bears (inbox unread)
 thread:XXX   2001-01-05 [1/1] Notmuch Test Suite; Bites, stings, sad feelings (inbox unread)"
 
 test_begin_subtest "anchored path:// search"
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [patch v3 10/12] lib: add _notmuch_message_remove_indexed_terms
  2017-06-04 12:32 v3 of index multiple files per msg-id, add reindex command David Bremner
                   ` (8 preceding siblings ...)
  2017-06-04 12:32 ` [patch v3 09/12] cli/search: print total number of files matched in summary output David Bremner
@ 2017-06-04 12:32 ` David Bremner
  2017-06-04 12:32 ` [patch v3 11/12] lib: add notmuch_message_reindex David Bremner
  2017-06-04 12:32 ` [patch v3 12/12] add "notmuch reindex" subcommand David Bremner
  11 siblings, 0 replies; 17+ messages in thread
From: David Bremner @ 2017-06-04 12:32 UTC (permalink / raw)
  To: notmuch, notmuch

Testing will be provided via use in notmuch_message_reindex
---
 lib/message.cc        | 54 +++++++++++++++++++++++++++++++++++++++++++++++++++
 lib/notmuch-private.h |  2 ++
 2 files changed, 56 insertions(+)

diff --git a/lib/message.cc b/lib/message.cc
index f3d1e159..33c24354 100644
--- a/lib/message.cc
+++ b/lib/message.cc
@@ -599,6 +599,59 @@ _notmuch_message_remove_terms (notmuch_message_t *message, const char *prefix)
     }
 }
 
+
+/* Remove all terms generated by indexing, i.e. not tags or
+ * properties, along with any automatic tags*/
+notmuch_private_status_t
+_notmuch_message_remove_indexed_terms (notmuch_message_t *message)
+{
+    Xapian::TermIterator i;
+
+    const std::string
+	id_prefix = _find_prefix ("id"),
+	property_prefix = _find_prefix ("property"),
+	tag_prefix = _find_prefix ("tag"),
+	type_prefix = _find_prefix ("type");
+
+    for (i = message->doc.termlist_begin ();
+	 i != message->doc.termlist_end (); i++) {
+
+	const std::string term = *i;
+
+	if (term.compare (0, type_prefix.size (), type_prefix) == 0)
+	    continue;
+
+	if (term.compare (0, id_prefix.size (), id_prefix) == 0)
+	    continue;
+
+	if (term.compare (0, property_prefix.size (), property_prefix) == 0)
+	    continue;
+
+	if (term.compare (0, tag_prefix.size (), tag_prefix) == 0 &&
+	    term.compare (1, strlen("encrypted"), "encrypted") != 0 &&
+	    term.compare (1, strlen("signed"), "signed") != 0 &&
+	    term.compare (1, strlen("attachment"), "attachment") != 0)
+	    continue;
+
+	try {
+	    message->doc.remove_term ((*i));
+	    message->modified = TRUE;
+	} catch (const Xapian::InvalidArgumentError) {
+	    /* Ignore failure to remove non-existent term. */
+	} catch (const Xapian::Error &error) {
+	    notmuch_database_t *notmuch = message->notmuch;
+
+	    if (!notmuch->exception_reported) {
+		_notmuch_database_log(_notmuch_message_database (message), "A Xapian exception occurred creating message: %s\n",
+				      error.get_msg().c_str());
+		notmuch->exception_reported = TRUE;
+	    }
+	    return NOTMUCH_PRIVATE_STATUS_XAPIAN_EXCEPTION;
+	}
+    }
+    return NOTMUCH_PRIVATE_STATUS_SUCCESS;
+}
+
 /* Return true if p points at "new" or "cur". */
 static bool is_maildir (const char *p)
 {
@@ -646,6 +699,7 @@ _notmuch_message_add_folder_terms (notmuch_message_t *message,
 
     talloc_free (folder);
 
+    message->modified = TRUE;
     return NOTMUCH_STATUS_SUCCESS;
 }
 
diff --git a/lib/notmuch-private.h b/lib/notmuch-private.h
index bcd22bee..a4a20d8e 100644
--- a/lib/notmuch-private.h
+++ b/lib/notmuch-private.h
@@ -526,6 +526,8 @@ _notmuch_message_add_reply (notmuch_message_t *message,
 notmuch_database_t *
 _notmuch_message_database (notmuch_message_t *message);
 
+void
+_notmuch_message_remove_unprefixed_terms (notmuch_message_t *message);
 /* sha1.c */
 
 char *
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [patch v3 11/12] lib: add notmuch_message_reindex
  2017-06-04 12:32 v3 of index multiple files per msg-id, add reindex command David Bremner
                   ` (9 preceding siblings ...)
  2017-06-04 12:32 ` [patch v3 10/12] lib: add _notmuch_message_remove_indexed_terms David Bremner
@ 2017-06-04 12:32 ` David Bremner
  2017-06-04 12:32 ` [patch v3 12/12] add "notmuch reindex" subcommand David Bremner
  11 siblings, 0 replies; 17+ messages in thread
From: David Bremner @ 2017-06-04 12:32 UTC (permalink / raw)
  To: notmuch, notmuch; +Cc: Daniel Kahn Gillmor

From: Daniel Kahn Gillmor <dkg@fifthhorseman.net>

This new function asks the database to reindex a given message.
The parameter `indexopts` is currently ignored, but is intended to
provide an extensible API to support e.g. changing the encryption or
filtering status (e.g. whether and how certain non-plaintext parts are
indexed).
---
 lib/add-message.cc    |   2 +-
 lib/message.cc        | 108 +++++++++++++++++++++++++++++++++++++++++++++++++-
 lib/notmuch-private.h |   6 +++
 lib/notmuch.h         |  15 +++++++
 4 files changed, 129 insertions(+), 2 deletions(-)

diff --git a/lib/add-message.cc b/lib/add-message.cc
index ae9b14a7..26405742 100644
--- a/lib/add-message.cc
+++ b/lib/add-message.cc
@@ -220,7 +220,7 @@ _my_talloc_free_for_g_hash (void *ptr)
     talloc_free (ptr);
 }
 
-static notmuch_status_t
+notmuch_status_t
 _notmuch_database_link_message_to_parents (notmuch_database_t *notmuch,
 					   notmuch_message_t *message,
 					   notmuch_message_file_t *message_file,
diff --git a/lib/message.cc b/lib/message.cc
index 33c24354..5b6abc83 100644
--- a/lib/message.cc
+++ b/lib/message.cc
@@ -579,7 +579,9 @@ void
 _notmuch_message_remove_terms (notmuch_message_t *message, const char *prefix)
 {
     Xapian::TermIterator i;
-    size_t prefix_len = strlen (prefix);
+    size_t prefix_len = 0;
+
+    prefix_len = strlen (prefix);
 
     while (1) {
 	i = message->doc.termlist_begin ();
@@ -1934,3 +1936,107 @@ _notmuch_message_frozen (notmuch_message_t *message)
 {
     return message->frozen;
 }
+
+notmuch_status_t
+notmuch_message_reindex (notmuch_message_t *message,
+			 notmuch_param_t unused (*indexopts))
+{
+    notmuch_database_t *notmuch = NULL;
+    notmuch_status_t ret = NOTMUCH_STATUS_SUCCESS;
+    notmuch_private_status_t private_status;
+    notmuch_filenames_t *orig_filenames = NULL;
+    const char *orig_thread_id = NULL;
+    notmuch_message_file_t *message_file = NULL;
+
+    int found = 0;
+
+    if (message == NULL)
+	return NOTMUCH_STATUS_NULL_POINTER;
+
+    /* Save in case we need to delete message */
+    orig_thread_id = notmuch_message_get_thread_id (message);
+    if (!orig_thread_id) {
+	/* XXX TODO: make up new error return? */
+	INTERNAL_ERROR ("message without thread-id");
+    }
+
+    /* strdup it because the metadata may be invalidated */
+    orig_thread_id = talloc_strdup (message, orig_thread_id);
+
+    notmuch = _notmuch_message_database (message);
+
+    ret = _notmuch_database_ensure_writable (notmuch);
+    if (ret)
+	return ret;
+
+    orig_filenames = notmuch_message_get_filenames (message);
+
+    private_status = _notmuch_message_remove_indexed_terms (message);
+    if (private_status) {
+	ret = COERCE_STATUS(private_status, "error removing terms");
+	goto DONE;
+    }
+
+    /* re-add the filenames with the associated indexopts */
+    for (; notmuch_filenames_valid (orig_filenames);
+	 notmuch_filenames_move_to_next (orig_filenames)) {
+
+	const char *date;
+	const char *from, *to, *subject;
+	char *message_id = NULL;
+	const char *thread_id = NULL;
+
+	const char *filename = notmuch_filenames_get (orig_filenames);
+
+	message_file = _notmuch_message_file_open (notmuch, filename);
+	if (message_file == NULL)
+	    continue;
+
+	ret = _notmuch_message_file_get_headers (message_file,
+						 &from, &subject, &to, &date,
+						 &message_id);
+	if (ret)
+	    goto DONE;
+
+	/* XXX TODO: deal with changing message id? */
+
+	_notmuch_message_add_filename (message, filename);
+
+	ret = _notmuch_database_link_message_to_parents (notmuch, message,
+							 message_file,
+							 &thread_id);
+	if (ret)
+	    goto DONE;
+
+	if (thread_id == NULL)
+	    thread_id = orig_thread_id;
+
+	_notmuch_message_add_term (message, "thread", thread_id);
+	_notmuch_message_set_header_values (message, date, from, subject);
+
+	ret = _notmuch_message_index_file (message, message_file);
+
+	if (ret == NOTMUCH_STATUS_FILE_ERROR)
+	    continue;
+	if (ret)
+	    goto DONE;
+
+	found++;
+	_notmuch_message_file_close (message_file);
+	message_file = NULL;
+    }
+    if (found == 0) {
+	/* put back thread id to help cleanup */
+	_notmuch_message_add_term (message, "thread", orig_thread_id);
+	ret = _notmuch_message_delete (message);
+    } else {
+	_notmuch_message_sync (message);
+    }
+
+ DONE:
+    if (message_file)
+	_notmuch_message_file_close (message_file);
+
+    /* XXX TODO destroy orig_filenames? */
+    return ret;
+}
diff --git a/lib/notmuch-private.h b/lib/notmuch-private.h
index a4a20d8e..f4250442 100644
--- a/lib/notmuch-private.h
+++ b/lib/notmuch-private.h
@@ -436,6 +436,12 @@ _notmuch_message_file_get_headers (notmuch_message_file_t *message_file,
 const char *
 _notmuch_message_file_get_filename (notmuch_message_file_t *message);
 
+/* add-message.cc */
+notmuch_status_t
+_notmuch_database_link_message_to_parents (notmuch_database_t *notmuch,
+					   notmuch_message_t *message,
+					   notmuch_message_file_t *message_file,
+					   const char **thread_id);
 /* index.cc */
 
 notmuch_status_t
diff --git a/lib/notmuch.h b/lib/notmuch.h
index 7bd5346f..df0d7d2c 100644
--- a/lib/notmuch.h
+++ b/lib/notmuch.h
@@ -219,6 +219,7 @@ typedef struct _notmuch_tags notmuch_tags_t;
 typedef struct _notmuch_directory notmuch_directory_t;
 typedef struct _notmuch_filenames notmuch_filenames_t;
 typedef struct _notmuch_config_list notmuch_config_list_t;
+typedef struct _notmuch_param notmuch_param_t;
 #endif /* __DOXYGEN__ */
 
 /**
@@ -1394,6 +1395,20 @@ notmuch_filenames_t *
 notmuch_message_get_filenames (notmuch_message_t *message);
 
 /**
+ * Re-index the e-mail corresponding to 'message' using the supplied index options
+ *
+ * Returns the status of the re-index operation.  (see the return
+ * codes documented in notmuch_database_add_message)
+ *
+ * After reindexing, the user should discard the message object passed
+ * in here by calling notmuch_message_destroy, since it refers to the
+ * original message, not to the reindexed message.
+ */
+notmuch_status_t
+notmuch_message_reindex (notmuch_message_t *message,
+			 notmuch_param_t *indexopts);
+
+/**
  * Message flags.
  */
 typedef enum _notmuch_message_flag {
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [patch v3 12/12] add "notmuch reindex" subcommand
  2017-06-04 12:32 v3 of index multiple files per msg-id, add reindex command David Bremner
                   ` (10 preceding siblings ...)
  2017-06-04 12:32 ` [patch v3 11/12] lib: add notmuch_message_reindex David Bremner
@ 2017-06-04 12:32 ` David Bremner
  11 siblings, 0 replies; 17+ messages in thread
From: David Bremner @ 2017-06-04 12:32 UTC (permalink / raw)
  To: notmuch, notmuch; +Cc: Daniel Kahn Gillmor

From: Daniel Kahn Gillmor <dkg@fifthhorseman.net>

This new subcommand takes a set of search terms, and re-indexes the
list of matching messages.
---
 Makefile.local                    |   1 +
 doc/conf.py                       |   4 ++
 doc/index.rst                     |   1 +
 doc/man1/notmuch-reindex.rst      |  29 +++++++++
 doc/man1/notmuch.rst              |   4 +-
 doc/man7/notmuch-search-terms.rst |   7 +-
 notmuch-client.h                  |   3 +
 notmuch-reindex.c                 | 134 ++++++++++++++++++++++++++++++++++++++
 notmuch.c                         |   2 +
 performance-test/M04-reindex.sh   |  11 ++++
 performance-test/T03-reindex.sh   |  13 ++++
 test/T670-duplicate-mid.sh        |   7 ++
 test/T700-reindex.sh              |  79 ++++++++++++++++++++++
 13 files changed, 291 insertions(+), 4 deletions(-)
 create mode 100644 doc/man1/notmuch-reindex.rst
 create mode 100644 notmuch-reindex.c
 create mode 100755 performance-test/M04-reindex.sh
 create mode 100755 performance-test/T03-reindex.sh
 create mode 100755 test/T700-reindex.sh

diff --git a/Makefile.local b/Makefile.local
index 6bc78ef8..af12ca7f 100644
--- a/Makefile.local
+++ b/Makefile.local
@@ -225,6 +225,7 @@ notmuch_client_srcs =		\
 	notmuch-dump.c		\
 	notmuch-insert.c	\
 	notmuch-new.c		\
+	notmuch-reindex.c       \
 	notmuch-reply.c		\
 	notmuch-restore.c	\
 	notmuch-search.c	\
diff --git a/doc/conf.py b/doc/conf.py
index a3d82696..aa864b3c 100644
--- a/doc/conf.py
+++ b/doc/conf.py
@@ -95,6 +95,10 @@ man_pages = [
      u'incorporate new mail into the notmuch database',
      [notmuch_authors], 1),
 
+    ('man1/notmuch-reindex', 'notmuch-reindex',
+     u're-index matching messages',
+     [notmuch_authors], 1),
+
     ('man1/notmuch-reply', 'notmuch-reply',
      u'constructs a reply template for a set of messages',
      [notmuch_authors], 1),
diff --git a/doc/index.rst b/doc/index.rst
index 344606d9..aa6c9f40 100644
--- a/doc/index.rst
+++ b/doc/index.rst
@@ -18,6 +18,7 @@ Contents:
    man5/notmuch-hooks
    man1/notmuch-insert
    man1/notmuch-new
+   man1/notmuch-reindex
    man1/notmuch-reply
    man1/notmuch-restore
    man1/notmuch-search
diff --git a/doc/man1/notmuch-reindex.rst b/doc/man1/notmuch-reindex.rst
new file mode 100644
index 00000000..e39cc4ee
--- /dev/null
+++ b/doc/man1/notmuch-reindex.rst
@@ -0,0 +1,29 @@
+===============
+notmuch-reindex
+===============
+
+SYNOPSIS
+========
+
+**notmuch** **reindex** [*option* ...] <*search-term*> ...
+
+DESCRIPTION
+===========
+
+Re-index all messages matching the search terms.
+
+See **notmuch-search-terms(7)** for details of the supported syntax for
+<*search-term*\ >.
+
+The **reindex** command searches for all messages matching the
+supplied search terms, and re-creates the full-text index on these
+messages using the supplied options.
+
+SEE ALSO
+========
+
+**notmuch(1)**, **notmuch-config(1)**, **notmuch-count(1)**,
+**notmuch-dump(1)**, **notmuch-hooks(5)**, **notmuch-insert(1)**,
+**notmuch-new(1)**,
+**notmuch-reply(1)**, **notmuch-restore(1)**, **notmuch-search(1)**,
+**notmuch-search-terms(7)**, **notmuch-show(1)**, **notmuch-tag(1)**
diff --git a/doc/man1/notmuch.rst b/doc/man1/notmuch.rst
index fbd7f381..b2a8376e 100644
--- a/doc/man1/notmuch.rst
+++ b/doc/man1/notmuch.rst
@@ -149,8 +149,8 @@ SEE ALSO
 
 **notmuch-address(1)**, **notmuch-compact(1)**, **notmuch-config(1)**,
 **notmuch-count(1)**, **notmuch-dump(1)**, **notmuch-hooks(5)**,
-**notmuch-insert(1)**, **notmuch-new(1)**, **notmuch-reply(1)**,
-**notmuch-restore(1)**, **notmuch-search(1)**,
+**notmuch-insert(1)**, **notmuch-new(1)**, **notmuch-reindex(1)**,
+**notmuch-reply(1)**, **notmuch-restore(1)**, **notmuch-search(1)**,
 **notmuch-search-terms(7)**, **notmuch-show(1)**, **notmuch-tag(1)**
 
 The notmuch website: **https://notmuchmail.org**
diff --git a/doc/man7/notmuch-search-terms.rst b/doc/man7/notmuch-search-terms.rst
index 47cab48d..dd76972e 100644
--- a/doc/man7/notmuch-search-terms.rst
+++ b/doc/man7/notmuch-search-terms.rst
@@ -9,6 +9,8 @@ SYNOPSIS
 
 **notmuch** **dump** [--format=(batch-tag|sup)] [--] [--output=<*file*>] [--] [<*search-term*> ...]
 
+**notmuch** **reindex** [option ...] <*search-term*> ...
+
 **notmuch** **search** [option ...] <*search-term*> ...
 
 **notmuch** **show** [option ...] <*search-term*> ...
@@ -421,5 +423,6 @@ SEE ALSO
 
 **notmuch(1)**, **notmuch-config(1)**, **notmuch-count(1)**,
 **notmuch-dump(1)**, **notmuch-hooks(5)**, **notmuch-insert(1)**,
-**notmuch-new(1)**, **notmuch-reply(1)**, **notmuch-restore(1)**,
-**notmuch-search(1)**, **notmuch-show(1)**, **notmuch-tag(1)**
+**notmuch-new(1)**, **notmuch-reindex(1)**, **notmuch-reply(1)**,
+**notmuch-restore(1)**, **notmuch-search(1)**, **notmuch-show(1)**,
+**notmuch-tag(1)**
diff --git a/notmuch-client.h b/notmuch-client.h
index 62d4bcec..d6234fca 100644
--- a/notmuch-client.h
+++ b/notmuch-client.h
@@ -197,6 +197,9 @@ int
 notmuch_insert_command (notmuch_config_t *config, int argc, char *argv[]);
 
 int
+notmuch_reindex_command (notmuch_config_t *config, int argc, char *argv[]);
+
+int
 notmuch_reply_command (notmuch_config_t *config, int argc, char *argv[]);
 
 int
diff --git a/notmuch-reindex.c b/notmuch-reindex.c
new file mode 100644
index 00000000..44223042
--- /dev/null
+++ b/notmuch-reindex.c
@@ -0,0 +1,134 @@
+/* notmuch - Not much of an email program, (just index and search)
+ *
+ * Copyright © 2016 Daniel Kahn Gillmor
+ *
+ * This program is free software: you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation, either version 3 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see http://www.gnu.org/licenses/ .
+ *
+ * Author: Daniel Kahn Gillmor <dkg@fifthhorseman.net>
+ */
+
+#include "notmuch-client.h"
+#include "string-util.h"
+
+static volatile sig_atomic_t interrupted;
+
+static void
+handle_sigint (unused (int sig))
+{
+    static char msg[] = "Stopping...         \n";
+
+    /* This write is "opportunistic", so it's okay to ignore the
+     * result.  It is not required for correctness, and if it does
+     * fail or produce a short write, we want to get out of the signal
+     * handler as quickly as possible, not retry it. */
+    IGNORE_RESULT (write (2, msg, sizeof (msg) - 1));
+    interrupted = 1;
+}
+
+/* reindex all messages matching 'query_string' using the passed-in indexopts
+ */
+static int
+reindex_query (notmuch_database_t *notmuch, const char *query_string,
+	       notmuch_param_t *indexopts)
+{
+    notmuch_query_t *query;
+    notmuch_messages_t *messages;
+    notmuch_message_t *message;
+    notmuch_status_t status;
+
+    notmuch_status_t ret = NOTMUCH_STATUS_SUCCESS;
+
+    query = notmuch_query_create (notmuch, query_string);
+    if (query == NULL) {
+	fprintf (stderr, "Out of memory.\n");
+	return 1;
+    }
+
+    /* reindexing is not interested in any special sort order */
+    notmuch_query_set_sort (query, NOTMUCH_SORT_UNSORTED);
+
+    status = notmuch_query_search_messages (query, &messages);
+    if (print_status_query ("notmuch reindex", query, status))
+	return status;
+
+    ret = notmuch_database_begin_atomic (notmuch);
+    for (;
+	 notmuch_messages_valid (messages) && ! interrupted;
+	 notmuch_messages_move_to_next (messages)) {
+	message = notmuch_messages_get (messages);
+
+	ret = notmuch_message_reindex(message, indexopts);
+	if (ret != NOTMUCH_STATUS_SUCCESS)
+	    break;
+    }
+
+    if (!ret)
+	ret = notmuch_database_end_atomic (notmuch);
+
+    notmuch_query_destroy (query);
+
+    return ret || interrupted;
+}
+
+int
+notmuch_reindex_command (notmuch_config_t *config, int argc, char *argv[])
+{
+    char *query_string = NULL;
+    notmuch_database_t *notmuch;
+    struct sigaction action;
+    int opt_index;
+    int ret;
+    notmuch_param_t *indexopts = NULL;
+
+    /* Set up our handler for SIGINT */
+    memset (&action, 0, sizeof (struct sigaction));
+    action.sa_handler = handle_sigint;
+    sigemptyset (&action.sa_mask);
+    action.sa_flags = SA_RESTART;
+    sigaction (SIGINT, &action, NULL);
+
+    notmuch_opt_desc_t options[] = {
+	{ NOTMUCH_OPT_INHERIT, (void *) &notmuch_shared_options, NULL, 0, 0 },
+	{ 0, 0, 0, 0, 0 }
+    };
+
+    opt_index = parse_arguments (argc, argv, options, 1);
+    if (opt_index < 0)
+	return EXIT_FAILURE;
+
+    notmuch_process_shared_options (argv[0]);
+
+    if (notmuch_database_open (notmuch_config_get_database_path (config),
+			       NOTMUCH_DATABASE_MODE_READ_WRITE, &notmuch))
+	return EXIT_FAILURE;
+
+    notmuch_exit_if_unmatched_db_uuid (notmuch);
+
+    query_string = query_string_from_args (config, argc-opt_index, argv+opt_index);
+    if (query_string == NULL) {
+	fprintf (stderr, "Out of memory\n");
+	return EXIT_FAILURE;
+    }
+
+    if (*query_string == '\0') {
+	fprintf (stderr, "Error: notmuch reindex requires at least one search term.\n");
+	return EXIT_FAILURE;
+    }
+    
+    ret = reindex_query (notmuch, query_string, indexopts);
+
+    notmuch_database_destroy (notmuch);
+
+    return ret || interrupted ? EXIT_FAILURE : EXIT_SUCCESS;
+}
diff --git a/notmuch.c b/notmuch.c
index 8e332ce6..201c7454 100644
--- a/notmuch.c
+++ b/notmuch.c
@@ -123,6 +123,8 @@ static command_t commands[] = {
       "Restore the tags from the given dump file (see 'dump')." },
     { "compact", notmuch_compact_command, NOTMUCH_CONFIG_OPEN,
       "Compact the notmuch database." },
+    { "reindex", notmuch_reindex_command, NOTMUCH_CONFIG_OPEN,
+      "Re-index all messages matching the search terms." },
     { "config", notmuch_config_command, NOTMUCH_CONFIG_OPEN,
       "Get or set settings in the notmuch configuration file." },
     { "help", notmuch_help_command, NOTMUCH_CONFIG_CREATE, /* create but don't save config */
diff --git a/performance-test/M04-reindex.sh b/performance-test/M04-reindex.sh
new file mode 100755
index 00000000..d36e061b
--- /dev/null
+++ b/performance-test/M04-reindex.sh
@@ -0,0 +1,11 @@
+#!/bin/bash
+
+test_description='reindex'
+
+. ./perf-test-lib.sh || exit 1
+
+memory_start
+
+memory_run 'reindex *' "notmuch reindex '*'"
+
+memory_done
diff --git a/performance-test/T03-reindex.sh b/performance-test/T03-reindex.sh
new file mode 100755
index 00000000..7af2d22d
--- /dev/null
+++ b/performance-test/T03-reindex.sh
@@ -0,0 +1,13 @@
+#!/bin/bash
+
+test_description='tagging'
+
+. ./perf-test-lib.sh || exit 1
+
+time_start
+
+time_run 'reindex *' "notmuch reindex '*'"
+time_run 'reindex *' "notmuch reindex '*'"
+time_run 'reindex *' "notmuch reindex '*'"
+
+time_done
diff --git a/test/T670-duplicate-mid.sh b/test/T670-duplicate-mid.sh
index f1952555..da5ce5d4 100755
--- a/test/T670-duplicate-mid.sh
+++ b/test/T670-duplicate-mid.sh
@@ -23,4 +23,11 @@ EOF
 notmuch search --output=files "sekrit" | notmuch_dir_sanitize > OUTPUT
 test_expect_equal_file EXPECTED OUTPUT
 
+rm ${MAIL_DIR}/copy3
+test_begin_subtest 'reindex drops terms in duplicate file'
+cp /dev/null EXPECTED
+notmuch reindex '*'
+notmuch search --output=files "sekrit" | notmuch_dir_sanitize > OUTPUT
+test_expect_equal_file EXPECTED OUTPUT
+
 test_done
diff --git a/test/T700-reindex.sh b/test/T700-reindex.sh
new file mode 100755
index 00000000..051fbb3c
--- /dev/null
+++ b/test/T700-reindex.sh
@@ -0,0 +1,79 @@
+#!/usr/bin/env bash
+test_description='reindexing messages'
+. ./test-lib.sh || exit 1
+
+add_email_corpus
+
+notmuch tag +usertag1 '*'
+
+notmuch search '*' | notmuch_search_sanitize > initial-threads
+notmuch search --output=messages '*' > initial-message-ids
+notmuch dump > initial-dump
+
+test_begin_subtest 'reindex preserves threads'
+notmuch reindex '*'
+notmuch search '*' | notmuch_search_sanitize > OUTPUT
+test_expect_equal_file initial-threads OUTPUT
+
+test_begin_subtest 'reindex after removing duplicate file preserves threads'
+# remove one copy
+sed 's,3/3(4),3/3,' < initial-threads > EXPECTED
+mv $MAIL_DIR/bar/18:2, duplicate-msg-1.eml
+notmuch reindex '*'
+notmuch search '*' | notmuch_search_sanitize > OUTPUT
+test_expect_equal_file EXPECTED OUTPUT
+
+test_begin_subtest 'reindex preserves message-ids'
+notmuch reindex '*'
+notmuch search --output=messages '*' > OUTPUT
+test_expect_equal_file initial-message-ids OUTPUT
+
+test_begin_subtest 'reindex preserves tags'
+notmuch reindex '*'
+notmuch dump > OUTPUT
+test_expect_equal_file initial-dump OUTPUT
+
+test_begin_subtest 'reindex moves a message between threads'
+notmuch search --output=threads id:87iqd9rn3l.fsf@vertex.dottedmag > EXPECTED
+# re-parent
+sed -i 's/1258471718-6781-1-git-send-email-dottedmag@dottedmag.net/87iqd9rn3l.fsf@vertex.dottedmag/' $MAIL_DIR/02:2,*
+notmuch reindex id:1258471718-6781-2-git-send-email-dottedmag@dottedmag.net
+notmuch search --output=threads id:1258471718-6781-2-git-send-email-dottedmag@dottedmag.net > OUTPUT
+test_expect_equal_file EXPECTED OUTPUT
+
+test_begin_subtest 'reindex detects removal of all files'
+notmuch search --output=messages not id:20091117232137.GA7669@griffis1.net> EXPECTED
+# remove both copies
+mv $MAIL_DIR/cur/51:2,* duplicate-message-2.eml
+notmuch reindex id:20091117232137.GA7669@griffis1.net
+notmuch search --output=messages '*' > OUTPUT
+test_expect_equal_file EXPECTED OUTPUT
+
+test_begin_subtest "reindex preserves properties"
+cat <<EOF > prop-dump
+#= 1258471718-6781-1-git-send-email-dottedmag@dottedmag.net userprop=userval
+#= 1258471718-6781-2-git-send-email-dottedmag@dottedmag.net userprop=userval
+#= 1258491078-29658-1-git-send-email-dottedmag@dottedmag.net userprop=userval1
+#= 20091117190054.GU3165@dottiness.seas.harvard.edu userprop=userval
+#= 20091117203301.GV3165@dottiness.seas.harvard.edu userprop=userval3
+#= 87fx8can9z.fsf@vertex.dottedmag userprop=userval2
+#= 87iqd9rn3l.fsf@vertex.dottedmag userprop=userval
+#= 87lji4lx9v.fsf@yoom.home.cworth.org userprop=userval3
+#= 87lji5cbwo.fsf@yoom.home.cworth.org userprop=userval
+#= cf0c4d610911171136h1713aa59w9cf9aa31f052ad0a@mail.gmail.com userprop=userval
+EOF
+notmuch restore < prop-dump
+notmuch reindex '*'
+notmuch dump | grep '^#=' | sort > OUTPUT
+test_expect_equal_file prop-dump OUTPUT
+test_done
+
+add_email_corpus lkml
+
+test_begin_subtest "reindex of lkml corpus preserves threads"
+notmuch search '*' | notmuch_search_sanitize > EXPECTED
+notmuch reindex '*'
+notmuch search '*' | notmuch_search_sanitize > OUTPUT
+test_expect_equal_file EXPECTED OUTPUT
+
+test_done
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [patch v3 06/12] lib: index message files with duplicate message-ids
  2017-06-04 12:32 ` [patch v3 06/12] lib: index message files with duplicate message-ids David Bremner
@ 2017-06-04 20:34   ` Daniel Kahn Gillmor
  2017-06-06  1:09     ` David Bremner
  2017-06-10 20:02     ` [PATCH] fixup! " David Bremner
  0 siblings, 2 replies; 17+ messages in thread
From: Daniel Kahn Gillmor @ 2017-06-04 20:34 UTC (permalink / raw)
  To: David Bremner, notmuch, notmuch

[-- Attachment #1: Type: text/plain, Size: 1054 bytes --]

On Sun 2017-06-04 09:32:29 -0300, David Bremner wrote:
> The corresponding xapian document just gets more terms added to it,
> but this doesn't seem to break anything. Values on the other hand get
> overwritten, which is a bit annoying, but arguably it is not worse to
> take the values (from, subject, date) from the last file indexed
> rather than the first.

It's certainly a change in behavior, though.  This suggests that i can
send you mail and have it change how an existing message shows up in the
summary view, for example.

for example, i could follow up on the current message with another
message with Message-Id: 20170604123235.24466-7-david@tethera.net and
give it a subject "Re: [patch v3 06/12] lib: do *not* index message
files with duplicate message-ids".  that's a bit odd, no?

I'm not saying it's wrong, i'm just hoping to acknowledge that this
might be a controversial change.

this whole series is pretty elaborate, of course, but so far it's
looking like reasonable steps toward a clearer and more hackable
codebase.

      --dkg

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch v3 06/12] lib: index message files with duplicate message-ids
  2017-06-04 20:34   ` Daniel Kahn Gillmor
@ 2017-06-06  1:09     ` David Bremner
  2017-06-09 10:57       ` David Bremner
  2017-06-10 20:02     ` [PATCH] fixup! " David Bremner
  1 sibling, 1 reply; 17+ messages in thread
From: David Bremner @ 2017-06-06  1:09 UTC (permalink / raw)
  To: Daniel Kahn Gillmor, notmuch, notmuch

Daniel Kahn Gillmor <dkg@fifthhorseman.net> writes:

> On Sun 2017-06-04 09:32:29 -0300, David Bremner wrote:
>> The corresponding xapian document just gets more terms added to it,
>> but this doesn't seem to break anything. Values on the other hand get
>> overwritten, which is a bit annoying, but arguably it is not worse to
>> take the values (from, subject, date) from the last file indexed
>> rather than the first.
[snip]
> for example, i could follow up on the current message with another
> message with Message-Id: 20170604123235.24466-7-david@tethera.net and
> give it a subject "Re: [patch v3 06/12] lib: do *not* index message
> files with duplicate message-ids".  that's a bit odd, no?

Yes, I agree that's a bit strange.  We should make some effort to
display the subject that belongs with a given message body. I think it's
not too hard [1] to preserve the old behaviour of keeping the first
subject, date, and from. This leaves us with a version of the original
hiding message attack, but only for the special case of regex searches,
since those rely exclusively on the value slots.

[1]: should be just a matter of guarding the call to
_notmuch_message_set_header_values() with if (is_new || is_ghost), but
that needs testing.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch v3 06/12] lib: index message files with duplicate message-ids
  2017-06-06  1:09     ` David Bremner
@ 2017-06-09 10:57       ` David Bremner
  0 siblings, 0 replies; 17+ messages in thread
From: David Bremner @ 2017-06-09 10:57 UTC (permalink / raw)
  To: Daniel Kahn Gillmor, notmuch, notmuch

David Bremner <david@tethera.net> writes:

> Daniel Kahn Gillmor <dkg@fifthhorseman.net> writes:

>> for example, i could follow up on the current message with another
>> message with Message-Id: 20170604123235.24466-7-david@tethera.net and
>> give it a subject "Re: [patch v3 06/12] lib: do *not* index message
>> files with duplicate message-ids".  that's a bit odd, no?
>
> Yes, I agree that's a bit strange.  We should make some effort to
> display the subject that belongs with a given message body. I think it's
> not too hard [1] to preserve the old behaviour of keeping the first
> subject, date, and from. This leaves us with a version of the original
> hiding message attack, but only for the special case of regex searches,
> since those rely exclusively on the value slots.

I had a slightly radical idea for how to deal with that. Subject/from
from extra files could be appended to the value slot (e.g. separated by
newlines). Then regexp searches would behave similarly to term based
searches in that matching any file would match the message. We'd have to
be slightly careful about what anchors meant.  A further enhancement
would be to expose the search result as an array. This kind of approach
doesn't really make sense for dates, as we essentially search for those
as numbers, and such a hack would break the built-in xapian range
search.  

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH] fixup! lib: index message files with duplicate message-ids
  2017-06-04 20:34   ` Daniel Kahn Gillmor
  2017-06-06  1:09     ` David Bremner
@ 2017-06-10 20:02     ` David Bremner
  1 sibling, 0 replies; 17+ messages in thread
From: David Bremner @ 2017-06-10 20:02 UTC (permalink / raw)
  To: Daniel Kahn Gillmor, David Bremner, notmuch, notmuch

---

This seems to do the trick for preserving the subject as additional files are indexed.

 lib/add-message.cc         | 3 ++-
 test/T670-duplicate-mid.sh | 7 +++++++
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/lib/add-message.cc b/lib/add-message.cc
index 26405742..711ed9fa 100644
--- a/lib/add-message.cc
+++ b/lib/add-message.cc
@@ -536,7 +536,8 @@ notmuch_database_add_message (notmuch_database_t *notmuch,
 	if (ret)
 	    goto DONE;
 
-	_notmuch_message_set_header_values (message, date, from, subject);
+	if (is_new || is_ghost)
+	    _notmuch_message_set_header_values (message, date, from, subject);
 
 	ret = _notmuch_message_index_file (message, message_file);
 	if (ret)
diff --git a/test/T670-duplicate-mid.sh b/test/T670-duplicate-mid.sh
index da5ce5d4..ea5e1d6a 100755
--- a/test/T670-duplicate-mid.sh
+++ b/test/T670-duplicate-mid.sh
@@ -5,6 +5,13 @@ test_description="duplicate message ids"
 add_message '[id]="duplicate"' '[subject]="message 1" [filename]=copy1'
 add_message '[id]="duplicate"' '[subject]="message 2" [filename]=copy2'
 
+test_begin_subtest 'First subject preserved'
+cat <<EOF > EXPECTED
+thread:XXX   2001-01-05 [1/1(2)] Notmuch Test Suite; message 1 (inbox unread)
+EOF
+notmuch search id:duplicate | notmuch_search_sanitize > OUTPUT
+test_expect_equal_file EXPECTED OUTPUT
+
 test_begin_subtest 'Search for second subject'
 cat <<EOF >EXPECTED
 MAIL_DIR/copy1
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2017-06-10 20:02 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-06-04 12:32 v3 of index multiple files per msg-id, add reindex command David Bremner
2017-06-04 12:32 ` [patch v3 01/12] lib: isolate n_d_add_message and helper functions into own file David Bremner
2017-06-04 12:32 ` [patch v3 02/12] lib/n_d_add_message: refactor test for new/ghost messages David Bremner
2017-06-04 12:32 ` [patch v3 03/12] lib: factor out message-id parsing to separate file David Bremner
2017-06-04 12:32 ` [patch v3 04/12] lib: refactor notmuch_database_add_message header parsing David Bremner
2017-06-04 12:32 ` [patch v3 05/12] test: add known broken tests for duplicate message id David Bremner
2017-06-04 12:32 ` [patch v3 06/12] lib: index message files with duplicate message-ids David Bremner
2017-06-04 20:34   ` Daniel Kahn Gillmor
2017-06-06  1:09     ` David Bremner
2017-06-09 10:57       ` David Bremner
2017-06-10 20:02     ` [PATCH] fixup! " David Bremner
2017-06-04 12:32 ` [patch v3 07/12] lib: add notmuch_message_count_files David Bremner
2017-06-04 12:32 ` [patch v3 08/12] lib: add notmuch_thread_get_total_files David Bremner
2017-06-04 12:32 ` [patch v3 09/12] cli/search: print total number of files matched in summary output David Bremner
2017-06-04 12:32 ` [patch v3 10/12] lib: add _notmuch_message_remove_indexed_terms David Bremner
2017-06-04 12:32 ` [patch v3 11/12] lib: add notmuch_message_reindex David Bremner
2017-06-04 12:32 ` [patch v3 12/12] add "notmuch reindex" subcommand David Bremner

Code repositories for project(s) associated with this public inbox

	https://yhetil.org/notmuch.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).