* [PATCH v4] lib: replace the header parser with gmime
@ 2014-03-23 20:01 Jani Nikula
2014-03-29 22:11 ` Austin Clements
0 siblings, 1 reply; 3+ messages in thread
From: Jani Nikula @ 2014-03-23 20:01 UTC (permalink / raw)
To: notmuch
The notmuch library includes a full blown message header parser. Yet
the same message headers are parsed by gmime during indexing. Switch
to gmime parsing completely.
These are the main changes:
* Gmime stops header parsing at the first invalid header, and presumes
the message body starts from there. The current parser is quite
liberal in accepting broken headers. The change means we will be
much pickier about accepting invalid messages.
* The current parser converts tabs used in header folding to
spaces. Gmime preserve the tabs. Due to a broken python library used
in mailman, there are plenty of mailing lists that produce headers
with tabs in header folding, and we'll see plenty of tabs. (This
change has been mitigated in preparatory patches.)
* For pure header parsing, the current parser is likely faster than
gmime, which parses the whole message rather than just the
headers. Since we parse the message and its headers using gmime for
indexing anyway, this avoids and extra header parsing round when
adding new messages. In case of duplicate messages, we'll end up
parsing the full message although just headers would be
sufficient. All in all this should still speed up 'notmuch new'.
* Calls to notmuch_message_get_header() may be slightly slower than
previously for headers that are not indexed in the database, due to
parsing of the whole message. Within the notmuch code base, notmuch
reply is the only such user.
---
This is v4 of patches [1] and [2], combining them into one. Patch [3]
is required to make all the tests pass.
The implementation has been changed considerably. Notably we only
cache the decoded headers to a hash table if and when they are
needed. The main reasons for doing caching at all is that gmime
provides raw headers that need decoding, and to minimize changes
outside of message-file.c we want to own and track the allocated
strings instead of pushing the responsibility to callers.
This addresses most comments from Austin.
The diff in message-file.c is rather confusing due to a mix of
added/removed lines. Looking at the patched file is much more
pleasant.
BR,
Jani.
[1] id:bdef45dc21c777130a7ae1fa650f172d8c6eaaf4.1391456555.git.jani@nikula.org
[2] id:31d785c4a3e4b90862a0fdc545d4e900a4c898e2.1391456555.git.jani@nikula.org
[3] id:1395247490-19892-1-git-send-email-jani@nikula.org
---
lib/database.cc | 15 +-
lib/index.cc | 70 +--------
lib/message-file.c | 420 ++++++++++++++++++++------------------------------
lib/notmuch-private.h | 54 +++----
4 files changed, 208 insertions(+), 351 deletions(-)
diff --git a/lib/database.cc b/lib/database.cc
index aef748f75b5d..4750f40cf0fb 100644
--- a/lib/database.cc
+++ b/lib/database.cc
@@ -1971,15 +1971,10 @@ notmuch_database_add_message (notmuch_database_t *notmuch,
if (ret)
goto DONE;
- notmuch_message_file_restrict_headers (message_file,
- "date",
- "from",
- "in-reply-to",
- "message-id",
- "references",
- "subject",
- "to",
- (char *) NULL);
+ /* Parse message up front to get better error status. */
+ ret = notmuch_message_file_parse (message_file);
+ if (ret)
+ goto DONE;
try {
/* Before we do any real work, (especially before doing a
@@ -2066,7 +2061,7 @@ notmuch_database_add_message (notmuch_database_t *notmuch,
date = notmuch_message_file_get_header (message_file, "date");
_notmuch_message_set_header_values (message, date, from, subject);
- ret = _notmuch_message_index_file (message, filename);
+ ret = _notmuch_message_index_file (message, message_file);
if (ret)
goto DONE;
} else {
diff --git a/lib/index.cc b/lib/index.cc
index 78c18cf36d10..46a019325454 100644
--- a/lib/index.cc
+++ b/lib/index.cc
@@ -425,63 +425,15 @@ _index_mime_part (notmuch_message_t *message,
notmuch_status_t
_notmuch_message_index_file (notmuch_message_t *message,
- const char *filename)
+ notmuch_message_file_t *message_file)
{
- GMimeStream *stream = NULL;
- GMimeParser *parser = NULL;
- GMimeMessage *mime_message = NULL;
+ GMimeMessage *mime_message;
InternetAddressList *addresses;
- FILE *file = NULL;
const char *from, *subject;
- notmuch_status_t ret = NOTMUCH_STATUS_SUCCESS;
- static int initialized = 0;
- char from_buf[5];
- bool is_mbox = false;
- static bool mbox_warning = false;
-
- if (! initialized) {
- g_mime_init (GMIME_ENABLE_RFC2047_WORKAROUNDS);
- initialized = 1;
- }
-
- file = fopen (filename, "r");
- if (! file) {
- fprintf (stderr, "Error opening %s: %s\n", filename, strerror (errno));
- ret = NOTMUCH_STATUS_FILE_ERROR;
- goto DONE;
- }
-
- /* Is this mbox? */
- if (fread (from_buf, sizeof (from_buf), 1, file) == 1 &&
- strncmp (from_buf, "From ", 5) == 0)
- is_mbox = true;
- rewind (file);
- /* Evil GMime steals my FILE* here so I won't fclose it. */
- stream = g_mime_stream_file_new (file);
-
- parser = g_mime_parser_new_with_stream (stream);
- g_mime_parser_set_scan_from (parser, is_mbox);
-
- mime_message = g_mime_parser_construct_message (parser);
-
- if (is_mbox) {
- if (!g_mime_parser_eos (parser)) {
- /* This is a multi-message mbox. */
- ret = NOTMUCH_STATUS_FILE_NOT_EMAIL;
- goto DONE;
- }
- /* For historical reasons, we support single-message mboxes,
- * but this behavior is likely to change in the future, so
- * warn. */
- if (!mbox_warning) {
- mbox_warning = true;
- fprintf (stderr, "\
-Warning: %s is an mbox containing a single message,\n\
-likely caused by misconfigured mail delivery. Support for single-message\n\
-mboxes is deprecated and may be removed in the future.\n", filename);
- }
- }
+ mime_message = notmuch_message_file_get_mime_message (message_file);
+ if (! mime_message)
+ return NOTMUCH_STATUS_FILE_NOT_EMAIL;
from = g_mime_message_get_sender (mime_message);
@@ -502,15 +454,5 @@ mboxes is deprecated and may be removed in the future.\n", filename);
_index_mime_part (message, g_mime_message_get_mime_part (mime_message));
- DONE:
- if (mime_message)
- g_object_unref (mime_message);
-
- if (parser)
- g_object_unref (parser);
-
- if (stream)
- g_object_unref (stream);
-
- return ret;
+ return NOTMUCH_STATUS_SUCCESS;
}
diff --git a/lib/message-file.c b/lib/message-file.c
index a2850c278b5a..88662608d319 100644
--- a/lib/message-file.c
+++ b/lib/message-file.c
@@ -26,30 +26,15 @@
#include <glib.h> /* GHashTable */
-typedef struct {
- char *str;
- size_t size;
- size_t len;
-} header_value_closure_t;
-
struct _notmuch_message_file {
/* File object */
FILE *file;
+ char *filename;
- /* Header storage */
- int restrict_headers;
+ /* Cache for decoded headers */
GHashTable *headers;
- int broken_headers;
- int good_headers;
- size_t header_size; /* Length of full message header in bytes. */
-
- /* Parsing state */
- char *line;
- size_t line_size;
- header_value_closure_t value;
- int parsing_started;
- int parsing_finished;
+ GMimeMessage *message;
};
static int
@@ -76,15 +61,12 @@ strcase_hash (const void *ptr)
static int
_notmuch_message_file_destructor (notmuch_message_file_t *message)
{
- if (message->line)
- free (message->line);
-
- if (message->value.size)
- free (message->value.str);
-
if (message->headers)
g_hash_table_destroy (message->headers);
+ if (message->message)
+ g_object_unref (message->message);
+
if (message->file)
fclose (message->file);
@@ -102,20 +84,17 @@ _notmuch_message_file_open_ctx (void *ctx, const char *filename)
if (unlikely (message == NULL))
return NULL;
+ /* Only needed for error messages during parsing. */
+ message->filename = talloc_strdup (message, filename);
+ if (message->filename == NULL)
+ goto FAIL;
+
talloc_set_destructor (message, _notmuch_message_file_destructor);
message->file = fopen (filename, "r");
if (message->file == NULL)
goto FAIL;
- message->headers = g_hash_table_new_full (strcase_hash,
- strcase_equal,
- free,
- g_free);
-
- message->parsing_started = 0;
- message->parsing_finished = 0;
-
return message;
FAIL:
@@ -137,264 +116,205 @@ notmuch_message_file_close (notmuch_message_file_t *message)
talloc_free (message);
}
-void
-notmuch_message_file_restrict_headersv (notmuch_message_file_t *message,
- va_list va_headers)
+notmuch_status_t
+notmuch_message_file_parse (notmuch_message_file_t *message)
{
- char *header;
+ GMimeStream *stream;
+ GMimeParser *parser;
+ notmuch_status_t status = NOTMUCH_STATUS_SUCCESS;
+ static int initialized = 0;
+ char from_buf[5];
+ notmuch_bool_t is_mbox = FALSE;
+ static notmuch_bool_t mbox_warning = FALSE;
- if (message->parsing_started)
- INTERNAL_ERROR ("notmuch_message_file_restrict_headers called after parsing has started");
+ if (message->message)
+ return NOTMUCH_STATUS_SUCCESS;
- while (1) {
- header = va_arg (va_headers, char*);
- if (header == NULL)
- break;
- g_hash_table_insert (message->headers,
- xstrdup (header), NULL);
+ if (! initialized) {
+ g_mime_init (GMIME_ENABLE_RFC2047_WORKAROUNDS);
+ initialized = 1;
}
- message->restrict_headers = 1;
-}
-
-void
-notmuch_message_file_restrict_headers (notmuch_message_file_t *message, ...)
-{
- va_list va_headers;
-
- va_start (va_headers, message);
+ message->headers = g_hash_table_new_full (strcase_hash, strcase_equal,
+ free, g_free);
+ if (! message->headers)
+ return NOTMUCH_STATUS_OUT_OF_MEMORY;
- notmuch_message_file_restrict_headersv (message, va_headers);
-}
+ /* Is this mbox? */
+ if (fread (from_buf, sizeof (from_buf), 1, message->file) == 1 &&
+ strncmp (from_buf, "From ", 5) == 0)
+ is_mbox = TRUE;
+ rewind (message->file);
-static void
-copy_header_unfolding (header_value_closure_t *value,
- const char *chunk)
-{
- char *last;
+ stream = g_mime_stream_file_new (message->file);
- if (chunk == NULL)
- return;
+ /* We'll own and fclose the FILE* ourselves. */
+ g_mime_stream_file_set_owner (GMIME_STREAM_FILE (stream), FALSE);
- while (*chunk == ' ' || *chunk == '\t')
- chunk++;
+ parser = g_mime_parser_new_with_stream (stream);
+ g_mime_parser_set_scan_from (parser, is_mbox);
- if (value->len + 1 + strlen (chunk) + 1 > value->size) {
- unsigned int new_size = value->size;
- if (value->size == 0)
- new_size = strlen (chunk) + 1;
- else
- while (value->len + 1 + strlen (chunk) + 1 > new_size)
- new_size *= 2;
- value->str = xrealloc (value->str, new_size);
- value->size = new_size;
+ message->message = g_mime_parser_construct_message (parser);
+ if (! message->message) {
+ status = NOTMUCH_STATUS_FILE_NOT_EMAIL;
+ goto DONE;
}
- last = value->str + value->len;
- if (value->len) {
- *last = ' ';
- last++;
- value->len++;
+ if (is_mbox) {
+ if (! g_mime_parser_eos (parser)) {
+ /* This is a multi-message mbox. */
+ status = NOTMUCH_STATUS_FILE_NOT_EMAIL;
+ goto DONE;
+ }
+ /*
+ * For historical reasons, we support single-message mboxes,
+ * but this behavior is likely to change in the future, so
+ * warn.
+ */
+ if (! mbox_warning) {
+ mbox_warning = TRUE;
+ fprintf (stderr, "\
+Warning: %s is an mbox containing a single message,\n\
+likely caused by misconfigured mail delivery. Support for single-message\n\
+mboxes is deprecated and may be removed in the future.\n", message->filename);
+ }
}
- strcpy (last, chunk);
- value->len += strlen (chunk);
+ DONE:
+ g_object_unref (stream);
+ g_object_unref (parser);
+
+ if (status) {
+ g_hash_table_destroy (message->headers);
+ message->headers = NULL;
- last = value->str + value->len - 1;
- if (*last == '\n') {
- *last = '\0';
- value->len--;
+ if (message->message) {
+ g_object_unref (message->message);
+ message->message = NULL;
+ }
+
+ rewind (message->file);
}
+
+ return status;
}
-/* As a special-case, a value of NULL for header_desired will force
- * the entire header to be parsed if it is not parsed already. This is
- * used by the _notmuch_message_file_get_headers_end function.
- * Another special case is the Received: header. For this header we
- * want to concatenate all instances of the header instead of just
- * hashing the first instance as we use this when analyzing the path
- * the mail has taken from sender to recipient.
- */
-const char *
-notmuch_message_file_get_header (notmuch_message_file_t *message,
- const char *header_desired)
+GMimeMessage *
+notmuch_message_file_get_mime_message (notmuch_message_file_t *message)
{
- int contains;
- char *header, *decoded_value, *header_sofar, *combined_header;
- const char *s, *colon;
- int match, newhdr, hdrsofar, is_received;
- static int initialized = 0;
+ if (notmuch_message_file_parse (message))
+ return NULL;
- is_received = (strcmp(header_desired,"received") == 0);
+ return message->message;
+}
- if (! initialized) {
- g_mime_init (GMIME_ENABLE_RFC2047_WORKAROUNDS);
- initialized = 1;
- }
+/*
+ * Get all instances of a header decoded and concatenated.
+ *
+ * The result must be freed using g_free().
+ *
+ * Return NULL on errors, empty string for non-existing headers.
+ */
+static char *
+_notmuch_message_file_get_combined_header (notmuch_message_file_t *message,
+ const char *header)
+{
+ GMimeHeaderList *headers;
+ GMimeHeaderIter *iter;
+ char *combined = NULL;
- message->parsing_started = 1;
-
- if (header_desired == NULL)
- contains = 0;
- else
- contains = g_hash_table_lookup_extended (message->headers,
- header_desired, NULL,
- (gpointer *) &decoded_value);
-
- if (contains && decoded_value)
- return decoded_value;
-
- if (message->parsing_finished)
- return "";
-
-#define NEXT_HEADER_LINE(closure) \
- while (1) { \
- ssize_t bytes_read = getline (&message->line, \
- &message->line_size, \
- message->file); \
- if (bytes_read == -1) { \
- message->parsing_finished = 1; \
- break; \
- } \
- if (*message->line == '\n') { \
- message->parsing_finished = 1; \
- break; \
- } \
- if (closure && \
- (*message->line == ' ' || *message->line == '\t')) \
- { \
- copy_header_unfolding ((closure), message->line); \
- } \
- if (*message->line == ' ' || *message->line == '\t') \
- message->header_size += strlen (message->line); \
- else \
- break; \
- }
+ headers = g_mime_object_get_header_list (GMIME_OBJECT (message->message));
+ if (! headers)
+ return NULL;
- if (message->line == NULL)
- NEXT_HEADER_LINE (NULL);
+ iter = g_mime_header_iter_new ();
+ if (! iter)
+ return NULL;
- while (1) {
+ if (! g_mime_header_list_get_iter (headers, iter))
+ goto DONE;
- if (message->parsing_finished)
- break;
+ do {
+ const char *value;
+ char *decoded;
- colon = strchr (message->line, ':');
+ if (strcasecmp (g_mime_header_iter_get_name (iter), header) != 0)
+ continue;
- if (colon == NULL) {
- message->broken_headers++;
- /* A simple heuristic for giving up on things that just
- * don't look like mail messages. */
- if (message->broken_headers >= 10 &&
- message->good_headers < 5)
- {
- message->parsing_finished = 1;
- break;
+ value = g_mime_header_iter_get_value (iter);
+ decoded = g_mime_utils_header_decode_text (value);
+ if (! decoded) {
+ if (combined) {
+ g_free (combined);
+ combined = NULL;
}
- NEXT_HEADER_LINE (NULL);
- continue;
+ goto DONE;
}
- message->header_size += strlen (message->line);
+ if (combined) {
+ char *tmp = g_strdup_printf ("%s %s", combined, decoded);
+ g_free (decoded);
+ g_free (combined);
+ if (! tmp) {
+ combined = NULL;
+ goto DONE;
+ }
- message->good_headers++;
+ combined = tmp;
+ } else {
+ combined = decoded;
+ }
+ } while (g_mime_header_iter_next (iter));
- header = xstrndup (message->line, colon - message->line);
+ /* Return empty string for non-existing headers. */
+ if (! combined)
+ combined = g_strdup ("");
- if (message->restrict_headers &&
- ! g_hash_table_lookup_extended (message->headers,
- header, NULL, NULL))
- {
- free (header);
- NEXT_HEADER_LINE (NULL);
- continue;
- }
+ DONE:
+ g_mime_header_iter_free (iter);
- s = colon + 1;
- while (*s == ' ' || *s == '\t')
- s++;
+ return combined;
+}
- message->value.len = 0;
- copy_header_unfolding (&message->value, s);
+/* Return NULL on errors, empty string for non-existing headers. */
+const char *
+notmuch_message_file_get_header (notmuch_message_file_t *message,
+ const char *header)
+{
+ const char *value = NULL;
+ char *decoded;
+ notmuch_bool_t found;
- NEXT_HEADER_LINE (&message->value);
+ if (notmuch_message_file_parse (message))
+ return NULL;
- if (header_desired == NULL)
- match = 0;
+ /* If we have a cached decoded value, use it. */
+ found = g_hash_table_lookup_extended (message->headers, header,
+ NULL, (gpointer *) &value);
+ if (found)
+ return value ? value : "";
+
+ if (strcasecmp (header, "received") == 0) {
+ /*
+ * The Received: header is special. We concatenate all
+ * instances of the header as we use this when analyzing the
+ * path the mail has taken from sender to recipient.
+ */
+ decoded = _notmuch_message_file_get_combined_header (message, header);
+ } else {
+ value = g_mime_object_get_header (GMIME_OBJECT (message->message),
+ header);
+ if (value)
+ decoded = g_mime_utils_header_decode_text (value);
else
- match = (strcasecmp (header, header_desired) == 0);
-
- decoded_value = g_mime_utils_header_decode_text (message->value.str);
- header_sofar = (char *)g_hash_table_lookup (message->headers, header);
- /* we treat the Received: header special - we want to concat ALL of
- * the Received: headers we encounter.
- * for everything else we return the first instance of a header */
- if (strcasecmp(header, "received") == 0) {
- if (header_sofar == NULL) {
- /* first Received: header we encountered; just add it */
- g_hash_table_insert (message->headers, header, decoded_value);
- } else {
- /* we need to add the header to those we already collected */
- newhdr = strlen(decoded_value);
- hdrsofar = strlen(header_sofar);
- combined_header = g_malloc(hdrsofar + newhdr + 2);
- strncpy(combined_header,header_sofar,hdrsofar);
- *(combined_header+hdrsofar) = ' ';
- strncpy(combined_header+hdrsofar+1,decoded_value,newhdr+1);
- g_free (decoded_value);
- g_hash_table_insert (message->headers, header, combined_header);
- }
- } else {
- if (header_sofar == NULL) {
- /* Only insert if we don't have a value for this header, yet. */
- g_hash_table_insert (message->headers, header, decoded_value);
- } else {
- free (header);
- g_free (decoded_value);
- decoded_value = header_sofar;
- }
- }
- /* if we found a match we can bail - unless of course we are
- * collecting all the Received: headers */
- if (match && !is_received)
- return decoded_value;
- }
-
- if (message->parsing_finished) {
- fclose (message->file);
- message->file = NULL;
+ decoded = g_strdup ("");
}
- if (message->line)
- free (message->line);
- message->line = NULL;
-
- if (message->value.size) {
- free (message->value.str);
- message->value.str = NULL;
- message->value.size = 0;
- message->value.len = 0;
- }
+ if (! decoded)
+ return NULL;
- /* For the Received: header we actually might end up here even
- * though we found the header (as we force continued parsing
- * in that case). So let's check if that's the header we were
- * looking for and return the value that we found (if any)
- */
- if (is_received)
- return (char *)g_hash_table_lookup (message->headers, "received");
-
- /* We've parsed all headers and never found the one we're looking
- * for. It's probably just not there, but let's check that we
- * didn't make a mistake preventing us from seeing it. */
- if (message->restrict_headers && header_desired &&
- ! g_hash_table_lookup_extended (message->headers,
- header_desired, NULL, NULL))
- {
- INTERNAL_ERROR ("Attempt to get header \"%s\" which was not\n"
- "included in call to notmuch_message_file_restrict_headers\n",
- header_desired);
- }
+ /* Cache the decoded value. We also own the strings. */
+ g_hash_table_insert (message->headers, xstrdup (header), decoded);
- return "";
+ return decoded;
}
diff --git a/lib/notmuch-private.h b/lib/notmuch-private.h
index 59eb2bc285a5..734a4e338554 100644
--- a/lib/notmuch-private.h
+++ b/lib/notmuch-private.h
@@ -46,6 +46,8 @@ NOTMUCH_BEGIN_DECLS
#include <talloc.h>
+#include <gmime/gmime.h>
+
#include "xutil.h"
#include "error_util.h"
@@ -320,13 +322,6 @@ notmuch_message_set_author (notmuch_message_t *message, const char *author);
const char *
notmuch_message_get_author (notmuch_message_t *message);
-
-/* index.cc */
-
-notmuch_status_t
-_notmuch_message_index_file (notmuch_message_t *message,
- const char *filename);
-
/* message-file.c */
/* XXX: I haven't decided yet whether these will actually get exported
@@ -352,32 +347,31 @@ _notmuch_message_file_open_ctx (void *ctx, const char *filename);
void
notmuch_message_file_close (notmuch_message_file_t *message);
-/* Restrict 'message' to only save the named headers.
+/* Parse the message.
*
- * When the caller is only interested in a short list of headers,
- * known in advance, calling this function can avoid wasted time and
- * memory parsing/saving header values that will never be needed.
+ * This will be done automatically as necessary on other calls
+ * depending on it, but an explicit call allows for better error
+ * status reporting.
+ */
+notmuch_status_t
+notmuch_message_file_parse (notmuch_message_file_t *message);
+
+/* Get the gmime message of a message file.
*
- * The variable arguments should be a list of const char * with a
- * final '(const char *) NULL' to terminate the list.
+ * The message file is parsed as necessary.
*
- * If this function is called, it must be called before any calls to
- * notmuch_message_get_header for this message.
+ * Returns GMimeMessage* on success (which the caller must not unref),
+ * NULL if the message file parsing fails.
*
- * After calling this function, if notmuch_message_get_header is
- * called with a header name not in this list, then NULL will be
- * returned even if that header exists in the actual message.
+ * XXX: Would be nice to not have to expose GMimeMessage here.
*/
-void
-notmuch_message_file_restrict_headers (notmuch_message_file_t *message, ...);
-
-/* Identical to notmuch_message_restrict_headers but accepting a va_list. */
-void
-notmuch_message_file_restrict_headersv (notmuch_message_file_t *message,
- va_list va_headers);
+GMimeMessage *
+notmuch_message_file_get_mime_message (notmuch_message_file_t *message);
/* Get the value of the specified header from the message as a UTF-8 string.
*
+ * The message file is parsed as necessary.
+ *
* The header name is case insensitive.
*
* The Received: header is special - for it all Received: headers in
@@ -387,13 +381,19 @@ notmuch_message_file_restrict_headersv (notmuch_message_file_t *message,
* only until the message is closed. The caller should copy it if
* needing to modify the value or to hold onto it for longer.
*
- * Returns NULL if the message does not contain a header line matching
- * 'header'.
+ * Returns NULL on errors, empty string if the message does not
+ * contain a header line matching 'header'.
*/
const char *
notmuch_message_file_get_header (notmuch_message_file_t *message,
const char *header);
+/* index.cc */
+
+notmuch_status_t
+_notmuch_message_index_file (notmuch_message_t *message,
+ notmuch_message_file_t *message_file);
+
/* messages.c */
typedef struct _notmuch_message_node {
--
1.9.0
^ permalink raw reply related [flat|nested] 3+ messages in thread
* Re: [PATCH v4] lib: replace the header parser with gmime
2014-03-23 20:01 [PATCH v4] lib: replace the header parser with gmime Jani Nikula
@ 2014-03-29 22:11 ` Austin Clements
2014-03-30 21:10 ` Jani Nikula
0 siblings, 1 reply; 3+ messages in thread
From: Austin Clements @ 2014-03-29 22:11 UTC (permalink / raw)
To: Jani Nikula; +Cc: notmuch
Various little comments below. Overall this is looking really good.
Quoth Jani Nikula on Mar 23 at 10:01 pm:
> The notmuch library includes a full blown message header parser. Yet
> the same message headers are parsed by gmime during indexing. Switch
> to gmime parsing completely.
>
> These are the main changes:
>
> * Gmime stops header parsing at the first invalid header, and presumes
> the message body starts from there. The current parser is quite
> liberal in accepting broken headers. The change means we will be
> much pickier about accepting invalid messages.
>
> * The current parser converts tabs used in header folding to
> spaces. Gmime preserve the tabs. Due to a broken python library used
> in mailman, there are plenty of mailing lists that produce headers
> with tabs in header folding, and we'll see plenty of tabs. (This
> change has been mitigated in preparatory patches.)
>
> * For pure header parsing, the current parser is likely faster than
> gmime, which parses the whole message rather than just the
> headers. Since we parse the message and its headers using gmime for
> indexing anyway, this avoids and extra header parsing round when
> adding new messages. In case of duplicate messages, we'll end up
> parsing the full message although just headers would be
> sufficient. All in all this should still speed up 'notmuch new'.
>
> * Calls to notmuch_message_get_header() may be slightly slower than
> previously for headers that are not indexed in the database, due to
> parsing of the whole message. Within the notmuch code base, notmuch
> reply is the only such user.
>
> ---
>
> This is v4 of patches [1] and [2], combining them into one. Patch [3]
> is required to make all the tests pass.
>
> The implementation has been changed considerably. Notably we only
> cache the decoded headers to a hash table if and when they are
> needed. The main reasons for doing caching at all is that gmime
> provides raw headers that need decoding, and to minimize changes
> outside of message-file.c we want to own and track the allocated
> strings instead of pushing the responsibility to callers.
>
> This addresses most comments from Austin.
>
> The diff in message-file.c is rather confusing due to a mix of
> added/removed lines. Looking at the patched file is much more
> pleasant.
>
> BR,
> Jani.
>
> [1] id:bdef45dc21c777130a7ae1fa650f172d8c6eaaf4.1391456555.git.jani@nikula.org
> [2] id:31d785c4a3e4b90862a0fdc545d4e900a4c898e2.1391456555.git.jani@nikula.org
> [3] id:1395247490-19892-1-git-send-email-jani@nikula.org
> ---
> lib/database.cc | 15 +-
> lib/index.cc | 70 +--------
> lib/message-file.c | 420 ++++++++++++++++++++------------------------------
> lib/notmuch-private.h | 54 +++----
> 4 files changed, 208 insertions(+), 351 deletions(-)
>
> diff --git a/lib/database.cc b/lib/database.cc
> index aef748f75b5d..4750f40cf0fb 100644
> --- a/lib/database.cc
> +++ b/lib/database.cc
> @@ -1971,15 +1971,10 @@ notmuch_database_add_message (notmuch_database_t *notmuch,
> if (ret)
> goto DONE;
>
> - notmuch_message_file_restrict_headers (message_file,
> - "date",
> - "from",
> - "in-reply-to",
> - "message-id",
> - "references",
> - "subject",
> - "to",
> - (char *) NULL);
> + /* Parse message up front to get better error status. */
> + ret = notmuch_message_file_parse (message_file);
> + if (ret)
> + goto DONE;
>
> try {
> /* Before we do any real work, (especially before doing a
> @@ -2066,7 +2061,7 @@ notmuch_database_add_message (notmuch_database_t *notmuch,
> date = notmuch_message_file_get_header (message_file, "date");
> _notmuch_message_set_header_values (message, date, from, subject);
>
> - ret = _notmuch_message_index_file (message, filename);
> + ret = _notmuch_message_index_file (message, message_file);
> if (ret)
> goto DONE;
> } else {
> diff --git a/lib/index.cc b/lib/index.cc
> index 78c18cf36d10..46a019325454 100644
> --- a/lib/index.cc
> +++ b/lib/index.cc
> @@ -425,63 +425,15 @@ _index_mime_part (notmuch_message_t *message,
>
> notmuch_status_t
> _notmuch_message_index_file (notmuch_message_t *message,
> - const char *filename)
> + notmuch_message_file_t *message_file)
> {
> - GMimeStream *stream = NULL;
> - GMimeParser *parser = NULL;
> - GMimeMessage *mime_message = NULL;
> + GMimeMessage *mime_message;
> InternetAddressList *addresses;
> - FILE *file = NULL;
> const char *from, *subject;
> - notmuch_status_t ret = NOTMUCH_STATUS_SUCCESS;
> - static int initialized = 0;
> - char from_buf[5];
> - bool is_mbox = false;
> - static bool mbox_warning = false;
> -
> - if (! initialized) {
> - g_mime_init (GMIME_ENABLE_RFC2047_WORKAROUNDS);
> - initialized = 1;
> - }
> -
> - file = fopen (filename, "r");
> - if (! file) {
> - fprintf (stderr, "Error opening %s: %s\n", filename, strerror (errno));
> - ret = NOTMUCH_STATUS_FILE_ERROR;
> - goto DONE;
> - }
> -
> - /* Is this mbox? */
> - if (fread (from_buf, sizeof (from_buf), 1, file) == 1 &&
> - strncmp (from_buf, "From ", 5) == 0)
> - is_mbox = true;
> - rewind (file);
>
> - /* Evil GMime steals my FILE* here so I won't fclose it. */
> - stream = g_mime_stream_file_new (file);
> -
> - parser = g_mime_parser_new_with_stream (stream);
> - g_mime_parser_set_scan_from (parser, is_mbox);
> -
> - mime_message = g_mime_parser_construct_message (parser);
> -
> - if (is_mbox) {
> - if (!g_mime_parser_eos (parser)) {
> - /* This is a multi-message mbox. */
> - ret = NOTMUCH_STATUS_FILE_NOT_EMAIL;
> - goto DONE;
> - }
> - /* For historical reasons, we support single-message mboxes,
> - * but this behavior is likely to change in the future, so
> - * warn. */
> - if (!mbox_warning) {
> - mbox_warning = true;
> - fprintf (stderr, "\
> -Warning: %s is an mbox containing a single message,\n\
> -likely caused by misconfigured mail delivery. Support for single-message\n\
> -mboxes is deprecated and may be removed in the future.\n", filename);
> - }
> - }
> + mime_message = notmuch_message_file_get_mime_message (message_file);
> + if (! mime_message)
> + return NOTMUCH_STATUS_FILE_NOT_EMAIL;
This could also be because of out-of-memory.
notmuch_message_file_get_mime_message (and notmuch_message_file_parse,
in turn) should probably take a notmuch_status_t *status_ret argument
like other internal functions.
>
> from = g_mime_message_get_sender (mime_message);
>
> @@ -502,15 +454,5 @@ mboxes is deprecated and may be removed in the future.\n", filename);
>
> _index_mime_part (message, g_mime_message_get_mime_part (mime_message));
>
> - DONE:
> - if (mime_message)
> - g_object_unref (mime_message);
> -
> - if (parser)
> - g_object_unref (parser);
> -
> - if (stream)
> - g_object_unref (stream);
> -
> - return ret;
> + return NOTMUCH_STATUS_SUCCESS;
> }
> diff --git a/lib/message-file.c b/lib/message-file.c
> index a2850c278b5a..88662608d319 100644
> --- a/lib/message-file.c
> +++ b/lib/message-file.c
> @@ -26,30 +26,15 @@
>
> #include <glib.h> /* GHashTable */
>
> -typedef struct {
> - char *str;
> - size_t size;
> - size_t len;
> -} header_value_closure_t;
> -
> struct _notmuch_message_file {
> /* File object */
> FILE *file;
> + char *filename;
>
> - /* Header storage */
> - int restrict_headers;
> + /* Cache for decoded headers */
> GHashTable *headers;
> - int broken_headers;
> - int good_headers;
> - size_t header_size; /* Length of full message header in bytes. */
> -
> - /* Parsing state */
> - char *line;
> - size_t line_size;
> - header_value_closure_t value;
>
> - int parsing_started;
> - int parsing_finished;
> + GMimeMessage *message;
> };
>
> static int
> @@ -76,15 +61,12 @@ strcase_hash (const void *ptr)
> static int
> _notmuch_message_file_destructor (notmuch_message_file_t *message)
> {
> - if (message->line)
> - free (message->line);
> -
> - if (message->value.size)
> - free (message->value.str);
> -
> if (message->headers)
> g_hash_table_destroy (message->headers);
>
> + if (message->message)
> + g_object_unref (message->message);
> +
> if (message->file)
> fclose (message->file);
>
> @@ -102,20 +84,17 @@ _notmuch_message_file_open_ctx (void *ctx, const char *filename)
> if (unlikely (message == NULL))
> return NULL;
>
> + /* Only needed for error messages during parsing. */
> + message->filename = talloc_strdup (message, filename);
> + if (message->filename == NULL)
> + goto FAIL;
> +
> talloc_set_destructor (message, _notmuch_message_file_destructor);
>
> message->file = fopen (filename, "r");
> if (message->file == NULL)
> goto FAIL;
>
> - message->headers = g_hash_table_new_full (strcase_hash,
> - strcase_equal,
> - free,
> - g_free);
> -
> - message->parsing_started = 0;
> - message->parsing_finished = 0;
> -
> return message;
>
> FAIL:
> @@ -137,264 +116,205 @@ notmuch_message_file_close (notmuch_message_file_t *message)
> talloc_free (message);
> }
>
> -void
> -notmuch_message_file_restrict_headersv (notmuch_message_file_t *message,
> - va_list va_headers)
> +notmuch_status_t
> +notmuch_message_file_parse (notmuch_message_file_t *message)
> {
> - char *header;
> + GMimeStream *stream;
> + GMimeParser *parser;
> + notmuch_status_t status = NOTMUCH_STATUS_SUCCESS;
> + static int initialized = 0;
> + char from_buf[5];
> + notmuch_bool_t is_mbox = FALSE;
> + static notmuch_bool_t mbox_warning = FALSE;
>
> - if (message->parsing_started)
> - INTERNAL_ERROR ("notmuch_message_file_restrict_headers called after parsing has started");
> + if (message->message)
> + return NOTMUCH_STATUS_SUCCESS;
>
> - while (1) {
> - header = va_arg (va_headers, char*);
> - if (header == NULL)
> - break;
> - g_hash_table_insert (message->headers,
> - xstrdup (header), NULL);
> + if (! initialized) {
> + g_mime_init (GMIME_ENABLE_RFC2047_WORKAROUNDS);
> + initialized = 1;
> }
>
> - message->restrict_headers = 1;
> -}
> -
> -void
> -notmuch_message_file_restrict_headers (notmuch_message_file_t *message, ...)
> -{
> - va_list va_headers;
> -
> - va_start (va_headers, message);
> + message->headers = g_hash_table_new_full (strcase_hash, strcase_equal,
> + free, g_free);
> + if (! message->headers)
> + return NOTMUCH_STATUS_OUT_OF_MEMORY;
>
> - notmuch_message_file_restrict_headersv (message, va_headers);
> -}
> + /* Is this mbox? */
> + if (fread (from_buf, sizeof (from_buf), 1, message->file) == 1 &&
> + strncmp (from_buf, "From ", 5) == 0)
> + is_mbox = TRUE;
> + rewind (message->file);
>
> -static void
> -copy_header_unfolding (header_value_closure_t *value,
> - const char *chunk)
> -{
> - char *last;
> + stream = g_mime_stream_file_new (message->file);
>
> - if (chunk == NULL)
> - return;
> + /* We'll own and fclose the FILE* ourselves. */
> + g_mime_stream_file_set_owner (GMIME_STREAM_FILE (stream), FALSE);
>
> - while (*chunk == ' ' || *chunk == '\t')
> - chunk++;
> + parser = g_mime_parser_new_with_stream (stream);
> + g_mime_parser_set_scan_from (parser, is_mbox);
>
> - if (value->len + 1 + strlen (chunk) + 1 > value->size) {
> - unsigned int new_size = value->size;
> - if (value->size == 0)
> - new_size = strlen (chunk) + 1;
> - else
> - while (value->len + 1 + strlen (chunk) + 1 > new_size)
> - new_size *= 2;
> - value->str = xrealloc (value->str, new_size);
> - value->size = new_size;
> + message->message = g_mime_parser_construct_message (parser);
> + if (! message->message) {
> + status = NOTMUCH_STATUS_FILE_NOT_EMAIL;
> + goto DONE;
> }
>
> - last = value->str + value->len;
> - if (value->len) {
> - *last = ' ';
> - last++;
> - value->len++;
> + if (is_mbox) {
> + if (! g_mime_parser_eos (parser)) {
> + /* This is a multi-message mbox. */
> + status = NOTMUCH_STATUS_FILE_NOT_EMAIL;
> + goto DONE;
> + }
> + /*
> + * For historical reasons, we support single-message mboxes,
> + * but this behavior is likely to change in the future, so
> + * warn.
> + */
> + if (! mbox_warning) {
> + mbox_warning = TRUE;
> + fprintf (stderr, "\
> +Warning: %s is an mbox containing a single message,\n\
> +likely caused by misconfigured mail delivery. Support for single-message\n\
> +mboxes is deprecated and may be removed in the future.\n", message->filename);
I still wonder if this is the right place to issue this warning. For
example, doesn't this mean that, when constructing a reply to a single
message mbox, we'll issue this warning?
Or maybe it's time to drop single message mbox support. We deprecated
it in 0.15.1, which was over a year ago. OTOH, wheezy still has
0.13.2, so there could be people still using mboxes without the
warning. OTOOH, we only kept this around to appear David anyway. So
maybe the answer is to leave this in an not think too hard about it.
> + }
> }
>
> - strcpy (last, chunk);
> - value->len += strlen (chunk);
> + DONE:
> + g_object_unref (stream);
> + g_object_unref (parser);
> +
> + if (status) {
> + g_hash_table_destroy (message->headers);
> + message->headers = NULL;
>
> - last = value->str + value->len - 1;
> - if (*last == '\n') {
> - *last = '\0';
> - value->len--;
> + if (message->message) {
> + g_object_unref (message->message);
> + message->message = NULL;
> + }
> +
> + rewind (message->file);
> }
> +
> + return status;
> }
>
> -/* As a special-case, a value of NULL for header_desired will force
> - * the entire header to be parsed if it is not parsed already. This is
> - * used by the _notmuch_message_file_get_headers_end function.
> - * Another special case is the Received: header. For this header we
> - * want to concatenate all instances of the header instead of just
> - * hashing the first instance as we use this when analyzing the path
> - * the mail has taken from sender to recipient.
> - */
> -const char *
> -notmuch_message_file_get_header (notmuch_message_file_t *message,
> - const char *header_desired)
> +GMimeMessage *
> +notmuch_message_file_get_mime_message (notmuch_message_file_t *message)
> {
> - int contains;
> - char *header, *decoded_value, *header_sofar, *combined_header;
> - const char *s, *colon;
> - int match, newhdr, hdrsofar, is_received;
> - static int initialized = 0;
> + if (notmuch_message_file_parse (message))
> + return NULL;
>
> - is_received = (strcmp(header_desired,"received") == 0);
> + return message->message;
> +}
>
> - if (! initialized) {
> - g_mime_init (GMIME_ENABLE_RFC2047_WORKAROUNDS);
> - initialized = 1;
> - }
> +/*
> + * Get all instances of a header decoded and concatenated.
> + *
> + * The result must be freed using g_free().
> + *
> + * Return NULL on errors, empty string for non-existing headers.
> + */
> +static char *
> +_notmuch_message_file_get_combined_header (notmuch_message_file_t *message,
> + const char *header)
> +{
> + GMimeHeaderList *headers;
> + GMimeHeaderIter *iter;
> + char *combined = NULL;
>
> - message->parsing_started = 1;
> -
> - if (header_desired == NULL)
> - contains = 0;
> - else
> - contains = g_hash_table_lookup_extended (message->headers,
> - header_desired, NULL,
> - (gpointer *) &decoded_value);
> -
> - if (contains && decoded_value)
> - return decoded_value;
> -
> - if (message->parsing_finished)
> - return "";
> -
> -#define NEXT_HEADER_LINE(closure) \
> - while (1) { \
> - ssize_t bytes_read = getline (&message->line, \
> - &message->line_size, \
> - message->file); \
> - if (bytes_read == -1) { \
> - message->parsing_finished = 1; \
> - break; \
> - } \
> - if (*message->line == '\n') { \
> - message->parsing_finished = 1; \
> - break; \
> - } \
> - if (closure && \
> - (*message->line == ' ' || *message->line == '\t')) \
> - { \
> - copy_header_unfolding ((closure), message->line); \
> - } \
> - if (*message->line == ' ' || *message->line == '\t') \
> - message->header_size += strlen (message->line); \
> - else \
> - break; \
> - }
> + headers = g_mime_object_get_header_list (GMIME_OBJECT (message->message));
> + if (! headers)
> + return NULL;
>
> - if (message->line == NULL)
> - NEXT_HEADER_LINE (NULL);
> + iter = g_mime_header_iter_new ();
> + if (! iter)
> + return NULL;
>
> - while (1) {
> + if (! g_mime_header_list_get_iter (headers, iter))
> + goto DONE;
>
> - if (message->parsing_finished)
> - break;
> + do {
> + const char *value;
Incorrect indentation.
> + char *decoded;
>
> - colon = strchr (message->line, ':');
> + if (strcasecmp (g_mime_header_iter_get_name (iter), header) != 0)
> + continue;
>
> - if (colon == NULL) {
> - message->broken_headers++;
> - /* A simple heuristic for giving up on things that just
> - * don't look like mail messages. */
> - if (message->broken_headers >= 10 &&
> - message->good_headers < 5)
> - {
> - message->parsing_finished = 1;
> - break;
> + value = g_mime_header_iter_get_value (iter);
> + decoded = g_mime_utils_header_decode_text (value);
Maybe a comment saying that GMime retains ownership of value, but
decoded must be freed with g_free? GMime is confusing about
ownership, and the fact that value *doesn't* have to be freed (which
is not documented by GMime!) just adds to the confusion. I could see
this being easy to screw up during later code changes.
> + if (! decoded) {
> + if (combined) {
> + g_free (combined);
> + combined = NULL;
> }
> - NEXT_HEADER_LINE (NULL);
> - continue;
> + goto DONE;
> }
>
> - message->header_size += strlen (message->line);
> + if (combined) {
> + char *tmp = g_strdup_printf ("%s %s", combined, decoded);
> + g_free (decoded);
> + g_free (combined);
> + if (! tmp) {
> + combined = NULL;
> + goto DONE;
> + }
>
> - message->good_headers++;
> + combined = tmp;
> + } else {
> + combined = decoded;
> + }
> + } while (g_mime_header_iter_next (iter));
Yet another funny GMime API... Did you consider
for (; g_mime_header_iter_is_valid (iter); g_mime_header_iter_next (iter))
? I realize that the g_mime_header_iter_is_valid isn't actually
necessary because g_mime_header_list_get_iter returns NULL if the
iterator would be empty, but using a for-loop would make it more
obvious that this code is correct.
>
> - header = xstrndup (message->line, colon - message->line);
> + /* Return empty string for non-existing headers. */
> + if (! combined)
> + combined = g_strdup ("");
>
> - if (message->restrict_headers &&
> - ! g_hash_table_lookup_extended (message->headers,
> - header, NULL, NULL))
> - {
> - free (header);
> - NEXT_HEADER_LINE (NULL);
> - continue;
> - }
> + DONE:
> + g_mime_header_iter_free (iter);
>
> - s = colon + 1;
> - while (*s == ' ' || *s == '\t')
> - s++;
> + return combined;
> +}
Oof. That received header had better be worth it!
>
> - message->value.len = 0;
> - copy_header_unfolding (&message->value, s);
> +/* Return NULL on errors, empty string for non-existing headers. */
This may sound strange, but I'd be inclined to omit this comment,
since it made me not realize I should go looking for the much more
thorough comment in notmuch-private.h.
> +const char *
> +notmuch_message_file_get_header (notmuch_message_file_t *message,
> + const char *header)
> +{
> + const char *value = NULL;
> + char *decoded;
> + notmuch_bool_t found;
>
> - NEXT_HEADER_LINE (&message->value);
> + if (notmuch_message_file_parse (message))
> + return NULL;
>
> - if (header_desired == NULL)
> - match = 0;
> + /* If we have a cached decoded value, use it. */
> + found = g_hash_table_lookup_extended (message->headers, header,
> + NULL, (gpointer *) &value);
> + if (found)
> + return value ? value : "";
When would header be found, but value be NULL? (If this can't in fact
happen, how about using g_hash_table_lookup?)
> +
> + if (strcasecmp (header, "received") == 0) {
> + /*
> + * The Received: header is special. We concatenate all
> + * instances of the header as we use this when analyzing the
> + * path the mail has taken from sender to recipient.
> + */
> + decoded = _notmuch_message_file_get_combined_header (message, header);
> + } else {
> + value = g_mime_object_get_header (GMIME_OBJECT (message->message),
> + header);
> + if (value)
> + decoded = g_mime_utils_header_decode_text (value);
> else
> - match = (strcasecmp (header, header_desired) == 0);
> -
> - decoded_value = g_mime_utils_header_decode_text (message->value.str);
> - header_sofar = (char *)g_hash_table_lookup (message->headers, header);
> - /* we treat the Received: header special - we want to concat ALL of
> - * the Received: headers we encounter.
> - * for everything else we return the first instance of a header */
> - if (strcasecmp(header, "received") == 0) {
> - if (header_sofar == NULL) {
> - /* first Received: header we encountered; just add it */
> - g_hash_table_insert (message->headers, header, decoded_value);
> - } else {
> - /* we need to add the header to those we already collected */
> - newhdr = strlen(decoded_value);
> - hdrsofar = strlen(header_sofar);
> - combined_header = g_malloc(hdrsofar + newhdr + 2);
> - strncpy(combined_header,header_sofar,hdrsofar);
> - *(combined_header+hdrsofar) = ' ';
> - strncpy(combined_header+hdrsofar+1,decoded_value,newhdr+1);
> - g_free (decoded_value);
> - g_hash_table_insert (message->headers, header, combined_header);
> - }
> - } else {
> - if (header_sofar == NULL) {
> - /* Only insert if we don't have a value for this header, yet. */
> - g_hash_table_insert (message->headers, header, decoded_value);
> - } else {
> - free (header);
> - g_free (decoded_value);
> - decoded_value = header_sofar;
> - }
> - }
> - /* if we found a match we can bail - unless of course we are
> - * collecting all the Received: headers */
> - if (match && !is_received)
> - return decoded_value;
> - }
> -
> - if (message->parsing_finished) {
> - fclose (message->file);
> - message->file = NULL;
> + decoded = g_strdup ("");
> }
>
> - if (message->line)
> - free (message->line);
> - message->line = NULL;
> -
> - if (message->value.size) {
> - free (message->value.str);
> - message->value.str = NULL;
> - message->value.size = 0;
> - message->value.len = 0;
> - }
> + if (! decoded)
> + return NULL;
>
> - /* For the Received: header we actually might end up here even
> - * though we found the header (as we force continued parsing
> - * in that case). So let's check if that's the header we were
> - * looking for and return the value that we found (if any)
> - */
> - if (is_received)
> - return (char *)g_hash_table_lookup (message->headers, "received");
> -
> - /* We've parsed all headers and never found the one we're looking
> - * for. It's probably just not there, but let's check that we
> - * didn't make a mistake preventing us from seeing it. */
> - if (message->restrict_headers && header_desired &&
> - ! g_hash_table_lookup_extended (message->headers,
> - header_desired, NULL, NULL))
> - {
> - INTERNAL_ERROR ("Attempt to get header \"%s\" which was not\n"
> - "included in call to notmuch_message_file_restrict_headers\n",
> - header_desired);
> - }
> + /* Cache the decoded value. We also own the strings. */
> + g_hash_table_insert (message->headers, xstrdup (header), decoded);
>
> - return "";
> + return decoded;
> }
> diff --git a/lib/notmuch-private.h b/lib/notmuch-private.h
> index 59eb2bc285a5..734a4e338554 100644
> --- a/lib/notmuch-private.h
> +++ b/lib/notmuch-private.h
> @@ -46,6 +46,8 @@ NOTMUCH_BEGIN_DECLS
>
> #include <talloc.h>
>
> +#include <gmime/gmime.h>
> +
> #include "xutil.h"
> #include "error_util.h"
>
> @@ -320,13 +322,6 @@ notmuch_message_set_author (notmuch_message_t *message, const char *author);
> const char *
> notmuch_message_get_author (notmuch_message_t *message);
>
> -
> -/* index.cc */
> -
> -notmuch_status_t
> -_notmuch_message_index_file (notmuch_message_t *message,
> - const char *filename);
> -
> /* message-file.c */
>
> /* XXX: I haven't decided yet whether these will actually get exported
> @@ -352,32 +347,31 @@ _notmuch_message_file_open_ctx (void *ctx, const char *filename);
> void
> notmuch_message_file_close (notmuch_message_file_t *message);
>
> -/* Restrict 'message' to only save the named headers.
> +/* Parse the message.
> *
> - * When the caller is only interested in a short list of headers,
> - * known in advance, calling this function can avoid wasted time and
> - * memory parsing/saving header values that will never be needed.
> + * This will be done automatically as necessary on other calls
> + * depending on it, but an explicit call allows for better error
> + * status reporting.
> + */
> +notmuch_status_t
> +notmuch_message_file_parse (notmuch_message_file_t *message);
Most internal functions have a leading _. (We've diverged from that
in a few places, especially in notmuch_message_file_*, but it would be
nice to not diverge further from that convention.)
> +
> +/* Get the gmime message of a message file.
> *
> - * The variable arguments should be a list of const char * with a
> - * final '(const char *) NULL' to terminate the list.
> + * The message file is parsed as necessary.
> *
> - * If this function is called, it must be called before any calls to
> - * notmuch_message_get_header for this message.
> + * Returns GMimeMessage* on success (which the caller must not unref),
> + * NULL if the message file parsing fails.
> *
> - * After calling this function, if notmuch_message_get_header is
> - * called with a header name not in this list, then NULL will be
> - * returned even if that header exists in the actual message.
> + * XXX: Would be nice to not have to expose GMimeMessage here.
> */
> -void
> -notmuch_message_file_restrict_headers (notmuch_message_file_t *message, ...);
> -
> -/* Identical to notmuch_message_restrict_headers but accepting a va_list. */
> -void
> -notmuch_message_file_restrict_headersv (notmuch_message_file_t *message,
> - va_list va_headers);
> +GMimeMessage *
> +notmuch_message_file_get_mime_message (notmuch_message_file_t *message);
Same comment about _.
>
> /* Get the value of the specified header from the message as a UTF-8 string.
> *
> + * The message file is parsed as necessary.
> + *
> * The header name is case insensitive.
> *
> * The Received: header is special - for it all Received: headers in
> @@ -387,13 +381,19 @@ notmuch_message_file_restrict_headersv (notmuch_message_file_t *message,
> * only until the message is closed. The caller should copy it if
> * needing to modify the value or to hold onto it for longer.
> *
> - * Returns NULL if the message does not contain a header line matching
> - * 'header'.
> + * Returns NULL on errors, empty string if the message does not
> + * contain a header line matching 'header'.
> */
> const char *
> notmuch_message_file_get_header (notmuch_message_file_t *message,
> const char *header);
>
> +/* index.cc */
> +
> +notmuch_status_t
> +_notmuch_message_index_file (notmuch_message_t *message,
> + notmuch_message_file_t *message_file);
> +
Any particular reason this floated down? (Obviously it's not an
actual problem.)
> /* messages.c */
>
> typedef struct _notmuch_message_node {
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [PATCH v4] lib: replace the header parser with gmime
2014-03-29 22:11 ` Austin Clements
@ 2014-03-30 21:10 ` Jani Nikula
0 siblings, 0 replies; 3+ messages in thread
From: Jani Nikula @ 2014-03-30 21:10 UTC (permalink / raw)
To: Austin Clements; +Cc: notmuch
On Sun, 30 Mar 2014, Austin Clements <amdragon@MIT.EDU> wrote:
> Various little comments below. Overall this is looking really good.
>
> Quoth Jani Nikula on Mar 23 at 10:01 pm:
>> The notmuch library includes a full blown message header parser. Yet
>> the same message headers are parsed by gmime during indexing. Switch
>> to gmime parsing completely.
>>
>> These are the main changes:
>>
>> * Gmime stops header parsing at the first invalid header, and presumes
>> the message body starts from there. The current parser is quite
>> liberal in accepting broken headers. The change means we will be
>> much pickier about accepting invalid messages.
>>
>> * The current parser converts tabs used in header folding to
>> spaces. Gmime preserve the tabs. Due to a broken python library used
>> in mailman, there are plenty of mailing lists that produce headers
>> with tabs in header folding, and we'll see plenty of tabs. (This
>> change has been mitigated in preparatory patches.)
>>
>> * For pure header parsing, the current parser is likely faster than
>> gmime, which parses the whole message rather than just the
>> headers. Since we parse the message and its headers using gmime for
>> indexing anyway, this avoids and extra header parsing round when
>> adding new messages. In case of duplicate messages, we'll end up
>> parsing the full message although just headers would be
>> sufficient. All in all this should still speed up 'notmuch new'.
>>
>> * Calls to notmuch_message_get_header() may be slightly slower than
>> previously for headers that are not indexed in the database, due to
>> parsing of the whole message. Within the notmuch code base, notmuch
>> reply is the only such user.
>>
>> ---
>>
>> This is v4 of patches [1] and [2], combining them into one. Patch [3]
>> is required to make all the tests pass.
>>
>> The implementation has been changed considerably. Notably we only
>> cache the decoded headers to a hash table if and when they are
>> needed. The main reasons for doing caching at all is that gmime
>> provides raw headers that need decoding, and to minimize changes
>> outside of message-file.c we want to own and track the allocated
>> strings instead of pushing the responsibility to callers.
>>
>> This addresses most comments from Austin.
>>
>> The diff in message-file.c is rather confusing due to a mix of
>> added/removed lines. Looking at the patched file is much more
>> pleasant.
>>
>> BR,
>> Jani.
>>
>> [1] id:bdef45dc21c777130a7ae1fa650f172d8c6eaaf4.1391456555.git.jani@nikula.org
>> [2] id:31d785c4a3e4b90862a0fdc545d4e900a4c898e2.1391456555.git.jani@nikula.org
>> [3] id:1395247490-19892-1-git-send-email-jani@nikula.org
>> ---
>> lib/database.cc | 15 +-
>> lib/index.cc | 70 +--------
>> lib/message-file.c | 420 ++++++++++++++++++++------------------------------
>> lib/notmuch-private.h | 54 +++----
>> 4 files changed, 208 insertions(+), 351 deletions(-)
>>
>> diff --git a/lib/database.cc b/lib/database.cc
>> index aef748f75b5d..4750f40cf0fb 100644
>> --- a/lib/database.cc
>> +++ b/lib/database.cc
>> @@ -1971,15 +1971,10 @@ notmuch_database_add_message (notmuch_database_t *notmuch,
>> if (ret)
>> goto DONE;
>>
>> - notmuch_message_file_restrict_headers (message_file,
>> - "date",
>> - "from",
>> - "in-reply-to",
>> - "message-id",
>> - "references",
>> - "subject",
>> - "to",
>> - (char *) NULL);
>> + /* Parse message up front to get better error status. */
>> + ret = notmuch_message_file_parse (message_file);
>> + if (ret)
>> + goto DONE;
>>
>> try {
>> /* Before we do any real work, (especially before doing a
>> @@ -2066,7 +2061,7 @@ notmuch_database_add_message (notmuch_database_t *notmuch,
>> date = notmuch_message_file_get_header (message_file, "date");
>> _notmuch_message_set_header_values (message, date, from, subject);
>>
>> - ret = _notmuch_message_index_file (message, filename);
>> + ret = _notmuch_message_index_file (message, message_file);
>> if (ret)
>> goto DONE;
>> } else {
>> diff --git a/lib/index.cc b/lib/index.cc
>> index 78c18cf36d10..46a019325454 100644
>> --- a/lib/index.cc
>> +++ b/lib/index.cc
>> @@ -425,63 +425,15 @@ _index_mime_part (notmuch_message_t *message,
>>
>> notmuch_status_t
>> _notmuch_message_index_file (notmuch_message_t *message,
>> - const char *filename)
>> + notmuch_message_file_t *message_file)
>> {
>> - GMimeStream *stream = NULL;
>> - GMimeParser *parser = NULL;
>> - GMimeMessage *mime_message = NULL;
>> + GMimeMessage *mime_message;
>> InternetAddressList *addresses;
>> - FILE *file = NULL;
>> const char *from, *subject;
>> - notmuch_status_t ret = NOTMUCH_STATUS_SUCCESS;
>> - static int initialized = 0;
>> - char from_buf[5];
>> - bool is_mbox = false;
>> - static bool mbox_warning = false;
>> -
>> - if (! initialized) {
>> - g_mime_init (GMIME_ENABLE_RFC2047_WORKAROUNDS);
>> - initialized = 1;
>> - }
>> -
>> - file = fopen (filename, "r");
>> - if (! file) {
>> - fprintf (stderr, "Error opening %s: %s\n", filename, strerror (errno));
>> - ret = NOTMUCH_STATUS_FILE_ERROR;
>> - goto DONE;
>> - }
>> -
>> - /* Is this mbox? */
>> - if (fread (from_buf, sizeof (from_buf), 1, file) == 1 &&
>> - strncmp (from_buf, "From ", 5) == 0)
>> - is_mbox = true;
>> - rewind (file);
>>
>> - /* Evil GMime steals my FILE* here so I won't fclose it. */
>> - stream = g_mime_stream_file_new (file);
>> -
>> - parser = g_mime_parser_new_with_stream (stream);
>> - g_mime_parser_set_scan_from (parser, is_mbox);
>> -
>> - mime_message = g_mime_parser_construct_message (parser);
>> -
>> - if (is_mbox) {
>> - if (!g_mime_parser_eos (parser)) {
>> - /* This is a multi-message mbox. */
>> - ret = NOTMUCH_STATUS_FILE_NOT_EMAIL;
>> - goto DONE;
>> - }
>> - /* For historical reasons, we support single-message mboxes,
>> - * but this behavior is likely to change in the future, so
>> - * warn. */
>> - if (!mbox_warning) {
>> - mbox_warning = true;
>> - fprintf (stderr, "\
>> -Warning: %s is an mbox containing a single message,\n\
>> -likely caused by misconfigured mail delivery. Support for single-message\n\
>> -mboxes is deprecated and may be removed in the future.\n", filename);
>> - }
>> - }
>> + mime_message = notmuch_message_file_get_mime_message (message_file);
>> + if (! mime_message)
>> + return NOTMUCH_STATUS_FILE_NOT_EMAIL;
>
> This could also be because of out-of-memory.
> notmuch_message_file_get_mime_message (and notmuch_message_file_parse,
> in turn) should probably take a notmuch_status_t *status_ret argument
> like other internal functions.
Agreed (though I opted for returning notmuch_status_t and passing
GMimeMessage** as parameter).
>
>>
>> from = g_mime_message_get_sender (mime_message);
>>
>> @@ -502,15 +454,5 @@ mboxes is deprecated and may be removed in the future.\n", filename);
>>
>> _index_mime_part (message, g_mime_message_get_mime_part (mime_message));
>>
>> - DONE:
>> - if (mime_message)
>> - g_object_unref (mime_message);
>> -
>> - if (parser)
>> - g_object_unref (parser);
>> -
>> - if (stream)
>> - g_object_unref (stream);
>> -
>> - return ret;
>> + return NOTMUCH_STATUS_SUCCESS;
>> }
>> diff --git a/lib/message-file.c b/lib/message-file.c
>> index a2850c278b5a..88662608d319 100644
>> --- a/lib/message-file.c
>> +++ b/lib/message-file.c
>> @@ -26,30 +26,15 @@
>>
>> #include <glib.h> /* GHashTable */
>>
>> -typedef struct {
>> - char *str;
>> - size_t size;
>> - size_t len;
>> -} header_value_closure_t;
>> -
>> struct _notmuch_message_file {
>> /* File object */
>> FILE *file;
>> + char *filename;
>>
>> - /* Header storage */
>> - int restrict_headers;
>> + /* Cache for decoded headers */
>> GHashTable *headers;
>> - int broken_headers;
>> - int good_headers;
>> - size_t header_size; /* Length of full message header in bytes. */
>> -
>> - /* Parsing state */
>> - char *line;
>> - size_t line_size;
>> - header_value_closure_t value;
>>
>> - int parsing_started;
>> - int parsing_finished;
>> + GMimeMessage *message;
>> };
>>
>> static int
>> @@ -76,15 +61,12 @@ strcase_hash (const void *ptr)
>> static int
>> _notmuch_message_file_destructor (notmuch_message_file_t *message)
>> {
>> - if (message->line)
>> - free (message->line);
>> -
>> - if (message->value.size)
>> - free (message->value.str);
>> -
>> if (message->headers)
>> g_hash_table_destroy (message->headers);
>>
>> + if (message->message)
>> + g_object_unref (message->message);
>> +
>> if (message->file)
>> fclose (message->file);
>>
>> @@ -102,20 +84,17 @@ _notmuch_message_file_open_ctx (void *ctx, const char *filename)
>> if (unlikely (message == NULL))
>> return NULL;
>>
>> + /* Only needed for error messages during parsing. */
>> + message->filename = talloc_strdup (message, filename);
>> + if (message->filename == NULL)
>> + goto FAIL;
>> +
>> talloc_set_destructor (message, _notmuch_message_file_destructor);
>>
>> message->file = fopen (filename, "r");
>> if (message->file == NULL)
>> goto FAIL;
>>
>> - message->headers = g_hash_table_new_full (strcase_hash,
>> - strcase_equal,
>> - free,
>> - g_free);
>> -
>> - message->parsing_started = 0;
>> - message->parsing_finished = 0;
>> -
>> return message;
>>
>> FAIL:
>> @@ -137,264 +116,205 @@ notmuch_message_file_close (notmuch_message_file_t *message)
>> talloc_free (message);
>> }
>>
>> -void
>> -notmuch_message_file_restrict_headersv (notmuch_message_file_t *message,
>> - va_list va_headers)
>> +notmuch_status_t
>> +notmuch_message_file_parse (notmuch_message_file_t *message)
>> {
>> - char *header;
>> + GMimeStream *stream;
>> + GMimeParser *parser;
>> + notmuch_status_t status = NOTMUCH_STATUS_SUCCESS;
>> + static int initialized = 0;
>> + char from_buf[5];
>> + notmuch_bool_t is_mbox = FALSE;
>> + static notmuch_bool_t mbox_warning = FALSE;
>>
>> - if (message->parsing_started)
>> - INTERNAL_ERROR ("notmuch_message_file_restrict_headers called after parsing has started");
>> + if (message->message)
>> + return NOTMUCH_STATUS_SUCCESS;
>>
>> - while (1) {
>> - header = va_arg (va_headers, char*);
>> - if (header == NULL)
>> - break;
>> - g_hash_table_insert (message->headers,
>> - xstrdup (header), NULL);
>> + if (! initialized) {
>> + g_mime_init (GMIME_ENABLE_RFC2047_WORKAROUNDS);
>> + initialized = 1;
>> }
>>
>> - message->restrict_headers = 1;
>> -}
>> -
>> -void
>> -notmuch_message_file_restrict_headers (notmuch_message_file_t *message, ...)
>> -{
>> - va_list va_headers;
>> -
>> - va_start (va_headers, message);
>> + message->headers = g_hash_table_new_full (strcase_hash, strcase_equal,
>> + free, g_free);
>> + if (! message->headers)
>> + return NOTMUCH_STATUS_OUT_OF_MEMORY;
>>
>> - notmuch_message_file_restrict_headersv (message, va_headers);
>> -}
>> + /* Is this mbox? */
>> + if (fread (from_buf, sizeof (from_buf), 1, message->file) == 1 &&
>> + strncmp (from_buf, "From ", 5) == 0)
>> + is_mbox = TRUE;
>> + rewind (message->file);
>>
>> -static void
>> -copy_header_unfolding (header_value_closure_t *value,
>> - const char *chunk)
>> -{
>> - char *last;
>> + stream = g_mime_stream_file_new (message->file);
>>
>> - if (chunk == NULL)
>> - return;
>> + /* We'll own and fclose the FILE* ourselves. */
>> + g_mime_stream_file_set_owner (GMIME_STREAM_FILE (stream), FALSE);
>>
>> - while (*chunk == ' ' || *chunk == '\t')
>> - chunk++;
>> + parser = g_mime_parser_new_with_stream (stream);
>> + g_mime_parser_set_scan_from (parser, is_mbox);
>>
>> - if (value->len + 1 + strlen (chunk) + 1 > value->size) {
>> - unsigned int new_size = value->size;
>> - if (value->size == 0)
>> - new_size = strlen (chunk) + 1;
>> - else
>> - while (value->len + 1 + strlen (chunk) + 1 > new_size)
>> - new_size *= 2;
>> - value->str = xrealloc (value->str, new_size);
>> - value->size = new_size;
>> + message->message = g_mime_parser_construct_message (parser);
>> + if (! message->message) {
>> + status = NOTMUCH_STATUS_FILE_NOT_EMAIL;
>> + goto DONE;
>> }
>>
>> - last = value->str + value->len;
>> - if (value->len) {
>> - *last = ' ';
>> - last++;
>> - value->len++;
>> + if (is_mbox) {
>> + if (! g_mime_parser_eos (parser)) {
>> + /* This is a multi-message mbox. */
>> + status = NOTMUCH_STATUS_FILE_NOT_EMAIL;
>> + goto DONE;
>> + }
>> + /*
>> + * For historical reasons, we support single-message mboxes,
>> + * but this behavior is likely to change in the future, so
>> + * warn.
>> + */
>> + if (! mbox_warning) {
>> + mbox_warning = TRUE;
>> + fprintf (stderr, "\
>> +Warning: %s is an mbox containing a single message,\n\
>> +likely caused by misconfigured mail delivery. Support for single-message\n\
>> +mboxes is deprecated and may be removed in the future.\n", message->filename);
>
> I still wonder if this is the right place to issue this warning. For
> example, doesn't this mean that, when constructing a reply to a single
> message mbox, we'll issue this warning?
>
> Or maybe it's time to drop single message mbox support. We deprecated
> it in 0.15.1, which was over a year ago. OTOH, wheezy still has
> 0.13.2, so there could be people still using mboxes without the
> warning. OTOOH, we only kept this around to appear David anyway. So
> maybe the answer is to leave this in an not think too hard about it.
I'm adding a prep patch to drop mbox support altogether.
>
>> + }
>> }
>>
>> - strcpy (last, chunk);
>> - value->len += strlen (chunk);
>> + DONE:
>> + g_object_unref (stream);
>> + g_object_unref (parser);
>> +
>> + if (status) {
>> + g_hash_table_destroy (message->headers);
>> + message->headers = NULL;
>>
>> - last = value->str + value->len - 1;
>> - if (*last == '\n') {
>> - *last = '\0';
>> - value->len--;
>> + if (message->message) {
>> + g_object_unref (message->message);
>> + message->message = NULL;
>> + }
>> +
>> + rewind (message->file);
>> }
>> +
>> + return status;
>> }
>>
>> -/* As a special-case, a value of NULL for header_desired will force
>> - * the entire header to be parsed if it is not parsed already. This is
>> - * used by the _notmuch_message_file_get_headers_end function.
>> - * Another special case is the Received: header. For this header we
>> - * want to concatenate all instances of the header instead of just
>> - * hashing the first instance as we use this when analyzing the path
>> - * the mail has taken from sender to recipient.
>> - */
>> -const char *
>> -notmuch_message_file_get_header (notmuch_message_file_t *message,
>> - const char *header_desired)
>> +GMimeMessage *
>> +notmuch_message_file_get_mime_message (notmuch_message_file_t *message)
>> {
>> - int contains;
>> - char *header, *decoded_value, *header_sofar, *combined_header;
>> - const char *s, *colon;
>> - int match, newhdr, hdrsofar, is_received;
>> - static int initialized = 0;
>> + if (notmuch_message_file_parse (message))
>> + return NULL;
>>
>> - is_received = (strcmp(header_desired,"received") == 0);
>> + return message->message;
>> +}
>>
>> - if (! initialized) {
>> - g_mime_init (GMIME_ENABLE_RFC2047_WORKAROUNDS);
>> - initialized = 1;
>> - }
>> +/*
>> + * Get all instances of a header decoded and concatenated.
>> + *
>> + * The result must be freed using g_free().
>> + *
>> + * Return NULL on errors, empty string for non-existing headers.
>> + */
>> +static char *
>> +_notmuch_message_file_get_combined_header (notmuch_message_file_t *message,
>> + const char *header)
>> +{
>> + GMimeHeaderList *headers;
>> + GMimeHeaderIter *iter;
>> + char *combined = NULL;
>>
>> - message->parsing_started = 1;
>> -
>> - if (header_desired == NULL)
>> - contains = 0;
>> - else
>> - contains = g_hash_table_lookup_extended (message->headers,
>> - header_desired, NULL,
>> - (gpointer *) &decoded_value);
>> -
>> - if (contains && decoded_value)
>> - return decoded_value;
>> -
>> - if (message->parsing_finished)
>> - return "";
>> -
>> -#define NEXT_HEADER_LINE(closure) \
>> - while (1) { \
>> - ssize_t bytes_read = getline (&message->line, \
>> - &message->line_size, \
>> - message->file); \
>> - if (bytes_read == -1) { \
>> - message->parsing_finished = 1; \
>> - break; \
>> - } \
>> - if (*message->line == '\n') { \
>> - message->parsing_finished = 1; \
>> - break; \
>> - } \
>> - if (closure && \
>> - (*message->line == ' ' || *message->line == '\t')) \
>> - { \
>> - copy_header_unfolding ((closure), message->line); \
>> - } \
>> - if (*message->line == ' ' || *message->line == '\t') \
>> - message->header_size += strlen (message->line); \
>> - else \
>> - break; \
>> - }
>> + headers = g_mime_object_get_header_list (GMIME_OBJECT (message->message));
>> + if (! headers)
>> + return NULL;
>>
>> - if (message->line == NULL)
>> - NEXT_HEADER_LINE (NULL);
>> + iter = g_mime_header_iter_new ();
>> + if (! iter)
>> + return NULL;
>>
>> - while (1) {
>> + if (! g_mime_header_list_get_iter (headers, iter))
>> + goto DONE;
>>
>> - if (message->parsing_finished)
>> - break;
>> + do {
>> + const char *value;
>
> Incorrect indentation.
I wonder what happened there!
>
>> + char *decoded;
>>
>> - colon = strchr (message->line, ':');
>> + if (strcasecmp (g_mime_header_iter_get_name (iter), header) != 0)
>> + continue;
>>
>> - if (colon == NULL) {
>> - message->broken_headers++;
>> - /* A simple heuristic for giving up on things that just
>> - * don't look like mail messages. */
>> - if (message->broken_headers >= 10 &&
>> - message->good_headers < 5)
>> - {
>> - message->parsing_finished = 1;
>> - break;
>> + value = g_mime_header_iter_get_value (iter);
>> + decoded = g_mime_utils_header_decode_text (value);
>
> Maybe a comment saying that GMime retains ownership of value, but
> decoded must be freed with g_free? GMime is confusing about
> ownership, and the fact that value *doesn't* have to be freed (which
> is not documented by GMime!) just adds to the confusion. I could see
> this being easy to screw up during later code changes.
Agreed.
>
>> + if (! decoded) {
>> + if (combined) {
>> + g_free (combined);
>> + combined = NULL;
>> }
>> - NEXT_HEADER_LINE (NULL);
>> - continue;
>> + goto DONE;
>> }
>>
>> - message->header_size += strlen (message->line);
>> + if (combined) {
>> + char *tmp = g_strdup_printf ("%s %s", combined, decoded);
>> + g_free (decoded);
>> + g_free (combined);
>> + if (! tmp) {
>> + combined = NULL;
>> + goto DONE;
>> + }
>>
>> - message->good_headers++;
>> + combined = tmp;
>> + } else {
>> + combined = decoded;
>> + }
>> + } while (g_mime_header_iter_next (iter));
>
> Yet another funny GMime API... Did you consider
>
> for (; g_mime_header_iter_is_valid (iter); g_mime_header_iter_next (iter))
>
> ? I realize that the g_mime_header_iter_is_valid isn't actually
> necessary because g_mime_header_list_get_iter returns NULL if the
> iterator would be empty, but using a for-loop would make it more
> obvious that this code is correct.
As I mentioned on IRC, that would lead to an infinite loop. Ugh.
>
>>
>> - header = xstrndup (message->line, colon - message->line);
>> + /* Return empty string for non-existing headers. */
>> + if (! combined)
>> + combined = g_strdup ("");
>>
>> - if (message->restrict_headers &&
>> - ! g_hash_table_lookup_extended (message->headers,
>> - header, NULL, NULL))
>> - {
>> - free (header);
>> - NEXT_HEADER_LINE (NULL);
>> - continue;
>> - }
>> + DONE:
>> + g_mime_header_iter_free (iter);
>>
>> - s = colon + 1;
>> - while (*s == ' ' || *s == '\t')
>> - s++;
>> + return combined;
>> +}
>
> Oof. That received header had better be worth it!
I'm not sure it is, but we've got users for it, so...
>
>>
>> - message->value.len = 0;
>> - copy_header_unfolding (&message->value, s);
>> +/* Return NULL on errors, empty string for non-existing headers. */
>
> This may sound strange, but I'd be inclined to omit this comment,
> since it made me not realize I should go looking for the much more
> thorough comment in notmuch-private.h.
Agreed.
>
>> +const char *
>> +notmuch_message_file_get_header (notmuch_message_file_t *message,
>> + const char *header)
>> +{
>> + const char *value = NULL;
>> + char *decoded;
>> + notmuch_bool_t found;
>>
>> - NEXT_HEADER_LINE (&message->value);
>> + if (notmuch_message_file_parse (message))
>> + return NULL;
>>
>> - if (header_desired == NULL)
>> - match = 0;
>> + /* If we have a cached decoded value, use it. */
>> + found = g_hash_table_lookup_extended (message->headers, header,
>> + NULL, (gpointer *) &value);
>> + if (found)
>> + return value ? value : "";
>
> When would header be found, but value be NULL? (If this can't in fact
> happen, how about using g_hash_table_lookup?)
Right. This was a remnant of header restricting which has now been
removed. Switching to g_hash_table_lookup.
>
>> +
>> + if (strcasecmp (header, "received") == 0) {
>> + /*
>> + * The Received: header is special. We concatenate all
>> + * instances of the header as we use this when analyzing the
>> + * path the mail has taken from sender to recipient.
>> + */
>> + decoded = _notmuch_message_file_get_combined_header (message, header);
>> + } else {
>> + value = g_mime_object_get_header (GMIME_OBJECT (message->message),
>> + header);
>> + if (value)
>> + decoded = g_mime_utils_header_decode_text (value);
>> else
>> - match = (strcasecmp (header, header_desired) == 0);
>> -
>> - decoded_value = g_mime_utils_header_decode_text (message->value.str);
>> - header_sofar = (char *)g_hash_table_lookup (message->headers, header);
>> - /* we treat the Received: header special - we want to concat ALL of
>> - * the Received: headers we encounter.
>> - * for everything else we return the first instance of a header */
>> - if (strcasecmp(header, "received") == 0) {
>> - if (header_sofar == NULL) {
>> - /* first Received: header we encountered; just add it */
>> - g_hash_table_insert (message->headers, header, decoded_value);
>> - } else {
>> - /* we need to add the header to those we already collected */
>> - newhdr = strlen(decoded_value);
>> - hdrsofar = strlen(header_sofar);
>> - combined_header = g_malloc(hdrsofar + newhdr + 2);
>> - strncpy(combined_header,header_sofar,hdrsofar);
>> - *(combined_header+hdrsofar) = ' ';
>> - strncpy(combined_header+hdrsofar+1,decoded_value,newhdr+1);
>> - g_free (decoded_value);
>> - g_hash_table_insert (message->headers, header, combined_header);
>> - }
>> - } else {
>> - if (header_sofar == NULL) {
>> - /* Only insert if we don't have a value for this header, yet. */
>> - g_hash_table_insert (message->headers, header, decoded_value);
>> - } else {
>> - free (header);
>> - g_free (decoded_value);
>> - decoded_value = header_sofar;
>> - }
>> - }
>> - /* if we found a match we can bail - unless of course we are
>> - * collecting all the Received: headers */
>> - if (match && !is_received)
>> - return decoded_value;
>> - }
>> -
>> - if (message->parsing_finished) {
>> - fclose (message->file);
>> - message->file = NULL;
>> + decoded = g_strdup ("");
>> }
>>
>> - if (message->line)
>> - free (message->line);
>> - message->line = NULL;
>> -
>> - if (message->value.size) {
>> - free (message->value.str);
>> - message->value.str = NULL;
>> - message->value.size = 0;
>> - message->value.len = 0;
>> - }
>> + if (! decoded)
>> + return NULL;
>>
>> - /* For the Received: header we actually might end up here even
>> - * though we found the header (as we force continued parsing
>> - * in that case). So let's check if that's the header we were
>> - * looking for and return the value that we found (if any)
>> - */
>> - if (is_received)
>> - return (char *)g_hash_table_lookup (message->headers, "received");
>> -
>> - /* We've parsed all headers and never found the one we're looking
>> - * for. It's probably just not there, but let's check that we
>> - * didn't make a mistake preventing us from seeing it. */
>> - if (message->restrict_headers && header_desired &&
>> - ! g_hash_table_lookup_extended (message->headers,
>> - header_desired, NULL, NULL))
>> - {
>> - INTERNAL_ERROR ("Attempt to get header \"%s\" which was not\n"
>> - "included in call to notmuch_message_file_restrict_headers\n",
>> - header_desired);
>> - }
>> + /* Cache the decoded value. We also own the strings. */
>> + g_hash_table_insert (message->headers, xstrdup (header), decoded);
>>
>> - return "";
>> + return decoded;
>> }
>> diff --git a/lib/notmuch-private.h b/lib/notmuch-private.h
>> index 59eb2bc285a5..734a4e338554 100644
>> --- a/lib/notmuch-private.h
>> +++ b/lib/notmuch-private.h
>> @@ -46,6 +46,8 @@ NOTMUCH_BEGIN_DECLS
>>
>> #include <talloc.h>
>>
>> +#include <gmime/gmime.h>
>> +
>> #include "xutil.h"
>> #include "error_util.h"
>>
>> @@ -320,13 +322,6 @@ notmuch_message_set_author (notmuch_message_t *message, const char *author);
>> const char *
>> notmuch_message_get_author (notmuch_message_t *message);
>>
>> -
>> -/* index.cc */
>> -
>> -notmuch_status_t
>> -_notmuch_message_index_file (notmuch_message_t *message,
>> - const char *filename);
>> -
>> /* message-file.c */
>>
>> /* XXX: I haven't decided yet whether these will actually get exported
>> @@ -352,32 +347,31 @@ _notmuch_message_file_open_ctx (void *ctx, const char *filename);
>> void
>> notmuch_message_file_close (notmuch_message_file_t *message);
>>
>> -/* Restrict 'message' to only save the named headers.
>> +/* Parse the message.
>> *
>> - * When the caller is only interested in a short list of headers,
>> - * known in advance, calling this function can avoid wasted time and
>> - * memory parsing/saving header values that will never be needed.
>> + * This will be done automatically as necessary on other calls
>> + * depending on it, but an explicit call allows for better error
>> + * status reporting.
>> + */
>> +notmuch_status_t
>> +notmuch_message_file_parse (notmuch_message_file_t *message);
>
> Most internal functions have a leading _. (We've diverged from that
> in a few places, especially in notmuch_message_file_*, but it would be
> nice to not diverge further from that convention.)
Fixed.
>
>> +
>> +/* Get the gmime message of a message file.
>> *
>> - * The variable arguments should be a list of const char * with a
>> - * final '(const char *) NULL' to terminate the list.
>> + * The message file is parsed as necessary.
>> *
>> - * If this function is called, it must be called before any calls to
>> - * notmuch_message_get_header for this message.
>> + * Returns GMimeMessage* on success (which the caller must not unref),
>> + * NULL if the message file parsing fails.
>> *
>> - * After calling this function, if notmuch_message_get_header is
>> - * called with a header name not in this list, then NULL will be
>> - * returned even if that header exists in the actual message.
>> + * XXX: Would be nice to not have to expose GMimeMessage here.
>> */
>> -void
>> -notmuch_message_file_restrict_headers (notmuch_message_file_t *message, ...);
>> -
>> -/* Identical to notmuch_message_restrict_headers but accepting a va_list. */
>> -void
>> -notmuch_message_file_restrict_headersv (notmuch_message_file_t *message,
>> - va_list va_headers);
>> +GMimeMessage *
>> +notmuch_message_file_get_mime_message (notmuch_message_file_t *message);
>
> Same comment about _.
Ditto.
>
>>
>> /* Get the value of the specified header from the message as a UTF-8 string.
>> *
>> + * The message file is parsed as necessary.
>> + *
>> * The header name is case insensitive.
>> *
>> * The Received: header is special - for it all Received: headers in
>> @@ -387,13 +381,19 @@ notmuch_message_file_restrict_headersv (notmuch_message_file_t *message,
>> * only until the message is closed. The caller should copy it if
>> * needing to modify the value or to hold onto it for longer.
>> *
>> - * Returns NULL if the message does not contain a header line matching
>> - * 'header'.
>> + * Returns NULL on errors, empty string if the message does not
>> + * contain a header line matching 'header'.
>> */
>> const char *
>> notmuch_message_file_get_header (notmuch_message_file_t *message,
>> const char *header);
>>
>> +/* index.cc */
>> +
>> +notmuch_status_t
>> +_notmuch_message_index_file (notmuch_message_t *message,
>> + notmuch_message_file_t *message_file);
>> +
>
> Any particular reason this floated down? (Obviously it's not an
> actual problem.)
It needs the typedef for notmuch_message_file_t. Previous versions of
the patch moved the typedef, but this keeps the typedef next to other
stuff from message-file.c.
Thanks for the review.
BR,
Jani.
>
>> /* messages.c */
>>
>> typedef struct _notmuch_message_node {
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2014-03-30 21:10 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-03-23 20:01 [PATCH v4] lib: replace the header parser with gmime Jani Nikula
2014-03-29 22:11 ` Austin Clements
2014-03-30 21:10 ` Jani Nikula
Code repositories for project(s) associated with this public inbox
https://yhetil.org/notmuch.git/
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).