unofficial mirror of notmuch@notmuchmail.org
 help / color / mirror / code / Atom feed
* improve from-header guessing
@ 2010-04-16 20:51 Dirk Hohndel
  2010-04-16 20:51 ` [PATCH 1/2] Add interface to obtain the concatenation of all instances of a specified header Dirk Hohndel
  2010-04-23 18:47 ` improve from-header guessing Carl Worth
  0 siblings, 2 replies; 5+ messages in thread
From: Dirk Hohndel @ 2010-04-16 20:51 UTC (permalink / raw)
  To: notmuch

The following two patches should address most of the concerns raised 
to my previous series. 

The first patch simply adds an interface to obtain a concatenation of
all instances of a specific header from an email.
The second patch uses that in order to get the full Received: headers.
It now looks at Envelope-to: and X-Original-To: headers, then at the
concatenated Received headers for either a "for email@add.res" clause
that matches a configured address or for a " by " clause that matches
the domain of a configured address.

What is still missing is the check if the host from which the mail was
received in this last case had a routable IP address.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH 1/2] Add interface to obtain the concatenation of all instances of a specified header
  2010-04-16 20:51 improve from-header guessing Dirk Hohndel
@ 2010-04-16 20:51 ` Dirk Hohndel
  2010-04-16 20:51   ` [PATCH 2/2] Improve heuristic for guessing best from address in replies Dirk Hohndel
  2010-04-23 18:47 ` improve from-header guessing Carl Worth
  1 sibling, 1 reply; 5+ messages in thread
From: Dirk Hohndel @ 2010-04-16 20:51 UTC (permalink / raw)
  To: notmuch

notmuch_message_get_header only returns the first instance of the specified
header in a message.
notmuch_message_get_concat_header concatenates the values from ALL instances
of that header in a message. This is useful for example to get the full
delivery path as captured in all of the Received: headers.

Signed-off-by: Dirk Hohndel <hohndel@infradead.org>
---
 lib/database.cc       |   14 +++++++-------
 lib/message-file.c    |   49 +++++++++++++++++++++++++++++++++++--------------
 lib/message.cc        |   12 +++++++++++-
 lib/notmuch-private.h |    2 +-
 lib/notmuch.h         |   16 ++++++++++++++++
 5 files changed, 70 insertions(+), 23 deletions(-)

diff --git a/lib/database.cc b/lib/database.cc
index 6842faf..d706263 100644
--- a/lib/database.cc
+++ b/lib/database.cc
@@ -1289,11 +1289,11 @@ _notmuch_database_link_message_to_parents (notmuch_database_t *notmuch,
     parents = g_hash_table_new_full (g_str_hash, g_str_equal,
 				     _my_talloc_free_for_g_hash, NULL);
 
-    refs = notmuch_message_file_get_header (message_file, "references");
+    refs = notmuch_message_file_get_header (message_file, "references", 0);
     parse_references (message, notmuch_message_get_message_id (message),
 		      parents, refs);
 
-    in_reply_to = notmuch_message_file_get_header (message_file, "in-reply-to");
+    in_reply_to = notmuch_message_file_get_header (message_file, "in-reply-to", 0);
     parse_references (message, notmuch_message_get_message_id (message),
 		      parents, in_reply_to);
 
@@ -1506,9 +1506,9 @@ notmuch_database_add_message (notmuch_database_t *notmuch,
 	 * let's make sure that what we're looking at looks like an
 	 * actual email message.
 	 */
-	from = notmuch_message_file_get_header (message_file, "from");
-	subject = notmuch_message_file_get_header (message_file, "subject");
-	to = notmuch_message_file_get_header (message_file, "to");
+	from = notmuch_message_file_get_header (message_file, "from", 0);
+	subject = notmuch_message_file_get_header (message_file, "subject", 0);
+	to = notmuch_message_file_get_header (message_file, "to", 0);
 
 	if ((from == NULL || *from == '\0') &&
 	    (subject == NULL || *subject == '\0') &&
@@ -1521,7 +1521,7 @@ notmuch_database_add_message (notmuch_database_t *notmuch,
 	/* Now that we're sure it's mail, the first order of business
 	 * is to find a message ID (or else create one ourselves). */
 
-	header = notmuch_message_file_get_header (message_file, "message-id");
+	header = notmuch_message_file_get_header (message_file, "message-id", 0);
 	if (header && *header != '\0') {
 	    message_id = _parse_message_id (message_file, header, NULL);
 
@@ -1580,7 +1580,7 @@ notmuch_database_add_message (notmuch_database_t *notmuch,
 	    if (ret)
 		goto DONE;
 
-	    date = notmuch_message_file_get_header (message_file, "date");
+	    date = notmuch_message_file_get_header (message_file, "date", 0);
 	    _notmuch_message_set_date (message, date);
 
 	    _notmuch_message_index_file (message, filename);
diff --git a/lib/message-file.c b/lib/message-file.c
index 0c152a3..a01adbb 100644
--- a/lib/message-file.c
+++ b/lib/message-file.c
@@ -209,15 +209,21 @@ copy_header_unfolding (header_value_closure_t *value,
 
 /* As a special-case, a value of NULL for header_desired will force
  * the entire header to be parsed if it is not parsed already. This is
- * used by the _notmuch_message_file_get_headers_end function. */
+ * used by the _notmuch_message_file_get_headers_end function. 
+ * If concat is 'true' then it parses the whole message and
+ * concatenates all instances of the header in question. This is
+ * currently used to get a complete Received: header when analyzing
+ * the path the mail has taken from sender to recipient.
+ */
 const char *
 notmuch_message_file_get_header (notmuch_message_file_t *message,
-				 const char *header_desired)
+				 const char *header_desired,
+				 int concat)
 {
     int contains;
-    char *header, *decoded_value;
+    char *header, *decoded_value, *header_sofar, *combined_header;
     const char *s, *colon;
-    int match;
+    int match, newhdr, hdrsofar;
     static int initialized = 0;
 
     if (! initialized) {
@@ -227,7 +233,7 @@ notmuch_message_file_get_header (notmuch_message_file_t *message,
 
     message->parsing_started = 1;
 
-    if (header_desired == NULL)
+    if (concat || header_desired == NULL) 
 	contains = 0;
     else
 	contains = g_hash_table_lookup_extended (message->headers,
@@ -237,6 +243,9 @@ notmuch_message_file_get_header (notmuch_message_file_t *message,
     if (contains && decoded_value)
 	return decoded_value;
 
+    if (concat)
+	message->parsing_finished = 0;
+
     if (message->parsing_finished)
 	return "";
 
@@ -312,20 +321,32 @@ notmuch_message_file_get_header (notmuch_message_file_t *message,
 
 	NEXT_HEADER_LINE (&message->value);
 
-	if (header_desired == 0)
+	if (concat || header_desired == NULL)
 	    match = 0;
 	else
 	    match = (strcasecmp (header, header_desired) == 0);
 
 	decoded_value = g_mime_utils_header_decode_text (message->value.str);
-	if (g_hash_table_lookup (message->headers, header) == NULL) {
-	    /* Only insert if we don't have a value for this header, yet.
-	     * This way we always return the FIRST instance of any header
-	     * we search for
-	     * FIXME: we should be returning ALL instances of a header
-	     *        or at least provide a way to iterate over them
-	     */
-	    g_hash_table_insert (message->headers, header, decoded_value);
+	header_sofar = (char *)g_hash_table_lookup (message->headers, header);
+	if (concat) {
+	    if (header_sofar == NULL) {
+		/* Only insert if we don't have a value for this header, yet. */
+		g_hash_table_insert (message->headers, header, decoded_value);
+	    } else {
+		/* the caller wants them all concatenated */
+		newhdr = strlen(decoded_value);
+		hdrsofar = strlen(header_sofar);
+		combined_header = xmalloc(hdrsofar + newhdr + 2);
+		strncpy(combined_header,header_sofar,hdrsofar);
+		*(combined_header+hdrsofar) = ' ';
+		strncpy(combined_header+hdrsofar+1,decoded_value,newhdr+1);
+		g_hash_table_insert (message->headers, header, combined_header);
+	    }
+	} else {
+	    if (header_sofar == NULL) {
+		/* Only insert if we don't have a value for this header, yet. */
+		g_hash_table_insert (message->headers, header, decoded_value);
+	    }
 	}
 	if (match)
 	    return decoded_value;
diff --git a/lib/message.cc b/lib/message.cc
index 721c9a6..fb8fe95 100644
--- a/lib/message.cc
+++ b/lib/message.cc
@@ -264,7 +264,17 @@ notmuch_message_get_header (notmuch_message_t *message, const char *header)
     if (message->message_file == NULL)
 	return NULL;
 
-    return notmuch_message_file_get_header (message->message_file, header);
+    return notmuch_message_file_get_header (message->message_file, header, 0);
+}
+
+const char *
+notmuch_message_get_concat_header (notmuch_message_t *message, const char *header)
+{
+    _notmuch_message_ensure_message_file (message);
+    if (message->message_file == NULL)
+	return NULL;
+
+    return notmuch_message_file_get_header (message->message_file, header, 1);
 }
 
 /* Return the message ID from the In-Reply-To header of 'message'.
diff --git a/lib/notmuch-private.h b/lib/notmuch-private.h
index d52d84d..9f8a10a 100644
--- a/lib/notmuch-private.h
+++ b/lib/notmuch-private.h
@@ -342,7 +342,7 @@ notmuch_message_file_restrict_headersv (notmuch_message_file_t *message,
  */
 const char *
 notmuch_message_file_get_header (notmuch_message_file_t *message,
-				 const char *header);
+				 const char *header, int concat);
 
 /* messages.c */
 
diff --git a/lib/notmuch.h b/lib/notmuch.h
index a7e66dd..d77eb5c 100644
--- a/lib/notmuch.h
+++ b/lib/notmuch.h
@@ -787,6 +787,22 @@ notmuch_message_get_date  (notmuch_message_t *message);
 const char *
 notmuch_message_get_header (notmuch_message_t *message, const char *header);
 
+/* Get the concatenated value of all instances of the specified header
+ * from 'message'.
+ *
+ * The value will be read from the actual message file, not from the
+ * notmuch database. The header name is case insensitive.
+ *
+ * The returned string belongs to the message so should not be
+ * modified or freed by the caller (nor should it be referenced after
+ * the message is destroyed).
+ *
+ * Returns an empty string ("") if the message does not contain a
+ * header line matching 'header'. Returns NULL if any error occurs.
+ */
+const char *
+notmuch_message_get_concat_header (notmuch_message_t *message, const char *header);
+
 /* Get the tags for 'message', returning a notmuch_tags_t object which
  * can be used to iterate over all tags.
  *
-- 
1.6.6.1

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH 2/2] Improve heuristic for guessing best from address in replies
  2010-04-16 20:51 ` [PATCH 1/2] Add interface to obtain the concatenation of all instances of a specified header Dirk Hohndel
@ 2010-04-16 20:51   ` Dirk Hohndel
  0 siblings, 0 replies; 5+ messages in thread
From: Dirk Hohndel @ 2010-04-16 20:51 UTC (permalink / raw)
  To: notmuch

We now look at Envelope-To: and Original-To: headers
Then concat all of the Received headers and walk through them to find
either a "for email@add.res" clause or a host in a known domain.

This should deal with most of the fetchmail and mail hoster induced
pain (and failure) of the old heuristic.

Signed-off-by: Dirk Hohndel <hohndel@infradead.org>
---
 notmuch-reply.c |  125 +++++++++++++++++++++++++++++++++++++++++--------------
 1 files changed, 94 insertions(+), 31 deletions(-)

diff --git a/notmuch-reply.c b/notmuch-reply.c
index 230cacc..78d3914 100644
--- a/notmuch-reply.c
+++ b/notmuch-reply.c
@@ -305,33 +305,95 @@ add_recipients_from_message (GMimeMessage *reply,
 static const char *
 guess_from_received_header (notmuch_config_t *config, notmuch_message_t *message)
 {
-    const char *received,*primary;
-    char **other;
-    char *by,*mta,*ptr,*token;
+    const char *received,*primary,*by;
+    char **other,*tohdr;
+    char *mta,*ptr,*token;
     char *domain=NULL;
     char *tld=NULL;
     const char *delim=". \t";
     size_t i,other_len;
 
-    received = notmuch_message_get_header (message, "received");
-    by = strstr (received, " by ");
-    if (by && *(by+4)) {
-	/* sadly, the format of Received: headers is a bit inconsistent,
-	 * depending on the MTA used. So we try to extract just the MTA
-	 * here by removing leading whitespace and assuming that the MTA
-	 * name ends at the next whitespace
-	 * we test for *(by+4) to be non-'\0' to make sure there's something
-	 * there at all - and then assume that the first whitespace delimited
-	 * token that follows is the last receiving server
+    const char *to_headers[] = {"Envelope-to", "X-Original-To"};
+
+    primary = notmuch_config_get_user_primary_email (config);
+    other = notmuch_config_get_user_other_email (config, &other_len);
+
+    /* sadly, there is no standard way to find out to which email
+     * address a mail was delivered - what is in the headers depends
+     * on the MTAs used along the way. So we are trying a number of
+     * heuristics which hopefully will answer this question.
+
+     * We only got here if none of the users email addresses are in
+     * the To: or Cc: header. From here we try the following in order:
+     * 1) check for an Envelope-to: header
+     * 2) check for an X-Original-To: header
+     * 3) check for a (for <email@add.res>) clause in Received: headers
+     * 4) check for the domain part of known email addresses in the 
+     *    'by' part of Received headers
+     * If none of these work, we give up and return NULL
+     */
+    for (i = 0; i < sizeof(to_headers)/sizeof(*to_headers); i++) {
+	tohdr = xstrdup(notmuch_message_get_header (message, to_headers[i]));
+	if (tohdr && *tohdr) {
+	    /* tohdr is potentialy a list of email addresses, so here we
+	     * check if one of the email addresses is a substring of tohdr
+	     */
+	    if (strcasestr(tohdr, primary)) {
+		free(tohdr);
+		return primary;
+	    }
+	    for (i = 0; i < other_len; i++)
+		if (strcasestr (tohdr, other[i])) {
+		    free(tohdr);
+		    return other[i];
+		}
+	    free(tohdr);
+	}
+    }
+		   
+    /* We get the concatenated Received: headers and search from the
+     * front (last Received: header added) and try to extract from
+     * them indications to which email address this message was
+     * delivered.
+     */
+    received = notmuch_message_get_concat_header (message, "received");
+    /* First we look for a " for <email@add.res>" in the received
+     * header
+     */
+    ptr = strstr (received, " for ");
+    if (ptr) {
+	/* the text following is potentialy a list of email addresses,
+	 * so again we check if one of the email addresses is a
+	 * substring of ptr
 	 */
-	mta = strdup (by+4);
-	if (mta == NULL)
-	    return NULL;
+	if (strcasestr(ptr, primary)) {
+	    return primary;
+	}
+	for (i = 0; i < other_len; i++)
+	    if (strcasestr (ptr, other[i])) {
+		return other[i];
+	    }
+    }
+    /* Finally, we parse all the " by MTA ..." headers to guess the
+     * email address that this was originally delivered to.
+     * We extract just the MTA here by removing leading whitespace and
+     * assuming that the MTA name ends at the next whitespace.
+     * We test for *(by+4) to be non-'\0' to make sure there's
+     * something there at all - and then assume that the first
+     * whitespace delimited token that follows is the receiving
+     * system in this step of the receive chain
+     */
+    by = received;
+    while((by = strstr (by, " by ")) != NULL) {
+	by += 4;
+	if (*by == '\0')
+	    break;
+	mta = xstrdup (by);
 	token = strtok(mta," \t");
 	if (token == NULL)
-	    return NULL;
+	    break;
 	/* Now extract the last two components of the MTA host name
-	 * as domain and tld
+	 * as domain and tld.
 	 */
 	while ((ptr = strsep (&token, delim)) != NULL) {
 	    if (*ptr == '\0')
@@ -341,23 +403,24 @@ guess_from_received_header (notmuch_config_t *config, notmuch_message_t *message
 	}
 
 	if (domain) {
-	    /* recombine domain and tld and look for it among the configured
-	     * email addresses
+	    /* Recombine domain and tld and look for it among the configured
+	     * email addresses.
+	     * This time we have a known domain name and nothing else - so
+	     * the test is the other way around: we check if this is a 
+	     * substring of one of the email addresses.
 	     */
 	    *(tld-1) = '.';
-	    primary = notmuch_config_get_user_primary_email (config);
-	    if (strcasestr (primary, domain)) {
-		free (mta);
-		return primary;
+	    
+	    if (strcasestr(primary, domain)) {
+		free(mta);
+	    return primary;
+	}
+	for (i = 0; i < other_len; i++)
+	    if (strcasestr (other[i],domain)) {
+		free(mta);
+		return other[i];
 	    }
-	    other = notmuch_config_get_user_other_email (config, &other_len);
-	    for (i = 0; i < other_len; i++)
-		if (strcasestr (other[i], domain)) {
-		    free (mta);
-		    return other[i];
-		}
 	}
-
 	free (mta);
     }
 
-- 
1.6.6.1

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: improve from-header guessing
  2010-04-16 20:51 improve from-header guessing Dirk Hohndel
  2010-04-16 20:51 ` [PATCH 1/2] Add interface to obtain the concatenation of all instances of a specified header Dirk Hohndel
@ 2010-04-23 18:47 ` Carl Worth
  2010-04-24 19:09   ` Dirk Hohndel
  1 sibling, 1 reply; 5+ messages in thread
From: Carl Worth @ 2010-04-23 18:47 UTC (permalink / raw)
  To: Dirk Hohndel, notmuch

[-- Attachment #1: Type: text/plain, Size: 1493 bytes --]

On Fri, 16 Apr 2010 13:51:40 -0700, Dirk Hohndel <hohndel@infradead.org> wrote:
> The following two patches should address most of the concerns raised 
> to my previous series. 

Allow me to raise new concerns then. ;-)

> The first patch simply adds an interface to obtain a concatenation of
> all instances of a specific header from an email.

I was hoping to see the "special-case value of NULL" go away with this
change.

And I like that there's a new function to get the concatenated header,
(I would prefer an unabbreviated name of get_concatenated_header than
get_header_concat), but I don't like seeing all the existing callers of
get_header updated to pass an extra 0. Instead, I'd prefer to see those
calls unchanged, and a tiny new get_header that passes the 0 and then
make the actual implementing function be static and named something like
notmuch_message_file_get_header_internal.

Both patches have some trailing whitespace. I see these easily wince I
have the following in my ~/.gitconfig:

	[core]
		whitespace = trailing-space,space-before-tab

I'm sure there's a way to make git refuse to let you commit changes with
trailing whitespace, but I don't know offhand what it is.

Finally, I'd like to see some tests for this feature. (But we do have
the feature already without tests, so I won't strictly block on that).

If you can fix up any of the above before I make another pass through ym
queue, that would be great.

Thanks,

-Carl

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: improve from-header guessing
  2010-04-23 18:47 ` improve from-header guessing Carl Worth
@ 2010-04-24 19:09   ` Dirk Hohndel
  0 siblings, 0 replies; 5+ messages in thread
From: Dirk Hohndel @ 2010-04-24 19:09 UTC (permalink / raw)
  To: Carl Worth, notmuch

On Fri, 23 Apr 2010 11:47:04 -0700, Carl Worth <cworth@cworth.org> wrote:
> On Fri, 16 Apr 2010 13:51:40 -0700, Dirk Hohndel <hohndel@infradead.org> wrote:
> > The following two patches should address most of the concerns raised 
> > to my previous series. 
> 
> Allow me to raise new concerns then. ;-)

Any time
 
> > The first patch simply adds an interface to obtain a concatenation of
> > all instances of a specific header from an email.
> 
> I was hoping to see the "special-case value of NULL" go away with this
> change.
> 
> And I like that there's a new function to get the concatenated header,
> (I would prefer an unabbreviated name of get_concatenated_header than
> get_header_concat), but I don't like seeing all the existing callers of
> get_header updated to pass an extra 0. Instead, I'd prefer to see those
> calls unchanged, and a tiny new get_header that passes the 0 and then
> make the actual implementing function be static and named something like
> notmuch_message_file_get_header_internal.

Turns out that the way I did this was broken anyway. So we can simply
forget these patches and your concerns. I'm sure you'll raise new
concerns on the new ("rearchitected") patches.

> Both patches have some trailing whitespace. I see these easily wince I
> have the following in my ~/.gitconfig:
> 
> 	[core]
> 		whitespace = trailing-space,space-before-tab

I know. I'm trying to be better about checking whitespace pollution
before submitting things.

> Finally, I'd like to see some tests for this feature. (But we do have
> the feature already without tests, so I won't strictly block on that).

Hu? You even commited these already. Or am I reading email out of order
again? 

/D

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2010-04-24 19:09 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-04-16 20:51 improve from-header guessing Dirk Hohndel
2010-04-16 20:51 ` [PATCH 1/2] Add interface to obtain the concatenation of all instances of a specified header Dirk Hohndel
2010-04-16 20:51   ` [PATCH 2/2] Improve heuristic for guessing best from address in replies Dirk Hohndel
2010-04-23 18:47 ` improve from-header guessing Carl Worth
2010-04-24 19:09   ` Dirk Hohndel

Code repositories for project(s) associated with this public inbox

	https://yhetil.org/notmuch.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).