* improve from-header guessing
@ 2010-04-16 20:51 Dirk Hohndel
2010-04-16 20:51 ` [PATCH 1/2] Add interface to obtain the concatenation of all instances of a specified header Dirk Hohndel
2010-04-23 18:47 ` improve from-header guessing Carl Worth
0 siblings, 2 replies; 5+ messages in thread
From: Dirk Hohndel @ 2010-04-16 20:51 UTC (permalink / raw)
To: notmuch
The following two patches should address most of the concerns raised
to my previous series.
The first patch simply adds an interface to obtain a concatenation of
all instances of a specific header from an email.
The second patch uses that in order to get the full Received: headers.
It now looks at Envelope-to: and X-Original-To: headers, then at the
concatenated Received headers for either a "for email@add.res" clause
that matches a configured address or for a " by " clause that matches
the domain of a configured address.
What is still missing is the check if the host from which the mail was
received in this last case had a routable IP address.
^ permalink raw reply [flat|nested] 5+ messages in thread
* [PATCH 1/2] Add interface to obtain the concatenation of all instances of a specified header
2010-04-16 20:51 improve from-header guessing Dirk Hohndel
@ 2010-04-16 20:51 ` Dirk Hohndel
2010-04-16 20:51 ` [PATCH 2/2] Improve heuristic for guessing best from address in replies Dirk Hohndel
2010-04-23 18:47 ` improve from-header guessing Carl Worth
1 sibling, 1 reply; 5+ messages in thread
From: Dirk Hohndel @ 2010-04-16 20:51 UTC (permalink / raw)
To: notmuch
notmuch_message_get_header only returns the first instance of the specified
header in a message.
notmuch_message_get_concat_header concatenates the values from ALL instances
of that header in a message. This is useful for example to get the full
delivery path as captured in all of the Received: headers.
Signed-off-by: Dirk Hohndel <hohndel@infradead.org>
---
lib/database.cc | 14 +++++++-------
lib/message-file.c | 49 +++++++++++++++++++++++++++++++++++--------------
lib/message.cc | 12 +++++++++++-
lib/notmuch-private.h | 2 +-
lib/notmuch.h | 16 ++++++++++++++++
5 files changed, 70 insertions(+), 23 deletions(-)
diff --git a/lib/database.cc b/lib/database.cc
index 6842faf..d706263 100644
--- a/lib/database.cc
+++ b/lib/database.cc
@@ -1289,11 +1289,11 @@ _notmuch_database_link_message_to_parents (notmuch_database_t *notmuch,
parents = g_hash_table_new_full (g_str_hash, g_str_equal,
_my_talloc_free_for_g_hash, NULL);
- refs = notmuch_message_file_get_header (message_file, "references");
+ refs = notmuch_message_file_get_header (message_file, "references", 0);
parse_references (message, notmuch_message_get_message_id (message),
parents, refs);
- in_reply_to = notmuch_message_file_get_header (message_file, "in-reply-to");
+ in_reply_to = notmuch_message_file_get_header (message_file, "in-reply-to", 0);
parse_references (message, notmuch_message_get_message_id (message),
parents, in_reply_to);
@@ -1506,9 +1506,9 @@ notmuch_database_add_message (notmuch_database_t *notmuch,
* let's make sure that what we're looking at looks like an
* actual email message.
*/
- from = notmuch_message_file_get_header (message_file, "from");
- subject = notmuch_message_file_get_header (message_file, "subject");
- to = notmuch_message_file_get_header (message_file, "to");
+ from = notmuch_message_file_get_header (message_file, "from", 0);
+ subject = notmuch_message_file_get_header (message_file, "subject", 0);
+ to = notmuch_message_file_get_header (message_file, "to", 0);
if ((from == NULL || *from == '\0') &&
(subject == NULL || *subject == '\0') &&
@@ -1521,7 +1521,7 @@ notmuch_database_add_message (notmuch_database_t *notmuch,
/* Now that we're sure it's mail, the first order of business
* is to find a message ID (or else create one ourselves). */
- header = notmuch_message_file_get_header (message_file, "message-id");
+ header = notmuch_message_file_get_header (message_file, "message-id", 0);
if (header && *header != '\0') {
message_id = _parse_message_id (message_file, header, NULL);
@@ -1580,7 +1580,7 @@ notmuch_database_add_message (notmuch_database_t *notmuch,
if (ret)
goto DONE;
- date = notmuch_message_file_get_header (message_file, "date");
+ date = notmuch_message_file_get_header (message_file, "date", 0);
_notmuch_message_set_date (message, date);
_notmuch_message_index_file (message, filename);
diff --git a/lib/message-file.c b/lib/message-file.c
index 0c152a3..a01adbb 100644
--- a/lib/message-file.c
+++ b/lib/message-file.c
@@ -209,15 +209,21 @@ copy_header_unfolding (header_value_closure_t *value,
/* As a special-case, a value of NULL for header_desired will force
* the entire header to be parsed if it is not parsed already. This is
- * used by the _notmuch_message_file_get_headers_end function. */
+ * used by the _notmuch_message_file_get_headers_end function.
+ * If concat is 'true' then it parses the whole message and
+ * concatenates all instances of the header in question. This is
+ * currently used to get a complete Received: header when analyzing
+ * the path the mail has taken from sender to recipient.
+ */
const char *
notmuch_message_file_get_header (notmuch_message_file_t *message,
- const char *header_desired)
+ const char *header_desired,
+ int concat)
{
int contains;
- char *header, *decoded_value;
+ char *header, *decoded_value, *header_sofar, *combined_header;
const char *s, *colon;
- int match;
+ int match, newhdr, hdrsofar;
static int initialized = 0;
if (! initialized) {
@@ -227,7 +233,7 @@ notmuch_message_file_get_header (notmuch_message_file_t *message,
message->parsing_started = 1;
- if (header_desired == NULL)
+ if (concat || header_desired == NULL)
contains = 0;
else
contains = g_hash_table_lookup_extended (message->headers,
@@ -237,6 +243,9 @@ notmuch_message_file_get_header (notmuch_message_file_t *message,
if (contains && decoded_value)
return decoded_value;
+ if (concat)
+ message->parsing_finished = 0;
+
if (message->parsing_finished)
return "";
@@ -312,20 +321,32 @@ notmuch_message_file_get_header (notmuch_message_file_t *message,
NEXT_HEADER_LINE (&message->value);
- if (header_desired == 0)
+ if (concat || header_desired == NULL)
match = 0;
else
match = (strcasecmp (header, header_desired) == 0);
decoded_value = g_mime_utils_header_decode_text (message->value.str);
- if (g_hash_table_lookup (message->headers, header) == NULL) {
- /* Only insert if we don't have a value for this header, yet.
- * This way we always return the FIRST instance of any header
- * we search for
- * FIXME: we should be returning ALL instances of a header
- * or at least provide a way to iterate over them
- */
- g_hash_table_insert (message->headers, header, decoded_value);
+ header_sofar = (char *)g_hash_table_lookup (message->headers, header);
+ if (concat) {
+ if (header_sofar == NULL) {
+ /* Only insert if we don't have a value for this header, yet. */
+ g_hash_table_insert (message->headers, header, decoded_value);
+ } else {
+ /* the caller wants them all concatenated */
+ newhdr = strlen(decoded_value);
+ hdrsofar = strlen(header_sofar);
+ combined_header = xmalloc(hdrsofar + newhdr + 2);
+ strncpy(combined_header,header_sofar,hdrsofar);
+ *(combined_header+hdrsofar) = ' ';
+ strncpy(combined_header+hdrsofar+1,decoded_value,newhdr+1);
+ g_hash_table_insert (message->headers, header, combined_header);
+ }
+ } else {
+ if (header_sofar == NULL) {
+ /* Only insert if we don't have a value for this header, yet. */
+ g_hash_table_insert (message->headers, header, decoded_value);
+ }
}
if (match)
return decoded_value;
diff --git a/lib/message.cc b/lib/message.cc
index 721c9a6..fb8fe95 100644
--- a/lib/message.cc
+++ b/lib/message.cc
@@ -264,7 +264,17 @@ notmuch_message_get_header (notmuch_message_t *message, const char *header)
if (message->message_file == NULL)
return NULL;
- return notmuch_message_file_get_header (message->message_file, header);
+ return notmuch_message_file_get_header (message->message_file, header, 0);
+}
+
+const char *
+notmuch_message_get_concat_header (notmuch_message_t *message, const char *header)
+{
+ _notmuch_message_ensure_message_file (message);
+ if (message->message_file == NULL)
+ return NULL;
+
+ return notmuch_message_file_get_header (message->message_file, header, 1);
}
/* Return the message ID from the In-Reply-To header of 'message'.
diff --git a/lib/notmuch-private.h b/lib/notmuch-private.h
index d52d84d..9f8a10a 100644
--- a/lib/notmuch-private.h
+++ b/lib/notmuch-private.h
@@ -342,7 +342,7 @@ notmuch_message_file_restrict_headersv (notmuch_message_file_t *message,
*/
const char *
notmuch_message_file_get_header (notmuch_message_file_t *message,
- const char *header);
+ const char *header, int concat);
/* messages.c */
diff --git a/lib/notmuch.h b/lib/notmuch.h
index a7e66dd..d77eb5c 100644
--- a/lib/notmuch.h
+++ b/lib/notmuch.h
@@ -787,6 +787,22 @@ notmuch_message_get_date (notmuch_message_t *message);
const char *
notmuch_message_get_header (notmuch_message_t *message, const char *header);
+/* Get the concatenated value of all instances of the specified header
+ * from 'message'.
+ *
+ * The value will be read from the actual message file, not from the
+ * notmuch database. The header name is case insensitive.
+ *
+ * The returned string belongs to the message so should not be
+ * modified or freed by the caller (nor should it be referenced after
+ * the message is destroyed).
+ *
+ * Returns an empty string ("") if the message does not contain a
+ * header line matching 'header'. Returns NULL if any error occurs.
+ */
+const char *
+notmuch_message_get_concat_header (notmuch_message_t *message, const char *header);
+
/* Get the tags for 'message', returning a notmuch_tags_t object which
* can be used to iterate over all tags.
*
--
1.6.6.1
^ permalink raw reply related [flat|nested] 5+ messages in thread
* [PATCH 2/2] Improve heuristic for guessing best from address in replies
2010-04-16 20:51 ` [PATCH 1/2] Add interface to obtain the concatenation of all instances of a specified header Dirk Hohndel
@ 2010-04-16 20:51 ` Dirk Hohndel
0 siblings, 0 replies; 5+ messages in thread
From: Dirk Hohndel @ 2010-04-16 20:51 UTC (permalink / raw)
To: notmuch
We now look at Envelope-To: and Original-To: headers
Then concat all of the Received headers and walk through them to find
either a "for email@add.res" clause or a host in a known domain.
This should deal with most of the fetchmail and mail hoster induced
pain (and failure) of the old heuristic.
Signed-off-by: Dirk Hohndel <hohndel@infradead.org>
---
notmuch-reply.c | 125 +++++++++++++++++++++++++++++++++++++++++--------------
1 files changed, 94 insertions(+), 31 deletions(-)
diff --git a/notmuch-reply.c b/notmuch-reply.c
index 230cacc..78d3914 100644
--- a/notmuch-reply.c
+++ b/notmuch-reply.c
@@ -305,33 +305,95 @@ add_recipients_from_message (GMimeMessage *reply,
static const char *
guess_from_received_header (notmuch_config_t *config, notmuch_message_t *message)
{
- const char *received,*primary;
- char **other;
- char *by,*mta,*ptr,*token;
+ const char *received,*primary,*by;
+ char **other,*tohdr;
+ char *mta,*ptr,*token;
char *domain=NULL;
char *tld=NULL;
const char *delim=". \t";
size_t i,other_len;
- received = notmuch_message_get_header (message, "received");
- by = strstr (received, " by ");
- if (by && *(by+4)) {
- /* sadly, the format of Received: headers is a bit inconsistent,
- * depending on the MTA used. So we try to extract just the MTA
- * here by removing leading whitespace and assuming that the MTA
- * name ends at the next whitespace
- * we test for *(by+4) to be non-'\0' to make sure there's something
- * there at all - and then assume that the first whitespace delimited
- * token that follows is the last receiving server
+ const char *to_headers[] = {"Envelope-to", "X-Original-To"};
+
+ primary = notmuch_config_get_user_primary_email (config);
+ other = notmuch_config_get_user_other_email (config, &other_len);
+
+ /* sadly, there is no standard way to find out to which email
+ * address a mail was delivered - what is in the headers depends
+ * on the MTAs used along the way. So we are trying a number of
+ * heuristics which hopefully will answer this question.
+
+ * We only got here if none of the users email addresses are in
+ * the To: or Cc: header. From here we try the following in order:
+ * 1) check for an Envelope-to: header
+ * 2) check for an X-Original-To: header
+ * 3) check for a (for <email@add.res>) clause in Received: headers
+ * 4) check for the domain part of known email addresses in the
+ * 'by' part of Received headers
+ * If none of these work, we give up and return NULL
+ */
+ for (i = 0; i < sizeof(to_headers)/sizeof(*to_headers); i++) {
+ tohdr = xstrdup(notmuch_message_get_header (message, to_headers[i]));
+ if (tohdr && *tohdr) {
+ /* tohdr is potentialy a list of email addresses, so here we
+ * check if one of the email addresses is a substring of tohdr
+ */
+ if (strcasestr(tohdr, primary)) {
+ free(tohdr);
+ return primary;
+ }
+ for (i = 0; i < other_len; i++)
+ if (strcasestr (tohdr, other[i])) {
+ free(tohdr);
+ return other[i];
+ }
+ free(tohdr);
+ }
+ }
+
+ /* We get the concatenated Received: headers and search from the
+ * front (last Received: header added) and try to extract from
+ * them indications to which email address this message was
+ * delivered.
+ */
+ received = notmuch_message_get_concat_header (message, "received");
+ /* First we look for a " for <email@add.res>" in the received
+ * header
+ */
+ ptr = strstr (received, " for ");
+ if (ptr) {
+ /* the text following is potentialy a list of email addresses,
+ * so again we check if one of the email addresses is a
+ * substring of ptr
*/
- mta = strdup (by+4);
- if (mta == NULL)
- return NULL;
+ if (strcasestr(ptr, primary)) {
+ return primary;
+ }
+ for (i = 0; i < other_len; i++)
+ if (strcasestr (ptr, other[i])) {
+ return other[i];
+ }
+ }
+ /* Finally, we parse all the " by MTA ..." headers to guess the
+ * email address that this was originally delivered to.
+ * We extract just the MTA here by removing leading whitespace and
+ * assuming that the MTA name ends at the next whitespace.
+ * We test for *(by+4) to be non-'\0' to make sure there's
+ * something there at all - and then assume that the first
+ * whitespace delimited token that follows is the receiving
+ * system in this step of the receive chain
+ */
+ by = received;
+ while((by = strstr (by, " by ")) != NULL) {
+ by += 4;
+ if (*by == '\0')
+ break;
+ mta = xstrdup (by);
token = strtok(mta," \t");
if (token == NULL)
- return NULL;
+ break;
/* Now extract the last two components of the MTA host name
- * as domain and tld
+ * as domain and tld.
*/
while ((ptr = strsep (&token, delim)) != NULL) {
if (*ptr == '\0')
@@ -341,23 +403,24 @@ guess_from_received_header (notmuch_config_t *config, notmuch_message_t *message
}
if (domain) {
- /* recombine domain and tld and look for it among the configured
- * email addresses
+ /* Recombine domain and tld and look for it among the configured
+ * email addresses.
+ * This time we have a known domain name and nothing else - so
+ * the test is the other way around: we check if this is a
+ * substring of one of the email addresses.
*/
*(tld-1) = '.';
- primary = notmuch_config_get_user_primary_email (config);
- if (strcasestr (primary, domain)) {
- free (mta);
- return primary;
+
+ if (strcasestr(primary, domain)) {
+ free(mta);
+ return primary;
+ }
+ for (i = 0; i < other_len; i++)
+ if (strcasestr (other[i],domain)) {
+ free(mta);
+ return other[i];
}
- other = notmuch_config_get_user_other_email (config, &other_len);
- for (i = 0; i < other_len; i++)
- if (strcasestr (other[i], domain)) {
- free (mta);
- return other[i];
- }
}
-
free (mta);
}
--
1.6.6.1
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: improve from-header guessing
2010-04-16 20:51 improve from-header guessing Dirk Hohndel
2010-04-16 20:51 ` [PATCH 1/2] Add interface to obtain the concatenation of all instances of a specified header Dirk Hohndel
@ 2010-04-23 18:47 ` Carl Worth
2010-04-24 19:09 ` Dirk Hohndel
1 sibling, 1 reply; 5+ messages in thread
From: Carl Worth @ 2010-04-23 18:47 UTC (permalink / raw)
To: Dirk Hohndel, notmuch
[-- Attachment #1: Type: text/plain, Size: 1493 bytes --]
On Fri, 16 Apr 2010 13:51:40 -0700, Dirk Hohndel <hohndel@infradead.org> wrote:
> The following two patches should address most of the concerns raised
> to my previous series.
Allow me to raise new concerns then. ;-)
> The first patch simply adds an interface to obtain a concatenation of
> all instances of a specific header from an email.
I was hoping to see the "special-case value of NULL" go away with this
change.
And I like that there's a new function to get the concatenated header,
(I would prefer an unabbreviated name of get_concatenated_header than
get_header_concat), but I don't like seeing all the existing callers of
get_header updated to pass an extra 0. Instead, I'd prefer to see those
calls unchanged, and a tiny new get_header that passes the 0 and then
make the actual implementing function be static and named something like
notmuch_message_file_get_header_internal.
Both patches have some trailing whitespace. I see these easily wince I
have the following in my ~/.gitconfig:
[core]
whitespace = trailing-space,space-before-tab
I'm sure there's a way to make git refuse to let you commit changes with
trailing whitespace, but I don't know offhand what it is.
Finally, I'd like to see some tests for this feature. (But we do have
the feature already without tests, so I won't strictly block on that).
If you can fix up any of the above before I make another pass through ym
queue, that would be great.
Thanks,
-Carl
[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: improve from-header guessing
2010-04-23 18:47 ` improve from-header guessing Carl Worth
@ 2010-04-24 19:09 ` Dirk Hohndel
0 siblings, 0 replies; 5+ messages in thread
From: Dirk Hohndel @ 2010-04-24 19:09 UTC (permalink / raw)
To: Carl Worth, notmuch
On Fri, 23 Apr 2010 11:47:04 -0700, Carl Worth <cworth@cworth.org> wrote:
> On Fri, 16 Apr 2010 13:51:40 -0700, Dirk Hohndel <hohndel@infradead.org> wrote:
> > The following two patches should address most of the concerns raised
> > to my previous series.
>
> Allow me to raise new concerns then. ;-)
Any time
> > The first patch simply adds an interface to obtain a concatenation of
> > all instances of a specific header from an email.
>
> I was hoping to see the "special-case value of NULL" go away with this
> change.
>
> And I like that there's a new function to get the concatenated header,
> (I would prefer an unabbreviated name of get_concatenated_header than
> get_header_concat), but I don't like seeing all the existing callers of
> get_header updated to pass an extra 0. Instead, I'd prefer to see those
> calls unchanged, and a tiny new get_header that passes the 0 and then
> make the actual implementing function be static and named something like
> notmuch_message_file_get_header_internal.
Turns out that the way I did this was broken anyway. So we can simply
forget these patches and your concerns. I'm sure you'll raise new
concerns on the new ("rearchitected") patches.
> Both patches have some trailing whitespace. I see these easily wince I
> have the following in my ~/.gitconfig:
>
> [core]
> whitespace = trailing-space,space-before-tab
I know. I'm trying to be better about checking whitespace pollution
before submitting things.
> Finally, I'd like to see some tests for this feature. (But we do have
> the feature already without tests, so I won't strictly block on that).
Hu? You even commited these already. Or am I reading email out of order
again?
/D
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2010-04-24 19:09 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-04-16 20:51 improve from-header guessing Dirk Hohndel
2010-04-16 20:51 ` [PATCH 1/2] Add interface to obtain the concatenation of all instances of a specified header Dirk Hohndel
2010-04-16 20:51 ` [PATCH 2/2] Improve heuristic for guessing best from address in replies Dirk Hohndel
2010-04-23 18:47 ` improve from-header guessing Carl Worth
2010-04-24 19:09 ` Dirk Hohndel
Code repositories for project(s) associated with this public inbox
https://yhetil.org/notmuch.git/
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).