* improve from-header guessing @ 2010-04-16 20:51 Dirk Hohndel 2010-04-16 20:51 ` [PATCH 1/2] Add interface to obtain the concatenation of all instances of a specified header Dirk Hohndel 2010-04-23 18:47 ` improve from-header guessing Carl Worth 0 siblings, 2 replies; 5+ messages in thread From: Dirk Hohndel @ 2010-04-16 20:51 UTC (permalink / raw) To: notmuch The following two patches should address most of the concerns raised to my previous series. The first patch simply adds an interface to obtain a concatenation of all instances of a specific header from an email. The second patch uses that in order to get the full Received: headers. It now looks at Envelope-to: and X-Original-To: headers, then at the concatenated Received headers for either a "for email@add.res" clause that matches a configured address or for a " by " clause that matches the domain of a configured address. What is still missing is the check if the host from which the mail was received in this last case had a routable IP address. ^ permalink raw reply [flat|nested] 5+ messages in thread
* [PATCH 1/2] Add interface to obtain the concatenation of all instances of a specified header 2010-04-16 20:51 improve from-header guessing Dirk Hohndel @ 2010-04-16 20:51 ` Dirk Hohndel 2010-04-16 20:51 ` [PATCH 2/2] Improve heuristic for guessing best from address in replies Dirk Hohndel 2010-04-23 18:47 ` improve from-header guessing Carl Worth 1 sibling, 1 reply; 5+ messages in thread From: Dirk Hohndel @ 2010-04-16 20:51 UTC (permalink / raw) To: notmuch notmuch_message_get_header only returns the first instance of the specified header in a message. notmuch_message_get_concat_header concatenates the values from ALL instances of that header in a message. This is useful for example to get the full delivery path as captured in all of the Received: headers. Signed-off-by: Dirk Hohndel <hohndel@infradead.org> --- lib/database.cc | 14 +++++++------- lib/message-file.c | 49 +++++++++++++++++++++++++++++++++++-------------- lib/message.cc | 12 +++++++++++- lib/notmuch-private.h | 2 +- lib/notmuch.h | 16 ++++++++++++++++ 5 files changed, 70 insertions(+), 23 deletions(-) diff --git a/lib/database.cc b/lib/database.cc index 6842faf..d706263 100644 --- a/lib/database.cc +++ b/lib/database.cc @@ -1289,11 +1289,11 @@ _notmuch_database_link_message_to_parents (notmuch_database_t *notmuch, parents = g_hash_table_new_full (g_str_hash, g_str_equal, _my_talloc_free_for_g_hash, NULL); - refs = notmuch_message_file_get_header (message_file, "references"); + refs = notmuch_message_file_get_header (message_file, "references", 0); parse_references (message, notmuch_message_get_message_id (message), parents, refs); - in_reply_to = notmuch_message_file_get_header (message_file, "in-reply-to"); + in_reply_to = notmuch_message_file_get_header (message_file, "in-reply-to", 0); parse_references (message, notmuch_message_get_message_id (message), parents, in_reply_to); @@ -1506,9 +1506,9 @@ notmuch_database_add_message (notmuch_database_t *notmuch, * let's make sure that what we're looking at looks like an * actual email message. */ - from = notmuch_message_file_get_header (message_file, "from"); - subject = notmuch_message_file_get_header (message_file, "subject"); - to = notmuch_message_file_get_header (message_file, "to"); + from = notmuch_message_file_get_header (message_file, "from", 0); + subject = notmuch_message_file_get_header (message_file, "subject", 0); + to = notmuch_message_file_get_header (message_file, "to", 0); if ((from == NULL || *from == '\0') && (subject == NULL || *subject == '\0') && @@ -1521,7 +1521,7 @@ notmuch_database_add_message (notmuch_database_t *notmuch, /* Now that we're sure it's mail, the first order of business * is to find a message ID (or else create one ourselves). */ - header = notmuch_message_file_get_header (message_file, "message-id"); + header = notmuch_message_file_get_header (message_file, "message-id", 0); if (header && *header != '\0') { message_id = _parse_message_id (message_file, header, NULL); @@ -1580,7 +1580,7 @@ notmuch_database_add_message (notmuch_database_t *notmuch, if (ret) goto DONE; - date = notmuch_message_file_get_header (message_file, "date"); + date = notmuch_message_file_get_header (message_file, "date", 0); _notmuch_message_set_date (message, date); _notmuch_message_index_file (message, filename); diff --git a/lib/message-file.c b/lib/message-file.c index 0c152a3..a01adbb 100644 --- a/lib/message-file.c +++ b/lib/message-file.c @@ -209,15 +209,21 @@ copy_header_unfolding (header_value_closure_t *value, /* As a special-case, a value of NULL for header_desired will force * the entire header to be parsed if it is not parsed already. This is - * used by the _notmuch_message_file_get_headers_end function. */ + * used by the _notmuch_message_file_get_headers_end function. + * If concat is 'true' then it parses the whole message and + * concatenates all instances of the header in question. This is + * currently used to get a complete Received: header when analyzing + * the path the mail has taken from sender to recipient. + */ const char * notmuch_message_file_get_header (notmuch_message_file_t *message, - const char *header_desired) + const char *header_desired, + int concat) { int contains; - char *header, *decoded_value; + char *header, *decoded_value, *header_sofar, *combined_header; const char *s, *colon; - int match; + int match, newhdr, hdrsofar; static int initialized = 0; if (! initialized) { @@ -227,7 +233,7 @@ notmuch_message_file_get_header (notmuch_message_file_t *message, message->parsing_started = 1; - if (header_desired == NULL) + if (concat || header_desired == NULL) contains = 0; else contains = g_hash_table_lookup_extended (message->headers, @@ -237,6 +243,9 @@ notmuch_message_file_get_header (notmuch_message_file_t *message, if (contains && decoded_value) return decoded_value; + if (concat) + message->parsing_finished = 0; + if (message->parsing_finished) return ""; @@ -312,20 +321,32 @@ notmuch_message_file_get_header (notmuch_message_file_t *message, NEXT_HEADER_LINE (&message->value); - if (header_desired == 0) + if (concat || header_desired == NULL) match = 0; else match = (strcasecmp (header, header_desired) == 0); decoded_value = g_mime_utils_header_decode_text (message->value.str); - if (g_hash_table_lookup (message->headers, header) == NULL) { - /* Only insert if we don't have a value for this header, yet. - * This way we always return the FIRST instance of any header - * we search for - * FIXME: we should be returning ALL instances of a header - * or at least provide a way to iterate over them - */ - g_hash_table_insert (message->headers, header, decoded_value); + header_sofar = (char *)g_hash_table_lookup (message->headers, header); + if (concat) { + if (header_sofar == NULL) { + /* Only insert if we don't have a value for this header, yet. */ + g_hash_table_insert (message->headers, header, decoded_value); + } else { + /* the caller wants them all concatenated */ + newhdr = strlen(decoded_value); + hdrsofar = strlen(header_sofar); + combined_header = xmalloc(hdrsofar + newhdr + 2); + strncpy(combined_header,header_sofar,hdrsofar); + *(combined_header+hdrsofar) = ' '; + strncpy(combined_header+hdrsofar+1,decoded_value,newhdr+1); + g_hash_table_insert (message->headers, header, combined_header); + } + } else { + if (header_sofar == NULL) { + /* Only insert if we don't have a value for this header, yet. */ + g_hash_table_insert (message->headers, header, decoded_value); + } } if (match) return decoded_value; diff --git a/lib/message.cc b/lib/message.cc index 721c9a6..fb8fe95 100644 --- a/lib/message.cc +++ b/lib/message.cc @@ -264,7 +264,17 @@ notmuch_message_get_header (notmuch_message_t *message, const char *header) if (message->message_file == NULL) return NULL; - return notmuch_message_file_get_header (message->message_file, header); + return notmuch_message_file_get_header (message->message_file, header, 0); +} + +const char * +notmuch_message_get_concat_header (notmuch_message_t *message, const char *header) +{ + _notmuch_message_ensure_message_file (message); + if (message->message_file == NULL) + return NULL; + + return notmuch_message_file_get_header (message->message_file, header, 1); } /* Return the message ID from the In-Reply-To header of 'message'. diff --git a/lib/notmuch-private.h b/lib/notmuch-private.h index d52d84d..9f8a10a 100644 --- a/lib/notmuch-private.h +++ b/lib/notmuch-private.h @@ -342,7 +342,7 @@ notmuch_message_file_restrict_headersv (notmuch_message_file_t *message, */ const char * notmuch_message_file_get_header (notmuch_message_file_t *message, - const char *header); + const char *header, int concat); /* messages.c */ diff --git a/lib/notmuch.h b/lib/notmuch.h index a7e66dd..d77eb5c 100644 --- a/lib/notmuch.h +++ b/lib/notmuch.h @@ -787,6 +787,22 @@ notmuch_message_get_date (notmuch_message_t *message); const char * notmuch_message_get_header (notmuch_message_t *message, const char *header); +/* Get the concatenated value of all instances of the specified header + * from 'message'. + * + * The value will be read from the actual message file, not from the + * notmuch database. The header name is case insensitive. + * + * The returned string belongs to the message so should not be + * modified or freed by the caller (nor should it be referenced after + * the message is destroyed). + * + * Returns an empty string ("") if the message does not contain a + * header line matching 'header'. Returns NULL if any error occurs. + */ +const char * +notmuch_message_get_concat_header (notmuch_message_t *message, const char *header); + /* Get the tags for 'message', returning a notmuch_tags_t object which * can be used to iterate over all tags. * -- 1.6.6.1 ^ permalink raw reply related [flat|nested] 5+ messages in thread
* [PATCH 2/2] Improve heuristic for guessing best from address in replies 2010-04-16 20:51 ` [PATCH 1/2] Add interface to obtain the concatenation of all instances of a specified header Dirk Hohndel @ 2010-04-16 20:51 ` Dirk Hohndel 0 siblings, 0 replies; 5+ messages in thread From: Dirk Hohndel @ 2010-04-16 20:51 UTC (permalink / raw) To: notmuch We now look at Envelope-To: and Original-To: headers Then concat all of the Received headers and walk through them to find either a "for email@add.res" clause or a host in a known domain. This should deal with most of the fetchmail and mail hoster induced pain (and failure) of the old heuristic. Signed-off-by: Dirk Hohndel <hohndel@infradead.org> --- notmuch-reply.c | 125 +++++++++++++++++++++++++++++++++++++++++-------------- 1 files changed, 94 insertions(+), 31 deletions(-) diff --git a/notmuch-reply.c b/notmuch-reply.c index 230cacc..78d3914 100644 --- a/notmuch-reply.c +++ b/notmuch-reply.c @@ -305,33 +305,95 @@ add_recipients_from_message (GMimeMessage *reply, static const char * guess_from_received_header (notmuch_config_t *config, notmuch_message_t *message) { - const char *received,*primary; - char **other; - char *by,*mta,*ptr,*token; + const char *received,*primary,*by; + char **other,*tohdr; + char *mta,*ptr,*token; char *domain=NULL; char *tld=NULL; const char *delim=". \t"; size_t i,other_len; - received = notmuch_message_get_header (message, "received"); - by = strstr (received, " by "); - if (by && *(by+4)) { - /* sadly, the format of Received: headers is a bit inconsistent, - * depending on the MTA used. So we try to extract just the MTA - * here by removing leading whitespace and assuming that the MTA - * name ends at the next whitespace - * we test for *(by+4) to be non-'\0' to make sure there's something - * there at all - and then assume that the first whitespace delimited - * token that follows is the last receiving server + const char *to_headers[] = {"Envelope-to", "X-Original-To"}; + + primary = notmuch_config_get_user_primary_email (config); + other = notmuch_config_get_user_other_email (config, &other_len); + + /* sadly, there is no standard way to find out to which email + * address a mail was delivered - what is in the headers depends + * on the MTAs used along the way. So we are trying a number of + * heuristics which hopefully will answer this question. + + * We only got here if none of the users email addresses are in + * the To: or Cc: header. From here we try the following in order: + * 1) check for an Envelope-to: header + * 2) check for an X-Original-To: header + * 3) check for a (for <email@add.res>) clause in Received: headers + * 4) check for the domain part of known email addresses in the + * 'by' part of Received headers + * If none of these work, we give up and return NULL + */ + for (i = 0; i < sizeof(to_headers)/sizeof(*to_headers); i++) { + tohdr = xstrdup(notmuch_message_get_header (message, to_headers[i])); + if (tohdr && *tohdr) { + /* tohdr is potentialy a list of email addresses, so here we + * check if one of the email addresses is a substring of tohdr + */ + if (strcasestr(tohdr, primary)) { + free(tohdr); + return primary; + } + for (i = 0; i < other_len; i++) + if (strcasestr (tohdr, other[i])) { + free(tohdr); + return other[i]; + } + free(tohdr); + } + } + + /* We get the concatenated Received: headers and search from the + * front (last Received: header added) and try to extract from + * them indications to which email address this message was + * delivered. + */ + received = notmuch_message_get_concat_header (message, "received"); + /* First we look for a " for <email@add.res>" in the received + * header + */ + ptr = strstr (received, " for "); + if (ptr) { + /* the text following is potentialy a list of email addresses, + * so again we check if one of the email addresses is a + * substring of ptr */ - mta = strdup (by+4); - if (mta == NULL) - return NULL; + if (strcasestr(ptr, primary)) { + return primary; + } + for (i = 0; i < other_len; i++) + if (strcasestr (ptr, other[i])) { + return other[i]; + } + } + /* Finally, we parse all the " by MTA ..." headers to guess the + * email address that this was originally delivered to. + * We extract just the MTA here by removing leading whitespace and + * assuming that the MTA name ends at the next whitespace. + * We test for *(by+4) to be non-'\0' to make sure there's + * something there at all - and then assume that the first + * whitespace delimited token that follows is the receiving + * system in this step of the receive chain + */ + by = received; + while((by = strstr (by, " by ")) != NULL) { + by += 4; + if (*by == '\0') + break; + mta = xstrdup (by); token = strtok(mta," \t"); if (token == NULL) - return NULL; + break; /* Now extract the last two components of the MTA host name - * as domain and tld + * as domain and tld. */ while ((ptr = strsep (&token, delim)) != NULL) { if (*ptr == '\0') @@ -341,23 +403,24 @@ guess_from_received_header (notmuch_config_t *config, notmuch_message_t *message } if (domain) { - /* recombine domain and tld and look for it among the configured - * email addresses + /* Recombine domain and tld and look for it among the configured + * email addresses. + * This time we have a known domain name and nothing else - so + * the test is the other way around: we check if this is a + * substring of one of the email addresses. */ *(tld-1) = '.'; - primary = notmuch_config_get_user_primary_email (config); - if (strcasestr (primary, domain)) { - free (mta); - return primary; + + if (strcasestr(primary, domain)) { + free(mta); + return primary; + } + for (i = 0; i < other_len; i++) + if (strcasestr (other[i],domain)) { + free(mta); + return other[i]; } - other = notmuch_config_get_user_other_email (config, &other_len); - for (i = 0; i < other_len; i++) - if (strcasestr (other[i], domain)) { - free (mta); - return other[i]; - } } - free (mta); } -- 1.6.6.1 ^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: improve from-header guessing 2010-04-16 20:51 improve from-header guessing Dirk Hohndel 2010-04-16 20:51 ` [PATCH 1/2] Add interface to obtain the concatenation of all instances of a specified header Dirk Hohndel @ 2010-04-23 18:47 ` Carl Worth 2010-04-24 19:09 ` Dirk Hohndel 1 sibling, 1 reply; 5+ messages in thread From: Carl Worth @ 2010-04-23 18:47 UTC (permalink / raw) To: Dirk Hohndel, notmuch [-- Attachment #1: Type: text/plain, Size: 1493 bytes --] On Fri, 16 Apr 2010 13:51:40 -0700, Dirk Hohndel <hohndel@infradead.org> wrote: > The following two patches should address most of the concerns raised > to my previous series. Allow me to raise new concerns then. ;-) > The first patch simply adds an interface to obtain a concatenation of > all instances of a specific header from an email. I was hoping to see the "special-case value of NULL" go away with this change. And I like that there's a new function to get the concatenated header, (I would prefer an unabbreviated name of get_concatenated_header than get_header_concat), but I don't like seeing all the existing callers of get_header updated to pass an extra 0. Instead, I'd prefer to see those calls unchanged, and a tiny new get_header that passes the 0 and then make the actual implementing function be static and named something like notmuch_message_file_get_header_internal. Both patches have some trailing whitespace. I see these easily wince I have the following in my ~/.gitconfig: [core] whitespace = trailing-space,space-before-tab I'm sure there's a way to make git refuse to let you commit changes with trailing whitespace, but I don't know offhand what it is. Finally, I'd like to see some tests for this feature. (But we do have the feature already without tests, so I won't strictly block on that). If you can fix up any of the above before I make another pass through ym queue, that would be great. Thanks, -Carl [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: improve from-header guessing 2010-04-23 18:47 ` improve from-header guessing Carl Worth @ 2010-04-24 19:09 ` Dirk Hohndel 0 siblings, 0 replies; 5+ messages in thread From: Dirk Hohndel @ 2010-04-24 19:09 UTC (permalink / raw) To: Carl Worth, notmuch On Fri, 23 Apr 2010 11:47:04 -0700, Carl Worth <cworth@cworth.org> wrote: > On Fri, 16 Apr 2010 13:51:40 -0700, Dirk Hohndel <hohndel@infradead.org> wrote: > > The following two patches should address most of the concerns raised > > to my previous series. > > Allow me to raise new concerns then. ;-) Any time > > The first patch simply adds an interface to obtain a concatenation of > > all instances of a specific header from an email. > > I was hoping to see the "special-case value of NULL" go away with this > change. > > And I like that there's a new function to get the concatenated header, > (I would prefer an unabbreviated name of get_concatenated_header than > get_header_concat), but I don't like seeing all the existing callers of > get_header updated to pass an extra 0. Instead, I'd prefer to see those > calls unchanged, and a tiny new get_header that passes the 0 and then > make the actual implementing function be static and named something like > notmuch_message_file_get_header_internal. Turns out that the way I did this was broken anyway. So we can simply forget these patches and your concerns. I'm sure you'll raise new concerns on the new ("rearchitected") patches. > Both patches have some trailing whitespace. I see these easily wince I > have the following in my ~/.gitconfig: > > [core] > whitespace = trailing-space,space-before-tab I know. I'm trying to be better about checking whitespace pollution before submitting things. > Finally, I'd like to see some tests for this feature. (But we do have > the feature already without tests, so I won't strictly block on that). Hu? You even commited these already. Or am I reading email out of order again? /D ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2010-04-24 19:09 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-04-16 20:51 improve from-header guessing Dirk Hohndel 2010-04-16 20:51 ` [PATCH 1/2] Add interface to obtain the concatenation of all instances of a specified header Dirk Hohndel 2010-04-16 20:51 ` [PATCH 2/2] Improve heuristic for guessing best from address in replies Dirk Hohndel 2010-04-23 18:47 ` improve from-header guessing Carl Worth 2010-04-24 19:09 ` Dirk Hohndel
Code repositories for project(s) associated with this public inbox https://yhetil.org/notmuch.git/ This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).