* Searching through different charsets @ 2012-02-22 17:10 Serge Z 2012-02-24 0:31 ` Michal Sojka 0 siblings, 1 reply; 15+ messages in thread From: Serge Z @ 2012-02-22 17:10 UTC (permalink / raw) To: notmuch Hello! I've got the following problem: fetched emails can be in different encodings. And searching a term typed in one encoding (system default) does not match the same term in another encoding. The solution, as I see, can be in preprocessing each incoming email to "normalize" it and its encoding so that indexer will handle emails in system encoding only. Could you please suggest something? Another issue (not so much wanted but wanted too) is searching through html messages without matching html tags. This problem looks to be solvable by properly configured run-mailcap. Is there such solution anywhere? Thanks. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Searching through different charsets 2012-02-22 17:10 Searching through different charsets Serge Z @ 2012-02-24 0:31 ` Michal Sojka 2012-02-24 0:33 ` [PATCH] test: Add test for searching of uncommonly encoded messages Michal Sojka 0 siblings, 1 reply; 15+ messages in thread From: Michal Sojka @ 2012-02-24 0:31 UTC (permalink / raw) To: Serge Z, notmuch On Wed, 22 Feb 2012, Serge Z wrote: > > Hello! > > I've got the following problem: fetched emails can be in different encodings. > And searching a term typed in one encoding (system default) does not match the > same term in another encoding. > > The solution, as I see, can be in preprocessing each incoming email to > "normalize" it and its encoding so that indexer will handle emails in system > encoding only. Could you please suggest something? I can confirm this issue and sending a patch with test case (marked as broken) for this. I expect the fix to be quite simple because all encoding/docoding stuff is already implemented in gmime which is used by notmuch when indexing. > > Another issue (not so much wanted but wanted too) is searching through html > messages without matching html tags. I don't know whether somebody works on this or nor. > This problem looks to be solvable by properly configured run-mailcap. Is there > such solution anywhere? I don't think that run-mailcap has anything to do with notmuch. -Michal ^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH] test: Add test for searching of uncommonly encoded messages 2012-02-24 0:31 ` Michal Sojka @ 2012-02-24 0:33 ` Michal Sojka 2012-02-24 4:29 ` Serge Z ` (2 more replies) 0 siblings, 3 replies; 15+ messages in thread From: Michal Sojka @ 2012-02-24 0:33 UTC (permalink / raw) To: notmuch Emails that are encoded differently than as ASCII or UTF-8 are not indexed properly by notmuch. It is not possible to search for non-ASCII words within those messages. --- test/encoding | 9 +++++++++ test/test-lib.sh | 5 +++++ 2 files changed, 14 insertions(+), 0 deletions(-) diff --git a/test/encoding b/test/encoding index 33259c1..3992b5c 100755 --- a/test/encoding +++ b/test/encoding @@ -21,4 +21,13 @@ irrelevant \fbody} \fmessage}" +test_begin_subtest "Search for ISO-8859-2 encoded message" +test_subtest_known_broken +add_message '[content-type]="text/plain; charset=iso-8859-2"' \ + '[content-transfer-encoding]=8bit' \ + '[subject]="ISO-8859-2 encoded message"' \ + "[body]=$'Czech word tu\350\362\341\350\350\355 means pinguin\'s.'" # ISO-8859-2 characters are generated by shell's escape sequences +output=$(notmuch search tučňáččí 2>&1 | notmuch_show_sanitize) +test_expect_equal "$output" "thread:0000000000000002 2001-01-05 [1/1] Notmuch Test Suite; ISO-8859-2 encoded message (inbox unread)" + test_done diff --git a/test/test-lib.sh b/test/test-lib.sh index 063a2b2..2781506 100644 --- a/test/test-lib.sh +++ b/test/test-lib.sh @@ -356,6 +356,11 @@ ${additional_headers}" ${additional_headers}" fi + if [ ! -z "${template[content-transfer-encoding]}" ]; then + additional_headers="Content-Transfer-Encoding: ${template[content-transfer-encoding]} +${additional_headers}" + fi + # Note that in the way we're setting it above and using it below, # `additional_headers' will also serve as the header / body separator # (empty line in between). -- 1.7.9.1 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH] test: Add test for searching of uncommonly encoded messages 2012-02-24 0:33 ` [PATCH] test: Add test for searching of uncommonly encoded messages Michal Sojka @ 2012-02-24 4:29 ` Serge Z 2012-02-24 7:00 ` Michal Sojka 2012-02-24 7:36 ` [PATCH 1/2] Convert non-UTF-8 parts to UTF-8 before indexing them Michal Sojka 2012-02-29 11:55 ` [PATCH] test: Add test for searching of uncommonly encoded messages David Bremner 2 siblings, 1 reply; 15+ messages in thread From: Serge Z @ 2012-02-24 4:29 UTC (permalink / raw) To: notmuch Quoting Michal Sojka (2012-02-24 04:33:15) >Emails that are encoded differently than as ASCII or UTF-8 are not >indexed properly by notmuch. It is not possible to search for non-ASCII >words within those messages. Ok. But we can preprocess each incoming message right after 'getmail' to convert it from html to text and to utf8 encoding. One solution is to create a seperate script for this and make gmail pipe all messages to this script, and then to notmuch. But It would be better if maildir contains original messages only, so the question is: can we make nomuch indexing engine to index preprocessed message while maildir will contain original message - as it was obtained? ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] test: Add test for searching of uncommonly encoded messages 2012-02-24 4:29 ` Serge Z @ 2012-02-24 7:00 ` Michal Sojka 2012-02-24 7:57 ` Serge Z 0 siblings, 1 reply; 15+ messages in thread From: Michal Sojka @ 2012-02-24 7:00 UTC (permalink / raw) To: Serge Z, notmuch On Fri, 24 Feb 2012, Serge Z wrote: > > Quoting Michal Sojka (2012-02-24 04:33:15) > >Emails that are encoded differently than as ASCII or UTF-8 are not > >indexed properly by notmuch. It is not possible to search for non-ASCII > >words within those messages. > > Ok. But we can preprocess each incoming message right after 'getmail' to > convert it from html to text and to utf8 encoding. One solution is to create a > seperate script for this and make gmail pipe all messages to this script, and > then to notmuch. But It would be better if maildir contains original messages > only, so the question is: can we make nomuch indexing engine to index > preprocessed message while maildir will contain original message - as it was > obtained? Hi, I'm not big fan of adding "preprocessor". First, I thing that both reasons you mention are actually bugs and it would be better to fix them for everybody than requiring each user to configure some preprocessor. Second, depending on what and how would your preprocessor do, the initial mail indexing could be a way slower, which is also nothing that people want. Do you have any other use case for the preprocessor besides utf8 and html->text conversions? Cheers, -Michal ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] test: Add test for searching of uncommonly encoded messages 2012-02-24 7:00 ` Michal Sojka @ 2012-02-24 7:57 ` Serge Z 2012-02-24 8:38 ` Michal Sojka 0 siblings, 1 reply; 15+ messages in thread From: Serge Z @ 2012-02-24 7:57 UTC (permalink / raw) To: notmuch Quoting Michal Sojka (2012-02-24 11:00:02) >On Fri, 24 Feb 2012, Serge Z wrote: >> >> Quoting Michal Sojka (2012-02-24 04:33:15) >> >Emails that are encoded differently than as ASCII or UTF-8 are not >> >indexed properly by notmuch. It is not possible to search for non-ASCII >> >words within those messages. >> >> Ok. But we can preprocess each incoming message right after 'getmail' to >> convert it from html to text and to utf8 encoding. One solution is to create a >> seperate script for this and make gmail pipe all messages to this script, and >> then to notmuch. But It would be better if maildir contains original messages >> only, so the question is: can we make nomuch indexing engine to index >> preprocessed message while maildir will contain original message - as it was >> obtained? > >Hi, > >I'm not big fan of adding "preprocessor". First, I thing that both >reasons you mention are actually bugs and it would be better to fix them >for everybody than requiring each user to configure some preprocessor. >Second, depending on what and how would your preprocessor do, the >initial mail indexing could be a way slower, which is also nothing that >people want. > >Do you have any other use case for the preprocessor besides utf8 and >html->text conversions? > >Cheers, >-Michal Well, I don't want to add any external preprocessor too. This may be considered as an architectural decision: search engine should not access messages directly, but through some preprocessing layer which would handle the case of different encodings in body and headers, RFC2047-encoded headers (if this is not handled yet) etc. Anyway, this solution imho would be nice to be concluded inside a separate library which would be useful for notmuch clients as well as other mail indexing engines. Or an existing library should be looked for. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] test: Add test for searching of uncommonly encoded messages 2012-02-24 7:57 ` Serge Z @ 2012-02-24 8:38 ` Michal Sojka 2012-02-25 8:36 ` Serge Z 0 siblings, 1 reply; 15+ messages in thread From: Michal Sojka @ 2012-02-24 8:38 UTC (permalink / raw) To: Serge Z, notmuch On Fri, 24 Feb 2012, Serge Z wrote: > > Quoting Michal Sojka (2012-02-24 11:00:02) > >I'm not big fan of adding "preprocessor". First, I thing that both > >reasons you mention are actually bugs and it would be better to fix them > >for everybody than requiring each user to configure some preprocessor. > >Second, depending on what and how would your preprocessor do, the > >initial mail indexing could be a way slower, which is also nothing that > >people want. > > > >Do you have any other use case for the preprocessor besides utf8 and > >html->text conversions? > > > >Cheers, > >-Michal > > Well, I don't want to add any external preprocessor too. > > This may be considered as an architectural decision: search engine should not > access messages directly, but through some preprocessing layer which would > handle the case of different encodings in body and headers, RFC2047-encoded > headers (if this is not handled yet) etc. > > Anyway, this solution imho would be nice to be concluded inside a separate > library Yes, this library is called gmime and notmuch already make use of it. -Michal ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] test: Add test for searching of uncommonly encoded messages 2012-02-24 8:38 ` Michal Sojka @ 2012-02-25 8:36 ` Serge Z 2012-02-26 9:33 ` Double decoded text/html parts (was: [PATCH] test: Add test for searching of uncommonly encoded messages) Michal Sojka 0 siblings, 1 reply; 15+ messages in thread From: Serge Z @ 2012-02-25 8:36 UTC (permalink / raw) To: notmuch Hi! I've struck another problem: I've got an html/text email with body encoded with cp1251. Its encoding is mentioned in both Content-type: email header and html <meta> tag. So when the client tries to display it with external html2text converter, The message is decoded twice: first by client, second by html2text (I use w3m). As I understand, notmuch (while indexing this message) decodes it once and index it in the right way (though including html tags to index). But what if the message contains no "charset" option in Content-Type email header but contain <meta> content-type tag with charset noted? Should such message be considered as being composed wrong or it should be indexed with diving into html details (content-type)? ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Double decoded text/html parts (was: [PATCH] test: Add test for searching of uncommonly encoded messages) 2012-02-25 8:36 ` Serge Z @ 2012-02-26 9:33 ` Michal Sojka 2012-02-26 10:20 ` Serge Z 0 siblings, 1 reply; 15+ messages in thread From: Michal Sojka @ 2012-02-26 9:33 UTC (permalink / raw) To: Serge Z, notmuch On Sat, 25 Feb 2012, Serge Z wrote: > > Hi! > I've struck another problem: > > I've got an html/text email with body encoded with cp1251. > Its encoding is mentioned in both Content-type: email header and html <meta> > tag. So when the client tries to display it with external html2text converter, > The message is decoded twice: first by client, second by html2text (I > use w3m). Right. After my analysis of the problem (see below) it seems there is no trivial solution for this. > As I understand, notmuch (while indexing this message) decodes it once and > index it in the right way (though including html tags to index). But what if > the message contains no "charset" option in Content-Type email header but > contain <meta> content-type tag with charset noted? This should not happen. It violates RFC 2046, section 4.1.2. > Should such message be considered as being composed wrong or it should > be indexed with diving into html details (content-type)? I don't think it's wrongly composed and it should be even correctly indexed (with my patch). The problem is when you view such a message with an external HTML viewer. In my mailbox I can find two different types of text/html parts. First, the parts that contain complete HTML document including all headers and especially <meta http-equiv="content-type" content="text/html; ...">. Such parts could be passed to external HTML viewer without any decoding by notmuch. The second type is text/html part that does not contain any HTML headers. Passing such a part to an external HTML viewer undecoded would require it to guess the correct charset from the content. AFAIK Firefox users can set fallback charset (used for HTML documents with unknown charset) in the preferences, but I don't know what other browsers would do. In particular, do you know how w3m behaves when charset is not specified? In any way, if we want notmuch to do the right thing, we should analyze the content of text/html parts and decide whether to decode the part or not. Perhaps, a simple heuristic could be to search the content of the part for strings "charset=" and "encoding=" and if any is found, notmuch wouldn't decode that part. Otherwise it will decode it according to Content-Type header. As a curiosity, I found the following in one of my emails. Note that two different encodings (iso-8859-2 and windows-1250) are specified at the same time :) That's the reason why I think that fixing the problem won't be trivial. Content-Type: text/html; charset="iso-8859-2" Content-Transfer-Encoding: 8bit <?xml version="1.0" encoding="windows-1250" ?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-2" /> Cheers, -Michal ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Double decoded text/html parts (was: [PATCH] test: Add test for searching of uncommonly encoded messages) 2012-02-26 9:33 ` Double decoded text/html parts (was: [PATCH] test: Add test for searching of uncommonly encoded messages) Michal Sojka @ 2012-02-26 10:20 ` Serge Z 0 siblings, 0 replies; 15+ messages in thread From: Serge Z @ 2012-02-26 10:20 UTC (permalink / raw) To: notmuch This works: w3m -o document_charset=windows-1251 test.html It says that w3m should suppose windows-1251 encoding if no html-meta content-type tag given. ^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH 1/2] Convert non-UTF-8 parts to UTF-8 before indexing them 2012-02-24 0:33 ` [PATCH] test: Add test for searching of uncommonly encoded messages Michal Sojka 2012-02-24 4:29 ` Serge Z @ 2012-02-24 7:36 ` Michal Sojka 2012-02-24 7:36 ` [PATCH 2/2] test: Remove 'broken' flag from encoding test Michal Sojka ` (2 more replies) 2012-02-29 11:55 ` [PATCH] test: Add test for searching of uncommonly encoded messages David Bremner 2 siblings, 3 replies; 15+ messages in thread From: Michal Sojka @ 2012-02-24 7:36 UTC (permalink / raw) To: notmuch This fixes a bug that didn't allow to search for non-ASCII words such parts. The code here was copied from show_text_part_content(), because the show command already does the needed conversion when showing the message. --- lib/index.cc | 15 +++++++++++++++ 1 files changed, 15 insertions(+), 0 deletions(-) diff --git a/lib/index.cc b/lib/index.cc index d8f8b2b..e377732 100644 --- a/lib/index.cc +++ b/lib/index.cc @@ -315,6 +315,7 @@ _index_mime_part (notmuch_message_t *message, GByteArray *byte_array; GMimeContentDisposition *disposition; char *body; + const char *charset; if (! part) { fprintf (stderr, "Warning: Not indexing empty mime part.\n"); @@ -390,6 +391,20 @@ _index_mime_part (notmuch_message_t *message, g_mime_stream_filter_add (GMIME_STREAM_FILTER (filter), discard_uuencode_filter); + charset = g_mime_object_get_content_type_parameter (part, "charset"); + if (charset) { + GMimeFilter *charset_filter; + charset_filter = g_mime_filter_charset_new (charset, "UTF-8"); + /* This result can be NULL for things like "unknown-8bit". + * Don't set a NULL filter as that makes GMime print + * annoying assertion-failure messages on stderr. */ + if (charset_filter) { + g_mime_stream_filter_add (GMIME_STREAM_FILTER (filter), + charset_filter); + g_object_unref (charset_filter); + } + } + wrapper = g_mime_part_get_content_object (GMIME_PART (part)); if (wrapper) g_mime_data_wrapper_write_to_stream (wrapper, filter); -- 1.7.9.1 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH 2/2] test: Remove 'broken' flag from encoding test 2012-02-24 7:36 ` [PATCH 1/2] Convert non-UTF-8 parts to UTF-8 before indexing them Michal Sojka @ 2012-02-24 7:36 ` Michal Sojka 2012-02-25 4:33 ` [PATCH 1/2] Convert non-UTF-8 parts to UTF-8 before indexing them Austin Clements 2012-02-29 11:55 ` David Bremner 2 siblings, 0 replies; 15+ messages in thread From: Michal Sojka @ 2012-02-24 7:36 UTC (permalink / raw) To: notmuch --- test/encoding | 1 - 1 files changed, 0 insertions(+), 1 deletions(-) diff --git a/test/encoding b/test/encoding index 3992b5c..f0d073c 100755 --- a/test/encoding +++ b/test/encoding @@ -22,7 +22,6 @@ irrelevant \fmessage}" test_begin_subtest "Search for ISO-8859-2 encoded message" -test_subtest_known_broken add_message '[content-type]="text/plain; charset=iso-8859-2"' \ '[content-transfer-encoding]=8bit' \ '[subject]="ISO-8859-2 encoded message"' \ -- 1.7.9.1 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH 1/2] Convert non-UTF-8 parts to UTF-8 before indexing them 2012-02-24 7:36 ` [PATCH 1/2] Convert non-UTF-8 parts to UTF-8 before indexing them Michal Sojka 2012-02-24 7:36 ` [PATCH 2/2] test: Remove 'broken' flag from encoding test Michal Sojka @ 2012-02-25 4:33 ` Austin Clements 2012-02-29 11:55 ` David Bremner 2 siblings, 0 replies; 15+ messages in thread From: Austin Clements @ 2012-02-25 4:33 UTC (permalink / raw) To: Michal Sojka; +Cc: notmuch LGTM. I'm assuming this interacts with the uuencoding filter in the right order (I don't see how any other order could be correct), but don't actually know. Quoth Michal Sojka on Feb 24 at 8:36 am: > This fixes a bug that didn't allow to search for non-ASCII words such > parts. The code here was copied from show_text_part_content(), because > the show command already does the needed conversion when showing the > message. > --- > lib/index.cc | 15 +++++++++++++++ > 1 files changed, 15 insertions(+), 0 deletions(-) > > diff --git a/lib/index.cc b/lib/index.cc > index d8f8b2b..e377732 100644 > --- a/lib/index.cc > +++ b/lib/index.cc > @@ -315,6 +315,7 @@ _index_mime_part (notmuch_message_t *message, > GByteArray *byte_array; > GMimeContentDisposition *disposition; > char *body; > + const char *charset; > > if (! part) { > fprintf (stderr, "Warning: Not indexing empty mime part.\n"); > @@ -390,6 +391,20 @@ _index_mime_part (notmuch_message_t *message, > g_mime_stream_filter_add (GMIME_STREAM_FILTER (filter), > discard_uuencode_filter); > > + charset = g_mime_object_get_content_type_parameter (part, "charset"); > + if (charset) { > + GMimeFilter *charset_filter; > + charset_filter = g_mime_filter_charset_new (charset, "UTF-8"); > + /* This result can be NULL for things like "unknown-8bit". > + * Don't set a NULL filter as that makes GMime print > + * annoying assertion-failure messages on stderr. */ > + if (charset_filter) { > + g_mime_stream_filter_add (GMIME_STREAM_FILTER (filter), > + charset_filter); > + g_object_unref (charset_filter); > + } > + } > + > wrapper = g_mime_part_get_content_object (GMIME_PART (part)); > if (wrapper) > g_mime_data_wrapper_write_to_stream (wrapper, filter); ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/2] Convert non-UTF-8 parts to UTF-8 before indexing them 2012-02-24 7:36 ` [PATCH 1/2] Convert non-UTF-8 parts to UTF-8 before indexing them Michal Sojka 2012-02-24 7:36 ` [PATCH 2/2] test: Remove 'broken' flag from encoding test Michal Sojka 2012-02-25 4:33 ` [PATCH 1/2] Convert non-UTF-8 parts to UTF-8 before indexing them Austin Clements @ 2012-02-29 11:55 ` David Bremner 2 siblings, 0 replies; 15+ messages in thread From: David Bremner @ 2012-02-29 11:55 UTC (permalink / raw) To: Michal Sojka, notmuch On Fri, 24 Feb 2012 08:36:22 +0100, Michal Sojka <sojkam1@fel.cvut.cz> wrote: > This fixes a bug that didn't allow to search for non-ASCII words such > parts. The code here was copied from show_text_part_content(), because > the show command already does the needed conversion when showing the > message. pushed both, d ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] test: Add test for searching of uncommonly encoded messages 2012-02-24 0:33 ` [PATCH] test: Add test for searching of uncommonly encoded messages Michal Sojka 2012-02-24 4:29 ` Serge Z 2012-02-24 7:36 ` [PATCH 1/2] Convert non-UTF-8 parts to UTF-8 before indexing them Michal Sojka @ 2012-02-29 11:55 ` David Bremner 2 siblings, 0 replies; 15+ messages in thread From: David Bremner @ 2012-02-29 11:55 UTC (permalink / raw) To: Michal Sojka, notmuch On Fri, 24 Feb 2012 01:33:15 +0100, Michal Sojka <sojkam1@fel.cvut.cz> wrote: > Emails that are encoded differently than as ASCII or UTF-8 are not > indexed properly by notmuch. It is not possible to search for non-ASCII > words within those messages. pushed d ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2012-02-29 11:55 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-02-22 17:10 Searching through different charsets Serge Z 2012-02-24 0:31 ` Michal Sojka 2012-02-24 0:33 ` [PATCH] test: Add test for searching of uncommonly encoded messages Michal Sojka 2012-02-24 4:29 ` Serge Z 2012-02-24 7:00 ` Michal Sojka 2012-02-24 7:57 ` Serge Z 2012-02-24 8:38 ` Michal Sojka 2012-02-25 8:36 ` Serge Z 2012-02-26 9:33 ` Double decoded text/html parts (was: [PATCH] test: Add test for searching of uncommonly encoded messages) Michal Sojka 2012-02-26 10:20 ` Serge Z 2012-02-24 7:36 ` [PATCH 1/2] Convert non-UTF-8 parts to UTF-8 before indexing them Michal Sojka 2012-02-24 7:36 ` [PATCH 2/2] test: Remove 'broken' flag from encoding test Michal Sojka 2012-02-25 4:33 ` [PATCH 1/2] Convert non-UTF-8 parts to UTF-8 before indexing them Austin Clements 2012-02-29 11:55 ` David Bremner 2012-02-29 11:55 ` [PATCH] test: Add test for searching of uncommonly encoded messages David Bremner
Code repositories for project(s) associated with this public inbox https://yhetil.org/notmuch.git/ This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).