From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by arlo.cworth.org (Postfix) with ESMTP id 6C8AE6DE021C for ; Mon, 30 Jul 2018 00:47:56 -0700 (PDT) X-Virus-Scanned: Debian amavisd-new at cworth.org X-Spam-Flag: NO X-Spam-Score: -0.006 X-Spam-Level: X-Spam-Status: No, score=-0.006 tagged_above=-999 required=5 tests=[AWL=0.005, SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01] autolearn=disabled Received: from arlo.cworth.org ([127.0.0.1]) by localhost (arlo.cworth.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id TMcX9dXZNmJg for ; Mon, 30 Jul 2018 00:47:55 -0700 (PDT) Received: from smtp.eurecom.fr (smtp.eurecom.fr [193.55.113.210]) by arlo.cworth.org (Postfix) with ESMTP id 79AA16DE01F7 for ; Mon, 30 Jul 2018 00:47:55 -0700 (PDT) X-IronPort-AV: E=Sophos;i="5.51,422,1526335200"; d="scan'208";a="7974083" Received: from waha.eurecom.fr (HELO smtps.eurecom.fr) ([10.3.2.236]) by drago1i.eurecom.fr with ESMTP; 30 Jul 2018 09:47:54 +0200 Received: from archibald (unknown [193.55.114.4]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtps.eurecom.fr (Postfix) with ESMTPSA id 824BD4E9; Mon, 30 Jul 2018 09:47:54 +0200 (CEST) From: Sebastian Poeplau To: Jeffrey Stedfast , "notmuch\@notmuchmail.org" Subject: Re: Handling mislabeled emails encoded with Windows-1252 In-Reply-To: <87wotdxjuu.fsf@eurecom.fr> References: <87lgaeat37.fsf@eurecom.fr> <8736w91jz0.fsf@tethera.net> <87effszpg7.fsf@eurecom.fr> <87zhyby589.fsf@eurecom.fr> <9C0F603A-6125-4CF0-8AE7-E02301355906@microsoft.com> <87wotdxjuu.fsf@eurecom.fr> Date: Mon, 30 Jul 2018 09:47:55 +0200 Message-ID: <87tvohxiz8.fsf@eurecom.fr> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" X-BeenThere: notmuch@notmuchmail.org X-Mailman-Version: 2.1.26 Precedence: list List-Id: "Use and development of the notmuch mail system." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 30 Jul 2018 07:47:56 -0000 --=-=-= Content-Type: text/plain Hi, >> As an added optimization, you could try limiting that block of code to >> just when the charset is one of the iso-8859-* charsets. >> >> The following code snippet should help with that: >> >> charset = charset ? g_mime_charset_canon_name (charset) : NULL; >> if (wrapper && charset && g_ascii_strncasecmp (charset, "iso-8859-", 9)) { >> ... >> >> The reason you need to use g_mime_charset_canon_name (if you decide to >> add the optimization) is that mail software does not always use the >> canonical form of the various charset names that they use. Often you >> will get stuff like "latin1" or "iso_8859-1". > > Nice, I'll add it. Updated patch attached. Cheers, Sebastian --=-=-= Content-Type: text/x-patch Content-Disposition: inline; filename=fix_windows_charsets.patch diff -ura notmuch-0.27/notmuch-show.c notmuch-0.27-patched/notmuch-show.c --- notmuch-0.27/notmuch-show.c 2018-06-13 03:42:34.000000000 +0200 +++ notmuch-0.27-patched/notmuch-show.c 2018-07-30 09:41:05.491636418 +0200 @@ -272,6 +272,7 @@ GMimeContentType *content_type = g_mime_object_get_content_type (GMIME_OBJECT (part)); GMimeStream *stream_filter = NULL; GMimeFilter *crlf_filter = NULL; + GMimeFilter *windows_filter = NULL; GMimeDataWrapper *wrapper; const char *charset; @@ -282,13 +283,37 @@ if (stream_out == NULL) return; + charset = g_mime_object_get_content_type_parameter (part, "charset"); + charset = charset ? g_mime_charset_canon_name (charset) : NULL; + wrapper = g_mime_part_get_content_object (GMIME_PART (part)); + if (wrapper && charset && !g_ascii_strncasecmp (charset, "iso-8859-", 9)) { + GMimeStream *null_stream = NULL; + GMimeStream *null_stream_filter = NULL; + + /* Check for mislabeled Windows encoding */ + null_stream = g_mime_stream_null_new (); + null_stream_filter = g_mime_stream_filter_new (null_stream); + windows_filter = g_mime_filter_windows_new (charset); + g_mime_stream_filter_add(GMIME_STREAM_FILTER (null_stream_filter), + windows_filter); + g_mime_data_wrapper_write_to_stream (wrapper, null_stream_filter); + charset = g_mime_filter_windows_real_charset( + (GMimeFilterWindows *) windows_filter); + + if (null_stream_filter) + g_object_unref (null_stream_filter); + if (null_stream) + g_object_unref (null_stream); + /* Keep a reference to windows_filter in order to prevent the + * charset string from deallocation. */ + } + stream_filter = g_mime_stream_filter_new (stream_out); crlf_filter = g_mime_filter_crlf_new (false, false); g_mime_stream_filter_add(GMIME_STREAM_FILTER (stream_filter), crlf_filter); g_object_unref (crlf_filter); - charset = g_mime_object_get_content_type_parameter (part, "charset"); if (charset) { GMimeFilter *charset_filter; charset_filter = g_mime_filter_charset_new (charset, "UTF-8"); @@ -313,11 +338,12 @@ } } - wrapper = g_mime_part_get_content_object (GMIME_PART (part)); if (wrapper && stream_filter) g_mime_data_wrapper_write_to_stream (wrapper, stream_filter); if (stream_filter) g_object_unref(stream_filter); + if (windows_filter) + g_object_unref (windows_filter); } static const char* --=-=-=--