From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by arlo.cworth.org (Postfix) with ESMTP id D9A356DE01DF for ; Sat, 28 Jul 2018 04:22:52 -0700 (PDT) X-Virus-Scanned: Debian amavisd-new at cworth.org X-Spam-Flag: NO X-Spam-Score: -0.006 X-Spam-Level: X-Spam-Status: No, score=-0.006 tagged_above=-999 required=5 tests=[AWL=0.005, SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01] autolearn=disabled Received: from arlo.cworth.org ([127.0.0.1]) by localhost (arlo.cworth.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id CiwVoMP3rPu7 for ; Sat, 28 Jul 2018 04:22:51 -0700 (PDT) Received: from smtp.eurecom.fr (smtp.eurecom.fr [193.55.113.210]) by arlo.cworth.org (Postfix) with ESMTP id 2D4196DE01CE for ; Sat, 28 Jul 2018 04:22:50 -0700 (PDT) X-IronPort-AV: E=Sophos;i="5.51,413,1526335200"; d="scan'208";a="7967583" Received: from waha.eurecom.fr (HELO smtps.eurecom.fr) ([10.3.2.236]) by drago1i.eurecom.fr with ESMTP; 28 Jul 2018 13:22:47 +0200 Received: from archibald (dslb-084-056-082-241.084.056.pools.vodafone-ip.de [84.56.82.241]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtps.eurecom.fr (Postfix) with ESMTPSA id 6E05EFD8; Sat, 28 Jul 2018 13:22:45 +0200 (CEST) From: Sebastian Poeplau To: Jeffrey Stedfast , "notmuch\@notmuchmail.org" Subject: Re: Handling mislabeled emails encoded with Windows-1252 In-Reply-To: <87effszpg7.fsf@eurecom.fr> References: <87lgaeat37.fsf@eurecom.fr> <8736w91jz0.fsf@tethera.net> <87effszpg7.fsf@eurecom.fr> Date: Sat, 28 Jul 2018 13:22:46 +0200 Message-ID: <87zhyby589.fsf@eurecom.fr> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" X-BeenThere: notmuch@notmuchmail.org X-Mailman-Version: 2.1.26 Precedence: list List-Id: "Use and development of the notmuch mail system." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 28 Jul 2018 11:22:53 -0000 --=-=-= Content-Type: text/plain Hi all, Here's the updated patch. It filters the message through the GMimeFilterWindows that Jeff mentioned and then uses the charset it detects for GMimeFilterCharset in the actual rendering of the message. Jeff, is this how to use the filter correctly? Cheers, Sebastian --=-=-= Content-Type: text/x-patch Content-Disposition: inline; filename=fix_windows_charsets.patch diff -ura notmuch-0.27/notmuch-show.c notmuch-0.27-patched/notmuch-show.c --- notmuch-0.27/notmuch-show.c 2018-06-13 03:42:34.000000000 +0200 +++ notmuch-0.27-patched/notmuch-show.c 2018-07-28 10:25:25.358502880 +0200 @@ -271,7 +271,10 @@ { GMimeContentType *content_type = g_mime_object_get_content_type (GMIME_OBJECT (part)); GMimeStream *stream_filter = NULL; + GMimeStream *null_stream = NULL; + GMimeStream *null_stream_filter = NULL; GMimeFilter *crlf_filter = NULL; + GMimeFilter *windows_filter = NULL; GMimeDataWrapper *wrapper; const char *charset; @@ -282,13 +285,27 @@ if (stream_out == NULL) return; + charset = g_mime_object_get_content_type_parameter (part, "charset"); + wrapper = g_mime_part_get_content_object (GMIME_PART (part)); + if (wrapper && charset) { + /* Check for mislabeled Windows encoding */ + null_stream = g_mime_stream_null_new (); + null_stream_filter = g_mime_stream_filter_new (null_stream); + windows_filter = g_mime_filter_windows_new (charset); + g_mime_stream_filter_add(GMIME_STREAM_FILTER (null_stream_filter), + windows_filter); + g_mime_data_wrapper_write_to_stream (wrapper, null_stream_filter); + charset = g_mime_filter_windows_real_charset( + (GMimeFilterWindows *) windows_filter); + g_object_unref (windows_filter); + } + stream_filter = g_mime_stream_filter_new (stream_out); crlf_filter = g_mime_filter_crlf_new (false, false); g_mime_stream_filter_add(GMIME_STREAM_FILTER (stream_filter), crlf_filter); g_object_unref (crlf_filter); - charset = g_mime_object_get_content_type_parameter (part, "charset"); if (charset) { GMimeFilter *charset_filter; charset_filter = g_mime_filter_charset_new (charset, "UTF-8"); @@ -313,9 +330,12 @@ } } - wrapper = g_mime_part_get_content_object (GMIME_PART (part)); if (wrapper && stream_filter) g_mime_data_wrapper_write_to_stream (wrapper, stream_filter); + if (null_stream_filter) + g_object_unref (null_stream_filter); + if (null_stream) + g_object_unref (null_stream); if (stream_filter) g_object_unref(stream_filter); } --=-=-= Content-Type: text/plain Sebastian Poeplau writes: > Hi Jeff, > >> GMime actually comes with a stream filter (GMimeFilterWindows) which can auto-detect this situation. >> >> In this particular case, you'd instantiate the GMimeFilterWindows like this: >> >> filter = g_mime_filter_windows_new ("iso-8859-1"); >> >> "iso-8859-1" being the charset that the content claims to be in. >> >> Then you'd pipe the raw (decoded but not converted to utf-8) content though the filter and afterward call g_mime_filter_windows_real_charset (filter) which would return, in this user's case, "windows-1252". > > Nice, this is exactly what I was looking for! Somehow I missed it when > checking GMime. I'll adapt my local fix and post the results here. > > Thanks, > Sebastian --=-=-=--