From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by arlo.cworth.org (Postfix) with ESMTP id 7B3EE6DE0219 for ; Tue, 24 Jul 2018 06:56:08 -0700 (PDT) X-Virus-Scanned: Debian amavisd-new at cworth.org X-Spam-Flag: NO X-Spam-Score: -0.006 X-Spam-Level: X-Spam-Status: No, score=-0.006 tagged_above=-999 required=5 tests=[AWL=0.005, SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01] autolearn=disabled Received: from arlo.cworth.org ([127.0.0.1]) by localhost (arlo.cworth.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Uyc2JhDdjAB1 for ; Tue, 24 Jul 2018 06:56:07 -0700 (PDT) Received: from smtp.eurecom.fr (smtp.eurecom.fr [193.55.113.210]) by arlo.cworth.org (Postfix) with ESMTP id 2FAD36DE01FF for ; Tue, 24 Jul 2018 06:56:06 -0700 (PDT) X-IronPort-AV: E=Sophos;i="5.51,398,1526335200"; d="py'?scan'208";a="7936583" Received: from waha.eurecom.fr (HELO smtps.eurecom.fr) ([10.3.2.236]) by drago1i.eurecom.fr with ESMTP; 24 Jul 2018 15:55:54 +0200 Received: from archibald (unknown [193.55.114.4]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtps.eurecom.fr (Postfix) with ESMTPSA id BB06BFE6; Tue, 24 Jul 2018 15:55:53 +0200 (CEST) From: Sebastian Poeplau To: David Bremner , notmuch@notmuchmail.org Subject: Re: Handling mislabeled emails encoded with Windows-1252 In-Reply-To: <87o9exyseg.fsf@eurecom.fr> References: <87lgaeat37.fsf@eurecom.fr> <8736w91jz0.fsf@tethera.net> <87o9exyseg.fsf@eurecom.fr> Date: Tue, 24 Jul 2018 15:55:54 +0200 Message-ID: <87k1pkzqj9.fsf@eurecom.fr> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" X-BeenThere: notmuch@notmuchmail.org X-Mailman-Version: 2.1.26 Precedence: list List-Id: "Use and development of the notmuch mail system." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 24 Jul 2018 13:56:08 -0000 --=-=-= Content-Type: text/plain Hi again, >> Everyone's mail situation is unique, but I haven't noticed this >> problem. Do you have a mechanical (e.g. scripted) way of detecting such >> mails? I suppose it could just look for characters in the range 0x80 to >> 0x95 in allegedly ISO_8859-1 messages. A census of the situation in my >> own mail would help me think about this problem, I think. > > Yes, I guess that should be a good enough heuristic for detecting > affected mail. I'll try to come up with a simple script and post it > here. Attached is a Python script that checks individual message files and prints their name if it finds them to contain mislabeled Windows-1252 text. The heuristic seems to work well on my mail - let me know if you encounter any issues! Cheers, Sebastian --=-=-= Content-Type: application/octet-stream Content-Disposition: attachment; filename=find_mislabeled_cp1252.py Content-Transfer-Encoding: base64 IyEvdXNyL2Jpbi9lbnYgcHl0aG9uMwpmcm9tIGVtYWlsLnBhcnNlciBpbXBvcnQgQnl0ZXNQYXJz ZXIKaW1wb3J0IGVtYWlsLnBvbGljeQoKCm1haWxfcGFyc2VyID0gQnl0ZXNQYXJzZXIocG9saWN5 PWVtYWlsLnBvbGljeS5kZWZhdWx0KQoKCmRlZiBjaGVja19tZXNzYWdlKGZpbGVuYW1lKToKICAg ICIiIgogICAgUmV0dXJuIFRydWUgaWYgdGhlIHNwZWNpZmllZCBtZXNzYWdlIGNvbnRhaW5zIG1p c2xhYmVsZWQgV2luZG93cy0xMjUyIHRleHQuCiAgICAiIiIKICAgIHdpdGggb3BlbihmaWxlbmFt ZSwgJ3JiJykgYXMgZjoKICAgICAgICBtZXNzYWdlID0gbWFpbF9wYXJzZXIucGFyc2UoZikKCiAg ICBmb3IgcGFydCBpbiBtZXNzYWdlLndhbGsoKToKICAgICAgICBpZiBwYXJ0LmdldF9jb250ZW50 X3R5cGUoKSAhPSAndGV4dC9wbGFpbicgb3IgXAogICAgICAgICAgIHBhcnQuZ2V0X2NvbnRlbnRf Y2hhcnNldCgpICE9ICdpc28tODg1OS0xJzoKICAgICAgICAgICAgY29udGludWUKCiAgICAgICAg Ym9keSA9IHBhcnQuZ2V0X2NvbnRlbnQoKQogICAgICAgIGZvciBjaGFyIGluIGJvZHk6CiAgICAg ICAgICAgIGNvZGUgPSBvcmQoY2hhcikKICAgICAgICAgICAgaWYgY29kZSA+IDB4ODAgYW5kIGNv ZGUgPCAweGEwOgogICAgICAgICAgICAgICAgcmV0dXJuIFRydWUKCiAgICByZXR1cm4gRmFsc2UK CgpkZWYgbWFpbihhcmdzKToKICAgIGZvciBhcmcgaW4gYXJnczoKICAgICAgICBpZiBjaGVja19t ZXNzYWdlKGFyZyk6CiAgICAgICAgICAgIHByaW50KGFyZykKCgppZiBfX25hbWVfXyA9PSAnX19t YWluX18nOgogICAgaW1wb3J0IHN5cwogICAgbWFpbihzeXMuYXJndikK --=-=-=--