From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by arlo.cworth.org (Postfix) with ESMTP id 619BE6DE1868 for ; Mon, 14 Mar 2016 04:49:44 -0700 (PDT) X-Virus-Scanned: Debian amavisd-new at cworth.org X-Spam-Flag: NO X-Spam-Score: -0.031 X-Spam-Level: X-Spam-Status: No, score=-0.031 tagged_above=-999 required=5 tests=[AWL=-0.020, SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01] autolearn=disabled Received: from arlo.cworth.org ([127.0.0.1]) by localhost (arlo.cworth.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id x571wJW0eGWI for ; Mon, 14 Mar 2016 04:49:42 -0700 (PDT) Received: from fethera.tethera.net (fethera.tethera.net [198.245.60.197]) by arlo.cworth.org (Postfix) with ESMTPS id 3D1C46DE1862 for ; Mon, 14 Mar 2016 04:49:42 -0700 (PDT) Received: from remotemail by fethera.tethera.net with local (Exim 4.84) (envelope-from ) id 1afR0p-0005LI-JP; Mon, 14 Mar 2016 07:50:15 -0400 Received: (nullmailer pid 12910 invoked by uid 1000); Mon, 14 Mar 2016 11:49:36 -0000 From: David Bremner To: David Edmondson , Mark Walters , notmuch@notmuchmail.org Subject: Re: [PATCH v1 0/3] Improve the acquisition of text parts. In-Reply-To: References: <1457457179-4707-1-git-send-email-dme@dme.org> <87ziu2s8rb.fsf@qmul.ac.uk> User-Agent: Notmuch/0.21+74~g6c60fb1 (http://notmuchmail.org) Emacs/24.5.1 (x86_64-pc-linux-gnu) Date: Mon, 14 Mar 2016 08:49:36 -0300 Message-ID: <87bn6h5lf3.fsf@zancas.localnet> MIME-Version: 1.0 Content-Type: text/plain X-BeenThere: notmuch@notmuchmail.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: "Use and development of the notmuch mail system." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 14 Mar 2016 11:49:44 -0000 David Edmondson writes: > On Sun, Mar 13 2016, Mark Walters wrote: >> However, it would be sensible to get testing in a greater variety of >> charsets/encodings > > Agreed. Does anyone have suggestions on how we might achieve this? A > corpus of mail that we could use? Maybe the notmuch performance corpus, particularly the lkml sample. grep -R charset= performance-test/corpus/mail/lkml | sed -e 's/^.*charset=//' -e 's/;.*//' -e 's/"//g' | tr '[A-Z]' '[a-z]' | sort -u gives euc-kr gb2312 iso-2022-jp iso-2022-jp-2 iso-8859-1 iso-8859-14 iso 8859-15 iso-8859-15 iso-8859-1 iso-8859-2 iso-8859-6 iso-8859-7 iso-8859-9 koi8-r koi8-u ks_c_5601-1987 shift_jis unknown unknown-8bit us-ascii utf8 utf-8 windows-1250 windows-1251 windows-1252 windows-1255 to unpack the corpus cd performance-test make download-corpus ./T00-new.sh --large probably interrupt the test once notmuch-new starts running.