From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Giorgos Keramidas Newsgroups: gmane.emacs.devel Subject: Email text that confuses charset recognition in emacs Date: Tue, 16 Apr 2013 18:27:57 +0200 Message-ID: <20130416162747.GA11871@saturn> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Trace: ger.gmane.org 1366129710 13492 80.91.229.3 (16 Apr 2013 16:28:30 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Tue, 16 Apr 2013 16:28:30 +0000 (UTC) To: emacs-devel@gnu.org Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Tue Apr 16 18:28:35 2013 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1US8kD-0002AQ-4M for ged-emacs-devel@m.gmane.org; Tue, 16 Apr 2013 18:28:33 +0200 Original-Received: from localhost ([::1]:53416 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1US8kC-00024O-O5 for ged-emacs-devel@m.gmane.org; Tue, 16 Apr 2013 12:28:32 -0400 Original-Received: from eggs.gnu.org ([208.118.235.92]:45942) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1US8k7-0001yh-B6 for emacs-devel@gnu.org; Tue, 16 Apr 2013 12:28:30 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1US8k4-0006jq-Ao for emacs-devel@gnu.org; Tue, 16 Apr 2013 12:28:27 -0400 Original-Received: from tux-cave.hellug.gr ([195.134.99.74]:58327) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1US8k3-0006gl-LV for emacs-devel@gnu.org; Tue, 16 Apr 2013 12:28:24 -0400 X-Hellug-MailScanner-From: keramida@ceid.upatras.gr X-Hellug-MailScanner-SpamCheck: not spam, SpamAssassin (not cached, score=-0.19, required 5, autolearn=not spam, ALL_TRUSTED -1.00, BAYES_50 0.80, T_LOTS_OF_MONEY 0.01) X-Hellug-MailScanner: Found to be clean X-Hellug-MailScanner-ID: r3GGS3NC009269 Original-Received: from saturn.laptop (217-162-217-29.dynamic.hispeed.ch [217.162.217.29]) (authenticated bits=0) by tux-cave.hellug.gr (8.14.3/8.14.3/Debian-9.4) with ESMTP id r3GGS3NC009269 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NOT) for ; Tue, 16 Apr 2013 19:28:11 +0300 Original-Received: from saturn.laptop (localhost [127.0.0.1]) by saturn.laptop (8.14.4/8.14.4/Debian-2.1ubuntu1) with ESMTP id r3GGRvJM015234 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NOT) for ; Tue, 16 Apr 2013 18:27:57 +0200 Original-Received: (from keramida@localhost) by saturn.laptop (8.14.4/8.14.4/Submit) id r3GGRvws015233 for emacs-devel@gnu.org; Tue, 16 Apr 2013 18:27:57 +0200 X-Authentication-Warning: saturn.laptop: keramida set sender to keramida@ceid.upatras.gr using -f Content-Disposition: inline X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-Received-From: 195.134.99.74 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:158951 Archived-At: Hi everyone, I just noticed that the attached email message confuses the charset detection machinery of Emacs, and it starts interpreting all text as Japanese text -- even though most of the contents of the file are plain us-ascii text. I first noticed the problem when I received the attached message from the `freebsd-current' mailing list, and the text looked Japanese in a Gnus article buffer. But saving the email text with `C-u M-g' to look at the raw article text and re-opening the raw article text in Emacs always shows Japanese text. If I open the article text with `M-x find-file-literally', I can read all of the English text, minus a few parts of the attribution line of the email text. The problem seems to start near this text in the email body: | >>> In message <18DF99B0-6E66-4906-A233-7778451B8A92@felyko.com>, Rui Pau= lo | >>> writes: | >>>> 2013/04/15 9:55^[$B!"^[(BCy Schubert ^[$B= $N%a%C%;!<%8^[ | >> (B: I think it's because one of the intermediate mail relays (or even Gmail, which got the final delivery for me) wrapper a line of text in the middle of the byte sequence: "%8^[(B" one line before the last one. In fact, joining the two last lines, and removing the bogusly inserted text of "\n>> " restores the sequence to something that decodes properly in Emacs: >>>> 2013/04/15 9:55=1B$A!"=1B(BCy Schubert = =1B$A$N%a%C%;=1B$B!<=1B$A%8=1B(B: Since this is not a problem in what Emacs does with charset decoding of the entire buffer, is it something we should try to fix in Gnus? Is there any way we can detect this sort of decoding problem at all? I am running an Emacs snapshot built earlier this morning from this version, by the way: | 0416 18:10 saturn:~/git/emacs$ git log -1 | commit 3e440d19d12bc740010d9e98958d529260eea321 | Author: Michael Albinus | Date: Tue Apr 16 10:11:56 2013 +0200 | | * tramp.texi (Frequently Asked Questions): Precise, how to define | an own ControlPath. Here's also the uuencoded "broken" text of the email/article buffer: begin 644 gnus-article-confusing-charset.txt M1&5L:79E2`Q,"XQ-"XQ.#(N-S(@=3DVET:"!33510(&ED(&XT.&UR M.#$Y.38W-V5E;2XS+C$S-C8P-3@R-C8U,C@["B`@("`@("`@36]N+"`Q-2!! M<'(@,C`Q,R`Q,SHS-SHT-B`M,#2!P;W-E:61O;BYC96ED+G5P871R87,N9W(@ M*%!O2!M86EL+F-E:60N=3D7!A=3D')A"D*"6ED(#DT.30T.3%#049#-SL@36]N+"`Q-2!!<'(@,C`Q M,R`R,SHS-SHT-2`K,#,P,"`H14535"D*1&5L:79E2!M86EL+F-E:60N=3D7!A=3D')A"D@=3DVET:"!%4TU44"!I9"`Y,#)&-3DQ0T%&0S8*"69O2!M86EL+F-E:60N=3D7!A=3D')A"D@=3DVET:"!%4TU44"!I9"`W,#1",CDQ0T%&0S4*"69O"P@9G)O;2!U#(N9G)E M96)S9"YO#$N1G)E94)31"YO"D@=3DVET:"!%4TU44"!I M9"!&,D4P1D5#1@H@9F]R(#QC=3D7)R96YT0&9R965B6%H;V\N8V]M(%LY."XQ,S"D@=3DVET:"!33510(&ED($,S,#DX,38U M,@H@9F]R(#QC=3D7)R96YT0&9R965B2!N;3$U+F)U;&QE=3D"YM86EL+F=3DQ,2YY86AO;RYC;VT@=3DVET M:"!.3D9-4#L*(#$U($%P2!T;30N8G5L;&5T+FUA:6PN9W$Q M+GEA:&]O+F-O;2!W:71H($Y.1DU0.PH@,34@07!R(#(P,3,@,3DZ,S(Z-#D@ M+3`P,#`*4F5C96EV960Z(&9R;VT@6S$R-RXP+C`N,5T@8GD@&5D+W)E;&%X960[(&0]>6%H;V\N8V]M.R!S/7,Q,#(T.PH@ M=3D#TQ,S8V,#4T,S8Y.R!B:#U+:S=3DM,VY,5D1-,$UK3FE5:&DO9&98351:=3DC)- M:3!%04E"-$UZD1W/3L*(&@]6"U986AO;RU.97=3DM86XM260Z6"U986AO M;RU.97=3DM86XM4')O<&5R=3D'DZ6"U936%I;"U/4T3H@>6UA:6PM M,PI8+5E-86EL+4]31SH@34UG+CE.,%9-,6YT2$-2>&QY:S=3D&-W9D0C1E5'5. M96-356Y03'II:EI#2&U4:V$*(#!J67=3DO,DYD3&E.1'I926-%=3D7HN,69D<$=3D3 M0T-">'AS:%%X<&AM0V93:6Y8;'=3DN66A72#!0,'%&6FE8<`H@-W)"G1V,55784=3DO7UEV1U):;G5F;E8N;FTS M;5HR8DE("B!O-65(:T%V:TY&6&AR6EA#1$E#24(R3E"83A/<%=3DI0W$N04UL6#-&<%\*(&E#4G-83V1,27DP55I$46=3DJ94\P M8W4Y4&=3DY>4-8=3D5!15%0Q54QV2F\R>7DU9T5B9%]Y3V,V>#=3DT5VTQ5`H@=3DG,P M=3D%EL>&IT4#!&44E?,V-036%T131(3WHQ:W%O>GIX:%]Y4'-!0VE-95%->C1D M=3DTU#86IO1'8W=3DSE&"B!*-$]&841S2U!!>E]C65!Q,'%N44U#1E%":&,X2!S;71P,C`S+FUA:6PN9W$Q+GEA:&]O+F-O;2!W:71H(%--5%`[ M(#$U($%P6%H;V\N8V]M/@I);BU297!L>2U4;SH@/#(P,3,P M-#$U,3DR-RYR,T9*4G1Q.3`P,C2Y38VAU M8F5R=3D$!K;VUQ=3D6%T2YC;VT^+`H@(F-U6MO+F-O;3XL(")N971`9G)E96)S9"YO7!E.B!T97AT+W!L86EN.R!C:&%R2Y38VAU8F5R=3D$!K;VUQ=3D6%T2!P M;W)T6]U('=3DA M;G0@=3D&\@=3DV]R:R!O;B!S;VUE=3D&AI;F<@=3D&AA=3D"!P96]P;&4@:&%V92!B965N M('1R>6EN9R!T;R!R96UO=3D@H^/B!E(',*/CX^/B!I;F-E(#(P,#4_"CX^/B`* M/CX^($D@86YD(&]T:&5R6YC+"!R9&ES=3D"P@86YD(&$@;&]C86P@0U93(')E M<&\I+B`*/CX^($EN=3D&5R;W!E7-T96US M('=3DH:6-H('5S92!)4"!&:6QT97(@:7,@82!P;'5S+B!)9B`*/CX^('1H97)E M)W,@82!M86EN=3D&%I;F5R+"!I=3D"!O;FQY(&UA:V5S($9R965"4T0@6]U)W)E(&-O;6UI=3D'1E9"!T;R!M86EN=3D&%I;FEN M9R!)4$9I;'1E2!R:6=3DH=3D"!N;W2!A;F0@=3DW)O=3D&4@4$8@+2T@22=3DM(&YO=3D"!S=3D7)E('=3DH870@3F5T M0E-$(&1I9"DN"CX@"@H*22!W;W5L9"!A