From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Reiner Steib Newsgroups: gmane.emacs.devel Subject: Auto-detection of windows-1252 fails Date: Sat, 05 Jan 2008 14:22:37 +0100 Message-ID: Reply-To: Reiner Steib NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Trace: ger.gmane.org 1199539429 17072 80.91.229.12 (5 Jan 2008 13:23:49 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Sat, 5 Jan 2008 13:23:49 +0000 (UTC) To: emacs-devel@gnu.org Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sat Jan 05 14:24:09 2008 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.50) id 1JB90R-0001cW-9X for ged-emacs-devel@m.gmane.org; Sat, 05 Jan 2008 14:24:07 +0100 Original-Received: from localhost ([127.0.0.1] helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1JB904-0005wp-Gs for ged-emacs-devel@m.gmane.org; Sat, 05 Jan 2008 08:23:44 -0500 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1JB900-0005va-Ew for emacs-devel@gnu.org; Sat, 05 Jan 2008 08:23:40 -0500 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1JB8zz-0005up-Nf for emacs-devel@gnu.org; Sat, 05 Jan 2008 08:23:39 -0500 Original-Received: from [199.232.76.173] (helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1JB8zz-0005um-Hm for emacs-devel@gnu.org; Sat, 05 Jan 2008 08:23:39 -0500 Original-Received: from main.gmane.org ([80.91.229.2] helo=ciao.gmane.org) by monty-python.gnu.org with esmtps (TLS-1.0:RSA_AES_256_CBC_SHA1:32) (Exim 4.60) (envelope-from ) id 1JB8zz-0002zm-27 for emacs-devel@gnu.org; Sat, 05 Jan 2008 08:23:39 -0500 Original-Received: from list by ciao.gmane.org with local (Exim 4.43) id 1JB8zs-0004m1-Ff for emacs-devel@gnu.org; Sat, 05 Jan 2008 13:23:32 +0000 Original-Received: from hsi-kbw-085-216-076-228.hsi.kabelbw.de ([85.216.76.228]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Sat, 05 Jan 2008 13:23:32 +0000 Original-Received: from Reiner.Steib by hsi-kbw-085-216-076-228.hsi.kabelbw.de with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Sat, 05 Jan 2008 13:23:32 +0000 X-Injected-Via-Gmane: http://gmane.org/ Mail-Followup-To: emacs-devel@gnu.org Original-Lines: 145 Original-X-Complaints-To: usenet@ger.gmane.org X-Gmane-NNTP-Posting-Host: hsi-kbw-085-216-076-228.hsi.kabelbw.de X-Face: P05mdcZT&lL[-s2=mw~RsllZ0zZAb?vdE}.s List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:86151 Archived-At: Hi, in September/October 2006 we had a long thread on emacs-pretest-bugs about auto-detection of windows-1252 text files: Subject: local chars displayed as numbers [ I include a summary of this thread below. ] windows-1252 files were supposed to be detected automatically in the "Latin-1" and "German" language environments. This doesn't work (anymore?) in Emacs 22.1, the Emacs_22 branch and in the trunk. * Recipe to reproduce the problem: $ echo -e '\x91 O:\xD6 o:\xF6 \x92' > w1252-O-o.txt I.e. The file contains the following non-ascii characters: - LEFT SINGLE QUOTATION MARK (U+2018) - LATIN CAPITAL LETTER O WITH DIAERESIS (U+00D6) - LATIN SMALL LETTER O WITH DIAERESIS (U+00F6) - RIGHT SINGLE QUOTATION MARK (U+2019) $ file w1252-O-o.txt w1252-O-o.txt: Non-ISO extended-ASCII text When decoded correctly, it looks like this: ,----[ w1252-O-o.txt ] | ‘ O:Ö o:ö ’ `---- * Expected result: According to the discussion in 2006, this file should be recognized as windows-1252 with the following command lines: $ LC_ALL=de_DE emacs -Q w1252-O-o.txt $ emacs -Q --eval '(set-language-environment "German")' w1252-O-o.txt * Current result: The file is opened in iso-8859-1, i.e. the left quotation mark is displayed as \221 and the right quotation mark is detected as eight-bit-control: ,----[ M-x describe-char RET ] | character: \222 (146, #o222, #x92, U+0092) | charset: eight-bit-control (8-bit control code (0x80..0x9F)) | code point: #x92 | syntax: which means: whitespace | buffer code: #x92 | file code: not encodable by coding system iso-latin-1-unix | display: by display table entry [?'] (see below) | | The display table entry is displayed by these fonts (glyph codes): | ': -Misc-Fixed-Medium-R-SemiCondensed--13-120-75-75-C-60-ISO8859-1 (#x27) `---- * Summary of the September/October 2006 discussion: The following change was installed... ,----[ ChangeLog.12 ] | 2006-09-21 Kenichi Handa | | * language/european.el ("Latin-1"): Add windows-1252 to | coding-priority. | ("German"): Likewise. `---- ... and was supposed to result in the following behavior: Kenichi Handa wrote in : | A file containing a windows-1252 char that doesn't appear in | iso-8859-1 is detected as windows-1252. Bad effect is that some (or | many) binary files are also detected as windows-1252. Some people pointed out that this may lead to the bad effect that some (or many) binary files are also detected as windows-1252. Eli suggested to implement null-byte detection which should solve this problem. In Kenichi Handa wrote: | Reiner Steib imap.cc> writes: | | > (6) Implement null-byte detection (to prevent binary files | > mis-detected as windows-12xx), keep the current code (windows-1252) | > and add windows-1254/1255 accordingly. | | I think that change results in the best behavior. ... and Richard agreed on that. But I don't think this has been done. ("the current code" refers to the 2006-09-21 change, see above.) In I attached 3 simple test files an described the result: ,---- | I did some tests with (see attached auto-coding.tar.gz)... | | (a) a file containing only windows-1252 characters, | | (b) a file with some Latin-1 text plus "reserved characters" | (i.e. chars not defined in windows-1252), | | (c) a file with some Latin-1 and windows-1252 text plus a null-byte. | | Emacs detected the files as: | | (a) windows-1252 (-> correct) | | (b) raw-text-unix (-> correct) | | (c) windows-1252 (-> slightly incorrect, at least for people who argue | that binary is better here) `---- * Additionally, the addition of windows-1252 to "German" has been lost in the emacs-unicode-2 branch: --- european.el 26 Jul 2007 05:27:10 -0000 1.100 +++ european.el 25 Dec 2007 10:57:51 -0000 1.86.4.13 @@ -277,16 +414,15 @@ (set-language-info-alist "German" '((tutorial . "TUTORIAL.de") - (charset ascii latin-iso8859-1) + (charset iso-8859-1) (coding-system iso-latin-1 iso-latin-9) - (coding-priority iso-latin-1 windows-1252) + (coding-priority iso-latin-1) + (nonascii-translation . iso-8859-1) (input-method . "german-postfix") Bye, Reiner. -- ,,, (o o) ---ooO-(_)-Ooo--- | PGP key available | http://rsteib.home.pages.de/