all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
From: Reiner Steib <reinersteib+gmane@imap.cc>
To: emacs-devel@gnu.org
Subject: Auto-detection of windows-1252 fails
Date: Sat, 05 Jan 2008 14:22:37 +0100	[thread overview]
Message-ID: <v9fxxctnj6.fsf@marauder.physik.uni-ulm.de> (raw)

Hi,

in September/October 2006 we had a long thread on emacs-pretest-bugs
about auto-detection of windows-1252 text files:

  Subject: local chars displayed as numbers
  <http://thread.gmane.org/gmane.emacs.pretest.bugs/14020/>
  [ I include a summary of this thread below. ]

windows-1252 files were supposed to be detected automatically in the
"Latin-1" and "German" language environments.  This doesn't work
(anymore?) in Emacs 22.1, the Emacs_22 branch and in the trunk.

* Recipe to reproduce the problem:

  $ echo -e '\x91 O:\xD6 o:\xF6 \x92' > w1252-O-o.txt

  I.e. The file contains the following non-ascii characters:

  - LEFT SINGLE QUOTATION MARK (U+2018)
  - LATIN CAPITAL LETTER O WITH DIAERESIS (U+00D6)
  - LATIN SMALL LETTER O WITH DIAERESIS (U+00F6)
  - RIGHT SINGLE QUOTATION MARK (U+2019)
  
  $ file w1252-O-o.txt
  w1252-O-o.txt: Non-ISO extended-ASCII text
  
  When decoded correctly, it looks like this:
  ,----[ w1252-O-o.txt ]
  | ‘ O:Ö o:ö ’
  `----
  
* Expected result:

  According to the discussion in 2006, this file should be recognized
  as windows-1252 with the following command lines:

  $ LC_ALL=de_DE emacs -Q w1252-O-o.txt

  $ emacs -Q --eval '(set-language-environment "German")' w1252-O-o.txt

* Current result:

  The file is opened in iso-8859-1, i.e. the left quotation mark is
  displayed as \221 and the right quotation mark is detected as
  eight-bit-control:

  ,----[ M-x describe-char RET ]
  |   character: \222 (146, #o222, #x92, U+0092)
  |     charset: eight-bit-control (8-bit control code (0x80..0x9F))
  |  code point: #x92
  |      syntax:   	which means: whitespace
  | buffer code: #x92
  |   file code: not encodable by coding system iso-latin-1-unix
  |     display: by display table entry [?'] (see below)
  | 
  | The display table entry is displayed by these fonts (glyph codes):
  | ': -Misc-Fixed-Medium-R-SemiCondensed--13-120-75-75-C-60-ISO8859-1 (#x27)
  `----
  
* Summary of the September/October 2006 discussion:

  The following change was installed...
  
  ,----[ ChangeLog.12 ]
  | 2006-09-21  Kenichi Handa  <handa@m17n.org>
  | 
  | 	* language/european.el ("Latin-1"): Add windows-1252 to
  | 	coding-priority.
  | 	("German"): Likewise.
  `----
  
  ... and was supposed to result in the following behavior:
  
  Kenichi Handa wrote in
  <http://article.gmane.org/gmane.emacs.pretest.bugs/14384>:
  
  | A file containing a windows-1252 char that doesn't appear in
  | iso-8859-1 is detected as windows-1252.  Bad effect is that some (or
  | many) binary files are also detected as windows-1252.
  
  Some people pointed out that this may lead to the bad effect that some
  (or many) binary files are also detected as windows-1252.  Eli
  suggested to implement null-byte detection which should solve this
  problem.
  
  In <http://thread.gmane.org/gmane.emacs.pretest.bugs/14020/focus=14384>
  Kenichi Handa wrote:
  | Reiner Steib <reinersteib+gmane <at> imap.cc> writes:
  | 
  | > (6) Implement null-byte detection (to prevent binary files
  | >    mis-detected as windows-12xx), keep the current code (windows-1252)
  | >    and add windows-1254/1255 accordingly.
  | 
  | I think that change results in the best behavior.
  
  ... and Richard agreed on that.  But I don't think this has been done.
  ("the current code" refers to the 2006-09-21 change, see above.)
  
  In 
  <http://thread.gmane.org/gmane.emacs.pretest.bugs/14020/focus=14367> I
  attached 3 simple test files an described the result:
  
  ,----
  | I did some tests with (see attached auto-coding.tar.gz)...
  | 
  | (a) a file containing only windows-1252 characters,
  | 
  | (b) a file with some Latin-1 text plus "reserved characters"
  |     (i.e. chars not defined in windows-1252),
  | 
  | (c) a file with some Latin-1 and windows-1252 text plus a null-byte.
  | 
  | Emacs detected the files as:
  | 
  | (a) windows-1252 (-> correct)
  | 
  | (b) raw-text-unix (-> correct)
  | 
  | (c) windows-1252 (-> slightly incorrect, at least for people who argue
  |     that binary is better here)
  `----
  
* Additionally, the addition of windows-1252 to "German" has been lost
  in the emacs-unicode-2 branch:

  --- european.el	26 Jul 2007 05:27:10 -0000	1.100
  +++ european.el	25 Dec 2007 10:57:51 -0000	1.86.4.13
  @@ -277,16 +414,15 @@
   
   (set-language-info-alist
    "German" '((tutorial . "TUTORIAL.de")
  -	    (charset ascii latin-iso8859-1)
  +	    (charset iso-8859-1)
   	    (coding-system iso-latin-1 iso-latin-9)
  -	    (coding-priority iso-latin-1 windows-1252)
  +	    (coding-priority iso-latin-1)
  +	    (nonascii-translation . iso-8859-1)
   	    (input-method . "german-postfix")
  
Bye, Reiner.
-- 
       ,,,
      (o o)
---ooO-(_)-Ooo---  |  PGP key available  |  http://rsteib.home.pages.de/

             reply	other threads:[~2008-01-05 13:22 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-01-05 13:22 Reiner Steib [this message]
2008-01-05 16:44 ` Auto-detection of windows-1252 fails David De La Harpe Golden
     [not found] ` <E1JBQZ5-0001Uy-TV@fencepost.gnu.org>
2008-01-09  6:33   ` Kenichi Handa
2008-01-14 20:58     ` Reiner Steib

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=v9fxxctnj6.fsf@marauder.physik.uni-ulm.de \
    --to=reinersteib+gmane@imap.cc \
    --cc=Reiner.Steib@gmx.de \
    --cc=emacs-devel@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.