From: Reiner Steib <reinersteib+gmane@imap.cc>
To: emacs-devel@gnu.org
Subject: Auto-detection of windows-1252 fails
Date: Sat, 05 Jan 2008 14:22:37 +0100 [thread overview]
Message-ID: <v9fxxctnj6.fsf@marauder.physik.uni-ulm.de> (raw)
Hi,
in September/October 2006 we had a long thread on emacs-pretest-bugs
about auto-detection of windows-1252 text files:
Subject: local chars displayed as numbers
<http://thread.gmane.org/gmane.emacs.pretest.bugs/14020/>
[ I include a summary of this thread below. ]
windows-1252 files were supposed to be detected automatically in the
"Latin-1" and "German" language environments. This doesn't work
(anymore?) in Emacs 22.1, the Emacs_22 branch and in the trunk.
* Recipe to reproduce the problem:
$ echo -e '\x91 O:\xD6 o:\xF6 \x92' > w1252-O-o.txt
I.e. The file contains the following non-ascii characters:
- LEFT SINGLE QUOTATION MARK (U+2018)
- LATIN CAPITAL LETTER O WITH DIAERESIS (U+00D6)
- LATIN SMALL LETTER O WITH DIAERESIS (U+00F6)
- RIGHT SINGLE QUOTATION MARK (U+2019)
$ file w1252-O-o.txt
w1252-O-o.txt: Non-ISO extended-ASCII text
When decoded correctly, it looks like this:
,----[ w1252-O-o.txt ]
| ‘ O:Ö o:ö ’
`----
* Expected result:
According to the discussion in 2006, this file should be recognized
as windows-1252 with the following command lines:
$ LC_ALL=de_DE emacs -Q w1252-O-o.txt
$ emacs -Q --eval '(set-language-environment "German")' w1252-O-o.txt
* Current result:
The file is opened in iso-8859-1, i.e. the left quotation mark is
displayed as \221 and the right quotation mark is detected as
eight-bit-control:
,----[ M-x describe-char RET ]
| character: \222 (146, #o222, #x92, U+0092)
| charset: eight-bit-control (8-bit control code (0x80..0x9F))
| code point: #x92
| syntax: which means: whitespace
| buffer code: #x92
| file code: not encodable by coding system iso-latin-1-unix
| display: by display table entry [?'] (see below)
|
| The display table entry is displayed by these fonts (glyph codes):
| ': -Misc-Fixed-Medium-R-SemiCondensed--13-120-75-75-C-60-ISO8859-1 (#x27)
`----
* Summary of the September/October 2006 discussion:
The following change was installed...
,----[ ChangeLog.12 ]
| 2006-09-21 Kenichi Handa <handa@m17n.org>
|
| * language/european.el ("Latin-1"): Add windows-1252 to
| coding-priority.
| ("German"): Likewise.
`----
... and was supposed to result in the following behavior:
Kenichi Handa wrote in
<http://article.gmane.org/gmane.emacs.pretest.bugs/14384>:
| A file containing a windows-1252 char that doesn't appear in
| iso-8859-1 is detected as windows-1252. Bad effect is that some (or
| many) binary files are also detected as windows-1252.
Some people pointed out that this may lead to the bad effect that some
(or many) binary files are also detected as windows-1252. Eli
suggested to implement null-byte detection which should solve this
problem.
In <http://thread.gmane.org/gmane.emacs.pretest.bugs/14020/focus=14384>
Kenichi Handa wrote:
| Reiner Steib <reinersteib+gmane <at> imap.cc> writes:
|
| > (6) Implement null-byte detection (to prevent binary files
| > mis-detected as windows-12xx), keep the current code (windows-1252)
| > and add windows-1254/1255 accordingly.
|
| I think that change results in the best behavior.
... and Richard agreed on that. But I don't think this has been done.
("the current code" refers to the 2006-09-21 change, see above.)
In
<http://thread.gmane.org/gmane.emacs.pretest.bugs/14020/focus=14367> I
attached 3 simple test files an described the result:
,----
| I did some tests with (see attached auto-coding.tar.gz)...
|
| (a) a file containing only windows-1252 characters,
|
| (b) a file with some Latin-1 text plus "reserved characters"
| (i.e. chars not defined in windows-1252),
|
| (c) a file with some Latin-1 and windows-1252 text plus a null-byte.
|
| Emacs detected the files as:
|
| (a) windows-1252 (-> correct)
|
| (b) raw-text-unix (-> correct)
|
| (c) windows-1252 (-> slightly incorrect, at least for people who argue
| that binary is better here)
`----
* Additionally, the addition of windows-1252 to "German" has been lost
in the emacs-unicode-2 branch:
--- european.el 26 Jul 2007 05:27:10 -0000 1.100
+++ european.el 25 Dec 2007 10:57:51 -0000 1.86.4.13
@@ -277,16 +414,15 @@
(set-language-info-alist
"German" '((tutorial . "TUTORIAL.de")
- (charset ascii latin-iso8859-1)
+ (charset iso-8859-1)
(coding-system iso-latin-1 iso-latin-9)
- (coding-priority iso-latin-1 windows-1252)
+ (coding-priority iso-latin-1)
+ (nonascii-translation . iso-8859-1)
(input-method . "german-postfix")
Bye, Reiner.
--
,,,
(o o)
---ooO-(_)-Ooo--- | PGP key available | http://rsteib.home.pages.de/
next reply other threads:[~2008-01-05 13:22 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-01-05 13:22 Reiner Steib [this message]
2008-01-05 16:44 ` Auto-detection of windows-1252 fails David De La Harpe Golden
[not found] ` <E1JBQZ5-0001Uy-TV@fencepost.gnu.org>
2008-01-09 6:33 ` Kenichi Handa
2008-01-14 20:58 ` Reiner Steib
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=v9fxxctnj6.fsf@marauder.physik.uni-ulm.de \
--to=reinersteib+gmane@imap.cc \
--cc=Reiner.Steib@gmx.de \
--cc=emacs-devel@gnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this external index
https://git.savannah.gnu.org/cgit/emacs.git
https://git.savannah.gnu.org/cgit/emacs/org-mode.git
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.