* Auto-detection of windows-1252 fails
@ 2008-01-05 13:22 Reiner Steib
2008-01-05 16:44 ` David De La Harpe Golden
[not found] ` <E1JBQZ5-0001Uy-TV@fencepost.gnu.org>
0 siblings, 2 replies; 4+ messages in thread
From: Reiner Steib @ 2008-01-05 13:22 UTC (permalink / raw)
To: emacs-devel
Hi,
in September/October 2006 we had a long thread on emacs-pretest-bugs
about auto-detection of windows-1252 text files:
Subject: local chars displayed as numbers
<http://thread.gmane.org/gmane.emacs.pretest.bugs/14020/>
[ I include a summary of this thread below. ]
windows-1252 files were supposed to be detected automatically in the
"Latin-1" and "German" language environments. This doesn't work
(anymore?) in Emacs 22.1, the Emacs_22 branch and in the trunk.
* Recipe to reproduce the problem:
$ echo -e '\x91 O:\xD6 o:\xF6 \x92' > w1252-O-o.txt
I.e. The file contains the following non-ascii characters:
- LEFT SINGLE QUOTATION MARK (U+2018)
- LATIN CAPITAL LETTER O WITH DIAERESIS (U+00D6)
- LATIN SMALL LETTER O WITH DIAERESIS (U+00F6)
- RIGHT SINGLE QUOTATION MARK (U+2019)
$ file w1252-O-o.txt
w1252-O-o.txt: Non-ISO extended-ASCII text
When decoded correctly, it looks like this:
,----[ w1252-O-o.txt ]
| ‘ O:Ö o:ö ’
`----
* Expected result:
According to the discussion in 2006, this file should be recognized
as windows-1252 with the following command lines:
$ LC_ALL=de_DE emacs -Q w1252-O-o.txt
$ emacs -Q --eval '(set-language-environment "German")' w1252-O-o.txt
* Current result:
The file is opened in iso-8859-1, i.e. the left quotation mark is
displayed as \221 and the right quotation mark is detected as
eight-bit-control:
,----[ M-x describe-char RET ]
| character: \222 (146, #o222, #x92, U+0092)
| charset: eight-bit-control (8-bit control code (0x80..0x9F))
| code point: #x92
| syntax: which means: whitespace
| buffer code: #x92
| file code: not encodable by coding system iso-latin-1-unix
| display: by display table entry [?'] (see below)
|
| The display table entry is displayed by these fonts (glyph codes):
| ': -Misc-Fixed-Medium-R-SemiCondensed--13-120-75-75-C-60-ISO8859-1 (#x27)
`----
* Summary of the September/October 2006 discussion:
The following change was installed...
,----[ ChangeLog.12 ]
| 2006-09-21 Kenichi Handa <handa@m17n.org>
|
| * language/european.el ("Latin-1"): Add windows-1252 to
| coding-priority.
| ("German"): Likewise.
`----
... and was supposed to result in the following behavior:
Kenichi Handa wrote in
<http://article.gmane.org/gmane.emacs.pretest.bugs/14384>:
| A file containing a windows-1252 char that doesn't appear in
| iso-8859-1 is detected as windows-1252. Bad effect is that some (or
| many) binary files are also detected as windows-1252.
Some people pointed out that this may lead to the bad effect that some
(or many) binary files are also detected as windows-1252. Eli
suggested to implement null-byte detection which should solve this
problem.
In <http://thread.gmane.org/gmane.emacs.pretest.bugs/14020/focus=14384>
Kenichi Handa wrote:
| Reiner Steib <reinersteib+gmane <at> imap.cc> writes:
|
| > (6) Implement null-byte detection (to prevent binary files
| > mis-detected as windows-12xx), keep the current code (windows-1252)
| > and add windows-1254/1255 accordingly.
|
| I think that change results in the best behavior.
... and Richard agreed on that. But I don't think this has been done.
("the current code" refers to the 2006-09-21 change, see above.)
In
<http://thread.gmane.org/gmane.emacs.pretest.bugs/14020/focus=14367> I
attached 3 simple test files an described the result:
,----
| I did some tests with (see attached auto-coding.tar.gz)...
|
| (a) a file containing only windows-1252 characters,
|
| (b) a file with some Latin-1 text plus "reserved characters"
| (i.e. chars not defined in windows-1252),
|
| (c) a file with some Latin-1 and windows-1252 text plus a null-byte.
|
| Emacs detected the files as:
|
| (a) windows-1252 (-> correct)
|
| (b) raw-text-unix (-> correct)
|
| (c) windows-1252 (-> slightly incorrect, at least for people who argue
| that binary is better here)
`----
* Additionally, the addition of windows-1252 to "German" has been lost
in the emacs-unicode-2 branch:
--- european.el 26 Jul 2007 05:27:10 -0000 1.100
+++ european.el 25 Dec 2007 10:57:51 -0000 1.86.4.13
@@ -277,16 +414,15 @@
(set-language-info-alist
"German" '((tutorial . "TUTORIAL.de")
- (charset ascii latin-iso8859-1)
+ (charset iso-8859-1)
(coding-system iso-latin-1 iso-latin-9)
- (coding-priority iso-latin-1 windows-1252)
+ (coding-priority iso-latin-1)
+ (nonascii-translation . iso-8859-1)
(input-method . "german-postfix")
Bye, Reiner.
--
,,,
(o o)
---ooO-(_)-Ooo--- | PGP key available | http://rsteib.home.pages.de/
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Auto-detection of windows-1252 fails
2008-01-05 13:22 Auto-detection of windows-1252 fails Reiner Steib
@ 2008-01-05 16:44 ` David De La Harpe Golden
[not found] ` <E1JBQZ5-0001Uy-TV@fencepost.gnu.org>
1 sibling, 0 replies; 4+ messages in thread
From: David De La Harpe Golden @ 2008-01-05 16:44 UTC (permalink / raw)
To: emacs-devel
[this is more appropriate place to note this than the thread Reiner+I
were in in comp.emacs]:
Function set-coding-system-priority in src/coding.c
says it won't actually add two encodings of the
same "coding-system-category" to the priority list. And at least in
emacs-unicode2, iso-latin-1 and windows-1252 are apparently being put
in the same category, "coding-category-charset", unlike, say utf-8 or
sjis. So adding windows-1252 to coding-priority after iso-latin-1
won't do anything without changing more stuff.
main and emacs-unicode2's src/coding.c look like they differ quite a bit
at this stage, but on emacs-unicode2 (I personally don't have a cvs
head version around at the moment to check what it's doing):
(mapcar (lambda (x)
(cons x (coding-system-category x)))
(coding-system-list)) ;; or just '(windows-1252 iso-latin-1))
shows that a lot of encodings are ending up in coding-category-charset.
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Auto-detection of windows-1252 fails
[not found] ` <E1JBQZ5-0001Uy-TV@fencepost.gnu.org>
@ 2008-01-09 6:33 ` Kenichi Handa
2008-01-14 20:58 ` Reiner Steib
0 siblings, 1 reply; 4+ messages in thread
From: Kenichi Handa @ 2008-01-09 6:33 UTC (permalink / raw)
To: emacs-devel; +Cc: rms, reinersteib+gmane
In article <E1JBQZ5-0001Uy-TV@fencepost.gnu.org>, Richard Stallman <rms@gnu.org> writes:
> Can you please DTRT on this, and ack?
[...]
> From: Reiner Steib <reinersteib+gmane@imap.cc>
> Date: Sat, 05 Jan 2008 14:22:37 +0100
> Subject: Auto-detection of windows-1252 fails
[...]
> in September/October 2006 we had a long thread on emacs-pretest-bugs
> about auto-detection of windows-1252 text files:
> Subject: local chars displayed as numbers
> <http://thread.gmane.org/gmane.emacs.pretest.bugs/14020/>
> [ I include a summary of this thread below. ]
> windows-1252 files were supposed to be detected automatically in the
> "Latin-1" and "German" language environments. This doesn't work
> (anymore?) in Emacs 22.1, the Emacs_22 branch and in the trunk.
> * Summary of the September/October 2006 discussion:
> The following change was installed...
> ,----[ ChangeLog.12 ]
> | 2006-09-21 Kenichi Handa <handa@m17n.org>
> |
> | * language/european.el ("Latin-1"): Add windows-1252 to
> | coding-priority.
> | ("German"): Likewise.
> `----
> ... and was supposed to result in the following behavior:
> Kenichi Handa wrote in
> <http://article.gmane.org/gmane.emacs.pretest.bugs/14384>:
> | A file containing a windows-1252 char that doesn't appear in
> | iso-8859-1 is detected as windows-1252. Bad effect is that some (or
> | many) binary files are also detected as windows-1252.
> Some people pointed out that this may lead to the bad effect that some
> (or many) binary files are also detected as windows-1252. Eli
> suggested to implement null-byte detection which should solve this
> problem.
> In <http://thread.gmane.org/gmane.emacs.pretest.bugs/14020/focus=14384>
> Kenichi Handa wrote:
> | Reiner Steib <reinersteib+gmane <at> imap.cc> writes:
> |
> | > (6) Implement null-byte detection (to prevent binary files
> | > mis-detected as windows-12xx), keep the current code (windows-1252)
> | > and add windows-1254/1255 accordingly.
> |
> | I think that change results in the best behavior.
> ... and Richard agreed on that. But I don't think this has been done.
> ("the current code" refers to the 2006-09-21 change, see above.)
I've just installed the null-byte detection code and some
improvement on handling latin-extra-code-table in the trunk.
Could you please test the latest code?
> | > and add windows-1254/1255 accordingly.
I've not yet done that. Could someone tell me which to add
where?
> * Additionally, the addition of windows-1252 to "German" has been lost
> in the emacs-unicode-2 branch:
> --- european.el 26 Jul 2007 05:27:10 -0000 1.100
> +++ european.el 25 Dec 2007 10:57:51 -0000 1.86.4.13
> @@ -277,16 +414,15 @@
> (set-language-info-alist
> "German" '((tutorial . "TUTORIAL.de")
> - (charset ascii latin-iso8859-1)
> + (charset iso-8859-1)
> (coding-system iso-latin-1 iso-latin-9)
> - (coding-priority iso-latin-1 windows-1252)
> + (coding-priority iso-latin-1)
> + (nonascii-translation . iso-8859-1)
> (input-method . "german-postfix")
Oops, I don't know why that change was lost. I'll fix it
soon as well as the equivalent change for null-byte
detection and latin-extra-code-table handling improvement.
---
Kenichi Handa
handa@ni.aist.go.jp
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Auto-detection of windows-1252 fails
2008-01-09 6:33 ` Kenichi Handa
@ 2008-01-14 20:58 ` Reiner Steib
0 siblings, 0 replies; 4+ messages in thread
From: Reiner Steib @ 2008-01-14 20:58 UTC (permalink / raw)
To: Kenichi Handa; +Cc: reinersteib+gmane, rms, emacs-devel
On Wed, Jan 09 2008, Kenichi Handa wrote:
>> | > (6) Implement null-byte detection (to prevent binary files
>> | > mis-detected as windows-12xx), keep the current code (windows-1252)
>> | > and add windows-1254/1255 accordingly.
[...]
> I've just installed the null-byte detection code and some
> improvement on handling latin-extra-code-table in the trunk.
> Could you please test the latest code?
Works for me as expected. I've done some test with de_DE.utf-8,
de_DE, ...
>> (set-language-info-alist
>> "German" '((tutorial . "TUTORIAL.de")
>> - (charset ascii latin-iso8859-1)
>> + (charset iso-8859-1)
>> (coding-system iso-latin-1 iso-latin-9)
>> - (coding-priority iso-latin-1 windows-1252)
>> + (coding-priority iso-latin-1)
>> + (nonascii-translation . iso-8859-1)
>> (input-method . "german-postfix")
>
> Oops, I don't know why that change was lost. I'll fix it
> soon as well as the equivalent change for null-byte
> detection and latin-extra-code-table handling improvement.
Thanks.
Bye, Reiner.
--
,,,
(o o)
---ooO-(_)-Ooo--- | PGP key available | http://rsteib.home.pages.de/
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2008-01-14 20:58 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-01-05 13:22 Auto-detection of windows-1252 fails Reiner Steib
2008-01-05 16:44 ` David De La Harpe Golden
[not found] ` <E1JBQZ5-0001Uy-TV@fencepost.gnu.org>
2008-01-09 6:33 ` Kenichi Handa
2008-01-14 20:58 ` Reiner Steib
Code repositories for project(s) associated with this public inbox
https://git.savannah.gnu.org/cgit/emacs.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).