unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
* bug#2354: 23.0.90; Emacs fails to detect utf-8 encoding with language environment Latin-1
@ 2009-02-17 10:35 ` David Engster
  2009-02-17 16:45   ` Juanma Barranquero
  2009-02-28 12:30   ` bug#2354: marked as done (23.0.90; Emacs fails to detect utf-8 encoding with language environment Latin-1) Emacs bug Tracking System
  0 siblings, 2 replies; 41+ messages in thread
From: David Engster @ 2009-02-17 10:35 UTC (permalink / raw)
  To: bug-gnu-emacs

This is what I believe to be a regression in CVS Emacs since the
23.0.90 pretest. I'm using a fresh CVS checkout from 2009-02-17,
compiled with 'make bootstrap'.

You can reproduce it as follows:

1. emacs -Q
2. M-x set-language-environment RET Latin-1 RET
3. In some buffer write:

 (ucs-insert "2500")

4. Eval it, so that the unicode character is inserted into the buffer.
5. Save the file and choose utf-8 as encoding.
6. Kill the buffer.
7. Load the file you just saved.

Result: Emacs displays "â\224\200" for the unicode character.

Expected behaviour: Emacs should detect utf-8 encoding and display
correct character.

Please note that this has worked without problems with the Emacs
23.0.90 pretest, so it must be due to some change(s) since then in CVS.

In GNU Emacs 23.0.90.1 (i686-pc-linux-gnu, GTK+ Version 2.12.11)
 of 2009-02-17 on void
Windowing system distributor `The X.Org Foundation', version 11.0.10402000
configured using `configure  '--prefix=/usr/local/emacs''

Important settings:
  value of $LC_ALL: nil
  value of $LC_COLLATE: nil
  value of $LC_CTYPE: nil
  value of $LC_MESSAGES: nil
  value of $LC_MONETARY: nil
  value of $LC_NUMERIC: nil
  value of $LC_TIME: nil
  value of $LANG: nil
  value of $XMODIFIERS: nil
  locale-coding-system: nil
  default-enable-multibyte-characters: t

Major mode: Lisp Interaction

Minor modes in effect:
  tooltip-mode: t
  tool-bar-mode: t
  mouse-wheel-mode: t
  menu-bar-mode: t
  file-name-shadow-mode: t
  global-font-lock-mode: t
  font-lock-mode: t
  blink-cursor-mode: t
  global-auto-composition-mode: t
  auto-composition-mode: t
  auto-encryption-mode: t
  auto-compression-mode: t
  line-number-mode: t
  transient-mark-mode: t

Recent input:
M-x r e p o <tab> r <tab> C-g M-x s e t - l a n <tab> 
<return> L a t i n w <backspace> - w <return> <backspace> 
1 <return> M-x r e p o <tab> r <tab> <return>

Recent messages:
For information about GNU Emacs and the GNU system, type C-h C-a.
Making completion list...
Quit
Making completion list...







^ permalink raw reply	[flat|nested] 41+ messages in thread

* bug#2354: 23.0.90; Emacs fails to detect utf-8 encoding with language environment Latin-1
  2009-02-17 10:35 ` bug#2354: 23.0.90; Emacs fails to detect utf-8 encoding with language environment Latin-1 David Engster
@ 2009-02-17 16:45   ` Juanma Barranquero
  2009-02-17 18:04     ` David Engster
  2009-02-28 12:30   ` bug#2354: marked as done (23.0.90; Emacs fails to detect utf-8 encoding with language environment Latin-1) Emacs bug Tracking System
  1 sibling, 1 reply; 41+ messages in thread
From: Juanma Barranquero @ 2009-02-17 16:45 UTC (permalink / raw)
  To: David Engster; +Cc: 2354

On Tue, Feb 17, 2009 at 11:35, David Engster <deng@randomsample.de> wrote:

> You can reproduce it as follows:
>
> 1. emacs -Q
> 2. M-x set-language-environment RET Latin-1 RET
> 3. In some buffer write:
>
>  (ucs-insert "2500")
>
> 4. Eval it, so that the unicode character is inserted into the buffer.
> 5. Save the file and choose utf-8 as encoding.
> 6. Kill the buffer.
> 7. Load the file you just saved.
>
> Result: Emacs displays "â\224\200" for the unicode character.

I cannot reproduce it on Windows with the current trunk. The file's
coding is correctly detected as UTF-8.

    Juanma






^ permalink raw reply	[flat|nested] 41+ messages in thread

* bug#2354: 23.0.90; Emacs fails to detect utf-8 encoding with language environment Latin-1
  2009-02-17 16:45   ` Juanma Barranquero
@ 2009-02-17 18:04     ` David Engster
  0 siblings, 0 replies; 41+ messages in thread
From: David Engster @ 2009-02-17 18:04 UTC (permalink / raw)
  To: Juanma Barranquero; +Cc: 2354

Juanma Barranquero <lekktu@gmail.com> writes:
> On Tue, Feb 17, 2009 at 11:35, David Engster <deng@randomsample.de> wrote:
>
>> You can reproduce it as follows:
>>
>> 1. emacs -Q
>> 2. M-x set-language-environment RET Latin-1 RET
>> 3. In some buffer write:
>>
>>  (ucs-insert "2500")
>>
>> 4. Eval it, so that the unicode character is inserted into the buffer.
>> 5. Save the file and choose utf-8 as encoding.
>> 6. Kill the buffer.
>> 7. Load the file you just saved.
>>
>> Result: Emacs displays "â\224\200" for the unicode character.
>
> I cannot reproduce it on Windows with the current trunk. The file's
> coding is correctly detected as UTF-8.

Thank you for looking into this. I tested this now again on a different
machine, but also running GNU/Linux (Ubuntu 8.10), with the same
result. FWIW, I think I could track down this issue to the following
commit for src/coding.c:

revision 1.413
date: 2009-02-09 01:42:37 +0100;  author: handa;  state: Exp;  lines: +1 -1;  commitid: WAhpeD8cqX926HBt;
(detect_coding_charset): Fix previous change.

With revision 1.412 of coding.c, the error disappears for me.

-David






^ permalink raw reply	[flat|nested] 41+ messages in thread

* bug#2497: 23.0.91; Fails to read UTF-8 on Win2k
@ 2009-02-27 14:10 ` Uwe Siart
  2009-02-27 16:03   ` Eli Zaretskii
                     ` (4 more replies)
  0 siblings, 5 replies; 41+ messages in thread
From: Uwe Siart @ 2009-02-27 14:10 UTC (permalink / raw)
  To: emacs-pretest-bug

I'm using the windows port of 23.0.91 on Win2k SP4 and I found that it
fails to read utf-8 encoded files correctly. When visiting a file in
utf-8 encoding all characters above 255 are screwed up and "C-h C RET"
indicates iso-latin1-dos for saving the file. This has not been an
issue in 23.0.90.

-- 
Uwe


In GNU Emacs 23.0.91.1 (i386-mingw-nt5.0.2195)
 of 2009-02-27 on SOFT-MJASON
Windowing system distributor `Microsoft Corp.', version 5.0.2195
configured using `configure --with-gcc (3.4)'

Important settings:
  value of $LC_ALL: nil
  value of $LC_COLLATE: nil
  value of $LC_CTYPE: nil
  value of $LC_MESSAGES: nil
  value of $LC_MONETARY: nil
  value of $LC_NUMERIC: nil
  value of $LC_TIME: nil
  value of $LANG: DEU
  value of $XMODIFIERS: nil
  locale-coding-system: cp1252
  default-enable-multibyte-characters: t

Major mode: Lisp Interaction

Minor modes in effect:
  iswitchb-mode: t
  display-time-mode: t
  auto-insert-mode: t
  diff-auto-refine-mode: t
  delete-selection-mode: t
  pc-selection-mode: t
  tooltip-mode: t
  mouse-wheel-mode: t
  file-name-shadow-mode: t
  global-font-lock-mode: t
  font-lock-mode: t
  global-auto-composition-mode: t
  auto-composition-mode: t
  auto-encryption-mode: t
  auto-compression-mode: t
  column-number-mode: t
  line-number-mode: t
  transient-mark-mode: t

Recent input:
M-x r e <tab> p o <tab> r t <tab> <return>

Recent messages:
Loading time...done
Loading iswitchb...done
For information about GNU Emacs and the GNU system, type C-h C-a.
Making completion list... [2 times]






^ permalink raw reply	[flat|nested] 41+ messages in thread

* bug#2497: 23.0.91; Fails to read UTF-8 on Win2k
  2009-02-27 14:10 ` bug#2497: 23.0.91; Fails to read UTF-8 on Win2k Uwe Siart
@ 2009-02-27 16:03   ` Eli Zaretskii
  2009-02-27 16:48     ` Uwe Siart
  2009-02-27 16:11   ` Juanma Barranquero
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 41+ messages in thread
From: Eli Zaretskii @ 2009-02-27 16:03 UTC (permalink / raw)
  To: uwe.siart, 2497

> Date: Fri, 27 Feb 2009 15:10:19 +0100
> From: Uwe Siart <uwe.siart@tum.de>
> Cc: 
> 
> I'm using the windows port of 23.0.91 on Win2k SP4 and I found that it
> fails to read utf-8 encoded files correctly. When visiting a file in
> utf-8 encoding all characters above 255 are screwed up and "C-h C RET"
> indicates iso-latin1-dos for saving the file.

Does it work with "C-x RET c utf-8 RET" immediately prior to
"C-x C-f"?  If it does, then the problem is with guessing the
encoding, not with decoding it.

Also, what is the default value of buffer-file-coding-system, and was
it the same in 23.0.90?






^ permalink raw reply	[flat|nested] 41+ messages in thread

* bug#2497: 23.0.91; Fails to read UTF-8 on Win2k
  2009-02-27 14:10 ` bug#2497: 23.0.91; Fails to read UTF-8 on Win2k Uwe Siart
  2009-02-27 16:03   ` Eli Zaretskii
@ 2009-02-27 16:11   ` Juanma Barranquero
  2009-02-27 16:16     ` Juanma Barranquero
                       ` (2 more replies)
  2009-02-27 17:46   ` David Engster
                     ` (2 subsequent siblings)
  4 siblings, 3 replies; 41+ messages in thread
From: Juanma Barranquero @ 2009-02-27 16:11 UTC (permalink / raw)
  To: uwe.siart; +Cc: 2497

On Fri, Feb 27, 2009 at 15:10, Uwe Siart <uwe.siart@tum.de> wrote:

> I'm using the windows port of 23.0.91 on Win2k SP4 and I found that it
> fails to read utf-8 encoded files correctly. When visiting a file in
> utf-8 encoding all characters above 255 are screwed up and "C-h C RET"
> indicates iso-latin1-dos for saving the file. This has not been an
> issue in 23.0.90.

Do you have a specific example of a UTF-8 coded file that was detected
as UTF-8 in 23.0.90 and it is detected as Latin-1 in 23.0.91?

For example, I create a UTF-8 file (without UTF-8 byte-order-mark
"signature") with just the following contents:

cañón

And 23.0.90 also thinks it is Latin-1.

That said, if you need UTF-8 to be given more priority than Latin-1,
etc, you can use `set-coding-system-priority' in your .emacs.

    Juanma






^ permalink raw reply	[flat|nested] 41+ messages in thread

* bug#2497: 23.0.91; Fails to read UTF-8 on Win2k
  2009-02-27 16:11   ` Juanma Barranquero
@ 2009-02-27 16:16     ` Juanma Barranquero
  2009-02-27 16:27       ` Uwe Siart
  2009-02-27 16:23     ` Uwe Siart
  2009-02-27 17:02     ` Leo
  2 siblings, 1 reply; 41+ messages in thread
From: Juanma Barranquero @ 2009-02-27 16:16 UTC (permalink / raw)
  To: uwe.siart; +Cc: 2497

On Fri, Feb 27, 2009 at 17:11, Juanma Barranquero <lekktu@gmail.com> wrote:

> cañón
>
> And 23.0.90 also thinks it is Latin-1.

Just to be clear: of course "cañón" is Latin-1. What I mean is that
emacs 23.0.90 also reads the byte representation of "cañón" in UTF-8,
that is:

  0000000 63 61 c3 b1 c3 b3 6e

and interprets it as Latin-1: cañón

    Juanma






^ permalink raw reply	[flat|nested] 41+ messages in thread

* bug#2497: 23.0.91; Fails to read UTF-8 on Win2k
  2009-02-27 16:11   ` Juanma Barranquero
  2009-02-27 16:16     ` Juanma Barranquero
@ 2009-02-27 16:23     ` Uwe Siart
  2009-02-27 16:38       ` Juanma Barranquero
  2009-02-27 17:02     ` Leo
  2 siblings, 1 reply; 41+ messages in thread
From: Uwe Siart @ 2009-02-27 16:23 UTC (permalink / raw)
  To: Juanma Barranquero; +Cc: 2497

Juanma Barranquero <lekktu@gmail.com> writes:

> On Fri, Feb 27, 2009 at 15:10, Uwe Siart <uwe.siart@tum.de> wrote:
>
>> I'm using the windows port of 23.0.91 on Win2k SP4 and I found that it
>> fails to read utf-8 encoded files correctly. When visiting a file in
>> utf-8 encoding all characters above 255 are screwed up and "C-h C RET"
>> indicates iso-latin1-dos for saving the file. This has not been an
>> issue in 23.0.90.
>
> Do you have a specific example of a UTF-8 coded file that was detected
> as UTF-8 in 23.0.90 and it is detected as Latin-1 in 23.0.91?

Yes. My .gnus.el: <http://www.siart.de/etc/.gnus.el>

I hope, the webserver delivers it in utf-8 encoding.

> For example, I create a UTF-8 file (without UTF-8 byte-order-mark
> "signature") with just the following contents:
>
> cañón
>
> And 23.0.90 also thinks it is Latin-1.

Maybe because it can be encoded in latin-1. That would be ok for me. But
my .gnus.el contains symbols (arrows for the summary buffer) that are
definitely not included in latin-1 but 23.0.91 recognises latin-1.

-- 
Uwe






^ permalink raw reply	[flat|nested] 41+ messages in thread

* bug#2497: 23.0.91; Fails to read UTF-8 on Win2k
  2009-02-27 16:16     ` Juanma Barranquero
@ 2009-02-27 16:27       ` Uwe Siart
  2009-02-27 16:32         ` Juanma Barranquero
  0 siblings, 1 reply; 41+ messages in thread
From: Uwe Siart @ 2009-02-27 16:27 UTC (permalink / raw)
  To: Juanma Barranquero; +Cc: 2497

Juanma Barranquero <lekktu@gmail.com> writes:

> Just to be clear: of course "cañón" is Latin-1. What I mean is that
> emacs 23.0.90 also reads the byte representation of "cañón" in UTF-8,
> that is:
>
>   0000000 63 61 c3 b1 c3 b3 6e
>
> and interprets it as Latin-1: cañón

I tried this out in 23.0.90 in the following way:

- mark "cañón" from your mail
- create empty file with 'touch t.txt'
- visit t.txt and yank cañón
- save t.txt
- visit t.txt

and get correct result (cañón not cañón)

-- 
Uwe






^ permalink raw reply	[flat|nested] 41+ messages in thread

* bug#2497: 23.0.91; Fails to read UTF-8 on Win2k
  2009-02-27 16:27       ` Uwe Siart
@ 2009-02-27 16:32         ` Juanma Barranquero
  0 siblings, 0 replies; 41+ messages in thread
From: Juanma Barranquero @ 2009-02-27 16:32 UTC (permalink / raw)
  To: uwe.siart; +Cc: 2497

On Fri, Feb 27, 2009 at 17:27, Uwe Siart <uwe.siart@tum.de> wrote:

> I tried this out in 23.0.90 in the following way:
>
> - mark "cañón" from your mail
> - create empty file with 'touch t.txt'
> - visit t.txt and yank cañón
> - save t.txt
> - visit t.txt
>
> and get correct result (cañón not cañón)

Of course: you've created a file t.txt encoded in Latin-1, not UTF-8.

    Juanma






^ permalink raw reply	[flat|nested] 41+ messages in thread

* bug#2497: 23.0.91; Fails to read UTF-8 on Win2k
  2009-02-27 16:23     ` Uwe Siart
@ 2009-02-27 16:38       ` Juanma Barranquero
  2009-02-27 18:19         ` Eli Zaretskii
  0 siblings, 1 reply; 41+ messages in thread
From: Juanma Barranquero @ 2009-02-27 16:38 UTC (permalink / raw)
  To: uwe.siart; +Cc: 2497

On Fri, Feb 27, 2009 at 17:23, Uwe Siart <uwe.siart@tum.de> wrote:

> Yes. My .gnus.el: <http://www.siart.de/etc/.gnus.el>

Aha, yes, the bug is reproducible.

> I hope, the webserver delivers it in utf-8 encoding.

Yes. Emacs 23.0.90 opens it as utf-8, as does Notepad2.

> But
> my .gnus.el contains symbols (arrows for the summary buffer) that are
> definitely not included in latin-1 but 23.0.91 recognises latin-1.


    Juanma






^ permalink raw reply	[flat|nested] 41+ messages in thread

* bug#2497: 23.0.91; Fails to read UTF-8 on Win2k
  2009-02-27 16:03   ` Eli Zaretskii
@ 2009-02-27 16:48     ` Uwe Siart
  2009-02-27 18:19       ` Eli Zaretskii
  0 siblings, 1 reply; 41+ messages in thread
From: Uwe Siart @ 2009-02-27 16:48 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 2497

Eli Zaretskii <eliz@gnu.org> writes:

>> Date: Fri, 27 Feb 2009 15:10:19 +0100
>> From: Uwe Siart <uwe.siart@tum.de>
>> Cc: 
>> 
>> I'm using the windows port of 23.0.91 on Win2k SP4 and I found that it
>> fails to read utf-8 encoded files correctly. When visiting a file in
>> utf-8 encoding all characters above 255 are screwed up and "C-h C RET"
>> indicates iso-latin1-dos for saving the file.
>
> Does it work with "C-x RET c utf-8 RET" immediately prior to
> "C-x C-f"?

It works with "C-x RET c utf-8 RET" immediately prior to "C-x C-f".

> If it does, then the problem is with guessing the encoding, not with
> decoding it.

That's also my impression.

> Also, what is the default value of buffer-file-coding-system, and was
> it the same in 23.0.90?

iso-latin-1-dos in 23.0.90 and in 23.0.91.

-- 
Uwe






^ permalink raw reply	[flat|nested] 41+ messages in thread

* bug#2497: 23.0.91; Fails to read UTF-8 on Win2k
  2009-02-27 16:11   ` Juanma Barranquero
  2009-02-27 16:16     ` Juanma Barranquero
  2009-02-27 16:23     ` Uwe Siart
@ 2009-02-27 17:02     ` Leo
  2 siblings, 0 replies; 41+ messages in thread
From: Leo @ 2009-02-27 17:02 UTC (permalink / raw)
  To: bug-gnu-emacs

On 2009-02-27 16:11 +0000, Juanma Barranquero wrote:
> On Fri, Feb 27, 2009 at 15:10, Uwe Siart <uwe.siart@tum.de> wrote:
>
>> I'm using the windows port of 23.0.91 on Win2k SP4 and I found that it
>> fails to read utf-8 encoded files correctly. When visiting a file in
>> utf-8 encoding all characters above 255 are screwed up and "C-h C RET"
>> indicates iso-latin1-dos for saving the file. This has not been an
>> issue in 23.0.90.
>
> Do you have a specific example of a UTF-8 coded file that was detected
> as UTF-8 in 23.0.90 and it is detected as Latin-1 in 23.0.91?
>
> For example, I create a UTF-8 file (without UTF-8 byte-order-mark
> "signature") with just the following contents:
>
> cañón
>
> And 23.0.90 also thinks it is Latin-1.
>
> That said, if you need UTF-8 to be given more priority than Latin-1,
> etc, you can use `set-coding-system-priority' in your .emacs.
>
>     Juanma

I have the following code in my .emacs when I changed to w32 last
June. So the problem might exist longer.

;;; FIXME: find out why GNU/Linux does not need this
(prefer-coding-system 'utf-8)

I just tested some Chinese files. Without that line, all of them are
being opened in latin-1 encoding and are unreadable.

Tested in GNU Emacs 23.0.91.1 (i386-mingw-nt5.1.2600) of 2009-02-26

-- 
.:  Leo  :.  [ sdl.web AT gmail.com ]  .: I use Emacs :.








^ permalink raw reply	[flat|nested] 41+ messages in thread

* bug#2497: 23.0.91; Fails to read UTF-8 on Win2k
  2009-02-27 14:10 ` bug#2497: 23.0.91; Fails to read UTF-8 on Win2k Uwe Siart
  2009-02-27 16:03   ` Eli Zaretskii
  2009-02-27 16:11   ` Juanma Barranquero
@ 2009-02-27 17:46   ` David Engster
  2009-02-27 21:15     ` Uwe Siart
  2009-02-28  1:32     ` Jason Rumney
  2009-02-27 23:34   ` bug#2497: 23.0.91; Fails to read UTF-8 on Windows2k Richard M Stallman
  2009-02-28 12:30   ` bug#2497: marked as done (23.0.91; Fails to read UTF-8 on Win2k) Emacs bug Tracking System
  4 siblings, 2 replies; 41+ messages in thread
From: David Engster @ 2009-02-27 17:46 UTC (permalink / raw)
  To: uwe.siart; +Cc: emacs-pretest-bug, 2497

Uwe Siart <uwe.siart@tum.de> writes:
> I'm using the windows port of 23.0.91 on Win2k SP4 and I found that it
> fails to read utf-8 encoded files correctly. When visiting a file in
> utf-8 encoding all characters above 255 are screwed up and "C-h C RET"
> indicates iso-latin1-dos for saving the file. This has not been an
> issue in 23.0.90.

Maybe this is a duplicate of what I reported in

http://emacsbugs.donarmstrong.com/cgi-bin/bugreport.cgi?bug=2354

As I write later in that bug report, I think I could track down this
issue to the change in revision 1.413 of src/coding.c. Maybe you could
try if the same applies to your problem.

-David






^ permalink raw reply	[flat|nested] 41+ messages in thread

* bug#2497: 23.0.91; Fails to read UTF-8 on Win2k
  2009-02-27 16:48     ` Uwe Siart
@ 2009-02-27 18:19       ` Eli Zaretskii
  2009-02-27 20:35         ` Uwe Siart
  2009-02-28  4:40         ` Stefan Monnier
  0 siblings, 2 replies; 41+ messages in thread
From: Eli Zaretskii @ 2009-02-27 18:19 UTC (permalink / raw)
  To: uwe.siart; +Cc: 2497

> From: Uwe Siart <uwe.siart@tum.de>
> Cc: 2497@emacsbugs.donarmstrong.com
> Date: Fri, 27 Feb 2009 17:48:15 +0100
> 
> It works with "C-x RET c utf-8 RET" immediately prior to "C-x C-f".
> 
> > If it does, then the problem is with guessing the encoding, not with
> > decoding it.
> 
> That's also my impression.
> 
> > Also, what is the default value of buffer-file-coding-system, and was
> > it the same in 23.0.90?
> 
> iso-latin-1-dos in 23.0.90 and in 23.0.91.

Then you shouldn't expect Emacs to guess UTF-8 encoding correctly in
every single instance.  Distinguishing between UTF-8 and Latin-1 is
generally impossible with the current state of the art of coded
character sets support in Emacs.  It might work in certain cases, but
that's sheer luck.

One way to work around that in your specific case, without changing
your global defaults, is to add a `coding:' cookie to your .gnus.el
file.






^ permalink raw reply	[flat|nested] 41+ messages in thread

* bug#2497: 23.0.91; Fails to read UTF-8 on Win2k
  2009-02-27 16:38       ` Juanma Barranquero
@ 2009-02-27 18:19         ` Eli Zaretskii
  2009-02-27 20:38           ` Juanma Barranquero
  2009-02-28  1:29           ` Jason Rumney
  0 siblings, 2 replies; 41+ messages in thread
From: Eli Zaretskii @ 2009-02-27 18:19 UTC (permalink / raw)
  To: Juanma Barranquero, 2497; +Cc: uwe.siart

> Date: Fri, 27 Feb 2009 17:38:37 +0100
> From: Juanma Barranquero <lekktu@gmail.com>
> Cc: 2497@emacsbugs.donarmstrong.com
> 
> On Fri, Feb 27, 2009 at 17:23, Uwe Siart <uwe.siart@tum.de> wrote:
> 
> > Yes. My .gnus.el: <http://www.siart.de/etc/.gnus.el>
> 
> Aha, yes, the bug is reproducible.

Which bug?






^ permalink raw reply	[flat|nested] 41+ messages in thread

* bug#2497: 23.0.91; Fails to read UTF-8 on Win2k
  2009-02-27 18:19       ` Eli Zaretskii
@ 2009-02-27 20:35         ` Uwe Siart
  2009-02-28  4:40         ` Stefan Monnier
  1 sibling, 0 replies; 41+ messages in thread
From: Uwe Siart @ 2009-02-27 20:35 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 2497

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Uwe Siart <uwe.siart@tum.de>
>> iso-latin-1-dos in 23.0.90 and in 23.0.91.
>
> Then you shouldn't expect Emacs to guess UTF-8 encoding correctly in
> every single instance. Distinguishing between UTF-8 and Latin-1 is
> generally impossible with the current state of the art of coded
> character sets support in Emacs. It might work in certain cases, but
> that's sheer luck.

I do not have the background knowledge to join in this conversation but
I just observed that it worked correctly for years now (even with CVS
Emacsen prior to the 22.1 release) and that it stopped working in
23.0.91. If it appears that this is not a bug then I will take the
measures you suggested and set a utf-8 cookie in all files concerned.

-- 
Uwe






^ permalink raw reply	[flat|nested] 41+ messages in thread

* bug#2497: 23.0.91; Fails to read UTF-8 on Win2k
  2009-02-27 18:19         ` Eli Zaretskii
@ 2009-02-27 20:38           ` Juanma Barranquero
  2009-02-28  1:29           ` Jason Rumney
  1 sibling, 0 replies; 41+ messages in thread
From: Juanma Barranquero @ 2009-02-27 20:38 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 2497, uwe.siart

On Fri, Feb 27, 2009 at 19:19, Eli Zaretskii <eliz@gnu.org> wrote:

>> Aha, yes, the bug is reproducible.
>
> Which bug?

I mean, the fact that the given .gnus.el file was read as utf-8-dos in
23.0.90 and as iso-latin1-dos in 23.0.91 (with characters that are not
latin-1).

    Juanma






^ permalink raw reply	[flat|nested] 41+ messages in thread

* bug#2497: 23.0.91; Fails to read UTF-8 on Win2k
  2009-02-27 17:46   ` David Engster
@ 2009-02-27 21:15     ` Uwe Siart
  2009-02-28  1:32     ` Jason Rumney
  1 sibling, 0 replies; 41+ messages in thread
From: Uwe Siart @ 2009-02-27 21:15 UTC (permalink / raw)
  To: 2497; +Cc: emacs-pretest-bug

David Engster <deng@randomsample.de> writes:

> Maybe this is a duplicate of what I reported in
>
> http://emacsbugs.donarmstrong.com/cgi-bin/bugreport.cgi?bug=2354
>
> As I write later in that bug report, I think I could track down this
> issue to the change in revision 1.413 of src/coding.c. Maybe you could
> try if the same applies to your problem.

At least I can reproduce it and it seems to be the very same thing that
I stumbled across. But due to lack of detailed knowledge about coding
recognition I'm unable to join the discussion whether this is a bug or
not. It's just that I felt more comfortable about the previous state.

So far I got things back to work with

;; -*- coding:utf-8-dos; -*-

as the first line of my .gnus.el :-)

-- 
Uwe






^ permalink raw reply	[flat|nested] 41+ messages in thread

* bug#2497: 23.0.91; Fails to read UTF-8 on Windows2k
  2009-02-27 14:10 ` bug#2497: 23.0.91; Fails to read UTF-8 on Win2k Uwe Siart
                     ` (2 preceding siblings ...)
  2009-02-27 17:46   ` David Engster
@ 2009-02-27 23:34   ` Richard M Stallman
  2009-02-28  9:47     ` Uwe Siart
  2009-02-28 12:30   ` bug#2497: marked as done (23.0.91; Fails to read UTF-8 on Win2k) Emacs bug Tracking System
  4 siblings, 1 reply; 41+ messages in thread
From: Richard M Stallman @ 2009-02-27 23:34 UTC (permalink / raw)
  To: uwe.siart, 2497; +Cc: emacs-pretest-bug

Please don't call that system "Win"--that name implies praise.





^ permalink raw reply	[flat|nested] 41+ messages in thread

* bug#2497: 23.0.91; Fails to read UTF-8 on Win2k
  2009-02-27 18:19         ` Eli Zaretskii
  2009-02-27 20:38           ` Juanma Barranquero
@ 2009-02-28  1:29           ` Jason Rumney
  1 sibling, 0 replies; 41+ messages in thread
From: Jason Rumney @ 2009-02-28  1:29 UTC (permalink / raw)
  To: Eli Zaretskii, 2497; +Cc: Juanma Barranquero, uwe.siart

Eli Zaretskii wrote:
>> Date: Fri, 27 Feb 2009 17:38:37 +0100
>> From: Juanma Barranquero <lekktu@gmail.com>
>> Cc: 2497@emacsbugs.donarmstrong.com
>>
>> On Fri, Feb 27, 2009 at 17:23, Uwe Siart <uwe.siart@tum.de> wrote:
>>
>>     
>>> Yes. My .gnus.el: <http://www.siart.de/etc/.gnus.el>
>>>       
>> Aha, yes, the bug is reproducible.
>>     
>
> Which bug?
>   

The one where the OP's .gnus.el contains characters which were correctly 
detected as UTF-8 in 23.0.90, but now appear as \200\224 octal escapes, 
as the file is incorrectly detected as Latin-1.







^ permalink raw reply	[flat|nested] 41+ messages in thread

* bug#2497: 23.0.91; Fails to read UTF-8 on Win2k
  2009-02-27 17:46   ` David Engster
  2009-02-27 21:15     ` Uwe Siart
@ 2009-02-28  1:32     ` Jason Rumney
  2009-02-28  1:35       ` Processed (with 5 errors): " Emacs bug Tracking System
  1 sibling, 1 reply; 41+ messages in thread
From: Jason Rumney @ 2009-02-28  1:32 UTC (permalink / raw)
  To: David Engster, 2497

merge 2354 2497

David Engster wrote:
> Maybe this is a duplicate of what I reported in
>
> http://emacsbugs.donarmstrong.com/cgi-bin/bugreport.cgi?bug=2354
>   

It seems so, yes.







^ permalink raw reply	[flat|nested] 41+ messages in thread

* Processed (with 5 errors): Re: bug#2497: 23.0.91; Fails to read UTF-8 on Win2k
  2009-02-28  1:32     ` Jason Rumney
@ 2009-02-28  1:35       ` Emacs bug Tracking System
  0 siblings, 0 replies; 41+ messages in thread
From: Emacs bug Tracking System @ 2009-02-28  1:35 UTC (permalink / raw)
  To: Jason Rumney; +Cc: Emacs Bugs

Processing commands for control@emacsbugs.donarmstrong.com:

> merge 2354 2497
bug#2354: 23.0.90; Emacs fails to detect utf-8 encoding with language environment Latin-1
bug#2497: 23.0.91; Fails to read UTF-8 on Win2k
Merged 2354 2497.

> David Engster wrote:
Unknown command or malformed arguments to command.

> > Maybe this is a duplicate of what I reported in
Unknown command or malformed arguments to command.

> >
Unknown command or malformed arguments to command.

> > http://emacsbugs.donarmstrong.com/cgi-bin/bugreport.cgi?bug=2354
Unknown command or malformed arguments to command.

> >
Unknown command or malformed arguments to command.

Too many unknown commands, stopping here.

Please contact me if you need assistance.

Don Armstrong
(administrator, Emacs bugs database)




^ permalink raw reply	[flat|nested] 41+ messages in thread

* bug#2497: 23.0.91; Fails to read UTF-8 on Win2k
  2009-02-27 18:19       ` Eli Zaretskii
  2009-02-27 20:35         ` Uwe Siart
@ 2009-02-28  4:40         ` Stefan Monnier
  2009-02-28  8:17           ` Uwe Siart
  2009-02-28 10:49           ` Eli Zaretskii
  1 sibling, 2 replies; 41+ messages in thread
From: Stefan Monnier @ 2009-02-28  4:40 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 2497, uwe.siart

>> It works with "C-x RET c utf-8 RET" immediately prior to "C-x C-f".
>> > If it does, then the problem is with guessing the encoding, not with
>> > decoding it.
>> That's also my impression.
>> > Also, what is the default value of buffer-file-coding-system, and was
>> > it the same in 23.0.90?
>> iso-latin-1-dos in 23.0.90 and in 23.0.91.
> Then you shouldn't expect Emacs to guess UTF-8 encoding correctly in
> every single instance.  Distinguishing between UTF-8 and Latin-1 is

The guessing shouldn't give priority to buffer-file-coding-system.
Instead we have the set-coding-system-priority instead.
And IIUC utf-8 should always have a pretty high priority since false
positives are fairly rare.  So this still looks like a real bug.


        Stefan







^ permalink raw reply	[flat|nested] 41+ messages in thread

* bug#2497: 23.0.91; Fails to read UTF-8 on Win2k
  2009-02-28  4:40         ` Stefan Monnier
@ 2009-02-28  8:17           ` Uwe Siart
  2009-02-28 10:14             ` David Engster
  2009-02-28 22:00             ` Stefan Monnier
  2009-02-28 10:49           ` Eli Zaretskii
  1 sibling, 2 replies; 41+ messages in thread
From: Uwe Siart @ 2009-02-28  8:17 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: 2497

Stefan Monnier <monnier@iro.umontreal.ca> writes:

> The guessing shouldn't give priority to buffer-file-coding-system.
> Instead we have the set-coding-system-priority instead. And IIUC utf-8
> should always have a pretty high priority since false positives are
> fairly rare. So this still looks like a real bug.

Here I would like to note that I never had false positives in the past
(before 23.0.91) but I do have false positives now. Therefore I'm
inclined to call it a bug.

-- 
Uwe






^ permalink raw reply	[flat|nested] 41+ messages in thread

* bug#2497: 23.0.91; Fails to read UTF-8 on Windows2k
  2009-02-27 23:34   ` bug#2497: 23.0.91; Fails to read UTF-8 on Windows2k Richard M Stallman
@ 2009-02-28  9:47     ` Uwe Siart
  2009-02-28 18:08       ` Richard M Stallman
  0 siblings, 1 reply; 41+ messages in thread
From: Uwe Siart @ 2009-02-28  9:47 UTC (permalink / raw)
  To: rms; +Cc: emacs-pretest-bug, 2497

Richard M Stallman <rms@gnu.org> writes:

> Please don't call that system "Win"--that name implies praise.

How right you are. Forgive me my trespasses. In my own defence I have to
say that I never thought of W2k as the "system". My system is Emacs and
I'm very comfortable with it. W2k is its boot loader. The boot loader
does not become noticeable too much. I never understood, however, why
this boot loader takes up a whole CD.

-- 
Uwe






^ permalink raw reply	[flat|nested] 41+ messages in thread

* bug#2497: 23.0.91; Fails to read UTF-8 on Win2k
  2009-02-28  8:17           ` Uwe Siart
@ 2009-02-28 10:14             ` David Engster
  2009-02-28 12:09               ` Eli Zaretskii
  2009-02-28 22:00             ` Stefan Monnier
  1 sibling, 1 reply; 41+ messages in thread
From: David Engster @ 2009-02-28 10:14 UTC (permalink / raw)
  To: uwe.siart; +Cc: 2497

Uwe Siart <uwe.siart@tum.de> writes:
> Stefan Monnier <monnier@iro.umontreal.ca> writes:
>
>> The guessing shouldn't give priority to buffer-file-coding-system.
>> Instead we have the set-coding-system-priority instead. And IIUC utf-8
>> should always have a pretty high priority since false positives are
>> fairly rare. So this still looks like a real bug.
>
> Here I would like to note that I never had false positives in the past
> (before 23.0.91) but I do have false positives now. Therefore I'm
> inclined to call it a bug.

I second this - this has worked for years without problems, and suddenly
it fails to detect UTF-8 with a Latin-1 environment.

I once again confirmed that this behaviour can be tracked down to this
change in detect_coding_charset in coding.c (revision 1.413):

--- coding.c    7 Feb 2009 10:49:39 -0000       1.412
+++ coding.c    9 Feb 2009 00:42:37 -0000       1.413
@@ -5101,7 +5101,7 @@
   valids = AREF (attrs, coding_attr_charset_valids);
   name = CODING_ID_NAME (coding->id);
   if (VECTORP (Vlatin_extra_code_table)
-      && strcmp ((char *) SDATA (SYMBOL_NAME (name)), "iso-8859-"))
+      && strcmp ((char *) SDATA (SYMBOL_NAME (name)), "iso-8859-") == 0)
     check_latin_extra = 1;
   if (! NILP (CODING_ATTR_ASCII_COMPAT (attrs)))
     src += head_ascii;

I'm inclined to say that this change is wrong, since strcmp will only
return 0 if two strings are exactly equal. In this case though, the
string "iso-8859-" is compared to "iso-8859-1" (in my case), so it
returns 1 and therefore check_latin_extra is not set.

-David








^ permalink raw reply	[flat|nested] 41+ messages in thread

* bug#2497: 23.0.91; Fails to read UTF-8 on Win2k
  2009-02-28  4:40         ` Stefan Monnier
  2009-02-28  8:17           ` Uwe Siart
@ 2009-02-28 10:49           ` Eli Zaretskii
  2009-02-28 12:16             ` Uwe Siart
                               ` (2 more replies)
  1 sibling, 3 replies; 41+ messages in thread
From: Eli Zaretskii @ 2009-02-28 10:49 UTC (permalink / raw)
  To: Stefan Monnier, Kenichi Handa; +Cc: 2497, uwe.siart

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: 2497@emacsbugs.donarmstrong.com,  uwe.siart@tum.de
> Date: Fri, 27 Feb 2009 23:40:01 -0500
> 
> >> It works with "C-x RET c utf-8 RET" immediately prior to "C-x C-f".
> >> > If it does, then the problem is with guessing the encoding, not with
> >> > decoding it.
> >> That's also my impression.
> >> > Also, what is the default value of buffer-file-coding-system, and was
> >> > it the same in 23.0.90?
> >> iso-latin-1-dos in 23.0.90 and in 23.0.91.
> > Then you shouldn't expect Emacs to guess UTF-8 encoding correctly in
> > every single instance.  Distinguishing between UTF-8 and Latin-1 is
> 
> The guessing shouldn't give priority to buffer-file-coding-system.
> Instead we have the set-coding-system-priority instead.

Please give me some credit: I said ``the _default_value_ of
buffer-file-coding-system''.  That default tells volumes about the
coding-system priorities.

> And IIUC utf-8 should always have a pretty high priority

With today's CVS on a Windows XP machine I get this:

  M-: (coding-system-priority-list) RET
  =>  (iso-latin-1 utf-8 iso-2022-7bit iso-2022-7bit-lock iso-2022-8bit-ss2 emacs-mule raw-text iso-2022-jp in-is13194-devanagari chinese-iso-8bit utf-8-auto utf-8-with-signature utf-16 utf-16be-with-signature utf-16le-with-signature utf-16be utf-16le japanese-shift-jis undecided)

So UTF-8 is indeed ``pretty high'', but lower than the locale's
default.

> So this still looks like a real bug.

Perhaps it is, but I didn't know Emacs 23 can reliably distinguish
between Latin-1 and UTF-8, even when UTF-8 sequences are present in
the text.  Can we do that reliably?  Perhaps Handa-san can shed some
light on this.






^ permalink raw reply	[flat|nested] 41+ messages in thread

* bug#2497: 23.0.91; Fails to read UTF-8 on Win2k
  2009-02-28 10:14             ` David Engster
@ 2009-02-28 12:09               ` Eli Zaretskii
  2009-02-28 14:16                 ` Jason Rumney
  2009-02-28 14:31                 ` David Engster
  0 siblings, 2 replies; 41+ messages in thread
From: Eli Zaretskii @ 2009-02-28 12:09 UTC (permalink / raw)
  To: David Engster, 2497; +Cc: uwe.siart

> From: David Engster <deng@randomsample.de>
> Date: Sat, 28 Feb 2009 11:14:16 +0100
> Cc: 2497@emacsbugs.donarmstrong.com
> 
> I once again confirmed that this behaviour can be tracked down to this
> change in detect_coding_charset in coding.c (revision 1.413):
> 
> --- coding.c    7 Feb 2009 10:49:39 -0000       1.412
> +++ coding.c    9 Feb 2009 00:42:37 -0000       1.413
> @@ -5101,7 +5101,7 @@
>    valids = AREF (attrs, coding_attr_charset_valids);
>    name = CODING_ID_NAME (coding->id);
>    if (VECTORP (Vlatin_extra_code_table)
> -      && strcmp ((char *) SDATA (SYMBOL_NAME (name)), "iso-8859-"))
> +      && strcmp ((char *) SDATA (SYMBOL_NAME (name)), "iso-8859-") == 0)
>      check_latin_extra = 1;
>    if (! NILP (CODING_ATTR_ASCII_COMPAT (attrs)))
>      src += head_ascii;
> 
> I'm inclined to say that this change is wrong, since strcmp will only
> return 0 if two strings are exactly equal. In this case though, the
> string "iso-8859-" is compared to "iso-8859-1" (in my case), so it
> returns 1 and therefore check_latin_extra is not set.

You are right.  But in my case, it was not enough to test for
"iso-8859-", as the symbol's name was "iso-latin-1", not "iso-8859-1".

I installed the patch below, that does seem to fix the problem with
the OP's .gnus.el, although I don't know how general that problem is,
nor whether Emacs is capable of distinguishing UTF-8 from Latin-N in
general.


2009-02-28  Eli Zaretskii  <eliz@gnu.org>

	* coding.c (detect_coding_charset): Fix change from 2008-10-21.
	Also, check iso-latin-*, not only iso-8859-*.

Index: src/coding.c
===================================================================
RCS file: /cvsroot/emacs/emacs/src/coding.c,v
retrieving revision 1.419
diff -u -r1.419 coding.c
--- src/coding.c	22 Feb 2009 15:48:03 -0000	1.419
+++ src/coding.c	28 Feb 2009 12:01:18 -0000
@@ -5103,7 +5103,10 @@
   valids = AREF (attrs, coding_attr_charset_valids);
   name = CODING_ID_NAME (coding->id);
   if (VECTORP (Vlatin_extra_code_table)
-      && strcmp ((char *) SDATA (SYMBOL_NAME (name)), "iso-8859-") == 0)
+      && (strncmp ((char *) SDATA (SYMBOL_NAME (name)),
+		   "iso-8859-", sizeof ("iso-8859-") - 1) == 0
+	  || strncmp ((char *) SDATA (SYMBOL_NAME (name)),
+		      "iso-latin-", sizeof ("iso-latin-") - 1) == 0))
     check_latin_extra = 1;
   if (! NILP (CODING_ATTR_ASCII_COMPAT (attrs)))
     src += head_ascii;






^ permalink raw reply	[flat|nested] 41+ messages in thread

* bug#2497: 23.0.91; Fails to read UTF-8 on Win2k
  2009-02-28 10:49           ` Eli Zaretskii
@ 2009-02-28 12:16             ` Uwe Siart
  2009-02-28 22:04             ` Stefan Monnier
  2009-03-02 11:43             ` Kenichi Handa
  2 siblings, 0 replies; 41+ messages in thread
From: Uwe Siart @ 2009-02-28 12:16 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 2497

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Stefan Monnier <monnier@iro.umontreal.ca>
>> So this still looks like a real bug.
>
> Perhaps it is, but I didn't know Emacs 23 can reliably distinguish
> between Latin-1 and UTF-8, even when UTF-8 sequences are present in
> the text. Can we do that reliably? Perhaps Handa-san can shed some
> light on this.

Finding a solution to do it reliably would of course be the best.

Assumed this is not possible right now we should distinguish between
»high reliability« and »poor reliability«. From my perception it has
been much more reliable earlier so (as a user with limited viewpoint)
I vote for reverting the change.

-- 
Uwe






^ permalink raw reply	[flat|nested] 41+ messages in thread

* bug#2354: marked as done (23.0.90; Emacs fails to detect utf-8  encoding with language environment Latin-1)
  2009-02-17 10:35 ` bug#2354: 23.0.90; Emacs fails to detect utf-8 encoding with language environment Latin-1 David Engster
  2009-02-17 16:45   ` Juanma Barranquero
@ 2009-02-28 12:30   ` Emacs bug Tracking System
  1 sibling, 0 replies; 41+ messages in thread
From: Emacs bug Tracking System @ 2009-02-28 12:30 UTC (permalink / raw)
  To: Eli Zaretskii

[-- Attachment #1: Type: text/plain, Size: 907 bytes --]


Your message dated Sat, 28 Feb 2009 14:21:08 +0200
with message-id <uzlg6oiq3.fsf@gnu.org>
and subject line Re: bug#2497: 23.0.91; Fails to read UTF-8 on Win2k
has caused the Emacs bug report #2354,
regarding 23.0.90; Emacs fails to detect utf-8 encoding with language environment Latin-1
to be marked as done.

This means that you claim that the problem has been dealt with.
If this is not the case it is now your responsibility to reopen the
bug report if necessary, and/or fix the problem forthwith.

(NB: If you are a system administrator and have no idea what this
message is talking about, this may indicate a serious mail system
misconfiguration somewhere. Please contact owner@emacsbugs.donarmstrong.com
immediately.)


-- 
2354: http://emacsbugs.donarmstrong.com/cgi-bin/bugreport.cgi?bug=2354
Emacs Bug Tracking System
Contact owner@emacsbugs.donarmstrong.com with problems

[-- Attachment #2: Type: message/rfc822, Size: 4181 bytes --]

From: David Engster <deng@randomsample.de>
To: bug-gnu-emacs@gnu.org
Subject: 23.0.90; Emacs fails to detect utf-8 encoding with language environment Latin-1
Date: Tue, 17 Feb 2009 11:35:11 +0100
Message-ID: <87y6w5jqqo.fsf@engster.org>

This is what I believe to be a regression in CVS Emacs since the
23.0.90 pretest. I'm using a fresh CVS checkout from 2009-02-17,
compiled with 'make bootstrap'.

You can reproduce it as follows:

1. emacs -Q
2. M-x set-language-environment RET Latin-1 RET
3. In some buffer write:

 (ucs-insert "2500")

4. Eval it, so that the unicode character is inserted into the buffer.
5. Save the file and choose utf-8 as encoding.
6. Kill the buffer.
7. Load the file you just saved.

Result: Emacs displays "â\224\200" for the unicode character.

Expected behaviour: Emacs should detect utf-8 encoding and display
correct character.

Please note that this has worked without problems with the Emacs
23.0.90 pretest, so it must be due to some change(s) since then in CVS.

In GNU Emacs 23.0.90.1 (i686-pc-linux-gnu, GTK+ Version 2.12.11)
 of 2009-02-17 on void
Windowing system distributor `The X.Org Foundation', version 11.0.10402000
configured using `configure  '--prefix=/usr/local/emacs''

Important settings:
  value of $LC_ALL: nil
  value of $LC_COLLATE: nil
  value of $LC_CTYPE: nil
  value of $LC_MESSAGES: nil
  value of $LC_MONETARY: nil
  value of $LC_NUMERIC: nil
  value of $LC_TIME: nil
  value of $LANG: nil
  value of $XMODIFIERS: nil
  locale-coding-system: nil
  default-enable-multibyte-characters: t

Major mode: Lisp Interaction

Minor modes in effect:
  tooltip-mode: t
  tool-bar-mode: t
  mouse-wheel-mode: t
  menu-bar-mode: t
  file-name-shadow-mode: t
  global-font-lock-mode: t
  font-lock-mode: t
  blink-cursor-mode: t
  global-auto-composition-mode: t
  auto-composition-mode: t
  auto-encryption-mode: t
  auto-compression-mode: t
  line-number-mode: t
  transient-mark-mode: t

Recent input:
M-x r e p o <tab> r <tab> C-g M-x s e t - l a n <tab> 
<return> L a t i n w <backspace> - w <return> <backspace> 
1 <return> M-x r e p o <tab> r <tab> <return>

Recent messages:
For information about GNU Emacs and the GNU system, type C-h C-a.
Making completion list...
Quit
Making completion list...




[-- Attachment #3: Type: message/rfc822, Size: 2452 bytes --]

From: Eli Zaretskii <eliz@gnu.org>
To: 2497-done@emacsbugs.donarmstrong.com, 2354-done@emacsbugs.donarmstrong.com
Subject: Re: bug#2497: 23.0.91; Fails to read UTF-8 on Win2k
Date: Sat, 28 Feb 2009 14:21:08 +0200
Message-ID: <uzlg6oiq3.fsf@gnu.org>

> From: David Engster <deng@randomsample.de>
> Date: Fri, 27 Feb 2009 18:46:12 +0100
> Cc: emacs-pretest-bug@gnu.org, 2497@emacsbugs.donarmstrong.com
> 
> Uwe Siart <uwe.siart@tum.de> writes:
> > I'm using the windows port of 23.0.91 on Win2k SP4 and I found that it
> > fails to read utf-8 encoded files correctly. When visiting a file in
> > utf-8 encoding all characters above 255 are screwed up and "C-h C RET"
> > indicates iso-latin1-dos for saving the file. This has not been an
> > issue in 23.0.90.
> 
> Maybe this is a duplicate of what I reported in
> 
> http://emacsbugs.donarmstrong.com/cgi-bin/bugreport.cgi?bug=2354
> 
> As I write later in that bug report, I think I could track down this
> issue to the change in revision 1.413 of src/coding.c. Maybe you could
> try if the same applies to your problem.

Should be fixed by this change:

2009-02-28  Eli Zaretskii  <eliz@gnu.org>

	* coding.c (detect_coding_charset): Fix change from 2008-10-21.
	Also, check iso-latin-*, not only iso-8859-*.



^ permalink raw reply	[flat|nested] 41+ messages in thread

* bug#2497: marked as done (23.0.91; Fails to read UTF-8 on Win2k)
  2009-02-27 14:10 ` bug#2497: 23.0.91; Fails to read UTF-8 on Win2k Uwe Siart
                     ` (3 preceding siblings ...)
  2009-02-27 23:34   ` bug#2497: 23.0.91; Fails to read UTF-8 on Windows2k Richard M Stallman
@ 2009-02-28 12:30   ` Emacs bug Tracking System
  4 siblings, 0 replies; 41+ messages in thread
From: Emacs bug Tracking System @ 2009-02-28 12:30 UTC (permalink / raw)
  To: Eli Zaretskii

[-- Attachment #1: Type: text/plain, Size: 865 bytes --]


Your message dated Sat, 28 Feb 2009 14:21:08 +0200
with message-id <uzlg6oiq3.fsf@gnu.org>
and subject line Re: bug#2497: 23.0.91; Fails to read UTF-8 on Win2k
has caused the Emacs bug report #2354,
regarding 23.0.91; Fails to read UTF-8 on Win2k
to be marked as done.

This means that you claim that the problem has been dealt with.
If this is not the case it is now your responsibility to reopen the
bug report if necessary, and/or fix the problem forthwith.

(NB: If you are a system administrator and have no idea what this
message is talking about, this may indicate a serious mail system
misconfiguration somewhere. Please contact owner@emacsbugs.donarmstrong.com
immediately.)


-- 
2354: http://emacsbugs.donarmstrong.com/cgi-bin/bugreport.cgi?bug=2354
Emacs Bug Tracking System
Contact owner@emacsbugs.donarmstrong.com with problems

[-- Attachment #2: Type: message/rfc822, Size: 3281 bytes --]

From: Uwe Siart <uwe.siart@tum.de>
To: emacs-pretest-bug@gnu.org
Subject: 23.0.91; Fails to read UTF-8 on Win2k
Date: Fri, 27 Feb 2009 15:10:19 +0100
Message-ID: <877i3c55tg.fsf@tum.de>

I'm using the windows port of 23.0.91 on Win2k SP4 and I found that it
fails to read utf-8 encoded files correctly. When visiting a file in
utf-8 encoding all characters above 255 are screwed up and "C-h C RET"
indicates iso-latin1-dos for saving the file. This has not been an
issue in 23.0.90.

-- 
Uwe


In GNU Emacs 23.0.91.1 (i386-mingw-nt5.0.2195)
 of 2009-02-27 on SOFT-MJASON
Windowing system distributor `Microsoft Corp.', version 5.0.2195
configured using `configure --with-gcc (3.4)'

Important settings:
  value of $LC_ALL: nil
  value of $LC_COLLATE: nil
  value of $LC_CTYPE: nil
  value of $LC_MESSAGES: nil
  value of $LC_MONETARY: nil
  value of $LC_NUMERIC: nil
  value of $LC_TIME: nil
  value of $LANG: DEU
  value of $XMODIFIERS: nil
  locale-coding-system: cp1252
  default-enable-multibyte-characters: t

Major mode: Lisp Interaction

Minor modes in effect:
  iswitchb-mode: t
  display-time-mode: t
  auto-insert-mode: t
  diff-auto-refine-mode: t
  delete-selection-mode: t
  pc-selection-mode: t
  tooltip-mode: t
  mouse-wheel-mode: t
  file-name-shadow-mode: t
  global-font-lock-mode: t
  font-lock-mode: t
  global-auto-composition-mode: t
  auto-composition-mode: t
  auto-encryption-mode: t
  auto-compression-mode: t
  column-number-mode: t
  line-number-mode: t
  transient-mark-mode: t

Recent input:
M-x r e <tab> p o <tab> r t <tab> <return>

Recent messages:
Loading time...done
Loading iswitchb...done
For information about GNU Emacs and the GNU system, type C-h C-a.
Making completion list... [2 times]



[-- Attachment #3: Type: message/rfc822, Size: 2452 bytes --]

From: Eli Zaretskii <eliz@gnu.org>
To: 2497-done@emacsbugs.donarmstrong.com, 2354-done@emacsbugs.donarmstrong.com
Subject: Re: bug#2497: 23.0.91; Fails to read UTF-8 on Win2k
Date: Sat, 28 Feb 2009 14:21:08 +0200
Message-ID: <uzlg6oiq3.fsf@gnu.org>

> From: David Engster <deng@randomsample.de>
> Date: Fri, 27 Feb 2009 18:46:12 +0100
> Cc: emacs-pretest-bug@gnu.org, 2497@emacsbugs.donarmstrong.com
> 
> Uwe Siart <uwe.siart@tum.de> writes:
> > I'm using the windows port of 23.0.91 on Win2k SP4 and I found that it
> > fails to read utf-8 encoded files correctly. When visiting a file in
> > utf-8 encoding all characters above 255 are screwed up and "C-h C RET"
> > indicates iso-latin1-dos for saving the file. This has not been an
> > issue in 23.0.90.
> 
> Maybe this is a duplicate of what I reported in
> 
> http://emacsbugs.donarmstrong.com/cgi-bin/bugreport.cgi?bug=2354
> 
> As I write later in that bug report, I think I could track down this
> issue to the change in revision 1.413 of src/coding.c. Maybe you could
> try if the same applies to your problem.

Should be fixed by this change:

2009-02-28  Eli Zaretskii  <eliz@gnu.org>

	* coding.c (detect_coding_charset): Fix change from 2008-10-21.
	Also, check iso-latin-*, not only iso-8859-*.



^ permalink raw reply	[flat|nested] 41+ messages in thread

* bug#2497: 23.0.91; Fails to read UTF-8 on Win2k
  2009-02-28 12:09               ` Eli Zaretskii
@ 2009-02-28 14:16                 ` Jason Rumney
  2009-02-28 14:31                 ` David Engster
  1 sibling, 0 replies; 41+ messages in thread
From: Jason Rumney @ 2009-02-28 14:16 UTC (permalink / raw)
  To: Eli Zaretskii, 2497; +Cc: uwe.siart, David Engster

Eli Zaretskii wrote:
> You are right.  But in my case, it was not enough to test for
> "iso-8859-", as the symbol's name was "iso-latin-1", not "iso-8859-1".
>
> I installed the patch below, that does seem to fix the problem with
> the OP's .gnus.el, although I don't know how general that problem is,
> nor whether Emacs is capable of distinguishing UTF-8 from Latin-N in
> general.
>   

I installed a further change for the case where latin-extra-code-table 
is not a vector. But I don't understand why we have this table, and why 
the default value allows the 6 C1 control codes PU1, PU2, STS, CCH, MW 
and SPA to appear in latin text without breaking the auto detection. Are 
these control characters really that common?









^ permalink raw reply	[flat|nested] 41+ messages in thread

* bug#2497: 23.0.91; Fails to read UTF-8 on Win2k
  2009-02-28 12:09               ` Eli Zaretskii
  2009-02-28 14:16                 ` Jason Rumney
@ 2009-02-28 14:31                 ` David Engster
  1 sibling, 0 replies; 41+ messages in thread
From: David Engster @ 2009-02-28 14:31 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 2497, uwe.siart

Eli Zaretskii <eliz@gnu.org> writes:
>> From: David Engster <deng@randomsample.de>
>> I'm inclined to say that this change is wrong, since strcmp will only
>> return 0 if two strings are exactly equal. In this case though, the
>> string "iso-8859-" is compared to "iso-8859-1" (in my case), so it
>> returns 1 and therefore check_latin_extra is not set.
>
> You are right.  But in my case, it was not enough to test for
> "iso-8859-", as the symbol's name was "iso-latin-1", not "iso-8859-1".
>
> I installed the patch below, that does seem to fix the problem with
> the OP's .gnus.el, although I don't know how general that problem is,
> nor whether Emacs is capable of distinguishing UTF-8 from Latin-N in
> general.

I can confirm this patch fixes my original bug report (#2354). Thanks!

-David






^ permalink raw reply	[flat|nested] 41+ messages in thread

* bug#2497: 23.0.91; Fails to read UTF-8 on Windows2k
  2009-02-28  9:47     ` Uwe Siart
@ 2009-02-28 18:08       ` Richard M Stallman
  0 siblings, 0 replies; 41+ messages in thread
From: Richard M Stallman @ 2009-02-28 18:08 UTC (permalink / raw)
  To: uwe.siart, 2497; +Cc: emacs-pretest-bug, 2497

    How right you are. Forgive me my trespasses.

Only Emacs can forgive you, but I am confident that it will.

						 In my own defence I have to
    say that I never thought of W2k as the "system". My system is Emacs and
    I'm very comfortable with it. W2k is its boot loader.

Why not switch to a free boot loader then?






^ permalink raw reply	[flat|nested] 41+ messages in thread

* bug#2497: 23.0.91; Fails to read UTF-8 on Win2k
  2009-02-28  8:17           ` Uwe Siart
  2009-02-28 10:14             ` David Engster
@ 2009-02-28 22:00             ` Stefan Monnier
  1 sibling, 0 replies; 41+ messages in thread
From: Stefan Monnier @ 2009-02-28 22:00 UTC (permalink / raw)
  To: uwe.siart; +Cc: 2497

>> The guessing shouldn't give priority to buffer-file-coding-system.
>> Instead we have the set-coding-system-priority instead. And IIUC utf-8
>> should always have a pretty high priority since false positives are
>> fairly rare. So this still looks like a real bug.

> Here I would like to note that I never had false positives in the past
> (before 23.0.91) but I do have false positives now. Therefore I'm
> inclined to call it a bug.

To clear things up: by "false positives" I meant text that Emacs thinks
is valid utf-8 whereas it's really using some other coding system.


        Stefan






^ permalink raw reply	[flat|nested] 41+ messages in thread

* bug#2497: 23.0.91; Fails to read UTF-8 on Win2k
  2009-02-28 10:49           ` Eli Zaretskii
  2009-02-28 12:16             ` Uwe Siart
@ 2009-02-28 22:04             ` Stefan Monnier
  2009-03-02 11:43             ` Kenichi Handa
  2 siblings, 0 replies; 41+ messages in thread
From: Stefan Monnier @ 2009-02-28 22:04 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 2497, uwe.siart

>> The guessing shouldn't give priority to buffer-file-coding-system.
>> Instead we have the set-coding-system-priority instead.

> Please give me some credit: I said ``the _default_value_ of
> buffer-file-coding-system''.  That default tells volumes about the
> coding-system priorities.

I'm sorry for my bad wording: what I wrote was only meant to describe
the way the code is currently expected to work (AFAIK).

>   M-: (coding-system-priority-list) RET
>   =>  (iso-latin-1 utf-8 iso-2022-7bit iso-2022-7bit-lock iso-2022-8bit-ss2 emacs-mule raw-text iso-2022-jp in-is13194-devanagari chinese-iso-8bit utf-8-auto utf-8-with-signature utf-16 utf-16be-with-signature utf-16le-with-signature utf-16be utf-16le japanese-shift-jis undecided)

> So UTF-8 is indeed ``pretty high'', but lower than the locale's
> default.

That seems to be the source of the problem.  utf-8 should always come
before latin-1 in that list, since utf-8 streams that are valid latin-1
streams are not uncommon, whereas latin-1 streams that are valid utf-8
streams are extremely rare.


        Stefan






^ permalink raw reply	[flat|nested] 41+ messages in thread

* bug#2497: 23.0.91; Fails to read UTF-8 on Win2k
  2009-02-28 10:49           ` Eli Zaretskii
  2009-02-28 12:16             ` Uwe Siart
  2009-02-28 22:04             ` Stefan Monnier
@ 2009-03-02 11:43             ` Kenichi Handa
  2009-03-02 15:25               ` Stefan Monnier
  2 siblings, 1 reply; 41+ messages in thread
From: Kenichi Handa @ 2009-03-02 11:43 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 2497, uwe.siart

In article <uab86q1ih.fsf@gnu.org>, Eli Zaretskii <eliz@gnu.org> writes:

>   M-: (coding-system-priority-list) RET
>>> (iso-latin-1 utf-8 iso-2022-7bit iso-2022-7bit-lock iso-2022-8bit-ss2 emacs-mule raw-text iso-2022-jp in-is13194-devanagari chinese-iso-8bit utf-8-auto utf-8-with-signature utf-16 utf-16be-with-signature utf-16le-with-signature utf-16be utf-16le japanese-shift-jis undecided)

> So UTF-8 is indeed ``pretty high'', but lower than the locale's
> default.

> > So this still looks like a real bug.

> Perhaps it is, but I didn't know Emacs 23 can reliably distinguish
> between Latin-1 and UTF-8, even when UTF-8 sequences are present in
> the text.  Can we do that reliably?  Perhaps Handa-san can shed some
> light on this.

The coding system iso-latin-1 is for the character set
iso-8859-1, and the code-space of iso-8859-1 is 0x00..0xFF
(without gap, i.e. including 0x80..0x9F) (see
/usr/share/i18n/charmaps/ISO-8859-1.gz).  So, if we follows
it strictly, any byte sequence can be a correct iso-8859-1
stream, and it means that when iso-latin-1 has the highest
priority, all files are detected as iso-latin-1.

So, as far as we strictly follows the definition of
iso-8859-1...

In article <jwv7i3az0fc.fsf-monnier+emacsbugreports@gnu.org>, Stefan Monnier <monnier@iro.umontreal.ca> writes:

> That seems to be the source of the problem.  utf-8 should always come
> before latin-1 in that list, since utf-8 streams that are valid latin-1
> streams are not uncommon, whereas latin-1 streams that are valid utf-8
> streams are extremely rare.

I think that is the only solution.

In article <87ab86ah9z.fsf@tum.de>, Uwe Siart <uwe.siart@tum.de> writes:

> Assumed this is not possible right now we should distinguish between
> »high reliability« and »poor reliability«. From my perception it has
> been much more reliable earlier so (as a user with limited viewpoint)
> I vote for reverting the change.

In Emacs 22, the coding system iso-latin-1 was defined as a
variant of iso-2022-based coding system, and thus 0x80..0x9F
were not a valid byte (except for 0x91 and etc. in
latin-extra-code-table).  So, some of UTF-8 texts were not
detected as iso-latin-1.

To recover that behaviour, we can define iso-latin-1 as
before by doing this:

(define-coding-system 'iso-latin-1
  "Emacs 22 iso-latin-1."
  :mnemonic ?1
  :coding-type 'iso-2022
  :charset-list '(ascii latin-iso8859-1)
  :ascii-compatible-p t
  :mime-charset 'iso-8859-1
  :designation [ascii latin-iso8859-1 nil nil])

But, even with that, still some valid UTF-8 texts will be
detected as iso-latin-1.  So I don't think this is the
solution of "high reliability".

---
Kenichi Handa
handa@m17n.org






^ permalink raw reply	[flat|nested] 41+ messages in thread

* bug#2497: 23.0.91; Fails to read UTF-8 on Win2k
  2009-03-02 11:43             ` Kenichi Handa
@ 2009-03-02 15:25               ` Stefan Monnier
  2009-03-02 19:25                 ` Eli Zaretskii
  0 siblings, 1 reply; 41+ messages in thread
From: Stefan Monnier @ 2009-03-02 15:25 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: 2497, uwe.siart

>> That seems to be the source of the problem.  utf-8 should always come
>> before latin-1 in that list, since utf-8 streams that are valid latin-1
>> streams are not uncommon, whereas latin-1 streams that are valid utf-8
>> streams are extremely rare.
> I think that is the only solution.

Not only it's the only solution, but it's a solution on which we agreed
already several years ago.  So, again, the bug is in the ordering, and
we have to figure out which code ends up putting latin-1 before utf-8 in
the coding system priority.


        Stefan






^ permalink raw reply	[flat|nested] 41+ messages in thread

* bug#2497: 23.0.91; Fails to read UTF-8 on Win2k
  2009-03-02 15:25               ` Stefan Monnier
@ 2009-03-02 19:25                 ` Eli Zaretskii
  2009-03-03 16:34                   ` Stefan Monnier
  0 siblings, 1 reply; 41+ messages in thread
From: Eli Zaretskii @ 2009-03-02 19:25 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: 2497, uwe.siart

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: Eli Zaretskii <eliz@gnu.org>,  2497@emacsbugs.donarmstrong.com,  uwe.siart@tum.de
> Date: Mon, 02 Mar 2009 10:25:45 -0500
> 
> So, again, the bug is in the ordering

Actually, the OP was complaining that, even with this ordering, Emacs
23 did TRT for him, and that a recent change broke that.  That bug is
fixed now, I believe, so you are talking about a more general problem.

> we have to figure out which code ends up putting latin-1 before utf-8 in
> the coding system priority.

Well, I think this is fairly easy: set-locale-environment does it.
Observe:

  (defun set-locale-environment (&optional locale-name frame)
    "Set up multi-lingual environment for using LOCALE-NAME.
  This sets the language environment, the coding system priority,
  the default input method and sometimes other things.
	...
	(let ((language-name
	       (locale-name-match locale locale-language-names))
	      (charset-language-name
	       (locale-name-match locale locale-charset-language-names))
	      (default-eol-type (coding-system-eol-type
				 default-buffer-file-coding-system))
	      (coding-system
	       (or (locale-name-match locale locale-preferred-coding-systems)
		   (when locale
		     (if (string-match "\\.\\([^@]+\\)" locale)
			 (locale-charset-to-coding-system
			  (match-string 1 locale)))))))
	...
	  (when (and (not frame)
		     coding-system
		     (not (coding-system-equal coding-system
					       locale-coding-system)))
    >>>>>	  (prefer-coding-system coding-system)
	    ;; Fixme: perhaps prefer-coding-system should set this too.
	    ;; But it's not the time to do such a fundamental change.
	    (setq default-sendmail-coding-system coding-system)
	    (setq locale-coding-system coding-system))))

Even the doc string says that the coding system priority is set
according to the locale's native encoding.






^ permalink raw reply	[flat|nested] 41+ messages in thread

* bug#2497: 23.0.91; Fails to read UTF-8 on Win2k
  2009-03-02 19:25                 ` Eli Zaretskii
@ 2009-03-03 16:34                   ` Stefan Monnier
  0 siblings, 0 replies; 41+ messages in thread
From: Stefan Monnier @ 2009-03-03 16:34 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 2497, uwe.siart

>> So, again, the bug is in the ordering
> Actually, the OP was complaining that, even with this ordering, Emacs
> 23 did TRT for him, and that a recent change broke that.  That bug is
> fixed now, I believe, so you are talking about a more general problem.

Yes.  I didn't realize that the reason why it worked before is because
we were luckly.

>> we have to figure out which code ends up putting latin-1 before utf-8 in
>> the coding system priority.

> Well, I think this is fairly easy: set-locale-environment does it.
> Observe:

>   (defun set-locale-environment (&optional locale-name frame)
[...]
>>>>>> (prefer-coding-system coding-system)
[...]
> Even the doc string says that the coding system priority is set
> according to the locale's native encoding.

Indeed, thanks for spotting it.  Can someone change this code so it
doesn't move utf-8 from first to second place?


        Stefan






^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2009-03-03 16:34 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <uzlg6oiq3.fsf@gnu.org>
2009-02-17 10:35 ` bug#2354: 23.0.90; Emacs fails to detect utf-8 encoding with language environment Latin-1 David Engster
2009-02-17 16:45   ` Juanma Barranquero
2009-02-17 18:04     ` David Engster
2009-02-28 12:30   ` bug#2354: marked as done (23.0.90; Emacs fails to detect utf-8 encoding with language environment Latin-1) Emacs bug Tracking System
2009-02-27 14:10 ` bug#2497: 23.0.91; Fails to read UTF-8 on Win2k Uwe Siart
2009-02-27 16:03   ` Eli Zaretskii
2009-02-27 16:48     ` Uwe Siart
2009-02-27 18:19       ` Eli Zaretskii
2009-02-27 20:35         ` Uwe Siart
2009-02-28  4:40         ` Stefan Monnier
2009-02-28  8:17           ` Uwe Siart
2009-02-28 10:14             ` David Engster
2009-02-28 12:09               ` Eli Zaretskii
2009-02-28 14:16                 ` Jason Rumney
2009-02-28 14:31                 ` David Engster
2009-02-28 22:00             ` Stefan Monnier
2009-02-28 10:49           ` Eli Zaretskii
2009-02-28 12:16             ` Uwe Siart
2009-02-28 22:04             ` Stefan Monnier
2009-03-02 11:43             ` Kenichi Handa
2009-03-02 15:25               ` Stefan Monnier
2009-03-02 19:25                 ` Eli Zaretskii
2009-03-03 16:34                   ` Stefan Monnier
2009-02-27 16:11   ` Juanma Barranquero
2009-02-27 16:16     ` Juanma Barranquero
2009-02-27 16:27       ` Uwe Siart
2009-02-27 16:32         ` Juanma Barranquero
2009-02-27 16:23     ` Uwe Siart
2009-02-27 16:38       ` Juanma Barranquero
2009-02-27 18:19         ` Eli Zaretskii
2009-02-27 20:38           ` Juanma Barranquero
2009-02-28  1:29           ` Jason Rumney
2009-02-27 17:02     ` Leo
2009-02-27 17:46   ` David Engster
2009-02-27 21:15     ` Uwe Siart
2009-02-28  1:32     ` Jason Rumney
2009-02-28  1:35       ` Processed (with 5 errors): " Emacs bug Tracking System
2009-02-27 23:34   ` bug#2497: 23.0.91; Fails to read UTF-8 on Windows2k Richard M Stallman
2009-02-28  9:47     ` Uwe Siart
2009-02-28 18:08       ` Richard M Stallman
2009-02-28 12:30   ` bug#2497: marked as done (23.0.91; Fails to read UTF-8 on Win2k) Emacs bug Tracking System

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).