unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed
* [angeli@iwi.uni-sb.de: Coding problem with Euro sign]
@ 2005-12-13 23:34 Richard M. Stallman
  2005-12-14 18:56 ` Kevin Rodgers
  0 siblings, 1 reply; 32+ messages in thread
From: Richard M. Stallman @ 2005-12-13 23:34 UTC (permalink / raw)


Would someone please DTRT and ack?

------- Start of forwarded message -------
From: Ralf Angeli <angeli@iwi.uni-sb.de>
To: emacs-pretest-bug@gnu.org
Date: Tue, 13 Dec 2005 13:12:02 +0100
X-IWi-MailScanner-Information: Please contact the ISP for more information
X-IWi-MailScanner: Found to be clean
X-IWi-MailScanner-SpamCheck: not spam, SpamAssassin (score=0.077, required 5, 
	autolearn=disabled, TW_KB 0.08)
X-MailScanner-From: angeli@iwi.uni-sb.de
X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-1.5.1
	(justus.rz.uni-saarland.de [134.96.7.31]);
	Tue, 13 Dec 2005 13:12:01 +0100 (CET)
X-AntiVirus: checked by AntiVir Milter (version: 1.1.1-9; AVE: 6.33.0.15;
	VDF: 6.33.0.22; host: AntiVir2)
Subject: Coding problem with Euro sign
Sender: emacs-pretest-bug-bounces+rms=gnu.org@gnu.org
X-Spam-Checker-Version: SpamAssassin 2.63 (2004-01-11) on monty-python
X-Spam-Level: 
X-Spam-Status: No, hits=0.0 required=5.0 tests=none autolearn=no version=2.63

- --=-=-=

Attached you can find a file with two 8-bit characters I extracted
from a file produced by Visual Studio under Windows.  The characters
should be u umlaut and the Euro sign.  Emacs does not seem to be able
to find the right coding system for it and displays it with
raw-text-dos.  I could not get the file displayed correctly by loading
it with iso-latin-1, iso-latin-9, or cp1251.  And I am not sure if
this is a problem of Emacs or if Visual Studio simply produced
garbage.


- --=-=-=
Content-Type: text/plain; charset=utf-8
Content-Disposition: attachment; filename=test.txt
Content-Transfer-Encoding: quoted-printable

=FC u umlaut
=C2=80 euro

- --=-=-=



In GNU Emacs 22.0.50.1 (i486-pc-linux-gnu, GTK+ Version 2.6.10)
 of 2005-12-07 on pacem, modified by Debian
X server distributor `The X.Org Foundation', version 11.0.60802000
configured using `configure '--build' 'i486-linux-gnu' '--host' 'i486-linux-gnu' '--prefix=/usr' '--sharedstatedir=/var/lib' '--libexecdir=/usr/lib' '--localstatedir=/var' '--infodir=/usr/share/info' '--mandir=/usr/share/man' '--with-pop=yes' '--enable-locallisppath=/etc/emacs-snapshot:/etc/emacs:/usr/local/share/emacs/22.0.50/site-lisp:/usr/local/share/emacs/site-lisp:/usr/share/emacs/22.0.50/site-lisp:/usr/share/emacs/site-lisp:/usr/share/emacs/22.0.50/leim' '--with-x=yes' '--with-x-toolkit=gtk' 'CFLAGS=-DDEBIAN -g -Wno-pointer-sign -O2' 'build_alias=i486-linux-gnu' 'host_alias=i486-linux-gnu''

Important settings:
  value of $LC_ALL: nil
  value of $LC_COLLATE: nil
  value of $LC_CTYPE: nil
  value of $LC_MESSAGES: nil
  value of $LC_MONETARY: nil
  value of $LC_NUMERIC: nil
  value of $LC_TIME: nil
  value of $LANG: en_US
  locale-coding-system: iso-8859-1
  default-enable-multibyte-characters: t

Major mode: Text

Minor modes in effect:
  desktop-save-mode: t
  display-time-mode: t
  iswitchb-mode: t
  recentf-mode: t
  show-paren-mode: t
  encoded-kbd-mode: t
  auto-compression-mode: t
  mouse-wheel-mode: t
  menu-bar-mode: t
  file-name-shadow-mode: t
  global-font-lock-mode: t
  font-lock-mode: t
  blink-cursor-mode: t
  unify-8859-on-encoding-mode: t
  utf-translate-cjk-mode: t
  column-number-mode: t
  line-number-mode: t
  transient-mark-mode: t

Recent input:
C-x k <return> C-x <return> c i s o - l a <tab> 1 <return> 
C-x C-f s c <tab> t e s <backspace> x <tab> . <backspace> 
t . t x <tab> <backspace> <backspace> <backspace> <backspace> 
<backspace> s t <tab> x <tab> <return> C-x k <return> 
C-x <return> c i s o - l a t <tab> 9 <return> C-x C-f 
M-p <return> C-x k <return> C-x <return> c c p 1 2 
<tab> <return> C-x C-f M-p <return> M-x r e p o r t 
- - e m <tab> <return>

Recent messages:
Cleaning up the recentf list...done (1 removed)
Loading iswitchb...done
Loading time...done
Loading desktop...done
No desktop file.
For information about the GNU Project and its goals, type C-h C-p.
Loading cl-seq...done
Making completion list...
Loading help-mode...done
Loading emacsbug...done

- --=-=-=
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

_______________________________________________
emacs-pretest-bug mailing list
emacs-pretest-bug@gnu.org
http://lists.gnu.org/mailman/listinfo/emacs-pretest-bug

- --=-=-=--
------- End of forwarded message -------

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [angeli@iwi.uni-sb.de: Coding problem with Euro sign]
  2005-12-13 23:34 [angeli@iwi.uni-sb.de: Coding problem with Euro sign] Richard M. Stallman
@ 2005-12-14 18:56 ` Kevin Rodgers
  2005-12-14 22:51   ` Ralf Angeli
  0 siblings, 1 reply; 32+ messages in thread
From: Kevin Rodgers @ 2005-12-14 18:56 UTC (permalink / raw)


Richard M. Stallman wrote:
 > Would someone please DTRT and ack?
 >
 > ------- Start of forwarded message -------
 > From: Ralf Angeli <angeli@iwi.uni-sb.de>
 > To: emacs-pretest-bug@gnu.org
 > Date: Tue, 13 Dec 2005 13:12:02 +0100
...
 > Subject: Coding problem with Euro sign
 > Sender: emacs-pretest-bug-bounces+rms=gnu.org@gnu.org
...
 >
 > - --=-=-=
 >
 > Attached you can find a file with two 8-bit characters I extracted
 > from a file produced by Visual Studio under Windows.  The characters
 > should be u umlaut and the Euro sign.  Emacs does not seem to be able
 > to find the right coding system for it and displays it with
 > raw-text-dos.  I could not get the file displayed correctly by loading
 > it with iso-latin-1, iso-latin-9, or cp1251.  And I am not sure if
 > this is a problem of Emacs or if Visual Studio simply produced
 > garbage.
 >
 >
 > - --=-=-=
 > Content-Type: text/plain; charset=utf-8
 > Content-Disposition: attachment; filename=test.txt
 > Content-Transfer-Encoding: quoted-printable
 >
 > =FC u umlaut
 > =C2=80 euro
 >
 > - --=-=-=

I think the OP is confused: u umlaut is 0xFC in ISO 8859-1 (Latin 1),
ISO 8859-15 (Latin 9), and Unicode.  The euro is 0xA4 in ISO 8859-15 but
U+20AC in Unicode (and not defined in ISO 8859-1).

But in UTF-8, as the quoted-printable attachment claims to be, they are
0xC3 0xBC and 0xE2 0x82 0xAC resp.

The attachment above uses a single-byte encoding for u umlaut.  But the
encoding used for the euro is a either an unknown 2-byte encoding or the
wrong single-byte encoding (C2 is A circumflex in ISO 8859-15) followed
by 0x80 (undefined in ISO 8859-*).  That could explain why Emacs does
not recognize it as iso-latin-1 or iso-latin-9.

As far as Microsoft Windows code pages go, 1251 is Cyrillic so the OP
must have meant 1252.  And in that character set, the euro is indeed
0x80 (and 0xC2 is still A circumflex).

So the attachment should have been labelled windows-1252 instead of
utf-8, and its contents would be more accurately written as:

=FC u umlaut
=C2 A circumflex
=80 euro

And the OP should try visiting the file with the cp1252 coding system.

-- 
Kevin Rodgers

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [angeli@iwi.uni-sb.de: Coding problem with Euro sign]
  2005-12-14 18:56 ` Kevin Rodgers
@ 2005-12-14 22:51   ` Ralf Angeli
  2005-12-15  1:34     ` Kevin Rodgers
  0 siblings, 1 reply; 32+ messages in thread
From: Ralf Angeli @ 2005-12-14 22:51 UTC (permalink / raw)


* Kevin Rodgers (2005-12-14) writes:

> I think the OP is confused:

Was confused.  That was cleared up on emacs-pretest-bug.

> And the OP should try visiting the file with the cp1252 coding system.

Well, the question now is if it is possible for Emacs to figure out
the coding system on itself with the example at hand.

-- 
Ralf

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [angeli@iwi.uni-sb.de: Coding problem with Euro sign]
  2005-12-14 22:51   ` Ralf Angeli
@ 2005-12-15  1:34     ` Kevin Rodgers
  2005-12-15 16:20       ` Ralf Angeli
  0 siblings, 1 reply; 32+ messages in thread
From: Kevin Rodgers @ 2005-12-15  1:34 UTC (permalink / raw)


Ralf Angeli wrote:
> * Kevin Rodgers (2005-12-14) writes:
>>I think the OP is confused: 
> 
> Was confused.  That was cleared up on emacs-pretest-bug.

Good!  I hope you didn't take offense at my remark.

>>And the OP should try visiting the file with the cp1252 coding system.
> 
> Well, the question now is if it is possible for Emacs to figure out
> the coding system on itself with the example at hand.

You could try something like this:

(setq auto-coding-regexp-alist
       (cons '("[\040-\177][\200-\237]" . cp1252)
             auto-coding-regexp-alist))

I don't think that's a general purpose solution since (1)
auto-coding-regexp-alist actually has precedence over `-*-coding:-*-'
file variables and (2) other encodings probably use those o200 - o237
bytes (certainly other Microsoft Windows code pages do).

-- 
Kevin Rodgers

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [angeli@iwi.uni-sb.de: Coding problem with Euro sign]
  2005-12-15  1:34     ` Kevin Rodgers
@ 2005-12-15 16:20       ` Ralf Angeli
  2005-12-15 22:02         ` Kevin Rodgers
  2005-12-16 10:35         ` [angeli@iwi.uni-sb.de: Coding problem with Euro sign] David Hansen
  0 siblings, 2 replies; 32+ messages in thread
From: Ralf Angeli @ 2005-12-15 16:20 UTC (permalink / raw)


* Kevin Rodgers (2005-12-15) writes:

> Ralf Angeli wrote:
>> * Kevin Rodgers (2005-12-14) writes:
>>>I think the OP is confused: 
>> 
>> Was confused.  That was cleared up on emacs-pretest-bug.
>
> Good!  I hope you didn't take offense at my remark.

Oh well ... something like that was to be expected as my knowledge
about coding systems is only improving slowly. (c:

>>>And the OP should try visiting the file with the cp1252 coding system.
>> 
>> Well, the question now is if it is possible for Emacs to figure out
>> the coding system on itself with the example at hand.
>
> You could try something like this:
>
> (setq auto-coding-regexp-alist
>        (cons '("[\040-\177][\200-\237]" . cp1252)
>              auto-coding-regexp-alist))
>
> I don't think that's a general purpose solution since (1)
> auto-coding-regexp-alist actually has precedence over `-*-coding:-*-'
> file variables and (2) other encodings probably use those o200 - o237
> bytes (certainly other Microsoft Windows code pages do).

This doesn't seem to work here.  I still see the byte codes of the
8-bit characters when opening the file after evaluating the above
form.

And a customization is actually not what I am interested in; I'd like
Emacs to figure this out by itself, out of the box.

I am not sure how common something like the case at hand is but it is
certainly not academic.  And if one is working with different
operating systems or interchanging files with people working on
different operating systems the failure to detect the correct coding
could lead to people regarding Emacs as a truly inferior piece of
software.  I can already hear them: "What?  It displays the Euro sign
as \200?  Even Notepad gets this right!"  On these grounds it may
become a bit hard to convince people that Emacs is the one true
editor.

Anyway, I tested a bit and under Windows (surprise) every application
I tried (e.g. Notepad and OpenOffice) managed to display the file
correctly.  On GNU/Linux no application got it right.  I checked with
less, more, vim, nano, pico, and OpenOffice.  Either "garbage" was
displayed or (in case of OpenOffice) a dialog asking the user to
specify the encoding.  So it's not like Emacs isn't in good company.
Nevertheless it would be nice if Emacs got it right.  Unfortunately I
lack the knowledge for judging if this is possible at all without
having to use all sorts of unreliable heuristics which are costly to
implement.

-- 
Ralf

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [angeli@iwi.uni-sb.de: Coding problem with Euro sign]
  2005-12-15 16:20       ` Ralf Angeli
@ 2005-12-15 22:02         ` Kevin Rodgers
  2005-12-16  8:57           ` Eli Zaretskii
  2005-12-16 11:55           ` Ralf Angeli
  2005-12-16 10:35         ` [angeli@iwi.uni-sb.de: Coding problem with Euro sign] David Hansen
  1 sibling, 2 replies; 32+ messages in thread
From: Kevin Rodgers @ 2005-12-15 22:02 UTC (permalink / raw)


Ralf Angeli wrote:
 > * Kevin Rodgers (2005-12-15) writes:
 >
 >>Ralf Angeli wrote:
 >>
 >>>* Kevin Rodgers (2005-12-14) writes:
 >>>>And the OP should try visiting the file with the cp1252 coding system.
 >>>
 >>>Well, the question now is if it is possible for Emacs to figure out
 >>>the coding system on itself with the example at hand.
 >>
 >>You could try something like this:
 >>
 >>(setq auto-coding-regexp-alist
 >>       (cons '("[\040-\177][\200-\237]" . cp1252)
 >>             auto-coding-regexp-alist))
 >>
 >>I don't think that's a general purpose solution since (1)
 >>auto-coding-regexp-alist actually has precedence over `-*-coding:-*-'
 >>file variables and (2) other encodings probably use those o200 - o237
 >>bytes (certainly other Microsoft Windows code pages do).
 >
 > This doesn't seem to work here.  I still see the byte codes of the
 > 8-bit characters when opening the file after evaluating the above
 > form.

OK, now I've actually tried that here in Emacs 21.4 running on
Unix/Solaris under X.  First it complained that cp1252 is an invalid
coding system, so I found the "MS-DOS and MULE" Info node referenced
from the "Coding Systems" node and tried `M-x codepage-setup'.  It
wouldn't take 1252, but a quick search in that node revealed that the
right number is 850.

So I tweaked the auto-coding-regexp-alist entry to use cp850 and
revisited the file.  Now instead of displaying the u umlaut and A
circumflex characters as such in my default font's character set
(iso8859-1) and the euro as "\200", Emacs displays the u umlaut as
superscript 3, A circumflex as "\302", and the euro as C cedilla.

I assume those display problems are because I haven't configured an
Emacs fontset for the cp850 coding system.  But the
auto-coding-regexp-alist entry worked as intended, and you're on
Windows so your fontset should be properly configured for that.

One other detail: that entry only sets the coding system if the euro
is immediately preceded by an ASCII character.  Is that the case in
your file?  What does `C-h C RET' say after visiting the file?

I assume you're running with multibyte characters enabled.

 > And a customization is actually not what I am interested in; I'd like
 > Emacs to figure this out by itself, out of the box.

How is Emacs supposed to infer the coding system from the contents of
that file?  If you can come up with a suitable customization, perhaps
it will be incorporated into Emacs as the default behavior.

 > I am not sure how common something like the case at hand is but it is
 > certainly not academic.  And if one is working with different
 > operating systems or interchanging files with people working on
 > different operating systems the failure to detect the correct coding
 > could lead to people regarding Emacs as a truly inferior piece of
 > software.  I can already hear them: "What?  It displays the Euro sign
 > as \200?  Even Notepad gets this right!"  On these grounds it may
 > become a bit hard to convince people that Emacs is the one true
 > editor.

Can Notepad display files in anything besides CP850/Windows-1252 and
probably UTF-8 w/BOM?  E.g. can it distinguish ISO 8859-1 from ISO
8859-2 from ISO 8859-15?

 > Anyway, I tested a bit and under Windows (surprise) every application
 > I tried (e.g. Notepad and OpenOffice) managed to display the file
 > correctly.  On GNU/Linux no application got it right.  I checked with
 > less, more, vim, nano, pico, and OpenOffice.  Either "garbage" was
 > displayed or (in case of OpenOffice) a dialog asking the user to
 > specify the encoding.  So it's not like Emacs isn't in good company.
 > Nevertheless it would be nice if Emacs got it right.  Unfortunately I
 > lack the knowledge for judging if this is possible at all without
 > having to use all sorts of unreliable heuristics which are costly to
 > implement.

Yes, Windows applications simply assumes you're using a proprietary
Microsoft character set, and GNU/Linux apps prioritize support for
standard character encodings.  Maybe all you need is
(prefer-coding-system 'cp850)

-- 
Kevin Rodgers

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [angeli@iwi.uni-sb.de: Coding problem with Euro sign]
  2005-12-15 22:02         ` Kevin Rodgers
@ 2005-12-16  8:57           ` Eli Zaretskii
  2005-12-16 17:59             ` Kevin Rodgers
  2005-12-16 11:55           ` Ralf Angeli
  1 sibling, 1 reply; 32+ messages in thread
From: Eli Zaretskii @ 2005-12-16  8:57 UTC (permalink / raw)
  Cc: emacs-devel

> From: Kevin Rodgers <ihs_4664@yahoo.com>
> Date: Thu, 15 Dec 2005 15:02:48 -0700
> 
> OK, now I've actually tried that here in Emacs 21.4 running on
> Unix/Solaris under X.  First it complained that cp1252 is an invalid
> coding system, so I found the "MS-DOS and MULE" Info node referenced
> from the "Coding Systems" node and tried `M-x codepage-setup'.  It
> wouldn't take 1252, but a quick search in that node revealed that the
> right number is 850.

Emacs 21.x doesn't support cp1252, and cp850 is not an equivalent of
cp1252.

> So I tweaked the auto-coding-regexp-alist entry to use cp850 and
> revisited the file.  Now instead of displaying the u umlaut and A
> circumflex characters as such in my default font's character set
> (iso8859-1) and the euro as "\200", Emacs displays the u umlaut as
> superscript 3, A circumflex as "\302", and the euro as C cedilla.

cp850 doesn't support the Euro sign, that's why it displays it as
something else.  And the codepoints of other Latin characters are
different in cp850 than in iso-8859-1, so you see different glyphs.

> I assume those display problems are because I haven't configured an
> Emacs fontset for the cp850 coding system.

No, it's because the codepoints are different; see above.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [angeli@iwi.uni-sb.de: Coding problem with Euro sign]
  2005-12-15 16:20       ` Ralf Angeli
  2005-12-15 22:02         ` Kevin Rodgers
@ 2005-12-16 10:35         ` David Hansen
  1 sibling, 0 replies; 32+ messages in thread
From: David Hansen @ 2005-12-16 10:35 UTC (permalink / raw)


On Thu, 15 Dec 2005 17:20:09 +0100 Ralf Angeli wrote:


> [ windows-1252 problems ]
>
> I am not sure how common something like the case at hand is but it is
> certainly not academic.  And if one is working with different
> operating systems or interchanging files with people working on
> different operating systems the failure to detect the correct coding
> could lead to people regarding Emacs as a truly inferior piece of
> software.  I can already hear them: "What?  It displays the Euro sign
> as \200?  Even Notepad gets this right!"  On these grounds it may
> become a bit hard to convince people that Emacs is the one true
> editor.

I think all that has to be done is: if emacs "decides" to choose
latin-1 as the coding and there are [\200-\237] chars then choose
windows-1252 instead.  Who has control chars in her text file
anyway?

David

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [angeli@iwi.uni-sb.de: Coding problem with Euro sign]
  2005-12-15 22:02         ` Kevin Rodgers
  2005-12-16  8:57           ` Eli Zaretskii
@ 2005-12-16 11:55           ` Ralf Angeli
  2005-12-16 22:58             ` Kevin Rodgers
  2006-01-10 12:38             ` windows-XXXX and cpXXXX Kenichi Handa
  1 sibling, 2 replies; 32+ messages in thread
From: Ralf Angeli @ 2005-12-16 11:55 UTC (permalink / raw)


* Kevin Rodgers (2005-12-15) writes:

> Ralf Angeli wrote:
>  > * Kevin Rodgers (2005-12-15) writes:
>  >
>  >>You could try something like this:
>  >>
>  >>(setq auto-coding-regexp-alist
>  >>       (cons '("[\040-\177][\200-\237]" . cp1252)
>  >>             auto-coding-regexp-alist))
>  >
>  > This doesn't seem to work here.  I still see the byte codes of the
>  > 8-bit characters when opening the file after evaluating the above
>  > form.
[...]
> I assume those display problems are because I haven't configured an
> Emacs fontset for the cp850 coding system.  But the
> auto-coding-regexp-alist entry worked as intended, and you're on
> Windows so your fontset should be properly configured for that.

Currently I am on GNU/Linux.  Anyway, with the development version of
Emacs I did not have the problems with cp1252 you described when
loading the file.  But when trying to write the file I got this
warning:

,----
| Warning (:warning): Invalid coding system `cp1252' is specified
| for the current buffer/file by the variable `auto-coding-regexp-alist'.
| It is highly recommended to fix it before writing to a file.
`----

I didn't do `M-x codepage-setup RET' before trying all of this.
Interestingly loading and writing the file worked fine if I used
windows-1252 instead of cp1252.

> One other detail: that entry only sets the coding system if the euro
> is immediately preceded by an ASCII character.  Is that the case in
> your file?

No.  On emacs-pretest-bug I already explained that the original (test)
file doesn't include the A circumflex, that means the euro is preceded
by a newline.  (Maybe it would be better to continue the discussion in
the thread on emacs-pretest-bug in order to avoid repetition?)

If I insert a space or a random ASCII character before the Euro sign
and evaluate the form above (using windows-1252 for the encoding) the
encoding is being identified correctly and both the u umlaut and the
Euro sign are being displayed correctly.

> What does `C-h C RET' say after visiting the file?

In case the encoding is not identfied correctly:

,----
| Coding system for saving this buffer:
|   t -- raw-text-dos
| 
| Default coding system (for new files):
|   1 -- iso-latin-1 (alias: iso-8859-1 latin-1)
| 
| Coding system for keyboard input:
|   1 -- iso-latin-1 (alias: iso-8859-1 latin-1)
| 
| Coding system for terminal output:
|   1 -- iso-8859-1 (alias of iso-latin-1)
| 
| Defaults for subprocess I/O:
|   decoding: 1 -- iso-latin-1 (alias: iso-8859-1 latin-1)
| 
|   encoding: 1 -- iso-latin-1 (alias: iso-8859-1 latin-1)
| 
| 
| Priority order for recognizing coding systems when reading files:
|   1. iso-latin-1 (alias: iso-8859-1 latin-1)
|   2. mule-utf-8 (alias: utf-8)
|   3. mule-utf-16be-with-signature (alias: utf-16be-with-signature mule-utf-16-be utf-16-be)
|   4. mule-utf-16le-with-signature (alias: utf-16le-with-signature mule-utf-16-le utf-16-le)
|   5. iso-2022-jp (alias: junet)
|   6. iso-2022-7bit 
|   7. iso-2022-7bit-lock (alias: iso-2022-int-1)
|   8. iso-2022-8bit-ss2 
|   9. emacs-mule 
|   10. raw-text 
|   11. japanese-shift-jis (alias: shift_jis sjis cp932)
|   12. chinese-big5 (alias: big5 cn-big5 cp950)
|   13. no-conversion 
| 
|   Other coding systems cannot be distinguished automatically
|   from these, and therefore cannot be recognized automatically
|   with the present coding system priorities.
| 
|   The following are decoded correctly but recognized as iso-2022-7bit-lock:
|     iso-2022-7bit-ss2 iso-2022-7bit-lock-ss2 iso-2022-cn iso-2022-cn-ext
|     iso-2022-jp-2 iso-2022-kr
| [...]
`----

In case the coding is identified correctly:

,----
| Coding system for saving this buffer:
|   * -- windows-1252-dos
| 
| Default coding system (for new files):
|   1 -- iso-latin-1 (alias: iso-8859-1 latin-1)
| 
| Coding system for keyboard input:
|   1 -- iso-latin-1 (alias: iso-8859-1 latin-1)
| 
| Coding system for terminal output:
|   1 -- iso-8859-1 (alias of iso-latin-1)
| 
| Defaults for subprocess I/O:
|   decoding: 1 -- iso-latin-1 (alias: iso-8859-1 latin-1)
| 
|   encoding: 1 -- iso-latin-1 (alias: iso-8859-1 latin-1)
| [...]
`----

> I assume you're running with multibyte characters enabled.

Yes.  The relevant setting should be included in the original bug
report.

>  > And a customization is actually not what I am interested in; I'd like
>  > Emacs to figure this out by itself, out of the box.
>
> How is Emacs supposed to infer the coding system from the contents of
> that file?  If you can come up with a suitable customization, perhaps
> it will be incorporated into Emacs as the default behavior.

If I knew how to do that I would have sent a patch already.  My naive
approach would be to look for the presence of bytes which are
characteristic for Windows codepages in order to identify the encoding
as a Windows codepage.  Maybe looking at line endings can help to make
the right decision.  After the encoding was identified to be a Windows
codepage, the exact codepage could be chosen based on the language
environment.  But this suggestion is just random guesswork from my
side because I know close to nothing about what processes are involved
in identifying an encoding.

> Can Notepad display files in anything besides CP850/Windows-1252 and
> probably UTF-8 w/BOM?  E.g. can it distinguish ISO 8859-1 from ISO
> 8859-2 from ISO 8859-15?

As far as I understood Reiner on emacs-pretest-bug this is impossible
anyway.

> Yes, Windows applications simply assumes you're using a proprietary
> Microsoft character set, and GNU/Linux apps prioritize support for
> standard character encodings.  Maybe all you need is
> (prefer-coding-system 'cp850)

Wouldn't that be a bit too restricted as a general solution for Emacs?

-- 
Ralf

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [angeli@iwi.uni-sb.de: Coding problem with Euro sign]
  2005-12-16  8:57           ` Eli Zaretskii
@ 2005-12-16 17:59             ` Kevin Rodgers
  2005-12-17  7:19               ` Eli Zaretskii
  0 siblings, 1 reply; 32+ messages in thread
From: Kevin Rodgers @ 2005-12-16 17:59 UTC (permalink / raw)


Eli Zaretskii wrote:
 >>From: Kevin Rodgers <ihs_4664@yahoo.com>
 >>Date: Thu, 15 Dec 2005 15:02:48 -0700
 >>
 >>OK, now I've actually tried that here in Emacs 21.4 running on
 >>Unix/Solaris under X.  First it complained that cp1252 is an invalid
 >>coding system, so I found the "MS-DOS and MULE" Info node referenced
 >>from the "Coding Systems" node and tried `M-x codepage-setup'.  It
 >>wouldn't take 1252, but a quick search in that node revealed that the
 >>right number is 850.
 >
 >
 > Emacs 21.x doesn't support cp1252, and cp850 is not an equivalent of
 > cp1252.

Ah, I misunderstood the following passage from the MS-DOS and MULE info
node:

,----
|    MS-Windows provides its own codepages, which are different from the
| DOS codepages for the same locale.  For example, DOS codepage 850
| supports the same character set as Windows codepage 1252; DOS codepage
| 855 supports the same character set as Windows codepage 1251, etc.  The
| MS-Windows version of Emacs uses the current codepage for display when
| invoked with the `-nw' option.
`----

So 850 and 1252 assign the same set of characters to different code
points?

-- 
Kevin Rodgers

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [angeli@iwi.uni-sb.de: Coding problem with Euro sign]
  2005-12-16 11:55           ` Ralf Angeli
@ 2005-12-16 22:58             ` Kevin Rodgers
  2005-12-17  7:36               ` Eli Zaretskii
  2005-12-17 10:47               ` Reiner Steib
  2006-01-10 12:38             ` windows-XXXX and cpXXXX Kenichi Handa
  1 sibling, 2 replies; 32+ messages in thread
From: Kevin Rodgers @ 2005-12-16 22:58 UTC (permalink / raw)


Ralf Angeli wrote:
 > Currently I am on GNU/Linux.  Anyway, with the development version of
 > Emacs I did not have the problems with cp1252 you described when
 > loading the file.  But when trying to write the file I got this
 > warning:
 >
 > ,----
 > | Warning (:warning): Invalid coding system `cp1252' is specified
 > | for the current buffer/file by the variable `auto-coding-regexp-alist'.
 > | It is highly recommended to fix it before writing to a file.
 > `----
 >
 > I didn't do `M-x codepage-setup RET' before trying all of this.
 > Interestingly loading and writing the file worked fine if I used
 > windows-1252 instead of cp1252.

Well, there you go.  Emacs 22.0 supports windows-1252, and Emacs 21.4
only supports cp850.

 > * Kevin Rodgers (2005-12-15) writes:
 >>One other detail: that entry only sets the coding system if the euro
 >>is immediately preceded by an ASCII character.  Is that the case in
 >>your file?
 >
 > No.  On emacs-pretest-bug I already explained that the original (test)
 > file doesn't include the A circumflex, that means the euro is preceded
 > by a newline.  (Maybe it would be better to continue the discussion in
 > the thread on emacs-pretest-bug in order to avoid repetition?)

Ah.  The regexp only matched the [\200-\237] characters after a
non-control ASCII character.  So [\040-\177] needs to be expanded, at
least to [\t\n\r\040-\177] to include tab and newline sequences, but
maybe [\t\n\r\v\f\040-\177] to include vertical tab and formfeed, or
even [\000-\177] to include all ASCII characters.

(I don't subscribe to emacs-pretest-bug, I read the gnu.emacs.devel
newsgroup on gmane.org, which is gatewayed to and from the
emacs-devel@gnu.org mailing list.  If you followed up to both mailing
lists/newsgroups that should solve the problem.)

 > If I insert a space or a random ASCII character before the Euro sign
 > and evaluate the form above (using windows-1252 for the encoding) the
 > encoding is being identified correctly and both the u umlaut and the
 > Euro sign are being displayed correctly.

Good!

...

 >>How is Emacs supposed to infer the coding system from the contents of
 >>that file?  If you can come up with a suitable customization, perhaps
 >>it will be incorporated into Emacs as the default behavior.
 >
 > If I knew how to do that I would have sent a patch already.  My naive
 > approach would be to look for the presence of bytes which are
 > characteristic for Windows codepages in order to identify the encoding
 > as a Windows codepage.

Right, but a single byte is not enough information to identify the
character encoding.  Even a pattern is not enough, since coding systems
may differ only in what characters are assigned to the same byte
sequence: sometimes you need "out of band" information.

Have you read the Recognize Coding node (aka Recognizing Coding Systems)
of the Emacs manual?

The Emacs implementors are less naive than you and me.  :-)

 > Maybe looking at line endings can help to make the right decision.

That would be a very weak heuristic indeed.  A I understand it, Emacs is
very conservative in this regard: if a buffer contains only single \r
sequences, it's mac; if it contains only \n sequences, it's unix; if it
contains only \r\n sequences, it's DOS; but if it contains a mix, it is
indeterminate.

 > After the encoding was identified to be a Windows
 > codepage, the exact codepage could be chosen based on the language
 > environment.  But this suggestion is just random guesswork from my
 > side because I know close to nothing about what processes are involved
 > in identifying an encoding.

Me neither, your idea sounds reasonable to me.  But I don't understand
why auto-coding-regexp-alist has such a high priority (over the coding:
tag).

 >>Can Notepad display files in anything besides CP850/Windows-1252 and
 >>probably UTF-8 w/BOM?  E.g. can it distinguish ISO 8859-1 from ISO
 >>8859-2 from ISO 8859-15?
 >
 > As far as I understood Reiner on emacs-pretest-bug this is impossible
 > anyway.

Just as windows-1252 can't be distinguished reliably from any other
coding systems that use bytes [\200-\237].

 >>Yes, Windows applications simply assumes you're using a proprietary
 >>Microsoft character set, and GNU/Linux apps prioritize support for
 >>standard character encodings.  Maybe all you need is
 >>(prefer-coding-system 'cp850)
 >
 > Wouldn't that be a bit too restricted as a general solution for Emacs?

Of course.  But we don't know whether this is a general problem for
Emacs or a specific problem for your configuration, nor in either case
whether it's a problem that can be solved.  As a scientist I'd like to
solve the most general case, but as an engineer I'd like to start by
solving the particular problem you've identified.

-- 
Kevin Rodgers

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [angeli@iwi.uni-sb.de: Coding problem with Euro sign]
  2005-12-16 17:59             ` Kevin Rodgers
@ 2005-12-17  7:19               ` Eli Zaretskii
  0 siblings, 0 replies; 32+ messages in thread
From: Eli Zaretskii @ 2005-12-17  7:19 UTC (permalink / raw)
  Cc: emacs-devel

> From: Kevin Rodgers <ihs_4664@yahoo.com>
> Date: Fri, 16 Dec 2005 10:59:02 -0700
> 
> Ah, I misunderstood the following passage from the MS-DOS and MULE info
> node:
> 
> ,----
> |    MS-Windows provides its own codepages, which are different from the
> | DOS codepages for the same locale.  For example, DOS codepage 850
> | supports the same character set as Windows codepage 1252; DOS codepage
> | 855 supports the same character set as Windows codepage 1251, etc.  The
> | MS-Windows version of Emacs uses the current codepage for display when
> | invoked with the `-nw' option.
> `----
> 
> So 850 and 1252 assign the same set of characters to different code
> points?

Yes.  In the quoted paragraph, ``the same character set'' refers to
the _underlying_ Emacs character set: Latin-1, Latin-2, etc.  cpNNN
are not character sets, they are encodings (a.k.a. coding systems) of
the character sets they support.

Also note that in the CVS version of Emacs, codepage.el is used only
by the MS-DOS build; other builds, including the MS-Windows port, use
code-pages.el, which is an entirely different implementation of the
same functionality.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [angeli@iwi.uni-sb.de: Coding problem with Euro sign]
  2005-12-16 22:58             ` Kevin Rodgers
@ 2005-12-17  7:36               ` Eli Zaretskii
  2005-12-17 10:47               ` Reiner Steib
  1 sibling, 0 replies; 32+ messages in thread
From: Eli Zaretskii @ 2005-12-17  7:36 UTC (permalink / raw)
  Cc: emacs-devel

> From: Kevin Rodgers <ihs_4664@yahoo.com>
> Date: Fri, 16 Dec 2005 15:58:22 -0700
> 
> Ralf Angeli wrote:
>  > Currently I am on GNU/Linux.  Anyway, with the development version of
>  > Emacs I did not have the problems with cp1252 you described when
>  > loading the file.  But when trying to write the file I got this
>  > warning:
>  >
>  > ,----
>  > | Warning (:warning): Invalid coding system `cp1252' is specified
>  > | for the current buffer/file by the variable `auto-coding-regexp-alist'.
>  > | It is highly recommended to fix it before writing to a file.
>  > `----
>  >
>  > I didn't do `M-x codepage-setup RET' before trying all of this.
>  > Interestingly loading and writing the file worked fine if I used
>  > windows-1252 instead of cp1252.
> 
> Well, there you go.  Emacs 22.0 supports windows-1252, and Emacs 21.4
> only supports cp850.

As I wrote elsewhere in this thread, Emacs 22 has a new implementation
of the code page support (code-pages.el), which doesn't need
codepage-setup, and also supports windows-1252 and other similar
encoding names.  Emacs 21 didn't have that; thus the differences in
behavior described above.

>  > After the encoding was identified to be a Windows
>  > codepage, the exact codepage could be chosen based on the language
>  > environment.  But this suggestion is just random guesswork from my
>  > side because I know close to nothing about what processes are involved
>  > in identifying an encoding.
> 
> Me neither, your idea sounds reasonable to me.

I may be missing something, but where is the problem?  Emacs already
uses the Windows language environment to select the default code
page.  (I didn't track this long thread, so perhaps I don't understand
the issue you were discussing here.)  Are you talking about the
default code page, or something else?  If that's not the default code
page, then the language environment is not a good guide to decide the
encoding.

> I don't understand why auto-coding-regexp-alist has such a high
> priority (over the coding: tag).

Because we want RMAIL Babyl files to be recognized and read with no
decoding, even if something or someone tagged it with a `coding' tag.

In other words, auto-coding-regexp-alist exists _precisely_ so you
could define something that depends on the file's contents, but takes
precedence over the `coding' tags.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [angeli@iwi.uni-sb.de: Coding problem with Euro sign]
  2005-12-16 22:58             ` Kevin Rodgers
  2005-12-17  7:36               ` Eli Zaretskii
@ 2005-12-17 10:47               ` Reiner Steib
  1 sibling, 0 replies; 32+ messages in thread
From: Reiner Steib @ 2005-12-17 10:47 UTC (permalink / raw)


On Fri, Dec 16 2005, Kevin Rodgers wrote:

> (I don't subscribe to emacs-pretest-bug, I read the gnu.emacs.devel
> newsgroup on gmane.org, which is gatewayed to and from the
> emacs-devel@gnu.org mailing list.  If you followed up to both mailing
> lists/newsgroups that should solve the problem.)

emacs-pretest-bug is available on Gmane as gmane.emacs.pretest.bugs.
This thread is
http://thread.gmane.org/87r78hryb1.fsf%40iwi190.iwi.uni-sb.de there.

Bye, Reiner.
-- 
       ,,,
      (o o)
---ooO-(_)-Ooo---  |  PGP key available  |  http://rsteib.home.pages.de/

^ permalink raw reply	[flat|nested] 32+ messages in thread

* windows-XXXX and cpXXXX
  2005-12-16 11:55           ` Ralf Angeli
  2005-12-16 22:58             ` Kevin Rodgers
@ 2006-01-10 12:38             ` Kenichi Handa
  2006-01-10 19:18               ` Eli Zaretskii
  1 sibling, 1 reply; 32+ messages in thread
From: Kenichi Handa @ 2006-01-10 12:38 UTC (permalink / raw)
  Cc: emacs-devel

I changed the Subject: line.

In article <dnua00$mc2$1@sea.gmane.org>, Ralf Angeli <angeli@iwi.uni-sb.de> writes:

>>   >>(setq auto-coding-regexp-alist
>>   >>       (cons '("[\040-\177][\200-\237]" . cp1252)
>>   >>             auto-coding-regexp-alist))

Please note that even in the latest code, Emacs doesn't know
about cp1252.  It knows only windows-1252.

Emacs currently knows these windows-* coding sysmtems:

  windows-1250 windows-1251 windows-1252 windows-1253
  windows-1254 windows-1255 windows-1256 windows-1257 windows-1258

and these cp* coding systems:

  cp437 cp720 cp737 cp775 cp850 cp851 cp852 cp855 cp857 cp860
  cp861 cp862 cp863 cp864 cp865 cp866 cp869 cp874 cp1125

But it seems that windows-XXXX are quite frequently called
as cpXXXX(*).  If so, I'll register cpXXXXs as alises of the
corresponding windows-XXXXs, ok?

Note (*): IANA doesn't lists cpXXXX.  It lists only cpXXXs
as an alias of IBMXXXs.  And, I've never seen an complaint
saying that Emacs doesn't know about IBMXXX.

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: windows-XXXX and cpXXXX
  2006-01-10 12:38             ` windows-XXXX and cpXXXX Kenichi Handa
@ 2006-01-10 19:18               ` Eli Zaretskii
  2006-01-11 11:35                 ` Kenichi Handa
  0 siblings, 1 reply; 32+ messages in thread
From: Eli Zaretskii @ 2006-01-10 19:18 UTC (permalink / raw)
  Cc: angeli, emacs-devel

> From: Kenichi Handa <handa@m17n.org>
> Date: Tue, 10 Jan 2006 21:38:01 +0900
> Cc: emacs-devel@gnu.org
> 
> Please note that even in the latest code, Emacs doesn't know
> about cp1252.  It knows only windows-1252.
> 
> Emacs currently knows these windows-* coding sysmtems:
> 
>   windows-1250 windows-1251 windows-1252 windows-1253
>   windows-1254 windows-1255 windows-1256 windows-1257 windows-1258
> 
> and these cp* coding systems:
> 
>   cp437 cp720 cp737 cp775 cp850 cp851 cp852 cp855 cp857 cp860
>   cp861 cp862 cp863 cp864 cp865 cp866 cp869 cp874 cp1125
> 
> But it seems that windows-XXXX are quite frequently called
> as cpXXXX(*).  If so, I'll register cpXXXXs as alises of the
> corresponding windows-XXXXs, ok?

Won't that increase the confusion, which is IMHO already too high,
between codepage.el and code-pages.el?  The cpXXX encodings you listed
above are for DOS only, so it is okay to call them cp*.

> Note (*): IANA doesn't lists cpXXXX.  It lists only cpXXXs
> as an alias of IBMXXXs.  And, I've never seen an complaint
> saying that Emacs doesn't know about IBMXXX.

That's because DOS never pretended to be a net-connected platform, so
it was unimportant to say what encoding you worked in, since your
machine was stand-alone anyway.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: windows-XXXX and cpXXXX
  2006-01-10 19:18               ` Eli Zaretskii
@ 2006-01-11 11:35                 ` Kenichi Handa
  2006-01-11 17:46                   ` Eli Zaretskii
  0 siblings, 1 reply; 32+ messages in thread
From: Kenichi Handa @ 2006-01-11 11:35 UTC (permalink / raw)
  Cc: angeli, emacs-devel

In article <uzmm3rivy.fsf@gnu.org>, Eli Zaretskii <eliz@gnu.org> writes:
>>  Emacs currently knows these windows-* coding sysmtems:
>>  
>>    windows-1250 windows-1251 windows-1252 windows-1253
>>    windows-1254 windows-1255 windows-1256 windows-1257 windows-1258
>>  
>>  and these cp* coding systems:
>>  
>>    cp437 cp720 cp737 cp775 cp850 cp851 cp852 cp855 cp857 cp860
>>    cp861 cp862 cp863 cp864 cp865 cp866 cp869 cp874 cp1125

Here, I forgot to add cp1251 and cp9XXs.

>>  But it seems that windows-XXXX are quite frequently called
>>  as cpXXXX(*).  If so, I'll register cpXXXXs as alises of the
>>  corresponding windows-XXXXs, ok?

> Won't that increase the confusion, which is IMHO already too high,
> between codepage.el and code-pages.el?  

They already support a coding system of the same name
(e.g. cp720) in a different way.  What kind of confusion
does making aliases cp125[02345678] increase?

> The cpXXX encodings you listed above are for DOS only, so
> it is okay to call them cp*.

I didn't intend to change them.  My suggestion is just to
make cp125[02345678] as an aliases of windows-125[02345678].
Actualy cp1251 is already an alias of windows-1251.

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: windows-XXXX and cpXXXX
  2006-01-11 11:35                 ` Kenichi Handa
@ 2006-01-11 17:46                   ` Eli Zaretskii
  2006-01-12  1:25                     ` Kenichi Handa
  0 siblings, 1 reply; 32+ messages in thread
From: Eli Zaretskii @ 2006-01-11 17:46 UTC (permalink / raw)
  Cc: angeli, emacs-devel

> From: Kenichi Handa <handa@m17n.org>
> CC: angeli@iwi.uni-sb.de, emacs-devel@gnu.org
> Date: Wed, 11 Jan 2006 20:35:34 +0900
> 
> >>    cp437 cp720 cp737 cp775 cp850 cp851 cp852 cp855 cp857 cp860
> >>    cp861 cp862 cp863 cp864 cp865 cp866 cp869 cp874 cp1125
> 
> Here, I forgot to add cp1251 and cp9XXs.
> 
> >>  But it seems that windows-XXXX are quite frequently called
> >>  as cpXXXX(*).  If so, I'll register cpXXXXs as alises of the
> >>  corresponding windows-XXXXs, ok?
> 
> > Won't that increase the confusion, which is IMHO already too high,
> > between codepage.el and code-pages.el?  
> 
> They already support a coding system of the same name
> (e.g. cp720) in a different way.

Then I'd rather remove DOS codepages (as opposed to Windows codepages)
from code-pages.el, than add more cpNNN encodings to it.

> What kind of confusion does making aliases cp125[02345678] increase?

Confusion between codepage.el and code-pages.el.  They are different
and subtly incompatible, but define symbols that are almost identical.
As code-pages.el cannot be used in the MS-DOS port, we cannot throw
away codepage.el.  Thus, I think DOS codepages (whose names are cpNNN)
should be provided only by codepage.el.

> I didn't intend to change them.  My suggestion is just to
> make cp125[02345678] as an aliases of windows-125[02345678].

My concern would be how a user is to know which library of the two she
is using, or should use in a given situation.  I already saw several
confused users on help-gnu-emacs.  So I think we need to sanitize
these two libraries from each other's namespace.  Adding cpNNN aliases
would be a step in the wrong direction.

Alternatively, someone who has more time than I do could add to
code-pages.el what it is lacking now to fully support the MS-DOS port,
and then we could toss codepage.el and add the aliases you asked about
to code-pages.el.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: windows-XXXX and cpXXXX
  2006-01-11 17:46                   ` Eli Zaretskii
@ 2006-01-12  1:25                     ` Kenichi Handa
  2006-01-12  4:33                       ` Eli Zaretskii
  0 siblings, 1 reply; 32+ messages in thread
From: Kenichi Handa @ 2006-01-12  1:25 UTC (permalink / raw)
  Cc: angeli, emacs-devel

In article <u7j9664ke.fsf@gnu.org>, Eli Zaretskii <eliz@gnu.org> writes:

>>  From: Kenichi Handa <handa@m17n.org>
>>  CC: angeli@iwi.uni-sb.de, emacs-devel@gnu.org
>>  Date: Wed, 11 Jan 2006 20:35:34 +0900
>>  
>>  >>    cp437 cp720 cp737 cp775 cp850 cp851 cp852 cp855 cp857 cp860
>>  >>    cp861 cp862 cp863 cp864 cp865 cp866 cp869 cp874 cp1125
>>  
>>  Here, I forgot to add cp1251 and cp9XXs.
>>  
>>  >>  But it seems that windows-XXXX are quite frequently called
>>  >>  as cpXXXX(*).  If so, I'll register cpXXXXs as alises of the
>>  >>  corresponding windows-XXXXs, ok?
>>  
>>  > Won't that increase the confusion, which is IMHO already too high,
>>  > between codepage.el and code-pages.el?  
>>  
>>  They already support a coding system of the same name
>>  (e.g. cp720) in a different way.

> Then I'd rather remove DOS codepages (as opposed to Windows codepages)
> from code-pages.el, than add more cpNNN encodings to it.

>>  What kind of confusion does making aliases cp125[02345678] increase?

> Confusion between codepage.el and code-pages.el.  They are different
> and subtly incompatible, but define symbols that are almost identical.
> As code-pages.el cannot be used in the MS-DOS port, we cannot throw
> away codepage.el.  Thus, I think DOS codepages (whose names are cpNNN)
> should be provided only by codepage.el.

I don't know the distinction of cpNNN and cpNNNN.

If cpNNN are only for DOS and are never used in the other
environment, I agree that having cpNNN in code-pages.el is
useless.

But, as for cpNNNN,
<http://www.microsoft.com/typography/unicode/cscp.htm> says
that Windows uses codepages 125[012345678].  And if it's a
convention to refer to them by names cp125X, shouldn't we
provide those names for non-DOS users?

>>  I didn't intend to change them.  My suggestion is just to
>>  make cp125[02345678] as an aliases of windows-125[02345678].

> My concern would be how a user is to know which library of the two she
> is using, or should use in a given situation.

DOS users use codepage.el.  The other users use
code-pages.el.  Isn't it clear?

> I already saw several confused users on help-gnu-emacs.
> So I think we need to sanitize these two libraries from
> each other's namespace.  Adding cpNNN aliases would be a
> step in the wrong direction.

> Alternatively, someone who has more time than I do could add to
> code-pages.el what it is lacking now to fully support the MS-DOS port,
> and then we could toss codepage.el and add the aliases you asked about
> to code-pages.el.

Unfortunately, I, too, don't have a time merge them. :-(

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: windows-XXXX and cpXXXX
  2006-01-12  1:25                     ` Kenichi Handa
@ 2006-01-12  4:33                       ` Eli Zaretskii
  2006-01-12  8:29                         ` Werner LEMBERG
  2006-01-12 13:23                         ` Kenichi Handa
  0 siblings, 2 replies; 32+ messages in thread
From: Eli Zaretskii @ 2006-01-12  4:33 UTC (permalink / raw)
  Cc: angeli, emacs-devel

> From: Kenichi Handa <handa@m17n.org>
> CC: angeli@iwi.uni-sb.de, emacs-devel@gnu.org
> Date: Thu, 12 Jan 2006 10:25:13 +0900
> 
> > Confusion between codepage.el and code-pages.el.  They are different
> > and subtly incompatible, but define symbols that are almost identical.
> > As code-pages.el cannot be used in the MS-DOS port, we cannot throw
> > away codepage.el.  Thus, I think DOS codepages (whose names are cpNNN)
> > should be provided only by codepage.el.
> 
> I don't know the distinction of cpNNN and cpNNNN.

That's not what I meant: I didn't mean to say that cpNNN with 3-digit
numbers and cpNNNN with 4 digits are different in any way.

> If cpNNN are only for DOS and are never used in the other
> environment, I agree that having cpNNN in code-pages.el is
> useless.

I don't think we can make such a distinction.  What I'd like to
suggest is that code-pages.el and codepage.el use different names for
them, just for the sake of the user.  For example, code-pages.el could
use ibmNNN or something.

> But, as for cpNNNN,
> <http://www.microsoft.com/typography/unicode/cscp.htm> says
> that Windows uses codepages 125[012345678].  And if it's a
> convention to refer to them by names cp125X, shouldn't we
> provide those names for non-DOS users?

I don't know about such a convention, outside Emacs.  windows-NNNN is
a notation used bu IANA, but cpNNNN is not.

> > My concern would be how a user is to know which library of the two she
> > is using, or should use in a given situation.
> 
> DOS users use codepage.el.  The other users use
> code-pages.el.  Isn't it clear?

It turns out people don't know this, and it certainly is unclear for
them.  I mean users, not Emacs developers.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: windows-XXXX and cpXXXX
  2006-01-12  4:33                       ` Eli Zaretskii
@ 2006-01-12  8:29                         ` Werner LEMBERG
  2006-01-12 19:56                           ` Eli Zaretskii
  2006-01-12 13:23                         ` Kenichi Handa
  1 sibling, 1 reply; 32+ messages in thread
From: Werner LEMBERG @ 2006-01-12  8:29 UTC (permalink / raw)
  Cc: emacs-devel, angeli, handa

> > DOS users use codepage.el.  The other users use
> > code-pages.el.  Isn't it clear?
> 
> It turns out people don't know this, and it certainly is unclear for
> them.  I mean users, not Emacs developers.

Prior to everything else I suggest to rename `codepage.el' to, say,
`ms-cpxxxx.el' (maybe you find a better name).  This gives at least a
small hint in the file name itself what it contains.


    Werner


PS: On 8+3 systems, the proposed file name is appropriately shortened
    to `ms-cpxxx.el' :-)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: windows-XXXX and cpXXXX
  2006-01-12  4:33                       ` Eli Zaretskii
  2006-01-12  8:29                         ` Werner LEMBERG
@ 2006-01-12 13:23                         ` Kenichi Handa
  2006-01-12 19:59                           ` Eli Zaretskii
  1 sibling, 1 reply; 32+ messages in thread
From: Kenichi Handa @ 2006-01-12 13:23 UTC (permalink / raw)
  Cc: angeli, emacs-devel

Oops, I misunderstood code-pages.el.  It already makes
cp125X as alises of windows-125X.  The reason why only
windows-125X are seen in my environment is that
code-pages.el is not loaded (it is loaded by default only in
such lang. env. as Latin-6/7, Lithuanian, etc).

Hmmm, the current situation is more confusing than I have
thought.  Eli, to fix this confusion, it seems that there's
no easy way other than making cp125X available by default.

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: windows-XXXX and cpXXXX
  2006-01-12  8:29                         ` Werner LEMBERG
@ 2006-01-12 19:56                           ` Eli Zaretskii
  0 siblings, 0 replies; 32+ messages in thread
From: Eli Zaretskii @ 2006-01-12 19:56 UTC (permalink / raw)
  Cc: emacs-devel, angeli, handa

> Date: Thu, 12 Jan 2006 09:29:41 +0100 (CET)
> Cc: handa@m17n.org, angeli@iwi.uni-sb.de, emacs-devel@gnu.org
> From: Werner LEMBERG <wl@gnu.org>
> 
> Prior to everything else I suggest to rename `codepage.el' to, say,
> `ms-cpxxxx.el' (maybe you find a better name).  This gives at least a
> small hint in the file name itself what it contains.

The problem is not the file name (typical Emacs users don't have any
idea what files are loaded when they need some feature), it's the
symbols codepage.el defines.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: windows-XXXX and cpXXXX
  2006-01-12 13:23                         ` Kenichi Handa
@ 2006-01-12 19:59                           ` Eli Zaretskii
  2006-01-13  0:58                             ` Kenichi Handa
  0 siblings, 1 reply; 32+ messages in thread
From: Eli Zaretskii @ 2006-01-12 19:59 UTC (permalink / raw)
  Cc: angeli, emacs-devel

> From: Kenichi Handa <handa@m17n.org>
> CC: angeli@iwi.uni-sb.de, emacs-devel@gnu.org
> Date: Thu, 12 Jan 2006 22:23:58 +0900
> 
> Hmmm, the current situation is more confusing than I have
> thought.  Eli, to fix this confusion, it seems that there's
> no easy way other than making cp125X available by default.

What confusion is that?

ALso, didn't you just said that cp125X are already available as
aliases?  So what do you mean by ``making cp125X available''?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: windows-XXXX and cpXXXX
  2006-01-12 19:59                           ` Eli Zaretskii
@ 2006-01-13  0:58                             ` Kenichi Handa
  2006-01-13  8:52                               ` Eli Zaretskii
  0 siblings, 1 reply; 32+ messages in thread
From: Kenichi Handa @ 2006-01-13  0:58 UTC (permalink / raw)
  Cc: angeli, emacs-devel

In article <ulkxli5fb.fsf@gnu.org>, Eli Zaretskii <eliz@gnu.org> writes:

>>  Hmmm, the current situation is more confusing than I have
>>  thought.  Eli, to fix this confusion, it seems that there's
>>  no easy way other than making cp125X available by default.

> What confusion is that?

Even for non-DOS users, (coding-system-p 'cp1252) returns
nil or t depending on his locale.

> ALso, didn't you just said that cp125X are already available as
> aliases?  So what do you mean by ``making cp125X available''?

code-pages.el, when loaded, defines cp125X as aliases of
windows-125X.  But, as windows-125X defined in code-pages.el
have autoload cookies, they are always available to users
regardless of their locale.

What I mean by "making cp125X available by default" is that
"making cp125X always available to users regardless of their
locale".

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: windows-XXXX and cpXXXX
  2006-01-13  0:58                             ` Kenichi Handa
@ 2006-01-13  8:52                               ` Eli Zaretskii
  2006-01-13 11:50                                 ` Kenichi Handa
  2006-01-13 14:45                                 ` Stefan Monnier
  0 siblings, 2 replies; 32+ messages in thread
From: Eli Zaretskii @ 2006-01-13  8:52 UTC (permalink / raw)
  Cc: angeli, emacs-devel

> From: Kenichi Handa <handa@m17n.org>
> CC: angeli@iwi.uni-sb.de, emacs-devel@gnu.org
> Date: Fri, 13 Jan 2006 09:58:42 +0900
> 
> Even for non-DOS users, (coding-system-p 'cp1252) returns
> nil or t depending on his locale.

Why would non-DOS users need cp1252? why cannot they use windows-1252,
which is the official IANA name of that encoding?

The fact that people get confused by what the Emacs manual says about
this (which I believe is what led us to this discussion) can be taken
care of by fixing the manual.  There's no need to modify code for that.

In short, cp1252 is something we invented in codepage.el.  So we might
as well retire those cpXXX symbols for good, on all platforms (except
DOS, which must use codepage.el).  That is my suggestion: let's remove
cpXXX and cpXXXX aliases from code-pages.el, and let's start educating
users to use windows-XXXX instead.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: windows-XXXX and cpXXXX
  2006-01-13  8:52                               ` Eli Zaretskii
@ 2006-01-13 11:50                                 ` Kenichi Handa
  2006-01-13 12:59                                   ` Eli Zaretskii
  2006-01-13 14:45                                 ` Stefan Monnier
  1 sibling, 1 reply; 32+ messages in thread
From: Kenichi Handa @ 2006-01-13 11:50 UTC (permalink / raw)
  Cc: angeli, emacs-devel

In article <ulkxkse69.fsf@gnu.org>, Eli Zaretskii <eliz@gnu.org> writes:

> Why would non-DOS users need cp1252? why cannot they use windows-1252,
> which is the official IANA name of that encoding?

> The fact that people get confused by what the Emacs manual says about
> this (which I believe is what led us to this discussion) can be taken
> care of by fixing the manual.  There's no need to modify code for that.

> In short, cp1252 is something we invented in codepage.el.  So we might
> as well retire those cpXXX symbols for good, on all platforms (except
> DOS, which must use codepage.el).  That is my suggestion: let's remove
> cpXXX and cpXXXX aliases from code-pages.el, and let's start educating
> users to use windows-XXXX instead.

At least cpXXX can't be replaced by windows-XXX.  Do you
suggest to use the official name ibmXXX instead?

But, I've just checked the name cp1252 on google and found this:

iconv supports cpXXXX from fairly long ago.  CPXXXX are used
also in this page:
  <http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/>
It seems that Java (JVM) also support the name cp1252.

So, I think the name cpXXXXs have already acquired
citizenship, and thus it is a lack of sophistication not to
support the name cpXXXX.

Of course, I agree that it is the right thing to somehow
merge codepage.el and code-pages.el, and provide the same
definition of a coding system to both DOS and non-DOS users.
But we don't have a man power to work on it now.

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: windows-XXXX and cpXXXX
  2006-01-13 11:50                                 ` Kenichi Handa
@ 2006-01-13 12:59                                   ` Eli Zaretskii
  2006-01-16  1:05                                     ` Kenichi Handa
  0 siblings, 1 reply; 32+ messages in thread
From: Eli Zaretskii @ 2006-01-13 12:59 UTC (permalink / raw)
  Cc: angeli, emacs-devel

> From: Kenichi Handa <handa@m17n.org>
> CC: angeli@iwi.uni-sb.de, emacs-devel@gnu.org
> Date: Fri, 13 Jan 2006 20:50:07 +0900
> 
> At least cpXXX can't be replaced by windows-XXX.  Do you
> suggest to use the official name ibmXXX instead?

Yes, in code-pages.el I suggest that we do that.

> But, I've just checked the name cp1252 on google and found this:
> 
> iconv supports cpXXXX from fairly long ago.  CPXXXX are used
> also in this page:
>   <http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/>
> It seems that Java (JVM) also support the name cp1252.
> 
> So, I think the name cpXXXXs have already acquired
> citizenship, and thus it is a lack of sophistication not to
> support the name cpXXXX.

That's too bad.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: windows-XXXX and cpXXXX
  2006-01-13  8:52                               ` Eli Zaretskii
  2006-01-13 11:50                                 ` Kenichi Handa
@ 2006-01-13 14:45                                 ` Stefan Monnier
  1 sibling, 0 replies; 32+ messages in thread
From: Stefan Monnier @ 2006-01-13 14:45 UTC (permalink / raw)
  Cc: emacs-devel, angeli, Kenichi Handa

>> Even for non-DOS users, (coding-system-p 'cp1252) returns
>> nil or t depending on his locale.

> Why would non-DOS users need cp1252? why cannot they use windows-1252,
> which is the official IANA name of that encoding?

> The fact that people get confused by what the Emacs manual says about
> this (which I believe is what led us to this discussion) can be taken
> care of by fixing the manual.  There's no need to modify code for that.

> In short, cp1252 is something we invented in codepage.el.  So we might
> as well retire those cpXXX symbols for good, on all platforms (except
> DOS, which must use codepage.el).  That is my suggestion: let's remove
> cpXXX and cpXXXX aliases from code-pages.el, and let's start educating
> users to use windows-XXXX instead.

For what it's worth, I think it makes sense, although what would make even
more sense is to get code-pages working in the DOS port.


        Stefan

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: windows-XXXX and cpXXXX
  2006-01-13 12:59                                   ` Eli Zaretskii
@ 2006-01-16  1:05                                     ` Kenichi Handa
  2006-01-16  4:31                                       ` Eli Zaretskii
  0 siblings, 1 reply; 32+ messages in thread
From: Kenichi Handa @ 2006-01-16  1:05 UTC (permalink / raw)
  Cc: angeli, emacs-devel

In article <uace0s2qs.fsf@gnu.org>, Eli Zaretskii <eliz@gnu.org> writes:

>>  So, I think the name cpXXXXs have already acquired
>>  citizenship, and thus it is a lack of sophistication not to
>>  support the name cpXXXX.

> That's too bad.

Yes, that's too bad.  So, what should we do?

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: windows-XXXX and cpXXXX
  2006-01-16  1:05                                     ` Kenichi Handa
@ 2006-01-16  4:31                                       ` Eli Zaretskii
  2006-01-16 12:11                                         ` Kenichi Handa
  0 siblings, 1 reply; 32+ messages in thread
From: Eli Zaretskii @ 2006-01-16  4:31 UTC (permalink / raw)
  Cc: angeli, emacs-devel

> From: Kenichi Handa <handa@m17n.org>
> CC: angeli@iwi.uni-sb.de, emacs-devel@gnu.org
> Date: Mon, 16 Jan 2006 10:05:43 +0900
> 
> In article <uace0s2qs.fsf@gnu.org>, Eli Zaretskii <eliz@gnu.org> writes:
> 
> >>  So, I think the name cpXXXXs have already acquired
> >>  citizenship, and thus it is a lack of sophistication not to
> >>  support the name cpXXXX.
> 
> > That's too bad.
> 
> Yes, that's too bad.  So, what should we do?

If you think it's imperative that we let users have cp* encodings, go
ahead and install the changes.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: windows-XXXX and cpXXXX
  2006-01-16  4:31                                       ` Eli Zaretskii
@ 2006-01-16 12:11                                         ` Kenichi Handa
  0 siblings, 0 replies; 32+ messages in thread
From: Kenichi Handa @ 2006-01-16 12:11 UTC (permalink / raw)
  Cc: angeli, emacs-devel

In article <uirsk6beu.fsf@gnu.org>, Eli Zaretskii <eliz@gnu.org> writes:

> If you think it's imperative that we let users have cp* encodings, go
> ahead and install the changes.

Ok, just done.  What I've done is mainly just adding these
lines in code-pages.el.

;;;###autoload(autoload-coding-system 'cp1250 '(require 'code-pages))

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2006-01-16 12:11 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-12-13 23:34 [angeli@iwi.uni-sb.de: Coding problem with Euro sign] Richard M. Stallman
2005-12-14 18:56 ` Kevin Rodgers
2005-12-14 22:51   ` Ralf Angeli
2005-12-15  1:34     ` Kevin Rodgers
2005-12-15 16:20       ` Ralf Angeli
2005-12-15 22:02         ` Kevin Rodgers
2005-12-16  8:57           ` Eli Zaretskii
2005-12-16 17:59             ` Kevin Rodgers
2005-12-17  7:19               ` Eli Zaretskii
2005-12-16 11:55           ` Ralf Angeli
2005-12-16 22:58             ` Kevin Rodgers
2005-12-17  7:36               ` Eli Zaretskii
2005-12-17 10:47               ` Reiner Steib
2006-01-10 12:38             ` windows-XXXX and cpXXXX Kenichi Handa
2006-01-10 19:18               ` Eli Zaretskii
2006-01-11 11:35                 ` Kenichi Handa
2006-01-11 17:46                   ` Eli Zaretskii
2006-01-12  1:25                     ` Kenichi Handa
2006-01-12  4:33                       ` Eli Zaretskii
2006-01-12  8:29                         ` Werner LEMBERG
2006-01-12 19:56                           ` Eli Zaretskii
2006-01-12 13:23                         ` Kenichi Handa
2006-01-12 19:59                           ` Eli Zaretskii
2006-01-13  0:58                             ` Kenichi Handa
2006-01-13  8:52                               ` Eli Zaretskii
2006-01-13 11:50                                 ` Kenichi Handa
2006-01-13 12:59                                   ` Eli Zaretskii
2006-01-16  1:05                                     ` Kenichi Handa
2006-01-16  4:31                                       ` Eli Zaretskii
2006-01-16 12:11                                         ` Kenichi Handa
2006-01-13 14:45                                 ` Stefan Monnier
2005-12-16 10:35         ` [angeli@iwi.uni-sb.de: Coding problem with Euro sign] David Hansen

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).