From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: Ralf Angeli <angeli@iwi.uni-sb.de>
Newsgroups: gmane.emacs.devel
Subject: Re: [angeli@iwi.uni-sb.de: Coding problem with Euro sign]
Date: Fri, 16 Dec 2005 12:55:47 +0100
Message-ID: <dnua00$mc2$1@sea.gmane.org>
References: <E1EmJfO-000347-9K@fencepost.gnu.org> <dnpptf$cqu$1@sea.gmane.org>
	<dnq7l2$ue1$1@sea.gmane.org> <dnqh85$okp$1@sea.gmane.org>
	<dns53n$dk4$1@sea.gmane.org> <dnsp6c$mg2$1@sea.gmane.org>
NNTP-Posting-Host: main.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Trace: sea.gmane.org 1134760487 16333 80.91.229.2 (16 Dec 2005 19:14:47 GMT)
X-Complaints-To: usenet@sea.gmane.org
NNTP-Posting-Date: Fri, 16 Dec 2005 19:14:47 +0000 (UTC)
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Fri Dec 16 20:14:44 2005
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Original-Received: from lists.gnu.org ([199.232.76.165])
	by ciao.gmane.org with esmtp (Exim 4.43)
	id 1EnL0O-0003sI-7b
	for ged-emacs-devel@m.gmane.org; Fri, 16 Dec 2005 20:12:36 +0100
Original-Received: from localhost ([127.0.0.1] helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43)
	id 1EnL15-0001RS-QR
	for ged-emacs-devel@m.gmane.org; Fri, 16 Dec 2005 14:13:19 -0500
Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43)
	id 1EnGpi-00065F-2Y
	for emacs-devel@gnu.org; Fri, 16 Dec 2005 09:45:18 -0500
Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43)
	id 1EnGk8-0003x8-LZ
	for emacs-devel@gnu.org; Fri, 16 Dec 2005 09:39:41 -0500
Original-Received: from [199.232.76.173] (helo=monty-python.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1EnEEw-00056K-Py
	for emacs-devel@gnu.org; Fri, 16 Dec 2005 06:59:12 -0500
Original-Received: from [80.91.229.2] (helo=ciao.gmane.org)
	by monty-python.gnu.org with esmtp (TLS-1.0:RSA_AES_128_CBC_SHA:16)
	(Exim 4.34) id 1EnEHL-0005GM-5n
	for emacs-devel@gnu.org; Fri, 16 Dec 2005 07:01:40 -0500
Original-Received: from list by ciao.gmane.org with local (Exim 4.43)
	id 1EnEBq-0005Ex-GZ
	for emacs-devel@gnu.org; Fri, 16 Dec 2005 12:55:58 +0100
Original-Received: from dialin-212-144-211-196.pools.arcor-ip.net ([212.144.211.196])
	by main.gmane.org with esmtp (Gmexim 0.1 (Debian))
	id 1AlnuQ-0007hv-00
	for <emacs-devel@gnu.org>; Fri, 16 Dec 2005 12:55:58 +0100
Original-Received: from angeli by dialin-212-144-211-196.pools.arcor-ip.net with local
	(Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00
	for <emacs-devel@gnu.org>; Fri, 16 Dec 2005 12:55:58 +0100
X-Injected-Via-Gmane: http://gmane.org/
Original-To: emacs-devel@gnu.org
Original-Lines: 157
Original-X-Complaints-To: usenet@sea.gmane.org
X-Gmane-NNTP-Posting-Host: dialin-212-144-211-196.pools.arcor-ip.net
User-Agent: Gnus/5.110004 (No Gnus v0.4) Emacs/22.0.50 (gnu/linux)
Cancel-Lock: sha1:lXEiqwetqnsffdk4wXOjU51lQ5E=
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <http://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <http://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.emacs.devel:47870
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/47870>

* Kevin Rodgers (2005-12-15) writes:

> Ralf Angeli wrote:
>  > * Kevin Rodgers (2005-12-15) writes:
>  >
>  >>You could try something like this:
>  >>
>  >>(setq auto-coding-regexp-alist
>  >>       (cons '("[\040-\177][\200-\237]" . cp1252)
>  >>             auto-coding-regexp-alist))
>  >
>  > This doesn't seem to work here.  I still see the byte codes of the
>  > 8-bit characters when opening the file after evaluating the above
>  > form.
[...]
> I assume those display problems are because I haven't configured an
> Emacs fontset for the cp850 coding system.  But the
> auto-coding-regexp-alist entry worked as intended, and you're on
> Windows so your fontset should be properly configured for that.

Currently I am on GNU/Linux.  Anyway, with the development version of
Emacs I did not have the problems with cp1252 you described when
loading the file.  But when trying to write the file I got this
warning:

,----
| Warning (:warning): Invalid coding system `cp1252' is specified
| for the current buffer/file by the variable `auto-coding-regexp-alist'.
| It is highly recommended to fix it before writing to a file.
`----

I didn't do `M-x codepage-setup RET' before trying all of this.
Interestingly loading and writing the file worked fine if I used
windows-1252 instead of cp1252.

> One other detail: that entry only sets the coding system if the euro
> is immediately preceded by an ASCII character.  Is that the case in
> your file?

No.  On emacs-pretest-bug I already explained that the original (test)
file doesn't include the A circumflex, that means the euro is preceded
by a newline.  (Maybe it would be better to continue the discussion in
the thread on emacs-pretest-bug in order to avoid repetition?)

If I insert a space or a random ASCII character before the Euro sign
and evaluate the form above (using windows-1252 for the encoding) the
encoding is being identified correctly and both the u umlaut and the
Euro sign are being displayed correctly.

> What does `C-h C RET' say after visiting the file?

In case the encoding is not identfied correctly:

,----
| Coding system for saving this buffer:
|   t -- raw-text-dos
| 
| Default coding system (for new files):
|   1 -- iso-latin-1 (alias: iso-8859-1 latin-1)
| 
| Coding system for keyboard input:
|   1 -- iso-latin-1 (alias: iso-8859-1 latin-1)
| 
| Coding system for terminal output:
|   1 -- iso-8859-1 (alias of iso-latin-1)
| 
| Defaults for subprocess I/O:
|   decoding: 1 -- iso-latin-1 (alias: iso-8859-1 latin-1)
| 
|   encoding: 1 -- iso-latin-1 (alias: iso-8859-1 latin-1)
| 
| 
| Priority order for recognizing coding systems when reading files:
|   1. iso-latin-1 (alias: iso-8859-1 latin-1)
|   2. mule-utf-8 (alias: utf-8)
|   3. mule-utf-16be-with-signature (alias: utf-16be-with-signature mule-utf-16-be utf-16-be)
|   4. mule-utf-16le-with-signature (alias: utf-16le-with-signature mule-utf-16-le utf-16-le)
|   5. iso-2022-jp (alias: junet)
|   6. iso-2022-7bit 
|   7. iso-2022-7bit-lock (alias: iso-2022-int-1)
|   8. iso-2022-8bit-ss2 
|   9. emacs-mule 
|   10. raw-text 
|   11. japanese-shift-jis (alias: shift_jis sjis cp932)
|   12. chinese-big5 (alias: big5 cn-big5 cp950)
|   13. no-conversion 
| 
|   Other coding systems cannot be distinguished automatically
|   from these, and therefore cannot be recognized automatically
|   with the present coding system priorities.
| 
|   The following are decoded correctly but recognized as iso-2022-7bit-lock:
|     iso-2022-7bit-ss2 iso-2022-7bit-lock-ss2 iso-2022-cn iso-2022-cn-ext
|     iso-2022-jp-2 iso-2022-kr
| [...]
`----

In case the coding is identified correctly:

,----
| Coding system for saving this buffer:
|   * -- windows-1252-dos
| 
| Default coding system (for new files):
|   1 -- iso-latin-1 (alias: iso-8859-1 latin-1)
| 
| Coding system for keyboard input:
|   1 -- iso-latin-1 (alias: iso-8859-1 latin-1)
| 
| Coding system for terminal output:
|   1 -- iso-8859-1 (alias of iso-latin-1)
| 
| Defaults for subprocess I/O:
|   decoding: 1 -- iso-latin-1 (alias: iso-8859-1 latin-1)
| 
|   encoding: 1 -- iso-latin-1 (alias: iso-8859-1 latin-1)
| [...]
`----

> I assume you're running with multibyte characters enabled.

Yes.  The relevant setting should be included in the original bug
report.

>  > And a customization is actually not what I am interested in; I'd like
>  > Emacs to figure this out by itself, out of the box.
>
> How is Emacs supposed to infer the coding system from the contents of
> that file?  If you can come up with a suitable customization, perhaps
> it will be incorporated into Emacs as the default behavior.

If I knew how to do that I would have sent a patch already.  My naive
approach would be to look for the presence of bytes which are
characteristic for Windows codepages in order to identify the encoding
as a Windows codepage.  Maybe looking at line endings can help to make
the right decision.  After the encoding was identified to be a Windows
codepage, the exact codepage could be chosen based on the language
environment.  But this suggestion is just random guesswork from my
side because I know close to nothing about what processes are involved
in identifying an encoding.

> Can Notepad display files in anything besides CP850/Windows-1252 and
> probably UTF-8 w/BOM?  E.g. can it distinguish ISO 8859-1 from ISO
> 8859-2 from ISO 8859-15?

As far as I understood Reiner on emacs-pretest-bug this is impossible
anyway.

> Yes, Windows applications simply assumes you're using a proprietary
> Microsoft character set, and GNU/Linux apps prioritize support for
> standard character encodings.  Maybe all you need is
> (prefer-coding-system 'cp850)

Wouldn't that be a bit too restricted as a general solution for Emacs?

-- 
Ralf