unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed
From: Ralf Angeli <angeli@iwi.uni-sb.de>
Subject: Re: [angeli@iwi.uni-sb.de: Coding problem with Euro sign]
Date: Fri, 16 Dec 2005 12:55:47 +0100	[thread overview]
Message-ID: <dnua00$mc2$1@sea.gmane.org> (raw)
In-Reply-To: dnsp6c$mg2$1@sea.gmane.org

* Kevin Rodgers (2005-12-15) writes:

> Ralf Angeli wrote:
>  > * Kevin Rodgers (2005-12-15) writes:
>  >
>  >>You could try something like this:
>  >>
>  >>(setq auto-coding-regexp-alist
>  >>       (cons '("[\040-\177][\200-\237]" . cp1252)
>  >>             auto-coding-regexp-alist))
>  >
>  > This doesn't seem to work here.  I still see the byte codes of the
>  > 8-bit characters when opening the file after evaluating the above
>  > form.
[...]
> I assume those display problems are because I haven't configured an
> Emacs fontset for the cp850 coding system.  But the
> auto-coding-regexp-alist entry worked as intended, and you're on
> Windows so your fontset should be properly configured for that.

Currently I am on GNU/Linux.  Anyway, with the development version of
Emacs I did not have the problems with cp1252 you described when
loading the file.  But when trying to write the file I got this
warning:

,----
| Warning (:warning): Invalid coding system `cp1252' is specified
| for the current buffer/file by the variable `auto-coding-regexp-alist'.
| It is highly recommended to fix it before writing to a file.
`----

I didn't do `M-x codepage-setup RET' before trying all of this.
Interestingly loading and writing the file worked fine if I used
windows-1252 instead of cp1252.

> One other detail: that entry only sets the coding system if the euro
> is immediately preceded by an ASCII character.  Is that the case in
> your file?

No.  On emacs-pretest-bug I already explained that the original (test)
file doesn't include the A circumflex, that means the euro is preceded
by a newline.  (Maybe it would be better to continue the discussion in
the thread on emacs-pretest-bug in order to avoid repetition?)

If I insert a space or a random ASCII character before the Euro sign
and evaluate the form above (using windows-1252 for the encoding) the
encoding is being identified correctly and both the u umlaut and the
Euro sign are being displayed correctly.

> What does `C-h C RET' say after visiting the file?

In case the encoding is not identfied correctly:

,----
| Coding system for saving this buffer:
|   t -- raw-text-dos
| 
| Default coding system (for new files):
|   1 -- iso-latin-1 (alias: iso-8859-1 latin-1)
| 
| Coding system for keyboard input:
|   1 -- iso-latin-1 (alias: iso-8859-1 latin-1)
| 
| Coding system for terminal output:
|   1 -- iso-8859-1 (alias of iso-latin-1)
| 
| Defaults for subprocess I/O:
|   decoding: 1 -- iso-latin-1 (alias: iso-8859-1 latin-1)
| 
|   encoding: 1 -- iso-latin-1 (alias: iso-8859-1 latin-1)
| 
| 
| Priority order for recognizing coding systems when reading files:
|   1. iso-latin-1 (alias: iso-8859-1 latin-1)
|   2. mule-utf-8 (alias: utf-8)
|   3. mule-utf-16be-with-signature (alias: utf-16be-with-signature mule-utf-16-be utf-16-be)
|   4. mule-utf-16le-with-signature (alias: utf-16le-with-signature mule-utf-16-le utf-16-le)
|   5. iso-2022-jp (alias: junet)
|   6. iso-2022-7bit 
|   7. iso-2022-7bit-lock (alias: iso-2022-int-1)
|   8. iso-2022-8bit-ss2 
|   9. emacs-mule 
|   10. raw-text 
|   11. japanese-shift-jis (alias: shift_jis sjis cp932)
|   12. chinese-big5 (alias: big5 cn-big5 cp950)
|   13. no-conversion 
| 
|   Other coding systems cannot be distinguished automatically
|   from these, and therefore cannot be recognized automatically
|   with the present coding system priorities.
| 
|   The following are decoded correctly but recognized as iso-2022-7bit-lock:
|     iso-2022-7bit-ss2 iso-2022-7bit-lock-ss2 iso-2022-cn iso-2022-cn-ext
|     iso-2022-jp-2 iso-2022-kr
| [...]
`----

In case the coding is identified correctly:

,----
| Coding system for saving this buffer:
|   * -- windows-1252-dos
| 
| Default coding system (for new files):
|   1 -- iso-latin-1 (alias: iso-8859-1 latin-1)
| 
| Coding system for keyboard input:
|   1 -- iso-latin-1 (alias: iso-8859-1 latin-1)
| 
| Coding system for terminal output:
|   1 -- iso-8859-1 (alias of iso-latin-1)
| 
| Defaults for subprocess I/O:
|   decoding: 1 -- iso-latin-1 (alias: iso-8859-1 latin-1)
| 
|   encoding: 1 -- iso-latin-1 (alias: iso-8859-1 latin-1)
| [...]
`----

> I assume you're running with multibyte characters enabled.

Yes.  The relevant setting should be included in the original bug
report.

>  > And a customization is actually not what I am interested in; I'd like
>  > Emacs to figure this out by itself, out of the box.
>
> How is Emacs supposed to infer the coding system from the contents of
> that file?  If you can come up with a suitable customization, perhaps
> it will be incorporated into Emacs as the default behavior.

If I knew how to do that I would have sent a patch already.  My naive
approach would be to look for the presence of bytes which are
characteristic for Windows codepages in order to identify the encoding
as a Windows codepage.  Maybe looking at line endings can help to make
the right decision.  After the encoding was identified to be a Windows
codepage, the exact codepage could be chosen based on the language
environment.  But this suggestion is just random guesswork from my
side because I know close to nothing about what processes are involved
in identifying an encoding.

> Can Notepad display files in anything besides CP850/Windows-1252 and
> probably UTF-8 w/BOM?  E.g. can it distinguish ISO 8859-1 from ISO
> 8859-2 from ISO 8859-15?

As far as I understood Reiner on emacs-pretest-bug this is impossible
anyway.

> Yes, Windows applications simply assumes you're using a proprietary
> Microsoft character set, and GNU/Linux apps prioritize support for
> standard character encodings.  Maybe all you need is
> (prefer-coding-system 'cp850)

Wouldn't that be a bit too restricted as a general solution for Emacs?

-- 
Ralf

  parent reply	other threads:[~2005-12-16 11:55 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-12-13 23:34 [angeli@iwi.uni-sb.de: Coding problem with Euro sign] Richard M. Stallman
2005-12-14 18:56 ` Kevin Rodgers
2005-12-14 22:51   ` Ralf Angeli
2005-12-15  1:34     ` Kevin Rodgers
2005-12-15 16:20       ` Ralf Angeli
2005-12-15 22:02         ` Kevin Rodgers
2005-12-16  8:57           ` Eli Zaretskii
2005-12-16 17:59             ` Kevin Rodgers
2005-12-17  7:19               ` Eli Zaretskii
2005-12-16 11:55           ` Ralf Angeli [this message]
2005-12-16 22:58             ` Kevin Rodgers
2005-12-17  7:36               ` Eli Zaretskii
2005-12-17 10:47               ` Reiner Steib
2006-01-10 12:38             ` windows-XXXX and cpXXXX Kenichi Handa
2006-01-10 19:18               ` Eli Zaretskii
2006-01-11 11:35                 ` Kenichi Handa
2006-01-11 17:46                   ` Eli Zaretskii
2006-01-12  1:25                     ` Kenichi Handa
2006-01-12  4:33                       ` Eli Zaretskii
2006-01-12  8:29                         ` Werner LEMBERG
2006-01-12 19:56                           ` Eli Zaretskii
2006-01-12 13:23                         ` Kenichi Handa
2006-01-12 19:59                           ` Eli Zaretskii
2006-01-13  0:58                             ` Kenichi Handa
2006-01-13  8:52                               ` Eli Zaretskii
2006-01-13 11:50                                 ` Kenichi Handa
2006-01-13 12:59                                   ` Eli Zaretskii
2006-01-16  1:05                                     ` Kenichi Handa
2006-01-16  4:31                                       ` Eli Zaretskii
2006-01-16 12:11                                         ` Kenichi Handa
2006-01-13 14:45                                 ` Stefan Monnier
2005-12-16 10:35         ` [angeli@iwi.uni-sb.de: Coding problem with Euro sign] David Hansen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='dnua00$mc2$1@sea.gmane.org' \
    --to=angeli@iwi.uni-sb.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).