unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed
From: Kevin Rodgers <ihs_4664@yahoo.com>
Subject: Re: [angeli@iwi.uni-sb.de: Coding problem with Euro sign]
Date: Fri, 16 Dec 2005 15:58:22 -0700	[thread overview]
Message-ID: <dnvgqj$p7c$1@sea.gmane.org> (raw)
In-Reply-To: <dnua00$mc2$1@sea.gmane.org>

Ralf Angeli wrote:
 > Currently I am on GNU/Linux.  Anyway, with the development version of
 > Emacs I did not have the problems with cp1252 you described when
 > loading the file.  But when trying to write the file I got this
 > warning:
 >
 > ,----
 > | Warning (:warning): Invalid coding system `cp1252' is specified
 > | for the current buffer/file by the variable `auto-coding-regexp-alist'.
 > | It is highly recommended to fix it before writing to a file.
 > `----
 >
 > I didn't do `M-x codepage-setup RET' before trying all of this.
 > Interestingly loading and writing the file worked fine if I used
 > windows-1252 instead of cp1252.

Well, there you go.  Emacs 22.0 supports windows-1252, and Emacs 21.4
only supports cp850.

 > * Kevin Rodgers (2005-12-15) writes:
 >>One other detail: that entry only sets the coding system if the euro
 >>is immediately preceded by an ASCII character.  Is that the case in
 >>your file?
 >
 > No.  On emacs-pretest-bug I already explained that the original (test)
 > file doesn't include the A circumflex, that means the euro is preceded
 > by a newline.  (Maybe it would be better to continue the discussion in
 > the thread on emacs-pretest-bug in order to avoid repetition?)

Ah.  The regexp only matched the [\200-\237] characters after a
non-control ASCII character.  So [\040-\177] needs to be expanded, at
least to [\t\n\r\040-\177] to include tab and newline sequences, but
maybe [\t\n\r\v\f\040-\177] to include vertical tab and formfeed, or
even [\000-\177] to include all ASCII characters.

(I don't subscribe to emacs-pretest-bug, I read the gnu.emacs.devel
newsgroup on gmane.org, which is gatewayed to and from the
emacs-devel@gnu.org mailing list.  If you followed up to both mailing
lists/newsgroups that should solve the problem.)

 > If I insert a space or a random ASCII character before the Euro sign
 > and evaluate the form above (using windows-1252 for the encoding) the
 > encoding is being identified correctly and both the u umlaut and the
 > Euro sign are being displayed correctly.

Good!

...

 >>How is Emacs supposed to infer the coding system from the contents of
 >>that file?  If you can come up with a suitable customization, perhaps
 >>it will be incorporated into Emacs as the default behavior.
 >
 > If I knew how to do that I would have sent a patch already.  My naive
 > approach would be to look for the presence of bytes which are
 > characteristic for Windows codepages in order to identify the encoding
 > as a Windows codepage.

Right, but a single byte is not enough information to identify the
character encoding.  Even a pattern is not enough, since coding systems
may differ only in what characters are assigned to the same byte
sequence: sometimes you need "out of band" information.

Have you read the Recognize Coding node (aka Recognizing Coding Systems)
of the Emacs manual?

The Emacs implementors are less naive than you and me.  :-)

 > Maybe looking at line endings can help to make the right decision.

That would be a very weak heuristic indeed.  A I understand it, Emacs is
very conservative in this regard: if a buffer contains only single \r
sequences, it's mac; if it contains only \n sequences, it's unix; if it
contains only \r\n sequences, it's DOS; but if it contains a mix, it is
indeterminate.

 > After the encoding was identified to be a Windows
 > codepage, the exact codepage could be chosen based on the language
 > environment.  But this suggestion is just random guesswork from my
 > side because I know close to nothing about what processes are involved
 > in identifying an encoding.

Me neither, your idea sounds reasonable to me.  But I don't understand
why auto-coding-regexp-alist has such a high priority (over the coding:
tag).

 >>Can Notepad display files in anything besides CP850/Windows-1252 and
 >>probably UTF-8 w/BOM?  E.g. can it distinguish ISO 8859-1 from ISO
 >>8859-2 from ISO 8859-15?
 >
 > As far as I understood Reiner on emacs-pretest-bug this is impossible
 > anyway.

Just as windows-1252 can't be distinguished reliably from any other
coding systems that use bytes [\200-\237].

 >>Yes, Windows applications simply assumes you're using a proprietary
 >>Microsoft character set, and GNU/Linux apps prioritize support for
 >>standard character encodings.  Maybe all you need is
 >>(prefer-coding-system 'cp850)
 >
 > Wouldn't that be a bit too restricted as a general solution for Emacs?

Of course.  But we don't know whether this is a general problem for
Emacs or a specific problem for your configuration, nor in either case
whether it's a problem that can be solved.  As a scientist I'd like to
solve the most general case, but as an engineer I'd like to start by
solving the particular problem you've identified.

-- 
Kevin Rodgers

  reply	other threads:[~2005-12-16 22:58 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-12-13 23:34 [angeli@iwi.uni-sb.de: Coding problem with Euro sign] Richard M. Stallman
2005-12-14 18:56 ` Kevin Rodgers
2005-12-14 22:51   ` Ralf Angeli
2005-12-15  1:34     ` Kevin Rodgers
2005-12-15 16:20       ` Ralf Angeli
2005-12-15 22:02         ` Kevin Rodgers
2005-12-16  8:57           ` Eli Zaretskii
2005-12-16 17:59             ` Kevin Rodgers
2005-12-17  7:19               ` Eli Zaretskii
2005-12-16 11:55           ` Ralf Angeli
2005-12-16 22:58             ` Kevin Rodgers [this message]
2005-12-17  7:36               ` Eli Zaretskii
2005-12-17 10:47               ` Reiner Steib
2006-01-10 12:38             ` windows-XXXX and cpXXXX Kenichi Handa
2006-01-10 19:18               ` Eli Zaretskii
2006-01-11 11:35                 ` Kenichi Handa
2006-01-11 17:46                   ` Eli Zaretskii
2006-01-12  1:25                     ` Kenichi Handa
2006-01-12  4:33                       ` Eli Zaretskii
2006-01-12  8:29                         ` Werner LEMBERG
2006-01-12 19:56                           ` Eli Zaretskii
2006-01-12 13:23                         ` Kenichi Handa
2006-01-12 19:59                           ` Eli Zaretskii
2006-01-13  0:58                             ` Kenichi Handa
2006-01-13  8:52                               ` Eli Zaretskii
2006-01-13 11:50                                 ` Kenichi Handa
2006-01-13 12:59                                   ` Eli Zaretskii
2006-01-16  1:05                                     ` Kenichi Handa
2006-01-16  4:31                                       ` Eli Zaretskii
2006-01-16 12:11                                         ` Kenichi Handa
2006-01-13 14:45                                 ` Stefan Monnier
2005-12-16 10:35         ` [angeli@iwi.uni-sb.de: Coding problem with Euro sign] David Hansen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='dnvgqj$p7c$1@sea.gmane.org' \
    --to=ihs_4664@yahoo.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).