From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Kevin Rodgers Newsgroups: gmane.emacs.devel Subject: Re: [angeli@iwi.uni-sb.de: Coding problem with Euro sign] Date: Fri, 16 Dec 2005 15:58:22 -0700 Message-ID: References: NNTP-Posting-Host: main.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Trace: sea.gmane.org 1134774754 28325 80.91.229.2 (16 Dec 2005 23:12:34 GMT) X-Complaints-To: usenet@sea.gmane.org NNTP-Posting-Date: Fri, 16 Dec 2005 23:12:34 +0000 (UTC) Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sat Dec 17 00:12:25 2005 Return-path: Original-Received: from lists.gnu.org ([199.232.76.165]) by ciao.gmane.org with esmtp (Exim 4.43) id 1EnOkI-0005vP-Eh for ged-emacs-devel@m.gmane.org; Sat, 17 Dec 2005 00:12:14 +0100 Original-Received: from localhost ([127.0.0.1] helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1EnOl0-0000I2-2q for ged-emacs-devel@m.gmane.org; Fri, 16 Dec 2005 18:12:58 -0500 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1EnOan-0003Ed-4Q for emacs-devel@gnu.org; Fri, 16 Dec 2005 18:02:25 -0500 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1EnOak-0003Du-Tx for emacs-devel@gnu.org; Fri, 16 Dec 2005 18:02:24 -0500 Original-Received: from [199.232.76.173] (helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1EnOak-0003Dp-Ns for emacs-devel@gnu.org; Fri, 16 Dec 2005 18:02:22 -0500 Original-Received: from [80.91.229.2] (helo=ciao.gmane.org) by monty-python.gnu.org with esmtp (TLS-1.0:RSA_AES_128_CBC_SHA:16) (Exim 4.34) id 1EnOdF-0001NX-3L for emacs-devel@gnu.org; Fri, 16 Dec 2005 18:04:57 -0500 Original-Received: from list by ciao.gmane.org with local (Exim 4.43) id 1EnOYG-0000S6-40 for emacs-devel@gnu.org; Fri, 16 Dec 2005 23:59:48 +0100 Original-Received: from 207.167.42.60 ([207.167.42.60]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Fri, 16 Dec 2005 23:59:48 +0100 Original-Received: from ihs_4664 by 207.167.42.60 with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Fri, 16 Dec 2005 23:59:48 +0100 X-Injected-Via-Gmane: http://gmane.org/ Original-To: emacs-devel@gnu.org Original-Lines: 111 Original-X-Complaints-To: usenet@sea.gmane.org X-Gmane-NNTP-Posting-Host: 207.167.42.60 User-Agent: Mozilla Thunderbird 0.9 (X11/20041105) X-Accept-Language: en-us, en In-Reply-To: X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:47913 Archived-At: Ralf Angeli wrote: > Currently I am on GNU/Linux. Anyway, with the development version of > Emacs I did not have the problems with cp1252 you described when > loading the file. But when trying to write the file I got this > warning: > > ,---- > | Warning (:warning): Invalid coding system `cp1252' is specified > | for the current buffer/file by the variable `auto-coding-regexp-alist'. > | It is highly recommended to fix it before writing to a file. > `---- > > I didn't do `M-x codepage-setup RET' before trying all of this. > Interestingly loading and writing the file worked fine if I used > windows-1252 instead of cp1252. Well, there you go. Emacs 22.0 supports windows-1252, and Emacs 21.4 only supports cp850. > * Kevin Rodgers (2005-12-15) writes: >>One other detail: that entry only sets the coding system if the euro >>is immediately preceded by an ASCII character. Is that the case in >>your file? > > No. On emacs-pretest-bug I already explained that the original (test) > file doesn't include the A circumflex, that means the euro is preceded > by a newline. (Maybe it would be better to continue the discussion in > the thread on emacs-pretest-bug in order to avoid repetition?) Ah. The regexp only matched the [\200-\237] characters after a non-control ASCII character. So [\040-\177] needs to be expanded, at least to [\t\n\r\040-\177] to include tab and newline sequences, but maybe [\t\n\r\v\f\040-\177] to include vertical tab and formfeed, or even [\000-\177] to include all ASCII characters. (I don't subscribe to emacs-pretest-bug, I read the gnu.emacs.devel newsgroup on gmane.org, which is gatewayed to and from the emacs-devel@gnu.org mailing list. If you followed up to both mailing lists/newsgroups that should solve the problem.) > If I insert a space or a random ASCII character before the Euro sign > and evaluate the form above (using windows-1252 for the encoding) the > encoding is being identified correctly and both the u umlaut and the > Euro sign are being displayed correctly. Good! ... >>How is Emacs supposed to infer the coding system from the contents of >>that file? If you can come up with a suitable customization, perhaps >>it will be incorporated into Emacs as the default behavior. > > If I knew how to do that I would have sent a patch already. My naive > approach would be to look for the presence of bytes which are > characteristic for Windows codepages in order to identify the encoding > as a Windows codepage. Right, but a single byte is not enough information to identify the character encoding. Even a pattern is not enough, since coding systems may differ only in what characters are assigned to the same byte sequence: sometimes you need "out of band" information. Have you read the Recognize Coding node (aka Recognizing Coding Systems) of the Emacs manual? The Emacs implementors are less naive than you and me. :-) > Maybe looking at line endings can help to make the right decision. That would be a very weak heuristic indeed. A I understand it, Emacs is very conservative in this regard: if a buffer contains only single \r sequences, it's mac; if it contains only \n sequences, it's unix; if it contains only \r\n sequences, it's DOS; but if it contains a mix, it is indeterminate. > After the encoding was identified to be a Windows > codepage, the exact codepage could be chosen based on the language > environment. But this suggestion is just random guesswork from my > side because I know close to nothing about what processes are involved > in identifying an encoding. Me neither, your idea sounds reasonable to me. But I don't understand why auto-coding-regexp-alist has such a high priority (over the coding: tag). >>Can Notepad display files in anything besides CP850/Windows-1252 and >>probably UTF-8 w/BOM? E.g. can it distinguish ISO 8859-1 from ISO >>8859-2 from ISO 8859-15? > > As far as I understood Reiner on emacs-pretest-bug this is impossible > anyway. Just as windows-1252 can't be distinguished reliably from any other coding systems that use bytes [\200-\237]. >>Yes, Windows applications simply assumes you're using a proprietary >>Microsoft character set, and GNU/Linux apps prioritize support for >>standard character encodings. Maybe all you need is >>(prefer-coding-system 'cp850) > > Wouldn't that be a bit too restricted as a general solution for Emacs? Of course. But we don't know whether this is a general problem for Emacs or a specific problem for your configuration, nor in either case whether it's a problem that can be solved. As a scientist I'd like to solve the most general case, but as an engineer I'd like to start by solving the particular problem you've identified. -- Kevin Rodgers