From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Ralf Angeli Newsgroups: gmane.emacs.devel Subject: Re: [angeli@iwi.uni-sb.de: Coding problem with Euro sign] Date: Fri, 16 Dec 2005 12:55:47 +0100 Message-ID: References: NNTP-Posting-Host: main.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: sea.gmane.org 1134760487 16333 80.91.229.2 (16 Dec 2005 19:14:47 GMT) X-Complaints-To: usenet@sea.gmane.org NNTP-Posting-Date: Fri, 16 Dec 2005 19:14:47 +0000 (UTC) Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Fri Dec 16 20:14:44 2005 Return-path: Original-Received: from lists.gnu.org ([199.232.76.165]) by ciao.gmane.org with esmtp (Exim 4.43) id 1EnL0O-0003sI-7b for ged-emacs-devel@m.gmane.org; Fri, 16 Dec 2005 20:12:36 +0100 Original-Received: from localhost ([127.0.0.1] helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1EnL15-0001RS-QR for ged-emacs-devel@m.gmane.org; Fri, 16 Dec 2005 14:13:19 -0500 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1EnGpi-00065F-2Y for emacs-devel@gnu.org; Fri, 16 Dec 2005 09:45:18 -0500 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1EnGk8-0003x8-LZ for emacs-devel@gnu.org; Fri, 16 Dec 2005 09:39:41 -0500 Original-Received: from [199.232.76.173] (helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1EnEEw-00056K-Py for emacs-devel@gnu.org; Fri, 16 Dec 2005 06:59:12 -0500 Original-Received: from [80.91.229.2] (helo=ciao.gmane.org) by monty-python.gnu.org with esmtp (TLS-1.0:RSA_AES_128_CBC_SHA:16) (Exim 4.34) id 1EnEHL-0005GM-5n for emacs-devel@gnu.org; Fri, 16 Dec 2005 07:01:40 -0500 Original-Received: from list by ciao.gmane.org with local (Exim 4.43) id 1EnEBq-0005Ex-GZ for emacs-devel@gnu.org; Fri, 16 Dec 2005 12:55:58 +0100 Original-Received: from dialin-212-144-211-196.pools.arcor-ip.net ([212.144.211.196]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Fri, 16 Dec 2005 12:55:58 +0100 Original-Received: from angeli by dialin-212-144-211-196.pools.arcor-ip.net with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Fri, 16 Dec 2005 12:55:58 +0100 X-Injected-Via-Gmane: http://gmane.org/ Original-To: emacs-devel@gnu.org Original-Lines: 157 Original-X-Complaints-To: usenet@sea.gmane.org X-Gmane-NNTP-Posting-Host: dialin-212-144-211-196.pools.arcor-ip.net User-Agent: Gnus/5.110004 (No Gnus v0.4) Emacs/22.0.50 (gnu/linux) Cancel-Lock: sha1:lXEiqwetqnsffdk4wXOjU51lQ5E= X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:47870 Archived-At: * Kevin Rodgers (2005-12-15) writes: > Ralf Angeli wrote: > > * Kevin Rodgers (2005-12-15) writes: > > > >>You could try something like this: > >> > >>(setq auto-coding-regexp-alist > >> (cons '("[\040-\177][\200-\237]" . cp1252) > >> auto-coding-regexp-alist)) > > > > This doesn't seem to work here. I still see the byte codes of the > > 8-bit characters when opening the file after evaluating the above > > form. [...] > I assume those display problems are because I haven't configured an > Emacs fontset for the cp850 coding system. But the > auto-coding-regexp-alist entry worked as intended, and you're on > Windows so your fontset should be properly configured for that. Currently I am on GNU/Linux. Anyway, with the development version of Emacs I did not have the problems with cp1252 you described when loading the file. But when trying to write the file I got this warning: ,---- | Warning (:warning): Invalid coding system `cp1252' is specified | for the current buffer/file by the variable `auto-coding-regexp-alist'. | It is highly recommended to fix it before writing to a file. `---- I didn't do `M-x codepage-setup RET' before trying all of this. Interestingly loading and writing the file worked fine if I used windows-1252 instead of cp1252. > One other detail: that entry only sets the coding system if the euro > is immediately preceded by an ASCII character. Is that the case in > your file? No. On emacs-pretest-bug I already explained that the original (test) file doesn't include the A circumflex, that means the euro is preceded by a newline. (Maybe it would be better to continue the discussion in the thread on emacs-pretest-bug in order to avoid repetition?) If I insert a space or a random ASCII character before the Euro sign and evaluate the form above (using windows-1252 for the encoding) the encoding is being identified correctly and both the u umlaut and the Euro sign are being displayed correctly. > What does `C-h C RET' say after visiting the file? In case the encoding is not identfied correctly: ,---- | Coding system for saving this buffer: | t -- raw-text-dos | | Default coding system (for new files): | 1 -- iso-latin-1 (alias: iso-8859-1 latin-1) | | Coding system for keyboard input: | 1 -- iso-latin-1 (alias: iso-8859-1 latin-1) | | Coding system for terminal output: | 1 -- iso-8859-1 (alias of iso-latin-1) | | Defaults for subprocess I/O: | decoding: 1 -- iso-latin-1 (alias: iso-8859-1 latin-1) | | encoding: 1 -- iso-latin-1 (alias: iso-8859-1 latin-1) | | | Priority order for recognizing coding systems when reading files: | 1. iso-latin-1 (alias: iso-8859-1 latin-1) | 2. mule-utf-8 (alias: utf-8) | 3. mule-utf-16be-with-signature (alias: utf-16be-with-signature mule-utf-16-be utf-16-be) | 4. mule-utf-16le-with-signature (alias: utf-16le-with-signature mule-utf-16-le utf-16-le) | 5. iso-2022-jp (alias: junet) | 6. iso-2022-7bit | 7. iso-2022-7bit-lock (alias: iso-2022-int-1) | 8. iso-2022-8bit-ss2 | 9. emacs-mule | 10. raw-text | 11. japanese-shift-jis (alias: shift_jis sjis cp932) | 12. chinese-big5 (alias: big5 cn-big5 cp950) | 13. no-conversion | | Other coding systems cannot be distinguished automatically | from these, and therefore cannot be recognized automatically | with the present coding system priorities. | | The following are decoded correctly but recognized as iso-2022-7bit-lock: | iso-2022-7bit-ss2 iso-2022-7bit-lock-ss2 iso-2022-cn iso-2022-cn-ext | iso-2022-jp-2 iso-2022-kr | [...] `---- In case the coding is identified correctly: ,---- | Coding system for saving this buffer: | * -- windows-1252-dos | | Default coding system (for new files): | 1 -- iso-latin-1 (alias: iso-8859-1 latin-1) | | Coding system for keyboard input: | 1 -- iso-latin-1 (alias: iso-8859-1 latin-1) | | Coding system for terminal output: | 1 -- iso-8859-1 (alias of iso-latin-1) | | Defaults for subprocess I/O: | decoding: 1 -- iso-latin-1 (alias: iso-8859-1 latin-1) | | encoding: 1 -- iso-latin-1 (alias: iso-8859-1 latin-1) | [...] `---- > I assume you're running with multibyte characters enabled. Yes. The relevant setting should be included in the original bug report. > > And a customization is actually not what I am interested in; I'd like > > Emacs to figure this out by itself, out of the box. > > How is Emacs supposed to infer the coding system from the contents of > that file? If you can come up with a suitable customization, perhaps > it will be incorporated into Emacs as the default behavior. If I knew how to do that I would have sent a patch already. My naive approach would be to look for the presence of bytes which are characteristic for Windows codepages in order to identify the encoding as a Windows codepage. Maybe looking at line endings can help to make the right decision. After the encoding was identified to be a Windows codepage, the exact codepage could be chosen based on the language environment. But this suggestion is just random guesswork from my side because I know close to nothing about what processes are involved in identifying an encoding. > Can Notepad display files in anything besides CP850/Windows-1252 and > probably UTF-8 w/BOM? E.g. can it distinguish ISO 8859-1 from ISO > 8859-2 from ISO 8859-15? As far as I understood Reiner on emacs-pretest-bug this is impossible anyway. > Yes, Windows applications simply assumes you're using a proprietary > Microsoft character set, and GNU/Linux apps prioritize support for > standard character encodings. Maybe all you need is > (prefer-coding-system 'cp850) Wouldn't that be a bit too restricted as a general solution for Emacs? -- Ralf