* word syntax/umlauts emacs 23 vs 22 @ 2010-10-12 17:00 Ralf Fassel 2010-10-12 22:15 ` Stefan Monnier 0 siblings, 1 reply; 14+ messages in thread From: Ralf Fassel @ 2010-10-12 17:00 UTC (permalink / raw) To: help-gnu-emacs I recently switched from 22.3 (opensuse 11.1) to emacs 23.1 (opensuse 11.3). In emacs-22.3, a word containing german Umlauts was skipped as a whole by forward-word/backward-word. In emacs-23.1, a word containing german Umlauts is considered as three parts by forward-word/backward-word. E.g Müller i.e. "\115\374\154\154\145\162" in Unibyte/Latin-9 Setting point before the 'M' and doing M-x forward-word ends up after the 'r' in emacs-22 and after the 'M' in emacs-23. The syntax entries for the Umlaut ü in emacs-23 explicitely says 'word', so why does forward-word stop at the Umlaut? emacs-23: character: ü (252, #o374, #xfc) preferred charset: eight-bit (Raw bytes 128-255) code point: 0xFC syntax: w which means: word buffer code: #xFC file code: #xFC display: by display table entry [?ü] (see below) The display table entry is displayed by these fonts (glyph codes): ü: x:-b&h-lucidatypewriter-medium-r-normal-sans-18-180-75-75-m-110-iso8859-1 (#xFC) emacs-22: character: ü (252, #o374, #xfc) charset: eight-bit-graphic (8-bit graphic char (0xA0..0xFF)) code point: #xFC syntax: w which means: word buffer code: #xFC file code: not encodable by coding system iso-latin-9 display: by display table entry [?ü] (see below) The display table entry is displayed by these fonts (glyph codes): ü: -B&H-LucidaTypewriter-Medium-R-Normal-Sans-18-180-75-75-M-110-ISO8859-1 (#xFC) Any hints how to get the emacs-22 behaviour back? R' ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: word syntax/umlauts emacs 23 vs 22 2010-10-12 17:00 word syntax/umlauts emacs 23 vs 22 Ralf Fassel @ 2010-10-12 22:15 ` Stefan Monnier [not found] ` <yga7hhm1tma.fsf@gepard2.akutech-local.de> 0 siblings, 1 reply; 14+ messages in thread From: Stefan Monnier @ 2010-10-12 22:15 UTC (permalink / raw) To: help-gnu-emacs > emacs-23: > character: ü (252, #o374, #xfc) > preferred charset: eight-bit (Raw bytes 128-255) This character is not really the "u mit umlaut" but rather it's the byte 252 (FC in hexidecimal), which happens to be displayed as ü for reasons I'm not sure I understand. > emacs-22: > character: ü (252, #o374, #xfc) > charset: eight-bit-graphic (8-bit graphic char (0xA0..0xFF)) Same thing here. I.e. your Emacs-22 also gets this file wrong. The only difference between Emacs-22 and Emacs-23 is that Emacs-23 doesn't pretend that bytes between 128 and 255 are latin-1 chars. The right fix is to try and figure out why the char is "byte nb 252" rather than "u mit umlaut". Try to look at that file with "emacs -Q", to see if you can reproduce the problem there. Stefan PS: maybe you're using Emacs in unibyte mode, which was a bad idea in Emacs-22, is deprecated in Emacs-23 and won't exist any more in Emacs-24. ^ permalink raw reply [flat|nested] 14+ messages in thread
[parent not found: <yga7hhm1tma.fsf@gepard2.akutech-local.de>]
* Re: word syntax/umlauts emacs 23 vs 22 [not found] ` <yga7hhm1tma.fsf@gepard2.akutech-local.de> @ 2010-10-15 17:42 ` Stefan Monnier 2010-10-20 19:28 ` Ralf Fassel 0 siblings, 1 reply; 14+ messages in thread From: Stefan Monnier @ 2010-10-15 17:42 UTC (permalink / raw) To: help-gnu-emacs > | Try to look at that file with "emacs -Q", to see if you can reproduce > | the problem there. > Same thing. I visit a file with 'über' (4 bytes/chars). This is > initially displayed as \374ber (the \374 being one entity). [...] > character: \374 (252, #o374, #xfc) > preferred charset: eight-bit (Raw bytes 128-255) > code point: 0xFC > syntax: w which means: word > buffer code: #xFC > file code: #xFC > display: by this font (glyph code) > x:-sony-fixed-medium-r-normal--16-120-100-100-c-80-iso8859-1 (#xFC) I cannot reproduce it. What is your LANG/LC_ALL setting? > I use unibyte in regular use. How do you tell Emacs to use unibyte? Stefan ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: word syntax/umlauts emacs 23 vs 22 2010-10-15 17:42 ` Stefan Monnier @ 2010-10-20 19:28 ` Ralf Fassel 2010-10-21 0:18 ` Jason Rumney 2010-10-21 1:27 ` Stefan Monnier 0 siblings, 2 replies; 14+ messages in thread From: Ralf Fassel @ 2010-10-20 19:28 UTC (permalink / raw) To: help-gnu-emacs * Stefan Monnier <monnier@iro.umontreal.ca> Stefan, thanks for your patience with this. | I cannot reproduce it. What is your LANG/LC_ALL setting? LANG=de_DE.UTF-8 | > I use unibyte in regular use. > | How do you tell Emacs to use unibyte? I just recognized that I had set EMACS_UNIBYTE in the environment. If I unset this and start /usr/bin/emacs -Q, I get correct word-movement on Umlauts inserted on a german keyboard. Now we still have basically all of our files in unibyte encoding, and the show as M\374ller, with the single-byte Umlauts as escape sequences, and word-movement stops at the non-ascii char. I found that if I customize the latin1-display Variable, they show up as Umlauts, and word-movement also behaves properly. Is setting latin1-display the Right Thing to work with the unibyte files? Thanks R' ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: word syntax/umlauts emacs 23 vs 22 2010-10-20 19:28 ` Ralf Fassel @ 2010-10-21 0:18 ` Jason Rumney 2010-10-21 1:27 ` Stefan Monnier 1 sibling, 0 replies; 14+ messages in thread From: Jason Rumney @ 2010-10-21 0:18 UTC (permalink / raw) To: help-gnu-emacs On Oct 21, 3:28 am, Ralf Fassel <ralf...@gmx.de> wrote: > Now we still have basically all of our files in unibyte encoding, and > the show as M\374ller, with the single-byte Umlauts as escape sequences, > and word-movement stops at the non-ascii char. I found that if I > customize the latin1-display Variable, they show up as Umlauts, and > word-movement also behaves properly. Is setting latin1-display the > Right Thing to work with the unibyte files? No. The correct thing is to use latin-1 instead of unibyte if the files are indeed latin-1. But this should happen by default. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: word syntax/umlauts emacs 23 vs 22 2010-10-20 19:28 ` Ralf Fassel 2010-10-21 0:18 ` Jason Rumney @ 2010-10-21 1:27 ` Stefan Monnier 2010-10-21 13:25 ` Ralf Fassel 1 sibling, 1 reply; 14+ messages in thread From: Stefan Monnier @ 2010-10-21 1:27 UTC (permalink / raw) To: help-gnu-emacs > | I cannot reproduce it. What is your LANG/LC_ALL setting? > LANG=de_DE.UTF-8 > | > I use unibyte in regular use. > | How do you tell Emacs to use unibyte? > I just recognized that I had set EMACS_UNIBYTE in the environment. > If I unset this and start /usr/bin/emacs -Q, I get correct word-movement > on Umlauts inserted on a german keyboard. Great. > Now we still have basically all of our files in unibyte encoding, and "unibyte encoding" is a term that makes sense here, but searching for it won't put you on the right track, I'm afraid ;-) > the show as M\374ller, with the single-byte Umlauts as escape sequences, Your "unibyte encoding" is most likely latin-1 or latin-9, so your problem now is that Emacs for some reason does not try latin-1 for those files that don't use utf-8. C-x RET r latin-1 RET should cause the file to be re-read as a latin-1 file, and it should then be displayed properly. Now, the question is why didn't Emacs recognize the file as a latin-1 file. If you do emacs23 -Q ~/tmp/foo.txt where foo.txt is a file encoded in latin-1 that contains Müller and some more ASCII text, Emacs should properly recognize the file as latin-1 (as indicated in the leftmost part of the mode-line by "-1:") and the ü should be recognized and displayed fine. At least it works for me (and many more people). So if that doesn't work for you, there's something more going on (maybe you'll want to try it with different files, because it may be a problem in the file's encoding). > and word-movement stops at the non-ascii char. I found that if I > customize the latin1-display Variable, they show up as Umlauts, and > word-movement also behaves properly. Is setting latin1-display the > Right Thing to work with the unibyte files? No, the "latin1-display" thingy, as the name implies, deals with display and hence just works around the problem, just like your reliance on UNIBYTE did. Stefan ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: word syntax/umlauts emacs 23 vs 22 2010-10-21 1:27 ` Stefan Monnier @ 2010-10-21 13:25 ` Ralf Fassel 2010-10-21 15:24 ` Jason Rumney [not found] ` <jwv8w1p4pjo.fsf-monnier+gnu.emacs.help@gnu.org> 0 siblings, 2 replies; 14+ messages in thread From: Ralf Fassel @ 2010-10-21 13:25 UTC (permalink / raw) To: help-gnu-emacs * Stefan Monnier <monnier@iro.umontreal.ca> | Your "unibyte encoding" is most likely latin-1 or latin-9, so your | problem now is that Emacs for some reason does not try latin-1 for | those files that don't use utf-8. Lets assume Latin-9 (IIRC that is Latin-1 with the Euro-Sign?). | C-x RET r latin-1 RET should cause the file to be re-read as a latin-1 | file, and it should then be displayed properly. The file is one line: % cat foo.txt Herr Müller editiert mit Emacs. % od -c foo.txt 0000000 H e r r M 374 l l e r e d i t 0000020 i e r t m i t E m a c s . \n 0000040 If I load that in emacs23 -Q ~/tmp/foo.txt the file is displayed correctly (mode line shows "1:--- foo.txt"), Umlauts are displayed properly and word movement works. However, a different, larger (800+kB) file which is also supposed to be latin-1 displays as "t:--- file", and the single-byte Umlauts are displayed as octal. How can I find out why emacs loads this file in 't' instead of '1'? A quick search shows only doubled umlauts such as 'Grüße' or öö, but if I add these to foo.txt, emacs still loads foo.txt as "1:". Thanks R' ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: word syntax/umlauts emacs 23 vs 22 2010-10-21 13:25 ` Ralf Fassel @ 2010-10-21 15:24 ` Jason Rumney 2010-10-25 9:33 ` Ralf Fassel 2010-10-26 2:53 ` Ilya Zakharevich [not found] ` <jwv8w1p4pjo.fsf-monnier+gnu.emacs.help@gnu.org> 1 sibling, 2 replies; 14+ messages in thread From: Jason Rumney @ 2010-10-21 15:24 UTC (permalink / raw) To: help-gnu-emacs On Oct 21, 9:25 pm, Ralf Fassel <ralf...@gmx.de> wrote: > However, a different, larger (800+kB) file which is also supposed to be > latin-1 displays as "t:--- file", and the single-byte Umlauts are > displayed as octal. How can I find out why emacs loads this file in 't' > instead of '1'? A quick search shows only doubled umlauts such as > 'Grüße' or öö, but if I add these to foo.txt, emacs still loads foo.txt > as "1:". Was the file edited on Windows? Does it contain euro signs, "smartquotes" or other non-standard Microsoft additions to Latin-1? If this is the answer to your problem, then the following should help (but probably breaks auto-detection of utf-8 files). (prefer-coding-system 'windows-1252) ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: word syntax/umlauts emacs 23 vs 22 2010-10-21 15:24 ` Jason Rumney @ 2010-10-25 9:33 ` Ralf Fassel 2010-10-29 18:26 ` Stefan Monnier 2010-10-26 2:53 ` Ilya Zakharevich 1 sibling, 1 reply; 14+ messages in thread From: Ralf Fassel @ 2010-10-25 9:33 UTC (permalink / raw) To: help-gnu-emacs * Jason Rumney <jasonrumney@gmail.com> | Was the file edited on Windows? Does it contain euro signs, | "smartquotes" or other non-standard Microsoft additions to Latin-1? Yes. A single windows-\200-Euro-sign was the reason for not opening it in latin-1. After replacing that char, the file is opened ok in latin-1. | (prefer-coding-system 'windows-1252) No, those files should not be edited using a windows code page, they are supposed to be latin-1. Thanks R' ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: word syntax/umlauts emacs 23 vs 22 2010-10-25 9:33 ` Ralf Fassel @ 2010-10-29 18:26 ` Stefan Monnier 2010-11-04 9:36 ` Ralf Fassel 0 siblings, 1 reply; 14+ messages in thread From: Stefan Monnier @ 2010-10-29 18:26 UTC (permalink / raw) To: help-gnu-emacs > | (prefer-coding-system 'windows-1252) > No, those files should not be edited using a windows code page, they are > supposed to be latin-1. Are you saying that this char was an error? If not, then it does seem like you want to use windows-1252 (which is a perfectly normal coding-system, tho it happens to be a lot more common for a valid utf-8 file to also be a valid windows-1252 file, so it's more difficult for Emacs to automatically choose the right coding-system unless you tell it that you prefer utf-8 more than 1252). Stefan ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: word syntax/umlauts emacs 23 vs 22 2010-10-29 18:26 ` Stefan Monnier @ 2010-11-04 9:36 ` Ralf Fassel 2010-11-04 19:37 ` Stefan Monnier 0 siblings, 1 reply; 14+ messages in thread From: Ralf Fassel @ 2010-11-04 9:36 UTC (permalink / raw) To: help-gnu-emacs * Stefan Monnier <monnier@iro.umontreal.ca> | > | (prefer-coding-system 'windows-1252) | > No, those files should not be edited using a windows code page, they | > are supposed to be latin-1. > | Are you saying that this char was an error? Hard to say. Windows users insert \200 when they press the Euro sign on their keybord, Linux users enter \244. Since we're mostly Linux, the \244 should be the Euro. | If not, then it does seem like you want to use windows-1252 (which is | a perfectly normal coding-system, Since the Umlauts are at the same positions in latin-9 and windows-1252 we might as well use cp1252, but then the Linux-Euro will get displayed as a crossed 'o'... well... I'd say lets end this thread. We can work with emacs and xemacs again after re-enabling multibyte mode, and using the proper default encoding. Thanks again R' ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: word syntax/umlauts emacs 23 vs 22 2010-11-04 9:36 ` Ralf Fassel @ 2010-11-04 19:37 ` Stefan Monnier 0 siblings, 0 replies; 14+ messages in thread From: Stefan Monnier @ 2010-11-04 19:37 UTC (permalink / raw) To: help-gnu-emacs > | > | (prefer-coding-system 'windows-1252) > | > No, those files should not be edited using a windows code page, they > | > are supposed to be latin-1. > | Are you saying that this char was an error? > Hard to say. Windows users insert \200 when they press the Euro sign on > their keybord, Linux users enter \244. Don't know about Windows, but at least for GNU/Linux what you say is largely not true: all major distributions of GNU/Linux switched to utf-8 locales by default many years ago, so by now most GNU/Linux users will insert #xE2 #x82 #xAC when they press the Euro sign. Stefan ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: word syntax/umlauts emacs 23 vs 22 2010-10-21 15:24 ` Jason Rumney 2010-10-25 9:33 ` Ralf Fassel @ 2010-10-26 2:53 ` Ilya Zakharevich 1 sibling, 0 replies; 14+ messages in thread From: Ilya Zakharevich @ 2010-10-26 2:53 UTC (permalink / raw) To: help-gnu-emacs On 2010-10-21, Jason Rumney <jasonrumney@gmail.com> wrote: > Was the file edited on Windows? Does it contain euro signs, > "smartquotes" or other non-standard Microsoft additions to Latin-1? Is not it siege mentality? cp1252 is as standard as iso-SHIT-1 (in fact much more standard ;-). This fact should better be recognized by Emacs' recognition engine... Hope this helps, Ilya P.S. And BOTH of them are latin-1. So if you CARE, please use proper names. ^ permalink raw reply [flat|nested] 14+ messages in thread
[parent not found: <jwv8w1p4pjo.fsf-monnier+gnu.emacs.help@gnu.org>]
* Re: word syntax/umlauts emacs 23 vs 22 [not found] ` <jwv8w1p4pjo.fsf-monnier+gnu.emacs.help@gnu.org> @ 2010-10-25 9:31 ` Ralf Fassel 0 siblings, 0 replies; 14+ messages in thread From: Ralf Fassel @ 2010-10-25 9:31 UTC (permalink / raw) To: help-gnu-emacs * Stefan Monnier <monnier@iro.umontreal.ca> | You can do the following: | - open the file | - set its coding-system to latin-1 with C-x RET f latin-1 RET | - try to save the file using the new coding system: C-x C-s | - this should hopefully pop up a window giving you a list of chars | that conflict. Ok, by string-replacing the regular Umlauts I found the culprit was a single \200 char within 880+kB. This the Windows-Euro (of course the file is edited from Windows and Linux). After replacing this single char, the file is opened in latin-1 ok. Since all those files use a special editing major mode in emacs anyway, I suppose I can switch to latin-1 when entering that major mode. | But I still suggest you try what I already suggested: > | | C-x RET r latin-1 RET should cause the file to be re-read as a | | latin-1 file, and it should then be displayed properly. > | and see what this gives. This works ok in that the Umlauts are displayed properly, word movement works, and the single \200 char is not bitched about when saving the file. So I'd call this function in the major mode and all should be well (fingers crossed)... Now you don't happen to know the details of all of this in *X*emacs? Last time I looked, the functions were all different... R' ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2010-11-04 19:37 UTC | newest] Thread overview: 14+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-10-12 17:00 word syntax/umlauts emacs 23 vs 22 Ralf Fassel 2010-10-12 22:15 ` Stefan Monnier [not found] ` <yga7hhm1tma.fsf@gepard2.akutech-local.de> 2010-10-15 17:42 ` Stefan Monnier 2010-10-20 19:28 ` Ralf Fassel 2010-10-21 0:18 ` Jason Rumney 2010-10-21 1:27 ` Stefan Monnier 2010-10-21 13:25 ` Ralf Fassel 2010-10-21 15:24 ` Jason Rumney 2010-10-25 9:33 ` Ralf Fassel 2010-10-29 18:26 ` Stefan Monnier 2010-11-04 9:36 ` Ralf Fassel 2010-11-04 19:37 ` Stefan Monnier 2010-10-26 2:53 ` Ilya Zakharevich [not found] ` <jwv8w1p4pjo.fsf-monnier+gnu.emacs.help@gnu.org> 2010-10-25 9:31 ` Ralf Fassel
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).