word syntax/umlauts emacs 23 vs 22

unofficial mirror of help-gnu-emacs@gnu.org
 help / color / mirror / Atom feed

* word syntax/umlauts emacs 23 vs 22
@ 2010-10-12 17:00 Ralf Fassel
  2010-10-12 22:15 ` Stefan Monnier
  0 siblings, 1 reply; 14+ messages in thread
From: Ralf Fassel @ 2010-10-12 17:00 UTC (permalink / raw)
  To: help-gnu-emacs

I recently switched from 22.3 (opensuse 11.1) to emacs 23.1 (opensuse 11.3).

In emacs-22.3, a word containing german Umlauts was skipped as a whole
by forward-word/backward-word.

In emacs-23.1, a word containing german Umlauts is considered as three
parts by forward-word/backward-word.

E.g
  Müller   i.e.  "\115\374\154\154\145\162" in Unibyte/Latin-9

Setting point before the 'M' and doing M-x forward-word ends up after
the 'r' in emacs-22 and after the 'M' in emacs-23.

The syntax entries for the Umlaut ü in emacs-23 explicitely says 'word',
so why does forward-word stop at the Umlaut?

emacs-23:
            character: ü (252, #o374, #xfc)
    preferred charset: eight-bit (Raw bytes 128-255)
           code point: 0xFC
               syntax: w 	which means: word
          buffer code: #xFC
            file code: #xFC
              display: by display table entry [?ü] (see below)

    The display table entry is displayed by these fonts (glyph codes):
    ü: x:-b&h-lucidatypewriter-medium-r-normal-sans-18-180-75-75-m-110-iso8859-1 (#xFC)

emacs-22:
      character: ü (252, #o374, #xfc)
        charset: eight-bit-graphic (8-bit graphic char (0xA0..0xFF))
     code point: #xFC
         syntax: w 	which means: word
    buffer code: #xFC
      file code: not encodable by coding system iso-latin-9
        display: by display table entry [?ü] (see below)

    The display table entry is displayed by these fonts (glyph codes):
    ü: -B&H-LucidaTypewriter-Medium-R-Normal-Sans-18-180-75-75-M-110-ISO8859-1 (#xFC)

Any hints how to get the emacs-22 behaviour back?

R'


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: word syntax/umlauts emacs 23 vs 22
  2010-10-12 17:00 word syntax/umlauts emacs 23 vs 22 Ralf Fassel
@ 2010-10-12 22:15 ` Stefan Monnier
       [not found]   ` <yga7hhm1tma.fsf@gepard2.akutech-local.de>
  0 siblings, 1 reply; 14+ messages in thread
From: Stefan Monnier @ 2010-10-12 22:15 UTC (permalink / raw)
  To: help-gnu-emacs

> emacs-23:
>             character: ü (252, #o374, #xfc)
>     preferred charset: eight-bit (Raw bytes 128-255)

This character is not really the "u mit umlaut" but rather it's the byte
252 (FC in hexidecimal), which happens to be displayed as ü for reasons
I'm not sure I understand.

> emacs-22:
>       character: ü (252, #o374, #xfc)
>         charset: eight-bit-graphic (8-bit graphic char (0xA0..0xFF))

Same thing here.  I.e. your Emacs-22 also gets this file wrong.

The only difference between Emacs-22 and Emacs-23 is that Emacs-23
doesn't pretend that bytes between 128 and 255 are latin-1 chars.

The right fix is to try and figure out why the char is "byte nb 252"
rather than "u mit umlaut".  Try to look at that file with "emacs -Q",
to see if you can reproduce the problem there.

        Stefan

PS: maybe you're using Emacs in unibyte mode, which was a bad idea in
Emacs-22, is deprecated in Emacs-23 and won't exist any more in Emacs-24.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: word syntax/umlauts emacs 23 vs 22
       [not found]   ` <yga7hhm1tma.fsf@gepard2.akutech-local.de>
@ 2010-10-15 17:42     ` Stefan Monnier
  2010-10-20 19:28       ` Ralf Fassel
  0 siblings, 1 reply; 14+ messages in thread
From: Stefan Monnier @ 2010-10-15 17:42 UTC (permalink / raw)
  To: help-gnu-emacs

> | Try to look at that file with "emacs -Q", to see if you can reproduce
> | the problem there.

> Same thing.  I visit a file with 'über' (4 bytes/chars).  This is
> initially displayed as \374ber (the \374 being one entity).
[...]
>             character: \374 (252, #o374, #xfc)
>     preferred charset: eight-bit (Raw bytes 128-255)
>            code point: 0xFC
>                syntax: w 	which means: word
>           buffer code: #xFC
>             file code: #xFC
>               display: by this font (glyph code)
>         x:-sony-fixed-medium-r-normal--16-120-100-100-c-80-iso8859-1 (#xFC)

I cannot reproduce it.  What is your LANG/LC_ALL setting?

> I use unibyte in regular use.

How do you tell Emacs to use unibyte?


        Stefan


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: word syntax/umlauts emacs 23 vs 22
  2010-10-15 17:42     ` Stefan Monnier
@ 2010-10-20 19:28       ` Ralf Fassel
  2010-10-21  0:18         ` Jason Rumney
  2010-10-21  1:27         ` Stefan Monnier
  0 siblings, 2 replies; 14+ messages in thread
From: Ralf Fassel @ 2010-10-20 19:28 UTC (permalink / raw)
  To: help-gnu-emacs

* Stefan Monnier <monnier@iro.umontreal.ca>
Stefan,
thanks for your patience with this.

| I cannot reproduce it.  What is your LANG/LC_ALL setting?

LANG=de_DE.UTF-8

| > I use unibyte in regular use.
>
| How do you tell Emacs to use unibyte?

I just recognized that I had set EMACS_UNIBYTE in the environment.

If I unset this and start /usr/bin/emacs -Q, I get correct word-movement
on Umlauts inserted on a german keyboard.

Now we still have basically all of our files in unibyte encoding, and
the show as M\374ller, with the single-byte Umlauts as escape sequences,
and word-movement stops at the non-ascii char.  I found that if I
customize the latin1-display Variable, they show up as Umlauts, and
word-movement also behaves properly.  Is setting latin1-display the
Right Thing to work with the unibyte files?

Thanks
R'

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: word syntax/umlauts emacs 23 vs 22
  2010-10-20 19:28       ` Ralf Fassel
@ 2010-10-21  0:18         ` Jason Rumney
  2010-10-21  1:27         ` Stefan Monnier
  1 sibling, 0 replies; 14+ messages in thread
From: Jason Rumney @ 2010-10-21  0:18 UTC (permalink / raw)
  To: help-gnu-emacs

On Oct 21, 3:28 am, Ralf Fassel <ralf...@gmx.de> wrote:

> Now we still have basically all of our files in unibyte encoding, and
> the show as M\374ller, with the single-byte Umlauts as escape sequences,
> and word-movement stops at the non-ascii char.  I found that if I
> customize the latin1-display Variable, they show up as Umlauts, and
> word-movement also behaves properly.  Is setting latin1-display the
> Right Thing to work with the unibyte files?

No. The correct thing is to use latin-1 instead of unibyte if the
files are indeed latin-1.  But this should happen by default.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: word syntax/umlauts emacs 23 vs 22
  2010-10-20 19:28       ` Ralf Fassel
  2010-10-21  0:18         ` Jason Rumney
@ 2010-10-21  1:27         ` Stefan Monnier
  2010-10-21 13:25           ` Ralf Fassel
  1 sibling, 1 reply; 14+ messages in thread
From: Stefan Monnier @ 2010-10-21  1:27 UTC (permalink / raw)
  To: help-gnu-emacs

> | I cannot reproduce it.  What is your LANG/LC_ALL setting?
> LANG=de_DE.UTF-8

> | > I use unibyte in regular use.
> | How do you tell Emacs to use unibyte?
> I just recognized that I had set EMACS_UNIBYTE in the environment.

> If I unset this and start /usr/bin/emacs -Q, I get correct word-movement
> on Umlauts inserted on a german keyboard.

Great.

> Now we still have basically all of our files in unibyte encoding, and

"unibyte encoding" is a term that makes sense here, but searching for it
won't put you on the right track, I'm afraid ;-)

> the show as M\374ller, with the single-byte Umlauts as escape sequences,

Your "unibyte encoding" is most likely latin-1 or latin-9, so your
problem now is that Emacs for some reason does not try latin-1 for those
files that don't use utf-8.

C-x RET r latin-1 RET should cause the file to be re-read as a latin-1
file, and it should then be displayed properly.  Now, the question is
why didn't Emacs recognize the file as a latin-1 file.

If you do

   emacs23 -Q ~/tmp/foo.txt

where foo.txt is a file encoded in latin-1 that contains Müller and some
more ASCII text, Emacs should properly recognize the file as latin-1 (as
indicated in the leftmost part of the mode-line by "-1:") and the
ü should be recognized and displayed fine.  At least it works for me
(and many more people).
So if that doesn't work for you, there's something more going on (maybe
you'll want to try it with different files, because it may be a problem
in the file's encoding).

> and word-movement stops at the non-ascii char.  I found that if I
> customize the latin1-display Variable, they show up as Umlauts, and
> word-movement also behaves properly.  Is setting latin1-display the
> Right Thing to work with the unibyte files?

No, the "latin1-display" thingy, as the name implies, deals with display
and hence just works around the problem, just like your reliance on
UNIBYTE did.

        Stefan

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: word syntax/umlauts emacs 23 vs 22
  2010-10-21  1:27         ` Stefan Monnier
@ 2010-10-21 13:25           ` Ralf Fassel
  2010-10-21 15:24             ` Jason Rumney
       [not found]             ` <jwv8w1p4pjo.fsf-monnier+gnu.emacs.help@gnu.org>
  0 siblings, 2 replies; 14+ messages in thread
From: Ralf Fassel @ 2010-10-21 13:25 UTC (permalink / raw)
  To: help-gnu-emacs

* Stefan Monnier <monnier@iro.umontreal.ca>
| Your "unibyte encoding" is most likely latin-1 or latin-9, so your
| problem now is that Emacs for some reason does not try latin-1 for
| those files that don't use utf-8.

Lets assume Latin-9 (IIRC that is Latin-1 with the Euro-Sign?).

| C-x RET r latin-1 RET should cause the file to be re-read as a latin-1
| file, and it should then be displayed properly.

The file is one line:

    % cat foo.txt
    Herr Müller editiert mit Emacs.
    % od -c foo.txt
    0000000   H   e   r   r       M 374   l   l   e   r       e   d   i   t
    0000020   i   e   r   t       m   i   t       E   m   a   c   s   .  \n
    0000040

If I load that in
  emacs23 -Q ~/tmp/foo.txt
the file is displayed correctly (mode line shows "1:--- foo.txt"),
Umlauts are displayed properly and word movement works.

However, a different, larger (800+kB) file which is also supposed to be
latin-1 displays as "t:--- file", and the single-byte Umlauts are
displayed as octal.  How can I find out why emacs loads this file in 't'
instead of '1'?  A quick search shows only doubled umlauts such as
'Grüße' or öö, but if I add these to foo.txt, emacs still loads foo.txt
as "1:".

Thanks
R'

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: word syntax/umlauts emacs 23 vs 22
  2010-10-21 13:25           ` Ralf Fassel
@ 2010-10-21 15:24             ` Jason Rumney
  2010-10-25  9:33               ` Ralf Fassel
  2010-10-26  2:53               ` Ilya Zakharevich
       [not found]             ` <jwv8w1p4pjo.fsf-monnier+gnu.emacs.help@gnu.org>
  1 sibling, 2 replies; 14+ messages in thread
From: Jason Rumney @ 2010-10-21 15:24 UTC (permalink / raw)
  To: help-gnu-emacs

On Oct 21, 9:25 pm, Ralf Fassel <ralf...@gmx.de> wrote:

> However, a different, larger (800+kB) file which is also supposed to be
> latin-1 displays as "t:--- file", and the single-byte Umlauts are
> displayed as octal.  How can I find out why emacs loads this file in 't'
> instead of '1'?  A quick search shows only doubled umlauts such as
> 'Grüße' or öö, but if I add these to foo.txt, emacs still loads foo.txt
> as "1:".

Was the file edited on Windows?  Does it contain euro signs,
"smartquotes" or other non-standard Microsoft additions to Latin-1?

If this is the answer to your problem, then the following should help
(but probably breaks auto-detection of utf-8 files).

(prefer-coding-system 'windows-1252)


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: word syntax/umlauts emacs 23 vs 22
       [not found]             ` <jwv8w1p4pjo.fsf-monnier+gnu.emacs.help@gnu.org>
@ 2010-10-25  9:31               ` Ralf Fassel
  0 siblings, 0 replies; 14+ messages in thread
From: Ralf Fassel @ 2010-10-25  9:31 UTC (permalink / raw)
  To: help-gnu-emacs

* Stefan Monnier <monnier@iro.umontreal.ca>
| You can do the following:
| - open the file
| - set its coding-system to latin-1 with C-x RET f latin-1 RET
| - try to save the file using the new coding system: C-x C-s
| - this should hopefully pop up a window giving you a list of chars
|   that conflict.

Ok, by string-replacing the regular Umlauts I found the culprit was a
single \200 char within 880+kB.  This the Windows-Euro (of course the
file is edited from Windows and Linux).  After replacing this single
char, the file is opened in latin-1 ok.

Since all those files use a special editing major mode in emacs anyway,
I suppose I can switch to latin-1 when entering that major mode.

| But I still suggest you try what I already suggested:
>
| | C-x RET r latin-1 RET should cause the file to be re-read as a
| | latin-1 file, and it should then be displayed properly.
>
| and see what this gives.

This works ok in that the Umlauts are displayed properly, word movement
works, and the single \200 char is not bitched about when saving the
file.  So I'd call this function in the major mode and all should be
well (fingers crossed)...

Now you don't happen to know the details of all of this in *X*emacs?
Last time I looked, the functions were all different...

R'

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: word syntax/umlauts emacs 23 vs 22
  2010-10-21 15:24             ` Jason Rumney
@ 2010-10-25  9:33               ` Ralf Fassel
  2010-10-29 18:26                 ` Stefan Monnier
  2010-10-26  2:53               ` Ilya Zakharevich
  1 sibling, 1 reply; 14+ messages in thread
From: Ralf Fassel @ 2010-10-25  9:33 UTC (permalink / raw)
  To: help-gnu-emacs

* Jason Rumney <jasonrumney@gmail.com>
| Was the file edited on Windows?  Does it contain euro signs,
| "smartquotes" or other non-standard Microsoft additions to Latin-1?

Yes.  A single windows-\200-Euro-sign was the reason for not opening it
in latin-1.  After replacing that char, the file is opened ok in latin-1.

| (prefer-coding-system 'windows-1252)

No, those files should not be edited using a windows code page, they are
supposed to be latin-1.

Thanks
R'

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: word syntax/umlauts emacs 23 vs 22
  2010-10-21 15:24             ` Jason Rumney
  2010-10-25  9:33               ` Ralf Fassel
@ 2010-10-26  2:53               ` Ilya Zakharevich
  1 sibling, 0 replies; 14+ messages in thread
From: Ilya Zakharevich @ 2010-10-26  2:53 UTC (permalink / raw)
  To: help-gnu-emacs

On 2010-10-21, Jason Rumney <jasonrumney@gmail.com> wrote:
> Was the file edited on Windows?  Does it contain euro signs,
> "smartquotes" or other non-standard Microsoft additions to Latin-1?

Is not it siege mentality?  cp1252 is as standard as iso-SHIT-1 (in
fact much more standard ;-).  This fact should better be recognized by
Emacs' recognition engine...

Hope this helps,
Ilya

P.S.  And BOTH of them are latin-1.  So if you CARE, please use proper names.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: word syntax/umlauts emacs 23 vs 22
  2010-10-25  9:33               ` Ralf Fassel
@ 2010-10-29 18:26                 ` Stefan Monnier
  2010-11-04  9:36                   ` Ralf Fassel
  0 siblings, 1 reply; 14+ messages in thread
From: Stefan Monnier @ 2010-10-29 18:26 UTC (permalink / raw)
  To: help-gnu-emacs

> | (prefer-coding-system 'windows-1252)
> No, those files should not be edited using a windows code page, they are
> supposed to be latin-1.

Are you saying that this char was an error?  If not, then it does seem
like you want to use windows-1252 (which is a perfectly normal
coding-system, tho it happens to be a lot more common for a valid utf-8
file to also be a valid windows-1252 file, so it's more difficult for
Emacs to automatically choose the right coding-system unless you tell it
that you prefer utf-8 more than 1252).

        Stefan

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: word syntax/umlauts emacs 23 vs 22
  2010-10-29 18:26                 ` Stefan Monnier
@ 2010-11-04  9:36                   ` Ralf Fassel
  2010-11-04 19:37                     ` Stefan Monnier
  0 siblings, 1 reply; 14+ messages in thread
From: Ralf Fassel @ 2010-11-04  9:36 UTC (permalink / raw)
  To: help-gnu-emacs

* Stefan Monnier <monnier@iro.umontreal.ca>
| > | (prefer-coding-system 'windows-1252)
| > No, those files should not be edited using a windows code page, they
| > are supposed to be latin-1.
>
| Are you saying that this char was an error?

Hard to say.  Windows users insert \200 when they press the Euro sign on
their keybord, Linux users enter \244.  Since we're mostly Linux, the
\244 should be the Euro.  

| If not, then it does seem like you want to use windows-1252 (which is
| a perfectly normal coding-system,

Since the Umlauts are at the same positions in latin-9 and windows-1252
we might as well use cp1252, but then the Linux-Euro will get displayed
as a crossed 'o'... well...

I'd say lets end this thread.  We can work with emacs and xemacs again
after re-enabling multibyte mode, and using the proper default
encoding.

Thanks again
R'

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: word syntax/umlauts emacs 23 vs 22
  2010-11-04  9:36                   ` Ralf Fassel
@ 2010-11-04 19:37                     ` Stefan Monnier
  0 siblings, 0 replies; 14+ messages in thread
From: Stefan Monnier @ 2010-11-04 19:37 UTC (permalink / raw)
  To: help-gnu-emacs

> | > | (prefer-coding-system 'windows-1252)
> | > No, those files should not be edited using a windows code page, they
> | > are supposed to be latin-1.
> | Are you saying that this char was an error?
> Hard to say.  Windows users insert \200 when they press the Euro sign on
> their keybord, Linux users enter \244.

Don't know about Windows, but at least for GNU/Linux what you say is
largely not true: all major distributions of GNU/Linux switched to utf-8
locales by default many years ago, so by now most GNU/Linux users will
insert #xE2 #x82 #xAC when they press the Euro sign.


        Stefan


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2010-11-04 19:37 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-10-12 17:00 word syntax/umlauts emacs 23 vs 22 Ralf Fassel
2010-10-12 22:15 ` Stefan Monnier
     [not found]   ` <yga7hhm1tma.fsf@gepard2.akutech-local.de>
2010-10-15 17:42     ` Stefan Monnier
2010-10-20 19:28       ` Ralf Fassel
2010-10-21  0:18         ` Jason Rumney
2010-10-21  1:27         ` Stefan Monnier
2010-10-21 13:25           ` Ralf Fassel
2010-10-21 15:24             ` Jason Rumney
2010-10-25  9:33               ` Ralf Fassel
2010-10-29 18:26                 ` Stefan Monnier
2010-11-04  9:36                   ` Ralf Fassel
2010-11-04 19:37                     ` Stefan Monnier
2010-10-26  2:53               ` Ilya Zakharevich
     [not found]             ` <jwv8w1p4pjo.fsf-monnier+gnu.emacs.help@gnu.org>
2010-10-25  9:31               ` Ralf Fassel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).