Ediff problem with accents

unofficial mirror of help-gnu-emacs@gnu.org
 help / color / mirror / Atom feed

* Ediff problem with accents
@ 2006-09-12 20:32 Sébastien Vauban
  2006-09-12 21:36 ` Peter Dyballa
       [not found] ` <mailman.6839.1158097004.9609.help-gnu-emacs@gnu.org>
  0 siblings, 2 replies; 4+ messages in thread
From: Sébastien Vauban @ 2006-09-12 20:32 UTC (permalink / raw)


Hi,

Since recently (new PC installation, in fact), I've got a weird
trouble whose I can't understand the roots of. It's over the
behavior of ediff.

I've always used ediff without any problem. That great tool
always helped me re-reading the changes I've made before
committing and logging a sensible comment.

Now, I can't really used it anymore, as it sees every accent as
being a difference between the source and the modified file...

Here's an example:

    ,----[ Source buffer ]
    | Ce document présente le détail des modifications apportées...
    | Avant.
    |
    |----[ Modified buffer ]
    | Ce document prÃ©sente le dÃ©tail des modifications apportÃ©es...
    | Maintenant.
    `----

Needless to say that it becomes heavily difficult to distinguish
the real modifications (`Avant' -> `Maintenant') from the "false
positives" (`présente' -> `prÃ©sente').

Note that this problem only occurs with ediff, not at all with
diff (that only spots the real modifications).

Any help?

Thank you very much,
  Seb

-- 
Sébastien Vauban

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Ediff problem with accents
  2006-09-12 20:32 Ediff problem with accents Sébastien Vauban
@ 2006-09-12 21:36 ` Peter Dyballa
       [not found]   ` <873bak2r61.fsf_-_@mundaneum.mygooglest.com>
       [not found] ` <mailman.6839.1158097004.9609.help-gnu-emacs@gnu.org>
  1 sibling, 1 reply; 4+ messages in thread
From: Peter Dyballa @ 2006-09-12 21:36 UTC (permalink / raw)
  Cc: help-gnu-emacs

Am 12.09.2006 um 20:32 schrieb Sébastien Vauban:

>     ,----[ Source buffer ]
>     | Ce document présente le détail des modifications apportées...
>     | Avant.
>     |
>     |----[ Modified buffer ]
>     | Ce document prÃ©sente le dÃ©tail des modifications apportÃ©es...
>     | Maintenant.
>     `----

The upper buffer's contents is obviously presented in an ISO Latin  
encoding, presumingly ISO 8859-1 or ISO 8859-15, the lower one shows  
an UTF-8 contents in obviously the same ISO Latin encoding (é is in  
UTF-8 C3 A9, or: Ã ©). But it seems more likely that (almost) the  
same UTF-8 contents is displayed once in UTF-8 (correct) and once in  
ISO Latin (incorrect).

To make both buffers appear (in) the same (encoding) you should put  
into your .emacs file:

	(prefer-coding-system    'utf-8-unix)

or set environment variables like LC_CTYPE or LANG with UTF-8 in it,  
like mine: de_DE.UTF-8. This setting will make GNU Emacs to  
automatically use UTF-8.

At least if works for my buffers with UTF-8 contents in ediff ...

--
Greetings

   Pete

"Let's face it; we don't want a free market economy either."
         James Farley, president, Coca-Cola Export Corp., 1959

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Ediff problem with accents
       [not found] ` <mailman.6839.1158097004.9609.help-gnu-emacs@gnu.org>
@ 2006-09-22 10:14   ` Sébastien Vauban
  0 siblings, 0 replies; 4+ messages in thread
From: Sébastien Vauban @ 2006-09-22 10:14 UTC (permalink / raw)


Hello Peter,

Sorry for the long delay... but it was impossible for me to make
the wished tests until now.

FYI, I've sanitized my .emacs section about the coding systems
(you'll see an extract beneath), and I've made a lot of
comparisons.

I still have the problem, but here follows a deeper insight on
what I'm experiencing:

    o   if (prefer-coding-system 'iso-latin-9),
        then I see the following when ediff'ing:

        ----------------------------------------
        | ^M               |                   |
        | pr\351sente ^M   | présente          |
        | ^M               |                   |
        |                  |                   |
        |------------------|-------------------|
        |-0:%%             |-0\--              |  (modeline)
        ----------------------------------------
          iso-latin-9?       iso-latin-9-dos


    o   if (prefer-coding-system 'utf-8),
        then I see the following when ediff'ing:

        ----------------------------------------
        | ^M               |                   |
        | pr\351sente ^M   | présente          |
        | ^M               |                   |
        |                  |                   |
        |------------------|-------------------|
        |-u:%%             |-1\--              |
        ----------------------------------------
          utf-8?             iso-latin-1-dos


    o   if I don't set any preferred coding system (commented line),
        then I see the following when ediff'ing:

        ----------------------------------------
        | ^M               |                   |
        | pr\351sente ^M   | présente          |
        | ^M               |                   |
        |                  |                   |
        |------------------|-------------------|
        |-1\%%             |-1\--              |
        ----------------------------------------
          iso-latin-1-dos?   iso-latin-1-dos

To indicate the coding system under the window, I used

    M-x describe-coding-system RET RET

but, for the base version, it states "not set locally, use the
default"; that's why I wrote the default coding system for new
files and put a interrogation mark after (because I'm not sure
this is the correct way to do).

So, you can see that, whatever I do, I can't compare my buffers
in a normal way... I'm completely lost...

THANK YOU very much for any help you could bring me,
  Seb

PS- As promised, an extract of my .emacs config file:

,----[ my Emacs Init File ]
| 
| (message "26 International Character Set Support...")
| 
| ;; default input method for multilingual text
| (setq default-input-method "latin-9-prefix")
| 
| ;; if you want to use UTF-8 on Emacs 21.3, install Mule-UCS
| (GNUEmacs
|     (try-require 'un-define))
| 
| (add-to-list 'file-coding-system-alist
|              '("\\.owl\\'" utf-8 . utf-8))
| ;; In GNU Emacs, when you specify the coding explicitly in the file, that
| ;; overrides `file-coding-system-alist'. Not in XEmacs?
| 
| ;; ;; default coding system (for new files)
| (GNUEmacs
|     (prefer-coding-system 'utf-8))
| 
| (GNUEmacs
|     ;; to copy and paste outside Emacs
|     (set-clipboard-coding-system 'iso-latin-9))  ;; aka iso-8859-15
| 
| ;; unify the Latin-N charsets, so that Emacs knows that the é in Latin-9
| ;; (with the euro) is the same as the é in Latin-1 (without the euro)
| ;; [avoid the small accentuated characters]
| (when (try-require 'ucs-tables)
|     (unify-8859-on-encoding-mode 1)  ;; harmless
|     (unify-8859-on-decoding-mode 1)) ;; may unexpectedly change files if they
|                                      ;; contain different Latin-N charsets
|                                      ;; which should not be unified
| 
| (when window-system
|   ;; functions for dealing with char tables
|   (require 'disp-table))
| 
| (XEmacs
|     (require 'iso-syntax))
| 
| (message "26 International Character Set Support... Done")
`----

-- 
Sébastien Vauban

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Ediff problem with accents
       [not found]   ` <873bak2r61.fsf_-_@mundaneum.mygooglest.com>
@ 2006-09-22 10:42     ` Peter Dyballa
  0 siblings, 0 replies; 4+ messages in thread
From: Peter Dyballa @ 2006-09-22 10:42 UTC (permalink / raw)
  Cc: GNU Emacs List

Am 22.09.2006 um 11:20 schrieb Sébastien Vauban:

> Hello Peter,
>
> Sorry for the long delay... but it was impossible for me to make
> the wished tests until now.
>
>     Note -- You can copy this mail to gnus.emacs.help. I don't
>             have news access from where I am now...
>
> FYI, I've sanitized my .emacs section about the coding systems
> (you'll see an extract beneath), and I've made a lot of
> comparisons.
>
> I still have the problem, but here follows a deeper insight on
> what I'm experiencing:
>
>     o   if (prefer-coding-system 'iso-latin-9),
>         then I see the following when ediff'ing:
>
>         ----------------------------------------
>         | ^M               |                   |
>         | pr\351sente ^M   | présente          |
>         | ^M               |                   |
>         |                  |                   |
>         |------------------|-------------------|
>         |-0:%%             |-0\--              |  (modeline)
>         ----------------------------------------
>           iso-latin-9?       iso-latin-9-dos

That's correct, for the modeline: ISO 8859-15 or ISO Latin-9 encoding  
is used. The left buffer is read-only, the right one is not changed?  
The ``\´´ puzzles me, but it's such a long time that I have used GNU  
Emacs on some MS Losedows, that I cannot remember. The ^M in the left  
buffer should not appear, probably the right value for the encoding  
is the right one, which also presents présente the right way.

>
>
>     o   if (prefer-coding-system 'utf-8),
>         then I see the following when ediff'ing:
>
>         ----------------------------------------
>         | ^M               |                   |
>         | pr\351sente ^M   | présente          |
>         | ^M               |                   |
>         |                  |                   |
>         |------------------|-------------------|
>         |-u:%%             |-1\--              |
>         ----------------------------------------
>           utf-8?             iso-latin-1-dos

Again, the  mode-lines are right and the left buffer need to be  
specified as utf-8-dos to make the ^M disappear and make pr\351sente  
appear correctly. The prefer-coding-system function allows the use of  
"extensions" like -dos, -mac, -unix to specify exactly the preferred  
encoding.

>
>
>     o   if I don't set any preferred coding system (commented line),
>         then I see the following when ediff'ing:
>
>         ----------------------------------------
>         | ^M               |                   |
>         | pr\351sente ^M   | présente          |
>         | ^M               |                   |
>         |                  |                   |
>         |------------------|-------------------|
>         |-1\%%             |-1\--              |
>         ----------------------------------------
>           iso-latin-1-dos?   iso-latin-1-dos

Here certainly the left buffer is not -dos – otherwise the ^M would  
not appear there.

>
> To indicate the coding system under the window, I used
>
>     M-x describe-coding-system RET RET
>
> but, for the base version, it states "not set locally, use the
> default"; that's why I wrote the default coding system for new
> files and put a interrogation mark after (because I'm not sure
> this is the correct way to do).

There are no local settings in the file (see below), so some default  
is assumed that the *Help* buffer should describe:

	Coding system for saving this buffer:
	  0 -- iso-latin-9-unix

	Default coding system (for new files):
	  u -- mule-utf-8-unix

	Coding system for keyboard input:
	  nil
	Coding system for terminal output:
	  u -- mule-utf-8 (alias: utf-8)

	Defaults for subprocess I/O:
	  decoding: u -- mule-utf-8 (alias: utf-8)

	  encoding: u -- mule-utf-8 (alias: utf-8)

	Priority order for recognizing coding systems when reading files:
	  .
	  .
	  .

The prefer-coding-system setting also effects your old files: they  
can now be interpreted differently then when they were created and  
saved. You could continue to stick at iso-latin-9-dos to have the €  
and keep your old files unchanged. Every new (and old) file will have  
some extra ^M bytes, but at least new and old ones will be treated  
equally. (Conversion could be done, on the command line (recode,  
iconv) or more time consuming with GNU Emacs: Options menu -> Mule ->  
Set Coding Systems.)

>
> So, you can see that, whatever I do, I can't compare my buffers
> in a normal way... I'm completely lost...

Try: (prefer-coding-system 'iso-latin-9-dos).

You also can use some of these calls each with a different encoding.  
These will make GNU Emacs first to choose from this list and then try  
to find another encoding.

>
> PS- As promised, an extract of my .emacs config file:
>
> ,----[ my Emacs Init File ]
> |
> | (message "26 International Character Set Support...")
> |
> | ;; default input method for multilingual text
> | (setq default-input-method "latin-9-prefix")

I do not use any input method: my keyboard creates/composes é by  
pressing the dead key ´ first and then the e. Works also for some  
other accented characters. Actually I think I never used any Emacs  
input method. 20 years ago I had DEC or Sun keyboards with a Compose  
key, now the X server allows to have other characters with alt or  
shift-alt pressed ...

> |
> | ;; if you want to use UTF-8 on Emacs 21.3, install Mule-UCS
> | (GNUEmacs
> |     (try-require 'un-define))

This was necessary with GNU Emacs 20. The recent versions 21.x have  
MULE somehow built-in. Could be that this line causes a lot of your  
trouble. (A good way to test the built-in capabilities is to launch  
GNU Emacs with -Q: no site or user specific initialisation files are  
used. And it might perform better ...)

> |
> | (add-to-list 'file-coding-system-alist
> |              '("\\.owl\\'" utf-8 . utf-8))

This obviously only effects .owl files.

> | ;; In GNU Emacs, when you specify the coding explicitly in the  
> file, that
> | ;; overrides `file-coding-system-alist'. Not in XEmacs?
> |
> | ;; ;; default coding system (for new files)
> | (GNUEmacs
> |     (prefer-coding-system 'utf-8))

You might consider to add -dos, but it's more important that you  
understand that this change will make a lot of your old files  
unusable. In UTF-8 only the 7 bit ASCII range is encoded by one  
octet. All 8 bit characters from the ISO Latin encodings are encoding  
by two octets (or even three, for example the €). Your é is encoded  
as C3 A9 (€ as the well known E2 82 AC). If GNU Emacs only sees E9  
(or A4 for €), it will make mistakes! If you switch to UTF-8 you  
would need to convert all text files first, or save their old  
encodings by adding a header line like this as the first line:

	 -*- mode: Text; coding: iso-8859-9; -*-

The mode part is not necessary (could also be tex or latex), but  
coding *is*. The other option is 'local variables' in the file's footer:

	%%% Local Variables:
	%%% coding: iso-8859-9
	%%% mode: tex
	%%% End:

and might need to teach GNU Emacs that these local variables are 'safe'.

> |
> | (GNUEmacs
> |     ;; to copy and paste outside Emacs
> |     (set-clipboard-coding-system 'iso-latin-9))  ;; aka iso-8859-15

This depends on the windowing system you use. Now, I think, most will  
use UTF-8 ...

> |
> | ;; unify the Latin-N charsets, so that Emacs knows that the é in  
> Latin-9
> | ;; (with the euro) is the same as the é in Latin-1 (without the  
> euro)
> | ;; [avoid the small accentuated characters]
> | (when (try-require 'ucs-tables)
> |     (unify-8859-on-encoding-mode 1)  ;; harmless
> |     (unify-8859-on-decoding-mode 1)) ;; may unexpectedly change  
> files if they
> |                                      ;; contain different Latin-N  
> charsets
> |                                      ;; which should not be unified

I use these two in GNU Emacs 21.3.50 without the MULE/ucs clause ...

> |
> | (when window-system
> |   ;; functions for dealing with char tables
> |   (require 'disp-table))

This might have been useful in GNU Emacs 20 and before. I never used  
it, except for european-display or such, maybe. And I also avoid set- 
language-environment: this is close to obsolete, politely writing.

> |
> | (XEmacs
> |     (require 'iso-syntax))

I'm not really an XEmacs user, but I think this is also something  
from the past, 20th century or before.

Could be GNU Emacs 22.0.50 serves you better. Both GNU Emacsen,  
22.0.50 and 21.3, work better and set up internally better when they  
read environment variables like LC_CTYPE, LANG, or LC_ALL that  
explain in which environment they are running. Then you only need to  
specify exceptions from this general rule. I have in my environment  
LC_CTYPE=de_DE.UTF-8 ...

--
Greetings

   Pete

A morning without coffee is like something without something else.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2006-09-22 10:42 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-09-12 20:32 Ediff problem with accents Sébastien Vauban
2006-09-12 21:36 ` Peter Dyballa
     [not found]   ` <873bak2r61.fsf_-_@mundaneum.mygooglest.com>
2006-09-22 10:42     ` Peter Dyballa
     [not found] ` <mailman.6839.1158097004.9609.help-gnu-emacs@gnu.org>
2006-09-22 10:14   ` Sébastien Vauban

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).