unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed
* Russian letters
@ 2006-07-05 18:10 Paul Pogonyshev
  2006-07-05 18:19 ` Andreas Schwab
  0 siblings, 1 reply; 23+ messages in thread
From: Paul Pogonyshev @ 2006-07-05 18:10 UTC (permalink / raw)


Russian letters loaded from file and newly typed are different
character no matter if `unify-8859-on-...-mode's are active or
not.

Characters loaded from file:

  character: а (3664, #o7120, #xe50, U+0430)
    charset: cyrillic-iso8859-5 (Right-Hand Part of Latin/Cyrillic Alphabet (ISO/IEC 8859-5): ISO-IR-144.)
 code point: #x50
     syntax: w 	which means: word
   category: y:Cyrillic
buffer code: #x8C #xD0
  file code: #xD0 #xB0 (encoded by coding system mule-utf-8-unix)
    display: by this font (glyph code)
     -cronyx-Fixed-Medium-R-Normal--18-120-100-100-C-90-ISO8859-5 (#xD0)

Newly typed characters:

  character: а (332880, #o1212120, #x51450, U+0430)
    charset: mule-unicode-0100-24ff (Unicode characters of the range U+0100..U+24FF.)
 code point: #x28 #x50
     syntax: w 	which means: word
   category: y:Cyrillic
buffer code: #x9C #xF4 #xA8 #xD0
  file code: #xD0 #xB0 (encoded by coding system mule-utf-8-unix)
    display: by this font (glyph code)
     -Adobe-Courier-Medium-R-Normal--17-120-100-100-M-100-ISO10646-1 (#x430)

The latter are displayed as boxes on my machine, which makes editing
of Russian text impossible.  Reproducible with `emacs -Q'.  I consider
it a bug.

Paul

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Russian letters
  2006-07-05 18:10 Russian letters Paul Pogonyshev
@ 2006-07-05 18:19 ` Andreas Schwab
  2006-07-05 21:43   ` Paul Pogonyshev
  0 siblings, 1 reply; 23+ messages in thread
From: Andreas Schwab @ 2006-07-05 18:19 UTC (permalink / raw)
  Cc: emacs-devel

Paul Pogonyshev <pogonyshev@gmx.net> writes:

> Russian letters loaded from file and newly typed are different
> character no matter if `unify-8859-on-...-mode's are active or
> not.

What's your language environment?

Andreas.

-- 
Andreas Schwab, SuSE Labs, schwab@suse.de
SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany
PGP key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Russian letters
  2006-07-05 18:19 ` Andreas Schwab
@ 2006-07-05 21:43   ` Paul Pogonyshev
  2006-07-05 22:08     ` Andreas Schwab
  0 siblings, 1 reply; 23+ messages in thread
From: Paul Pogonyshev @ 2006-07-05 21:43 UTC (permalink / raw)
  Cc: Andreas Schwab

Andreas Schwab wrote:
> Paul Pogonyshev <pogonyshev@gmx.net> writes:
> 
> > Russian letters loaded from file and newly typed are different
> > character no matter if `unify-8859-on-...-mode's are active or
> > not.
> 
> What's your language environment?

I'm not sure I understand your question.  The buffer is in UTF-8 and
Emacs knows that.  `locale' reports

LANG=en_US.utf8
LC_CTYPE="en_US.utf8"
LC_NUMERIC="en_US.utf8"
LC_TIME="en_US.utf8"
LC_COLLATE="en_US.utf8"
LC_MONETARY="en_US.utf8"
LC_MESSAGES="en_US.utf8"
LC_PAPER="en_US.utf8"
LC_NAME="en_US.utf8"
LC_ADDRESS="en_US.utf8"
LC_TELEPHONE="en_US.utf8"
LC_MEASUREMENT="en_US.utf8"
LC_IDENTIFICATION="en_US.utf8"
LC_ALL=en_US.utf8

Paul

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Russian letters
  2006-07-05 21:43   ` Paul Pogonyshev
@ 2006-07-05 22:08     ` Andreas Schwab
  2006-07-05 22:21       ` Paul Pogonyshev
  0 siblings, 1 reply; 23+ messages in thread
From: Andreas Schwab @ 2006-07-05 22:08 UTC (permalink / raw)
  Cc: emacs-devel

Paul Pogonyshev <pogonyshev@gmx.net> writes:

> Andreas Schwab wrote:
>> Paul Pogonyshev <pogonyshev@gmx.net> writes:
>> 
>> > Russian letters loaded from file and newly typed are different
>> > character no matter if `unify-8859-on-...-mode's are active or
>> > not.
>> 
>> What's your language environment?
>
> I'm not sure I understand your question.

C-h L (describe-language-environment)

Anyway, as documented, unify-8859-on-decoding-mode can only map to
`iso-latin-1' and `mule-unicode-0100-24ff'.

Andreas.

-- 
Andreas Schwab, SuSE Labs, schwab@suse.de
SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany
PGP key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Russian letters
  2006-07-05 22:08     ` Andreas Schwab
@ 2006-07-05 22:21       ` Paul Pogonyshev
  2006-07-05 22:55         ` Andreas Schwab
  2006-07-06  3:41         ` Eli Zaretskii
  0 siblings, 2 replies; 23+ messages in thread
From: Paul Pogonyshev @ 2006-07-05 22:21 UTC (permalink / raw)
  Cc: Andreas Schwab

Andreas Schwab wrote:
> Paul Pogonyshev <pogonyshev@gmx.net> writes:
> 
> > Andreas Schwab wrote:
> >> Paul Pogonyshev <pogonyshev@gmx.net> writes:
> >> 
> >> > Russian letters loaded from file and newly typed are different
> >> > character no matter if `unify-8859-on-...-mode's are active or
> >> > not.
> >> 
> >> What's your language environment?
> >
> > I'm not sure I understand your question.
> 
> C-h L (describe-language-environment)

UTF-8 language environment

Input methods (default rfc1345):
  rfc1345 ("m" in mode line)
  TeX ("\" in mode line)
  sgml ("&" in mode line)
  ucs ("U+" in mode line)

Character sets:
  nothing specific to UTF-8

Coding systems:
  mule-utf-8 (`u' in mode line):
	UTF-8 encoding for Emacs-supported Unicode characters.
It supports Unicode characters of these ranges:
    U+0000..U+33FF, U+E000..U+FFFF.
They correspond to these Emacs character sets:
    ascii, latin-iso8859-1, mule-unicode-0100-24ff,
    mule-unicode-2500-33ff, mule-unicode-e000-ffff

On decoding (e.g. reading a file), Unicode characters not in the above
ranges are decoded into sequences of eight-bit-control and
eight-bit-graphic characters to preserve their byte sequences.  The
byte sequence is preserved on i/o for valid utf-8, but not necessarily
for invalid utf-8.

On encoding (e.g. writing a file), Emacs characters not belonging to
any of the character sets listed above are encoded into the UTF-8 byte
sequence representing U+FFFD (REPLACEMENT CHARACTER).
	(alias: mule-utf-8 utf-8)

> Anyway, as documented, unify-8859-on-decoding-mode can only map to
> `iso-latin-1' and `mule-unicode-0100-24ff'.

That's fine, but if the same characters read from file and typed from
keyboard are different in a buffer, that's nothing else than a bug.
Tell the average user about language environments.  Ideally, Emacs
should work in this case as installed, without any configuration or
lines in `.emacs'.

Paul

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Russian letters
  2006-07-05 22:21       ` Paul Pogonyshev
@ 2006-07-05 22:55         ` Andreas Schwab
  2006-07-06 15:59           ` Paul Pogonyshev
  2006-07-06  3:41         ` Eli Zaretskii
  1 sibling, 1 reply; 23+ messages in thread
From: Andreas Schwab @ 2006-07-05 22:55 UTC (permalink / raw)
  Cc: emacs-devel

Paul Pogonyshev <pogonyshev@gmx.net> writes:

>> Anyway, as documented, unify-8859-on-decoding-mode can only map to
>> `iso-latin-1' and `mule-unicode-0100-24ff'.
>
> That's fine, but if the same characters read from file and typed from
> keyboard are different in a buffer, that's nothing else than a bug.

You can get that only if you explicitly specify a coding system during
reading, otherwise your file would be decoded as latin-1 even if it is
encoded as cyrillic-iso-8bit.

Andreas.

-- 
Andreas Schwab, SuSE Labs, schwab@suse.de
SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany
PGP key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Russian letters
  2006-07-05 22:21       ` Paul Pogonyshev
  2006-07-05 22:55         ` Andreas Schwab
@ 2006-07-06  3:41         ` Eli Zaretskii
  2006-07-06 15:56           ` Paul Pogonyshev
  1 sibling, 1 reply; 23+ messages in thread
From: Eli Zaretskii @ 2006-07-06  3:41 UTC (permalink / raw)
  Cc: emacs-devel

> From: Paul Pogonyshev <pogonyshev@gmx.net>
> Date: Thu, 6 Jul 2006 01:21:26 +0300
> Cc: Andreas Schwab <schwab@suse.de>
> 
> > Anyway, as documented, unify-8859-on-decoding-mode can only map to
> > `iso-latin-1' and `mule-unicode-0100-24ff'.
> 
> That's fine, but if the same characters read from file and typed from
> keyboard are different in a buffer, that's nothing else than a bug.

What was the file's encoding?

Does it help to play with the value of utf-fragment-on-decoding?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Russian letters
  2006-07-06  3:41         ` Eli Zaretskii
@ 2006-07-06 15:56           ` Paul Pogonyshev
  2006-07-06 20:12             ` Eli Zaretskii
  0 siblings, 1 reply; 23+ messages in thread
From: Paul Pogonyshev @ 2006-07-06 15:56 UTC (permalink / raw)


Eli Zaretskii wrote:
> > From: Paul Pogonyshev <pogonyshev@gmx.net>
> > Date: Thu, 6 Jul 2006 01:21:26 +0300
> > Cc: Andreas Schwab <schwab@suse.de>
> > 
> > > Anyway, as documented, unify-8859-on-decoding-mode can only map to
> > > `iso-latin-1' and `mule-unicode-0100-24ff'.
> > 
> > That's fine, but if the same characters read from file and typed from
> > keyboard are different in a buffer, that's nothing else than a bug.
> 
> What was the file's encoding?

UTF-8.

> Does it help to play with the value of utf-fragment-on-decoding?

Not really.  Nothing seems to change and the result of `describe-char'
are identical.

Paul

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Russian letters
  2006-07-05 22:55         ` Andreas Schwab
@ 2006-07-06 15:59           ` Paul Pogonyshev
  2006-07-06 16:39             ` Andreas Schwab
  0 siblings, 1 reply; 23+ messages in thread
From: Paul Pogonyshev @ 2006-07-06 15:59 UTC (permalink / raw)
  Cc: Andreas Schwab

Andreas Schwab wrote:
> Paul Pogonyshev <pogonyshev@gmx.net> writes:
> 
> >> Anyway, as documented, unify-8859-on-decoding-mode can only map to
> >> `iso-latin-1' and `mule-unicode-0100-24ff'.
> >
> > That's fine, but if the same characters read from file and typed from
> > keyboard are different in a buffer, that's nothing else than a bug.
> 
> You can get that only if you explicitly specify a coding system during
> reading, otherwise your file would be decoded as latin-1 even if it is
> encoded as cyrillic-iso-8bit.

The file is UTF-8 and mentions its coding in `Local variables'.  Again,
the file is read just fine.  The problems begin when I type new
characters into the buffer: they are treated differently than the same
characters read from the file.

Paul

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Russian letters
  2006-07-06 15:59           ` Paul Pogonyshev
@ 2006-07-06 16:39             ` Andreas Schwab
  2006-07-06 18:17               ` Paul Pogonyshev
  0 siblings, 1 reply; 23+ messages in thread
From: Andreas Schwab @ 2006-07-06 16:39 UTC (permalink / raw)
  Cc: emacs-devel

Paul Pogonyshev <pogonyshev@gmx.net> writes:

> The file is UTF-8 and mentions its coding in `Local variables'.  Again,
> the file is read just fine.  The problems begin when I type new
> characters into the buffer: they are treated differently than the same
> characters read from the file.

I can't reproduce that here.  Whenever I read a file with russian letters
that is encoded in utf-8 the letters are decoded into
mule-unicode-0100-24ff.

Andreas.

-- 
Andreas Schwab, SuSE Labs, schwab@suse.de
SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany
PGP key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Russian letters
  2006-07-06 16:39             ` Andreas Schwab
@ 2006-07-06 18:17               ` Paul Pogonyshev
  2006-07-06 20:11                 ` Eli Zaretskii
  0 siblings, 1 reply; 23+ messages in thread
From: Paul Pogonyshev @ 2006-07-06 18:17 UTC (permalink / raw)
  Cc: Andreas Schwab

Andreas Schwab wrote:
> Paul Pogonyshev <pogonyshev@gmx.net> writes:
> 
> > The file is UTF-8 and mentions its coding in `Local variables'.  Again,
> > the file is read just fine.  The problems begin when I type new
> > characters into the buffer: they are treated differently than the same
> > characters read from the file.
> 
> I can't reproduce that here.  Whenever I read a file with russian letters
> that is encoded in utf-8 the letters are decoded into
> mule-unicode-0100-24ff.

I said many times that the problems begin when I _type_ characters, not
when they are read from file.  They end up being different characters, at
least in the sence that `describe-char' shows different things.  I presume
that new characters are still valid, but they are shown as boxes (no font
support, aparently), while read characters are shown just fine.

Paul

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Russian letters
  2006-07-06 18:17               ` Paul Pogonyshev
@ 2006-07-06 20:11                 ` Eli Zaretskii
  0 siblings, 0 replies; 23+ messages in thread
From: Eli Zaretskii @ 2006-07-06 20:11 UTC (permalink / raw)
  Cc: schwab, emacs-devel

> From: Paul Pogonyshev <pogonyshev@gmx.net>
> Date: Thu, 6 Jul 2006 21:17:24 +0300
> Cc: Andreas Schwab <schwab@suse.de>
> 
> Andreas Schwab wrote:
> > Paul Pogonyshev <pogonyshev@gmx.net> writes:
> > 
> > > The file is UTF-8 and mentions its coding in `Local variables'.  Again,
> > > the file is read just fine.  The problems begin when I type new
> > > characters into the buffer: they are treated differently than the same
> > > characters read from the file.
> > 
> > I can't reproduce that here.  Whenever I read a file with russian letters
> > that is encoded in utf-8 the letters are decoded into
> > mule-unicode-0100-24ff.
> 
> I said many times that the problems begin when I _type_ characters, not
> when they are read from file.

No, you said, and I quote:

    Russian letters loaded from file and newly typed are different
    character no matter if `unify-8859-on-...-mode's are active or
    not.

    Characters loaded from file:

      character: a (3664, #o7120, #xe50, U+0430)
	charset: cyrillic-iso8859-5 (Right-Hand Part of Latin/Cyrillic Alphabet (ISO/IEC 8859-5): ISO-IR-144.)
     code point: #x50
	 syntax: w 	which means: word
       category: y:Cyrillic
    buffer code: #x8C #xD0
      file code: #xD0 #xB0 (encoded by coding system mule-utf-8-unix)
	display: by this font (glyph code)
	 -cronyx-Fixed-Medium-R-Normal--18-120-100-100-C-90-ISO8859-5 (#xD0)

    Newly typed characters:

      character: a (332880, #o1212120, #x51450, U+0430)
	charset: mule-unicode-0100-24ff (Unicode characters of the range U+0100..U+24FF.)
     code point: #x28 #x50
	 syntax: w 	which means: word
       category: y:Cyrillic
    buffer code: #x9C #xF4 #xA8 #xD0
      file code: #xD0 #xB0 (encoded by coding system mule-utf-8-unix)
	display: by this font (glyph code)
	 -Adobe-Courier-Medium-R-Normal--17-120-100-100-M-100-ISO10646-1 (#x430)

That is, you said that characters read from a file are decoded into
cyrillic-iso8859-5, while characters you type are decoded into
mule-unicode-0100-24ff.  Now it sounds like it's the other way around,
especially since you say that the file is encoded in UTF-8 (which is
_always_ decoded into mule-unicode-0100-24ff, AFAIR).

Please clarify which one is it.

Also, please try the same file and keyboard keys in "emacs -Q",
perhaps something in your .emacs has unpleasant side effects.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Russian letters
  2006-07-06 15:56           ` Paul Pogonyshev
@ 2006-07-06 20:12             ` Eli Zaretskii
  2006-07-06 20:27               ` Paul Pogonyshev
  0 siblings, 1 reply; 23+ messages in thread
From: Eli Zaretskii @ 2006-07-06 20:12 UTC (permalink / raw)
  Cc: emacs-devel

> From: Paul Pogonyshev <pogonyshev@gmx.net>
> Date: Thu, 6 Jul 2006 18:56:54 +0300
> 
> Eli Zaretskii wrote:
> > > From: Paul Pogonyshev <pogonyshev@gmx.net>
> > > Date: Thu, 6 Jul 2006 01:21:26 +0300
> > > Cc: Andreas Schwab <schwab@suse.de>
> > > 
> > > > Anyway, as documented, unify-8859-on-decoding-mode can only map to
> > > > `iso-latin-1' and `mule-unicode-0100-24ff'.
> > > 
> > > That's fine, but if the same characters read from file and typed from
> > > keyboard are different in a buffer, that's nothing else than a bug.
> > 
> > What was the file's encoding?
> 
> UTF-8.

UTF-8 is always decoded into Unicode character set, while you
originally said that characters read from a file were decoded into
Cyrillic ISO character set.  Which one is true?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Russian letters
  2006-07-06 20:12             ` Eli Zaretskii
@ 2006-07-06 20:27               ` Paul Pogonyshev
  2006-07-06 20:38                 ` Paul Pogonyshev
  2006-07-06 21:14                 ` Eli Zaretskii
  0 siblings, 2 replies; 23+ messages in thread
From: Paul Pogonyshev @ 2006-07-06 20:27 UTC (permalink / raw)


Eli Zaretskii wrote:
> > From: Paul Pogonyshev <pogonyshev@gmx.net>
> > Date: Thu, 6 Jul 2006 18:56:54 +0300
> > 
> > Eli Zaretskii wrote:
> > > > From: Paul Pogonyshev <pogonyshev@gmx.net>
> > > > Date: Thu, 6 Jul 2006 01:21:26 +0300
> > > > Cc: Andreas Schwab <schwab@suse.de>
> > > > 
> > > > > Anyway, as documented, unify-8859-on-decoding-mode can only map to
> > > > > `iso-latin-1' and `mule-unicode-0100-24ff'.
> > > > 
> > > > That's fine, but if the same characters read from file and typed from
> > > > keyboard are different in a buffer, that's nothing else than a bug.
> > > 
> > > What was the file's encoding?
> > 
> > UTF-8.
> 
> UTF-8 is always decoded into Unicode character set, while you
> originally said that characters read from a file were decoded into
> Cyrillic ISO character set.  Which one is true?

I explicitly reverted the buffer in UTF-8 (though I know it is):

	C-x RET r utf-8 RET yes RET

`describe-char' on the Cyrillic characters from the file shows this:

  character: а (3664, #o7120, #xe50, U+0430)
    charset: cyrillic-iso8859-5 (Right-Hand Part of Latin/Cyrillic Alphabet (ISO/IEC 8859-5): ISO-IR-144.)
 code point: #x50
     syntax: w 	which means: word
   category: y:Cyrillic
buffer code: #x8C #xD0
  file code: #xD0 #xB0 (encoded by coding system mule-utf-8-unix)
    display: by this font (glyph code)
     -cronyx-Fixed-Medium-R-Normal--18-120-100-100-C-90-ISO8859-5 (#xD0)

Note the file code, it is UTF-8!

Paul

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Russian letters
  2006-07-06 20:27               ` Paul Pogonyshev
@ 2006-07-06 20:38                 ` Paul Pogonyshev
  2006-07-07  8:41                   ` Eli Zaretskii
  2006-07-06 21:14                 ` Eli Zaretskii
  1 sibling, 1 reply; 23+ messages in thread
From: Paul Pogonyshev @ 2006-07-06 20:38 UTC (permalink / raw)
  Cc: Eli Zaretskii

Paul Pogonyshev wrote:
> I explicitly reverted the buffer in UTF-8 (though I know it is):
> 
> 	C-x RET r utf-8 RET yes RET
> 
> `describe-char' on the Cyrillic characters from the file shows this:
> 
>   character: а (3664, #o7120, #xe50, U+0430)
>     charset: cyrillic-iso8859-5 (Right-Hand Part of Latin/Cyrillic Alphabet (ISO/IEC 8859-5): ISO-IR-144.)
>  code point: #x50
>      syntax: w 	which means: word
>    category: y:Cyrillic
> buffer code: #x8C #xD0
>   file code: #xD0 #xB0 (encoded by coding system mule-utf-8-unix)
>     display: by this font (glyph code)
>      -cronyx-Fixed-Medium-R-Normal--18-120-100-100-C-90-ISO8859-5 (#xD0)
> 
> Note the file code, it is UTF-8!

Actually, this doesn't happen in `emacs -Q', not sure why...  There
characters are decoded to `mule-unicode-0100-24ff' (and displayed as
boxes, gah.)

Paul

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Russian letters
  2006-07-06 20:27               ` Paul Pogonyshev
  2006-07-06 20:38                 ` Paul Pogonyshev
@ 2006-07-06 21:14                 ` Eli Zaretskii
  2006-07-06 21:48                   ` Paul Pogonyshev
  1 sibling, 1 reply; 23+ messages in thread
From: Eli Zaretskii @ 2006-07-06 21:14 UTC (permalink / raw)
  Cc: emacs-devel

> From: Paul Pogonyshev <pogonyshev@gmx.net>
> Date: Thu, 6 Jul 2006 23:27:27 +0300
> 
> > > > What was the file's encoding?
> > > 
> > > UTF-8.
> > 
> > UTF-8 is always decoded into Unicode character set, while you
> > originally said that characters read from a file were decoded into
> > Cyrillic ISO character set.  Which one is true?
> 
> I explicitly reverted the buffer in UTF-8 (though I know it is):
> 
> 	C-x RET r utf-8 RET yes RET

This gets more and more complicated with each message.

Could you please post a short file (as a binary attachment) and a
clear recipe how you visit it and how you type Cyrillic characters in
order to reproduce the problem?

Did I understand correctly that, unlike you first said, the Cyrillic
ISO characters come from keyboard input, while the mule-unicode
characters come from a file?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Russian letters
  2006-07-06 21:14                 ` Eli Zaretskii
@ 2006-07-06 21:48                   ` Paul Pogonyshev
  2006-07-07  8:46                     ` Eli Zaretskii
  0 siblings, 1 reply; 23+ messages in thread
From: Paul Pogonyshev @ 2006-07-06 21:48 UTC (permalink / raw)


[-- Attachment #1: Type: text/plain, Size: 1384 bytes --]

Eli Zaretskii wrote:
> > From: Paul Pogonyshev <pogonyshev@gmx.net>
> > Date: Thu, 6 Jul 2006 23:27:27 +0300
> > 
> > > > > What was the file's encoding?
> > > > 
> > > > UTF-8.
> > > 
> > > UTF-8 is always decoded into Unicode character set, while you
> > > originally said that characters read from a file were decoded into
> > > Cyrillic ISO character set.  Which one is true?
> > 
> > I explicitly reverted the buffer in UTF-8 (though I know it is):
> > 
> > 	C-x RET r utf-8 RET yes RET
> 
> This gets more and more complicated with each message.
> 
> Could you please post a short file (as a binary attachment) and a
> clear recipe how you visit it and how you type Cyrillic characters in
> order to reproduce the problem?

OK, after some testing I came up with a test (I didn't find it earlier
because `customize-variable' is essential, simply setting it doesn't
work):

$ emacs -Q
M-x customize-variable RET utf-fragment-on-decoding RET
[set to t, set for current session]
C-x C-f test.text RET

Now, the characters from the file are decoded into `cyrillic-iso8859-5',
while new, typed characters are in `mule-unicode-0100-24ff'.

> Did I understand correctly that, unlike you first said, the Cyrillic
> ISO characters come from keyboard input, while the mule-unicode
> characters come from a file?

No.  Please try yourself, that way it must be easier to understand ;)

Paul

[-- Attachment #2: test.text --]
[-- Type: text/plain, Size: 102 bytes --]

АБВГДЕЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯ

Local variables:
coding: utf-8
End:

[-- Attachment #3: Type: text/plain, Size: 142 bytes --]

_______________________________________________
Emacs-devel mailing list
Emacs-devel@gnu.org
http://lists.gnu.org/mailman/listinfo/emacs-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Russian letters
  2006-07-06 20:38                 ` Paul Pogonyshev
@ 2006-07-07  8:41                   ` Eli Zaretskii
  0 siblings, 0 replies; 23+ messages in thread
From: Eli Zaretskii @ 2006-07-07  8:41 UTC (permalink / raw)
  Cc: emacs-devel

> From: Paul Pogonyshev <pogonyshev@gmx.net>
> Date: Thu, 6 Jul 2006 23:38:12 +0300
> Cc: Eli Zaretskii <eliz@gnu.org>
> 
> Actually, this doesn't happen in `emacs -Q', not sure why...  There
> characters are decoded to `mule-unicode-0100-24ff' (and displayed as
> boxes, gah.)

This is expected behavior, AFAIK.  The empty bixes mean you need to
install a Unicode font that spans the Cyrillic characters.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Russian letters
  2006-07-06 21:48                   ` Paul Pogonyshev
@ 2006-07-07  8:46                     ` Eli Zaretskii
  2006-07-07 19:59                       ` Paul Pogonyshev
  0 siblings, 1 reply; 23+ messages in thread
From: Eli Zaretskii @ 2006-07-07  8:46 UTC (permalink / raw)
  Cc: emacs-devel

> From: Paul Pogonyshev <pogonyshev@gmx.net>
> Date: Fri, 7 Jul 2006 00:48:15 +0300
> 
> $ emacs -Q
> M-x customize-variable RET utf-fragment-on-decoding RET
> [set to t, set for current session]
> C-x C-f test.text RET
> 
> Now, the characters from the file are decoded into `cyrillic-iso8859-5',
> while new, typed characters are in `mule-unicode-0100-24ff'.

This is exactly what is expected.  Here's the doc string of
utf-fragment-on-decoding:

    utf-fragment-on-decoding's value is nil

    Whether or not to decode some chars in UTF-8/16 text into iso8859 charsets.
    Setting this means that the relevant Cyrillic and Greek characters are
    decoded into the iso8859 charsets rather than into
    mule-unicode-0100-24ff.  The iso8859 charsets take half as much space
    in the buffer, but using them may affect how the buffer can be re-encoded
    and may require a different input method to search for them, for instance.
    See `unify-8859-on-decoding-mode' and `unify-8859-on-encoding-mode'
    for mechanisms to make this largely transparent.

The reason why the default value is nil is precisely that most users
will not want the fragmentation, they will want the characters to
belong to a single character set.

Did you set this variable to a non-nil value in your .emacs?  If so,
how about removing that customization?  If the reason is that you
don't have Unicode fonts installed, I think installing them is a
better solution.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Russian letters
  2006-07-07  8:46                     ` Eli Zaretskii
@ 2006-07-07 19:59                       ` Paul Pogonyshev
  2006-07-08 12:35                         ` Eli Zaretskii
  0 siblings, 1 reply; 23+ messages in thread
From: Paul Pogonyshev @ 2006-07-07 19:59 UTC (permalink / raw)


Eli Zaretskii wrote:
> > From: Paul Pogonyshev <pogonyshev@gmx.net>
> > Date: Fri, 7 Jul 2006 00:48:15 +0300
> > 
> > $ emacs -Q
> > M-x customize-variable RET utf-fragment-on-decoding RET
> > [set to t, set for current session]
> > C-x C-f test.text RET
> > 
> > Now, the characters from the file are decoded into `cyrillic-iso8859-5',
> > while new, typed characters are in `mule-unicode-0100-24ff'.
> 
> This is exactly what is expected.  Here's the doc string of
> utf-fragment-on-decoding:
> 
>     utf-fragment-on-decoding's value is nil
> 
>     Whether or not to decode some chars in UTF-8/16 text into iso8859 charsets.
>     [...]

Why not do the same to the typed characters?  Current behavior is
inconsistent---some characters are decoded (into iso-8859 charsets),
some are not.

> The reason why the default value is nil is precisely that most users
> will not want the fragmentation, they will want the characters to
> belong to a single character set.

I understand you, but actually, most users do not bother.  Emacs should
work `out of the box' and display the characters.  Apparently, it can
show Cyrillic letters, but won't show them, uh?

Why doesn't Emacs try to decode characters on displaying?  This can
be done only once just to check if the decoded characters can be
shown normally, not as boxes.

> Did you set this variable to a non-nil value in your .emacs?  If so,
> how about removing that customization?  If the reason is that you
> don't have Unicode fonts installed, I think installing them is a
> better solution.

I use Debian Sarge which is only 1 year old.  And Emacs doesn't work
with its standard font and Cyrillic letters as is.  (Well, I didn't
try the standard package, but CVS `emacs -Q' shows boxes.)  I had enough
persistence to find the reason (here, thank you), but most users won't.
Especially since Emacs cannot even list font families (at least I don't
know how.)

Paul

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Russian letters
  2006-07-07 19:59                       ` Paul Pogonyshev
@ 2006-07-08 12:35                         ` Eli Zaretskii
  2006-07-08 15:30                           ` Paul Pogonyshev
  0 siblings, 1 reply; 23+ messages in thread
From: Eli Zaretskii @ 2006-07-08 12:35 UTC (permalink / raw)
  Cc: handa, emacs-devel

> From: Paul Pogonyshev <pogonyshev@gmx.net>
> Date: Fri, 7 Jul 2006 22:59:40 +0300
> 
> >     utf-fragment-on-decoding's value is nil
> > 
> >     Whether or not to decode some chars in UTF-8/16 text into iso8859 charsets.
> >     [...]
> 
> Why not do the same to the typed characters?

Maybe it does, let's find out: how did you type those characters?  Did
you use a Leim input method (which one?), or did you type them on your
keyboard?

> Current behavior is inconsistent---some characters are decoded (into
> iso-8859 charsets), some are not.

I think it is consistent in the default configuration.

> > The reason why the default value is nil is precisely that most users
> > will not want the fragmentation, they will want the characters to
> > belong to a single character set.
> 
> I understand you, but actually, most users do not bother.  Emacs should
> work `out of the box' and display the characters.

It does work `out of the box', if you don't change the value of
utf-fragment-on-decoding.

> Why doesn't Emacs try to decode characters on displaying?

Decoding happens on input, when the characters are inserted into a
buffer, not when they are displayed.  Such insertion occurs when you
either (a) type the characters at the keyboard, or (b) visit a file,
or (c) paste them from an X selection or a clipboard, or (d) read
output of some process which interacts with Emacs.  (I hope I didn't
forget any other possibilities.)

If you describe how you typed those characters, maybe we will find a
bug that needs to be fixed.

> > Did you set this variable to a non-nil value in your .emacs?  If so,
> > how about removing that customization?  If the reason is that you
> > don't have Unicode fonts installed, I think installing them is a
> > better solution.
> 
> I use Debian Sarge which is only 1 year old.  And Emacs doesn't work
> with its standard font and Cyrillic letters as is.  (Well, I didn't
> try the standard package, but CVS `emacs -Q' shows boxes.)  I had enough
> persistence to find the reason (here, thank you), but most users won't.
> Especially since Emacs cannot even list font families (at least I don't
> know how.)

I still don't understand whether you modified the value of
utf-fragment-on-decoding or it came that way with Debian Sarge.  In
the latter case, I think it's something to complain about to Debian
maintainers.

The missing fonts is also an issue with Debian, I think.  Perhaps they
have an optional package you need to install, but since you live in a
Cyrillic locale (if I understand correctly the headers of your
message), I find it hard to believe that your system lacks Unicode
fonts that don't support Cyrillic characters.

If you do have these fonts installed, maybe it's yet another bug in
Emacs.  One of your prior messages showed that your locale is
en_US.utf8.  I don't know enough about font selection and fontsets;
Handa-san, could you please tell Paul what information to send in
order to find out why Unicode fonts aren't found by Emacs?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Russian letters
  2006-07-08 12:35                         ` Eli Zaretskii
@ 2006-07-08 15:30                           ` Paul Pogonyshev
  2006-07-08 16:06                             ` Eli Zaretskii
  0 siblings, 1 reply; 23+ messages in thread
From: Paul Pogonyshev @ 2006-07-08 15:30 UTC (permalink / raw)
  Cc: handa

Eli Zaretskii wrote:
> > From: Paul Pogonyshev <pogonyshev@gmx.net>
> > Date: Fri, 7 Jul 2006 22:59:40 +0300
> > 
> > >     utf-fragment-on-decoding's value is nil
> > > 
> > >     Whether or not to decode some chars in UTF-8/16 text into iso8859 charsets.
> > >     [...]
> > 
> > Why not do the same to the typed characters?
> 
> Maybe it does, let's find out: how did you type those characters?  Did
> you use a Leim input method (which one?), or did you type them on your
> keyboard?

I think it is Leim input method `russian-computer'.  I.e. I use `C-\' in
Emacs to switch between US Engish and Russian keyboard layouts.

> > Current behavior is inconsistent---some characters are decoded (into
> > iso-8859 charsets), some are not.
> 
> I think it is consistent in the default configuration.

Yes, in the default.  But not if you change `utf-fragment-on-decoding',
I think.

> > > The reason why the default value is nil is precisely that most users
> > > will not want the fragmentation, they will want the characters to
> > > belong to a single character set.
> > 
> > I understand you, but actually, most users do not bother.  Emacs should
> > work `out of the box' and display the characters.
> 
> It does work `out of the box', if you don't change the value of
> utf-fragment-on-decoding.

And displays boxes in place of Russian characters (all of them.)  If
`utf-fragment-on-decoding' is non=nil, it displays read characters fine,
but not the newly typed characters.

> > Why doesn't Emacs try to decode characters on displaying?
> 
> Decoding happens on input, when the characters are inserted into a
> buffer, not when they are displayed.  Such insertion occurs when you
> either (a) type the characters at the keyboard, or (b) visit a file,
> or (c) paste them from an X selection or a clipboard, or (d) read
> output of some process which interacts with Emacs.  (I hope I didn't
> forget any other possibilities.)
> 
> If you describe how you typed those characters, maybe we will find a
> bug that needs to be fixed.

Maybe I used imprecise words.  We know that Emacs can display Russian
characters if they decoded into a national ISO charset.  The same
(conceptually, from the user point of view) characters are shown as
boxes when they are in UTF charset.  It should be possible to display
ranges from UTF charset as national charsets.  I.e. if character
U+0430 is displayed as ISO-8859-5 0x50, all problems solved.  No
matter how the characters are encoded, if they conceptually are the
same, they should be displayed using the same method, no?

> > > Did you set this variable to a non-nil value in your .emacs?  If so,
> > > how about removing that customization?  If the reason is that you
> > > don't have Unicode fonts installed, I think installing them is a
> > > better solution.
> > 
> > I use Debian Sarge which is only 1 year old.  And Emacs doesn't work
> > with its standard font and Cyrillic letters as is.  (Well, I didn't
> > try the standard package, but CVS `emacs -Q' shows boxes.)  I had enough
> > persistence to find the reason (here, thank you), but most users won't.
> > Especially since Emacs cannot even list font families (at least I don't
> > know how.)
> 
> I still don't understand whether you modified the value of
> utf-fragment-on-decoding or it came that way with Debian Sarge.  In
> the latter case, I think it's something to complain about to Debian
> maintainers.

I think modification of `utf-fragment-on-decoding' is a remnant of the
times I tried to solve Russian characters problem.  Maybe it worked then,
not sure.

> The missing fonts is also an issue with Debian, I think.  Perhaps they
> have an optional package you need to install, but since you live in a
> Cyrillic locale (if I understand correctly the headers of your
> message), I find it hard to believe that your system lacks Unicode
> fonts that don't support Cyrillic characters.

I'll try writing to Debian.  However, Emacs does poor job: while it _can_
show Cyrillic characters in `adobe-courier', it does so only when they are
in a certain encoding.

Cronyx fonts do indeed support Russian characters.  However, customizing
`default' face to use cronyx-courier for some reason influences only the
current Emacs session.  Bug?

In the current session: I customize the default face to use `cronyx-courier'
and press the ``Save for Future Sessions'' button.  Cyrillic characters
are now displayed with the Cronyx font, but ASCII characters are shown with
`adobe-courier'...

A new session: all characters are shown with `adobe-courier'.  In particular,
Cyrillic characters are shown as boxes.  `.emacs' does indeed contain
`cronyx-courier', but for some reason it doesn't take effect at all...

Actually, I now see that I had this problem before and wrote about it in
``Pango-like font fallback (was Re: Russian numero sign)'' thread:

    I went to install all the fonts I could find in my Debian Sarge.  And
    found cronyx-courier font, which looks nice _and_ has Cyrillic
    characters.  However, when I customize the default face in Emacs and
    set that font family, latin characters are still displayed in
    adobe-courier (though Cyrillic ones are shown in cronyx-courier)...
    And the customization doesn't take any effect after I restart Emacs...
    Any ideas?

Kenichi Handa answered:

   Perhaps that because you don't have
   -cronyx-courier-...-iso8859-1.  Emacs by default uses an
   iso8859-1 font for ASCII.  To change it, you must create a
   proper fontset by one of these ways: [...]

How an average user is supposed to find it is beyond me.  I disovered it
only here.

Paul

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Russian letters
  2006-07-08 15:30                           ` Paul Pogonyshev
@ 2006-07-08 16:06                             ` Eli Zaretskii
  0 siblings, 0 replies; 23+ messages in thread
From: Eli Zaretskii @ 2006-07-08 16:06 UTC (permalink / raw)
  Cc: handa, emacs-devel

> From: Paul Pogonyshev <pogonyshev@gmx.net>
> Date: Sat, 8 Jul 2006 18:30:12 +0300
> Cc: handa@m17n.org
> 
> Eli Zaretskii wrote:
> > > From: Paul Pogonyshev <pogonyshev@gmx.net>
> > > Date: Fri, 7 Jul 2006 22:59:40 +0300
> > > 
> > > >     utf-fragment-on-decoding's value is nil
> > > > 
> > > >     Whether or not to decode some chars in UTF-8/16 text into iso8859 charsets.
> > > >     [...]
> > > 
> > > Why not do the same to the typed characters?
> > 
> > Maybe it does, let's find out: how did you type those characters?  Did
> > you use a Leim input method (which one?), or did you type them on your
> > keyboard?
> 
> I think it is Leim input method `russian-computer'.  I.e. I use `C-\' in
> Emacs to switch between US Engish and Russian keyboard layouts.

Handa-san, should Leim obey utf-fragment-on-decoding?  I think it
should, but maybe there's some complication that prevents it.

> No matter how the characters are encoded, if they conceptually are
> the same, they should be displayed using the same method, no?

Ideally, yes.  However, this is a harsh requirement: a font assumes a
certain encoding of a character, so Emacs cannot easily use another
font if it's for a different encoding.

> Cronyx fonts do indeed support Russian characters.  However, customizing
> `default' face to use cronyx-courier for some reason influences only the
> current Emacs session.  Bug?

Probably.  I'll let Handa-san to answer this.

> Actually, I now see that I had this problem before and wrote about it in
> ``Pango-like font fallback (was Re: Russian numero sign)'' thread:
> 
>     I went to install all the fonts I could find in my Debian Sarge.  And
>     found cronyx-courier font, which looks nice _and_ has Cyrillic
>     characters.  However, when I customize the default face in Emacs and
>     set that font family, latin characters are still displayed in
>     adobe-courier (though Cyrillic ones are shown in cronyx-courier)...
>     And the customization doesn't take any effect after I restart Emacs...
>     Any ideas?
> 
> Kenichi Handa answered:
> 
>    Perhaps that because you don't have
>    -cronyx-courier-...-iso8859-1.  Emacs by default uses an
>    iso8859-1 font for ASCII.  To change it, you must create a
>    proper fontset by one of these ways: [...]
> 
> How an average user is supposed to find it is beyond me.

They shouldn't.  But I think Debian should add a -cronyx-courier font
for Latin-1, because without that Emacs is broken for Cyrillic
scripts.  Or maybe there's some other Unicode font that covers both
Cyrillic and Latin-1.

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2006-07-08 16:06 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-07-05 18:10 Russian letters Paul Pogonyshev
2006-07-05 18:19 ` Andreas Schwab
2006-07-05 21:43   ` Paul Pogonyshev
2006-07-05 22:08     ` Andreas Schwab
2006-07-05 22:21       ` Paul Pogonyshev
2006-07-05 22:55         ` Andreas Schwab
2006-07-06 15:59           ` Paul Pogonyshev
2006-07-06 16:39             ` Andreas Schwab
2006-07-06 18:17               ` Paul Pogonyshev
2006-07-06 20:11                 ` Eli Zaretskii
2006-07-06  3:41         ` Eli Zaretskii
2006-07-06 15:56           ` Paul Pogonyshev
2006-07-06 20:12             ` Eli Zaretskii
2006-07-06 20:27               ` Paul Pogonyshev
2006-07-06 20:38                 ` Paul Pogonyshev
2006-07-07  8:41                   ` Eli Zaretskii
2006-07-06 21:14                 ` Eli Zaretskii
2006-07-06 21:48                   ` Paul Pogonyshev
2006-07-07  8:46                     ` Eli Zaretskii
2006-07-07 19:59                       ` Paul Pogonyshev
2006-07-08 12:35                         ` Eli Zaretskii
2006-07-08 15:30                           ` Paul Pogonyshev
2006-07-08 16:06                             ` Eli Zaretskii

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).