unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed
* converting between charsets
@ 2006-05-07  9:52 Alexander Kotelnikov
  2006-05-07 12:43 ` Stefan Monnier
  0 siblings, 1 reply; 19+ messages in thread
From: Alexander Kotelnikov @ 2006-05-07  9:52 UTC (permalink / raw)


Hello.

After I switched to utf-8 as my basic environment encoding (on Linux)
I got need of converting some texts sometimes back to koi8-r. Typical
task here is to convert outgoing mail to persons and newsgroups
hierarchies which do not understand multibyte encodings.]

Theoretically something like
(encode-coding-region (point-min) (point-max) 'koi8-r)
should work, but it does not.

There could be three different ways, which I checked, how characters
to be converted can appear in emacs buffer:
  a. when I open such file.
  b. when I type in characters and my keyboard layout in X is different
     from 'us', for me it is normally 'ru' then.
  c. when I type in after I used toggle-input-method.


And the trouble is that encode-coding-region converts only in case
(c). In (a) and (b) characters that need conversion are substituted
with question marks. And even in (c) conversion is performed (if, for
instance, I save a file after it appears to be in koi8-r) in the
converted buffer converted characters are shown in \321 manner.

So, it will be nice to get some help on this, thanks.

Relevant lines in my ~/.emacs are:

(set-language-environment "UTF-8")
(set-terminal-coding-system 'utf-8)
(set-selection-coding-system 'utf-8)
(setq default-buffer-file-coding-system 'utf-8)
(set-input-mode (car (current-input-mode)) (nth 1 (current-input-mode)) 0)
(setq default-input-method "cyrillic-jcuken")

BTW, there are other troubles with handling charters other that first
half of ASCII table:

1. Paste in X (from non-Emacs to Emacs) does not work correctly. It
seems to be broken in different ways for singlebyte and mutlibyte.

2. With my utf-8 setup non-ascii input does not work on terminal (for
example, when emacs is run in xterm as emacs -nw) when I switch input
with system means (X keyboard layout, console input mode), instead of
toggle-input-method.

Probably, somebody can comment on this also.

Thanks once more,
-- 
Alexander Kotelnikov
Saint-Petersburg, Russia

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: converting between charsets
  2006-05-07  9:52 converting between charsets Alexander Kotelnikov
@ 2006-05-07 12:43 ` Stefan Monnier
  2006-05-07 19:40   ` Alexander Kotelnikov
  0 siblings, 1 reply; 19+ messages in thread
From: Stefan Monnier @ 2006-05-07 12:43 UTC (permalink / raw)


> After I switched to utf-8 as my basic environment encoding (on Linux)
> I got need of converting some texts sometimes back to koi8-r.  Typical
> task here is to convert outgoing mail to persons and newsgroups
> hierarchies which do not understand multibyte encodings.]

Emacs always converts from/to the encoding you use.  So you don't really
need to "convert from utf-8 to koi8", when sending email because, before the
email is sent, it's not any more in utf-8 than in any other encoding (other
than the internal encoding).
I.e. all you need is to tell Emacs that when sending to newsgroups such and
such, it should use koi8 rather than utf-8.  How to do that depends on the
newsreader you're using.

> Theoretically something like
> (encode-coding-region (point-min) (point-max) 'koi8-r)
> should work, but it does not.

I don't think that's true in theory.

> Relevant lines in my ~/.emacs are:

> (set-language-environment "UTF-8")
> (set-terminal-coding-system 'utf-8)
> (set-selection-coding-system 'utf-8)
> (setq default-buffer-file-coding-system 'utf-8)
> (set-input-mode (car (current-input-mode)) (nth 1 (current-input-mode)) 0)
> (setq default-input-method "cyrillic-jcuken")

Looks like you have some problems here.  Try to remove most of the lines
(if your locale is using utf-8 already, you really don't need to do
anything at all in your .emacs).
At the very least try removing the set-selection-coding-system and
set-input-mode.

> 1. Paste in X (from non-Emacs to Emacs) does not work correctly. It
> seems to be broken in different ways for singlebyte and mutlibyte.

Probably caused by your set-selection-coding-system.

> 2. With my utf-8 setup non-ascii input does not work on terminal (for
> example, when emacs is run in xterm as emacs -nw) when I switch input
> with system means (X keyboard layout, console input mode), instead of
> toggle-input-method.

Could be because of your set-input-mode.


        Stefan

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: converting between charsets
  2006-05-07 12:43 ` Stefan Monnier
@ 2006-05-07 19:40   ` Alexander Kotelnikov
  2006-05-08  3:28     ` Stefan Monnier
  0 siblings, 1 reply; 19+ messages in thread
From: Alexander Kotelnikov @ 2006-05-07 19:40 UTC (permalink / raw)


>>>>> On Sun, 07 May 2006 08:43:56 -0400
>>>>> "SM" == Stefan Monnier <monnier@iro.umontreal.ca> wrote:
SM> 
>> After I switched to utf-8 as my basic environment encoding (on Linux)
>> I got need of converting some texts sometimes back to koi8-r.  Typical
>> task here is to convert outgoing mail to persons and newsgroups
>> hierarchies which do not understand multibyte encodings.]
SM> 
SM> Emacs always converts from/to the encoding you use.  So you don't really
SM> need to "convert from utf-8 to koi8", when sending email because, before the
SM> email is sent, it's not any more in utf-8 than in any other encoding (other
SM> than the internal encoding).
SM> I.e. all you need is to tell Emacs that when sending to newsgroups such and
SM> such, it should use koi8 rather than utf-8.  How to do that depends on the
SM> newsreader you're using.

I am using Gnus, it does not have such functionality, and the thing I
am going to do is to implement it. So, the thing I need to be able to
do is to convert from internal represintation to some code page
(mostly koi8-r).

>> Theoretically something like
>> (encode-coding-region (point-min) (point-max) 'koi8-r)
>> should work, but it does not.
SM> 
SM> I don't think that's true in theory.

Why?

>> Relevant lines in my ~/.emacs are:
SM> 
>> (set-language-environment "UTF-8")
>> (set-terminal-coding-system 'utf-8)
>> (set-selection-coding-system 'utf-8)
>> (setq default-buffer-file-coding-system 'utf-8)
>> (set-input-mode (car (current-input-mode)) (nth 1 (current-input-mode)) 0)
>> (setq default-input-method "cyrillic-jcuken")
SM> 
SM> Looks like you have some problems here.  Try to remove most of the lines
SM> (if your locale is using utf-8 already, you really don't need to do
SM> anything at all in your .emacs).
SM> At the very least try removing the set-selection-coding-system and
SM> set-input-mode.
SM> 
>> 1. Paste in X (from non-Emacs to Emacs) does not work correctly. It
>> seems to be broken in different ways for singlebyte and mutlibyte.
SM> 
SM> Probably caused by your set-selection-coding-system.
SM> 
>> 2. With my utf-8 setup non-ascii input does not work on terminal (for
>> example, when emacs is run in xterm as emacs -nw) when I switch input
>> with system means (X keyboard layout, console input mode), instead of
>> toggle-input-method.
SM> 
SM> Could be because of your set-input-mode.

I have started emacs without ~/.emacs and evaluated 
(setq default-input-method "cyrillic-jcuken")

What I got:
1. Paste into Emacs frame works strange: different from normal font is
used and on save I am asked in Minibuffer:
Select coding system (default euc-jp)
and I am shown *Warning* buffer with lines

Start of *Warning*
These default coding systems were tried:
  mule-utf-8
However, none of them safely encodes the target text.

Select one of the following safe coding systems:
  euc-jp shift_jis iso-2022-jp iso-2022-jp-2 x-ctext
  japanese-iso-7bit-1978-irv iso-2022-7bit raw-text emacs-mule
  no-conversion iso-2022-7bit-lock-ss2 ctext-no-compositions
  iso-2022-8bit-ss2 iso-2022-7bit-lock iso-2022-7bit-ss2
  tibetan-iso-8bit-with-esc thai-tis620-with-esc lao-with-esc
  korean-iso-8bit-with-esc hebrew-iso-8bit-with-esc
  greek-iso-8bit-with-esc iso-latin-9-with-esc iso-latin-8-with-esc
  iso-latin-5-with-esc iso-latin-4-with-esc iso-latin-3-with-esc
  iso-latin-2-with-esc iso-latin-1-with-esc
  in-is13194-devanagari-with-esc cyrillic-iso-8bit-with-esc
  chinese-iso-8bit-with-esc japanese-iso-8bit-with-esc
End of *Warning*

Cyrillic nput in emacs -nw in xterm still does not work, if I just
change X keyboard layout.

Converting works just like before: it converts only text typed with
toggled input method.

-- 
Alexander Kotelnikov
Saint-Petersburg, Russia

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: converting between charsets
  2006-05-07 19:40   ` Alexander Kotelnikov
@ 2006-05-08  3:28     ` Stefan Monnier
  2006-05-08  9:39       ` Alexander Kotelnikov
  0 siblings, 1 reply; 19+ messages in thread
From: Stefan Monnier @ 2006-05-08  3:28 UTC (permalink / raw)


>>> After I switched to utf-8 as my basic environment encoding (on Linux)
>>> I got need of converting some texts sometimes back to koi8-r.  Typical
>>> task here is to convert outgoing mail to persons and newsgroups
>>> hierarchies which do not understand multibyte encodings.]
SM> 
SM> Emacs always converts from/to the encoding you use.  So you don't really
SM> need to "convert from utf-8 to koi8", when sending email because, before the
SM> email is sent, it's not any more in utf-8 than in any other encoding (other
SM> than the internal encoding).
SM> I.e. all you need is to tell Emacs that when sending to newsgroups such and
SM> such, it should use koi8 rather than utf-8.  How to do that depends on the
SM> newsreader you're using.

> I am using Gnus, it does not have such functionality,

In what way does the functionality described in the node "Charsets" of the
Gnus manual fail to provide the functionality you need?

>>> Theoretically something like
>>> (encode-coding-region (point-min) (point-max) 'koi8-r)
>>> should work, but it does not.
SM> I don't think that's true in theory.
> Why?

Because it completely depends on how and when you do it.  There already is
an encoding step taking place somewhere.  So if you only add a call to
encode-coding-region somewhere you'll simply cause a double encoding to
happen which will most likely give you garbage.

So one way to do it is to take care of the encoding yourself, which may
amount to doing the whole "send" yourself (i.e. the NIH approach).  Or the
other way is to figure out how to tell the code that already does the
encoding to use koi8 rather than utf-8.

> I have started emacs without ~/.emacs and evaluated 
> (setq default-input-method "cyrillic-jcuken")

What's your locale?  What version of Emacs is this?

> What I got:
> 1. Paste into Emacs frame works strange:

What text did you paste?  Where does it come from?

> Cyrillic nput in emacs -nw in xterm still does not work, if I just
> change X keyboard layout.

That doesn't give us much to go on, does it?  What does it do, other than
"not work"?


        Stefan

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: converting between charsets
  2006-05-08  3:28     ` Stefan Monnier
@ 2006-05-08  9:39       ` Alexander Kotelnikov
  2006-05-08 14:30         ` Stefan Monnier
  0 siblings, 1 reply; 19+ messages in thread
From: Alexander Kotelnikov @ 2006-05-08  9:39 UTC (permalink / raw)


[-- Attachment #1: Type: text/plain, Size: 2862 bytes --]

>>>>> On Sun, 07 May 2006 23:28:09 -0400
>>>>> "SM" == Stefan Monnier <monnier@iro.umontreal.ca> wrote:
SM> 
>>>> After I switched to utf-8 as my basic environment encoding (on Linux)
>>>> I got need of converting some texts sometimes back to koi8-r.  Typical
>>>> task here is to convert outgoing mail to persons and newsgroups
>>>> hierarchies which do not understand multibyte encodings.]
SM> 
SM> Emacs always converts from/to the encoding you use.  So you don't really
SM> need to "convert from utf-8 to koi8", when sending email because, before the
SM> email is sent, it's not any more in utf-8 than in any other encoding (other
SM> than the internal encoding).
SM> I.e. all you need is to tell Emacs that when sending to newsgroups such and
SM> such, it should use koi8 rather than utf-8.  How to do that depends on the
SM> newsreader you're using.
SM> 
>> I am using Gnus, it does not have such functionality,
SM> 
SM> In what way does the functionality described in the node "Charsets" of the
SM> Gnus manual fail to provide the functionality you need?

It fails. Its default value contains element

("^\\(fido7\\|relcom\\)\\.[^,]*\\(,[ 	\n]*\\(fido7\\|relcom\\)\\.[^,]*\\)*$" koi8-r
  (koi8-r))

and my post to fido7 hierarchy go in utf-8 anyway.

>>>> Theoretically something like
>>>> (encode-coding-region (point-min) (point-max) 'koi8-r)
>>>> should work, but it does not.
SM> I don't think that's true in theory.
>> Why?
SM> 
SM> Because it completely depends on how and when you do it.  There already is
SM> an encoding step taking place somewhere.  So if you only add a call to
SM> encode-coding-region somewhere you'll simply cause a double encoding to
SM> happen which will most likely give you garbage.

Let's first talk about encoding regions. Why does not it work with
encode-coding-region?

What about garbage, if encoding/decoiding works I can always decode
into internal representation and encode into desired charset in
send-hook.

I would be happy to get an answer on question: "How do I decode and
encode in Emacs?"

SM> So one way to do it is to take care of the encoding yourself, which may
SM> amount to doing the whole "send" yourself (i.e. the NIH approach).
SM> Or the

NIH?

SM> other way is to figure out how to tell the code that already does the
SM> encoding to use koi8 rather than utf-8.

There is no such code right now, and, probably, I will write it. But
I'll need to make an encoding into koi8-r which does not seems to
work.

>> I have started emacs without ~/.emacs and evaluated 
>> (setq default-input-method "cyrillic-jcuken")
SM> 
SM> What's your locale?  What version of Emacs is this?
SM> 
>> What I got:
>> 1. Paste into Emacs frame works strange:
SM> 
SM> What text did you paste?  Where does it come from?

I type some Russian text in xterm and paste in into Emacs, have a look
at the attached screenshot.


[-- Attachment #2: see the difference --]
[-- Type: image/png, Size: 8634 bytes --]

[-- Attachment #3: Type: text/plain, Size: 266 bytes --]


SM> 
>> Cyrillic nput in emacs -nw in xterm still does not work, if I just
>> change X keyboard layout.
SM> 
SM> That doesn't give us much to go on, does it?  What does it do, other than
SM> "not work"?

It beeps.

-- 
Alexander Kotelnikov
Saint-Petersburg, Russia

[-- Attachment #4: Type: text/plain, Size: 142 bytes --]

_______________________________________________
Emacs-devel mailing list
Emacs-devel@gnu.org
http://lists.gnu.org/mailman/listinfo/emacs-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: converting between charsets
  2006-05-08  9:39       ` Alexander Kotelnikov
@ 2006-05-08 14:30         ` Stefan Monnier
  2006-05-09  5:41           ` Alexander Kotelnikov
  0 siblings, 1 reply; 19+ messages in thread
From: Stefan Monnier @ 2006-05-08 14:30 UTC (permalink / raw)


> It fails. Its default value contains element

> ("^\\(fido7\\|relcom\\)\\.[^,]*\\(,[ 	\n]*\\(fido7\\|relcom\\)\\.[^,]*\\)*$" koi8-r
>   (koi8-r))

> and my post to fido7 hierarchy go in utf-8 anyway.

[ I don't know if your settnig is correct and supposed to work,
  but I'll assume it is.  ]

Then please report this as a bug with M-x report-emacs-bug (or directly to
the Gnus guys, but be sure to include the kind of information included in
M-x report-emacs-bug).

> Let's first talk about encoding regions. Why does not it work with
> encode-coding-region?

It works.  Any evidence that it doesn't?

> What about garbage, if encoding/decoiding works I can always decode
> into internal representation and encode into desired charset in
> send-hook.

Not always: decoding+encoding can't always be exact inverses of each other.

> I would be happy to get an answer on question: "How do I decode and
> encode in Emacs?"

It's not the right question.  The question you seem to want to ask is "how
do I change the way Emacs's package FOO encodes/decodes my object BAR?"

SM> So one way to do it is to take care of the encoding yourself, which may
SM> amount to doing the whole "send" yourself (i.e. the NIH approach).
SM> Or the

> NIH?

Not Invented Here: the typical reaction of reinventing your own wheel rather
than try to adapt the ones you're already using (but which you haven't built
yourself).

SM> other way is to figure out how to tell the code that already does the
SM> encoding to use koi8 rather than utf-8.

> There is no such code right now, and, probably, I will write it.

You complain that it uses utf-8, so somewhere a piece of code encodes the
text into utf-8.

>>> 1. Paste into Emacs frame works strange:
SM> What text did you paste?  Where does it come from?
> I type some Russian text in xterm and paste in into Emacs, have a look
> at the attached screenshot.

Oh, I see.  I don't know enough of how this works to help you much further.
If you hit C-u C-x = on the various chars (especially on two similar chars
displayed with different fonts), you'll see that they come from different
charsets (one is probably something like iso-8859-5 and the other may be
unicode).  Emacs-22 doesn't unify them by default.  You can try to put
(unify-8859-on-decoding-mode 1) in your .emacs.  And you can also try to
play with utf-fragment-on-decoding.  And ask someone more knowledgeable
about such problems.

You could even M-x report-emacs-bug about it, since maybe the default config
in a cyrillic locale should already take care of it.

>>> Cyrillic nput in emacs -nw in xterm still does not work, if I just
>>> change X keyboard layout.
SM> 
SM> That doesn't give us much to go on, does it?  What does it do, other than
SM> "not work"?

> It beeps.

What does C-h l show after hitting a particular key?


        Stefan

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: converting between charsets
  2006-05-08 14:30         ` Stefan Monnier
@ 2006-05-09  5:41           ` Alexander Kotelnikov
  2006-05-09 18:42             ` Stefan Monnier
  0 siblings, 1 reply; 19+ messages in thread
From: Alexander Kotelnikov @ 2006-05-09  5:41 UTC (permalink / raw)


>>>>> On Mon, 08 May 2006 10:30:48 -0400
>>>>> "SM" == Stefan Monnier <monnier@iro.umontreal.ca> wrote:
SM> 
>> Let's first talk about encoding regions. Why does not it work with
>> encode-coding-region?
SM> 
SM> It works.  Any evidence that it doesn't?

I started this thread from note about problems with
encoding-coding-region:

>>>>> On Sun, 07 May 2006 13:52:08 +0400
>>>>> "AK" == Alexander Kotelnikov <sacha@myxomop.com> wrote:
AK> 
AK> There could be three different ways, which I checked, how characters
AK> to be converted can appear in emacs buffer:
AK>   a. when I open such file.
AK>   b. when I type in characters and my keyboard layout in X is different
AK>      from 'us', for me it is normally 'ru' then.
AK>   c. when I type in after I used toggle-input-method.
AK> 
AK> 
AK> And the trouble is that encode-coding-region converts only in case
AK> (c). In (a) and (b) characters that need conversion are substituted
AK> with question marks. And even in (c) conversion is performed (if, for
AK> instance, I save a file after it appears to be in koi8-r) in the
AK> converted buffer converted characters are shown in \321 manner.
AK> 
AK> So, it will be nice to get some help on this, thanks.

>>>> 1. Paste into Emacs frame works strange:
SM> What text did you paste?  Where does it come from?
>> I type some Russian text in xterm and paste in into Emacs, have a look
>> at the attached screenshot.
SM> 
SM> Oh, I see.  I don't know enough of how this works to help you much further.
SM> If you hit C-u C-x = on the various chars (especially on two similar chars
SM> displayed with different fonts), you'll see that they come from different
SM> charsets (one is probably something like iso-8859-5 and the other may be
SM> unicode).  Emacs-22 doesn't unify them by default.  You can try to put
SM> (unify-8859-on-decoding-mode 1) in your .emacs.  And you can also try to
SM> play with utf-fragment-on-decoding.  And ask someone more knowledgeable
SM> about such problems.

On first character like latin T:
  character: <I removed cyrillic character> (01212102, 332866, 0x51442)^[-A
    charset: mule-unicode-0100-24ff
	     (Unicode characters of the range U+0100..U+24FF.)
 code point: 40 66
     syntax: word
   category: y:Cyrillic  
buffer code: 0x9C 0xF4 0xA8 0xC2
  file code: 0xD0 0xA2 (encoded by coding system mule-utf-8)
       font: -monotype-courier new-medium-r-normal--13-94-99-99-m-80-iso10646-1

After the same character in the next line:
  character: <I remove cyrillic character shown with wrong fontt> (0151664, 54196, 0xd3b4)
    charset: japanese-jisx0208 (JISX0208.1983/1990 Japanese Kanji: ISO-IR-87)
 code point: 39 52
     syntax: word
   category: Y:Cyrillic characters of 2-byte character sets   j:Japanese  
	     |:While filling, we can break a line at this character.  
buffer code: 0x92 0xA7 0xB4
  file code: not encodable by coding system mule-utf-8
       font: -Misc-Fixed-Medium-R-Normal--14-130-75-75-C-140-JISX0208.1983-0

Something is not ok here...

SM> You could even M-x report-emacs-bug about it, since maybe the default config
SM> in a cyrillic locale should already take care of it.
SM> 
>>>> Cyrillic nput in emacs -nw in xterm still does not work, if I just
>>>> change X keyboard layout.
SM> 
SM> That doesn't give us much to go on, does it?  What does it do, other than
SM> "not work"?
SM> 
>> It beeps.
SM> 
SM> What does C-h l show after hitting a particular key?

M-P M-0 C-h l

-- 
Alexander Kotelnikov
Saint-Petersburg, Russia

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: converting between charsets
  2006-05-09  5:41           ` Alexander Kotelnikov
@ 2006-05-09 18:42             ` Stefan Monnier
  2006-05-13 18:42               ` Alexander Kotelnikov
  0 siblings, 1 reply; 19+ messages in thread
From: Stefan Monnier @ 2006-05-09 18:42 UTC (permalink / raw)


> I started this thread from note about problems with
> encoding-coding-region:

>>>>> On Sun, 07 May 2006 13:52:08 +0400
>>>>> "AK" == Alexander Kotelnikov <sacha@myxomop.com> wrote:
AK> 
AK> There could be three different ways, which I checked, how characters
AK> to be converted can appear in emacs buffer:
AK> a. when I open such file.
AK> b. when I type in characters and my keyboard layout in X is different
AK> from 'us', for me it is normally 'ru' then.
AK> c. when I type in after I used toggle-input-method.
AK> 
AK> 
AK> And the trouble is that encode-coding-region converts only in case
AK> (c). In (a) and (b) characters that need conversion are substituted
AK> with question marks. And even in (c) conversion is performed (if, for
AK> instance, I save a file after it appears to be in koi8-r) in the
AK> converted buffer converted characters are shown in \321 manner.
AK> 
AK> So, it will be nice to get some help on this, thanks.

Please explain why you think there is relation between those things and
encode-coding-region.  And of course, that will involve describing where how
and when you call encode-coding-region.

SM> Oh, I see.  I don't know enough of how this works to help you much further.
SM> If you hit C-u C-x = on the various chars (especially on two similar chars
SM> displayed with different fonts), you'll see that they come from different
SM> charsets (one is probably something like iso-8859-5 and the other may be
SM> unicode).  Emacs-22 doesn't unify them by default.  You can try to put
SM> (unify-8859-on-decoding-mode 1) in your .emacs.  And you can also try to
SM> play with utf-fragment-on-decoding.  And ask someone more knowledgeable
SM> about such problems.

> On first character like latin T:
>   character: <I removed cyrillic character> (01212102, 332866, 0x51442)^[-A
>     charset: mule-unicode-0100-24ff
> 	     (Unicode characters of the range U+0100..U+24FF.)
>  code point: 40 66
>      syntax: word
>    category: y:Cyrillic  
> buffer code: 0x9C 0xF4 0xA8 0xC2
>   file code: 0xD0 0xA2 (encoded by coding system mule-utf-8)
>        font: -monotype-courier new-medium-r-normal--13-94-99-99-m-80-iso10646-1

> After the same character in the next line:
>   character: <I remove cyrillic character shown with wrong fontt> (0151664, 54196, 0xd3b4)
>     charset: japanese-jisx0208 (JISX0208.1983/1990 Japanese Kanji: ISO-IR-87)
>  code point: 39 52
>      syntax: word
>    category: Y:Cyrillic characters of 2-byte character sets   j:Japanese  
> 	     |:While filling, we can break a line at this character.  
> buffer code: 0x92 0xA7 0xB4
>   file code: not encodable by coding system mule-utf-8
>        font: -Misc-Fixed-Medium-R-Normal--14-130-75-75-C-140-JISX0208.1983-0

> Something is not ok here...

Same kind of issue as the one I mentioned.
Have you tried unify-8859-on-decoding-mode?

In any case, please report this via M-x report-emacs-bug with as many
painful details as you can come up with (i.e. describe how to reproduce the
problem starting from "emacs -Q", showing your locale, etc...).

SM> You could even M-x report-emacs-bug about it, since maybe the default config
SM> in a cyrillic locale should already take care of it.
SM> 
>>>>> Cyrillic nput in emacs -nw in xterm still does not work, if I just
>>>>> change X keyboard layout.
SM> 
SM> That doesn't give us much to go on, does it?  What does it do, other than
SM> "not work"?
SM> 
>>> It beeps.
SM> 
SM> What does C-h l show after hitting a particular key?

> M-P M-0 C-h l

So when you hit that key, Emacs received M-P M-0 rather than the char you
think you sent to it.  What is your locale?


        Stefan

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: converting between charsets
  2006-05-09 18:42             ` Stefan Monnier
@ 2006-05-13 18:42               ` Alexander Kotelnikov
  2006-05-14  3:20                 ` Stefan Monnier
  0 siblings, 1 reply; 19+ messages in thread
From: Alexander Kotelnikov @ 2006-05-13 18:42 UTC (permalink / raw)


>>>>> On Tue, 09 May 2006 14:42:01 -0400
>>>>> "SM" == Stefan Monnier <monnier@iro.umontreal.ca> wrote:
SM> 
>> I started this thread from note about problems with
>> encoding-coding-region:
SM> 
>>>>> On Sun, 07 May 2006 13:52:08 +0400
>>>>> "AK" == Alexander Kotelnikov <sacha@myxomop.com> wrote:
AK> 
AK> There could be three different ways, which I checked, how characters
AK> to be converted can appear in emacs buffer:
AK> a. when I open such file.
AK> b. when I type in characters and my keyboard layout in X is different
AK> from 'us', for me it is normally 'ru' then.
AK> c. when I type in after I used toggle-input-method.
AK> 
AK> 
AK> And the trouble is that encode-coding-region converts only in case
AK> (c). In (a) and (b) characters that need conversion are substituted
AK> with question marks. And even in (c) conversion is performed (if, for
AK> instance, I save a file after it appears to be in koi8-r) in the
AK> converted buffer converted characters are shown in \321 manner.
AK> 
AK> So, it will be nice to get some help on this, thanks.
SM> 
SM> Please explain why you think there is relation between those things and
SM> encode-coding-region.  And of course, that will involve describing where how
SM> and when you call encode-coding-region.

I do not understand the question. I use encode-coding-region to encode
a region into a charset and some characters are not encoded, but are
substituted with question mark.

SM> Oh, I see.  I don't know enough of how this works to help you much further.
SM> If you hit C-u C-x = on the various chars (especially on two similar chars
SM> displayed with different fonts), you'll see that they come from different
SM> charsets (one is probably something like iso-8859-5 and the other may be
SM> unicode).  Emacs-22 doesn't unify them by default.  You can try to put
SM> (unify-8859-on-decoding-mode 1) in your .emacs.  And you can also try to
SM> play with utf-fragment-on-decoding.  And ask someone more knowledgeable
SM> about such problems.
SM> 
>> On first character like latin T:
>> character: <I removed cyrillic character> (01212102, 332866, 0x51442)^[-A
>> charset: mule-unicode-0100-24ff
>> (Unicode characters of the range U+0100..U+24FF.)
>> code point: 40 66
>> syntax: word
>> category: y:Cyrillic  
>> buffer code: 0x9C 0xF4 0xA8 0xC2
>> file code: 0xD0 0xA2 (encoded by coding system mule-utf-8)
>> font: -monotype-courier new-medium-r-normal--13-94-99-99-m-80-iso10646-1
SM> 
>> After the same character in the next line:
>> character: <I remove cyrillic character shown with wrong fontt> (0151664, 54196, 0xd3b4)
>> charset: japanese-jisx0208 (JISX0208.1983/1990 Japanese Kanji: ISO-IR-87)
>> code point: 39 52
>> syntax: word
>> category: Y:Cyrillic characters of 2-byte character sets   j:Japanese  
>> |:While filling, we can break a line at this character.  
>> buffer code: 0x92 0xA7 0xB4
>> file code: not encodable by coding system mule-utf-8
>> font: -Misc-Fixed-Medium-R-Normal--14-130-75-75-C-140-JISX0208.1983-0
SM> 
>> Something is not ok here...
SM> 
SM> Same kind of issue as the one I mentioned.
SM> Have you tried unify-8859-on-decoding-mode?

Just tried. Nothing changes.

SM> In any case, please report this via M-x report-emacs-bug with as many
SM> painful details as you can come up with (i.e. describe how to reproduce the
SM> problem starting from "emacs -Q", showing your locale, etc...).
SM> 
SM> You could even M-x report-emacs-bug about it, since maybe the default config
SM> in a cyrillic locale should already take care of it.
SM> 
>>>>> Cyrillic nput in emacs -nw in xterm still does not work, if I just
>>>>> change X keyboard layout.
SM> 
SM> That doesn't give us much to go on, does it?  What does it do, other than
SM> "not work"?
SM> 
>>>> It beeps.
SM> 
SM> What does C-h l show after hitting a particular key?
SM> 
>> M-P M-0 C-h l
SM> 
SM> So when you hit that key, Emacs received M-P M-0 rather than the char you
SM> think you sent to it.  What is your locale?

22:37 pts/28 sacha@vinci:~ 1> locale
LANG=ru_RU.UTF-8
LC_CTYPE="ru_RU.UTF-8"
LC_NUMERIC=C
LC_TIME=C
LC_COLLATE="ru_RU.UTF-8"
LC_MONETARY="ru_RU.UTF-8"
LC_MESSAGES=C
LC_PAPER="ru_RU.UTF-8"
LC_NAME="ru_RU.UTF-8"
LC_ADDRESS="ru_RU.UTF-8"
LC_TELEPHONE="ru_RU.UTF-8"
LC_MEASUREMENT="ru_RU.UTF-8"
LC_IDENTIFICATION="ru_RU.UTF-8"
LC_ALL=


-- 
Alexander Kotelnikov
Saint-Petersburg, Russia

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: converting between charsets
  2006-05-13 18:42               ` Alexander Kotelnikov
@ 2006-05-14  3:20                 ` Stefan Monnier
  2006-05-14 17:53                   ` Alexander Kotelnikov
  0 siblings, 1 reply; 19+ messages in thread
From: Stefan Monnier @ 2006-05-14  3:20 UTC (permalink / raw)


SM> Please explain why you think there is relation between those things and
SM> encode-coding-region.  And of course, that will involve describing where
SM> how and when you call encode-coding-region.

> I do not understand the question. I use encode-coding-region to encode
> a region into a charset and some characters are not encoded, but are
> substituted with question mark.

Than show us how and when you call encode-coding-region.
I.e. repeat the above but in elisp rather than english.
Assume you're explining it to a complete idiot.

SM> Same kind of issue as the one I mentioned.
SM> Have you tried unify-8859-on-decoding-mode?

> Just tried. Nothing changes.

Than please report it via M-x report-emacs-bug.

>>> M-P M-0 C-h l
SM> 
SM> So when you hit that key, Emacs received M-P M-0 rather than the char you
SM> think you sent to it.  What is your locale?

> 22:37 pts/28 sacha@vinci:~ 1> locale
> LANG=ru_RU.UTF-8
> LC_CTYPE="ru_RU.UTF-8"
> LC_NUMERIC=C
> LC_TIME=C
> LC_COLLATE="ru_RU.UTF-8"
> LC_MONETARY="ru_RU.UTF-8"
> LC_MESSAGES=C
> LC_PAPER="ru_RU.UTF-8"
> LC_NAME="ru_RU.UTF-8"
> LC_ADDRESS="ru_RU.UTF-8"
> LC_TELEPHONE="ru_RU.UTF-8"
> LC_MEASUREMENT="ru_RU.UTF-8"
> LC_IDENTIFICATION="ru_RU.UTF-8"
> LC_ALL=

Sounds like a bug, then.  Please report it via M-x report-emacs-bug.


        Stefan

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: converting between charsets
  2006-05-14  3:20                 ` Stefan Monnier
@ 2006-05-14 17:53                   ` Alexander Kotelnikov
  2006-05-15  0:37                     ` Stefan Monnier
  0 siblings, 1 reply; 19+ messages in thread
From: Alexander Kotelnikov @ 2006-05-14 17:53 UTC (permalink / raw)


>>>>> On Sat, 13 May 2006 23:20:03 -0400
>>>>> "SM" == Stefan Monnier <monnier@iro.umontreal.ca> wrote:
SM> 
SM> Please explain why you think there is relation between those things and
SM> encode-coding-region.  And of course, that will involve describing where
SM> how and when you call encode-coding-region.
SM> 
>> I do not understand the question. I use encode-coding-region to encode
>> a region into a charset and some characters are not encoded, but are
>> substituted with question mark.
SM> 
SM> Than show us how and when you call encode-coding-region.
SM> I.e. repeat the above but in elisp rather than english.
SM> Assume you're explining it to a complete idiot.

For example. 
1. (find-file "/tmp/test.txt")
2. enter some text in Russian (after I toggled xkb layout)
3. M-: (encode-coding-region (point-min) (point-max) 'koi8-r) and
Russian characters become '?'.

-- 
Alexander Kotelnikov
Saint-Petersburg, Russia

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: converting between charsets
  2006-05-14 17:53                   ` Alexander Kotelnikov
@ 2006-05-15  0:37                     ` Stefan Monnier
  2006-05-15  5:55                       ` Alexander Kotelnikov
  0 siblings, 1 reply; 19+ messages in thread
From: Stefan Monnier @ 2006-05-15  0:37 UTC (permalink / raw)


SM> Than show us how and when you call encode-coding-region.
SM> I.e. repeat the above but in elisp rather than english.
SM> Assume you're explining it to a complete idiot.

Please take seriously the bit about the idiot.

> For example.
> 1. (find-file "/tmp/test.txt")

How did you start Emacs?

> 2. enter some text in Russian (after I toggled xkb layout)
> 3. M-: (encode-coding-region (point-min) (point-max) 'koi8-r) and
> Russian characters become '?'.

What did you expect instead?


        Stefan

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: converting between charsets
  2006-05-15  0:37                     ` Stefan Monnier
@ 2006-05-15  5:55                       ` Alexander Kotelnikov
  2006-05-15  6:02                         ` Alexander Kotelnikov
  2006-05-15 14:11                         ` Stefan Monnier
  0 siblings, 2 replies; 19+ messages in thread
From: Alexander Kotelnikov @ 2006-05-15  5:55 UTC (permalink / raw)


>>>>> On Sun, 14 May 2006 20:37:21 -0400
>>>>> "SM" == Stefan Monnier <monnier@iro.umontreal.ca> wrote:
SM> 
SM> Than show us how and when you call encode-coding-region.
SM> I.e. repeat the above but in elisp rather than english.
SM> Assume you're explining it to a complete idiot.
SM> 
SM> Please take seriously the bit about the idiot.
SM> 
>> For example.
>> 1. (find-file "/tmp/test.txt")
SM> 
SM> How did you start Emacs?

For example 'emacs -q', even if do not suppress reading my ~/.emacs
the result is the same.

>> 2. enter some text in Russian (after I toggled xkb layout)
>> 3. M-: (encode-coding-region (point-min) (point-max) 'koi8-r) and
>> Russian characters become '?'.
SM> 
SM> What did you expect instead?

I expect that cyrrillic characters will be encoded to their koi8-r values.
-- 
Alexander Kotelnikov
Saint-Petersburg, Russia

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: converting between charsets
  2006-05-15  5:55                       ` Alexander Kotelnikov
@ 2006-05-15  6:02                         ` Alexander Kotelnikov
  2006-05-15 14:11                         ` Stefan Monnier
  1 sibling, 0 replies; 19+ messages in thread
From: Alexander Kotelnikov @ 2006-05-15  6:02 UTC (permalink / raw)


>>>>> On Mon, 15 May 2006 09:55:10 +0400
>>>>> "AK" == Alexander Kotelnikov <sacha@myxomop.com> wrote:
AK> 
>>>>> On Sun, 14 May 2006 20:37:21 -0400
>>>>> "SM" == Stefan Monnier <monnier@iro.umontreal.ca> wrote:
SM> 
>>> 2. enter some text in Russian (after I toggled xkb layout)
>>> 3. M-: (encode-coding-region (point-min) (point-max) 'koi8-r) and
>>> Russian characters become '?'.
SM> 
SM> What did you expect instead?
AK> 
AK> I expect that cyrrillic characters will be encoded to their koi8-r values.

Just like if toggle xkb to 'fr', enter some text in French (with
accents and stuff), and evaluate 
(encode-coding-region (point-min) (point-max) 'iso-8859-1)
This works, I just checked!

-- 
Alexander Kotelnikov
Saint-Petersburg, Russia

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: converting between charsets
  2006-05-15  5:55                       ` Alexander Kotelnikov
  2006-05-15  6:02                         ` Alexander Kotelnikov
@ 2006-05-15 14:11                         ` Stefan Monnier
  2006-05-15 20:30                           ` Alexander Kotelnikov
  1 sibling, 1 reply; 19+ messages in thread
From: Stefan Monnier @ 2006-05-15 14:11 UTC (permalink / raw)


SM> Than show us how and when you call encode-coding-region.
SM> I.e. repeat the above but in elisp rather than english.
SM> Assume you're explining it to a complete idiot.
SM> 
SM> Please take seriously the bit about the idiot.
SM> 
>>> For example.
>>> 1. (find-file "/tmp/test.txt")
SM> 
SM> How did you start Emacs?

> For example 'emacs -q', even if do not suppress reading my ~/.emacs
> the result is the same.

Under X or under a tty?

>>> 2. enter some text in Russian (after I toggled xkb layout)
>>> 3. M-: (encode-coding-region (point-min) (point-max) 'koi8-r) and
>>> Russian characters become '?'.
SM> What did you expect instead?
> I expect that cyrrillic characters will be encoded to their koi8-r values.

If you put the cursor on the russian chars before calling
encode-coding-region and hit C-u C-x = what does it say?

If you put the cursor on the `?' that replaced that char and hit C-u C-x =
what does it say?


        Stefan

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: converting between charsets
  2006-05-15 14:11                         ` Stefan Monnier
@ 2006-05-15 20:30                           ` Alexander Kotelnikov
  2006-05-16  3:50                             ` Stefan Monnier
  0 siblings, 1 reply; 19+ messages in thread
From: Alexander Kotelnikov @ 2006-05-15 20:30 UTC (permalink / raw)


>>>>> On Mon, 15 May 2006 10:11:48 -0400
>>>>> "SM" == Stefan Monnier <monnier@iro.umontreal.ca> wrote:
SM> 
SM> Than show us how and when you call encode-coding-region.
SM> I.e. repeat the above but in elisp rather than english.
SM> Assume you're explining it to a complete idiot.
SM> 
SM> Please take seriously the bit about the idiot.
SM> 
>>>> For example.
>>>> 1. (find-file "/tmp/test.txt")
SM> 
SM> How did you start Emacs?
SM> 
>> For example 'emacs -q', even if do not suppress reading my ~/.emacs
>> the result is the same.
SM> 
SM> Under X or under a tty?

X

>>>> 2. enter some text in Russian (after I toggled xkb layout)
>>>> 3. M-: (encode-coding-region (point-min) (point-max) 'koi8-r) and
>>>> Russian characters become '?'.
SM> What did you expect instead?
>> I expect that cyrrillic characters will be encoded to their koi8-r values.
SM> 
SM> If you put the cursor on the russian chars before calling
SM> encode-coding-region and hit C-u C-x = what does it say?

  character: Т (01212102, 332866, 0x51442)
    charset: mule-unicode-0100-24ff
             (Unicode characters of the range U+0100..U+24FF.)
 code point: 40 66
     syntax: word
   category: y:Cyrillic  
buffer code: 0x9C 0xF4 0xA8 0xC2
  file code: 0xD0 0xA2 (encoded by coding system utf-8)
       font: -monotype-courier new-medium-r-normal--13-94-99-99-m-80-iso10646-1

SM> If you put the cursor on the `?' that replaced that char and hit C-u C-x =
SM> what does it say?

  character: ? (077, 63, 0x3f)
    charset: ascii (ASCII (ISO646 IRV))
 code point: 63
     syntax: punctuation
   category: a:ASCII   l:Latin  
buffer code: 0x3F
  file code: 0x3F (encoded by coding system utf-8)
       font: -monotype-courier new-medium-r-normal--13-94-99-99-m-80-adobe-standard

And for français I get:
  character: ç (04347, 2279, 0x8e7)
    charset: latin-iso8859-1
             (Right-Hand Part of Latin Alphabet 1 (ISO/IEC 8859-1): ISO-IR-100)
 code point: 103
     syntax: word
   category: l:Latin  
buffer code: 0x81 0xE7
  file code: 0xC3 0xA7 (encoded by coding system utf-8)
       font: -monotype-courier new-medium-r-normal--13-94-99-99-m-80-iso8859-1

after (representaion is \347)
  character: ç (0347, 231, 0xe7)
    charset: eight-bit-graphic (8-bit graphic char (0xA0..0xFF))
 code point: 231
     syntax: whitespace
   category:
buffer code: 0xE7
  file code: 0xE7 (encoded by coding system utf-8)
       font: -monotype-courier new-medium-r-normal--13-94-99-99-m-80-adobe-standard
-- 
Alexander Kotelnikov
Saint-Petersburg, Russia

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: converting between charsets
  2006-05-15 20:30                           ` Alexander Kotelnikov
@ 2006-05-16  3:50                             ` Stefan Monnier
  2006-05-16 10:04                               ` Alexander Kotelnikov
  0 siblings, 1 reply; 19+ messages in thread
From: Stefan Monnier @ 2006-05-16  3:50 UTC (permalink / raw)


SM> If you put the cursor on the russian chars before calling
SM> encode-coding-region and hit C-u C-x = what does it say?

>   character: Т (01212102, 332866, 0x51442)
>     charset: mule-unicode-0100-24ff
>              (Unicode characters of the range U+0100..U+24FF.)
>  code point: 40 66
>      syntax: word
>    category: y:Cyrillic  
> buffer code: 0x9C 0xF4 0xA8 0xC2
>   file code: 0xD0 0xA2 (encoded by coding system utf-8)
>        font: -monotype-courier new-medium-r-normal--13-94-99-99-m-80-iso10646-1

SM> If you put the cursor on the `?' that replaced that char and hit C-u C-x =
SM> what does it say?

>   character: ? (077, 63, 0x3f)
>     charset: ascii (ASCII (ISO646 IRV))
>  code point: 63
>      syntax: punctuation
>    category: a:ASCII   l:Latin  
> buffer code: 0x3F
>   file code: 0x3F (encoded by coding system utf-8)
>        font: -monotype-courier new-medium-r-normal--13-94-99-99-m-80-adobe-standard

Hmm... with my Emacs (a recent CVS checkout), if I do

   M-: (encode-coding-string (string 332866) 'koi8-r) RET

I get "\364" rather than "?".  So either you're running an older Emacs and
the problem has been fixed, or there's something else going that
I don't understand.


        Stefan

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: converting between charsets
  2006-05-16  3:50                             ` Stefan Monnier
@ 2006-05-16 10:04                               ` Alexander Kotelnikov
  2006-05-17 15:20                                 ` Stefan Monnier
  0 siblings, 1 reply; 19+ messages in thread
From: Alexander Kotelnikov @ 2006-05-16 10:04 UTC (permalink / raw)


>>>>> On Mon, 15 May 2006 23:50:50 -0400
>>>>> "SM" == Stefan Monnier <monnier@iro.umontreal.ca> wrote:
SM> 
SM> Hmm... with my Emacs (a recent CVS checkout), if I do
SM> 
SM>    M-: (encode-coding-string (string 332866) 'koi8-r) RET
SM> 
SM> I get "\364" rather than "?".  So either you're running an older Emacs and
SM> the problem has been fixed, or there's something else going that
SM> I don't understand.

Ah. I tried it with Debian's emacs-snapshot and:
1. encode-coding-* work.
2. Russian input on console also works.
3. Russian paste in X still does not work.

Thanks,
-- 
Alexander Kotelnikov
Saint-Petersburg, Russia

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: converting between charsets
  2006-05-16 10:04                               ` Alexander Kotelnikov
@ 2006-05-17 15:20                                 ` Stefan Monnier
  0 siblings, 0 replies; 19+ messages in thread
From: Stefan Monnier @ 2006-05-17 15:20 UTC (permalink / raw)


SM> Hmm... with my Emacs (a recent CVS checkout), if I do
SM> 
SM> M-: (encode-coding-string (string 332866) 'koi8-r) RET
SM> 
SM> I get "\364" rather than "?".  So either you're running an older Emacs and
SM> the problem has been fixed, or there's something else going that
SM> I don't understand.

> Ah. I tried it with Debian's emacs-snapshot and:
> 1. encode-coding-* work.
> 2. Russian input on console also works.
> 3. Russian paste in X still does not work.

Good.  I don't enough about how X copy/pasting works, so please report this
via M-x report-emacs-bug so that someone else can take a look at it.


        Stefan

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2006-05-17 15:20 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-05-07  9:52 converting between charsets Alexander Kotelnikov
2006-05-07 12:43 ` Stefan Monnier
2006-05-07 19:40   ` Alexander Kotelnikov
2006-05-08  3:28     ` Stefan Monnier
2006-05-08  9:39       ` Alexander Kotelnikov
2006-05-08 14:30         ` Stefan Monnier
2006-05-09  5:41           ` Alexander Kotelnikov
2006-05-09 18:42             ` Stefan Monnier
2006-05-13 18:42               ` Alexander Kotelnikov
2006-05-14  3:20                 ` Stefan Monnier
2006-05-14 17:53                   ` Alexander Kotelnikov
2006-05-15  0:37                     ` Stefan Monnier
2006-05-15  5:55                       ` Alexander Kotelnikov
2006-05-15  6:02                         ` Alexander Kotelnikov
2006-05-15 14:11                         ` Stefan Monnier
2006-05-15 20:30                           ` Alexander Kotelnikov
2006-05-16  3:50                             ` Stefan Monnier
2006-05-16 10:04                               ` Alexander Kotelnikov
2006-05-17 15:20                                 ` Stefan Monnier

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).