unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed
* Reporting UTF-8 related problems?
@ 2002-07-28 16:14 Karl Eichwalder
  2002-07-28 18:23 ` Eli Zaretskii
  2002-07-28 18:26 ` Eli Zaretskii
  0 siblings, 2 replies; 21+ messages in thread
From: Karl Eichwalder @ 2002-07-28 16:14 UTC (permalink / raw)
  Cc: Kenichi Handa

Is it useful to report UTF-8 related problems (Emacs CVS version, from
trunk)?

Cut-and-paste via X selection shows issues.

Visit http://www.textkritik.de/bka/dokumente/dok_k/koepke1.htm with
lynx runing in an UTF-8 xterm (the page is in German).  You should be
able to see ,,Die Familie Schroffenstein'' and ,,Amphitryon'' properly
quoted (double left quotes at the line bottom).

Yanking in Emacs with the mouse results in

    ^[%Gâž ž¾^[%@Die Familie Schroffenstein^[$(B!H

Doing the same from the Gnome web browser Galeon results in

    ?Die Familie Schroffenstein?

(literal question marks instead of Unicode quotes).  BTW, there is no
problem to paste from Galeon into an UTF-8 xterm.

The Emacs buffer is UTF-8 enable and I also tried to set the Coding
system for X selection accordingly (C-x RET x utf-8 RET).

-- 
ke@suse.de (work) / keichwa@gmx.net (home):              |
http://www.suse.de/~ke/                                  |      ,__o
Free Translation Project:                                |    _-\_<,
http://www.iro.umontreal.ca/contrib/po/HTML/             |   (*)/'(*)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Reporting UTF-8 related problems?
  2002-07-28 16:14 Reporting UTF-8 related problems? Karl Eichwalder
@ 2002-07-28 18:23 ` Eli Zaretskii
  2002-07-28 18:26 ` Eli Zaretskii
  1 sibling, 0 replies; 21+ messages in thread
From: Eli Zaretskii @ 2002-07-28 18:23 UTC (permalink / raw)
  Cc: emacs-devel, handa

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 988 bytes --]

> From: Karl Eichwalder <keichwa@gmx.net>
> Date: Sun, 28 Jul 2002 18:14:35 +0200
> 
> Yanking in Emacs with the mouse results in
> 
>     ^[%G\x7fž\x7fž\x7f^[%@Die Familie Schroffenstein^[$(B!H
> 
> Doing the same from the Gnome web browser Galeon results in
> 
>     ?Die Familie Schroffenstein?
> 
> (literal question marks instead of Unicode quotes).  BTW, there is no
> problem to paste from Galeon into an UTF-8 xterm.
> 
> The Emacs buffer is UTF-8 enable and I also tried to set the Coding
> system for X selection accordingly (C-x RET x utf-8 RET).
> 
> -- 
> ke@suse.de (work) / keichwa@gmx.net (home):              |
> http://www.suse.de/~ke/                                  |      ,__o
> Free Translation Project:                                |    _-\_<,
> http://www.iro.umontreal.ca/contrib/po/HTML/             |   (*)/'(*)
> 
> _______________________________________________
> Emacs-devel mailing list
> Emacs-devel@gnu.org
> http://mail.gnu.org/mailman/listinfo/emacs-devel
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Reporting UTF-8 related problems?
  2002-07-28 16:14 Reporting UTF-8 related problems? Karl Eichwalder
  2002-07-28 18:23 ` Eli Zaretskii
@ 2002-07-28 18:26 ` Eli Zaretskii
  2002-07-29  5:18   ` Kenichi Handa
  2002-07-29 17:29   ` Richard Stallman
  1 sibling, 2 replies; 21+ messages in thread
From: Eli Zaretskii @ 2002-07-28 18:26 UTC (permalink / raw)
  Cc: emacs-devel, handa

> From: Karl Eichwalder <keichwa@gmx.net>
> Date: Sun, 28 Jul 2002 18:14:35 +0200
> 
> Cut-and-paste via X selection shows issues.
> 
> Visit http://www.textkritik.de/bka/dokumente/dok_k/koepke1.htm with
> lynx runing in an UTF-8 xterm (the page is in German).

The telltale ESC % sequence is the beginning of the ``extended
segment'' in ICCCM parlance.  Emacs doesn't currently support UTF-8
in the extended segments, but adding that support should be easy, I'd
think.  See ctext-post-read-conversion and ctext-pre-write-conversion
defined on mule.el.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Reporting UTF-8 related problems?
  2002-07-28 18:26 ` Eli Zaretskii
@ 2002-07-29  5:18   ` Kenichi Handa
  2002-07-29  5:37     ` Kenichi Handa
  2002-07-29 15:35     ` Karl Eichwalder
  2002-07-29 17:29   ` Richard Stallman
  1 sibling, 2 replies; 21+ messages in thread
From: Kenichi Handa @ 2002-07-29  5:18 UTC (permalink / raw)
  Cc: keichwa, emacs-devel

In article <2110-Sun28Jul2002212621+0300-eliz@is.elta.co.il>, "Eli Zaretskii" <eliz@is.elta.co.il> writes:
>>  From: Karl Eichwalder <keichwa@gmx.net>
>>  Date: Sun, 28 Jul 2002 18:14:35 +0200
>>  
>>  Cut-and-paste via X selection shows issues.
>>  
>>  Visit http://www.textkritik.de/bka/dokumente/dok_k/koepke1.htm with
>>  lynx runing in an UTF-8 xterm (the page is in German).

> The telltale ESC % sequence is the beginning of the ``extended
> segment'' in ICCCM parlance.  Emacs doesn't currently support UTF-8
> in the extended segments, but adding that support should be easy, I'd
> think.  See ctext-post-read-conversion and ctext-pre-write-conversion
> defined on mule.el.

The reported escape sequence is "ESC % G ... ESC % @" which
is not the extended segments of CTEXT (described in the
section 6 of the ctext document), but the special sequence
for utf-8 (described in the newly inserted section 7 of the
ctext document distributed with XFree86, I'll attach it).

I've just commited a change to ctext-post-read-conversion
(in mule.el).  Could you please try it?  Eli, could you also
check my change?

> Doing the same from the Gnome web browser Galeon results in

>     ?Die Familie Schroffenstein?

> (literal question marks instead of Unicode quotes).  BTW, there is no
> problem to paste from Galeon into an UTF-8 xterm.

I tried the latest Galeon.  It sends Emacs the same byte
sequence as what lynx does.

---
Ken'ichi HANDA
handa@etl.go.jp

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Reporting UTF-8 related problems?
  2002-07-29  5:18   ` Kenichi Handa
@ 2002-07-29  5:37     ` Kenichi Handa
  2002-07-29 15:35     ` Karl Eichwalder
  1 sibling, 0 replies; 21+ messages in thread
From: Kenichi Handa @ 2002-07-29  5:37 UTC (permalink / raw)
  Cc: eliz, keichwa, emacs-devel

In article <200207290518.OAA04004@etlken.m17n.org>, Kenichi Handa <handa@etl.go.jp> writes:
> The reported escape sequence is "ESC % G ... ESC % @" which
> is not the extended segments of CTEXT (described in the
> section 6 of the ctext document), but the special sequence
> for utf-8 (described in the newly inserted section 7 of the
> ctext document distributed with XFree86, I'll attach it).

Oops, I forgot to attach it.  Here it is.

---
Ken'ichi HANDA
handa@etl.go.jp

7.  The UTF-8 encoding

Unicode characters that are not contained in one of the
approved standard encodings can be encoded using the UTF-8
encoding. The following escape sequences are used:


     01/11 02/05 04/07	 switch into UTF-8 mode
     01/11 02/05 04/00	 return from UTF-8 mode


The first is the ISO registered sequence for UTF-8 (ISO-
IR-196), the second is the ISO-2022 ``standard return''
sequence. While in UTF-8 mode, the UTF-8 encoding replaces
the currently designated GL and GR encodings. After return
from UTF-8 mode, the previously designated GL and GR encod-
ings are reactivated.

[This is the only ``other coding system'' used in Compound
Text.]

[This is an XFree86 extension introduced in XFree86 4.0.2.]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Reporting UTF-8 related problems?
  2002-07-29  5:18   ` Kenichi Handa
  2002-07-29  5:37     ` Kenichi Handa
@ 2002-07-29 15:35     ` Karl Eichwalder
  2002-07-30  5:22       ` Kenichi Handa
  1 sibling, 1 reply; 21+ messages in thread
From: Karl Eichwalder @ 2002-07-29 15:35 UTC (permalink / raw)
  Cc: eliz, emacs-devel

Kenichi Handa <handa@etl.go.jp> writes:

> I've just commited a change to ctext-post-read-conversion
> (in mule.el).

Thanks it mostly works for me.  When I yank the phrase into Emacs using
the mouse, the right double quote becomes 2 letters wide (it look like a
space just before the quote:

Char: “ (0150310, 53448, 0xd0c8) point=309 of 321 (96%) column 12 

Another issue is you cannot kill and yank the quotation mark without
marking it first.

>> Doing the same from the Gnome web browser Galeon results in
>
>>     ?Die Familie Schroffenstein?
>
>> (literal question marks instead of Unicode quotes).  BTW, there is no
>> problem to paste from Galeon into an UTF-8 xterm.
>
> I tried the latest Galeon.  It sends Emacs the same byte
> sequence as what lynx does.

Stil wondering why Emacs treet the quotes coming from Galeon different?

-- 
ke@suse.de (work) / keichwa@gmx.net (home):              |
http://www.suse.de/~ke/                                  |      ,__o
Free Translation Project:                                |    _-\_<,
http://www.iro.umontreal.ca/contrib/po/HTML/             |   (*)/'(*)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Reporting UTF-8 related problems?
  2002-07-28 18:26 ` Eli Zaretskii
  2002-07-29  5:18   ` Kenichi Handa
@ 2002-07-29 17:29   ` Richard Stallman
  1 sibling, 0 replies; 21+ messages in thread
From: Richard Stallman @ 2002-07-29 17:29 UTC (permalink / raw)
  Cc: keichwa, emacs-devel, handa

    The telltale ESC % sequence is the beginning of the ``extended
    segment'' in ICCCM parlance.  Emacs doesn't currently support UTF-8
    in the extended segments, but adding that support should be easy, I'd
    think.

If someone would like to work on this, can you give more detailed
advice?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Reporting UTF-8 related problems?
  2002-07-29 15:35     ` Karl Eichwalder
@ 2002-07-30  5:22       ` Kenichi Handa
  2002-07-30  6:01         ` Karl Eichwalder
  0 siblings, 1 reply; 21+ messages in thread
From: Kenichi Handa @ 2002-07-30  5:22 UTC (permalink / raw)
  Cc: eliz, emacs-devel

In article <sh65yy35te.fsf@tux.gnu.franken.de>, Karl Eichwalder <keichwa@gmx.net> writes:
> Kenichi Handa <handa@etl.go.jp> writes:
>>  I've just commited a change to ctext-post-read-conversion
>>  (in mule.el).

> Thanks it mostly works for me.  When I yank the phrase into Emacs using
> the mouse, the right double quote becomes 2 letters wide (it look like a
> space just before the quote:

> Char: “ (0150310, 53448, 0xd0c8) point=309 of 321 (96%) column 12 

This is because Emacs received this byte sequence:
	ESC $ ( B ! H
"ESC $ ( B" is a designation sequence for jisx0208, 
and the following two bytes "! H" specifies the above
Japanese symbol.

This is a problem of lynx and galeon (or some core part of
gnome, I don't know).

> Another issue is you cannot kill and yank the quotation mark without
> marking it first.

I don't understand what do you mean.  "kill and yank" from
where to where?  What is the meaning of "marking it"?

>>>  Doing the same from the Gnome web browser Galeon results in
>> 
>>>      ?Die Familie Schroffenstein?
>> 
>>>  (literal question marks instead of Unicode quotes).  BTW, there is no
>>>  problem to paste from Galeon into an UTF-8 xterm.
>> 
>>  I tried the latest Galeon.  It sends Emacs the same byte
>>  sequence as what lynx does.

> Stil wondering why Emacs treet the quotes coming from Galeon different?

No.  I think your Galeon actually sent `?' to Emacs.  My
Galeon (ver.1.2.5) sends "ESC % G ... ESC % @".

The ICCCM document distributed with XFree86 contains this
paragraph (which doesn't exist in X.V11R6's document):

----------------------------------------------------------------------
UTF8_STRING as a type or a target specifies an UTF-8 encoded
string, with NEWLINE (U+000A, hex 0A) as end-of-line marker.
----------------------------------------------------------------------

What I suspect is that UTF-8 xterm asks Galeon to send
selection-data by UTF8_STRING (not by TEXT as Emacs does).


---
Ken'ichi HANDA
handa@etl.go.jp

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Reporting UTF-8 related problems?
  2002-07-30  5:22       ` Kenichi Handa
@ 2002-07-30  6:01         ` Karl Eichwalder
  2002-07-30  7:11           ` Kenichi Handa
  0 siblings, 1 reply; 21+ messages in thread
From: Karl Eichwalder @ 2002-07-30  6:01 UTC (permalink / raw)
  Cc: eliz, emacs-devel

Kenichi Handa <handa@etl.go.jp> writes:

>> Char: “ (0150310, 53448, 0xd0c8) point=309 of 321 (96%) column 12 
>
> This is because Emacs received this byte sequence:
> 	ESC $ ( B ! H
> "ESC $ ( B" is a designation sequence for jisx0208, 
> and the following two bytes "! H" specifies the above
> Japanese symbol.

Originally, it was the "right double quote raising" and not meant to be
a special Japanese symbol ;)

> This is a problem of lynx and galeon (or some core part of
> gnome, I don't know).

I will have an eye on it.

> I don't understand what do you mean.  "kill and yank" from
> where to where?  What is the meaning of "marking it"?

Sorry.  This time: from Emacs to Emacs.  I assumed, you can C-d the
current letter and yank it back (C-y).  My assumptions is wrong.  C-d
just deletes; thus C-y cannot yank it back.

> No.  I think your Galeon actually sent `?' to Emacs.  My
> Galeon (ver.1.2.5) sends "ESC % G ... ESC % @".

Oops, my Galeon ist outdated.

> What I suspect is that UTF-8 xterm asks Galeon to send
> selection-data by UTF8_STRING (not by TEXT as Emacs does).

Sounds convincingly.  Thanks.

-- 
ke@suse.de (work) / keichwa@gmx.net (home):              |
http://www.suse.de/~ke/                                  |      ,__o
Free Translation Project:                                |    _-\_<,
http://www.iro.umontreal.ca/contrib/po/HTML/             |   (*)/'(*)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Reporting UTF-8 related problems?
  2002-07-30  6:01         ` Karl Eichwalder
@ 2002-07-30  7:11           ` Kenichi Handa
  2002-07-30  7:57             ` Andreas Schwab
  2002-07-30 18:58             ` Karl Eichwalder
  0 siblings, 2 replies; 21+ messages in thread
From: Kenichi Handa @ 2002-07-30  7:11 UTC (permalink / raw)
  Cc: eliz, emacs-devel

In article <shznw9eotw.fsf@tux.gnu.franken.de>, Karl Eichwalder <keichwa@gmx.net> writes:
> Kenichi Handa <handa@etl.go.jp> writes:
>>>  Char: “ (0150310, 53448, 0xd0c8) point=309 of 321 (96%) column 12 
>> 
>>  This is because Emacs received this byte sequence:
>>  	ESC $ ( B ! H
>>  "ESC $ ( B" is a designation sequence for jisx0208, 
>>  and the following two bytes "! H" specifies the above
>>  Japanese symbol.

> Originally, it was the "right double quote raising" and not meant to be
> a special Japanese symbol ;)

I checked the contents of the html file itself and found this:

	&#132;Die Familie Schroffenstein&#147

I thought that the notation &#NUMBER is for transmitting
Unicode character of code NUMBER.  But, 132 and 147 are
control codes in Unicode, not any kind of quotings.  Do you
know a proper web page describing the meaning of them?

> Sorry.  This time: from Emacs to Emacs.  I assumed, you can C-d the
> current letter and yank it back (C-y).  My assumptions is wrong.  C-d
> just deletes; thus C-y cannot yank it back.

That's a general feature of Emacs.   C-d DELETEs a
character, not KILL it.   C-y can yank only what killed.
The Emacs info nodes "Deletion and Killing" tells the
difference in detail.

---
Ken'ichi HANDA
handa@etl.go.jp

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Reporting UTF-8 related problems?
  2002-07-30  7:11           ` Kenichi Handa
@ 2002-07-30  7:57             ` Andreas Schwab
  2002-07-30  8:30               ` Kenichi Handa
  2002-07-30 18:58             ` Karl Eichwalder
  1 sibling, 1 reply; 21+ messages in thread
From: Andreas Schwab @ 2002-07-30  7:57 UTC (permalink / raw)
  Cc: keichwa, eliz, emacs-devel

Kenichi Handa <handa@etl.go.jp> writes:

|> In article <shznw9eotw.fsf@tux.gnu.franken.de>, Karl Eichwalder <keichwa@gmx.net> writes:
|> > Kenichi Handa <handa@etl.go.jp> writes:
|> >>>  Char: “ (0150310, 53448, 0xd0c8) point=309 of 321 (96%) column 12 
|> >> 
|> >>  This is because Emacs received this byte sequence:
|> >>  	ESC $ ( B ! H
|> >>  "ESC $ ( B" is a designation sequence for jisx0208, 
|> >>  and the following two bytes "! H" specifies the above
|> >>  Japanese symbol.
|> 
|> > Originally, it was the "right double quote raising" and not meant to be
|> > a special Japanese symbol ;)
|> 
|> I checked the contents of the html file itself and found this:
|> 
|> 	&#132;Die Familie Schroffenstein&#147
|> 
|> I thought that the notation &#NUMBER is for transmitting
|> Unicode character of code NUMBER.  But, 132 and 147 are
|> control codes in Unicode, not any kind of quotings.  Do you
|> know a proper web page describing the meaning of them?

The numbers are supposed to be ISO 8859-1 characters codes.  I'd guess the
page has been written with some broken (a.k.a. W*nd*ws) software (the use
of *.htm makes this apparent).  There is no hope for being compliant to
any standard.  I tried to validate it through the W3.org validator, but no
document type matches.

Andreas.

-- 
Andreas Schwab, SuSE Labs, schwab@suse.de
SuSE Linux AG, Deutschherrnstr. 15-19, D-90429 Nürnberg
Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Reporting UTF-8 related problems?
  2002-07-30  7:57             ` Andreas Schwab
@ 2002-07-30  8:30               ` Kenichi Handa
  0 siblings, 0 replies; 21+ messages in thread
From: Kenichi Handa @ 2002-07-30  8:30 UTC (permalink / raw)
  Cc: keichwa, eliz, emacs-devel

In article <je65yxwsve.fsf@sykes.suse.de>, Andreas Schwab <schwab@suse.de> writes:
> |> 	&#132;Die Familie Schroffenstein&#147
> |> 
> |> I thought that the notation &#NUMBER is for transmitting
> |> Unicode character of code NUMBER.  But, 132 and 147 are
> |> control codes in Unicode, not any kind of quotings.  Do you
> |> know a proper web page describing the meaning of them?

> The numbers are supposed to be ISO 8859-1 characters codes.  I'd guess the
> page has been written with some broken (a.k.a. W*nd*ws) software (the use
> of *.htm makes this apparent).  There is no hope for being compliant to
> any standard.  I tried to validate it through the W3.org validator, but no
> document type matches.

Ah, I see.  I found that windows-125X maps 132 and 147 to
U+201E and U+201C.  So, perhaps those systems (galeon and
lynx) parse them as U+201E and U+201C.  Anyway, how to
encode them in X selection is their problem and Emacs can't
do anything about it.

---
Ken'ichi HANDA
handa@etl.go.jp

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Reporting UTF-8 related problems?
  2002-07-30  7:11           ` Kenichi Handa
  2002-07-30  7:57             ` Andreas Schwab
@ 2002-07-30 18:58             ` Karl Eichwalder
  2002-07-30 19:51               ` Karl Eichwalder
                                 ` (2 more replies)
  1 sibling, 3 replies; 21+ messages in thread
From: Karl Eichwalder @ 2002-07-30 18:58 UTC (permalink / raw)
  Cc: eliz, emacs-devel, Andreas Schwab

[-- Attachment #1: Type: text/plain, Size: 3102 bytes --]

Kenichi Handa <handa@etl.go.jp> writes:

> 	&#132;Die Familie Schroffenstein&#147
>
> I thought that the notation &#NUMBER is for transmitting
> Unicode character of code NUMBER.  But, 132 and 147 are
> control codes in Unicode, not any kind of quotings.

&#NUMBERs are so called "character references"; the SGML declaration
defines which are allowed.  For HTML you must consult the html.d[e]?cl
file.  The crucial section is (HTML 2):

     BASESET   "ISO Registration Number 100//CHARSET
                ECMA-94 Right Part of
                Latin Alphabet Nr. 1//ESC 2/13 4/1"

         DESCSET  128  32   UNUSED
                  160  96    32

This basically means: &#128 to &#159 are unused.  The same applies for
HTML 4 (and later fpr XML resp. XHTML):

          BASESET  "ISO Registration Number 177//CHARSET
                    ISO/IEC 10646-1:1993 UCS-4 with
                    implementation level 3//ESC 2/5 2/15 4/6"
         DESCSET 0       9       UNUSED
                 9       2       9
                 11      2       UNUSED
                 13      1       13
                 14      18      UNUSED
                 32      95      32
                 127     1       UNUSED
                 128     32      UNUSED
                 [...]

To make the SGML parser happy you can provide a changed declaration:

          BASESET  "ISO Registration Number 177//CHARSET
                    ISO/IEC 10646-1:1993 UCS-4 with
                    implementation level 3//ESC 2/5 2/15 4/6"
         DESCSET 0       9       UNUSED
                 9       2       9
                 11      2       UNUSED
                 13      1       13
                 14      18      UNUSED
                 32      95      32
                 127     1       UNUSED
                 128     4      UNUSED
                 132     1      "My rising double quote left (low)"
                 133     14     UNUSED
                 147     1      "My rising double quote right (high)"
                 148     16     UNUSED
                 [...]

Untested, and the result is invalid HTML.  If they would announce a
proper HTTP header, it could be okay:

Content-Type: text/html; charset=windows-1252


Andreas Schwab <schwab@suse.de> writes:

> The numbers are supposed to be ISO 8859-1 characters codes.  I'd guess the
> page has been written with some broken (a.k.a. W*nd*ws) software (the use
> of *.htm makes this apparent).

Yes, they have "interesting" guidelines online...

Kenichi Handa <handa@etl.go.jp> writes:

> Ah, I see.  I found that windows-125X maps 132 and 147 to
> U+201E and U+201C.  So, perhaps those systems (galeon and
> lynx) parse them as U+201E and U+201C.  Anyway, how to
> encode them in X selection is their problem and Emacs can't
> do anything about it.

Yes, but once in the X selection I'd like to see Emacs honor them.

The spacing problem also occurs when I try to cut and paste from Markus
Kuhn's demo file
(http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-demo.txt):

• ‚deutsche

[-- Attachment #2: Type: text/plain, Size: 3 bytes --][-- Attachment #3: Type: text/plain, Size: 4 bytes --]

„Anf

[-- Attachment #4: Type: text/plain, Size: 14 bytes --]

ührungszeichen

[-- Attachment #5: Type: text/plain, Size: 135 bytes --]

“

When I insert (C-x RET c utf-8 RET C-x C-f UTF-8-demo.txt RET), things
are correctly displayed (the characters are different):

[-- Attachment #6: Type: text/plain, Size: 19 bytes --]


• ‚deutsche‘ „Anf

[-- Attachment #7: Type: text/plain, Size: 14 bytes --]

ührungszeichen

[-- Attachment #8: Type: text/plain, Size: 475 bytes --]

“

Cut and paste both these examples from Emacs (this mail buffer) to a
UTF-8 xterm doesn't work neither; instead of the quotes I see "-1" and
garbage.

I hope the examples will go through.

-- 
ke@suse.de (work) / keichwa@gmx.net (home):              |
http://www.suse.de/~ke/                                  |      ,__o
Free Translation Project:                                |    _-\_<,
http://www.iro.umontreal.ca/contrib/po/HTML/             |   (*)/'(*)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Reporting UTF-8 related problems?
  2002-07-30 18:58             ` Karl Eichwalder
@ 2002-07-30 19:51               ` Karl Eichwalder
  2002-07-31  2:59               ` Karl Eichwalder
  2002-07-31 12:26               ` Kenichi Handa
  2 siblings, 0 replies; 21+ messages in thread
From: Karl Eichwalder @ 2002-07-30 19:51 UTC (permalink / raw)
  Cc: eliz, emacs-devel, Andreas Schwab

Karl Eichwalder <keichwa@gmx.net> writes:

>                  128     4      UNUSED
>                  132     1      "My rising double quote left (low)"
>                  133     14     UNUSED
>                  147     1      "My rising double quote right (high)"
>                  148     16     UNUSED
                           ^^ 12 of course.  Do your math, Karl!
>                  [...]

> Andreas Schwab <schwab@suse.de> writes:
>
>> The numbers are supposed to be ISO 8859-1 characters codes.  I'd guess the
>> page has been written with some broken (a.k.a. W*nd*ws) software (the use
>> of *.htm makes this apparent).

BTW, nsgmls accepts the numver character references even when marked as
UNUSED; try in an UTF-8 xterm:

echo '
<!DOCTYPE html PUBLIC "-//IETF//DTD HTML//EN">
<html>
 <head><title></title></head>
 <body>&#132;Die Familie Schroffenstein&#147;</body>
' | nsgmls -c /usr/share/sgml/CATALOG.html /usr/share/sgml/html/html.decl - \
  | iconv -f windows-1252 -t utf-8

-- 
ke@suse.de (work) / keichwa@gmx.net (home):              |
http://www.suse.de/~ke/                                  |      ,__o
Free Translation Project:                                |    _-\_<,
http://www.iro.umontreal.ca/contrib/po/HTML/             |   (*)/'(*)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Reporting UTF-8 related problems?
  2002-07-30 18:58             ` Karl Eichwalder
  2002-07-30 19:51               ` Karl Eichwalder
@ 2002-07-31  2:59               ` Karl Eichwalder
  2002-07-31 12:26               ` Kenichi Handa
  2 siblings, 0 replies; 21+ messages in thread
From: Karl Eichwalder @ 2002-07-31  2:59 UTC (permalink / raw)
  Cc: eliz, emacs-devel, Andreas Schwab

Karl Eichwalder <keichwa@gmx.net> writes:

> I hope the examples will go through.

It did -- at least I received it as sent.  Gnus did a great job even if
it decided to use "macintosh" encodings for some parts.

Karl Eichwalder <keichwa@gmx.net> writes:

>>                  128     4      UNUSED
>>                  132     1      "My rising double quote left (low)"
>>                  133     14     UNUSED
>>                  147     1      "My rising double quote right (high)"
>>                  148     16     UNUSED
>                           ^^ 12 of course.  Do your math, Karl!

> BTW, nsgmls accepts the numver character references even when marked as
> UNUSED; try in an UTF-8 xterm:

I did some more digging.  This works because HTML does not say within
the SYNTAX clause that &#128;-&#159; are control characters; the DocBook
SGML declaration explicitly forbids these as data:

SYNTAX

	SHUNCHAR  CONTROLS   0   1   2   3   4   5   6   7   8   9
                            10  11  12  13  14  15  16  17  18  19
                            20  21  22  23  24  25  26  27  28  29
                            30  31                     127 128 129
                           130 131 132 133 134 135 136 137 138 139
                           140 141 142 143 144 145 146 147 148 149
                           150 151 152 153 154 155 156 157 158 159


-- 
ke@suse.de (work) / keichwa@gmx.net (home):              |
http://www.suse.de/~ke/                                  |      ,__o
Free Translation Project:                                |    _-\_<,
http://www.iro.umontreal.ca/contrib/po/HTML/             |   (*)/'(*)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Reporting UTF-8 related problems?
  2002-07-30 18:58             ` Karl Eichwalder
  2002-07-30 19:51               ` Karl Eichwalder
  2002-07-31  2:59               ` Karl Eichwalder
@ 2002-07-31 12:26               ` Kenichi Handa
  2002-07-31 16:29                 ` Karl Eichwalder
  2002-08-01  5:18                 ` Eli Zaretskii
  2 siblings, 2 replies; 21+ messages in thread
From: Kenichi Handa @ 2002-07-31 12:26 UTC (permalink / raw)
  Cc: eliz, emacs-devel, schwab

In article <sh65yxniuf.fsf@tux.gnu.franken.de>, Karl Eichwalder <keichwa@gmx.net> writes:
> Yes, but once in the X selection I'd like to see Emacs honor them.

> The spacing problem also occurs when I try to cut and paste from Markus
> Kuhn's demo file
> (http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-demo.txt):

As far as I understand, that's not a spacing problem.  As
those clients send Emacs the designation sequence of
jisx0208 characters, Emacs just decodes them correctly (i.e.
honoring them) and displaying them by Japanese double-width
font.

> When I insert (C-x RET c utf-8 RET C-x C-f UTF-8-demo.txt RET), things
> are correctly displayed (the characters are different):

That's because the file is correclty encoded in utf-8, thus
Emacs can decode it correctly.

> Cut and paste both these examples from Emacs (this mail buffer) to a
> UTF-8 xterm doesn't work neither; instead of the quotes I see "-1" and
> garbage.

Yes because I have not yet installed a code for encoding
Emacs string to what UTF-8 xterm expect.

I confirmed that UTF-8 xterm surely request the target type
UTF8_STRING at first.   I'm now finding a way to handle it.

While tracing the the whole procedure of Emacs to handle a
selection request, I found the followings.

Could someone else also check if I miss something?

When Emacs receives a selection request,
x_handle_selction_request (xselect.c) is called.

The flow is as this:

x_handle_selction_request (EVENT)  -- xselect.c
  x_get_local_selection (SELECTION, TARGET_TYPE)  -- xselect.c
    xselect-convert-to-string (SELECTION, TARGET-TYPE, VALUE)  -- select.el
       => returns MULTIBYTE-STRING
    => returns MULTIBYTE-STRING
  lisp_data_to_selection_data (EVENT, MULTIBYTE-STRING, ...)
     => returns encoded string
  x_reply_selection_request (EVENT, above returned encoded string)
     ;; sends selection data to the other client

So, it seems that we can perform the encoding in the lisp
function xselect-convert-to-string, not in
lisp_data_to_selection_data.  BUT...

xselect-convert-to-string is also called in this way:

yank  -- simple.el
  current-kill  -- simple.el
    x-cur-buffer-or-selection-value  -- x-win.el
      x-get-selection  -- select.el
        Fx_get_selection_internal  -- xselect.c
          x_get_local_selection  -- xselect.c
             xselect-convert-to-string  -- select.el  !!!

And, in the latter case, xselect-convert-to-string must
return an Emacs string without encoding it.  Currently,
xselect-convert-to-string has no way to know in which
situation it is called.

So, how about calling xselect-convert-to-string with
TARGET-TYPE nil in the latter case?  This can be done by
adding one more arg LOCAL-REQUEST to x_get_local_selection.

If the above analysis is correct, we can implement the
rather sensitive/delicate code for handling string in
lisp_data_to_selection_data and x_encode_text in Lisp, which
makes the Emacs' reaction to selection request more flexible
and also makes the future maintanance easier.

What do you think?

---
Ken'ichi HANDA
handa@etl.go.jp

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Reporting UTF-8 related problems?
  2002-07-31 12:26               ` Kenichi Handa
@ 2002-07-31 16:29                 ` Karl Eichwalder
  2002-08-01  5:18                 ` Eli Zaretskii
  1 sibling, 0 replies; 21+ messages in thread
From: Karl Eichwalder @ 2002-07-31 16:29 UTC (permalink / raw)
  Cc: eliz, emacs-devel, schwab

Kenichi Handa <handa@etl.go.jp> writes:

> As far as I understand, that's not a spacing problem.  As
> those clients send Emacs the designation sequence of
> jisx0208 characters, Emacs just decodes them correctly (i.e.
> honoring them) and displaying them by Japanese double-width
> font.

I will try to talk to the xterm/lynx and Galeon maintainers; thanks for
verifying the problem and your comittment to solve it!

> That's because the file is correclty encoded in utf-8, thus
> Emacs can decode it correctly.

Okay, I guess it's the best to stay away from using the X selection at
the moment.  I hope other can jump in and anwser your confirmation
request; my knowledge in these issues equals nearly to zero.

-- 
ke@suse.de (work) / keichwa@gmx.net (home):              |
http://www.suse.de/~ke/                                  |      ,__o
Free Translation Project:                                |    _-\_<,
http://www.iro.umontreal.ca/contrib/po/HTML/             |   (*)/'(*)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Reporting UTF-8 related problems?
  2002-07-31 12:26               ` Kenichi Handa
  2002-07-31 16:29                 ` Karl Eichwalder
@ 2002-08-01  5:18                 ` Eli Zaretskii
  2002-08-14  1:21                   ` Kenichi Handa
  1 sibling, 1 reply; 21+ messages in thread
From: Eli Zaretskii @ 2002-08-01  5:18 UTC (permalink / raw)
  Cc: emacs-devel


On Wed, 31 Jul 2002, Kenichi Handa wrote:

> So, how about calling xselect-convert-to-string with
> TARGET-TYPE nil in the latter case?  This can be done by
> adding one more arg LOCAL-REQUEST to x_get_local_selection.
> 
> If the above analysis is correct, we can implement the
> rather sensitive/delicate code for handling string in
> lisp_data_to_selection_data and x_encode_text in Lisp, which
> makes the Emacs' reaction to selection request more flexible
> and also makes the future maintanance easier.
> 
> What do you think?

I think it's a good idea.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Reporting UTF-8 related problems?
  2002-08-01  5:18                 ` Eli Zaretskii
@ 2002-08-14  1:21                   ` Kenichi Handa
  2002-11-03 20:21                     ` Karl Eichwalder
  0 siblings, 1 reply; 21+ messages in thread
From: Kenichi Handa @ 2002-08-14  1:21 UTC (permalink / raw)
  Cc: emacs-devel

In article <Pine.SUN.3.91.1020801081750.20714A-100000@is>, Eli Zaretskii <eliz@is.elta.co.il> writes:
> On Wed, 31 Jul 2002, Kenichi Handa wrote:

>>  So, how about calling xselect-convert-to-string with
>>  TARGET-TYPE nil in the latter case?  This can be done by
>>  adding one more arg LOCAL-REQUEST to x_get_local_selection.
>>  
>>  If the above analysis is correct, we can implement the
>>  rather sensitive/delicate code for handling string in
>>  lisp_data_to_selection_data and x_encode_text in Lisp, which
>>  makes the Emacs' reaction to selection request more flexible
>>  and also makes the future maintanance easier.
>>  
>>  What do you think?

> I think it's a good idea.

I've just committed this change.  I confirmed that this
works on X (now pasting from Emacs to UTF-8 xterm also
works), but I don't know if it doesn't break anything on
Windows/DOS.

2002-08-14  Kenichi Handa  <handa@etl.go.jp>

	* select.el (xselect-convert-to-string): If TYPE is non-nil,
	encode the selection data string.  Always return cons of type and
	string.
	(selection-converter-alist): Add (UTF8_STRING .
	xselect-convert-to-string).

2002-08-14  Kenichi Handa  <handa@etl.go.jp>

	* xselect.c (QUTF8_STRING): New variable.
	(symbol_to_x_atom): Pay attention to QUTF8_STRING.
	(x_atom_to_symbol): Likewise.
	(x_get_local_selection): New argument local_request.  If it is
	nonzero, call handler_fn with the second arg nil.
	(x_handle_selection_request): Call x_get_local_selection with
	local_request 0.
	(lisp_data_to_selection_data): Don't encode the string here.
	(Fx_get_selection_internal): Call x_get_local_selection with
	local_request 1.
	(syms_of_xselect): Intern and staticpro QUTF8_STRING.

	* xterm.c (x_term_init): Initialize dpyinfo->Xatom_UTF8_STRING.

	* xterm.h (struct x_display_info): New member Xatom_UTF8_STRING.

---
Ken'ichi HANDA
handa@etl.go.jp

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Reporting UTF-8 related problems?
  2002-08-14  1:21                   ` Kenichi Handa
@ 2002-11-03 20:21                     ` Karl Eichwalder
  2002-11-04  4:56                       ` Karl Eichwalder
  0 siblings, 1 reply; 21+ messages in thread
From: Karl Eichwalder @ 2002-11-03 20:21 UTC (permalink / raw)
  Cc: eliz, schwab, emacs-devel

䭥湩捨椠䡡湤愠㱨慮摡䁥瑬⹧漮橰㸠睲楴敳㨍਍ਾ⁉❶攠橵獴⁣潭浩瑴敤⁴桩猠捨慮来⸠⁉⁣潮晩牭敤⁴桡琠瑨楳ഊ㸠睯牫猠潮⁘
湯眠灡獴楮朠晲潭⁅浡捳⁴漠啔䘭㠠硴敲洠慬獯ഊ㸠睯牫猩Ⰽ਍ਖ਼敳Ⱐ晲潭⁅浡捳⁴漠啔䘭㠠硴敲洠睯牫猠ⴭ⁴桡湫猠景爠瑨楳⁥湨慮捥浥湴⃾F㨠⸍਍੔桥灰潳楴攠摩牥捴楯渠獴楬氠晡楬猠景爠浥⸠⁌整瑥牳⁡汳漠慶慩污扬攠睩瑨楮ഊ瑨攠污瑩渱⁲慮来⁡牥⁰慳瑥搠捯牲散瑬礬⁢畴•ő∠⡯⁷楴栠汯湧⁤潴猬⁵獥損੩渠䡵湧慲楡温⁩猠獨潷渠慳⁡⁨慳栠浡牫
∣∩湬礮ഊഊ偡獴楮朠ő⁦牯洠啔䘭㠠硴敲洠瑯潺楬污⁷潲歳⸍਍ਾ⁢畴⁉⁤潮❴湯眠楦⁩琠摯敳渧琠扲敡欠慮祴桩湧渠坩湤潷猯䑏匮ഊഊⴭ‍੫敀獵獥⹤攠⡷潲欩 敩捨睡䁧浸⹮整
桯浥⤺†††††††簍੨瑴瀺⼯睷眮杮甮晲慮步渮摥⽫支††††††††††††††簠††‬彟漍੆牥攠呲慮獬慴楯渠偲潪散琺††††††††††††††††簠† ⵜ弼Ⰽ੨瑴瀺⼯睷眮楲漮畭潮瑲敡氮捡⽣潮瑲楢⽰漯䡔䵌⼠††††††簠†⠪⤯✨⨩ഊ

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Reporting UTF-8 related problems?
  2002-11-03 20:21                     ` Karl Eichwalder
@ 2002-11-04  4:56                       ` Karl Eichwalder
  0 siblings, 0 replies; 21+ messages in thread
From: Karl Eichwalder @ 2002-11-04  4:56 UTC (permalink / raw)
  Cc: eliz, schwab, emacs-devel

Sorry, Gnus decided to go for a strange encode my Emacs isn't able to
decode again.  Let's try again:

Kenichi Handa <handa@etl.go.jp> writes:

> I've just committed this change.  I confirmed that this
> works on X (now pasting from Emacs to UTF-8 xterm also
> works),

Yes, from Emacs to UTF-8 xterm works -- thanks for this enhancement ☺ .

The opposite direction still fails for me.  Letters also available within
the latin1 range are pasted correctly, but "ő" (o with long dots, used
in Hungarian) is shown as a hash mark ("#") only.

Pasting ő from UTF-8 xterm to mozilla works.

> but I don't know if it doesn't break anything on Windows/DOS.



-- 
ke@suse.de (work) / keichwa@gmx.net (home):              |
http://www.gnu.franken.de/ke/                            |      ,__o
Free Translation Project:                                |    _-\_<,
http://www.iro.umontreal.ca/contrib/po/HTML/             |   (*)/'(*)

-- 
ke@suse.de (work) / keichwa@gmx.net (home):              |
http://www.gnu.franken.de/ke/                            |      ,__o
Free Translation Project:                                |    _-\_<,
http://www.iro.umontreal.ca/contrib/po/HTML/             |   (*)/'(*)

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2002-11-04  4:56 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-07-28 16:14 Reporting UTF-8 related problems? Karl Eichwalder
2002-07-28 18:23 ` Eli Zaretskii
2002-07-28 18:26 ` Eli Zaretskii
2002-07-29  5:18   ` Kenichi Handa
2002-07-29  5:37     ` Kenichi Handa
2002-07-29 15:35     ` Karl Eichwalder
2002-07-30  5:22       ` Kenichi Handa
2002-07-30  6:01         ` Karl Eichwalder
2002-07-30  7:11           ` Kenichi Handa
2002-07-30  7:57             ` Andreas Schwab
2002-07-30  8:30               ` Kenichi Handa
2002-07-30 18:58             ` Karl Eichwalder
2002-07-30 19:51               ` Karl Eichwalder
2002-07-31  2:59               ` Karl Eichwalder
2002-07-31 12:26               ` Kenichi Handa
2002-07-31 16:29                 ` Karl Eichwalder
2002-08-01  5:18                 ` Eli Zaretskii
2002-08-14  1:21                   ` Kenichi Handa
2002-11-03 20:21                     ` Karl Eichwalder
2002-11-04  4:56                       ` Karl Eichwalder
2002-07-29 17:29   ` Richard Stallman

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).