* Reporting UTF-8 related problems? @ 2002-07-28 16:14 Karl Eichwalder 2002-07-28 18:23 ` Eli Zaretskii 2002-07-28 18:26 ` Eli Zaretskii 0 siblings, 2 replies; 21+ messages in thread From: Karl Eichwalder @ 2002-07-28 16:14 UTC (permalink / raw) Cc: Kenichi Handa Is it useful to report UTF-8 related problems (Emacs CVS version, from trunk)? Cut-and-paste via X selection shows issues. Visit http://www.textkritik.de/bka/dokumente/dok_k/koepke1.htm with lynx runing in an UTF-8 xterm (the page is in German). You should be able to see ,,Die Familie Schroffenstein'' and ,,Amphitryon'' properly quoted (double left quotes at the line bottom). Yanking in Emacs with the mouse results in ^[%Gâ ¾^[%@Die Familie Schroffenstein^[$(B!H Doing the same from the Gnome web browser Galeon results in ?Die Familie Schroffenstein? (literal question marks instead of Unicode quotes). BTW, there is no problem to paste from Galeon into an UTF-8 xterm. The Emacs buffer is UTF-8 enable and I also tried to set the Coding system for X selection accordingly (C-x RET x utf-8 RET). -- ke@suse.de (work) / keichwa@gmx.net (home): | http://www.suse.de/~ke/ | ,__o Free Translation Project: | _-\_<, http://www.iro.umontreal.ca/contrib/po/HTML/ | (*)/'(*) ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Reporting UTF-8 related problems? 2002-07-28 16:14 Reporting UTF-8 related problems? Karl Eichwalder @ 2002-07-28 18:23 ` Eli Zaretskii 2002-07-28 18:26 ` Eli Zaretskii 1 sibling, 0 replies; 21+ messages in thread From: Eli Zaretskii @ 2002-07-28 18:23 UTC (permalink / raw) Cc: emacs-devel, handa [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain, Size: 988 bytes --] > From: Karl Eichwalder <keichwa@gmx.net> > Date: Sun, 28 Jul 2002 18:14:35 +0200 > > Yanking in Emacs with the mouse results in > > ^[%G\x7f\x7f\x7f^[%@Die Familie Schroffenstein^[$(B!H > > Doing the same from the Gnome web browser Galeon results in > > ?Die Familie Schroffenstein? > > (literal question marks instead of Unicode quotes). BTW, there is no > problem to paste from Galeon into an UTF-8 xterm. > > The Emacs buffer is UTF-8 enable and I also tried to set the Coding > system for X selection accordingly (C-x RET x utf-8 RET). > > -- > ke@suse.de (work) / keichwa@gmx.net (home): | > http://www.suse.de/~ke/ | ,__o > Free Translation Project: | _-\_<, > http://www.iro.umontreal.ca/contrib/po/HTML/ | (*)/'(*) > > _______________________________________________ > Emacs-devel mailing list > Emacs-devel@gnu.org > http://mail.gnu.org/mailman/listinfo/emacs-devel > ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Reporting UTF-8 related problems? 2002-07-28 16:14 Reporting UTF-8 related problems? Karl Eichwalder 2002-07-28 18:23 ` Eli Zaretskii @ 2002-07-28 18:26 ` Eli Zaretskii 2002-07-29 5:18 ` Kenichi Handa 2002-07-29 17:29 ` Richard Stallman 1 sibling, 2 replies; 21+ messages in thread From: Eli Zaretskii @ 2002-07-28 18:26 UTC (permalink / raw) Cc: emacs-devel, handa > From: Karl Eichwalder <keichwa@gmx.net> > Date: Sun, 28 Jul 2002 18:14:35 +0200 > > Cut-and-paste via X selection shows issues. > > Visit http://www.textkritik.de/bka/dokumente/dok_k/koepke1.htm with > lynx runing in an UTF-8 xterm (the page is in German). The telltale ESC % sequence is the beginning of the ``extended segment'' in ICCCM parlance. Emacs doesn't currently support UTF-8 in the extended segments, but adding that support should be easy, I'd think. See ctext-post-read-conversion and ctext-pre-write-conversion defined on mule.el. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Reporting UTF-8 related problems? 2002-07-28 18:26 ` Eli Zaretskii @ 2002-07-29 5:18 ` Kenichi Handa 2002-07-29 5:37 ` Kenichi Handa 2002-07-29 15:35 ` Karl Eichwalder 2002-07-29 17:29 ` Richard Stallman 1 sibling, 2 replies; 21+ messages in thread From: Kenichi Handa @ 2002-07-29 5:18 UTC (permalink / raw) Cc: keichwa, emacs-devel In article <2110-Sun28Jul2002212621+0300-eliz@is.elta.co.il>, "Eli Zaretskii" <eliz@is.elta.co.il> writes: >> From: Karl Eichwalder <keichwa@gmx.net> >> Date: Sun, 28 Jul 2002 18:14:35 +0200 >> >> Cut-and-paste via X selection shows issues. >> >> Visit http://www.textkritik.de/bka/dokumente/dok_k/koepke1.htm with >> lynx runing in an UTF-8 xterm (the page is in German). > The telltale ESC % sequence is the beginning of the ``extended > segment'' in ICCCM parlance. Emacs doesn't currently support UTF-8 > in the extended segments, but adding that support should be easy, I'd > think. See ctext-post-read-conversion and ctext-pre-write-conversion > defined on mule.el. The reported escape sequence is "ESC % G ... ESC % @" which is not the extended segments of CTEXT (described in the section 6 of the ctext document), but the special sequence for utf-8 (described in the newly inserted section 7 of the ctext document distributed with XFree86, I'll attach it). I've just commited a change to ctext-post-read-conversion (in mule.el). Could you please try it? Eli, could you also check my change? > Doing the same from the Gnome web browser Galeon results in > ?Die Familie Schroffenstein? > (literal question marks instead of Unicode quotes). BTW, there is no > problem to paste from Galeon into an UTF-8 xterm. I tried the latest Galeon. It sends Emacs the same byte sequence as what lynx does. --- Ken'ichi HANDA handa@etl.go.jp ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Reporting UTF-8 related problems? 2002-07-29 5:18 ` Kenichi Handa @ 2002-07-29 5:37 ` Kenichi Handa 2002-07-29 15:35 ` Karl Eichwalder 1 sibling, 0 replies; 21+ messages in thread From: Kenichi Handa @ 2002-07-29 5:37 UTC (permalink / raw) Cc: eliz, keichwa, emacs-devel In article <200207290518.OAA04004@etlken.m17n.org>, Kenichi Handa <handa@etl.go.jp> writes: > The reported escape sequence is "ESC % G ... ESC % @" which > is not the extended segments of CTEXT (described in the > section 6 of the ctext document), but the special sequence > for utf-8 (described in the newly inserted section 7 of the > ctext document distributed with XFree86, I'll attach it). Oops, I forgot to attach it. Here it is. --- Ken'ichi HANDA handa@etl.go.jp 7. The UTF-8 encoding Unicode characters that are not contained in one of the approved standard encodings can be encoded using the UTF-8 encoding. The following escape sequences are used: 01/11 02/05 04/07 switch into UTF-8 mode 01/11 02/05 04/00 return from UTF-8 mode The first is the ISO registered sequence for UTF-8 (ISO- IR-196), the second is the ISO-2022 ``standard return'' sequence. While in UTF-8 mode, the UTF-8 encoding replaces the currently designated GL and GR encodings. After return from UTF-8 mode, the previously designated GL and GR encod- ings are reactivated. [This is the only ``other coding system'' used in Compound Text.] [This is an XFree86 extension introduced in XFree86 4.0.2.] ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Reporting UTF-8 related problems? 2002-07-29 5:18 ` Kenichi Handa 2002-07-29 5:37 ` Kenichi Handa @ 2002-07-29 15:35 ` Karl Eichwalder 2002-07-30 5:22 ` Kenichi Handa 1 sibling, 1 reply; 21+ messages in thread From: Karl Eichwalder @ 2002-07-29 15:35 UTC (permalink / raw) Cc: eliz, emacs-devel Kenichi Handa <handa@etl.go.jp> writes: > I've just commited a change to ctext-post-read-conversion > (in mule.el). Thanks it mostly works for me. When I yank the phrase into Emacs using the mouse, the right double quote becomes 2 letters wide (it look like a space just before the quote: Char: “ (0150310, 53448, 0xd0c8) point=309 of 321 (96%) column 12 Another issue is you cannot kill and yank the quotation mark without marking it first. >> Doing the same from the Gnome web browser Galeon results in > >> ?Die Familie Schroffenstein? > >> (literal question marks instead of Unicode quotes). BTW, there is no >> problem to paste from Galeon into an UTF-8 xterm. > > I tried the latest Galeon. It sends Emacs the same byte > sequence as what lynx does. Stil wondering why Emacs treet the quotes coming from Galeon different? -- ke@suse.de (work) / keichwa@gmx.net (home): | http://www.suse.de/~ke/ | ,__o Free Translation Project: | _-\_<, http://www.iro.umontreal.ca/contrib/po/HTML/ | (*)/'(*) ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Reporting UTF-8 related problems? 2002-07-29 15:35 ` Karl Eichwalder @ 2002-07-30 5:22 ` Kenichi Handa 2002-07-30 6:01 ` Karl Eichwalder 0 siblings, 1 reply; 21+ messages in thread From: Kenichi Handa @ 2002-07-30 5:22 UTC (permalink / raw) Cc: eliz, emacs-devel In article <sh65yy35te.fsf@tux.gnu.franken.de>, Karl Eichwalder <keichwa@gmx.net> writes: > Kenichi Handa <handa@etl.go.jp> writes: >> I've just commited a change to ctext-post-read-conversion >> (in mule.el). > Thanks it mostly works for me. When I yank the phrase into Emacs using > the mouse, the right double quote becomes 2 letters wide (it look like a > space just before the quote: > Char: “ (0150310, 53448, 0xd0c8) point=309 of 321 (96%) column 12 This is because Emacs received this byte sequence: ESC $ ( B ! H "ESC $ ( B" is a designation sequence for jisx0208, and the following two bytes "! H" specifies the above Japanese symbol. This is a problem of lynx and galeon (or some core part of gnome, I don't know). > Another issue is you cannot kill and yank the quotation mark without > marking it first. I don't understand what do you mean. "kill and yank" from where to where? What is the meaning of "marking it"? >>> Doing the same from the Gnome web browser Galeon results in >> >>> ?Die Familie Schroffenstein? >> >>> (literal question marks instead of Unicode quotes). BTW, there is no >>> problem to paste from Galeon into an UTF-8 xterm. >> >> I tried the latest Galeon. It sends Emacs the same byte >> sequence as what lynx does. > Stil wondering why Emacs treet the quotes coming from Galeon different? No. I think your Galeon actually sent `?' to Emacs. My Galeon (ver.1.2.5) sends "ESC % G ... ESC % @". The ICCCM document distributed with XFree86 contains this paragraph (which doesn't exist in X.V11R6's document): ---------------------------------------------------------------------- UTF8_STRING as a type or a target specifies an UTF-8 encoded string, with NEWLINE (U+000A, hex 0A) as end-of-line marker. ---------------------------------------------------------------------- What I suspect is that UTF-8 xterm asks Galeon to send selection-data by UTF8_STRING (not by TEXT as Emacs does). --- Ken'ichi HANDA handa@etl.go.jp ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Reporting UTF-8 related problems? 2002-07-30 5:22 ` Kenichi Handa @ 2002-07-30 6:01 ` Karl Eichwalder 2002-07-30 7:11 ` Kenichi Handa 0 siblings, 1 reply; 21+ messages in thread From: Karl Eichwalder @ 2002-07-30 6:01 UTC (permalink / raw) Cc: eliz, emacs-devel Kenichi Handa <handa@etl.go.jp> writes: >> Char: “ (0150310, 53448, 0xd0c8) point=309 of 321 (96%) column 12 > > This is because Emacs received this byte sequence: > ESC $ ( B ! H > "ESC $ ( B" is a designation sequence for jisx0208, > and the following two bytes "! H" specifies the above > Japanese symbol. Originally, it was the "right double quote raising" and not meant to be a special Japanese symbol ;) > This is a problem of lynx and galeon (or some core part of > gnome, I don't know). I will have an eye on it. > I don't understand what do you mean. "kill and yank" from > where to where? What is the meaning of "marking it"? Sorry. This time: from Emacs to Emacs. I assumed, you can C-d the current letter and yank it back (C-y). My assumptions is wrong. C-d just deletes; thus C-y cannot yank it back. > No. I think your Galeon actually sent `?' to Emacs. My > Galeon (ver.1.2.5) sends "ESC % G ... ESC % @". Oops, my Galeon ist outdated. > What I suspect is that UTF-8 xterm asks Galeon to send > selection-data by UTF8_STRING (not by TEXT as Emacs does). Sounds convincingly. Thanks. -- ke@suse.de (work) / keichwa@gmx.net (home): | http://www.suse.de/~ke/ | ,__o Free Translation Project: | _-\_<, http://www.iro.umontreal.ca/contrib/po/HTML/ | (*)/'(*) ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Reporting UTF-8 related problems? 2002-07-30 6:01 ` Karl Eichwalder @ 2002-07-30 7:11 ` Kenichi Handa 2002-07-30 7:57 ` Andreas Schwab 2002-07-30 18:58 ` Karl Eichwalder 0 siblings, 2 replies; 21+ messages in thread From: Kenichi Handa @ 2002-07-30 7:11 UTC (permalink / raw) Cc: eliz, emacs-devel In article <shznw9eotw.fsf@tux.gnu.franken.de>, Karl Eichwalder <keichwa@gmx.net> writes: > Kenichi Handa <handa@etl.go.jp> writes: >>> Char: “ (0150310, 53448, 0xd0c8) point=309 of 321 (96%) column 12 >> >> This is because Emacs received this byte sequence: >> ESC $ ( B ! H >> "ESC $ ( B" is a designation sequence for jisx0208, >> and the following two bytes "! H" specifies the above >> Japanese symbol. > Originally, it was the "right double quote raising" and not meant to be > a special Japanese symbol ;) I checked the contents of the html file itself and found this: „Die Familie Schroffenstein“ I thought that the notation &#NUMBER is for transmitting Unicode character of code NUMBER. But, 132 and 147 are control codes in Unicode, not any kind of quotings. Do you know a proper web page describing the meaning of them? > Sorry. This time: from Emacs to Emacs. I assumed, you can C-d the > current letter and yank it back (C-y). My assumptions is wrong. C-d > just deletes; thus C-y cannot yank it back. That's a general feature of Emacs. C-d DELETEs a character, not KILL it. C-y can yank only what killed. The Emacs info nodes "Deletion and Killing" tells the difference in detail. --- Ken'ichi HANDA handa@etl.go.jp ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Reporting UTF-8 related problems? 2002-07-30 7:11 ` Kenichi Handa @ 2002-07-30 7:57 ` Andreas Schwab 2002-07-30 8:30 ` Kenichi Handa 2002-07-30 18:58 ` Karl Eichwalder 1 sibling, 1 reply; 21+ messages in thread From: Andreas Schwab @ 2002-07-30 7:57 UTC (permalink / raw) Cc: keichwa, eliz, emacs-devel Kenichi Handa <handa@etl.go.jp> writes: |> In article <shznw9eotw.fsf@tux.gnu.franken.de>, Karl Eichwalder <keichwa@gmx.net> writes: |> > Kenichi Handa <handa@etl.go.jp> writes: |> >>> Char: “ (0150310, 53448, 0xd0c8) point=309 of 321 (96%) column 12 |> >> |> >> This is because Emacs received this byte sequence: |> >> ESC $ ( B ! H |> >> "ESC $ ( B" is a designation sequence for jisx0208, |> >> and the following two bytes "! H" specifies the above |> >> Japanese symbol. |> |> > Originally, it was the "right double quote raising" and not meant to be |> > a special Japanese symbol ;) |> |> I checked the contents of the html file itself and found this: |> |> „Die Familie Schroffenstein“ |> |> I thought that the notation &#NUMBER is for transmitting |> Unicode character of code NUMBER. But, 132 and 147 are |> control codes in Unicode, not any kind of quotings. Do you |> know a proper web page describing the meaning of them? The numbers are supposed to be ISO 8859-1 characters codes. I'd guess the page has been written with some broken (a.k.a. W*nd*ws) software (the use of *.htm makes this apparent). There is no hope for being compliant to any standard. I tried to validate it through the W3.org validator, but no document type matches. Andreas. -- Andreas Schwab, SuSE Labs, schwab@suse.de SuSE Linux AG, Deutschherrnstr. 15-19, D-90429 Nürnberg Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Reporting UTF-8 related problems? 2002-07-30 7:57 ` Andreas Schwab @ 2002-07-30 8:30 ` Kenichi Handa 0 siblings, 0 replies; 21+ messages in thread From: Kenichi Handa @ 2002-07-30 8:30 UTC (permalink / raw) Cc: keichwa, eliz, emacs-devel In article <je65yxwsve.fsf@sykes.suse.de>, Andreas Schwab <schwab@suse.de> writes: > |> „Die Familie Schroffenstein“ > |> > |> I thought that the notation &#NUMBER is for transmitting > |> Unicode character of code NUMBER. But, 132 and 147 are > |> control codes in Unicode, not any kind of quotings. Do you > |> know a proper web page describing the meaning of them? > The numbers are supposed to be ISO 8859-1 characters codes. I'd guess the > page has been written with some broken (a.k.a. W*nd*ws) software (the use > of *.htm makes this apparent). There is no hope for being compliant to > any standard. I tried to validate it through the W3.org validator, but no > document type matches. Ah, I see. I found that windows-125X maps 132 and 147 to U+201E and U+201C. So, perhaps those systems (galeon and lynx) parse them as U+201E and U+201C. Anyway, how to encode them in X selection is their problem and Emacs can't do anything about it. --- Ken'ichi HANDA handa@etl.go.jp ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Reporting UTF-8 related problems? 2002-07-30 7:11 ` Kenichi Handa 2002-07-30 7:57 ` Andreas Schwab @ 2002-07-30 18:58 ` Karl Eichwalder 2002-07-30 19:51 ` Karl Eichwalder ` (2 more replies) 1 sibling, 3 replies; 21+ messages in thread From: Karl Eichwalder @ 2002-07-30 18:58 UTC (permalink / raw) Cc: eliz, emacs-devel, Andreas Schwab [-- Attachment #1: Type: text/plain, Size: 3102 bytes --] Kenichi Handa <handa@etl.go.jp> writes: > „Die Familie Schroffenstein“ > > I thought that the notation &#NUMBER is for transmitting > Unicode character of code NUMBER. But, 132 and 147 are > control codes in Unicode, not any kind of quotings. &#NUMBERs are so called "character references"; the SGML declaration defines which are allowed. For HTML you must consult the html.d[e]?cl file. The crucial section is (HTML 2): BASESET "ISO Registration Number 100//CHARSET ECMA-94 Right Part of Latin Alphabet Nr. 1//ESC 2/13 4/1" DESCSET 128 32 UNUSED 160 96 32 This basically means: € to Ÿ are unused. The same applies for HTML 4 (and later fpr XML resp. XHTML): BASESET "ISO Registration Number 177//CHARSET ISO/IEC 10646-1:1993 UCS-4 with implementation level 3//ESC 2/5 2/15 4/6" DESCSET 0 9 UNUSED 9 2 9 11 2 UNUSED 13 1 13 14 18 UNUSED 32 95 32 127 1 UNUSED 128 32 UNUSED [...] To make the SGML parser happy you can provide a changed declaration: BASESET "ISO Registration Number 177//CHARSET ISO/IEC 10646-1:1993 UCS-4 with implementation level 3//ESC 2/5 2/15 4/6" DESCSET 0 9 UNUSED 9 2 9 11 2 UNUSED 13 1 13 14 18 UNUSED 32 95 32 127 1 UNUSED 128 4 UNUSED 132 1 "My rising double quote left (low)" 133 14 UNUSED 147 1 "My rising double quote right (high)" 148 16 UNUSED [...] Untested, and the result is invalid HTML. If they would announce a proper HTTP header, it could be okay: Content-Type: text/html; charset=windows-1252 Andreas Schwab <schwab@suse.de> writes: > The numbers are supposed to be ISO 8859-1 characters codes. I'd guess the > page has been written with some broken (a.k.a. W*nd*ws) software (the use > of *.htm makes this apparent). Yes, they have "interesting" guidelines online... Kenichi Handa <handa@etl.go.jp> writes: > Ah, I see. I found that windows-125X maps 132 and 147 to > U+201E and U+201C. So, perhaps those systems (galeon and > lynx) parse them as U+201E and U+201C. Anyway, how to > encode them in X selection is their problem and Emacs can't > do anything about it. Yes, but once in the X selection I'd like to see Emacs honor them. The spacing problem also occurs when I try to cut and paste from Markus Kuhn's demo file (http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-demo.txt): • ‚deutsche [-- Attachment #2: Type: text/plain, Size: 3 bytes --] ‘ [-- Attachment #3: Type: text/plain, Size: 4 bytes --] „Anf [-- Attachment #4: Type: text/plain, Size: 14 bytes --] ührungszeichen [-- Attachment #5: Type: text/plain, Size: 135 bytes --] “ When I insert (C-x RET c utf-8 RET C-x C-f UTF-8-demo.txt RET), things are correctly displayed (the characters are different): [-- Attachment #6: Type: text/plain, Size: 19 bytes --] • ‚deutsche‘ „Anf [-- Attachment #7: Type: text/plain, Size: 14 bytes --] ührungszeichen [-- Attachment #8: Type: text/plain, Size: 475 bytes --] “ Cut and paste both these examples from Emacs (this mail buffer) to a UTF-8 xterm doesn't work neither; instead of the quotes I see "-1" and garbage. I hope the examples will go through. -- ke@suse.de (work) / keichwa@gmx.net (home): | http://www.suse.de/~ke/ | ,__o Free Translation Project: | _-\_<, http://www.iro.umontreal.ca/contrib/po/HTML/ | (*)/'(*) ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Reporting UTF-8 related problems? 2002-07-30 18:58 ` Karl Eichwalder @ 2002-07-30 19:51 ` Karl Eichwalder 2002-07-31 2:59 ` Karl Eichwalder 2002-07-31 12:26 ` Kenichi Handa 2 siblings, 0 replies; 21+ messages in thread From: Karl Eichwalder @ 2002-07-30 19:51 UTC (permalink / raw) Cc: eliz, emacs-devel, Andreas Schwab Karl Eichwalder <keichwa@gmx.net> writes: > 128 4 UNUSED > 132 1 "My rising double quote left (low)" > 133 14 UNUSED > 147 1 "My rising double quote right (high)" > 148 16 UNUSED ^^ 12 of course. Do your math, Karl! > [...] > Andreas Schwab <schwab@suse.de> writes: > >> The numbers are supposed to be ISO 8859-1 characters codes. I'd guess the >> page has been written with some broken (a.k.a. W*nd*ws) software (the use >> of *.htm makes this apparent). BTW, nsgmls accepts the numver character references even when marked as UNUSED; try in an UTF-8 xterm: echo ' <!DOCTYPE html PUBLIC "-//IETF//DTD HTML//EN"> <html> <head><title></title></head> <body>„Die Familie Schroffenstein“</body> ' | nsgmls -c /usr/share/sgml/CATALOG.html /usr/share/sgml/html/html.decl - \ | iconv -f windows-1252 -t utf-8 -- ke@suse.de (work) / keichwa@gmx.net (home): | http://www.suse.de/~ke/ | ,__o Free Translation Project: | _-\_<, http://www.iro.umontreal.ca/contrib/po/HTML/ | (*)/'(*) ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Reporting UTF-8 related problems? 2002-07-30 18:58 ` Karl Eichwalder 2002-07-30 19:51 ` Karl Eichwalder @ 2002-07-31 2:59 ` Karl Eichwalder 2002-07-31 12:26 ` Kenichi Handa 2 siblings, 0 replies; 21+ messages in thread From: Karl Eichwalder @ 2002-07-31 2:59 UTC (permalink / raw) Cc: eliz, emacs-devel, Andreas Schwab Karl Eichwalder <keichwa@gmx.net> writes: > I hope the examples will go through. It did -- at least I received it as sent. Gnus did a great job even if it decided to use "macintosh" encodings for some parts. Karl Eichwalder <keichwa@gmx.net> writes: >> 128 4 UNUSED >> 132 1 "My rising double quote left (low)" >> 133 14 UNUSED >> 147 1 "My rising double quote right (high)" >> 148 16 UNUSED > ^^ 12 of course. Do your math, Karl! > BTW, nsgmls accepts the numver character references even when marked as > UNUSED; try in an UTF-8 xterm: I did some more digging. This works because HTML does not say within the SYNTAX clause that €-Ÿ are control characters; the DocBook SGML declaration explicitly forbids these as data: SYNTAX SHUNCHAR CONTROLS 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 -- ke@suse.de (work) / keichwa@gmx.net (home): | http://www.suse.de/~ke/ | ,__o Free Translation Project: | _-\_<, http://www.iro.umontreal.ca/contrib/po/HTML/ | (*)/'(*) ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Reporting UTF-8 related problems? 2002-07-30 18:58 ` Karl Eichwalder 2002-07-30 19:51 ` Karl Eichwalder 2002-07-31 2:59 ` Karl Eichwalder @ 2002-07-31 12:26 ` Kenichi Handa 2002-07-31 16:29 ` Karl Eichwalder 2002-08-01 5:18 ` Eli Zaretskii 2 siblings, 2 replies; 21+ messages in thread From: Kenichi Handa @ 2002-07-31 12:26 UTC (permalink / raw) Cc: eliz, emacs-devel, schwab In article <sh65yxniuf.fsf@tux.gnu.franken.de>, Karl Eichwalder <keichwa@gmx.net> writes: > Yes, but once in the X selection I'd like to see Emacs honor them. > The spacing problem also occurs when I try to cut and paste from Markus > Kuhn's demo file > (http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-demo.txt): As far as I understand, that's not a spacing problem. As those clients send Emacs the designation sequence of jisx0208 characters, Emacs just decodes them correctly (i.e. honoring them) and displaying them by Japanese double-width font. > When I insert (C-x RET c utf-8 RET C-x C-f UTF-8-demo.txt RET), things > are correctly displayed (the characters are different): That's because the file is correclty encoded in utf-8, thus Emacs can decode it correctly. > Cut and paste both these examples from Emacs (this mail buffer) to a > UTF-8 xterm doesn't work neither; instead of the quotes I see "-1" and > garbage. Yes because I have not yet installed a code for encoding Emacs string to what UTF-8 xterm expect. I confirmed that UTF-8 xterm surely request the target type UTF8_STRING at first. I'm now finding a way to handle it. While tracing the the whole procedure of Emacs to handle a selection request, I found the followings. Could someone else also check if I miss something? When Emacs receives a selection request, x_handle_selction_request (xselect.c) is called. The flow is as this: x_handle_selction_request (EVENT) -- xselect.c x_get_local_selection (SELECTION, TARGET_TYPE) -- xselect.c xselect-convert-to-string (SELECTION, TARGET-TYPE, VALUE) -- select.el => returns MULTIBYTE-STRING => returns MULTIBYTE-STRING lisp_data_to_selection_data (EVENT, MULTIBYTE-STRING, ...) => returns encoded string x_reply_selection_request (EVENT, above returned encoded string) ;; sends selection data to the other client So, it seems that we can perform the encoding in the lisp function xselect-convert-to-string, not in lisp_data_to_selection_data. BUT... xselect-convert-to-string is also called in this way: yank -- simple.el current-kill -- simple.el x-cur-buffer-or-selection-value -- x-win.el x-get-selection -- select.el Fx_get_selection_internal -- xselect.c x_get_local_selection -- xselect.c xselect-convert-to-string -- select.el !!! And, in the latter case, xselect-convert-to-string must return an Emacs string without encoding it. Currently, xselect-convert-to-string has no way to know in which situation it is called. So, how about calling xselect-convert-to-string with TARGET-TYPE nil in the latter case? This can be done by adding one more arg LOCAL-REQUEST to x_get_local_selection. If the above analysis is correct, we can implement the rather sensitive/delicate code for handling string in lisp_data_to_selection_data and x_encode_text in Lisp, which makes the Emacs' reaction to selection request more flexible and also makes the future maintanance easier. What do you think? --- Ken'ichi HANDA handa@etl.go.jp ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Reporting UTF-8 related problems? 2002-07-31 12:26 ` Kenichi Handa @ 2002-07-31 16:29 ` Karl Eichwalder 2002-08-01 5:18 ` Eli Zaretskii 1 sibling, 0 replies; 21+ messages in thread From: Karl Eichwalder @ 2002-07-31 16:29 UTC (permalink / raw) Cc: eliz, emacs-devel, schwab Kenichi Handa <handa@etl.go.jp> writes: > As far as I understand, that's not a spacing problem. As > those clients send Emacs the designation sequence of > jisx0208 characters, Emacs just decodes them correctly (i.e. > honoring them) and displaying them by Japanese double-width > font. I will try to talk to the xterm/lynx and Galeon maintainers; thanks for verifying the problem and your comittment to solve it! > That's because the file is correclty encoded in utf-8, thus > Emacs can decode it correctly. Okay, I guess it's the best to stay away from using the X selection at the moment. I hope other can jump in and anwser your confirmation request; my knowledge in these issues equals nearly to zero. -- ke@suse.de (work) / keichwa@gmx.net (home): | http://www.suse.de/~ke/ | ,__o Free Translation Project: | _-\_<, http://www.iro.umontreal.ca/contrib/po/HTML/ | (*)/'(*) ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Reporting UTF-8 related problems? 2002-07-31 12:26 ` Kenichi Handa 2002-07-31 16:29 ` Karl Eichwalder @ 2002-08-01 5:18 ` Eli Zaretskii 2002-08-14 1:21 ` Kenichi Handa 1 sibling, 1 reply; 21+ messages in thread From: Eli Zaretskii @ 2002-08-01 5:18 UTC (permalink / raw) Cc: emacs-devel On Wed, 31 Jul 2002, Kenichi Handa wrote: > So, how about calling xselect-convert-to-string with > TARGET-TYPE nil in the latter case? This can be done by > adding one more arg LOCAL-REQUEST to x_get_local_selection. > > If the above analysis is correct, we can implement the > rather sensitive/delicate code for handling string in > lisp_data_to_selection_data and x_encode_text in Lisp, which > makes the Emacs' reaction to selection request more flexible > and also makes the future maintanance easier. > > What do you think? I think it's a good idea. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Reporting UTF-8 related problems? 2002-08-01 5:18 ` Eli Zaretskii @ 2002-08-14 1:21 ` Kenichi Handa 2002-11-03 20:21 ` Karl Eichwalder 0 siblings, 1 reply; 21+ messages in thread From: Kenichi Handa @ 2002-08-14 1:21 UTC (permalink / raw) Cc: emacs-devel In article <Pine.SUN.3.91.1020801081750.20714A-100000@is>, Eli Zaretskii <eliz@is.elta.co.il> writes: > On Wed, 31 Jul 2002, Kenichi Handa wrote: >> So, how about calling xselect-convert-to-string with >> TARGET-TYPE nil in the latter case? This can be done by >> adding one more arg LOCAL-REQUEST to x_get_local_selection. >> >> If the above analysis is correct, we can implement the >> rather sensitive/delicate code for handling string in >> lisp_data_to_selection_data and x_encode_text in Lisp, which >> makes the Emacs' reaction to selection request more flexible >> and also makes the future maintanance easier. >> >> What do you think? > I think it's a good idea. I've just committed this change. I confirmed that this works on X (now pasting from Emacs to UTF-8 xterm also works), but I don't know if it doesn't break anything on Windows/DOS. 2002-08-14 Kenichi Handa <handa@etl.go.jp> * select.el (xselect-convert-to-string): If TYPE is non-nil, encode the selection data string. Always return cons of type and string. (selection-converter-alist): Add (UTF8_STRING . xselect-convert-to-string). 2002-08-14 Kenichi Handa <handa@etl.go.jp> * xselect.c (QUTF8_STRING): New variable. (symbol_to_x_atom): Pay attention to QUTF8_STRING. (x_atom_to_symbol): Likewise. (x_get_local_selection): New argument local_request. If it is nonzero, call handler_fn with the second arg nil. (x_handle_selection_request): Call x_get_local_selection with local_request 0. (lisp_data_to_selection_data): Don't encode the string here. (Fx_get_selection_internal): Call x_get_local_selection with local_request 1. (syms_of_xselect): Intern and staticpro QUTF8_STRING. * xterm.c (x_term_init): Initialize dpyinfo->Xatom_UTF8_STRING. * xterm.h (struct x_display_info): New member Xatom_UTF8_STRING. --- Ken'ichi HANDA handa@etl.go.jp ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Reporting UTF-8 related problems? 2002-08-14 1:21 ` Kenichi Handa @ 2002-11-03 20:21 ` Karl Eichwalder 2002-11-04 4:56 ` Karl Eichwalder 0 siblings, 1 reply; 21+ messages in thread From: Karl Eichwalder @ 2002-11-03 20:21 UTC (permalink / raw) Cc: eliz, schwab, emacs-devel 䭥湩捨椠䡡湤愠㱨慮摡䁥瑬漮橰㸠睲楴敳㨍ਾ⁉❶攠橵獴潭浩瑴敤⁴桩猠捨慮来⸠⁉潮晩牭敤⁴桡琠瑨楳ഊ㸠睯牫猠潮⁘ 湯眠灡獴楮朠晲潭⁅浡捳⁴漠啔䘭㠠硴敲洠慬獯ഊ㸠睯牫猩Ⰽਖ਼敳Ⱐ晲潭⁅浡捳⁴漠啔䘭㠠硴敲洠睯牫猠ⴭ⁴桡湫猠景爠瑨楳湨慮捥浥湴F㨠⸍桥灰潳楴攠摩牥捴楯渠獴楬氠晡楬猠景爠浥⸠⁌整瑥牳汳漠慶慩污扬攠睩瑨楮ഊ瑨攠污瑩渱慮来牥⁰慳瑥搠捯牲散瑬礬畴•ő∠⡯⁷楴栠汯湧潴猬⁵獥損੩渠䡵湧慲楡温猠獨潷渠慳慳栠浡牫 ∣∩湬礮ഊഊ偡獴楮朠ő牯洠啔䘭㠠硴敲洠瑯潺楬污⁷潲歳⸍ਾ畴⁉潮❴湯眠楦琠摯敳渧琠扲敡欠慮祴桩湧渠坩湤潷猯䑏匮ഊഊⴭ੫敀獵獥攠⡷潲欩 敩捨睡䁧浸整 桯浥⤺†††††††簍੨瑴瀺⼯睷眮杮甮晲慮步渮摥⽫支††††††††††††††簠††彟漍牥攠呲慮獬慴楯渠偲潪散琺††††††††††††††††簠† ⵜ弼Ⰽ੨瑴瀺⼯睷眮楲漮畭潮瑲敡氮捡⽣潮瑲楢⽰漯䡔䵌⼠††††††簠†⠪⤯✨⨩ഊ ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Reporting UTF-8 related problems? 2002-11-03 20:21 ` Karl Eichwalder @ 2002-11-04 4:56 ` Karl Eichwalder 0 siblings, 0 replies; 21+ messages in thread From: Karl Eichwalder @ 2002-11-04 4:56 UTC (permalink / raw) Cc: eliz, schwab, emacs-devel Sorry, Gnus decided to go for a strange encode my Emacs isn't able to decode again. Let's try again: Kenichi Handa <handa@etl.go.jp> writes: > I've just committed this change. I confirmed that this > works on X (now pasting from Emacs to UTF-8 xterm also > works), Yes, from Emacs to UTF-8 xterm works -- thanks for this enhancement ☺ . The opposite direction still fails for me. Letters also available within the latin1 range are pasted correctly, but "ő" (o with long dots, used in Hungarian) is shown as a hash mark ("#") only. Pasting ő from UTF-8 xterm to mozilla works. > but I don't know if it doesn't break anything on Windows/DOS. -- ke@suse.de (work) / keichwa@gmx.net (home): | http://www.gnu.franken.de/ke/ | ,__o Free Translation Project: | _-\_<, http://www.iro.umontreal.ca/contrib/po/HTML/ | (*)/'(*) -- ke@suse.de (work) / keichwa@gmx.net (home): | http://www.gnu.franken.de/ke/ | ,__o Free Translation Project: | _-\_<, http://www.iro.umontreal.ca/contrib/po/HTML/ | (*)/'(*) ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Reporting UTF-8 related problems? 2002-07-28 18:26 ` Eli Zaretskii 2002-07-29 5:18 ` Kenichi Handa @ 2002-07-29 17:29 ` Richard Stallman 1 sibling, 0 replies; 21+ messages in thread From: Richard Stallman @ 2002-07-29 17:29 UTC (permalink / raw) Cc: keichwa, emacs-devel, handa The telltale ESC % sequence is the beginning of the ``extended segment'' in ICCCM parlance. Emacs doesn't currently support UTF-8 in the extended segments, but adding that support should be easy, I'd think. If someone would like to work on this, can you give more detailed advice? ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2002-11-04 4:56 UTC | newest] Thread overview: 21+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2002-07-28 16:14 Reporting UTF-8 related problems? Karl Eichwalder 2002-07-28 18:23 ` Eli Zaretskii 2002-07-28 18:26 ` Eli Zaretskii 2002-07-29 5:18 ` Kenichi Handa 2002-07-29 5:37 ` Kenichi Handa 2002-07-29 15:35 ` Karl Eichwalder 2002-07-30 5:22 ` Kenichi Handa 2002-07-30 6:01 ` Karl Eichwalder 2002-07-30 7:11 ` Kenichi Handa 2002-07-30 7:57 ` Andreas Schwab 2002-07-30 8:30 ` Kenichi Handa 2002-07-30 18:58 ` Karl Eichwalder 2002-07-30 19:51 ` Karl Eichwalder 2002-07-31 2:59 ` Karl Eichwalder 2002-07-31 12:26 ` Kenichi Handa 2002-07-31 16:29 ` Karl Eichwalder 2002-08-01 5:18 ` Eli Zaretskii 2002-08-14 1:21 ` Kenichi Handa 2002-11-03 20:21 ` Karl Eichwalder 2002-11-04 4:56 ` Karl Eichwalder 2002-07-29 17:29 ` Richard Stallman
Code repositories for project(s) associated with this external index https://git.savannah.gnu.org/cgit/emacs.git https://git.savannah.gnu.org/cgit/emacs/org-mode.git This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.