* X11 Compound Text vs ISO 2022 @ 2010-07-06 16:21 James Cloos 2010-07-06 20:18 ` David De La Harpe Golden ` (2 more replies) 0 siblings, 3 replies; 21+ messages in thread From: James Cloos @ 2010-07-06 16:21 UTC (permalink / raw) To: emacs-devel While testing my recently applied patch, I've discovered that Emacs will product ISO-2022 output for COMPOUND_TEXT which other libs and apps -- notably including libX11 -- cannot decode. As an example, (encode-coding-string "•" 'compound-text) ; U+2022 BULLET produces "^[$(O#@^[(B". '$(O' is ISO-IR 228¹, JIS X 2013:2000. But libX11 only knows about the $( charsets: 0, 1, A-D and G-M. A number of characters are output in '^[$-1'; such as: (encode-coding-string "ℜ" 'compound-text) ; U+211C BLACK-LETTER CAPITAL R "^[$-1\365\334^[-A" (encode-coding-string "ʻ" 'compound-text) ; U+02BB MODIFIER LETTER TURNED COMMA "^[$-1\244\333^[-A" That is encoded in mule-unicode-0100-24ff, essentially unknown outside Emacs. Other libs/apps prefer to use utf-8³ in compound_text for such chars. I understand *why* this happens, given that Emacs used to use 2022 internally, but it confuses other X11 apps. I am not fully fluent in Emacs' internal charset conversion routines; is there an easy way to tell it to limit which 2022 charsets it will use when converting a string into a 2022 encoding? A better way? I will be adding at least some of the charsets to libX11, provided I can find the relevant mappings with X11-compatable licensing, but that will not help current installations, nor those who, like Emacs, rolled their own compund_text decoders. -JimC P.S. The libX11 src, in libX11/src/xlibi18n/lcCT.c, is the best resource to know which 2022 charsets libX11 supports. 1] http://www.itscj.ipsj.or.jp/ISO-IR/228.pdf 2] http://www.itscj.ipsj.or.jp/ISO-IR/143.pdf 3] http://www.itscj.ipsj.or.jp/ISO-IR/196.pdf -- James Cloos <cloos@jhcloos.com> OpenPGP: 1024D/ED7DAEA6 ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: X11 Compound Text vs ISO 2022 2010-07-06 16:21 X11 Compound Text vs ISO 2022 James Cloos @ 2010-07-06 20:18 ` David De La Harpe Golden 2010-07-06 22:30 ` James Cloos 2010-07-06 23:38 ` David De La Harpe Golden 2010-07-29 12:36 ` Kenichi Handa 2 siblings, 1 reply; 21+ messages in thread From: David De La Harpe Golden @ 2010-07-06 20:18 UTC (permalink / raw) To: emacs-devel; +Cc: James Cloos On 06/07/10 17:21, James Cloos wrote: > While testing my recently applied patch, I've discovered that Emacs will > product ISO-2022 output for COMPOUND_TEXT which other libs and apps -- > notably including libX11 -- cannot decode. > > As an example, (encode-coding-string "•" 'compound-text) ; U+2022 BULLET > produces "^[$(O#@^[(B". '$(O' is ISO-IR 228¹, JIS X 2013:2000. But > libX11 only knows about the $( charsets: 0, 1, A-D and G-M. > > A number of characters are output in '^[$-1'; such as: > > (encode-coding-string "ℜ" 'compound-text) ; U+211C BLACK-LETTER CAPITAL R > "^[$-1\365\334^[-A" > (encode-coding-string "ʻ" 'compound-text) ; U+02BB MODIFIER LETTER TURNED COMMA > "^[$-1\244\333^[-A" > > That is encoded in mule-unicode-0100-24ff, essentially unknown outside > Emacs. > > Other libs/apps prefer to use utf-8³ in compound_text for such chars. > Not really intimately familiar with the area [compound text seems to be a bit of a horror in these days of unicode...] But anyway, if emacs isn't using one of the character sets listed in the table in sect. 4/5 of "the" spec [1] or utf-8 as per sect.7, presumably it's an emacs bug unless emacs has successfully "registered the encoding with the X consortium" as per sect. 6 (and I don't see that happening...). Conversely, if emacs is sending a charset that IS listed in the table in sect. 4/5 or utf-8 as per sect. 7, then libX11 and other apps are "at fault" if they don't recognise them. [1] http://www.it.freebsd.org/pub/Unix/XFree86/WWW/htdocs/current/ctext.html But err... the spec on freedesktop.org seems a lot older, not even mentioning utf-8 ??? [2] http://cgit.freedesktop.org/xorg/doc/xorg-docs/tree/specs/CTEXT/ctext.tbl.ms ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: X11 Compound Text vs ISO 2022 2010-07-06 20:18 ` David De La Harpe Golden @ 2010-07-06 22:30 ` James Cloos 2010-07-07 0:36 ` Stephen J. Turnbull 0 siblings, 1 reply; 21+ messages in thread From: James Cloos @ 2010-07-06 22:30 UTC (permalink / raw) To: David De La Harpe Golden; +Cc: emacs-devel >>>>> "DDLHG" == David De La Harpe Golden <david@harpegolden.net> writes: DDLHG> But anyway, if emacs isn't using one of the character sets listed in DDLHG> the table in sect. 4/5 of "the" spec [1] or utf-8 as per sect.7, DDLHG> presumably it's an emacs bug unless emacs has successfully "registered DDLHG> the encoding with the X consortium" as per sect. 6 (and I don't see DDLHG> that happening...). Exactly. Xorg libX11 supports the what is in that spec (including the utf8 which was first added by XFree86, but was not added to the upstream spec), a couple of other charsets "for compatability with Xfree86 3.1" and two sets which are "used by Emacs, but not backed by ISO-IR". Xorg's luid app has its own 2022 encoder/decoder which supports a couple of additional charsets, such as "DEC Special", "DEC Technical", four KOI8 variations, cp125[012], cp437, cp850 and cp866. But it does not use those for COMPOUND_TEXT, only as its internal encoding, much like Emacs used to do. DDLHG> Conversely, if emacs is sending a charset that IS listed in the table DDLHG> in sect. 4/5 or utf-8 as per sect. 7, then libX11 and other apps are DDLHG> "at fault" if they don't recognise them. Emacs sends as COMPOUND_TEXT a 2022 encoding which appears to be exactly what it used to use internally, rather than keeping to the ctext spec. DDLHG> But err... the spec on freedesktop.org seems a lot older, not even DDLHG> mentioning utf-8 ??? I think utf8 is the only significant difference between the upstream Xorg spec and the Xfree86 modification. I vaguely recall the discussions on the xfree86 list(s) when it was introduced (too many years ago, [SIGH]). The EWMH spec and the UTF8_STRING fromat came about, in part, out of that discussion, IIRC. Emacs does need to limit what it is willing to encode in COMPOUND_TEXT, and to use utf8-in-ctext for everything which is not in the 8859, GB, JISX, KSC, CNS or BIG5 varients libX11 supports. I'd go a bit further and prefer utf8 over the CJK encodings for characters which are not part of a CJK string. (As an example, Emacs uses japanese-jisx0213-1 for U+2022 MIDDLE DOT; it would be better to use utf-8 unless the MIDDLE DOT is in a string which was entered via the Japanese input method, or LANG is ja_JA, or something of that sort.) The question, then, is how best to do that? -JimC -- James Cloos <cloos@jhcloos.com> OpenPGP: 1024D/ED7DAEA6 ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: X11 Compound Text vs ISO 2022 2010-07-06 22:30 ` James Cloos @ 2010-07-07 0:36 ` Stephen J. Turnbull 2010-07-07 5:19 ` James Cloos 0 siblings, 1 reply; 21+ messages in thread From: Stephen J. Turnbull @ 2010-07-07 0:36 UTC (permalink / raw) To: James Cloos; +Cc: emacs-devel, David De La Harpe Golden James Cloos writes: > I think utf8 is the only significant difference between the upstream > Xorg spec and the Xfree86 modification. I vaguely recall the > discussions on the xfree86 list(s) when it was introduced (too many > years ago, [SIGH]). The EWMH spec and the UTF8_STRING fromat came > about, in part, out of that discussion, IIRC. As of about 2004, the XFree86 spec was totally bogus (internally contradictory on the subject of encoding some ISO 8859 coded character sets), and the XFree86 implementation ignored it anyway in many cases. > Emacs does need to limit what it is willing to encode in COMPOUND_TEXT, > and to use utf8-in-ctext for everything which is not in the 8859, GB, > JISX, KSC, CNS or BIG5 varients libX11 supports. I'd go a bit further > and prefer utf8 over the CJK encodings for characters which are not > part of a CJK string. But that goes against the spec, which AFAIK still provides that in COMPOUND_TEXT the escape to non-ISO-2022 should only be used for characters not in the repertoires of the registered charsets: Extended segments are not to be used for any character set encoding that can be constructed from a GL/GR pair of approved standard encodings. For example, it is incorrect to use an extended segment for any of the ISO 8859 family of encodings. I would argue that you have two choices here: consider the whole string to be Unicode, and used an extended segment for the whole thing; or consider the string to be pieced together from segments in approved standard encodings, in which case a character that can be represented in those encodings should be. BTW, for the case of MIDDLE DOT using JIS X 0213, the most recent spec I could find on the web doesn't admit JIS X 0213 (or JIS X 0212 for that matter). > The question, then, is how best to do that? Wouldn't it be better to avoid use of COMPOUND_TEXT targets? How many apps prefer it to UTF8_STRING? So, for example, when asked for supported targets Emacs could list UTF8_STRING first. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: X11 Compound Text vs ISO 2022 2010-07-07 0:36 ` Stephen J. Turnbull @ 2010-07-07 5:19 ` James Cloos 2010-07-07 19:51 ` James Cloos 0 siblings, 1 reply; 21+ messages in thread From: James Cloos @ 2010-07-07 5:19 UTC (permalink / raw) To: Stephen J. Turnbull; +Cc: emacs-devel, David De La Harpe Golden >>>>> "SJT" == Stephen J Turnbull <stephen@xemacs.org> writes: SJT> But that goes against the spec, which AFAIK still provides that in SJT> COMPOUND_TEXT the escape to non-ISO-2022 should only be used for SJT> characters not in the repertoires of the registered charsets: SJT> Extended segments are not to be used for any character set SJT> encoding that can be constructed from a GL/GR pair of approved SJT> standard encodings. For example, it is incorrect to use an SJT> extended segment for any of the ISO 8859 family of encodings. SJT> I would argue that you have two choices here: consider the whole SJT> string to be Unicode, and used an extended segment for the whole SJT> thing; or consider the string to be pieced together from segments in SJT> approved standard encodings, in which case a character that can be SJT> represented in those encodings should be. AFAICT, gtk and qt doe the former, and that is really what I was suggesting, except when there is reason for Emacs to beleive that the user may perfer the CJK set. SJT> BTW, for the case of MIDDLE DOT using JIS X 0213, the most recent spec SJT> I could find on the web doesn't admit JIS X 0213 (or JIS X 0212 for SJT> that matter). Exactly the complaint. And even compound-text-with-extensions makes that choice. I'm testing the latter now in xfns.c, but the ctext charsets still need to avoid JIS X 0213. Yes, that seems to fix everything except the usage of 0213. >> The question, then, is how best to do that? SJT> Wouldn't it be better to avoid use of COMPOUND_TEXT targets? How many SJT> apps prefer it to UTF8_STRING? So, for example, when asked for SJT> supported targets Emacs could list UTF8_STRING first. Things are getting better, but ctext is still required for some properties and for interactions with some other clients. I'd prefer UTF8_STRING everywhere, but not to the extent of breaking compatability with the other clients I (and others) use. -JimC -- James Cloos <cloos@jhcloos.com> OpenPGP: 1024D/ED7DAEA6 ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: X11 Compound Text vs ISO 2022 2010-07-07 5:19 ` James Cloos @ 2010-07-07 19:51 ` James Cloos 2010-07-08 0:24 ` David De La Harpe Golden 0 siblings, 1 reply; 21+ messages in thread From: James Cloos @ 2010-07-07 19:51 UTC (permalink / raw) To: Stephen J. Turnbull; +Cc: David De La Harpe Golden, emacs-devel >>>>> "SJT" == Stephen J Turnbull <stephen@xemacs.org> writes: >>>>> "JC" == James Cloos <cloos@jhcloos.com> writes: SJT> I would argue that you have two choices here: consider the whole SJT> string to be Unicode, and used an extended segment for the whole SJT> thing; or consider the string to be pieced together from segments in SJT> approved standard encodings, in which case a character that can be SJT> represented in those encodings should be. JC> AFAICT, gtk and qt doe the former, and that is really what I was JC> suggesting, except when there is reason for Emacs to beleive that the JC> user may perfer the CJK set. I misstated that. Ctext should still use the 8859 sets where possible, and the GB_2312-80, JIS_X0208-1983, JIS_X0208-1990, KS_C_5601-1987 and CNS11643-1992 sets (but not JIS_X0213) for characters covered by Emacs' 'han script name symbol (as used by, eg, (set-fontset-font)) and utf8 for everything else. Other apps which understand utf8 will (one hopes) prefer UTF8_STRING; those which actually need ctext would then get maximum benefit. As I wrote, ctext-with-extensions is almost there; elimitating 0213 from it should just about do it. -JimC P.S. Stephen: mail to your @xemacs address alwasy bounces, and has for some time. -- James Cloos <cloos@jhcloos.com> OpenPGP: 1024D/ED7DAEA6 ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: X11 Compound Text vs ISO 2022 2010-07-07 19:51 ` James Cloos @ 2010-07-08 0:24 ` David De La Harpe Golden 2010-07-14 21:07 ` James Cloos 0 siblings, 1 reply; 21+ messages in thread From: David De La Harpe Golden @ 2010-07-08 0:24 UTC (permalink / raw) To: Emacs developers; +Cc: James Cloos, Kenichi Handa On 07/07/10 20:51, James Cloos wrote: > I misstated that. Ctext should still use the 8859 sets where possible, > and the GB_2312-80, JIS_X0208-1983, JIS_X0208-1990, KS_C_5601-1987 and > CNS11643-1992 sets(but not JIS_X0213) for characters covered by Emacs' Modifying function mule.el/ctext-non-standard-encodings-table with, say*: (not (string-match "jisx0213" (symbol-name charset))) in its charset-list-walking dolist does exclude that in particular from consideration, yielding: (encode-coding-string "•" 'compound-text-with-extensions) "^[%G\342\200\242^[%@" ... But excluding ones that are not expected to work may be contrary given a shortlist of ones that can be expected to work (I'm still a bit unclear why adding a :charset-list with a shortlist to definition of compound-text-with-extensions didn't work, maybe something isn't getting bound somewhere.) [* similar to existing line for excluding cns11643, which I think was because "big5" was preferred, see "bzr log -r52413.1.843"/"bzr diff -r52413.1.842..52413.1.843"] ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: X11 Compound Text vs ISO 2022 2010-07-08 0:24 ` David De La Harpe Golden @ 2010-07-14 21:07 ` James Cloos 0 siblings, 0 replies; 21+ messages in thread From: James Cloos @ 2010-07-14 21:07 UTC (permalink / raw) To: David De La Harpe Golden; +Cc: Kenichi Handa, Emacs developers I used this as a --batch file to generate a list of how Emacs converts each UCS code point to compound-text and compound-text-with-extensions: ;;; emacs -q --batch --script to-n-from-ctext.el (setq num 0) (while (< num 1114112) (princ (format "%04X\t%S\n" num (decode-coding-string (encode-coding-string (format "%c" num) 'compound-text-with-extensions) 'compound-text))) (setq num (+ 1 num))) ;;;;;; (Change 'compound-text-with-extensions to 'ctext to see how converting to ctext works.) The reuslts of converting to ctext are: ,----< tab-separated charset count > | ipa 6 | lao 94 | tibetan 193 | chinese-big5-1 415 | chinese-big5-2 29 | chinese-cns11643-1 2257 | chinese-cns11643-2 6594 | chinese-cns11643-3 5705 | chinese-cns11643-4 7217 | chinese-cns11643-5 8599 | chinese-cns11643-6 6384 | chinese-cns11643-7 6539 | arabic-digit 9 | chinese-gb2312 7299 | latin-iso8859-1 96 | latin-iso8859-13 3 | latin-iso8859-14 27 | latin-iso8859-15 2 | latin-iso8859-16 4 | latin-iso8859-2 57 | latin-iso8859-3 22 | latin-iso8859-4 35 | cyrillic-iso8859-5 93 | arabic-iso8859-6 48 | greek-iso8859-7 77 | hebrew-iso8859-8 30 | katakana-jisx0201 63 | japanese-jisx0208 316 | japanese-jisx0212 124 | japanese-jisx0213-1 507 | japanese-jisx0213-2 250 | korean-ksc5601 2907 | thai-tis620 96 | mule-unicode-0100-24ff 7851 | mule-unicode-2500-33ff 3005 | mule-unicode-e000-ffff 7219 | vietnamese-viscii-lower 46 | vietnamese-viscii-upper 46 `---- As you can see, that is of no value. It also fails to convert the vast majority of non-bmp characters. Converting to ctext-with-extensions gives somewhat better results: ,----< tab-separated charset count > | latin-iso8859-1 96 | latin-iso8859-2 57 | latin-iso8859-3 22 | latin-iso8859-4 35 | cyrillic-iso8859-5 93 | arabic-iso8859-6 48 | greek-iso8859-7 77 | hebrew-iso8859-8 30 | thai-tis620 96 | latin-iso8859-13 3 | latin-iso8859-14 27 | latin-iso8859-15 2 | latin-iso8859-16 4 | katakana-jisx0201 63 | chinese-gb2312 7299 | japanese-jisx0208 316 | japanese-jisx0212 124 | korean-ksc5601 2907 | chinese-cns11643-1 2044 | chinese-cns11643-2 3307 | chinese-cns11643-3 1714 | chinese-cns11643-4 755 | chinese-cns11643-5 89 | chinese-cns11643-6 39 | chinese-cns11643-7 31 | utf-8 1093949 | japanese-jisx0213-1 507 | japanese-jisx0213-2 250 `---- As you can see, 8859-9 and 8859-10 are not generated, but that is bacause all of their characters can be found in 8859-1 through -8 and is therefore not a problem. But japanese-jisx0213-1 and japanese-jisx0213-2 need to go; they are simply unknown by other COMPOUND_TEXT users. It is clear that the current deffinition of compound-text is wrong; I'd replace it with the current compound-text-with-extensions and make that an alias for backwards compatibility. Then, we need to determine how to prevent Emacs from considering the jisx0213-? charsets when convertign to ctext. And, perhaps, to prefer utf8 over the gb, cns, ksc, and jisx charsets when converting "narrow" characters (and ambiguous chacters when in a "narrow" or "non-cjk" locale). Handa-san already did some comparable work for font selection; what he did there is also needed here. -JimC -- James Cloos <cloos@jhcloos.com> OpenPGP: 1024D/ED7DAEA6 ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: X11 Compound Text vs ISO 2022 2010-07-06 16:21 X11 Compound Text vs ISO 2022 James Cloos 2010-07-06 20:18 ` David De La Harpe Golden @ 2010-07-06 23:38 ` David De La Harpe Golden 2010-07-07 1:15 ` David De La Harpe Golden 2010-07-07 4:55 ` James Cloos 2010-07-29 12:36 ` Kenichi Handa 2 siblings, 2 replies; 21+ messages in thread From: David De La Harpe Golden @ 2010-07-06 23:38 UTC (permalink / raw) To: James Cloos; +Cc: emacs-devel > A number of characters are output in '^[$-1'; such as: > (encode-coding-string "ℜ" 'compound-text) ; U+211C BLACK-LETTER CAPITAL R > "^[$-1\365\334^[-A" > (encode-coding-string "ʻ" 'compound-text) ; U+02BB MODIFIER LETTER TURNED COMMA > "^[$-1\244\333^[-A" > That is encoded in mule-unicode-0100-24ff, essentially unknown outside > Emacs. But actually I think emacs should be using using coding system compound-text-with-extensions by default, not coding system compound-text? At least if you haven't customized selection-coding-system. That does give different results to your examples in some cases: (encode-coding-string "ℜ" 'compound-text-with-extensions) "^[%G\342\204\234^[%@" (encode-coding-string "ʻ" 'compound-text-with-extensions) "^[%G\312\273^[%@" ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: X11 Compound Text vs ISO 2022 2010-07-06 23:38 ` David De La Harpe Golden @ 2010-07-07 1:15 ` David De La Harpe Golden 2010-07-07 4:55 ` James Cloos 1 sibling, 0 replies; 21+ messages in thread From: David De La Harpe Golden @ 2010-07-07 1:15 UTC (permalink / raw) To: James Cloos; +Cc: emacs-devel On 07/07/10 00:38, David De La Harpe Golden wrote: > > A number of characters are output in '^[$-1'; such as: > > (encode-coding-string "ℜ" 'compound-text) ; U+211C BLACK-LETTER > CAPITAL R > > "^[$-1\365\334^[-A" > > (encode-coding-string "ʻ" 'compound-text) ; U+02BB MODIFIER LETTER > TURNED COMMA > > "^[$-1\244\333^[-A" > > > That is encoded in mule-unicode-0100-24ff, essentially unknown outside > > Emacs. > > But actually I think emacs should be using using coding system > compound-text-with-extensions by default, not coding system > compound-text? At least if you haven't customized > selection-coding-system. Well, that probably "solves" one thing, but what about the use of JIS X 0213 The definition of coding systems compound-text and compound-text-with-extensions say (around line 1445 of lisp/international/mule-conf.el): :charset-list 'iso-2022 Which accoding to define-coding-system doc means it thinks compound-text supports "all iso-2022 charsets"... So maybe it could/should be trimmed to only those iso-2022 charsets in the compound text spec, though I'm not sure that naively adjusting :charset-list will work right, especially for compound-text-with-extensions (I just managed to segfault emacs playing with it). ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: X11 Compound Text vs ISO 2022 2010-07-06 23:38 ` David De La Harpe Golden 2010-07-07 1:15 ` David De La Harpe Golden @ 2010-07-07 4:55 ` James Cloos 1 sibling, 0 replies; 21+ messages in thread From: James Cloos @ 2010-07-07 4:55 UTC (permalink / raw) To: David De La Harpe Golden; +Cc: emacs-devel >>>>> "DDLHG" == David De La Harpe Golden <david@harpegolden.net> writes: DDLHG> But actually I think emacs should be using using coding system DDLHG> compound-text-with-extensions by default, not coding system DDLHG> compound-text? At least if you haven't customized DDLHG> selection-coding-system. Perhaps for selections, but I was looking into wierd results for the WM_NAME and WM_ICON_NAME properties. (My patch to also set the _NET version in the non-GTK case --- just like GTK already does --- fixes that for window managers which support the _NET properties, but mine does not, yet.) For that, the c code explicitly uses compound-text, not compound-text-with-extensions. That difference may explain why I was unable to duplicate the issue with selections. So perahps the fix is to use the -wiht-extensions variation in x_set_name_internal()? -JimC -- James Cloos <cloos@jhcloos.com> OpenPGP: 1024D/ED7DAEA6 ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: X11 Compound Text vs ISO 2022 2010-07-06 16:21 X11 Compound Text vs ISO 2022 James Cloos 2010-07-06 20:18 ` David De La Harpe Golden 2010-07-06 23:38 ` David De La Harpe Golden @ 2010-07-29 12:36 ` Kenichi Handa 2010-07-29 15:51 ` James Cloos 2 siblings, 1 reply; 21+ messages in thread From: Kenichi Handa @ 2010-07-29 12:36 UTC (permalink / raw) To: James Cloos; +Cc: david, emacs-devel Very sorry for the late response on this matter. In article <m3zky4wpgw.fsf@carbon.jhcloos.org>, James Cloos <cloos@jhcloos.com> writes: > While testing my recently applied patch, I've discovered that Emacs will > product ISO-2022 output for COMPOUND_TEXT which other libs and apps -- > notably including libX11 -- cannot decode. > As an example, (encode-coding-string "•" 'compound-text) ; U+2022 BULLET > produces "^[$(O#@^[(B". '$(O' is ISO-IR 228¹, JIS X 2013:2000. But > libX11 only knows about the $( charsets: 0, 1, A-D and G-M. > A number of characters are output in '^[$-1'; such as: > (encode-coding-string "ℜ" 'compound-text) ; U+211C BLACK-LETTER CAPITAL R > "^[$-1\365\334^[-A" > (encode-coding-string "ʻ" 'compound-text) ; U+02BB MODIFIER LETTER TURNED COMMA > "^[$-1\244\333^[-A" > That is encoded in mule-unicode-0100-24ff, essentially unknown outside > Emacs. I admit that those behaviour is not good now. When I at first implemented ctext in Emacs, there wasn't UTF8_STRING nor CTEXT_with_UTF8_extended_segment. So, I added more character sets to it for cut&paste between two running Emacses. As Emacs was the only application that supported many character sets at that time, no one complained about that behaviour of ctext. The other applications anyway couldn't handle those many characters. > Other libs/apps prefer to use utf-8³ in compound_text for such chars. > I understand *why* this happens, given that Emacs used to use 2022 > internally, but it confuses other X11 apps. Actually the latest Emacs (Emacs 23 and the later) uses unicode internally. > I am not fully fluent in Emacs' internal charset conversion routines; > is there an easy way to tell it to limit which 2022 charsets it will > use when converting a string into a 2022 encoding? A better way? It's fairly easy to limit charsets of ctext. But, I care the backward compatibility. As ctext is the only coding system that is compatible with iso-8859-1 and can encode many other character sets, there will be old users who still uses it for file/process encodings. And, anyway ctext is not used for selection, I'd rather just document that ctext is not fully compatible X's COMPOUND_TEXT spec, but is the extended vesion. For WM_NAME, etc, yes, we should use ctext-with-extensions, and as ctext-with-extensions is not intended to be used directly by users, I think it won't cause actual problems even if we change it so that more characters are encoded using UTF8-extended-segment. So, I'll work on it soon. The only problem with ctext-with-extensions is that it is now implemented by Elisp, and thus it may cause GC. I'm not sure it is safe to call Lisp at the place we convert WM_NAME etc. If it is not safe, I'll implement ctext-with-extensions in C. --- Kenichi Handa handa@m17n.org ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: X11 Compound Text vs ISO 2022 2010-07-29 12:36 ` Kenichi Handa @ 2010-07-29 15:51 ` James Cloos 2010-07-30 1:27 ` Kenichi Handa 0 siblings, 1 reply; 21+ messages in thread From: James Cloos @ 2010-07-29 15:51 UTC (permalink / raw) To: Kenichi Handa; +Cc: david, emacs-devel >>>>> "KH" == Kenichi Handa <handa@m17n.org> writes: KH> Very sorry for the late response on this matter. That is OK. KH> I added more KH> character sets to it for cut&paste between two running KH> Emacses. Understood. And, now that you have mentioned it, I think I even remember (vaguely) a post or a discussion about it from back then. KH> It's fairly easy to limit charsets of ctext. But, I care KH> the backward compatibility. As ctext is the only coding KH> system that is compatible with iso-8859-1 and can encode KH> many other character sets, there will be old users who still KH> uses it for file/process encodings. I was not aware of that. KH> And, anyway ctext is not used for selection, I has to be used for X selection, yes? How else could X selection of text work than using data tagged with X's STRING, COMPOUND_TEXT or UTF8_STRING atoms? KH> I'd rather just document that ctext is not fully compatible X's KH> COMPOUND_TEXT spec, but is the extended vesion. KH> For WM_NAME, etc, yes, we should use ctext-with-extensions, KH> and as ctext-with-extensions is not intended to be used KH> directly by users, I think it won't cause actual problems KH> even if we change it so that more characters are encoded KH> using UTF8-extended-segment. So, I'll work on it soon. ctext-with-extesnions already supports the UTF8 extended segment; the bug is that it uses JISX 0213 for some characters. The earlier JISX versions (0201, 0208 and 0212) are OK, but 0213 is not. KH> The only problem with ctext-with-extensions is that it is KH> now implemented by Elisp, and thus it may cause GC. I'm not KH> sure it is safe to call Lisp at the place we convert WM_NAME KH> etc. If it is not safe, I'll implement KH> ctext-with-extensions in C. the WM_NAME code already has to gc protect to do the conversion to utf8 for the gtk call (when compiled for gtk) and the new code to set the UTF8_STRING _NET_WM_NAME and _NET_WM_ICON_NAME properties; I presume it could do a conversion to ctext-with-extensions within that same protect? Then it just needs to prefer utf8 over jisx0213. Thanks for looking at it. -JimC -- James Cloos <cloos@jhcloos.com> OpenPGP: 1024D/ED7DAEA6 ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: X11 Compound Text vs ISO 2022 2010-07-29 15:51 ` James Cloos @ 2010-07-30 1:27 ` Kenichi Handa 2010-07-30 18:46 ` James Cloos 0 siblings, 1 reply; 21+ messages in thread From: Kenichi Handa @ 2010-07-30 1:27 UTC (permalink / raw) To: James Cloos; +Cc: emacs-devel, david In article <m3tynii8vi.fsf@carbon.jhcloos.org>, James Cloos <cloos@jhcloos.com> writes: KH> And, anyway ctext is not used for selection, > I has to be used for X selection, yes? No. > How else could X selection of > text work than using data tagged with X's STRING, COMPOUND_TEXT or > UTF8_STRING atoms? iso-8859-1, ctext-with-extensions, utf-8 respectively in this order. KH> I'd rather just document that ctext is not fully compatible X's KH> COMPOUND_TEXT spec, but is the extended vesion. KH> For WM_NAME, etc, yes, we should use ctext-with-extensions, KH> and as ctext-with-extensions is not intended to be used KH> directly by users, I think it won't cause actual problems KH> even if we change it so that more characters are encoded KH> using UTF8-extended-segment. So, I'll work on it soon. > ctext-with-extesnions already supports the UTF8 extended segment; the > bug is that it uses JISX 0213 for some characters. The earlier JISX > versions (0201, 0208 and 0212) are OK, but 0213 is not. Yes, I understand that. My intention is to modify (or fix) ctext-with-extesnions to use UTF8-extended-segment for characters that doesn't belong to any of few legacy character sets listed in the spec of COMPOUND_TEXT. KH> The only problem with ctext-with-extensions is that it is KH> now implemented by Elisp, and thus it may cause GC. I'm not KH> sure it is safe to call Lisp at the place we convert WM_NAME KH> etc. If it is not safe, I'll implement KH> ctext-with-extensions in C. > the WM_NAME code already has to gc protect to do the conversion to utf8 > for the gtk call (when compiled for gtk) and the new code to set the > UTF8_STRING _NET_WM_NAME and _NET_WM_ICON_NAME properties; I presume it > could do a conversion to ctext-with-extensions within that same protect? Ah, then, perhaps so. By the way, the spec of COMPOUND_TEXT (included in xorg-docs-1.5) lists these registered charsets. ISO8859-1 ISO8859-2 ISO8859-3 ISO8859-4 ISO8859-5 ISO8859-6 ISO8859-7 ISO8859-8 ISO8859-9 JISX0201.1976-0 GB2312.1980-0 JISX0208.1983-0 KSC5601.1987-0 but libX11-1.3.2/src/xlibi18n/lcCT.c lists many more charsets (e.g. more 8859 series and all CNS11643 series). I think it is better to follow the above spec than lcCT.c. What do you think? --- Kenichi Handa handa@m17n.org ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: X11 Compound Text vs ISO 2022 2010-07-30 1:27 ` Kenichi Handa @ 2010-07-30 18:46 ` James Cloos 2010-08-01 9:35 ` Stephen J. Turnbull 0 siblings, 1 reply; 21+ messages in thread From: James Cloos @ 2010-07-30 18:46 UTC (permalink / raw) To: Kenichi Handa; +Cc: emacs-devel, david >>>>> "KH" == Kenichi Handa <handa@m17n.org> writes: KH> iso-8859-1, ctext-with-extensions, utf-8 respectively in KH> this order. D'Oh. I read COMPOUND_TEXT even though you wrote ctext. :( KH> By the way, the spec of COMPOUND_TEXT (included in KH> xorg-docs-1.5) lists these registered charsets. KH> ISO8859-1 through ISO8859-9 KH> JISX0201.1976-0 KH> GB2312.1980-0 KH> JISX0208.1983-0 KH> KSC5601.1987-0 KH> but libX11-1.3.2/src/xlibi18n/lcCT.c lists many more KH> charsets (e.g. more 8859 series and all CNS11643 series). KH> I think it is better to follow the above spec than lcCT.c. KH> What do you think? The changes to the code go back many, many years, and I (as an Xorg member) expect that the spec will be updated to match the code. So I'd follow the Xlib code rather than the spec. -JimC -- James Cloos <cloos@jhcloos.com> OpenPGP: 1024D/ED7DAEA6 ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: X11 Compound Text vs ISO 2022 2010-07-30 18:46 ` James Cloos @ 2010-08-01 9:35 ` Stephen J. Turnbull 2010-08-01 11:06 ` James Cloos 0 siblings, 1 reply; 21+ messages in thread From: Stephen J. Turnbull @ 2010-08-01 9:35 UTC (permalink / raw) To: James Cloos; +Cc: david, emacs-devel, Kenichi Handa James Cloos writes: > The changes to the code go back many, many years, and I (as an Xorg > member) expect that the spec will be updated to match the code. Really? I can't say I'm terribly impressed with X.org's response to complaints that specs and behavior don't match. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: X11 Compound Text vs ISO 2022 2010-08-01 9:35 ` Stephen J. Turnbull @ 2010-08-01 11:06 ` James Cloos 2010-08-02 8:14 ` Stephen J. Turnbull 2010-08-06 12:50 ` Kenichi Handa 0 siblings, 2 replies; 21+ messages in thread From: James Cloos @ 2010-08-01 11:06 UTC (permalink / raw) To: Stephen J. Turnbull; +Cc: david, emacs-devel, Kenichi Handa >>>>> "SJT" == Stephen J Turnbull <stephen@xemacs.org> writes: >> The changes to the code go back many, many years, and I (as an Xorg >> member) expect that the spec will be updated to match the code. SJT> Really? I can't say I'm terribly impressed with X.org's response SJT> to complaints that specs and behavior don't match. The compound text spec shows a 1989 copyright, states that it is version 1.1 documenting X Version 11 Release 6.8 and notes that it might be expandepd in the future. The ctext addtions were clearly added to support additional locales. And the expansion has stopped thanks to the general shift from iso- 2022 to iso-10646. Clearly the ctext spec should have followed along with the reference code just like, eg, the elisp manual follows the code. I think it is more than fair to update the ctext spec to document the current reference code, especially now that the document is old enough to purchase alcohol in its home country. :) -JimC -- James Cloos <cloos@jhcloos.com> OpenPGP: 1024D/ED7DAEA6 ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: X11 Compound Text vs ISO 2022 2010-08-01 11:06 ` James Cloos @ 2010-08-02 8:14 ` Stephen J. Turnbull 2010-08-06 12:50 ` Kenichi Handa 1 sibling, 0 replies; 21+ messages in thread From: Stephen J. Turnbull @ 2010-08-02 8:14 UTC (permalink / raw) To: James Cloos; +Cc: emacs-devel James Cloos writes: > I think it is more than fair to update the ctext spec to document the > current reference code, especially now that the document is old enough > to purchase alcohol in its home country. :) Sure. But do we have to wait for the diamond anniversary? ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: X11 Compound Text vs ISO 2022 2010-08-01 11:06 ` James Cloos 2010-08-02 8:14 ` Stephen J. Turnbull @ 2010-08-06 12:50 ` Kenichi Handa 2010-08-08 9:47 ` James Cloos 1 sibling, 1 reply; 21+ messages in thread From: Kenichi Handa @ 2010-08-06 12:50 UTC (permalink / raw) To: James Cloos; +Cc: stephen, david, emacs-devel In article <m362zulhh5.fsf@carbon.jhcloos.org>, James Cloos <cloos@jhcloos.com> writes: > The compound text spec shows a 1989 copyright, states that it is > version 1.1 documenting X Version 11 Release 6.8 and notes that > it might be expandepd in the future. > The ctext addtions were clearly added to support additional locales. > And the expansion has stopped thanks to the general shift from iso- > 2022 to iso-10646. > Clearly the ctext spec should have followed along with the reference > code just like, eg, the elisp manual follows the code. > I think it is more than fair to update the ctext spec to document the > current reference code, especially now that the document is old enough > to purchase alcohol in its home country. :) I've just committed a new code to make ctext-with-extensions conform to X's Compound Text spec. As for which charsets to treat as "the standard encodings", I made a variable ctext-standard-encodings and set the default value to this at the moment (i.e. following the current (old) SPEC). ascii latin-jisx0201 katakana-jisx0201 latin-iso8859-1 latin-iso8859-2 latin-iso8859-3 latin-iso8859-4 greek-iso8859-7 arabic-iso8859-6 hebrew-iso8859-8 cyrillic-iso8859-5 latin-iso8859-9 chinese-gb2312 japanese-jisx0208 korean-ksc5601 If we actually find that it is better to follow the current CODE, we can just add more charsets to the variable. The new code was committed to emacs-23 branch, but I don't know when it is propagated to the trunk. --- Kenichi Handa handa@m17n.org ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: X11 Compound Text vs ISO 2022 2010-08-06 12:50 ` Kenichi Handa @ 2010-08-08 9:47 ` James Cloos 2010-08-09 1:49 ` Kenichi Handa 0 siblings, 1 reply; 21+ messages in thread From: James Cloos @ 2010-08-08 9:47 UTC (permalink / raw) To: Kenichi Handa; +Cc: stephen, emacs-devel, david >>>>> "KH" == Kenichi Handa <handa@m17n.org> writes: KH> I've just committed a new code to make ctext-with-extensions KH> conform to X's Compound Text spec. Thanks. I read through the patch on the diffs list; everything looked right. KH> As for which charsets to treat as "the standard encodings", I made a KH> variable ctext-standard-encodings and set the default value to this KH> at the moment (i.e. following the current (old) SPEC). Good idea! KH> The new code was committed to emacs-23 branch, but I don't KH> know when it is propagated to the trunk. I will probably wait until it is merged into trunk to test; all of my systems currently run trunk or a packaged snapshot thereof. -JimC -- James Cloos <cloos@jhcloos.com> OpenPGP: 1024D/ED7DAEA6 ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: X11 Compound Text vs ISO 2022 2010-08-08 9:47 ` James Cloos @ 2010-08-09 1:49 ` Kenichi Handa 0 siblings, 0 replies; 21+ messages in thread From: Kenichi Handa @ 2010-08-09 1:49 UTC (permalink / raw) To: James Cloos; +Cc: stephen, emacs-devel, david In article <m3tyn5qvu7.fsf@carbon.jhcloos.org>, James Cloos <cloos@jhcloos.com> writes: >>>>>> "KH" == Kenichi Handa <handa@m17n.org> writes: KH> I've just committed a new code to make ctext-with-extensions KH> conform to X's Compound Text spec. [...] > I read through the patch on the diffs list; everything looked right. Thank you very much for double checking the change. --- Kenichi Handa handa@m17n.org ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2010-08-09 1:49 UTC | newest] Thread overview: 21+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-07-06 16:21 X11 Compound Text vs ISO 2022 James Cloos 2010-07-06 20:18 ` David De La Harpe Golden 2010-07-06 22:30 ` James Cloos 2010-07-07 0:36 ` Stephen J. Turnbull 2010-07-07 5:19 ` James Cloos 2010-07-07 19:51 ` James Cloos 2010-07-08 0:24 ` David De La Harpe Golden 2010-07-14 21:07 ` James Cloos 2010-07-06 23:38 ` David De La Harpe Golden 2010-07-07 1:15 ` David De La Harpe Golden 2010-07-07 4:55 ` James Cloos 2010-07-29 12:36 ` Kenichi Handa 2010-07-29 15:51 ` James Cloos 2010-07-30 1:27 ` Kenichi Handa 2010-07-30 18:46 ` James Cloos 2010-08-01 9:35 ` Stephen J. Turnbull 2010-08-01 11:06 ` James Cloos 2010-08-02 8:14 ` Stephen J. Turnbull 2010-08-06 12:50 ` Kenichi Handa 2010-08-08 9:47 ` James Cloos 2010-08-09 1:49 ` Kenichi Handa
Code repositories for project(s) associated with this public inbox https://git.savannah.gnu.org/cgit/emacs.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).