X11 Compound Text vs ISO 2022

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

* X11 Compound Text vs ISO 2022
@ 2010-07-06 16:21 James Cloos
  2010-07-06 20:18 ` David De La Harpe Golden
                   ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: James Cloos @ 2010-07-06 16:21 UTC (permalink / raw)
  To: emacs-devel

While testing my recently applied patch, I've discovered that Emacs will
product ISO-2022 output for COMPOUND_TEXT which other libs and apps --
notably including libX11 -- cannot decode.

As an example, (encode-coding-string "•" 'compound-text) ; U+2022 BULLET
produces "^[$(O#@^[(B".  '$(O' is ISO-IR 228¹, JIS X 2013:2000.  But
libX11 only knows about the $( charsets:  0, 1, A-D and G-M.

A number of characters are output in '^[$-1'; such as:

(encode-coding-string "ℜ" 'compound-text) ; U+211C BLACK-LETTER CAPITAL R
"^[$-1\365\334^[-A"
(encode-coding-string "ʻ" 'compound-text) ; U+02BB MODIFIER LETTER TURNED COMMA
"^[$-1\244\333^[-A"

That is encoded in mule-unicode-0100-24ff, essentially unknown outside
Emacs.

Other libs/apps prefer to use utf-8³ in compound_text for such chars.

I understand *why* this happens, given that Emacs used to use 2022
internally, but it confuses other X11 apps.

I am not fully fluent in Emacs' internal charset conversion routines;
is there an easy way to tell it to limit which 2022 charsets it will
use when converting a string into a 2022 encoding?  A better way?

I will be adding at least some of the charsets to libX11, provided I can
find the relevant mappings with X11-compatable licensing, but that will
not help current installations, nor those who, like Emacs, rolled their
own compund_text decoders.

-JimC

P.S.  The libX11 src, in libX11/src/xlibi18n/lcCT.c, is the best
      resource to know which 2022 charsets libX11 supports.

1] http://www.itscj.ipsj.or.jp/ISO-IR/228.pdf
2] http://www.itscj.ipsj.or.jp/ISO-IR/143.pdf
3] http://www.itscj.ipsj.or.jp/ISO-IR/196.pdf

-- 
James Cloos <cloos@jhcloos.com>         OpenPGP: 1024D/ED7DAEA6

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: X11 Compound Text vs ISO 2022
  2010-07-06 16:21 X11 Compound Text vs ISO 2022 James Cloos
@ 2010-07-06 20:18 ` David De La Harpe Golden
  2010-07-06 22:30   ` James Cloos
  2010-07-06 23:38 ` David De La Harpe Golden
  2010-07-29 12:36 ` Kenichi Handa
  2 siblings, 1 reply; 21+ messages in thread
From: David De La Harpe Golden @ 2010-07-06 20:18 UTC (permalink / raw)
  To: emacs-devel; +Cc: James Cloos

On 06/07/10 17:21, James Cloos wrote:
> While testing my recently applied patch, I've discovered that Emacs will
> product ISO-2022 output for COMPOUND_TEXT which other libs and apps --
> notably including libX11 -- cannot decode.
>
> As an example, (encode-coding-string "•" 'compound-text) ; U+2022 BULLET
> produces "^[$(O#@^[(B".  '$(O' is ISO-IR 228¹, JIS X 2013:2000.  But
> libX11 only knows about the $( charsets:  0, 1, A-D and G-M.
>
> A number of characters are output in '^[$-1'; such as:
>
> (encode-coding-string "ℜ" 'compound-text) ; U+211C BLACK-LETTER CAPITAL R
> "^[$-1\365\334^[-A"
> (encode-coding-string "ʻ" 'compound-text) ; U+02BB MODIFIER LETTER TURNED COMMA
> "^[$-1\244\333^[-A"
>
> That is encoded in mule-unicode-0100-24ff, essentially unknown outside
> Emacs.
>
> Other libs/apps prefer to use utf-8³ in compound_text for such chars.
>

Not really intimately familiar with the area [compound text seems to be 
a bit of a horror in these days of unicode...]

But anyway, if emacs isn't using one of the character sets listed in the 
table in sect. 4/5 of "the" spec [1] or utf-8 as per sect.7, presumably 
it's an emacs bug unless emacs has successfully "registered the encoding 
with the X consortium" as per sect. 6 (and I don't see that happening...).

Conversely, if emacs is sending a charset that IS listed in the table
in sect. 4/5 or utf-8 as per sect. 7, then libX11 and other apps are "at 
fault" if they don't recognise them.


[1]

http://www.it.freebsd.org/pub/Unix/XFree86/WWW/htdocs/current/ctext.html

But err... the spec on freedesktop.org seems a lot older, not even 
mentioning utf-8 ???

[2]
http://cgit.freedesktop.org/xorg/doc/xorg-docs/tree/specs/CTEXT/ctext.tbl.ms



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: X11 Compound Text vs ISO 2022
  2010-07-06 20:18 ` David De La Harpe Golden
@ 2010-07-06 22:30   ` James Cloos
  2010-07-07  0:36     ` Stephen J. Turnbull
  0 siblings, 1 reply; 21+ messages in thread
From: James Cloos @ 2010-07-06 22:30 UTC (permalink / raw)
  To: David De La Harpe Golden; +Cc: emacs-devel

>>>>> "DDLHG" == David De La Harpe Golden <david@harpegolden.net> writes:

DDLHG> But anyway, if emacs isn't using one of the character sets listed in
DDLHG> the table in sect. 4/5 of "the" spec [1] or utf-8 as per sect.7,
DDLHG> presumably it's an emacs bug unless emacs has successfully "registered
DDLHG> the encoding with the X consortium" as per sect. 6 (and I don't see
DDLHG> that happening...).

Exactly.  Xorg libX11 supports the what is in that spec (including the utf8 
which was first added by XFree86, but was not added to the upstream spec),
a couple of other charsets "for compatability with Xfree86 3.1" and two sets
which are "used by Emacs, but not backed by ISO-IR".

Xorg's luid app has its own 2022 encoder/decoder which supports a couple
of additional charsets, such as "DEC Special", "DEC Technical", four
KOI8 variations, cp125[012], cp437, cp850 and cp866.  But it does not
use those for COMPOUND_TEXT, only as its internal encoding, much like
Emacs used to do.

DDLHG> Conversely, if emacs is sending a charset that IS listed in the table
DDLHG> in sect. 4/5 or utf-8 as per sect. 7, then libX11 and other apps are
DDLHG> "at fault" if they don't recognise them.

Emacs sends as COMPOUND_TEXT a 2022 encoding which appears to be exactly
what it used to use internally, rather than keeping to the ctext spec.

DDLHG> But err... the spec on freedesktop.org seems a lot older, not even
DDLHG> mentioning utf-8 ???

I think utf8 is the only significant difference between the upstream
Xorg spec and the Xfree86 modification.  I vaguely recall the
discussions on the xfree86 list(s) when it was introduced (too many
years ago, [SIGH]).  The EWMH spec and the UTF8_STRING fromat came
about, in part, out of that discussion, IIRC.

Emacs does need to limit what it is willing to encode in COMPOUND_TEXT,
and to use utf8-in-ctext for everything which is not in the 8859, GB,
JISX, KSC, CNS or BIG5 varients libX11 supports.  I'd go a bit further
and prefer utf8 over the CJK encodings for characters which are not
part of a CJK string.  (As an example, Emacs uses japanese-jisx0213-1
for U+2022 MIDDLE DOT; it would be better to use utf-8 unless the
MIDDLE DOT is in a string which was entered via the Japanese input
method, or LANG is ja_JA, or something of that sort.)

The question, then, is how best to do that?

-JimC
-- 
James Cloos <cloos@jhcloos.com>         OpenPGP: 1024D/ED7DAEA6

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: X11 Compound Text vs ISO 2022
  2010-07-06 16:21 X11 Compound Text vs ISO 2022 James Cloos
  2010-07-06 20:18 ` David De La Harpe Golden
@ 2010-07-06 23:38 ` David De La Harpe Golden
  2010-07-07  1:15   ` David De La Harpe Golden
  2010-07-07  4:55   ` James Cloos
  2010-07-29 12:36 ` Kenichi Handa
  2 siblings, 2 replies; 21+ messages in thread
From: David De La Harpe Golden @ 2010-07-06 23:38 UTC (permalink / raw)
  To: James Cloos; +Cc: emacs-devel

 > A number of characters are output in '^[$-1'; such as:
 > (encode-coding-string "ℜ" 'compound-text) ; U+211C BLACK-LETTER CAPITAL R
 > "^[$-1\365\334^[-A"
 > (encode-coding-string "ʻ" 'compound-text) ; U+02BB MODIFIER LETTER 
TURNED COMMA
 > "^[$-1\244\333^[-A"

 > That is encoded in mule-unicode-0100-24ff, essentially unknown outside
 > Emacs.

But actually I think emacs should be using using coding system
compound-text-with-extensions by default, not coding system
compound-text?  At least if you haven't customized
selection-coding-system.

That does give different results to your examples in some cases:

(encode-coding-string "ℜ" 'compound-text-with-extensions)
"^[%G\342\204\234^[%@"
(encode-coding-string "ʻ" 'compound-text-with-extensions)
"^[%G\312\273^[%@"






^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: X11 Compound Text vs ISO 2022
  2010-07-06 22:30   ` James Cloos
@ 2010-07-07  0:36     ` Stephen J. Turnbull
  2010-07-07  5:19       ` James Cloos
  0 siblings, 1 reply; 21+ messages in thread
From: Stephen J. Turnbull @ 2010-07-07  0:36 UTC (permalink / raw)
  To: James Cloos; +Cc: emacs-devel, David De La Harpe Golden

James Cloos writes:

 > I think utf8 is the only significant difference between the upstream
 > Xorg spec and the Xfree86 modification.  I vaguely recall the
 > discussions on the xfree86 list(s) when it was introduced (too many
 > years ago, [SIGH]).  The EWMH spec and the UTF8_STRING fromat came
 > about, in part, out of that discussion, IIRC.

As of about 2004, the XFree86 spec was totally bogus (internally
contradictory on the subject of encoding some ISO 8859 coded character
sets), and the XFree86 implementation ignored it anyway in many cases.

 > Emacs does need to limit what it is willing to encode in COMPOUND_TEXT,
 > and to use utf8-in-ctext for everything which is not in the 8859, GB,
 > JISX, KSC, CNS or BIG5 varients libX11 supports.  I'd go a bit further
 > and prefer utf8 over the CJK encodings for characters which are not
 > part of a CJK string.

But that goes against the spec, which AFAIK still provides that in
COMPOUND_TEXT the escape to non-ISO-2022 should only be used for
characters not in the repertoires of the registered charsets:

    Extended segments are not to be used for any character set
    encoding that can be constructed from a GL/GR pair of approved
    standard encodings. For example, it is incorrect to use an
    extended segment for any of the ISO 8859 family of encodings.

I would argue that you have two choices here: consider the whole
string to be Unicode, and used an extended segment for the whole
thing; or consider the string to be pieced together from segments in
approved standard encodings, in which case a character that can be
represented in those encodings should be.

BTW, for the case of MIDDLE DOT using JIS X 0213, the most recent spec
I could find on the web doesn't admit JIS X 0213 (or JIS X 0212 for
that matter).

 > The question, then, is how best to do that?

Wouldn't it be better to avoid use of COMPOUND_TEXT targets?  How many
apps prefer it to UTF8_STRING?  So, for example, when asked for
supported targets Emacs could list UTF8_STRING first.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: X11 Compound Text vs ISO 2022
  2010-07-06 23:38 ` David De La Harpe Golden
@ 2010-07-07  1:15   ` David De La Harpe Golden
  2010-07-07  4:55   ` James Cloos
  1 sibling, 0 replies; 21+ messages in thread
From: David De La Harpe Golden @ 2010-07-07  1:15 UTC (permalink / raw)
  To: James Cloos; +Cc: emacs-devel

On 07/07/10 00:38, David De La Harpe Golden wrote:
>  > A number of characters are output in '^[$-1'; such as:
>  > (encode-coding-string "ℜ" 'compound-text) ; U+211C BLACK-LETTER
> CAPITAL R
>  > "^[$-1\365\334^[-A"
>  > (encode-coding-string "ʻ" 'compound-text) ; U+02BB MODIFIER LETTER
> TURNED COMMA
>  > "^[$-1\244\333^[-A"
>
>  > That is encoded in mule-unicode-0100-24ff, essentially unknown outside
>  > Emacs.
>
> But actually I think emacs should be using using coding system
> compound-text-with-extensions by default, not coding system
> compound-text? At least if you haven't customized
> selection-coding-system.



Well, that probably "solves" one thing, but what about the use of JIS X 0213

The definition of coding systems compound-text and 
compound-text-with-extensions say (around line 1445 of 
lisp/international/mule-conf.el):

:charset-list 'iso-2022

Which accoding to define-coding-system doc means it thinks compound-text 
supports "all iso-2022 charsets"...

So maybe it could/should be trimmed to only those iso-2022 charsets in 
the compound text spec, though I'm not sure that naively adjusting 
:charset-list will work right, especially for 
compound-text-with-extensions  (I just managed to segfault emacs playing 
with it).






























^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: X11 Compound Text vs ISO 2022
  2010-07-06 23:38 ` David De La Harpe Golden
  2010-07-07  1:15   ` David De La Harpe Golden
@ 2010-07-07  4:55   ` James Cloos
  1 sibling, 0 replies; 21+ messages in thread
From: James Cloos @ 2010-07-07  4:55 UTC (permalink / raw)
  To: David De La Harpe Golden; +Cc: emacs-devel

>>>>> "DDLHG" == David De La Harpe Golden <david@harpegolden.net> writes:

DDLHG> But actually I think emacs should be using using coding system
DDLHG> compound-text-with-extensions by default, not coding system
DDLHG> compound-text?  At least if you haven't customized
DDLHG> selection-coding-system.

Perhaps for selections, but I was looking into wierd results for the
WM_NAME and WM_ICON_NAME properties.  (My patch to also set the _NET
version in the non-GTK case --- just like GTK already does --- fixes
that for window managers which support the _NET properties, but mine
does not, yet.)  For that, the c code explicitly uses compound-text,
not compound-text-with-extensions.

That difference may explain why I was unable to duplicate the issue
with selections.

So perahps the fix is to use the -wiht-extensions variation in
x_set_name_internal()?

-JimC
-- 
James Cloos <cloos@jhcloos.com>         OpenPGP: 1024D/ED7DAEA6

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: X11 Compound Text vs ISO 2022
  2010-07-07  0:36     ` Stephen J. Turnbull
@ 2010-07-07  5:19       ` James Cloos
  2010-07-07 19:51         ` James Cloos
  0 siblings, 1 reply; 21+ messages in thread
From: James Cloos @ 2010-07-07  5:19 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: emacs-devel, David De La Harpe Golden

>>>>> "SJT" == Stephen J Turnbull <stephen@xemacs.org> writes:

SJT> But that goes against the spec, which AFAIK still provides that in
SJT> COMPOUND_TEXT the escape to non-ISO-2022 should only be used for
SJT> characters not in the repertoires of the registered charsets:

SJT>     Extended segments are not to be used for any character set
SJT>     encoding that can be constructed from a GL/GR pair of approved
SJT>     standard encodings. For example, it is incorrect to use an
SJT>     extended segment for any of the ISO 8859 family of encodings.

SJT> I would argue that you have two choices here: consider the whole
SJT> string to be Unicode, and used an extended segment for the whole
SJT> thing; or consider the string to be pieced together from segments in
SJT> approved standard encodings, in which case a character that can be
SJT> represented in those encodings should be.

AFAICT, gtk and qt doe the former, and that is really what I was
suggesting, except when there is reason for Emacs to beleive that the
user may perfer the CJK set.

SJT> BTW, for the case of MIDDLE DOT using JIS X 0213, the most recent spec
SJT> I could find on the web doesn't admit JIS X 0213 (or JIS X 0212 for
SJT> that matter).

Exactly the complaint.  And even compound-text-with-extensions makes
that choice.  I'm testing the latter now in xfns.c, but the ctext
charsets still need to avoid JIS X 0213.

Yes, that seems to fix everything except the usage of 0213.

>> The question, then, is how best to do that?

SJT> Wouldn't it be better to avoid use of COMPOUND_TEXT targets?  How many
SJT> apps prefer it to UTF8_STRING?  So, for example, when asked for
SJT> supported targets Emacs could list UTF8_STRING first.

Things are getting better, but ctext is still required for some
properties and for interactions with some other clients.  I'd prefer
UTF8_STRING everywhere, but not to the extent of breaking compatability
with the other clients I (and others) use.  

-JimC
-- 
James Cloos <cloos@jhcloos.com>         OpenPGP: 1024D/ED7DAEA6

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: X11 Compound Text vs ISO 2022
  2010-07-07  5:19       ` James Cloos
@ 2010-07-07 19:51         ` James Cloos
  2010-07-08  0:24           ` David De La Harpe Golden
  0 siblings, 1 reply; 21+ messages in thread
From: James Cloos @ 2010-07-07 19:51 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: David De La Harpe Golden, emacs-devel

>>>>> "SJT" == Stephen J Turnbull <stephen@xemacs.org> writes:
>>>>> "JC" == James Cloos <cloos@jhcloos.com> writes:

SJT> I would argue that you have two choices here: consider the whole
SJT> string to be Unicode, and used an extended segment for the whole
SJT> thing; or consider the string to be pieced together from segments in
SJT> approved standard encodings, in which case a character that can be
SJT> represented in those encodings should be.

JC> AFAICT, gtk and qt doe the former, and that is really what I was
JC> suggesting, except when there is reason for Emacs to beleive that the
JC> user may perfer the CJK set.

I misstated that.  Ctext should still use the 8859 sets where possible,
and the GB_2312-80, JIS_X0208-1983, JIS_X0208-1990, KS_C_5601-1987 and
CNS11643-1992 sets (but not JIS_X0213) for characters covered by Emacs'
'han script name symbol (as used by, eg, (set-fontset-font)) and utf8
for everything else.  

Other apps which understand utf8 will (one hopes) prefer UTF8_STRING;
those which actually need ctext would then get maximum benefit.

As I wrote, ctext-with-extensions is almost there; elimitating 0213
from it should just about do it.

-JimC

P.S.  Stephen: mail to your @xemacs address alwasy bounces,
      and has for some time.
-- 
James Cloos <cloos@jhcloos.com>         OpenPGP: 1024D/ED7DAEA6

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: X11 Compound Text vs ISO 2022
  2010-07-07 19:51         ` James Cloos
@ 2010-07-08  0:24           ` David De La Harpe Golden
  2010-07-14 21:07             ` James Cloos
  0 siblings, 1 reply; 21+ messages in thread
From: David De La Harpe Golden @ 2010-07-08  0:24 UTC (permalink / raw)
  To: Emacs developers; +Cc: James Cloos, Kenichi Handa

On 07/07/10 20:51, James Cloos wrote:


> I misstated that.  Ctext should still use the 8859 sets where possible,
> and the GB_2312-80, JIS_X0208-1983, JIS_X0208-1990, KS_C_5601-1987 and
> CNS11643-1992 sets(but not JIS_X0213) for characters covered by Emacs'

Modifying function mule.el/ctext-non-standard-encodings-table with, say*:

(not (string-match "jisx0213" (symbol-name charset)))

in its charset-list-walking dolist does exclude that in particular from 
consideration, yielding:

(encode-coding-string "•" 'compound-text-with-extensions)
"^[%G\342\200\242^[%@"

... But excluding ones that are not expected to work may be contrary
given a shortlist of ones that can be expected to work (I'm still a bit 
unclear why adding a  :charset-list with a shortlist to definition of 
compound-text-with-extensions didn't work, maybe something isn't getting 
bound somewhere.)

[* similar to existing line for excluding cns11643, which I think was
because "big5" was preferred, see
"bzr log -r52413.1.843"/"bzr diff -r52413.1.842..52413.1.843"]



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: X11 Compound Text vs ISO 2022
  2010-07-08  0:24           ` David De La Harpe Golden
@ 2010-07-14 21:07             ` James Cloos
  0 siblings, 0 replies; 21+ messages in thread
From: James Cloos @ 2010-07-14 21:07 UTC (permalink / raw)
  To: David De La Harpe Golden; +Cc: Kenichi Handa, Emacs developers

I used this as a --batch file to generate a list of how Emacs converts
each UCS code point to compound-text and compound-text-with-extensions:

;;; emacs -q --batch --script to-n-from-ctext.el
(setq num 0)
(while (< num 1114112)
  (princ (format "%04X\t%S\n" num (decode-coding-string
    (encode-coding-string (format "%c" num)
    'compound-text-with-extensions) 'compound-text)))
  (setq num (+ 1 num)))
;;;;;;

(Change 'compound-text-with-extensions to 'ctext to see how converting
to ctext works.)

The reuslts of converting to ctext are:

,----< tab-separated charset count >
| ipa	6
| lao	94
| tibetan	193
| chinese-big5-1	415
| chinese-big5-2	29
| chinese-cns11643-1	2257
| chinese-cns11643-2	6594
| chinese-cns11643-3	5705
| chinese-cns11643-4	7217
| chinese-cns11643-5	8599
| chinese-cns11643-6	6384
| chinese-cns11643-7	6539
| arabic-digit	9
| chinese-gb2312	7299
| latin-iso8859-1	96
| latin-iso8859-13	3
| latin-iso8859-14	27
| latin-iso8859-15	2
| latin-iso8859-16	4
| latin-iso8859-2	57
| latin-iso8859-3	22
| latin-iso8859-4	35
| cyrillic-iso8859-5	93
| arabic-iso8859-6	48
| greek-iso8859-7	77
| hebrew-iso8859-8	30
| katakana-jisx0201	63
| japanese-jisx0208	316
| japanese-jisx0212	124
| japanese-jisx0213-1	507
| japanese-jisx0213-2	250
| korean-ksc5601	2907
| thai-tis620	96
| mule-unicode-0100-24ff	7851
| mule-unicode-2500-33ff	3005
| mule-unicode-e000-ffff	7219
| vietnamese-viscii-lower	46
| vietnamese-viscii-upper	46
`----

As you can see, that is of no value.  It also fails to convert the vast
majority of non-bmp characters.

Converting to ctext-with-extensions gives somewhat better results:

,----< tab-separated charset count >
| latin-iso8859-1	96
| latin-iso8859-2	57
| latin-iso8859-3	22
| latin-iso8859-4	35
| cyrillic-iso8859-5	93
| arabic-iso8859-6	48
| greek-iso8859-7	77
| hebrew-iso8859-8	30
| thai-tis620	96
| latin-iso8859-13	3
| latin-iso8859-14	27
| latin-iso8859-15	2
| latin-iso8859-16	4
| katakana-jisx0201	63
| chinese-gb2312	7299
| japanese-jisx0208	316
| japanese-jisx0212	124
| korean-ksc5601	2907
| chinese-cns11643-1	2044
| chinese-cns11643-2	3307
| chinese-cns11643-3	1714
| chinese-cns11643-4	755
| chinese-cns11643-5	89
| chinese-cns11643-6	39
| chinese-cns11643-7	31
| utf-8 1093949
| japanese-jisx0213-1	507
| japanese-jisx0213-2	250
`----

As you can see, 8859-9 and 8859-10 are not generated, but that is
bacause all of their characters can be found in 8859-1 through -8
and is therefore not a problem.

But japanese-jisx0213-1 and japanese-jisx0213-2 need to go; they are
simply unknown by other COMPOUND_TEXT users.

It is clear that the current deffinition of compound-text is wrong;
I'd replace it with the current compound-text-with-extensions and make
that an alias for backwards compatibility.

Then, we need to determine how to prevent Emacs from considering the
jisx0213-? charsets when convertign to ctext.

And, perhaps, to prefer utf8 over the gb, cns, ksc, and jisx charsets
when converting "narrow" characters (and ambiguous chacters when in a
"narrow" or "non-cjk" locale).  Handa-san already did some comparable
work for font selection; what he did there is also needed here.

-JimC
-- 
James Cloos <cloos@jhcloos.com>         OpenPGP: 1024D/ED7DAEA6

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: X11 Compound Text vs ISO 2022
  2010-07-06 16:21 X11 Compound Text vs ISO 2022 James Cloos
  2010-07-06 20:18 ` David De La Harpe Golden
  2010-07-06 23:38 ` David De La Harpe Golden
@ 2010-07-29 12:36 ` Kenichi Handa
  2010-07-29 15:51   ` James Cloos
  2 siblings, 1 reply; 21+ messages in thread
From: Kenichi Handa @ 2010-07-29 12:36 UTC (permalink / raw)
  To: James Cloos; +Cc: david, emacs-devel

Very sorry for the late response on this matter.

In article <m3zky4wpgw.fsf@carbon.jhcloos.org>, James Cloos <cloos@jhcloos.com> writes:

> While testing my recently applied patch, I've discovered that Emacs will
> product ISO-2022 output for COMPOUND_TEXT which other libs and apps --
> notably including libX11 -- cannot decode.

> As an example, (encode-coding-string "•" 'compound-text) ; U+2022 BULLET
> produces "^[$(O#@^[(B".  '$(O' is ISO-IR 228¹, JIS X 2013:2000.  But
> libX11 only knows about the $( charsets:  0, 1, A-D and G-M.

> A number of characters are output in '^[$-1'; such as:

> (encode-coding-string "ℜ" 'compound-text) ; U+211C BLACK-LETTER CAPITAL R
> "^[$-1\365\334^[-A"
> (encode-coding-string "ʻ" 'compound-text) ; U+02BB MODIFIER LETTER TURNED COMMA
> "^[$-1\244\333^[-A"

> That is encoded in mule-unicode-0100-24ff, essentially unknown outside
> Emacs.

I admit that those behaviour is not good now.  When I at
first implemented ctext in Emacs, there wasn't UTF8_STRING
nor CTEXT_with_UTF8_extended_segment.  So, I added more
character sets to it for cut&paste between two running
Emacses.  As Emacs was the only application that supported
many character sets at that time, no one complained about
that behaviour of ctext.  The other applications anyway
couldn't handle those many characters.

> Other libs/apps prefer to use utf-8³ in compound_text for such chars.

> I understand *why* this happens, given that Emacs used to use 2022
> internally, but it confuses other X11 apps.

Actually the latest Emacs (Emacs 23 and the later) uses
unicode internally.

> I am not fully fluent in Emacs' internal charset conversion routines;
> is there an easy way to tell it to limit which 2022 charsets it will
> use when converting a string into a 2022 encoding?  A better way?

It's fairly easy to limit charsets of ctext.  But, I care
the backward compatibility.  As ctext is the only coding
system that is compatible with iso-8859-1 and can encode
many other character sets, there will be old users who still
uses it for file/process encodings.

And, anyway ctext is not used for selection, I'd rather just
document that ctext is not fully compatible X's
COMPOUND_TEXT spec, but is the extended vesion.

For WM_NAME, etc, yes, we should use ctext-with-extensions,
and as ctext-with-extensions is not intended to be used
directly by users, I think it won't cause actual problems
even if we change it so that more characters are encoded
using UTF8-extended-segment.  So, I'll work on it soon.

The only problem with ctext-with-extensions is that it is
now implemented by Elisp, and thus it may cause GC.  I'm not
sure it is safe to call Lisp at the place we convert WM_NAME
etc.  If it is not safe, I'll implement
ctext-with-extensions in C.

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: X11 Compound Text vs ISO 2022
  2010-07-29 12:36 ` Kenichi Handa
@ 2010-07-29 15:51   ` James Cloos
  2010-07-30  1:27     ` Kenichi Handa
  0 siblings, 1 reply; 21+ messages in thread
From: James Cloos @ 2010-07-29 15:51 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: david, emacs-devel

>>>>> "KH" == Kenichi Handa <handa@m17n.org> writes:

KH> Very sorry for the late response on this matter.

That is OK.

KH> I added more
KH> character sets to it for cut&paste between two running
KH> Emacses.

Understood.  And, now that you have mentioned it, I think I even
remember (vaguely) a post or a discussion about it from back then.

KH> It's fairly easy to limit charsets of ctext.  But, I care
KH> the backward compatibility.  As ctext is the only coding
KH> system that is compatible with iso-8859-1 and can encode
KH> many other character sets, there will be old users who still
KH> uses it for file/process encodings.

I was not aware of that.

KH> And, anyway ctext is not used for selection,

I has to be used for X selection, yes?  How else could X selection of
text work than using data tagged with X's STRING, COMPOUND_TEXT or
UTF8_STRING atoms?

KH> I'd rather just document that ctext is not fully compatible X's
KH> COMPOUND_TEXT spec, but is the extended vesion.

KH> For WM_NAME, etc, yes, we should use ctext-with-extensions,
KH> and as ctext-with-extensions is not intended to be used
KH> directly by users, I think it won't cause actual problems
KH> even if we change it so that more characters are encoded
KH> using UTF8-extended-segment.  So, I'll work on it soon.

ctext-with-extesnions already supports the UTF8 extended segment; the
bug is that it uses JISX 0213 for some characters.  The earlier JISX
versions (0201, 0208 and 0212) are OK, but 0213 is not.

KH> The only problem with ctext-with-extensions is that it is
KH> now implemented by Elisp, and thus it may cause GC.  I'm not
KH> sure it is safe to call Lisp at the place we convert WM_NAME
KH> etc.  If it is not safe, I'll implement
KH> ctext-with-extensions in C.

the WM_NAME code already has to gc protect to do the conversion to utf8
for the gtk call (when compiled for gtk) and the new code to set the
UTF8_STRING _NET_WM_NAME and _NET_WM_ICON_NAME properties; I presume it
could do a conversion to ctext-with-extensions within that same protect?

Then it just needs to prefer utf8 over jisx0213.

Thanks for looking at it.

-JimC
-- 
James Cloos <cloos@jhcloos.com>         OpenPGP: 1024D/ED7DAEA6

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: X11 Compound Text vs ISO 2022
  2010-07-29 15:51   ` James Cloos
@ 2010-07-30  1:27     ` Kenichi Handa
  2010-07-30 18:46       ` James Cloos
  0 siblings, 1 reply; 21+ messages in thread
From: Kenichi Handa @ 2010-07-30  1:27 UTC (permalink / raw)
  To: James Cloos; +Cc: emacs-devel, david

In article <m3tynii8vi.fsf@carbon.jhcloos.org>, James Cloos <cloos@jhcloos.com> writes:

KH> And, anyway ctext is not used for selection,

> I has to be used for X selection, yes?

No.

> How else could X selection of
> text work than using data tagged with X's STRING, COMPOUND_TEXT or
> UTF8_STRING atoms?

iso-8859-1, ctext-with-extensions, utf-8 respectively in
this order.

KH> I'd rather just document that ctext is not fully compatible X's
KH> COMPOUND_TEXT spec, but is the extended vesion.

KH> For WM_NAME, etc, yes, we should use ctext-with-extensions,
KH> and as ctext-with-extensions is not intended to be used
KH> directly by users, I think it won't cause actual problems
KH> even if we change it so that more characters are encoded
KH> using UTF8-extended-segment.  So, I'll work on it soon.

> ctext-with-extesnions already supports the UTF8 extended segment; the
> bug is that it uses JISX 0213 for some characters.  The earlier JISX
> versions (0201, 0208 and 0212) are OK, but 0213 is not.

Yes, I understand that.  My intention is to modify (or fix)
ctext-with-extesnions to use UTF8-extended-segment for
characters that doesn't belong to any of few legacy
character sets listed in the spec of COMPOUND_TEXT.

KH> The only problem with ctext-with-extensions is that it is
KH> now implemented by Elisp, and thus it may cause GC.  I'm not
KH> sure it is safe to call Lisp at the place we convert WM_NAME
KH> etc.  If it is not safe, I'll implement
KH> ctext-with-extensions in C.

> the WM_NAME code already has to gc protect to do the conversion to utf8
> for the gtk call (when compiled for gtk) and the new code to set the
> UTF8_STRING _NET_WM_NAME and _NET_WM_ICON_NAME properties; I presume it
> could do a conversion to ctext-with-extensions within that same protect?

Ah, then, perhaps so.

By the way, the spec of COMPOUND_TEXT (included in
xorg-docs-1.5) lists these registered charsets.

ISO8859-1
ISO8859-2
ISO8859-3
ISO8859-4
ISO8859-5
ISO8859-6
ISO8859-7
ISO8859-8
ISO8859-9
JISX0201.1976-0
GB2312.1980-0
JISX0208.1983-0
KSC5601.1987-0

but libX11-1.3.2/src/xlibi18n/lcCT.c lists many more
charsets (e.g. more 8859 series and all CNS11643 series).  I
think it is better to follow the above spec than lcCT.c.
What do you think?

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: X11 Compound Text vs ISO 2022
  2010-07-30  1:27     ` Kenichi Handa
@ 2010-07-30 18:46       ` James Cloos
  2010-08-01  9:35         ` Stephen J. Turnbull
  0 siblings, 1 reply; 21+ messages in thread
From: James Cloos @ 2010-07-30 18:46 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: emacs-devel, david

>>>>> "KH" == Kenichi Handa <handa@m17n.org> writes:

KH> iso-8859-1, ctext-with-extensions, utf-8 respectively in
KH> this order.

D'Oh.  I read COMPOUND_TEXT even though you wrote ctext. :(

KH> By the way, the spec of COMPOUND_TEXT (included in
KH> xorg-docs-1.5) lists these registered charsets.

KH> ISO8859-1 through ISO8859-9
KH> JISX0201.1976-0
KH> GB2312.1980-0
KH> JISX0208.1983-0
KH> KSC5601.1987-0

KH> but libX11-1.3.2/src/xlibi18n/lcCT.c lists many more
KH> charsets (e.g. more 8859 series and all CNS11643 series).
KH> I think it is better to follow the above spec than lcCT.c.
KH> What do you think?

The changes to the code go back many, many years, and I (as an Xorg
member) expect that the spec will be updated to match the code.

So I'd follow the Xlib code rather than the spec.

-JimC
-- 
James Cloos <cloos@jhcloos.com>         OpenPGP: 1024D/ED7DAEA6

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: X11 Compound Text vs ISO 2022
  2010-07-30 18:46       ` James Cloos
@ 2010-08-01  9:35         ` Stephen J. Turnbull
  2010-08-01 11:06           ` James Cloos
  0 siblings, 1 reply; 21+ messages in thread
From: Stephen J. Turnbull @ 2010-08-01  9:35 UTC (permalink / raw)
  To: James Cloos; +Cc: david, emacs-devel, Kenichi Handa

James Cloos writes:

 > The changes to the code go back many, many years, and I (as an Xorg
 > member) expect that the spec will be updated to match the code.

Really?  I can't say I'm terribly impressed with X.org's response to
complaints that specs and behavior don't match.




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: X11 Compound Text vs ISO 2022
  2010-08-01  9:35         ` Stephen J. Turnbull
@ 2010-08-01 11:06           ` James Cloos
  2010-08-02  8:14             ` Stephen J. Turnbull
  2010-08-06 12:50             ` Kenichi Handa
  0 siblings, 2 replies; 21+ messages in thread
From: James Cloos @ 2010-08-01 11:06 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: david, emacs-devel, Kenichi Handa

>>>>> "SJT" == Stephen J Turnbull <stephen@xemacs.org> writes:

>> The changes to the code go back many, many years, and I (as an Xorg
>> member) expect that the spec will be updated to match the code.

SJT> Really?  I can't say I'm terribly impressed with X.org's response
SJT> to complaints that specs and behavior don't match.

The compound text spec shows a 1989 copyright, states that it is
version 1.1 documenting X Version 11 Release 6.8 and notes that
it might be expandepd in the future.

The ctext addtions were clearly added to support additional locales.
And the expansion has stopped thanks to the general shift from iso-
2022 to iso-10646.

Clearly the ctext spec should have followed along with the reference
code just like, eg, the elisp manual follows the code.

I think it is more than fair to update the ctext spec to document the
current reference code, especially now that the document is old enough
to purchase alcohol in its home country. :)

-JimC
-- 
James Cloos <cloos@jhcloos.com>         OpenPGP: 1024D/ED7DAEA6

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: X11 Compound Text vs ISO 2022
  2010-08-01 11:06           ` James Cloos
@ 2010-08-02  8:14             ` Stephen J. Turnbull
  2010-08-06 12:50             ` Kenichi Handa
  1 sibling, 0 replies; 21+ messages in thread
From: Stephen J. Turnbull @ 2010-08-02  8:14 UTC (permalink / raw)
  To: James Cloos; +Cc: emacs-devel

James Cloos writes:

 > I think it is more than fair to update the ctext spec to document the
 > current reference code, especially now that the document is old enough
 > to purchase alcohol in its home country. :)

Sure.  But do we have to wait for the diamond anniversary?



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: X11 Compound Text vs ISO 2022
  2010-08-01 11:06           ` James Cloos
  2010-08-02  8:14             ` Stephen J. Turnbull
@ 2010-08-06 12:50             ` Kenichi Handa
  2010-08-08  9:47               ` James Cloos
  1 sibling, 1 reply; 21+ messages in thread
From: Kenichi Handa @ 2010-08-06 12:50 UTC (permalink / raw)
  To: James Cloos; +Cc: stephen, david, emacs-devel

In article <m362zulhh5.fsf@carbon.jhcloos.org>, James Cloos <cloos@jhcloos.com> writes:

> The compound text spec shows a 1989 copyright, states that it is
> version 1.1 documenting X Version 11 Release 6.8 and notes that
> it might be expandepd in the future.

> The ctext addtions were clearly added to support additional locales.
> And the expansion has stopped thanks to the general shift from iso-
> 2022 to iso-10646.

> Clearly the ctext spec should have followed along with the reference
> code just like, eg, the elisp manual follows the code.

> I think it is more than fair to update the ctext spec to document the
> current reference code, especially now that the document is old enough
> to purchase alcohol in its home country. :)

I've just committed a new code to make ctext-with-extensions
conform to X's Compound Text spec.  As for which charsets to
treat as "the standard encodings", I made a variable
ctext-standard-encodings and set the default value to this
at the moment (i.e. following the current (old) SPEC).

ascii latin-jisx0201 katakana-jisx0201 latin-iso8859-1
latin-iso8859-2 latin-iso8859-3 latin-iso8859-4
greek-iso8859-7 arabic-iso8859-6 hebrew-iso8859-8
cyrillic-iso8859-5 latin-iso8859-9 chinese-gb2312
japanese-jisx0208 korean-ksc5601

If we actually find that it is better to follow the current
CODE, we can just add more charsets to the variable.

The new code was committed to emacs-23 branch, but I don't
know when it is propagated to the trunk.

---
Kenichi Handa
handa@m17n.org



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: X11 Compound Text vs ISO 2022
  2010-08-06 12:50             ` Kenichi Handa
@ 2010-08-08  9:47               ` James Cloos
  2010-08-09  1:49                 ` Kenichi Handa
  0 siblings, 1 reply; 21+ messages in thread
From: James Cloos @ 2010-08-08  9:47 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: stephen, emacs-devel, david

>>>>> "KH" == Kenichi Handa <handa@m17n.org> writes:

KH> I've just committed a new code to make ctext-with-extensions
KH> conform to X's Compound Text spec.

Thanks.

I read through the patch on the diffs list; everything looked right.

KH> As for which charsets to treat as "the standard encodings", I made a
KH> variable ctext-standard-encodings and set the default value to this
KH> at the moment (i.e. following the current (old) SPEC).

Good idea!

KH> The new code was committed to emacs-23 branch, but I don't
KH> know when it is propagated to the trunk.

I will probably wait until it is merged into trunk to test; all of my
systems currently run trunk or a packaged snapshot thereof.

-JimC
-- 
James Cloos <cloos@jhcloos.com>         OpenPGP: 1024D/ED7DAEA6

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: X11 Compound Text vs ISO 2022
  2010-08-08  9:47               ` James Cloos
@ 2010-08-09  1:49                 ` Kenichi Handa
  0 siblings, 0 replies; 21+ messages in thread
From: Kenichi Handa @ 2010-08-09  1:49 UTC (permalink / raw)
  To: James Cloos; +Cc: stephen, emacs-devel, david

In article <m3tyn5qvu7.fsf@carbon.jhcloos.org>, James Cloos <cloos@jhcloos.com> writes:

>>>>>> "KH" == Kenichi Handa <handa@m17n.org> writes:
KH> I've just committed a new code to make ctext-with-extensions
KH> conform to X's Compound Text spec.
[...]
> I read through the patch on the diffs list; everything looked right.

Thank you very much for double checking the change.

---
Kenichi Handa
handa@m17n.org



^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2010-08-09  1:49 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-07-06 16:21 X11 Compound Text vs ISO 2022 James Cloos
2010-07-06 20:18 ` David De La Harpe Golden
2010-07-06 22:30   ` James Cloos
2010-07-07  0:36     ` Stephen J. Turnbull
2010-07-07  5:19       ` James Cloos
2010-07-07 19:51         ` James Cloos
2010-07-08  0:24           ` David De La Harpe Golden
2010-07-14 21:07             ` James Cloos
2010-07-06 23:38 ` David De La Harpe Golden
2010-07-07  1:15   ` David De La Harpe Golden
2010-07-07  4:55   ` James Cloos
2010-07-29 12:36 ` Kenichi Handa
2010-07-29 15:51   ` James Cloos
2010-07-30  1:27     ` Kenichi Handa
2010-07-30 18:46       ` James Cloos
2010-08-01  9:35         ` Stephen J. Turnbull
2010-08-01 11:06           ` James Cloos
2010-08-02  8:14             ` Stephen J. Turnbull
2010-08-06 12:50             ` Kenichi Handa
2010-08-08  9:47               ` James Cloos
2010-08-09  1:49                 ` Kenichi Handa

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).