[mew-int 01581] Re: windows 1252

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

* [mew-int 01581] Re: windows 1252
       [not found]   ` <20031030.175736.39971315.kazu@iijlab.net>
@ 2003-10-30 14:41     ` Werner LEMBERG
  2003-10-31 11:04       ` [mew-int 01579] " Kenichi Handa
  0 siblings, 1 reply; 29+ messages in thread
From: Werner LEMBERG @ 2003-10-30 14:41 UTC (permalink / raw)
  Cc: mew-int

The mail program `mew' uses ctext encoding to display the summary
buffer.  The question is how to display mails in windows-125x
encodings.  Here the relevant snippet from the mail exchange:

  From: Kazu Yamamoto (山本和彦) <kazu@iijlab.net>
  Subject: [mew-int 01579] Re: windows 1252
  Date: Thu, 30 Oct 2003 17:57:36 +0900 (JST)

  [about handling windows-125x in ctext encoding]

  It's worth trying to ask the developers to have *private* extensions
  to 'ctext.  Defining a new private escape sequence is just enough, I
  guess.

Is something like this already available?  Do you have other
suggestions?  Otherwise, is there a simple way to convert windows-125x
encodings to something which can be easily displayed with ctext?

    Werner

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [mew-int 01579] Re: windows 1252
  2003-10-30 14:41     ` [mew-int 01581] Re: windows 1252 Werner LEMBERG
@ 2003-10-31 11:04       ` Kenichi Handa
  2003-10-31 12:39         ` [mew-int 01583] " Kazu Yamamoto
  0 siblings, 1 reply; 29+ messages in thread
From: Kenichi Handa @ 2003-10-31 11:04 UTC (permalink / raw)
  Cc: mew-int, emacs-devel

In article <20031030.154127.155831569.wl@gnu.org>, Werner LEMBERG <wl@gnu.org> writes:
> The mail program `mew' uses ctext encoding to display the summary
> buffer.  The question is how to display mails in windows-125x
> encodings.

??? I don't understand why encoding is relevant to
displaying?  Emacs displays a character accoding to its
internal character code.

---
Ken'ichi HANDA
handa@m17n.org

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [mew-int 01583] Re: windows 1252
  2003-10-31 11:04       ` [mew-int 01579] " Kenichi Handa
@ 2003-10-31 12:39         ` Kazu Yamamoto
  2003-11-01 15:36           ` [mew-int 01584] " Eli Zaretskii
  0 siblings, 1 reply; 29+ messages in thread
From: Kazu Yamamoto @ 2003-10-31 12:39 UTC (permalink / raw)
  Cc: wl, emacs-devel, mew-int

From: Kenichi Handa <handa@m17n.org>
Subject: [mew-int 01582] Re: windows 1252

> > The mail program `mew' uses ctext encoding to display the summary
> > buffer.  The question is how to display mails in windows-125x
> > encodings.
> 
> ??? I don't understand why encoding is relevant to
> displaying?  Emacs displays a character accoding to its
> internal character code.

I think Werner misunderstand. As Handa-san said, encoding is not
a matter to display characters in a buffer.

Mew saves a buffer in Summary mode to a file as a cache. For this
file, 'ctext is used.

Note that I don't know the ctext implementation on Emacs can handle
windows-125x. I said, if it cannot, it's worth asking developers to
extend the ctext implementation to be able to handle windows-125x.

--Kazu

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [mew-int 01584] Re: windows 1252
  2003-10-31 12:39         ` [mew-int 01583] " Kazu Yamamoto
@ 2003-11-01 15:36           ` Eli Zaretskii
  2003-11-02  6:41             ` [mew-int 01582] " Stephen J. Turnbull
  0 siblings, 1 reply; 29+ messages in thread
From: Eli Zaretskii @ 2003-11-01 15:36 UTC (permalink / raw)
  Cc: handa, wl, mew-int, emacs-devel

> Date: Fri, 31 Oct 2003 21:39:16 +0900 (JST)
> From: Kazu Yamamoto (=?iso-2022-jp?B?GyRCOzNLXE9CSScbKEI=?=) <kazu@iijlab.net>
> 
> Mew saves a buffer in Summary mode to a file as a cache. For this
> file, 'ctext is used.

Doesn't the Summary buffer include only the Subject and From parts of
the message?  If so, then they should not normally include any
non-ASCII characters (it's contrary to the relevant RFC, IIRC).  If
they do have non-ASCII characters, one could probably qp-encode them
before writing the cache.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [mew-int 01582] Re: windows 1252
  2003-11-01 15:36           ` [mew-int 01584] " Eli Zaretskii
@ 2003-11-02  6:41             ` Stephen J. Turnbull
  2003-11-04  2:13               ` [mew-int 01586] " Kazu Yamamoto
  0 siblings, 1 reply; 29+ messages in thread
From: Stephen J. Turnbull @ 2003-11-02  6:41 UTC (permalink / raw)
  Cc: handa, wl, kazu, mew-int, emacs-devel

>>>>> "Eli" == Eli Zaretskii <eliz@elta.co.il> writes:

    >> Date: Fri, 31 Oct 2003 21:39:16 +0900 (JST)

    >> From: Kazu Yamamoto (=?iso-2022-jp?B?GyRCOzNLXE9CSScbKEI=?=)
    >> <kazu@iijlab.net>

    >> Mew saves a buffer in Summary mode to a file as a cache. For
    >> this file, 'ctext is used.

    Eli> Doesn't the Summary buffer include only the Subject and From
    Eli> parts of the message?  If so, then they should not normally
    Eli> include any non-ASCII characters (it's contrary to the
    Eli> relevant RFC, IIRC).

Not really.  RFCs are about wire format; they are not relevant to
summary buffers or disk storage.  (Unless there is some reason, such
as digital signatures, that the on the wire format needs to be
preserved exactly.)

    Eli> If they do have non-ASCII characters,
    Eli> one could probably qp-encode them before writing the cache.

Why bother?

If ctext is really preferred, there is a standard mechanism, the X
Compound Text "extended segment", which Emacs should already know
about.  In fact, I'd like to encourage the mew people to use that, so
that there could actually be a non-abusive use of it for reference
(this is the same mechanism that XFree86 abuses for ISO 8859/15
selections).

That said, I think the sane thing for summary buffers is to use UTF-8
as the encoding; the Unihan problem can be dealt with using a separate
language field in the cache database, or guessed from the Content-Type
header.

-- 
Institute of Policy and Planning Sciences     http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
               Ask not how you can "do" free software business;
              ask what your business can "do for" free software.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [mew-int 01586] Re: windows 1252
  2003-11-02  6:41             ` [mew-int 01582] " Stephen J. Turnbull
@ 2003-11-04  2:13               ` Kazu Yamamoto
  2003-11-04  5:55                 ` [mew-int 01585] " Eli Zaretskii
                                   ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: Kazu Yamamoto @ 2003-11-04  2:13 UTC (permalink / raw)


Hello all,

> If ctext is really preferred, there is a standard mechanism, the X
> Compound Text "extended segment", which Emacs should already know
> about.  In fact, I'd like to encourage the mew people to use that, so
> that there could actually be a non-abusive use of it for reference
> (this is the same mechanism that XFree86 abuses for ISO 8859/15
> selections).

If my understanding is correct, the ctext of Emacs implementation has
a private extension for Big5. That's why I said "ask developers to
extend ctext".

> That said, I think the sane thing for summary buffers is to use UTF-8
> as the encoding; the Unihan problem can be dealt with using a separate
> language field in the cache database, or guessed from the Content-Type
> header.

I should have explained the background earlier.

The reasons why Mew uses ctext for the Summary mode cache are:

(1) Backgourd compatibility to non-Mule Emacsen.

	Non-Mule Emacsen use 8bit as ISO-8859-1. Thus, to share the
	cache among Mule Emacsen and non-Mule Emacsen, we need to
	character set whose 8bit is ISO-8859-1.

(2) Co-exist of Emacs and XEmacs.

	The 'emacs-mule coding-system is not appropriate since XEmacs
	has a different internal representation from Emacs'one. Note
	Emacsen use different 'emacs-mule coding-system among
	versions.

The one-and-only coding-system which, I found, meets the requirements
above is 'ctext.

--Kazu


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [mew-int 01585] Re: windows 1252
  2003-11-04  2:13               ` [mew-int 01586] " Kazu Yamamoto
@ 2003-11-04  5:55                 ` Eli Zaretskii
  2003-11-04  6:13                   ` [mew-int 01587] " Kazu Yamamoto
  2003-11-04  6:23                   ` [mew-int 01589] " Stephen J. Turnbull
  2003-11-04 15:13                 ` [mew-int 01590] " Stefan Monnier
  2003-11-07  7:13                 ` [mew-int 01596] " Kenichi Handa
  2 siblings, 2 replies; 29+ messages in thread
From: Eli Zaretskii @ 2003-11-04  5:55 UTC (permalink / raw)
  Cc: mew-int, emacs-devel

> Date: Tue, 04 Nov 2003 11:13:34 +0900 (JST)
> From: Kazu Yamamoto (=?iso-2022-jp?B?GyRCOzNLXE9CSScbKEI=?=) <kazu@iijlab.net>
> (2) Co-exist of Emacs and XEmacs.
> 
> 	The 'emacs-mule coding-system is not appropriate since XEmacs
> 	has a different internal representation from Emacs'one. Note
> 	Emacsen use different 'emacs-mule coding-system among
> 	versions.
> 
> The one-and-only coding-system which, I found, meets the requirements
> above is 'ctext.

In that case, extending ctext to handle the problem we were discussing
in this thread would not be a good idea: IIRC, XEmacs doesn't support
extended segments, and non-MULE Emacsen certainly don't.  Or am I
missing something?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [mew-int 01587] Re: windows 1252
  2003-11-04  5:55                 ` [mew-int 01585] " Eli Zaretskii
@ 2003-11-04  6:13                   ` Kazu Yamamoto
  2003-11-04  6:23                   ` [mew-int 01589] " Stephen J. Turnbull
  1 sibling, 0 replies; 29+ messages in thread
From: Kazu Yamamoto @ 2003-11-04  6:13 UTC (permalink / raw)

From: Eli Zaretskii <eliz@elta.co.il>
Subject: Re: [mew-int 01585] Re: windows 1252

> > The one-and-only coding-system which, I found, meets the requirements
> > above is 'ctext.
> 
> In that case, extending ctext to handle the problem we were discussing
> in this thread would not be a good idea: IIRC, XEmacs doesn't support
> extended segments, and non-MULE Emacsen certainly don't.  Or am I
> missing something?

Speaking non-Mule Emacsen first, they can handle only US-ASCII and
ISO-8859-1. They cannot display other character sets defined in ctext.
But Mew users are happy because they use Mew in a proper way (see
below).

Some Mew users may ask a questoin:

Q) I'm using Emacs 20.7 and when I received Japanese messages, Mew
   displays Japanese correctly. But when I use Mew on Emacs 19 and
   read a cache of Summary, the part written in Japanese is not
   displayed well. Why?

The answer is:

A) Since Emacs 19 does not support Japanese, Japanese cannot be
   displayed on Emacs 19. If you want to display Japanese, use Mew on
   Emacs 20.7.

The situation is the same for XEmacs. Suppose we extend ctext of Emacs
21.4 to handle window 125x.

Q) I'm using Emacs 21.4 and when I received window 125x, Mew displays
   it correctly. But when I use XEmacs, it displayed broken. Why?

A) ctext of XEmacs cannot handle window 125x. If you want to display
   window 125x, use Mew on Emacs 21.4.

Note that this method of Mew has been surviving for last ten years
on the multi-language and multi-Emacs environment.

--Kazu

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [mew-int 01589] Re: windows 1252
  2003-11-04  5:55                 ` [mew-int 01585] " Eli Zaretskii
  2003-11-04  6:13                   ` [mew-int 01587] " Kazu Yamamoto
@ 2003-11-04  6:23                   ` Stephen J. Turnbull
  1 sibling, 0 replies; 29+ messages in thread
From: Stephen J. Turnbull @ 2003-11-04  6:23 UTC (permalink / raw)
  Cc: Kazu Yamamoto, mew-int, emacs-devel

>>>>> "Eli" == Eli Zaretskii <eliz@elta.co.il> writes:

> Date: Tue, 04 Nov 2003 11:13:34 +0900 (JST)
> From: Kazu Yamamoto (=?iso-2022-jp?B?GyRCOzNLXE9CSScbKEI=?=) <kazu@iijlab.net>
> (2) Co-exist of Emacs and XEmacs.
> 
> 	The 'emacs-mule coding-system is not appropriate since XEmacs
> 	has a different internal representation from Emacs'one. Note
> 	Emacsen use different 'emacs-mule coding-system among
> 	versions.
> 
> The one-and-only coding-system which, I found, meets the requirements
> above is 'ctext.

    Eli> In that case, extending ctext to handle the problem we were
    Eli> discussing in this thread would not be a good idea: IIRC,
    Eli> XEmacs doesn't support extended segments, and non-MULE
    Eli> Emacsen certainly don't.  Or am I missing something?

Use of extended segments in this way is not an extension of ctext;
it's exactly what they are designed for.  True, XEmacs doesn't support
them properly yet, but that's XEmacs's problem; we should and will
support them.  But I would oppose adding special-case support (eg,
using private character sets or an alternative use of DOCS) to XEmacs.


-- 
Institute of Policy and Planning Sciences     http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
               Ask not how you can "do" free software business;
              ask what your business can "do for" free software.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [mew-int 01590] Re: windows 1252
  2003-11-04  2:13               ` [mew-int 01586] " Kazu Yamamoto
  2003-11-04  5:55                 ` [mew-int 01585] " Eli Zaretskii
@ 2003-11-04 15:13                 ` Stefan Monnier
  2003-11-04 15:55                   ` [mew-int 01591] " Kazu Yamamoto
  2003-11-07  7:13                 ` [mew-int 01596] " Kenichi Handa
  2 siblings, 1 reply; 29+ messages in thread
From: Stefan Monnier @ 2003-11-04 15:13 UTC (permalink / raw)
  Cc: emacs-devel, mew-int

> (1) Backgourd compatibility to non-Mule Emacsen.

> 	Non-Mule Emacsen use 8bit as ISO-8859-1. Thus, to share the
> 	cache among Mule Emacsen and non-Mule Emacsen, we need to
> 	character set whose 8bit is ISO-8859-1.

utf-8 would be ideal here.

> (2) Co-exist of Emacs and XEmacs.

> 	The 'emacs-mule coding-system is not appropriate since XEmacs
> 	has a different internal representation from Emacs'one. Note
> 	Emacsen use different 'emacs-mule coding-system among
> 	versions.

iso-2022 would be the answer here.

> The one-and-only coding-system which, I found, meets the requirements
> above is 'ctext.

It's unfortunate, but I guess it makes sense.
It should be possible to make ctext-with-extensions work for your case.

BTW, windows-1252 should internally be turned into a mix of chars from
various charsets and they should (hopefully) all be encodable directly in
ctext, so I'm not sure what is your exact problem.  Could you describe what
currently happens with windows-1252 and what you'd like to see instead ?


        Stefan


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [mew-int 01591] Re: windows 1252
  2003-11-04 15:13                 ` [mew-int 01590] " Stefan Monnier
@ 2003-11-04 15:55                   ` Kazu Yamamoto
  2003-11-04 17:04                     ` [mew-int 01590] " Stefan Monnier
  2003-11-04 18:45                     ` Stephen J. Turnbull
  0 siblings, 2 replies; 29+ messages in thread
From: Kazu Yamamoto @ 2003-11-04 15:55 UTC (permalink / raw)


From: Stefan Monnier <monnier@IRO.UMontreal.CA>
Subject: [mew-int 01590] Re: windows 1252

> > (1) Backgourd compatibility to non-Mule Emacsen.
> 
> > 	Non-Mule Emacsen use 8bit as ISO-8859-1. Thus, to share the
> > 	cache among Mule Emacsen and non-Mule Emacsen, we need to
> > 	character set whose 8bit is ISO-8859-1.
> 
> utf-8 would be ideal here.

Not correct.

UTF-8 is compatible to US-ASCII but not to ISO-8859-1.

That is, U0000-U007f is encoded to 0x00-0x0f while U0080-U00FF is
encoded to two 8bit bytes (110xxxxx 10xxxxxx).

> > (2) Co-exist of Emacs and XEmacs.
> 
> > 	The 'emacs-mule coding-system is not appropriate since XEmacs
> > 	has a different internal representation from Emacs'one. Note
> > 	Emacsen use different 'emacs-mule coding-system among
> > 	versions.
> 
> iso-2022 would be the answer here.

Yes, ctext is one instance of the ISO-2022 framework.

> It's unfortunate, but I guess it makes sense.
> It should be possible to make ctext-with-extensions work for your case.

To support a new character set in ctext, we only need to register a
new escape sequence. The new ctext is forward compatible, and backward
compatible if the new character set is not encoded. So, we don't need
a new coding-system name, I think.

> BTW, windows-1252 should internally be turned into a mix of chars from
> various charsets and they should (hopefully) all be encodable directly in
> ctext, so I'm not sure what is your exact problem.  Could you describe what
> currently happens with windows-1252 and what you'd like to see instead ?

As I said, I don't know windows-1252 well and I don't know the current
ctext can encode all windows-1252 characters. I would like to know
correct information about this.

--Kazu


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [mew-int 01590] Re: windows 1252
  2003-11-04 15:55                   ` [mew-int 01591] " Kazu Yamamoto
@ 2003-11-04 17:04                     ` Stefan Monnier
  2003-11-04 18:45                     ` Stephen J. Turnbull
  1 sibling, 0 replies; 29+ messages in thread
From: Stefan Monnier @ 2003-11-04 17:04 UTC (permalink / raw)
  Cc: mew-int, emacs-devel

> As I said, I don't know windows-1252 well and I don't know the current
> ctext can encode all windows-1252 characters. I would like to know
> correct information about this.

Huh?  I thought your question was about a real problem.
What makes you think there's any problem with windows-1252, then ?


        Stefan

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [mew-int 01590] Re: windows 1252
  2003-11-04 15:55                   ` [mew-int 01591] " Kazu Yamamoto
  2003-11-04 17:04                     ` [mew-int 01590] " Stefan Monnier
@ 2003-11-04 18:45                     ` Stephen J. Turnbull
  2003-11-05  1:59                       ` [mew-int 01594] " Kazu Yamamoto
  1 sibling, 1 reply; 29+ messages in thread
From: Stephen J. Turnbull @ 2003-11-04 18:45 UTC (permalink / raw)
  Cc: mew-int, emacs-devel

>>>>> "Kazu" == Kazu Yamamoto <(山本和彦) <kazu@iijlab.net>> writes:

    >> It's unfortunate, but I guess it makes sense.  It should be
    >> possible to make ctext-with-extensions work for your case.

    Kazu> To support a new character set in ctext, we only need to
    Kazu> register a new escape sequence.

You don't even need to do that with an extended segment.  The
Windows-125x sets are all IANA-registered, which should be enough for
global uniqueness.  To represent the text, you just use the name of
the character set: ESC % / 1 <M> <L> Windows-1252 STX ... where <M>
and <L> encode the length of the segment and ESC and STX are the ASCII
control characters 0x1B and 0x02.

    Kazu> As I said, I don't know windows-1252 well and I don't know
    Kazu> the current ctext can encode all windows-1252 characters.

ctext can, because in the extended segment the characters will be
represented as themselves.  Whether Mule can or not is a different
story.  However, I'm fairly sure that all of the characters that
Windows 125x put into the C1 space are encodable by Mule.  See

http://www.microsoft.com/globaldev/reference/sbcs/1252.htm

for example.

-- 
Institute of Policy and Planning Sciences     http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
               Ask not how you can "do" free software business;
              ask what your business can "do for" free software.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [mew-int 01594] Re: windows 1252
  2003-11-04 18:45                     ` Stephen J. Turnbull
@ 2003-11-05  1:59                       ` Kazu Yamamoto
  2003-11-05  5:00                         ` [mew-int 01593] " Stephen J. Turnbull
  2003-11-07  7:28                         ` [mew-int 01597] " Kenichi Handa
  0 siblings, 2 replies; 29+ messages in thread
From: Kazu Yamamoto @ 2003-11-05  1:59 UTC (permalink / raw)


Hello Stephen,

# ordering was changed.

> ctext can, because in the extended segment the characters will be
> represented as themselves.  Whether Mule can or not is a different
> story.  However, I'm fairly sure that all of the characters that
> Windows 125x put into the C1 space are encodable by Mule.  See
> 
> http://www.microsoft.com/globaldev/reference/sbcs/1252.htm
> 
> for example.

Thank you for this information.

I produced 0x80-0xff into a file and let Emacs read it as Windows
1252.

Q1) According to the page above, 0x8f is undefined, and 0x9e is
    defined as LATIN SMALL LETTER Z WITH CARON.

    But Emacs 21.3.50 treated 0x8f as LATIN SMALL LETTER Z WITH CARON
    and 0x9e as undefined.

    Is this a bug?

> You don't even need to do that with an extended segment.  The
> Windows-125x sets are all IANA-registered, which should be enough for
> global uniqueness.  To represent the text, you just use the name of
> the character set: ESC % / 1 <M> <L> Windows-1252 STX ... where <M>
> and <L> encode the length of the segment and ESC and STX are the ASCII
> control characters 0x1B and 0x02.

I save the buffer as ctext. The result file is attached below. All
characters in Windows 1252 can be encoded with ctext. :-)

Q2) However the encoding is different from the one above. Is this
    encoding correct?

Note that I verified that Emacs can read the ctext file correctly.

--Kazu

ESC $ - 1 0xf4 0xcc ESC - A 
ESC $ - 1 0xf2 0xfa ESC - A 
ESC $ - 1 0xa1 0xd2 ESC - A 
ESC $ - 1 0xf2 0xfe ESC - A 
ESC $ - 1 0xf3 0xa6 ESC - A 
ESC $ - 1 0xf3 0xa0 ESC - A 
ESC $ - 1 0xf3 0xa1 ESC - A 
ESC $ - 1 0xa4 0xe6 ESC - A 
ESC $ - 1 0xf3 0xb0 ESC - A 
ESC $ - 1 0xa1 0xa0 ESC - A 
ESC $ - 1 0xf3 0xb9 ESC - A 
ESC $ - 1 0xa0 0xf2 ESC - A 
ESC $ - 1 0xa1 0xbd ESC - A 
ESC $ - 1 0xa1 0xbe ESC - A 
ESC $ - 1 0xf2 0xf8 ESC - A 
ESC $ - 1 0xf2 0xf9 ESC - A 
ESC $ - 1 0xf2 0xfc ESC - A 
ESC $ - 1 0xf2 0xfd ESC - A 
ESC $ - 1 0xf3 0xa2 ESC - A 
ESC $ - 1 0xf2 0xf3 ESC - A 
ESC $ - 1 0xf2 0xf4 ESC - A 
ESC $ - 1 0xa4 0xfc ESC - A 
ESC $ - 1 0xf5 0xe2 ESC - A 
ESC $ - 1 0xa1 0xa1 ESC - A 
ESC $ - 1 0xf3 0xba ESC - A 
ESC $ - 1 0xa0 0xf3 ESC - A 
ESC $ - 1 0xa1 0xb8 ESC - A 
0xa0 
0xa1 
...


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [mew-int 01593] Re: windows 1252
  2003-11-05  1:59                       ` [mew-int 01594] " Kazu Yamamoto
@ 2003-11-05  5:00                         ` Stephen J. Turnbull
  2003-11-07  7:30                           ` Kenichi Handa
  2003-11-07  7:28                         ` [mew-int 01597] " Kenichi Handa
  1 sibling, 1 reply; 29+ messages in thread
From: Stephen J. Turnbull @ 2003-11-05  5:00 UTC (permalink / raw)
  Cc: mew-int, emacs-devel

>>>>> "Kazu" == Kazu Yamamoto <(山本和彦) <kazu@iijlab.net>> writes:

    Kazu> Q1) According to the page above, 0x8f is undefined, and 0x9e
    Kazu> is defined as LATIN SMALL LETTER Z WITH CARON.

    Kazu>     But Emacs 21.3.50 treated 0x8f as LATIN SMALL LETTER Z
    Kazu> WITH CARON and 0x9e as undefined.

    Kazu>     Is this a bug?

Yes.  CP1252.TXT from the Unicode Consortium also has 0x9E as LATIN
SMALL LETTER Z WITH CARON.

    Kazu> Q2) However the encoding is different from the one above. Is
    Kazu> this encoding correct?

The encoding below "ESC $ - 1" uses a multibyte private charset in the
GR register, then switches back to Latin-1 with "ESC - A".  Since
Windows 1252 is just Latin-1 in 0xA0--0xFF, this is the expected
result.  I don't recognize the encoding offhand, but I guess it's some
transform of Unicode, probably the mule-unicode charset.

    Kazu> Note that I verified that Emacs can read the ctext file
    Kazu> correctly.

Then the obvious advice is that mew should just use that encoding, and
recommend GNU Emacs to those who insist on Windows 1252 (etc) instead
of ISO 8859/1.  I doubt that XEmacs will use this private charset;
it's really too late to add to 21.4, and 21.5 will take Stefan's
advice and recommend Unicode (UTF-8) for this purpose.  We recognize
the backward compatibility problem that mew faces, but prefer all
standard-with-XEmacs external encodings to be standard to improve
portability to other apps, and to leave all ISO 2022 private charsets
to the user.  It's a tough choice.

The only real question is will emacs-unicode continue to support the
mule-unicode charsets as external encodings.  I suppose so, but you'd
have to ask Handa-san and/or Dave Love.  If not, then I suggest you
follow Stefan's advice and start moving to UTF-8 immediately.

-- 
Institute of Policy and Planning Sciences     http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
               Ask not how you can "do" free software business;
              ask what your business can "do for" free software.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [mew-int 01596] Re: windows 1252
  2003-11-04  2:13               ` [mew-int 01586] " Kazu Yamamoto
  2003-11-04  5:55                 ` [mew-int 01585] " Eli Zaretskii
  2003-11-04 15:13                 ` [mew-int 01590] " Stefan Monnier
@ 2003-11-07  7:13                 ` Kenichi Handa
  2003-11-10  7:11                   ` [mew-int 01607] " Kazu Yamamoto
  2 siblings, 1 reply; 29+ messages in thread
From: Kenichi Handa @ 2003-11-07  7:13 UTC (permalink / raw)
  Cc: emacs-devel, mew-int

I'm sorry for the late response on this thread.

I at first want to clarify these things:

(1) windows-1252

This is actually not a charset but a coding system in
Emacs.  When Emacs reads a file by this coding system, it
decode each byte into one of these character sets:
	ascii, latin-iso8859-1, mule-unicode-0100-24ff

(2) ctext (alias of compound-text)

On conversion, it works not fully compatible with the
specification of X Compound Text because it encodes any
Emacs characters while using an designation sequence for
private character sets (please note that all Emacs charasets
have a iso-final-char).  So, Big5 characters are preceded by
ESC $ ( 0 or 1, mule-unicode-0100-24ff characters are
preceded by ESC - 1.

(3) ctext-with-extensions (alias of compound-text-with-extensions)

It can handle several kinds of "extended segment".  On
decoding, it handles ESC % / N M L ... ^b for what listed in
ctext-non-standard-encoding-alist, and ESC % G ...ESC % @
for UTF-8.  On encoding, it does two-path encoding; at first
by `compound-text', then re-encode what are encoded by a
designation sequence listed in
ctext-non-standard-designations-alist using the "extended
segment".  Currently only ESC $ ( 0 and ESC $ ( 1 are
listed.  Thus only Big5 are encoded using the "extended
segment".

As to the Mew case, I think the following is good.

When it runs under the current Emacs, keep using ctext but
add a coding tag to the file.  Emacs should be able to
encode/decode all Emacs characters.

When it runs under emacs-unicode version, on writing the
file, if all the characters can be encoded by ctext, keep
using it.  If not (because, in emacs-unicode, some character
doesn't belong to any charset that has iso-final-char), use
utf-8.  And in both cases, add a coding tag.  On reading,
check the coding tag at first.  If no coding tag, read by
ctext, otherwise, read by the coding system specified in the
tag.

By the way,

> The one-and-only coding-system which, I found, meets the requirements
> above is 'ctext.

I think iso-latin-1-with-esc also meets your requirements.

---
Ken'ichi HANDA
handa@m17n.org

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [mew-int 01597] Re: windows 1252
  2003-11-05  1:59                       ` [mew-int 01594] " Kazu Yamamoto
  2003-11-05  5:00                         ` [mew-int 01593] " Stephen J. Turnbull
@ 2003-11-07  7:28                         ` Kenichi Handa
  2003-11-07  8:21                           ` [mew-int 01599] " Kazu Yamamoto
  1 sibling, 1 reply; 29+ messages in thread
From: Kenichi Handa @ 2003-11-07  7:28 UTC (permalink / raw)
  Cc: emacs-devel, mew-int

In article <20031105.105912.246010891.kazu@iijlab.net>, Kazu Yamamoto (山本和彦) <kazu@iijlab.net> writes:
> I produced 0x80-0xff into a file and let Emacs read it as Windows
> 1252.

> Q1) According to the page above, 0x8f is undefined, and 0x9e is
>     defined as LATIN SMALL LETTER Z WITH CARON.

>     But Emacs 21.3.50 treated 0x8f as LATIN SMALL LETTER Z WITH CARON
>     and 0x9e as undefined.

>     Is this a bug?

Yes, it's a bug.  Thank you for noticing it.  I've just
installed a fix.

---
Ken'ichi HANDA
handa@m17n.org


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [mew-int 01593] Re: windows 1252
  2003-11-05  5:00                         ` [mew-int 01593] " Stephen J. Turnbull
@ 2003-11-07  7:30                           ` Kenichi Handa
  0 siblings, 0 replies; 29+ messages in thread
From: Kenichi Handa @ 2003-11-07  7:30 UTC (permalink / raw)
  Cc: mew-int, kazu, emacs-devel

In article <87he1jquir.fsf@tleepslib.sk.tsukuba.ac.jp>, "Stephen J. Turnbull" <stephen@xemacs.org> writes:
> The only real question is will emacs-unicode continue to support the
> mule-unicode charsets as external encodings.  I suppose so, but you'd
> have to ask Handa-san and/or Dave Love.

Yes, emacs-unicode still can read/write mule-unicode
charsets.  Actually it supports all current charsets, and
the coding system emacs-mule is also still supported.

---
Ken'ichi HANDA
handa@m17n.org

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [mew-int 01599] Re: windows 1252
  2003-11-07  7:28                         ` [mew-int 01597] " Kenichi Handa
@ 2003-11-07  8:21                           ` Kazu Yamamoto
  0 siblings, 0 replies; 29+ messages in thread
From: Kazu Yamamoto @ 2003-11-07  8:21 UTC (permalink / raw)


From: Kenichi Handa <handa@m17n.org>
Subject: [mew-int 01597] Re: windows 1252

> Yes, it's a bug.  Thank you for noticing it.  I've just
> installed a fix.

I verified this. Thank you for fixing.

--Kazu


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [mew-int 01607] Re: windows 1252
  2003-11-07  7:13                 ` [mew-int 01596] " Kenichi Handa
@ 2003-11-10  7:11                   ` Kazu Yamamoto
  2003-11-10  7:42                     ` [mew-int 01608] " Kenichi Handa
  0 siblings, 1 reply; 29+ messages in thread
From: Kazu Yamamoto @ 2003-11-10  7:11 UTC (permalink / raw)


Hello Handa-san,

Thank you for your explanation.

> (2) ctext (alias of compound-text)
> 
> On conversion, it works not fully compatible with the
> specification of X Compound Text because it encodes any
> Emacs characters while using an designation sequence for
> private character sets (please note that all Emacs charasets
> have a iso-final-char).  So, Big5 characters are preceded by
> ESC $ ( 0 or 1, mule-unicode-0100-24ff characters are
> preceded by ESC - 1.
              ^^^^^^^

Let me clarify. 

Q1) It seemes to me that Emacs encodes mule-unicode-0100-24ff with ESC
$ - 1. But the explanation above says ESC - 1. Which one is correct as
Emacs's spec?

Q2) I don't think it's not good idea to disclose the internal
representation "mule-unicode-0100-24ff" into a file. According to the
spec of ctext provided with XFree86, it has extension for UTF-8:

---
7.  The UTF-8 encoding

Unicode  characters  that  are  not  contained in one of the
approved standard encodings can be encoded using  the  UTF-8
encoding. The following escape sequences are used:

     01/11 02/05 04/07   switch into UTF-8 mode
     01/11 02/05 04/00   return from UTF-8 mode

The  first  is  the  ISO registered sequence for UTF-8 (ISO-
IR-196), the second  is  the  ISO-2022  ``standard  return''
sequence.  While  in UTF-8 mode, the UTF-8 encoding replaces
the currently designated GL and GR encodings.  After  return
from  UTF-8 mode, the previously designated GL and GR encod-
ings are reactivated.
---

How about using this to encode mule-unicode-0100-24ff?

> When it runs under emacs-unicode version, on writing the
> file, if all the characters can be encoded by ctext, keep
> using it.  If not (because, in emacs-unicode, some character
> doesn't belong to any charset that has iso-final-char), use
> utf-8.  And in both cases, add a coding tag.  On reading,
> check the coding tag at first.  If no coding tag, read by
> ctext, otherwise, read by the coding system specified in the
> tag.

I remember that, some years ago, Handa-san said to me, "The current
Emacs is using mule-unicode but will migrate to Unicode".  But I don't
know what exactly emacs-unicode refers to. Which versions? Or
a different source tree?

--Kazu


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [mew-int 01608] Re: windows 1252
  2003-11-10  7:11                   ` [mew-int 01607] " Kazu Yamamoto
@ 2003-11-10  7:42                     ` Kenichi Handa
  2003-11-12 16:36                       ` [mew-int 01596] " Stephen J. Turnbull
  0 siblings, 1 reply; 29+ messages in thread
From: Kenichi Handa @ 2003-11-10  7:42 UTC (permalink / raw)
  Cc: emacs-devel, mew-int

In article <20031110.161123.49979847.kazu@iijlab.net>, Kazu Yamamoto (山本和彦) <kazu@iijlab.net> writes:
>>  have a iso-final-char).  So, Big5 characters are preceded by
>>  ESC $ ( 0 or 1, mule-unicode-0100-24ff characters are
>>  preceded by ESC - 1.
>               ^^^^^^^

> Let me clarify. 

> Q1) It seemes to me that Emacs encodes mule-unicode-0100-24ff with ESC
> $ - 1. But the explanation above says ESC - 1. Which one is correct as
> Emacs's spec?

Sorry, mine was a typo, ESC $ - 1 is correct because
mule-unicode-0100-24ff is treated as a 96x96 charset.

> Q2) I don't think it's not good idea to disclose the internal
> representation "mule-unicode-0100-24ff" into a file. According to the
> spec of ctext provided with XFree86, it has extension for UTF-8:

> ---
> 7.  The UTF-8 encoding
[...]
> How about using this to encode mule-unicode-0100-24ff?

That's a good idea.   I'll work on it.

> I remember that, some years ago, Handa-san said to me, "The current
> Emacs is using mule-unicode but will migrate to Unicode".  But I don't
> know what exactly emacs-unicode refers to. Which versions? Or
> a different source tree?

The latest emacs-unicode is available in the
"emacs-unicode-2" branch of CVS.

---
Ken'ichi HANDA
handa@m17n.org


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [mew-int 01596] Re: windows 1252
  2003-11-10  7:42                     ` [mew-int 01608] " Kenichi Handa
@ 2003-11-12 16:36                       ` Stephen J. Turnbull
  2003-11-13  1:01                         ` Kenichi Handa
  0 siblings, 1 reply; 29+ messages in thread
From: Stephen J. Turnbull @ 2003-11-12 16:36 UTC (permalink / raw)
  Cc: mew-int, kazu, emacs-devel

>>>>> "Kenichi" == Kenichi Handa <handa@m17n.org> writes:

    >> 7.  The UTF-8 encoding
    Kenichi> [...]
    >> How about using this to encode mule-unicode-0100-24ff?

    Kenichi> That's a good idea.  I'll work on it.

AFAIK this is an XFree86-only extension.  As of X11R6.4 such
extensions were forbidden in X.org Compound Text Encoding.  Is it
really a good idea?

On the other hand, even if extended segments are ugly, we must support
extended segments to handle ISO-8859-15 selections on XFree86.  At
least it is a standard mechanism on all versions of X, going back to
at least X11R5.


-- 
Institute of Policy and Planning Sciences     http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
               Ask not how you can "do" free software business;
              ask what your business can "do for" free software.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [mew-int 01596] Re: windows 1252
  2003-11-12 16:36                       ` [mew-int 01596] " Stephen J. Turnbull
@ 2003-11-13  1:01                         ` Kenichi Handa
  2003-11-13 16:32                           ` Stephen J. Turnbull
  2003-11-13 19:49                           ` Eli Zaretskii
  0 siblings, 2 replies; 29+ messages in thread
From: Kenichi Handa @ 2003-11-13  1:01 UTC (permalink / raw)
  Cc: mew-int, kazu, d.love, emacs-devel

In article <87ekwdilwx.fsf@tleepslib.sk.tsukuba.ac.jp>, "Stephen J. Turnbull" <stephen@xemacs.org> writes:
>>>>>>  "Kenichi" == Kenichi Handa <handa@m17n.org> writes:
>>>  7.  The UTF-8 encoding
Kenichi>  [...]
>>>  How about using this to encode mule-unicode-0100-24ff?

Kenichi>  That's a good idea.  I'll work on it.

> AFAIK this is an XFree86-only extension.  As of X11R6.4 such
> extensions were forbidden in X.org Compound Text Encoding.  Is it
> really a good idea?

I think so.  Currently we encode mule-unicode-0100-24ff by
ESC $ - 1 ...  which is also an invalid code, and only Emacs
can decode it.  If we use UTF-8 encoding, more clients can
decode it.

> On the other hand, even if extended segments are ugly, we must support
> extended segments to handle ISO-8859-15 selections on XFree86.  At
> least it is a standard mechanism on all versions of X, going back to
> at least X11R5.

Emacs decodes extended segment for ISO-8859-15 correctly,
but doesn't use it for encoding.  According to Dave, Latin-9
(ISO-8859-15) users don't want it.  See this code in
mule.el.

;; If you add charsets here, be sure to modify the regexp used by
;; ctext-pre-write-conversion to look up non-standard charsets.
(defvar ctext-non-standard-designations-alist
  '(("$(0" . (big5 "big5-0" 2))
    ("$(1" . (big5 "big5-0" 2))
    ;; The following are actually standard; generating extended
    ;; segments for them is wrong and screws e.g. Latin-9 users.
    ;; 8859-{10,13,16} aren't Emacs charsets anyhow.  -- fx
;;     ("-V"  . (t "iso8859-10" 1))
;;     ("-Y"  . (t "iso8859-13" 1))
;;     ("-_"  . (t "iso8859-14" 1))
;;     ("-b"  . (t "iso8859-15" 1))
;;     ("-f"  . (t "iso8859-16" 1))

I think Dave is correct because CTEXT spec has this
paragraph.

	Extended segments are not to be	used for any character set
	encoding that can be constructed from a	GL/GR pair of
	approved standard encodings. For example, it is	incorrect to
	use an extended	segment	for any	of the ISO 8859	family of
	encodings.

---
Ken'ichi HANDA
handa@m17n.org

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [mew-int 01596] Re: windows 1252
  2003-11-13  1:01                         ` Kenichi Handa
@ 2003-11-13 16:32                           ` Stephen J. Turnbull
  2003-11-14  2:57                             ` Kenichi Handa
  2003-11-13 19:49                           ` Eli Zaretskii
  1 sibling, 1 reply; 29+ messages in thread
From: Stephen J. Turnbull @ 2003-11-13 16:32 UTC (permalink / raw)
  Cc: stephen, mew-int, kazu, d.love, emacs-devel

>>>>> "Kenichi" == Kenichi Handa <handa@m17n.org> writes:

    Kenichi> In article <87ekwdilwx.fsf@tleepslib.sk.tsukuba.ac.jp>,
    Kenichi> "Stephen J. Turnbull" <stephen@xemacs.org> writes:
    >>>>>>> "Kenichi" == Kenichi Handa <handa@m17n.org> writes:
    >>>> 7.  The UTF-8 encoding
    Kenichi> [...]
    >>>> How about using this to encode mule-unicode-0100-24ff?

    Kenichi> That's a good idea.  I'll work on it.

    >> AFAIK this is an XFree86-only extension.  As of X11R6.4 such
    >> extensions were forbidden in X.org Compound Text Encoding.  Is
    >> it really a good idea?

    Kenichi> I think so.  Currently we encode mule-unicode-0100-24ff
    Kenichi> by ESC $ - 1 ...  which is also an invalid code, and only
    Kenichi> Emacs can decode it.  If we use UTF-8 encoding, more
    Kenichi> clients can decode it.

I certainly agree that UTF-8 should be used for encoding.  The
question is should the DOCS UTF-8 (XFree86 only, I fear) sequence be
used to invoke it, or should the DOCS private final byte UTF-8 (X11
standard extended segment) be used.

XEmacs will follow what GNU does on this; there's no point in having
yet another unneeded incompatibility.  But I prefer to follow the
general standard, not XFree86, especially where the XFree86 practice
has always been forbidden by the general X11 standard.

    Kenichi> Emacs decodes extended segment for ISO-8859-15 correctly,
    Kenichi> but doesn't use it for encoding.  According to Dave,
    Kenichi> Latin-9 (ISO-8859-15) users don't want it.  See this code
    Kenichi> in mule.el.

I know it violates the CTEXT standard but many Linux apps give it to
you anyway.

It's interesting that they happily take the standard codes.  That's
useful to know.


-- 
Institute of Policy and Planning Sciences     http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
               Ask not how you can "do" free software business;
              ask what your business can "do for" free software.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [mew-int 01596] Re: windows 1252
  2003-11-13  1:01                         ` Kenichi Handa
  2003-11-13 16:32                           ` Stephen J. Turnbull
@ 2003-11-13 19:49                           ` Eli Zaretskii
  2003-11-14  3:39                             ` [mew-int 01621] " Kenichi Handa
  1 sibling, 1 reply; 29+ messages in thread
From: Eli Zaretskii @ 2003-11-13 19:49 UTC (permalink / raw)
  Cc: mew-int, kazu, emacs-devel

> Date: Thu, 13 Nov 2003 10:01:57 +0900 (JST)
> From: Kenichi Handa <handa@m17n.org>
> 
> (defvar ctext-non-standard-designations-alist
>   '(("$(0" . (big5 "big5-0" 2))
>     ("$(1" . (big5 "big5-0" 2))
>     ;; The following are actually standard; generating extended
>     ;; segments for them is wrong and screws e.g. Latin-9 users.
>     ;; 8859-{10,13,16} aren't Emacs charsets anyhow.  -- fx
> ;;     ("-V"  . (t "iso8859-10" 1))
> ;;     ("-Y"  . (t "iso8859-13" 1))
> ;;     ("-_"  . (t "iso8859-14" 1))
> ;;     ("-b"  . (t "iso8859-15" 1))
> ;;     ("-f"  . (t "iso8859-16" 1))
> 
> I think Dave is correct because CTEXT spec has this
> paragraph.
> 
> 	Extended segments are not to be	used for any character set
> 	encoding that can be constructed from a	GL/GR pair of
> 	approved standard encodings. For example, it is	incorrect to
> 	use an extended	segment	for any	of the ISO 8859	family of
> 	encodings.

For the record, when I worked on this code, I added the ISO 8859
charsets mentioned above because the then official version of the
CTEXT spec did not include them in the list of approved standard
encodings.  So, as far as that CTEXT spec was concerned, these
charsets were not members of the ISO 8859 family.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [mew-int 01596] Re: windows 1252
  2003-11-13 16:32                           ` Stephen J. Turnbull
@ 2003-11-14  2:57                             ` Kenichi Handa
  2003-11-14 11:20                               ` Stephen J. Turnbull
  0 siblings, 1 reply; 29+ messages in thread
From: Kenichi Handa @ 2003-11-14  2:57 UTC (permalink / raw)
  Cc: mew-int, kazu, d.love, emacs-devel

In article <87he18grg9.fsf@tleepslib.sk.tsukuba.ac.jp>, "Stephen J. Turnbull" <stephen@xemacs.org> writes:
> I certainly agree that UTF-8 should be used for encoding.  The
> question is should the DOCS UTF-8 (XFree86 only, I fear) sequence be
> used to invoke it, or should the DOCS private final byte UTF-8 (X11
> standard extended segment) be used.

I don't understand what "DOCS private final byte UTF-8"
means.  Do you mean using the following for UTF-8?

6.  Non-Standard Character Set Encodings
[...]
     01/11 02/05 02/15 03/00 M L   variable number of octets per character

Do you know if there exist an application that send/receive
such an encoding?  If so, now we have three methods for
transfering UTF-8 in inter-client communication (the above,
XFree86's only UTF-8 encoding using ESC % G ..., use
UTF8_STRING instead of CTEXT), and there's no way to know
which receiver accept which encoding.  Sigh...

Kenichi>  Emacs decodes extended segment for ISO-8859-15 correctly,
Kenichi>  but doesn't use it for encoding.  According to Dave,
Kenichi>  Latin-9 (ISO-8859-15) users don't want it.  See this code
Kenichi>  in mule.el.

> I know it violates the CTEXT standard but many Linux apps give it to
> you anyway.

> It's interesting that they happily take the standard codes.  That's
> useful to know.

I've just confirmed that, in iso-8859-15 locale, XFree86
client (gnome-terminal) sends iso-8859-15 chars in extended
segment, not in the standard encoding (i.e. ESC - b ...),
but accepts iso-8859-15 in the standard encoding.

---
Ken'ichi HANDA
handa@m17n.org

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [mew-int 01621] Re: windows 1252
  2003-11-13 19:49                           ` Eli Zaretskii
@ 2003-11-14  3:39                             ` Kenichi Handa
  0 siblings, 0 replies; 29+ messages in thread
From: Kenichi Handa @ 2003-11-14  3:39 UTC (permalink / raw)
  Cc: mew-int, kazu, emacs-devel, d.love

In article <9003-Thu13Nov2003214931+0200-eliz@elta.co.il>, "Eli Zaretskii" <eliz@elta.co.il> writes:
>>  I think Dave is correct because CTEXT spec has this
>>  paragraph.
>>  
>>  	Extended segments are not to be	used for any character set
>>  	encoding that can be constructed from a	GL/GR pair of
>>  	approved standard encodings. For example, it is	incorrect to
>>  	use an extended	segment	for any	of the ISO 8859	family of
>>  	encodings.

> For the record, when I worked on this code, I added the ISO 8859
> charsets mentioned above because the then official version of the
> CTEXT spec did not include them in the list of approved standard
> encodings.  So, as far as that CTEXT spec was concerned, these
> charsets were not members of the ISO 8859 family.

Hmmm, I didn't understand the above paragraph as you, but it
seems that you are correct.  Dave, what do you think?

FYI, I found this section in the spec.

------------------------------------------------------------
10.  Extensions

There is no absolute requirement for a parser to deal with
anything but the particular encoding syntax defined in this
specification.	However, it is possible	that Compound Text
may be extended	in the future, and as such it may be desir-
able to	construct the parser to	handle 2022/6429 syntax	more
generally.

There are two general formats covering all control sequences
that are expected to appear in extensions:

01/11 {I} F

     For this format, I	is always in the range 02/00 to
     02/15, and	F is always in the range 03/00 to 07/14.

[...]
If extensions to this specification are	defined	in the
future,	then any string	incorporating instances	of such
extensions must	start with one of the following	control
sequences:

     01/11 02/03 V 03/00   ignoring extensions is OK
     01/11 02/03 V 03/01   ignoring extensions is not OK
[...]
------------------------------------------------------------

So, designating ISO-8859-15 by ESC - b (i.e. 01/11 {I} F)
without any of the last two ESC sequences explicitly
violates CTEXT even if CTEXT is exteneded in the future.

---
Ken'ichi HANDA
handa@m17n.org


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [mew-int 01596] Re: windows 1252
  2003-11-14  2:57                             ` Kenichi Handa
@ 2003-11-14 11:20                               ` Stephen J. Turnbull
  2003-11-14 12:02                                 ` Kenichi Handa
  0 siblings, 1 reply; 29+ messages in thread
From: Stephen J. Turnbull @ 2003-11-14 11:20 UTC (permalink / raw)
  Cc: mew-int, kazu, d.love, emacs-devel

>>>>> "Kenichi" == Kenichi Handa <handa@m17n.org> writes:

    Kenichi> I don't understand what "DOCS private final byte UTF-8"
    Kenichi> means.  Do you mean using the following for UTF-8?

    Kenichi> 6.  Non-Standard Character Set Encodings

    Kenichi>      01/11 02/05 02/15 03/00 M L   variable number of octets per character

Yes.

    Kenichi> Do you know if there exist an application that
    Kenichi> send/receive such an encoding?

Good question.  Can't confirm at the moment; the only non-XFree86
system I have immediate access to is Solaris, and I can't get it into
any UTF-8 locale :-(.

According to the standard, apps that implement compound text must
implement extended segments, and if you implement utf-8 anyway, this
would be trivial to support.  But that doesn't mean they do.

    Kenichi> If so, now we have three methods for transfering UTF-8 in
    Kenichi> inter-client communication (the above, XFree86's only
    Kenichi> UTF-8 encoding using ESC % G ..., use UTF8_STRING instead

Isn't UTF8_STRING also XFree86 only?  The Solaris apps (xterm and
dtterm) I have tried don't seem to understand it.  (xterm's man page
says X11R5, how strange.)  

Also, after selecting purely ASCII text in an XFree86 UXTerm, I get

(get-selection 'PRIMARY)                => "Loretta Guarino Reid"
(get-selection 'PRIMARY 'COMPOUND_TEXT) => "Loretta Guarino Reid"
(get-selection 'PRIMARY 'UTF8_STRING)   => nil

but that might be an internal problem of XEmacs not knowing what to do
with the data.


-- 
Institute of Policy and Planning Sciences     http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
               Ask not how you can "do" free software business;
              ask what your business can "do for" free software.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [mew-int 01596] Re: windows 1252
  2003-11-14 11:20                               ` Stephen J. Turnbull
@ 2003-11-14 12:02                                 ` Kenichi Handa
  0 siblings, 0 replies; 29+ messages in thread
From: Kenichi Handa @ 2003-11-14 12:02 UTC (permalink / raw)
  Cc: mew-int, kazu, d.love, emacs-devel

In article <87vfpndwnl.fsf@tleepslib.sk.tsukuba.ac.jp>, "Stephen J. Turnbull" <stephen@xemacs.org> writes:
> According to the standard, apps that implement compound text must
> implement extended segments, and if you implement utf-8 anyway, this
> would be trivial to support.  But that doesn't mean they do.

The ctext spec says:

	The name of the encoding should be registered with the X
	Consortium to avoid conflicts and ...

But, the "registry" file included in X11R6.6 doesn't contain
UTF-8 nor "ISO-8859-15".

> Isn't UTF8_STRING also XFree86 only?

I don't think so.  The above "registry" file have this note
at the tail.

[134]   In R6.6 X.Org is reserving the string UTF8_STRING for use as an ICCCM
        property type and selection target.  The ICCCM spec will be updated
        in a future release to fully specify UTF8_STRING.

---
Ken'ichi HANDA
handa@m17n.org

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2003-11-14 12:02 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20031029.160819.120233945.kazu@iijlab.net>
     [not found] ` <20031029.082403.193886873.wl@gnu.org>
     [not found]   ` <20031030.175736.39971315.kazu@iijlab.net>
2003-10-30 14:41     ` [mew-int 01581] Re: windows 1252 Werner LEMBERG
2003-10-31 11:04       ` [mew-int 01579] " Kenichi Handa
2003-10-31 12:39         ` [mew-int 01583] " Kazu Yamamoto
2003-11-01 15:36           ` [mew-int 01584] " Eli Zaretskii
2003-11-02  6:41             ` [mew-int 01582] " Stephen J. Turnbull
2003-11-04  2:13               ` [mew-int 01586] " Kazu Yamamoto
2003-11-04  5:55                 ` [mew-int 01585] " Eli Zaretskii
2003-11-04  6:13                   ` [mew-int 01587] " Kazu Yamamoto
2003-11-04  6:23                   ` [mew-int 01589] " Stephen J. Turnbull
2003-11-04 15:13                 ` [mew-int 01590] " Stefan Monnier
2003-11-04 15:55                   ` [mew-int 01591] " Kazu Yamamoto
2003-11-04 17:04                     ` [mew-int 01590] " Stefan Monnier
2003-11-04 18:45                     ` Stephen J. Turnbull
2003-11-05  1:59                       ` [mew-int 01594] " Kazu Yamamoto
2003-11-05  5:00                         ` [mew-int 01593] " Stephen J. Turnbull
2003-11-07  7:30                           ` Kenichi Handa
2003-11-07  7:28                         ` [mew-int 01597] " Kenichi Handa
2003-11-07  8:21                           ` [mew-int 01599] " Kazu Yamamoto
2003-11-07  7:13                 ` [mew-int 01596] " Kenichi Handa
2003-11-10  7:11                   ` [mew-int 01607] " Kazu Yamamoto
2003-11-10  7:42                     ` [mew-int 01608] " Kenichi Handa
2003-11-12 16:36                       ` [mew-int 01596] " Stephen J. Turnbull
2003-11-13  1:01                         ` Kenichi Handa
2003-11-13 16:32                           ` Stephen J. Turnbull
2003-11-14  2:57                             ` Kenichi Handa
2003-11-14 11:20                               ` Stephen J. Turnbull
2003-11-14 12:02                                 ` Kenichi Handa
2003-11-13 19:49                           ` Eli Zaretskii
2003-11-14  3:39                             ` [mew-int 01621] " Kenichi Handa

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).