unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed
* UCS-2BE
@ 2006-08-30 22:54 Juri Linkov
  2006-08-31  9:09 ` UCS-2BE Jason Rumney
  0 siblings, 1 reply; 23+ messages in thread
From: Juri Linkov @ 2006-08-30 22:54 UTC (permalink / raw)


Maybe I'm missing it, but I don't see the coding UCS-2BE supported
by Emacs (e.g. in a list of `describe-coding-system').  Is this true?
If yes, why not to support it?

-- 
Juri Linkov
http://www.jurta.org/emacs/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: UCS-2BE
  2006-08-30 22:54 UCS-2BE Juri Linkov
@ 2006-08-31  9:09 ` Jason Rumney
  2006-08-31 10:23   ` UCS-2BE Kenichi Handa
  0 siblings, 1 reply; 23+ messages in thread
From: Jason Rumney @ 2006-08-31  9:09 UTC (permalink / raw)
  Cc: emacs-devel

Juri Linkov wrote:
> Maybe I'm missing it, but I don't see the coding UCS-2BE supported
> by Emacs (e.g. in a list of `describe-coding-system').  Is this true?
> If yes, why not to support it?
>
>   
Can Emacs (22) represent any of the characters that exist in UTF-16 but 
not UCS-2? If not, then it can just be an alias.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: UCS-2BE
  2006-08-31  9:09 ` UCS-2BE Jason Rumney
@ 2006-08-31 10:23   ` Kenichi Handa
  2006-08-31 10:39     ` UCS-2BE Jason Rumney
  0 siblings, 1 reply; 23+ messages in thread
From: Kenichi Handa @ 2006-08-31 10:23 UTC (permalink / raw)
  Cc: juri, emacs-devel

In article <44F6A74A.9040708@gnu.org>, Jason Rumney <jasonr@gnu.org> writes:

> Juri Linkov wrote:
> > Maybe I'm missing it, but I don't see the coding UCS-2BE supported
> > by Emacs (e.g. in a list of `describe-coding-system').  Is this true?
> > If yes, why not to support it?
> >   
> Can Emacs (22) represent any of the characters that exist in UTF-16 but 
> not UCS-2? If not, then it can just be an alias.

To my understanding, UCS-2 and UCS-4 are the names of
Character Encoding Form (CEF), not Character Encoding Scheme
(CES), and as CEF doesn't include byte serialization
mechanism, it can't be a coding system.  Only CES (UTF-XXX)
can be a coding system.

But, I don't know the definition of UCS-2BE.  Is it just a
limited UTF-16BE (limited to BMP)?  Where is it defined
officially?

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: UCS-2BE
  2006-08-31 10:23   ` UCS-2BE Kenichi Handa
@ 2006-08-31 10:39     ` Jason Rumney
  2006-08-31 10:55       ` UCS-2BE Kenichi Handa
  0 siblings, 1 reply; 23+ messages in thread
From: Jason Rumney @ 2006-08-31 10:39 UTC (permalink / raw)
  Cc: juri, emacs-devel

Kenichi Handa wrote:
> But, I don't know the definition of UCS-2BE.  Is it just a
> limited UTF-16BE (limited to BMP)?  Where is it defined
> officially?
>
>   
It's not really authoritative, but there is some information here:

http://en.wikipedia.org/wiki/UTF-16#UCS-2

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: UCS-2BE
  2006-08-31 10:39     ` UCS-2BE Jason Rumney
@ 2006-08-31 10:55       ` Kenichi Handa
  2006-08-31 11:56         ` UCS-2BE Andreas Schwab
  0 siblings, 1 reply; 23+ messages in thread
From: Kenichi Handa @ 2006-08-31 10:55 UTC (permalink / raw)
  Cc: juri, emacs-devel

In article <44F6BC5B.8010504@gnu.org>, Jason Rumney <jasonr@gnu.org> writes:

> Kenichi Handa wrote:
> > But, I don't know the definition of UCS-2BE.  Is it just a
> > limited UTF-16BE (limited to BMP)?  Where is it defined
> > officially?
> >
> >   
> It's not really authoritative, but there is some information here:

> http://en.wikipedia.org/wiki/UTF-16#UCS-2

Thank you, but it says nothing about "UCS-2BE".

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: UCS-2BE
  2006-08-31 10:55       ` UCS-2BE Kenichi Handa
@ 2006-08-31 11:56         ` Andreas Schwab
  2006-08-31 12:16           ` UCS-2BE Kenichi Handa
  0 siblings, 1 reply; 23+ messages in thread
From: Andreas Schwab @ 2006-08-31 11:56 UTC (permalink / raw)
  Cc: juri, emacs-devel, Jason Rumney

Kenichi Handa <handa@m17n.org> writes:

> In article <44F6BC5B.8010504@gnu.org>, Jason Rumney <jasonr@gnu.org> writes:
>
>> Kenichi Handa wrote:
>> > But, I don't know the definition of UCS-2BE.  Is it just a
>> > limited UTF-16BE (limited to BMP)?  Where is it defined
>> > officially?
>> >
>> >   
>> It's not really authoritative, but there is some information here:
>
>> http://en.wikipedia.org/wiki/UTF-16#UCS-2
>
> Thank you, but it says nothing about "UCS-2BE".

"UTF-16 is often mislabeled UCS-2."

Otherwise UCS-2 is just UTF-16 without surrogates.

See <http://www.unicode.org/versions/Unicode4.0.0/appC.pdf>.

Andreas.

-- 
Andreas Schwab, SuSE Labs, schwab@suse.de
SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany
PGP key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: UCS-2BE
  2006-08-31 11:56         ` UCS-2BE Andreas Schwab
@ 2006-08-31 12:16           ` Kenichi Handa
  2006-08-31 14:33             ` UCS-2BE Andreas Schwab
  2006-08-31 23:32             ` UCS-2BE Juri Linkov
  0 siblings, 2 replies; 23+ messages in thread
From: Kenichi Handa @ 2006-08-31 12:16 UTC (permalink / raw)
  Cc: juri, emacs-devel, jasonr

In article <jepsehazju.fsf@sykes.suse.de>, Andreas Schwab <schwab@suse.de> writes:

>>> http://en.wikipedia.org/wiki/UTF-16#UCS-2
> >
> > Thank you, but it says nothing about "UCS-2BE".

> "UTF-16 is often mislabeled UCS-2."

> Otherwise UCS-2 is just UTF-16 without surrogates.

Yes, I know that.  But it doesn't necessarily mean that
"UTF-16BE is opten mislabeled UCS-2BE".  As I've never seen
"UCS-2BE", I'd like to confirm what it exactly means.

> See <http://www.unicode.org/versions/Unicode4.0.0/appC.pdf>.

It says nothing about "UCS-2BE", either.

If UCS-2BE is a mislabel of UTF-16BE, UCS-2BE can simply be
an alias of UTF16-BE.  If UCS-2BE is a BMP subset of
UTF-16BE, UCS2-BE should be implemented differently from
UTF-16BE (at least, we should not select it by
select-safe-conding-system on saving a buffer that contains
non-BMP characters).

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: UCS-2BE
  2006-08-31 12:16           ` UCS-2BE Kenichi Handa
@ 2006-08-31 14:33             ` Andreas Schwab
  2006-08-31 22:48               ` UCS-2BE Kenichi Handa
  2006-08-31 23:32             ` UCS-2BE Juri Linkov
  1 sibling, 1 reply; 23+ messages in thread
From: Andreas Schwab @ 2006-08-31 14:33 UTC (permalink / raw)
  Cc: juri, jasonr, emacs-devel

Kenichi Handa <handa@m17n.org> writes:

>> See <http://www.unicode.org/versions/Unicode4.0.0/appC.pdf>.
>
> It says nothing about "UCS-2BE", either.

C.2 [...]  The 32-bit form is referred to as UCS-4 (Universal Character
Set coded in 4 octets), and the 16-bit form is referred to as UCS-2
(Universal Character Set coded in 2 octets).

Andreas.

-- 
Andreas Schwab, SuSE Labs, schwab@suse.de
SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany
PGP key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: UCS-2BE
  2006-08-31 14:33             ` UCS-2BE Andreas Schwab
@ 2006-08-31 22:48               ` Kenichi Handa
  2006-08-31 23:02                 ` UCS-2BE Andreas Schwab
  0 siblings, 1 reply; 23+ messages in thread
From: Kenichi Handa @ 2006-08-31 22:48 UTC (permalink / raw)
  Cc: juri, jasonr, emacs-devel

In article <jey7t59dqh.fsf@sykes.suse.de>, Andreas Schwab <schwab@suse.de> writes:

> Kenichi Handa <handa@m17n.org> writes:
>>> See <http://www.unicode.org/versions/Unicode4.0.0/appC.pdf>.
> >
> > It says nothing about "UCS-2BE", either.

> C.2 [...]  The 32-bit form is referred to as UCS-4 (Universal Character
> Set coded in 4 octets), and the 16-bit form is referred to as UCS-2
> (Universal Character Set coded in 2 octets).

??? So, what is "UCS-2BE"?

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: UCS-2BE
  2006-08-31 22:48               ` UCS-2BE Kenichi Handa
@ 2006-08-31 23:02                 ` Andreas Schwab
  2006-09-01  1:22                   ` UCS-2BE Kenichi Handa
  0 siblings, 1 reply; 23+ messages in thread
From: Andreas Schwab @ 2006-08-31 23:02 UTC (permalink / raw)
  Cc: juri, emacs-devel, jasonr

Kenichi Handa <handa@m17n.org> writes:

> In article <jey7t59dqh.fsf@sykes.suse.de>, Andreas Schwab <schwab@suse.de> writes:
>
>> Kenichi Handa <handa@m17n.org> writes:
>>>> See <http://www.unicode.org/versions/Unicode4.0.0/appC.pdf>.
>> >
>> > It says nothing about "UCS-2BE", either.
>
>> C.2 [...]  The 32-bit form is referred to as UCS-4 (Universal Character
>> Set coded in 4 octets), and the 16-bit form is referred to as UCS-2
>> (Universal Character Set coded in 2 octets).
>
> ??? So, what is "UCS-2BE"?

Like every multi-octet encoding you need to specify the byte order.

Andreas.

-- 
Andreas Schwab, SuSE Labs, schwab@suse.de
SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany
PGP key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: UCS-2BE
  2006-08-31 12:16           ` UCS-2BE Kenichi Handa
  2006-08-31 14:33             ` UCS-2BE Andreas Schwab
@ 2006-08-31 23:32             ` Juri Linkov
  2006-09-01  1:19               ` UCS-2BE Kenichi Handa
  1 sibling, 1 reply; 23+ messages in thread
From: Juri Linkov @ 2006-08-31 23:32 UTC (permalink / raw)
  Cc: schwab, emacs-devel, jasonr

> If UCS-2BE is a mislabel of UTF-16BE, UCS-2BE can simply be
> an alias of UTF16-BE.  If UCS-2BE is a BMP subset of
> UTF-16BE, UCS2-BE should be implemented differently from
> UTF-16BE

`UCS-2' is the fixed-length encoding of the BMP.  `UCS-2BE' is
a big-endian version of the UCS-2 encoding without using a BOM.
So as actually UCS-2 is a BMP subset of UTF-16, UCS-2BE is a BMP
subset of UTF-16BE (and UCS-2LE is a BMP subset of UTF-16LE).

The encodings `UCS-2' and `UCS-2BE' are implemented in iconv
(http://www.gnu.org/software/libiconv/), so you could look
at the implementation of UCS-2BE:

http://libiconv.cvs.sourceforge.net/libiconv/libiconv/lib/ucs2be.h?revision=1.4&view=markup

Comparing it with the implementation of UTF-16BE, you can see that
UTF-16BE deals also with other planes:

http://libiconv.cvs.sourceforge.net/libiconv/libiconv/lib/utf16be.h?revision=1.4&view=markup

And comparing UCS-2BE with the implementation of UCS-2, you can see that
UCS-2 also deals with a BOM:

http://libiconv.cvs.sourceforge.net/libiconv/libiconv/lib/ucs2.h?revision=1.4&view=markup

There is one difference between outputting a BOM in the iconv
implementations of UCS-2 and UTF-16:

http://libiconv.cvs.sourceforge.net/libiconv/libiconv/lib/utf16.h?revision=1.4&view=markup

i.e. converting a string to UTF-16 adds the BOM to the output, but
converting to UCS-2 doesn't add the BOM.

Does the Emacs implementation of UTF-16 output the BOM?

> (at least, we should not select it by select-safe-coding-system on
> saving a buffer that contains non-BMP characters).

What do you think is the right way to deal with non-BMP characters
when the user will try to save a UTF-16(BE) buffer in the UCS-2(BE)
encoding?

-- 
Juri Linkov
http://www.jurta.org/emacs/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: UCS-2BE
       [not found] <E1GIw3v-00059X-TI@monty-python.gnu.org>
@ 2006-08-31 23:36 ` Jonathan Yavner
  0 siblings, 0 replies; 23+ messages in thread
From: Jonathan Yavner @ 2006-08-31 23:36 UTC (permalink / raw)
  Cc: JURI, SCHWAB, JASONR

> ??? So, what is "UCS-2BE"?

BE means "big-endian".  The more significant byte is stored first, 
followed by the less significant byte.  Also known as "network byte 
order".

UCS-2LE is what most people actually use on x86-based computers.  The 
less significant byte arrives before the more significant one in each 
16-bit quantity.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: UCS-2BE
  2006-08-31 23:32             ` UCS-2BE Juri Linkov
@ 2006-09-01  1:19               ` Kenichi Handa
  2006-09-01 11:30                 ` UCS-2BE YAMAMOTO Mitsuharu
  0 siblings, 1 reply; 23+ messages in thread
From: Kenichi Handa @ 2006-09-01  1:19 UTC (permalink / raw)
  Cc: schwab, jasonr, emacs-devel

In article <87ac5ko50j.fsf@jurta.org>, Juri Linkov <juri@jurta.org> writes:

> `UCS-2' is the fixed-length encoding of the BMP.  `UCS-2BE' is
> a big-endian version of the UCS-2 encoding without using a BOM.
> So as actually UCS-2 is a BMP subset of UTF-16, UCS-2BE is a BMP
> subset of UTF-16BE (and UCS-2LE is a BMP subset of UTF-16LE).

Where did you get that info?

The word "encoding" is ambiguous here.  There are "CEF
(Character Encoding Form)" and "CES (Character Encoding
Scheme)".  Unicode says (see Glossary):

Character Encoding Form: Mapping from a character set
definition to the actual code units used to represent the
data.

Character Encoding Scheme: A character encoding form plus
byte serialization. ...

UCS-XXX are CEF, and UTF-XXX are CES.  So, UCS-XXX are not
appropriate lavel names for specifing how to byte-serialize
characters (i.e. on saving characters in a file).  At least,
that is the official definition in Unicode.

And, as you see now, there's is a contradition in the term
"UCS-2BE" because "BE" is information about
byte-serialization.  But the term "UCS-2BE" itself is not
defined in Unicode.  So, there are two possibilities:

(1) It's just a mis-label of something.
(2) It's defined somewhere else.

Which is the case?

> The encodings `UCS-2' and `UCS-2BE' are implemented in iconv
> (http://www.gnu.org/software/libiconv/), so you could look
> at the implementation of UCS-2BE:

Does it mean that it's an invention of iconv to use those
names as CES?

> Does the Emacs implementation of UTF-16 output the BOM?

Yes.

> > (at least, we should not select it by select-safe-coding-system on
> > saving a buffer that contains non-BMP characters).

> What do you think is the right way to deal with non-BMP characters
> when the user will try to save a UTF-16(BE) buffer in the UCS-2(BE)
> encoding?

It depends on how UCS-2BE is defined.  If we follow the
implementation of iconv (and if the buffer contains non-BMP
characters), we should ask the user to select a proper
coding system.

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: UCS-2BE
  2006-08-31 23:02                 ` UCS-2BE Andreas Schwab
@ 2006-09-01  1:22                   ` Kenichi Handa
  2006-09-01  9:01                     ` UCS-2BE Andreas Schwab
  0 siblings, 1 reply; 23+ messages in thread
From: Kenichi Handa @ 2006-09-01  1:22 UTC (permalink / raw)
  Cc: juri, jasonr, emacs-devel

In article <jeirk8ik4p.fsf@sykes.suse.de>, Andreas Schwab <schwab@suse.de> writes:

>>> C.2 [...]  The 32-bit form is referred to as UCS-4 (Universal Character
>>> Set coded in 4 octets), and the 16-bit form is referred to as UCS-2
>>> (Universal Character Set coded in 2 octets).
> >
> > ??? So, what is "UCS-2BE"?

> Like every multi-octet encoding you need to specify the byte order.

You are also confusing CEF and CES.  Please see my reply to
Juri.

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: UCS-2BE
  2006-09-01  1:22                   ` UCS-2BE Kenichi Handa
@ 2006-09-01  9:01                     ` Andreas Schwab
  2006-09-01 11:28                       ` UCS-2BE Kenichi Handa
  0 siblings, 1 reply; 23+ messages in thread
From: Andreas Schwab @ 2006-09-01  9:01 UTC (permalink / raw)
  Cc: juri, emacs-devel, jasonr

Kenichi Handa <handa@m17n.org> writes:

> In article <jeirk8ik4p.fsf@sykes.suse.de>, Andreas Schwab <schwab@suse.de> writes:
>
>>>> C.2 [...]  The 32-bit form is referred to as UCS-4 (Universal Character
>>>> Set coded in 4 octets), and the 16-bit form is referred to as UCS-2
>>>> (Universal Character Set coded in 2 octets).
>> >
>> > ??? So, what is "UCS-2BE"?
>
>> Like every multi-octet encoding you need to specify the byte order.
>
> You are also confusing CEF and CES.  Please see my reply to
> Juri.

The above quote is talking about "coded in N octets".  If that's not about
serialisation, what else is it?

Andreas.

-- 
Andreas Schwab, SuSE Labs, schwab@suse.de
SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany
PGP key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: UCS-2BE
  2006-09-01  9:01                     ` UCS-2BE Andreas Schwab
@ 2006-09-01 11:28                       ` Kenichi Handa
  0 siblings, 0 replies; 23+ messages in thread
From: Kenichi Handa @ 2006-09-01 11:28 UTC (permalink / raw)
  Cc: juri, emacs-devel, jasonr

In article <jeirk87yg5.fsf@sykes.suse.de>, Andreas Schwab <schwab@suse.de> writes:

> The above quote is talking about "coded in N octets".  If that's not about
> serialisation, what else is it?

To my understanding, it means 8*N bits here, and the wording
"UCS-4 (Universal Character Set coded in 4 octets)" is just
for explaining from where the the literal "UCS-4" comes.

See this description in C.2.

"As a consequence, UCS-4 can now be taken effectively as an
alias for the Unicode encoding form UTF-32, ..."

So, apparently UCS-4 is CEF here.

By the way, Unicode itself is confusing in names.  For
instance, UTF-32 means both "UTF-32 encoding form (CEF)" and
"UTF-32 encoding scheme (CES)".  Unicode 4.1 says:

"For historical reasons, the Unicode encoding schemes are
also referred to as Unicode (or UCS) transformation formats
(UTF). That term is, however, ambiguous between its usage
for encoding forms and encoding schemes."

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: UCS-2BE
  2006-09-01  1:19               ` UCS-2BE Kenichi Handa
@ 2006-09-01 11:30                 ` YAMAMOTO Mitsuharu
  2006-09-01 12:26                   ` UCS-2BE Kenichi Handa
  0 siblings, 1 reply; 23+ messages in thread
From: YAMAMOTO Mitsuharu @ 2006-09-01 11:30 UTC (permalink / raw)
  Cc: Juri Linkov, schwab, emacs-devel, jasonr

>>>>> On Fri, 01 Sep 2006 10:19:34 +0900, Kenichi Handa <handa@m17n.org> said:

> UCS-XXX are CEF, and UTF-XXX are CES.  So, UCS-XXX are not
> appropriate lavel names for specifing how to byte-serialize
> characters (i.e. on saving characters in a file).  At least, that is
> the official definition in Unicode.

IIUC, UCS is in the ISO/IEC 10646 terminology, rather than in the
Unicode terminology except Unicode 1.1 (though there would be some
references in the documentations, of course.)

"Unicode Technical Report #17, Character Encoding Model"
(http://www.unicode.org/reports/tr17/index.html) says:

  Examples of encoding forms as applied to particular coded character
  sets:

    Name           Encoding forms
    Unicode 4.0    UTF-16 (default), UTF-8, or UTF-32 encoding form
    Unicode 3.0    either UTF-16 (default) or UTF-8 encoding form
    Unicode 1.1    either UCS-2 (default) or UTF-8 encoding form
    ISO/IEC 10646, depending on the declared implementation levels, may
                   have UCS-2, UCS-4, UTF-16, or UTF-8.

  Examples of Unicode Character Encoding Schemes:

    The Unicode Standard has seven character encoding schemes: UTF-8,
    UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, and UTF-32LE.

    Unicode 1.1 had three character encoding schemes: UTF-8, UCS-2BE,
    and UCS-2LE, although the latter two were not named that way at
    the time.

I suspect "UCS-2BE" is just a customary name and not explicitly
defined even in ISO/IEC 10646.

"UTF-8 and Unicode FAQ" (http://www.cl.cam.ac.uk/~mgk25/unicode.html)
says:

  No endianess is implied by the encoding names UCS-2, UCS-4, UTF-16,
  and UTF-32, though ISO 10646-1 says that Bigendian should be
  preferred unless otherwise agreed.  It has become customary to
  append the letters "BE" (Bigendian, high-byte first) and "LE"
  (Littleendian, low-byte first) to the encoding names in order to
  explicitly specify a byte order.

				     YAMAMOTO Mitsuharu
				mituharu@math.s.chiba-u.ac.jp

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: UCS-2BE
  2006-09-01 11:30                 ` UCS-2BE YAMAMOTO Mitsuharu
@ 2006-09-01 12:26                   ` Kenichi Handa
  2006-09-01 12:30                     ` UCS-2BE Andreas Schwab
                                       ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Kenichi Handa @ 2006-09-01 12:26 UTC (permalink / raw)
  Cc: juri, schwab, jasonr, emacs-devel

Thank you for the info! 

In article <wl3bbbvn71.wl%mituharu@math.s.chiba-u.ac.jp>, YAMAMOTO Mitsuharu <mituharu@math.s.chiba-u.ac.jp> writes:

> "Unicode Technical Report #17, Character Encoding Model"
> (http://www.unicode.org/reports/tr17/index.html) says:
[...]
>   Examples of Unicode Character Encoding Schemes:
[...]
>     Unicode 1.1 had three character encoding schemes: UTF-8, UCS-2BE,
>     and UCS-2LE, although the latter two were not named that way at
>     the time.

Ah!  So here we can see the term "UCS-2BE" as CES.  But how
it was defined? (I don't have Unicode 1.1)

> I suspect "UCS-2BE" is just a customary name and not explicitly
> defined even in ISO/IEC 10646.

> "UTF-8 and Unicode FAQ" (http://www.cl.cam.ac.uk/~mgk25/unicode.html)
> says:

>   No endianess is implied by the encoding names UCS-2, UCS-4, UTF-16,
>   and UTF-32, though ISO 10646-1 says that Bigendian should be
>   preferred unless otherwise agreed.  It has become customary to
>   append the letters "BE" (Bigendian, high-byte first) and "LE"
>   (Littleendian, low-byte first) to the encoding names in order to
>   explicitly specify a byte order.

I don't know how much authorized this page is, but it also
says:

    A full featured character encoding converter will have
    to provide the following 13 encoding variants of Unicode
    and UCS:

	UCS-2, UCS-2BE, UCS-2LE, UCS-4, UCS-4LE, UCS-4BE,
	UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE,
	UTF-32LE

It seems that UCS-2BE is not a mis-label of UTF-16BE, then,
it seems that treating it as a subset (not using surrogate
pair) of UTF-16BE (as done in iconv) is the right thing.
I'll try to implement it (and others) in emacs-unicode-2.

By the way, why do people want such many variants... sigh...

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: UCS-2BE
  2006-09-01 12:26                   ` UCS-2BE Kenichi Handa
@ 2006-09-01 12:30                     ` Andreas Schwab
  2006-09-01 12:57                       ` UCS-2BE Kenichi Handa
  2006-09-01 17:08                     ` UCS-2BE Stefan Monnier
  2006-09-01 23:45                     ` UCS-2BE Juri Linkov
  2 siblings, 1 reply; 23+ messages in thread
From: Andreas Schwab @ 2006-09-01 12:30 UTC (permalink / raw)
  Cc: juri, jasonr, YAMAMOTO Mitsuharu, emacs-devel

Kenichi Handa <handa@m17n.org> writes:

> By the way, why do people want such many variants... sigh...

Actually, sane people only want UTF-8. :-)

Andreas.

-- 
Andreas Schwab, SuSE Labs, schwab@suse.de
SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany
PGP key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: UCS-2BE
  2006-09-01 12:30                     ` UCS-2BE Andreas Schwab
@ 2006-09-01 12:57                       ` Kenichi Handa
  0 siblings, 0 replies; 23+ messages in thread
From: Kenichi Handa @ 2006-09-01 12:57 UTC (permalink / raw)
  Cc: juri, emacs-devel, mituharu, jasonr

In article <jeejuv7or8.fsf@sykes.suse.de>, Andreas Schwab <schwab@suse.de> writes:

> Kenichi Handa <handa@m17n.org> writes:
> > By the way, why do people want such many variants... sigh...

> Actually, sane people only want UTF-8. :-)

Actually, UTF-8 is not that sane nowadays because it may
have BOM <EF BB BF>.  I read Unicode 4.1 but it doesn't
define how to treat the heading byte sequence <EF BB BF> of
UTF-8.  It says, in short, "be careful"!!!  Sigh...

Is the situation somehow improved in Unicode 5.0?

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: UCS-2BE
  2006-09-01 12:26                   ` UCS-2BE Kenichi Handa
  2006-09-01 12:30                     ` UCS-2BE Andreas Schwab
@ 2006-09-01 17:08                     ` Stefan Monnier
  2006-09-01 23:45                     ` UCS-2BE Juri Linkov
  2 siblings, 0 replies; 23+ messages in thread
From: Stefan Monnier @ 2006-09-01 17:08 UTC (permalink / raw)
  Cc: juri, schwab, emacs-devel, YAMAMOTO Mitsuharu, jasonr

> 	UCS-2, UCS-2BE, UCS-2LE, UCS-4, UCS-4LE, UCS-4BE,
> 	UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE,
> 	UTF-32LE

That's it?  Nothing else?
Whatever happened to UCS-4BLE and UCS-LBE (with big-ending and
little-endian different at the 8bit-level and the 16-bit level).
And UCS-4-1243 and UCS-4-1324 and other byte permuatations?
And what about bit-level endianness and permutations?
And how 'bout 18bit computers?


        Stefan

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: UCS-2BE
  2006-09-01 12:26                   ` UCS-2BE Kenichi Handa
  2006-09-01 12:30                     ` UCS-2BE Andreas Schwab
  2006-09-01 17:08                     ` UCS-2BE Stefan Monnier
@ 2006-09-01 23:45                     ` Juri Linkov
  2006-09-02  1:27                       ` UCS-2BE Kenichi Handa
  2 siblings, 1 reply; 23+ messages in thread
From: Juri Linkov @ 2006-09-01 23:45 UTC (permalink / raw)
  Cc: schwab, jasonr, mituharu, emacs-devel

> Ah!  So here we can see the term "UCS-2BE" as CES.  But how
> it was defined? (I don't have Unicode 1.1)

You can buy the specification for 114 Swiss francs ;-)
http://www.iso.ch/iso/en/CatalogueDetailPage.CatalogueDetail?CSNUMBER=39921

> It seems that UCS-2BE is not a mis-label of UTF-16BE, then,
> it seems that treating it as a subset (not using surrogate
> pair) of UTF-16BE (as done in iconv) is the right thing.

I have the same understanding.

> I'll try to implement it (and others) in emacs-unicode-2.

Is it reasonable to use the iconv library in Emacs (to not reimplement
its encodings)?

> By the way, why do people want such many variants... sigh...

There are still some standards (e.g. in the telecom industry)
that are based on the old encodings.  I understood so that UCS-2 is
an obsolete encoding and will be replaced gradually by UTF-16.

-- 
Juri Linkov
http://www.jurta.org/emacs/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: UCS-2BE
  2006-09-01 23:45                     ` UCS-2BE Juri Linkov
@ 2006-09-02  1:27                       ` Kenichi Handa
  0 siblings, 0 replies; 23+ messages in thread
From: Kenichi Handa @ 2006-09-02  1:27 UTC (permalink / raw)
  Cc: schwab, emacs-devel, mituharu, jasonr

In article <87fyfb6uy5.fsf@jurta.org>, Juri Linkov <juri@jurta.org> writes:

> > It seems that UCS-2BE is not a mis-label of UTF-16BE, then,
> > it seems that treating it as a subset (not using surrogate
> > pair) of UTF-16BE (as done in iconv) is the right thing.

> I have the same understanding.

Ok.

> > I'll try to implement it (and others) in emacs-unicode-2.

> Is it reasonable to use the iconv library in Emacs (to not reimplement
> its encodings)?

We can use iconv for certain encodings, but not for all.
And, the code for handling variants of UCS/UTF is fairly
short as you see in iconv.  In addition, we anyway need data
in etc/charsets for
decode-char/encode-char/map-charset-chars functions.

So, I think it's simpler to have encoder/decoder in Emacs
itself.

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2006-09-02  1:27 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-08-30 22:54 UCS-2BE Juri Linkov
2006-08-31  9:09 ` UCS-2BE Jason Rumney
2006-08-31 10:23   ` UCS-2BE Kenichi Handa
2006-08-31 10:39     ` UCS-2BE Jason Rumney
2006-08-31 10:55       ` UCS-2BE Kenichi Handa
2006-08-31 11:56         ` UCS-2BE Andreas Schwab
2006-08-31 12:16           ` UCS-2BE Kenichi Handa
2006-08-31 14:33             ` UCS-2BE Andreas Schwab
2006-08-31 22:48               ` UCS-2BE Kenichi Handa
2006-08-31 23:02                 ` UCS-2BE Andreas Schwab
2006-09-01  1:22                   ` UCS-2BE Kenichi Handa
2006-09-01  9:01                     ` UCS-2BE Andreas Schwab
2006-09-01 11:28                       ` UCS-2BE Kenichi Handa
2006-08-31 23:32             ` UCS-2BE Juri Linkov
2006-09-01  1:19               ` UCS-2BE Kenichi Handa
2006-09-01 11:30                 ` UCS-2BE YAMAMOTO Mitsuharu
2006-09-01 12:26                   ` UCS-2BE Kenichi Handa
2006-09-01 12:30                     ` UCS-2BE Andreas Schwab
2006-09-01 12:57                       ` UCS-2BE Kenichi Handa
2006-09-01 17:08                     ` UCS-2BE Stefan Monnier
2006-09-01 23:45                     ` UCS-2BE Juri Linkov
2006-09-02  1:27                       ` UCS-2BE Kenichi Handa
     [not found] <E1GIw3v-00059X-TI@monty-python.gnu.org>
2006-08-31 23:36 ` UCS-2BE Jonathan Yavner

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).