* UCS-2BE @ 2006-08-30 22:54 Juri Linkov 2006-08-31 9:09 ` UCS-2BE Jason Rumney 0 siblings, 1 reply; 23+ messages in thread From: Juri Linkov @ 2006-08-30 22:54 UTC (permalink / raw) Maybe I'm missing it, but I don't see the coding UCS-2BE supported by Emacs (e.g. in a list of `describe-coding-system'). Is this true? If yes, why not to support it? -- Juri Linkov http://www.jurta.org/emacs/ ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: UCS-2BE 2006-08-30 22:54 UCS-2BE Juri Linkov @ 2006-08-31 9:09 ` Jason Rumney 2006-08-31 10:23 ` UCS-2BE Kenichi Handa 0 siblings, 1 reply; 23+ messages in thread From: Jason Rumney @ 2006-08-31 9:09 UTC (permalink / raw) Cc: emacs-devel Juri Linkov wrote: > Maybe I'm missing it, but I don't see the coding UCS-2BE supported > by Emacs (e.g. in a list of `describe-coding-system'). Is this true? > If yes, why not to support it? > > Can Emacs (22) represent any of the characters that exist in UTF-16 but not UCS-2? If not, then it can just be an alias. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: UCS-2BE 2006-08-31 9:09 ` UCS-2BE Jason Rumney @ 2006-08-31 10:23 ` Kenichi Handa 2006-08-31 10:39 ` UCS-2BE Jason Rumney 0 siblings, 1 reply; 23+ messages in thread From: Kenichi Handa @ 2006-08-31 10:23 UTC (permalink / raw) Cc: juri, emacs-devel In article <44F6A74A.9040708@gnu.org>, Jason Rumney <jasonr@gnu.org> writes: > Juri Linkov wrote: > > Maybe I'm missing it, but I don't see the coding UCS-2BE supported > > by Emacs (e.g. in a list of `describe-coding-system'). Is this true? > > If yes, why not to support it? > > > Can Emacs (22) represent any of the characters that exist in UTF-16 but > not UCS-2? If not, then it can just be an alias. To my understanding, UCS-2 and UCS-4 are the names of Character Encoding Form (CEF), not Character Encoding Scheme (CES), and as CEF doesn't include byte serialization mechanism, it can't be a coding system. Only CES (UTF-XXX) can be a coding system. But, I don't know the definition of UCS-2BE. Is it just a limited UTF-16BE (limited to BMP)? Where is it defined officially? --- Kenichi Handa handa@m17n.org ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: UCS-2BE 2006-08-31 10:23 ` UCS-2BE Kenichi Handa @ 2006-08-31 10:39 ` Jason Rumney 2006-08-31 10:55 ` UCS-2BE Kenichi Handa 0 siblings, 1 reply; 23+ messages in thread From: Jason Rumney @ 2006-08-31 10:39 UTC (permalink / raw) Cc: juri, emacs-devel Kenichi Handa wrote: > But, I don't know the definition of UCS-2BE. Is it just a > limited UTF-16BE (limited to BMP)? Where is it defined > officially? > > It's not really authoritative, but there is some information here: http://en.wikipedia.org/wiki/UTF-16#UCS-2 ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: UCS-2BE 2006-08-31 10:39 ` UCS-2BE Jason Rumney @ 2006-08-31 10:55 ` Kenichi Handa 2006-08-31 11:56 ` UCS-2BE Andreas Schwab 0 siblings, 1 reply; 23+ messages in thread From: Kenichi Handa @ 2006-08-31 10:55 UTC (permalink / raw) Cc: juri, emacs-devel In article <44F6BC5B.8010504@gnu.org>, Jason Rumney <jasonr@gnu.org> writes: > Kenichi Handa wrote: > > But, I don't know the definition of UCS-2BE. Is it just a > > limited UTF-16BE (limited to BMP)? Where is it defined > > officially? > > > > > It's not really authoritative, but there is some information here: > http://en.wikipedia.org/wiki/UTF-16#UCS-2 Thank you, but it says nothing about "UCS-2BE". --- Kenichi Handa handa@m17n.org ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: UCS-2BE 2006-08-31 10:55 ` UCS-2BE Kenichi Handa @ 2006-08-31 11:56 ` Andreas Schwab 2006-08-31 12:16 ` UCS-2BE Kenichi Handa 0 siblings, 1 reply; 23+ messages in thread From: Andreas Schwab @ 2006-08-31 11:56 UTC (permalink / raw) Cc: juri, emacs-devel, Jason Rumney Kenichi Handa <handa@m17n.org> writes: > In article <44F6BC5B.8010504@gnu.org>, Jason Rumney <jasonr@gnu.org> writes: > >> Kenichi Handa wrote: >> > But, I don't know the definition of UCS-2BE. Is it just a >> > limited UTF-16BE (limited to BMP)? Where is it defined >> > officially? >> > >> > >> It's not really authoritative, but there is some information here: > >> http://en.wikipedia.org/wiki/UTF-16#UCS-2 > > Thank you, but it says nothing about "UCS-2BE". "UTF-16 is often mislabeled UCS-2." Otherwise UCS-2 is just UTF-16 without surrogates. See <http://www.unicode.org/versions/Unicode4.0.0/appC.pdf>. Andreas. -- Andreas Schwab, SuSE Labs, schwab@suse.de SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany PGP key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: UCS-2BE 2006-08-31 11:56 ` UCS-2BE Andreas Schwab @ 2006-08-31 12:16 ` Kenichi Handa 2006-08-31 14:33 ` UCS-2BE Andreas Schwab 2006-08-31 23:32 ` UCS-2BE Juri Linkov 0 siblings, 2 replies; 23+ messages in thread From: Kenichi Handa @ 2006-08-31 12:16 UTC (permalink / raw) Cc: juri, emacs-devel, jasonr In article <jepsehazju.fsf@sykes.suse.de>, Andreas Schwab <schwab@suse.de> writes: >>> http://en.wikipedia.org/wiki/UTF-16#UCS-2 > > > > Thank you, but it says nothing about "UCS-2BE". > "UTF-16 is often mislabeled UCS-2." > Otherwise UCS-2 is just UTF-16 without surrogates. Yes, I know that. But it doesn't necessarily mean that "UTF-16BE is opten mislabeled UCS-2BE". As I've never seen "UCS-2BE", I'd like to confirm what it exactly means. > See <http://www.unicode.org/versions/Unicode4.0.0/appC.pdf>. It says nothing about "UCS-2BE", either. If UCS-2BE is a mislabel of UTF-16BE, UCS-2BE can simply be an alias of UTF16-BE. If UCS-2BE is a BMP subset of UTF-16BE, UCS2-BE should be implemented differently from UTF-16BE (at least, we should not select it by select-safe-conding-system on saving a buffer that contains non-BMP characters). --- Kenichi Handa handa@m17n.org ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: UCS-2BE 2006-08-31 12:16 ` UCS-2BE Kenichi Handa @ 2006-08-31 14:33 ` Andreas Schwab 2006-08-31 22:48 ` UCS-2BE Kenichi Handa 2006-08-31 23:32 ` UCS-2BE Juri Linkov 1 sibling, 1 reply; 23+ messages in thread From: Andreas Schwab @ 2006-08-31 14:33 UTC (permalink / raw) Cc: juri, jasonr, emacs-devel Kenichi Handa <handa@m17n.org> writes: >> See <http://www.unicode.org/versions/Unicode4.0.0/appC.pdf>. > > It says nothing about "UCS-2BE", either. C.2 [...] The 32-bit form is referred to as UCS-4 (Universal Character Set coded in 4 octets), and the 16-bit form is referred to as UCS-2 (Universal Character Set coded in 2 octets). Andreas. -- Andreas Schwab, SuSE Labs, schwab@suse.de SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany PGP key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: UCS-2BE 2006-08-31 14:33 ` UCS-2BE Andreas Schwab @ 2006-08-31 22:48 ` Kenichi Handa 2006-08-31 23:02 ` UCS-2BE Andreas Schwab 0 siblings, 1 reply; 23+ messages in thread From: Kenichi Handa @ 2006-08-31 22:48 UTC (permalink / raw) Cc: juri, jasonr, emacs-devel In article <jey7t59dqh.fsf@sykes.suse.de>, Andreas Schwab <schwab@suse.de> writes: > Kenichi Handa <handa@m17n.org> writes: >>> See <http://www.unicode.org/versions/Unicode4.0.0/appC.pdf>. > > > > It says nothing about "UCS-2BE", either. > C.2 [...] The 32-bit form is referred to as UCS-4 (Universal Character > Set coded in 4 octets), and the 16-bit form is referred to as UCS-2 > (Universal Character Set coded in 2 octets). ??? So, what is "UCS-2BE"? --- Kenichi Handa handa@m17n.org ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: UCS-2BE 2006-08-31 22:48 ` UCS-2BE Kenichi Handa @ 2006-08-31 23:02 ` Andreas Schwab 2006-09-01 1:22 ` UCS-2BE Kenichi Handa 0 siblings, 1 reply; 23+ messages in thread From: Andreas Schwab @ 2006-08-31 23:02 UTC (permalink / raw) Cc: juri, emacs-devel, jasonr Kenichi Handa <handa@m17n.org> writes: > In article <jey7t59dqh.fsf@sykes.suse.de>, Andreas Schwab <schwab@suse.de> writes: > >> Kenichi Handa <handa@m17n.org> writes: >>>> See <http://www.unicode.org/versions/Unicode4.0.0/appC.pdf>. >> > >> > It says nothing about "UCS-2BE", either. > >> C.2 [...] The 32-bit form is referred to as UCS-4 (Universal Character >> Set coded in 4 octets), and the 16-bit form is referred to as UCS-2 >> (Universal Character Set coded in 2 octets). > > ??? So, what is "UCS-2BE"? Like every multi-octet encoding you need to specify the byte order. Andreas. -- Andreas Schwab, SuSE Labs, schwab@suse.de SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany PGP key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: UCS-2BE 2006-08-31 23:02 ` UCS-2BE Andreas Schwab @ 2006-09-01 1:22 ` Kenichi Handa 2006-09-01 9:01 ` UCS-2BE Andreas Schwab 0 siblings, 1 reply; 23+ messages in thread From: Kenichi Handa @ 2006-09-01 1:22 UTC (permalink / raw) Cc: juri, jasonr, emacs-devel In article <jeirk8ik4p.fsf@sykes.suse.de>, Andreas Schwab <schwab@suse.de> writes: >>> C.2 [...] The 32-bit form is referred to as UCS-4 (Universal Character >>> Set coded in 4 octets), and the 16-bit form is referred to as UCS-2 >>> (Universal Character Set coded in 2 octets). > > > > ??? So, what is "UCS-2BE"? > Like every multi-octet encoding you need to specify the byte order. You are also confusing CEF and CES. Please see my reply to Juri. --- Kenichi Handa handa@m17n.org ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: UCS-2BE 2006-09-01 1:22 ` UCS-2BE Kenichi Handa @ 2006-09-01 9:01 ` Andreas Schwab 2006-09-01 11:28 ` UCS-2BE Kenichi Handa 0 siblings, 1 reply; 23+ messages in thread From: Andreas Schwab @ 2006-09-01 9:01 UTC (permalink / raw) Cc: juri, emacs-devel, jasonr Kenichi Handa <handa@m17n.org> writes: > In article <jeirk8ik4p.fsf@sykes.suse.de>, Andreas Schwab <schwab@suse.de> writes: > >>>> C.2 [...] The 32-bit form is referred to as UCS-4 (Universal Character >>>> Set coded in 4 octets), and the 16-bit form is referred to as UCS-2 >>>> (Universal Character Set coded in 2 octets). >> > >> > ??? So, what is "UCS-2BE"? > >> Like every multi-octet encoding you need to specify the byte order. > > You are also confusing CEF and CES. Please see my reply to > Juri. The above quote is talking about "coded in N octets". If that's not about serialisation, what else is it? Andreas. -- Andreas Schwab, SuSE Labs, schwab@suse.de SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany PGP key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: UCS-2BE 2006-09-01 9:01 ` UCS-2BE Andreas Schwab @ 2006-09-01 11:28 ` Kenichi Handa 0 siblings, 0 replies; 23+ messages in thread From: Kenichi Handa @ 2006-09-01 11:28 UTC (permalink / raw) Cc: juri, emacs-devel, jasonr In article <jeirk87yg5.fsf@sykes.suse.de>, Andreas Schwab <schwab@suse.de> writes: > The above quote is talking about "coded in N octets". If that's not about > serialisation, what else is it? To my understanding, it means 8*N bits here, and the wording "UCS-4 (Universal Character Set coded in 4 octets)" is just for explaining from where the the literal "UCS-4" comes. See this description in C.2. "As a consequence, UCS-4 can now be taken effectively as an alias for the Unicode encoding form UTF-32, ..." So, apparently UCS-4 is CEF here. By the way, Unicode itself is confusing in names. For instance, UTF-32 means both "UTF-32 encoding form (CEF)" and "UTF-32 encoding scheme (CES)". Unicode 4.1 says: "For historical reasons, the Unicode encoding schemes are also referred to as Unicode (or UCS) transformation formats (UTF). That term is, however, ambiguous between its usage for encoding forms and encoding schemes." --- Kenichi Handa handa@m17n.org ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: UCS-2BE 2006-08-31 12:16 ` UCS-2BE Kenichi Handa 2006-08-31 14:33 ` UCS-2BE Andreas Schwab @ 2006-08-31 23:32 ` Juri Linkov 2006-09-01 1:19 ` UCS-2BE Kenichi Handa 1 sibling, 1 reply; 23+ messages in thread From: Juri Linkov @ 2006-08-31 23:32 UTC (permalink / raw) Cc: schwab, emacs-devel, jasonr > If UCS-2BE is a mislabel of UTF-16BE, UCS-2BE can simply be > an alias of UTF16-BE. If UCS-2BE is a BMP subset of > UTF-16BE, UCS2-BE should be implemented differently from > UTF-16BE `UCS-2' is the fixed-length encoding of the BMP. `UCS-2BE' is a big-endian version of the UCS-2 encoding without using a BOM. So as actually UCS-2 is a BMP subset of UTF-16, UCS-2BE is a BMP subset of UTF-16BE (and UCS-2LE is a BMP subset of UTF-16LE). The encodings `UCS-2' and `UCS-2BE' are implemented in iconv (http://www.gnu.org/software/libiconv/), so you could look at the implementation of UCS-2BE: http://libiconv.cvs.sourceforge.net/libiconv/libiconv/lib/ucs2be.h?revision=1.4&view=markup Comparing it with the implementation of UTF-16BE, you can see that UTF-16BE deals also with other planes: http://libiconv.cvs.sourceforge.net/libiconv/libiconv/lib/utf16be.h?revision=1.4&view=markup And comparing UCS-2BE with the implementation of UCS-2, you can see that UCS-2 also deals with a BOM: http://libiconv.cvs.sourceforge.net/libiconv/libiconv/lib/ucs2.h?revision=1.4&view=markup There is one difference between outputting a BOM in the iconv implementations of UCS-2 and UTF-16: http://libiconv.cvs.sourceforge.net/libiconv/libiconv/lib/utf16.h?revision=1.4&view=markup i.e. converting a string to UTF-16 adds the BOM to the output, but converting to UCS-2 doesn't add the BOM. Does the Emacs implementation of UTF-16 output the BOM? > (at least, we should not select it by select-safe-coding-system on > saving a buffer that contains non-BMP characters). What do you think is the right way to deal with non-BMP characters when the user will try to save a UTF-16(BE) buffer in the UCS-2(BE) encoding? -- Juri Linkov http://www.jurta.org/emacs/ ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: UCS-2BE 2006-08-31 23:32 ` UCS-2BE Juri Linkov @ 2006-09-01 1:19 ` Kenichi Handa 2006-09-01 11:30 ` UCS-2BE YAMAMOTO Mitsuharu 0 siblings, 1 reply; 23+ messages in thread From: Kenichi Handa @ 2006-09-01 1:19 UTC (permalink / raw) Cc: schwab, jasonr, emacs-devel In article <87ac5ko50j.fsf@jurta.org>, Juri Linkov <juri@jurta.org> writes: > `UCS-2' is the fixed-length encoding of the BMP. `UCS-2BE' is > a big-endian version of the UCS-2 encoding without using a BOM. > So as actually UCS-2 is a BMP subset of UTF-16, UCS-2BE is a BMP > subset of UTF-16BE (and UCS-2LE is a BMP subset of UTF-16LE). Where did you get that info? The word "encoding" is ambiguous here. There are "CEF (Character Encoding Form)" and "CES (Character Encoding Scheme)". Unicode says (see Glossary): Character Encoding Form: Mapping from a character set definition to the actual code units used to represent the data. Character Encoding Scheme: A character encoding form plus byte serialization. ... UCS-XXX are CEF, and UTF-XXX are CES. So, UCS-XXX are not appropriate lavel names for specifing how to byte-serialize characters (i.e. on saving characters in a file). At least, that is the official definition in Unicode. And, as you see now, there's is a contradition in the term "UCS-2BE" because "BE" is information about byte-serialization. But the term "UCS-2BE" itself is not defined in Unicode. So, there are two possibilities: (1) It's just a mis-label of something. (2) It's defined somewhere else. Which is the case? > The encodings `UCS-2' and `UCS-2BE' are implemented in iconv > (http://www.gnu.org/software/libiconv/), so you could look > at the implementation of UCS-2BE: Does it mean that it's an invention of iconv to use those names as CES? > Does the Emacs implementation of UTF-16 output the BOM? Yes. > > (at least, we should not select it by select-safe-coding-system on > > saving a buffer that contains non-BMP characters). > What do you think is the right way to deal with non-BMP characters > when the user will try to save a UTF-16(BE) buffer in the UCS-2(BE) > encoding? It depends on how UCS-2BE is defined. If we follow the implementation of iconv (and if the buffer contains non-BMP characters), we should ask the user to select a proper coding system. --- Kenichi Handa handa@m17n.org ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: UCS-2BE 2006-09-01 1:19 ` UCS-2BE Kenichi Handa @ 2006-09-01 11:30 ` YAMAMOTO Mitsuharu 2006-09-01 12:26 ` UCS-2BE Kenichi Handa 0 siblings, 1 reply; 23+ messages in thread From: YAMAMOTO Mitsuharu @ 2006-09-01 11:30 UTC (permalink / raw) Cc: Juri Linkov, schwab, emacs-devel, jasonr >>>>> On Fri, 01 Sep 2006 10:19:34 +0900, Kenichi Handa <handa@m17n.org> said: > UCS-XXX are CEF, and UTF-XXX are CES. So, UCS-XXX are not > appropriate lavel names for specifing how to byte-serialize > characters (i.e. on saving characters in a file). At least, that is > the official definition in Unicode. IIUC, UCS is in the ISO/IEC 10646 terminology, rather than in the Unicode terminology except Unicode 1.1 (though there would be some references in the documentations, of course.) "Unicode Technical Report #17, Character Encoding Model" (http://www.unicode.org/reports/tr17/index.html) says: Examples of encoding forms as applied to particular coded character sets: Name Encoding forms Unicode 4.0 UTF-16 (default), UTF-8, or UTF-32 encoding form Unicode 3.0 either UTF-16 (default) or UTF-8 encoding form Unicode 1.1 either UCS-2 (default) or UTF-8 encoding form ISO/IEC 10646, depending on the declared implementation levels, may have UCS-2, UCS-4, UTF-16, or UTF-8. Examples of Unicode Character Encoding Schemes: The Unicode Standard has seven character encoding schemes: UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, and UTF-32LE. Unicode 1.1 had three character encoding schemes: UTF-8, UCS-2BE, and UCS-2LE, although the latter two were not named that way at the time. I suspect "UCS-2BE" is just a customary name and not explicitly defined even in ISO/IEC 10646. "UTF-8 and Unicode FAQ" (http://www.cl.cam.ac.uk/~mgk25/unicode.html) says: No endianess is implied by the encoding names UCS-2, UCS-4, UTF-16, and UTF-32, though ISO 10646-1 says that Bigendian should be preferred unless otherwise agreed. It has become customary to append the letters "BE" (Bigendian, high-byte first) and "LE" (Littleendian, low-byte first) to the encoding names in order to explicitly specify a byte order. YAMAMOTO Mitsuharu mituharu@math.s.chiba-u.ac.jp ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: UCS-2BE 2006-09-01 11:30 ` UCS-2BE YAMAMOTO Mitsuharu @ 2006-09-01 12:26 ` Kenichi Handa 2006-09-01 12:30 ` UCS-2BE Andreas Schwab ` (2 more replies) 0 siblings, 3 replies; 23+ messages in thread From: Kenichi Handa @ 2006-09-01 12:26 UTC (permalink / raw) Cc: juri, schwab, jasonr, emacs-devel Thank you for the info! In article <wl3bbbvn71.wl%mituharu@math.s.chiba-u.ac.jp>, YAMAMOTO Mitsuharu <mituharu@math.s.chiba-u.ac.jp> writes: > "Unicode Technical Report #17, Character Encoding Model" > (http://www.unicode.org/reports/tr17/index.html) says: [...] > Examples of Unicode Character Encoding Schemes: [...] > Unicode 1.1 had three character encoding schemes: UTF-8, UCS-2BE, > and UCS-2LE, although the latter two were not named that way at > the time. Ah! So here we can see the term "UCS-2BE" as CES. But how it was defined? (I don't have Unicode 1.1) > I suspect "UCS-2BE" is just a customary name and not explicitly > defined even in ISO/IEC 10646. > "UTF-8 and Unicode FAQ" (http://www.cl.cam.ac.uk/~mgk25/unicode.html) > says: > No endianess is implied by the encoding names UCS-2, UCS-4, UTF-16, > and UTF-32, though ISO 10646-1 says that Bigendian should be > preferred unless otherwise agreed. It has become customary to > append the letters "BE" (Bigendian, high-byte first) and "LE" > (Littleendian, low-byte first) to the encoding names in order to > explicitly specify a byte order. I don't know how much authorized this page is, but it also says: A full featured character encoding converter will have to provide the following 13 encoding variants of Unicode and UCS: UCS-2, UCS-2BE, UCS-2LE, UCS-4, UCS-4LE, UCS-4BE, UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, UTF-32LE It seems that UCS-2BE is not a mis-label of UTF-16BE, then, it seems that treating it as a subset (not using surrogate pair) of UTF-16BE (as done in iconv) is the right thing. I'll try to implement it (and others) in emacs-unicode-2. By the way, why do people want such many variants... sigh... --- Kenichi Handa handa@m17n.org ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: UCS-2BE 2006-09-01 12:26 ` UCS-2BE Kenichi Handa @ 2006-09-01 12:30 ` Andreas Schwab 2006-09-01 12:57 ` UCS-2BE Kenichi Handa 2006-09-01 17:08 ` UCS-2BE Stefan Monnier 2006-09-01 23:45 ` UCS-2BE Juri Linkov 2 siblings, 1 reply; 23+ messages in thread From: Andreas Schwab @ 2006-09-01 12:30 UTC (permalink / raw) Cc: juri, jasonr, YAMAMOTO Mitsuharu, emacs-devel Kenichi Handa <handa@m17n.org> writes: > By the way, why do people want such many variants... sigh... Actually, sane people only want UTF-8. :-) Andreas. -- Andreas Schwab, SuSE Labs, schwab@suse.de SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany PGP key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: UCS-2BE 2006-09-01 12:30 ` UCS-2BE Andreas Schwab @ 2006-09-01 12:57 ` Kenichi Handa 0 siblings, 0 replies; 23+ messages in thread From: Kenichi Handa @ 2006-09-01 12:57 UTC (permalink / raw) Cc: juri, emacs-devel, mituharu, jasonr In article <jeejuv7or8.fsf@sykes.suse.de>, Andreas Schwab <schwab@suse.de> writes: > Kenichi Handa <handa@m17n.org> writes: > > By the way, why do people want such many variants... sigh... > Actually, sane people only want UTF-8. :-) Actually, UTF-8 is not that sane nowadays because it may have BOM <EF BB BF>. I read Unicode 4.1 but it doesn't define how to treat the heading byte sequence <EF BB BF> of UTF-8. It says, in short, "be careful"!!! Sigh... Is the situation somehow improved in Unicode 5.0? --- Kenichi Handa handa@m17n.org ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: UCS-2BE 2006-09-01 12:26 ` UCS-2BE Kenichi Handa 2006-09-01 12:30 ` UCS-2BE Andreas Schwab @ 2006-09-01 17:08 ` Stefan Monnier 2006-09-01 23:45 ` UCS-2BE Juri Linkov 2 siblings, 0 replies; 23+ messages in thread From: Stefan Monnier @ 2006-09-01 17:08 UTC (permalink / raw) Cc: juri, schwab, emacs-devel, YAMAMOTO Mitsuharu, jasonr > UCS-2, UCS-2BE, UCS-2LE, UCS-4, UCS-4LE, UCS-4BE, > UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, > UTF-32LE That's it? Nothing else? Whatever happened to UCS-4BLE and UCS-LBE (with big-ending and little-endian different at the 8bit-level and the 16-bit level). And UCS-4-1243 and UCS-4-1324 and other byte permuatations? And what about bit-level endianness and permutations? And how 'bout 18bit computers? Stefan ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: UCS-2BE 2006-09-01 12:26 ` UCS-2BE Kenichi Handa 2006-09-01 12:30 ` UCS-2BE Andreas Schwab 2006-09-01 17:08 ` UCS-2BE Stefan Monnier @ 2006-09-01 23:45 ` Juri Linkov 2006-09-02 1:27 ` UCS-2BE Kenichi Handa 2 siblings, 1 reply; 23+ messages in thread From: Juri Linkov @ 2006-09-01 23:45 UTC (permalink / raw) Cc: schwab, jasonr, mituharu, emacs-devel > Ah! So here we can see the term "UCS-2BE" as CES. But how > it was defined? (I don't have Unicode 1.1) You can buy the specification for 114 Swiss francs ;-) http://www.iso.ch/iso/en/CatalogueDetailPage.CatalogueDetail?CSNUMBER=39921 > It seems that UCS-2BE is not a mis-label of UTF-16BE, then, > it seems that treating it as a subset (not using surrogate > pair) of UTF-16BE (as done in iconv) is the right thing. I have the same understanding. > I'll try to implement it (and others) in emacs-unicode-2. Is it reasonable to use the iconv library in Emacs (to not reimplement its encodings)? > By the way, why do people want such many variants... sigh... There are still some standards (e.g. in the telecom industry) that are based on the old encodings. I understood so that UCS-2 is an obsolete encoding and will be replaced gradually by UTF-16. -- Juri Linkov http://www.jurta.org/emacs/ ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: UCS-2BE 2006-09-01 23:45 ` UCS-2BE Juri Linkov @ 2006-09-02 1:27 ` Kenichi Handa 0 siblings, 0 replies; 23+ messages in thread From: Kenichi Handa @ 2006-09-02 1:27 UTC (permalink / raw) Cc: schwab, emacs-devel, mituharu, jasonr In article <87fyfb6uy5.fsf@jurta.org>, Juri Linkov <juri@jurta.org> writes: > > It seems that UCS-2BE is not a mis-label of UTF-16BE, then, > > it seems that treating it as a subset (not using surrogate > > pair) of UTF-16BE (as done in iconv) is the right thing. > I have the same understanding. Ok. > > I'll try to implement it (and others) in emacs-unicode-2. > Is it reasonable to use the iconv library in Emacs (to not reimplement > its encodings)? We can use iconv for certain encodings, but not for all. And, the code for handling variants of UCS/UTF is fairly short as you see in iconv. In addition, we anyway need data in etc/charsets for decode-char/encode-char/map-charset-chars functions. So, I think it's simpler to have encoder/decoder in Emacs itself. --- Kenichi Handa handa@m17n.org ^ permalink raw reply [flat|nested] 23+ messages in thread
[parent not found: <E1GIw3v-00059X-TI@monty-python.gnu.org>]
* Re: UCS-2BE [not found] <E1GIw3v-00059X-TI@monty-python.gnu.org> @ 2006-08-31 23:36 ` Jonathan Yavner 0 siblings, 0 replies; 23+ messages in thread From: Jonathan Yavner @ 2006-08-31 23:36 UTC (permalink / raw) Cc: JURI, SCHWAB, JASONR > ??? So, what is "UCS-2BE"? BE means "big-endian". The more significant byte is stored first, followed by the less significant byte. Also known as "network byte order". UCS-2LE is what most people actually use on x86-based computers. The less significant byte arrives before the more significant one in each 16-bit quantity. ^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2006-09-02 1:27 UTC | newest] Thread overview: 23+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-08-30 22:54 UCS-2BE Juri Linkov 2006-08-31 9:09 ` UCS-2BE Jason Rumney 2006-08-31 10:23 ` UCS-2BE Kenichi Handa 2006-08-31 10:39 ` UCS-2BE Jason Rumney 2006-08-31 10:55 ` UCS-2BE Kenichi Handa 2006-08-31 11:56 ` UCS-2BE Andreas Schwab 2006-08-31 12:16 ` UCS-2BE Kenichi Handa 2006-08-31 14:33 ` UCS-2BE Andreas Schwab 2006-08-31 22:48 ` UCS-2BE Kenichi Handa 2006-08-31 23:02 ` UCS-2BE Andreas Schwab 2006-09-01 1:22 ` UCS-2BE Kenichi Handa 2006-09-01 9:01 ` UCS-2BE Andreas Schwab 2006-09-01 11:28 ` UCS-2BE Kenichi Handa 2006-08-31 23:32 ` UCS-2BE Juri Linkov 2006-09-01 1:19 ` UCS-2BE Kenichi Handa 2006-09-01 11:30 ` UCS-2BE YAMAMOTO Mitsuharu 2006-09-01 12:26 ` UCS-2BE Kenichi Handa 2006-09-01 12:30 ` UCS-2BE Andreas Schwab 2006-09-01 12:57 ` UCS-2BE Kenichi Handa 2006-09-01 17:08 ` UCS-2BE Stefan Monnier 2006-09-01 23:45 ` UCS-2BE Juri Linkov 2006-09-02 1:27 ` UCS-2BE Kenichi Handa [not found] <E1GIw3v-00059X-TI@monty-python.gnu.org> 2006-08-31 23:36 ` UCS-2BE Jonathan Yavner
Code repositories for project(s) associated with this public inbox https://git.savannah.gnu.org/cgit/emacs.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).