More Cyrillic vs UTF-8

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

* More Cyrillic vs UTF-8
@ 2003-04-25 16:35 Simon Josefsson
  2003-04-25 22:42 ` Eli Zaretskii
  2003-04-26  7:52 ` Kenichi Handa
  0 siblings, 2 replies; 25+ messages in thread
From: Simon Josefsson @ 2003-04-25 16:35 UTC (permalink / raw)


(Same configuration as last mail)

Cut'n'paste the following string into a new file and save it:

Горбачев

UTF-8 isn't shown as an option, and indeed selecting UTF-8 destroys
the data.  Doesn't Emacs CVS support the entire Unicode repertoire?

(The string above, encoded as shift_jis, is, according to od -x:
0000000 4384 8084 8284 7184 7084 8984 7584 7284)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: More Cyrillic vs UTF-8
  2003-04-25 16:35 More Cyrillic vs UTF-8 Simon Josefsson
@ 2003-04-25 22:42 ` Eli Zaretskii
  2003-04-26  0:26   ` Simon Josefsson
  2003-04-26  7:52 ` Kenichi Handa
  1 sibling, 1 reply; 25+ messages in thread
From: Eli Zaretskii @ 2003-04-25 22:42 UTC (permalink / raw)
  Cc: emacs-devel

> From: Simon Josefsson <jas@extundo.com>
> Date: Fri, 25 Apr 2003 18:35:37 +0200
> 
> Doesn't Emacs CVS support the entire Unicode repertoire?

It currently only supports the parts of the BMP whose codepoints are
in the ranges 0000-33ff and e000-ffff.  It doesn't support anything
beyond that.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: More Cyrillic vs UTF-8
  2003-04-25 22:42 ` Eli Zaretskii
@ 2003-04-26  0:26   ` Simon Josefsson
  2003-04-26 13:45     ` Richard Stallman
  0 siblings, 1 reply; 25+ messages in thread
From: Simon Josefsson @ 2003-04-26  0:26 UTC (permalink / raw)
  Cc: emacs-devel

"Eli Zaretskii" <eliz@elta.co.il> writes:

>> From: Simon Josefsson <jas@extundo.com>
>> Date: Fri, 25 Apr 2003 18:35:37 +0200
>> 
>> Doesn't Emacs CVS support the entire Unicode repertoire?
>
> It currently only supports the parts of the BMP whose codepoints are
> in the ranges 0000-33ff and e000-ffff.  It doesn't support anything
> beyond that.

Could we add that information to the PROBLEMS file?

--- PROBLEMS.~1.147.~	Tue Feb  4 16:44:10 2003
+++ PROBLEMS	Sat Apr 26 02:26:21 2003
@@ -27,6 +27,11 @@
 mule-unicode-e000-ffff:-gnu-unifont-*-iso10646-1,\
 mule-unicode-0100-24ff:-gnu-unifont-*-iso10646-1
 
+* Some Unicode characters are not supported.
+
+Emacs currently only supports the parts of the BMP whose codepoints
+are in the ranges 0000-33ff and e000-ffff.
+
 * Problems with file dialogs in Emacs built with Open Motif.
 
 When Emacs 21 is built with Open Motif 2.1, it can happen that the

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: More Cyrillic vs UTF-8
  2003-04-25 16:35 More Cyrillic vs UTF-8 Simon Josefsson
  2003-04-25 22:42 ` Eli Zaretskii
@ 2003-04-26  7:52 ` Kenichi Handa
  2003-04-26 11:54   ` Simon Josefsson
  1 sibling, 1 reply; 25+ messages in thread
From: Kenichi Handa @ 2003-04-26  7:52 UTC (permalink / raw)
  Cc: emacs-devel

In article <ilu4r4m357q.fsf@latte.josefsson.org>, Simon Josefsson <jas@extundo.com> writes:
> (Same configuration as last mail)
> Cut'n'paste the following string into a new file and save it:

> Горбачев

> UTF-8 isn't shown as an option, and indeed selecting UTF-8 destroys
> the data.  Doesn't Emacs CVS support the entire Unicode repertoire?

> (The string above, encoded as shift_jis, is, according to od -x:
> 0000000 4384 8084 8284 7184 7084 8984 7584 7284)

Those characters belongs to the charset japanese-jisx0208,
and the current Emacs still can't encode them into UTF-8.

How did you get such characters?

---
Ken'ichi HANDA
handa@m17n.org

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: More Cyrillic vs UTF-8
  2003-04-26  7:52 ` Kenichi Handa
@ 2003-04-26 11:54   ` Simon Josefsson
  0 siblings, 0 replies; 25+ messages in thread
From: Simon Josefsson @ 2003-04-26 11:54 UTC (permalink / raw)
  Cc: emacs-devel

Kenichi Handa <handa@m17n.org> writes:

> In article <ilu4r4m357q.fsf@latte.josefsson.org>, Simon Josefsson <jas@extundo.com> writes:
>> (Same configuration as last mail)
>> Cut'n'paste the following string into a new file and save it:
>
>> Горбачев
>
>> UTF-8 isn't shown as an option, and indeed selecting UTF-8 destroys
>> the data.  Doesn't Emacs CVS support the entire Unicode repertoire?
>
>> (The string above, encoded as shift_jis, is, according to od -x:
>> 0000000 4384 8084 8284 7184 7084 8984 7584 7284)
>
> Those characters belongs to the charset japanese-jisx0208,
> and the current Emacs still can't encode them into UTF-8.
>
> How did you get such characters?

That may be interesting by itself.  Go to
http://www.nns.ru/persons/gorbach.html using galeon (or mozilla, I
think).  Cut'n'paste the first word and yank it in Emacs.  It looks as
single-width in galeon, but when yanked into emacs it becomes double
width. Yanking it into xterm or gnome-terminal doesn't change the
string, it looks like single-width.  Save the HTML file and open it in
emacs as a koi8 file (note that emacs doesn't auto detect it as koi8
so you to do that manually), then it is single-width too.

I guess it is the emacs X cut'n'paste code that somehow makes the
string into double width japanese characters.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: More Cyrillic vs UTF-8
  2003-04-26  0:26   ` Simon Josefsson
@ 2003-04-26 13:45     ` Richard Stallman
  2003-04-26 14:15       ` Simon Josefsson
  0 siblings, 1 reply; 25+ messages in thread
From: Richard Stallman @ 2003-04-26 13:45 UTC (permalink / raw)
  Cc: emacs-devel

    Could we add that information to the PROBLEMS file?

    --- PROBLEMS.~1.147.~	Tue Feb  4 16:44:10 2003
    +++ PROBLEMS	Sat Apr 26 02:26:21 2003
    @@ -27,6 +27,11 @@
     mule-unicode-e000-ffff:-gnu-unifont-*-iso10646-1,\
     mule-unicode-0100-24ff:-gnu-unifont-*-iso10646-1

    +* Some Unicode characters are not supported.
    +
    +Emacs currently only supports the parts of the BMP whose codepoints
    +are in the ranges 0000-33ff and e000-ffff.
    +

Mentioning this in PROBLEMS seems like a good idea to me, but a useful
entry needs to be stated in terms of what behavior the user sees.
This text doesn't explain the practical consequences; a user would say
"so what does that mean for me?"

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: More Cyrillic vs UTF-8
  2003-04-26 13:45     ` Richard Stallman
@ 2003-04-26 14:15       ` Simon Josefsson
  2003-04-26 20:19         ` Kai Großjohann
  2003-04-28  4:37         ` Richard Stallman
  0 siblings, 2 replies; 25+ messages in thread
From: Simon Josefsson @ 2003-04-26 14:15 UTC (permalink / raw)
  Cc: emacs-devel

Richard Stallman <rms@gnu.org> writes:

>     Could we add that information to the PROBLEMS file?
>
>     --- PROBLEMS.~1.147.~	Tue Feb  4 16:44:10 2003
>     +++ PROBLEMS	Sat Apr 26 02:26:21 2003
>     @@ -27,6 +27,11 @@
>      mule-unicode-e000-ffff:-gnu-unifont-*-iso10646-1,\
>      mule-unicode-0100-24ff:-gnu-unifont-*-iso10646-1
>
>     +* Some Unicode characters are not supported.
>     +
>     +Emacs currently only supports the parts of the BMP whose codepoints
>     +are in the ranges 0000-33ff and e000-ffff.
>     +
>
> Mentioning this in PROBLEMS seems like a good idea to me, but a useful
> entry needs to be stated in terms of what behavior the user sees.
> This text doesn't explain the practical consequences; a user would say
> "so what does that mean for me?"

Is this better?  This was the behaviour I got when trying to save the
data; I had specified that the coding system for saving should be
utf-8 but when I tried to save the buffer Emacs was unable to encode
the characters and suggested shift_jis (etc) instead.

--- PROBLEMS.~1.147.~	Tue Feb  4 16:44:10 2003
+++ PROBLEMS	Sat Apr 26 16:13:07 2003
@@ -27,6 +27,13 @@
 mule-unicode-e000-ffff:-gnu-unifont-*-iso10646-1,\
 mule-unicode-0100-24ff:-gnu-unifont-*-iso10646-1
 
+* Encoding some characters as Unicode is rejected by Emacs.
+
+Emacs currently only supports the parts of the BMP whose codepoints
+are in the ranges 0000-33ff and e000-ffff.  If you try to save a file
+containing characters with code points outside this range, Emacs will
+suggest other compatible coding systems.
+
 * Problems with file dialogs in Emacs built with Open Motif.
 
 When Emacs 21 is built with Open Motif 2.1, it can happen that the

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: More Cyrillic vs UTF-8
  2003-04-26 14:15       ` Simon Josefsson
@ 2003-04-26 20:19         ` Kai Großjohann
  2003-04-26 21:16           ` Simon Josefsson
  2003-04-28  4:37         ` Richard Stallman
  1 sibling, 1 reply; 25+ messages in thread
From: Kai Großjohann @ 2003-04-26 20:19 UTC (permalink / raw)


Simon Josefsson <jas@extundo.com> writes:

> Richard Stallman <rms@gnu.org> writes:
>
>> Mentioning this in PROBLEMS seems like a good idea to me, but a useful
>> entry needs to be stated in terms of what behavior the user sees.
>> This text doesn't explain the practical consequences; a user would say
>> "so what does that mean for me?"
>
> Is this better?

Can you say what characters you're talking about, instead of just the
code points?  I guess that most people haven't memorized the Unicode
table (your truly included ;-).
-- 
file-error; Data: (Opening input file no such file or directory ~/.signature)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: More Cyrillic vs UTF-8
  2003-04-26 20:19         ` Kai Großjohann
@ 2003-04-26 21:16           ` Simon Josefsson
  2003-04-26 21:29             ` Kai Großjohann
  0 siblings, 1 reply; 25+ messages in thread
From: Simon Josefsson @ 2003-04-26 21:16 UTC (permalink / raw)


kai.grossjohann@gmx.net (Kai Großjohann) writes:

> Simon Josefsson <jas@extundo.com> writes:
>
>> Richard Stallman <rms@gnu.org> writes:
>>
>>> Mentioning this in PROBLEMS seems like a good idea to me, but a useful
>>> entry needs to be stated in terms of what behavior the user sees.
>>> This text doesn't explain the practical consequences; a user would say
>>> "so what does that mean for me?"
>>
>> Is this better?
>
> Can you say what characters you're talking about, instead of just the
> code points?  I guess that most people haven't memorized the Unicode
> table (your truly included ;-).

I agree, but I don't know which they are, and maybe the range includes
very many different kind of characters.  And as new characters are
added all the time, I fear that both the list of supported characters
and the list of unsupported characters would be too long to be useful.
Hm.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: More Cyrillic vs UTF-8
  2003-04-26 21:16           ` Simon Josefsson
@ 2003-04-26 21:29             ` Kai Großjohann
  2003-04-26 21:47               ` Simon Josefsson
  0 siblings, 1 reply; 25+ messages in thread
From: Kai Großjohann @ 2003-04-26 21:29 UTC (permalink / raw)


Simon Josefsson <jas@extundo.com> writes:

> kai.grossjohann@gmx.net (Kai Großjohann) writes:
>
>> Simon Josefsson <jas@extundo.com> writes:
>>
>>> Richard Stallman <rms@gnu.org> writes:
>>>
>>>> Mentioning this in PROBLEMS seems like a good idea to me, but a useful
>>>> entry needs to be stated in terms of what behavior the user sees.
>>>> This text doesn't explain the practical consequences; a user would say
>>>> "so what does that mean for me?"
>>>
>>> Is this better?
>>
>> Can you say what characters you're talking about, instead of just the
>> code points?  I guess that most people haven't memorized the Unicode
>> table (your truly included ;-).
>
> I agree, but I don't know which they are, and maybe the range includes
> very many different kind of characters.  And as new characters are
> added all the time, I fear that both the list of supported characters
> and the list of unsupported characters would be too long to be useful.
> Hm.

Well, isn't Unicode divided into blocks so that one can list the
blocks?  Hm.  Oh!  See http://www.unicode.org/charts/ -- looks quite
promising.  Searching for the code blocks there and then giving the
names ought to be useful.  WDYT?

-- 
file-error; Data: (Opening input file no such file or directory ~/.signature)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: More Cyrillic vs UTF-8
  2003-04-26 21:29             ` Kai Großjohann
@ 2003-04-26 21:47               ` Simon Josefsson
  2003-04-27  8:37                 ` Kai Großjohann
  2003-04-28  4:37                 ` Richard Stallman
  0 siblings, 2 replies; 25+ messages in thread
From: Simon Josefsson @ 2003-04-26 21:47 UTC (permalink / raw)


kai.grossjohann@gmx.net (Kai Großjohann) writes:

> Simon Josefsson <jas@extundo.com> writes:
>
>> kai.grossjohann@gmx.net (Kai Großjohann) writes:
>>
>>> Simon Josefsson <jas@extundo.com> writes:
>>>
>>>> Richard Stallman <rms@gnu.org> writes:
>>>>
>>>>> Mentioning this in PROBLEMS seems like a good idea to me, but a useful
>>>>> entry needs to be stated in terms of what behavior the user sees.
>>>>> This text doesn't explain the practical consequences; a user would say
>>>>> "so what does that mean for me?"
>>>>
>>>> Is this better?
>>>
>>> Can you say what characters you're talking about, instead of just the
>>> code points?  I guess that most people haven't memorized the Unicode
>>> table (your truly included ;-).
>>
>> I agree, but I don't know which they are, and maybe the range includes
>> very many different kind of characters.  And as new characters are
>> added all the time, I fear that both the list of supported characters
>> and the list of unsupported characters would be too long to be useful.
>> Hm.
>
> Well, isn't Unicode divided into blocks so that one can list the
> blocks?  Hm.  Oh!  See http://www.unicode.org/charts/ -- looks quite
> promising.  Searching for the code blocks there and then giving the
> names ought to be useful.  WDYT?

The compiled list is below.  Does it really help anyone to list all of
them?

Supported:

Basic Latin  	Optical Character Recognition
Latin-1 Supplement 	Enclosed Alphanumerics
Latin Extended-A 	Box Drawing
Latin Extended-B 	Block Elements
IPA Extensions 	Geometric Shapes
Spacing Modifier Letters 	Miscellaneous Symbols
Combining Diacritical Marks 	Dingbats
Greek 	Miscellaneous Mathematical Symbols-A
Cyrillic 	Supplemental Arrows-A
Cyrillic Supplement 	Braille Patterns
Armenian 	Supplemental Arrows-B
Hebrew 	Miscellaneous Mathematical Symbols-B
Arabic 	Supplemental Mathematical Operators
Syriac 	CJK Radicals Supplement
Thaana 	Kangxi Radicals
Devanagari 	Ideographic Description Characters
Bengali 	CJK Symbols and Punctuation
Gurmukhi 	Hiragana
Gujarati 	Katakana
Oriya 	Bopomofo
Tamil 	Hangul Compatibility Jamo
Telugu 	Kanbun
Kannada 	Bopomofo Extended
Malayalam 	Enclosed CJK Letters and Months
Sinhala 	CJK Compatibility
Thai 	
Lao 	
Tibetan 	
Myanmar 	
Georgian 	
Hangul Jamo 	
Ethiopic 	
Cherokee 	Private Use Area
Unified Canadian Aboriginal Syllabic 	CJK Compatibility Ideographs
Ogham 	Alphabetic Presentation Forms
Runic 	Arabic Presentation Forms-A
Tagalog 	Variation Selectors
Hanunoo 	Combining Half Marks
Buhid 	CJK Compatibility Forms
Tagbanwa 	Small Form Variants
Khmer 	Arabic Presentation Forms-B
Mongolian 	Halfwidth and Fullwidth Forms
Latin Extended Additional 	Specials
Greek Extended 	
General Punctuation 	
Superscripts and Subscripts 	
Currency Symbols 	
Combining Marks for Symbols 	
Letterlike Symbols 	
Number Forms 	
Arrows 	
Mathematical Operators 	
Miscellaneous Technical 	
Control Pictures 	

Unsupported:

CJK Unified Ideographs Extension A (1.5MB)
CJK Unified Ideographs (5MB)
Yi Syllables
Yi Radicals
Hangul Syllables (7MB)
High Surrogates
Low Surrogates
Old Italic
Gothic
Deseret
Byzantine Musical Symbols
Musical Symbols
Mathematical Alphanumeric Symbols
CJK Unified Ideographs Extension B (13MB)
CJK Compatibility Ideographs Supplement
Tags
Supplementary Private Use Area-A
Supplementary Private Use Area-B

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: More Cyrillic vs UTF-8
  2003-04-26 21:47               ` Simon Josefsson
@ 2003-04-27  8:37                 ` Kai Großjohann
  2003-04-28 12:35                   ` Kenichi Handa
  2003-04-28 23:38                   ` Richard Stallman
  2003-04-28  4:37                 ` Richard Stallman
  1 sibling, 2 replies; 25+ messages in thread
From: Kai Großjohann @ 2003-04-27  8:37 UTC (permalink / raw)


Simon Josefsson <jas@extundo.com> writes:

> Unsupported:
>
> CJK Unified Ideographs Extension A (1.5MB)
> CJK Unified Ideographs (5MB)
> Yi Syllables
> Yi Radicals
> Hangul Syllables (7MB)
> High Surrogates
> Low Surrogates
> Old Italic
> Gothic
> Deseret
> Byzantine Musical Symbols
> Musical Symbols
> Mathematical Alphanumeric Symbols
> CJK Unified Ideographs Extension B (13MB)
> CJK Compatibility Ideographs Supplement
> Tags
> Supplementary Private Use Area-A
> Supplementary Private Use Area-B

It seems that these might be summarized by CJK, Music, Maths, Private
Use Area.

WDYT?
-- 
file-error; Data: (Opening input file no such file or directory ~/.signature)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: More Cyrillic vs UTF-8
  2003-04-26 14:15       ` Simon Josefsson
  2003-04-26 20:19         ` Kai Großjohann
@ 2003-04-28  4:37         ` Richard Stallman
  1 sibling, 0 replies; 25+ messages in thread
From: Richard Stallman @ 2003-04-28  4:37 UTC (permalink / raw)
  Cc: emacs-devel

    +* Encoding some characters as Unicode is rejected by Emacs.
    +
    +Emacs currently only supports the parts of the BMP whose codepoints
    +are in the ranges 0000-33ff and e000-ffff.  If you try to save a file
    +containing characters with code points outside this range, Emacs will
    +suggest other compatible coding systems.

That is clearer; it's written in terms of behavior the user sees.
I agree with the people who said that the codepoint numbers may not
be clear enough.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: More Cyrillic vs UTF-8
  2003-04-26 21:47               ` Simon Josefsson
  2003-04-27  8:37                 ` Kai Großjohann
@ 2003-04-28  4:37                 ` Richard Stallman
  1 sibling, 0 replies; 25+ messages in thread
From: Richard Stallman @ 2003-04-28  4:37 UTC (permalink / raw)
  Cc: emacs-devel

    The compiled list is below.  Does it really help anyone to list all of
    them?

The list of unsupported ones is not too long.
Listing them might be compact and useful.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: More Cyrillic vs UTF-8
  2003-04-27  8:37                 ` Kai Großjohann
@ 2003-04-28 12:35                   ` Kenichi Handa
  2003-04-28 23:08                     ` Simon Josefsson
                                       ` (2 more replies)
  2003-04-28 23:38                   ` Richard Stallman
  1 sibling, 3 replies; 25+ messages in thread
From: Kenichi Handa @ 2003-04-28 12:35 UTC (permalink / raw)
  Cc: jas

In article <8465p0l4jp.fsf@lucy.is.informatik.uni-duisburg.de>, kai.grossjohann@gmx.net (Kai Großjohann) writes:

> Simon Josefsson <jas@extundo.com> writes:
>>  Unsupported:
>> 
>>  CJK Unified Ideographs Extension A (1.5MB)
>>  CJK Unified Ideographs (5MB)
[...]
>>  Supplementary Private Use Area-A
>>  Supplementary Private Use Area-B

> It seems that these might be summarized by CJK, Music, Maths, Private
> Use Area.

Private Use Area in U+E000..U+F8FF are supported.

Richard Stallman <rms@gnu.org> writes:
>     +* Encoding some characters as Unicode is rejected by Emacs.
>     +
>     +Emacs currently only supports the parts of the BMP whose codepoints
>     +are in the ranges 0000-33ff and e000-ffff.  If you try to save a file
>     +containing characters with code points outside this range, Emacs will
>     +suggest other compatible coding systems.

> That is clearer; it's written in terms of behavior the user sees.
> I agree with the people who said that the codepoint numbers may not
> be clear enough.

Perhaps, it is better to mention utf-translate-cjk mode as this. 

* Encoding some characters as Unicode (UTF-8) is rejected by Emacs.

Emacs currently, by default, only supports the parts of the
BMP whose codepoints are in the ranges 0000-33ff and
e000-ffff.  This excludes CJK, Yi, Music, and Maths.

If you try to save a file containing characters with code
points outside this range, Emacs will suggest other
compatible coding systems.

By turing Utf-Translate-Cjk mode on, many more CJK
characters are included in the support.

---
Ken'ichi HANDA
handa@m17n.org

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: More Cyrillic vs UTF-8
  2003-04-28 12:35                   ` Kenichi Handa
@ 2003-04-28 23:08                     ` Simon Josefsson
  2003-04-29 16:51                       ` Kai Großjohann
  2003-04-29  5:39                     ` Richard Stallman
       [not found]                     ` <87llxusaj9.fsf@gnu.org>
  2 siblings, 1 reply; 25+ messages in thread
From: Simon Josefsson @ 2003-04-28 23:08 UTC (permalink / raw)
  Cc: kai.grossjohann

Kenichi Handa <handa@m17n.org> writes:

> Richard Stallman <rms@gnu.org> writes:
>>     +* Encoding some characters as Unicode is rejected by Emacs.
>>     +
>>     +Emacs currently only supports the parts of the BMP whose codepoints
>>     +are in the ranges 0000-33ff and e000-ffff.  If you try to save a file
>>     +containing characters with code points outside this range, Emacs will
>>     +suggest other compatible coding systems.
>
>> That is clearer; it's written in terms of behavior the user sees.
>> I agree with the people who said that the codepoint numbers may not
>> be clear enough.
>
> Perhaps, it is better to mention utf-translate-cjk mode as this. 
>
> * Encoding some characters as Unicode (UTF-8) is rejected by Emacs.
>
> Emacs currently, by default, only supports the parts of the
> BMP whose codepoints are in the ranges 0000-33ff and
> e000-ffff.  This excludes CJK, Yi, Music, and Maths.
>
> If you try to save a file containing characters with code
> points outside this range, Emacs will suggest other
> compatible coding systems.
>
> By turing Utf-Translate-Cjk mode on, many more CJK
> characters are included in the support.

This looks good.

As for utf-translate-cjk, it does sounds like that functionality
should be enabled by default.  Is the only problem that loading them
is slow?  Perhaps it can be loaded lazily?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: More Cyrillic vs UTF-8
  2003-04-27  8:37                 ` Kai Großjohann
  2003-04-28 12:35                   ` Kenichi Handa
@ 2003-04-28 23:38                   ` Richard Stallman
  2003-04-29 16:17                     ` Benjamin Riefenstahl
  1 sibling, 1 reply; 25+ messages in thread
From: Richard Stallman @ 2003-04-28 23:38 UTC (permalink / raw)
  Cc: emacs-devel

    > Unsupported:
    >
    > CJK Unified Ideographs Extension A (1.5MB)
    > CJK Unified Ideographs (5MB)
    > Yi Syllables
    > Yi Radicals
    > Hangul Syllables (7MB)
    > High Surrogates
    > Low Surrogates
    > Old Italic
    > Gothic
    > Deseret
    > Byzantine Musical Symbols
    > Musical Symbols
    > Mathematical Alphanumeric Symbols
    > CJK Unified Ideographs Extension B (13MB)
    > CJK Compatibility Ideographs Supplement
    > Tags
    > Supplementary Private Use Area-A
    > Supplementary Private Use Area-B

    It seems that these might be summarized by CJK, Music, Maths, Private
    Use Area.

I don't know what "Surrogates" are.  Also, Old Italic and Gothic do not
fit in that list.  What are "Tags"?

Also, I am not sure whether ALL CJK characters are included here.
For instance, are Hangul letters included here?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: More Cyrillic vs UTF-8
  2003-04-28 12:35                   ` Kenichi Handa
  2003-04-28 23:08                     ` Simon Josefsson
@ 2003-04-29  5:39                     ` Richard Stallman
  2003-04-29 13:36                       ` Simon Josefsson
       [not found]                     ` <87llxusaj9.fsf@gnu.org>
  2 siblings, 1 reply; 25+ messages in thread
From: Richard Stallman @ 2003-04-29  5:39 UTC (permalink / raw)
  Cc: jas

    Perhaps, it is better to mention utf-translate-cjk mode as this. 

    * Encoding some characters as Unicode (UTF-8) is rejected by Emacs.

    Emacs currently, by default, only supports the parts of the
    BMP whose codepoints are in the ranges 0000-33ff and
    e000-ffff.  This excludes CJK, Yi, Music, and Maths.

    If you try to save a file containing characters with code
    points outside this range, Emacs will suggest other
    compatible coding systems.

    By turing Utf-Translate-Cjk mode on, many more CJK
    characters are included in the support.

Please install that now, even though it may require a little
further modification.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: More Cyrillic vs UTF-8
  2003-04-29  5:39                     ` Richard Stallman
@ 2003-04-29 13:36                       ` Simon Josefsson
  0 siblings, 0 replies; 25+ messages in thread
From: Simon Josefsson @ 2003-04-29 13:36 UTC (permalink / raw)
  Cc: Kenichi Handa

Richard Stallman <rms@gnu.org> writes:

>     Perhaps, it is better to mention utf-translate-cjk mode as this. 
>
>     * Encoding some characters as Unicode (UTF-8) is rejected by Emacs.
>
>     Emacs currently, by default, only supports the parts of the
>     BMP whose codepoints are in the ranges 0000-33ff and
>     e000-ffff.  This excludes CJK, Yi, Music, and Maths.
>
>     If you try to save a file containing characters with code
>     points outside this range, Emacs will suggest other
>     compatible coding systems.
>
>     By turing Utf-Translate-Cjk mode on, many more CJK
>     characters are included in the support.
>
> Please install that now, even though it may require a little
> further modification.

I added it.  I changed UTF-8 into UTF-8/16, since I assume the same
holds for UTF-16.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: More Cyrillic vs UTF-8
  2003-04-28 23:38                   ` Richard Stallman
@ 2003-04-29 16:17                     ` Benjamin Riefenstahl
  2003-04-30  5:43                       ` Richard Stallman
  0 siblings, 1 reply; 25+ messages in thread
From: Benjamin Riefenstahl @ 2003-04-29 16:17 UTC (permalink / raw)
  Cc: Kai Großjohann

Hi Richard,


>     > Unsupported:
>     >
>     > CJK Unified Ideographs Extension A (1.5MB)
>     > CJK Unified Ideographs (5MB)
>     > Yi Syllables
>     > Yi Radicals
>     > Hangul Syllables (7MB)
>     > High Surrogates
>     > Low Surrogates
>     > Old Italic
>     > Gothic
>     > Deseret
>     > Byzantine Musical Symbols
>     > Musical Symbols
>     > Mathematical Alphanumeric Symbols
>     > CJK Unified Ideographs Extension B (13MB)
>     > CJK Compatibility Ideographs Supplement
>     > Tags
>     > Supplementary Private Use Area-A
>     > Supplementary Private Use Area-B
> 
>     It seems that these might be summarized by CJK, Music, Maths,
>     Private Use Area.

Richard Stallman <rms@gnu.org> writes:
> I don't know what "Surrogates" are.  Also, Old Italic and Gothic do
> not fit in that list.  What are "Tags"?

"Surrogates" are the codes that are used in UTF-16 to encode
characters with code points above \uFFFF.

"Tags" are codes used for in-band language tagging.

> Also, I am not sure whether ALL CJK characters are included here.
> For instance, are Hangul letters included here?

Kana, Bopomofo and some CJK compatibility and special symbols are
below \u03FF and/or above \uE000, but the major part of the CJK and
all of Hangul is unsupported.


so long, benny

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: More Cyrillic vs UTF-8
  2003-04-28 23:08                     ` Simon Josefsson
@ 2003-04-29 16:51                       ` Kai Großjohann
  2003-04-29 20:00                         ` Robert J. Chassell
  0 siblings, 1 reply; 25+ messages in thread
From: Kai Großjohann @ 2003-04-29 16:51 UTC (permalink / raw)


Simon Josefsson <jas@extundo.com> writes:

> As for utf-translate-cjk, it does sounds like that functionality
> should be enabled by default.  Is the only problem that loading them
> is slow?  Perhaps it can be loaded lazily?

I don't think it can be done lazily.  I'm sure that Dave would have
done that, if possible.

-- 
file-error; Data: (Opening input file no such file or directory ~/.signature)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: More Cyrillic vs UTF-8
  2003-04-29 16:51                       ` Kai Großjohann
@ 2003-04-29 20:00                         ` Robert J. Chassell
  0 siblings, 0 replies; 25+ messages in thread
From: Robert J. Chassell @ 2003-04-29 20:00 UTC (permalink / raw)


By the way, SergeyFleytin <fleytin@mail.ru> just posted a message to
the Emacspeak mailing list that he is using a version of Emacspeak
that converts Cyrillic text to spoken Russian.

I don't know how good this is, nor its licenses (I asked him), but
you might want to listen as well as read Russian.

The FTP site is:

    ftp://ftp.rakurs.spb.ru/pub/Goga/

Sergey says

    I am using emacspeak with a so called 'multilingual server'. It
    was written by one of the Russian programmers, who also wrote a
    Russian tts engine for it. This server uses freephone&mbrola for
    English and ru_tts for Russian. Moreover, that person also
    produced a special installation cd-rom called 'slackspeak'. On
    that disk one would find pre-installed, ready to use,
    speech-enabled linux system. It uses emacspeak as a speech
    interface and uses only software synth for output. If the system
    fails to recognize your sound card, the output is directed to the
    pc-speaker. You can either boot directly from that cd or start it
    from dos promt. 

-- 
    Robert J. Chassell                         Rattlesnake Enterprises
    http://www.rattlesnake.com                  GnuPG Key ID: 004B4AC8
    http://www.teak.cc                             bob@rattlesnake.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: More Cyrillic vs UTF-8
  2003-04-29 16:17                     ` Benjamin Riefenstahl
@ 2003-04-30  5:43                       ` Richard Stallman
  2003-04-30  8:01                         ` Kai Großjohann
  0 siblings, 1 reply; 25+ messages in thread
From: Richard Stallman @ 2003-04-30  5:43 UTC (permalink / raw)
  Cc: kai.grossjohann

So, how should we amend the list "CJK, Music, Maths, Private Use
Area"?  Is adding "Gothic and Old Italic" enough?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: More Cyrillic vs UTF-8
  2003-04-30  5:43                       ` Richard Stallman
@ 2003-04-30  8:01                         ` Kai Großjohann
  0 siblings, 0 replies; 25+ messages in thread
From: Kai Großjohann @ 2003-04-30  8:01 UTC (permalink / raw)


Richard Stallman <rms@gnu.org> writes:

> So, how should we amend the list "CJK, Music, Maths, Private Use
> Area"?  Is adding "Gothic and Old Italic" enough?

I guess "CJK, Music, Maths, Private Use Area, Gothic, and Old Italic"
is good enough.

But don't delete the code points -- then Unicode experts know the
full story.
-- 
file-error; Data: (Opening input file no such file or directory ~/.signature)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: More Cyrillic vs UTF-8
       [not found]                     ` <87llxusaj9.fsf@gnu.org>
@ 2003-05-01 11:27                       ` Kenichi Handa
  0 siblings, 0 replies; 25+ messages in thread
From: Kenichi Handa @ 2003-05-01 11:27 UTC (permalink / raw)
  Cc: jas

In article <87llxusaj9.fsf@gnu.org>, Alex Schroeder <alex@gnu.org> writes:
> I'm attaching the real message as a text file encoded using
> iso-2022-jp, and using the MIME type application/octet-stream.  You
> will probably have to save the file and open it using Emacs.  :(

The problem described in your real message should be fixed
now.  Please you try again with the latest HEAD?

---
Ken'ichi HANDA
handa@m17n.org

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2003-05-01 11:27 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-04-25 16:35 More Cyrillic vs UTF-8 Simon Josefsson
2003-04-25 22:42 ` Eli Zaretskii
2003-04-26  0:26   ` Simon Josefsson
2003-04-26 13:45     ` Richard Stallman
2003-04-26 14:15       ` Simon Josefsson
2003-04-26 20:19         ` Kai Großjohann
2003-04-26 21:16           ` Simon Josefsson
2003-04-26 21:29             ` Kai Großjohann
2003-04-26 21:47               ` Simon Josefsson
2003-04-27  8:37                 ` Kai Großjohann
2003-04-28 12:35                   ` Kenichi Handa
2003-04-28 23:08                     ` Simon Josefsson
2003-04-29 16:51                       ` Kai Großjohann
2003-04-29 20:00                         ` Robert J. Chassell
2003-04-29  5:39                     ` Richard Stallman
2003-04-29 13:36                       ` Simon Josefsson
     [not found]                     ` <87llxusaj9.fsf@gnu.org>
2003-05-01 11:27                       ` Kenichi Handa
2003-04-28 23:38                   ` Richard Stallman
2003-04-29 16:17                     ` Benjamin Riefenstahl
2003-04-30  5:43                       ` Richard Stallman
2003-04-30  8:01                         ` Kai Großjohann
2003-04-28  4:37                 ` Richard Stallman
2003-04-28  4:37         ` Richard Stallman
2003-04-26  7:52 ` Kenichi Handa
2003-04-26 11:54   ` Simon Josefsson

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).