What exactly is chinese-big5?

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

* What exactly is chinese-big5?
@ 2008-04-17 11:15 Eli Zaretskii
  2008-04-18  1:32 ` Kenichi Handa
  0 siblings, 1 reply; 8+ messages in thread
From: Eli Zaretskii @ 2008-04-17 11:15 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: emacs-devel

Emacs 22.2 supports the chinese-big5 encoding.  However, I cannot find
anywhere the precise description of which flavor(s) of BIG5 is/are
supported.  The Wiki page http://en.wikipedia.org/wiki/Big5 describes
half a dozen of extensions to the original Big5 encoding, so it would
be good to know which one(s) we support.

The specific situation where I needed to know this was when I was
handed a file with what was supposed to be Chinese text and was asked
to convert it to UTF-8.  detect-coding-region suggested chinese-big5
as the only Chinese encoding for the non-ASCII characters in the file,
so I tried that.  Interestingly enough, both `recode' and `iconv'
refused to convert the file, no matter what flavor of Big5 (including
cp950) I tried, but Emacs read the file with no problems and produced
what seems like a valid UTF-8 encoding.  `iconv' 1.12 supports quite a
few Big5 flavors, but they all choked on some characters in the file.

So what exactly is chinese-big5 in Emacs, and how come it succeeds
where the latest `iconv' fails?  In particular, should I worry about
possibly incorrect conversion by Emacs, where `iconv' barfs (the file
is very large and I cannot proofread all the converted strings)?

TIA

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: What exactly is chinese-big5?
  2008-04-17 11:15 What exactly is chinese-big5? Eli Zaretskii
@ 2008-04-18  1:32 ` Kenichi Handa
  2008-04-18  8:16   ` Eli Zaretskii
  0 siblings, 1 reply; 8+ messages in thread
From: Kenichi Handa @ 2008-04-18  1:32 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

In article <E1JmS52-0000vS-J3@fencepost.gnu.org>, Eli Zaretskii <eliz@gnu.org> writes:

> Emacs 22.2 supports the chinese-big5 encoding.  However, I cannot find
> anywhere the precise description of which flavor(s) of BIG5 is/are
> supported.  The Wiki page http://en.wikipedia.org/wiki/Big5 describes
> half a dozen of extensions to the original Big5 encoding, so it would
> be good to know which one(s) we support.

Emacs supports full range of Big5 code space; i.e. 
   1st byte: 0xA1 .. 0xFE
   2nd byte: 0x40 .. 0x7E and 0xA1 .. 0xFE

It means that Emacs can decode all code points of Big5 that
fit in the above range.  And, Emacs 22 doesn't pay attention
to which code point is assigned to which Chinese character.
It has seperate character space for Big5 characters (in
charsets chinese-big5-1 and chinese-big5-2) and thus can
contain all possible characters.  Some code point may be
assigned to no character in some variant of Big5.  Emacs
22.2 simply doesn't care about that.

Emacs 23 support all code points of Big5 as well.  It at
first decodes Big5 charaters in a single seperate code space
(#x130000 and over).  Then, unify most of them with Unicode
by using a charset map distributed with glibc
(/usr/share/i18n/charmaps/BIG5.gz).  So, for instance Big5
A140 is decoded and unified to U+3000, but Big5 FEFE is just
decoded to #x134621 (out of unicode range).

> The specific situation where I needed to know this was when I was
> handed a file with what was supposed to be Chinese text and was asked
> to convert it to UTF-8.  detect-coding-region suggested chinese-big5
> as the only Chinese encoding for the non-ASCII characters in the file,
> so I tried that.  Interestingly enough, both `recode' and `iconv'
> refused to convert the file, no matter what flavor of Big5 (including
> cp950) I tried, but Emacs read the file with no problems and produced
> what seems like a valid UTF-8 encoding.  `iconv' 1.12 supports quite a
> few Big5 flavors, but they all choked on some characters in the file.

When you read a Big5 file of the byte sequence FEFE, and try
to write it by utf-8, Emacs 22.2 silently generates U+FFFD
(REPLACEMENT CHARACTER) as described in the docstring of
utf-8 coding system.   So, there's a possibility that the
file you wrote also contains that character.

On the other hand, Emacs 23 warns that only Big5 and
utf-8-emacs can encode it.

> So what exactly is chinese-big5 in Emacs, and how come it succeeds
> where the latest `iconv' fails?

As explained above.

> In particular, should I worry about
> possibly incorrect conversion by Emacs, where `iconv' barfs (the file
> is very large and I cannot proofread all the converted strings)?

In Emacs 22, you can read the written file by utf-8 and
search for U+FFFD.  In Emacs 23, you'll see the warning on
writing the file.

---
Kenichi Handa
handa@ni.aist.go.jp

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: What exactly is chinese-big5?
  2008-04-18  1:32 ` Kenichi Handa
@ 2008-04-18  8:16   ` Eli Zaretskii
  2008-04-18 11:28     ` Kenichi Handa
  0 siblings, 1 reply; 8+ messages in thread
From: Eli Zaretskii @ 2008-04-18  8:16 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: emacs-devel

> From: Kenichi Handa <handa@m17n.org>
> CC: emacs-devel@gnu.org
> Date: Fri, 18 Apr 2008 10:32:15 +0900
> 
> Emacs supports full range of Big5 code space; i.e. 
>    1st byte: 0xA1 .. 0xFE
>    2nd byte: 0x40 .. 0x7E and 0xA1 .. 0xFE

Thank you for the detailed explanations.

> In Emacs 22, you can read the written file by utf-8 and
> search for U+FFFD.

Is U+FFFD the _only_ character that will be produced for any codepoint
that is unassigned in the Big5 code space?  That is, if I search for
U+FFFD, will I find _all_ the places where the original file had
something not belonging to Big5?

Also, assuming that I find one or more invalid characters, is there
some encoding other than chinese-big5 that I should try, which could
explain those problematic characters, besides those I mentioned in my
original message?  This file came from Chinese speaking people, so
there's little doubt it should include only strings that can be read
by Chinese speakers.  Therefore, I wonder how come it does not
translate cleanly into Unicode.  (I cannot ask the people who produced
the file about these issues, since they seem to be pretty ignorant
about that: they claimed the file was in UTF-8...)

Thanks again for you help.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: What exactly is chinese-big5?
  2008-04-18  8:16   ` Eli Zaretskii
@ 2008-04-18 11:28     ` Kenichi Handa
  2008-04-18 12:47       ` Eli Zaretskii
  2008-04-18 13:26       ` Jason Rumney
  0 siblings, 2 replies; 8+ messages in thread
From: Kenichi Handa @ 2008-04-18 11:28 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

In article <u4p9zr314.fsf@gnu.org>, Eli Zaretskii <eliz@gnu.org> writes:

> > In Emacs 22, you can read the written file by utf-8 and
> > search for U+FFFD.

> Is U+FFFD the _only_ character that will be produced for any codepoint
> that is unassigned in the Big5 code space?  That is, if I search for
> U+FFFD, will I find _all_ the places where the original file had
> something not belonging to Big5?

No exactly.  U+FFFD is the only character that will be
produced for "any character that can't be unified with
Unicode".  Which Big5 character can unified with Unicode is
defined in subst-big5.el in Emacs 22 (I don't know which
Big5 version Dave used to make that file) and in
etc/charsets/BIG5.map in Emacs 23.  So, if the dialect of
Big5 is different from what defined in those files, there's
a possibility that some character which the file creater
thinks Big5 is encoded into U+FFFD.

> Also, assuming that I find one or more invalid characters, is there
> some encoding other than chinese-big5 that I should try, which could
> explain those problematic characters, besides those I mentioned in my
> original message?  This file came from Chinese speaking people, so
> there's little doubt it should include only strings that can be read
> by Chinese speakers.  Therefore, I wonder how come it does not
> translate cleanly into Unicode.  (I cannot ask the people who produced
> the file about these issues, since they seem to be pretty ignorant
> about that: they claimed the file was in UTF-8...)

That file may be GBK whose code-space is similar to but
wider than Big5.  But, it's supported only in Emacs 23.

---
Kenichi Handa
handa@ni.aist.go.jp

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: What exactly is chinese-big5?
  2008-04-18 11:28     ` Kenichi Handa
@ 2008-04-18 12:47       ` Eli Zaretskii
  2008-04-18 13:37         ` Jason Rumney
  2008-04-18 13:26       ` Jason Rumney
  1 sibling, 1 reply; 8+ messages in thread
From: Eli Zaretskii @ 2008-04-18 12:47 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: emacs-devel

> From: Kenichi Handa <handa@m17n.org>
> CC: emacs-devel@gnu.org
> Date: Fri, 18 Apr 2008 20:28:08 +0900
> 
> > Is U+FFFD the _only_ character that will be produced for any codepoint
> > that is unassigned in the Big5 code space?  That is, if I search for
> > U+FFFD, will I find _all_ the places where the original file had
> > something not belonging to Big5?
> 
> No exactly.  U+FFFD is the only character that will be
> produced for "any character that can't be unified with
> Unicode".  Which Big5 character can unified with Unicode is
> defined in subst-big5.el in Emacs 22 (I don't know which
> Big5 version Dave used to make that file) and in
> etc/charsets/BIG5.map in Emacs 23.  So, if the dialect of
> Big5 is different from what defined in those files, there's
> a possibility that some character which the file creater
> thinks Big5 is encoded into U+FFFD.

Thanks.

> That file may be GBK whose code-space is similar to but
> wider than Big5.  But, it's supported only in Emacs 23.

In Emacs 22, cp936 could be a good approximation to GBK, right?




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: What exactly is chinese-big5?
  2008-04-18 11:28     ` Kenichi Handa
  2008-04-18 12:47       ` Eli Zaretskii
@ 2008-04-18 13:26       ` Jason Rumney
  1 sibling, 0 replies; 8+ messages in thread
From: Jason Rumney @ 2008-04-18 13:26 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: Eli Zaretskii, emacs-devel

Kenichi Handa wrote:
> So, if the dialect of
> Big5 is different from what defined in those files, there's
> a possibility that some character which the file creater
> thinks Big5 is encoded into U+FFFD.
>   

Isn't there also a possibility that some character gets encoded into a 
different utf-8 character if the Big5 dialect is different, or are all 
characters in Big5 that are encodable as utf-8 common to all dialects?






^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: What exactly is chinese-big5?
  2008-04-18 12:47       ` Eli Zaretskii
@ 2008-04-18 13:37         ` Jason Rumney
  2008-04-18 15:52           ` Eli Zaretskii
  0 siblings, 1 reply; 8+ messages in thread
From: Jason Rumney @ 2008-04-18 13:37 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel, Kenichi Handa

Eli Zaretskii wrote:
> In Emacs 22, cp936 could be a good approximation to GBK, right?
>   

Does Emacs 22 have an independant implementation of cp936, or is it just 
an alias for gb2312?

I defined some aliases like this (against protests from Dave Love), so 
that we could initialise the locale-coding-system on Windows based on 
the system codepage, without needing a mapping alist to select the 
closest approximation when Emacs did not have full support for the 
codepage. Dave's protest was that it would cause confusion to claim 
support for codepages where that support was incomplete, while my point 
of view was that limited support is better than none at all.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: What exactly is chinese-big5?
  2008-04-18 13:37         ` Jason Rumney
@ 2008-04-18 15:52           ` Eli Zaretskii
  0 siblings, 0 replies; 8+ messages in thread
From: Eli Zaretskii @ 2008-04-18 15:52 UTC (permalink / raw)
  To: Jason Rumney; +Cc: emacs-devel, handa

> Date: Fri, 18 Apr 2008 14:37:03 +0100
> From: Jason Rumney <jasonr@gnu.org>
> CC: Kenichi Handa <handa@m17n.org>, emacs-devel@gnu.org
> 
> Eli Zaretskii wrote:
> > In Emacs 22, cp936 could be a good approximation to GBK, right?
> >   
> 
> Does Emacs 22 have an independant implementation of cp936, or is it just 
> an alias for gb2312?

You are right, it's an alias of gb2312.  And in Emacs 23, it's an
alias of gbk.

> I defined some aliases like this (against protests from Dave Love), so 
> that we could initialise the locale-coding-system on Windows based on 
> the system codepage, without needing a mapping alist to select the 
> closest approximation when Emacs did not have full support for the 
> codepage. Dave's protest was that it would cause confusion to claim 
> support for codepages where that support was incomplete, while my point 
> of view was that limited support is better than none at all.

I have no objections to these aliases, especially since the confusion
is there anyway, given the lack of documentation for the precise
ranges of codepoints supported by big5 and others, as mentioned in
this thread by Handa-san.




^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2008-04-18 15:52 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-04-17 11:15 What exactly is chinese-big5? Eli Zaretskii
2008-04-18  1:32 ` Kenichi Handa
2008-04-18  8:16   ` Eli Zaretskii
2008-04-18 11:28     ` Kenichi Handa
2008-04-18 12:47       ` Eli Zaretskii
2008-04-18 13:37         ` Jason Rumney
2008-04-18 15:52           ` Eli Zaretskii
2008-04-18 13:26       ` Jason Rumney

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).