can not decode 0x93 and 0x94 to correct char

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

* can not decode 0x93 and 0x94 to correct char
@ 2007-09-28  4:24 William Xue
  0 siblings, 0 replies; 11+ messages in thread
From: William Xue @ 2007-09-28  4:24 UTC (permalink / raw)
  To: emacs-devel

[-- Attachment #1: Type: text/plain, Size: 1 bytes --]



[-- Attachment #2: sc_tmp_15.png --]
[-- Type: image/png, Size: 41476 bytes --]

[-- Attachment #3: sc_tmp_16.png --]
[-- Type: image/png, Size: 17483 bytes --]

[-- Attachment #4: Type: text/plain, Size: 142 bytes --]

_______________________________________________
Emacs-devel mailing list
Emacs-devel@gnu.org
http://lists.gnu.org/mailman/listinfo/emacs-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* can not decode 0x93 and 0x94 to correct char
@ 2007-09-28  4:33 William Xue
  2007-09-28  6:31 ` Kenichi Handa
  2007-09-28 13:50 ` Stefan Monnier
  0 siblings, 2 replies; 11+ messages in thread
From: William Xue @ 2007-09-28  4:33 UTC (permalink / raw)
  To: emacs-devel

[-- Attachment #1: Type: text/plain, Size: 288 bytes --]

Hi,

Could you confirm the issue?

version: GNU Emacs 23.0.0.1
platform: winxp + sp2

Steps:

1. emacs -q
2. open char_err_clip.c
3. \223GPL License\224

please check screen shots for detail.

(The same issue can be reproduced in a version(GNU emacs 22.1.1) under  
cygwin here.)

Thanks.

[-- Attachment #2: sc_tmp_15.png --]
[-- Type: image/png, Size: 41476 bytes --]

[-- Attachment #3: sc_tmp_16.png --]
[-- Type: image/png, Size: 17483 bytes --]

[-- Attachment #4: char_err_clip.c --]
[-- Type: application/octet-stream, Size: 311 bytes --]

[-- Attachment #5: Type: text/plain, Size: 142 bytes --]

_______________________________________________
Emacs-devel mailing list
Emacs-devel@gnu.org
http://lists.gnu.org/mailman/listinfo/emacs-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: can not decode 0x93 and 0x94 to correct char
  2007-09-28  4:33 William Xue
@ 2007-09-28  6:31 ` Kenichi Handa
  2007-09-28  8:30   ` Eli Zaretskii
  2007-09-28 13:50 ` Stefan Monnier
  1 sibling, 1 reply; 11+ messages in thread
From: Kenichi Handa @ 2007-09-28  6:31 UTC (permalink / raw)
  To: William Xue; +Cc: emacs-devel

In article <op.tzcj9x17hkv0w5@smiling>, "William Xue" <william.xue@gmail.com> writes:

> Could you confirm the issue?

> version: GNU Emacs 23.0.0.1
> platform: winxp + sp2

> Steps:

> 1. emacs -q
> 2. open char_err_clip.c
> 3. \223GPL License\224

\223 and \224 are code points of cp125X for LEFT DOUBLE
QUOTATION MARK and RIGHT DOUBLE QUOTATION MARK.  Their
Unicode code points are U+201C and U+201D (note that they
are not included in ISO-8859-1).

With the trunk Emacs, you must use UTF-8 (or the other
UTF-based encodings, some of CJK encodings) to handle those
characters.

With emacs-unicode, the support of CP125X contains those
characters, thus you can also use one of CP125X encodings.

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: can not decode 0x93 and 0x94 to correct char
  2007-09-28  6:31 ` Kenichi Handa
@ 2007-09-28  8:30   ` Eli Zaretskii
  2007-09-28  9:38     ` William Xue
  2007-10-01  1:33     ` Kenichi Handa
  0 siblings, 2 replies; 11+ messages in thread
From: Eli Zaretskii @ 2007-09-28  8:30 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: william.xue, emacs-devel

> From: Kenichi Handa <handa@m17n.org>
> Date: Fri, 28 Sep 2007 15:31:04 +0900
> Cc: emacs-devel@gnu.org
> 
> \223 and \224 are code points of cp125X for LEFT DOUBLE
> QUOTATION MARK and RIGHT DOUBLE QUOTATION MARK.  Their
> Unicode code points are U+201C and U+201D (note that they
> are not included in ISO-8859-1).
> 
> With the trunk Emacs, you must use UTF-8 (or the other
> UTF-based encodings, some of CJK encodings) to handle those
> characters.

Actually, Emacs 22.1 displays these characters just fine if I type
"C-x RET c cp1252 RET C-x C-f char_err_clip.c RET".  OTOH, using UTF-8
instead if cp1252 still displays octal escapes, like I'd expect.

AFAIK, in Emacs 22 on Windows, cp1252 and friends map these characters
into Unicode by design, and that is why using cp1252 for visiting this
file does The Right Thing.  So I don't think the OP needs to wait for
the inherent Unicode support.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: can not decode 0x93 and 0x94 to correct char
  2007-09-28  8:30   ` Eli Zaretskii
@ 2007-09-28  9:38     ` William Xue
  2007-10-01  1:33     ` Kenichi Handa
  1 sibling, 0 replies; 11+ messages in thread
From: William Xue @ 2007-09-28  9:38 UTC (permalink / raw)
  To: Eli Zaretskii, Kenichi Handa; +Cc: emacs-devel

On Fri, 28 Sep 2007 16:30:01 +0800, Eli Zaretskii <eliz@gnu.org> wrote:

>> From: Kenichi Handa <handa@m17n.org>
>> Date: Fri, 28 Sep 2007 15:31:04 +0900
>> Cc: emacs-devel@gnu.org
>>
>> \223 and \224 are code points of cp125X for LEFT DOUBLE
>> QUOTATION MARK and RIGHT DOUBLE QUOTATION MARK.  Their
>> Unicode code points are U+201C and U+201D (note that they
>> are not included in ISO-8859-1).
>>
>> With the trunk Emacs, you must use UTF-8 (or the other
>> UTF-based encodings, some of CJK encodings) to handle those
>> characters.
>
> Actually, Emacs 22.1 displays these characters just fine if I type
> "C-x RET c cp1252 RET C-x C-f char_err_clip.c RET".  OTOH, using UTF-8
> instead if cp1252 still displays octal escapes, like I'd expect.
>
> AFAIK, in Emacs 22 on Windows, cp1252 and friends map these characters
> into Unicode by design, and that is why using cp1252 for visiting this
> file does The Right Thing.  So I don't think the OP needs to wait for
> the inherent Unicode support.

I use the following settings in .emacs to make the things work.

(prefer-coding-system 'windows-1258)
(prefer-coding-system 'utf-8-emacs)


-- 
Yours,
WilliamX

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: can not decode 0x93 and 0x94 to correct char
  2007-09-28  4:33 William Xue
  2007-09-28  6:31 ` Kenichi Handa
@ 2007-09-28 13:50 ` Stefan Monnier
  2007-09-28 14:45   ` Eli Zaretskii
  1 sibling, 1 reply; 11+ messages in thread
From: Stefan Monnier @ 2007-09-28 13:50 UTC (permalink / raw)
  To: William Xue; +Cc: emacs-devel

> Could you confirm the issue?

> version: GNU Emacs 23.0.0.1
> platform: winxp + sp2

> Steps:

> 1. emacs -q
> 2. open char_err_clip.c
> 3. \223GPL License\224

> please check screen shots for detail.

The problem here seems to be the default coding system used by Emacs.
Apparently it uses something like latin-1 rather than something
like cp1252.  I don't know enough about how such things are specified in
general (outside of Emacs) under w32 to be able to help any further, but all

I know is that maybe Emacs should try and figure out that your default coding
system should be cp1252.  Maybe the problem is that Emacs doesn't try to do
it, or maybe ti doesn't know how to do it, or maybe it does it wrong, or
maybe it doesn't want to do it (e.g. because cp1252 covers the whole 256
possible bytes so the auto-detection can't work well).

        Stefan

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: can not decode 0x93 and 0x94 to correct char
  2007-09-28 13:50 ` Stefan Monnier
@ 2007-09-28 14:45   ` Eli Zaretskii
  2007-09-29  8:29     ` William Xue
  0 siblings, 1 reply; 11+ messages in thread
From: Eli Zaretskii @ 2007-09-28 14:45 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: william.xue, emacs-devel

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Date: Fri, 28 Sep 2007 09:50:47 -0400
> Cc: emacs-devel@gnu.org
> 
> > 1. emacs -q
> > 2. open char_err_clip.c
> > 3. \223GPL License\224
> 
> > please check screen shots for detail.
> 
> The problem here seems to be the default coding system used by Emacs.
> Apparently it uses something like latin-1 rather than something
> like cp1252.

Yes.  However, I don't think this is a problem, see below.

> I don't know enough about how such things are specified in
> general (outside of Emacs) under w32 to be able to help any further, but all
> I know is that maybe Emacs should try and figure out that your default coding
> system should be cp1252.  Maybe the problem is that Emacs doesn't try to do
> it, or maybe ti doesn't know how to do it, or maybe it does it wrong, or
> maybe it doesn't want to do it (e.g. because cp1252 covers the whole 256
> possible bytes so the auto-detection can't work well).

Emacs on Windows looks up the UI language of the current user, and
then sets up the language environment for that language.  Most
language environments do not specify cpNNNN as their preferred
encodings, so neither does Emacs.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: can not decode 0x93 and 0x94 to correct char
  2007-09-28 14:45   ` Eli Zaretskii
@ 2007-09-29  8:29     ` William Xue
  2007-09-29 13:47       ` Stefan Monnier
  0 siblings, 1 reply; 11+ messages in thread
From: William Xue @ 2007-09-29  8:29 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

On Fri, 28 Sep 2007 22:45:29 +0800, Eli Zaretskii <eliz@gnu.org> wrote:

> Emacs on Windows looks up the UI language of the current user, and
> then sets up the language environment for that language.  Most
> language environments do not specify cpNNNN as their preferred
> encodings, so neither does Emacs.

Are the following suitable settings for this situation?

; for cp1258
(prefer-coding-system 'windows-1258)
; for displaying utf-8 encoded file
(prefer-coding-system 'utf-8-emacs)
; for displaying chinese characters
(prefer-coding-system 'gb2312)

It would be a little problem. Because if I changed the gb2312 to gb18030  
or gbk,
the first setting (prefer-coding-system 'windows-1258) would be failed.

-- 
Yours,
WilliamX

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: can not decode 0x93 and 0x94 to correct char
  2007-09-29  8:29     ` William Xue
@ 2007-09-29 13:47       ` Stefan Monnier
  2007-09-29 15:30         ` William Xue
  0 siblings, 1 reply; 11+ messages in thread
From: Stefan Monnier @ 2007-09-29 13:47 UTC (permalink / raw)
  To: William Xue; +Cc: Eli Zaretskii, emacs-devel

> ; for cp1258
> (prefer-coding-system 'windows-1258)
> ; for displaying utf-8 encoded file
> (prefer-coding-system 'utf-8-emacs)
> ; for displaying chinese characters
> (prefer-coding-system 'gb2312)

> It would be a little problem. Because if I changed the gb2312 to gb18030
> or gbk, the first setting (prefer-coding-system 'windows-1258) would
> be failed.

I'm not sure what you mean by "would be failed", but when you use
prefer-coding-system, you have to realize that it's not quite as simple as
it sounds:
- first, the three statements above mean to try (in this order) first
  gb2312, then utf-8, then windows-1258.
- second, this order should not be chosen exclusively based on how often
  you expect to use each of those encodings.  Because it depends a lot of
  the frequency of false positives.  E.g. utf-8 should usually be first,
  because it has very few false positives (if the auto-detect decides it's
  utf-8, then it's very unlikely that the file isn't utf-8).
  OTOH window-1258 should *not* be first because it has many false
  positives: any file without a 0 byte in it is a valid windows-1258 file.

The second point is the main reason why the order of detection of coding
systems when reading a file should be the same as the order of preference to
choose a coding system to use when writing a file.


        Stefan

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: can not decode 0x93 and 0x94 to correct char
  2007-09-29 13:47       ` Stefan Monnier
@ 2007-09-29 15:30         ` William Xue
  0 siblings, 0 replies; 11+ messages in thread
From: William Xue @ 2007-09-29 15:30 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel

On Sat, 29 Sep 2007 21:47:37 +0800, Stefan Monnier  
<monnier@iro.umontreal.ca> wrote:

>> ; for cp1258
>> (prefer-coding-system 'windows-1258)
>> ; for displaying utf-8 encoded file
>> (prefer-coding-system 'utf-8-emacs)
>> ; for displaying chinese characters
>> (prefer-coding-system 'gb2312)
>
>> It would be a little problem. Because if I changed the gb2312 to gb18030
>> or gbk, the first setting (prefer-coding-system 'windows-1258) would
>> be failed.
>
> I'm not sure what you mean by "would be failed", but when you use

If I changed the gb2312 to gb18030 or gbk, the char \223 and \224, which  
are
left and right quotation marks in cp1258, would not be decoded correctly.

So I think it may not be a correct solution for this situation. If  
somebody want to
decode Japanese, French, Russian, and so on, it's too complex

> prefer-coding-system, you have to realize that it's not quite as simple  
> as
> it sounds:
> - first, the three statements above mean to try (in this order) first
>   gb2312, then utf-8, then windows-1258.
> - second, this order should not be chosen exclusively based on how often
>   you expect to use each of those encodings.  Because it depends a lot of
>   the frequency of false positives.  E.g. utf-8 should usually be first,
>   because it has very few false positives (if the auto-detect decides  
> it's
>   utf-8, then it's very unlikely that the file isn't utf-8).
>   OTOH window-1258 should *not* be first because it has many false
>   positives: any file without a 0 byte in it is a valid windows-1258  
> file.
>
> The second point is the main reason why the order of detection of coding
> systems when reading a file should be the same as the order of  
> preference to
> choose a coding system to use when writing a file.

Thanks!

>
>
>         Stefan
>
>
> _______________________________________________
> Emacs-devel mailing list
> Emacs-devel@gnu.org
> http://lists.gnu.org/mailman/listinfo/emacs-devel



-- 
Yours,
WilliamX

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: can not decode 0x93 and 0x94 to correct char
  2007-09-28  8:30   ` Eli Zaretskii
  2007-09-28  9:38     ` William Xue
@ 2007-10-01  1:33     ` Kenichi Handa
  1 sibling, 0 replies; 11+ messages in thread
From: Kenichi Handa @ 2007-10-01  1:33 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: william.xue, emacs-devel

In article <utzpf42gm.fsf@gnu.org>, Eli Zaretskii <eliz@gnu.org> writes:

> > \223 and \224 are code points of cp125X for LEFT DOUBLE
> > QUOTATION MARK and RIGHT DOUBLE QUOTATION MARK.  Their
> > Unicode code points are U+201C and U+201D (note that they
> > are not included in ISO-8859-1).
> > 
> > With the trunk Emacs, you must use UTF-8 (or the other
> > UTF-based encodings, some of CJK encodings) to handle those
> > characters.

> Actually, Emacs 22.1 displays these characters just fine if I type
> "C-x RET c cp1252 RET C-x C-f char_err_clip.c RET".  OTOH, using UTF-8
> instead if cp1252 still displays octal escapes, like I'd expect.

> AFAIK, in Emacs 22 on Windows, cp1252 and friends map these characters
> into Unicode by design, and that is why using cp1252 for visiting this
> file does The Right Thing.  So I don't think the OP needs to wait for
> the inherent Unicode support.

Ah, sorry, it was my misunderstanding.  You are right.
Emacs 22 (and the trunk) correctly decodes them with cp1252.

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2007-10-01  1:33 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-09-28  4:24 can not decode 0x93 and 0x94 to correct char William Xue
  -- strict thread matches above, loose matches on Subject: below --
2007-09-28  4:33 William Xue
2007-09-28  6:31 ` Kenichi Handa
2007-09-28  8:30   ` Eli Zaretskii
2007-09-28  9:38     ` William Xue
2007-10-01  1:33     ` Kenichi Handa
2007-09-28 13:50 ` Stefan Monnier
2007-09-28 14:45   ` Eli Zaretskii
2007-09-29  8:29     ` William Xue
2007-09-29 13:47       ` Stefan Monnier
2007-09-29 15:30         ` William Xue

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).