* bug#1654: 23.0.60; auto encoding detection (detect-coding-region) not working
@ 2008-12-21 8:21 Gaofeng Huang
0 siblings, 0 replies; 12+ messages in thread
From: Gaofeng Huang @ 2008-12-21 8:21 UTC (permalink / raw)
To: emacs-pretest-bug
[-- Attachment #1: Type: text/plain, Size: 500 bytes --]
Please write in English if possible, because the Emacs maintainers
usually do not have translators to read other languages for them.
Your bug report will be posted to the emacs-pretest-bug@gnu.org mailing list.
Please describe exactly what actions triggered the bug
and the precise symptoms of the bug:
The auto encoding detection can not detect the correct encoding (for
example, for the two files attached). And neither the
detect-coding-region function works (after M-x find-file-literally)
[-- Attachment #2: test_gbk.txt --]
[-- Type: text/plain, Size: 153 bytes --]
File: 08-jj_lin-street-tosk.mp3
Title: 街道 Track: 8
Artist: 林俊傑
Album: 陸 Year: 2008
[-- Attachment #3: test_big5.txt --]
[-- Type: text/plain, Size: 153 bytes --]
File: 08-jj_lin-street-tosk.mp3
Title: 街道 Track: 8
Artist: 林俊傑
Album: 陸 Year: 2008
[-- Attachment #4: Type: text/plain, Size: 3205 bytes --]
If Emacs crashed, and you have the Emacs process in the gdb debugger,
please include the output from the following gdb commands:
`bt full' and `xbacktrace'.
If you would like to further debug the crash, please read the file
/Applications/Emacs.app/Contents/Resources/etc/DEBUG for instructions.
In GNU Emacs 23.0.60.1 (i386-apple-darwin9.5.0, NS apple-appkit-949.35)
of 2008-12-11 on neutron.local
Windowing system distributor `Apple', version 97.112.112.108.101.45.97.112.112.107.105.116.45.57.52.57.46.51.53
configured using `configure '--with-ns''
Important settings:
value of $LC_ALL: nil
value of $LC_COLLATE: nil
value of $LC_CTYPE: zh_CN.UTF-8
value of $LC_MESSAGES: nil
value of $LC_MONETARY: nil
value of $LC_NUMERIC: nil
value of $LC_TIME: nil
value of $LANG: nil
value of $XMODIFIERS: @im=fcitx
locale-coding-system: utf-8-unix
default-enable-multibyte-characters: t
Major mode: Group
Minor modes in effect:
erc-log-mode: t
erc-list-mode: t
erc-menu-mode: t
erc-autojoin-mode: t
erc-ring-mode: t
erc-networks-mode: t
erc-pcomplete-mode: t
erc-track-mode: t
erc-track-minor-mode: t
erc-match-mode: t
erc-button-mode: t
erc-fill-mode: t
erc-stamp-mode: t
erc-netsplit-mode: t
erc-irccontrols-mode: t
erc-noncommands-mode: t
erc-move-to-prompt-mode: t
erc-readonly-mode: t
gnus-undo-mode: t
iswitchb-mode: t
recentf-mode: t
which-function-mode: t
show-paren-mode: t
mouse-sel-mode: t
global-hl-line-mode: t
pinbar-mode: t
shell-dirtrack-mode: t
tooltip-mode: t
tool-bar-mode: t
mouse-wheel-mode: t
file-name-shadow-mode: t
global-font-lock-mode: t
font-lock-mode: t
global-auto-composition-mode: t
auto-composition-mode: t
auto-encryption-mode: t
auto-compression-mode: t
column-number-mode: t
line-number-mode: t
transient-mark-mode: t
Recent input:
g i p SPC r y m SPC DEL r t f p SPC v b SPC b SPC .
RET w q SPC j SPC i SPC j SPC e t SPC r SPC h j SPC
t h g c SPC s g SPC ? RET ESC 1 C-c C-@ C-c C-@ C-l
ESC 1 p ESC x u n i TAB c TAB TAB - v TAB RET C-n C-n
C-n C-n C-n C-n C-n C-n C-n C-n C-n C-n C-x C-f ESC
p RET C-n C-n C-n C-n C-n C-n C-n C-n C-n C-n C-n C-n
C-n C-n C-n C-n C-x k C-x b & RET ESC 1 C-c C-@ DEL
DEL DEL DEL DEL DEL DEL DEL DEL DEL DEL DEL DEL DEL
DEL DEL DEL C-l C-l C-l C-l DEL DEL ESC 1 ESC x n e
w s C-p C-p RET U N o o o o o o ESC x n e w s C-p C-p
RET ESC 1 C-p C-p C-p C-p C-p C-p C-p C-p C-p C-p C-p
C-p C-p C-p C-p C-p C-p C-p C-p C-p C-p C-p C-p C-p
C-x C-f ~ / t e s t _ g b k TAB RET C-x k C-x C-f ESC
p C-g ESC x u n i c a TAB - d i s TAB RET C-x C-f ESC
p RET C-x k C-x C-f ESC p ESC DEL ESC DEL b i g TAB
RET C-x k ESC x ESC p C-e ESC DEL e n TAB RET C-x C-f
ESC p RET C-x k C-x C-f ESC p ESC DEL ESC DEL g b k
TAB RET C-x k C-n C-n C-n C-n C-n C-n C-n C-n C-p ESC
x r e p o TAB r TAB RET
Recent messages:
Truncate long lines enabled [2 times]
Truncate long lines enabled [2 times]
Reading active file from gmail via nnimap...
nnimap: Checking mailboxes...done
Reading active file from freenews.netfront.net via nntp...
Opening nntp server on freenews.netfront.net...done
Checking new news...done
Making completion list...
^ permalink raw reply [flat|nested] 12+ messages in thread
* bug#1654: 23.0.60; auto encoding detection (detect-coding-region) not working
@ 2009-03-27 2:59 Chong Yidong
2009-03-27 4:34 ` Kenichi Handa
0 siblings, 1 reply; 12+ messages in thread
From: Chong Yidong @ 2009-03-27 2:59 UTC (permalink / raw)
To: Kenichi Handa; +Cc: 1654
Hi Handa-san,
Could you take a look at bug#1654? Thanks.
http://emacsbugs.donarmstrong.com/cgi-bin/bugreport.cgi?bug=1654
^ permalink raw reply [flat|nested] 12+ messages in thread
* bug#1654: 23.0.60; auto encoding detection (detect-coding-region) not working
2009-03-27 2:59 Chong Yidong
@ 2009-03-27 4:34 ` Kenichi Handa
2009-03-27 4:58 ` Chong Yidong
2009-03-27 5:03 ` poppyer
0 siblings, 2 replies; 12+ messages in thread
From: Kenichi Handa @ 2009-03-27 4:34 UTC (permalink / raw)
To: Chong Yidong; +Cc: poppyer, 1654
In article <87k56bu0uu.fsf@cyd.mit.edu>, Chong Yidong <cyd@stupidchicken.com> writes:
> Hi Handa-san,
> Could you take a look at bug#1654? Thanks.
> http://emacsbugs.donarmstrong.com/cgi-bin/bugreport.cgi?bug=1654
Ok, but I can't reproduce the problem for test_gbk.txt.
What is shown by C-h C RET?
Gaofeng Huang wrote:
> Important settings:
[...]
> value of $LC_CTYPE: zh_CN.UTF-8
So, I installed that locale, and ran Emacs under it. The
file test_gbk.txt was correctly detected as GBK.
But, as the byte sequence pattern of GBK and Big5 is not
distinguishable, it is impossible to make Emacs detect both
files correctly at the same time.
By the way, as the file test_big5.txt is completely the same
as test_gbk.txt, I changed the encoding of test_big5.txt to
Big5 by iconv, and tested with it.
> The auto encoding detection can not detect the correct encoding (for
> example, for the two files attached). And neither the
> detect-coding-region function works (after M-x find-file-literally)
At least detect_coding_region works correctly for
test_gbk.txt. But, it doesn't work for test_big5.txt by the
same reason as above.
If you want to read Big5 file correctly, do something like
this:
C-x C-m c big5 RET C-x C-f _FILENAME_
or change the language environment to Chinese-BIG5 by:
C-x C-m l Chinese-BIG5 RET
or run Emacs under Big5 locale.
---
Kenichi Handa
handa@m17n.org
^ permalink raw reply [flat|nested] 12+ messages in thread
* bug#1654: 23.0.60; auto encoding detection (detect-coding-region) not working
2009-03-27 4:34 ` Kenichi Handa
@ 2009-03-27 4:58 ` Chong Yidong
2009-03-27 5:20 ` poppyer
2009-03-27 6:02 ` Kenichi Handa
2009-03-27 5:03 ` poppyer
1 sibling, 2 replies; 12+ messages in thread
From: Chong Yidong @ 2009-03-27 4:58 UTC (permalink / raw)
To: Kenichi Handa; +Cc: poppyer, 1654
Kenichi Handa <handa@m17n.org> writes:
>> http://emacsbugs.donarmstrong.com/cgi-bin/bugreport.cgi?bug=1654
>
> Ok, but I can't reproduce the problem for test_gbk.txt.
> What is shown by C-h C RET?
Strangely enough, I get
Coding system for saving this buffer:
S -- japanese-shift-jis-unix (alias: shift_jis-unix sjis-unix
cp932-unix)
My LANG is en_US.UTF-8.
^ permalink raw reply [flat|nested] 12+ messages in thread
* bug#1654: 23.0.60; auto encoding detection (detect-coding-region) not working
2009-03-27 4:34 ` Kenichi Handa
2009-03-27 4:58 ` Chong Yidong
@ 2009-03-27 5:03 ` poppyer
2009-03-27 6:55 ` Kenichi Handa
1 sibling, 1 reply; 12+ messages in thread
From: poppyer @ 2009-03-27 5:03 UTC (permalink / raw)
To: Kenichi Handa; +Cc: Chong Yidong, 1654
[-- Attachment #1: Type: text/plain, Size: 1255 bytes --]
Kenichi Handa <handa@m17n.org> writes:
> In article <87k56bu0uu.fsf@cyd.mit.edu>, Chong Yidong <cyd@stupidchicken.com> writes:
>
>> Hi Handa-san,
>> Could you take a look at bug#1654? Thanks.
>
>> http://emacsbugs.donarmstrong.com/cgi-bin/bugreport.cgi?bug=1654
>
>
>> The auto encoding detection can not detect the correct encoding (for
>> example, for the two files attached). And neither the
>> detect-coding-region function works (after M-x find-file-literally)
>
> At least detect_coding_region works correctly for
> test_gbk.txt. But, it doesn't work for test_big5.txt by the
> same reason as above.
>
Yes, the gbk issue is confirmed solved in the CVS
(after coding.c rev1.413)
But for the big5, in the list returned by
"(detect_coding_region (region-beginning) (region-end))",
there is not big5. I do understand that gbk and big5's sequences might
not be easy to distinguish, but in this case, both encodings are
compatible to the input literal text, so both should be in the returned list. Am
I right? Can you check with this?
BTW, is that any hook that I can put after the coding detection? I might
want to write a small lisp to distinguish BIG5 and GBK (by char statistics,
for example).
I re-attached the test_big5.txt file here.
[-- Attachment #2: test_big5.txt --]
[-- Type: text/plain, Size: 153 bytes --]
File: 08-jj_lin-street-tosk.mp3
Title: 街道 Track: 8
Artist: 林俊傑
Album: 陸 Year: 2008
[-- Attachment #3: Type: text/plain, Size: 57 bytes --]
Cheers,
poppyer
> ---
> Kenichi Handa
> handa@m17n.org
^ permalink raw reply [flat|nested] 12+ messages in thread
* bug#1654: 23.0.60; auto encoding detection (detect-coding-region) not working
2009-03-27 4:58 ` Chong Yidong
@ 2009-03-27 5:20 ` poppyer
2009-03-27 7:29 ` Kenichi Handa
2009-03-27 6:02 ` Kenichi Handa
1 sibling, 1 reply; 12+ messages in thread
From: poppyer @ 2009-03-27 5:20 UTC (permalink / raw)
To: Chong Yidong; +Cc: 1654
Chong Yidong <cyd@stupidchicken.com> writes:
> Kenichi Handa <handa@m17n.org> writes:
>
>>> http://emacsbugs.donarmstrong.com/cgi-bin/bugreport.cgi?bug=1654
>>
>> Ok, but I can't reproduce the problem for test_gbk.txt.
>> What is shown by C-h C RET?
>
> Strangely enough, I get
>
> Coding system for saving this buffer:
> S -- japanese-shift-jis-unix (alias: shift_jis-unix sjis-unix
> cp932-unix)
>
> My LANG is en_US.UTF-8.
Yes, that is expected I think. For non zh_CN locale, japanese is
prioritized for CJK.
and IMPORTANT, ignore the attached .txt file in my previous email. It
seems that my email system convert big5 to gbk automatically. So yes,
you need to iconv it to big5 first for testing.
So the current issue is that: in detect-coding-region's DOC, it says
"Return a list of possible coding systems ordered by priority".
But for big5 txt file, the big5 coding system is not in the returned
list.
Cheers,
poppyer
^ permalink raw reply [flat|nested] 12+ messages in thread
* bug#1654: 23.0.60; auto encoding detection (detect-coding-region) not working
2009-03-27 4:58 ` Chong Yidong
2009-03-27 5:20 ` poppyer
@ 2009-03-27 6:02 ` Kenichi Handa
1 sibling, 0 replies; 12+ messages in thread
From: Kenichi Handa @ 2009-03-27 6:02 UTC (permalink / raw)
To: Chong Yidong; +Cc: poppyer, 1654
In article <87vdpv7eal.fsf@cyd.mit.edu>, Chong Yidong <cyd@stupidchicken.com> writes:
> Kenichi Handa <handa@m17n.org> writes:
>>> http://emacsbugs.donarmstrong.com/cgi-bin/bugreport.cgi?bug=1654
> >
> > Ok, but I can't reproduce the problem for test_gbk.txt.
> > What is shown by C-h C RET?
> Strangely enough, I get
> Coding system for saving this buffer:
> S -- japanese-shift-jis-unix (alias: shift_jis-unix sjis-unix
> cp932-unix)
> My LANG is en_US.UTF-8.
In that locale, the above is normal. Shift-jis can't
distinguished from Big5 and GBK either.
---
Kenichi Handa
handa@m17n.org
^ permalink raw reply [flat|nested] 12+ messages in thread
* bug#1654: 23.0.60; auto encoding detection (detect-coding-region) not working
2009-03-27 5:03 ` poppyer
@ 2009-03-27 6:55 ` Kenichi Handa
2009-03-27 7:00 ` poppyer
2009-03-27 8:52 ` poppyer
0 siblings, 2 replies; 12+ messages in thread
From: Kenichi Handa @ 2009-03-27 6:55 UTC (permalink / raw)
To: poppyer; +Cc: cyd, 1654
In article <ukwsabwo8x.fsf@nusnet-97-126.dynip.nus.edu.sg>, poppyer <poppyer@gmail.com> writes:
> But for the big5, in the list returned by
> "(detect_coding_region (region-beginning) (region-end))",
> there is not big5. I do understand that gbk and big5's sequences might
> not be easy to distinguish, but in this case, both encodings are
> compatible to the input literal text, so both should be in the returned list. Am
> I right?
You are right. But, the current Emacs can't have both GBK
and Big5 in a list of coding systems to try for detecting
because they are in the same category of coding-system
(i.e. charset-base). I know that this restriction is not
good, and improving it is in my todo list, but I still don't
have a time to work on it.
> BTW, is that any hook that I can put after the coding detection? I might
> want to write a small lisp to distinguish BIG5 and GBK (by char statistics,
> for example).
We don't have such a hook, but I think you can use
after-insert-file-functions for reading a file. When that
hook is called, the buffer already contains a text decoded
by buffer-file-coding-system. You can re-decode the newly
inserted text as this:
(defun check-gbk-big5 (nchars)
(if (and enable-multibyte-characters
(not coding-system-for-read)
(coding-system-equal
'chinese-gbk (coding-system-base buffer-file-coding-system)))
(let* ((pos (point))
(end (+ pos nchars))
(modified (buffer-modified-p)))
(when (search-forward "\x5201" end t) ;; (*1)
(save-restriction
(goto-char pos)
(narrow-to-region pos end)
(encode-coding-region pos end buffer-file-coding-system)
(decode-coding-region pos (point-max) 'big5)
(set-buffer-file-coding-system last-coding-system-used)
(set-buffer-modified-p modified)
(setq nchars (point-max))))))
nchars)
(add-hook 'after-insert-file-functions 'check-gbk-big5)
You can change (*1) part to your check function.
---
Kenichi Handa
handa@m17n.org
^ permalink raw reply [flat|nested] 12+ messages in thread
* bug#1654: 23.0.60; auto encoding detection (detect-coding-region) not working
2009-03-27 6:55 ` Kenichi Handa
@ 2009-03-27 7:00 ` poppyer
2009-03-27 8:52 ` poppyer
1 sibling, 0 replies; 12+ messages in thread
From: poppyer @ 2009-03-27 7:00 UTC (permalink / raw)
To: Kenichi Handa; +Cc: cyd, 1654
Kenichi Handa <handa@m17n.org> writes:
> In article <ukwsabwo8x.fsf@nusnet-97-126.dynip.nus.edu.sg>, poppyer <poppyer@gmail.com> writes:
>
>> But for the big5, in the list returned by
>> "(detect_coding_region (region-beginning) (region-end))",
>> there is not big5. I do understand that gbk and big5's sequences might
>> not be easy to distinguish, but in this case, both encodings are
>> compatible to the input literal text, so both should be in the returned list. Am
>> I right?
>
> You are right. But, the current Emacs can't have both GBK
> and Big5 in a list of coding systems to try for detecting
> because they are in the same category of coding-system
> (i.e. charset-base). I know that this restriction is not
> good, and improving it is in my todo list, but I still don't
> have a time to work on it.
>
Thanks, that is clear.
poppyer
^ permalink raw reply [flat|nested] 12+ messages in thread
* bug#1654: 23.0.60; auto encoding detection (detect-coding-region) not working
2009-03-27 5:20 ` poppyer
@ 2009-03-27 7:29 ` Kenichi Handa
0 siblings, 0 replies; 12+ messages in thread
From: Kenichi Handa @ 2009-03-27 7:29 UTC (permalink / raw)
To: poppyer; +Cc: cyd, 1654
In article <h0r60jwnh8.fsf@nusnet-97-126.dynip.nus.edu.sg>, poppyer <poppyer@gmail.com> writes:
> So the current issue is that: in detect-coding-region's DOC, it says
> "Return a list of possible coding systems ordered by priority".
> But for big5 txt file, the big5 coding system is not in the returned
> list.
Ok, I've just modified the docstring as these:
DEFUN ("detect-coding-region", Fdetect_coding_region, Sdetect_coding_region,
2, 3, 0,
doc: /* Detect coding system of the text in the region between START and END.
Return a list of possible coding systems ordered by priority.
The coding systems to try and their priorities follows what
the function `coding-system-priority-list' (which see) returns.
[...]
DEFUN ("coding-system-priority-list", Fcoding_system_priority_list,
Scoding_system_priority_list, 0, 1, 0,
doc: /* Return a list of coding systems ordered by their priorities.
The list contains a subset of coding systems; i.e. coding systems
assigned to each coding category (see `coding-category-list').
[...]
Are they clear? If not, please improve them.
---
Kenichi Handa
handa@m17n.org
^ permalink raw reply [flat|nested] 12+ messages in thread
* bug#1654: 23.0.60; auto encoding detection (detect-coding-region) not working
2009-03-27 6:55 ` Kenichi Handa
2009-03-27 7:00 ` poppyer
@ 2009-03-27 8:52 ` poppyer
2009-03-30 1:08 ` Kenichi Handa
1 sibling, 1 reply; 12+ messages in thread
From: poppyer @ 2009-03-27 8:52 UTC (permalink / raw)
To: Kenichi Handa; +Cc: cyd, 1654
Kenichi Handa <handa@m17n.org> writes:
> In article <ukwsabwo8x.fsf@nusnet-97-126.dynip.nus.edu.sg>, poppyer <poppyer@gmail.com> writes:
>
>> But for the big5, in the list returned by
>> "(detect_coding_region (region-beginning) (region-end))",
>> there is not big5. I do understand that gbk and big5's sequences might
>> not be easy to distinguish, but in this case, both encodings are
>> compatible to the input literal text, so both should be in the returned list. Am
>> I right?
>
> You are right. But, the current Emacs can't have both GBK
> and Big5 in a list of coding systems to try for detecting
> because they are in the same category of coding-system
> (i.e. charset-base). I know that this restriction is not
> good, and improving it is in my todo list, but I still don't
> have a time to work on it.
>
I just re-examine the code, and find the bug.
And it is a bug in lisp/language/chinese.el
near line 125:
=================
(define-coding-system 'chinese-big5
"BIG5 8-bit encoding for Chinese (MIME:Big5)"
:coding-type 'charset
:mnemonic ?B
:charset-list '(ascii big5)
:mime-charset 'big5)
=====================
should be:
=================
(define-coding-system 'chinese-big5
"BIG5 8-bit encoding for Chinese (MIME:Big5)"
:coding-type 'big5 ;; change charset to big5 here, poppyer
:mnemonic ?B
:charset-list '(ascii big5)
:mime-charset 'big5)
=====================
recompile emacs again, i would be able to get
(coding-system-category 'big5) => coding_category_big5
and coding_category_big5 is already defined in coding.c
so
gbk belongs to coding_category_charset
big5 belongs to coding_category_big5
sjis belongs to coding_category_sjis
three diff categories, and the results can be listed by
detect-coding-region
Cheers,
poppyer
^ permalink raw reply [flat|nested] 12+ messages in thread
* bug#1654: 23.0.60; auto encoding detection (detect-coding-region) not working
2009-03-27 8:52 ` poppyer
@ 2009-03-30 1:08 ` Kenichi Handa
0 siblings, 0 replies; 12+ messages in thread
From: Kenichi Handa @ 2009-03-30 1:08 UTC (permalink / raw)
To: poppyer; +Cc: cyd, 1654
In article <9m7i2bwdod.fsf@nusnet-97-126.dynip.nus.edu.sg>, poppyer <poppyer@gmail.com> writes:
> I just re-examine the code, and find the bug.
> And it is a bug in lisp/language/chinese.el
> near line 125:
[...]
> should be:
> =================
> (define-coding-system 'chinese-big5
> "BIG5 8-bit encoding for Chinese (MIME:Big5)"
> :coding-type 'big5 ;; change charset to big5 here, poppyer
> :mnemonic ?B
> :charset-list '(ascii big5)
> :mime-charset 'big5)
> =====================
Actually this is not a bug. When I introduced the
coding-type `charset' in Emacs 23, I changed most
coding-systems that require charset-mapping to that type.
In the future, I want to delete all Big5 (and SJIS) specific
codes in coding.c.
But, the implementation of detecting coding systems of the
same type won't be in time for Emacs 23.1. So, I'll commit
your change.
---
Kenichi Handa
handa@m17n.org
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2009-03-30 1:08 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-12-21 8:21 bug#1654: 23.0.60; auto encoding detection (detect-coding-region) not working Gaofeng Huang
-- strict thread matches above, loose matches on Subject: below --
2009-03-27 2:59 Chong Yidong
2009-03-27 4:34 ` Kenichi Handa
2009-03-27 4:58 ` Chong Yidong
2009-03-27 5:20 ` poppyer
2009-03-27 7:29 ` Kenichi Handa
2009-03-27 6:02 ` Kenichi Handa
2009-03-27 5:03 ` poppyer
2009-03-27 6:55 ` Kenichi Handa
2009-03-27 7:00 ` poppyer
2009-03-27 8:52 ` poppyer
2009-03-30 1:08 ` Kenichi Handa
Code repositories for project(s) associated with this public inbox
https://git.savannah.gnu.org/cgit/emacs.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).