Questions on charset encoding detection and keyboard layout

unofficial mirror of help-gnu-emacs@gnu.org
 help / color / mirror / Atom feed

* Questions on charset encoding detection and keyboard layout
@ 2009-12-10 14:42 Hou, Ruoyu
  2009-12-10 19:03 ` Eli Zaretskii
  0 siblings, 1 reply; 7+ messages in thread
From: Hou, Ruoyu @ 2009-12-10 14:42 UTC (permalink / raw)
  To: Emacs Mailing List

Hello,

I am a beginner to emacs and starting to experience some problems on 
multilingual editing. I just hope someone would kindly give me some 
hints. My main platform is NetBSD 5.0.1 running emacs23.1-gtk from pkgsrc.

1. My working environment involves handling documents with major eastern 
Asian characters in different encodings(GB2312, GB18030, BIG5, GBK, 
UTF-8, Shift-JIS, EUC-JP, ISO-2022-JP). My language environment was set 
as UTF-8, because I want all the documents *I create* saved as UTF-8, 
and properly display/edit/save other file in different encodings without 
changing the raw encodings. I noticed that my emacs was not properly 
recognizing documents encoded with euc-jp, so I have to manually set it 
every time I encounter such documents. Is there any configurations I 
could tweak to accurately auto-detect and display the file encodings I 
mentioned above? In most occasions, I don't have any a priori knowledge 
on what the exact encoding is in a given document.

 From the manual I learnt that some encodings are not easily 
distinguishable one another. So I guess the setting would be delicate. 
Some gvim user mentioned that EUC-JP has to be located before GBK in 
encoding list to get appropriate result.

2. I'm using a computer with Japanese keyboard layout(a 84-key notebook 
variant of jp-106). When using emacs in X GUI mode, the language 
conversion keys (<henkan>, <muhenkan>, <hirakana/katakana>, etc.) 
respond perfectly in echo area, so I can bind some key macros to toggle 
Japanese input method. However, when I start emacs -nw in say uxterm, 
those keys are not echoed.

Frankly, I prefer to run emacs in non-GUI mode for faster response on my 
old notebook. I'm wondering if I could also configure those conversion 
keys to work the same as in GUI mode. Do I have to changing any settings 
in X or emacs? Is there anyone experiencing a similar situation?

I have to confess that I do not have much experiences in emacs after 
switching my work platform to open source softwares, and as a somewhat 
busy benchwork biologist I probably missed some homework I should have 
done before asking. I would appreciate any hints or reference readings. 
Thanks for any attention.

Regards,
-- 
Hou, Ruoyu

Laboratory of Reproductive & Stem Cell Biology,
College of Life Science & Biotech.,
Shanghai Jiao Tong University,
Shanghai 200240, P.R.China.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Questions on charset encoding detection and keyboard layout
  2009-12-10 14:42 Questions on charset encoding detection and keyboard layout Hou, Ruoyu
@ 2009-12-10 19:03 ` Eli Zaretskii
  2009-12-11  5:42   ` Hou, Ruoyu
  0 siblings, 1 reply; 7+ messages in thread
From: Eli Zaretskii @ 2009-12-10 19:03 UTC (permalink / raw)
  To: help-gnu-emacs

> Date: Thu, 10 Dec 2009 22:42:11 +0800
> From: "Hou, Ruoyu" <phoenixhou@gmail.com>
> 
> 1. My working environment involves handling documents with major eastern 
> Asian characters in different encodings(GB2312, GB18030, BIG5, GBK, 
> UTF-8, Shift-JIS, EUC-JP, ISO-2022-JP). My language environment was set 
> as UTF-8, because I want all the documents *I create* saved as UTF-8, 
> and properly display/edit/save other file in different encodings without 
> changing the raw encodings. I noticed that my emacs was not properly 
> recognizing documents encoded with euc-jp, so I have to manually set it 
> every time I encounter such documents. Is there any configurations I 
> could tweak to accurately auto-detect and display the file encodings I 
> mentioned above?

Try putting this in your ~/.emacs init file (and restart Emacs after
that):

 (prefer-coding-system 'euc-jp)




^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Questions on charset encoding detection and keyboard layout
  2009-12-10 19:03 ` Eli Zaretskii
@ 2009-12-11  5:42   ` Hou, Ruoyu
  2009-12-11  8:42     ` Eli Zaretskii
  0 siblings, 1 reply; 7+ messages in thread
From: Hou, Ruoyu @ 2009-12-11  5:42 UTC (permalink / raw)
  To: Emacs Mailing List

Dear Zaretskii,

I tried the tip you gave me, but now I've got my GBK-encoded files 
unreadable. How you would solve the problem?

Moreover, as I mentioned in the previous post, how could I set a 
prefer-coding-system without beforehand knowledge about the encoding I 
am supposed to encounter?

Thanks for your help.

Regards,

Eli Zaretskii wrote:
>> Date: Thu, 10 Dec 2009 22:42:11 +0800
>> From: "Hou, Ruoyu" <phoenixhou@gmail.com>
>>
>> 1. My working environment involves handling documents with major eastern 
>> Asian characters in different encodings(GB2312, GB18030, BIG5, GBK, 
>> UTF-8, Shift-JIS, EUC-JP, ISO-2022-JP). My language environment was set 
>> as UTF-8, because I want all the documents *I create* saved as UTF-8, 
>> and properly display/edit/save other file in different encodings without 
>> changing the raw encodings. I noticed that my emacs was not properly 
>> recognizing documents encoded with euc-jp, so I have to manually set it 
>> every time I encounter such documents. Is there any configurations I 
>> could tweak to accurately auto-detect and display the file encodings I 
>> mentioned above?
> 
> Try putting this in your ~/.emacs init file (and restart Emacs after
> that):
> 
>  (prefer-coding-system 'euc-jp)
> 
> 
> 

-- 
Hou, Ruoyu

Laboratory of Reproductive & Stem Cell Biology,
College of Life Science & Biotech.,
Shanghai Jiao Tong University,
Shanghai 200240, P.R.China.




^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Questions on charset encoding detection and keyboard layout
  2009-12-11  5:42   ` Hou, Ruoyu
@ 2009-12-11  8:42     ` Eli Zaretskii
  2009-12-11 15:18       ` Kevin Rodgers
  2009-12-11 19:51       ` Hou, Ruoyu
  0 siblings, 2 replies; 7+ messages in thread
From: Eli Zaretskii @ 2009-12-11  8:42 UTC (permalink / raw)
  To: help-gnu-emacs

> Date: Fri, 11 Dec 2009 13:42:39 +0800
> From: "Hou, Ruoyu" <phoenixhou@gmail.com>
> 
> I tried the tip you gave me, but now I've got my GBK-encoded files 
> unreadable. How you would solve the problem?
> 
> Moreover, as I mentioned in the previous post, how could I set a 
> prefer-coding-system without beforehand knowledge about the encoding I 
> am supposed to encounter?

If you have many documents in different encodings that Emacs cannot
distinguish by itself, then I'm afraid there's no good solution except
"C-x RET c", which requires that you know the encoding in advance.  At
least I'm not aware of any better way.  What do other applications do?

Of course, if you inadvertently visit a file without knowing the
encoding, and want to re-visit it with the correct encoding, after you
notice that Emacs didn't properly decode it, then typing "C-x RET c
CORRECT-ENCODING RET M-x revert-buffer RET" will fix the problem.
Here CORRECT-ENCODING is the correct encoding of the file.

Also, if you could somehow manage to have documents in different
encodings to reside in different directories, then perhaps you could
set up the directory-local variables to cause Emacs decode the files
in each directory correctly.  See the node "Directory Variables" in
the Emacs user manual for details about this feature.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Questions on charset encoding detection and keyboard layout
  2009-12-11  8:42     ` Eli Zaretskii
@ 2009-12-11 15:18       ` Kevin Rodgers
  2009-12-11 19:51       ` Hou, Ruoyu
  1 sibling, 0 replies; 7+ messages in thread
From: Kevin Rodgers @ 2009-12-11 15:18 UTC (permalink / raw)
  To: help-gnu-emacs

Eli Zaretskii wrote:
> Of course, if you inadvertently visit a file without knowing the
> encoding, and want to re-visit it with the correct encoding, after you
> notice that Emacs didn't properly decode it, then typing "C-x RET c
> CORRECT-ENCODING RET M-x revert-buffer RET" will fix the problem.
> Here CORRECT-ENCODING is the correct encoding of the file.

aka C-x RET r CORRECT-ENCODING RET

-- 
Kevin Rodgers
Denver, Colorado, USA





^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Questions on charset encoding detection and keyboard layout
  2009-12-11  8:42     ` Eli Zaretskii
  2009-12-11 15:18       ` Kevin Rodgers
@ 2009-12-11 19:51       ` Hou, Ruoyu
  2009-12-11 20:54         ` Eli Zaretskii
  1 sibling, 1 reply; 7+ messages in thread
From: Hou, Ruoyu @ 2009-12-11 19:51 UTC (permalink / raw)
  To: help-gnu-emacs

Dear Zaretskii,

Before switching to Emacs I've been using EmEditor, a proprietary editor 
under Windows. It could auto-detect those files with different encodings 
and prompt a coding list in statistical confidence order for me to 
determine the most likely file encoding. So I guess it may implements 
certain statistical algorithm to detect the proper encoding.

I also tried MadEdit, an open source cross-platform editor. So far it 
could automatically decode files it handled even without the need for me 
to choose a likely one. I am not skilled to read its source code so I 
can't tell how it is done. Also I don't know how MULE handles the coding 
detection case.

A friend of mine, a Vim user, showed me handling those different 
encodings by ":set fencs=(a list of possible encodings, the point is to 
put euc-jp before gbk)". It seems to be done by calling libiconv and 
libintl(or gettext, I'm not sure).

I just thought that my Emacs should perform better or at least 
equivalent with these softwares.

Thanks for your help. I am actually using the commands you mentioned to 
set encodings for viewing or saving. The classification for document 
storage is a good idea and habit, only if I had the foresight. It's a 
bit unrealistic when facing a large quantity of unsorted documents in 
different encodings already on the disk and constantly increasing (as I 
always complain, why can't those guys just use UTF-8?). Is it possible 
to for example write a script to distinguish and sort those documents?

Regards,

Eli Zaretskii wrote:
>> Date: Fri, 11 Dec 2009 13:42:39 +0800
>> From: "Hou, Ruoyu" <phoenixhou@gmail.com>
>>
>> I tried the tip you gave me, but now I've got my GBK-encoded files 
>> unreadable. How you would solve the problem?
>>
>> Moreover, as I mentioned in the previous post, how could I set a 
>> prefer-coding-system without beforehand knowledge about the encoding I 
>> am supposed to encounter?
> 
> If you have many documents in different encodings that Emacs cannot
> distinguish by itself, then I'm afraid there's no good solution except
> "C-x RET c", which requires that you know the encoding in advance.  At
> least I'm not aware of any better way.  What do other applications do?
> 
> Of course, if you inadvertently visit a file without knowing the
> encoding, and want to re-visit it with the correct encoding, after you
> notice that Emacs didn't properly decode it, then typing "C-x RET c
> CORRECT-ENCODING RET M-x revert-buffer RET" will fix the problem.
> Here CORRECT-ENCODING is the correct encoding of the file.
> 
> Also, if you could somehow manage to have documents in different
> encodings to reside in different directories, then perhaps you could
> set up the directory-local variables to cause Emacs decode the files
> in each directory correctly.  See the node "Directory Variables" in
> the Emacs user manual for details about this feature.
> 
> 
> 

-- 
Hou, Ruoyu

Laboratory of Reproductive & Stem Cell Biology,
College of Life Science & Biotech.,
Shanghai Jiao Tong University,
Shanghai 200240, P.R.China.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Questions on charset encoding detection and keyboard layout
  2009-12-11 19:51       ` Hou, Ruoyu
@ 2009-12-11 20:54         ` Eli Zaretskii
  0 siblings, 0 replies; 7+ messages in thread
From: Eli Zaretskii @ 2009-12-11 20:54 UTC (permalink / raw)
  To: help-gnu-emacs

> Date: Sat, 12 Dec 2009 03:51:50 +0800
> From: "Hou, Ruoyu" <phoenixhou@gmail.com>
> 
> Before switching to Emacs I've been using EmEditor, a proprietary editor 
> under Windows. It could auto-detect those files with different encodings 
> and prompt a coding list in statistical confidence order for me to 
> determine the most likely file encoding. So I guess it may implements 
> certain statistical algorithm to detect the proper encoding.

This feature still awaits a volunteer to be added to Emacs.  It
shouldn't be too hard, I think.

> A friend of mine, a Vim user, showed me handling those different 
> encodings by ":set fencs=(a list of possible encodings, the point is to 
> put euc-jp before gbk)".

The customization I suggested, i.e.

  (prefer-coding-system 'euc-jp)

was supposed to make euc-jp of higher priority than GBK (and
everything else).  However, I understand it did you more harm than
good.

For more fine-grain control, try calling set-coding-system-priority
for every encoding you need to deal with, and in such an order that
the resulting list returned by coding-system-priority-list would show
the encodings in the order you want them.  (These two functions are
documented in the ELisp manual.)  I'm not sure this will have the same
effect as ":set fencs" in vim, though.

> The classification for document 
> storage is a good idea and habit, only if I had the foresight. It's a 
> bit unrealistic when facing a large quantity of unsorted documents in 
> different encodings already on the disk and constantly increasing (as I 
> always complain, why can't those guys just use UTF-8?). Is it possible 
> to for example write a script to distinguish and sort those documents?

I would try to find a program that could print a file's
encoding. `file' does not do that, but maybe there's something else
out there.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2009-12-11 20:54 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-12-10 14:42 Questions on charset encoding detection and keyboard layout Hou, Ruoyu
2009-12-10 19:03 ` Eli Zaretskii
2009-12-11  5:42   ` Hou, Ruoyu
2009-12-11  8:42     ` Eli Zaretskii
2009-12-11 15:18       ` Kevin Rodgers
2009-12-11 19:51       ` Hou, Ruoyu
2009-12-11 20:54         ` Eli Zaretskii

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).