How to determine encoding for file?

unofficial mirror of help-gnu-emacs@gnu.org
 help / color / mirror / Atom feed

* How to determine encoding for file?
@ 2010-01-24 21:29 kj
  2010-01-24 21:59 ` Pascal J. Bourguignon
  2010-01-26  2:58 ` Kevin Rodgers
  0 siblings, 2 replies; 6+ messages in thread
From: kj @ 2010-01-24 21:29 UTC (permalink / raw)
  To: help-gnu-emacs

I've downloaded a large file that is supposed to contain a mixture
of Japanese and English (it's basically a learner's dictionary).
The English is displayed correctly, but not so for the Japanese.

I've tried setting the buffer's coding system to utf-8,
japanese-shift-jis, japanese-shift-jis-mac, japanese-shift-jis-dos
(just guessing).  None worked.

In fact, I'm not even sure that any of these changes of the coding
system achieved *anything*, since the buffer's appearance remained
unchanged throughout all this mucking around.  I used the command
set-buffer-file-coding-system to do this.  Should I need to do
anything besides re-setting the coding system to see a change in
how the file is displayed?

More importantly, is there a better way to determine a file's
correct coding system besides trial and error?

TIA!

~K

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: How to determine encoding for file?
  2010-01-24 21:29 How to determine encoding for file? kj
@ 2010-01-24 21:59 ` Pascal J. Bourguignon
  2010-01-25  5:57   ` tomas
       [not found]   ` <mailman.146.1264399148.14305.help-gnu-emacs@gnu.org>
  2010-01-26  2:58 ` Kevin Rodgers
  1 sibling, 2 replies; 6+ messages in thread
From: Pascal J. Bourguignon @ 2010-01-24 21:59 UTC (permalink / raw)
  To: help-gnu-emacs

kj <no.email@please.post> writes:

> I've downloaded a large file that is supposed to contain a mixture
> of Japanese and English (it's basically a learner's dictionary).
> The English is displayed correctly, but not so for the Japanese.
>
> I've tried setting the buffer's coding system to utf-8,
> japanese-shift-jis, japanese-shift-jis-mac, japanese-shift-jis-dos
> (just guessing).  None worked.
>
> In fact, I'm not even sure that any of these changes of the coding
> system achieved *anything*, since the buffer's appearance remained
> unchanged throughout all this mucking around.  I used the command
> set-buffer-file-coding-system to do this.  Should I need to do
> anything besides re-setting the coding system to see a change in
> how the file is displayed?
>
> More importantly, is there a better way to determine a file's
> correct coding system besides trial and error?

No, there is no better way.

From the sequences of bytes you can find inside the files, you may
eliminate some encodings (eg. some byte sequences are invalid in
UTF-8), but there are several encodings that use all the byte values
and where any sequence of byte is valid, so you cannot choose between
these encoding without knowing the meaning of the file.

For example, the following file:

$ od -t x1 -c /tmp/a.txt
0000000  50  72  69  63  65  20  6f  66  20  74  68  65  20  69  74  65
          P   r   i   c   e       o   f       t   h   e       i   t   e
0000020  6d  3a  20  31  34  20  a4  0a
          m   :       1   4     244  \n
0000030


The encoding cannot be determined without knowing what item is
referenced, and what price ranges are probable for this item.  It
could contain:

Price of the item: 14 €

or:

Price of the item: 14 ¤

(It could also contain another character, but it would be less probable).


-- 
__Pascal Bourguignon__                     http://www.informatimago.com/


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: How to determine encoding for file?
  2010-01-24 21:59 ` Pascal J. Bourguignon
@ 2010-01-25  5:57   ` tomas
       [not found]   ` <mailman.146.1264399148.14305.help-gnu-emacs@gnu.org>
  1 sibling, 0 replies; 6+ messages in thread
From: tomas @ 2010-01-25  5:57 UTC (permalink / raw)
  To: help-gnu-emacs

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Sun, Jan 24, 2010 at 10:59:46PM +0100, Pascal J. Bourguignon wrote:
> kj <no.email@please.post> writes:
> 
> > I've downloaded a large file that is supposed to contain a mixture
> > of Japanese and English (it's basically a learner's dictionary).
> > The English is displayed correctly, but not so for the Japanese.
> >
> > I've tried setting the buffer's coding system to utf-8,
> > japanese-shift-jis, japanese-shift-jis-mac, japanese-shift-jis-dos
> > (just guessing).  None worked.
> >
> > In fact, I'm not even sure that any of these changes of the coding
> > system achieved *anything*, since the buffer's appearance remained
> > unchanged throughout all this mucking around.  I used the command
> > set-buffer-file-coding-system to do this.

This won't do the trick (see below for what will do). This function just
says: "forget you loaded this file as shift-JIS. From now on it will be
UTF-8" (for example). So it doesn't change anything, but when you save
the file, it will be transformed to the new coding system (if possible).

> >                                            Should I need to do
> > anything besides re-setting the coding system to see a change in
> > how the file is displayed?

You'll have to use `revert-buffer-with-coding-system' (by default mapped
to the key seqence C-x RET r). This will reload the file under
assumption of the new coding system.

> > More importantly, is there a better way to determine a file's
> > correct coding system besides trial and error?

Pascal answered this part better than I could :-)

There will be always lots of byte sequences valid under several coding
systems (but meaning different things). The methods out there to get a
grip on the problem are heuristic, partly based on statistical
properties of the text. If you want to have some fun understanding the
kind of problems involved, have a look at [1]. For an implementation in
Emacs  Lisp, see Unicad [2]

- --------
[1] <http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html>
[2] <http://www.emacswiki.org/emacs-en/Unicad>

Regards

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFLXTLXBcgs9XrR2kYRAnR5AJ9Jowgc9pPrCaW0lRe1Tv7xFGya+QCfRXJ8
mLTW2GBvke8OYbVdWiVcrcU=
=gJuQ
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: How to determine encoding for file?
       [not found]   ` <mailman.146.1264399148.14305.help-gnu-emacs@gnu.org>
@ 2010-01-25 14:55     ` kj
  2010-01-26  9:01       ` Thien-Thi Nguyen
  0 siblings, 1 reply; 6+ messages in thread
From: kj @ 2010-01-25 14:55 UTC (permalink / raw)
  To: help-gnu-emacs

Pascal, Tomas, many thanks for your comments and suggestions!  I
eventually resorted to a brute-force solution: I created many copies
of a fragment of the file, with a different -*- coding: ??? -*-
line at the top, and opened all these copies in Emacs.  Then I
scanned all of them using *Buffer List*.  Dumb but effective.

~K

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: How to determine encoding for file?
  2010-01-24 21:29 How to determine encoding for file? kj
  2010-01-24 21:59 ` Pascal J. Bourguignon
@ 2010-01-26  2:58 ` Kevin Rodgers
  1 sibling, 0 replies; 6+ messages in thread
From: Kevin Rodgers @ 2010-01-26  2:58 UTC (permalink / raw)
  To: help-gnu-emacs

kj wrote:
> I've downloaded a large file that is supposed to contain a mixture
> of Japanese and English (it's basically a learner's dictionary).
> The English is displayed correctly, but not so for the Japanese.
> 
> I've tried setting the buffer's coding system to utf-8,
> japanese-shift-jis, japanese-shift-jis-mac, japanese-shift-jis-dos
> (just guessing).  None worked.
> 
> In fact, I'm not even sure that any of these changes of the coding
> system achieved *anything*, since the buffer's appearance remained
> unchanged throughout all this mucking around.  I used the command
> set-buffer-file-coding-system to do this.  Should I need to do
> anything besides re-setting the coding system to see a change in
> how the file is displayed?

Use `C-x RET r' (aka M-x revert-buffer-with-coding-system).

> More importantly, is there a better way to determine a file's
> correct coding system besides trial and error?

-- 
Kevin Rodgers
Denver, Colorado, USA





^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: How to determine encoding for file?
  2010-01-25 14:55     ` kj
@ 2010-01-26  9:01       ` Thien-Thi Nguyen
  0 siblings, 0 replies; 6+ messages in thread
From: Thien-Thi Nguyen @ 2010-01-26  9:01 UTC (permalink / raw)
  To: help-gnu-emacs

() kj <no.email@please.post>
() Mon, 25 Jan 2010 14:55:06 +0000 (UTC)

   Dumb but effective.

Out of curiosity, how many rounds did it take?
What was the encoding found, in the end?

If such a method is unavoidable, perhaps Emacs should have:

(defun interactive-search-encoding (&optional candidates)
  "Search `coding-system-priority-list' interactively.
For each encoding, do `revert-buffer-with-coding-system', and
query the user if the result is acceptable.  Stop looping if so.
Optional (prefix) arg CANDIDATES is a list of encodings (symbols)
to try.  Return non-nil if an acceptable encoding is found."
  (interactive
   (list (let ((raw (read-string "Try (space-separated) encodings: ")))
           (when (stringp raw)
             (mapcar 'intern (split-string raw))))))
  (unless candidates (setq candidates (coding-system-priority-list)))
  (let ((revert-without-query '(".*")))
    (find-if (lambda (coding)
               (revert-buffer-with-coding-system coding)
               (when (y-or-n-p (format "%s acceptable? " coding))
                 (message "buffer-file-coding-system now %s"
                          buffer-file-coding-system)
                 t))
             candidates)))

Emacs does have `select-safe-coding-system', but it is not interactive.

thi




^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2010-01-26  9:01 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-01-24 21:29 How to determine encoding for file? kj
2010-01-24 21:59 ` Pascal J. Bourguignon
2010-01-25  5:57   ` tomas
     [not found]   ` <mailman.146.1264399148.14305.help-gnu-emacs@gnu.org>
2010-01-25 14:55     ` kj
2010-01-26  9:01       ` Thien-Thi Nguyen
2010-01-26  2:58 ` Kevin Rodgers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).