From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: tomas@tuxteam.de
Newsgroups: gmane.emacs.help
Subject: Re: How to determine encoding for file?
Date: Mon, 25 Jan 2010 06:57:43 +0100
Message-ID: <20100125055743.GB26580@tomas>
References: <hjie4c$e4v$1@reader1.panix.com>
	<87y6jn2mgd.fsf@hubble.lan.informatimago.com>
NNTP-Posting-Host: lo.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii; x-action=pgp-signed
X-Trace: ger.gmane.org 1264399222 10857 80.91.229.12 (25 Jan 2010 06:00:22 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Mon, 25 Jan 2010 06:00:22 +0000 (UTC)
To: help-gnu-emacs@gnu.org
Original-X-From: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Mon Jan 25 07:00:15 2010
Return-path: <help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org>
Envelope-to: geh-help-gnu-emacs@m.gmane.org
Original-Received: from lists.gnu.org ([199.232.76.165])
	by lo.gmane.org with esmtp (Exim 4.50)
	id 1NZHzC-0000Xb-Ap
	for geh-help-gnu-emacs@m.gmane.org; Mon, 25 Jan 2010 07:00:15 +0100
Original-Received: from localhost ([127.0.0.1]:40416 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43)
	id 1NZHz1-0001Lo-FX
	for geh-help-gnu-emacs@m.gmane.org; Mon, 25 Jan 2010 00:59:31 -0500
Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43)
	id 1NZHyd-0001LE-Cz
	for help-gnu-emacs@gnu.org; Mon, 25 Jan 2010 00:59:07 -0500
Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43)
	id 1NZHyY-0001KB-10
	for help-gnu-emacs@gnu.org; Mon, 25 Jan 2010 00:59:06 -0500
Original-Received: from [199.232.76.173] (port=54344 helo=monty-python.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1NZHyX-0001K8-Tm
	for help-gnu-emacs@gnu.org; Mon, 25 Jan 2010 00:59:01 -0500
Original-Received: from alextrapp1.equinoxe.de ([217.22.192.104]:37383
	helo=www.elogos.de) by monty-python.gnu.org with esmtp (Exim 4.60)
	(envelope-from <tomas@tuxteam.de>) id 1NZHyW-0001UE-V3
	for help-gnu-emacs@gnu.org; Mon, 25 Jan 2010 00:59:01 -0500
Original-Received: by www.elogos.de (Postfix, from userid 1000)
	id 18D7A9004A; Mon, 25 Jan 2010 06:57:43 +0100 (CET)
Content-Disposition: inline
In-Reply-To: <87y6jn2mgd.fsf@hubble.lan.informatimago.com>
User-Agent: Mutt/1.5.15+20070412 (2007-04-11)
X-detected-operating-system: by monty-python.gnu.org: GNU/Linux 2.6 (newer, 2)
X-BeenThere: help-gnu-emacs@gnu.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Users list for the GNU Emacs text editor <help-gnu-emacs.gnu.org>
List-Unsubscribe: <http://lists.gnu.org/mailman/listinfo/help-gnu-emacs>,
	<mailto:help-gnu-emacs-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/help-gnu-emacs>
List-Post: <mailto:help-gnu-emacs@gnu.org>
List-Help: <mailto:help-gnu-emacs-request@gnu.org?subject=help>
List-Subscribe: <http://lists.gnu.org/mailman/listinfo/help-gnu-emacs>,
	<mailto:help-gnu-emacs-request@gnu.org?subject=subscribe>
Original-Sender: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org
Errors-To: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.emacs.help:71431
Archived-At: <http://permalink.gmane.org/gmane.emacs.help/71431>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Sun, Jan 24, 2010 at 10:59:46PM +0100, Pascal J. Bourguignon wrote:
> kj <no.email@please.post> writes:
> 
> > I've downloaded a large file that is supposed to contain a mixture
> > of Japanese and English (it's basically a learner's dictionary).
> > The English is displayed correctly, but not so for the Japanese.
> >
> > I've tried setting the buffer's coding system to utf-8,
> > japanese-shift-jis, japanese-shift-jis-mac, japanese-shift-jis-dos
> > (just guessing).  None worked.
> >
> > In fact, I'm not even sure that any of these changes of the coding
> > system achieved *anything*, since the buffer's appearance remained
> > unchanged throughout all this mucking around.  I used the command
> > set-buffer-file-coding-system to do this.

This won't do the trick (see below for what will do). This function just
says: "forget you loaded this file as shift-JIS. From now on it will be
UTF-8" (for example). So it doesn't change anything, but when you save
the file, it will be transformed to the new coding system (if possible).

> >                                            Should I need to do
> > anything besides re-setting the coding system to see a change in
> > how the file is displayed?

You'll have to use `revert-buffer-with-coding-system' (by default mapped
to the key seqence C-x RET r). This will reload the file under
assumption of the new coding system.

> > More importantly, is there a better way to determine a file's
> > correct coding system besides trial and error?

Pascal answered this part better than I could :-)

There will be always lots of byte sequences valid under several coding
systems (but meaning different things). The methods out there to get a
grip on the problem are heuristic, partly based on statistical
properties of the text. If you want to have some fun understanding the
kind of problems involved, have a look at [1]. For an implementation in
Emacs  Lisp, see Unicad [2]

- --------
[1] <http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html>
[2] <http://www.emacswiki.org/emacs-en/Unicad>

Regards

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFLXTLXBcgs9XrR2kYRAnR5AJ9Jowgc9pPrCaW0lRe1Tv7xFGya+QCfRXJ8
mLTW2GBvke8OYbVdWiVcrcU=
=gJuQ
-----END PGP SIGNATURE-----