From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: tomas@tuxteam.de Newsgroups: gmane.emacs.help Subject: Re: How to determine encoding for file? Date: Mon, 25 Jan 2010 06:57:43 +0100 Message-ID: <20100125055743.GB26580@tomas> References: <87y6jn2mgd.fsf@hubble.lan.informatimago.com> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii; x-action=pgp-signed X-Trace: ger.gmane.org 1264399222 10857 80.91.229.12 (25 Jan 2010 06:00:22 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Mon, 25 Jan 2010 06:00:22 +0000 (UTC) To: help-gnu-emacs@gnu.org Original-X-From: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Mon Jan 25 07:00:15 2010 Return-path: Envelope-to: geh-help-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.50) id 1NZHzC-0000Xb-Ap for geh-help-gnu-emacs@m.gmane.org; Mon, 25 Jan 2010 07:00:15 +0100 Original-Received: from localhost ([127.0.0.1]:40416 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1NZHz1-0001Lo-FX for geh-help-gnu-emacs@m.gmane.org; Mon, 25 Jan 2010 00:59:31 -0500 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1NZHyd-0001LE-Cz for help-gnu-emacs@gnu.org; Mon, 25 Jan 2010 00:59:07 -0500 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1NZHyY-0001KB-10 for help-gnu-emacs@gnu.org; Mon, 25 Jan 2010 00:59:06 -0500 Original-Received: from [199.232.76.173] (port=54344 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1NZHyX-0001K8-Tm for help-gnu-emacs@gnu.org; Mon, 25 Jan 2010 00:59:01 -0500 Original-Received: from alextrapp1.equinoxe.de ([217.22.192.104]:37383 helo=www.elogos.de) by monty-python.gnu.org with esmtp (Exim 4.60) (envelope-from ) id 1NZHyW-0001UE-V3 for help-gnu-emacs@gnu.org; Mon, 25 Jan 2010 00:59:01 -0500 Original-Received: by www.elogos.de (Postfix, from userid 1000) id 18D7A9004A; Mon, 25 Jan 2010 06:57:43 +0100 (CET) Content-Disposition: inline In-Reply-To: <87y6jn2mgd.fsf@hubble.lan.informatimago.com> User-Agent: Mutt/1.5.15+20070412 (2007-04-11) X-detected-operating-system: by monty-python.gnu.org: GNU/Linux 2.6 (newer, 2) X-BeenThere: help-gnu-emacs@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Users list for the GNU Emacs text editor List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Errors-To: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.help:71431 Archived-At: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Sun, Jan 24, 2010 at 10:59:46PM +0100, Pascal J. Bourguignon wrote: > kj writes: > > > I've downloaded a large file that is supposed to contain a mixture > > of Japanese and English (it's basically a learner's dictionary). > > The English is displayed correctly, but not so for the Japanese. > > > > I've tried setting the buffer's coding system to utf-8, > > japanese-shift-jis, japanese-shift-jis-mac, japanese-shift-jis-dos > > (just guessing). None worked. > > > > In fact, I'm not even sure that any of these changes of the coding > > system achieved *anything*, since the buffer's appearance remained > > unchanged throughout all this mucking around. I used the command > > set-buffer-file-coding-system to do this. This won't do the trick (see below for what will do). This function just says: "forget you loaded this file as shift-JIS. From now on it will be UTF-8" (for example). So it doesn't change anything, but when you save the file, it will be transformed to the new coding system (if possible). > > Should I need to do > > anything besides re-setting the coding system to see a change in > > how the file is displayed? You'll have to use `revert-buffer-with-coding-system' (by default mapped to the key seqence C-x RET r). This will reload the file under assumption of the new coding system. > > More importantly, is there a better way to determine a file's > > correct coding system besides trial and error? Pascal answered this part better than I could :-) There will be always lots of byte sequences valid under several coding systems (but meaning different things). The methods out there to get a grip on the problem are heuristic, partly based on statistical properties of the text. If you want to have some fun understanding the kind of problems involved, have a look at [1]. For an implementation in Emacs Lisp, see Unicad [2] - -------- [1] [2] Regards -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFLXTLXBcgs9XrR2kYRAnR5AJ9Jowgc9pPrCaW0lRe1Tv7xFGya+QCfRXJ8 mLTW2GBvke8OYbVdWiVcrcU= =gJuQ -----END PGP SIGNATURE-----