From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.devel Subject: Re: Detecting the coding system of a file programmatically Date: Fri, 10 Aug 2018 10:28:07 +0300 Message-ID: <83ftzmof3s.fsf@gnu.org> References: NNTP-Posting-Host: blaine.gmane.org X-Trace: blaine.gmane.org 1533886046 10806 195.159.176.226 (10 Aug 2018 07:27:26 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Fri, 10 Aug 2018 07:27:26 +0000 (UTC) Cc: emacs-devel@gnu.org To: Andrea Cardaci Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Fri Aug 10 09:27:21 2018 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1fo1pR-0002ht-GG for ged-emacs-devel@m.gmane.org; Fri, 10 Aug 2018 09:27:21 +0200 Original-Received: from localhost ([::1]:54713 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1fo1rY-0008Rl-3l for ged-emacs-devel@m.gmane.org; Fri, 10 Aug 2018 03:29:32 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:59432) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1fo1qF-0008C6-On for emacs-devel@gnu.org; Fri, 10 Aug 2018 03:28:12 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1fo1qC-0007Aa-FB for emacs-devel@gnu.org; Fri, 10 Aug 2018 03:28:11 -0400 Original-Received: from fencepost.gnu.org ([2001:4830:134:3::e]:56401) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1fo1qC-0007AF-B5; Fri, 10 Aug 2018 03:28:08 -0400 Original-Received: from [176.228.60.248] (port=1938 helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256) (Exim 4.82) (envelope-from ) id 1fo1qB-0006Wc-6D; Fri, 10 Aug 2018 03:28:08 -0400 In-reply-to: (message from Andrea Cardaci on Fri, 10 Aug 2018 03:02:55 +0200) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 2001:4830:134:3::e X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.org gmane.emacs.devel:228353 Archived-At: > From: Andrea Cardaci > Date: Fri, 10 Aug 2018 03:02:55 +0200 > > (with-temp-buffer > (insert-file-contents-literally path) > (decode-coding-region (point-min) (point-max) 'utf-8) > (... do suff with the buffer ...)) > > I use `insert-file-contents-literally' because the non-literally > counterpart is too slow (about twice as much apparently) as it does a > bunch of stuff in addition to simply populate the buffer. > Unfortunately, one of these things is to decode the buffer. > > Now instead of hardcoding 'utf-8 I'd like to detect the correct > encoding where possible, so I tried experimenting with > `find-operation-coding-system'. That's the wrong function to use in this case; you want decode-coding-inserted-region instead. Alternatively, you could use detect-coding-region and then decode-coding-region with the value it returns. I suggest a good read of the "Explicit Encoding" and "Lisp and Coding Systems" nodes of the ELisp manual. > I created a latin-1 file (which gets > recognised properly when I visit it) and tried the following: > > (with-temp-buffer > (setq path "~/tmp/latin-1") > (insert-file-contents-literally path) > (find-operation-coding-system > 'insert-file-contents > (cons path (current-buffer)))) > > But all I get is (undecided). That's expected: find-operation-coding-system returns the _default_ to use for the named operation. It doesn't consider the contents of the buffer. > Now my question is twofold: is this the best approach for what I'm > trying to achieve? And in any case, why does the latter example does > not work as expected? (And hence how I can detect the coding system > programmatically?) I hope I answered all of those questions, if not, please ask more. In any case, it is definitely OK to call decode-coding-region with the value 'undecided' returned by find-operation-coding-system, because 'undecided' is a special value which signals to decode-coding-region that detection of the actual encoding is necessary. Thus, I expect this to work for you: (with-temp-buffer (insert-file-contents-literally path) (decode-coding-region (point-min) (point-max) (find-operation-coding-system 'insert-file-contents (cons path (current-buffer))))) But I still recommend to use decode-coding-inserted-region, because it will do all of the above (and slightly more) for you internally.