Detecting the coding system of a file programmatically

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

* Detecting the coding system of a file programmatically
@ 2018-08-10  1:02 Andrea Cardaci
  2018-08-10  7:28 ` Eli Zaretskii
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Andrea Cardaci @ 2018-08-10  1:02 UTC (permalink / raw)
  To: Emacs developers

Hi,

I'm in a situation where I need to batch process (readonly) a number
of file with Emacs, my current approach is the following:

(with-temp-buffer
  (insert-file-contents-literally path)
  (decode-coding-region (point-min) (point-max) 'utf-8)
  (... do suff with the buffer ...))

I use `insert-file-contents-literally' because the non-literally
counterpart is too slow (about twice as much apparently) as it does a
bunch of stuff in addition to simply populate the buffer.
Unfortunately, one of these things is to decode the buffer.

Now instead of hardcoding 'utf-8 I'd like to detect the correct
encoding where possible, so I tried experimenting with
`find-operation-coding-system'. I created a latin-1 file (which gets
recognised properly when I visit it) and tried the following:

(with-temp-buffer
  (setq path "~/tmp/latin-1")
  (insert-file-contents-literally path)
  (find-operation-coding-system
   'insert-file-contents
   (cons path (current-buffer))))

But all I get is (undecided). Now my question is twofold: is this the
best approach for what I'm trying to achieve? And in any case, why
does the latter example does not work as expected? (And hence how I
can detect the coding system programmatically?)

Best,

Andrea

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Detecting the coding system of a file programmatically
  2018-08-10  1:02 Detecting the coding system of a file programmatically Andrea Cardaci
@ 2018-08-10  7:28 ` Eli Zaretskii
  2018-08-10 13:37   ` Andrea Cardaci
  2018-08-10 15:26 ` Stefan Monnier
  2018-08-16 21:27 ` Juri Linkov
  2 siblings, 1 reply; 8+ messages in thread
From: Eli Zaretskii @ 2018-08-10  7:28 UTC (permalink / raw)
  To: Andrea Cardaci; +Cc: emacs-devel

> From: Andrea Cardaci <cyrus.and@gmail.com>
> Date: Fri, 10 Aug 2018 03:02:55 +0200
> 
> (with-temp-buffer
>   (insert-file-contents-literally path)
>   (decode-coding-region (point-min) (point-max) 'utf-8)
>   (... do suff with the buffer ...))
> 
> I use `insert-file-contents-literally' because the non-literally
> counterpart is too slow (about twice as much apparently) as it does a
> bunch of stuff in addition to simply populate the buffer.
> Unfortunately, one of these things is to decode the buffer.
> 
> Now instead of hardcoding 'utf-8 I'd like to detect the correct
> encoding where possible, so I tried experimenting with
> `find-operation-coding-system'.

That's the wrong function to use in this case; you want
decode-coding-inserted-region instead.  Alternatively, you could use
detect-coding-region and then decode-coding-region with the value it
returns.  I suggest a good read of the "Explicit Encoding" and "Lisp
and Coding Systems" nodes of the ELisp manual.

> I created a latin-1 file (which gets
> recognised properly when I visit it) and tried the following:
> 
> (with-temp-buffer
>   (setq path "~/tmp/latin-1")
>   (insert-file-contents-literally path)
>   (find-operation-coding-system
>    'insert-file-contents
>    (cons path (current-buffer))))
> 
> But all I get is (undecided).

That's expected: find-operation-coding-system returns the _default_ to
use for the named operation.  It doesn't consider the contents of the
buffer.

> Now my question is twofold: is this the best approach for what I'm
> trying to achieve? And in any case, why does the latter example does
> not work as expected? (And hence how I can detect the coding system
> programmatically?)

I hope I answered all of those questions, if not, please ask more.

In any case, it is definitely OK to call decode-coding-region with the
value 'undecided' returned by find-operation-coding-system, because
'undecided' is a special value which signals to decode-coding-region
that detection of the actual encoding is necessary.  Thus, I expect
this to work for you:

  (with-temp-buffer
    (insert-file-contents-literally path)
    (decode-coding-region (point-min) (point-max)
                          (find-operation-coding-system
			    'insert-file-contents
			    (cons path (current-buffer)))))

But I still recommend to use decode-coding-inserted-region, because it
will do all of the above (and slightly more) for you internally.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Detecting the coding system of a file programmatically
  2018-08-10  7:28 ` Eli Zaretskii
@ 2018-08-10 13:37   ` Andrea Cardaci
  2018-08-10 13:54     ` Eli Zaretskii
  0 siblings, 1 reply; 8+ messages in thread
From: Andrea Cardaci @ 2018-08-10 13:37 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Emacs developers

Hi Eli,

Thanks for the thorough reply.

> That's the wrong function to use in this case; you want
> decode-coding-inserted-region instead.

Yes, that works!

> Thus, I expect this to work for you:
>
>   (with-temp-buffer
>     (insert-file-contents-literally path)
>     (decode-coding-region (point-min) (point-max)
>                           (find-operation-coding-system
>                             'insert-file-contents
>                             (cons path (current-buffer)))))

Yes, except that it accepts a single symbol. I also tried directly with:

(decode-coding-region (point-min) (point-max) 'undecided)

which in my use case it resulted in a more snappy performance.
Basically this latter `decode-coding-region' doesn't introduce a
noticeable slowing to the `insert-file-contents-literally', instead
using `decode-coding-inserted-region' is more or less as slow as using
`insert-file-contents' alone. I guess I'll go with the former.

Andrea

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Detecting the coding system of a file programmatically
  2018-08-10 13:37   ` Andrea Cardaci
@ 2018-08-10 13:54     ` Eli Zaretskii
  2018-08-10 14:47       ` Andrea Cardaci
  0 siblings, 1 reply; 8+ messages in thread
From: Eli Zaretskii @ 2018-08-10 13:54 UTC (permalink / raw)
  To: Andrea Cardaci; +Cc: emacs-devel

> From: Andrea Cardaci <cyrus.and@gmail.com>
> Date: Fri, 10 Aug 2018 15:37:08 +0200
> Cc: Emacs developers <emacs-devel@gnu.org>
> 
> (decode-coding-region (point-min) (point-max) 'undecided)
> 
> which in my use case it resulted in a more snappy performance.
> Basically this latter `decode-coding-region' doesn't introduce a
> noticeable slowing to the `insert-file-contents-literally', instead
> using `decode-coding-inserted-region' is more or less as slow as using
> `insert-file-contents' alone. I guess I'll go with the former.

Suit yourself, but you need to be aware that while speeding up the
code, you lose some features, which may or may not be important, such
as setting the default coding-system based on the file's name.  If
this code ever needs to handle a file whose contents fools the Emacs
guesswork (which is based on a small part of the buffer contents),
your shortcut might misfire.  E.g., UTF-8 encoded files sometimes dupe
Emacs into thinking they are encoded in some Windows codepage, if that
codepage is the default encoding under the user's locale, so
processing XML or LaTeX files might use a wrong encoding.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Detecting the coding system of a file programmatically
  2018-08-10 13:54     ` Eli Zaretskii
@ 2018-08-10 14:47       ` Andrea Cardaci
  0 siblings, 0 replies; 8+ messages in thread
From: Andrea Cardaci @ 2018-08-10 14:47 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Emacs developers

> Suit yourself, but you need to be aware that while speeding up the
> code, you lose some features, which may or may not be important, such
> as setting the default coding-system based on the file's name.

I'll take that into consideration, thanks.



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Detecting the coding system of a file programmatically
  2018-08-10  1:02 Detecting the coding system of a file programmatically Andrea Cardaci
  2018-08-10  7:28 ` Eli Zaretskii
@ 2018-08-10 15:26 ` Stefan Monnier
  2018-08-16 21:27 ` Juri Linkov
  2 siblings, 0 replies; 8+ messages in thread
From: Stefan Monnier @ 2018-08-10 15:26 UTC (permalink / raw)
  To: emacs-devel

> I use `insert-file-contents-literally' because the non-literally
> counterpart is too slow (about twice as much apparently) as it does a

I'd be interested to hear if your final code is significantly faster
than `insert-file-contents'.


        Stefan




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Detecting the coding system of a file programmatically
  2018-08-10  1:02 Detecting the coding system of a file programmatically Andrea Cardaci
  2018-08-10  7:28 ` Eli Zaretskii
  2018-08-10 15:26 ` Stefan Monnier
@ 2018-08-16 21:27 ` Juri Linkov
  2018-08-17 11:33   ` Andrea Cardaci
  2 siblings, 1 reply; 8+ messages in thread
From: Juri Linkov @ 2018-08-16 21:27 UTC (permalink / raw)
  To: Andrea Cardaci; +Cc: Emacs developers

[-- Attachment #1: Type: text/plain, Size: 486 bytes --]

> I use `insert-file-contents-literally' because the non-literally
> counterpart is too slow (about twice as much apparently) as it does a
> bunch of stuff in addition to simply populate the buffer.
> Unfortunately, one of these things is to decode the buffer.

For better performance I restrict the size of inserted file by giving to
`insert-file-contents' a small value of args BEG and END.  For example,
to automatically detect encodings of files for diff I use such customization:


[-- Attachment #2: dired-diff.el --]
[-- Type: application/emacs-lisp, Size: 984 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Detecting the coding system of a file programmatically
  2018-08-16 21:27 ` Juri Linkov
@ 2018-08-17 11:33   ` Andrea Cardaci
  0 siblings, 0 replies; 8+ messages in thread
From: Andrea Cardaci @ 2018-08-17 11:33 UTC (permalink / raw)
  To: juri; +Cc: Emacs developers

HI Juri,

But in this way the extra operations executed by
`insert-file-contents' are performed anyway, albeit on a smaller
portion of the buffer. I'm not sure if the slow part (decoding
excluded) is proportional to the size of the input file.

I'll keep that in mind as an alternative solution, thanks.

I ended up using:

(with-temp-buffer
  (insert-file-contents-literally path)
  (decode-coding-region (point-min) (point-max) 'undecided)
  (... do suff with the buffer ...))

But take a look at the gotchas mentioned by Eli.
On Thu, 16 Aug 2018 at 23:40, Juri Linkov <juri@linkov.net> wrote:
>
> > I use `insert-file-contents-literally' because the non-literally
> > counterpart is too slow (about twice as much apparently) as it does a
> > bunch of stuff in addition to simply populate the buffer.
> > Unfortunately, one of these things is to decode the buffer.
>
> For better performance I restrict the size of inserted file by giving to
> `insert-file-contents' a small value of args BEG and END.  For example,
> to automatically detect encodings of files for diff I use such customization:
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2018-08-17 11:33 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-08-10  1:02 Detecting the coding system of a file programmatically Andrea Cardaci
2018-08-10  7:28 ` Eli Zaretskii
2018-08-10 13:37   ` Andrea Cardaci
2018-08-10 13:54     ` Eli Zaretskii
2018-08-10 14:47       ` Andrea Cardaci
2018-08-10 15:26 ` Stefan Monnier
2018-08-16 21:27 ` Juri Linkov
2018-08-17 11:33   ` Andrea Cardaci

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).