From: Eli Zaretskii <eliz@gnu.org>
To: Michael Heerdegen <michael_heerdegen@web.de>
Cc: rswgnu@gmail.com, emacs-devel@gnu.org, rsw@gnu.org
Subject: Re: [ELPA] New package: find-dups
Date: Wed, 11 Oct 2017 21:56:43 +0300 [thread overview]
Message-ID: <83fuapo6ms.fsf@gnu.org> (raw)
In-Reply-To: <87bmldefg5.fsf@web.de> (message from Michael Heerdegen on Wed, 11 Oct 2017 19:56:26 +0200)
> From: Michael Heerdegen <michael_heerdegen@web.de>
> Date: Wed, 11 Oct 2017 19:56:26 +0200
> Cc: rswgnu@gmail.com, Emacs Development <emacs-devel@gnu.org>
>
> #+begin_src emacs-lisp
> (find-dups my-sequence-of-file-names
> (list (list (lambda (file)
> (file-attribute-size (file-attributes file)))
> #'eq)
> (list (lambda (file)
> (shell-command-to-string
> (format "head %s"
> (shell-quote-argument file))))
> #'equal)
> (list (lambda (file)
> (shell-command-to-string
> (format "md5sum %s | awk '{print $1;}'"
> (shell-quote-argument file))))
> #'equal)))
> #+end_src
Apologies for barging into the middle of a discussion, but starting
processes and making strings out of their output to process just a
portion of a file is sub-optimal, because process creation is not
cheap. It is easier to simply read a predefined number of bytes into
a buffer; insert-file-contents-literally supports that. Likewise with
md5sum: we have the md5 primitive for that.
In general, working with buffers is much more efficient in Emacs than
working with strings, so avoid strings, let alone large strings, as
much as you can.
One other comment is that shell-command-to-string decodes the output
from the shell command, which is not something you want here, because
AFAIU you are looking for files whose contents is identical on the
byte-stream level, i.e. 2 files which have the same characters, but
are encoded differently on disk (like one UTF-8, the other Latin-1)
should be considered different in this contents, whereas
shell-command-to-string will/might produce identical strings for them.
(Decoding is also expensive run-time wise.)
next prev parent reply other threads:[~2017-10-11 18:56 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-10-11 15:25 [ELPA] New package: find-dups Michael Heerdegen
2017-10-11 17:05 ` Robert Weiner
2017-10-11 17:56 ` Michael Heerdegen
2017-10-11 18:56 ` Eli Zaretskii [this message]
2017-10-11 19:25 ` Michael Heerdegen
2017-10-11 23:28 ` Thien-Thi Nguyen
2017-10-12 2:23 ` Robert Weiner
2017-10-12 8:37 ` Andreas Politz
2017-10-12 12:32 ` Michael Heerdegen
2017-10-12 13:20 ` Nicolas Petton
2017-10-12 18:49 ` Michael Heerdegen
2017-10-13 10:21 ` Michael Heerdegen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://www.gnu.org/software/emacs/
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=83fuapo6ms.fsf@gnu.org \
--to=eliz@gnu.org \
--cc=emacs-devel@gnu.org \
--cc=michael_heerdegen@web.de \
--cc=rsw@gnu.org \
--cc=rswgnu@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://git.savannah.gnu.org/cgit/emacs.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).