unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed
From: Michael Heerdegen <michael_heerdegen@web.de>
To: Robert Weiner <rsw@gnu.org>
Cc: rswgnu@gmail.com, Emacs Development <emacs-devel@gnu.org>
Subject: Re: [ELPA] New package: find-dups
Date: Wed, 11 Oct 2017 19:56:26 +0200	[thread overview]
Message-ID: <87bmldefg5.fsf@web.de> (raw)
In-Reply-To: <CA+OMD9gZN0FUNKiu_fhZPDN0wTt-He5aC3b-C92u13MHbRFwZQ@mail.gmail.com> (Robert Weiner's message of "Wed, 11 Oct 2017 13:05:58 -0400")

Robert Weiner <rsw@gnu.org> writes:

> This seems incredibly complicated.  It would help if you would state
> the general problem you are trying to solve and the performance
> characteristics you need.  It certainly is not a generic duplicate
> removal library.  Why can't you flatten your list and then just apply
> a sequence of predicate matches as needed or use hashing as mentioned
> in the commentary?

I guess the name is misleading, I'll try to find a better one.

Look at the example of finding files with equal contents in your file
system: you have a list or stream of, say, 10000 files in a file
hierarchy.  If you calculate hashes of all of those 10000 files, it will
take hours.

It's wiser to do it in steps: first, look at the file's sizes of all
files.  That's a very fast test, and files with equal contents have the
same size.  You can discard all files with unique sizes.

In a second step, we have less many files.  We could look at the first N
bytes of the files.  That's still quite fast.  Left are groups of files
with equal sizes and equal heads.  For those it's worth of calculating a
hash sum to see which have also equal contents.

The idea of the library is to abstract over the type of elements and the
number and kinds of test.  So you can write the above as 

#+begin_src emacs-lisp
(find-dups my-sequence-of-file-names
           (list (list (lambda (file)
                         (file-attribute-size (file-attributes file)))
                       #'eq)
                 (list (lambda (file)
                         (shell-command-to-string
                          (format "head %s"
                                  (shell-quote-argument file))))
                       #'equal)
                 (list (lambda (file)
                         (shell-command-to-string
                          (format "md5sum %s | awk '{print $1;}'"
                                  (shell-quote-argument file))))
                       #'equal)))
#+end_src

and `find-dups' executes the algorithm with the steps as specified.  You
need just to specify a number of tests but don't need to write out the
code yourself.

Do you need a mathematical formulation of the abstract problem that the
algorithm solves, and how it works?  I had hoped the example in the
header is a good explanation...


Regards,

Michael.



  reply	other threads:[~2017-10-11 17:56 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-10-11 15:25 [ELPA] New package: find-dups Michael Heerdegen
2017-10-11 17:05 ` Robert Weiner
2017-10-11 17:56   ` Michael Heerdegen [this message]
2017-10-11 18:56     ` Eli Zaretskii
2017-10-11 19:25       ` Michael Heerdegen
2017-10-11 23:28     ` Thien-Thi Nguyen
2017-10-12  2:23     ` Robert Weiner
2017-10-12  8:37 ` Andreas Politz
2017-10-12 12:32   ` Michael Heerdegen
2017-10-12 13:20     ` Nicolas Petton
2017-10-12 18:49       ` Michael Heerdegen
2017-10-13 10:21         ` Michael Heerdegen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87bmldefg5.fsf@web.de \
    --to=michael_heerdegen@web.de \
    --cc=emacs-devel@gnu.org \
    --cc=rsw@gnu.org \
    --cc=rswgnu@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).