Re: [ELPA] New package: find-dups

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

From: Robert Weiner <rsw@gnu.org>
To: Michael Heerdegen <michael_heerdegen@web.de>
Cc: Emacs Development <emacs-devel@gnu.org>
Subject: Re: [ELPA] New package: find-dups
Date: Wed, 11 Oct 2017 22:23:00 -0400	[thread overview]
Message-ID: <CA+OMD9ix3oZnUdnPYLfvudUWmEMN3KJw7ho_n7ochc+TM14j6A@mail.gmail.com> (raw)
In-Reply-To: <87bmldefg5.fsf@web.de>

[-- Attachment #1: Type: text/plain, Size: 3587 bytes --]

On Wed, Oct 11, 2017 at 1:56 PM, Michael Heerdegen <michael_heerdegen@web.de
> wrote:

> Robert Weiner <rsw@gnu.org> writes:
>
> > This seems incredibly complicated.  It would help if you would state
> > the general problem you are trying to solve and the performance
> > characteristics you need.  It certainly is not a generic duplicate
> > removal library.  Why can't you flatten your list and then just apply
> > a sequence of predicate matches as needed or use hashing as mentioned
> > in the commentary?
>
> I guess the name is misleading, I'll try to find a better one.
>

Sounds good.  How about filter-set?  You are filtering a bunch of items to
produce a set.  I'm not sure if this is limited to files or more generic.


> 
>
> 
> Look at the example of finding files with equal contents in your file
> 
>
> 
> system: you have a list or stream of, say, 10000 files in a file
> 
>
> 
> hierarchy.  If you calculate hashes of all of those 10000 files,
> 
> it will
> 
> take hours.
> 
>
> 
Ok, so you want to filter down a set of hierarchically arranged files.


> 
>
> 
> It's wiser to do it in steps: first, look at the file's sizes of all
> 
>
> 
> files.  That's a very fast test, and files with equal contents
> 
> have the
> 
> same size.  You can discard all files with unique sizes.
> 
>
> 
Yes, but that is just filtering (get the size of the files and filter to
sets of files with the unique sizes).  Then you chain more filters to
filter further.
   (filter-duplicates list-of-filters-to-apply list-of-files-to-filter)

which would produce a chain of filters like:
   (filterN .... (filter2 (filter1 list-of-files-to-filter))


>
> In a second step, we have less many files.  We could look at the first N
> bytes of the files.  That's still quite fast.


So you apply your fastest and most effective filters first.


> 
>   Left are groups of files
> 
> with equal sizes and equal heads.  For those it's worth of calculating a
> 
> hash sum to see which have also equal contents.
>

Ok.


> 
>
> 
> The idea of the library is to abstract over the type of elements and the
> 
> n
> 
> u
> 
> m
> 
> b
> 
> e
> 
> r
> 
> 
> a
> 
> n
> 
> d
> 
> 
> k
> 
> i
> 
> n
> 
> d
> 
> s
> 
> 
> o
> 
> f
> 
> 
> t
> 
> e
> 
> s
> 
> t
> 
> .



But as the prior message author noted, you don't need lists of lists to do
that.  We want you to simplify things so they are most generally useful
and easier to understand.

>
> and `find-dups' executes the algorithm with the steps as specified.  You
> need just to specify a number of tests but don't need to write out the
> code yourself.
>

I don't quite see what code is not being written except the sequencing of
the filter applications which is your code.


> 
>
> 
> Do you need a mathematical formulation of the abstract problem that the
> 
> algorithm solves, and how it works?  I had hoped the example in the
> 
> header is a good explanation...
>

The example is a good one to use but as was noted is only one use case.
Keep at it and you'll see it will become something much nicer.

Bob

[-- Attachment #2: Type: text/html, Size: 10925 bytes --]

next prev parent reply	other threads:[~2017-10-12  2:23 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-10-11 15:25 [ELPA] New package: find-dups Michael Heerdegen
2017-10-11 17:05 ` Robert Weiner
2017-10-11 17:56   ` Michael Heerdegen
2017-10-11 18:56     ` Eli Zaretskii
2017-10-11 19:25       ` Michael Heerdegen
2017-10-11 23:28     ` Thien-Thi Nguyen
2017-10-12  2:23     ` Robert Weiner [this message]
2017-10-12  8:37 ` Andreas Politz
2017-10-12 12:32   ` Michael Heerdegen
2017-10-12 13:20     ` Nicolas Petton
2017-10-12 18:49       ` Michael Heerdegen
2017-10-13 10:21         ` Michael Heerdegen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CA+OMD9ix3oZnUdnPYLfvudUWmEMN3KJw7ho_n7ochc+TM14j6A@mail.gmail.com \
    --to=rsw@gnu.org \
    --cc=emacs-devel@gnu.org \
    --cc=michael_heerdegen@web.de \
    --cc=rswgnu@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).