unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed
From: Robert Weiner <rsw@gnu.org>
To: Michael Heerdegen <michael_heerdegen@web.de>
Cc: Emacs Development <emacs-devel@gnu.org>
Subject: Re: [ELPA] New package: find-dups
Date: Wed, 11 Oct 2017 22:23:00 -0400	[thread overview]
Message-ID: <CA+OMD9ix3oZnUdnPYLfvudUWmEMN3KJw7ho_n7ochc+TM14j6A@mail.gmail.com> (raw)
In-Reply-To: <87bmldefg5.fsf@web.de>

[-- Attachment #1: Type: text/plain, Size: 3587 bytes --]

On Wed, Oct 11, 2017 at 1:56 PM, Michael Heerdegen <michael_heerdegen@web.de
> wrote:

> Robert Weiner <rsw@gnu.org> writes:
>
> > This seems incredibly complicated.  It would help if you would state
> > the general problem you are trying to solve and the performance
> > characteristics you need.  It certainly is not a generic duplicate
> > removal library.  Why can't you flatten your list and then just apply
> > a sequence of predicate matches as needed or use hashing as mentioned
> > in the commentary?
>
> I guess the name is misleading, I'll try to find a better one.
>

​Sounds good.  How about filter-set?  You are filtering a bunch of items to
produce a set.  I'm not sure if this is limited to files or more generic.
​

> ​​
>
> ​​
> Look at the example of finding files with equal contents in your file
> ​​
>
> ​​
> system: you have a list or stream of, say, 10000 files in a file
> ​​
>
> ​​
> hierarchy.  If you calculate hashes of all of those 10000 files,
> ​​
> it will
> ​​
> take hours.
> ​​
>
> ​​
​Ok, so you want to filter down a set of hierarchically arranged files.
​

> ​​
>
> ​​
> It's wiser to do it in steps: first, look at the file's sizes of all
> ​​
>
> ​​
> files.  That's a very fast test, and files with equal contents
> ​​
> have the
> ​​
> same size.  You can discard all files with unique sizes.
> ​​
>
> ​​
​​Yes, but that is just filtering (get the size of the files and filter to
sets of files with the unique sizes).  Then you chain more filters to
filter further.
   (filter-duplicates list-of-filters-to-apply list-of-files-to-filter)

which would produce a chain of filters like:
   (filterN .... (filter2 (filter1 list-of-files-to-filter))
​

>
> In a second step, we have less many files.  We could look at the first N
> bytes of the files.  That's still quite fast.


​So you apply your fastest and most effective filters first.
​

> ​​
>   Left are groups of files
> ​​
> with equal sizes and equal heads.  For those it's worth of calculating a
> ​​
> hash sum to see which have also equal contents.
>

​Ok.
​

> ​​
>
> ​​
> The idea of the library is to abstract over the type of elements and the
> ​​
> n
> ​​
> u
> ​​
> m
> ​​
> b
> ​​
> e
> ​​
> r
> ​​
> ​​
> a
> ​​
> n
> ​​
> d
> ​​
> ​​
> k
> ​​
> i
> ​​
> n
> ​​
> d
> ​​
> s
> ​​
> ​​
> o
> ​​
> f
> ​​
> ​​
> t
> ​​
> e
> ​​
> s
> ​​
> t
> ​​
> .

​​

​But as the prior message author noted, you don't need lists of lists to do
that.  We want you to simplify things so they are most​ generally useful
and easier to understand.

>
> and `find-dups' executes the algorithm with the steps as specified.  You
> need just to specify a number of tests but don't need to write out the
> code yourself.
>

​I don't quite see what code is not being written except the sequencing of
the filter applications which is your code.
​

> ​​
>
> ​​
> Do you need a mathematical formulation of the abstract problem that the
> ​​
> algorithm solves, and how it works?  I had hoped the example in the
> ​​
> header is a good explanation...
>

​The example is a good one to use but as was noted is only one use case.
Keep at it and you'll see it will become something much nicer.

Bob​

[-- Attachment #2: Type: text/html, Size: 10925 bytes --]

  parent reply	other threads:[~2017-10-12  2:23 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-10-11 15:25 [ELPA] New package: find-dups Michael Heerdegen
2017-10-11 17:05 ` Robert Weiner
2017-10-11 17:56   ` Michael Heerdegen
2017-10-11 18:56     ` Eli Zaretskii
2017-10-11 19:25       ` Michael Heerdegen
2017-10-11 23:28     ` Thien-Thi Nguyen
2017-10-12  2:23     ` Robert Weiner [this message]
2017-10-12  8:37 ` Andreas Politz
2017-10-12 12:32   ` Michael Heerdegen
2017-10-12 13:20     ` Nicolas Petton
2017-10-12 18:49       ` Michael Heerdegen
2017-10-13 10:21         ` Michael Heerdegen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CA+OMD9ix3oZnUdnPYLfvudUWmEMN3KJw7ho_n7ochc+TM14j6A@mail.gmail.com \
    --to=rsw@gnu.org \
    --cc=emacs-devel@gnu.org \
    --cc=michael_heerdegen@web.de \
    --cc=rswgnu@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).