On Wed, Oct 11, 2017 at 1:56 PM, Michael Heerdegen <michael_heerdegen@web.de
> wrote:

> Robert Weiner <rsw@gnu.org> writes:
>
> > This seems incredibly complicated.  It would help if you would state
> > the general problem you are trying to solve and the performance
> > characteristics you need.  It certainly is not a generic duplicate
> > removal library.  Why can't you flatten your list and then just apply
> > a sequence of predicate matches as needed or use hashing as mentioned
> > in the commentary?
>
> I guess the name is misleading, I'll try to find a better one.
>

​Sounds good.  How about filter-set?  You are filtering a bunch of items to
produce a set.  I'm not sure if this is limited to files or more generic.
​

> ​​
>
> ​​
> Look at the example of finding files with equal contents in your file
> ​​
>
> ​​
> system: you have a list or stream of, say, 10000 files in a file
> ​​
>
> ​​
> hierarchy.  If you calculate hashes of all of those 10000 files,
> ​​
> it will
> ​​
> take hours.
> ​​
>
> ​​
​Ok, so you want to filter down a set of hierarchically arranged files.
​

> ​​
>
> ​​
> It's wiser to do it in steps: first, look at the file's sizes of all
> ​​
>
> ​​
> files.  That's a very fast test, and files with equal contents
> ​​
> have the
> ​​
> same size.  You can discard all files with unique sizes.
> ​​
>
> ​​
​​Yes, but that is just filtering (get the size of the files and filter to
sets of files with the unique sizes).  Then you chain more filters to
filter further.
   (filter-duplicates list-of-filters-to-apply list-of-files-to-filter)

which would produce a chain of filters like:
   (filterN .... (filter2 (filter1 list-of-files-to-filter))
​

>
> In a second step, we have less many files.  We could look at the first N
> bytes of the files.  That's still quite fast.


​So you apply your fastest and most effective filters first.
​

> ​​
>   Left are groups of files
> ​​
> with equal sizes and equal heads.  For those it's worth of calculating a
> ​​
> hash sum to see which have also equal contents.
>

​Ok.
​

> ​​
>
> ​​
> The idea of the library is to abstract over the type of elements and the
> ​​
> n
> ​​
> u
> ​​
> m
> ​​
> b
> ​​
> e
> ​​
> r
> ​​
> ​​
> a
> ​​
> n
> ​​
> d
> ​​
> ​​
> k
> ​​
> i
> ​​
> n
> ​​
> d
> ​​
> s
> ​​
> ​​
> o
> ​​
> f
> ​​
> ​​
> t
> ​​
> e
> ​​
> s
> ​​
> t
> ​​
> .

​​

​But as the prior message author noted, you don't need lists of lists to do
that.  We want you to simplify things so they are most​ generally useful
and easier to understand.

>
> and `find-dups' executes the algorithm with the steps as specified.  You
> need just to specify a number of tests but don't need to write out the
> code yourself.
>

​I don't quite see what code is not being written except the sequencing of
the filter applications which is your code.
​

> ​​
>
> ​​
> Do you need a mathematical formulation of the abstract problem that the
> ​​
> algorithm solves, and how it works?  I had hoped the example in the
> ​​
> header is a good explanation...
>

​The example is a good one to use but as was noted is only one use case.
Keep at it and you'll see it will become something much nicer.

Bob​