On Wed, Oct 11, 2017 at 1:56 PM, Michael Heerdegen <michael_heerdegen@web.de> wrote:

Robert Weiner <rsw@gnu.org> writes:

> This seems incredibly complicated. It would help if you would state
> the general problem you are trying to solve and the performance
> characteristics you need. It certainly is not a generic duplicate
> removal library. Why can't you flatten your list and then just apply
> a sequence of predicate matches as needed or use hashing as mentioned
> in the commentary?

I guess the name is misleading, I'll try to find a better one.

​Sounds good.  How about filter-set?  You are filtering a bunch of items to produce a set.  I'm not sure if this is limited to files or more generic.

​

Look at the example of finding files with equal contents in your file

system: you have a list or stream of, say, 10000 files in a file

hierarchy. If you calculate hashes of all of those 10000 files,

it will

take hours.

​​

​Ok, so you want to filter down a set of hierarchically arranged files.

​

It's wiser to do it in steps: first, look at the file's sizes of all

files. That's a very fast test, and files with equal contents

have the

same size. You can discard all files with unique sizes.

​​

​​Yes, but that is just filtering (get the size of the files and filter to sets of files with the unique sizes).  Then you chain more filters to filter further.

   (filter-duplicates list-of-filters-to-apply list-of-files-to-filter)

which would produce a chain of filters like:

   (filterN .... (filter2 (filter1 list-of-files-to-filter))

​

In a second step, we have less many files. We could look at the first N
bytes of the files. That's still quite fast.

​So you apply your fastest and most effective filters first.

​

Left are groups of files

with equal sizes and equal heads. For those it's worth of calculating a

hash sum to see which have also equal contents.

​Ok.

​

The idea of the library is to abstract over the type of elements and the

n

u

m

b

e

r

a

n

d

k

i

n

d

s

o

f

t

e

s

t

.

​​

​But as the prior message author noted, you don't need lists of lists to do that.  We want you to simplify things so they are most​ generally useful and easier to understand.

and `find-dups' executes the algorithm with the steps as specified. You
need just to specify a number of tests but don't need to write out the
code yourself.

​I don't quite see what code is not being written except the sequencing of the filter applications which is your code.

​

Do you need a mathematical formulation of the abstract problem that the

algorithm solves, and how it works? I had hoped the example in the

header is a good explanation...

​The example is a good one to use but as was noted is only one use case.  Keep at it and you'll see it will become something much nicer.

Bob​