Robert Weiner <rsw@gnu.org> writes:
> This seems incredibly complicated. It would help if you would state
> the general problem you are trying to solve and the performance
> characteristics you need. It certainly is not a generic duplicate
> removal library. Why can't you flatten your list and then just apply
> a sequence of predicate matches as needed or use hashing as mentioned
> in the commentary?
I guess the name is misleading, I'll try to find a better one.
Look at the example of finding files with equal contents in your file
system: you have a list or stream of, say, 10000 files in a file
hierarchy. If you calculate hashes of all of those 10000 files,it will
take hours.
It's wiser to do it in steps: first, look at the file's sizes of all
files. That's a very fast test, and files with equal contentshave the
same size. You can discard all files with unique sizes.
In a second step, we have less many files. We could look at the first N
bytes of the files. That's still quite fast.
Left are groups of files
with equal sizes and equal heads. For those it's worth of calculating a
hash sum to see which have also equal contents.
The idea of the library is to abstract over the type of elements and the
numberandkindsoftest.
and `find-dups' executes the algorithm with the steps as specified. You
need just to specify a number of tests but don't need to write out the
code yourself.
Do you need a mathematical formulation of the abstract problem that the
algorithm solves, and how it works? I had hoped the example in the
header is a good explanation...