On Wed, Oct 11, 2017 at 1:56 PM, Michael Heerdegen wrote: > Robert Weiner writes: > > > This seems incredibly complicated. It would help if you would state > > the general problem you are trying to solve and the performance > > characteristics you need. It certainly is not a generic duplicate > > removal library. Why can't you flatten your list and then just apply > > a sequence of predicate matches as needed or use hashing as mentioned > > in the commentary? > > I guess the name is misleading, I'll try to find a better one. > ​Sounds good. How about filter-set? You are filtering a bunch of items to produce a set. I'm not sure if this is limited to files or more generic. ​ > ​​ > > ​​ > Look at the example of finding files with equal contents in your file > ​​ > > ​​ > system: you have a list or stream of, say, 10000 files in a file > ​​ > > ​​ > hierarchy. If you calculate hashes of all of those 10000 files, > ​​ > it will > ​​ > take hours. > ​​ > > ​​ ​Ok, so you want to filter down a set of hierarchically arranged files. ​ > ​​ > > ​​ > It's wiser to do it in steps: first, look at the file's sizes of all > ​​ > > ​​ > files. That's a very fast test, and files with equal contents > ​​ > have the > ​​ > same size. You can discard all files with unique sizes. > ​​ > > ​​ ​​Yes, but that is just filtering (get the size of the files and filter to sets of files with the unique sizes). Then you chain more filters to filter further. (filter-duplicates list-of-filters-to-apply list-of-files-to-filter) which would produce a chain of filters like: (filterN .... (filter2 (filter1 list-of-files-to-filter)) ​ > > In a second step, we have less many files. We could look at the first N > bytes of the files. That's still quite fast. ​So you apply your fastest and most effective filters first. ​ > ​​ > Left are groups of files > ​​ > with equal sizes and equal heads. For those it's worth of calculating a > ​​ > hash sum to see which have also equal contents. > ​Ok. ​ > ​​ > > ​​ > The idea of the library is to abstract over the type of elements and the > ​​ > n > ​​ > u > ​​ > m > ​​ > b > ​​ > e > ​​ > r > ​​ > ​​ > a > ​​ > n > ​​ > d > ​​ > ​​ > k > ​​ > i > ​​ > n > ​​ > d > ​​ > s > ​​ > ​​ > o > ​​ > f > ​​ > ​​ > t > ​​ > e > ​​ > s > ​​ > t > ​​ > . ​​ ​But as the prior message author noted, you don't need lists of lists to do that. We want you to simplify things so they are most​ generally useful and easier to understand. > > and `find-dups' executes the algorithm with the steps as specified. You > need just to specify a number of tests but don't need to write out the > code yourself. > ​I don't quite see what code is not being written except the sequencing of the filter applications which is your code. ​ > ​​ > > ​​ > Do you need a mathematical formulation of the abstract problem that the > ​​ > algorithm solves, and how it works? I had hoped the example in the > ​​ > header is a good explanation... > ​The example is a good one to use but as was noted is only one use case. Keep at it and you'll see it will become something much nicer. Bob​