From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: Robert Weiner Newsgroups: gmane.emacs.devel Subject: Re: [ELPA] New package: find-dups Date: Wed, 11 Oct 2017 22:23:00 -0400 Message-ID: References: <87fuapemew.fsf@web.de> <87bmldefg5.fsf@web.de> Reply-To: rswgnu@gmail.com NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: multipart/alternative; boundary="001a113fe9f0f057ec055b50392f" X-Trace: blaine.gmane.org 1507775054 32466 195.159.176.226 (12 Oct 2017 02:24:14 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Thu, 12 Oct 2017 02:24:14 +0000 (UTC) Cc: Emacs Development To: Michael Heerdegen Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Thu Oct 12 04:24:10 2017 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1e2TAN-0007f2-RR for ged-emacs-devel@m.gmane.org; Thu, 12 Oct 2017 04:24:08 +0200 Original-Received: from localhost ([::1]:43362 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1e2TAV-0002Hm-5Y for ged-emacs-devel@m.gmane.org; Wed, 11 Oct 2017 22:24:15 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:56646) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1e2T9r-0002Hg-BX for emacs-devel@gnu.org; Wed, 11 Oct 2017 22:23:36 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1e2T9o-0007ez-0v for emacs-devel@gnu.org; Wed, 11 Oct 2017 22:23:35 -0400 Original-Received: from fencepost.gnu.org ([2001:4830:134:3::e]:35277) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1e2T9n-0007ea-TU for emacs-devel@gnu.org; Wed, 11 Oct 2017 22:23:31 -0400 Original-Received: from mail-qt0-f169.google.com ([209.85.216.169]:45780) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_128_CBC_SHA1:128) (Exim 4.82) (envelope-from ) id 1e2T9n-0004uX-Jn for emacs-devel@gnu.org; Wed, 11 Oct 2017 22:23:31 -0400 Original-Received: by mail-qt0-f169.google.com with SMTP id p1so11056784qtg.2 for ; Wed, 11 Oct 2017 19:23:31 -0700 (PDT) X-Gm-Message-State: AMCzsaUE4gFwYUdnUyOU7U6uAmlYCncySulNb9ZeNOKNLnnPVMDGfisK YqDB/LVcDiRGiMKejWLh13PBZO7i/vggB3sxonI= X-Google-Smtp-Source: AOwi7QDqgc/G9g5BDPFHZIKIEOq7P865x9dgqRo8Wf0RBX8Cv9KEWG/6IWL3D5HeQ8A1KyPCXCwZ6ui841+inyPP/QQ= X-Received: by 10.55.19.228 with SMTP id 97mr1322186qkt.271.1507775011115; Wed, 11 Oct 2017 19:23:31 -0700 (PDT) Original-Received: by 10.237.34.225 with HTTP; Wed, 11 Oct 2017 19:23:00 -0700 (PDT) In-Reply-To: <87bmldefg5.fsf@web.de> X-Gmail-Original-Message-ID: X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 2001:4830:134:3::e X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.org gmane.emacs.devel:219398 Archived-At: --001a113fe9f0f057ec055b50392f Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Wed, Oct 11, 2017 at 1:56 PM, Michael Heerdegen wrote: > Robert Weiner writes: > > > This seems incredibly complicated. It would help if you would state > > the general problem you are trying to solve and the performance > > characteristics you need. It certainly is not a generic duplicate > > removal library. Why can't you flatten your list and then just apply > > a sequence of predicate matches as needed or use hashing as mentioned > > in the commentary? > > I guess the name is misleading, I'll try to find a better one. > =E2=80=8BSounds good. How about filter-set? You are filtering a bunch of = items to produce a set. I'm not sure if this is limited to files or more generic. =E2=80=8B > =E2=80=8B=E2=80=8B > > =E2=80=8B=E2=80=8B > Look at the example of finding files with equal contents in your file > =E2=80=8B=E2=80=8B > > =E2=80=8B=E2=80=8B > system: you have a list or stream of, say, 10000 files in a file > =E2=80=8B=E2=80=8B > > =E2=80=8B=E2=80=8B > hierarchy. If you calculate hashes of all of those 10000 files, > =E2=80=8B=E2=80=8B > it will > =E2=80=8B=E2=80=8B > take hours. > =E2=80=8B=E2=80=8B > > =E2=80=8B=E2=80=8B =E2=80=8BOk, so you want to filter down a set of hierarchically arranged fi= les. =E2=80=8B > =E2=80=8B=E2=80=8B > > =E2=80=8B=E2=80=8B > It's wiser to do it in steps: first, look at the file's sizes of all > =E2=80=8B=E2=80=8B > > =E2=80=8B=E2=80=8B > files. That's a very fast test, and files with equal contents > =E2=80=8B=E2=80=8B > have the > =E2=80=8B=E2=80=8B > same size. You can discard all files with unique sizes. > =E2=80=8B=E2=80=8B > > =E2=80=8B=E2=80=8B =E2=80=8B=E2=80=8BYes, but that is just filtering (get the size of the file= s and filter to sets of files with the unique sizes). Then you chain more filters to filter further. (filter-duplicates list-of-filters-to-apply list-of-files-to-filter) which would produce a chain of filters like: (filterN .... (filter2 (filter1 list-of-files-to-filter)) =E2=80=8B > > In a second step, we have less many files. We could look at the first N > bytes of the files. That's still quite fast. =E2=80=8BSo you apply your fastest and most effective filters first. =E2=80=8B > =E2=80=8B=E2=80=8B > Left are groups of files > =E2=80=8B=E2=80=8B > with equal sizes and equal heads. For those it's worth of calculating a > =E2=80=8B=E2=80=8B > hash sum to see which have also equal contents. > =E2=80=8BOk. =E2=80=8B > =E2=80=8B=E2=80=8B > > =E2=80=8B=E2=80=8B > The idea of the library is to abstract over the type of elements and the > =E2=80=8B=E2=80=8B > n > =E2=80=8B=E2=80=8B > u > =E2=80=8B=E2=80=8B > m > =E2=80=8B=E2=80=8B > b > =E2=80=8B=E2=80=8B > e > =E2=80=8B=E2=80=8B > r > =E2=80=8B=E2=80=8B > =E2=80=8B=E2=80=8B > a > =E2=80=8B=E2=80=8B > n > =E2=80=8B=E2=80=8B > d > =E2=80=8B=E2=80=8B > =E2=80=8B=E2=80=8B > k > =E2=80=8B=E2=80=8B > i > =E2=80=8B=E2=80=8B > n > =E2=80=8B=E2=80=8B > d > =E2=80=8B=E2=80=8B > s > =E2=80=8B=E2=80=8B > =E2=80=8B=E2=80=8B > o > =E2=80=8B=E2=80=8B > f > =E2=80=8B=E2=80=8B > =E2=80=8B=E2=80=8B > t > =E2=80=8B=E2=80=8B > e > =E2=80=8B=E2=80=8B > s > =E2=80=8B=E2=80=8B > t > =E2=80=8B=E2=80=8B > . =E2=80=8B=E2=80=8B =E2=80=8BBut as the prior message author noted, you don't need lists of lis= ts to do that. We want you to simplify things so they are most=E2=80=8B generally u= seful and easier to understand. > > and `find-dups' executes the algorithm with the steps as specified. You > need just to specify a number of tests but don't need to write out the > code yourself. > =E2=80=8BI don't quite see what code is not being written except the sequen= cing of the filter applications which is your code. =E2=80=8B > =E2=80=8B=E2=80=8B > > =E2=80=8B=E2=80=8B > Do you need a mathematical formulation of the abstract problem that the > =E2=80=8B=E2=80=8B > algorithm solves, and how it works? I had hoped the example in the > =E2=80=8B=E2=80=8B > header is a good explanation... > =E2=80=8BThe example is a good one to use but as was noted is only one use = case. Keep at it and you'll see it will become something much nicer. Bob=E2=80=8B --001a113fe9f0f057ec055b50392f Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
On Wed, Oct 11, 2= 017 at 1:56 PM, Michael Heerdegen <michael_heerdegen@web.de> wrote:
<= div class=3D"gmail_quote">
Robert Weine= r <rsw@gnu.org> = writes:

> This seems incredibly complicated.=C2=A0 It would help if you would st= ate
> the general problem you are trying to solve and the performance
> characteristics you need.=C2=A0 It certainly is not a generic duplicat= e
> removal library.=C2=A0 Why can't you flatten your list and then ju= st apply
> a sequence of predicate matches as needed or use hashing as mentioned<= br> > in the commentary?

I guess the name is misleading, I'll try to find a better one.

=E2=80=8BSounds good.=C2=A0 How about filter-set?= =C2=A0 You are filtering a bunch of items to produce a set.=C2=A0 I'm n= ot sure if this is limited to files or more generic.
=E2=80=8B
=E2=80=8B=E2=80=8B

=E2=80=8B=E2=80=8B
Look at the example of finding files wit= h equal contents in your file
=E2=80=8B=E2=80=8B

=E2=80=8B=E2=80=8B
system: you have a list or stream of, sa= y, 10000 files in a file
=E2=80=8B=E2=80=8B

=E2=80=8B=E2=80=8B
hierarchy.=C2=A0 If you calculate hashes= of all of those 10000 files,
=E2=80=8B=E2=80=8B
it will
=E2=80=8B=E2=80=8B
take hours.
=E2=80=8B=E2=80=8B=

=E2=80=8B=E2=80=8B
=E2=80=8BOk, so you want to fil= ter down a set of hierarchically arranged files.
=E2=80=8B
=E2=80=8B=E2=80=8B

=E2=80=8B=E2=80=8B
It's wiser to do it in steps: first,= look at the file's sizes of all
=E2=80=8B=E2=80=8B

=E2=80=8B=E2=80=8B
files.=C2=A0 That's a very fast test= , and files with equal contents
=E2=80=8B=E2=80=8B
have the=
=E2=80=8B=E2=80=8B
same size.=C2=A0 You can discard all fil= es with unique sizes.
=E2=80=8B=E2=80=8B

= =E2=80=8B=E2=80=8B
=E2=80=8B=E2=80=8BYes, but that is just filtering (get = the size of the files and filter to sets of files with the unique sizes).= =C2=A0 Then you chain more filters to filter further.
=C2=A0 =C2=A0(= filter-duplicates list-of-filters-to-apply list-of-files-to-filter)

= which would produce a chain of filters like:
=C2=A0 =C2=A0(filterN .... (f= ilter2 (filter1 list-of-files-to-filter))
=E2=80=8B

In a second step, we have less many files.=C2=A0 We could look at the first= N
bytes of the files.=C2=A0 That's still quite fast.
=E2=80=8BSo you apply your fastest and most effective filters first.
= =E2=80=8B
=E2=80=8B=E2=80=8B=
=C2=A0 Left are groups of files
=E2=80=8B=E2=80=8B
with equal sizes and equal heads.=C2=A0 = For those it's worth of calculating a
=E2=80=8B=E2=80=8B
hash sum to see which have also equal co= ntents.

=E2=80=8BOk.
=E2=80=8B
=E2=80=8B=E2=80=8B

=E2=80=8B=E2=80=8B
The idea of the library is to abstract o= ver the type of elements and the
=E2=80=8B=E2=80=8B
n
=E2=80=8B=E2=80=8B
u=E2=80=8B=E2=80=8B
m
=E2=80=8B=E2=80=8B
b
=E2=80=8B=E2=80=8B
e
=E2=80=8B=E2=80=8B
r
=E2=80=8B=E2=80=8B
=E2=80=8B=E2=80=8B
a
=E2=80=8B=E2=80=8B
n
=E2=80=8B=E2=80=8B
d
= =E2=80=8B=E2=80=8B
=E2=80=8B=E2=80=8B
k
=E2= =80=8B=E2=80=8B
i
=E2=80=8B=E2=80=8B
n
=E2=80= =8B=E2=80=8B
d
=E2=80=8B=E2=80=8B
s
=E2=80=8B= =E2=80=8B
=E2=80=8B=E2=80=8B
o
=E2=80=8B=E2= =80=8B
f
=E2=80=8B=E2=80=8B
=E2=80=8B=E2=80= =8B
t
=E2=80=8B=E2=80=8B
e
=E2=80=8B=E2=80=8B=
s
=E2=80=8B=E2=80=8B
t
=E2=80=8B=E2=80=8B.
=E2=80=8B=E2=80=8B

=E2=80=8BBut as the prior m= essage author noted, you don't need lists of lists to do that.=C2=A0 We= want you to simplify things so they are most=E2=80=8B generally useful and= easier to understand.

and `find-dups' executes the algorithm with the steps as specified.=C2= =A0 You
need just to specify a number of tests but don't need to write out the<= br> code yourself.

=E2=80=8BI don't quite see wh= at code is not being written except the sequencing of the filter applicatio= ns which is your code.
=E2=80=8B
=E2=80=8B=E2=80=8B

=E2=80=8B=E2=80=8B
Do you need a mathematical formulation o= f the abstract problem that the
=E2=80=8B=E2=80=8B
algorithm solves, and how it works?=C2= =A0 I had hoped the example in the
=E2=80=8B=E2=80=8B
header is a good explanation...

=E2=80=8BThe example is a good one to use but as was not= ed is only one use case.=C2=A0 Keep at it and you'll see it will become= something much nicer.

Bob=E2=80=8B
--001a113fe9f0f057ec055b50392f--