[ELPA] New package: find-dups

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

* [ELPA] New package: find-dups
@ 2017-10-11 15:25 Michael Heerdegen
  2017-10-11 17:05 ` Robert Weiner
  2017-10-12  8:37 ` Andreas Politz
  0 siblings, 2 replies; 12+ messages in thread
From: Michael Heerdegen @ 2017-10-11 15:25 UTC (permalink / raw)
  To: Emacs Development

[-- Attachment #1: Type: text/plain, Size: 255 bytes --]

Hello,

would somebody want me to add something like the following to Gnu Elpa?
Then I would push it after doing some fine-tuning (like adding
autoloads, finding KEYWORDS for the file header etc - if nobody wants
it, I could spare finding keywords etc).

[-- Attachment #2: find-dups.el --]
[-- Type: application/emacs-lisp, Size: 6466 bytes --]

[-- Attachment #3: Type: text/plain, Size: 18 bytes --]

TIA,

Michael.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [ELPA] New package: find-dups
  2017-10-11 15:25 [ELPA] New package: find-dups Michael Heerdegen
@ 2017-10-11 17:05 ` Robert Weiner
  2017-10-11 17:56   ` Michael Heerdegen
  2017-10-12  8:37 ` Andreas Politz
  1 sibling, 1 reply; 12+ messages in thread
From: Robert Weiner @ 2017-10-11 17:05 UTC (permalink / raw)
  To: Michael Heerdegen; +Cc: Emacs Development

[-- Attachment #1: Type: text/plain, Size: 376 bytes --]

This seems incredibly complicated.  It would help if you would state the
general problem you are trying to solve and the performance characteristics
you need.
It certainly is not a generic duplicate removal library.  Why can't you
flatten your list and then just apply a sequence of predicate matches as
needed or use hashing as mentioned in the commentary?

Bob

[-- Attachment #2: Type: text/html, Size: 795 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [ELPA] New package: find-dups
  2017-10-11 17:05 ` Robert Weiner
@ 2017-10-11 17:56   ` Michael Heerdegen
  2017-10-11 18:56     ` Eli Zaretskii
                       ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Michael Heerdegen @ 2017-10-11 17:56 UTC (permalink / raw)
  To: Robert Weiner; +Cc: rswgnu, Emacs Development

Robert Weiner <rsw@gnu.org> writes:

> This seems incredibly complicated.  It would help if you would state
> the general problem you are trying to solve and the performance
> characteristics you need.  It certainly is not a generic duplicate
> removal library.  Why can't you flatten your list and then just apply
> a sequence of predicate matches as needed or use hashing as mentioned
> in the commentary?

I guess the name is misleading, I'll try to find a better one.

Look at the example of finding files with equal contents in your file
system: you have a list or stream of, say, 10000 files in a file
hierarchy.  If you calculate hashes of all of those 10000 files, it will
take hours.

It's wiser to do it in steps: first, look at the file's sizes of all
files.  That's a very fast test, and files with equal contents have the
same size.  You can discard all files with unique sizes.

In a second step, we have less many files.  We could look at the first N
bytes of the files.  That's still quite fast.  Left are groups of files
with equal sizes and equal heads.  For those it's worth of calculating a
hash sum to see which have also equal contents.

The idea of the library is to abstract over the type of elements and the
number and kinds of test.  So you can write the above as 

#+begin_src emacs-lisp
(find-dups my-sequence-of-file-names
           (list (list (lambda (file)
                         (file-attribute-size (file-attributes file)))
                       #'eq)
                 (list (lambda (file)
                         (shell-command-to-string
                          (format "head %s"
                                  (shell-quote-argument file))))
                       #'equal)
                 (list (lambda (file)
                         (shell-command-to-string
                          (format "md5sum %s | awk '{print $1;}'"
                                  (shell-quote-argument file))))
                       #'equal)))
#+end_src

and `find-dups' executes the algorithm with the steps as specified.  You
need just to specify a number of tests but don't need to write out the
code yourself.

Do you need a mathematical formulation of the abstract problem that the
algorithm solves, and how it works?  I had hoped the example in the
header is a good explanation...

Regards,

Michael.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [ELPA] New package: find-dups
  2017-10-11 17:56   ` Michael Heerdegen
@ 2017-10-11 18:56     ` Eli Zaretskii
  2017-10-11 19:25       ` Michael Heerdegen
  2017-10-11 23:28     ` Thien-Thi Nguyen
  2017-10-12  2:23     ` Robert Weiner
  2 siblings, 1 reply; 12+ messages in thread
From: Eli Zaretskii @ 2017-10-11 18:56 UTC (permalink / raw)
  To: Michael Heerdegen; +Cc: rswgnu, emacs-devel, rsw

> From: Michael Heerdegen <michael_heerdegen@web.de>
> Date: Wed, 11 Oct 2017 19:56:26 +0200
> Cc: rswgnu@gmail.com, Emacs Development <emacs-devel@gnu.org>
> 
> #+begin_src emacs-lisp
> (find-dups my-sequence-of-file-names
>            (list (list (lambda (file)
>                          (file-attribute-size (file-attributes file)))
>                        #'eq)
>                  (list (lambda (file)
>                          (shell-command-to-string
>                           (format "head %s"
>                                   (shell-quote-argument file))))
>                        #'equal)
>                  (list (lambda (file)
>                          (shell-command-to-string
>                           (format "md5sum %s | awk '{print $1;}'"
>                                   (shell-quote-argument file))))
>                        #'equal)))
> #+end_src

Apologies for barging into the middle of a discussion, but starting
processes and making strings out of their output to process just a
portion of a file is sub-optimal, because process creation is not
cheap.  It is easier to simply read a predefined number of bytes into
a buffer; insert-file-contents-literally supports that.  Likewise with
md5sum: we have the md5 primitive for that.

In general, working with buffers is much more efficient in Emacs than
working with strings, so avoid strings, let alone large strings, as
much as you can.

One other comment is that shell-command-to-string decodes the output
from the shell command, which is not something you want here, because
AFAIU you are looking for files whose contents is identical on the
byte-stream level, i.e. 2 files which have the same characters, but
are encoded differently on disk (like one UTF-8, the other Latin-1)
should be considered different in this contents, whereas
shell-command-to-string will/might produce identical strings for them.
(Decoding is also expensive run-time wise.)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [ELPA] New package: find-dups
  2017-10-11 18:56     ` Eli Zaretskii
@ 2017-10-11 19:25       ` Michael Heerdegen
  0 siblings, 0 replies; 12+ messages in thread
From: Michael Heerdegen @ 2017-10-11 19:25 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: rswgnu, emacs-devel, rsw

Hi Eli,

Thanks for your comment.

> Apologies for barging into the middle of a discussion, but starting
> processes and making strings out of their output to process just a
> portion of a file is sub-optimal, because process creation is not
> cheap.  It is easier to simply read a predefined number of bytes into
> a buffer; insert-file-contents-literally supports that.  Likewise with
> md5sum: we have the md5 primitive for that.

That's one of the details I delayed until I know whether we want this
for inclusion.  My first version used `insert-file-contents-literally'
but I ran into problems, and calling `debug' crashed Emacs, so I gave up
for the moment.  Personally, I don't care that much because it doesn't
make a big speed difference, but yes, I know it's not absolutely
correct, and I would care about this.

Regards,

Michael.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [ELPA] New package: find-dups
  2017-10-11 17:56   ` Michael Heerdegen
  2017-10-11 18:56     ` Eli Zaretskii
@ 2017-10-11 23:28     ` Thien-Thi Nguyen
  2017-10-12  2:23     ` Robert Weiner
  2 siblings, 0 replies; 12+ messages in thread
From: Thien-Thi Nguyen @ 2017-10-11 23:28 UTC (permalink / raw)
  To: Emacs Development; +Cc: Michael Heerdegen, rswgnu

[-- Attachment #1: Type: text/plain, Size: 3320 bytes --]

() Michael Heerdegen <michael_heerdegen@web.de>
() Wed, 11 Oct 2017 19:56:26 +0200

   Robert Weiner <rsw@gnu.org> writes:

   > This seems incredibly complicated.  It would help if you
   > would state the general problem you are trying to solve and
   > the performance characteristics you need.  It certainly is
   > not a generic duplicate removal library.  Why can't you
   > flatten your list and then just apply a sequence of
   > predicate matches as needed or use hashing as mentioned in
   > the commentary?

   I guess the name is misleading, I'll try to find a better one.

How about "multi-pass-dups", then use/document "pass" everywhere
in the code/discussion?  (Currently, you use "stage" in the code,
and "step" in this thread.)

   Look at the example of finding files with equal contents in
   your file system: [...]

Can you think of another use-case?  That exercise will help
highlight the general (factorable) concepts to document well.
Conversely, if you cannot, maybe that's a hint that the
abstraction level is too high; some opportunity exists for
specialization (and thus optimization).

   In a second step, we have less many files.

This is the key motivation for multi-pass...

   Do you need a mathematical formulation of the abstract
   problem that the algorithm solves, and how it works?

...so briefly explaining how iteration mitigates the suffering
due to the (irreducible) N^2 might be good to do early on (in
Commentary), before giving examples.  Leading w/ a small bit of
theory caters to those readers already disposed to that style.

To help the rest of the readers, a common technique is to label
components of the theory (e.g., [1], [2], ...) and refer to
these later, in the concrete examples.  Those readers might
gloss over the theory at first (being indisposed) but the back
references invite them to make connections at their own pace.

In short, "it is wise" to show how "it is wise" and avoid saying
"it is wise" (according to this practiced "wise"ass :-D).

   (find-dups my-sequence-of-file-names
              (list (list (lambda (file) ...)
                          #'eq)
                    (list (lambda (file) ...)
                          #'equal)
                    (list (lambda (file) ...)
                          #'equal)))

IIUC the 2nd level ‘list’ is to associate each characterization
func w/ a comparison func.  I wonder if there is another way.
Too bad Emacs Lisp has no "object properties" like Guile, eh?

OTOH, the 1st level ‘list’ seems like a gratuitous hoop (read:
source of latent PEBKAC/complaints/redesign).  Why not move that
down-chain, so caller need not worry?  Something like:

#+begin_src emacs-lisp
(multi-pass-dups
 MY-SEQUENCE-OF-FILE-NAMES
 (list (lambda (file) ...)
       #'eq)
 (list (lambda (file) ...)
       #'equal)
 (list (lambda (file) ...)
       #'equal))
#+end_src

(I also use the all-caps-for-metavariables convention, here.)

-- 
Thien-Thi Nguyen -----------------------------------------------
 (defun responsep (query)
   (pcase (context query)
     (`(technical ,ml) (correctp ml))
     ...))                              748E A0E8 1CB8 A748 9BFA
--------------------------------------- 6CE4 6703 2224 4C80 7502

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [ELPA] New package: find-dups
  2017-10-11 17:56   ` Michael Heerdegen
  2017-10-11 18:56     ` Eli Zaretskii
  2017-10-11 23:28     ` Thien-Thi Nguyen
@ 2017-10-12  2:23     ` Robert Weiner
  2 siblings, 0 replies; 12+ messages in thread
From: Robert Weiner @ 2017-10-12  2:23 UTC (permalink / raw)
  To: Michael Heerdegen; +Cc: Emacs Development

[-- Attachment #1: Type: text/plain, Size: 3587 bytes --]

On Wed, Oct 11, 2017 at 1:56 PM, Michael Heerdegen <michael_heerdegen@web.de
> wrote:

> Robert Weiner <rsw@gnu.org> writes:
>
> > This seems incredibly complicated.  It would help if you would state
> > the general problem you are trying to solve and the performance
> > characteristics you need.  It certainly is not a generic duplicate
> > removal library.  Why can't you flatten your list and then just apply
> > a sequence of predicate matches as needed or use hashing as mentioned
> > in the commentary?
>
> I guess the name is misleading, I'll try to find a better one.
>

Sounds good.  How about filter-set?  You are filtering a bunch of items to
produce a set.  I'm not sure if this is limited to files or more generic.


> 
>
> 
> Look at the example of finding files with equal contents in your file
> 
>
> 
> system: you have a list or stream of, say, 10000 files in a file
> 
>
> 
> hierarchy.  If you calculate hashes of all of those 10000 files,
> 
> it will
> 
> take hours.
> 
>
> 
Ok, so you want to filter down a set of hierarchically arranged files.


> 
>
> 
> It's wiser to do it in steps: first, look at the file's sizes of all
> 
>
> 
> files.  That's a very fast test, and files with equal contents
> 
> have the
> 
> same size.  You can discard all files with unique sizes.
> 
>
> 
Yes, but that is just filtering (get the size of the files and filter to
sets of files with the unique sizes).  Then you chain more filters to
filter further.
   (filter-duplicates list-of-filters-to-apply list-of-files-to-filter)

which would produce a chain of filters like:
   (filterN .... (filter2 (filter1 list-of-files-to-filter))


>
> In a second step, we have less many files.  We could look at the first N
> bytes of the files.  That's still quite fast.


So you apply your fastest and most effective filters first.


> 
>   Left are groups of files
> 
> with equal sizes and equal heads.  For those it's worth of calculating a
> 
> hash sum to see which have also equal contents.
>

Ok.


> 
>
> 
> The idea of the library is to abstract over the type of elements and the
> 
> n
> 
> u
> 
> m
> 
> b
> 
> e
> 
> r
> 
> 
> a
> 
> n
> 
> d
> 
> 
> k
> 
> i
> 
> n
> 
> d
> 
> s
> 
> 
> o
> 
> f
> 
> 
> t
> 
> e
> 
> s
> 
> t
> 
> .



But as the prior message author noted, you don't need lists of lists to do
that.  We want you to simplify things so they are most generally useful
and easier to understand.

>
> and `find-dups' executes the algorithm with the steps as specified.  You
> need just to specify a number of tests but don't need to write out the
> code yourself.
>

I don't quite see what code is not being written except the sequencing of
the filter applications which is your code.


> 
>
> 
> Do you need a mathematical formulation of the abstract problem that the
> 
> algorithm solves, and how it works?  I had hoped the example in the
> 
> header is a good explanation...
>

The example is a good one to use but as was noted is only one use case.
Keep at it and you'll see it will become something much nicer.

Bob

[-- Attachment #2: Type: text/html, Size: 10925 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [ELPA] New package: find-dups
  2017-10-11 15:25 [ELPA] New package: find-dups Michael Heerdegen
  2017-10-11 17:05 ` Robert Weiner
@ 2017-10-12  8:37 ` Andreas Politz
  2017-10-12 12:32   ` Michael Heerdegen
  1 sibling, 1 reply; 12+ messages in thread
From: Andreas Politz @ 2017-10-12  8:37 UTC (permalink / raw)
  To: Michael Heerdegen; +Cc: Emacs Development


What you seem to be doing is something like this (minus equal vs eq).

#+BEGIN_SRC emacs-lisp
  (defun partition-by (s &rest fns)
    "Partition sequence S by some function F.

  Partition S using a equivalence relation R constructed by F, such
  that for all x,y ∈ S

       (x, y) ∈ R ⇔ f(x) = f(y) .

  If GS is non-nil, it should be a list of functions acting as necessary
  conditions, such that for all x,y ∈ S and g₁̣,..,gₙ ∈ GS:

       g₁(x) = g₁(y)  ⇐ ... ⇐ gₙ(x) = gₙ(y) ⇐ f(x) = f(y)

  The idea is that the functions in GS may be computed more efficiently
  than F and thus allowing for an overall faster computation of the
  partition in some cases.

  Returns the partition of S as a list of lists.

  \(FN S &rest G1 ... GN F\)"

    (cond
     ((= (length s) 0) nil)
     ((or (null fns)
          (< (length s) 2))
      (list s))
     (t
      (seq-mapcat
       (lambda (elt)
         (apply #'partition-by (cdr elt) (cdr fns)))
       (seq-group-by (car fns) s)))))

  (partition-by
   (seq-filter #'file-regular-p (directory-files "/tmp" t))
   (lambda (file) (nth 7 (file-attributes file)))
   (lambda (file) (with-temp-buffer
                    (insert-file-contents-literally file)
                    (md5 (current-buffer)))))
#+END_SRC

Andreas



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [ELPA] New package: find-dups
  2017-10-12  8:37 ` Andreas Politz
@ 2017-10-12 12:32   ` Michael Heerdegen
  2017-10-12 13:20     ` Nicolas Petton
  0 siblings, 1 reply; 12+ messages in thread
From: Michael Heerdegen @ 2017-10-12 12:32 UTC (permalink / raw)
  To: Andreas Politz; +Cc: Emacs Development

Andreas Politz <politza@hochschule-trier.de> writes:

> What you seem to be doing is something like this (minus equal vs eq).

I didn't see that `seq-group-by' could be used as a factor - thanks.  It
would be good if `seq-group-by' was able to use hash tables when
appropriate, however.

Michael.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [ELPA] New package: find-dups
  2017-10-12 12:32   ` Michael Heerdegen
@ 2017-10-12 13:20     ` Nicolas Petton
  2017-10-12 18:49       ` Michael Heerdegen
  0 siblings, 1 reply; 12+ messages in thread
From: Nicolas Petton @ 2017-10-12 13:20 UTC (permalink / raw)
  To: Michael Heerdegen, Andreas Politz; +Cc: Emacs Development

[-- Attachment #1: Type: text/plain, Size: 266 bytes --]

Michael Heerdegen <michael_heerdegen@web.de> writes:


> I didn't see that `seq-group-by' could be used as a factor - thanks.  It
> would be good if `seq-group-by' was able to use hash tables when
> appropriate, however.

When would it be appropriate?

Cheers,
Nico

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [ELPA] New package: find-dups
  2017-10-12 13:20     ` Nicolas Petton
@ 2017-10-12 18:49       ` Michael Heerdegen
  2017-10-13 10:21         ` Michael Heerdegen
  0 siblings, 1 reply; 12+ messages in thread
From: Michael Heerdegen @ 2017-10-12 18:49 UTC (permalink / raw)
  To: Nicolas Petton; +Cc: Andreas Politz, Emacs Development

Nicolas Petton <nicolas@petton.fr> writes:

> Michael Heerdegen <michael_heerdegen@web.de> writes:
>
>
> > I didn't see that `seq-group-by' could be used as a factor - thanks.
> > It
> > would be good if `seq-group-by' was able to use hash tables when
> > appropriate, however.
>
> When would it be appropriate?

Probably when there are enough elements - `delete-dups' hardcodes a
limit of 100 for using a list instead of a hash table.


Michael.



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [ELPA] New package: find-dups
  2017-10-12 18:49       ` Michael Heerdegen
@ 2017-10-13 10:21         ` Michael Heerdegen
  0 siblings, 0 replies; 12+ messages in thread
From: Michael Heerdegen @ 2017-10-13 10:21 UTC (permalink / raw)
  To: Nicolas Petton; +Cc: Andreas Politz, Emacs Development

Michael Heerdegen <michael_heerdegen@web.de> writes:

> Probably when there are enough elements - `delete-dups' hardcodes a
> limit of 100 for using a list instead of a hash table.

BTW it would also be nice if `seq-group-by' would have similar features
as `cl-remove-duplicates' (in particular, :test and :key) in some way.
AFAICT, `cl-remove-duplicates' never uses hash tables, so the perfect
function of that kind is not yet available.

Michael.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2017-10-13 10:21 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-10-11 15:25 [ELPA] New package: find-dups Michael Heerdegen
2017-10-11 17:05 ` Robert Weiner
2017-10-11 17:56   ` Michael Heerdegen
2017-10-11 18:56     ` Eli Zaretskii
2017-10-11 19:25       ` Michael Heerdegen
2017-10-11 23:28     ` Thien-Thi Nguyen
2017-10-12  2:23     ` Robert Weiner
2017-10-12  8:37 ` Andreas Politz
2017-10-12 12:32   ` Michael Heerdegen
2017-10-12 13:20     ` Nicolas Petton
2017-10-12 18:49       ` Michael Heerdegen
2017-10-13 10:21         ` Michael Heerdegen

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).