* Re: [ELPA] New package: find-dups
2017-10-11 17:56 ` Michael Heerdegen
@ 2017-10-11 18:56 ` Eli Zaretskii
2017-10-11 19:25 ` Michael Heerdegen
2017-10-11 23:28 ` Thien-Thi Nguyen
2017-10-12 2:23 ` Robert Weiner
2 siblings, 1 reply; 12+ messages in thread
From: Eli Zaretskii @ 2017-10-11 18:56 UTC (permalink / raw)
To: Michael Heerdegen; +Cc: rswgnu, emacs-devel, rsw
> From: Michael Heerdegen <michael_heerdegen@web.de>
> Date: Wed, 11 Oct 2017 19:56:26 +0200
> Cc: rswgnu@gmail.com, Emacs Development <emacs-devel@gnu.org>
>
> #+begin_src emacs-lisp
> (find-dups my-sequence-of-file-names
> (list (list (lambda (file)
> (file-attribute-size (file-attributes file)))
> #'eq)
> (list (lambda (file)
> (shell-command-to-string
> (format "head %s"
> (shell-quote-argument file))))
> #'equal)
> (list (lambda (file)
> (shell-command-to-string
> (format "md5sum %s | awk '{print $1;}'"
> (shell-quote-argument file))))
> #'equal)))
> #+end_src
Apologies for barging into the middle of a discussion, but starting
processes and making strings out of their output to process just a
portion of a file is sub-optimal, because process creation is not
cheap. It is easier to simply read a predefined number of bytes into
a buffer; insert-file-contents-literally supports that. Likewise with
md5sum: we have the md5 primitive for that.
In general, working with buffers is much more efficient in Emacs than
working with strings, so avoid strings, let alone large strings, as
much as you can.
One other comment is that shell-command-to-string decodes the output
from the shell command, which is not something you want here, because
AFAIU you are looking for files whose contents is identical on the
byte-stream level, i.e. 2 files which have the same characters, but
are encoded differently on disk (like one UTF-8, the other Latin-1)
should be considered different in this contents, whereas
shell-command-to-string will/might produce identical strings for them.
(Decoding is also expensive run-time wise.)
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [ELPA] New package: find-dups
2017-10-11 18:56 ` Eli Zaretskii
@ 2017-10-11 19:25 ` Michael Heerdegen
0 siblings, 0 replies; 12+ messages in thread
From: Michael Heerdegen @ 2017-10-11 19:25 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: rswgnu, emacs-devel, rsw
Hi Eli,
Thanks for your comment.
> Apologies for barging into the middle of a discussion, but starting
> processes and making strings out of their output to process just a
> portion of a file is sub-optimal, because process creation is not
> cheap. It is easier to simply read a predefined number of bytes into
> a buffer; insert-file-contents-literally supports that. Likewise with
> md5sum: we have the md5 primitive for that.
That's one of the details I delayed until I know whether we want this
for inclusion. My first version used `insert-file-contents-literally'
but I ran into problems, and calling `debug' crashed Emacs, so I gave up
for the moment. Personally, I don't care that much because it doesn't
make a big speed difference, but yes, I know it's not absolutely
correct, and I would care about this.
Regards,
Michael.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [ELPA] New package: find-dups
2017-10-11 17:56 ` Michael Heerdegen
2017-10-11 18:56 ` Eli Zaretskii
@ 2017-10-11 23:28 ` Thien-Thi Nguyen
2017-10-12 2:23 ` Robert Weiner
2 siblings, 0 replies; 12+ messages in thread
From: Thien-Thi Nguyen @ 2017-10-11 23:28 UTC (permalink / raw)
To: Emacs Development; +Cc: Michael Heerdegen, rswgnu
[-- Attachment #1: Type: text/plain, Size: 3320 bytes --]
() Michael Heerdegen <michael_heerdegen@web.de>
() Wed, 11 Oct 2017 19:56:26 +0200
Robert Weiner <rsw@gnu.org> writes:
> This seems incredibly complicated. It would help if you
> would state the general problem you are trying to solve and
> the performance characteristics you need. It certainly is
> not a generic duplicate removal library. Why can't you
> flatten your list and then just apply a sequence of
> predicate matches as needed or use hashing as mentioned in
> the commentary?
I guess the name is misleading, I'll try to find a better one.
How about "multi-pass-dups", then use/document "pass" everywhere
in the code/discussion? (Currently, you use "stage" in the code,
and "step" in this thread.)
Look at the example of finding files with equal contents in
your file system: [...]
Can you think of another use-case? That exercise will help
highlight the general (factorable) concepts to document well.
Conversely, if you cannot, maybe that's a hint that the
abstraction level is too high; some opportunity exists for
specialization (and thus optimization).
In a second step, we have less many files.
This is the key motivation for multi-pass...
Do you need a mathematical formulation of the abstract
problem that the algorithm solves, and how it works?
...so briefly explaining how iteration mitigates the suffering
due to the (irreducible) N^2 might be good to do early on (in
Commentary), before giving examples. Leading w/ a small bit of
theory caters to those readers already disposed to that style.
To help the rest of the readers, a common technique is to label
components of the theory (e.g., [1], [2], ...) and refer to
these later, in the concrete examples. Those readers might
gloss over the theory at first (being indisposed) but the back
references invite them to make connections at their own pace.
In short, "it is wise" to show how "it is wise" and avoid saying
"it is wise" (according to this practiced "wise"ass :-D).
(find-dups my-sequence-of-file-names
(list (list (lambda (file) ...)
#'eq)
(list (lambda (file) ...)
#'equal)
(list (lambda (file) ...)
#'equal)))
IIUC the 2nd level ‘list’ is to associate each characterization
func w/ a comparison func. I wonder if there is another way.
Too bad Emacs Lisp has no "object properties" like Guile, eh?
OTOH, the 1st level ‘list’ seems like a gratuitous hoop (read:
source of latent PEBKAC/complaints/redesign). Why not move that
down-chain, so caller need not worry? Something like:
#+begin_src emacs-lisp
(multi-pass-dups
MY-SEQUENCE-OF-FILE-NAMES
(list (lambda (file) ...)
#'eq)
(list (lambda (file) ...)
#'equal)
(list (lambda (file) ...)
#'equal))
#+end_src
(I also use the all-caps-for-metavariables convention, here.)
--
Thien-Thi Nguyen -----------------------------------------------
(defun responsep (query)
(pcase (context query)
(`(technical ,ml) (correctp ml))
...)) 748E A0E8 1CB8 A748 9BFA
--------------------------------------- 6CE4 6703 2224 4C80 7502
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 197 bytes --]
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [ELPA] New package: find-dups
2017-10-11 17:56 ` Michael Heerdegen
2017-10-11 18:56 ` Eli Zaretskii
2017-10-11 23:28 ` Thien-Thi Nguyen
@ 2017-10-12 2:23 ` Robert Weiner
2 siblings, 0 replies; 12+ messages in thread
From: Robert Weiner @ 2017-10-12 2:23 UTC (permalink / raw)
To: Michael Heerdegen; +Cc: Emacs Development
[-- Attachment #1: Type: text/plain, Size: 3587 bytes --]
On Wed, Oct 11, 2017 at 1:56 PM, Michael Heerdegen <michael_heerdegen@web.de
> wrote:
> Robert Weiner <rsw@gnu.org> writes:
>
> > This seems incredibly complicated. It would help if you would state
> > the general problem you are trying to solve and the performance
> > characteristics you need. It certainly is not a generic duplicate
> > removal library. Why can't you flatten your list and then just apply
> > a sequence of predicate matches as needed or use hashing as mentioned
> > in the commentary?
>
> I guess the name is misleading, I'll try to find a better one.
>
Sounds good. How about filter-set? You are filtering a bunch of items to
produce a set. I'm not sure if this is limited to files or more generic.
>
>
>
> Look at the example of finding files with equal contents in your file
>
>
>
> system: you have a list or stream of, say, 10000 files in a file
>
>
>
> hierarchy. If you calculate hashes of all of those 10000 files,
>
> it will
>
> take hours.
>
>
>
Ok, so you want to filter down a set of hierarchically arranged files.
>
>
>
> It's wiser to do it in steps: first, look at the file's sizes of all
>
>
>
> files. That's a very fast test, and files with equal contents
>
> have the
>
> same size. You can discard all files with unique sizes.
>
>
>
Yes, but that is just filtering (get the size of the files and filter to
sets of files with the unique sizes). Then you chain more filters to
filter further.
(filter-duplicates list-of-filters-to-apply list-of-files-to-filter)
which would produce a chain of filters like:
(filterN .... (filter2 (filter1 list-of-files-to-filter))
>
> In a second step, we have less many files. We could look at the first N
> bytes of the files. That's still quite fast.
So you apply your fastest and most effective filters first.
>
> Left are groups of files
>
> with equal sizes and equal heads. For those it's worth of calculating a
>
> hash sum to see which have also equal contents.
>
Ok.
>
>
>
> The idea of the library is to abstract over the type of elements and the
>
> n
>
> u
>
> m
>
> b
>
> e
>
> r
>
>
> a
>
> n
>
> d
>
>
> k
>
> i
>
> n
>
> d
>
> s
>
>
> o
>
> f
>
>
> t
>
> e
>
> s
>
> t
>
> .
But as the prior message author noted, you don't need lists of lists to do
that. We want you to simplify things so they are most generally useful
and easier to understand.
>
> and `find-dups' executes the algorithm with the steps as specified. You
> need just to specify a number of tests but don't need to write out the
> code yourself.
>
I don't quite see what code is not being written except the sequencing of
the filter applications which is your code.
>
>
>
> Do you need a mathematical formulation of the abstract problem that the
>
> algorithm solves, and how it works? I had hoped the example in the
>
> header is a good explanation...
>
The example is a good one to use but as was noted is only one use case.
Keep at it and you'll see it will become something much nicer.
Bob
[-- Attachment #2: Type: text/html, Size: 10925 bytes --]
^ permalink raw reply [flat|nested] 12+ messages in thread