From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: Michael Heerdegen Newsgroups: gmane.emacs.devel Subject: [ELPA] New package: find-dups Date: Wed, 11 Oct 2017 17:25:59 +0200 Message-ID: <87fuapemew.fsf@web.de> NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" X-Trace: blaine.gmane.org 1507735589 4071 195.159.176.226 (11 Oct 2017 15:26:29 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Wed, 11 Oct 2017 15:26:29 +0000 (UTC) User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.0.60 (gnu/linux) To: Emacs Development Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Wed Oct 11 17:26:21 2017 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1e2Itf-0007hP-T3 for ged-emacs-devel@m.gmane.org; Wed, 11 Oct 2017 17:26:12 +0200 Original-Received: from localhost ([::1]:41449 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1e2Itk-0002JG-5a for ged-emacs-devel@m.gmane.org; Wed, 11 Oct 2017 11:26:16 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:36941) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1e2Ita-0002Hs-Fj for emacs-devel@gnu.org; Wed, 11 Oct 2017 11:26:07 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1e2ItX-0008U3-8u for emacs-devel@gnu.org; Wed, 11 Oct 2017 11:26:06 -0400 Original-Received: from mout.web.de ([212.227.15.3]:62393) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1e2ItW-0008Sn-Sn for emacs-devel@gnu.org; Wed, 11 Oct 2017 11:26:03 -0400 Original-Received: from drachen.dragon ([88.67.74.137]) by smtp.web.de (mrweb003 [213.165.67.108]) with ESMTPSA (Nemesis) id 0Md4le-1dkE4r3BD8-00IAwu; Wed, 11 Oct 2017 17:26:00 +0200 X-Provags-ID: V03:K0:2CmH2+D/n9CqRlfd10cy9RUKVdoQR8GJez/Lt4D1tTmASChdJHI wSs/PmrMcHHH1cu+2ObCsIZ0bJHQI3WGK+4tBaofM5q3w3P9yFX4x30riv7caUEIf7jwJ91 kI+YSeSn22/T4lQP07ypzPb/dcYh4zAgW8L2TJrN8VQx5gFKTRUdzzQipb7ilpZp6nn67IK P12JJU4OS/ALfNU8XCr1Q== X-UI-Out-Filterresults: notjunk:1;V01:K0:b81vy2NTsuw=:lOy8Y1JqGfhqlbnA0eVb4J 65ocBlpeDzkXhPzpkfbzJ4VDp/pVC8wNK8npHqQyGF41Op7Xjmt2dJ5iebLCL8VjnHn5g5IRJ rLsSk8tSSMFHN+ZqJPoJkNq2osP/6Mp7TFzq7uPNphMuV6L9/i1m6urAlF26pCt/W7x47wh7n r0nPG6xOJ6Ep3eG6XDMzeKJsyB54SZfcXUJhTGo9hNNkzsKt5TjiSse9ccWSNrp5ftSHG/qp/ DosIJ1P7ogzSqu0NSpvrBEuqHDt8RK6mul/NJ7hqbqrA1ffMjg9B8YdFMIpxOtQjLyEMSlp+3 aEMbB4fXn6HADnGb9I0uzxuvbCCopabm5ZAmQGEk7AL+o1iA27P+qoQXPxA/oN7df2vI2Ix2U yHE7Z+GrX9Uwbsb3ddR9FXAVju5Ojab/lUdGozimo+R5suaKF7LxOpUmsvweq3FZDHXWg/iYm 23josVABil29tRw54KRsuutpVLKTI47/vQIqUEEceCpBXUntl89GCLc9dpDzYkafmjDHxH6eB EhxM7nnCAigksUkRPxOpLdX1iWy9h5YqMneMl0c3WeqxRLavsp099WSaTIuzQ1ukMko3hGB// +s5NNtciDb4JBgdpH8HzGhECFOjZyG4YsGJ2paggJSJSpbHIFiaz9fPu2fUpggt81CFdO/U2I RtKzG342+AWaYxsMv3WBDCcijnL+S+wdccDAJaYH2LG8Am2bw46ze0X8FsstXYbcj6XpEYj7A iHhgeeNBFRGQHbSQLARb/me4InBN4JTZwTrQxUdPvL+NyPsB9K1lqZ4NP/dhJHf9QMFWRdgC X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] [fuzzy] X-Received-From: 212.227.15.3 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.org gmane.emacs.devel:219378 Archived-At: --=-=-= Content-Type: text/plain Hello, would somebody want me to add something like the following to Gnu Elpa? Then I would push it after doing some fine-tuning (like adding autoloads, finding KEYWORDS for the file header etc - if nobody wants it, I could spare finding keywords etc). --=-=-= Content-Type: application/emacs-lisp Content-Disposition: inline; filename=find-dups.el Content-Transfer-Encoding: quoted-printable ;;; find-dups.el --- find equivalent elements in a sequence -*- lexical-b= inding: t -*- ;; Copyright (C) 2017 Free Software Foundation, Inc ;; Author: Michael Heerdegen ;; Maintainer: Michael Heerdegen ;; Created: 11 Oct 2017 ;; Keywords: ... ;; Version: 0.1 ;; This file is not part of GNU Emacs. ;; GNU Emacs is free software: you can redistribute it and/or modify ;; it under the terms of the GNU General Public License as published by ;; the Free Software Foundation, either version 3 of the License, or ;; (at your option) any later version. ;; GNU Emacs is distributed in the hope that it will be useful, ;; but WITHOUT ANY WARRANTY; without even the implied warranty of ;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the ;; GNU General Public License for more details. ;; You should have received a copy of the GNU General Public License ;; along with GNU Emacs. If not, see . ;;; Commentary: ;; ;; This package implements a simple algorithm to find equivalent ;; elements in a sequence of elements using multiple steps of ;; refinement with increasing cost. ;; ;; The basic idea is as follows: suppose you want to find files with ;; identical contents in a file hierarchy. The input is a sequence of ;; absolute file names. You could do it this way: Create a hash table ;; with file checksums as keys. Calculate checksums for every file ;; and push the file's name to the entry for that checksum. Finally, ;; look under which keys multiple files are stored - these are the ;; groups of files with duplicated contents. This is not very ;; efficient, however. You could instead first have a look at the ;; file sizes, and in a second step, calculate checksums only for ;; those groups of files that have equal sizes. ;; ;; The simple algorithm implemented here works as follows: Input is a ;; list of lists of elements that were found to be equivalent in the ;; last step. In step N the algorithm uses a pair (F_n EQ-PRED_n) to ;; build a new list of lists of elements that also turned out to be ;; equivalent modulo `EQ-PRED' after applying F. Bunches containing a ;; single element are discarded as such an element can't be equivalent ;; to any other. ;; ;; The algorithm is initialized with (SEQ), were SEQ is the original ;; sequence of elements (it doesn't have to be a list). ;; ;; Example: In the case of finding files with duplicate contents your ;; call could (literally) look like this: ;; ;; (find-dups my-sequence-of-file-names ;; (list (list (lambda (file) ;; (file-attribute-size (file-attributes file))) ;; #'eq) ;; (list (lambda (file) ;; (shell-command-to-string ;; (format "head -n 100 %s" ;; (shell-quote-argument file)))) ;; #'equal) ;; (list (lambda (file) ;; (shell-command-to-string ;; (format "md5sum %s | awk '{print $1;}'" ;; (shell-quote-argument file)))) ;; #'equal))) ;; ;; This package implements the function `find-dups' as well as a user ;; command to find files with identical contents as described above. ;;; Code: (eval-when-compile (require 'subr-x)) (require 'seq) (require 'map) (require 'stream) (defun find-dups-partition (seq f &optional eq-pred) "Return a list of lists of equivalent elements." (cl-callf or eq-pred #'equal) (let ((map (if (and (memq eq-pred (list #'equal #'eq)) (not (seq-empty-p (seq-drop seq 100)))) (make-hash-table :test eq-pred) '())) result) (seq-doseq (elt seq) (push elt (map-elt map (funcall f elt) nil eq-pred= ))) (map-do (lambda (_ elts) (when (cdr elts) (push elts result))) map) result)) (defun find-dups (seq &rest stages) (let (nbr-stages (stage 0) (result (list seq)) (should-message nil)) (when (eq (car stages) :print-stages) (pop stages) (setq should-message t)) (setq nbr-stages (length stages)) (while stages (when should-message (message "Stage %d/%d%s" (cl-incf stage) nbr-stages (if-let ((descr (caddr (car stages)))) (format " (%s)" descr) ""))) (let ((stage (pop stages))) (setq result (mapcan (lambda (seq) (apply #'find-dups-partition seq (seq-take stage 2))) result)))) result)) (defvar find-file-dups-stages (list (list (lambda (file) (file-attribute-size (file-attributes file))) #'eq "size") (list (lambda (file) (shell-command-to-string (format "head -n 100 %s"(shell-quote-argument file)))) #'equal "head -n 100") (list (lambda (file) (message "md5sum %s" file) (shell-command-to-string (format "md5sum %s | awk '{print $1;}'" (shell-quote-argument file)))) #'equal "md5sum"))) (defun find-file-dups (file-stream) "Find duplicate files in FILE-STREAM." (apply #'find-dups file-stream :print-stages find-file-dups-stages)) (defun find-directory-file-dups (dir &optional recurse follow-links filter) "Simple interactive front end for `find-file-dups'." (interactive (list (read-directory-name "Dir: ") current-prefix-arg)) (pop-to-buffer (get-buffer-create "*File Dups*")) (erase-buffer) (insert "Duplicate files:") (dolist (files (find-file-dups (stream-of-directory-files (expand-file-name dir) t nil recurse foll= ow-links (lambda (file) (and (file-readable-p file) (file-regular-p file) (or (not filter) (funcall = filter file))))))) (insert "\n\n") (dolist (file files) (insert file "\n"))) (insert "\n\n\nDone.")) (provide 'find-dups) --=-=-= Content-Type: text/plain TIA, Michael. --=-=-=--