* remove-duplicates performances @ 2011-05-20 13:51 Thierry Volpiatto 2011-05-20 14:09 ` Stefan Monnier 2011-05-20 14:28 ` David Kastrup 0 siblings, 2 replies; 14+ messages in thread From: Thierry Volpiatto @ 2011-05-20 13:51 UTC (permalink / raw) To: emacs-devel Hi all, i just noticed that `remove-duplicates' is very slow. Something like below seem much faster: --8<---------------cut here---------------start------------->8--- (defun* remove-dups (seq &key (test 'eq)) (let ((cont (make-hash-table :test test))) (loop for elm in seq unless (gethash elm cont) do (puthash elm elm cont) finally return (loop for i being the hash-values in cont collect i)))) --8<---------------cut here---------------end--------------->8--- Test: (setq A (let ((seq (loop for i from 1 to 10000 collect i))) (append seq seq))) (1 2 3 4 5 6 7 8 9 10 1 2 ...) (remove-dups A) (1 2 3 4 5 6 7 8 9 10 11 12 ...) elp-results: remove-dups 1 0.013707 0.013707 (remove-duplicates A) (1 2 3 4 5 6 7 8 9 10 11 12 ...) elp-results: remove-duplicates 1 66.971619 66.971619 Would be nice to improve performances of `remove-duplicates'. -- A+ Thierry Get my Gnupg key: gpg --keyserver pgp.mit.edu --recv-keys 59F29997 ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: remove-duplicates performances 2011-05-20 13:51 remove-duplicates performances Thierry Volpiatto @ 2011-05-20 14:09 ` Stefan Monnier 2011-05-20 14:39 ` Thierry Volpiatto 2011-05-20 15:16 ` Ted Zlatanov 2011-05-20 14:28 ` David Kastrup 1 sibling, 2 replies; 14+ messages in thread From: Stefan Monnier @ 2011-05-20 14:09 UTC (permalink / raw) To: Thierry Volpiatto; +Cc: emacs-devel > i just noticed that `remove-duplicates' is very slow. It's using an O(N²) algorithm, so it's indeed slow for long lists. > (setq A (let ((seq (loop for i from 1 to 10000 collect i))) > (append seq seq))) > (1 2 3 4 5 6 7 8 9 10 1 2 ...) For such long lists, I fully expect it to be slow. But for short lists, the overhead of constructing a hash-table should make the current code competitive. Can you try and find out for which lengths your code is better than the one we have? Stefan ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: remove-duplicates performances 2011-05-20 14:09 ` Stefan Monnier @ 2011-05-20 14:39 ` Thierry Volpiatto 2011-05-20 16:14 ` Stefan Monnier 2011-05-20 15:16 ` Ted Zlatanov 1 sibling, 1 reply; 14+ messages in thread From: Thierry Volpiatto @ 2011-05-20 14:39 UTC (permalink / raw) To: emacs-devel Stefan Monnier <monnier@iro.umontreal.ca> writes: >> i just noticed that `remove-duplicates' is very slow. > > It's using an O(N²) algorithm, so it's indeed slow for long lists. I am not familiar with big O notations, but yes the code seem complex with many loops. >> (setq A (let ((seq (loop for i from 1 to 10000 collect i))) >> (append seq seq))) >> (1 2 3 4 5 6 7 8 9 10 1 2 ...) > > For such long lists, I fully expect it to be slow. > But for short lists, the overhead of constructing a hash-table should > make the current code competitive. Can you try and find out for which > lengths your code is better than the one we have? I go down to a list of 10 elements and it still faster: liste de 2X100 éléments: remove-duplicates 1 0.010716 0.010716 remove-dups 1 0.00027 0.00027 liste de 2X50 éléments: remove-duplicates 1 0.00415 0.00415 remove-dups 1 0.000129 0.000129 liste de 2X25 éléments: remove-duplicates 1 0.00047 0.00047 remove-dups 1 7.9e-05 7.9e-05 liste de 2X10 éléments: remove-duplicates 1 0.000209 0.000209 remove-dups 1 3.6e-05 3.6e-05 liste de 2X5 éléments: remove-duplicates 1 7.3e-05 7.3e-05 remove-dups 1 6.4e-05 6.4e-05 -- A+ Thierry Get my Gnupg key: gpg --keyserver pgp.mit.edu --recv-keys 59F29997 ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: remove-duplicates performances 2011-05-20 14:39 ` Thierry Volpiatto @ 2011-05-20 16:14 ` Stefan Monnier 2011-05-20 17:46 ` Thierry Volpiatto 0 siblings, 1 reply; 14+ messages in thread From: Stefan Monnier @ 2011-05-20 16:14 UTC (permalink / raw) To: Thierry Volpiatto; +Cc: emacs-devel > I go down to a list of 10 elements and it still faster: I'm not surprised the break-even is less than 10. > liste de 2X10 éléments: > remove-duplicates 1 0.000209 0.000209 > remove-dups 1 3.6e-05 3.6e-05 > liste de 2X5 éléments: > remove-duplicates 1 7.3e-05 7.3e-05 > remove-dups 1 6.4e-05 6.4e-05 Hmm... so it's faster to do it for 20 than for 10? I expect it is common to call remove-duplicates with very short lists (shorter than 10 for sure) that present (almost) no duplication. Stefan ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: remove-duplicates performances 2011-05-20 16:14 ` Stefan Monnier @ 2011-05-20 17:46 ` Thierry Volpiatto 0 siblings, 0 replies; 14+ messages in thread From: Thierry Volpiatto @ 2011-05-20 17:46 UTC (permalink / raw) To: emacs-devel Stefan Monnier <monnier@iro.umontreal.ca> writes: >> I go down to a list of 10 elements and it still faster: > > I'm not surprised the break-even is less than 10. > >> liste de 2X10 éléments: >> remove-duplicates 1 0.000209 0.000209 >> remove-dups 1 3.6e-05 3.6e-05 > >> liste de 2X5 éléments: >> remove-duplicates 1 7.3e-05 7.3e-05 >> remove-dups 1 6.4e-05 6.4e-05 > > Hmm... so it's faster to do it for 20 than for 10? Yes, i didn't notice that, i redo it and i have now: For 10X2==>remove-dups 1 5.2e-05 5.2e-05 For 5X2==> remove-dups 1 5.4e-05 5.4e-05 Don't know why. Can you try on your side? > I expect it is common to call remove-duplicates with very short lists > (shorter than 10 for sure) that present (almost) no duplication. Well, on small list, (shorter than ten), no need to have such a complex function, most of the time i use something like: (loop for i in '(a b a c b c) unless (memq i ls) collect i into ls finally return ls) I just tried remove-duplicates in common-lisp (slime) and i see the result on big list is instant.(i didn't instrument though) -- A+ Thierry Get my Gnupg key: gpg --keyserver pgp.mit.edu --recv-keys 59F29997 ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: remove-duplicates performances 2011-05-20 14:09 ` Stefan Monnier 2011-05-20 14:39 ` Thierry Volpiatto @ 2011-05-20 15:16 ` Ted Zlatanov 1 sibling, 0 replies; 14+ messages in thread From: Ted Zlatanov @ 2011-05-20 15:16 UTC (permalink / raw) To: emacs-devel On Fri, 20 May 2011 11:09:55 -0300 Stefan Monnier <monnier@iro.umontreal.ca> wrote: >> i just noticed that `remove-duplicates' is very slow. SM> It's using an O(N²) algorithm, so it's indeed slow for long lists. ... SM> For such long lists, I fully expect it to be slow. SM> But for short lists, the overhead of constructing a hash-table should SM> make the current code competitive. Can you try and find out for which SM> lengths your code is better than the one we have? That would depend on the memory and CPU available to Emacs, wouldn't it? Plus small data sets will have a lot of performance variance due to external factors. I would just make 30 the breaking point. Ted ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: remove-duplicates performances 2011-05-20 13:51 remove-duplicates performances Thierry Volpiatto 2011-05-20 14:09 ` Stefan Monnier @ 2011-05-20 14:28 ` David Kastrup 2011-05-20 16:00 ` Thierry Volpiatto 1 sibling, 1 reply; 14+ messages in thread From: David Kastrup @ 2011-05-20 14:28 UTC (permalink / raw) To: emacs-devel Thierry Volpiatto <thierry.volpiatto@gmail.com> writes: > i just noticed that `remove-duplicates' is very slow. > > Something like below seem much faster: > > (defun* remove-dups (seq &key (test 'eq)) > (let ((cont (make-hash-table :test test))) > (loop for elm in seq > unless (gethash elm cont) > do (puthash elm elm cont) > finally return (loop for i being the hash-values in cont collect i)))) > > Test: > > (setq A (let ((seq (loop for i from 1 to 10000 collect i))) > (append seq seq))) > (1 2 3 4 5 6 7 8 9 10 1 2 ...) > > (remove-dups A) > (1 2 3 4 5 6 7 8 9 10 11 12 ...) > elp-results: remove-dups 1 0.013707 0.013707 > > (remove-duplicates A) > (1 2 3 4 5 6 7 8 9 10 11 12 ...) > elp-results: remove-duplicates 1 66.971619 66.971619 > > Would be nice to improve performances of `remove-duplicates'. There is little point in the overhead of a hashtable for a one-shot operation. Hashtables are best for _maintaining_ data, not for processing other data structures. I've found the following in some file of mine: (defun uniquify (list predicate) (let* ((p list) lst (x1 (make-symbol "x1")) (x2 (make-symbol "x2"))) (while p (push p lst) (setq p (cdr p))) ;;; (princ lst)(princ "\n") (setq lst (sort lst `(lambda(,x1 ,x2) (funcall ',predicate (car ,x1) (car ,x2))))) ;;; lst now contains all sorted sublists, with equal cars being ;;; sorted in order of increasing length (from end of list to start). ;; (while (cdr lst) (unless (funcall predicate (car (car lst)) (car (cadr lst))) (setcar (car lst) x1)) (setq lst (cdr lst))) (delq x1 list))) (uniquify '(2 1 2 1 2) '<) (uniquify '(4 7 3 26 4 2 6 24 4 5 2 3 2 4 6) '<) Obviously, this should make use of lexical binding "nowadays" instead of creating its own unique symbols for function parameters. And instead of using x1 as a sentinel value some other unique thing might be used, like a one-shot '(nil). Basically, for a one-shot O(n^2) operation on a list, `sort' will usually do the trick. But if you are doing that O(n^2) operation regularly, you should probably maintain the whole data set in a hashtable instead. -- David Kastrup ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: remove-duplicates performances 2011-05-20 14:28 ` David Kastrup @ 2011-05-20 16:00 ` Thierry Volpiatto 2011-05-20 17:01 ` David Kastrup 0 siblings, 1 reply; 14+ messages in thread From: Thierry Volpiatto @ 2011-05-20 16:00 UTC (permalink / raw) To: emacs-devel David Kastrup <dak@gnu.org> writes: > I've found the following in some file of mine: > > (defun uniquify (list predicate) > (let* ((p list) lst (x1 (make-symbol "x1")) > (x2 (make-symbol "x2"))) > (while p > (push p lst) > (setq p (cdr p))) > ;;; (princ lst)(princ "\n") > (setq lst > (sort lst `(lambda(,x1 ,x2) > (funcall ',predicate (car ,x1) (car ,x2))))) > ;;; lst now contains all sorted sublists, with equal cars being > ;;; sorted in order of increasing length (from end of list to start). > ;; > > (while (cdr lst) > (unless (funcall predicate (car (car lst)) (car (cadr lst))) > (setcar (car lst) x1)) > (setq lst (cdr lst))) > (delq x1 list))) > > (uniquify '(2 1 2 1 2) '<) > (uniquify '(4 7 3 26 4 2 6 24 4 5 2 3 2 4 6) '<) This is nice and very instructive (at least for me) thanks. It is not as performant as the version with hash-table, but very usable: 0.3 <=> 0.13 with same test on list with 20000 elements. However, isn't it a problem when we want to remove duplicate in a list type alist e.g ((a . 1) (b . 2) (a . 1) (c . 3) (b . 2)...) -- A+ Thierry Get my Gnupg key: gpg --keyserver pgp.mit.edu --recv-keys 59F29997 ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: remove-duplicates performances 2011-05-20 16:00 ` Thierry Volpiatto @ 2011-05-20 17:01 ` David Kastrup 2011-05-20 17:31 ` Thierry Volpiatto 2011-05-20 17:57 ` Ted Zlatanov 0 siblings, 2 replies; 14+ messages in thread From: David Kastrup @ 2011-05-20 17:01 UTC (permalink / raw) To: emacs-devel Thierry Volpiatto <thierry.volpiatto@gmail.com> writes: > David Kastrup <dak@gnu.org> writes: > >> I've found the following in some file of mine: >> >> (defun uniquify (list predicate) >> (let* ((p list) lst (x1 (make-symbol "x1")) >> (x2 (make-symbol "x2"))) >> (while p >> (push p lst) >> (setq p (cdr p))) >> ;;; (princ lst)(princ "\n") >> (setq lst >> (sort lst `(lambda(,x1 ,x2) >> (funcall ',predicate (car ,x1) (car ,x2))))) >> ;;; lst now contains all sorted sublists, with equal cars being >> ;;; sorted in order of increasing length (from end of list to start). >> ;; >> >> (while (cdr lst) >> (unless (funcall predicate (car (car lst)) (car (cadr lst))) >> (setcar (car lst) x1)) >> (setq lst (cdr lst))) >> (delq x1 list))) >> >> (uniquify '(2 1 2 1 2) '<) >> (uniquify '(4 7 3 26 4 2 6 24 4 5 2 3 2 4 6) '<) > > This is nice and very instructive (at least for me) thanks. > It is not as performant as the version with hash-table, Well, the sorting function is a mess due to not being compiled and fearing dynamic binding. If you byte-compile something like (defun uniquify (list predicate) (let* ((p list) lst (sentinel (list nil))) (while p (push p lst) (setq p (cdr p))) (setq lst (sort lst (lambda(x1 x2) (funcall predicate (car x1) (car x2))))) ;;; lst now contains all sorted sublists, with equal cars being ;;; sorted in order of increasing length (from end of list to start). ;; (while (cdr lst) (unless (funcall predicate (car (car lst)) (car (cadr lst))) (setcar (car lst) sentinel)) (setq lst (cdr lst))) (delq sentinel list))) the behavior is likely better. > but very usable: 0.3 <=> 0.13 with same test on list with 20000 > elements. However, isn't it a problem when we want to remove > duplicate in a list type alist e.g ((a . 1) (b . 2) (a . 1) (c . 3) (b > . 2)...) Why? You need a predicate < both for sorting and for telling inequality. As long as you define a suitable predicate for that purpose, what should go wrong? Any elements for which (or (predicate a b) (predicate b a)) is nil will be considered duplicate. -- David Kastrup ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: remove-duplicates performances 2011-05-20 17:01 ` David Kastrup @ 2011-05-20 17:31 ` Thierry Volpiatto 2011-05-20 17:46 ` David Kastrup 2011-05-20 17:57 ` Ted Zlatanov 1 sibling, 1 reply; 14+ messages in thread From: Thierry Volpiatto @ 2011-05-20 17:31 UTC (permalink / raw) To: emacs-devel David Kastrup <dak@gnu.org> writes: > Thierry Volpiatto <thierry.volpiatto@gmail.com> writes: > >> David Kastrup <dak@gnu.org> writes: >> >>> I've found the following in some file of mine: >>> >>> (defun uniquify (list predicate) >>> (let* ((p list) lst (x1 (make-symbol "x1")) >>> (x2 (make-symbol "x2"))) >>> (while p >>> (push p lst) >>> (setq p (cdr p))) >>> ;;; (princ lst)(princ "\n") >>> (setq lst >>> (sort lst `(lambda(,x1 ,x2) >>> (funcall ',predicate (car ,x1) (car ,x2))))) >>> ;;; lst now contains all sorted sublists, with equal cars being >>> ;;; sorted in order of increasing length (from end of list to start). >>> ;; >>> >>> (while (cdr lst) >>> (unless (funcall predicate (car (car lst)) (car (cadr lst))) >>> (setcar (car lst) x1)) >>> (setq lst (cdr lst))) >>> (delq x1 list))) >>> >>> (uniquify '(2 1 2 1 2) '<) >>> (uniquify '(4 7 3 26 4 2 6 24 4 5 2 3 2 4 6) '<) >> >> This is nice and very instructive (at least for me) thanks. >> It is not as performant as the version with hash-table, > > Well, the sorting function is a mess due to not being compiled and > fearing dynamic binding. If you byte-compile something like > > (defun uniquify (list predicate) > (let* ((p list) lst (sentinel (list nil))) > (while p > (push p lst) > (setq p (cdr p))) > (setq lst > (sort lst (lambda(x1 x2) > (funcall predicate (car x1) (car x2))))) > ;;; lst now contains all sorted sublists, with equal cars being > ;;; sorted in order of increasing length (from end of list to start). > ;; > (while (cdr lst) > (unless (funcall predicate (car (car lst)) (car (cadr lst))) > (setcar (car lst) sentinel)) > (setq lst (cdr lst))) > (delq sentinel list))) > > the behavior is likely better. Yes, that's not really a problem, the results are very acceptable compared to remove-duplicates (68s!). >> but very usable: 0.3 <=> 0.13 with same test on list with 20000 >> elements. However, isn't it a problem when we want to remove >> duplicate in a list type alist e.g ((a . 1) (b . 2) (a . 1) (c . 3) (b >> . 2)...) > > Why? You need a predicate < both for sorting and for telling > inequality. As long as you define a suitable predicate for that > purpose, what should go wrong? Any elements for which > (or (predicate a b) (predicate b a)) is nil will be considered > duplicate. Yes, i understand that, what i mean is you have to write a predicate each time, which could be inconvenient, instead of using :test 'equal. -- A+ Thierry Get my Gnupg key: gpg --keyserver pgp.mit.edu --recv-keys 59F29997 ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: remove-duplicates performances 2011-05-20 17:31 ` Thierry Volpiatto @ 2011-05-20 17:46 ` David Kastrup 2011-05-20 18:01 ` Stefan Monnier 0 siblings, 1 reply; 14+ messages in thread From: David Kastrup @ 2011-05-20 17:46 UTC (permalink / raw) To: emacs-devel Thierry Volpiatto <thierry.volpiatto@gmail.com> writes: > David Kastrup <dak@gnu.org> writes: > >> Thierry Volpiatto <thierry.volpiatto@gmail.com> writes: >>> >>> This is nice and very instructive (at least for me) thanks. >>> It is not as performant as the version with hash-table, [...] >>> but very usable: 0.3 <=> 0.13 with same test on list with 20000 >>> elements. However, isn't it a problem when we want to remove >>> duplicate in a list type alist e.g ((a . 1) (b . 2) (a . 1) (c . 3) (b >>> . 2)...) >> >> Why? You need a predicate < both for sorting and for telling >> inequality. As long as you define a suitable predicate for that >> purpose, what should go wrong? Any elements for which >> (or (predicate a b) (predicate b a)) is nil will be considered >> duplicate. > Yes, i understand that, what i mean is you have to write a predicate > each time, which could be inconvenient, instead of using :test 'equal. With all due respect, you are proposing a hashtable as an alternate mechanism. A hashtable requires a hash function and an equality test. If you want to get better behavior than O(n^2), you can't just use an equivalence operator for weeding out duplicates. -- David Kastrup ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: remove-duplicates performances 2011-05-20 17:46 ` David Kastrup @ 2011-05-20 18:01 ` Stefan Monnier 2011-05-20 21:57 ` Pascal J. Bourguignon 0 siblings, 1 reply; 14+ messages in thread From: Stefan Monnier @ 2011-05-20 18:01 UTC (permalink / raw) To: David Kastrup; +Cc: emacs-devel >> Yes, i understand that, what i mean is you have to write a predicate >> each time, which could be inconvenient, instead of using :test 'equal. > With all due respect, you are proposing a hashtable as an alternate > mechanism. A hashtable requires a hash function and an equality test. We do have a built-in hash function that corresponds to the `equal' equality test, as well as one for the `eq' equality test. Total ordering predicates consistent with `eq' or `equal' OTOH are not currently provided, so the use of `sort' requires extra work. Stefan ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: remove-duplicates performances 2011-05-20 18:01 ` Stefan Monnier @ 2011-05-20 21:57 ` Pascal J. Bourguignon 0 siblings, 0 replies; 14+ messages in thread From: Pascal J. Bourguignon @ 2011-05-20 21:57 UTC (permalink / raw) To: emacs-devel Stefan Monnier <monnier@iro.umontreal.ca> writes: >>> Yes, i understand that, what i mean is you have to write a predicate >>> each time, which could be inconvenient, instead of using :test 'equal. >> With all due respect, you are proposing a hashtable as an alternate >> mechanism. A hashtable requires a hash function and an equality test. > > We do have a built-in hash function that corresponds to the `equal' > equality test, as well as one for the `eq' equality test. > > Total ordering predicates consistent with `eq' or `equal' OTOH are not > currently provided, so the use of `sort' requires extra work. We can use a hash table when the test is eq, eql or equal (and corresponding special cases when test-not is specified), or when the user has defined a hash-table test with define-hash-table-test. This can be tested with (get test 'hash-table-test). In the other cases, we cannot use a hash-table. Notably, CL remove-duplicates and delete-duplicates are specified in such a way that they can produce (sometimes) useful result when given non equivalence relationships as test function. I guess in those case we have to fall back to the O(n²) algorithm. And notice that the function to re-implement is cl-delete-duplicates, since it's used by both remove-duplicates and delete-duplicates. -- __Pascal Bourguignon__ http://www.informatimago.com/ A bad day in () is better than a good day in {}. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: remove-duplicates performances 2011-05-20 17:01 ` David Kastrup 2011-05-20 17:31 ` Thierry Volpiatto @ 2011-05-20 17:57 ` Ted Zlatanov 1 sibling, 0 replies; 14+ messages in thread From: Ted Zlatanov @ 2011-05-20 17:57 UTC (permalink / raw) To: emacs-devel On Fri, 20 May 2011 19:01:16 +0200 David Kastrup <dak@gnu.org> wrote: DK> Well, the sorting function is a mess due to not being compiled and DK> fearing dynamic binding. If you byte-compile something like ... DK> the behavior is likely better. Incidentally, Doug Hoyte claims in "Let Over Lambda" that sorting networks have better performance than other sorting algorithms for small lists (25 elements was the biggest optimization example he gave). This seemed very reasonable to me because of their low memory and startup costs, though I don't know enough about the topic to recommend sorting networks in Emacs Lisp specifically. It may be worthwhile to generate optimized sorting networks for such small lists and to use them for sorting and (with modifications) for `remove-duplicates' and frequency counting. Is anyone interested? Ted ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2011-05-20 21:57 UTC | newest] Thread overview: 14+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-05-20 13:51 remove-duplicates performances Thierry Volpiatto 2011-05-20 14:09 ` Stefan Monnier 2011-05-20 14:39 ` Thierry Volpiatto 2011-05-20 16:14 ` Stefan Monnier 2011-05-20 17:46 ` Thierry Volpiatto 2011-05-20 15:16 ` Ted Zlatanov 2011-05-20 14:28 ` David Kastrup 2011-05-20 16:00 ` Thierry Volpiatto 2011-05-20 17:01 ` David Kastrup 2011-05-20 17:31 ` Thierry Volpiatto 2011-05-20 17:46 ` David Kastrup 2011-05-20 18:01 ` Stefan Monnier 2011-05-20 21:57 ` Pascal J. Bourguignon 2011-05-20 17:57 ` Ted Zlatanov
Code repositories for project(s) associated with this external index https://git.savannah.gnu.org/cgit/emacs.git https://git.savannah.gnu.org/cgit/emacs/org-mode.git This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.