From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Oleh Krehel Newsgroups: gmane.emacs.devel Subject: What to do for faster `remove-duplicates'? Date: Wed, 06 May 2015 09:33:02 +0200 Message-ID: <87383atb2p.fsf@gmail.com> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain X-Trace: ger.gmane.org 1430897966 26825 80.91.229.3 (6 May 2015 07:39:26 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Wed, 6 May 2015 07:39:26 +0000 (UTC) To: emacs-devel@gnu.org Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Wed May 06 09:39:21 2015 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1YptvM-0002lm-3R for ged-emacs-devel@m.gmane.org; Wed, 06 May 2015 09:39:20 +0200 Original-Received: from localhost ([::1]:43410 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YptvL-0003Ne-D5 for ged-emacs-devel@m.gmane.org; Wed, 06 May 2015 03:39:19 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:42833) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YptvG-0003Kh-Gj for emacs-devel@gnu.org; Wed, 06 May 2015 03:39:15 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1YptvB-0007Jy-ID for emacs-devel@gnu.org; Wed, 06 May 2015 03:39:14 -0400 Original-Received: from mail-wi0-x230.google.com ([2a00:1450:400c:c05::230]:35489) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YptvB-0007Jh-Bt for emacs-devel@gnu.org; Wed, 06 May 2015 03:39:09 -0400 Original-Received: by widdi4 with SMTP id di4so190746951wid.0 for ; Wed, 06 May 2015 00:39:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=from:to:subject:date:message-id:mime-version:content-type; bh=qL/4N/USLky3CCKy/KBjPEKHyHi363odMH0QRmI7SIM=; b=yxa1ju19cLUZ6LO3hTtgZbQWvAEoWhjDmQ0x5Jeb3U6l2IA5kr72up9PjpnPtThbz6 qJ+Wms+V0neEHIA7RXrc3QMKFJgbZcK8yiQNYHlTkwaA79pkMS3/DiX8WvDxKLnYnFqQ 2udScB44ujOCmcIuUIWZjAOLpwBpTcpqoJLwU8E9w0uC2CGf3JWiPXIiY594CjHkSThy w1x+DIitHxFFlYHu0S/LRLyCGH4OUpX0gmEq1X4ZvAwap8nBI2nVm32PIK0aBhKa3mkJ ma2jd7Vl76BhKsLIyx3HL/Doj0q6sW4eJRV7dgKMelpNuJ1SO9lF55gfbLg4NA1Arq2d O4HA== X-Received: by 10.180.88.99 with SMTP id bf3mr11249814wib.75.1430897947872; Wed, 06 May 2015 00:39:07 -0700 (PDT) Original-Received: from firefly (dyn069045.nbw.tue.nl. [131.155.69.45]) by mx.google.com with ESMTPSA id wr2sm1368982wjb.45.2015.05.06.00.39.07 for (version=TLSv1.2 cipher=RC4-SHA bits=128/128); Wed, 06 May 2015 00:39:07 -0700 (PDT) X-detected-operating-system: by eggs.gnu.org: Error: Malformed IPv6 address (bad octet value). X-Received-From: 2a00:1450:400c:c05::230 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:186273 Archived-At: Hi all, Should `cl-remove-duplicates' switch to using a hash table if the candidates list is large? I see here a speedup factor of >500 for removing the duplicates of `load-library' candidates. Helm has a function for this, which looks very simple (much more simple that the current `cl-remove-duplicates'): (cl-defun helm-fast-remove-dups (seq &key (test 'eq)) "Remove duplicates elements in list SEQ. This is same as `remove-duplicates' but with memoisation. It is much faster, especially in large lists. A test function can be provided with TEST argument key. Default is `eq'." (cl-loop with cont = (make-hash-table :test test) for elm in seq unless (gethash elm cont) do (puthash elm elm cont) finally return (cl-loop for i being the hash-values in cont collect i))) And here are the benchmark tests: (setq cands (locate-file-completion-table load-path (get-load-suffixes) "" nil t)) (length cands) 5357 (length (cl-remove-duplicates cands :test 'equal)) 2481 (benchmark-run (cl-remove-duplicates cands :test 'equal)) (0.67873101 0 0.0) (benchmark-run (helm-fast-remove-dups cands :test 'equal)) (0.001350054 0 0.0) (benchmark-run (seq-uniq cands 'equal)) (5.270219822 27 2.396615401000002) This is also a request for seq.el to revise its optimization strategy. So should `cl-remove-duplicates' be a 2-in-1 (lists for small input, hash table for large input), or should there be `cl-remove-duplicates-hash', or an optional argument to `cl-remove-duplicates' to use a hash? Oleh