From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mark H Weaver Subject: Optimizing union.scm Date: Sun, 23 Mar 2014 23:55:43 -0400 Message-ID: <87r45sl074.fsf_-_@yeeloong.lan> References: <871tz8oldk.fsf@netris.org> <874n2oubuq.fsf@gnu.org> <8761n4mzvu.fsf@yeeloong.lan> <87siq8r77u.fsf@gnu.org> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Return-path: Received: from eggs.gnu.org ([2001:4830:134:3::10]:58596) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1WRw02-0007nJ-NG for guix-devel@gnu.org; Sun, 23 Mar 2014 23:56:39 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1WRvzx-00089V-DM for guix-devel@gnu.org; Sun, 23 Mar 2014 23:56:34 -0400 In-Reply-To: <87siq8r77u.fsf@gnu.org> ("Ludovic \=\?utf-8\?Q\?Court\=C3\=A8s\=22'\?\= \=\?utf-8\?Q\?s\?\= message of "Sun, 23 Mar 2014 21:27:33 +0100") List-Id: "Development of GNU Guix and the GNU System distribution." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guix-devel-bounces+gcggd-guix-devel=m.gmane.org@gnu.org Sender: guix-devel-bounces+gcggd-guix-devel=m.gmane.org@gnu.org To: Ludovic =?utf-8?Q?Court=C3=A8s?= Cc: guix-devel@gnu.org ludo@gnu.org (Ludovic Court=C3=A8s) writes: > Mark H Weaver skribis: > >> I want to optimize it anyway, since it takes over 5 minutes to build >> my profile, which is a bit painful. > > Oh, this much? > > I have 140 packages in my profile and it takes less than 30s to build > it; that=E2=80=99s an SSD though, so that probably makes a big difference. My profile has 155 packages. I've looked over union.scm and can see some extreme wastefulness, most notably in the use of 'others-have-it?'. I guess this typically does N 'lstat' calls for every file or directory in every package in the resulting profile that cannot be pruned (due to being in a directory that's only in one package), where N is the number of packages in the profile. I'm fairly sure it is possible to replace those N 'lstat' calls with something that requires 0 system calls and at most O(log N) time. The basic idea would be to iterate over all packages in a breadth-first manner, as follows: I think we should readdir the top-level directories of every package, and merge them together into a single map structure (vhash?) that maps filenames to sets of packages containing that filename. We don't even need to 'lstat' them at this point, all we need are the names. After we've read the directories of every package, we then iterate over the map. For any unique entries, we simply make a symlink in the new profile. For duplicates we'd do conflict resolution, which starts with an 'lstat'. For directories, the default conflict resolution would simply create a directory in the new profile and then recurse into that subdirectory, considering only the packages that contained that subdirectory. My guess is that this would speed up profile creation by at least an order of magnitude, maybe more. What do you think? Mark