From mboxrd@z Thu Jan 1 00:00:00 1970 From: ludo@gnu.org (Ludovic =?utf-8?Q?Court=C3=A8s?=) Subject: Re: Optimizing union.scm Date: Mon, 24 Mar 2014 14:45:23 +0100 Message-ID: <87ha6nzp58.fsf@gnu.org> References: <871tz8oldk.fsf@netris.org> <874n2oubuq.fsf@gnu.org> <8761n4mzvu.fsf@yeeloong.lan> <87siq8r77u.fsf@gnu.org> <87r45sl074.fsf_-_@yeeloong.lan> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Return-path: Received: from eggs.gnu.org ([2001:4830:134:3::10]:34216) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1WS5Du-0006jg-4Z for guix-devel@gnu.org; Mon, 24 Mar 2014 09:47:35 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1WS5Bu-0000Io-DY for guix-devel@gnu.org; Mon, 24 Mar 2014 09:45:32 -0400 Received: from hera.aquilenet.fr ([2a01:474::1]:46196) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1WS5Bu-0000HE-55 for guix-devel@gnu.org; Mon, 24 Mar 2014 09:45:26 -0400 In-Reply-To: <87r45sl074.fsf_-_@yeeloong.lan> (Mark H. Weaver's message of "Sun, 23 Mar 2014 23:55:43 -0400") List-Id: "Development of GNU Guix and the GNU System distribution." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guix-devel-bounces+gcggd-guix-devel=m.gmane.org@gnu.org Sender: guix-devel-bounces+gcggd-guix-devel=m.gmane.org@gnu.org To: Mark H Weaver Cc: guix-devel@gnu.org Mark H Weaver skribis: > ludo@gnu.org (Ludovic Court=C3=A8s) writes: > >> Mark H Weaver skribis: >> >>> I want to optimize it anyway, since it takes over 5 minutes to build >>> my profile, which is a bit painful. >> >> Oh, this much? >> >> I have 140 packages in my profile and it takes less than 30s to build >> it; that=E2=80=99s an SSD though, so that probably makes a big differenc= e. > > My profile has 155 packages. I've looked over union.scm and can see > some extreme wastefulness, most notably in the use of 'others-have-it?'. > I guess this typically does N 'lstat' calls for every file or directory > in every package in the resulting profile that cannot be pruned (due to > being in a directory that's only in one package), where N is the number > of packages in the profile. We could use a memoizing version of =E2=80=98file-exists?=E2=80=99 in =E2= =80=98union-build=E2=80=99 and see what happens. > I'm fairly sure it is possible to replace those N 'lstat' calls with > something that requires 0 system calls and at most O(log N) time. The > basic idea would be to iterate over all packages in a breadth-first > manner, as follows: > > I think we should readdir the top-level directories of every package, > and merge them together into a single map structure (vhash?) that maps > filenames to sets of packages containing that filename. We don't even > need to 'lstat' them at this point, all we need are the names. After > we've read the directories of every package, we then iterate over the > map. For any unique entries, we simply make a symlink in the new > profile. > > For duplicates we'd do conflict resolution, which starts with an > 'lstat'. For directories, the default conflict resolution would simply > create a directory in the new profile and then recurse into that > subdirectory, considering only the packages that contained that > subdirectory. > > My guess is that this would speed up profile creation by at least an > order of magnitude, maybe more. > > What do you think? I=E2=80=99m all for it! It=E2=80=99s not obvious to me that we would gain this much (esp. an order = of magnitude), but if that=E2=80=99s the case, it=E2=80=99s great. (I=E2=80= =99m guessing that the bottleneck is I/O, not CPU.) My only concern is that that code should remain readable and tested. So: patches welcome. ;-) Ludo=E2=80=99.