From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:470:142:3::10]:41584) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1jSYNj-0006Zi-DX for guix-patches@gnu.org; Sat, 25 Apr 2020 23:55:03 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.90_1) (envelope-from ) id 1jSYNi-0000XK-P5 for guix-patches@gnu.org; Sat, 25 Apr 2020 23:55:03 -0400 Received: from debbugs.gnu.org ([209.51.188.43]:49171) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1jSYNi-0000Vz-Cg for guix-patches@gnu.org; Sat, 25 Apr 2020 23:55:02 -0400 Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1jSYNi-0002tz-B6 for guix-patches@gnu.org; Sat, 25 Apr 2020 23:55:02 -0400 Subject: [bug#39258] benchmark search: default vs v2 vs v3 References: In-Reply-To: Resent-Message-ID: MIME-Version: 1.0 From: zimoun Date: Sun, 26 Apr 2020 05:54:21 +0200 Message-ID: Content-Type: text/plain; charset="UTF-8" List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guix-patches-bounces+kyle=kyleam.com@gnu.org Sender: "Guix-patches" To: 39258@debbugs.gnu.org, Arun Isaac , Ludovic =?UTF-8?Q?Court=C3=A8s?= , Pierre Neidhardt Hi, Thank you Arun for the patches and all the work. Sorryfor the delay. TLDR: 1) around 25 seconds added to "guix pull"... but I am more than often waiting around 10 minutes when pulling. 2) the speedup is clear: more than 2x. The question is the tradeoff between: the slowdown of pull vs the speedup of search. What is acceptable? Here let benchmark 3 versions of Guix: - default is a357849f5b - v2 rebased on default and based on Xapian - v3 rebased on default too and based on "custom" index and let compare the time of "guix pull" and then "guix search". Because v2 uses Xapian, the accuracy is different and so the list of outputs is different depending on the query; the impact on the performance seems minimal. Let discuss elsewhere about accuracy and BM25 and let focus on performance for now. * guix pull ----------- The idea is: measure if computing the new index is expensive or not, compared to all of what "guix pull" computes. ** Reference ------------ Maybe, I should have misconfigured something or my laptop is really not powerful at all, but here some numbers. (Note: /proc/cpuinfo says 4 times Intel(R) Core(TM) i5-5200U CPU @ 2.20GHz and /sys/block/sda/queue/rotational says 0 which is SSD.) --8<---------------cut here---------------start------------->8--- $ guix describe Generation 8 Apr 25 2020 09:00:01 (current) guix f84b036 repository URL: https://git.savannah.gnu.org/git/guix.git branch: master commit: f84b0363053e5479464f6ce6ded45f80360d90fc --8<---------------cut here---------------end--------------->8--- --8<---------------cut here---------------start------------->8--- $ time guix pull -C ~/.config/guix/default-channels.scm Updating channel 'guix' from Git repository at 'https://git.savannah.gnu.org/git/guix.git'... Building from this channel: guix https://git.savannah.gnu.org/git/guix.git 8cf6d15 downloading from https://ci.guix.gnu.org/nar/gzip/xgakzpfs3rz57m666hsk1v3d3zcy7wgn-config.scm ... config.scm [...] building fonts directory... building directory of Info manuals... building database for manual pages... building profile with 1 package... building /gnu/store/kq1zlj5rxz8wrxc3ha8vck2wv2iakfnb-inferior-script.scm.drv... building package cache... building profile with 1 package... New in this revision: 2 new packages: cl-osicat, sbcl-osicat real 13m37.997s user 1m38.129s sys 0m0.856s --8<---------------cut here---------------end--------------->8--- And because "guix search" is used say 10 times more than "guix pull", an increase of 10% of "guix pull" will ease the experience of the user if "guix search" is faster, IMHO. Therefore, because "guix pull" takes around 13 minutes, the extra cost to index all the packages can be roughly 1min30s (at most). Then, if I pull back from 8cf6d15 to '--commit=a357849f5b' then it takes: real 2m13.693s user 1m37.418s sys 0m0.666s so in this case 10% means around 7s. But after 1 minute waiting, the command feels too long to me and personally I am already waiting so I do not mind much if it would take 2m13s or 3m00s. Well, it is hard to draw a clear line about what could be accepted as the time of indexing because the time of pulling is already highly variable. What is the average of "guix pull"? It could be really interresting to probe the users. They could report: - guix describe - time guix pull whatever which channels are up. Just to have an idea about what should be the acceptable extra time added by indexing. For sure it depends on the hardware but it would provide an idea and help to see if the extra time is worth or not. WDYT? ** Let's compare the index time ------------------------------- Let pull for the 3 cases and populate the store by all the necessary items. Could be looooonng! (20minutes) For example, for the version 2 of patches -- living in my branch 'search-v2' using a worktree. --8<---------------cut here---------------start------------->8--- time ./pre-inst-env guix pull -p /tmp/v2 \ --url=$PWD --branch=search-v2 \ -C ~/.config/guix/default-channels.scm --8<---------------cut here---------------end--------------->8--- and then let spot the index file for each version: --8<---------------cut here---------------start------------->8--- # ls -l /tmp/default/lib/guix /gnu/store/g5c08vqsv31nkn2r0hr32dbrkhf3cvd8-guix-package-cache readlink /tmp/v2/lib/guix/package-search.index /gnu/store/8xbzhn81hmshagbgazmnr7xfps1cdsa3-guix-package-search-index/lib/guix/package-search.index readlink /tmp/v3/lib/guix/package-metadata.cache /gnu/store/8j78b5c4ddic21gcx7wpbq2akjn7x7mr-guix-package-metadata-cache/lib/guix/package-metadata.cache --8<---------------cut here---------------end--------------->8--- Well, let remove the profiles and garbage collect the index files: --8<---------------cut here---------------start------------->8--- rm /tmp/default /tmp/v{2,3}* guix gc -D \ /gnu/store/g5c08vqsv31nkn2r0hr32dbrkhf3cvd8-guix-package-cache \ /gnu/store/8xbzhn81hmshagbgazmnr7xfps1cdsa3-guix-package-search-index \ /gnu/store/8j78b5c4ddic21gcx7wpbq2akjn7x7mr-guix-package-metadata-cache --8<---------------cut here---------------end--------------->8--- And then re-run "guix pull". We are now comparing apple to apple, I guess. | time | default | v2 | v3 | |------+-----------+-----------+-----------| | real | 1m11.899s | 1m30.806s | 1m34.341s | | user | 1m23.845s | 1m24.160s | 1m24.233s | | sys | 0m0.570s | 0m0.563s | 0m0.529s | Therefore less than extra 20s and 25s for v2 and v3. All the question is an extra 25s compared to which time of "guix pull": - more than 13m: adding 25s is acceptable - less than 2m: adding 25s is questionable Usually, my feeling about "guix pull" is... I am waiting! Therefore, I will not see this extra 25s because it is masked by all the other work "guix pull" is doing. * guix search ------------- Let compare cold (sudo echo 3 > /proc/sys/vm/drop_caches) and warm cache. For example for the query 'inkscape'. | time | default | v2 | v3 | |------+----------+----------+----------| | real | 0m1.842s | 0m0.331s | 0m0.437s | | user | 0m1.270s | 0m0.179s | 0m0.336s | | sys | 0m0.142s | 0m0.047s | 0m0.052s | |------+----------+----------+----------| | real | 0m0.898s | 0m0.132s | 0m0.292s | | user | 0m1.069s | 0m0.168s | 0m0.353s | | sys | 0m0.072s | 0m0.008s | 0m0.019s | Therefore the speedup is at least 3. | cache | default-vs-v2 | default-vs-v3 | |-------+---------------+---------------| | cold | 5.6 | 4.2 | | warm | 6.8 | 3.1 | Another query: --8<---------------cut here---------------start------------->8--- time guix search crypto library | recsel -P name | grep libb2 --8<---------------cut here---------------end--------------->8--- | time | default | v2 | v3 | |------+----------+----------+----------| | real | 0m2.216s | 0m1.109s | 0m0.689s | | user | 0m1.655s | 0m1.309s | 0m0.683s | | sys | 0m0.193s | 0m0.073s | 0m0.035s | |------+----------+----------+----------| | real | 0m1.197s | 0m0.490s | 0m0.491s | | user | 0m1.448s | 0m0.819s | 0m0.625s | | sys | 0m0.089s | 0m0.034s | 0m0.039s | | cache | default-vs-v2 | default-vs-v3 | |-------+---------------+---------------| | cold | 2.0 | 3.2 | | warm | 2.4 | 2.4 | Before going further, especially about any other more sophisticated inverted index (BM25), it appears to me important to fix what is "cost" on "guix pull" that the users are ready to pay. Because somehow the inverted index has to be computed. And without an inverted index, it seems difficult to improve the accurary. One solution should be: let compute the inverted index in the background with a low priority. If the index is not done yet when "guix search" is called, then fallback to the current default behaviour. WDYT? Cheers, simon