From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp0.migadu.com ([2001:41d0:403:58f0::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms13.migadu.com with LMTPS id mJ1VOLbW5GYZGAAAqHPOHw:P1 (envelope-from ) for ; Sat, 14 Sep 2024 00:20:07 +0000 Received: from aspmx1.migadu.com ([2001:41d0:403:58f0::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp0.migadu.com with LMTPS id mJ1VOLbW5GYZGAAAqHPOHw (envelope-from ) for ; Sat, 14 Sep 2024 02:20:07 +0200 X-Envelope-To: larch@yhetil.org Authentication-Results: aspmx1.migadu.com; dkim=pass header.d=debbugs.gnu.org header.s=debbugs-gnu-org header.b=tYnnQeDB; dkim=fail ("headers rsa verify failed") header.d=gmail.com header.s=20230601 header.b=Ek1WkDVs; spf=pass (aspmx1.migadu.com: domain of "guix-patches-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="guix-patches-bounces+larch=yhetil.org@gnu.org"; dmarc=fail reason="SPF not aligned (relaxed), DKIM not aligned (relaxed)" header.from=gmail.com (policy=none) ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=yhetil.org; s=key1; t=1726273206; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:resent-cc:resent-from:resent-sender: resent-message-id:in-reply-to:in-reply-to:references:references: list-id:list-help:list-unsubscribe:list-subscribe:list-post: dkim-signature; bh=6GPgmKyzzi1lDpSL8D4us6W7hS0SaSCVJCOIo0qydQE=; b=K1jwh+QaqMiIXnvazifF4vVGOJ9fb3OYUvVOtXs/IooXrTNrKD/UA2M716S1+GRxZBIk6K q2/M+b007eyoK7QURDAyEtMahugzn5TLr8muI9e2k0siQRZXV1Nbtf24wnp345usVlgLvI 0MQc7jOWeFWARWCRV/MJrj8EATp+ySJvxyGEQIj6amconZ6aqMEM1qClfA4unIpSxlO4i7 +OGxO9PcQl6TO1MBk4rpNmuUlUrvOR4K+sGm1xAM8mPhFCNzrn5zO3xztHqBDhALYwJWxH 95h0b0CzZaFcAt6hPV8vAoWPV6mG3xgby9su1jbls6F5mSqO79iB0ifT5JcJhw== ARC-Authentication-Results: i=1; aspmx1.migadu.com; dkim=pass header.d=debbugs.gnu.org header.s=debbugs-gnu-org header.b=tYnnQeDB; dkim=fail ("headers rsa verify failed") header.d=gmail.com header.s=20230601 header.b=Ek1WkDVs; spf=pass (aspmx1.migadu.com: domain of "guix-patches-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="guix-patches-bounces+larch=yhetil.org@gnu.org"; dmarc=fail reason="SPF not aligned (relaxed), DKIM not aligned (relaxed)" header.from=gmail.com (policy=none) ARC-Seal: i=1; s=key1; d=yhetil.org; t=1726273206; a=rsa-sha256; cv=none; b=Mxj/Ur9uhifO0Jm0fzfiGVVrVsOtebZc8iuAzsRX5hChqwoSRAtWCVSn0S7AQqGe53PkDf 0tTBtkEDzWHBUt+Gu16nQEH1TXoYLq1OkRP6RV1+dQlYvEM289Qs0voSzGPTe+wWNKHKwC uUs2o3pfXitP0TV3RsnAF2bgsxnr+q8lF55B7Cz0G5geXPk+b1j+7L1VnUYRE+Eie57xeY tTvMZkCvanWDvWOdlZjjqrqlr8MeB9WK5zYbTwkAjFu4GtR4dAcBtN0fOv1NCSPRa4yPYU VfEVRD7sD05F8kGO5DfIhB9NuXfU8vEDfz1yrHnI7CbEk8MN3TtDA4B2eGq4/A== Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id 027456E520 for ; Sat, 14 Sep 2024 02:20:04 +0200 (CEST) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1spGW2-0002yW-Bd; Fri, 13 Sep 2024 20:19:54 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1spGW0-0002yM-Ea for guix-patches@gnu.org; Fri, 13 Sep 2024 20:19:52 -0400 Received: from debbugs.gnu.org ([2001:470:142:5::43]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1spGW0-00027j-4a for guix-patches@gnu.org; Fri, 13 Sep 2024 20:19:52 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=debbugs.gnu.org; s=debbugs-gnu-org; h=In-Reply-To:References:From:MIME-Version:Date:To:Subject; bh=6GPgmKyzzi1lDpSL8D4us6W7hS0SaSCVJCOIo0qydQE=; b=tYnnQeDBFGKLKgh9hJ2tCLG9UiSLk9t7JxrBU4cLScvhv3BhZCDqtMCfmELG4JJxmbl+qFVnJfIOayDHJy1Zk8Oqs8ecs5PdRQ+fFU0h3KNKHS15V6Ot9irMlKsgYLdo/eqQOJtuW5EUKvS4LW7eQrThw6+VsgOdvKThwwElcW4zhlHFww829IoWNrj3U3HrnM+pa4JcGcQMm0IFvFMIMI5GMLf2Y6dvCf2TS0GepBhUfRUrN2O9Namiey/DG1Yq9zE3YUHVqBTP/hrOLXeHUYnwP6PEZb+gCdxaoDqaWDuIhAdzH2hqLOM3e9J+5PZmb0yhPDLIW7wO2ULBJV4shQ==; Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1spGW9-0003UA-SZ for guix-patches@gnu.org; Fri, 13 Sep 2024 20:20:01 -0400 X-Loop: help-debbugs@gnu.org Subject: [bug#73220] [PATCH] ui: Add more nuance to relevance scoring. Resent-From: aurtzy Original-Sender: "Debbugs-submit" Resent-CC: guix-patches@gnu.org Resent-Date: Sat, 14 Sep 2024 00:20:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 73220 X-GNU-PR-Package: guix-patches X-GNU-PR-Keywords: patch To: 73220@debbugs.gnu.org Cc: aurtzy , Simon Tournier Received: via spool by 73220-submit@debbugs.gnu.org id=B73220.172627315613321 (code B ref 73220); Sat, 14 Sep 2024 00:20:01 +0000 Received: (at 73220) by debbugs.gnu.org; 14 Sep 2024 00:19:16 +0000 Received: from localhost ([127.0.0.1]:44448 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1spGVP-0003Sm-Ap for submit@debbugs.gnu.org; Fri, 13 Sep 2024 20:19:16 -0400 Received: from mail-io1-f50.google.com ([209.85.166.50]:57454) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1spGVL-0003ST-SB for 73220@debbugs.gnu.org; Fri, 13 Sep 2024 20:19:13 -0400 Received: by mail-io1-f50.google.com with SMTP id ca18e2360f4ac-82aa7c3b3dbso119773839f.2 for <73220@debbugs.gnu.org>; Fri, 13 Sep 2024 17:19:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1726273076; x=1726877876; darn=debbugs.gnu.org; h=in-reply-to:content-language:references:cc:to:subject:from :user-agent:mime-version:date:message-id:from:to:cc:subject:date :message-id:reply-to; bh=6GPgmKyzzi1lDpSL8D4us6W7hS0SaSCVJCOIo0qydQE=; b=Ek1WkDVsHj2W7PIXPw7QT5e0es6Q0EWa1WTLCBKXCt/P+ED93IRmxfHNCCFq5GmoKD 8GT5aA0wE4iQ5w/ahG6n6v5H0xevCXfDZDvtPHVG9gqwA0qfkROpq/uu0arz260HOICN MdBV+yxwFzo5+6h0MqO2aAeFjJZJteYTSE3RDsko/zrtq7FRMSztzu6UGUzzI+2z+61Z pScB/uFpBWZIfseaRIa9fc19Ag500tKC2f664rU8VXxLUhGyEeWBSV9K5R1blS+/eTfg XffIaZqwyxM4N3srBTIMCwEjBbZTDX8GEVXDmcB+Jl2WHFggq+suSxRo04riJ9cjfxEQ 7/Cg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1726273076; x=1726877876; h=in-reply-to:content-language:references:cc:to:subject:from :user-agent:mime-version:date:message-id:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=6GPgmKyzzi1lDpSL8D4us6W7hS0SaSCVJCOIo0qydQE=; b=tlvnBTZ+Q2GztIJ9LoXYVjfhtFSGpX8h2/QaC6ok5xARQzF9sHR9z2zzNZo9c+j8fQ KW4E612O7aeCBlkslkOFCWblAVsLFpyFba01A9Xtr3RydH/C66WQLYiS4nwYbWcFX7t7 ikyr4LPaxDZ7OmbuaMXsgLLU347ndcIZ5Vg6TVYd/3HHn9YF/bAXuWyo774qlakO8T81 6Dtoz24fj+0OGA/NxAr1yxNew9ykE+LGvUZI335o1QBOSqOdpapbbcy5E0LK99u1Qaeg u+s8znlBrrPzlMhLoJDRpllXFjX0pVBUoToSHw46p2deF1jv09z9s8oryxJ/GcovJ3v7 8g8A== X-Gm-Message-State: AOJu0Yw1Ox10BDlgn4El4NkWuDU7IqFT4jwpIziMKj0BR8jvXkA+8oAF hnYd3vUDtnh9eXMnL50zTuZXSV3wsYUZ3B/NSo7E9+pBUN3SN5UkirVpCw== X-Google-Smtp-Source: AGHT+IGkjlmBm56aCYVJaLvoZWyOIfonJSlncbxV3hl3XBCiLhvGcPgSfEBNaFbS7Y0ov8ZOm2Xndw== X-Received: by 2002:a05:6e02:1c2d:b0:39a:e8cf:80d0 with SMTP id e9e14a558f8ab-3a0848fc70cmr81091915ab.14.1726273075476; Fri, 13 Sep 2024 17:17:55 -0700 (PDT) Received: from ?IPV6:2600:4808:a053:7600::e413? ([2600:4808:a053:7600::e413]) by smtp.gmail.com with ESMTPSA id e9e14a558f8ab-3a092e72306sm993095ab.67.2024.09.13.17.17.53 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Fri, 13 Sep 2024 17:17:54 -0700 (PDT) Content-Type: multipart/alternative; boundary="------------21s0DpniSld8nE19l6UNkpfL" Message-ID: <4eea8048-fb10-40b5-a16b-09c96932ccb0@gmail.com> Date: Fri, 13 Sep 2024 20:17:52 -0400 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird From: aurtzy References: <87a5gbve0s.fsf@gmail.com> Content-Language: en-US In-Reply-To: <87a5gbve0s.fsf@gmail.com> X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-BeenThere: guix-patches@gnu.org List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guix-patches-bounces+larch=yhetil.org@gnu.org Sender: guix-patches-bounces+larch=yhetil.org@gnu.org X-Migadu-Flow: FLOW_IN X-Migadu-Country: US X-Migadu-Queue-Id: 027456E520 X-Migadu-Scanner: mx11.migadu.com X-Spam-Score: -10.43 X-Migadu-Spam-Score: -10.43 X-TUID: A5Q+qxSfRcF0 This is a multi-part message in MIME format. --------------21s0DpniSld8nE19l6UNkpfL Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Hi Simon, On 9/13/24 10:12, Simon Tournier wrote: >> | tor | Tor related; ~torbrowser~ somewhere near top. | > --8<---------------cut here---------------start------------->8--- > $ ./pre-inst-env guix search tor | recsel -p name,relevance | head -8 > name: tor > relevance: 208 > > name: tor-client > relevance: 169 > > name: torsocks > relevance: 103 > --8<---------------cut here---------------end--------------->8--- > > Compared to current: > > --8<---------------cut here---------------start------------->8--- > $ guix search tor | recsel -p name,relevance | head -8 > name: tor > relevance: 47 > > name: ghc-storablevector > relevance: 29 > > name: tor-client > relevance: 28 > --8<---------------cut here---------------end--------------->8--- > > However, the position move from 225th to 19th. > > $ guix search tor | recsel -P name | grep -n torbrowser > 225:torbrowser > > $ ./pre-inst-env guix search tor | recsel -P name | grep -n torbrowser > 19:torbrowser > > Similarly as ’dig’, the description of ’torbrowser’ package could be > improvement. Because ’guix search tor browser’ returns nothing. Does ~torbrowser~ not appear as the third result in all three cases for you when running =guix search tor browser=? Otherwise, if you meant =guix search tor= to find ~torbrowser~: perhaps it should be higher ranked, but it could be argued that patch v1's behavior is still more optimal in this aspect considering all results above ~torbrowser~ it are indeed related to Tor. >> | Keyword(s) with poor | Expectations | >> | results before | | >> |-----------------------+-----------------------------------------------| >> | dig | ~bind~ near top. | > Hum, indeed and I do not know if we can improve here. Well, it’s hard > to improve for short terms, BTW. > > --8<---------------cut here---------------start------------->8--- > $ ./pre-inst-env guix search dig | recsel -p name,relevance | head -8 > name: go-go-uber-org-dig > relevance: 104 > > name: rust-num-bigint-dig > relevance: 78 > > name: rust-num-bigint-dig > relevance: 78 > --8<---------------cut here---------------end--------------->8--- > > Compared to current: > > --8<---------------cut here---------------start------------->8--- > $ guix search dig | recsel -p name,relevance | head -8 > name: sysdig > relevance: 24 > > name: texlive-pedigree-perl > relevance: 13 > > name: ruby-net-http-digest-auth > relevance: 13 > --8<---------------cut here---------------end--------------->8--- > > Indeed, 17th position is better than 609th. But if you add a term as > ’dns’, bang! :-) Well, BTW the description of ’bind’ could be a bit > improved because the word network does not appear. Anyway. :-) [...] >> | rsh | ~inetutils~ near top. | > --8<---------------cut here---------------start------------->8--- > $ ./pre-inst-env guix search rsh | recsel -p name,relevance | head -8 > name: inetutils > relevance: 26 > > name: emacs-tramp > relevance: 26 > > name: rust-borsh-schema-derive-internal > relevance: 22 > --8<---------------cut here---------------end--------------->8--- > > Compared to current: > > --8<---------------cut here---------------start------------->8--- > $ guix search rsh | recsel -p name,relevance | head -8 > name: go-sigs-k8s-io-yaml > relevance: 14 > > name: python-pymarshal > relevance: 13 > > name: emacs-powershell > relevance: 13 > --8<---------------cut here---------------end--------------->8--- [...] >> | gcc | ~gcc-toolchain~ near top. | > Indeed, something is unexpected. Well, first: > > $ guix search gcc | recsel -CP name | uniq | head -8 > gccgo > gfortran-toolchain > gdc-toolchain > gcc-toolchain > gcc-cross-x86_64-w64-mingw32-toolchain > gcc-cross-or1k-elf-toolchain > gcc-cross-i686-w64-mingw32-toolchain > gcc-cross-avr-toolchain > > $ guix search gcc | recsel -CP name | uniq -c | sort -rn | head -8 > 18 llvm > 12 gcc-toolchain > 6 libgccjit > 6 gccgo > 3 isl > 2 libstdc++-doc > 2 java-commons-cli > 2 gdc-toolchain > > Other said, the packages with multi-versions decrease the experience. > Well, that had already by “improved” [1] with some SEO. ;-) Indeed, > maybe the relevance should be improved. > > Second, gccgo has a relevance score of 22 with the only term ’gcc’, > compared to gcc-toolchain scoring at 15. > > gccgo gcc-toolchain > 4 * 1 * 1 4 * 1 * 1 > + 2 * 5 * 1 + 2 * 1 * 1 > + 1 * 0 + 1 * 0 > + 3 * 1 * 1 + 3 * 1 * 1 > + 2 * 0 + 2 * 1 * 3 > + 1 * 5 * 1 + 1 * 0 > = 22 = 15 > > This is unexpected. And, IMHO that’s bug! In the description of > gcc-toolchain, the term ’gcc’ appears 3 times but it only score with ’1’ > instead of ’5’. > > As the patch try to address, the main issue is: > > (define (score regexp str) > (fold-matches regexp str 0 > (lambda (m score) > (+ score > (if (string=? (match:substring m) str) > 5 ;exact match > 1))))) > > Here the exact match does not consider a substring exact match. For > instance, one would consider that the term ’gcc’ exactly matches in > “some GCC thing”. Considering the current implementation, that’s not > the case. For instance, a snippet as the procedure ’scoring’: > > --8<---------------cut here---------------start------------->8--- > scheme@(guix-user)> ,use(ice-9 regex) > scheme@(guix-user)> (define regexp (make-regexp "gcc" regexp/icase)) > scheme@(guix-user)> (define str "some GCC thing") > scheme@(guix-user)> (fold-matches regexp str 0 > (lambda (m res) > (+ res > (if (string=? (match:substring m) str) > 5 1)))) > $2 = 1 > --8<---------------cut here---------------end--------------->8--- > > > See v2 for my proposal fixing this. > > Please note that this v2 gives the same ranking for torbrowser. And > also improve the situation with gcc-toolchain. > > --8<---------------cut here---------------start------------->8--- > $ ./pre-inst-env guix search gcc | recsel -CP name | grep -n gcc-toolchain > 1:gcc-toolchain > 2:gcc-toolchain > 3:gcc-toolchain > 4:gcc-toolchain > 5:gcc-toolchain > 6:gcc-toolchain > 7:gcc-toolchain > 8:gcc-toolchain > 9:gcc-toolchain > 10:gcc-toolchain > 11:gcc-toolchain > 12:gcc-toolchain > > $ ./pre-inst-env guix search tor | recsel -CP name | grep -n torbrowser > 7:torbrowser > > $ ./pre-inst-env guix search dig | recsel -CP name | grep -n bind > 44:bind > --8<---------------cut here---------------end--------------->8--- > > However, inetutils is still at 44th with the only one term ’rsh’. I > would suggest to do some tweak with the description. And including a relevant part of your message from #70689: > Again, considering the case at hand: If instead of 3 randomly picked in > v2 of #73220, we would pick 7, then inetutils is ranked first. > > Yeah, maybe 3 isn’t enough… And maybe 7 is a good choice. What do you think about setting the value to the sum of all weights in ~metrics~ as I did in patch v1? My logic is that an object is almost always going to be relevant if it contains a whole word match compared to "maybe relevant" if it only matches substrings, so it would be reasonable to thus show most of the objects with whole word matches first. This improves or maintains consistency of relevant results in the test cases with shorter terms, and also reduces the need for guesswork with choosing arbitrary numbers that may or may not work. Note that I also gave the same treatment to exact match scores, although not as extremely weighed (only double the whole word score in v1). In the case of ~inetutils~, for example, this formula guarantees that if I were to search =rsh= - which is a common subword, but itself has a very unique meaning - ~inetutils~ /always/ shows up at or near the top along with other rsh-related packages, assuming no exact matches. In other words, the intention would be to have the calculations set up such that they implicitly "categorize" object rankings into a (rough) hierarchy of the following: |--------------------------------------------| | Objects with at least one exact match | |--------------------------------------------| | Objects with at least one whole word match | |--------------------------------------------| | Objects with only substring matches | |--------------------------------------------| >> I opted to switch to counting a maximum of one match per field, which helps >> with cases where a common subword matches /many/ times in packages with longer >> descriptions, pushing more relevant packages down. In multi-term searches, >> the unique terms - which are naturally rarer - also contribute to a larger >> percentage of the score as a result of these changes. >> Having matches with only one word boundary be scored as 2 instead of 1 was >> done with the reasoning that a term is more likely to be part of a compound >> word name (and thus more relevant) if it is a prefix or suffix; for example, >> "gl" in OpenGL, "borg" in borgmatic, and "tor" in torbrowser. > [...] > >> Closing this message on an unrelated note for future work: I stumbled on an >> interesting idea while looking for test cases which suggested reducing the >> score of a programming library when its language is not included in search >> terms. It's out of scope for the current issue, but I thought I'd mention it >> anyways for potential further improvements. > Well, years ago I thought about implementing TF-IDF [2,3]. Other ideas > [4] are floating around. Then, we spent some time for making “guix > search” faster [5] and today my TODO is about having an extension > relying on Guile-Xapian. > > Therefore, I would prefer keep the ’relevance’ more or less predictable > by only counting the number of occurrences and apply some weights. > Else, for what my opinion is worth, the direction would not be to > re-invent an algorithm but maybe implement some already well-known ones. > TF-IDF [3] is one or Okapi-BM25 is another one, etc. In all in all, > that what Xapian provides. ;-) And it does it very well! That’s why I > would be tempted to have a Guix extension relying on Guile-Xapin for > indexing and searching (fast!). Yes, I had thought about trying something like TF-IDF while looking into the issue, but it seemed much less trivial than changes to a scoring function. The count-once-per-field change was supposed to at least tangentially mimic this behavior and reduce bias towards objects that happen to have very long descriptions but aren't very relevant. It's also needed for my "categorization" math to hold. > Hum, why this: > > guix search ' dig$' dig | recsel -p name,relevance | head -8 > > does not return the package ’bind’? It appears the ~regexp/newline~ flag needs to be set for ~make-regexp~. A quick test adding it here [1] seemed to work. My main concern with v2 is that I don't think whole words are weighed heavily enough, but it provides a simpler solution that still offers improvement, so I'm happy either way. Thanks for the feedback! [1] https://git.savannah.gnu.org/cgit/guix.git/tree/guix/scripts/package.scm#n897 Cheers, aurtzy --------------21s0DpniSld8nE19l6UNkpfL Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: 8bit

Hi Simon,

On 9/13/24 10:12, Simon Tournier wrote:

| tor                   | Tor related; ~torbrowser~ somewhere near top. |
--8<---------------cut here---------------start------------->8---
$ ./pre-inst-env guix search tor | recsel -p name,relevance | head -8
name: tor
relevance: 208

name: tor-client
relevance: 169

name: torsocks
relevance: 103
--8<---------------cut here---------------end--------------->8---

Compared to current:

--8<---------------cut here---------------start------------->8---
$ guix search tor | recsel -p name,relevance | head -8
name: tor
relevance: 47

name: ghc-storablevector
relevance: 29

name: tor-client
relevance: 28
--8<---------------cut here---------------end--------------->8---

However, the position move from 225th to 19th.

    $ guix search tor | recsel -P name | grep -n torbrowser
    225:torbrowser

    $ ./pre-inst-env guix search tor | recsel -P name | grep -n torbrowser
    19:torbrowser

Similarly as ’dig’, the description of ’torbrowser’ package could be
improvement.  Because ’guix search tor browser’ returns nothing.

Does ~torbrowser~ not appear as the third result in all three cases for you when running =guix search tor browser=?

Otherwise, if you meant =guix search tor= to find ~torbrowser~: perhaps it should be higher ranked, but it could be argued that patch v1's behavior is still more optimal in this aspect considering all results above ~torbrowser~ it are indeed related to Tor.

| Keyword(s) with poor  | Expectations                                  |
| results before        |                                               |
|-----------------------+-----------------------------------------------|
| dig                   | ~bind~ near top.                              |
Hum, indeed and I do not know if we can improve here.  Well, it’s hard
to improve for short terms, BTW.

--8<---------------cut here---------------start------------->8---
$ ./pre-inst-env guix search dig | recsel -p name,relevance | head -8
name: go-go-uber-org-dig
relevance: 104

name: rust-num-bigint-dig
relevance: 78

name: rust-num-bigint-dig
relevance: 78
--8<---------------cut here---------------end--------------->8---

Compared to current:

--8<---------------cut here---------------start------------->8---
$ guix search dig | recsel -p name,relevance | head -8
name: sysdig
relevance: 24

name: texlive-pedigree-perl
relevance: 13

name: ruby-net-http-digest-auth
relevance: 13
--8<---------------cut here---------------end--------------->8---

Indeed, 17th position is better than 609th.  But if you add a term as
’dns’, bang! :-)  Well, BTW the description of ’bind’ could be a bit
improved because the word network does not appear.  Anyway. :-)

[...]

| rsh                   | ~inetutils~ near top.                         |
--8<---------------cut here---------------start------------->8---
$ ./pre-inst-env guix search rsh | recsel -p name,relevance | head -8
name: inetutils
relevance: 26

name: emacs-tramp
relevance: 26

name: rust-borsh-schema-derive-internal
relevance: 22
--8<---------------cut here---------------end--------------->8---

Compared to current:

--8<---------------cut here---------------start------------->8---
$ guix search rsh | recsel -p name,relevance | head -8
name: go-sigs-k8s-io-yaml
relevance: 14

name: python-pymarshal
relevance: 13

name: emacs-powershell
relevance: 13
--8<---------------cut here---------------end--------------->8---

[...]

| gcc                   | ~gcc-toolchain~ near top.                     |
Indeed, something is unexpected.  Well, first:

    $ guix search gcc | recsel -CP name | uniq | head -8
    gccgo
    gfortran-toolchain
    gdc-toolchain
    gcc-toolchain
    gcc-cross-x86_64-w64-mingw32-toolchain
    gcc-cross-or1k-elf-toolchain
    gcc-cross-i686-w64-mingw32-toolchain
    gcc-cross-avr-toolchain

    $ guix search gcc | recsel -CP name | uniq -c | sort -rn | head -8
         18 llvm
         12 gcc-toolchain
          6 libgccjit
          6 gccgo
          3 isl
          2 libstdc++-doc
          2 java-commons-cli
          2 gdc-toolchain

Other said, the packages with multi-versions decrease the experience.
Well, that had already by “improved” [1] with some SEO. ;-)  Indeed,
maybe the relevance should be improved.

Second, gccgo has a relevance score of 22 with the only term ’gcc’,
compared to gcc-toolchain scoring at 15.

    gccgo        gcc-toolchain
  4 * 1 * 1      4 * 1 * 1  
+ 2 * 5 * 1    + 2 * 1 * 1  
+ 1 * 0        + 1 * 0      
+ 3 * 1 * 1    + 3 * 1 * 1  
+ 2 * 0        + 2 * 1 * 3  
+ 1 * 5 * 1    + 1 * 0      
= 22           = 15         

This is unexpected.  And, IMHO that’s bug!  In the description of
gcc-toolchain, the term ’gcc’ appears 3 times but it only score with ’1’
instead of ’5’.

As the patch try to address, the main issue is:

  (define (score regexp str)
    (fold-matches regexp str 0
                  (lambda (m score)
                    (+ score
                       (if (string=? (match:substring m) str)
                           5             ;exact match
                           1)))))

Here the exact match does not consider a substring exact match.  For
instance, one would consider that the term ’gcc’ exactly matches in
“some GCC thing”.  Considering the current implementation, that’s not
the case.  For instance, a snippet as the procedure ’scoring’:

--8<---------------cut here---------------start------------->8---
scheme@(guix-user)> ,use(ice-9 regex)
scheme@(guix-user)> (define regexp (make-regexp "gcc" regexp/icase))
scheme@(guix-user)> (define str "some GCC thing")
scheme@(guix-user)> (fold-matches regexp str 0
    (lambda (m res)
      (+ res
        (if (string=? (match:substring m) str)
          5 1))))
$2 = 1
--8<---------------cut here---------------end--------------->8---


See v2 for my proposal fixing this.

Please note that this v2 gives the same ranking for torbrowser.  And
also improve the situation with gcc-toolchain.

--8<---------------cut here---------------start------------->8---
$ ./pre-inst-env guix search gcc | recsel -CP name | grep -n gcc-toolchain
1:gcc-toolchain
2:gcc-toolchain
3:gcc-toolchain
4:gcc-toolchain
5:gcc-toolchain
6:gcc-toolchain
7:gcc-toolchain
8:gcc-toolchain
9:gcc-toolchain
10:gcc-toolchain
11:gcc-toolchain
12:gcc-toolchain

$ ./pre-inst-env guix search tor | recsel -CP name | grep -n torbrowser
7:torbrowser

$ ./pre-inst-env guix search dig | recsel -CP name | grep -n bind
44:bind
--8<---------------cut here---------------end--------------->8---

However, inetutils is still at 44th with the only one term ’rsh’.  I
would suggest to do some tweak with the description.

And including a relevant part of your message from #70689:

Again, considering the case at hand: If instead of 3 randomly picked in
v2 of #73220, we would pick 7, then inetutils is ranked first.

Yeah, maybe 3 isn’t enough… And maybe 7 is a good choice.
What do you think about setting the value to the sum of all weights in ~metrics~ as I did in patch v1? My logic is that an object is almost always going to be relevant if it contains a whole word match compared to "maybe relevant" if it only matches substrings, so it would be reasonable to thus show most of the objects with whole word matches first. This improves or maintains consistency of relevant results in the test cases with shorter terms, and also reduces the need for guesswork with choosing arbitrary numbers that may or may not work.

Note that I also gave the same treatment to exact match scores, although not as extremely weighed (only double the whole word score in v1).

In the case of ~inetutils~, for example, this formula guarantees that if I were to search =rsh= - which is a common subword, but itself has a very unique meaning - ~inetutils~ /always/ shows up at or near the top along with other rsh-related packages, assuming no exact matches.

In other words, the intention would be to have the calculations set up such that they implicitly "categorize" object rankings into a (rough) hierarchy of the following:

|--------------------------------------------| | Objects with at least one exact match | |--------------------------------------------| | Objects with at least one whole word match | |--------------------------------------------| | Objects with only substring matches | |--------------------------------------------|

I opted to switch to counting a maximum of one match per field, which helps
with cases where a common subword matches /many/ times in packages with longer
descriptions, pushing more relevant packages down.  In multi-term searches,
the unique terms - which are naturally rarer - also contribute to a larger
percentage of the score as a result of these changes.
Having matches with only one word boundary be scored as 2 instead of 1 was
done with the reasoning that a term is more likely to be part of a compound
word name (and thus more relevant) if it is a prefix or suffix; for example,
"gl" in OpenGL, "borg" in borgmatic, and "tor" in torbrowser.
[...]

Closing this message on an unrelated note for future work: I stumbled on an
interesting idea while looking for test cases which suggested reducing the
score of a programming library when its language is not included in search
terms.  It's out of scope for the current issue, but I thought I'd mention it
anyways for potential further improvements.
Well, years ago I thought about implementing TF-IDF [2,3].  Other ideas
[4] are floating around.  Then, we spent some time for making “guix
search” faster [5] and today my TODO is about having an extension
relying on Guile-Xapian.

Therefore, I would prefer keep the ’relevance’ more or less predictable
by only counting the number of occurrences and apply some weights.
Else, for what my opinion is worth, the direction would not be to
re-invent an algorithm but maybe implement some already well-known ones.
TF-IDF [3] is one or Okapi-BM25 is another one, etc.  In all in all,
that what Xapian provides. ;-) And it does it very well!  That’s why I
would be tempted to have a Guix extension relying on Guile-Xapin for
indexing and searching (fast!).

Yes, I had thought about trying something like TF-IDF while looking into the issue, but it seemed much less trivial than changes to a scoring function. The count-once-per-field change was supposed to at least tangentially mimic this behavior and reduce bias towards objects that happen to have very long descriptions but aren't very relevant. It's also needed for my "categorization" math to hold.

Hum, why this:

    guix search ' dig$' dig | recsel -p name,relevance | head -8

does not return the package ’bind’?

It appears the ~regexp/newline~ flag needs to be set for ~make-regexp~. A quick test adding it here [1] seemed to work.


My main concern with v2 is that I don't think whole words are weighed heavily enough, but it provides a simpler solution that still offers improvement, so I'm happy either way.

Thanks for the feedback!

[1] https://git.savannah.gnu.org/cgit/guix.git/tree/guix/scripts/package.scm#n897

Cheers,

aurtzy

--------------21s0DpniSld8nE19l6UNkpfL--