From: bokr@bokr.com
To: Richard Sent <richard@freakingpenguin.com>
Cc: 70689@debbugs.gnu.org
Subject: bug#70689: guix search doesn't weigh word matches higher than subword matches
Date: Wed, 1 May 2024 15:45:05 +0200 [thread overview]
Message-ID: <20240501134505.GA10144@LionPure> (raw)
In-Reply-To: <87bk5qcm1w.fsf@freakingpenguin.com>
On +2024-04-30 22:18:03 -0400, Richard Sent wrote:
> Hi Guix!
>
> When running guix search, relevance in synopsis and description fields
> are computed strictly by the number of matches, both as a word and as a
> subword. Ideally, if a search string matches an isolated word in a
> search, that result should be considered more relevant than simply
> matching a subword, even multiple times.
>
> To illustrate, imagine trying to find what package provides the `rsh`
> binary and running running `$ guix search rsh`. This binary is part of
> `inetutils` and the description field contains:
>
> > Inetutils is a collection of common network programs, such as an ftp
> > client and server, a telnet client and server, an rsh client and
> > server, and hostname.
>
> Most likely, this is what the user is interested in. However, inetutils
> does not show up until roughly the ~75th result with a relevance of 2
> (the lowest possible relevance).
>
> Almost every search result beforehand contains the string "rsh" as a
> component of another word, such as "marshaling", "powershell", and
> "hershey". However, these match multiple times and are weighted
> significantly higher.
>
> Ideally, guix search should rate inetutils higher because the string
> "rsh" occurs as its own word, not as a component of another, unrelated
> word. (Very, very people would search "rsh" looking for matches with
> "hershey", even if "hershey" occurs multiple times.)
>
> Another example of where this can happen is with "dig", part of the bind
> package. Searching for "dig" returns garbage because "dig" is a common
> subword. Bind is scored with a relevance of 2, even though bind's
> description emphasises that dig is part of it.
>
> This would improve the experience when searching with strings that
> commonly occur as subwords.
>
> Since this change can't occur in a vacuum, care should be taken not to
> reduce the effectiveness of other reasonably forseeable search queries.
>
> --
> Take it easy,
> Richard Sent
> Making my computer weirder one commit at a time.
>
>
>
I like your proposal :)
I'm wondering how [1] compares in what it does for your use(ful) case.
(I am not familiar with Hyper Estraier beyond being prompted for gnu.org searching)
[1] <https://directory.fsf.org/wiki/Hyper_Estraier>
--
Regards,
Bengt Richter
next prev parent reply other threads:[~2024-05-01 13:46 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-05-01 2:18 bug#70689: guix search doesn't weigh word matches higher than subword matches Richard Sent
2024-05-01 13:45 ` bokr [this message]
2024-09-13 7:13 ` aurtzy
2024-09-13 15:08 ` Simon Tournier
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://guix.gnu.org/
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240501134505.GA10144@LionPure \
--to=bokr@bokr.com \
--cc=70689@debbugs.gnu.org \
--cc=richard@freakingpenguin.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://git.savannah.gnu.org/cgit/guix.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).