all messages for Guix-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
From: Simon Tournier <zimon.toutoune@gmail.com>
To: Richard Sent <richard@freakingpenguin.com>
Cc: aurtzy <aurtzy@gmail.com>, 70689@debbugs.gnu.org
Subject: bug#70689: guix search doesn't weigh word matches higher than subword matches
Date: Fri, 13 Sep 2024 17:08:19 +0200	[thread overview]
Message-ID: <877cbfvbfg.fsf@gmail.com> (raw)
In-Reply-To: <87bk5qcm1w.fsf@freakingpenguin.com> (Richard Sent's message of "Tue, 30 Apr 2024 22:18:03 -0400")

Hi,

On Tue, 30 Apr 2024 at 22:18, Richard Sent <richard@freakingpenguin.com> wrote:

>> Inetutils is a collection of common network programs, such as an ftp
>> client and server, a telnet client and server, an rsh client and
>> server, and hostname.
>
> Most likely, this is what the user is interested in. However, inetutils
> does not show up until roughly the ~75th result with a relevance of 2
> (the lowest possible relevance).

Using Guix 056910e, I get:

    $ guix search rsh | recsel -CP name | grep -n inetutils
    76:inetutils

Then using the proposed v2 patch#73220 [1], I get:

    $ ./pre-inst-env guix search rsh | recsel -CP name | grep -n inetutils
    34:inetutils

Well, that’s not perfect but a bit better.


> Almost every search result beforehand contains the string "rsh" as a
> component of another word, such as "marshaling", "powershell", and
> "hershey". However, these match multiple times and are weighted
> significantly higher.

Well, if we consider the current implementation, the relevance scoring
reads for the highest:

          4 * 0       name
        + 2 * 0       upstream-name
        + 1 * 0       outputs
        + 3 * 2 * 1   synopsis
        + 2 * 4 * 1   description
        + 1 * 0       file-name
        = 14

where it means: field-weigh * match * weight-match

Compared to inetutils:

          4 * 0       name
        + 2 * 0       upstream-name
        + 1 * 0       outputs
        + 3 * 0       synopsis
        + 2 * 1 * 1   description
        + 1 * 0       file-name
        = 2

Well, this case cannot be improved much.  First, the field-weights are
almost optimal [2]. Second the number of occurrences depends on the
description; maybe it could be improved, I have not checked yet.

And v2 of #73220 replace the value of weight-match: the term ’rsh’ in
“an rsh client” should have an higher score than in “uses `json.Marshal'
and `json.Unmarshal'”.

In other words, it reads:

          4 * 0       name
        + 2 * 0       upstream-name
        + 1 * 0       outputs
        + 3 * 0       synopsis
        + 2 * 1 * 3   description
        + 1 * 0       file-name
        = 6

I think this address your suggestion, I guess.


> Ideally, guix search should rate inetutils higher because the string
> "rsh" occurs as its own word, not as a component of another, unrelated
> word. (Very, very people would search "rsh" looking for matches with
> "hershey", even if "hershey" occurs multiple times.)

Again, considering the case at hand: If instead of 3 randomly picked in
v2 of #73220, we would pick 7, then inetutils is ranked first.

Yeah, maybe 3 isn’t enough… And maybe 7 is a good choice.

Do you have other examples than ’rsh’?


> Another example of where this can happen is with "dig", part of the bind
> package. Searching for "dig" returns garbage because "dig" is a common
> subword. Bind is scored with a relevance of 2, even though bind's
> description emphasises that dig is part of it.

Please note that using v2 of #73220 with the weight of 7, the package is
returned “third“: a relevance of 14 (behind 24 and 20).

However, it appears 8th in the list because the appearance for packages
having the same relevance scoring is arbitrary.  It just depends on how
the modules are walked.  Therefore, we cannot do much, IMHO.


Cheers,
simon


1: https://issues.guix.gnu.org/73220#1

2: Re: Search improvements (Was: Opposition to new single-letter package name "t")
zimoun <zimon.toutoune@gmail.com>
Tue, 09 Mar 2021 19:37:23 +0100
id:CAJ3okZ3+hn0nJP98OhnZYLWJvhLGpdTUK+jB0hoM5JArQxO=zw@mail.gmail.com
https://lists.gnu.org/archive/html/guix-devel/2021-03
https://yhetil.org/guix/CAJ3okZ3+hn0nJP98OhnZYLWJvhLGpdTUK+jB0hoM5JArQxO=zw@mail.gmail.com




      parent reply	other threads:[~2024-09-13 15:10 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-01  2:18 bug#70689: guix search doesn't weigh word matches higher than subword matches Richard Sent
2024-05-01 13:45 ` bokr
2024-09-13  7:13   ` aurtzy
2024-09-13 15:08 ` Simon Tournier [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=877cbfvbfg.fsf@gmail.com \
    --to=zimon.toutoune@gmail.com \
    --cc=70689@debbugs.gnu.org \
    --cc=aurtzy@gmail.com \
    --cc=richard@freakingpenguin.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/guix.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.