From mboxrd@z Thu Jan  1 00:00:00 1970
From: zimoun <zimon.toutoune@gmail.com>
Subject: Re: Organizing packages
Date: Tue, 16 Jul 2019 19:04:26 +0200
Message-ID: <CAJ3okZ0LaJzWDBA7bjqZew_jAmtt1rj9PJhevwrtBiA_COXENg@mail.gmail.com>
References: <20190709100732.3f760245@gmail.com> <87ef2s4t8d.fsf@gnu.org>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Return-path: <guix-devel-bounces+gcggd-guix-devel=m.gmane.org@gnu.org>
Received: from eggs.gnu.org ([2001:470:142:3::10]:35020)
 by lists.gnu.org with esmtp (Exim 4.86_2)
 (envelope-from <zimon.toutoune@gmail.com>) id 1hnQsc-0000Ls-IL
 for guix-devel@gnu.org; Tue, 16 Jul 2019 13:04:43 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <zimon.toutoune@gmail.com>) id 1hnQsb-0000av-2I
 for guix-devel@gnu.org; Tue, 16 Jul 2019 13:04:42 -0400
In-Reply-To: <87ef2s4t8d.fsf@gnu.org>
List-Id: "Development of GNU Guix and the GNU System distribution."
 <guix-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/guix-devel>,
 <mailto:guix-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <https://lists.gnu.org/archive/html/guix-devel>
List-Post: <mailto:guix-devel@gnu.org>
List-Help: <mailto:guix-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/guix-devel>,
 <mailto:guix-devel-request@gnu.org?subject=subscribe>
Errors-To: guix-devel-bounces+gcggd-guix-devel=m.gmane.org@gnu.org
Sender: "Guix-devel" <guix-devel-bounces+gcggd-guix-devel=m.gmane.org@gnu.org>
To: =?UTF-8?Q?Ludovic_Court=C3=A8s?= <ludo@gnu.org>
Cc: Guix Devel <guix-devel@gnu.org>, Jesse Gibbons <jgibbons2357@gmail.com>

Dear,

On Sun, 14 Jul 2019 at 15:54, Ludovic Court=C3=A8s <ludo@gnu.org> wrote:

> > I think this will make searching easier because not everything has an
> > obvious name, and when I `guix search` for a purpose (like drawing) I
> > often get unrelated results.
>
> I don=E2=80=99t think the module hierarchy should be thought of as a tool=
 for
> users to search for packages.

I totally agree. :-)


> So really, =E2=80=98guix search=E2=80=99 is the tool that should be impro=
ved.  It=E2=80=99s been
> discussed many times, and improving it turns out to be difficult without
> resorting to external sources of information (e.g., list of command
> names, popularity database, etc.)
>
> What we can do is look at specific examples to see if there=E2=80=99s som=
ething
> we can improve on the current scoring system (with the understanding
> that sometimes the answer is that we cannot do any better.)
>
> For example, =E2=80=98guix search drawing program=E2=80=99 shows Tux Pain=
t as the first
> result, which is good; but =E2=80=98guix search drawing=E2=80=99 and =E2=
=80=98guix search
> drawing application=E2=80=99 are much less useful.  In this particular ex=
ample,
> it=E2=80=99s not clear to me what can be done.
>
> One suggestion that was made before and that might help here is to
> increase the score of leaf packages (applications).

One of the current issue is that the score is not "normalized" somehow.

The current score is built by using the number of occurrences of each
field (name, synopsis, description, etc.) with weights (see
`%package-metrics` in guix/ui.scm).
For instance, `guix search drawing` ranks first `texlive-latex-eepic`
because the word `drawing` appears 4 times in the description, second
`tuxpaint` because of 2 times and third `xfig` (1 time).
What should be expected (IMHO) is that these 3 packages should be
scored at the same value. Therefore, something normalizing seems
missing. But what? :-)
And leaf package should have a higher score than non-leaf package. For
instance, `xfig` should be higher than `libart-lgpl`.

Then the situation with this scoring system cannot be improved so much
for the "only word" search.


Moreover, nothing can help with bad written descriptions.

For example, you need to know that `roguelike` is a `game` when
reading the description of the package `angband`.

  guix package --show=3Dangband | recsel -C -p synopsis,description

synopsis: Dungeon exploration roguelike
description: Angband is a Classic dungeon exploration roguelike.  Explore t=
he
+ depths below Angband, seeking riches, fighting monsters, and preparing to
+ fight Morgoth, the Lord of Darkness.

>From my opinion, the issue is that the description is not good enough
to provide any relevant information usable by a search tool. And again
from my opinion, adding tags or classifying the packages inside
relevant filenames---which is a way to tag---seems a wrong approach.
For example, where VLC should be in? video.scm? But is it closer to
ffmpeg-for-stepmania or to Totem which is in gnome.scm?
IMHO, bikeshedding cannot be an improvement for searching packages. :-)


However, a kind of tf-idf [1] should be used to better self organize
the packages when searching.

[1] https://en.wikipedia.org/wiki/Tf%E2%80%93idf


For example, I have 10146 package definitions:
  guix search ' ' | recsel -P name -C | wc -l
  10146
and 46 contain the word 'drawing'.
So, the Inverse-Document-Frequency is:
 IDF(drawing) =3D log(10146 / 46)

Let consider the 3 first most relevant package (with the current score).
The term `drawing` appears:
   for pkg in $(guix search drawing | recsel -C -P name | head -n3);\
   do\
      echo $pkg ; guix package --show=3D$pkg | grep -c drawing ;\
   done

FREQ(drawing, texlive-latex-eepic) =3D 5
FREQ(drawing, tuxpaint) =3D 2
FREQ(drawing, xfig) =3D 2

Let normalize by the length of the document:
   for pkg in $(guix search drawing | recsel -C -P name | head -n3);\
   do\
      echo $pkg ; guix package --show=3D$pkg \
      | recsel -P synopsis,description | wc -w ;\
   done

LEN(texlive-latex-eepic) =3D 68
LEN(tuxpaint) =3D 60
LEN(xfig) =3D 76

Then one definition of the Term-Frequency is:

TF(drawing, texlive-latex-eepic) =3D 5 / 68
TF(drawing, tuxpaint) =3D 2 / 60
TF(drawing, xfig) =3D 2 / 76


The TF-IDF reads:

TF-IDF(drawing, texlive-latex-eepic) =3D 5/68*log(10146/46) =3D0.3968
TF-IDF(drawing, tuxpaint) =3D 2/60*log(10146/46) =3D0.1799
TF-IDF(drawing, xfig) =3D 2/76*log(10146/46) =3D0.1420


This does not change much the current result. But this allows to
better know which words are "good filter".

Let consider the word `program` and the package `tuxpaint`.
The current relevance score is 5 for `program`. The term appears 2
times (note that `software` appears in synopsis which should be
replaced be `program`).
The current relevance score is 7 for `drawing`. The term also appears 2 tim=
es.
The difference is just because the weight per field.

However, the TF-IDF is totally different:

TF-IDF(drawing, tuxpaint) =3D 2/60*log(10146/46) =3D0.1799
TF-IDF(program, tuxpaint) =3D 2/60*log(10146/1056) =3D0.0754

Well, the term `drawing` owns more information than the term `program`
for the package tuxpaint.


>From my opinion, text mining or search engine strategies seem a better
approach to improve the `guix search` than rigidify the filename tree.
And some data mining to see how the packages cluster (depending on the
metric) should be helpful to first understand how to improve `guix
search`.
I do not know if my words make sense.

All the best,
simon