Organizing packages

unofficial mirror of guix-devel@gnu.org 
 help / color / mirror / code / Atom feed

* Organizing packages
@ 2019-07-09 16:07 Jesse Gibbons
  2019-07-14 13:54 ` Ludovic Courtès
  0 siblings, 1 reply; 9+ messages in thread
From: Jesse Gibbons @ 2019-07-09 16:07 UTC (permalink / raw)
  To: guix-devel

I noticed that a few files have only one package definition and are
named for that package. I think these packages can be organized better.
Might I suggest the following rules:

1. if a package is a library for a particular language $LANG (like
Python, Perl, etc.) it goes in ${LANG}-xyz.scm. If it is a library built
for a particular PURPOSE, it may go into LANG-PURPOSE.scm with those
packages.

2. If the package defines a compiler or interpreter for a language
$LANG, it may go into ${LANG}.scm

3. If the package is part of a large divisible project $PROJ like gcc or
texlive, it may go into ${PROJ}.scm

4. If the package is maintained a part of a large desktop environment
$DE like GNOME or KDE, it may be put in ${DE}.scm

5. When in doubt, the package must go into a file named after its
$PURPOSE, ${PURPOSE}.scm. For example, if the package is a game (like
supertuxracer), it goes into games.scm; if it is for undirected
fun (like sl), it goes into toys.scm; if it is for audio
control or audio production, it goes into audio.scm; if it is for
drawing or producing graphics, it goes into graphics.scm; etc. Projects
that can be described with multiple purposes (like fortune) may go into
any of those files.

I think this will make searching easier because not everything has an
obvious name, and when I `guix search` for a purpose (like drawing) I
often get unrelated results.

If we follow these rules:

- Most packages will remain in place, especially libraries upon which
  plenty of packages depend.

- Packages that are placed in their own files should be moved to a file
  that describes their function. For exaple:
  - abduco.scm only defines abduco, a program that lets child processes
    of a shell become independent. It might fit into
    task-management.scm or moreutils.scm.
  - abiword.scm only defines abiword, a word processing program. It may
    be placed into a new file office.scm.
  - acct.scm only defines acct, which based on its description might fit
    into admin.scm
  - acl.scm only defines acl, which based on its functionality fits into
    admin.scm.
  - anthy.scm only defines anthy, which is useful for Japanese language
    input. It can fit into language.scm.
  I hope you can see the pattern, and perhaps you can find better places
  for some of these packages. By my count, there are currently over 140
  of these files that define only one package.

- Files that define multiple packages and are named after one app might
  be split into multiple pieces by function. For example,
  libreoffice.scm defines, among other packages, libreoffice, which can
  go into office.scm, and multiple dictionaries, which can be moved
  into the already-existing dictionaries.scm. To me this is lower
  priority than one-package files.

- When a package is moved, the associated copyright information should
  be copied with it. This might make it difficult to split files like
  libreoffice.scm. I do not think the people who move packages should
  add their copyright information because they are not really adding new
  code, but I leave this detail to community discussion.

- Ideally, packages would be alphabetized within their respective
  files. This might not be practical.

- Services should likewise be organized by function, with file names
  consisting of full words. Although there is not much to do on this
  front, it will likely break a lot of OS configurations and make the
  manual inaccurate. Therefore it is very low priority to me.

Feel free to tell me what you think of this suggestion. If the
maintainers and the community like this idea, I will personally work
on carrying it out. I am also ready to openly discuss particular
packages, either in response to this or when I send a patch.

-Jesse

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Organizing packages
  2019-07-09 16:07 Organizing packages Jesse Gibbons
@ 2019-07-14 13:54 ` Ludovic Courtès
  2019-07-15 17:21   ` Jesse Gibbons
  2019-07-16 17:04   ` zimoun
  0 siblings, 2 replies; 9+ messages in thread
From: Ludovic Courtès @ 2019-07-14 13:54 UTC (permalink / raw)
  To: Jesse Gibbons; +Cc: guix-devel

Hello!

Jesse Gibbons <jgibbons2357@gmail.com> skribis:

> I noticed that a few files have only one package definition and are
> named for that package. I think these packages can be organized better.
> Might I suggest the following rules:
>
> 1. if a package is a library for a particular language $LANG (like
> Python, Perl, etc.) it goes in ${LANG}-xyz.scm. If it is a library built
> for a particular PURPOSE, it may go into LANG-PURPOSE.scm with those
> packages.
>
> 2. If the package defines a compiler or interpreter for a language
> $LANG, it may go into ${LANG}.scm
>
> 3. If the package is part of a large divisible project $PROJ like gcc or
> texlive, it may go into ${PROJ}.scm
>
> 4. If the package is maintained a part of a large desktop environment
> $DE like GNOME or KDE, it may be put in ${DE}.scm
>
> 5. When in doubt, the package must go into a file named after its
> $PURPOSE, ${PURPOSE}.scm. For example, if the package is a game (like
> supertuxracer), it goes into games.scm; if it is for undirected
> fun (like sl), it goes into toys.scm; if it is for audio
> control or audio production, it goes into audio.scm; if it is for
> drawing or producing graphics, it goes into graphics.scm; etc. Projects
> that can be described with multiple purposes (like fortune) may go into
> any of those files.

I had experience with Nixpkgs, which has a decision tree for where to
put packages:

  https://nixos.org/nixpkgs/manual/#sec-hierarchy

In the end I didn’t find it to be helpful in any way: you’d always have
to open ‘top-level/all-packages’, a file that lists all the packages, to
find out where the package you’re looking for lives.

I believe ‘guix edit’ greatly solves that (along with Helm or similar
editor support for grepping.)

> I think this will make searching easier because not everything has an
> obvious name, and when I `guix search` for a purpose (like drawing) I
> often get unrelated results.

I don’t think the module hierarchy should be thought of as a tool for
users to search for packages.

So really, ‘guix search’ is the tool that should be improved.  It’s been
discussed many times, and improving it turns out to be difficult without
resorting to external sources of information (e.g., list of command
names, popularity database, etc.)

What we can do is look at specific examples to see if there’s something
we can improve on the current scoring system (with the understanding
that sometimes the answer is that we cannot do any better.)

For example, ‘guix search drawing program’ shows Tux Paint as the first
result, which is good; but ‘guix search drawing’ and ‘guix search
drawing application’ are much less useful.  In this particular example,
it’s not clear to me what can be done.

One suggestion that was made before and that might help here is to
increase the score of leaf packages (applications).

Food for thought!

Ludo’.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Organizing packages
  2019-07-14 13:54 ` Ludovic Courtès
@ 2019-07-15 17:21   ` Jesse Gibbons
  2019-07-15 17:38     ` Robert Vollmert
  2019-07-15 21:37     ` Ricardo Wurmus
  2019-07-16 17:04   ` zimoun
  1 sibling, 2 replies; 9+ messages in thread
From: Jesse Gibbons @ 2019-07-15 17:21 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: guix-devel

On Sun, 14 Jul 2019 15:54:10 +0200
Ludovic Courtès <ludo@gnu.org> wrote:

> Hello!
> 
> Jesse Gibbons <jgibbons2357@gmail.com> skribis:
> 
> > I noticed that a few files have only one package definition and are
> > named for that package. I think these packages can be organized
> > better. Might I suggest the following rules:
> >
> > 1. if a package is a library for a particular language $LANG (like
> > Python, Perl, etc.) it goes in ${LANG}-xyz.scm. If it is a library
> > built for a particular PURPOSE, it may go into LANG-PURPOSE.scm
> > with those packages.
> >
> > 2. If the package defines a compiler or interpreter for a language
> > $LANG, it may go into ${LANG}.scm
> >
> > 3. If the package is part of a large divisible project $PROJ like
> > gcc or texlive, it may go into ${PROJ}.scm
> >
> > 4. If the package is maintained a part of a large desktop
> > environment $DE like GNOME or KDE, it may be put in ${DE}.scm
> >
> > 5. When in doubt, the package must go into a file named after its
> > $PURPOSE, ${PURPOSE}.scm. For example, if the package is a game
> > (like supertuxracer), it goes into games.scm; if it is for
> > undirected fun (like sl), it goes into toys.scm; if it is for audio
> > control or audio production, it goes into audio.scm; if it is for
> > drawing or producing graphics, it goes into graphics.scm; etc.
> > Projects that can be described with multiple purposes (like
> > fortune) may go into any of those files.  
> 
> I had experience with Nixpkgs, which has a decision tree for where to
> put packages:
> 
>   https://nixos.org/nixpkgs/manual/#sec-hierarchy
> 
> In the end I didn’t find it to be helpful in any way: you’d always
> have to open ‘top-level/all-packages’, a file that lists all the
> packages, to find out where the package you’re looking for lives.
> 
> I believe ‘guix edit’ greatly solves that (along with Helm or similar
> editor support for grepping.)
> 
Interesting. So is it worth trying to organize the guix packages or do
you think it will get too complicated? I'm primarily bothered by the
number of small files with only one package definition and the
inconsistency in how packages are organized. I would rather a file have
multiple package definitions that make sense together than a hundred
files with only one package definition.

> > I think this will make searching easier because not everything has
> > an obvious name, and when I `guix search` for a purpose (like
> > drawing) I often get unrelated results.  
This was an afterthought.

> 
> I don’t think the module hierarchy should be thought of as a tool for
> users to search for packages.
> 
> So really, ‘guix search’ is the tool that should be improved.  It’s
> been discussed many times, and improving it turns out to be difficult
> without resorting to external sources of information (e.g., list of
> command names, popularity database, etc.)
I was thinking this would help `guix search`. For example, if I try
`guix search game` a lot of the leaf packages in games.scm are ranked
with relevance level 1 because they do not have the word "game" in
their synopsis or description. I would expect them to have a higher
relevance (8 at the very minimum) because of their placement in
games.scm. I do not think these packages would be listed at all if they
were not in games.scm.

Hypothetically, if someone decided to define a package for the tuxemon
RPG in a new file "tuxemon.scm" and did not mention the word "game" in
the summary or description, it would not be listed in the `guix search
game` results at all. If it was placed into games.scm, then it would at
least come up in the results.

> 
> What we can do is look at specific examples to see if there’s
> something we can improve on the current scoring system (with the
> understanding that sometimes the answer is that we cannot do any
> better.)
> 
> For example, ‘guix search drawing program’ shows Tux Paint as the
> first result, which is good; but ‘guix search drawing’ and ‘guix
> search drawing application’ are much less useful.  In this particular
> example, it’s not clear to me what can be done.
> 
> One suggestion that was made before and that might help here is to
> increase the score of leaf packages (applications).
> 
> Food for thought!
> 
> Ludo’.
I have ideas about how to resolve this and other issues regarding `guix
search`, but perhaps they are best explained in bug reports or other
guix-devel discussions.

-Jesse

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Organizing packages
  2019-07-15 17:21   ` Jesse Gibbons
@ 2019-07-15 17:38     ` Robert Vollmert
  2019-07-15 20:15       ` Jesse Gibbons
  2019-07-15 21:37     ` Ricardo Wurmus
  1 sibling, 1 reply; 9+ messages in thread
From: Robert Vollmert @ 2019-07-15 17:38 UTC (permalink / raw)
  To: Jesse Gibbons; +Cc: guix-devel



> On 15. Jul 2019, at 19:21, Jesse Gibbons <jgibbons2357@gmail.com> wrote:
> 
> On Sun, 14 Jul 2019 15:54:10 +0200
> Ludovic Courtès <ludo@gnu.org> wrote:
> 
>> Hello!
>> 
>> Jesse Gibbons <jgibbons2357@gmail.com> skribis:
>> 
>>> I noticed that a few files have only one package definition and are
>>> named for that package. I think these packages can be organized
>>> better. Might I suggest the following rules:
>>> 
>>> 1. if a package is a library for a particular language $LANG (like
>>> Python, Perl, etc.) it goes in ${LANG}-xyz.scm. If it is a library
>>> built for a particular PURPOSE, it may go into LANG-PURPOSE.scm
>>> with those packages.
>>> 
>>> 2. If the package defines a compiler or interpreter for a language
>>> $LANG, it may go into ${LANG}.scm
>>> 
>>> 3. If the package is part of a large divisible project $PROJ like
>>> gcc or texlive, it may go into ${PROJ}.scm
>>> 
>>> 4. If the package is maintained a part of a large desktop
>>> environment $DE like GNOME or KDE, it may be put in ${DE}.scm
>>> 
>>> 5. When in doubt, the package must go into a file named after its
>>> $PURPOSE, ${PURPOSE}.scm. For example, if the package is a game
>>> (like supertuxracer), it goes into games.scm; if it is for
>>> undirected fun (like sl), it goes into toys.scm; if it is for audio
>>> control or audio production, it goes into audio.scm; if it is for
>>> drawing or producing graphics, it goes into graphics.scm; etc.
>>> Projects that can be described with multiple purposes (like
>>> fortune) may go into any of those files.  
>> 
>> I had experience with Nixpkgs, which has a decision tree for where to
>> put packages:
>> 
>>  https://nixos.org/nixpkgs/manual/#sec-hierarchy
>> 
>> In the end I didn’t find it to be helpful in any way: you’d always
>> have to open ‘top-level/all-packages’, a file that lists all the
>> packages, to find out where the package you’re looking for lives.
>> 
>> I believe ‘guix edit’ greatly solves that (along with Helm or similar
>> editor support for grepping.)
>> 
> Interesting. So is it worth trying to organize the guix packages or do
> you think it will get too complicated? I'm primarily bothered by the
> number of small files with only one package definition and the
> inconsistency in how packages are organized. I would rather a file have
> multiple package definitions that make sense together than a hundred
> files with only one package definition.

Just to voice some support for a consistent approach. It would be beneficial
in a similar way that a consistent indentation style helps: Less decisions to
make, less opportunity for bike-shedding discussions.

(Personally, one file per package sounds fine, too. No confusion about why
which module imports what. No overhead deciding where to file a package. No
need to grep around for where a package might be defined.)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Organizing packages
  2019-07-15 17:38     ` Robert Vollmert
@ 2019-07-15 20:15       ` Jesse Gibbons
  0 siblings, 0 replies; 9+ messages in thread
From: Jesse Gibbons @ 2019-07-15 20:15 UTC (permalink / raw)
  To: Robert Vollmert; +Cc: guix-devel

On Mon, 15 Jul 2019 19:38:34 +0200
Robert Vollmert <rob@vllmrt.net> wrote:

> > On 15. Jul 2019, at 19:21, Jesse Gibbons <jgibbons2357@gmail.com>
> > wrote:
> > 
> > On Sun, 14 Jul 2019 15:54:10 +0200
> > Ludovic Courtès <ludo@gnu.org> wrote:
> >   
> >> Hello!
> >> 
> >> Jesse Gibbons <jgibbons2357@gmail.com> skribis:
> >>   
> >>> I noticed that a few files have only one package definition and
> >>> are named for that package. I think these packages can be
> >>> organized better. Might I suggest the following rules:
> >>> 
> >>> 1. if a package is a library for a particular language $LANG (like
> >>> Python, Perl, etc.) it goes in ${LANG}-xyz.scm. If it is a library
> >>> built for a particular PURPOSE, it may go into LANG-PURPOSE.scm
> >>> with those packages.
> >>> 
> >>> 2. If the package defines a compiler or interpreter for a language
> >>> $LANG, it may go into ${LANG}.scm
> >>> 
> >>> 3. If the package is part of a large divisible project $PROJ like
> >>> gcc or texlive, it may go into ${PROJ}.scm
> >>> 
> >>> 4. If the package is maintained a part of a large desktop
> >>> environment $DE like GNOME or KDE, it may be put in ${DE}.scm
> >>> 
> >>> 5. When in doubt, the package must go into a file named after its
> >>> $PURPOSE, ${PURPOSE}.scm. For example, if the package is a game
> >>> (like supertuxracer), it goes into games.scm; if it is for
> >>> undirected fun (like sl), it goes into toys.scm; if it is for
> >>> audio control or audio production, it goes into audio.scm; if it
> >>> is for drawing or producing graphics, it goes into graphics.scm;
> >>> etc. Projects that can be described with multiple purposes (like
> >>> fortune) may go into any of those files.    
> >> 
> >> I had experience with Nixpkgs, which has a decision tree for where
> >> to put packages:
> >> 
> >>  https://nixos.org/nixpkgs/manual/#sec-hierarchy
> >> 
> >> In the end I didn’t find it to be helpful in any way: you’d always
> >> have to open ‘top-level/all-packages’, a file that lists all the
> >> packages, to find out where the package you’re looking for lives.
> >> 
> >> I believe ‘guix edit’ greatly solves that (along with Helm or
> >> similar editor support for grepping.)
> >>   
> > Interesting. So is it worth trying to organize the guix packages or
> > do you think it will get too complicated? I'm primarily bothered by
> > the number of small files with only one package definition and the
> > inconsistency in how packages are organized. I would rather a file
> > have multiple package definitions that make sense together than a
> > hundred files with only one package definition.  
> 
> Just to voice some support for a consistent approach. It would be
> beneficial in a similar way that a consistent indentation style
> helps: Less decisions to make, less opportunity for bike-shedding
> discussions.
> 
> (Personally, one file per package sounds fine, too. No confusion
> about why which module imports what. No overhead deciding where to
> file a package. No need to grep around for where a package might be
> defined.)
> 

I too wouldn't mind a one-package-per-file approach as long as it is
consistent. But consider packages that have multiple parts like gcc and
texlive, as well as the dictionaries packages that are generated with
non-public syntax like the french dictionaries in libreoffice.scm, not
to mention the packages for both python2 and python3. I think there's a
good reason to cluster them into groups. But the one-package-per-file
approach would affect guix search in a significant negative way, as I
pointed out unless something is done to change how guix search works.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Organizing packages
  2019-07-15 17:21   ` Jesse Gibbons
  2019-07-15 17:38     ` Robert Vollmert
@ 2019-07-15 21:37     ` Ricardo Wurmus
  1 sibling, 0 replies; 9+ messages in thread
From: Ricardo Wurmus @ 2019-07-15 21:37 UTC (permalink / raw)
  To: Jesse Gibbons; +Cc: guix-devel

Jesse Gibbons <jgibbons2357@gmail.com> writes:

> Interesting. So is it worth trying to organize the guix packages or do
> you think it will get too complicated? I'm primarily bothered by the
> number of small files with only one package definition and the
> inconsistency in how packages are organized. I would rather a file have
> multiple package definitions that make sense together than a hundred
> files with only one package definition.

I think it doesn’t matter much, but in some cases having separate
modules even if they only contain one package definition can be really
important because it reduces the “module closure” of certain packages.
Our modules are heavily interdependent and that’s bad as we cannot
easily split up the work that has to be done by “guix pull”.

We shouldn’t have to evaluate all or most modules just to build the
derivations for core packages.  In those cases smaller modules can be
used to cut inter-module references.

(In many other cases, however, that’s just how package definitions were
organized in the early days of Guix.  Moving them around is fine if the
move is a clear improvement.)

--
Ricardo

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Organizing packages
  2019-07-14 13:54 ` Ludovic Courtès
  2019-07-15 17:21   ` Jesse Gibbons
@ 2019-07-16 17:04   ` zimoun
  2019-07-17 21:27     ` Improving ‘guix search’ scoring Ludovic Courtès
  1 sibling, 1 reply; 9+ messages in thread
From: zimoun @ 2019-07-16 17:04 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: Guix Devel, Jesse Gibbons

Dear,

On Sun, 14 Jul 2019 at 15:54, Ludovic Courtès <ludo@gnu.org> wrote:

> > I think this will make searching easier because not everything has an
> > obvious name, and when I `guix search` for a purpose (like drawing) I
> > often get unrelated results.
>
> I don’t think the module hierarchy should be thought of as a tool for
> users to search for packages.

I totally agree. :-)

> So really, ‘guix search’ is the tool that should be improved.  It’s been
> discussed many times, and improving it turns out to be difficult without
> resorting to external sources of information (e.g., list of command
> names, popularity database, etc.)
>
> What we can do is look at specific examples to see if there’s something
> we can improve on the current scoring system (with the understanding
> that sometimes the answer is that we cannot do any better.)
>
> For example, ‘guix search drawing program’ shows Tux Paint as the first
> result, which is good; but ‘guix search drawing’ and ‘guix search
> drawing application’ are much less useful.  In this particular example,
> it’s not clear to me what can be done.
>
> One suggestion that was made before and that might help here is to
> increase the score of leaf packages (applications).

One of the current issue is that the score is not "normalized" somehow.

The current score is built by using the number of occurrences of each
field (name, synopsis, description, etc.) with weights (see
`%package-metrics` in guix/ui.scm).
For instance, `guix search drawing` ranks first `texlive-latex-eepic`
because the word `drawing` appears 4 times in the description, second
`tuxpaint` because of 2 times and third `xfig` (1 time).
What should be expected (IMHO) is that these 3 packages should be
scored at the same value. Therefore, something normalizing seems
missing. But what? :-)
And leaf package should have a higher score than non-leaf package. For
instance, `xfig` should be higher than `libart-lgpl`.

Then the situation with this scoring system cannot be improved so much
for the "only word" search.

Moreover, nothing can help with bad written descriptions.

For example, you need to know that `roguelike` is a `game` when
reading the description of the package `angband`.

  guix package --show=angband | recsel -C -p synopsis,description

synopsis: Dungeon exploration roguelike
description: Angband is a Classic dungeon exploration roguelike.  Explore the
+ depths below Angband, seeking riches, fighting monsters, and preparing to
+ fight Morgoth, the Lord of Darkness.

From my opinion, the issue is that the description is not good enough
to provide any relevant information usable by a search tool. And again
from my opinion, adding tags or classifying the packages inside
relevant filenames---which is a way to tag---seems a wrong approach.
For example, where VLC should be in? video.scm? But is it closer to
ffmpeg-for-stepmania or to Totem which is in gnome.scm?
IMHO, bikeshedding cannot be an improvement for searching packages. :-)

However, a kind of tf-idf [1] should be used to better self organize
the packages when searching.

[1] https://en.wikipedia.org/wiki/Tf%E2%80%93idf

For example, I have 10146 package definitions:
  guix search ' ' | recsel -P name -C | wc -l
  10146
and 46 contain the word 'drawing'.
So, the Inverse-Document-Frequency is:
 IDF(drawing) = log(10146 / 46)

Let consider the 3 first most relevant package (with the current score).
The term `drawing` appears:
   for pkg in $(guix search drawing | recsel -C -P name | head -n3);\
   do\
      echo $pkg ; guix package --show=$pkg | grep -c drawing ;\
   done

FREQ(drawing, texlive-latex-eepic) = 5
FREQ(drawing, tuxpaint) = 2
FREQ(drawing, xfig) = 2

Let normalize by the length of the document:
   for pkg in $(guix search drawing | recsel -C -P name | head -n3);\
   do\
      echo $pkg ; guix package --show=$pkg \
      | recsel -P synopsis,description | wc -w ;\
   done

LEN(texlive-latex-eepic) = 68
LEN(tuxpaint) = 60
LEN(xfig) = 76

Then one definition of the Term-Frequency is:

TF(drawing, texlive-latex-eepic) = 5 / 68
TF(drawing, tuxpaint) = 2 / 60
TF(drawing, xfig) = 2 / 76

The TF-IDF reads:

TF-IDF(drawing, texlive-latex-eepic) = 5/68*log(10146/46) =0.3968
TF-IDF(drawing, tuxpaint) = 2/60*log(10146/46) =0.1799
TF-IDF(drawing, xfig) = 2/76*log(10146/46) =0.1420

This does not change much the current result. But this allows to
better know which words are "good filter".

Let consider the word `program` and the package `tuxpaint`.
The current relevance score is 5 for `program`. The term appears 2
times (note that `software` appears in synopsis which should be
replaced be `program`).
The current relevance score is 7 for `drawing`. The term also appears 2 times.
The difference is just because the weight per field.

However, the TF-IDF is totally different:

TF-IDF(drawing, tuxpaint) = 2/60*log(10146/46) =0.1799
TF-IDF(program, tuxpaint) = 2/60*log(10146/1056) =0.0754

Well, the term `drawing` owns more information than the term `program`
for the package tuxpaint.

From my opinion, text mining or search engine strategies seem a better
approach to improve the `guix search` than rigidify the filename tree.
And some data mining to see how the packages cluster (depending on the
metric) should be helpful to first understand how to improve `guix
search`.
I do not know if my words make sense.

All the best,
simon

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Improving ‘guix search’ scoring
  2019-07-16 17:04   ` zimoun
@ 2019-07-17 21:27     ` Ludovic Courtès
  2019-07-18 11:11       ` zimoun
  0 siblings, 1 reply; 9+ messages in thread
From: Ludovic Courtès @ 2019-07-17 21:27 UTC (permalink / raw)
  To: zimoun; +Cc: Guix Devel, Jesse Gibbons

Hello zimoun!

zimoun <zimon.toutoune@gmail.com> skribis:

> However, a kind of tf-idf [1] should be used to better self organize
> the packages when searching.
>
> [1] https://en.wikipedia.org/wiki/Tf%E2%80%93idf
>
>
> For example, I have 10146 package definitions:
>   guix search ' ' | recsel -P name -C | wc -l
>   10146
> and 46 contain the word 'drawing'.
> So, the Inverse-Document-Frequency is:
>  IDF(drawing) = log(10146 / 46)
>
> Let consider the 3 first most relevant package (with the current score).
> The term `drawing` appears:
>    for pkg in $(guix search drawing | recsel -C -P name | head -n3);\
>    do\
>       echo $pkg ; guix package --show=$pkg | grep -c drawing ;\
>    done
>
> FREQ(drawing, texlive-latex-eepic) = 5
> FREQ(drawing, tuxpaint) = 2
> FREQ(drawing, xfig) = 2
>
> Let normalize by the length of the document:
>    for pkg in $(guix search drawing | recsel -C -P name | head -n3);\
>    do\
>       echo $pkg ; guix package --show=$pkg \
>       | recsel -P synopsis,description | wc -w ;\
>    done
>
> LEN(texlive-latex-eepic) = 68
> LEN(tuxpaint) = 60
> LEN(xfig) = 76
>
> Then one definition of the Term-Frequency is:
>
> TF(drawing, texlive-latex-eepic) = 5 / 68
> TF(drawing, tuxpaint) = 2 / 60
> TF(drawing, xfig) = 2 / 76
>
>
> The TF-IDF reads:
>
> TF-IDF(drawing, texlive-latex-eepic) = 5/68*log(10146/46) =0.3968
> TF-IDF(drawing, tuxpaint) = 2/60*log(10146/46) =0.1799
> TF-IDF(drawing, xfig) = 2/76*log(10146/46) =0.1420
>
>
> This does not change much the current result. But this allows to
> better know which words are "good filter".
>
> Let consider the word `program` and the package `tuxpaint`.
> The current relevance score is 5 for `program`. The term appears 2
> times (note that `software` appears in synopsis which should be
> replaced be `program`).
> The current relevance score is 7 for `drawing`. The term also appears 2 times.
> The difference is just because the weight per field.
>
> However, the TF-IDF is totally different:
>
> TF-IDF(drawing, tuxpaint) = 2/60*log(10146/46) =0.1799
> TF-IDF(program, tuxpaint) = 2/60*log(10146/1056) =0.0754
>
> Well, the term `drawing` owns more information than the term `program`
> for the package tuxpaint.

That’s insightful!

I guess computing the TF-IDF could perhaps improve the results compared
to the current scoring mechanism.  It would be worth trying to implement
it.

The bottom line though, as you wrote, is that this all depends on the
quality of synopses and descriptions, and there’s only so much we can
draw from 5-line descriptions.

Thanks,
Ludo’.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Improving ‘guix search’ scoring
  2019-07-17 21:27     ` Improving ‘guix search’ scoring Ludovic Courtès
@ 2019-07-18 11:11       ` zimoun
  0 siblings, 0 replies; 9+ messages in thread
From: zimoun @ 2019-07-18 11:11 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: Guix Devel, Jesse Gibbons

Hi,

On Wed, 17 Jul 2019 at 23:27, Ludovic Courtès <ludo@gnu.org> wrote:

> I guess computing the TF-IDF could perhaps improve the results compared
> to the current scoring mechanism.  It would be worth trying to implement
> it.
>
> The bottom line though, as you wrote, is that this all depends on the
> quality of synopses and descriptions, and there’s only so much we can
> draw from 5-line descriptions.

From my opinion, because the description is say 5 lines plus the
synopsis, before implementing something, one needs to first analyse
the "quality" of the available information (words + dependencies). I
mean doing some "data science" (buzz buzz! :-)) with R or Python.
And I do not know the state-of-art of recommender systems. Neither
applied to packages retrieval. I have never read something about that
in other distributions (Debian, Gentoo, etc.). Someone does? Any
pointer?

For example, the current scoring looks like a poor man version of the
Boolean model of Information Retrieval [1]. What about the Okapi model
[2]? etc.

Well, if a student is reading this thread and is looking for a project. ;-)

And I will try to give a look after my summer holidays.
Please share your opinion or experience.

All the best,
simon

[1] https://en.wikipedia.org/wiki/Boolean_model_of_information_retrieval
[2] https://en.wikipedia.org/wiki/Okapi_BM25

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2019-07-18 11:11 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-09 16:07 Organizing packages Jesse Gibbons
2019-07-14 13:54 ` Ludovic Courtès
2019-07-15 17:21   ` Jesse Gibbons
2019-07-15 17:38     ` Robert Vollmert
2019-07-15 20:15       ` Jesse Gibbons
2019-07-15 21:37     ` Ricardo Wurmus
2019-07-16 17:04   ` zimoun
2019-07-17 21:27     ` Improving ‘guix search’ scoring Ludovic Courtès
2019-07-18 11:11       ` zimoun

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).