* Organizing packages @ 2019-07-09 16:07 Jesse Gibbons 2019-07-14 13:54 ` Ludovic Courtès 0 siblings, 1 reply; 9+ messages in thread From: Jesse Gibbons @ 2019-07-09 16:07 UTC (permalink / raw) To: guix-devel I noticed that a few files have only one package definition and are named for that package. I think these packages can be organized better. Might I suggest the following rules: 1. if a package is a library for a particular language $LANG (like Python, Perl, etc.) it goes in ${LANG}-xyz.scm. If it is a library built for a particular PURPOSE, it may go into LANG-PURPOSE.scm with those packages. 2. If the package defines a compiler or interpreter for a language $LANG, it may go into ${LANG}.scm 3. If the package is part of a large divisible project $PROJ like gcc or texlive, it may go into ${PROJ}.scm 4. If the package is maintained a part of a large desktop environment $DE like GNOME or KDE, it may be put in ${DE}.scm 5. When in doubt, the package must go into a file named after its $PURPOSE, ${PURPOSE}.scm. For example, if the package is a game (like supertuxracer), it goes into games.scm; if it is for undirected fun (like sl), it goes into toys.scm; if it is for audio control or audio production, it goes into audio.scm; if it is for drawing or producing graphics, it goes into graphics.scm; etc. Projects that can be described with multiple purposes (like fortune) may go into any of those files. I think this will make searching easier because not everything has an obvious name, and when I `guix search` for a purpose (like drawing) I often get unrelated results. If we follow these rules: - Most packages will remain in place, especially libraries upon which plenty of packages depend. - Packages that are placed in their own files should be moved to a file that describes their function. For exaple: - abduco.scm only defines abduco, a program that lets child processes of a shell become independent. It might fit into task-management.scm or moreutils.scm. - abiword.scm only defines abiword, a word processing program. It may be placed into a new file office.scm. - acct.scm only defines acct, which based on its description might fit into admin.scm - acl.scm only defines acl, which based on its functionality fits into admin.scm. - anthy.scm only defines anthy, which is useful for Japanese language input. It can fit into language.scm. I hope you can see the pattern, and perhaps you can find better places for some of these packages. By my count, there are currently over 140 of these files that define only one package. - Files that define multiple packages and are named after one app might be split into multiple pieces by function. For example, libreoffice.scm defines, among other packages, libreoffice, which can go into office.scm, and multiple dictionaries, which can be moved into the already-existing dictionaries.scm. To me this is lower priority than one-package files. - When a package is moved, the associated copyright information should be copied with it. This might make it difficult to split files like libreoffice.scm. I do not think the people who move packages should add their copyright information because they are not really adding new code, but I leave this detail to community discussion. - Ideally, packages would be alphabetized within their respective files. This might not be practical. - Services should likewise be organized by function, with file names consisting of full words. Although there is not much to do on this front, it will likely break a lot of OS configurations and make the manual inaccurate. Therefore it is very low priority to me. Feel free to tell me what you think of this suggestion. If the maintainers and the community like this idea, I will personally work on carrying it out. I am also ready to openly discuss particular packages, either in response to this or when I send a patch. -Jesse ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Organizing packages 2019-07-09 16:07 Organizing packages Jesse Gibbons @ 2019-07-14 13:54 ` Ludovic Courtès 2019-07-15 17:21 ` Jesse Gibbons 2019-07-16 17:04 ` zimoun 0 siblings, 2 replies; 9+ messages in thread From: Ludovic Courtès @ 2019-07-14 13:54 UTC (permalink / raw) To: Jesse Gibbons; +Cc: guix-devel Hello! Jesse Gibbons <jgibbons2357@gmail.com> skribis: > I noticed that a few files have only one package definition and are > named for that package. I think these packages can be organized better. > Might I suggest the following rules: > > 1. if a package is a library for a particular language $LANG (like > Python, Perl, etc.) it goes in ${LANG}-xyz.scm. If it is a library built > for a particular PURPOSE, it may go into LANG-PURPOSE.scm with those > packages. > > 2. If the package defines a compiler or interpreter for a language > $LANG, it may go into ${LANG}.scm > > 3. If the package is part of a large divisible project $PROJ like gcc or > texlive, it may go into ${PROJ}.scm > > 4. If the package is maintained a part of a large desktop environment > $DE like GNOME or KDE, it may be put in ${DE}.scm > > 5. When in doubt, the package must go into a file named after its > $PURPOSE, ${PURPOSE}.scm. For example, if the package is a game (like > supertuxracer), it goes into games.scm; if it is for undirected > fun (like sl), it goes into toys.scm; if it is for audio > control or audio production, it goes into audio.scm; if it is for > drawing or producing graphics, it goes into graphics.scm; etc. Projects > that can be described with multiple purposes (like fortune) may go into > any of those files. I had experience with Nixpkgs, which has a decision tree for where to put packages: https://nixos.org/nixpkgs/manual/#sec-hierarchy In the end I didn’t find it to be helpful in any way: you’d always have to open ‘top-level/all-packages’, a file that lists all the packages, to find out where the package you’re looking for lives. I believe ‘guix edit’ greatly solves that (along with Helm or similar editor support for grepping.) > I think this will make searching easier because not everything has an > obvious name, and when I `guix search` for a purpose (like drawing) I > often get unrelated results. I don’t think the module hierarchy should be thought of as a tool for users to search for packages. So really, ‘guix search’ is the tool that should be improved. It’s been discussed many times, and improving it turns out to be difficult without resorting to external sources of information (e.g., list of command names, popularity database, etc.) What we can do is look at specific examples to see if there’s something we can improve on the current scoring system (with the understanding that sometimes the answer is that we cannot do any better.) For example, ‘guix search drawing program’ shows Tux Paint as the first result, which is good; but ‘guix search drawing’ and ‘guix search drawing application’ are much less useful. In this particular example, it’s not clear to me what can be done. One suggestion that was made before and that might help here is to increase the score of leaf packages (applications). Food for thought! Ludo’. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Organizing packages 2019-07-14 13:54 ` Ludovic Courtès @ 2019-07-15 17:21 ` Jesse Gibbons 2019-07-15 17:38 ` Robert Vollmert 2019-07-15 21:37 ` Ricardo Wurmus 2019-07-16 17:04 ` zimoun 1 sibling, 2 replies; 9+ messages in thread From: Jesse Gibbons @ 2019-07-15 17:21 UTC (permalink / raw) To: Ludovic Courtès; +Cc: guix-devel On Sun, 14 Jul 2019 15:54:10 +0200 Ludovic Courtès <ludo@gnu.org> wrote: > Hello! > > Jesse Gibbons <jgibbons2357@gmail.com> skribis: > > > I noticed that a few files have only one package definition and are > > named for that package. I think these packages can be organized > > better. Might I suggest the following rules: > > > > 1. if a package is a library for a particular language $LANG (like > > Python, Perl, etc.) it goes in ${LANG}-xyz.scm. If it is a library > > built for a particular PURPOSE, it may go into LANG-PURPOSE.scm > > with those packages. > > > > 2. If the package defines a compiler or interpreter for a language > > $LANG, it may go into ${LANG}.scm > > > > 3. If the package is part of a large divisible project $PROJ like > > gcc or texlive, it may go into ${PROJ}.scm > > > > 4. If the package is maintained a part of a large desktop > > environment $DE like GNOME or KDE, it may be put in ${DE}.scm > > > > 5. When in doubt, the package must go into a file named after its > > $PURPOSE, ${PURPOSE}.scm. For example, if the package is a game > > (like supertuxracer), it goes into games.scm; if it is for > > undirected fun (like sl), it goes into toys.scm; if it is for audio > > control or audio production, it goes into audio.scm; if it is for > > drawing or producing graphics, it goes into graphics.scm; etc. > > Projects that can be described with multiple purposes (like > > fortune) may go into any of those files. > > I had experience with Nixpkgs, which has a decision tree for where to > put packages: > > https://nixos.org/nixpkgs/manual/#sec-hierarchy > > In the end I didn’t find it to be helpful in any way: you’d always > have to open ‘top-level/all-packages’, a file that lists all the > packages, to find out where the package you’re looking for lives. > > I believe ‘guix edit’ greatly solves that (along with Helm or similar > editor support for grepping.) > Interesting. So is it worth trying to organize the guix packages or do you think it will get too complicated? I'm primarily bothered by the number of small files with only one package definition and the inconsistency in how packages are organized. I would rather a file have multiple package definitions that make sense together than a hundred files with only one package definition. > > I think this will make searching easier because not everything has > > an obvious name, and when I `guix search` for a purpose (like > > drawing) I often get unrelated results. This was an afterthought. > > I don’t think the module hierarchy should be thought of as a tool for > users to search for packages. > > So really, ‘guix search’ is the tool that should be improved. It’s > been discussed many times, and improving it turns out to be difficult > without resorting to external sources of information (e.g., list of > command names, popularity database, etc.) I was thinking this would help `guix search`. For example, if I try `guix search game` a lot of the leaf packages in games.scm are ranked with relevance level 1 because they do not have the word "game" in their synopsis or description. I would expect them to have a higher relevance (8 at the very minimum) because of their placement in games.scm. I do not think these packages would be listed at all if they were not in games.scm. Hypothetically, if someone decided to define a package for the tuxemon RPG in a new file "tuxemon.scm" and did not mention the word "game" in the summary or description, it would not be listed in the `guix search game` results at all. If it was placed into games.scm, then it would at least come up in the results. > > What we can do is look at specific examples to see if there’s > something we can improve on the current scoring system (with the > understanding that sometimes the answer is that we cannot do any > better.) > > For example, ‘guix search drawing program’ shows Tux Paint as the > first result, which is good; but ‘guix search drawing’ and ‘guix > search drawing application’ are much less useful. In this particular > example, it’s not clear to me what can be done. > > One suggestion that was made before and that might help here is to > increase the score of leaf packages (applications). > > Food for thought! > > Ludo’. I have ideas about how to resolve this and other issues regarding `guix search`, but perhaps they are best explained in bug reports or other guix-devel discussions. -Jesse ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Organizing packages 2019-07-15 17:21 ` Jesse Gibbons @ 2019-07-15 17:38 ` Robert Vollmert 2019-07-15 20:15 ` Jesse Gibbons 2019-07-15 21:37 ` Ricardo Wurmus 1 sibling, 1 reply; 9+ messages in thread From: Robert Vollmert @ 2019-07-15 17:38 UTC (permalink / raw) To: Jesse Gibbons; +Cc: guix-devel > On 15. Jul 2019, at 19:21, Jesse Gibbons <jgibbons2357@gmail.com> wrote: > > On Sun, 14 Jul 2019 15:54:10 +0200 > Ludovic Courtès <ludo@gnu.org> wrote: > >> Hello! >> >> Jesse Gibbons <jgibbons2357@gmail.com> skribis: >> >>> I noticed that a few files have only one package definition and are >>> named for that package. I think these packages can be organized >>> better. Might I suggest the following rules: >>> >>> 1. if a package is a library for a particular language $LANG (like >>> Python, Perl, etc.) it goes in ${LANG}-xyz.scm. If it is a library >>> built for a particular PURPOSE, it may go into LANG-PURPOSE.scm >>> with those packages. >>> >>> 2. If the package defines a compiler or interpreter for a language >>> $LANG, it may go into ${LANG}.scm >>> >>> 3. If the package is part of a large divisible project $PROJ like >>> gcc or texlive, it may go into ${PROJ}.scm >>> >>> 4. If the package is maintained a part of a large desktop >>> environment $DE like GNOME or KDE, it may be put in ${DE}.scm >>> >>> 5. When in doubt, the package must go into a file named after its >>> $PURPOSE, ${PURPOSE}.scm. For example, if the package is a game >>> (like supertuxracer), it goes into games.scm; if it is for >>> undirected fun (like sl), it goes into toys.scm; if it is for audio >>> control or audio production, it goes into audio.scm; if it is for >>> drawing or producing graphics, it goes into graphics.scm; etc. >>> Projects that can be described with multiple purposes (like >>> fortune) may go into any of those files. >> >> I had experience with Nixpkgs, which has a decision tree for where to >> put packages: >> >> https://nixos.org/nixpkgs/manual/#sec-hierarchy >> >> In the end I didn’t find it to be helpful in any way: you’d always >> have to open ‘top-level/all-packages’, a file that lists all the >> packages, to find out where the package you’re looking for lives. >> >> I believe ‘guix edit’ greatly solves that (along with Helm or similar >> editor support for grepping.) >> > Interesting. So is it worth trying to organize the guix packages or do > you think it will get too complicated? I'm primarily bothered by the > number of small files with only one package definition and the > inconsistency in how packages are organized. I would rather a file have > multiple package definitions that make sense together than a hundred > files with only one package definition. Just to voice some support for a consistent approach. It would be beneficial in a similar way that a consistent indentation style helps: Less decisions to make, less opportunity for bike-shedding discussions. (Personally, one file per package sounds fine, too. No confusion about why which module imports what. No overhead deciding where to file a package. No need to grep around for where a package might be defined.) ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Organizing packages 2019-07-15 17:38 ` Robert Vollmert @ 2019-07-15 20:15 ` Jesse Gibbons 0 siblings, 0 replies; 9+ messages in thread From: Jesse Gibbons @ 2019-07-15 20:15 UTC (permalink / raw) To: Robert Vollmert; +Cc: guix-devel On Mon, 15 Jul 2019 19:38:34 +0200 Robert Vollmert <rob@vllmrt.net> wrote: > > On 15. Jul 2019, at 19:21, Jesse Gibbons <jgibbons2357@gmail.com> > > wrote: > > > > On Sun, 14 Jul 2019 15:54:10 +0200 > > Ludovic Courtès <ludo@gnu.org> wrote: > > > >> Hello! > >> > >> Jesse Gibbons <jgibbons2357@gmail.com> skribis: > >> > >>> I noticed that a few files have only one package definition and > >>> are named for that package. I think these packages can be > >>> organized better. Might I suggest the following rules: > >>> > >>> 1. if a package is a library for a particular language $LANG (like > >>> Python, Perl, etc.) it goes in ${LANG}-xyz.scm. If it is a library > >>> built for a particular PURPOSE, it may go into LANG-PURPOSE.scm > >>> with those packages. > >>> > >>> 2. If the package defines a compiler or interpreter for a language > >>> $LANG, it may go into ${LANG}.scm > >>> > >>> 3. If the package is part of a large divisible project $PROJ like > >>> gcc or texlive, it may go into ${PROJ}.scm > >>> > >>> 4. If the package is maintained a part of a large desktop > >>> environment $DE like GNOME or KDE, it may be put in ${DE}.scm > >>> > >>> 5. When in doubt, the package must go into a file named after its > >>> $PURPOSE, ${PURPOSE}.scm. For example, if the package is a game > >>> (like supertuxracer), it goes into games.scm; if it is for > >>> undirected fun (like sl), it goes into toys.scm; if it is for > >>> audio control or audio production, it goes into audio.scm; if it > >>> is for drawing or producing graphics, it goes into graphics.scm; > >>> etc. Projects that can be described with multiple purposes (like > >>> fortune) may go into any of those files. > >> > >> I had experience with Nixpkgs, which has a decision tree for where > >> to put packages: > >> > >> https://nixos.org/nixpkgs/manual/#sec-hierarchy > >> > >> In the end I didn’t find it to be helpful in any way: you’d always > >> have to open ‘top-level/all-packages’, a file that lists all the > >> packages, to find out where the package you’re looking for lives. > >> > >> I believe ‘guix edit’ greatly solves that (along with Helm or > >> similar editor support for grepping.) > >> > > Interesting. So is it worth trying to organize the guix packages or > > do you think it will get too complicated? I'm primarily bothered by > > the number of small files with only one package definition and the > > inconsistency in how packages are organized. I would rather a file > > have multiple package definitions that make sense together than a > > hundred files with only one package definition. > > Just to voice some support for a consistent approach. It would be > beneficial in a similar way that a consistent indentation style > helps: Less decisions to make, less opportunity for bike-shedding > discussions. > > (Personally, one file per package sounds fine, too. No confusion > about why which module imports what. No overhead deciding where to > file a package. No need to grep around for where a package might be > defined.) > I too wouldn't mind a one-package-per-file approach as long as it is consistent. But consider packages that have multiple parts like gcc and texlive, as well as the dictionaries packages that are generated with non-public syntax like the french dictionaries in libreoffice.scm, not to mention the packages for both python2 and python3. I think there's a good reason to cluster them into groups. But the one-package-per-file approach would affect guix search in a significant negative way, as I pointed out unless something is done to change how guix search works. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Organizing packages 2019-07-15 17:21 ` Jesse Gibbons 2019-07-15 17:38 ` Robert Vollmert @ 2019-07-15 21:37 ` Ricardo Wurmus 1 sibling, 0 replies; 9+ messages in thread From: Ricardo Wurmus @ 2019-07-15 21:37 UTC (permalink / raw) To: Jesse Gibbons; +Cc: guix-devel Jesse Gibbons <jgibbons2357@gmail.com> writes: > Interesting. So is it worth trying to organize the guix packages or do > you think it will get too complicated? I'm primarily bothered by the > number of small files with only one package definition and the > inconsistency in how packages are organized. I would rather a file have > multiple package definitions that make sense together than a hundred > files with only one package definition. I think it doesn’t matter much, but in some cases having separate modules even if they only contain one package definition can be really important because it reduces the “module closure” of certain packages. Our modules are heavily interdependent and that’s bad as we cannot easily split up the work that has to be done by “guix pull”. We shouldn’t have to evaluate all or most modules just to build the derivations for core packages. In those cases smaller modules can be used to cut inter-module references. (In many other cases, however, that’s just how package definitions were organized in the early days of Guix. Moving them around is fine if the move is a clear improvement.) -- Ricardo ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Organizing packages 2019-07-14 13:54 ` Ludovic Courtès 2019-07-15 17:21 ` Jesse Gibbons @ 2019-07-16 17:04 ` zimoun 2019-07-17 21:27 ` Improving ‘guix search’ scoring Ludovic Courtès 1 sibling, 1 reply; 9+ messages in thread From: zimoun @ 2019-07-16 17:04 UTC (permalink / raw) To: Ludovic Courtès; +Cc: Guix Devel, Jesse Gibbons Dear, On Sun, 14 Jul 2019 at 15:54, Ludovic Courtès <ludo@gnu.org> wrote: > > I think this will make searching easier because not everything has an > > obvious name, and when I `guix search` for a purpose (like drawing) I > > often get unrelated results. > > I don’t think the module hierarchy should be thought of as a tool for > users to search for packages. I totally agree. :-) > So really, ‘guix search’ is the tool that should be improved. It’s been > discussed many times, and improving it turns out to be difficult without > resorting to external sources of information (e.g., list of command > names, popularity database, etc.) > > What we can do is look at specific examples to see if there’s something > we can improve on the current scoring system (with the understanding > that sometimes the answer is that we cannot do any better.) > > For example, ‘guix search drawing program’ shows Tux Paint as the first > result, which is good; but ‘guix search drawing’ and ‘guix search > drawing application’ are much less useful. In this particular example, > it’s not clear to me what can be done. > > One suggestion that was made before and that might help here is to > increase the score of leaf packages (applications). One of the current issue is that the score is not "normalized" somehow. The current score is built by using the number of occurrences of each field (name, synopsis, description, etc.) with weights (see `%package-metrics` in guix/ui.scm). For instance, `guix search drawing` ranks first `texlive-latex-eepic` because the word `drawing` appears 4 times in the description, second `tuxpaint` because of 2 times and third `xfig` (1 time). What should be expected (IMHO) is that these 3 packages should be scored at the same value. Therefore, something normalizing seems missing. But what? :-) And leaf package should have a higher score than non-leaf package. For instance, `xfig` should be higher than `libart-lgpl`. Then the situation with this scoring system cannot be improved so much for the "only word" search. Moreover, nothing can help with bad written descriptions. For example, you need to know that `roguelike` is a `game` when reading the description of the package `angband`. guix package --show=angband | recsel -C -p synopsis,description synopsis: Dungeon exploration roguelike description: Angband is a Classic dungeon exploration roguelike. Explore the + depths below Angband, seeking riches, fighting monsters, and preparing to + fight Morgoth, the Lord of Darkness. From my opinion, the issue is that the description is not good enough to provide any relevant information usable by a search tool. And again from my opinion, adding tags or classifying the packages inside relevant filenames---which is a way to tag---seems a wrong approach. For example, where VLC should be in? video.scm? But is it closer to ffmpeg-for-stepmania or to Totem which is in gnome.scm? IMHO, bikeshedding cannot be an improvement for searching packages. :-) However, a kind of tf-idf [1] should be used to better self organize the packages when searching. [1] https://en.wikipedia.org/wiki/Tf%E2%80%93idf For example, I have 10146 package definitions: guix search ' ' | recsel -P name -C | wc -l 10146 and 46 contain the word 'drawing'. So, the Inverse-Document-Frequency is: IDF(drawing) = log(10146 / 46) Let consider the 3 first most relevant package (with the current score). The term `drawing` appears: for pkg in $(guix search drawing | recsel -C -P name | head -n3);\ do\ echo $pkg ; guix package --show=$pkg | grep -c drawing ;\ done FREQ(drawing, texlive-latex-eepic) = 5 FREQ(drawing, tuxpaint) = 2 FREQ(drawing, xfig) = 2 Let normalize by the length of the document: for pkg in $(guix search drawing | recsel -C -P name | head -n3);\ do\ echo $pkg ; guix package --show=$pkg \ | recsel -P synopsis,description | wc -w ;\ done LEN(texlive-latex-eepic) = 68 LEN(tuxpaint) = 60 LEN(xfig) = 76 Then one definition of the Term-Frequency is: TF(drawing, texlive-latex-eepic) = 5 / 68 TF(drawing, tuxpaint) = 2 / 60 TF(drawing, xfig) = 2 / 76 The TF-IDF reads: TF-IDF(drawing, texlive-latex-eepic) = 5/68*log(10146/46) =0.3968 TF-IDF(drawing, tuxpaint) = 2/60*log(10146/46) =0.1799 TF-IDF(drawing, xfig) = 2/76*log(10146/46) =0.1420 This does not change much the current result. But this allows to better know which words are "good filter". Let consider the word `program` and the package `tuxpaint`. The current relevance score is 5 for `program`. The term appears 2 times (note that `software` appears in synopsis which should be replaced be `program`). The current relevance score is 7 for `drawing`. The term also appears 2 times. The difference is just because the weight per field. However, the TF-IDF is totally different: TF-IDF(drawing, tuxpaint) = 2/60*log(10146/46) =0.1799 TF-IDF(program, tuxpaint) = 2/60*log(10146/1056) =0.0754 Well, the term `drawing` owns more information than the term `program` for the package tuxpaint. From my opinion, text mining or search engine strategies seem a better approach to improve the `guix search` than rigidify the filename tree. And some data mining to see how the packages cluster (depending on the metric) should be helpful to first understand how to improve `guix search`. I do not know if my words make sense. All the best, simon ^ permalink raw reply [flat|nested] 9+ messages in thread
* Improving ‘guix search’ scoring 2019-07-16 17:04 ` zimoun @ 2019-07-17 21:27 ` Ludovic Courtès 2019-07-18 11:11 ` zimoun 0 siblings, 1 reply; 9+ messages in thread From: Ludovic Courtès @ 2019-07-17 21:27 UTC (permalink / raw) To: zimoun; +Cc: Guix Devel, Jesse Gibbons Hello zimoun! zimoun <zimon.toutoune@gmail.com> skribis: > However, a kind of tf-idf [1] should be used to better self organize > the packages when searching. > > [1] https://en.wikipedia.org/wiki/Tf%E2%80%93idf > > > For example, I have 10146 package definitions: > guix search ' ' | recsel -P name -C | wc -l > 10146 > and 46 contain the word 'drawing'. > So, the Inverse-Document-Frequency is: > IDF(drawing) = log(10146 / 46) > > Let consider the 3 first most relevant package (with the current score). > The term `drawing` appears: > for pkg in $(guix search drawing | recsel -C -P name | head -n3);\ > do\ > echo $pkg ; guix package --show=$pkg | grep -c drawing ;\ > done > > FREQ(drawing, texlive-latex-eepic) = 5 > FREQ(drawing, tuxpaint) = 2 > FREQ(drawing, xfig) = 2 > > Let normalize by the length of the document: > for pkg in $(guix search drawing | recsel -C -P name | head -n3);\ > do\ > echo $pkg ; guix package --show=$pkg \ > | recsel -P synopsis,description | wc -w ;\ > done > > LEN(texlive-latex-eepic) = 68 > LEN(tuxpaint) = 60 > LEN(xfig) = 76 > > Then one definition of the Term-Frequency is: > > TF(drawing, texlive-latex-eepic) = 5 / 68 > TF(drawing, tuxpaint) = 2 / 60 > TF(drawing, xfig) = 2 / 76 > > > The TF-IDF reads: > > TF-IDF(drawing, texlive-latex-eepic) = 5/68*log(10146/46) =0.3968 > TF-IDF(drawing, tuxpaint) = 2/60*log(10146/46) =0.1799 > TF-IDF(drawing, xfig) = 2/76*log(10146/46) =0.1420 > > > This does not change much the current result. But this allows to > better know which words are "good filter". > > Let consider the word `program` and the package `tuxpaint`. > The current relevance score is 5 for `program`. The term appears 2 > times (note that `software` appears in synopsis which should be > replaced be `program`). > The current relevance score is 7 for `drawing`. The term also appears 2 times. > The difference is just because the weight per field. > > However, the TF-IDF is totally different: > > TF-IDF(drawing, tuxpaint) = 2/60*log(10146/46) =0.1799 > TF-IDF(program, tuxpaint) = 2/60*log(10146/1056) =0.0754 > > Well, the term `drawing` owns more information than the term `program` > for the package tuxpaint. That’s insightful! I guess computing the TF-IDF could perhaps improve the results compared to the current scoring mechanism. It would be worth trying to implement it. The bottom line though, as you wrote, is that this all depends on the quality of synopses and descriptions, and there’s only so much we can draw from 5-line descriptions. Thanks, Ludo’. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Improving ‘guix search’ scoring 2019-07-17 21:27 ` Improving ‘guix search’ scoring Ludovic Courtès @ 2019-07-18 11:11 ` zimoun 0 siblings, 0 replies; 9+ messages in thread From: zimoun @ 2019-07-18 11:11 UTC (permalink / raw) To: Ludovic Courtès; +Cc: Guix Devel, Jesse Gibbons Hi, On Wed, 17 Jul 2019 at 23:27, Ludovic Courtès <ludo@gnu.org> wrote: > I guess computing the TF-IDF could perhaps improve the results compared > to the current scoring mechanism. It would be worth trying to implement > it. > > The bottom line though, as you wrote, is that this all depends on the > quality of synopses and descriptions, and there’s only so much we can > draw from 5-line descriptions. From my opinion, because the description is say 5 lines plus the synopsis, before implementing something, one needs to first analyse the "quality" of the available information (words + dependencies). I mean doing some "data science" (buzz buzz! :-)) with R or Python. And I do not know the state-of-art of recommender systems. Neither applied to packages retrieval. I have never read something about that in other distributions (Debian, Gentoo, etc.). Someone does? Any pointer? For example, the current scoring looks like a poor man version of the Boolean model of Information Retrieval [1]. What about the Okapi model [2]? etc. Well, if a student is reading this thread and is looking for a project. ;-) And I will try to give a look after my summer holidays. Please share your opinion or experience. All the best, simon [1] https://en.wikipedia.org/wiki/Boolean_model_of_information_retrieval [2] https://en.wikipedia.org/wiki/Okapi_BM25 ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2019-07-18 11:11 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2019-07-09 16:07 Organizing packages Jesse Gibbons 2019-07-14 13:54 ` Ludovic Courtès 2019-07-15 17:21 ` Jesse Gibbons 2019-07-15 17:38 ` Robert Vollmert 2019-07-15 20:15 ` Jesse Gibbons 2019-07-15 21:37 ` Ricardo Wurmus 2019-07-16 17:04 ` zimoun 2019-07-17 21:27 ` Improving ‘guix search’ scoring Ludovic Courtès 2019-07-18 11:11 ` zimoun
Code repositories for project(s) associated with this external index https://git.savannah.gnu.org/cgit/guix.git This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.