From mboxrd@z Thu Jan 1 00:00:00 1970 From: zimoun Subject: Re: Organizing packages Date: Tue, 16 Jul 2019 19:04:26 +0200 Message-ID: References: <20190709100732.3f760245@gmail.com> <87ef2s4t8d.fsf@gnu.org> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Return-path: Received: from eggs.gnu.org ([2001:470:142:3::10]:35020) by lists.gnu.org with esmtp (Exim 4.86_2) (envelope-from ) id 1hnQsc-0000Ls-IL for guix-devel@gnu.org; Tue, 16 Jul 2019 13:04:43 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1hnQsb-0000av-2I for guix-devel@gnu.org; Tue, 16 Jul 2019 13:04:42 -0400 In-Reply-To: <87ef2s4t8d.fsf@gnu.org> List-Id: "Development of GNU Guix and the GNU System distribution." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guix-devel-bounces+gcggd-guix-devel=m.gmane.org@gnu.org Sender: "Guix-devel" To: =?UTF-8?Q?Ludovic_Court=C3=A8s?= Cc: Guix Devel , Jesse Gibbons Dear, On Sun, 14 Jul 2019 at 15:54, Ludovic Court=C3=A8s wrote: > > I think this will make searching easier because not everything has an > > obvious name, and when I `guix search` for a purpose (like drawing) I > > often get unrelated results. > > I don=E2=80=99t think the module hierarchy should be thought of as a tool= for > users to search for packages. I totally agree. :-) > So really, =E2=80=98guix search=E2=80=99 is the tool that should be impro= ved. It=E2=80=99s been > discussed many times, and improving it turns out to be difficult without > resorting to external sources of information (e.g., list of command > names, popularity database, etc.) > > What we can do is look at specific examples to see if there=E2=80=99s som= ething > we can improve on the current scoring system (with the understanding > that sometimes the answer is that we cannot do any better.) > > For example, =E2=80=98guix search drawing program=E2=80=99 shows Tux Pain= t as the first > result, which is good; but =E2=80=98guix search drawing=E2=80=99 and =E2= =80=98guix search > drawing application=E2=80=99 are much less useful. In this particular ex= ample, > it=E2=80=99s not clear to me what can be done. > > One suggestion that was made before and that might help here is to > increase the score of leaf packages (applications). One of the current issue is that the score is not "normalized" somehow. The current score is built by using the number of occurrences of each field (name, synopsis, description, etc.) with weights (see `%package-metrics` in guix/ui.scm). For instance, `guix search drawing` ranks first `texlive-latex-eepic` because the word `drawing` appears 4 times in the description, second `tuxpaint` because of 2 times and third `xfig` (1 time). What should be expected (IMHO) is that these 3 packages should be scored at the same value. Therefore, something normalizing seems missing. But what? :-) And leaf package should have a higher score than non-leaf package. For instance, `xfig` should be higher than `libart-lgpl`. Then the situation with this scoring system cannot be improved so much for the "only word" search. Moreover, nothing can help with bad written descriptions. For example, you need to know that `roguelike` is a `game` when reading the description of the package `angband`. guix package --show=3Dangband | recsel -C -p synopsis,description synopsis: Dungeon exploration roguelike description: Angband is a Classic dungeon exploration roguelike. Explore t= he + depths below Angband, seeking riches, fighting monsters, and preparing to + fight Morgoth, the Lord of Darkness. >From my opinion, the issue is that the description is not good enough to provide any relevant information usable by a search tool. And again from my opinion, adding tags or classifying the packages inside relevant filenames---which is a way to tag---seems a wrong approach. For example, where VLC should be in? video.scm? But is it closer to ffmpeg-for-stepmania or to Totem which is in gnome.scm? IMHO, bikeshedding cannot be an improvement for searching packages. :-) However, a kind of tf-idf [1] should be used to better self organize the packages when searching. [1] https://en.wikipedia.org/wiki/Tf%E2%80%93idf For example, I have 10146 package definitions: guix search ' ' | recsel -P name -C | wc -l 10146 and 46 contain the word 'drawing'. So, the Inverse-Document-Frequency is: IDF(drawing) =3D log(10146 / 46) Let consider the 3 first most relevant package (with the current score). The term `drawing` appears: for pkg in $(guix search drawing | recsel -C -P name | head -n3);\ do\ echo $pkg ; guix package --show=3D$pkg | grep -c drawing ;\ done FREQ(drawing, texlive-latex-eepic) =3D 5 FREQ(drawing, tuxpaint) =3D 2 FREQ(drawing, xfig) =3D 2 Let normalize by the length of the document: for pkg in $(guix search drawing | recsel -C -P name | head -n3);\ do\ echo $pkg ; guix package --show=3D$pkg \ | recsel -P synopsis,description | wc -w ;\ done LEN(texlive-latex-eepic) =3D 68 LEN(tuxpaint) =3D 60 LEN(xfig) =3D 76 Then one definition of the Term-Frequency is: TF(drawing, texlive-latex-eepic) =3D 5 / 68 TF(drawing, tuxpaint) =3D 2 / 60 TF(drawing, xfig) =3D 2 / 76 The TF-IDF reads: TF-IDF(drawing, texlive-latex-eepic) =3D 5/68*log(10146/46) =3D0.3968 TF-IDF(drawing, tuxpaint) =3D 2/60*log(10146/46) =3D0.1799 TF-IDF(drawing, xfig) =3D 2/76*log(10146/46) =3D0.1420 This does not change much the current result. But this allows to better know which words are "good filter". Let consider the word `program` and the package `tuxpaint`. The current relevance score is 5 for `program`. The term appears 2 times (note that `software` appears in synopsis which should be replaced be `program`). The current relevance score is 7 for `drawing`. The term also appears 2 tim= es. The difference is just because the weight per field. However, the TF-IDF is totally different: TF-IDF(drawing, tuxpaint) =3D 2/60*log(10146/46) =3D0.1799 TF-IDF(program, tuxpaint) =3D 2/60*log(10146/1056) =3D0.0754 Well, the term `drawing` owns more information than the term `program` for the package tuxpaint. >From my opinion, text mining or search engine strategies seem a better approach to improve the `guix search` than rigidify the filename tree. And some data mining to see how the packages cluster (depending on the metric) should be helpful to first understand how to improve `guix search`. I do not know if my words make sense. All the best, simon