From mboxrd@z Thu Jan 1 00:00:00 1970 From: =?utf-8?Q?Ludovic_Court=C3=A8s?= Subject: Improving =?utf-8?Q?=E2=80=98guix_search=E2=80=99?= scoring Date: Wed, 17 Jul 2019 23:27:47 +0200 Message-ID: <878ssw8i7g.fsf_-_@gnu.org> References: <20190709100732.3f760245@gmail.com> <87ef2s4t8d.fsf@gnu.org> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Return-path: Received: from eggs.gnu.org ([2001:470:142:3::10]:49848) by lists.gnu.org with esmtp (Exim 4.86_2) (envelope-from ) id 1hnrSs-0001Pm-9C for guix-devel@gnu.org; Wed, 17 Jul 2019 17:27:55 -0400 In-Reply-To: (zimoun's message of "Tue, 16 Jul 2019 19:04:26 +0200") List-Id: "Development of GNU Guix and the GNU System distribution." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guix-devel-bounces+gcggd-guix-devel=m.gmane.org@gnu.org Sender: "Guix-devel" To: zimoun Cc: Guix Devel , Jesse Gibbons Hello zimoun! zimoun skribis: > However, a kind of tf-idf [1] should be used to better self organize > the packages when searching. > > [1] https://en.wikipedia.org/wiki/Tf%E2%80%93idf > > > For example, I have 10146 package definitions: > guix search ' ' | recsel -P name -C | wc -l > 10146 > and 46 contain the word 'drawing'. > So, the Inverse-Document-Frequency is: > IDF(drawing) =3D log(10146 / 46) > > Let consider the 3 first most relevant package (with the current score). > The term `drawing` appears: > for pkg in $(guix search drawing | recsel -C -P name | head -n3);\ > do\ > echo $pkg ; guix package --show=3D$pkg | grep -c drawing ;\ > done > > FREQ(drawing, texlive-latex-eepic) =3D 5 > FREQ(drawing, tuxpaint) =3D 2 > FREQ(drawing, xfig) =3D 2 > > Let normalize by the length of the document: > for pkg in $(guix search drawing | recsel -C -P name | head -n3);\ > do\ > echo $pkg ; guix package --show=3D$pkg \ > | recsel -P synopsis,description | wc -w ;\ > done > > LEN(texlive-latex-eepic) =3D 68 > LEN(tuxpaint) =3D 60 > LEN(xfig) =3D 76 > > Then one definition of the Term-Frequency is: > > TF(drawing, texlive-latex-eepic) =3D 5 / 68 > TF(drawing, tuxpaint) =3D 2 / 60 > TF(drawing, xfig) =3D 2 / 76 > > > The TF-IDF reads: > > TF-IDF(drawing, texlive-latex-eepic) =3D 5/68*log(10146/46) =3D0.3968 > TF-IDF(drawing, tuxpaint) =3D 2/60*log(10146/46) =3D0.1799 > TF-IDF(drawing, xfig) =3D 2/76*log(10146/46) =3D0.1420 > > > This does not change much the current result. But this allows to > better know which words are "good filter". > > Let consider the word `program` and the package `tuxpaint`. > The current relevance score is 5 for `program`. The term appears 2 > times (note that `software` appears in synopsis which should be > replaced be `program`). > The current relevance score is 7 for `drawing`. The term also appears 2 t= imes. > The difference is just because the weight per field. > > However, the TF-IDF is totally different: > > TF-IDF(drawing, tuxpaint) =3D 2/60*log(10146/46) =3D0.1799 > TF-IDF(program, tuxpaint) =3D 2/60*log(10146/1056) =3D0.0754 > > Well, the term `drawing` owns more information than the term `program` > for the package tuxpaint. That=E2=80=99s insightful! I guess computing the TF-IDF could perhaps improve the results compared to the current scoring mechanism. It would be worth trying to implement it. The bottom line though, as you wrote, is that this all depends on the quality of synopses and descriptions, and there=E2=80=99s only so much we c= an draw from 5-line descriptions. Thanks, Ludo=E2=80=99.