From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp0.migadu.com ([2001:41d0:303:e16b::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms13.migadu.com with LMTPS id OKN9AYdL5GboMgAAqHPOHw:P1 (envelope-from ) for ; Fri, 13 Sep 2024 14:26:15 +0000 Received: from aspmx1.migadu.com ([2001:41d0:303:e16b::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp0.migadu.com with LMTPS id OKN9AYdL5GboMgAAqHPOHw (envelope-from ) for ; Fri, 13 Sep 2024 16:26:15 +0200 X-Envelope-To: larch@yhetil.org Authentication-Results: aspmx1.migadu.com; dkim=pass header.d=debbugs.gnu.org header.s=debbugs-gnu-org header.b=ExYCkgs1; dkim=fail ("headers rsa verify failed") header.d=gmail.com header.s=20230601 header.b=IXGLHW+R; dmarc=fail reason="SPF not aligned (relaxed), DKIM not aligned (relaxed)" header.from=gmail.com (policy=none); spf=pass (aspmx1.migadu.com: domain of "guix-patches-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="guix-patches-bounces+larch=yhetil.org@gnu.org" ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=yhetil.org; s=key1; t=1726237574; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding:resent-cc: resent-from:resent-sender:resent-message-id:in-reply-to:in-reply-to: references:references:list-id:list-help:list-unsubscribe: list-subscribe:list-post:dkim-signature; bh=418tI8extvxgjJA/mIb3Bf8zCJFgf+oj4oXhkehEexU=; b=jK/F0UDwoR9iZAjX/r7LtKLeQWcYGPg2KXBx8I8r4Rng/JIp8LSbmEDAfGTxQlMO0rcvsw 5i1+1aCMqwjAkfyuP15WXvfTwUoBJA6nXxZI1+cy/V8jIzmRGoUhtyMp0HO0NFnhBfJ5I0 anfwU5ayr5nl+TnTfm7sirf2b5i2ERBuD6pI9kn3xZLoZePdwMeR2TNg8tjyB5XgKKsLxk FUPFat2thh1CLCI7LGOjKJ0UvzJYXUWvuJRC/FK1LlxBPadBG+uTp0Bjlf9jMSLdf1+dQF wNcvK3wl4GPbaM601CiKkUE/cOW8/bN9+jD8TG32ldcTix8MRbpkXFlKHoqSog== ARC-Seal: i=1; s=key1; d=yhetil.org; t=1726237574; a=rsa-sha256; cv=none; b=NcticCH6tMoSo1y5YbcXMNFQFjPtdT0nsqBjyX+W34p2Hd84Nwo6ctU5V8q+t1kyVI9rIL MywQvJOUzg5O1a3TaxnBHLdsOXSspMkx6JC7ThMgvaYmYlzfwHVJLQrIvdjnbwahtRB0Bv rukZ2209Nt+2ukfFtk8gyxT4BIaTy69Abo+/GS2bnesTsjpCcFTc41RGkD5/x5qRvRKQJB XJJ7aV6dhj3qRL/rE1j48/NZmTKSrDmXG/6xVxan0cFLTWumUgV8OmDUppEY7tCHWAyasc F3yKi4lInSqNzowJq04P91KxLZr+4xWkxdLNoG/BmnF5e6Vm4jilfVbq8+Z4hg== ARC-Authentication-Results: i=1; aspmx1.migadu.com; dkim=pass header.d=debbugs.gnu.org header.s=debbugs-gnu-org header.b=ExYCkgs1; dkim=fail ("headers rsa verify failed") header.d=gmail.com header.s=20230601 header.b=IXGLHW+R; dmarc=fail reason="SPF not aligned (relaxed), DKIM not aligned (relaxed)" header.from=gmail.com (policy=none); spf=pass (aspmx1.migadu.com: domain of "guix-patches-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="guix-patches-bounces+larch=yhetil.org@gnu.org" Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id 96BD8BCD6 for ; Fri, 13 Sep 2024 16:26:14 +0200 (CEST) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1sp7FJ-0000TO-I0; Fri, 13 Sep 2024 10:26:01 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1sp7FD-0000KW-UN for guix-patches@gnu.org; Fri, 13 Sep 2024 10:25:56 -0400 Received: from debbugs.gnu.org ([2001:470:142:5::43]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1sp7FB-0002CN-Sr for guix-patches@gnu.org; Fri, 13 Sep 2024 10:25:55 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=debbugs.gnu.org; s=debbugs-gnu-org; h=MIME-Version:Date:References:In-Reply-To:From:To:Subject; bh=418tI8extvxgjJA/mIb3Bf8zCJFgf+oj4oXhkehEexU=; b=ExYCkgs1RerLjnByfDnf35YZdNkCNMHcufXfHj23Xtbh73ocYOtGsCvPonQx8k8zQnIkHQiRacmcxinaFWh2JNBDksrQXyeXNaq3kmdBPwKeN1+2M+DoUjGMvynpX+mU0AnEa02oXT947kb5MYmLbdgxvDb1UGjhKRRc5had73cee4/r/re6Pjq4a+3B8jpyrF0Qc+u+KGq34JgznG8+snCHQEHkBgYXftzarH0mhDZ+uRqwQzkPolvUaZsCxvPzOFQNqdA90sS/jsitoFbHoNT2UJr6ePqpcg9HpG5yaQGz1flD4cHYBMUcKneFlJHa4pPES10Ljz4Dwfx1pw+9Zg==; Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1sp7FK-00020e-UD for guix-patches@gnu.org; Fri, 13 Sep 2024 10:26:02 -0400 X-Loop: help-debbugs@gnu.org Subject: [bug#73220] [PATCH] ui: Add more nuance to relevance scoring. Resent-From: Simon Tournier Original-Sender: "Debbugs-submit" Resent-CC: guix-patches@gnu.org Resent-Date: Fri, 13 Sep 2024 14:26:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 73220 X-GNU-PR-Package: guix-patches X-GNU-PR-Keywords: patch To: aurtzy , 73220@debbugs.gnu.org Cc: Josselin Poiret , aurtzy , Mathieu Othacehe , Ludovic =?UTF-8?Q?Court=C3=A8s?= , Tobias Geerinckx-Rice , Christopher Baines Received: via spool by 73220-submit@debbugs.gnu.org id=B73220.17262375367639 (code B ref 73220); Fri, 13 Sep 2024 14:26:02 +0000 Received: (at 73220) by debbugs.gnu.org; 13 Sep 2024 14:25:36 +0000 Received: from localhost ([127.0.0.1]:43925 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1sp7Et-0001z8-RR for submit@debbugs.gnu.org; Fri, 13 Sep 2024 10:25:36 -0400 Received: from mail-wr1-f43.google.com ([209.85.221.43]:49450) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1sp7Ep-0001yI-BZ for 73220@debbugs.gnu.org; Fri, 13 Sep 2024 10:25:32 -0400 Received: by mail-wr1-f43.google.com with SMTP id ffacd0b85a97d-375e5c12042so1276585f8f.3 for <73220@debbugs.gnu.org>; Fri, 13 Sep 2024 07:25:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1726237456; x=1726842256; darn=debbugs.gnu.org; h=content-transfer-encoding:mime-version:message-id:date:references :in-reply-to:subject:cc:to:from:from:to:cc:subject:date:message-id :reply-to; bh=418tI8extvxgjJA/mIb3Bf8zCJFgf+oj4oXhkehEexU=; b=IXGLHW+RtwLOp5WdNcJ4i98rj5zy8O3AT5g1e/JSYIw+tN/dyyH4uTB3NUw0jo1cm4 8NUWfsflVCYxXNzoHgCCY3d/HD6F5nFgAE1HEqx5pT37UjfLWPDQlV//HH6PpwzyGbjr tySw/Uj3JljhDA6yelcK2N6CnwV1AkBP+SyUanc0kn5gWjPNKyBCCTHpAxpNG3wPUOJi ZKSs79M8AGbab5Dp+EKH/nDFglMQGBE0gVxL0xoK+R5GeNeamEvhsxkK7dJ5cm/pnQdj R4qcPr8S/thj4sRcZnnkYgvSq9TEPp9kPgX6Phn/3wGvjHapw9epSOf1r2Xn4LEtHwdc xiIw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1726237456; x=1726842256; h=content-transfer-encoding:mime-version:message-id:date:references :in-reply-to:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=418tI8extvxgjJA/mIb3Bf8zCJFgf+oj4oXhkehEexU=; b=bdUgYCYoHuo5cCrqhB/aTz1smNEIcaw3Mq9NOXZsSTXI18gME8A0qZZJxfk07IQ/b6 bZmCyyHQecZrTEXet1PCWnZRV4rM//NWu9VaU8FdnqbaBlV4AtgPPB8fMHJ3Tksz1fjK 0sZwfnOk59eABryzOBhjludRhuvuBZVyMjgt1tQPSteyIRQnonU8eFGV9Mm+DOM3lD/7 j8Ey7ttRX84YHh8KyzdocDM7O3sHU/EKkatKu54RQpTLDDGSkrb3PXinpVBSibmkRv7T +Ej91y+hGIX4NZME5U7waeHW5+0mJ+qld5Nn8Mmy/mpUFNT7llCHoBZw3tp43eN7YPzR zyaQ== X-Forwarded-Encrypted: i=1; AJvYcCXOYFwwb2GsALkJArPXzyyTQiuAtHbH3Ho8Rkhb2qG0qbFVI665dbUykoQZGVTnp8UkLRfRiw==@debbugs.gnu.org X-Gm-Message-State: AOJu0Yxzh1dl+05yi4wFYyTrVxXjGhj+AO0O7957xcGu0AZSxs4t8Uwp XrCOs69o0Ley++FpyvL8sE/X4P/UKmQpil8y3CLnu5pRTUCG79YX X-Google-Smtp-Source: AGHT+IEqqtE6nKiQa1QDMx7Dps6y1cpsqeTJtRCNbHnDzCcK2Td9ggsmxDhl88CGLcPbyBs2U3QnHg== X-Received: by 2002:adf:ef87:0:b0:374:be0f:45c4 with SMTP id ffacd0b85a97d-378c2d12457mr4055583f8f.28.1726237456237; Fri, 13 Sep 2024 07:24:16 -0700 (PDT) Received: from lili (roam-nat-fw-prg-194-254-61-46.net.univ-paris-diderot.fr. [194.254.61.46]) by smtp.gmail.com with ESMTPSA id ffacd0b85a97d-378956654f4sm17147455f8f.43.2024.09.13.07.24.15 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 13 Sep 2024 07:24:15 -0700 (PDT) From: Simon Tournier In-Reply-To: References: Date: Fri, 13 Sep 2024 16:12:19 +0200 Message-ID: <87a5gbve0s.fsf@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-BeenThere: guix-patches@gnu.org List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guix-patches-bounces+larch=yhetil.org@gnu.org Sender: guix-patches-bounces+larch=yhetil.org@gnu.org X-Migadu-Country: US X-Migadu-Flow: FLOW_IN X-Migadu-Spam-Score: -6.01 X-Spam-Score: -6.01 X-Migadu-Scanner: mx13.migadu.com X-Migadu-Queue-Id: 96BD8BCD6 X-TUID: ZfLQ8z9saHZg Hi, On Fri, 13 Sep 2024 at 03:02, aurtzy wrote: > Fixes . Thanks! > | Keyword(s) with poor | Expectations | > | results before | | > |-----------------------+-----------------------------------------------| > | dig | ~bind~ near top. | Hum, indeed and I do not know if we can improve here. Well, it=E2=80=99s h= ard to improve for short terms, BTW. --8<---------------cut here---------------start------------->8--- $ ./pre-inst-env guix search dig | recsel -p name,relevance | head -8 name: go-go-uber-org-dig relevance: 104 name: rust-num-bigint-dig relevance: 78 name: rust-num-bigint-dig relevance: 78 --8<---------------cut here---------------end--------------->8--- Compared to current: --8<---------------cut here---------------start------------->8--- $ guix search dig | recsel -p name,relevance | head -8 name: sysdig relevance: 24 name: texlive-pedigree-perl relevance: 13 name: ruby-net-http-digest-auth relevance: 13 --8<---------------cut here---------------end--------------->8--- Indeed, 17th position is better than 609th. But if you add a term as =E2=80=99dns=E2=80=99, bang! :-) Well, BTW the description of =E2=80=99bin= d=E2=80=99 could be a bit improved because the word network does not appear. Anyway. :-) Hum, why this: guix search ' dig$' dig | recsel -p name,relevance | head -8 does not return the package =E2=80=99bind=E2=80=99? > | rsh | ~inetutils~ near top. | --8<---------------cut here---------------start------------->8--- $ ./pre-inst-env guix search rsh | recsel -p name,relevance | head -8 name: inetutils relevance: 26 name: emacs-tramp relevance: 26 name: rust-borsh-schema-derive-internal relevance: 22 --8<---------------cut here---------------end--------------->8--- Compared to current: --8<---------------cut here---------------start------------->8--- $ guix search rsh | recsel -p name,relevance | head -8 name: go-sigs-k8s-io-yaml relevance: 14 name: python-pymarshal relevance: 13 name: emacs-powershell relevance: 13 --8<---------------cut here---------------end--------------->8--- > | c | C language related. | > | c compiler | Compiler-related C stuff. | This cannot be improved. > | r | R language related. | Usually, I add the prefix ^r\- and I do not have issue with search for r packages. For instance, search ^r\- keyword and it works well. $ guix search ^r\- cyto | recsel -CP name | cut -f1 -d'-' | uniq -c 29 r Somehow, I do not think we can improve here. I mean, the improvement is to document the usage of prefixes. Similarly for ghc (haskell), ocaml, python, etc. > | tor | Tor related; ~torbrowser~ somewhere near top. | --8<---------------cut here---------------start------------->8--- $ ./pre-inst-env guix search tor | recsel -p name,relevance | head -8 name: tor relevance: 208 name: tor-client relevance: 169 name: torsocks relevance: 103 --8<---------------cut here---------------end--------------->8--- Compared to current: --8<---------------cut here---------------start------------->8--- $ guix search tor | recsel -p name,relevance | head -8 name: tor relevance: 47 name: ghc-storablevector relevance: 29 name: tor-client relevance: 28 --8<---------------cut here---------------end--------------->8--- However, the position move from 225th to 19th. $ guix search tor | recsel -P name | grep -n torbrowser 225:torbrowser $ ./pre-inst-env guix search tor | recsel -P name | grep -n torbrowser 19:torbrowser Similarly as =E2=80=99dig=E2=80=99, the description of =E2=80=99torbrowser= =E2=80=99 package could be improvement. Because =E2=80=99guix search tor browser=E2=80=99 returns not= hing. > | gcc | ~gcc-toolchain~ near top. | Indeed, something is unexpected. Well, first: $ guix search gcc | recsel -CP name | uniq | head -8 gccgo gfortran-toolchain gdc-toolchain gcc-toolchain gcc-cross-x86_64-w64-mingw32-toolchain gcc-cross-or1k-elf-toolchain gcc-cross-i686-w64-mingw32-toolchain gcc-cross-avr-toolchain $ guix search gcc | recsel -CP name | uniq -c | sort -rn | head -8 18 llvm 12 gcc-toolchain 6 libgccjit 6 gccgo 3 isl 2 libstdc++-doc 2 java-commons-cli 2 gdc-toolchain Other said, the packages with multi-versions decrease the experience. Well, that had already by =E2=80=9Cimproved=E2=80=9D [1] with some SEO. ;-)= Indeed, maybe the relevance should be improved. Second, gccgo has a relevance score of 22 with the only term =E2=80=99gcc= =E2=80=99, compared to gcc-toolchain scoring at 15. gccgo gcc-toolchain 4 * 1 * 1 4 * 1 * 1=20=20 + 2 * 5 * 1 + 2 * 1 * 1=20=20 + 1 * 0 + 1 * 0=20=20=20=20=20=20 + 3 * 1 * 1 + 3 * 1 * 1=20=20 + 2 * 0 + 2 * 1 * 3=20=20 + 1 * 5 * 1 + 1 * 0=20=20=20=20=20=20 =3D 22 =3D 15=20=20=20=20=20=20=20=20=20 This is unexpected. And, IMHO that=E2=80=99s bug! In the description of gcc-toolchain, the term =E2=80=99gcc=E2=80=99 appears 3 times but it only s= core with =E2=80=991=E2=80=99 instead of =E2=80=995=E2=80=99. As the patch try to address, the main issue is: (define (score regexp str) (fold-matches regexp str 0 (lambda (m score) (+ score (if (string=3D? (match:substring m) str) 5 ;exact match 1))))) Here the exact match does not consider a substring exact match. For instance, one would consider that the term =E2=80=99gcc=E2=80=99 exactly ma= tches in =E2=80=9Csome GCC thing=E2=80=9D. Considering the current implementation, = that=E2=80=99s not the case. For instance, a snippet as the procedure =E2=80=99scoring=E2=80= =99: --8<---------------cut here---------------start------------->8--- scheme@(guix-user)> ,use(ice-9 regex) scheme@(guix-user)> (define regexp (make-regexp "gcc" regexp/icase)) scheme@(guix-user)> (define str "some GCC thing") scheme@(guix-user)> (fold-matches regexp str 0 (lambda (m res) (+ res (if (string=3D? (match:substring m) str) 5 1)))) $2 =3D 1 --8<---------------cut here---------------end--------------->8--- See v2 for my proposal fixing this. Please note that this v2 gives the same ranking for torbrowser. And also improve the situation with gcc-toolchain. --8<---------------cut here---------------start------------->8--- $ ./pre-inst-env guix search gcc | recsel -CP name | grep -n gcc-toolchain 1:gcc-toolchain 2:gcc-toolchain 3:gcc-toolchain 4:gcc-toolchain 5:gcc-toolchain 6:gcc-toolchain 7:gcc-toolchain 8:gcc-toolchain 9:gcc-toolchain 10:gcc-toolchain 11:gcc-toolchain 12:gcc-toolchain $ ./pre-inst-env guix search tor | recsel -CP name | grep -n torbrowser 7:torbrowser $ ./pre-inst-env guix search dig | recsel -CP name | grep -n bind 44:bind --8<---------------cut here---------------end--------------->8--- However, inetutils is still at 44th with the only one term =E2=80=99rsh=E2= =80=99. I would suggest to do some tweak with the description. Bah maybe it is then a bit slower on cold caches? Hum?! Well, I have not investigated, neither with your patch. :-) Well, that something that could be investigated; especially the performance of =E2=80=99char-set=E2= =80=99 operations. 1: https://issues.guix.gnu.org/43342 > I opted to switch to counting a maximum of one match per field, which hel= ps > with cases where a common subword matches /many/ times in packages with l= onger > descriptions, pushing more relevant packages down. In multi-term searche= s, > the unique terms - which are naturally rarer - also contribute to a larger > percentage of the score as a result of these changes. > Having matches with only one word boundary be scored as 2 instead of 1 was > done with the reasoning that a term is more likely to be part of a compou= nd > word name (and thus more relevant) if it is a prefix or suffix; for examp= le, > "gl" in OpenGL, "borg" in borgmatic, and "tor" in torbrowser. [...] > Closing this message on an unrelated note for future work: I stumbled on = an > interesting idea while looking for test cases which suggested reducing the > score of a programming library when its language is not included in search > terms. It's out of scope for the current issue, but I thought I'd mentio= n it > anyways for potential further improvements. Well, years ago I thought about implementing TF-IDF [2,3]. Other ideas [4] are floating around. Then, we spent some time for making =E2=80=9Cguix search=E2=80=9D faster [5] and today my TODO is about having an extension relying on Guile-Xapian. Therefore, I would prefer keep the =E2=80=99relevance=E2=80=99 more or less= predictable by only counting the number of occurrences and apply some weights. Else, for what my opinion is worth, the direction would not be to re-invent an algorithm but maybe implement some already well-known ones. TF-IDF [3] is one or Okapi-BM25 is another one, etc. In all in all, that what Xapian provides. ;-) And it does it very well! That=E2=80=99s wh= y I would be tempted to have a Guix extension relying on Guile-Xapin for indexing and searching (fast!). Cheers, simon 2: Re: Organizing packages zimoun Tue, 16 Jul 2019 19:04:26 +0200 id:CAJ3okZ0LaJzWDBA7bjqZew_jAmtt1rj9PJhevwrtBiA_COXENg@mail.gmail.com https://lists.gnu.org/archive/html/guix-devel/2019-07 https://yhetil.org/guix/CAJ3okZ0LaJzWDBA7bjqZew_jAmtt1rj9PJhevwrtBiA_COXENg= @mail.gmail.com 3: https://en.wikipedia.org/wiki/Tf%E2%80%93idf 4: Inverted index to accelerate guix package search Arun Isaac Sun, 12 Jan 2020 20:33:51 +0530 id:cu7h810emy0.fsf@systemreboot.net https://lists.gnu.org/archive/html/guix-devel/2020-01 https://yhetil.org/guix/cu7h810emy0.fsf@systemreboot.net 5: [bug#39258] Faster guix search using an sqlite cache Arun Isaac Fri, 24 Jan 2020 01:21:57 +0530 id:cu7pnfaar36.fsf@systemreboot.net https://issues.guix.gnu.org/39258 https://issues.guix.gnu.org/msgid/cu7pnfaar36.fsf@systemreboot.net https://yhetil.org/guix/cu7pnfaar36.fsf@systemreboot.net 6: https://en.wikipedia.org/wiki/Okapi_BM25