From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp0 ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms11 with LMTPS id gOpNH1bRR2AnXAAA0tVLHw (envelope-from ) for ; Tue, 09 Mar 2021 19:49:42 +0000 Received: from aspmx1.migadu.com ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp0 with LMTPS id KJ0aG1bRR2B9LwAA1q6Kng (envelope-from ) for ; Tue, 09 Mar 2021 19:49:42 +0000 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id E638F27A1E for ; Tue, 9 Mar 2021 20:49:41 +0100 (CET) Received: from localhost ([::1]:37802 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1lJiMO-0001OY-UQ for larch@yhetil.org; Tue, 09 Mar 2021 14:49:40 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]:49206) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1lJhEh-0001OC-Hd for guix-devel@gnu.org; Tue, 09 Mar 2021 13:37:41 -0500 Received: from mail-qt1-x832.google.com ([2607:f8b0:4864:20::832]:39515) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1lJhEf-0005JK-9c for guix-devel@gnu.org; Tue, 09 Mar 2021 13:37:39 -0500 Received: by mail-qt1-x832.google.com with SMTP id w6so11013258qti.6 for ; Tue, 09 Mar 2021 10:37:35 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=X+kMNGi2iQWn/9tdq8fEFwZvFmqIQfc/vi8Uac9Fu6A=; b=YBQqVgBhpiyb5/Q8ekcbhn6qvCVfUACN3V8S0OXqD/+DkGUbCOXZQZJgXB3xQawe3j JlAXdfafclhmlL4C7qqt2WWuNgJ3PF3H0Ym+cvmzfLXUXNy4qz0DX9MsaTZ2xfInR4ec HrPQ14eTRpEMPHvWfPnqqU2Wkp5ijiSvZDsVRE5sEsISUUSbVzNL+MV7sTqnkGv9Uw2p UR06Q3q9RkAvbdHIrtgqIax62LutIiRZT+8ZNkeqwjZt7wxuXV/zAgwTPf9lvlBfSpUK +Zm9OwKC/b9tu6b9K1FTRpR/VrJ+jT1Ip9GGeBSa++OIWflNcIdAblyZqrg2i5JuezIJ PGxQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=X+kMNGi2iQWn/9tdq8fEFwZvFmqIQfc/vi8Uac9Fu6A=; b=jKb6A3L9PIKxI1LqBG+NibQmm3Jsj5kZuKee9vJv7XmLr83o61938uE4POCRhIiD+y 1WCzhr7qA+wNqTG2yXemE21FVA88uHqx3NDhYLdYUVw++5wXNdj9FtBwPJIC0rnx7UTo t4sBuEMazG6xYHbFE/0pqF6T9NnSn2g7+5CTFxu7qaD4AjEq68FcsuoviAlIa9DSWdCX i4CjjPxVRv0PqyQYzmMHpFwt3fbF8xP5D2VYWim6MocL7sNzsbrLUB5SWqUFlYDC81ue LA9uXzaOrmDW8j+YJaz+rfccDtoRSU5Pc/F06hdpq1m2U+YOYBvd4hks1zOhoRPf9Is+ uXaA== X-Gm-Message-State: AOAM530OeFymtd1y/0gM8dP4gOnk+i+7+Hz2C2G5/fHOPkjw4DpJKQr/ xcBjUx/2xStSEsM35b/hLZQBnHRfHiievc9mmC8= X-Google-Smtp-Source: ABdhPJzxuDPRWCnENPxRMO9mUxPDtOwngQHSIorUpAc4xjZKmcC9FCO7yoIivfyMcSEphVtEC5uf5eL+abITnFsGcGA= X-Received: by 2002:ac8:5444:: with SMTP id d4mr10384852qtq.313.1615315054851; Tue, 09 Mar 2021 10:37:34 -0800 (PST) MIME-Version: 1.0 References: <7ae1c8ee-30fc-6639-5539-621c65e7fc26@raghavgururajan.name> <87h7lkj3pt.fsf@netris.org> <2d477594f968f088d61e51a177e78bd2@dismail.de> <9f7ad8ce-7275-0e06-1767-eef1fb0f11cb@raghavgururajan.name> <87pn08muar.fsf@nckx> <446cc95b-9068-e07d-80ee-fbcc887c2c65@gmail.com> <87h7lkmhb5.fsf@nckx> In-Reply-To: <87h7lkmhb5.fsf@nckx> From: zimoun Date: Tue, 9 Mar 2021 19:37:23 +0100 Message-ID: Subject: Re: Search improvements (Was: Opposition to new single-letter package name "t") To: Tobias Geerinckx-Rice Content-Type: text/plain; charset="UTF-8" Received-SPF: pass client-ip=2607:f8b0:4864:20::832; envelope-from=zimon.toutoune@gmail.com; helo=mail-qt1-x832.google.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: guix-devel@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Development of GNU Guix and the GNU System distribution." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: LibreMiami , Raghav Gururajan , jgart , Nicolas Goaziou , Guix Devel Errors-To: guix-devel-bounces+larch=yhetil.org@gnu.org Sender: "Guix-devel" X-Migadu-Flow: FLOW_IN ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=yhetil.org; s=key1; t=1615319382; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:in-reply-to:in-reply-to: references:references:list-id:list-help:list-unsubscribe: list-subscribe:list-post:dkim-signature; bh=X+kMNGi2iQWn/9tdq8fEFwZvFmqIQfc/vi8Uac9Fu6A=; b=ZDmqcA8MQO46CSiVG8jRCPehZHei9BDyDLtaLV+Og4fmCS4QEebYLSTIwuksSrJLt8BbIW wXG3Hk2m/CaBey46nmIJsU40IhuXI4MyMaUakJJxdlP+966d9C3MYf6DbqeNBhAMyaaf4I 8VNM4gagCRxoN1nXx+16ZgGSHU/VSjd98nor+gf86aCMCalIHcyXrFvICORXygw7THQL/6 AhRvvN5FBx+xWd2ASQRkuKiKF6xhgJe4MD23188LZqog7hx9uWL6Ddk/8J3AOalL5im/tm MgNGSdNRSwtENd8IZn7BOlTI6E2btYPuSFshq6Bzl1PWskTPzggUcxox15sByA== ARC-Seal: i=1; s=key1; d=yhetil.org; t=1615319382; a=rsa-sha256; cv=none; b=TyqtiBShGstrFUsMr5UifvQkPESVhA4VdkGJcZ3b+9vZgdb8B2M/zfBzAG8qPTrbaT+k0k YO5khHIu3aRiEYN4xVmWqfKmqHFBtJGHxpYCUkA4Zu8MDLQMocQJeM3ZxnazyooVcR9Nad M1n0Ypk4Ly4kES6vHw3GubyyMjhMhJjkbmHg2VCSkFBHwyKQ3FXBsAX90lDKv2GAA3QF47 lSO4cVxalQBkqcp2hdfQXlYsP4Td485t/Zgl4TK3g901uKfdTDK1lRd176MYteJOVBnhjf R1X/lGEdHcc9L66oj5fsDvS1Xq3BD4ehdYKck17YhAzCz87hYsES+cMSzAzOsA== ARC-Authentication-Results: i=1; aspmx1.migadu.com; dkim=fail ("headers rsa verify failed") header.d=gmail.com header.s=20161025 header.b=YBQqVgBh; dmarc=fail reason="SPF not aligned (relaxed)" header.from=gmail.com (policy=none); spf=pass (aspmx1.migadu.com: domain of guix-devel-bounces@gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=guix-devel-bounces@gnu.org X-Migadu-Spam-Score: -1.28 Authentication-Results: aspmx1.migadu.com; dkim=fail ("headers rsa verify failed") header.d=gmail.com header.s=20161025 header.b=YBQqVgBh; dmarc=fail reason="SPF not aligned (relaxed)" header.from=gmail.com (policy=none); spf=pass (aspmx1.migadu.com: domain of guix-devel-bounces@gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=guix-devel-bounces@gnu.org X-Migadu-Queue-Id: E638F27A1E X-Spam-Score: -1.28 X-Migadu-Scanner: scn0.migadu.com X-TUID: qShNy0Inuf6t Hi Tobias, On Tue, 9 Mar 2021 at 18:14, Tobias Geerinckx-Rice wrote: > For most upstreams whether or not dashes were in vogue[0] when > they named their project is literally arbitrary. We'd penalise > many other packages like texlive-todonotes, open{ssh,vpn,*}, > ktexteditor, r-performanceanalytics, qutebrowser, ... It's not a > net win. I am not sure to understand what you mean here. > If I might pet my own peeve, I think clever heuristics appear > necessary in part because %package-metrics grossly overscores > package names. Rank them *below* synopsis & description--which > will contain the name anyway--with a metric of 1, maybe 2. Enough > to keep the relevant stuff above the irrelevant stuff (python- > > ruby-, etc.) without distorting things as they do now. I really did math, i.e., write the scoring function, something like (to simplify) score(package, query) = sum_{term in query} (wS cS + wD cD + w) where wS, wD, wN are the weights for synopsis, description, name and cS, cD, cN are the number of occurrences. Then for example computed Jacobian and so on in order to see the relation between the weights w* and the number of occurrence c*. Or I gave a look at the condition to have: score(package_1, query) = score(package_2, query) and basically, using the linear relevance as it is currently, the weight (%package-metrics) are not so bad; you cannot find a really better heuristic. Another conclusion is: it really depends on the number of terms the query has. Basically, if you type one term, you know what you are looking for and it is the package name but your are not sure. For more terms, currently the result strongly depends on the quality of the synopsis and description. For instance, try: guix search gnu compiler and compare the description of all the packages with a relevance higher than 4 (gcc-toolchain). Well, with a linear and local scoring function as it is currently, you cannot improve much, IMHO. By local, I mean only considering the words of one package independently of the words of other packages. That's why TF-IDF [1]. For a concrete example, see . Once you have a TF-IDF, the natural scoring is BM25 [2]. Well, it is included in Xapian and there is a patch by Arun using Xapian as a backend for "guix search", see . It is missing a good evaluation, i.e., queries examples. I have asked such examples (what query an user type and what they are expecting) here but no one replied and since I am enough comfortable with searching with Guix and other bugs are more annoying for my workflow, I moved to other stuff. For another discussion on the topic, see . Since 2020, I have read pieces of "word embdeding" (part of vogue[0] graph neural nets), and I think it would a great project: first some vogue[0] stats to evaluate how the packages cluster together, i.e., is emacs-foo closer to emacs-bar or python-foo? and second depending on the results, implement such embdeding to improve "guix search". The first means use Julia (or package PyTorch for Guix ;-)) and the second means implement targeting Guile (it could awesome to have an equivalent to Zygote [3,4] for Guile). 0: Not a joke. :-) 1: 2: 3: 4: Cheers, simon