From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp0 ([2001:41d0:8:6d80::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms11 with LMTPS id YNmHJQ6WR2C6DgAA0tVLHw (envelope-from ) for ; Tue, 09 Mar 2021 15:36:46 +0000 Received: from aspmx1.migadu.com ([2001:41d0:8:6d80::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp0 with LMTPS id WJFKIQ6WR2DfdwAA1q6Kng (envelope-from ) for ; Tue, 09 Mar 2021 15:36:46 +0000 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id 9189613D3F for ; Tue, 9 Mar 2021 16:36:44 +0100 (CET) Received: from localhost ([::1]:60424 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1lJePb-0000bI-Mi for larch@yhetil.org; Tue, 09 Mar 2021 10:36:43 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]:54406) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1lJe2N-00055h-Vm for guix-devel@gnu.org; Tue, 09 Mar 2021 10:12:44 -0500 Received: from mail-qv1-xf2a.google.com ([2607:f8b0:4864:20::f2a]:38010) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1lJe2G-0000az-OI for guix-devel@gnu.org; Tue, 09 Mar 2021 10:12:43 -0500 Received: by mail-qv1-xf2a.google.com with SMTP id bh3so6535361qvb.5 for ; Tue, 09 Mar 2021 07:12:36 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=zGUaZSIGYlNYdKZxOOdILcZLpC1ipzDWhglG7f7TXoA=; b=OkxJgLOE+DsU9kIeZABP1Nt2WLNU0sgp+IJvJsPVHuprzbMUoXWmfQEjexPEiXwwhF j2sFPylV9kbCVvLmrNvrkGySgFLnBjxbgWjYrh7VpWjA8EGuOZVItC5vZpvNWV9WAiUS 6lfi7FkE4940PE4IjFufQ9spa2n/Hs48gh6rAJXMR/0PQDs7OklHiFUkMO6SaLzzPXnf X4/H2xR/KAqWN+v69jB1kLrDAIHrdkV8HLDXJPEezX6AfqH7GguXAMCNN4xfdJt2ZxHG zJOPdxC1f+lCbzM4ALEboYtK+M+kWN/AjMCoSfzJhk3iZfwRmajPbHCizYYT1wRNXz0o ZJdw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=zGUaZSIGYlNYdKZxOOdILcZLpC1ipzDWhglG7f7TXoA=; b=LwXuKKRUSt6EN8y38G7gy4sizDBocMC2FFcEunn57sl2GjHYkTDZ0nELwftT6QmhQM OcU2LBnghbxOJajD3k9AsOzUP1j01fk4E2x68Sl59JynkxY+OCa9FhjClli/AfXDu9le BCQInWHre5zD214vI2GLxuUgyhydFGpQgNmLonn3dIKo7dQ6JAG1fZKWlk76yKkAkX7X kGYdcZcXjPM6QUwJcUyTsouKW/Peks+xSeAzZ9c8Rgtqr7Nhv4OnFKMI1zWXo2cqO9SO t1jRrYgOf6UDQgEOgd3h6fTNrdjocs5ROQV8j2sftp+tO1d5iwOvrKyPeGCzM6VwMX7I k7Bw== X-Gm-Message-State: AOAM5321xSjGqHxReWdov3i8jtP1iwNFlRKM8y+QK2y4U0kuVgAXwP+8 OPdEuo4yatfmJ+sy6M0TrY9jfvW9wUN3svb4EWw= X-Google-Smtp-Source: ABdhPJymKQO/j7+mp9ria/1Bc7h96P5lLmx2UpQR/+hz1OMZc6syU1UwAMCDdCWFUylwv/iD/7gqLemvTIdIEqJQPFs= X-Received: by 2002:ad4:4991:: with SMTP id t17mr25714669qvx.33.1615302755786; Tue, 09 Mar 2021 07:12:35 -0800 (PST) MIME-Version: 1.0 References: <7ae1c8ee-30fc-6639-5539-621c65e7fc26@raghavgururajan.name> <87h7lkj3pt.fsf@netris.org> <2d477594f968f088d61e51a177e78bd2@dismail.de> <9f7ad8ce-7275-0e06-1767-eef1fb0f11cb@raghavgururajan.name> <87pn08muar.fsf@nckx> <446cc95b-9068-e07d-80ee-fbcc887c2c65@gmail.com> In-Reply-To: <446cc95b-9068-e07d-80ee-fbcc887c2c65@gmail.com> From: zimoun Date: Tue, 9 Mar 2021 16:12:24 +0100 Message-ID: Subject: Re: Search improvements (Was: Opposition to new single-letter package name "t") To: Taylan Kammer Content-Type: text/plain; charset="UTF-8" Received-SPF: pass client-ip=2607:f8b0:4864:20::f2a; envelope-from=zimon.toutoune@gmail.com; helo=mail-qv1-xf2a.google.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: guix-devel@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Development of GNU Guix and the GNU System distribution." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: LibreMiami , Raghav Gururajan , jgart , Nicolas Goaziou , Guix Devel Errors-To: guix-devel-bounces+larch=yhetil.org@gnu.org Sender: "Guix-devel" X-Migadu-Flow: FLOW_IN ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=yhetil.org; s=key1; t=1615304206; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:in-reply-to:in-reply-to: references:references:list-id:list-help:list-unsubscribe: list-subscribe:list-post:dkim-signature; bh=zGUaZSIGYlNYdKZxOOdILcZLpC1ipzDWhglG7f7TXoA=; b=inSHFW0k9OB8T49K2xGuAncx4yBrrKmRTj58FSVcJv8Kvhey1/CtDAQhzb0Sx7HLdN5Jrh 615i5fgZGzhTD8sUbW7ME9zqIbH4zDXQhFxzucZ2S1i8unow7+rf8tjzS+L7qDaj/xhTf4 YvNJYcMNeqT99lGlBXVe6aIDy4yYcz0ygZTqLgxj2bXT2pGPdes80CVH00+cX2DgVWwnmV BLxIRx48ombUMAtxGXrwDb5ibaqZGzUxwDxKfBFvbNdTvtUbIV3Lnq5R7IGzLYOINtCFjH DxG9nXtmf3oNirFBXdIa5YZeskXie+Qicnk/Hiid03D+MmY33oh6l6yaxw9V0w== ARC-Seal: i=1; s=key1; d=yhetil.org; t=1615304206; a=rsa-sha256; cv=none; b=f/clq0Te9wqC+XRz/vPDFBWS7hh07FaC9Fw6QvEwmY+/tlNpkXWSSUvDP/q6SiiBcqZVcB kKO4zo9nqYHt8tXdNwH9fVOipwzN27XtfQZA+uHtVU2Jf2pIZYNOUcMz3l2BGxb7juj8Ea 38deL0exD0ADrwc3WCfQYknsF7Sxlo5mQi++LmBTpzt7szA7O1k70zrpsJJbinsl/4W8dN 5Secq3Jld/PnNQzDD8dRlCDzCNpGwyqsH2VCVGvtSTm2kk4OCvG8pohQ/fFdIYGVSYVydK QjoBZ4PiRxz/EOHqfSqGN8CBjbVIx6wpHGaiUWyptmzz7mkg3NlKGXIXdQnPLg== ARC-Authentication-Results: i=1; aspmx1.migadu.com; dkim=fail ("headers rsa verify failed") header.d=gmail.com header.s=20161025 header.b=OkxJgLOE; dmarc=fail reason="SPF not aligned (relaxed)" header.from=gmail.com (policy=none); spf=pass (aspmx1.migadu.com: domain of guix-devel-bounces@gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=guix-devel-bounces@gnu.org X-Migadu-Spam-Score: 0.22 Authentication-Results: aspmx1.migadu.com; dkim=fail ("headers rsa verify failed") header.d=gmail.com header.s=20161025 header.b=OkxJgLOE; dmarc=fail reason="SPF not aligned (relaxed)" header.from=gmail.com (policy=none); spf=pass (aspmx1.migadu.com: domain of guix-devel-bounces@gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=guix-devel-bounces@gnu.org X-Migadu-Queue-Id: 9189613D3F X-Spam-Score: 0.22 X-Migadu-Scanner: scn1.migadu.com X-TUID: oD25DQ/kg2Ar Hi, On Tue, 9 Mar 2021 at 14:37, Taylan Kammer wrote: > This discussion made me realize that "guix search" might benefit from > the following improvement though: I think the relevance score for a > search result should be increased significantly if the searched word is > a standalone (not substring) part of a package's name when the name is > split into dash-separated words. Currently, perfect match uses the weight of 5 and substring match uses 1. You are proposing to add something between, say 3, for perfect match on substring delimited by dash. Why not. > For instance, the package "emacs-hl-todo" should get a much higher score > than "emacs-mastodon" when searching for "todo". Currently the Mastodon > one has score 11 and the todo one only 9. Here how the relevance score reads: query: todo | field | emacs-hl-todo | emacs-mastodon | weight | |-------------+----------------+-----------------+--------| | name | 1 | 1 | 4 | | synopsis | 1 | 1 | 3 | | description | 1 | 2 | 2 | |-------------+----------------+-----------------+--------| | total | 1*4+1*3+2*1= 9 | 1*4+1*3+2*2= 11 | | Therefore, something looks wrong here: the score for emacs-hl-todo should be 1*4+1*5*3+1*5*2= 29 because the term TODO should be considered as a perfect match for the query todo. > The same thing goes for the synopsis and description of the package, but > with respectively lower increases to the score. (I.e. name > synopsis > > description.) Your proposal just needs the tweak of 'score' in the function 'relevance' from (guix ui). The weight for the field is another part (see %package-metrics in (guix ui)) > Handling of plurals like "todos" instead of "todo" would also be great > but could be left to a later step. The issue with this is that it is strongly connected to the language. Therefore, an external library implementing Natural Language should be added. And I am not convinced it is worth at the CLI level. > Any thoughts about / objections to this idea? To be honest I haven't > checked if there's maybe already a bug report about this. If you are interested, there is such discussion in this heavy thread: And the 'relevance' function could be improved, for sure. For example, I proposed TF-IDF here: and I did some tiny math calculs (optimization) to compute "better" relevance weight (%package-metrics) but the current choice are not so bad and simple enough. :-) Previous week, I have started to examine a strategy based on Bag-Of-Word and some word embedings strategies; mimicking a simple autoencoder [1] such as Word2Vec [2] but since the Guile tools are poor in this field, I have started to use Julia first to look if it is worth to implement or not such solution. My idea is to see how the packages cluster based on the synopsis+description information, then ideally based on this, we should be able to define package similarity and "synonyms". Well, if you are student and you are looking for a cool project about Machine Learning and Data Science, ping me. :-) 1: 2: Cheers, simon