From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: Drew Adams <drew.adams@oracle.com>
Newsgroups: gmane.emacs.devel
Subject: RE: Why do apropos commands match only pairs of words in a word-list
	pattern?
Date: Mon, 11 May 2015 10:01:36 -0700 (PDT)
Message-ID: <d114bca8-efe7-418a-86d0-f018a68b4718@default>
References: <<eb17cadb-6235-4c3f-919a-6ca8dcc1da6d@default>>
	<<87pp671ygs.fsf@yahoo.fr>>
	<<fb8ad237-c97f-4ec9-94b4-6937e4d01abc@default>>
	<<83lhgvma79.fsf@gnu.org>>
NNTP-Posting-Host: plane.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: quoted-printable
X-Trace: ger.gmane.org 1431363724 9322 80.91.229.3 (11 May 2015 17:02:04 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Mon, 11 May 2015 17:02:04 +0000 (UTC)
Cc: theonewiththeevillook@yahoo.fr, emacs-devel@gnu.org
To: Eli Zaretskii <eliz@gnu.org>, Drew Adams <drew.adams@oracle.com>
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Mon May 11 19:01:52 2015
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane.org
Original-Received: from lists.gnu.org ([208.118.235.17])
	by plane.gmane.org with esmtp (Exim 4.69)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1Yrr5U-0004N8-1A
	for ged-emacs-devel@m.gmane.org; Mon, 11 May 2015 19:01:52 +0200
Original-Received: from localhost ([::1]:38823 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1Yrr5T-0006cy-6P
	for ged-emacs-devel@m.gmane.org; Mon, 11 May 2015 13:01:51 -0400
Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:43905)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <drew.adams@oracle.com>) id 1Yrr5L-0006ch-Ou
	for emacs-devel@gnu.org; Mon, 11 May 2015 13:01:48 -0400
Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <drew.adams@oracle.com>) id 1Yrr5F-0002EY-Rb
	for emacs-devel@gnu.org; Mon, 11 May 2015 13:01:43 -0400
Original-Received: from userp1050.oracle.com ([156.151.31.82]:45480)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <drew.adams@oracle.com>)
	id 1Yrr5F-0002EU-L0; Mon, 11 May 2015 13:01:37 -0400
Original-Received: from userp1040.oracle.com (userp1040.oracle.com [156.151.31.81])
	by userp1050.oracle.com (Sentrion-MTA-4.3.2/Sentrion-MTA-4.3.2) with
	ESMTP id t4BH1aWv008985
	(version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK);
	Mon, 11 May 2015 17:01:37 GMT
Original-Received: from aserv0021.oracle.com (aserv0021.oracle.com [141.146.126.233])
	by userp1040.oracle.com (Sentrion-MTA-4.3.2/Sentrion-MTA-4.3.2) with
	ESMTP id t4BH1Yav006628
	(version=TLSv1 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK);
	Mon, 11 May 2015 17:01:35 GMT
Original-Received: from aserv0122.oracle.com (aserv0122.oracle.com [141.146.126.236])
	by aserv0021.oracle.com (8.13.8/8.13.8) with ESMTP id t4BH1Y7h029644
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=FAIL);
	Mon, 11 May 2015 17:01:34 GMT
Original-Received: from abhmp0018.oracle.com (abhmp0018.oracle.com [141.146.116.24])
	by aserv0122.oracle.com (8.13.8/8.13.8) with ESMTP id t4BH1YNA026139;
	Mon, 11 May 2015 17:01:34 GMT
In-Reply-To: <<83lhgvma79.fsf@gnu.org>>
X-Priority: 3
X-Mailer: Oracle Beehive Extensions for Outlook 2.0.1.9  (901082) [OL
	12.0.6691.5000 (x86)]
X-Source-IP: aserv0021.oracle.com [141.146.126.233]
X-Source-IP: userp1040.oracle.com [156.151.31.81]
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.4.x-2.6.x [generic]
X-Received-From: 156.151.31.82
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.emacs.devel:186423
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/186423>

> the design was discussed in this long thread:
>   http://lists.gnu.org/archive/html/emacs-devel/2002-05/msg00397.html
> You will see that the heuristic in question did get some attention.

Thanks for the reference.  I'll pull out what I see as a summary, with
some comments from me.

The arguments given there in favor of the current, 2-or-more, design and
against a straightforward AND design boiled down to these 3, all from Kim:

1. Emacs has few return hits anyway.

 For WEB search engines, I think AND does make sense -- since there
 are SOOOO many pages to match.  But for a limited universe like
 emacs -- which doesn't always use the most obvious terms --
 using AND doesn't make a lot of sense to me.

2. It's good enough.

 I think it is adequate in practice.

3. It is more helpful when you don't know exactly what you're looking for.

 [it] has a more "novice" appeal: if don't know what a specific
 function is called, it will be easier to enter a few more alternatives,
 and see what turns up. -- it specifying more words returns more
 alternatives.

 matching at least two keywords will find all the entries found by
 searching for all combinations - and it may find some entries the
 user didn't think about

My response to these arguments:

1. It's not clear to me that the "limited universe" of Emacs is so
   limited that it is helpful to include the noise of false positives.

2. And that "adequate in practice" argument echoes the more-or-less
   apologetic comment in the code that suggests that the design is
   not ideal (not really what we want) but is probably OK in general.

3. And I think it is a mistake to try to be "smart", guessing that
   what's best for a novice by using "dumb" matching.  If you want
   to try to be smart then you need to do something more/other than
   just return all matches of any two of the words.

More importantly, as I said, I think this should be a user choice,
not just a design-time choice.  Even Kim suggested user choice:

 We could put a "button bar" at the top of the apropos output with
 the following buttons:

   [Match all words]  [anchored match]  [search documentation]

(No such user choice was ever implemented, AFAIK.)

Back to my summary of the thread -

A certain Eli Z came out clearly in favor of AND, and against OR'd
pairwise (AND) matches:

 Perhaps that's because they want to show off the number of hits
 they return.  I was always annoyed by ORing, and many times catch
 myself forgetting to type the magic that makes it do an AND.
 Because I always want the AND method.

and

 > I don't like the "and" approach -- at least not as the default.

 I'm afraid anything else will bring too many hits.  A docs search
 tool that returns gobs of information is not very useful, in my
 experience.

and

 I'm afraid this rule will bring many false hits, and I think we
 should beware of that as the plague.

Kim then backed off a bit from pairwise matching:

 if matching only two words gives too many matches for documentation,
 require three (or four) matching words.

To which Mr Z said:

 a rule based on the number of matched keywords is not good enough,
 since sometimes even one word is enough to yield a very  accurate
 result.

Miles said (and Mr Z agreed):

 I think it's clear that we need a bit of experience with this
 stuff, so we can see how well the various alternatives actually
 work in practice, rather than sitting around pontificating...

Well, we don't seem to have experimented with different approaches,
but rather have just gone with OR'd AND'd pair matches. In the end,
RMS decided that pairwise matching "seems more useful", Kim
implemented it, and that was that.

Note that Kai G mentioned what Nicolas R suggested recently: It's OK
to return tons of hits, including noise, if you sort by relevance.
And RMS said "that is the best way to handle the argument".

But as I said before, a major problem with that approach is that it
interferes with other sort possibilities.  If you have many hits, most
of which are noise, the *only* order in which the noise can reasonably
be ignored is relevance (e.g. more AND matches first, fewer later).
You cannot reorder those alphabetically, or by putting all function
names first, then variable names, or any other meaningful order.
Doing that would push all the noise throughout the list of hits.

IOW, a high-noise (high recall, low precision) return set requires
an ordering that keeps the noise farther from immediate view.  Users
deserve to be able to use different sort orders, and the approach
of noise-might-help-sometimes-&-costs-nothing-if-far-from-view
interferes with sorting.

Finally, Eli mentioned also the possibility that is used in Icicles
and probably some other completion UIs: let the user progressively
refine the set of returned hits.

 The user enters a query.  The system does the search and presents
 a menu of possible refinements of the original search spec.  The
 user chooses one of the possibilities, and the process repeats,
 until the list of possible hits is shorter than some predefined
 value; when that happens, the list of hits is displayed.

 The user never needs to wade through gobs of hits, trying to
 figure out which one is relevant to his/her query.

Whether the hits returned at each refinement stage are displayed
or not is not the question (IMO).  In Icicles, a user can choose
whether to see the hits at each stage.  But it is generally useful,
IMO, to show them, even when there are many.  That doesn't imply
that a user must "wade through" them, but s?he can get an idea=20
of what's there - and that helps guide upcoming refinement patterns.