From mboxrd@z Thu Jan  1 00:00:00 1970
Path: main.gmane.org!not-for-mail
From: Kai.Grossjohann@CS.Uni-Dortmund.DE (Kai =?iso-8859-1?q?Gro=DFjohann?=)
Newsgroups: gmane.emacs.devel
Subject: Re: Apropos commands and regexps
Date: Thu, 16 May 2002 13:04:13 +0200
Sender: emacs-devel-admin@gnu.org
Message-ID: <vafvg9o8ic2.fsf@INBOX.auto.emacs.devel.tok.lucy.cs.uni-dortmund.de>
References: <5xg00y41zj.fsf@kfs2.cua.dk>
NNTP-Posting-Host: localhost.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Trace: main.gmane.org 1021547232 8818 127.0.0.1 (16 May 2002 11:07:12 GMT)
X-Complaints-To: usenet@main.gmane.org
NNTP-Posting-Date: Thu, 16 May 2002 11:07:12 +0000 (UTC)
Cc: emacs-devel@gnu.org
Return-path: <emacs-devel-admin@gnu.org>
Original-Received: from quimby.gnus.org ([80.91.224.244])
	by main.gmane.org with esmtp (Exim 3.33 #1 (Debian))
	id 178J6G-0002I7-00
	for <emacs-devel@main.gmane.org>; Thu, 16 May 2002 13:07:12 +0200
Original-Received: from fencepost.gnu.org ([199.232.76.164])
	by quimby.gnus.org with esmtp (Exim 3.12 #1 (Debian))
	id 178JIE-0000Dw-00
	for <emacs-devel@quimby.gnus.org>; Thu, 16 May 2002 13:19:34 +0200
Original-Received: from localhost ([127.0.0.1] helo=fencepost.gnu.org)
	by fencepost.gnu.org with esmtp (Exim 3.34 #1 (Debian))
	id 178J6B-0005FW-00; Thu, 16 May 2002 07:07:07 -0400
Original-Received: from waldorf.cs.uni-dortmund.de ([129.217.4.42])
	by fencepost.gnu.org with esmtp (Exim 3.34 #1 (Debian))
	id 178J3U-00058H-00
	for <emacs-devel@gnu.org>; Thu, 16 May 2002 07:04:20 -0400
Original-Received: from lothlorien.cs.uni-dortmund.de (lothlorien [129.217.19.67])
        by waldorf.cs.uni-dortmund.de  with ESMTP id g4GB4Jb23498;
	Thu, 16 May 2002 13:04:19 +0200 (MES)
Original-Received: from lucy.cs.uni-dortmund.de (lucy [129.217.19.80])
	by lothlorien.cs.uni-dortmund.de id NAA12543; Thu, 16 May 2002 13:04:14 +0200 (MET DST)
Original-Received: by lucy.cs.uni-dortmund.de (Postfix, from userid 6104)
	id 000DD3B41E; Thu, 16 May 2002 13:04:13 +0200 (CEST)
Original-To: storm@cua.dk (Kim F. Storm)
In-Reply-To: <5xg00y41zj.fsf@kfs2.cua.dk> (storm@cua.dk's message of "12 May
 2002 02:57:20 +0200")
Original-Lines: 104
User-Agent: Gnus/5.090007 (Oort Gnus v0.07) Emacs/21.2.50
 (i686-pc-linux-gnu)
Errors-To: emacs-devel-admin@gnu.org
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.0.9
Precedence: bulk
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Post: <mailto:emacs-devel@gnu.org>
List-Subscribe: <http://mail.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
List-Id: Emacs development discussions. <emacs-devel.gnu.org>
List-Unsubscribe: <http://mail.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://mail.gnu.org/pipermail/emacs-devel/>
Xref: main.gmane.org gmane.emacs.devel:3997
X-Report-Spam: http://spam.gmane.org/gmane.emacs.devel:3997

storm@cua.dk (Kim F. Storm) writes:

> Wouldn't it be simpler (for a novice user -- and for advanced users
> too) to simply write one or more words (substrings) and then search
> for all combinations of those words (substrings) in the relevant list.

There has been a lengthy discussion on this.  Since my research area
is Information Retrieval, maybe I should say something about this...

IMHO, the long-term goal for searching the Emacs documentation should
look similar to the following:

The user enters a query.  The system searches items of documentation
and computes a score for each one.  Then the items with the highest
score come out first.

Now we need to decide on the query language (query format), and we
need to decide on the method of computing a score for each item of
documentation.

For the query language, I see these possibilities:

* List of words.

  Here, items containing all words will come out first, followed by
  items with all but one word, and so on.  The presence or absence of
  a very common word has less effect on the score than the presence
  or absence of an unusual word.  (Extreme example: if the user types
  "A B C" and there is only one item of documentation matching C but
  A and B are very common, then that item should come out first.)

* List of words, with optional "+" or "-" prefixes.

  The idea is that words prefixed with + must occur, whereas words
  prefixed with - must not occur.  The presence or absence of
  unprefixed words just raises/lowers the score, as in the first
  alternative.

* Boolean expression with parentheses and AND, OR, NOT connectives.

  Actually, these connectives just combine the individual scores.
  For example, if some item has score X w.r.t. query A and score Y
  w.r.t. query B, then the score for A AND B could be X * Y, the
  score for A OR B could be X + Y - (X * Y), and so on.  These
  formulas come from a probabilistic interpretation of the scores
  (which are assumed to be between 0 and 1).  But other useful
  formulas could be found, with different theoretical foundations.

Maybe people can suggest other possibilities.

And then we need to compute the scores for each individual "word" in
the query.  The Emacs documentation has a complex structure (for the
traditional Information Retrieval crowd anyway, who looks at
retrieving "documents" where each such document is just a sequence of
terms).  A number of rules come to my mind that might be useful to
implement:

* If the word occurs in a command/function/variable name, then the
  score should be higher than a match in the docstring (or other
  explanatory text) only.

* If the word does not occur at all, but a synonym of the word does,
  the item should match (perhaps with a lowered score).

* Instead of just synonyms, also consider more general terms,
  more specific terms, related terms.

* If the word does not occur, but a derived form does, then the item
  should match (perhaps with a lowered score).  So "mouse" should
  find "mice" and so on.  The Porter stemming algorithm appears to be
  a useful thing here.

* I guess that "igrep" should be considered a "derived form" of "grep"
  in the context of the Emacs documentation.  Do we do this with an
  explicit synonym list?  Or perhaps with a metric of similarity
  between terms which is based on editing distance or suchlike?

And then, there is the issue of where to look when searching.
Sometimes, people want to search in the docstrings, sometimes only in
the command names.

When searching in info files, some subdivision makes sense I think.
Should each node be considered a retrievable item, or is another
subdivision more sensible?

The above should be interpreted as a source of ideas.  I think I
would like to implement most of the mechanisms someday, but who knows
when I will be able to do that.  You just select from this something
which appears useful and implement that.  I haven't covered result
presentation at all, yet.

For instance, a cool replacement (complement?) for M-x apropos RET
could be something that splits the query and each Lisp symbol into
words.  So now the query is a set of words and the Lisp symbol is a
set of words.  The score for the Lisp symbol would be the number of
elements in the intersection of these two sets.  Sort the result by
decreasing score.  (Also offer to sort by name of symbol, while
displaying the score.)

What do you think?

kai
-- 
Silence is foo!