From mboxrd@z Thu Jan 1 00:00:00 1970 Path: main.gmane.org!not-for-mail From: Kai.Grossjohann@CS.Uni-Dortmund.DE (Kai =?iso-8859-1?q?Gro=DFjohann?=) Newsgroups: gmane.emacs.devel Subject: Re: Apropos commands and regexps Date: Thu, 16 May 2002 13:04:13 +0200 Sender: emacs-devel-admin@gnu.org Message-ID: References: <5xg00y41zj.fsf@kfs2.cua.dk> NNTP-Posting-Host: localhost.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: main.gmane.org 1021547232 8818 127.0.0.1 (16 May 2002 11:07:12 GMT) X-Complaints-To: usenet@main.gmane.org NNTP-Posting-Date: Thu, 16 May 2002 11:07:12 +0000 (UTC) Cc: emacs-devel@gnu.org Return-path: Original-Received: from quimby.gnus.org ([80.91.224.244]) by main.gmane.org with esmtp (Exim 3.33 #1 (Debian)) id 178J6G-0002I7-00 for ; Thu, 16 May 2002 13:07:12 +0200 Original-Received: from fencepost.gnu.org ([199.232.76.164]) by quimby.gnus.org with esmtp (Exim 3.12 #1 (Debian)) id 178JIE-0000Dw-00 for ; Thu, 16 May 2002 13:19:34 +0200 Original-Received: from localhost ([127.0.0.1] helo=fencepost.gnu.org) by fencepost.gnu.org with esmtp (Exim 3.34 #1 (Debian)) id 178J6B-0005FW-00; Thu, 16 May 2002 07:07:07 -0400 Original-Received: from waldorf.cs.uni-dortmund.de ([129.217.4.42]) by fencepost.gnu.org with esmtp (Exim 3.34 #1 (Debian)) id 178J3U-00058H-00 for ; Thu, 16 May 2002 07:04:20 -0400 Original-Received: from lothlorien.cs.uni-dortmund.de (lothlorien [129.217.19.67]) by waldorf.cs.uni-dortmund.de with ESMTP id g4GB4Jb23498; Thu, 16 May 2002 13:04:19 +0200 (MES) Original-Received: from lucy.cs.uni-dortmund.de (lucy [129.217.19.80]) by lothlorien.cs.uni-dortmund.de id NAA12543; Thu, 16 May 2002 13:04:14 +0200 (MET DST) Original-Received: by lucy.cs.uni-dortmund.de (Postfix, from userid 6104) id 000DD3B41E; Thu, 16 May 2002 13:04:13 +0200 (CEST) Original-To: storm@cua.dk (Kim F. Storm) In-Reply-To: <5xg00y41zj.fsf@kfs2.cua.dk> (storm@cua.dk's message of "12 May 2002 02:57:20 +0200") Original-Lines: 104 User-Agent: Gnus/5.090007 (Oort Gnus v0.07) Emacs/21.2.50 (i686-pc-linux-gnu) Errors-To: emacs-devel-admin@gnu.org X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.0.9 Precedence: bulk List-Help: List-Post: List-Subscribe: , List-Id: Emacs development discussions. List-Unsubscribe: , List-Archive: Xref: main.gmane.org gmane.emacs.devel:3997 X-Report-Spam: http://spam.gmane.org/gmane.emacs.devel:3997 storm@cua.dk (Kim F. Storm) writes: > Wouldn't it be simpler (for a novice user -- and for advanced users > too) to simply write one or more words (substrings) and then search > for all combinations of those words (substrings) in the relevant list. There has been a lengthy discussion on this. Since my research area is Information Retrieval, maybe I should say something about this... IMHO, the long-term goal for searching the Emacs documentation should look similar to the following: The user enters a query. The system searches items of documentation and computes a score for each one. Then the items with the highest score come out first. Now we need to decide on the query language (query format), and we need to decide on the method of computing a score for each item of documentation. For the query language, I see these possibilities: * List of words. Here, items containing all words will come out first, followed by items with all but one word, and so on. The presence or absence of a very common word has less effect on the score than the presence or absence of an unusual word. (Extreme example: if the user types "A B C" and there is only one item of documentation matching C but A and B are very common, then that item should come out first.) * List of words, with optional "+" or "-" prefixes. The idea is that words prefixed with + must occur, whereas words prefixed with - must not occur. The presence or absence of unprefixed words just raises/lowers the score, as in the first alternative. * Boolean expression with parentheses and AND, OR, NOT connectives. Actually, these connectives just combine the individual scores. For example, if some item has score X w.r.t. query A and score Y w.r.t. query B, then the score for A AND B could be X * Y, the score for A OR B could be X + Y - (X * Y), and so on. These formulas come from a probabilistic interpretation of the scores (which are assumed to be between 0 and 1). But other useful formulas could be found, with different theoretical foundations. Maybe people can suggest other possibilities. And then we need to compute the scores for each individual "word" in the query. The Emacs documentation has a complex structure (for the traditional Information Retrieval crowd anyway, who looks at retrieving "documents" where each such document is just a sequence of terms). A number of rules come to my mind that might be useful to implement: * If the word occurs in a command/function/variable name, then the score should be higher than a match in the docstring (or other explanatory text) only. * If the word does not occur at all, but a synonym of the word does, the item should match (perhaps with a lowered score). * Instead of just synonyms, also consider more general terms, more specific terms, related terms. * If the word does not occur, but a derived form does, then the item should match (perhaps with a lowered score). So "mouse" should find "mice" and so on. The Porter stemming algorithm appears to be a useful thing here. * I guess that "igrep" should be considered a "derived form" of "grep" in the context of the Emacs documentation. Do we do this with an explicit synonym list? Or perhaps with a metric of similarity between terms which is based on editing distance or suchlike? And then, there is the issue of where to look when searching. Sometimes, people want to search in the docstrings, sometimes only in the command names. When searching in info files, some subdivision makes sense I think. Should each node be considered a retrievable item, or is another subdivision more sensible? The above should be interpreted as a source of ideas. I think I would like to implement most of the mechanisms someday, but who knows when I will be able to do that. You just select from this something which appears useful and implement that. I haven't covered result presentation at all, yet. For instance, a cool replacement (complement?) for M-x apropos RET could be something that splits the query and each Lisp symbol into words. So now the query is a set of words and the Lisp symbol is a set of words. The score for the Lisp symbol would be the number of elements in the intersection of these two sets. Sort the result by decreasing score. (Also offer to sort by name of symbol, while displaying the score.) What do you think? kai -- Silence is foo!