From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.devel Subject: Re: Character group folding in searches Date: Fri, 06 Feb 2015 16:32:57 +0200 Message-ID: <83zj8rcdpi.fsf@gnu.org> References: Reply-To: Eli Zaretskii NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: QUOTED-PRINTABLE X-Trace: ger.gmane.org 1423233200 15745 80.91.229.3 (6 Feb 2015 14:33:20 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Fri, 6 Feb 2015 14:33:20 +0000 (UTC) Cc: emacs-devel@gnu.org To: bruce.connor.am@gmail.com Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Fri Feb 06 15:33:19 2015 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1YJjy8-0004vA-R3 for ged-emacs-devel@m.gmane.org; Fri, 06 Feb 2015 15:33:17 +0100 Original-Received: from localhost ([::1]:48625 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YJjy3-0002qH-2r for ged-emacs-devel@m.gmane.org; Fri, 06 Feb 2015 09:33:11 -0500 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:39787) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YJjxz-0002pL-96 for emacs-devel@gnu.org; Fri, 06 Feb 2015 09:33:08 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1YJjxu-0003jg-67 for emacs-devel@gnu.org; Fri, 06 Feb 2015 09:33:07 -0500 Original-Received: from mtaout20.012.net.il ([80.179.55.166]:54360) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YJjxt-0003jU-PZ for emacs-devel@gnu.org; Fri, 06 Feb 2015 09:33:02 -0500 Original-Received: from conversion-daemon.a-mtaout20.012.net.il by a-mtaout20.012.net.il (HyperSendmail v2007.08) id <0NJC00E00TKYSP00@a-mtaout20.012.net.il> for emacs-devel@gnu.org; Fri, 06 Feb 2015 16:33:00 +0200 (IST) Original-Received: from HOME-C4E4A596F7 ([87.69.4.28]) by a-mtaout20.012.net.il (HyperSendmail v2007.08) with ESMTPA id <0NJC00EN0TQZC950@a-mtaout20.012.net.il>; Fri, 06 Feb 2015 16:33:00 +0200 (IST) In-reply-to: X-012-Sender: halo1@inter.net.il X-detected-operating-system: by eggs.gnu.org: Solaris 10 X-Received-From: 80.179.55.166 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:182528 Archived-At: > Date: Fri, 6 Feb 2015 11:04:03 -0200 > From: Artur Malabarba >=20 > 1. Follow the `decomposition' char property. For instance, the > character "a" in the search string would match any one of "a=C3= =A3=C3=A1=C3=A2" (and > so on). This is easy to do, and one of the patches below already sh= ows > how. Note that this won't handle symbols that are actually composed= of > multiple characters. >=20 > 2. Follow an intuitive sense of similarity which is not defined in = the > unicode standard. For instance, an ascii single quote in the search > string should match any type of single quote (there are about a doz= en > that I know of). >=20 > 3. Ignore modifier (non-spacing) characters. Another way of writing > "=C3=A1" is to write "a" followed by a special non-spacing accute. = This > kind of thing (a symbol composed of multiple characters) is not > handled by item 1, so I'm listing as a separate point. >=20 > 4. Perform the conversion two-ways. That is, item 1 should work eve= n > if the search contained "=C3=A1" instead of "a". Item 2 should matc= h an > ascii quote if the search string contains a curly quote. This is > mostly useful when the user copies a fancy string from somewhere an= d > pastes it into the search field. >=20 > 5. It should work for any searching, not just isearch. The full set of "folding" transformations is described in the Unicode technical report UTR #30. It was withdrawn, but its last draft is still enlightening. I think we should support some subset of what's described there. The way to do it IMO is to generate a set of char-tables where each character is mapped to its folded variant, one char-table for each subset of folding. A character whose folding is not a single character should map to a vector or a string of characters (not sure which one is best, we should choose the one that lends itself to the most efficient use). I think the best approach is to modify search.c to be able to handle folding that produces more than a single character. I think we will also need search.c to support several alternative foldings for the same search operation. Making these changes would be relatively easy= , I think, and once it's done, all the rest will "just work", because the basic search algorithms don't need to be touched. As a final general remark, I don't think I like the "group" part of the terminology. Why not use "character-folding" instead, it's what this is called out there. > * group-folding-with-regexp-lisp.patch >=20 > This one takes each input character and either keeps it verbatim or > transform it into a regexp which matches the entire group that this > character represents. It is implemented in isearch. >=20 > + It trivially handles goals 1, 2 and 3. Because regexps are quite > versatile, it is the only solution that handles item 3 (it allows e= ach > character to match more than a single character). But the downside is that we will have to construct such regexps for all the foldings of all the characters we want to support. That will be quite a large database, and a lot of work to construct it. > * group-folding-with-case-table-lisp.patch >=20 > This patch is entirely in elisp. I've put it all inside `isearch.el= ' > for now, for the sake of simplicity, but it's not restricted to > isearch. >=20 > It creates a new case-table which performs group folding by borrowi= ng > the case-folding machinery, so it is very fast. Then, group folding > can be achieved by running the search inside a `with-group-folding` > macro. There's also an example implementation which turns it on for > isearch by default. >=20 > + It immediately satisfies items 1, 2, 4, and 5. > + It is very fast. > - It has no simple way of achieving item 3. It could use a separate case-table for item 3, couldn't it? I think we will need separate tables for different foldings anyway, because each use case calls for some specific folding. In isearch, the user will have to specify which foldings she wants to be in effect. > - If the user decides to set `group-fold-search' to t, this can bre= ak > existing code (a disadvantage that the lisp version above does not > have). > - It adds two extra fields to every buffer object (the boolean > variable and the char table). I'm not sure we need to add these tables to the buffer object. The experience with using case-tables this way is not encouraging, becaus= e in several important cases it is not at all clear which buffer is relevant to the folding-match operation one needs to do. > Do any of these options seem good enough? Which would you all like = to explore? > I like the second one best, but goal 3 is quite important. I think we must lift the limitation of single-character folding result, which means changes on the C level are inevitable. I also think we need to talk a bit more about which kinds of folding we would like to support. Thanks.