From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: Eli Zaretskii <eliz@gnu.org>
Newsgroups: gmane.emacs.devel
Subject: Re: Character group folding in searches
Date: Fri, 06 Feb 2015 16:32:57 +0200
Message-ID: <83zj8rcdpi.fsf@gnu.org>
References: <CAAdUY-L8ipk4Aj83hJErinrgODjJab+mhx==59=FjnfmFm_wjw@mail.gmail.com>
Reply-To: Eli Zaretskii <eliz@gnu.org>
NNTP-Posting-Host: plane.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: QUOTED-PRINTABLE
X-Trace: ger.gmane.org 1423233200 15745 80.91.229.3 (6 Feb 2015 14:33:20 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Fri, 6 Feb 2015 14:33:20 +0000 (UTC)
Cc: emacs-devel@gnu.org
To: bruce.connor.am@gmail.com
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Fri Feb 06 15:33:19 2015
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane.org
Original-Received: from lists.gnu.org ([208.118.235.17])
	by plane.gmane.org with esmtp (Exim 4.69)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1YJjy8-0004vA-R3
	for ged-emacs-devel@m.gmane.org; Fri, 06 Feb 2015 15:33:17 +0100
Original-Received: from localhost ([::1]:48625 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1YJjy3-0002qH-2r
	for ged-emacs-devel@m.gmane.org; Fri, 06 Feb 2015 09:33:11 -0500
Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:39787)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <eliz@gnu.org>) id 1YJjxz-0002pL-96
	for emacs-devel@gnu.org; Fri, 06 Feb 2015 09:33:08 -0500
Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <eliz@gnu.org>) id 1YJjxu-0003jg-67
	for emacs-devel@gnu.org; Fri, 06 Feb 2015 09:33:07 -0500
Original-Received: from mtaout20.012.net.il ([80.179.55.166]:54360)
	by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from <eliz@gnu.org>)
	id 1YJjxt-0003jU-PZ
	for emacs-devel@gnu.org; Fri, 06 Feb 2015 09:33:02 -0500
Original-Received: from conversion-daemon.a-mtaout20.012.net.il by
	a-mtaout20.012.net.il (HyperSendmail v2007.08) id
	<0NJC00E00TKYSP00@a-mtaout20.012.net.il> for
	emacs-devel@gnu.org; Fri, 06 Feb 2015 16:33:00 +0200 (IST)
Original-Received: from HOME-C4E4A596F7 ([87.69.4.28]) by a-mtaout20.012.net.il
	(HyperSendmail v2007.08) with ESMTPA id
	<0NJC00EN0TQZC950@a-mtaout20.012.net.il>;
	Fri, 06 Feb 2015 16:33:00 +0200 (IST)
In-reply-to: <CAAdUY-L8ipk4Aj83hJErinrgODjJab+mhx==59=FjnfmFm_wjw@mail.gmail.com>
X-012-Sender: halo1@inter.net.il
X-detected-operating-system: by eggs.gnu.org: Solaris 10
X-Received-From: 80.179.55.166
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.emacs.devel:182528
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/182528>

> Date: Fri, 6 Feb 2015 11:04:03 -0200
> From: Artur Malabarba <bruce.connor.am@gmail.com>
>=20
> 1. Follow the `decomposition' char property. For instance, the
> character "a" in the search string would match any one of  "a=C3=
=A3=C3=A1=C3=A2" (and
> so on). This is easy to do, and one of the patches below already sh=
ows
> how. Note that this won't handle symbols that are actually composed=
 of
> multiple characters.
>=20
> 2. Follow an intuitive sense of similarity which is not defined in =
the
> unicode standard. For instance, an ascii single quote in the search
> string should match any type of single quote (there are about a doz=
en
> that I know of).
>=20
> 3. Ignore modifier (non-spacing) characters. Another way of writing
> "=C3=A1" is to write "a" followed by a special non-spacing accute. =
This
> kind of thing (a symbol composed of multiple characters) is not
> handled by item 1, so I'm listing as a separate point.
>=20
> 4. Perform the conversion two-ways. That is, item 1 should work eve=
n
> if the search contained "=C3=A1" instead of "a". Item 2 should matc=
h an
> ascii quote if the search string contains a curly quote. This is
> mostly useful when the user copies a fancy string from somewhere an=
d
> pastes it into the search field.
>=20
> 5. It should work for any searching, not just isearch.

The full set of "folding" transformations is described in the Unicode
technical report UTR #30.  It was withdrawn, but its last draft is
still enlightening.

I think we should support some subset of what's described there.

The way to do it IMO is to generate a set of char-tables where each
character is mapped to its folded variant, one char-table for each
subset of folding.  A character whose folding is not a single
character should map to a vector or a string of characters (not sure
which one is best, we should choose the one that lends itself to the
most efficient use).

I think the best approach is to modify search.c to be able to handle
folding that produces more than a single character.  I think we will
also need search.c to support several alternative foldings for the
same search operation.  Making these changes would be relatively easy=
,
I think, and once it's done, all the rest will "just work", because
the basic search algorithms don't need to be touched.

As a final general remark, I don't think I like the "group" part of
the terminology.  Why not use "character-folding" instead, it's what
this is called out there.

> * group-folding-with-regexp-lisp.patch
>=20
> This one takes each input character and either keeps it verbatim or
> transform it into a regexp which matches the entire group that this
> character represents. It is implemented in isearch.
>=20
> + It trivially handles goals 1, 2 and 3. Because regexps are quite
> versatile, it is the only solution that handles item 3 (it allows e=
ach
> character to match more than a single character).

But the downside is that we will have to construct such regexps for
all the foldings of all the characters we want to support.  That will
be quite a large database, and a lot of work to construct it.

> * group-folding-with-case-table-lisp.patch
>=20
> This patch is entirely in elisp. I've put it all inside `isearch.el=
'
> for now, for the sake of simplicity, but it's not restricted to
> isearch.
>=20
> It creates a new case-table which performs group folding by borrowi=
ng
> the case-folding machinery, so it is very fast. Then, group folding
> can be achieved by running the search inside a `with-group-folding`
> macro. There's also an example implementation which turns it on for
> isearch by default.
>=20
> + It immediately satisfies items 1, 2, 4, and 5.
> + It is very fast.
> - It has no simple way of achieving item 3.

It could use a separate case-table for item 3, couldn't it?

I think we will need separate tables for different foldings anyway,
because each use case calls for some specific folding.  In isearch,
the user will have to specify which foldings she wants to be in
effect.

> - If the user decides to set `group-fold-search' to t, this can bre=
ak
> existing code (a disadvantage that the lisp version above does not
> have).
> - It adds two extra fields to every buffer object (the boolean
> variable and the char table).

I'm not sure we need to add these tables to the buffer object.  The
experience with using case-tables this way is not encouraging, becaus=
e
in several important cases it is not at all clear which buffer is
relevant to the folding-match operation one needs to do.

> Do any of these options seem good enough? Which would you all like =
to explore?
> I like the second one best, but goal 3 is quite important.

I think we must lift the limitation of single-character folding
result, which means changes on the C level are inevitable.

I also think we need to talk a bit more about which kinds of folding
we would like to support.

Thanks.