From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: Eli Zaretskii <eliz@gnu.org>
Newsgroups: gmane.emacs.devel
Subject: Re: Single quotes in Info
Date: Wed, 28 Jan 2015 17:24:44 +0200
Message-ID: <83vbjrnd1f.fsf@gnu.org>
References: <87twzhgk84.fsf@wmi.amu.edu.pl> <83lhksshdm.fsf@gnu.org>
	<9ee0c895-a178-40e1-b1c8-ed2b97071c6b@default>
	<87h9vgglkz.fsf@wmi.amu.edu.pl>
	<CAAdUY-J4s+1_C7bj32Xk5x8d01fe9baPCYmwd+0KU=QorO7wZg@mail.gmail.com>
	<83h9vcp0bq.fsf@gnu.org>
	<CAAdUY-Kck6moHTRJshbXJdRVQ6gK6Q24f_PD7SuEaZ7hURpdQw@mail.gmail.com>
	<83y4onorcc.fsf@gnu.org>
	<CAAdUY-+ooLydD-qPtiEvv-01TGxX5E-cf6asvs+Jn+eR_=38ig@mail.gmail.com>
Reply-To: Eli Zaretskii <eliz@gnu.org>
NNTP-Posting-Host: plane.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: QUOTED-PRINTABLE
X-Trace: ger.gmane.org 1422458705 4319 80.91.229.3 (28 Jan 2015 15:25:05 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Wed, 28 Jan 2015 15:25:05 +0000 (UTC)
Cc: emacs-devel@gnu.org
To: bruce.connor.am@gmail.com
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Wed Jan 28 16:25:05 2015
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane.org
Original-Received: from lists.gnu.org ([208.118.235.17])
	by plane.gmane.org with esmtp (Exim 4.69)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1YGUUJ-0000TH-En
	for ged-emacs-devel@m.gmane.org; Wed, 28 Jan 2015 16:25:03 +0100
Original-Received: from localhost ([::1]:54163 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1YGUUI-0006GP-Jh
	for ged-emacs-devel@m.gmane.org; Wed, 28 Jan 2015 10:25:02 -0500
Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:34354)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <eliz@gnu.org>) id 1YGUUE-0006Fp-Py
	for emacs-devel@gnu.org; Wed, 28 Jan 2015 10:25:00 -0500
Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <eliz@gnu.org>) id 1YGUUA-0001D7-GE
	for emacs-devel@gnu.org; Wed, 28 Jan 2015 10:24:58 -0500
Original-Received: from mtaout26.012.net.il ([80.179.55.182]:59590)
	by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from <eliz@gnu.org>)
	id 1YGUUA-0001Cq-37
	for emacs-devel@gnu.org; Wed, 28 Jan 2015 10:24:54 -0500
Original-Received: from conversion-daemon.mtaout26.012.net.il by mtaout26.012.net.il
	(HyperSendmail v2007.08) id
	<0NIW006007VUZ700@mtaout26.012.net.il> for emacs-devel@gnu.org;
	Wed, 28 Jan 2015 17:24:49 +0200 (IST)
Original-Received: from HOME-C4E4A596F7 ([87.69.4.28]) by mtaout26.012.net.il
	(HyperSendmail v2007.08) with ESMTPA id
	<0NIW00NHV85DQE80@mtaout26.012.net.il>;
	Wed, 28 Jan 2015 17:24:49 +0200 (IST)
In-reply-to: <CAAdUY-+ooLydD-qPtiEvv-01TGxX5E-cf6asvs+Jn+eR_=38ig@mail.gmail.com>
X-012-Sender: halo1@inter.net.il
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x
X-Received-From: 80.179.55.182
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.emacs.devel:181910
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/181910>

> Date: Tue, 27 Jan 2015 23:15:22 -0200
> From: Artur Malabarba <bruce.connor.am@gmail.com>
> Cc: emacs-devel <emacs-devel@gnu.org>
>=20
> Eli, if I may ask, did you get a chance to see the code? (it's quit=
e short)
> The last couple emails give me the impression we're not quite on th=
e same page.

I did just now, and I don't think I was on a different page.

> The purpose of this is to allow the user to search for complex char=
acters (such as curly quotes or any of these =EF=BC=82=E2=80=9C=E2=
=80=9D=E2=80=9D=E2=80=9E=E2=B9=82=E3=80=9E=E2=80=9F=E2=80=9F=E2=9D=
=9E=E2=9D=9D=E2=9D=A0=E2=80=9C=E2=80=9E=E3=80=9D=E3=80=9F=F0=9F=99=
=B7=F0=9F=99=B6=F0=9F=99=B8) by typing a simple character available o=
n simple keyboards (such as the plain double quote ").=20

But that's exactly where it falls short of supporting a more general
feature, which allows to find text that is "equivalent" to the one yo=
u
search for.  The limitation to "simple characters available on simple
keyboards" might seem a no-brainer for predominantly ASCII text, but
it _is_ a serious limitation for any non-ASCII script, certainly for
complex scripts, which Emacs supports for years.

> Each simple character, needs an entry on the `isearch-groups-alist'=
 variable. The max number of entries we'll ever need on this alist (i=
n the very worst possible scenario) is the number of simple character=
s in a simple keyboard (which is way less than 5000 last I checked).

You seem to forget that modern keyboards and input methods support
much more than what meets the eye on the keyboard.  Even Latin locale=
s
provide non-ASCII characters such as =C3=A1 and =C3=A5.  It is also n=
ot uncommon
to copy/paste a search string from some text, in which case the searc=
h
string could include the "complex" characters, but you'd still want t=
o
find their "simple" equivalents; your code, which transforms only the
search string, cannot support this use case.  Moreover, CJK locales
use input methods that can produce thousands of characters, and for
people in those cultures such input is "simple" because they can use
nothing simpler.

Using a database that maps ASCII characters to regexps doesn't scale
for supporting these use cases.  It doesn't even scale to the
above-mentioned Latin characters, because =C3=A1 has a sequence of 2
characters "a =CC=81" as its canonical decomposition, so when I type =
=C3=A1, I
expect to find both =C3=A1 and "a =CC=81", and vice versa.  More comp=
lex scripts
have several forms of the same letter, such as the "final" form used
in Arabic and Hebrew for the last letter in a word -- typing one of
these forms should find any other form.  Etc. etc. -- there's a huge
complexity behind all this, and we need to support it if we want to b=
e
respected as a text editor.

The way to support this is similar to how we support case-insensitive
search: we "fold" each character, both in the search string and in th=
e
text being searched, using case tables, and then compare the "folded"
characters.  Similarly, to support equivalence, we need to produce a
canonical/equivalent decomposition from each character on both sides
of the comparison, and then compare the results.

As I said before, we already have all the necessary data in the
'decomposition' property of each character, we just need to use it in
a way that is similar to case tables, just slightly more complex
(because we are no longer talking single characters).

> > > Does it relate a simple character to all its complex
> > > equivalents? Or does it relate each complex character to a simp=
le alternative?
> > The latter.  Read paragraph 1.1 of UAX #15 for the starting point=
, and
> > also section 3.7 of the Unicode Standard.
> If it's the latter, then it's the wrong way for us to do an automat=
ed approach. What we need is to know the whole set of Unicode charact=
ers which is equivalent to a given ASCII character. Of course we can =
build this table from the Unicode Standard (that's exactly what the `=
isearch-groups-alist' variable is meant to do), I'm just saying an au=
tomated approach probably isn't viable here.

I don't see why it won't be viable, or maybe I don't understand what
you mean by "automated" here.  I certainly don't think we should limi=
t
ourselves to "simple characters", not for something as general-purpos=
e
as text search.  This might be okay for Info only, but not if we want
it in isearch.el.

My idea is to use the 'decomposition' property to decompose each
character in the search string and in the text being searched, when
they need to be compared.  Exactly like we do with case-folding.