From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: "Stephen J. Turnbull" <stephen@xemacs.org>
Newsgroups: gmane.emacs.devel
Subject: Re: ASCII-folded search [was: Re: Upcoming loss of usability ...]
Date: Thu, 18 Jun 2015 16:48:58 +0900
Message-ID: <87lhfhfod1.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <20150615142237.GA3517@acm.fritz.box>
	<87y4jkhqh5.fsf@uwakimon.sk.tsukuba.ac.jp>
	<E1Z4aDD-0005QL-Iq@fencepost.gnu.org>
	<557F3C22.4060909@cs.ucla.edu>
	<E1Z4tDG-0003St-DZ@fencepost.gnu.org>
	<5580D356.4050708@cs.ucla.edu> <87si9qonxb.fsf@gnu.org>
	<E1Z5EWg-00050S-GG@fencepost.gnu.org> <87ioamz8if.fsf@petton.fr>
	<32013464-2300-46c6-ba46-4a3c36bfee5d@default>
	<87twu62nnt.fsf@mbork.pl>
	<87oakdfwim.fsf@uwakimon.sk.tsukuba.ac.jp> <83wpz1lh7c.fsf@gnu.org>
NNTP-Posting-Host: plane.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
X-Trace: ger.gmane.org 1434613776 15924 80.91.229.3 (18 Jun 2015 07:49:36 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Thu, 18 Jun 2015 07:49:36 +0000 (UTC)
Cc: emacs-devel@gnu.org
To: Eli Zaretskii <eliz@gnu.org>
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Thu Jun 18 09:49:29 2015
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane.org
Original-Received: from lists.gnu.org ([208.118.235.17])
	by plane.gmane.org with esmtp (Exim 4.69)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1Z5UZk-0001ZL-0T
	for ged-emacs-devel@m.gmane.org; Thu, 18 Jun 2015 09:49:28 +0200
Original-Received: from localhost ([::1]:50631 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1Z5UZj-0005l5-5H
	for ged-emacs-devel@m.gmane.org; Thu, 18 Jun 2015 03:49:27 -0400
Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:34683)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <turnbull@sk.tsukuba.ac.jp>) id 1Z5UZV-0005kw-Rb
	for emacs-devel@gnu.org; Thu, 18 Jun 2015 03:49:14 -0400
Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <turnbull@sk.tsukuba.ac.jp>) id 1Z5UZQ-0005oM-WE
	for emacs-devel@gnu.org; Thu, 18 Jun 2015 03:49:13 -0400
Original-Received: from shako.sk.tsukuba.ac.jp ([130.158.97.161]:51105)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <turnbull@sk.tsukuba.ac.jp>)
	id 1Z5UZM-0005mA-83; Thu, 18 Jun 2015 03:49:04 -0400
Original-Received: from uwakimon.sk.tsukuba.ac.jp (uwakimon.sk.tsukuba.ac.jp
	[130.158.99.156])
	(using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by shako.sk.tsukuba.ac.jp (Postfix) with ESMTPS id 12A471C3976;
	Thu, 18 Jun 2015 16:48:59 +0900 (JST)
Original-Received: by uwakimon.sk.tsukuba.ac.jp (Postfix, from userid 1000)
	id EBE5F1A2CA2; Thu, 18 Jun 2015 16:48:58 +0900 (JST)
In-Reply-To: <83wpz1lh7c.fsf@gnu.org>
X-Mailer: VM undefined under 21.5  (beta34) "kale" 83e5c3cd6be6 XEmacs Lucid
	(x86_64-unknown-linux)
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x
X-Received-From: 130.158.97.161
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.emacs.devel:187270
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/187270>

Eli Zaretskii writes:

 > No, it's much more complex than that.  For starters, normalization
 > won't convert u+2018 etc. to their ASCII counterparts.  The Unicode
 > Standard doesn't consider those even compatibility-equivalent.

True, but the OP asked for a "reasonable subset".  Given the context,
sure, we'd have to go beyond what NFD (or NFKD) gives, but that could
be done over time, starting with a few quotation characters (which can
probably be assembled by selecting on Unicode name).

 > And for matching just the base characters (which is what I presume
 > is meant here by "ascii-folding"), we'd need to handle correctly
 > any number of combinations of pre-composed and decomposed character
 > sequences in both the search string and the text we search, and
 > implement that on the fly, since the buffer text obviously cannot
 > be transformed for these purposes.

That's not at all obvious, for two reasons.  (1) If the applications
producing and consuming the buffer text claim Unicode conformance, we
sure can.  (2) Nobody said we have to do the transformation in place.

 > Interested individuals can start by studying the following
 > references:

I don't think that's the place to start.  The whole idea is heuristic.
Sure, at some point we'd want to improve accuracy by applying those
TRs, but anyone who wants to do this can start with just the
heuristic.

I'm not offering to do this myself, so your advice is better than
mine.  But it *could* be done this way.