From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: "Stephen J. Turnbull" Newsgroups: gmane.emacs.devel Subject: Re: ASCII-folded search [was: Re: Upcoming loss of usability ...] Date: Thu, 18 Jun 2015 16:48:58 +0900 Message-ID: <87lhfhfod1.fsf@uwakimon.sk.tsukuba.ac.jp> References: <20150615142237.GA3517@acm.fritz.box> <87y4jkhqh5.fsf@uwakimon.sk.tsukuba.ac.jp> <557F3C22.4060909@cs.ucla.edu> <5580D356.4050708@cs.ucla.edu> <87si9qonxb.fsf@gnu.org> <87ioamz8if.fsf@petton.fr> <32013464-2300-46c6-ba46-4a3c36bfee5d@default> <87twu62nnt.fsf@mbork.pl> <87oakdfwim.fsf@uwakimon.sk.tsukuba.ac.jp> <83wpz1lh7c.fsf@gnu.org> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 X-Trace: ger.gmane.org 1434613776 15924 80.91.229.3 (18 Jun 2015 07:49:36 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Thu, 18 Jun 2015 07:49:36 +0000 (UTC) Cc: emacs-devel@gnu.org To: Eli Zaretskii Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Thu Jun 18 09:49:29 2015 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1Z5UZk-0001ZL-0T for ged-emacs-devel@m.gmane.org; Thu, 18 Jun 2015 09:49:28 +0200 Original-Received: from localhost ([::1]:50631 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Z5UZj-0005l5-5H for ged-emacs-devel@m.gmane.org; Thu, 18 Jun 2015 03:49:27 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:34683) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Z5UZV-0005kw-Rb for emacs-devel@gnu.org; Thu, 18 Jun 2015 03:49:14 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Z5UZQ-0005oM-WE for emacs-devel@gnu.org; Thu, 18 Jun 2015 03:49:13 -0400 Original-Received: from shako.sk.tsukuba.ac.jp ([130.158.97.161]:51105) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Z5UZM-0005mA-83; Thu, 18 Jun 2015 03:49:04 -0400 Original-Received: from uwakimon.sk.tsukuba.ac.jp (uwakimon.sk.tsukuba.ac.jp [130.158.99.156]) (using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by shako.sk.tsukuba.ac.jp (Postfix) with ESMTPS id 12A471C3976; Thu, 18 Jun 2015 16:48:59 +0900 (JST) Original-Received: by uwakimon.sk.tsukuba.ac.jp (Postfix, from userid 1000) id EBE5F1A2CA2; Thu, 18 Jun 2015 16:48:58 +0900 (JST) In-Reply-To: <83wpz1lh7c.fsf@gnu.org> X-Mailer: VM undefined under 21.5 (beta34) "kale" 83e5c3cd6be6 XEmacs Lucid (x86_64-unknown-linux) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x X-Received-From: 130.158.97.161 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:187270 Archived-At: Eli Zaretskii writes: > No, it's much more complex than that. For starters, normalization > won't convert u+2018 etc. to their ASCII counterparts. The Unicode > Standard doesn't consider those even compatibility-equivalent. True, but the OP asked for a "reasonable subset". Given the context, sure, we'd have to go beyond what NFD (or NFKD) gives, but that could be done over time, starting with a few quotation characters (which can probably be assembled by selecting on Unicode name). > And for matching just the base characters (which is what I presume > is meant here by "ascii-folding"), we'd need to handle correctly > any number of combinations of pre-composed and decomposed character > sequences in both the search string and the text we search, and > implement that on the fly, since the buffer text obviously cannot > be transformed for these purposes. That's not at all obvious, for two reasons. (1) If the applications producing and consuming the buffer text claim Unicode conformance, we sure can. (2) Nobody said we have to do the transformation in place. > Interested individuals can start by studying the following > references: I don't think that's the place to start. The whole idea is heuristic. Sure, at some point we'd want to improve accuracy by applying those TRs, but anyone who wants to do this can start with just the heuristic. I'm not offering to do this myself, so your advice is better than mine. But it *could* be done this way.