From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.devel Subject: Re: ASCII-folded search [was: Re: Upcoming loss of usability ...] Date: Thu, 18 Jun 2015 08:27:03 +0300 Message-ID: <83wpz1lh7c.fsf@gnu.org> References: <20150615142237.GA3517@acm.fritz.box> <87y4jkhqh5.fsf@uwakimon.sk.tsukuba.ac.jp> <557F3C22.4060909@cs.ucla.edu> <5580D356.4050708@cs.ucla.edu> <87si9qonxb.fsf@gnu.org> <87ioamz8if.fsf@petton.fr> <32013464-2300-46c6-ba46-4a3c36bfee5d@default> <87twu62nnt.fsf@mbork.pl> <87oakdfwim.fsf@uwakimon.sk.tsukuba.ac.jp> Reply-To: Eli Zaretskii NNTP-Posting-Host: plane.gmane.org X-Trace: ger.gmane.org 1434605271 17859 80.91.229.3 (18 Jun 2015 05:27:51 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Thu, 18 Jun 2015 05:27:51 +0000 (UTC) Cc: emacs-devel@gnu.org To: "Stephen J. Turnbull" Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Thu Jun 18 07:27:36 2015 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1Z5SMN-00033c-UG for ged-emacs-devel@m.gmane.org; Thu, 18 Jun 2015 07:27:32 +0200 Original-Received: from localhost ([::1]:50225 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Z5SMM-0004fB-Gk for ged-emacs-devel@m.gmane.org; Thu, 18 Jun 2015 01:27:30 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:53288) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Z5SMA-0004cl-7K for emacs-devel@gnu.org; Thu, 18 Jun 2015 01:27:19 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Z5SM5-00051k-IY for emacs-devel@gnu.org; Thu, 18 Jun 2015 01:27:18 -0400 Original-Received: from mtaout22.012.net.il ([80.179.55.172]:46317) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Z5SM5-00051M-AG for emacs-devel@gnu.org; Thu, 18 Jun 2015 01:27:13 -0400 Original-Received: from conversion-daemon.a-mtaout22.012.net.il by a-mtaout22.012.net.il (HyperSendmail v2007.08) id <0NQ400L00KD1RP00@a-mtaout22.012.net.il> for emacs-devel@gnu.org; Thu, 18 Jun 2015 08:27:11 +0300 (IDT) Original-Received: from HOME-C4E4A596F7 ([87.69.4.28]) by a-mtaout22.012.net.il (HyperSendmail v2007.08) with ESMTPA id <0NQ400LMFKHAOD40@a-mtaout22.012.net.il>; Thu, 18 Jun 2015 08:27:11 +0300 (IDT) In-reply-to: <87oakdfwim.fsf@uwakimon.sk.tsukuba.ac.jp> X-012-Sender: halo1@inter.net.il X-detected-operating-system: by eggs.gnu.org: Solaris 10 X-Received-From: 80.179.55.172 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:187261 Archived-At: > From: "Stephen J. Turnbull" > Date: Thu, 18 Jun 2015 13:52:49 +0900 > Cc: emacs-devel@gnu.org > > Marcin Borkowski writes: > > > On the other hand, it would be great if we had an "ascii-folding" > > option, making (some reasonable subset of) Unicode "equivalent" to > > ASCII, > > I believe Emacs already implements NFD normalization. Yes, see ucs-normalize-NFD-region and friends. > All you need after that is to skip compose characters when > searching. No, it's much more complex than that. For starters, normalization won't convert u+2018 etc. to their ASCII counterparts. The Unicode Standard doesn't consider those even compatibility-equivalent. And for matching just the base characters (which is what I presume is meant here by "ascii-folding"), we'd need to handle correctly any number of combinations of pre-composed and decomposed character sequences in both the search string and the text we search, and implement that on the fly, since the buffer text obviously cannot be transformed for these purposes. So yes, this feature is something that's sorely needed, but volunteers need to know that the task is not too easy (or else it would have been done long ago). Interested individuals can start by studying the following references: . Sections 5.18 "Case Mappings" and 5.19 "Mapping Compatibility Variants" of the Unicode Standard . UTN#5 "Canonical Equivalence in Applications" (http://www.unicode.org/notes/tn5/) . UTR#15 "Unicode Normalization Forms" (http://unicode.org/reports/tr15/)