From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Juri Linkov Newsgroups: gmane.emacs.devel Subject: Re: Character group folding in searches Date: Sat, 07 Feb 2015 02:07:15 +0200 Organization: LINKOV.NET Message-ID: <87386iinyk.fsf@mail.linkov.net> References: NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Trace: ger.gmane.org 1423268421 1929 80.91.229.3 (7 Feb 2015 00:20:21 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Sat, 7 Feb 2015 00:20:21 +0000 (UTC) Cc: emacs-devel To: Artur Malabarba Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sat Feb 07 01:20:20 2015 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1YJt8D-0000ny-Il for ged-emacs-devel@m.gmane.org; Sat, 07 Feb 2015 01:20:17 +0100 Original-Received: from localhost ([::1]:51029 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YJt8C-0001lh-Ih for ged-emacs-devel@m.gmane.org; Fri, 06 Feb 2015 19:20:16 -0500 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:50418) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YJt89-0001kn-Ha for emacs-devel@gnu.org; Fri, 06 Feb 2015 19:20:14 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1YJt85-00067Q-EL for emacs-devel@gnu.org; Fri, 06 Feb 2015 19:20:13 -0500 Original-Received: from ps18281.dreamhost.com ([69.163.222.226]:46971 helo=ps18281.dreamhostps.com) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YJt85-00066H-8x for emacs-devel@gnu.org; Fri, 06 Feb 2015 19:20:09 -0500 Original-Received: from localhost.linkov.net (ps18281.dreamhostps.com [69.163.222.226]) by ps18281.dreamhostps.com (Postfix) with ESMTP id 06FA2315FBBCDF; Fri, 6 Feb 2015 16:20:06 -0800 (PST) In-Reply-To: (Artur Malabarba's message of "Fri, 6 Feb 2015 11:04:03 -0200") User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.0.50 (x86_64-pc-linux-gnu) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x X-Received-From: 69.163.222.226 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:182566 Archived-At: > My question is: > > Do any of these options seem good enough? Which would you all like to e= xplore? This feature, as I see it, has several levels of complexity: * 1-to-1 char-folding ?a <=3D> ?=C3=A1 This is already supported by char-tables, so there is no problem. * 1-to-1 char-folding in combination with case-folding ?a <=3D> ?=C3=81 (in this example one of them is in lower case and another in upper case with acute) I'm not sure how your patch handles this case. We have to consult the information about case-folding from the case-table. Otherwise, we woul= d need two new tables instead of one: where character mappings are with and without case-folding. In any case we have to take care of the correct interaction with case-fold-search. * 1-to-1 char-folding plus a combining character ?a <=3D> "a=CC=81" The simplest solution is just to ignore all combining characters in sea= rch. This should be easy to implement in the search engine by introducing a new list of ignorable characters to skip during the search. * multi-character translation such as ligatures, etc. ?=EF=AC=84 <=3D> "ffl" This is the hardest case. Maybe the existing translation tables from ucs-normalize.el could help. Then configuring would be like (set-char-table-extra-slot case-table 3 (get 'ucs-normalize-nfd-table '= translation-table)) But this requires a significant modification of the search engine to us= e the same logic in the search as is used in `translate-region-internal' to support multi-character translation in the search. Also it might require adding a new mode such as "lax-decomposition" that like lax-word mode will match partially, e.g. "f" will match "=EF=AC= =84". Or maybe better to use some external libraries like http://userguide.icu-project.org/collation/icu-string-search-service I agree with your attempts to have something instead of having nothing (as in an all-or-nothing attitude). So to me it seems that the first 3 items would comprise the useful minimum, and the hardest last case could be implemented afterwards.