From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.devel Subject: Re: extending case-fold-search to remove nonspacing marks (diacritics etc.) Date: Fri, 06 Feb 2015 09:35:24 +0200 Message-ID: <83ioffeblv.fsf@gnu.org> References: <87fvakvwbf.fsf@lifelogs.com> Reply-To: Eli Zaretskii NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: QUOTED-PRINTABLE X-Trace: ger.gmane.org 1423208145 18413 80.91.229.3 (6 Feb 2015 07:35:45 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Fri, 6 Feb 2015 07:35:45 +0000 (UTC) Cc: emacs-devel@gnu.org To: bruce.connor.am@gmail.com Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Fri Feb 06 08:35:41 2015 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1YJdRz-0001aq-WC for ged-emacs-devel@m.gmane.org; Fri, 06 Feb 2015 08:35:40 +0100 Original-Received: from localhost ([::1]:47023 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YJdRz-0003Gy-6B for ged-emacs-devel@m.gmane.org; Fri, 06 Feb 2015 02:35:39 -0500 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:35540) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YJdRs-0003Gp-UR for emacs-devel@gnu.org; Fri, 06 Feb 2015 02:35:33 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1YJdRp-0006Zx-NT for emacs-devel@gnu.org; Fri, 06 Feb 2015 02:35:32 -0500 Original-Received: from mtaout25.012.net.il ([80.179.55.181]:40938) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YJdRp-0006Zn-BZ for emacs-devel@gnu.org; Fri, 06 Feb 2015 02:35:29 -0500 Original-Received: from conversion-daemon.mtaout25.012.net.il by mtaout25.012.net.il (HyperSendmail v2007.08) id <0NJC00400A6DNS00@mtaout25.012.net.il> for emacs-devel@gnu.org; Fri, 06 Feb 2015 09:30:26 +0200 (IST) Original-Received: from HOME-C4E4A596F7 ([87.69.4.28]) by mtaout25.012.net.il (HyperSendmail v2007.08) with ESMTPA id <0NJC00LK4A6PJQ70@mtaout25.012.net.il>; Fri, 06 Feb 2015 09:30:25 +0200 (IST) In-reply-to: X-012-Sender: halo1@inter.net.il X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-Received-From: 80.179.55.181 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:182500 Archived-At: > Date: Thu, 5 Feb 2015 23:17:42 +0000 > From: Artur Malabarba >=20 > As for answering your questions: >=20 > >> implementing it for users so it works like `case-fold-search' (y= ou just > >> set something in Customize and all search commands DWYM) seems m= uch > >> harder. >=20 > Doing it as part of Emacs is not terribly hard, but it has > disadvantages. Namely, the case-fold-search machinery only relates = one > character to another character (1 to 1). At least for latin this wo= uld > be enough a lot of the time, e.g. you can use it to relate "=C3= =A1" to "a". > However, there's another way of writing "=C3=A1" which takes two > characters, and this situation can't be handled (AFAIK) by the > case-fold-search machinery. This just means you cannot implement that without changes to the C level. Changing the C code to lift the one-character restriction is not very hard. > The bright side is that I think this two-char way of writing latin > accents is much less common (not 100% sure though, it's hard to tel= l > the difference). The downside is that I know nothing about other > languages, so maybe using two chars to represent one char is the > default behavior in some other languages? It can be more than 2 characters, e.g. in scripts that use diacritics= : there could be more than diacritic combined with one base character. And then there are characters to be ignored, like ZWJ and bidi directional controls. So I think ad-hoc rules like the above is not going to cut it. We must use the decomposed forms, whatever they are, and we should also consult the character properties to ignore the ignorables. > >> Does anyone have suggestions? Maybe some defadvice magic? >=20 > You can use a defadvice around one of the isearch internal function= s > (check out the branch I mentioned) to implement something in elisp. > And you can redefine the buffer's case-folding table and use that i= n > the advice, but that will require that you generate the entire tabl= e. Please don't kludge around the problem. If it is important enough fo= r you to solve it, let's solve it as God intended.