From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.devel Subject: Re: extending case-fold-search to remove nonspacing marks (diacritics etc.) Date: Fri, 06 Feb 2015 09:29:33 +0200 Message-ID: <83k2zvebvm.fsf@gnu.org> References: <87fvakvwbf.fsf@lifelogs.com> Reply-To: Eli Zaretskii NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: QUOTED-PRINTABLE X-Trace: ger.gmane.org 1423207789 13151 80.91.229.3 (6 Feb 2015 07:29:49 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Fri, 6 Feb 2015 07:29:49 +0000 (UTC) To: emacs-devel@gnu.org Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Fri Feb 06 08:29:49 2015 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1YJdMK-0007OZ-Ld for ged-emacs-devel@m.gmane.org; Fri, 06 Feb 2015 08:29:48 +0100 Original-Received: from localhost ([::1]:47009 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YJdMK-0001Ws-3U for ged-emacs-devel@m.gmane.org; Fri, 06 Feb 2015 02:29:48 -0500 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:34021) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YJdME-0001Qu-Ly for emacs-devel@gnu.org; Fri, 06 Feb 2015 02:29:43 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1YJdMA-0004Ev-EM for emacs-devel@gnu.org; Fri, 06 Feb 2015 02:29:42 -0500 Original-Received: from mtaout21.012.net.il ([80.179.55.169]:55289) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YJdMA-0004EK-6g for emacs-devel@gnu.org; Fri, 06 Feb 2015 02:29:38 -0500 Original-Received: from conversion-daemon.a-mtaout21.012.net.il by a-mtaout21.012.net.il (HyperSendmail v2007.08) id <0NJC002009U25100@a-mtaout21.012.net.il> for emacs-devel@gnu.org; Fri, 06 Feb 2015 09:29:36 +0200 (IST) Original-Received: from HOME-C4E4A596F7 ([87.69.4.28]) by a-mtaout21.012.net.il (HyperSendmail v2007.08) with ESMTPA id <0NJC002E3A5C1PA0@a-mtaout21.012.net.il> for emacs-devel@gnu.org; Fri, 06 Feb 2015 09:29:36 +0200 (IST) In-reply-to: <87fvakvwbf.fsf@lifelogs.com> X-012-Sender: halo1@inter.net.il X-detected-operating-system: by eggs.gnu.org: Solaris 10 X-Received-From: 80.179.55.169 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:182498 Archived-At: > From: Ted Zlatanov > Date: Thu, 05 Feb 2015 17:16:04 -0500 >=20 > https://emacs.stackexchange.com/questions/7992/how-to-search-an-ara= bic-word-in-text-without-its-diacritics-accents > suggested it would be useful if diacritics were ignored when search= ing > for text in various situations. This is similar to `case-fold-searc= h' > but more generic. Here's what I suggested as the answer at the ELis= p > level: >=20 > #+begin_src emacs-lisp > (defun kill-marks (string) > (concat (loop for c across string > when (not (eq 'Mn (get-char-code-property c 'genera= l-category))) > collect c))) >=20 > (let* ((original1 "your Arabic string here") > (normalized1 (ucs-normalize-NFKD-string original1)) > (original2 "your other Arabic string here") > (normalized2 (ucs-normalize-NFKD-string original2))) > (equal > (replace-regexp-in-string "." 'kill-marks normalized1) > (replace-regexp-in-string "." 'kill-marks normalized2))) > #+end_src That doesn't do what we want, it's only a partial solution to that problem. E.g., it doesn't equate the initial, medial, and final variants of the letters used by Arabic and other Semitic scripts. Moreover, you cannot even search for "a" and find "=E1", AFAICS. The way to solve this correctly and generally was discussed here some time ago, so if there are people here for whom this is an itch to scratch, please let's do this as discussed there. We already have al= l the necessary information for that in Emacs databases.