From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: Juri Linkov <juri@linkov.net>
Newsgroups: gmane.emacs.devel
Subject: Re: extending case-fold-search to remove nonspacing marks (diacritics
	etc.)
Date: Fri, 06 Feb 2015 02:54:45 +0200
Organization: LINKOV.NET
Message-ID: <87386jx2m2.fsf@mail.linkov.net>
References: <87fvakvwbf.fsf@lifelogs.com>
	<CAAdUY-+tA-5WjkXuqphdN=FEYH2D=YCHe4GxjmbSftK7L0a-MQ@mail.gmail.com>
	<CAAdUY-+A6Rz=BSbbOfDxVKPznLwGbZmbi0JUi0EY4=MHrQaW6g@mail.gmail.com>
NNTP-Posting-Host: plane.gmane.org
Mime-Version: 1.0
Content-Type: text/plain
X-Trace: ger.gmane.org 1423185539 11114 80.91.229.3 (6 Feb 2015 01:18:59 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Fri, 6 Feb 2015 01:18:59 +0000 (UTC)
Cc: emacs-devel <emacs-devel@gnu.org>
To: Artur Malabarba <bruce.connor.am@gmail.com>
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Fri Feb 06 02:18:59 2015
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane.org
Original-Received: from lists.gnu.org ([208.118.235.17])
	by plane.gmane.org with esmtp (Exim 4.69)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1YJXZS-0003rv-CA
	for ged-emacs-devel@m.gmane.org; Fri, 06 Feb 2015 02:18:58 +0100
Original-Received: from localhost ([::1]:46255 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1YJXZR-0005qS-Aw
	for ged-emacs-devel@m.gmane.org; Thu, 05 Feb 2015 20:18:57 -0500
Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:39879)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <juri@linkov.net>) id 1YJXZN-0005qB-55
	for emacs-devel@gnu.org; Thu, 05 Feb 2015 20:18:54 -0500
Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <juri@linkov.net>) id 1YJXZI-0006bU-3e
	for emacs-devel@gnu.org; Thu, 05 Feb 2015 20:18:53 -0500
Original-Received: from ps18281.dreamhost.com ([69.163.222.226]:37907
	helo=ps18281.dreamhostps.com) by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <juri@linkov.net>) id 1YJXZH-0006YU-Uo
	for emacs-devel@gnu.org; Thu, 05 Feb 2015 20:18:48 -0500
Original-Received: from localhost.linkov.net (ps18281.dreamhostps.com [69.163.222.226])
	by ps18281.dreamhostps.com (Postfix) with ESMTP id 55948312D8ED34;
	Thu,  5 Feb 2015 17:18:39 -0800 (PST)
In-Reply-To: <CAAdUY-+A6Rz=BSbbOfDxVKPznLwGbZmbi0JUi0EY4=MHrQaW6g@mail.gmail.com>
	(Artur Malabarba's message of "Thu, 5 Feb 2015 23:17:42 +0000")
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.0.50 (x86_64-pc-linux-gnu)
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x
X-Received-From: 69.163.222.226
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.emacs.devel:182487
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/182487>

> Something essentially identical to this was being discussed here a
> couple of weeks ago. Look for the thread "Single quotes in Info". I
> wrote a small elisp solution for building this into isearch (which you
> can find on the "scratch/isearch-character-group-folding" branch). It
> took a different approach to yours, relating characters to regexp, but
> it works.

I see that your branch contains nothing more than was already implemented
a long time ago in bug#13041 where the major stumbling block was
an inefficiency of the regexp-based solution.  Could you help to improve it?

> The bright side is that I think this two-char way of writing latin
> accents is much less common (not 100% sure though, it's hard to tell
> the difference). The downside is that I know nothing about other
> languages, so maybe using two chars to represent one char is the
> default behavior in some other languages?

As https://emacs.stackexchange.com/q/7992/478 indicates,
other languages require insertion/deletion of special characters
like diacritics/accents from the search string/buffer for normalization.

When looking for a solution I recommend you to check ucs-normalize.
For example, evaluating:

  (require 'ucs-normalize)
  ucs-normalize-combining-chars

you can see exactly the same characters

  1616 1615 1619 1648 1618 1612 1613 1611 1617 1614

mentioned in https://emacs.stackexchange.com/a/8001/478

Using its corresponding regexp `ucs-normalize-combining-chars-regexp'
is easy in isearch, e.g.:

  ;; Decomposition search for accented letters.
  (define-key isearch-mode-map "\M-sd" 'isearch-toggle-decomposition)

  (defun isearch-toggle-decomposition ()
    "Toggle Unicode decomposition searching on or off."
    (interactive)
    (setq isearch-word (unless (eq isearch-word 'isearch-decomposition-regexp)
                         'isearch-decomposition-regexp))
    (if isearch-word (setq isearch-regexp nil))
    (setq isearch-success t isearch-adjusted t)
    (isearch-update))

  (defun isearch-decomposition-regexp (string &optional _lax)
    "Return a regexp that matches decomposed Unicode characters in STRING."
    (let ((accents (substring ucs-normalize-combining-chars-regexp 0 -1)))
      (mapconcat
       (lambda (c0)
         (concat (string c0) accents "?"))
       (replace-regexp-in-string accents "" string) "")))

  (put 'isearch-decomposition-regexp 'isearch-message-prefix "deco ")

But this is more inefficient than properly implementing it using case tables.