From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: Artur Malabarba <bruce.connor.am@gmail.com>
Newsgroups: gmane.emacs.devel
Subject: Re: extending case-fold-search to remove nonspacing marks (diacritics
	etc.)
Date: Fri, 6 Feb 2015 02:32:46 +0000
Message-ID: <CAAdUY-L228zdyHLp0V5O0xN47qGSKVZmq-RKCC0L5Xc48gYbwA@mail.gmail.com>
References: <87fvakvwbf.fsf@lifelogs.com>
	<CAAdUY-+tA-5WjkXuqphdN=FEYH2D=YCHe4GxjmbSftK7L0a-MQ@mail.gmail.com>
	<CAAdUY-+A6Rz=BSbbOfDxVKPznLwGbZmbi0JUi0EY4=MHrQaW6g@mail.gmail.com>
	<87386jx2m2.fsf@mail.linkov.net>
Reply-To: bruce.connor.am@gmail.com
NNTP-Posting-Host: plane.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
X-Trace: ger.gmane.org 1423189975 11929 80.91.229.3 (6 Feb 2015 02:32:55 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Fri, 6 Feb 2015 02:32:55 +0000 (UTC)
Cc: emacs-devel <emacs-devel@gnu.org>
To: Juri Linkov <juri@linkov.net>
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Fri Feb 06 03:32:54 2015
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane.org
Original-Received: from lists.gnu.org ([208.118.235.17])
	by plane.gmane.org with esmtp (Exim 4.69)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1YJYiz-0003wT-Sj
	for ged-emacs-devel@m.gmane.org; Fri, 06 Feb 2015 03:32:54 +0100
Original-Received: from localhost ([::1]:46386 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1YJYiy-0004nK-O8
	for ged-emacs-devel@m.gmane.org; Thu, 05 Feb 2015 21:32:52 -0500
Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:38810)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <arturmalabarba@gmail.com>) id 1YJYiu-0004n2-Fb
	for emacs-devel@gnu.org; Thu, 05 Feb 2015 21:32:49 -0500
Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <arturmalabarba@gmail.com>) id 1YJYit-0006B0-8N
	for emacs-devel@gnu.org; Thu, 05 Feb 2015 21:32:48 -0500
Original-Received: from mail-oi0-x22d.google.com ([2607:f8b0:4003:c06::22d]:64334)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <arturmalabarba@gmail.com>) id 1YJYit-0006An-3L
	for emacs-devel@gnu.org; Thu, 05 Feb 2015 21:32:47 -0500
Original-Received: by mail-oi0-f45.google.com with SMTP id g201so9772305oib.4
	for <emacs-devel@gnu.org>; Thu, 05 Feb 2015 18:32:46 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
	h=mime-version:reply-to:sender:in-reply-to:references:date:message-id
	:subject:from:to:cc:content-type;
	bh=tfSYH12akeP6K+6hozaRNklyW4ZV12/fHvXLaLNCDQk=;
	b=CcXIyNOmAaRgu4XxYM8E8YE5lMuNFePHybTXZA9/9q9pFggnwffxCP5lYRtoX7Wdwv
	i3CrXhDZY7MCKpF97Tx06MIYmwLqCsS84aEPOx3c/1FKO4HTSutD4zSEC+IAjPLnHUVh
	MFJmPe9b73XZOm6p5PSThoEK6+1t05Od4dFhK7P1cKOx8YecSx8GiTKyFIZMyClifQWZ
	Evf4EI4S7E/4QV1j2JQ/tuJpvYtREpOluTsLmQpxP0zwZAclLieT4Btrcu0NtLLE0Mv/
	Vnp0L2mdnpiWnSP3qhLzvSo0t+funyBvVV7pEhrM2Pz1QD5ZFrjfHwDQhs8nvMLdbR+s
	xzGQ==
X-Received: by 10.182.97.105 with SMTP id dz9mr928437obb.46.1423189966667;
	Thu, 05 Feb 2015 18:32:46 -0800 (PST)
Original-Received: by 10.76.125.1 with HTTP; Thu, 5 Feb 2015 18:32:46 -0800 (PST)
In-Reply-To: <87386jx2m2.fsf@mail.linkov.net>
X-Google-Sender-Auth: TNH6W0LpTukh1oFpATh5Vr37wcQ
X-detected-operating-system: by eggs.gnu.org: Error: Malformed IPv6 address
	(bad octet value).
X-Received-From: 2607:f8b0:4003:c06::22d
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.emacs.devel:182489
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/182489>

2015-02-05 22:54 GMT-02:00 Juri Linkov <juri@linkov.net>:
>> Something essentially identical to this was being discussed here a
>> couple of weeks ago. Look for the thread "Single quotes in Info". I
>> wrote a small elisp solution for building this into isearch (which you
>> can find on the "scratch/isearch-character-group-folding" branch). It
>> took a different approach to yours, relating characters to regexp, but
>> it works.
>
> I see that your branch contains nothing more than was already implemented
> a long time ago in bug#13041 where the major stumbling block was
> an inefficiency of the regexp-based solution.  Could you help to improve it?

I'll have a look. The code I wrote was fast enough for isearch and I'm
starting to convince myself it was the best solution.

The motivation behind extending case-fold tables was to make it fast
enough to use on any search, and also have it work on some very
corner-case situations. Combine this with the core-dump issue I've hit
while trying to implement it, and you have a recipe for my fast
diminishing motivation to do this.

>> The bright side is that I think this two-char way of writing latin
>> accents is much less common (not 100% sure though, it's hard to tell
>> the difference). The downside is that I know nothing about other
>> languages, so maybe using two chars to represent one char is the
>> default behavior in some other languages?
>
> As https://emacs.stackexchange.com/q/7992/478 indicates,
> other languages require insertion/deletion of special characters
> like diacritics/accents from the search string/buffer for normalization.
>
> When looking for a solution I recommend you to check ucs-normalize.
> For example, evaluating:
>
>   (require 'ucs-normalize)
>   ucs-normalize-combining-chars
>
> you can see exactly the same characters
>
>   1616 1615 1619 1648 1618 1612 1613 1611 1617 1614
>
> mentioned in https://emacs.stackexchange.com/a/8001/478
>
> Using its corresponding regexp `ucs-normalize-combining-chars-regexp'
> is easy in isearch, e.g.:
>
>   ;; Decomposition search for accented letters.
>   (define-key isearch-mode-map "\M-sd" 'isearch-toggle-decomposition)
>
>   (defun isearch-toggle-decomposition ()
>     "Toggle Unicode decomposition searching on or off."
>     (interactive)
>     (setq isearch-word (unless (eq isearch-word 'isearch-decomposition-regexp)
>                          'isearch-decomposition-regexp))
>     (if isearch-word (setq isearch-regexp nil))
>     (setq isearch-success t isearch-adjusted t)
>     (isearch-update))
>
>   (defun isearch-decomposition-regexp (string &optional _lax)
>     "Return a regexp that matches decomposed Unicode characters in STRING."
>     (let ((accents (substring ucs-normalize-combining-chars-regexp 0 -1)))
>       (mapconcat
>        (lambda (c0)
>          (concat (string c0) accents "?"))
>        (replace-regexp-in-string accents "" string) "")))
>
>   (put 'isearch-decomposition-regexp 'isearch-message-prefix "deco ")
>
> But this is more inefficient than properly implementing it using case tables.

There's probably a way of handling these in c code, but it'll have to
be done manually (translation tables won't do it). And by someone who
understands this more than me. :-)