From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Juri Linkov Newsgroups: gmane.emacs.devel Subject: Re: extending case-fold-search to remove nonspacing marks (diacritics etc.) Date: Fri, 06 Feb 2015 02:54:45 +0200 Organization: LINKOV.NET Message-ID: <87386jx2m2.fsf@mail.linkov.net> References: <87fvakvwbf.fsf@lifelogs.com> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain X-Trace: ger.gmane.org 1423185539 11114 80.91.229.3 (6 Feb 2015 01:18:59 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Fri, 6 Feb 2015 01:18:59 +0000 (UTC) Cc: emacs-devel To: Artur Malabarba Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Fri Feb 06 02:18:59 2015 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1YJXZS-0003rv-CA for ged-emacs-devel@m.gmane.org; Fri, 06 Feb 2015 02:18:58 +0100 Original-Received: from localhost ([::1]:46255 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YJXZR-0005qS-Aw for ged-emacs-devel@m.gmane.org; Thu, 05 Feb 2015 20:18:57 -0500 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:39879) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YJXZN-0005qB-55 for emacs-devel@gnu.org; Thu, 05 Feb 2015 20:18:54 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1YJXZI-0006bU-3e for emacs-devel@gnu.org; Thu, 05 Feb 2015 20:18:53 -0500 Original-Received: from ps18281.dreamhost.com ([69.163.222.226]:37907 helo=ps18281.dreamhostps.com) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YJXZH-0006YU-Uo for emacs-devel@gnu.org; Thu, 05 Feb 2015 20:18:48 -0500 Original-Received: from localhost.linkov.net (ps18281.dreamhostps.com [69.163.222.226]) by ps18281.dreamhostps.com (Postfix) with ESMTP id 55948312D8ED34; Thu, 5 Feb 2015 17:18:39 -0800 (PST) In-Reply-To: (Artur Malabarba's message of "Thu, 5 Feb 2015 23:17:42 +0000") User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.0.50 (x86_64-pc-linux-gnu) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x X-Received-From: 69.163.222.226 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:182487 Archived-At: > Something essentially identical to this was being discussed here a > couple of weeks ago. Look for the thread "Single quotes in Info". I > wrote a small elisp solution for building this into isearch (which you > can find on the "scratch/isearch-character-group-folding" branch). It > took a different approach to yours, relating characters to regexp, but > it works. I see that your branch contains nothing more than was already implemented a long time ago in bug#13041 where the major stumbling block was an inefficiency of the regexp-based solution. Could you help to improve it? > The bright side is that I think this two-char way of writing latin > accents is much less common (not 100% sure though, it's hard to tell > the difference). The downside is that I know nothing about other > languages, so maybe using two chars to represent one char is the > default behavior in some other languages? As https://emacs.stackexchange.com/q/7992/478 indicates, other languages require insertion/deletion of special characters like diacritics/accents from the search string/buffer for normalization. When looking for a solution I recommend you to check ucs-normalize. For example, evaluating: (require 'ucs-normalize) ucs-normalize-combining-chars you can see exactly the same characters 1616 1615 1619 1648 1618 1612 1613 1611 1617 1614 mentioned in https://emacs.stackexchange.com/a/8001/478 Using its corresponding regexp `ucs-normalize-combining-chars-regexp' is easy in isearch, e.g.: ;; Decomposition search for accented letters. (define-key isearch-mode-map "\M-sd" 'isearch-toggle-decomposition) (defun isearch-toggle-decomposition () "Toggle Unicode decomposition searching on or off." (interactive) (setq isearch-word (unless (eq isearch-word 'isearch-decomposition-regexp) 'isearch-decomposition-regexp)) (if isearch-word (setq isearch-regexp nil)) (setq isearch-success t isearch-adjusted t) (isearch-update)) (defun isearch-decomposition-regexp (string &optional _lax) "Return a regexp that matches decomposed Unicode characters in STRING." (let ((accents (substring ucs-normalize-combining-chars-regexp 0 -1))) (mapconcat (lambda (c0) (concat (string c0) accents "?")) (replace-regexp-in-string accents "" string) ""))) (put 'isearch-decomposition-regexp 'isearch-message-prefix "deco ") But this is more inefficient than properly implementing it using case tables.