From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Artur Malabarba Newsgroups: gmane.emacs.devel Subject: Re: extending case-fold-search to remove nonspacing marks (diacritics etc.) Date: Fri, 6 Feb 2015 02:32:46 +0000 Message-ID: References: <87fvakvwbf.fsf@lifelogs.com> <87386jx2m2.fsf@mail.linkov.net> Reply-To: bruce.connor.am@gmail.com NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 X-Trace: ger.gmane.org 1423189975 11929 80.91.229.3 (6 Feb 2015 02:32:55 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Fri, 6 Feb 2015 02:32:55 +0000 (UTC) Cc: emacs-devel To: Juri Linkov Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Fri Feb 06 03:32:54 2015 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1YJYiz-0003wT-Sj for ged-emacs-devel@m.gmane.org; Fri, 06 Feb 2015 03:32:54 +0100 Original-Received: from localhost ([::1]:46386 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YJYiy-0004nK-O8 for ged-emacs-devel@m.gmane.org; Thu, 05 Feb 2015 21:32:52 -0500 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:38810) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YJYiu-0004n2-Fb for emacs-devel@gnu.org; Thu, 05 Feb 2015 21:32:49 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1YJYit-0006B0-8N for emacs-devel@gnu.org; Thu, 05 Feb 2015 21:32:48 -0500 Original-Received: from mail-oi0-x22d.google.com ([2607:f8b0:4003:c06::22d]:64334) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YJYit-0006An-3L for emacs-devel@gnu.org; Thu, 05 Feb 2015 21:32:47 -0500 Original-Received: by mail-oi0-f45.google.com with SMTP id g201so9772305oib.4 for ; Thu, 05 Feb 2015 18:32:46 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:reply-to:sender:in-reply-to:references:date:message-id :subject:from:to:cc:content-type; bh=tfSYH12akeP6K+6hozaRNklyW4ZV12/fHvXLaLNCDQk=; b=CcXIyNOmAaRgu4XxYM8E8YE5lMuNFePHybTXZA9/9q9pFggnwffxCP5lYRtoX7Wdwv i3CrXhDZY7MCKpF97Tx06MIYmwLqCsS84aEPOx3c/1FKO4HTSutD4zSEC+IAjPLnHUVh MFJmPe9b73XZOm6p5PSThoEK6+1t05Od4dFhK7P1cKOx8YecSx8GiTKyFIZMyClifQWZ Evf4EI4S7E/4QV1j2JQ/tuJpvYtREpOluTsLmQpxP0zwZAclLieT4Btrcu0NtLLE0Mv/ Vnp0L2mdnpiWnSP3qhLzvSo0t+funyBvVV7pEhrM2Pz1QD5ZFrjfHwDQhs8nvMLdbR+s xzGQ== X-Received: by 10.182.97.105 with SMTP id dz9mr928437obb.46.1423189966667; Thu, 05 Feb 2015 18:32:46 -0800 (PST) Original-Received: by 10.76.125.1 with HTTP; Thu, 5 Feb 2015 18:32:46 -0800 (PST) In-Reply-To: <87386jx2m2.fsf@mail.linkov.net> X-Google-Sender-Auth: TNH6W0LpTukh1oFpATh5Vr37wcQ X-detected-operating-system: by eggs.gnu.org: Error: Malformed IPv6 address (bad octet value). X-Received-From: 2607:f8b0:4003:c06::22d X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:182489 Archived-At: 2015-02-05 22:54 GMT-02:00 Juri Linkov : >> Something essentially identical to this was being discussed here a >> couple of weeks ago. Look for the thread "Single quotes in Info". I >> wrote a small elisp solution for building this into isearch (which you >> can find on the "scratch/isearch-character-group-folding" branch). It >> took a different approach to yours, relating characters to regexp, but >> it works. > > I see that your branch contains nothing more than was already implemented > a long time ago in bug#13041 where the major stumbling block was > an inefficiency of the regexp-based solution. Could you help to improve it? I'll have a look. The code I wrote was fast enough for isearch and I'm starting to convince myself it was the best solution. The motivation behind extending case-fold tables was to make it fast enough to use on any search, and also have it work on some very corner-case situations. Combine this with the core-dump issue I've hit while trying to implement it, and you have a recipe for my fast diminishing motivation to do this. >> The bright side is that I think this two-char way of writing latin >> accents is much less common (not 100% sure though, it's hard to tell >> the difference). The downside is that I know nothing about other >> languages, so maybe using two chars to represent one char is the >> default behavior in some other languages? > > As https://emacs.stackexchange.com/q/7992/478 indicates, > other languages require insertion/deletion of special characters > like diacritics/accents from the search string/buffer for normalization. > > When looking for a solution I recommend you to check ucs-normalize. > For example, evaluating: > > (require 'ucs-normalize) > ucs-normalize-combining-chars > > you can see exactly the same characters > > 1616 1615 1619 1648 1618 1612 1613 1611 1617 1614 > > mentioned in https://emacs.stackexchange.com/a/8001/478 > > Using its corresponding regexp `ucs-normalize-combining-chars-regexp' > is easy in isearch, e.g.: > > ;; Decomposition search for accented letters. > (define-key isearch-mode-map "\M-sd" 'isearch-toggle-decomposition) > > (defun isearch-toggle-decomposition () > "Toggle Unicode decomposition searching on or off." > (interactive) > (setq isearch-word (unless (eq isearch-word 'isearch-decomposition-regexp) > 'isearch-decomposition-regexp)) > (if isearch-word (setq isearch-regexp nil)) > (setq isearch-success t isearch-adjusted t) > (isearch-update)) > > (defun isearch-decomposition-regexp (string &optional _lax) > "Return a regexp that matches decomposed Unicode characters in STRING." > (let ((accents (substring ucs-normalize-combining-chars-regexp 0 -1))) > (mapconcat > (lambda (c0) > (concat (string c0) accents "?")) > (replace-regexp-in-string accents "" string) ""))) > > (put 'isearch-decomposition-regexp 'isearch-message-prefix "deco ") > > But this is more inefficient than properly implementing it using case tables. There's probably a way of handling these in c code, but it'll have to be done manually (translation tables won't do it). And by someone who understands this more than me. :-)