From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Juri Linkov Newsgroups: gmane.emacs.bugs Subject: bug#13041: 24.2; diacritic-fold-search Date: Sun, 02 Dec 2012 02:27:32 +0200 Organization: JURTA Message-ID: <87hao5jqu3.fsf@mail.jurta.org> References: <20121130182205.C722F14B8D@panix1.panix.com> <87hao69b5r.fsf@mail.jurta.org> <20665.8224.844876.619203@panix5.panix.com> <87hao6zko4.fsf@mail.jurta.org> <83fw3qtboc.fsf@gnu.org> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Trace: ger.gmane.org 1354409281 13161 80.91.229.3 (2 Dec 2012 00:48:01 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Sun, 2 Dec 2012 00:48:01 +0000 (UTC) Cc: perin@panix.com, 13041@debbugs.gnu.org, perin@acm.org To: Eli Zaretskii Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Sun Dec 02 01:48:12 2012 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1Texj4-0002nf-9z for geb-bug-gnu-emacs@m.gmane.org; Sun, 02 Dec 2012 01:48:06 +0100 Original-Received: from localhost ([::1]:48457 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Texis-000543-Rw for geb-bug-gnu-emacs@m.gmane.org; Sat, 01 Dec 2012 19:47:54 -0500 Original-Received: from eggs.gnu.org ([208.118.235.92]:51492) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Texio-0004z2-Nf for bug-gnu-emacs@gnu.org; Sat, 01 Dec 2012 19:47:52 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Texin-0002Nt-5c for bug-gnu-emacs@gnu.org; Sat, 01 Dec 2012 19:47:50 -0500 Original-Received: from debbugs.gnu.org ([140.186.70.43]:38672) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Texin-0002Np-1l for bug-gnu-emacs@gnu.org; Sat, 01 Dec 2012 19:47:49 -0500 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.72) (envelope-from ) id 1Texkw-0001nm-Hy for bug-gnu-emacs@gnu.org; Sat, 01 Dec 2012 19:50:02 -0500 X-Loop: help-debbugs@gnu.org Resent-From: Juri Linkov Original-Sender: debbugs-submit-bounces@debbugs.gnu.org Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Sun, 02 Dec 2012 00:50:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 13041 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: Original-Received: via spool by 13041-submit@debbugs.gnu.org id=B13041.13544093936897 (code B ref 13041); Sun, 02 Dec 2012 00:50:02 +0000 Original-Received: (at 13041) by debbugs.gnu.org; 2 Dec 2012 00:49:53 +0000 Original-Received: from localhost ([127.0.0.1]:48920 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1Texkm-0001nB-Ok for submit@debbugs.gnu.org; Sat, 01 Dec 2012 19:49:53 -0500 Original-Received: from ps18281.dreamhost.com ([69.163.218.105]:58055 helo=ps18281.dreamhostps.com) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1Texkj-0001mv-4W for 13041@debbugs.gnu.org; Sat, 01 Dec 2012 19:49:50 -0500 Original-Received: from localhost (ps18281.dreamhostps.com [69.163.218.105]) by ps18281.dreamhostps.com (Postfix) with ESMTP id E5A16451E1D4; Sat, 1 Dec 2012 16:47:32 -0800 (PST) In-Reply-To: <83fw3qtboc.fsf@gnu.org> (Eli Zaretskii's message of "Sat, 01 Dec 2012 10:32:35 +0200") User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.3.50 (x86_64-pc-linux-gnu) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-Received-From: 140.186.70.43 X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Original-Sender: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.bugs:67749 Archived-At: > Using these properties, every search string can be converted to a > sequence of non-decomposable characters (this process is recursive, > because the 'decomposition' property can use characters that > themselves are decomposable). If the user wants to ignore diacritics, > then the diacritics should be dropped from the decomposition sequence > before starting the search. E.g., for the decomposition of =C3=A8 abov= e, > we will drop the 768 and will be left with 101, which is 'e'. Then > searching for that string should apply the same decomposition > transformation to the text being searched, when comparing them. Yes, using the `decomposition' property would be better than hard-coding these decomposition mappings. Though I'm surprised to see case mappings hard-coded in lisp/international/characters.el instead of using the properties `uppercase' and `lowercase' during creation of case tables. But nevertheless the `decomposition' property should be used to find all decomposable characters. The question is how to use them in the sear= ch. One solution is to use the case tables. I tried to build the case table with the decomposed characters retrieved using the `decomposition' proper= ty recursively: (defvar decomposition-table nil) (defun make-decomposition-table () (let ((table (standard-case-table)) canon) (setq canon (copy-sequence table)) (let ((c #x0000) d) (while (<=3D c #xFFFD) (make-decomposition-table-1 canon c c) (setq c (1+ c)))) (set-char-table-extra-slot table 1 canon) (set-char-table-extra-slot table 2 nil) (setq decomposition-table table))) (defun make-decomposition-table-1 (canon c0 c1) (let ((d (get-char-code-property c1 'decomposition))) (when d (unless (characterp (car d)) (pop d)) (if (eq c1 (car d)) (aset canon c0 (car d)) (make-decomposition-table-1 canon c0 (car d)))))) (make-decomposition-table) Then a new Isearch command (the existing `isearch-toggle-case-fold' can't be used because it enables/disables the standard case table) could toggle between the current case table and the decomposition case table using (set-case-table decomposition-table) After evaluating this, Isearch correctly finds all related characters in every row of this example: http://hex-machina.com/scripts/yui/3.3.0pr1/api/unicode-data-accentfold= .js.html But it seems using the case table for decomposition has one limitation. I see no way to ignore combining accent characters in the case table, i.e. to map combining accent characters to nothing. These characters have the general-category "Mn (Mark, Nonspacing)", so they should be igno= red in the search. An alternative would be to build a regexp from the search string like building a regexp for word-search: (define-key isearch-mode-map "\M-sd" 'isearch-toggle-decomposition) (defun isearch-toggle-decomposition () "Toggle Unicode decomposition searching on or off." (interactive) (setq isearch-word (unless (eq isearch-word 'isearch-decomposition-rege= xp) 'isearch-decomposition-regexp)) (if isearch-word (setq isearch-regexp nil)) (setq isearch-success t isearch-adjusted t) (isearch-update)) (defun isearch-decomposition-regexp (string &optional _lax) "Return a regexp that matches decomposed Unicode characters in STRING." (mapconcat (lambda (c0) (if (eq (get-char-code-property c0 'general-category) 'Mn) ;; Mark-Nonspacing chars like COMBINING ACUTE ACCENT are optiona= l. (concat (string c0) "?") (let ((c1 c0) c2 chars) (while (and (setq c2 (aref (char-table-extra-slot decomposition-table 2) c1)) (not (eq c2 c0))) (push c2 chars) (setq c1 c2)) (if chars ;; Character alternatives from the case equivalences table. (concat "[" (string c0) chars "]") (string c0))))) string "")) (put 'isearch-decomposition-regexp 'isearch-message-prefix "deco ") This uses the decomposition table created above but instead of activating= it, it's necessary to "shuffle" the equivalences table with the following cod= e that prepares the table but doesn't enable it in the current buffer: (with-temp-buffer (set-case-table decomposition-table)) The advantage of the regexp-based approach is making combining accents optional in the search string. But there is another problem: how to igno= re combining accents in the buffer when the search string doesn't contain th= em. With regexps this means adding a group of all possible combining accents after every character in the search string like turning a search string like "abc" into "a[=CC=81=CC=82=CC=83=CC=84=CC=86]?b[=CC=81=CC=82=CC=83=CC= =84=CC=86]?c[=CC=81=CC=82=CC=83=CC=84=CC=86]?". This would make the search slow, and I have no better idea.