From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: Juri Linkov <juri@jurta.org>
Newsgroups: gmane.emacs.bugs
Subject: bug#13041: 24.2; diacritic-fold-search
Date: Sun, 02 Dec 2012 02:27:32 +0200
Organization: JURTA
Message-ID: <87hao5jqu3.fsf@mail.jurta.org>
References: <20121130182205.C722F14B8D@panix1.panix.com>
	<87hao69b5r.fsf@mail.jurta.org>
	<20665.8224.844876.619203@panix5.panix.com>
	<87hao6zko4.fsf@mail.jurta.org> <83fw3qtboc.fsf@gnu.org>
NNTP-Posting-Host: plane.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Trace: ger.gmane.org 1354409281 13161 80.91.229.3 (2 Dec 2012 00:48:01 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Sun, 2 Dec 2012 00:48:01 +0000 (UTC)
Cc: perin@panix.com, 13041@debbugs.gnu.org, perin@acm.org
To: Eli Zaretskii <eliz@gnu.org>
Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Sun Dec 02 01:48:12 2012
Return-path: <bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org>
Envelope-to: geb-bug-gnu-emacs@m.gmane.org
Original-Received: from lists.gnu.org ([208.118.235.17])
	by plane.gmane.org with esmtp (Exim 4.69)
	(envelope-from <bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org>)
	id 1Texj4-0002nf-9z
	for geb-bug-gnu-emacs@m.gmane.org; Sun, 02 Dec 2012 01:48:06 +0100
Original-Received: from localhost ([::1]:48457 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org>)
	id 1Texis-000543-Rw
	for geb-bug-gnu-emacs@m.gmane.org; Sat, 01 Dec 2012 19:47:54 -0500
Original-Received: from eggs.gnu.org ([208.118.235.92]:51492)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <Debian-debbugs@debbugs.gnu.org>) id 1Texio-0004z2-Nf
	for bug-gnu-emacs@gnu.org; Sat, 01 Dec 2012 19:47:52 -0500
Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <Debian-debbugs@debbugs.gnu.org>) id 1Texin-0002Nt-5c
	for bug-gnu-emacs@gnu.org; Sat, 01 Dec 2012 19:47:50 -0500
Original-Received: from debbugs.gnu.org ([140.186.70.43]:38672)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <Debian-debbugs@debbugs.gnu.org>) id 1Texin-0002Np-1l
	for bug-gnu-emacs@gnu.org; Sat, 01 Dec 2012 19:47:49 -0500
Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.72)
	(envelope-from <Debian-debbugs@debbugs.gnu.org>) id 1Texkw-0001nm-Hy
	for bug-gnu-emacs@gnu.org; Sat, 01 Dec 2012 19:50:02 -0500
X-Loop: help-debbugs@gnu.org
Resent-From: Juri Linkov <juri@jurta.org>
Original-Sender: debbugs-submit-bounces@debbugs.gnu.org
Resent-CC: bug-gnu-emacs@gnu.org
Resent-Date: Sun, 02 Dec 2012 00:50:02 +0000
Resent-Message-ID: <handler.13041.B13041.13544093936897@debbugs.gnu.org>
Resent-Sender: help-debbugs@gnu.org
X-GNU-PR-Message: followup 13041
X-GNU-PR-Package: emacs
X-GNU-PR-Keywords: 
Original-Received: via spool by 13041-submit@debbugs.gnu.org id=B13041.13544093936897
	(code B ref 13041); Sun, 02 Dec 2012 00:50:02 +0000
Original-Received: (at 13041) by debbugs.gnu.org; 2 Dec 2012 00:49:53 +0000
Original-Received: from localhost ([127.0.0.1]:48920 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.72)
	(envelope-from <debbugs-submit-bounces@debbugs.gnu.org>)
	id 1Texkm-0001nB-Ok
	for submit@debbugs.gnu.org; Sat, 01 Dec 2012 19:49:53 -0500
Original-Received: from ps18281.dreamhost.com ([69.163.218.105]:58055
	helo=ps18281.dreamhostps.com)
	by debbugs.gnu.org with esmtp (Exim 4.72)
	(envelope-from <juri@jurta.org>) id 1Texkj-0001mv-4W
	for 13041@debbugs.gnu.org; Sat, 01 Dec 2012 19:49:50 -0500
Original-Received: from localhost (ps18281.dreamhostps.com [69.163.218.105])
	by ps18281.dreamhostps.com (Postfix) with ESMTP id E5A16451E1D4;
	Sat,  1 Dec 2012 16:47:32 -0800 (PST)
In-Reply-To: <83fw3qtboc.fsf@gnu.org> (Eli Zaretskii's message of "Sat, 01 Dec
	2012 10:32:35 +0200")
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.3.50 (x86_64-pc-linux-gnu)
X-BeenThere: debbugs-submit@debbugs.gnu.org
X-Mailman-Version: 2.1.13
Precedence: list
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x
X-Received-From: 140.186.70.43
X-BeenThere: bug-gnu-emacs@gnu.org
List-Id: "Bug reports for GNU Emacs,
	the Swiss army knife of text editors" <bug-gnu-emacs.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/bug-gnu-emacs>,
	<mailto:bug-gnu-emacs-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/bug-gnu-emacs>
List-Post: <mailto:bug-gnu-emacs@gnu.org>
List-Help: <mailto:bug-gnu-emacs-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/bug-gnu-emacs>,
	<mailto:bug-gnu-emacs-request@gnu.org?subject=subscribe>
Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org
Original-Sender: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.emacs.bugs:67749
Archived-At: <http://permalink.gmane.org/gmane.emacs.bugs/67749>

> Using these properties, every search string can be converted to a
> sequence of non-decomposable characters (this process is recursive,
> because the 'decomposition' property can use characters that
> themselves are decomposable).  If the user wants to ignore diacritics,
> then the diacritics should be dropped from the decomposition sequence
> before starting the search.  E.g., for the decomposition of =C3=A8 abov=
e,
> we will drop the 768 and will be left with 101, which is 'e'.  Then
> searching for that string should apply the same decomposition
> transformation to the text being searched, when comparing them.

Yes, using the `decomposition' property would be better than hard-coding
these decomposition mappings.  Though I'm surprised to see case mappings
hard-coded in lisp/international/characters.el instead of using the
properties `uppercase' and `lowercase' during creation of case tables.

But nevertheless the `decomposition' property should be used to find
all decomposable characters.  The question is how to use them in the sear=
ch.
One solution is to use the case tables.  I tried to build the case table
with the decomposed characters retrieved using the `decomposition' proper=
ty
recursively:

(defvar decomposition-table nil)

(defun make-decomposition-table ()
  (let ((table (standard-case-table))
        canon)
    (setq canon (copy-sequence table))
    (let ((c #x0000) d)
      (while (<=3D c #xFFFD)
        (make-decomposition-table-1 canon c c)
        (setq c (1+ c))))
    (set-char-table-extra-slot table 1 canon)
    (set-char-table-extra-slot table 2 nil)
    (setq decomposition-table table)))

(defun make-decomposition-table-1 (canon c0 c1)
  (let ((d (get-char-code-property c1 'decomposition)))
    (when d
      (unless (characterp (car d)) (pop d))
      (if (eq c1 (car d))
          (aset canon c0 (car d))
        (make-decomposition-table-1 canon c0 (car d))))))

(make-decomposition-table)

Then a new Isearch command (the existing `isearch-toggle-case-fold'
can't be used because it enables/disables the standard case table)
could toggle between the current case table and the decomposition
case table using

  (set-case-table decomposition-table)

After evaluating this, Isearch correctly finds all related characters
in every row of this example:

  http://hex-machina.com/scripts/yui/3.3.0pr1/api/unicode-data-accentfold=
.js.html

But it seems using the case table for decomposition has one limitation.
I see no way to ignore combining accent characters in the case table,
i.e. to map combining accent characters to nothing.  These characters
have the general-category "Mn (Mark, Nonspacing)", so they should be igno=
red
in the search.

An alternative would be to build a regexp from the search string
like building a regexp for word-search:

(define-key isearch-mode-map "\M-sd" 'isearch-toggle-decomposition)

(defun isearch-toggle-decomposition ()
  "Toggle Unicode decomposition searching on or off."
  (interactive)
  (setq isearch-word (unless (eq isearch-word 'isearch-decomposition-rege=
xp)
		       'isearch-decomposition-regexp))
  (if isearch-word (setq isearch-regexp nil))
  (setq isearch-success t isearch-adjusted t)
  (isearch-update))

(defun isearch-decomposition-regexp (string &optional _lax)
  "Return a regexp that matches decomposed Unicode characters in STRING."
  (mapconcat
   (lambda (c0)
     (if (eq (get-char-code-property c0 'general-category) 'Mn)
         ;; Mark-Nonspacing chars like COMBINING ACUTE ACCENT are optiona=
l.
         (concat (string c0) "?")
       (let ((c1 c0) c2 chars)
         (while (and (setq c2 (aref (char-table-extra-slot
                                     decomposition-table 2) c1))
                     (not (eq c2 c0)))
           (push c2 chars)
           (setq c1 c2))
         (if chars
             ;; Character alternatives from the case equivalences table.
             (concat "[" (string c0) chars "]")
           (string c0)))))
   string ""))

(put 'isearch-decomposition-regexp 'isearch-message-prefix "deco ")

This uses the decomposition table created above but instead of activating=
 it,
it's necessary to "shuffle" the equivalences table with the following cod=
e
that prepares the table but doesn't enable it in the current buffer:

  (with-temp-buffer (set-case-table decomposition-table))

The advantage of the regexp-based approach is making combining accents
optional in the search string.  But there is another problem: how to igno=
re
combining accents in the buffer when the search string doesn't contain th=
em.
With regexps this means adding a group of all possible combining accents
after every character in the search string like turning a search string
like "abc" into "a[=CC=81=CC=82=CC=83=CC=84=CC=86]?b[=CC=81=CC=82=CC=83=CC=
=84=CC=86]?c[=CC=81=CC=82=CC=83=CC=84=CC=86]?".
This would make the search slow, and I have no better idea.