all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
From: Juri Linkov <juri@jurta.org>
To: Eli Zaretskii <eliz@gnu.org>
Cc: perin@panix.com, 13041@debbugs.gnu.org, perin@acm.org
Subject: bug#13041: 24.2; diacritic-fold-search
Date: Sun, 02 Dec 2012 02:27:32 +0200	[thread overview]
Message-ID: <87hao5jqu3.fsf@mail.jurta.org> (raw)
In-Reply-To: <83fw3qtboc.fsf@gnu.org> (Eli Zaretskii's message of "Sat, 01 Dec 2012 10:32:35 +0200")

> Using these properties, every search string can be converted to a
> sequence of non-decomposable characters (this process is recursive,
> because the 'decomposition' property can use characters that
> themselves are decomposable).  If the user wants to ignore diacritics,
> then the diacritics should be dropped from the decomposition sequence
> before starting the search.  E.g., for the decomposition of è above,
> we will drop the 768 and will be left with 101, which is 'e'.  Then
> searching for that string should apply the same decomposition
> transformation to the text being searched, when comparing them.

Yes, using the `decomposition' property would be better than hard-coding
these decomposition mappings.  Though I'm surprised to see case mappings
hard-coded in lisp/international/characters.el instead of using the
properties `uppercase' and `lowercase' during creation of case tables.

But nevertheless the `decomposition' property should be used to find
all decomposable characters.  The question is how to use them in the search.
One solution is to use the case tables.  I tried to build the case table
with the decomposed characters retrieved using the `decomposition' property
recursively:

(defvar decomposition-table nil)

(defun make-decomposition-table ()
  (let ((table (standard-case-table))
        canon)
    (setq canon (copy-sequence table))
    (let ((c #x0000) d)
      (while (<= c #xFFFD)
        (make-decomposition-table-1 canon c c)
        (setq c (1+ c))))
    (set-char-table-extra-slot table 1 canon)
    (set-char-table-extra-slot table 2 nil)
    (setq decomposition-table table)))

(defun make-decomposition-table-1 (canon c0 c1)
  (let ((d (get-char-code-property c1 'decomposition)))
    (when d
      (unless (characterp (car d)) (pop d))
      (if (eq c1 (car d))
          (aset canon c0 (car d))
        (make-decomposition-table-1 canon c0 (car d))))))

(make-decomposition-table)

Then a new Isearch command (the existing `isearch-toggle-case-fold'
can't be used because it enables/disables the standard case table)
could toggle between the current case table and the decomposition
case table using

  (set-case-table decomposition-table)

After evaluating this, Isearch correctly finds all related characters
in every row of this example:

  http://hex-machina.com/scripts/yui/3.3.0pr1/api/unicode-data-accentfold.js.html

But it seems using the case table for decomposition has one limitation.
I see no way to ignore combining accent characters in the case table,
i.e. to map combining accent characters to nothing.  These characters
have the general-category "Mn (Mark, Nonspacing)", so they should be ignored
in the search.

An alternative would be to build a regexp from the search string
like building a regexp for word-search:

(define-key isearch-mode-map "\M-sd" 'isearch-toggle-decomposition)

(defun isearch-toggle-decomposition ()
  "Toggle Unicode decomposition searching on or off."
  (interactive)
  (setq isearch-word (unless (eq isearch-word 'isearch-decomposition-regexp)
		       'isearch-decomposition-regexp))
  (if isearch-word (setq isearch-regexp nil))
  (setq isearch-success t isearch-adjusted t)
  (isearch-update))

(defun isearch-decomposition-regexp (string &optional _lax)
  "Return a regexp that matches decomposed Unicode characters in STRING."
  (mapconcat
   (lambda (c0)
     (if (eq (get-char-code-property c0 'general-category) 'Mn)
         ;; Mark-Nonspacing chars like COMBINING ACUTE ACCENT are optional.
         (concat (string c0) "?")
       (let ((c1 c0) c2 chars)
         (while (and (setq c2 (aref (char-table-extra-slot
                                     decomposition-table 2) c1))
                     (not (eq c2 c0)))
           (push c2 chars)
           (setq c1 c2))
         (if chars
             ;; Character alternatives from the case equivalences table.
             (concat "[" (string c0) chars "]")
           (string c0)))))
   string ""))

(put 'isearch-decomposition-regexp 'isearch-message-prefix "deco ")

This uses the decomposition table created above but instead of activating it,
it's necessary to "shuffle" the equivalences table with the following code
that prepares the table but doesn't enable it in the current buffer:

  (with-temp-buffer (set-case-table decomposition-table))

The advantage of the regexp-based approach is making combining accents
optional in the search string.  But there is another problem: how to ignore
combining accents in the buffer when the search string doesn't contain them.
With regexps this means adding a group of all possible combining accents
after every character in the search string like turning a search string
like "abc" into "a[́̂̃̄̆]?b[́̂̃̄̆]?c[́̂̃̄̆]?".
This would make the search slow, and I have no better idea.





  parent reply	other threads:[~2012-12-02  0:27 UTC|newest]

Thread overview: 83+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-11-30 18:22 bug#13041: 24.2; diacritic-fold-search Lewis Perin
2012-11-30 18:51 ` Juri Linkov
2012-11-30 21:07   ` Lewis Perin
2012-12-01  0:27     ` Juri Linkov
2012-12-01  0:47       ` Drew Adams
2012-12-01  0:49         ` Drew Adams
2012-12-01  1:20           ` Lew Perin
2012-12-01  6:50             ` Drew Adams
2012-12-01  8:32       ` Eli Zaretskii
2012-12-01  9:09         ` Eli Zaretskii
2012-12-01 16:38         ` Drew Adams
2012-12-02  0:27         ` Juri Linkov [this message]
2012-12-02 17:45           ` martin rudalics
2012-12-02 18:02             ` Eli Zaretskii
2012-12-03 10:16               ` martin rudalics
2012-12-03 16:47                 ` Eli Zaretskii
2012-12-03 17:42                   ` martin rudalics
2012-12-03 17:59                     ` Eli Zaretskii
2012-12-04 17:54                       ` martin rudalics
2012-12-04 19:28                         ` Eli Zaretskii
2012-12-05  9:41                           ` martin rudalics
2012-12-05 16:37                             ` Eli Zaretskii
2012-12-06 10:31                               ` martin rudalics
2012-12-06 17:48                                 ` Eli Zaretskii
2012-12-05 23:05                             ` Juri Linkov
2012-12-06 10:32                               ` martin rudalics
2012-12-04 20:12                         ` Drew Adams
2012-12-04 23:15                           ` Drew Adams
2012-12-05  6:50                             ` Drew Adams
2012-12-05  9:42                               ` martin rudalics
2012-12-05 15:38                                 ` Drew Adams
2012-12-06  9:25                               ` Kenichi Handa
2012-12-06 10:34                                 ` martin rudalics
2012-12-06 17:50                                   ` Eli Zaretskii
2012-12-07  0:58                                 ` Juri Linkov
2012-12-07  6:33                                   ` Eli Zaretskii
2012-12-07 10:37                                   ` martin rudalics
2012-12-07 23:55                                     ` Juri Linkov
2012-12-08  8:20                                       ` Eli Zaretskii
2012-12-08 11:35                                         ` martin rudalics
2012-12-08 12:40                                           ` Eli Zaretskii
2012-12-08 11:21                                       ` martin rudalics
2012-12-08 23:07                                         ` Juri Linkov
2012-12-09  0:04                                           ` Drew Adams
2012-12-09 17:52                                           ` martin rudalics
2012-12-09 18:06                                             ` Drew Adams
2012-12-11  7:19                                               ` Eli Zaretskii
2012-12-08 23:54                                       ` Stefan Monnier
2012-12-09  0:14                                         ` Drew Adams
2012-12-09 15:42                                           ` Stefan Monnier
2012-12-09 18:00                                             ` Drew Adams
2012-12-09  0:35                                         ` Juri Linkov
2012-12-09 11:35                                           ` Stephen Berman
2012-12-09 17:52                                             ` martin rudalics
2012-12-09 15:45                                           ` Stefan Monnier
2012-12-10  7:57                                             ` Juri Linkov
2012-12-10  8:20                                               ` Eli Zaretskii
2012-12-05  9:42                             ` martin rudalics
2012-12-05  9:42                           ` martin rudalics
2012-12-05 15:38                             ` Drew Adams
2012-12-05 15:51                               ` Lewis Perin
2012-12-05 16:20                                 ` Drew Adams
2012-12-05 17:16                               ` Drew Adams
2012-12-05 18:00                                 ` Drew Adams
2012-12-05 18:27                                   ` Eli Zaretskii
2012-12-06 10:31                                   ` martin rudalics
2012-12-06 15:59                                     ` Drew Adams
2012-12-06 10:28                               ` martin rudalics
2012-12-06 17:53                                 ` Eli Zaretskii
2012-12-05 23:04                             ` Juri Linkov
2012-12-06 10:31                               ` martin rudalics
2012-12-07  0:52                                 ` Juri Linkov
2012-12-02 21:39             ` Juri Linkov
2012-12-03 10:16               ` martin rudalics
2012-12-04  0:17                 ` Juri Linkov
2012-12-04  3:41                   ` Eli Zaretskii
2012-12-02 18:16           ` Eli Zaretskii
2012-12-02 21:31             ` Juri Linkov
2012-12-05 19:17             ` Drew Adams
2012-12-05 21:19               ` Eli Zaretskii
2012-11-30 19:31 ` Stefan Monnier
2016-08-31 14:45 ` Michael Albinus
     [not found]   ` <22473.57245.883865.68491@panix5.panix.com>
2016-09-03  7:06     ` Michael Albinus

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87hao5jqu3.fsf@mail.jurta.org \
    --to=juri@jurta.org \
    --cc=13041@debbugs.gnu.org \
    --cc=eliz@gnu.org \
    --cc=perin@acm.org \
    --cc=perin@panix.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.