all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
From: Drew Adams <drew.adams@oracle.com>
To: emacs-devel@gnu.org
Subject: RE: char equivalence classes in search - why not symmetric?
Date: Thu, 10 Sep 2015 14:46:39 -0700 (PDT)	[thread overview]
Message-ID: <cc7c4bb6-fb38-4d35-97d4-b4ee2f5661d8@default> (raw)
In-Reply-To: <116512ec-bdec-43de-afa9-dc01a57715e8@default>

[-- Attachment #1: Type: text/plain, Size: 1707 bytes --]

Yesterday I said:

 > 2. The code I have is not sufficient for everything.  You can
 > use it to see what the behavior is for single-char entries in the
 > char table, which includes accented chars (chars with diacritics).
 > But it does not also handle multiple-char entries in the table.
 > 
 > For instance, you can search for "é" and get char folding, but you
 > cannot search for "é" and get char folding.  The first of these is
 > just the char named LATIN SMALL LETTER E WITH ACUTE.  The second is
 > plain "e" composed with "́" (the char named COMBINING ACUTE ACCENT).
 > 
 > Some more work would be needed to make such combinations work too.
 > As I said, I'm no expert on char tables.  But the attached code
 > should give you a good idea of what is involved.

The attached version seems to take care of this, so you can search
with, say, the decomposition "é" and get the same effect as
searching for the fully composed char "é".

Again, just load the file, to try it out.  Remember that M-s '
toggles char folding.

At the end of the file there are a few strings you can use to test.
When you see two consecutive strings there that look the same, the
first is a decomposition, and the second is the same char fully
composed.

For example: "é" "é".  (The first string is two chars, however it
might be displayed.)

`C-u C-x =' on the first char of the first string tells you:
LATIN SMALL LETTER E, decomposition: (101) ('e')
and on the second char it tells you:
COMBINING ACUTE ACCENT, decomposition: (769) ('́').

`C-u C-x =' on the single char of the second string tells you:
LATIN SMALL LETTER E WITH ACUTE, decomposition: (101 769) ('e' '́')

[-- Attachment #2: symmetric-char-fold.el --]
[-- Type: application/octet-stream, Size: 7320 bytes --]

(setq character-fold-search t)
(load-library "character-fold")

(defvar char-fold-decomps ()
  "List of conses of a decomposition and its base char.")

(defun update-char-fold-table ()
  "Update the value of variable `character-fold-table'.
The new value reflects the current value of `char-fold-symmetric'."
  (setq char-fold-decomps  ())
  (setq character-fold-table
        (let* ((equiv  (make-char-table 'character-fold-table))
               (table  (unicode-property-table-internal 'decomposition))
               (func   (char-table-extra-slot table 1)))
          ;; Ensure the table is populated.
          (map-char-table (lambda (i v) (when (consp i) (funcall func (car i) v table)))
                          table)
          ;; Compile a list of all complex chars that each simple char should match.
          (map-char-table
           (lambda (i dec)
             (when (consp dec)
               ;; Discard a possible formatting tag.
               (when (symbolp (car dec))
                 (setq dec  (cdr dec)))
               ;; Skip trivial cases like ?a decomposing to (?a).
               (unless (and (eq i (car dec))  (not  (cdr dec)))
                 (let ((d            dec)
                       (fold-decomp  t)
                       k found)
                   (while (and d  (not found))
                     (setq k  (pop d))
                     ;; Is k a number or letter, per unicode standard?
                     (setq found  (memq (get-char-code-property k 'general-category)
                                        '(Lu Ll Lt Lm Lo Nd Nl No))))
                   (if found
                       ;; Check if the decomposition has more than one letter,
                       ;; because then we don't want the first letter to match
                       ;; the decomposition.
                       (dolist (k d)
                         (when (and fold-decomp
                                    (memq (get-char-code-property k 'general-category)
                                          '(Lu Ll Lt Lm Lo Nd Nl No)))
                           (setq fold-decomp  nil)))
                     ;; If there's no number or letter on the
                     ;; decomposition, take the first character in it.
                     (setq found  (car-safe dec)))
                   ;; Finally, we only fold multi-char decomposition if at
                   ;; least one of the chars is non-spacing (combining).
                   (when fold-decomp
                     (setq fold-decomp  nil)
                     (dolist (k  dec)
                       (when (and (not fold-decomp)
                                  (> (get-char-code-property k 'canonical-combining-class) 0))
                         (setq fold-decomp  t))))
                   ;; Add i to the list of characters that k can
                   ;; represent. Also possibly add its decomposition, so we can
                   ;; match multi-char representations like (format "a%c" 769)
                   (when (and found  (not (eq i k)))
                     (let ((chr-strgs  (cons (char-to-string i) (aref equiv k))))
                       (aset equiv k (if fold-decomp
                                         (cons (apply #'string dec) chr-strgs)
                                       chr-strgs))))))))
           table)
          ;; Add some manual entries.
          (dolist (it '((?\" """ "“" "”" "”" "„" "⹂" "〞" "‟" "‟" "❞" "❝"
                         "❠" "“" "„" "〝" "〟" "🙷" "🙶" "🙸" "«" "»")
                        (?' "❟" "❛" "❜" "‘" "’" "‚" "‛" "‚" "󠀢" "❮" "❯" "‹" "›")
                        (?` "❛" "‘" "‛" "󠀢" "❮" "‹")))
            (let ((idx        (car it))
                  (chr-strgs  (cdr it)))
              (aset equiv idx (append chr-strgs (aref equiv idx)))))

          ;; --------8<------the only addition----------------
          (when char-fold-symmetric
            ;; Add an entry for each equivalent char.
            (let ((others  ()))
              (map-char-table
               (lambda (base v)
                 (let ((chr-strgs  (aref equiv base)))
                   (when (consp chr-strgs)
                     (dolist (strg  (cdr chr-strgs))
                       (if (< (length strg) 2)
                           (push (cons (string-to-char strg) (remove strg chr-strgs)) others)
                         ;; A decomposition.  Add it and its base char to `char-fold-decomps'.
                         (push (cons strg (char-to-string base)) char-fold-decomps))))))
               equiv)
              (dolist (it  others)
                (let ((base       (car it))
                      (chr-strgs  (cdr it)))
                  (aset equiv base (append chr-strgs (aref equiv base)))))))
          ;; --------8<---------------------------------------

          ;; Convert the lists of characters we compiled into regexps.
          (map-char-table
           (lambda (i v) (let ((re  (regexp-opt (cons (char-to-string i) v))))
                      (if (consp i)
                          (set-char-table-range equiv i re)
                        (aset equiv i re))))
           equiv)
          equiv)))

(defun character-fold-to-regexp (string &optional lax)
  "Return a regexp matching anything that character-folds into STRING.
If `character-fold-search' is nil, just `regexp-quote' STRING.
Otherwise:

Replace any decompositions in `character-fold-table' by their base
chars, so search will match all equivalents.  Then replace any chars
in STRING that have entries in `character-fold-table' by their
entries (which are regexps), and replace other chars in STRING by
`regexp-quote' applied to them.

Non-nil LAX means any whitespace char can match any number of times."
  (if (not character-fold-search)
      (regexp-quote string)
    (when char-fold-decomps
      (dolist (decomp  char-fold-decomps)
        (setq string  (replace-regexp-in-string
                       (regexp-quote (car decomp)) (cdr decomp) string 'FIXED-CASE 'LITERAL))))
    (apply #'concat
           (mapcar (lambda (c) (if (and lax (memq c '(?\s ?\t ?\r ?\n)))
                              "[ \t\n\r\xa0\x2002\x2d\x200a\x202f\x205f\x3000]+"
                            (or (aref character-fold-table c)
                                (regexp-quote (string c)))))
                   string))))

(defcustom char-fold-symmetric t
  "Non-nil means char-fold searching treats equivalent chars the same.
That is, use of any of a set of char-fold equivalent chars in a search
string finds any of them in the text being searched.

If nil then only the \"base\" or \"canonical\" char of the set matches
any of them.  The others match only themselves, even when char-folding
is turned on."
  :set (lambda (sym defs)
         (custom-set-default sym defs)
         (update-char-fold-table))
  :type 'boolean :group 'isearch)

;; Test by searching for these strings.
;; ("𝚎" "𝙚" "𝘦" "𝗲" "𝖾" "𝖊" "𝕖" "𝔢" "𝓮" "𝒆" "𝑒" "𝐞" "e" "㋎" "㋍" "ⓔ" "⒠"
;;  "ⅇ" "ℯ" "ₑ" "ẽ" "ẽ" "ẻ" "ẻ" "ẹ" "ẹ" "ḛ" "ḛ" "ḙ" "ḙ" "ᵉ" "ȩ" "ȩ" "ȇ" "ȇ"
;; "ȅ" "ȅ" "ě" "ě" "ę" "ę" "ė" "ė" "ĕ" "ĕ" "ē" "ē" "ë" "ë" "ê" "ê" "é" "é" "è" "è")


  reply	other threads:[~2015-09-10 21:46 UTC|newest]

Thread overview: 86+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-09-01 15:46 char equivalence classes in search - why not symmetric? Drew Adams
2015-09-01 15:52 ` Davis Herring
2015-09-01 16:51   ` Stefan Monnier
2015-09-01 17:51   ` Drew Adams
2015-09-01 18:40     ` Davis Herring
2015-09-01 19:09       ` Drew Adams
2015-09-01 22:45       ` Juri Linkov
2015-09-02  0:33         ` Drew Adams
2015-09-01 20:10     ` Stephen J. Turnbull
2015-09-01 16:16 ` Eli Zaretskii
     [not found]   ` <<38061f42-eaf1-47c6-b74d-f676ac952b18@default>
     [not found]     ` <<83r3miatvl.fsf@gnu.org>
     [not found]       ` <<21998.29683.916211.867479@a1i15.kph.uni-mainz.de>
     [not found]         ` <<9A972800-D8F0-4DA8-877E-07D5BDC2E1F9@gmail.com>
2015-09-01 17:50   ` Drew Adams
2015-09-01 18:15     ` Eli Zaretskii
2015-09-01 18:46       ` Drew Adams
2015-09-01 19:19         ` Eli Zaretskii
2015-09-01 20:15           ` Drew Adams
2015-09-08  5:36       ` Ulrich Mueller
2015-09-08  6:04         ` Jean-Christophe Helary
2015-09-08 13:31           ` Stephen J. Turnbull
2015-09-08 14:24             ` Drew Adams
2015-09-08 15:21               ` Stephen J. Turnbull
2015-09-08 16:58                 ` Drew Adams
2015-09-08 17:38                   ` Stephen J. Turnbull
2015-09-09 22:52                     ` Drew Adams
2015-09-10  3:12                       ` Drew Adams
2015-09-10 21:46                         ` Drew Adams [this message]
2015-09-08 20:15               ` Richard Stallman
2015-09-08 20:15               ` Richard Stallman
2015-09-08 21:25                 ` Drew Adams
2015-09-09 15:07                   ` Richard Stallman
2015-09-09 15:21                     ` Drew Adams
2015-09-10  2:03                       ` Richard Stallman
2015-09-10  3:23                         ` Drew Adams
2015-09-11 10:28                           ` Richard Stallman
2015-09-11 13:28                             ` Stefan Monnier
2015-09-11 16:33                               ` Drew Adams
2015-09-11 20:59                                 ` Juri Linkov
2015-09-11 23:11                                   ` Drew Adams
2015-09-12 15:28                               ` Richard Stallman
2015-09-11 16:31                             ` Drew Adams
2015-09-11 10:28                           ` Richard Stallman
2015-09-11 16:31                             ` Drew Adams
2015-09-12 15:29                               ` Richard Stallman
     [not found]             ` <<8cf269bc-69d8-4752-8506-de8d992512e1@default>
     [not found]               ` <<E1ZZPIS-0005rf-DJ@fencepost.gnu.org>
2015-09-08 21:46                 ` Drew Adams
     [not found]               ` <<E1ZZPIT-0005s6-ST@fencepost.gnu.org>
     [not found]                 ` <<da54a6cb-90eb-481d-aa20-acfad612e709@default>
     [not found]                   ` <<E1ZZgxz-0006X2-Bg@fencepost.gnu.org>
     [not found]                     ` <<cb107072-7f90-41fb-9aff-075d50eb65bb@default>
     [not found]                       ` <<E1ZZrCm-0001x4-9a@fencepost.gnu.org>
     [not found]                         ` <<4f3b1db3-d3d2-480f-8662-fbf7c74aa67f@default>
     [not found]                           ` <<E1ZaLZR-0002Bf-8q@fencepost.gnu.org>
     [not found]                             ` <<e77f8e7b-581f-436d-816a-c8daed734ff5@default>
     [not found]                               ` <<E1ZamkM-0005d4-RN@fencepost.gnu.org>
2015-09-12 15:59                                 ` Drew Adams
2015-09-08 13:39           ` Drew Adams
2015-09-08 21:19             ` Juri Linkov
2015-09-09 15:07               ` Richard Stallman
2015-09-08 15:47         ` Eli Zaretskii
2015-09-08 16:57           ` Drew Adams
2015-09-08 21:20           ` Juri Linkov
2015-09-09  2:42             ` Eli Zaretskii
2015-09-09 11:23               ` Artur Malabarba
2015-09-09 13:32                 ` Drew Adams
2015-09-09 15:12                 ` Richard Stallman
2015-09-11 20:50                   ` Juri Linkov
     [not found]               ` <<CAAdUY-JMQVsRFku8nwX8JcA9k6Y9sHWoVL6ZC60RHnjoj0cd+Q@mail.gmail.com>
     [not found]                 ` <<E1ZZh2a-0003u6-Fj@fencepost.gnu.org>
2015-09-09 15:22                   ` Drew Adams
2015-09-10  2:03                     ` Richard Stallman
2015-09-10  3:15                       ` Drew Adams
2015-09-10  6:57                         ` David Kastrup
2015-09-10 15:02                           ` Drew Adams
2015-09-10 15:50                         ` Richard Stallman
2015-09-08 20:09         ` Richard Stallman
2015-09-08 21:00           ` Drew Adams
2015-09-09 15:06             ` Richard Stallman
2015-09-08 21:47           ` Ulrich Mueller
2015-09-02 15:34   ` Richard Stallman
2015-09-02 15:56     ` Drew Adams
2015-09-02 16:05     ` Eli Zaretskii
2015-09-02 21:51       ` Jean-Christophe Helary
2015-09-02 22:15         ` Drew Adams
2015-09-03 15:37           ` Richard Stallman
2015-09-03  2:41         ` Eli Zaretskii
2015-09-03  3:08           ` Jean-Christophe Helary
2015-09-03  7:28             ` Artur Malabarba
2015-09-03 17:15               ` Drew Adams
2015-09-07 13:52                 ` Nix
2015-09-07 17:07                   ` Drew Adams
2015-09-07 23:23                     ` Nix
2015-09-08  2:17                   ` Richard Stallman
2015-09-03 14:33             ` Eli Zaretskii
2015-09-03 15:00         ` Stefan Monnier
2015-09-03 16:15           ` Drew Adams
2015-09-03 16:23             ` Eli Zaretskii
2015-09-03 16:46               ` Drew Adams
2015-09-02 16:10     ` Artur Malabarba
2015-09-03 19:49     ` Pip Cet
     [not found] <<2a7b9134-af2a-462d-af6c-d02bad60bbe8@default>

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=cc7c4bb6-fb38-4d35-97d4-b4ee2f5661d8@default \
    --to=drew.adams@oracle.com \
    --cc=emacs-devel@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.