Character group folding in searches

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

* Character group folding in searches
@ 2015-02-06 13:04 Artur Malabarba
  2015-02-06 14:32 ` Eli Zaretskii
  2015-02-07  0:07 ` Juri Linkov
  0 siblings, 2 replies; 25+ messages in thread
From: Artur Malabarba @ 2015-02-06 13:04 UTC (permalink / raw)
  To: emacs-devel

[-- Attachment #1: Type: text/plain, Size: 4679 bytes --]

This is a follow up on a previous discussion regarding Single quotes in Info.

I've been looking into ways of having the search functions fold
similar characters together. There are a few goals which I'm listing
above to facilitate comparision of possible approaches. Feel free to
mention other highly-important goals, but please don't go into
high-level abstractions (such as letting the user define groups),
these can always be done and are not relevant to this discussion.

1. Follow the `decomposition' char property. For instance, the
character "a" in the search string would match any one of  "aãáâ" (and
so on). This is easy to do, and one of the patches below already shows
how. Note that this won't handle symbols that are actually composed of
multiple characters.

2. Follow an intuitive sense of similarity which is not defined in the
unicode standard. For instance, an ascii single quote in the search
string should match any type of single quote (there are about a dozen
that I know of).

3. Ignore modifier (non-spacing) characters. Another way of writing
"á" is to write "a" followed by a special non-spacing accute. This
kind of thing (a symbol composed of multiple characters) is not
handled by item 1, so I'm listing as a separate point.

4. Perform the conversion two-ways. That is, item 1 should work even
if the search contained "á" instead of "a". Item 2 should match an
ascii quote if the search string contains a curly quote. This is
mostly useful when the user copies a fancy string from somewhere and
pastes it into the search field.

5. It should work for any searching, not just isearch.


Goals 1, 2, and 3 are the most important (in my opinion).
Goals 1 and 2 are achieved by all of the patches below, while the others vary.

-----------------------------------------------------------

Below, I'm attaching 3 patches, they each represent a different way of
achieving part of the above.

* group-folding-with-regexp-lisp.patch

This one takes each input character and either keeps it verbatim or
transform it into a regexp which matches the entire group that this
character represents. It is implemented in isearch.

+ It trivially handles goals 1, 2 and 3. Because regexps are quite
versatile, it is the only solution that handles item 3 (it allows each
character to match more than a single character).
+ Goal 4 can be achieved with a bit more work (the input just needs to
be normalized before turning it into a regexp).
- It is slower than the options below, but it should be fast enough for isearch.
- Goal 5 would take a lot more work. This character parsing would have
to be added to each of search functions (not to mention it might be
too slow for lisp-code searches).

(Note that the attached patch doesn't actually do item 1. That is NOT
a limitation, it can do item 1 quite trivially. I simply haven't done
it yet.)

* group-folding-with-case-table-lisp.patch

This patch is entirely in elisp. I've put it all inside `isearch.el'
for now, for the sake of simplicity, but it's not restricted to
isearch.

It creates a new case-table which performs group folding by borrowing
the case-folding machinery, so it is very fast. Then, group folding
can be achieved by running the search inside a `with-group-folding`
macro. There's also an example implementation which turns it on for
isearch by default.

+ It immediately satisfies items 1, 2, 4, and 5.
+ It is very fast.
- It has no simple way of achieving item 3.

(Note that the attached patch doesn't actually do item 2. That is NOT
a limitation, it can do item 2 quite trivially. I simply haven't done
it yet.)

* group-folding-with-case-table-C.patch

This patch defines a new char-table and uses it instead of
case_canon_table when the group-fold-search variable is non-nil.

This shares the advantages and disadvantages of the lisp patch above
but, in addition:
+ You don't need a `with-group-folding' macro, all you need is to (let
((group-fold-search t)) ...) around the search which is more in terms
with how case-folding works.
- If the user decides to set `group-fold-search' to t, this can break
existing code (a disadvantage that the lisp version above does not
have).
- It adds two extra fields to every buffer object (the boolean
variable and the char table).

(Note that compiling this last patch gives a crashing executable for
me. I'm just putting it here to showcase the option.)

---------------------

My question is:

Do any of these options seem good enough? Which would you all like to explore?
I like the second one best, but goal 3 is quite important.

[-- Attachment #2: group-folding-with-case-table-C.patch --]
[-- Type: text/x-patch, Size: 8983 bytes --]

From d49d03abdbb1bdb892a322ed6d9e25648edc3b56 Mon Sep 17 00:00:00 2001
From: Artur Malabarba <bruce.connor.am@gmail.com>
Date: Thu, 5 Feb 2015 22:50:52 -0200
Subject: [PATCH] hi

---
 src/buffer.c  | 18 ++++++++++++++++++
 src/buffer.h  |  9 +++++++++
 src/casetab.c |  7 +++++++
 src/editfns.c |  6 ++++--
 src/search.c  | 27 +++++++++++++++++----------
 5 files changed, 55 insertions(+), 12 deletions(-)

diff --git a/src/buffer.c b/src/buffer.c
index 67eda3e..7160850 100644
--- a/src/buffer.c
+++ b/src/buffer.c
@@ -182,6 +182,11 @@ bset_case_fold_search (struct buffer *b, Lisp_Object val)
   b->INTERNAL_FIELD (case_fold_search) = val;
 }
 static void
+bset_group_fold_search (struct buffer *b, Lisp_Object val)
+{
+  b->INTERNAL_FIELD (group_fold_search) = val;
+}
+static void
 bset_ctl_arrow (struct buffer *b, Lisp_Object val)
 {
   b->INTERNAL_FIELD (ctl_arrow) = val;
@@ -975,6 +980,7 @@ reset_buffer_local_variables (struct buffer *b, bool permanent_too)
   bset_upcase_table (b, XCHAR_TABLE (Vascii_downcase_table)->extras[0]);
   bset_case_canon_table (b, XCHAR_TABLE (Vascii_downcase_table)->extras[1]);
   bset_case_eqv_table (b, XCHAR_TABLE (Vascii_downcase_table)->extras[2]);
+  bset_group_canon_table (b, XCHAR_TABLE (Vascii_downcase_table)->extras[1]);
   bset_invisibility_spec (b, Qt);
 
   /* Reset all (or most) per-buffer variables to their defaults.  */
@@ -5053,6 +5059,7 @@ init_buffer_once (void)
   bset_upcase_table (&buffer_local_flags, make_number (0));
   bset_case_canon_table (&buffer_local_flags, make_number (0));
   bset_case_eqv_table (&buffer_local_flags, make_number (0));
+  bset_group_canon_table (&buffer_local_flags, make_number (0));
   bset_minor_modes (&buffer_local_flags, make_number (0));
   bset_width_table (&buffer_local_flags, make_number (0));
   bset_pt_marker (&buffer_local_flags, make_number (0));
@@ -5065,6 +5072,7 @@ init_buffer_once (void)
   XSETFASTINT (BVAR (&buffer_local_flags, abbrev_mode), idx); ++idx;
   XSETFASTINT (BVAR (&buffer_local_flags, overwrite_mode), idx); ++idx;
   XSETFASTINT (BVAR (&buffer_local_flags, case_fold_search), idx); ++idx;
+  XSETFASTINT (BVAR (&buffer_local_flags, group_fold_search), idx); ++idx;
   XSETFASTINT (BVAR (&buffer_local_flags, auto_fill_function), idx); ++idx;
   XSETFASTINT (BVAR (&buffer_local_flags, selective_display), idx); ++idx;
   XSETFASTINT (BVAR (&buffer_local_flags, selective_display_ellipses), idx); ++idx;
@@ -5143,6 +5151,7 @@ init_buffer_once (void)
   bset_abbrev_mode (&buffer_defaults, Qnil);
   bset_overwrite_mode (&buffer_defaults, Qnil);
   bset_case_fold_search (&buffer_defaults, Qt);
+  bset_group_fold_search (&buffer_defaults, Qnil);
   bset_auto_fill_function (&buffer_defaults, Qnil);
   bset_selective_display (&buffer_defaults, Qnil);
   bset_selective_display_ellipses (&buffer_defaults, Qt);
@@ -5486,6 +5495,11 @@ This is the same as (default-value 'tab-width).  */);
 			  doc: /* Default value of `case-fold-search' for buffers that don't override it.
 This is the same as (default-value 'case-fold-search).  */);
 
+  DEFVAR_BUFFER_DEFAULTS ("default-group-fold-search",
+			  group_fold_search,
+			  doc: /* Default value of `group-fold-search' for buffers that don't override it.
+This is the same as (default-value 'group-fold-search).  */);
+
   DEFVAR_BUFFER_DEFAULTS ("default-left-margin-width",
 			  left_margin_cols,
 			  doc: /* Default value of `left-margin-width' for buffers that don't override it.
@@ -5657,6 +5671,10 @@ Use the command `abbrev-mode' to change this variable.  */);
 		     Qnil,
 		     doc: /* Non-nil if searches and matches should ignore case.  */);
 
+  DEFVAR_PER_BUFFER ("group-fold-search", &BVAR (current_buffer, group_fold_search),
+		     Qnil,
+		     doc: /* Non-nil if searches and matches should ignore case.  */);
+
   DEFVAR_PER_BUFFER ("fill-column", &BVAR (current_buffer, fill_column),
 		     Qintegerp,
 		     doc: /* Column beyond which automatic line-wrapping should happen.
diff --git a/src/buffer.h b/src/buffer.h
index 81852ca..ff56a81 100644
--- a/src/buffer.h
+++ b/src/buffer.h
@@ -558,6 +558,7 @@ struct buffer
   /* tab-width is buffer-local so that redisplay can find it
      in buffers that are not current.  */
   Lisp_Object INTERNAL_FIELD (case_fold_search);
+  Lisp_Object INTERNAL_FIELD (group_fold_search);
   Lisp_Object INTERNAL_FIELD (tab_width);
   Lisp_Object INTERNAL_FIELD (fill_column);
   Lisp_Object INTERNAL_FIELD (left_margin);
@@ -578,6 +579,9 @@ struct buffer
   /* Char-table of equivalences for case-folding search.  */
   Lisp_Object INTERNAL_FIELD (case_eqv_table);
 
+  /* Char-table for conversion for group-folding search.  */
+  Lisp_Object INTERNAL_FIELD (group_canon_table);
+
   /* Non-nil means do not display continuation lines.  */
   Lisp_Object INTERNAL_FIELD (truncate_lines);
 
@@ -899,6 +903,11 @@ bset_case_eqv_table (struct buffer *b, Lisp_Object val)
   b->INTERNAL_FIELD (case_eqv_table) = val;
 }
 INLINE void
+bset_group_canon_table (struct buffer *b, Lisp_Object val)
+{
+  b->INTERNAL_FIELD (group_canon_table) = val;
+}
+INLINE void
 bset_directory (struct buffer *b, Lisp_Object val)
 {
   b->INTERNAL_FIELD (directory) = val;
diff --git a/src/casetab.c b/src/casetab.c
index b086abc..81a6476 100644
--- a/src/casetab.c
+++ b/src/casetab.c
@@ -63,6 +63,13 @@ check_case_table (Lisp_Object obj)
   return (obj);
 }
 
+DEFUN ("current-group-table", Fcurrent_group_table, Scurrent_group_table, 0, 0, 0,
+       doc: /* Return the group table of the current buffer.  */)
+  (void)
+{
+  return BVAR (current_buffer, group_canon_table);
+}
+
 DEFUN ("current-case-table", Fcurrent_case_table, Scurrent_case_table, 0, 0, 0,
        doc: /* Return the case table of the current buffer.  */)
   (void)
diff --git a/src/editfns.c b/src/editfns.c
index 7026ccc..9fe6bc6 100644
--- a/src/editfns.c
+++ b/src/editfns.c
@@ -2805,8 +2805,10 @@ determines whether case is significant or ignored.  */)
   register EMACS_INT begp1, endp1, begp2, endp2, temp;
   register struct buffer *bp1, *bp2;
   register Lisp_Object trt
-    = (!NILP (BVAR (current_buffer, case_fold_search))
-       ? BVAR (current_buffer, case_canon_table) : Qnil);
+    = (!NILP (BVAR (current_buffer, group_fold_search))
+       ? BVAR (current_buffer, group_canon_table) :
+       (!NILP (BVAR (current_buffer, case_fold_search))
+        ? BVAR (current_buffer, case_canon_table) : Qnil));
   ptrdiff_t chars = 0;
   ptrdiff_t i1, i2, i1_byte, i2_byte;
 
diff --git a/src/search.c b/src/search.c
index e961798..037f409 100644
--- a/src/search.c
+++ b/src/search.c
@@ -281,8 +281,10 @@ looking_at_1 (Lisp_Object string, bool posix)
   bufp = compile_pattern (string,
 			  (NILP (Vinhibit_changing_match_data)
 			   ? &search_regs : NULL),
-			  (!NILP (BVAR (current_buffer, case_fold_search))
-			   ? BVAR (current_buffer, case_canon_table) : Qnil),
+			  (!NILP (BVAR (current_buffer, group_fold_search))
+			   ? BVAR (current_buffer, group_canon_table) :
+                           (!NILP (BVAR (current_buffer, case_fold_search))
+                            ? BVAR (current_buffer, case_canon_table) : Qnil)),
 			  posix,
 			  !NILP (BVAR (current_buffer, enable_multibyte_characters)));
 
@@ -396,8 +398,10 @@ string_match_1 (Lisp_Object regexp, Lisp_Object string, Lisp_Object start,
   bufp = compile_pattern (regexp,
 			  (NILP (Vinhibit_changing_match_data)
 			   ? &search_regs : NULL),
-			  (!NILP (BVAR (current_buffer, case_fold_search))
-			   ? BVAR (current_buffer, case_canon_table) : Qnil),
+			  (!NILP (BVAR (current_buffer, group_fold_search))
+			   ? BVAR (current_buffer, group_canon_table) :
+                           (!NILP (BVAR (current_buffer, case_fold_search))
+                            ? BVAR (current_buffer, case_canon_table) : Qnil)),
 			  posix,
 			  STRING_MULTIBYTE (string));
   immediate_quit = 1;
@@ -1052,12 +1056,15 @@ search_command (Lisp_Object string, Lisp_Object bound, Lisp_Object noerror,
 			 BVAR (current_buffer, case_eqv_table));
 
   np = search_buffer (string, PT, PT_BYTE, lim, lim_byte, n, RE,
-		      (!NILP (BVAR (current_buffer, case_fold_search))
-		       ? BVAR (current_buffer, case_canon_table)
-		       : Qnil),
-		      (!NILP (BVAR (current_buffer, case_fold_search))
-		       ? BVAR (current_buffer, case_eqv_table)
-		       : Qnil),
+                      (!NILP (BVAR (current_buffer, group_fold_search))
+                       ? BVAR (current_buffer, group_canon_table) :
+                       (!NILP (BVAR (current_buffer, case_fold_search))
+                        ? BVAR (current_buffer, case_canon_table) : Qnil)),
+                      (!NILP (BVAR (current_buffer, group_fold_search))
+                       ? Qnil :
+                       (!NILP (BVAR (current_buffer, case_fold_search))
+                        ? BVAR (current_buffer, case_eqv_table)
+                        : Qnil)),
 		      posix);
   if (np <= 0)
     {
-- 
2.2.2


[-- Attachment #3: group-folding-with-regexp-lisp.patch --]
[-- Type: text/x-patch, Size: 3231 bytes --]

From f67ae7ed53e6a90cf4f97ac1bba9498b5d58e6dc Mon Sep 17 00:00:00 2001
From: Artur Malabarba <bruce.connor.am@gmail.com>
Date: Tue, 27 Jan 2015 14:08:01 -0200
Subject: [PATCH] (isearch-search-fun-default): Implement group folding in
 isearch.

Use `isearch-fold-groups', `isearch-groups-alist', and
`isearch--replace-groups-in-string'.
---
 lisp/isearch.el | 37 +++++++++++++++++++++++++++++++++++++
 1 file changed, 37 insertions(+)

diff --git a/lisp/isearch.el b/lisp/isearch.el
index 99ca73f..accfb72 100644
--- a/lisp/isearch.el
+++ b/lisp/isearch.el
@@ -58,6 +58,7 @@
 ;;; Code:
 
 (eval-when-compile (require 'cl-lib))
+(eval-when-compile (require 'subr-x))
 \f
 ;; Some additional options and constants.
 
@@ -272,6 +273,23 @@ Default value, nil, means edit the string instead."
   :version "23.1"
   :group 'isearch)
 
+(defcustom isearch-fold-groups t
+  "Whether regular isearch should do group folding.
+This means some characters will match entire groups of charactes,
+such as \" matching ”, for instance."
+  :type 'boolean
+  :group 'isearch
+  :version "25.1")
+
+(defvar isearch-groups-alist
+  ;; FIXME: Add all the latin accented letters like Ã.
+  '((?\" . "[\"＂“””„⹂〞‟‟❞❝❠“„〝〟🙷🙶🙸❮❯«»‹›󠀢]")
+    (?' . "[`'❟❛❜‘’’‚‛‛‚‘󠀢❮❯«»‹›]")
+    ;; `isearch-fold-groups' doesn't interact with
+    ;; `isearch-lax-whitespace' yet.  So we need to add this here.
+    (?\s . "[ 	\r\n]+"))
+  "Alist of groups to use when `isearch-fold-groups' is non-nil.")
+
 (defcustom isearch-lazy-highlight t
   "Controls the lazy-highlighting during incremental search.
 When non-nil, all text in the buffer matching the current search
@@ -2565,6 +2583,18 @@ search for the first occurrence of STRING or its translation.")
 Can be changed via `isearch-search-fun-function' for special needs."
   (funcall (or isearch-search-fun-function 'isearch-search-fun-default)))
 
+(defun isearch--replace-groups-in-string (string)
+  "Return a group-folded regexp version of STRING.
+Any character that has an entry in `isearch-groups-alist' is
+replaced with the cdr of that entry (which should be a regexp).
+Other characters are `regexp-quote'd."
+  (apply #'concat
+    (mapcar (lambda (c)
+              (if-let ((entry (assq c isearch-groups-alist)))
+                  (cdr entry)
+                (regexp-quote (string c))))
+      string)))
+
 (defun isearch-search-fun-default ()
   "Return default functions to use for the search."
   (cond
@@ -2591,6 +2621,13 @@ Can be changed via `isearch-search-fun-function' for special needs."
       're-search-backward-lax-whitespace))
    (isearch-regexp
     (if isearch-forward 're-search-forward 're-search-backward))
+   ;; `isearch-regexp' is essentially a superset of
+   ;; `isearch-fold-groups'.  So fold-groups comes after it.
+   (isearch-fold-groups
+    (lambda (string &optional bound noerror count)
+      (funcall (if isearch-forward #'re-search-forward #'re-search-backward)
+        (isearch--replace-groups-in-string string)
+        bound noerror count)))
    ((and isearch-lax-whitespace search-whitespace-regexp)
     (if isearch-forward
 	'search-forward-lax-whitespace
-- 
2.2.2


[-- Attachment #4: group-folding-with-case-table-lisp.patch --]
[-- Type: text/x-patch, Size: 2549 bytes --]

From 8f4be27dca714b168414171bde3eeee9fefc44e9 Mon Sep 17 00:00:00 2001
From: Artur Malabarba <bruce.connor.am@gmail.com>
Date: Tue, 27 Jan 2015 14:08:01 -0200
Subject: [PATCH] (isearch-search-fun-default): Implement group folding in
 isearch.

(isearch-fold-groups group-fold-table): New variables.

When `isearch-fold-groups' is non-nil `group-fold-table' is used as the
case table.
---
 lisp/isearch.el | 38 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 38 insertions(+)

diff --git a/lisp/isearch.el b/lisp/isearch.el
index 99ca73f..7d568dd 100644
--- a/lisp/isearch.el
+++ b/lisp/isearch.el
@@ -272,6 +272,38 @@ Default value, nil, means edit the string instead."
   :version "23.1"
   :group 'isearch)
 
+(defcustom isearch-fold-groups t
+  "Whether regular isearch should do group folding.
+This means some characters will match entire groups of charactes,
+such as \" matching ”, for instance."
+  :type 'boolean
+  :group 'isearch
+  :version "25.1")
+
+(defvar group-fold-table
+  (eval-when-compile
+    (let ((table (make-char-table 'case-table))
+          (eq (make-char-table 'equiv)))
+      (require 'subr-x)
+      ;; Build the group table.
+      (dotimes (i (length eq))
+        (when-let ((d (get-char-code-property i 'decomposition))
+                   (k (car-safe d)))
+          (unless (eq i k)
+            (aset eq i (if (characterp k) k (cadr d))))))
+      ;; Put it in the right place.
+      (set-char-table-extra-slot table 1 eq)
+      table))
+  "Used for folding characters of the same group during search.")
+
+(defmacro with-group-folding (&rest body)
+  "Execute BODY with character-group folding turned on.
+This sets `group-fold-table' as the case-table during the
+execution of BODY."
+  `(let ((case-fold-search t))
+     (with-case-table group-fold-table
+       ,@body)))
+
 (defcustom isearch-lazy-highlight t
   "Controls the lazy-highlighting during incremental search.
 When non-nil, all text in the buffer matching the current search
@@ -2568,6 +2600,12 @@ Can be changed via `isearch-search-fun-function' for special needs."
 (defun isearch-search-fun-default ()
   "Return default functions to use for the search."
   (cond
+   (isearch-fold-groups
+    (lambda (&rest args)
+      (let* ((isearch-fold-groups nil)
+             (function (isearch-search-fun-default)))
+        (with-group-folding
+         (apply function args)))))
    (isearch-word
     (lambda (string &optional bound noerror count)
       ;; Use lax versions to not fail at the end of the word while
-- 
2.2.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: Character group folding in searches
  2015-02-06 13:04 Character group folding in searches Artur Malabarba
@ 2015-02-06 14:32 ` Eli Zaretskii
  2015-02-06 16:18   ` Artur Malabarba
  2015-02-06 18:03   ` Stefan Monnier
  2015-02-07  0:07 ` Juri Linkov
  1 sibling, 2 replies; 25+ messages in thread
From: Eli Zaretskii @ 2015-02-06 14:32 UTC (permalink / raw)
  To: bruce.connor.am; +Cc: emacs-devel

> Date: Fri, 6 Feb 2015 11:04:03 -0200
> From: Artur Malabarba <bruce.connor.am@gmail.com>
> 
> 1. Follow the `decomposition' char property. For instance, the
> character "a" in the search string would match any one of  "aãáâ" (and
> so on). This is easy to do, and one of the patches below already shows
> how. Note that this won't handle symbols that are actually composed of
> multiple characters.
> 
> 2. Follow an intuitive sense of similarity which is not defined in the
> unicode standard. For instance, an ascii single quote in the search
> string should match any type of single quote (there are about a dozen
> that I know of).
> 
> 3. Ignore modifier (non-spacing) characters. Another way of writing
> "á" is to write "a" followed by a special non-spacing accute. This
> kind of thing (a symbol composed of multiple characters) is not
> handled by item 1, so I'm listing as a separate point.
> 
> 4. Perform the conversion two-ways. That is, item 1 should work even
> if the search contained "á" instead of "a". Item 2 should match an
> ascii quote if the search string contains a curly quote. This is
> mostly useful when the user copies a fancy string from somewhere and
> pastes it into the search field.
> 
> 5. It should work for any searching, not just isearch.

The full set of "folding" transformations is described in the Unicode
technical report UTR #30.  It was withdrawn, but its last draft is
still enlightening.

I think we should support some subset of what's described there.

The way to do it IMO is to generate a set of char-tables where each
character is mapped to its folded variant, one char-table for each
subset of folding.  A character whose folding is not a single
character should map to a vector or a string of characters (not sure
which one is best, we should choose the one that lends itself to the
most efficient use).

I think the best approach is to modify search.c to be able to handle
folding that produces more than a single character.  I think we will
also need search.c to support several alternative foldings for the
same search operation.  Making these changes would be relatively easy,
I think, and once it's done, all the rest will "just work", because
the basic search algorithms don't need to be touched.

As a final general remark, I don't think I like the "group" part of
the terminology.  Why not use "character-folding" instead, it's what
this is called out there.

> * group-folding-with-regexp-lisp.patch
> 
> This one takes each input character and either keeps it verbatim or
> transform it into a regexp which matches the entire group that this
> character represents. It is implemented in isearch.
> 
> + It trivially handles goals 1, 2 and 3. Because regexps are quite
> versatile, it is the only solution that handles item 3 (it allows each
> character to match more than a single character).

But the downside is that we will have to construct such regexps for
all the foldings of all the characters we want to support.  That will
be quite a large database, and a lot of work to construct it.

> * group-folding-with-case-table-lisp.patch
> 
> This patch is entirely in elisp. I've put it all inside `isearch.el'
> for now, for the sake of simplicity, but it's not restricted to
> isearch.
> 
> It creates a new case-table which performs group folding by borrowing
> the case-folding machinery, so it is very fast. Then, group folding
> can be achieved by running the search inside a `with-group-folding`
> macro. There's also an example implementation which turns it on for
> isearch by default.
> 
> + It immediately satisfies items 1, 2, 4, and 5.
> + It is very fast.
> - It has no simple way of achieving item 3.

It could use a separate case-table for item 3, couldn't it?

I think we will need separate tables for different foldings anyway,
because each use case calls for some specific folding.  In isearch,
the user will have to specify which foldings she wants to be in
effect.

> - If the user decides to set `group-fold-search' to t, this can break
> existing code (a disadvantage that the lisp version above does not
> have).
> - It adds two extra fields to every buffer object (the boolean
> variable and the char table).

I'm not sure we need to add these tables to the buffer object.  The
experience with using case-tables this way is not encouraging, because
in several important cases it is not at all clear which buffer is
relevant to the folding-match operation one needs to do.

> Do any of these options seem good enough? Which would you all like to explore?
> I like the second one best, but goal 3 is quite important.

I think we must lift the limitation of single-character folding
result, which means changes on the C level are inevitable.

I also think we need to talk a bit more about which kinds of folding
we would like to support.

Thanks.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Character group folding in searches
  2015-02-06 14:32 ` Eli Zaretskii
@ 2015-02-06 16:18   ` Artur Malabarba
  2015-02-06 16:44     ` Eli Zaretskii
  2015-02-06 18:03   ` Stefan Monnier
  1 sibling, 1 reply; 25+ messages in thread
From: Artur Malabarba @ 2015-02-06 16:18 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

> The full set of "folding" transformations is described in the Unicode
> technical report UTR #30.  It was withdrawn, but its last draft is
> still enlightening.
>
> I think we should support some subset of what's described there.
>
> The way to do it IMO is to generate a set of char-tables where each
> character is mapped to its folded variant,
> one char-table for each subset of folding.

Although the attached patches only define one table for now, they all
support multiple tables (even the one that's not based on char-tables)
so the sky is the limit. For this reason, this detail probably won't
be an obstacle so we can decide later which subset of foldings we want
to provide by default.

> A character whose folding is not a single
> character should map to a vector or a string of characters (not sure
> which one is best, we should choose the one that lends itself to the
> most efficient use).
> I think the best approach is to modify search.c to be able to handle
> folding that produces more than a single character.  I think we will
> also need search.c to support several alternative foldings for the
> same search operation.  Making these changes would be relatively easy,

It's certainly doable, but I'm not sure it's easy. The `search_buffer'
function seems pretty focused on handling 1 char at time. Having a
single char suddenly turn into two might require significant changes
to the code flow.

Of course, if someone takes that up that's great!

>> * group-folding-with-regexp-lisp.patch
>>
>> This one takes each input character and either keeps it verbatim or
>> transform it into a regexp which matches the entire group that this
>> character represents. It is implemented in isearch.
>>
>> + It trivially handles goals 1, 2 and 3. Because regexps are quite
>> versatile, it is the only solution that handles item 3 (it allows each
>> character to match more than a single character).
>
> But the downside is that we will have to construct such regexps for
> all the foldings of all the characters we want to support.  That will
> be quite a large database, and a lot of work to construct it.

It's only a tiny bit more work than generating case-tables that are
also under discussion. Any information available to construct the case
tables is also available for building the regexps.


>> * group-folding-with-case-table-lisp.patch
>>
>> This patch is entirely in elisp. I've put it all inside `isearch.el'
>> for now, for the sake of simplicity, but it's not restricted to
>> isearch.
>>
>> It creates a new case-table which performs group folding by borrowing
>> the case-folding machinery, so it is very fast. Then, group folding
>> can be achieved by running the search inside a `with-group-folding`
>> macro. There's also an example implementation which turns it on for
>> isearch by default.
>>
>> + It immediately satisfies items 1, 2, 4, and 5.
>> + It is very fast.
>> - It has no simple way of achieving item 3.
>
> It could use a separate case-table for item 3, couldn't it?

Not that I can tell. You either need to tell emacs to either (1)
ignore the accute entirely, or (2) have the "a´" pair of characters
fold into "a". Case tables just can't do this right now AFAIK.

> I think we will need separate tables for different foldings anyway,
> because each use case calls for some specific folding.  In isearch,
> the user will have to specify which foldings she wants to be in
> effect.

Yes, multiple tables are fine and will be done regardless of the approach taken.

>> - If the user decides to set `group-fold-search' to t, this can break
>> existing code (a disadvantage that the lisp version above does not
>> have).
>> - It adds two extra fields to every buffer object (the boolean
>> variable and the char table).
>
> I'm not sure we need to add these tables to the buffer object.  The
> experience with using case-tables this way is not encouraging, because
> in several important cases it is not at all clear which buffer is
> relevant to the folding-match operation one needs to do.

Yes, I don't like this either. I was threading unknown waters here, so
I just tried to stays as close as possible to what case-fold-search
does.

>> Do any of these options seem good enough? Which would you all like to explore?
>> I like the second one best, but goal 3 is quite important.
>
> I think we must lift the limitation of single-character folding
> result, which means changes on the C level are inevitable.

I agree this is important. But if no one takes it up I'd rather have
single-character folding than none at all.

> I also think we need to talk a bit more about which kinds of folding
> we would like to support.

What do you mean? Which folding subsets to provide by default?



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Character group folding in searches
  2015-02-06 16:18   ` Artur Malabarba
@ 2015-02-06 16:44     ` Eli Zaretskii
  0 siblings, 0 replies; 25+ messages in thread
From: Eli Zaretskii @ 2015-02-06 16:44 UTC (permalink / raw)
  To: bruce.connor.am; +Cc: emacs-devel

> Date: Fri, 6 Feb 2015 14:18:39 -0200
> From: Artur Malabarba <bruce.connor.am@gmail.com>
> Cc: emacs-devel <emacs-devel@gnu.org>
> 
> > I also think we need to talk a bit more about which kinds of folding
> > we would like to support.
> 
> What do you mean? Which folding subsets to provide by default?

Yes, for starters.



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Character group folding in searches
  2015-02-06 14:32 ` Eli Zaretskii
  2015-02-06 16:18   ` Artur Malabarba
@ 2015-02-06 18:03   ` Stefan Monnier
  2015-02-06 19:03     ` Eli Zaretskii
  1 sibling, 1 reply; 25+ messages in thread
From: Stefan Monnier @ 2015-02-06 18:03 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: bruce.connor.am, emacs-devel

> subset of folding.  A character whose folding is not a single
> character should map to a vector or a string of characters (not sure
> which one is best, we should choose the one that lends itself to the
> most efficient use).

So "folding" can turn a single char into a sequence of chars?  Why not
do it the other way?


        Stefan



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Character group folding in searches
  2015-02-06 18:03   ` Stefan Monnier
@ 2015-02-06 19:03     ` Eli Zaretskii
  2015-02-06 19:27       ` Artur Malabarba
  2015-02-06 19:41       ` Stefan Monnier
  0 siblings, 2 replies; 25+ messages in thread
From: Eli Zaretskii @ 2015-02-06 19:03 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: bruce.connor.am, emacs-devel

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: bruce.connor.am@gmail.com,  emacs-devel@gnu.org
> Date: Fri, 06 Feb 2015 13:03:12 -0500
> 
> > subset of folding.  A character whose folding is not a single
> > character should map to a vector or a string of characters (not sure
> > which one is best, we should choose the one that lends itself to the
> > most efficient use).
> 
> So "folding" can turn a single char into a sequence of chars?  Why not
> do it the other way?

Because the other way you cannot use char-tables.  And because
matching "a" and "á" will be hard the other way.




^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Character group folding in searches
  2015-02-06 19:03     ` Eli Zaretskii
@ 2015-02-06 19:27       ` Artur Malabarba
  2015-02-06 21:38         ` Eli Zaretskii
  2015-02-06 19:41       ` Stefan Monnier
  1 sibling, 1 reply; 25+ messages in thread
From: Artur Malabarba @ 2015-02-06 19:27 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Stefan Monnier, emacs-devel

> Because the other way you cannot use char-tables.  And because
> matching "a" and "á" will be hard the other way.

Maybe I'm missing something, but if you have "á" expand to "a´", it
won't match "a", will it?



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Character group folding in searches
  2015-02-06 19:03     ` Eli Zaretskii
  2015-02-06 19:27       ` Artur Malabarba
@ 2015-02-06 19:41       ` Stefan Monnier
  2015-02-06 21:43         ` Eli Zaretskii
  1 sibling, 1 reply; 25+ messages in thread
From: Stefan Monnier @ 2015-02-06 19:41 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: bruce.connor.am, emacs-devel

>> > subset of folding.  A character whose folding is not a single
>> > character should map to a vector or a string of characters (not sure
>> > which one is best, we should choose the one that lends itself to the
>> > most efficient use).
>> So "folding" can turn a single char into a sequence of chars?  Why not
>> do it the other way?
> Because the other way you cannot use char-tables.

I don't see how turning single chars into char-sequences can help us
handle multi-char sequences (unless maybe you restrict it so that there
can only ever be one such sequence per equivalence class?).

> And because matching "a" and "á" will be hard the other way.

I don't see what's hard about it: map them both to the same char (e.g. ?a).


        Stefan



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Character group folding in searches
  2015-02-06 19:27       ` Artur Malabarba
@ 2015-02-06 21:38         ` Eli Zaretskii
  2015-02-06 22:08           ` Artur Malabarba
  0 siblings, 1 reply; 25+ messages in thread
From: Eli Zaretskii @ 2015-02-06 21:38 UTC (permalink / raw)
  To: bruce.connor.am; +Cc: monnier, emacs-devel

> Date: Fri, 6 Feb 2015 19:27:03 +0000
> From: Artur Malabarba <bruce.connor.am@gmail.com>
> Cc: Stefan Monnier <monnier@iro.umontreal.ca>, emacs-devel <emacs-devel@gnu.org>
> 
> > Because the other way you cannot use char-tables.  And because
> > matching "a" and "á" will be hard the other way.
> 
> Maybe I'm missing something, but if you have "á" expand to "a´", it
> won't match "a", will it?

It will, if you only pay attention to the base character.




^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Character group folding in searches
  2015-02-06 19:41       ` Stefan Monnier
@ 2015-02-06 21:43         ` Eli Zaretskii
  2015-02-07  0:05           ` Stefan Monnier
  0 siblings, 1 reply; 25+ messages in thread
From: Eli Zaretskii @ 2015-02-06 21:43 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: bruce.connor.am, emacs-devel

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: bruce.connor.am@gmail.com,  emacs-devel@gnu.org
> Date: Fri, 06 Feb 2015 14:41:14 -0500
> 
> I don't see how turning single chars into char-sequences can help us
> handle multi-char sequences (unless maybe you restrict it so that there
> can only ever be one such sequence per equivalence class?).

The decomposition sequence is unique, yes.

> > And because matching "a" and "á" will be hard the other way.
> 
> I don't see what's hard about it: map them both to the same char (e.g. ?a).

Then you won't be able to match "á" with "á".




^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Character group folding in searches
  2015-02-06 21:38         ` Eli Zaretskii
@ 2015-02-06 22:08           ` Artur Malabarba
  2015-02-07  8:38             ` Eli Zaretskii
  0 siblings, 1 reply; 25+ messages in thread
From: Artur Malabarba @ 2015-02-06 22:08 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Stefan Monnier, emacs-devel

>> > Because the other way you cannot use char-tables.  And because
>> > matching "a" and "á" will be hard the other way.
>>
>> Maybe I'm missing something, but if you have "á" expand to "a´", it
>> won't match "a", will it?
>
> It will, if you only pay attention to the base character.

If you have the possibility of only paying attention to the base
character (if the machinery is in place) then there's no reason to
fold "á" into "a´" (folding 1 char into many).

Just fold everything into "a". Then (by only paying attention to the
base character) "á" and "á" will match, because "á" folds into "a"
which is the base character of "á".

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Character group folding in searches
  2015-02-06 21:43         ` Eli Zaretskii
@ 2015-02-07  0:05           ` Stefan Monnier
  2015-02-07  8:47             ` Eli Zaretskii
  0 siblings, 1 reply; 25+ messages in thread
From: Stefan Monnier @ 2015-02-07  0:05 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: bruce.connor.am, emacs-devel

>> I don't see how turning single chars into char-sequences can help us
>> handle multi-char sequences (unless maybe you restrict it so that there
>> can only ever be one such sequence per equivalence class?).
> The decomposition sequence is unique, yes.

Elsewhere:
> It will, if you only pay attention to the base character.

Hmm... could you characterize the kinds of equivalence classes you
intend to handle?

Could it handle for example an equivalence class which includes
→, ⇒, ->, and => ?


        Stefan



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Character group folding in searches
  2015-02-06 13:04 Character group folding in searches Artur Malabarba
  2015-02-06 14:32 ` Eli Zaretskii
@ 2015-02-07  0:07 ` Juri Linkov
  1 sibling, 0 replies; 25+ messages in thread
From: Juri Linkov @ 2015-02-07  0:07 UTC (permalink / raw)
  To: Artur Malabarba; +Cc: emacs-devel

> My question is:
>
> Do any of these options seem good enough? Which would you all like to explore?

This feature, as I see it, has several levels of complexity:

* 1-to-1 char-folding

  ?a <=> ?á

  This is already supported by char-tables, so there is no problem.

* 1-to-1 char-folding in combination with case-folding

  ?a <=> ?Á   (in this example one of them is in lower case
               and another in upper case with acute)

  I'm not sure how your patch handles this case.  We have to consult the
  information about case-folding from the case-table.  Otherwise, we would
  need two new tables instead of one: where character mappings are
  with and without case-folding.  In any case we have to take care of
  the correct interaction with case-fold-search.

* 1-to-1 char-folding plus a combining character

  ?a <=> "á"

  The simplest solution is just to ignore all combining characters in search.
  This should be easy to implement in the search engine by introducing
  a new list of ignorable characters to skip during the search.

* multi-character translation such as ligatures, etc.

  ?ﬄ <=> "ffl"

  This is the hardest case.  Maybe the existing translation tables
  from ucs-normalize.el could help.  Then configuring would be like

  (set-char-table-extra-slot case-table 3 (get 'ucs-normalize-nfd-table 'translation-table))

  But this requires a significant modification of the search engine to use
  the same logic in the search as is used in `translate-region-internal'
  to support multi-character translation in the search.

  Also it might require adding a new mode such as "lax-decomposition"
  that like lax-word mode will match partially, e.g. "f" will match "ﬄ".

  Or maybe better to use some external libraries like
  http://userguide.icu-project.org/collation/icu-string-search-service

I agree with your attempts to have something instead of having nothing
(as in an all-or-nothing attitude).  So to me it seems that the first
3 items would comprise the useful minimum, and the hardest last case
could be implemented afterwards.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Character group folding in searches
  2015-02-06 22:08           ` Artur Malabarba
@ 2015-02-07  8:38             ` Eli Zaretskii
  0 siblings, 0 replies; 25+ messages in thread
From: Eli Zaretskii @ 2015-02-07  8:38 UTC (permalink / raw)
  To: bruce.connor.am; +Cc: monnier, emacs-devel

> Date: Fri, 6 Feb 2015 22:08:19 +0000
> From: Artur Malabarba <bruce.connor.am@gmail.com>
> Cc: Stefan Monnier <monnier@iro.umontreal.ca>, emacs-devel <emacs-devel@gnu.org>
> 
> >> > Because the other way you cannot use char-tables.  And because
> >> > matching "a" and "á" will be hard the other way.
> >>
> >> Maybe I'm missing something, but if you have "á" expand to "a´", it
> >> won't match "a", will it?
> >
> > It will, if you only pay attention to the base character.
> 
> If you have the possibility of only paying attention to the base
> character (if the machinery is in place) then there's no reason to
> fold "á" into "a´" (folding 1 char into many).
> 
> Just fold everything into "a". Then (by only paying attention to the
> base character) "á" and "á" will match, because "á" folds into "a"
> which is the base character of "á".

But we need both capabilities, since whether or not a match of the
base character is enough depends on what the caller/user wants.
Folding everything into the base character supports only part of those
features.

As the simplest example, how can you have "á" and "a´" match, but "á"
and "a¨" fail to match, if you _only_ look at the base character?
(Btw, using ´ and ¨ here is incorrect, the correct characters are
their combining variants, u+0301 and u+0308; I left the ones you used
just for clarity, to prevent Emacs from composing a and the following
combining character.)

And then there are more complex examples, like "q̣̇" that should match
"q̣̇" (because the ordering of combining marks doesn't matter), etc.

What this tells to me is that we do need to fold "á" into "a´", and
then use a comparison function that pays attention to the "folding
options" specified by the caller, to decide which parts of the folded
sequence to ignore, and also how to compare the non-ignored parts
(e.g., with some options the order of non-base characters should not
matter).

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Character group folding in searches
  2015-02-07  0:05           ` Stefan Monnier
@ 2015-02-07  8:47             ` Eli Zaretskii
  2015-02-07 15:02               ` Stefan Monnier
  0 siblings, 1 reply; 25+ messages in thread
From: Eli Zaretskii @ 2015-02-07  8:47 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: bruce.connor.am, emacs-devel

> From: Stefan Monnier <monnier@IRO.UMontreal.CA>
> Cc: bruce.connor.am@gmail.com, emacs-devel@gnu.org
> Date: Fri, 06 Feb 2015 19:05:28 -0500
> 
> >> I don't see how turning single chars into char-sequences can help us
> >> handle multi-char sequences (unless maybe you restrict it so that there
> >> can only ever be one such sequence per equivalence class?).
> > The decomposition sequence is unique, yes.
> 
> Elsewhere:
> > It will, if you only pay attention to the base character.
> 
> Hmm... could you characterize the kinds of equivalence classes you
> intend to handle?

That was one part of what I suggested to discuss.  We don't yet have a
list we decided upon.  However, it is quite clear that comparing only
the base characters is one feature we would want, for the same reasons
we wanted to ignore diacriticals in string-collate-equalp.

> Could it handle for example an equivalence class which includes
> →, ⇒, ->, and => ?

These are our application-specific equivalences, they are not in
Unicode.  So we will need to have a list of equivalences in addition
to Unicode, that we will want to add to the "folding" char-table,
either permanently or as buffer-local customizations.

But once the decomposition is in the table, processing it should be
the same as with others, although it's possible we won't allow only
base-character matches for equivalences such as the ones above.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Character group folding in searches
  2015-02-07  8:47             ` Eli Zaretskii
@ 2015-02-07 15:02               ` Stefan Monnier
  2015-02-07 15:31                 ` Eli Zaretskii
  0 siblings, 1 reply; 25+ messages in thread
From: Stefan Monnier @ 2015-02-07 15:02 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: bruce.connor.am, emacs-devel

>> Could it handle for example an equivalence class which includes
>> →, ⇒, ->, and => ?
> These are our application-specific equivalences, they are not in
> Unicode.

I know.

> So we will need to have a list of equivalences in addition
> to Unicode, that we will want to add to the "folding" char-table,
> either permanently or as buffer-local customizations.

To me the simplest option is to have a DFA which returns an integer
(this integer being "the equivalence class number", and which will
usually be one of the characters in the equivalence class).

Each DFA node could be a char-table.  So if all equivalence classes are
made up of single-chars, the DFA collapses is just a plain-old
char-table mapping chars to the canonical element of their
equivalence classes.  For 2-char elements, we'll arrange for the
entry for the first char (in the main char-table) to be not an integer
but another char-table.  Being a DFA, this could easily handle complex
elements (matching arbitrary regular expressions), tho whether we'd make
much use of this particular feature is not very important.

Since some of the nodes in the DFA would likely only handle a very few
chars specially, we could later improve the representation so that those
nodes don't use up a whole char-table.

        Stefan

PS: And this same kind of "char-table extended into a DFA" could be
useful for syntax-tables in order to provide much more flexible support
for multi-character comment markers or "paren-like nested elements".

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Character group folding in searches
  2015-02-07 15:02               ` Stefan Monnier
@ 2015-02-07 15:31                 ` Eli Zaretskii
  2015-02-08 14:03                   ` Stefan Monnier
  0 siblings, 1 reply; 25+ messages in thread
From: Eli Zaretskii @ 2015-02-07 15:31 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: bruce.connor.am, emacs-devel

> From: Stefan Monnier <monnier@IRO.UMontreal.CA>
> Cc: bruce.connor.am@gmail.com, emacs-devel@gnu.org
> Date: Sat, 07 Feb 2015 10:02:52 -0500
> 
> To me the simplest option is to have a DFA which returns an integer
> (this integer being "the equivalence class number", and which will
> usually be one of the characters in the equivalence class).
> 
> Each DFA node could be a char-table.  So if all equivalence classes are
> made up of single-chars, the DFA collapses is just a plain-old
> char-table mapping chars to the canonical element of their
> equivalence classes.  For 2-char elements, we'll arrange for the
> entry for the first char (in the main char-table) to be not an integer
> but another char-table.  Being a DFA, this could easily handle complex
> elements (matching arbitrary regular expressions), tho whether we'd make
> much use of this particular feature is not very important.

I'm sorry, I don't understand how this will solve the use-cases
brought up in this thread.  Can you explain?

The use-cases I have in mind are:

  . exact match -- only exactly the same codepoints match

  . base-character match -- this ignores any combining marks,
    diacriticals, etc.

  . matching ligatures, such as ﬃ and ffi

  . ignoring punctuation, like string-collate-equalp does,
    i.e. "foobar" will match "foo.bar"

  . ignoring isolated zero-width or non-combining marks and
    directional controls

I understand very well how these can be handled by several different
char-tables, but you seem to say that a single char-table can do all
this, and I don't see how.

Also, what does DFA have to do with all this?

> Since some of the nodes in the DFA would likely only handle a very few
> chars specially, we could later improve the representation so that those
> nodes don't use up a whole char-table.

Now I'm completely confused: char-tables don't need this optimization,
as you well know: they already are space-efficient for storing
characters that map to the table's default value.  So I probably
misunderstand your whole idea, if it does need such an optimization.

> PS: And this same kind of "char-table extended into a DFA" could be
> useful for syntax-tables in order to provide much more flexible support
> for multi-character comment markers or "paren-like nested elements".

If that's your itch to scratch, I'm impatiently waiting for patches ;-)




^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Character group folding in searches
  2015-02-07 15:31                 ` Eli Zaretskii
@ 2015-02-08 14:03                   ` Stefan Monnier
  2015-02-08 19:12                     ` Eli Zaretskii
  0 siblings, 1 reply; 25+ messages in thread
From: Stefan Monnier @ 2015-02-08 14:03 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: bruce.connor.am, emacs-devel

> I'm sorry, I don't understand how this will solve the use-cases
> brought up in this thread.  Can you explain?

Every equivalence class selected by such a DFA can match any set of
strings that can be described by a regular expression, so it should be
more than sufficiently powerful.

>   . exact match -- only exactly the same codepoints match

The DFA is trivial, matches any (and only) one-char sequences and
returns the char.

>   . base-character match -- this ignores any combining marks,
>     diacriticals, etc.

Admittedly, less trivial since we have to remember the base char after
matching it, while skipping subsequent combining marks and diacriticals.

>   . matching ligatures, such as ﬃ and ffi

Straightforward.

>   . ignoring punctuation, like string-collate-equalp does,
>     i.e. "foobar" will match "foo.bar"

Easy: the DFA will simply loop back when it sees a ".".

>   . ignoring isolated zero-width or non-combining marks and
>     directional controls

Same.

> I understand very well how these can be handled by several different
> char-tables, but you seem to say that a single char-table can do all
> this, and I don't see how.

Not sure what you mean by "single char-table" or why you think I said
something about single-vs-multiple char-tables.

A first implementation of DFAs could use internally char-tables (where
each node of the DFA is a char-table) but I think it's something
entirely different from what you mean by "different char-tables" or
"single char-table", since you'd choose one DFA (which may have any
number of char-tables inside).

> Now I'm completely confused: char-tables don't need this optimization,
> as you well know: they already are space-efficient for storing
> characters that map to the table's default value.  So I probably
> misunderstand your whole idea, if it does need such an optimization.

A DFA can have hundreds of nodes (hence hundreds of char-tables if we
use char-tables for that), most of which map one or two chars to
a special value while all others are mapped to "the default", so there
can be significant gains from using a more specialized representation.

>> PS: And this same kind of "char-table extended into a DFA" could be
>> useful for syntax-tables in order to provide much more flexible support
>> for multi-character comment markers or "paren-like nested elements".
> If that's your itch to scratch, I'm impatiently waiting for patches ;-)

It's been in the back of my mind for many years.

        Stefan

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Character group folding in searches
  2015-02-08 14:03                   ` Stefan Monnier
@ 2015-02-08 19:12                     ` Eli Zaretskii
  2015-02-09  3:03                       ` Stefan Monnier
  0 siblings, 1 reply; 25+ messages in thread
From: Eli Zaretskii @ 2015-02-08 19:12 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: bruce.connor.am, emacs-devel

> From: Stefan Monnier <monnier@IRO.UMontreal.CA>
> Cc: bruce.connor.am@gmail.com, emacs-devel@gnu.org
> Date: Sun, 08 Feb 2015 09:03:23 -0500
> 
> > I'm sorry, I don't understand how this will solve the use-cases
> > brought up in this thread.  Can you explain?
> 
> Every equivalence class selected by such a DFA can match any set of
> strings that can be described by a regular expression, so it should be
> more than sufficiently powerful.

Who and how will create such a DFA?  (Or is it multiple DFAs?)
Are you thinking about DFAs created by compiling regular expressions,
or about some new infrastructure we don't yet have?

If the DFA will be the result of compiling regexps, we need quite a
few categories we don't yet have, I think, e.g., to express
diacriticals.  (The current set of categories is just a hodge-podge of
ad-hoc stuff that was needed by some feature at some point.)  We will
also need to decompose characters (NFD and NFKD at least).  That is,
if I at all understand what you have in mind.

> A first implementation of DFAs could use internally char-tables (where
> each node of the DFA is a char-table) but I think it's something
> entirely different from what you mean by "different char-tables" or
> "single char-table", since you'd choose one DFA (which may have any
> number of char-tables inside).

Char-tables are efficient, and at least for decomposition they seem to
be the perfect vehicle.  DFAs that come out of arbitrary regexps,
OTOH, can sometimes be very inefficient.  That's why I tend to think
about this in terms of char-tables.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Character group folding in searches
  2015-02-08 19:12                     ` Eli Zaretskii
@ 2015-02-09  3:03                       ` Stefan Monnier
  2015-02-09 15:40                         ` Eli Zaretskii
  0 siblings, 1 reply; 25+ messages in thread
From: Stefan Monnier @ 2015-02-09  3:03 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: bruce.connor.am, emacs-devel

> Char-tables are efficient, and at least for decomposition they seem to
> be the perfect vehicle.  DFAs that come out of arbitrary regexps,
> OTOH, can sometimes be very inefficient.  That's why I tend to think
> about this in terms of char-tables.

That's a false dichotomy.  DFA is about *recognizing* multi-char
entities.  If the input entities you care about are only single-char (as
is the case for decomposition), then your DFA will degenerate to
a single char-table (as is the case now).

But how do you use current char-tables to handle multi-char input
entities (i.e. to recognize things like "=>")?

> Who and how will create such a DFA?

They'd be mechanically constructed (by hand-written code), for example
driven by the existing Unicode tables.

        Stefan

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Character group folding in searches
  2015-02-09  3:03                       ` Stefan Monnier
@ 2015-02-09 15:40                         ` Eli Zaretskii
  2015-02-09 16:33                           ` Stefan Monnier
  0 siblings, 1 reply; 25+ messages in thread
From: Eli Zaretskii @ 2015-02-09 15:40 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: bruce.connor.am, emacs-devel

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: bruce.connor.am@gmail.com,  emacs-devel@gnu.org
> Date: Sun, 08 Feb 2015 22:03:08 -0500
> 
> > Char-tables are efficient, and at least for decomposition they seem to
> > be the perfect vehicle.  DFAs that come out of arbitrary regexps,
> > OTOH, can sometimes be very inefficient.  That's why I tend to think
> > about this in terms of char-tables.
> 
> That's a false dichotomy.

Actually, it's not a dichotomy at all.  I just explained why
char-tables seem to be a good basis on which to build this feature.

> DFA is about *recognizing* multi-char entities.  If the input
> entities you care about are only single-char (as is the case for
> decomposition), then your DFA will degenerate to a single char-table
> (as is the case now).

I think we have a miscommunication here.  I was talking about the
tables that are part of a DFA that drive its state machine.  Those
tables might become large and sparse, certainly if the input symbol
can be any Unicode character, most of which only match themselves.

I guess I'm still struggling to understand your idea of using DFAs.
E.g., you talk about each node of a DFA being a char-table, but AFAIK
a DFA node is just a state of the automaton, so how can that be
expressed as a char-table?  And above you are saying that a "DFA will
degenerate to a single char-table", which again is a stumbling block
for me, since a DFA is more than a table.  What am I missing?

> But how do you use current char-tables to handle multi-char input
> entities (i.e. to recognize things like "=>")?

I don't understand the question, sorry.  The simple answer is that a
char-table entry can be any Lisp object, including a string, but you
already know that.

If you mean how to compare "=>" with "⇒", then the latter will be
"folded" to the former using a char-table, and then the results will
be compared, either as strings or character by character.  Is this
what you were asking?

> > Who and how will create such a DFA?
> 
> They'd be mechanically constructed (by hand-written code), for example
> driven by the existing Unicode tables.

What would be the input language for specifying such a DFA?  I mean,
how would we specify which sequence of states are acceptable (yielding
a match for the search) and which aren't?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Character group folding in searches
  2015-02-09 15:40                         ` Eli Zaretskii
@ 2015-02-09 16:33                           ` Stefan Monnier
  2015-02-09 17:39                             ` Eli Zaretskii
  0 siblings, 1 reply; 25+ messages in thread
From: Stefan Monnier @ 2015-02-09 16:33 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: bruce.connor.am, emacs-devel

> I guess I'm still struggling to understand your idea of using DFAs.
> E.g., you talk about each node of a DFA being a char-table, but AFAIK
> a DFA node is just a state of the automaton, so how can that be

A DFA node is a state with labeled arcs going out to other states.

It's usually implemented as a "table" (array, hash-table, char-table, ...)
that maps the labels to the next state.

Does it make more sense, now?

>> But how do you use current char-tables to handle multi-char input
>> entities (i.e. to recognize things like "=>")?
> I don't understand the question, sorry.  The simple answer is that a
> char-table entry can be any Lisp object, including a string, but you
> already know that.

That doesn't tell me how you'd use it.  Would the ?= char be mapped to
a list of strings (one of them being "=>") and then you'd check if the
next (few) chars match one of those strings?
What I suggest is to map the ?= char to another char-table which then
maps the ?> char to (say) ?⇒.

> If you mean how to compare "=>" with "⇒", then the latter will be
> "folded" to the former using a char-table,

[ I always get confused by this terminology since "folding" to me
  implies making things smaller, so I'd call it "unfolding" in that
  direction.  ]

> and then the results will be compared, either as strings or character
> by character.  Is this what you were asking?

But how would this handle an equivalence class that includes both "=>"
and "->"?

>> > Who and how will create such a DFA?
>> They'd be mechanically constructed (by hand-written code), for example
>> driven by the existing Unicode tables.
> What would be the input language for specifying such a DFA?  I mean,
> how would we specify which sequence of states are acceptable (yielding
> a match for the search) and which aren't?

Depends.  For the Unicode-defined equivalence classes, we'd use the
Unicode tables directly and build the DFA nodes from it without going
through some intermediate "specification".

For other cases, we could specify the DFA with a list of strings.
Or with regular expressions.

        Stefan

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Character group folding in searches
  2015-02-09 16:33                           ` Stefan Monnier
@ 2015-02-09 17:39                             ` Eli Zaretskii
  2015-02-10  2:15                               ` Stefan Monnier
  0 siblings, 1 reply; 25+ messages in thread
From: Eli Zaretskii @ 2015-02-09 17:39 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: bruce.connor.am, emacs-devel

> From: Stefan Monnier <monnier@IRO.UMontreal.CA>
> Cc: bruce.connor.am@gmail.com, emacs-devel@gnu.org
> Date: Mon, 09 Feb 2015 11:33:29 -0500
> 
> > I guess I'm still struggling to understand your idea of using DFAs.
> > E.g., you talk about each node of a DFA being a char-table, but AFAIK
> > a DFA node is just a state of the automaton, so how can that be
> 
> A DFA node is a state with labeled arcs going out to other states.
> 
> It's usually implemented as a "table" (array, hash-table, char-table, ...)
> that maps the labels to the next state.
> 
> Does it make more sense, now?

Yes.  (Your "node" seems to be both a node and some part of the
transition function, which is not the usual terminology, AFAIR.)

> >> But how do you use current char-tables to handle multi-char input
> >> entities (i.e. to recognize things like "=>")?
> > I don't understand the question, sorry.  The simple answer is that a
> > char-table entry can be any Lisp object, including a string, but you
> > already know that.
> 
> That doesn't tell me how you'd use it.  Would the ?= char be mapped to
> a list of strings (one of them being "=>") and then you'd check if the
> next (few) chars match one of those strings?
> What I suggest is to map the ?= char to another char-table which then
> maps the ?> char to (say) ?⇒.

I thought doing it the other way around, starting with ?⇒.

> > If you mean how to compare "=>" with "⇒", then the latter will be
> > "folded" to the former using a char-table,
> 
> [ I always get confused by this terminology since "folding" to me
>   implies making things smaller, so I'd call it "unfolding" in that
>   direction.  ]
> 
> > and then the results will be compared, either as strings or character
> > by character.  Is this what you were asking?
> 
> But how would this handle an equivalence class that includes both "=>"
> and "->"?

Why should we?  "->" could be equivalent to ?→, but I see no reason to
make it equivalent to "=>".

> >> > Who and how will create such a DFA?
> >> They'd be mechanically constructed (by hand-written code), for example
> >> driven by the existing Unicode tables.
> > What would be the input language for specifying such a DFA?  I mean,
> > how would we specify which sequence of states are acceptable (yielding
> > a match for the search) and which aren't?
> 
> Depends.  For the Unicode-defined equivalence classes, we'd use the
> Unicode tables directly and build the DFA nodes from it without going
> through some intermediate "specification".
> 
> For other cases, we could specify the DFA with a list of strings.
> Or with regular expressions.

And the DFA will be in C or in Lisp?

Anyway, I hope to see something like that landing on master,
preferably sooner than later.




^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Character group folding in searches
  2015-02-09 17:39                             ` Eli Zaretskii
@ 2015-02-10  2:15                               ` Stefan Monnier
  2015-02-10 15:45                                 ` Eli Zaretskii
  0 siblings, 1 reply; 25+ messages in thread
From: Stefan Monnier @ 2015-02-10  2:15 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: bruce.connor.am, emacs-devel

> Why should we?  "->" could be equivalent to ?→, but I see no reason to
> make it equivalent to "=>".

The me the question goes the other way: how can you be sure noone will
ever need/want equivalence classes which include things like "->" and
"=>" at the same time?

> And the DFA will be in C or in Lisp?

The DFA itself will be a Lisp data-structure, at first I'd expect it to
be simply a bunch of inter-connected char-tables.  The code that "runs"
such a DFA on a chunk of text is usually very simple so I'd expect that
it can be implemented in C very easily, tho an Elisp implementation is
also an option.

> Anyway, I hope to see something like that landing on master,
> preferably sooner than later.

FWIW, elpa/packages/lex has Elisp code that builds such DFAs from
regular expressions, along with an Elisp loop that runs such DFAs to
perform regexp-matching and regexp-searching.

But I don't have any immediate plans to work on this, no.

        Stefan

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Character group folding in searches
  2015-02-10  2:15                               ` Stefan Monnier
@ 2015-02-10 15:45                                 ` Eli Zaretskii
  0 siblings, 0 replies; 25+ messages in thread
From: Eli Zaretskii @ 2015-02-10 15:45 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: bruce.connor.am, emacs-devel

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: bruce.connor.am@gmail.com,  emacs-devel@gnu.org
> Date: Mon, 09 Feb 2015 21:15:21 -0500
> 
> > Why should we?  "->" could be equivalent to ?→, but I see no reason to
> > make it equivalent to "=>".
> 
> The me the question goes the other way: how can you be sure noone will
> ever need/want equivalence classes which include things like "->" and
> "=>" at the same time?

When they do, we will indeed want your DFAs ;-)

Seriously, though: the real question here is whether this limitation
is serious enough to render the entire idea useless.  I don't think
so, I think the use-case you envision is rare at best.

> But I don't have any immediate plans to work on this, no.

Damn!




^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2015-02-10 15:45 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-02-06 13:04 Character group folding in searches Artur Malabarba
2015-02-06 14:32 ` Eli Zaretskii
2015-02-06 16:18   ` Artur Malabarba
2015-02-06 16:44     ` Eli Zaretskii
2015-02-06 18:03   ` Stefan Monnier
2015-02-06 19:03     ` Eli Zaretskii
2015-02-06 19:27       ` Artur Malabarba
2015-02-06 21:38         ` Eli Zaretskii
2015-02-06 22:08           ` Artur Malabarba
2015-02-07  8:38             ` Eli Zaretskii
2015-02-06 19:41       ` Stefan Monnier
2015-02-06 21:43         ` Eli Zaretskii
2015-02-07  0:05           ` Stefan Monnier
2015-02-07  8:47             ` Eli Zaretskii
2015-02-07 15:02               ` Stefan Monnier
2015-02-07 15:31                 ` Eli Zaretskii
2015-02-08 14:03                   ` Stefan Monnier
2015-02-08 19:12                     ` Eli Zaretskii
2015-02-09  3:03                       ` Stefan Monnier
2015-02-09 15:40                         ` Eli Zaretskii
2015-02-09 16:33                           ` Stefan Monnier
2015-02-09 17:39                             ` Eli Zaretskii
2015-02-10  2:15                               ` Stefan Monnier
2015-02-10 15:45                                 ` Eli Zaretskii
2015-02-07  0:07 ` Juri Linkov

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).