search-default-mode char-fold-to-regexp and Greek Extended block characters

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

* search-default-mode char-fold-to-regexp and Greek Extended block characters
@ 2019-07-19 14:18 Robert Pluim
  2019-07-19 14:37 ` Eli Zaretskii
  2019-07-19 18:53 ` Juri Linkov
  0 siblings, 2 replies; 32+ messages in thread
From: Robert Pluim @ 2019-07-19 14:18 UTC (permalink / raw)
  To: emacs-devel

This is an offshoot of the discussion in Bug#36717.

I have

(setq search-default-mode 'char-fold-to-regexp)

In a buffer containing

1. ί (\u03af GREEK SMALL LETTER IOTA WITH TONOS)
2. ί (\u1f77 GREEK SMALL LETTER IOTA WITH OXIA)

I do C-s C-x 8 RET 03b9

isearch will find the iota with tonos, but not the iota with oxia,
even though the decomposition of the decomposition of the latter
contains iota. Is that expected?

Thanks

Robert

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: search-default-mode char-fold-to-regexp and Greek Extended block characters
  2019-07-19 14:18 search-default-mode char-fold-to-regexp and Greek Extended block characters Robert Pluim
@ 2019-07-19 14:37 ` Eli Zaretskii
  2019-07-19 16:03   ` Robert Pluim
  2019-07-19 18:53 ` Juri Linkov
  1 sibling, 1 reply; 32+ messages in thread
From: Eli Zaretskii @ 2019-07-19 14:37 UTC (permalink / raw)
  To: Robert Pluim; +Cc: emacs-devel

> From: Robert Pluim <rpluim@gmail.com>
> Date: Fri, 19 Jul 2019 16:18:52 +0200
> 
> In a buffer containing
> 
> 1. ί (\u03af GREEK SMALL LETTER IOTA WITH TONOS)
> 2. ί (\u1f77 GREEK SMALL LETTER IOTA WITH OXIA)
> 
> I do C-s C-x 8 RET 03b9
> 
> isearch will find the iota with tonos, but not the iota with oxia,
> even though the decomposition of the decomposition of the latter
> contains iota. Is that expected?

I suggest to step through the loop in char-fold.el and see what
happens there for ί.



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: search-default-mode char-fold-to-regexp and Greek Extended block characters
  2019-07-19 14:37 ` Eli Zaretskii
@ 2019-07-19 16:03   ` Robert Pluim
  2019-07-19 18:13     ` Eli Zaretskii
  0 siblings, 1 reply; 32+ messages in thread
From: Robert Pluim @ 2019-07-19 16:03 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

>>>>> On Fri, 19 Jul 2019 17:37:24 +0300, Eli Zaretskii <eliz@gnu.org> said:

    >> From: Robert Pluim <rpluim@gmail.com>
    >> Date: Fri, 19 Jul 2019 16:18:52 +0200
    >> 
    >> In a buffer containing
    >> 
    >> 1. ί (\u03af GREEK SMALL LETTER IOTA WITH TONOS)
    >> 2. ί (\u1f77 GREEK SMALL LETTER IOTA WITH OXIA)
    >> 
    >> I do C-s C-x 8 RET 03b9
    >> 
    >> isearch will find the iota with tonos, but not the iota with oxia,
    >> even though the decomposition of the decomposition of the latter
    >> contains iota. Is that expected?

    Eli> I suggest to step through the loop in char-fold.el and see what
    Eli> happens there for ί.

After poking around, the char-fold-table entry for \u1f77 contains
\u1f77 and \u03af, but not \u03b9, so this is expected. I then started
looking into the further details of unicode decomposition and
normalization, and decided that I definitely donʼt know enough about
this to decide if this is a bug or not :-)

Robert



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: search-default-mode char-fold-to-regexp and Greek Extended block characters
  2019-07-19 16:03   ` Robert Pluim
@ 2019-07-19 18:13     ` Eli Zaretskii
  2019-07-21 11:03       ` Robert Pluim
  0 siblings, 1 reply; 32+ messages in thread
From: Eli Zaretskii @ 2019-07-19 18:13 UTC (permalink / raw)
  To: Robert Pluim; +Cc: emacs-devel

> From: Robert Pluim <rpluim@gmail.com>
> Cc: emacs-devel@gnu.org
> Date: Fri, 19 Jul 2019 18:03:05 +0200
> 
>     >> 1. ί (\u03af GREEK SMALL LETTER IOTA WITH TONOS)
>     >> 2. ί (\u1f77 GREEK SMALL LETTER IOTA WITH OXIA)
>     >> 
>     >> I do C-s C-x 8 RET 03b9
>     >> 
>     >> isearch will find the iota with tonos, but not the iota with oxia,
>     >> even though the decomposition of the decomposition of the latter
>     >> contains iota. Is that expected?
> 
>     Eli> I suggest to step through the loop in char-fold.el and see what
>     Eli> happens there for ί.
> 
> After poking around, the char-fold-table entry for \u1f77 contains
> \u1f77 and \u03af, but not \u03b9, so this is expected. I then started
> looking into the further details of unicode decomposition and
> normalization, and decided that I definitely donʼt know enough about
> this to decide if this is a bug or not :-)

  (get-char-code-property ?ί 'decomposition) => (943) ; (#x03af) i.e (?ί)
  (get-char-code-property ?ί 'decomposition) => (953 769) ; (#x03b9 #x0301)

Do we expand the decomposition property recursively?  It sounds like
we don't, but maybe we should.



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: search-default-mode char-fold-to-regexp and Greek Extended block characters
  2019-07-19 14:18 search-default-mode char-fold-to-regexp and Greek Extended block characters Robert Pluim
  2019-07-19 14:37 ` Eli Zaretskii
@ 2019-07-19 18:53 ` Juri Linkov
  1 sibling, 0 replies; 32+ messages in thread
From: Juri Linkov @ 2019-07-19 18:53 UTC (permalink / raw)
  To: emacs-devel

> This is an offshoot of the discussion in Bug#36717.
>
> I have
>
> (setq search-default-mode 'char-fold-to-regexp)
>
> In a buffer containing
>
> 1. ί (\u03af GREEK SMALL LETTER IOTA WITH TONOS)
> 2. ί (\u1f77 GREEK SMALL LETTER IOTA WITH OXIA)
>
> I do C-s C-x 8 RET 03b9
>
> isearch will find the iota with tonos, but not the iota with oxia,
> even though the decomposition of the decomposition of the latter
> contains iota. Is that expected?

Decomposition of GREEK SMALL LETTER IOTA WITH TONOS is
GREEK SMALL LETTER IOTA + COMBINING ACUTE ACCENT, whereas
decomposition of GREEK SMALL LETTER IOTA WITH OXIA is
GREEK SMALL LETTER IOTA WITH TONOS, i.e. it's just an alias,
but so far several levels of indirection (from GREEK SMALL LETTER IOTA
to GREEK SMALL LETTER IOTA WITH OXIA via GREEK SMALL LETTER IOTA WITH TONOS)
was unsupported in char-fold.el.  This should be fixed in bug#35689.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: search-default-mode char-fold-to-regexp and Greek Extended block characters
  2019-07-19 18:13     ` Eli Zaretskii
@ 2019-07-21 11:03       ` Robert Pluim
  2019-07-22 18:39         ` Robert Pluim
  0 siblings, 1 reply; 32+ messages in thread
From: Robert Pluim @ 2019-07-21 11:03 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

>>>>> On Fri, 19 Jul 2019 21:13:02 +0300, Eli Zaretskii <eliz@gnu.org> said:

    Eli>   (get-char-code-property ?ί 'decomposition) => (943) ; (#x03af) i.e (?ί)
    Eli>   (get-char-code-property ?ί 'decomposition) => (953 769) ; (#x03b9 #x0301)

    Eli> Do we expand the decomposition property recursively?  It sounds like
    Eli> we don't, but maybe we should.

We donʼt. The following patch allows searching for ι (0x3b9) to match
both ί (0x3af) and ί (1f77). It doesnʼt recurse, but I have no idea if
there are longer chains of decompositions.

It causes

(aref char-fold-table ?ι)

to expand from:
"\\(?:ι[̀́̄̆̈̓̔͂]\\|[ίιϊἰἱὶιῐῑῖ𝛊𝜄𝜾𝝸𝞲]\\)"
to:
"\\(?:ι[̀́̄̆̈̓̔͂]\\|[ΐίιϊἰ-ἷὶίιῐῑῒῖῗ𝛊𝜄𝜾𝝸𝞲]\\)"

where the additions are basically all the variants of IOTA + one or
more diacritical

Even if we donʼt apply this or something like it, itʼs been
educational.

diff --git i/lisp/char-fold.el w/lisp/char-fold.el
index 9d3ea17b41..bf2a4c2484 100644
--- i/lisp/char-fold.el
+++ w/lisp/char-fold.el
@@ -78,6 +78,20 @@
                               (cons (char-to-string char)
                                     (aref equiv (car decomp))))))))
                (funcall make-decomp-match-char decomp char)
+               ;; Check to see if the first char of the decomposition
+               ;; has a further decomposition.  If so, add a mapping
+               ;; back from that second decomposition to the original
+               ;; character.  This allows e.g. 'ι' (GREEK SMALL LETTER
+               ;; IOTA) to match both the Basic Greek block and
+               ;; Extended Greek block variants of IOTA +
+               ;; diacritical(s)
+               (let ((l2-decomp (char-table-range table (car decomp))))
+                 (when (consp l2-decomp)
+                   (when (symbolp (car l2-decomp))
+                     (setq l2-decomp (cdr l2-decomp)))
+                   (if (not (eq (car decomp)
+                                (car l2-decomp)))
+                       (funcall make-decomp-match-char (list (car l2-decomp)) char))))
                ;; Do it again, without the non-spacing characters.
                ;; This allows 'a' to match 'ä'.
                (let ((simpler-decomp nil)



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: search-default-mode char-fold-to-regexp and Greek Extended block characters
  2019-07-21 11:03       ` Robert Pluim
@ 2019-07-22 18:39         ` Robert Pluim
  2019-07-23 14:57           ` Eli Zaretskii
  0 siblings, 1 reply; 32+ messages in thread
From: Robert Pluim @ 2019-07-22 18:39 UTC (permalink / raw)
  To: emacs-devel

[-- Attachment #1: Type: text/plain, Size: 1244 bytes --]

>>>>> On Sun, 21 Jul 2019 13:03:37 +0200, Robert Pluim <rpluim@gmail.com> said:

>>>>> On Fri, 19 Jul 2019 21:13:02 +0300, Eli Zaretskii <eliz@gnu.org> said:
    Eli> (get-char-code-property ?ί 'decomposition) => (943) ; (#x03af) i.e (?ί)
    Eli> (get-char-code-property ?ί 'decomposition) => (953 769) ; (#x03b9 #x0301)

    Eli> Do we expand the decomposition property recursively?  It sounds like
    Eli> we don't, but maybe we should.

    Robert> We donʼt. The following patch allows searching for ι (0x3b9) to match
    Robert> both ί (0x3af) and ί (1f77). It doesnʼt recurse, but I have no idea if
    Robert> there are longer chains of decompositions.

The answer to that, empirically, is 'yes', since with the following
patch the number of characters equivalent to ι increases, ie:

Standard =>
(aref char-fold-table ?ι)"\\(?:ι[̀́̄̆̈̓̔͂]\\|[ίιϊἰἱὶιῐῑῖ𝛊𝜄𝜾𝝸𝞲]\\)"

2 level decomposition =>
(aref char-fold-table ?ι)"\\(?:ι[̀́̄̆̈̓̔͂]\\|[ΐίιϊἰ-ἷὶίιῐῑῒῖῗ𝛊𝜄𝜾𝝸𝞲]\\)"

n level decomposition =>
(aref char-fold-table ?ι)"\\(?:ι[̀́̄̆̈̓̔͂]\\|[ΐίιϊἰ-ἷὶίιῐ-ΐῖῗ𝛊𝜄𝜾𝝸𝞲]\\)"


[-- Attachment #2: 0001-Follow-decomposition-chains-when-constructing-char-f.patch --]
[-- Type: text/x-patch, Size: 2983 bytes --]

From 3628379cf461805008b34e01dba751183c0b857c Mon Sep 17 00:00:00 2001
From: Robert Pluim <rpluim@gmail.com>
Date: Mon, 22 Jul 2019 20:27:59 +0200
Subject: [PATCH] Follow decomposition chains when constructing char-fold-table
To: emacs-devel@gnu.org

* lisp/char-fold.el (char-fold-make-table): Decompose the
decomposition of each character, adding equivalences to the original
character, until no more decompositions are left.
---
 etc/NEWS          |  8 ++++++++
 lisp/char-fold.el | 21 +++++++++++++++++++++
 2 files changed, 29 insertions(+)

diff --git a/etc/NEWS b/etc/NEWS
index e9ec21bb4c..33fe7075ec 100644
--- a/etc/NEWS
+++ b/etc/NEWS
@@ -1169,6 +1169,14 @@ and case-sensitivity together with search strings in the search ring.
 +++
 *** 'flush-lines' prints and returns the number of deleted matching lines.
 
+---
+*** 'char-fold-to-regexp' now matches more variants of a base character.
+The table used to check for equivalence of characters is now built
+using the complete chain of unicode decompositions of a character,
+rather than stopping after one level, such that searching for
+e.g. GREEK SMALL LETTER IOTA will now also find GREEK SMALL LETTER
+IOTA WITH OXIA.
+
 ** Debugger
 
 +++
diff --git a/lisp/char-fold.el b/lisp/char-fold.el
index 9d3ea17b41..6842d38a62 100644
--- a/lisp/char-fold.el
+++ b/lisp/char-fold.el
@@ -78,6 +78,27 @@
                               (cons (char-to-string char)
                                     (aref equiv (car decomp))))))))
                (funcall make-decomp-match-char decomp char)
+               ;; Check to see if the first char of the decomposition
+               ;; has a further decomposition.  If so, add a mapping
+               ;; back from that second decomposition to the original
+               ;; character.  This allows e.g. 'ι' (GREEK SMALL LETTER
+               ;; IOTA) to match both the Basic Greek block and
+               ;; Extended Greek block variants of IOTA +
+               ;; diacritical(s).  Repeat until there are no more
+               ;; decompositions.
+               (let ((dec decomp)
+                     next-decomp)
+                 (catch 'done
+                   (while dec
+                     (setq next-decomp (char-table-range table (car dec)))
+                     (when (consp next-decomp)
+                       (when (symbolp (car next-decomp))
+                         (setq next-decomp (cdr next-decomp)))
+                       (if (not (eq (car dec)
+                                    (car next-decomp)))
+                           (funcall make-decomp-match-char (list (car next-decomp)) char)
+                         (throw 'done t)))
+                     (setq dec next-decomp))))
                ;; Do it again, without the non-spacing characters.
                ;; This allows 'a' to match 'ä'.
                (let ((simpler-decomp nil)
-- 
2.21.0.419.gffac537e6c


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: search-default-mode char-fold-to-regexp and Greek Extended block characters
  2019-07-22 18:39         ` Robert Pluim
@ 2019-07-23 14:57           ` Eli Zaretskii
  2019-07-23 17:43             ` Robert Pluim
  0 siblings, 1 reply; 32+ messages in thread
From: Eli Zaretskii @ 2019-07-23 14:57 UTC (permalink / raw)
  To: Robert Pluim; +Cc: emacs-devel

> From: Robert Pluim <rpluim@gmail.com>
> Date: Mon, 22 Jul 2019 20:39:22 +0200
> 
> The answer to that, empirically, is 'yes', since with the following
> patch the number of characters equivalent to ι increases, ie:
> 
> Standard =>
> (aref char-fold-table ?ι)"\\(?:ι[̀́̄̆̈̓̔͂]\\|[ίιϊἰἱὶιῐῑῖ𝛊𝜄𝜾𝝸𝞲]\\)"
> 
> 2 level decomposition =>
> (aref char-fold-table ?ι)"\\(?:ι[̀́̄̆̈̓̔͂]\\|[ΐίιϊἰ-ἷὶίιῐῑῒῖῗ𝛊𝜄𝜾𝝸𝞲]\\)"
> 
> n level decomposition =>
> (aref char-fold-table ?ι)"\\(?:ι[̀́̄̆̈̓̔͂]\\|[ΐίιϊἰ-ἷὶίιῐ-ΐῖῗ𝛊𝜄𝜾𝝸𝞲]\\)"

Thanks, I think you should install your patch.



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: search-default-mode char-fold-to-regexp and Greek Extended block characters
  2019-07-23 14:57           ` Eli Zaretskii
@ 2019-07-23 17:43             ` Robert Pluim
  2019-07-23 20:29               ` Juri Linkov
  0 siblings, 1 reply; 32+ messages in thread
From: Robert Pluim @ 2019-07-23 17:43 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

>>>>> On Tue, 23 Jul 2019 17:57:38 +0300, Eli Zaretskii <eliz@gnu.org> said:

    >> From: Robert Pluim <rpluim@gmail.com>
    >> Date: Mon, 22 Jul 2019 20:39:22 +0200
    >> 
    >> The answer to that, empirically, is 'yes', since with the following
    >> patch the number of characters equivalent to ι increases, ie:
    >> 
    >> Standard =>
    >> (aref char-fold-table ?ι)"\\(?:ι[̀́̄̆̈̓̔͂]\\|[ίιϊἰἱὶιῐῑῖ𝛊𝜄𝜾𝝸𝞲]\\)"
    >> 
    >> 2 level decomposition =>
    >> (aref char-fold-table ?ι)"\\(?:ι[̀́̄̆̈̓̔͂]\\|[ΐίιϊἰ-ἷὶίιῐῑῒῖῗ𝛊𝜄𝜾𝝸𝞲]\\)"
    >> 
    >> n level decomposition =>
    >> (aref char-fold-table ?ι)"\\(?:ι[̀́̄̆̈̓̔͂]\\|[ΐίιϊἰ-ἷὶίιῐ-ΐῖῗ𝛊𝜄𝜾𝝸𝞲]\\)"

    Eli> Thanks, I think you should install your patch.

Done as f9337bc36d

Thanks

Robert

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: search-default-mode char-fold-to-regexp and Greek Extended block characters
  2019-07-23 17:43             ` Robert Pluim
@ 2019-07-23 20:29               ` Juri Linkov
  2019-07-24  7:56                 ` Robert Pluim
  2019-07-24  9:04                 ` Robert Pluim
  0 siblings, 2 replies; 32+ messages in thread
From: Juri Linkov @ 2019-07-23 20:29 UTC (permalink / raw)
  To: emacs-devel

> Done as f9337bc36d

Thanks!  Could you please look why tests fail to validate matching of
n-level decomposition.  The character with 3 level decomposition in
char-fold--test-without-customization is currently commented out as
FIXME.  After uncommenting this test fails, and I don't understand why.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: search-default-mode char-fold-to-regexp and Greek Extended block characters
  2019-07-23 20:29               ` Juri Linkov
@ 2019-07-24  7:56                 ` Robert Pluim
  2019-07-24  7:59                   ` Robert Pluim
  2019-07-24  9:04                 ` Robert Pluim
  1 sibling, 1 reply; 32+ messages in thread
From: Robert Pluim @ 2019-07-24  7:56 UTC (permalink / raw)
  To: Juri Linkov; +Cc: emacs-devel

>>>>> On Tue, 23 Jul 2019 23:29:01 +0300, Juri Linkov <juri@linkov.net> said:

    >> Done as f9337bc36d
    Juri> Thanks!  Could you please look why tests fail to validate matching of
    Juri> n-level decomposition.  The character with 3 level decomposition in
    Juri> char-fold--test-without-customization is currently commented out as
    Juri> FIXME.  After uncommenting this test fails, and I don't understand why.

I canʼt find a test with that name. Is there a patch floating around I
should look at?

Robert



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: search-default-mode char-fold-to-regexp and Greek Extended block characters
  2019-07-24  7:56                 ` Robert Pluim
@ 2019-07-24  7:59                   ` Robert Pluim
  0 siblings, 0 replies; 32+ messages in thread
From: Robert Pluim @ 2019-07-24  7:59 UTC (permalink / raw)
  To: emacs-devel; +Cc: Juri Linkov

>>>>> On Wed, 24 Jul 2019 09:56:02 +0200, Robert Pluim <rpluim@gmail.com> said:

>>>>> On Tue, 23 Jul 2019 23:29:01 +0300, Juri Linkov <juri@linkov.net> said:
    >>> Done as f9337bc36d
    Juri> Thanks!  Could you please look why tests fail to validate matching of
    Juri> n-level decomposition.  The character with 3 level decomposition in
    Juri> char-fold--test-without-customization is currently commented out as
    Juri> FIXME.  After uncommenting this test fails, and I don't understand why.

    Robert> I canʼt find a test with that name. Is there a patch floating around I
    Robert> should look at?

Never mind, I needed to do a git pull.




^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: search-default-mode char-fold-to-regexp and Greek Extended block characters
  2019-07-23 20:29               ` Juri Linkov
  2019-07-24  7:56                 ` Robert Pluim
@ 2019-07-24  9:04                 ` Robert Pluim
  2019-07-24 23:12                   ` Juri Linkov
  1 sibling, 1 reply; 32+ messages in thread
From: Robert Pluim @ 2019-07-24  9:04 UTC (permalink / raw)
  To: Juri Linkov; +Cc: emacs-devel

>>>>> On Tue, 23 Jul 2019 23:29:01 +0300, Juri Linkov <juri@linkov.net> said:

    >> Done as f9337bc36d
    Juri> Thanks!  Could you please look why tests fail to validate matching of
    Juri> n-level decomposition.  The character with 3 level decomposition in
    Juri> char-fold--test-without-customization is currently commented out as
    Juri> FIXME.  After uncommenting this test fails, and I don't understand why.

That test ends up doing

(string-match "\\`\\(?:ι[̀́̄̆̈̓̔͂]\\|[ΐίιϊἰ-ἷὶίιῐ-ΐῖῗ𝛊𝜄𝜾𝝸𝞲]\\)\\'" "Ϊ́")

because it does (upcase "ΐ") => Ϊ́

That character is GREEK SMALL LETTER IOTA WITH DIALYTIKA AND OXIA, and
as far as I can tell there is no CAPITAL variant of that letter, so
upcase canʼt return it, which means it returns GREEK CAPITAL LETTER
IOTA plus the diacriticals, which is obviously not going to
match.

Robert



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: search-default-mode char-fold-to-regexp and Greek Extended block characters
  2019-07-24  9:04                 ` Robert Pluim
@ 2019-07-24 23:12                   ` Juri Linkov
  2019-07-25  0:18                     ` Basil L. Contovounesios
                                       ` (2 more replies)
  0 siblings, 3 replies; 32+ messages in thread
From: Juri Linkov @ 2019-07-24 23:12 UTC (permalink / raw)
  To: emacs-devel

>     >> Done as f9337bc36d
>     Juri> Thanks!  Could you please look why tests fail to validate matching of
>     Juri> n-level decomposition.  The character with 3 level decomposition in
>     Juri> char-fold--test-without-customization is currently commented out as
>     Juri> FIXME.  After uncommenting this test fails, and I don't understand why.
>
> That test ends up doing
>
> (string-match "\\`\\(?:ι[̀́̄̆̈̓̔͂]\\|[ΐίιϊἰ-ἷὶίιῐ-ΐῖῗ𝛊𝜄𝜾𝝸𝞲]\\)\\'" "Ϊ́")
>
> because it does (upcase "ΐ") => Ϊ́
>
> That character is GREEK SMALL LETTER IOTA WITH DIALYTIKA AND OXIA, and
> as far as I can tell there is no CAPITAL variant of that letter, so
> upcase canʼt return it, which means it returns GREEK CAPITAL LETTER
> IOTA plus the diacriticals, which is obviously not going to
> match.

This is an interesting case like (upcase "ß") => "SS" that required
adding (?ß "ss") to pass the tests.  So I guess we need to add (?ι "ΐ")
for the tests to pass:

diff --git a/test/lisp/char-fold-tests.el b/test/lisp/char-fold-tests.el
index e519435ef0..3819f3919d 100644
--- a/test/lisp/char-fold-tests.el
+++ b/test/lisp/char-fold-tests.el
@@ -166,6 +165,7 @@ char-fold--test-with-customization
   (let* ((char-fold-include
           '(
             (?ß "ss") ;; de
+            (?ι "ΐ")  ;; el
             (?o "ø")  ;; da no nb nn
             (?l "ł")  ;; pl
             ))
@@ -184,9 +184,7 @@ char-fold--test-with-customization
           '(
             ("e" "ℯ" "ḗ" "ë" "ë")
             ("е" "ё" "ё")
-            ("ι" "ί" "ί"
-             ;; FIXME: "ΐ"
-             )
+            ("ι" "ί" "ί" "ΐ")
             ("ß" "ss")
             ("o" "ø")
             ("l" "ł")


But this is only for char-fold--test-with-customization.  OTOH, for
char-fold--test-without-customization we need also to change the default
value in char-fold.el like:

diff --git a/lisp/char-fold.el b/lisp/char-fold.el
index f379229e6c..c4add03bd9 100644
--- a/lisp/char-fold.el
+++ b/lisp/char-fold.el
@@ -27,7 +27,8 @@
   (defconst char-fold--default-include
     '((?\" "＂" "“" "”" "”" "„" "⹂" "〞" "‟" "‟" "❞" "❝" "❠" "“" "„" "〝" "〟" "🙷" "🙶" "🙸" "«" "»")
       (?' "❟" "❛" "❜" "‘" "’" "‚" "‛" "‚" "󠀢" "❮" "❯" "‹" "›")
-      (?` "❛" "‘" "‛" "󠀢" "❮" "‹")))
+      (?` "❛" "‘" "‛" "󠀢" "❮" "‹")
+      (?ι "ΐ")))
   (defconst char-fold--default-exclude nil)
   (defconst char-fold--default-symmetric nil)
   (defconst char-fold--previous (list char-fold--default-include

diff --git a/test/lisp/char-fold-tests.el b/test/lisp/char-fold-tests.el
index e519435ef0..3819f3919d 100644
--- a/test/lisp/char-fold-tests.el
+++ b/test/lisp/char-fold-tests.el
@@ -154,8 +154,7 @@ char-fold--test-without-customization
             ("ι"
              "ί" ;; 1 level decomposition
              "ί" ;; 2 level decomposition
-             ;; FIXME:
-             ;; "ΐ" ;; 3 level decomposition
+             "ΐ" ;; 3 level decomposition
              )
             )))
     (dolist (strings matches)



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: search-default-mode char-fold-to-regexp and Greek Extended block characters
  2019-07-24 23:12                   ` Juri Linkov
@ 2019-07-25  0:18                     ` Basil L. Contovounesios
  2019-07-25 18:40                       ` Juri Linkov
  2019-07-25  2:36                     ` Eli Zaretskii
  2019-07-25  8:46                     ` Robert Pluim
  2 siblings, 1 reply; 32+ messages in thread
From: Basil L. Contovounesios @ 2019-07-25  0:18 UTC (permalink / raw)
  To: Juri Linkov; +Cc: emacs-devel

Juri Linkov <juri@linkov.net> writes:

>>     Juri> Thanks!  Could you please look why tests fail to validate matching of
>>     Juri> n-level decomposition.  The character with 3 level decomposition in
>>     Juri> char-fold--test-without-customization is currently commented out as
>>     Juri> FIXME.  After uncommenting this test fails, and I don't understand why.
>>
>> That test ends up doing
>>
>> (string-match "\\`\\(?:ι[̀́̄̆̈̓̔͂]\\|[ΐίιϊἰ-ἷὶίιῐ-ΐῖῗ𝛊𝜄𝜾𝝸𝞲]\\)\\'" "Ϊ́")
>>
>> because it does (upcase "ΐ") => Ϊ́
>>
>> That character is GREEK SMALL LETTER IOTA WITH DIALYTIKA AND OXIA, and
>> as far as I can tell there is no CAPITAL variant of that letter, so
>> upcase canʼt return it, which means it returns GREEK CAPITAL LETTER
>> IOTA plus the diacriticals, which is obviously not going to
>> match.
>
> This is an interesting case like (upcase "ß") => "SS" that required
> adding (?ß "ss") to pass the tests.

It is probably this way because all caps are not usually (if ever)
accented in Greek, so the only time upper-case letters take accents is
at the start of capitalised words, where dialytika can never appear, as
dialytika only make sense on the second of two consecutive vowels.

> So I guess we need to add (?ι "ΐ") for the tests to pass:

[...]

> But this is only for char-fold--test-with-customization.  OTOH, for
> char-fold--test-without-customization we need also to change the default
> value in char-fold.el like:

[...]

Can you please explain why iota with dialytika and tonos needs to be
special-cased in these places?

Thanks,

-- 
Basil



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: search-default-mode char-fold-to-regexp and Greek Extended block characters
  2019-07-24 23:12                   ` Juri Linkov
  2019-07-25  0:18                     ` Basil L. Contovounesios
@ 2019-07-25  2:36                     ` Eli Zaretskii
  2019-07-25  8:59                       ` Robert Pluim
  2019-07-25  8:46                     ` Robert Pluim
  2 siblings, 1 reply; 32+ messages in thread
From: Eli Zaretskii @ 2019-07-25  2:36 UTC (permalink / raw)
  To: Juri Linkov; +Cc: emacs-devel

> From: Juri Linkov <juri@linkov.net>
> Date: Thu, 25 Jul 2019 02:12:01 +0300
> 
> This is an interesting case like (upcase "ß") => "SS" that required
> adding (?ß "ss") to pass the tests.

Isn't there now an upper-case eszet that eliminates that problem?



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: search-default-mode char-fold-to-regexp and Greek Extended block characters
  2019-07-24 23:12                   ` Juri Linkov
  2019-07-25  0:18                     ` Basil L. Contovounesios
  2019-07-25  2:36                     ` Eli Zaretskii
@ 2019-07-25  8:46                     ` Robert Pluim
  2019-07-25 18:46                       ` Juri Linkov
  2 siblings, 1 reply; 32+ messages in thread
From: Robert Pluim @ 2019-07-25  8:46 UTC (permalink / raw)
  To: Juri Linkov; +Cc: emacs-devel

>>>>> On Thu, 25 Jul 2019 02:12:01 +0300, Juri Linkov <juri@linkov.net> said:
    Juri> This is an interesting case like (upcase "ß") => "SS" that required
    Juri> adding (?ß "ss") to pass the tests.  So I guess we need to add (?ι "ΐ")
    Juri> for the tests to pass:

This is OK

    Juri> diff --git a/test/lisp/char-fold-tests.el b/test/lisp/char-fold-tests.el
    Juri> index e519435ef0..3819f3919d 100644
    Juri> --- a/test/lisp/char-fold-tests.el
    Juri> +++ b/test/lisp/char-fold-tests.el
    Juri> @@ -166,6 +165,7 @@ char-fold--test-with-customization
    Juri>    (let* ((char-fold-include
    Juri>            '(
    Juri>              (?ß "ss") ;; de
    Juri> +            (?ι "ΐ")  ;; el
    Juri>              (?o "ø")  ;; da no nb nn
    Juri>              (?l "ł")  ;; pl
    Juri>              ))
    Juri> @@ -184,9 +184,7 @@ char-fold--test-with-customization
    Juri>            '(
    Juri>              ("e" "ℯ" "ḗ" "ë" "ë")
    Juri>              ("е" "ё" "ё")
    Juri> -            ("ι" "ί" "ί"
    Juri> -             ;; FIXME: "ΐ"
    Juri> -             )
    Juri> +            ("ι" "ί" "ί" "ΐ")
    Juri>              ("ß" "ss")
    Juri>              ("o" "ø")
    Juri>              ("l" "ł")


    Juri> But this is only for char-fold--test-with-customization.  OTOH, for
    Juri> char-fold--test-without-customization we need also to change the default
    Juri> value in char-fold.el like:

    Juri> diff --git a/lisp/char-fold.el b/lisp/char-fold.el
    Juri> index f379229e6c..c4add03bd9 100644
    Juri> --- a/lisp/char-fold.el
    Juri> +++ b/lisp/char-fold.el
    Juri> @@ -27,7 +27,8 @@
    Juri>    (defconst char-fold--default-include
    Juri>      '((?\" "＂" "“" "”" "”" "„" "⹂" "〞" "‟" "‟" "❞" "❝" "❠" "“" "„" "〝" "〟" "🙷" "🙶" "🙸" "«" "»")
    Juri>        (?' "❟" "❛" "❜" "‘" "’" "‚" "‛" "‚" "󠀢" "❮" "❯" "‹" "›")
    Juri> -      (?` "❛" "‘" "‛" "󠀢" "❮" "‹")))
    Juri> +      (?` "❛" "‘" "‛" "󠀢" "❮" "‹")
    Juri> +      (?ι "ΐ")))
    Juri>    (defconst char-fold--default-exclude nil)
    Juri>    (defconst char-fold--default-symmetric nil)
    Juri>    (defconst char-fold--previous (list char-fold--default-include

But this one I donʼt understand. Searching for iota (capital or small)
in a buffer containing ΐ ΐ or Ϊ́ already works with
char-fold-to-regexp, so why is this needed?

Robert



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: search-default-mode char-fold-to-regexp and Greek Extended block characters
  2019-07-25  2:36                     ` Eli Zaretskii
@ 2019-07-25  8:59                       ` Robert Pluim
  2019-07-25 12:53                         ` Eli Zaretskii
  0 siblings, 1 reply; 32+ messages in thread
From: Robert Pluim @ 2019-07-25  8:59 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel, Juri Linkov

>>>>> On Thu, 25 Jul 2019 05:36:12 +0300, Eli Zaretskii <eliz@gnu.org> said:

    >> From: Juri Linkov <juri@linkov.net>
    >> Date: Thu, 25 Jul 2019 02:12:01 +0300
    >> 
    >> This is an interesting case like (upcase "ß") => "SS" that required
    >> adding (?ß "ss") to pass the tests.

    Eli> Isn't there now an upper-case eszet that eliminates that problem?

There is, but the mapping is one-way.

UnicodeData.txt has:

00DF;LATIN SMALL LETTER SHARP S;Ll;0;L;;;;;N;;;;;
1E9E;LATIN CAPITAL LETTER SHARP S;Lu;0;L;;;;;N;;;;00DF;

and SpecialCasing.txt has

00DF; 00DF; 0053 0073; 0053 0053; # LATIN SMALL LETTER SHARP S

which gives us

(upcase "ß") => "SS" ; \udf => "SS"
(downcase "ẞ") => "ß" ; \u1e9e => \udf

Perhaps this is different in a later version of Unicode than what
weʼre using.

Robert



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: search-default-mode char-fold-to-regexp and Greek Extended block characters
  2019-07-25  8:59                       ` Robert Pluim
@ 2019-07-25 12:53                         ` Eli Zaretskii
  0 siblings, 0 replies; 32+ messages in thread
From: Eli Zaretskii @ 2019-07-25 12:53 UTC (permalink / raw)
  To: Robert Pluim; +Cc: emacs-devel

> From: Robert Pluim <rpluim@gmail.com>
> Cc: Juri Linkov <juri@linkov.net>,  emacs-devel@gnu.org
> Date: Thu, 25 Jul 2019 10:59:39 +0200
> 
>     Eli> Isn't there now an upper-case eszet that eliminates that problem?
> 
> There is, but the mapping is one-way.

Right you are, thanks.

> Perhaps this is different in a later version of Unicode than what
> weʼre using.

We are using the latest official version 12.0 of Unicode.  And this
data is unchanged in the current draft for Unicode 13.0, so I think we
are good.



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: search-default-mode char-fold-to-regexp and Greek Extended block characters
  2019-07-25  0:18                     ` Basil L. Contovounesios
@ 2019-07-25 18:40                       ` Juri Linkov
  2019-07-25 20:44                         ` search-default-mode char-fold-to-regexp and Greek Extended block characters, " Robert Pluim
  0 siblings, 1 reply; 32+ messages in thread
From: Juri Linkov @ 2019-07-25 18:40 UTC (permalink / raw)
  To: Basil L. Contovounesios; +Cc: emacs-devel

>>> because it does (upcase "ΐ") => Ϊ́
>>>
>>> That character is GREEK SMALL LETTER IOTA WITH DIALYTIKA AND OXIA, and
>>> as far as I can tell there is no CAPITAL variant of that letter, so
>>> upcase canʼt return it, which means it returns GREEK CAPITAL LETTER
>>> IOTA plus the diacriticals, which is obviously not going to
>>> match.
>>
>> This is an interesting case like (upcase "ß") => "SS" that required
>> adding (?ß "ss") to pass the tests.
>
> It is probably this way because all caps are not usually (if ever)
> accented in Greek, so the only time upper-case letters take accents is
> at the start of capitalised words, where dialytika can never appear, as
> dialytika only make sense on the second of two consecutive vowels.

Maybe only for searching purposes we could find all cases
where upper- and lower-case letters differ significantly and
add them to char-fold-include by default.

>> So I guess we need to add (?ι "ΐ") for the tests to pass:
>
> [...]
>
>> But this is only for char-fold--test-with-customization.  OTOH, for
>> char-fold--test-without-customization we need also to change the default
>> value in char-fold.el like:
>
> [...]
>
> Can you please explain why iota with dialytika and tonos needs to be
> special-cased in these places?

Here is the test case that demonstrates the need to add it
to char-fold-include:

0. emacs -Q
1. Paste this text to *scratch*: "ΐΐ"
2. Search for two IOTAs with char-fold, e.g.: C-s M-s ' ιι

The char-fold search doesn't match the characters with combining accents
with their base char GREEK SMALL LETTER IOTA.

However, after adding (?ι "ΐ") to char-fold-include it can match the
base character IOTA.



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: search-default-mode char-fold-to-regexp and Greek Extended block characters
  2019-07-25  8:46                     ` Robert Pluim
@ 2019-07-25 18:46                       ` Juri Linkov
  2019-07-26  6:04                         ` Eli Zaretskii
  0 siblings, 1 reply; 32+ messages in thread
From: Juri Linkov @ 2019-07-25 18:46 UTC (permalink / raw)
  To: emacs-devel

> But this one I donʼt understand. Searching for iota (capital or small)
> in a buffer containing ΐ ΐ or Ϊ́ already works with
> char-fold-to-regexp, so why is this needed?

Searching for ι finds ΐ only when searching for a single letter ι
because the search matches the first part of ΐ that contains the base
character ι and ignores the remaining combining accents like ̈́

So for testing you need to search for longer strings, e.g.
in a buffer with this text "ΐΐΪ́." try to search for "ιιι."

It fails to find this text without adding (?ι "ΐ")
to char-fold-include.



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: search-default-mode char-fold-to-regexp and Greek Extended block characters, Re: search-default-mode char-fold-to-regexp and Greek Extended block characters
  2019-07-25 18:40                       ` Juri Linkov
@ 2019-07-25 20:44                         ` Robert Pluim
  2019-07-25 21:35                           ` Juri Linkov
  0 siblings, 1 reply; 32+ messages in thread
From: Robert Pluim @ 2019-07-25 20:44 UTC (permalink / raw)
  To: Juri Linkov; +Cc: Basil L. Contovounesios, emacs-devel

>>>>> On Thu, 25 Jul 2019 21:40:12 +0300, Juri Linkov <juri@linkov.net> said:

    >> Can you please explain why iota with dialytika and tonos needs to be
    >> special-cased in these places?

    Juri> Here is the test case that demonstrates the need to add it
    Juri> to char-fold-include:

    Juri> 0. emacs -Q
    Juri> 1. Paste this text to *scratch*: "ΐΐ"
    Juri> 2. Search for two IOTAs with char-fold, e.g.: C-s M-s ' ιι

    Juri> The char-fold search doesn't match the characters with combining accents
    Juri> with their base char GREEK SMALL LETTER IOTA.

    Juri> However, after adding (?ι "ΐ") to char-fold-include it can match the
    Juri> base character IOTA.

Yes, I see the problem now. Maybe this can be solved by adding that
mapping when building char-fold-table. Or 'those mappings' I should
say, since there are going to be many cases like this.

How about the following? It passes your tests with the FIXMEs
uncommented (and isearch for multiple iotas matches multiple iotas +
combining diacriticals).

I deliberately restricted it to lower case characters, since the
roundtripping fails for İ and a large number of titlecase characters.

diff --git i/lisp/char-fold.el w/lisp/char-fold.el
index f379229e6c..91fd7ddc28 100644
--- i/lisp/char-fold.el
+++ w/lisp/char-fold.el
@@ -108,6 +108,17 @@
                                     (car next-decomp)))
                            (funcall make-decomp-match-char (list (car next-decomp)) char)))
                      (setq dec next-decomp)))
+               ;; If there is no precomposed uppercase version of a
+               ;; character with diacriticals, we also add a mapping
+               ;; from the base character to the base character with
+               ;; combining diacriticals
+               (when (eq (get-char-code-property char 'general-category) 'Ll)
+                 (let* ((str (char-to-string char))
+                        (upper (upcase str))
+                        (roundtrip (downcase upper)))
+                   (when (> (length roundtrip) 1)
+                     (aset equiv (aref roundtrip 0)
+                           (cons roundtrip (aref equiv (aref roundtrip 0)))))))
                ;; Do it again, without the non-spacing characters.
                ;; This allows 'a' to match 'ä'.
                (let ((simpler-decomp nil)



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: search-default-mode char-fold-to-regexp and Greek Extended block characters, Re: search-default-mode char-fold-to-regexp and Greek Extended block characters
  2019-07-25 20:44                         ` search-default-mode char-fold-to-regexp and Greek Extended block characters, " Robert Pluim
@ 2019-07-25 21:35                           ` Juri Linkov
  2019-07-26 11:09                             ` Robert Pluim
  0 siblings, 1 reply; 32+ messages in thread
From: Juri Linkov @ 2019-07-25 21:35 UTC (permalink / raw)
  To: emacs-devel

> How about the following? It passes your tests with the FIXMEs
> uncommented (and isearch for multiple iotas matches multiple iotas +
> combining diacriticals).
>
> I deliberately restricted it to lower case characters, since the
> roundtripping fails for İ and a large number of titlecase characters.
>
> diff --git i/lisp/char-fold.el w/lisp/char-fold.el
> index f379229e6c..91fd7ddc28 100644
> --- i/lisp/char-fold.el
> +++ w/lisp/char-fold.el
> @@ -108,6 +108,17 @@
>                                      (car next-decomp)))
>                             (funcall make-decomp-match-char (list (car next-decomp)) char)))
>                       (setq dec next-decomp)))
> +               ;; If there is no precomposed uppercase version of a
> +               ;; character with diacriticals, we also add a mapping
> +               ;; from the base character to the base character with
> +               ;; combining diacriticals
> +               (when (eq (get-char-code-property char 'general-category) 'Ll)
> +                 (let* ((str (char-to-string char))
> +                        (upper (upcase str))
> +                        (roundtrip (downcase upper)))
> +                   (when (> (length roundtrip) 1)
> +                     (aset equiv (aref roundtrip 0)
> +                           (cons roundtrip (aref equiv (aref roundtrip 0)))))))
>                 ;; Do it again, without the non-spacing characters.
>                 ;; This allows 'a' to match 'ä'.
>                 (let ((simpler-decomp nil)

If there are many such cases, then better to handle them automatically indeed
(if this doesn't cause slowdown too much) instead of adding them one by one
to the default values.  Does this handle ß as well?



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: search-default-mode char-fold-to-regexp and Greek Extended block characters
  2019-07-25 18:46                       ` Juri Linkov
@ 2019-07-26  6:04                         ` Eli Zaretskii
  2019-07-26 18:40                           ` Juri Linkov
  0 siblings, 1 reply; 32+ messages in thread
From: Eli Zaretskii @ 2019-07-26  6:04 UTC (permalink / raw)
  To: Juri Linkov; +Cc: emacs-devel

> From: Juri Linkov <juri@linkov.net>
> Date: Thu, 25 Jul 2019 21:46:20 +0300
> 
> > But this one I donʼt understand. Searching for iota (capital or small)
> > in a buffer containing ΐ ΐ or Ϊ́ already works with
> > char-fold-to-regexp, so why is this needed?
> 
> Searching for ι finds ΐ only when searching for a single letter ι
> because the search matches the first part of ΐ that contains the base
> character ι and ignores the remaining combining accents like ̈́
> 
> So for testing you need to search for longer strings, e.g.
> in a buffer with this text "ΐΐΪ́." try to search for "ιιι."
> 
> It fails to find this text without adding (?ι "ΐ")
> to char-fold-include.

Maybe we should decide that this is a limitation of the current
implementation, and instead work on a more correct implementation,
which actually "folds" characters to their base variants as the search
proceed.

Let's not forget that the current implementation was known to be
limited from the get-go, and we only accepted it because the "full"
one was too complex and required non-trivial changes on the C level.
So we shouldn't go too far into making the current implementation
support everything that the full one will inherently support.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: search-default-mode char-fold-to-regexp and Greek Extended block characters, Re: search-default-mode char-fold-to-regexp and Greek Extended block characters
  2019-07-25 21:35                           ` Juri Linkov
@ 2019-07-26 11:09                             ` Robert Pluim
  2019-07-26 18:38                               ` Juri Linkov
  0 siblings, 1 reply; 32+ messages in thread
From: Robert Pluim @ 2019-07-26 11:09 UTC (permalink / raw)
  To: Juri Linkov; +Cc: emacs-devel

>>>>> On Fri, 26 Jul 2019 00:35:51 +0300, Juri Linkov <juri@linkov.net> said:

    Juri> If there are many such cases, then better to handle them automatically indeed
    Juri> (if this doesn't cause slowdown too much) instead of adding them one by one
    Juri> to the default values.  Does this handle ß as well?

There are 74, and I donʼt want to maintain such a list by hand :-).

ß is not a complex character, so is never looked at here. But if we
hoist the checking out of the loop over complex characters, we can
make that work as well (this supersedes my previous patch).

I have no idea of the performance impact of all this.

diff --git i/lisp/char-fold.el w/lisp/char-fold.el
index f379229e6c..bee485c8ed 100644
--- i/lisp/char-fold.el
+++ w/lisp/char-fold.el
@@ -49,6 +49,36 @@
                             (funcall func (car char) v table)))
                         table))
 
+      (map-char-table
+       (lambda (char category)
+         ;; If the uppercase version of a character is not a single
+         ;; character, we add a mapping from the first character of
+         ;; the uppercase version to the lowercase character. This
+         ;; handles eg ß => SS and ῗ => Ϊ́
+
+         (when (eq category 'Ll)
+           (let ((start char)
+                 (end char))
+             (when (consp char)
+               (setq start (car char)
+                     end (cdr char)))
+             (while (<= start end)
+               (let* ((str (char-to-string start))
+                      (upper (upcase str))
+                      (roundtrip (downcase upper)))
+                 (when (> (length roundtrip) 1)
+                   ;; Complex characters map to the decomposed version
+                   ;; + diacriticals, simple characters map to the
+                   ;; base char.  Also add the reverse mapping for
+                   ;; simple characters.
+                   (if (cdr (get-char-code-property start 'decomposition))
+                       (setq str roundtrip)
+                     (aset equiv start
+                           (cons roundtrip (aref equiv start))))
+                   (aset equiv (aref roundtrip 0)
+                         (cons str (aref equiv (aref roundtrip 0))))))
+               (setq start (+ 1 start))))))
+             (unicode-property-table-internal 'general-category))
       ;; Compile a list of all complex characters that each simple
       ;; character should match.
       ;; In summary this loop does 3 things:
diff --git i/test/lisp/char-fold-tests.el w/test/lisp/char-fold-tests.el
index e519435ef0..cf155d3cb5 100644
--- i/test/lisp/char-fold-tests.el
+++ w/test/lisp/char-fold-tests.el
@@ -154,9 +154,9 @@ char-fold--test-without-customization
             ("ι"
              "ί" ;; 1 level decomposition
              "ί" ;; 2 level decomposition
-             ;; FIXME:
-             ;; "ΐ" ;; 3 level decomposition
+             "ΐ" ;; 3 level decomposition
              )
+            ("ß" "ss")
             )))
     (dolist (strings matches)
       (apply 'char-fold--test-match-exactly strings))))
@@ -165,7 +165,6 @@ char-fold--test-with-customization
   :tags '(:expensive-test)
   (let* ((char-fold-include
           '(
-            (?ß "ss") ;; de
             (?o "ø")  ;; da no nb nn
             (?l "ł")  ;; pl
             ))
@@ -184,8 +183,7 @@ char-fold--test-with-customization
           '(
             ("e" "ℯ" "ḗ" "ë" "ë")
             ("е" "ё" "ё")
-            ("ι" "ί" "ί"
-             ;; FIXME: "ΐ"
+            ("ι" "ί" "ί" "ΐ"
              )
             ("ß" "ss")
             ("o" "ø")



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: search-default-mode char-fold-to-regexp and Greek Extended block characters, Re: search-default-mode char-fold-to-regexp and Greek Extended block characters
  2019-07-26 11:09                             ` Robert Pluim
@ 2019-07-26 18:38                               ` Juri Linkov
  2019-07-29  8:32                                 ` Robert Pluim
  0 siblings, 1 reply; 32+ messages in thread
From: Juri Linkov @ 2019-07-26 18:38 UTC (permalink / raw)
  To: emacs-devel

>     Juri> If there are many such cases, then better to handle them automatically indeed
>     Juri> (if this doesn't cause slowdown too much) instead of adding them one by one
>     Juri> to the default values.  Does this handle ß as well?
>
> There are 74, and I donʼt want to maintain such a list by hand :-).

Yes, 74 is too tedious to maintain by hand, so better to install your
previous patch (if it doesn't have the problem mentioned below) since
there are only 3 such complex characters (handled by your newer patch)
that is easy to add by hand:

  '((?ß "ss")
    (?ΐ "ΐ")
    (?ΰ "ΰ"))

> ß is not a complex character, so is never looked at here. But if we
> hoist the checking out of the loop over complex characters, we can
> make that work as well (this supersedes my previous patch).
>
> I have no idea of the performance impact of all this.
> [...]
> +                   (aset equiv (aref roundtrip 0)
> +                         (cons str (aref equiv (aref roundtrip 0))))))

It seems this adds a symmetric decomposition from the first character of "ss",
i.e. from ?s to "ß".  Shouldn't this rather update 'equiv-multi' instead?

OTOH, I see no reason to add symmetric decompositions by default since
they are handled by the option 'char-fold-symmetric'.



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: search-default-mode char-fold-to-regexp and Greek Extended block characters
  2019-07-26  6:04                         ` Eli Zaretskii
@ 2019-07-26 18:40                           ` Juri Linkov
  2019-07-26 19:13                             ` Eli Zaretskii
  0 siblings, 1 reply; 32+ messages in thread
From: Juri Linkov @ 2019-07-26 18:40 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

>> > But this one I donʼt understand. Searching for iota (capital or small)
>> > in a buffer containing ΐ ΐ or Ϊ́ already works with
>> > char-fold-to-regexp, so why is this needed?
>>
>> Searching for ι finds ΐ only when searching for a single letter ι
>> because the search matches the first part of ΐ that contains the base
>> character ι and ignores the remaining combining accents like ̈́
>>
>> So for testing you need to search for longer strings, e.g.
>> in a buffer with this text "ΐΐΪ́." try to search for "ιιι."
>>
>> It fails to find this text without adding (?ι "ΐ")
>> to char-fold-include.
>
> Maybe we should decide that this is a limitation of the current
> implementation, and instead work on a more correct implementation,
> which actually "folds" characters to their base variants as the search
> proceed.
>
> Let's not forget that the current implementation was known to be
> limited from the get-go, and we only accepted it because the "full"
> one was too complex and required non-trivial changes on the C level.
> So we shouldn't go too far into making the current implementation
> support everything that the full one will inherently support.

I consider the current regexp-based implementation as a fully functional
prototype with complete test coverage, so after switching the
implementation later to C, the same tests should still pass.



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: search-default-mode char-fold-to-regexp and Greek Extended block characters
  2019-07-26 18:40                           ` Juri Linkov
@ 2019-07-26 19:13                             ` Eli Zaretskii
  0 siblings, 0 replies; 32+ messages in thread
From: Eli Zaretskii @ 2019-07-26 19:13 UTC (permalink / raw)
  To: Juri Linkov; +Cc: emacs-devel

> From: Juri Linkov <juri@linkov.net>
> Cc: emacs-devel@gnu.org
> Date: Fri, 26 Jul 2019 21:40:58 +0300
> 
> > Let's not forget that the current implementation was known to be
> > limited from the get-go, and we only accepted it because the "full"
> > one was too complex and required non-trivial changes on the C level.
> > So we shouldn't go too far into making the current implementation
> > support everything that the full one will inherently support.
> 
> I consider the current regexp-based implementation as a fully functional
> prototype with complete test coverage, so after switching the
> implementation later to C, the same tests should still pass.

That's missing the point I ws trying to make.  My point is that there
might be some corner use cases which we might decide not to support
with the current implementation, and encourage people who'd like to
have a better support to work on the full implementation as described
by Unicode.



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: search-default-mode char-fold-to-regexp and Greek Extended block characters
  2019-07-26 18:38                               ` Juri Linkov
@ 2019-07-29  8:32                                 ` Robert Pluim
  2019-07-29 18:09                                   ` Juri Linkov
  0 siblings, 1 reply; 32+ messages in thread
From: Robert Pluim @ 2019-07-29  8:32 UTC (permalink / raw)
  To: Juri Linkov; +Cc: emacs-devel

>>>>> On Fri, 26 Jul 2019 21:38:30 +0300, Juri Linkov <juri@linkov.net> said:

    Juri> If there are many such cases, then better to handle them automatically indeed
    Juri> (if this doesn't cause slowdown too much) instead of adding them one by one
    Juri> to the default values.  Does this handle ß as well?
    >> 
    >> There are 74, and I donʼt want to maintain such a list by hand :-).

    Juri> Yes, 74 is too tedious to maintain by hand, so better to install your
    Juri> previous patch (if it doesn't have the problem mentioned below) since

The only difference between v2 and v1 of the patch is that v2 handles
ß, so v1 is probably better.

    Juri> there are only 3 such complex characters (handled by your newer patch)
    Juri> that is easy to add by hand:

    Juri>   '((?ß "ss")
    Juri>     (?ΐ "ΐ")
    Juri>     (?ΰ "ΰ"))

I donʼt understand this comment. With v1 of the patch, ß is the only
one that would need to be added by hand to char-fold--default-include

    >> ß is not a complex character, so is never looked at here. But if we
    >> hoist the checking out of the loop over complex characters, we can
    >> make that work as well (this supersedes my previous patch).
    >> 
    >> I have no idea of the performance impact of all this.
    >> [...]
    >> +                   (aset equiv (aref roundtrip 0)
    >> +                         (cons str (aref equiv (aref roundtrip 0))))))

    Juri> It seems this adds a symmetric decomposition from the first character of "ss",
    Juri> i.e. from ?s to "ß".  Shouldn't this rather update 'equiv-multi' instead?

Yes, thinko on my part.

    Juri> OTOH, I see no reason to add symmetric decompositions by default since
    Juri> they are handled by the option 'char-fold-symmetric'.

OK. Sounds like v1 is the winner. Iʼll clean it up and commit when ready.

Robert



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: search-default-mode char-fold-to-regexp and Greek Extended block characters
  2019-07-29  8:32                                 ` Robert Pluim
@ 2019-07-29 18:09                                   ` Juri Linkov
  2019-07-30  8:09                                     ` Robert Pluim
  0 siblings, 1 reply; 32+ messages in thread
From: Juri Linkov @ 2019-07-29 18:09 UTC (permalink / raw)
  To: emacs-devel

> I donʼt understand this comment. With v1 of the patch, ß is the only
> one that would need to be added by hand to char-fold--default-include

You are right, v1 of the patch is needed to remove ι and υ
from the default values, not ß.

The only problem with v1 of the patch is that it folds "f" to "ff"
and "fl", so typing `C-s f' matches "ff", `C-s s' matches "st", etc.



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: search-default-mode char-fold-to-regexp and Greek Extended block characters
  2019-07-29 18:09                                   ` Juri Linkov
@ 2019-07-30  8:09                                     ` Robert Pluim
  2019-07-30 10:15                                       ` Eli Zaretskii
  0 siblings, 1 reply; 32+ messages in thread
From: Robert Pluim @ 2019-07-30  8:09 UTC (permalink / raw)
  To: Juri Linkov; +Cc: emacs-devel

>>>>> On Mon, 29 Jul 2019 21:09:03 +0300, Juri Linkov <juri@linkov.net> said:

    >> I donʼt understand this comment. With v1 of the patch, ß is the only
    >> one that would need to be added by hand to char-fold--default-include

    Juri> You are right, v1 of the patch is needed to remove ι and υ
    Juri> from the default values, not ß.

    Juri> The only problem with v1 of the patch is that it folds "f" to "ff"
    Juri> and "fl", so typing `C-s f' matches "ff", `C-s s' matches "st", etc.

So it does. More thought required. Is there a Unicode spec for this
sort of stuff?

Robert



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: search-default-mode char-fold-to-regexp and Greek Extended block characters
  2019-07-30  8:09                                     ` Robert Pluim
@ 2019-07-30 10:15                                       ` Eli Zaretskii
  0 siblings, 0 replies; 32+ messages in thread
From: Eli Zaretskii @ 2019-07-30 10:15 UTC (permalink / raw)
  To: emacs-devel, Robert Pluim, Juri Linkov

On July 30, 2019 11:09:14 AM GMT+03:00, Robert Pluim <rpluim@gmail.com> wrote:
> 
> Is there a Unicode spec for this
> sort of stuff?


The Unicode guidelines for this are in UTS#10 Unicode Collation Algorithm, and specifically Section 11 Searching and Matching there.



^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2019-07-30 10:15 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-07-19 14:18 search-default-mode char-fold-to-regexp and Greek Extended block characters Robert Pluim
2019-07-19 14:37 ` Eli Zaretskii
2019-07-19 16:03   ` Robert Pluim
2019-07-19 18:13     ` Eli Zaretskii
2019-07-21 11:03       ` Robert Pluim
2019-07-22 18:39         ` Robert Pluim
2019-07-23 14:57           ` Eli Zaretskii
2019-07-23 17:43             ` Robert Pluim
2019-07-23 20:29               ` Juri Linkov
2019-07-24  7:56                 ` Robert Pluim
2019-07-24  7:59                   ` Robert Pluim
2019-07-24  9:04                 ` Robert Pluim
2019-07-24 23:12                   ` Juri Linkov
2019-07-25  0:18                     ` Basil L. Contovounesios
2019-07-25 18:40                       ` Juri Linkov
2019-07-25 20:44                         ` search-default-mode char-fold-to-regexp and Greek Extended block characters, " Robert Pluim
2019-07-25 21:35                           ` Juri Linkov
2019-07-26 11:09                             ` Robert Pluim
2019-07-26 18:38                               ` Juri Linkov
2019-07-29  8:32                                 ` Robert Pluim
2019-07-29 18:09                                   ` Juri Linkov
2019-07-30  8:09                                     ` Robert Pluim
2019-07-30 10:15                                       ` Eli Zaretskii
2019-07-25  2:36                     ` Eli Zaretskii
2019-07-25  8:59                       ` Robert Pluim
2019-07-25 12:53                         ` Eli Zaretskii
2019-07-25  8:46                     ` Robert Pluim
2019-07-25 18:46                       ` Juri Linkov
2019-07-26  6:04                         ` Eli Zaretskii
2019-07-26 18:40                           ` Juri Linkov
2019-07-26 19:13                             ` Eli Zaretskii
2019-07-19 18:53 ` Juri Linkov

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).