* search-default-mode char-fold-to-regexp and Greek Extended block characters @ 2019-07-19 14:18 Robert Pluim 2019-07-19 14:37 ` Eli Zaretskii 2019-07-19 18:53 ` Juri Linkov 0 siblings, 2 replies; 32+ messages in thread From: Robert Pluim @ 2019-07-19 14:18 UTC (permalink / raw) To: emacs-devel This is an offshoot of the discussion in Bug#36717. I have (setq search-default-mode 'char-fold-to-regexp) In a buffer containing 1. ί (\u03af GREEK SMALL LETTER IOTA WITH TONOS) 2. ί (\u1f77 GREEK SMALL LETTER IOTA WITH OXIA) I do C-s C-x 8 RET 03b9 isearch will find the iota with tonos, but not the iota with oxia, even though the decomposition of the decomposition of the latter contains iota. Is that expected? Thanks Robert ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: search-default-mode char-fold-to-regexp and Greek Extended block characters 2019-07-19 14:18 search-default-mode char-fold-to-regexp and Greek Extended block characters Robert Pluim @ 2019-07-19 14:37 ` Eli Zaretskii 2019-07-19 16:03 ` Robert Pluim 2019-07-19 18:53 ` Juri Linkov 1 sibling, 1 reply; 32+ messages in thread From: Eli Zaretskii @ 2019-07-19 14:37 UTC (permalink / raw) To: Robert Pluim; +Cc: emacs-devel > From: Robert Pluim <rpluim@gmail.com> > Date: Fri, 19 Jul 2019 16:18:52 +0200 > > In a buffer containing > > 1. ί (\u03af GREEK SMALL LETTER IOTA WITH TONOS) > 2. ί (\u1f77 GREEK SMALL LETTER IOTA WITH OXIA) > > I do C-s C-x 8 RET 03b9 > > isearch will find the iota with tonos, but not the iota with oxia, > even though the decomposition of the decomposition of the latter > contains iota. Is that expected? I suggest to step through the loop in char-fold.el and see what happens there for ί. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: search-default-mode char-fold-to-regexp and Greek Extended block characters 2019-07-19 14:37 ` Eli Zaretskii @ 2019-07-19 16:03 ` Robert Pluim 2019-07-19 18:13 ` Eli Zaretskii 0 siblings, 1 reply; 32+ messages in thread From: Robert Pluim @ 2019-07-19 16:03 UTC (permalink / raw) To: Eli Zaretskii; +Cc: emacs-devel >>>>> On Fri, 19 Jul 2019 17:37:24 +0300, Eli Zaretskii <eliz@gnu.org> said: >> From: Robert Pluim <rpluim@gmail.com> >> Date: Fri, 19 Jul 2019 16:18:52 +0200 >> >> In a buffer containing >> >> 1. ί (\u03af GREEK SMALL LETTER IOTA WITH TONOS) >> 2. ί (\u1f77 GREEK SMALL LETTER IOTA WITH OXIA) >> >> I do C-s C-x 8 RET 03b9 >> >> isearch will find the iota with tonos, but not the iota with oxia, >> even though the decomposition of the decomposition of the latter >> contains iota. Is that expected? Eli> I suggest to step through the loop in char-fold.el and see what Eli> happens there for ί. After poking around, the char-fold-table entry for \u1f77 contains \u1f77 and \u03af, but not \u03b9, so this is expected. I then started looking into the further details of unicode decomposition and normalization, and decided that I definitely donʼt know enough about this to decide if this is a bug or not :-) Robert ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: search-default-mode char-fold-to-regexp and Greek Extended block characters 2019-07-19 16:03 ` Robert Pluim @ 2019-07-19 18:13 ` Eli Zaretskii 2019-07-21 11:03 ` Robert Pluim 0 siblings, 1 reply; 32+ messages in thread From: Eli Zaretskii @ 2019-07-19 18:13 UTC (permalink / raw) To: Robert Pluim; +Cc: emacs-devel > From: Robert Pluim <rpluim@gmail.com> > Cc: emacs-devel@gnu.org > Date: Fri, 19 Jul 2019 18:03:05 +0200 > > >> 1. ί (\u03af GREEK SMALL LETTER IOTA WITH TONOS) > >> 2. ί (\u1f77 GREEK SMALL LETTER IOTA WITH OXIA) > >> > >> I do C-s C-x 8 RET 03b9 > >> > >> isearch will find the iota with tonos, but not the iota with oxia, > >> even though the decomposition of the decomposition of the latter > >> contains iota. Is that expected? > > Eli> I suggest to step through the loop in char-fold.el and see what > Eli> happens there for ί. > > After poking around, the char-fold-table entry for \u1f77 contains > \u1f77 and \u03af, but not \u03b9, so this is expected. I then started > looking into the further details of unicode decomposition and > normalization, and decided that I definitely donʼt know enough about > this to decide if this is a bug or not :-) (get-char-code-property ?ί 'decomposition) => (943) ; (#x03af) i.e (?ί) (get-char-code-property ?ί 'decomposition) => (953 769) ; (#x03b9 #x0301) Do we expand the decomposition property recursively? It sounds like we don't, but maybe we should. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: search-default-mode char-fold-to-regexp and Greek Extended block characters 2019-07-19 18:13 ` Eli Zaretskii @ 2019-07-21 11:03 ` Robert Pluim 2019-07-22 18:39 ` Robert Pluim 0 siblings, 1 reply; 32+ messages in thread From: Robert Pluim @ 2019-07-21 11:03 UTC (permalink / raw) To: Eli Zaretskii; +Cc: emacs-devel >>>>> On Fri, 19 Jul 2019 21:13:02 +0300, Eli Zaretskii <eliz@gnu.org> said: Eli> (get-char-code-property ?ί 'decomposition) => (943) ; (#x03af) i.e (?ί) Eli> (get-char-code-property ?ί 'decomposition) => (953 769) ; (#x03b9 #x0301) Eli> Do we expand the decomposition property recursively? It sounds like Eli> we don't, but maybe we should. We donʼt. The following patch allows searching for ι (0x3b9) to match both ί (0x3af) and ί (1f77). It doesnʼt recurse, but I have no idea if there are longer chains of decompositions. It causes (aref char-fold-table ?ι) to expand from: "\\(?:ι[̀́̄̆̈̓̔͂]\\|[ίιϊἰἱὶιῐῑῖ𝛊𝜄𝜾𝝸𝞲]\\)" to: "\\(?:ι[̀́̄̆̈̓̔͂]\\|[ΐίιϊἰ-ἷὶίιῐῑῒῖῗ𝛊𝜄𝜾𝝸𝞲]\\)" where the additions are basically all the variants of IOTA + one or more diacritical Even if we donʼt apply this or something like it, itʼs been educational. diff --git i/lisp/char-fold.el w/lisp/char-fold.el index 9d3ea17b41..bf2a4c2484 100644 --- i/lisp/char-fold.el +++ w/lisp/char-fold.el @@ -78,6 +78,20 @@ (cons (char-to-string char) (aref equiv (car decomp)))))))) (funcall make-decomp-match-char decomp char) + ;; Check to see if the first char of the decomposition + ;; has a further decomposition. If so, add a mapping + ;; back from that second decomposition to the original + ;; character. This allows e.g. 'ι' (GREEK SMALL LETTER + ;; IOTA) to match both the Basic Greek block and + ;; Extended Greek block variants of IOTA + + ;; diacritical(s) + (let ((l2-decomp (char-table-range table (car decomp)))) + (when (consp l2-decomp) + (when (symbolp (car l2-decomp)) + (setq l2-decomp (cdr l2-decomp))) + (if (not (eq (car decomp) + (car l2-decomp))) + (funcall make-decomp-match-char (list (car l2-decomp)) char)))) ;; Do it again, without the non-spacing characters. ;; This allows 'a' to match 'ä'. (let ((simpler-decomp nil) ^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: search-default-mode char-fold-to-regexp and Greek Extended block characters 2019-07-21 11:03 ` Robert Pluim @ 2019-07-22 18:39 ` Robert Pluim 2019-07-23 14:57 ` Eli Zaretskii 0 siblings, 1 reply; 32+ messages in thread From: Robert Pluim @ 2019-07-22 18:39 UTC (permalink / raw) To: emacs-devel [-- Attachment #1: Type: text/plain, Size: 1244 bytes --] >>>>> On Sun, 21 Jul 2019 13:03:37 +0200, Robert Pluim <rpluim@gmail.com> said: >>>>> On Fri, 19 Jul 2019 21:13:02 +0300, Eli Zaretskii <eliz@gnu.org> said: Eli> (get-char-code-property ?ί 'decomposition) => (943) ; (#x03af) i.e (?ί) Eli> (get-char-code-property ?ί 'decomposition) => (953 769) ; (#x03b9 #x0301) Eli> Do we expand the decomposition property recursively? It sounds like Eli> we don't, but maybe we should. Robert> We donʼt. The following patch allows searching for ι (0x3b9) to match Robert> both ί (0x3af) and ί (1f77). It doesnʼt recurse, but I have no idea if Robert> there are longer chains of decompositions. The answer to that, empirically, is 'yes', since with the following patch the number of characters equivalent to ι increases, ie: Standard => (aref char-fold-table ?ι)"\\(?:ι[̀́̄̆̈̓̔͂]\\|[ίιϊἰἱὶιῐῑῖ𝛊𝜄𝜾𝝸𝞲]\\)" 2 level decomposition => (aref char-fold-table ?ι)"\\(?:ι[̀́̄̆̈̓̔͂]\\|[ΐίιϊἰ-ἷὶίιῐῑῒῖῗ𝛊𝜄𝜾𝝸𝞲]\\)" n level decomposition => (aref char-fold-table ?ι)"\\(?:ι[̀́̄̆̈̓̔͂]\\|[ΐίιϊἰ-ἷὶίιῐ-ΐῖῗ𝛊𝜄𝜾𝝸𝞲]\\)" [-- Attachment #2: 0001-Follow-decomposition-chains-when-constructing-char-f.patch --] [-- Type: text/x-patch, Size: 2983 bytes --] From 3628379cf461805008b34e01dba751183c0b857c Mon Sep 17 00:00:00 2001 From: Robert Pluim <rpluim@gmail.com> Date: Mon, 22 Jul 2019 20:27:59 +0200 Subject: [PATCH] Follow decomposition chains when constructing char-fold-table To: emacs-devel@gnu.org * lisp/char-fold.el (char-fold-make-table): Decompose the decomposition of each character, adding equivalences to the original character, until no more decompositions are left. --- etc/NEWS | 8 ++++++++ lisp/char-fold.el | 21 +++++++++++++++++++++ 2 files changed, 29 insertions(+) diff --git a/etc/NEWS b/etc/NEWS index e9ec21bb4c..33fe7075ec 100644 --- a/etc/NEWS +++ b/etc/NEWS @@ -1169,6 +1169,14 @@ and case-sensitivity together with search strings in the search ring. +++ *** 'flush-lines' prints and returns the number of deleted matching lines. +--- +*** 'char-fold-to-regexp' now matches more variants of a base character. +The table used to check for equivalence of characters is now built +using the complete chain of unicode decompositions of a character, +rather than stopping after one level, such that searching for +e.g. GREEK SMALL LETTER IOTA will now also find GREEK SMALL LETTER +IOTA WITH OXIA. + ** Debugger +++ diff --git a/lisp/char-fold.el b/lisp/char-fold.el index 9d3ea17b41..6842d38a62 100644 --- a/lisp/char-fold.el +++ b/lisp/char-fold.el @@ -78,6 +78,27 @@ (cons (char-to-string char) (aref equiv (car decomp)))))))) (funcall make-decomp-match-char decomp char) + ;; Check to see if the first char of the decomposition + ;; has a further decomposition. If so, add a mapping + ;; back from that second decomposition to the original + ;; character. This allows e.g. 'ι' (GREEK SMALL LETTER + ;; IOTA) to match both the Basic Greek block and + ;; Extended Greek block variants of IOTA + + ;; diacritical(s). Repeat until there are no more + ;; decompositions. + (let ((dec decomp) + next-decomp) + (catch 'done + (while dec + (setq next-decomp (char-table-range table (car dec))) + (when (consp next-decomp) + (when (symbolp (car next-decomp)) + (setq next-decomp (cdr next-decomp))) + (if (not (eq (car dec) + (car next-decomp))) + (funcall make-decomp-match-char (list (car next-decomp)) char) + (throw 'done t))) + (setq dec next-decomp)))) ;; Do it again, without the non-spacing characters. ;; This allows 'a' to match 'ä'. (let ((simpler-decomp nil) -- 2.21.0.419.gffac537e6c ^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: search-default-mode char-fold-to-regexp and Greek Extended block characters 2019-07-22 18:39 ` Robert Pluim @ 2019-07-23 14:57 ` Eli Zaretskii 2019-07-23 17:43 ` Robert Pluim 0 siblings, 1 reply; 32+ messages in thread From: Eli Zaretskii @ 2019-07-23 14:57 UTC (permalink / raw) To: Robert Pluim; +Cc: emacs-devel > From: Robert Pluim <rpluim@gmail.com> > Date: Mon, 22 Jul 2019 20:39:22 +0200 > > The answer to that, empirically, is 'yes', since with the following > patch the number of characters equivalent to ι increases, ie: > > Standard => > (aref char-fold-table ?ι)"\\(?:ι[̀́̄̆̈̓̔͂]\\|[ίιϊἰἱὶιῐῑῖ𝛊𝜄𝜾𝝸𝞲]\\)" > > 2 level decomposition => > (aref char-fold-table ?ι)"\\(?:ι[̀́̄̆̈̓̔͂]\\|[ΐίιϊἰ-ἷὶίιῐῑῒῖῗ𝛊𝜄𝜾𝝸𝞲]\\)" > > n level decomposition => > (aref char-fold-table ?ι)"\\(?:ι[̀́̄̆̈̓̔͂]\\|[ΐίιϊἰ-ἷὶίιῐ-ΐῖῗ𝛊𝜄𝜾𝝸𝞲]\\)" Thanks, I think you should install your patch. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: search-default-mode char-fold-to-regexp and Greek Extended block characters 2019-07-23 14:57 ` Eli Zaretskii @ 2019-07-23 17:43 ` Robert Pluim 2019-07-23 20:29 ` Juri Linkov 0 siblings, 1 reply; 32+ messages in thread From: Robert Pluim @ 2019-07-23 17:43 UTC (permalink / raw) To: Eli Zaretskii; +Cc: emacs-devel >>>>> On Tue, 23 Jul 2019 17:57:38 +0300, Eli Zaretskii <eliz@gnu.org> said: >> From: Robert Pluim <rpluim@gmail.com> >> Date: Mon, 22 Jul 2019 20:39:22 +0200 >> >> The answer to that, empirically, is 'yes', since with the following >> patch the number of characters equivalent to ι increases, ie: >> >> Standard => >> (aref char-fold-table ?ι)"\\(?:ι[̀́̄̆̈̓̔͂]\\|[ίιϊἰἱὶιῐῑῖ𝛊𝜄𝜾𝝸𝞲]\\)" >> >> 2 level decomposition => >> (aref char-fold-table ?ι)"\\(?:ι[̀́̄̆̈̓̔͂]\\|[ΐίιϊἰ-ἷὶίιῐῑῒῖῗ𝛊𝜄𝜾𝝸𝞲]\\)" >> >> n level decomposition => >> (aref char-fold-table ?ι)"\\(?:ι[̀́̄̆̈̓̔͂]\\|[ΐίιϊἰ-ἷὶίιῐ-ΐῖῗ𝛊𝜄𝜾𝝸𝞲]\\)" Eli> Thanks, I think you should install your patch. Done as f9337bc36d Thanks Robert ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: search-default-mode char-fold-to-regexp and Greek Extended block characters 2019-07-23 17:43 ` Robert Pluim @ 2019-07-23 20:29 ` Juri Linkov 2019-07-24 7:56 ` Robert Pluim 2019-07-24 9:04 ` Robert Pluim 0 siblings, 2 replies; 32+ messages in thread From: Juri Linkov @ 2019-07-23 20:29 UTC (permalink / raw) To: emacs-devel > Done as f9337bc36d Thanks! Could you please look why tests fail to validate matching of n-level decomposition. The character with 3 level decomposition in char-fold--test-without-customization is currently commented out as FIXME. After uncommenting this test fails, and I don't understand why. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: search-default-mode char-fold-to-regexp and Greek Extended block characters 2019-07-23 20:29 ` Juri Linkov @ 2019-07-24 7:56 ` Robert Pluim 2019-07-24 7:59 ` Robert Pluim 2019-07-24 9:04 ` Robert Pluim 1 sibling, 1 reply; 32+ messages in thread From: Robert Pluim @ 2019-07-24 7:56 UTC (permalink / raw) To: Juri Linkov; +Cc: emacs-devel >>>>> On Tue, 23 Jul 2019 23:29:01 +0300, Juri Linkov <juri@linkov.net> said: >> Done as f9337bc36d Juri> Thanks! Could you please look why tests fail to validate matching of Juri> n-level decomposition. The character with 3 level decomposition in Juri> char-fold--test-without-customization is currently commented out as Juri> FIXME. After uncommenting this test fails, and I don't understand why. I canʼt find a test with that name. Is there a patch floating around I should look at? Robert ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: search-default-mode char-fold-to-regexp and Greek Extended block characters 2019-07-24 7:56 ` Robert Pluim @ 2019-07-24 7:59 ` Robert Pluim 0 siblings, 0 replies; 32+ messages in thread From: Robert Pluim @ 2019-07-24 7:59 UTC (permalink / raw) To: emacs-devel; +Cc: Juri Linkov >>>>> On Wed, 24 Jul 2019 09:56:02 +0200, Robert Pluim <rpluim@gmail.com> said: >>>>> On Tue, 23 Jul 2019 23:29:01 +0300, Juri Linkov <juri@linkov.net> said: >>> Done as f9337bc36d Juri> Thanks! Could you please look why tests fail to validate matching of Juri> n-level decomposition. The character with 3 level decomposition in Juri> char-fold--test-without-customization is currently commented out as Juri> FIXME. After uncommenting this test fails, and I don't understand why. Robert> I canʼt find a test with that name. Is there a patch floating around I Robert> should look at? Never mind, I needed to do a git pull. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: search-default-mode char-fold-to-regexp and Greek Extended block characters 2019-07-23 20:29 ` Juri Linkov 2019-07-24 7:56 ` Robert Pluim @ 2019-07-24 9:04 ` Robert Pluim 2019-07-24 23:12 ` Juri Linkov 1 sibling, 1 reply; 32+ messages in thread From: Robert Pluim @ 2019-07-24 9:04 UTC (permalink / raw) To: Juri Linkov; +Cc: emacs-devel >>>>> On Tue, 23 Jul 2019 23:29:01 +0300, Juri Linkov <juri@linkov.net> said: >> Done as f9337bc36d Juri> Thanks! Could you please look why tests fail to validate matching of Juri> n-level decomposition. The character with 3 level decomposition in Juri> char-fold--test-without-customization is currently commented out as Juri> FIXME. After uncommenting this test fails, and I don't understand why. That test ends up doing (string-match "\\`\\(?:ι[̀́̄̆̈̓̔͂]\\|[ΐίιϊἰ-ἷὶίιῐ-ΐῖῗ𝛊𝜄𝜾𝝸𝞲]\\)\\'" "Ϊ́") because it does (upcase "ΐ") => Ϊ́ That character is GREEK SMALL LETTER IOTA WITH DIALYTIKA AND OXIA, and as far as I can tell there is no CAPITAL variant of that letter, so upcase canʼt return it, which means it returns GREEK CAPITAL LETTER IOTA plus the diacriticals, which is obviously not going to match. Robert ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: search-default-mode char-fold-to-regexp and Greek Extended block characters 2019-07-24 9:04 ` Robert Pluim @ 2019-07-24 23:12 ` Juri Linkov 2019-07-25 0:18 ` Basil L. Contovounesios ` (2 more replies) 0 siblings, 3 replies; 32+ messages in thread From: Juri Linkov @ 2019-07-24 23:12 UTC (permalink / raw) To: emacs-devel > >> Done as f9337bc36d > Juri> Thanks! Could you please look why tests fail to validate matching of > Juri> n-level decomposition. The character with 3 level decomposition in > Juri> char-fold--test-without-customization is currently commented out as > Juri> FIXME. After uncommenting this test fails, and I don't understand why. > > That test ends up doing > > (string-match "\\`\\(?:ι[̀́̄̆̈̓̔͂]\\|[ΐίιϊἰ-ἷὶίιῐ-ΐῖῗ𝛊𝜄𝜾𝝸𝞲]\\)\\'" "Ϊ́") > > because it does (upcase "ΐ") => Ϊ́ > > That character is GREEK SMALL LETTER IOTA WITH DIALYTIKA AND OXIA, and > as far as I can tell there is no CAPITAL variant of that letter, so > upcase canʼt return it, which means it returns GREEK CAPITAL LETTER > IOTA plus the diacriticals, which is obviously not going to > match. This is an interesting case like (upcase "ß") => "SS" that required adding (?ß "ss") to pass the tests. So I guess we need to add (?ι "ΐ") for the tests to pass: diff --git a/test/lisp/char-fold-tests.el b/test/lisp/char-fold-tests.el index e519435ef0..3819f3919d 100644 --- a/test/lisp/char-fold-tests.el +++ b/test/lisp/char-fold-tests.el @@ -166,6 +165,7 @@ char-fold--test-with-customization (let* ((char-fold-include '( (?ß "ss") ;; de + (?ι "ΐ") ;; el (?o "ø") ;; da no nb nn (?l "ł") ;; pl )) @@ -184,9 +184,7 @@ char-fold--test-with-customization '( ("e" "ℯ" "ḗ" "ë" "ë") ("е" "ё" "ё") - ("ι" "ί" "ί" - ;; FIXME: "ΐ" - ) + ("ι" "ί" "ί" "ΐ") ("ß" "ss") ("o" "ø") ("l" "ł") But this is only for char-fold--test-with-customization. OTOH, for char-fold--test-without-customization we need also to change the default value in char-fold.el like: diff --git a/lisp/char-fold.el b/lisp/char-fold.el index f379229e6c..c4add03bd9 100644 --- a/lisp/char-fold.el +++ b/lisp/char-fold.el @@ -27,7 +27,8 @@ (defconst char-fold--default-include '((?\" """ "“" "”" "”" "„" "⹂" "〞" "‟" "‟" "❞" "❝" "❠" "“" "„" "〝" "〟" "🙷" "🙶" "🙸" "«" "»") (?' "❟" "❛" "❜" "‘" "’" "‚" "‛" "‚" "" "❮" "❯" "‹" "›") - (?` "❛" "‘" "‛" "" "❮" "‹"))) + (?` "❛" "‘" "‛" "" "❮" "‹") + (?ι "ΐ"))) (defconst char-fold--default-exclude nil) (defconst char-fold--default-symmetric nil) (defconst char-fold--previous (list char-fold--default-include diff --git a/test/lisp/char-fold-tests.el b/test/lisp/char-fold-tests.el index e519435ef0..3819f3919d 100644 --- a/test/lisp/char-fold-tests.el +++ b/test/lisp/char-fold-tests.el @@ -154,8 +154,7 @@ char-fold--test-without-customization ("ι" "ί" ;; 1 level decomposition "ί" ;; 2 level decomposition - ;; FIXME: - ;; "ΐ" ;; 3 level decomposition + "ΐ" ;; 3 level decomposition ) ))) (dolist (strings matches) ^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: search-default-mode char-fold-to-regexp and Greek Extended block characters 2019-07-24 23:12 ` Juri Linkov @ 2019-07-25 0:18 ` Basil L. Contovounesios 2019-07-25 18:40 ` Juri Linkov 2019-07-25 2:36 ` Eli Zaretskii 2019-07-25 8:46 ` Robert Pluim 2 siblings, 1 reply; 32+ messages in thread From: Basil L. Contovounesios @ 2019-07-25 0:18 UTC (permalink / raw) To: Juri Linkov; +Cc: emacs-devel Juri Linkov <juri@linkov.net> writes: >> Juri> Thanks! Could you please look why tests fail to validate matching of >> Juri> n-level decomposition. The character with 3 level decomposition in >> Juri> char-fold--test-without-customization is currently commented out as >> Juri> FIXME. After uncommenting this test fails, and I don't understand why. >> >> That test ends up doing >> >> (string-match "\\`\\(?:ι[̀́̄̆̈̓̔͂]\\|[ΐίιϊἰ-ἷὶίιῐ-ΐῖῗ𝛊𝜄𝜾𝝸𝞲]\\)\\'" "Ϊ́") >> >> because it does (upcase "ΐ") => Ϊ́ >> >> That character is GREEK SMALL LETTER IOTA WITH DIALYTIKA AND OXIA, and >> as far as I can tell there is no CAPITAL variant of that letter, so >> upcase canʼt return it, which means it returns GREEK CAPITAL LETTER >> IOTA plus the diacriticals, which is obviously not going to >> match. > > This is an interesting case like (upcase "ß") => "SS" that required > adding (?ß "ss") to pass the tests. It is probably this way because all caps are not usually (if ever) accented in Greek, so the only time upper-case letters take accents is at the start of capitalised words, where dialytika can never appear, as dialytika only make sense on the second of two consecutive vowels. > So I guess we need to add (?ι "ΐ") for the tests to pass: [...] > But this is only for char-fold--test-with-customization. OTOH, for > char-fold--test-without-customization we need also to change the default > value in char-fold.el like: [...] Can you please explain why iota with dialytika and tonos needs to be special-cased in these places? Thanks, -- Basil ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: search-default-mode char-fold-to-regexp and Greek Extended block characters 2019-07-25 0:18 ` Basil L. Contovounesios @ 2019-07-25 18:40 ` Juri Linkov 2019-07-25 20:44 ` search-default-mode char-fold-to-regexp and Greek Extended block characters, " Robert Pluim 0 siblings, 1 reply; 32+ messages in thread From: Juri Linkov @ 2019-07-25 18:40 UTC (permalink / raw) To: Basil L. Contovounesios; +Cc: emacs-devel >>> because it does (upcase "ΐ") => Ϊ́ >>> >>> That character is GREEK SMALL LETTER IOTA WITH DIALYTIKA AND OXIA, and >>> as far as I can tell there is no CAPITAL variant of that letter, so >>> upcase canʼt return it, which means it returns GREEK CAPITAL LETTER >>> IOTA plus the diacriticals, which is obviously not going to >>> match. >> >> This is an interesting case like (upcase "ß") => "SS" that required >> adding (?ß "ss") to pass the tests. > > It is probably this way because all caps are not usually (if ever) > accented in Greek, so the only time upper-case letters take accents is > at the start of capitalised words, where dialytika can never appear, as > dialytika only make sense on the second of two consecutive vowels. Maybe only for searching purposes we could find all cases where upper- and lower-case letters differ significantly and add them to char-fold-include by default. >> So I guess we need to add (?ι "ΐ") for the tests to pass: > > [...] > >> But this is only for char-fold--test-with-customization. OTOH, for >> char-fold--test-without-customization we need also to change the default >> value in char-fold.el like: > > [...] > > Can you please explain why iota with dialytika and tonos needs to be > special-cased in these places? Here is the test case that demonstrates the need to add it to char-fold-include: 0. emacs -Q 1. Paste this text to *scratch*: "ΐΐ" 2. Search for two IOTAs with char-fold, e.g.: C-s M-s ' ιι The char-fold search doesn't match the characters with combining accents with their base char GREEK SMALL LETTER IOTA. However, after adding (?ι "ΐ") to char-fold-include it can match the base character IOTA. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: search-default-mode char-fold-to-regexp and Greek Extended block characters, Re: search-default-mode char-fold-to-regexp and Greek Extended block characters 2019-07-25 18:40 ` Juri Linkov @ 2019-07-25 20:44 ` Robert Pluim 2019-07-25 21:35 ` Juri Linkov 0 siblings, 1 reply; 32+ messages in thread From: Robert Pluim @ 2019-07-25 20:44 UTC (permalink / raw) To: Juri Linkov; +Cc: Basil L. Contovounesios, emacs-devel >>>>> On Thu, 25 Jul 2019 21:40:12 +0300, Juri Linkov <juri@linkov.net> said: >> Can you please explain why iota with dialytika and tonos needs to be >> special-cased in these places? Juri> Here is the test case that demonstrates the need to add it Juri> to char-fold-include: Juri> 0. emacs -Q Juri> 1. Paste this text to *scratch*: "ΐΐ" Juri> 2. Search for two IOTAs with char-fold, e.g.: C-s M-s ' ιι Juri> The char-fold search doesn't match the characters with combining accents Juri> with their base char GREEK SMALL LETTER IOTA. Juri> However, after adding (?ι "ΐ") to char-fold-include it can match the Juri> base character IOTA. Yes, I see the problem now. Maybe this can be solved by adding that mapping when building char-fold-table. Or 'those mappings' I should say, since there are going to be many cases like this. How about the following? It passes your tests with the FIXMEs uncommented (and isearch for multiple iotas matches multiple iotas + combining diacriticals). I deliberately restricted it to lower case characters, since the roundtripping fails for İ and a large number of titlecase characters. diff --git i/lisp/char-fold.el w/lisp/char-fold.el index f379229e6c..91fd7ddc28 100644 --- i/lisp/char-fold.el +++ w/lisp/char-fold.el @@ -108,6 +108,17 @@ (car next-decomp))) (funcall make-decomp-match-char (list (car next-decomp)) char))) (setq dec next-decomp))) + ;; If there is no precomposed uppercase version of a + ;; character with diacriticals, we also add a mapping + ;; from the base character to the base character with + ;; combining diacriticals + (when (eq (get-char-code-property char 'general-category) 'Ll) + (let* ((str (char-to-string char)) + (upper (upcase str)) + (roundtrip (downcase upper))) + (when (> (length roundtrip) 1) + (aset equiv (aref roundtrip 0) + (cons roundtrip (aref equiv (aref roundtrip 0))))))) ;; Do it again, without the non-spacing characters. ;; This allows 'a' to match 'ä'. (let ((simpler-decomp nil) ^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: search-default-mode char-fold-to-regexp and Greek Extended block characters, Re: search-default-mode char-fold-to-regexp and Greek Extended block characters 2019-07-25 20:44 ` search-default-mode char-fold-to-regexp and Greek Extended block characters, " Robert Pluim @ 2019-07-25 21:35 ` Juri Linkov 2019-07-26 11:09 ` Robert Pluim 0 siblings, 1 reply; 32+ messages in thread From: Juri Linkov @ 2019-07-25 21:35 UTC (permalink / raw) To: emacs-devel > How about the following? It passes your tests with the FIXMEs > uncommented (and isearch for multiple iotas matches multiple iotas + > combining diacriticals). > > I deliberately restricted it to lower case characters, since the > roundtripping fails for İ and a large number of titlecase characters. > > diff --git i/lisp/char-fold.el w/lisp/char-fold.el > index f379229e6c..91fd7ddc28 100644 > --- i/lisp/char-fold.el > +++ w/lisp/char-fold.el > @@ -108,6 +108,17 @@ > (car next-decomp))) > (funcall make-decomp-match-char (list (car next-decomp)) char))) > (setq dec next-decomp))) > + ;; If there is no precomposed uppercase version of a > + ;; character with diacriticals, we also add a mapping > + ;; from the base character to the base character with > + ;; combining diacriticals > + (when (eq (get-char-code-property char 'general-category) 'Ll) > + (let* ((str (char-to-string char)) > + (upper (upcase str)) > + (roundtrip (downcase upper))) > + (when (> (length roundtrip) 1) > + (aset equiv (aref roundtrip 0) > + (cons roundtrip (aref equiv (aref roundtrip 0))))))) > ;; Do it again, without the non-spacing characters. > ;; This allows 'a' to match 'ä'. > (let ((simpler-decomp nil) If there are many such cases, then better to handle them automatically indeed (if this doesn't cause slowdown too much) instead of adding them one by one to the default values. Does this handle ß as well? ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: search-default-mode char-fold-to-regexp and Greek Extended block characters, Re: search-default-mode char-fold-to-regexp and Greek Extended block characters 2019-07-25 21:35 ` Juri Linkov @ 2019-07-26 11:09 ` Robert Pluim 2019-07-26 18:38 ` Juri Linkov 0 siblings, 1 reply; 32+ messages in thread From: Robert Pluim @ 2019-07-26 11:09 UTC (permalink / raw) To: Juri Linkov; +Cc: emacs-devel >>>>> On Fri, 26 Jul 2019 00:35:51 +0300, Juri Linkov <juri@linkov.net> said: Juri> If there are many such cases, then better to handle them automatically indeed Juri> (if this doesn't cause slowdown too much) instead of adding them one by one Juri> to the default values. Does this handle ß as well? There are 74, and I donʼt want to maintain such a list by hand :-). ß is not a complex character, so is never looked at here. But if we hoist the checking out of the loop over complex characters, we can make that work as well (this supersedes my previous patch). I have no idea of the performance impact of all this. diff --git i/lisp/char-fold.el w/lisp/char-fold.el index f379229e6c..bee485c8ed 100644 --- i/lisp/char-fold.el +++ w/lisp/char-fold.el @@ -49,6 +49,36 @@ (funcall func (car char) v table))) table)) + (map-char-table + (lambda (char category) + ;; If the uppercase version of a character is not a single + ;; character, we add a mapping from the first character of + ;; the uppercase version to the lowercase character. This + ;; handles eg ß => SS and ῗ => Ϊ́ + + (when (eq category 'Ll) + (let ((start char) + (end char)) + (when (consp char) + (setq start (car char) + end (cdr char))) + (while (<= start end) + (let* ((str (char-to-string start)) + (upper (upcase str)) + (roundtrip (downcase upper))) + (when (> (length roundtrip) 1) + ;; Complex characters map to the decomposed version + ;; + diacriticals, simple characters map to the + ;; base char. Also add the reverse mapping for + ;; simple characters. + (if (cdr (get-char-code-property start 'decomposition)) + (setq str roundtrip) + (aset equiv start + (cons roundtrip (aref equiv start)))) + (aset equiv (aref roundtrip 0) + (cons str (aref equiv (aref roundtrip 0)))))) + (setq start (+ 1 start)))))) + (unicode-property-table-internal 'general-category)) ;; Compile a list of all complex characters that each simple ;; character should match. ;; In summary this loop does 3 things: diff --git i/test/lisp/char-fold-tests.el w/test/lisp/char-fold-tests.el index e519435ef0..cf155d3cb5 100644 --- i/test/lisp/char-fold-tests.el +++ w/test/lisp/char-fold-tests.el @@ -154,9 +154,9 @@ char-fold--test-without-customization ("ι" "ί" ;; 1 level decomposition "ί" ;; 2 level decomposition - ;; FIXME: - ;; "ΐ" ;; 3 level decomposition + "ΐ" ;; 3 level decomposition ) + ("ß" "ss") ))) (dolist (strings matches) (apply 'char-fold--test-match-exactly strings)))) @@ -165,7 +165,6 @@ char-fold--test-with-customization :tags '(:expensive-test) (let* ((char-fold-include '( - (?ß "ss") ;; de (?o "ø") ;; da no nb nn (?l "ł") ;; pl )) @@ -184,8 +183,7 @@ char-fold--test-with-customization '( ("e" "ℯ" "ḗ" "ë" "ë") ("е" "ё" "ё") - ("ι" "ί" "ί" - ;; FIXME: "ΐ" + ("ι" "ί" "ί" "ΐ" ) ("ß" "ss") ("o" "ø") ^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: search-default-mode char-fold-to-regexp and Greek Extended block characters, Re: search-default-mode char-fold-to-regexp and Greek Extended block characters 2019-07-26 11:09 ` Robert Pluim @ 2019-07-26 18:38 ` Juri Linkov 2019-07-29 8:32 ` Robert Pluim 0 siblings, 1 reply; 32+ messages in thread From: Juri Linkov @ 2019-07-26 18:38 UTC (permalink / raw) To: emacs-devel > Juri> If there are many such cases, then better to handle them automatically indeed > Juri> (if this doesn't cause slowdown too much) instead of adding them one by one > Juri> to the default values. Does this handle ß as well? > > There are 74, and I donʼt want to maintain such a list by hand :-). Yes, 74 is too tedious to maintain by hand, so better to install your previous patch (if it doesn't have the problem mentioned below) since there are only 3 such complex characters (handled by your newer patch) that is easy to add by hand: '((?ß "ss") (?ΐ "ΐ") (?ΰ "ΰ")) > ß is not a complex character, so is never looked at here. But if we > hoist the checking out of the loop over complex characters, we can > make that work as well (this supersedes my previous patch). > > I have no idea of the performance impact of all this. > [...] > + (aset equiv (aref roundtrip 0) > + (cons str (aref equiv (aref roundtrip 0)))))) It seems this adds a symmetric decomposition from the first character of "ss", i.e. from ?s to "ß". Shouldn't this rather update 'equiv-multi' instead? OTOH, I see no reason to add symmetric decompositions by default since they are handled by the option 'char-fold-symmetric'. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: search-default-mode char-fold-to-regexp and Greek Extended block characters 2019-07-26 18:38 ` Juri Linkov @ 2019-07-29 8:32 ` Robert Pluim 2019-07-29 18:09 ` Juri Linkov 0 siblings, 1 reply; 32+ messages in thread From: Robert Pluim @ 2019-07-29 8:32 UTC (permalink / raw) To: Juri Linkov; +Cc: emacs-devel >>>>> On Fri, 26 Jul 2019 21:38:30 +0300, Juri Linkov <juri@linkov.net> said: Juri> If there are many such cases, then better to handle them automatically indeed Juri> (if this doesn't cause slowdown too much) instead of adding them one by one Juri> to the default values. Does this handle ß as well? >> >> There are 74, and I donʼt want to maintain such a list by hand :-). Juri> Yes, 74 is too tedious to maintain by hand, so better to install your Juri> previous patch (if it doesn't have the problem mentioned below) since The only difference between v2 and v1 of the patch is that v2 handles ß, so v1 is probably better. Juri> there are only 3 such complex characters (handled by your newer patch) Juri> that is easy to add by hand: Juri> '((?ß "ss") Juri> (?ΐ "ΐ") Juri> (?ΰ "ΰ")) I donʼt understand this comment. With v1 of the patch, ß is the only one that would need to be added by hand to char-fold--default-include >> ß is not a complex character, so is never looked at here. But if we >> hoist the checking out of the loop over complex characters, we can >> make that work as well (this supersedes my previous patch). >> >> I have no idea of the performance impact of all this. >> [...] >> + (aset equiv (aref roundtrip 0) >> + (cons str (aref equiv (aref roundtrip 0)))))) Juri> It seems this adds a symmetric decomposition from the first character of "ss", Juri> i.e. from ?s to "ß". Shouldn't this rather update 'equiv-multi' instead? Yes, thinko on my part. Juri> OTOH, I see no reason to add symmetric decompositions by default since Juri> they are handled by the option 'char-fold-symmetric'. OK. Sounds like v1 is the winner. Iʼll clean it up and commit when ready. Robert ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: search-default-mode char-fold-to-regexp and Greek Extended block characters 2019-07-29 8:32 ` Robert Pluim @ 2019-07-29 18:09 ` Juri Linkov 2019-07-30 8:09 ` Robert Pluim 0 siblings, 1 reply; 32+ messages in thread From: Juri Linkov @ 2019-07-29 18:09 UTC (permalink / raw) To: emacs-devel > I donʼt understand this comment. With v1 of the patch, ß is the only > one that would need to be added by hand to char-fold--default-include You are right, v1 of the patch is needed to remove ι and υ from the default values, not ß. The only problem with v1 of the patch is that it folds "f" to "ff" and "fl", so typing `C-s f' matches "ff", `C-s s' matches "st", etc. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: search-default-mode char-fold-to-regexp and Greek Extended block characters 2019-07-29 18:09 ` Juri Linkov @ 2019-07-30 8:09 ` Robert Pluim 2019-07-30 10:15 ` Eli Zaretskii 0 siblings, 1 reply; 32+ messages in thread From: Robert Pluim @ 2019-07-30 8:09 UTC (permalink / raw) To: Juri Linkov; +Cc: emacs-devel >>>>> On Mon, 29 Jul 2019 21:09:03 +0300, Juri Linkov <juri@linkov.net> said: >> I donʼt understand this comment. With v1 of the patch, ß is the only >> one that would need to be added by hand to char-fold--default-include Juri> You are right, v1 of the patch is needed to remove ι and υ Juri> from the default values, not ß. Juri> The only problem with v1 of the patch is that it folds "f" to "ff" Juri> and "fl", so typing `C-s f' matches "ff", `C-s s' matches "st", etc. So it does. More thought required. Is there a Unicode spec for this sort of stuff? Robert ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: search-default-mode char-fold-to-regexp and Greek Extended block characters 2019-07-30 8:09 ` Robert Pluim @ 2019-07-30 10:15 ` Eli Zaretskii 0 siblings, 0 replies; 32+ messages in thread From: Eli Zaretskii @ 2019-07-30 10:15 UTC (permalink / raw) To: emacs-devel, Robert Pluim, Juri Linkov On July 30, 2019 11:09:14 AM GMT+03:00, Robert Pluim <rpluim@gmail.com> wrote: > > Is there a Unicode spec for this > sort of stuff? The Unicode guidelines for this are in UTS#10 Unicode Collation Algorithm, and specifically Section 11 Searching and Matching there. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: search-default-mode char-fold-to-regexp and Greek Extended block characters 2019-07-24 23:12 ` Juri Linkov 2019-07-25 0:18 ` Basil L. Contovounesios @ 2019-07-25 2:36 ` Eli Zaretskii 2019-07-25 8:59 ` Robert Pluim 2019-07-25 8:46 ` Robert Pluim 2 siblings, 1 reply; 32+ messages in thread From: Eli Zaretskii @ 2019-07-25 2:36 UTC (permalink / raw) To: Juri Linkov; +Cc: emacs-devel > From: Juri Linkov <juri@linkov.net> > Date: Thu, 25 Jul 2019 02:12:01 +0300 > > This is an interesting case like (upcase "ß") => "SS" that required > adding (?ß "ss") to pass the tests. Isn't there now an upper-case eszet that eliminates that problem? ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: search-default-mode char-fold-to-regexp and Greek Extended block characters 2019-07-25 2:36 ` Eli Zaretskii @ 2019-07-25 8:59 ` Robert Pluim 2019-07-25 12:53 ` Eli Zaretskii 0 siblings, 1 reply; 32+ messages in thread From: Robert Pluim @ 2019-07-25 8:59 UTC (permalink / raw) To: Eli Zaretskii; +Cc: emacs-devel, Juri Linkov >>>>> On Thu, 25 Jul 2019 05:36:12 +0300, Eli Zaretskii <eliz@gnu.org> said: >> From: Juri Linkov <juri@linkov.net> >> Date: Thu, 25 Jul 2019 02:12:01 +0300 >> >> This is an interesting case like (upcase "ß") => "SS" that required >> adding (?ß "ss") to pass the tests. Eli> Isn't there now an upper-case eszet that eliminates that problem? There is, but the mapping is one-way. UnicodeData.txt has: 00DF;LATIN SMALL LETTER SHARP S;Ll;0;L;;;;;N;;;;; 1E9E;LATIN CAPITAL LETTER SHARP S;Lu;0;L;;;;;N;;;;00DF; and SpecialCasing.txt has 00DF; 00DF; 0053 0073; 0053 0053; # LATIN SMALL LETTER SHARP S which gives us (upcase "ß") => "SS" ; \udf => "SS" (downcase "ẞ") => "ß" ; \u1e9e => \udf Perhaps this is different in a later version of Unicode than what weʼre using. Robert ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: search-default-mode char-fold-to-regexp and Greek Extended block characters 2019-07-25 8:59 ` Robert Pluim @ 2019-07-25 12:53 ` Eli Zaretskii 0 siblings, 0 replies; 32+ messages in thread From: Eli Zaretskii @ 2019-07-25 12:53 UTC (permalink / raw) To: Robert Pluim; +Cc: emacs-devel > From: Robert Pluim <rpluim@gmail.com> > Cc: Juri Linkov <juri@linkov.net>, emacs-devel@gnu.org > Date: Thu, 25 Jul 2019 10:59:39 +0200 > > Eli> Isn't there now an upper-case eszet that eliminates that problem? > > There is, but the mapping is one-way. Right you are, thanks. > Perhaps this is different in a later version of Unicode than what > weʼre using. We are using the latest official version 12.0 of Unicode. And this data is unchanged in the current draft for Unicode 13.0, so I think we are good. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: search-default-mode char-fold-to-regexp and Greek Extended block characters 2019-07-24 23:12 ` Juri Linkov 2019-07-25 0:18 ` Basil L. Contovounesios 2019-07-25 2:36 ` Eli Zaretskii @ 2019-07-25 8:46 ` Robert Pluim 2019-07-25 18:46 ` Juri Linkov 2 siblings, 1 reply; 32+ messages in thread From: Robert Pluim @ 2019-07-25 8:46 UTC (permalink / raw) To: Juri Linkov; +Cc: emacs-devel >>>>> On Thu, 25 Jul 2019 02:12:01 +0300, Juri Linkov <juri@linkov.net> said: Juri> This is an interesting case like (upcase "ß") => "SS" that required Juri> adding (?ß "ss") to pass the tests. So I guess we need to add (?ι "ΐ") Juri> for the tests to pass: This is OK Juri> diff --git a/test/lisp/char-fold-tests.el b/test/lisp/char-fold-tests.el Juri> index e519435ef0..3819f3919d 100644 Juri> --- a/test/lisp/char-fold-tests.el Juri> +++ b/test/lisp/char-fold-tests.el Juri> @@ -166,6 +165,7 @@ char-fold--test-with-customization Juri> (let* ((char-fold-include Juri> '( Juri> (?ß "ss") ;; de Juri> + (?ι "ΐ") ;; el Juri> (?o "ø") ;; da no nb nn Juri> (?l "ł") ;; pl Juri> )) Juri> @@ -184,9 +184,7 @@ char-fold--test-with-customization Juri> '( Juri> ("e" "ℯ" "ḗ" "ë" "ë") Juri> ("е" "ё" "ё") Juri> - ("ι" "ί" "ί" Juri> - ;; FIXME: "ΐ" Juri> - ) Juri> + ("ι" "ί" "ί" "ΐ") Juri> ("ß" "ss") Juri> ("o" "ø") Juri> ("l" "ł") Juri> But this is only for char-fold--test-with-customization. OTOH, for Juri> char-fold--test-without-customization we need also to change the default Juri> value in char-fold.el like: Juri> diff --git a/lisp/char-fold.el b/lisp/char-fold.el Juri> index f379229e6c..c4add03bd9 100644 Juri> --- a/lisp/char-fold.el Juri> +++ b/lisp/char-fold.el Juri> @@ -27,7 +27,8 @@ Juri> (defconst char-fold--default-include Juri> '((?\" """ "“" "”" "”" "„" "⹂" "〞" "‟" "‟" "❞" "❝" "❠" "“" "„" "〝" "〟" "🙷" "🙶" "🙸" "«" "»") Juri> (?' "❟" "❛" "❜" "‘" "’" "‚" "‛" "‚" "" "❮" "❯" "‹" "›") Juri> - (?` "❛" "‘" "‛" "" "❮" "‹"))) Juri> + (?` "❛" "‘" "‛" "" "❮" "‹") Juri> + (?ι "ΐ"))) Juri> (defconst char-fold--default-exclude nil) Juri> (defconst char-fold--default-symmetric nil) Juri> (defconst char-fold--previous (list char-fold--default-include But this one I donʼt understand. Searching for iota (capital or small) in a buffer containing ΐ ΐ or Ϊ́ already works with char-fold-to-regexp, so why is this needed? Robert ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: search-default-mode char-fold-to-regexp and Greek Extended block characters 2019-07-25 8:46 ` Robert Pluim @ 2019-07-25 18:46 ` Juri Linkov 2019-07-26 6:04 ` Eli Zaretskii 0 siblings, 1 reply; 32+ messages in thread From: Juri Linkov @ 2019-07-25 18:46 UTC (permalink / raw) To: emacs-devel > But this one I donʼt understand. Searching for iota (capital or small) > in a buffer containing ΐ ΐ or Ϊ́ already works with > char-fold-to-regexp, so why is this needed? Searching for ι finds ΐ only when searching for a single letter ι because the search matches the first part of ΐ that contains the base character ι and ignores the remaining combining accents like ̈́ So for testing you need to search for longer strings, e.g. in a buffer with this text "ΐΐΪ́." try to search for "ιιι." It fails to find this text without adding (?ι "ΐ") to char-fold-include. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: search-default-mode char-fold-to-regexp and Greek Extended block characters 2019-07-25 18:46 ` Juri Linkov @ 2019-07-26 6:04 ` Eli Zaretskii 2019-07-26 18:40 ` Juri Linkov 0 siblings, 1 reply; 32+ messages in thread From: Eli Zaretskii @ 2019-07-26 6:04 UTC (permalink / raw) To: Juri Linkov; +Cc: emacs-devel > From: Juri Linkov <juri@linkov.net> > Date: Thu, 25 Jul 2019 21:46:20 +0300 > > > But this one I donʼt understand. Searching for iota (capital or small) > > in a buffer containing ΐ ΐ or Ϊ́ already works with > > char-fold-to-regexp, so why is this needed? > > Searching for ι finds ΐ only when searching for a single letter ι > because the search matches the first part of ΐ that contains the base > character ι and ignores the remaining combining accents like ̈́ > > So for testing you need to search for longer strings, e.g. > in a buffer with this text "ΐΐΪ́." try to search for "ιιι." > > It fails to find this text without adding (?ι "ΐ") > to char-fold-include. Maybe we should decide that this is a limitation of the current implementation, and instead work on a more correct implementation, which actually "folds" characters to their base variants as the search proceed. Let's not forget that the current implementation was known to be limited from the get-go, and we only accepted it because the "full" one was too complex and required non-trivial changes on the C level. So we shouldn't go too far into making the current implementation support everything that the full one will inherently support. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: search-default-mode char-fold-to-regexp and Greek Extended block characters 2019-07-26 6:04 ` Eli Zaretskii @ 2019-07-26 18:40 ` Juri Linkov 2019-07-26 19:13 ` Eli Zaretskii 0 siblings, 1 reply; 32+ messages in thread From: Juri Linkov @ 2019-07-26 18:40 UTC (permalink / raw) To: Eli Zaretskii; +Cc: emacs-devel >> > But this one I donʼt understand. Searching for iota (capital or small) >> > in a buffer containing ΐ ΐ or Ϊ́ already works with >> > char-fold-to-regexp, so why is this needed? >> >> Searching for ι finds ΐ only when searching for a single letter ι >> because the search matches the first part of ΐ that contains the base >> character ι and ignores the remaining combining accents like ̈́ >> >> So for testing you need to search for longer strings, e.g. >> in a buffer with this text "ΐΐΪ́." try to search for "ιιι." >> >> It fails to find this text without adding (?ι "ΐ") >> to char-fold-include. > > Maybe we should decide that this is a limitation of the current > implementation, and instead work on a more correct implementation, > which actually "folds" characters to their base variants as the search > proceed. > > Let's not forget that the current implementation was known to be > limited from the get-go, and we only accepted it because the "full" > one was too complex and required non-trivial changes on the C level. > So we shouldn't go too far into making the current implementation > support everything that the full one will inherently support. I consider the current regexp-based implementation as a fully functional prototype with complete test coverage, so after switching the implementation later to C, the same tests should still pass. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: search-default-mode char-fold-to-regexp and Greek Extended block characters 2019-07-26 18:40 ` Juri Linkov @ 2019-07-26 19:13 ` Eli Zaretskii 0 siblings, 0 replies; 32+ messages in thread From: Eli Zaretskii @ 2019-07-26 19:13 UTC (permalink / raw) To: Juri Linkov; +Cc: emacs-devel > From: Juri Linkov <juri@linkov.net> > Cc: emacs-devel@gnu.org > Date: Fri, 26 Jul 2019 21:40:58 +0300 > > > Let's not forget that the current implementation was known to be > > limited from the get-go, and we only accepted it because the "full" > > one was too complex and required non-trivial changes on the C level. > > So we shouldn't go too far into making the current implementation > > support everything that the full one will inherently support. > > I consider the current regexp-based implementation as a fully functional > prototype with complete test coverage, so after switching the > implementation later to C, the same tests should still pass. That's missing the point I ws trying to make. My point is that there might be some corner use cases which we might decide not to support with the current implementation, and encourage people who'd like to have a better support to work on the full implementation as described by Unicode. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: search-default-mode char-fold-to-regexp and Greek Extended block characters 2019-07-19 14:18 search-default-mode char-fold-to-regexp and Greek Extended block characters Robert Pluim 2019-07-19 14:37 ` Eli Zaretskii @ 2019-07-19 18:53 ` Juri Linkov 1 sibling, 0 replies; 32+ messages in thread From: Juri Linkov @ 2019-07-19 18:53 UTC (permalink / raw) To: emacs-devel > This is an offshoot of the discussion in Bug#36717. > > I have > > (setq search-default-mode 'char-fold-to-regexp) > > In a buffer containing > > 1. ί (\u03af GREEK SMALL LETTER IOTA WITH TONOS) > 2. ί (\u1f77 GREEK SMALL LETTER IOTA WITH OXIA) > > I do C-s C-x 8 RET 03b9 > > isearch will find the iota with tonos, but not the iota with oxia, > even though the decomposition of the decomposition of the latter > contains iota. Is that expected? Decomposition of GREEK SMALL LETTER IOTA WITH TONOS is GREEK SMALL LETTER IOTA + COMBINING ACUTE ACCENT, whereas decomposition of GREEK SMALL LETTER IOTA WITH OXIA is GREEK SMALL LETTER IOTA WITH TONOS, i.e. it's just an alias, but so far several levels of indirection (from GREEK SMALL LETTER IOTA to GREEK SMALL LETTER IOTA WITH OXIA via GREEK SMALL LETTER IOTA WITH TONOS) was unsupported in char-fold.el. This should be fixed in bug#35689. ^ permalink raw reply [flat|nested] 32+ messages in thread
end of thread, other threads:[~2019-07-30 10:15 UTC | newest] Thread overview: 32+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2019-07-19 14:18 search-default-mode char-fold-to-regexp and Greek Extended block characters Robert Pluim 2019-07-19 14:37 ` Eli Zaretskii 2019-07-19 16:03 ` Robert Pluim 2019-07-19 18:13 ` Eli Zaretskii 2019-07-21 11:03 ` Robert Pluim 2019-07-22 18:39 ` Robert Pluim 2019-07-23 14:57 ` Eli Zaretskii 2019-07-23 17:43 ` Robert Pluim 2019-07-23 20:29 ` Juri Linkov 2019-07-24 7:56 ` Robert Pluim 2019-07-24 7:59 ` Robert Pluim 2019-07-24 9:04 ` Robert Pluim 2019-07-24 23:12 ` Juri Linkov 2019-07-25 0:18 ` Basil L. Contovounesios 2019-07-25 18:40 ` Juri Linkov 2019-07-25 20:44 ` search-default-mode char-fold-to-regexp and Greek Extended block characters, " Robert Pluim 2019-07-25 21:35 ` Juri Linkov 2019-07-26 11:09 ` Robert Pluim 2019-07-26 18:38 ` Juri Linkov 2019-07-29 8:32 ` Robert Pluim 2019-07-29 18:09 ` Juri Linkov 2019-07-30 8:09 ` Robert Pluim 2019-07-30 10:15 ` Eli Zaretskii 2019-07-25 2:36 ` Eli Zaretskii 2019-07-25 8:59 ` Robert Pluim 2019-07-25 12:53 ` Eli Zaretskii 2019-07-25 8:46 ` Robert Pluim 2019-07-25 18:46 ` Juri Linkov 2019-07-26 6:04 ` Eli Zaretskii 2019-07-26 18:40 ` Juri Linkov 2019-07-26 19:13 ` Eli Zaretskii 2019-07-19 18:53 ` Juri Linkov
Code repositories for project(s) associated with this external index https://git.savannah.gnu.org/cgit/emacs.git https://git.savannah.gnu.org/cgit/emacs/org-mode.git This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.