* bug#13041: 24.2; diacritic-fold-search
@ 2012-11-30 18:22 Lewis Perin
2012-11-30 18:51 ` Juri Linkov
` (2 more replies)
0 siblings, 3 replies; 83+ messages in thread
From: Lewis Perin @ 2012-11-30 18:22 UTC (permalink / raw)
To: 13041
This is not a bug report but a feature request, so I am omitting
diagnostic information.
Emacs search has long been able to toggle between (a) ignoring the
distinction between upper- and lower-case characters
(case-fold-search) and (b) searching for only one of the pair. One
could say Climacs offers the choice between (a) searching for all
members of a (2-member) equivalence class and (b) searching for only
one member.
There are larger equivalence classes of characters with practical use
which Climacs is currently unaware of: the groups of characters
consisting of an unadorned (ASCII) character plus all its
diacritic-adorned versions. Currently, if I want to search for both
“apres” and “après”, I need an additive regular expression. I would
like to do this as easily as I can search for “apres” and “Apres”. I
would be delighted if Emacs implemented the equivalence classes
spelled out here:
http://hex-machina.com/scripts/yui/3.3.0pr1/api/unicode-data-accentfold.js.html
I might add that diacritics folding is the default in web search
engines. It is also a feature of at least one Web browser in
searching the text of a displayed page (Chrome.)
I’m sure that maintaining the core of Emacs is a big job, and I’m
grateful for the skill and effort that go into that task, including
your consideration of this request!
/Lew
---
Lew Perin | perin@acm.org | http://babelcarp.org
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-11-30 18:22 bug#13041: 24.2; diacritic-fold-search Lewis Perin
@ 2012-11-30 18:51 ` Juri Linkov
2012-11-30 21:07 ` Lewis Perin
2012-11-30 19:31 ` Stefan Monnier
2016-08-31 14:45 ` Michael Albinus
2 siblings, 1 reply; 83+ messages in thread
From: Juri Linkov @ 2012-11-30 18:51 UTC (permalink / raw)
To: Lewis Perin; +Cc: 13041, perin
> Currently, if I want to search for both “apres” and “après”,
> I need an additive regular expression. I would like to do this as
> easily as I can search for “apres” and “Apres”. I would be delighted
> if Emacs implemented the equivalence classes spelled out here:
>
> http://hex-machina.com/scripts/yui/3.3.0pr1/api/unicode-data-accentfold.js.html
This could be implemented in isearch using a recipe from
http://thread.gmane.org/gmane.emacs.devel/117003/focus=117959
Instead of hard-coding a list of equivalent characters
I guess it should be possible to do this automatically
using Unicode information about characters.
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-11-30 18:22 bug#13041: 24.2; diacritic-fold-search Lewis Perin
2012-11-30 18:51 ` Juri Linkov
@ 2012-11-30 19:31 ` Stefan Monnier
2016-08-31 14:45 ` Michael Albinus
2 siblings, 0 replies; 83+ messages in thread
From: Stefan Monnier @ 2012-11-30 19:31 UTC (permalink / raw)
To: Lewis Perin; +Cc: 13041, perin
severity 13041 wishlist
thanks
> diacritic-adorned versions. Currently, if I want to search for both
> “apres” and “après”, I need an additive regular expression. I would
> like to do this as easily as I can search for “apres” and “Apres”.
That would be a very welcome feature, indeed.
Stefan
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-11-30 18:51 ` Juri Linkov
@ 2012-11-30 21:07 ` Lewis Perin
2012-12-01 0:27 ` Juri Linkov
0 siblings, 1 reply; 83+ messages in thread
From: Lewis Perin @ 2012-11-30 21:07 UTC (permalink / raw)
To: Juri Linkov; +Cc: 13041
Juri Linkov writes:
> > Currently, if I want to search for both “apres” and “après”,
> > I need an additive regular expression. I would like to do this as
> > easily as I can search for “apres” and “Apres”. I would be delighted
> > if Emacs implemented the equivalence classes spelled out here:
> >
> > http://hex-machina.com/scripts/yui/3.3.0pr1/api/unicode-data-accentfold.js.html
>
> This could be implemented in isearch using a recipe from
>
> http://thread.gmane.org/gmane.emacs.devel/117003/focus=117959
>
> Instead of hard-coding a list of equivalent characters
> I guess it should be possible to do this automatically
> using Unicode information about characters.
I never thought I was the first to wonder about this!
In the last message of that thread, you say “Provided it doesn’t make
the search slow, it would be nice to add it to Emacs activating on
some user settings.” Do you remember if that technique turned out to
be tolerably speedy?
/Lew
---
Lew Perin | perin@acm.org | http://babelcarp.org
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-11-30 21:07 ` Lewis Perin
@ 2012-12-01 0:27 ` Juri Linkov
2012-12-01 0:47 ` Drew Adams
2012-12-01 8:32 ` Eli Zaretskii
0 siblings, 2 replies; 83+ messages in thread
From: Juri Linkov @ 2012-12-01 0:27 UTC (permalink / raw)
To: Lewis Perin; +Cc: 13041, perin
> In the last message of that thread, you say “Provided it doesn’t make
> the search slow, it would be nice to add it to Emacs activating on
> some user settings.” Do you remember if that technique turned out to
> be tolerably speedy?
Yes, I have no problems with the speed. The problem is how to
disable this feature when it is active. We need a special key
to toggle it in Isearch. One variant is M-s ~ where the easy-to-type
TILDE character represents diacritics. Also it's unclear whether the
Isearch prompt should indicate its active state as e.g.
Diacritic I-search:
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-01 0:27 ` Juri Linkov
@ 2012-12-01 0:47 ` Drew Adams
2012-12-01 0:49 ` Drew Adams
2012-12-01 8:32 ` Eli Zaretskii
1 sibling, 1 reply; 83+ messages in thread
From: Drew Adams @ 2012-12-01 0:47 UTC (permalink / raw)
To: 'Juri Linkov', 'Lewis Perin'; +Cc: 13041, perin
> it's unclear whether the Isearch prompt should indicate
> its active state
Ǐsearch
(But perhaps that suggests recognizing, rather than ignoring, diacritics.)
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-01 0:47 ` Drew Adams
@ 2012-12-01 0:49 ` Drew Adams
2012-12-01 1:20 ` Lew Perin
0 siblings, 1 reply; 83+ messages in thread
From: Drew Adams @ 2012-12-01 0:49 UTC (permalink / raw)
To: 'Juri Linkov', 'Lewis Perin'; +Cc: 13041, perin
> > it's unclear whether the Isearch prompt should indicate
> > its active state
>
> Isearch
>
> (But perhaps that suggests recognizing, rather than ignoring,
> diacritics.)
Hm. That was a capital I with caron when I sent it...
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-01 0:49 ` Drew Adams
@ 2012-12-01 1:20 ` Lew Perin
2012-12-01 6:50 ` Drew Adams
0 siblings, 1 reply; 83+ messages in thread
From: Lew Perin @ 2012-12-01 1:20 UTC (permalink / raw)
To: Drew Adams; +Cc: <13041@debbugs.gnu.org>, <perin@acm.org>
On Nov 30, 2012, at 7:49 PM, "Drew Adams" <drew.adams@oracle.com> wrote:
>>> it's unclear whether the Isearch prompt should indicate
>>> its active state
>>
>> Isearch
>>
>> (But perhaps that suggests recognizing, rather than ignoring,
>> diacritics.)
>
> Hm. That was a capital I with caron when I sent it...
A caron-topped capital I is exactly what I got (on my iPhone.)
/Lew
---
Lew Perin | perin@acm.org | http://babelcarp.org
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-01 1:20 ` Lew Perin
@ 2012-12-01 6:50 ` Drew Adams
0 siblings, 0 replies; 83+ messages in thread
From: Drew Adams @ 2012-12-01 6:50 UTC (permalink / raw)
To: 'Lew Perin'; +Cc: 13041, perin
> >>> it's unclear whether the Isearch prompt should indicate
> >>> its active state
> >>
> >> Isearch
> >>
> >> (But perhaps that suggests recognizing, rather than ignoring,
> >> diacritics.)
> >
> > Hm. That was a capital I with caron when I sent it...
>
> A caron-topped capital I is exactly what I got (on my iPhone.)
Great. I guess it's the encoding used in my mail client that's showing it with
no marks.
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-01 0:27 ` Juri Linkov
2012-12-01 0:47 ` Drew Adams
@ 2012-12-01 8:32 ` Eli Zaretskii
2012-12-01 9:09 ` Eli Zaretskii
` (2 more replies)
1 sibling, 3 replies; 83+ messages in thread
From: Eli Zaretskii @ 2012-12-01 8:32 UTC (permalink / raw)
To: Juri Linkov; +Cc: perin, 13041, perin
> From: Juri Linkov <juri@jurta.org>
> Date: Sat, 01 Dec 2012 02:27:40 +0200
> Cc: 13041@debbugs.gnu.org, perin@acm.org
>
> > In the last message of that thread, you say “Provided it doesn’t make
> > the search slow, it would be nice to add it to Emacs activating on
> > some user settings.” Do you remember if that technique turned out to
> > be tolerably speedy?
>
> Yes, I have no problems with the speed. The problem is how to
> disable this feature when it is active. We need a special key
> to toggle it in Isearch. One variant is M-s ~ where the easy-to-type
> TILDE character represents diacritics. Also it's unclear whether the
> Isearch prompt should indicate its active state as e.g.
I don't understand why this thread is talking only about Latin
characters with diacritics. That is a special case of what Unicode
calls "compatibility equivalence" (q.e.). For example, even in the
Latin environments, don't you want to find "sniff" when searching for
"sniff", and vice versa? And there are similar issues in many
non-Latin scripts.
The decomposition of a character such as 'ff' is given by the Unicode
database, for example:
FB00;LATIN SMALL LIGATURE FF;Ll;0;L;<compat> 0066 0066;;;;N;;;;;
^^^^^^^^^^^^^^^^^^
(66 hex, or 102 decimal, is the codepoint of 'f').
Emacs already supports these decomposition properties. E.g.:
(get-char-code-property ?ff 'decomposition) => (compat 102 102)
Another example, closer to the issue that triggered this thread:
(get-char-code-property ?è 'decomposition) => (101 768)
(If you want to understand why the previous example included "compat"
in the result, while this one doesn't, read more about Unicode
normalization forms. The distinction is irrelevant for the current
discussion.)
Using these properties, every search string can be converted to a
sequence of non-decomposable characters (this process is recursive,
because the 'decomposition' property can use characters that
themselves are decomposable). If the user wants to ignore diacritics,
then the diacritics should be dropped from the decomposition sequence
before starting the search. E.g., for the decomposition of è above,
we will drop the 768 and will be left with 101, which is 'e'. Then
searching for that string should apply the same decomposition
transformation to the text being searched, when comparing them.
This would be the most general way of solving this issue, a way that
is not limited to diacritics nor to Latin scripts. And doing that
will move Emacs closer to the goal of being Unicode compatible, since
support for this is required by the Unicode Standard.
By contrast, building and using custom data bases of equivalences that
are limited to diacritics in Latin scripts is not moving Emacs towards
that goal. It's just a hack, IMO.
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-01 8:32 ` Eli Zaretskii
@ 2012-12-01 9:09 ` Eli Zaretskii
2012-12-01 16:38 ` Drew Adams
2012-12-02 0:27 ` Juri Linkov
2 siblings, 0 replies; 83+ messages in thread
From: Eli Zaretskii @ 2012-12-01 9:09 UTC (permalink / raw)
To: juri, perin; +Cc: 13041, perin
> Date: Sat, 01 Dec 2012 10:32:35 +0200
> From: Eli Zaretskii <eliz@gnu.org>
> Cc: perin@panix.com, 13041@debbugs.gnu.org, perin@acm.org
>
> I don't understand why this thread is talking only about Latin
> characters with diacritics. That is a special case of what Unicode
> calls "compatibility equivalence" (q.e.).
^^^^
I meant "q.v.", of course. Sorry.
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-01 8:32 ` Eli Zaretskii
2012-12-01 9:09 ` Eli Zaretskii
@ 2012-12-01 16:38 ` Drew Adams
2012-12-02 0:27 ` Juri Linkov
2 siblings, 0 replies; 83+ messages in thread
From: Drew Adams @ 2012-12-01 16:38 UTC (permalink / raw)
To: 'Eli Zaretskii', 'Juri Linkov'; +Cc: perin, 13041, perin
> I don't understand why this thread is talking only about Latin
> characters with diacritics. That is a special case of what Unicode
> calls "compatibility equivalence" (q.e.). For example, even in the
> Latin environments, don't you want to find "sni?" when searching for
> "sniff", and vice versa? And there are similar issues in many
> non-Latin scripts.
Actually, in the original thread I made the same point.
Please see that discussion for this and other points.
http://lists.gnu.org/archive/html/help-gnu-emacs/2012-11/msg00429.html
> The decomposition of a character such as '?' is given by
> the Unicode database... Emacs already supports these
> decomposition properties.
That's good news (new to me). So it sounds like even the most hopeful
wanna-haves of the discussion could perhaps be realized without too much
trouble.
> Using these properties, every search string can be converted to a
> sequence of non-decomposable characters (this process is recursive,
> because the 'decomposition' property can use characters that
> themselves are decomposable). If the user wants to ignore diacritics,
> then the diacritics should be dropped from the decomposition sequence
> before starting the search. E.g., for the decomposition of è above,
> we will drop the 768 and will be left with 101, which is 'e'. Then
> searching for that string should apply the same decomposition
> transformation to the text being searched, when comparing them.
>
> This would be the most general way of solving this issue, a way that
> is not limited to diacritics nor to Latin scripts. And doing that
> will move Emacs closer to the goal of being Unicode compatible, since
> support for this is required by the Unicode Standard.
This sounds great. I really hope someone with the time and knowledge adds such
a feature soon (even though, to be clear, I personally do not have much need for
it). I think it would be very handy for many users - most welcome.
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-01 8:32 ` Eli Zaretskii
2012-12-01 9:09 ` Eli Zaretskii
2012-12-01 16:38 ` Drew Adams
@ 2012-12-02 0:27 ` Juri Linkov
2012-12-02 17:45 ` martin rudalics
2012-12-02 18:16 ` Eli Zaretskii
2 siblings, 2 replies; 83+ messages in thread
From: Juri Linkov @ 2012-12-02 0:27 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: perin, 13041, perin
> Using these properties, every search string can be converted to a
> sequence of non-decomposable characters (this process is recursive,
> because the 'decomposition' property can use characters that
> themselves are decomposable). If the user wants to ignore diacritics,
> then the diacritics should be dropped from the decomposition sequence
> before starting the search. E.g., for the decomposition of è above,
> we will drop the 768 and will be left with 101, which is 'e'. Then
> searching for that string should apply the same decomposition
> transformation to the text being searched, when comparing them.
Yes, using the `decomposition' property would be better than hard-coding
these decomposition mappings. Though I'm surprised to see case mappings
hard-coded in lisp/international/characters.el instead of using the
properties `uppercase' and `lowercase' during creation of case tables.
But nevertheless the `decomposition' property should be used to find
all decomposable characters. The question is how to use them in the search.
One solution is to use the case tables. I tried to build the case table
with the decomposed characters retrieved using the `decomposition' property
recursively:
(defvar decomposition-table nil)
(defun make-decomposition-table ()
(let ((table (standard-case-table))
canon)
(setq canon (copy-sequence table))
(let ((c #x0000) d)
(while (<= c #xFFFD)
(make-decomposition-table-1 canon c c)
(setq c (1+ c))))
(set-char-table-extra-slot table 1 canon)
(set-char-table-extra-slot table 2 nil)
(setq decomposition-table table)))
(defun make-decomposition-table-1 (canon c0 c1)
(let ((d (get-char-code-property c1 'decomposition)))
(when d
(unless (characterp (car d)) (pop d))
(if (eq c1 (car d))
(aset canon c0 (car d))
(make-decomposition-table-1 canon c0 (car d))))))
(make-decomposition-table)
Then a new Isearch command (the existing `isearch-toggle-case-fold'
can't be used because it enables/disables the standard case table)
could toggle between the current case table and the decomposition
case table using
(set-case-table decomposition-table)
After evaluating this, Isearch correctly finds all related characters
in every row of this example:
http://hex-machina.com/scripts/yui/3.3.0pr1/api/unicode-data-accentfold.js.html
But it seems using the case table for decomposition has one limitation.
I see no way to ignore combining accent characters in the case table,
i.e. to map combining accent characters to nothing. These characters
have the general-category "Mn (Mark, Nonspacing)", so they should be ignored
in the search.
An alternative would be to build a regexp from the search string
like building a regexp for word-search:
(define-key isearch-mode-map "\M-sd" 'isearch-toggle-decomposition)
(defun isearch-toggle-decomposition ()
"Toggle Unicode decomposition searching on or off."
(interactive)
(setq isearch-word (unless (eq isearch-word 'isearch-decomposition-regexp)
'isearch-decomposition-regexp))
(if isearch-word (setq isearch-regexp nil))
(setq isearch-success t isearch-adjusted t)
(isearch-update))
(defun isearch-decomposition-regexp (string &optional _lax)
"Return a regexp that matches decomposed Unicode characters in STRING."
(mapconcat
(lambda (c0)
(if (eq (get-char-code-property c0 'general-category) 'Mn)
;; Mark-Nonspacing chars like COMBINING ACUTE ACCENT are optional.
(concat (string c0) "?")
(let ((c1 c0) c2 chars)
(while (and (setq c2 (aref (char-table-extra-slot
decomposition-table 2) c1))
(not (eq c2 c0)))
(push c2 chars)
(setq c1 c2))
(if chars
;; Character alternatives from the case equivalences table.
(concat "[" (string c0) chars "]")
(string c0)))))
string ""))
(put 'isearch-decomposition-regexp 'isearch-message-prefix "deco ")
This uses the decomposition table created above but instead of activating it,
it's necessary to "shuffle" the equivalences table with the following code
that prepares the table but doesn't enable it in the current buffer:
(with-temp-buffer (set-case-table decomposition-table))
The advantage of the regexp-based approach is making combining accents
optional in the search string. But there is another problem: how to ignore
combining accents in the buffer when the search string doesn't contain them.
With regexps this means adding a group of all possible combining accents
after every character in the search string like turning a search string
like "abc" into "a[́̂̃̄̆]?b[́̂̃̄̆]?c[́̂̃̄̆]?".
This would make the search slow, and I have no better idea.
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-02 0:27 ` Juri Linkov
@ 2012-12-02 17:45 ` martin rudalics
2012-12-02 18:02 ` Eli Zaretskii
2012-12-02 21:39 ` Juri Linkov
2012-12-02 18:16 ` Eli Zaretskii
1 sibling, 2 replies; 83+ messages in thread
From: martin rudalics @ 2012-12-02 17:45 UTC (permalink / raw)
To: Juri Linkov; +Cc: perin, perin, 13041
> But nevertheless the `decomposition' property should be used to find
> all decomposable characters. The question is how to use them in the search.
Whatever solution you find most suitable here, it would be nice to come
up with a similar solution for sorting. I've been playing around with a
function like
(defun decomposed-string-lessp (string1 string2)
"Return t if STRING1 is decomposition-less than STRING2."
(let* ((length1 (length string1))
(length2 (length string2))
(min-length (min length1 length2))
(index 0)
type1 type2)
(catch 'found
(while (< index min-length)
(setq type1 (car (get-char-code-property
(elt string1 index) 'decomposition)))
(setq type2 (car (get-char-code-property
(elt string2 index) 'decomposition)))
(cond
((< type1 type2)
(throw 'found t))
((> type1 type2)
(throw 'found nil)))
;; Continue.
(setq index (1+ index)))
;; Shorter is less.
(< length1 length2))))
but am not sure whether I'm missing something wrt the return value of
`get-char-code-property'.
martin
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-02 17:45 ` martin rudalics
@ 2012-12-02 18:02 ` Eli Zaretskii
2012-12-03 10:16 ` martin rudalics
2012-12-02 21:39 ` Juri Linkov
1 sibling, 1 reply; 83+ messages in thread
From: Eli Zaretskii @ 2012-12-02 18:02 UTC (permalink / raw)
To: martin rudalics; +Cc: perin, 13041, perin
> Date: Sun, 02 Dec 2012 18:45:38 +0100
> From: martin rudalics <rudalics@gmx.at>
> CC: Eli Zaretskii <eliz@gnu.org>, perin@panix.com, 13041@debbugs.gnu.org,
> perin@acm.org
>
> (setq type1 (car (get-char-code-property
> (elt string1 index) 'decomposition)))
> (setq type2 (car (get-char-code-property
> (elt string2 index) 'decomposition)))
> (cond
> ((< type1 type2)
> (throw 'found t))
> ((> type1 type2)
> (throw 'found nil)))
> ;; Continue.
> (setq index (1+ index)))
> ;; Shorter is less.
> (< length1 length2))))
>
> but am not sure whether I'm missing something wrt the return value of
> `get-char-code-property'.
Maybe only the fact that it can return a list whose car is 'compat',
see the examples I posted.
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-02 0:27 ` Juri Linkov
2012-12-02 17:45 ` martin rudalics
@ 2012-12-02 18:16 ` Eli Zaretskii
2012-12-02 21:31 ` Juri Linkov
2012-12-05 19:17 ` Drew Adams
1 sibling, 2 replies; 83+ messages in thread
From: Eli Zaretskii @ 2012-12-02 18:16 UTC (permalink / raw)
To: Juri Linkov; +Cc: perin, 13041, perin
> From: Juri Linkov <juri@jurta.org>
> Cc: perin@panix.com, 13041@debbugs.gnu.org, perin@acm.org
> Date: Sun, 02 Dec 2012 02:27:32 +0200
>
> I'm surprised to see case mappings hard-coded in
> lisp/international/characters.el instead of using the properties
> `uppercase' and `lowercase' during creation of case tables.
My guess is that this is because the code in characters.el was written
long before we had access to Unicode character properties in Emacs,
and in fact before Emacs was switched to character representation
based on Unicode codepoints. And no one bothered to rewrite that code
since then; volunteers are welcome.
> (defvar decomposition-table nil)
>
> (defun make-decomposition-table ()
> (let ((table (standard-case-table))
> canon)
> (setq canon (copy-sequence table))
> (let ((c #x0000) d)
> (while (<= c #xFFFD)
> (make-decomposition-table-1 canon c c)
> (setq c (1+ c))))
> (set-char-table-extra-slot table 1 canon)
> (set-char-table-extra-slot table 2 nil)
> (setq decomposition-table table)))
>
> (defun make-decomposition-table-1 (canon c0 c1)
> (let ((d (get-char-code-property c1 'decomposition)))
> (when d
> (unless (characterp (car d)) (pop d))
> (if (eq c1 (car d))
> (aset canon c0 (car d))
> (make-decomposition-table-1 canon c0 (car d))))))
>
> (make-decomposition-table)
>
> Then a new Isearch command (the existing `isearch-toggle-case-fold'
> can't be used because it enables/disables the standard case table)
> could toggle between the current case table and the decomposition
> case table using
>
> (set-case-table decomposition-table)
>
> After evaluating this, Isearch correctly finds all related characters
> in every row of this example:
>
> http://hex-machina.com/scripts/yui/3.3.0pr1/api/unicode-data-accentfold.js.html
>
> But it seems using the case table for decomposition has one limitation.
> I see no way to ignore combining accent characters in the case table,
> i.e. to map combining accent characters to nothing. These characters
> have the general-category "Mn (Mark, Nonspacing)", so they should be ignored
> in the search.
IMO, using case tables for this is evil. If I want to "fold"
diacritics in search, that doesn't necessarily mean I want to fold the
letter-case as well. I might want doing that, or I might not; these
are two orthogonal features.
So we need a separate kind of char-table, one that could be installed
in addition to the case table, and one that will interpret nil as
an indication to ignore the character during search. Then we will be
able to ignore combining accents, as we indeed should. We also need
to modify the searching primitives to consult this new table, in
addition to case table.
IOW, I don't think we can implement this feature entirely in Lisp.
Some changes are needed on the C level as well.
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-02 18:16 ` Eli Zaretskii
@ 2012-12-02 21:31 ` Juri Linkov
2012-12-05 19:17 ` Drew Adams
1 sibling, 0 replies; 83+ messages in thread
From: Juri Linkov @ 2012-12-02 21:31 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: perin, 13041, perin
> IMO, using case tables for this is evil. If I want to "fold"
> diacritics in search, that doesn't necessarily mean I want to fold the
> letter-case as well. I might want doing that, or I might not; these
> are two orthogonal features.
`decomposition-table' is a separate char-table that has the
subtype `case-table'. It should not conflict with the standard
case table, so using `isearch-toggle-case-fold' should still
toggle the usage of the standard case table.
To toggle folding in the diacritics search perhaps requires
having two decomposition tables: one where upper and lower case
letters belong to one equivalence set, and another where
they are in different sets, so `isearch-toggle-decomposition'
could toggle between them.
Or should the standard case table and the decomposition table
be combined some other way? Maybe like the existing variable
`case-fold-search' to add a new variable `decomposition-search'
to enable/disable diacritics in search.
> So we need a separate kind of char-table, one that could be installed
> in addition to the case table, and one that will interpret nil as
> an indication to ignore the character during search.
I believe this kind of char-table should be based on the existing
subtype `case-table' because it provides the features necessary for
decomposition search such as extra table EQUIVALENCES (that permutes
each equivalence class) and the extra table CANONICALIZE (where
the canonical character is the final character in the recursion
that traverses the `decomposition' property).
> Then we will be able to ignore combining accents, as we indeed should.
> We also need to modify the searching primitives to consult this new
> table, in addition to case table.
Yes, it seems the feature of ignoring combining accents (i.e. mapping
some characters to nil) can't be added to existing case tables
because for the case table this would mean that converting a string
to upper case might delete some characters (like combining accents)
and converting a string to lower case might add combining accents
to the string that of course makes no sense.
> IOW, I don't think we can implement this feature entirely in Lisp.
> Some changes are needed on the C level as well.
A hack that abuses the standard case table is already possible
in Lisp. A complete implementation requires changes on the C level.
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-02 17:45 ` martin rudalics
2012-12-02 18:02 ` Eli Zaretskii
@ 2012-12-02 21:39 ` Juri Linkov
2012-12-03 10:16 ` martin rudalics
1 sibling, 1 reply; 83+ messages in thread
From: Juri Linkov @ 2012-12-02 21:39 UTC (permalink / raw)
To: martin rudalics; +Cc: perin, perin, 13041
> Whatever solution you find most suitable here, it would be nice to come
> up with a similar solution for sorting. I've been playing around with a
> function like
Did you try to build the case table with the diacritics mappings? It should
affect the sorting as well without requiring any changes in sorting functions.
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-02 18:02 ` Eli Zaretskii
@ 2012-12-03 10:16 ` martin rudalics
2012-12-03 16:47 ` Eli Zaretskii
0 siblings, 1 reply; 83+ messages in thread
From: martin rudalics @ 2012-12-03 10:16 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: perin, 13041, perin
> Maybe only the fact that it can return a list whose car is 'compat',
> see the examples I posted.
So I need two indices for looping. But what are the guidelines to
interpet `compat'? Does every list starting with a `compat' mean that
the remaining entries of that list represent the constituents of that
composite?
And how do I now call `put-char-code-property' to make the German sharp
"s" ("ß") equivalent to "ss"? Or am I not supposed to do such a thing?
martin
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-02 21:39 ` Juri Linkov
@ 2012-12-03 10:16 ` martin rudalics
2012-12-04 0:17 ` Juri Linkov
0 siblings, 1 reply; 83+ messages in thread
From: martin rudalics @ 2012-12-03 10:16 UTC (permalink / raw)
To: Juri Linkov; +Cc: perin, perin, 13041
> Did you try to build the case table with the diacritics mappings? It should
> affect the sorting as well without requiring any changes in sorting functions.
I tried but it didn't work out. I have to understand your code first
before I can tell what happens. In any case, doing your
(set-case-table decomposition-table)
permanently for a buffer crashed Emacs here.
martin
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-03 10:16 ` martin rudalics
@ 2012-12-03 16:47 ` Eli Zaretskii
2012-12-03 17:42 ` martin rudalics
0 siblings, 1 reply; 83+ messages in thread
From: Eli Zaretskii @ 2012-12-03 16:47 UTC (permalink / raw)
To: martin rudalics; +Cc: perin, 13041, perin
> Date: Mon, 03 Dec 2012 11:16:21 +0100
> From: martin rudalics <rudalics@gmx.at>
> CC: juri@jurta.org, perin@panix.com, 13041@debbugs.gnu.org,
> perin@acm.org
>
> But what are the guidelines to interpet `compat'?
For the purposes of comparing strings, both 'compatibility' and
'canonical' decompositions should be treated the same, AFAIU. You can
find the details here:
http://unicode.org/reports/tr15/
> Does every list starting with a `compat' mean that the remaining
> entries of that list represent the constituents of that composite?
Yes. This comes directly from UnicdeData.txt, e.g.:
0132;LATIN CAPITAL LIGATURE IJ;Lu;0;L;<compat> 0049 004A;;;;N;LATIN CAPITAL LETTER I J;;;0133;
^^^^^^^^^^^^^^^^^^
> And how do I now call `put-char-code-property' to make the German sharp
> "s" ("ß") equivalent to "ss"? Or am I not supposed to do such a thing?
That's already set up in the appropriate case table, I think. But it
is not a compatibility decomposition, AFAIK.
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-03 16:47 ` Eli Zaretskii
@ 2012-12-03 17:42 ` martin rudalics
2012-12-03 17:59 ` Eli Zaretskii
0 siblings, 1 reply; 83+ messages in thread
From: martin rudalics @ 2012-12-03 17:42 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: perin, 13041, perin
>> And how do I now call `put-char-code-property' to make the German sharp
>> "s" ("ß") equivalent to "ss"? Or am I not supposed to do such a thing?
>
> That's already set up in the appropriate case table, I think.
Why in a case table? Both "ß" and "ss" are lower case.
> But it
> is not a compatibility decomposition, AFAIK.
But I can make it one?
martin
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-03 17:42 ` martin rudalics
@ 2012-12-03 17:59 ` Eli Zaretskii
2012-12-04 17:54 ` martin rudalics
0 siblings, 1 reply; 83+ messages in thread
From: Eli Zaretskii @ 2012-12-03 17:59 UTC (permalink / raw)
To: martin rudalics; +Cc: perin, 13041, perin
> Date: Mon, 03 Dec 2012 18:42:53 +0100
> From: martin rudalics <rudalics@gmx.at>
> CC: juri@jurta.org, perin@panix.com, 13041@debbugs.gnu.org,
> perin@acm.org
>
> >> And how do I now call `put-char-code-property' to make the German sharp
> >> "s" ("ß") equivalent to "ss"? Or am I not supposed to do such a thing?
> >
> > That's already set up in the appropriate case table, I think.
>
> Why in a case table? Both "ß" and "ss" are lower case.
I meant the relation "ß" => "SS".
> > But it
> > is not a compatibility decomposition, AFAIK.
>
> But I can make it one?
Yes, you can modify the table set up by uni-decomposition.el. I
think.
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-03 10:16 ` martin rudalics
@ 2012-12-04 0:17 ` Juri Linkov
2012-12-04 3:41 ` Eli Zaretskii
0 siblings, 1 reply; 83+ messages in thread
From: Juri Linkov @ 2012-12-04 0:17 UTC (permalink / raw)
To: martin rudalics; +Cc: perin, perin, 13041
> In any case, doing your
>
> (set-case-table decomposition-table)
>
> permanently for a buffer crashed Emacs here.
With more use I see crashes too. The backtrace says that crashes are in
boyer_moore.
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-04 0:17 ` Juri Linkov
@ 2012-12-04 3:41 ` Eli Zaretskii
0 siblings, 0 replies; 83+ messages in thread
From: Eli Zaretskii @ 2012-12-04 3:41 UTC (permalink / raw)
To: Juri Linkov; +Cc: 13041, perin, perin
> From: Juri Linkov <juri@jurta.org>
> Cc: Eli Zaretskii <eliz@gnu.org>, perin@panix.com, 13041@debbugs.gnu.org, perin@acm.org
> Date: Tue, 04 Dec 2012 02:17:04 +0200
>
> > In any case, doing your
> >
> > (set-case-table decomposition-table)
> >
> > permanently for a buffer crashed Emacs here.
>
> With more use I see crashes too. The backtrace says that crashes are in
> boyer_moore.
Please file a bug report with a minimal reproducible recipe.
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-03 17:59 ` Eli Zaretskii
@ 2012-12-04 17:54 ` martin rudalics
2012-12-04 19:28 ` Eli Zaretskii
2012-12-04 20:12 ` Drew Adams
0 siblings, 2 replies; 83+ messages in thread
From: martin rudalics @ 2012-12-04 17:54 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: perin, 13041, perin
> Yes, you can modify the table set up by uni-decomposition.el. I
> think.
Seems to work well. The function I came up with goes as below.
Thanks for the hints, martin
(defun decomposed-string-lessp (string1 string2)
"Return t if STRING1 is decomposition-less than STRING2."
(let* ((length1 (length string1))
(length2 (length string2))
(min-length (min length1 length2))
(index1 0)
(index2 0)
prop1 prop2 type1 type2 compat1 compat2)
(catch 'found
(while (and (< index1 length1) (< index2 length2))
(setq prop1 (get-char-code-property
(downcase (elt string1 index1)) 'decomposition))
(setq type1 (car prop1))
(setq prop2 (get-char-code-property
(downcase (elt string2 index2)) 'decomposition))
(setq type2 (car prop2))
(cond
((and (eq type1 'compat) (eq type2 'compat))
(setq compat1 (concat (cdr prop1)))
(setq compat2 (concat (cdr prop2)))
(let ((value (compare-strings compat1 0 nil compat2 0 nil t)))
(cond
((eq value t)
(setq index1 (1+ index1))
(setq index2 (1+ index2)))
((< value 0)
(throw 'found t))
((< value 0)
(throw 'found nil)))))
((eq type1 'compat)
(setq compat1 (concat (cdr prop1)))
(let ((value
(compare-strings
compat1 0 nil
string2 index2 (min (+ index2 (length compat1)) length2) t)))
(cond
((eq value t)
(setq index1 (1+ index1))
(setq index2 (+ index2 (length compat1))))
((< value 0)
(throw 'found t))
((< value 0)
(throw 'found nil)))))
((eq type2 'compat)
(setq compat2 (concat (cdr prop2)))
(let ((value
(compare-strings
string1 index1 (min (+ index1 (length compat2)) length1)
compat2 0 nil t)))
(cond
((eq value t)
(setq index1 (+ index1 (length compat2)))
(setq index2 (1+ index2)))
((< value 0)
(throw 'found t))
((< value 0)
(throw 'found nil)))))
((< type1 type2)
(throw 'found t))
((> type1 type2)
(throw 'found nil))
(t
(setq index1 (1+ index1))
(setq index2 (1+ index2)))))
;; Shorter is less.
(< length1 length2))))
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-04 17:54 ` martin rudalics
@ 2012-12-04 19:28 ` Eli Zaretskii
2012-12-05 9:41 ` martin rudalics
2012-12-04 20:12 ` Drew Adams
1 sibling, 1 reply; 83+ messages in thread
From: Eli Zaretskii @ 2012-12-04 19:28 UTC (permalink / raw)
To: martin rudalics; +Cc: perin, 13041, perin
> Date: Tue, 04 Dec 2012 18:54:59 +0100
> From: martin rudalics <rudalics@gmx.at>
> CC: juri@jurta.org, perin@panix.com, 13041@debbugs.gnu.org,
> perin@acm.org
>
> > Yes, you can modify the table set up by uni-decomposition.el. I
> > think.
>
> Seems to work well. The function I came up with goes as below.
How about putting it in subr.el?
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-04 17:54 ` martin rudalics
2012-12-04 19:28 ` Eli Zaretskii
@ 2012-12-04 20:12 ` Drew Adams
2012-12-04 23:15 ` Drew Adams
2012-12-05 9:42 ` martin rudalics
1 sibling, 2 replies; 83+ messages in thread
From: Drew Adams @ 2012-12-04 20:12 UTC (permalink / raw)
To: 'martin rudalics', 'Eli Zaretskii'; +Cc: perin, 13041, perin
> The function [Martin] came up with goes as below.
> (defun decomposed-string-lessp (string1 string2)
> "Return t if STRING1 is decomposition-less than STRING2."
> ...
I know nothing about character composition and have not tested this with
anything but a few western accents. But this seems like good stuff.
1. Assuming this or similar is added to Emacs (please do). Please consider
modifying it to respect `case-fold-search'. These modified lines do that.
(setq prop1 (get-char-code-property
(if case-fold-search
(downcase (elt string1 index1))
(elt string1 index1))
'decomposition))
[Same thing for prop2 with string2 and index2.]
(let ((value (compare-strings compat1 0 nil
compat2 0 nil case-fold-search)))
2. In addition, consider updating `string-lessp' to be sensitive to a variable
such as this:
(defvar ignore-diacritics nil
"Non-nil means ignore diacritics for string comparisons.")
With that, an alternative to hard-coding a call to `decomposed-string-lessp' is
to bind `ignore-diacritics' and use `string-lessp'.
A similar change could be made for `compare-strings': reflect the value of
`ignore-diacritics'. Or since that function has made the choice to pass
case-sensitivity as a parameter instead of respecting `case-fold-search', pass
another parameter for diacritic sensitivity.
3. More general than #2 would be a function like this, which is sensitive to
both `ignore-diacritics' and `case-fold-search' (this assumes the change
suggested above in #1 for `decomposed-string-lessp').
(defun my-string-lessp (s1 s2)
"..."
(if ignore-diacritics
(decomposed-string-lessp s1 s2)
(when case-fold-search (setq s1 (upcase s1)
s2 (upcase s2)))
(string-lessp s1 s2)))
Dunno a good name for this. It's too late to let `string-lessp' itself act like
this - that would break stuff.
4. Even better than hard-coding `case-fold-search' in `my-string-less-p' and
`decomposed-string-lessp' would be to have those functions be sensitive to a
variable such as this:
(defvar string-case-variable 'case-fold-search
"Value is a case-sensitivity variable such as `case-fold-search'.
The values of that variable must be like those for `case-fold-search':
nil means case-sensitive, non-nil means case-insensitive.")
Code could then bind `string-case-variable' to, say, `(not
completion-ignore-case)' or to any other case-sensitivity controlling sexp, when
appropriate.
This would have the advantages offered by passing an explicit case-sensitivity
parameter, as in `compare-strings', but also the advantages of dynamic scope:
binding `string-case-var' to affect all comparisons within scope.
Comparers such as `(my-)string-lessp' are often used as arguments to
higher-order functions that treat them as (only) binary predicates, i.e.,
predicates where any additional parameters specifying case or diacritic
sensitivity are ignored.
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-04 20:12 ` Drew Adams
@ 2012-12-04 23:15 ` Drew Adams
2012-12-05 6:50 ` Drew Adams
2012-12-05 9:42 ` martin rudalics
2012-12-05 9:42 ` martin rudalics
1 sibling, 2 replies; 83+ messages in thread
From: Drew Adams @ 2012-12-04 23:15 UTC (permalink / raw)
To: 'martin rudalics', 'Eli Zaretskii'; +Cc: perin, 13041, perin
BTW, there are a couple of minor things to check wrt the code you sent, Martin:
* `min-length' is not used.
* The `cond's all repeat condition (< value 0) twice, with different actions.
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-04 23:15 ` Drew Adams
@ 2012-12-05 6:50 ` Drew Adams
2012-12-05 9:42 ` martin rudalics
2012-12-06 9:25 ` Kenichi Handa
2012-12-05 9:42 ` martin rudalics
1 sibling, 2 replies; 83+ messages in thread
From: Drew Adams @ 2012-12-05 6:50 UTC (permalink / raw)
To: 'martin rudalics', 'Eli Zaretskii'; +Cc: perin, 13041, perin
This version of Martin's function (but respecting `case-fold-search') is maybe a
tiny bit simpler. It could also be a bit slower because of `substring'
returning a copy (vs just incrementing an offset). It should also be checked
for correctness - not really tested. FWIW/HTH.
(It does correct the two double `(< value 0)' typos I mentioned earlier.
That should be done in any case.)
(defun decomposed-string-lessp (string1 string2)
"Return non-nil if decomposed STRING1 is less than decomposed STRING2.
Comparison respects `case-fold-search'."
(let ((s1 string1)
(s2 string2)
prop1 prop2 type1 type2)
(catch 'found
(while (and (> (length s1) 0) (> (length s2) 0))
(setq prop1 (get-char-code-property (if case-fold-search
(downcase (elt s1 0))
(elt s1 0))
'decomposition)
prop2 (get-char-code-property (if case-fold-search
(downcase (elt s2 0))
(elt s2 0))
'decomposition)
type1 (car prop1)
type2 (car prop2))
(when (eq type1 'compat) (setq s1 (concat (cdr prop1))))
(when (eq type2 'compat) (setq s2 (concat (cdr prop2))))
(cond ((eq type1 'compat)
(let ((cs (compare-strings
s1 0 nil
s2 0 (and (not (eq type2 'compat))
(min (length s1) (length s2)))
case-fold-search)))
(unless (eq cs t) (throw 'found (< cs 0)))))
((eq type2 'compat)
(let ((cs (compare-strings
s1 0 (min (length s2) (length s1))
s2 0 nil
case-fold-search)))
(unless (eq cs t) (throw 'found (< cs 0)))))
((= type1 type2)
(setq s1 (substring s1 1)
s2 (substring s2 1)))
(t (throw 'found (< type1 type2)))))
(< (length string1) (length string2)))))
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-04 19:28 ` Eli Zaretskii
@ 2012-12-05 9:41 ` martin rudalics
2012-12-05 16:37 ` Eli Zaretskii
2012-12-05 23:05 ` Juri Linkov
0 siblings, 2 replies; 83+ messages in thread
From: martin rudalics @ 2012-12-05 9:41 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: perin, 13041, perin
> How about putting it in subr.el?
If I correctly understand Juri, I next have to deal with things like
(get-char-code-property #xff59 'decomposition)
and related issues we might unearth in the course of this.
Also, while currently sorting is stable in the sense that with respect
to diacritics text remains unchanged from the original order, this is
not nice for sorting larger pieces of text. So I'd rather have to use
the second list element returned by `get-char-code-property' to make
sure that, for example, "e" gets always sorted before "è" before "é".
martin
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-04 20:12 ` Drew Adams
2012-12-04 23:15 ` Drew Adams
@ 2012-12-05 9:42 ` martin rudalics
2012-12-05 15:38 ` Drew Adams
2012-12-05 23:04 ` Juri Linkov
1 sibling, 2 replies; 83+ messages in thread
From: martin rudalics @ 2012-12-05 9:42 UTC (permalink / raw)
To: Drew Adams; +Cc: perin, perin, 13041
> 1. Assuming this or similar is added to Emacs (please do). Please consider
> modifying it to respect `case-fold-search'. These modified lines do that.
>
> (setq prop1 (get-char-code-property
> (if case-fold-search
> (downcase (elt string1 index1))
> (elt string1 index1))
> 'decomposition))
>
> [Same thing for prop2 with string2 and index2.]
This would have to be done, yes.
> (let ((value (compare-strings compat1 0 nil
> compat2 0 nil case-fold-search)))
>
>
> 2. In addition, consider updating `string-lessp' to be sensitive to a variable
> such as this:
>
> (defvar ignore-diacritics nil
> "Non-nil means ignore diacritics for string comparisons.")
>
> With that, an alternative to hard-coding a call to `decomposed-string-lessp' is
> to bind `ignore-diacritics' and use `string-lessp'.
`ignore-diacritics' is misleading. The variable would have to be called
`observe-decompositions' or something the like.
> A similar change could be made for `compare-strings': reflect the value of
> `ignore-diacritics'. Or since that function has made the choice to pass
> case-sensitivity as a parameter instead of respecting `case-fold-search', pass
> another parameter for diacritic sensitivity.
Indeed, `string-lessp' is too weak - we'd need a function to tell
whether two strings are equal disregarding "certain" decomposition
properties.
> 3. More general than #2 would be a function like this, which is sensitive to
> both `ignore-diacritics' and `case-fold-search' (this assumes the change
> suggested above in #1 for `decomposed-string-lessp').
>
> (defun my-string-lessp (s1 s2)
> "..."
> (if ignore-diacritics
> (decomposed-string-lessp s1 s2)
> (when case-fold-search (setq s1 (upcase s1)
> s2 (upcase s2)))
> (string-lessp s1 s2)))
>
> Dunno a good name for this. It's too late to let `string-lessp' itself act like
> this - that would break stuff.
`string-lessp' is in C. I wouldn't touch it anyway.
> 4. Even better than hard-coding `case-fold-search' in `my-string-less-p' and
> `decomposed-string-lessp' would be to have those functions be sensitive to a
> variable such as this:
>
> (defvar string-case-variable 'case-fold-search
> "Value is a case-sensitivity variable such as `case-fold-search'.
> The values of that variable must be like those for `case-fold-search':
> nil means case-sensitive, non-nil means case-insensitive.")
>
> Code could then bind `string-case-variable' to, say, `(not
> completion-ignore-case)' or to any other case-sensitivity controlling sexp, when
> appropriate.
>
> This would have the advantages offered by passing an explicit case-sensitivity
> parameter, as in `compare-strings', but also the advantages of dynamic scope:
> binding `string-case-var' to affect all comparisons within scope.
>
> Comparers such as `(my-)string-lessp' are often used as arguments to
> higher-order functions that treat them as (only) binary predicates, i.e.,
> predicates where any additional parameters specifying case or diacritic
> sensitivity are ignored.
I first have to solve the problems with the values returned by
`get-char-code-property'. Then I will look into this.
martin
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-04 23:15 ` Drew Adams
2012-12-05 6:50 ` Drew Adams
@ 2012-12-05 9:42 ` martin rudalics
1 sibling, 0 replies; 83+ messages in thread
From: martin rudalics @ 2012-12-05 9:42 UTC (permalink / raw)
To: Drew Adams; +Cc: perin, perin, 13041
> BTW, there are a couple of minor things to check wrt the code you sent, Martin:
>
> * `min-length' is not used.
Leftover from a previous version.
> * The `cond's all repeat condition (< value 0) twice, with different actions.
These are clearly silly, yes. Funnily, they don't affect the result since
they are never taken and the return value is nil as intended.
martin
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-05 6:50 ` Drew Adams
@ 2012-12-05 9:42 ` martin rudalics
2012-12-05 15:38 ` Drew Adams
2012-12-06 9:25 ` Kenichi Handa
1 sibling, 1 reply; 83+ messages in thread
From: martin rudalics @ 2012-12-05 9:42 UTC (permalink / raw)
To: Drew Adams; +Cc: perin, perin, 13041
> This version of Martin's function (but respecting `case-fold-search') is maybe a
> tiny bit simpler. It could also be a bit slower because of `substring'
> returning a copy (vs just incrementing an offset). It should also be checked
> for correctness - not really tested. FWIW/HTH.
The most important application I see for this is within `sort-subr'
where I want to compare buffer substrings in situ by passing their
boundaries. Hence I plan to provide a version working in terms of
buffer positions. For simple string checking your version might be
preferable.
martin
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-05 9:42 ` martin rudalics
@ 2012-12-05 15:38 ` Drew Adams
2012-12-05 15:51 ` Lewis Perin
` (2 more replies)
2012-12-05 23:04 ` Juri Linkov
1 sibling, 3 replies; 83+ messages in thread
From: Drew Adams @ 2012-12-05 15:38 UTC (permalink / raw)
To: 'martin rudalics'; +Cc: perin, perin, 13041
> `ignore-diacritics' is misleading. The variable would have
> to be called `observe-decompositions' or something the like.
1. "Observe decompositions" doesn't mean anything to me. The verb should
probably be more active - what does it mean to observe the char decompositions
here?
BTW, if we use "decomposition" in the name and description then we should
probably also use "char" - this is not about decomposing strings in some way
(whatever that might mean); it involves decomposing Unicode characters.
2. But my confusion over the name/description is in fact wrt function
`decomposed-string-lessp': I guess it's not 100% clear to me what it does.
Your doc string said "STRING1 is decomposition-less than STRING2", which
confuses me. And it is a bit ambiguous wrt "-less":
a. decomposition-less as in comparing the strings only after
removing (some parts of) their decompositions (i.e., "-less"
as in "sans")?
or
b. -lessp as in `string<': a comparison ordering relation?
In the version of `decomposed-string-lessp' that I sent, I changed the doc
string to this: "decomposed STRING1 is less than decomposed STRING2". But that
is no doubt incorrect (less correct than yours, if perhaps clearer). In
particular, it says nothing about how we compare the two decompositions.
In practical (use) terms, this is typically about ignoring diacritics, keeping
only the "base" characters. Something about that should at least be mentioned
in the doc, so that users know they can use this for that.
But IIUC this is not just about diacritics; it sometimes might not be about
diacritics at all; and diacritics present are sometimes not ignored. E.g., the
ligature ffi gets treated the same as the 3 chars f f i. There are no
diacritics present in that case.
IIUC, we convert the two strings to their Unicode decompositions and then use
the Unicode char compatibility specs to compare the decompositions. IOW, we
treat equivalent chars, as defined by Unicode, as the same.
Perhaps the name/description should speak in terms of Unicode char compatibility
or equivalence. Perhaps a name like `string-less-compat-p'? Or
`Unicode-equivalent-p'? Or `string-equivalent-p'?
How would you characterize what the function does? No doubt Eli can help here.
It is important to try to get the function name and description right from the
outset, if we can. If the Unicode standard has some terminology that applies
here then perhaps we can/should leverage that.
Beyond the name and an accurate description, the doc should, as I say, at least
mention that you can use this to ignore diacritics (such as accents), as that
will be a common use case.
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-05 9:42 ` martin rudalics
@ 2012-12-05 15:38 ` Drew Adams
0 siblings, 0 replies; 83+ messages in thread
From: Drew Adams @ 2012-12-05 15:38 UTC (permalink / raw)
To: 'martin rudalics'; +Cc: perin, perin, 13041
> The most important application I see for this is within `sort-subr'
> where I want to compare buffer substrings in situ by passing their
> boundaries. Hence I plan to provide a version working in terms of
> buffer positions. For simple string checking your version might be
> preferable.
Please do whatever is right - using positions as you intended.
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-05 15:38 ` Drew Adams
@ 2012-12-05 15:51 ` Lewis Perin
2012-12-05 16:20 ` Drew Adams
2012-12-05 17:16 ` Drew Adams
2012-12-06 10:28 ` martin rudalics
2 siblings, 1 reply; 83+ messages in thread
From: Lewis Perin @ 2012-12-05 15:51 UTC (permalink / raw)
To: Drew Adams; +Cc: 13041
Drew Adams writes:
> > `ignore-diacritics' is misleading. The variable would have
> > to be called `observe-decompositions' or something the like.
>
>
> 1. "Observe decompositions" doesn't mean anything to me. The verb
> should probably be more active - what does it mean to observe the
> char decompositions here?
What about “heed”?
/Lew
---
Lew Perin | perin@acm.org | http://babelcarp.org
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-05 15:51 ` Lewis Perin
@ 2012-12-05 16:20 ` Drew Adams
0 siblings, 0 replies; 83+ messages in thread
From: Drew Adams @ 2012-12-05 16:20 UTC (permalink / raw)
To: perin; +Cc: 13041
> > > `ignore-diacritics' is misleading. The variable would have
> > > to be called `observe-decompositions' or something the like.
> >
> > 1. "Observe decompositions" doesn't mean anything to me. The verb
> > should probably be more active - what does it mean to observe the
> > char decompositions here?
>
> What about "heed"?
"Respect" is a more common term with that meaning.
But the point (to me) is that we are not conveying much by that - too vague.
"Heed" meaning what? Heed how?
Those are terms, like "treat", "handle" and "process" (verb), that are generally
signs, in computer science as elsewhere, of insufficient understanding or
laziness in communication. They say essentially, "it does something".
Sometimes (not here though) such words can even be signals that the function in
question is a congeries of things that do not necessarily belong together.
We should be able to do better here. If I understood better what the function
does I might be able to offer better name suggestions.
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-05 9:41 ` martin rudalics
@ 2012-12-05 16:37 ` Eli Zaretskii
2012-12-06 10:31 ` martin rudalics
2012-12-05 23:05 ` Juri Linkov
1 sibling, 1 reply; 83+ messages in thread
From: Eli Zaretskii @ 2012-12-05 16:37 UTC (permalink / raw)
To: martin rudalics; +Cc: perin, 13041, perin
> Date: Wed, 05 Dec 2012 10:41:40 +0100
> From: martin rudalics <rudalics@gmx.at>
> CC: juri@jurta.org, perin@panix.com, 13041@debbugs.gnu.org,
> perin@acm.org
>
> > How about putting it in subr.el?
>
> If I correctly understand Juri, I next have to deal with things like
>
> (get-char-code-property #xff59 'decomposition)
>
> and related issues we might unearth in the course of this.
My reading of the table in
http://www.unicode.org/reports/tr44/#Character_Decomposition_Mappings
you should ignore any car of the list returned by
get-char-code-property if it does not pass the characterp test (or
those that do pass the symbolp test). That is, the character #xff59
should sort exactly like lower-case y.
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-05 15:38 ` Drew Adams
2012-12-05 15:51 ` Lewis Perin
@ 2012-12-05 17:16 ` Drew Adams
2012-12-05 18:00 ` Drew Adams
2012-12-06 10:28 ` martin rudalics
2 siblings, 1 reply; 83+ messages in thread
From: Drew Adams @ 2012-12-05 17:16 UTC (permalink / raw)
To: 'martin rudalics'; +Cc: perin, 13041, perin
> Perhaps the name/description should speak in terms of Unicode
> char compatibility or equivalence. Perhaps a name like
> `string-less-compat-p'? Or `Unicode-equivalent-p'? Or
> `string-equivalent-p'?
In the last two suggestions I forgot about the "less" part.
Taking a quick look at the Unicode specs, it seems that what we do involves
(Unicode) "compatibility equivalence". But it also seemed that Eli was saying
that for us this is not distinguished from (Unicode) "canonical equivalence".
So perhaps `unicode-equivalence-less-p'? Or if there is a risk of confusion
with char (not string) comparison, then perhaps `unicode-equiv-string-less-p'?
Or just `equiv-string-less-p'?
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-05 17:16 ` Drew Adams
@ 2012-12-05 18:00 ` Drew Adams
2012-12-05 18:27 ` Eli Zaretskii
2012-12-06 10:31 ` martin rudalics
0 siblings, 2 replies; 83+ messages in thread
From: Drew Adams @ 2012-12-05 18:00 UTC (permalink / raw)
To: 'martin rudalics'; +Cc: perin, 13041, perin
FWIW - Some more browsing on the topic tells me that what we are trying to come
up with here is a predicate for the NFKD canonical ordering (as applied to a
char sequence, not to a single char).
IOW, a string-ordering predicate that uses the canonical ordering for a
character's decomposed normal code point sequence.
We are using compatibility normalization, not canonical normalization. So a
search (or a string comparison test) for `f' will match the ligature `ffi'
(whereas it would not match wrt canonical normalization).
Someone please correct me if any of this is wrong.
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-05 18:00 ` Drew Adams
@ 2012-12-05 18:27 ` Eli Zaretskii
2012-12-06 10:31 ` martin rudalics
1 sibling, 0 replies; 83+ messages in thread
From: Eli Zaretskii @ 2012-12-05 18:27 UTC (permalink / raw)
To: Drew Adams; +Cc: 13041, perin, perin
> From: "Drew Adams" <drew.adams@oracle.com>
> Date: Wed, 5 Dec 2012 10:00:14 -0800
> Cc: perin@panix.com, 13041@debbugs.gnu.org, perin@acm.org
>
> We are using compatibility normalization, not canonical normalization. So a
> search (or a string comparison test) for `f' will match the ligature `ffi'
> (whereas it would not match wrt canonical normalization).
>
> Someone please correct me if any of this is wrong.
I'm not sure who is wrong ;-), but I think when compatibility
decomposition exists, it should be used; if not, the canonical
decomposition should be used.
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-02 18:16 ` Eli Zaretskii
2012-12-02 21:31 ` Juri Linkov
@ 2012-12-05 19:17 ` Drew Adams
2012-12-05 21:19 ` Eli Zaretskii
1 sibling, 1 reply; 83+ messages in thread
From: Drew Adams @ 2012-12-05 19:17 UTC (permalink / raw)
To: 'Eli Zaretskii', 'Juri Linkov'; +Cc: perin, 13041, perin
> > I'm surprised to see case mappings hard-coded in
> > lisp/international/characters.el instead of using the properties
> > `uppercase' and `lowercase' during creation of case tables.
>
> My guess is that this is because the code in characters.el was written
> long before we had access to Unicode character properties in Emacs,
> and in fact before Emacs was switched to character representation
> based on Unicode codepoints. And no one bothered to rewrite that code
> since then; volunteers are welcome.
Doesn't file CaseFolding.txt contain all the info needed?
If so, what about populating the case tables from the latest CaseFolding.txt
file at Emacs build time? Or if no Internet access during build, populate from
a copy of the file to be distributed with Emacs.
And provide the same population code as a Lisp function, in case someone wants
to refresh an old Emacs release to use a more recent CaseFolding.txt file.
Would this make any sense?
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-05 19:17 ` Drew Adams
@ 2012-12-05 21:19 ` Eli Zaretskii
0 siblings, 0 replies; 83+ messages in thread
From: Eli Zaretskii @ 2012-12-05 21:19 UTC (permalink / raw)
To: Drew Adams; +Cc: perin, 13041, perin
> From: "Drew Adams" <drew.adams@oracle.com>
> Cc: <perin@panix.com>, <13041@debbugs.gnu.org>, <perin@acm.org>
> Date: Wed, 5 Dec 2012 11:17:04 -0800
>
> > > I'm surprised to see case mappings hard-coded in
> > > lisp/international/characters.el instead of using the properties
> > > `uppercase' and `lowercase' during creation of case tables.
> >
> > My guess is that this is because the code in characters.el was written
> > long before we had access to Unicode character properties in Emacs,
> > and in fact before Emacs was switched to character representation
> > based on Unicode codepoints. And no one bothered to rewrite that code
> > since then; volunteers are welcome.
>
> Doesn't file CaseFolding.txt contain all the info needed?
You don't need CaseFolding.txt, because UnicodeData.txt includes the
same information, and uni-lowercase.el, uni-uppercase.el, and
uni-titlecase.el already read that information into char-tables.
> If so, what about populating the case tables from the latest CaseFolding.txt
> file at Emacs build time? Or if no Internet access during build, populate from
> a copy of the file to be distributed with Emacs.
>
> And provide the same population code as a Lisp function, in case someone wants
> to refresh an old Emacs release to use a more recent CaseFolding.txt file.
>
> Would this make any sense?
It would make sense to load case tables from uni-*.el at Emacs build
time. Volunteers are welcome.
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-05 9:42 ` martin rudalics
2012-12-05 15:38 ` Drew Adams
@ 2012-12-05 23:04 ` Juri Linkov
2012-12-06 10:31 ` martin rudalics
1 sibling, 1 reply; 83+ messages in thread
From: Juri Linkov @ 2012-12-05 23:04 UTC (permalink / raw)
To: martin rudalics; +Cc: perin, 13041, perin
> `ignore-diacritics' is misleading. The variable would have to be called
> `observe-decompositions' or something the like.
Since the existing variable that corresponds to the
Unicode file CaseFolding.txt is `case-fold-search',
its counterpart variable that corresponds to the Unicode file
Decomposition.txt could be called `decomposition-search'.
Also like the existing `sort-fold-case', its counterpart could be called
`sort-decomposition'.
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-05 9:41 ` martin rudalics
2012-12-05 16:37 ` Eli Zaretskii
@ 2012-12-05 23:05 ` Juri Linkov
2012-12-06 10:32 ` martin rudalics
1 sibling, 1 reply; 83+ messages in thread
From: Juri Linkov @ 2012-12-05 23:05 UTC (permalink / raw)
To: martin rudalics; +Cc: perin, perin, 13041
> If I correctly understand Juri, I next have to deal with things like
>
> (get-char-code-property #xff59 'decomposition)
>
> and related issues we might unearth in the course of this.
Only until bug#13084 is fixed that is a separate problem.
> Also, while currently sorting is stable in the sense that with respect
> to diacritics text remains unchanged from the original order, this is
> not nice for sorting larger pieces of text. So I'd rather have to use
> the second list element returned by `get-char-code-property' to make
> sure that, for example, "e" gets always sorted before "è" before "é".
In principle, you could do this by let-binding a new variable
`sort-decomposition' to non-nil for stable sorting.
And later to let-bind `sort-decomposition' to nil for
last-resort comparison where equal lines
(equal according to non-nil `sort-decomposition')
will be sorted without regard to decomposition.
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-05 6:50 ` Drew Adams
2012-12-05 9:42 ` martin rudalics
@ 2012-12-06 9:25 ` Kenichi Handa
2012-12-06 10:34 ` martin rudalics
2012-12-07 0:58 ` Juri Linkov
1 sibling, 2 replies; 83+ messages in thread
From: Kenichi Handa @ 2012-12-06 9:25 UTC (permalink / raw)
To: Drew Adams; +Cc: perin, 13041, perin
In article <707786B35E94470FB727BCF7F3DDA41A@us.oracle.com>, "Drew Adams" <drew.adams@oracle.com> writes:
> This version of Martin's function (but respecting `case-fold-search') is maybe a
> tiny bit simpler. It could also be a bit slower because of `substring'
> returning a copy (vs just incrementing an offset). It should also be checked
> for correctness - not really tested. FWIW/HTH.
Emacs contains ucs-normailze package which provides various
normalization functions. For instance,
(require 'ucs-normalize)
(ucs-normalize-NFKD-string "Äffin") => "Äffin"
Isn't it usable?
---
Kenichi Handa
handa@gnu.org
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-05 15:38 ` Drew Adams
2012-12-05 15:51 ` Lewis Perin
2012-12-05 17:16 ` Drew Adams
@ 2012-12-06 10:28 ` martin rudalics
2012-12-06 17:53 ` Eli Zaretskii
2 siblings, 1 reply; 83+ messages in thread
From: martin rudalics @ 2012-12-06 10:28 UTC (permalink / raw)
To: Drew Adams; +Cc: perin, perin, 13041
>> `ignore-diacritics' is misleading. The variable would have
>> to be called `observe-decompositions' or something the like.
>
>
> 1. "Observe decompositions" doesn't mean anything to me. The verb should
> probably be more active - what does it mean to observe the char decompositions
> here?
>
> BTW, if we use "decomposition" in the name and description then we should
> probably also use "char" - this is not about decomposing strings in some way
> (whatever that might mean); it involves decomposing Unicode characters.
`ignore-diacritics' is misleading because when we, for example,
sort/match ligatures we already do more than ignore diacritics. A
variable using the term `observe-decompositions' would express what the
underlying algorithm does - observe the decomposition properties
provided by `get-char-code-property'.
Bear in mind that a "correct" solution for searching and sorting would
have to be based on a correct implementation of a collation table (see
bug#12008) plus some options that make searching more convenient (aka
"asymmetric searching" http://www.unicode.org/reports/tr10/#Searching).
In that sense, Juri's approach for searching and my function can be
considered only as poor man's variants of what should be eventually
done.
For example my Austrian locale sorts
o < ö < p
while IIUC Swedish has
o < p ... < z < ö
which IIUC can't be done via the decomposition table. I don't know
whether this implies that searching for "o" in Swedish means to _not_
list results for "ö" either.
> 2. But my confusion over the name/description is in fact wrt function
> `decomposed-string-lessp': I guess it's not 100% clear to me what it does.
>
> Your doc string said "STRING1 is decomposition-less than STRING2", which
> confuses me. And it is a bit ambiguous wrt "-less":
>
> a. decomposition-less as in comparing the strings only after
> removing (some parts of) their decompositions (i.e., "-less"
> as in "sans")?
>
> or
>
> b. -lessp as in `string<': a comparison ordering relation?
I didn't think much about the wording. But I can't, in general, talk
about comparing characters because in the ligature case (or the "ß" vs
"ss" case) I do compare substrings.
> In the version of `decomposed-string-lessp' that I sent, I changed the doc
> string to this: "decomposed STRING1 is less than decomposed STRING2". But that
> is no doubt incorrect (less correct than yours, if perhaps clearer). In
> particular, it says nothing about how we compare the two decompositions.
>
> In practical (use) terms, this is typically about ignoring diacritics, keeping
> only the "base" characters. Something about that should at least be mentioned
> in the doc, so that users know they can use this for that.
Yes.
> But IIUC this is not just about diacritics; it sometimes might not be about
> diacritics at all; and diacritics present are sometimes not ignored. E.g., the
> ligature ffi gets treated the same as the 3 chars f f i. There are no
> diacritics present in that case.
That's why I want to just talk about decompositions for the moment.
> IIUC, we convert the two strings to their Unicode decompositions and then use
> the Unicode char compatibility specs to compare the decompositions. IOW, we
> treat equivalent chars, as defined by Unicode, as the same.
Character sequences, IIUC.
> Perhaps the name/description should speak in terms of Unicode char compatibility
> or equivalence. Perhaps a name like `string-less-compat-p'? Or
> `Unicode-equivalent-p'? Or `string-equivalent-p'?
>
> How would you characterize what the function does? No doubt Eli can help here.
> It is important to try to get the function name and description right from the
> outset, if we can. If the Unicode standard has some terminology that applies
> here then perhaps we can/should leverage that.
I'm not sure whether we can ever fully support Unicode here - the
weights you find in http://www.unicode.org/Public/UCA/6.2.0/allkeys.txt
appear hardly digestible for me (and my machine, presumably).
> Beyond the name and an accurate description, the doc should, as I say, at least
> mention that you can use this to ignore diacritics (such as accents), as that
> will be a common use case.
Sure.
martin
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-05 16:37 ` Eli Zaretskii
@ 2012-12-06 10:31 ` martin rudalics
2012-12-06 17:48 ` Eli Zaretskii
0 siblings, 1 reply; 83+ messages in thread
From: martin rudalics @ 2012-12-06 10:31 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: perin, 13041, perin
> My reading of the table in
>
> http://www.unicode.org/reports/tr44/#Character_Decomposition_Mappings
>
> you should ignore any car of the list returned by
> get-char-code-property if it does not pass the characterp test (or
> those that do pass the symbolp test). That is, the character #xff59
> should sort exactly like lower-case y.
That is, `wide' and `compat' are completely equivalent in this regard?
martin
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-05 18:00 ` Drew Adams
2012-12-05 18:27 ` Eli Zaretskii
@ 2012-12-06 10:31 ` martin rudalics
2012-12-06 15:59 ` Drew Adams
1 sibling, 1 reply; 83+ messages in thread
From: martin rudalics @ 2012-12-06 10:31 UTC (permalink / raw)
To: Drew Adams; +Cc: perin, 13041, perin
> We are using compatibility normalization, not canonical normalization. So a
> search (or a string comparison test) for `f' will match the ligature `ffi'
> (whereas it would not match wrt canonical normalization).
If it can be done, searching for "f" should match ligatures like "ff"
and "fi".
martin
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-05 23:04 ` Juri Linkov
@ 2012-12-06 10:31 ` martin rudalics
2012-12-07 0:52 ` Juri Linkov
0 siblings, 1 reply; 83+ messages in thread
From: martin rudalics @ 2012-12-06 10:31 UTC (permalink / raw)
To: Juri Linkov; +Cc: perin, 13041, perin
> Since the existing variable that corresponds to the
> Unicode file CaseFolding.txt is `case-fold-search',
> its counterpart variable that corresponds to the Unicode file
> Decomposition.txt
Where is this file?
> could be called `decomposition-search'.
>
> Also like the existing `sort-fold-case', its counterpart could be called
> `sort-decomposition'.
martin
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-05 23:05 ` Juri Linkov
@ 2012-12-06 10:32 ` martin rudalics
0 siblings, 0 replies; 83+ messages in thread
From: martin rudalics @ 2012-12-06 10:32 UTC (permalink / raw)
To: Juri Linkov; +Cc: perin, perin, 13041
> And later to let-bind `sort-decomposition' to nil for
> last-resort comparison where equal lines
> (equal according to non-nil `sort-decomposition')
> will be sorted without regard to decomposition.
Indeed. In any case, equal lines shouldn't be the rule - especially with
functions that remove duplicates ;-)
martin
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-06 9:25 ` Kenichi Handa
@ 2012-12-06 10:34 ` martin rudalics
2012-12-06 17:50 ` Eli Zaretskii
2012-12-07 0:58 ` Juri Linkov
1 sibling, 1 reply; 83+ messages in thread
From: martin rudalics @ 2012-12-06 10:34 UTC (permalink / raw)
To: Kenichi Handa; +Cc: perin, perin, 13041
> Emacs contains ucs-normailze package which provides various
> normalization functions. For instance,
>
> (require 'ucs-normalize)
> (ucs-normalize-NFKD-string "Äffin") => "Äffin"
>
> Isn't it usable?
Actually, the function should do what we need. But I have no idea how
to integrate it into a searching algorithm. And when sorting, it seems
expensive for comparing buffer substrings. Also, the use of a temporary
buffer for normalizing every single string makes its weight quite heavy.
In any case, I would probably steal the entire decomposition property
handling part from it. So thanks a lot for this hint.
martin
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-06 10:31 ` martin rudalics
@ 2012-12-06 15:59 ` Drew Adams
0 siblings, 0 replies; 83+ messages in thread
From: Drew Adams @ 2012-12-06 15:59 UTC (permalink / raw)
To: 'martin rudalics'; +Cc: perin, 13041, perin
> > We are using compatibility normalization, not canonical
> > normalization. So a search (or a string comparison test)
> > for `f' will match the ligature `ffi'
> > (whereas it would not match wrt canonical normalization).
>
> If it can be done, searching for "f" should match ligatures like "ff"
> and "fi".
That's what I thought you were planning/preparing to do.
On the other hand, as the Unicode spec points out (for level 2), sometimes
someone wants to distinguish searching for f from searching for the ligature.
Ideally (we might never get there), that would be possible as an alternative
(choice).
The spec also points to hybrid situations regarding case conversion (see sect
RL2.4) where, e.g., you might want to do full case matching on ß in a literal
name such as Strauß but simple case folding on ß when used in a character class,
such as [ß]. Dunno whether we would ever get there either.
There seems to be a lot in the Unicode regexp spec
(http://www.unicode.org/reports/tr18/) that could be food for thought for Emacs.
I imagine that some Emacs Dev folks have already taken a close look and given it
some thought.
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-06 10:31 ` martin rudalics
@ 2012-12-06 17:48 ` Eli Zaretskii
0 siblings, 0 replies; 83+ messages in thread
From: Eli Zaretskii @ 2012-12-06 17:48 UTC (permalink / raw)
To: martin rudalics; +Cc: perin, 13041, perin
> Date: Thu, 06 Dec 2012 11:31:31 +0100
> From: martin rudalics <rudalics@gmx.at>
> CC: juri@jurta.org, perin@panix.com, 13041@debbugs.gnu.org,
> perin@acm.org
>
> > My reading of the table in
> >
> > http://www.unicode.org/reports/tr44/#Character_Decomposition_Mappings
> >
> > you should ignore any car of the list returned by
> > get-char-code-property if it does not pass the characterp test (or
> > those that do pass the symbolp test). That is, the character #xff59
> > should sort exactly like lower-case y.
>
> That is, `wide' and `compat' are completely equivalent in this regard?
Yes. They are all different forms of the same character, which should
all compare equal in this context.
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-06 10:34 ` martin rudalics
@ 2012-12-06 17:50 ` Eli Zaretskii
0 siblings, 0 replies; 83+ messages in thread
From: Eli Zaretskii @ 2012-12-06 17:50 UTC (permalink / raw)
To: martin rudalics; +Cc: 13041, perin, perin
> Date: Thu, 06 Dec 2012 11:34:26 +0100
> From: martin rudalics <rudalics@gmx.at>
> CC: Drew Adams <drew.adams@oracle.com>, eliz@gnu.org, perin@panix.com,
> 13041@debbugs.gnu.org, perin@acm.org
>
> > Emacs contains ucs-normailze package which provides various
> > normalization functions. For instance,
> >
> > (require 'ucs-normalize)
> > (ucs-normalize-NFKD-string "Äffin") => "Äffin"
> >
> > Isn't it usable?
>
> Actually, the function should do what we need. But I have no idea how
> to integrate it into a searching algorithm. And when sorting, it seems
> expensive for comparing buffer substrings. Also, the use of a temporary
> buffer for normalizing every single string makes its weight quite heavy.
Yes, I don't think this will be possible without changes on the C
level. Those changes should use code very similar to what we
currently do for case-insensitive search.
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-06 10:28 ` martin rudalics
@ 2012-12-06 17:53 ` Eli Zaretskii
0 siblings, 0 replies; 83+ messages in thread
From: Eli Zaretskii @ 2012-12-06 17:53 UTC (permalink / raw)
To: martin rudalics; +Cc: perin, 13041, perin
> Date: Thu, 06 Dec 2012 11:28:05 +0100
> From: martin rudalics <rudalics@gmx.at>
> CC: 'Eli Zaretskii' <eliz@gnu.org>, perin@panix.com,
> 13041@debbugs.gnu.org, perin@acm.org
>
> >> `ignore-diacritics' is misleading. The variable would have
> >> to be called `observe-decompositions' or something the like.
> >
> >
> > 1. "Observe decompositions" doesn't mean anything to me. The verb should
> > probably be more active - what does it mean to observe the char decompositions
> > here?
> >
> > BTW, if we use "decomposition" in the name and description then we should
> > probably also use "char" - this is not about decomposing strings in some way
> > (whatever that might mean); it involves decomposing Unicode characters.
>
> `ignore-diacritics' is misleading because when we, for example,
> sort/match ligatures we already do more than ignore diacritics. A
> variable using the term `observe-decompositions' would express what the
> underlying algorithm does - observe the decomposition properties
> provided by `get-char-code-property'.
I would suggest something like equivalence-search or maybe
loose-match-search. The latter is slightly less suitable, since loose
matches include not just decompositions, see the Unicode Regular
Expressions report.
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-06 10:31 ` martin rudalics
@ 2012-12-07 0:52 ` Juri Linkov
0 siblings, 0 replies; 83+ messages in thread
From: Juri Linkov @ 2012-12-07 0:52 UTC (permalink / raw)
To: martin rudalics; +Cc: perin, 13041, perin
>> Since the existing variable that corresponds to the
>> Unicode file CaseFolding.txt is `case-fold-search',
>> its counterpart variable that corresponds to the Unicode file
>> Decomposition.txt
>
> Where is this file?
There was a reference to
http://www.unicode.org/Public/UNIDATA/extracted/DerivedDecompositionType.txt
from http://www.unicode.org/faq/casemap_charprop.html
but it seems this file is redundant since you can get
the same information from admin/unidata/UnicodeData.txt
using (get-char-code-property ?? 'decomposition)
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-06 9:25 ` Kenichi Handa
2012-12-06 10:34 ` martin rudalics
@ 2012-12-07 0:58 ` Juri Linkov
2012-12-07 6:33 ` Eli Zaretskii
2012-12-07 10:37 ` martin rudalics
1 sibling, 2 replies; 83+ messages in thread
From: Juri Linkov @ 2012-12-07 0:58 UTC (permalink / raw)
To: Kenichi Handa; +Cc: perin, 13041, perin
> Emacs contains ucs-normailze package which provides various
> normalization functions. For instance,
>
> (require 'ucs-normalize)
> (ucs-normalize-NFKD-string "Äffin") => "Äffin"
>
> Isn't it usable?
This is usable to sort and compare strings, but I don't see
how ucs-normalize.el could help in the search. I suppose the
searched buffer can't be normalized before starting a search.
So the search function somehow should be able to skip combining
characters in the buffer. But to do this, the translation table needs
to contain additional information about certain characters to ignore.
Also the translation table should be able to map a sequence of
characters like "ss" to "ß".
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-07 0:58 ` Juri Linkov
@ 2012-12-07 6:33 ` Eli Zaretskii
2012-12-07 10:37 ` martin rudalics
1 sibling, 0 replies; 83+ messages in thread
From: Eli Zaretskii @ 2012-12-07 6:33 UTC (permalink / raw)
To: Juri Linkov; +Cc: 13041, perin, perin
> From: Juri Linkov <juri@jurta.org>
> Date: Fri, 07 Dec 2012 02:58:17 +0200
> Cc: perin@panix.com, 13041@debbugs.gnu.org, perin@acm.org
>
> > Emacs contains ucs-normailze package which provides various
> > normalization functions. For instance,
> >
> > (require 'ucs-normalize)
> > (ucs-normalize-NFKD-string "Äffin") => "Äffin"
> >
> > Isn't it usable?
>
> This is usable to sort and compare strings, but I don't see
> how ucs-normalize.el could help in the search.
I agree.
> I suppose the searched buffer can't be normalized before starting a
> search.
Yes, that's not acceptable.
> So the search function somehow should be able to skip combining
> characters in the buffer. But to do this, the translation table needs
> to contain additional information about certain characters to ignore.
Right. This is very similar to how the search primitives currently
use the case tables, except that they don't skip characters. But
adding such a skip operation should be easy.
> Also the translation table should be able to map a sequence of
> characters like "ss" to "ß".
I'd say the other way around: map ß to ss.
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-07 0:58 ` Juri Linkov
2012-12-07 6:33 ` Eli Zaretskii
@ 2012-12-07 10:37 ` martin rudalics
2012-12-07 23:55 ` Juri Linkov
1 sibling, 1 reply; 83+ messages in thread
From: martin rudalics @ 2012-12-07 10:37 UTC (permalink / raw)
To: Juri Linkov; +Cc: 13041, perin, perin
> This is usable to sort and compare strings, but I don't see
> how ucs-normalize.el could help in the search. I suppose the
> searched buffer can't be normalized before starting a search.
You can either temporarily
- leave the text alone but give each string that should be handled
specially a text property with the normalized form. In this case
searching has to pay attention to these properties, if present.
- normalize the text and give each normalized string a text property
with the original text. In this case searching will proceed as usual
but you have to restore the original text when done.
I don't know how feasible these are for searching. But I used the
second approach for sorting without problems.
Also I don't know how to handle the return value and/or highlighting
when, for example, finding a match for "suf" within "suffer". For
example, replacing each occurrence of "suf" with the empty string should
leave us with "fer" here. So in this case, we have to deal with the
normalized string anyway. OTOH replacing a match for "res" in "résumé"
with the empty string should probably leave us with "umé".
> So the search function somehow should be able to skip combining
> characters in the buffer. But to do this, the translation table needs
> to contain additional information about certain characters to ignore.
> Also the translation table should be able to map a sequence of
> characters like "ss" to "ß".
I have no idea how many mappings like "ß" -> "ss" exist. The problem is
that we don't get them from UnicodeData.txt IIUC.
martin
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-07 10:37 ` martin rudalics
@ 2012-12-07 23:55 ` Juri Linkov
2012-12-08 8:20 ` Eli Zaretskii
` (2 more replies)
0 siblings, 3 replies; 83+ messages in thread
From: Juri Linkov @ 2012-12-07 23:55 UTC (permalink / raw)
To: martin rudalics; +Cc: 13041, perin, perin
> - leave the text alone but give each string that should be handled
> specially a text property with the normalized form. In this case
> searching has to pay attention to these properties, if present.
>
> - normalize the text and give each normalized string a text property
> with the original text. In this case searching will proceed as usual
> but you have to restore the original text when done.
This reminds an idea that searching should take into account the text
displayed with the `display' property and other display-related properties.
It seems this is more difficult to implement.
> Also I don't know how to handle the return value and/or highlighting
> when, for example, finding a match for "suf" within "suffer". For
> example, replacing each occurrence of "suf" with the empty string should
> leave us with "fer" here.
I believe such ligature characters should be handled as a whole,
i.e. "suf" doesn't match "suffer", only "suff" should match it.
> I have no idea how many mappings like "ß" -> "ss" exist. The problem is
> that we don't get them from UnicodeData.txt IIUC.
I can't find them in UnicodeData.txt too. Looking at the files in
http://www.unicode.org/Public/UNIDATA/ can find them in the file
http://www.unicode.org/Public/UNIDATA/DerivedNormalizationProps.txt
that is derived from
http://www.unicode.org/Public/UNIDATA/CaseFolding.txt
http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-07 23:55 ` Juri Linkov
@ 2012-12-08 8:20 ` Eli Zaretskii
2012-12-08 11:35 ` martin rudalics
2012-12-08 11:21 ` martin rudalics
2012-12-08 23:54 ` Stefan Monnier
2 siblings, 1 reply; 83+ messages in thread
From: Eli Zaretskii @ 2012-12-08 8:20 UTC (permalink / raw)
To: Juri Linkov; +Cc: 13041, perin, perin
> From: Juri Linkov <juri@jurta.org>
> Date: Sat, 08 Dec 2012 01:55:22 +0200
> Cc: 13041@debbugs.gnu.org, perin@panix.com, perin@acm.org
>
> This reminds an idea that searching should take into account the text
> displayed with the `display' property and other display-related properties.
> It seems this is more difficult to implement.
I don't know if it's more difficult. After all, the primitives you
need to (a) find out whether there's a display string at given buffer
position, and (b) access its text, are already there, ready to be
used. Moreover, there's even a C function that searches the current
buffer for a specific Lisp string, which you could use as a model for
this feature.
What is definitely true, though, is that searching display string is a
separate feature, with an entirely different implementation. I
suggest therefore to keep it in mind, but not mix with what's being
discussed here.
> > I have no idea how many mappings like "ß" -> "ss" exist. The problem is
> > that we don't get them from UnicodeData.txt IIUC.
>
> I can't find them in UnicodeData.txt too. Looking at the files in
> http://www.unicode.org/Public/UNIDATA/ can find them in the file
>
> http://www.unicode.org/Public/UNIDATA/DerivedNormalizationProps.txt
>
> that is derived from
>
> http://www.unicode.org/Public/UNIDATA/CaseFolding.txt
> http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt
Maybe we should extend ucs-normalize.el to include that as well.
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-07 23:55 ` Juri Linkov
2012-12-08 8:20 ` Eli Zaretskii
@ 2012-12-08 11:21 ` martin rudalics
2012-12-08 23:07 ` Juri Linkov
2012-12-08 23:54 ` Stefan Monnier
2 siblings, 1 reply; 83+ messages in thread
From: martin rudalics @ 2012-12-08 11:21 UTC (permalink / raw)
To: Juri Linkov; +Cc: 13041, perin, perin
>> - leave the text alone but give each string that should be handled
>> specially a text property with the normalized form. In this case
>> searching has to pay attention to these properties, if present.
>>
>> - normalize the text and give each normalized string a text property
>> with the original text. In this case searching will proceed as usual
>> but you have to restore the original text when done.
>
> This reminds an idea that searching should take into account the text
> displayed with the `display' property and other display-related properties.
> It seems this is more difficult to implement.
... and probably should include searching for overlays too.
>> Also I don't know how to handle the return value and/or highlighting
>> when, for example, finding a match for "suf" within "suffer". For
>> example, replacing each occurrence of "suf" with the empty string should
>> leave us with "fer" here.
>
> I believe such ligature characters should be handled as a whole,
> i.e. "suf" doesn't match "suffer", only "suff" should match it.
This means that when you type the second "f" you might get a match
before the present one. Consider a buffer containing the two lines
suffer
suffer
Typing "suf" as search string would go to "suffer". Adding an "f" to
the search string now would go back to "suffer" (or not). Disconcerting
in any case.
>> I have no idea how many mappings like "ß" -> "ss" exist. The problem is
>> that we don't get them from UnicodeData.txt IIUC.
>
> I can't find them in UnicodeData.txt too. Looking at the files in
> http://www.unicode.org/Public/UNIDATA/ can find them in the file
>
> http://www.unicode.org/Public/UNIDATA/DerivedNormalizationProps.txt
>
> that is derived from
>
> http://www.unicode.org/Public/UNIDATA/CaseFolding.txt
> http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt
Case folding "ß" to "SS" (upper case "S") is not what I had in mind. I
was talking about the (weak?) equivalence of "ß" and "ss" (lower case
"s") which is much more important when searching. In particular so,
because many German words that were earlier written with an "ß" are now
written with "ss".
martin
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-08 8:20 ` Eli Zaretskii
@ 2012-12-08 11:35 ` martin rudalics
2012-12-08 12:40 ` Eli Zaretskii
0 siblings, 1 reply; 83+ messages in thread
From: martin rudalics @ 2012-12-08 11:35 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: perin, 13041, perin
> I don't know if it's more difficult. After all, the primitives you
> need to (a) find out whether there's a display string at given buffer
> position, and (b) access its text, are already there, ready to be
> used. Moreover, there's even a C function that searches the current
> buffer for a specific Lisp string, which you could use as a model for
> this feature.
I think that mirroring/cloning (part of) the current buffer in a special
search buffer would be the cheapest solution. The search buffer would
contain the normalized text, be built only when normalization is
needed and be rebuilt whenever a search option or the buffer text
changes. I don't know whether `buffer-swap-text' could be used here.
martin
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-08 11:35 ` martin rudalics
@ 2012-12-08 12:40 ` Eli Zaretskii
0 siblings, 0 replies; 83+ messages in thread
From: Eli Zaretskii @ 2012-12-08 12:40 UTC (permalink / raw)
To: martin rudalics; +Cc: perin, 13041, perin
> Date: Sat, 08 Dec 2012 12:35:37 +0100
> From: martin rudalics <rudalics@gmx.at>
> CC: Juri Linkov <juri@jurta.org>, 13041@debbugs.gnu.org, perin@panix.com,
> perin@acm.org
>
> > I don't know if it's more difficult. After all, the primitives you
> > need to (a) find out whether there's a display string at given buffer
> > position, and (b) access its text, are already there, ready to be
> > used. Moreover, there's even a C function that searches the current
> > buffer for a specific Lisp string, which you could use as a model for
> > this feature.
>
> I think that mirroring/cloning (part of) the current buffer in a special
> search buffer would be the cheapest solution. The search buffer would
> contain the normalized text, be built only when normalization is
> needed and be rebuilt whenever a search option or the buffer text
> changes.
Maybe this is the cheapest, but it still needs the same support the
other alternatives do.
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-08 11:21 ` martin rudalics
@ 2012-12-08 23:07 ` Juri Linkov
2012-12-09 0:04 ` Drew Adams
2012-12-09 17:52 ` martin rudalics
0 siblings, 2 replies; 83+ messages in thread
From: Juri Linkov @ 2012-12-08 23:07 UTC (permalink / raw)
To: martin rudalics; +Cc: 13041, perin, perin
> This means that when you type the second "f" you might get a match
> before the present one. Consider a buffer containing the two lines
> suffer
> suffer
>
> Typing "suf" as search string would go to "suffer". Adding an "f" to
> the search string now would go back to "suffer" (or not).
Going back looks like backtracking in the regexp search.
OTOH, instead of using an approach of matching only a full match
like in Chromium, we could do like GEdit and OpenOffice that
match the whole ligature character in a partial match
(i.e. to match "ff" when the search string is just "f").
Though this has a problem of highlighting the whole character for
a partial match that looks wrong, but perhaps no one can do better.
>> http://www.unicode.org/Public/UNIDATA/CaseFolding.txt
>> http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt
>
> Case folding "ß" to "SS" (upper case "S") is not what I had in mind. I
> was talking about the (weak?) equivalence of "ß" and "ss" (lower case
> "s") which is much more important when searching. In particular so,
> because many German words that were earlier written with an "ß" are now
> written with "ss".
Yes, this is what I meant too. It is surprising but
http://www.unicode.org/Public/UNIDATA/CaseFolding.txt
defines the equivalence of "ß" and "ss" (lower case "s")
instead of case-folding. The following line in CaseFolding.txt:
00DF; F; 0073 0073; # LATIN SMALL LETTER SHARP S
maps 00DF (LATIN SMALL LETTER SHARP S) to two characters
0073 0073 (LATIN SMALL LETTER S) keeping the lower case.
Maybe this is a bug in Unicode data?
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-07 23:55 ` Juri Linkov
2012-12-08 8:20 ` Eli Zaretskii
2012-12-08 11:21 ` martin rudalics
@ 2012-12-08 23:54 ` Stefan Monnier
2012-12-09 0:14 ` Drew Adams
2012-12-09 0:35 ` Juri Linkov
2 siblings, 2 replies; 83+ messages in thread
From: Stefan Monnier @ 2012-12-08 23:54 UTC (permalink / raw)
To: Juri Linkov; +Cc: 13041, perin, perin
> i.e. "suf" doesn't match "suffer", only "suff" should match it.
I completely disagree here. "suf" should match "suffer".
Stefan
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-08 23:07 ` Juri Linkov
@ 2012-12-09 0:04 ` Drew Adams
2012-12-09 17:52 ` martin rudalics
1 sibling, 0 replies; 83+ messages in thread
From: Drew Adams @ 2012-12-09 0:04 UTC (permalink / raw)
To: 'Juri Linkov', 'martin rudalics'; +Cc: perin, 13041, perin
> > Typing "suf" as search string would go to "suffer". Adding
> > an "f" to the search string now would go back to "su?er" (or not).
>
> Going back looks like backtracking in the regexp search.
>
> OTOH, instead of using an approach of matching only a full match
> like in Chromium, we could do like GEdit and OpenOffice that
> match the whole ligature character in a partial match
> (i.e. to match "?" when the search string is just "f").
Seems to me that the starting point should be the Unicode Regexp spec, which
outlines the behavior of level 1 and level 2 searches. Emacs Dev can choose
what it wants to do, of course, but that is a good place to start, I think.
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-08 23:54 ` Stefan Monnier
@ 2012-12-09 0:14 ` Drew Adams
2012-12-09 15:42 ` Stefan Monnier
2012-12-09 0:35 ` Juri Linkov
1 sibling, 1 reply; 83+ messages in thread
From: Drew Adams @ 2012-12-09 0:14 UTC (permalink / raw)
To: 'Stefan Monnier', 'Juri Linkov'; +Cc: perin, 13041, perin
> > i.e. "suf" doesn't match "su?er", only "suff" should match it.
>
> I completely disagree here. "suf" should match "su?er".
The Unicode Regexp spec says that it is best, if possible, to let users do
either. It discusses such different search possibilities explicitly.
We might not be able to support that superior level (level 2) for Emacs search,
but the point is that each kind of matching can be useful here.
At this stage of the discussion it should not, I think, be a case of "I
completely disagree" (or completely agree), unless you have already decided
something wrt design/implementation etc. Better to look at the possibilities
for users and then discuss what it might take to be able to support this or that
kind of search matching.
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-08 23:54 ` Stefan Monnier
2012-12-09 0:14 ` Drew Adams
@ 2012-12-09 0:35 ` Juri Linkov
2012-12-09 11:35 ` Stephen Berman
2012-12-09 15:45 ` Stefan Monnier
1 sibling, 2 replies; 83+ messages in thread
From: Juri Linkov @ 2012-12-09 0:35 UTC (permalink / raw)
To: Stefan Monnier; +Cc: 13041, perin, perin
>> i.e. "suf" doesn't match "suffer", only "suff" should match it.
>
> I completely disagree here. "suf" should match "suffer".
AFAIS, there are more programs that find a partial match,
but neither of them can do the right highlighting:
both possibilities (to highlight the whole ligature and not to highlight)
are wrong, and highlighting a part of the ligature is impossible.
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-09 0:35 ` Juri Linkov
@ 2012-12-09 11:35 ` Stephen Berman
2012-12-09 17:52 ` martin rudalics
2012-12-09 15:45 ` Stefan Monnier
1 sibling, 1 reply; 83+ messages in thread
From: Stephen Berman @ 2012-12-09 11:35 UTC (permalink / raw)
To: Juri Linkov; +Cc: perin, 13041, perin
On Sun, 09 Dec 2012 02:35:46 +0200 Juri Linkov <juri@jurta.org> wrote:
>>> i.e. "suf" doesn't match "suffer", only "suff" should match it.
>>
>> I completely disagree here. "suf" should match "suffer".
>
> AFAIS, there are more programs that find a partial match,
> but neither of them can do the right highlighting:
> both possibilities (to highlight the whole ligature and not to highlight)
> are wrong, and highlighting a part of the ligature is impossible.
Could a ligature be highlighted in a different way (different color or
additional attribute such as underlining) to indicate a partial or
potential match?
Steve Berman
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-09 0:14 ` Drew Adams
@ 2012-12-09 15:42 ` Stefan Monnier
2012-12-09 18:00 ` Drew Adams
0 siblings, 1 reply; 83+ messages in thread
From: Stefan Monnier @ 2012-12-09 15:42 UTC (permalink / raw)
To: Drew Adams; +Cc: perin, 13041, perin
> The Unicode Regexp spec says that it is best, if possible, to let users do
> either.
We're talking about the (now misnamed) "diacritic-fold" search. If the
user wants to be more strict, there's always going to be the
"non-diacritic-fold" search.
Stefan
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-09 0:35 ` Juri Linkov
2012-12-09 11:35 ` Stephen Berman
@ 2012-12-09 15:45 ` Stefan Monnier
2012-12-10 7:57 ` Juri Linkov
1 sibling, 1 reply; 83+ messages in thread
From: Stefan Monnier @ 2012-12-09 15:45 UTC (permalink / raw)
To: Juri Linkov; +Cc: 13041, perin, perin
>>> i.e. "suf" doesn't match "suffer", only "suff" should match it.
>> I completely disagree here. "suf" should match "suffer".
> AFAIS, there are more programs that find a partial match,
> but neither of them can do the right highlighting:
> both possibilities (to highlight the whole ligature and not to highlight)
> are wrong, and highlighting a part of the ligature is impossible.
One step at a time: first, let's make sure we can match it. Then we'll
worry about what the match-boundaries should be and how to display it
(when we get to this point, we can even consider displaying suffer as
suffer temporarily, just like we do when point is in the middle of
a composition).
Stefan
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-08 23:07 ` Juri Linkov
2012-12-09 0:04 ` Drew Adams
@ 2012-12-09 17:52 ` martin rudalics
2012-12-09 18:06 ` Drew Adams
1 sibling, 1 reply; 83+ messages in thread
From: martin rudalics @ 2012-12-09 17:52 UTC (permalink / raw)
To: Juri Linkov; +Cc: 13041, perin, perin
> OTOH, instead of using an approach of matching only a full match
> like in Chromium, we could do like GEdit and OpenOffice that
> match the whole ligature character in a partial match
> (i.e. to match "ff" when the search string is just "f").
Strictly spoken, they should match the first "f" in "ff". When matching
"suf" against "suffer", the `match-string' would be "suf", with
`match-end' after "ff". That is, the match length would not increase
when adding an "f" to the search string now. But I don't know what
`match-string' should return - "suff" or "suff".
> Though this has a problem of highlighting the whole character for
> a partial match that looks wrong, but perhaps no one can do better.
We needed a display string "ff" replacing "ff" during highlighting and
highlight only the first "f" in it.
> Yes, this is what I meant too. It is surprising but
> http://www.unicode.org/Public/UNIDATA/CaseFolding.txt
> defines the equivalence of "ß" and "ss" (lower case "s")
> instead of case-folding. The following line in CaseFolding.txt:
>
> 00DF; F; 0073 0073; # LATIN SMALL LETTER SHARP S
>
> maps 00DF (LATIN SMALL LETTER SHARP S) to two characters
> 0073 0073 (LATIN SMALL LETTER S) keeping the lower case.
> Maybe this is a bug in Unicode data?
Maybe it's explained here
http://www.unicode.org/faq/idn.html
in the answer to
Q: Why does IDNA2003 map final sigma (ς) to sigma (σ), map eszett (ß)
to "ss", and delete ZWJ/ZWNJ?
One possible interpretation of this is that mapping "ß" to "SS" would
imply that downcasing "SS" should produce "ß" and this is unwanted. But
I still wonder whether we are supposed to apply mappings recursively.
martin
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-09 11:35 ` Stephen Berman
@ 2012-12-09 17:52 ` martin rudalics
0 siblings, 0 replies; 83+ messages in thread
From: martin rudalics @ 2012-12-09 17:52 UTC (permalink / raw)
To: Stephen Berman; +Cc: perin, 13041, perin
> Could a ligature be highlighted in a different way (different color or
> additional attribute such as underlining) to indicate a partial or
> potential match?
I think ligatures can be easily handled by displaying the corresponding
decomposed string. But a different color could be used to higlight the
"ß" with an incremental search string "Mas" and a match in "Maße".
martin
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-09 15:42 ` Stefan Monnier
@ 2012-12-09 18:00 ` Drew Adams
0 siblings, 0 replies; 83+ messages in thread
From: Drew Adams @ 2012-12-09 18:00 UTC (permalink / raw)
To: 'Stefan Monnier'; +Cc: perin, 13041, perin
> > The Unicode Regexp spec says that it is best, if possible,
> > to let users do either.
>
> We're talking about the (now misnamed) "diacritic-fold" search.
> If the user wants to be more strict, there's always going to be
> the "non-diacritic-fold" search.
Yes, and? That ignoring of diacritics etc. is essentially what the Unicode
Regexp spec refers to as "loose matching", IIUC. And that means "at least the
simple, default Unicode case folding."
You are considering, among other things, whether `f' should match the ? ligature
or whether only `ff' should match it. The standard deals with this question, I
believe.
(BTW, I cannot actually see that ligature with my mail client. So I copied the
char from another mail message and pasted it, above. If that copy+paste didn't
work, what I meant was the ligature for ff.)
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-09 17:52 ` martin rudalics
@ 2012-12-09 18:06 ` Drew Adams
2012-12-11 7:19 ` Eli Zaretskii
0 siblings, 1 reply; 83+ messages in thread
From: Drew Adams @ 2012-12-09 18:06 UTC (permalink / raw)
To: 'martin rudalics', 'Juri Linkov'; +Cc: perin, 13041, perin
> Maybe it's explained here
> http://www.unicode.org/faq/idn.html
> in the answer to
>
> Q: Why does IDNA2003 map final sigma (?) to sigma (s), map
> eszett (ß) to "ss", and delete ZWJ/ZWNJ?
>
> One possible interpretation of this is that mapping "ß" to "SS" would
> imply that downcasing "SS" should produce "ß" and this is
> unwanted.
This is also covered in the Unicode Regexp spec.
http://www.unicode.org/reports/tr18/
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-09 15:45 ` Stefan Monnier
@ 2012-12-10 7:57 ` Juri Linkov
2012-12-10 8:20 ` Eli Zaretskii
0 siblings, 1 reply; 83+ messages in thread
From: Juri Linkov @ 2012-12-10 7:57 UTC (permalink / raw)
To: Stefan Monnier; +Cc: 13041, perin, perin
> One step at a time: first, let's make sure we can match it. Then we'll
> worry about what the match-boundaries should be and how to display it
> (when we get to this point, we can even consider displaying suffer as
> suffer temporarily, just like we do when point is in the middle of
> a composition).
Isearch used to decompose a composition of a character with a combining
accent and displaying them separately in the middle of a composition
in Emacs 23. But as I see now in the latest version Isearch in the
middle of a composition doesn't decompose them. It highlights the
matched character with still unmatched combining accent as a whole.
It seems the current behavior is better then earlier because it doesn't
change the displayed characters. This is more WYSIWYG.
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-10 7:57 ` Juri Linkov
@ 2012-12-10 8:20 ` Eli Zaretskii
0 siblings, 0 replies; 83+ messages in thread
From: Eli Zaretskii @ 2012-12-10 8:20 UTC (permalink / raw)
To: Juri Linkov; +Cc: perin, 13041, perin
> From: Juri Linkov <juri@jurta.org>
> Date: Mon, 10 Dec 2012 09:57:49 +0200
> Cc: 13041@debbugs.gnu.org, perin@panix.com, perin@acm.org
>
> Isearch used to decompose a composition of a character with a combining
> accent and displaying them separately in the middle of a composition
> in Emacs 23.
AFAIR, this was due to problems in the display engine wrt composite
characters, and problems with composition support in general, problems
which are now solved.
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-12-09 18:06 ` Drew Adams
@ 2012-12-11 7:19 ` Eli Zaretskii
0 siblings, 0 replies; 83+ messages in thread
From: Eli Zaretskii @ 2012-12-11 7:19 UTC (permalink / raw)
To: Drew Adams; +Cc: 13041, perin, perin
> From: "Drew Adams" <drew.adams@oracle.com>
> Date: Sun, 9 Dec 2012 10:06:44 -0800
> Cc: perin@panix.com, 13041@debbugs.gnu.org, perin@acm.org
>
> > Maybe it's explained here
> > http://www.unicode.org/faq/idn.html
> > in the answer to
> >
> > Q: Why does IDNA2003 map final sigma (?) to sigma (s), map
> > eszett (ß) to "ss", and delete ZWJ/ZWNJ?
> >
> > One possible interpretation of this is that mapping "ß" to "SS" would
> > imply that downcasing "SS" should produce "ß" and this is
> > unwanted.
>
> This is also covered in the Unicode Regexp spec.
> http://www.unicode.org/reports/tr18/
Another relevant Unicode document is the Unicode Collation Algorithm.
For the latest (yet unapproved) draft, see
http://www.unicode.org/reports/tr10/proposed.html
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
2012-11-30 18:22 bug#13041: 24.2; diacritic-fold-search Lewis Perin
2012-11-30 18:51 ` Juri Linkov
2012-11-30 19:31 ` Stefan Monnier
@ 2016-08-31 14:45 ` Michael Albinus
[not found] ` <22473.57245.883865.68491@panix5.panix.com>
2 siblings, 1 reply; 83+ messages in thread
From: Michael Albinus @ 2016-08-31 14:45 UTC (permalink / raw)
To: Lewis Perin; +Cc: 13041, perin
Lewis Perin <perin@panix.com> writes:
> Emacs search has long been able to toggle between (a) ignoring the
> distinction between upper- and lower-case characters
> (case-fold-search) and (b) searching for only one of the pair. One
> could say Climacs offers the choice between (a) searching for all
> members of a (2-member) equivalence class and (b) searching for only
> one member.
>
> There are larger equivalence classes of characters with practical use
> which Climacs is currently unaware of: the groups of characters
> consisting of an unadorned (ASCII) character plus all its
> diacritic-adorned versions. Currently, if I want to search for both
> “apres” and “après”, I need an additive regular expression. I would
> like to do this as easily as I can search for “apres” and “Apres”. I
> would be delighted if Emacs implemented the equivalence classes
> spelled out here:
>
> http://hex-machina.com/scripts/yui/3.3.0pr1/api/unicode-data-accentfold.js.html
>
> I might add that diacritics folding is the default in web search
> engines. It is also a feature of at least one Web browser in
> searching the text of a displayed page (Chrome.)
Emacs 25.1 has introduced the new user option `search-default-mode'. If
set to `char-fold-to-regexp', the requested feature is available. See
etc/NEWS for further information.
So I propose to close this bug. There was a long discussion in the bug's
log back in 2012, but AFAICS, all proposals have been implemented.
> /Lew
Best regards, Michael.
^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search
[not found] ` <22473.57245.883865.68491@panix5.panix.com>
@ 2016-09-03 7:06 ` Michael Albinus
0 siblings, 0 replies; 83+ messages in thread
From: Michael Albinus @ 2016-09-03 7:06 UTC (permalink / raw)
To: perin; +Cc: 13041-done
Version: 25.1
nobody writes:
> This is great news! I’m afraid I’m not in a position to use 25.1 yet,
> but I look forward to it eagerly. Closing the bug seems right to me;
> if the new functionality has flaws, then they would be *new* bugs.
So I'm closing the bug.
> Thanks very much for letting me know!
>
> /Lew
Best regards, Michael.
^ permalink raw reply [flat|nested] 83+ messages in thread
end of thread, other threads:[~2016-09-03 7:06 UTC | newest]
Thread overview: 83+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-11-30 18:22 bug#13041: 24.2; diacritic-fold-search Lewis Perin
2012-11-30 18:51 ` Juri Linkov
2012-11-30 21:07 ` Lewis Perin
2012-12-01 0:27 ` Juri Linkov
2012-12-01 0:47 ` Drew Adams
2012-12-01 0:49 ` Drew Adams
2012-12-01 1:20 ` Lew Perin
2012-12-01 6:50 ` Drew Adams
2012-12-01 8:32 ` Eli Zaretskii
2012-12-01 9:09 ` Eli Zaretskii
2012-12-01 16:38 ` Drew Adams
2012-12-02 0:27 ` Juri Linkov
2012-12-02 17:45 ` martin rudalics
2012-12-02 18:02 ` Eli Zaretskii
2012-12-03 10:16 ` martin rudalics
2012-12-03 16:47 ` Eli Zaretskii
2012-12-03 17:42 ` martin rudalics
2012-12-03 17:59 ` Eli Zaretskii
2012-12-04 17:54 ` martin rudalics
2012-12-04 19:28 ` Eli Zaretskii
2012-12-05 9:41 ` martin rudalics
2012-12-05 16:37 ` Eli Zaretskii
2012-12-06 10:31 ` martin rudalics
2012-12-06 17:48 ` Eli Zaretskii
2012-12-05 23:05 ` Juri Linkov
2012-12-06 10:32 ` martin rudalics
2012-12-04 20:12 ` Drew Adams
2012-12-04 23:15 ` Drew Adams
2012-12-05 6:50 ` Drew Adams
2012-12-05 9:42 ` martin rudalics
2012-12-05 15:38 ` Drew Adams
2012-12-06 9:25 ` Kenichi Handa
2012-12-06 10:34 ` martin rudalics
2012-12-06 17:50 ` Eli Zaretskii
2012-12-07 0:58 ` Juri Linkov
2012-12-07 6:33 ` Eli Zaretskii
2012-12-07 10:37 ` martin rudalics
2012-12-07 23:55 ` Juri Linkov
2012-12-08 8:20 ` Eli Zaretskii
2012-12-08 11:35 ` martin rudalics
2012-12-08 12:40 ` Eli Zaretskii
2012-12-08 11:21 ` martin rudalics
2012-12-08 23:07 ` Juri Linkov
2012-12-09 0:04 ` Drew Adams
2012-12-09 17:52 ` martin rudalics
2012-12-09 18:06 ` Drew Adams
2012-12-11 7:19 ` Eli Zaretskii
2012-12-08 23:54 ` Stefan Monnier
2012-12-09 0:14 ` Drew Adams
2012-12-09 15:42 ` Stefan Monnier
2012-12-09 18:00 ` Drew Adams
2012-12-09 0:35 ` Juri Linkov
2012-12-09 11:35 ` Stephen Berman
2012-12-09 17:52 ` martin rudalics
2012-12-09 15:45 ` Stefan Monnier
2012-12-10 7:57 ` Juri Linkov
2012-12-10 8:20 ` Eli Zaretskii
2012-12-05 9:42 ` martin rudalics
2012-12-05 9:42 ` martin rudalics
2012-12-05 15:38 ` Drew Adams
2012-12-05 15:51 ` Lewis Perin
2012-12-05 16:20 ` Drew Adams
2012-12-05 17:16 ` Drew Adams
2012-12-05 18:00 ` Drew Adams
2012-12-05 18:27 ` Eli Zaretskii
2012-12-06 10:31 ` martin rudalics
2012-12-06 15:59 ` Drew Adams
2012-12-06 10:28 ` martin rudalics
2012-12-06 17:53 ` Eli Zaretskii
2012-12-05 23:04 ` Juri Linkov
2012-12-06 10:31 ` martin rudalics
2012-12-07 0:52 ` Juri Linkov
2012-12-02 21:39 ` Juri Linkov
2012-12-03 10:16 ` martin rudalics
2012-12-04 0:17 ` Juri Linkov
2012-12-04 3:41 ` Eli Zaretskii
2012-12-02 18:16 ` Eli Zaretskii
2012-12-02 21:31 ` Juri Linkov
2012-12-05 19:17 ` Drew Adams
2012-12-05 21:19 ` Eli Zaretskii
2012-11-30 19:31 ` Stefan Monnier
2016-08-31 14:45 ` Michael Albinus
[not found] ` <22473.57245.883865.68491@panix5.panix.com>
2016-09-03 7:06 ` Michael Albinus
Code repositories for project(s) associated with this external index
https://git.savannah.gnu.org/cgit/emacs.git
https://git.savannah.gnu.org/cgit/emacs/org-mode.git
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.