* bug#13041: 24.2; diacritic-fold-search @ 2012-11-30 18:22 Lewis Perin 2012-11-30 18:51 ` Juri Linkov ` (2 more replies) 0 siblings, 3 replies; 83+ messages in thread From: Lewis Perin @ 2012-11-30 18:22 UTC (permalink / raw) To: 13041 This is not a bug report but a feature request, so I am omitting diagnostic information. Emacs search has long been able to toggle between (a) ignoring the distinction between upper- and lower-case characters (case-fold-search) and (b) searching for only one of the pair. One could say Climacs offers the choice between (a) searching for all members of a (2-member) equivalence class and (b) searching for only one member. There are larger equivalence classes of characters with practical use which Climacs is currently unaware of: the groups of characters consisting of an unadorned (ASCII) character plus all its diacritic-adorned versions. Currently, if I want to search for both “apres” and “après”, I need an additive regular expression. I would like to do this as easily as I can search for “apres” and “Apres”. I would be delighted if Emacs implemented the equivalence classes spelled out here: http://hex-machina.com/scripts/yui/3.3.0pr1/api/unicode-data-accentfold.js.html I might add that diacritics folding is the default in web search engines. It is also a feature of at least one Web browser in searching the text of a displayed page (Chrome.) I’m sure that maintaining the core of Emacs is a big job, and I’m grateful for the skill and effort that go into that task, including your consideration of this request! /Lew --- Lew Perin | perin@acm.org | http://babelcarp.org ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-11-30 18:22 bug#13041: 24.2; diacritic-fold-search Lewis Perin @ 2012-11-30 18:51 ` Juri Linkov 2012-11-30 21:07 ` Lewis Perin 2012-11-30 19:31 ` Stefan Monnier 2016-08-31 14:45 ` Michael Albinus 2 siblings, 1 reply; 83+ messages in thread From: Juri Linkov @ 2012-11-30 18:51 UTC (permalink / raw) To: Lewis Perin; +Cc: 13041, perin > Currently, if I want to search for both “apres” and “après”, > I need an additive regular expression. I would like to do this as > easily as I can search for “apres” and “Apres”. I would be delighted > if Emacs implemented the equivalence classes spelled out here: > > http://hex-machina.com/scripts/yui/3.3.0pr1/api/unicode-data-accentfold.js.html This could be implemented in isearch using a recipe from http://thread.gmane.org/gmane.emacs.devel/117003/focus=117959 Instead of hard-coding a list of equivalent characters I guess it should be possible to do this automatically using Unicode information about characters. ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-11-30 18:51 ` Juri Linkov @ 2012-11-30 21:07 ` Lewis Perin 2012-12-01 0:27 ` Juri Linkov 0 siblings, 1 reply; 83+ messages in thread From: Lewis Perin @ 2012-11-30 21:07 UTC (permalink / raw) To: Juri Linkov; +Cc: 13041 Juri Linkov writes: > > Currently, if I want to search for both “apres” and “après”, > > I need an additive regular expression. I would like to do this as > > easily as I can search for “apres” and “Apres”. I would be delighted > > if Emacs implemented the equivalence classes spelled out here: > > > > http://hex-machina.com/scripts/yui/3.3.0pr1/api/unicode-data-accentfold.js.html > > This could be implemented in isearch using a recipe from > > http://thread.gmane.org/gmane.emacs.devel/117003/focus=117959 > > Instead of hard-coding a list of equivalent characters > I guess it should be possible to do this automatically > using Unicode information about characters. I never thought I was the first to wonder about this! In the last message of that thread, you say “Provided it doesn’t make the search slow, it would be nice to add it to Emacs activating on some user settings.” Do you remember if that technique turned out to be tolerably speedy? /Lew --- Lew Perin | perin@acm.org | http://babelcarp.org ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-11-30 21:07 ` Lewis Perin @ 2012-12-01 0:27 ` Juri Linkov 2012-12-01 0:47 ` Drew Adams 2012-12-01 8:32 ` Eli Zaretskii 0 siblings, 2 replies; 83+ messages in thread From: Juri Linkov @ 2012-12-01 0:27 UTC (permalink / raw) To: Lewis Perin; +Cc: 13041, perin > In the last message of that thread, you say “Provided it doesn’t make > the search slow, it would be nice to add it to Emacs activating on > some user settings.” Do you remember if that technique turned out to > be tolerably speedy? Yes, I have no problems with the speed. The problem is how to disable this feature when it is active. We need a special key to toggle it in Isearch. One variant is M-s ~ where the easy-to-type TILDE character represents diacritics. Also it's unclear whether the Isearch prompt should indicate its active state as e.g. Diacritic I-search: ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-01 0:27 ` Juri Linkov @ 2012-12-01 0:47 ` Drew Adams 2012-12-01 0:49 ` Drew Adams 2012-12-01 8:32 ` Eli Zaretskii 1 sibling, 1 reply; 83+ messages in thread From: Drew Adams @ 2012-12-01 0:47 UTC (permalink / raw) To: 'Juri Linkov', 'Lewis Perin'; +Cc: 13041, perin > it's unclear whether the Isearch prompt should indicate > its active state Ǐsearch (But perhaps that suggests recognizing, rather than ignoring, diacritics.) ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-01 0:47 ` Drew Adams @ 2012-12-01 0:49 ` Drew Adams 2012-12-01 1:20 ` Lew Perin 0 siblings, 1 reply; 83+ messages in thread From: Drew Adams @ 2012-12-01 0:49 UTC (permalink / raw) To: 'Juri Linkov', 'Lewis Perin'; +Cc: 13041, perin > > it's unclear whether the Isearch prompt should indicate > > its active state > > Isearch > > (But perhaps that suggests recognizing, rather than ignoring, > diacritics.) Hm. That was a capital I with caron when I sent it... ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-01 0:49 ` Drew Adams @ 2012-12-01 1:20 ` Lew Perin 2012-12-01 6:50 ` Drew Adams 0 siblings, 1 reply; 83+ messages in thread From: Lew Perin @ 2012-12-01 1:20 UTC (permalink / raw) To: Drew Adams; +Cc: <13041@debbugs.gnu.org>, <perin@acm.org> On Nov 30, 2012, at 7:49 PM, "Drew Adams" <drew.adams@oracle.com> wrote: >>> it's unclear whether the Isearch prompt should indicate >>> its active state >> >> Isearch >> >> (But perhaps that suggests recognizing, rather than ignoring, >> diacritics.) > > Hm. That was a capital I with caron when I sent it... A caron-topped capital I is exactly what I got (on my iPhone.) /Lew --- Lew Perin | perin@acm.org | http://babelcarp.org ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-01 1:20 ` Lew Perin @ 2012-12-01 6:50 ` Drew Adams 0 siblings, 0 replies; 83+ messages in thread From: Drew Adams @ 2012-12-01 6:50 UTC (permalink / raw) To: 'Lew Perin'; +Cc: 13041, perin > >>> it's unclear whether the Isearch prompt should indicate > >>> its active state > >> > >> Isearch > >> > >> (But perhaps that suggests recognizing, rather than ignoring, > >> diacritics.) > > > > Hm. That was a capital I with caron when I sent it... > > A caron-topped capital I is exactly what I got (on my iPhone.) Great. I guess it's the encoding used in my mail client that's showing it with no marks. ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-01 0:27 ` Juri Linkov 2012-12-01 0:47 ` Drew Adams @ 2012-12-01 8:32 ` Eli Zaretskii 2012-12-01 9:09 ` Eli Zaretskii ` (2 more replies) 1 sibling, 3 replies; 83+ messages in thread From: Eli Zaretskii @ 2012-12-01 8:32 UTC (permalink / raw) To: Juri Linkov; +Cc: perin, 13041, perin > From: Juri Linkov <juri@jurta.org> > Date: Sat, 01 Dec 2012 02:27:40 +0200 > Cc: 13041@debbugs.gnu.org, perin@acm.org > > > In the last message of that thread, you say “Provided it doesn’t make > > the search slow, it would be nice to add it to Emacs activating on > > some user settings.” Do you remember if that technique turned out to > > be tolerably speedy? > > Yes, I have no problems with the speed. The problem is how to > disable this feature when it is active. We need a special key > to toggle it in Isearch. One variant is M-s ~ where the easy-to-type > TILDE character represents diacritics. Also it's unclear whether the > Isearch prompt should indicate its active state as e.g. I don't understand why this thread is talking only about Latin characters with diacritics. That is a special case of what Unicode calls "compatibility equivalence" (q.e.). For example, even in the Latin environments, don't you want to find "sniff" when searching for "sniff", and vice versa? And there are similar issues in many non-Latin scripts. The decomposition of a character such as 'ff' is given by the Unicode database, for example: FB00;LATIN SMALL LIGATURE FF;Ll;0;L;<compat> 0066 0066;;;;N;;;;; ^^^^^^^^^^^^^^^^^^ (66 hex, or 102 decimal, is the codepoint of 'f'). Emacs already supports these decomposition properties. E.g.: (get-char-code-property ?ff 'decomposition) => (compat 102 102) Another example, closer to the issue that triggered this thread: (get-char-code-property ?è 'decomposition) => (101 768) (If you want to understand why the previous example included "compat" in the result, while this one doesn't, read more about Unicode normalization forms. The distinction is irrelevant for the current discussion.) Using these properties, every search string can be converted to a sequence of non-decomposable characters (this process is recursive, because the 'decomposition' property can use characters that themselves are decomposable). If the user wants to ignore diacritics, then the diacritics should be dropped from the decomposition sequence before starting the search. E.g., for the decomposition of è above, we will drop the 768 and will be left with 101, which is 'e'. Then searching for that string should apply the same decomposition transformation to the text being searched, when comparing them. This would be the most general way of solving this issue, a way that is not limited to diacritics nor to Latin scripts. And doing that will move Emacs closer to the goal of being Unicode compatible, since support for this is required by the Unicode Standard. By contrast, building and using custom data bases of equivalences that are limited to diacritics in Latin scripts is not moving Emacs towards that goal. It's just a hack, IMO. ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-01 8:32 ` Eli Zaretskii @ 2012-12-01 9:09 ` Eli Zaretskii 2012-12-01 16:38 ` Drew Adams 2012-12-02 0:27 ` Juri Linkov 2 siblings, 0 replies; 83+ messages in thread From: Eli Zaretskii @ 2012-12-01 9:09 UTC (permalink / raw) To: juri, perin; +Cc: 13041, perin > Date: Sat, 01 Dec 2012 10:32:35 +0200 > From: Eli Zaretskii <eliz@gnu.org> > Cc: perin@panix.com, 13041@debbugs.gnu.org, perin@acm.org > > I don't understand why this thread is talking only about Latin > characters with diacritics. That is a special case of what Unicode > calls "compatibility equivalence" (q.e.). ^^^^ I meant "q.v.", of course. Sorry. ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-01 8:32 ` Eli Zaretskii 2012-12-01 9:09 ` Eli Zaretskii @ 2012-12-01 16:38 ` Drew Adams 2012-12-02 0:27 ` Juri Linkov 2 siblings, 0 replies; 83+ messages in thread From: Drew Adams @ 2012-12-01 16:38 UTC (permalink / raw) To: 'Eli Zaretskii', 'Juri Linkov'; +Cc: perin, 13041, perin > I don't understand why this thread is talking only about Latin > characters with diacritics. That is a special case of what Unicode > calls "compatibility equivalence" (q.e.). For example, even in the > Latin environments, don't you want to find "sni?" when searching for > "sniff", and vice versa? And there are similar issues in many > non-Latin scripts. Actually, in the original thread I made the same point. Please see that discussion for this and other points. http://lists.gnu.org/archive/html/help-gnu-emacs/2012-11/msg00429.html > The decomposition of a character such as '?' is given by > the Unicode database... Emacs already supports these > decomposition properties. That's good news (new to me). So it sounds like even the most hopeful wanna-haves of the discussion could perhaps be realized without too much trouble. > Using these properties, every search string can be converted to a > sequence of non-decomposable characters (this process is recursive, > because the 'decomposition' property can use characters that > themselves are decomposable). If the user wants to ignore diacritics, > then the diacritics should be dropped from the decomposition sequence > before starting the search. E.g., for the decomposition of è above, > we will drop the 768 and will be left with 101, which is 'e'. Then > searching for that string should apply the same decomposition > transformation to the text being searched, when comparing them. > > This would be the most general way of solving this issue, a way that > is not limited to diacritics nor to Latin scripts. And doing that > will move Emacs closer to the goal of being Unicode compatible, since > support for this is required by the Unicode Standard. This sounds great. I really hope someone with the time and knowledge adds such a feature soon (even though, to be clear, I personally do not have much need for it). I think it would be very handy for many users - most welcome. ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-01 8:32 ` Eli Zaretskii 2012-12-01 9:09 ` Eli Zaretskii 2012-12-01 16:38 ` Drew Adams @ 2012-12-02 0:27 ` Juri Linkov 2012-12-02 17:45 ` martin rudalics 2012-12-02 18:16 ` Eli Zaretskii 2 siblings, 2 replies; 83+ messages in thread From: Juri Linkov @ 2012-12-02 0:27 UTC (permalink / raw) To: Eli Zaretskii; +Cc: perin, 13041, perin > Using these properties, every search string can be converted to a > sequence of non-decomposable characters (this process is recursive, > because the 'decomposition' property can use characters that > themselves are decomposable). If the user wants to ignore diacritics, > then the diacritics should be dropped from the decomposition sequence > before starting the search. E.g., for the decomposition of è above, > we will drop the 768 and will be left with 101, which is 'e'. Then > searching for that string should apply the same decomposition > transformation to the text being searched, when comparing them. Yes, using the `decomposition' property would be better than hard-coding these decomposition mappings. Though I'm surprised to see case mappings hard-coded in lisp/international/characters.el instead of using the properties `uppercase' and `lowercase' during creation of case tables. But nevertheless the `decomposition' property should be used to find all decomposable characters. The question is how to use them in the search. One solution is to use the case tables. I tried to build the case table with the decomposed characters retrieved using the `decomposition' property recursively: (defvar decomposition-table nil) (defun make-decomposition-table () (let ((table (standard-case-table)) canon) (setq canon (copy-sequence table)) (let ((c #x0000) d) (while (<= c #xFFFD) (make-decomposition-table-1 canon c c) (setq c (1+ c)))) (set-char-table-extra-slot table 1 canon) (set-char-table-extra-slot table 2 nil) (setq decomposition-table table))) (defun make-decomposition-table-1 (canon c0 c1) (let ((d (get-char-code-property c1 'decomposition))) (when d (unless (characterp (car d)) (pop d)) (if (eq c1 (car d)) (aset canon c0 (car d)) (make-decomposition-table-1 canon c0 (car d)))))) (make-decomposition-table) Then a new Isearch command (the existing `isearch-toggle-case-fold' can't be used because it enables/disables the standard case table) could toggle between the current case table and the decomposition case table using (set-case-table decomposition-table) After evaluating this, Isearch correctly finds all related characters in every row of this example: http://hex-machina.com/scripts/yui/3.3.0pr1/api/unicode-data-accentfold.js.html But it seems using the case table for decomposition has one limitation. I see no way to ignore combining accent characters in the case table, i.e. to map combining accent characters to nothing. These characters have the general-category "Mn (Mark, Nonspacing)", so they should be ignored in the search. An alternative would be to build a regexp from the search string like building a regexp for word-search: (define-key isearch-mode-map "\M-sd" 'isearch-toggle-decomposition) (defun isearch-toggle-decomposition () "Toggle Unicode decomposition searching on or off." (interactive) (setq isearch-word (unless (eq isearch-word 'isearch-decomposition-regexp) 'isearch-decomposition-regexp)) (if isearch-word (setq isearch-regexp nil)) (setq isearch-success t isearch-adjusted t) (isearch-update)) (defun isearch-decomposition-regexp (string &optional _lax) "Return a regexp that matches decomposed Unicode characters in STRING." (mapconcat (lambda (c0) (if (eq (get-char-code-property c0 'general-category) 'Mn) ;; Mark-Nonspacing chars like COMBINING ACUTE ACCENT are optional. (concat (string c0) "?") (let ((c1 c0) c2 chars) (while (and (setq c2 (aref (char-table-extra-slot decomposition-table 2) c1)) (not (eq c2 c0))) (push c2 chars) (setq c1 c2)) (if chars ;; Character alternatives from the case equivalences table. (concat "[" (string c0) chars "]") (string c0))))) string "")) (put 'isearch-decomposition-regexp 'isearch-message-prefix "deco ") This uses the decomposition table created above but instead of activating it, it's necessary to "shuffle" the equivalences table with the following code that prepares the table but doesn't enable it in the current buffer: (with-temp-buffer (set-case-table decomposition-table)) The advantage of the regexp-based approach is making combining accents optional in the search string. But there is another problem: how to ignore combining accents in the buffer when the search string doesn't contain them. With regexps this means adding a group of all possible combining accents after every character in the search string like turning a search string like "abc" into "a[́̂̃̄̆]?b[́̂̃̄̆]?c[́̂̃̄̆]?". This would make the search slow, and I have no better idea. ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-02 0:27 ` Juri Linkov @ 2012-12-02 17:45 ` martin rudalics 2012-12-02 18:02 ` Eli Zaretskii 2012-12-02 21:39 ` Juri Linkov 2012-12-02 18:16 ` Eli Zaretskii 1 sibling, 2 replies; 83+ messages in thread From: martin rudalics @ 2012-12-02 17:45 UTC (permalink / raw) To: Juri Linkov; +Cc: perin, perin, 13041 > But nevertheless the `decomposition' property should be used to find > all decomposable characters. The question is how to use them in the search. Whatever solution you find most suitable here, it would be nice to come up with a similar solution for sorting. I've been playing around with a function like (defun decomposed-string-lessp (string1 string2) "Return t if STRING1 is decomposition-less than STRING2." (let* ((length1 (length string1)) (length2 (length string2)) (min-length (min length1 length2)) (index 0) type1 type2) (catch 'found (while (< index min-length) (setq type1 (car (get-char-code-property (elt string1 index) 'decomposition))) (setq type2 (car (get-char-code-property (elt string2 index) 'decomposition))) (cond ((< type1 type2) (throw 'found t)) ((> type1 type2) (throw 'found nil))) ;; Continue. (setq index (1+ index))) ;; Shorter is less. (< length1 length2)))) but am not sure whether I'm missing something wrt the return value of `get-char-code-property'. martin ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-02 17:45 ` martin rudalics @ 2012-12-02 18:02 ` Eli Zaretskii 2012-12-03 10:16 ` martin rudalics 2012-12-02 21:39 ` Juri Linkov 1 sibling, 1 reply; 83+ messages in thread From: Eli Zaretskii @ 2012-12-02 18:02 UTC (permalink / raw) To: martin rudalics; +Cc: perin, 13041, perin > Date: Sun, 02 Dec 2012 18:45:38 +0100 > From: martin rudalics <rudalics@gmx.at> > CC: Eli Zaretskii <eliz@gnu.org>, perin@panix.com, 13041@debbugs.gnu.org, > perin@acm.org > > (setq type1 (car (get-char-code-property > (elt string1 index) 'decomposition))) > (setq type2 (car (get-char-code-property > (elt string2 index) 'decomposition))) > (cond > ((< type1 type2) > (throw 'found t)) > ((> type1 type2) > (throw 'found nil))) > ;; Continue. > (setq index (1+ index))) > ;; Shorter is less. > (< length1 length2)))) > > but am not sure whether I'm missing something wrt the return value of > `get-char-code-property'. Maybe only the fact that it can return a list whose car is 'compat', see the examples I posted. ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-02 18:02 ` Eli Zaretskii @ 2012-12-03 10:16 ` martin rudalics 2012-12-03 16:47 ` Eli Zaretskii 0 siblings, 1 reply; 83+ messages in thread From: martin rudalics @ 2012-12-03 10:16 UTC (permalink / raw) To: Eli Zaretskii; +Cc: perin, 13041, perin > Maybe only the fact that it can return a list whose car is 'compat', > see the examples I posted. So I need two indices for looping. But what are the guidelines to interpet `compat'? Does every list starting with a `compat' mean that the remaining entries of that list represent the constituents of that composite? And how do I now call `put-char-code-property' to make the German sharp "s" ("ß") equivalent to "ss"? Or am I not supposed to do such a thing? martin ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-03 10:16 ` martin rudalics @ 2012-12-03 16:47 ` Eli Zaretskii 2012-12-03 17:42 ` martin rudalics 0 siblings, 1 reply; 83+ messages in thread From: Eli Zaretskii @ 2012-12-03 16:47 UTC (permalink / raw) To: martin rudalics; +Cc: perin, 13041, perin > Date: Mon, 03 Dec 2012 11:16:21 +0100 > From: martin rudalics <rudalics@gmx.at> > CC: juri@jurta.org, perin@panix.com, 13041@debbugs.gnu.org, > perin@acm.org > > But what are the guidelines to interpet `compat'? For the purposes of comparing strings, both 'compatibility' and 'canonical' decompositions should be treated the same, AFAIU. You can find the details here: http://unicode.org/reports/tr15/ > Does every list starting with a `compat' mean that the remaining > entries of that list represent the constituents of that composite? Yes. This comes directly from UnicdeData.txt, e.g.: 0132;LATIN CAPITAL LIGATURE IJ;Lu;0;L;<compat> 0049 004A;;;;N;LATIN CAPITAL LETTER I J;;;0133; ^^^^^^^^^^^^^^^^^^ > And how do I now call `put-char-code-property' to make the German sharp > "s" ("ß") equivalent to "ss"? Or am I not supposed to do such a thing? That's already set up in the appropriate case table, I think. But it is not a compatibility decomposition, AFAIK. ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-03 16:47 ` Eli Zaretskii @ 2012-12-03 17:42 ` martin rudalics 2012-12-03 17:59 ` Eli Zaretskii 0 siblings, 1 reply; 83+ messages in thread From: martin rudalics @ 2012-12-03 17:42 UTC (permalink / raw) To: Eli Zaretskii; +Cc: perin, 13041, perin >> And how do I now call `put-char-code-property' to make the German sharp >> "s" ("ß") equivalent to "ss"? Or am I not supposed to do such a thing? > > That's already set up in the appropriate case table, I think. Why in a case table? Both "ß" and "ss" are lower case. > But it > is not a compatibility decomposition, AFAIK. But I can make it one? martin ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-03 17:42 ` martin rudalics @ 2012-12-03 17:59 ` Eli Zaretskii 2012-12-04 17:54 ` martin rudalics 0 siblings, 1 reply; 83+ messages in thread From: Eli Zaretskii @ 2012-12-03 17:59 UTC (permalink / raw) To: martin rudalics; +Cc: perin, 13041, perin > Date: Mon, 03 Dec 2012 18:42:53 +0100 > From: martin rudalics <rudalics@gmx.at> > CC: juri@jurta.org, perin@panix.com, 13041@debbugs.gnu.org, > perin@acm.org > > >> And how do I now call `put-char-code-property' to make the German sharp > >> "s" ("ß") equivalent to "ss"? Or am I not supposed to do such a thing? > > > > That's already set up in the appropriate case table, I think. > > Why in a case table? Both "ß" and "ss" are lower case. I meant the relation "ß" => "SS". > > But it > > is not a compatibility decomposition, AFAIK. > > But I can make it one? Yes, you can modify the table set up by uni-decomposition.el. I think. ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-03 17:59 ` Eli Zaretskii @ 2012-12-04 17:54 ` martin rudalics 2012-12-04 19:28 ` Eli Zaretskii 2012-12-04 20:12 ` Drew Adams 0 siblings, 2 replies; 83+ messages in thread From: martin rudalics @ 2012-12-04 17:54 UTC (permalink / raw) To: Eli Zaretskii; +Cc: perin, 13041, perin > Yes, you can modify the table set up by uni-decomposition.el. I > think. Seems to work well. The function I came up with goes as below. Thanks for the hints, martin (defun decomposed-string-lessp (string1 string2) "Return t if STRING1 is decomposition-less than STRING2." (let* ((length1 (length string1)) (length2 (length string2)) (min-length (min length1 length2)) (index1 0) (index2 0) prop1 prop2 type1 type2 compat1 compat2) (catch 'found (while (and (< index1 length1) (< index2 length2)) (setq prop1 (get-char-code-property (downcase (elt string1 index1)) 'decomposition)) (setq type1 (car prop1)) (setq prop2 (get-char-code-property (downcase (elt string2 index2)) 'decomposition)) (setq type2 (car prop2)) (cond ((and (eq type1 'compat) (eq type2 'compat)) (setq compat1 (concat (cdr prop1))) (setq compat2 (concat (cdr prop2))) (let ((value (compare-strings compat1 0 nil compat2 0 nil t))) (cond ((eq value t) (setq index1 (1+ index1)) (setq index2 (1+ index2))) ((< value 0) (throw 'found t)) ((< value 0) (throw 'found nil))))) ((eq type1 'compat) (setq compat1 (concat (cdr prop1))) (let ((value (compare-strings compat1 0 nil string2 index2 (min (+ index2 (length compat1)) length2) t))) (cond ((eq value t) (setq index1 (1+ index1)) (setq index2 (+ index2 (length compat1)))) ((< value 0) (throw 'found t)) ((< value 0) (throw 'found nil))))) ((eq type2 'compat) (setq compat2 (concat (cdr prop2))) (let ((value (compare-strings string1 index1 (min (+ index1 (length compat2)) length1) compat2 0 nil t))) (cond ((eq value t) (setq index1 (+ index1 (length compat2))) (setq index2 (1+ index2))) ((< value 0) (throw 'found t)) ((< value 0) (throw 'found nil))))) ((< type1 type2) (throw 'found t)) ((> type1 type2) (throw 'found nil)) (t (setq index1 (1+ index1)) (setq index2 (1+ index2))))) ;; Shorter is less. (< length1 length2)))) ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-04 17:54 ` martin rudalics @ 2012-12-04 19:28 ` Eli Zaretskii 2012-12-05 9:41 ` martin rudalics 2012-12-04 20:12 ` Drew Adams 1 sibling, 1 reply; 83+ messages in thread From: Eli Zaretskii @ 2012-12-04 19:28 UTC (permalink / raw) To: martin rudalics; +Cc: perin, 13041, perin > Date: Tue, 04 Dec 2012 18:54:59 +0100 > From: martin rudalics <rudalics@gmx.at> > CC: juri@jurta.org, perin@panix.com, 13041@debbugs.gnu.org, > perin@acm.org > > > Yes, you can modify the table set up by uni-decomposition.el. I > > think. > > Seems to work well. The function I came up with goes as below. How about putting it in subr.el? ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-04 19:28 ` Eli Zaretskii @ 2012-12-05 9:41 ` martin rudalics 2012-12-05 16:37 ` Eli Zaretskii 2012-12-05 23:05 ` Juri Linkov 0 siblings, 2 replies; 83+ messages in thread From: martin rudalics @ 2012-12-05 9:41 UTC (permalink / raw) To: Eli Zaretskii; +Cc: perin, 13041, perin > How about putting it in subr.el? If I correctly understand Juri, I next have to deal with things like (get-char-code-property #xff59 'decomposition) and related issues we might unearth in the course of this. Also, while currently sorting is stable in the sense that with respect to diacritics text remains unchanged from the original order, this is not nice for sorting larger pieces of text. So I'd rather have to use the second list element returned by `get-char-code-property' to make sure that, for example, "e" gets always sorted before "è" before "é". martin ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-05 9:41 ` martin rudalics @ 2012-12-05 16:37 ` Eli Zaretskii 2012-12-06 10:31 ` martin rudalics 2012-12-05 23:05 ` Juri Linkov 1 sibling, 1 reply; 83+ messages in thread From: Eli Zaretskii @ 2012-12-05 16:37 UTC (permalink / raw) To: martin rudalics; +Cc: perin, 13041, perin > Date: Wed, 05 Dec 2012 10:41:40 +0100 > From: martin rudalics <rudalics@gmx.at> > CC: juri@jurta.org, perin@panix.com, 13041@debbugs.gnu.org, > perin@acm.org > > > How about putting it in subr.el? > > If I correctly understand Juri, I next have to deal with things like > > (get-char-code-property #xff59 'decomposition) > > and related issues we might unearth in the course of this. My reading of the table in http://www.unicode.org/reports/tr44/#Character_Decomposition_Mappings you should ignore any car of the list returned by get-char-code-property if it does not pass the characterp test (or those that do pass the symbolp test). That is, the character #xff59 should sort exactly like lower-case y. ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-05 16:37 ` Eli Zaretskii @ 2012-12-06 10:31 ` martin rudalics 2012-12-06 17:48 ` Eli Zaretskii 0 siblings, 1 reply; 83+ messages in thread From: martin rudalics @ 2012-12-06 10:31 UTC (permalink / raw) To: Eli Zaretskii; +Cc: perin, 13041, perin > My reading of the table in > > http://www.unicode.org/reports/tr44/#Character_Decomposition_Mappings > > you should ignore any car of the list returned by > get-char-code-property if it does not pass the characterp test (or > those that do pass the symbolp test). That is, the character #xff59 > should sort exactly like lower-case y. That is, `wide' and `compat' are completely equivalent in this regard? martin ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-06 10:31 ` martin rudalics @ 2012-12-06 17:48 ` Eli Zaretskii 0 siblings, 0 replies; 83+ messages in thread From: Eli Zaretskii @ 2012-12-06 17:48 UTC (permalink / raw) To: martin rudalics; +Cc: perin, 13041, perin > Date: Thu, 06 Dec 2012 11:31:31 +0100 > From: martin rudalics <rudalics@gmx.at> > CC: juri@jurta.org, perin@panix.com, 13041@debbugs.gnu.org, > perin@acm.org > > > My reading of the table in > > > > http://www.unicode.org/reports/tr44/#Character_Decomposition_Mappings > > > > you should ignore any car of the list returned by > > get-char-code-property if it does not pass the characterp test (or > > those that do pass the symbolp test). That is, the character #xff59 > > should sort exactly like lower-case y. > > That is, `wide' and `compat' are completely equivalent in this regard? Yes. They are all different forms of the same character, which should all compare equal in this context. ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-05 9:41 ` martin rudalics 2012-12-05 16:37 ` Eli Zaretskii @ 2012-12-05 23:05 ` Juri Linkov 2012-12-06 10:32 ` martin rudalics 1 sibling, 1 reply; 83+ messages in thread From: Juri Linkov @ 2012-12-05 23:05 UTC (permalink / raw) To: martin rudalics; +Cc: perin, perin, 13041 > If I correctly understand Juri, I next have to deal with things like > > (get-char-code-property #xff59 'decomposition) > > and related issues we might unearth in the course of this. Only until bug#13084 is fixed that is a separate problem. > Also, while currently sorting is stable in the sense that with respect > to diacritics text remains unchanged from the original order, this is > not nice for sorting larger pieces of text. So I'd rather have to use > the second list element returned by `get-char-code-property' to make > sure that, for example, "e" gets always sorted before "è" before "é". In principle, you could do this by let-binding a new variable `sort-decomposition' to non-nil for stable sorting. And later to let-bind `sort-decomposition' to nil for last-resort comparison where equal lines (equal according to non-nil `sort-decomposition') will be sorted without regard to decomposition. ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-05 23:05 ` Juri Linkov @ 2012-12-06 10:32 ` martin rudalics 0 siblings, 0 replies; 83+ messages in thread From: martin rudalics @ 2012-12-06 10:32 UTC (permalink / raw) To: Juri Linkov; +Cc: perin, perin, 13041 > And later to let-bind `sort-decomposition' to nil for > last-resort comparison where equal lines > (equal according to non-nil `sort-decomposition') > will be sorted without regard to decomposition. Indeed. In any case, equal lines shouldn't be the rule - especially with functions that remove duplicates ;-) martin ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-04 17:54 ` martin rudalics 2012-12-04 19:28 ` Eli Zaretskii @ 2012-12-04 20:12 ` Drew Adams 2012-12-04 23:15 ` Drew Adams 2012-12-05 9:42 ` martin rudalics 1 sibling, 2 replies; 83+ messages in thread From: Drew Adams @ 2012-12-04 20:12 UTC (permalink / raw) To: 'martin rudalics', 'Eli Zaretskii'; +Cc: perin, 13041, perin > The function [Martin] came up with goes as below. > (defun decomposed-string-lessp (string1 string2) > "Return t if STRING1 is decomposition-less than STRING2." > ... I know nothing about character composition and have not tested this with anything but a few western accents. But this seems like good stuff. 1. Assuming this or similar is added to Emacs (please do). Please consider modifying it to respect `case-fold-search'. These modified lines do that. (setq prop1 (get-char-code-property (if case-fold-search (downcase (elt string1 index1)) (elt string1 index1)) 'decomposition)) [Same thing for prop2 with string2 and index2.] (let ((value (compare-strings compat1 0 nil compat2 0 nil case-fold-search))) 2. In addition, consider updating `string-lessp' to be sensitive to a variable such as this: (defvar ignore-diacritics nil "Non-nil means ignore diacritics for string comparisons.") With that, an alternative to hard-coding a call to `decomposed-string-lessp' is to bind `ignore-diacritics' and use `string-lessp'. A similar change could be made for `compare-strings': reflect the value of `ignore-diacritics'. Or since that function has made the choice to pass case-sensitivity as a parameter instead of respecting `case-fold-search', pass another parameter for diacritic sensitivity. 3. More general than #2 would be a function like this, which is sensitive to both `ignore-diacritics' and `case-fold-search' (this assumes the change suggested above in #1 for `decomposed-string-lessp'). (defun my-string-lessp (s1 s2) "..." (if ignore-diacritics (decomposed-string-lessp s1 s2) (when case-fold-search (setq s1 (upcase s1) s2 (upcase s2))) (string-lessp s1 s2))) Dunno a good name for this. It's too late to let `string-lessp' itself act like this - that would break stuff. 4. Even better than hard-coding `case-fold-search' in `my-string-less-p' and `decomposed-string-lessp' would be to have those functions be sensitive to a variable such as this: (defvar string-case-variable 'case-fold-search "Value is a case-sensitivity variable such as `case-fold-search'. The values of that variable must be like those for `case-fold-search': nil means case-sensitive, non-nil means case-insensitive.") Code could then bind `string-case-variable' to, say, `(not completion-ignore-case)' or to any other case-sensitivity controlling sexp, when appropriate. This would have the advantages offered by passing an explicit case-sensitivity parameter, as in `compare-strings', but also the advantages of dynamic scope: binding `string-case-var' to affect all comparisons within scope. Comparers such as `(my-)string-lessp' are often used as arguments to higher-order functions that treat them as (only) binary predicates, i.e., predicates where any additional parameters specifying case or diacritic sensitivity are ignored. ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-04 20:12 ` Drew Adams @ 2012-12-04 23:15 ` Drew Adams 2012-12-05 6:50 ` Drew Adams 2012-12-05 9:42 ` martin rudalics 2012-12-05 9:42 ` martin rudalics 1 sibling, 2 replies; 83+ messages in thread From: Drew Adams @ 2012-12-04 23:15 UTC (permalink / raw) To: 'martin rudalics', 'Eli Zaretskii'; +Cc: perin, 13041, perin BTW, there are a couple of minor things to check wrt the code you sent, Martin: * `min-length' is not used. * The `cond's all repeat condition (< value 0) twice, with different actions. ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-04 23:15 ` Drew Adams @ 2012-12-05 6:50 ` Drew Adams 2012-12-05 9:42 ` martin rudalics 2012-12-06 9:25 ` Kenichi Handa 2012-12-05 9:42 ` martin rudalics 1 sibling, 2 replies; 83+ messages in thread From: Drew Adams @ 2012-12-05 6:50 UTC (permalink / raw) To: 'martin rudalics', 'Eli Zaretskii'; +Cc: perin, 13041, perin This version of Martin's function (but respecting `case-fold-search') is maybe a tiny bit simpler. It could also be a bit slower because of `substring' returning a copy (vs just incrementing an offset). It should also be checked for correctness - not really tested. FWIW/HTH. (It does correct the two double `(< value 0)' typos I mentioned earlier. That should be done in any case.) (defun decomposed-string-lessp (string1 string2) "Return non-nil if decomposed STRING1 is less than decomposed STRING2. Comparison respects `case-fold-search'." (let ((s1 string1) (s2 string2) prop1 prop2 type1 type2) (catch 'found (while (and (> (length s1) 0) (> (length s2) 0)) (setq prop1 (get-char-code-property (if case-fold-search (downcase (elt s1 0)) (elt s1 0)) 'decomposition) prop2 (get-char-code-property (if case-fold-search (downcase (elt s2 0)) (elt s2 0)) 'decomposition) type1 (car prop1) type2 (car prop2)) (when (eq type1 'compat) (setq s1 (concat (cdr prop1)))) (when (eq type2 'compat) (setq s2 (concat (cdr prop2)))) (cond ((eq type1 'compat) (let ((cs (compare-strings s1 0 nil s2 0 (and (not (eq type2 'compat)) (min (length s1) (length s2))) case-fold-search))) (unless (eq cs t) (throw 'found (< cs 0))))) ((eq type2 'compat) (let ((cs (compare-strings s1 0 (min (length s2) (length s1)) s2 0 nil case-fold-search))) (unless (eq cs t) (throw 'found (< cs 0))))) ((= type1 type2) (setq s1 (substring s1 1) s2 (substring s2 1))) (t (throw 'found (< type1 type2))))) (< (length string1) (length string2))))) ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-05 6:50 ` Drew Adams @ 2012-12-05 9:42 ` martin rudalics 2012-12-05 15:38 ` Drew Adams 2012-12-06 9:25 ` Kenichi Handa 1 sibling, 1 reply; 83+ messages in thread From: martin rudalics @ 2012-12-05 9:42 UTC (permalink / raw) To: Drew Adams; +Cc: perin, perin, 13041 > This version of Martin's function (but respecting `case-fold-search') is maybe a > tiny bit simpler. It could also be a bit slower because of `substring' > returning a copy (vs just incrementing an offset). It should also be checked > for correctness - not really tested. FWIW/HTH. The most important application I see for this is within `sort-subr' where I want to compare buffer substrings in situ by passing their boundaries. Hence I plan to provide a version working in terms of buffer positions. For simple string checking your version might be preferable. martin ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-05 9:42 ` martin rudalics @ 2012-12-05 15:38 ` Drew Adams 0 siblings, 0 replies; 83+ messages in thread From: Drew Adams @ 2012-12-05 15:38 UTC (permalink / raw) To: 'martin rudalics'; +Cc: perin, perin, 13041 > The most important application I see for this is within `sort-subr' > where I want to compare buffer substrings in situ by passing their > boundaries. Hence I plan to provide a version working in terms of > buffer positions. For simple string checking your version might be > preferable. Please do whatever is right - using positions as you intended. ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-05 6:50 ` Drew Adams 2012-12-05 9:42 ` martin rudalics @ 2012-12-06 9:25 ` Kenichi Handa 2012-12-06 10:34 ` martin rudalics 2012-12-07 0:58 ` Juri Linkov 1 sibling, 2 replies; 83+ messages in thread From: Kenichi Handa @ 2012-12-06 9:25 UTC (permalink / raw) To: Drew Adams; +Cc: perin, 13041, perin In article <707786B35E94470FB727BCF7F3DDA41A@us.oracle.com>, "Drew Adams" <drew.adams@oracle.com> writes: > This version of Martin's function (but respecting `case-fold-search') is maybe a > tiny bit simpler. It could also be a bit slower because of `substring' > returning a copy (vs just incrementing an offset). It should also be checked > for correctness - not really tested. FWIW/HTH. Emacs contains ucs-normailze package which provides various normalization functions. For instance, (require 'ucs-normalize) (ucs-normalize-NFKD-string "Äffin") => "Äffin" Isn't it usable? --- Kenichi Handa handa@gnu.org ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-06 9:25 ` Kenichi Handa @ 2012-12-06 10:34 ` martin rudalics 2012-12-06 17:50 ` Eli Zaretskii 2012-12-07 0:58 ` Juri Linkov 1 sibling, 1 reply; 83+ messages in thread From: martin rudalics @ 2012-12-06 10:34 UTC (permalink / raw) To: Kenichi Handa; +Cc: perin, perin, 13041 > Emacs contains ucs-normailze package which provides various > normalization functions. For instance, > > (require 'ucs-normalize) > (ucs-normalize-NFKD-string "Äffin") => "Äffin" > > Isn't it usable? Actually, the function should do what we need. But I have no idea how to integrate it into a searching algorithm. And when sorting, it seems expensive for comparing buffer substrings. Also, the use of a temporary buffer for normalizing every single string makes its weight quite heavy. In any case, I would probably steal the entire decomposition property handling part from it. So thanks a lot for this hint. martin ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-06 10:34 ` martin rudalics @ 2012-12-06 17:50 ` Eli Zaretskii 0 siblings, 0 replies; 83+ messages in thread From: Eli Zaretskii @ 2012-12-06 17:50 UTC (permalink / raw) To: martin rudalics; +Cc: 13041, perin, perin > Date: Thu, 06 Dec 2012 11:34:26 +0100 > From: martin rudalics <rudalics@gmx.at> > CC: Drew Adams <drew.adams@oracle.com>, eliz@gnu.org, perin@panix.com, > 13041@debbugs.gnu.org, perin@acm.org > > > Emacs contains ucs-normailze package which provides various > > normalization functions. For instance, > > > > (require 'ucs-normalize) > > (ucs-normalize-NFKD-string "Äffin") => "Äffin" > > > > Isn't it usable? > > Actually, the function should do what we need. But I have no idea how > to integrate it into a searching algorithm. And when sorting, it seems > expensive for comparing buffer substrings. Also, the use of a temporary > buffer for normalizing every single string makes its weight quite heavy. Yes, I don't think this will be possible without changes on the C level. Those changes should use code very similar to what we currently do for case-insensitive search. ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-06 9:25 ` Kenichi Handa 2012-12-06 10:34 ` martin rudalics @ 2012-12-07 0:58 ` Juri Linkov 2012-12-07 6:33 ` Eli Zaretskii 2012-12-07 10:37 ` martin rudalics 1 sibling, 2 replies; 83+ messages in thread From: Juri Linkov @ 2012-12-07 0:58 UTC (permalink / raw) To: Kenichi Handa; +Cc: perin, 13041, perin > Emacs contains ucs-normailze package which provides various > normalization functions. For instance, > > (require 'ucs-normalize) > (ucs-normalize-NFKD-string "Äffin") => "Äffin" > > Isn't it usable? This is usable to sort and compare strings, but I don't see how ucs-normalize.el could help in the search. I suppose the searched buffer can't be normalized before starting a search. So the search function somehow should be able to skip combining characters in the buffer. But to do this, the translation table needs to contain additional information about certain characters to ignore. Also the translation table should be able to map a sequence of characters like "ss" to "ß". ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-07 0:58 ` Juri Linkov @ 2012-12-07 6:33 ` Eli Zaretskii 2012-12-07 10:37 ` martin rudalics 1 sibling, 0 replies; 83+ messages in thread From: Eli Zaretskii @ 2012-12-07 6:33 UTC (permalink / raw) To: Juri Linkov; +Cc: 13041, perin, perin > From: Juri Linkov <juri@jurta.org> > Date: Fri, 07 Dec 2012 02:58:17 +0200 > Cc: perin@panix.com, 13041@debbugs.gnu.org, perin@acm.org > > > Emacs contains ucs-normailze package which provides various > > normalization functions. For instance, > > > > (require 'ucs-normalize) > > (ucs-normalize-NFKD-string "Äffin") => "Äffin" > > > > Isn't it usable? > > This is usable to sort and compare strings, but I don't see > how ucs-normalize.el could help in the search. I agree. > I suppose the searched buffer can't be normalized before starting a > search. Yes, that's not acceptable. > So the search function somehow should be able to skip combining > characters in the buffer. But to do this, the translation table needs > to contain additional information about certain characters to ignore. Right. This is very similar to how the search primitives currently use the case tables, except that they don't skip characters. But adding such a skip operation should be easy. > Also the translation table should be able to map a sequence of > characters like "ss" to "ß". I'd say the other way around: map ß to ss. ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-07 0:58 ` Juri Linkov 2012-12-07 6:33 ` Eli Zaretskii @ 2012-12-07 10:37 ` martin rudalics 2012-12-07 23:55 ` Juri Linkov 1 sibling, 1 reply; 83+ messages in thread From: martin rudalics @ 2012-12-07 10:37 UTC (permalink / raw) To: Juri Linkov; +Cc: 13041, perin, perin > This is usable to sort and compare strings, but I don't see > how ucs-normalize.el could help in the search. I suppose the > searched buffer can't be normalized before starting a search. You can either temporarily - leave the text alone but give each string that should be handled specially a text property with the normalized form. In this case searching has to pay attention to these properties, if present. - normalize the text and give each normalized string a text property with the original text. In this case searching will proceed as usual but you have to restore the original text when done. I don't know how feasible these are for searching. But I used the second approach for sorting without problems. Also I don't know how to handle the return value and/or highlighting when, for example, finding a match for "suf" within "suffer". For example, replacing each occurrence of "suf" with the empty string should leave us with "fer" here. So in this case, we have to deal with the normalized string anyway. OTOH replacing a match for "res" in "résumé" with the empty string should probably leave us with "umé". > So the search function somehow should be able to skip combining > characters in the buffer. But to do this, the translation table needs > to contain additional information about certain characters to ignore. > Also the translation table should be able to map a sequence of > characters like "ss" to "ß". I have no idea how many mappings like "ß" -> "ss" exist. The problem is that we don't get them from UnicodeData.txt IIUC. martin ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-07 10:37 ` martin rudalics @ 2012-12-07 23:55 ` Juri Linkov 2012-12-08 8:20 ` Eli Zaretskii ` (2 more replies) 0 siblings, 3 replies; 83+ messages in thread From: Juri Linkov @ 2012-12-07 23:55 UTC (permalink / raw) To: martin rudalics; +Cc: 13041, perin, perin > - leave the text alone but give each string that should be handled > specially a text property with the normalized form. In this case > searching has to pay attention to these properties, if present. > > - normalize the text and give each normalized string a text property > with the original text. In this case searching will proceed as usual > but you have to restore the original text when done. This reminds an idea that searching should take into account the text displayed with the `display' property and other display-related properties. It seems this is more difficult to implement. > Also I don't know how to handle the return value and/or highlighting > when, for example, finding a match for "suf" within "suffer". For > example, replacing each occurrence of "suf" with the empty string should > leave us with "fer" here. I believe such ligature characters should be handled as a whole, i.e. "suf" doesn't match "suffer", only "suff" should match it. > I have no idea how many mappings like "ß" -> "ss" exist. The problem is > that we don't get them from UnicodeData.txt IIUC. I can't find them in UnicodeData.txt too. Looking at the files in http://www.unicode.org/Public/UNIDATA/ can find them in the file http://www.unicode.org/Public/UNIDATA/DerivedNormalizationProps.txt that is derived from http://www.unicode.org/Public/UNIDATA/CaseFolding.txt http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-07 23:55 ` Juri Linkov @ 2012-12-08 8:20 ` Eli Zaretskii 2012-12-08 11:35 ` martin rudalics 2012-12-08 11:21 ` martin rudalics 2012-12-08 23:54 ` Stefan Monnier 2 siblings, 1 reply; 83+ messages in thread From: Eli Zaretskii @ 2012-12-08 8:20 UTC (permalink / raw) To: Juri Linkov; +Cc: 13041, perin, perin > From: Juri Linkov <juri@jurta.org> > Date: Sat, 08 Dec 2012 01:55:22 +0200 > Cc: 13041@debbugs.gnu.org, perin@panix.com, perin@acm.org > > This reminds an idea that searching should take into account the text > displayed with the `display' property and other display-related properties. > It seems this is more difficult to implement. I don't know if it's more difficult. After all, the primitives you need to (a) find out whether there's a display string at given buffer position, and (b) access its text, are already there, ready to be used. Moreover, there's even a C function that searches the current buffer for a specific Lisp string, which you could use as a model for this feature. What is definitely true, though, is that searching display string is a separate feature, with an entirely different implementation. I suggest therefore to keep it in mind, but not mix with what's being discussed here. > > I have no idea how many mappings like "ß" -> "ss" exist. The problem is > > that we don't get them from UnicodeData.txt IIUC. > > I can't find them in UnicodeData.txt too. Looking at the files in > http://www.unicode.org/Public/UNIDATA/ can find them in the file > > http://www.unicode.org/Public/UNIDATA/DerivedNormalizationProps.txt > > that is derived from > > http://www.unicode.org/Public/UNIDATA/CaseFolding.txt > http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt Maybe we should extend ucs-normalize.el to include that as well. ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-08 8:20 ` Eli Zaretskii @ 2012-12-08 11:35 ` martin rudalics 2012-12-08 12:40 ` Eli Zaretskii 0 siblings, 1 reply; 83+ messages in thread From: martin rudalics @ 2012-12-08 11:35 UTC (permalink / raw) To: Eli Zaretskii; +Cc: perin, 13041, perin > I don't know if it's more difficult. After all, the primitives you > need to (a) find out whether there's a display string at given buffer > position, and (b) access its text, are already there, ready to be > used. Moreover, there's even a C function that searches the current > buffer for a specific Lisp string, which you could use as a model for > this feature. I think that mirroring/cloning (part of) the current buffer in a special search buffer would be the cheapest solution. The search buffer would contain the normalized text, be built only when normalization is needed and be rebuilt whenever a search option or the buffer text changes. I don't know whether `buffer-swap-text' could be used here. martin ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-08 11:35 ` martin rudalics @ 2012-12-08 12:40 ` Eli Zaretskii 0 siblings, 0 replies; 83+ messages in thread From: Eli Zaretskii @ 2012-12-08 12:40 UTC (permalink / raw) To: martin rudalics; +Cc: perin, 13041, perin > Date: Sat, 08 Dec 2012 12:35:37 +0100 > From: martin rudalics <rudalics@gmx.at> > CC: Juri Linkov <juri@jurta.org>, 13041@debbugs.gnu.org, perin@panix.com, > perin@acm.org > > > I don't know if it's more difficult. After all, the primitives you > > need to (a) find out whether there's a display string at given buffer > > position, and (b) access its text, are already there, ready to be > > used. Moreover, there's even a C function that searches the current > > buffer for a specific Lisp string, which you could use as a model for > > this feature. > > I think that mirroring/cloning (part of) the current buffer in a special > search buffer would be the cheapest solution. The search buffer would > contain the normalized text, be built only when normalization is > needed and be rebuilt whenever a search option or the buffer text > changes. Maybe this is the cheapest, but it still needs the same support the other alternatives do. ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-07 23:55 ` Juri Linkov 2012-12-08 8:20 ` Eli Zaretskii @ 2012-12-08 11:21 ` martin rudalics 2012-12-08 23:07 ` Juri Linkov 2012-12-08 23:54 ` Stefan Monnier 2 siblings, 1 reply; 83+ messages in thread From: martin rudalics @ 2012-12-08 11:21 UTC (permalink / raw) To: Juri Linkov; +Cc: 13041, perin, perin >> - leave the text alone but give each string that should be handled >> specially a text property with the normalized form. In this case >> searching has to pay attention to these properties, if present. >> >> - normalize the text and give each normalized string a text property >> with the original text. In this case searching will proceed as usual >> but you have to restore the original text when done. > > This reminds an idea that searching should take into account the text > displayed with the `display' property and other display-related properties. > It seems this is more difficult to implement. ... and probably should include searching for overlays too. >> Also I don't know how to handle the return value and/or highlighting >> when, for example, finding a match for "suf" within "suffer". For >> example, replacing each occurrence of "suf" with the empty string should >> leave us with "fer" here. > > I believe such ligature characters should be handled as a whole, > i.e. "suf" doesn't match "suffer", only "suff" should match it. This means that when you type the second "f" you might get a match before the present one. Consider a buffer containing the two lines suffer suffer Typing "suf" as search string would go to "suffer". Adding an "f" to the search string now would go back to "suffer" (or not). Disconcerting in any case. >> I have no idea how many mappings like "ß" -> "ss" exist. The problem is >> that we don't get them from UnicodeData.txt IIUC. > > I can't find them in UnicodeData.txt too. Looking at the files in > http://www.unicode.org/Public/UNIDATA/ can find them in the file > > http://www.unicode.org/Public/UNIDATA/DerivedNormalizationProps.txt > > that is derived from > > http://www.unicode.org/Public/UNIDATA/CaseFolding.txt > http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt Case folding "ß" to "SS" (upper case "S") is not what I had in mind. I was talking about the (weak?) equivalence of "ß" and "ss" (lower case "s") which is much more important when searching. In particular so, because many German words that were earlier written with an "ß" are now written with "ss". martin ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-08 11:21 ` martin rudalics @ 2012-12-08 23:07 ` Juri Linkov 2012-12-09 0:04 ` Drew Adams 2012-12-09 17:52 ` martin rudalics 0 siblings, 2 replies; 83+ messages in thread From: Juri Linkov @ 2012-12-08 23:07 UTC (permalink / raw) To: martin rudalics; +Cc: 13041, perin, perin > This means that when you type the second "f" you might get a match > before the present one. Consider a buffer containing the two lines > suffer > suffer > > Typing "suf" as search string would go to "suffer". Adding an "f" to > the search string now would go back to "suffer" (or not). Going back looks like backtracking in the regexp search. OTOH, instead of using an approach of matching only a full match like in Chromium, we could do like GEdit and OpenOffice that match the whole ligature character in a partial match (i.e. to match "ff" when the search string is just "f"). Though this has a problem of highlighting the whole character for a partial match that looks wrong, but perhaps no one can do better. >> http://www.unicode.org/Public/UNIDATA/CaseFolding.txt >> http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt > > Case folding "ß" to "SS" (upper case "S") is not what I had in mind. I > was talking about the (weak?) equivalence of "ß" and "ss" (lower case > "s") which is much more important when searching. In particular so, > because many German words that were earlier written with an "ß" are now > written with "ss". Yes, this is what I meant too. It is surprising but http://www.unicode.org/Public/UNIDATA/CaseFolding.txt defines the equivalence of "ß" and "ss" (lower case "s") instead of case-folding. The following line in CaseFolding.txt: 00DF; F; 0073 0073; # LATIN SMALL LETTER SHARP S maps 00DF (LATIN SMALL LETTER SHARP S) to two characters 0073 0073 (LATIN SMALL LETTER S) keeping the lower case. Maybe this is a bug in Unicode data? ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-08 23:07 ` Juri Linkov @ 2012-12-09 0:04 ` Drew Adams 2012-12-09 17:52 ` martin rudalics 1 sibling, 0 replies; 83+ messages in thread From: Drew Adams @ 2012-12-09 0:04 UTC (permalink / raw) To: 'Juri Linkov', 'martin rudalics'; +Cc: perin, 13041, perin > > Typing "suf" as search string would go to "suffer". Adding > > an "f" to the search string now would go back to "su?er" (or not). > > Going back looks like backtracking in the regexp search. > > OTOH, instead of using an approach of matching only a full match > like in Chromium, we could do like GEdit and OpenOffice that > match the whole ligature character in a partial match > (i.e. to match "?" when the search string is just "f"). Seems to me that the starting point should be the Unicode Regexp spec, which outlines the behavior of level 1 and level 2 searches. Emacs Dev can choose what it wants to do, of course, but that is a good place to start, I think. ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-08 23:07 ` Juri Linkov 2012-12-09 0:04 ` Drew Adams @ 2012-12-09 17:52 ` martin rudalics 2012-12-09 18:06 ` Drew Adams 1 sibling, 1 reply; 83+ messages in thread From: martin rudalics @ 2012-12-09 17:52 UTC (permalink / raw) To: Juri Linkov; +Cc: 13041, perin, perin > OTOH, instead of using an approach of matching only a full match > like in Chromium, we could do like GEdit and OpenOffice that > match the whole ligature character in a partial match > (i.e. to match "ff" when the search string is just "f"). Strictly spoken, they should match the first "f" in "ff". When matching "suf" against "suffer", the `match-string' would be "suf", with `match-end' after "ff". That is, the match length would not increase when adding an "f" to the search string now. But I don't know what `match-string' should return - "suff" or "suff". > Though this has a problem of highlighting the whole character for > a partial match that looks wrong, but perhaps no one can do better. We needed a display string "ff" replacing "ff" during highlighting and highlight only the first "f" in it. > Yes, this is what I meant too. It is surprising but > http://www.unicode.org/Public/UNIDATA/CaseFolding.txt > defines the equivalence of "ß" and "ss" (lower case "s") > instead of case-folding. The following line in CaseFolding.txt: > > 00DF; F; 0073 0073; # LATIN SMALL LETTER SHARP S > > maps 00DF (LATIN SMALL LETTER SHARP S) to two characters > 0073 0073 (LATIN SMALL LETTER S) keeping the lower case. > Maybe this is a bug in Unicode data? Maybe it's explained here http://www.unicode.org/faq/idn.html in the answer to Q: Why does IDNA2003 map final sigma (ς) to sigma (σ), map eszett (ß) to "ss", and delete ZWJ/ZWNJ? One possible interpretation of this is that mapping "ß" to "SS" would imply that downcasing "SS" should produce "ß" and this is unwanted. But I still wonder whether we are supposed to apply mappings recursively. martin ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-09 17:52 ` martin rudalics @ 2012-12-09 18:06 ` Drew Adams 2012-12-11 7:19 ` Eli Zaretskii 0 siblings, 1 reply; 83+ messages in thread From: Drew Adams @ 2012-12-09 18:06 UTC (permalink / raw) To: 'martin rudalics', 'Juri Linkov'; +Cc: perin, 13041, perin > Maybe it's explained here > http://www.unicode.org/faq/idn.html > in the answer to > > Q: Why does IDNA2003 map final sigma (?) to sigma (s), map > eszett (ß) to "ss", and delete ZWJ/ZWNJ? > > One possible interpretation of this is that mapping "ß" to "SS" would > imply that downcasing "SS" should produce "ß" and this is > unwanted. This is also covered in the Unicode Regexp spec. http://www.unicode.org/reports/tr18/ ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-09 18:06 ` Drew Adams @ 2012-12-11 7:19 ` Eli Zaretskii 0 siblings, 0 replies; 83+ messages in thread From: Eli Zaretskii @ 2012-12-11 7:19 UTC (permalink / raw) To: Drew Adams; +Cc: 13041, perin, perin > From: "Drew Adams" <drew.adams@oracle.com> > Date: Sun, 9 Dec 2012 10:06:44 -0800 > Cc: perin@panix.com, 13041@debbugs.gnu.org, perin@acm.org > > > Maybe it's explained here > > http://www.unicode.org/faq/idn.html > > in the answer to > > > > Q: Why does IDNA2003 map final sigma (?) to sigma (s), map > > eszett (ß) to "ss", and delete ZWJ/ZWNJ? > > > > One possible interpretation of this is that mapping "ß" to "SS" would > > imply that downcasing "SS" should produce "ß" and this is > > unwanted. > > This is also covered in the Unicode Regexp spec. > http://www.unicode.org/reports/tr18/ Another relevant Unicode document is the Unicode Collation Algorithm. For the latest (yet unapproved) draft, see http://www.unicode.org/reports/tr10/proposed.html ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-07 23:55 ` Juri Linkov 2012-12-08 8:20 ` Eli Zaretskii 2012-12-08 11:21 ` martin rudalics @ 2012-12-08 23:54 ` Stefan Monnier 2012-12-09 0:14 ` Drew Adams 2012-12-09 0:35 ` Juri Linkov 2 siblings, 2 replies; 83+ messages in thread From: Stefan Monnier @ 2012-12-08 23:54 UTC (permalink / raw) To: Juri Linkov; +Cc: 13041, perin, perin > i.e. "suf" doesn't match "suffer", only "suff" should match it. I completely disagree here. "suf" should match "suffer". Stefan ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-08 23:54 ` Stefan Monnier @ 2012-12-09 0:14 ` Drew Adams 2012-12-09 15:42 ` Stefan Monnier 2012-12-09 0:35 ` Juri Linkov 1 sibling, 1 reply; 83+ messages in thread From: Drew Adams @ 2012-12-09 0:14 UTC (permalink / raw) To: 'Stefan Monnier', 'Juri Linkov'; +Cc: perin, 13041, perin > > i.e. "suf" doesn't match "su?er", only "suff" should match it. > > I completely disagree here. "suf" should match "su?er". The Unicode Regexp spec says that it is best, if possible, to let users do either. It discusses such different search possibilities explicitly. We might not be able to support that superior level (level 2) for Emacs search, but the point is that each kind of matching can be useful here. At this stage of the discussion it should not, I think, be a case of "I completely disagree" (or completely agree), unless you have already decided something wrt design/implementation etc. Better to look at the possibilities for users and then discuss what it might take to be able to support this or that kind of search matching. ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-09 0:14 ` Drew Adams @ 2012-12-09 15:42 ` Stefan Monnier 2012-12-09 18:00 ` Drew Adams 0 siblings, 1 reply; 83+ messages in thread From: Stefan Monnier @ 2012-12-09 15:42 UTC (permalink / raw) To: Drew Adams; +Cc: perin, 13041, perin > The Unicode Regexp spec says that it is best, if possible, to let users do > either. We're talking about the (now misnamed) "diacritic-fold" search. If the user wants to be more strict, there's always going to be the "non-diacritic-fold" search. Stefan ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-09 15:42 ` Stefan Monnier @ 2012-12-09 18:00 ` Drew Adams 0 siblings, 0 replies; 83+ messages in thread From: Drew Adams @ 2012-12-09 18:00 UTC (permalink / raw) To: 'Stefan Monnier'; +Cc: perin, 13041, perin > > The Unicode Regexp spec says that it is best, if possible, > > to let users do either. > > We're talking about the (now misnamed) "diacritic-fold" search. > If the user wants to be more strict, there's always going to be > the "non-diacritic-fold" search. Yes, and? That ignoring of diacritics etc. is essentially what the Unicode Regexp spec refers to as "loose matching", IIUC. And that means "at least the simple, default Unicode case folding." You are considering, among other things, whether `f' should match the ? ligature or whether only `ff' should match it. The standard deals with this question, I believe. (BTW, I cannot actually see that ligature with my mail client. So I copied the char from another mail message and pasted it, above. If that copy+paste didn't work, what I meant was the ligature for ff.) ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-08 23:54 ` Stefan Monnier 2012-12-09 0:14 ` Drew Adams @ 2012-12-09 0:35 ` Juri Linkov 2012-12-09 11:35 ` Stephen Berman 2012-12-09 15:45 ` Stefan Monnier 1 sibling, 2 replies; 83+ messages in thread From: Juri Linkov @ 2012-12-09 0:35 UTC (permalink / raw) To: Stefan Monnier; +Cc: 13041, perin, perin >> i.e. "suf" doesn't match "suffer", only "suff" should match it. > > I completely disagree here. "suf" should match "suffer". AFAIS, there are more programs that find a partial match, but neither of them can do the right highlighting: both possibilities (to highlight the whole ligature and not to highlight) are wrong, and highlighting a part of the ligature is impossible. ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-09 0:35 ` Juri Linkov @ 2012-12-09 11:35 ` Stephen Berman 2012-12-09 17:52 ` martin rudalics 2012-12-09 15:45 ` Stefan Monnier 1 sibling, 1 reply; 83+ messages in thread From: Stephen Berman @ 2012-12-09 11:35 UTC (permalink / raw) To: Juri Linkov; +Cc: perin, 13041, perin On Sun, 09 Dec 2012 02:35:46 +0200 Juri Linkov <juri@jurta.org> wrote: >>> i.e. "suf" doesn't match "suffer", only "suff" should match it. >> >> I completely disagree here. "suf" should match "suffer". > > AFAIS, there are more programs that find a partial match, > but neither of them can do the right highlighting: > both possibilities (to highlight the whole ligature and not to highlight) > are wrong, and highlighting a part of the ligature is impossible. Could a ligature be highlighted in a different way (different color or additional attribute such as underlining) to indicate a partial or potential match? Steve Berman ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-09 11:35 ` Stephen Berman @ 2012-12-09 17:52 ` martin rudalics 0 siblings, 0 replies; 83+ messages in thread From: martin rudalics @ 2012-12-09 17:52 UTC (permalink / raw) To: Stephen Berman; +Cc: perin, 13041, perin > Could a ligature be highlighted in a different way (different color or > additional attribute such as underlining) to indicate a partial or > potential match? I think ligatures can be easily handled by displaying the corresponding decomposed string. But a different color could be used to higlight the "ß" with an incremental search string "Mas" and a match in "Maße". martin ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-09 0:35 ` Juri Linkov 2012-12-09 11:35 ` Stephen Berman @ 2012-12-09 15:45 ` Stefan Monnier 2012-12-10 7:57 ` Juri Linkov 1 sibling, 1 reply; 83+ messages in thread From: Stefan Monnier @ 2012-12-09 15:45 UTC (permalink / raw) To: Juri Linkov; +Cc: 13041, perin, perin >>> i.e. "suf" doesn't match "suffer", only "suff" should match it. >> I completely disagree here. "suf" should match "suffer". > AFAIS, there are more programs that find a partial match, > but neither of them can do the right highlighting: > both possibilities (to highlight the whole ligature and not to highlight) > are wrong, and highlighting a part of the ligature is impossible. One step at a time: first, let's make sure we can match it. Then we'll worry about what the match-boundaries should be and how to display it (when we get to this point, we can even consider displaying suffer as suffer temporarily, just like we do when point is in the middle of a composition). Stefan ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-09 15:45 ` Stefan Monnier @ 2012-12-10 7:57 ` Juri Linkov 2012-12-10 8:20 ` Eli Zaretskii 0 siblings, 1 reply; 83+ messages in thread From: Juri Linkov @ 2012-12-10 7:57 UTC (permalink / raw) To: Stefan Monnier; +Cc: 13041, perin, perin > One step at a time: first, let's make sure we can match it. Then we'll > worry about what the match-boundaries should be and how to display it > (when we get to this point, we can even consider displaying suffer as > suffer temporarily, just like we do when point is in the middle of > a composition). Isearch used to decompose a composition of a character with a combining accent and displaying them separately in the middle of a composition in Emacs 23. But as I see now in the latest version Isearch in the middle of a composition doesn't decompose them. It highlights the matched character with still unmatched combining accent as a whole. It seems the current behavior is better then earlier because it doesn't change the displayed characters. This is more WYSIWYG. ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-10 7:57 ` Juri Linkov @ 2012-12-10 8:20 ` Eli Zaretskii 0 siblings, 0 replies; 83+ messages in thread From: Eli Zaretskii @ 2012-12-10 8:20 UTC (permalink / raw) To: Juri Linkov; +Cc: perin, 13041, perin > From: Juri Linkov <juri@jurta.org> > Date: Mon, 10 Dec 2012 09:57:49 +0200 > Cc: 13041@debbugs.gnu.org, perin@panix.com, perin@acm.org > > Isearch used to decompose a composition of a character with a combining > accent and displaying them separately in the middle of a composition > in Emacs 23. AFAIR, this was due to problems in the display engine wrt composite characters, and problems with composition support in general, problems which are now solved. ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-04 23:15 ` Drew Adams 2012-12-05 6:50 ` Drew Adams @ 2012-12-05 9:42 ` martin rudalics 1 sibling, 0 replies; 83+ messages in thread From: martin rudalics @ 2012-12-05 9:42 UTC (permalink / raw) To: Drew Adams; +Cc: perin, perin, 13041 > BTW, there are a couple of minor things to check wrt the code you sent, Martin: > > * `min-length' is not used. Leftover from a previous version. > * The `cond's all repeat condition (< value 0) twice, with different actions. These are clearly silly, yes. Funnily, they don't affect the result since they are never taken and the return value is nil as intended. martin ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-04 20:12 ` Drew Adams 2012-12-04 23:15 ` Drew Adams @ 2012-12-05 9:42 ` martin rudalics 2012-12-05 15:38 ` Drew Adams 2012-12-05 23:04 ` Juri Linkov 1 sibling, 2 replies; 83+ messages in thread From: martin rudalics @ 2012-12-05 9:42 UTC (permalink / raw) To: Drew Adams; +Cc: perin, perin, 13041 > 1. Assuming this or similar is added to Emacs (please do). Please consider > modifying it to respect `case-fold-search'. These modified lines do that. > > (setq prop1 (get-char-code-property > (if case-fold-search > (downcase (elt string1 index1)) > (elt string1 index1)) > 'decomposition)) > > [Same thing for prop2 with string2 and index2.] This would have to be done, yes. > (let ((value (compare-strings compat1 0 nil > compat2 0 nil case-fold-search))) > > > 2. In addition, consider updating `string-lessp' to be sensitive to a variable > such as this: > > (defvar ignore-diacritics nil > "Non-nil means ignore diacritics for string comparisons.") > > With that, an alternative to hard-coding a call to `decomposed-string-lessp' is > to bind `ignore-diacritics' and use `string-lessp'. `ignore-diacritics' is misleading. The variable would have to be called `observe-decompositions' or something the like. > A similar change could be made for `compare-strings': reflect the value of > `ignore-diacritics'. Or since that function has made the choice to pass > case-sensitivity as a parameter instead of respecting `case-fold-search', pass > another parameter for diacritic sensitivity. Indeed, `string-lessp' is too weak - we'd need a function to tell whether two strings are equal disregarding "certain" decomposition properties. > 3. More general than #2 would be a function like this, which is sensitive to > both `ignore-diacritics' and `case-fold-search' (this assumes the change > suggested above in #1 for `decomposed-string-lessp'). > > (defun my-string-lessp (s1 s2) > "..." > (if ignore-diacritics > (decomposed-string-lessp s1 s2) > (when case-fold-search (setq s1 (upcase s1) > s2 (upcase s2))) > (string-lessp s1 s2))) > > Dunno a good name for this. It's too late to let `string-lessp' itself act like > this - that would break stuff. `string-lessp' is in C. I wouldn't touch it anyway. > 4. Even better than hard-coding `case-fold-search' in `my-string-less-p' and > `decomposed-string-lessp' would be to have those functions be sensitive to a > variable such as this: > > (defvar string-case-variable 'case-fold-search > "Value is a case-sensitivity variable such as `case-fold-search'. > The values of that variable must be like those for `case-fold-search': > nil means case-sensitive, non-nil means case-insensitive.") > > Code could then bind `string-case-variable' to, say, `(not > completion-ignore-case)' or to any other case-sensitivity controlling sexp, when > appropriate. > > This would have the advantages offered by passing an explicit case-sensitivity > parameter, as in `compare-strings', but also the advantages of dynamic scope: > binding `string-case-var' to affect all comparisons within scope. > > Comparers such as `(my-)string-lessp' are often used as arguments to > higher-order functions that treat them as (only) binary predicates, i.e., > predicates where any additional parameters specifying case or diacritic > sensitivity are ignored. I first have to solve the problems with the values returned by `get-char-code-property'. Then I will look into this. martin ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-05 9:42 ` martin rudalics @ 2012-12-05 15:38 ` Drew Adams 2012-12-05 15:51 ` Lewis Perin ` (2 more replies) 2012-12-05 23:04 ` Juri Linkov 1 sibling, 3 replies; 83+ messages in thread From: Drew Adams @ 2012-12-05 15:38 UTC (permalink / raw) To: 'martin rudalics'; +Cc: perin, perin, 13041 > `ignore-diacritics' is misleading. The variable would have > to be called `observe-decompositions' or something the like. 1. "Observe decompositions" doesn't mean anything to me. The verb should probably be more active - what does it mean to observe the char decompositions here? BTW, if we use "decomposition" in the name and description then we should probably also use "char" - this is not about decomposing strings in some way (whatever that might mean); it involves decomposing Unicode characters. 2. But my confusion over the name/description is in fact wrt function `decomposed-string-lessp': I guess it's not 100% clear to me what it does. Your doc string said "STRING1 is decomposition-less than STRING2", which confuses me. And it is a bit ambiguous wrt "-less": a. decomposition-less as in comparing the strings only after removing (some parts of) their decompositions (i.e., "-less" as in "sans")? or b. -lessp as in `string<': a comparison ordering relation? In the version of `decomposed-string-lessp' that I sent, I changed the doc string to this: "decomposed STRING1 is less than decomposed STRING2". But that is no doubt incorrect (less correct than yours, if perhaps clearer). In particular, it says nothing about how we compare the two decompositions. In practical (use) terms, this is typically about ignoring diacritics, keeping only the "base" characters. Something about that should at least be mentioned in the doc, so that users know they can use this for that. But IIUC this is not just about diacritics; it sometimes might not be about diacritics at all; and diacritics present are sometimes not ignored. E.g., the ligature ffi gets treated the same as the 3 chars f f i. There are no diacritics present in that case. IIUC, we convert the two strings to their Unicode decompositions and then use the Unicode char compatibility specs to compare the decompositions. IOW, we treat equivalent chars, as defined by Unicode, as the same. Perhaps the name/description should speak in terms of Unicode char compatibility or equivalence. Perhaps a name like `string-less-compat-p'? Or `Unicode-equivalent-p'? Or `string-equivalent-p'? How would you characterize what the function does? No doubt Eli can help here. It is important to try to get the function name and description right from the outset, if we can. If the Unicode standard has some terminology that applies here then perhaps we can/should leverage that. Beyond the name and an accurate description, the doc should, as I say, at least mention that you can use this to ignore diacritics (such as accents), as that will be a common use case. ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-05 15:38 ` Drew Adams @ 2012-12-05 15:51 ` Lewis Perin 2012-12-05 16:20 ` Drew Adams 2012-12-05 17:16 ` Drew Adams 2012-12-06 10:28 ` martin rudalics 2 siblings, 1 reply; 83+ messages in thread From: Lewis Perin @ 2012-12-05 15:51 UTC (permalink / raw) To: Drew Adams; +Cc: 13041 Drew Adams writes: > > `ignore-diacritics' is misleading. The variable would have > > to be called `observe-decompositions' or something the like. > > > 1. "Observe decompositions" doesn't mean anything to me. The verb > should probably be more active - what does it mean to observe the > char decompositions here? What about “heed”? /Lew --- Lew Perin | perin@acm.org | http://babelcarp.org ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-05 15:51 ` Lewis Perin @ 2012-12-05 16:20 ` Drew Adams 0 siblings, 0 replies; 83+ messages in thread From: Drew Adams @ 2012-12-05 16:20 UTC (permalink / raw) To: perin; +Cc: 13041 > > > `ignore-diacritics' is misleading. The variable would have > > > to be called `observe-decompositions' or something the like. > > > > 1. "Observe decompositions" doesn't mean anything to me. The verb > > should probably be more active - what does it mean to observe the > > char decompositions here? > > What about "heed"? "Respect" is a more common term with that meaning. But the point (to me) is that we are not conveying much by that - too vague. "Heed" meaning what? Heed how? Those are terms, like "treat", "handle" and "process" (verb), that are generally signs, in computer science as elsewhere, of insufficient understanding or laziness in communication. They say essentially, "it does something". Sometimes (not here though) such words can even be signals that the function in question is a congeries of things that do not necessarily belong together. We should be able to do better here. If I understood better what the function does I might be able to offer better name suggestions. ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-05 15:38 ` Drew Adams 2012-12-05 15:51 ` Lewis Perin @ 2012-12-05 17:16 ` Drew Adams 2012-12-05 18:00 ` Drew Adams 2012-12-06 10:28 ` martin rudalics 2 siblings, 1 reply; 83+ messages in thread From: Drew Adams @ 2012-12-05 17:16 UTC (permalink / raw) To: 'martin rudalics'; +Cc: perin, 13041, perin > Perhaps the name/description should speak in terms of Unicode > char compatibility or equivalence. Perhaps a name like > `string-less-compat-p'? Or `Unicode-equivalent-p'? Or > `string-equivalent-p'? In the last two suggestions I forgot about the "less" part. Taking a quick look at the Unicode specs, it seems that what we do involves (Unicode) "compatibility equivalence". But it also seemed that Eli was saying that for us this is not distinguished from (Unicode) "canonical equivalence". So perhaps `unicode-equivalence-less-p'? Or if there is a risk of confusion with char (not string) comparison, then perhaps `unicode-equiv-string-less-p'? Or just `equiv-string-less-p'? ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-05 17:16 ` Drew Adams @ 2012-12-05 18:00 ` Drew Adams 2012-12-05 18:27 ` Eli Zaretskii 2012-12-06 10:31 ` martin rudalics 0 siblings, 2 replies; 83+ messages in thread From: Drew Adams @ 2012-12-05 18:00 UTC (permalink / raw) To: 'martin rudalics'; +Cc: perin, 13041, perin FWIW - Some more browsing on the topic tells me that what we are trying to come up with here is a predicate for the NFKD canonical ordering (as applied to a char sequence, not to a single char). IOW, a string-ordering predicate that uses the canonical ordering for a character's decomposed normal code point sequence. We are using compatibility normalization, not canonical normalization. So a search (or a string comparison test) for `f' will match the ligature `ffi' (whereas it would not match wrt canonical normalization). Someone please correct me if any of this is wrong. ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-05 18:00 ` Drew Adams @ 2012-12-05 18:27 ` Eli Zaretskii 2012-12-06 10:31 ` martin rudalics 1 sibling, 0 replies; 83+ messages in thread From: Eli Zaretskii @ 2012-12-05 18:27 UTC (permalink / raw) To: Drew Adams; +Cc: 13041, perin, perin > From: "Drew Adams" <drew.adams@oracle.com> > Date: Wed, 5 Dec 2012 10:00:14 -0800 > Cc: perin@panix.com, 13041@debbugs.gnu.org, perin@acm.org > > We are using compatibility normalization, not canonical normalization. So a > search (or a string comparison test) for `f' will match the ligature `ffi' > (whereas it would not match wrt canonical normalization). > > Someone please correct me if any of this is wrong. I'm not sure who is wrong ;-), but I think when compatibility decomposition exists, it should be used; if not, the canonical decomposition should be used. ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-05 18:00 ` Drew Adams 2012-12-05 18:27 ` Eli Zaretskii @ 2012-12-06 10:31 ` martin rudalics 2012-12-06 15:59 ` Drew Adams 1 sibling, 1 reply; 83+ messages in thread From: martin rudalics @ 2012-12-06 10:31 UTC (permalink / raw) To: Drew Adams; +Cc: perin, 13041, perin > We are using compatibility normalization, not canonical normalization. So a > search (or a string comparison test) for `f' will match the ligature `ffi' > (whereas it would not match wrt canonical normalization). If it can be done, searching for "f" should match ligatures like "ff" and "fi". martin ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-06 10:31 ` martin rudalics @ 2012-12-06 15:59 ` Drew Adams 0 siblings, 0 replies; 83+ messages in thread From: Drew Adams @ 2012-12-06 15:59 UTC (permalink / raw) To: 'martin rudalics'; +Cc: perin, 13041, perin > > We are using compatibility normalization, not canonical > > normalization. So a search (or a string comparison test) > > for `f' will match the ligature `ffi' > > (whereas it would not match wrt canonical normalization). > > If it can be done, searching for "f" should match ligatures like "ff" > and "fi". That's what I thought you were planning/preparing to do. On the other hand, as the Unicode spec points out (for level 2), sometimes someone wants to distinguish searching for f from searching for the ligature. Ideally (we might never get there), that would be possible as an alternative (choice). The spec also points to hybrid situations regarding case conversion (see sect RL2.4) where, e.g., you might want to do full case matching on ß in a literal name such as Strauß but simple case folding on ß when used in a character class, such as [ß]. Dunno whether we would ever get there either. There seems to be a lot in the Unicode regexp spec (http://www.unicode.org/reports/tr18/) that could be food for thought for Emacs. I imagine that some Emacs Dev folks have already taken a close look and given it some thought. ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-05 15:38 ` Drew Adams 2012-12-05 15:51 ` Lewis Perin 2012-12-05 17:16 ` Drew Adams @ 2012-12-06 10:28 ` martin rudalics 2012-12-06 17:53 ` Eli Zaretskii 2 siblings, 1 reply; 83+ messages in thread From: martin rudalics @ 2012-12-06 10:28 UTC (permalink / raw) To: Drew Adams; +Cc: perin, perin, 13041 >> `ignore-diacritics' is misleading. The variable would have >> to be called `observe-decompositions' or something the like. > > > 1. "Observe decompositions" doesn't mean anything to me. The verb should > probably be more active - what does it mean to observe the char decompositions > here? > > BTW, if we use "decomposition" in the name and description then we should > probably also use "char" - this is not about decomposing strings in some way > (whatever that might mean); it involves decomposing Unicode characters. `ignore-diacritics' is misleading because when we, for example, sort/match ligatures we already do more than ignore diacritics. A variable using the term `observe-decompositions' would express what the underlying algorithm does - observe the decomposition properties provided by `get-char-code-property'. Bear in mind that a "correct" solution for searching and sorting would have to be based on a correct implementation of a collation table (see bug#12008) plus some options that make searching more convenient (aka "asymmetric searching" http://www.unicode.org/reports/tr10/#Searching). In that sense, Juri's approach for searching and my function can be considered only as poor man's variants of what should be eventually done. For example my Austrian locale sorts o < ö < p while IIUC Swedish has o < p ... < z < ö which IIUC can't be done via the decomposition table. I don't know whether this implies that searching for "o" in Swedish means to _not_ list results for "ö" either. > 2. But my confusion over the name/description is in fact wrt function > `decomposed-string-lessp': I guess it's not 100% clear to me what it does. > > Your doc string said "STRING1 is decomposition-less than STRING2", which > confuses me. And it is a bit ambiguous wrt "-less": > > a. decomposition-less as in comparing the strings only after > removing (some parts of) their decompositions (i.e., "-less" > as in "sans")? > > or > > b. -lessp as in `string<': a comparison ordering relation? I didn't think much about the wording. But I can't, in general, talk about comparing characters because in the ligature case (or the "ß" vs "ss" case) I do compare substrings. > In the version of `decomposed-string-lessp' that I sent, I changed the doc > string to this: "decomposed STRING1 is less than decomposed STRING2". But that > is no doubt incorrect (less correct than yours, if perhaps clearer). In > particular, it says nothing about how we compare the two decompositions. > > In practical (use) terms, this is typically about ignoring diacritics, keeping > only the "base" characters. Something about that should at least be mentioned > in the doc, so that users know they can use this for that. Yes. > But IIUC this is not just about diacritics; it sometimes might not be about > diacritics at all; and diacritics present are sometimes not ignored. E.g., the > ligature ffi gets treated the same as the 3 chars f f i. There are no > diacritics present in that case. That's why I want to just talk about decompositions for the moment. > IIUC, we convert the two strings to their Unicode decompositions and then use > the Unicode char compatibility specs to compare the decompositions. IOW, we > treat equivalent chars, as defined by Unicode, as the same. Character sequences, IIUC. > Perhaps the name/description should speak in terms of Unicode char compatibility > or equivalence. Perhaps a name like `string-less-compat-p'? Or > `Unicode-equivalent-p'? Or `string-equivalent-p'? > > How would you characterize what the function does? No doubt Eli can help here. > It is important to try to get the function name and description right from the > outset, if we can. If the Unicode standard has some terminology that applies > here then perhaps we can/should leverage that. I'm not sure whether we can ever fully support Unicode here - the weights you find in http://www.unicode.org/Public/UCA/6.2.0/allkeys.txt appear hardly digestible for me (and my machine, presumably). > Beyond the name and an accurate description, the doc should, as I say, at least > mention that you can use this to ignore diacritics (such as accents), as that > will be a common use case. Sure. martin ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-06 10:28 ` martin rudalics @ 2012-12-06 17:53 ` Eli Zaretskii 0 siblings, 0 replies; 83+ messages in thread From: Eli Zaretskii @ 2012-12-06 17:53 UTC (permalink / raw) To: martin rudalics; +Cc: perin, 13041, perin > Date: Thu, 06 Dec 2012 11:28:05 +0100 > From: martin rudalics <rudalics@gmx.at> > CC: 'Eli Zaretskii' <eliz@gnu.org>, perin@panix.com, > 13041@debbugs.gnu.org, perin@acm.org > > >> `ignore-diacritics' is misleading. The variable would have > >> to be called `observe-decompositions' or something the like. > > > > > > 1. "Observe decompositions" doesn't mean anything to me. The verb should > > probably be more active - what does it mean to observe the char decompositions > > here? > > > > BTW, if we use "decomposition" in the name and description then we should > > probably also use "char" - this is not about decomposing strings in some way > > (whatever that might mean); it involves decomposing Unicode characters. > > `ignore-diacritics' is misleading because when we, for example, > sort/match ligatures we already do more than ignore diacritics. A > variable using the term `observe-decompositions' would express what the > underlying algorithm does - observe the decomposition properties > provided by `get-char-code-property'. I would suggest something like equivalence-search or maybe loose-match-search. The latter is slightly less suitable, since loose matches include not just decompositions, see the Unicode Regular Expressions report. ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-05 9:42 ` martin rudalics 2012-12-05 15:38 ` Drew Adams @ 2012-12-05 23:04 ` Juri Linkov 2012-12-06 10:31 ` martin rudalics 1 sibling, 1 reply; 83+ messages in thread From: Juri Linkov @ 2012-12-05 23:04 UTC (permalink / raw) To: martin rudalics; +Cc: perin, 13041, perin > `ignore-diacritics' is misleading. The variable would have to be called > `observe-decompositions' or something the like. Since the existing variable that corresponds to the Unicode file CaseFolding.txt is `case-fold-search', its counterpart variable that corresponds to the Unicode file Decomposition.txt could be called `decomposition-search'. Also like the existing `sort-fold-case', its counterpart could be called `sort-decomposition'. ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-05 23:04 ` Juri Linkov @ 2012-12-06 10:31 ` martin rudalics 2012-12-07 0:52 ` Juri Linkov 0 siblings, 1 reply; 83+ messages in thread From: martin rudalics @ 2012-12-06 10:31 UTC (permalink / raw) To: Juri Linkov; +Cc: perin, 13041, perin > Since the existing variable that corresponds to the > Unicode file CaseFolding.txt is `case-fold-search', > its counterpart variable that corresponds to the Unicode file > Decomposition.txt Where is this file? > could be called `decomposition-search'. > > Also like the existing `sort-fold-case', its counterpart could be called > `sort-decomposition'. martin ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-06 10:31 ` martin rudalics @ 2012-12-07 0:52 ` Juri Linkov 0 siblings, 0 replies; 83+ messages in thread From: Juri Linkov @ 2012-12-07 0:52 UTC (permalink / raw) To: martin rudalics; +Cc: perin, 13041, perin >> Since the existing variable that corresponds to the >> Unicode file CaseFolding.txt is `case-fold-search', >> its counterpart variable that corresponds to the Unicode file >> Decomposition.txt > > Where is this file? There was a reference to http://www.unicode.org/Public/UNIDATA/extracted/DerivedDecompositionType.txt from http://www.unicode.org/faq/casemap_charprop.html but it seems this file is redundant since you can get the same information from admin/unidata/UnicodeData.txt using (get-char-code-property ?? 'decomposition) ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-02 17:45 ` martin rudalics 2012-12-02 18:02 ` Eli Zaretskii @ 2012-12-02 21:39 ` Juri Linkov 2012-12-03 10:16 ` martin rudalics 1 sibling, 1 reply; 83+ messages in thread From: Juri Linkov @ 2012-12-02 21:39 UTC (permalink / raw) To: martin rudalics; +Cc: perin, perin, 13041 > Whatever solution you find most suitable here, it would be nice to come > up with a similar solution for sorting. I've been playing around with a > function like Did you try to build the case table with the diacritics mappings? It should affect the sorting as well without requiring any changes in sorting functions. ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-02 21:39 ` Juri Linkov @ 2012-12-03 10:16 ` martin rudalics 2012-12-04 0:17 ` Juri Linkov 0 siblings, 1 reply; 83+ messages in thread From: martin rudalics @ 2012-12-03 10:16 UTC (permalink / raw) To: Juri Linkov; +Cc: perin, perin, 13041 > Did you try to build the case table with the diacritics mappings? It should > affect the sorting as well without requiring any changes in sorting functions. I tried but it didn't work out. I have to understand your code first before I can tell what happens. In any case, doing your (set-case-table decomposition-table) permanently for a buffer crashed Emacs here. martin ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-03 10:16 ` martin rudalics @ 2012-12-04 0:17 ` Juri Linkov 2012-12-04 3:41 ` Eli Zaretskii 0 siblings, 1 reply; 83+ messages in thread From: Juri Linkov @ 2012-12-04 0:17 UTC (permalink / raw) To: martin rudalics; +Cc: perin, perin, 13041 > In any case, doing your > > (set-case-table decomposition-table) > > permanently for a buffer crashed Emacs here. With more use I see crashes too. The backtrace says that crashes are in boyer_moore. ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-04 0:17 ` Juri Linkov @ 2012-12-04 3:41 ` Eli Zaretskii 0 siblings, 0 replies; 83+ messages in thread From: Eli Zaretskii @ 2012-12-04 3:41 UTC (permalink / raw) To: Juri Linkov; +Cc: 13041, perin, perin > From: Juri Linkov <juri@jurta.org> > Cc: Eli Zaretskii <eliz@gnu.org>, perin@panix.com, 13041@debbugs.gnu.org, perin@acm.org > Date: Tue, 04 Dec 2012 02:17:04 +0200 > > > In any case, doing your > > > > (set-case-table decomposition-table) > > > > permanently for a buffer crashed Emacs here. > > With more use I see crashes too. The backtrace says that crashes are in > boyer_moore. Please file a bug report with a minimal reproducible recipe. ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-02 0:27 ` Juri Linkov 2012-12-02 17:45 ` martin rudalics @ 2012-12-02 18:16 ` Eli Zaretskii 2012-12-02 21:31 ` Juri Linkov 2012-12-05 19:17 ` Drew Adams 1 sibling, 2 replies; 83+ messages in thread From: Eli Zaretskii @ 2012-12-02 18:16 UTC (permalink / raw) To: Juri Linkov; +Cc: perin, 13041, perin > From: Juri Linkov <juri@jurta.org> > Cc: perin@panix.com, 13041@debbugs.gnu.org, perin@acm.org > Date: Sun, 02 Dec 2012 02:27:32 +0200 > > I'm surprised to see case mappings hard-coded in > lisp/international/characters.el instead of using the properties > `uppercase' and `lowercase' during creation of case tables. My guess is that this is because the code in characters.el was written long before we had access to Unicode character properties in Emacs, and in fact before Emacs was switched to character representation based on Unicode codepoints. And no one bothered to rewrite that code since then; volunteers are welcome. > (defvar decomposition-table nil) > > (defun make-decomposition-table () > (let ((table (standard-case-table)) > canon) > (setq canon (copy-sequence table)) > (let ((c #x0000) d) > (while (<= c #xFFFD) > (make-decomposition-table-1 canon c c) > (setq c (1+ c)))) > (set-char-table-extra-slot table 1 canon) > (set-char-table-extra-slot table 2 nil) > (setq decomposition-table table))) > > (defun make-decomposition-table-1 (canon c0 c1) > (let ((d (get-char-code-property c1 'decomposition))) > (when d > (unless (characterp (car d)) (pop d)) > (if (eq c1 (car d)) > (aset canon c0 (car d)) > (make-decomposition-table-1 canon c0 (car d)))))) > > (make-decomposition-table) > > Then a new Isearch command (the existing `isearch-toggle-case-fold' > can't be used because it enables/disables the standard case table) > could toggle between the current case table and the decomposition > case table using > > (set-case-table decomposition-table) > > After evaluating this, Isearch correctly finds all related characters > in every row of this example: > > http://hex-machina.com/scripts/yui/3.3.0pr1/api/unicode-data-accentfold.js.html > > But it seems using the case table for decomposition has one limitation. > I see no way to ignore combining accent characters in the case table, > i.e. to map combining accent characters to nothing. These characters > have the general-category "Mn (Mark, Nonspacing)", so they should be ignored > in the search. IMO, using case tables for this is evil. If I want to "fold" diacritics in search, that doesn't necessarily mean I want to fold the letter-case as well. I might want doing that, or I might not; these are two orthogonal features. So we need a separate kind of char-table, one that could be installed in addition to the case table, and one that will interpret nil as an indication to ignore the character during search. Then we will be able to ignore combining accents, as we indeed should. We also need to modify the searching primitives to consult this new table, in addition to case table. IOW, I don't think we can implement this feature entirely in Lisp. Some changes are needed on the C level as well. ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-02 18:16 ` Eli Zaretskii @ 2012-12-02 21:31 ` Juri Linkov 2012-12-05 19:17 ` Drew Adams 1 sibling, 0 replies; 83+ messages in thread From: Juri Linkov @ 2012-12-02 21:31 UTC (permalink / raw) To: Eli Zaretskii; +Cc: perin, 13041, perin > IMO, using case tables for this is evil. If I want to "fold" > diacritics in search, that doesn't necessarily mean I want to fold the > letter-case as well. I might want doing that, or I might not; these > are two orthogonal features. `decomposition-table' is a separate char-table that has the subtype `case-table'. It should not conflict with the standard case table, so using `isearch-toggle-case-fold' should still toggle the usage of the standard case table. To toggle folding in the diacritics search perhaps requires having two decomposition tables: one where upper and lower case letters belong to one equivalence set, and another where they are in different sets, so `isearch-toggle-decomposition' could toggle between them. Or should the standard case table and the decomposition table be combined some other way? Maybe like the existing variable `case-fold-search' to add a new variable `decomposition-search' to enable/disable diacritics in search. > So we need a separate kind of char-table, one that could be installed > in addition to the case table, and one that will interpret nil as > an indication to ignore the character during search. I believe this kind of char-table should be based on the existing subtype `case-table' because it provides the features necessary for decomposition search such as extra table EQUIVALENCES (that permutes each equivalence class) and the extra table CANONICALIZE (where the canonical character is the final character in the recursion that traverses the `decomposition' property). > Then we will be able to ignore combining accents, as we indeed should. > We also need to modify the searching primitives to consult this new > table, in addition to case table. Yes, it seems the feature of ignoring combining accents (i.e. mapping some characters to nil) can't be added to existing case tables because for the case table this would mean that converting a string to upper case might delete some characters (like combining accents) and converting a string to lower case might add combining accents to the string that of course makes no sense. > IOW, I don't think we can implement this feature entirely in Lisp. > Some changes are needed on the C level as well. A hack that abuses the standard case table is already possible in Lisp. A complete implementation requires changes on the C level. ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-02 18:16 ` Eli Zaretskii 2012-12-02 21:31 ` Juri Linkov @ 2012-12-05 19:17 ` Drew Adams 2012-12-05 21:19 ` Eli Zaretskii 1 sibling, 1 reply; 83+ messages in thread From: Drew Adams @ 2012-12-05 19:17 UTC (permalink / raw) To: 'Eli Zaretskii', 'Juri Linkov'; +Cc: perin, 13041, perin > > I'm surprised to see case mappings hard-coded in > > lisp/international/characters.el instead of using the properties > > `uppercase' and `lowercase' during creation of case tables. > > My guess is that this is because the code in characters.el was written > long before we had access to Unicode character properties in Emacs, > and in fact before Emacs was switched to character representation > based on Unicode codepoints. And no one bothered to rewrite that code > since then; volunteers are welcome. Doesn't file CaseFolding.txt contain all the info needed? If so, what about populating the case tables from the latest CaseFolding.txt file at Emacs build time? Or if no Internet access during build, populate from a copy of the file to be distributed with Emacs. And provide the same population code as a Lisp function, in case someone wants to refresh an old Emacs release to use a more recent CaseFolding.txt file. Would this make any sense? ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-12-05 19:17 ` Drew Adams @ 2012-12-05 21:19 ` Eli Zaretskii 0 siblings, 0 replies; 83+ messages in thread From: Eli Zaretskii @ 2012-12-05 21:19 UTC (permalink / raw) To: Drew Adams; +Cc: perin, 13041, perin > From: "Drew Adams" <drew.adams@oracle.com> > Cc: <perin@panix.com>, <13041@debbugs.gnu.org>, <perin@acm.org> > Date: Wed, 5 Dec 2012 11:17:04 -0800 > > > > I'm surprised to see case mappings hard-coded in > > > lisp/international/characters.el instead of using the properties > > > `uppercase' and `lowercase' during creation of case tables. > > > > My guess is that this is because the code in characters.el was written > > long before we had access to Unicode character properties in Emacs, > > and in fact before Emacs was switched to character representation > > based on Unicode codepoints. And no one bothered to rewrite that code > > since then; volunteers are welcome. > > Doesn't file CaseFolding.txt contain all the info needed? You don't need CaseFolding.txt, because UnicodeData.txt includes the same information, and uni-lowercase.el, uni-uppercase.el, and uni-titlecase.el already read that information into char-tables. > If so, what about populating the case tables from the latest CaseFolding.txt > file at Emacs build time? Or if no Internet access during build, populate from > a copy of the file to be distributed with Emacs. > > And provide the same population code as a Lisp function, in case someone wants > to refresh an old Emacs release to use a more recent CaseFolding.txt file. > > Would this make any sense? It would make sense to load case tables from uni-*.el at Emacs build time. Volunteers are welcome. ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-11-30 18:22 bug#13041: 24.2; diacritic-fold-search Lewis Perin 2012-11-30 18:51 ` Juri Linkov @ 2012-11-30 19:31 ` Stefan Monnier 2016-08-31 14:45 ` Michael Albinus 2 siblings, 0 replies; 83+ messages in thread From: Stefan Monnier @ 2012-11-30 19:31 UTC (permalink / raw) To: Lewis Perin; +Cc: 13041, perin severity 13041 wishlist thanks > diacritic-adorned versions. Currently, if I want to search for both > “apres” and “après”, I need an additive regular expression. I would > like to do this as easily as I can search for “apres” and “Apres”. That would be a very welcome feature, indeed. Stefan ^ permalink raw reply [flat|nested] 83+ messages in thread
* bug#13041: 24.2; diacritic-fold-search 2012-11-30 18:22 bug#13041: 24.2; diacritic-fold-search Lewis Perin 2012-11-30 18:51 ` Juri Linkov 2012-11-30 19:31 ` Stefan Monnier @ 2016-08-31 14:45 ` Michael Albinus [not found] ` <22473.57245.883865.68491@panix5.panix.com> 2 siblings, 1 reply; 83+ messages in thread From: Michael Albinus @ 2016-08-31 14:45 UTC (permalink / raw) To: Lewis Perin; +Cc: 13041, perin Lewis Perin <perin@panix.com> writes: > Emacs search has long been able to toggle between (a) ignoring the > distinction between upper- and lower-case characters > (case-fold-search) and (b) searching for only one of the pair. One > could say Climacs offers the choice between (a) searching for all > members of a (2-member) equivalence class and (b) searching for only > one member. > > There are larger equivalence classes of characters with practical use > which Climacs is currently unaware of: the groups of characters > consisting of an unadorned (ASCII) character plus all its > diacritic-adorned versions. Currently, if I want to search for both > “apres” and “après”, I need an additive regular expression. I would > like to do this as easily as I can search for “apres” and “Apres”. I > would be delighted if Emacs implemented the equivalence classes > spelled out here: > > http://hex-machina.com/scripts/yui/3.3.0pr1/api/unicode-data-accentfold.js.html > > I might add that diacritics folding is the default in web search > engines. It is also a feature of at least one Web browser in > searching the text of a displayed page (Chrome.) Emacs 25.1 has introduced the new user option `search-default-mode'. If set to `char-fold-to-regexp', the requested feature is available. See etc/NEWS for further information. So I propose to close this bug. There was a long discussion in the bug's log back in 2012, but AFAICS, all proposals have been implemented. > /Lew Best regards, Michael. ^ permalink raw reply [flat|nested] 83+ messages in thread
[parent not found: <22473.57245.883865.68491@panix5.panix.com>]
* bug#13041: 24.2; diacritic-fold-search [not found] ` <22473.57245.883865.68491@panix5.panix.com> @ 2016-09-03 7:06 ` Michael Albinus 0 siblings, 0 replies; 83+ messages in thread From: Michael Albinus @ 2016-09-03 7:06 UTC (permalink / raw) To: perin; +Cc: 13041-done Version: 25.1 nobody writes: > This is great news! I’m afraid I’m not in a position to use 25.1 yet, > but I look forward to it eagerly. Closing the bug seems right to me; > if the new functionality has flaws, then they would be *new* bugs. So I'm closing the bug. > Thanks very much for letting me know! > > /Lew Best regards, Michael. ^ permalink raw reply [flat|nested] 83+ messages in thread
end of thread, other threads:[~2016-09-03 7:06 UTC | newest] Thread overview: 83+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-11-30 18:22 bug#13041: 24.2; diacritic-fold-search Lewis Perin 2012-11-30 18:51 ` Juri Linkov 2012-11-30 21:07 ` Lewis Perin 2012-12-01 0:27 ` Juri Linkov 2012-12-01 0:47 ` Drew Adams 2012-12-01 0:49 ` Drew Adams 2012-12-01 1:20 ` Lew Perin 2012-12-01 6:50 ` Drew Adams 2012-12-01 8:32 ` Eli Zaretskii 2012-12-01 9:09 ` Eli Zaretskii 2012-12-01 16:38 ` Drew Adams 2012-12-02 0:27 ` Juri Linkov 2012-12-02 17:45 ` martin rudalics 2012-12-02 18:02 ` Eli Zaretskii 2012-12-03 10:16 ` martin rudalics 2012-12-03 16:47 ` Eli Zaretskii 2012-12-03 17:42 ` martin rudalics 2012-12-03 17:59 ` Eli Zaretskii 2012-12-04 17:54 ` martin rudalics 2012-12-04 19:28 ` Eli Zaretskii 2012-12-05 9:41 ` martin rudalics 2012-12-05 16:37 ` Eli Zaretskii 2012-12-06 10:31 ` martin rudalics 2012-12-06 17:48 ` Eli Zaretskii 2012-12-05 23:05 ` Juri Linkov 2012-12-06 10:32 ` martin rudalics 2012-12-04 20:12 ` Drew Adams 2012-12-04 23:15 ` Drew Adams 2012-12-05 6:50 ` Drew Adams 2012-12-05 9:42 ` martin rudalics 2012-12-05 15:38 ` Drew Adams 2012-12-06 9:25 ` Kenichi Handa 2012-12-06 10:34 ` martin rudalics 2012-12-06 17:50 ` Eli Zaretskii 2012-12-07 0:58 ` Juri Linkov 2012-12-07 6:33 ` Eli Zaretskii 2012-12-07 10:37 ` martin rudalics 2012-12-07 23:55 ` Juri Linkov 2012-12-08 8:20 ` Eli Zaretskii 2012-12-08 11:35 ` martin rudalics 2012-12-08 12:40 ` Eli Zaretskii 2012-12-08 11:21 ` martin rudalics 2012-12-08 23:07 ` Juri Linkov 2012-12-09 0:04 ` Drew Adams 2012-12-09 17:52 ` martin rudalics 2012-12-09 18:06 ` Drew Adams 2012-12-11 7:19 ` Eli Zaretskii 2012-12-08 23:54 ` Stefan Monnier 2012-12-09 0:14 ` Drew Adams 2012-12-09 15:42 ` Stefan Monnier 2012-12-09 18:00 ` Drew Adams 2012-12-09 0:35 ` Juri Linkov 2012-12-09 11:35 ` Stephen Berman 2012-12-09 17:52 ` martin rudalics 2012-12-09 15:45 ` Stefan Monnier 2012-12-10 7:57 ` Juri Linkov 2012-12-10 8:20 ` Eli Zaretskii 2012-12-05 9:42 ` martin rudalics 2012-12-05 9:42 ` martin rudalics 2012-12-05 15:38 ` Drew Adams 2012-12-05 15:51 ` Lewis Perin 2012-12-05 16:20 ` Drew Adams 2012-12-05 17:16 ` Drew Adams 2012-12-05 18:00 ` Drew Adams 2012-12-05 18:27 ` Eli Zaretskii 2012-12-06 10:31 ` martin rudalics 2012-12-06 15:59 ` Drew Adams 2012-12-06 10:28 ` martin rudalics 2012-12-06 17:53 ` Eli Zaretskii 2012-12-05 23:04 ` Juri Linkov 2012-12-06 10:31 ` martin rudalics 2012-12-07 0:52 ` Juri Linkov 2012-12-02 21:39 ` Juri Linkov 2012-12-03 10:16 ` martin rudalics 2012-12-04 0:17 ` Juri Linkov 2012-12-04 3:41 ` Eli Zaretskii 2012-12-02 18:16 ` Eli Zaretskii 2012-12-02 21:31 ` Juri Linkov 2012-12-05 19:17 ` Drew Adams 2012-12-05 21:19 ` Eli Zaretskii 2012-11-30 19:31 ` Stefan Monnier 2016-08-31 14:45 ` Michael Albinus [not found] ` <22473.57245.883865.68491@panix5.panix.com> 2016-09-03 7:06 ` Michael Albinus
Code repositories for project(s) associated with this external index https://git.savannah.gnu.org/cgit/emacs.git https://git.savannah.gnu.org/cgit/emacs/org-mode.git This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.