* Re: extending case-fold-search to remove nonspacing marks (diacritics etc.)
2015-02-05 23:17 ` Artur Malabarba
@ 2015-02-06 0:54 ` Juri Linkov
2015-02-06 2:32 ` Artur Malabarba
2015-02-06 4:58 ` Stephen J. Turnbull
2015-02-06 7:35 ` Eli Zaretskii
2 siblings, 1 reply; 18+ messages in thread
From: Juri Linkov @ 2015-02-06 0:54 UTC (permalink / raw)
To: Artur Malabarba; +Cc: emacs-devel
> Something essentially identical to this was being discussed here a
> couple of weeks ago. Look for the thread "Single quotes in Info". I
> wrote a small elisp solution for building this into isearch (which you
> can find on the "scratch/isearch-character-group-folding" branch). It
> took a different approach to yours, relating characters to regexp, but
> it works.
I see that your branch contains nothing more than was already implemented
a long time ago in bug#13041 where the major stumbling block was
an inefficiency of the regexp-based solution. Could you help to improve it?
> The bright side is that I think this two-char way of writing latin
> accents is much less common (not 100% sure though, it's hard to tell
> the difference). The downside is that I know nothing about other
> languages, so maybe using two chars to represent one char is the
> default behavior in some other languages?
As https://emacs.stackexchange.com/q/7992/478 indicates,
other languages require insertion/deletion of special characters
like diacritics/accents from the search string/buffer for normalization.
When looking for a solution I recommend you to check ucs-normalize.
For example, evaluating:
(require 'ucs-normalize)
ucs-normalize-combining-chars
you can see exactly the same characters
1616 1615 1619 1648 1618 1612 1613 1611 1617 1614
mentioned in https://emacs.stackexchange.com/a/8001/478
Using its corresponding regexp `ucs-normalize-combining-chars-regexp'
is easy in isearch, e.g.:
;; Decomposition search for accented letters.
(define-key isearch-mode-map "\M-sd" 'isearch-toggle-decomposition)
(defun isearch-toggle-decomposition ()
"Toggle Unicode decomposition searching on or off."
(interactive)
(setq isearch-word (unless (eq isearch-word 'isearch-decomposition-regexp)
'isearch-decomposition-regexp))
(if isearch-word (setq isearch-regexp nil))
(setq isearch-success t isearch-adjusted t)
(isearch-update))
(defun isearch-decomposition-regexp (string &optional _lax)
"Return a regexp that matches decomposed Unicode characters in STRING."
(let ((accents (substring ucs-normalize-combining-chars-regexp 0 -1)))
(mapconcat
(lambda (c0)
(concat (string c0) accents "?"))
(replace-regexp-in-string accents "" string) "")))
(put 'isearch-decomposition-regexp 'isearch-message-prefix "deco ")
But this is more inefficient than properly implementing it using case tables.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: extending case-fold-search to remove nonspacing marks (diacritics etc.)
2015-02-06 0:54 ` Juri Linkov
@ 2015-02-06 2:32 ` Artur Malabarba
2015-02-06 2:51 ` Artur Malabarba
2015-02-06 7:48 ` Eli Zaretskii
0 siblings, 2 replies; 18+ messages in thread
From: Artur Malabarba @ 2015-02-06 2:32 UTC (permalink / raw)
To: Juri Linkov; +Cc: emacs-devel
2015-02-05 22:54 GMT-02:00 Juri Linkov <juri@linkov.net>:
>> Something essentially identical to this was being discussed here a
>> couple of weeks ago. Look for the thread "Single quotes in Info". I
>> wrote a small elisp solution for building this into isearch (which you
>> can find on the "scratch/isearch-character-group-folding" branch). It
>> took a different approach to yours, relating characters to regexp, but
>> it works.
>
> I see that your branch contains nothing more than was already implemented
> a long time ago in bug#13041 where the major stumbling block was
> an inefficiency of the regexp-based solution. Could you help to improve it?
I'll have a look. The code I wrote was fast enough for isearch and I'm
starting to convince myself it was the best solution.
The motivation behind extending case-fold tables was to make it fast
enough to use on any search, and also have it work on some very
corner-case situations. Combine this with the core-dump issue I've hit
while trying to implement it, and you have a recipe for my fast
diminishing motivation to do this.
>> The bright side is that I think this two-char way of writing latin
>> accents is much less common (not 100% sure though, it's hard to tell
>> the difference). The downside is that I know nothing about other
>> languages, so maybe using two chars to represent one char is the
>> default behavior in some other languages?
>
> As https://emacs.stackexchange.com/q/7992/478 indicates,
> other languages require insertion/deletion of special characters
> like diacritics/accents from the search string/buffer for normalization.
>
> When looking for a solution I recommend you to check ucs-normalize.
> For example, evaluating:
>
> (require 'ucs-normalize)
> ucs-normalize-combining-chars
>
> you can see exactly the same characters
>
> 1616 1615 1619 1648 1618 1612 1613 1611 1617 1614
>
> mentioned in https://emacs.stackexchange.com/a/8001/478
>
> Using its corresponding regexp `ucs-normalize-combining-chars-regexp'
> is easy in isearch, e.g.:
>
> ;; Decomposition search for accented letters.
> (define-key isearch-mode-map "\M-sd" 'isearch-toggle-decomposition)
>
> (defun isearch-toggle-decomposition ()
> "Toggle Unicode decomposition searching on or off."
> (interactive)
> (setq isearch-word (unless (eq isearch-word 'isearch-decomposition-regexp)
> 'isearch-decomposition-regexp))
> (if isearch-word (setq isearch-regexp nil))
> (setq isearch-success t isearch-adjusted t)
> (isearch-update))
>
> (defun isearch-decomposition-regexp (string &optional _lax)
> "Return a regexp that matches decomposed Unicode characters in STRING."
> (let ((accents (substring ucs-normalize-combining-chars-regexp 0 -1)))
> (mapconcat
> (lambda (c0)
> (concat (string c0) accents "?"))
> (replace-regexp-in-string accents "" string) "")))
>
> (put 'isearch-decomposition-regexp 'isearch-message-prefix "deco ")
>
> But this is more inefficient than properly implementing it using case tables.
There's probably a way of handling these in c code, but it'll have to
be done manually (translation tables won't do it). And by someone who
understands this more than me. :-)
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: extending case-fold-search to remove nonspacing marks (diacritics etc.)
2015-02-06 2:32 ` Artur Malabarba
@ 2015-02-06 2:51 ` Artur Malabarba
2015-02-06 7:48 ` Eli Zaretskii
1 sibling, 0 replies; 18+ messages in thread
From: Artur Malabarba @ 2015-02-06 2:51 UTC (permalink / raw)
To: Juri Linkov; +Cc: emacs-devel
2015-02-06 0:32 GMT-02:00 Artur Malabarba <bruce.connor.am@gmail.com>:
> 2015-02-05 22:54 GMT-02:00 Juri Linkov <juri@linkov.net>:
>>> Something essentially identical to this was being discussed here a
>>> couple of weeks ago. Look for the thread "Single quotes in Info". I
>>> wrote a small elisp solution for building this into isearch (which you
>>> can find on the "scratch/isearch-character-group-folding" branch). It
>>> took a different approach to yours, relating characters to regexp, but
>>> it works.
>>
>> I see that your branch contains nothing more than was already implemented
>> a long time ago in bug#13041 where the major stumbling block was
>> an inefficiency of the regexp-based solution. Could you help to improve it?
>
> I'll have a look.
Scratch that. That thread is way too long to be read on anything but a
holiday. :-)
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: extending case-fold-search to remove nonspacing marks (diacritics etc.)
2015-02-06 2:32 ` Artur Malabarba
2015-02-06 2:51 ` Artur Malabarba
@ 2015-02-06 7:48 ` Eli Zaretskii
2015-02-06 9:06 ` Artur Malabarba
1 sibling, 1 reply; 18+ messages in thread
From: Eli Zaretskii @ 2015-02-06 7:48 UTC (permalink / raw)
To: bruce.connor.am; +Cc: emacs-devel, juri
> Date: Fri, 6 Feb 2015 02:32:46 +0000
> From: Artur Malabarba <bruce.connor.am@gmail.com>
> Cc: emacs-devel <emacs-devel@gnu.org>
>
> There's probably a way of handling these in c code, but it'll have to
> be done manually (translation tables won't do it).
We already have the decomposition in our database, so nothing needs to
be done manually.
> And by someone who understands this more than me. :-)
The "understands" part has been taken care of when constructing those
databases. You just need to use them.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: extending case-fold-search to remove nonspacing marks (diacritics etc.)
2015-02-06 7:48 ` Eli Zaretskii
@ 2015-02-06 9:06 ` Artur Malabarba
2015-02-06 9:41 ` Eli Zaretskii
0 siblings, 1 reply; 18+ messages in thread
From: Artur Malabarba @ 2015-02-06 9:06 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: emacs-devel, Juri Linkov
[-- Attachment #1: Type: text/plain, Size: 949 bytes --]
On 6 Feb 2015 07:48, "Eli Zaretskii" <eliz@gnu.org> wrote:
>
> > Date: Fri, 6 Feb 2015 02:32:46 +0000
> > From: Artur Malabarba <bruce.connor.am@gmail.com>
> > Cc: emacs-devel <emacs-devel@gnu.org>
> >
> > There's probably a way of handling these in c code, but it'll have to
> > be done manually (translation tables won't do it).
>
> We already have the decomposition in our database, so nothing needs to
> be done manually.
Yes.
By "manually", I wasn't referring to the database, I was referring to the c
code necessary (in the sense that it's not a matter of simply using
translation tables, it will just need some ad-hoc coding).
> > And by someone who understands this more than me. :-)
>
> The "understands" part has been taken care of when constructing those
> databases. You just need to use them.
Yes. By "understanding" I was referring to the c code necessary, not the
database. I've actually got quite familiar with the database. :-P
[-- Attachment #2: Type: text/html, Size: 1359 bytes --]
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: extending case-fold-search to remove nonspacing marks (diacritics etc.)
2015-02-06 9:06 ` Artur Malabarba
@ 2015-02-06 9:41 ` Eli Zaretskii
2015-02-06 10:03 ` Artur Malabarba
2015-02-06 10:04 ` Eli Zaretskii
0 siblings, 2 replies; 18+ messages in thread
From: Eli Zaretskii @ 2015-02-06 9:41 UTC (permalink / raw)
To: bruce.connor.am; +Cc: emacs-devel, juri
> Date: Fri, 6 Feb 2015 07:06:27 -0200
> From: Artur Malabarba <bruce.connor.am@gmail.com>
> Cc: Juri Linkov <juri@linkov.net>, emacs-devel <emacs-devel@gnu.org>
>
> > We already have the decomposition in our database, so nothing needs to
> > be done manually.
>
> Yes.
> By "manually", I wasn't referring to the database, I was referring to the c
> code necessary (in the sense that it's not a matter of simply using translation
> tables, it will just need some ad-hoc coding).
>
> > > And by someone who understands this more than me. :-)
> >
> > The "understands" part has been taken care of when constructing those
> > databases. You just need to use them.
>
> Yes. By "understanding" I was referring to the c code necessary, not the
> database. I've actually got quite familiar with the database. :-P
OK, then please don't hesitate to post questions and ask for help,
including for writing some code, if needed. This doesn't have to be a
single-person effort (unless you want it to be ;-). Certainly advice
and answers to questions are abundantly available here, for any code
that is in the core, and for issues related to Unicode.
TIA
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: extending case-fold-search to remove nonspacing marks (diacritics etc.)
2015-02-06 9:41 ` Eli Zaretskii
@ 2015-02-06 10:03 ` Artur Malabarba
2015-02-06 10:04 ` Eli Zaretskii
1 sibling, 0 replies; 18+ messages in thread
From: Artur Malabarba @ 2015-02-06 10:03 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: emacs-devel, Juri Linkov
> OK, then please don't hesitate to post questions and ask for help,
> including for writing some code, if needed. This doesn't have to be a
> single-person effort (unless you want it to be ;-). Certainly advice
> and answers to questions are abundantly available here, for any code
> that is in the core, and for issues related to Unicode.
Yes, I've been meaning to. I have 3 solutions which I can (as in "I
have enough motivation to") implement.
One of them can handle multi-char symbols, but it's only for isearch
would need to be specifically extended to other searches (it's
essentially an improvement of the previous regexp proposal).
The other two options use char tables, so they are faster and
immediately apply to any searching but only handle single-char
symbols.
I'll write up an email with diffs and help requests next time I have a minute.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: extending case-fold-search to remove nonspacing marks (diacritics etc.)
2015-02-06 9:41 ` Eli Zaretskii
2015-02-06 10:03 ` Artur Malabarba
@ 2015-02-06 10:04 ` Eli Zaretskii
1 sibling, 0 replies; 18+ messages in thread
From: Eli Zaretskii @ 2015-02-06 10:04 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: juri, bruce.connor.am, emacs-devel
> Date: Fri, 06 Feb 2015 11:41:37 +0200
> From: Eli Zaretskii <eliz@gnu.org>
> Cc: emacs-devel@gnu.org, juri@linkov.net
>
> > Yes. By "understanding" I was referring to the c code necessary, not the
> > database. I've actually got quite familiar with the database. :-P
>
> OK, then please don't hesitate to post questions and ask for help,
> including for writing some code, if needed. This doesn't have to be a
> single-person effort (unless you want it to be ;-). Certainly advice
> and answers to questions are abundantly available here, for any code
> that is in the core, and for issues related to Unicode.
Btw, an alternative idea which might be worth exploring is to use
string-collate-equalp for comparison during search, or wcscoll it
calls on the C level. That will use the character databases of the
underlying libraries, instead of using the Emacs's own databases (and
so will be less prone to customization), but that might be good enough
in this case.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: extending case-fold-search to remove nonspacing marks (diacritics etc.)
2015-02-05 23:17 ` Artur Malabarba
2015-02-06 0:54 ` Juri Linkov
@ 2015-02-06 4:58 ` Stephen J. Turnbull
2015-02-06 7:51 ` Eli Zaretskii
2015-02-06 7:35 ` Eli Zaretskii
2 siblings, 1 reply; 18+ messages in thread
From: Stephen J. Turnbull @ 2015-02-06 4:58 UTC (permalink / raw)
To: emacs-devel
Artur Malabarba writes:
> The bright side is that I think this two-char way of writing latin
> accents is much less common (not 100% sure though, it's hard to
> tell the difference).
Yes, it's less common if you take a random sample of the storage in
the world, but there are specific places where the canonical NFD form
is standardized, such as Apple's default file system (at least for Mac
OS). I'm not sure how common that is (NFC is more friendly to casual
hackers), but in any case there is a need to be able to deal with
decomposed characters because not all composition sequences have
precomposed forms.
I would assume that Emacs's character handling machinery knows about
this stuff, though, or at least the underlying libraries do. It's
probably just a matter of incorporating an appropriate library call.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: extending case-fold-search to remove nonspacing marks (diacritics etc.)
2015-02-06 4:58 ` Stephen J. Turnbull
@ 2015-02-06 7:51 ` Eli Zaretskii
2015-02-06 14:50 ` Stefan Monnier
0 siblings, 1 reply; 18+ messages in thread
From: Eli Zaretskii @ 2015-02-06 7:51 UTC (permalink / raw)
To: Stephen J. Turnbull; +Cc: emacs-devel
> From: "Stephen J. Turnbull" <stephen@xemacs.org>
> Date: Fri, 06 Feb 2015 13:58:00 +0900
>
> I would assume that Emacs's character handling machinery knows about
> this stuff, though, or at least the underlying libraries do. It's
> probably just a matter of incorporating an appropriate library call.
It's actually even easier than that: the decompositions are already in
the char-tables we build from the Unicode character database. They
just need to be used.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: extending case-fold-search to remove nonspacing marks (diacritics etc.)
2015-02-06 7:51 ` Eli Zaretskii
@ 2015-02-06 14:50 ` Stefan Monnier
2015-02-06 14:54 ` Eli Zaretskii
0 siblings, 1 reply; 18+ messages in thread
From: Stefan Monnier @ 2015-02-06 14:50 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: Stephen J. Turnbull, emacs-devel
> It's actually even easier than that: the decompositions are already in
> the char-tables we build from the Unicode character database. They
> just need to be used.
I think adapting regex.c to it might be a bit more tricky than that.
Stefan
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: extending case-fold-search to remove nonspacing marks (diacritics etc.)
2015-02-06 14:50 ` Stefan Monnier
@ 2015-02-06 14:54 ` Eli Zaretskii
0 siblings, 0 replies; 18+ messages in thread
From: Eli Zaretskii @ 2015-02-06 14:54 UTC (permalink / raw)
To: Stefan Monnier; +Cc: stephen, emacs-devel
> From: Stefan Monnier <monnier@IRO.UMontreal.CA>
> Cc: "Stephen J. Turnbull" <stephen@xemacs.org>, emacs-devel@gnu.org
> Date: Fri, 06 Feb 2015 09:50:07 -0500
>
> > It's actually even easier than that: the decompositions are already in
> > the char-tables we build from the Unicode character database. They
> > just need to be used.
>
> I think adapting regex.c to it might be a bit more tricky than that.
Yes, but let's start with isearch, and then move to more complex
tasks.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: extending case-fold-search to remove nonspacing marks (diacritics etc.)
2015-02-05 23:17 ` Artur Malabarba
2015-02-06 0:54 ` Juri Linkov
2015-02-06 4:58 ` Stephen J. Turnbull
@ 2015-02-06 7:35 ` Eli Zaretskii
2 siblings, 0 replies; 18+ messages in thread
From: Eli Zaretskii @ 2015-02-06 7:35 UTC (permalink / raw)
To: bruce.connor.am; +Cc: emacs-devel
> Date: Thu, 5 Feb 2015 23:17:42 +0000
> From: Artur Malabarba <bruce.connor.am@gmail.com>
>
> As for answering your questions:
>
> >> implementing it for users so it works like `case-fold-search' (you just
> >> set something in Customize and all search commands DWYM) seems much
> >> harder.
>
> Doing it as part of Emacs is not terribly hard, but it has
> disadvantages. Namely, the case-fold-search machinery only relates one
> character to another character (1 to 1). At least for latin this would
> be enough a lot of the time, e.g. you can use it to relate "á" to "a".
> However, there's another way of writing "á" which takes two
> characters, and this situation can't be handled (AFAIK) by the
> case-fold-search machinery.
This just means you cannot implement that without changes to the C
level. Changing the C code to lift the one-character restriction is
not very hard.
> The bright side is that I think this two-char way of writing latin
> accents is much less common (not 100% sure though, it's hard to tell
> the difference). The downside is that I know nothing about other
> languages, so maybe using two chars to represent one char is the
> default behavior in some other languages?
It can be more than 2 characters, e.g. in scripts that use diacritics:
there could be more than diacritic combined with one base character.
And then there are characters to be ignored, like ZWJ and bidi
directional controls.
So I think ad-hoc rules like the above is not going to cut it. We
must use the decomposed forms, whatever they are, and we should also
consult the character properties to ignore the ignorables.
> >> Does anyone have suggestions? Maybe some defadvice magic?
>
> You can use a defadvice around one of the isearch internal functions
> (check out the branch I mentioned) to implement something in elisp.
> And you can redefine the buffer's case-folding table and use that in
> the advice, but that will require that you generate the entire table.
Please don't kludge around the problem. If it is important enough for
you to solve it, let's solve it as God intended.
^ permalink raw reply [flat|nested] 18+ messages in thread