extending case-fold-search to remove nonspacing marks (diacritics etc.)

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

* extending case-fold-search to remove nonspacing marks (diacritics etc.)
@ 2015-02-05 22:16 Ted Zlatanov
  2015-02-05 23:06 ` Artur Malabarba
  2015-02-06  7:29 ` Eli Zaretskii
  0 siblings, 2 replies; 18+ messages in thread
From: Ted Zlatanov @ 2015-02-05 22:16 UTC (permalink / raw)
  To: emacs-devel

https://emacs.stackexchange.com/questions/7992/how-to-search-an-arabic-word-in-text-without-its-diacritics-accents
suggested it would be useful if diacritics were ignored when searching
for text in various situations. This is similar to `case-fold-search'
but more generic. Here's what I suggested as the answer at the ELisp
level:

#+begin_src emacs-lisp
(defun kill-marks (string)
  (concat (loop for c across string
                when (not (eq 'Mn (get-char-code-property c 'general-category)))
                collect c)))

(let* ((original1 "your Arabic string here")
      (normalized1 (ucs-normalize-NFKD-string original1))
      (original2 "your other Arabic string here")
      (normalized2 (ucs-normalize-NFKD-string original2)))
  (equal
   (replace-regexp-in-string "." 'kill-marks normalized1)
   (replace-regexp-in-string "." 'kill-marks normalized2)))
#+end_src

This would probably be useful for other languages, not just Arabic. But
implementing it for users so it works like `case-fold-search' (you just
set something in Customize and all search commands DWYM) seems much
harder. Does anyone have suggestions? Maybe some defadvice magic? Or is
it not possible?

Thanks
Ted

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: extending case-fold-search to remove nonspacing marks (diacritics etc.)
  2015-02-05 22:16 extending case-fold-search to remove nonspacing marks (diacritics etc.) Ted Zlatanov
@ 2015-02-05 23:06 ` Artur Malabarba
  2015-02-05 23:17   ` Artur Malabarba
  2015-02-06  7:29 ` Eli Zaretskii
  1 sibling, 1 reply; 18+ messages in thread
From: Artur Malabarba @ 2015-02-05 23:06 UTC (permalink / raw)
  To: emacs-devel

Something essentially identical to this was being discussed here a
couple of weeks ago. Look for the thread "Single quotes in Info". I
wrote a small elisp solution for building this into isearch (which you
can find on the "scratch/isearch-character-group-folding" branch). It
took a different approach to yours, relating characters to regexp, but
it works.

It's not merged because I was advised to looking into using the
case-fold-search machinery.


2015-02-05 20:16 GMT-02:00 Ted Zlatanov <tzz@lifelogs.com>:
> https://emacs.stackexchange.com/questions/7992/how-to-search-an-arabic-word-in-text-without-its-diacritics-accents
> suggested it would be useful if diacritics were ignored when searching
> for text in various situations. This is similar to `case-fold-search'
> but more generic. Here's what I suggested as the answer at the ELisp
> level:
>
> #+begin_src emacs-lisp
> (defun kill-marks (string)
>   (concat (loop for c across string
>                 when (not (eq 'Mn (get-char-code-property c 'general-category)))
>                 collect c)))
>
> (let* ((original1 "your Arabic string here")
>       (normalized1 (ucs-normalize-NFKD-string original1))
>       (original2 "your other Arabic string here")
>       (normalized2 (ucs-normalize-NFKD-string original2)))
>   (equal
>    (replace-regexp-in-string "." 'kill-marks normalized1)
>    (replace-regexp-in-string "." 'kill-marks normalized2)))
> #+end_src
>
> This would probably be useful for other languages, not just Arabic. But
> implementing it for users so it works like `case-fold-search' (you just
> set something in Customize and all search commands DWYM) seems much
> harder. Does anyone have suggestions? Maybe some defadvice magic? Or is
> it not possible?
>
> Thanks
> Ted
>
>



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: extending case-fold-search to remove nonspacing marks (diacritics etc.)
  2015-02-05 23:06 ` Artur Malabarba
@ 2015-02-05 23:17   ` Artur Malabarba
  2015-02-06  0:54     ` Juri Linkov
                       ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Artur Malabarba @ 2015-02-05 23:17 UTC (permalink / raw)
  To: emacs-devel

As for answering your questions:

>> implementing it for users so it works like `case-fold-search' (you just
>> set something in Customize and all search commands DWYM) seems much
>> harder.

Doing it as part of Emacs is not terribly hard, but it has
disadvantages. Namely, the case-fold-search machinery only relates one
character to another character (1 to 1). At least for latin this would
be enough a lot of the time, e.g. you can use it to relate "á" to "a".
However, there's another way of writing "á" which takes two
characters, and this situation can't be handled (AFAIK) by the
case-fold-search machinery.
The bright side is that I think this two-char way of writing latin
accents is much less common (not 100% sure though, it's hard to tell
the difference). The downside is that I know nothing about other
languages, so maybe using two chars to represent one char is the
default behavior in some other languages?

>> Does anyone have suggestions? Maybe some defadvice magic?

You can use a defadvice around one of the isearch internal functions
(check out the branch I mentioned) to implement something in elisp.
And you can redefine the buffer's case-folding table and use that in
the advice, but that will require that you generate the entire table.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: extending case-fold-search to remove nonspacing marks (diacritics etc.)
  2015-02-05 23:17   ` Artur Malabarba
@ 2015-02-06  0:54     ` Juri Linkov
  2015-02-06  2:32       ` Artur Malabarba
  2015-02-06  4:58     ` Stephen J. Turnbull
  2015-02-06  7:35     ` Eli Zaretskii
  2 siblings, 1 reply; 18+ messages in thread
From: Juri Linkov @ 2015-02-06  0:54 UTC (permalink / raw)
  To: Artur Malabarba; +Cc: emacs-devel

> Something essentially identical to this was being discussed here a
> couple of weeks ago. Look for the thread "Single quotes in Info". I
> wrote a small elisp solution for building this into isearch (which you
> can find on the "scratch/isearch-character-group-folding" branch). It
> took a different approach to yours, relating characters to regexp, but
> it works.

I see that your branch contains nothing more than was already implemented
a long time ago in bug#13041 where the major stumbling block was
an inefficiency of the regexp-based solution.  Could you help to improve it?

> The bright side is that I think this two-char way of writing latin
> accents is much less common (not 100% sure though, it's hard to tell
> the difference). The downside is that I know nothing about other
> languages, so maybe using two chars to represent one char is the
> default behavior in some other languages?

As https://emacs.stackexchange.com/q/7992/478 indicates,
other languages require insertion/deletion of special characters
like diacritics/accents from the search string/buffer for normalization.

When looking for a solution I recommend you to check ucs-normalize.
For example, evaluating:

  (require 'ucs-normalize)
  ucs-normalize-combining-chars

you can see exactly the same characters

  1616 1615 1619 1648 1618 1612 1613 1611 1617 1614

mentioned in https://emacs.stackexchange.com/a/8001/478

Using its corresponding regexp `ucs-normalize-combining-chars-regexp'
is easy in isearch, e.g.:

  ;; Decomposition search for accented letters.
  (define-key isearch-mode-map "\M-sd" 'isearch-toggle-decomposition)

  (defun isearch-toggle-decomposition ()
    "Toggle Unicode decomposition searching on or off."
    (interactive)
    (setq isearch-word (unless (eq isearch-word 'isearch-decomposition-regexp)
                         'isearch-decomposition-regexp))
    (if isearch-word (setq isearch-regexp nil))
    (setq isearch-success t isearch-adjusted t)
    (isearch-update))

  (defun isearch-decomposition-regexp (string &optional _lax)
    "Return a regexp that matches decomposed Unicode characters in STRING."
    (let ((accents (substring ucs-normalize-combining-chars-regexp 0 -1)))
      (mapconcat
       (lambda (c0)
         (concat (string c0) accents "?"))
       (replace-regexp-in-string accents "" string) "")))

  (put 'isearch-decomposition-regexp 'isearch-message-prefix "deco ")

But this is more inefficient than properly implementing it using case tables.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: extending case-fold-search to remove nonspacing marks (diacritics etc.)
  2015-02-06  0:54     ` Juri Linkov
@ 2015-02-06  2:32       ` Artur Malabarba
  2015-02-06  2:51         ` Artur Malabarba
  2015-02-06  7:48         ` Eli Zaretskii
  0 siblings, 2 replies; 18+ messages in thread
From: Artur Malabarba @ 2015-02-06  2:32 UTC (permalink / raw)
  To: Juri Linkov; +Cc: emacs-devel

2015-02-05 22:54 GMT-02:00 Juri Linkov <juri@linkov.net>:
>> Something essentially identical to this was being discussed here a
>> couple of weeks ago. Look for the thread "Single quotes in Info". I
>> wrote a small elisp solution for building this into isearch (which you
>> can find on the "scratch/isearch-character-group-folding" branch). It
>> took a different approach to yours, relating characters to regexp, but
>> it works.
>
> I see that your branch contains nothing more than was already implemented
> a long time ago in bug#13041 where the major stumbling block was
> an inefficiency of the regexp-based solution.  Could you help to improve it?

I'll have a look. The code I wrote was fast enough for isearch and I'm
starting to convince myself it was the best solution.

The motivation behind extending case-fold tables was to make it fast
enough to use on any search, and also have it work on some very
corner-case situations. Combine this with the core-dump issue I've hit
while trying to implement it, and you have a recipe for my fast
diminishing motivation to do this.

>> The bright side is that I think this two-char way of writing latin
>> accents is much less common (not 100% sure though, it's hard to tell
>> the difference). The downside is that I know nothing about other
>> languages, so maybe using two chars to represent one char is the
>> default behavior in some other languages?
>
> As https://emacs.stackexchange.com/q/7992/478 indicates,
> other languages require insertion/deletion of special characters
> like diacritics/accents from the search string/buffer for normalization.
>
> When looking for a solution I recommend you to check ucs-normalize.
> For example, evaluating:
>
>   (require 'ucs-normalize)
>   ucs-normalize-combining-chars
>
> you can see exactly the same characters
>
>   1616 1615 1619 1648 1618 1612 1613 1611 1617 1614
>
> mentioned in https://emacs.stackexchange.com/a/8001/478
>
> Using its corresponding regexp `ucs-normalize-combining-chars-regexp'
> is easy in isearch, e.g.:
>
>   ;; Decomposition search for accented letters.
>   (define-key isearch-mode-map "\M-sd" 'isearch-toggle-decomposition)
>
>   (defun isearch-toggle-decomposition ()
>     "Toggle Unicode decomposition searching on or off."
>     (interactive)
>     (setq isearch-word (unless (eq isearch-word 'isearch-decomposition-regexp)
>                          'isearch-decomposition-regexp))
>     (if isearch-word (setq isearch-regexp nil))
>     (setq isearch-success t isearch-adjusted t)
>     (isearch-update))
>
>   (defun isearch-decomposition-regexp (string &optional _lax)
>     "Return a regexp that matches decomposed Unicode characters in STRING."
>     (let ((accents (substring ucs-normalize-combining-chars-regexp 0 -1)))
>       (mapconcat
>        (lambda (c0)
>          (concat (string c0) accents "?"))
>        (replace-regexp-in-string accents "" string) "")))
>
>   (put 'isearch-decomposition-regexp 'isearch-message-prefix "deco ")
>
> But this is more inefficient than properly implementing it using case tables.

There's probably a way of handling these in c code, but it'll have to
be done manually (translation tables won't do it). And by someone who
understands this more than me. :-)



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: extending case-fold-search to remove nonspacing marks (diacritics etc.)
  2015-02-06  2:32       ` Artur Malabarba
@ 2015-02-06  2:51         ` Artur Malabarba
  2015-02-06  7:48         ` Eli Zaretskii
  1 sibling, 0 replies; 18+ messages in thread
From: Artur Malabarba @ 2015-02-06  2:51 UTC (permalink / raw)
  To: Juri Linkov; +Cc: emacs-devel

2015-02-06 0:32 GMT-02:00 Artur Malabarba <bruce.connor.am@gmail.com>:
> 2015-02-05 22:54 GMT-02:00 Juri Linkov <juri@linkov.net>:
>>> Something essentially identical to this was being discussed here a
>>> couple of weeks ago. Look for the thread "Single quotes in Info". I
>>> wrote a small elisp solution for building this into isearch (which you
>>> can find on the "scratch/isearch-character-group-folding" branch). It
>>> took a different approach to yours, relating characters to regexp, but
>>> it works.
>>
>> I see that your branch contains nothing more than was already implemented
>> a long time ago in bug#13041 where the major stumbling block was
>> an inefficiency of the regexp-based solution.  Could you help to improve it?
>
> I'll have a look.

Scratch that. That thread is way too long to be read on anything but a
holiday. :-)



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: extending case-fold-search to remove nonspacing marks (diacritics etc.)
  2015-02-05 23:17   ` Artur Malabarba
  2015-02-06  0:54     ` Juri Linkov
@ 2015-02-06  4:58     ` Stephen J. Turnbull
  2015-02-06  7:51       ` Eli Zaretskii
  2015-02-06  7:35     ` Eli Zaretskii
  2 siblings, 1 reply; 18+ messages in thread
From: Stephen J. Turnbull @ 2015-02-06  4:58 UTC (permalink / raw)
  To: emacs-devel

Artur Malabarba writes:

 > The bright side is that I think this two-char way of writing latin
 > accents is much less common (not 100% sure though, it's hard to
 > tell the difference).

Yes, it's less common if you take a random sample of the storage in
the world, but there are specific places where the canonical NFD form
is standardized, such as Apple's default file system (at least for Mac
OS).  I'm not sure how common that is (NFC is more friendly to casual
hackers), but in any case there is a need to be able to deal with
decomposed characters because not all composition sequences have
precomposed forms.

I would assume that Emacs's character handling machinery knows about
this stuff, though, or at least the underlying libraries do.  It's
probably just a matter of incorporating an appropriate library call.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: extending case-fold-search to remove nonspacing marks (diacritics etc.)
  2015-02-05 22:16 extending case-fold-search to remove nonspacing marks (diacritics etc.) Ted Zlatanov
  2015-02-05 23:06 ` Artur Malabarba
@ 2015-02-06  7:29 ` Eli Zaretskii
  2015-02-07 12:59   ` Ted Zlatanov
  1 sibling, 1 reply; 18+ messages in thread
From: Eli Zaretskii @ 2015-02-06  7:29 UTC (permalink / raw)
  To: emacs-devel

> From: Ted Zlatanov <tzz@lifelogs.com>
> Date: Thu, 05 Feb 2015 17:16:04 -0500
> 
> https://emacs.stackexchange.com/questions/7992/how-to-search-an-arabic-word-in-text-without-its-diacritics-accents
> suggested it would be useful if diacritics were ignored when searching
> for text in various situations. This is similar to `case-fold-search'
> but more generic. Here's what I suggested as the answer at the ELisp
> level:
> 
> #+begin_src emacs-lisp
> (defun kill-marks (string)
>   (concat (loop for c across string
>                 when (not (eq 'Mn (get-char-code-property c 'general-category)))
>                 collect c)))
> 
> (let* ((original1 "your Arabic string here")
>       (normalized1 (ucs-normalize-NFKD-string original1))
>       (original2 "your other Arabic string here")
>       (normalized2 (ucs-normalize-NFKD-string original2)))
>   (equal
>    (replace-regexp-in-string "." 'kill-marks normalized1)
>    (replace-regexp-in-string "." 'kill-marks normalized2)))
> #+end_src

That doesn't do what we want, it's only a partial solution to that
problem.  E.g., it doesn't equate the initial, medial, and final
variants of the letters used by Arabic and other Semitic scripts.
Moreover, you cannot even search for "a" and find "á", AFAICS.

The way to solve this correctly and generally was discussed here some
time ago, so if there are people here for whom this is an itch to
scratch, please let's do this as discussed there.  We already have all
the necessary information for that in Emacs databases.




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: extending case-fold-search to remove nonspacing marks (diacritics etc.)
  2015-02-05 23:17   ` Artur Malabarba
  2015-02-06  0:54     ` Juri Linkov
  2015-02-06  4:58     ` Stephen J. Turnbull
@ 2015-02-06  7:35     ` Eli Zaretskii
  2 siblings, 0 replies; 18+ messages in thread
From: Eli Zaretskii @ 2015-02-06  7:35 UTC (permalink / raw)
  To: bruce.connor.am; +Cc: emacs-devel

> Date: Thu, 5 Feb 2015 23:17:42 +0000
> From: Artur Malabarba <bruce.connor.am@gmail.com>
> 
> As for answering your questions:
> 
> >> implementing it for users so it works like `case-fold-search' (you just
> >> set something in Customize and all search commands DWYM) seems much
> >> harder.
> 
> Doing it as part of Emacs is not terribly hard, but it has
> disadvantages. Namely, the case-fold-search machinery only relates one
> character to another character (1 to 1). At least for latin this would
> be enough a lot of the time, e.g. you can use it to relate "á" to "a".
> However, there's another way of writing "á" which takes two
> characters, and this situation can't be handled (AFAIK) by the
> case-fold-search machinery.

This just means you cannot implement that without changes to the C
level.  Changing the C code to lift the one-character restriction is
not very hard.

> The bright side is that I think this two-char way of writing latin
> accents is much less common (not 100% sure though, it's hard to tell
> the difference). The downside is that I know nothing about other
> languages, so maybe using two chars to represent one char is the
> default behavior in some other languages?

It can be more than 2 characters, e.g. in scripts that use diacritics:
there could be more than diacritic combined with one base character.

And then there are characters to be ignored, like ZWJ and bidi
directional controls.

So I think ad-hoc rules like the above is not going to cut it.  We
must use the decomposed forms, whatever they are, and we should also
consult the character properties to ignore the ignorables.

> >> Does anyone have suggestions? Maybe some defadvice magic?
> 
> You can use a defadvice around one of the isearch internal functions
> (check out the branch I mentioned) to implement something in elisp.
> And you can redefine the buffer's case-folding table and use that in
> the advice, but that will require that you generate the entire table.

Please don't kludge around the problem.  If it is important enough for
you to solve it, let's solve it as God intended.




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: extending case-fold-search to remove nonspacing marks (diacritics etc.)
  2015-02-06  2:32       ` Artur Malabarba
  2015-02-06  2:51         ` Artur Malabarba
@ 2015-02-06  7:48         ` Eli Zaretskii
  2015-02-06  9:06           ` Artur Malabarba
  1 sibling, 1 reply; 18+ messages in thread
From: Eli Zaretskii @ 2015-02-06  7:48 UTC (permalink / raw)
  To: bruce.connor.am; +Cc: emacs-devel, juri

> Date: Fri, 6 Feb 2015 02:32:46 +0000
> From: Artur Malabarba <bruce.connor.am@gmail.com>
> Cc: emacs-devel <emacs-devel@gnu.org>
> 
> There's probably a way of handling these in c code, but it'll have to
> be done manually (translation tables won't do it).

We already have the decomposition in our database, so nothing needs to
be done manually.

> And by someone who understands this more than me. :-)

The "understands" part has been taken care of when constructing those
databases.  You just need to use them.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: extending case-fold-search to remove nonspacing marks (diacritics etc.)
  2015-02-06  4:58     ` Stephen J. Turnbull
@ 2015-02-06  7:51       ` Eli Zaretskii
  2015-02-06 14:50         ` Stefan Monnier
  0 siblings, 1 reply; 18+ messages in thread
From: Eli Zaretskii @ 2015-02-06  7:51 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: emacs-devel

> From: "Stephen J. Turnbull" <stephen@xemacs.org>
> Date: Fri, 06 Feb 2015 13:58:00 +0900
> 
> I would assume that Emacs's character handling machinery knows about
> this stuff, though, or at least the underlying libraries do.  It's
> probably just a matter of incorporating an appropriate library call.

It's actually even easier than that: the decompositions are already in
the char-tables we build from the Unicode character database.  They
just need to be used.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: extending case-fold-search to remove nonspacing marks (diacritics etc.)
  2015-02-06  7:48         ` Eli Zaretskii
@ 2015-02-06  9:06           ` Artur Malabarba
  2015-02-06  9:41             ` Eli Zaretskii
  0 siblings, 1 reply; 18+ messages in thread
From: Artur Malabarba @ 2015-02-06  9:06 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel, Juri Linkov

[-- Attachment #1: Type: text/plain, Size: 949 bytes --]

On 6 Feb 2015 07:48, "Eli Zaretskii" <eliz@gnu.org> wrote:
>
> > Date: Fri, 6 Feb 2015 02:32:46 +0000
> > From: Artur Malabarba <bruce.connor.am@gmail.com>
> > Cc: emacs-devel <emacs-devel@gnu.org>
> >
> > There's probably a way of handling these in c code, but it'll have to
> > be done manually (translation tables won't do it).
>
> We already have the decomposition in our database, so nothing needs to
> be done manually.

Yes.
By "manually", I wasn't referring to the database, I was referring to the c
code necessary (in the sense that it's not a matter of simply using
translation tables, it will just need some ad-hoc coding).

> > And by someone who understands this more than me. :-)
>
> The "understands" part has been taken care of when constructing those
> databases.  You just need to use them.

Yes. By "understanding" I was referring to the c code necessary, not the
database. I've actually got quite familiar with the database. :-P

[-- Attachment #2: Type: text/html, Size: 1359 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: extending case-fold-search to remove nonspacing marks (diacritics etc.)
  2015-02-06  9:06           ` Artur Malabarba
@ 2015-02-06  9:41             ` Eli Zaretskii
  2015-02-06 10:03               ` Artur Malabarba
  2015-02-06 10:04               ` Eli Zaretskii
  0 siblings, 2 replies; 18+ messages in thread
From: Eli Zaretskii @ 2015-02-06  9:41 UTC (permalink / raw)
  To: bruce.connor.am; +Cc: emacs-devel, juri

> Date: Fri, 6 Feb 2015 07:06:27 -0200
> From: Artur Malabarba <bruce.connor.am@gmail.com>
> Cc: Juri Linkov <juri@linkov.net>, emacs-devel <emacs-devel@gnu.org>
> 
> > We already have the decomposition in our database, so nothing needs to
> > be done manually.
> 
> Yes. 
> By "manually", I wasn't referring to the database, I was referring to the c
> code necessary (in the sense that it's not a matter of simply using translation
> tables, it will just need some ad-hoc coding). 
> 
> > > And by someone who understands this more than me. :-)
> >
> > The "understands" part has been taken care of when constructing those
> > databases. You just need to use them.
> 
> Yes. By "understanding" I was referring to the c code necessary, not the
> database. I've actually got quite familiar with the database. :-P 

OK, then please don't hesitate to post questions and ask for help,
including for writing some code, if needed.  This doesn't have to be a
single-person effort (unless you want it to be ;-).  Certainly advice
and answers to questions are abundantly available here, for any code
that is in the core, and for issues related to Unicode.

TIA



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: extending case-fold-search to remove nonspacing marks (diacritics etc.)
  2015-02-06  9:41             ` Eli Zaretskii
@ 2015-02-06 10:03               ` Artur Malabarba
  2015-02-06 10:04               ` Eli Zaretskii
  1 sibling, 0 replies; 18+ messages in thread
From: Artur Malabarba @ 2015-02-06 10:03 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel, Juri Linkov

> OK, then please don't hesitate to post questions and ask for help,
> including for writing some code, if needed.  This doesn't have to be a
> single-person effort (unless you want it to be ;-).  Certainly advice
> and answers to questions are abundantly available here, for any code
> that is in the core, and for issues related to Unicode.

Yes, I've been meaning to. I have 3 solutions which I can (as in "I
have enough motivation to") implement.
One of them can handle multi-char symbols, but it's only for isearch
would need to be specifically extended to other searches (it's
essentially an improvement of the previous regexp proposal).
The other two options use char tables, so they are faster and
immediately apply to any searching but only handle single-char
symbols.

I'll write up an email with diffs and help requests next time I have a minute.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: extending case-fold-search to remove nonspacing marks (diacritics etc.)
  2015-02-06  9:41             ` Eli Zaretskii
  2015-02-06 10:03               ` Artur Malabarba
@ 2015-02-06 10:04               ` Eli Zaretskii
  1 sibling, 0 replies; 18+ messages in thread
From: Eli Zaretskii @ 2015-02-06 10:04 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: juri, bruce.connor.am, emacs-devel

> Date: Fri, 06 Feb 2015 11:41:37 +0200
> From: Eli Zaretskii <eliz@gnu.org>
> Cc: emacs-devel@gnu.org, juri@linkov.net
> 
> > Yes. By "understanding" I was referring to the c code necessary, not the
> > database. I've actually got quite familiar with the database. :-P 
> 
> OK, then please don't hesitate to post questions and ask for help,
> including for writing some code, if needed.  This doesn't have to be a
> single-person effort (unless you want it to be ;-).  Certainly advice
> and answers to questions are abundantly available here, for any code
> that is in the core, and for issues related to Unicode.

Btw, an alternative idea which might be worth exploring is to use
string-collate-equalp for comparison during search, or wcscoll it
calls on the C level.  That will use the character databases of the
underlying libraries, instead of using the Emacs's own databases (and
so will be less prone to customization), but that might be good enough
in this case.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: extending case-fold-search to remove nonspacing marks (diacritics etc.)
  2015-02-06  7:51       ` Eli Zaretskii
@ 2015-02-06 14:50         ` Stefan Monnier
  2015-02-06 14:54           ` Eli Zaretskii
  0 siblings, 1 reply; 18+ messages in thread
From: Stefan Monnier @ 2015-02-06 14:50 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Stephen J. Turnbull, emacs-devel

> It's actually even easier than that: the decompositions are already in
> the char-tables we build from the Unicode character database.  They
> just need to be used.

I think adapting regex.c to it might be a bit more tricky than that.


        Stefan



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: extending case-fold-search to remove nonspacing marks (diacritics etc.)
  2015-02-06 14:50         ` Stefan Monnier
@ 2015-02-06 14:54           ` Eli Zaretskii
  0 siblings, 0 replies; 18+ messages in thread
From: Eli Zaretskii @ 2015-02-06 14:54 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: stephen, emacs-devel

> From: Stefan Monnier <monnier@IRO.UMontreal.CA>
> Cc: "Stephen J. Turnbull" <stephen@xemacs.org>, emacs-devel@gnu.org
> Date: Fri, 06 Feb 2015 09:50:07 -0500
> 
> > It's actually even easier than that: the decompositions are already in
> > the char-tables we build from the Unicode character database.  They
> > just need to be used.
> 
> I think adapting regex.c to it might be a bit more tricky than that.

Yes, but let's start with isearch, and then move to more complex
tasks.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: extending case-fold-search to remove nonspacing marks (diacritics etc.)
  2015-02-06  7:29 ` Eli Zaretskii
@ 2015-02-07 12:59   ` Ted Zlatanov
  0 siblings, 0 replies; 18+ messages in thread
From: Ted Zlatanov @ 2015-02-07 12:59 UTC (permalink / raw)
  To: emacs-devel

On Fri, 06 Feb 2015 09:29:33 +0200 Eli Zaretskii <eliz@gnu.org> wrote: 

>> From: Ted Zlatanov <tzz@lifelogs.com>
>> Date: Thu, 05 Feb 2015 17:16:04 -0500
>> 
>> https://emacs.stackexchange.com/questions/7992/how-to-search-an-arabic-word-in-text-without-its-diacritics-accents
>> suggested it would be useful if diacritics were ignored when searching
>> for text in various situations. This is similar to `case-fold-search'
>> but more generic. Here's what I suggested as the answer at the ELisp
>> level:
...

EZ> That doesn't do what we want, it's only a partial solution to that
EZ> problem.  E.g., it doesn't equate the initial, medial, and final
EZ> variants of the letters used by Arabic and other Semitic scripts.
EZ> Moreover, you cannot even search for "a" and find "á", AFAICS.

Thanks for explaining. I am certainly not an expert in this area and
don't even speak or write Arabic, but my solution did work for the given
parameters so I thought it might be useful.

EZ> The way to solve this correctly and generally was discussed here some
EZ> time ago, so if there are people here for whom this is an itch to
EZ> scratch, please let's do this as discussed there.  We already have all
EZ> the necessary information for that in Emacs databases.

I am not one of those people. There's little I can contribute other than
this suggestion and testing for Romance languages with accents.

The general need seems to be for extending `case-fold-search', perhaps
with a new variable like `fold-search' that's a set of symbols. But I'm
sure you've already thought of that.

The performance concerns are justified but IMHO a correct solution is
easy to optimize later, so I wouldn't worry too much about it.

Ted

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2015-02-07 12:59 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-02-05 22:16 extending case-fold-search to remove nonspacing marks (diacritics etc.) Ted Zlatanov
2015-02-05 23:06 ` Artur Malabarba
2015-02-05 23:17   ` Artur Malabarba
2015-02-06  0:54     ` Juri Linkov
2015-02-06  2:32       ` Artur Malabarba
2015-02-06  2:51         ` Artur Malabarba
2015-02-06  7:48         ` Eli Zaretskii
2015-02-06  9:06           ` Artur Malabarba
2015-02-06  9:41             ` Eli Zaretskii
2015-02-06 10:03               ` Artur Malabarba
2015-02-06 10:04               ` Eli Zaretskii
2015-02-06  4:58     ` Stephen J. Turnbull
2015-02-06  7:51       ` Eli Zaretskii
2015-02-06 14:50         ` Stefan Monnier
2015-02-06 14:54           ` Eli Zaretskii
2015-02-06  7:35     ` Eli Zaretskii
2015-02-06  7:29 ` Eli Zaretskii
2015-02-07 12:59   ` Ted Zlatanov

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).