Re: extending case-fold-search to remove nonspacing marks (diacritics etc.)

all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed

From: Artur Malabarba <bruce.connor.am@gmail.com>
To: Juri Linkov <juri@linkov.net>
Cc: emacs-devel <emacs-devel@gnu.org>
Subject: Re: extending case-fold-search to remove nonspacing marks (diacritics etc.)
Date: Fri, 6 Feb 2015 02:32:46 +0000	[thread overview]
Message-ID: <CAAdUY-L228zdyHLp0V5O0xN47qGSKVZmq-RKCC0L5Xc48gYbwA@mail.gmail.com> (raw)
In-Reply-To: <87386jx2m2.fsf@mail.linkov.net>

2015-02-05 22:54 GMT-02:00 Juri Linkov <juri@linkov.net>:
>> Something essentially identical to this was being discussed here a
>> couple of weeks ago. Look for the thread "Single quotes in Info". I
>> wrote a small elisp solution for building this into isearch (which you
>> can find on the "scratch/isearch-character-group-folding" branch). It
>> took a different approach to yours, relating characters to regexp, but
>> it works.
>
> I see that your branch contains nothing more than was already implemented
> a long time ago in bug#13041 where the major stumbling block was
> an inefficiency of the regexp-based solution.  Could you help to improve it?

I'll have a look. The code I wrote was fast enough for isearch and I'm
starting to convince myself it was the best solution.

The motivation behind extending case-fold tables was to make it fast
enough to use on any search, and also have it work on some very
corner-case situations. Combine this with the core-dump issue I've hit
while trying to implement it, and you have a recipe for my fast
diminishing motivation to do this.

>> The bright side is that I think this two-char way of writing latin
>> accents is much less common (not 100% sure though, it's hard to tell
>> the difference). The downside is that I know nothing about other
>> languages, so maybe using two chars to represent one char is the
>> default behavior in some other languages?
>
> As https://emacs.stackexchange.com/q/7992/478 indicates,
> other languages require insertion/deletion of special characters
> like diacritics/accents from the search string/buffer for normalization.
>
> When looking for a solution I recommend you to check ucs-normalize.
> For example, evaluating:
>
>   (require 'ucs-normalize)
>   ucs-normalize-combining-chars
>
> you can see exactly the same characters
>
>   1616 1615 1619 1648 1618 1612 1613 1611 1617 1614
>
> mentioned in https://emacs.stackexchange.com/a/8001/478
>
> Using its corresponding regexp `ucs-normalize-combining-chars-regexp'
> is easy in isearch, e.g.:
>
>   ;; Decomposition search for accented letters.
>   (define-key isearch-mode-map "\M-sd" 'isearch-toggle-decomposition)
>
>   (defun isearch-toggle-decomposition ()
>     "Toggle Unicode decomposition searching on or off."
>     (interactive)
>     (setq isearch-word (unless (eq isearch-word 'isearch-decomposition-regexp)
>                          'isearch-decomposition-regexp))
>     (if isearch-word (setq isearch-regexp nil))
>     (setq isearch-success t isearch-adjusted t)
>     (isearch-update))
>
>   (defun isearch-decomposition-regexp (string &optional _lax)
>     "Return a regexp that matches decomposed Unicode characters in STRING."
>     (let ((accents (substring ucs-normalize-combining-chars-regexp 0 -1)))
>       (mapconcat
>        (lambda (c0)
>          (concat (string c0) accents "?"))
>        (replace-regexp-in-string accents "" string) "")))
>
>   (put 'isearch-decomposition-regexp 'isearch-message-prefix "deco ")
>
> But this is more inefficient than properly implementing it using case tables.

There's probably a way of handling these in c code, but it'll have to
be done manually (translation tables won't do it). And by someone who
understands this more than me. :-)

next prev parent reply	other threads:[~2015-02-06  2:32 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-02-05 22:16 extending case-fold-search to remove nonspacing marks (diacritics etc.) Ted Zlatanov
2015-02-05 23:06 ` Artur Malabarba
2015-02-05 23:17   ` Artur Malabarba
2015-02-06  0:54     ` Juri Linkov
2015-02-06  2:32       ` Artur Malabarba [this message]
2015-02-06  2:51         ` Artur Malabarba
2015-02-06  7:48         ` Eli Zaretskii
2015-02-06  9:06           ` Artur Malabarba
2015-02-06  9:41             ` Eli Zaretskii
2015-02-06 10:03               ` Artur Malabarba
2015-02-06 10:04               ` Eli Zaretskii
2015-02-06  4:58     ` Stephen J. Turnbull
2015-02-06  7:51       ` Eli Zaretskii
2015-02-06 14:50         ` Stefan Monnier
2015-02-06 14:54           ` Eli Zaretskii
2015-02-06  7:35     ` Eli Zaretskii
2015-02-06  7:29 ` Eli Zaretskii
2015-02-07 12:59   ` Ted Zlatanov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAAdUY-L228zdyHLp0V5O0xN47qGSKVZmq-RKCC0L5Xc48gYbwA@mail.gmail.com \
    --to=bruce.connor.am@gmail.com \
    --cc=emacs-devel@gnu.org \
    --cc=juri@linkov.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.