Re: Char-folding: how can we implement matching multiple characters as a single "thing"?

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

From: "Clément Pit--Claudel" <clement.pit@gmail.com>
To: emacs-devel@gnu.org
Subject: Re: Char-folding: how can we implement matching multiple characters as a single "thing"?
Date: Mon, 30 Nov 2015 17:49:35 +0100	[thread overview]
Message-ID: <565C7E1F.10204@gmail.com> (raw)
In-Reply-To: <CAAdUY-KR93As68v-_0TkUYtJSOjBJLD5_VfEiejD8HAsp6Cqgg@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 2706 bytes --]

On 11/30/2015 04:54 PM, Artur Malabarba wrote:
> As you can see. The number of branches will scale badly with the 
> number of input characters. Which means the number of output 
> characters will be much more than 40 times the input. And even if
> the next character does NOT combine with 'x', it's hard to write an 
> algorithm that can identify that and close all branches before 
> continuing with the conversion.
> 
> And that is why I disabled this code for now. With a short string 
> like "endless-visit-notifications" it was return regexps longer than 
> 10k characters. Which is more than Emacs handle.

Hi Arthur,

Thanks for your efforts on this; it would be very convenient for "ff" or "fi" to match their ligature counterparts.

Are the difficulties due to the sheer size of the regular expressions, or to their complexity (too much branching?). If the size is the real issue (and if that size mostly comes from repeated folding of individual characters), then I can think of two options:

* Extend the C-level implementation of regular expressions to make character-folding a capability of the C engine, allowing for succinct regexps + a flag.
* Add a new backslash to regular expressions for matching char-folded characters (custom character categories almost already allow for this, so the implementation would probably be similar).

These two solutions would not help with the branching, but they would allow for smaller input to the regexp engine (again, assuming that the size issue is due to the large brackets, not to the branching).

On the other hand, if excessive branching is the issue (either due to the size of the regexps even with the ideas above applied, or because the engine has trouble with all the backtracking), then maybe normalization can help: instead of searching the actual buffer, the regexp engine could be made to search the buffer itself, and a copy of it with all text normalized, for some "good" definition of normalization. So looking for "iff" in "diﬃcult" would really look for "iff" in both "diﬃcult" and "difficult". Since normalization is a per-character operation, this does not require maintaining a second copy of the buffer: instead, one can just feed two streams of characters to the regexp engine: the un-normalized one, and the normalized one.

As a side note, it may also be possible to track at the buffer level whether the whole file is ascii; if it is, then case folding can be omitted, making any performance degradation stemming from character folding more acceptable. One can also use finer-grained control, especially if character folding happens at the C level in the regexp engine.

Cheers,
Clément.

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

next prev parent reply	other threads:[~2015-11-30 16:49 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-11-30 15:54 Char-folding: how can we implement matching multiple characters as a single "thing"? Artur Malabarba
2015-11-30 16:12 ` Paul Eggert
2015-11-30 16:49 ` Clément Pit--Claudel [this message]
2015-11-30 17:55   ` Eli Zaretskii
2015-11-30 21:48     ` John Wiegley
2015-12-01 14:18       ` Artur Malabarba
2015-12-01 15:50         ` Eli Zaretskii
2015-12-01 16:31 ` GIT mirror of Lisp dev sources [was: Char-folding: how can we implement matching...] Drew Adams
2015-12-01 16:43   ` Steinar Bang
2015-12-01 17:14     ` Drew Adams
2015-12-01 17:32   ` Artur Malabarba
2015-12-01 18:03     ` Drew Adams
2015-12-01 18:29       ` Karl Fogel
2015-12-01 18:52         ` Artur Malabarba
2015-12-01 21:18           ` Drew Adams
2015-12-01 23:37             ` Artur Malabarba
2015-12-02  0:14               ` Drew Adams
2015-12-02  0:59                 ` Artur Malabarba

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=565C7E1F.10204@gmail.com \
    --to=clement.pit@gmail.com \
    --cc=emacs-devel@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).