unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed
* Char-folding: how can we implement matching multiple characters as a single "thing"?
@ 2015-11-30 15:54 Artur Malabarba
  2015-11-30 16:12 ` Paul Eggert
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Artur Malabarba @ 2015-11-30 15:54 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

I pushed this feature a couple of days ago, but I had to disable it
now (the code is in `character-fold-to-regexp').

For those who don't know, the way char-folding works is by replacing
each character in the search string with a regexp of the characters it
can match.
For instance, a character-folding search for "fi" is actually a regexp
search for "\\(?:ḟ\\|[fᶠḟⓕf𝐟𝑓𝒇𝒻𝓯𝔣𝕗𝖋𝖿𝗳𝘧𝙛𝚏]\\)\\(?:i[̀-̨̣̰̄̆̈̉̌̏̑]\\|[iì-ïĩīĭįǐȉȋᵢḭỉịⁱℹⅈⅰⓘi𝐢𝑖𝒊𝒾𝓲𝔦𝕚𝖎𝗂𝗶𝘪𝙞𝚒]\\)".

As you can see, this multiplies the length of the string by a factor
of about 45. But that's still acceptable.

However, this only includes foldings for the 'f' character and the 'i'
character independently. As it happens, the string "fi" can also match
the ligature 'fi'.
Now, for the sake of conciseness, let me denote the f and i regexps
above as [f] and [i].

The code I pushed a couple of days ago added support for multi-char
matches (such as "fi" matching that ligature), but returning a regexp
like this: "\\([f][i]\\|fi\\)"
This might not look like much. It only adds a few chars to a regexp
that already has about 90. But the problem is that it introduces a new
branch.

Let's say we want to search for the word "fix". The letters 'i' and
'x' could also represent the character 'ⅸ' (roman numeral nine). So
the only way to write that regexp is:
"\\([f]\\([i][x]\\|ⅸ\\)\\|fi[x]\\)".
Now the pattern [x] (which I remind you has ~ 40 chars) appears twice
in the regexp, and we've gained another branch. If the next character
the search string can also combine with 'x', then we go from 3 to 5
branches.

As you can see. The number of branches will scale badly with the
number of input characters. Which means the number of output
characters will be much more than 40 times the input.
And even if the next character does NOT combine with 'x', it's hard to
write an algorithm that can identify that and close all branches
before continuing with the conversion.

And that is why I disabled this code for now. With a short string like
"endless-visit-notifications" it was return regexps longer than 10k
characters. Which is more than Emacs handle.

Does anyone have alternative ideas?



^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2015-12-02  0:59 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-11-30 15:54 Char-folding: how can we implement matching multiple characters as a single "thing"? Artur Malabarba
2015-11-30 16:12 ` Paul Eggert
2015-11-30 16:49 ` Clément Pit--Claudel
2015-11-30 17:55   ` Eli Zaretskii
2015-11-30 21:48     ` John Wiegley
2015-12-01 14:18       ` Artur Malabarba
2015-12-01 15:50         ` Eli Zaretskii
2015-12-01 16:31 ` GIT mirror of Lisp dev sources [was: Char-folding: how can we implement matching...] Drew Adams
2015-12-01 16:43   ` Steinar Bang
2015-12-01 17:14     ` Drew Adams
2015-12-01 17:32   ` Artur Malabarba
2015-12-01 18:03     ` Drew Adams
2015-12-01 18:29       ` Karl Fogel
2015-12-01 18:52         ` Artur Malabarba
2015-12-01 21:18           ` Drew Adams
2015-12-01 23:37             ` Artur Malabarba
2015-12-02  0:14               ` Drew Adams
2015-12-02  0:59                 ` Artur Malabarba

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).