all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
* Regular expression libraries
@ 2016-12-15 19:00 Clément Pit--Claudel
  2016-12-15 20:10 ` Eli Zaretskii
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Clément Pit--Claudel @ 2016-12-15 19:00 UTC (permalink / raw)
  To: Emacs developers

[-- Attachment #1: Type: text/plain, Size: 3877 bytes --]

Hi emacs-devel,

The top of regex.c says this:

    /* Ignore some GCC warnings for now.  This section should go away
       once the Emacs and Gnulib regex code is merged.  */

And indeed I've seen brief discussions of this merge in the past (https://lists.gnu.org/archive/html/emacs-devel/2002-04/msg00083.html , https://lists.gnu.org/archive/html/emacs-devel/2016-07/msg01115.html).  Looking at this is more detail, though, it is not clear how such a merge could happen:

* Emacs has special regexp features based on syntax classes, or the position of the point.  These features don't seem to be supported in gnulib nor glibc
* Emacs uses a gap buffer, whereas gnulib and glibc expects a char* for the subject string

Stefan also expressed a desire to move to a better regular expression engine — ideally one that uses a non-backtracking implementation when possible, to avoid the exponential blowups that Emacs currently suffers from on certain regular expressions.

I've had a quick look at the existing options.  Here's what I gathered (the list below includes only libraries whose licensing is compatible with Emacs):

* There are only two usable, non-backtracking regexp libraries:
  - RE2 (from Google, C++) doesn't support any form of backtracking, though, so "\\(.*\\)\1" won't work anymore.  Additionally, RE2 does not support matching over a gap buffer (the subject string needs to be a contiguous chunk of memory).  Discussion here: https://github.com/google/re2/issues/126
  - TRE (http://laurikari.net/tre/) uses backtracking if needed, but otherwise uses a finite automaton.  It's stable, can be relatively easily linked into Emacs, and performance is similar to what we have (I benchmarked it).  It does support matching on a gap buffer (it has an interface to use the moral equivalent of a random iterator, reguexec).  It doesn't support Emacs' native encoding out of the box, though, and the library is unmaintained; there are multiple forks which fix various security vulnerabilities.
  In both cases, the syntax isn't compatible with that of Emacs, which would force us to write a compatibility layer.  And Emacs' internal encoding isn't supported either.

* Backtracking implementations are more common, but none seem to fit the bill:
  - Oniguruma and Onigmo (C, used in Ruby, the Atom text editor, and others) support a wide range of encodings, including allegedly Emacs' internal encoding, as well as many syntaxes, including Emacs' syntax.  But they don't support matching on a gap buffer either (they take a contiguous string).  Discussion here: https://github.com/k-takata/Onigmo/issues/83
  - PCRE (C, used in Perl and others) doesn't have Emacs encoding, nor Emacs syntax.  It requires a contiguous buffer, but it does support partial matches.  Partial matches can apparently be restarted, which could almost work in our case, but the semantics of restarting a search from a partial match are broken.
  - Boost (C++) can use a custom random-access iterator for the subject string, which should work for gap buffers; but I'm not sure how good the encoding story is.

If I understand our current implementation, our regexp.c functions take the moral equivalent of two strings, and match over that, pretending that it's just one large string.  In practice, these two strings are the two halves of the gap buffer.  Correct?

The bottom line is that, even though it should be relatively easy to change string-match-p to use onigmo, changing re-search-forward sounds tricky.  Did I miss something?  Was there a concrete plan for merging with glibc, or for using an automaton-based matching strategy?  The advantages of switching could be support for a more common syntax, more flexible regular expressions (named groups, assertions, …), better performance, and better handling of special cases.

Cheers,
Clément.


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 1979 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2016-12-16 21:25 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-12-15 19:00 Regular expression libraries Clément Pit--Claudel
2016-12-15 20:10 ` Eli Zaretskii
2016-12-15 20:30 ` Paul Eggert
2016-12-15 22:00   ` Andreas Schwab
2016-12-16  7:20     ` Paul Eggert
2016-12-16 14:31       ` Clément Pit--Claudel
2016-12-16 14:54       ` Clément Pit--Claudel
2016-12-16 15:42         ` Lars Ingebrigtsen
2016-12-16 20:06           ` Clément Pit--Claudel
2016-12-16 21:25             ` Eli Zaretskii
2016-12-16 17:43         ` Paul Eggert
2016-12-15 22:16   ` Clément Pit--Claudel
2016-12-16  5:15 ` Stefan Monnier
2016-12-16 14:41   ` Clément Pit--Claudel
2016-12-16 17:45     ` Paul Eggert
2016-12-16 20:07       ` Clément Pit--Claudel

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.