unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed
From: "Clément Pit--Claudel" <cpitcla@mit.edu>
To: Emacs developers <emacs-devel@gnu.org>
Subject: Regular expression libraries
Date: Thu, 15 Dec 2016 14:00:25 -0500	[thread overview]
Message-ID: <01d7e608-04d2-84a4-6143-e954bc9d569f@mit.edu> (raw)

[-- Attachment #1: Type: text/plain, Size: 3877 bytes --]

Hi emacs-devel,

The top of regex.c says this:

    /* Ignore some GCC warnings for now.  This section should go away
       once the Emacs and Gnulib regex code is merged.  */

And indeed I've seen brief discussions of this merge in the past (https://lists.gnu.org/archive/html/emacs-devel/2002-04/msg00083.html , https://lists.gnu.org/archive/html/emacs-devel/2016-07/msg01115.html).  Looking at this is more detail, though, it is not clear how such a merge could happen:

* Emacs has special regexp features based on syntax classes, or the position of the point.  These features don't seem to be supported in gnulib nor glibc
* Emacs uses a gap buffer, whereas gnulib and glibc expects a char* for the subject string

Stefan also expressed a desire to move to a better regular expression engine — ideally one that uses a non-backtracking implementation when possible, to avoid the exponential blowups that Emacs currently suffers from on certain regular expressions.

I've had a quick look at the existing options.  Here's what I gathered (the list below includes only libraries whose licensing is compatible with Emacs):

* There are only two usable, non-backtracking regexp libraries:
  - RE2 (from Google, C++) doesn't support any form of backtracking, though, so "\\(.*\\)\1" won't work anymore.  Additionally, RE2 does not support matching over a gap buffer (the subject string needs to be a contiguous chunk of memory).  Discussion here: https://github.com/google/re2/issues/126
  - TRE (http://laurikari.net/tre/) uses backtracking if needed, but otherwise uses a finite automaton.  It's stable, can be relatively easily linked into Emacs, and performance is similar to what we have (I benchmarked it).  It does support matching on a gap buffer (it has an interface to use the moral equivalent of a random iterator, reguexec).  It doesn't support Emacs' native encoding out of the box, though, and the library is unmaintained; there are multiple forks which fix various security vulnerabilities.
  In both cases, the syntax isn't compatible with that of Emacs, which would force us to write a compatibility layer.  And Emacs' internal encoding isn't supported either.

* Backtracking implementations are more common, but none seem to fit the bill:
  - Oniguruma and Onigmo (C, used in Ruby, the Atom text editor, and others) support a wide range of encodings, including allegedly Emacs' internal encoding, as well as many syntaxes, including Emacs' syntax.  But they don't support matching on a gap buffer either (they take a contiguous string).  Discussion here: https://github.com/k-takata/Onigmo/issues/83
  - PCRE (C, used in Perl and others) doesn't have Emacs encoding, nor Emacs syntax.  It requires a contiguous buffer, but it does support partial matches.  Partial matches can apparently be restarted, which could almost work in our case, but the semantics of restarting a search from a partial match are broken.
  - Boost (C++) can use a custom random-access iterator for the subject string, which should work for gap buffers; but I'm not sure how good the encoding story is.

If I understand our current implementation, our regexp.c functions take the moral equivalent of two strings, and match over that, pretending that it's just one large string.  In practice, these two strings are the two halves of the gap buffer.  Correct?

The bottom line is that, even though it should be relatively easy to change string-match-p to use onigmo, changing re-search-forward sounds tricky.  Did I miss something?  Was there a concrete plan for merging with glibc, or for using an automaton-based matching strategy?  The advantages of switching could be support for a more common syntax, more flexible regular expressions (named groups, assertions, …), better performance, and better handling of special cases.

Cheers,
Clément.


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 1979 bytes --]

             reply	other threads:[~2016-12-15 19:00 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-12-15 19:00 Clément Pit--Claudel [this message]
2016-12-15 20:10 ` Regular expression libraries Eli Zaretskii
2016-12-15 20:30 ` Paul Eggert
2016-12-15 22:00   ` Andreas Schwab
2016-12-16  7:20     ` Paul Eggert
2016-12-16 14:31       ` Clément Pit--Claudel
2016-12-16 14:54       ` Clément Pit--Claudel
2016-12-16 15:42         ` Lars Ingebrigtsen
2016-12-16 20:06           ` Clément Pit--Claudel
2016-12-16 21:25             ` Eli Zaretskii
2016-12-16 17:43         ` Paul Eggert
2016-12-15 22:16   ` Clément Pit--Claudel
2016-12-16  5:15 ` Stefan Monnier
2016-12-16 14:41   ` Clément Pit--Claudel
2016-12-16 17:45     ` Paul Eggert
2016-12-16 20:07       ` Clément Pit--Claudel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=01d7e608-04d2-84a4-6143-e954bc9d569f@mit.edu \
    --to=cpitcla@mit.edu \
    --cc=emacs-devel@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).