unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed
* Regular expression libraries
@ 2016-12-15 19:00 Clément Pit--Claudel
  2016-12-15 20:10 ` Eli Zaretskii
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Clément Pit--Claudel @ 2016-12-15 19:00 UTC (permalink / raw)
  To: Emacs developers

[-- Attachment #1: Type: text/plain, Size: 3877 bytes --]

Hi emacs-devel,

The top of regex.c says this:

    /* Ignore some GCC warnings for now.  This section should go away
       once the Emacs and Gnulib regex code is merged.  */

And indeed I've seen brief discussions of this merge in the past (https://lists.gnu.org/archive/html/emacs-devel/2002-04/msg00083.html , https://lists.gnu.org/archive/html/emacs-devel/2016-07/msg01115.html).  Looking at this is more detail, though, it is not clear how such a merge could happen:

* Emacs has special regexp features based on syntax classes, or the position of the point.  These features don't seem to be supported in gnulib nor glibc
* Emacs uses a gap buffer, whereas gnulib and glibc expects a char* for the subject string

Stefan also expressed a desire to move to a better regular expression engine — ideally one that uses a non-backtracking implementation when possible, to avoid the exponential blowups that Emacs currently suffers from on certain regular expressions.

I've had a quick look at the existing options.  Here's what I gathered (the list below includes only libraries whose licensing is compatible with Emacs):

* There are only two usable, non-backtracking regexp libraries:
  - RE2 (from Google, C++) doesn't support any form of backtracking, though, so "\\(.*\\)\1" won't work anymore.  Additionally, RE2 does not support matching over a gap buffer (the subject string needs to be a contiguous chunk of memory).  Discussion here: https://github.com/google/re2/issues/126
  - TRE (http://laurikari.net/tre/) uses backtracking if needed, but otherwise uses a finite automaton.  It's stable, can be relatively easily linked into Emacs, and performance is similar to what we have (I benchmarked it).  It does support matching on a gap buffer (it has an interface to use the moral equivalent of a random iterator, reguexec).  It doesn't support Emacs' native encoding out of the box, though, and the library is unmaintained; there are multiple forks which fix various security vulnerabilities.
  In both cases, the syntax isn't compatible with that of Emacs, which would force us to write a compatibility layer.  And Emacs' internal encoding isn't supported either.

* Backtracking implementations are more common, but none seem to fit the bill:
  - Oniguruma and Onigmo (C, used in Ruby, the Atom text editor, and others) support a wide range of encodings, including allegedly Emacs' internal encoding, as well as many syntaxes, including Emacs' syntax.  But they don't support matching on a gap buffer either (they take a contiguous string).  Discussion here: https://github.com/k-takata/Onigmo/issues/83
  - PCRE (C, used in Perl and others) doesn't have Emacs encoding, nor Emacs syntax.  It requires a contiguous buffer, but it does support partial matches.  Partial matches can apparently be restarted, which could almost work in our case, but the semantics of restarting a search from a partial match are broken.
  - Boost (C++) can use a custom random-access iterator for the subject string, which should work for gap buffers; but I'm not sure how good the encoding story is.

If I understand our current implementation, our regexp.c functions take the moral equivalent of two strings, and match over that, pretending that it's just one large string.  In practice, these two strings are the two halves of the gap buffer.  Correct?

The bottom line is that, even though it should be relatively easy to change string-match-p to use onigmo, changing re-search-forward sounds tricky.  Did I miss something?  Was there a concrete plan for merging with glibc, or for using an automaton-based matching strategy?  The advantages of switching could be support for a more common syntax, more flexible regular expressions (named groups, assertions, …), better performance, and better handling of special cases.

Cheers,
Clément.


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 1979 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Regular expression libraries
  2016-12-15 19:00 Regular expression libraries Clément Pit--Claudel
@ 2016-12-15 20:10 ` Eli Zaretskii
  2016-12-15 20:30 ` Paul Eggert
  2016-12-16  5:15 ` Stefan Monnier
  2 siblings, 0 replies; 16+ messages in thread
From: Eli Zaretskii @ 2016-12-15 20:10 UTC (permalink / raw)
  To: Clément Pit--Claudel; +Cc: emacs-devel

> From: Clément Pit--Claudel <cpitcla@mit.edu>
> Date: Thu, 15 Dec 2016 14:00:25 -0500
> 
> * Emacs has special regexp features based on syntax classes, or the position of the point.  These features don't seem to be supported in gnulib nor glibc
> * Emacs uses a gap buffer, whereas gnulib and glibc expects a char* for the subject string

There's one other important aspect: Emacs doesn't use the locale to
implement the likes of [:alnum:], [:print:], etc.  We use our own
functions that access Unicode data stored in specialized char-tables,
see the uni-*.el files.  We also use our own macros to fetch multibyte
characters from buffers and strings, instead of functions from the
standard C library.

> If I understand our current implementation, our regexp.c functions take the moral equivalent of two strings, and match over that, pretending that it's just one large string.  In practice, these two strings are the two halves of the gap buffer.  Correct?

Yes.



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Regular expression libraries
  2016-12-15 19:00 Regular expression libraries Clément Pit--Claudel
  2016-12-15 20:10 ` Eli Zaretskii
@ 2016-12-15 20:30 ` Paul Eggert
  2016-12-15 22:00   ` Andreas Schwab
  2016-12-15 22:16   ` Clément Pit--Claudel
  2016-12-16  5:15 ` Stefan Monnier
  2 siblings, 2 replies; 16+ messages in thread
From: Paul Eggert @ 2016-12-15 20:30 UTC (permalink / raw)
  To: Clément Pit--Claudel, Emacs developers

On 12/15/2016 11:00 AM, Clément Pit--Claudel wrote:
> I've had a quick look at the existing options.  Here's what I gathered (the list below includes only libraries whose licensing is compatible with Emacs):
>
> * There are only two usable, non-backtracking regexp libraries:
>    - RE2 ...
>    - TRE ...

Another library is in that list:

   - The GNU C library, which ordinarily uses a DFA but which uses 
backtracking if needed.

That is, the current glibc implementation is now quite different from 
the old version that the Emacs matcher is derived from.



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Regular expression libraries
  2016-12-15 20:30 ` Paul Eggert
@ 2016-12-15 22:00   ` Andreas Schwab
  2016-12-16  7:20     ` Paul Eggert
  2016-12-15 22:16   ` Clément Pit--Claudel
  1 sibling, 1 reply; 16+ messages in thread
From: Andreas Schwab @ 2016-12-15 22:00 UTC (permalink / raw)
  To: Paul Eggert; +Cc: Emacs developers, Clément Pit--Claudel

On Dez 15 2016, Paul Eggert <eggert@cs.ucla.edu> wrote:

> Another library is in that list:
>
>   - The GNU C library, which ordinarily uses a DFA but which uses
> backtracking if needed.

But it doesn't support a gap.

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Regular expression libraries
  2016-12-15 20:30 ` Paul Eggert
  2016-12-15 22:00   ` Andreas Schwab
@ 2016-12-15 22:16   ` Clément Pit--Claudel
  1 sibling, 0 replies; 16+ messages in thread
From: Clément Pit--Claudel @ 2016-12-15 22:16 UTC (permalink / raw)
  To: Paul Eggert, Emacs developers


[-- Attachment #1.1: Type: text/plain, Size: 864 bytes --]

On 2016-12-15 15:30, Paul Eggert wrote:
> On 12/15/2016 11:00 AM, Clément Pit--Claudel wrote:
>> I've had a quick look at the existing options.  Here's what I
>> gathered (the list below includes only libraries whose licensing is
>> compatible with Emacs):
>> 
>> * There are only two usable, non-backtracking regexp libraries: -
>> RE2 ... - TRE ...
> 
> Another library is in that list:
> 
> - The GNU C library, which ordinarily uses a DFA but which uses
> backtracking if needed.
> 
> That is, the current glibc implementation is now quite different from
> the old version that the Emacs matcher is derived from.

Thanks! I'm surprised that I didn't find out about this while looking around.

Does it support something like PCRE_PARTIAL in pcre2_dfa_match?  That would make it trivial to extend it to a gap buffer.

Cheers,
Clément.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Regular expression libraries
  2016-12-15 19:00 Regular expression libraries Clément Pit--Claudel
  2016-12-15 20:10 ` Eli Zaretskii
  2016-12-15 20:30 ` Paul Eggert
@ 2016-12-16  5:15 ` Stefan Monnier
  2016-12-16 14:41   ` Clément Pit--Claudel
  2 siblings, 1 reply; 16+ messages in thread
From: Stefan Monnier @ 2016-12-16  5:15 UTC (permalink / raw)
  To: emacs-devel

> * Emacs has special regexp features based on syntax classes, or the position of the point.  These features don't seem to be supported in gnulib nor glibc
> * Emacs uses a gap buffer, whereas gnulib and glibc expects a char* for the subject string

The gnulib regexp code used to be a derivative of the code we have in
Emacs (or rather they both derived from the same codebase).
Since then it's been replaced with a completely different codebase.

But IIRC the two did merge back (or close enough) before gnulib' code
was replaced with something else.

> Stefan also expressed a desire to move to a better regular expression
> engine — ideally one that uses a non-backtracking implementation when
> possible, to avoid the exponential blowups that Emacs currently
> suffers from on certain regular expressions.

From my point of view, the main problem with the current code is the
blowup in pathological cases, and it's the only one that justifies moving
to another implementation.

Some users may want to change the engine so as to get new features, but
FWIW, I think it'd be much easier to just add the features to the
current engine.  The main reason such features are currently lacking is
that I've been resisting their addition because I don't want such
features to later make it harder to switch to a non-backtracking engine.

> * There are only two usable, non-backtracking regexp libraries:

What about the one in glibc?

>   - TRE (http://laurikari.net/tre/) uses backtracking if needed, but
>     otherwise uses a finite automaton.  It's stable, can be relatively
>     easily linked into Emacs, and performance is similar to what we
>     have (I benchmarked it).  It does support matching on a gap buffer
>     (it has an interface to use the moral equivalent of a random
>     iterator, reguexec).  It doesn't support Emacs' native encoding
>     out of the box, though, and the library is unmaintained; there are
>     multiple forks which fix various security vulnerabilities.

So, this would be acceptable, but would mean we'd still use "our own
code" and maintain it ourselves.

> If I understand our current implementation, our regexp.c functions
> take the moral equivalent of two strings, and match over that,
> pretending that it's just one large string.  In practice, these two
> strings are the two halves of the gap buffer.  Correct?

That's right.


        Stefan




^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Regular expression libraries
  2016-12-15 22:00   ` Andreas Schwab
@ 2016-12-16  7:20     ` Paul Eggert
  2016-12-16 14:31       ` Clément Pit--Claudel
  2016-12-16 14:54       ` Clément Pit--Claudel
  0 siblings, 2 replies; 16+ messages in thread
From: Paul Eggert @ 2016-12-16  7:20 UTC (permalink / raw)
  To: Andreas Schwab; +Cc: Emacs developers, Clément Pit--Claudel

Andreas Schwab wrote:
>> Another library is in that list:
>>
>>   - The GNU C library, which ordinarily uses a DFA but which uses
>> backtracking if needed.
> But it doesn't support a gap.

True, but none of the others do either. Only the old glibc code (which survives 
only in Emacs now) does that, as far as I know.

I doubt whether it'd be that hard to add gap support to any regexp matcher, as 
it's simply interposing an address calculation.



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Regular expression libraries
  2016-12-16  7:20     ` Paul Eggert
@ 2016-12-16 14:31       ` Clément Pit--Claudel
  2016-12-16 14:54       ` Clément Pit--Claudel
  1 sibling, 0 replies; 16+ messages in thread
From: Clément Pit--Claudel @ 2016-12-16 14:31 UTC (permalink / raw)
  To: emacs-devel


[-- Attachment #1.1: Type: text/plain, Size: 885 bytes --]

On 2016-12-16 02:20, Paul Eggert wrote:
> Andreas Schwab wrote:
>>> Another library is in that list:
>>> 
>>> - The GNU C library, which ordinarily uses a DFA but which uses 
>>> backtracking if needed.
>> But it doesn't support a gap.
> 
> True, but none of the others do either. Only the old glibc code
> (which survives only in Emacs now) does that, as far as I know.

TRE does.

> I doubt whether it'd be that hard to add gap support to any regexp
> matcher, as it's simply interposing an address calculation.

I did ask both the Oniguruma and the RE2 people; neither were optimistic (admittedly, my question was about supporting a slightly more general form than 2 chunks).  In RE2, there are optimizations like using memchr to find the next character within a memory region, which are tricky to use if address calculations are added into the mix.

Clément.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Regular expression libraries
  2016-12-16  5:15 ` Stefan Monnier
@ 2016-12-16 14:41   ` Clément Pit--Claudel
  2016-12-16 17:45     ` Paul Eggert
  0 siblings, 1 reply; 16+ messages in thread
From: Clément Pit--Claudel @ 2016-12-16 14:41 UTC (permalink / raw)
  To: emacs-devel


[-- Attachment #1.1: Type: text/plain, Size: 347 bytes --]

On 2016-12-16 00:15, Stefan Monnier wrote:
> What about the one in glibc?

Paul pointed to it too.  Is it documented anywhere?  That is, are its implementation details documented anywhere?

The only thing I can find is this:

  https://www.gnu.org/software/libc/manual/html_node/Regular-Expressions.html#Regular-Expressions

Clément.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Regular expression libraries
  2016-12-16  7:20     ` Paul Eggert
  2016-12-16 14:31       ` Clément Pit--Claudel
@ 2016-12-16 14:54       ` Clément Pit--Claudel
  2016-12-16 15:42         ` Lars Ingebrigtsen
  2016-12-16 17:43         ` Paul Eggert
  1 sibling, 2 replies; 16+ messages in thread
From: Clément Pit--Claudel @ 2016-12-16 14:54 UTC (permalink / raw)
  To: emacs-devel


[-- Attachment #1.1: Type: text/plain, Size: 1222 bytes --]

On 2016-12-16 02:20, Paul Eggert wrote:
> Andreas Schwab wrote:
>>> Another library is in that list:
>>> 
>>> - The GNU C library, which ordinarily uses a DFA but which uses 
>>> backtracking if needed.
>> 
>> But it doesn't support a gap.
> 
> True, but none of the others do either. Only the old glibc code
> (which survives only in Emacs now) does that, as far as I know.

Wait, now I'm confused.  The master branch of glibc *does* include the following in regex.h:

extern int __re_search_2
  (struct re_pattern_buffer *buffer, const char *string1,
   int length1, const char *string2, int length2,
   int start, int range, struct re_registers *regs, int stop);

… but the implementation just allocates a large buffer and concatenates both strings?

Also, one naive question: the code does this:

    int len = length1 + length2;
    char *s = NULL;

    if (BE (length1 < 0 || length2 < 0 || stop < 0 || len < length1, 0))
      return -2;

According to the git history, the last check (len < length1) was added to "avoid overflow in computing sum of lengths". But I thought int overflow was undefined behaviour, so why does the len < length1 check make sense?

Thanks!
Clément.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Regular expression libraries
  2016-12-16 14:54       ` Clément Pit--Claudel
@ 2016-12-16 15:42         ` Lars Ingebrigtsen
  2016-12-16 20:06           ` Clément Pit--Claudel
  2016-12-16 17:43         ` Paul Eggert
  1 sibling, 1 reply; 16+ messages in thread
From: Lars Ingebrigtsen @ 2016-12-16 15:42 UTC (permalink / raw)
  To: Clément Pit--Claudel; +Cc: emacs-devel

Clément Pit--Claudel <clement.pit@gmail.com> writes:

> Wait, now I'm confused.  The master branch of glibc *does* include the following in regex.h:
>
> extern int __re_search_2
>   (struct re_pattern_buffer *buffer, const char *string1,
>    int length1, const char *string2, int length2,
>    int start, int range, struct re_registers *regs, int stop);
>
> … but the implementation just allocates a large buffer and concatenates both strings?

Yes.  If I remember correctly, a glibc developer popped up here a couple
of years ago announcing that they were going to do that change (thereby
making __re_search_2 incredibly slow, but retained for compatibility).

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Regular expression libraries
  2016-12-16 14:54       ` Clément Pit--Claudel
  2016-12-16 15:42         ` Lars Ingebrigtsen
@ 2016-12-16 17:43         ` Paul Eggert
  1 sibling, 0 replies; 16+ messages in thread
From: Paul Eggert @ 2016-12-16 17:43 UTC (permalink / raw)
  To: Clément Pit--Claudel; +Cc: Gnulib bugs, emacs-devel

[-- Attachment #1: Type: text/plain, Size: 535 bytes --]

Clément Pit–Claudel wrote:
> the last check (len < length1) was added to "avoid overflow in computing sum of lengths". But I thought int overflow was undefined behaviour, so why does the len < length1 check make sense?

You're right, it doesn't. I fixed this by installing the attached patch into 
Gnulib. You should look at the Gnulib version and not the glibc version: Gnulib 
regex has portability and other fixes that glibc regex doesn't. I never have 
gotten up the energy to propagate the Gnulib fixes back into glibc.

[-- Attachment #2: 0001-regex-fix-integer-overflow-bug-in-never-used-code.txt --]
[-- Type: text/plain, Size: 2873 bytes --]

From d2e639dd5a5586e1daad32d30624d98c2805ffb6 Mon Sep 17 00:00:00 2001
From: Paul Eggert <eggert@cs.ucla.edu>
Date: Fri, 16 Dec 2016 09:34:02 -0800
Subject: [PATCH] regex: fix integer-overflow bug in never-used code
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Problem reported by Clément Pit–Claudel in:
http://lists.gnu.org/archive/html/emacs-devel/2016-12/msg00654.html
* lib/regex_internal.h: Include intprops.h.
* lib/regexec.c (re_search_2_stub): Use it to avoid undefined
behavior on integer overflow.
* modules/regex (Depends-on): Add intprops.
---
 ChangeLog            | 8 ++++++++
 lib/regex_internal.h | 2 ++
 lib/regexec.c        | 6 ++++--
 modules/regex        | 1 +
 4 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/ChangeLog b/ChangeLog
index dd67dba..c588296 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,5 +1,13 @@
 2016-12-16  Paul Eggert  <eggert@cs.ucla.edu>
 
+	regex: fix integer-overflow bug in never-used code
+	Problem reported by Clément Pit–Claudel in:
+	http://lists.gnu.org/archive/html/emacs-devel/2016-12/msg00654.html
+	* lib/regex_internal.h: Include intprops.h.
+	* lib/regexec.c (re_search_2_stub): Use it to avoid undefined
+	behavior on integer overflow.
+	* modules/regex (Depends-on): Add intprops.
+
 	fpending: fix port to MinGW on Emacs
 	* lib/stdio-impl.h [__MINGW32__]: Do not include errno.h.
 	Problem reported by Eli Zaretskii in:
diff --git a/lib/regex_internal.h b/lib/regex_internal.h
index 56a315a..7ac5f92 100644
--- a/lib/regex_internal.h
+++ b/lib/regex_internal.h
@@ -33,6 +33,8 @@
 #include <stdbool.h>
 #include <stdint.h>
 
+#include "intprops.h"
+
 #ifdef _LIBC
 # include <libc-lock.h>
 # define lock_define(name) __libc_lock_define (, name)
diff --git a/lib/regexec.c b/lib/regexec.c
index afdc173..f1e61af 100644
--- a/lib/regexec.c
+++ b/lib/regexec.c
@@ -354,10 +354,12 @@ re_search_2_stub (struct re_pattern_buffer *bufp, const char *string1,
 {
   const char *str;
   regoff_t rval;
-  Idx len = length1 + length2;
+  Idx len;
   char *s = NULL;
 
-  if (BE (length1 < 0 || length2 < 0 || stop < 0 || len < length1, 0))
+  if (BE ((length1 < 0 || length2 < 0 || stop < 0
+           || INT_ADD_WRAPV (length1, length2, &len)),
+          0))
     return -2;
 
   /* Concatenate the strings.  */
diff --git a/modules/regex b/modules/regex
index c5ae1b2..e1c9b58 100644
--- a/modules/regex
+++ b/modules/regex
@@ -18,6 +18,7 @@ ssize_t
 alloca-opt      [test $ac_use_included_regex = yes]
 btowc           [test $ac_use_included_regex = yes]
 gettext-h       [test $ac_use_included_regex = yes]
+intprops        [test $ac_use_included_regex = yes]
 lock      [test "$ac_cv_gnu_library_2_1:$ac_use_included_regex" = no:yes]
 memcmp          [test $ac_use_included_regex = yes]
 memmove         [test $ac_use_included_regex = yes]
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: Regular expression libraries
  2016-12-16 14:41   ` Clément Pit--Claudel
@ 2016-12-16 17:45     ` Paul Eggert
  2016-12-16 20:07       ` Clément Pit--Claudel
  0 siblings, 1 reply; 16+ messages in thread
From: Paul Eggert @ 2016-12-16 17:45 UTC (permalink / raw)
  To: Clément Pit--Claudel, emacs-devel

Clément Pit--Claudel wrote:
> Paul pointed to it too.  Is it documented anywhere?  That is, are its implementation details documented anywhere?

There is no secret or mysteriously hidden documentation, no. The comments are 
all in the source code. Although it is not easy reading, that's par for the 
course with fast regexp matchers.



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Regular expression libraries
  2016-12-16 15:42         ` Lars Ingebrigtsen
@ 2016-12-16 20:06           ` Clément Pit--Claudel
  2016-12-16 21:25             ` Eli Zaretskii
  0 siblings, 1 reply; 16+ messages in thread
From: Clément Pit--Claudel @ 2016-12-16 20:06 UTC (permalink / raw)
  To: Lars Ingebrigtsen; +Cc: emacs-devel


[-- Attachment #1.1: Type: text/plain, Size: 1373 bytes --]

On 2016-12-16 10:42, Lars Ingebrigtsen wrote:
> Clément Pit--Claudel <clement.pit@gmail.com> writes:
> 
>> Wait, now I'm confused.  The master branch of glibc *does* include the following in regex.h:
>>
>> extern int __re_search_2
>>   (struct re_pattern_buffer *buffer, const char *string1,
>>    int length1, const char *string2, int length2,
>>    int start, int range, struct re_registers *regs, int stop);
>>
>> … but the implementation just allocates a large buffer and concatenates both strings?
> 
> Yes.  If I remember correctly, a glibc developer popped up here a couple
> of years ago announcing that they were going to do that change (thereby
> making __re_search_2 incredibly slow, but retained for compatibility).

I see, thanks.  This code seems a bit beyond my abilities to easily extend :/
Eli started listing things that we would need from an external library before we could move to it.  AFAICT, there is:

* Support for searching a gap buffer (essentially two strings).
* Support for Emacs' internal utf-8 based encoding
* Support for Emacs' regexp syntax (though we could imagine writing a translator)
* Support for Emacs' syntax properties, and Emacs-specific extensions like matching the position of the point, etc.

An I missing anything else? Which ones of these things does gnulib already have?

Thanks!
Clément.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Regular expression libraries
  2016-12-16 17:45     ` Paul Eggert
@ 2016-12-16 20:07       ` Clément Pit--Claudel
  0 siblings, 0 replies; 16+ messages in thread
From: Clément Pit--Claudel @ 2016-12-16 20:07 UTC (permalink / raw)
  To: Paul Eggert, emacs-devel


[-- Attachment #1.1: Type: text/plain, Size: 650 bytes --]

On 2016-12-16 12:45, Paul Eggert wrote:
> Clément Pit--Claudel wrote:
>> Paul pointed to it too.  Is it documented anywhere?  That is, are
>> its implementation details documented anywhere?
> 
> There is no secret or mysteriously hidden documentation, no. The
> comments are all in the source code. Although it is not easy reading,
> that's par for the course with fast regexp matchers.

Thanks! This wasn't meant as criticism.  I was just surprised to learn that there seems to be a very good regex implementation heavily used in the wild, but it's not commonly discussed in benchmarks, nor loudly advertised :)

Cheers,
Clément.



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Regular expression libraries
  2016-12-16 20:06           ` Clément Pit--Claudel
@ 2016-12-16 21:25             ` Eli Zaretskii
  0 siblings, 0 replies; 16+ messages in thread
From: Eli Zaretskii @ 2016-12-16 21:25 UTC (permalink / raw)
  To: Clément Pit--Claudel; +Cc: larsi, emacs-devel

> From: Clément Pit--Claudel <clement.pit@gmail.com>
> Date: Fri, 16 Dec 2016 15:06:13 -0500
> Cc: emacs-devel@gnu.org
> 
> Eli started listing things that we would need from an external library before we could move to it.  AFAICT, there is:
> 
> * Support for searching a gap buffer (essentially two strings).
> * Support for Emacs' internal utf-8 based encoding
> * Support for Emacs' regexp syntax (though we could imagine writing a translator)
> * Support for Emacs' syntax properties, and Emacs-specific extensions like matching the position of the point, etc.
> 
> An I missing anything else? Which ones of these things does gnulib already have?

 * Support for custom functions, independent of the current locale,
   that implement [:alpha:], [:print:], iswupper, etc.
 * Support for Emacs-specific character classes, like [:multibyte:]
   and [:unibyte:]
 * Support for Emacs character categories
 * Support for quitting in the middle of a regex operation
 * Support for treating a single blank as standing for any stretch of
   whitespace

Basically, search for "ifdef emacs", and you will find all the
Emacs-specific features we use.



^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2016-12-16 21:25 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-15 19:00 Regular expression libraries Clément Pit--Claudel
2016-12-15 20:10 ` Eli Zaretskii
2016-12-15 20:30 ` Paul Eggert
2016-12-15 22:00   ` Andreas Schwab
2016-12-16  7:20     ` Paul Eggert
2016-12-16 14:31       ` Clément Pit--Claudel
2016-12-16 14:54       ` Clément Pit--Claudel
2016-12-16 15:42         ` Lars Ingebrigtsen
2016-12-16 20:06           ` Clément Pit--Claudel
2016-12-16 21:25             ` Eli Zaretskii
2016-12-16 17:43         ` Paul Eggert
2016-12-15 22:16   ` Clément Pit--Claudel
2016-12-16  5:15 ` Stefan Monnier
2016-12-16 14:41   ` Clément Pit--Claudel
2016-12-16 17:45     ` Paul Eggert
2016-12-16 20:07       ` Clément Pit--Claudel

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).