unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed
* prior work on non-backtracking regex engine?
@ 2024-03-10 15:41 Danny McClanahan
  2024-03-12 23:45 ` Danny McClanahan
  2024-04-17 14:23 ` Clément Pit-Claudel
  0 siblings, 2 replies; 11+ messages in thread
From: Danny McClanahan @ 2024-03-10 15:41 UTC (permalink / raw)
  To: emacs-devel@gnu.org

Hello emacs-devel,

I've been learning a lot about regex implementations recently and was interested in learning:

(1) What kinds of inputs does the current emacs regex engine do poorly on?
  This "Regexp Problems" page is super helpful (https://www.gnu.org/software/emacs/manual/html_node/elisp/Regexp-Problems.html), but I wanted to know if there was anywhere else e.g. in the source I should look for further notes on current performance TODOs? Is there a benchmark suite?

(2) What feature requirements does emacs impose of the regex engine?
  In particular, someone on the internet claimed (without citing any sources) several years ago (https://github.com/google/re2/issues/127#issuecomment-267676672) that the emacs regex engine needs the ability to search through non-contiguous text data to incorporate the gap buffer (which re2 along with many other engines do not support). When I recently asked the rust regex maintainer about this idea (https://github.com/rust-lang/regex/discussions/1167#discussioncomment-8585027), I found that emacs's ability to dynamically rebind word \< and symbol \_< boundaries according to mode is also unsupported by many general-purpose engines which e.g. only support Unicode word boundaries.

  I am also aware emacs supports backrefs in pattern strings, which cannot be implemented without some level of backtracking, but I am hoping this functionality can be implemented on top of an otherwise non-backtracking engine and retain the performance improvements.

(3) Have there been prior investigations of non-backtracking regex engines in emacs, or trying to use an external regex engine in general? What was the outcome, and does it seem like a useful research direction?

Thanks,
Danny



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: prior work on non-backtracking regex engine?
  2024-03-10 15:41 prior work on non-backtracking regex engine? Danny McClanahan
@ 2024-03-12 23:45 ` Danny McClanahan
  2024-03-13 13:23   ` Ihor Radchenko
  2024-04-17 14:23 ` Clément Pit-Claudel
  1 sibling, 1 reply; 11+ messages in thread
From: Danny McClanahan @ 2024-03-12 23:45 UTC (permalink / raw)
  To: emacs-devel@gnu.org

Came up with some more practical/specific questions about implementation. I haven't written any code for this yet and I haven't demonstrated any performance improvement yet, so this is all still totally subject to change. Feel free to ignore for now and I'll post again when I have a convincing demo!

(4) What is the best way to package third-party code for emacs?

I was thinking of architecting this regex engine as a third-party codebase exposing a C ABI shared library, which the emacs build system could detect as an optional dependency (like libjansson). I was hoping to use rust to implement this regex engine, but I know that a cargo package alone isn't enough for non-rust code to depend on: I was also planning to maintain package recipes for this regex engine for multiple linux distros, so that emacs could easily add it to the build system (like librsvg). Are there any further constraints I should know about for optional dependencies in the emacs build system?

`cbindgen' (https://github.com/mozilla/cbindgen) is a mature piece of software to generate C headers for rust code, and cargo/rustc already supports generating C ABI shared libs with the `cdylib' linkage (https://doc.rust-lang.org/reference/linkage.html). I'm also hoping to use rust's `no_std' feature (https://docs.rust-embedded.org/book/intro/no-std.html) to produce very small binaries without any dependency on the rust stdlib or even libc. Is binary size a concern at all for shared libraries emacs depends on?

Note: my goal is to make emacs faster for all users, which is why I'm thinking this non-backtracking engine makes the most sense as a direct replacement for the code in `regex-emacs.c'. However, I understand it would be nicer if this performance improvement didn't require an external dependency at all. If this code ends up being simple enough, I would absolutely consider translating it to C code and contributing it to the emacs codebase directly. Depending on the performance results, we could also consider contributing a "simple" version of this non-backtracking engine in C in emacs itself, with an optional dependency enabling the use of the more performant rust engine. We'll know more about this after we have a working prototype.

(6) How should we handle allocation?

Along with `no_std', I'm also hoping to use the Allocator API (https://doc.rust-lang.org/std/alloc/trait.Allocator.html) with rust collections in the implementation of this regex engine. This should make it feasible to present a C ABI interface which delegates any dynamic allocations to the caller, so that emacs can track regex engine state using its own GC. To be clear, I suspect this just means emacs would provide an `alloc()' function pointer to some methods when calling into the regex engine, along with a `free()' function pointer to other methods. Does that sound like a reasonable approach, or is there more complexity around correctly allocating GC-able objects that I should be aware of? Are there examples of this kind of allocation delegation in the source code that I should model the regex engine API after?

Thanks,
Danny



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: prior work on non-backtracking regex engine?
  2024-03-12 23:45 ` Danny McClanahan
@ 2024-03-13 13:23   ` Ihor Radchenko
  2024-04-07  4:42     ` Danny McClanahan
  0 siblings, 1 reply; 11+ messages in thread
From: Ihor Radchenko @ 2024-03-13 13:23 UTC (permalink / raw)
  To: Danny McClanahan; +Cc: emacs-devel@gnu.org

Danny McClanahan <dmcc2@hypnicjerk.ai> writes:

> (4) What is the best way to package third-party code for emacs?
>
> I was thinking of architecting this regex engine as a third-party
> codebase exposing a C ABI shared library, which the emacs build system
> could detect as an optional dependency (like libjansson). I was hoping
> to use rust to implement this regex engine, but I know that a cargo
> package alone isn't enough for non-rust code to depend on: I was also
> planning to maintain package recipes for this regex engine for
> multiple linux distros, so that emacs could easily add it to the build
> system (like librsvg). Are there any further constraints I should know
> about for optional dependencies in the emacs build system?

AFAIR, previous discussions raised some concerns about using Rust in
particular:
https://yhetil.org/emacs-devel/E1pKAM0-0001Ss-W6@fencepost.gnu.org/
Because of Rust volatility and some license issues.

That said, the linked discussion was not about linking to rust libraries
via C ABI, but about contributing Rust code to Emacs core directly.

Another somewhat relevant discussion is
https://yhetil.org/emacs-devel/95980ffc-86e7-ad54-4a20-539d8c6ea5d0@mailo.com/

P.S. Better regexp engine would be very welcome.

-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: prior work on non-backtracking regex engine?
  2024-03-13 13:23   ` Ihor Radchenko
@ 2024-04-07  4:42     ` Danny McClanahan
  2024-04-07 14:15       ` Ihor Radchenko
                         ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Danny McClanahan @ 2024-04-07  4:42 UTC (permalink / raw)
  To: Ihor Radchenko; +Cc: emacs-devel@gnu.org

Thanks so much for this helpful feedback, Ihor! This makes me much more confident about packaging this as a dependency!

I've written out a summary of the packaging story for this proposed new dependency, along with a tentative work outline at the bottom.

On Wednesday, March 13th, 2024 at 09:19, Ihor Radchenko <yantar92@posteo.net> wrote:
> Danny McClanahan dmcc2@hypnicjerk.ai writes:
> 
> > (4) What is the best way to package third-party code for emacs?
> > 
> > I was thinking of architecting this regex engine as a third-party
> > codebase exposing a C ABI shared library, which the emacs build system
> > could detect as an optional dependency (like libjansson). I was hoping
> > to use rust to implement this regex engine, but I know that a cargo
> > package alone isn't enough for non-rust code to depend on: I was also
> > planning to maintain package recipes for this regex engine for
> > multiple linux distros, so that emacs could easily add it to the build
> > system (like librsvg). Are there any further constraints I should know
> > about for optional dependencies in the emacs build system?
> 
> 
> AFAIR, previous discussions raised some concerns about using Rust in
> particular:
> https://yhetil.org/emacs-devel/E1pKAM0-0001Ss-W6@fencepost.gnu.org/
> Because of Rust volatility and some license issues.
> 
> That said, the linked discussion was not about linking to rust libraries
> via C ABI, but about contributing Rust code to Emacs core directly.

Thanks so much for this context! It sounds like an optional C ABI library dependency is absolutely the ideal way to package something for emacs to depend on, both for legal and maintainability reasons. I absolutely agree with the quote from that linked discussion that "Rust changes too fast" for emacs to reliably expect to build its own rust code. The rust language makes it quite difficult to avoid involving cargo when packaging rust code for distribution, but the emacs build system itself should not need to be aware of cargo or rust at all, as per "we should not publicly refer to its existence" (to avoid hidden dependencies on its unstable behaviors). Each *distro packager* for this rust regex library would need to invoke cargo to compile/link the C ABI library + generated C headers, but those would just be installed into somewhere like /usr/include and /usr/lib like any other C library, so emacs should not need to be made aware of any rust- or cargo-specific behavior even if I choose to use rust as an implementation language. I am extremely enthusiastic about achieving a rust packaging solution for this work that can serve as a role model for other portable rust code exposing a C ABI library interface.

I'm particularly hoping to package any rust output as a shared and/or static C ABI library using rust's no_std configuration, which removes a large portion of rust stdlib code as well as avoiding any implicit libc dependency. Rust actually has a libc crate (https://github.com/rust-lang/libc) which I believe can be used in no_std environments to invoke libc if needed, but since I am hoping to produce a C ABI library interface which delegates to the caller for alloc/free callbacks (instead of performing any dynamic allocation in the regex engine library code), it may be feasible to avoid even a libc dep.

If I employ any exotic instructions, it would be via rust's core::simd API (see e.g. https://mcyoung.xyz/2023/11/27/simd-base64/) for platform-specific vectorized instruction selection. This thoughtful and simple interface falls back to portable SWAR techniques for generic platforms (see https://doc.rust-lang.org/stable/core/simd/index.html#portable-simd-works-on-every-target), so this feature of the rust language would actually *improve* our portability. However, since llvm (and therefore rust) still supports fewer platforms than gcc, I also intend to incorporate rustc_codegen_gcc (which compiles rust with libgccjit) and/or the gccrs frontend (not yet functional but improving rapidly) into the packaging scripts for this regex engine library, to ensure that emacs does not begin to implicitly disadvantage non-llvm platforms.

> Another somewhat relevant discussion is
> https://yhetil.org/emacs-devel/95980ffc-86e7-ad54-4a20-539d8c6ea5d0@mailo.com/

Wow, this is super neat!! A lot of my programming experience has actually been on developing build tools or integrating them with editors and this discussion/library looks super thoughtful! I will have to chew on this discussion more.

> P.S. Better regexp engine would be very welcome.

^_^ !!! This makes me so happy to hear! I took the time to learn how regex-emacs.c as well as other emacs C code interacts with the gap buffer last week, and was super glad to see how simple the gap buffer data structure is. I am not an expert on string search performance yet, but I believe the gap buffer (which is always allocated within a single block, and has at most two non-adjacent data sections) is likely much more amenable to high-performance search techniques than a more complex data structure such as a rope. And I was also *super* pleased to see that regex-emacs.h itself doesn't expose any dependency on the gap buffer or other internal emacs representations (except regarding multibyte encoding). So in my amateur evaluation, emacs actually seems very well-placed to take advantage of high-performance regex engine techniques without any big structural changes.

As a concrete example, I currently maintain helm-rg on MELPA (https://github.com/cosmicexplorer/helm-rg), which itself performs quite a few regex matches to extract process output from ripgrep, and have previously contributed to helm-ag and helm-swoop (the latter of which searches within emacs buffers instead of hitting the filesystem). While ripgrep is extremely fast (even over very large corporate monorepos) and emacs makes it simple enough to communicate efficiently with an asynchronous subprocess, I realized that regex matching over buffer contents (which emacs has already mapped into its virtual memory space) might be an improvement for multiple reasons:
(a) it avoids an unnecessary round-trip through the filesystem (also friendlier to TRAMP),
(b) it allows the emacs user to narrow the search via their preferred method of buffer list selection (as opposed to maintaining a separate interface to configure ripgrep's fs traversal with CLI flags and ignore files).

So practically, I'm hoping this work allows me to effectively deprecate helm-rg and replace a lot of finicky lisp code (which currently performs asynchronous IPC, globbing files, and writing back to the filesystem when editing match results inline) with a simple loop over a provided buffer list that performs a very fast zero-copy in-memory search and batched pattern replacement.

I would *really like* to eventually have emacs depend on an existing regex engine like re2 or rust regex to take advantage of their bugfixes and optimizations, but both of those engines (and most others) require utf-8 input, and (I'm pretty sure) can't easily be made to support emacs's multibyte functionality. So I think there is a strong case for a new engine here, especially one licensed with the GPLv3 (or any later version) as opposed to the LGPL or other more permissive license.

-------
Informal sketch of work outline:
(1) simple, robust non-backtracking regex matcher with a backwards-compatible API for regex-emacs.h (no support for backrefs, maybe just a single NFA matcher with the PikeVM to test the API)
(2) basic benchmark harness + integration into emacs build system (so we can get a concrete performance comparison and %improvement, and start creating distro packages)
(3) implement a lazy DFA / other specialized automata to get performance generally on par with re2/rust regex
(4) implement a SIMD prefilter for literal substrings (the prototype should now be significantly faster than regex-emacs.c on all supported inputs)
(5) separate out a regex pattern parsing API, and expose regex objects to lisp code
(6) try implementing backrefs
(7) batched search-replace API for replacement of pattern strings in buffer contents

In general, I've decided to put off these features at first:
(a) backrefs
(b) per-mode customizable char classes & word/symbol boundaries
(c) regex matching against syntax properties (not the raw string data)

- I will probably propose deprecating (c) entirely, especially if tree-sitter ends up being able to take over syntax propertization in general.
- (b) is non-standard, and if we deprecate it we may be able to more easily lean on an existing regex pattern parser like regex-syntax (https://docs.rs/regex-syntax/), although regex-syntax also does not support (a) backrefs and it probably makes sense to make our own pattern parsing library anyway as part of (5).
- I personally would like to avoid entirely deprecating (a) backrefs even if by definition they require some level of backtracking, because I personally consider them very handy for small bounded inputs. If I can't figure out a clean way to support backrefs in this new engine, I would like to simply retain the regex-emacs.c engine for backref patterns (and maybe add some defvar that warns/errors if a backref pattern is compiled when set).



Thanks so much for providing such thoughtful and patient feedback!
--Danny



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: prior work on non-backtracking regex engine?
  2024-04-07  4:42     ` Danny McClanahan
@ 2024-04-07 14:15       ` Ihor Radchenko
  2024-04-08 12:19       ` Helmut Eller
  2024-04-08 14:00       ` Po Lu
  2 siblings, 0 replies; 11+ messages in thread
From: Ihor Radchenko @ 2024-04-07 14:15 UTC (permalink / raw)
  To: Danny McClanahan, Mattias Engdegård, Eli Zaretskii,
	stefankangas
  Cc: emacs-devel@gnu.org

Danny McClanahan <dmcc2@hypnicjerk.ai> writes:

>> P.S. Better regexp engine would be very welcome.
>
> ... So in my amateur evaluation, emacs actually seems very well-placed to take advantage of high-performance regex engine techniques without any big structural changes.
> ...
> I would *really like* to eventually have emacs depend on an existing
> regex engine like re2 or rust regex to take advantage of their
> bugfixes and optimizations...
> ...
> Informal sketch of work outline: ...

Please, be aware that I have little to do with Emacs regexp maintenance.
CCing the actual maintainers.

-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: prior work on non-backtracking regex engine?
  2024-04-07  4:42     ` Danny McClanahan
  2024-04-07 14:15       ` Ihor Radchenko
@ 2024-04-08 12:19       ` Helmut Eller
  2024-04-08 13:13         ` Eli Zaretskii
  2024-04-08 14:00       ` Po Lu
  2 siblings, 1 reply; 11+ messages in thread
From: Helmut Eller @ 2024-04-08 12:19 UTC (permalink / raw)
  To: Danny McClanahan; +Cc: Ihor Radchenko, emacs-devel@gnu.org

On Sun, Apr 07 2024, Danny McClanahan wrote:

> And I was also *super*
> pleased to see that regex-emacs.h itself doesn't expose any dependency
> on the gap buffer or other internal emacs representations (except
> regarding multibyte encoding). So in my amateur evaluation, emacs
> actually seems very well-placed to take advantage of high-performance
> regex engine techniques without any big structural changes.

What's the history of regex-emacs.h?  It seems like in the past it was
regex.h from Gnulib.  That would explain why the regex engine is
relatively well decoupled from the rest.  But it also leads to the
question: why does Emacs no longer use Gnulib's regex engine?

My guess is that it has something to do with the way Emacs's performs
case-insensitive matches.  Another complication may be that Gnulib
doesn't support the non-greedy variants of some operators.

It seems that these days Gnulib uses a DFA based algorithm when possible
and falls back to backtracking for backrefs (and presumably for POSIX
compatible sub-match rules).  So one could argue that Gnulib already has
a lot of what we want and would be the natural place to add a clean API
for the features that Emacs needs.

So does somebody know the details why Gnulib and Emacs went separate
ways in the past?

Helmut



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: prior work on non-backtracking regex engine?
  2024-04-08 12:19       ` Helmut Eller
@ 2024-04-08 13:13         ` Eli Zaretskii
  0 siblings, 0 replies; 11+ messages in thread
From: Eli Zaretskii @ 2024-04-08 13:13 UTC (permalink / raw)
  To: Helmut Eller; +Cc: dmcc2, yantar92, emacs-devel

> From: Helmut Eller <eller.helmut@gmail.com>
> Cc: Ihor Radchenko <yantar92@posteo.net>,  "emacs-devel@gnu.org"
>  <emacs-devel@gnu.org>
> Date: Mon, 08 Apr 2024 14:19:13 +0200
> 
> On Sun, Apr 07 2024, Danny McClanahan wrote:
> 
> > And I was also *super*
> > pleased to see that regex-emacs.h itself doesn't expose any dependency
> > on the gap buffer or other internal emacs representations (except
> > regarding multibyte encoding). So in my amateur evaluation, emacs
> > actually seems very well-placed to take advantage of high-performance
> > regex engine techniques without any big structural changes.
> 
> What's the history of regex-emacs.h?  It seems like in the past it was
> regex.h from Gnulib.  That would explain why the regex engine is
> relatively well decoupled from the rest.  But it also leads to the
> question: why does Emacs no longer use Gnulib's regex engine?

The Gnulib regex is basically the glibc regex.

Why Emacs cannot use the glibc code was discussed here (among other
places):

  https://lists.gnu.org/archive/html/emacs-devel/2002-04/msg00084.html

> My guess is that it has something to do with the way Emacs's performs
> case-insensitive matches.

No, it's because Emacs has some unique requirements for regular
expressions.

> So does somebody know the details why Gnulib and Emacs went separate
> ways in the past?

Basically, because no one was interested in merging them (which would
include some macro-ization, so that Emacs could have the features it
needs, while other programs could have those special features compile
to trivial code).  regex-emacs.c started that way, but when glibc and
Gnulib switched to a new engine, they never bothered to keep the
Emacs-specific features we had in the original code.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: prior work on non-backtracking regex engine?
  2024-04-07  4:42     ` Danny McClanahan
  2024-04-07 14:15       ` Ihor Radchenko
  2024-04-08 12:19       ` Helmut Eller
@ 2024-04-08 14:00       ` Po Lu
  2024-04-08 14:23         ` Eli Zaretskii
  2 siblings, 1 reply; 11+ messages in thread
From: Po Lu @ 2024-04-08 14:00 UTC (permalink / raw)
  To: Danny McClanahan; +Cc: Ihor Radchenko, emacs-devel@gnu.org

Danny McClanahan <dmcc2@hypnicjerk.ai> writes:

>> P.S. Better regexp engine would be very welcome.
>
> ^_^ !!! This makes me so happy to hear! I took the time to learn how
> regex-emacs.c as well as other emacs C code interacts with the gap
> buffer last week, and was super glad to see how simple the gap buffer
> data structure is. I am not an expert on string search performance
> yet, but I believe the gap buffer (which is always allocated within a
> single block, and has at most two non-adjacent data sections) is
> likely much more amenable to high-performance search techniques than a
> more complex data structure such as a rope. And I was also *super*
> pleased to see that regex-emacs.h itself doesn't expose any dependency
> on the gap buffer or other internal emacs representations (except
> regarding multibyte encoding). So in my amateur evaluation, emacs
> actually seems very well-placed to take advantage of high-performance
> regex engine techniques without any big structural changes.

[...]

> I would *really like* to eventually have emacs depend on an existing
> regex engine like re2 or rust regex to take advantage of their
> bugfixes and optimizations, but both of those engines (and most
> others) require utf-8 input, and (I'm pretty sure) can't easily be
> made to support emacs's multibyte functionality. So I think there is a
> strong case for a new engine here, especially one licensed with the
> GPLv3 (or any later version) as opposed to the LGPL or other more
> permissive license.

I hate to rain on people's parades, but from where I stand, introducing
one more mandatory dependency, not least a dependency for so fundamental
a component of Emacs as the regexp matcher, is not acceptable, even if
this library is maintained under the auspices of the GNU project or
under the GPL, license or stewardship being really immaterial issues.
The options being tendered are still less so: the one is written in C++,
and relies on a library that gives an awful impression of being a
boost-in-the-making, and the other is written in an immature language
that is not portable, especially to older systems we support for users
with antiquarian interests.

Don't let's delude ourselves that a solution is waiting in the wings in
the shape of some mystical library, and embark our hopes on this library
alone, to the detriment of improving the regex engine we already have.
It's a recurring trap that we should have learned to avoid by now.  If
the sabotage of xz, for example, has taught us anything, it's that
software is generally improved by a reduction in dependencies.

Thanks.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: prior work on non-backtracking regex engine?
  2024-04-08 14:00       ` Po Lu
@ 2024-04-08 14:23         ` Eli Zaretskii
  2024-04-12  0:12           ` Danny McClanahan
  0 siblings, 1 reply; 11+ messages in thread
From: Eli Zaretskii @ 2024-04-08 14:23 UTC (permalink / raw)
  To: Po Lu; +Cc: dmcc2, yantar92, emacs-devel

> From: Po Lu <luangruo@yahoo.com>
> Cc: Ihor Radchenko <yantar92@posteo.net>,  "emacs-devel@gnu.org"
>  <emacs-devel@gnu.org>
> Date: Mon, 08 Apr 2024 22:00:27 +0800
> 
> I hate to rain on people's parades, but from where I stand, introducing
> one more mandatory dependency, not least a dependency for so fundamental
> a component of Emacs as the regexp matcher, is not acceptable, even if
> this library is maintained under the auspices of the GNU project or
> under the GPL, license or stewardship being really immaterial issues.

If it's an optional dependency, I don't see why not.

Of course, an optional dependency will have to either be 110%
compatible with what we have now, or expose a separate and distinct
API to Lisp programs.  Which makes this project a larger undertaking.
But otherwise, I don't see why not.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: prior work on non-backtracking regex engine?
  2024-04-08 14:23         ` Eli Zaretskii
@ 2024-04-12  0:12           ` Danny McClanahan
  0 siblings, 0 replies; 11+ messages in thread
From: Danny McClanahan @ 2024-04-12  0:12 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Po Lu, emacs-devel

Thanks so much for this thorough feedback!

In reply to Po Lu:

> I hate to rain on people's parades,

I don't think any parades are being rained on! This is exactly the kind of feedback I was hoping to get from this list!

> a dependency for so fundamental
> a component of Emacs as the regexp matcher, is not acceptable,

I completely agree that the performance of the regexp matcher is fundamentally linked to the rest of Emacs in ways that make it very difficult to depend on external libraries the way other software can. When I proposed extending the rust regex crate to support the features needed for the Emacs regex engine, I indeed found that external libraries are generally uninterested in or unable to support many of the features we need (especially matching over non-contiguous data like the gap buffer). Later, after investigating even further, I now also believe that Emacs's (super cool!!) multibyte encoding would be mostly or completely incompatible with existing regex engines (although it may be possible to delegate this to a byte-at-a-time automaton--not sure yet).

> Don't let's delude ourselves that a solution is waiting in the wings in
> the shape of some mystical library, and embark our hopes on this library
> alone, to the detriment of improving the regex engine we already have.
> It's a recurring trap that we should have learned to avoid by now.

Especially after becoming more familiar with the existing regex-emacs.c engine (and confirming that we likely won't need to employ too many complex/non-portable techniques to achieve performance goals), I now agree that focusing more on a simple engine that Emacs can extend is much more likely to produce something of value, rather than e.g. a complex engine that Emacs will have to work to conform to. It remains *intellectually* very exciting to me to think about how to extend other regex engines to support Emacs's technical constraints, especially regarding non-contiguous data (and I am actually hoping to study this sort of thing in graduate research, so please anyone feel free to reach out if you know any programs sponsoring research on efficient string search under various constraints). But since my initial posting I have actually become more convinced (in agreement with you) that if we truly intend to supplant regex-emacs.c, the replacement should focus on supporting Emacs first before trying to be extremely general (because "general" regex engines tend to have very different constraints!).

> If
> the sabotage of xz, for example, has taught us anything, it's that
> software is generally improved by a reduction in dependencies.

I am extremely sympathetic to this argument, especially from my experience advocating for open source within corporations. In an off-list reply I described having recently implemented FFI wrapper libraries for the re2 (https://docs.rs/re2) and hyperscan (https://docs.rs/vectorscan-async) libraries, mostly as a learning exercise, but also to see how they handled various use cases such as non-contiguous input. While hyperscan actually has one very important feature we want (searching over non-contiguous data such as the gap buffer: see my documentation for the relevant method at https://docs.rs/vectorscan-async/latest/vectorscan/state/struct.Scratch.html#method.scan_sync_vectored), I believe it is unfortunately completely unsuitable for our purposes due to its unclear maintainership: intel relicensed it as proprietary less than a year ago, and the vectorscan fork which retains the original BSD-3 license is not clearly trustworthy and could easily be a target for attack, just like the xz sabotage (and Emacs is absolutely a high value target).

While I have already made nontrivial changes to the hyperscan source code for the purposes of my wrapper library and I think the code is worth reviewing as a technical reference, I do not think it will be safe for Emacs to depend on for the precise reasons you mention. I mention it here to describe how that exact concern has been guiding my investigations.

> an immature language
> that is not portable, especially to older systems we support for users
> with antiquarian interests.

Right now, I'm justifying my use of rust purely as a prototyping platform. I hadn't actually considered whether the external engine could eventually become a mandatory dependency, so thanks for raising that concern early on!

I agree that rustc cannot be relied upon to support the same wealth of platforms as gcc, but I am under the impression that gccrs will eventually close this gap. If I were to ever propose this as a mandatory dependency, I understand that would require gccrs along with any other necessary build dependency (e.g. cargo to download external code) to be generally available for all Emacs platforms. For example, one practical effect that might have is that we might need to disable any use of the rust core::simd portable SIMD instruction selection functionality at first, since it is an experimental feature and will need to be separately reimplemented for gccrs. However, SIMD instructions are more of a reach goal for this work anyway since the primary goal is just to avoid backtracking except for backrefs.

I was thinking that a rust-based regex engine could become portable enough to consider making mandatory if either:
(a) the engine's C ABI library becomes widely available via distros like librsvg, built via rustc_codegen_gcc (which uses llvm and libgccjit) or the gccrs frontend.
(b) the engine's source code (+ deps) can be checked into and built as part of Emacs (likely requiring gccrs specifically to avoid introducing an llvm dependency).

I hadn't considered (b) at all until now, but even if we were interested in performance alone, it seems plausible that building from source would enable greater opportunities for compiler optimizations across function calls. It also ensures we maintain Emacs's portability guarantees by taking responsibility for building the code ourselves (what other mandatory third-party dependencies does Emacs even have?). I think if I were to propose inclusion of any rust code into Emacs as a mandatory dependency, I would first implement support for rust source files in the Emacs build system, probably using whatever linking techniques the Linux kernel ends up deciding on, so that we could implement (b).

In reply to Eli Zaretskii:

> Of course, an optional dependency will have to either be 110%
> compatible with what we have now, or expose a separate and distinct
> API to Lisp programs. Which makes this project a larger undertaking.
> But otherwise, I don't see why not.

I *was* hoping the result would achieve both of these: improving performance of existing regexp code using the high-level single-pattern API "for free", but also building a foundation for a future distinct API for Lisp programs that would enable more powerful search capabilities (such as multi-pattern matching) as well as more performance for specific easy-to-optimize use cases (such as finding a set of literal strings vs requiring users to manually OR together their patterns with \|).

If the existing regex-emacs.h interface were more complex/fragile than it is (e.g. if it had exposed a dependency on the gap buffer), then I was definitely going to propose just introducing this work as a new incompatible Lisp API. But as of now, it seems like relatively little work to extend any new matching engine to the regex-emacs.h API, so it seems worth doing. But definitely I suspect there could be a lot of value performance-wise to extract from a more complex Lisp-level search API and that's what I see as the long-term potential value of this work.

However, I would definitely feel uncomfortable developing such a framework for Lisp code that required the presence of an optional dependency at Emacs build time, and would strongly prefer a result like the use of libjansson for json parsing, which is an implementation detail noted via `(json--available-p)` but no more. I can see now that this would be more difficult to achieve with rust alone until gccrs arrives, and that goal (complex Lisp search API to achieve greater search/scan performance under specific constraints) might actually be best served by writing C/Lisp code that delegates to regex-emacs.c in the interim!

As you describe, aiming for "110% compatibility" first should allow us to more easily justify an optional dependency, which allows me to continue leaning on rust as a prototyping language. If I can demonstrate robust performance improvements with complete backwards compatibility, I think it might be worthwhile to enable as an optional dependency for greater review (as was done with libgccjit for the native compiler). At that point I think I will have more experience with packaging rust code for interop as well as the Emacs codebase itself. But it's beginning to seem more plausible that the Lisp-level API might be mostly orthogonal to the regexp matching engine itself, so any native code to support that might well be done in C instead! So I'm probably putting that out of scope for now, and will figure out how to reliably introduce an incompatible API later (or in parallel).

----
My goal with doing this work in Emacs was to get the perf improvement out to as many people as possible, and my hope with implementing the regex matching API was that it would speed up a lot of existing code. I'm not deeply wedded to using rust here, but I'm partially hoping to use this work as an excuse to work out some of the remaining technical difficulties with reliably packaging and consuming rust code. I am hoping that gccrs makes it feasible for Emacs to incorporate rust source code as a mandatory dependency in the long term, but in the interim (and even after that) I'm definitely planning to be very realistic about how the use of rust limits our eventual portability.


Thanks,
Danny



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: prior work on non-backtracking regex engine?
  2024-03-10 15:41 prior work on non-backtracking regex engine? Danny McClanahan
  2024-03-12 23:45 ` Danny McClanahan
@ 2024-04-17 14:23 ` Clément Pit-Claudel
  1 sibling, 0 replies; 11+ messages in thread
From: Clément Pit-Claudel @ 2024-04-17 14:23 UTC (permalink / raw)
  To: emacs-devel

On 3/10/24 16:41, Danny McClanahan wrote:
> (3) Have there been prior investigations of non-backtracking regex engines in emacs, or trying to use an external regex engine in general? What was the outcome, and does it seem like a useful research direction?

Here is a relevant thread:

  https://lists.gnu.org/archive/html/emacs-devel/2016-12/msg00622.html

The gap is indeed a significant issue.

If you wanted linear-time support, my guess is your best bets today would be RE2, rust-regex, or the linear engine in Chromium/V8.  All of them would require sizable changes, and none would replace the existing engine because they don't support backreferences.  The one in Chromium is fairly small and simple.

Clément.



^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2024-04-17 14:23 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-03-10 15:41 prior work on non-backtracking regex engine? Danny McClanahan
2024-03-12 23:45 ` Danny McClanahan
2024-03-13 13:23   ` Ihor Radchenko
2024-04-07  4:42     ` Danny McClanahan
2024-04-07 14:15       ` Ihor Radchenko
2024-04-08 12:19       ` Helmut Eller
2024-04-08 13:13         ` Eli Zaretskii
2024-04-08 14:00       ` Po Lu
2024-04-08 14:23         ` Eli Zaretskii
2024-04-12  0:12           ` Danny McClanahan
2024-04-17 14:23 ` Clément Pit-Claudel

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).