Re: prior work on non-backtracking regex engine?

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

From: Danny McClanahan <dmcc2@hypnicjerk.ai>
To: Eli Zaretskii <eliz@gnu.org>
Cc: Po Lu <luangruo@yahoo.com>, emacs-devel@gnu.org
Subject: Re: prior work on non-backtracking regex engine?
Date: Fri, 12 Apr 2024 00:12:30 +0000	[thread overview]
Message-ID: <JmA2ZvuusvNthi4NtnGTKlV1BU6hWoDD-yuyvmBmPwkuoeAby7oXRtS4ZmHmRmd_hCZdIZOXj7wcgaB6ljpdDjJrgH4GcF9m-JMD1Ek5Pxw=@hypnicjerk.ai> (raw)
In-Reply-To: <86sezwx7ao.fsf@gnu.org>

Thanks so much for this thorough feedback!

In reply to Po Lu:

> I hate to rain on people's parades,

I don't think any parades are being rained on! This is exactly the kind of feedback I was hoping to get from this list!

> a dependency for so fundamental
> a component of Emacs as the regexp matcher, is not acceptable,

I completely agree that the performance of the regexp matcher is fundamentally linked to the rest of Emacs in ways that make it very difficult to depend on external libraries the way other software can. When I proposed extending the rust regex crate to support the features needed for the Emacs regex engine, I indeed found that external libraries are generally uninterested in or unable to support many of the features we need (especially matching over non-contiguous data like the gap buffer). Later, after investigating even further, I now also believe that Emacs's (super cool!!) multibyte encoding would be mostly or completely incompatible with existing regex engines (although it may be possible to delegate this to a byte-at-a-time automaton--not sure yet).

> Don't let's delude ourselves that a solution is waiting in the wings in
> the shape of some mystical library, and embark our hopes on this library
> alone, to the detriment of improving the regex engine we already have.
> It's a recurring trap that we should have learned to avoid by now.

Especially after becoming more familiar with the existing regex-emacs.c engine (and confirming that we likely won't need to employ too many complex/non-portable techniques to achieve performance goals), I now agree that focusing more on a simple engine that Emacs can extend is much more likely to produce something of value, rather than e.g. a complex engine that Emacs will have to work to conform to. It remains *intellectually* very exciting to me to think about how to extend other regex engines to support Emacs's technical constraints, especially regarding non-contiguous data (and I am actually hoping to study this sort of thing in graduate research, so please anyone feel free to reach out if you know any programs sponsoring research on efficient string search under various constraints). But since my initial posting I have actually become more convinced (in agreement with you) that if we truly intend to supplant regex-emacs.c, the replacement should focus on supporting Emacs first before trying to be extremely general (because "general" regex engines tend to have very different constraints!).

> If
> the sabotage of xz, for example, has taught us anything, it's that
> software is generally improved by a reduction in dependencies.

I am extremely sympathetic to this argument, especially from my experience advocating for open source within corporations. In an off-list reply I described having recently implemented FFI wrapper libraries for the re2 (https://docs.rs/re2) and hyperscan (https://docs.rs/vectorscan-async) libraries, mostly as a learning exercise, but also to see how they handled various use cases such as non-contiguous input. While hyperscan actually has one very important feature we want (searching over non-contiguous data such as the gap buffer: see my documentation for the relevant method at https://docs.rs/vectorscan-async/latest/vectorscan/state/struct.Scratch.html#method.scan_sync_vectored), I believe it is unfortunately completely unsuitable for our purposes due to its unclear maintainership: intel relicensed it as proprietary less than a year ago, and the vectorscan fork which retains the original BSD-3 license is not clearly trustworthy and could easily be a target for attack, just like the xz sabotage (and Emacs is absolutely a high value target).

While I have already made nontrivial changes to the hyperscan source code for the purposes of my wrapper library and I think the code is worth reviewing as a technical reference, I do not think it will be safe for Emacs to depend on for the precise reasons you mention. I mention it here to describe how that exact concern has been guiding my investigations.

> an immature language
> that is not portable, especially to older systems we support for users
> with antiquarian interests.

Right now, I'm justifying my use of rust purely as a prototyping platform. I hadn't actually considered whether the external engine could eventually become a mandatory dependency, so thanks for raising that concern early on!

I agree that rustc cannot be relied upon to support the same wealth of platforms as gcc, but I am under the impression that gccrs will eventually close this gap. If I were to ever propose this as a mandatory dependency, I understand that would require gccrs along with any other necessary build dependency (e.g. cargo to download external code) to be generally available for all Emacs platforms. For example, one practical effect that might have is that we might need to disable any use of the rust core::simd portable SIMD instruction selection functionality at first, since it is an experimental feature and will need to be separately reimplemented for gccrs. However, SIMD instructions are more of a reach goal for this work anyway since the primary goal is just to avoid backtracking except for backrefs.

I was thinking that a rust-based regex engine could become portable enough to consider making mandatory if either:
(a) the engine's C ABI library becomes widely available via distros like librsvg, built via rustc_codegen_gcc (which uses llvm and libgccjit) or the gccrs frontend.
(b) the engine's source code (+ deps) can be checked into and built as part of Emacs (likely requiring gccrs specifically to avoid introducing an llvm dependency).

I hadn't considered (b) at all until now, but even if we were interested in performance alone, it seems plausible that building from source would enable greater opportunities for compiler optimizations across function calls. It also ensures we maintain Emacs's portability guarantees by taking responsibility for building the code ourselves (what other mandatory third-party dependencies does Emacs even have?). I think if I were to propose inclusion of any rust code into Emacs as a mandatory dependency, I would first implement support for rust source files in the Emacs build system, probably using whatever linking techniques the Linux kernel ends up deciding on, so that we could implement (b).

In reply to Eli Zaretskii:

> Of course, an optional dependency will have to either be 110%
> compatible with what we have now, or expose a separate and distinct
> API to Lisp programs. Which makes this project a larger undertaking.
> But otherwise, I don't see why not.

I *was* hoping the result would achieve both of these: improving performance of existing regexp code using the high-level single-pattern API "for free", but also building a foundation for a future distinct API for Lisp programs that would enable more powerful search capabilities (such as multi-pattern matching) as well as more performance for specific easy-to-optimize use cases (such as finding a set of literal strings vs requiring users to manually OR together their patterns with \|).

If the existing regex-emacs.h interface were more complex/fragile than it is (e.g. if it had exposed a dependency on the gap buffer), then I was definitely going to propose just introducing this work as a new incompatible Lisp API. But as of now, it seems like relatively little work to extend any new matching engine to the regex-emacs.h API, so it seems worth doing. But definitely I suspect there could be a lot of value performance-wise to extract from a more complex Lisp-level search API and that's what I see as the long-term potential value of this work.

However, I would definitely feel uncomfortable developing such a framework for Lisp code that required the presence of an optional dependency at Emacs build time, and would strongly prefer a result like the use of libjansson for json parsing, which is an implementation detail noted via `(json--available-p)` but no more. I can see now that this would be more difficult to achieve with rust alone until gccrs arrives, and that goal (complex Lisp search API to achieve greater search/scan performance under specific constraints) might actually be best served by writing C/Lisp code that delegates to regex-emacs.c in the interim!

As you describe, aiming for "110% compatibility" first should allow us to more easily justify an optional dependency, which allows me to continue leaning on rust as a prototyping language. If I can demonstrate robust performance improvements with complete backwards compatibility, I think it might be worthwhile to enable as an optional dependency for greater review (as was done with libgccjit for the native compiler). At that point I think I will have more experience with packaging rust code for interop as well as the Emacs codebase itself. But it's beginning to seem more plausible that the Lisp-level API might be mostly orthogonal to the regexp matching engine itself, so any native code to support that might well be done in C instead! So I'm probably putting that out of scope for now, and will figure out how to reliably introduce an incompatible API later (or in parallel).

----
My goal with doing this work in Emacs was to get the perf improvement out to as many people as possible, and my hope with implementing the regex matching API was that it would speed up a lot of existing code. I'm not deeply wedded to using rust here, but I'm partially hoping to use this work as an excuse to work out some of the remaining technical difficulties with reliably packaging and consuming rust code. I am hoping that gccrs makes it feasible for Emacs to incorporate rust source code as a mandatory dependency in the long term, but in the interim (and even after that) I'm definitely planning to be very realistic about how the use of rust limits our eventual portability.

Thanks,
Danny

next prev parent reply	other threads:[~2024-04-12  0:12 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-10 15:41 prior work on non-backtracking regex engine? Danny McClanahan
2024-03-12 23:45 ` Danny McClanahan
2024-03-13 13:23   ` Ihor Radchenko
2024-04-07  4:42     ` Danny McClanahan
2024-04-07 14:15       ` Ihor Radchenko
2024-04-08 12:19       ` Helmut Eller
2024-04-08 13:13         ` Eli Zaretskii
2024-04-08 14:00       ` Po Lu
2024-04-08 14:23         ` Eli Zaretskii
2024-04-12  0:12           ` Danny McClanahan [this message]
2024-04-17 14:23 ` Clément Pit-Claudel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='JmA2ZvuusvNthi4NtnGTKlV1BU6hWoDD-yuyvmBmPwkuoeAby7oXRtS4ZmHmRmd_hCZdIZOXj7wcgaB6ljpdDjJrgH4GcF9m-JMD1Ek5Pxw=@hypnicjerk.ai' \
    --to=dmcc2@hypnicjerk.ai \
    --cc=eliz@gnu.org \
    --cc=emacs-devel@gnu.org \
    --cc=luangruo@yahoo.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).