all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
* rosie/libpexl library for regex pattern composition
@ 2024-07-27 13:04 Danny McClanahan
  2024-07-28  7:08 ` Helmut Eller
  0 siblings, 1 reply; 5+ messages in thread
From: Danny McClanahan @ 2024-07-27 13:04 UTC (permalink / raw)
  To: emacs-devel@gnu.org

Hello emacs-devel,

I have recently become familiar with the Rosie Pattern Language (https://rosie-lang.org/) by Prof. Jamie Jennings at NCSU. I pinged this list a few months ago about improving the performance and worst-case behavior of regex-emacs.c, and while I'm still working on a prototype for that, others on this list also responded that a method to compose patterns in lisp code might be quite useful as well.

While I understand that tree-sitter tends to be the more accepted way to parse program source, there remain many use cases for which regex or something like it remains applicable, especially parsing the output of external processes (like the built-in M-x grep, or my extension https://github.com/cosmicexplorer/helm-rg). I became especially interested reading https://rosie-lang.org/about/ regarding its focus on enabling maintainable/testable libraries of patterns, which seemed to correspond to my vision of what pattern composition might look like for Emacs extension developers.

While I believe Rosie has a build-time (and possibly run-time) dependency on Lua, PEXL (https://gitlab.com/pexlang/libpexl) is the author's new implementation and is written in very portable C99. It also has several new features and implementation techniques over Rosie. I'm still getting familiar with the project, so I can't speak to any standout features yet, but on its face it seems like a potential substrate we could build lisp-level composable pattern abstractions on top of.

Rosie/PEXL's goals are explicitly focused more on maintainability than sheer performance, so I'm thinking it might make sense to introduce Rosie as a separate interface to the regex engine, while we can keep the regex engine narrowly focused on patterns that we can more easily optimize. For example, I was glad to hear in my previous communications with emacs-devel that there was some receptiveness to deprecating features like runtime lookup of mode-specific word boundaries from the regex engine if it would ease optimization (I'm not sure if that's necessary yet), but one way we could avoid removing more complex functionality like backrefs that extension devs depend on is to direct them to a lisp interface wrapping Rosie, which supports backrefs (it actually supports a strictly more powerful formalization of backrefs than regex engines do; see the author's post on it at https://jamiejennings.com/posts/2023-10-01-dont-look-back-3/).

Like I said, I'm still becoming familiar with Rosie/PEXL, so I don't quite have enough info yet to make a more thorough proposal. But I'd love to know if others are familiar with this project and whether it might correspond to the use cases for lisp-level pattern composition brought up in response to my previous communications about improving the regex engine.

Thanks,
Danny



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: rosie/libpexl library for regex pattern composition
  2024-07-27 13:04 rosie/libpexl library for regex pattern composition Danny McClanahan
@ 2024-07-28  7:08 ` Helmut Eller
  2024-07-28  7:51   ` Eli Zaretskii
  0 siblings, 1 reply; 5+ messages in thread
From: Helmut Eller @ 2024-07-28  7:08 UTC (permalink / raw)
  To: Danny McClanahan; +Cc: emacs-devel@gnu.org

On Sat, Jul 27 2024, Danny McClanahan wrote:

> Rosie/PEXL's goals are explicitly focused more on maintainability than
> sheer performance, so I'm thinking it might make sense to introduce
> Rosie as a separate interface to the regex engine, while we can keep
> the regex engine narrowly focused on patterns that we can more easily
> optimize.

AFAIU, Rosie is based on parsing expression grammars, i.e. it's not a
regular language and it's in a different complexity class.

I think it would be best for Emacs to provide an efficient C API for
dynamic modules to buffer text with the explicit goal to support
alternative regexp engines and parsing libraries.

It seems that tree-sitter only needs a single function:
treesit_read_buffer.  If it works for tree-sitter, then maybe it's also
good enough for Rosie.

Helmut



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: rosie/libpexl library for regex pattern composition
  2024-07-28  7:08 ` Helmut Eller
@ 2024-07-28  7:51   ` Eli Zaretskii
  2024-07-29 13:58     ` Danny McClanahan
  0 siblings, 1 reply; 5+ messages in thread
From: Eli Zaretskii @ 2024-07-28  7:51 UTC (permalink / raw)
  To: Helmut Eller; +Cc: dmcc2, emacs-devel

> From: Helmut Eller <eller.helmut@gmail.com>
> Cc: "emacs-devel@gnu.org" <emacs-devel@gnu.org>
> Date: Sun, 28 Jul 2024 09:08:06 +0200
> 
> It seems that tree-sitter only needs a single function:
> treesit_read_buffer.  If it works for tree-sitter, then maybe it's also
> good enough for Rosie.

Tree-sitter accesses buffer text one character at a time, for which an
API is very simple, and we already have such an API modules could use:
char-after.  But I very much doubt that libraries like Rosie could do
well with such APIs for its needs.  The alternative is to give modules
direct access to the entire buffer text, which IMO is unthinkable.



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: rosie/libpexl library for regex pattern composition
  2024-07-28  7:51   ` Eli Zaretskii
@ 2024-07-29 13:58     ` Danny McClanahan
  2024-07-29 19:33       ` Helmut Eller
  0 siblings, 1 reply; 5+ messages in thread
From: Danny McClanahan @ 2024-07-29 13:58 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Helmut Eller, emacs-devel

> From: Helmut Eller <eller.helmut@gmail.com>
> Cc: "emacs-devel@gnu.org" <emacs-devel@gnu.org>
> Date: Sun, 28 Jul 2024 09:08:06 +0200
>
> AFAIU, Rosie is based on parsing expression grammars, i.e. it's not a
regular language and it's in a different complexity class.

That's true! However, the PEXL project (essentially Rosie v2) is actually looking to move past the PEG formalism for being insufficiently expressive as well as presenting some implementation issues, particularly with respect to performance (see brief description at https://gitlab.com/pexlang/libpexl#libpexl-is-a-work-in-progress-library-for-text-matching-and-simple-parsing).

> On Sunday, July 28th, 2024 at 03:51, Eli Zaretskii <eliz@gnu.org> wrote:
>
> Tree-sitter accesses buffer text one character at a time, for which an
> API is very simple, and we already have such an API modules could use:
> char-after. But I very much doubt that libraries like Rosie could do
> well with such APIs for its needs.

One of the improvements PEXL has over Rosie v1 is the `find` VM instruction (described in the above-linked README), which uses SIMD to search many bytes ahead at once instead of iterating char-by-char. This functionality something I am also hoping to demonstrate with my regex-emacs v2 engine, although I am working on a char-by-char NFA first. IIUC, regex engines tend to employ SIMD searching as a "prefilter" (see discussion in https://github.com/BurntSushi/rebar/blob/master/README.md) before employing the char-by-char matchers, which may be easier to separate out into some explicit multi-phase search API if needed.

This is mostly to say that both regex engines using automata as well as Rosie could likely be made to work with a char-by-char API alone, although it would obviously hamper some optimization opportunities. It may be an easier way to prototype new matching engines than building them within Emacs.

> The alternative is to give modules
> direct access to the entire buffer text, which IMO is unthinkable.

I had not considered the possibility of exposing a module API for extensible parsing engines as opposed to building it directly into Emacs like the current regex-emacs engine, but I definitely appreciate the idea of decoupling from a specific framework, especially with an eye to reducing the maintenance burden. From my current understanding, the benefits of building Rosie/PEXL directly into Emacs would be:
(1) could take advantage of e.g. SIMD search by directly accessing the gap buffer memory,
(2) since it would be reliably available to all Emacs users, could form a composeable elisp search API.

My impression from prior discussion on this list was that one long-term desire for Emacs lisp was a lisp-level API to explicitly represent and compose search patterns, vs e.g. the implicit caching performed for matchers from regex-emacs.c (currently represented to elisp developers as lisp strings). If that is a goal, it seems difficult to realize that without building the matching logic and the pattern compiler into Emacs itself (please correct me if I'm wrong about this).

However, if *that* is the goal, then it also seems appropriate to prototype such a lisp-level API first, before introducing external code like Rosie/PEXL; at a first glance, it should be possible to do this with the current regex-emacs.c. Once such a layer is established, we could even imagine exposing a module API that allows for pluggable parsing frameworks. If Rosie/PEXL offers significant benefits and is easy enough to integrate, we could then compare the pros and cons of building it in directly (and then offering a more expressive lisp-level matching API with its power) vs exposing a module API (which may be easier to maintain).

We would then be comparing the maintenance burden of pluggability over the maintenance burden of integrating Rosie/PEXL, but I believe the lisp-level matcher API (with explicit construction over implicit caching) can be prototyped right now without that bikeshedding, and seems to be something Emacs desires in any case.

Does that correctly outline the goals/desires for lisp-level matcher construction? I like the idea of figuring out that API first, since all it needs to start is "just" an explicit lisp object for compiled regexps, which alone was mentioned as a goal in prior discussions. I think in any case having established that API surface will definitely make it easier to integrate other regex/parsing frameworks, be they built into Emacs or via external module. As it stands, I would like to have made this relatively straightforward improvement to Emacs before proposing the integration of other more complex external dependencies.

Thanks,
Danny



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: rosie/libpexl library for regex pattern composition
  2024-07-29 13:58     ` Danny McClanahan
@ 2024-07-29 19:33       ` Helmut Eller
  0 siblings, 0 replies; 5+ messages in thread
From: Helmut Eller @ 2024-07-29 19:33 UTC (permalink / raw)
  To: Danny McClanahan; +Cc: Eli Zaretskii, emacs-devel

On Mon, Jul 29 2024, Danny McClanahan wrote:

[...]
> I had not considered the possibility of exposing a module API for
> extensible parsing engines as opposed to building it directly into
> Emacs like the current regex-emacs engine, but I definitely appreciate
> the idea of decoupling from a specific framework, especially with an
> eye to reducing the maintenance burden. From my current understanding,
> the benefits of building Rosie/PEXL directly into Emacs would be:
> (1) could take advantage of e.g. SIMD search by directly accessing the
> gap buffer memory,

It depends on how the module API exposes the buffer text.  E.g. I would
imagine that a Rust function that receives the text as two &[u8],
i.e. two byte array slices borrowed read-only for the duration of a
call, could have similar efficiency as something built directly into
Emacs.

> (2) since it would be reliably available to all Emacs users, could
> form a composeable elisp search API.

I think Rosie/PEXL would stay an optional feature for a long time.

[...]
> Does that correctly outline the goals/desires for lisp-level matcher
> construction?

I don't know.  This sounds like an new API would be needed and the
difficulty, beside designing the API, would be to make it popular;
usually the most difficult part.  The old API would still be used for a
long time.

I think the Emacs community is mostly a do-ocracy, i.e., people work on
things they are interested in or know how to do; in contrast to work on
goals/ideas that somebody else defined.  That's quite ok as long as the
boring tasks get done eventually.  And I would suggest the same here: do
what you think is useful and that you know how to do.



^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2024-07-29 19:33 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-27 13:04 rosie/libpexl library for regex pattern composition Danny McClanahan
2024-07-28  7:08 ` Helmut Eller
2024-07-28  7:51   ` Eli Zaretskii
2024-07-29 13:58     ` Danny McClanahan
2024-07-29 19:33       ` Helmut Eller

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.