From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail
From: Danny McClanahan <dmcc2@hypnicjerk.ai>
Newsgroups: gmane.emacs.devel
Subject: Re: [PATCH] add compiled regexp primitive lisp object
Date: Wed, 07 Aug 2024 07:59:15 +0000
Message-ID: <7TES32-QbXZEtj4F5ClAcMZAjkGgR4H08Opv0luyXSYDF4QeOQwHn7oa7tNGRKIoDCpkw3UdQTJnkXRpxEtX0i0_jw3UcyVMaKbTzOYHkjA=@hypnicjerk.ai>
References: <DX05HWbmw37nQEdshSH46KqWNPDwwK5xaKBcjZr7XBx293dQtroHGmYssDsOVTcMgLLqHbbrvkBpJcWuBgXsDHkvhAHLsTPyQ2j5wZpeL8M=@hypnicjerk.ai>
 <87mslxxddk.fsf@protonmail.com>
 <5He97LtsyeyQoTLU7d91oP2CLO8s_2afdgcNxozsFjzu8qGbB_7nXmsZL5O6Ej7K-tuEmngCcPKJpDAjxeKz4jk1DvqSUbdOLpw5U1vo1SY=@hypnicjerk.ai>
 <87le1avopk.fsf@protonmail.com>
 <2LOLmIp1X8w4CGbqq3qDrzmKVA0KzYNL1N9lBtWdB-MtEv9oCuYgJMYprG170wMPjYxeQImAmWOPatGTTl4KxZMlptNo9A9hnHt84vdN9EA=@hypnicjerk.ai>
 <87ttfxtszi.fsf@protonmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214";
	logging-data="29508"; mail-complaints-to="usenet@ciao.gmane.io"
Cc: "emacs-devel@gnu.org" <emacs-devel@gnu.org>,
 =?utf-8?Q?Mattias_Engdeg=C3=A5rd?= <mattiase@acm.org>
To: Pip Cet <pipcet@protonmail.com>
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Wed Aug 07 12:52:09 2024
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane-mx.org
Original-Received: from lists.gnu.org ([209.51.188.17])
	by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
	(Exim 4.92)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org>)
	id 1sbeH3-0007Ob-9J
	for ged-emacs-devel@m.gmane-mx.org; Wed, 07 Aug 2024 12:52:09 +0200
Original-Received: from localhost ([::1] helo=lists1p.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.90_1)
	(envelope-from <emacs-devel-bounces@gnu.org>)
	id 1sbeGN-0005Di-Mi; Wed, 07 Aug 2024 06:51:27 -0400
Original-Received: from eggs.gnu.org ([2001:470:142:3::10])
 by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.90_1) (envelope-from <dmcc2@hypnicjerk.ai>)
 id 1sbbZx-0000Vw-LJ
 for emacs-devel@gnu.org; Wed, 07 Aug 2024 03:59:29 -0400
Original-Received: from mail-40136.proton.ch ([185.70.40.136])
 by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.90_1) (envelope-from <dmcc2@hypnicjerk.ai>)
 id 1sbbZr-00043K-J3
 for emacs-devel@gnu.org; Wed, 07 Aug 2024 03:59:29 -0400
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=hypnicjerk.ai;
 s=protonmail2; t=1723017559; x=1723276759;
 bh=X2MXwm3CxaXFy0MS2zSGs8+Tx7Ca7SYWr63L91cvWfE=;
 h=Date:To:From:Cc:Subject:Message-ID:In-Reply-To:References:
 Feedback-ID:From:To:Cc:Date:Subject:Reply-To:Feedback-ID:
 Message-ID:BIMI-Selector;
 b=mjg6jwMylzX3l3lUwRP6fsZpru//lUXRmqILQKMv7nCaZlwTqEfdrBphuI0rtZ9Wt
 hE4bazNcpwNYnvulcnyXh4F1OPfhqNLnAZ3m1vFUXCZr3IlDpjLC6PK+ANQ3QkuLGG
 zF+bi4ZhJx8IB62I01uqf0Fwl4f8sGy7Q6rJf6m/rw7n0/Mxk7ZZS9FDyOp7YssksM
 5PNdH3qEC86pYamAxKg94LcDihiEnfKbY3OyETUVJJt7ZmFYTHUaZ+lxGnNgnAzx6x
 SB4qKDTCvpl6jdy+oVp0huEfNT4CJ1TYBG2TAdTsbzyEQS6Ql6Beugmtb++sQDchBN
 CjJIz6mL+nCUA==
In-Reply-To: <87ttfxtszi.fsf@protonmail.com>
Feedback-ID: 27837847:user:proton
X-Pm-Message-ID: 6ec8bd8263e3b5015df4bdb21b374f5b8011fbdd
Received-SPF: pass client-ip=185.70.40.136; envelope-from=dmcc2@hypnicjerk.ai;
 helo=mail-40136.proton.ch
X-Spam_score_int: -20
X-Spam_score: -2.1
X-Spam_bar: --
X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1,
 DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1,
 RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H4=0.001, RCVD_IN_MSPIKE_WL=0.001,
 RCVD_IN_VALIDITY_CERTIFIED_BLOCKED=0.001, RCVD_IN_VALIDITY_RPBL_BLOCKED=0.001,
 SPF_HELO_PASS=-0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no
X-Spam_action: no action
X-Mailman-Approved-At: Wed, 07 Aug 2024 06:51:24 -0400
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/emacs-devel>,
 <mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <https://lists.gnu.org/archive/html/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/emacs-devel>,
 <mailto:emacs-devel-request@gnu.org?subject=subscribe>
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org
Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org
Xref: news.gmane.io gmane.emacs.devel:322489
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/322489>

> From: Pip Cet pipcet@protonmail.com

Thanks so much again for the thorough feedback Pip! I am having difficulty
making this email shorter, so I have split it into labeled sections. The mo=
st
useful part is at the bottom, where I try to focus on specific goals this
patchset could turn into. It's a lot of brainstorming that I'm sending to
emacs-devel to record for later, but you do not need to read it.

Summary:
- I have attached the formatted patch which now avoids special purecopy log=
ic
  and fixes pdumping, thanks to Pip's fixes.
- I will try using this patch's `make-regexp' for jit-lock-mode and see if =
that
  produces any performance improvement.
- I will look at implementing Pip's proposal for a Lisp reimplementation of
  regex-emacs.c as a separate patch.

Please let me know if emails of this length are not suitable for emacs-deve=
l and
I would be happy to oblige. ^_^ <3


=0C
-------------------------------------------
This part is about maintainer interactions:
-------------------------------------------
> Thanks again. I'm hoping the maintainers will have something positive to
> say about this patch.
>
> Pip

Pip, I continue to heavily appreciate your input (and please do not stop do=
ing
so!), but I would please ask you not to use words like "positive" to mean "=
accepts
the changes", as I consider this to be emotionally manipulative and if I wa=
s
a maintainer it would bias me against acceptance. The scrutiny I am receivi=
ng
for introducing new primitive objects which would modify bytecode and take =
up
space in the `pvec_type' enum seems perfectly appropriate and by no
means would I consider it "negative" or even bureaucratic. These discussion=
s
spurred by Eli and Andrea have already helped me to realize:

(1) Improved performance for jit-lock-mode (if possible) is likely achievab=
le
    without these objects, by introducing new caching methods.
(2) Much of the SIMD techniques that treat multibyte text as a string of by=
tes
    cannot be easily applied in the face of translation tables (and therefo=
re
    improved performance would require a separate breaking API change than
    simply compiled regexp objects).
(3) The coding-system has nothing to do with regex matching.

I would not consider it a failure if this patch was rejected as I now reali=
ze
much of the benefits do not require that, and I consider this a necessary p=
art
of review as well as a learning experience for me about Emacs internals (mu=
ch as
you have educated me about purecopying). I do not consider it antagonistic.


=0C
-------------------------------------------------
This part is direct responses to review comments:
-------------------------------------------------
> > > I'm a sworn enemy of pure space (see scratch/no-pure-space), but I'm
> > > pretty sure you can make purecopy simply return the object in questio=
n
> > > (rather than going through the generic vectorlike code) and things
> > > should work. But I haven't tried it...
> >
> > Just tried it and pdumping went into what appears to be an infinite loo=
p in the
> > gc when allocating buffer space for a vectorlike (which may be the
> > pseudovec
>
>
> I misremembered, you have to pin objects you want to pretend you
> purecopied, just as we do for normal hash tables.

This works great! We no longer do any special purecopying at all!


> > regexp type, and may be because of the `make-regexp' call I added in
> > image.el to test exactly this) and then errored after about 10 seconds.=
 From
> > scratch/no-pure-space I see that pure space relates to zero-length vect=
ors,
> > which may have helped me to figure out why` make_lisp_ptr()' does not r=
eturn
> > Qnil when passed a null reference and Lisp_Vectorlike as arguments.
>
>
> We use tagged NULL pointers for known-to-be-invalid objects in a few
> places, actually, so we can't make make_lisp_ptr return nil...

Ah, that makes sense! It's easy enough to just check for NULL in gc and pdu=
mping
^_^!


> > > Just to reveal my conflict of interest, I'm currently playing with th=
e
> > > scratch/igc branch, and unaligned pointers are a big no for the MPS
> > > garbage collector we're using in that branch.
> >
> > Could I also ask you to please explain the importance of and requiremen=
ts for
> > alignment there? I believe alignment is useful for tagged pointers as i=
t gives
> > you more tag bits (and I see scratch/igc uses those), but is it just th=
at?
>
>
> Most CPU architectures require things to be naturally aligned; x86 and
> x86_64 are the exception there (though, of course, there's an exception
> to the exception, because the %xmmN registers can be accessed using
> special instructions that require alignment). MPS also requires all
> pointers to be stored in aligned words, but that's a separate
> requirement.
>
> The short story is you want to pass -alignof(type) to the pure_alloc
> function or Apple users will get segfaults.

Thanks so much!!! This is something I have tried to understand elsewhere so
I appreciate the free lesson ^_^!


> + /* `translate' being NULL produces an invalid object from +` make_lisp_=
ptr()'. */
> + if (bufp->translate)
>
> + {
> + const Lisp_Object translate =3D
> + make_lisp_ptr ((void *) &bufp->translate, Lisp_Vectorlike);
>
> + dump_field_lv (ctx, &out, bufp, &translate, WEIGHT_NORMAL);
> + }
> +
>
> I don't think you can do that: dump_field_lv will write to a random
> address in the dump based on the difference between bufp and
> "&translate", which is unpredictable as it depends on the stack layout
> of unrelated variables. IIUC, that is.
>
> Can you use dump_field_lv_rawptr instead?

Yes! See commit 4 of patch.

> If you want to remove the purecopy support, here's a patch to do that:

I didn't notice this patch before trying it myself, and elected to use a st=
atic
helper method instead (also in commit 4 of patch). I assume this is
also acceptable?

static Lisp_Object
pin_object (Lisp_Object obj)
{
  struct pinned_object *o =3D xmalloc (sizeof *o);
  o->object =3D obj;
  o->next =3D pinned_objects;
  pinned_objects =3D o;
  return obj;
}


> Well, I don't see why the read syntax can't be the uncompiled pattern
> for now, plus the extra data. Compiling it in lread.c would probably not
> be a huge performance issue. But see below for an alternative: use xr.

> > `xr' is SUPER cool and I've just installed it, thanks so much for
> > enlightening me!
>
> Possibly a better read syntax than the pattern string. Mattias, how
> hard would it be to make the rx + xr combination the identity, or at
> least idempotent?

I really like the idea of using `rx' for read syntax (or at least for print=
ing)
and will look at implementing this in the current patch for
`struct Lisp_Regexp'.

> I believe Apache Lucene implements the relevant mechanisms as well. It
> didn't look too complicated when I looked at it.

Thanks so much for this tip!!!! Will definitely spelunk through Lucene for =
ideas
here ^_^!


=0C
------------------------------------------------------------
This part focuses on modifications to regexp matching logic:
------------------------------------------------------------

Asking for further clarification on a previous message of yours:
> > > (My main problem with the current regexp implementation is it's
> > > intimately tied to the gap representation of buffers, and it'd be eas=
ier
> > > to hack on that if we could "just" switch to a representation-agnosti=
c,
> > > slow, Lisp implementation of regexps. Or translate the regexp itself =
to
> > > Lisp which we could JIT...)

So my impression is actually that the only coupling to the gap buffer is th=
at
the `re_match_2_internal()' method accepts two separate `string1' and
`string2' params (I was actually quite happily surprised when I first saw t=
his,
because it seemed very easy to implement with an external engine). It accep=
ts
just a single search string quite nicely as well, although I agree that the=
re is
some tricky conditional logic to cover the gap buffer case. But unless I'm
mistaken (please correct me!), I wouldn't consider it "intimately tied" to =
the
gap buffer. But this may just be pedantry on my part.

> That sounds very good. I don't think there's any particular rush, to be
> honest, though I would like to reiterate how helpful it would be for me
> to be able to disable the regex-emacs.c code and use a slow Lisp
> implementation instead.

I am very open to this for several reasons, but could I please ask you to
elaborate on the utility here? Maybe you explained it and I'm forgetting
(sorry!); is this because regex-emacs.c is hard to introduce new functional=
ity
to? Or does it affect the GC? A short answer is fine, I just want to unders=
tand
the issue regex-emacs.c specifically poses.


=0C
---------------------------------------------------------------
This part describes how I considered what makes sense for Emacs:
---------------------------------------------------------------
> I take it you're aware of remacs, which hasn't seen development lately
> but was a good way to discover the problems, in particular, of
> integrating rust with the Emacs GC. There might be some salvageable
> code there, though I'm not sure about the copyright status.

I typed up an incredibly long brainstorming session in response to this, wh=
ich
I'm reproducing here because I believe it to be useful, but is only relevan=
t if
you're interested in my thought process. Please feel free to skip down to t=
he
next labeled section.

I was aware of it but was hoping to avoid the "rewrite it in rust" paradigm=
 that
tends to leave important infrastructural projects unmaintained after an ini=
tial
burst of outside enthusiasm. The rust regex crate is more performant than
anything I could expect to write myself, but it is uninterested in supporti=
ng
the kind of flexibility that Emacs requires, and also does not expose an FF=
I and
is therefore limited strictly to rust programs, another unfortunate conceit=
 of
the rust language community. I am currently most interested in novel ways t=
o
perform string matching, so I was hoping that an FFI would avoid needing to
impose any dependency on rust or the GC.

As background which I may not have provided, I am looking to pursue a phd i=
n
text parsing/string matching and have many goals not satisfied by any prior=
 art,
and I am hoping to expose a cross-language FFI so that my work is usable fr=
om
any language and indeed possible to integrate into language implementations
themselves, instead of requiring each language to provide varying and separ=
ate
levels of support in their own bespoke text matching engine. Emacs provides
a delightful set of constraints (most notably the gap buffer and the fantas=
tic
multibyte encoding) which are simply not supported by any current regex eng=
ines,
and in particular it heavily makes use of regexps internally to implement e=
ditor
functionality, because of course Emacs uses text to form its UI. So improvi=
ng
Emacs would be a really fantastic demonstration of the generality of my wor=
k,
and if my hypothesis is correct, I would be able to produce a meaningful
performance improvement without loss of the generality Emacs users
expect. I also have other goals including some novel work in parsing theory=
 and
hygienically composing grammars.

The remacs project at least demonstrates an example of how to provide
a rust-level interface to add new Emacs functionality, but I am heavily
interested in language-agnostic methods to define grammars for parsing and
string searching, which can then be tested, analyzed, compiled, and compose=
d
into libraries of parse techniques, because parsers are hard and error-pron=
e to
write. Because Emacs relies on text in such a deep way, I believe it's a re=
ally
fertile ground for examples of how programmers compose text search/replacem=
ent
with conditional logic, which is something I believe can be made easier and=
 more
fun with thoughtful high-level text manipulation operations.

I'm typing a lot here because I'm having difficulty describing how the exte=
rnal
project implementing these ideas is likely to be useful for inclusion into =
the
Emacs project itself as opposed to an external module. The improvements I w=
as
thinking of (converting lots of `(and (looking-at ...) (replace-match ...))=
`
into some sort of separate grammar description which then invokes the exter=
nal
matching library to perform high-level text ops) would necessarily involve
rewriting a lot of lisp implementation code. The magnitude of changes would=
 then
necessarily end up producing an effort similar to remacs, which is a cool
experiment but not a sustainable infrastructural project like Emacs proper.


=0C
---------------------------------------------------------------------------=
---
This part proposes specific research goals that could turn into Emacs patch=
es:
---------------------------------------------------------------------------=
---

I think I really have several concrete goals here relevant to the Emacs pro=
ject:

(1) Make things like jit-lock-mode (which use regexps in a very specific wa=
y)
    significantly faster by introducing a less-general code path for regex
    matching.
    - `struct Lisp_Regexp' and `struct Lisp_Match' from this patch could be=
 nice
      for this, but it's actually almost completely orthogonal to this patc=
h.
    - I thought we would need to introduce a separate regexp engine to achi=
eve
      this, but I actually think doing this in C and exposing
      a highly-restricted interface to lisp code is likely the best way to
      achieve this. We could even do platform-specific SIMD stuff this way.
    - Because jit-lock-mode uses regexps in a restricted way, it seems less
      likely that we would need to e.g. compile a generic NFA/DFA as oppose=
d to
      some handwritten parsing logic.
    - Importantly, it remains to be demonstrated that jit-lock-mode perform=
ance
      can be significantly improved as I suspect. This may not be a signifi=
cant
      performance bottleneck for emacs.

(2) Improve the interface for regexp searching and matching in Elisp code.
    - This is what this patch offers. As I noted in a prior email:
> Right now, I think the optional STRING argument to `match-string' and
> `replace-match' is the strongest argument that we are currently working a=
round
> a confusing implicit dependency on implicitly-clobbered global match data=
 by
> adding new optional arguments to low-level search functions to index into=
 this
> implicit global result cache which may be invalidated at any time. I thin=
k it
> would make elisp programs easier to write if we had the `struct Lisp_Matc=
h'
> object introduced by this patch.
>=20
> Separately, I think the fact that it's possible to mix up regexp patterns
> (interpreted as a regexp string) and string arguments to search against
> (interpreted as a literal string), and thereby induce silent search error=
s, is
> the strongest argument for introducing a separate `struct Lisp_Regexp'
> object. The fact that `search-spaces-regexp' can be set by a user and int=
roduce
> errors to modes which don't explicitly set it seems to bolster the argume=
nt that
> introducing an explicit "compile" method for regexp patterns would make i=
t
> easier to write correct elisp extensions.
    - The main benefit of `struct Lisp_Regexp' and `struct Lisp_Match' woul=
d be
      to avoid user code having to manage implicit global state. Subjective=
ly,
      I believe this is much cleaner and less error-prone, and would improv=
e the
      experience of writing Elisp extensions.
    - I haven't yet received any feedback from anyone about how to construc=
t
      a realistic benchmark that tests regex performance from precompilatio=
n,
      but I do believe this would also produce performance improvements ove=
r the
      global regexp cache, without introducing complex caching heuristics.
      - I will see whether precompiling `struct Lisp_Regexp' objects will i=
mprove
        jit-lock-mode performance "for free".

(3) Implement an Elisp-level matching engine instead of regex-emacs.c.
    - Now that I think about it, the simplicity of regex-emacs.c and its
      char-by-char iteration seems to lend itself quite well to a Lisp
      implementation, and it seems quite possible for the JIT to generate g=
ood
      code for that as well.
      - I think this is actually a really interesting idea and I will try
        hacking away at this.
      - This can be totally separate from this patch (and in fact would rep=
lace
        it).


=0C
Thanks,
Danny