From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Danny McClanahan Newsgroups: gmane.emacs.devel Subject: Re: [PATCH] add compiled regexp primitive lisp object Date: Wed, 07 Aug 2024 07:59:15 +0000 Message-ID: <7TES32-QbXZEtj4F5ClAcMZAjkGgR4H08Opv0luyXSYDF4QeOQwHn7oa7tNGRKIoDCpkw3UdQTJnkXRpxEtX0i0_jw3UcyVMaKbTzOYHkjA=@hypnicjerk.ai> References: <87mslxxddk.fsf@protonmail.com> <5He97LtsyeyQoTLU7d91oP2CLO8s_2afdgcNxozsFjzu8qGbB_7nXmsZL5O6Ej7K-tuEmngCcPKJpDAjxeKz4jk1DvqSUbdOLpw5U1vo1SY=@hypnicjerk.ai> <87le1avopk.fsf@protonmail.com> <2LOLmIp1X8w4CGbqq3qDrzmKVA0KzYNL1N9lBtWdB-MtEv9oCuYgJMYprG170wMPjYxeQImAmWOPatGTTl4KxZMlptNo9A9hnHt84vdN9EA=@hypnicjerk.ai> <87ttfxtszi.fsf@protonmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="29508"; mail-complaints-to="usenet@ciao.gmane.io" Cc: "emacs-devel@gnu.org" , =?utf-8?Q?Mattias_Engdeg=C3=A5rd?= To: Pip Cet Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Wed Aug 07 12:52:09 2024 Return-path: Envelope-to: ged-emacs-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1sbeH3-0007Ob-9J for ged-emacs-devel@m.gmane-mx.org; Wed, 07 Aug 2024 12:52:09 +0200 Original-Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1sbeGN-0005Di-Mi; Wed, 07 Aug 2024 06:51:27 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1sbbZx-0000Vw-LJ for emacs-devel@gnu.org; Wed, 07 Aug 2024 03:59:29 -0400 Original-Received: from mail-40136.proton.ch ([185.70.40.136]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1sbbZr-00043K-J3 for emacs-devel@gnu.org; Wed, 07 Aug 2024 03:59:29 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=hypnicjerk.ai; s=protonmail2; t=1723017559; x=1723276759; bh=X2MXwm3CxaXFy0MS2zSGs8+Tx7Ca7SYWr63L91cvWfE=; h=Date:To:From:Cc:Subject:Message-ID:In-Reply-To:References: Feedback-ID:From:To:Cc:Date:Subject:Reply-To:Feedback-ID: Message-ID:BIMI-Selector; b=mjg6jwMylzX3l3lUwRP6fsZpru//lUXRmqILQKMv7nCaZlwTqEfdrBphuI0rtZ9Wt hE4bazNcpwNYnvulcnyXh4F1OPfhqNLnAZ3m1vFUXCZr3IlDpjLC6PK+ANQ3QkuLGG zF+bi4ZhJx8IB62I01uqf0Fwl4f8sGy7Q6rJf6m/rw7n0/Mxk7ZZS9FDyOp7YssksM 5PNdH3qEC86pYamAxKg94LcDihiEnfKbY3OyETUVJJt7ZmFYTHUaZ+lxGnNgnAzx6x SB4qKDTCvpl6jdy+oVp0huEfNT4CJ1TYBG2TAdTsbzyEQS6Ql6Beugmtb++sQDchBN CjJIz6mL+nCUA== In-Reply-To: <87ttfxtszi.fsf@protonmail.com> Feedback-ID: 27837847:user:proton X-Pm-Message-ID: 6ec8bd8263e3b5015df4bdb21b374f5b8011fbdd Received-SPF: pass client-ip=185.70.40.136; envelope-from=dmcc2@hypnicjerk.ai; helo=mail-40136.proton.ch X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H4=0.001, RCVD_IN_MSPIKE_WL=0.001, RCVD_IN_VALIDITY_CERTIFIED_BLOCKED=0.001, RCVD_IN_VALIDITY_RPBL_BLOCKED=0.001, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-Mailman-Approved-At: Wed, 07 Aug 2024 06:51:24 -0400 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Xref: news.gmane.io gmane.emacs.devel:322489 Archived-At: > From: Pip Cet pipcet@protonmail.com Thanks so much again for the thorough feedback Pip! I am having difficulty making this email shorter, so I have split it into labeled sections. The mo= st useful part is at the bottom, where I try to focus on specific goals this patchset could turn into. It's a lot of brainstorming that I'm sending to emacs-devel to record for later, but you do not need to read it. Summary: - I have attached the formatted patch which now avoids special purecopy log= ic and fixes pdumping, thanks to Pip's fixes. - I will try using this patch's `make-regexp' for jit-lock-mode and see if = that produces any performance improvement. - I will look at implementing Pip's proposal for a Lisp reimplementation of regex-emacs.c as a separate patch. Please let me know if emails of this length are not suitable for emacs-deve= l and I would be happy to oblige. ^_^ <3 =0C ------------------------------------------- This part is about maintainer interactions: ------------------------------------------- > Thanks again. I'm hoping the maintainers will have something positive to > say about this patch. > > Pip Pip, I continue to heavily appreciate your input (and please do not stop do= ing so!), but I would please ask you not to use words like "positive" to mean "= accepts the changes", as I consider this to be emotionally manipulative and if I wa= s a maintainer it would bias me against acceptance. The scrutiny I am receivi= ng for introducing new primitive objects which would modify bytecode and take = up space in the `pvec_type' enum seems perfectly appropriate and by no means would I consider it "negative" or even bureaucratic. These discussion= s spurred by Eli and Andrea have already helped me to realize: (1) Improved performance for jit-lock-mode (if possible) is likely achievab= le without these objects, by introducing new caching methods. (2) Much of the SIMD techniques that treat multibyte text as a string of by= tes cannot be easily applied in the face of translation tables (and therefo= re improved performance would require a separate breaking API change than simply compiled regexp objects). (3) The coding-system has nothing to do with regex matching. I would not consider it a failure if this patch was rejected as I now reali= ze much of the benefits do not require that, and I consider this a necessary p= art of review as well as a learning experience for me about Emacs internals (mu= ch as you have educated me about purecopying). I do not consider it antagonistic. =0C ------------------------------------------------- This part is direct responses to review comments: ------------------------------------------------- > > > I'm a sworn enemy of pure space (see scratch/no-pure-space), but I'm > > > pretty sure you can make purecopy simply return the object in questio= n > > > (rather than going through the generic vectorlike code) and things > > > should work. But I haven't tried it... > > > > Just tried it and pdumping went into what appears to be an infinite loo= p in the > > gc when allocating buffer space for a vectorlike (which may be the > > pseudovec > > > I misremembered, you have to pin objects you want to pretend you > purecopied, just as we do for normal hash tables. This works great! We no longer do any special purecopying at all! > > regexp type, and may be because of the `make-regexp' call I added in > > image.el to test exactly this) and then errored after about 10 seconds.= From > > scratch/no-pure-space I see that pure space relates to zero-length vect= ors, > > which may have helped me to figure out why` make_lisp_ptr()' does not r= eturn > > Qnil when passed a null reference and Lisp_Vectorlike as arguments. > > > We use tagged NULL pointers for known-to-be-invalid objects in a few > places, actually, so we can't make make_lisp_ptr return nil... Ah, that makes sense! It's easy enough to just check for NULL in gc and pdu= mping ^_^! > > > Just to reveal my conflict of interest, I'm currently playing with th= e > > > scratch/igc branch, and unaligned pointers are a big no for the MPS > > > garbage collector we're using in that branch. > > > > Could I also ask you to please explain the importance of and requiremen= ts for > > alignment there? I believe alignment is useful for tagged pointers as i= t gives > > you more tag bits (and I see scratch/igc uses those), but is it just th= at? > > > Most CPU architectures require things to be naturally aligned; x86 and > x86_64 are the exception there (though, of course, there's an exception > to the exception, because the %xmmN registers can be accessed using > special instructions that require alignment). MPS also requires all > pointers to be stored in aligned words, but that's a separate > requirement. > > The short story is you want to pass -alignof(type) to the pure_alloc > function or Apple users will get segfaults. Thanks so much!!! This is something I have tried to understand elsewhere so I appreciate the free lesson ^_^! > + /* `translate' being NULL produces an invalid object from +` make_lisp_= ptr()'. */ > + if (bufp->translate) > > + { > + const Lisp_Object translate =3D > + make_lisp_ptr ((void *) &bufp->translate, Lisp_Vectorlike); > > + dump_field_lv (ctx, &out, bufp, &translate, WEIGHT_NORMAL); > + } > + > > I don't think you can do that: dump_field_lv will write to a random > address in the dump based on the difference between bufp and > "&translate", which is unpredictable as it depends on the stack layout > of unrelated variables. IIUC, that is. > > Can you use dump_field_lv_rawptr instead? Yes! See commit 4 of patch. > If you want to remove the purecopy support, here's a patch to do that: I didn't notice this patch before trying it myself, and elected to use a st= atic helper method instead (also in commit 4 of patch). I assume this is also acceptable? static Lisp_Object pin_object (Lisp_Object obj) { struct pinned_object *o =3D xmalloc (sizeof *o); o->object =3D obj; o->next =3D pinned_objects; pinned_objects =3D o; return obj; } > Well, I don't see why the read syntax can't be the uncompiled pattern > for now, plus the extra data. Compiling it in lread.c would probably not > be a huge performance issue. But see below for an alternative: use xr. > > `xr' is SUPER cool and I've just installed it, thanks so much for > > enlightening me! > > Possibly a better read syntax than the pattern string. Mattias, how > hard would it be to make the rx + xr combination the identity, or at > least idempotent? I really like the idea of using `rx' for read syntax (or at least for print= ing) and will look at implementing this in the current patch for `struct Lisp_Regexp'. > I believe Apache Lucene implements the relevant mechanisms as well. It > didn't look too complicated when I looked at it. Thanks so much for this tip!!!! Will definitely spelunk through Lucene for = ideas here ^_^! =0C ------------------------------------------------------------ This part focuses on modifications to regexp matching logic: ------------------------------------------------------------ Asking for further clarification on a previous message of yours: > > > (My main problem with the current regexp implementation is it's > > > intimately tied to the gap representation of buffers, and it'd be eas= ier > > > to hack on that if we could "just" switch to a representation-agnosti= c, > > > slow, Lisp implementation of regexps. Or translate the regexp itself = to > > > Lisp which we could JIT...) So my impression is actually that the only coupling to the gap buffer is th= at the `re_match_2_internal()' method accepts two separate `string1' and `string2' params (I was actually quite happily surprised when I first saw t= his, because it seemed very easy to implement with an external engine). It accep= ts just a single search string quite nicely as well, although I agree that the= re is some tricky conditional logic to cover the gap buffer case. But unless I'm mistaken (please correct me!), I wouldn't consider it "intimately tied" to = the gap buffer. But this may just be pedantry on my part. > That sounds very good. I don't think there's any particular rush, to be > honest, though I would like to reiterate how helpful it would be for me > to be able to disable the regex-emacs.c code and use a slow Lisp > implementation instead. I am very open to this for several reasons, but could I please ask you to elaborate on the utility here? Maybe you explained it and I'm forgetting (sorry!); is this because regex-emacs.c is hard to introduce new functional= ity to? Or does it affect the GC? A short answer is fine, I just want to unders= tand the issue regex-emacs.c specifically poses. =0C --------------------------------------------------------------- This part describes how I considered what makes sense for Emacs: --------------------------------------------------------------- > I take it you're aware of remacs, which hasn't seen development lately > but was a good way to discover the problems, in particular, of > integrating rust with the Emacs GC. There might be some salvageable > code there, though I'm not sure about the copyright status. I typed up an incredibly long brainstorming session in response to this, wh= ich I'm reproducing here because I believe it to be useful, but is only relevan= t if you're interested in my thought process. Please feel free to skip down to t= he next labeled section. I was aware of it but was hoping to avoid the "rewrite it in rust" paradigm= that tends to leave important infrastructural projects unmaintained after an ini= tial burst of outside enthusiasm. The rust regex crate is more performant than anything I could expect to write myself, but it is uninterested in supporti= ng the kind of flexibility that Emacs requires, and also does not expose an FF= I and is therefore limited strictly to rust programs, another unfortunate conceit= of the rust language community. I am currently most interested in novel ways t= o perform string matching, so I was hoping that an FFI would avoid needing to impose any dependency on rust or the GC. As background which I may not have provided, I am looking to pursue a phd i= n text parsing/string matching and have many goals not satisfied by any prior= art, and I am hoping to expose a cross-language FFI so that my work is usable fr= om any language and indeed possible to integrate into language implementations themselves, instead of requiring each language to provide varying and separ= ate levels of support in their own bespoke text matching engine. Emacs provides a delightful set of constraints (most notably the gap buffer and the fantas= tic multibyte encoding) which are simply not supported by any current regex eng= ines, and in particular it heavily makes use of regexps internally to implement e= ditor functionality, because of course Emacs uses text to form its UI. So improvi= ng Emacs would be a really fantastic demonstration of the generality of my wor= k, and if my hypothesis is correct, I would be able to produce a meaningful performance improvement without loss of the generality Emacs users expect. I also have other goals including some novel work in parsing theory= and hygienically composing grammars. The remacs project at least demonstrates an example of how to provide a rust-level interface to add new Emacs functionality, but I am heavily interested in language-agnostic methods to define grammars for parsing and string searching, which can then be tested, analyzed, compiled, and compose= d into libraries of parse techniques, because parsers are hard and error-pron= e to write. Because Emacs relies on text in such a deep way, I believe it's a re= ally fertile ground for examples of how programmers compose text search/replacem= ent with conditional logic, which is something I believe can be made easier and= more fun with thoughtful high-level text manipulation operations. I'm typing a lot here because I'm having difficulty describing how the exte= rnal project implementing these ideas is likely to be useful for inclusion into = the Emacs project itself as opposed to an external module. The improvements I w= as thinking of (converting lots of `(and (looking-at ...) (replace-match ...))= ` into some sort of separate grammar description which then invokes the exter= nal matching library to perform high-level text ops) would necessarily involve rewriting a lot of lisp implementation code. The magnitude of changes would= then necessarily end up producing an effort similar to remacs, which is a cool experiment but not a sustainable infrastructural project like Emacs proper. =0C ---------------------------------------------------------------------------= --- This part proposes specific research goals that could turn into Emacs patch= es: ---------------------------------------------------------------------------= --- I think I really have several concrete goals here relevant to the Emacs pro= ject: (1) Make things like jit-lock-mode (which use regexps in a very specific wa= y) significantly faster by introducing a less-general code path for regex matching. - `struct Lisp_Regexp' and `struct Lisp_Match' from this patch could be= nice for this, but it's actually almost completely orthogonal to this patc= h. - I thought we would need to introduce a separate regexp engine to achi= eve this, but I actually think doing this in C and exposing a highly-restricted interface to lisp code is likely the best way to achieve this. We could even do platform-specific SIMD stuff this way. - Because jit-lock-mode uses regexps in a restricted way, it seems less likely that we would need to e.g. compile a generic NFA/DFA as oppose= d to some handwritten parsing logic. - Importantly, it remains to be demonstrated that jit-lock-mode perform= ance can be significantly improved as I suspect. This may not be a signifi= cant performance bottleneck for emacs. (2) Improve the interface for regexp searching and matching in Elisp code. - This is what this patch offers. As I noted in a prior email: > Right now, I think the optional STRING argument to `match-string' and > `replace-match' is the strongest argument that we are currently working a= round > a confusing implicit dependency on implicitly-clobbered global match data= by > adding new optional arguments to low-level search functions to index into= this > implicit global result cache which may be invalidated at any time. I thin= k it > would make elisp programs easier to write if we had the `struct Lisp_Matc= h' > object introduced by this patch. >=20 > Separately, I think the fact that it's possible to mix up regexp patterns > (interpreted as a regexp string) and string arguments to search against > (interpreted as a literal string), and thereby induce silent search error= s, is > the strongest argument for introducing a separate `struct Lisp_Regexp' > object. The fact that `search-spaces-regexp' can be set by a user and int= roduce > errors to modes which don't explicitly set it seems to bolster the argume= nt that > introducing an explicit "compile" method for regexp patterns would make i= t > easier to write correct elisp extensions. - The main benefit of `struct Lisp_Regexp' and `struct Lisp_Match' woul= d be to avoid user code having to manage implicit global state. Subjective= ly, I believe this is much cleaner and less error-prone, and would improv= e the experience of writing Elisp extensions. - I haven't yet received any feedback from anyone about how to construc= t a realistic benchmark that tests regex performance from precompilatio= n, but I do believe this would also produce performance improvements ove= r the global regexp cache, without introducing complex caching heuristics. - I will see whether precompiling `struct Lisp_Regexp' objects will i= mprove jit-lock-mode performance "for free". (3) Implement an Elisp-level matching engine instead of regex-emacs.c. - Now that I think about it, the simplicity of regex-emacs.c and its char-by-char iteration seems to lend itself quite well to a Lisp implementation, and it seems quite possible for the JIT to generate g= ood code for that as well. - I think this is actually a really interesting idea and I will try hacking away at this. - This can be totally separate from this patch (and in fact would rep= lace it). =0C Thanks, Danny