From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Pip Cet Newsgroups: gmane.emacs.devel Subject: Re: [PATCH] add compiled regexp primitive lisp object Date: Mon, 05 Aug 2024 17:55:54 +0000 Message-ID: <87le1avopk.fsf@protonmail.com> References: <87mslxxddk.fsf@protonmail.com> <5He97LtsyeyQoTLU7d91oP2CLO8s_2afdgcNxozsFjzu8qGbB_7nXmsZL5O6Ej7K-tuEmngCcPKJpDAjxeKz4jk1DvqSUbdOLpw5U1vo1SY=@hypnicjerk.ai> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="22098"; mail-complaints-to="usenet@ciao.gmane.io" Cc: "emacs-devel@gnu.org" To: Danny McClanahan Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Mon Aug 05 19:57:21 2024 Return-path: Envelope-to: ged-emacs-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1sb1xR-0005Yg-DS for ged-emacs-devel@m.gmane-mx.org; Mon, 05 Aug 2024 19:57:21 +0200 Original-Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1sb1wf-0006qD-EC; Mon, 05 Aug 2024 13:56:33 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1sb1wF-00065V-VE for emacs-devel@gnu.org; Mon, 05 Aug 2024 13:56:08 -0400 Original-Received: from mail-4322.protonmail.ch ([185.70.43.22]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1sb1w9-00059P-JD for emacs-devel@gnu.org; Mon, 05 Aug 2024 13:56:06 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=protonmail.com; s=protonmail3; t=1722880557; x=1723139757; bh=L0l0raoAgtJaIUux/zbq970s+8ZArhxckE5ByTx9vy4=; h=Date:To:From:Cc:Subject:Message-ID:In-Reply-To:References: Feedback-ID:From:To:Cc:Date:Subject:Reply-To:Feedback-ID: Message-ID:BIMI-Selector; b=umJweG347eDoNWETd6+OqE84oRw1BdYfW3m9KP+nCrGgF7cZfPJ2NPmrjvCV4YZuo e3aYpZpg+CsoFiXKEG74x+8PSsct3eDaA2Mym5tSPWLL4oNG6PJirLOcFORbdGB+dQ Zb5bIGWou5PSJYr4qW7mim9YNZcxr/CsLDHjzHzSrx47yvd1dhIOQ4bYSZWX1D0IEm hk9SHPypRHnXMN4pl0bsroKQ2amiwMzHeu59S/iTwZj/WuWU3/lyXKUcF81GgB40YO YCo1PQTL7Of7dAfXLxipiPm1wTi7B7Xi7ezsyQmWncJuHRP3OQP1j3hDSzHB397hc0 EMjfvw3u6guXA== In-Reply-To: <5He97LtsyeyQoTLU7d91oP2CLO8s_2afdgcNxozsFjzu8qGbB_7nXmsZL5O6Ej7K-tuEmngCcPKJpDAjxeKz4jk1DvqSUbdOLpw5U1vo1SY=@hypnicjerk.ai> Feedback-ID: 112775352:user:proton X-Pm-Message-ID: 1a9941fb83ed26da4a56062201f9a44a1ebb10e1 Received-SPF: pass client-ip=185.70.43.22; envelope-from=pipcet@protonmail.com; helo=mail-4322.protonmail.ch X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H4=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-Mailman-Approved-At: Mon, 05 Aug 2024 13:56:31 -0400 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Xref: news.gmane.io gmane.emacs.devel:322403 Archived-At: "Danny McClanahan" writes: > Ok, here's a thorough response to this wonderful review from Pip a few da= ys ago: > >> On Wednesday, July 31st, 2024 at 21:04, Pip Cet = wrote: >> >> That part I think I understand, but what data precisely is part of those >> regexp objects, and why? If I change case-fold-search, do I have to >> recompile my regexp or will its case sensitivity stick? > > In this case, the case-sensitivity is "stuck" into the regexp (...mostly)= . There > are three variables at play here, some of which affect regexp compilation= , some > of which affect the input string, and some of which do both: > > - `search-spaces-regexp': if set, then during pattern compilation, litera= l > spaces ' ' will be replaced with the value of this string. Afterwards, = it has > no effect. > > - `case-fold-search': if set, then a C-level buffer-local variable named > `case_canon_table' is applied. This `translate` table is applied to the= input > pattern when compiled, as well as to a string when it is matched. > > - Note that because this translation table is necessary for pattern mat= ching, > it is (already) stored in re_pattern_buffer, not in the new Lisp_Rege= xp. > > - `syntax_table': this one makes my head hurt. `re_pattern_buffer' record= s > `used_syntax' when `RECC_SPACE' or `RECC_WORD' are encountered by the p= attern > compiler, and it is noted that if `used_syntax' is set, then the compil= ed > pattern is *only* valid for that syntax table. `regexp_cache' in search= .c > records the `syntax_table' so that it can check for invalidation (if th= e > current buffer's syntax table is different than the compiled pattern's,= then > the pattern is invalidated and must be recompiled). > > - I'm not sure how to handle this most effectively, except my suspicion= is > that it would probably be a very good idea to move any syntax table l= ookup > out of the regex matching process entirely. I believe there was some > openness to this approach in previous discussions about regex-emacs. > > `posix' behavior is also technically set at compile-time, but it's not se= t via > an environment variable, and the current search APIs expose separate meth= ods for > posix search behavior. But that's also a "sticky" variable, although it's= much > more straightforward than the rest. > > To directly answer your question: if you change `case-fold-search' after > compiling a regexp, the case sensitivity *will* remain in the compiled > regexp. In general, most of the tricky behavior here sticks along with th= e > compiled regexp, and the existing `regexp_cache' setup has already done m= ost of > the work to figure out what makes a regexp invalid (which is very helpful= !). Thank you for that wonderful summary. FWIW, I agree that a regexp should have as much as possible baked into it. >> > - make Lisp_Regexp purecopyable and pdumpable >> >> >> Is the purecopy part really necessary? It may be easier not to purecopy >> regexp objects for now... > > I actually have no idea what purecopy is doing at all; I just thought it = was > related to the pdumper since I found other pseudovec types doing it. For = some > reason or other, though, I found that the `purecopy()` method was errorin= g out > on compiled regexps during the bootstrap process, so I implemented pureco= py here > as a practical matter to get emacs building again. It may have been becau= se of > the strings I converted into `make-regexp' calls in lisp/image.el, so if > purecopy is not something we want to do, we may be able to avoid it by re= verting > that. I'm a sworn enemy of pure space (see scratch/no-pure-space), but I'm pretty sure you can make purecopy simply return the object in question (rather than going through the generic vectorlike code) and things should work. But I haven't tried it... >> I think I can mostly provide silly questions, so that's what I'll do. >> However, my first one should come before the notes on the patch: why >> does the new PVEC store a pointer to re_pattern_buffer rather than >> containing re_pattern_buffer directly? Doing that would simplify the >> code, particularly the dumping and purecopying part. Or is there no >> one-to-one correspondence between pvecs and re_pattern_buffers? > > This is a great question; the reason I did this was because the compiler > complained because lisp.h doesn't have access to the definition of the > `re_pattern_buffer' struct for some reason. It would absolutely be prefer= able to > store an `re_pattern_buffer' directly, and I'll look into that now. I don't think the definition needs to be in lisp.h. >> > +{ >> > + struct re_pattern_buffer *pure =3D pure_alloc (sizeof *pure, -1); >> >> >> I don't think -1 is correct here, as the result may well be unaligned, w= hich >> will crash on some architectures (but not x86). > > ^_^ Thought this might be the case! The methods are at least reasonably w= ell > documented. Just to reveal my conflict of interest, I'm currently playing with the scratch/igc branch, and unaligned pointers are a big no for the MPS garbage collector we're using in that branch. >> I think you changed some whitespace here, or maybe I'm missing what >> changed for PVEC_SQLITE etc. > > I did change some whitespace; I couldn't figure out what had happened at > the time. This can be reverted. Absolutely, no problem. >> > dump_field_fixup_later (ctx, &out, node, &node->parent); >> > + / FIXME: should these both be &node->{left,right} instead of &node->= parent? */ >> > if (node->left) >> > dump_field_fixup_later (ctx, &out, node, &node->parent); >> > if (node->right) >> >> >> Yes, I believe those should be node->left and node->right... > > Great! Thanks! Do you have commit access yet? That would be worth fixing independently of the rest of the patch (and getting the copyright assignment in order takes a while, so it's probably best to start early). IIUC, it doesn't have any effect at present, but the code is confusing... >> I think this change is unrelated, and may have a small performance >> impact. > > My impression is that the new `dump_cold_bytes()' performs exactly these = two > operations in a row, so I thought it would be the same result. I will rev= iew > this again. COI again, we also touched this code in the scratch/igc branch so all I see was more work :-) I actually like the cold_bytes abstraction, though. >> I'm not sure we should be using the a=3Db pattern here. Maybe one day we >> want to read such objects and it'd be easier to do that if the output >> syntax were more Lisp-like (in fact, I'd convert the whole thing to a >> hash table and print that...) > > Yes! *sheepish grin* sorry, this was incredibly hasty work (modifying pri= nt.c > was the first change I made to the codebase after adding PVEC_REGEXP). I = will > definitely look again at improving this ^_^. I said "hash table" above, but even that's too complicated. Just cons up a list :-) >> And that look-up happens at the time make-regexp is called, not when it >> is used, right? That might be surprising if case-fold-search changes, >> for example. > > Yes, the lookup from the current buffer occurs at compile time, and that = case > folding stays with the regexp after compilation (see top). I also had the= same > concern regarding this wording; thanks for raising it as truly the intera= ctions > between environment variables are some of the more confusing parts of > regex behavior. > As above, I think the existing `regexp_cache' work has actually done a gr= eat job > at nailing down what invalidates a regexp, so I think we can extend that > framework to ensure compiled regexps have all of the configuration set at > compile time to ensure intuitive behavior. Indeed. Again, my preference is to pretend the world is UTF-8, because charset interactions make my head hurt, and declare that a compiled regexp simply matches or does not match a given array of bytes (plus a marker position and BOL/EOL flags, but you get the idea), and that changing the flags results in a new and different compiled regexp. >> > @@ -3405,16 +3473,24 @@ DEFUN ("re--describe-compiled", Fre__describe_= compiled, Sre__describe_compiled, >> > If RAW is non-nil, just return the actual bytecode. */) >> > (Lisp_Object regexp, Lisp_Object raw) >> > { >> > - struct regexp_cache *cache_entry >> > - =3D compile_pattern (regexp, NULL, >> > - (!NILP (Vcase_fold_search) >> > - ? BVAR (current_buffer, case_canon_table) : Qnil), >> > - false, >> > - !NILP (BVAR (current_buffer, >> > - enable_multibyte_characters))); >> > + struct re_pattern_buffer *bufp; >> > + >> > + if (REGEXP_P (regexp)) >> > + bufp =3D XREGEXP (regexp)->buffer; >> > + else >> > + { >> > + struct regexp_cache *cache_entry >> > + =3D compile_pattern (regexp, NULL, >> > + (!NILP (Vcase_fold_search) >> > + ? BVAR (current_buffer, case_canon_table) : Qnil), >> > + false, >> > + !NILP (BVAR (current_buffer, >> > + enable_multibyte_characters))); >> > + bufp =3D &cache_entry->buf; >> > + } >> > + >> >> We never CHECK_STRING in re--describe-compiled, so >> (re--describe-compiled nil) currently appears to crash Emacs (without >> the patch). Can we fix that while we're in there? > > I accidentally caused my own crash this way yesterday by removing a CHECK= _STRING > from string_match_1! Will definitely keep an eye out for other argument > validation in this area. Let me say how impressed I am by all this. You're probably aware of it, but I'm a huge fan of the rx and xr functions (the latter is available from ELPA). I believe the regexp engine in Emacs is somewhat outdated and in need of a replacement, and I think this abstraction plus xr are good first steps. I'm saying this because it might help to think of a compiled regexp as a GC-able reference to a DFA (or, my preference, an NFA), and then we could do cool things like performing a Levenshtein transformation on a regexp to capture typos. (My main problem with the current regexp implementation is it's intimately tied to the gap representation of buffers, and it'd be easier to hack on that if we could "just" switch to a representation-agnostic, slow, Lisp implementation of regexps. Or translate the regexp itself to Lisp which we could JIT...) Anyway, thanks for this, I'm looking forward to the next patch! Pip