From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Pip Cet Newsgroups: gmane.emacs.devel Subject: Re: [PATCH] add compiled regexp primitive lisp object Date: Tue, 06 Aug 2024 18:18:44 +0000 Message-ID: <87ttfxtszi.fsf@protonmail.com> References: <87mslxxddk.fsf@protonmail.com> <5He97LtsyeyQoTLU7d91oP2CLO8s_2afdgcNxozsFjzu8qGbB_7nXmsZL5O6Ej7K-tuEmngCcPKJpDAjxeKz4jk1DvqSUbdOLpw5U1vo1SY=@hypnicjerk.ai> <87le1avopk.fsf@protonmail.com> <2LOLmIp1X8w4CGbqq3qDrzmKVA0KzYNL1N9lBtWdB-MtEv9oCuYgJMYprG170wMPjYxeQImAmWOPatGTTl4KxZMlptNo9A9hnHt84vdN9EA=@hypnicjerk.ai> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="35442"; mail-complaints-to="usenet@ciao.gmane.io" Cc: "emacs-devel@gnu.org" , =?utf-8?Q?Mattias_Engdeg=C3=A5rd?= To: Danny McClanahan Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Tue Aug 06 20:27:03 2024 Return-path: Envelope-to: ged-emacs-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1sbOtj-00093H-7r for ged-emacs-devel@m.gmane-mx.org; Tue, 06 Aug 2024 20:27:03 +0200 Original-Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1sbOt3-000140-GV; Tue, 06 Aug 2024 14:26:21 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1sbOlz-0001AB-6Z for emacs-devel@gnu.org; Tue, 06 Aug 2024 14:19:06 -0400 Original-Received: from mail-40134.protonmail.ch ([185.70.40.134]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1sbOlv-00051e-3Z for emacs-devel@gnu.org; Tue, 06 Aug 2024 14:19:02 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=protonmail.com; s=protonmail3; t=1722968328; x=1723227528; bh=7Uzkp0L9HryhrgZ7jjBqfYXRbheYglkQgbnzCxFRv/Y=; h=Date:To:From:Cc:Subject:Message-ID:In-Reply-To:References: Feedback-ID:From:To:Cc:Date:Subject:Reply-To:Feedback-ID: Message-ID:BIMI-Selector; b=vz0BtzcRIbQase4L9Lgm7jokmBWgAoGT8JBkg40KhcOs+7vn0YzO497WcdenDBbOk Hq1MW7lkta1uDqSUinXQxYSqi28+4aq2TKkGCm7TIxTwlUGALl656GCm4WItmCBZc9 zaeOKOcnsmE73cQxLcgyW0sLvauUeQM2iWdTVTpUdVEfeIJWsV1wjUGxYJd7SzF14R JCTHMOQ1UEh0uJrlen8AlEo3xwalOH9WjGa8yUFZjPteRhmakT369wfHGVXEINzIqD FBAhWGA8tqjYTTuiJoqH9s64OBDRp+eDnf8qJwU2Cj5wtteRPVqCgW5gTLlsEyVEPs eEDCa2VYuWRfQ== In-Reply-To: <2LOLmIp1X8w4CGbqq3qDrzmKVA0KzYNL1N9lBtWdB-MtEv9oCuYgJMYprG170wMPjYxeQImAmWOPatGTTl4KxZMlptNo9A9hnHt84vdN9EA=@hypnicjerk.ai> Feedback-ID: 112775352:user:proton X-Pm-Message-ID: 74798b05b3dde162fb91797e022d1f004658fd72 Received-SPF: pass client-ip=185.70.40.134; envelope-from=pipcet@protonmail.com; helo=mail-40134.protonmail.ch X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H4=0.001, RCVD_IN_MSPIKE_WL=0.001, RCVD_IN_VALIDITY_CERTIFIED_BLOCKED=0.001, RCVD_IN_VALIDITY_RPBL_BLOCKED=0.001, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-Mailman-Approved-At: Tue, 06 Aug 2024 14:26:19 -0400 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Xref: news.gmane.io gmane.emacs.devel:322456 Archived-At: "Danny McClanahan" writes: >> On Monday, August 5th, 2024 at 13:55, Pip Cet wr= ote: >> >> Thank you for that wonderful summary. FWIW, I agree that a regexp >> should have as much as possible baked into it. > > Upon checking out how the `translate' table is used, I actually think it = would > be possible and even reasonable to avoid keeping track of it at all after > compiling the regexp, if the regexp is instead compiled into a form that = relies > on character sets instead of translating input chars (this is how the rus= t regex > crate handles its equivalent of `case-fold-search'). This would also remo= ve > another blocker to enabling SIMD searching in some ways, although pattern= s with > case- or char-folding also tend to foil SIMD search techniques (maybe som= ething > clever can be done about this). > >> I'm a sworn enemy of pure space (see scratch/no-pure-space), but I'm >> pretty sure you can make purecopy simply return the object in question >> (rather than going through the generic vectorlike code) and things >> should work. But I haven't tried it... > > Just tried it and pdumping went into what appears to be an infinite loop = in the > gc when allocating buffer space for a vectorlike (which may be the > pseudovec I misremembered, you have to pin objects you want to pretend you purecopied, just as we do for normal hash tables. > regexp type, and may be because of the `make-regexp' call I added in imag= e.el to > test exactly this) and then errored after about 10 seconds. From > scratch/no-pure-space I see that pure space relates to zero-length vector= s, > which may have helped me to figure out why `make_lisp_ptr()' does not ret= urn > Qnil when passed a null reference and Lisp_Vectorlike as arguments. We use tagged NULL pointers for known-to-be-invalid objects in a few places, actually, so we can't make make_lisp_ptr return nil... >> Just to reveal my conflict of interest, I'm currently playing with the >> scratch/igc branch, and unaligned pointers are a big no for the MPS >> garbage collector we're using in that branch. > > Could I also ask you to please explain the importance of and requirements= for > alignment there? I believe alignment is useful for tagged pointers as it = gives > you more tag bits (and I see scratch/igc uses those), but is it just that= ? Most CPU architectures require things to be naturally aligned; x86 and x86_64 are the exception there (though, of course, there's an exception to the exception, because the %xmmN registers can be accessed using special instructions that require alignment). MPS also requires all pointers to be stored in aligned words, but that's a separate requirement. The short story is you want to pass -alignof(type) to the pure_alloc function or Apple users will get segfaults. >> Do you have commit access yet? That would be worth fixing independently >> of the rest of the patch (and getting the copyright assignment in order >> takes a while, so it's probably best to start early). IIUC, it doesn't >> have any effect at present, but the code is confusing... > > I have no problems with assigning copyright to the FSF (I even see that t= hey no > longer require physical mail) but will wait for maintainers to resolve wh= ether > this set of changes is considered useful, for which I am in no hurry. In = any > case, I consider splitting up large changes into easily-reviewable chunks= one of > my specialties and will definitely be able to separate this sort of chang= e from > the rest of the patch. Even if this change is not accepted, it's good to = hear > that changes are welcome here and I would easily be able to look at impro= ving > them now that I'm vaguely up to speed. That sounds very good. I don't think there's any particular rush, to be honest, though I would like to reiterate how helpful it would be for me to be able to disable the regex-emacs.c code and use a slow Lisp implementation instead. >> > My impression is that the new `dump_cold_bytes()' performs exactly the= se two >> > operations in a row, so I thought it would be the same result. I will = review >> > this again. >> >> COI again, we also touched this code in the scratch/igc branch so all I >> see was more work :-) I actually like the cold_bytes abstraction, though= . > > Not a problem! I was pleasantly surprised at how hackable low-level thing= s like > the pdumper and gc are but very easy to consider scratch/igc in the futur= e now > that you've mentioned it ^_^! Excellent. >> > > I'm not sure we should be using the a=3Db pattern here. Maybe one da= y we >> > > want to read such objects and it'd be easier to do that if the outpu= t >> > > syntax were more Lisp-like (in fact, I'd convert the whole thing to = a >> > > hash table and print that...) >> >> I said "hash table" above, but even that's too complicated. Just cons up >> a list :-) > > It would be wonderful to consider a read syntax for this, although as I n= oted in > my other reply to Andrea the current imposition of `make-regexp' imposes > a natural barrier with which to employ more time-intensive compilation > techniques e.g. DFAs. Most regexp engines employ some form of heuristics = for > this anyway, although I personally don't really like how that produces > nonintuitive performance behavior and would strongly prefer just requirin= g > an explicit compilation step (perhaps with optimization flags to denote a= nything > that's hard for heuristics to guess). Well, I don't see why the read syntax can't be the uncompiled pattern for now, plus the extra data. Compiling it in lread.c would probably not be a huge performance issue. But see below for an alternative: use xr. >> > > And that look-up happens at the time make-regexp is called, not when= it >> > > is used, right? That might be surprising if case-fold-search changes= , >> > > for example. >> > >> > Yes, the lookup from the current buffer occurs at compile time, and th= at case >> > folding stays with the regexp after compilation (see top). I also had = the same >> > concern regarding this wording; thanks for raising it as truly the int= eractions >> > between environment variables are some of the more confusing parts of >> > regex behavior. >> >> > As above, I think the existing `regexp_cache' work has actually done a= great job >> > at nailing down what invalidates a regexp, so I think we can extend th= at >> > framework to ensure compiled regexps have all of the configuration set= at >> > compile time to ensure intuitive behavior. >> >> Indeed. Again, my preference is to pretend the world is UTF-8, because >> charset interactions make my head hurt, and declare that a compiled >> regexp simply matches or does not match a given array of bytes (plus a >> marker position and BOL/EOL flags, but you get the idea), and that >> changing the flags results in a new and different compiled regexp. > > I have actually already created a rust crate for emacs multibyte en/decod= ing > (currently very spare: https://docs.rs/emacs-multibyte) for use with the = regexp > compiler I'm implementing in rust and hoping to introduce to emacs as an > optional dependency (https://github.com/cosmicexplorer/emacs-regexp). On = the > face of it, I'm under the impression that multibyte encoding still produc= es > a deterministic representation for any string of bytes not mappable to UT= F-8, as > those characters are just stored in the high bits not used by UTF-8 (I am= really > unfamiliar with charsets and char-tables though). However, this is indeed= what > most regex engines I am aware of do in order to support UTF-8 and still j= ust > operate on an array of bytes, although encoding unicode-aware character c= lasses > still requires dropping down to a char-by-char loop instead of fancy SIMD= , which > is why many offer the ability to turn off unicode-aware character classes > (although I think we can do better for performance of non-ASCII users tha= n > this, perhaps by enabling the compilation of character classes to a speci= fic > language/unicode subset to enable further optimization). I think I'll have to have a look at those projects before I'm able to say anything useful. >> Let me say how impressed I am by all this. You're probably aware of it, >> but I'm a huge fan of the rx and xr functions (the latter is available >> from ELPA). I believe the regexp engine in Emacs is somewhat outdated >> and in need of a replacement, and I think this abstraction plus xr are >> good first steps. I'm saying this because it might help to think of a >> compiled regexp as a GC-able reference to a DFA (or, my preference, an >> NFA), and then we could do cool things like performing a Levenshtein >> transformation on a regexp to capture typos. > > Hyperscan (which recently moved to a proprietary license, very sad but we= can > still steal ideas from them) offers edit distance and Hamming distance as > experimental extensions to their pattern compiler (see > https://intel.github.io/hyperscan/dev-reference/compilation.html#extended= -parameters). I believe Apache Lucene implements the relevant mechanisms as well. It didn't look too complicated when I looked at it. > `xr' is SUPER cool and I've just installed it, thanks so much for > enlightening me! Possibly a better read syntax than the pattern string. Mattias, how hard would it be to make the rx + xr combination the identity, or at least idempotent? >> (My main problem with the current regexp implementation is it's >> intimately tied to the gap representation of buffers, and it'd be easier >> to hack on that if we could "just" switch to a representation-agnostic, >> slow, Lisp implementation of regexps. Or translate the regexp itself to >> Lisp which we could JIT...) > > While rust is unfortunately less portable than C due to its reliance on L= LVM > (and therefore unsuitable for Emacs), rustc_codegen_gcc uses libgccjit to > achieve this, and the gccrs frontend requires lots more work but will be = very > exciting upon release. I was hoping to use something like rust in order t= o more > easily achieve competitive performance, but I absolutely agree it makes t= hings > much less hackable than a lisp implementation, which is definitely JITabl= e as > you say. There would be a lot of benefit to that approach if it could app= roach > the performance of a native implementation of NFA/DFA compilation, especi= ally as > it would enable further development of regexp features from Emacs itself = instead > of an external dependency in an unfamiliar language. I personally have a = kind of > grudge for complex reasons against people pushing an external module in r= ust as > the only way to achieve performant code from an interpreted language, and= Emacs > incorporating a JIT internally makes it feasible to do the opposite, whic= h is > vaguely exciting to me. I will definitely consider this approach before > proposing a rust dependency to emacs, although I will likely continue to = work on > the rust engine for fun. I take it you're aware of remacs, which hasn't seen development lately but was a good way to discover the problems, in particular, of integrating rust with the Emacs GC. There might be some salvageable code there, though I'm not sure about the copyright status. > Really appreciate your feedback and will be checking out the igc patches = from > now on, insight into memory allocation is one thing I wish regexp engines > exposed more to users. Excellent! > [2. text/x-patch; lisp-regex-and-match-objects.patch]... + + /* `translate' being NULL produces an invalid object from + `make_lisp_ptr()'. */ + if (bufp->translate) + { + const Lisp_Object translate =3D +=09make_lisp_ptr ((void *) &bufp->translate, Lisp_Vectorlike); + dump_field_lv (ctx, &out, bufp, &translate, WEIGHT_NORMAL); + } + I don't think you can do that: dump_field_lv will write to a random address in the dump based on the difference between bufp and "&translate", which is unpredictable as it depends on the stack layout of unrelated variables. IIUC, that is. Can you use dump_field_lv_rawptr instead? If you want to remove the purecopy support, here's a patch to do that: diff --git a/src/alloc.c b/src/alloc.c index 7cf711da0dc..798fdf7077c 100644 --- a/src/alloc.c +++ b/src/alloc.c @@ -5895,73 +5895,6 @@ make_pure_c_string (const char *data, ptrdiff_t ncha= rs) =20 static Lisp_Object purecopy (Lisp_Object obj); =20 -static struct re_pattern_buffer -make_pure_re_pattern_buffer (const struct re_pattern_buffer *bufp) -{ - struct re_pattern_buffer pure =3D *bufp; - - pure.buffer =3D pure_alloc (bufp->used, -1); - memcpy (pure.buffer, bufp->buffer, bufp->used); - pure.allocated =3D bufp->used; - /* The fastmap is sometimes unset when using the `regexp_cache'. */ - if (pure.fastmap) - { - pure.fastmap =3D pure_alloc (FASTMAP_SIZE, -1); - memcpy (pure.fastmap, bufp->fastmap, FASTMAP_SIZE); - } - /* `translate' being NULL produces an invalid object from - `make_lisp_ptr()'. */ - if (pure.translate) - { - const Lisp_Object translate =3D - make_lisp_ptr (bufp->translate, Lisp_Vectorlike); - pure.translate =3D XLP (purecopy (translate)); - } - - return pure; -} - -static Lisp_Object -make_pure_regexp (const struct Lisp_Regexp *r) -{ - struct Lisp_Regexp *pure =3D pure_alloc (sizeof *pure, Lisp_Vectorlike); - *pure =3D *r; - - pure->pattern =3D purecopy (r->pattern); - pure->whitespace_regexp =3D purecopy (r->whitespace_regexp); - pure->syntax_table =3D purecopy (r->syntax_table); - pure->default_match_target =3D purecopy (r->default_match_target); - pure->buffer =3D make_pure_re_pattern_buffer (&r->buffer); - - return make_lisp_ptr (pure, Lisp_Vectorlike); -} - -static struct re_registers -make_pure_re_registers (const struct re_registers *regs) -{ - struct re_registers pure =3D *regs; - - ptrdiff_t reg_size =3D regs->num_regs * sizeof (ptrdiff_t); - pure.start =3D pure_alloc (reg_size, -1); - memcpy (pure.start, regs->start, reg_size); - pure.end =3D pure_alloc (reg_size, -1); - memcpy (pure.end, regs->end, reg_size); - - return pure; -} - -static Lisp_Object -make_pure_match (const struct Lisp_Match *m) -{ - struct Lisp_Match *pure =3D pure_alloc (sizeof *pure, Lisp_Vectorlike); - *pure =3D *m; - - pure->haystack =3D purecopy (m->haystack); - pure->regs =3D make_pure_re_registers (&m->regs); - - return make_lisp_ptr (pure, Lisp_Vectorlike); -} - /* Return a cons allocated from pure space. Give it pure copies of CAR as car and CDR as cdr. */ =20 @@ -6120,9 +6053,9 @@ purecopy (Lisp_Object obj) SBYTES (obj), STRING_MULTIBYTE (obj)); else if (REGEXP_P (obj)) - obj =3D make_pure_regexp (XREGEXP (obj)); + goto pin; else if (MATCH_P (obj)) - obj =3D make_pure_match (XMATCH (obj)); + goto pin; else if (HASH_TABLE_P (obj)) { struct Lisp_Hash_Table *table =3D XHASH_TABLE (obj); @@ -6133,6 +6066,7 @@ purecopy (Lisp_Object obj) { /* Instead, add the hash table to the list of pinned objects, so that it will be marked during GC. */ + pin: struct pinned_object *o =3D xmalloc (sizeof *o); o->object =3D obj; o->next =3D pinned_objects; Thanks again. I'm hoping the maintainers will have something positive to say about this patch. Pip