From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail
From: Pip Cet <pipcet@protonmail.com>
Newsgroups: gmane.emacs.devel
Subject: Re: [PATCH] add compiled regexp primitive lisp object
Date: Tue, 06 Aug 2024 18:18:44 +0000
Message-ID: <87ttfxtszi.fsf@protonmail.com>
References: <DX05HWbmw37nQEdshSH46KqWNPDwwK5xaKBcjZr7XBx293dQtroHGmYssDsOVTcMgLLqHbbrvkBpJcWuBgXsDHkvhAHLsTPyQ2j5wZpeL8M=@hypnicjerk.ai>
 <87mslxxddk.fsf@protonmail.com>
 <5He97LtsyeyQoTLU7d91oP2CLO8s_2afdgcNxozsFjzu8qGbB_7nXmsZL5O6Ej7K-tuEmngCcPKJpDAjxeKz4jk1DvqSUbdOLpw5U1vo1SY=@hypnicjerk.ai>
 <87le1avopk.fsf@protonmail.com>
 <2LOLmIp1X8w4CGbqq3qDrzmKVA0KzYNL1N9lBtWdB-MtEv9oCuYgJMYprG170wMPjYxeQImAmWOPatGTTl4KxZMlptNo9A9hnHt84vdN9EA=@hypnicjerk.ai>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214";
	logging-data="35442"; mail-complaints-to="usenet@ciao.gmane.io"
Cc: "emacs-devel@gnu.org" <emacs-devel@gnu.org>,
 =?utf-8?Q?Mattias_Engdeg=C3=A5rd?= <mattiase@acm.org>
To: Danny McClanahan <dmcc2@hypnicjerk.ai>
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Tue Aug 06 20:27:03 2024
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane-mx.org
Original-Received: from lists.gnu.org ([209.51.188.17])
	by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
	(Exim 4.92)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org>)
	id 1sbOtj-00093H-7r
	for ged-emacs-devel@m.gmane-mx.org; Tue, 06 Aug 2024 20:27:03 +0200
Original-Received: from localhost ([::1] helo=lists1p.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.90_1)
	(envelope-from <emacs-devel-bounces@gnu.org>)
	id 1sbOt3-000140-GV; Tue, 06 Aug 2024 14:26:21 -0400
Original-Received: from eggs.gnu.org ([2001:470:142:3::10])
 by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.90_1) (envelope-from <pipcet@protonmail.com>)
 id 1sbOlz-0001AB-6Z
 for emacs-devel@gnu.org; Tue, 06 Aug 2024 14:19:06 -0400
Original-Received: from mail-40134.protonmail.ch ([185.70.40.134])
 by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.90_1) (envelope-from <pipcet@protonmail.com>)
 id 1sbOlv-00051e-3Z
 for emacs-devel@gnu.org; Tue, 06 Aug 2024 14:19:02 -0400
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=protonmail.com;
 s=protonmail3; t=1722968328; x=1723227528;
 bh=7Uzkp0L9HryhrgZ7jjBqfYXRbheYglkQgbnzCxFRv/Y=;
 h=Date:To:From:Cc:Subject:Message-ID:In-Reply-To:References:
 Feedback-ID:From:To:Cc:Date:Subject:Reply-To:Feedback-ID:
 Message-ID:BIMI-Selector;
 b=vz0BtzcRIbQase4L9Lgm7jokmBWgAoGT8JBkg40KhcOs+7vn0YzO497WcdenDBbOk
 Hq1MW7lkta1uDqSUinXQxYSqi28+4aq2TKkGCm7TIxTwlUGALl656GCm4WItmCBZc9
 zaeOKOcnsmE73cQxLcgyW0sLvauUeQM2iWdTVTpUdVEfeIJWsV1wjUGxYJd7SzF14R
 JCTHMOQ1UEh0uJrlen8AlEo3xwalOH9WjGa8yUFZjPteRhmakT369wfHGVXEINzIqD
 FBAhWGA8tqjYTTuiJoqH9s64OBDRp+eDnf8qJwU2Cj5wtteRPVqCgW5gTLlsEyVEPs
 eEDCa2VYuWRfQ==
In-Reply-To: <2LOLmIp1X8w4CGbqq3qDrzmKVA0KzYNL1N9lBtWdB-MtEv9oCuYgJMYprG170wMPjYxeQImAmWOPatGTTl4KxZMlptNo9A9hnHt84vdN9EA=@hypnicjerk.ai>
Feedback-ID: 112775352:user:proton
X-Pm-Message-ID: 74798b05b3dde162fb91797e022d1f004658fd72
Received-SPF: pass client-ip=185.70.40.134; envelope-from=pipcet@protonmail.com;
 helo=mail-40134.protonmail.ch
X-Spam_score_int: -20
X-Spam_score: -2.1
X-Spam_bar: --
X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1,
 DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001,
 RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H4=0.001, RCVD_IN_MSPIKE_WL=0.001,
 RCVD_IN_VALIDITY_CERTIFIED_BLOCKED=0.001, RCVD_IN_VALIDITY_RPBL_BLOCKED=0.001,
 SPF_HELO_PASS=-0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no
X-Spam_action: no action
X-Mailman-Approved-At: Tue, 06 Aug 2024 14:26:19 -0400
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/emacs-devel>,
 <mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <https://lists.gnu.org/archive/html/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/emacs-devel>,
 <mailto:emacs-devel-request@gnu.org?subject=subscribe>
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org
Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org
Xref: news.gmane.io gmane.emacs.devel:322456
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/322456>

"Danny McClanahan" <dmcc2@hypnicjerk.ai> writes:

>> On Monday, August 5th, 2024 at 13:55, Pip Cet <pipcet@protonmail.com> wr=
ote:
>>
>> Thank you for that wonderful summary. FWIW, I agree that a regexp
>> should have as much as possible baked into it.
>
> Upon checking out how the `translate' table is used, I actually think it =
would
> be possible and even reasonable to avoid keeping track of it at all after
> compiling the regexp, if the regexp is instead compiled into a form that =
relies
> on character sets instead of translating input chars (this is how the rus=
t regex
> crate handles its equivalent of `case-fold-search'). This would also remo=
ve
> another blocker to enabling SIMD searching in some ways, although pattern=
s with
> case- or char-folding also tend to foil SIMD search techniques (maybe som=
ething
> clever can be done about this).
>
>> I'm a sworn enemy of pure space (see scratch/no-pure-space), but I'm
>> pretty sure you can make purecopy simply return the object in question
>> (rather than going through the generic vectorlike code) and things
>> should work. But I haven't tried it...
>
> Just tried it and pdumping went into what appears to be an infinite loop =
in the
> gc when allocating buffer space for a vectorlike (which may be the
> pseudovec

I misremembered, you have to pin objects you want to pretend you
purecopied, just as we do for normal hash tables.

> regexp type, and may be because of the `make-regexp' call I added in imag=
e.el to
> test exactly this) and then errored after about 10 seconds. From
> scratch/no-pure-space I see that pure space relates to zero-length vector=
s,
> which may have helped me to figure out why `make_lisp_ptr()' does not ret=
urn
> Qnil when passed a null reference and Lisp_Vectorlike as arguments.

We use tagged NULL pointers for known-to-be-invalid objects in a few
places, actually, so we can't make make_lisp_ptr return nil...

>> Just to reveal my conflict of interest, I'm currently playing with the
>> scratch/igc branch, and unaligned pointers are a big no for the MPS
>> garbage collector we're using in that branch.
>
> Could I also ask you to please explain the importance of and requirements=
 for
> alignment there? I believe alignment is useful for tagged pointers as it =
gives
> you more tag bits (and I see scratch/igc uses those), but is it just that=
?

Most CPU architectures require things to be naturally aligned; x86 and
x86_64 are the exception there (though, of course, there's an exception
to the exception, because the %xmmN registers can be accessed using
special instructions that require alignment). MPS also requires all
pointers to be stored in aligned words, but that's a separate
requirement.

The short story is you want to pass -alignof(type) to the pure_alloc
function or Apple users will get segfaults.

>> Do you have commit access yet? That would be worth fixing independently
>> of the rest of the patch (and getting the copyright assignment in order
>> takes a while, so it's probably best to start early). IIUC, it doesn't
>> have any effect at present, but the code is confusing...
>
> I have no problems with assigning copyright to the FSF (I even see that t=
hey no
> longer require physical mail) but will wait for maintainers to resolve wh=
ether
> this set of changes is considered useful, for which I am in no hurry. In =
any
> case, I consider splitting up large changes into easily-reviewable chunks=
 one of
> my specialties and will definitely be able to separate this sort of chang=
e from
> the rest of the patch. Even if this change is not accepted, it's good to =
hear
> that changes are welcome here and I would easily be able to look at impro=
ving
> them now that I'm vaguely up to speed.

That sounds very good.  I don't think there's any particular rush, to be
honest, though I would like to reiterate how helpful it would be for me
to be able to disable the regex-emacs.c code and use a slow Lisp
implementation instead.

>> > My impression is that the new `dump_cold_bytes()' performs exactly the=
se two
>> > operations in a row, so I thought it would be the same result. I will =
review
>> > this again.
>>
>> COI again, we also touched this code in the scratch/igc branch so all I
>> see was more work :-) I actually like the cold_bytes abstraction, though=
.
>
> Not a problem! I was pleasantly surprised at how hackable low-level thing=
s like
> the pdumper and gc are but very easy to consider scratch/igc in the futur=
e now
> that you've mentioned it ^_^!

Excellent.

>> > > I'm not sure we should be using the a=3Db pattern here. Maybe one da=
y we
>> > > want to read such objects and it'd be easier to do that if the outpu=
t
>> > > syntax were more Lisp-like (in fact, I'd convert the whole thing to =
a
>> > > hash table and print that...)
>>
>> I said "hash table" above, but even that's too complicated. Just cons up
>> a list :-)
>
> It would be wonderful to consider a read syntax for this, although as I n=
oted in
> my other reply to Andrea the current imposition of `make-regexp' imposes
> a natural barrier with which to employ more time-intensive compilation
> techniques e.g. DFAs. Most regexp engines employ some form of heuristics =
for
> this anyway, although I personally don't really like how that produces
> nonintuitive performance behavior and would strongly prefer just requirin=
g
> an explicit compilation step (perhaps with optimization flags to denote a=
nything
> that's hard for heuristics to guess).

Well, I don't see why the read syntax can't be the uncompiled pattern
for now, plus the extra data. Compiling it in lread.c would probably not
be a huge performance issue.  But see below for an alternative: use xr.

>> > > And that look-up happens at the time make-regexp is called, not when=
 it
>> > > is used, right? That might be surprising if case-fold-search changes=
,
>> > > for example.
>> >
>> > Yes, the lookup from the current buffer occurs at compile time, and th=
at case
>> > folding stays with the regexp after compilation (see top). I also had =
the same
>> > concern regarding this wording; thanks for raising it as truly the int=
eractions
>> > between environment variables are some of the more confusing parts of
>> > regex behavior.
>>
>> > As above, I think the existing `regexp_cache' work has actually done a=
 great job
>> > at nailing down what invalidates a regexp, so I think we can extend th=
at
>> > framework to ensure compiled regexps have all of the configuration set=
 at
>> > compile time to ensure intuitive behavior.
>>
>> Indeed. Again, my preference is to pretend the world is UTF-8, because
>> charset interactions make my head hurt, and declare that a compiled
>> regexp simply matches or does not match a given array of bytes (plus a
>> marker position and BOL/EOL flags, but you get the idea), and that
>> changing the flags results in a new and different compiled regexp.
>
> I have actually already created a rust crate for emacs multibyte en/decod=
ing
> (currently very spare: https://docs.rs/emacs-multibyte) for use with the =
regexp
> compiler I'm implementing in rust and hoping to introduce to emacs as an
> optional dependency (https://github.com/cosmicexplorer/emacs-regexp). On =
the
> face of it, I'm under the impression that multibyte encoding still produc=
es
> a deterministic representation for any string of bytes not mappable to UT=
F-8, as
> those characters are just stored in the high bits not used by UTF-8 (I am=
 really
> unfamiliar with charsets and char-tables though). However, this is indeed=
 what
> most regex engines I am aware of do in order to support UTF-8 and still j=
ust
> operate on an array of bytes, although encoding unicode-aware character c=
lasses
> still requires dropping down to a char-by-char loop instead of fancy SIMD=
, which
> is why many offer the ability to turn off unicode-aware character classes
> (although I think we can do better for performance of non-ASCII users tha=
n
> this, perhaps by enabling the compilation of character classes to a speci=
fic
> language/unicode subset to enable further optimization).

I think I'll have to have a look at those projects before I'm able to
say anything useful.

>> Let me say how impressed I am by all this. You're probably aware of it,
>> but I'm a huge fan of the rx and xr functions (the latter is available
>> from ELPA). I believe the regexp engine in Emacs is somewhat outdated
>> and in need of a replacement, and I think this abstraction plus xr are
>> good first steps. I'm saying this because it might help to think of a
>> compiled regexp as a GC-able reference to a DFA (or, my preference, an
>> NFA), and then we could do cool things like performing a Levenshtein
>> transformation on a regexp to capture typos.
>
> Hyperscan (which recently moved to a proprietary license, very sad but we=
 can
> still steal ideas from them) offers edit distance and Hamming distance as
> experimental extensions to their pattern compiler (see
> https://intel.github.io/hyperscan/dev-reference/compilation.html#extended=
-parameters).

I believe Apache Lucene implements the relevant mechanisms as well. It
didn't look too complicated when I looked at it.

> `xr' is SUPER cool and I've just installed it, thanks so much for
> enlightening me!

Possibly a better read syntax than the pattern string.  Mattias, how
hard would it be to make the rx + xr combination the identity, or at
least idempotent?

>> (My main problem with the current regexp implementation is it's
>> intimately tied to the gap representation of buffers, and it'd be easier
>> to hack on that if we could "just" switch to a representation-agnostic,
>> slow, Lisp implementation of regexps. Or translate the regexp itself to
>> Lisp which we could JIT...)
>
> While rust is unfortunately less portable than C due to its reliance on L=
LVM
> (and therefore unsuitable for Emacs), rustc_codegen_gcc uses libgccjit to
> achieve this, and the gccrs frontend requires lots more work but will be =
very
> exciting upon release. I was hoping to use something like rust in order t=
o more
> easily achieve competitive performance, but I absolutely agree it makes t=
hings
> much less hackable than a lisp implementation, which is definitely JITabl=
e as
> you say. There would be a lot of benefit to that approach if it could app=
roach
> the performance of a native implementation of NFA/DFA compilation, especi=
ally as
> it would enable further development of regexp features from Emacs itself =
instead
> of an external dependency in an unfamiliar language. I personally have a =
kind of
> grudge for complex reasons against people pushing an external module in r=
ust as
> the only way to achieve performant code from an interpreted language, and=
 Emacs
> incorporating a JIT internally makes it feasible to do the opposite, whic=
h is
> vaguely exciting to me. I will definitely consider this approach before
> proposing a rust dependency to emacs, although I will likely continue to =
work on
> the rust engine for fun.

I take it you're aware of remacs, which hasn't seen development lately
but was a good way to discover the problems, in particular, of
integrating rust with the Emacs GC.  There might be some salvageable
code there, though I'm not sure about the copyright status.

> Really appreciate your feedback and will be checking out the igc patches =
from
> now on, insight into memory allocation is one thing I wish regexp engines
> exposed more to users.

Excellent!

> [2. text/x-patch; lisp-regex-and-match-objects.patch]...

+
+  /* `translate' being NULL produces an invalid object from
+     `make_lisp_ptr()'. */
+  if (bufp->translate)
+    {
+      const Lisp_Object translate =3D
+=09make_lisp_ptr ((void *) &bufp->translate, Lisp_Vectorlike);
+      dump_field_lv (ctx, &out, bufp, &translate, WEIGHT_NORMAL);
+    }
+

I don't think you can do that: dump_field_lv will write to a random
address in the dump based on the difference between bufp and
"&translate", which is unpredictable as it depends on the stack layout
of unrelated variables.  IIUC, that is.

Can you use dump_field_lv_rawptr instead?

If you want to remove the purecopy support, here's a patch to do that:

diff --git a/src/alloc.c b/src/alloc.c
index 7cf711da0dc..798fdf7077c 100644
--- a/src/alloc.c
+++ b/src/alloc.c
@@ -5895,73 +5895,6 @@ make_pure_c_string (const char *data, ptrdiff_t ncha=
rs)
=20
 static Lisp_Object purecopy (Lisp_Object obj);
=20
-static struct re_pattern_buffer
-make_pure_re_pattern_buffer (const struct re_pattern_buffer *bufp)
-{
-  struct re_pattern_buffer pure =3D *bufp;
-
-  pure.buffer =3D pure_alloc (bufp->used, -1);
-  memcpy (pure.buffer, bufp->buffer, bufp->used);
-  pure.allocated =3D bufp->used;
-  /* The fastmap is sometimes unset when using the `regexp_cache'. */
-  if (pure.fastmap)
-    {
-      pure.fastmap =3D pure_alloc (FASTMAP_SIZE, -1);
-      memcpy (pure.fastmap, bufp->fastmap, FASTMAP_SIZE);
-    }
-  /* `translate' being NULL produces an invalid object from
-     `make_lisp_ptr()'. */
-  if (pure.translate)
-    {
-      const Lisp_Object translate =3D
-       make_lisp_ptr (bufp->translate, Lisp_Vectorlike);
-      pure.translate =3D XLP (purecopy (translate));
-    }
-
-  return pure;
-}
-
-static Lisp_Object
-make_pure_regexp (const struct Lisp_Regexp *r)
-{
-  struct Lisp_Regexp *pure =3D pure_alloc (sizeof *pure, Lisp_Vectorlike);
-  *pure =3D *r;
-
-  pure->pattern =3D purecopy (r->pattern);
-  pure->whitespace_regexp =3D purecopy (r->whitespace_regexp);
-  pure->syntax_table =3D purecopy (r->syntax_table);
-  pure->default_match_target =3D purecopy (r->default_match_target);
-  pure->buffer =3D make_pure_re_pattern_buffer (&r->buffer);
-
-  return make_lisp_ptr (pure, Lisp_Vectorlike);
-}
-
-static struct re_registers
-make_pure_re_registers (const struct re_registers *regs)
-{
-  struct re_registers pure =3D *regs;
-
-  ptrdiff_t reg_size =3D regs->num_regs * sizeof (ptrdiff_t);
-  pure.start =3D pure_alloc (reg_size, -1);
-  memcpy (pure.start, regs->start, reg_size);
-  pure.end =3D pure_alloc (reg_size, -1);
-  memcpy (pure.end, regs->end, reg_size);
-
-  return pure;
-}
-
-static Lisp_Object
-make_pure_match (const struct Lisp_Match *m)
-{
-  struct Lisp_Match *pure =3D pure_alloc (sizeof *pure, Lisp_Vectorlike);
-  *pure =3D *m;
-
-  pure->haystack =3D purecopy (m->haystack);
-  pure->regs =3D make_pure_re_registers (&m->regs);
-
-  return make_lisp_ptr (pure, Lisp_Vectorlike);
-}
-
 /* Return a cons allocated from pure space.  Give it pure copies
    of CAR as car and CDR as cdr.  */
=20
@@ -6120,9 +6053,9 @@ purecopy (Lisp_Object obj)
                            SBYTES (obj),
                            STRING_MULTIBYTE (obj));
   else if (REGEXP_P (obj))
-    obj =3D make_pure_regexp (XREGEXP (obj));
+    goto pin;
   else if (MATCH_P (obj))
-    obj =3D make_pure_match (XMATCH (obj));
+    goto pin;
   else if (HASH_TABLE_P (obj))
     {
       struct Lisp_Hash_Table *table =3D XHASH_TABLE (obj);
@@ -6133,6 +6066,7 @@ purecopy (Lisp_Object obj)
         {
           /* Instead, add the hash table to the list of pinned objects,
              so that it will be marked during GC.  */
+       pin:
           struct pinned_object *o =3D xmalloc (sizeof *o);
           o->object =3D obj;
           o->next =3D pinned_objects;

Thanks again. I'm hoping the maintainers will have something positive to
say about this patch.

Pip