From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.devel Subject: Re: [PATCH] add compiled regexp primitive lisp object Date: Tue, 06 Aug 2024 18:57:36 +0300 Message-ID: <861q31prtb.fsf@gnu.org> References: <87mslxxddk.fsf@protonmail.com> <5He97LtsyeyQoTLU7d91oP2CLO8s_2afdgcNxozsFjzu8qGbB_7nXmsZL5O6Ej7K-tuEmngCcPKJpDAjxeKz4jk1DvqSUbdOLpw5U1vo1SY=@hypnicjerk.ai> <87le1avopk.fsf@protonmail.com> <2LOLmIp1X8w4CGbqq3qDrzmKVA0KzYNL1N9lBtWdB-MtEv9oCuYgJMYprG170wMPjYxeQImAmWOPatGTTl4KxZMlptNo9A9hnHt84vdN9EA=@hypnicjerk.ai> Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="29343"; mail-complaints-to="usenet@ciao.gmane.io" Cc: pipcet@protonmail.com, emacs-devel@gnu.org To: Danny McClanahan Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Tue Aug 06 17:58:39 2024 Return-path: Envelope-to: ged-emacs-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1sbMa6-0007Ny-5f for ged-emacs-devel@m.gmane-mx.org; Tue, 06 Aug 2024 17:58:38 +0200 Original-Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1sbMZG-0003KL-4n; Tue, 06 Aug 2024 11:57:47 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1sbMZE-0003K9-OA for emacs-devel@gnu.org; Tue, 06 Aug 2024 11:57:44 -0400 Original-Received: from fencepost.gnu.org ([2001:470:142:3::e]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1sbMZE-0001XG-0j; Tue, 06 Aug 2024 11:57:44 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=gnu.org; s=fencepost-gnu-org; h=References:Subject:In-Reply-To:To:From:Date: mime-version; bh=SDpLLDQC/KdFzpdYoLo6OO2hYwdo54IS7Z29y9sx5/Q=; b=jLJOd6BLbnEl HO3Aym/ngTnQQXO8WPS6lvrPwfuk4TcwT3Hg/1LaJJMwAbu00wINvji4bYkD+mvPkYH/K2gSsjeGn Bba50+9TISO0zjCF6zIegwMxL3u+pi8Sq0w0Yq/eKIEipqSfL1xYOEq8e+YVDJwwT+s1NrMSYl8KB c1droLI13QLc9Bkw0wIGKGZRA9VNABomn5dkdhwGtoauycSUzHe69ocNvvCtcfWsxEqRLmH7ssFUz 8xmCchkWJEIQ1mrqttkwWyOiPWkMx25Byy+aOIBoczcm0G6DzOND28DwvSDhXACNcuEagmAnjQdSU AKFDSUubWG/07S5jUtO5wg==; In-Reply-To: <2LOLmIp1X8w4CGbqq3qDrzmKVA0KzYNL1N9lBtWdB-MtEv9oCuYgJMYprG170wMPjYxeQImAmWOPatGTTl4KxZMlptNo9A9hnHt84vdN9EA=@hypnicjerk.ai> (message from Danny McClanahan on Tue, 06 Aug 2024 15:15:31 +0000) X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Xref: news.gmane.io gmane.emacs.devel:322450 Archived-At: > Date: Tue, 06 Aug 2024 15:15:31 +0000 > From: Danny McClanahan > Cc: "emacs-devel@gnu.org" > > Upon checking out how the `translate' table is used, I actually think it would > be possible and even reasonable to avoid keeping track of it at all after > compiling the regexp, if the regexp is instead compiled into a form that relies > on character sets instead of translating input chars (this is how the rust regex > crate handles its equivalent of `case-fold-search'). This would also remove > another blocker to enabling SIMD searching in some ways, although patterns with > case- or char-folding also tend to foil SIMD search techniques (maybe something > clever can be done about this). Are you aware that in Emacs case-conversion uses the buffer-local case-tables if they are defined? This means that the case-table in effect when the regex was compiled and when it is executed could be different, and your character set trick will not do what the user expects. > > > As above, I think the existing `regexp_cache' work has actually done a great job > > > at nailing down what invalidates a regexp, so I think we can extend that > > > framework to ensure compiled regexps have all of the configuration set at > > > compile time to ensure intuitive behavior. > > > > Indeed. Again, my preference is to pretend the world is UTF-8, because > > charset interactions make my head hurt, and declare that a compiled > > regexp simply matches or does not match a given array of bytes (plus a > > marker position and BOL/EOL flags, but you get the idea), and that > > changing the flags results in a new and different compiled regexp. > > I have actually already created a rust crate for emacs multibyte en/decoding > (currently very spare: https://docs.rs/emacs-multibyte) for use with the regexp > compiler I'm implementing in rust and hoping to introduce to emacs as an > optional dependency (https://github.com/cosmicexplorer/emacs-regexp). On the > face of it, I'm under the impression that multibyte encoding still produces > a deterministic representation for any string of bytes not mappable to UTF-8, as > those characters are just stored in the high bits not used by UTF-8 (I am really > unfamiliar with charsets and char-tables though). However, this is indeed what > most regex engines I am aware of do in order to support UTF-8 and still just > operate on an array of bytes, although encoding unicode-aware character classes > still requires dropping down to a char-by-char loop instead of fancy SIMD, which > is why many offer the ability to turn off unicode-aware character classes > (although I think we can do better for performance of non-ASCII users than > this, perhaps by enabling the compilation of character classes to a specific > language/unicode subset to enable further optimization). I don't understand this part at all. As long as you are dealing with Emacs buffers and strings, there's only one "encoding": the internal multibyte representation of characters Emacs uses. It is a superset of UTF-8, and you need never care about anything else. (Well, there's unibyte strings, but that can be addressed later.)