From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Ihor Radchenko Newsgroups: gmane.emacs.bugs Subject: bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c) Date: Mon, 08 May 2023 19:38:59 +0000 Message-ID: <875y923964.fsf@localhost> References: <63882A45-BD02-40D5-92FA-70175267BA3B@acm.org> <874jou7lsf.fsf@localhost> <37EED5F9-F1FE-46B6-B4FA-0B268B945123@gmail.com> <87wn1qqvj0.fsf@localhost> <34F4849A-CB39-4C96-9CC1-11ED723706DA@gmail.com> <87wn1psqny.fsf@localhost> <6DAF37F9-B236-4C33-8E30-0FCA47CCBCC5@gmail.com> <87zg6lfobh.fsf@localhost> <281B22C2-CD69-4495-A97C-E754446CA9A6@gmail.com> <87o7n1v1w3.fsf@localhost> <878E8D66-A548-42E6-B077-6068A8B131D8@gmail.com> <87ednvul22.fsf@localhost> <87y1m1oa01.fsf@localhost> <74CD5EF4-5424-40BA-8F80-D0FD89CB890F@gmail.com> <87zg6fjar6.fsf@localhost> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="36642"; mail-complaints-to="usenet@ciao.gmane.io" Cc: 63225@debbugs.gnu.org To: Mattias =?UTF-8?Q?Engdeg=C3=A5rd?= Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Mon May 08 21:36:22 2023 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1pw6ej-0009Gj-OS for geb-bug-gnu-emacs@m.gmane-mx.org; Mon, 08 May 2023 21:36:22 +0200 Original-Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1pw6eS-0007DJ-Nm; Mon, 08 May 2023 15:36:04 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1pw6eQ-0007Bl-Rk for bug-gnu-emacs@gnu.org; Mon, 08 May 2023 15:36:03 -0400 Original-Received: from debbugs.gnu.org ([209.51.188.43]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1pw6eQ-0001IF-Js for bug-gnu-emacs@gnu.org; Mon, 08 May 2023 15:36:02 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1pw6eQ-0004fu-DX for bug-gnu-emacs@gnu.org; Mon, 08 May 2023 15:36:02 -0400 X-Loop: help-debbugs@gnu.org Resent-From: Ihor Radchenko Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Mon, 08 May 2023 19:36:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 63225 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: patch Original-Received: via spool by 63225-submit@debbugs.gnu.org id=B63225.168357455617952 (code B ref 63225); Mon, 08 May 2023 19:36:02 +0000 Original-Received: (at 63225) by debbugs.gnu.org; 8 May 2023 19:35:56 +0000 Original-Received: from localhost ([127.0.0.1]:41695 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pw6eK-0004fT-0Y for submit@debbugs.gnu.org; Mon, 08 May 2023 15:35:56 -0400 Original-Received: from mout01.posteo.de ([185.67.36.65]:48463) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pw6eH-0004fB-N4 for 63225@debbugs.gnu.org; Mon, 08 May 2023 15:35:54 -0400 Original-Received: from submission (posteo.de [185.67.36.169]) by mout01.posteo.de (Postfix) with ESMTPS id 3DEF624026A for <63225@debbugs.gnu.org>; Mon, 8 May 2023 21:35:47 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=posteo.net; s=2017; t=1683574547; bh=yymRBrh0iD78PAARKFoXC5QcTw583CV00WIyjx129zk=; h=From:To:Cc:Subject:Date:From; b=LgitHieUwRWNa3V9jlPnusb0clKNayEEpR3mVpOE6WaExZeBYA6+InSGOaHY0HFax dYbl/jC14yXo8zI0Y/omHwtYUzyGjZYZtsAKgXACP05ppSIVT5BcPfgjJe6aqsGnkx +LAYxfzevjFJQRtxE/B7WnQgk3/dACo3btfDAhupAlw33DzbgYJCc9m4+26sueuiRc QAAiTMSBpViW45WeMQJGjeAb0WEtvxTM+WKVEcxW4TcaEKWh+6uag3x5BAUnaYHLlN DsLjM9kMZFTe/uoBmTfjTCCmkBo+qBm8enp9eAmmBevuLcBRHIujM26deaoA/02/cg Sb7P4uZPrd7nA== Original-Received: from customer (localhost [127.0.0.1]) by submission (posteo.de) with ESMTPSA id 4QFWjT1LwVz9rxQ; Mon, 8 May 2023 21:35:41 +0200 (CEST) In-Reply-To: X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Original-Sender: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Xref: news.gmane.io gmane.emacs.bugs:261361 Archived-At: Mattias Engdeg=C3=A5rd writes: >> (save-excursion >> (beginning-of-line 0) >> (not (looking-at-p "[[:blank:]]*$")))) > > I wonder if that last part isn't better written as > > (save-excursion > (forward-line 0) ; faster than beginning-of-line Fair point. It is probably worth looking through Org sources and replacing all those `beginning-of-line' and `end-of-line' calls. I doubt that we even intend to use fields for real. > (skip-chars-forward "[:blank:]") ; faster than looking-at-p > (not (eolp))) ; very cheap Hmm. I feel confused. Does this imply that simple regexps like (looking-at-p (rx (seq bol (zero-or-more (any "\t ")) "#" (or " " eol)))) should better be implemented as the following? (and (bolp) (skip-chars-forward " \t") (eq (char-after) ?#) (forward-char) (or (eolp) (eq (char-after) ?\s))) (I now start thinking that it might be more efficient to create a bunch or char tables and step over them to match "regexp", just like any finite automaton would) > But yes, I sort of understand what you are getting at (except the busines= s with the MODE parameter which is still a bit mysterious to me). MODE parameter is used because we constrain what kinds of syntax elements are allowed inside other. For example, see `org-element-object-restrictions'. And within `org-element--current-element', MODE is used, for example, to constrain planning/property drawer to be only the first child of a parent heading, parsed earlier. >> [now part of the giant rx] >>=20 >> (rx line-start (0+ (any ?\s ?\t)) >> ":" (1+ (any ?- ?_ word)) ":" >> (0+ (any ?\s ?\t)) line-end) > > Any reason you don't capture the part between the colons here, so that yo= u don't need to match it later on? That would demand all the callers of `org-element-drawer-parser' to set match data appropriately. Which is doable, but headache for maintenance. We sometimes do call parsers explicitly not from inside `org-element--current-element'. >> But why? Aren't (in word ?_ ?-) and (or word ?_ ?-) not the same? > > "[-_[:word:]]" and "\\w\\|[_-]" indeed match the same thing but they don'= t generate the same regexp bytecode -- the former is faster. (In this case = rx makes a literal translation to those strings but we should probably make= it optimise to the faster regexp.) > > There is a regexp disassembler for the really curious but it doesn't come= with Emacs. I really hope that I did not need to do all these workarounds specific to current implementation pitfalls of Emacs regexp compiler. >>> Maybe, but you still cons each time. (And remember that the plist-get e= quality funarg is new in Emacs 29.) >>=20 >> Sure it does. >> It is just one of the variable parts of Org syntax that might be >> changed. There are ways to make this into constant, but it is a fragile >> area of the code that I do not want to touch without a reason. >> (Especially given that I am not familiar with org-list.el) > > So it's fine to use elisp constructs new in Emacs 29 in Org? Then the line > > ;; Package-Requires: ((emacs "26.1")) > > in org.el should probably be updated, right? Nope. It is just that the link I shared is for WIP branch I am developing using Emacs master. I will ensure backwards compatibility later. For example, by converting `plist-get' to `assoc'. >> I hope we do. If only Emacs had a way to define `case-fold-search' right >> within the regexp itself. > > I would like that too, but changing that isn't easy. I am sure that it is easy. For example, regexp command may optionally accept a vector with first element being regexp and second element setting a flag for `case-fold-search'. Of course, it is just one, maybe not the best way to implement this. My suggestion in https://debbugs.gnu.org/cgi/bugreport.cgi?bug=3D63225#56 is also compatible with this kind of approach. > By the way, it seems that org-element-node-property-parser binds case-fol= d-search without actually using it. Bug? It actually does nothing given that `org-property-re' does not contain letters. I will remove it. --=20 Ihor Radchenko // yantar92, Org mode contributor, Learn more about Org mode at . Support Org development at , or support my work at