unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
From: Ihor Radchenko <yantar92@posteo.net>
To: "Mattias Engdegård" <mattias.engdegard@gmail.com>
Cc: 63225@debbugs.gnu.org
Subject: bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c)
Date: Mon, 08 May 2023 19:38:59 +0000	[thread overview]
Message-ID: <875y923964.fsf@localhost> (raw)
In-Reply-To: <C037659A-DBF7-4071-A54F-84CC1D7D18AF@gmail.com>

Mattias Engdegård <mattias.engdegard@gmail.com> writes:

>> 		 (save-excursion
>> 		   (beginning-of-line 0)
>> 		   (not (looking-at-p "[[:blank:]]*$"))))
>
> I wonder if that last part isn't better written as
>
>   (save-excursion
>     (forward-line 0)   ; faster than beginning-of-line

Fair point. It is probably worth looking through Org sources and
replacing all those `beginning-of-line' and `end-of-line' calls. I doubt
that we even intend to use fields for real.

>     (skip-chars-forward "[:blank:]") ; faster than looking-at-p
>     (not (eolp)))   ; very cheap

Hmm. I feel confused.
Does this imply that simple regexps like

(looking-at-p (rx (seq bol (zero-or-more (any "\t ")) "#" (or " " eol))))

should better be implemented as the following?

(and (bolp)
     (skip-chars-forward " \t")
     (eq (char-after) ?#) (forward-char)
     (or (eolp) (eq (char-after) ?\s)))

(I now start thinking that it might be more efficient to create a bunch
or char tables and step over them to match "regexp", just like any
finite automaton would)

> But yes, I sort of understand what you are getting at (except the business with the MODE parameter which is still a bit mysterious to me).

MODE parameter is used because we constrain what kinds of syntax
elements are allowed inside other. For example, see
`org-element-object-restrictions'. And within
`org-element--current-element', MODE is used, for example, to constrain
planning/property drawer to be only the first child of a parent heading,
parsed earlier.

>> [now part of the giant rx]
>> 
>> (rx line-start (0+ (any ?\s ?\t))
>>      ":" (1+ (any ?- ?_ word)) ":"
>>      (0+ (any ?\s ?\t)) line-end)
>
> Any reason you don't capture the part between the colons here, so that you don't need to match it later on?

That would demand all the callers of `org-element-drawer-parser' to set
match data appropriately. Which is doable, but headache for maintenance.
We sometimes do call parsers explicitly not from inside
`org-element--current-element'.

>> But why? Aren't (in word ?_ ?-) and (or word ?_ ?-) not the same?
>
> "[-_[:word:]]" and "\\w\\|[_-]" indeed match the same thing but they don't generate the same regexp bytecode -- the former is faster. (In this case rx makes a literal translation to those strings but we should probably make it optimise to the faster regexp.)
>
> There is a regexp disassembler for the really curious but it doesn't come with Emacs.

I really hope that I did not need to do all these workarounds specific to
current implementation pitfalls of Emacs regexp compiler.

>>> Maybe, but you still cons each time. (And remember that the plist-get equality funarg is new in Emacs 29.)
>> 
>> Sure it does.
>> It is just one of the variable parts of Org syntax that might be
>> changed. There are ways to make this into constant, but it is a fragile
>> area of the code that I do not want to touch without a reason.
>> (Especially given that I am not familiar with org-list.el)
>
> So it's fine to use elisp constructs new in Emacs 29 in Org? Then the line
>
>  ;; Package-Requires: ((emacs "26.1"))
>
> in org.el should probably be updated, right?

Nope. It is just that the link I shared is for WIP branch I am
developing using Emacs master. I will ensure backwards compatibility
later. For example, by converting `plist-get' to `assoc'.

>> I hope we do. If only Emacs had a way to define `case-fold-search' right
>> within the regexp itself.
>
> I would like that too, but changing that isn't easy.

I am sure that it is easy.
For example, regexp command may optionally accept a vector with first
element being regexp and second element setting a flag for
`case-fold-search'.

Of course, it is just one, maybe not the best way to implement this.

My suggestion in https://debbugs.gnu.org/cgi/bugreport.cgi?bug=63225#56
is also compatible with this kind of approach.

> By the way, it seems that org-element-node-property-parser binds case-fold-search without actually using it. Bug?

It actually does nothing given that `org-property-re' does not contain
letters. I will remove it.

-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>





  reply	other threads:[~2023-05-08 19:38 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-02  7:37 bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c) Ihor Radchenko
2023-05-02 14:33 ` Mattias Engdegård
2023-05-02 15:25   ` Eli Zaretskii
2023-05-02 15:28     ` Mattias Engdegård
2023-05-02 17:30       ` Eli Zaretskii
2023-05-02 17:58         ` Ihor Radchenko
2023-05-02 16:14   ` Ihor Radchenko
2023-05-02 21:00     ` Mattias Engdegård
2023-05-02 21:21       ` Ihor Radchenko
2023-05-03  8:39         ` Mattias Engdegård
2023-05-03  9:36           ` Ihor Radchenko
2023-05-03 13:59             ` Mattias Engdegård
2023-05-03 15:05               ` Ihor Radchenko
2023-05-03 15:20                 ` Mattias Engdegård
2023-05-03 16:02                   ` Ihor Radchenko
2023-05-04  9:24                     ` Mattias Engdegård
2023-05-05 10:31                       ` Ihor Radchenko
2023-05-05 16:26                         ` Mattias Engdegård
2023-05-06 13:38                           ` Ihor Radchenko
2023-05-07 10:32                             ` Mattias Engdegård
2023-05-08 11:58                               ` Ihor Radchenko
2023-05-08 18:21                                 ` Mattias Engdegård
2023-05-08 19:38                                   ` Ihor Radchenko [this message]
2023-05-08 19:53                                     ` Mattias Engdegård
2023-05-09  8:36                                       ` bug#63225: Using char table-based finite-state machines as a replacement for re-search-forward (was: bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c)) Ihor Radchenko
2023-05-09 12:02                                       ` bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c) Ihor Radchenko
2023-05-09 15:05                                         ` Mattias Engdegård
2023-05-09 15:56                                           ` Ihor Radchenko
2023-05-09 15:57                                             ` Mattias Engdegård
2023-05-07 12:45                           ` Mattias Engdegård
2023-05-08 13:56                             ` Ihor Radchenko
2023-05-08 19:32                               ` Mattias Engdegård
2023-05-08 19:44                                 ` Ihor Radchenko
2023-05-04 12:58               ` Ihor Radchenko
2023-05-02 23:36   ` Po Lu via Bug reports for GNU Emacs, the Swiss army knife of text editors

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=875y923964.fsf@localhost \
    --to=yantar92@posteo.net \
    --cc=63225@debbugs.gnu.org \
    --cc=mattias.engdegard@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).