unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
From: Ihor Radchenko <yantar92@posteo.net>
To: "Mattias Engdegård" <mattias.engdegard@gmail.com>
Cc: 63225@debbugs.gnu.org
Subject: bug#63225: Using char table-based finite-state machines as a replacement for re-search-forward (was: bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c))
Date: Tue, 09 May 2023 08:36:25 +0000	[thread overview]
Message-ID: <87ednp296e.fsf@localhost> (raw)
In-Reply-To: <67AAB661-8B27-4D09-BF0D-6B76ABB54477@gmail.com>

Mattias Engdegård <mattias.engdegard@gmail.com> writes:

>> (I now start thinking that it might be more efficient to create a bunch
>> or char tables and step over them to match "regexp", just like any
>> finite automaton would)
>
> I wish elisp were fast enough that we could do that in general.

I tried the following simple implementation for searching via char
tables:

(defun yant/make-char-re (string)
  "Create a regexp matching STRING using char tables.
Return entry char table. The values are either char tables or return value for
terminals.  This is a dumb version not trying to account for substring cycles."
  (let (root (table (make-char-table 'regexp-table)) (idx 0))
    (setq root table)
    (set-char-table-range root t root)
    (seq-map
     (lambda (char)
       (let ((next-table (make-char-table 'regexp-table root)))
	 (aset table char next-table)
         (setq table next-table)))
     string)
    (set-char-table-range table t t)
    root))

(defun yant/search-forward (regexp-table)
  "Move point after next string matching REGEXP-TABLE."
  (let ((pos (point-marker)))
    (while (and (< pos (point-max))
		(char-table-p
		 (setq regexp-table
		       (aref regexp-table (char-after pos))))
		(move-marker pos (1+ pos))))
    (unless (char-table-p regexp-table) (goto-char pos))))

DEFUN ("auto-search-forward", Fauto_search_forward, Sauto_search_forward, 1, 1, 0,
       doc: /* Search forward using CHAR-TABLE. (WIP) */)
  (Lisp_Object table)
{
  CHECK_TYPE (CHAR_TABLE_P (table), Qchar_table_p, table);
  ptrdiff_t pos_byte = PT_BYTE;
  ptrdiff_t pos_char = PT;
  if (pos_byte < BEGV_BYTE || pos_byte >= ZV_BYTE)
    return Qnil;
  table = CHAR_TABLE_REF (table, FETCH_CHAR (pos_byte));
  while (pos_byte < ZV_BYTE && CHAR_TABLE_P (table))
    {
      pos_byte += next_char_len(pos_byte);
      pos_char ++;
      table = CHAR_TABLE_REF (table, FETCH_CHAR (pos_byte));
    }
  if (pos_byte >= ZV_BYTE)
    return Qnil;
  else
    {
      SET_PT (pos_char);
      return Qt;
    }
}

(setq yant/test (yant/make-char-re ":ID:"))
(benchmark-run 10 (org-with-wide-buffer (goto-char 1) (while (yant/search-forward yant/test))))
(benchmark-run 10 (org-with-wide-buffer (goto-char 1) (while (auto-search-forward yant/test))))
(benchmark-run 10 (org-with-wide-buffer (goto-char (point-min)) (while (re-search-forward ":ID:" nil t))))

- yant/search-forward :: (24.677463432 0 0.0)
- re-search-forward   :: (0.569693718 0 0.0)
- auto-search-forward   :: (1.177098129 0 0.0)

The lisp version suffers from (1) Extra redirection when checking
function symbol value; (2) moving markers, as there is no efficient way
to move point forward from Elisp; `forward-char' is even slower.

auto-search-forward is twice slower than re-search-forward, but I expect
this to change if we ask for a more complex regexp. auto-search-forward
will not care about the complexity of the finite-state machine provided.

-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>





  reply	other threads:[~2023-05-09  8:36 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-02  7:37 bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c) Ihor Radchenko
2023-05-02 14:33 ` Mattias Engdegård
2023-05-02 15:25   ` Eli Zaretskii
2023-05-02 15:28     ` Mattias Engdegård
2023-05-02 17:30       ` Eli Zaretskii
2023-05-02 17:58         ` Ihor Radchenko
2023-05-02 16:14   ` Ihor Radchenko
2023-05-02 21:00     ` Mattias Engdegård
2023-05-02 21:21       ` Ihor Radchenko
2023-05-03  8:39         ` Mattias Engdegård
2023-05-03  9:36           ` Ihor Radchenko
2023-05-03 13:59             ` Mattias Engdegård
2023-05-03 15:05               ` Ihor Radchenko
2023-05-03 15:20                 ` Mattias Engdegård
2023-05-03 16:02                   ` Ihor Radchenko
2023-05-04  9:24                     ` Mattias Engdegård
2023-05-05 10:31                       ` Ihor Radchenko
2023-05-05 16:26                         ` Mattias Engdegård
2023-05-06 13:38                           ` Ihor Radchenko
2023-05-07 10:32                             ` Mattias Engdegård
2023-05-08 11:58                               ` Ihor Radchenko
2023-05-08 18:21                                 ` Mattias Engdegård
2023-05-08 19:38                                   ` Ihor Radchenko
2023-05-08 19:53                                     ` Mattias Engdegård
2023-05-09  8:36                                       ` Ihor Radchenko [this message]
2023-05-09 12:02                                       ` Ihor Radchenko
2023-05-09 15:05                                         ` Mattias Engdegård
2023-05-09 15:56                                           ` Ihor Radchenko
2023-05-09 15:57                                             ` Mattias Engdegård
2023-05-07 12:45                           ` Mattias Engdegård
2023-05-08 13:56                             ` Ihor Radchenko
2023-05-08 19:32                               ` Mattias Engdegård
2023-05-08 19:44                                 ` Ihor Radchenko
2023-05-04 12:58               ` Ihor Radchenko
2023-05-02 23:36   ` Po Lu via Bug reports for GNU Emacs, the Swiss army knife of text editors

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87ednp296e.fsf@localhost \
    --to=yantar92@posteo.net \
    --cc=63225@debbugs.gnu.org \
    --cc=mattias.engdegard@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).