From: "Mattias Engdegård" <mattiase@acm.org>
To: emacs-devel <emacs-devel@gnu.org>
Subject: Regexp bytecode disassembler
Date: Fri, 20 Mar 2020 13:27:35 +0100 [thread overview]
Message-ID: <4201DF24-BCC4-4C08-9857-38207B7C10B4@acm.org> (raw)
[-- Attachment #1: Type: text/plain, Size: 595 bytes --]
It is sometimes useful to inspect the generated regexp engine bytecode, but doing so currently involves recompiling with REGEX_EMACS_DEBUG configured, setting an internal variable using a debugger, and watching data scrolling past on stderr.
This patch adds a lisp-based regexp bytecode disassembler which is always available without any runtime cost to the regexp engine. It is mainly a tool for maintainers but curious users may find it useful as well. It has already revealed one bug in the regexp compiler, now fixed (f189e5dc10).
Any objections against it being added (to master)?
[-- Attachment #2: 0001-Add-regexp-bytecode-disassembler.patch --]
[-- Type: application/octet-stream, Size: 15711 bytes --]
From f7eba6c64acee0d9a445d6e139e86750e8d36bd5 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Mattias=20Engdeg=C3=A5rd?= <mattiase@acm.org>
Date: Tue, 17 Mar 2020 15:09:10 +0100
Subject: [PATCH] Add regexp bytecode disassembler
This is mainly intended for debugging and understanding the regexp
engine, and for anyone curious about how their regexps really are
interpreted and why they are slow.
* src/search.c (Fregexp_bytecode): New function.
(syms_of_search): Define the symbol.
* lisp/emacs-lisp/regexp-disasm.el (regexp-disasm--classes)
(regexp-disasm--syntax-codes, regexp-disasm, regexp-disassemble): New.
---
lisp/emacs-lisp/regexp-disasm.el | 267 +++++++++++++++++++++++++++++++
src/search.c | 20 +++
2 files changed, 287 insertions(+)
create mode 100644 lisp/emacs-lisp/regexp-disasm.el
diff --git a/lisp/emacs-lisp/regexp-disasm.el b/lisp/emacs-lisp/regexp-disasm.el
new file mode 100644
index 0000000000..aee26db381
--- /dev/null
+++ b/lisp/emacs-lisp/regexp-disasm.el
@@ -0,0 +1,267 @@
+;;; regexp-disasm -- disassemble regexp bytecode -*- lexical-binding: t -*-
+
+;; Copyright (C) 2020 Free Software Foundation, Inc.
+
+;; This file is part of GNU Emacs.
+
+;; GNU Emacs is free software: you can redistribute it and/or modify
+;; it under the terms of the GNU General Public License as published by
+;; the Free Software Foundation, either version 3 of the License, or
+;; (at your option) any later version.
+
+;; GNU Emacs is distributed in the hope that it will be useful,
+;; but WITHOUT ANY WARRANTY; without even the implied warranty of
+;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+;; GNU General Public License for more details.
+
+;; You should have received a copy of the GNU General Public License
+;; along with GNU Emacs. If not, see <https://www.gnu.org/licenses/>.
+
+;;; Commentary:
+
+;; Decode compiled Emacs regexp bytecode and pretty-print.
+
+(defconst regexp-disasm--classes
+ [word lower punct space upper multibyte alpha alnum graph print blank]
+ "Vector of character classes, corresponding to BIT_* in regex-emacs.c.")
+
+(defconst regexp-disasm--syntax-codes
+ [whitespace punctuation word symbol
+ open-parenthesis close-parenthesis expression-prefix string-quote
+ paired-delimiter escape character-quote comment-start comment-end
+ inherit comment-delimiter string-delimiter]
+ "Vector of syntax codes, corresponding to enum syntaxcode in syntax.h
+but using names from `rx'.")
+
+;;;###autoload
+(defun regexp-disasm (regexp)
+ "Disassemble REGEXP; return list of instructions.
+Instructions are on the form (ADDRESS . INSTR) where ADDRESS is the
+byte offset and INSTR an S-expression representing the instruction."
+ (let* ((bc (regexp-bytecode regexp))
+ (read-u16 (lambda (ofs) (+ (aref bc ofs)
+ (ash (aref bc (1+ ofs)) 8))))
+ (read-u24 (lambda (ofs) (+ (aref bc ofs)
+ (ash (aref bc (+ ofs 1)) 8)
+ (ash (aref bc (+ ofs 2)) 16))))
+ (read-s16 (lambda (ofs) (let ((x (funcall read-u16 ofs)))
+ (- x (ash (logand x #x8000) 1)))))
+ (mb (multibyte-string-p regexp))
+ (len (length bc))
+ (i 0)
+ (entries nil))
+ (while (< i len)
+ (let* ((opcode (aref bc i))
+ (entry-and-size
+ (pcase opcode
+ (0 (cons 'no-op 1))
+ (1 (cons 'succeed 1))
+ (2 (let* ((nbytes (aref bc (1+ i)))
+ (raw (substring bc (+ i 2) (+ i 2 nbytes)))
+ (str (if mb
+ (decode-coding-string raw 'utf-8-emacs)
+ raw)))
+ (cons (list 'exactn str) (+ 2 nbytes))))
+ (3 (cons 'nonl 1)) ; `anychar' is a misnomer
+ ((or 4 5) ; `charset', `notcharset'
+ (let* ((negated (= opcode 5))
+ (bitmap-len-raw (aref bc (1+ i)))
+ (bitmap-len (logand bitmap-len-raw #x7f))
+ (have-range-table (/= (logand bitmap-len-raw #x80) 0))
+ (npairs (if have-range-table
+ (funcall read-u16 (+ i 2 bitmap-len 2))
+ 0))
+ (bitmap-pairs nil)
+ (classes nil)
+ (pairs nil))
+
+ ;; Convert the bitmap to ranges.
+ (let ((first nil))
+ (dotimes (j (* bitmap-len 8))
+ (if (/= (logand (aref bc (+ i 2 (ash j -3)))
+ (ash 1 (logand j 7)))
+ 0)
+ (unless first
+ (setq first j))
+ (when first
+ (push (cons first (1- j)) bitmap-pairs)
+ (setq first nil))))
+ (when first
+ (push (cons first (1- (* bitmap-len 8))) bitmap-pairs)))
+
+ (when have-range-table
+ ;; Convert class bits to list of classes.
+ (let ((class-bits (funcall read-u16 (+ i 2 bitmap-len))))
+ (dotimes (j (length regexp-disasm--classes))
+ (when (/= (logand class-bits (ash 1 j)) 0)
+ (push (aref regexp-disasm--classes j) classes))))
+
+ ;; Read range table.
+ (dotimes (j npairs)
+ (let* ((ofs (+ i 2 bitmap-len 4 (* j 6)))
+ (from (funcall read-u24 ofs))
+ (to (funcall read-u24 (+ ofs 3))))
+ (push (cons from to) pairs))))
+
+ (cons (list (if negated 'notcharset 'charset)
+ (reverse bitmap-pairs)
+ (reverse classes)
+ (reverse pairs))
+ (+ 2 bitmap-len
+ (if have-range-table 4 0) (* npairs 6)))))
+ (6 (cons (list 'start-memory (aref bc (1+ i)))
+ 2))
+ (7 (cons (list 'stop-memory (aref bc (1+ i)))
+ 2))
+ (8 (cons (list 'duplicate (aref bc (1+ i)))
+ 2))
+ (9 (cons 'begline 1))
+ (10 (cons 'endline 1))
+ (11 (cons 'begbuf 1))
+ (12 (cons 'endbuf 1))
+ (13 (cons (list 'jump
+ (+ (funcall read-s16 (1+ i)) i 3))
+ 3))
+ (14 (cons (list 'on-failure-jump
+ (+ (funcall read-s16 (1+ i)) i 3))
+ 3))
+ (15 (cons (list 'on-failure-keep-string-jump
+ (+ (funcall read-s16 (1+ i)) i 3))
+ 3))
+ (16 (cons (list 'on-failure-jump-loop
+ (+ (funcall read-s16 (1+ i)) i 3))
+ 3))
+ (17 (cons (list 'on-failure-jump-nastyloop
+ (+ (funcall read-s16 (1+ i)) i 3))
+ 3))
+ (18 (cons (list 'on-failure-jump-smart
+ (+ (funcall read-s16 (1+ i)) i 3))
+ 3))
+ (19 (cons (list 'succeed-n
+ (+ (funcall read-s16 (1+ i)) i 3)
+ (funcall read-u16 (+ i 3)))
+ 5))
+ (20 (cons (list 'jump-n
+ (+ (funcall read-s16 (1+ i)) i 3)
+ (funcall read-u16 (+ i 3)))
+ 5))
+ (21 (cons (list 'set-number-at
+ (+ (funcall read-s16 (1+ i)) i 3)
+ (funcall read-u16 (+ i 3)))
+ 5))
+ (22 (cons 'wordbeg 1))
+ (23 (cons 'wordend 1))
+ (24 (cons 'wordbound 1))
+ (25 (cons 'notwordbound 1))
+ (26 (cons 'symbeg 1))
+ (27 (cons 'symend 1))
+ (28 (cons (list 'syntaxspec
+ (aref regexp-disasm--syntax-codes
+ (aref bc (1+ i))))
+ 2))
+ (29 (cons (list 'notsyntaxspec
+ (aref regexp-disasm--syntax-codes
+ (aref bc (1+ i))))
+ 2))
+ (30 (cons 'at-dot 1))
+ (31 (cons (list 'categoryspec (aref bc (1+ i)))
+ 2))
+ (32 (cons (list 'notcategoryspec (aref bc (1+ i)))
+ 2))
+ (_ (error "bad opcode at ofs %d: 0x%02x" i opcode))))
+ (entry (car entry-and-size))
+ (size (cdr entry-and-size)))
+ (push (cons i entry) entries)
+ (setq i (+ i size))))
+ (reverse entries)))
+
+;;;###autoload
+(defun regexp-disassemble (regexp)
+ "Compile REGEXP and print the disassembled bytecode."
+ (interactive "XRegexp (evaluated): ")
+ (let* ((instructions (regexp-disasm regexp))
+ (control-chars '((?\b . ?b)
+ (?\t . ?t)
+ (?\n . ?n)
+ (?\v . ?v)
+ (?\f . ?f)
+ (?\r . ?r)
+ (?\e . ?e)))
+ (quote-byte (lambda (c)
+ (let ((esc (assq c control-chars)))
+ (cond (esc (string ?\\ (cdr esc)))
+ ((or (<= c 31) (<= #x7f c #xff))
+ (format "\\%03o" c))
+ (t (string c))))))
+ (quote-string-char (lambda (c)
+ (let ((esc (assq c control-chars)))
+ (cond (esc (string ?\\ (cdr esc)))
+ ((memq c '(?\\ ?\"))
+ (string ?\\ c))
+ ((or (<= c 31) (= c 127)
+ (>= c #x3fff80))
+ (format "\\%03o" (logand c #xff)))
+ (t (string c))))))
+ (quote-string (lambda (s)
+ (concat "\""
+ (mapconcat quote-string-char s "")
+ "\"")))
+ (quote-range (lambda (range quote-char)
+ (if (eq (car range) (cdr range))
+ (funcall quote-char (car range))
+ (format "%s-%s"
+ (funcall quote-char (car range))
+ (funcall quote-char (cdr range))))))
+ (quote-range-uni
+ (lambda (range) (funcall quote-range range quote-byte)))
+ (quote-range-multi
+ (lambda (range) (funcall quote-range range #'string))))
+ (with-output-to-temp-buffer "*Regexp-disassemble*"
+ (with-current-buffer standard-output
+ (insert (format "Disassembly of regexp %s\n\n"
+ (funcall quote-string regexp)))
+ (dolist (instr instructions)
+ (let* ((addr (car instr))
+ (op (cdr instr))
+ (line
+ (pcase op
+ ((pred symbolp) (symbol-name op))
+ (`(exactn ,s) (format "exactn %s" (funcall quote-string s)))
+ (`(,(or 'charset 'notcharset)
+ ,bitmap-pairs ,classes ,pairs)
+ ;; FIXME: Maybe use a less ambiguous charset syntax.
+ ;; Avoid ranges when endpoints are adjacent.
+ ;; What to do about metachars like `]' and `-'?
+ (concat (format "%s [%s]"
+ (car op)
+ (mapconcat quote-range-uni
+ bitmap-pairs ""))
+ (and classes
+ (concat " [:"
+ (mapconcat
+ #'symbol-name classes ",")
+ ":]"))
+ (and pairs
+ (concat " ["
+ (mapconcat quote-range-multi pairs "")
+ "]"))))
+ (`(,(or 'start-memory 'stop-memory 'duplicate) ,n)
+ (format "%s group %d" (car op) n))
+ (`(,(or 'jump 'on-failure-jump 'on-failure-keep-string-jump
+ 'on-failure-jump-loop 'on-failure-jump-nastyloop
+ 'on-failure-jump-smart)
+ ,dest)
+ (format "%s to %d" (car op) dest))
+ (`(,(or 'succeed-n 'jump-n 'set-number-at)
+ ,dest ,val)
+ (format "%s addr %d, value %d" (car op) dest val))
+ (`(,(or 'syntaxspec 'notsyntaxspec) ,syn)
+ (format "%s %s" (car op) syn))
+ (`(,(or 'categoryspec 'notcategoryspec) ,ch)
+ (format "%s '%c'" (car op) ch))
+ (_ (error "unrecognised opcode: %S" op)))))
+ (insert (format "%5d %s\n" addr line))))))))
+
+(provide 'regexp-disasm)
+
+;;; regexp-disasm.el ends here
diff --git a/src/search.c b/src/search.c
index 818bb4af24..ee5c0432f0 100644
--- a/src/search.c
+++ b/src/search.c
@@ -3106,6 +3106,25 @@ record_unwind_save_match_data (void)
Fmatch_data (Qnil, Qnil, Qnil));
}
+DEFUN ("regexp-bytecode", Fregexp_bytecode, Sregexp_bytecode, 1, 1, 0,
+ doc: /* Compile REGEXP and return the compiled bytecode.
+The compiled bytecode is returned as a string; its format is
+implementation-dependent. Cached bytecode may be returned if available. */)
+ (Lisp_Object regexp)
+{
+ CHECK_STRING (regexp);
+ struct regexp_cache *cache_entry = compile_pattern (
+ regexp,
+ NULL,
+ (!NILP (BVAR (current_buffer, case_fold_search))
+ ? BVAR (current_buffer, case_canon_table) : Qnil),
+ false,
+ true);
+ struct re_pattern_buffer *pb = &cache_entry->buf;
+ return make_unibyte_string (pb->buffer, pb->used);
+}
+
+
/* Quote a string to deactivate reg-expr chars */
DEFUN ("regexp-quote", Fregexp_quote, Sregexp_quote, 1, 1, 0,
@@ -3400,6 +3419,7 @@ syms_of_search (void)
defsubr (&Smatch_end);
defsubr (&Smatch_data);
defsubr (&Sset_match_data);
+ defsubr (&Sregexp_bytecode);
defsubr (&Sregexp_quote);
defsubr (&Snewline_cache_check);
--
2.21.1 (Apple Git-122.3)
next reply other threads:[~2020-03-20 12:27 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-03-20 12:27 Mattias Engdegård [this message]
2020-03-20 12:58 ` Regexp bytecode disassembler Andreas Schwab
2020-03-20 14:34 ` Eli Zaretskii
2020-03-21 16:52 ` Mattias Engdegård
2020-03-21 19:19 ` Eli Zaretskii
2020-03-21 20:16 ` Štěpán Němec
2020-03-21 20:30 ` Eli Zaretskii
2020-03-21 20:40 ` Mattias Engdegård
2020-03-21 20:44 ` Štěpán Němec
2020-03-22 14:12 ` Eli Zaretskii
2020-03-22 14:43 ` Štěpán Němec
2020-03-22 16:55 ` Eli Zaretskii
2020-03-22 17:16 ` Štěpán Němec
2020-03-22 17:30 ` Eli Zaretskii
2020-03-22 18:34 ` Paul Eggert
2020-03-22 18:36 ` Dmitry Gutov
2020-03-21 20:50 ` Dmitry Gutov
2020-03-21 23:58 ` Drew Adams
2020-03-22 0:02 ` Drew Adams
2020-03-21 20:37 ` Mattias Engdegård
2020-03-22 3:28 ` Eli Zaretskii
2020-03-22 9:23 ` Mattias Engdegård
2020-03-22 10:37 ` Eli Zaretskii
2020-03-22 15:24 ` Mattias Engdegård
2020-03-22 17:06 ` Eli Zaretskii
2020-03-22 19:39 ` Mattias Engdegård
2020-03-22 20:12 ` Eli Zaretskii
2020-03-22 20:22 ` Corwin Brust
2020-03-20 15:39 ` Pip Cet
2020-03-21 16:56 ` Mattias Engdegård
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4201DF24-BCC4-4C08-9857-38207B7C10B4@acm.org \
--to=mattiase@acm.org \
--cc=emacs-devel@gnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this external index
https://git.savannah.gnu.org/cgit/emacs.git
https://git.savannah.gnu.org/cgit/emacs/org-mode.git
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.