Regexp bytecode disassembler

all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed

* Regexp bytecode disassembler
@ 2020-03-20 12:27 Mattias Engdegård
  2020-03-20 12:58 ` Andreas Schwab
                   ` (2 more replies)
  0 siblings, 3 replies; 30+ messages in thread
From: Mattias Engdegård @ 2020-03-20 12:27 UTC (permalink / raw)
  To: emacs-devel

[-- Attachment #1: Type: text/plain, Size: 595 bytes --]

It is sometimes useful to inspect the generated regexp engine bytecode, but doing so currently involves recompiling with REGEX_EMACS_DEBUG configured, setting an internal variable using a debugger, and watching data scrolling past on stderr.

This patch adds a lisp-based regexp bytecode disassembler which is always available without any runtime cost to the regexp engine. It is mainly a tool for maintainers but curious users may find it useful as well. It has already revealed one bug in the regexp compiler, now fixed (f189e5dc10).

Any objections against it being added (to master)?


[-- Attachment #2: 0001-Add-regexp-bytecode-disassembler.patch --]
[-- Type: application/octet-stream, Size: 15711 bytes --]

From f7eba6c64acee0d9a445d6e139e86750e8d36bd5 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Mattias=20Engdeg=C3=A5rd?= <mattiase@acm.org>
Date: Tue, 17 Mar 2020 15:09:10 +0100
Subject: [PATCH] Add regexp bytecode disassembler

This is mainly intended for debugging and understanding the regexp
engine, and for anyone curious about how their regexps really are
interpreted and why they are slow.

* src/search.c (Fregexp_bytecode): New function.
(syms_of_search): Define the symbol.
* lisp/emacs-lisp/regexp-disasm.el (regexp-disasm--classes)
(regexp-disasm--syntax-codes, regexp-disasm, regexp-disassemble): New.
---
 lisp/emacs-lisp/regexp-disasm.el | 267 +++++++++++++++++++++++++++++++
 src/search.c                     |  20 +++
 2 files changed, 287 insertions(+)
 create mode 100644 lisp/emacs-lisp/regexp-disasm.el

diff --git a/lisp/emacs-lisp/regexp-disasm.el b/lisp/emacs-lisp/regexp-disasm.el
new file mode 100644
index 0000000000..aee26db381
--- /dev/null
+++ b/lisp/emacs-lisp/regexp-disasm.el
@@ -0,0 +1,267 @@
+;;; regexp-disasm -- disassemble regexp bytecode  -*- lexical-binding: t -*-
+
+;; Copyright (C) 2020 Free Software Foundation, Inc.
+
+;; This file is part of GNU Emacs.
+
+;; GNU Emacs is free software: you can redistribute it and/or modify
+;; it under the terms of the GNU General Public License as published by
+;; the Free Software Foundation, either version 3 of the License, or
+;; (at your option) any later version.
+
+;; GNU Emacs is distributed in the hope that it will be useful,
+;; but WITHOUT ANY WARRANTY; without even the implied warranty of
+;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+;; GNU General Public License for more details.
+
+;; You should have received a copy of the GNU General Public License
+;; along with GNU Emacs.  If not, see <https://www.gnu.org/licenses/>.
+
+;;; Commentary:
+
+;; Decode compiled Emacs regexp bytecode and pretty-print.
+
+(defconst regexp-disasm--classes
+  [word lower punct space upper multibyte alpha alnum graph print blank]
+  "Vector of character classes, corresponding to BIT_* in regex-emacs.c.")
+
+(defconst regexp-disasm--syntax-codes
+  [whitespace punctuation word symbol
+   open-parenthesis close-parenthesis expression-prefix string-quote
+   paired-delimiter escape character-quote comment-start comment-end
+   inherit comment-delimiter string-delimiter]
+  "Vector of syntax codes, corresponding to enum syntaxcode in syntax.h
+but using names from `rx'.")
+
+;;;###autoload
+(defun regexp-disasm (regexp)
+  "Disassemble REGEXP; return list of instructions.
+Instructions are on the form (ADDRESS . INSTR) where ADDRESS is the
+byte offset and INSTR an S-expression representing the instruction."
+  (let* ((bc (regexp-bytecode regexp))
+         (read-u16 (lambda (ofs) (+ (aref bc ofs)
+                                    (ash (aref bc (1+ ofs)) 8))))
+         (read-u24 (lambda (ofs) (+ (aref bc ofs)
+                                    (ash (aref bc (+ ofs 1)) 8)
+                                    (ash (aref bc (+ ofs 2)) 16))))
+         (read-s16 (lambda (ofs) (let ((x (funcall read-u16 ofs)))
+                                   (- x (ash (logand x #x8000) 1)))))
+         (mb (multibyte-string-p regexp))
+         (len (length bc))
+         (i 0)
+         (entries nil))
+    (while (< i len)
+      (let* ((opcode (aref bc i))
+             (entry-and-size
+               (pcase opcode
+                 (0 (cons 'no-op 1))
+                 (1 (cons 'succeed 1))
+                 (2 (let* ((nbytes (aref bc (1+ i)))
+                           (raw (substring bc (+ i 2) (+ i 2 nbytes)))
+                           (str (if mb
+                                    (decode-coding-string raw 'utf-8-emacs)
+                                  raw)))
+                      (cons (list 'exactn str) (+ 2 nbytes))))
+                 (3 (cons 'nonl 1))     ; `anychar' is a misnomer
+                 ((or 4 5)              ; `charset', `notcharset'
+                  (let* ((negated (= opcode 5))
+                         (bitmap-len-raw (aref bc (1+ i)))
+                         (bitmap-len (logand bitmap-len-raw #x7f))
+                         (have-range-table (/= (logand bitmap-len-raw #x80) 0))
+                         (npairs (if have-range-table
+                                     (funcall read-u16 (+ i 2 bitmap-len 2))
+                                   0))
+                         (bitmap-pairs nil)
+                         (classes nil)
+                         (pairs nil))
+
+                    ;; Convert the bitmap to ranges.
+                    (let ((first nil))
+                      (dotimes (j (* bitmap-len 8))
+                        (if (/= (logand (aref bc (+ i 2 (ash j -3)))
+                                        (ash 1 (logand j 7)))
+                                0)
+                            (unless first
+                              (setq first j))
+                          (when first
+                            (push (cons first (1- j)) bitmap-pairs)
+                            (setq first nil))))
+                      (when first
+                        (push (cons first (1- (* bitmap-len 8))) bitmap-pairs)))
+
+                    (when have-range-table
+                      ;; Convert class bits to list of classes.
+                      (let ((class-bits (funcall read-u16 (+ i 2 bitmap-len))))
+                        (dotimes (j (length regexp-disasm--classes))
+                          (when (/= (logand class-bits (ash 1 j)) 0)
+                            (push (aref regexp-disasm--classes j) classes))))
+
+                      ;; Read range table.
+                      (dotimes (j npairs)
+                        (let* ((ofs (+ i 2 bitmap-len 4 (* j 6)))
+                               (from (funcall read-u24 ofs))
+                               (to   (funcall read-u24 (+ ofs 3))))
+                          (push (cons from to) pairs))))
+
+                    (cons (list (if negated 'notcharset 'charset)
+                                (reverse bitmap-pairs)
+                                (reverse classes)
+                                (reverse pairs))
+                          (+ 2 bitmap-len
+                             (if have-range-table 4 0) (* npairs 6)))))
+                 (6 (cons (list 'start-memory (aref bc (1+ i)))
+                          2))
+                 (7 (cons (list 'stop-memory (aref bc (1+ i)))
+                          2))
+                 (8 (cons (list 'duplicate (aref bc (1+ i)))
+                          2))
+                 (9 (cons 'begline 1))
+                 (10 (cons 'endline 1))
+                 (11 (cons 'begbuf 1))
+                 (12 (cons 'endbuf 1))
+                 (13 (cons (list 'jump
+                                 (+ (funcall read-s16 (1+ i)) i 3))
+                           3))
+                 (14 (cons (list 'on-failure-jump
+                                 (+ (funcall read-s16 (1+ i)) i 3))
+                           3))
+                 (15 (cons (list 'on-failure-keep-string-jump
+                                 (+ (funcall read-s16 (1+ i)) i 3))
+                           3))
+                 (16 (cons (list 'on-failure-jump-loop
+                                 (+ (funcall read-s16 (1+ i)) i 3))
+                           3))
+                 (17 (cons (list 'on-failure-jump-nastyloop
+                                 (+ (funcall read-s16 (1+ i)) i 3))
+                           3))
+                 (18 (cons (list 'on-failure-jump-smart
+                                 (+ (funcall read-s16 (1+ i)) i 3))
+                           3))
+                 (19 (cons (list 'succeed-n
+                                 (+ (funcall read-s16 (1+ i)) i 3)
+                                 (funcall read-u16 (+ i 3)))
+                           5))
+                 (20 (cons (list 'jump-n
+                                 (+ (funcall read-s16 (1+ i)) i 3)
+                                 (funcall read-u16 (+ i 3)))
+                           5))
+                 (21 (cons (list 'set-number-at
+                                 (+ (funcall read-s16 (1+ i)) i 3)
+                                 (funcall read-u16 (+ i 3)))
+                           5))
+                 (22 (cons 'wordbeg 1))
+                 (23 (cons 'wordend 1))
+                 (24 (cons 'wordbound 1))
+                 (25 (cons 'notwordbound 1))
+                 (26 (cons 'symbeg 1))
+                 (27 (cons 'symend 1))
+                 (28 (cons (list 'syntaxspec
+                                 (aref regexp-disasm--syntax-codes
+                                       (aref bc (1+ i))))
+                           2))
+                 (29 (cons (list 'notsyntaxspec
+                                 (aref regexp-disasm--syntax-codes
+                                       (aref bc (1+ i))))
+                           2))
+                 (30 (cons 'at-dot 1))
+                 (31 (cons (list 'categoryspec (aref bc (1+ i)))
+                           2))
+                 (32 (cons (list 'notcategoryspec (aref bc (1+ i)))
+                           2))
+                 (_ (error "bad opcode at ofs %d: 0x%02x" i opcode))))
+             (entry (car entry-and-size))
+             (size (cdr entry-and-size)))
+        (push (cons i entry) entries)
+        (setq i (+ i size))))
+    (reverse entries)))
+
+;;;###autoload
+(defun regexp-disassemble (regexp)
+  "Compile REGEXP and print the disassembled bytecode."
+  (interactive "XRegexp (evaluated): ")
+  (let* ((instructions (regexp-disasm regexp))
+         (control-chars '((?\b . ?b)
+                          (?\t . ?t)
+                          (?\n . ?n)
+                          (?\v . ?v)
+                          (?\f . ?f)
+                          (?\r . ?r)
+                          (?\e . ?e)))
+         (quote-byte (lambda (c)
+                       (let ((esc (assq c control-chars)))
+                         (cond (esc (string ?\\ (cdr esc)))
+                               ((or (<= c 31) (<= #x7f c #xff))
+                                (format "\\%03o" c))
+                               (t (string c))))))
+         (quote-string-char (lambda (c)
+                              (let ((esc (assq c control-chars)))
+                                (cond (esc (string ?\\ (cdr esc)))
+                                      ((memq c '(?\\ ?\"))
+                                       (string ?\\ c))
+                                      ((or (<= c 31) (= c 127)
+                                           (>= c #x3fff80))
+                                       (format "\\%03o" (logand c #xff)))
+                                      (t (string c))))))
+         (quote-string (lambda (s)
+                         (concat "\""
+                                 (mapconcat quote-string-char s "")
+                                 "\"")))
+         (quote-range (lambda (range quote-char)
+                        (if (eq (car range) (cdr range))
+                            (funcall quote-char (car range))
+                          (format "%s-%s"
+                                  (funcall quote-char (car range))
+                                  (funcall quote-char (cdr range))))))
+         (quote-range-uni
+          (lambda (range) (funcall quote-range range quote-byte)))
+         (quote-range-multi
+          (lambda (range) (funcall quote-range range #'string))))
+    (with-output-to-temp-buffer "*Regexp-disassemble*"
+      (with-current-buffer standard-output
+        (insert (format "Disassembly of regexp %s\n\n"
+                        (funcall quote-string regexp)))
+        (dolist (instr instructions)
+          (let* ((addr (car instr))
+                 (op (cdr instr))
+                 (line
+                  (pcase op
+                    ((pred symbolp) (symbol-name op))
+                    (`(exactn ,s) (format "exactn %s" (funcall quote-string s)))
+                    (`(,(or 'charset 'notcharset)
+                       ,bitmap-pairs ,classes ,pairs)
+                     ;; FIXME: Maybe use a less ambiguous charset syntax.
+                     ;; Avoid ranges when endpoints are adjacent.
+                     ;; What to do about metachars like `]' and `-'?
+                     (concat (format "%s [%s]"
+                                     (car op)
+                                     (mapconcat quote-range-uni
+                                                bitmap-pairs ""))
+                             (and classes
+                                  (concat " [:"
+                                          (mapconcat
+                                           #'symbol-name classes ",")
+                                          ":]"))
+                             (and pairs
+                                  (concat " ["
+                                          (mapconcat quote-range-multi pairs "")
+                                          "]"))))
+                    (`(,(or 'start-memory 'stop-memory 'duplicate) ,n)
+                     (format "%s group %d" (car op) n))
+                    (`(,(or 'jump 'on-failure-jump 'on-failure-keep-string-jump
+                            'on-failure-jump-loop 'on-failure-jump-nastyloop
+                            'on-failure-jump-smart)
+                       ,dest)
+                     (format "%s to %d" (car op) dest))
+                    (`(,(or 'succeed-n 'jump-n 'set-number-at)
+                       ,dest ,val)
+                     (format "%s addr %d, value %d" (car op) dest val))
+                    (`(,(or 'syntaxspec 'notsyntaxspec) ,syn)
+                     (format "%s %s" (car op) syn))
+                    (`(,(or 'categoryspec 'notcategoryspec) ,ch)
+                     (format "%s '%c'" (car op) ch))
+                    (_ (error "unrecognised opcode: %S" op)))))
+            (insert (format "%5d  %s\n" addr line))))))))
+
+(provide 'regexp-disasm)
+
+;;; regexp-disasm.el ends here
diff --git a/src/search.c b/src/search.c
index 818bb4af24..ee5c0432f0 100644
--- a/src/search.c
+++ b/src/search.c
@@ -3106,6 +3106,25 @@ record_unwind_save_match_data (void)
 			 Fmatch_data (Qnil, Qnil, Qnil));
 }
 
+DEFUN ("regexp-bytecode", Fregexp_bytecode, Sregexp_bytecode, 1, 1, 0,
+       doc: /* Compile REGEXP and return the compiled bytecode.
+The compiled bytecode is returned as a string; its format is
+implementation-dependent.  Cached bytecode may be returned if available.  */)
+  (Lisp_Object regexp)
+{
+  CHECK_STRING (regexp);
+  struct regexp_cache *cache_entry = compile_pattern (
+    regexp,
+    NULL,
+    (!NILP (BVAR (current_buffer, case_fold_search))
+     ? BVAR (current_buffer, case_canon_table) : Qnil),
+    false,
+    true);
+  struct re_pattern_buffer *pb = &cache_entry->buf;
+  return make_unibyte_string (pb->buffer, pb->used);
+}
+
+
 /* Quote a string to deactivate reg-expr chars */
 
 DEFUN ("regexp-quote", Fregexp_quote, Sregexp_quote, 1, 1, 0,
@@ -3400,6 +3419,7 @@ syms_of_search (void)
   defsubr (&Smatch_end);
   defsubr (&Smatch_data);
   defsubr (&Sset_match_data);
+  defsubr (&Sregexp_bytecode);
   defsubr (&Sregexp_quote);
   defsubr (&Snewline_cache_check);
 
-- 
2.21.1 (Apple Git-122.3)


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: Regexp bytecode disassembler
  2020-03-20 12:27 Regexp bytecode disassembler Mattias Engdegård
@ 2020-03-20 12:58 ` Andreas Schwab
  2020-03-20 14:34 ` Eli Zaretskii
  2020-03-20 15:39 ` Pip Cet
  2 siblings, 0 replies; 30+ messages in thread
From: Andreas Schwab @ 2020-03-20 12:58 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: emacs-devel

On Mär 20 2020, Mattias Engdegård wrote:

> +    (while (< i len)
> +      (let* ((opcode (aref bc i))
> +             (entry-and-size
> +               (pcase opcode
> +                 (0 (cons 'no-op 1))
> +                 (1 (cons 'succeed 1))

This loop would create less garbage if you used '(no-op . 1) instead of
(cons 'no-op 1).

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Regexp bytecode disassembler
  2020-03-20 12:27 Regexp bytecode disassembler Mattias Engdegård
  2020-03-20 12:58 ` Andreas Schwab
@ 2020-03-20 14:34 ` Eli Zaretskii
  2020-03-21 16:52   ` Mattias Engdegård
  2020-03-20 15:39 ` Pip Cet
  2 siblings, 1 reply; 30+ messages in thread
From: Eli Zaretskii @ 2020-03-20 14:34 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: emacs-devel

> From: Mattias Engdegård <mattiase@acm.org>
> Date: Fri, 20 Mar 2020 13:27:35 +0100
> 
> This patch adds a lisp-based regexp bytecode disassembler which is always available without any runtime cost to the regexp engine. It is mainly a tool for maintainers but curious users may find it useful as well. It has already revealed one bug in the regexp compiler, now fixed (f189e5dc10).
> 
> Any objections against it being added (to master)?

Thanks.  I'm far from being an expert on the subject, so my comments
are limited to secondary aspects of this.

First, please document this in NEWS and in the ELisp manual.  IMNSHO,
this feature will be much less useful without documentation.

Second, please document the code more than you did.  I especially miss
any documentation of the regexp bytecode that is being disassembled;
it is IMO sub-optimal to ask people to read the regex C code to glean
that information while trying to understand what the disassembler does
and why.

> * src/search.c (Fregexp_bytecode): New function.
> (syms_of_search): Define the symbol.

Which symbol is that?

> * lisp/emacs-lisp/regexp-disasm.el (regexp-disasm--classes)
> (regexp-disasm--syntax-codes, regexp-disasm, regexp-disassemble): New.

This is a new file, so it is customary just to say "New file" without
listing the functions, which are all new.

> +;; Decode compiled Emacs regexp bytecode and pretty-print.

  "Decode compiled Emacs regexp bytecode, and pretty-print them."

> +(defconst regexp-disasm--classes
> +  [word lower punct space upper multibyte alpha alnum graph print blank]
> +  "Vector of character classes, corresponding to BIT_* in regex-emacs.c.")

This is one place where a more detailed description in the doc string
could be most beneficial.

> +(defconst regexp-disasm--syntax-codes
> +  [whitespace punctuation word symbol
> +   open-parenthesis close-parenthesis expression-prefix string-quote
> +   paired-delimiter escape character-quote comment-start comment-end
> +   inherit comment-delimiter string-delimiter]
> +  "Vector of syntax codes, corresponding to enum syntaxcode in syntax.h
> +but using names from `rx'.")

And this is another.  (Btw, the first line should be a complete
sentence.)

Any way to make sure these two get updated, or at least the build
displays an alert when the C sources are modified without also
updating the above defconst's?  At the very least, we should have
comments in the C sources to the effect that any change there needs an
update here.

> +;;;###autoload
> +(defun regexp-disasm (regexp)

Why do we need to auto-load this?

> +Instructions are on the form (ADDRESS . INSTR) where ADDRESS is the
                    ^^^^^^^^^^^
"of the form"

> +         (read-u16 (lambda (ofs) (+ (aref bc ofs)
> +                                    (ash (aref bc (1+ ofs)) 8))))
> +         (read-u24 (lambda (ofs) (+ (aref bc ofs)
> +                                    (ash (aref bc (+ ofs 1)) 8)
> +                                    (ash (aref bc (+ ofs 2)) 16))))
> +         (read-s16 (lambda (ofs) (let ((x (funcall read-u16 ofs)))
> +                                   (- x (ash (logand x #x8000) 1)))))

Why lambda-forms and not functions (or desfsubst)?

> +         (mb (multibyte-string-p regexp))

Please use more self-describing names of variables.  E.g., how about
multibyte-p in this case?  Likewise regarding "bc".

> +                           (str (if mb
> +                                    (decode-coding-string raw 'utf-8-emacs)
> +                                  raw)))

This call to decode-coding-string needs a comment that explains why
it's needed.

> +               (pcase opcode
> +                 (0 (cons 'no-op 1))
> +                 (1 (cons 'succeed 1))

Is pcase really needed here?  It looks like a simple cond will do.

> +                 (_ (error "bad opcode at ofs %d: 0x%02x" i opcode))))
                                             ^^^
"offset", please, not "ofs".

> +(defun regexp-disassemble (regexp)
> +  "Compile REGEXP and print the disassembled bytecode."

I think the fact that it compiles PATTERN is an implementation
detail.  The real purpose of this command is different.  Can you
propose a better description of that purpose?

> +  (interactive "XRegexp (evaluated): ")

This prompt should do a better job describing what kind of input is
expected here.

> +DEFUN ("regexp-bytecode", Fregexp_bytecode, Sregexp_bytecode, 1, 1, 0,
> +       doc: /* Compile REGEXP and return the compiled bytecode.
> +The compiled bytecode is returned as a string; its format is
> +implementation-dependent.  Cached bytecode may be returned if available.  */)

Is this function useful on its own, or is it an internal API?  If the
latter, let's use an internal notation for its name, and let's
document that it's an internal function.

> +  struct regexp_cache *cache_entry = compile_pattern (
> +    regexp,
> +    NULL,
> +    (!NILP (BVAR (current_buffer, case_fold_search))
> +     ? BVAR (current_buffer, case_canon_table) : Qnil),
> +    false,
> +    true);

That's not our style of formatting such function calls.  It is hard to
read.

And why are you using variables related to the current buffer here?
It sounds like something that will cause confusion in subtle cases
down the road.  I'm aware that we do something like that in search
APIs, but (a) that is controversial as well; and (b) this API is
different: it is not intended to search a buffer, and thus what the
current buffer-local settings are is almost completely irrelevant.  I
suggest instead to expose this argument to Lisp, so that callers could
decide what they want there.

Thanks again for working on this.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Regexp bytecode disassembler
  2020-03-20 12:27 Regexp bytecode disassembler Mattias Engdegård
  2020-03-20 12:58 ` Andreas Schwab
  2020-03-20 14:34 ` Eli Zaretskii
@ 2020-03-20 15:39 ` Pip Cet
  2020-03-21 16:56   ` Mattias Engdegård
  2 siblings, 1 reply; 30+ messages in thread
From: Pip Cet @ 2020-03-20 15:39 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: emacs-devel

On Fri, Mar 20, 2020 at 12:28 PM Mattias Engdegård <mattiase@acm.org> wrote:
> It is sometimes useful to inspect the generated regexp engine bytecode, but doing so currently involves recompiling with REGEX_EMACS_DEBUG configured, setting an internal variable using a debugger, and watching data scrolling past on stderr.
>
> This patch adds a lisp-based regexp bytecode disassembler which is always available without any runtime cost to the regexp engine. It is mainly a tool for maintainers but curious users may find it useful as well. It has already revealed one bug in the regexp compiler, now fixed (f189e5dc10).

This looks excellent!

I think we should warn more about the non-reentrancy of our regexp
code, though: the disassembled text of a regexp may change when it is
used to match a string. Alternatively, we could omit volatile state
information from the disassembled text.

I don't think
  exactn "a"
is very readable, since there's no n on the right hand side. exactn 1,
"a" would reflect the bytecode more precisely, while exact "a" would
work better as a description, IMHO.

I'd use nreverse rather than reverse, if we're worried about garbage
collecting a few cells :-)

I'd print the address of the "value" of succeed-n etc separately: that
makes it easier to find the corresponding set-number-at.  So instead
of printing

   10  succeed-n addr 23, value 0

we could print

  10  succeed-n addr 23, value 0 at addr 13

Or similar.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Regexp bytecode disassembler
  2020-03-20 14:34 ` Eli Zaretskii
@ 2020-03-21 16:52   ` Mattias Engdegård
  2020-03-21 19:19     ` Eli Zaretskii
  0 siblings, 1 reply; 30+ messages in thread
From: Mattias Engdegård @ 2020-03-21 16:52 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

[-- Attachment #1: Type: text/plain, Size: 6544 bytes --]

20 mars 2020 kl. 13.58 skrev Andreas Schwab <schwab@linux-m68k.org>:

> This loop would create less garbage if you used '(no-op . 1) instead of
> (cons 'no-op 1).

Thank you, we can't have that now can we. Fixed.

20 mars 2020 kl. 15.34 skrev Eli Zaretskii <eliz@gnu.org>:

> First, please document this in NEWS and in the ELisp manual.  IMNSHO,
> this feature will be much less useful without documentation.

Sorry, I should have been clear on the point that this is primarily a debug and maintenance aid for the regexp-engine developer and not intended as a user-facing feature. Nobody is barred from using it, but they are expected to read the circuit schematics that comes with Emacs (ie, the source code).

In particular, there is no user interface to the regexp bytecode at all; users can't write program in it and have Emacs run them. It is also not stable in the slightest. Documenting the inner workings of the regexp engine would only put a burden on its maintainers.

>> * src/search.c (Fregexp_bytecode): New function.
>> (syms_of_search): Define the symbol.
> 
> Which symbol is that?

I meant regexp-bytecode, sorry. Reworded.

>> * lisp/emacs-lisp/regexp-disasm.el (regexp-disasm--classes)
>> (regexp-disasm--syntax-codes, regexp-disasm, regexp-disassemble): New.
> 
> This is a new file, so it is customary just to say "New file" without
> listing the functions, which are all new.

Understood and corrected.

>> +;; Decode compiled Emacs regexp bytecode and pretty-print.
> 
>  "Decode compiled Emacs regexp bytecode, and pretty-print them."

Reworded.

>> +(defconst regexp-disasm--classes
>> +  [word lower punct space upper multibyte alpha alnum graph print blank]
>> +  "Vector of character classes, corresponding to BIT_* in regex-emacs.c.")
> 
> This is one place where a more detailed description in the doc string
> could be most beneficial.

Expanded and made more precise.

>> +(defconst regexp-disasm--syntax-codes
>> +  [whitespace punctuation word symbol
>> +   open-parenthesis close-parenthesis expression-prefix string-quote
>> +   paired-delimiter escape character-quote comment-start comment-end
>> +   inherit comment-delimiter string-delimiter]
>> +  "Vector of syntax codes, corresponding to enum syntaxcode in syntax.h
>> +but using names from `rx'.")
> 
> And this is another.  (Btw, the first line should be a complete
> sentence.)

Fixed.

> Any way to make sure these two get updated, or at least the build
> displays an alert when the C sources are modified without also
> updating the above defconst's?  At the very least, we should have
> comments in the C sources to the effect that any change there needs an
> update here.

I added comments to the C sources, since I was unsure about whether it was worth the trouble to add Lisp accessors to the respective constants. After all, if the definitions get out of sync, then regexp-disasm stops working which isn't a disaster.

>> +;;;###autoload
>> +(defun regexp-disasm (regexp)
> 
> Why do we need to auto-load this?

Actually, a function that returns the bytecode in symbolic form turned out to be useful in its own right, and I found it handy for some programmatic uses like comparing the bytecodes of two regexps.

>> +Instructions are on the form (ADDRESS . INSTR) where ADDRESS is the
>                    ^^^^^^^^^^^
> "of the form"

Reworded.

>> +         (read-u16 (lambda (ofs) (+ (aref bc ofs)
>> +                                    (ash (aref bc (1+ ofs)) 8))))
> 
> Why lambda-forms and not functions (or desfsubst)?

Because they need to close over variables in scope. With lexical binding, elisp almost feels like a real programming language!

>> +         (mb (multibyte-string-p regexp))
> 
> Please use more self-describing names of variables.  E.g., how about
> multibyte-p in this case?  Likewise regarding "bc".

Agreed about mb, thanks. I'm keeping 'bc' because it's used everywhere, and the shorthand definitely increases readability.

>> +                           (str (if mb
>> +                                    (decode-coding-string raw 'utf-8-emacs)
>> +                                  raw)))
> 
> This call to decode-coding-string needs a comment that explains why
> it's needed.

You are quite right, and decoding turned out to be more complicated as well. Fixed.

>> +               (pcase opcode
>> +                 (0 (cons 'no-op 1))
>> +                 (1 (cons 'succeed 1))
> 
> Is pcase really needed here?  It looks like a simple cond will do.

Well, pcase is a lot more readable here, don't you think?

>> +                 (_ (error "bad opcode at ofs %d: 0x%02x" i opcode))))
>                                             ^^^
> "offset", please, not "ofs".

Right, thanks.

>> +(defun regexp-disassemble (regexp)
>> +  "Compile REGEXP and print the disassembled bytecode."
> 
> I think the fact that it compiles PATTERN is an implementation
> detail.  The real purpose of this command is different.  Can you
> propose a better description of that purpose?

Agreed, and rewritten.

>> +  (interactive "XRegexp (evaluated): ")
> 
> This prompt should do a better job describing what kind of input is
> expected here.

I'm not sure what else to say in the prompt. I found it more useful to input the regexp as a lisp expression than a string (for cut-and-paste from source code, or for rx) but maybe that's just me.

>> +DEFUN ("regexp-bytecode", Fregexp_bytecode, Sregexp_bytecode, 1, 1, 0,
>> +       doc: /* Compile REGEXP and return the compiled bytecode.
>> +The compiled bytecode is returned as a string; its format is
>> +implementation-dependent.  Cached bytecode may be returned if available.  */)
> 
> Is this function useful on its own, or is it an internal API?  If the
> latter, let's use an internal notation for its name, and let's
> document that it's an internal function.

It's probably not useful for anything else. I made it internal.

>> +  struct regexp_cache *cache_entry = compile_pattern (
>> +    regexp,
>> +    NULL,
>> +    (!NILP (BVAR (current_buffer, case_fold_search))
>> +     ? BVAR (current_buffer, case_canon_table) : Qnil),
>> +    false,
>> +    true);
> 
> That's not our style of formatting such function calls.  It is hard to
> read.
> 
> And why are you using variables related to the current buffer here?

You are right, the case folding is better made explicit. Fixed.

Revised patch attached.


[-- Attachment #2: 0001-Add-regexp-bytecode-disassembler.patch --]
[-- Type: application/octet-stream, Size: 20181 bytes --]

From 50daa712e6740a60bf73840f43b7c4a7d782523f Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Mattias=20Engdeg=C3=A5rd?= <mattiase@acm.org>
Date: Tue, 17 Mar 2020 15:09:10 +0100
Subject: [PATCH] Add regexp bytecode disassembler

This is mainly intended as a regexp engine maintenance and debugging
aid.  See discussion at
https://lists.gnu.org/archive/html/emacs-devel/2020-03/msg00461.html .

* lisp/emacs-lisp/regexp-disasm.el: New file.
* src/search.c (Finternal__regexp_bytecode): New function.
(syms_of_search): Define it as Lisp function.
* src/regex-emacs.c:
* src/syntax.h: Add comments to keep definitions in sync with
regexp-disasm.el.
---
 lisp/emacs-lisp/regexp-disasm.el | 317 +++++++++++++++++++++++++++++++
 src/regex-emacs.c                |   7 +-
 src/search.c                     |  17 ++
 src/syntax.h                     |   2 +
 4 files changed, 341 insertions(+), 2 deletions(-)
 create mode 100644 lisp/emacs-lisp/regexp-disasm.el

diff --git a/lisp/emacs-lisp/regexp-disasm.el b/lisp/emacs-lisp/regexp-disasm.el
new file mode 100644
index 0000000000..4c467141e8
--- /dev/null
+++ b/lisp/emacs-lisp/regexp-disasm.el
@@ -0,0 +1,317 @@
+;;; regexp-disasm -- disassemble regexp bytecode  -*- lexical-binding: t -*-
+
+;; Copyright (C) 2020 Free Software Foundation, Inc.
+
+;; This file is part of GNU Emacs.
+
+;; GNU Emacs is free software: you can redistribute it and/or modify
+;; it under the terms of the GNU General Public License as published by
+;; the Free Software Foundation, either version 3 of the License, or
+;; (at your option) any later version.
+
+;; GNU Emacs is distributed in the hope that it will be useful,
+;; but WITHOUT ANY WARRANTY; without even the implied warranty of
+;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+;; GNU General Public License for more details.
+
+;; You should have received a copy of the GNU General Public License
+;; along with GNU Emacs.  If not, see <https://www.gnu.org/licenses/>.
+
+;;; Commentary:
+
+;; Decode compiled Emacs regexp bytecode, and pretty-print it.
+;; Effective use of this code requires some knowledge of the internals
+;; of the regexp engine.  See re_opcode_t in regex-emacs.c.
+
+(defconst regexp-disasm--classes
+  [word lower punct space upper multibyte alpha alnum graph print blank]
+  "Vector of character classes, corresponding to BIT_* in regex-emacs.c.
+The vector index corresponds to the bit number.")
+
+(defconst regexp-disasm--syntax-codes
+  [whitespace punctuation word symbol
+   open-parenthesis close-parenthesis expression-prefix string-quote
+   paired-delimiter escape character-quote comment-start comment-end
+   inherit comment-delimiter string-delimiter]
+  "Vector of syntax codes, corresponding to enum syntaxcode in syntax.h.
+The vector index is the enum value; the symbol names are from `rx'.")
+
+(defun regexp-disasm--decode-multibyte (bytes)
+  "Decode BYTES from the internal string format to the string it represents."
+  ;; Each raw byte in the 80..ff range is internally represented as the
+  ;; two bytes 1100000x 10xxxxxx, which are decoded as separate raw bytes
+  ;; by utf-8-emacs, so we have to post-process the results.
+  (let* ((s (decode-coding-string bytes 'utf-8-emacs))
+         (len (length s))
+         (i 0)
+         (start 0)
+         (parts nil))
+    (while (< i (1- len))
+      (if (and (<= #x3fffc0 (aref s i) #x3fffc1)
+               (<= #x3fff80 (aref s (1+ i)) #x3fffbf))
+          ;; A pair of chars representing a raw byte.
+          (progn
+            (when (> i start)
+              (push (substring s start i) parts))
+            (push (string (+ #x3fff80
+                             (ash (logand (aref s i) 1) 6)
+                             (logand (aref s (1+ i)) #x3f)))
+                  parts)
+            (setq start (+ i 2))
+            (setq i start))
+        (setq i (1+ i))))
+    (when (> len start)
+      (push (substring s start) parts))
+    (apply #'concat (nreverse parts))))
+
+;;;###autoload
+(defun regexp-disasm (regexp &optional case-table)
+  "Disassemble the bytecode for REGEXP; return list of instructions.
+CASE-TABLE, if non-nil, is a translation table for case-folding.
+Instructions take the form (ADDRESS . INSTR) where ADDRESS is the
+byte offset and INSTR an S-expression representing the instruction."
+  (let* ((bc (internal--regexp-bytecode regexp case-table))
+         (read-u16 (lambda (ofs) (+ (aref bc ofs)
+                                    (ash (aref bc (1+ ofs)) 8))))
+         (read-u24 (lambda (ofs) (+ (aref bc ofs)
+                                    (ash (aref bc (+ ofs 1)) 8)
+                                    (ash (aref bc (+ ofs 2)) 16))))
+         (read-s16 (lambda (ofs) (let ((x (funcall read-u16 ofs)))
+                                   (- x (ash (logand x #x8000) 1)))))
+         (len (length bc))
+         (i 0)
+         (entries nil))
+    (while (< i len)
+      ;; This code depends on the exact details of the regexp bytecode
+      ;; representation; see re_opcode_t in regex-emacs.c.
+      (let* ((opcode (aref bc i))
+             (entry-and-size
+               (pcase opcode
+                 (0 '(no-op . 1))
+                 (1 '(succeed . 1))
+                 (2 (let* ((nbytes (aref bc (1+ i)))
+                           (raw (substring bc (+ i 2) (+ i 2 nbytes)))
+                           ;; Exact strings are multibyte-coded iff the
+                           ;; original regexp is.
+                           (str (if (multibyte-string-p regexp)
+                                    (regexp-disasm--decode-multibyte raw)
+                                  raw)))
+                      (cons (list 'exact str) (+ 2 nbytes))))
+                 (3 '(not-newline . 1))        ; `anychar' is a misnomer
+                 ((or 4 5)              ; `charset', `charset-not'
+                  (let* ((negated (= opcode 5))
+                         (bitmap-len-raw (aref bc (1+ i)))
+                         (bitmap-len (logand bitmap-len-raw #x7f))
+                         (have-range-table (/= (logand bitmap-len-raw #x80) 0))
+                         (npairs (if have-range-table
+                                     (funcall read-u16 (+ i 2 bitmap-len 2))
+                                   0))
+                         (bitmap-pairs nil)
+                         (classes nil)
+                         (pairs nil))
+
+                    ;; Convert the bitmap to ranges.
+                    (let ((first nil))
+                      (dotimes (j (* bitmap-len 8))
+                        (if (/= (logand (aref bc (+ i 2 (ash j -3)))
+                                        (ash 1 (logand j 7)))
+                                0)
+                            (unless first
+                              (setq first j))
+                          (when first
+                            (push (cons first (1- j)) bitmap-pairs)
+                            (setq first nil))))
+                      (when first
+                        (push (cons first (1- (* bitmap-len 8))) bitmap-pairs)))
+
+                    (when have-range-table
+                      ;; Convert class bits to list of classes.
+                      (let ((class-bits (funcall read-u16 (+ i 2 bitmap-len))))
+                        (dotimes (j (length regexp-disasm--classes))
+                          (when (/= (logand class-bits (ash 1 j)) 0)
+                            (push (aref regexp-disasm--classes j) classes))))
+
+                      ;; Read range table.
+                      (dotimes (j npairs)
+                        (let* ((ofs (+ i 2 bitmap-len 4 (* j 6)))
+                               (from (funcall read-u24 ofs))
+                               (to   (funcall read-u24 (+ ofs 3))))
+                          (push (cons from to) pairs))))
+
+                    (cons (list (if negated 'charset-not 'charset)
+                                (nreverse bitmap-pairs)
+                                (nreverse classes)
+                                (nreverse pairs))
+                          (+ 2 bitmap-len
+                             (if have-range-table 4 0) (* npairs 6)))))
+                 (6 (cons (list 'start-memory (aref bc (1+ i)))
+                          2))
+                 (7 (cons (list 'stop-memory (aref bc (1+ i)))
+                          2))
+                 (8 (cons (list 'duplicate (aref bc (1+ i)))
+                          2))
+                 (9 '(begline . 1))
+                 (10 '(endline . 1))
+                 (11 '(begbuf . 1))
+                 (12 '(endbuf . 1))
+                 (13 (cons (list 'jump
+                                 (+ (funcall read-s16 (1+ i)) i 3))
+                           3))
+                 (14 (cons (list 'on-failure-jump
+                                 (+ (funcall read-s16 (1+ i)) i 3))
+                           3))
+                 (15 (cons (list 'on-failure-keep-string-jump
+                                 (+ (funcall read-s16 (1+ i)) i 3))
+                           3))
+                 (16 (cons (list 'on-failure-jump-loop
+                                 (+ (funcall read-s16 (1+ i)) i 3))
+                           3))
+                 (17 (cons (list 'on-failure-jump-nastyloop
+                                 (+ (funcall read-s16 (1+ i)) i 3))
+                           3))
+                 (18 (cons (list 'on-failure-jump-smart
+                                 (+ (funcall read-s16 (1+ i)) i 3))
+                           3))
+                 (19 (cons (list 'succeed-n
+                                 (+ (funcall read-s16 (1+ i)) i 3)
+                                 (funcall read-u16 (+ i 3)))
+                           5))
+                 (20 (cons (list 'jump-n
+                                 (+ (funcall read-s16 (1+ i)) i 3)
+                                 (funcall read-u16 (+ i 3)))
+                           5))
+                 (21 (cons (list 'set-number-at
+                                 (+ (funcall read-s16 (1+ i)) i 3)
+                                 (funcall read-u16 (+ i 3)))
+                           5))
+                 (22 '(wordbeg . 1))
+                 (23 '(wordend . 1))
+                 (24 '(wordbound . 1))
+                 (25 '(notwordbound . 1))
+                 (26 '(symbeg . 1))
+                 (27 '(symend . 1))
+                 ;; Use symbolic names for syntax classes.
+                 (28 (cons (list 'syntaxspec
+                                 (aref regexp-disasm--syntax-codes
+                                       (aref bc (1+ i))))
+                           2))
+                 (29 (cons (list 'notsyntaxspec
+                                 (aref regexp-disasm--syntax-codes
+                                       (aref bc (1+ i))))
+                           2))
+                 (30 '(at-dot . 1))
+                 ;; Use the category code char for categories.
+                 (31 (cons (list 'categoryspec (aref bc (1+ i)))
+                           2))
+                 (32 (cons (list 'notcategoryspec (aref bc (1+ i)))
+                           2))
+                 (_ (error "bad opcode at offset %d: 0x%02x" i opcode))))
+             (entry (car entry-and-size))
+             (size (cdr entry-and-size)))
+        (push (cons i entry) entries)
+        (setq i (+ i size))))
+    (nreverse entries)))
+
+;;;###autoload
+(defun regexp-disassemble (regexp &optional case-table)
+  "Print the disassembled bytecode for REGEXP.
+If CASE-TABLE is non-nil, use it as translation table for case-folding.
+
+Cached bytecode is returned if available.  Since a compiled regexp can
+be modified when it is used in matching, the exact output of this function
+may vary, but it should be operationally equivalent."
+  (interactive "XRegexp (evaluated): ")
+  (let* ((instructions (regexp-disasm regexp case-table))
+         (control-chars '((?\b . ?b)
+                          (?\t . ?t)
+                          (?\n . ?n)
+                          (?\v . ?v)
+                          (?\f . ?f)
+                          (?\r . ?r)
+                          (?\e . ?e)))
+         (quote-byte (lambda (c)
+                       (let ((esc (assq c control-chars)))
+                         (cond (esc (string ?\\ (cdr esc)))
+                               ((or (<= c 31) (<= #x7f c #xff))
+                                (format "\\%03o" c))
+                               (t (string c))))))
+         (quote-string-char (lambda (c)
+                              (let ((esc (assq c control-chars)))
+                                (cond (esc (string ?\\ (cdr esc)))
+                                      ((memq c '(?\\ ?\"))
+                                       (string ?\\ c))
+                                      ((or (<= c 31) (= c 127)
+                                           (>= c #x3fff80))
+                                       (format "\\%03o" (logand c #xff)))
+                                      (t (string c))))))
+         (quote-string (lambda (s)
+                         (concat "\""
+                                 (mapconcat quote-string-char
+                                            ;; Make multibyte, to distinguish
+                                            ;; raw chars from U+0080..00ff.
+                                            (string-to-multibyte s)
+                                            "")
+                                 "\"")))
+         (quote-range (lambda (range quote-char)
+                        (if (eq (car range) (cdr range))
+                            (funcall quote-char (car range))
+                          (format "%s-%s"
+                                  (funcall quote-char (car range))
+                                  (funcall quote-char (cdr range))))))
+         (quote-range-uni
+          (lambda (range) (funcall quote-range range quote-byte)))
+         (quote-range-multi
+          (lambda (range) (funcall quote-range range #'string))))
+    (with-output-to-temp-buffer "*Regexp-disassemble*"
+      (with-current-buffer standard-output
+        (insert (format "Disassembly of regexp %s\n\n"
+                        (funcall quote-string regexp)))
+        (dolist (instr instructions)
+          (let* ((addr (car instr))
+                 (op (cdr instr))
+                 (line
+                  (pcase op
+                    ((pred symbolp) (symbol-name op))
+                    (`(exact ,s) (format "exact %s" (funcall quote-string s)))
+                    (`(,(or 'charset 'charset-not)
+                       ,bitmap-pairs ,classes ,pairs)
+                     ;; FIXME: Maybe use a less ambiguous charset syntax.
+                     ;; Avoid ranges when endpoints are adjacent.
+                     ;; What to do about metachars like `]' and `-'?
+                     (concat (format "%s [%s]"
+                                     (car op)
+                                     (mapconcat quote-range-uni
+                                                bitmap-pairs ""))
+                             (and classes
+                                  (concat " [:"
+                                          (mapconcat
+                                           #'symbol-name classes ",")
+                                          ":]"))
+                             (and pairs
+                                  (concat " ["
+                                          (mapconcat quote-range-multi pairs "")
+                                          "]"))))
+                    (`(,(or 'start-memory 'stop-memory 'duplicate) ,n)
+                     (format "%s group %d" (car op) n))
+                    (`(,(or 'jump 'on-failure-jump 'on-failure-keep-string-jump
+                            'on-failure-jump-loop 'on-failure-jump-nastyloop
+                            'on-failure-jump-smart)
+                       ,dest)
+                     (format "%s to %d" (car op) dest))
+                    (`(,(or 'succeed-n 'jump-n) ,dest ,val)
+                     (format "%s to %d, value %d" (car op) dest val))
+                    (`(set-number-at ,dest ,val)
+                     ;; We adjust the destination address so that it
+                     ;; refers to the instruction, not to the offset
+                     ;; of the number.
+                     (format "%s instr %d to value %d" (car op) (- dest 3) val))
+                    (`(,(or 'syntaxspec 'notsyntaxspec) ,syn)
+                     (format "%s %s" (car op) syn))
+                    (`(,(or 'categoryspec 'notcategoryspec) ,ch)
+                     (format "%s '%c'" (car op) ch))
+                    (_ (error "unrecognised opcode: %S at %S" op addr)))))
+            (insert (format "%5d  %s\n" addr line))))))))
+
+(provide 'regexp-disasm)
+
+;;; regexp-disasm.el ends here
diff --git a/src/regex-emacs.c b/src/regex-emacs.c
index 0ae004e463..f12b1f9246 100644
--- a/src/regex-emacs.c
+++ b/src/regex-emacs.c
@@ -196,7 +196,9 @@ #define BYTEWIDTH 8 /* In bits.  */
 /* These are the command codes that appear in compiled regular
    expressions.  Some opcodes are followed by argument bytes.  A
    command code can specify any interpretation whatsoever for its
-   arguments.  Zero bytes may appear in the compiled regular expression.  */
+   arguments.  Zero bytes may appear in the compiled regular expression.
+
+   Any changes here should be reflected in regexp-disasm.el as well.  */
 
 typedef enum
 {
@@ -1342,7 +1344,8 @@ #define RANGE_TABLE_WORK_ELT(work_area, i) ((work_area).table[i])
 
 /* Bits used to implement the multibyte-part of the various character classes
    such as [:alnum:] in a charset's range table.  The code currently assumes
-   that only the low 16 bits are used.  */
+   that only the low 16 bits are used.
+   Keep this table in sync with regexp-disasm--classes.  */
 #define BIT_WORD	0x1
 #define BIT_LOWER	0x2
 #define BIT_PUNCT	0x4
diff --git a/src/search.c b/src/search.c
index 818bb4af24..a5804a17b6 100644
--- a/src/search.c
+++ b/src/search.c
@@ -3106,6 +3106,22 @@ record_unwind_save_match_data (void)
 			 Fmatch_data (Qnil, Qnil, Qnil));
 }
 
+DEFUN ("internal--regexp-bytecode", Finternal__regexp_bytecode,
+       Sinternal__regexp_bytecode, 2, 2, 0,
+       doc: /* Return the compiled bytecode of REGEXP.
+If CASE_TABLE is non-nil, use it as translation table for ignoring case.
+The bytecode is returned as a string; its format is implementation-dependent.
+Cached bytecode may be returned if available.  */)
+  (Lisp_Object regexp, Lisp_Object case_table)
+{
+  CHECK_STRING (regexp);
+  struct regexp_cache *cache_entry =
+    compile_pattern (regexp, NULL, case_table, false, true);
+  struct re_pattern_buffer *pb = &cache_entry->buf;
+  return make_unibyte_string (pb->buffer, pb->used);
+}
+
+
 /* Quote a string to deactivate reg-expr chars */
 
 DEFUN ("regexp-quote", Fregexp_quote, Sregexp_quote, 1, 1, 0,
@@ -3400,6 +3416,7 @@ syms_of_search (void)
   defsubr (&Smatch_end);
   defsubr (&Smatch_data);
   defsubr (&Sset_match_data);
+  defsubr (&Sinternal__regexp_bytecode);
   defsubr (&Sregexp_quote);
   defsubr (&Snewline_cache_check);
 
diff --git a/src/syntax.h b/src/syntax.h
index a2ec3301ba..4817e4c288 100644
--- a/src/syntax.h
+++ b/src/syntax.h
@@ -37,6 +37,8 @@ #define Vstandard_syntax_table BVAR (&buffer_defaults, syntax_table)
    (CODE+FLAGS . MATCHING-CHAR).  MATCHING-CHAR can be nil if the char
    is not a kind of parenthesis.
 
+   Keep this table in sync with regexp-disasm--syntax-codes.
+
    The low 8 bits of CODE+FLAGS is a code, as follows:  */
 
 enum syntaxcode
-- 
2.21.1 (Apple Git-122.3)


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: Regexp bytecode disassembler
  2020-03-20 15:39 ` Pip Cet
@ 2020-03-21 16:56   ` Mattias Engdegård
  0 siblings, 0 replies; 30+ messages in thread
From: Mattias Engdegård @ 2020-03-21 16:56 UTC (permalink / raw)
  To: Pip Cet; +Cc: emacs-devel

20 mars 2020 kl. 16.39 skrev Pip Cet <pipcet@gmail.com>:

> I think we should warn more about the non-reentrancy of our regexp
> code, though: the disassembled text of a regexp may change when it is
> used to match a string. Alternatively, we could omit volatile state
> information from the disassembled text.

Yes, I added some vague verbiage to the doc string about this.

> I don't think
>  exactn "a"
> is very readable, since there's no n on the right hand side. exactn 1,
> "a" would reflect the bytecode more precisely, while exact "a" would
> work better as a description, IMHO.

OK, 'exact' it is.

> I'd use nreverse rather than reverse, if we're worried about garbage
> collecting a few cells :-)

Correctness first, but since the code now is correct!, we can nreverse.

> I'd print the address of the "value" of succeed-n etc separately: that
> makes it easier to find the corresponding set-number-at.

This irked me as well, so I went the other way and changed the set-number-at pretty-printer to use the address of the target instruction rather than that of the value field.




^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Regexp bytecode disassembler
  2020-03-21 16:52   ` Mattias Engdegård
@ 2020-03-21 19:19     ` Eli Zaretskii
  2020-03-21 20:16       ` Štěpán Němec
  2020-03-21 20:37       ` Mattias Engdegård
  0 siblings, 2 replies; 30+ messages in thread
From: Eli Zaretskii @ 2020-03-21 19:19 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: emacs-devel

> From: Mattias Engdegård <mattiase@acm.org>
> Date: Sat, 21 Mar 2020 17:52:51 +0100
> Cc: emacs-devel@gnu.org
> 
> > First, please document this in NEWS and in the ELisp manual.  IMNSHO,
> > this feature will be much less useful without documentation.
> 
> Sorry, I should have been clear on the point that this is primarily a debug and maintenance aid for the regexp-engine developer and not intended as a user-facing feature. Nobody is barred from using it, but they are expected to read the circuit schematics that comes with Emacs (ie, the source code).
> 
> In particular, there is no user interface to the regexp bytecode at all; users can't write program in it and have Emacs run them. It is also not stable in the slightest. Documenting the inner workings of the regexp engine would only put a burden on its maintainers.

I didn't mean the user manual, I meant the ELisp manual.  I don't
agree that this command should remain undocumented, and I don't
understand your opposition to making this more visible and more easily
used.  Having users read the C code is quite an obstacle to some.

> >> +;;;###autoload
> >> +(defun regexp-disasm (regexp)
> > 
> > Why do we need to auto-load this?
> 
> Actually, a function that returns the bytecode in symbolic form turned out to be useful in its own right, and I found it handy for some programmatic uses like comparing the bytecodes of two regexps.

I don't think this answers the question.  Not every useful function is
auto-loaded, is it?  Why is it a problem to have to require this
package?

> >> +         (read-u16 (lambda (ofs) (+ (aref bc ofs)
> >> +                                    (ash (aref bc (1+ ofs)) 8))))
> > 
> > Why lambda-forms and not functions (or desfsubst)?
> 
> Because they need to close over variables in scope.

So you are "saving" one more argument?

> With lexical binding, elisp almost feels like a real programming language!

Maybe so, but this style makes the code harder to read and modify,
IMO.

> >> +               (pcase opcode
> >> +                 (0 (cons 'no-op 1))
> >> +                 (1 (cons 'succeed 1))
> > 
> > Is pcase really needed here?  It looks like a simple cond will do.
> 
> Well, pcase is a lot more readable here, don't you think?

No, I don't, not in this case.  You are just selecting from a list of
fixed values.

> >> +  (interactive "XRegexp (evaluated): ")
> > 
> > This prompt should do a better job describing what kind of input is
> > expected here.
> 
> I'm not sure what else to say in the prompt. I found it more useful to input the regexp as a lisp expression than a string (for cut-and-paste from source code, or for rx) but maybe that's just me.

I envision many people will think a string is expected, thus my
comment.

> +   Any changes here should be reflected in regexp-disasm.el as well.  */

I think the same comment should be near the definition of re_opcode_t.

Thanks.



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Regexp bytecode disassembler
  2020-03-21 19:19     ` Eli Zaretskii
@ 2020-03-21 20:16       ` Štěpán Němec
  2020-03-21 20:30         ` Eli Zaretskii
  2020-03-21 20:37       ` Mattias Engdegård
  1 sibling, 1 reply; 30+ messages in thread
From: Štěpán Němec @ 2020-03-21 20:16 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Mattias Engdegård, emacs-devel

On Sat, 21 Mar 2020 21:19:16 +0200
Eli Zaretskii wrote:

>> >> +               (pcase opcode
>> >> +                 (0 (cons 'no-op 1))
>> >> +                 (1 (cons 'succeed 1))
>> > 
>> > Is pcase really needed here?  It looks like a simple cond will do.
>> 
>> Well, pcase is a lot more readable here, don't you think?
>
> No, I don't, not in this case.  You are just selecting from a list of
> fixed values.

Please forgive my intrusion, but, while I tend to agree with most of
your other comments, I find this one very surprising.

Do you mean you would prefer to use `cond' and rewrite all those clauses
to something like the following?

(cond
 ((eql opcode 0) (cons 'no-op 1))
 ((eql opcode 1) (cons 'succeed 1))
  ...

Maybe readability is much more subjective than I thought, but I find the
latter very suboptimal, to say the least.

Also, isn't "just selecting from a list of fixed values" precisely the
reason to use some sort of case/switch instead of the general `cond'?

Certainly `pcase' can also be useful in more complicated use cases, but
it will expand to the cond form anyway, so I also don't see any
performance concerns.

-- 
Štěpán

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Regexp bytecode disassembler
  2020-03-21 20:16       ` Štěpán Němec
@ 2020-03-21 20:30         ` Eli Zaretskii
  2020-03-21 20:40           ` Mattias Engdegård
                             ` (3 more replies)
  0 siblings, 4 replies; 30+ messages in thread
From: Eli Zaretskii @ 2020-03-21 20:30 UTC (permalink / raw)
  To: Štěpán Němec; +Cc: mattiase, emacs-devel

> From: Štěpán Němec <stepnem@gmail.com>
> Cc: Mattias Engdegård <mattiase@acm.org>,
>   emacs-devel@gnu.org
> Date: Sat, 21 Mar 2020 21:16:02 +0100
> 
> Do you mean you would prefer to use `cond' and rewrite all those clauses
> to something like the following?
> 
> (cond
>  ((eql opcode 0) (cons 'no-op 1))
>  ((eql opcode 1) (cons 'succeed 1))

Yes.

> Maybe readability is much more subjective than I thought, but I find the
> latter very suboptimal, to say the least.

Why "suboptimal"?

> Also, isn't "just selecting from a list of fixed values" precisely the
> reason to use some sort of case/switch instead of the general `cond'?

'cond' _is_ a case/switch construct.

> Certainly `pcase' can also be useful in more complicated use cases, but
> it will expand to the cond form anyway, so I also don't see any
> performance concerns.

I said nothing about the performance.



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Regexp bytecode disassembler
  2020-03-21 19:19     ` Eli Zaretskii
  2020-03-21 20:16       ` Štěpán Němec
@ 2020-03-21 20:37       ` Mattias Engdegård
  2020-03-22  3:28         ` Eli Zaretskii
  1 sibling, 1 reply; 30+ messages in thread
From: Mattias Engdegård @ 2020-03-21 20:37 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

21 mars 2020 kl. 20.19 skrev Eli Zaretskii <eliz@gnu.org>:

> I didn't mean the user manual, I meant the ELisp manual.  I don't
> agree that this command should remain undocumented, and I don't
> understand your opposition to making this more visible and more easily
> used.  Having users read the C code is quite an obstacle to some.

It's not meant for Elisp users. It's meant for those who edit regex-emacs.c.
Actually it was mostly meant for me because I was curious about how the matcher worked.

I didn't expect this amount of opposition though, so I'm just going to add the internal--regexp-bytecode function (which I expect to be fully unobjectionable). Problem solved!

Thank you very much for the reviews!

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Regexp bytecode disassembler
  2020-03-21 20:30         ` Eli Zaretskii
@ 2020-03-21 20:40           ` Mattias Engdegård
  2020-03-21 20:44           ` Štěpán Němec
                             ` (2 subsequent siblings)
  3 siblings, 0 replies; 30+ messages in thread
From: Mattias Engdegård @ 2020-03-21 20:40 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Štěpán Němec, emacs-devel

21 mars 2020 kl. 21.30 skrev Eli Zaretskii <eliz@gnu.org>:

> 'cond' _is_ a case/switch construct.

Actually, it's an IF...ELSE IF... construct.




^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Regexp bytecode disassembler
  2020-03-21 20:30         ` Eli Zaretskii
  2020-03-21 20:40           ` Mattias Engdegård
@ 2020-03-21 20:44           ` Štěpán Němec
  2020-03-22 14:12             ` Eli Zaretskii
  2020-03-21 20:50           ` Dmitry Gutov
  2020-03-21 23:58           ` Drew Adams
  3 siblings, 1 reply; 30+ messages in thread
From: Štěpán Němec @ 2020-03-21 20:44 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: mattiase, emacs-devel

On Sat, 21 Mar 2020 22:30:49 +0200
Eli Zaretskii wrote:

>> Do you mean you would prefer to use `cond' and rewrite all those clauses
>> to something like the following?
>> 
>> (cond
>>  ((eql opcode 0) (cons 'no-op 1))
>>  ((eql opcode 1) (cons 'succeed 1))
>
> Yes.
>
>> Maybe readability is much more subjective than I thought, but I find the
>> latter very suboptimal, to say the least.
>
> Why "suboptimal"?

It's unnecessarily verbose and most of that is just repetition, i.e.
noise, which makes it harder to read. It _begs_ to be rewritten with
(p)case. :-)

>> Also, isn't "just selecting from a list of fixed values" precisely the
>> reason to use some sort of case/switch instead of the general `cond'?
>
> 'cond' _is_ a case/switch construct.

I have to disagree. `cond' is a general conditional construct: all the
clauses are completely independent, testing for anything, whereas in a
`case' you have one object that is tested against each clause. Which is
precisely the case here.

>> Certainly `pcase' can also be useful in more complicated use cases, but
>> it will expand to the cond form anyway, so I also don't see any
>> performance concerns.
>
> I said nothing about the performance.

OK. I was just trying to come up with some other possible factors, as I
found the readability argument so surprising.

-- 
Štěpán



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Regexp bytecode disassembler
  2020-03-21 20:30         ` Eli Zaretskii
  2020-03-21 20:40           ` Mattias Engdegård
  2020-03-21 20:44           ` Štěpán Němec
@ 2020-03-21 20:50           ` Dmitry Gutov
  2020-03-21 23:58           ` Drew Adams
  3 siblings, 0 replies; 30+ messages in thread
From: Dmitry Gutov @ 2020-03-21 20:50 UTC (permalink / raw)
  To: Eli Zaretskii, Štěpán Němec; +Cc: mattiase, emacs-devel

On 21.03.2020 22:30, Eli Zaretskii wrote:
> 'cond'_is_  a case/switch construct.

I'd say it's a switch construct, but not a case construct (unlike 
cl-case or pcase).

I think the pcase approach is shorter, just as (or more) readable, and 
thus less error-prone.



^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: Regexp bytecode disassembler
  2020-03-21 20:30         ` Eli Zaretskii
                             ` (2 preceding siblings ...)
  2020-03-21 20:50           ` Dmitry Gutov
@ 2020-03-21 23:58           ` Drew Adams
  2020-03-22  0:02             ` Drew Adams
  3 siblings, 1 reply; 30+ messages in thread
From: Drew Adams @ 2020-03-21 23:58 UTC (permalink / raw)
  To: Eli Zaretskii, Štěpán Němec; +Cc: mattiase, emacs-devel

> > Do you mean you would prefer to use `cond' and rewrite all those
> > clauses to something like the following?
> >
> > (cond
> >  ((eql opcode 0) (cons 'no-op 1))
> >  ((eql opcode 1) (cons 'succeed 1))
> 
> Yes.

Sorry for butting in, and I haven't followed the
thread.  But if it were I, I'd just use `case'
(aka `cl-case' now), assuming that all of the
clauses just test the value of `opcode' using
`eql'.

`case' automatically uses `eql', and it makes
very clear that the _only_ tests are `opcode'
values.

To me, `pcase' would be overkill here and `cond'
would be unnecessarily verbose and wouldn't
emphasize what I said in the previous paragraph.

(case opcode
  (0  (cons 'no-op   1))
  (1  (cons 'succeed 1))
  ...)

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: Regexp bytecode disassembler
  2020-03-21 23:58           ` Drew Adams
@ 2020-03-22  0:02             ` Drew Adams
  0 siblings, 0 replies; 30+ messages in thread
From: Drew Adams @ 2020-03-22  0:02 UTC (permalink / raw)
  To: Eli Zaretskii, Štěpán Němec; +Cc: mattiase, emacs-devel

> To me, `pcase' would be overkill here 

... because it is more general.  Like `cond', it
doesn't make clear that this is a simple case that
is exactly the case that `cl-case' handles.

Yes, the code in this case looks the same, whether
it is `pcase' or `cl-case', but the latter does
only this, so it's clear that only this is involved.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Regexp bytecode disassembler
  2020-03-21 20:37       ` Mattias Engdegård
@ 2020-03-22  3:28         ` Eli Zaretskii
  2020-03-22  9:23           ` Mattias Engdegård
  0 siblings, 1 reply; 30+ messages in thread
From: Eli Zaretskii @ 2020-03-22  3:28 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: emacs-devel

> From: Mattias Engdegård <mattiase@acm.org>
> Date: Sat, 21 Mar 2020 21:37:32 +0100
> Cc: emacs-devel@gnu.org
> 
> 21 mars 2020 kl. 20.19 skrev Eli Zaretskii <eliz@gnu.org>:
> 
> > I didn't mean the user manual, I meant the ELisp manual.  I don't
> > agree that this command should remain undocumented, and I don't
> > understand your opposition to making this more visible and more easily
> > used.  Having users read the C code is quite an obstacle to some.
> 
> It's not meant for Elisp users. It's meant for those who edit regex-emacs.c.

How is this different from the "Internals" chapter in the ELisp
manual?

> I didn't expect this amount of opposition though, so I'm just going to add the internal--regexp-bytecode function (which I expect to be fully unobjectionable). Problem solved!

I meant to document the command.  Please do document it.  Making
important functions internal solves no problem.



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Regexp bytecode disassembler
  2020-03-22  3:28         ` Eli Zaretskii
@ 2020-03-22  9:23           ` Mattias Engdegård
  2020-03-22 10:37             ` Eli Zaretskii
  0 siblings, 1 reply; 30+ messages in thread
From: Mattias Engdegård @ 2020-03-22  9:23 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

22 mars 2020 kl. 04.28 skrev Eli Zaretskii <eliz@gnu.org>:

> I meant to document the command.  Please do document it.  Making
> important functions internal solves no problem.

The rest of the patch has been retracted. There is nothing to document.




^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Regexp bytecode disassembler
  2020-03-22  9:23           ` Mattias Engdegård
@ 2020-03-22 10:37             ` Eli Zaretskii
  2020-03-22 15:24               ` Mattias Engdegård
  0 siblings, 1 reply; 30+ messages in thread
From: Eli Zaretskii @ 2020-03-22 10:37 UTC (permalink / raw)
  To: emacs-devel, Mattias Engdegård

On March 22, 2020 11:23:23 AM GMT+02:00, "Mattias Engdegård" <mattiase@acm.org> wrote:
> 22 mars 2020 kl. 04.28 skrev Eli Zaretskii <eliz@gnu.org>:
> 
> > I meant to document the command.  Please do document it.  Making
> > important functions internal solves no problem.
> 
> The rest of the patch has been retracted. There is nothing to
> document.

That's too bad, but if this is the case, I see no reason to have a primitive that is not used by any command or public API.  We don't have any other primitives like that, AFAIR, we only add primitives that have some uses.



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Regexp bytecode disassembler
  2020-03-21 20:44           ` Štěpán Němec
@ 2020-03-22 14:12             ` Eli Zaretskii
  2020-03-22 14:43               ` Štěpán Němec
  0 siblings, 1 reply; 30+ messages in thread
From: Eli Zaretskii @ 2020-03-22 14:12 UTC (permalink / raw)
  To: Štěpán Němec; +Cc: mattiase, emacs-devel

> From: Štěpán Němec <stepnem@gmail.com>
> Cc: mattiase@acm.org, emacs-devel@gnu.org
> Date: Sat, 21 Mar 2020 21:44:49 +0100
> 
> On Sat, 21 Mar 2020 22:30:49 +0200
> Eli Zaretskii wrote:
> 
> >> Do you mean you would prefer to use `cond' and rewrite all those clauses
> >> to something like the following?
> >> 
> >> (cond
> >>  ((eql opcode 0) (cons 'no-op 1))
> >>  ((eql opcode 1) (cons 'succeed 1))
> >
> > Yes.

(Btw, I'd use '=' here instead of 'eql', after verifying the value is
a number and rejecting non-numbers.)

> >> Maybe readability is much more subjective than I thought, but I find the
> >> latter very suboptimal, to say the least.
> >
> > Why "suboptimal"?
> 
> It's unnecessarily verbose and most of that is just repetition, i.e.
> noise, which makes it harder to read. It _begs_ to be rewritten with
> (p)case. :-)

In that case, it's the other way around IMO: using pcase here is like
firing cannons at birds -- it's un-economical (a.k.a. "suboptimal").

Please think about people who are not 100% at home with advanced
macros such as pcase: they will have a harder time understanding what
the code does.  This is especially relevant to this feature, since it
is quite probable that it would have been used by someone working on
the Emacs regex code, so Lisp proficiency may not be guaranteed.  That
is one reason why I asked to have the code prominently commented and
documented.

> >> Also, isn't "just selecting from a list of fixed values" precisely the
> >> reason to use some sort of case/switch instead of the general `cond'?
> >
> > 'cond' _is_ a case/switch construct.
> 
> I have to disagree. `cond' is a general conditional construct: all the
> clauses are completely independent, testing for anything, whereas in a
> `case' you have one object that is tested against each clause. Which is
> precisely the case here.

Then the same applies to pcase.

(Sadly, all this is now a moot point.)



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Regexp bytecode disassembler
  2020-03-22 14:12             ` Eli Zaretskii
@ 2020-03-22 14:43               ` Štěpán Němec
  2020-03-22 16:55                 ` Eli Zaretskii
  0 siblings, 1 reply; 30+ messages in thread
From: Štěpán Němec @ 2020-03-22 14:43 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: mattiase, emacs-devel

On Sun, 22 Mar 2020 16:12:47 +0200
Eli Zaretskii wrote:

>> It's unnecessarily verbose and most of that is just repetition, i.e.
>> noise, which makes it harder to read. It _begs_ to be rewritten with
>> (p)case. :-)
>
> In that case, it's the other way around IMO: using pcase here is like
> firing cannons at birds -- it's un-economical (a.k.a. "suboptimal").
>
> Please think about people who are not 100% at home with advanced
> macros such as pcase: they will have a harder time understanding what
> the code does.  This is especially relevant to this feature, since it
> is quite probable that it would have been used by someone working on
> the Emacs regex code, so Lisp proficiency may not be guaranteed.  That
> is one reason why I asked to have the code prominently commented and
> documented.

I am aware of that; I intentionally put the (p) in parentheses, and I
did say _some_ kind of case/switch instead of general conditional.

IMHO the complexity of `pcase' is completely transparent in this case:
it does what I believe anyone without any knowledge about the macro
would expect it to do just looking at it (and you can easily check the
expansion e.g. with M-x pp-macroexpand-last-sexp).

That said, there is always `cl-case', of course, as Drew suggested.

-- 
Štěpán



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Regexp bytecode disassembler
  2020-03-22 10:37             ` Eli Zaretskii
@ 2020-03-22 15:24               ` Mattias Engdegård
  2020-03-22 17:06                 ` Eli Zaretskii
  0 siblings, 1 reply; 30+ messages in thread
From: Mattias Engdegård @ 2020-03-22 15:24 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

22 mars 2020 kl. 11.37 skrev Eli Zaretskii <eliz@gnu.org>:

> That's too bad, but if this is the case, I see no reason to have a primitive that is not used by any command or public API.  We don't have any other primitives like that, AFAIR, we only add primitives that have some uses.

Surely adding it won't hurt Emacs in the slightest or bother anyone; it's just a few lines. Not having it is a bit of an inconvenience to a few people (requires local patching).




^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Regexp bytecode disassembler
  2020-03-22 14:43               ` Štěpán Němec
@ 2020-03-22 16:55                 ` Eli Zaretskii
  2020-03-22 17:16                   ` Štěpán Němec
  0 siblings, 1 reply; 30+ messages in thread
From: Eli Zaretskii @ 2020-03-22 16:55 UTC (permalink / raw)
  To: Štěpán Němec; +Cc: mattiase, emacs-devel

> From: Štěpán Němec <stepnem@gmail.com>
> Cc: mattiase@acm.org,  emacs-devel@gnu.org
> Date: Sun, 22 Mar 2020 15:43:03 +0100
> 
> IMHO the complexity of `pcase' is completely transparent in this case:

I disagree.  And please also consider the easiness (or lack thereof)
of changing the code, not just reading it.  pcase is far from being
simple, its syntax is tricky to understand fully.  it's no accident
that it took us 2-3 iterations and many discussions to converge on
what is now its documentation.  Just the volume of that documentation
should tell you how NOT transparent is its complexity.

> you can easily check the expansion e.g. with M-x
> pp-macroexpand-last-sexp

So you are saying that reading code should be accompanied by expanding
the macros it uses?  (Assuming the reader even knows about
pp-macroexpand-last-sexp.)  Seriously, I don't think this is something
I'd call "code that explains itself".  In any language.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Regexp bytecode disassembler
  2020-03-22 15:24               ` Mattias Engdegård
@ 2020-03-22 17:06                 ` Eli Zaretskii
  2020-03-22 19:39                   ` Mattias Engdegård
  0 siblings, 1 reply; 30+ messages in thread
From: Eli Zaretskii @ 2020-03-22 17:06 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: emacs-devel

> From: Mattias Engdegård <mattiase@acm.org>
> Date: Sun, 22 Mar 2020 16:24:38 +0100
> Cc: emacs-devel@gnu.org
> 
> 22 mars 2020 kl. 11.37 skrev Eli Zaretskii <eliz@gnu.org>:
> 
> > That's too bad, but if this is the case, I see no reason to have a primitive that is not used by any command or public API.  We don't have any other primitives like that, AFAIR, we only add primitives that have some uses.
> 
> Surely adding it won't hurt Emacs in the slightest or bother anyone; it's just a few lines. Not having it is a bit of an inconvenience to a few people (requires local patching).

Having code we don't use is a maintenance burden without benefits.  If
local patching is an issue, perhaps you could rewrite the code as a
module?

But I actually still don't understand why you took such a radical step
and removed everything except the primitive.  I think this is a
valuable debugging feature, of which we don't have enough in Emacs.
Would it help if I volunteered to produce documentation from some
explanatory unformatted text that you could provide?  Or are you
refusing to document this out of some principle?  Or is something else
a matter here?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Regexp bytecode disassembler
  2020-03-22 16:55                 ` Eli Zaretskii
@ 2020-03-22 17:16                   ` Štěpán Němec
  2020-03-22 17:30                     ` Eli Zaretskii
  0 siblings, 1 reply; 30+ messages in thread
From: Štěpán Němec @ 2020-03-22 17:16 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: mattiase, emacs-devel

On Sun, 22 Mar 2020 18:55:53 +0200
Eli Zaretskii wrote:

>> From: Štěpán Němec <stepnem@gmail.com>
>> Cc: mattiase@acm.org,  emacs-devel@gnu.org
>> Date: Sun, 22 Mar 2020 15:43:03 +0100
>> 
>> IMHO the complexity of `pcase' is completely transparent in this case:
>
> I disagree.  And please also consider the easiness (or lack thereof)
> of changing the code, not just reading it.  pcase is far from being
> simple, its syntax is tricky to understand fully.  it's no accident
> that it took us 2-3 iterations and many discussions to converge on
> what is now its documentation.  Just the volume of that documentation
> should tell you how NOT transparent is its complexity.

Changing code that _uses_ pcase doesn't involve changing code that
implements pcase, unless I'm missing something.

Given that using pcase here makes the code in question shorter (and for
most of those who commented apparently also more readable) compared to
the cond version, doesn't that _add_ to the easiness of changing it?

>> you can easily check the expansion e.g. with M-x
>> pp-macroexpand-last-sexp
>
> So you are saying that reading code should be accompanied by expanding
> the macros it uses?

Certainly not. I already said that I believe the code was perfectly
understandable even for people knowing nothing about pcase (at least in
this case).

The macroexpand check suggestion was addressing the "economy" or avian
warfare concerns (it expands to the simple cond form), or those who want
to check if the intuition matches reality.

In any case, you seem to still focus on the "why _pcase_" issue, which I
thought was really a minor point here, and was never what I argued for
specifically, instead of the "why _some_ case/switch rather than cond".

-- 
Štěpán

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Regexp bytecode disassembler
  2020-03-22 17:16                   ` Štěpán Němec
@ 2020-03-22 17:30                     ` Eli Zaretskii
  2020-03-22 18:34                       ` Paul Eggert
  2020-03-22 18:36                       ` Dmitry Gutov
  0 siblings, 2 replies; 30+ messages in thread
From: Eli Zaretskii @ 2020-03-22 17:30 UTC (permalink / raw)
  To: Štěpán Němec; +Cc: mattiase, emacs-devel

> From: Štěpán Němec <stepnem@gmail.com>
> Cc: mattiase@acm.org, emacs-devel@gnu.org
> Date: Sun, 22 Mar 2020 18:16:58 +0100
> 
> > I disagree.  And please also consider the easiness (or lack thereof)
> > of changing the code, not just reading it.  pcase is far from being
> > simple, its syntax is tricky to understand fully.  it's no accident
> > that it took us 2-3 iterations and many discussions to converge on
> > what is now its documentation.  Just the volume of that documentation
> > should tell you how NOT transparent is its complexity.
> 
> Changing code that _uses_ pcase doesn't involve changing code that
> implements pcase, unless I'm missing something.

It depends on the change.  And in any case, how can you reliably
change code whose syntax you don't sufficiently understand?

> Given that using pcase here makes the code in question shorter (and for
> most of those who commented apparently also more readable) compared to
> the cond version, doesn't that _add_ to the easiness of changing it?

For you and me, maybe.  But we must think of others.

I'm not saying we shouldn't use pcase -- we use it quite a lot in pur
sources.  I'm saying that using it where simpler, more basic
constructs will do reasonably well is more economical.

> > So you are saying that reading code should be accompanied by expanding
> > the macros it uses?
> 
> Certainly not. I already said that I believe the code was perfectly
> understandable even for people knowing nothing about pcase (at least in
> this case).

How can we, who master this already, know what is or isn't
understandable for people who don't?  I don't think we can.  You
cannot "unlearn" something, at least not easily.  The only practical
strategy is not to use devices that are overkill.



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Regexp bytecode disassembler
  2020-03-22 17:30                     ` Eli Zaretskii
@ 2020-03-22 18:34                       ` Paul Eggert
  2020-03-22 18:36                       ` Dmitry Gutov
  1 sibling, 0 replies; 30+ messages in thread
From: Paul Eggert @ 2020-03-22 18:34 UTC (permalink / raw)
  To: Eli Zaretskii, Štěpán Němec; +Cc: mattiase, emacs-devel

On 3/22/20 10:30 AM, Eli Zaretskii wrote:
> How can we, who master this already, know what is or isn't
> understandable for people who don't?

I don't use pcase and haven't read pcase's documentation, and I found the pcase 
version to be clear, and to be easier to read than the corresponding cond.

I'm just one data point of course. That being said, let's not go overboard with 
skepticism about pcase's utility.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Regexp bytecode disassembler
  2020-03-22 17:30                     ` Eli Zaretskii
  2020-03-22 18:34                       ` Paul Eggert
@ 2020-03-22 18:36                       ` Dmitry Gutov
  1 sibling, 0 replies; 30+ messages in thread
From: Dmitry Gutov @ 2020-03-22 18:36 UTC (permalink / raw)
  To: Eli Zaretskii, Štěpán Němec; +Cc: mattiase, emacs-devel

On 22.03.2020 19:30, Eli Zaretskii wrote:
> How can we, who master this already, know what is or isn't
> understandable for people who don't?  I don't think we can.  You
> cannot "unlearn" something, at least not easily.  The only practical
> strategy is not to use devices that are overkill.

Most high-level languages have a "case" construct which compares a 
single value with a list of options, even C does (switch ... case). So a 
lot of developers are used to this idea. And even expect it.

And in Emacs Lisp, the closest built-in construct we have is pcase. 
Although cl-case is also an option.



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Regexp bytecode disassembler
  2020-03-22 17:06                 ` Eli Zaretskii
@ 2020-03-22 19:39                   ` Mattias Engdegård
  2020-03-22 20:12                     ` Eli Zaretskii
  2020-03-22 20:22                     ` Corwin Brust
  0 siblings, 2 replies; 30+ messages in thread
From: Mattias Engdegård @ 2020-03-22 19:39 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

22 mars 2020 kl. 18.06 skrev Eli Zaretskii <eliz@gnu.org>:

> If local patching is an issue, perhaps you could rewrite the code as a
> module?

No, Fregexp_bytecode cannot be written as a module.

You have made your point abundantly clear, and the code is not important. I'm not going to fight.




^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Regexp bytecode disassembler
  2020-03-22 19:39                   ` Mattias Engdegård
@ 2020-03-22 20:12                     ` Eli Zaretskii
  2020-03-22 20:22                     ` Corwin Brust
  1 sibling, 0 replies; 30+ messages in thread
From: Eli Zaretskii @ 2020-03-22 20:12 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: emacs-devel

> From: Mattias Engdegård <mattiase@acm.org>
> Date: Sun, 22 Mar 2020 20:39:27 +0100
> Cc: emacs-devel@gnu.org
> 
> You have made your point abundantly clear, and the code is not important. I'm not going to fight.

FWIW, I had no intention to fight, I hoped we could add a useful
feature to Emacs.



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Regexp bytecode disassembler
  2020-03-22 19:39                   ` Mattias Engdegård
  2020-03-22 20:12                     ` Eli Zaretskii
@ 2020-03-22 20:22                     ` Corwin Brust
  1 sibling, 0 replies; 30+ messages in thread
From: Corwin Brust @ 2020-03-22 20:22 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: Eli Zaretskii, Emacs developers

Forgive my sticking my nose in.

On Sun, Mar 22, 2020 at 2:39 PM Mattias Engdegård <mattiase@acm.org> wrote:
>
> 22 mars 2020 kl. 18.06 skrev Eli Zaretskii <eliz@gnu.org>:
>
> > If local patching is an issue, perhaps you could rewrite the code as a
> > module?
>
> No, Fregexp_bytecode cannot be written as a module.
>
> You have made your point abundantly clear, and the code is not important. I'm not going to fight.

I've been holding my breathing reading this thread in order and am I'm
devastated to see this at the moment at the conclusion.

I think this feature is really cool and very close to being aligned in
each detail.

Would it help to have someone else work at adding some information to
the internals section?

FWIW, I think this type of feature could really help people who
understand some c and some elisp to better understand how their regex
code is treated by Emacs internally.  More importantly, as people do
this and share their experiences we'll gain more understanding of the
Emacs C and elisp sources and become more able to participate in
conversations and development on this list and elsewhere.

While there are still a few points of contention that really do need
alignment, I think most of these decisions are simple to understand,
if not to make.   For me, the answers feel obvious too but I'm very,
very green.

* What should be autoloaded, if anything?

For me its nbd for to (require ) it so to be able to bind the interactive.

* Should there be an interactive command?  What should be it's inputs?

Yes!  It would be most cool if could handle either string or elisp or
maybe have a prefix arg if magic there is a bad idea?  In any case,
yes.  If that's a poor idea for whatever reason that I guess I'd side
with with Eli's view in that string is what most will expect (thus
easier to prompt for) even though I also would want to use `rx' to
make such strings.\

* What should be the documentation?

Following my own views, this is easy.  Document the interactive
command with a trail of giger-snaps inviting the reader into the elisp
and c sources, where the extra commentary already developed may indeed
be instructive.

I really appreciate the work on this feature and on this type of
feature, which helps introspect the system and leads one to better
understanding the whole of it.

Could we have another go if it, please?

--
Corwin
612-217-1742
612-298-0615 (fax)
612-695-4276 (mobile)
corwin.brust (skype)
corwin@bru.st

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2020-03-22 20:22 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-03-20 12:27 Regexp bytecode disassembler Mattias Engdegård
2020-03-20 12:58 ` Andreas Schwab
2020-03-20 14:34 ` Eli Zaretskii
2020-03-21 16:52   ` Mattias Engdegård
2020-03-21 19:19     ` Eli Zaretskii
2020-03-21 20:16       ` Štěpán Němec
2020-03-21 20:30         ` Eli Zaretskii
2020-03-21 20:40           ` Mattias Engdegård
2020-03-21 20:44           ` Štěpán Němec
2020-03-22 14:12             ` Eli Zaretskii
2020-03-22 14:43               ` Štěpán Němec
2020-03-22 16:55                 ` Eli Zaretskii
2020-03-22 17:16                   ` Štěpán Němec
2020-03-22 17:30                     ` Eli Zaretskii
2020-03-22 18:34                       ` Paul Eggert
2020-03-22 18:36                       ` Dmitry Gutov
2020-03-21 20:50           ` Dmitry Gutov
2020-03-21 23:58           ` Drew Adams
2020-03-22  0:02             ` Drew Adams
2020-03-21 20:37       ` Mattias Engdegård
2020-03-22  3:28         ` Eli Zaretskii
2020-03-22  9:23           ` Mattias Engdegård
2020-03-22 10:37             ` Eli Zaretskii
2020-03-22 15:24               ` Mattias Engdegård
2020-03-22 17:06                 ` Eli Zaretskii
2020-03-22 19:39                   ` Mattias Engdegård
2020-03-22 20:12                     ` Eli Zaretskii
2020-03-22 20:22                     ` Corwin Brust
2020-03-20 15:39 ` Pip Cet
2020-03-21 16:56   ` Mattias Engdegård

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.