all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
From: Eli Zaretskii <eliz@gnu.org>
To: "Mattias Engdegård" <mattiase@acm.org>
Cc: emacs-devel@gnu.org
Subject: Re: Regexp bytecode disassembler
Date: Fri, 20 Mar 2020 16:34:16 +0200	[thread overview]
Message-ID: <83mu8bdriv.fsf@gnu.org> (raw)
In-Reply-To: <4201DF24-BCC4-4C08-9857-38207B7C10B4@acm.org> (message from Mattias Engdegård on Fri, 20 Mar 2020 13:27:35 +0100)

> From: Mattias Engdegård <mattiase@acm.org>
> Date: Fri, 20 Mar 2020 13:27:35 +0100
> 
> This patch adds a lisp-based regexp bytecode disassembler which is always available without any runtime cost to the regexp engine. It is mainly a tool for maintainers but curious users may find it useful as well. It has already revealed one bug in the regexp compiler, now fixed (f189e5dc10).
> 
> Any objections against it being added (to master)?

Thanks.  I'm far from being an expert on the subject, so my comments
are limited to secondary aspects of this.

First, please document this in NEWS and in the ELisp manual.  IMNSHO,
this feature will be much less useful without documentation.

Second, please document the code more than you did.  I especially miss
any documentation of the regexp bytecode that is being disassembled;
it is IMO sub-optimal to ask people to read the regex C code to glean
that information while trying to understand what the disassembler does
and why.

> * src/search.c (Fregexp_bytecode): New function.
> (syms_of_search): Define the symbol.

Which symbol is that?

> * lisp/emacs-lisp/regexp-disasm.el (regexp-disasm--classes)
> (regexp-disasm--syntax-codes, regexp-disasm, regexp-disassemble): New.

This is a new file, so it is customary just to say "New file" without
listing the functions, which are all new.

> +;; Decode compiled Emacs regexp bytecode and pretty-print.

  "Decode compiled Emacs regexp bytecode, and pretty-print them."

> +(defconst regexp-disasm--classes
> +  [word lower punct space upper multibyte alpha alnum graph print blank]
> +  "Vector of character classes, corresponding to BIT_* in regex-emacs.c.")

This is one place where a more detailed description in the doc string
could be most beneficial.

> +(defconst regexp-disasm--syntax-codes
> +  [whitespace punctuation word symbol
> +   open-parenthesis close-parenthesis expression-prefix string-quote
> +   paired-delimiter escape character-quote comment-start comment-end
> +   inherit comment-delimiter string-delimiter]
> +  "Vector of syntax codes, corresponding to enum syntaxcode in syntax.h
> +but using names from `rx'.")

And this is another.  (Btw, the first line should be a complete
sentence.)

Any way to make sure these two get updated, or at least the build
displays an alert when the C sources are modified without also
updating the above defconst's?  At the very least, we should have
comments in the C sources to the effect that any change there needs an
update here.

> +;;;###autoload
> +(defun regexp-disasm (regexp)

Why do we need to auto-load this?

> +Instructions are on the form (ADDRESS . INSTR) where ADDRESS is the
                    ^^^^^^^^^^^
"of the form"

> +         (read-u16 (lambda (ofs) (+ (aref bc ofs)
> +                                    (ash (aref bc (1+ ofs)) 8))))
> +         (read-u24 (lambda (ofs) (+ (aref bc ofs)
> +                                    (ash (aref bc (+ ofs 1)) 8)
> +                                    (ash (aref bc (+ ofs 2)) 16))))
> +         (read-s16 (lambda (ofs) (let ((x (funcall read-u16 ofs)))
> +                                   (- x (ash (logand x #x8000) 1)))))

Why lambda-forms and not functions (or desfsubst)?

> +         (mb (multibyte-string-p regexp))

Please use more self-describing names of variables.  E.g., how about
multibyte-p in this case?  Likewise regarding "bc".

> +                           (str (if mb
> +                                    (decode-coding-string raw 'utf-8-emacs)
> +                                  raw)))

This call to decode-coding-string needs a comment that explains why
it's needed.

> +               (pcase opcode
> +                 (0 (cons 'no-op 1))
> +                 (1 (cons 'succeed 1))

Is pcase really needed here?  It looks like a simple cond will do.

> +                 (_ (error "bad opcode at ofs %d: 0x%02x" i opcode))))
                                             ^^^
"offset", please, not "ofs".

> +(defun regexp-disassemble (regexp)
> +  "Compile REGEXP and print the disassembled bytecode."

I think the fact that it compiles PATTERN is an implementation
detail.  The real purpose of this command is different.  Can you
propose a better description of that purpose?

> +  (interactive "XRegexp (evaluated): ")

This prompt should do a better job describing what kind of input is
expected here.

> +DEFUN ("regexp-bytecode", Fregexp_bytecode, Sregexp_bytecode, 1, 1, 0,
> +       doc: /* Compile REGEXP and return the compiled bytecode.
> +The compiled bytecode is returned as a string; its format is
> +implementation-dependent.  Cached bytecode may be returned if available.  */)

Is this function useful on its own, or is it an internal API?  If the
latter, let's use an internal notation for its name, and let's
document that it's an internal function.

> +  struct regexp_cache *cache_entry = compile_pattern (
> +    regexp,
> +    NULL,
> +    (!NILP (BVAR (current_buffer, case_fold_search))
> +     ? BVAR (current_buffer, case_canon_table) : Qnil),
> +    false,
> +    true);

That's not our style of formatting such function calls.  It is hard to
read.

And why are you using variables related to the current buffer here?
It sounds like something that will cause confusion in subtle cases
down the road.  I'm aware that we do something like that in search
APIs, but (a) that is controversial as well; and (b) this API is
different: it is not intended to search a buffer, and thus what the
current buffer-local settings are is almost completely irrelevant.  I
suggest instead to expose this argument to Lisp, so that callers could
decide what they want there.

Thanks again for working on this.



  parent reply	other threads:[~2020-03-20 14:34 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-03-20 12:27 Regexp bytecode disassembler Mattias Engdegård
2020-03-20 12:58 ` Andreas Schwab
2020-03-20 14:34 ` Eli Zaretskii [this message]
2020-03-21 16:52   ` Mattias Engdegård
2020-03-21 19:19     ` Eli Zaretskii
2020-03-21 20:16       ` Štěpán Němec
2020-03-21 20:30         ` Eli Zaretskii
2020-03-21 20:40           ` Mattias Engdegård
2020-03-21 20:44           ` Štěpán Němec
2020-03-22 14:12             ` Eli Zaretskii
2020-03-22 14:43               ` Štěpán Němec
2020-03-22 16:55                 ` Eli Zaretskii
2020-03-22 17:16                   ` Štěpán Němec
2020-03-22 17:30                     ` Eli Zaretskii
2020-03-22 18:34                       ` Paul Eggert
2020-03-22 18:36                       ` Dmitry Gutov
2020-03-21 20:50           ` Dmitry Gutov
2020-03-21 23:58           ` Drew Adams
2020-03-22  0:02             ` Drew Adams
2020-03-21 20:37       ` Mattias Engdegård
2020-03-22  3:28         ` Eli Zaretskii
2020-03-22  9:23           ` Mattias Engdegård
2020-03-22 10:37             ` Eli Zaretskii
2020-03-22 15:24               ` Mattias Engdegård
2020-03-22 17:06                 ` Eli Zaretskii
2020-03-22 19:39                   ` Mattias Engdegård
2020-03-22 20:12                     ` Eli Zaretskii
2020-03-22 20:22                     ` Corwin Brust
2020-03-20 15:39 ` Pip Cet
2020-03-21 16:56   ` Mattias Engdegård

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=83mu8bdriv.fsf@gnu.org \
    --to=eliz@gnu.org \
    --cc=emacs-devel@gnu.org \
    --cc=mattiase@acm.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.