From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.io!.POSTED.ciao.gmane.io!not-for-mail
From: Eli Zaretskii <eliz@gnu.org>
Newsgroups: gmane.emacs.devel
Subject: Re: Regexp bytecode disassembler
Date: Fri, 20 Mar 2020 16:34:16 +0200
Message-ID: <83mu8bdriv.fsf@gnu.org>
References: <4201DF24-BCC4-4C08-9857-38207B7C10B4@acm.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Injection-Info: ciao.gmane.io; posting-host="ciao.gmane.io:159.69.161.202";
	logging-data="62706"; mail-complaints-to="usenet@ciao.gmane.io"
Cc: emacs-devel@gnu.org
To: Mattias =?utf-8?Q?Engdeg=C3=A5rd?= <mattiase@acm.org>
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Fri Mar 20 15:35:17 2020
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane-mx.org
Original-Received: from lists.gnu.org ([209.51.188.17])
	by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
	(Exim 4.92)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org>)
	id 1jFIk0-000G9i-9q
	for ged-emacs-devel@m.gmane-mx.org; Fri, 20 Mar 2020 15:35:16 +0100
Original-Received: from localhost ([::1]:53604 helo=lists1p.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.90_1)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org>)
	id 1jFIjz-00024W-9Z
	for ged-emacs-devel@m.gmane-mx.org; Fri, 20 Mar 2020 10:35:15 -0400
Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:49627)
 by lists.gnu.org with esmtp (Exim 4.90_1)
 (envelope-from <eliz@gnu.org>) id 1jFIjM-0001CZ-6R
 for emacs-devel@gnu.org; Fri, 20 Mar 2020 10:34:37 -0400
Original-Received: from fencepost.gnu.org ([2001:470:142:3::e]:51620)
 by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from <eliz@gnu.org>)
 id 1jFIjL-0006ys-UQ; Fri, 20 Mar 2020 10:34:35 -0400
Original-Received: from [176.228.60.248] (port=3258 helo=home-c4e4a596f7)
 by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256)
 (Exim 4.82) (envelope-from <eliz@gnu.org>)
 id 1jFIjJ-0003MK-1U; Fri, 20 Mar 2020 10:34:35 -0400
In-Reply-To: <4201DF24-BCC4-4C08-9857-38207B7C10B4@acm.org> (message from
 Mattias =?utf-8?Q?Engdeg=C3=A5rd?= on Fri, 20 Mar 2020 13:27:35 +0100)
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic]
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/emacs-devel>,
 <mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <https://lists.gnu.org/archive/html/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/emacs-devel>,
 <mailto:emacs-devel-request@gnu.org?subject=subscribe>
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org
Original-Sender: "Emacs-devel"
 <emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org>
Xref: news.gmane.io gmane.emacs.devel:245603
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/245603>

> From: Mattias Engdegård <mattiase@acm.org>
> Date: Fri, 20 Mar 2020 13:27:35 +0100
> 
> This patch adds a lisp-based regexp bytecode disassembler which is always available without any runtime cost to the regexp engine. It is mainly a tool for maintainers but curious users may find it useful as well. It has already revealed one bug in the regexp compiler, now fixed (f189e5dc10).
> 
> Any objections against it being added (to master)?

Thanks.  I'm far from being an expert on the subject, so my comments
are limited to secondary aspects of this.

First, please document this in NEWS and in the ELisp manual.  IMNSHO,
this feature will be much less useful without documentation.

Second, please document the code more than you did.  I especially miss
any documentation of the regexp bytecode that is being disassembled;
it is IMO sub-optimal to ask people to read the regex C code to glean
that information while trying to understand what the disassembler does
and why.

> * src/search.c (Fregexp_bytecode): New function.
> (syms_of_search): Define the symbol.

Which symbol is that?

> * lisp/emacs-lisp/regexp-disasm.el (regexp-disasm--classes)
> (regexp-disasm--syntax-codes, regexp-disasm, regexp-disassemble): New.

This is a new file, so it is customary just to say "New file" without
listing the functions, which are all new.

> +;; Decode compiled Emacs regexp bytecode and pretty-print.

  "Decode compiled Emacs regexp bytecode, and pretty-print them."

> +(defconst regexp-disasm--classes
> +  [word lower punct space upper multibyte alpha alnum graph print blank]
> +  "Vector of character classes, corresponding to BIT_* in regex-emacs.c.")

This is one place where a more detailed description in the doc string
could be most beneficial.

> +(defconst regexp-disasm--syntax-codes
> +  [whitespace punctuation word symbol
> +   open-parenthesis close-parenthesis expression-prefix string-quote
> +   paired-delimiter escape character-quote comment-start comment-end
> +   inherit comment-delimiter string-delimiter]
> +  "Vector of syntax codes, corresponding to enum syntaxcode in syntax.h
> +but using names from `rx'.")

And this is another.  (Btw, the first line should be a complete
sentence.)

Any way to make sure these two get updated, or at least the build
displays an alert when the C sources are modified without also
updating the above defconst's?  At the very least, we should have
comments in the C sources to the effect that any change there needs an
update here.

> +;;;###autoload
> +(defun regexp-disasm (regexp)

Why do we need to auto-load this?

> +Instructions are on the form (ADDRESS . INSTR) where ADDRESS is the
                    ^^^^^^^^^^^
"of the form"

> +         (read-u16 (lambda (ofs) (+ (aref bc ofs)
> +                                    (ash (aref bc (1+ ofs)) 8))))
> +         (read-u24 (lambda (ofs) (+ (aref bc ofs)
> +                                    (ash (aref bc (+ ofs 1)) 8)
> +                                    (ash (aref bc (+ ofs 2)) 16))))
> +         (read-s16 (lambda (ofs) (let ((x (funcall read-u16 ofs)))
> +                                   (- x (ash (logand x #x8000) 1)))))

Why lambda-forms and not functions (or desfsubst)?

> +         (mb (multibyte-string-p regexp))

Please use more self-describing names of variables.  E.g., how about
multibyte-p in this case?  Likewise regarding "bc".

> +                           (str (if mb
> +                                    (decode-coding-string raw 'utf-8-emacs)
> +                                  raw)))

This call to decode-coding-string needs a comment that explains why
it's needed.

> +               (pcase opcode
> +                 (0 (cons 'no-op 1))
> +                 (1 (cons 'succeed 1))

Is pcase really needed here?  It looks like a simple cond will do.

> +                 (_ (error "bad opcode at ofs %d: 0x%02x" i opcode))))
                                             ^^^
"offset", please, not "ofs".

> +(defun regexp-disassemble (regexp)
> +  "Compile REGEXP and print the disassembled bytecode."

I think the fact that it compiles PATTERN is an implementation
detail.  The real purpose of this command is different.  Can you
propose a better description of that purpose?

> +  (interactive "XRegexp (evaluated): ")

This prompt should do a better job describing what kind of input is
expected here.

> +DEFUN ("regexp-bytecode", Fregexp_bytecode, Sregexp_bytecode, 1, 1, 0,
> +       doc: /* Compile REGEXP and return the compiled bytecode.
> +The compiled bytecode is returned as a string; its format is
> +implementation-dependent.  Cached bytecode may be returned if available.  */)

Is this function useful on its own, or is it an internal API?  If the
latter, let's use an internal notation for its name, and let's
document that it's an internal function.

> +  struct regexp_cache *cache_entry = compile_pattern (
> +    regexp,
> +    NULL,
> +    (!NILP (BVAR (current_buffer, case_fold_search))
> +     ? BVAR (current_buffer, case_canon_table) : Qnil),
> +    false,
> +    true);

That's not our style of formatting such function calls.  It is hard to
read.

And why are you using variables related to the current buffer here?
It sounds like something that will cause confusion in subtle cases
down the road.  I'm aware that we do something like that in search
APIs, but (a) that is controversial as well; and (b) this API is
different: it is not intended to search a buffer, and thus what the
current buffer-local settings are is almost completely irrelevant.  I
suggest instead to expose this argument to Lisp, so that callers could
decide what they want there.

Thanks again for working on this.