From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.ciao.gmane.io!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.devel Subject: Re: Regexp bytecode disassembler Date: Fri, 20 Mar 2020 16:34:16 +0200 Message-ID: <83mu8bdriv.fsf@gnu.org> References: <4201DF24-BCC4-4C08-9857-38207B7C10B4@acm.org> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Injection-Info: ciao.gmane.io; posting-host="ciao.gmane.io:159.69.161.202"; logging-data="62706"; mail-complaints-to="usenet@ciao.gmane.io" Cc: emacs-devel@gnu.org To: Mattias =?utf-8?Q?Engdeg=C3=A5rd?= Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Fri Mar 20 15:35:17 2020 Return-path: Envelope-to: ged-emacs-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1jFIk0-000G9i-9q for ged-emacs-devel@m.gmane-mx.org; Fri, 20 Mar 2020 15:35:16 +0100 Original-Received: from localhost ([::1]:53604 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1jFIjz-00024W-9Z for ged-emacs-devel@m.gmane-mx.org; Fri, 20 Mar 2020 10:35:15 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:49627) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1jFIjM-0001CZ-6R for emacs-devel@gnu.org; Fri, 20 Mar 2020 10:34:37 -0400 Original-Received: from fencepost.gnu.org ([2001:470:142:3::e]:51620) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1jFIjL-0006ys-UQ; Fri, 20 Mar 2020 10:34:35 -0400 Original-Received: from [176.228.60.248] (port=3258 helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256) (Exim 4.82) (envelope-from ) id 1jFIjJ-0003MK-1U; Fri, 20 Mar 2020 10:34:35 -0400 In-Reply-To: <4201DF24-BCC4-4C08-9857-38207B7C10B4@acm.org> (message from Mattias =?utf-8?Q?Engdeg=C3=A5rd?= on Fri, 20 Mar 2020 13:27:35 +0100) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.io gmane.emacs.devel:245603 Archived-At: > From: Mattias EngdegÄrd > Date: Fri, 20 Mar 2020 13:27:35 +0100 > > This patch adds a lisp-based regexp bytecode disassembler which is always available without any runtime cost to the regexp engine. It is mainly a tool for maintainers but curious users may find it useful as well. It has already revealed one bug in the regexp compiler, now fixed (f189e5dc10). > > Any objections against it being added (to master)? Thanks. I'm far from being an expert on the subject, so my comments are limited to secondary aspects of this. First, please document this in NEWS and in the ELisp manual. IMNSHO, this feature will be much less useful without documentation. Second, please document the code more than you did. I especially miss any documentation of the regexp bytecode that is being disassembled; it is IMO sub-optimal to ask people to read the regex C code to glean that information while trying to understand what the disassembler does and why. > * src/search.c (Fregexp_bytecode): New function. > (syms_of_search): Define the symbol. Which symbol is that? > * lisp/emacs-lisp/regexp-disasm.el (regexp-disasm--classes) > (regexp-disasm--syntax-codes, regexp-disasm, regexp-disassemble): New. This is a new file, so it is customary just to say "New file" without listing the functions, which are all new. > +;; Decode compiled Emacs regexp bytecode and pretty-print. "Decode compiled Emacs regexp bytecode, and pretty-print them." > +(defconst regexp-disasm--classes > + [word lower punct space upper multibyte alpha alnum graph print blank] > + "Vector of character classes, corresponding to BIT_* in regex-emacs.c.") This is one place where a more detailed description in the doc string could be most beneficial. > +(defconst regexp-disasm--syntax-codes > + [whitespace punctuation word symbol > + open-parenthesis close-parenthesis expression-prefix string-quote > + paired-delimiter escape character-quote comment-start comment-end > + inherit comment-delimiter string-delimiter] > + "Vector of syntax codes, corresponding to enum syntaxcode in syntax.h > +but using names from `rx'.") And this is another. (Btw, the first line should be a complete sentence.) Any way to make sure these two get updated, or at least the build displays an alert when the C sources are modified without also updating the above defconst's? At the very least, we should have comments in the C sources to the effect that any change there needs an update here. > +;;;###autoload > +(defun regexp-disasm (regexp) Why do we need to auto-load this? > +Instructions are on the form (ADDRESS . INSTR) where ADDRESS is the ^^^^^^^^^^^ "of the form" > + (read-u16 (lambda (ofs) (+ (aref bc ofs) > + (ash (aref bc (1+ ofs)) 8)))) > + (read-u24 (lambda (ofs) (+ (aref bc ofs) > + (ash (aref bc (+ ofs 1)) 8) > + (ash (aref bc (+ ofs 2)) 16)))) > + (read-s16 (lambda (ofs) (let ((x (funcall read-u16 ofs))) > + (- x (ash (logand x #x8000) 1))))) Why lambda-forms and not functions (or desfsubst)? > + (mb (multibyte-string-p regexp)) Please use more self-describing names of variables. E.g., how about multibyte-p in this case? Likewise regarding "bc". > + (str (if mb > + (decode-coding-string raw 'utf-8-emacs) > + raw))) This call to decode-coding-string needs a comment that explains why it's needed. > + (pcase opcode > + (0 (cons 'no-op 1)) > + (1 (cons 'succeed 1)) Is pcase really needed here? It looks like a simple cond will do. > + (_ (error "bad opcode at ofs %d: 0x%02x" i opcode)))) ^^^ "offset", please, not "ofs". > +(defun regexp-disassemble (regexp) > + "Compile REGEXP and print the disassembled bytecode." I think the fact that it compiles PATTERN is an implementation detail. The real purpose of this command is different. Can you propose a better description of that purpose? > + (interactive "XRegexp (evaluated): ") This prompt should do a better job describing what kind of input is expected here. > +DEFUN ("regexp-bytecode", Fregexp_bytecode, Sregexp_bytecode, 1, 1, 0, > + doc: /* Compile REGEXP and return the compiled bytecode. > +The compiled bytecode is returned as a string; its format is > +implementation-dependent. Cached bytecode may be returned if available. */) Is this function useful on its own, or is it an internal API? If the latter, let's use an internal notation for its name, and let's document that it's an internal function. > + struct regexp_cache *cache_entry = compile_pattern ( > + regexp, > + NULL, > + (!NILP (BVAR (current_buffer, case_fold_search)) > + ? BVAR (current_buffer, case_canon_table) : Qnil), > + false, > + true); That's not our style of formatting such function calls. It is hard to read. And why are you using variables related to the current buffer here? It sounds like something that will cause confusion in subtle cases down the road. I'm aware that we do something like that in search APIs, but (a) that is controversial as well; and (b) this API is different: it is not intended to search a buffer, and thus what the current buffer-local settings are is almost completely irrelevant. I suggest instead to expose this argument to Lisp, so that callers could decide what they want there. Thanks again for working on this.