From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Eric Abrahamsen Newsgroups: gmane.emacs.devel Subject: [PATCH] Re: Make peg.el a built-in library? Date: Tue, 15 Nov 2022 20:27:56 -0800 Message-ID: <87leobplpv.fsf_-_@ericabrahamsen.net> References: <875yvtbbn3.fsf@ericabrahamsen.net> <877d07a16u.fsf@localhost> <87tu3asg2r.fsf@ericabrahamsen.net> <87edud25ov.fsf@localhost> <87a6511ku0.fsf@ericabrahamsen.net> <87wn85z0zl.fsf@ericabrahamsen.net> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="16608"; mail-complaints-to="usenet@ciao.gmane.io" User-Agent: Gnus/5.13 (Gnus v5.13) To: emacs-devel@gnu.org Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Wed Nov 16 05:29:08 2022 Return-path: Envelope-to: ged-emacs-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1ovA2t-00048X-C7 for ged-emacs-devel@m.gmane-mx.org; Wed, 16 Nov 2022 05:29:07 +0100 Original-Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1ovA20-0006bb-JR; Tue, 15 Nov 2022 23:28:12 -0500 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1ovA1y-0006bK-Iy for emacs-devel@gnu.org; Tue, 15 Nov 2022 23:28:11 -0500 Original-Received: from mail.ericabrahamsen.net ([52.70.2.18]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1ovA1t-0008M0-AC for emacs-devel@gnu.org; Tue, 15 Nov 2022 23:28:09 -0500 Original-Received: from localhost (c-71-197-232-41.hsd1.wa.comcast.net [71.197.232.41]) (Authenticated sender: eric@ericabrahamsen.net) by mail.ericabrahamsen.net (Postfix) with ESMTPSA id 9AB19FA095 for ; Wed, 16 Nov 2022 04:27:56 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ericabrahamsen.net; s=mail; t=1668572876; bh=8RPC/BpPO3hAq7gvlawVwRs+PLL0+qKmvCVjKBKmDgU=; h=From:To:Subject:In-Reply-To:References:Date:From; b=cDXPodhr9cDZGN09fAAbqNCN+w6IZETv4WL2H+FtgbcnW7bJF3L2vA4O/+rondSng I74eK1Lt2d8S2mh+Q5w1vVVcanAuZ5H7gD6k5cserTDdFlOp+R5MO0uhakyqpAwxvJ fc9AZ1+9epW1nGN2x2WIG9nUna+5lHJB9dLD1J68= In-Reply-To: <87wn85z0zl.fsf@ericabrahamsen.net> (Eric Abrahamsen's message of "Tue, 08 Nov 2022 11:42:54 -0800") Received-SPF: pass client-ip=52.70.2.18; envelope-from=eric@ericabrahamsen.net; helo=mail.ericabrahamsen.net X-Spam_score_int: -43 X-Spam_score: -4.4 X-Spam_bar: ---- X-Spam_report: (-4.4 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_MED=-2.3, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Xref: news.gmane.io gmane.emacs.devel:299895 Archived-At: --=-=-= Content-Type: text/plain Eric Abrahamsen writes: > writes: > >> On Tue, Nov 08, 2022 at 08:18:15AM -0800, Eric Abrahamsen wrote: >>> Ihor Radchenko writes: >>> >>> > Eric Abrahamsen writes: >>> > >>> >>> Is there any progress merging peg.el to Emacs? >>> >>> I do not see any obvious blockers in the discussion, but the merge never >>> >>> happened? >> >> [...] >> >>> > As the comment in peg.el states, the definitions are adapted from the >>> > original PEG paper [...] >> >>> This is what I was saying in my original message, though: if peg.el is >>> going to go into core, it probably needs more/better docs than code >>> comments and "read this paper". Its likely users will be Elisp library >>> authors like me, who are just trying to free themselves from regexp hell >>> and want a relatively straightforward alternative. >> >> Yes. Coming from regexp they are deceivingly similar but frustratingly >> different. >> >> The best way I found to wrap my head around them is that they are a >> fancy notation for a recursive descent parser. Thus slightly more >> powerful than regexps, but slightly less than a full YACC (i.e. LALR >> or thereabouts). >> >> What is attractive about them is that one can do "full" parsers >> (as long as your grammar is roughly LL(k)) without having to build >> two storey buildings. I guess it takes some practice, though (I >> haven't). >> >> I think comparing them to treesitter is a category error. > > Okay, this is all sounding good. I'm going to read the paper, try to get > my head around all this, and write some docs for peg.el. Okay, here's a first stab. I read the paper, and understood about half of it, which seemed like enough. It was interesting to see that the paper explicitly calls out the exact greedy-matching behavior I'd encountered. I'm sure I've got some of the conventions wrong, here, and it's unfortunate that there's already a manual node called "Expression Parsing", but I don't know what to call this except "Expression Parsing Grammars"... Eric --=-=-= Content-Type: text/x-patch Content-Disposition: attachment; filename=pexmanual.diff diff --git a/doc/lispref/elisp.texi b/doc/lispref/elisp.texi index a3d1d80408..6440728541 100644 --- a/doc/lispref/elisp.texi +++ b/doc/lispref/elisp.texi @@ -222,6 +222,7 @@ Top * Non-ASCII Characters:: Non-ASCII text in buffers and strings. * Searching and Matching:: Searching buffers for strings or regexps. * Syntax Tables:: The syntax table controls word and list parsing. +* Parsing Expression Grammars:: Parsing buffer text. * Abbrevs:: How Abbrev mode works, and its data structures. * Threads:: Concurrency in Emacs Lisp. @@ -1703,6 +1704,7 @@ Top @include searching.texi @include syntax.texi +@include peg.texi @include abbrevs.texi @include threads.texi @include processes.texi diff --git a/doc/lispref/peg.texi b/doc/lispref/peg.texi new file mode 100644 index 0000000000..ec3962d7bf --- /dev/null +++ b/doc/lispref/peg.texi @@ -0,0 +1,314 @@ +@c -*-texinfo-*- +@c This is part of the GNU Emacs Lisp Reference Manual. +@c Copyright (C) 1990--1995, 1998--1999, 2001--2022 Free Software +@c Foundation, Inc. +@c See the file elisp.texi for copying conditions. +@node Parsing Expression Grammars +@chapter Parsing Expression Grammars +@cindex text parsing + + Emacs Lisp provide several tools for parsing and matching text, from +regular expressions (@pxref{Regular Expressions}) to full @acronym{LL} +grammar parsers (@pxref{Top,, Bovine parser development, bovine}). +@dfn{Parsing Expression Grammars} (@acronym{PEG}) are another approach +to text parsing that offer more structure and composibility than +regular expressions, but less complexity than context-free grammars. + +A @acronym{PEG} parser is defined as a list of named rules, each of +which match text patterns, and/or contain references to other rules. +Parsing is initiated with the function @code{peg-run} or the macro +@code{peg-parse}, and parses text after point in the current buffer, +using a given set of rules. + +The definition of each rule is referred to as a @dfn{parsing +expression} (@acronym{PEX}), and can consist of a literal string, a +regexp-like character range or set, a peg-specific construct +resembling an elisp function call, a reference to another rule, or a +combination of any of these. A grammar is expressed as a set of rules +in which one rule is typically treated as a ``top-level'' or +``entry-point'' rule. For instance: + +@example +@group +((number sign digit (* digit)) + (sign (or "+" "-" "")) + (digit [0-9])) +@end group +@end example + +The above grammar could be used directly in a call to +@code{peg-parse}, in which the first rule is considered the +``entry-point'' rule: + +@example +(peg-parse + ((number sign digit (* digit)) + (sign (or "+" "-" "")) + (digit [0-9]))) +@end example + +Or set as the value of a variable, and the variable used in a +combination of calls to @code{with-peg-rules} and @code{peg-run}, +where the ``entry-point'' rule is given explicitly: + +@example +(defvar number-grammar + '((number sign digit (* digit)) + (sign (or "+" "-" "")) + (digit [0-9]))) + +(with-peg-rules number-grammar + (peg-run (peg number))) +@end example + +By default, calls to @code{peg-run} or @code{peg-parse} produce no +output: parsing simply moves point. In order to return or otherwise +act upon parsed strings, rules can include @dfn{actions}, see +@xref{Parsing Actions} for more information. + +Individual rules can also be defined using a more @code{defun}-like +syntax, using the macro @code{define-peg-rule}: + +@example +(define-peg-rule digit () + [0-9]) +@end example + +This allows the rule to be referred to by name within calls to +@code{peg-run} or @code{peg-parse} elsewhere, and also allows the use +of function arguments in the rule body. + +@node PEX Definitions +@section PEX Definitions + +Parsing expressions can be defined using the following syntax: + +@table @code +@item (and E1 E2 ...) +A sequence of PEXs that must all be matched. The @code{and} form is +optional and implicit. + +@item (or E1 E2 ...) +Prioritized choices, meaning that, as in Elisp, the choices are tried +in order, and the first successful match is used. + +@item (any) +Matches any single character, as the regexp ``.''. + +@item "abc" +A literal string. + +@item (char C) +A single character, as an Elisp character literal. + +@item (* E) +Zero or more of an expression, as the regexp ``*''. + +@item (+ E) +One or more of an expression, as the regexp ``+''. + +@item (opt E) +Zero or one of an expression, as the regexp ``?''. + +@item SYMBOL +A symbol representing a previously-define PEG rule. + +@item (range A B) +The character range between A and B, as the regexp ``[A-B]''. + +@item [a-b "+*" ?x] +A character set, including ranges, literal characters, or strings of +characters. + +@item [ascii cntrl] +A list of named character classes (see below). + +@item (syntax-class NAME) +A single syntax class. + +@item (null) +The empty string. +@end table + +The following expressions are used as anchors -- they do not move +point. + +@table @code +@item (bob) +Beginning of buffer. + +@item (eob) +End of buffer. + +@item (bol) +Beginning of line. + +@item (eol) +End of line. + +@item (bow) +Beginning of word. + +@item (eow) +End of word. + +@item (bos) +Beginning of symbol. + +@item (eos) +End of symbol. +@end table + +The following expressions are used as booleans, to constrain matching +(@pxref{Writing PEG Rules}), and do not move point. + +@table @code +@item (not E) +@item (if E) +@item (guard EXP) +@end table + +@vindex peg-char-classes +Named character classes include the following: + +@itemize +@item ascii +@item alnum +@item alpha +@item blank +@item cntrl +@item digit +@item graph +@item lower +@item multibyte +@item nonascii +@item print +@item punct +@item space +@item unibyte +@item upper +@item word +@item xdigit +@end itemize + +@node Parsing Actions +@section Parsing Actions + +By default the process of parsing simply moves point in the current +buffer, ultimately returning @code{t} if the parsing succeeds, and +@code{nil} if it doesn't. It's also possible to define ``actions'' +that can run arbitrary Elisp at certain points during parsing. These +actions can affect something called the @dfn{parsing stack}: a list of +values built up during the course of parsing. If the stack is +non-@code{nil} at the end of parsing, it is returned as the final +value of the parsing process. + +Actions can be added anywhere in the definition of a rule. They are +distinguished from parsing expressions by an initial backquote +(@samp{`}), followed by a parenthetical form that must contain a pair +of hyphens (@samp{--}) somewhere within it. Symbols to the left of +the hyphens are bound to values popped from the stack (they are +somewhat analogous to the argument list in a lambda). Values produced +by code to the right are pushed to the stack (analogous to the return +value of the lambda). For instance, the previous grammar can be +augmented with actions to return the parsed number as an actual +integer: + +@example +(with-peg-rules ((number sign digit (* digit + `(a b -- (+ (* a 10) b))) + `(sign val -- (* sign val))) + (sign (or (and "+" `(-- 1)) + (and "-" `(-- -1)) + (and "" `(-- 1)))) + (digit [0-9] `(-- (- (char-before) ?0)))) + (peg-run (peg number))) +@end example + +There must be values on the stack before they can be popped and +returned. An action with no left-hand terms will only push values to +the stack; an action with no right-hand terms will consume (and +discard) values from the stack. + +To return the string matched by a PEX (instead of simply moving point +over it), a rule like this can be used: + +@example +(one-word + `(-- (point)) + (+ [word]) + `(start -- (buffer-substring start (point)))) +@end example + +The first action pushes the initial value of point to the stack. The +intervening PEX moves point over the next word. The second action pops +the previous value from the stack (binding it to the variable +@code{start}), and uses that value to extract a substring from the +buffer and push it to the stack. This pattern is so common that +peg.el provides a shorthand function that does exactly the above, +along with a few other shorthands for common scenarios: + +@table @code +@item (substring E) +Match PEX E and push the matched string to the stack. + +@item (region E) +Match E and push the start and end positions of the matched region to +the stack. + +@item (replace E "repl") +Match E and replaced the matched region with the string "repl". + +@item (list E) +Match E, collect all values produced by E (and its sub-expressions) +into a list, and push that list to the stack. +@end table + +It is up to the grammar author to keep track of which rules and +sub-rules push values to the stack, and the state of the stack at any +given point in the parsing. If an action pops values from an empty +stack, the symbols will be bound to @code{nil}. + +@node Writing PEG Rules +@section Writing PEG Rules + +Something to be aware of when writing PEG rules is that they are +greedy. Rules which consume a variable amount of text will always +consume the maximum amount possible, even if that causes a rule that +might otherwise have matched to fail later on. For instance, this +rule will never succeed: + +@example +(forest (+ "tree" (* [blank])) "tree" (eol)) +@end example + +The @acronym{PEX} @code{(+ "tree" (* [blank]))} will consume all +repetitions of the word ``tree'', leaving none to match the final +@code{"tree"}. + +In these situations, the desired result can be obtained by using +predicates and guards -- namely the @code{not}, @code{if} and +@code{guard} expressions -- to restrict behavior. For instance: + +@example +(forest (+ "tree" (* [blank])) (not (eol)) "tree" (eol)) +@end example + +The @code{if} and @code{not} operators accept a parsing expression and +interpret it as a boolean, without moving point. The contents of a +@code{guard} operator are evaluated as regular Elisp (not a +@acronym{PEX}) and should return a boolean value. A @code{nil} value +causes the match to fail. + +Another potentially unexpected behavior is that parsing will move +point as far as possible, even if the parsing ultimately fails. This +rule: + +@example +(end-game "game" (eob)) +@end example + +when run in a buffer containing the text ``game over'' after point, +will move point to just after ``game'' and halt parsing, returning +@code{nil}. Successful parsing will always return @code{t}, or the +contexts of the parsing stack. --=-=-=--