diff --git a/doc/lispref/elisp.texi b/doc/lispref/elisp.texi index a3d1d80408..6440728541 100644 --- a/doc/lispref/elisp.texi +++ b/doc/lispref/elisp.texi @@ -222,6 +222,7 @@ Top * Non-ASCII Characters:: Non-ASCII text in buffers and strings. * Searching and Matching:: Searching buffers for strings or regexps. * Syntax Tables:: The syntax table controls word and list parsing. +* Parsing Expression Grammars:: Parsing buffer text. * Abbrevs:: How Abbrev mode works, and its data structures. * Threads:: Concurrency in Emacs Lisp. @@ -1703,6 +1704,7 @@ Top @include searching.texi @include syntax.texi +@include peg.texi @include abbrevs.texi @include threads.texi @include processes.texi diff --git a/doc/lispref/peg.texi b/doc/lispref/peg.texi new file mode 100644 index 0000000000..ec3962d7bf --- /dev/null +++ b/doc/lispref/peg.texi @@ -0,0 +1,314 @@ +@c -*-texinfo-*- +@c This is part of the GNU Emacs Lisp Reference Manual. +@c Copyright (C) 1990--1995, 1998--1999, 2001--2022 Free Software +@c Foundation, Inc. +@c See the file elisp.texi for copying conditions. +@node Parsing Expression Grammars +@chapter Parsing Expression Grammars +@cindex text parsing + + Emacs Lisp provide several tools for parsing and matching text, from +regular expressions (@pxref{Regular Expressions}) to full @acronym{LL} +grammar parsers (@pxref{Top,, Bovine parser development, bovine}). +@dfn{Parsing Expression Grammars} (@acronym{PEG}) are another approach +to text parsing that offer more structure and composibility than +regular expressions, but less complexity than context-free grammars. + +A @acronym{PEG} parser is defined as a list of named rules, each of +which match text patterns, and/or contain references to other rules. +Parsing is initiated with the function @code{peg-run} or the macro +@code{peg-parse}, and parses text after point in the current buffer, +using a given set of rules. + +The definition of each rule is referred to as a @dfn{parsing +expression} (@acronym{PEX}), and can consist of a literal string, a +regexp-like character range or set, a peg-specific construct +resembling an elisp function call, a reference to another rule, or a +combination of any of these. A grammar is expressed as a set of rules +in which one rule is typically treated as a ``top-level'' or +``entry-point'' rule. For instance: + +@example +@group +((number sign digit (* digit)) + (sign (or "+" "-" "")) + (digit [0-9])) +@end group +@end example + +The above grammar could be used directly in a call to +@code{peg-parse}, in which the first rule is considered the +``entry-point'' rule: + +@example +(peg-parse + ((number sign digit (* digit)) + (sign (or "+" "-" "")) + (digit [0-9]))) +@end example + +Or set as the value of a variable, and the variable used in a +combination of calls to @code{with-peg-rules} and @code{peg-run}, +where the ``entry-point'' rule is given explicitly: + +@example +(defvar number-grammar + '((number sign digit (* digit)) + (sign (or "+" "-" "")) + (digit [0-9]))) + +(with-peg-rules number-grammar + (peg-run (peg number))) +@end example + +By default, calls to @code{peg-run} or @code{peg-parse} produce no +output: parsing simply moves point. In order to return or otherwise +act upon parsed strings, rules can include @dfn{actions}, see +@xref{Parsing Actions} for more information. + +Individual rules can also be defined using a more @code{defun}-like +syntax, using the macro @code{define-peg-rule}: + +@example +(define-peg-rule digit () + [0-9]) +@end example + +This allows the rule to be referred to by name within calls to +@code{peg-run} or @code{peg-parse} elsewhere, and also allows the use +of function arguments in the rule body. + +@node PEX Definitions +@section PEX Definitions + +Parsing expressions can be defined using the following syntax: + +@table @code +@item (and E1 E2 ...) +A sequence of PEXs that must all be matched. The @code{and} form is +optional and implicit. + +@item (or E1 E2 ...) +Prioritized choices, meaning that, as in Elisp, the choices are tried +in order, and the first successful match is used. + +@item (any) +Matches any single character, as the regexp ``.''. + +@item "abc" +A literal string. + +@item (char C) +A single character, as an Elisp character literal. + +@item (* E) +Zero or more of an expression, as the regexp ``*''. + +@item (+ E) +One or more of an expression, as the regexp ``+''. + +@item (opt E) +Zero or one of an expression, as the regexp ``?''. + +@item SYMBOL +A symbol representing a previously-define PEG rule. + +@item (range A B) +The character range between A and B, as the regexp ``[A-B]''. + +@item [a-b "+*" ?x] +A character set, including ranges, literal characters, or strings of +characters. + +@item [ascii cntrl] +A list of named character classes (see below). + +@item (syntax-class NAME) +A single syntax class. + +@item (null) +The empty string. +@end table + +The following expressions are used as anchors -- they do not move +point. + +@table @code +@item (bob) +Beginning of buffer. + +@item (eob) +End of buffer. + +@item (bol) +Beginning of line. + +@item (eol) +End of line. + +@item (bow) +Beginning of word. + +@item (eow) +End of word. + +@item (bos) +Beginning of symbol. + +@item (eos) +End of symbol. +@end table + +The following expressions are used as booleans, to constrain matching +(@pxref{Writing PEG Rules}), and do not move point. + +@table @code +@item (not E) +@item (if E) +@item (guard EXP) +@end table + +@vindex peg-char-classes +Named character classes include the following: + +@itemize +@item ascii +@item alnum +@item alpha +@item blank +@item cntrl +@item digit +@item graph +@item lower +@item multibyte +@item nonascii +@item print +@item punct +@item space +@item unibyte +@item upper +@item word +@item xdigit +@end itemize + +@node Parsing Actions +@section Parsing Actions + +By default the process of parsing simply moves point in the current +buffer, ultimately returning @code{t} if the parsing succeeds, and +@code{nil} if it doesn't. It's also possible to define ``actions'' +that can run arbitrary Elisp at certain points during parsing. These +actions can affect something called the @dfn{parsing stack}: a list of +values built up during the course of parsing. If the stack is +non-@code{nil} at the end of parsing, it is returned as the final +value of the parsing process. + +Actions can be added anywhere in the definition of a rule. They are +distinguished from parsing expressions by an initial backquote +(@samp{`}), followed by a parenthetical form that must contain a pair +of hyphens (@samp{--}) somewhere within it. Symbols to the left of +the hyphens are bound to values popped from the stack (they are +somewhat analogous to the argument list in a lambda). Values produced +by code to the right are pushed to the stack (analogous to the return +value of the lambda). For instance, the previous grammar can be +augmented with actions to return the parsed number as an actual +integer: + +@example +(with-peg-rules ((number sign digit (* digit + `(a b -- (+ (* a 10) b))) + `(sign val -- (* sign val))) + (sign (or (and "+" `(-- 1)) + (and "-" `(-- -1)) + (and "" `(-- 1)))) + (digit [0-9] `(-- (- (char-before) ?0)))) + (peg-run (peg number))) +@end example + +There must be values on the stack before they can be popped and +returned. An action with no left-hand terms will only push values to +the stack; an action with no right-hand terms will consume (and +discard) values from the stack. + +To return the string matched by a PEX (instead of simply moving point +over it), a rule like this can be used: + +@example +(one-word + `(-- (point)) + (+ [word]) + `(start -- (buffer-substring start (point)))) +@end example + +The first action pushes the initial value of point to the stack. The +intervening PEX moves point over the next word. The second action pops +the previous value from the stack (binding it to the variable +@code{start}), and uses that value to extract a substring from the +buffer and push it to the stack. This pattern is so common that +peg.el provides a shorthand function that does exactly the above, +along with a few other shorthands for common scenarios: + +@table @code +@item (substring E) +Match PEX E and push the matched string to the stack. + +@item (region E) +Match E and push the start and end positions of the matched region to +the stack. + +@item (replace E "repl") +Match E and replaced the matched region with the string "repl". + +@item (list E) +Match E, collect all values produced by E (and its sub-expressions) +into a list, and push that list to the stack. +@end table + +It is up to the grammar author to keep track of which rules and +sub-rules push values to the stack, and the state of the stack at any +given point in the parsing. If an action pops values from an empty +stack, the symbols will be bound to @code{nil}. + +@node Writing PEG Rules +@section Writing PEG Rules + +Something to be aware of when writing PEG rules is that they are +greedy. Rules which consume a variable amount of text will always +consume the maximum amount possible, even if that causes a rule that +might otherwise have matched to fail later on. For instance, this +rule will never succeed: + +@example +(forest (+ "tree" (* [blank])) "tree" (eol)) +@end example + +The @acronym{PEX} @code{(+ "tree" (* [blank]))} will consume all +repetitions of the word ``tree'', leaving none to match the final +@code{"tree"}. + +In these situations, the desired result can be obtained by using +predicates and guards -- namely the @code{not}, @code{if} and +@code{guard} expressions -- to restrict behavior. For instance: + +@example +(forest (+ "tree" (* [blank])) (not (eol)) "tree" (eol)) +@end example + +The @code{if} and @code{not} operators accept a parsing expression and +interpret it as a boolean, without moving point. The contents of a +@code{guard} operator are evaluated as regular Elisp (not a +@acronym{PEX}) and should return a boolean value. A @code{nil} value +causes the match to fail. + +Another potentially unexpected behavior is that parsing will move +point as far as possible, even if the parsing ultimately fails. This +rule: + +@example +(end-game "game" (eob)) +@end example + +when run in a buffer containing the text ``game over'' after point, +will move point to just after ``game'' and halt parsing, returning +@code{nil}. Successful parsing will always return @code{t}, or the +contexts of the parsing stack.