unofficial mirror of emacs-devel@gnu.org

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
 @c -*-texinfo-*-
@c This is part of the GNU Emacs Lisp Reference Manual.
@c Copyright (C) 1990--1995, 1998--1999, 2001--2022 Free Software
@c Foundation, Inc.
@c See the file elisp.texi for copying conditions.
@node Parsing Expression Grammars
@chapter Parsing Expression Grammars
@cindex text parsing

  Emacs Lisp provide several tools for parsing and matching text, from
regular expressions (@pxref{Regular Expressions}) to full @acronym{LL}
grammar parsers (@pxref{Top,, Bovine parser development, bovine}).
@dfn{Parsing Expression Grammars} (@acronym{PEG}) are another approach
to text parsing that offer more structure and composibility than
regular expressions, but less complexity than context-free grammars.

A @acronym{PEG} parser is defined as a list of named rules, each of
which match text patterns, and/or contain references to other rules.
Parsing is initiated with the function @code{peg-run} or the macro
@code{peg-parse}, and parses text after point in the current buffer,
using a given set of rules.

The definition of each rule is referred to as a @dfn{parsing
expression} (@acronym{PEX}), and can consist of a literal string, a
regexp-like character range or set, a peg-specific construct
resembling an elisp function call, a reference to another rule, or a
combination of any of these.  A grammar is expressed as a set of rules
in which one rule is typically treated as a ``top-level'' or
``entry-point'' rule.  For instance:

@example
@group
((number sign digit (* digit))
 (sign   (or "+" "-" ""))
 (digit  [0-9]))
@end group
@end example

The above grammar could be used directly in a call to
@code{peg-parse}, in which the first rule is considered the
``entry-point'' rule:

@example
(peg-parse
  ((number sign digit (* digit))
   (sign   (or "+" "-" ""))
   (digit  [0-9])))
@end example

Or set as the value of a variable, and the variable used in a
combination of calls to @code{with-peg-rules} and @code{peg-run},
where the ``entry-point'' rule is given explicitly:

@example
(defvar number-grammar
        '((number sign digit (* digit))
          (sign (or "+" "-" ""))
          (digit [0-9])))

(with-peg-rules number-grammar
  (peg-run (peg number)))
@end example

By default, calls to @code{peg-run} or @code{peg-parse} produce no
output: parsing simply moves point.  In order to return or otherwise
act upon parsed strings, rules can include @dfn{actions}, see
@xref{Parsing Actions} for more information.

Individual rules can also be defined using a more @code{defun}-like
syntax, using the macro @code{define-peg-rule}:

@example
(define-peg-rule digit ()
  [0-9])
@end example

This allows the rule to be referred to by name within calls to
@code{peg-run} or @code{peg-parse} elsewhere, and also allows the use
of function arguments in the rule body.

@node PEX Definitions
@section PEX Definitions

Parsing expressions can be defined using the following syntax:

@table @code
@item (and E1 E2 ...)
A sequence of PEXs that must all be matched.  The @code{and} form is
optional and implicit.

@item (or E1 E2 ...)
Prioritized choices, meaning that, as in Elisp, the choices are tried
in order, and the first successful match is used.

@item (any)
Matches any single character, as the regexp ``.''.

@item "abc"
A literal string.

@item (char C)
A single character, as an Elisp character literal.

@item (* E)
Zero or more of an expression, as the regexp ``*''.

@item (+ E)
One or more of an expression, as the regexp ``+''.

@item (opt E)
Zero or one of an expression, as the regexp ``?''.

@item SYMBOL
A symbol representing a previously-define PEG rule.

@item (range A B)
The character range between A and B, as the regexp ``[A-B]''.

@item [a-b "+*" ?x]
A character set, including ranges, literal characters, or strings of
characters.

@item [ascii cntrl]
A list of named character classes (see below).

@item (syntax-class NAME)
A single syntax class.

@item (null)
The empty string.
@end table

The following expressions are used as anchors -- they do not move
point.

@table @code
@item (bob)
Beginning of buffer.

@item (eob)
End of buffer.

@item (bol)
Beginning of line.

@item (eol)
End of line.

@item (bow)
Beginning of word.

@item (eow)
End of word.

@item (bos)
Beginning of symbol.

@item (eos)
End of symbol.
@end table

The following expressions are used as booleans, to constrain matching
(@pxref{Writing PEG Rules}), and do not move point.

@table @code
@item (not E)
@item (if E)
@item (guard EXP)		
@end table

@vindex peg-char-classes
Named character classes include the following:

@itemize
@item ascii
@item alnum
@item alpha
@item blank
@item cntrl
@item digit
@item graph
@item lower
@item multibyte
@item nonascii
@item print
@item punct
@item space
@item unibyte
@item upper
@item word
@item xdigit
@end itemize

@node Parsing Actions
@section Parsing Actions

By default the process of parsing simply moves point in the current
buffer, ultimately returning @code{t} if the parsing succeeds, and
@code{nil} if it doesn't.  It's also possible to define ``actions''
that can run arbitrary Elisp at certain points during parsing.  These
actions can affect something called the @dfn{parsing stack}: a list of
values built up during the course of parsing.  If the stack is
non-@code{nil} at the end of parsing, it is returned as the final
value of the parsing process.

Actions can be added anywhere in the definition of a rule.  They are
distinguished from parsing expressions by an initial backquote
(@samp{`}), followed by a parenthetical form that must contain a pair
of hyphens (@samp{--}) somewhere within it.  Symbols to the left of
the hyphens are bound to values popped from the stack (they are
somewhat analogous to the argument list in a lambda).  Values produced
by code to the right are pushed to the stack (analogous to the return
value of the lambda).  For instance, the previous grammar can be
augmented with actions to return the parsed number as an actual
integer:

@example
(with-peg-rules ((number sign digit (* digit
                                       `(a b -- (+ (* a 10) b)))
                         `(sign val -- (* sign val)))
                 (sign (or (and "+" `(-- 1))
                           (and "-" `(-- -1))
                           (and ""  `(-- 1))))
                 (digit [0-9] `(-- (- (char-before) ?0))))
  (peg-run (peg number)))
@end example

There must be values on the stack before they can be popped and
returned.  An action with no left-hand terms will only push values to
the stack; an action with no right-hand terms will consume (and
discard) values from the stack.

To return the string matched by a PEX (instead of simply moving point
over it), a rule like this can be used:

@example
(one-word
  `(-- (point))
  (+ [word])
  `(start -- (buffer-substring start (point))))
@end example

The first action pushes the initial value of point to the stack.  The
intervening PEX moves point over the next word. The second action pops
the previous value from the stack (binding it to the variable
@code{start}), and uses that value to extract a substring from the
buffer and push it to the stack.  This pattern is so common that
peg.el provides a shorthand function that does exactly the above,
along with a few other shorthands for common scenarios:

@table @code
@item (substring E)
Match PEX E and push the matched string to the stack.

@item (region E)
Match E and push the start and end positions of the matched region to
the stack.

@item (replace E "repl")
Match E and replaced the matched region with the string "repl".

@item (list E)
Match E, collect all values produced by E (and its sub-expressions)
into a list, and push that list to the stack.
@end table

It is up to the grammar author to keep track of which rules and
sub-rules push values to the stack, and the state of the stack at any
given point in the parsing.  If an action pops values from an empty
stack, the symbols will be bound to @code{nil}.

@node Writing PEG Rules
@section Writing PEG Rules

Something to be aware of when writing PEG rules is that they are
greedy.  Rules which consume a variable amount of text will always
consume the maximum amount possible, even if that causes a rule that
might otherwise have matched to fail later on.  For instance, this
rule will never succeed:

@example
(forest (+ "tree" (* [blank])) "tree" (eol))
@end example

The @acronym{PEX} @code{(+ "tree" (* [blank]))} will consume all
repetitions of the word ``tree'', leaving none to match the final
@code{"tree"}.

In these situations, the desired result can be obtained by using
predicates and guards -- namely the @code{not}, @code{if} and
@code{guard} expressions -- to restrict behavior.  For instance:

@example
(forest (+ "tree" (* [blank])) (not (eol)) "tree" (eol))
@end example

The @code{if} and @code{not} operators accept a parsing expression and
interpret it as a boolean, without moving point. The contents of a
@code{guard} operator are evaluated as regular Elisp (not a
@acronym{PEX}) and should return a boolean value.  A @code{nil} value
causes the match to fail.

Another potentially unexpected behavior is that parsing will move
point as far as possible, even if the parsing ultimately fails.  This
rule:

@example
(end-game "game" (eob))
@end example

when run in a buffer containing the text ``game over'' after point,
will move point to just after ``game'' and halt parsing, returning
@code{nil}.  Successful parsing will always return @code{t}, or the
contexts of the parsing stack.