* Make peg.el a built-in library? @ 2021-08-25 18:52 Eric Abrahamsen 2021-08-26 6:17 ` Eli Zaretskii ` (3 more replies) 0 siblings, 4 replies; 100+ messages in thread From: Eric Abrahamsen @ 2021-08-25 18:52 UTC (permalink / raw) To: emacs-devel; +Cc: Stefan Monnier Hi all, In my on-again-off-again quest to not have to write text parsers myself, I was pointed towards the PEG library (in ELPA), which does pretty much exactly what I want (Parsing Expression Grammars). Would the maintainers consider moving this into Emacs proper? I ask mostly because this would be very useful to have in Gnus, both to replace the home-made parser in gnus-search.el, and I would hope to parse eg IMAP server responses more fully and reliably. I pinged the original author Helmut Eller, and he said the library pretty much belongs to Stefan now, though he'd be happy to have it in core. He also said he didn't think it was the most ergonomic or efficient thing out there. It looks fine to me, but I haven't benchmarked it. I understand it might be redundant with bovine/wisent, but TBH I've never been able to make them work at all. Anyway, plenty of reasons to say no, but I thought I'd check! Thanks, Eric ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Make peg.el a built-in library? 2021-08-25 18:52 Make peg.el a built-in library? Eric Abrahamsen @ 2021-08-26 6:17 ` Eli Zaretskii 2021-08-26 15:34 ` Eric Abrahamsen 2021-08-26 17:02 ` Adam Porter ` (2 subsequent siblings) 3 siblings, 1 reply; 100+ messages in thread From: Eli Zaretskii @ 2021-08-26 6:17 UTC (permalink / raw) To: Eric Abrahamsen; +Cc: monnier, emacs-devel > From: Eric Abrahamsen <eric@ericabrahamsen.net> > Date: Wed, 25 Aug 2021 11:52:00 -0700 > Cc: Stefan Monnier <monnier@iro.umontreal.ca> > > In my on-again-off-again quest to not have to write text parsers myself, > I was pointed towards the PEG library (in ELPA), which does pretty much > exactly what I want (Parsing Expression Grammars). > > Would the maintainers consider moving this into Emacs proper? I ask > mostly because this would be very useful to have in Gnus, both to > replace the home-made parser in gnus-search.el, and I would hope to > parse eg IMAP server responses more fully and reliably. Fine with me, but please update the (outdated) Wiki page to say where the latest peg.el is, when it is imported. > I understand it might be redundant with bovine/wisent, but TBH I've > never been able to make them work at all. That should at least warrant a bug report, IMO. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Make peg.el a built-in library? 2021-08-26 6:17 ` Eli Zaretskii @ 2021-08-26 15:34 ` Eric Abrahamsen 2021-09-09 4:36 ` Eric Abrahamsen 0 siblings, 1 reply; 100+ messages in thread From: Eric Abrahamsen @ 2021-08-26 15:34 UTC (permalink / raw) To: Eli Zaretskii; +Cc: monnier, emacs-devel Eli Zaretskii <eliz@gnu.org> writes: >> From: Eric Abrahamsen <eric@ericabrahamsen.net> >> Date: Wed, 25 Aug 2021 11:52:00 -0700 >> Cc: Stefan Monnier <monnier@iro.umontreal.ca> >> >> In my on-again-off-again quest to not have to write text parsers myself, >> I was pointed towards the PEG library (in ELPA), which does pretty much >> exactly what I want (Parsing Expression Grammars). >> >> Would the maintainers consider moving this into Emacs proper? I ask >> mostly because this would be very useful to have in Gnus, both to >> replace the home-made parser in gnus-search.el, and I would hope to >> parse eg IMAP server responses more fully and reliably. > > Fine with me, but please update the (outdated) Wiki page to say where > the latest peg.el is, when it is imported. Will do. Stefan also asked me to make sure the library actually does what I expect it to do, before making this move, so I'll write the code first. >> I understand it might be redundant with bovine/wisent, but TBH I've >> never been able to make them work at all. > > That should at least warrant a bug report, IMO. I'll take another look and remind myself of where I got lost. Thanks, Eric ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Make peg.el a built-in library? 2021-08-26 15:34 ` Eric Abrahamsen @ 2021-09-09 4:36 ` Eric Abrahamsen 2021-09-19 15:25 ` Eric Abrahamsen 2021-09-30 19:44 ` Stefan Monnier 0 siblings, 2 replies; 100+ messages in thread From: Eric Abrahamsen @ 2021-09-09 4:36 UTC (permalink / raw) To: Eli Zaretskii; +Cc: monnier, emacs-devel [-- Attachment #1: Type: text/plain, Size: 2892 bytes --] On 08/26/21 08:34 AM, Eric Abrahamsen wrote: > Eli Zaretskii <eliz@gnu.org> writes: > >>> From: Eric Abrahamsen <eric@ericabrahamsen.net> >>> Date: Wed, 25 Aug 2021 11:52:00 -0700 >>> Cc: Stefan Monnier <monnier@iro.umontreal.ca> >>> >>> In my on-again-off-again quest to not have to write text parsers myself, >>> I was pointed towards the PEG library (in ELPA), which does pretty much >>> exactly what I want (Parsing Expression Grammars). >>> >>> Would the maintainers consider moving this into Emacs proper? I ask >>> mostly because this would be very useful to have in Gnus, both to >>> replace the home-made parser in gnus-search.el, and I would hope to >>> parse eg IMAP server responses more fully and reliably. >> >> Fine with me, but please update the (outdated) Wiki page to say where >> the latest peg.el is, when it is imported. > > Will do. Stefan also asked me to make sure the library actually does > what I expect it to do, before making this move, so I'll write the code > first. Okay, I wrote some code: the "use-peg-in-gnus-search.diff" attachment is the result of that. It works really well! A net removal of ~100 LOC (obviously we're still in deficit with the addition of peg.el), it already fixes some wrong behavior of the old parser, and it's much easier to reason about and add new behavior to. It's the shiny declarative future I was looking forward to. Whether or not PEG gets added to core I'd like to propose some patches. The "peg-doc-patches.diff" attachment adds some documentation to the Commentary section, including an example grammar based on a much-simplified version of what gnus-search does. The peg-allow-symbols patch is more tentative. The issue is that _all_ of the entry-points to peg code are macros, meaning you can't build your grammar up in a variable, and then pass that variable to any of `peg-run', `peg-parse', `with-peg-rules', etc. Nobody will evaluate the variable; you have to literally write the rules inside the `with-peg-rules' form. It seems like a fairly plausible use-case to store the rules in a variable or an option, even if you're not doing run-time manipulation of them. The only solution, as Adam found with org-ql, is to `eval' one of the macros. This doesn't seem necessary! The patch has `with-peg-rules' check if the rules are a symbol, and take the `symbol-value' if so. But I wonder if it wouldn't be nicer to break some of the code out: `peg-normalize' seems to be the entry-point for "compile this grammar", and that could be modified to work the way that some languages provide for pre-compiled regexps: a way to let the developer build and compile the grammar at load-time or launch-time, then feed the stored compiled version to parsing routines. `peg-parse' could be a function, or maybe it also could also just check if its argument is a symbol. I hope someone will have some thoughts on this! Eric [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: use-peg-in-gnus-search.diff --] [-- Type: text/x-patch, Size: 8107 bytes --] diff --git a/lisp/gnus/gnus-search.el b/lisp/gnus/gnus-search.el index 2a8069d400..5574061457 100644 --- a/lisp/gnus/gnus-search.el +++ b/lisp/gnus/gnus-search.el @@ -82,6 +82,7 @@ (require 'gnus-sum) (require 'message) (require 'gnus-util) +(require 'peg) (require 'eieio) (eval-when-compile (require 'cl-lib)) (autoload 'eieio-build-class-alist "eieio-opt") @@ -390,8 +391,29 @@ gnus-search-contact-tables ;;; Search language -;; This "language" was generalized from the original IMAP search query -;; parsing routine. +;; Here's our attempt at using the PEG library to rewrite the parser. + +(defvar gnus-search-query-pexs + '((query (+ (or compound-term term))) + (term (or subquery prefixed-term kv-term value) term-end) + (subquery "(" query ")" + `(query -- (if (= 1 (length query)) query (list query)))) + (prefixed-term (or negated-term near-term)) + (negated-term (or "not " "-") term + `(term -- (list 'not term))) + (near-term "near " term + `(term -- (list 'near term))) + (compound-term (or or-terms and-terms)) + (or-terms (or subquery prefixed-term term) "or " (or subquery prefixed-term term) + `(t1 t2 -- (list 'or t1 t2))) + (and-terms (or subquery prefixed-term term) "and " (or subquery prefixed-term term) + `(t1 t2 -- (list 'and t1 t2))) + (value (or quoted-value plain-value)) + (plain-value (substring (+ [word]))) + (quoted-value "\"" (substring (+ (not "\"") (any))) "\"") + (kv-term plain-value ":" value + `(k v -- (gnus-search-query-parse-kv k v))) + (term-end (opt (+ [space]))))) (defun gnus-search-parse-query (string) "Turn STRING into an s-expression based query. @@ -459,108 +481,26 @@ gnus-search-parse-query structured query. Malformed, unusable or invalid queries will typically be silently ignored." (with-temp-buffer - ;; Set up the parsing environment. (insert string) (goto-char (point-min)) - ;; Now, collect the output terms and return them. - (let (out) - (while (not (gnus-search-query-end-of-input)) - (push (gnus-search-query-next-expr) out)) - (reverse out)))) - -(defun gnus-search-query-next-expr (&optional count halt) - "Return the next expression from the current buffer." - (let ((term (gnus-search-query-next-term count)) - (next (gnus-search-query-peek-symbol))) - ;; Deal with top-level expressions. And, or, not, near... What - ;; else? Notmuch also provides xor and adj. It also provides a - ;; "nearness" parameter for near and adj. - (cond - ;; Handle 'expr or expr' - ((and (eq next 'or) - (null halt)) - (list 'or term (gnus-search-query-next-expr 2))) - ;; Handle 'near operator. - ((eq next 'near) - (let ((near-next (gnus-search-query-next-expr 2))) - (if (and (stringp term) - (stringp near-next)) - (list 'near term near-next) - (signal 'gnus-search-parse-error - (list "\"Near\" keyword must appear between two plain strings."))))) - ;; Anything else - (t term)))) - -(defun gnus-search-query-next-term (&optional count) - "Return the next TERM from the current buffer." - (let ((term (gnus-search-query-next-symbol count))) - ;; What sort of term is this? - (cond - ;; negated term - ((eq term 'not) (list 'not (gnus-search-query-next-expr nil 'halt))) - ;; generic term - (t term)))) - -(defun gnus-search-query-peek-symbol () - "Return the next symbol from the current buffer, but don't consume it." - (save-excursion - (gnus-search-query-next-symbol))) - -(defun gnus-search-query-next-symbol (&optional count) - "Return the next symbol from the current buffer, or nil if we are -at the end of the buffer. If supplied COUNT skips some symbols before -returning the one at the supplied position." - (when (and (numberp count) (> count 1)) - (gnus-search-query-next-symbol (1- count))) - (let ((case-fold-search t)) - ;; end of input stream? - (unless (gnus-search-query-end-of-input) - ;; No, return the next symbol from the stream. - (cond - ;; Negated expression -- return it and advance one char. - ((looking-at "-") (forward-char 1) 'not) - ;; List expression -- we parse the content and return this as a list. - ((looking-at "(") - (gnus-search-parse-query (gnus-search-query-return-string ")" t))) - ;; Keyword input -- return a symbol version. - ((looking-at "\\band\\b") (forward-char 3) 'and) - ((looking-at "\\bor\\b") (forward-char 2) 'or) - ((looking-at "\\bnot\\b") (forward-char 3) 'not) - ((looking-at "\\bnear\\b") (forward-char 4) 'near) - ;; Plain string, no keyword - ((looking-at "[\"/]?\\b[^:]+\\([[:blank:]]\\|\\'\\)") - (gnus-search-query-return-string - (when (looking-at-p "[\"/]") t))) - ;; Assume a K:V expression. - (t (let ((key (gnus-search-query-expand-key - (buffer-substring - (point) - (progn - (re-search-forward ":" (point-at-eol) t) - (1- (point)))))) - (value (gnus-search-query-return-string - (when (looking-at-p "[\"/]") t)))) - (gnus-search-query-parse-kv key value))))))) + (with-peg-rules gnus-search-query-pexs + peg-run (peg query)))) (defun gnus-search-query-parse-kv (key value) "Handle KEY and VALUE, parsing and expanding as necessary. -This may result in (key value) being turned into a larger query -structure. - In the simplest case, they are simply consed together. String KEY is converted to a symbol." - (let () ;; return - (cond - ((member key gnus-search-date-keys) - (when (string= "after" key) - (setq key "since")) - (setq value (gnus-search-query-parse-date value))) - ((equal key "mark") - (setq value (gnus-search-query-parse-mark value))) - ((string= "message-id" key) - (setq key "id"))) - (or nil ;; return - (cons (intern key) value)))) + (setq key (gnus-search-query-expand-key key)) + (cond + ((member key gnus-search-date-keys) + (when (string= "after" key) + (setq key "since")) + (setq value (gnus-search-query-parse-date value))) + ((equal key "mark") + (setq value (gnus-search-query-parse-mark value))) + ((string= "message-id" key) + (setq key "id"))) + (cons (intern key) value)) (defun gnus-search-query-parse-date (value &optional rel-date) "Interpret VALUE as a date specification. @@ -647,44 +587,6 @@ gnus-search-query-expand-key ;; We completed to a unique known key. comp)))) -(defun gnus-search-query-return-string (&optional delimited trim) - "Return a string from the current buffer. -If DELIMITED is non-nil, assume the next character is a delimiter -character, and return everything between point and the next -occurrence of the delimiter, including the delimiters themselves. -If TRIM is non-nil, do not return the delimiters. Otherwise, -return one word." - ;; This function cannot handle nested delimiters, as it's not a - ;; proper parser. Ie, you cannot parse "to:bob or (from:bob or - ;; (cc:bob or bcc:bob))". - (let ((start (point)) - (delimiter (if (stringp delimited) - delimited - (when delimited - (char-to-string (char-after))))) - end) - (if delimiter - (progn - (when trim - ;; Skip past first delimiter if we're trimming. - (forward-char 1)) - (while (not end) - (unless (search-forward delimiter nil t (unless trim 2)) - (signal 'gnus-search-parse-error - (list (format "Unmatched delimited input with %s in query" delimiter)))) - (let ((here (point))) - (unless (equal (buffer-substring (- here 2) (- here 1)) "\\") - (setq end (if trim (1- (point)) (point)) - start (if trim (1+ start) start)))))) - (setq end (progn (re-search-forward "\\([[:blank:]]+\\|$\\)" (point-max) t) - (match-beginning 0)))) - (buffer-substring-no-properties start end))) - -(defun gnus-search-query-end-of-input () - "Are we at the end of input?" - (skip-chars-forward "[:blank:]") - (looking-at "$")) - ;;; Search engines ;; Search engines are implemented as classes. This is good for two [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #3: peg-doc-patch.diff --] [-- Type: text/x-patch, Size: 4172 bytes --] diff --git a/peg.el b/peg.el index d71c707dc0..0e4221eeb7 100644 --- a/peg.el +++ b/peg.el @@ -79,17 +79,69 @@ ;; Beginning-of-Symbol (bos) ;; End-of-Symbol (eos) ;; -;; PEXs also support parsing actions, i.e. Lisp snippets which -;; are executed when a pex matches. This can be used to construct -;; syntax trees or for similar tasks. Actions are written as +;; Rules can refer to other rules, and a grammar is often structured +;; as a tree, with a root rule referring to one or more "branch +;; rules", all the way down to the "leaf rules" that deal with actual +;; buffer text. Rules can be recursive or mutually referential, +;; though care must be taken not to create infinite loops. +;; +;; PEXs also support parsing actions, i.e. Lisp snippets which are +;; executed when a pex matches. This can be used to construct syntax +;; trees or for similar tasks. The most basic form of action is +;; written as: ;; ;; (action FORM) ; evaluate FORM for its side-effects -;; `(VAR... -- FORM...) ; stack action ;; ;; Actions don't consume input, but are executed at the point of -;; match. A "stack action" takes VARs from the "value stack" and -;; pushes the result of evaluating FORMs to that stack. -;; See `peg-ex-parse-int' in `peg-tests.el' for an example. +;; match. Another kind of action is called a "stack action", and +;; looks like this: +;; +;; `(VAR... -- FORM...) ; stack action +;; +;; A stack action takes VARs from the "value stack" and pushes the +;; results of evaluating FORMs to that stack. + +;; The value stack is created during the course of parsing. Certain +;; operators (see below) that match buffer text can push values onto +;; this stack. "Upstream" rules can then draw values from the stack, +;; and optionally push new ones back. For instance, consider this +;; very simple grammar: +;; +;; (with-peg-rules +;; ((query (+ term) (eol)) +;; (term key ":" value (opt (+ [space])) +;; `(k v -- (cons (intern k) v))) +;; (key (substring (and (not ":") (+ [word])))) +;; (value (or string-value number-value)) +;; (string-value (substring (+ [alpha]))) +;; (number-value (substring (+ [digit])) +;; `(val -- (string-to-number val)))) +;; (peg-run (peg query))) +;; +;; This invocation of `peg-run' would parse this buffer text: +;; +;; name:Jane age:30 +;; +;; And return this Elisp sexp: +;; +;; ((age . 30) (name . "Jane")) +;; +;; Note that, in complex grammars, some care must be taken to make +;; sure that the number and type of values drawn from the stack always +;; match those pushed. In the example above, both `string-value' and +;; `number-value' push a single value to the stack. Since the `value' +;; rule only includes these two sub-rules, any upstream rule that +;; makes use of `value' can be confident it will always and only push +;; a single value to the stack. +;; +;; Stack action forms are in a sense analogous to lambda forms: the +;; symbols before the "--" are the equivalent of lambda arguments, +;; while the forms after the "--" are return values. The difference +;; being that a lambda form can only return a single value, while a +;; stack action can push multiple values onto the stack. It's also +;; perfectly valid to use `(-- FORM...)' or `(VAR... --)': the former +;; pushes values to the stack without consuming any, and the latter +;; pops values from the stack and discards them. ;; ;; Derived Operators: ;; @@ -101,6 +153,8 @@ ;; (replace E RPL); Match E and replace the matched region with RPL. ;; (list E) ; Match E and push a list of the items that E produced. ;; +;; See `peg-ex-parse-int' in `peg-tests.el' for further examples. +;; ;; Regexp equivalents: ;; ;; Here a some examples for regexps and how those could be written as pex. @@ -177,7 +231,7 @@ EXPS is a list of rules/expressions that failed.") ;;;; Main entry points -;; Sometimes (with-peg-rule ... (peg-run (peg ...))) is too +;; Sometimes (with-peg-rules ... (peg-run (peg ...))) is too ;; longwinded for the task at hand, so `peg-parse' comes in handy. (defmacro peg-parse (&rest pexs) "Match PEXS at point. [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #4: peg-allow-symbols.diff --] [-- Type: text/x-patch, Size: 919 bytes --] diff --git a/peg.el b/peg.el index 0e4221eeb7..fa7e23619f 100644 --- a/peg.el +++ b/peg.el @@ -314,10 +314,14 @@ RULES is a list of rules of the form (NAME . PEXS), where PEXS is a sequence of PEG expressions, implicitly combined with `and'." (declare (indent 1) (debug (sexp form))) ;FIXME: `sexp' is not good enough! (let ((rules - ;; First, macroexpand the rules. - (mapcar (lambda (rule) - (cons (car rule) (peg-normalize `(and . ,(cdr rule))))) - rules)) + (progn + ;; Handle RULES as a variable. + (when (symbolp rules) + (setq rules (symbol-value rules))) + ;; Then macroexpand the rules. + (mapcar (lambda (rule) + (cons (car rule) (peg-normalize `(and . ,(cdr rule))))) + rules))) (ctx (assq :peg-rules macroexpand-all-environment))) (macroexpand-all `(cl-labels ^ permalink raw reply related [flat|nested] 100+ messages in thread
* Re: Make peg.el a built-in library? 2021-09-09 4:36 ` Eric Abrahamsen @ 2021-09-19 15:25 ` Eric Abrahamsen 2021-09-30 19:44 ` Stefan Monnier 1 sibling, 0 replies; 100+ messages in thread From: Eric Abrahamsen @ 2021-09-19 15:25 UTC (permalink / raw) To: monnier; +Cc: emacs-devel Bumping this up in case it slid off radars: I'd like to at least push the documentation patch to peg.el... On 09/08/21 21:36 PM, Eric Abrahamsen wrote: > On 08/26/21 08:34 AM, Eric Abrahamsen wrote: >> Eli Zaretskii <eliz@gnu.org> writes: >> >>>> From: Eric Abrahamsen <eric@ericabrahamsen.net> >>>> Date: Wed, 25 Aug 2021 11:52:00 -0700 >>>> Cc: Stefan Monnier <monnier@iro.umontreal.ca> >>>> >>>> In my on-again-off-again quest to not have to write text parsers myself, >>>> I was pointed towards the PEG library (in ELPA), which does pretty much >>>> exactly what I want (Parsing Expression Grammars). >>>> >>>> Would the maintainers consider moving this into Emacs proper? I ask >>>> mostly because this would be very useful to have in Gnus, both to >>>> replace the home-made parser in gnus-search.el, and I would hope to >>>> parse eg IMAP server responses more fully and reliably. >>> >>> Fine with me, but please update the (outdated) Wiki page to say where >>> the latest peg.el is, when it is imported. >> >> Will do. Stefan also asked me to make sure the library actually does >> what I expect it to do, before making this move, so I'll write the code >> first. > > Okay, I wrote some code: the "use-peg-in-gnus-search.diff" attachment is > the result of that. It works really well! A net removal of ~100 LOC > (obviously we're still in deficit with the addition of peg.el), it > already fixes some wrong behavior of the old parser, and it's much > easier to reason about and add new behavior to. It's the shiny > declarative future I was looking forward to. > > Whether or not PEG gets added to core I'd like to propose some patches. > The "peg-doc-patches.diff" attachment adds some documentation to the > Commentary section, including an example grammar based on a > much-simplified version of what gnus-search does. > > The peg-allow-symbols patch is more tentative. The issue is that _all_ > of the entry-points to peg code are macros, meaning you can't build your > grammar up in a variable, and then pass that variable to any of > `peg-run', `peg-parse', `with-peg-rules', etc. Nobody will evaluate the > variable; you have to literally write the rules inside the > `with-peg-rules' form. It seems like a fairly plausible use-case to > store the rules in a variable or an option, even if you're not doing > run-time manipulation of them. The only solution, as Adam found with > org-ql, is to `eval' one of the macros. > > This doesn't seem necessary! The patch has `with-peg-rules' check if the > rules are a symbol, and take the `symbol-value' if so. But I wonder if > it wouldn't be nicer to break some of the code out: `peg-normalize' > seems to be the entry-point for "compile this grammar", and that could > be modified to work the way that some languages provide for pre-compiled > regexps: a way to let the developer build and compile the grammar at > load-time or launch-time, then feed the stored compiled version to > parsing routines. > > `peg-parse' could be a function, or maybe it also could also just check > if its argument is a symbol. > > I hope someone will have some thoughts on this! > > Eric ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Make peg.el a built-in library? 2021-09-09 4:36 ` Eric Abrahamsen 2021-09-19 15:25 ` Eric Abrahamsen @ 2021-09-30 19:44 ` Stefan Monnier 2021-09-30 20:34 ` Adam Porter 1 sibling, 1 reply; 100+ messages in thread From: Stefan Monnier @ 2021-09-30 19:44 UTC (permalink / raw) To: Eric Abrahamsen; +Cc: Eli Zaretskii, emacs-devel > Whether or not PEG gets added to core I'd like to propose some patches. > The "peg-doc-patches.diff" attachment adds some documentation to the > Commentary section, including an example grammar based on a > much-simplified version of what gnus-search does. Looks great, thanks. > The peg-allow-symbols patch is more tentative. The issue is that _all_ > of the entry-points to peg code are macros, meaning you can't build your > grammar up in a variable, and then pass that variable to any of > `peg-run', `peg-parse', `with-peg-rules', etc. Nobody will evaluate the > variable; you have to literally write the rules inside the > `with-peg-rules' form. It seems like a fairly plausible use-case to > store the rules in a variable or an option, even if you're not doing > run-time manipulation of them. The only solution, as Adam found with > org-ql, is to `eval' one of the macros. > > This doesn't seem necessary! The patch has `with-peg-rules' check if the > rules are a symbol, and take the `symbol-value' if so. But I wonder if > it wouldn't be nicer to break some of the code out: `peg-normalize' > seems to be the entry-point for "compile this grammar", and that could > be modified to work the way that some languages provide for pre-compiled > regexps: a way to let the developer build and compile the grammar at > load-time or launch-time, then feed the stored compiled version to > parsing routines. `peg` is the macro that's supposed to be this compilation step: you pass it a PEX and you receive a value in return. It's a bit like `lambda`. You can then use this value (a "peg matcher") to parse something by passing it to `peg-run`. So you can do (let ((parser (peg PEX))) ... (peg-run parser ...) ...) What might still be missing, tho is a way to invoke this `parser` from within a PEX. So we might want to add a new PEX form that would be akin to `funcall`. We could name it `call`: (let* ((parser (peg PEX)) ... (with-peg-rules ((foo ...) (bar ... (call parser) ...) (baz ...)) ...)) so (peg-parse (call FORM)) would end up equivalent to (peg-run FORM ...). WDYT? Stefan ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Make peg.el a built-in library? 2021-09-30 19:44 ` Stefan Monnier @ 2021-09-30 20:34 ` Adam Porter 2021-10-01 8:14 ` Augusto Stoffel 2021-10-01 18:05 ` Stefan Monnier 0 siblings, 2 replies; 100+ messages in thread From: Adam Porter @ 2021-09-30 20:34 UTC (permalink / raw) To: emacs-devel Stefan Monnier <monnier@iro.umontreal.ca> writes: >> The peg-allow-symbols patch is more tentative. The issue is that _all_ >> of the entry-points to peg code are macros, meaning you can't build your >> grammar up in a variable, and then pass that variable to any of >> `peg-run', `peg-parse', `with-peg-rules', etc. Nobody will evaluate the >> variable; you have to literally write the rules inside the >> `with-peg-rules' form. It seems like a fairly plausible use-case to >> store the rules in a variable or an option, even if you're not doing >> run-time manipulation of them. The only solution, as Adam found with >> org-ql, is to `eval' one of the macros. >> >> This doesn't seem necessary! The patch has `with-peg-rules' check if the >> rules are a symbol, and take the `symbol-value' if so. But I wonder if >> it wouldn't be nicer to break some of the code out: `peg-normalize' >> seems to be the entry-point for "compile this grammar", and that could >> be modified to work the way that some languages provide for pre-compiled >> regexps: a way to let the developer build and compile the grammar at >> load-time or launch-time, then feed the stored compiled version to >> parsing routines. > > `peg` is the macro that's supposed to be this compilation step: you pass > it a PEX and you receive a value in return. It's a bit like `lambda`. > > You can then use this value (a "peg matcher") to parse something by > passing it to `peg-run`. > > So you can do > > (let ((parser (peg PEX))) > ... > (peg-run parser ...) > ...) > > What might still be missing, tho is a way to invoke this `parser` from > within a PEX. So we might want to add a new PEX form that would be akin > to `funcall`. We could name it `call`: > > (let* ((parser (peg PEX)) > ... > (with-peg-rules > ((foo ...) > (bar ... (call parser) ...) > (baz ...)) > ...)) > > so (peg-parse (call FORM)) would end up equivalent to (peg-run FORM ...). > WDYT? In org-ql, the PEX is redefined at load time and/or run time, being derived from search keywords that are defined by the package and possibly by the user. So the PEX can't be defined in advance, at compile time. So having to use `with-peg-rules' means having to use `eval'. That's why it would be nice to have a `peg' function that could be called with a PEX form, to return a function that could be stored in a variable and later be called with a string argument, that would parse the string with the PEG. Sort of like Python's re.compile. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Make peg.el a built-in library? 2021-09-30 20:34 ` Adam Porter @ 2021-10-01 8:14 ` Augusto Stoffel 2021-10-01 18:05 ` Stefan Monnier 1 sibling, 0 replies; 100+ messages in thread From: Augusto Stoffel @ 2021-10-01 8:14 UTC (permalink / raw) To: Adam Porter; +Cc: emacs-devel On Thu, 30 Sep 2021 at 15:34, Adam Porter <adam@alphapapa.net> wrote: > In org-ql, the PEX is redefined at load time and/or run time, being > derived from search keywords that are defined by the package and > possibly by the user. So the PEX can't be defined in advance, at > compile time. So having to use `with-peg-rules' means having to use > `eval'. > > That's why it would be nice to have a `peg' function that could be > called with a PEX form, to return a function that could be stored in a > variable and later be called with a string argument, that would parse > the string with the PEG. Sort of like Python's re.compile. FWIW, in my use of PEGs (which is outside of Emacs, in a code analyzer/language server for TeX), such on-the-fly generation of parsers is used all the time as well. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Make peg.el a built-in library? 2021-09-30 20:34 ` Adam Porter 2021-10-01 8:14 ` Augusto Stoffel @ 2021-10-01 18:05 ` Stefan Monnier 2021-10-01 18:40 ` Eric Abrahamsen 2021-10-02 7:32 ` Adam Porter 1 sibling, 2 replies; 100+ messages in thread From: Stefan Monnier @ 2021-10-01 18:05 UTC (permalink / raw) To: Adam Porter; +Cc: emacs-devel > In org-ql, the PEX is redefined at load time and/or run time, being > derived from search keywords that are defined by the package and > possibly by the user. So the PEX can't be defined in advance, at > compile time. So having to use `with-peg-rules' means having to use > `eval'. If the grammar changes radically at run time, based on external/user data there's probably no better way than via `eval` or similar (`load`, `byte-compile`, you name it). But if the changes are sufficiently limited (e.g. have an (or "foo" "bar" ....) with a variable set of strings that can match), then we can do better. E.g. we could have a PEX of the form (re FORM) where FORM can be any ELisp expression that returns a regular expression. Then `org-ql.el` could do (let ((predicate-re (regexp-opt predicate-names))) (peg-parse ((query (+ term (opt (+ (syntax-class whitespace) (any))))) [...] (predicate (re predicate-re)) [...]))) -- Stefan PS: BTW, regarding your comment: ;; Sort the keywords longest-first to work around what seems to be an ;; obscure bug in `peg': when one keyword is a substring of another, ;; and the shorter one is listed first, the shorter one fails to match. The behavior you describe indeed seems like a bug, but maybe what you see is slightly different (and not a bug): if you have a PEX like (and (or "foo" "foobar") "X") the "foo" will match when faced with "foobarX" and the parser won't backtrack to try and match the "foobar" when the "X" fails to match. It's one of those differences between BNF and PEG grammars. So indeed you do want to sort from longest to shortest to avoid this problem. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Make peg.el a built-in library? 2021-10-01 18:05 ` Stefan Monnier @ 2021-10-01 18:40 ` Eric Abrahamsen 2021-10-02 3:57 ` Stefan Monnier 2021-10-02 7:32 ` Adam Porter 1 sibling, 1 reply; 100+ messages in thread From: Eric Abrahamsen @ 2021-10-01 18:40 UTC (permalink / raw) To: Stefan Monnier; +Cc: Adam Porter, emacs-devel Stefan Monnier <monnier@iro.umontreal.ca> writes: >> In org-ql, the PEX is redefined at load time and/or run time, being >> derived from search keywords that are defined by the package and >> possibly by the user. So the PEX can't be defined in advance, at >> compile time. So having to use `with-peg-rules' means having to use >> `eval'. > > If the grammar changes radically at run time, based on external/user > data there's probably no better way than via `eval` or similar (`load`, > `byte-compile`, you name it). Can you explain why a function plus some sort of pre-compilation step won't work? Maybe if I just tried to write the patch I would naturally see the problem, but theoretically I don't get it... > But if the changes are sufficiently limited (e.g. have an (or "foo" > "bar" ....) with a variable set of strings that can match), then we can > do better. > > E.g. we could have a PEX of the form (re FORM) where FORM can be any > ELisp expression that returns a regular expression. I suppose the `call' pex you mentioned up-thread could also ease things a bit. I'll hold off on the documentation patch until we know whether any code will change. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Make peg.el a built-in library? 2021-10-01 18:40 ` Eric Abrahamsen @ 2021-10-02 3:57 ` Stefan Monnier 0 siblings, 0 replies; 100+ messages in thread From: Stefan Monnier @ 2021-10-02 3:57 UTC (permalink / raw) To: Eric Abrahamsen; +Cc: Adam Porter, emacs-devel Eric Abrahamsen [2021-10-01 11:40:47] wrote: > Stefan Monnier <monnier@iro.umontreal.ca> writes: >>> In org-ql, the PEX is redefined at load time and/or run time, being >>> derived from search keywords that are defined by the package and >>> possibly by the user. So the PEX can't be defined in advance, at >>> compile time. So having to use `with-peg-rules' means having to use >>> `eval'. >> >> If the grammar changes radically at run time, based on external/user >> data there's probably no better way than via `eval` or similar (`load`, >> `byte-compile`, you name it). > > Can you explain why a function plus some sort of pre-compilation step > won't work? That "function plus precompilation step" would do the equivalent of `eval` ;-) > I suppose the `call' pex you mentioned up-thread could also ease things > a bit. Indeed, with it you can define a function like `peg-and` such that (peg-and (peg PEX1) (peg PEX2)) === (peg (and PEX1 PEX2)) but using such functions to build a PEG would result in substantially slower code (because it gets split into many small functions, thus increasing the function call overheads). Stefan ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Make peg.el a built-in library? 2021-10-01 18:05 ` Stefan Monnier 2021-10-01 18:40 ` Eric Abrahamsen @ 2021-10-02 7:32 ` Adam Porter 2021-10-02 14:45 ` Stefan Monnier 1 sibling, 1 reply; 100+ messages in thread From: Adam Porter @ 2021-10-02 7:32 UTC (permalink / raw) To: Stefan Monnier; +Cc: emacs-devel On Fri, Oct 1, 2021 at 1:05 PM Stefan Monnier <monnier@iro.umontreal.ca> wrote: > > > In org-ql, the PEX is redefined at load time and/or run time, being > > derived from search keywords that are defined by the package and > > possibly by the user. So the PEX can't be defined in advance, at > > compile time. So having to use `with-peg-rules' means having to use > > `eval'. > > If the grammar changes radically at run time, based on external/user > data there's probably no better way than via `eval` or similar (`load`, > `byte-compile`, you name it). > > But if the changes are sufficiently limited (e.g. have an (or "foo" > "bar" ....) with a variable set of strings that can match), then we can > do better. In org-ql's case, it's the latter: the grammar doesn't fundamentally change, only the list of strings that can be matched in a certain expression: https://github.com/alphapapa/org-ql/blob/31aeb0a2505acf8044c07824888ddec7f3e529c1/org-ql.el#L869 > E.g. we could have a PEX of the form (re FORM) where FORM can be any > ELisp expression that returns a regular expression. Then `org-ql.el` > could do > > (let ((predicate-re (regexp-opt predicate-names))) > (peg-parse > ((query (+ term > (opt (+ (syntax-class whitespace) (any))))) > [...] > (predicate (re predicate-re)) > [...]))) That would be helpful, yes. > PS: BTW, regarding your comment: > > ;; Sort the keywords longest-first to work around what seems to be an > ;; obscure bug in `peg': when one keyword is a substring of another, > ;; and the shorter one is listed first, the shorter one fails to match. > > The behavior you describe indeed seems like a bug, but maybe what you > see is slightly different (and not a bug): if you have a PEX like > (and (or "foo" "foobar") "X") > the "foo" will match when faced with "foobarX" and the parser won't > backtrack to try and match the "foobar" when the "X" fails to match. Hmm, thanks. I think an example of the problem is that a predicate in org-ql might have a shorter alias, e.g. "heading" is has the alias "h", and predicates are followed by arguments, like "heading:foo", so IIRC, without sorting them there, "heading:foo" would work, while "h:foo" wouldn't. (Or maybe a better example is predicates that optionally accept keyword-style arguments, like "ts-active:from=2021-10-01", which has the alias "ts-a", and could also be called without arguments, like "ts-a:".) > It's one of those differences between BNF and PEG grammars. > So indeed you do want to sort from longest to shortest to avoid > this problem. Thanks, I didn't realize that. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Make peg.el a built-in library? 2021-10-02 7:32 ` Adam Porter @ 2021-10-02 14:45 ` Stefan Monnier 2021-10-02 15:13 ` Adam Porter 0 siblings, 1 reply; 100+ messages in thread From: Stefan Monnier @ 2021-10-02 14:45 UTC (permalink / raw) To: Adam Porter; +Cc: emacs-devel >> E.g. we could have a PEX of the form (re FORM) where FORM can be any >> ELisp expression that returns a regular expression. Then `org-ql.el` >> could do >> >> (let ((predicate-re (regexp-opt predicate-names))) >> (peg-parse >> ((query (+ term >> (opt (+ (syntax-class whitespace) (any))))) >> [...] >> (predicate (re predicate-re)) >> [...]))) > > That would be helpful, yes. Thanks, I'll think about what can be done here. >> PS: BTW, regarding your comment: >> >> ;; Sort the keywords longest-first to work around what seems to be an >> ;; obscure bug in `peg': when one keyword is a substring of another, >> ;; and the shorter one is listed first, the shorter one fails to match. >> >> The behavior you describe indeed seems like a bug, but maybe what you >> see is slightly different (and not a bug): if you have a PEX like >> (and (or "foo" "foobar") "X") >> the "foo" will match when faced with "foobarX" and the parser won't >> backtrack to try and match the "foobar" when the "X" fails to match. > > Hmm, thanks. I think an example of the problem is that a predicate in > org-ql might have a shorter alias, e.g. "heading" is has the alias > "h", and predicates are followed by arguments, like "heading:foo", so > IIRC, without sorting them there, "heading:foo" would work, while > "h:foo" wouldn't. Odd. If you have (or "h" "header") in the grammar then I'd expect "h:foo" to be recognized but "heading:foo" to be rejected (IOW, that would be a bug in the grammar rather than in `peg.el`). But you describe the exact opposite for which I don't have an explanation. So maybe it's a bug in `peg.el`. Could you try and distill it into a bug report? Stefan ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Make peg.el a built-in library? 2021-10-02 14:45 ` Stefan Monnier @ 2021-10-02 15:13 ` Adam Porter 0 siblings, 0 replies; 100+ messages in thread From: Adam Porter @ 2021-10-02 15:13 UTC (permalink / raw) To: Stefan Monnier; +Cc: emacs-devel On Sat, Oct 2, 2021 at 9:45 AM Stefan Monnier <monnier@iro.umontreal.ca> wrote: > > >> PS: BTW, regarding your comment: > >> > >> ;; Sort the keywords longest-first to work around what seems to be an > >> ;; obscure bug in `peg': when one keyword is a substring of another, > >> ;; and the shorter one is listed first, the shorter one fails to match. > >> > >> The behavior you describe indeed seems like a bug, but maybe what you > >> see is slightly different (and not a bug): if you have a PEX like > >> (and (or "foo" "foobar") "X") > >> the "foo" will match when faced with "foobarX" and the parser won't > >> backtrack to try and match the "foobar" when the "X" fails to match. > > > > Hmm, thanks. I think an example of the problem is that a predicate in > > org-ql might have a shorter alias, e.g. "heading" is has the alias > > "h", and predicates are followed by arguments, like "heading:foo", so > > IIRC, without sorting them there, "heading:foo" would work, while > > "h:foo" wouldn't. > > Odd. If you have (or "h" "header") in the grammar then I'd expect > "h:foo" to be recognized but "heading:foo" to be rejected (IOW, that > would be a bug in the grammar rather than in `peg.el`). > > But you describe the exact opposite for which I don't have > an explanation. So maybe it's a bug in `peg.el`. Could you try and > distill it into a bug report? Frankly, probably not. :) I worked on that code a long time ago and haven't touched it since, so my recollection might not even be accurate. For me, it Just Works(TM), and I have other Emacs-related projects that are higher priority, so I don't expect to be able to work on that part of org-ql or peg.el anytime soon. Sorry. :( (e.g. I'd really like to make progress on this bug report, so I could reasonably submit plz.el to ELPA (though I might do that anyway, since it mostly works fine): https://debbugs.gnu.org/cgi/bugreport.cgi?bug=50166 But it's stumped me so far. Maybe someone else would have some ideas sometime...) ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Make peg.el a built-in library? 2021-08-25 18:52 Make peg.el a built-in library? Eric Abrahamsen 2021-08-26 6:17 ` Eli Zaretskii @ 2021-08-26 17:02 ` Adam Porter 2021-08-26 17:25 ` Eric Abrahamsen 2021-08-27 3:17 ` Eric Abrahamsen 2021-10-09 1:31 ` Michael Heerdegen 2022-11-07 3:33 ` Ihor Radchenko 3 siblings, 2 replies; 100+ messages in thread From: Adam Porter @ 2021-08-26 17:02 UTC (permalink / raw) To: emacs-devel FWIW, I've been happy using peg.el in org-ql. I use it to parse strings like: "todo:WAITING scheduled:from=2021-08-01,to=2021-08-31" into a sexp like: (and (todo "WAITING") (scheduled :from "2021-08-01" :to "2021-08-31")) You can see the code I use here: https://github.com/alphapapa/org-ql/blob/master/org-ql.el#L854 I can't speak much to its performance, because it's plenty fast enough for the relatively light usage it gets in org-ql. My only, minor complaint is that I ended up having to use `eval' on its `with-peg-rules' macro at runtime: https://github.com/alphapapa/org-ql/blob/94f9e6f3031b32cf5e2149beca7074807235dcb0/org-ql.el#L908 I tried many, many things before resorting to that, so I don't think I missed any alternatives at the time. If that aspect of the API could be improved, it would be welcomed, but I don't think it's necessary to do so before adding it to Emacs. (The issue is that the tokens that are parsed can be added to at runtime, so they are stored in a variable, and the parsing function is redefined as necessary at runtime, so it's not possible to define the parser fully at expansion time.) Thanks for suggesting this, Eric. And thanks to Helmut and Stefan for their work on peg. It's a great tool. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Make peg.el a built-in library? 2021-08-26 17:02 ` Adam Porter @ 2021-08-26 17:25 ` Eric Abrahamsen 2021-08-27 3:17 ` Eric Abrahamsen 1 sibling, 0 replies; 100+ messages in thread From: Eric Abrahamsen @ 2021-08-26 17:25 UTC (permalink / raw) To: Adam Porter; +Cc: emacs-devel Adam Porter <adam@alphapapa.net> writes: > FWIW, I've been happy using peg.el in org-ql. I use it to parse strings > like: > > "todo:WAITING scheduled:from=2021-08-01,to=2021-08-31" > > into a sexp like: > > (and (todo "WAITING") > (scheduled :from "2021-08-01" :to "2021-08-31")) > > You can see the code I use here: > > https://github.com/alphapapa/org-ql/blob/master/org-ql.el#L854 This is very helpful, thanks. The peg.el comments are not very helpful when it comes to actions, so it's great to have more examples. I'll try to provide a comment patch to the library once I've made more progress. > I can't speak much to its performance, because it's plenty fast enough for the > relatively light usage it gets in org-ql. In gnus-search.el it would be the same situation: performance wouldn't matter at all. If I can use it for IMAP server parsing, though, it would be important not to be too slow. > My only, minor complaint is that I ended up having to use `eval' on > its `with-peg-rules' macro at runtime: > > https://github.com/alphapapa/org-ql/blob/94f9e6f3031b32cf5e2149beca7074807235dcb0/org-ql.el#L908 > > I tried many, many things before resorting to that, so I don't think I > missed any alternatives at the time. If that aspect of the API could be > improved, it would be welcomed, but I don't think it's necessary to do > so before adding it to Emacs. > > (The issue is that the tokens that are parsed can be added to at > runtime, so they are stored in a variable, and the parsing function is > redefined as necessary at runtime, so it's not possible to define the > parser fully at expansion time.) This doesn't mean much to me (yet), but I'll keep an eye out! > Thanks for suggesting this, Eric. Thanks for pointing out that it exists! ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Make peg.el a built-in library? 2021-08-26 17:02 ` Adam Porter 2021-08-26 17:25 ` Eric Abrahamsen @ 2021-08-27 3:17 ` Eric Abrahamsen 2021-08-27 6:41 ` Helmut Eller 1 sibling, 1 reply; 100+ messages in thread From: Eric Abrahamsen @ 2021-08-27 3:17 UTC (permalink / raw) To: Adam Porter; +Cc: emacs-devel Adam Porter <adam@alphapapa.net> writes: > FWIW, I've been happy using peg.el in org-ql. I use it to parse strings > like: > > "todo:WAITING scheduled:from=2021-08-01,to=2021-08-31" > > into a sexp like: > > (and (todo "WAITING") > (scheduled :from "2021-08-01" :to "2021-08-31")) > > You can see the code I use here: > > https://github.com/alphapapa/org-ql/blob/master/org-ql.el#L854 Whoo, I've been trying to get enough of a handle on the parsing actions to write a documentation patch for them -- now I'm seeing what Helmut meant by "semantically unintuitive". The sum total of docs regarding actions is: A "stack action" takes VARs from the "value stack" and pushes the result of evaluating FORMs to that stack. So lower-level pexs need to explicitly push values onto the stack. They can do that with either one of the built-in "operators" (substring, region, replace, list), or by using the pattern: (and <your pex> `(VARS... -- FORM...)) Which confused me mightily until I realized that the backquoted sexp was essentially a lambda with funny syntax: `(VARS... -- FORM...) ==> (lambda (vars...) form...) `(-- FORM...) ==> (lambda () form...) You don't actually need the leading `and' if you're writing a top-level pex, it only seems necessary if you're lining up a series of them under an `or'. A built-in operator pushes a value onto the stack. No operator (or stack action) means no push. An action lambda with no argument but a return value simply pushes that value onto the stack. An action lambda that accepts arguments consumes values from the stack, and then pushes a new value (its return value) onto the stack. So lower-level pexs can take values from the stack and push new ones back onto the stack, and higher-level pexs can pick those up later. But because higher-level pexs often simply "or" lower-level pexs, the developer has to be consistent with the number and type of pushed values: if a high-level pex looks like: (foo (or baz bar) `(str -- (upcase str))) Then the contract is that both the "baz" and "bar" pexs (or an even lower-level pex referred to by them) will push a single string value onto the stack (probably with the "substring" operator). Essentially we need to be calling our upper-level lambda with the right number/type of argument(s). If this email makes no sense, it's because I'm halfway through trying to understand this library. I guess I could wish that these action forms were simply callables, since they're clearly modeled after function calls. Anyhoo, I'm going to try to confirm all of the above, and then at least add to the commentary section for the main package file. Eric ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Make peg.el a built-in library? 2021-08-27 3:17 ` Eric Abrahamsen @ 2021-08-27 6:41 ` Helmut Eller 2021-08-27 16:57 ` Eric Abrahamsen 2021-09-26 10:59 ` Augusto Stoffel 0 siblings, 2 replies; 100+ messages in thread From: Helmut Eller @ 2021-08-27 6:41 UTC (permalink / raw) To: emacs-devel On Thu, Aug 26 2021, Eric Abrahamsen wrote: > Whoo, I've been trying to get enough of a handle on the parsing actions > to write a documentation patch for them -- now I'm seeing what Helmut > meant by "semantically unintuitive". What I actually meant with "semantically unintuitive" are issues described in Roman Redziejowski's "Trying to understand PEG" paper[*]. He writes: The problem with limited backtracking is that by not trying hard it may miss some inputs that it should accept. A notorious example is the rule A = aAa | aa that defines the set of strings of a’s of even length. Implemented with limited backtracking, this rule accepts only strings of length 2^n. > The sum total of docs regarding > actions is: > > A "stack action" takes VARs from the "value stack" and pushes the result > of evaluating FORMs to that stack. Using an "open stack" for actions was my rather idiosyncratic choice and I'm sure that many people will not like it. The syntax ( a b -- b a ) should be familiar to Forth programmers, where it's used to describe the stack-effect of commands. The example would be the SWAP operator. If you have never, or not recently, written some Forth or Postscript, then mentally keeping track of the stack state can be challenging. As for "documentation" of actions: there are also some examples. I think that the s-exp parsing example turned out quite elegant. Helmut [*] http://www.romanredz.se/papers/FI2017.pdf ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Make peg.el a built-in library? 2021-08-27 6:41 ` Helmut Eller @ 2021-08-27 16:57 ` Eric Abrahamsen 2021-09-26 10:59 ` Augusto Stoffel 1 sibling, 0 replies; 100+ messages in thread From: Eric Abrahamsen @ 2021-08-27 16:57 UTC (permalink / raw) To: Helmut Eller; +Cc: emacs-devel Helmut Eller <eller.helmut@gmail.com> writes: Thanks for jumping in (and thanks for the package)! > On Thu, Aug 26 2021, Eric Abrahamsen wrote: > >> Whoo, I've been trying to get enough of a handle on the parsing actions >> to write a documentation patch for them -- now I'm seeing what Helmut >> meant by "semantically unintuitive". > > What I actually meant with "semantically unintuitive" are issues > described in Roman Redziejowski's "Trying to understand PEG" paper[*]. > He writes: > > The problem with limited backtracking is that by not trying hard it > may miss some inputs that it should accept. A notorious example is > the rule A = aAa | aa that defines the set of strings of a’s of even > length. Implemented with limited backtracking, this rule accepts only > strings of length 2^n. Oh... well personally I haven't got to the stage where this is an issue... >> The sum total of docs regarding >> actions is: >> >> A "stack action" takes VARs from the "value stack" and pushes the result >> of evaluating FORMs to that stack. > > Using an "open stack" for actions was my rather idiosyncratic choice and > I'm sure that many people will not like it. The syntax ( a b -- b a ) > should be familiar to Forth programmers, where it's used to describe the > stack-effect of commands. The example would be the SWAP operator. If > you have never, or not recently, written some Forth or Postscript, then > mentally keeping track of the stack state can be challenging. The stack itself isn't that hard to handle, but I do think the documentation could be fleshed out with a little hand-holding. The examples are good, _after_ you understand the basics. I've never written Forth, and we probably shouldn't expect anyone else to have, either. I originally assumed the `(a b -- b a) bits could just be replaced with lambda forms, but I suppose the problem there is that a lambda has a single return value, and we'd have to do something ugly if we wanted to push multiple values back onto the stack. Eric ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Make peg.el a built-in library? 2021-08-27 6:41 ` Helmut Eller 2021-08-27 16:57 ` Eric Abrahamsen @ 2021-09-26 10:59 ` Augusto Stoffel 2021-09-26 15:06 ` Eric Abrahamsen 2021-09-27 22:34 ` Richard Stallman 1 sibling, 2 replies; 100+ messages in thread From: Augusto Stoffel @ 2021-09-26 10:59 UTC (permalink / raw) To: Helmut Eller; +Cc: Eric Abrahamsen, emacs-devel I think it would be really cool to have PEGs built into Emacs. Things like json.el could be simplified by at least (log10 2) orders of magnitude with PEGs. Whatever the use case of `rx' is, PEGs are probably the "real" solution. But I suspect this would only take traction with a fast and robust C implementation like Lua's LPEG (see below for a reason). On Fri, 27 Aug 2021 at 08:41, Helmut Eller <eller.helmut@gmail.com> wrote: > On Thu, Aug 26 2021, Eric Abrahamsen wrote: > >> Whoo, I've been trying to get enough of a handle on the parsing actions >> to write a documentation patch for them -- now I'm seeing what Helmut >> meant by "semantically unintuitive". > > What I actually meant with "semantically unintuitive" are issues > described in Roman Redziejowski's "Trying to understand PEG" paper[*]. > He writes: > > The problem with limited backtracking is that by not trying hard it > may miss some inputs that it should accept. A notorious example is > the rule A = aAa | aa that defines the set of strings of a’s of even > length. Implemented with limited backtracking, this rule accepts only > strings of length 2^n. When I started to write PEGs intensively, I thought the limited backtracking would be a problem. It's not. In fact, I find the regexp-style backtracking great, but only for “quick and dirty” things (e.g., those throw-away little programs one writes for grep or isearch). But if are trying to write a more complex parser, aggressive backtracking actually gets in the way. The example above is kind of silly. You can parse an even number of a's with the rule A = aaA | ε. This is still kind of bad, because (unless peg.el is way fancier than I'm imagining), it consumes the call stack. LPEG has a kind of “tail call optimization” that allows you to do this. Obviously, the sane way to parse an even number of a's is the rule (aa)*, aka (* "aa"). But there are many justifiable use-cases for the tail call optimization. For instance, given a pattern P, produce a new pattern that looks ahead for the first match of P. This would be P | .P, or (or P (and (any) P)) in peg.el notation. Is there a simple an efficient way to do this in peg.el, that allows to skip over thousands of characters without a new call stack entry for each one of them? ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Make peg.el a built-in library? 2021-09-26 10:59 ` Augusto Stoffel @ 2021-09-26 15:06 ` Eric Abrahamsen 2021-09-26 18:36 ` Augusto Stoffel 2021-09-27 22:34 ` Richard Stallman 1 sibling, 1 reply; 100+ messages in thread From: Eric Abrahamsen @ 2021-09-26 15:06 UTC (permalink / raw) To: emacs-devel Augusto Stoffel <arstoffel@gmail.com> writes: > I think it would be really cool to have PEGs built into Emacs. Things > like json.el could be simplified by at least (log10 2) orders of > magnitude with PEGs. Whatever the use case of `rx' is, PEGs are > probably the "real" solution. > > But I suspect this would only take traction with a fast and robust C > implementation like Lua's LPEG (see below for a reason). I wonder if it would make sense to adopt this elisp library for now, see if people use it (or want to use it but complain about speed), and consider translating to C if they do? The elisp version has generic methods for `peg-normalize' (and `peg--macroexpand', though I guess that's private) which would allow library authors to write new peg expressions. We'd lose that with C, though I suppose speed vs extensibility is always the tradeoff with C vs Elisp. In a previous message I complained a little bit about the entry-points to PEG as it stands now -- they're all macros. Maybe if we were thinking in terms of a future C translation, we could narrow the API down a little and lock it down, and discourage authors from using anything that wouldn't be made available by the future version. Eric ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Make peg.el a built-in library? 2021-09-26 15:06 ` Eric Abrahamsen @ 2021-09-26 18:36 ` Augusto Stoffel 2021-09-27 16:18 ` Eric Abrahamsen 0 siblings, 1 reply; 100+ messages in thread From: Augusto Stoffel @ 2021-09-26 18:36 UTC (permalink / raw) To: Eric Abrahamsen; +Cc: emacs-devel On Sun, 26 Sep 2021 at 08:06, Eric Abrahamsen <eric@ericabrahamsen.net> wrote: > Augusto Stoffel <arstoffel@gmail.com> writes: > >> I think it would be really cool to have PEGs built into Emacs. Things >> like json.el could be simplified by at least (log10 2) orders of >> magnitude with PEGs. Whatever the use case of `rx' is, PEGs are >> probably the "real" solution. >> >> But I suspect this would only take traction with a fast and robust C >> implementation like Lua's LPEG (see below for a reason). > > I wonder if it would make sense to adopt this elisp library for now, see > if people use it (or want to use it but complain about speed), and > consider translating to C if they do? Yes, that sounds reasonable. But the efficiency problem isn't even just about speed, it's also about which patterns you can run at all without exhausting the call stack. Without the “tail call optimization” that I mentioned in the previous message, I think much of the appeal of PEGs is gone... > > The elisp version has generic methods for `peg-normalize' (and > `peg--macroexpand', though I guess that's private) which would allow > library authors to write new peg expressions. We'd lose that with C, > though I suppose speed vs extensibility is always the tradeoff with > C vs Elisp. I'm not sure I understand this comment, and I confess I didn't look closely at peg.el. But there's a difference between _defining_ a pattern and _executing_ it. If the basic PEG vocabulary (sequence, ordered choice, repetition, grammars, etc.) is implemented in C, you can define all sorts of combinators, such as (define-peg-rule search (patt) (or patt (and (any) (search patt)))) [or whatever the syntax is for grammars/recursive definitions], and executing the patterns doesn't involve any Lisp calls. > > In a previous message I complained a little bit about the entry-points > to PEG as it stands now -- they're all macros. Maybe if we were thinking > in terms of a future C translation, we could narrow the API down a > little and lock it down, and discourage authors from using anything that > wouldn't be made available by the future version. I can't say anything useful here without studying peg.el a bit, but I think it would be ideal if PEGs are just values (which, in particular, you can manipulate without naming) and there are functions that allow making new PEGs out of old ones. And once again, Lua's LPEG is a fantastic library. It might be worth taking a look at how it works. > > Eric ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Make peg.el a built-in library? 2021-09-26 18:36 ` Augusto Stoffel @ 2021-09-27 16:18 ` Eric Abrahamsen 0 siblings, 0 replies; 100+ messages in thread From: Eric Abrahamsen @ 2021-09-27 16:18 UTC (permalink / raw) To: emacs-devel Augusto Stoffel <arstoffel@gmail.com> writes: > On Sun, 26 Sep 2021 at 08:06, Eric Abrahamsen <eric@ericabrahamsen.net> wrote: > >> Augusto Stoffel <arstoffel@gmail.com> writes: >> >>> I think it would be really cool to have PEGs built into Emacs. Things >>> like json.el could be simplified by at least (log10 2) orders of >>> magnitude with PEGs. Whatever the use case of `rx' is, PEGs are >>> probably the "real" solution. >>> >>> But I suspect this would only take traction with a fast and robust C >>> implementation like Lua's LPEG (see below for a reason). >> >> I wonder if it would make sense to adopt this elisp library for now, see >> if people use it (or want to use it but complain about speed), and >> consider translating to C if they do? > > Yes, that sounds reasonable. But the efficiency problem isn't even just > about speed, it's also about which patterns you can run at all without > exhausting the call stack. Without the “tail call optimization” that I > mentioned in the previous message, I think much of the appeal of PEGs is > gone... For someone hoping to use PEG to simplify parsing of very regular (though possibly complex) text (me), it's still pretty appealing. >> >> The elisp version has generic methods for `peg-normalize' (and >> `peg--macroexpand', though I guess that's private) which would allow >> library authors to write new peg expressions. We'd lose that with C, >> though I suppose speed vs extensibility is always the tradeoff with >> C vs Elisp. > > I'm not sure I understand this comment, and I confess I didn't look > closely at peg.el. But there's a difference between _defining_ a > pattern and _executing_ it. If the basic PEG vocabulary (sequence, > ordered choice, repetition, grammars, etc.) is implemented in C, you can > define all sorts of combinators, such as > > (define-peg-rule search (patt) > (or patt (and (any) (search patt)))) > > [or whatever the syntax is for grammars/recursive definitions], and > executing the patterns doesn't involve any Lisp calls. Yes, that's all I meant. So long as rules can still be defined in Lisp, this isn't an issue. >> In a previous message I complained a little bit about the entry-points >> to PEG as it stands now -- they're all macros. Maybe if we were thinking >> in terms of a future C translation, we could narrow the API down a >> little and lock it down, and discourage authors from using anything that >> wouldn't be made available by the future version. > > I can't say anything useful here without studying peg.el a bit, but I > think it would be ideal if PEGs are just values (which, in particular, > you can manipulate without naming) and there are functions that allow > making new PEGs out of old ones. > > And once again, Lua's LPEG is a fantastic library. It might be worth > taking a look at how it works. I don't really know anything about PEGs or the theory behind them, and was just hoping to be the squeaky wheel in this case. It would be great to improve peg.el, but I still think it would be nice to get it into Emacs first. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Make peg.el a built-in library? 2021-09-26 10:59 ` Augusto Stoffel 2021-09-26 15:06 ` Eric Abrahamsen @ 2021-09-27 22:34 ` Richard Stallman 2021-09-28 3:52 ` Eric Abrahamsen 1 sibling, 1 reply; 100+ messages in thread From: Richard Stallman @ 2021-09-27 22:34 UTC (permalink / raw) To: Augusto Stoffel; +Cc: eric, eller.helmut, emacs-devel [[[ To any NSA and FBI agents reading my email: please consider ]]] [[[ whether defending the US Constitution against all enemies, ]]] [[[ foreign or domestic, requires you to follow Snowden's example. ]]] What is a PEG? -- Dr Richard Stallman (https://stallman.org) Chief GNUisance of the GNU Project (https://gnu.org) Founder, Free Software Foundation (https://fsf.org) Internet Hall-of-Famer (https://internethalloffame.org) ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Make peg.el a built-in library? 2021-09-27 22:34 ` Richard Stallman @ 2021-09-28 3:52 ` Eric Abrahamsen 2021-09-28 8:09 ` tomas 2021-09-30 6:04 ` Richard Stallman 0 siblings, 2 replies; 100+ messages in thread From: Eric Abrahamsen @ 2021-09-28 3:52 UTC (permalink / raw) To: Richard Stallman; +Cc: eller.helmut, Augusto Stoffel, emacs-devel On 09/27/21 18:34 PM, Richard Stallman wrote: > [[[ To any NSA and FBI agents reading my email: please consider ]]] > [[[ whether defending the US Constitution against all enemies, ]]] > [[[ foreign or domestic, requires you to follow Snowden's example. ]]] > > What is a PEG? A Parsing Expression Grammar: https://en.wikipedia.org/wiki/Parsing_expression_grammar Basically a way of composing a parser out of smaller regexp-like expressions. They can be very useful in a wide variety of situations. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Make peg.el a built-in library? 2021-09-28 3:52 ` Eric Abrahamsen @ 2021-09-28 8:09 ` tomas 2021-09-28 9:32 ` Helmut Eller 2021-09-28 15:24 ` Augusto Stoffel 2021-09-30 6:04 ` Richard Stallman 1 sibling, 2 replies; 100+ messages in thread From: tomas @ 2021-09-28 8:09 UTC (permalink / raw) To: emacs-devel [-- Attachment #1: Type: text/plain, Size: 1269 bytes --] On Mon, Sep 27, 2021 at 08:52:38PM -0700, Eric Abrahamsen wrote: > > On 09/27/21 18:34 PM, Richard Stallman wrote: > > [[[ To any NSA and FBI agents reading my email: please consider ]]] > > [[[ whether defending the US Constitution against all enemies, ]]] > > [[[ foreign or domestic, requires you to follow Snowden's example. ]]] > > > > What is a PEG? > > A Parsing Expression Grammar: > https://en.wikipedia.org/wiki/Parsing_expression_grammar > > Basically a way of composing a parser out of smaller regexp-like > expressions. They can be very useful in a wide variety of situations. In the Chomsky hierarchy, they live in some funny place between regular (Type-3) and context free (Type-2). They are strictly more powerful than regular grammars (but can eat memory for breakfast [1], but (quoting the Wikipedia ref above: "It is an open problem to give a concrete example of a context-free language which cannot be recognized by a parsing expression grammar." I don't know at the moment whether there is a (non-constructive) proof that CFGs be strictly more expressive than PEGs? Cheers [1] Memory has become significantly cheaper since Thompson, this might have a practical significance or not ;-) - t [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Make peg.el a built-in library? 2021-09-28 8:09 ` tomas @ 2021-09-28 9:32 ` Helmut Eller 2021-09-28 10:45 ` tomas 2021-09-28 15:24 ` Augusto Stoffel 1 sibling, 1 reply; 100+ messages in thread From: Helmut Eller @ 2021-09-28 9:32 UTC (permalink / raw) To: emacs-devel On Tue, Sep 28 2021, tomas@tuxteam.de wrote: > I don't know at the moment whether there is a (non-constructive) > proof that CFGs be strictly more expressive than PEGs? You could ask this question on the PEG mailing list [1]. Apparently it has been proven[2] that for every CFG in LL(1) there is a corresponding PEG. This is very nice, because in practice we are mostly interested in grammars that can be parsed efficiently. Unfortunately, it seems[3] difficult/impossible to tell (statically) if a given PEG corresponds to LL(1) or how much backtracking it needs. Helmut [1] https://lists.csail.mit.edu/mailman/listinfo/peg [2] https://arxiv.org/abs/1304.3177 [3] Trying to understand PEG Fundamenta Informaticae 157, 4 (2018) 463-475. http://www.romanredz.se/papers/FI2017.pdf ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Make peg.el a built-in library? 2021-09-28 9:32 ` Helmut Eller @ 2021-09-28 10:45 ` tomas 0 siblings, 0 replies; 100+ messages in thread From: tomas @ 2021-09-28 10:45 UTC (permalink / raw) To: emacs-devel [-- Attachment #1: Type: text/plain, Size: 412 bytes --] On Tue, Sep 28, 2021 at 11:32:58AM +0200, Helmut Eller wrote: > On Tue, Sep 28 2021, tomas@tuxteam.de wrote: > > > I don't know at the moment whether there is a (non-constructive) > > proof that CFGs be strictly more expressive than PEGs? > > You could ask this question on the PEG mailing list [1]. Uh, thanks for the links. They'll possibly fill most long evenings this winter ;-) Cheers - t [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Make peg.el a built-in library? 2021-09-28 8:09 ` tomas 2021-09-28 9:32 ` Helmut Eller @ 2021-09-28 15:24 ` Augusto Stoffel 1 sibling, 0 replies; 100+ messages in thread From: Augusto Stoffel @ 2021-09-28 15:24 UTC (permalink / raw) To: tomas; +Cc: emacs-devel On Tue, 28 Sep 2021 at 10:09, <tomas@tuxteam.de> wrote: > On Mon, Sep 27, 2021 at 08:52:38PM -0700, Eric Abrahamsen wrote: >> >> On 09/27/21 18:34 PM, Richard Stallman wrote: >> > [[[ To any NSA and FBI agents reading my email: please consider ]]] >> > [[[ whether defending the US Constitution against all enemies, ]]] >> > [[[ foreign or domestic, requires you to follow Snowden's example. ]]] >> > >> > What is a PEG? >> >> A Parsing Expression Grammar: >> https://en.wikipedia.org/wiki/Parsing_expression_grammar >> >> Basically a way of composing a parser out of smaller regexp-like >> expressions. They can be very useful in a wide variety of situations. > > In the Chomsky hierarchy, they live in some funny place between > regular (Type-3) and context free (Type-2). They are strictly > more powerful than regular grammars (but can eat memory for > breakfast [1], but (quoting the Wikipedia ref above: > > "It is an open problem to give a concrete example of a > context-free language which cannot be recognized by a > parsing expression grammar." Perhaps more interesting in practice: a PEG can compute and return a value as it parses the subject string. So one can (easily) write a PEG that recognizes well-formed arithmetic expressions _and_ computes the value of the arithmetic expression along the way. Or a PEG that recognizes email headers and returns those headers as an alist. Regexps usually only produce substrings of the subject string (in Emacs regexps can also call Lisp code, but this is not as general.) [Also note that a PEG defines a parser for a grammar, not just a grammar.] > > I don't know at the moment whether there is a (non-constructive) > proof that CFGs be strictly more expressive than PEGs? > > Cheers > > [1] Memory has become significantly cheaper since Thompson, this > might have a practical significance or not ;-) > > - t ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Make peg.el a built-in library? 2021-09-28 3:52 ` Eric Abrahamsen 2021-09-28 8:09 ` tomas @ 2021-09-30 6:04 ` Richard Stallman 2021-10-01 3:27 ` Eric Abrahamsen 1 sibling, 1 reply; 100+ messages in thread From: Richard Stallman @ 2021-09-30 6:04 UTC (permalink / raw) To: Eric Abrahamsen; +Cc: eller.helmut, arstoffel, emacs-devel [[[ To any NSA and FBI agents reading my email: please consider ]]] [[[ whether defending the US Constitution against all enemies, ]]] [[[ foreign or domestic, requires you to follow Snowden's example. ]]] > Basically a way of composing a parser out of smaller regexp-like > expressions. They can be very useful in a wide variety of situations. It does sound useful. Can you post a descripion of a specific simple example where this approach is advantageous? -- Dr Richard Stallman (https://stallman.org) Chief GNUisance of the GNU Project (https://gnu.org) Founder, Free Software Foundation (https://fsf.org) Internet Hall-of-Famer (https://internethalloffame.org) ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Make peg.el a built-in library? 2021-09-30 6:04 ` Richard Stallman @ 2021-10-01 3:27 ` Eric Abrahamsen 0 siblings, 0 replies; 100+ messages in thread From: Eric Abrahamsen @ 2021-10-01 3:27 UTC (permalink / raw) To: emacs-devel Richard Stallman <rms@gnu.org> writes: > [[[ To any NSA and FBI agents reading my email: please consider ]]] > [[[ whether defending the US Constitution against all enemies, ]]] > [[[ foreign or domestic, requires you to follow Snowden's example. ]]] > > > Basically a way of composing a parser out of smaller regexp-like > > expressions. They can be very useful in a wide variety of situations. > > It does sound useful. Can you post a descripion of a specific simple > example where this approach is advantageous? I feel like I've ended up advocating for this thing when I know less about it than anyone here, but... My sense is that really powerful PEG systems are the sort of thing you use to parse source code into ASTs, or do syntax highlighting, etc. We don't need that, and the use-cases I have in mind, anyway, are simpler situations where I want to parse a stream of well-defined-but-still-pretty-complicated text. The sort of thing where a regexp solution turns into a rat's nest very quickly. One theoretical example is parsing IMAP server responses. The response text is fully defined, but could vary enormously depending on the capabilities of the server. Writing naive regexps is a headache. Another non-theoretical example is the homemade token-parser in lisp/gnus/gnus-search.el:390-680, which turns a string like: from:bob (subject:lunch or subject:dinner) into the sexp ((from . "bob") (or (subject . "lunch") (subject . "dinner")) There are many, many libraries that need to do something similar. With peg.el I can parse the above (including arbitrarily-nested sub-expressions) with twenty lines of peg definition, which is comprehensible to look at (once you've got the basics), easier to reason about, and easier to modify. I guess it's sort of equivalent to a BNF. PEGs and their implementation are the subject of academic research, obviously, but for my modest uses, anyway, almost anything will do. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Make peg.el a built-in library? 2021-08-25 18:52 Make peg.el a built-in library? Eric Abrahamsen 2021-08-26 6:17 ` Eli Zaretskii 2021-08-26 17:02 ` Adam Porter @ 2021-10-09 1:31 ` Michael Heerdegen 2021-10-09 5:28 ` Michael Heerdegen ` (2 more replies) 2022-11-07 3:33 ` Ihor Radchenko 3 siblings, 3 replies; 100+ messages in thread From: Michael Heerdegen @ 2021-10-09 1:31 UTC (permalink / raw) To: Eric Abrahamsen; +Cc: Stefan Monnier, emacs-devel Eric Abrahamsen <eric@ericabrahamsen.net> writes: > Hi all, > > In my on-again-off-again quest to not have to write text parsers myself, > I was pointed towards the PEG library (in ELPA), which does pretty much > exactly what I want (Parsing Expression Grammars). I like the idea, and I have some remarks: (1) Can we improve the introduction in the file header a bit? I would add a link to the wikipedia page: https://en.wikipedia.org/wiki/Parsing_expression_grammar it explains some background. And: one example could contain the (non-standard if you only know regexps, but very educative) solution to the problem: "how do you jump over arbitrary text preceding a match?" (the answer seems to be: "use `or' and recursion", at least this is what I found out by myself after a while). (2) Would (replace E RPL) not be much more useful if it would be allowed to pop from the stack? Something like (replace E [VAR...] -- REPL) where REPL could use the VAR bindings? Background is of course that a replacement may depend on intermediate parsing results. (3) `(_ --) seems to produce an "Empty let body" compiler warning - can we silence it? (4) How hard would it be to parse regexps (or translate `rx' forms) into an equivalent peg? (5) We need to add a Game-like tutorial to PEGs called Peg-Man. Ok, that one was only a joke. WDYT? Thanks, Michael. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Make peg.el a built-in library? 2021-10-09 1:31 ` Michael Heerdegen @ 2021-10-09 5:28 ` Michael Heerdegen 2021-10-09 8:12 ` Helmut Eller 2021-10-09 12:54 ` Stefan Monnier 2021-10-09 16:49 ` Eric Abrahamsen 2 siblings, 1 reply; 100+ messages in thread From: Michael Heerdegen @ 2021-10-09 5:28 UTC (permalink / raw) To: Eric Abrahamsen; +Cc: Stefan Monnier, emacs-devel Michael Heerdegen <michael_heerdegen@web.de> writes: > "how do you jump over arbitrary text preceding a match?" (the answer > seems to be: "use `or' and recursion", at least this is what I found > out by myself after a while). No - using recursive rules of the kind (rule [maches what I want]) (search (or rule (and (any) search))) to advance over preceding text is not a good method in Emacs, this hits Emacs' maximum recursion level after a bunch of lines if we advance one character each time (which can't be avoided when searching text). Is there a better solution for this kind of problem? Thanks, Michael. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Make peg.el a built-in library? 2021-10-09 5:28 ` Michael Heerdegen @ 2021-10-09 8:12 ` Helmut Eller 2021-10-09 12:52 ` Stefan Monnier 2021-10-14 10:25 ` Michael Heerdegen 0 siblings, 2 replies; 100+ messages in thread From: Helmut Eller @ 2021-10-09 8:12 UTC (permalink / raw) To: emacs-devel On Sat, Oct 09 2021, Michael Heerdegen wrote: >> "how do you jump over arbitrary text preceding a match?" (the answer >> seems to be: "use `or' and recursion", at least this is what I found >> out by myself after a while). > > No - using recursive rules of the kind > > (rule [maches what I want]) > (search (or rule (and (any) search))) > > to advance over preceding text is not a good method in Emacs, this hits > Emacs' maximum recursion level after a bunch of lines if we advance one > character each time (which can't be avoided when searching text). Is > there a better solution for this kind of problem? Self-recursion can sometimes be rewritten using *. In peg.el, * is "inlined" and so doesn't run out of stack: (rule [maches what I want]) (search (and (* (not rule) (any)) rule)) It's kinda like rewriting a self tail call to a while loop. For the general case, peg.el would need some form of proper tail calls. Helmut ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Make peg.el a built-in library? 2021-10-09 8:12 ` Helmut Eller @ 2021-10-09 12:52 ` Stefan Monnier 2021-10-10 5:49 ` Helmut Eller 2021-10-14 10:25 ` Michael Heerdegen 1 sibling, 1 reply; 100+ messages in thread From: Stefan Monnier @ 2021-10-09 12:52 UTC (permalink / raw) To: Helmut Eller; +Cc: emacs-devel > For the general case, peg.el would need some form of proper tail calls. Maybe we could (re)use the tail-call elimination that I implemented for `named-let`. Stefan ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Make peg.el a built-in library? 2021-10-09 12:52 ` Stefan Monnier @ 2021-10-10 5:49 ` Helmut Eller 0 siblings, 0 replies; 100+ messages in thread From: Helmut Eller @ 2021-10-10 5:49 UTC (permalink / raw) To: Stefan Monnier; +Cc: emacs-devel On Sat, Oct 09 2021, Stefan Monnier wrote: >> For the general case, peg.el would need some form of proper tail calls. > > Maybe we could (re)use the tail-call elimination that I implemented for > `named-let`. This reminds me of a question I wanted to ask. Suppose we want to implement a LPeg-like virtual machine[*] as a dynamic module. Is there a reasonably efficient API to read a buffer's content? Maybe something like a FILE* stream backed by an Emacs buffer? Helmut [*] http://www.inf.puc-rio.br/~roberto/docs/ry08-4.pdf ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Make peg.el a built-in library? 2021-10-09 8:12 ` Helmut Eller 2021-10-09 12:52 ` Stefan Monnier @ 2021-10-14 10:25 ` Michael Heerdegen 1 sibling, 0 replies; 100+ messages in thread From: Michael Heerdegen @ 2021-10-14 10:25 UTC (permalink / raw) To: Helmut Eller; +Cc: emacs-devel Helmut Eller <eller.helmut@gmail.com> writes: > Self-recursion can sometimes be rewritten using *. In peg.el, * is > "inlined" and so doesn't run out of stack: > > (rule [maches what I want]) > (search (and (* (not rule) (any)) rule)) > > It's kinda like rewriting a self tail call to a while loop. Yes, that works well in my case, thanks. Michael. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Make peg.el a built-in library? 2021-10-09 1:31 ` Michael Heerdegen 2021-10-09 5:28 ` Michael Heerdegen @ 2021-10-09 12:54 ` Stefan Monnier 2021-10-09 16:47 ` Eric Abrahamsen 2021-10-09 16:49 ` Eric Abrahamsen 2 siblings, 1 reply; 100+ messages in thread From: Stefan Monnier @ 2021-10-09 12:54 UTC (permalink / raw) To: Michael Heerdegen; +Cc: Eric Abrahamsen, emacs-devel > (1) Can we improve the introduction in the file header a bit? I would > add a link to the wikipedia page: > > https://en.wikipedia.org/wiki/Parsing_expression_grammar > > it explains some background. I can't speak for Helmut, but I think you should feel free to make such a change, yes. Stefan ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Make peg.el a built-in library? 2021-10-09 12:54 ` Stefan Monnier @ 2021-10-09 16:47 ` Eric Abrahamsen 2021-10-10 4:20 ` Michael Heerdegen 0 siblings, 1 reply; 100+ messages in thread From: Eric Abrahamsen @ 2021-10-09 16:47 UTC (permalink / raw) To: emacs-devel; +Cc: Michael Heerdegen [-- Attachment #1: Type: text/plain, Size: 456 bytes --] Stefan Monnier <monnier@iro.umontreal.ca> writes: >> (1) Can we improve the introduction in the file header a bit? I would >> add a link to the wikipedia page: >> >> https://en.wikipedia.org/wiki/Parsing_expression_grammar >> >> it explains some background. > > I can't speak for Helmut, but I think you should feel free to make such > a change, yes. I've still got this documentation patch I haven't applied, I can just add that link to this patch? [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: peg-doc-patch.diff --] [-- Type: text/x-patch, Size: 4172 bytes --] diff --git a/peg.el b/peg.el index d71c707dc0..0e4221eeb7 100644 --- a/peg.el +++ b/peg.el @@ -79,17 +79,69 @@ ;; Beginning-of-Symbol (bos) ;; End-of-Symbol (eos) ;; -;; PEXs also support parsing actions, i.e. Lisp snippets which -;; are executed when a pex matches. This can be used to construct -;; syntax trees or for similar tasks. Actions are written as +;; Rules can refer to other rules, and a grammar is often structured +;; as a tree, with a root rule referring to one or more "branch +;; rules", all the way down to the "leaf rules" that deal with actual +;; buffer text. Rules can be recursive or mutually referential, +;; though care must be taken not to create infinite loops. +;; +;; PEXs also support parsing actions, i.e. Lisp snippets which are +;; executed when a pex matches. This can be used to construct syntax +;; trees or for similar tasks. The most basic form of action is +;; written as: ;; ;; (action FORM) ; evaluate FORM for its side-effects -;; `(VAR... -- FORM...) ; stack action ;; ;; Actions don't consume input, but are executed at the point of -;; match. A "stack action" takes VARs from the "value stack" and -;; pushes the result of evaluating FORMs to that stack. -;; See `peg-ex-parse-int' in `peg-tests.el' for an example. +;; match. Another kind of action is called a "stack action", and +;; looks like this: +;; +;; `(VAR... -- FORM...) ; stack action +;; +;; A stack action takes VARs from the "value stack" and pushes the +;; results of evaluating FORMs to that stack. + +;; The value stack is created during the course of parsing. Certain +;; operators (see below) that match buffer text can push values onto +;; this stack. "Upstream" rules can then draw values from the stack, +;; and optionally push new ones back. For instance, consider this +;; very simple grammar: +;; +;; (with-peg-rules +;; ((query (+ term) (eol)) +;; (term key ":" value (opt (+ [space])) +;; `(k v -- (cons (intern k) v))) +;; (key (substring (and (not ":") (+ [word])))) +;; (value (or string-value number-value)) +;; (string-value (substring (+ [alpha]))) +;; (number-value (substring (+ [digit])) +;; `(val -- (string-to-number val)))) +;; (peg-run (peg query))) +;; +;; This invocation of `peg-run' would parse this buffer text: +;; +;; name:Jane age:30 +;; +;; And return this Elisp sexp: +;; +;; ((age . 30) (name . "Jane")) +;; +;; Note that, in complex grammars, some care must be taken to make +;; sure that the number and type of values drawn from the stack always +;; match those pushed. In the example above, both `string-value' and +;; `number-value' push a single value to the stack. Since the `value' +;; rule only includes these two sub-rules, any upstream rule that +;; makes use of `value' can be confident it will always and only push +;; a single value to the stack. +;; +;; Stack action forms are in a sense analogous to lambda forms: the +;; symbols before the "--" are the equivalent of lambda arguments, +;; while the forms after the "--" are return values. The difference +;; being that a lambda form can only return a single value, while a +;; stack action can push multiple values onto the stack. It's also +;; perfectly valid to use `(-- FORM...)' or `(VAR... --)': the former +;; pushes values to the stack without consuming any, and the latter +;; pops values from the stack and discards them. ;; ;; Derived Operators: ;; @@ -101,6 +153,8 @@ ;; (replace E RPL); Match E and replace the matched region with RPL. ;; (list E) ; Match E and push a list of the items that E produced. ;; +;; See `peg-ex-parse-int' in `peg-tests.el' for further examples. +;; ;; Regexp equivalents: ;; ;; Here a some examples for regexps and how those could be written as pex. @@ -177,7 +231,7 @@ EXPS is a list of rules/expressions that failed.") ;;;; Main entry points -;; Sometimes (with-peg-rule ... (peg-run (peg ...))) is too +;; Sometimes (with-peg-rules ... (peg-run (peg ...))) is too ;; longwinded for the task at hand, so `peg-parse' comes in handy. (defmacro peg-parse (&rest pexs) "Match PEXS at point. ^ permalink raw reply related [flat|nested] 100+ messages in thread
* Re: Make peg.el a built-in library? 2021-10-09 16:47 ` Eric Abrahamsen @ 2021-10-10 4:20 ` Michael Heerdegen 2021-10-10 21:40 ` Eric Abrahamsen 0 siblings, 1 reply; 100+ messages in thread From: Michael Heerdegen @ 2021-10-10 4:20 UTC (permalink / raw) To: emacs-devel Eric Abrahamsen <eric@ericabrahamsen.net> writes: > >> (1) Can we improve the introduction in the file header a bit? I would > >> add a link to the wikipedia page: > >> > >> https://en.wikipedia.org/wiki/Parsing_expression_grammar > >> > >> it explains some background. > > > > I can't speak for Helmut, but I think you should feel free to make such > > a change, yes. > > I've still got this documentation patch I haven't applied, I can just > add that link to this patch? From my side, nothing against that. I have quickly skimmed over your text and found nothing obviously wrong or confusing, and it makes some things a bit clearer. Should we say something about how to use globally defined pegs? AFAIU you can use them like (my-peg) in parens, contrary to rules, which appear as plain symbols. At least, this was one of the things I wondered while trying this out: what do I have to wrap in parens. Michael. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Make peg.el a built-in library? 2021-10-10 4:20 ` Michael Heerdegen @ 2021-10-10 21:40 ` Eric Abrahamsen 2021-10-13 2:58 ` Michael Heerdegen 0 siblings, 1 reply; 100+ messages in thread From: Eric Abrahamsen @ 2021-10-10 21:40 UTC (permalink / raw) To: emacs-devel Michael Heerdegen <michael_heerdegen@web.de> writes: > Eric Abrahamsen <eric@ericabrahamsen.net> writes: > >> >> (1) Can we improve the introduction in the file header a bit? I would >> >> add a link to the wikipedia page: >> >> >> >> https://en.wikipedia.org/wiki/Parsing_expression_grammar >> >> >> >> it explains some background. >> > >> > I can't speak for Helmut, but I think you should feel free to make such >> > a change, yes. >> >> I've still got this documentation patch I haven't applied, I can just >> add that link to this patch? > > From my side, nothing against that. I have quickly skimmed over your > text and found nothing obviously wrong or confusing, and it makes some > things a bit clearer. > > Should we say something about how to use globally defined pegs? AFAIU > you can use them like (my-peg) in parens, contrary to rules, which > appear as plain symbols. At least, this was one of the things I > wondered while trying this out: what do I have to wrap in > parens. I'm not quite sure what you mean here. If you use the `define-peg-rule' you can use the symbol plain, you don't have to wrap it in parentheses. If you want to use one of the built-in action functions, like "substring", then you have to wrap your symbol in that, same as if you were defining a rule on the spot. But that's just for convenience. The shorthand: (substring <my-peg-symbol>) is defined as: (and `(-- (point)) <my-peg-symbol> `(start -- (buffer-substring-no-properties start (point)))) I don't think you have to wrap anything in parentheses, though you *can* if you want to, and it will work correctly. Am I misunderstanding you? Eric ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Make peg.el a built-in library? 2021-10-10 21:40 ` Eric Abrahamsen @ 2021-10-13 2:58 ` Michael Heerdegen 0 siblings, 0 replies; 100+ messages in thread From: Michael Heerdegen @ 2021-10-13 2:58 UTC (permalink / raw) To: Eric Abrahamsen; +Cc: emacs-devel Eric Abrahamsen <eric@ericabrahamsen.net> writes: > I don't think you have to wrap anything in parentheses, though you > *can* if you want to, and it will work correctly. > > Am I misunderstanding you? No, thanks for making that clear, I just didn't know. Michael. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Make peg.el a built-in library? 2021-10-09 1:31 ` Michael Heerdegen 2021-10-09 5:28 ` Michael Heerdegen 2021-10-09 12:54 ` Stefan Monnier @ 2021-10-09 16:49 ` Eric Abrahamsen 2021-10-10 3:43 ` Stefan Monnier 2 siblings, 1 reply; 100+ messages in thread From: Eric Abrahamsen @ 2021-10-09 16:49 UTC (permalink / raw) To: Michael Heerdegen; +Cc: Stefan Monnier, emacs-devel Michael Heerdegen <michael_heerdegen@web.de> writes: > Eric Abrahamsen <eric@ericabrahamsen.net> writes: > >> Hi all, >> >> In my on-again-off-again quest to not have to write text parsers myself, >> I was pointed towards the PEG library (in ELPA), which does pretty much >> exactly what I want (Parsing Expression Grammars). > > I like the idea, and I have some remarks: > > (1) Can we improve the introduction in the file header a bit? I would > add a link to the wikipedia page: > > https://en.wikipedia.org/wiki/Parsing_expression_grammar > > it explains some background. [...] > (4) How hard would it be to parse regexps (or translate `rx' forms) into > an equivalent peg? I had this idea as well -- we've already got "regexps that look like forms", it seems like it would be a natural to integrate this with rx. One thing we're not short of here is new ideas for code, but I do think this would make a lot of sense. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Make peg.el a built-in library? 2021-10-09 16:49 ` Eric Abrahamsen @ 2021-10-10 3:43 ` Stefan Monnier 2021-10-10 4:46 ` Michael Heerdegen 0 siblings, 1 reply; 100+ messages in thread From: Stefan Monnier @ 2021-10-10 3:43 UTC (permalink / raw) To: Eric Abrahamsen; +Cc: Michael Heerdegen, emacs-devel >> (4) How hard would it be to parse regexps (or translate `rx' forms) into >> an equivalent peg? > I had this idea as well -- we've already got "regexps that look like > forms", it seems like it would be a natural to integrate this with rx. > One thing we're not short of here is new ideas for code, but I do think > this would make a lot of sense. I think turning a regexp into a PEG should be easy, but at one condition: you shouldn't expect that PEG to be *equivalent* to the regexp. E.g. when matching (string-match "\\(ab\\|a)bc" "abc") the "natural" PEG for that regexp will fail to match (because it will see a success to match "ab" and will hence just skip the "a" alternative). Correctly matching regexps requires a deeper form of backtracking than provided by PEGs. Stefan ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Make peg.el a built-in library? 2021-10-10 3:43 ` Stefan Monnier @ 2021-10-10 4:46 ` Michael Heerdegen 2021-10-10 5:58 ` Helmut Eller 0 siblings, 1 reply; 100+ messages in thread From: Michael Heerdegen @ 2021-10-10 4:46 UTC (permalink / raw) To: Stefan Monnier; +Cc: Eric Abrahamsen, emacs-devel Stefan Monnier <monnier@iro.umontreal.ca> writes: > Correctly matching regexps requires a deeper form of backtracking than > provided by PEGs. I learned PEGs are able to accept any type 3 language. I also learned that PEGs alternatives work differently. Is it practically possible to transform a regexp into a really equivalent PEG, or is it too difficult, or would the resulting PEG just be too large or inefficient? Michael. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Make peg.el a built-in library? 2021-10-10 4:46 ` Michael Heerdegen @ 2021-10-10 5:58 ` Helmut Eller 2021-10-10 13:56 ` Stefan Monnier ` (2 more replies) 0 siblings, 3 replies; 100+ messages in thread From: Helmut Eller @ 2021-10-10 5:58 UTC (permalink / raw) To: emacs-devel On Sun, Oct 10 2021, Michael Heerdegen wrote: > Is it practically possible to transform a regexp into a really > equivalent PEG, or is it too difficult, or would the resulting PEG just > be too large or inefficient? The LPEG people wrote a paper[*] about this problem. But I haven't read it. I think, that regexp without backrefs can be implemented with DFAs, and, hence, shouldn't need any backtracking. The problem probably are backrefs and other extensions. Helmut http://www.lua.inf.puc-rio.br/publications/medeiros11regular.pdf ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Make peg.el a built-in library? 2021-10-10 5:58 ` Helmut Eller @ 2021-10-10 13:56 ` Stefan Monnier 2021-10-22 16:33 ` Michael Heerdegen 2021-10-31 23:43 ` Michael Heerdegen 2 siblings, 0 replies; 100+ messages in thread From: Stefan Monnier @ 2021-10-10 13:56 UTC (permalink / raw) To: Helmut Eller; +Cc: emacs-devel >> Is it practically possible to transform a regexp into a really >> equivalent PEG, or is it too difficult, or would the resulting PEG just >> be too large or inefficient? > The LPEG people wrote a paper[*] about this problem. But I haven't read This is similar to turning the regexp into an NFA and then using the PEG backtracking to run the NFA. Our regexp engine tries to remove some simple forms of backtracking (e.g. for regexps like "\\(.foo\\)*\nbar" because \n and . are mutually exclusive). This significantly reduces the amount of stack use. We could/should perform similar optimizations in `peg.el`. Stefan ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Make peg.el a built-in library? 2021-10-10 5:58 ` Helmut Eller 2021-10-10 13:56 ` Stefan Monnier @ 2021-10-22 16:33 ` Michael Heerdegen 2021-10-31 23:43 ` Michael Heerdegen 2 siblings, 0 replies; 100+ messages in thread From: Michael Heerdegen @ 2021-10-22 16:33 UTC (permalink / raw) To: Helmut Eller; +Cc: emacs-devel Helmut Eller <eller.helmut@gmail.com> writes: > On Sun, Oct 10 2021, Michael Heerdegen wrote: > > > Is it practically possible to transform a regexp into a really > > equivalent PEG, or is it too difficult, or would the resulting PEG just > > be too large or inefficient? > > The LPEG people wrote a paper[*] about this problem. IIUC their answer to the ordered `or' operator problem is simply, at the end, to apply the distributive law when performing the transcription. So e.g. (and (or "a" "aa") "b") doesn't match "aab" as a peg, but (or (and "a" "b") (and "aa" "b")) does. Michael. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Make peg.el a built-in library? 2021-10-10 5:58 ` Helmut Eller 2021-10-10 13:56 ` Stefan Monnier 2021-10-22 16:33 ` Michael Heerdegen @ 2021-10-31 23:43 ` Michael Heerdegen 2021-11-15 23:16 ` Michael Heerdegen 2 siblings, 1 reply; 100+ messages in thread From: Michael Heerdegen @ 2021-10-31 23:43 UTC (permalink / raw) To: Helmut Eller; +Cc: emacs-devel [-- Attachment #1: Type: text/plain, Size: 720 bytes --] Helmut Eller <eller.helmut@gmail.com> writes: > The LPEG people wrote a paper[*] about this problem. I tried to convert their transcription function to Elisp. See below. Seems to work - but so far only basic regexp constructs are supported. > The problem probably are backrefs and other extensions. I think backrefs can be implemented in peg.el in a simple way. They can't be a standard extension though because matching the backref needs to advance point (so they are not just a certain `guard'). If that is supported by peg.el I think backrefs could just be transcribed more or less directly but I I'm not sure about equivalence. > Helmut > > http://www.lua.inf.puc-rio.br/publications/medeiros11regular.pdf [-- Attachment #2: rx-to-peg.el --] [-- Type: application/emacs-lisp, Size: 4283 bytes --] [-- Attachment #3: Type: text/plain, Size: 12 bytes --] Michael. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Make peg.el a built-in library? 2021-10-31 23:43 ` Michael Heerdegen @ 2021-11-15 23:16 ` Michael Heerdegen 0 siblings, 0 replies; 100+ messages in thread From: Michael Heerdegen @ 2021-11-15 23:16 UTC (permalink / raw) To: Helmut Eller; +Cc: emacs-devel [-- Attachment #1: Type: text/plain, Size: 1227 bytes --] Michael Heerdegen <michael_heerdegen@web.de> writes: > > The LPEG people wrote a paper[*] about this problem. The converter is more or less done - see below. Feedback welcome! Nearly everything regexps support is implemented. I tried to make everything so that the resulting peg is really equivalent to the given regexp - please tell me if you find a translation where this is not respected. Remaining problems: (1) Group numbering currently has to be explicit - unnumbered groups are silently treated as shy. That's because getting the numbering right is not trivial. I implemented groups and backrefs using an uninterned global variable owned by the peg. It would be better to add built-in support to peg.el if we want that feature. (2) Transforming character ranges to the vector representation that peg.el uses is not trivial. I would welcome help to get it done correctly. A possible (slow) fallback solution is a guard calling `looking-at' followed by an (any). Oh - why I think this conversion code is useful? It's nice for learning but also for cases were a regexp would almost suffice but you need some Elisp guard somewhere in the middle of matching the regexp to examine the buffer at that position. [-- Attachment #2: rx-to-peg.el --] [-- Type: application/emacs-lisp, Size: 13406 bytes --] [-- Attachment #3: Type: text/plain, Size: 11 bytes --] Michael. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Make peg.el a built-in library? 2021-08-25 18:52 Make peg.el a built-in library? Eric Abrahamsen ` (2 preceding siblings ...) 2021-10-09 1:31 ` Michael Heerdegen @ 2022-11-07 3:33 ` Ihor Radchenko 2022-11-07 19:46 ` Eric Abrahamsen 3 siblings, 1 reply; 100+ messages in thread From: Ihor Radchenko @ 2022-11-07 3:33 UTC (permalink / raw) To: Eric Abrahamsen; +Cc: emacs-devel, Stefan Monnier Eric Abrahamsen <eric@ericabrahamsen.net> writes: > Would the maintainers consider moving this into Emacs proper? I ask > mostly because this would be very useful to have in Gnus, both to > replace the home-made parser in gnus-search.el, and I would hope to > parse eg IMAP server responses more fully and reliably. Is there any progress merging peg.el to Emacs? I do not see any obvious blockers in the discussion, but the merge never happened? -- Ihor Radchenko // yantar92, Org mode contributor, Learn more about Org mode at <https://orgmode.org/>. Support Org development at <https://liberapay.com/org-mode>, or support my work at <https://liberapay.com/yantar92> ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Make peg.el a built-in library? 2022-11-07 3:33 ` Ihor Radchenko @ 2022-11-07 19:46 ` Eric Abrahamsen 2022-11-08 6:57 ` Helmut Eller ` (2 more replies) 0 siblings, 3 replies; 100+ messages in thread From: Eric Abrahamsen @ 2022-11-07 19:46 UTC (permalink / raw) To: Ihor Radchenko; +Cc: emacs-devel, Stefan Monnier Ihor Radchenko <yantar92@posteo.net> writes: > Eric Abrahamsen <eric@ericabrahamsen.net> writes: > >> Would the maintainers consider moving this into Emacs proper? I ask >> mostly because this would be very useful to have in Gnus, both to >> replace the home-made parser in gnus-search.el, and I would hope to >> parse eg IMAP server responses more fully and reliably. > > Is there any progress merging peg.el to Emacs? > I do not see any obvious blockers in the discussion, but the merge never > happened? It certainly did lose momentum. I think there were some issues regarding implementation and API, some open questions, and then whoever would have needed to take ownership of the ticket and see it through did not do so. Probably that should have been whoever opened the bug report to begin with! I believe peg.el does a few things in non-standard ways. I'm not very familiar with parsing expression grammars, and I don't feel qualified to judge just how non-standard those ways are, and whether it's a real issue. But if no one has any massive objections (or plausible fixes) then personally I'd be okay with it going in like this. I'm not a maintainer though! I will say that I tried to use PEG to resolve some gruesome text-parsing issues in EBDB very recently, and failed to make it work in the hour or two I'd allotted to the problem. The file-comment docs are pretty good, but I think they would need to be expanded in a few crucial ways, particularly to help those who don't necessarily know how PEGs work. Specifically, it is not obvious (to me) the ways in which PEGs (or maybe just peg.el) are not fully declarative. It doesn't backtrack, and I suspect it won't ever backtrack or isn't even supposed to, which means users should be made explicitly aware of the ways in which their rules can fail, and the ways in which declaration order matter. The comment for the `or' construct reads: Prioritized Choice And that's about the only hint you get. I was trying to parse a multiword name like Eric Edwin Abrahamsen into the structure (("Eric" "Edwin") "Abrahamsen") using rules like (plain-name (substring (+ [word])) (* [space])) (full-name (list (+ plain-name) plain-name) `(names -- (list (butlast names) (car (last names))))) Which always fails to match because (+ plain-name) is greedy and eats up all the words. It doesn't ever try leaving out the last word in an attempt to make the rule match. I'm happy to write the docs (should it have its own info manual section?), if we really think there are no other necessary fixes/improvements. Eric ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Make peg.el a built-in library? 2022-11-07 19:46 ` Eric Abrahamsen @ 2022-11-08 6:57 ` Helmut Eller 2022-11-08 8:51 ` Ihor Radchenko 2022-11-10 4:04 ` Richard Stallman 2022-11-08 8:47 ` Ihor Radchenko 2022-11-08 14:01 ` Stefan Monnier 2 siblings, 2 replies; 100+ messages in thread From: Helmut Eller @ 2022-11-08 6:57 UTC (permalink / raw) To: Eric Abrahamsen; +Cc: Ihor Radchenko, emacs-devel On Mon, Nov 07 2022, Eric Abrahamsen wrote: >> Is there any progress merging peg.el to Emacs? >> I do not see any obvious blockers in the discussion, but the merge never >> happened? > > It certainly did lose momentum. I think there were some issues regarding > implementation and API, some open questions, and then whoever would have > needed to take ownership of the ticket and see it through did not do so. > Probably that should have been whoever opened the bug report to begin > with! Isn't Tree-sitter a better alternative to peg.el? I've never used Tree-sitter, but from the few things I read about it, it seems to be more "declarative", more efficient, and actually supported in core. Helmut ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Make peg.el a built-in library? 2022-11-08 6:57 ` Helmut Eller @ 2022-11-08 8:51 ` Ihor Radchenko 2022-11-10 4:04 ` Richard Stallman 1 sibling, 0 replies; 100+ messages in thread From: Ihor Radchenko @ 2022-11-08 8:51 UTC (permalink / raw) To: Helmut Eller; +Cc: Eric Abrahamsen, emacs-devel Helmut Eller <eller.helmut@gmail.com> writes: > Isn't Tree-sitter a better alternative to peg.el? I've never used > Tree-sitter, but from the few things I read about it, it seems to be > more "declarative", more efficient, and actually supported in core. Tree-sitter is a massive overkill when you need to parse something just slightly more complex than can be done via regexps. Tree-sitter requires a whole separate .so file with compiled parser + buffer setup. Bovine is a bit easier to use (you can, at least, define grammar in Elisp), but you also need to setup parser in a separate buffer with existing documentation being even more limited compared to peg.el. -- Ihor Radchenko // yantar92, Org mode contributor, Learn more about Org mode at <https://orgmode.org/>. Support Org development at <https://liberapay.com/org-mode>, or support my work at <https://liberapay.com/yantar92> ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Make peg.el a built-in library? 2022-11-08 6:57 ` Helmut Eller 2022-11-08 8:51 ` Ihor Radchenko @ 2022-11-10 4:04 ` Richard Stallman 2022-11-10 5:25 ` tomas 1 sibling, 1 reply; 100+ messages in thread From: Richard Stallman @ 2022-11-10 4:04 UTC (permalink / raw) To: Helmut Eller; +Cc: eric, yantar92, emacs-devel [[[ To any NSA and FBI agents reading my email: please consider ]]] [[[ whether defending the US Constitution against all enemies, ]]] [[[ foreign or domestic, requires you to follow Snowden's example. ]]] Would someone like to tell me in 10 lines what job peg.el does? -- Dr Richard Stallman (https://stallman.org) Chief GNUisance of the GNU Project (https://gnu.org) Founder, Free Software Foundation (https://fsf.org) Internet Hall-of-Famer (https://internethalloffame.org) ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Make peg.el a built-in library? 2022-11-10 4:04 ` Richard Stallman @ 2022-11-10 5:25 ` tomas 2022-11-10 8:15 ` Eli Zaretskii 2022-11-11 4:36 ` Richard Stallman 0 siblings, 2 replies; 100+ messages in thread From: tomas @ 2022-11-10 5:25 UTC (permalink / raw) To: emacs-devel [-- Attachment #1: Type: text/plain, Size: 1142 bytes --] On Wed, Nov 09, 2022 at 11:04:48PM -0500, Richard Stallman wrote: > [[[ To any NSA and FBI agents reading my email: please consider ]]] > [[[ whether defending the US Constitution against all enemies, ]]] > [[[ foreign or domestic, requires you to follow Snowden's example. ]]] > > Would someone like to tell me in 10 lines what job peg.el does? PEG (Parsing Expression Grammars [1]) is a grammar notation which can be automatically translated into a parser (think regular expressions). The notation is actually similar to that of regexps. The main difference is that the "alternative" operator is an "ordered" choice instead of an ambiguous choice. To compensate for this, the notation provides for a (potential) lookahead mechanism, which, in the naive implementation would lead to exponential running time in the worst case. The canonical implementation (nicknamed "packrat") addresses that by memoizing. Basically they can do what a recursive descent parser can, are thus slightly more powerful than regexps. They lead to nice little grammars, but they do take some practice to be useful. Cheers -- t [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 195 bytes --] ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Make peg.el a built-in library? 2022-11-10 5:25 ` tomas @ 2022-11-10 8:15 ` Eli Zaretskii 2022-11-10 8:29 ` tomas 2022-11-11 4:36 ` Richard Stallman 1 sibling, 1 reply; 100+ messages in thread From: Eli Zaretskii @ 2022-11-10 8:15 UTC (permalink / raw) To: tomas; +Cc: emacs-devel > Date: Thu, 10 Nov 2022 06:25:55 +0100 > From: <tomas@tuxteam.de> > > > Would someone like to tell me in 10 lines what job peg.el does? > > PEG (Parsing Expression Grammars [1]) is a grammar notation which can > be automatically translated into a parser (think regular expressions). The reference [1] was probably meant to be https://en.wikipedia.org/wiki/Parsing_expression_grammar or somesuch > The notation is actually similar to that of regexps. I believe you meant "similar to regular expressions in rx form"? > The main difference > is that the "alternative" operator is an "ordered" choice instead of an > ambiguous choice. To compensate for this, the notation provides for a > (potential) lookahead mechanism, which, in the naive implementation would > lead to exponential running time in the worst case. The canonical > implementation (nicknamed "packrat") addresses that by memoizing. > > Basically they can do what a recursive descent parser can, are thus > slightly more powerful than regexps. They lead to nice little grammars, > but they do take some practice to be useful. I think an example from peg.el will clarify the issue: ;; This file implements the macros `define-peg-rule', `with-peg-rules', and ;; `peg-parse' which parses the current buffer according to a PEG. ;; E.g. we can match integers with: ;; ;; (with-peg-rules ;; ((number sign digit (* digit)) ;; (sign (or "+" "-" "")) ;; (digit [0-9])) ;; (peg-run (peg number))) ;; or ;; (define-peg-rule digit () ;; [0-9]) ;; (peg-parse (number sign digit (* digit)) ;; (sign (or "+" "-" ""))) HTH ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Make peg.el a built-in library? 2022-11-10 8:15 ` Eli Zaretskii @ 2022-11-10 8:29 ` tomas 0 siblings, 0 replies; 100+ messages in thread From: tomas @ 2022-11-10 8:29 UTC (permalink / raw) To: Eli Zaretskii; +Cc: emacs-devel [-- Attachment #1: Type: text/plain, Size: 1738 bytes --] On Thu, Nov 10, 2022 at 10:15:23AM +0200, Eli Zaretskii wrote: > > Date: Thu, 10 Nov 2022 06:25:55 +0100 > > From: <tomas@tuxteam.de> > > > > > Would someone like to tell me in 10 lines what job peg.el does? > > > > PEG (Parsing Expression Grammars [1]) is a grammar notation which can > > be automatically translated into a parser (think regular expressions). > > The reference [1] was probably meant to be > > https://en.wikipedia.org/wiki/Parsing_expression_grammar Thanks. -ENOCOFFEE, possibly :) > or somesuch > > > The notation is actually similar to that of regexps. > > I believe you meant "similar to regular expressions in rx form"? I wasn't particularly thinking of rx. PEGs in general also have a "classical" spelling which looks deceivingly similar to "classic" regular expressions. [...] > I think an example from peg.el will clarify the issue: > > ;; This file implements the macros `define-peg-rule', `with-peg-rules', and > ;; `peg-parse' which parses the current buffer according to a PEG. > ;; E.g. we can match integers with: > ;; > ;; (with-peg-rules > ;; ((number sign digit (* digit)) > ;; (sign (or "+" "-" "")) > ;; (digit [0-9])) > ;; (peg-run (peg number))) > ;; or > ;; (define-peg-rule digit () > ;; [0-9]) > ;; (peg-parse (number sign digit (* digit)) > ;; (sign (or "+" "-" ""))) Thanks for the example. This one stresses tha main "selling point" of PEGs, that they can "do" a grown up parser without separating lex and "parse" into two "floors". The downside is that they aren't as "tall" (as the customary LALR/regexp combo). Cheers -- t [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 195 bytes --] ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Make peg.el a built-in library? 2022-11-10 5:25 ` tomas 2022-11-10 8:15 ` Eli Zaretskii @ 2022-11-11 4:36 ` Richard Stallman 1 sibling, 0 replies; 100+ messages in thread From: Richard Stallman @ 2022-11-11 4:36 UTC (permalink / raw) To: tomas; +Cc: emacs-devel [[[ To any NSA and FBI agents reading my email: please consider ]]] [[[ whether defending the US Constitution against all enemies, ]]] [[[ foreign or domestic, requires you to follow Snowden's example. ]]] > > Would someone like to tell me in 10 lines what job peg.el does? > PEG (Parsing Expression Grammars [1]) is a grammar notation which can > be automatically translated into a parser (think regular expressions). ... Thanks. -- Dr Richard Stallman (https://stallman.org) Chief GNUisance of the GNU Project (https://gnu.org) Founder, Free Software Foundation (https://fsf.org) Internet Hall-of-Famer (https://internethalloffame.org) ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Make peg.el a built-in library? 2022-11-07 19:46 ` Eric Abrahamsen 2022-11-08 6:57 ` Helmut Eller @ 2022-11-08 8:47 ` Ihor Radchenko 2022-11-08 16:18 ` Eric Abrahamsen 2022-11-08 14:01 ` Stefan Monnier 2 siblings, 1 reply; 100+ messages in thread From: Ihor Radchenko @ 2022-11-08 8:47 UTC (permalink / raw) To: Eric Abrahamsen; +Cc: emacs-devel, Stefan Monnier Eric Abrahamsen <eric@ericabrahamsen.net> writes: >> Is there any progress merging peg.el to Emacs? >> I do not see any obvious blockers in the discussion, but the merge never >> happened? > > I will say that I tried to use PEG to resolve some gruesome text-parsing > issues in EBDB very recently, and failed to make it work in the hour or > two I'd allotted to the problem. The file-comment docs are pretty good, > but I think they would need to be expanded in a few crucial ways, > particularly to help those who don't necessarily know how PEGs work. > > Specifically, it is not obvious (to me) the ways in which PEGs (or maybe > just peg.el) are not fully declarative. It doesn't backtrack, and I > suspect it won't ever backtrack or isn't even supposed to, which means > users should be made explicitly aware of the ways in which their rules > can fail, and the ways in which declaration order matter. The comment > for the `or' construct reads: > > Prioritized Choice > > And that's about the only hint you get. As the comment in peg.el states, the definitions are adapted from the original PEG paper. There is even a link to paper and also to presentation explaining how peg works. I strongly advice you to read that. Prioritized Choice is explained there. > I was trying to parse a > multiword name like > > Eric Edwin Abrahamsen > > into the structure > > (("Eric" "Edwin") "Abrahamsen") > > using rules like > > (plain-name (substring (+ [word])) (* [space])) > (full-name (list (+ plain-name) plain-name) > `(names -- (list (butlast names) (car (last names))))) > > Which always fails to match because (+ plain-name) is greedy and eats up > all the words. It doesn't ever try leaving out the last word in an > attempt to make the rule match. One way is (with-peg-rules ((name (substring (+ [word])) (* [blank])) (given-name name (not (eol))) (last-name name (and (eol))) (full-name (list (+ given-name) last-name) `(names -- (list (butlast names) (car (last names)))))) (peg-run (peg full-name))) A simple-minded non-greedy version would be ambiguous. You must necessarily indicate end of input. A more appropriate non-ambiguous non-greedy statement would involve or (which you admittedly did not understand): (with-peg-rules ((name (substring (+ [word])) (* [blank])) (given-name name) (last-name name (and (eol))) (full-name (list (+ (or last-name given-name)) (and (eol))) `(names -- (list (butlast names) (car (last names)))))) ;;;;;;;;;;;;;;;;;;;;;^^ (peg-run (peg full-name))) > I'm happy to write the docs (should it have its own info manual > section?), if we really think there are no other necessary > fixes/improvements. I find PEG to be a nice addition when regexps do not cut the necessary parsing, while using Bovine or tree-sitter is an overkill. -- Ihor Radchenko // yantar92, Org mode contributor, Learn more about Org mode at <https://orgmode.org/>. Support Org development at <https://liberapay.com/org-mode>, or support my work at <https://liberapay.com/yantar92> ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Make peg.el a built-in library? 2022-11-08 8:47 ` Ihor Radchenko @ 2022-11-08 16:18 ` Eric Abrahamsen 2022-11-08 19:08 ` tomas 0 siblings, 1 reply; 100+ messages in thread From: Eric Abrahamsen @ 2022-11-08 16:18 UTC (permalink / raw) To: emacs-devel Ihor Radchenko <yantar92@posteo.net> writes: > Eric Abrahamsen <eric@ericabrahamsen.net> writes: > >>> Is there any progress merging peg.el to Emacs? >>> I do not see any obvious blockers in the discussion, but the merge never >>> happened? >> >> I will say that I tried to use PEG to resolve some gruesome text-parsing >> issues in EBDB very recently, and failed to make it work in the hour or >> two I'd allotted to the problem. The file-comment docs are pretty good, >> but I think they would need to be expanded in a few crucial ways, >> particularly to help those who don't necessarily know how PEGs work. >> >> Specifically, it is not obvious (to me) the ways in which PEGs (or maybe >> just peg.el) are not fully declarative. It doesn't backtrack, and I >> suspect it won't ever backtrack or isn't even supposed to, which means >> users should be made explicitly aware of the ways in which their rules >> can fail, and the ways in which declaration order matter. The comment >> for the `or' construct reads: >> >> Prioritized Choice >> >> And that's about the only hint you get. > > As the comment in peg.el states, the definitions are adapted from the > original PEG paper. There is even a link to paper and also to > presentation explaining how peg works. I strongly advice you to read > that. Prioritized Choice is explained there. This is what I was saying in my original message, though: if peg.el is going to go into core, it probably needs more/better docs than code comments and "read this paper". Its likely users will be Elisp library authors like me, who are just trying to free themselves from regexp hell and want a relatively straightforward alternative. I used peg.el to prototype search-string parsing in Gnus and everything Just Worked the first time and it was pretty amazing. In my later example below everything did not Just Work, but I think with some more hand-holdy documentation it would have. >> I was trying to parse a >> multiword name like >> >> Eric Edwin Abrahamsen >> >> into the structure >> >> (("Eric" "Edwin") "Abrahamsen") >> >> using rules like >> >> (plain-name (substring (+ [word])) (* [space])) >> (full-name (list (+ plain-name) plain-name) >> `(names -- (list (butlast names) (car (last names))))) >> >> Which always fails to match because (+ plain-name) is greedy and eats up >> all the words. It doesn't ever try leaving out the last word in an >> attempt to make the rule match. > > One way is > > (with-peg-rules > ((name (substring (+ [word])) (* [blank])) > (given-name name (not (eol))) > (last-name name (and (eol))) > (full-name (list (+ given-name) last-name) `(names -- (list (butlast names) (car (last names)))))) > (peg-run (peg full-name))) > > A simple-minded non-greedy version would be ambiguous. You must > necessarily indicate end of input. > > A more appropriate non-ambiguous non-greedy statement would involve or > (which you admittedly did not understand): > > (with-peg-rules > ((name (substring (+ [word])) (* [blank])) > (given-name name) > (last-name name (and (eol))) > (full-name (list (+ (or last-name given-name)) (and (eol))) `(names -- (list (butlast names) (car (last names)))))) > ;;;;;;;;;;;;;;;;;;;;;^^ > (peg-run (peg full-name))) Thanks! This is very helpful to my understanding. In this particular case I'm putting strings in a temporary buffer, so signals like (eol) or more likely (eob) are present and reliable. >> I'm happy to write the docs (should it have its own info manual >> section?), if we really think there are no other necessary >> fixes/improvements. > > I find PEG to be a nice addition when regexps do not cut the necessary > parsing, while using Bovine or tree-sitter is an overkill. I've never tried tree-sitter, but I have tried and failed to make Bovine do this sort of thing more than once over the years. I also agree that a middle ground is needed. Eric ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Make peg.el a built-in library? 2022-11-08 16:18 ` Eric Abrahamsen @ 2022-11-08 19:08 ` tomas 2022-11-08 19:42 ` Eric Abrahamsen 0 siblings, 1 reply; 100+ messages in thread From: tomas @ 2022-11-08 19:08 UTC (permalink / raw) To: emacs-devel [-- Attachment #1: Type: text/plain, Size: 1398 bytes --] On Tue, Nov 08, 2022 at 08:18:15AM -0800, Eric Abrahamsen wrote: > Ihor Radchenko <yantar92@posteo.net> writes: > > > Eric Abrahamsen <eric@ericabrahamsen.net> writes: > > > >>> Is there any progress merging peg.el to Emacs? > >>> I do not see any obvious blockers in the discussion, but the merge never > >>> happened? [...] > > As the comment in peg.el states, the definitions are adapted from the > > original PEG paper [...] > This is what I was saying in my original message, though: if peg.el is > going to go into core, it probably needs more/better docs than code > comments and "read this paper". Its likely users will be Elisp library > authors like me, who are just trying to free themselves from regexp hell > and want a relatively straightforward alternative. Yes. Coming from regexp they are deceivingly similar but frustratingly different. The best way I found to wrap my head around them is that they are a fancy notation for a recursive descent parser. Thus slightly more powerful than regexps, but slightly less than a full YACC (i.e. LALR or thereabouts). What is attractive about them is that one can do "full" parsers (as long as your grammar is roughly LL(k)) without having to build two storey buildings. I guess it takes some practice, though (I haven't). I think comparing them to treesitter is a category error. Cheers -- t [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 195 bytes --] ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Make peg.el a built-in library? 2022-11-08 19:08 ` tomas @ 2022-11-08 19:42 ` Eric Abrahamsen 2022-11-16 4:27 ` [PATCH] " Eric Abrahamsen 0 siblings, 1 reply; 100+ messages in thread From: Eric Abrahamsen @ 2022-11-08 19:42 UTC (permalink / raw) To: emacs-devel <tomas@tuxteam.de> writes: > On Tue, Nov 08, 2022 at 08:18:15AM -0800, Eric Abrahamsen wrote: >> Ihor Radchenko <yantar92@posteo.net> writes: >> >> > Eric Abrahamsen <eric@ericabrahamsen.net> writes: >> > >> >>> Is there any progress merging peg.el to Emacs? >> >>> I do not see any obvious blockers in the discussion, but the merge never >> >>> happened? > > [...] > >> > As the comment in peg.el states, the definitions are adapted from the >> > original PEG paper [...] > >> This is what I was saying in my original message, though: if peg.el is >> going to go into core, it probably needs more/better docs than code >> comments and "read this paper". Its likely users will be Elisp library >> authors like me, who are just trying to free themselves from regexp hell >> and want a relatively straightforward alternative. > > Yes. Coming from regexp they are deceivingly similar but frustratingly > different. > > The best way I found to wrap my head around them is that they are a > fancy notation for a recursive descent parser. Thus slightly more > powerful than regexps, but slightly less than a full YACC (i.e. LALR > or thereabouts). > > What is attractive about them is that one can do "full" parsers > (as long as your grammar is roughly LL(k)) without having to build > two storey buildings. I guess it takes some practice, though (I > haven't). > > I think comparing them to treesitter is a category error. Okay, this is all sounding good. I'm going to read the paper, try to get my head around all this, and write some docs for peg.el. ^ permalink raw reply [flat|nested] 100+ messages in thread
* [PATCH] Re: Make peg.el a built-in library? 2022-11-08 19:42 ` Eric Abrahamsen @ 2022-11-16 4:27 ` Eric Abrahamsen 2022-11-16 5:07 ` tomas ` (2 more replies) 0 siblings, 3 replies; 100+ messages in thread From: Eric Abrahamsen @ 2022-11-16 4:27 UTC (permalink / raw) To: emacs-devel [-- Attachment #1: Type: text/plain, Size: 2085 bytes --] Eric Abrahamsen <eric@ericabrahamsen.net> writes: > <tomas@tuxteam.de> writes: > >> On Tue, Nov 08, 2022 at 08:18:15AM -0800, Eric Abrahamsen wrote: >>> Ihor Radchenko <yantar92@posteo.net> writes: >>> >>> > Eric Abrahamsen <eric@ericabrahamsen.net> writes: >>> > >>> >>> Is there any progress merging peg.el to Emacs? >>> >>> I do not see any obvious blockers in the discussion, but the merge never >>> >>> happened? >> >> [...] >> >>> > As the comment in peg.el states, the definitions are adapted from the >>> > original PEG paper [...] >> >>> This is what I was saying in my original message, though: if peg.el is >>> going to go into core, it probably needs more/better docs than code >>> comments and "read this paper". Its likely users will be Elisp library >>> authors like me, who are just trying to free themselves from regexp hell >>> and want a relatively straightforward alternative. >> >> Yes. Coming from regexp they are deceivingly similar but frustratingly >> different. >> >> The best way I found to wrap my head around them is that they are a >> fancy notation for a recursive descent parser. Thus slightly more >> powerful than regexps, but slightly less than a full YACC (i.e. LALR >> or thereabouts). >> >> What is attractive about them is that one can do "full" parsers >> (as long as your grammar is roughly LL(k)) without having to build >> two storey buildings. I guess it takes some practice, though (I >> haven't). >> >> I think comparing them to treesitter is a category error. > > Okay, this is all sounding good. I'm going to read the paper, try to get > my head around all this, and write some docs for peg.el. Okay, here's a first stab. I read the paper, and understood about half of it, which seemed like enough. It was interesting to see that the paper explicitly calls out the exact greedy-matching behavior I'd encountered. I'm sure I've got some of the conventions wrong, here, and it's unfortunate that there's already a manual node called "Expression Parsing", but I don't know what to call this except "Expression Parsing Grammars"... Eric [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: pexmanual.diff --] [-- Type: text/x-patch, Size: 10722 bytes --] diff --git a/doc/lispref/elisp.texi b/doc/lispref/elisp.texi index a3d1d80408..6440728541 100644 --- a/doc/lispref/elisp.texi +++ b/doc/lispref/elisp.texi @@ -222,6 +222,7 @@ Top * Non-ASCII Characters:: Non-ASCII text in buffers and strings. * Searching and Matching:: Searching buffers for strings or regexps. * Syntax Tables:: The syntax table controls word and list parsing. +* Parsing Expression Grammars:: Parsing buffer text. * Abbrevs:: How Abbrev mode works, and its data structures. * Threads:: Concurrency in Emacs Lisp. @@ -1703,6 +1704,7 @@ Top @include searching.texi @include syntax.texi +@include peg.texi @include abbrevs.texi @include threads.texi @include processes.texi diff --git a/doc/lispref/peg.texi b/doc/lispref/peg.texi new file mode 100644 index 0000000000..ec3962d7bf --- /dev/null +++ b/doc/lispref/peg.texi @@ -0,0 +1,314 @@ +@c -*-texinfo-*- +@c This is part of the GNU Emacs Lisp Reference Manual. +@c Copyright (C) 1990--1995, 1998--1999, 2001--2022 Free Software +@c Foundation, Inc. +@c See the file elisp.texi for copying conditions. +@node Parsing Expression Grammars +@chapter Parsing Expression Grammars +@cindex text parsing + + Emacs Lisp provide several tools for parsing and matching text, from +regular expressions (@pxref{Regular Expressions}) to full @acronym{LL} +grammar parsers (@pxref{Top,, Bovine parser development, bovine}). +@dfn{Parsing Expression Grammars} (@acronym{PEG}) are another approach +to text parsing that offer more structure and composibility than +regular expressions, but less complexity than context-free grammars. + +A @acronym{PEG} parser is defined as a list of named rules, each of +which match text patterns, and/or contain references to other rules. +Parsing is initiated with the function @code{peg-run} or the macro +@code{peg-parse}, and parses text after point in the current buffer, +using a given set of rules. + +The definition of each rule is referred to as a @dfn{parsing +expression} (@acronym{PEX}), and can consist of a literal string, a +regexp-like character range or set, a peg-specific construct +resembling an elisp function call, a reference to another rule, or a +combination of any of these. A grammar is expressed as a set of rules +in which one rule is typically treated as a ``top-level'' or +``entry-point'' rule. For instance: + +@example +@group +((number sign digit (* digit)) + (sign (or "+" "-" "")) + (digit [0-9])) +@end group +@end example + +The above grammar could be used directly in a call to +@code{peg-parse}, in which the first rule is considered the +``entry-point'' rule: + +@example +(peg-parse + ((number sign digit (* digit)) + (sign (or "+" "-" "")) + (digit [0-9]))) +@end example + +Or set as the value of a variable, and the variable used in a +combination of calls to @code{with-peg-rules} and @code{peg-run}, +where the ``entry-point'' rule is given explicitly: + +@example +(defvar number-grammar + '((number sign digit (* digit)) + (sign (or "+" "-" "")) + (digit [0-9]))) + +(with-peg-rules number-grammar + (peg-run (peg number))) +@end example + +By default, calls to @code{peg-run} or @code{peg-parse} produce no +output: parsing simply moves point. In order to return or otherwise +act upon parsed strings, rules can include @dfn{actions}, see +@xref{Parsing Actions} for more information. + +Individual rules can also be defined using a more @code{defun}-like +syntax, using the macro @code{define-peg-rule}: + +@example +(define-peg-rule digit () + [0-9]) +@end example + +This allows the rule to be referred to by name within calls to +@code{peg-run} or @code{peg-parse} elsewhere, and also allows the use +of function arguments in the rule body. + +@node PEX Definitions +@section PEX Definitions + +Parsing expressions can be defined using the following syntax: + +@table @code +@item (and E1 E2 ...) +A sequence of PEXs that must all be matched. The @code{and} form is +optional and implicit. + +@item (or E1 E2 ...) +Prioritized choices, meaning that, as in Elisp, the choices are tried +in order, and the first successful match is used. + +@item (any) +Matches any single character, as the regexp ``.''. + +@item "abc" +A literal string. + +@item (char C) +A single character, as an Elisp character literal. + +@item (* E) +Zero or more of an expression, as the regexp ``*''. + +@item (+ E) +One or more of an expression, as the regexp ``+''. + +@item (opt E) +Zero or one of an expression, as the regexp ``?''. + +@item SYMBOL +A symbol representing a previously-define PEG rule. + +@item (range A B) +The character range between A and B, as the regexp ``[A-B]''. + +@item [a-b "+*" ?x] +A character set, including ranges, literal characters, or strings of +characters. + +@item [ascii cntrl] +A list of named character classes (see below). + +@item (syntax-class NAME) +A single syntax class. + +@item (null) +The empty string. +@end table + +The following expressions are used as anchors -- they do not move +point. + +@table @code +@item (bob) +Beginning of buffer. + +@item (eob) +End of buffer. + +@item (bol) +Beginning of line. + +@item (eol) +End of line. + +@item (bow) +Beginning of word. + +@item (eow) +End of word. + +@item (bos) +Beginning of symbol. + +@item (eos) +End of symbol. +@end table + +The following expressions are used as booleans, to constrain matching +(@pxref{Writing PEG Rules}), and do not move point. + +@table @code +@item (not E) +@item (if E) +@item (guard EXP) +@end table + +@vindex peg-char-classes +Named character classes include the following: + +@itemize +@item ascii +@item alnum +@item alpha +@item blank +@item cntrl +@item digit +@item graph +@item lower +@item multibyte +@item nonascii +@item print +@item punct +@item space +@item unibyte +@item upper +@item word +@item xdigit +@end itemize + +@node Parsing Actions +@section Parsing Actions + +By default the process of parsing simply moves point in the current +buffer, ultimately returning @code{t} if the parsing succeeds, and +@code{nil} if it doesn't. It's also possible to define ``actions'' +that can run arbitrary Elisp at certain points during parsing. These +actions can affect something called the @dfn{parsing stack}: a list of +values built up during the course of parsing. If the stack is +non-@code{nil} at the end of parsing, it is returned as the final +value of the parsing process. + +Actions can be added anywhere in the definition of a rule. They are +distinguished from parsing expressions by an initial backquote +(@samp{`}), followed by a parenthetical form that must contain a pair +of hyphens (@samp{--}) somewhere within it. Symbols to the left of +the hyphens are bound to values popped from the stack (they are +somewhat analogous to the argument list in a lambda). Values produced +by code to the right are pushed to the stack (analogous to the return +value of the lambda). For instance, the previous grammar can be +augmented with actions to return the parsed number as an actual +integer: + +@example +(with-peg-rules ((number sign digit (* digit + `(a b -- (+ (* a 10) b))) + `(sign val -- (* sign val))) + (sign (or (and "+" `(-- 1)) + (and "-" `(-- -1)) + (and "" `(-- 1)))) + (digit [0-9] `(-- (- (char-before) ?0)))) + (peg-run (peg number))) +@end example + +There must be values on the stack before they can be popped and +returned. An action with no left-hand terms will only push values to +the stack; an action with no right-hand terms will consume (and +discard) values from the stack. + +To return the string matched by a PEX (instead of simply moving point +over it), a rule like this can be used: + +@example +(one-word + `(-- (point)) + (+ [word]) + `(start -- (buffer-substring start (point)))) +@end example + +The first action pushes the initial value of point to the stack. The +intervening PEX moves point over the next word. The second action pops +the previous value from the stack (binding it to the variable +@code{start}), and uses that value to extract a substring from the +buffer and push it to the stack. This pattern is so common that +peg.el provides a shorthand function that does exactly the above, +along with a few other shorthands for common scenarios: + +@table @code +@item (substring E) +Match PEX E and push the matched string to the stack. + +@item (region E) +Match E and push the start and end positions of the matched region to +the stack. + +@item (replace E "repl") +Match E and replaced the matched region with the string "repl". + +@item (list E) +Match E, collect all values produced by E (and its sub-expressions) +into a list, and push that list to the stack. +@end table + +It is up to the grammar author to keep track of which rules and +sub-rules push values to the stack, and the state of the stack at any +given point in the parsing. If an action pops values from an empty +stack, the symbols will be bound to @code{nil}. + +@node Writing PEG Rules +@section Writing PEG Rules + +Something to be aware of when writing PEG rules is that they are +greedy. Rules which consume a variable amount of text will always +consume the maximum amount possible, even if that causes a rule that +might otherwise have matched to fail later on. For instance, this +rule will never succeed: + +@example +(forest (+ "tree" (* [blank])) "tree" (eol)) +@end example + +The @acronym{PEX} @code{(+ "tree" (* [blank]))} will consume all +repetitions of the word ``tree'', leaving none to match the final +@code{"tree"}. + +In these situations, the desired result can be obtained by using +predicates and guards -- namely the @code{not}, @code{if} and +@code{guard} expressions -- to restrict behavior. For instance: + +@example +(forest (+ "tree" (* [blank])) (not (eol)) "tree" (eol)) +@end example + +The @code{if} and @code{not} operators accept a parsing expression and +interpret it as a boolean, without moving point. The contents of a +@code{guard} operator are evaluated as regular Elisp (not a +@acronym{PEX}) and should return a boolean value. A @code{nil} value +causes the match to fail. + +Another potentially unexpected behavior is that parsing will move +point as far as possible, even if the parsing ultimately fails. This +rule: + +@example +(end-game "game" (eob)) +@end example + +when run in a buffer containing the text ``game over'' after point, +will move point to just after ``game'' and halt parsing, returning +@code{nil}. Successful parsing will always return @code{t}, or the +contexts of the parsing stack. ^ permalink raw reply related [flat|nested] 100+ messages in thread
* Re: [PATCH] Re: Make peg.el a built-in library? 2022-11-16 4:27 ` [PATCH] " Eric Abrahamsen @ 2022-11-16 5:07 ` tomas 2022-11-16 5:39 ` Eric Abrahamsen 2022-11-16 6:24 ` Ihor Radchenko 2023-01-11 7:39 ` Michael Heerdegen 2 siblings, 1 reply; 100+ messages in thread From: tomas @ 2022-11-16 5:07 UTC (permalink / raw) To: emacs-devel [-- Attachment #1: Type: text/plain, Size: 918 bytes --] On Tue, Nov 15, 2022 at 08:27:56PM -0800, Eric Abrahamsen wrote: [...] > Okay, here's a first stab. I read the paper, and understood about half > of it, which seemed like enough. It was interesting to see that the > paper explicitly calls out the exact greedy-matching behavior I'd > encountered. Half of it sounds like double as much as I understood ;-) Seriously: thanks for your work. And thanks to the original authors! > I'm sure I've got some of the conventions wrong, here, and it's > unfortunate that there's already a manual node called "Expression > Parsing", but I don't know what to call this except "Expression Parsing > Grammars"... Hm. Perhaps "Parsing Expression Grammars" might be less confusing, since it is the moniker which has established itself. Admittedly, it grammars a bit awkwardly, but people having seen it once will rather recognize that one. Cheers -- t [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 195 bytes --] ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH] Re: Make peg.el a built-in library? 2022-11-16 5:07 ` tomas @ 2022-11-16 5:39 ` Eric Abrahamsen 2022-11-16 15:53 ` tomas 0 siblings, 1 reply; 100+ messages in thread From: Eric Abrahamsen @ 2022-11-16 5:39 UTC (permalink / raw) To: emacs-devel <tomas@tuxteam.de> writes: > On Tue, Nov 15, 2022 at 08:27:56PM -0800, Eric Abrahamsen wrote: > > [...] > >> Okay, here's a first stab. I read the paper, and understood about half >> of it, which seemed like enough. It was interesting to see that the >> paper explicitly calls out the exact greedy-matching behavior I'd >> encountered. > > Half of it sounds like double as much as I understood ;-) The second half is all mathematical notation. I only understand mathematics when it's written in plain English :) > Seriously: thanks for your work. And thanks to the original authors! > >> I'm sure I've got some of the conventions wrong, here, and it's >> unfortunate that there's already a manual node called "Expression >> Parsing", but I don't know what to call this except "Expression Parsing >> Grammars"... > > Hm. Perhaps "Parsing Expression Grammars" might be less confusing, > since it is the moniker which has established itself. Admittedly, > it grammars a bit awkwardly, but people having seen it once will > rather recognize that one. I foolishly reversed both of those: the new node *is* called "Parsing Expression Grammars", and the existing node is called "Parsing Expressions". Same issue, just reversed... ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH] Re: Make peg.el a built-in library? 2022-11-16 5:39 ` Eric Abrahamsen @ 2022-11-16 15:53 ` tomas 0 siblings, 0 replies; 100+ messages in thread From: tomas @ 2022-11-16 15:53 UTC (permalink / raw) To: emacs-devel [-- Attachment #1: Type: text/plain, Size: 419 bytes --] On Tue, Nov 15, 2022 at 09:39:25PM -0800, Eric Abrahamsen wrote: > <tomas@tuxteam.de> writes: [...] > > Hm. Perhaps "Parsing Expression Grammars" might be less confusing, [...] > I foolishly reversed both of those: the new node *is* called "Parsing > Expression Grammars", and the existing node is called "Parsing > Expressions". Same issue, just reversed... Kind of makes sense :) Thanks -- t [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 195 bytes --] ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH] Re: Make peg.el a built-in library? 2022-11-16 4:27 ` [PATCH] " Eric Abrahamsen 2022-11-16 5:07 ` tomas @ 2022-11-16 6:24 ` Ihor Radchenko 2022-11-16 18:15 ` Eric Abrahamsen 2023-01-11 7:39 ` Michael Heerdegen 2 siblings, 1 reply; 100+ messages in thread From: Ihor Radchenko @ 2022-11-16 6:24 UTC (permalink / raw) To: Eric Abrahamsen; +Cc: emacs-devel Eric Abrahamsen <eric@ericabrahamsen.net> writes: > Okay, here's a first stab. I read the paper, and understood about half > of it, which seemed like enough. It was interesting to see that the > paper explicitly calls out the exact greedy-matching behavior I'd > encountered. Thanks! > + Emacs Lisp provide several tools for parsing and matching text, from provides > +regular expressions (@pxref{Regular Expressions}) to full @acronym{LL} > +grammar parsers (@pxref{Top,, Bovine parser development, bovine}). > +@dfn{Parsing Expression Grammars} (@acronym{PEG}) are another approach > +to text parsing that offer more structure and composibility than > +regular expressions, but less complexity than context-free grammars. > + > +A @acronym{PEG} parser is defined as a list of named rules, each of > +which match text patterns, and/or contain references to other rules. > +Parsing is initiated with the function @code{peg-run} or the macro > +@code{peg-parse}, and parses text after point in the current buffer, > +using a given set of rules. > + > +The definition of each rule is referred to as a @dfn{parsing > +expression} (@acronym{PEX}), and can consist of a literal string, a > +regexp-like character range or set, a peg-specific construct > +resembling an elisp function call, a reference to another rule, or a > +combination of any of these. A grammar is expressed as a set of rules > +in which one rule is typically treated as a ``top-level'' or > +``entry-point'' rule. For instance: > + > +@example > +@group > +((number sign digit (* digit)) > + (sign (or "+" "-" "")) > + (digit [0-9])) > +@end group > +@end example > + > +The above grammar could be used directly in a call to > +@code{peg-parse}, in which the first rule is considered the > +``entry-point'' rule: > + > +@example > +(peg-parse > + ((number sign digit (* digit)) > + (sign (or "+" "-" "")) > + (digit [0-9]))) > +@end example > + > +Or set as the value of a variable, and the variable used in a > +combination of calls to @code{with-peg-rules} and @code{peg-run}, > +where the ``entry-point'' rule is given explicitly: > + > +@example > +(defvar number-grammar > + '((number sign digit (* digit)) > + (sign (or "+" "-" "")) > + (digit [0-9]))) > + > +(with-peg-rules number-grammar > + (peg-run (peg number))) > +@end example > + > +By default, calls to @code{peg-run} or @code{peg-parse} produce no > +output: parsing simply moves point. In order to return or otherwise > +act upon parsed strings, rules can include @dfn{actions}, see > +@xref{Parsing Actions} for more information. > + > +Individual rules can also be defined using a more @code{defun}-like > +syntax, using the macro @code{define-peg-rule}: > + > +@example > +(define-peg-rule digit () > + [0-9]) > +@end example > + > +This allows the rule to be referred to by name within calls to > +@code{peg-run} or @code{peg-parse} elsewhere, and also allows the use > +of function arguments in the rule body. > + > +@node PEX Definitions > +@section PEX Definitions > + > +Parsing expressions can be defined using the following syntax: > + > +@table @code > +@item (and E1 E2 ...) > +A sequence of PEXs that must all be matched. The @code{and} form is > +optional and implicit. > + > +@item (or E1 E2 ...) > +Prioritized choices, meaning that, as in Elisp, the choices are tried > +in order, and the first successful match is used. It is worth highlighting that it is different from CFGs. > +@item (* E) > +Zero or more of an expression, as the regexp ``*''. > + > +@item (+ E) > +One or more of an expression, as the regexp ``+''. It is worth highlighting the greedy part here and referring to &A and !A. > +@item SYMBOL > +A symbol representing a previously-define PEG rule. defined > +By default the process of parsing simply moves point in the current > +buffer, ultimately returning @code{t} if the parsing succeeds, and > +@code{nil} if it doesn't. It's also possible to define ``actions'' > +that can run arbitrary Elisp at certain points during parsing. These > +actions can affect something called the @dfn{parsing stack}: a list of > +values built up during the course of parsing. If the stack is > +non-@code{nil} at the end of parsing, it is returned as the final > +value of the parsing process. Actions are only run when the expression matches; with point moved after the match, right? What about &A and !A? > +There must be values on the stack before they can be popped and > +returned. What if there is just one value in the stack while the action required two? > +@item (list E) > +Match E, collect all values produced by E (and its sub-expressions) > +into a list, and push that list to the stack. > +@end table This one is not very clear. Does it imply that E is recursively wrapped into substring? > +It is up to the grammar author to keep track of which rules and > +sub-rules push values to the stack, and the state of the stack at any > +given point in the parsing. If an action pops values from an empty > +stack, the symbols will be bound to @code{nil}. The part about popping out of empty stack looks out of scope. Maybe move it to earlier discussion of variable bindings in actions? -- Ihor Radchenko // yantar92, Org mode contributor, Learn more about Org mode at <https://orgmode.org/>. Support Org development at <https://liberapay.com/org-mode>, or support my work at <https://liberapay.com/yantar92> ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH] Re: Make peg.el a built-in library? 2022-11-16 6:24 ` Ihor Radchenko @ 2022-11-16 18:15 ` Eric Abrahamsen 2022-11-17 12:21 ` Ihor Radchenko 0 siblings, 1 reply; 100+ messages in thread From: Eric Abrahamsen @ 2022-11-16 18:15 UTC (permalink / raw) To: emacs-devel Ihor Radchenko <yantar92@posteo.net> writes: > Eric Abrahamsen <eric@ericabrahamsen.net> writes: > >> Okay, here's a first stab. I read the paper, and understood about half >> of it, which seemed like enough. It was interesting to see that the >> paper explicitly calls out the exact greedy-matching behavior I'd >> encountered. > > Thanks! And thanks for the review! I'll add in all your simpler notes; more responses below. >> + Emacs Lisp provide several tools for parsing and matching text, from > > provides > >> +regular expressions (@pxref{Regular Expressions}) to full @acronym{LL} >> +grammar parsers (@pxref{Top,, Bovine parser development, bovine}). >> +@dfn{Parsing Expression Grammars} (@acronym{PEG}) are another approach >> +to text parsing that offer more structure and composibility than >> +regular expressions, but less complexity than context-free grammars. >> + >> +A @acronym{PEG} parser is defined as a list of named rules, each of >> +which match text patterns, and/or contain references to other rules. >> +Parsing is initiated with the function @code{peg-run} or the macro >> +@code{peg-parse}, and parses text after point in the current buffer, >> +using a given set of rules. >> + >> +The definition of each rule is referred to as a @dfn{parsing >> +expression} (@acronym{PEX}), and can consist of a literal string, a >> +regexp-like character range or set, a peg-specific construct >> +resembling an elisp function call, a reference to another rule, or a >> +combination of any of these. A grammar is expressed as a set of rules >> +in which one rule is typically treated as a ``top-level'' or >> +``entry-point'' rule. For instance: >> + >> +@example >> +@group >> +((number sign digit (* digit)) >> + (sign (or "+" "-" "")) >> + (digit [0-9])) >> +@end group >> +@end example >> + >> +The above grammar could be used directly in a call to >> +@code{peg-parse}, in which the first rule is considered the >> +``entry-point'' rule: >> + >> +@example >> +(peg-parse >> + ((number sign digit (* digit)) >> + (sign (or "+" "-" "")) >> + (digit [0-9]))) >> +@end example >> + >> +Or set as the value of a variable, and the variable used in a >> +combination of calls to @code{with-peg-rules} and @code{peg-run}, >> +where the ``entry-point'' rule is given explicitly: >> + >> +@example >> +(defvar number-grammar >> + '((number sign digit (* digit)) >> + (sign (or "+" "-" "")) >> + (digit [0-9]))) >> + >> +(with-peg-rules number-grammar >> + (peg-run (peg number))) >> +@end example >> + >> +By default, calls to @code{peg-run} or @code{peg-parse} produce no >> +output: parsing simply moves point. In order to return or otherwise >> +act upon parsed strings, rules can include @dfn{actions}, see >> +@xref{Parsing Actions} for more information. >> + >> +Individual rules can also be defined using a more @code{defun}-like >> +syntax, using the macro @code{define-peg-rule}: >> + >> +@example >> +(define-peg-rule digit () >> + [0-9]) >> +@end example >> + >> +This allows the rule to be referred to by name within calls to >> +@code{peg-run} or @code{peg-parse} elsewhere, and also allows the use >> +of function arguments in the rule body. >> + >> +@node PEX Definitions >> +@section PEX Definitions >> + >> +Parsing expressions can be defined using the following syntax: >> + >> +@table @code >> +@item (and E1 E2 ...) >> +A sequence of PEXs that must all be matched. The @code{and} form is >> +optional and implicit. >> + >> +@item (or E1 E2 ...) >> +Prioritized choices, meaning that, as in Elisp, the choices are tried >> +in order, and the first successful match is used. > > It is worth highlighting that it is different from CFGs. > >> +@item (* E) >> +Zero or more of an expression, as the regexp ``*''. >> + >> +@item (+ E) >> +One or more of an expression, as the regexp ``+''. > > It is worth highlighting the greedy part here and referring to &A and > !A. I don't believe there is separate syntax for &A and !A -- those are written (if A) and (not A). >> +@item SYMBOL >> +A symbol representing a previously-define PEG rule. > > defined > >> +By default the process of parsing simply moves point in the current >> +buffer, ultimately returning @code{t} if the parsing succeeds, and >> +@code{nil} if it doesn't. It's also possible to define ``actions'' >> +that can run arbitrary Elisp at certain points during parsing. These >> +actions can affect something called the @dfn{parsing stack}: a list of >> +values built up during the course of parsing. If the stack is >> +non-@code{nil} at the end of parsing, it is returned as the final >> +value of the parsing process. > > Actions are only run when the expression matches; with point moved after > the match, right? What about &A and !A? That's right, actions only run if the parsing succeeds, and they run all at once at the end. Maybe I can move all discussons of parsing success vs failure into one place. >> +There must be values on the stack before they can be popped and >> +returned. > > What if there is just one value in the stack while the action required two? > >> +@item (list E) >> +Match E, collect all values produced by E (and its sub-expressions) >> +into a list, and push that list to the stack. >> +@end table > > This one is not very clear. Does it imply that E is recursively wrapped > into substring? It's not very clear because I don't fully understand it! It does not implicitly create any value-returning calls (such as `substring'). I think what it means is that, by default, values returned by actions are all spliced into a single flat list. If you need some of those values to be returned in a sub-list, you can use this form. It's a bit tricky to use because the E in (list E) could potentially descend many levels and branch out into any number of sub-expressions, so you need to have a clear mental model of what values might ultimately be coming out of E. I guess that's also true for the whole thing, though. >> +It is up to the grammar author to keep track of which rules and >> +sub-rules push values to the stack, and the state of the stack at any >> +given point in the parsing. If an action pops values from an empty >> +stack, the symbols will be bound to @code{nil}. > > The part about popping out of empty stack looks out of scope. Maybe move > it to earlier discussion of variable bindings in actions? Okay, I'll remove this, and just add a shorter note up above about empty stacks. Thanks again, Eric ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH] Re: Make peg.el a built-in library? 2022-11-16 18:15 ` Eric Abrahamsen @ 2022-11-17 12:21 ` Ihor Radchenko 2022-11-27 1:46 ` Eric Abrahamsen 0 siblings, 1 reply; 100+ messages in thread From: Ihor Radchenko @ 2022-11-17 12:21 UTC (permalink / raw) To: Eric Abrahamsen; +Cc: emacs-devel Eric Abrahamsen <eric@ericabrahamsen.net> writes: >>> +@item (* E) >>> +Zero or more of an expression, as the regexp ``*''. >>> + >>> +@item (+ E) >>> +One or more of an expression, as the regexp ``+''. >> >> It is worth highlighting the greedy part here and referring to &A and >> !A. > > I don't believe there is separate syntax for &A and !A -- those are > written (if A) and (not A). Indeed. I just felt lazy to write (if A) and (not A) and wrote &A and !A :) The comment is suggesting to add reference to the (if A)/(not A) and the "Writing PEGs" section. >> Actions are only run when the expression matches; with point moved after >> the match, right? What about &A and !A? > > That's right, actions only run if the parsing succeeds, and they run all > at once at the end. Maybe I can move all discussons of parsing success > vs failure into one place. I think that there might be confusion here because people are used to full success/full failure but not to partial success. And (if A) feels even more confusing because it does not actually move point and does not advance the parser. So, it is unclear what success means and what is the buffer/stack context when action is executed. >>> +@item (list E) >>> +Match E, collect all values produced by E (and its sub-expressions) >>> +into a list, and push that list to the stack. >>> +@end table >> >> This one is not very clear. Does it imply that E is recursively wrapped >> into substring? > > It's not very clear because I don't fully understand it! It does not > implicitly create any value-returning calls (such as `substring'). I > think what it means is that, by default, values returned by actions are > all spliced into a single flat list. If you need some of those values to > be returned in a sub-list, you can use this form. > > It's a bit tricky to use because the E in (list E) could potentially > descend many levels and branch out into any number of sub-expressions, > so you need to have a clear mental model of what values might ultimately > be coming out of E. I guess that's also true for the whole thing, > though. I also don't fully understand this, but I tried to play around with the following: (with-peg-rules ((name (substring (+ [word])) (* [blank])) (given-name name (not (eol))) (last-name (list name) (if (eol))) (full-name (list (+ given-name)) last-name)) (peg-run (peg full-name))) ;; <point>Eric Edwin Abrahamsen ;; => (("Abrahamsen") ("Eric" "Edwin")) ;; Suggested stack states: ;; 1. nil ;; 2. Match Eric via given-name: ("Eric") ;; 3. Match Edwin via given-name: ("Edwin" "Eric") ;; 4. No more match for given-name. List operation: (("Eric" "Edwin")) ;; 5. Match Abrahamsen via last-name. ("Abrahamsen" ("Eric" "Edwin")) ;; 6. Done with last-name. List operation: (("Abrahamsen") ("Eric" "Edwin")) ;; 7. done So, one may think that the stack values coming from E in (list E) are simply reversed, wrapped into a list, and pushed back into the stack. Kind of group operation. -- Ihor Radchenko // yantar92, Org mode contributor, Learn more about Org mode at <https://orgmode.org/>. Support Org development at <https://liberapay.com/org-mode>, or support my work at <https://liberapay.com/yantar92> ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH] Re: Make peg.el a built-in library? 2022-11-17 12:21 ` Ihor Radchenko @ 2022-11-27 1:46 ` Eric Abrahamsen 2022-11-27 8:57 ` Eli Zaretskii 0 siblings, 1 reply; 100+ messages in thread From: Eric Abrahamsen @ 2022-11-27 1:46 UTC (permalink / raw) To: Ihor Radchenko; +Cc: emacs-devel [-- Attachment #1: Type: text/plain, Size: 1869 bytes --] Ihor Radchenko <yantar92@posteo.net> writes: > Eric Abrahamsen <eric@ericabrahamsen.net> writes: > >>>> +@item (* E) >>>> +Zero or more of an expression, as the regexp ``*''. >>>> + >>>> +@item (+ E) >>>> +One or more of an expression, as the regexp ``+''. >>> >>> It is worth highlighting the greedy part here and referring to &A and >>> !A. >> >> I don't believe there is separate syntax for &A and !A -- those are >> written (if A) and (not A). > > Indeed. I just felt lazy to write (if A) and (not A) and wrote &A and !A :) > > The comment is suggesting to add reference to the (if A)/(not A) and the > "Writing PEGs" section. > >>> Actions are only run when the expression matches; with point moved after >>> the match, right? What about &A and !A? >> >> That's right, actions only run if the parsing succeeds, and they run all >> at once at the end. Maybe I can move all discussons of parsing success >> vs failure into one place. > > I think that there might be confusion here because people are used to > full success/full failure but not to partial success. > > And (if A) feels even more confusing because it does not actually move > point and does not advance the parser. So, it is unclear what success > means and what is the buffer/stack context when action is executed. Here's a new version, that I hope clarifies these questions (instead of doing the opposite). Note that there's an open peg.el bug now (#59345), about whether the "syntax-class" PEX is supposed to advance point or not -- you'd think that it would, but it doesn't. No word from the author yet. Lastly, nobody with a maintainer's hat on has actually given the green light on this, and I assume we'll want to hold off until the next version of Emacs is released; anyway it would be good to know what Eli/Lars think. I haven't done any NEWS additions or anything, either. Thanks! Eric [-- Attachment #2: peg.texi --] [-- Type: application/x-texinfo, Size: 10028 bytes --] ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH] Re: Make peg.el a built-in library? 2022-11-27 1:46 ` Eric Abrahamsen @ 2022-11-27 8:57 ` Eli Zaretskii 2022-11-28 1:09 ` Eric Abrahamsen 0 siblings, 1 reply; 100+ messages in thread From: Eli Zaretskii @ 2022-11-27 8:57 UTC (permalink / raw) To: Eric Abrahamsen; +Cc: yantar92, emacs-devel > From: Eric Abrahamsen <eric@ericabrahamsen.net> > Cc: emacs-devel@gnu.org > Date: Sat, 26 Nov 2022 17:46:04 -0800 > > Here's a new version, that I hope clarifies these questions (instead of > doing the opposite). Thanks, a few minor comments below. > Lastly, nobody with a maintainer's hat on has actually given the green > light on this, and I assume we'll want to hold off until the next > version of Emacs is released; anyway it would be good to know what > Eli/Lars think. I haven't done any NEWS additions or anything, either. What exactly are you asking about here? > @c -*-texinfo-*- > @c This is part of the GNU Emacs Lisp Reference Manual. This would mean a suitable change to elisp.texi at the least, and probably also to another file that is part of the ELisp reference manual sources? > A @acronym{PEG} parser is defined as a list of named rules, each of > which match text patterns, and/or contain references to other rules. ^^^^^ ^^^^^^^ "matches" and "contains", in singular. > Parsing is initiated with the function @code{peg-run} or the macro > @code{peg-parse}, and parses text after point in the current buffer, > using a given set of rules. This function and this macro need to be formally documented with @defun and @defmac, as we do elsewhere in the ELisp reference. > The definition of each rule is referred to as a @dfn{parsing > expression} (@acronym{PEX}), and can consist of a literal string, a Ideally, each @dfn in the manual should have a @cindex entry, because people are likely to look up these terms. > Or set as the value of a variable, and the variable used in a > combination of calls to @code{with-peg-rules} and @code{peg-run}, > where the ``entry-point'' rule is given explicitly: This sentence reads awkwardly, because it starts with "Or set". Suggest to rephrase: Alternatively, use a variable whose value is a grammar, and use it in a combination of calls to... > @example > (defvar number-grammar > '((number sign digit (* digit)) > (sign (or "+" "-" "")) > (digit [0-9]))) Btw, this begs a question: how come the value of the variable is a (quoted) list, but the value you pass to peg-parse in the previous example was not quoted? > By default, calls to @code{peg-run} or @code{peg-parse} produce no > output: parsing simply moves point. In order to return or otherwise > act upon parsed strings, rules can include @dfn{actions}, see > @xref{Parsing Actions} for more information. Again, a @cindex for "actions" is in order here. Also, @xref produces a Capitalized "See", so you want a @ref here, not @xref. And please always follow the closing brace of a cross-reference with a period or a comma, because some versions of Texinfo insist on that. (The only exception from this rule is @pxref inside parentheses.) > Individual rules can also be defined using a more @code{defun}-like > syntax, using the macro @code{define-peg-rule}: > > @example > (define-peg-rule digit () > [0-9]) > @end example define-peg-rule should be documented with a @defmac. > @node PEX Definitions > @section PEX Definitions There should be a @menu in the parent @chapter's node for all the child @section nodes. Otherwise, makeinfo will barf. > @item "abc" > A literal string. You don't mean "abc" literally here, do you? The correct way of expressing "a string" is @item @var{string} > @item (char C) > A single character, as an Elisp character literal. Likewise here: @item @var{C} A single character @var{C}, as a Lisp character literal. > @item (* E) > Zero or more of an expression, as the regexp ``*''. Matching is > always ``greedy''. Likewise. Basically, all the elements here are meta-syntactic variables: they stand for something else. The right markup for them is @var. Also, "zero or more of an expression" reads awkwardly. I don't even think I understand what you mean. And please quote regexps using @samp, not literal quotes (here and elsewhere). > @item (+ E) > One or more of an expression, as the regexp ``+''. Matching is always > ``greedy''. Likewise about "one or more of an expression". > @item (opt E) > Zero or one of an expression, as the regexp ``?''. Same. > @item (range A B) > The character range between A and B, as the regexp ``[A-B]''. It is better to use CH1 and CH2 instead of A and B. > @item [a-b "+*" ?x] > A character set, including ranges, literal characters, or strings of > characters. Same comment about a and b. > @vindex peg-char-classes > Named character classes include the following: Instead of listing them, just use a cross-reference to the node where classes are documented as part of regexp syntax. > The first action pushes the initial value of point to the stack. The > intervening @acronym{PEX} moves point over the next word. The second ^^ Two spaces there. > action pops the previous value from the stack (binding it to the > variable @code{start}), and uses that value to extract a substring > from the buffer and push it to the stack. This pattern is so common > that peg.el provides a shorthand function that does exactly the above, ^^^^^^ @file{peg.el}. Or maybe just @acronym{PEG}. > @item (substring E) > Match @acronym{PEX} E and push the matched string to the stack. Same comments here regarding @var markup of meta-syntactic variables. > @item (replace E "repl") > Match E and replaced the matched region with the string "repl". "repl" is not a literal string, it's a meta-syntactic variable, just like E. Finally, this needs a lot of index entries to make it a useful reference that is easily looked up for stuff. Thanks. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH] Re: Make peg.el a built-in library? 2022-11-27 8:57 ` Eli Zaretskii @ 2022-11-28 1:09 ` Eric Abrahamsen 2022-11-28 12:16 ` Eli Zaretskii 0 siblings, 1 reply; 100+ messages in thread From: Eric Abrahamsen @ 2022-11-28 1:09 UTC (permalink / raw) To: Eli Zaretskii; +Cc: yantar92, emacs-devel On 11/27/22 10:57 AM, Eli Zaretskii wrote: >> From: Eric Abrahamsen <eric@ericabrahamsen.net> >> Cc: emacs-devel@gnu.org >> Date: Sat, 26 Nov 2022 17:46:04 -0800 >> >> Here's a new version, that I hope clarifies these questions (instead of >> doing the opposite). > > Thanks, a few minor comments below. Thank you! I feel like you've given me many of the same notes in the past (particularly @xref/@ref), I'll get it eventually. >> Lastly, nobody with a maintainer's hat on has actually given the green >> light on this, and I assume we'll want to hold off until the next >> version of Emacs is released; anyway it would be good to know what >> Eli/Lars think. I haven't done any NEWS additions or anything, either. > > What exactly are you asking about here? Making peg.el a built-in. I looked back over this whole thread and it turns out you already gave the OK early on, but now I'm not sure if this would go in Emacs proper, or as a built-in package... So that's my question. Where is the natural place to put it? >> @c -*-texinfo-*- >> @c This is part of the GNU Emacs Lisp Reference Manual. > > This would mean a suitable change to elisp.texi at the least, and probably > also to another file that is part of the ELisp reference manual sources? This would depend on how, exactly, it gets included. [...] >> @example >> (defvar number-grammar >> '((number sign digit (* digit)) >> (sign (or "+" "-" "")) >> (digit [0-9]))) > > Btw, this begs a question: how come the value of the variable is a (quoted) > list, but the value you pass to peg-parse in the previous example was not > quoted? peg-parse is a macro, peg-run is a function. peg-parse constructs a call to peg-run, passing in the car of whatever list you've given to it as the argument. The rest of your comments seem straightforward, I'll make those edits now. Thanks, Eric ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH] Re: Make peg.el a built-in library? 2022-11-28 1:09 ` Eric Abrahamsen @ 2022-11-28 12:16 ` Eli Zaretskii 2023-09-25 1:30 ` Eric Abrahamsen 0 siblings, 1 reply; 100+ messages in thread From: Eli Zaretskii @ 2022-11-28 12:16 UTC (permalink / raw) To: Eric Abrahamsen; +Cc: yantar92, emacs-devel > From: Eric Abrahamsen <eric@ericabrahamsen.net> > Cc: yantar92@posteo.net, emacs-devel@gnu.org > Date: Sun, 27 Nov 2022 17:09:38 -0800 > > >> Lastly, nobody with a maintainer's hat on has actually given the green > >> light on this, and I assume we'll want to hold off until the next > >> version of Emacs is released; anyway it would be good to know what > >> Eli/Lars think. I haven't done any NEWS additions or anything, either. > > > > What exactly are you asking about here? > > Making peg.el a built-in. I looked back over this whole thread and it > turns out you already gave the OK early on, but now I'm not sure if this > would go in Emacs proper, or as a built-in package... The former, of course. I'd defer to Stefan if I thought it should go to ELPA. > Where is the natural place to put it? Either in lisp/progmodes or in lisp/emacs-lisp. I prefer the former, FWIW. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH] Re: Make peg.el a built-in library? 2022-11-28 12:16 ` Eli Zaretskii @ 2023-09-25 1:30 ` Eric Abrahamsen 2023-09-25 2:27 ` Adam Porter 2024-03-24 14:19 ` Ihor Radchenko 0 siblings, 2 replies; 100+ messages in thread From: Eric Abrahamsen @ 2023-09-25 1:30 UTC (permalink / raw) To: emacs-devel; +Cc: Michael Heerdegen, Eli Zaretskii, Stefan Monnier, yantar92 [-- Attachment #1: Type: text/plain, Size: 1615 bytes --] Before another year goes by... Since my last attempt at this, Stefan has made some additions to the ELPA version of peg.el (adding him to cc in case he wants to look at this), and I have realized that my last stab at the manual inadvertently documented some local changes I had made and then forgotten about. So here's a commit adding package, tests, and manual all at once. I've cc'd the people who indicated interest. The manual should be up to date with the code, I hope I've managed to follow all the pointers, and I believe I've done a better job of explaining how to use the various entry points of the library. I hope this looks okay! Thanks, Eric On 11/28/22 14:16 PM, Eli Zaretskii wrote: >> From: Eric Abrahamsen <eric@ericabrahamsen.net> >> Cc: yantar92@posteo.net, emacs-devel@gnu.org >> Date: Sun, 27 Nov 2022 17:09:38 -0800 >> >> >> Lastly, nobody with a maintainer's hat on has actually given the green >> >> light on this, and I assume we'll want to hold off until the next >> >> version of Emacs is released; anyway it would be good to know what >> >> Eli/Lars think. I haven't done any NEWS additions or anything, either. >> > >> > What exactly are you asking about here? >> >> Making peg.el a built-in. I looked back over this whole thread and it >> turns out you already gave the OK early on, but now I'm not sure if this >> would go in Emacs proper, or as a built-in package... > > The former, of course. I'd defer to Stefan if I thought it should go to > ELPA. > >> Where is the natural place to put it? > > Either in lisp/progmodes or in lisp/emacs-lisp. I prefer the former, FWIW. [-- Attachment #2: 0001-Add-peg.el-as-a-built-in-library.patch --] [-- Type: text/x-patch, Size: 65704 bytes --] From a8d1b3ad3162e92b4f8c8dd52690d9c1f3333661 Mon Sep 17 00:00:00 2001 From: Eric Abrahamsen <eric@ericabrahamsen.net> Date: Mon, 5 Dec 2022 21:59:03 -0800 Subject: [PATCH] Add peg.el as a built-in library * lisp/progmodes/peg.el: New file, taken from ELPA package. * test/lisp/peg-tests.el: Package tests. * doc/lispref/peg.texi: Documentation. --- doc/lispref/Makefile.in | 1 + doc/lispref/elisp.texi | 2 + doc/lispref/peg.texi | 351 +++++++++++++++ lisp/progmodes/peg.el | 944 ++++++++++++++++++++++++++++++++++++++++ test/lisp/peg-tests.el | 367 ++++++++++++++++ 5 files changed, 1665 insertions(+) create mode 100644 doc/lispref/peg.texi create mode 100644 lisp/progmodes/peg.el create mode 100644 test/lisp/peg-tests.el diff --git a/doc/lispref/Makefile.in b/doc/lispref/Makefile.in index 325f23a3c0f..8ac1242996d 100644 --- a/doc/lispref/Makefile.in +++ b/doc/lispref/Makefile.in @@ -112,6 +112,7 @@ srcs = $(srcdir)/os.texi \ $(srcdir)/package.texi \ $(srcdir)/parsing.texi \ + $(srcdir)/peg.texi \ $(srcdir)/positions.texi \ $(srcdir)/processes.texi \ $(srcdir)/records.texi \ diff --git a/doc/lispref/elisp.texi b/doc/lispref/elisp.texi index 72441c8d442..e12f61fc7eb 100644 --- a/doc/lispref/elisp.texi +++ b/doc/lispref/elisp.texi @@ -222,6 +222,7 @@ Top * Non-ASCII Characters:: Non-ASCII text in buffers and strings. * Searching and Matching:: Searching buffers for strings or regexps. * Syntax Tables:: The syntax table controls word and list parsing. +* Parsing Expression Grammars:: Parsing structured buffer text. * Parsing Program Source:: Generate syntax tree for program sources. * Abbrevs:: How Abbrev mode works, and its data structures. @@ -1719,6 +1720,7 @@ Top @include searching.texi @include syntax.texi +@include peg.texi @include parsing.texi @include abbrevs.texi @include threads.texi diff --git a/doc/lispref/peg.texi b/doc/lispref/peg.texi new file mode 100644 index 00000000000..64950f148b1 --- /dev/null +++ b/doc/lispref/peg.texi @@ -0,0 +1,351 @@ +@c -*-texinfo-*- +@c This is part of the GNU Emacs Lisp Reference Manual. +@c Copyright (C) 1990--1995, 1998--1999, 2001--2023 Free Software +@c Foundation, Inc. +@c See the file elisp.texi for copying conditions. +@node Parsing Expression Grammars +@chapter Parsing Expression Grammars +@cindex text parsing +@cindex parsing expression grammar + + Emacs Lisp provides several tools for parsing and matching text, +from regular expressions (@pxref{Regular Expressions}) to full +@acronym{LL} grammar parsers (@pxref{Top,, Bovine parser +development,bovine}). @dfn{Parsing Expression Grammars} +(@acronym{PEG}) are another approach to text parsing that offer more +structure and composibility than regular expressions, but less +complexity than context-free grammars. + +A @acronym{PEG} parser is defined as a list of named rules, each of +which matches text patterns, and/or contains references to other +rules. Parsing is initiated with the function @code{peg-run} or the +macro @code{peg-parse} (see below), and parses text after point in the +current buffer, using a given set of rules. + +@cindex parsing expression +The definition of each rule is referred to as a @dfn{parsing +expression} (@acronym{PEX}), and can consist of a literal string, a +regexp-like character range or set, a peg-specific construct +resembling an elisp function call, a reference to another rule, or a +combination of any of these. A grammar is expressed as a tree of +rules in which one rule is typically treated as a ``root'' or +``entry-point'' rule. For instance: + +@example +@group +((number sign digit (* digit)) + (sign (or "+" "-" "")) + (digit [0-9])) +@end group +@end example + +Once defined, grammars can be used to parse text after point in the +current buffer, in the following ways: + +@defmac peg-parse &rest pexs +Match @var{pexs} at point. If @var{pexs} is a list of PEG rules, the +first rule is considered the ``entry-point'': +@end defmac + +@example +@group +(peg-parse + ((number sign digit (* digit)) + (sign (or "+" "-" "")) + (digit [0-9]))) +@end group +@end example + +This macro represents the simplest use of the @acronym{PEG} library, +but also the least flexible, as the rules must be written directly +into the source code. A more flexible approach involves use of three +macros in conjunction: @code{with-peg-rules}, a @code{let}-like +construct that makes a set of rules available within the macro body; +@code{peg-run}, which initiates parsing given a single rule; and +@code{peg}, which is used to wrap the entry-point rule name. In fact, +a call to @code{peg-parse} expands to just this set of calls. The +above example could be written as: + +@example +@group +(with-peg-rules + ((number sign digit (* digit)) + (sign (or "+" "-" "")) + (digit [0-9])) + (peg-run (peg number))) +@end group +@end example + +This allows more explicit control over the ``entry-point'' of parsing, +and allows the combination of rules from different sources. + +Individual rules can also be defined using a more @code{defun}-like +syntax, using the macro @code{define-peg-rule}: + +@example +(define-peg-rule digit () + [0-9]) +@end example + +This also allows for rules that accept an argument (supplied by the +@code{funcall} PEG rule). + +Another possibility is to define a named set of rules with +@code{define-peg-ruleset}: + +@example +(define-peg-ruleset number-grammar + '((number sign digit (* digit)) + digit ;; A reference to the definition above. + (sign (or "+" "-" "")))) +@end example + +Rules and rulesets defined this way can be referred to by name in +later calls to @code{peg-run} or @code{with-peg-rules}: + +@example +(with-peg-rules number-grammar + (peg-run (peg number))) +@end example + +By default, calls to @code{peg-run} or @code{peg-parse} produce no +output: parsing simply moves point. In order to return or otherwise +act upon parsed strings, rules can include @dfn{actions}, see +@ref{Parsing Actions}. + +@menu +* PEX Definitions:: The syntax of PEX rules. +* Parsing Actions:: Running actions upon successful parsing. +* Writing PEG Rules:: Tips for writing parsing rules. +@end menu + +@node PEX Definitions +@section PEX Definitions + +Parsing expressions can be defined using the following syntax: + +@table @code +@item (and E1 E2 ...) +A sequence of @acronym{PEX}s that must all be matched. The @code{and} form is +optional and implicit. + +@item (or E1 E2 ...) +Prioritized choices, meaning that, as in Elisp, the choices are tried +in order, and the first successful match is used. Note that this is +distinct from context-free grammars, in which selection between +multiple matches is indeterminate. + +@item (any) +Matches any single character, as the regexp ``.''. + +@item @var{string} +A literal string. + +@item (char @var{C}) +A single character @var{C}, as an Elisp character literal. + +@item (* @var{E}) +Zero or more instances of expression @var{E}, as the regexp @samp{*}. +Matching is always ``greedy''. + +@item (+ @var{E}) +One or more instances of expression @var{E}, as the regexp @samp{+}. +Matching is always ``greedy''. + +@item (opt @var{E}) +Zero or one instance of expression @var{E}, as the regexp @samp{?}. + +@item SYMBOL +A symbol representing a previously-defined PEG rule. + +@item (range CH1 CH2) +The character range between CH1 and CH2, as the regexp @samp{[CH1-CH2]}. + +@item [CH1-CH2 "+*" ?x] +A character set, which can include ranges, character literals, or +strings of characters. + +@item [ascii cntrl] +A list of named character classes. + +@item (syntax-class @var{NAME}) +A single syntax class. + +@item (funcall E ARGS...) +Call @acronym{PEX} E (previously defined with @code{define-peg-rule}) +with arguments @var{ARGS}. + +@item (null) +The empty string. + +@end table + +The following expressions are used as anchors or tests -- they do not +move point, but return a boolean value which can be used to constrain +matches as a way of controlling the parsing process (@pxref{Writing +PEG Rules}). + +@table @code +@item (bob) +Beginning of buffer. + +@item (eob) +End of buffer. + +@item (bol) +Beginning of line. + +@item (eol) +End of line. + +@item (bow) +Beginning of word. + +@item (eow) +End of word. + +@item (bos) +Beginning of symbol. + +@item (eos) +End of symbol. + +@item (if E) +Returns non-@code{nil} if parsing @acronym{PEX} E from point succeeds (point +is not moved). + +@item (not E) +Returns non-@code{nil} if parsing @acronym{PEX} E from point fails (point +is not moved). + +@item (guard EXP) +Treats the value of the Lisp expression EXP as a boolean. + +@end table + +@vindex peg-char-classes +Character class matching can use the same named character classes as +in regular expressions (@pxref{Top,, Character Classes,elisp}) + +@node Parsing Actions +@section Parsing Actions + +@cindex parsing actions +@cindex parsing stack +By default the process of parsing simply moves point in the current +buffer, ultimately returning @code{t} if the parsing succeeds, and +@code{nil} if it doesn't. It's also possible to define ``actions'' +that can run arbitrary Elisp at certain points in the parsed text. +These actions can optionally affect something called the @dfn{parsing +stack}, which is a list of values returned by the parsing process. +These actions only run (and only return values) if the parsing process +ultimately succeeds; if it fails the action code is not run at all. + +Actions can be added anywhere in the definition of a rule. They are +distinguished from parsing expressions by an initial backquote +(@samp{`}), followed by a parenthetical form that must contain a pair +of hyphens (@samp{--}) somewhere within it. Symbols to the left of +the hyphens are bound to values popped from the stack (they are +somewhat analogous to the argument list of a lambda form). Values +produced by code to the right are pushed to the stack (analogous to +the return value of the lambda). For instance, the previous grammar +can be augmented with actions to return the parsed number as an actual +integer: + +@example +(with-peg-rules ((number sign digit (* digit + `(a b -- (+ (* a 10) b))) + `(sign val -- (* sign val))) + (sign (or (and "+" `(-- 1)) + (and "-" `(-- -1)) + (and "" `(-- 1)))) + (digit [0-9] `(-- (- (char-before) ?0)))) + (peg-run (peg number))) +@end example + +There must be values on the stack before they can be popped and +returned -- if there aren't enough stack values to bind to an action's +left-hand terms, they will be bound to @code{nil}. An action with +only right-hand terms will push values to the stack; an action with +only left-hand terms will consume (and discard) values from the stack. +At the end of parsing, stack values are returned as a flat list. + +To return the string matched by a @acronym{PEX} (instead of simply +moving point over it), a rule like this can be used: + +@example +(one-word + `(-- (point)) + (+ [word]) + `(start -- (buffer-substring start (point)))) +@end example + +The first action pushes the initial value of point to the stack. The +intervening @acronym{PEX} moves point over the next word. The second +action pops the previous value from the stack (binding it to the +variable @code{start}), and uses that value to extract a substring +from the buffer and push it to the stack. This pattern is so common +that @acronym{PEG} provides a shorthand function that does exactly the +above, along with a few other shorthands for common scenarios: + +@table @code +@item (substring @var{E}) +Match @acronym{PEX} @var{E} and push the matched string to the stack. + +@item (region @var{E}) +Match @var{E} and push the start and end positions of the matched +region to the stack. + +@item (replace @var{E} @var{replacement}) +Match @var{E} and replaced the matched region with the string @var{replacement}. + +@item (list @var{E}) +Match @var{E}, collect all values produced by @var{E} (and its +sub-expressions) into a list, and push that list to the stack. Stack +values are typically returned as a flat list; this is a way of +``grouping'' values together. +@end table + +@node Writing PEG Rules +@section Writing PEG Rules + +Something to be aware of when writing PEG rules is that they are +greedy. Rules which can consume a variable amount of text will always +consume the maximum amount possible, even if that causes a rule that +might otherwise have matched to fail later on -- there is no +backtracking. For instance, this rule will never succeed: + +@example +(forest (+ "tree" (* [blank])) "tree" (eol)) +@end example + +The @acronym{PEX} @code{(+ "tree" (* [blank]))} will consume all +repetitions of the word ``tree'', leaving none to match the final +@code{"tree"}. + +In these situations, the desired result can be obtained by using +predicates and guards -- namely the @code{not}, @code{if} and +@code{guard} expressions -- to constrain behavior. For instance: + +@example +(forest (+ "tree" (* [blank])) (not (eol)) "tree" (eol)) +@end example + +The @code{if} and @code{not} operators accept a parsing expression and +interpret it as a boolean, without moving point. The contents of a +@code{guard} operator are evaluated as regular Lisp (not a +@acronym{PEX}) and should return a boolean value. A @code{nil} value +causes the match to fail. + +Another potentially unexpected behavior is that parsing will move +point as far as possible, even if the parsing ultimately fails. This +rule: + +@example +(end-game "game" (eob)) +@end example + +when run in a buffer containing the text ``game over'' after point, +will move point to just after ``game'' then halt parsing, returning +@code{nil}. Successful parsing will always return @code{t}, or the +contexts of the parsing stack. diff --git a/lisp/progmodes/peg.el b/lisp/progmodes/peg.el new file mode 100644 index 00000000000..2eb4a7384d0 --- /dev/null +++ b/lisp/progmodes/peg.el @@ -0,0 +1,944 @@ +;;; peg.el --- Parsing Expression Grammars in Emacs Lisp -*- lexical-binding:t -*- + +;; Copyright (C) 2008-2023 Free Software Foundation, Inc. +;; +;; Author: Helmut Eller <eller.helmut@gmail.com> +;; Maintainer: Stefan Monnier <monnier@iro.umontreal.ca> +;; Version: 1.0.1 +;; +;; This program is free software: you can redistribute it and/or modify +;; it under the terms of the GNU General Public License as published by +;; the Free Software Foundation, either version 3 of the License, or +;; (at your option) any later version. +;; +;; This program is distributed in the hope that it will be useful, +;; but WITHOUT ANY WARRANTY; without even the implied warranty of +;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +;; GNU General Public License for more details. +;; +;; You should have received a copy of the GNU General Public License +;; along with this program. If not, see <https://www.gnu.org/licenses/>. +;; +;;; Commentary: +;; +;; This package implements Parsing Expression Grammars for Emacs Lisp. + +;; Parsing Expression Grammars (PEG) are a formalism in the spirit of +;; Context Free Grammars (CFG) with some simplifications which makes +;; the implementation of PEGs as recursive descent parsers particularly +;; simple and easy to understand [Ford, Baker]. +;; PEGs are more expressive than regexps and potentially easier to use. +;; +;; This file implements the macros `define-peg-rule', `with-peg-rules', and +;; `peg-parse' which parses the current buffer according to a PEG. +;; E.g. we can match integers with: +;; +;; (with-peg-rules +;; ((number sign digit (* digit)) +;; (sign (or "+" "-" "")) +;; (digit [0-9])) +;; (peg-run (peg number))) +;; or +;; (define-peg-rule digit () +;; [0-9]) +;; (peg-parse (number sign digit (* digit)) +;; (sign (or "+" "-" ""))) +;; +;; In contrast to regexps, PEGs allow us to define recursive "rules". +;; A "grammar" is a set of rules. A rule is written as (NAME PEX...) +;; E.g. (sign (or "+" "-" "")) is a rule with the name "sign". +;; The syntax for PEX (Parsing Expression) is a follows: +;; +;; Description Lisp Traditional, as in Ford's paper +;; =========== ==== =========== +;; Sequence (and E1 E2) e1 e2 +;; Prioritized Choice (or E1 E2) e1 / e2 +;; Not-predicate (not E) !e +;; And-predicate (if E) &e +;; Any character (any) . +;; Literal string "abc" "abc" +;; Character C (char C) 'c' +;; Zero-or-more (* E) e* +;; One-or-more (+ E) e+ +;; Optional (opt E) e? +;; Non-terminal SYMBOL A +;; Character range (range A B) [a-b] +;; Character set [a-b "+*" ?x] [a-b+*x] ;Note: it's a vector +;; Character classes [ascii cntrl] +;; Boolean-guard (guard EXP) +;; Syntax-Class (syntax-class NAME) +;; Local definitions (with RULES PEX...) +;; Indirect call (funcall EXP ARGS...) +;; and +;; Empty-string (null) ε +;; Beginning-of-Buffer (bob) +;; End-of-Buffer (eob) +;; Beginning-of-Line (bol) +;; End-of-Line (eol) +;; Beginning-of-Word (bow) +;; End-of-Word (eow) +;; Beginning-of-Symbol (bos) +;; End-of-Symbol (eos) +;; +;; Rules can refer to other rules, and a grammar is often structured +;; as a tree, with a root rule referring to one or more "branch +;; rules", all the way down to the "leaf rules" that deal with actual +;; buffer text. Rules can be recursive or mutually referential, +;; though care must be taken not to create infinite loops. +;; +;;;; Named rulesets: +;; +;; You can define a set of rules for later use with: +;; +;; (define-peg-ruleset myrules +;; (sign () (or "+" "-" "")) +;; (digit () [0-9]) +;; (nat () digit (* digit)) +;; (int () sign digit (* digit)) +;; (float () int "." nat)) +;; +;; and later refer to it: +;; +;; (with-peg-rules +;; (myrules +;; (complex float "+i" float)) +;; ... (peg-parse nat "," nat "," complex) ...) +;; +;;;; Parsing actions: +;; +;; PEXs also support parsing actions, i.e. Lisp snippets which are +;; executed when a pex matches. This can be used to construct syntax +;; trees or for similar tasks. The most basic form of action is +;; written as: +;; +;; (action FORM) ; evaluate FORM for its side-effects +;; +;; Actions don't consume input, but are executed at the point of +;; match. Another kind of action is called a "stack action", and +;; looks like this: +;; +;; `(VAR... -- FORM...) ; stack action +;; +;; A stack action takes VARs from the "value stack" and pushes the +;; results of evaluating FORMs to that stack. + +;; The value stack is created during the course of parsing. Certain +;; operators (see below) that match buffer text can push values onto +;; this stack. "Upstream" rules can then draw values from the stack, +;; and optionally push new ones back. For instance, consider this +;; very simple grammar: +;; +;; (with-peg-rules +;; ((query (+ term) (eol)) +;; (term key ":" value (opt (+ [space])) +;; `(k v -- (cons (intern k) v))) +;; (key (substring (and (not ":") (+ [word])))) +;; (value (or string-value number-value)) +;; (string-value (substring (+ [alpha]))) +;; (number-value (substring (+ [digit])) +;; `(val -- (string-to-number val)))) +;; (peg-run (peg query))) +;; +;; This invocation of `peg-run' would parse this buffer text: +;; +;; name:Jane age:30 +;; +;; And return this Elisp sexp: +;; +;; ((age . 30) (name . "Jane")) +;; +;; Note that, in complex grammars, some care must be taken to make +;; sure that the number and type of values drawn from the stack always +;; match those pushed. In the example above, both `string-value' and +;; `number-value' push a single value to the stack. Since the `value' +;; rule only includes these two sub-rules, any upstream rule that +;; makes use of `value' can be confident it will always and only push +;; a single value to the stack. +;; +;; Stack action forms are in a sense analogous to lambda forms: the +;; symbols before the "--" are the equivalent of lambda arguments, +;; while the forms after the "--" are return values. The difference +;; being that a lambda form can only return a single value, while a +;; stack action can push multiple values onto the stack. It's also +;; perfectly valid to use `(-- FORM...)' or `(VAR... --)': the former +;; pushes values to the stack without consuming any, and the latter +;; pops values from the stack and discards them. +;; +;;;; Derived Operators: +;; +;; The following operators are implemented as combinations of +;; primitive expressions: +;; +;; (substring E) ; Match E and push the substring for the matched region. +;; (region E) ; Match E and push the start and end positions. +;; (replace E RPL); Match E and replace the matched region with RPL. +;; (list E) ; Match E and push a list of the items that E produced. +;; +;; See `peg-ex-parse-int' in `peg-tests.el' for further examples. +;; +;; Regexp equivalents: +;; +;; Here a some examples for regexps and how those could be written as pex. +;; [Most are taken from rx.el] +;; +;; "^[a-z]*" +;; (and (bol) (* [a-z])) +;; +;; "\n[^ \t]" +;; (and "\n" (not [" \t"]) (any)) +;; +;; "\\*\\*\\* EOOH \\*\\*\\*\n" +;; "*** EOOH ***\n" +;; +;; "\\<\\(catch\\|finally\\)\\>[^_]" +;; (and (bow) (or "catch" "finally") (eow) (not "_") (any)) +;; +;; "[ \t\n]*:\\([^:]+\\|$\\)" +;; (and (* [" \t\n"]) ":" (or (+ (not ":") (any)) (eol))) +;; +;; "^content-transfer-encoding:\\(\n?[\t ]\\)*quoted-printable\\(\n?[\t ]\\)*" +;; (and (bol) +;; "content-transfer-encoding:" +;; (* (opt "\n") ["\t "]) +;; "quoted-printable" +;; (* (opt "\n") ["\t "])) +;; +;; "\\$[I]d: [^ ]+ \\([^ ]+\\) " +;; (and "$Id: " (+ (not " ") (any)) " " (+ (not " ") (any)) " ") +;; +;; "^;;\\s-*\n\\|^\n" +;; (or (and (bol) ";;" (* (syntax-class whitespace)) "\n") +;; (and (bol) "\n")) +;; +;; "\\\\\\\\\\[\\w+" +;; (and "\\\\[" (+ (syntax-class word))) +;; +;; See ";;; Examples" in `peg-tests.el' for other examples. +;; +;;;; Rule argument and indirect calls: +;; +;; Rules can take arguments and those arguments can themselves be PEGs. +;; For example: +;; +;; (define-peg-rule 2-or-more (peg) +;; (funcall peg) +;; (funcall peg) +;; (* (funcall peg))) +;; +;; ... (peg-parse +;; ... +;; (2-or-more (peg foo)) +;; ... +;; (2-or-more (peg bar)) +;; ...) +;; +;;;; References: +;; +;; [Ford] Bryan Ford. Parsing Expression Grammars: a Recognition-Based +;; Syntactic Foundation. In POPL'04: Proceedings of the 31st ACM +;; SIGPLAN-SIGACT symposium on Principles of Programming Languages, +;; pages 111-122, New York, NY, USA, 2004. ACM Press. +;; http://pdos.csail.mit.edu/~baford/packrat/ +;; +;; [Baker] Baker, Henry G. "Pragmatic Parsing in Common Lisp". ACM Lisp +;; Pointers 4(2), April--June 1991, pp. 3--15. +;; http://home.pipeline.com/~hbaker1/Prag-Parse.html +;; +;; Roman Redziejowski does good PEG related research +;; http://www.romanredz.se/pubs.htm + +;;;; Todo: + +;; - Fix the exponential blowup in `peg-translate-exp'. +;; - Add a proper debug-spec for PEXs. + +;;; News: + +;; Since 1.0.1: +;; - Use OClosures to represent PEG rules when available, and let cl-print +;; display their source code. +;; - New PEX form (with RULES PEX...). +;; - Named rulesets. +;; - You can pass arguments to rules. +;; - New `funcall' rule to call rules indirectly (e.g. a peg you received +;; as argument). + +;; Version 1.0: +;; - New official entry points `peg` and `peg-run`. + +;;; Code: + +(eval-when-compile (require 'cl-lib)) + +(defvar peg--actions nil + "Actions collected along the current parse. +Used at runtime for backtracking. It's a list ((POS . THUNK)...). +Each THUNK is executed at the corresponding POS. Thunks are +executed in a postprocessing step, not during parsing.") + +(defvar peg--errors nil + "Data keeping track of the rightmost parse failure location. +It's a pair (POSITION . EXPS ...). POSITION is the buffer position and +EXPS is a list of rules/expressions that failed.") + +;;;; Main entry points + +(defmacro peg--when-fboundp (f &rest body) + (declare (indent 1) (debug (sexp body))) + (when (fboundp f) + (macroexp-progn body))) + +(peg--when-fboundp oclosure-define + (oclosure-define peg-function + "Parsing function built from PEG rule." + pexs) + + (cl-defmethod cl-print-object ((peg peg-function) stream) + (princ "#f<peg " stream) + (let ((args (help-function-arglist peg 'preserve-names))) + (if args + (prin1 args stream) + (princ "()" stream))) + (princ " " stream) + (prin1 (peg-function--pexs peg) stream) + (princ ">" stream))) + +(defmacro peg--lambda (pexs args &rest body) + (declare (indent 2) + (debug (&define form lambda-list def-body))) + (if (fboundp 'oclosure-lambda) + `(oclosure-lambda (peg-function (pexs ,pexs)) ,args . ,body) + `(lambda ,args . ,body))) + +;; Sometimes (with-peg-rules ... (peg-run (peg ...))) is too +;; longwinded for the task at hand, so `peg-parse' comes in handy. +(defmacro peg-parse (&rest pexs) + "Match PEXS at point. +PEXS is a sequence of PEG expressions, implicitly combined with `and'. +Returns STACK if the match succeed and signals an error on failure, +moving point along the way. +PEXS can also be a list of PEG rules, in which case the first rule is used." + (if (and (consp (car pexs)) + (symbolp (caar pexs)) + (not (ignore-errors (peg-normalize (car pexs))))) + ;; `pexs' is a list of rules: use the first rule as entry point. + `(with-peg-rules ,pexs (peg-run (peg ,(caar pexs)) #'peg-signal-failure)) + `(peg-run (peg ,@pexs) #'peg-signal-failure))) + +(defmacro peg (&rest pexs) + "Return a PEG-matcher that matches PEXS." + (pcase (peg-normalize `(and . ,pexs)) + (`(call ,name) `#',(peg--rule-id name)) ;Optimize this case by η-reduction! + (exp `(peg--lambda ',pexs () ,(peg-translate-exp exp))))) + +;; There are several "infos we want to return" when parsing a given PEX: +;; 1- We want to return the success/failure of the parse. +;; 2- We want to return the data of the successful parse (the stack). +;; 3- We want to return the diagnostic of the failures. +;; 4- We want to perform the actions (upon parse success)! +;; `peg-parse' used an error signal to encode the (1) boolean, which +;; lets it return all the info conveniently but the error signal was sometimes +;; inconvenient. Other times one wants to just know (1) maybe without even +;; performing (4). +;; `peg-run' lets you choose all that, and by default gives you +;; (1) as a simple boolean, while also doing (2), and (4). + +(defun peg-run (peg-matcher &optional failure-function success-function) + "Parse with PEG-MATCHER at point and run the success/failure function. +If a match was found, move to the end of the match and call SUCCESS-FUNCTION +with one argument: a function which will perform all the actions collected +during the parse and then return the resulting stack (or t if empty). +If no match was found, move to the (rightmost) point of parse failure and call +FAILURE-FUNCTION with one argument, which is a list of PEG expressions that +failed at this point. +SUCCESS-FUNCTION defaults to `funcall' and FAILURE-FUNCTION +defaults to `ignore'." + (let ((peg--actions '()) (peg--errors '(-1))) + (if (funcall peg-matcher) + ;; Found a parse: run the actions collected along the way. + (funcall (or success-function #'funcall) + (lambda () + (save-excursion (peg-postprocess peg--actions)))) + (goto-char (car peg--errors)) + (when failure-function + (funcall failure-function (peg-merge-errors (cdr peg--errors))))))) + +(defmacro define-peg-rule (name args &rest pexs) + "Define PEG rule NAME as equivalent to PEXS. +The PEG expressions in PEXS are implicitly combined with the +sequencing `and' operator of PEG grammars." + (declare (indent 1)) + (let ((inline nil)) + (while (keywordp (car pexs)) + (pcase (pop pexs) + (:inline (setq inline (car pexs)))) + (setq pexs (cdr pexs))) + (let ((id (peg--rule-id name)) + (exp (peg-normalize `(and . ,pexs)))) + `(progn + (defalias ',id + (peg--lambda ',pexs ,args + ,(if inline + ;; Short-circuit to peg--translate in order to skip + ;; the extra failure-recording of `peg-translate-exp'. + ;; It also skips the cycle detection of + ;; `peg--translate-rule-body', which is not the main + ;; purpose but we can live with it. + (apply #'peg--translate exp) + (peg--translate-rule-body name exp)))) + (eval-and-compile + ;; FIXME: We shouldn't need this any more since the info is now + ;; stored in the function, but sadly we need to find a name's EXP + ;; during compilation (i.e. before the `defalias' is executed) + ;; as part of cycle-detection! + (put ',id 'peg--rule-definition ',exp) + ,@(when inline + ;; FIXME: Copied from `defsubst'. + `(;; Never native-compile defsubsts as we need the byte + ;; definition in `byte-compile-unfold-bcf' to perform the + ;; inlining (Bug#42664, Bug#43280, Bug#44209). + ,(byte-run--set-speed id nil -1) + (put ',id 'byte-optimizer #'byte-compile-inline-expand)))))))) + +(defmacro define-peg-ruleset (name &rest rules) + "Define a set of PEG rules for later use, e.g., in `with-peg-rules'." + (declare (indent 1)) + (let ((defs ()) + (aliases ())) + (dolist (rule rules) + (let* ((rname (car rule)) + (full-rname (format "%s %s" name rname))) + (push `(define-peg-rule ,full-rname . ,(cdr rule)) defs) + (push `(,(peg--rule-id rname) #',(peg--rule-id full-rname)) aliases))) + `(cl-flet ,aliases + ,@defs + (eval-and-compile (put ',name 'peg--rules ',aliases))))) + +(defmacro with-peg-rules (rules &rest body) + "Make PEG rules RULES available within the scope of BODY. +RULES is a list of rules of the form (NAME . PEXS), where PEXS is a sequence +of PEG expressions, implicitly combined with `and'. +RULES can also contain symbols in which case these must name +rulesets defined previously with `define-peg-ruleset'." + (declare (indent 1) (debug (sexp form))) ;FIXME: `sexp' is not good enough! + (let* ((rulesets nil) + (rules + ;; First, macroexpand the rules. + (delq nil + (mapcar (lambda (rule) + (if (symbolp rule) + (progn (push rule rulesets) nil) + (cons (car rule) (peg-normalize `(and . ,(cdr rule)))))) + rules))) + (ctx (assq :peg-rules macroexpand-all-environment))) + (macroexpand-all + `(cl-labels + ,(mapcar (lambda (rule) + ;; FIXME: Use `peg--lambda' as well. + `(,(peg--rule-id (car rule)) + () + ,(peg--translate-rule-body (car rule) (cdr rule)))) + rules) + ,@body) + `((:peg-rules ,@(append rules (cdr ctx))) + ,@macroexpand-all-environment)))) + +;;;;; Old entry points + +(defmacro peg-parse-exp (exp) + "Match the parsing expression EXP at point." + (declare (obsolete peg-parse "peg-0.9")) + `(peg-run (peg ,exp))) + +;;;; The actual implementation + +(defun peg--lookup-rule (name) + (or (cdr (assq name (cdr (assq :peg-rules macroexpand-all-environment)))) + ;; With `peg-function' objects, we can recover the PEG from which it was + ;; defined, but this info is not yet available at compile-time. :-( + ;;(let ((id (peg--rule-id name))) + ;; (peg-function--pexs (symbol-function id))) + (get (peg--rule-id name) 'peg--rule-definition))) + +(defun peg--rule-id (name) + (intern (format "peg-rule %s" name))) + +(define-error 'peg-search-failed "Parse error at %d (expecting %S)") + +(defun peg-signal-failure (failures) + (signal 'peg-search-failed (list (point) failures))) + +(defun peg-parse-at-point (peg-matcher) + "Parse text at point according to the PEG rule PEG-MATCHER." + (declare (obsolete peg-run "peg-1.0")) + (peg-run peg-matcher + #'peg-signal-failure + (lambda (f) (let ((r (funcall f))) (if (listp r) r))))) + +;; Internally we use a regularized syntax, e.g. we only have binary OR +;; nodes. Regularized nodes are lists of the form (OP ARGS...). +(cl-defgeneric peg-normalize (exp) + "Return a \"normalized\" form of EXP." + (error "Invalid parsing expression: %S" exp)) + +(cl-defmethod peg-normalize ((exp string)) + (let ((len (length exp))) + (cond ((zerop len) '(guard t)) + ((= len 1) `(char ,(aref exp 0))) + (t `(str ,exp))))) + +(cl-defmethod peg-normalize ((exp symbol)) + ;; (peg--lookup-rule exp) + `(call ,exp)) + +(cl-defmethod peg-normalize ((exp vector)) + (peg-normalize `(set . ,(append exp '())))) + +(cl-defmethod peg-normalize ((exp cons)) + (apply #'peg--macroexpand exp)) + +(defconst peg-leaf-types '(any call action char range str set + guard syntax-class = funcall)) + +(cl-defgeneric peg--macroexpand (head &rest args) + (cond + ((memq head peg-leaf-types) (cons head args)) + (t `(call ,head ,@args)))) + +(cl-defmethod peg--macroexpand ((_ (eql or)) &rest args) + (cond ((null args) '(guard nil)) + ((null (cdr args)) (peg-normalize (car args))) + (t `(or ,(peg-normalize (car args)) + ,(peg-normalize `(or . ,(cdr args))))))) + +(cl-defmethod peg--macroexpand ((_ (eql and)) &rest args) + (cond ((null args) '(guard t)) + ((null (cdr args)) (peg-normalize (car args))) + (t `(and ,(peg-normalize (car args)) + ,(peg-normalize `(and . ,(cdr args))))))) + +(cl-defmethod peg--macroexpand ((_ (eql *)) &rest args) + `(* ,(peg-normalize `(and . ,args)))) + +;; FIXME: this duplicates code; could use some loop to avoid that +(cl-defmethod peg--macroexpand ((_ (eql +)) &rest args) + (let ((e (peg-normalize `(and . ,args)))) + `(and ,e (* ,e)))) + +(cl-defmethod peg--macroexpand ((_ (eql opt)) &rest args) + (let ((e (peg-normalize `(and . ,args)))) + `(or ,e (guard t)))) + +(cl-defmethod peg--macroexpand ((_ (eql if)) &rest args) + `(if ,(peg-normalize `(and . ,args)))) + +(cl-defmethod peg--macroexpand ((_ (eql not)) &rest args) + `(not ,(peg-normalize `(and . ,args)))) + +(cl-defmethod peg--macroexpand ((_ (eql \`)) form) + (peg-normalize `(stack-action ,form))) + +(cl-defmethod peg--macroexpand ((_ (eql stack-action)) form) + (unless (member '-- form) + (error "Malformed stack action: %S" form)) + (let ((args (cdr (member '-- (reverse form)))) + (values (cdr (member '-- form)))) + (let ((form `(let ,(mapcar (lambda (var) `(,var (pop peg--stack))) args) + ,@(mapcar (lambda (val) `(push ,val peg--stack)) values)))) + `(action ,form)))) + +(defvar peg-char-classes + '(ascii alnum alpha blank cntrl digit graph lower multibyte nonascii print + punct space unibyte upper word xdigit)) + +(cl-defmethod peg--macroexpand ((_ (eql set)) &rest specs) + (cond ((null specs) '(guard nil)) + ((and (null (cdr specs)) + (let ((range (peg-range-designator (car specs)))) + (and range `(range ,(car range) ,(cdr range)))))) + (t + (let ((chars '()) (ranges '()) (classes '())) + (while specs + (let* ((spec (pop specs)) + (range (peg-range-designator spec))) + (cond (range + (push range ranges)) + ((peg-characterp spec) + (push spec chars)) + ((stringp spec) + (setq chars (append (reverse (append spec ())) chars))) + ((memq spec peg-char-classes) + (push spec classes)) + (t (error "Invalid set specifier: %S" spec))))) + (setq ranges (reverse ranges)) + (setq chars (delete-dups (reverse chars))) + (setq classes (reverse classes)) + (cond ((and (null ranges) + (null classes) + (cond ((null chars) '(guard nil)) + ((null (cdr chars)) `(char ,(car chars)))))) + (t `(set ,ranges ,chars ,classes))))))) + +(defun peg-range-designator (x) + (and (symbolp x) + (let ((str (symbol-name x))) + (and (= (length str) 3) + (eq (aref str 1) ?-) + (< (aref str 0) (aref str 2)) + (cons (aref str 0) (aref str 2)))))) + +;; characterp is new in Emacs 23. +(defun peg-characterp (x) + (if (fboundp 'characterp) + (characterp x) + (integerp x))) + +(cl-defmethod peg--macroexpand ((_ (eql list)) &rest args) + (peg-normalize + (let ((marker (make-symbol "magic-marker"))) + `(and (stack-action (-- ',marker)) + ,@args + (stack-action (-- + (let ((l '())) + (while + (let ((e (pop peg--stack))) + (cond ((eq e ',marker) nil) + ((null peg--stack) + (error "No marker on stack")) + (t (push e l) t)))) + l))))))) + +(cl-defmethod peg--macroexpand ((_ (eql substring)) &rest args) + (peg-normalize + `(and `(-- (point)) + ,@args + `(start -- (buffer-substring-no-properties start (point)))))) + +(cl-defmethod peg--macroexpand ((_ (eql region)) &rest args) + (peg-normalize + `(and `(-- (point)) + ,@args + `(-- (point))))) + +(cl-defmethod peg--macroexpand ((_ (eql replace)) pe replacement) + (peg-normalize + `(and (stack-action (-- (point))) + ,pe + (stack-action (start -- (progn + (delete-region start (point)) + (insert-before-markers ,replacement)))) + (stack-action (_ --))))) + +(cl-defmethod peg--macroexpand ((_ (eql quote)) _form) + (error "quote is reserved for future use")) + +(cl-defgeneric peg--translate (head &rest args) + (error "No translator for: %S" (cons head args))) + +(defun peg--translate-rule-body (name exp) + (let ((msg (condition-case err + (progn (peg-detect-cycles exp (list name)) nil) + (error (error-message-string err)))) + (code (peg-translate-exp exp))) + (cond + ((null msg) code) + ((fboundp 'macroexp--warn-and-return) + (macroexp--warn-and-return msg code)) + (t + (message "%s" msg) + code)))) + +;; This is the main translation function. +(defun peg-translate-exp (exp) + "Return the ELisp code to match the PE EXP." + ;; FIXME: This expansion basically duplicates `exp' in the output, which is + ;; a serious problem because it's done recursively, so it makes the output + ;; code's size exponentially larger than the input! + `(or ,(apply #'peg--translate exp) + (peg--record-failure ',exp))) ; for error reporting + +(define-obsolete-function-alias 'peg-record-failure + #'peg--record-failure "peg-1.0") +(defun peg--record-failure (exp) + (cond ((= (point) (car peg--errors)) + (setcdr peg--errors (cons exp (cdr peg--errors)))) + ((> (point) (car peg--errors)) + (setq peg--errors (list (point) exp)))) + nil) + +(cl-defmethod peg--translate ((_ (eql and)) e1 e2) + `(and ,(peg-translate-exp e1) + ,(peg-translate-exp e2))) + +;; Choicepoints are used for backtracking. At a choicepoint we save +;; enough state, so that we can continue from there if needed. +(defun peg--choicepoint-moved-p (choicepoint) + `(/= ,(car choicepoint) (point))) + +(defun peg--choicepoint-restore (choicepoint) + `(progn + (goto-char ,(car choicepoint)) + (setq peg--actions ,(cdr choicepoint)))) + +(defmacro peg--with-choicepoint (var &rest body) + (declare (indent 1) (debug (symbolp form))) + `(let ((,var (cons (make-symbol "point") (make-symbol "actions")))) + `(let ((,(car ,var) (point)) + (,(cdr ,var) peg--actions)) + ,@(list ,@body)))) + +(cl-defmethod peg--translate ((_ (eql or)) e1 e2) + (peg--with-choicepoint cp + `(or ,(peg-translate-exp e1) + (,@(peg--choicepoint-restore cp) + ,(peg-translate-exp e2))))) + +(cl-defmethod peg--translate ((_ (eql with)) rules &rest exps) + `(with-peg-rules ,rules ,(peg--translate `(and . ,exps)))) + +(cl-defmethod peg--translate ((_ (eql guard)) exp) exp) + +(defvar peg-syntax-classes + '((whitespace ?-) (word ?w) (symbol ?s) (punctuation ?.) + (open ?\() (close ?\)) (string ?\") (escape ?\\) (charquote ?/) + (math ?$) (prefix ?') (comment ?<) (endcomment ?>) + (comment-fence ?!) (string-fence ?|))) + +(cl-defmethod peg--translate ((_ (eql syntax-class)) class) + (let ((probe (assoc class peg-syntax-classes))) + (cond (probe `(when (looking-at ,(format "\\s%c" (cadr probe))) + (forward-char) + t)) + (t (error "Invalid syntax class: %S\nMust be one of: %s" class + (mapcar #'car peg-syntax-classes)))))) + +(cl-defmethod peg--translate ((_ (eql =)) string) + `(let ((str ,string)) + (when (zerop (length str)) + (error "Empty strings not allowed for =")) + (search-forward str (+ (point) (length str)) t))) + +(cl-defmethod peg--translate ((_ (eql *)) e) + `(progn (while ,(peg--with-choicepoint cp + `(if ,(peg-translate-exp e) + ;; Just as regexps do for the `*' operator, + ;; we allow the body of `*' loops to match + ;; the empty string, but we don't repeat the loop if + ;; we haven't moved, to avoid inf-loops. + ,(peg--choicepoint-moved-p cp) + ,(peg--choicepoint-restore cp) + nil))) + t)) + +(cl-defmethod peg--translate ((_ (eql if)) e) + (peg--with-choicepoint cp + `(when ,(peg-translate-exp e) + ,(peg--choicepoint-restore cp) + t))) + +(cl-defmethod peg--translate ((_ (eql not)) e) + (peg--with-choicepoint cp + `(unless ,(peg-translate-exp e) + ,(peg--choicepoint-restore cp) + t))) + +(cl-defmethod peg--translate ((_ (eql any)) ) + '(when (not (eobp)) + (forward-char) + t)) + +(cl-defmethod peg--translate ((_ (eql char)) c) + `(when (eq (char-after) ',c) + (forward-char) + t)) + +(cl-defmethod peg--translate ((_ (eql set)) ranges chars classes) + `(when (looking-at ',(peg-make-charset-regexp ranges chars classes)) + (forward-char) + t)) + +(defun peg-make-charset-regexp (ranges chars classes) + (when (and (not ranges) (not classes) (<= (length chars) 1)) + (error "Bug")) + (let ((rbracket (member ?\] chars)) + (minus (member ?- chars)) + (hat (member ?^ chars))) + (dolist (c '(?\] ?- ?^)) + (setq chars (remove c chars))) + (format "[%s%s%s%s%s%s]" + (if rbracket "]" "") + (if minus "-" "") + (mapconcat (lambda (x) (format "%c-%c" (car x) (cdr x))) ranges "") + (mapconcat (lambda (c) (format "[:%s:]" c)) classes "") + (mapconcat (lambda (c) (format "%c" c)) chars "") + (if hat "^" "")))) + +(cl-defmethod peg--translate ((_ (eql range)) from to) + `(when (and (char-after) + (<= ',from (char-after)) + (<= (char-after) ',to)) + (forward-char) + t)) + +(cl-defmethod peg--translate ((_ (eql str)) str) + `(when (looking-at ',(regexp-quote str)) + (goto-char (match-end 0)) + t)) + +(cl-defmethod peg--translate ((_ (eql call)) name &rest args) + `(,(peg--rule-id name) ,@args)) + +(cl-defmethod peg--translate ((_ (eql funcall)) exp &rest args) + `(funcall ,exp ,@args)) + +(cl-defmethod peg--translate ((_ (eql action)) form) + `(progn + (push (cons (point) (lambda () ,form)) peg--actions) + t)) + +(defvar peg--stack nil) +(defun peg-postprocess (actions) + "Execute \"actions\"." + (let ((peg--stack '()) + (forw-actions ())) + (pcase-dolist (`(,pos . ,thunk) actions) + (push (cons (copy-marker pos) thunk) forw-actions)) + (pcase-dolist (`(,pos . ,thunk) forw-actions) + (goto-char pos) + (funcall thunk)) + (or peg--stack t))) + +;; Left recursion is presumably a common mistake when using PEGs. +;; Here we try to detect such mistakes. Essentially we traverse the +;; graph as long as we can without consuming input. When we find a +;; recursive call we signal an error. + +(defun peg-detect-cycles (exp path) + "Signal an error on a cycle. +Otherwise traverse EXP recursively and return T if EXP can match +without consuming input. Return nil if EXP definitely consumes +input. PATH is the list of rules that we have visited so far." + (apply #'peg--detect-cycles path exp)) + +(cl-defgeneric peg--detect-cycles (head _path &rest args) + (error "No detect-cycle method for: %S" (cons head args))) + +(cl-defmethod peg--detect-cycles (path (_ (eql call)) name) + (if (member name path) + (error "Possible left recursion: %s" + (mapconcat (lambda (x) (format "%s" x)) + (reverse (cons name path)) " -> ")) + (let ((exp (peg--lookup-rule name))) + (if (null exp) + ;; If there's no rule by that name, either we'll fail at + ;; run-time or it will be defined later. In any case, at this + ;; point there's no evidence of a cycle, and if a cycle appears + ;; later we'll hopefully catch it when the rule gets defined. + ;; FIXME: In practice, if `name' is part of the cycle, we will + ;; indeed detect it when it gets defined, but OTOH if `name' + ;; is not part of a cycle but it *enables* a cycle because + ;; it matches the empty string (i.e. we should have returned t + ;; here), then we may not catch the problem at all :-( + nil + (peg-detect-cycles exp (cons name path)))))) + +(cl-defmethod peg--detect-cycles (path (_ (eql and)) e1 e2) + (and (peg-detect-cycles e1 path) + (peg-detect-cycles e2 path))) + +(cl-defmethod peg--detect-cycles (path (_ (eql or)) e1 e2) + (or (peg-detect-cycles e1 path) + (peg-detect-cycles e2 path))) + +(cl-defmethod peg--detect-cycles (path (_ (eql *)) e) + (peg-detect-cycles e path) + t) + +(cl-defmethod peg--detect-cycles (path (_ (eql if)) e) + (peg-unary-nullable e path)) +(cl-defmethod peg--detect-cycles (path (_ (eql not)) e) + (peg-unary-nullable e path)) + +(defun peg-unary-nullable (exp path) + (peg-detect-cycles exp path) + t) + +(cl-defmethod peg--detect-cycles (_path (_ (eql any))) nil) +(cl-defmethod peg--detect-cycles (_path (_ (eql char)) _c) nil) +(cl-defmethod peg--detect-cycles (_path (_ (eql set)) _r _c _k) nil) +(cl-defmethod peg--detect-cycles (_path (_ (eql range)) _c1 _c2) nil) +(cl-defmethod peg--detect-cycles (_path (_ (eql str)) s) (equal s "")) +(cl-defmethod peg--detect-cycles (_path (_ (eql guard)) _e) t) +(cl-defmethod peg--detect-cycles (_path (_ (eql =)) _s) nil) +(cl-defmethod peg--detect-cycles (_path (_ (eql syntax-class)) _n) nil) +(cl-defmethod peg--detect-cycles (_path (_ (eql action)) _form) t) + +(defun peg-merge-errors (exps) + "Build a more readable error message out of failed expression." + (let ((merged '())) + (dolist (exp exps) + (setq merged (peg-merge-error exp merged))) + merged)) + +(defun peg-merge-error (exp merged) + (apply #'peg--merge-error merged exp)) + +(cl-defgeneric peg--merge-error (_merged head &rest args) + (error "No merge-error method for: %S" (cons head args))) + +(cl-defmethod peg--merge-error (merged (_ (eql or)) e1 e2) + (peg-merge-error e2 (peg-merge-error e1 merged))) + +(cl-defmethod peg--merge-error (merged (_ (eql and)) e1 _e2) + ;; FIXME: Why is `e2' not used? + (peg-merge-error e1 merged)) + +(cl-defmethod peg--merge-error (merged (_ (eql str)) str) + ;;(add-to-list 'merged str) + (cl-adjoin str merged :test #'equal)) + +(cl-defmethod peg--merge-error (merged (_ (eql call)) rule) + ;; (add-to-list 'merged rule) + (cl-adjoin rule merged :test #'equal)) + +(cl-defmethod peg--merge-error (merged (_ (eql char)) char) + ;; (add-to-list 'merged (string char)) + (cl-adjoin (string char) merged :test #'equal)) + +(cl-defmethod peg--merge-error (merged (_ (eql set)) r c k) + ;; (add-to-list 'merged (peg-make-charset-regexp r c k)) + (cl-adjoin (peg-make-charset-regexp r c k) merged :test #'equal)) + +(cl-defmethod peg--merge-error (merged (_ (eql range)) from to) + ;; (add-to-list 'merged (format "[%c-%c]" from to)) + (cl-adjoin (format "[%c-%c]" from to) merged :test #'equal)) + +(cl-defmethod peg--merge-error (merged (_ (eql *)) exp) + (peg-merge-error exp merged)) + +(cl-defmethod peg--merge-error (merged (_ (eql any))) + ;; (add-to-list 'merged '(any)) + (cl-adjoin '(any) merged :test #'equal)) + +(cl-defmethod peg--merge-error (merged (_ (eql not)) x) + ;; (add-to-list 'merged `(not ,x)) + (cl-adjoin `(not ,x) merged :test #'equal)) + +(cl-defmethod peg--merge-error (merged (_ (eql action)) _action) merged) +(cl-defmethod peg--merge-error (merged (_ (eql null))) merged) + +(provide 'peg) +(require 'peg) + +(define-peg-rule null () :inline t (guard t)) +(define-peg-rule fail () :inline t (guard nil)) +(define-peg-rule bob () :inline t (guard (bobp))) +(define-peg-rule eob () :inline t (guard (eobp))) +(define-peg-rule bol () :inline t (guard (bolp))) +(define-peg-rule eol () :inline t (guard (eolp))) +(define-peg-rule bow () :inline t (guard (looking-at "\\<"))) +(define-peg-rule eow () :inline t (guard (looking-at "\\>"))) +(define-peg-rule bos () :inline t (guard (looking-at "\\_<"))) +(define-peg-rule eos () :inline t (guard (looking-at "\\_>"))) + +;;; peg.el ends here diff --git a/test/lisp/peg-tests.el b/test/lisp/peg-tests.el new file mode 100644 index 00000000000..864e09b4200 --- /dev/null +++ b/test/lisp/peg-tests.el @@ -0,0 +1,367 @@ +;;; peg-tests.el --- Tests of PEG parsers -*- lexical-binding: t; -*- + +;; Copyright (C) 2008-2023 Free Software Foundation, Inc. + +;; This program is free software; you can redistribute it and/or modify +;; it under the terms of the GNU General Public License as published by +;; the Free Software Foundation, either version 3 of the License, or +;; (at your option) any later version. + +;; This program is distributed in the hope that it will be useful, +;; but WITHOUT ANY WARRANTY; without even the implied warranty of +;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +;; GNU General Public License for more details. + +;; You should have received a copy of the GNU General Public License +;; along with this program. If not, see <https://www.gnu.org/licenses/>. + +;;; Commentary: + +;; Tests and examples, that used to live in peg.el wrapped inside an `eval'. + +;;; Code: + +(require 'peg) +(require 'ert) + +;;; Tests: + +(defmacro peg-parse-string (pex string &optional noerror) + "Parse STRING according to PEX. +If NOERROR is non-nil, push nil resp. t if the parse failed +resp. succeeded instead of signaling an error." + (let ((oldstyle (consp (car-safe pex)))) ;PEX is really a list of rules. + `(with-temp-buffer + (insert ,string) + (goto-char (point-min)) + ,(if oldstyle + `(with-peg-rules ,pex + (peg-run (peg ,(caar pex)) + ,(unless noerror '#'peg-signal-failure))) + `(peg-run (peg ,pex) + ,(unless noerror '#'peg-signal-failure)))))) + +(define-peg-rule peg-test-natural () + [0-9] (* [0-9])) + +(ert-deftest peg-test () + (should (peg-parse-string peg-test-natural "99 bottles" t)) + (should (peg-parse-string ((s "a")) "a" t)) + (should (not (peg-parse-string ((s "a")) "b" t))) + (should (peg-parse-string ((s (not "a"))) "b" t)) + (should (not (peg-parse-string ((s (not "a"))) "a" t))) + (should (peg-parse-string ((s (if "a"))) "a" t)) + (should (not (peg-parse-string ((s (if "a"))) "b" t))) + (should (peg-parse-string ((s "ab")) "ab" t)) + (should (not (peg-parse-string ((s "ab")) "ba" t))) + (should (not (peg-parse-string ((s "ab")) "a" t))) + (should (peg-parse-string ((s (range ?0 ?9))) "0" t)) + (should (not (peg-parse-string ((s (range ?0 ?9))) "a" t))) + (should (peg-parse-string ((s [0-9])) "0" t)) + (should (not (peg-parse-string ((s [0-9])) "a" t))) + (should (not (peg-parse-string ((s [0-9])) "" t))) + (should (peg-parse-string ((s (any))) "0" t)) + (should (not (peg-parse-string ((s (any))) "" t))) + (should (peg-parse-string ((s (eob))) "" t)) + (should (peg-parse-string ((s (not (eob)))) "a" t)) + (should (peg-parse-string ((s (or "a" "b"))) "a" t)) + (should (peg-parse-string ((s (or "a" "b"))) "b" t)) + (should (not (peg-parse-string ((s (or "a" "b"))) "c" t))) + (should (peg-parse-string (and "a" "b") "ab" t)) + (should (peg-parse-string ((s (and "a" "b"))) "abc" t)) + (should (not (peg-parse-string (and "a" "b") "ba" t))) + (should (peg-parse-string ((s (and "a" "b" "c"))) "abc" t)) + (should (peg-parse-string ((s (* "a") "b" (eob))) "b" t)) + (should (peg-parse-string ((s (* "a") "b" (eob))) "ab" t)) + (should (peg-parse-string ((s (* "a") "b" (eob))) "aaab" t)) + (should (not (peg-parse-string ((s (* "a") "b" (eob))) "abc" t))) + (should (peg-parse-string ((s "")) "abc" t)) + (should (peg-parse-string ((s "" (eob))) "" t)) + (should (peg-parse-string ((s (opt "a") "b")) "abc" t)) + (should (peg-parse-string ((s (opt "a") "b")) "bc" t)) + (should (not (peg-parse-string ((s (or))) "ab" t))) + (should (peg-parse-string ((s (and))) "ab" t)) + (should (peg-parse-string ((s (and))) "" t)) + (should (peg-parse-string ((s ["^"])) "^" t)) + (should (peg-parse-string ((s ["^a"])) "a" t)) + (should (peg-parse-string ["-"] "-" t)) + (should (peg-parse-string ((s ["]-"])) "]" t)) + (should (peg-parse-string ((s ["^]"])) "^" t)) + (should (peg-parse-string ((s [alpha])) "z" t)) + (should (not (peg-parse-string ((s [alpha])) "0" t))) + (should (not (peg-parse-string ((s [alpha])) "" t))) + (should (not (peg-parse-string ((s ["][:alpha:]"])) "z" t))) + (should (peg-parse-string ((s (bob))) "" t)) + (should (peg-parse-string ((s (bos))) "x" t)) + (should (not (peg-parse-string ((s (bos))) " x" t))) + (should (peg-parse-string ((s "x" (eos))) "x" t)) + (should (peg-parse-string ((s (syntax-class whitespace))) " " t)) + (should (peg-parse-string ((s (= "foo"))) "foo" t)) + (should (let ((f "foo")) (peg-parse-string ((s (= f))) "foo" t))) + (should (not (peg-parse-string ((s (= "foo"))) "xfoo" t))) + (should (equal (peg-parse-string ((s `(-- 1 2))) "") '(2 1))) + (should (equal (peg-parse-string ((s `(-- 1 2) `(a b -- a b))) "") '(2 1))) + (should (equal (peg-parse-string ((s (or (and (any) s) + (substring [0-9])))) + "ab0cd1ef2gh") + '("2"))) + ;; The PEG rule `other' doesn't exist, which will cause a byte-compiler + ;; warning, but not an error at run time because the rule is not actually + ;; used in this particular case. + (should (equal (peg-parse-string ((s (substring (or "a" other))) + ;; Unused left-recursive rule, should + ;; cause a byte-compiler warning. + (r (* "a") r)) + "af") + '("a"))) + (should (equal (peg-parse-string ((s (list x y)) + (x `(-- 1)) + (y `(-- 2))) + "") + '((1 2)))) + (should (equal (peg-parse-string ((s (list (* x))) + (x "" `(-- 'x))) + "xxx") + ;; The empty loop body should be matched once! + '((x)))) + (should (equal (peg-parse-string ((s (list (* x))) + (x "x" `(-- 'x))) + "xxx") + '((x x x)))) + (should (equal (peg-parse-string ((s (region (* x))) + (x "x" `(-- 'x))) + "xxx") + ;; FIXME: Since string positions start at 0, this should + ;; really be '(3 x x x 0) !! + '(4 x x x 1))) + (should (equal (peg-parse-string ((s (region (list (* x)))) + (x "x" `(-- 'x 'y))) + "xxx") + '(4 (x y x y x y) 1))) + (should (equal (with-temp-buffer + (save-excursion (insert "abcdef")) + (list + (peg-run (peg "a" + (replace "bc" "x") + (replace "de" "y") + "f")) + (buffer-string))) + '(t "axyf"))) + (with-temp-buffer + (insert "toro") + (goto-char (point-min)) + (should (peg-run (peg "to"))) + (should-not (peg-run (peg "to"))) + (should (peg-run (peg "ro"))) + (should (eobp))) + (with-temp-buffer + (insert " ") + (goto-char (point-min)) + (peg-run (peg (+ (syntax-class whitespace)))) + (should (eobp))) + ) + +;;; Examples: + +;; peg-ex-recognize-int recognizes integers. An integer begins with a +;; optional sign, then follows one or more digits. Digits are all +;; characters from 0 to 9. +;; +;; Notes: +;; 1) "" matches the empty sequence, i.e. matches without consuming +;; input. +;; 2) [0-9] is the character range from 0 to 9. This can also be +;; written as (range ?0 ?9). Note that 0-9 is a symbol. +(defun peg-ex-recognize-int () + (with-peg-rules ((number sign digit (* digit)) + (sign (or "+" "-" "")) + (digit [0-9])) + (peg-run (peg number)))) + +;; peg-ex-parse-int recognizes integers and computes the corresponding +;; value. The grammar is the same as for `peg-ex-recognize-int' +;; augmented with parsing actions. Unfortunaletly, the actions add +;; quite a bit of clutter. +;; +;; The actions for the sign rule push -1 on the stack for a minus sign +;; and 1 for plus or no sign. +;; +;; The action for the digit rule pushes the value for a single digit. +;; +;; The action `(a b -- (+ (* a 10) b)), takes two items from the stack +;; and pushes the first digit times 10 added to the second digit. +;; +;; The action `(sign val -- (* sign val)), multiplies val with the +;; sign (1 or -1). +(defun peg-ex-parse-int () + (with-peg-rules ((number sign digit (* digit + `(a b -- (+ (* a 10) b))) + `(sign val -- (* sign val))) + (sign (or (and "+" `(-- 1)) + (and "-" `(-- -1)) + (and "" `(-- 1)))) + (digit [0-9] `(-- (- (char-before) ?0)))) + (peg-run (peg number)))) + +;; Put point after the ) and press C-x C-e +;; (peg-ex-parse-int)-234234 + +;; Parse arithmetic expressions and compute the result as side effect. +(defun peg-ex-arith () + (peg-parse + (expr _ sum eol) + (sum product (* (or (and "+" _ product `(a b -- (+ a b))) + (and "-" _ product `(a b -- (- a b)))))) + (product value (* (or (and "*" _ value `(a b -- (* a b))) + (and "/" _ value `(a b -- (/ a b)))))) + (value (or (and (substring number) `(string -- (string-to-number string))) + (and "(" _ sum ")" _))) + (number (+ [0-9]) _) + (_ (* [" \t"])) + (eol (or "\n" "\r\n" "\r")))) + +;; (peg-ex-arith) 1 + 2 * 3 * (4 + 5) +;; (peg-ex-arith) 1 + 2 ^ 3 * (4 + 5) ; fails to parse + +;; Parse URI according to RFC 2396. +(defun peg-ex-uri () + (peg-parse + (URI-reference (or absoluteURI relativeURI) + (or (and "#" (substring fragment)) + `(-- nil)) + `(scheme user host port path query fragment -- + (list :scheme scheme :user user + :host host :port port + :path path :query query + :fragment fragment))) + (absoluteURI (substring scheme) ":" (or hier-part opaque-part)) + (hier-part ;(-- user host port path query) + (or net-path + (and `(-- nil nil nil) + abs-path)) + (or (and "?" (substring query)) + `(-- nil))) + (net-path "//" authority (or abs-path `(-- nil))) + (abs-path "/" path-segments) + (path-segments segment (list (* "/" segment)) `(s l -- (cons s l))) + (segment (substring (* pchar) (* ";" param))) + (param (* pchar)) + (pchar (or unreserved escaped [":@&=+$,"])) + (query (* uric)) + (fragment (* uric)) + (relativeURI (or net-path abs-path rel-path) (opt "?" query)) + (rel-path rel-segment (opt abs-path)) + (rel-segment (+ unreserved escaped [";@&=+$,"])) + (authority (or server reg-name)) + (server (or (and (or (and (substring userinfo) "@") + `(-- nil)) + hostport) + `(-- nil nil nil))) + (userinfo (* (or unreserved escaped [";:&=+$,"]))) + (hostport (substring host) (or (and ":" (substring port)) + `(-- nil))) + (host (or hostname ipv4address)) + (hostname (* domainlabel ".") toplabel (opt ".")) + (domainlabel alphanum + (opt (* (or alphanum "-") (if alphanum)) + alphanum)) + (toplabel alpha + (* (or alphanum "-") (if alphanum)) + alphanum) + (ipv4address (+ digit) "." (+ digit) "." (+ digit) "." (+ digit)) + (port (* digit)) + (scheme alpha (* (or alpha digit ["+-."]))) + (reg-name (or unreserved escaped ["$,;:@&=+"])) + (opaque-part uric-no-slash (* uric)) + (uric (or reserved unreserved escaped)) + (uric-no-slash (or unreserved escaped [";?:@&=+$,"])) + (reserved (set ";/?:@&=+$,")) + (unreserved (or alphanum mark)) + (escaped "%" hex hex) + (hex (or digit [A-F] [a-f])) + (mark (set "-_.!~*'()")) + (alphanum (or alpha digit)) + (alpha (or lowalpha upalpha)) + (lowalpha [a-z]) + (upalpha [A-Z]) + (digit [0-9]))) + +;; (peg-ex-uri)http://luser@www.foo.com:8080/bar/baz.html?x=1#foo +;; (peg-ex-uri)file:/bar/baz.html?foo=df#x + +;; Split STRING where SEPARATOR occurs. +(defun peg-ex-split (string separator) + (peg-parse-string ((s (list (* (* sep) elt))) + (elt (substring (+ (not sep) (any)))) + (sep (= separator))) + string)) + +;; (peg-ex-split "-abc-cd-" "-") + +;; Parse a lisp style Sexp. +;; [To keep the example short, ' and . are handled as ordinary symbol.] +(defun peg-ex-lisp () + (peg-parse + (sexp _ (or string list number symbol)) + (_ (* (or [" \n\t"] comment))) + (comment ";" (* (not (or "\n" (eob))) (any))) + (string "\"" (substring (* (not "\"") (any))) "\"") + (number (substring (opt (set "+-")) (+ digit)) + (if terminating) + `(string -- (string-to-number string))) + (symbol (substring (and symchar (* (not terminating) symchar))) + `(s -- (intern s))) + (symchar [a-z A-Z 0-9 "-;!#%&'*+,./:;<=>?@[]^_`{|}~"]) + (list "(" `(-- (cons nil nil)) `(hd -- hd hd) + (* sexp `(tl e -- (setcdr tl (list e)))) + _ ")" `(hd _tl -- (cdr hd))) + (digit [0-9]) + (terminating (or (set " \n\t();\"'") (eob))))) + +;; (peg-ex-lisp) + +;; We try to detect left recursion and report it as error. +(defun peg-ex-left-recursion () + (eval '(peg-parse (exp (or term + (and exp "+" exp))) + (term (or digit + (and term "*" term))) + (digit [0-9])) + t)) + +(defun peg-ex-infinite-loop () + (eval '(peg-parse (exp (* (or "x" + "y" + (action (foo)))))) + t)) + +;; Some efficiency problems: + +;; Find the last digit in a string. +;; Recursive definition with excessive stack usage. +(defun peg-ex-last-digit (string) + (peg-parse-string ((s (or (and (any) s) + (substring [0-9])))) + string)) + +;; (peg-ex-last-digit "ab0cd1ef2gh") +;; (peg-ex-last-digit (make-string 50 ?-)) +;; (peg-ex-last-digit (make-string 1000 ?-)) + +;; Find the last digit without recursion. Doesn't run out of stack, +;; but probably still too inefficient for large inputs. +(defun peg-ex-last-digit2 (string) + (peg-parse-string ((s `(-- nil) + (+ (* (not digit) (any)) + (substring digit) + `(_d1 d2 -- d2))) + (digit [0-9])) + string)) + +;; (peg-ex-last-digit2 "ab0cd1ef2gh") +;; (peg-ex-last-digit2 (concat (make-string 500000 ?-) "8a9b")) +;; (peg-ex-last-digit2 (make-string 500000 ?-)) +;; (peg-ex-last-digit2 (make-string 500000 ?5)) + +(provide 'peg-tests) +;;; peg-tests.el ends here -- 2.42.0 ^ permalink raw reply related [flat|nested] 100+ messages in thread
* Re: [PATCH] Re: Make peg.el a built-in library? 2023-09-25 1:30 ` Eric Abrahamsen @ 2023-09-25 2:27 ` Adam Porter 2023-09-25 13:00 ` Alexander Adolf 2024-03-24 14:19 ` Ihor Radchenko 1 sibling, 1 reply; 100+ messages in thread From: Adam Porter @ 2023-09-25 2:27 UTC (permalink / raw) To: eric; +Cc: eliz, emacs-devel, michael_heerdegen, monnier, yantar92 Hi Eric, Thanks for picking this up again. I recently used peg.el in another package of mine to rewrite and simplify the parsing of a simple query syntax, and I was reminded of how useful it is. I think Emacs would definitely benefit from having it in core. --Adam ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH] Re: Make peg.el a built-in library? 2023-09-25 2:27 ` Adam Porter @ 2023-09-25 13:00 ` Alexander Adolf 0 siblings, 0 replies; 100+ messages in thread From: Alexander Adolf @ 2023-09-25 13:00 UTC (permalink / raw) To: Adam Porter; +Cc: eric, eliz, emacs-devel, michael_heerdegen, monnier, yantar92 [-- Attachment #1: Type: text/plain, Size: 602 bytes --] Hello, I fully second Adam’s comments, and would thus also be in favour of including peg.el in core. --alex -- www.condition-alpha.com / @c_alpha Sent from my iPhone; apologies for brevity and autocorrect weirdness. > On 25. Sep 2023, at 04:28, Adam Porter <adam@alphapapa.net> wrote: > > Hi Eric, > > Thanks for picking this up again. I recently used peg.el in another package of mine to rewrite and simplify the parsing of a simple query syntax, and I was reminded of how useful it is. I think Emacs would definitely benefit from having it in core. > > --Adam > [-- Attachment #2: smime.p7s --] [-- Type: application/pkcs7-signature, Size: 1944 bytes --] ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH] Re: Make peg.el a built-in library? 2023-09-25 1:30 ` Eric Abrahamsen 2023-09-25 2:27 ` Adam Porter @ 2024-03-24 14:19 ` Ihor Radchenko 2024-03-24 15:32 ` Eli Zaretskii 1 sibling, 1 reply; 100+ messages in thread From: Ihor Radchenko @ 2024-03-24 14:19 UTC (permalink / raw) To: Eric Abrahamsen Cc: emacs-devel, Michael Heerdegen, Eli Zaretskii, Stefan Monnier Eric Abrahamsen <eric@ericabrahamsen.net> writes: > So here's a commit adding package, tests, and manual all at once. I've > cc'd the people who indicated interest. The manual should be up to date > with the code, I hope I've managed to follow all the pointers, and I > believe I've done a better job of explaining how to use the various > entry points of the library. It has been a while since the last message in this thread. I am wondering if there is anything wrong with the latest version of the patch. Or maybe something else should be done to move forward towards merging peg.el? -- Ihor Radchenko // yantar92, Org mode contributor, Learn more about Org mode at <https://orgmode.org/>. Support Org development at <https://liberapay.com/org-mode>, or support my work at <https://liberapay.com/yantar92> ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH] Re: Make peg.el a built-in library? 2024-03-24 14:19 ` Ihor Radchenko @ 2024-03-24 15:32 ` Eli Zaretskii 2024-03-25 1:45 ` Eric Abrahamsen 0 siblings, 1 reply; 100+ messages in thread From: Eli Zaretskii @ 2024-03-24 15:32 UTC (permalink / raw) To: Ihor Radchenko; +Cc: eric, emacs-devel, michael_heerdegen, monnier > From: Ihor Radchenko <yantar92@posteo.net> > Cc: emacs-devel@gnu.org, Michael Heerdegen <michael_heerdegen@web.de>, Eli > Zaretskii <eliz@gnu.org>, Stefan Monnier <monnier@iro.umontreal.ca> > Date: Sun, 24 Mar 2024 14:19:58 +0000 > > Eric Abrahamsen <eric@ericabrahamsen.net> writes: > > > So here's a commit adding package, tests, and manual all at once. I've > > cc'd the people who indicated interest. The manual should be up to date > > with the code, I hope I've managed to follow all the pointers, and I > > believe I've done a better job of explaining how to use the various > > entry points of the library. > > It has been a while since the last message in this thread. > I am wondering if there is anything wrong with the latest version of the > patch. Or maybe something else should be done to move forward towards > merging peg.el? If the patch is still good to go, the only thing that's missing, AFAICT, is a NEWS entry. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH] Re: Make peg.el a built-in library? 2024-03-24 15:32 ` Eli Zaretskii @ 2024-03-25 1:45 ` Eric Abrahamsen 0 siblings, 0 replies; 100+ messages in thread From: Eric Abrahamsen @ 2024-03-25 1:45 UTC (permalink / raw) To: Eli Zaretskii; +Cc: Ihor Radchenko, emacs-devel, michael_heerdegen, monnier [-- Attachment #1: Type: text/plain, Size: 1460 bytes --] On 03/24/24 17:32 PM, Eli Zaretskii wrote: >> From: Ihor Radchenko <yantar92@posteo.net> >> Cc: emacs-devel@gnu.org, Michael Heerdegen <michael_heerdegen@web.de>, Eli >> Zaretskii <eliz@gnu.org>, Stefan Monnier <monnier@iro.umontreal.ca> >> Date: Sun, 24 Mar 2024 14:19:58 +0000 >> >> Eric Abrahamsen <eric@ericabrahamsen.net> writes: >> >> > So here's a commit adding package, tests, and manual all at once. I've >> > cc'd the people who indicated interest. The manual should be up to date >> > with the code, I hope I've managed to follow all the pointers, and I >> > believe I've done a better job of explaining how to use the various >> > entry points of the library. >> >> It has been a while since the last message in this thread. >> I am wondering if there is anything wrong with the latest version of the >> patch. Or maybe something else should be done to move forward towards >> merging peg.el? > > If the patch is still good to go, the only thing that's missing, > AFAICT, is a NEWS entry. Huh, I'm not sure what I was expecting to happen after my last message. Anyway, thanks for the nudge! The code itself reached a stable state a while ago; the last feedback on the patch was from Eli regarding improvements to the manual, all of which I incorporated. Just so we're all on the same page I'm reattaching the last version of the patch. I'm assuming all this is okay, and in a little bit I'll add a NEWS entry and push. Thanks to all! Eric [-- Attachment #2: 0001-Add-peg.el-as-a-built-in-library.patch --] [-- Type: text/x-patch, Size: 65704 bytes --] From a8d1b3ad3162e92b4f8c8dd52690d9c1f3333661 Mon Sep 17 00:00:00 2001 From: Eric Abrahamsen <eric@ericabrahamsen.net> Date: Mon, 5 Dec 2022 21:59:03 -0800 Subject: [PATCH] Add peg.el as a built-in library * lisp/progmodes/peg.el: New file, taken from ELPA package. * test/lisp/peg-tests.el: Package tests. * doc/lispref/peg.texi: Documentation. --- doc/lispref/Makefile.in | 1 + doc/lispref/elisp.texi | 2 + doc/lispref/peg.texi | 351 +++++++++++++++ lisp/progmodes/peg.el | 944 ++++++++++++++++++++++++++++++++++++++++ test/lisp/peg-tests.el | 367 ++++++++++++++++ 5 files changed, 1665 insertions(+) create mode 100644 doc/lispref/peg.texi create mode 100644 lisp/progmodes/peg.el create mode 100644 test/lisp/peg-tests.el diff --git a/doc/lispref/Makefile.in b/doc/lispref/Makefile.in index 325f23a3c0f..8ac1242996d 100644 --- a/doc/lispref/Makefile.in +++ b/doc/lispref/Makefile.in @@ -112,6 +112,7 @@ srcs = $(srcdir)/os.texi \ $(srcdir)/package.texi \ $(srcdir)/parsing.texi \ + $(srcdir)/peg.texi \ $(srcdir)/positions.texi \ $(srcdir)/processes.texi \ $(srcdir)/records.texi \ diff --git a/doc/lispref/elisp.texi b/doc/lispref/elisp.texi index 72441c8d442..e12f61fc7eb 100644 --- a/doc/lispref/elisp.texi +++ b/doc/lispref/elisp.texi @@ -222,6 +222,7 @@ Top * Non-ASCII Characters:: Non-ASCII text in buffers and strings. * Searching and Matching:: Searching buffers for strings or regexps. * Syntax Tables:: The syntax table controls word and list parsing. +* Parsing Expression Grammars:: Parsing structured buffer text. * Parsing Program Source:: Generate syntax tree for program sources. * Abbrevs:: How Abbrev mode works, and its data structures. @@ -1719,6 +1720,7 @@ Top @include searching.texi @include syntax.texi +@include peg.texi @include parsing.texi @include abbrevs.texi @include threads.texi diff --git a/doc/lispref/peg.texi b/doc/lispref/peg.texi new file mode 100644 index 00000000000..64950f148b1 --- /dev/null +++ b/doc/lispref/peg.texi @@ -0,0 +1,351 @@ +@c -*-texinfo-*- +@c This is part of the GNU Emacs Lisp Reference Manual. +@c Copyright (C) 1990--1995, 1998--1999, 2001--2023 Free Software +@c Foundation, Inc. +@c See the file elisp.texi for copying conditions. +@node Parsing Expression Grammars +@chapter Parsing Expression Grammars +@cindex text parsing +@cindex parsing expression grammar + + Emacs Lisp provides several tools for parsing and matching text, +from regular expressions (@pxref{Regular Expressions}) to full +@acronym{LL} grammar parsers (@pxref{Top,, Bovine parser +development,bovine}). @dfn{Parsing Expression Grammars} +(@acronym{PEG}) are another approach to text parsing that offer more +structure and composibility than regular expressions, but less +complexity than context-free grammars. + +A @acronym{PEG} parser is defined as a list of named rules, each of +which matches text patterns, and/or contains references to other +rules. Parsing is initiated with the function @code{peg-run} or the +macro @code{peg-parse} (see below), and parses text after point in the +current buffer, using a given set of rules. + +@cindex parsing expression +The definition of each rule is referred to as a @dfn{parsing +expression} (@acronym{PEX}), and can consist of a literal string, a +regexp-like character range or set, a peg-specific construct +resembling an elisp function call, a reference to another rule, or a +combination of any of these. A grammar is expressed as a tree of +rules in which one rule is typically treated as a ``root'' or +``entry-point'' rule. For instance: + +@example +@group +((number sign digit (* digit)) + (sign (or "+" "-" "")) + (digit [0-9])) +@end group +@end example + +Once defined, grammars can be used to parse text after point in the +current buffer, in the following ways: + +@defmac peg-parse &rest pexs +Match @var{pexs} at point. If @var{pexs} is a list of PEG rules, the +first rule is considered the ``entry-point'': +@end defmac + +@example +@group +(peg-parse + ((number sign digit (* digit)) + (sign (or "+" "-" "")) + (digit [0-9]))) +@end group +@end example + +This macro represents the simplest use of the @acronym{PEG} library, +but also the least flexible, as the rules must be written directly +into the source code. A more flexible approach involves use of three +macros in conjunction: @code{with-peg-rules}, a @code{let}-like +construct that makes a set of rules available within the macro body; +@code{peg-run}, which initiates parsing given a single rule; and +@code{peg}, which is used to wrap the entry-point rule name. In fact, +a call to @code{peg-parse} expands to just this set of calls. The +above example could be written as: + +@example +@group +(with-peg-rules + ((number sign digit (* digit)) + (sign (or "+" "-" "")) + (digit [0-9])) + (peg-run (peg number))) +@end group +@end example + +This allows more explicit control over the ``entry-point'' of parsing, +and allows the combination of rules from different sources. + +Individual rules can also be defined using a more @code{defun}-like +syntax, using the macro @code{define-peg-rule}: + +@example +(define-peg-rule digit () + [0-9]) +@end example + +This also allows for rules that accept an argument (supplied by the +@code{funcall} PEG rule). + +Another possibility is to define a named set of rules with +@code{define-peg-ruleset}: + +@example +(define-peg-ruleset number-grammar + '((number sign digit (* digit)) + digit ;; A reference to the definition above. + (sign (or "+" "-" "")))) +@end example + +Rules and rulesets defined this way can be referred to by name in +later calls to @code{peg-run} or @code{with-peg-rules}: + +@example +(with-peg-rules number-grammar + (peg-run (peg number))) +@end example + +By default, calls to @code{peg-run} or @code{peg-parse} produce no +output: parsing simply moves point. In order to return or otherwise +act upon parsed strings, rules can include @dfn{actions}, see +@ref{Parsing Actions}. + +@menu +* PEX Definitions:: The syntax of PEX rules. +* Parsing Actions:: Running actions upon successful parsing. +* Writing PEG Rules:: Tips for writing parsing rules. +@end menu + +@node PEX Definitions +@section PEX Definitions + +Parsing expressions can be defined using the following syntax: + +@table @code +@item (and E1 E2 ...) +A sequence of @acronym{PEX}s that must all be matched. The @code{and} form is +optional and implicit. + +@item (or E1 E2 ...) +Prioritized choices, meaning that, as in Elisp, the choices are tried +in order, and the first successful match is used. Note that this is +distinct from context-free grammars, in which selection between +multiple matches is indeterminate. + +@item (any) +Matches any single character, as the regexp ``.''. + +@item @var{string} +A literal string. + +@item (char @var{C}) +A single character @var{C}, as an Elisp character literal. + +@item (* @var{E}) +Zero or more instances of expression @var{E}, as the regexp @samp{*}. +Matching is always ``greedy''. + +@item (+ @var{E}) +One or more instances of expression @var{E}, as the regexp @samp{+}. +Matching is always ``greedy''. + +@item (opt @var{E}) +Zero or one instance of expression @var{E}, as the regexp @samp{?}. + +@item SYMBOL +A symbol representing a previously-defined PEG rule. + +@item (range CH1 CH2) +The character range between CH1 and CH2, as the regexp @samp{[CH1-CH2]}. + +@item [CH1-CH2 "+*" ?x] +A character set, which can include ranges, character literals, or +strings of characters. + +@item [ascii cntrl] +A list of named character classes. + +@item (syntax-class @var{NAME}) +A single syntax class. + +@item (funcall E ARGS...) +Call @acronym{PEX} E (previously defined with @code{define-peg-rule}) +with arguments @var{ARGS}. + +@item (null) +The empty string. + +@end table + +The following expressions are used as anchors or tests -- they do not +move point, but return a boolean value which can be used to constrain +matches as a way of controlling the parsing process (@pxref{Writing +PEG Rules}). + +@table @code +@item (bob) +Beginning of buffer. + +@item (eob) +End of buffer. + +@item (bol) +Beginning of line. + +@item (eol) +End of line. + +@item (bow) +Beginning of word. + +@item (eow) +End of word. + +@item (bos) +Beginning of symbol. + +@item (eos) +End of symbol. + +@item (if E) +Returns non-@code{nil} if parsing @acronym{PEX} E from point succeeds (point +is not moved). + +@item (not E) +Returns non-@code{nil} if parsing @acronym{PEX} E from point fails (point +is not moved). + +@item (guard EXP) +Treats the value of the Lisp expression EXP as a boolean. + +@end table + +@vindex peg-char-classes +Character class matching can use the same named character classes as +in regular expressions (@pxref{Top,, Character Classes,elisp}) + +@node Parsing Actions +@section Parsing Actions + +@cindex parsing actions +@cindex parsing stack +By default the process of parsing simply moves point in the current +buffer, ultimately returning @code{t} if the parsing succeeds, and +@code{nil} if it doesn't. It's also possible to define ``actions'' +that can run arbitrary Elisp at certain points in the parsed text. +These actions can optionally affect something called the @dfn{parsing +stack}, which is a list of values returned by the parsing process. +These actions only run (and only return values) if the parsing process +ultimately succeeds; if it fails the action code is not run at all. + +Actions can be added anywhere in the definition of a rule. They are +distinguished from parsing expressions by an initial backquote +(@samp{`}), followed by a parenthetical form that must contain a pair +of hyphens (@samp{--}) somewhere within it. Symbols to the left of +the hyphens are bound to values popped from the stack (they are +somewhat analogous to the argument list of a lambda form). Values +produced by code to the right are pushed to the stack (analogous to +the return value of the lambda). For instance, the previous grammar +can be augmented with actions to return the parsed number as an actual +integer: + +@example +(with-peg-rules ((number sign digit (* digit + `(a b -- (+ (* a 10) b))) + `(sign val -- (* sign val))) + (sign (or (and "+" `(-- 1)) + (and "-" `(-- -1)) + (and "" `(-- 1)))) + (digit [0-9] `(-- (- (char-before) ?0)))) + (peg-run (peg number))) +@end example + +There must be values on the stack before they can be popped and +returned -- if there aren't enough stack values to bind to an action's +left-hand terms, they will be bound to @code{nil}. An action with +only right-hand terms will push values to the stack; an action with +only left-hand terms will consume (and discard) values from the stack. +At the end of parsing, stack values are returned as a flat list. + +To return the string matched by a @acronym{PEX} (instead of simply +moving point over it), a rule like this can be used: + +@example +(one-word + `(-- (point)) + (+ [word]) + `(start -- (buffer-substring start (point)))) +@end example + +The first action pushes the initial value of point to the stack. The +intervening @acronym{PEX} moves point over the next word. The second +action pops the previous value from the stack (binding it to the +variable @code{start}), and uses that value to extract a substring +from the buffer and push it to the stack. This pattern is so common +that @acronym{PEG} provides a shorthand function that does exactly the +above, along with a few other shorthands for common scenarios: + +@table @code +@item (substring @var{E}) +Match @acronym{PEX} @var{E} and push the matched string to the stack. + +@item (region @var{E}) +Match @var{E} and push the start and end positions of the matched +region to the stack. + +@item (replace @var{E} @var{replacement}) +Match @var{E} and replaced the matched region with the string @var{replacement}. + +@item (list @var{E}) +Match @var{E}, collect all values produced by @var{E} (and its +sub-expressions) into a list, and push that list to the stack. Stack +values are typically returned as a flat list; this is a way of +``grouping'' values together. +@end table + +@node Writing PEG Rules +@section Writing PEG Rules + +Something to be aware of when writing PEG rules is that they are +greedy. Rules which can consume a variable amount of text will always +consume the maximum amount possible, even if that causes a rule that +might otherwise have matched to fail later on -- there is no +backtracking. For instance, this rule will never succeed: + +@example +(forest (+ "tree" (* [blank])) "tree" (eol)) +@end example + +The @acronym{PEX} @code{(+ "tree" (* [blank]))} will consume all +repetitions of the word ``tree'', leaving none to match the final +@code{"tree"}. + +In these situations, the desired result can be obtained by using +predicates and guards -- namely the @code{not}, @code{if} and +@code{guard} expressions -- to constrain behavior. For instance: + +@example +(forest (+ "tree" (* [blank])) (not (eol)) "tree" (eol)) +@end example + +The @code{if} and @code{not} operators accept a parsing expression and +interpret it as a boolean, without moving point. The contents of a +@code{guard} operator are evaluated as regular Lisp (not a +@acronym{PEX}) and should return a boolean value. A @code{nil} value +causes the match to fail. + +Another potentially unexpected behavior is that parsing will move +point as far as possible, even if the parsing ultimately fails. This +rule: + +@example +(end-game "game" (eob)) +@end example + +when run in a buffer containing the text ``game over'' after point, +will move point to just after ``game'' then halt parsing, returning +@code{nil}. Successful parsing will always return @code{t}, or the +contexts of the parsing stack. diff --git a/lisp/progmodes/peg.el b/lisp/progmodes/peg.el new file mode 100644 index 00000000000..2eb4a7384d0 --- /dev/null +++ b/lisp/progmodes/peg.el @@ -0,0 +1,944 @@ +;;; peg.el --- Parsing Expression Grammars in Emacs Lisp -*- lexical-binding:t -*- + +;; Copyright (C) 2008-2023 Free Software Foundation, Inc. +;; +;; Author: Helmut Eller <eller.helmut@gmail.com> +;; Maintainer: Stefan Monnier <monnier@iro.umontreal.ca> +;; Version: 1.0.1 +;; +;; This program is free software: you can redistribute it and/or modify +;; it under the terms of the GNU General Public License as published by +;; the Free Software Foundation, either version 3 of the License, or +;; (at your option) any later version. +;; +;; This program is distributed in the hope that it will be useful, +;; but WITHOUT ANY WARRANTY; without even the implied warranty of +;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +;; GNU General Public License for more details. +;; +;; You should have received a copy of the GNU General Public License +;; along with this program. If not, see <https://www.gnu.org/licenses/>. +;; +;;; Commentary: +;; +;; This package implements Parsing Expression Grammars for Emacs Lisp. + +;; Parsing Expression Grammars (PEG) are a formalism in the spirit of +;; Context Free Grammars (CFG) with some simplifications which makes +;; the implementation of PEGs as recursive descent parsers particularly +;; simple and easy to understand [Ford, Baker]. +;; PEGs are more expressive than regexps and potentially easier to use. +;; +;; This file implements the macros `define-peg-rule', `with-peg-rules', and +;; `peg-parse' which parses the current buffer according to a PEG. +;; E.g. we can match integers with: +;; +;; (with-peg-rules +;; ((number sign digit (* digit)) +;; (sign (or "+" "-" "")) +;; (digit [0-9])) +;; (peg-run (peg number))) +;; or +;; (define-peg-rule digit () +;; [0-9]) +;; (peg-parse (number sign digit (* digit)) +;; (sign (or "+" "-" ""))) +;; +;; In contrast to regexps, PEGs allow us to define recursive "rules". +;; A "grammar" is a set of rules. A rule is written as (NAME PEX...) +;; E.g. (sign (or "+" "-" "")) is a rule with the name "sign". +;; The syntax for PEX (Parsing Expression) is a follows: +;; +;; Description Lisp Traditional, as in Ford's paper +;; =========== ==== =========== +;; Sequence (and E1 E2) e1 e2 +;; Prioritized Choice (or E1 E2) e1 / e2 +;; Not-predicate (not E) !e +;; And-predicate (if E) &e +;; Any character (any) . +;; Literal string "abc" "abc" +;; Character C (char C) 'c' +;; Zero-or-more (* E) e* +;; One-or-more (+ E) e+ +;; Optional (opt E) e? +;; Non-terminal SYMBOL A +;; Character range (range A B) [a-b] +;; Character set [a-b "+*" ?x] [a-b+*x] ;Note: it's a vector +;; Character classes [ascii cntrl] +;; Boolean-guard (guard EXP) +;; Syntax-Class (syntax-class NAME) +;; Local definitions (with RULES PEX...) +;; Indirect call (funcall EXP ARGS...) +;; and +;; Empty-string (null) ε +;; Beginning-of-Buffer (bob) +;; End-of-Buffer (eob) +;; Beginning-of-Line (bol) +;; End-of-Line (eol) +;; Beginning-of-Word (bow) +;; End-of-Word (eow) +;; Beginning-of-Symbol (bos) +;; End-of-Symbol (eos) +;; +;; Rules can refer to other rules, and a grammar is often structured +;; as a tree, with a root rule referring to one or more "branch +;; rules", all the way down to the "leaf rules" that deal with actual +;; buffer text. Rules can be recursive or mutually referential, +;; though care must be taken not to create infinite loops. +;; +;;;; Named rulesets: +;; +;; You can define a set of rules for later use with: +;; +;; (define-peg-ruleset myrules +;; (sign () (or "+" "-" "")) +;; (digit () [0-9]) +;; (nat () digit (* digit)) +;; (int () sign digit (* digit)) +;; (float () int "." nat)) +;; +;; and later refer to it: +;; +;; (with-peg-rules +;; (myrules +;; (complex float "+i" float)) +;; ... (peg-parse nat "," nat "," complex) ...) +;; +;;;; Parsing actions: +;; +;; PEXs also support parsing actions, i.e. Lisp snippets which are +;; executed when a pex matches. This can be used to construct syntax +;; trees or for similar tasks. The most basic form of action is +;; written as: +;; +;; (action FORM) ; evaluate FORM for its side-effects +;; +;; Actions don't consume input, but are executed at the point of +;; match. Another kind of action is called a "stack action", and +;; looks like this: +;; +;; `(VAR... -- FORM...) ; stack action +;; +;; A stack action takes VARs from the "value stack" and pushes the +;; results of evaluating FORMs to that stack. + +;; The value stack is created during the course of parsing. Certain +;; operators (see below) that match buffer text can push values onto +;; this stack. "Upstream" rules can then draw values from the stack, +;; and optionally push new ones back. For instance, consider this +;; very simple grammar: +;; +;; (with-peg-rules +;; ((query (+ term) (eol)) +;; (term key ":" value (opt (+ [space])) +;; `(k v -- (cons (intern k) v))) +;; (key (substring (and (not ":") (+ [word])))) +;; (value (or string-value number-value)) +;; (string-value (substring (+ [alpha]))) +;; (number-value (substring (+ [digit])) +;; `(val -- (string-to-number val)))) +;; (peg-run (peg query))) +;; +;; This invocation of `peg-run' would parse this buffer text: +;; +;; name:Jane age:30 +;; +;; And return this Elisp sexp: +;; +;; ((age . 30) (name . "Jane")) +;; +;; Note that, in complex grammars, some care must be taken to make +;; sure that the number and type of values drawn from the stack always +;; match those pushed. In the example above, both `string-value' and +;; `number-value' push a single value to the stack. Since the `value' +;; rule only includes these two sub-rules, any upstream rule that +;; makes use of `value' can be confident it will always and only push +;; a single value to the stack. +;; +;; Stack action forms are in a sense analogous to lambda forms: the +;; symbols before the "--" are the equivalent of lambda arguments, +;; while the forms after the "--" are return values. The difference +;; being that a lambda form can only return a single value, while a +;; stack action can push multiple values onto the stack. It's also +;; perfectly valid to use `(-- FORM...)' or `(VAR... --)': the former +;; pushes values to the stack without consuming any, and the latter +;; pops values from the stack and discards them. +;; +;;;; Derived Operators: +;; +;; The following operators are implemented as combinations of +;; primitive expressions: +;; +;; (substring E) ; Match E and push the substring for the matched region. +;; (region E) ; Match E and push the start and end positions. +;; (replace E RPL); Match E and replace the matched region with RPL. +;; (list E) ; Match E and push a list of the items that E produced. +;; +;; See `peg-ex-parse-int' in `peg-tests.el' for further examples. +;; +;; Regexp equivalents: +;; +;; Here a some examples for regexps and how those could be written as pex. +;; [Most are taken from rx.el] +;; +;; "^[a-z]*" +;; (and (bol) (* [a-z])) +;; +;; "\n[^ \t]" +;; (and "\n" (not [" \t"]) (any)) +;; +;; "\\*\\*\\* EOOH \\*\\*\\*\n" +;; "*** EOOH ***\n" +;; +;; "\\<\\(catch\\|finally\\)\\>[^_]" +;; (and (bow) (or "catch" "finally") (eow) (not "_") (any)) +;; +;; "[ \t\n]*:\\([^:]+\\|$\\)" +;; (and (* [" \t\n"]) ":" (or (+ (not ":") (any)) (eol))) +;; +;; "^content-transfer-encoding:\\(\n?[\t ]\\)*quoted-printable\\(\n?[\t ]\\)*" +;; (and (bol) +;; "content-transfer-encoding:" +;; (* (opt "\n") ["\t "]) +;; "quoted-printable" +;; (* (opt "\n") ["\t "])) +;; +;; "\\$[I]d: [^ ]+ \\([^ ]+\\) " +;; (and "$Id: " (+ (not " ") (any)) " " (+ (not " ") (any)) " ") +;; +;; "^;;\\s-*\n\\|^\n" +;; (or (and (bol) ";;" (* (syntax-class whitespace)) "\n") +;; (and (bol) "\n")) +;; +;; "\\\\\\\\\\[\\w+" +;; (and "\\\\[" (+ (syntax-class word))) +;; +;; See ";;; Examples" in `peg-tests.el' for other examples. +;; +;;;; Rule argument and indirect calls: +;; +;; Rules can take arguments and those arguments can themselves be PEGs. +;; For example: +;; +;; (define-peg-rule 2-or-more (peg) +;; (funcall peg) +;; (funcall peg) +;; (* (funcall peg))) +;; +;; ... (peg-parse +;; ... +;; (2-or-more (peg foo)) +;; ... +;; (2-or-more (peg bar)) +;; ...) +;; +;;;; References: +;; +;; [Ford] Bryan Ford. Parsing Expression Grammars: a Recognition-Based +;; Syntactic Foundation. In POPL'04: Proceedings of the 31st ACM +;; SIGPLAN-SIGACT symposium on Principles of Programming Languages, +;; pages 111-122, New York, NY, USA, 2004. ACM Press. +;; http://pdos.csail.mit.edu/~baford/packrat/ +;; +;; [Baker] Baker, Henry G. "Pragmatic Parsing in Common Lisp". ACM Lisp +;; Pointers 4(2), April--June 1991, pp. 3--15. +;; http://home.pipeline.com/~hbaker1/Prag-Parse.html +;; +;; Roman Redziejowski does good PEG related research +;; http://www.romanredz.se/pubs.htm + +;;;; Todo: + +;; - Fix the exponential blowup in `peg-translate-exp'. +;; - Add a proper debug-spec for PEXs. + +;;; News: + +;; Since 1.0.1: +;; - Use OClosures to represent PEG rules when available, and let cl-print +;; display their source code. +;; - New PEX form (with RULES PEX...). +;; - Named rulesets. +;; - You can pass arguments to rules. +;; - New `funcall' rule to call rules indirectly (e.g. a peg you received +;; as argument). + +;; Version 1.0: +;; - New official entry points `peg` and `peg-run`. + +;;; Code: + +(eval-when-compile (require 'cl-lib)) + +(defvar peg--actions nil + "Actions collected along the current parse. +Used at runtime for backtracking. It's a list ((POS . THUNK)...). +Each THUNK is executed at the corresponding POS. Thunks are +executed in a postprocessing step, not during parsing.") + +(defvar peg--errors nil + "Data keeping track of the rightmost parse failure location. +It's a pair (POSITION . EXPS ...). POSITION is the buffer position and +EXPS is a list of rules/expressions that failed.") + +;;;; Main entry points + +(defmacro peg--when-fboundp (f &rest body) + (declare (indent 1) (debug (sexp body))) + (when (fboundp f) + (macroexp-progn body))) + +(peg--when-fboundp oclosure-define + (oclosure-define peg-function + "Parsing function built from PEG rule." + pexs) + + (cl-defmethod cl-print-object ((peg peg-function) stream) + (princ "#f<peg " stream) + (let ((args (help-function-arglist peg 'preserve-names))) + (if args + (prin1 args stream) + (princ "()" stream))) + (princ " " stream) + (prin1 (peg-function--pexs peg) stream) + (princ ">" stream))) + +(defmacro peg--lambda (pexs args &rest body) + (declare (indent 2) + (debug (&define form lambda-list def-body))) + (if (fboundp 'oclosure-lambda) + `(oclosure-lambda (peg-function (pexs ,pexs)) ,args . ,body) + `(lambda ,args . ,body))) + +;; Sometimes (with-peg-rules ... (peg-run (peg ...))) is too +;; longwinded for the task at hand, so `peg-parse' comes in handy. +(defmacro peg-parse (&rest pexs) + "Match PEXS at point. +PEXS is a sequence of PEG expressions, implicitly combined with `and'. +Returns STACK if the match succeed and signals an error on failure, +moving point along the way. +PEXS can also be a list of PEG rules, in which case the first rule is used." + (if (and (consp (car pexs)) + (symbolp (caar pexs)) + (not (ignore-errors (peg-normalize (car pexs))))) + ;; `pexs' is a list of rules: use the first rule as entry point. + `(with-peg-rules ,pexs (peg-run (peg ,(caar pexs)) #'peg-signal-failure)) + `(peg-run (peg ,@pexs) #'peg-signal-failure))) + +(defmacro peg (&rest pexs) + "Return a PEG-matcher that matches PEXS." + (pcase (peg-normalize `(and . ,pexs)) + (`(call ,name) `#',(peg--rule-id name)) ;Optimize this case by η-reduction! + (exp `(peg--lambda ',pexs () ,(peg-translate-exp exp))))) + +;; There are several "infos we want to return" when parsing a given PEX: +;; 1- We want to return the success/failure of the parse. +;; 2- We want to return the data of the successful parse (the stack). +;; 3- We want to return the diagnostic of the failures. +;; 4- We want to perform the actions (upon parse success)! +;; `peg-parse' used an error signal to encode the (1) boolean, which +;; lets it return all the info conveniently but the error signal was sometimes +;; inconvenient. Other times one wants to just know (1) maybe without even +;; performing (4). +;; `peg-run' lets you choose all that, and by default gives you +;; (1) as a simple boolean, while also doing (2), and (4). + +(defun peg-run (peg-matcher &optional failure-function success-function) + "Parse with PEG-MATCHER at point and run the success/failure function. +If a match was found, move to the end of the match and call SUCCESS-FUNCTION +with one argument: a function which will perform all the actions collected +during the parse and then return the resulting stack (or t if empty). +If no match was found, move to the (rightmost) point of parse failure and call +FAILURE-FUNCTION with one argument, which is a list of PEG expressions that +failed at this point. +SUCCESS-FUNCTION defaults to `funcall' and FAILURE-FUNCTION +defaults to `ignore'." + (let ((peg--actions '()) (peg--errors '(-1))) + (if (funcall peg-matcher) + ;; Found a parse: run the actions collected along the way. + (funcall (or success-function #'funcall) + (lambda () + (save-excursion (peg-postprocess peg--actions)))) + (goto-char (car peg--errors)) + (when failure-function + (funcall failure-function (peg-merge-errors (cdr peg--errors))))))) + +(defmacro define-peg-rule (name args &rest pexs) + "Define PEG rule NAME as equivalent to PEXS. +The PEG expressions in PEXS are implicitly combined with the +sequencing `and' operator of PEG grammars." + (declare (indent 1)) + (let ((inline nil)) + (while (keywordp (car pexs)) + (pcase (pop pexs) + (:inline (setq inline (car pexs)))) + (setq pexs (cdr pexs))) + (let ((id (peg--rule-id name)) + (exp (peg-normalize `(and . ,pexs)))) + `(progn + (defalias ',id + (peg--lambda ',pexs ,args + ,(if inline + ;; Short-circuit to peg--translate in order to skip + ;; the extra failure-recording of `peg-translate-exp'. + ;; It also skips the cycle detection of + ;; `peg--translate-rule-body', which is not the main + ;; purpose but we can live with it. + (apply #'peg--translate exp) + (peg--translate-rule-body name exp)))) + (eval-and-compile + ;; FIXME: We shouldn't need this any more since the info is now + ;; stored in the function, but sadly we need to find a name's EXP + ;; during compilation (i.e. before the `defalias' is executed) + ;; as part of cycle-detection! + (put ',id 'peg--rule-definition ',exp) + ,@(when inline + ;; FIXME: Copied from `defsubst'. + `(;; Never native-compile defsubsts as we need the byte + ;; definition in `byte-compile-unfold-bcf' to perform the + ;; inlining (Bug#42664, Bug#43280, Bug#44209). + ,(byte-run--set-speed id nil -1) + (put ',id 'byte-optimizer #'byte-compile-inline-expand)))))))) + +(defmacro define-peg-ruleset (name &rest rules) + "Define a set of PEG rules for later use, e.g., in `with-peg-rules'." + (declare (indent 1)) + (let ((defs ()) + (aliases ())) + (dolist (rule rules) + (let* ((rname (car rule)) + (full-rname (format "%s %s" name rname))) + (push `(define-peg-rule ,full-rname . ,(cdr rule)) defs) + (push `(,(peg--rule-id rname) #',(peg--rule-id full-rname)) aliases))) + `(cl-flet ,aliases + ,@defs + (eval-and-compile (put ',name 'peg--rules ',aliases))))) + +(defmacro with-peg-rules (rules &rest body) + "Make PEG rules RULES available within the scope of BODY. +RULES is a list of rules of the form (NAME . PEXS), where PEXS is a sequence +of PEG expressions, implicitly combined with `and'. +RULES can also contain symbols in which case these must name +rulesets defined previously with `define-peg-ruleset'." + (declare (indent 1) (debug (sexp form))) ;FIXME: `sexp' is not good enough! + (let* ((rulesets nil) + (rules + ;; First, macroexpand the rules. + (delq nil + (mapcar (lambda (rule) + (if (symbolp rule) + (progn (push rule rulesets) nil) + (cons (car rule) (peg-normalize `(and . ,(cdr rule)))))) + rules))) + (ctx (assq :peg-rules macroexpand-all-environment))) + (macroexpand-all + `(cl-labels + ,(mapcar (lambda (rule) + ;; FIXME: Use `peg--lambda' as well. + `(,(peg--rule-id (car rule)) + () + ,(peg--translate-rule-body (car rule) (cdr rule)))) + rules) + ,@body) + `((:peg-rules ,@(append rules (cdr ctx))) + ,@macroexpand-all-environment)))) + +;;;;; Old entry points + +(defmacro peg-parse-exp (exp) + "Match the parsing expression EXP at point." + (declare (obsolete peg-parse "peg-0.9")) + `(peg-run (peg ,exp))) + +;;;; The actual implementation + +(defun peg--lookup-rule (name) + (or (cdr (assq name (cdr (assq :peg-rules macroexpand-all-environment)))) + ;; With `peg-function' objects, we can recover the PEG from which it was + ;; defined, but this info is not yet available at compile-time. :-( + ;;(let ((id (peg--rule-id name))) + ;; (peg-function--pexs (symbol-function id))) + (get (peg--rule-id name) 'peg--rule-definition))) + +(defun peg--rule-id (name) + (intern (format "peg-rule %s" name))) + +(define-error 'peg-search-failed "Parse error at %d (expecting %S)") + +(defun peg-signal-failure (failures) + (signal 'peg-search-failed (list (point) failures))) + +(defun peg-parse-at-point (peg-matcher) + "Parse text at point according to the PEG rule PEG-MATCHER." + (declare (obsolete peg-run "peg-1.0")) + (peg-run peg-matcher + #'peg-signal-failure + (lambda (f) (let ((r (funcall f))) (if (listp r) r))))) + +;; Internally we use a regularized syntax, e.g. we only have binary OR +;; nodes. Regularized nodes are lists of the form (OP ARGS...). +(cl-defgeneric peg-normalize (exp) + "Return a \"normalized\" form of EXP." + (error "Invalid parsing expression: %S" exp)) + +(cl-defmethod peg-normalize ((exp string)) + (let ((len (length exp))) + (cond ((zerop len) '(guard t)) + ((= len 1) `(char ,(aref exp 0))) + (t `(str ,exp))))) + +(cl-defmethod peg-normalize ((exp symbol)) + ;; (peg--lookup-rule exp) + `(call ,exp)) + +(cl-defmethod peg-normalize ((exp vector)) + (peg-normalize `(set . ,(append exp '())))) + +(cl-defmethod peg-normalize ((exp cons)) + (apply #'peg--macroexpand exp)) + +(defconst peg-leaf-types '(any call action char range str set + guard syntax-class = funcall)) + +(cl-defgeneric peg--macroexpand (head &rest args) + (cond + ((memq head peg-leaf-types) (cons head args)) + (t `(call ,head ,@args)))) + +(cl-defmethod peg--macroexpand ((_ (eql or)) &rest args) + (cond ((null args) '(guard nil)) + ((null (cdr args)) (peg-normalize (car args))) + (t `(or ,(peg-normalize (car args)) + ,(peg-normalize `(or . ,(cdr args))))))) + +(cl-defmethod peg--macroexpand ((_ (eql and)) &rest args) + (cond ((null args) '(guard t)) + ((null (cdr args)) (peg-normalize (car args))) + (t `(and ,(peg-normalize (car args)) + ,(peg-normalize `(and . ,(cdr args))))))) + +(cl-defmethod peg--macroexpand ((_ (eql *)) &rest args) + `(* ,(peg-normalize `(and . ,args)))) + +;; FIXME: this duplicates code; could use some loop to avoid that +(cl-defmethod peg--macroexpand ((_ (eql +)) &rest args) + (let ((e (peg-normalize `(and . ,args)))) + `(and ,e (* ,e)))) + +(cl-defmethod peg--macroexpand ((_ (eql opt)) &rest args) + (let ((e (peg-normalize `(and . ,args)))) + `(or ,e (guard t)))) + +(cl-defmethod peg--macroexpand ((_ (eql if)) &rest args) + `(if ,(peg-normalize `(and . ,args)))) + +(cl-defmethod peg--macroexpand ((_ (eql not)) &rest args) + `(not ,(peg-normalize `(and . ,args)))) + +(cl-defmethod peg--macroexpand ((_ (eql \`)) form) + (peg-normalize `(stack-action ,form))) + +(cl-defmethod peg--macroexpand ((_ (eql stack-action)) form) + (unless (member '-- form) + (error "Malformed stack action: %S" form)) + (let ((args (cdr (member '-- (reverse form)))) + (values (cdr (member '-- form)))) + (let ((form `(let ,(mapcar (lambda (var) `(,var (pop peg--stack))) args) + ,@(mapcar (lambda (val) `(push ,val peg--stack)) values)))) + `(action ,form)))) + +(defvar peg-char-classes + '(ascii alnum alpha blank cntrl digit graph lower multibyte nonascii print + punct space unibyte upper word xdigit)) + +(cl-defmethod peg--macroexpand ((_ (eql set)) &rest specs) + (cond ((null specs) '(guard nil)) + ((and (null (cdr specs)) + (let ((range (peg-range-designator (car specs)))) + (and range `(range ,(car range) ,(cdr range)))))) + (t + (let ((chars '()) (ranges '()) (classes '())) + (while specs + (let* ((spec (pop specs)) + (range (peg-range-designator spec))) + (cond (range + (push range ranges)) + ((peg-characterp spec) + (push spec chars)) + ((stringp spec) + (setq chars (append (reverse (append spec ())) chars))) + ((memq spec peg-char-classes) + (push spec classes)) + (t (error "Invalid set specifier: %S" spec))))) + (setq ranges (reverse ranges)) + (setq chars (delete-dups (reverse chars))) + (setq classes (reverse classes)) + (cond ((and (null ranges) + (null classes) + (cond ((null chars) '(guard nil)) + ((null (cdr chars)) `(char ,(car chars)))))) + (t `(set ,ranges ,chars ,classes))))))) + +(defun peg-range-designator (x) + (and (symbolp x) + (let ((str (symbol-name x))) + (and (= (length str) 3) + (eq (aref str 1) ?-) + (< (aref str 0) (aref str 2)) + (cons (aref str 0) (aref str 2)))))) + +;; characterp is new in Emacs 23. +(defun peg-characterp (x) + (if (fboundp 'characterp) + (characterp x) + (integerp x))) + +(cl-defmethod peg--macroexpand ((_ (eql list)) &rest args) + (peg-normalize + (let ((marker (make-symbol "magic-marker"))) + `(and (stack-action (-- ',marker)) + ,@args + (stack-action (-- + (let ((l '())) + (while + (let ((e (pop peg--stack))) + (cond ((eq e ',marker) nil) + ((null peg--stack) + (error "No marker on stack")) + (t (push e l) t)))) + l))))))) + +(cl-defmethod peg--macroexpand ((_ (eql substring)) &rest args) + (peg-normalize + `(and `(-- (point)) + ,@args + `(start -- (buffer-substring-no-properties start (point)))))) + +(cl-defmethod peg--macroexpand ((_ (eql region)) &rest args) + (peg-normalize + `(and `(-- (point)) + ,@args + `(-- (point))))) + +(cl-defmethod peg--macroexpand ((_ (eql replace)) pe replacement) + (peg-normalize + `(and (stack-action (-- (point))) + ,pe + (stack-action (start -- (progn + (delete-region start (point)) + (insert-before-markers ,replacement)))) + (stack-action (_ --))))) + +(cl-defmethod peg--macroexpand ((_ (eql quote)) _form) + (error "quote is reserved for future use")) + +(cl-defgeneric peg--translate (head &rest args) + (error "No translator for: %S" (cons head args))) + +(defun peg--translate-rule-body (name exp) + (let ((msg (condition-case err + (progn (peg-detect-cycles exp (list name)) nil) + (error (error-message-string err)))) + (code (peg-translate-exp exp))) + (cond + ((null msg) code) + ((fboundp 'macroexp--warn-and-return) + (macroexp--warn-and-return msg code)) + (t + (message "%s" msg) + code)))) + +;; This is the main translation function. +(defun peg-translate-exp (exp) + "Return the ELisp code to match the PE EXP." + ;; FIXME: This expansion basically duplicates `exp' in the output, which is + ;; a serious problem because it's done recursively, so it makes the output + ;; code's size exponentially larger than the input! + `(or ,(apply #'peg--translate exp) + (peg--record-failure ',exp))) ; for error reporting + +(define-obsolete-function-alias 'peg-record-failure + #'peg--record-failure "peg-1.0") +(defun peg--record-failure (exp) + (cond ((= (point) (car peg--errors)) + (setcdr peg--errors (cons exp (cdr peg--errors)))) + ((> (point) (car peg--errors)) + (setq peg--errors (list (point) exp)))) + nil) + +(cl-defmethod peg--translate ((_ (eql and)) e1 e2) + `(and ,(peg-translate-exp e1) + ,(peg-translate-exp e2))) + +;; Choicepoints are used for backtracking. At a choicepoint we save +;; enough state, so that we can continue from there if needed. +(defun peg--choicepoint-moved-p (choicepoint) + `(/= ,(car choicepoint) (point))) + +(defun peg--choicepoint-restore (choicepoint) + `(progn + (goto-char ,(car choicepoint)) + (setq peg--actions ,(cdr choicepoint)))) + +(defmacro peg--with-choicepoint (var &rest body) + (declare (indent 1) (debug (symbolp form))) + `(let ((,var (cons (make-symbol "point") (make-symbol "actions")))) + `(let ((,(car ,var) (point)) + (,(cdr ,var) peg--actions)) + ,@(list ,@body)))) + +(cl-defmethod peg--translate ((_ (eql or)) e1 e2) + (peg--with-choicepoint cp + `(or ,(peg-translate-exp e1) + (,@(peg--choicepoint-restore cp) + ,(peg-translate-exp e2))))) + +(cl-defmethod peg--translate ((_ (eql with)) rules &rest exps) + `(with-peg-rules ,rules ,(peg--translate `(and . ,exps)))) + +(cl-defmethod peg--translate ((_ (eql guard)) exp) exp) + +(defvar peg-syntax-classes + '((whitespace ?-) (word ?w) (symbol ?s) (punctuation ?.) + (open ?\() (close ?\)) (string ?\") (escape ?\\) (charquote ?/) + (math ?$) (prefix ?') (comment ?<) (endcomment ?>) + (comment-fence ?!) (string-fence ?|))) + +(cl-defmethod peg--translate ((_ (eql syntax-class)) class) + (let ((probe (assoc class peg-syntax-classes))) + (cond (probe `(when (looking-at ,(format "\\s%c" (cadr probe))) + (forward-char) + t)) + (t (error "Invalid syntax class: %S\nMust be one of: %s" class + (mapcar #'car peg-syntax-classes)))))) + +(cl-defmethod peg--translate ((_ (eql =)) string) + `(let ((str ,string)) + (when (zerop (length str)) + (error "Empty strings not allowed for =")) + (search-forward str (+ (point) (length str)) t))) + +(cl-defmethod peg--translate ((_ (eql *)) e) + `(progn (while ,(peg--with-choicepoint cp + `(if ,(peg-translate-exp e) + ;; Just as regexps do for the `*' operator, + ;; we allow the body of `*' loops to match + ;; the empty string, but we don't repeat the loop if + ;; we haven't moved, to avoid inf-loops. + ,(peg--choicepoint-moved-p cp) + ,(peg--choicepoint-restore cp) + nil))) + t)) + +(cl-defmethod peg--translate ((_ (eql if)) e) + (peg--with-choicepoint cp + `(when ,(peg-translate-exp e) + ,(peg--choicepoint-restore cp) + t))) + +(cl-defmethod peg--translate ((_ (eql not)) e) + (peg--with-choicepoint cp + `(unless ,(peg-translate-exp e) + ,(peg--choicepoint-restore cp) + t))) + +(cl-defmethod peg--translate ((_ (eql any)) ) + '(when (not (eobp)) + (forward-char) + t)) + +(cl-defmethod peg--translate ((_ (eql char)) c) + `(when (eq (char-after) ',c) + (forward-char) + t)) + +(cl-defmethod peg--translate ((_ (eql set)) ranges chars classes) + `(when (looking-at ',(peg-make-charset-regexp ranges chars classes)) + (forward-char) + t)) + +(defun peg-make-charset-regexp (ranges chars classes) + (when (and (not ranges) (not classes) (<= (length chars) 1)) + (error "Bug")) + (let ((rbracket (member ?\] chars)) + (minus (member ?- chars)) + (hat (member ?^ chars))) + (dolist (c '(?\] ?- ?^)) + (setq chars (remove c chars))) + (format "[%s%s%s%s%s%s]" + (if rbracket "]" "") + (if minus "-" "") + (mapconcat (lambda (x) (format "%c-%c" (car x) (cdr x))) ranges "") + (mapconcat (lambda (c) (format "[:%s:]" c)) classes "") + (mapconcat (lambda (c) (format "%c" c)) chars "") + (if hat "^" "")))) + +(cl-defmethod peg--translate ((_ (eql range)) from to) + `(when (and (char-after) + (<= ',from (char-after)) + (<= (char-after) ',to)) + (forward-char) + t)) + +(cl-defmethod peg--translate ((_ (eql str)) str) + `(when (looking-at ',(regexp-quote str)) + (goto-char (match-end 0)) + t)) + +(cl-defmethod peg--translate ((_ (eql call)) name &rest args) + `(,(peg--rule-id name) ,@args)) + +(cl-defmethod peg--translate ((_ (eql funcall)) exp &rest args) + `(funcall ,exp ,@args)) + +(cl-defmethod peg--translate ((_ (eql action)) form) + `(progn + (push (cons (point) (lambda () ,form)) peg--actions) + t)) + +(defvar peg--stack nil) +(defun peg-postprocess (actions) + "Execute \"actions\"." + (let ((peg--stack '()) + (forw-actions ())) + (pcase-dolist (`(,pos . ,thunk) actions) + (push (cons (copy-marker pos) thunk) forw-actions)) + (pcase-dolist (`(,pos . ,thunk) forw-actions) + (goto-char pos) + (funcall thunk)) + (or peg--stack t))) + +;; Left recursion is presumably a common mistake when using PEGs. +;; Here we try to detect such mistakes. Essentially we traverse the +;; graph as long as we can without consuming input. When we find a +;; recursive call we signal an error. + +(defun peg-detect-cycles (exp path) + "Signal an error on a cycle. +Otherwise traverse EXP recursively and return T if EXP can match +without consuming input. Return nil if EXP definitely consumes +input. PATH is the list of rules that we have visited so far." + (apply #'peg--detect-cycles path exp)) + +(cl-defgeneric peg--detect-cycles (head _path &rest args) + (error "No detect-cycle method for: %S" (cons head args))) + +(cl-defmethod peg--detect-cycles (path (_ (eql call)) name) + (if (member name path) + (error "Possible left recursion: %s" + (mapconcat (lambda (x) (format "%s" x)) + (reverse (cons name path)) " -> ")) + (let ((exp (peg--lookup-rule name))) + (if (null exp) + ;; If there's no rule by that name, either we'll fail at + ;; run-time or it will be defined later. In any case, at this + ;; point there's no evidence of a cycle, and if a cycle appears + ;; later we'll hopefully catch it when the rule gets defined. + ;; FIXME: In practice, if `name' is part of the cycle, we will + ;; indeed detect it when it gets defined, but OTOH if `name' + ;; is not part of a cycle but it *enables* a cycle because + ;; it matches the empty string (i.e. we should have returned t + ;; here), then we may not catch the problem at all :-( + nil + (peg-detect-cycles exp (cons name path)))))) + +(cl-defmethod peg--detect-cycles (path (_ (eql and)) e1 e2) + (and (peg-detect-cycles e1 path) + (peg-detect-cycles e2 path))) + +(cl-defmethod peg--detect-cycles (path (_ (eql or)) e1 e2) + (or (peg-detect-cycles e1 path) + (peg-detect-cycles e2 path))) + +(cl-defmethod peg--detect-cycles (path (_ (eql *)) e) + (peg-detect-cycles e path) + t) + +(cl-defmethod peg--detect-cycles (path (_ (eql if)) e) + (peg-unary-nullable e path)) +(cl-defmethod peg--detect-cycles (path (_ (eql not)) e) + (peg-unary-nullable e path)) + +(defun peg-unary-nullable (exp path) + (peg-detect-cycles exp path) + t) + +(cl-defmethod peg--detect-cycles (_path (_ (eql any))) nil) +(cl-defmethod peg--detect-cycles (_path (_ (eql char)) _c) nil) +(cl-defmethod peg--detect-cycles (_path (_ (eql set)) _r _c _k) nil) +(cl-defmethod peg--detect-cycles (_path (_ (eql range)) _c1 _c2) nil) +(cl-defmethod peg--detect-cycles (_path (_ (eql str)) s) (equal s "")) +(cl-defmethod peg--detect-cycles (_path (_ (eql guard)) _e) t) +(cl-defmethod peg--detect-cycles (_path (_ (eql =)) _s) nil) +(cl-defmethod peg--detect-cycles (_path (_ (eql syntax-class)) _n) nil) +(cl-defmethod peg--detect-cycles (_path (_ (eql action)) _form) t) + +(defun peg-merge-errors (exps) + "Build a more readable error message out of failed expression." + (let ((merged '())) + (dolist (exp exps) + (setq merged (peg-merge-error exp merged))) + merged)) + +(defun peg-merge-error (exp merged) + (apply #'peg--merge-error merged exp)) + +(cl-defgeneric peg--merge-error (_merged head &rest args) + (error "No merge-error method for: %S" (cons head args))) + +(cl-defmethod peg--merge-error (merged (_ (eql or)) e1 e2) + (peg-merge-error e2 (peg-merge-error e1 merged))) + +(cl-defmethod peg--merge-error (merged (_ (eql and)) e1 _e2) + ;; FIXME: Why is `e2' not used? + (peg-merge-error e1 merged)) + +(cl-defmethod peg--merge-error (merged (_ (eql str)) str) + ;;(add-to-list 'merged str) + (cl-adjoin str merged :test #'equal)) + +(cl-defmethod peg--merge-error (merged (_ (eql call)) rule) + ;; (add-to-list 'merged rule) + (cl-adjoin rule merged :test #'equal)) + +(cl-defmethod peg--merge-error (merged (_ (eql char)) char) + ;; (add-to-list 'merged (string char)) + (cl-adjoin (string char) merged :test #'equal)) + +(cl-defmethod peg--merge-error (merged (_ (eql set)) r c k) + ;; (add-to-list 'merged (peg-make-charset-regexp r c k)) + (cl-adjoin (peg-make-charset-regexp r c k) merged :test #'equal)) + +(cl-defmethod peg--merge-error (merged (_ (eql range)) from to) + ;; (add-to-list 'merged (format "[%c-%c]" from to)) + (cl-adjoin (format "[%c-%c]" from to) merged :test #'equal)) + +(cl-defmethod peg--merge-error (merged (_ (eql *)) exp) + (peg-merge-error exp merged)) + +(cl-defmethod peg--merge-error (merged (_ (eql any))) + ;; (add-to-list 'merged '(any)) + (cl-adjoin '(any) merged :test #'equal)) + +(cl-defmethod peg--merge-error (merged (_ (eql not)) x) + ;; (add-to-list 'merged `(not ,x)) + (cl-adjoin `(not ,x) merged :test #'equal)) + +(cl-defmethod peg--merge-error (merged (_ (eql action)) _action) merged) +(cl-defmethod peg--merge-error (merged (_ (eql null))) merged) + +(provide 'peg) +(require 'peg) + +(define-peg-rule null () :inline t (guard t)) +(define-peg-rule fail () :inline t (guard nil)) +(define-peg-rule bob () :inline t (guard (bobp))) +(define-peg-rule eob () :inline t (guard (eobp))) +(define-peg-rule bol () :inline t (guard (bolp))) +(define-peg-rule eol () :inline t (guard (eolp))) +(define-peg-rule bow () :inline t (guard (looking-at "\\<"))) +(define-peg-rule eow () :inline t (guard (looking-at "\\>"))) +(define-peg-rule bos () :inline t (guard (looking-at "\\_<"))) +(define-peg-rule eos () :inline t (guard (looking-at "\\_>"))) + +;;; peg.el ends here diff --git a/test/lisp/peg-tests.el b/test/lisp/peg-tests.el new file mode 100644 index 00000000000..864e09b4200 --- /dev/null +++ b/test/lisp/peg-tests.el @@ -0,0 +1,367 @@ +;;; peg-tests.el --- Tests of PEG parsers -*- lexical-binding: t; -*- + +;; Copyright (C) 2008-2023 Free Software Foundation, Inc. + +;; This program is free software; you can redistribute it and/or modify +;; it under the terms of the GNU General Public License as published by +;; the Free Software Foundation, either version 3 of the License, or +;; (at your option) any later version. + +;; This program is distributed in the hope that it will be useful, +;; but WITHOUT ANY WARRANTY; without even the implied warranty of +;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +;; GNU General Public License for more details. + +;; You should have received a copy of the GNU General Public License +;; along with this program. If not, see <https://www.gnu.org/licenses/>. + +;;; Commentary: + +;; Tests and examples, that used to live in peg.el wrapped inside an `eval'. + +;;; Code: + +(require 'peg) +(require 'ert) + +;;; Tests: + +(defmacro peg-parse-string (pex string &optional noerror) + "Parse STRING according to PEX. +If NOERROR is non-nil, push nil resp. t if the parse failed +resp. succeeded instead of signaling an error." + (let ((oldstyle (consp (car-safe pex)))) ;PEX is really a list of rules. + `(with-temp-buffer + (insert ,string) + (goto-char (point-min)) + ,(if oldstyle + `(with-peg-rules ,pex + (peg-run (peg ,(caar pex)) + ,(unless noerror '#'peg-signal-failure))) + `(peg-run (peg ,pex) + ,(unless noerror '#'peg-signal-failure)))))) + +(define-peg-rule peg-test-natural () + [0-9] (* [0-9])) + +(ert-deftest peg-test () + (should (peg-parse-string peg-test-natural "99 bottles" t)) + (should (peg-parse-string ((s "a")) "a" t)) + (should (not (peg-parse-string ((s "a")) "b" t))) + (should (peg-parse-string ((s (not "a"))) "b" t)) + (should (not (peg-parse-string ((s (not "a"))) "a" t))) + (should (peg-parse-string ((s (if "a"))) "a" t)) + (should (not (peg-parse-string ((s (if "a"))) "b" t))) + (should (peg-parse-string ((s "ab")) "ab" t)) + (should (not (peg-parse-string ((s "ab")) "ba" t))) + (should (not (peg-parse-string ((s "ab")) "a" t))) + (should (peg-parse-string ((s (range ?0 ?9))) "0" t)) + (should (not (peg-parse-string ((s (range ?0 ?9))) "a" t))) + (should (peg-parse-string ((s [0-9])) "0" t)) + (should (not (peg-parse-string ((s [0-9])) "a" t))) + (should (not (peg-parse-string ((s [0-9])) "" t))) + (should (peg-parse-string ((s (any))) "0" t)) + (should (not (peg-parse-string ((s (any))) "" t))) + (should (peg-parse-string ((s (eob))) "" t)) + (should (peg-parse-string ((s (not (eob)))) "a" t)) + (should (peg-parse-string ((s (or "a" "b"))) "a" t)) + (should (peg-parse-string ((s (or "a" "b"))) "b" t)) + (should (not (peg-parse-string ((s (or "a" "b"))) "c" t))) + (should (peg-parse-string (and "a" "b") "ab" t)) + (should (peg-parse-string ((s (and "a" "b"))) "abc" t)) + (should (not (peg-parse-string (and "a" "b") "ba" t))) + (should (peg-parse-string ((s (and "a" "b" "c"))) "abc" t)) + (should (peg-parse-string ((s (* "a") "b" (eob))) "b" t)) + (should (peg-parse-string ((s (* "a") "b" (eob))) "ab" t)) + (should (peg-parse-string ((s (* "a") "b" (eob))) "aaab" t)) + (should (not (peg-parse-string ((s (* "a") "b" (eob))) "abc" t))) + (should (peg-parse-string ((s "")) "abc" t)) + (should (peg-parse-string ((s "" (eob))) "" t)) + (should (peg-parse-string ((s (opt "a") "b")) "abc" t)) + (should (peg-parse-string ((s (opt "a") "b")) "bc" t)) + (should (not (peg-parse-string ((s (or))) "ab" t))) + (should (peg-parse-string ((s (and))) "ab" t)) + (should (peg-parse-string ((s (and))) "" t)) + (should (peg-parse-string ((s ["^"])) "^" t)) + (should (peg-parse-string ((s ["^a"])) "a" t)) + (should (peg-parse-string ["-"] "-" t)) + (should (peg-parse-string ((s ["]-"])) "]" t)) + (should (peg-parse-string ((s ["^]"])) "^" t)) + (should (peg-parse-string ((s [alpha])) "z" t)) + (should (not (peg-parse-string ((s [alpha])) "0" t))) + (should (not (peg-parse-string ((s [alpha])) "" t))) + (should (not (peg-parse-string ((s ["][:alpha:]"])) "z" t))) + (should (peg-parse-string ((s (bob))) "" t)) + (should (peg-parse-string ((s (bos))) "x" t)) + (should (not (peg-parse-string ((s (bos))) " x" t))) + (should (peg-parse-string ((s "x" (eos))) "x" t)) + (should (peg-parse-string ((s (syntax-class whitespace))) " " t)) + (should (peg-parse-string ((s (= "foo"))) "foo" t)) + (should (let ((f "foo")) (peg-parse-string ((s (= f))) "foo" t))) + (should (not (peg-parse-string ((s (= "foo"))) "xfoo" t))) + (should (equal (peg-parse-string ((s `(-- 1 2))) "") '(2 1))) + (should (equal (peg-parse-string ((s `(-- 1 2) `(a b -- a b))) "") '(2 1))) + (should (equal (peg-parse-string ((s (or (and (any) s) + (substring [0-9])))) + "ab0cd1ef2gh") + '("2"))) + ;; The PEG rule `other' doesn't exist, which will cause a byte-compiler + ;; warning, but not an error at run time because the rule is not actually + ;; used in this particular case. + (should (equal (peg-parse-string ((s (substring (or "a" other))) + ;; Unused left-recursive rule, should + ;; cause a byte-compiler warning. + (r (* "a") r)) + "af") + '("a"))) + (should (equal (peg-parse-string ((s (list x y)) + (x `(-- 1)) + (y `(-- 2))) + "") + '((1 2)))) + (should (equal (peg-parse-string ((s (list (* x))) + (x "" `(-- 'x))) + "xxx") + ;; The empty loop body should be matched once! + '((x)))) + (should (equal (peg-parse-string ((s (list (* x))) + (x "x" `(-- 'x))) + "xxx") + '((x x x)))) + (should (equal (peg-parse-string ((s (region (* x))) + (x "x" `(-- 'x))) + "xxx") + ;; FIXME: Since string positions start at 0, this should + ;; really be '(3 x x x 0) !! + '(4 x x x 1))) + (should (equal (peg-parse-string ((s (region (list (* x)))) + (x "x" `(-- 'x 'y))) + "xxx") + '(4 (x y x y x y) 1))) + (should (equal (with-temp-buffer + (save-excursion (insert "abcdef")) + (list + (peg-run (peg "a" + (replace "bc" "x") + (replace "de" "y") + "f")) + (buffer-string))) + '(t "axyf"))) + (with-temp-buffer + (insert "toro") + (goto-char (point-min)) + (should (peg-run (peg "to"))) + (should-not (peg-run (peg "to"))) + (should (peg-run (peg "ro"))) + (should (eobp))) + (with-temp-buffer + (insert " ") + (goto-char (point-min)) + (peg-run (peg (+ (syntax-class whitespace)))) + (should (eobp))) + ) + +;;; Examples: + +;; peg-ex-recognize-int recognizes integers. An integer begins with a +;; optional sign, then follows one or more digits. Digits are all +;; characters from 0 to 9. +;; +;; Notes: +;; 1) "" matches the empty sequence, i.e. matches without consuming +;; input. +;; 2) [0-9] is the character range from 0 to 9. This can also be +;; written as (range ?0 ?9). Note that 0-9 is a symbol. +(defun peg-ex-recognize-int () + (with-peg-rules ((number sign digit (* digit)) + (sign (or "+" "-" "")) + (digit [0-9])) + (peg-run (peg number)))) + +;; peg-ex-parse-int recognizes integers and computes the corresponding +;; value. The grammar is the same as for `peg-ex-recognize-int' +;; augmented with parsing actions. Unfortunaletly, the actions add +;; quite a bit of clutter. +;; +;; The actions for the sign rule push -1 on the stack for a minus sign +;; and 1 for plus or no sign. +;; +;; The action for the digit rule pushes the value for a single digit. +;; +;; The action `(a b -- (+ (* a 10) b)), takes two items from the stack +;; and pushes the first digit times 10 added to the second digit. +;; +;; The action `(sign val -- (* sign val)), multiplies val with the +;; sign (1 or -1). +(defun peg-ex-parse-int () + (with-peg-rules ((number sign digit (* digit + `(a b -- (+ (* a 10) b))) + `(sign val -- (* sign val))) + (sign (or (and "+" `(-- 1)) + (and "-" `(-- -1)) + (and "" `(-- 1)))) + (digit [0-9] `(-- (- (char-before) ?0)))) + (peg-run (peg number)))) + +;; Put point after the ) and press C-x C-e +;; (peg-ex-parse-int)-234234 + +;; Parse arithmetic expressions and compute the result as side effect. +(defun peg-ex-arith () + (peg-parse + (expr _ sum eol) + (sum product (* (or (and "+" _ product `(a b -- (+ a b))) + (and "-" _ product `(a b -- (- a b)))))) + (product value (* (or (and "*" _ value `(a b -- (* a b))) + (and "/" _ value `(a b -- (/ a b)))))) + (value (or (and (substring number) `(string -- (string-to-number string))) + (and "(" _ sum ")" _))) + (number (+ [0-9]) _) + (_ (* [" \t"])) + (eol (or "\n" "\r\n" "\r")))) + +;; (peg-ex-arith) 1 + 2 * 3 * (4 + 5) +;; (peg-ex-arith) 1 + 2 ^ 3 * (4 + 5) ; fails to parse + +;; Parse URI according to RFC 2396. +(defun peg-ex-uri () + (peg-parse + (URI-reference (or absoluteURI relativeURI) + (or (and "#" (substring fragment)) + `(-- nil)) + `(scheme user host port path query fragment -- + (list :scheme scheme :user user + :host host :port port + :path path :query query + :fragment fragment))) + (absoluteURI (substring scheme) ":" (or hier-part opaque-part)) + (hier-part ;(-- user host port path query) + (or net-path + (and `(-- nil nil nil) + abs-path)) + (or (and "?" (substring query)) + `(-- nil))) + (net-path "//" authority (or abs-path `(-- nil))) + (abs-path "/" path-segments) + (path-segments segment (list (* "/" segment)) `(s l -- (cons s l))) + (segment (substring (* pchar) (* ";" param))) + (param (* pchar)) + (pchar (or unreserved escaped [":@&=+$,"])) + (query (* uric)) + (fragment (* uric)) + (relativeURI (or net-path abs-path rel-path) (opt "?" query)) + (rel-path rel-segment (opt abs-path)) + (rel-segment (+ unreserved escaped [";@&=+$,"])) + (authority (or server reg-name)) + (server (or (and (or (and (substring userinfo) "@") + `(-- nil)) + hostport) + `(-- nil nil nil))) + (userinfo (* (or unreserved escaped [";:&=+$,"]))) + (hostport (substring host) (or (and ":" (substring port)) + `(-- nil))) + (host (or hostname ipv4address)) + (hostname (* domainlabel ".") toplabel (opt ".")) + (domainlabel alphanum + (opt (* (or alphanum "-") (if alphanum)) + alphanum)) + (toplabel alpha + (* (or alphanum "-") (if alphanum)) + alphanum) + (ipv4address (+ digit) "." (+ digit) "." (+ digit) "." (+ digit)) + (port (* digit)) + (scheme alpha (* (or alpha digit ["+-."]))) + (reg-name (or unreserved escaped ["$,;:@&=+"])) + (opaque-part uric-no-slash (* uric)) + (uric (or reserved unreserved escaped)) + (uric-no-slash (or unreserved escaped [";?:@&=+$,"])) + (reserved (set ";/?:@&=+$,")) + (unreserved (or alphanum mark)) + (escaped "%" hex hex) + (hex (or digit [A-F] [a-f])) + (mark (set "-_.!~*'()")) + (alphanum (or alpha digit)) + (alpha (or lowalpha upalpha)) + (lowalpha [a-z]) + (upalpha [A-Z]) + (digit [0-9]))) + +;; (peg-ex-uri)http://luser@www.foo.com:8080/bar/baz.html?x=1#foo +;; (peg-ex-uri)file:/bar/baz.html?foo=df#x + +;; Split STRING where SEPARATOR occurs. +(defun peg-ex-split (string separator) + (peg-parse-string ((s (list (* (* sep) elt))) + (elt (substring (+ (not sep) (any)))) + (sep (= separator))) + string)) + +;; (peg-ex-split "-abc-cd-" "-") + +;; Parse a lisp style Sexp. +;; [To keep the example short, ' and . are handled as ordinary symbol.] +(defun peg-ex-lisp () + (peg-parse + (sexp _ (or string list number symbol)) + (_ (* (or [" \n\t"] comment))) + (comment ";" (* (not (or "\n" (eob))) (any))) + (string "\"" (substring (* (not "\"") (any))) "\"") + (number (substring (opt (set "+-")) (+ digit)) + (if terminating) + `(string -- (string-to-number string))) + (symbol (substring (and symchar (* (not terminating) symchar))) + `(s -- (intern s))) + (symchar [a-z A-Z 0-9 "-;!#%&'*+,./:;<=>?@[]^_`{|}~"]) + (list "(" `(-- (cons nil nil)) `(hd -- hd hd) + (* sexp `(tl e -- (setcdr tl (list e)))) + _ ")" `(hd _tl -- (cdr hd))) + (digit [0-9]) + (terminating (or (set " \n\t();\"'") (eob))))) + +;; (peg-ex-lisp) + +;; We try to detect left recursion and report it as error. +(defun peg-ex-left-recursion () + (eval '(peg-parse (exp (or term + (and exp "+" exp))) + (term (or digit + (and term "*" term))) + (digit [0-9])) + t)) + +(defun peg-ex-infinite-loop () + (eval '(peg-parse (exp (* (or "x" + "y" + (action (foo)))))) + t)) + +;; Some efficiency problems: + +;; Find the last digit in a string. +;; Recursive definition with excessive stack usage. +(defun peg-ex-last-digit (string) + (peg-parse-string ((s (or (and (any) s) + (substring [0-9])))) + string)) + +;; (peg-ex-last-digit "ab0cd1ef2gh") +;; (peg-ex-last-digit (make-string 50 ?-)) +;; (peg-ex-last-digit (make-string 1000 ?-)) + +;; Find the last digit without recursion. Doesn't run out of stack, +;; but probably still too inefficient for large inputs. +(defun peg-ex-last-digit2 (string) + (peg-parse-string ((s `(-- nil) + (+ (* (not digit) (any)) + (substring digit) + `(_d1 d2 -- d2))) + (digit [0-9])) + string)) + +;; (peg-ex-last-digit2 "ab0cd1ef2gh") +;; (peg-ex-last-digit2 (concat (make-string 500000 ?-) "8a9b")) +;; (peg-ex-last-digit2 (make-string 500000 ?-)) +;; (peg-ex-last-digit2 (make-string 500000 ?5)) + +(provide 'peg-tests) +;;; peg-tests.el ends here -- 2.42.0 ^ permalink raw reply related [flat|nested] 100+ messages in thread
* Re: [PATCH] Re: Make peg.el a built-in library? 2022-11-16 4:27 ` [PATCH] " Eric Abrahamsen 2022-11-16 5:07 ` tomas 2022-11-16 6:24 ` Ihor Radchenko @ 2023-01-11 7:39 ` Michael Heerdegen 2023-01-11 8:04 ` Ihor Radchenko 2 siblings, 1 reply; 100+ messages in thread From: Michael Heerdegen @ 2023-01-11 7:39 UTC (permalink / raw) To: Eric Abrahamsen; +Cc: emacs-devel Eric Abrahamsen <eric@ericabrahamsen.net> writes: > Okay, here's a first stab. I read the paper, and understood about half > of it, which seemed like enough. It was interesting to see that the > paper explicitly calls out the exact greedy-matching behavior I'd > encountered. I missed this discussion. Two points from my side: - When you have worked in all comments could you please post an up-to-date version of your additions to the manual for me to review? - When I had read that paper the outcome had been an rx-to-peg.el translator. If someome is interested I can attach it. This was some time ago and I don't know that much about pegs any more than the person at that time. Michael. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH] Re: Make peg.el a built-in library? 2023-01-11 7:39 ` Michael Heerdegen @ 2023-01-11 8:04 ` Ihor Radchenko 2023-01-11 11:01 ` Michael Heerdegen 0 siblings, 1 reply; 100+ messages in thread From: Ihor Radchenko @ 2023-01-11 8:04 UTC (permalink / raw) To: Michael Heerdegen; +Cc: Eric Abrahamsen, emacs-devel Michael Heerdegen <michael_heerdegen@web.de> writes: > - When I had read that paper the outcome had been an rx-to-peg.el > translator. If someome is interested I can attach it. This was some > time ago and I don't know that much about pegs any more than the person > at that time. I am wondering if we may instead just support traditional regexps as an extra PEG construct. Considering that regexp support is anyway built-in, why not? -- Ihor Radchenko // yantar92, Org mode contributor, Learn more about Org mode at <https://orgmode.org/>. Support Org development at <https://liberapay.com/org-mode>, or support my work at <https://liberapay.com/yantar92> ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH] Re: Make peg.el a built-in library? 2023-01-11 8:04 ` Ihor Radchenko @ 2023-01-11 11:01 ` Michael Heerdegen 2023-01-11 11:32 ` tomas 2023-02-05 12:10 ` Ihor Radchenko 0 siblings, 2 replies; 100+ messages in thread From: Michael Heerdegen @ 2023-01-11 11:01 UTC (permalink / raw) To: emacs-devel Ihor Radchenko <yantar92@posteo.net> writes: > I am wondering if we may instead just support traditional regexps as an > extra PEG construct. Considering that regexp support is anyway built-in, > why not? Dunno. I wrote the translator for academic interest and for learning. AFAIR not all Emacs regexp features work in PEGs - backrefs for example. Or match data handling. Michael. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH] Re: Make peg.el a built-in library? 2023-01-11 11:01 ` Michael Heerdegen @ 2023-01-11 11:32 ` tomas 2023-02-05 12:10 ` Ihor Radchenko 1 sibling, 0 replies; 100+ messages in thread From: tomas @ 2023-01-11 11:32 UTC (permalink / raw) To: emacs-devel [-- Attachment #1: Type: text/plain, Size: 596 bytes --] On Wed, Jan 11, 2023 at 12:01:43PM +0100, Michael Heerdegen wrote: > Ihor Radchenko <yantar92@posteo.net> writes: > > > I am wondering if we may instead just support traditional regexps as an > > extra PEG construct. Considering that regexp support is anyway built-in, > > why not? > > Dunno. I wrote the translator for academic interest and for > learning. > > AFAIR not all Emacs regexp features work in PEGs - backrefs for example. > Or match data handling. Plus, I don't think a PEG packrat parser is always as efficient as we know and love our regexps. Cheers -- t [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 195 bytes --] ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH] Re: Make peg.el a built-in library? 2023-01-11 11:01 ` Michael Heerdegen 2023-01-11 11:32 ` tomas @ 2023-02-05 12:10 ` Ihor Radchenko 2023-02-05 15:41 ` Eduardo Ochs 2023-02-06 0:33 ` Michael Heerdegen 1 sibling, 2 replies; 100+ messages in thread From: Ihor Radchenko @ 2023-02-05 12:10 UTC (permalink / raw) To: Michael Heerdegen; +Cc: emacs-devel Michael Heerdegen <michael_heerdegen@web.de> writes: > Ihor Radchenko <yantar92@posteo.net> writes: > >> I am wondering if we may instead just support traditional regexps as an >> extra PEG construct. Considering that regexp support is anyway built-in, >> why not? > > Dunno. I wrote the translator for academic interest and for > learning. > > AFAIR not all Emacs regexp features work in PEGs - backrefs for example. > Or match data handling. Sure. But if we make Emacs regexp a valid PEG construct, they will work. It is the whole point. -- Ihor Radchenko // yantar92, Org mode contributor, Learn more about Org mode at <https://orgmode.org/>. Support Org development at <https://liberapay.com/org-mode>, or support my work at <https://liberapay.com/yantar92> ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH] Re: Make peg.el a built-in library? 2023-02-05 12:10 ` Ihor Radchenko @ 2023-02-05 15:41 ` Eduardo Ochs 2023-02-05 15:45 ` Ihor Radchenko 2023-02-09 5:44 ` Jean Louis 2023-02-06 0:33 ` Michael Heerdegen 1 sibling, 2 replies; 100+ messages in thread From: Eduardo Ochs @ 2023-02-05 15:41 UTC (permalink / raw) To: Ihor Radchenko; +Cc: Michael Heerdegen, emacs-devel On Sun, 5 Feb 2023 at 09:10, Ihor Radchenko <yantar92@posteo.net> wrote: > > Michael Heerdegen <michael_heerdegen@web.de> writes: > > > Ihor Radchenko <yantar92@posteo.net> writes: > > > >> I am wondering if we may instead just support traditional regexps as an > >> extra PEG construct. Considering that regexp support is anyway built-in, > >> why not? > > > > Dunno. I wrote the translator for academic interest and for > > learning. > > > > AFAIR not all Emacs regexp features work in PEGs - backrefs for example. > > Or match data handling. > > Sure. But if we make Emacs regexp a valid PEG construct, they will work. > It is the whole point. I played a bit with peg.el some time ago - it is very elegant and it's very easy to inspect how it does things, but it is much slower than Lua's LPEG. I'm now using this to write my parsers: https://github.com/edubart/lpegrex https://github.com/edubart/lpegrex/blob/main/parsers/lua.lua The second link above is an example - a parser for Lua written in Lpegrex. I'm starting to use this thing, that lets me run a Lua interpreter inside Emacs as a module, https://github.com/edrx/emlua/#introduction to call lpegrex parsers to parse parts of Emacs buffers. The result - let me call it lpegrex+emlua - is very fragile because I'm too bad & lazy with C programming to implement better error handling in emlua, but if anyone else wants to play with lpegrex+emlua I can create a page with instructions... Cheers, Eduardo Ochs http://anggtwu.net/#eev http://anggtwu.net/eepitch.html ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH] Re: Make peg.el a built-in library? 2023-02-05 15:41 ` Eduardo Ochs @ 2023-02-05 15:45 ` Ihor Radchenko 2023-02-05 16:19 ` Eduardo Ochs 2023-02-09 5:44 ` Jean Louis 1 sibling, 1 reply; 100+ messages in thread From: Ihor Radchenko @ 2023-02-05 15:45 UTC (permalink / raw) To: Eduardo Ochs; +Cc: Michael Heerdegen, emacs-devel Eduardo Ochs <eduardoochs@gmail.com> writes: >> > AFAIR not all Emacs regexp features work in PEGs - backrefs for example. >> > Or match data handling. >> >> Sure. But if we make Emacs regexp a valid PEG construct, they will work. >> It is the whole point. > > I played a bit with peg.el some time ago - it is very elegant and it's > very easy to inspect how it does things, but it is much slower than > Lua's LPEG... What do you mean by slower? -- Ihor Radchenko // yantar92, Org mode contributor, Learn more about Org mode at <https://orgmode.org/>. Support Org development at <https://liberapay.com/org-mode>, or support my work at <https://liberapay.com/yantar92> ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH] Re: Make peg.el a built-in library? 2023-02-05 15:45 ` Ihor Radchenko @ 2023-02-05 16:19 ` Eduardo Ochs 2023-02-05 16:50 ` Ihor Radchenko 0 siblings, 1 reply; 100+ messages in thread From: Eduardo Ochs @ 2023-02-05 16:19 UTC (permalink / raw) To: Ihor Radchenko; +Cc: Michael Heerdegen, emacs-devel On Sun, 5 Feb 2023 at 12:44, Ihor Radchenko <yantar92@posteo.net> wrote: > > Eduardo Ochs <eduardoochs@gmail.com> writes: > > >> > AFAIR not all Emacs regexp features work in PEGs - backrefs for example. > >> > Or match data handling. > >> > >> Sure. But if we make Emacs regexp a valid PEG construct, they will work. > >> It is the whole point. > > > > I played a bit with peg.el some time ago - it is very elegant and it's > > very easy to inspect how it does things, but it is much slower than > > Lua's LPEG... > > What do you mean by slower? I wrote a simple peg.el parser and it took two seconds to parse an input that has just 2KB. The lpeg parser that I use to htmlize some of my files take about 0.5s to parse a file with 3MB and to return the htmlized version. [[]], E. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH] Re: Make peg.el a built-in library? 2023-02-05 16:19 ` Eduardo Ochs @ 2023-02-05 16:50 ` Ihor Radchenko 0 siblings, 0 replies; 100+ messages in thread From: Ihor Radchenko @ 2023-02-05 16:50 UTC (permalink / raw) To: Eduardo Ochs; +Cc: Michael Heerdegen, emacs-devel Eduardo Ochs <eduardoochs@gmail.com> writes: >> What do you mean by slower? > > I wrote a simple peg.el parser and it took two seconds to parse an > input that has just 2KB. The lpeg parser that I use to htmlize some of > my files take about 0.5s to parse a file with 3MB and to return the > htmlized version. There is no fundamental reason why Elisp peg implementation should be that much slower. I suggest filing a bug report. I am also wondering if you can share the example file and grammar. -- Ihor Radchenko // yantar92, Org mode contributor, Learn more about Org mode at <https://orgmode.org/>. Support Org development at <https://liberapay.com/org-mode>, or support my work at <https://liberapay.com/yantar92> ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH] Re: Make peg.el a built-in library? 2023-02-05 15:41 ` Eduardo Ochs 2023-02-05 15:45 ` Ihor Radchenko @ 2023-02-09 5:44 ` Jean Louis 1 sibling, 0 replies; 100+ messages in thread From: Jean Louis @ 2023-02-09 5:44 UTC (permalink / raw) To: Eduardo Ochs; +Cc: Ihor Radchenko, Michael Heerdegen, emacs-devel * Eduardo Ochs <eduardoochs@gmail.com> [2023-02-05 18:42]: > https://github.com/edrx/emlua/#introduction > > to call lpegrex parsers to parse parts of Emacs buffers. The result - > let me call it lpegrex+emlua - is very fragile because I'm too bad & > lazy with C programming to implement better error handling in emlua, > but if anyone else wants to play with lpegrex+emlua I can create a > page with instructions... I just recommend not relying on Github, rather on free software repositories. Savannah on nongnu.org: https//savannah.nongnu.org Savannah, the software forge for people committed to free software: https://savannah.gnu.org Codeberg.org (Germany): https://codeberg.org Sourcehut.org: https://sourcehut.org Pagure: https://pagure.io/pagure Trisquel GNU/Linux-libre Git Repositories: https://devel.trisquel.info/groups/trisquel GitGud - Fast and Free Git Hosting: https://gitgud.io/users/sign_in Fosshost: https://fosshost.org/ -- Jean Take action in Free Software Foundation campaigns: https://www.fsf.org/campaigns In support of Richard M. Stallman https://stallmansupport.org/ ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH] Re: Make peg.el a built-in library? 2023-02-05 12:10 ` Ihor Radchenko 2023-02-05 15:41 ` Eduardo Ochs @ 2023-02-06 0:33 ` Michael Heerdegen 1 sibling, 0 replies; 100+ messages in thread From: Michael Heerdegen @ 2023-02-06 0:33 UTC (permalink / raw) To: Ihor Radchenko; +Cc: emacs-devel Ihor Radchenko <yantar92@posteo.net> writes: > Sure. But if we make Emacs regexp a valid PEG construct, they will work. > It is the whole point. I misunderstood what you meant. That could make sense, and it should not be hard to do. Michael. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Make peg.el a built-in library? 2022-11-07 19:46 ` Eric Abrahamsen 2022-11-08 6:57 ` Helmut Eller 2022-11-08 8:47 ` Ihor Radchenko @ 2022-11-08 14:01 ` Stefan Monnier 2022-11-08 14:42 ` tomas ` (2 more replies) 2 siblings, 3 replies; 100+ messages in thread From: Stefan Monnier @ 2022-11-08 14:01 UTC (permalink / raw) To: Eric Abrahamsen; +Cc: Ihor Radchenko, emacs-devel > And that's about the only hint you get. I was trying to parse a > multiword name like > > Eric Edwin Abrahamsen Side note: the division between "given name" a "family name" is not a universal property, so as general rule I'd advise against trying to do it (and treat the whole thing as just "the name" without trying to analyze its structure) unless there's some strong external factor that requires it. Stefan ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Make peg.el a built-in library? 2022-11-08 14:01 ` Stefan Monnier @ 2022-11-08 14:42 ` tomas 2022-11-08 15:08 ` Visuwesh 2022-11-08 16:10 ` Eric Abrahamsen 2 siblings, 0 replies; 100+ messages in thread From: tomas @ 2022-11-08 14:42 UTC (permalink / raw) To: emacs-devel [-- Attachment #1: Type: text/plain, Size: 657 bytes --] On Tue, Nov 08, 2022 at 09:01:00AM -0500, Stefan Monnier wrote: > > And that's about the only hint you get. I was trying to parse a > > multiword name like > > > > Eric Edwin Abrahamsen > > Side note: the division between "given name" a "family name" is not > a universal property [...] HAH! That's what I try to tell all my customers. But they won't listen. I cheat: the display says "given name" and "family name", but search goes just over a combination of those. Users don't complain :-) The different conventions in middle/western Europe and USA are already pretty dizzying. Including Africa and East Asia well... Cheers -- t [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 195 bytes --] ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Make peg.el a built-in library? 2022-11-08 14:01 ` Stefan Monnier 2022-11-08 14:42 ` tomas @ 2022-11-08 15:08 ` Visuwesh 2022-11-08 16:29 ` Juanma Barranquero 2022-11-08 16:10 ` Eric Abrahamsen 2 siblings, 1 reply; 100+ messages in thread From: Visuwesh @ 2022-11-08 15:08 UTC (permalink / raw) To: Stefan Monnier; +Cc: Eric Abrahamsen, Ihor Radchenko, emacs-devel [செவ்வாய் நவம்பர் 08, 2022] Stefan Monnier wrote: >> And that's about the only hint you get. I was trying to parse a >> multiword name like >> >> Eric Edwin Abrahamsen > > Side note: the division between "given name" a "family name" is not > a universal property, so as general rule I'd advise against trying to do > it (and treat the whole thing as just "the name" without trying to > analyze its structure) unless there's some strong external factor that > requires it. +1. Nothing annoys me more than a form that says "First name", "Surname", and "Last name": I don't have a last name, just an initial. I usually put my father's "first name" as my last name when there's an absolute need to but then all the mails addressed to me make *zero* sense since they are addressed as Dear <Father's "first name"> rather than Dear Visuwesh which always makes me doubt that I got my father's mail *somehow* instead. (Side side note: every single time my family needs to fill up a form, we have a ten minute meeting about what to do with the first-name-last-name situation; it is not fun as you can imagine.) I sighed a breath of relief when the FSF CA form did not have anything like "last name". [ Ever since I came to the university campus, explaining that I have no "last name" has been a recurring and fun activity. ] ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Make peg.el a built-in library? 2022-11-08 15:08 ` Visuwesh @ 2022-11-08 16:29 ` Juanma Barranquero 2022-12-02 20:20 ` Augusto Stoffel 0 siblings, 1 reply; 100+ messages in thread From: Juanma Barranquero @ 2022-11-08 16:29 UTC (permalink / raw) To: Visuwesh; +Cc: Stefan Monnier, Eric Abrahamsen, Ihor Radchenko, emacs-devel [-- Attachment #1: Type: text/plain, Size: 913 bytes --] > [ Ever since I came to the university campus, explaining that I have no > "last name" has been a recurring and fun activity. ] My name, following Spanish uses, is "Juan Manuel" (name) "Barranquero Ríos" (two surnames). It's uncommon here to refer to someone by the two surnames, other than in specific situations. And I don't like my double name, so I *always* use Juanma, except in official documents. I introduce myself as Juanma Barranquero and that's how I self-identify. ...Except that I went to Buenos Aires, coming from São Paulo, and I don't know what did the travel agency assume about my origins. All I know is that I was in a hotel lobby and suddenly they called for a "Mr. Juan Ríos" and I thought for a moment "curious, that guy's got the same name as my maternal grandfather"... Until it dawned on me, a few seconds later, that *I* was supposed to be "Mr. Juan Ríos". [-- Attachment #2: Type: text/html, Size: 1681 bytes --] ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Make peg.el a built-in library? 2022-11-08 16:29 ` Juanma Barranquero @ 2022-12-02 20:20 ` Augusto Stoffel 0 siblings, 0 replies; 100+ messages in thread From: Augusto Stoffel @ 2022-12-02 20:20 UTC (permalink / raw) To: Juanma Barranquero Cc: Visuwesh, Stefan Monnier, Eric Abrahamsen, Ihor Radchenko, emacs-devel On Tue, 8 Nov 2022 at 17:29, Juanma Barranquero wrote: > ...Except that I went to Buenos Aires, coming from São Paulo, and I don't know what did > the travel agency assume about my origins. All I know is that I was in a hotel lobby and > suddenly they called for a "Mr. Juan Ríos" and I thought for a moment "curious, that guy's > got the same name as my maternal grandfather"... Until it dawned on me, a few seconds > later, that *I* was supposed to be "Mr. Juan Ríos". That's because the Portuguese system is similar to the Spanish one, except that the order or names is reversed. Typically, one's last last name comes from the paternal side, etc. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Make peg.el a built-in library? 2022-11-08 14:01 ` Stefan Monnier 2022-11-08 14:42 ` tomas 2022-11-08 15:08 ` Visuwesh @ 2022-11-08 16:10 ` Eric Abrahamsen 2022-11-08 18:59 ` tomas 2 siblings, 1 reply; 100+ messages in thread From: Eric Abrahamsen @ 2022-11-08 16:10 UTC (permalink / raw) To: emacs-devel Stefan Monnier <monnier@iro.umontreal.ca> writes: >> And that's about the only hint you get. I was trying to parse a >> multiword name like >> >> Eric Edwin Abrahamsen > > Side note: the division between "given name" a "family name" is not > a universal property, so as general rule I'd advise against trying to do > it (and treat the whole thing as just "the name" without trying to > analyze its structure) unless there's some strong external factor that > requires it. Oh, I've gone down all the rabbit holes... EBDB doesn't force this, it distinguishes between "complex" and "simple" names, and also allows "complex" names that only have a given name, or a list of given names, etc. Input can be done explicitly, field by field, or you can just chunk a string in there and see what happens. This peg.el adventure is only about "seeing what happens" with complex names. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Make peg.el a built-in library? 2022-11-08 16:10 ` Eric Abrahamsen @ 2022-11-08 18:59 ` tomas 2022-11-08 19:42 ` Eric Abrahamsen 0 siblings, 1 reply; 100+ messages in thread From: tomas @ 2022-11-08 18:59 UTC (permalink / raw) To: emacs-devel [-- Attachment #1: Type: text/plain, Size: 904 bytes --] On Tue, Nov 08, 2022 at 08:10:55AM -0800, Eric Abrahamsen wrote: > Stefan Monnier <monnier@iro.umontreal.ca> writes: > > >> And that's about the only hint you get. I was trying to parse a > >> multiword name like > >> > >> Eric Edwin Abrahamsen > > > > Side note: the division between "given name" a "family name" is not > > a universal property [...] > Oh, I've gone down all the rabbit holes... ;-D And this all because a small bunch of PEGs.., > EBDB doesn't force this, it > distinguishes between "complex" and "simple" names, and also allows > "complex" names that only have a given name, or a list of given names, > etc. Input can be done explicitly, field by field, or you can just chunk > a string in there and see what happens. This peg.el adventure is only > about "seeing what happens" with complex names. I think we all got that. Still... ;-) Cheers -- t [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 195 bytes --] ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Make peg.el a built-in library? 2022-11-08 18:59 ` tomas @ 2022-11-08 19:42 ` Eric Abrahamsen 2022-11-08 22:03 ` Tim Cross 0 siblings, 1 reply; 100+ messages in thread From: Eric Abrahamsen @ 2022-11-08 19:42 UTC (permalink / raw) To: emacs-devel <tomas@tuxteam.de> writes: > On Tue, Nov 08, 2022 at 08:10:55AM -0800, Eric Abrahamsen wrote: >> Stefan Monnier <monnier@iro.umontreal.ca> writes: >> >> >> And that's about the only hint you get. I was trying to parse a >> >> multiword name like >> >> >> >> Eric Edwin Abrahamsen >> > >> > Side note: the division between "given name" a "family name" is not >> > a universal property [...] > >> Oh, I've gone down all the rabbit holes... > > ;-D > > And this all because a small bunch of PEGs.., Oh the rabbit holes started as soon as I started EBDB! Personal information is complicated -- I won't claim it's as bad as timezones and calendars, but it's pretty messy... ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Make peg.el a built-in library? 2022-11-08 19:42 ` Eric Abrahamsen @ 2022-11-08 22:03 ` Tim Cross 0 siblings, 0 replies; 100+ messages in thread From: Tim Cross @ 2022-11-08 22:03 UTC (permalink / raw) To: emacs-devel Eric Abrahamsen <eric@ericabrahamsen.net> writes: > <tomas@tuxteam.de> writes: > >> On Tue, Nov 08, 2022 at 08:10:55AM -0800, Eric Abrahamsen wrote: >>> Stefan Monnier <monnier@iro.umontreal.ca> writes: >>> >>> >> And that's about the only hint you get. I was trying to parse a >>> >> multiword name like >>> >> >>> >> Eric Edwin Abrahamsen >>> > >>> > Side note: the division between "given name" a "family name" is not >>> > a universal property [...] >> >>> Oh, I've gone down all the rabbit holes... >> >> ;-D >> >> And this all because a small bunch of PEGs.., > > Oh the rabbit holes started as soon as I started EBDB! Personal > information is complicated -- I won't claim it's as bad as timezones and > calendars, but it's pretty messy... Yes, a definite mine field. I worked in the identity management space for a few years and this was a constant challenge. As Stefan noted, there is nothing intrinsic about the name which tells you what case the letters should have, the relationship between first/last name, cultural differences - some locales don't have anything which corresponds to first/last and some vary the order depending on the context or have different names depending on the level of perceived formality etc. To make it even more difficult, oddly enough, names are very personal and people get upset when you get it wrong. Then you can add in things like title e.g. Mr, Mrs, Ms etc and you open the whole gender identity issue. Our general solution at the time was two fold - As far as possible, allow the user to specify how they wanted to be addressed or how their name was to be displayed 'on-line'. This may require formal and informal versions - Train/educate staff and developers to avoid unnecessary use of names, title etc. We also tried to avoid using culturally bias terms like 'surname' or even 'first name' 'last name' as this simply doesn't map to anything consistent for some locations. Where I found the wheels often dropped off was when the legal department got involved. My experience was they were the least culturally aware area in the organisation. Not only did they often fail to recognise external cultural differences, they were also slow to acknowledge internal cultural evolution. ^ permalink raw reply [flat|nested] 100+ messages in thread
end of thread, other threads:[~2024-03-25 1:45 UTC | newest] Thread overview: 100+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2021-08-25 18:52 Make peg.el a built-in library? Eric Abrahamsen 2021-08-26 6:17 ` Eli Zaretskii 2021-08-26 15:34 ` Eric Abrahamsen 2021-09-09 4:36 ` Eric Abrahamsen 2021-09-19 15:25 ` Eric Abrahamsen 2021-09-30 19:44 ` Stefan Monnier 2021-09-30 20:34 ` Adam Porter 2021-10-01 8:14 ` Augusto Stoffel 2021-10-01 18:05 ` Stefan Monnier 2021-10-01 18:40 ` Eric Abrahamsen 2021-10-02 3:57 ` Stefan Monnier 2021-10-02 7:32 ` Adam Porter 2021-10-02 14:45 ` Stefan Monnier 2021-10-02 15:13 ` Adam Porter 2021-08-26 17:02 ` Adam Porter 2021-08-26 17:25 ` Eric Abrahamsen 2021-08-27 3:17 ` Eric Abrahamsen 2021-08-27 6:41 ` Helmut Eller 2021-08-27 16:57 ` Eric Abrahamsen 2021-09-26 10:59 ` Augusto Stoffel 2021-09-26 15:06 ` Eric Abrahamsen 2021-09-26 18:36 ` Augusto Stoffel 2021-09-27 16:18 ` Eric Abrahamsen 2021-09-27 22:34 ` Richard Stallman 2021-09-28 3:52 ` Eric Abrahamsen 2021-09-28 8:09 ` tomas 2021-09-28 9:32 ` Helmut Eller 2021-09-28 10:45 ` tomas 2021-09-28 15:24 ` Augusto Stoffel 2021-09-30 6:04 ` Richard Stallman 2021-10-01 3:27 ` Eric Abrahamsen 2021-10-09 1:31 ` Michael Heerdegen 2021-10-09 5:28 ` Michael Heerdegen 2021-10-09 8:12 ` Helmut Eller 2021-10-09 12:52 ` Stefan Monnier 2021-10-10 5:49 ` Helmut Eller 2021-10-14 10:25 ` Michael Heerdegen 2021-10-09 12:54 ` Stefan Monnier 2021-10-09 16:47 ` Eric Abrahamsen 2021-10-10 4:20 ` Michael Heerdegen 2021-10-10 21:40 ` Eric Abrahamsen 2021-10-13 2:58 ` Michael Heerdegen 2021-10-09 16:49 ` Eric Abrahamsen 2021-10-10 3:43 ` Stefan Monnier 2021-10-10 4:46 ` Michael Heerdegen 2021-10-10 5:58 ` Helmut Eller 2021-10-10 13:56 ` Stefan Monnier 2021-10-22 16:33 ` Michael Heerdegen 2021-10-31 23:43 ` Michael Heerdegen 2021-11-15 23:16 ` Michael Heerdegen 2022-11-07 3:33 ` Ihor Radchenko 2022-11-07 19:46 ` Eric Abrahamsen 2022-11-08 6:57 ` Helmut Eller 2022-11-08 8:51 ` Ihor Radchenko 2022-11-10 4:04 ` Richard Stallman 2022-11-10 5:25 ` tomas 2022-11-10 8:15 ` Eli Zaretskii 2022-11-10 8:29 ` tomas 2022-11-11 4:36 ` Richard Stallman 2022-11-08 8:47 ` Ihor Radchenko 2022-11-08 16:18 ` Eric Abrahamsen 2022-11-08 19:08 ` tomas 2022-11-08 19:42 ` Eric Abrahamsen 2022-11-16 4:27 ` [PATCH] " Eric Abrahamsen 2022-11-16 5:07 ` tomas 2022-11-16 5:39 ` Eric Abrahamsen 2022-11-16 15:53 ` tomas 2022-11-16 6:24 ` Ihor Radchenko 2022-11-16 18:15 ` Eric Abrahamsen 2022-11-17 12:21 ` Ihor Radchenko 2022-11-27 1:46 ` Eric Abrahamsen 2022-11-27 8:57 ` Eli Zaretskii 2022-11-28 1:09 ` Eric Abrahamsen 2022-11-28 12:16 ` Eli Zaretskii 2023-09-25 1:30 ` Eric Abrahamsen 2023-09-25 2:27 ` Adam Porter 2023-09-25 13:00 ` Alexander Adolf 2024-03-24 14:19 ` Ihor Radchenko 2024-03-24 15:32 ` Eli Zaretskii 2024-03-25 1:45 ` Eric Abrahamsen 2023-01-11 7:39 ` Michael Heerdegen 2023-01-11 8:04 ` Ihor Radchenko 2023-01-11 11:01 ` Michael Heerdegen 2023-01-11 11:32 ` tomas 2023-02-05 12:10 ` Ihor Radchenko 2023-02-05 15:41 ` Eduardo Ochs 2023-02-05 15:45 ` Ihor Radchenko 2023-02-05 16:19 ` Eduardo Ochs 2023-02-05 16:50 ` Ihor Radchenko 2023-02-09 5:44 ` Jean Louis 2023-02-06 0:33 ` Michael Heerdegen 2022-11-08 14:01 ` Stefan Monnier 2022-11-08 14:42 ` tomas 2022-11-08 15:08 ` Visuwesh 2022-11-08 16:29 ` Juanma Barranquero 2022-12-02 20:20 ` Augusto Stoffel 2022-11-08 16:10 ` Eric Abrahamsen 2022-11-08 18:59 ` tomas 2022-11-08 19:42 ` Eric Abrahamsen 2022-11-08 22:03 ` Tim Cross
Code repositories for project(s) associated with this public inbox https://git.savannah.gnu.org/cgit/emacs.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).