unofficial mirror of guile-devel@gnu.org 
 help / color / mirror / Atom feed
* [PATCH] ice-9/peg: Extend PEG for production use.
@ 2020-05-11 13:41 Rutger van Beusekom
  2024-10-14  7:26 ` [PATCH v2 0/5] " Janneke Nieuwenhuizen
  0 siblings, 1 reply; 11+ messages in thread
From: Rutger van Beusekom @ 2020-05-11 13:41 UTC (permalink / raw)
  To: guile-devel

[-- Attachment #1: Type: text/plain, Size: 332 bytes --]

Hi,

First of all, I am aware I am submitting a sizeable patch here, and I am
happy to break it up if necessary.

Currently I am using these changes to the Parser Expression Grammar
library in GNU Guile for a production parser.

Hopefully you will find this interesting and useful. If so I would
appreciate your feedback.

Rutger.


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-ice-9-peg-Extend-PEG-for-production-use.patch --]
[-- Type: text/x-diff, Size: 34651 bytes --]

From 091915cc0cfbe73333b2c486b254628f37d52bb1 Mon Sep 17 00:00:00 2001
From: Rutger van Beusekom <rutger.van.beusekom@verum.com>
Date: Mon, 11 May 2020 14:30:49 +0200
Subject: [PATCH] ice-9/peg: Extend PEG for production use.

* doc/ref/api-peg.texi: Document debug tracing, locations, skip parsing,
  expect parsing and fallback parsing.
* module/ice-9/peg.scm (ice-9): Re-export added parameters and macro.
* module/ice-9/peg/codegen.scm (%peg:debug?): Add exported parameter to
  toggle parser debug tracing.
(%peg:locations?): Add exported parameter to toggle adding locations to
the parse tree.
(%peg:skip?): Add exported parameter to pass a skip parser to the main
grammar.
(%peg:fallback?): Add exported parameter to toggle fallback parsing.
(%peg:error): Add exported parameter to handle fallback recovery.
(%continuation): Add parameter to maintain the parser productions
following the current production.
(format-error): New function which wraps (%peg:error).
(fallback-skip): New function which implements parser recovery.
(partial-match): New function to allow continuations to match only
partially.
(cg-and-int): Rewrite to add fallback parsing.
(cg-*): Rewrite to add fallback parsing.
(cg-+): Rewrite to add fallback parsing.
(cg-expect-int): New function implementing expect parsing interfals.
(cg-expect): New function implementing expect parsing.
(expect): Add expect parsing to PEG
(trace?): New function determining debug tracing verbosity.
(indent): New variable maintaining debug trace indentation level.
(wrap-parser-for-users): Add debug tracing, locations and skip parsing.
(define-skip-parser): New macro allows definging skip parsers.
* module/ice-9/peg/string-peg.scm (peg-as-peg): Add #
(peg-secondary->defn): Add # for expect parsing.
* test-suite/tests/peg.test: Test it.
---
 doc/ref/api-peg.texi            | 307 +++++++++++++++++++++++++++++---
 module/ice-9/peg.scm            |   9 +-
 module/ice-9/peg/codegen.scm    | 193 +++++++++++++++-----
 module/ice-9/peg/string-peg.scm |  10 +-
 test-suite/tests/peg.test       | 120 ++++++++++++-
 5 files changed, 562 insertions(+), 77 deletions(-)

diff --git a/doc/ref/api-peg.texi b/doc/ref/api-peg.texi
index 82e2758b4..811c4d216 100644
--- a/doc/ref/api-peg.texi
+++ b/doc/ref/api-peg.texi
@@ -159,6 +159,22 @@ Would be:
  (+ "e"))
 @end lisp
 
+@deftp {PEG Pattern} {expect} a
+Expect to parse @var{a}.  Upon failure, throw a @code{'syntax-error}
+including its offset position and its symbol.
+
+@code{"a#"}
+
+@code{(expect a)}
+@end deftp
+
+Example:
+
+@example
+"a b# (c / d)#"
+@end example
+
+
 @subsubheading Extended Syntax
 
 There is some extra syntax for S-expressions.
@@ -221,11 +237,11 @@ character (in normal PEGs nonterminals can only be alphabetic).
 
 For example, if we:
 @lisp
-(define-peg-string-patterns 
+(define-peg-string-patterns
   "as <- 'a'+
 bs <- 'b'+
 as-or-bs <- as/bs")
-(define-peg-string-patterns 
+(define-peg-string-patterns
   "as-tag <-- 'a'+
 bs-tag <-- 'b'+
 as-or-bs-tag <-- as-tag/bs-tag")
@@ -271,9 +287,9 @@ For Example, if we:
 @end lisp
 Then:
 @lisp
-(match-pattern as-or-bs "aabbcc") @result{} 
+(match-pattern as-or-bs "aabbcc") @result{}
 #<peg start: 0 end: 2 string: aabbcc tree: aa>
-(match-pattern as-or-bs-tag "aabbcc") @result{} 
+(match-pattern as-or-bs-tag "aabbcc") @result{}
 #<peg start: 0 end: 2 string: aabbcc tree: (as-or-bs-tag (as-tag aa))>
 @end lisp
 
@@ -325,7 +341,7 @@ find a valid substring starting at index 0 and @code{search-for-pattern} keeps
 looking.  They are both equally capable of ``parsing'' and ``matching''
 given those constraints.
 
-@deffn {Scheme Procedure} match-pattern nonterm string 
+@deffn {Scheme Procedure} match-pattern nonterm string
 Parses @var{string} using the PEG stored in @var{nonterm}.  If no match
 was found, @code{match-pattern} returns false.  If a match was found, a PEG
 match record is returned.
@@ -344,19 +360,19 @@ nothing
 
 @lisp
 (define-peg-pattern as all (+ "a"))
-(match-pattern as "aabbcc") @result{} 
+(match-pattern as "aabbcc") @result{}
 #<peg start: 0 end: 2 string: aabbcc tree: (as aa)>
 
 (define-peg-pattern as body (+ "a"))
-(match-pattern as "aabbcc") @result{} 
+(match-pattern as "aabbcc") @result{}
 #<peg start: 0 end: 2 string: aabbcc tree: aa>
 
 (define-peg-pattern as none (+ "a"))
-(match-pattern as "aabbcc") @result{} 
+(match-pattern as "aabbcc") @result{}
 #<peg start: 0 end: 2 string: aabbcc tree: ()>
 
 (define-peg-pattern bs body (+ "b"))
-(match-pattern bs "aabbcc") @result{} 
+(match-pattern bs "aabbcc") @result{}
 #f
 @end lisp
 @end deffn
@@ -371,31 +387,31 @@ was found, a PEG match record is returned.
 
 @lisp
 (define-peg-pattern as body (+ "a"))
-(search-for-pattern as "aabbcc") @result{} 
+(search-for-pattern as "aabbcc") @result{}
 #<peg start: 0 end: 2 string: aabbcc tree: aa>
-(search-for-pattern (+ "a") "aabbcc") @result{} 
+(search-for-pattern (+ "a") "aabbcc") @result{}
 #<peg start: 0 end: 2 string: aabbcc tree: aa>
-(search-for-pattern "'a'+" "aabbcc") @result{} 
+(search-for-pattern "'a'+" "aabbcc") @result{}
 #<peg start: 0 end: 2 string: aabbcc tree: aa>
 
 (define-peg-pattern as all (+ "a"))
-(search-for-pattern as "aabbcc") @result{} 
+(search-for-pattern as "aabbcc") @result{}
 #<peg start: 0 end: 2 string: aabbcc tree: (as aa)>
 
 (define-peg-pattern bs body (+ "b"))
-(search-for-pattern bs "aabbcc") @result{} 
+(search-for-pattern bs "aabbcc") @result{}
 #<peg start: 2 end: 4 string: aabbcc tree: bb>
-(search-for-pattern (+ "b") "aabbcc") @result{} 
+(search-for-pattern (+ "b") "aabbcc") @result{}
 #<peg start: 2 end: 4 string: aabbcc tree: bb>
-(search-for-pattern "'b'+" "aabbcc") @result{} 
+(search-for-pattern "'b'+" "aabbcc") @result{}
 #<peg start: 2 end: 4 string: aabbcc tree: bb>
 
 (define-peg-pattern zs body (+ "z"))
-(search-for-pattern zs "aabbcc") @result{} 
+(search-for-pattern zs "aabbcc") @result{}
 #f
-(search-for-pattern (+ "z") "aabbcc") @result{} 
+(search-for-pattern (+ "z") "aabbcc") @result{}
 #f
-(search-for-pattern "'z'+" "aabbcc") @result{} 
+(search-for-pattern "'z'+" "aabbcc") @result{}
 #f
 @end lisp
 @end deffn
@@ -468,9 +484,9 @@ itself satisfies @var{tst}, @code{(list lst)} is returned (this is a
 flat list whose only element satisfies @var{tst}).
 
 @lisp
-(context-flatten (lambda (x) (and (number? (car x)) (= (car x) 1))) '(2 2 (1 1 (2 2)) (2 2 (1 1)))) @result{} 
+(context-flatten (lambda (x) (and (number? (car x)) (= (car x) 1))) '(2 2 (1 1 (2 2)) (2 2 (1 1)))) @result{}
 (2 2 (1 1 (2 2)) 2 2 (1 1))
-(context-flatten (lambda (x) (and (number? (car x)) (= (car x) 1))) '(1 1 (1 1 (2 2)) (2 2 (1 1)))) @result{} 
+(context-flatten (lambda (x) (and (number? (car x)) (= (car x) 1))) '(1 1 (1 1 (2 2)) (2 2 (1 1)))) @result{}
 ((1 1 (1 1 (2 2)) (2 2 (1 1))))
 @end lisp
 
@@ -624,7 +640,7 @@ symbol for strings)..
   (entry "nobody:x:65534:65534:nobody:/nonexistent:/bin/sh")
   (entry "messagebus:x:103:107::/var/run/dbus:/bin/false"))
 (peg:tree (match-pattern tag-passwd "one entry"))
-(tag-passwd 
+(tag-passwd
   (entry "one entry"))
 @end lisp
 
@@ -841,7 +857,7 @@ number <-- [0-9]+")
 		(apply next-func (car first))
 		(apply next-func (car rest)))
           ;; form (sum (((product ...) "+") ((product ...) "+")) (product ...))
-	  (car 
+	  (car
 	   (reduce ;; walk through the list and build a left-associative tree
 	    (lambda (l r)
 	      (list (list (cadr r) (car r) (apply next-func (car l)))
@@ -1035,3 +1051,248 @@ symbol function)}, where @code{symbol} is the symbol that will indicate
 a form of this type and @code{function} is the code generating function
 described above. The function @code{add-peg-compiler!} is exported from
 the @code{(ice-9 peg codegen)} module.
+
+@subsubheading Debug tracing
+
+Due to the backtracking nature of PEG, the parser result is @code{#f}
+when it cannot match the input text.  It proves to be a big pain
+determining whether the problem is actually in the input or in the
+grammar, especially when changing the grammar itself.  Setting the
+parameter @code{%peg:debug?} to @code{#t} enables debug tracing of all
+non-terminals, alternatively parameter @code{%peg:debug?} can be set to
+a list of non-terminal sysmbols.  Debug traring will make the PEG parser
+print for each selected non-terminal: its name, the input remaining, as
+well as the parse result.
+
+@lisp
+(use-modules (ice-9 peg))
+(use-modules (ice-9 pretty-print))
+
+(define-peg-string-patterns
+                                     ;; missing: / term
+  "expression <-- term (PLUS / MINUS) expression
+   term       <-- factor (DIV / MULT) term / factor
+   factor     <-- LPAR expression RPAR / number
+   number     <-  MINUS? [0-9]+
+   PLUS       <-  '+'
+   MINUS      <-  '-'
+   DIV        <-  '/'
+   MULT       <-  '*'
+   LPAR       <  '('
+   RPAR       <  ')'")
+
+(parameterize ((%peg:debug? #t))
+  (pretty-print (match-pattern expression "1+")))
+@result{}
+expression
+    term
+        factor
+            LPAR
+            number
+                MINUS
+            number := "1"	next: "+"
+        factor := "1"	next: "+"
+        DIV
+        MULT
+    term := "1"	next: "+"
+    PLUS
+    PLUS := "+"	next: ""
+    expression
+        term
+            factor
+                LPAR
+                number
+                    MINUS
+#f
+
+@end lisp
+
+@subsubheading Locations
+
+Having the corresponding location of non-terminal matches in the parse
+tree proves to be useful when providing the user with warning or error
+messages. Location generation in the parse tree is enabled by setting
+the parameter @code{%peg:locations?} to @code{#t}.
+
+Example:
+
+@lisp
+(parameterize ((%peg:locations? #t))
+  (and=> (match-pattern expression "1+2/3*4") peg:tree))
+@result{}
+(expression
+  (term (factor "1" (location 0 1)) (location 0 1))
+  "+"
+  (expression
+    (term (factor "2" (location 2 3))
+          "*"
+          (term (factor "3" (location 4 5))
+                "/"
+                (term (factor "4" (location 6 7)) (location 6 7))
+                (location 4 7))
+          (location 2 7))
+    (location 2 7))
+  (location 0 7))
+@end lisp
+
+@subsubheading Skip parsing
+
+To write a PEG parser for a whitespace invariant language or a language
+which includes line and block comments requires littering the grammar
+with whitespace and comment parser expressions, which not only violates
+the DRY principle, but is hard to get right by hand.
+
+To this end, the @code{%peg:skip?} parameter is available.  When setting
+this parameter to an @emph{unwrapped parser} using
+@code{define-skip-parser}, this parser will be interleaved
+with the @emph{main parser}.
+
+Example:
+
+@lisp
+(use-modules (ice-9 match))
+(use-modules (ice-9 peg))
+(use-modules (ice-9 pretty-print))
+
+(define-peg-string-patterns
+  "expression <-- term (PLUS / MINUS) expression / term
+   term       <-- factor (DIV / MULT) term / factor
+   factor     <-- LPAR expression RPAR / number
+   number     <-  MINUS? [0-9]+
+   PLUS       <-  '+'
+   MINUS      <-  '-'
+   DIV        <-  '/'
+   MULT       <-  '*'
+   LPAR       <  '('
+   RPAR       <  ')'")
+
+(define (transform e)
+  (match
+   e
+   (((? symbol?) lhs op rhs) `(,(transform op) ,(transform lhs) ,(transform rhs)))
+   (((? symbol?) lhs) (transform lhs))
+   ("+" '+)
+   ("-" '-)
+   ("*" '*)
+   ("/" '/)
+   ((? string?) (string->number e))))
+
+(pretty-print (transform (peg:tree (match-pattern expression "1/(2+3)*4"))))
+@result{}
+(/ 1 (* (+ 2 3) 4))
+
+(define-skip-parser peg-ws none (* (or " " "\t" "\n")))
+
+(parameterize ((%peg:skip? peg-ws))
+  (pretty-print (transform (peg:tree (match-pattern expression "1 / (2 + 3) * 4")))))
+@result{}
+(/ 1 (* (+ 2 3) 4))
+@end lisp
+
+@subsubheading Expect parsing
+
+The best thing about PEG is its backtracking nature giving it
+@emph{infinite} look ahead capability.  At the same time it severely
+limits the debugability of the grammar, as mentioned in debug tracing.
+The ability to short circuit backtracking is achieved by introducing the
+@code{#} operator to the PEG language.  Putting a @code{#} behind a
+terminal or non-terminal indicates that its parsing must succeed,
+otherwise an exception is thrown containing the current parser state
+providing a hook to produce informative parse errors.
+
+Example:
+
+@lisp
+(use-modules (ice-9 peg))
+
+(define-peg-string-patterns
+  "expression <-- term (PLUS / MINUS) expression# / term
+   term       <-- factor (DIV / MULT) term# / factor
+   factor     <-- LPAR expression# RPAR# / number#
+   number     <-  MINUS? [0-9]+
+   PLUS       <-  '+'
+   MINUS      <-  '-'
+   DIV        <-  '/'
+   MULT       <-  '*'
+   LPAR       <  '('
+   RPAR       <  ')'")
+
+(let ((input "1/(2+var)*4"))
+  (catch 'syntax-error
+        (lambda _ (peg:tree (match-pattern expression input)))
+        (lambda (key args) (format #t "at position ~a, expected a ~a, not: ~s\n"
+                                   (car args)
+                                   (cadr args)
+                                   (substring input (car args))))))
+@result{}
+at position 5, expected a number, not: "var)*4"
+@end lisp
+
+@subsubheading Fallback parsing
+
+A natural extension to expect parsing is fallback parsing. Which
+is implemented by catching the exception thrown by the expect operator.
+The parser will catch the exception in the peg @code{and}, @code{+} and
+@code{*} operator to attempt to recover its state by eating away at the
+input until it runs out of input or until one of the grammar
+continuations matches and it will continue parsing from there.
+
+By setting parameter @code{%peg:fallback?} to @code{#t} fallback parsing
+will be enabled when it encouters a mismatch in the input on an expect
+production. Parameter @code{%peg:error} is used to handle the error and
+it receives the position, the input and a list representing the error as
+pair. When the @code{car} is @code{'skipped}, the @code{cdr} is the
+substring which is ignored by the parser. When the @code{car} is
+@code{'missing} the @code{cdr} is the parser snippet that was expected,
+but is missing in the input.
+
+Example:
+
+@lisp
+(use-modules (ice-9 peg))
+(use-modules (ice-9 pretty-print))
+
+(define-peg-string-patterns
+  "expression <-- term (PLUS / MINUS) expression# / term
+   term       <-- factor (DIV / MULT) term# / factor
+   factor     <-- LPAR expression RPAR# / number
+   number     <-  MINUS? [0-9]+
+   PLUS       <-  '+'
+   MINUS      <-  '-'
+   DIV        <-  '/'
+   MULT       <-  '*'
+   LPAR       <   '('
+   RPAR       <   ')'")
+
+(define (handle-error pos str error)
+  (format #t "at postion ~a, ~a: \"~a\"\n"
+          pos (car error) (cdr error)))
+
+(let ((input "1/(2+var)*4"))
+  (parameterize ((%peg:fallback? #t)
+                 (%peg:error handle-error))
+    (pretty-print (peg:tree (match-pattern expression input)))))
+@result{}
+at postion 5, skipped: "var)"
+(expression
+  (term (factor "1")
+        "/"
+        (term (factor
+                (expression (term (factor "2")) "+" expression))
+              "*"
+              (term (factor "4")))))
+
+(let ((input "1/(2+)*4"))
+  (parameterize ((%peg:fallback? #t)
+                 (%peg:error handle-error))
+    (pretty-print (peg:tree (match-pattern expression input)))))
+@result{}
+at postion 5, missing: "expression"
+(expression
+  (term (factor "1")
+        "/"
+        (term (factor
+                (expression (term (factor "2")) "+" expression))
+              "*"
+              (term (factor "4")))))
+@end lisp
diff --git a/module/ice-9/peg.scm b/module/ice-9/peg.scm
index 4e03131cd..ca7d091bf 100644
--- a/module/ice-9/peg.scm
+++ b/module/ice-9/peg.scm
@@ -38,5 +38,10 @@
                peg:string
                peg:tree
                peg:substring
-               peg-record?))
-
+               peg-record?
+               %peg:debug?
+               %peg:fallback?
+               %peg:locations?
+               %peg:skip?
+               %peg:error
+               define-skip-parser))
diff --git a/module/ice-9/peg/codegen.scm b/module/ice-9/peg/codegen.scm
index d80c3e849..fcfb71422 100644
--- a/module/ice-9/peg/codegen.scm
+++ b/module/ice-9/peg/codegen.scm
@@ -18,10 +18,27 @@
 ;;;;
 
 (define-module (ice-9 peg codegen)
-  #:export (compile-peg-pattern wrap-parser-for-users add-peg-compiler!)
+  #:export (compile-peg-pattern
+            wrap-parser-for-users
+            add-peg-compiler!
+            define-skip-parser
+            %peg:debug?
+            %peg:locations?
+            %peg:skip?
+            %peg:fallback?
+            %peg:error)
+  #:use-module (srfi srfi-1)
   #:use-module (ice-9 pretty-print)
   #:use-module (system base pmatch))
 
+(define %peg:debug? (make-parameter #f))
+(define %peg:locations? (make-parameter #f))
+(define %peg:skip? (make-parameter (lambda (str strlen at) `(,at ()))))
+(define %peg:fallback? (make-parameter #f))
+(define %peg:error (make-parameter (lambda (pos str error) #f)))
+
+(define %continuation (make-parameter (lambda (str streln at) #f)))
+
 (define-syntax single?
   (syntax-rules ()
     "Return #t if X is a list of one element."
@@ -162,6 +179,38 @@ return EXP."
    ((eq? accum 'none) 'none)))
 (define baf builtin-accum-filter)
 
+(define (format-error missing str)
+  (lambda (from to)
+    (unless (and (< from to)
+                 (string-every char-set:whitespace str from (1+ to)))
+      ((%peg:error) from str
+       (if (< from to) (cons 'skipped (substring str from (1+ to)))
+           (cons 'missing missing))))))
+
+(define* (fallback-skip kernel #:optional sequence?)
+  (if (not (%peg:fallback?)) kernel
+      (lambda (str strlen start)
+        (catch 'syntax-error
+               (lambda _
+                 (cond ((or #t (< start strlen)) (kernel str strlen start))
+                       ((not sequence?) `(,strlen ()))
+                       (else #f)))
+               (lambda (key . args)
+                 (let* ((expected (cadar args))
+                        (format-error (format-error expected str)))
+                   (let loop ((at start))
+                     (cond ((or (= at strlen) ((%continuation) str strlen at)) (format-error start at) (if sequence? `(,at ()) `(,at (,expected))))
+                           (else (or (let ((res (false-if-exception (kernel str strlen (1+ at))))) (when res(format-error start at)) res)
+                                     ;;if kernel matches, we have skipped over: (substring str start (1+ at)))
+                                     (loop (1+ at))))))))))))
+
+
+(define (partial-match kernel)
+  (lambda (str strlen at)
+    (catch #t
+           (lambda _ (kernel str strlen at))
+           (lambda (key . args) (and (< at (caar args)) (car args))))))
+
 ;; Top-level function builder for AND.  Reduces to a call to CG-AND-INT.
 (define (cg-and clauses accum)
   #`(lambda (str len pos)
@@ -174,8 +223,14 @@ return EXP."
     (()
      (cggr accum 'cg-and #`(reverse #,body) at))
     ((first rest ...)
-     #`(let ((res (#,(compile-peg-pattern #'first accum) #,str #,strlen #,at)))
-         (and res 
+     #`(let* ((next #,(cg-or #'(rest ...) 'body))
+              (kernel #,(compile-peg-pattern #'first accum))
+              (res (parameterize ((%continuation (let ((after-that (%continuation)))
+                                                   (lambda (str strlen at)
+                                                     (or ((partial-match next) str strlen at)
+                                                         ((partial-match after-that) str strlen at))))))
+                     ((fallback-skip kernel) #,str #,strlen #,at))))
+         (and res
               ;; update AT and BODY then recurse
               (let ((newat (car res))
                     (newbody (cadr res)))
@@ -200,42 +255,40 @@ return EXP."
 (define (cg-* args accum)
   (syntax-case args ()
     ((pat)
-     #`(lambda (str strlen at)
-         (let ((body '()))
-           (let lp ((end at) (count 0))
-             (let* ((match (#,(compile-peg-pattern #'pat (baf accum))
-                            str strlen end))
-                    (new-end (if match (car match) end))
-                    (count (if (> new-end end) (1+ count) count)))
-               (if (> new-end end)
-                   (push-not-null! body (single-filter (cadr match))))
-               (if (and (> new-end end)
-                        #,#t)
-                   (lp new-end count)
-                   (let ((success #,#t))
-                     #,#`(and success
-                                 #,(cggr (baf accum) 'cg-body
-                                         #'(reverse body) #'new-end)))))))))))
+     #`(let* ((kernel #,(compile-peg-pattern #'pat (baf accum)))
+              (kleene (lambda (str strlen at)
+                        (let ((body '()))
+                          (let lp ((end at) (count 0))
+                            (let* ((match ((fallback-skip kernel #t) str strlen end))
+                                   (new-end (if match (car match) end))
+                                   (count (if (> new-end end) (1+ count) count)))
+                              (when (> new-end end)
+                                (push-not-null! body (single-filter (cadr match))))
+                              (if (and (> new-end end) #,#t) (lp new-end count)
+                                  (let ((success #,#t))
+                                    #,#`(and success
+                                             #,(cggr (baf accum) 'cg-body
+                                                     #'(reverse body) #'new-end))))))))))
+         kleene))))
 
 (define (cg-+ args accum)
   (syntax-case args ()
     ((pat)
-     #`(lambda (str strlen at)
-         (let ((body '()))
-           (let lp ((end at) (count 0))
-             (let* ((match (#,(compile-peg-pattern #'pat (baf accum))
-                            str strlen end))
-                    (new-end (if match (car match) end))
-                    (count (if (> new-end end) (1+ count) count)))
-               (if (> new-end end)
-                   (push-not-null! body (single-filter (cadr match))))
-               (if (and (> new-end end)
-                        #,#t)
-                   (lp new-end count)
-                   (let ((success #,#'(>= count 1)))
-                     #,#`(and success
-                                 #,(cggr (baf accum) 'cg-body
-                                         #'(reverse body) #'new-end)))))))))))
+     #`(let* ((kernel #,(compile-peg-pattern #'pat (baf accum)))
+              (multiple (lambda (str strlen at)
+                          (let ((body '()))
+                            (let lp ((end at) (count 0))
+                              (let* ((match ((fallback-skip kernel #t) str strlen end))
+                                     (new-end (if match (car match) end))
+                                     (count (if (> new-end end) (1+ count) count)))
+                                (when (> new-end end)
+                                  (push-not-null! body (single-filter (cadr match))))
+                                (if (and (> new-end end) #,#t) (lp new-end count)
+                                    (let ((success #,#'(>= count 1)))
+                                      #,#`(and success
+                                               #,(cggr (baf accum) 'cg-body
+                                                       #'(reverse body) #'new-end))))))))))
+         multiple))))
 
 (define (cg-? args accum)
   (syntax-case args ()
@@ -296,6 +349,16 @@ return EXP."
                                 #f
                                 #,(cggr (baf accum) 'cg-body #''() #'at)))))))))))
 
+(define (cg-expect-int clauses accum str strlen at)
+  (syntax-case clauses ()
+    ((pat)
+     #`(or (#,(compile-peg-pattern #'pat accum) #,str #,strlen #,at)
+           (throw 'syntax-error (list #,at (syntax->datum #'pat)))))))
+
+(define (cg-expect clauses accum)
+  #`(lambda (str len pos)
+      #,(cg-expect-int clauses (baf accum) #'str #'len #'pos)))
+
 ;; Association list of functions to handle different expressions as PEGs
 (define peg-compiler-alist '())
 
@@ -313,6 +376,7 @@ return EXP."
 (add-peg-compiler! '? cg-?)
 (add-peg-compiler! 'followed-by cg-followed-by)
 (add-peg-compiler! 'not-followed-by cg-not-followed-by)
+(add-peg-compiler! 'expect cg-expect)
 
 ;; Takes an arbitrary expressions and accumulation variable, then parses it.
 ;; E.g.: (compile-peg-pattern syntax '(and "abc" (or "-" (range #\a #\z))) 'all)
@@ -332,28 +396,65 @@ return EXP."
                                 "Not one of" (map car peg-compiler-alist)))))))
 
 ;; Packages the results of a parser
+
+(define (trace? symbol)
+  (cond ((pair? (%peg:debug?)) (memq symbol (%peg:debug?)))
+        ((or (null? (%peg:debug?)) (%peg:debug?)) #t)
+        (else #f)))
+
+(define indent 0)
+
 (define (wrap-parser-for-users for-syntax parser accumsym s-syn)
-   #`(lambda (str strlen at)
-      (let ((res (#,parser str strlen at)))
+  #`(lambda (str strlen at)
+      (when (trace? '#,s-syn)
+        (format (current-error-port) "~a~a\n"
+                (make-string indent #\space)
+                '#,s-syn))
+      (set! indent (+ indent 4))
+      (let* ((comment-res ((%peg:skip?) str strlen at))
+             (comment-loc (and (%peg:locations?) comment-res `(location ,at ,(car comment-res))))
+             (at (or (and comment-res (car comment-res)) at))
+             (res (#,parser str strlen at)))
+        (set! indent (- indent 4))
+        (let ((pos (or (and res (car res)) 0)))
+          (when (and (trace? '#,s-syn) (< at pos))
+            (format (current-error-port) "~a~a := ~s\tnext: ~s\n"
+                    (make-string indent #\space)
+                    '#,s-syn
+                    (substring str at pos)
+                    (substring str pos (min strlen (+ pos 10))))))
         ;; Try to match the nonterminal.
         (if res
             ;; If we matched, do some post-processing to figure out
             ;; what data to propagate upward.
-            (let ((at (car res))
-                  (body (cadr res)))
+            (let* ((body (cadr res))
+                   (loc `(location ,at ,(car res)))
+                   (annotate (if (not (%peg:locations?)) '()
+                                 (if (null? (cadr comment-res)) `(,loc)
+                                     `((comment ,(cdr comment-res) ,comment-loc) ,loc))))
+                   (at (car res)))
               #,(cond
                  ((eq? accumsym 'name)
-                  #`(list at '#,s-syn))
+                  #`(list at '#,s-syn ,@annotate))
                  ((eq? accumsym 'all)
                   #`(list (car res)
                           (cond
                            ((not (list? body))
-                            (list '#,s-syn body))
-                           ((null? body) '#,s-syn)
+                            `(,'#,s-syn ,body ,@annotate))
+                           ((null? body)
+                            `(,'#,s-syn ,@annotate))
                            ((symbol? (car body))
-                            (list '#,s-syn body))
-                           (else (cons '#,s-syn body)))))
-                 ((eq? accumsym 'none) #`(list (car res) '()))
-                 (else #`(begin res))))
+                            `(,'#,s-syn ,body ,@annotate))
+                           (else
+                            (cons '#,s-syn (append body annotate))))))
+                 ((eq? accumsym 'none) #``(,at () ,@annotate))
+                 (else #``(,at ,body ,@annotate))))
             ;; If we didn't match, just return false.
             #f))))
+
+(define-syntax define-skip-parser
+  (lambda (x)
+    (syntax-case x ()
+      ((_ sym accum pat)
+       (let* ((matchf (compile-peg-pattern #'pat (syntax->datum #'accum))))
+         #`(define sym #,matchf))))))
diff --git a/module/ice-9/peg/string-peg.scm b/module/ice-9/peg/string-peg.scm
index 45ed14bb1..e8af6bb3d 100644
--- a/module/ice-9/peg/string-peg.scm
+++ b/module/ice-9/peg/string-peg.scm
@@ -42,7 +42,8 @@
 pattern <-- alternative (SLASH sp alternative)*
 alternative <-- ([!&]? sp suffix)+
 suffix <-- primary ([*+?] sp)*
-primary <-- '(' sp pattern ')' sp / '.' sp / literal / charclass / nonterminal !'<'
+primary <-- secondary ([#] sp)?
+secondary <-- '(' sp pattern ')' sp / '.' sp / literal / charclass / nonterminal !'<'
 literal <-- ['] (!['] .)* ['] sp
 charclass <-- LB (!']' (CCrange / CCsingle))* RB sp
 CCrange <-- . '-' .
@@ -73,6 +74,8 @@ RB < ']'
 (define-sexp-parser peg-suffix all
   (and peg-primary (* (and (or "*" "+" "?") peg-sp))))
 (define-sexp-parser peg-primary all
+  (and peg-secondary (? (and "#" peg-sp))))
+(define-sexp-parser peg-secondary all
   (or (and "(" peg-sp peg-pattern ")" peg-sp)
       (and "." peg-sp)
       peg-literal
@@ -188,6 +191,11 @@ RB < ']'
 
 ;; Parse a primary.
 (define (peg-primary->defn lst for-syntax)
+  (let ((inner-defn (peg-secondary->defn (cadr lst) for-syntax)))
+    (if (and (pair? (cddr lst)) (equal? (caddr lst) "#")) #`(expect #,inner-defn)
+        inner-defn)))
+
+(define (peg-secondary->defn lst for-syntax)
   (let ((el (cadr lst)))
   (cond
    ((list? el)
diff --git a/test-suite/tests/peg.test b/test-suite/tests/peg.test
index f516571e8..3f2ea70e1 100644
--- a/test-suite/tests/peg.test
+++ b/test-suite/tests/peg.test
@@ -1,5 +1,5 @@
+;;;;; PEG test suite. -*- scheme -*-
 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
-;;;;; PEG test suite.
 ;; Tests the parsing capabilities of (ice-9 peg).  Could use more
 ;; tests for edge cases.
 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
@@ -33,6 +33,7 @@
     (alternative peg-alternative)
     (suffix peg-suffix)
     (primary peg-primary)
+    (secondary peg-secondary)
     (literal peg-literal)
     (charclass peg-charclass)
     (CCrange charclass-range)
@@ -63,13 +64,12 @@
   (pass-if
    "defining PEGs with PEG"
    (and (eeval `(define-peg-string-patterns ,(@@ (ice-9 peg) peg-as-peg))) #t))
-  (pass-if
+  (pass-if-equal
    "equivalence of definitions"
-   (equal?
-    (peg:tree (match-pattern (@@ (ice-9 peg) peg-grammar) (@@ (ice-9 peg) peg-as-peg)))
+      (peg:tree (match-pattern (@@ (ice-9 peg) peg-grammar) (@@ (ice-9 peg) peg-as-peg)))
     (tree-map
      grammar-transform
-     (peg:tree (match-pattern grammar (@@ (ice-9 peg) peg-as-peg)))))))
+     (peg:tree (match-pattern grammar (@@ (ice-9 peg) peg-as-peg))))))
 
 ;; A grammar for pascal-style comments from Wikipedia.
 (define comment-grammar
@@ -276,3 +276,113 @@ number <-- [0-9]+")
    (equal? (eq-parse "1+1/2*3+(1+1)/2")
 	   '(+ (+ 1 (* (/ 1 2) 3)) (/ (+ 1 1) 2)))))
 
+(define-peg-string-patterns
+  "trace-grammar <-- foo bar* baz
+foo <-- 'foo'
+bar <-- 'bar'
+baz <-- 'baz'")
+
+(use-modules (ice-9 peg codegen)) ; %peg:debug?
+(with-test-prefix "Parse tracing"
+  (pass-if-equal
+   "trace"
+"trace-grammar
+    foo
+    foo := \"foo\"	next: \"baz\"
+    bar
+    baz
+    baz := \"baz\"	next: \"\"
+trace-grammar := \"foobaz\"	next: \"\"
+"
+   (parameterize ((%peg:debug? #t))
+      (with-error-to-string
+        (lambda _ (and=> (match-pattern trace-grammar "foobaz")
+                         peg:tree))))))
+
+(define-peg-string-patterns
+  "expect-grammar <-- one# two# three#
+one <-- 'one'
+two <-- 'two'
+three <-- 'three'"
+)
+
+(with-test-prefix "Parsing expect"
+  (pass-if-equal "expect okay"
+      '(expect-grammar (one "one") (two "two") (three "three"))
+      (and=> (match-pattern expect-grammar "onetwothree")
+             peg:tree))
+  (pass-if-equal "expect one"
+      '(syntax-error (0 one))
+    (catch 'syntax-error
+      (lambda _
+        (and=> (match-pattern expect-grammar "twothree")
+               peg:tree))
+      (lambda args args)))
+  (pass-if-equal "expect two"
+      '(syntax-error (3 two))
+    (catch 'syntax-error
+      (lambda _
+        (and=> (match-pattern expect-grammar "onethree")
+               peg:tree))
+      (lambda args args))))
+
+(define program-text "
+/*
+ CopyLeft (L) Acme
+*/
+foo // the first
+bar
+	bar
+baz
+")
+
+(define-skip-parser peg-ws all (or " " "\t"))
+(define-skip-parser peg-eol all (or "\r" "\n"))
+(define-skip-parser peg-line all (and "//" (* (and (not-followed-by peg-eol) peg-any))))
+(define-skip-parser peg-block all (and "/*" (* (or peg-block (and (not-followed-by "*/") peg-any))) (expect "*/")))
+(define-skip-parser peg-skip all (* (or peg-ws peg-eol peg-line peg-block)))
+
+(with-test-prefix "Skip parser"
+  (pass-if-equal "skip comments and whitespace"
+      '(trace-grammar (foo "foo") ((bar "bar") (bar "bar")) (baz "baz"))
+    (and=> (parameterize ((%peg:skip? peg-skip))
+             (match-pattern trace-grammar program-text))
+           peg:tree))
+  (pass-if-equal "preserve comments and whitespace"
+      '(trace-grammar (foo "foo" (location 26 29))
+                      ((bar "bar"
+                            (comment " // the first\n" (location 29 43))
+                            (location 43 46))
+                       (bar "bar"
+                            (comment "\n\t" (location 46 48))
+                            (location 48 51)))
+                      (baz "baz"
+                           (comment "\n" (location 51 52))
+                           (location 52 55))
+                      (comment "\n/*\n CopyLeft (L) Acme\n*/\n" (location 0 26))
+                      (location 26 55))
+    (and=> (parameterize ((%peg:skip? peg-skip)
+                          (%peg:locations? #t))
+             (match-pattern trace-grammar program-text))
+           peg:tree)))
+
+(with-test-prefix "Fallback parser"
+  (pass-if-equal "only one"
+      '(expect-grammar (one "one") two three)
+    (and=> (parameterize ((%peg:skip? peg-skip)
+                          (%peg:fallback? #t))
+             (match-pattern expect-grammar "one"))
+           peg:tree))
+  (pass-if-equal "no two"
+      '(expect-grammar (one "one") two (three "three"))
+    (and=> (parameterize ((%peg:skip? peg-skip)
+                          (%peg:fallback? #t))
+             (match-pattern expect-grammar "one three"))
+           peg:tree))
+  (pass-if-equal "missing one (skipping zero)"
+      '(expect-grammar (one (two "two") (three "three")))
+    (and=> (parameterize ((%peg:skip? peg-skip)
+                          (%peg:fallback? #t))
+             (match-pattern expect-grammar "zero two three"))
+           peg:tree)))
+
-- 
2.26.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2024-10-18  9:58 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-05-11 13:41 [PATCH] ice-9/peg: Extend PEG for production use Rutger van Beusekom
2024-10-14  7:26 ` [PATCH v2 0/5] " Janneke Nieuwenhuizen
2024-10-14  7:31   ` [PATCH v2 1/5] Remove trailing whitespace in PEG texinfo Janneke Nieuwenhuizen
2024-10-14  7:31     ` [PATCH v2 2/5] peg: Add debug tracing Janneke Nieuwenhuizen
2024-10-14  7:31     ` [PATCH v2 3/5] peg: Add expect Janneke Nieuwenhuizen
2024-10-14  7:31     ` [PATCH v2 4/5] peg: Add whitespace and comment skip parsers Janneke Nieuwenhuizen
2024-10-14  7:31     ` [PATCH v2 5/5] peg: Add fall-back parsing Janneke Nieuwenhuizen
2024-10-14  7:36       ` Janneke Nieuwenhuizen
2024-10-14 15:58         ` Ekaitz Zarraga
2024-10-16  6:44           ` Rutger van Beusekom
2024-10-18  9:58             ` Ekaitz Zarraga

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).