unofficial mirror of bug-guile@gnu.org 
 help / color / mirror / Atom feed
* bug#73188: PEG parser does not support full PEG grammar
@ 2024-09-11 22:03 Ekaitz Zarraga
  2024-09-12 20:57 ` bug#73188: [PATCH v2] PEG: Add full support for PEG + some extensions Ekaitz Zarraga
                   ` (4 more replies)
  0 siblings, 5 replies; 19+ messages in thread
From: Ekaitz Zarraga @ 2024-09-11 22:03 UTC (permalink / raw)
  To: 73188

[-- Attachment #1: Type: text/plain, Size: 565 bytes --]

Hi,

I noticed the PEG parser does not support the full PEG grammar, this 
includes comments, double quotes, escaping and underscores.

The attached patch adds support for those (I sent it to guile-devel but 
I guess it's better to do it here).

I'm trying to test it with Zig's grammar but I'm having some issues and 
the way this is generated it's really hard to test. As is, this patch 
passes all the current tests in Guile which includes being able to parse 
peg-as-peg, which now has comments and features that were not 
implemented before.

Thank you,
Ekaitz

[-- Attachment #2: 0001-PEG-Add-full-support-for-PEG-some-extensions.patch --]
[-- Type: text/x-patch, Size: 19155 bytes --]

From 2eef66ffdfc0728a84fa1e8a37b2a748bf464324 Mon Sep 17 00:00:00 2001
From: Ekaitz Zarraga <ekaitz@elenq.tech>
Date: Wed, 11 Sep 2024 21:19:26 +0200
Subject: [PATCH] PEG: Add full support for PEG + some extensions

This commit adds support for PEG as described in:

    <https://bford.info/pub/lang/peg.pdf>

It adds support for the missing features (comments, underscores in
identifiers and escaping) while keeping the extensions (dashes in
identifiers, < and <--).

The naming system tries to be as close as possible to the one proposed
in the paper.

* module/ice-9/peg/string-peg.scm: Rewrite PEG parser.
* test-suite/tests/peg.test: Fix import
---
 module/ice-9/peg/string-peg.scm | 410 +++++++++++++++++++-------------
 test-suite/tests/peg.test       |  32 ++-
 2 files changed, 263 insertions(+), 179 deletions(-)

diff --git a/module/ice-9/peg/string-peg.scm b/module/ice-9/peg/string-peg.scm
index 45ed14bb1..47202064b 100644
--- a/module/ice-9/peg/string-peg.scm
+++ b/module/ice-9/peg/string-peg.scm
@@ -1,6 +1,6 @@
 ;;;; string-peg.scm --- representing PEG grammars as strings
 ;;;;
-;;;; 	Copyright (C) 2010, 2011 Free Software Foundation, Inc.
+;;;; 	Copyright (C) 2010, 2011, 2023 Free Software Foundation, Inc.
 ;;;;
 ;;;; This library is free software; you can redistribute it and/or
 ;;;; modify it under the terms of the GNU Lesser General Public
@@ -22,6 +22,7 @@
             define-peg-string-patterns
             peg-grammar)
   #:use-module (ice-9 peg using-parsers)
+  #:use-module (srfi srfi-1)
   #:use-module (ice-9 peg codegen)
   #:use-module (ice-9 peg simplify-tree))
 
@@ -38,21 +39,55 @@
 
 ;; Grammar for PEGs in PEG grammar.
 (define peg-as-peg
-"grammar <-- (nonterminal ('<--' / '<-' / '<') sp pattern)+
-pattern <-- alternative (SLASH sp alternative)*
-alternative <-- ([!&]? sp suffix)+
-suffix <-- primary ([*+?] sp)*
-primary <-- '(' sp pattern ')' sp / '.' sp / literal / charclass / nonterminal !'<'
-literal <-- ['] (!['] .)* ['] sp
-charclass <-- LB (!']' (CCrange / CCsingle))* RB sp
-CCrange <-- . '-' .
-CCsingle <-- .
-nonterminal <-- [a-zA-Z0-9-]+ sp
-sp < [ \t\n]*
-SLASH < '/'
-LB < '['
-RB < ']'
-")
+"# Hierarchical syntax
+Grammar <-- Spacing Definition+ EndOfFile
+Definition <-- Identifier LEFTARROW Expression
+
+Expression <-- Sequence (SLASH Sequence)*
+Sequence <-- Prefix*
+Prefix <-- (AND / NOT)? Suffix
+Suffix <-- Primary (QUESTION / STAR / PLUS)?
+Primary <-- Identifier !LEFTARROW
+           / OPEN Expression CLOSE
+           / Literal / Class / DOT
+
+# Lexical syntax
+Identifier <-- IdentStart IdentCont* Spacing
+# NOTE: `-` is an extension
+IdentStart <- [a-zA-Z_-]
+IdentCont <- IdentStart / [0-9]
+
+Literal <-- SQUOTE (!SQUOTE Char)* SQUOTE Spacing
+        / DQUOTE (!DQUOTE Char)* DQUOTE Spacing
+Class <-- '[' (!']' Range)* ']' Spacing
+Range <-- Char '-' Char / Char
+Char <-- '\\\\' [nrt'\"\\[\\]\\\\]
+       / '\\\\' [0-2][0-7][0-7]
+       / '\\\\' [0-7][0-7]?
+       / !'\\\\' .
+
+# NOTE: `<--` and `<` are extensions
+LEFTARROW <- ('<--' / '<-' / '<') Spacing
+SQUOTE <-- [']
+DQUOTE <-- [\"]
+OPENBRACKET < '['
+CLOSEBRACKET < ']'
+SLASH < '/' Spacing
+AND <-- '&' Spacing
+NOT <-- '!' Spacing
+QUESTION <-- '?' Spacing
+STAR <-- '*' Spacing
+PLUS <-- '+' Spacing
+OPEN < '(' Spacing
+CLOSE < ')' Spacing
+DOT <-- '.' Spacing
+
+Spacing < (Space / Comment)*
+Comment < '#' (!EndOfLine .)* EndOfLine
+Space < ' ' / '\t' / EndOfLine
+EndOfLine < '\r\n' / '\n' / '\r'
+EndOfFile < !.")
+
 
 (define-syntax define-sexp-parser
   (lambda (x)
@@ -63,35 +98,78 @@ RB < ']'
               (syn (wrap-parser-for-users x matchf accumsym #'sym)))
            #`(define sym #,syn))))))
 
-(define-sexp-parser peg-grammar all
-  (+ (and peg-nonterminal (or "<--" "<-" "<") peg-sp peg-pattern)))
-(define-sexp-parser peg-pattern all
-  (and peg-alternative
-       (* (and (ignore "/") peg-sp peg-alternative))))
-(define-sexp-parser peg-alternative all
-  (+ (and (? (or "!" "&")) peg-sp peg-suffix)))
-(define-sexp-parser peg-suffix all
-  (and peg-primary (* (and (or "*" "+" "?") peg-sp))))
-(define-sexp-parser peg-primary all
-  (or (and "(" peg-sp peg-pattern ")" peg-sp)
-      (and "." peg-sp)
-      peg-literal
-      peg-charclass
-      (and peg-nonterminal (not-followed-by "<"))))
-(define-sexp-parser peg-literal all
-  (and "'" (* (and (not-followed-by "'") peg-any)) "'" peg-sp))
-(define-sexp-parser peg-charclass all
-  (and (ignore "[")
-       (* (and (not-followed-by "]")
-               (or charclass-range charclass-single)))
-       (ignore "]")
-       peg-sp))
-(define-sexp-parser charclass-range all (and peg-any "-" peg-any))
-(define-sexp-parser charclass-single all peg-any)
-(define-sexp-parser peg-nonterminal all
-  (and (+ (or (range #\a #\z) (range #\A #\Z) (range #\0 #\9) "-")) peg-sp))
-(define-sexp-parser peg-sp none
-  (* (or " " "\t" "\n")))
+(define-sexp-parser Grammar all
+  (and Spacing (+ Definition) EndOfFile))
+(define-sexp-parser Definition all
+  (and Identifier LEFTARROW Expression))
+(define-sexp-parser Expression all
+  (and Sequence (* (and SLASH Sequence))))
+(define-sexp-parser Sequence all
+  (* Prefix))
+(define-sexp-parser Prefix all
+  (and (? (or AND NOT)) Suffix))
+(define-sexp-parser Suffix all
+  (and Primary (? (or QUESTION STAR PLUS))))
+(define-sexp-parser Primary all
+  (or (and Identifier (not-followed-by LEFTARROW))
+      (and OPEN Expression CLOSE)
+      Literal
+      Class
+      DOT))
+(define-sexp-parser Identifier all
+  (and IdentStart (* IdentCont) Spacing))
+(define-sexp-parser IdentStart body
+  (or (range #\a #\z) (range #\A #\Z) "_" "-"))
+(define-sexp-parser IdentCont body
+  (or IdentStart (range #\0 #\9)))
+(define-sexp-parser Literal all
+  (or (and SQUOTE (* (and (not-followed-by SQUOTE) Char)) SQUOTE Spacing)
+      (and DQUOTE (* (and (not-followed-by DQUOTE) Char)) DQUOTE Spacing)))
+(define-sexp-parser Class all
+  (and OPENBRACKET (* (and (not-followed-by CLOSEBRACKET) Range)) CLOSEBRACKET Spacing))
+(define-sexp-parser Range all
+  (or (and Char DASH Char) Char))
+(define-sexp-parser Char all
+  (or (and "\\" (or "n" "r" "t" "'" "[" "]" "\\"))
+      (and "\\" (range #\0 #\2) (range #\0 #\7) (range #\0 #\7))
+      (and "\\" (range #\0 #\7) (? (range #\0 #\7)))
+      (and (not-followed-by "\\") peg-any)))
+(define-sexp-parser LEFTARROW body
+  (and (or "<--" "<-" "<") Spacing)) ; NOTE: <-- and < are extensions
+(define-sexp-parser SLASH none
+  (and "/" Spacing))
+(define-sexp-parser AND all
+  (and "&" Spacing))
+(define-sexp-parser NOT all
+  (and "!" Spacing))
+(define-sexp-parser QUESTION all
+  (and "?" Spacing))
+(define-sexp-parser STAR all
+  (and "*" Spacing))
+(define-sexp-parser PLUS all
+  (and "+" Spacing))
+(define-sexp-parser OPEN none
+  (and "(" Spacing))
+(define-sexp-parser CLOSE none
+  (and ")" Spacing))
+(define-sexp-parser DOT all
+  (and "." Spacing))
+(define-sexp-parser SQUOTE none "'")
+(define-sexp-parser DQUOTE none "\"")
+(define-sexp-parser OPENBRACKET none "[")
+(define-sexp-parser CLOSEBRACKET none "]")
+(define-sexp-parser DASH none "-")
+(define-sexp-parser Spacing none
+  (* (or Space Comment)))
+(define-sexp-parser Comment none
+  (and "#" (* (and (not-followed-by EndOfLine) peg-any)) EndOfLine))
+(define-sexp-parser Space none
+  (or " " "\t" EndOfLine))
+(define-sexp-parser EndOfLine none
+  (or "\r\n" "\n" "\r"))
+(define-sexp-parser EndOfFile none
+  (not-followed-by peg-any))
+
 
 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
 ;;;;; PARSE STRING PEGS
@@ -101,7 +179,7 @@ RB < ']'
 ;; will define all of the nonterminals in the grammar with equivalent
 ;; PEG s-expressions.
 (define (peg-parser str for-syntax)
-  (let ((parsed (match-pattern peg-grammar str)))
+  (let ((parsed (match-pattern Grammar str)))
     (if (not parsed)
         (begin
           ;; (display "Invalid PEG grammar!\n")
@@ -110,11 +188,123 @@ RB < ']'
           (cond
            ((or (not (list? lst)) (null? lst))
             lst)
-           ((eq? (car lst) 'peg-grammar)
-            #`(begin
-                #,@(map (lambda (x) (peg-nonterm->defn x for-syntax))
-                        (context-flatten (lambda (lst) (<= (depth lst) 2))
-                                         (cdr lst))))))))))
+           ((eq? (car lst) 'Grammar)
+            (Grammar->defn lst for-syntax)))))))
+
+;; (Grammar (Definition ...) (Definition ...))
+(define (Grammar->defn lst for-syntax)
+  #`(begin
+      #,@(map (lambda (x) (Definition->defn x for-syntax))
+              (context-flatten (lambda (lst) (<= (depth lst) 1))
+                               (cdr lst)))))
+
+;; (Definition (Identifier "Something") "<-" (Expression ...))
+;;  `-> (define-peg-pattern Something 'all ...)
+(define (Definition->defn lst for-syntax)
+  (let ((identifier (second (second lst)))
+        (grabber    (third  lst))
+        (expression (fourth lst)))
+    #`(define-peg-pattern #,(datum->syntax for-syntax
+                                           (string->symbol identifier))
+       #,(cond
+          ((string=? grabber "<--") (datum->syntax for-syntax 'all))
+          ((string=? grabber "<-") (datum->syntax for-syntax 'body))
+          (else (datum->syntax for-syntax 'none)))
+       #,(compressor (Expression->defn expression for-syntax) for-syntax))))
+
+;; (Expression (Sequence X))
+;;  `-> (X)
+;; (Expression (Sequence X) (Sequence Y))
+;;  `-> (or X Y)
+;; (Expression (Sequence X) ((Sequence Y) (Sequence Z) ...))
+;;  `-> (or X Y Z ...)
+(define (Expression->defn lst for-syntax)
+  (let ((first-sequence (second lst))
+        (rest           (cddr  lst)))
+    #`(or #,(Sequence->defn first-sequence for-syntax)
+          #,@(map (lambda (x)
+                    (Sequence->defn x for-syntax))
+                  (keyword-flatten '(Sequence) rest)))))
+
+
+(define (Sequence->defn lst for-syntax)
+  #`(and #,@(map (lambda (x) (Prefix->defn x for-syntax)) (cdr lst))))
+
+
+;; (Prefix (Suffix ...))
+;;  `-> (...)
+;; (Prefix (NOT "!") (Suffix ...))
+;;  `-> (not-followed-by ...)
+;; (Prefix (AND "&") (Suffix ...))
+;;  `-> (followed-by ...)
+(define (Prefix->defn lst for-syntax)
+  (let ((suffix (second lst)))
+    (case (car suffix)
+      ('AND #`(followed-by #,(Suffix->defn (third lst) for-syntax)))
+      ('NOT #`(not-followed-by #,(Suffix->defn (third lst) for-syntax)))
+      (else (Suffix->defn suffix for-syntax)))))
+
+;; (Suffix (Primary ...))
+;;  `-> (...)
+;; (Suffix (Primary ...) (STAR "*"))
+;;  `-> (* ...)
+;; (Suffix (Primary ...) (QUESTION "?"))
+;;  `-> (? ...)
+;; (Suffix (Primary ...) (PLUS "+"))
+;;  `-> (+ ...)
+(define (Suffix->defn lst for-syntax)
+  (let* ((primary (second lst))
+         (out     (Primary->defn primary for-syntax))
+         (extra   (cddr lst)))
+    (if (null? extra)
+      out
+      (case (caar extra)
+        ('QUESTION #`(? #,out))
+        ('STAR     #`(* #,out))
+        ('PLUS     #`(+ #,out))))))
+
+(define (Primary->defn lst for-syntax)
+  (let ((value (second lst)))
+    (case (car value)
+      ('DOT        #'peg-any)
+      ('Identifier (Identifier->defn value for-syntax))
+      ('Expression (Expression->defn value for-syntax))
+      ('Literal    (Literal->defn value for-syntax))
+      ('Class      (Class->defn value for-syntax)))))
+
+;; (Identifier "hello")
+;;  `-> hello
+(define (Identifier->defn lst for-syntax)
+  (datum->syntax for-syntax (string->symbol (second lst))))
+
+;; (Literal (Char "a") (Char "b") (Char "c"))
+;;  `-> "abc"
+(define (Literal->defn lst for-syntax)
+  (apply string-append (map second (cdr lst))))
+
+;; TODO: empty Class can happen: `[]`, but what does it represent?
+;; (Class ...)
+;;  `-> (or ...)
+(define (Class->defn lst for-syntax)
+  #`(or #,@(map (lambda (x)
+                  (Range->defn x for-syntax))
+                (cdr lst))))
+
+;; For one character:
+;; (Range (Char "a"))
+;;  `-> "a"
+;; Or for a range:
+;; (Range (Char "a") (Char "b"))
+;;  `-> (range #\a #\b)
+(define (Range->defn lst for-syntax)
+  (cond
+    ((= 2 (length lst))
+     (second (second lst)))
+    ((= 3 (length lst))
+     #`(range #,(string-ref (second (second lst)) 0)
+              #,(string-ref (second (third lst)) 0)))))
+
+(define peg-grammar Grammar)
 
 ;; Macro wrapper for PEG-PARSER.  Parses PEG grammars expressed as strings and
 ;; defines all the appropriate nonterminals.
@@ -124,119 +314,6 @@ RB < ']'
       ((_ str)
        (peg-parser (syntax->datum #'str) x)))))
 
-;; lst has format (nonterm grabber pattern), where
-;;   nonterm is a symbol (the name of the nonterminal),
-;;   grabber is a string (either "<", "<-" or "<--"), and
-;;   pattern is the parse of a PEG pattern expressed as as string.
-(define (peg-nonterm->defn lst for-syntax)
-  (let* ((nonterm (car lst))
-         (grabber (cadr lst))
-         (pattern (caddr lst))
-         (nonterm-name (datum->syntax for-syntax
-                                      (string->symbol (cadr nonterm)))))
-    #`(define-peg-pattern #,nonterm-name
-       #,(cond
-          ((string=? grabber "<--") (datum->syntax for-syntax 'all))
-          ((string=? grabber "<-") (datum->syntax for-syntax 'body))
-          (else (datum->syntax for-syntax 'none)))
-       #,(compressor (peg-pattern->defn pattern for-syntax) for-syntax))))
-
-;; lst has format ('peg-pattern ...).
-;; After the context-flatten, (cdr lst) has format
-;;   (('peg-alternative ...) ...), where the outer list is a collection
-;;   of elements from a '/' alternative.
-(define (peg-pattern->defn lst for-syntax)
-  #`(or #,@(map (lambda (x) (peg-alternative->defn x for-syntax))
-                (context-flatten (lambda (x) (eq? (car x) 'peg-alternative))
-                                 (cdr lst)))))
-
-;; lst has format ('peg-alternative ...).
-;; After the context-flatten, (cdr lst) has the format
-;;   (item ...), where each item has format either ("!" ...), ("&" ...),
-;;   or ('peg-suffix ...).
-(define (peg-alternative->defn lst for-syntax)
-  #`(and #,@(map (lambda (x) (peg-body->defn x for-syntax))
-                 (context-flatten (lambda (x) (or (string? (car x))
-                                             (eq? (car x) 'peg-suffix)))
-                                  (cdr lst)))))
-
-;; lst has the format either
-;;   ("!" ('peg-suffix ...)), ("&" ('peg-suffix ...)), or
-;;     ('peg-suffix ...).
-(define (peg-body->defn lst for-syntax)
-    (cond
-      ((equal? (car lst) "&")
-       #`(followed-by #,(peg-suffix->defn (cadr lst) for-syntax)))
-      ((equal? (car lst) "!")
-       #`(not-followed-by #,(peg-suffix->defn (cadr lst) for-syntax)))
-      ((eq? (car lst) 'peg-suffix)
-       (peg-suffix->defn lst for-syntax))
-      (else `(peg-parse-body-fail ,lst))))
-
-;; lst has format ('peg-suffix <peg-primary> (? (/ "*" "?" "+")))
-(define (peg-suffix->defn lst for-syntax)
-  (let ((inner-defn (peg-primary->defn (cadr lst) for-syntax)))
-    (cond
-      ((null? (cddr lst))
-       inner-defn)
-      ((equal? (caddr lst) "*")
-       #`(* #,inner-defn))
-      ((equal? (caddr lst) "?")
-       #`(? #,inner-defn))
-      ((equal? (caddr lst) "+")
-       #`(+ #,inner-defn)))))
-
-;; Parse a primary.
-(define (peg-primary->defn lst for-syntax)
-  (let ((el (cadr lst)))
-  (cond
-   ((list? el)
-    (cond
-     ((eq? (car el) 'peg-literal)
-      (peg-literal->defn el for-syntax))
-     ((eq? (car el) 'peg-charclass)
-      (peg-charclass->defn el for-syntax))
-     ((eq? (car el) 'peg-nonterminal)
-      (datum->syntax for-syntax (string->symbol (cadr el))))))
-   ((string? el)
-    (cond
-     ((equal? el "(")
-      (peg-pattern->defn (caddr lst) for-syntax))
-     ((equal? el ".")
-      (datum->syntax for-syntax 'peg-any))
-     (else (datum->syntax for-syntax
-                          `(peg-parse-any unknown-string ,lst)))))
-   (else (datum->syntax for-syntax
-                        `(peg-parse-any unknown-el ,lst))))))
-
-;; Trims characters off the front and end of STR.
-;; (trim-1chars "'ab'") -> "ab"
-(define (trim-1chars str) (substring str 1 (- (string-length str) 1)))
-
-;; Parses a literal.
-(define (peg-literal->defn lst for-syntax)
-  (datum->syntax for-syntax (trim-1chars (cadr lst))))
-
-;; Parses a charclass.
-(define (peg-charclass->defn lst for-syntax)
-  #`(or
-     #,@(map
-         (lambda (cc)
-           (cond
-            ((eq? (car cc) 'charclass-range)
-             #`(range #,(datum->syntax
-                         for-syntax
-                         (string-ref (cadr cc) 0))
-                      #,(datum->syntax
-                         for-syntax
-                         (string-ref (cadr cc) 2))))
-            ((eq? (car cc) 'charclass-single)
-             (datum->syntax for-syntax (cadr cc)))))
-         (context-flatten
-          (lambda (x) (or (eq? (car x) 'charclass-range)
-                          (eq? (car x) 'charclass-single)))
-          (cdr lst)))))
-
 ;; Compresses a list to save the optimizer work.
 ;; e.g. (or (and a)) -> a
 (define (compressor-core lst)
@@ -263,11 +340,10 @@ RB < ']'
      (let ((string (syntax->datum #'str-stx)))
        (compile-peg-pattern
         (compressor
-         (peg-pattern->defn
-          (peg:tree (match-pattern peg-pattern string)) #'str-stx)
+         (Expression->defn
+          (peg:tree (match-pattern Expression string)) #'str-stx)
          #'str-stx)
         (if (eq? accum 'all) 'body accum))))
      (else (error "Bad embedded PEG string" args))))
 
 (add-peg-compiler! 'peg peg-string-compile)
-
diff --git a/test-suite/tests/peg.test b/test-suite/tests/peg.test
index f516571e8..556145e72 100644
--- a/test-suite/tests/peg.test
+++ b/test-suite/tests/peg.test
@@ -28,17 +28,25 @@
 ;; the nonterminals defined in the PEG parser written with
 ;; S-expressions.
 (define grammar-mapping
-  '((grammar peg-grammar)
-    (pattern peg-pattern)
-    (alternative peg-alternative)
-    (suffix peg-suffix)
-    (primary peg-primary)
-    (literal peg-literal)
-    (charclass peg-charclass)
-    (CCrange charclass-range)
-    (CCsingle charclass-single)
-    (nonterminal peg-nonterminal)
-    (sp peg-sp)))
+  '((Grammar Grammar)
+    (Definition Definition)
+    (Expression Expression)
+    (Sequence Sequence)
+    (Prefix Prefix)
+    (Suffix Suffix)
+    (Primary Primary)
+    (Identifier Identifier)
+    (Literal Literal)
+    (Class Class)
+    (Range Range)
+    (Char Char)
+    (LEFTARROW LEFTARROW)
+    (AND AND)
+    (NOT NOT)
+    (QUESTION QUESTION)
+    (STAR STAR)
+    (PLUS PLUS)
+    (DOT DOT)))
 
 ;; Transforms the nonterminals defined in the PEG parser written as a PEG to the nonterminals defined in the PEG parser written with S-expressions.
 (define (grammar-transform x)
@@ -69,7 +77,7 @@
     (peg:tree (match-pattern (@@ (ice-9 peg) peg-grammar) (@@ (ice-9 peg) peg-as-peg)))
     (tree-map
      grammar-transform
-     (peg:tree (match-pattern grammar (@@ (ice-9 peg) peg-as-peg)))))))
+     (peg:tree (match-pattern (@@ (ice-9 peg) peg-grammar) (@@ (ice-9 peg) peg-as-peg)))))))
 
 ;; A grammar for pascal-style comments from Wikipedia.
 (define comment-grammar
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* bug#73188: [PATCH v2] PEG: Add full support for PEG + some extensions
  2024-09-11 22:03 bug#73188: PEG parser does not support full PEG grammar Ekaitz Zarraga
@ 2024-09-12 20:57 ` Ekaitz Zarraga
  2024-10-13 20:29   ` bug#73188: PEG parser does not support full PEG grammar Ludovic Courtès
  2024-10-11 12:31 ` bug#73188: [PATCH] PEG: Add support for `not-in-range` and [^...] Ekaitz Zarraga
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 19+ messages in thread
From: Ekaitz Zarraga @ 2024-09-12 20:57 UTC (permalink / raw)
  To: 73188; +Cc: Ekaitz Zarraga

This commit adds support for PEG as described in:

    <https://bford.info/pub/lang/peg.pdf>

It adds support for the missing features (comments, underscores in
identifiers and escaping) while keeping the extensions (dashes in
identifiers, < and <--).

The naming system tries to be as close as possible to the one proposed
in the paper.

* module/ice-9/peg/string-peg.scm: Rewrite PEG parser.
* test-suite/tests/peg.test: Fix import
---

Previus version did not make proper character and escaping control.

 module/ice-9/peg/string-peg.scm | 438 ++++++++++++++++++++------------
 test-suite/tests/peg.test       |  32 ++-
 2 files changed, 292 insertions(+), 178 deletions(-)

diff --git a/module/ice-9/peg/string-peg.scm b/module/ice-9/peg/string-peg.scm
index 45ed14bb1..05755693a 100644
--- a/module/ice-9/peg/string-peg.scm
+++ b/module/ice-9/peg/string-peg.scm
@@ -1,6 +1,6 @@
 ;;;; string-peg.scm --- representing PEG grammars as strings
 ;;;;
-;;;; 	Copyright (C) 2010, 2011 Free Software Foundation, Inc.
+;;;; 	Copyright (C) 2010, 2011, 2023 Free Software Foundation, Inc.
 ;;;;
 ;;;; This library is free software; you can redistribute it and/or
 ;;;; modify it under the terms of the GNU Lesser General Public
@@ -22,6 +22,7 @@
             define-peg-string-patterns
             peg-grammar)
   #:use-module (ice-9 peg using-parsers)
+  #:use-module (srfi srfi-1)
   #:use-module (ice-9 peg codegen)
   #:use-module (ice-9 peg simplify-tree))
 
@@ -38,22 +39,57 @@
 
 ;; Grammar for PEGs in PEG grammar.
 (define peg-as-peg
-"grammar <-- (nonterminal ('<--' / '<-' / '<') sp pattern)+
-pattern <-- alternative (SLASH sp alternative)*
-alternative <-- ([!&]? sp suffix)+
-suffix <-- primary ([*+?] sp)*
-primary <-- '(' sp pattern ')' sp / '.' sp / literal / charclass / nonterminal !'<'
-literal <-- ['] (!['] .)* ['] sp
-charclass <-- LB (!']' (CCrange / CCsingle))* RB sp
-CCrange <-- . '-' .
-CCsingle <-- .
-nonterminal <-- [a-zA-Z0-9-]+ sp
-sp < [ \t\n]*
-SLASH < '/'
-LB < '['
-RB < ']'
+"# Hierarchical syntax
+Grammar <-- Spacing Definition+ EndOfFile
+Definition <-- Identifier LEFTARROW Expression
+
+Expression <-- Sequence (SLASH Sequence)*
+Sequence <-- Prefix*
+Prefix <-- (AND / NOT)? Suffix
+Suffix <-- Primary (QUESTION / STAR / PLUS)?
+Primary <-- Identifier !LEFTARROW
+           / OPEN Expression CLOSE
+           / Literal / Class / DOT
+
+# Lexical syntax
+Identifier <-- IdentStart IdentCont* Spacing
+# NOTE: `-` is an extension
+IdentStart <- [a-zA-Z_] / '-'
+IdentCont <- IdentStart / [0-9]
+
+Literal <-- SQUOTE (!SQUOTE Char)* SQUOTE Spacing
+        / DQUOTE (!DQUOTE Char)* DQUOTE Spacing
+Class <-- '[' (!']' Range)* ']' Spacing
+Range <-- Char '-' Char / Char
+Char <-- '\\\\' [nrt'\"\\[\\]\\\\]
+       / '\\\\' [0-7][0-7][0-7]
+       / '\\\\' [0-7][0-7]?
+       / !'\\\\' .
+
+# NOTE: `<--` and `<` are extensions
+LEFTARROW <- ('<--' / '<-' / '<') Spacing
+SQUOTE <-- [']
+DQUOTE <-- [\"]
+OPENBRACKET < '['
+CLOSEBRACKET < ']'
+SLASH < '/' Spacing
+AND <-- '&' Spacing
+NOT <-- '!' Spacing
+QUESTION <-- '?' Spacing
+STAR <-- '*' Spacing
+PLUS <-- '+' Spacing
+OPEN < '(' Spacing
+CLOSE < ')' Spacing
+DOT <-- '.' Spacing
+
+Spacing < (Space / Comment)*
+Comment < '#' (!EndOfLine .)* EndOfLine
+Space < ' ' / '\t' / EndOfLine
+EndOfLine < '\\r\\n' / '\\n' / '\\r'
+EndOfFile < !.
 ")
 
+
 (define-syntax define-sexp-parser
   (lambda (x)
     (syntax-case x ()
@@ -63,35 +99,78 @@ RB < ']'
               (syn (wrap-parser-for-users x matchf accumsym #'sym)))
            #`(define sym #,syn))))))
 
-(define-sexp-parser peg-grammar all
-  (+ (and peg-nonterminal (or "<--" "<-" "<") peg-sp peg-pattern)))
-(define-sexp-parser peg-pattern all
-  (and peg-alternative
-       (* (and (ignore "/") peg-sp peg-alternative))))
-(define-sexp-parser peg-alternative all
-  (+ (and (? (or "!" "&")) peg-sp peg-suffix)))
-(define-sexp-parser peg-suffix all
-  (and peg-primary (* (and (or "*" "+" "?") peg-sp))))
-(define-sexp-parser peg-primary all
-  (or (and "(" peg-sp peg-pattern ")" peg-sp)
-      (and "." peg-sp)
-      peg-literal
-      peg-charclass
-      (and peg-nonterminal (not-followed-by "<"))))
-(define-sexp-parser peg-literal all
-  (and "'" (* (and (not-followed-by "'") peg-any)) "'" peg-sp))
-(define-sexp-parser peg-charclass all
-  (and (ignore "[")
-       (* (and (not-followed-by "]")
-               (or charclass-range charclass-single)))
-       (ignore "]")
-       peg-sp))
-(define-sexp-parser charclass-range all (and peg-any "-" peg-any))
-(define-sexp-parser charclass-single all peg-any)
-(define-sexp-parser peg-nonterminal all
-  (and (+ (or (range #\a #\z) (range #\A #\Z) (range #\0 #\9) "-")) peg-sp))
-(define-sexp-parser peg-sp none
-  (* (or " " "\t" "\n")))
+(define-sexp-parser Grammar all
+  (and Spacing (+ Definition) EndOfFile))
+(define-sexp-parser Definition all
+  (and Identifier LEFTARROW Expression))
+(define-sexp-parser Expression all
+  (and Sequence (* (and SLASH Sequence))))
+(define-sexp-parser Sequence all
+  (* Prefix))
+(define-sexp-parser Prefix all
+  (and (? (or AND NOT)) Suffix))
+(define-sexp-parser Suffix all
+  (and Primary (? (or QUESTION STAR PLUS))))
+(define-sexp-parser Primary all
+  (or (and Identifier (not-followed-by LEFTARROW))
+      (and OPEN Expression CLOSE)
+      Literal
+      Class
+      DOT))
+(define-sexp-parser Identifier all
+  (and IdentStart (* IdentCont) Spacing))
+(define-sexp-parser IdentStart body
+  (or (range #\a #\z) (range #\A #\Z) "_" "-")) ; NOTE: - is an extension
+(define-sexp-parser IdentCont body
+  (or IdentStart (range #\0 #\9)))
+(define-sexp-parser Literal all
+  (or (and SQUOTE (* (and (not-followed-by SQUOTE) Char)) SQUOTE Spacing)
+      (and DQUOTE (* (and (not-followed-by DQUOTE) Char)) DQUOTE Spacing)))
+(define-sexp-parser Class all
+  (and OPENBRACKET (* (and (not-followed-by CLOSEBRACKET) Range)) CLOSEBRACKET Spacing))
+(define-sexp-parser Range all
+  (or (and Char DASH Char) Char))
+(define-sexp-parser Char all
+  (or (and "\\" (or "n" "r" "t" "'" "[" "]" "\\"))
+      (and "\\" (range #\0 #\7) (range #\0 #\7) (range #\0 #\7))
+      (and "\\" (range #\0 #\7) (? (range #\0 #\7)))
+      (and (not-followed-by "\\") peg-any)))
+(define-sexp-parser LEFTARROW body
+  (and (or "<--" "<-" "<") Spacing)) ; NOTE: <-- and < are extensions
+(define-sexp-parser SLASH none
+  (and "/" Spacing))
+(define-sexp-parser AND all
+  (and "&" Spacing))
+(define-sexp-parser NOT all
+  (and "!" Spacing))
+(define-sexp-parser QUESTION all
+  (and "?" Spacing))
+(define-sexp-parser STAR all
+  (and "*" Spacing))
+(define-sexp-parser PLUS all
+  (and "+" Spacing))
+(define-sexp-parser OPEN none
+  (and "(" Spacing))
+(define-sexp-parser CLOSE none
+  (and ")" Spacing))
+(define-sexp-parser DOT all
+  (and "." Spacing))
+(define-sexp-parser SQUOTE none "'")
+(define-sexp-parser DQUOTE none "\"")
+(define-sexp-parser OPENBRACKET none "[")
+(define-sexp-parser CLOSEBRACKET none "]")
+(define-sexp-parser DASH none "-")
+(define-sexp-parser Spacing none
+  (* (or Space Comment)))
+(define-sexp-parser Comment none
+  (and "#" (* (and (not-followed-by EndOfLine) peg-any)) EndOfLine))
+(define-sexp-parser Space none
+  (or " " "\t" EndOfLine))
+(define-sexp-parser EndOfLine none
+  (or "\r\n" "\n" "\r"))
+(define-sexp-parser EndOfFile none
+  (not-followed-by peg-any))
+
 
 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
 ;;;;; PARSE STRING PEGS
@@ -101,7 +180,7 @@ RB < ']'
 ;; will define all of the nonterminals in the grammar with equivalent
 ;; PEG s-expressions.
 (define (peg-parser str for-syntax)
-  (let ((parsed (match-pattern peg-grammar str)))
+  (let ((parsed (match-pattern Grammar str)))
     (if (not parsed)
         (begin
           ;; (display "Invalid PEG grammar!\n")
@@ -110,132 +189,160 @@ RB < ']'
           (cond
            ((or (not (list? lst)) (null? lst))
             lst)
-           ((eq? (car lst) 'peg-grammar)
-            #`(begin
-                #,@(map (lambda (x) (peg-nonterm->defn x for-syntax))
-                        (context-flatten (lambda (lst) (<= (depth lst) 2))
-                                         (cdr lst))))))))))
+           ((eq? (car lst) 'Grammar)
+            (Grammar->defn lst for-syntax)))))))
 
-;; Macro wrapper for PEG-PARSER.  Parses PEG grammars expressed as strings and
-;; defines all the appropriate nonterminals.
-(define-syntax define-peg-string-patterns
-  (lambda (x)
-    (syntax-case x ()
-      ((_ str)
-       (peg-parser (syntax->datum #'str) x)))))
+;; (Grammar (Definition ...) (Definition ...))
+(define (Grammar->defn lst for-syntax)
+  #`(begin
+      #,@(map (lambda (x) (Definition->defn x for-syntax))
+              (context-flatten (lambda (lst) (<= (depth lst) 1))
+                               (cdr lst)))))
 
-;; lst has format (nonterm grabber pattern), where
-;;   nonterm is a symbol (the name of the nonterminal),
-;;   grabber is a string (either "<", "<-" or "<--"), and
-;;   pattern is the parse of a PEG pattern expressed as as string.
-(define (peg-nonterm->defn lst for-syntax)
-  (let* ((nonterm (car lst))
-         (grabber (cadr lst))
-         (pattern (caddr lst))
-         (nonterm-name (datum->syntax for-syntax
-                                      (string->symbol (cadr nonterm)))))
-    #`(define-peg-pattern #,nonterm-name
+;; (Definition (Identifier "Something") "<-" (Expression ...))
+;;  `-> (define-peg-pattern Something 'all ...)
+(define (Definition->defn lst for-syntax)
+  (let ((identifier (second (second lst)))
+        (grabber    (third  lst))
+        (expression (fourth lst)))
+    #`(define-peg-pattern #,(datum->syntax for-syntax
+                                           (string->symbol identifier))
        #,(cond
           ((string=? grabber "<--") (datum->syntax for-syntax 'all))
           ((string=? grabber "<-") (datum->syntax for-syntax 'body))
           (else (datum->syntax for-syntax 'none)))
-       #,(compressor (peg-pattern->defn pattern for-syntax) for-syntax))))
-
-;; lst has format ('peg-pattern ...).
-;; After the context-flatten, (cdr lst) has format
-;;   (('peg-alternative ...) ...), where the outer list is a collection
-;;   of elements from a '/' alternative.
-(define (peg-pattern->defn lst for-syntax)
-  #`(or #,@(map (lambda (x) (peg-alternative->defn x for-syntax))
-                (context-flatten (lambda (x) (eq? (car x) 'peg-alternative))
-                                 (cdr lst)))))
-
-;; lst has format ('peg-alternative ...).
-;; After the context-flatten, (cdr lst) has the format
-;;   (item ...), where each item has format either ("!" ...), ("&" ...),
-;;   or ('peg-suffix ...).
-(define (peg-alternative->defn lst for-syntax)
-  #`(and #,@(map (lambda (x) (peg-body->defn x for-syntax))
-                 (context-flatten (lambda (x) (or (string? (car x))
-                                             (eq? (car x) 'peg-suffix)))
-                                  (cdr lst)))))
-
-;; lst has the format either
-;;   ("!" ('peg-suffix ...)), ("&" ('peg-suffix ...)), or
-;;     ('peg-suffix ...).
-(define (peg-body->defn lst for-syntax)
-    (cond
-      ((equal? (car lst) "&")
-       #`(followed-by #,(peg-suffix->defn (cadr lst) for-syntax)))
-      ((equal? (car lst) "!")
-       #`(not-followed-by #,(peg-suffix->defn (cadr lst) for-syntax)))
-      ((eq? (car lst) 'peg-suffix)
-       (peg-suffix->defn lst for-syntax))
-      (else `(peg-parse-body-fail ,lst))))
-
-;; lst has format ('peg-suffix <peg-primary> (? (/ "*" "?" "+")))
-(define (peg-suffix->defn lst for-syntax)
-  (let ((inner-defn (peg-primary->defn (cadr lst) for-syntax)))
-    (cond
-      ((null? (cddr lst))
-       inner-defn)
-      ((equal? (caddr lst) "*")
-       #`(* #,inner-defn))
-      ((equal? (caddr lst) "?")
-       #`(? #,inner-defn))
-      ((equal? (caddr lst) "+")
-       #`(+ #,inner-defn)))))
-
-;; Parse a primary.
-(define (peg-primary->defn lst for-syntax)
-  (let ((el (cadr lst)))
+       #,(compressor (Expression->defn expression for-syntax) for-syntax))))
+
+;; (Expression (Sequence X))
+;;  `-> (X)
+;; (Expression (Sequence X) (Sequence Y))
+;;  `-> (or X Y)
+;; (Expression (Sequence X) ((Sequence Y) (Sequence Z) ...))
+;;  `-> (or X Y Z ...)
+(define (Expression->defn lst for-syntax)
+  (let ((first-sequence (second lst))
+        (rest           (cddr  lst)))
+    #`(or #,(Sequence->defn first-sequence for-syntax)
+          #,@(map (lambda (x)
+                    (Sequence->defn x for-syntax))
+                  (keyword-flatten '(Sequence) rest)))))
+
+
+(define (Sequence->defn lst for-syntax)
+  #`(and #,@(map (lambda (x) (Prefix->defn x for-syntax)) (cdr lst))))
+
+
+;; (Prefix (Suffix ...))
+;;  `-> (...)
+;; (Prefix (NOT "!") (Suffix ...))
+;;  `-> (not-followed-by ...)
+;; (Prefix (AND "&") (Suffix ...))
+;;  `-> (followed-by ...)
+(define (Prefix->defn lst for-syntax)
+  (let ((suffix (second lst)))
+    (case (car suffix)
+      ('AND #`(followed-by #,(Suffix->defn (third lst) for-syntax)))
+      ('NOT #`(not-followed-by #,(Suffix->defn (third lst) for-syntax)))
+      (else (Suffix->defn suffix for-syntax)))))
+
+;; (Suffix (Primary ...))
+;;  `-> (...)
+;; (Suffix (Primary ...) (STAR "*"))
+;;  `-> (* ...)
+;; (Suffix (Primary ...) (QUESTION "?"))
+;;  `-> (? ...)
+;; (Suffix (Primary ...) (PLUS "+"))
+;;  `-> (+ ...)
+(define (Suffix->defn lst for-syntax)
+  (let* ((primary (second lst))
+         (out     (Primary->defn primary for-syntax))
+         (extra   (cddr lst)))
+    (if (null? extra)
+      out
+      (case (caar extra)
+        ('QUESTION #`(? #,out))
+        ('STAR     #`(* #,out))
+        ('PLUS     #`(+ #,out))))))
+
+(define (Primary->defn lst for-syntax)
+  (let ((value (second lst)))
+    (case (car value)
+      ('DOT        #'peg-any)
+      ('Identifier (Identifier->defn value for-syntax))
+      ('Expression (Expression->defn value for-syntax))
+      ('Literal    (Literal->defn value for-syntax))
+      ('Class      (Class->defn value for-syntax)))))
+
+;; (Identifier "hello")
+;;  `-> hello
+(define (Identifier->defn lst for-syntax)
+  (datum->syntax for-syntax (string->symbol (second lst))))
+
+;; (Literal (Char "a") (Char "b") (Char "c"))
+;;  `-> "abc"
+(define (Literal->defn lst for-syntax)
+  (apply string (map (lambda (x) (Char->defn x for-syntax)) (cdr lst))))
+
+;; TODO: empty Class can happen: `[]`, but what does it represent?
+;; (Class ...)
+;;  `-> (or ...)
+(define (Class->defn lst for-syntax)
+  #`(or #,@(map (lambda (x) (Range->defn x for-syntax))
+                (cdr lst))))
+
+;; For one character:
+;; (Range (Char "a"))
+;;  `-> "a"
+;; Or for a range:
+;; (Range (Char "a") (Char "b"))
+;;  `-> (range #\a #\b)
+(define (Range->defn lst for-syntax)
   (cond
-   ((list? el)
-    (cond
-     ((eq? (car el) 'peg-literal)
-      (peg-literal->defn el for-syntax))
-     ((eq? (car el) 'peg-charclass)
-      (peg-charclass->defn el for-syntax))
-     ((eq? (car el) 'peg-nonterminal)
-      (datum->syntax for-syntax (string->symbol (cadr el))))))
-   ((string? el)
+    ((= 2 (length lst))
+     (second (second lst)))
+    ((= 3 (length lst))
+     #`(range
+         #,(Char->defn (second lst) for-syntax)
+         #,(Char->defn (third lst) for-syntax)))))
+
+;; (Char "a")
+;;  `-> #\a
+;; (Char "\\n")
+;;  `-> #\newline
+;; (Char "\\135")
+;;  `-> #\]
+(define (Char->defn lst for-syntax)
+  (let* ((charstr (second lst))
+         (first   (string-ref charstr 0)))
     (cond
-     ((equal? el "(")
-      (peg-pattern->defn (caddr lst) for-syntax))
-     ((equal? el ".")
-      (datum->syntax for-syntax 'peg-any))
-     (else (datum->syntax for-syntax
-                          `(peg-parse-any unknown-string ,lst)))))
-   (else (datum->syntax for-syntax
-                        `(peg-parse-any unknown-el ,lst))))))
-
-;; Trims characters off the front and end of STR.
-;; (trim-1chars "'ab'") -> "ab"
-(define (trim-1chars str) (substring str 1 (- (string-length str) 1)))
-
-;; Parses a literal.
-(define (peg-literal->defn lst for-syntax)
-  (datum->syntax for-syntax (trim-1chars (cadr lst))))
-
-;; Parses a charclass.
-(define (peg-charclass->defn lst for-syntax)
-  #`(or
-     #,@(map
-         (lambda (cc)
-           (cond
-            ((eq? (car cc) 'charclass-range)
-             #`(range #,(datum->syntax
-                         for-syntax
-                         (string-ref (cadr cc) 0))
-                      #,(datum->syntax
-                         for-syntax
-                         (string-ref (cadr cc) 2))))
-            ((eq? (car cc) 'charclass-single)
-             (datum->syntax for-syntax (cadr cc)))))
-         (context-flatten
-          (lambda (x) (or (eq? (car x) 'charclass-range)
-                          (eq? (car x) 'charclass-single)))
-          (cdr lst)))))
+      ((= 1 (string-length charstr)) first)
+      ((char-numeric? (string-ref charstr 1))
+       (integer->char
+         (reduce + 0
+                 (map
+                   (lambda (x y)
+                     (* (- (char->integer x) (char->integer #\0)) y))
+                   (reverse (string->list charstr 1))
+                   '(1 8 64)))))
+      (else
+        (case (string-ref charstr 1)
+          ((#\n) #\newline)
+          ((#\r) #\return)
+          ((#\t) #\tab)
+          ((#\') #\')
+          ((#\]) #\])
+          ((#\\) #\\)
+          ((#\[) #\[))))))
+
+(define peg-grammar Grammar)
+
+;; Macro wrapper for PEG-PARSER.  Parses PEG grammars expressed as strings and
+;; defines all the appropriate nonterminals.
+(define-syntax define-peg-string-patterns
+  (lambda (x)
+    (syntax-case x ()
+      ((_ str)
+       (peg-parser (syntax->datum #'str) x)))))
 
 ;; Compresses a list to save the optimizer work.
 ;; e.g. (or (and a)) -> a
@@ -263,11 +370,10 @@ RB < ']'
      (let ((string (syntax->datum #'str-stx)))
        (compile-peg-pattern
         (compressor
-         (peg-pattern->defn
-          (peg:tree (match-pattern peg-pattern string)) #'str-stx)
+         (Expression->defn
+          (peg:tree (match-pattern Expression string)) #'str-stx)
          #'str-stx)
         (if (eq? accum 'all) 'body accum))))
      (else (error "Bad embedded PEG string" args))))
 
 (add-peg-compiler! 'peg peg-string-compile)
-
diff --git a/test-suite/tests/peg.test b/test-suite/tests/peg.test
index f516571e8..556145e72 100644
--- a/test-suite/tests/peg.test
+++ b/test-suite/tests/peg.test
@@ -28,17 +28,25 @@
 ;; the nonterminals defined in the PEG parser written with
 ;; S-expressions.
 (define grammar-mapping
-  '((grammar peg-grammar)
-    (pattern peg-pattern)
-    (alternative peg-alternative)
-    (suffix peg-suffix)
-    (primary peg-primary)
-    (literal peg-literal)
-    (charclass peg-charclass)
-    (CCrange charclass-range)
-    (CCsingle charclass-single)
-    (nonterminal peg-nonterminal)
-    (sp peg-sp)))
+  '((Grammar Grammar)
+    (Definition Definition)
+    (Expression Expression)
+    (Sequence Sequence)
+    (Prefix Prefix)
+    (Suffix Suffix)
+    (Primary Primary)
+    (Identifier Identifier)
+    (Literal Literal)
+    (Class Class)
+    (Range Range)
+    (Char Char)
+    (LEFTARROW LEFTARROW)
+    (AND AND)
+    (NOT NOT)
+    (QUESTION QUESTION)
+    (STAR STAR)
+    (PLUS PLUS)
+    (DOT DOT)))
 
 ;; Transforms the nonterminals defined in the PEG parser written as a PEG to the nonterminals defined in the PEG parser written with S-expressions.
 (define (grammar-transform x)
@@ -69,7 +77,7 @@
     (peg:tree (match-pattern (@@ (ice-9 peg) peg-grammar) (@@ (ice-9 peg) peg-as-peg)))
     (tree-map
      grammar-transform
-     (peg:tree (match-pattern grammar (@@ (ice-9 peg) peg-as-peg)))))))
+     (peg:tree (match-pattern (@@ (ice-9 peg) peg-grammar) (@@ (ice-9 peg) peg-as-peg)))))))
 
 ;; A grammar for pascal-style comments from Wikipedia.
 (define comment-grammar
-- 
2.45.2






^ permalink raw reply related	[flat|nested] 19+ messages in thread

* bug#73188: [PATCH] PEG: Add support for `not-in-range` and [^...]
  2024-09-11 22:03 bug#73188: PEG parser does not support full PEG grammar Ekaitz Zarraga
  2024-09-12 20:57 ` bug#73188: [PATCH v2] PEG: Add full support for PEG + some extensions Ekaitz Zarraga
@ 2024-10-11 12:31 ` Ekaitz Zarraga
  2024-10-30 19:04 ` bug#73188: PEG: Fix bugs and add complex PEG for testing Ekaitz Zarraga
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 19+ messages in thread
From: Ekaitz Zarraga @ 2024-10-11 12:31 UTC (permalink / raw)
  To: 73188; +Cc: Ekaitz Zarraga

Modern PEG supports inversed class like `[^a-z]` that would get any
character not in the `a-z` range. This commit adds support for that and
also for a new `not-in-range` PEG pattern for scheme.

* module/ice-9/peg/codegen.scm (cg-not-in-range): New function.
* module/ice-9/peg/string-peg.scm: Add support for `[^...]`
* test-suite/tests/peg.test: Test it.
* doc/ref/api-peg.texi: Document accordingly.
---
 doc/ref/api-peg.texi            |  8 +++++++
 module/ice-9/peg/codegen.scm    | 22 +++++++++++++++++++
 module/ice-9/peg/string-peg.scm | 39 +++++++++++++++++++++++++++++----
 test-suite/tests/peg.test       |  6 ++++-
 4 files changed, 70 insertions(+), 5 deletions(-)

diff --git a/doc/ref/api-peg.texi b/doc/ref/api-peg.texi
index d34ddc64c..f2707442c 100644
--- a/doc/ref/api-peg.texi
+++ b/doc/ref/api-peg.texi
@@ -143,6 +143,14 @@ Parses any character falling between @var{a} and @var{z}.
 @code{(range #\a #\z)}
 @end deftp
 
+@deftp {PEG Pattern} {inversed range of characters} a z
+Parses any character not falling between @var{a} and @var{z}.
+
+@code{"[^a-z]"}
+
+@code{(not-in-range #\a #\z)}
+@end deftp
+
 Example:
 
 @example
diff --git a/module/ice-9/peg/codegen.scm b/module/ice-9/peg/codegen.scm
index d80c3e849..82367ef55 100644
--- a/module/ice-9/peg/codegen.scm
+++ b/module/ice-9/peg/codegen.scm
@@ -140,6 +140,27 @@ return EXP."
                          ((none) #`(list (1+ pos) '()))
                          (else (error "bad accum" accum))))))))))
 
+;; Generates code for matching a range of characters not between start and end.
+;; E.g.: (cg-not-in-range syntax #\a #\z 'body)
+(define (cg-not-in-range pat accum)
+  (syntax-case pat ()
+    ((start end)
+     (if (not (and (char? (syntax->datum #'start))
+                   (char? (syntax->datum #'end))))
+         (error "range PEG should have characters after it; instead got"
+                #'start #'end))
+     #`(lambda (str len pos)
+         (and (< pos len)
+              (let ((c (string-ref str pos)))
+                (and (or (char<? c start) (char>? c end))
+                     #,(case accum
+                         ((all) #`(list (1+ pos)
+                                        (list 'cg-not-in-range (string c))))
+                         ((name) #`(list (1+ pos) 'cg-not-in-range))
+                         ((body) #`(list (1+ pos) (string c)))
+                         ((none) #`(list (1+ pos) '()))
+                         (else (error "bad accum" accum))))))))))
+
 ;; Generate code to match a pattern and do nothing with the result
 (define (cg-ignore pat accum)
   (syntax-case pat ()
@@ -304,6 +325,7 @@ return EXP."
         (assq-set! peg-compiler-alist symbol function)))
 
 (add-peg-compiler! 'range cg-range)
+(add-peg-compiler! 'not-in-range cg-not-in-range)
 (add-peg-compiler! 'ignore cg-ignore)
 (add-peg-compiler! 'capture cg-capture)
 (add-peg-compiler! 'and cg-and)
diff --git a/module/ice-9/peg/string-peg.scm b/module/ice-9/peg/string-peg.scm
index 05755693a..f88d2f7d8 100644
--- a/module/ice-9/peg/string-peg.scm
+++ b/module/ice-9/peg/string-peg.scm
@@ -49,7 +49,7 @@ Prefix <-- (AND / NOT)? Suffix
 Suffix <-- Primary (QUESTION / STAR / PLUS)?
 Primary <-- Identifier !LEFTARROW
            / OPEN Expression CLOSE
-           / Literal / Class / DOT
+           / Literal / Class / NotInClass / DOT
 
 # Lexical syntax
 Identifier <-- IdentStart IdentCont* Spacing
@@ -59,7 +59,8 @@ IdentCont <- IdentStart / [0-9]
 
 Literal <-- SQUOTE (!SQUOTE Char)* SQUOTE Spacing
         / DQUOTE (!DQUOTE Char)* DQUOTE Spacing
-Class <-- '[' (!']' Range)* ']' Spacing
+NotInClass <-- '[' NOTIN  (!']' Range)* ']' Spacing
+Class <-- '[' !NOTIN (!']' Range)* ']' Spacing
 Range <-- Char '-' Char / Char
 Char <-- '\\\\' [nrt'\"\\[\\]\\\\]
        / '\\\\' [0-7][0-7][0-7]
@@ -72,6 +73,7 @@ SQUOTE <-- [']
 DQUOTE <-- [\"]
 OPENBRACKET < '['
 CLOSEBRACKET < ']'
+NOTIN < '^'
 SLASH < '/' Spacing
 AND <-- '&' Spacing
 NOT <-- '!' Spacing
@@ -116,6 +118,7 @@ EndOfFile < !.
       (and OPEN Expression CLOSE)
       Literal
       Class
+      NotInClass
       DOT))
 (define-sexp-parser Identifier all
   (and IdentStart (* IdentCont) Spacing))
@@ -127,7 +130,11 @@ EndOfFile < !.
   (or (and SQUOTE (* (and (not-followed-by SQUOTE) Char)) SQUOTE Spacing)
       (and DQUOTE (* (and (not-followed-by DQUOTE) Char)) DQUOTE Spacing)))
 (define-sexp-parser Class all
-  (and OPENBRACKET (* (and (not-followed-by CLOSEBRACKET) Range)) CLOSEBRACKET Spacing))
+  (and OPENBRACKET (not-followed-by NOTIN)
+       (* (and (not-followed-by CLOSEBRACKET) Range)) CLOSEBRACKET Spacing))
+(define-sexp-parser NotInClass all
+  (and OPENBRACKET NOTIN
+       (* (and (not-followed-by CLOSEBRACKET) Range)) CLOSEBRACKET Spacing))
 (define-sexp-parser Range all
   (or (and Char DASH Char) Char))
 (define-sexp-parser Char all
@@ -137,6 +144,8 @@ EndOfFile < !.
       (and (not-followed-by "\\") peg-any)))
 (define-sexp-parser LEFTARROW body
   (and (or "<--" "<-" "<") Spacing)) ; NOTE: <-- and < are extensions
+(define-sexp-parser NOTIN none
+  (and "^"))
 (define-sexp-parser SLASH none
   (and "/" Spacing))
 (define-sexp-parser AND all
@@ -271,6 +280,7 @@ EndOfFile < !.
       ('Identifier (Identifier->defn value for-syntax))
       ('Expression (Expression->defn value for-syntax))
       ('Literal    (Literal->defn value for-syntax))
+      ('NotInClass (NotInClass->defn value for-syntax))
       ('Class      (Class->defn value for-syntax)))))
 
 ;; (Identifier "hello")
@@ -283,13 +293,34 @@ EndOfFile < !.
 (define (Literal->defn lst for-syntax)
   (apply string (map (lambda (x) (Char->defn x for-syntax)) (cdr lst))))
 
-;; TODO: empty Class can happen: `[]`, but what does it represent?
+;; (NotInClass ...)
+;;  `-> (and ...)
+(define (NotInClass->defn lst for-syntax)
+  #`(and #,@(map (lambda (x) (NotInRange->defn x for-syntax))
+                 (cdr lst))))
+
 ;; (Class ...)
 ;;  `-> (or ...)
 (define (Class->defn lst for-syntax)
   #`(or #,@(map (lambda (x) (Range->defn x for-syntax))
                 (cdr lst))))
 
+;; For one character:
+;; (NotInRange (Char "a"))
+;;  `-> (not-in-range #\a #\a)
+;; Or for a range:
+;; (NotInRange (Char "a") (Char "b"))
+;;  `-> (not-in-range #\a #\b)
+(define (NotInRange->defn lst for-syntax)
+  (cond
+    ((= 2 (length lst))
+     (let ((ch (Char->defn (second lst) for-syntax)))
+       #`(not-in-range #,ch #,ch)))
+    ((= 3 (length lst))
+     #`(not-in-range
+         #,(Char->defn (second lst) for-syntax)
+         #,(Char->defn (third lst) for-syntax)))))
+
 ;; For one character:
 ;; (Range (Char "a"))
 ;;  `-> "a"
diff --git a/test-suite/tests/peg.test b/test-suite/tests/peg.test
index 556145e72..965e1c12f 100644
--- a/test-suite/tests/peg.test
+++ b/test-suite/tests/peg.test
@@ -38,6 +38,7 @@
     (Identifier Identifier)
     (Literal Literal)
     (Class Class)
+    (NotInClass NotInClass)
     (Range Range)
     (Char Char)
     (LEFTARROW LEFTARROW)
@@ -85,7 +86,7 @@
 End <-- '*)'
 C <- Begin N* End
 N <- C / (!Begin !End Z)
-Z <- .")
+Z <- [^X-Z]") ;; Forbid some characters to test not-in-range
 
 ;; A short /etc/passwd file.
 (define *etc-passwd*
@@ -125,6 +126,9 @@ SLASH < '/'")
     (match-pattern C "(*blah*)")
     (make-prec 0 8 "(*blah*)"
 	       '((Begin "(*") "blah" (End "*)")))))
+  (pass-if
+   "simple comment with forbidden char"
+   (not (match-pattern C "(*blYh*)")))
   (pass-if
    "simple comment padded"
    (equal?
-- 
2.46.0






^ permalink raw reply related	[flat|nested] 19+ messages in thread

* bug#73188: PEG parser does not support full PEG grammar
  2024-09-12 20:57 ` bug#73188: [PATCH v2] PEG: Add full support for PEG + some extensions Ekaitz Zarraga
@ 2024-10-13 20:29   ` Ludovic Courtès
  2024-10-13 20:59     ` Ekaitz Zarraga
  0 siblings, 1 reply; 19+ messages in thread
From: Ludovic Courtès @ 2024-10-13 20:29 UTC (permalink / raw)
  To: Ekaitz Zarraga; +Cc: 73188

Hi Ekaitz,

Ekaitz Zarraga <ekaitz@elenq.tech> skribis:

> This commit adds support for PEG as described in:
>
>     <https://bford.info/pub/lang/peg.pdf>

I would make this a comment below the ‘define-module’ form.

> It adds support for the missing features (comments, underscores in
> identifiers and escaping) while keeping the extensions (dashes in
> identifiers, < and <--).
>
> The naming system tries to be as close as possible to the one proposed
> in the paper.
>
> * module/ice-9/peg/string-peg.scm: Rewrite PEG parser.
> * test-suite/tests/peg.test: Fix import

Nice work!

Questions:

  1. Is the name change for lexical elements (camel case instead of
     lower-case + hyphens) user-visible?  I guess no but better be safe
     than sorry.

     I have a preference for lower-case + hyphens out of consistency
     with the rest of Scheme, but I can see how keeping the same names
     as in the reference material helps.

  2. Could you add tests for the missing features that this adds, and
     maybe extend ‘api-peg.texi’ accordingly too?

  3. You can choose to assign copyright to the FSF or to not do that¹.
     In the latter case, please add a copyright line for you where
     appropriate.

I’m really not a PEG expert though so I’d prefer more eyeballs here, but
I trust your judgment.

There are three (guix import *) modules that use (ice-9 peg) and that
come with tests.  It would be nice to check that those tests still pass
with the modified string-peg.scm.

Thanks!

Ludo’.

¹ https://lists.gnu.org/archive/html/guile-devel/2022-10/msg00008.html





^ permalink raw reply	[flat|nested] 19+ messages in thread

* bug#73188: PEG parser does not support full PEG grammar
  2024-10-13 20:29   ` bug#73188: PEG parser does not support full PEG grammar Ludovic Courtès
@ 2024-10-13 20:59     ` Ekaitz Zarraga
  2024-10-14 11:56       ` Ludovic Courtès
  0 siblings, 1 reply; 19+ messages in thread
From: Ekaitz Zarraga @ 2024-10-13 20:59 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 73188

Saluton Ludovic,

On 2024-10-13 22:29, Ludovic Courtès wrote:
> Hi Ekaitz,
> 
> Ekaitz Zarraga <ekaitz@elenq.tech> skribis:
> 
>> This commit adds support for PEG as described in:
>>
>>      <https://bford.info/pub/lang/peg.pdf>
> 
> I would make this a comment below the ‘define-module’ form.

Okay I will add it.

>> It adds support for the missing features (comments, underscores in
>> identifiers and escaping) while keeping the extensions (dashes in
>> identifiers, < and <--).
>>
>> The naming system tries to be as close as possible to the one proposed
>> in the paper.
>>
>> * module/ice-9/peg/string-peg.scm: Rewrite PEG parser.
>> * test-suite/tests/peg.test: Fix import
> 
> Nice work!

Thank you

> Questions:
> 
>    1. Is the name change for lexical elements (camel case instead of
>       lower-case + hyphens) user-visible?  I guess no but better be safe
>       than sorry.

I think they can be, in a very weird way. If using `peg-as-peg` or 
something they can be used, but the ones coming from the PEG in text, 
which makes way more sense written like in the paper. I'm not sure if 
there's another way to make them available, but I don't think there is.

I exported `Grammar` as `peg-grammar` because of this. So the users 
should just use `peg-grammar` for their things.

>       I have a preference for lower-case + hyphens out of consistency
>       with the rest of Scheme, but I can see how keeping the same names
>       as in the reference material helps.

Yeah, I wasn't sure about it. And I'm still not very sure. But I think 
keeping the same names the paper does helps.

>    2. Could you add tests for the missing features that this adds, and
>       maybe extend ‘api-peg.texi’ accordingly too?

It doesn't really add much new in this first case, but it makes it work 
as expected in PEG, which is what documentation already claimed to do, 
and the code didn't actually implement. Mostly what this commit adds is 
escaping support in the PEG string literals.

>    3. You can choose to assign copyright to the FSF or to not do that¹.
>       In the latter case, please add a copyright line for you where
>       appropriate.

I don't care (maybe I should?). I just want this to work properly.

> I’m really not a PEG expert though so I’d prefer more eyeballs here, but
> I trust your judgment.

I don't know how wise this is :)

> There are three (guix import *) modules that use (ice-9 peg) and that
> come with tests.  It would be nice to check that those tests still pass
> with the modified string-peg.scm.

This is very interesting and I didn't know. All tests from Guile pass, 
but I'll try these and report here.

> Thanks!
> 
> Ludo’.
> 
> ¹ https://lists.gnu.org/archive/html/guile-devel/2022-10/msg00008.html


Thanks for the review, Ludo.

Next days I plan to review some older patches in the mailing list that 
offer a very straightforward solution for error reporting and try to 
include them with this and the extra features I added.

For the time being, I'll review what you mentioned, rebase the support 
for `[^...]` and resend them together.

I appreciate your time spent here, Ludo. Really.





^ permalink raw reply	[flat|nested] 19+ messages in thread

* bug#73188: PEG parser does not support full PEG grammar
  2024-10-13 20:59     ` Ekaitz Zarraga
@ 2024-10-14 11:56       ` Ludovic Courtès
  2024-10-14 14:00         ` Ekaitz Zarraga
  0 siblings, 1 reply; 19+ messages in thread
From: Ludovic Courtès @ 2024-10-14 11:56 UTC (permalink / raw)
  To: Ekaitz Zarraga; +Cc: 73188

Saluton!

Ekaitz Zarraga <ekaitz@elenq.tech> skribis:

> On 2024-10-13 22:29, Ludovic Courtès wrote:
>> Hi Ekaitz,
>> Ekaitz Zarraga <ekaitz@elenq.tech> skribis:

[...]

>>> It adds support for the missing features (comments, underscores in
>>> identifiers and escaping) while keeping the extensions (dashes in
>>> identifiers, < and <--).
>>>
>>> The naming system tries to be as close as possible to the one proposed
>>> in the paper.
>>>
>>> * module/ice-9/peg/string-peg.scm: Rewrite PEG parser.
>>> * test-suite/tests/peg.test: Fix import

[...]

>>    1. Is the name change for lexical elements (camel case instead of
>>       lower-case + hyphens) user-visible?  I guess no but better be safe
>>       than sorry.
>
> I think they can be, in a very weird way. If using `peg-as-peg` or
> something they can be used, but the ones coming from the PEG in text,
> which makes way more sense written like in the paper. I'm not sure if
> there's another way to make them available, but I don't think there
> is.
>
> I exported `Grammar` as `peg-grammar` because of this. So the users
> should just use `peg-grammar` for their things.

Sounds good.  As long as we don’t unwillingly introduce API
incompatibilities, that is fine.

>>    2. Could you add tests for the missing features that this adds, and
>>       maybe extend ‘api-peg.texi’ accordingly too?
>
> It doesn't really add much new in this first case, but it makes it
> work as expected in PEG, which is what documentation already claimed
> to do, and the code didn't actually implement. Mostly what this commit
> adds is escaping support in the PEG string literals.

I was referring to the features mentioned in the commit log, namely
comments, underscores in identifiers, and escaping.

>>    3. You can choose to assign copyright to the FSF or to not do that¹.
>>       In the latter case, please add a copyright line for you where
>>       appropriate.
>
> I don't care (maybe I should?). I just want this to work properly.

So, copyright line I guess.  :-)

Thanks,
Ludo’.





^ permalink raw reply	[flat|nested] 19+ messages in thread

* bug#73188: PEG parser does not support full PEG grammar
  2024-10-14 11:56       ` Ludovic Courtès
@ 2024-10-14 14:00         ` Ekaitz Zarraga
  2024-10-20 10:10           ` Ludovic Courtès
  0 siblings, 1 reply; 19+ messages in thread
From: Ekaitz Zarraga @ 2024-10-14 14:00 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 73188

[-- Attachment #1: Type: text/plain, Size: 2157 bytes --]

Hi,

So I reviewed the suggestions and this what I have:

- Added the comment

- Added my copyright line

- Nothing in Guix is affected by the changes proposed here, because 
there's no use of `string-peg` and I didn't change how the S-Expression 
based PEG works.

- The changes applied here only affect to the PEG written as PEG strings 
and their processing, which was introduced in the documentation as "read 
the wikipedia", but our PEG definition wasn't implementing the real PEG 
explained in that wikipedia link. Our documentation claimed that we were 
able to do things we were not able to do.

- I added a link to the paper we are following below the link to the 
wikipedia page just in case, to let our users know exactly what does our 
PEG support. The paper says:

	Literals and character classes can contain C-like escape codes

That should be enough for anyone. Also we support `-` in identifiers, 
but our documentation already said that. I replaced the description of 
what normal PEG supports. Our docs said PEG only supported alphabetic 
characters but it has never been true, the paper explains PEG supports 
alpha, numeric and `_`, but the numeric cannot be the first character 
(like C identifiers).

- What it is true is if people were using `peg-as-peg` for their things, 
which isn't even exported by the module, their code would have problems. 
But we can't predict that.

- In summary, what I implemented here is a PEG string to PEG Sexp 
compiler. So it shouldn't affect anyone that uses the PEG Sexp directly 
and if I did my work properly it shouldn't affect those that used their 
PEG patterns in strings either, because I only made it work closely to 
what the paper said, adding a couple of missing features.

EXTRA:

I rebased the support for the `not-in-range` part, that does touch the 
PEG sexp part, but only adds more functionality without affecting 
anything that already existed.

In following days I'll review Error reporting and captures that were 
added in a different thread in guile-devel and see if we can integrate 
everything to create a production ready PEG parser.

Thanks!
Ekaitz

I attached both patches.

[-- Attachment #2: v3-0001-PEG-Add-full-support-for-PEG-some-extensions.patch --]
[-- Type: text/x-patch, Size: 21671 bytes --]

From 81944c2037e0d6d8c972def2ceb1aa4c51a61431 Mon Sep 17 00:00:00 2001
From: Ekaitz Zarraga <ekaitz@elenq.tech>
Date: Wed, 11 Sep 2024 21:19:26 +0200
Subject: [PATCH v3 1/2] PEG: Add full support for PEG + some extensions

This commit adds support for PEG as described in:

    <https://bford.info/pub/lang/peg.pdf>

It adds support for the missing features (comments, underscores in
identifiers and escaping) while keeping the extensions (dashes in
identifiers, < and <--).

The naming system tries to be as close as possible to the one proposed
in the paper.

* module/ice-9/peg/string-peg.scm: Rewrite PEG parser.
* test-suite/tests/peg.test: Fix import
---
 doc/ref/api-peg.texi            |   8 +-
 module/ice-9/peg/string-peg.scm | 442 ++++++++++++++++++++------------
 test-suite/tests/peg.test       |  32 ++-
 3 files changed, 302 insertions(+), 180 deletions(-)

diff --git a/doc/ref/api-peg.texi b/doc/ref/api-peg.texi
index d34ddc64c..84a9e6c6b 100644
--- a/doc/ref/api-peg.texi
+++ b/doc/ref/api-peg.texi
@@ -17,6 +17,10 @@ Wikipedia has a clear and concise introduction to PEGs if you want to
 familiarize yourself with the syntax:
 @url{http://en.wikipedia.org/wiki/Parsing_expression_grammar}.
 
+The paper that introduced PEG contains a more detailed description of how PEG
+works, and describes its syntax in detail:
+@url{https://bford.info/pub/lang/peg.pdf}
+
 The @code{(ice-9 peg)} module works by compiling PEGs down to lambda
 expressions.  These can either be stored in variables at compile-time by
 the define macros (@code{define-peg-pattern} and
@@ -216,8 +220,8 @@ should propagate up the parse tree.  The normal @code{<-} propagates the
 matched text up the parse tree, @code{<--} propagates the matched text
 up the parse tree tagged with the name of the nonterminal, and @code{<}
 discards that matched text and propagates nothing up the parse tree.
-Also, nonterminals may consist of any alphanumeric character or a ``-''
-character (in normal PEGs nonterminals can only be alphabetic).
+Also, nonterminals may include ``-'' character, while in normal PEG it is not
+allowed.
 
 For example, if we:
 @lisp
diff --git a/module/ice-9/peg/string-peg.scm b/module/ice-9/peg/string-peg.scm
index 45ed14bb1..f688653ef 100644
--- a/module/ice-9/peg/string-peg.scm
+++ b/module/ice-9/peg/string-peg.scm
@@ -1,6 +1,7 @@
 ;;;; string-peg.scm --- representing PEG grammars as strings
 ;;;;
-;;;; 	Copyright (C) 2010, 2011 Free Software Foundation, Inc.
+;;;; 	Copyright (C) 2010, 2011, Free Software Foundation, Inc.
+;;;; 	Copyright (C) 2024 Ekaitz Zarraga <ekaitz@elenq.tech>
 ;;;;
 ;;;; This library is free software; you can redistribute it and/or
 ;;;; modify it under the terms of the GNU Lesser General Public
@@ -22,9 +23,13 @@
             define-peg-string-patterns
             peg-grammar)
   #:use-module (ice-9 peg using-parsers)
+  #:use-module (srfi srfi-1)
   #:use-module (ice-9 peg codegen)
   #:use-module (ice-9 peg simplify-tree))
 
+;; This module provides support for PEG as described in:
+;;   <https://bford.info/pub/lang/peg.pdf>
+
 ;; Gets the left-hand depth of a list.
 (define (depth lst)
   (if (or (not (list? lst)) (null? lst))
@@ -38,22 +43,57 @@
 
 ;; Grammar for PEGs in PEG grammar.
 (define peg-as-peg
-"grammar <-- (nonterminal ('<--' / '<-' / '<') sp pattern)+
-pattern <-- alternative (SLASH sp alternative)*
-alternative <-- ([!&]? sp suffix)+
-suffix <-- primary ([*+?] sp)*
-primary <-- '(' sp pattern ')' sp / '.' sp / literal / charclass / nonterminal !'<'
-literal <-- ['] (!['] .)* ['] sp
-charclass <-- LB (!']' (CCrange / CCsingle))* RB sp
-CCrange <-- . '-' .
-CCsingle <-- .
-nonterminal <-- [a-zA-Z0-9-]+ sp
-sp < [ \t\n]*
-SLASH < '/'
-LB < '['
-RB < ']'
+"# Hierarchical syntax
+Grammar <-- Spacing Definition+ EndOfFile
+Definition <-- Identifier LEFTARROW Expression
+
+Expression <-- Sequence (SLASH Sequence)*
+Sequence <-- Prefix*
+Prefix <-- (AND / NOT)? Suffix
+Suffix <-- Primary (QUESTION / STAR / PLUS)?
+Primary <-- Identifier !LEFTARROW
+           / OPEN Expression CLOSE
+           / Literal / Class / DOT
+
+# Lexical syntax
+Identifier <-- IdentStart IdentCont* Spacing
+# NOTE: `-` is an extension
+IdentStart <- [a-zA-Z_] / '-'
+IdentCont <- IdentStart / [0-9]
+
+Literal <-- SQUOTE (!SQUOTE Char)* SQUOTE Spacing
+        / DQUOTE (!DQUOTE Char)* DQUOTE Spacing
+Class <-- '[' (!']' Range)* ']' Spacing
+Range <-- Char '-' Char / Char
+Char <-- '\\\\' [nrt'\"\\[\\]\\\\]
+       / '\\\\' [0-7][0-7][0-7]
+       / '\\\\' [0-7][0-7]?
+       / !'\\\\' .
+
+# NOTE: `<--` and `<` are extensions
+LEFTARROW <- ('<--' / '<-' / '<') Spacing
+SQUOTE <-- [']
+DQUOTE <-- [\"]
+OPENBRACKET < '['
+CLOSEBRACKET < ']'
+SLASH < '/' Spacing
+AND <-- '&' Spacing
+NOT <-- '!' Spacing
+QUESTION <-- '?' Spacing
+STAR <-- '*' Spacing
+PLUS <-- '+' Spacing
+OPEN < '(' Spacing
+CLOSE < ')' Spacing
+DOT <-- '.' Spacing
+
+Spacing < (Space / Comment)*
+Comment < '#' (!EndOfLine .)* EndOfLine
+Space < ' ' / '\t' / EndOfLine
+EndOfLine < '\\r\\n' / '\\n' / '\\r'
+EndOfFile < !.
 ")
 
+
 (define-syntax define-sexp-parser
   (lambda (x)
     (syntax-case x ()
@@ -63,35 +103,78 @@ RB < ']'
               (syn (wrap-parser-for-users x matchf accumsym #'sym)))
            #`(define sym #,syn))))))
 
-(define-sexp-parser peg-grammar all
-  (+ (and peg-nonterminal (or "<--" "<-" "<") peg-sp peg-pattern)))
-(define-sexp-parser peg-pattern all
-  (and peg-alternative
-       (* (and (ignore "/") peg-sp peg-alternative))))
-(define-sexp-parser peg-alternative all
-  (+ (and (? (or "!" "&")) peg-sp peg-suffix)))
-(define-sexp-parser peg-suffix all
-  (and peg-primary (* (and (or "*" "+" "?") peg-sp))))
-(define-sexp-parser peg-primary all
-  (or (and "(" peg-sp peg-pattern ")" peg-sp)
-      (and "." peg-sp)
-      peg-literal
-      peg-charclass
-      (and peg-nonterminal (not-followed-by "<"))))
-(define-sexp-parser peg-literal all
-  (and "'" (* (and (not-followed-by "'") peg-any)) "'" peg-sp))
-(define-sexp-parser peg-charclass all
-  (and (ignore "[")
-       (* (and (not-followed-by "]")
-               (or charclass-range charclass-single)))
-       (ignore "]")
-       peg-sp))
-(define-sexp-parser charclass-range all (and peg-any "-" peg-any))
-(define-sexp-parser charclass-single all peg-any)
-(define-sexp-parser peg-nonterminal all
-  (and (+ (or (range #\a #\z) (range #\A #\Z) (range #\0 #\9) "-")) peg-sp))
-(define-sexp-parser peg-sp none
-  (* (or " " "\t" "\n")))
+(define-sexp-parser Grammar all
+  (and Spacing (+ Definition) EndOfFile))
+(define-sexp-parser Definition all
+  (and Identifier LEFTARROW Expression))
+(define-sexp-parser Expression all
+  (and Sequence (* (and SLASH Sequence))))
+(define-sexp-parser Sequence all
+  (* Prefix))
+(define-sexp-parser Prefix all
+  (and (? (or AND NOT)) Suffix))
+(define-sexp-parser Suffix all
+  (and Primary (? (or QUESTION STAR PLUS))))
+(define-sexp-parser Primary all
+  (or (and Identifier (not-followed-by LEFTARROW))
+      (and OPEN Expression CLOSE)
+      Literal
+      Class
+      DOT))
+(define-sexp-parser Identifier all
+  (and IdentStart (* IdentCont) Spacing))
+(define-sexp-parser IdentStart body
+  (or (range #\a #\z) (range #\A #\Z) "_" "-")) ; NOTE: - is an extension
+(define-sexp-parser IdentCont body
+  (or IdentStart (range #\0 #\9)))
+(define-sexp-parser Literal all
+  (or (and SQUOTE (* (and (not-followed-by SQUOTE) Char)) SQUOTE Spacing)
+      (and DQUOTE (* (and (not-followed-by DQUOTE) Char)) DQUOTE Spacing)))
+(define-sexp-parser Class all
+  (and OPENBRACKET (* (and (not-followed-by CLOSEBRACKET) Range)) CLOSEBRACKET Spacing))
+(define-sexp-parser Range all
+  (or (and Char DASH Char) Char))
+(define-sexp-parser Char all
+  (or (and "\\" (or "n" "r" "t" "'" "[" "]" "\\"))
+      (and "\\" (range #\0 #\7) (range #\0 #\7) (range #\0 #\7))
+      (and "\\" (range #\0 #\7) (? (range #\0 #\7)))
+      (and (not-followed-by "\\") peg-any)))
+(define-sexp-parser LEFTARROW body
+  (and (or "<--" "<-" "<") Spacing)) ; NOTE: <-- and < are extensions
+(define-sexp-parser SLASH none
+  (and "/" Spacing))
+(define-sexp-parser AND all
+  (and "&" Spacing))
+(define-sexp-parser NOT all
+  (and "!" Spacing))
+(define-sexp-parser QUESTION all
+  (and "?" Spacing))
+(define-sexp-parser STAR all
+  (and "*" Spacing))
+(define-sexp-parser PLUS all
+  (and "+" Spacing))
+(define-sexp-parser OPEN none
+  (and "(" Spacing))
+(define-sexp-parser CLOSE none
+  (and ")" Spacing))
+(define-sexp-parser DOT all
+  (and "." Spacing))
+(define-sexp-parser SQUOTE none "'")
+(define-sexp-parser DQUOTE none "\"")
+(define-sexp-parser OPENBRACKET none "[")
+(define-sexp-parser CLOSEBRACKET none "]")
+(define-sexp-parser DASH none "-")
+(define-sexp-parser Spacing none
+  (* (or Space Comment)))
+(define-sexp-parser Comment none
+  (and "#" (* (and (not-followed-by EndOfLine) peg-any)) EndOfLine))
+(define-sexp-parser Space none
+  (or " " "\t" EndOfLine))
+(define-sexp-parser EndOfLine none
+  (or "\r\n" "\n" "\r"))
+(define-sexp-parser EndOfFile none
+  (not-followed-by peg-any))
+
 
 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
 ;;;;; PARSE STRING PEGS
@@ -101,7 +184,7 @@ RB < ']'
 ;; will define all of the nonterminals in the grammar with equivalent
 ;; PEG s-expressions.
 (define (peg-parser str for-syntax)
-  (let ((parsed (match-pattern peg-grammar str)))
+  (let ((parsed (match-pattern Grammar str)))
     (if (not parsed)
         (begin
           ;; (display "Invalid PEG grammar!\n")
@@ -110,132 +193,160 @@ RB < ']'
           (cond
            ((or (not (list? lst)) (null? lst))
             lst)
-           ((eq? (car lst) 'peg-grammar)
-            #`(begin
-                #,@(map (lambda (x) (peg-nonterm->defn x for-syntax))
-                        (context-flatten (lambda (lst) (<= (depth lst) 2))
-                                         (cdr lst))))))))))
+           ((eq? (car lst) 'Grammar)
+            (Grammar->defn lst for-syntax)))))))
 
-;; Macro wrapper for PEG-PARSER.  Parses PEG grammars expressed as strings and
-;; defines all the appropriate nonterminals.
-(define-syntax define-peg-string-patterns
-  (lambda (x)
-    (syntax-case x ()
-      ((_ str)
-       (peg-parser (syntax->datum #'str) x)))))
+;; (Grammar (Definition ...) (Definition ...))
+(define (Grammar->defn lst for-syntax)
+  #`(begin
+      #,@(map (lambda (x) (Definition->defn x for-syntax))
+              (context-flatten (lambda (lst) (<= (depth lst) 1))
+                               (cdr lst)))))
 
-;; lst has format (nonterm grabber pattern), where
-;;   nonterm is a symbol (the name of the nonterminal),
-;;   grabber is a string (either "<", "<-" or "<--"), and
-;;   pattern is the parse of a PEG pattern expressed as as string.
-(define (peg-nonterm->defn lst for-syntax)
-  (let* ((nonterm (car lst))
-         (grabber (cadr lst))
-         (pattern (caddr lst))
-         (nonterm-name (datum->syntax for-syntax
-                                      (string->symbol (cadr nonterm)))))
-    #`(define-peg-pattern #,nonterm-name
+;; (Definition (Identifier "Something") "<-" (Expression ...))
+;;  `-> (define-peg-pattern Something 'all ...)
+(define (Definition->defn lst for-syntax)
+  (let ((identifier (second (second lst)))
+        (grabber    (third  lst))
+        (expression (fourth lst)))
+    #`(define-peg-pattern #,(datum->syntax for-syntax
+                                           (string->symbol identifier))
        #,(cond
           ((string=? grabber "<--") (datum->syntax for-syntax 'all))
           ((string=? grabber "<-") (datum->syntax for-syntax 'body))
           (else (datum->syntax for-syntax 'none)))
-       #,(compressor (peg-pattern->defn pattern for-syntax) for-syntax))))
-
-;; lst has format ('peg-pattern ...).
-;; After the context-flatten, (cdr lst) has format
-;;   (('peg-alternative ...) ...), where the outer list is a collection
-;;   of elements from a '/' alternative.
-(define (peg-pattern->defn lst for-syntax)
-  #`(or #,@(map (lambda (x) (peg-alternative->defn x for-syntax))
-                (context-flatten (lambda (x) (eq? (car x) 'peg-alternative))
-                                 (cdr lst)))))
-
-;; lst has format ('peg-alternative ...).
-;; After the context-flatten, (cdr lst) has the format
-;;   (item ...), where each item has format either ("!" ...), ("&" ...),
-;;   or ('peg-suffix ...).
-(define (peg-alternative->defn lst for-syntax)
-  #`(and #,@(map (lambda (x) (peg-body->defn x for-syntax))
-                 (context-flatten (lambda (x) (or (string? (car x))
-                                             (eq? (car x) 'peg-suffix)))
-                                  (cdr lst)))))
-
-;; lst has the format either
-;;   ("!" ('peg-suffix ...)), ("&" ('peg-suffix ...)), or
-;;     ('peg-suffix ...).
-(define (peg-body->defn lst for-syntax)
-    (cond
-      ((equal? (car lst) "&")
-       #`(followed-by #,(peg-suffix->defn (cadr lst) for-syntax)))
-      ((equal? (car lst) "!")
-       #`(not-followed-by #,(peg-suffix->defn (cadr lst) for-syntax)))
-      ((eq? (car lst) 'peg-suffix)
-       (peg-suffix->defn lst for-syntax))
-      (else `(peg-parse-body-fail ,lst))))
-
-;; lst has format ('peg-suffix <peg-primary> (? (/ "*" "?" "+")))
-(define (peg-suffix->defn lst for-syntax)
-  (let ((inner-defn (peg-primary->defn (cadr lst) for-syntax)))
-    (cond
-      ((null? (cddr lst))
-       inner-defn)
-      ((equal? (caddr lst) "*")
-       #`(* #,inner-defn))
-      ((equal? (caddr lst) "?")
-       #`(? #,inner-defn))
-      ((equal? (caddr lst) "+")
-       #`(+ #,inner-defn)))))
-
-;; Parse a primary.
-(define (peg-primary->defn lst for-syntax)
-  (let ((el (cadr lst)))
+       #,(compressor (Expression->defn expression for-syntax) for-syntax))))
+
+;; (Expression (Sequence X))
+;;  `-> (X)
+;; (Expression (Sequence X) (Sequence Y))
+;;  `-> (or X Y)
+;; (Expression (Sequence X) ((Sequence Y) (Sequence Z) ...))
+;;  `-> (or X Y Z ...)
+(define (Expression->defn lst for-syntax)
+  (let ((first-sequence (second lst))
+        (rest           (cddr  lst)))
+    #`(or #,(Sequence->defn first-sequence for-syntax)
+          #,@(map (lambda (x)
+                    (Sequence->defn x for-syntax))
+                  (keyword-flatten '(Sequence) rest)))))
+
+
+(define (Sequence->defn lst for-syntax)
+  #`(and #,@(map (lambda (x) (Prefix->defn x for-syntax)) (cdr lst))))
+
+
+;; (Prefix (Suffix ...))
+;;  `-> (...)
+;; (Prefix (NOT "!") (Suffix ...))
+;;  `-> (not-followed-by ...)
+;; (Prefix (AND "&") (Suffix ...))
+;;  `-> (followed-by ...)
+(define (Prefix->defn lst for-syntax)
+  (let ((suffix (second lst)))
+    (case (car suffix)
+      ('AND #`(followed-by #,(Suffix->defn (third lst) for-syntax)))
+      ('NOT #`(not-followed-by #,(Suffix->defn (third lst) for-syntax)))
+      (else (Suffix->defn suffix for-syntax)))))
+
+;; (Suffix (Primary ...))
+;;  `-> (...)
+;; (Suffix (Primary ...) (STAR "*"))
+;;  `-> (* ...)
+;; (Suffix (Primary ...) (QUESTION "?"))
+;;  `-> (? ...)
+;; (Suffix (Primary ...) (PLUS "+"))
+;;  `-> (+ ...)
+(define (Suffix->defn lst for-syntax)
+  (let* ((primary (second lst))
+         (out     (Primary->defn primary for-syntax))
+         (extra   (cddr lst)))
+    (if (null? extra)
+      out
+      (case (caar extra)
+        ('QUESTION #`(? #,out))
+        ('STAR     #`(* #,out))
+        ('PLUS     #`(+ #,out))))))
+
+(define (Primary->defn lst for-syntax)
+  (let ((value (second lst)))
+    (case (car value)
+      ('DOT        #'peg-any)
+      ('Identifier (Identifier->defn value for-syntax))
+      ('Expression (Expression->defn value for-syntax))
+      ('Literal    (Literal->defn value for-syntax))
+      ('Class      (Class->defn value for-syntax)))))
+
+;; (Identifier "hello")
+;;  `-> hello
+(define (Identifier->defn lst for-syntax)
+  (datum->syntax for-syntax (string->symbol (second lst))))
+
+;; (Literal (Char "a") (Char "b") (Char "c"))
+;;  `-> "abc"
+(define (Literal->defn lst for-syntax)
+  (apply string (map (lambda (x) (Char->defn x for-syntax)) (cdr lst))))
+
+;; TODO: empty Class can happen: `[]`, but what does it represent?
+;; (Class ...)
+;;  `-> (or ...)
+(define (Class->defn lst for-syntax)
+  #`(or #,@(map (lambda (x) (Range->defn x for-syntax))
+                (cdr lst))))
+
+;; For one character:
+;; (Range (Char "a"))
+;;  `-> "a"
+;; Or for a range:
+;; (Range (Char "a") (Char "b"))
+;;  `-> (range #\a #\b)
+(define (Range->defn lst for-syntax)
   (cond
-   ((list? el)
-    (cond
-     ((eq? (car el) 'peg-literal)
-      (peg-literal->defn el for-syntax))
-     ((eq? (car el) 'peg-charclass)
-      (peg-charclass->defn el for-syntax))
-     ((eq? (car el) 'peg-nonterminal)
-      (datum->syntax for-syntax (string->symbol (cadr el))))))
-   ((string? el)
+    ((= 2 (length lst))
+     (second (second lst)))
+    ((= 3 (length lst))
+     #`(range
+         #,(Char->defn (second lst) for-syntax)
+         #,(Char->defn (third lst) for-syntax)))))
+
+;; (Char "a")
+;;  `-> #\a
+;; (Char "\\n")
+;;  `-> #\newline
+;; (Char "\\135")
+;;  `-> #\]
+(define (Char->defn lst for-syntax)
+  (let* ((charstr (second lst))
+         (first   (string-ref charstr 0)))
     (cond
-     ((equal? el "(")
-      (peg-pattern->defn (caddr lst) for-syntax))
-     ((equal? el ".")
-      (datum->syntax for-syntax 'peg-any))
-     (else (datum->syntax for-syntax
-                          `(peg-parse-any unknown-string ,lst)))))
-   (else (datum->syntax for-syntax
-                        `(peg-parse-any unknown-el ,lst))))))
-
-;; Trims characters off the front and end of STR.
-;; (trim-1chars "'ab'") -> "ab"
-(define (trim-1chars str) (substring str 1 (- (string-length str) 1)))
-
-;; Parses a literal.
-(define (peg-literal->defn lst for-syntax)
-  (datum->syntax for-syntax (trim-1chars (cadr lst))))
-
-;; Parses a charclass.
-(define (peg-charclass->defn lst for-syntax)
-  #`(or
-     #,@(map
-         (lambda (cc)
-           (cond
-            ((eq? (car cc) 'charclass-range)
-             #`(range #,(datum->syntax
-                         for-syntax
-                         (string-ref (cadr cc) 0))
-                      #,(datum->syntax
-                         for-syntax
-                         (string-ref (cadr cc) 2))))
-            ((eq? (car cc) 'charclass-single)
-             (datum->syntax for-syntax (cadr cc)))))
-         (context-flatten
-          (lambda (x) (or (eq? (car x) 'charclass-range)
-                          (eq? (car x) 'charclass-single)))
-          (cdr lst)))))
+      ((= 1 (string-length charstr)) first)
+      ((char-numeric? (string-ref charstr 1))
+       (integer->char
+         (reduce + 0
+                 (map
+                   (lambda (x y)
+                     (* (- (char->integer x) (char->integer #\0)) y))
+                   (reverse (string->list charstr 1))
+                   '(1 8 64)))))
+      (else
+        (case (string-ref charstr 1)
+          ((#\n) #\newline)
+          ((#\r) #\return)
+          ((#\t) #\tab)
+          ((#\') #\')
+          ((#\]) #\])
+          ((#\\) #\\)
+          ((#\[) #\[))))))
+
+(define peg-grammar Grammar)
+
+;; Macro wrapper for PEG-PARSER.  Parses PEG grammars expressed as strings and
+;; defines all the appropriate nonterminals.
+(define-syntax define-peg-string-patterns
+  (lambda (x)
+    (syntax-case x ()
+      ((_ str)
+       (peg-parser (syntax->datum #'str) x)))))
 
 ;; Compresses a list to save the optimizer work.
 ;; e.g. (or (and a)) -> a
@@ -263,11 +374,10 @@ RB < ']'
      (let ((string (syntax->datum #'str-stx)))
        (compile-peg-pattern
         (compressor
-         (peg-pattern->defn
-          (peg:tree (match-pattern peg-pattern string)) #'str-stx)
+         (Expression->defn
+          (peg:tree (match-pattern Expression string)) #'str-stx)
          #'str-stx)
         (if (eq? accum 'all) 'body accum))))
      (else (error "Bad embedded PEG string" args))))
 
 (add-peg-compiler! 'peg peg-string-compile)
-
diff --git a/test-suite/tests/peg.test b/test-suite/tests/peg.test
index f516571e8..556145e72 100644
--- a/test-suite/tests/peg.test
+++ b/test-suite/tests/peg.test
@@ -28,17 +28,25 @@
 ;; the nonterminals defined in the PEG parser written with
 ;; S-expressions.
 (define grammar-mapping
-  '((grammar peg-grammar)
-    (pattern peg-pattern)
-    (alternative peg-alternative)
-    (suffix peg-suffix)
-    (primary peg-primary)
-    (literal peg-literal)
-    (charclass peg-charclass)
-    (CCrange charclass-range)
-    (CCsingle charclass-single)
-    (nonterminal peg-nonterminal)
-    (sp peg-sp)))
+  '((Grammar Grammar)
+    (Definition Definition)
+    (Expression Expression)
+    (Sequence Sequence)
+    (Prefix Prefix)
+    (Suffix Suffix)
+    (Primary Primary)
+    (Identifier Identifier)
+    (Literal Literal)
+    (Class Class)
+    (Range Range)
+    (Char Char)
+    (LEFTARROW LEFTARROW)
+    (AND AND)
+    (NOT NOT)
+    (QUESTION QUESTION)
+    (STAR STAR)
+    (PLUS PLUS)
+    (DOT DOT)))
 
 ;; Transforms the nonterminals defined in the PEG parser written as a PEG to the nonterminals defined in the PEG parser written with S-expressions.
 (define (grammar-transform x)
@@ -69,7 +77,7 @@
     (peg:tree (match-pattern (@@ (ice-9 peg) peg-grammar) (@@ (ice-9 peg) peg-as-peg)))
     (tree-map
      grammar-transform
-     (peg:tree (match-pattern grammar (@@ (ice-9 peg) peg-as-peg)))))))
+     (peg:tree (match-pattern (@@ (ice-9 peg) peg-grammar) (@@ (ice-9 peg) peg-as-peg)))))))
 
 ;; A grammar for pascal-style comments from Wikipedia.
 (define comment-grammar
-- 
2.46.0


[-- Attachment #3: v3-0002-PEG-Add-support-for-not-in-range-and.patch --]
[-- Type: text/x-patch, Size: 7693 bytes --]

From 64a17be08581465d11185b4a0ca636354d2f944c Mon Sep 17 00:00:00 2001
From: Ekaitz Zarraga <ekaitz@elenq.tech>
Date: Fri, 11 Oct 2024 14:24:30 +0200
Subject: [PATCH v3 2/2] PEG: Add support for `not-in-range` and [^...]

Modern PEG supports inversed class like `[^a-z]` that would get any
character not in the `a-z` range. This commit adds support for that and
also for a new `not-in-range` PEG pattern for scheme.

* module/ice-9/peg/codegen.scm (cg-not-in-range): New function.
* module/ice-9/peg/string-peg.scm: Add support for `[^...]`
* test-suite/tests/peg.test: Test it.
* doc/ref/api-peg.texi: Document accordingly.
---
 doc/ref/api-peg.texi            |  8 +++++++
 module/ice-9/peg/codegen.scm    | 22 +++++++++++++++++++
 module/ice-9/peg/string-peg.scm | 39 +++++++++++++++++++++++++++++----
 test-suite/tests/peg.test       |  6 ++++-
 4 files changed, 70 insertions(+), 5 deletions(-)

diff --git a/doc/ref/api-peg.texi b/doc/ref/api-peg.texi
index 84a9e6c6b..edb090b20 100644
--- a/doc/ref/api-peg.texi
+++ b/doc/ref/api-peg.texi
@@ -147,6 +147,14 @@ Parses any character falling between @var{a} and @var{z}.
 @code{(range #\a #\z)}
 @end deftp
 
+@deftp {PEG Pattern} {inverse range of characters} a z
+Parses any character not falling between @var{a} and @var{z}.
+
+@code{"[^a-z]"}
+
+@code{(not-in-range #\a #\z)}
+@end deftp
+
 Example:
 
 @example
diff --git a/module/ice-9/peg/codegen.scm b/module/ice-9/peg/codegen.scm
index d80c3e849..82367ef55 100644
--- a/module/ice-9/peg/codegen.scm
+++ b/module/ice-9/peg/codegen.scm
@@ -140,6 +140,27 @@ return EXP."
                          ((none) #`(list (1+ pos) '()))
                          (else (error "bad accum" accum))))))))))
 
+;; Generates code for matching a range of characters not between start and end.
+;; E.g.: (cg-not-in-range syntax #\a #\z 'body)
+(define (cg-not-in-range pat accum)
+  (syntax-case pat ()
+    ((start end)
+     (if (not (and (char? (syntax->datum #'start))
+                   (char? (syntax->datum #'end))))
+         (error "range PEG should have characters after it; instead got"
+                #'start #'end))
+     #`(lambda (str len pos)
+         (and (< pos len)
+              (let ((c (string-ref str pos)))
+                (and (or (char<? c start) (char>? c end))
+                     #,(case accum
+                         ((all) #`(list (1+ pos)
+                                        (list 'cg-not-in-range (string c))))
+                         ((name) #`(list (1+ pos) 'cg-not-in-range))
+                         ((body) #`(list (1+ pos) (string c)))
+                         ((none) #`(list (1+ pos) '()))
+                         (else (error "bad accum" accum))))))))))
+
 ;; Generate code to match a pattern and do nothing with the result
 (define (cg-ignore pat accum)
   (syntax-case pat ()
@@ -304,6 +325,7 @@ return EXP."
         (assq-set! peg-compiler-alist symbol function)))
 
 (add-peg-compiler! 'range cg-range)
+(add-peg-compiler! 'not-in-range cg-not-in-range)
 (add-peg-compiler! 'ignore cg-ignore)
 (add-peg-compiler! 'capture cg-capture)
 (add-peg-compiler! 'and cg-and)
diff --git a/module/ice-9/peg/string-peg.scm b/module/ice-9/peg/string-peg.scm
index f688653ef..2d2f972ff 100644
--- a/module/ice-9/peg/string-peg.scm
+++ b/module/ice-9/peg/string-peg.scm
@@ -53,7 +53,7 @@ Prefix <-- (AND / NOT)? Suffix
 Suffix <-- Primary (QUESTION / STAR / PLUS)?
 Primary <-- Identifier !LEFTARROW
            / OPEN Expression CLOSE
-           / Literal / Class / DOT
+           / Literal / Class / NotInClass / DOT
 
 # Lexical syntax
 Identifier <-- IdentStart IdentCont* Spacing
@@ -63,7 +63,8 @@ IdentCont <- IdentStart / [0-9]
 
 Literal <-- SQUOTE (!SQUOTE Char)* SQUOTE Spacing
         / DQUOTE (!DQUOTE Char)* DQUOTE Spacing
-Class <-- '[' (!']' Range)* ']' Spacing
+NotInClass <-- '[' NOTIN  (!']' Range)* ']' Spacing
+Class <-- '[' !NOTIN (!']' Range)* ']' Spacing
 Range <-- Char '-' Char / Char
 Char <-- '\\\\' [nrt'\"\\[\\]\\\\]
        / '\\\\' [0-7][0-7][0-7]
@@ -76,6 +77,7 @@ SQUOTE <-- [']
 DQUOTE <-- [\"]
 OPENBRACKET < '['
 CLOSEBRACKET < ']'
+NOTIN < '^'
 SLASH < '/' Spacing
 AND <-- '&' Spacing
 NOT <-- '!' Spacing
@@ -120,6 +122,7 @@ EndOfFile < !.
       (and OPEN Expression CLOSE)
       Literal
       Class
+      NotInClass
       DOT))
 (define-sexp-parser Identifier all
   (and IdentStart (* IdentCont) Spacing))
@@ -131,7 +134,11 @@ EndOfFile < !.
   (or (and SQUOTE (* (and (not-followed-by SQUOTE) Char)) SQUOTE Spacing)
       (and DQUOTE (* (and (not-followed-by DQUOTE) Char)) DQUOTE Spacing)))
 (define-sexp-parser Class all
-  (and OPENBRACKET (* (and (not-followed-by CLOSEBRACKET) Range)) CLOSEBRACKET Spacing))
+  (and OPENBRACKET (not-followed-by NOTIN)
+       (* (and (not-followed-by CLOSEBRACKET) Range)) CLOSEBRACKET Spacing))
+(define-sexp-parser NotInClass all
+  (and OPENBRACKET NOTIN
+       (* (and (not-followed-by CLOSEBRACKET) Range)) CLOSEBRACKET Spacing))
 (define-sexp-parser Range all
   (or (and Char DASH Char) Char))
 (define-sexp-parser Char all
@@ -141,6 +148,8 @@ EndOfFile < !.
       (and (not-followed-by "\\") peg-any)))
 (define-sexp-parser LEFTARROW body
   (and (or "<--" "<-" "<") Spacing)) ; NOTE: <-- and < are extensions
+(define-sexp-parser NOTIN none
+  (and "^"))
 (define-sexp-parser SLASH none
   (and "/" Spacing))
 (define-sexp-parser AND all
@@ -275,6 +284,7 @@ EndOfFile < !.
       ('Identifier (Identifier->defn value for-syntax))
       ('Expression (Expression->defn value for-syntax))
       ('Literal    (Literal->defn value for-syntax))
+      ('NotInClass (NotInClass->defn value for-syntax))
       ('Class      (Class->defn value for-syntax)))))
 
 ;; (Identifier "hello")
@@ -287,13 +297,34 @@ EndOfFile < !.
 (define (Literal->defn lst for-syntax)
   (apply string (map (lambda (x) (Char->defn x for-syntax)) (cdr lst))))
 
-;; TODO: empty Class can happen: `[]`, but what does it represent?
+;; (NotInClass ...)
+;;  `-> (and ...)
+(define (NotInClass->defn lst for-syntax)
+  #`(and #,@(map (lambda (x) (NotInRange->defn x for-syntax))
+                 (cdr lst))))
+
 ;; (Class ...)
 ;;  `-> (or ...)
 (define (Class->defn lst for-syntax)
   #`(or #,@(map (lambda (x) (Range->defn x for-syntax))
                 (cdr lst))))
 
+;; For one character:
+;; (NotInRange (Char "a"))
+;;  `-> (not-in-range #\a #\a)
+;; Or for a range:
+;; (NotInRange (Char "a") (Char "b"))
+;;  `-> (not-in-range #\a #\b)
+(define (NotInRange->defn lst for-syntax)
+  (cond
+    ((= 2 (length lst))
+     (let ((ch (Char->defn (second lst) for-syntax)))
+       #`(not-in-range #,ch #,ch)))
+    ((= 3 (length lst))
+     #`(not-in-range
+         #,(Char->defn (second lst) for-syntax)
+         #,(Char->defn (third lst) for-syntax)))))
+
 ;; For one character:
 ;; (Range (Char "a"))
 ;;  `-> "a"
diff --git a/test-suite/tests/peg.test b/test-suite/tests/peg.test
index 556145e72..965e1c12f 100644
--- a/test-suite/tests/peg.test
+++ b/test-suite/tests/peg.test
@@ -38,6 +38,7 @@
     (Identifier Identifier)
     (Literal Literal)
     (Class Class)
+    (NotInClass NotInClass)
     (Range Range)
     (Char Char)
     (LEFTARROW LEFTARROW)
@@ -85,7 +86,7 @@
 End <-- '*)'
 C <- Begin N* End
 N <- C / (!Begin !End Z)
-Z <- .")
+Z <- [^X-Z]") ;; Forbid some characters to test not-in-range
 
 ;; A short /etc/passwd file.
 (define *etc-passwd*
@@ -125,6 +126,9 @@ SLASH < '/'")
     (match-pattern C "(*blah*)")
     (make-prec 0 8 "(*blah*)"
 	       '((Begin "(*") "blah" (End "*)")))))
+  (pass-if
+   "simple comment with forbidden char"
+   (not (match-pattern C "(*blYh*)")))
   (pass-if
    "simple comment padded"
    (equal?
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* bug#73188: PEG parser does not support full PEG grammar
  2024-10-14 14:00         ` Ekaitz Zarraga
@ 2024-10-20 10:10           ` Ludovic Courtès
  2024-10-20 20:18             ` Ekaitz Zarraga
  0 siblings, 1 reply; 19+ messages in thread
From: Ludovic Courtès @ 2024-10-20 10:10 UTC (permalink / raw)
  To: Ekaitz Zarraga; +Cc: 73188

Egun on Ekaitz,

Ekaitz Zarraga <ekaitz@elenq.tech> skribis:

> - What it is true is if people were using `peg-as-peg` for their
>   things, which isn't even exported by the module, their code would
>   have problems. But we can't predict that.

Since ‘peg-as-peg’ has always been private, that’s fine.

> From 81944c2037e0d6d8c972def2ceb1aa4c51a61431 Mon Sep 17 00:00:00 2001
> From: Ekaitz Zarraga <ekaitz@elenq.tech>
> Date: Wed, 11 Sep 2024 21:19:26 +0200
> Subject: [PATCH v3 1/2] PEG: Add full support for PEG + some extensions
>
> This commit adds support for PEG as described in:
>
>     <https://bford.info/pub/lang/peg.pdf>
>
> It adds support for the missing features (comments, underscores in
> identifiers and escaping) while keeping the extensions (dashes in
> identifiers, < and <--).
>
> The naming system tries to be as close as possible to the one proposed
> in the paper.
>
> * module/ice-9/peg/string-peg.scm: Rewrite PEG parser.
> * test-suite/tests/peg.test: Fix import

[...]

> +(define (Primary->defn lst for-syntax)
> +  (let ((value (second lst)))
> +    (case (car value)
> +      ('DOT        #'peg-any)
> +      ('Identifier (Identifier->defn value for-syntax))
> +      ('Expression (Expression->defn value for-syntax))
> +      ('Literal    (Literal->defn value for-syntax))
> +      ('Class      (Class->defn value for-syntax)))))

I get these compiler warnings:

--8<---------------cut here---------------start------------->8---
ice-9/peg/string-peg.scm:258:7: warning: duplicate datum quote in clause ((quote NOT) (quasisyntax (not-followed-by (unsyntax (Suffix->defn (third lst) for-syntax))))) of case expression (case (car suffix) ((quote AND) (quasisyntax (followed-by (unsyntax (Suffix->defn (third lst) for-syntax))))) ((quote NOT) (quasisyntax (not-followed-by (unsyntax (Suffix->defn (third lst) for-syntax))))) (else (Suffix->defn suffix for-syntax)))
ice-9/peg/string-peg.scm:277:9: warning: duplicate datum quote in clause ((quote STAR) (quasisyntax (* (unsyntax out)))) of case expression (case (caar extra) ((quote QUESTION) (quasisyntax (? (unsyntax out)))) ((quote STAR) (quasisyntax (* (unsyntax out)))) ((quote PLUS) (quasisyntax (+ (unsyntax out)))))
ice-9/peg/string-peg.scm:278:9: warning: duplicate datum quote in clause ((quote PLUS) (quasisyntax (+ (unsyntax out)))) of case expression (case (caar extra) ((quote QUESTION) (quasisyntax (? (unsyntax out)))) ((quote STAR) (quasisyntax (* (unsyntax out)))) ((quote PLUS) (quasisyntax (+ (unsyntax out)))))
ice-9/peg/string-peg.scm:284:7: warning: duplicate datum quote in clause ((quote Identifier) (Identifier->defn value for-syntax)) of case expression (case (car value) ((quote DOT) (syntax peg-any)) ((quote Identifier) (Identifier->defn value for-syntax)) ((quote Expression) (Expression->defn value for-syntax)) ((quote Literal) (Literal->defn value for-syntax)) ((quote NotInClass) (NotInClass->defn value for-syntax)) ((quote Class) (Class->defn value for-syntax)))
ice-9/peg/string-peg.scm:285:7: warning: duplicate datum quote in clause ((quote Expression) (Expression->defn value for-syntax)) of case expression (case (car value) ((quote DOT) (syntax peg-any)) ((quote Identifier) (Identifier->defn value for-syntax)) ((quote Expression) (Expression->defn value for-syntax)) ((quote Literal) (Literal->defn value for-syntax)) ((quote NotInClass) (NotInClass->defn value for-syntax)) ((quote Class) (Class->defn value for-syntax)))
ice-9/peg/string-peg.scm:286:7: warning: duplicate datum quote in clause ((quote Literal) (Literal->defn value for-syntax)) of case expression (case (car value) ((quote DOT) (syntax peg-any)) ((quote Identifier) (Identifier->defn value for-syntax)) ((quote Expression) (Expression->defn value for-syntax)) ((quote Literal) (Literal->defn value for-syntax)) ((quote NotInClass) (NotInClass->defn value for-syntax)) ((quote Class) (Class->defn value for-syntax)))
ice-9/peg/string-peg.scm:287:7: warning: duplicate datum quote in clause ((quote NotInClass) (NotInClass->defn value for-syntax)) of case expression (case (car value) ((quote DOT) (syntax peg-any)) ((quote Identifier) (Identifier->defn value for-syntax)) ((quote Expression) (Expression->defn value for-syntax)) ((quote Literal) (Literal->defn value for-syntax)) ((quote NotInClass) (NotInClass->defn value for-syntax)) ((quote Class) (Class->defn value for-syntax)))
ice-9/peg/string-peg.scm:288:7: warning: duplicate datum quote in clause ((quote Class) (Class->defn value for-syntax)) of case expression (case (car value) ((quote DOT) (syntax peg-any)) ((quote Identifier) (Identifier->defn value for-syntax)) ((quote Expression) (Expression->defn value for-syntax)) ((quote Literal) (Literal->defn value for-syntax)) ((quote NotInClass) (NotInClass->defn value for-syntax)) ((quote Class) (Class->defn value for-syntax)))
--8<---------------cut here---------------end--------------->8---

And indeed, the correct syntax is:

  (case value (DOT …) (Identifier …) …)

or:

  (match value ('DOT …) ('Identifier …) …)

The former returns *unspecified* when passed a value not matched by any
clause, whereas the latter throws an error.

So I would recommend ‘match’.

At any rate, that makes me wonder whether this code path is tested at
all.  As written, ‘Primary->defn’ would always return *unspecified*.
Is there a test we could add?

> +(define (Range->defn lst for-syntax)
>    (cond
> -   ((list? el)
> -    (cond
> -     ((eq? (car el) 'peg-literal)
> -      (peg-literal->defn el for-syntax))
> -     ((eq? (car el) 'peg-charclass)
> -      (peg-charclass->defn el for-syntax))
> -     ((eq? (car el) 'peg-nonterminal)
> -      (datum->syntax for-syntax (string->symbol (cadr el))))))
> -   ((string? el)
> +    ((= 2 (length lst))
> +     (second (second lst)))
> +    ((= 3 (length lst))
> +     #`(range
> +         #,(Char->defn (second lst) for-syntax)
> +         #,(Char->defn (third lst) for-syntax)))))

Keep in mind that ‘length’ is O(N), and that car/first, second/cadr are
frowned upon.  I would write it as:

  (match lst
    ((x y) y)
    ((x y z) #`(range …)))

(Ideally with identifiers more meaningful than x, y, and z. :-))

>  ;; Transforms the nonterminals defined in the PEG parser written as a PEG to the nonterminals defined in the PEG parser written with S-expressions.
>  (define (grammar-transform x)
> @@ -69,7 +77,7 @@
>      (peg:tree (match-pattern (@@ (ice-9 peg) peg-grammar) (@@ (ice-9 peg) peg-as-peg)))
>      (tree-map
>       grammar-transform
> -     (peg:tree (match-pattern grammar (@@ (ice-9 peg) peg-as-peg)))))))
> +     (peg:tree (match-pattern (@@ (ice-9 peg) peg-grammar) (@@ (ice-9 peg) peg-as-peg)))))))

What happened to the ‘grammar’ binding?  I can’t see where it was coming
from.

> From 64a17be08581465d11185b4a0ca636354d2f944c Mon Sep 17 00:00:00 2001
> From: Ekaitz Zarraga <ekaitz@elenq.tech>
> Date: Fri, 11 Oct 2024 14:24:30 +0200
> Subject: [PATCH v3 2/2] PEG: Add support for `not-in-range` and [^...]
>
> Modern PEG supports inversed class like `[^a-z]` that would get any
> character not in the `a-z` range. This commit adds support for that and
> also for a new `not-in-range` PEG pattern for scheme.
>
> * module/ice-9/peg/codegen.scm (cg-not-in-range): New function.
> * module/ice-9/peg/string-peg.scm: Add support for `[^...]`
> * test-suite/tests/peg.test: Test it.
> * doc/ref/api-peg.texi: Document accordingly.

This one LGTM.

In addition to the issues mentioned above, could you add an entry in the
‘NEWS’ file, probably under a new “New interfaces and functionality”
heading?

Thanks,
Ludo’.





^ permalink raw reply	[flat|nested] 19+ messages in thread

* bug#73188: PEG parser does not support full PEG grammar
  2024-10-20 10:10           ` Ludovic Courtès
@ 2024-10-20 20:18             ` Ekaitz Zarraga
  2024-12-09 17:23               ` Ludovic Courtès
  0 siblings, 1 reply; 19+ messages in thread
From: Ekaitz Zarraga @ 2024-10-20 20:18 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 73188

[-- Attachment #1: Type: text/plain, Size: 8781 bytes --]

Hi Ludovic,

Thanks for the review

On 2024-10-20 12:10, Ludovic Courtès wrote:

>> +(define (Primary->defn lst for-syntax)
>> +  (let ((value (second lst)))
>> +    (case (car value)
>> +      ('DOT        #'peg-any)
>> +      ('Identifier (Identifier->defn value for-syntax))
>> +      ('Expression (Expression->defn value for-syntax))
>> +      ('Literal    (Literal->defn value for-syntax))
>> +      ('Class      (Class->defn value for-syntax)))))
> 
> I get these compiler warnings:
> 
> --8<---------------cut here---------------start------------->8---
> ice-9/peg/string-peg.scm:258:7: warning: duplicate datum quote in clause ((quote NOT) (quasisyntax (not-followed-by (unsyntax (Suffix->defn (third lst) for-syntax))))) of case expression (case (car suffix) ((quote AND) (quasisyntax (followed-by (unsyntax (Suffix->defn (third lst) for-syntax))))) ((quote NOT) (quasisyntax (not-followed-by (unsyntax (Suffix->defn (third lst) for-syntax))))) (else (Suffix->defn suffix for-syntax)))
> ice-9/peg/string-peg.scm:277:9: warning: duplicate datum quote in clause ((quote STAR) (quasisyntax (* (unsyntax out)))) of case expression (case (caar extra) ((quote QUESTION) (quasisyntax (? (unsyntax out)))) ((quote STAR) (quasisyntax (* (unsyntax out)))) ((quote PLUS) (quasisyntax (+ (unsyntax out)))))
> ice-9/peg/string-peg.scm:278:9: warning: duplicate datum quote in clause ((quote PLUS) (quasisyntax (+ (unsyntax out)))) of case expression (case (caar extra) ((quote QUESTION) (quasisyntax (? (unsyntax out)))) ((quote STAR) (quasisyntax (* (unsyntax out)))) ((quote PLUS) (quasisyntax (+ (unsyntax out)))))
> ice-9/peg/string-peg.scm:284:7: warning: duplicate datum quote in clause ((quote Identifier) (Identifier->defn value for-syntax)) of case expression (case (car value) ((quote DOT) (syntax peg-any)) ((quote Identifier) (Identifier->defn value for-syntax)) ((quote Expression) (Expression->defn value for-syntax)) ((quote Literal) (Literal->defn value for-syntax)) ((quote NotInClass) (NotInClass->defn value for-syntax)) ((quote Class) (Class->defn value for-syntax)))
> ice-9/peg/string-peg.scm:285:7: warning: duplicate datum quote in clause ((quote Expression) (Expression->defn value for-syntax)) of case expression (case (car value) ((quote DOT) (syntax peg-any)) ((quote Identifier) (Identifier->defn value for-syntax)) ((quote Expression) (Expression->defn value for-syntax)) ((quote Literal) (Literal->defn value for-syntax)) ((quote NotInClass) (NotInClass->defn value for-syntax)) ((quote Class) (Class->defn value for-syntax)))
> ice-9/peg/string-peg.scm:286:7: warning: duplicate datum quote in clause ((quote Literal) (Literal->defn value for-syntax)) of case expression (case (car value) ((quote DOT) (syntax peg-any)) ((quote Identifier) (Identifier->defn value for-syntax)) ((quote Expression) (Expression->defn value for-syntax)) ((quote Literal) (Literal->defn value for-syntax)) ((quote NotInClass) (NotInClass->defn value for-syntax)) ((quote Class) (Class->defn value for-syntax)))
> ice-9/peg/string-peg.scm:287:7: warning: duplicate datum quote in clause ((quote NotInClass) (NotInClass->defn value for-syntax)) of case expression (case (car value) ((quote DOT) (syntax peg-any)) ((quote Identifier) (Identifier->defn value for-syntax)) ((quote Expression) (Expression->defn value for-syntax)) ((quote Literal) (Literal->defn value for-syntax)) ((quote NotInClass) (NotInClass->defn value for-syntax)) ((quote Class) (Class->defn value for-syntax)))
> ice-9/peg/string-peg.scm:288:7: warning: duplicate datum quote in clause ((quote Class) (Class->defn value for-syntax)) of case expression (case (car value) ((quote DOT) (syntax peg-any)) ((quote Identifier) (Identifier->defn value for-syntax)) ((quote Expression) (Expression->defn value for-syntax)) ((quote Literal) (Literal->defn value for-syntax)) ((quote NotInClass) (NotInClass->defn value for-syntax)) ((quote Class) (Class->defn value for-syntax)))
> --8<---------------cut here---------------end--------------->8---
> 
> And indeed, the correct syntax is:
> 
>    (case value (DOT …) (Identifier …) …)
> 
> or:
> 
>    (match value ('DOT …) ('Identifier …) …)
> 
> The former returns *unspecified* when passed a value not matched by any
> clause, whereas the latter throws an error.
> 
> So I would recommend ‘match’.

I'm always in doubt with this but the thing worked.
I always check in the internet how to do the case, and I always get the 
warning... my bad!

> At any rate, that makes me wonder whether this code path is tested at
> all.  As written, ‘Primary->defn’ would always return *unspecified*.
> Is there a test we could add?

I made some tests and it works, maybe unexpectedly:

	scheme@(guile-user)> (define a 'b)
	scheme@(guile-user)> (case a ('c 3)('b 1)('d 2))
	;;; <stdin>:12:15: warning: duplicate datum quote in clause ((quote b) 
1) of case expression (case a ((quote c) 3) ((quote b) 1) ((quote d) 2))
	;;; <stdin>:12:21: warning: duplicate datum quote in clause ((quote d) 
2) of case expression (case a ((quote c) 3) ((quote b) 1) ((quote d) 2))
	$8 = 1

While your proposal doesn't:
	
	scheme@(guile-user)> (case a (c 3)(b 1)(d 2))
	While compiling expression:
	Syntax error:
	unknown file:14:8: case: invalid clause in subform (c 3) of (case a (c 
3) (b 1) (d 2))

I tested further and this is what it should be:

	scheme@(guile-user)> (case a ((b) 1)((d) 2))
	$1 = 1

I used match instead as you proposed and made it all work with no 
warnings and better error reporting.

>> +(define (Range->defn lst for-syntax)
>>     (cond
>> -   ((list? el)
>> -    (cond
>> -     ((eq? (car el) 'peg-literal)
>> -      (peg-literal->defn el for-syntax))
>> -     ((eq? (car el) 'peg-charclass)
>> -      (peg-charclass->defn el for-syntax))
>> -     ((eq? (car el) 'peg-nonterminal)
>> -      (datum->syntax for-syntax (string->symbol (cadr el))))))
>> -   ((string? el)
>> +    ((= 2 (length lst))
>> +     (second (second lst)))
>> +    ((= 3 (length lst))
>> +     #`(range
>> +         #,(Char->defn (second lst) for-syntax)
>> +         #,(Char->defn (third lst) for-syntax)))))
> 
> Keep in mind that ‘length’ is O(N), and that car/first, second/cadr are
> frowned upon.  I would write it as:
> 
>    (match lst
>      ((x y) y)
>      ((x y z) #`(range …)))

Yeah, I was very bad at `match` when I wrote this thing :(
But! Now I spent some hours with it and rewrote the thing for `match`!

> (Ideally with identifiers more meaningful than x, y, and z. :-))
> 
>>   ;; Transforms the nonterminals defined in the PEG parser written as a PEG to the nonterminals defined in the PEG parser written with S-expressions.
>>   (define (grammar-transform x)
>> @@ -69,7 +77,7 @@
>>       (peg:tree (match-pattern (@@ (ice-9 peg) peg-grammar) (@@ (ice-9 peg) peg-as-peg)))
>>       (tree-map
>>        grammar-transform
>> -     (peg:tree (match-pattern grammar (@@ (ice-9 peg) peg-as-peg)))))))
>> +     (peg:tree (match-pattern (@@ (ice-9 peg) peg-grammar) (@@ (ice-9 peg) peg-as-peg)))))))
> 
> What happened to the ‘grammar’ binding?  I can’t see where it was coming
> from.

This is a very good catch!
I "fixed" a missing identifier by this, but it happened to be skipping a 
really important check: Is our PEG written as PEG understood like our 
PEG definition in sexps?

The `Grammar` is coming from the processing of `peg-as-peg` as a 
peg-string. This is what I was missing.

This was hiding some errors that I fixed, and now I'm pretty confident 
of the thing. Wow! This was a very good one! Thanks for pointing it out.

>>  From 64a17be08581465d11185b4a0ca636354d2f944c Mon Sep 17 00:00:00 2001
>> From: Ekaitz Zarraga <ekaitz@elenq.tech>
>> Date: Fri, 11 Oct 2024 14:24:30 +0200
>> Subject: [PATCH v3 2/2] PEG: Add support for `not-in-range` and [^...]
>>
>> Modern PEG supports inversed class like `[^a-z]` that would get any
>> character not in the `a-z` range. This commit adds support for that and
>> also for a new `not-in-range` PEG pattern for scheme.
>>
>> * module/ice-9/peg/codegen.scm (cg-not-in-range): New function.
>> * module/ice-9/peg/string-peg.scm: Add support for `[^...]`
>> * test-suite/tests/peg.test: Test it.
>> * doc/ref/api-peg.texi: Document accordingly.
> 
> This one LGTM.
> 
> In addition to the issues mentioned above, could you add an entry in the
> ‘NEWS’ file, probably under a new “New interfaces and functionality”
> heading?

Yes!
Added it in both commits: In the first commit I added that PEG was 
improved, and in the second that `not-in-range` was added.

> Thanks,
> Ludo’.

Attached new version of both patches.

Thanks for the help, the kind of things you point out are the ones that 
I wanted help with! Thanks! Now I'm a better matcher.

This helps me improve,
Ekaitz

[-- Attachment #2: v4-0002-PEG-Add-support-for-not-in-range-and.patch --]
[-- Type: text/x-patch, Size: 8198 bytes --]

From 7941589941fe7a0fb6f71b8681ccb1d976946912 Mon Sep 17 00:00:00 2001
From: Ekaitz Zarraga <ekaitz@elenq.tech>
Date: Fri, 11 Oct 2024 14:24:30 +0200
Subject: [PATCH v4 2/2] PEG: Add support for `not-in-range` and [^...]

Modern PEG supports inversed class like `[^a-z]` that would get any
character not in the `a-z` range. This commit adds support for that and
also for a new `not-in-range` PEG pattern for scheme.

* module/ice-9/peg/codegen.scm (cg-not-in-range): New function.
* module/ice-9/peg/string-peg.scm: Add support for `[^...]`
* test-suite/tests/peg.test: Test it.
* doc/ref/api-peg.texi: Document accordingly.
---
 NEWS                            |  3 ++-
 doc/ref/api-peg.texi            |  8 +++++++
 module/ice-9/peg/codegen.scm    | 22 +++++++++++++++++++
 module/ice-9/peg/string-peg.scm | 38 ++++++++++++++++++++++++++++++---
 test-suite/tests/peg.test       |  6 +++++-
 5 files changed, 72 insertions(+), 5 deletions(-)

diff --git a/NEWS b/NEWS
index df43f3754..17ef560b1 100644
--- a/NEWS
+++ b/NEWS
@@ -32,7 +32,8 @@ Changes in 3.0.11 (since 3.0.10)
 ** PEG parser
 
 PEG parser has been rewritten to cover all the functionality defined in
-<https://bford.info/pub/lang/peg.pdf>.
+<https://bford.info/pub/lang/peg.pdf>. Also added the `not-in-range` pattern
+to `(ice-9 peg)` that is also available from PEG strings via `[^...]`.
 
 \f
 Changes in 3.0.10 (since 3.0.9)
diff --git a/doc/ref/api-peg.texi b/doc/ref/api-peg.texi
index 84a9e6c6b..edb090b20 100644
--- a/doc/ref/api-peg.texi
+++ b/doc/ref/api-peg.texi
@@ -147,6 +147,14 @@ Parses any character falling between @var{a} and @var{z}.
 @code{(range #\a #\z)}
 @end deftp
 
+@deftp {PEG Pattern} {inverse range of characters} a z
+Parses any character not falling between @var{a} and @var{z}.
+
+@code{"[^a-z]"}
+
+@code{(not-in-range #\a #\z)}
+@end deftp
+
 Example:
 
 @example
diff --git a/module/ice-9/peg/codegen.scm b/module/ice-9/peg/codegen.scm
index d80c3e849..82367ef55 100644
--- a/module/ice-9/peg/codegen.scm
+++ b/module/ice-9/peg/codegen.scm
@@ -140,6 +140,27 @@ return EXP."
                          ((none) #`(list (1+ pos) '()))
                          (else (error "bad accum" accum))))))))))
 
+;; Generates code for matching a range of characters not between start and end.
+;; E.g.: (cg-not-in-range syntax #\a #\z 'body)
+(define (cg-not-in-range pat accum)
+  (syntax-case pat ()
+    ((start end)
+     (if (not (and (char? (syntax->datum #'start))
+                   (char? (syntax->datum #'end))))
+         (error "range PEG should have characters after it; instead got"
+                #'start #'end))
+     #`(lambda (str len pos)
+         (and (< pos len)
+              (let ((c (string-ref str pos)))
+                (and (or (char<? c start) (char>? c end))
+                     #,(case accum
+                         ((all) #`(list (1+ pos)
+                                        (list 'cg-not-in-range (string c))))
+                         ((name) #`(list (1+ pos) 'cg-not-in-range))
+                         ((body) #`(list (1+ pos) (string c)))
+                         ((none) #`(list (1+ pos) '()))
+                         (else (error "bad accum" accum))))))))))
+
 ;; Generate code to match a pattern and do nothing with the result
 (define (cg-ignore pat accum)
   (syntax-case pat ()
@@ -304,6 +325,7 @@ return EXP."
         (assq-set! peg-compiler-alist symbol function)))
 
 (add-peg-compiler! 'range cg-range)
+(add-peg-compiler! 'not-in-range cg-not-in-range)
 (add-peg-compiler! 'ignore cg-ignore)
 (add-peg-compiler! 'capture cg-capture)
 (add-peg-compiler! 'and cg-and)
diff --git a/module/ice-9/peg/string-peg.scm b/module/ice-9/peg/string-peg.scm
index 4b923220a..0026f8930 100644
--- a/module/ice-9/peg/string-peg.scm
+++ b/module/ice-9/peg/string-peg.scm
@@ -54,7 +54,7 @@ Prefix <-- (AND / NOT)? Suffix
 Suffix <-- Primary (QUESTION / STAR / PLUS)?
 Primary <-- Identifier !LEFTARROW
            / OPEN Expression CLOSE
-           / Literal / Class / DOT
+           / Literal / Class / NotInClass / DOT
 
 # Lexical syntax
 Identifier <-- IdentStart IdentCont* Spacing
@@ -64,6 +64,7 @@ IdentCont <- IdentStart / [0-9]
 
 Literal <-- SQUOTE (!SQUOTE Char)* SQUOTE Spacing
         / DQUOTE (!DQUOTE Char)* DQUOTE Spacing
+NotInClass <-- OPENBRACKET NOTIN  (!CLOSEBRACKET Range)* CLOSEBRACKET Spacing
 Class <-- OPENBRACKET !NOTIN  (!CLOSEBRACKET Range)* CLOSEBRACKET Spacing
 Range <-- Char DASH Char / Char
 Char <-- '\\\\' [nrt'\"\\[\\]\\\\]
@@ -78,6 +79,7 @@ DQUOTE < [\"]
 DASH < '-'
 OPENBRACKET < '['
 CLOSEBRACKET < ']'
+NOTIN < '^'
 SLASH < '/' Spacing
 AND <-- '&' Spacing
 NOT <-- '!' Spacing
@@ -122,6 +124,7 @@ EndOfFile < !.
       (and OPEN Expression CLOSE)
       Literal
       Class
+      NotInClass
       DOT))
 (define-sexp-parser Identifier all
   (and IdentStart (* IdentCont) Spacing))
@@ -133,7 +136,11 @@ EndOfFile < !.
   (or (and SQUOTE (* (and (not-followed-by SQUOTE) Char)) SQUOTE Spacing)
       (and DQUOTE (* (and (not-followed-by DQUOTE) Char)) DQUOTE Spacing)))
 (define-sexp-parser Class all
-  (and OPENBRACKET (* (and (not-followed-by CLOSEBRACKET) Range)) CLOSEBRACKET Spacing))
+  (and OPENBRACKET (not-followed-by NOTIN)
+       (* (and (not-followed-by CLOSEBRACKET) Range)) CLOSEBRACKET Spacing))
+(define-sexp-parser NotInClass all
+  (and OPENBRACKET NOTIN
+       (* (and (not-followed-by CLOSEBRACKET) Range)) CLOSEBRACKET Spacing))
 (define-sexp-parser Range all
   (or (and Char DASH Char) Char))
 (define-sexp-parser Char all
@@ -143,6 +150,8 @@ EndOfFile < !.
       (and (not-followed-by "\\") peg-any)))
 (define-sexp-parser LEFTARROW body
   (and (or "<--" "<-" "<") Spacing)) ; NOTE: <-- and < are extensions
+(define-sexp-parser NOTIN none
+  (and "^"))
 (define-sexp-parser SLASH none
   (and "/" Spacing))
 (define-sexp-parser AND all
@@ -279,6 +288,7 @@ EndOfFile < !.
       ('Identifier (Identifier->defn value for-syntax))
       ('Expression (Expression->defn value for-syntax))
       ('Literal    (Literal->defn value for-syntax))
+      ('NotInClass (NotInClass->defn value for-syntax))
       ('Class      (Class->defn value for-syntax)))))
 
 ;; (Identifier "hello")
@@ -291,13 +301,35 @@ EndOfFile < !.
 (define (Literal->defn lst for-syntax)
   (apply string (map (lambda (x) (Char->defn x for-syntax)) (cdr lst))))
 
-;; TODO: empty Class can happen: `[]`, but what does it represent?
+;; (NotInClass ...)
+;;  `-> (and ...)
+(define (NotInClass->defn lst for-syntax)
+  #`(and #,@(map (lambda (x) (NotInRange->defn x for-syntax))
+                 (cdr lst))))
+
 ;; (Class ...)
 ;;  `-> (or ...)
 (define (Class->defn lst for-syntax)
   #`(or #,@(map (lambda (x) (Range->defn x for-syntax))
                 (cdr lst))))
 
+;; NOTE: It's coming from NotInClass.
+;; For one character:
+;; (Range (Char "a"))
+;;  `-> (not-in-range #\a #\a)
+;; Or for a range:
+;; (Range (Char "a") (Char "b"))
+;;  `-> (not-in-range #\a #\b)
+(define (NotInRange->defn lst for-syntax)
+  (match lst
+    (('Range c)
+     (let ((ch (Char->defn c for-syntax)))
+       #`(not-in-range #,ch #,ch)))
+    (('Range range-beginning range-end)
+     #`(not-in-range
+         #,(Char->defn range-beginning for-syntax)
+         #,(Char->defn range-end       for-syntax)))))
+
 ;; For one character:
 ;; (Range (Char "a"))
 ;;  `-> "a"
diff --git a/test-suite/tests/peg.test b/test-suite/tests/peg.test
index 1136c03f1..d9e3e1b22 100644
--- a/test-suite/tests/peg.test
+++ b/test-suite/tests/peg.test
@@ -38,6 +38,7 @@
     (Identifier Identifier)
     (Literal Literal)
     (Class Class)
+    (NotInClass NotInClass)
     (Range Range)
     (Char Char)
     (LEFTARROW LEFTARROW)
@@ -85,7 +86,7 @@
 End <-- '*)'
 C <- Begin N* End
 N <- C / (!Begin !End Z)
-Z <- .")
+Z <- [^X-Z]") ;; Forbid some characters to test not-in-range
 
 ;; A short /etc/passwd file.
 (define *etc-passwd*
@@ -125,6 +126,9 @@ SLASH < '/'")
     (match-pattern C "(*blah*)")
     (make-prec 0 8 "(*blah*)"
 	       '((Begin "(*") "blah" (End "*)")))))
+  (pass-if
+   "simple comment with forbidden char"
+   (not (match-pattern C "(*blYh*)")))
   (pass-if
    "simple comment padded"
    (equal?
-- 
2.46.0


[-- Attachment #3: v4-0001-PEG-Add-full-support-for-PEG-some-extensions.patch --]
[-- Type: text/x-patch, Size: 22245 bytes --]

From 459d83ec2710bed5feb3d2a9ca28da8ede9da007 Mon Sep 17 00:00:00 2001
From: Ekaitz Zarraga <ekaitz@elenq.tech>
Date: Wed, 11 Sep 2024 21:19:26 +0200
Subject: [PATCH v4 1/2] PEG: Add full support for PEG + some extensions

This commit adds support for PEG as described in:

    <https://bford.info/pub/lang/peg.pdf>

It adds support for the missing features (comments, underscores in
identifiers and escaping) while keeping the extensions (dashes in
identifiers, < and <--).

The naming system tries to be as close as possible to the one proposed
in the paper.

* module/ice-9/peg/string-peg.scm: Rewrite PEG parser.
* test-suite/tests/peg.test: Fix import
---
 NEWS                            |   7 +
 doc/ref/api-peg.texi            |   8 +-
 module/ice-9/peg/string-peg.scm | 446 ++++++++++++++++++++------------
 test-suite/tests/peg.test       |  32 ++-
 4 files changed, 313 insertions(+), 180 deletions(-)

diff --git a/NEWS b/NEWS
index 9fd14c39d..df43f3754 100644
--- a/NEWS
+++ b/NEWS
@@ -27,6 +27,13 @@ Changes in 3.0.11 (since 3.0.10)
 ** Guile is compiled with -fexcess-precision=standard for i[3456]86 when possible
    (<https://debbugs.gnu.org/43262>)
 
+* New interfaces and functionality
+
+** PEG parser
+
+PEG parser has been rewritten to cover all the functionality defined in
+<https://bford.info/pub/lang/peg.pdf>.
+
 \f
 Changes in 3.0.10 (since 3.0.9)
 
diff --git a/doc/ref/api-peg.texi b/doc/ref/api-peg.texi
index d34ddc64c..84a9e6c6b 100644
--- a/doc/ref/api-peg.texi
+++ b/doc/ref/api-peg.texi
@@ -17,6 +17,10 @@ Wikipedia has a clear and concise introduction to PEGs if you want to
 familiarize yourself with the syntax:
 @url{http://en.wikipedia.org/wiki/Parsing_expression_grammar}.
 
+The paper that introduced PEG contains a more detailed description of how PEG
+works, and describes its syntax in detail:
+@url{https://bford.info/pub/lang/peg.pdf}
+
 The @code{(ice-9 peg)} module works by compiling PEGs down to lambda
 expressions.  These can either be stored in variables at compile-time by
 the define macros (@code{define-peg-pattern} and
@@ -216,8 +220,8 @@ should propagate up the parse tree.  The normal @code{<-} propagates the
 matched text up the parse tree, @code{<--} propagates the matched text
 up the parse tree tagged with the name of the nonterminal, and @code{<}
 discards that matched text and propagates nothing up the parse tree.
-Also, nonterminals may consist of any alphanumeric character or a ``-''
-character (in normal PEGs nonterminals can only be alphabetic).
+Also, nonterminals may include ``-'' character, while in normal PEG it is not
+allowed.
 
 For example, if we:
 @lisp
diff --git a/module/ice-9/peg/string-peg.scm b/module/ice-9/peg/string-peg.scm
index 45ed14bb1..4b923220a 100644
--- a/module/ice-9/peg/string-peg.scm
+++ b/module/ice-9/peg/string-peg.scm
@@ -1,6 +1,7 @@
 ;;;; string-peg.scm --- representing PEG grammars as strings
 ;;;;
-;;;; 	Copyright (C) 2010, 2011 Free Software Foundation, Inc.
+;;;; 	Copyright (C) 2010, 2011, Free Software Foundation, Inc.
+;;;; 	Copyright (C) 2024 Ekaitz Zarraga <ekaitz@elenq.tech>
 ;;;;
 ;;;; This library is free software; you can redistribute it and/or
 ;;;; modify it under the terms of the GNU Lesser General Public
@@ -21,10 +22,15 @@
   #:export (peg-as-peg
             define-peg-string-patterns
             peg-grammar)
+  #:use-module (ice-9 match)
   #:use-module (ice-9 peg using-parsers)
+  #:use-module (srfi srfi-1)
   #:use-module (ice-9 peg codegen)
   #:use-module (ice-9 peg simplify-tree))
 
+;; This module provides support for PEG as described in:
+;;   <https://bford.info/pub/lang/peg.pdf>
+
 ;; Gets the left-hand depth of a list.
 (define (depth lst)
   (if (or (not (list? lst)) (null? lst))
@@ -38,22 +44,58 @@
 
 ;; Grammar for PEGs in PEG grammar.
 (define peg-as-peg
-"grammar <-- (nonterminal ('<--' / '<-' / '<') sp pattern)+
-pattern <-- alternative (SLASH sp alternative)*
-alternative <-- ([!&]? sp suffix)+
-suffix <-- primary ([*+?] sp)*
-primary <-- '(' sp pattern ')' sp / '.' sp / literal / charclass / nonterminal !'<'
-literal <-- ['] (!['] .)* ['] sp
-charclass <-- LB (!']' (CCrange / CCsingle))* RB sp
-CCrange <-- . '-' .
-CCsingle <-- .
-nonterminal <-- [a-zA-Z0-9-]+ sp
-sp < [ \t\n]*
-SLASH < '/'
-LB < '['
-RB < ']'
+"# Hierarchical syntax
+Grammar <-- Spacing Definition+ EndOfFile
+Definition <-- Identifier LEFTARROW Expression
+
+Expression <-- Sequence (SLASH Sequence)*
+Sequence <-- Prefix*
+Prefix <-- (AND / NOT)? Suffix
+Suffix <-- Primary (QUESTION / STAR / PLUS)?
+Primary <-- Identifier !LEFTARROW
+           / OPEN Expression CLOSE
+           / Literal / Class / DOT
+
+# Lexical syntax
+Identifier <-- IdentStart IdentCont* Spacing
+# NOTE: `-` is an extension
+IdentStart <- [a-zA-Z_] / '-'
+IdentCont <- IdentStart / [0-9]
+
+Literal <-- SQUOTE (!SQUOTE Char)* SQUOTE Spacing
+        / DQUOTE (!DQUOTE Char)* DQUOTE Spacing
+Class <-- OPENBRACKET !NOTIN  (!CLOSEBRACKET Range)* CLOSEBRACKET Spacing
+Range <-- Char DASH Char / Char
+Char <-- '\\\\' [nrt'\"\\[\\]\\\\]
+       / '\\\\' [0-7][0-7][0-7]
+       / '\\\\' [0-7][0-7]?
+       / !'\\\\' .
+
+# NOTE: `<--` and `<` are extensions
+LEFTARROW <- ('<--' / '<-' / '<') Spacing
+SQUOTE < [']
+DQUOTE < [\"]
+DASH < '-'
+OPENBRACKET < '['
+CLOSEBRACKET < ']'
+SLASH < '/' Spacing
+AND <-- '&' Spacing
+NOT <-- '!' Spacing
+QUESTION <-- '?' Spacing
+STAR <-- '*' Spacing
+PLUS <-- '+' Spacing
+OPEN < '(' Spacing
+CLOSE < ')' Spacing
+DOT <-- '.' Spacing
+
+Spacing < (Space / Comment)*
+Comment < '#' (!EndOfLine .)* EndOfLine
+Space < ' ' / '\t' / EndOfLine
+EndOfLine < '\\r\\n' / '\\n' / '\\r'
+EndOfFile < !.
 ")
 
+
 (define-syntax define-sexp-parser
   (lambda (x)
     (syntax-case x ()
@@ -63,35 +105,78 @@ RB < ']'
               (syn (wrap-parser-for-users x matchf accumsym #'sym)))
            #`(define sym #,syn))))))
 
-(define-sexp-parser peg-grammar all
-  (+ (and peg-nonterminal (or "<--" "<-" "<") peg-sp peg-pattern)))
-(define-sexp-parser peg-pattern all
-  (and peg-alternative
-       (* (and (ignore "/") peg-sp peg-alternative))))
-(define-sexp-parser peg-alternative all
-  (+ (and (? (or "!" "&")) peg-sp peg-suffix)))
-(define-sexp-parser peg-suffix all
-  (and peg-primary (* (and (or "*" "+" "?") peg-sp))))
-(define-sexp-parser peg-primary all
-  (or (and "(" peg-sp peg-pattern ")" peg-sp)
-      (and "." peg-sp)
-      peg-literal
-      peg-charclass
-      (and peg-nonterminal (not-followed-by "<"))))
-(define-sexp-parser peg-literal all
-  (and "'" (* (and (not-followed-by "'") peg-any)) "'" peg-sp))
-(define-sexp-parser peg-charclass all
-  (and (ignore "[")
-       (* (and (not-followed-by "]")
-               (or charclass-range charclass-single)))
-       (ignore "]")
-       peg-sp))
-(define-sexp-parser charclass-range all (and peg-any "-" peg-any))
-(define-sexp-parser charclass-single all peg-any)
-(define-sexp-parser peg-nonterminal all
-  (and (+ (or (range #\a #\z) (range #\A #\Z) (range #\0 #\9) "-")) peg-sp))
-(define-sexp-parser peg-sp none
-  (* (or " " "\t" "\n")))
+(define-sexp-parser Grammar all
+  (and Spacing (+ Definition) EndOfFile))
+(define-sexp-parser Definition all
+  (and Identifier LEFTARROW Expression))
+(define-sexp-parser Expression all
+  (and Sequence (* (and SLASH Sequence))))
+(define-sexp-parser Sequence all
+  (* Prefix))
+(define-sexp-parser Prefix all
+  (and (? (or AND NOT)) Suffix))
+(define-sexp-parser Suffix all
+  (and Primary (? (or QUESTION STAR PLUS))))
+(define-sexp-parser Primary all
+  (or (and Identifier (not-followed-by LEFTARROW))
+      (and OPEN Expression CLOSE)
+      Literal
+      Class
+      DOT))
+(define-sexp-parser Identifier all
+  (and IdentStart (* IdentCont) Spacing))
+(define-sexp-parser IdentStart body
+  (or (or (range #\a #\z) (range #\A #\Z) "_") "-")) ; NOTE: - is an extension
+(define-sexp-parser IdentCont body
+  (or IdentStart (range #\0 #\9)))
+(define-sexp-parser Literal all
+  (or (and SQUOTE (* (and (not-followed-by SQUOTE) Char)) SQUOTE Spacing)
+      (and DQUOTE (* (and (not-followed-by DQUOTE) Char)) DQUOTE Spacing)))
+(define-sexp-parser Class all
+  (and OPENBRACKET (* (and (not-followed-by CLOSEBRACKET) Range)) CLOSEBRACKET Spacing))
+(define-sexp-parser Range all
+  (or (and Char DASH Char) Char))
+(define-sexp-parser Char all
+  (or (and "\\" (or "n" "r" "t" "'" "\"" "[" "]" "\\"))
+      (and "\\" (range #\0 #\7) (range #\0 #\7) (range #\0 #\7))
+      (and "\\" (range #\0 #\7) (? (range #\0 #\7)))
+      (and (not-followed-by "\\") peg-any)))
+(define-sexp-parser LEFTARROW body
+  (and (or "<--" "<-" "<") Spacing)) ; NOTE: <-- and < are extensions
+(define-sexp-parser SLASH none
+  (and "/" Spacing))
+(define-sexp-parser AND all
+  (and "&" Spacing))
+(define-sexp-parser NOT all
+  (and "!" Spacing))
+(define-sexp-parser QUESTION all
+  (and "?" Spacing))
+(define-sexp-parser STAR all
+  (and "*" Spacing))
+(define-sexp-parser PLUS all
+  (and "+" Spacing))
+(define-sexp-parser OPEN none
+  (and "(" Spacing))
+(define-sexp-parser CLOSE none
+  (and ")" Spacing))
+(define-sexp-parser DOT all
+  (and "." Spacing))
+(define-sexp-parser SQUOTE none "'")
+(define-sexp-parser DQUOTE none "\"")
+(define-sexp-parser OPENBRACKET none "[")
+(define-sexp-parser CLOSEBRACKET none "]")
+(define-sexp-parser DASH none "-")
+(define-sexp-parser Spacing none
+  (* (or Space Comment)))
+(define-sexp-parser Comment none
+  (and "#" (* (and (not-followed-by EndOfLine) peg-any)) EndOfLine))
+(define-sexp-parser Space none
+  (or " " "\t" EndOfLine))
+(define-sexp-parser EndOfLine none
+  (or "\r\n" "\n" "\r"))
+(define-sexp-parser EndOfFile none
+  (not-followed-by peg-any))
+
 
 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
 ;;;;; PARSE STRING PEGS
@@ -101,7 +186,7 @@ RB < ']'
 ;; will define all of the nonterminals in the grammar with equivalent
 ;; PEG s-expressions.
 (define (peg-parser str for-syntax)
-  (let ((parsed (match-pattern peg-grammar str)))
+  (let ((parsed (match-pattern Grammar str)))
     (if (not parsed)
         (begin
           ;; (display "Invalid PEG grammar!\n")
@@ -110,11 +195,154 @@ RB < ']'
           (cond
            ((or (not (list? lst)) (null? lst))
             lst)
-           ((eq? (car lst) 'peg-grammar)
-            #`(begin
-                #,@(map (lambda (x) (peg-nonterm->defn x for-syntax))
-                        (context-flatten (lambda (lst) (<= (depth lst) 2))
-                                         (cdr lst))))))))))
+           ((eq? (car lst) 'Grammar)
+            (Grammar->defn lst for-syntax)))))))
+
+;; (Grammar (Definition ...) (Definition ...))
+(define (Grammar->defn lst for-syntax)
+  #`(begin
+      #,@(map (lambda (x) (Definition->defn x for-syntax))
+              (context-flatten (lambda (lst) (<= (depth lst) 1))
+                               (cdr lst)))))
+
+;; (Definition (Identifier "Something") "<-" (Expression ...))
+;;  `-> (define-peg-pattern Something 'all ...)
+(define (Definition->defn lst for-syntax)
+  (match lst
+    (('Definition ('Identifier identifier) grabber expression)
+     #`(define-peg-pattern
+         #,(datum->syntax for-syntax (string->symbol identifier))
+         #,(match grabber
+                  ("<--" (datum->syntax for-syntax 'all))
+                  ("<-"  (datum->syntax for-syntax 'body))
+                  ("<"   (datum->syntax for-syntax 'none)))
+         #,(compressor
+             (Expression->defn expression for-syntax)
+             for-syntax)))))
+
+;; (Expression X)
+;;  `-> (or X)
+;; (Expression X Y)
+;;  `-> (or X Y)
+;; (Expression X (Y Z ...))
+;;  `-> (or X Y Z ...)
+(define (Expression->defn lst for-syntax)
+  (match lst
+    (('Expression seq ...)
+     #`(or #,@(map (lambda (x) (Sequence->defn x for-syntax))
+                   (keyword-flatten '(Sequence) seq))))))
+
+;; (Sequence X)
+;;  `-> (and X)
+;; (Sequence X Y)
+;;  `-> (and X Y)
+;; (Sequence X (Y Z ...))
+;;  `-> (and X Y Z ...)
+(define (Sequence->defn lst for-syntax)
+  (match lst
+    (('Sequence pre ...)
+     #`(and #,@(map (lambda (x) (Prefix->defn x for-syntax))
+                    (keyword-flatten '(Prefix) pre))))))
+
+;; (Prefix (Suffix ...))
+;;  `-> (...)
+;; (Prefix (NOT "!") (Suffix ...))
+;;  `-> (not-followed-by ...)
+;; (Prefix (AND "&") (Suffix ...))
+;;  `-> (followed-by ...)
+(define (Prefix->defn lst for-syntax)
+  (match lst
+    (('Prefix ('AND _) su) #`(followed-by     #,(Suffix->defn su for-syntax)))
+    (('Prefix ('NOT _) su) #`(not-followed-by #,(Suffix->defn su for-syntax)))
+    (('Prefix suffix) (Suffix->defn suffix for-syntax))))
+
+;; (Suffix (Primary ...))
+;;  `-> (...)
+;; (Suffix (Primary ...) (STAR "*"))
+;;  `-> (* ...)
+;; (Suffix (Primary ...) (QUESTION "?"))
+;;  `-> (? ...)
+;; (Suffix (Primary ...) (PLUS "+"))
+;;  `-> (+ ...)
+(define (Suffix->defn lst for-syntax)
+  (match lst
+    (('Suffix prim)               (Primary->defn prim for-syntax))
+    (('Suffix prim ('STAR     _)) #`(* #,(Primary->defn prim for-syntax)))
+    (('Suffix prim ('QUESTION _)) #`(? #,(Primary->defn prim for-syntax)))
+    (('Suffix prim ('PLUS     _)) #`(+ #,(Primary->defn prim for-syntax)))))
+
+
+(define (Primary->defn lst for-syntax)
+  (let ((value (second lst)))
+    (match (car value)
+      ('DOT        #'peg-any)
+      ('Identifier (Identifier->defn value for-syntax))
+      ('Expression (Expression->defn value for-syntax))
+      ('Literal    (Literal->defn value for-syntax))
+      ('Class      (Class->defn value for-syntax)))))
+
+;; (Identifier "hello")
+;;  `-> hello
+(define (Identifier->defn lst for-syntax)
+  (datum->syntax for-syntax (string->symbol (second lst))))
+
+;; (Literal (Char "a") (Char "b") (Char "c"))
+;;  `-> "abc"
+(define (Literal->defn lst for-syntax)
+  (apply string (map (lambda (x) (Char->defn x for-syntax)) (cdr lst))))
+
+;; TODO: empty Class can happen: `[]`, but what does it represent?
+;; (Class ...)
+;;  `-> (or ...)
+(define (Class->defn lst for-syntax)
+  #`(or #,@(map (lambda (x) (Range->defn x for-syntax))
+                (cdr lst))))
+
+;; For one character:
+;; (Range (Char "a"))
+;;  `-> "a"
+;; Or for a range:
+;; (Range (Char "a") (Char "b"))
+;;  `-> (range #\a #\b)
+(define (Range->defn lst for-syntax)
+  (match lst
+    (('Range ch)
+     (string (Char->defn ch for-syntax)))
+    (('Range range-beginning range-end)
+     #`(range
+         #,(Char->defn range-beginning for-syntax)
+         #,(Char->defn range-end       for-syntax)))))
+
+;; (Char "a")
+;;  `-> #\a
+;; (Char "\\n")
+;;  `-> #\newline
+;; (Char "\\135")
+;;  `-> #\]
+(define (Char->defn lst for-syntax)
+  (let* ((charstr (second lst))
+         (first   (string-ref charstr 0)))
+    (cond
+      ((= 1 (string-length charstr)) first)
+      ((char-numeric? (string-ref charstr 1))
+       (integer->char
+         (reduce + 0
+                 (map
+                   (lambda (x y)
+                     (* (- (char->integer x) (char->integer #\0)) y))
+                   (reverse (string->list charstr 1))
+                   '(1 8 64)))))
+      (else
+        (case (string-ref charstr 1)
+          ((#\n) #\newline)
+          ((#\r) #\return)
+          ((#\t) #\tab)
+          ((#\') #\')
+          ((#\]) #\])
+          ((#\\) #\\)
+          ((#\[) #\[))))))
+
+(define peg-grammar Grammar)
 
 ;; Macro wrapper for PEG-PARSER.  Parses PEG grammars expressed as strings and
 ;; defines all the appropriate nonterminals.
@@ -124,119 +352,6 @@ RB < ']'
       ((_ str)
        (peg-parser (syntax->datum #'str) x)))))
 
-;; lst has format (nonterm grabber pattern), where
-;;   nonterm is a symbol (the name of the nonterminal),
-;;   grabber is a string (either "<", "<-" or "<--"), and
-;;   pattern is the parse of a PEG pattern expressed as as string.
-(define (peg-nonterm->defn lst for-syntax)
-  (let* ((nonterm (car lst))
-         (grabber (cadr lst))
-         (pattern (caddr lst))
-         (nonterm-name (datum->syntax for-syntax
-                                      (string->symbol (cadr nonterm)))))
-    #`(define-peg-pattern #,nonterm-name
-       #,(cond
-          ((string=? grabber "<--") (datum->syntax for-syntax 'all))
-          ((string=? grabber "<-") (datum->syntax for-syntax 'body))
-          (else (datum->syntax for-syntax 'none)))
-       #,(compressor (peg-pattern->defn pattern for-syntax) for-syntax))))
-
-;; lst has format ('peg-pattern ...).
-;; After the context-flatten, (cdr lst) has format
-;;   (('peg-alternative ...) ...), where the outer list is a collection
-;;   of elements from a '/' alternative.
-(define (peg-pattern->defn lst for-syntax)
-  #`(or #,@(map (lambda (x) (peg-alternative->defn x for-syntax))
-                (context-flatten (lambda (x) (eq? (car x) 'peg-alternative))
-                                 (cdr lst)))))
-
-;; lst has format ('peg-alternative ...).
-;; After the context-flatten, (cdr lst) has the format
-;;   (item ...), where each item has format either ("!" ...), ("&" ...),
-;;   or ('peg-suffix ...).
-(define (peg-alternative->defn lst for-syntax)
-  #`(and #,@(map (lambda (x) (peg-body->defn x for-syntax))
-                 (context-flatten (lambda (x) (or (string? (car x))
-                                             (eq? (car x) 'peg-suffix)))
-                                  (cdr lst)))))
-
-;; lst has the format either
-;;   ("!" ('peg-suffix ...)), ("&" ('peg-suffix ...)), or
-;;     ('peg-suffix ...).
-(define (peg-body->defn lst for-syntax)
-    (cond
-      ((equal? (car lst) "&")
-       #`(followed-by #,(peg-suffix->defn (cadr lst) for-syntax)))
-      ((equal? (car lst) "!")
-       #`(not-followed-by #,(peg-suffix->defn (cadr lst) for-syntax)))
-      ((eq? (car lst) 'peg-suffix)
-       (peg-suffix->defn lst for-syntax))
-      (else `(peg-parse-body-fail ,lst))))
-
-;; lst has format ('peg-suffix <peg-primary> (? (/ "*" "?" "+")))
-(define (peg-suffix->defn lst for-syntax)
-  (let ((inner-defn (peg-primary->defn (cadr lst) for-syntax)))
-    (cond
-      ((null? (cddr lst))
-       inner-defn)
-      ((equal? (caddr lst) "*")
-       #`(* #,inner-defn))
-      ((equal? (caddr lst) "?")
-       #`(? #,inner-defn))
-      ((equal? (caddr lst) "+")
-       #`(+ #,inner-defn)))))
-
-;; Parse a primary.
-(define (peg-primary->defn lst for-syntax)
-  (let ((el (cadr lst)))
-  (cond
-   ((list? el)
-    (cond
-     ((eq? (car el) 'peg-literal)
-      (peg-literal->defn el for-syntax))
-     ((eq? (car el) 'peg-charclass)
-      (peg-charclass->defn el for-syntax))
-     ((eq? (car el) 'peg-nonterminal)
-      (datum->syntax for-syntax (string->symbol (cadr el))))))
-   ((string? el)
-    (cond
-     ((equal? el "(")
-      (peg-pattern->defn (caddr lst) for-syntax))
-     ((equal? el ".")
-      (datum->syntax for-syntax 'peg-any))
-     (else (datum->syntax for-syntax
-                          `(peg-parse-any unknown-string ,lst)))))
-   (else (datum->syntax for-syntax
-                        `(peg-parse-any unknown-el ,lst))))))
-
-;; Trims characters off the front and end of STR.
-;; (trim-1chars "'ab'") -> "ab"
-(define (trim-1chars str) (substring str 1 (- (string-length str) 1)))
-
-;; Parses a literal.
-(define (peg-literal->defn lst for-syntax)
-  (datum->syntax for-syntax (trim-1chars (cadr lst))))
-
-;; Parses a charclass.
-(define (peg-charclass->defn lst for-syntax)
-  #`(or
-     #,@(map
-         (lambda (cc)
-           (cond
-            ((eq? (car cc) 'charclass-range)
-             #`(range #,(datum->syntax
-                         for-syntax
-                         (string-ref (cadr cc) 0))
-                      #,(datum->syntax
-                         for-syntax
-                         (string-ref (cadr cc) 2))))
-            ((eq? (car cc) 'charclass-single)
-             (datum->syntax for-syntax (cadr cc)))))
-         (context-flatten
-          (lambda (x) (or (eq? (car x) 'charclass-range)
-                          (eq? (car x) 'charclass-single)))
-          (cdr lst)))))
-
 ;; Compresses a list to save the optimizer work.
 ;; e.g. (or (and a)) -> a
 (define (compressor-core lst)
@@ -263,11 +378,10 @@ RB < ']'
      (let ((string (syntax->datum #'str-stx)))
        (compile-peg-pattern
         (compressor
-         (peg-pattern->defn
-          (peg:tree (match-pattern peg-pattern string)) #'str-stx)
+         (Expression->defn
+          (peg:tree (match-pattern Expression string)) #'str-stx)
          #'str-stx)
         (if (eq? accum 'all) 'body accum))))
      (else (error "Bad embedded PEG string" args))))
 
 (add-peg-compiler! 'peg peg-string-compile)
-
diff --git a/test-suite/tests/peg.test b/test-suite/tests/peg.test
index f516571e8..1136c03f1 100644
--- a/test-suite/tests/peg.test
+++ b/test-suite/tests/peg.test
@@ -28,17 +28,25 @@
 ;; the nonterminals defined in the PEG parser written with
 ;; S-expressions.
 (define grammar-mapping
-  '((grammar peg-grammar)
-    (pattern peg-pattern)
-    (alternative peg-alternative)
-    (suffix peg-suffix)
-    (primary peg-primary)
-    (literal peg-literal)
-    (charclass peg-charclass)
-    (CCrange charclass-range)
-    (CCsingle charclass-single)
-    (nonterminal peg-nonterminal)
-    (sp peg-sp)))
+  '((Grammar Grammar)
+    (Definition Definition)
+    (Expression Expression)
+    (Sequence Sequence)
+    (Prefix Prefix)
+    (Suffix Suffix)
+    (Primary Primary)
+    (Identifier Identifier)
+    (Literal Literal)
+    (Class Class)
+    (Range Range)
+    (Char Char)
+    (LEFTARROW LEFTARROW)
+    (AND AND)
+    (NOT NOT)
+    (QUESTION QUESTION)
+    (STAR STAR)
+    (PLUS PLUS)
+    (DOT DOT)))
 
 ;; Transforms the nonterminals defined in the PEG parser written as a PEG to the nonterminals defined in the PEG parser written with S-expressions.
 (define (grammar-transform x)
@@ -69,7 +77,7 @@
     (peg:tree (match-pattern (@@ (ice-9 peg) peg-grammar) (@@ (ice-9 peg) peg-as-peg)))
     (tree-map
      grammar-transform
-     (peg:tree (match-pattern grammar (@@ (ice-9 peg) peg-as-peg)))))))
+     (peg:tree (match-pattern Grammar (@@ (ice-9 peg) peg-as-peg)))))))
 
 ;; A grammar for pascal-style comments from Wikipedia.
 (define comment-grammar
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* bug#73188: PEG: Fix bugs and add complex PEG for testing
  2024-09-11 22:03 bug#73188: PEG parser does not support full PEG grammar Ekaitz Zarraga
  2024-09-12 20:57 ` bug#73188: [PATCH v2] PEG: Add full support for PEG + some extensions Ekaitz Zarraga
  2024-10-11 12:31 ` bug#73188: [PATCH] PEG: Add support for `not-in-range` and [^...] Ekaitz Zarraga
@ 2024-10-30 19:04 ` Ekaitz Zarraga
  2024-12-22 17:45 ` bug#73188: PEG parser does not support full PEG grammar Ekaitz Zarraga
  2024-12-22 20:01 ` bug#73188: [PATCH 1/3] PEG: fix [^...] Ekaitz Zarraga
  4 siblings, 0 replies; 19+ messages in thread
From: Ekaitz Zarraga @ 2024-10-30 19:04 UTC (permalink / raw)
  To: 73188; +Cc: Ludovic Courtès

[-- Attachment #1: Type: text/plain, Size: 425 bytes --]

Hi,

I decided to improve the tests of the PEG module because I wasn't very 
confident about the [^...] functionality, and I found I had some minor 
bugs in the previous patch.

I attach a new version of the previous commits with an extra one that 
adds an HTML parser and tests against it. That's what made me find some 
of the errors and missing bits.

With the test I feel more confident about the changes.

Thanks,
Ekaitz

[-- Attachment #2: v5-0003-PEG-Add-a-complex-PEG-grammar-test.patch --]
[-- Type: text/x-patch, Size: 7996 bytes --]

From e682544a68ff1d9e41fb954539884b7163124589 Mon Sep 17 00:00:00 2001
From: Ekaitz Zarraga <ekaitz@elenq.tech>
Date: Wed, 30 Oct 2024 19:52:28 +0100
Subject: [PATCH v5 3/3] PEG: Add a complex PEG grammar test

Add a complex PEG grammar for HTML parsing and test against it.
This properly tests for complex constructs, specially `[^...]`.

* test-suite/tests/peg.test (html-grammar): New variable.
(html-example): New variable.
(Parsing with complex grammars): New test.
---
 test-suite/tests/peg.test | 113 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 113 insertions(+)

diff --git a/test-suite/tests/peg.test b/test-suite/tests/peg.test
index 5570fbfa8..d8d047288 100644
--- a/test-suite/tests/peg.test
+++ b/test-suite/tests/peg.test
@@ -284,3 +284,116 @@ number <-- [0-9]+")
    "1+1/2*3+(1+1)/2"
    (equal? (eq-parse "1+1/2*3+(1+1)/2")
 	   '(+ (+ 1 (* (/ 1 2) 3)) (/ (+ 1 1) 2)))))
+
+
+(define html-grammar
+"
+# Based on code from https://github.com/Fantom-Factory/afHtmlParser
+# 2014-2023 Steve Eynon. This code was originally released under the following
+# terms:
+#
+#      Permission to use, copy, modify, and/or distribute this software for any
+#      purpose with or without fee is hereby granted, provided that the above
+#      copyright notice and this permission notice appear in all copies.
+#
+#      THE SOFTWARE IS PROVIDED \"AS IS\" AND THE AUTHOR DISCLAIMS ALL
+#      WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES
+#      OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE
+#      FOR ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY
+#      DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER
+#      IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING
+#      OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
+
+# PEG Rules for parsing well formed HTML 5 documents
+# https://html.spec.whatwg.org/multipage/syntax.html
+
+html        <-- bom? blurb* doctype? blurb* xmlProlog? blurb* elem blurb*
+bom         <-- \"\\uFEFF\"
+xmlProlog   <-- \"<?xml\" (!\"?>\" .)+ \"?>\"
+
+# ---- Doctype ----
+
+doctype           <-- \"<!DOCTYPE\" [ \\t\\n\\f\\r]+ [a-zA-Z0-9]+ (doctypePublicId / doctypeSystemId)* [ \\t\\n\\f\\r]* \">\"
+doctypePublicId   <-- [ \\t\\n\\f\\r]+  \"PUBLIC\" [ \\t\\n\\f\\r]+   ((\"\\\"\" [^\"]* \"\\\"\") / (\"'\" [^']* \"'\"))
+doctypeSystemId   <-- [ \\t\\n\\f\\r]+ (\"SYSTEM\" [ \\t\\n\\f\\r]+)? ((\"\\\"\" [^\"]* \"\\\"\") / (\"'\" [^']* \"'\"))
+
+# ---- Elems ----
+
+elem              <-- voidElem / rawTextElem / escRawTextElem / selfClosingElem / normalElem
+voidElem          <-- \"<\"  voidElemName       attributes  \">\"
+rawTextElem       <-- \"<\"  rawTextElemName    attributes  \">\" rawTextContent endElem
+escRawTextElem    <-- \"<\"  escRawTextElemName attributes  \">\" escRawTextContent endElem
+selfClosingElem   <-- \"<\"  elemName           attributes \"/>\"
+normalElem        <-- \"<\"  elemName           attributes  \">\" normalContent? endElem?
+endElem           <-- \"</\" elemName                       \">\"
+
+elemName            <-- [a-zA-Z] [^\\t\\n\\f />]*
+voidElemName        <-- \"area\" / \"base\" / \"br\" / \"col\" / \"embed\" /
+                      \"hr\" / \"img\" / \"input\" / \"keygen\" / \"link\" /
+                      \"meta\" / \"param\" / \"source\" / \"track\" / \"wbr\"
+rawTextElemName     <-- \"script\" / \"style\"
+escRawTextElemName  <-- \"textarea\" / \"title\"
+
+rawTextContent      <-- (!(\"</script>\" / \"</style>\") .)+
+escRawTextContent   <-- ((!(\"</textarea>\" / \"</title>\" / \"&\") .)+ / charRef)*
+normalContent       <-- !\"</\" (([^<&]+ / charRef) / comment / cdata / elem)*
+
+# ---- Attributes ----
+
+attributes        <-- (&[^/>] ([ \\t]+ / doubleQuoteAttr / singleQuoteAttr / unquotedAttr / emptyAttr))*
+attrName          <-- [^ \\t\\n\\r\\f\"'>/=]+
+emptyAttr         <-- attrName+
+unquotedAttr      <-- attrName [ \\t]* \"=\" [ \\t]*      (charRef / [^ \\t\\n\\r\\f\"'=<>`&]+)+
+singleQuoteAttr   <-- attrName [ \\t]* \"=\" [ \\t]* \"'\"  (charRef / [^'&]+)* \"'\"
+doubleQuoteAttr   <-- attrName [ \\t]* \"=\" [ \\t]* \"\\\"\" (charRef / [^\"&]+)* \"\\\"\"
+
+# ---- Character References ----
+
+charRef         <-- &\"&\" (decNumCharRef / hexNumCharRef / namedCharRef / borkedRef)
+namedCharRef    <-- \"&\"   [^;>]+ \";\"
+decNumCharRef   <-- \"&#\"  [0-9]+ \";\"
+hexNumCharRef   <-- \"&#x\" [a-fA-F0-9]+ \";\"
+borkedRef       <-- \"&\"  &[ \\t]
+
+# ---- Misc ----
+
+cdata       <-- \"<![CDATA[\" (!\"]]>\" .)+ \"]]>\"
+comment     <-- \"<!--\" (!\"--\" .)+ \"-->\"
+blurb       <-- [ \\t\\n\\f\\r]+ / comment")
+
+(define html-example "
+<!DOCTYPE html>
+<html>
+<head>
+    <title>Example Domain</title>
+    <meta charset=\"utf-8\" />
+    <meta http-equiv=\"Content-type\" content=\"text/html; charset=utf-8\" />
+    <meta name=\"viewport\" content=\"width=device-width, initial-scale=1\" />
+    <style type=\"text/css\">
+    body {
+        background-color: #f0f0f2;
+        margin: 0;
+        padding: 0;
+    }
+    </style>
+</head>
+
+<body>
+<div>
+    <h1>Example Domain</h1>
+    <p>This domain is for use in illustrative examples in documents. You may
+    use this domain in literature without prior coordination or asking for
+    permission.</p> <p><a href=\"https://www.iana.org/domains/example\">More
+    information...</a></p>
+</div>
+</body>
+</html>
+")
+
+(with-test-prefix "Parsing with complex grammars"
+  (eeval `(define-peg-string-patterns ,html-grammar))
+  (pass-if
+    "HTML parsing"
+    (equal?
+      (peg:tree (match-pattern html html-example))
+      '(html (blurb "\n") (doctype "<!DOCTYPE html>") (blurb "\n") (elem (normalElem "<" (elemName "html") attributes ">" (normalContent "\n" (elem (normalElem "<" (elemName "head") attributes ">" (normalContent "\n    " (elem (escRawTextElem "<" (escRawTextElemName "title") attributes ">" (escRawTextContent "Example Domain") (endElem "</" (elemName "title") ">"))) "\n    " (elem (selfClosingElem "<" (elemName "meta") (attributes " " (doubleQuoteAttr (attrName "charset") "=\"utf-8\"") " ") "/>")) "\n    " (elem (selfClosingElem "<" (elemName "meta") (attributes " " (doubleQuoteAttr (attrName "http-equiv") "=\"Content-type\"") " " (doubleQuoteAttr (attrName "content") "=\"text/html; charset=utf-8\"") " ") "/>")) "\n    " (elem (selfClosingElem "<" (elemName "meta") (attributes " " (doubleQuoteAttr (attrName "name") "=\"viewport\"") " " (doubleQuoteAttr (attrName "content") "=\"width=device-width, initial-scale=1\"") " ") "/>")) "\n    " (elem (rawTextElem "<" (rawTextElemName "style") (attributes " " (doubleQuoteAttr (attrName "type") "=\"text/css\"")) ">" (rawTextContent "\n    body {\n        background-color: #f0f0f2;\n        margin: 0;\n        padding: 0;\n    }\n    ") (endElem "</" (elemName "style") ">"))) "\n") (endElem "</" (elemName "head") ">"))) "\n\n" (elem (normalElem "<" (elemName "body") attributes ">" (normalContent "\n" (elem (normalElem "<" (elemName "div") attributes ">" (normalContent "\n    " (elem (normalElem "<" (elemName "h1") attributes ">" (normalContent "Example Domain") (endElem "</" (elemName "h1") ">"))) "\n    " (elem (normalElem "<" (elemName "p") attributes ">" (normalContent "This domain is for use in illustrative examples in documents. You may\n    use this domain in literature without prior coordination or asking for\n    permission.") (endElem "</" (elemName "p") ">"))) " " (elem (normalElem "<" (elemName "p") attributes ">" (normalContent (elem (normalElem "<" (elemName "a") (attributes " " (doubleQuoteAttr (attrName "href") "=\"https://www.iana.org/domains/example\"")) ">" (normalContent "More\n    information...") (endElem "</" (elemName "a") ">")))) (endElem "</" (elemName "p") ">"))) "\n") (endElem "</" (elemName "div") ">"))) "\n") (endElem "</" (elemName "body") ">"))) "\n") (endElem "</" (elemName "html") ">"))) (blurb "\n")))))
-- 
2.46.0


[-- Attachment #3: v5-0002-PEG-Add-support-for-not-in-range-and.patch --]
[-- Type: text/x-patch, Size: 8383 bytes --]

From d894303ed01de72703b815061f826da97dd303f0 Mon Sep 17 00:00:00 2001
From: Ekaitz Zarraga <ekaitz@elenq.tech>
Date: Fri, 11 Oct 2024 14:24:30 +0200
Subject: [PATCH v5 2/3] PEG: Add support for `not-in-range` and [^...]

Modern PEG supports inversed class like `[^a-z]` that would get any
character not in the `a-z` range. This commit adds support for that and
also for a new `not-in-range` PEG pattern for scheme.

* module/ice-9/peg/codegen.scm (cg-not-in-range): New function.
* module/ice-9/peg/string-peg.scm: Add support for `[^...]`
* test-suite/tests/peg.test: Add NotInClass to grammar-mapping.
* doc/ref/api-peg.texi: Document accordingly.
---
 NEWS                            |  3 ++-
 doc/ref/api-peg.texi            |  8 ++++++
 module/ice-9/peg/codegen.scm    | 22 +++++++++++++++
 module/ice-9/peg/string-peg.scm | 48 ++++++++++++++++++++++++++++++---
 test-suite/tests/peg.test       |  2 +-
 5 files changed, 77 insertions(+), 6 deletions(-)

diff --git a/NEWS b/NEWS
index df43f3754..17ef560b1 100644
--- a/NEWS
+++ b/NEWS
@@ -32,7 +32,8 @@ Changes in 3.0.11 (since 3.0.10)
 ** PEG parser
 
 PEG parser has been rewritten to cover all the functionality defined in
-<https://bford.info/pub/lang/peg.pdf>.
+<https://bford.info/pub/lang/peg.pdf>. Also added the `not-in-range` pattern
+to `(ice-9 peg)` that is also available from PEG strings via `[^...]`.
 
 \f
 Changes in 3.0.10 (since 3.0.9)
diff --git a/doc/ref/api-peg.texi b/doc/ref/api-peg.texi
index 84a9e6c6b..edb090b20 100644
--- a/doc/ref/api-peg.texi
+++ b/doc/ref/api-peg.texi
@@ -147,6 +147,14 @@ Parses any character falling between @var{a} and @var{z}.
 @code{(range #\a #\z)}
 @end deftp
 
+@deftp {PEG Pattern} {inverse range of characters} a z
+Parses any character not falling between @var{a} and @var{z}.
+
+@code{"[^a-z]"}
+
+@code{(not-in-range #\a #\z)}
+@end deftp
+
 Example:
 
 @example
diff --git a/module/ice-9/peg/codegen.scm b/module/ice-9/peg/codegen.scm
index d80c3e849..82367ef55 100644
--- a/module/ice-9/peg/codegen.scm
+++ b/module/ice-9/peg/codegen.scm
@@ -140,6 +140,27 @@ return EXP."
                          ((none) #`(list (1+ pos) '()))
                          (else (error "bad accum" accum))))))))))
 
+;; Generates code for matching a range of characters not between start and end.
+;; E.g.: (cg-not-in-range syntax #\a #\z 'body)
+(define (cg-not-in-range pat accum)
+  (syntax-case pat ()
+    ((start end)
+     (if (not (and (char? (syntax->datum #'start))
+                   (char? (syntax->datum #'end))))
+         (error "range PEG should have characters after it; instead got"
+                #'start #'end))
+     #`(lambda (str len pos)
+         (and (< pos len)
+              (let ((c (string-ref str pos)))
+                (and (or (char<? c start) (char>? c end))
+                     #,(case accum
+                         ((all) #`(list (1+ pos)
+                                        (list 'cg-not-in-range (string c))))
+                         ((name) #`(list (1+ pos) 'cg-not-in-range))
+                         ((body) #`(list (1+ pos) (string c)))
+                         ((none) #`(list (1+ pos) '()))
+                         (else (error "bad accum" accum))))))))))
+
 ;; Generate code to match a pattern and do nothing with the result
 (define (cg-ignore pat accum)
   (syntax-case pat ()
@@ -304,6 +325,7 @@ return EXP."
         (assq-set! peg-compiler-alist symbol function)))
 
 (add-peg-compiler! 'range cg-range)
+(add-peg-compiler! 'not-in-range cg-not-in-range)
 (add-peg-compiler! 'ignore cg-ignore)
 (add-peg-compiler! 'capture cg-capture)
 (add-peg-compiler! 'and cg-and)
diff --git a/module/ice-9/peg/string-peg.scm b/module/ice-9/peg/string-peg.scm
index ede24181c..4b92b393c 100644
--- a/module/ice-9/peg/string-peg.scm
+++ b/module/ice-9/peg/string-peg.scm
@@ -54,7 +54,7 @@ Prefix <-- (AND / NOT)? Suffix
 Suffix <-- Primary (QUESTION / STAR / PLUS)?
 Primary <-- Identifier !LEFTARROW
            / OPEN Expression CLOSE
-           / Literal / Class / DOT
+           / Literal / Class / NotInClass / DOT
 
 # Lexical syntax
 Identifier <-- IdentStart IdentCont* Spacing
@@ -64,7 +64,8 @@ IdentCont <- IdentStart / [0-9]
 
 Literal <-- SQUOTE (!SQUOTE Char)* SQUOTE Spacing
         / DQUOTE (!DQUOTE Char)* DQUOTE Spacing
-Class <-- OPENBRACKET (!CLOSEBRACKET Range)* CLOSEBRACKET Spacing
+NotInClass <-- OPENBRACKET NOTIN  (!CLOSEBRACKET Range)* CLOSEBRACKET Spacing
+Class <-- OPENBRACKET !NOTIN  (!CLOSEBRACKET Range)* CLOSEBRACKET Spacing
 Range <-- Char DASH Char / Char
 Char <-- '\\\\' [nrtf'\"\\[\\]\\\\]
        / '\\\\' [0-7][0-7][0-7]
@@ -80,6 +81,7 @@ DASH < '-'
 OPENBRACKET < '['
 CLOSEBRACKET < ']'
 HEX <- [0-9a-fA-F]
+NOTIN < '^'
 SLASH < '/' Spacing
 AND <-- '&' Spacing
 NOT <-- '!' Spacing
@@ -124,6 +126,7 @@ EndOfFile < !.
       (and OPEN Expression CLOSE)
       Literal
       Class
+      NotInClass
       DOT))
 (define-sexp-parser Identifier all
   (and IdentStart (* IdentCont) Spacing))
@@ -135,7 +138,11 @@ EndOfFile < !.
   (or (and SQUOTE (* (and (not-followed-by SQUOTE) Char)) SQUOTE Spacing)
       (and DQUOTE (* (and (not-followed-by DQUOTE) Char)) DQUOTE Spacing)))
 (define-sexp-parser Class all
-  (and OPENBRACKET (* (and (not-followed-by CLOSEBRACKET) Range)) CLOSEBRACKET Spacing))
+  (and OPENBRACKET (not-followed-by NOTIN)
+       (* (and (not-followed-by CLOSEBRACKET) Range)) CLOSEBRACKET Spacing))
+(define-sexp-parser NotInClass all
+  (and OPENBRACKET NOTIN
+       (* (and (not-followed-by CLOSEBRACKET) Range)) CLOSEBRACKET Spacing))
 (define-sexp-parser Range all
   (or (and Char DASH Char) Char))
 (define-sexp-parser Char all
@@ -148,6 +155,8 @@ EndOfFile < !.
   (and (or "<--" "<-" "<") Spacing)) ; NOTE: <-- and < are extensions
 (define-sexp-parser HEX body
   (or (range #\0 #\9) (range #\a #\f) (range #\A #\F)))
+(define-sexp-parser NOTIN none
+  (and "^"))
 (define-sexp-parser SLASH none
   (and "/" Spacing))
 (define-sexp-parser AND all
@@ -284,6 +293,7 @@ EndOfFile < !.
       ('Identifier (Identifier->defn value for-syntax))
       ('Expression (Expression->defn value for-syntax))
       ('Literal    (Literal->defn value for-syntax))
+      ('NotInClass (NotInClass->defn value for-syntax))
       ('Class      (Class->defn value for-syntax)))))
 
 ;; (Identifier "hello")
@@ -296,13 +306,43 @@ EndOfFile < !.
 (define (Literal->defn lst for-syntax)
   (apply string (map (lambda (x) (Char->defn x for-syntax)) (cdr lst))))
 
-;; TODO: empty Class can happen: `[]`, but what does it represent?
+;; (NotInClass (Range ...) (Range ...))
+;;  `-> (and (followed-by (not-in-range ...))
+;;           (followed-by (not-in-range ...))
+;;           ...
+;;           (not-in-range ...))
+;; NOTE: the order doesn't matter, because all `not-in-range`s will always
+;; parse exactly one character, but all the elements but the last need not to
+;; consume the input.
+(define (NotInClass->defn lst for-syntax)
+  #`(and
+      #,@(map (lambda (x) #`(followed-by #,(NotInRange->defn x for-syntax)))
+              (cddr lst))
+      #,(NotInRange->defn (cadr lst) for-syntax)))
+
 ;; (Class ...)
 ;;  `-> (or ...)
 (define (Class->defn lst for-syntax)
   #`(or #,@(map (lambda (x) (Range->defn x for-syntax))
                 (cdr lst))))
 
+;; For one character:
+;; (Range (Char "a"))
+;;  `-> (not-in-range #\a #\a)
+;; Or for a range:
+;; (Range (Char "a") (Char "b"))
+;;  `-> (not-in-range #\a #\b)
+;; NOTE: It's coming from NotInClass.
+(define (NotInRange->defn lst for-syntax)
+  (match lst
+    (('Range c)
+     (let ((ch (Char->defn c for-syntax)))
+       #`(not-in-range #,ch #,ch)))
+    (('Range range-beginning range-end)
+     #`(not-in-range
+         #,(Char->defn range-beginning for-syntax)
+         #,(Char->defn range-end       for-syntax)))))
+
 ;; For one character:
 ;; (Range (Char "a"))
 ;;  `-> "a"
diff --git a/test-suite/tests/peg.test b/test-suite/tests/peg.test
index 1136c03f1..5570fbfa8 100644
--- a/test-suite/tests/peg.test
+++ b/test-suite/tests/peg.test
@@ -38,6 +38,7 @@
     (Identifier Identifier)
     (Literal Literal)
     (Class Class)
+    (NotInClass NotInClass)
     (Range Range)
     (Char Char)
     (LEFTARROW LEFTARROW)
@@ -283,4 +284,3 @@ number <-- [0-9]+")
    "1+1/2*3+(1+1)/2"
    (equal? (eq-parse "1+1/2*3+(1+1)/2")
 	   '(+ (+ 1 (* (/ 1 2) 3)) (/ (+ 1 1) 2)))))
-
-- 
2.46.0


[-- Attachment #4: v5-0001-PEG-Add-full-support-for-PEG-some-extensions.patch --]
[-- Type: text/x-patch, Size: 23039 bytes --]

From aa35db3cbc691b57c85374d6a269e8d344a0440a Mon Sep 17 00:00:00 2001
From: Ekaitz Zarraga <ekaitz@elenq.tech>
Date: Wed, 11 Sep 2024 21:19:26 +0200
Subject: [PATCH v5 1/3] PEG: Add full support for PEG + some extensions

This commit adds support for PEG as described in:

    <https://bford.info/pub/lang/peg.pdf>

It adds support for the missing features (comments, underscores in
identifiers and escaping) while keeping the extensions (dashes in
identifiers, < and <--).

The naming system tries to be as close as possible to the one proposed
in the paper.

* module/ice-9/peg/string-peg.scm: Rewrite PEG parser.
* test-suite/tests/peg.test: Fix import
---
 NEWS                            |   7 +
 doc/ref/api-peg.texi            |   8 +-
 module/ice-9/peg/string-peg.scm | 466 ++++++++++++++++++++------------
 test-suite/tests/peg.test       |  32 ++-
 4 files changed, 333 insertions(+), 180 deletions(-)

diff --git a/NEWS b/NEWS
index 9fd14c39d..df43f3754 100644
--- a/NEWS
+++ b/NEWS
@@ -27,6 +27,13 @@ Changes in 3.0.11 (since 3.0.10)
 ** Guile is compiled with -fexcess-precision=standard for i[3456]86 when possible
    (<https://debbugs.gnu.org/43262>)
 
+* New interfaces and functionality
+
+** PEG parser
+
+PEG parser has been rewritten to cover all the functionality defined in
+<https://bford.info/pub/lang/peg.pdf>.
+
 \f
 Changes in 3.0.10 (since 3.0.9)
 
diff --git a/doc/ref/api-peg.texi b/doc/ref/api-peg.texi
index d34ddc64c..84a9e6c6b 100644
--- a/doc/ref/api-peg.texi
+++ b/doc/ref/api-peg.texi
@@ -17,6 +17,10 @@ Wikipedia has a clear and concise introduction to PEGs if you want to
 familiarize yourself with the syntax:
 @url{http://en.wikipedia.org/wiki/Parsing_expression_grammar}.
 
+The paper that introduced PEG contains a more detailed description of how PEG
+works, and describes its syntax in detail:
+@url{https://bford.info/pub/lang/peg.pdf}
+
 The @code{(ice-9 peg)} module works by compiling PEGs down to lambda
 expressions.  These can either be stored in variables at compile-time by
 the define macros (@code{define-peg-pattern} and
@@ -216,8 +220,8 @@ should propagate up the parse tree.  The normal @code{<-} propagates the
 matched text up the parse tree, @code{<--} propagates the matched text
 up the parse tree tagged with the name of the nonterminal, and @code{<}
 discards that matched text and propagates nothing up the parse tree.
-Also, nonterminals may consist of any alphanumeric character or a ``-''
-character (in normal PEGs nonterminals can only be alphabetic).
+Also, nonterminals may include ``-'' character, while in normal PEG it is not
+allowed.
 
 For example, if we:
 @lisp
diff --git a/module/ice-9/peg/string-peg.scm b/module/ice-9/peg/string-peg.scm
index 45ed14bb1..ede24181c 100644
--- a/module/ice-9/peg/string-peg.scm
+++ b/module/ice-9/peg/string-peg.scm
@@ -1,6 +1,7 @@
 ;;;; string-peg.scm --- representing PEG grammars as strings
 ;;;;
-;;;; 	Copyright (C) 2010, 2011 Free Software Foundation, Inc.
+;;;; 	Copyright (C) 2010, 2011, Free Software Foundation, Inc.
+;;;; 	Copyright (C) 2024 Ekaitz Zarraga <ekaitz@elenq.tech>
 ;;;;
 ;;;; This library is free software; you can redistribute it and/or
 ;;;; modify it under the terms of the GNU Lesser General Public
@@ -21,10 +22,15 @@
   #:export (peg-as-peg
             define-peg-string-patterns
             peg-grammar)
+  #:use-module (ice-9 match)
   #:use-module (ice-9 peg using-parsers)
+  #:use-module (srfi srfi-1)
   #:use-module (ice-9 peg codegen)
   #:use-module (ice-9 peg simplify-tree))
 
+;; This module provides support for PEG as described in:
+;;   <https://bford.info/pub/lang/peg.pdf>
+
 ;; Gets the left-hand depth of a list.
 (define (depth lst)
   (if (or (not (list? lst)) (null? lst))
@@ -38,22 +44,60 @@
 
 ;; Grammar for PEGs in PEG grammar.
 (define peg-as-peg
-"grammar <-- (nonterminal ('<--' / '<-' / '<') sp pattern)+
-pattern <-- alternative (SLASH sp alternative)*
-alternative <-- ([!&]? sp suffix)+
-suffix <-- primary ([*+?] sp)*
-primary <-- '(' sp pattern ')' sp / '.' sp / literal / charclass / nonterminal !'<'
-literal <-- ['] (!['] .)* ['] sp
-charclass <-- LB (!']' (CCrange / CCsingle))* RB sp
-CCrange <-- . '-' .
-CCsingle <-- .
-nonterminal <-- [a-zA-Z0-9-]+ sp
-sp < [ \t\n]*
-SLASH < '/'
-LB < '['
-RB < ']'
+"# Hierarchical syntax
+Grammar <-- Spacing Definition+ EndOfFile
+Definition <-- Identifier LEFTARROW Expression
+
+Expression <-- Sequence (SLASH Sequence)*
+Sequence <-- Prefix*
+Prefix <-- (AND / NOT)? Suffix
+Suffix <-- Primary (QUESTION / STAR / PLUS)?
+Primary <-- Identifier !LEFTARROW
+           / OPEN Expression CLOSE
+           / Literal / Class / DOT
+
+# Lexical syntax
+Identifier <-- IdentStart IdentCont* Spacing
+# NOTE: `-` is an extension
+IdentStart <- [a-zA-Z_] / '-'
+IdentCont <- IdentStart / [0-9]
+
+Literal <-- SQUOTE (!SQUOTE Char)* SQUOTE Spacing
+        / DQUOTE (!DQUOTE Char)* DQUOTE Spacing
+Class <-- OPENBRACKET (!CLOSEBRACKET Range)* CLOSEBRACKET Spacing
+Range <-- Char DASH Char / Char
+Char <-- '\\\\' [nrtf'\"\\[\\]\\\\]
+       / '\\\\' [0-7][0-7][0-7]
+       / '\\\\' [0-7][0-7]?
+       / '\\\\' 'u' HEX HEX HEX HEX
+       / !'\\\\' .
+
+# NOTE: `<--` and `<` are extensions
+LEFTARROW <- ('<--' / '<-' / '<') Spacing
+SQUOTE < [']
+DQUOTE < [\"]
+DASH < '-'
+OPENBRACKET < '['
+CLOSEBRACKET < ']'
+HEX <- [0-9a-fA-F]
+SLASH < '/' Spacing
+AND <-- '&' Spacing
+NOT <-- '!' Spacing
+QUESTION <-- '?' Spacing
+STAR <-- '*' Spacing
+PLUS <-- '+' Spacing
+OPEN < '(' Spacing
+CLOSE < ')' Spacing
+DOT <-- '.' Spacing
+
+Spacing < (Space / Comment)*
+Comment < '#' (!EndOfLine .)* EndOfLine
+Space < ' ' / '\\t' / EndOfLine
+EndOfLine < '\\r\\n' / '\\n' / '\\r'
+EndOfFile < !.
 ")
 
+
 (define-syntax define-sexp-parser
   (lambda (x)
     (syntax-case x ()
@@ -63,35 +107,81 @@ RB < ']'
               (syn (wrap-parser-for-users x matchf accumsym #'sym)))
            #`(define sym #,syn))))))
 
-(define-sexp-parser peg-grammar all
-  (+ (and peg-nonterminal (or "<--" "<-" "<") peg-sp peg-pattern)))
-(define-sexp-parser peg-pattern all
-  (and peg-alternative
-       (* (and (ignore "/") peg-sp peg-alternative))))
-(define-sexp-parser peg-alternative all
-  (+ (and (? (or "!" "&")) peg-sp peg-suffix)))
-(define-sexp-parser peg-suffix all
-  (and peg-primary (* (and (or "*" "+" "?") peg-sp))))
-(define-sexp-parser peg-primary all
-  (or (and "(" peg-sp peg-pattern ")" peg-sp)
-      (and "." peg-sp)
-      peg-literal
-      peg-charclass
-      (and peg-nonterminal (not-followed-by "<"))))
-(define-sexp-parser peg-literal all
-  (and "'" (* (and (not-followed-by "'") peg-any)) "'" peg-sp))
-(define-sexp-parser peg-charclass all
-  (and (ignore "[")
-       (* (and (not-followed-by "]")
-               (or charclass-range charclass-single)))
-       (ignore "]")
-       peg-sp))
-(define-sexp-parser charclass-range all (and peg-any "-" peg-any))
-(define-sexp-parser charclass-single all peg-any)
-(define-sexp-parser peg-nonterminal all
-  (and (+ (or (range #\a #\z) (range #\A #\Z) (range #\0 #\9) "-")) peg-sp))
-(define-sexp-parser peg-sp none
-  (* (or " " "\t" "\n")))
+(define-sexp-parser Grammar all
+  (and Spacing (+ Definition) EndOfFile))
+(define-sexp-parser Definition all
+  (and Identifier LEFTARROW Expression))
+(define-sexp-parser Expression all
+  (and Sequence (* (and SLASH Sequence))))
+(define-sexp-parser Sequence all
+  (* Prefix))
+(define-sexp-parser Prefix all
+  (and (? (or AND NOT)) Suffix))
+(define-sexp-parser Suffix all
+  (and Primary (? (or QUESTION STAR PLUS))))
+(define-sexp-parser Primary all
+  (or (and Identifier (not-followed-by LEFTARROW))
+      (and OPEN Expression CLOSE)
+      Literal
+      Class
+      DOT))
+(define-sexp-parser Identifier all
+  (and IdentStart (* IdentCont) Spacing))
+(define-sexp-parser IdentStart body
+  (or (or (range #\a #\z) (range #\A #\Z) "_") "-")) ; NOTE: - is an extension
+(define-sexp-parser IdentCont body
+  (or IdentStart (range #\0 #\9)))
+(define-sexp-parser Literal all
+  (or (and SQUOTE (* (and (not-followed-by SQUOTE) Char)) SQUOTE Spacing)
+      (and DQUOTE (* (and (not-followed-by DQUOTE) Char)) DQUOTE Spacing)))
+(define-sexp-parser Class all
+  (and OPENBRACKET (* (and (not-followed-by CLOSEBRACKET) Range)) CLOSEBRACKET Spacing))
+(define-sexp-parser Range all
+  (or (and Char DASH Char) Char))
+(define-sexp-parser Char all
+  (or (and "\\" (or "n" "r" "t" "f" "'" "\"" "[" "]" "\\"))
+      (and "\\" (range #\0 #\7) (range #\0 #\7) (range #\0 #\7))
+      (and "\\" (range #\0 #\7) (? (range #\0 #\7)))
+      (and "\\" "u" HEX HEX HEX HEX)
+      (and (not-followed-by "\\") peg-any)))
+(define-sexp-parser LEFTARROW body
+  (and (or "<--" "<-" "<") Spacing)) ; NOTE: <-- and < are extensions
+(define-sexp-parser HEX body
+  (or (range #\0 #\9) (range #\a #\f) (range #\A #\F)))
+(define-sexp-parser SLASH none
+  (and "/" Spacing))
+(define-sexp-parser AND all
+  (and "&" Spacing))
+(define-sexp-parser NOT all
+  (and "!" Spacing))
+(define-sexp-parser QUESTION all
+  (and "?" Spacing))
+(define-sexp-parser STAR all
+  (and "*" Spacing))
+(define-sexp-parser PLUS all
+  (and "+" Spacing))
+(define-sexp-parser OPEN none
+  (and "(" Spacing))
+(define-sexp-parser CLOSE none
+  (and ")" Spacing))
+(define-sexp-parser DOT all
+  (and "." Spacing))
+(define-sexp-parser SQUOTE none "'")
+(define-sexp-parser DQUOTE none "\"")
+(define-sexp-parser OPENBRACKET none "[")
+(define-sexp-parser CLOSEBRACKET none "]")
+(define-sexp-parser DASH none "-")
+(define-sexp-parser Spacing none
+  (* (or Space Comment)))
+(define-sexp-parser Comment none
+  (and "#" (* (and (not-followed-by EndOfLine) peg-any)) EndOfLine))
+(define-sexp-parser Space none
+  (or " " "\t" EndOfLine))
+(define-sexp-parser EndOfLine none
+  (or "\r\n" "\n" "\r"))
+(define-sexp-parser EndOfFile none
+  (not-followed-by peg-any))
+
 
 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
 ;;;;; PARSE STRING PEGS
@@ -101,7 +191,7 @@ RB < ']'
 ;; will define all of the nonterminals in the grammar with equivalent
 ;; PEG s-expressions.
 (define (peg-parser str for-syntax)
-  (let ((parsed (match-pattern peg-grammar str)))
+  (let ((parsed (match-pattern Grammar str)))
     (if (not parsed)
         (begin
           ;; (display "Invalid PEG grammar!\n")
@@ -110,11 +200,169 @@ RB < ']'
           (cond
            ((or (not (list? lst)) (null? lst))
             lst)
-           ((eq? (car lst) 'peg-grammar)
-            #`(begin
-                #,@(map (lambda (x) (peg-nonterm->defn x for-syntax))
-                        (context-flatten (lambda (lst) (<= (depth lst) 2))
-                                         (cdr lst))))))))))
+           ((eq? (car lst) 'Grammar)
+            (Grammar->defn lst for-syntax)))))))
+
+;; (Grammar (Definition ...) (Definition ...))
+(define (Grammar->defn lst for-syntax)
+  #`(begin
+      #,@(map (lambda (x) (Definition->defn x for-syntax))
+              (context-flatten (lambda (lst) (<= (depth lst) 1))
+                               (cdr lst)))))
+
+;; (Definition (Identifier "Something") "<-" (Expression ...))
+;;  `-> (define-peg-pattern Something 'all ...)
+(define (Definition->defn lst for-syntax)
+  (match lst
+    (('Definition ('Identifier identifier) grabber expression)
+     #`(define-peg-pattern
+         #,(datum->syntax for-syntax (string->symbol identifier))
+         #,(match grabber
+                  ("<--" (datum->syntax for-syntax 'all))
+                  ("<-"  (datum->syntax for-syntax 'body))
+                  ("<"   (datum->syntax for-syntax 'none)))
+         #,(compressor
+             (Expression->defn expression for-syntax)
+             for-syntax)))))
+
+;; (Expression X)
+;;  `-> (or X)
+;; (Expression X Y)
+;;  `-> (or X Y)
+;; (Expression X (Y Z ...))
+;;  `-> (or X Y Z ...)
+(define (Expression->defn lst for-syntax)
+  (match lst
+    (('Expression seq ...)
+     #`(or #,@(map (lambda (x) (Sequence->defn x for-syntax))
+                   (keyword-flatten '(Sequence) seq))))))
+
+;; (Sequence X)
+;;  `-> (and X)
+;; (Sequence X Y)
+;;  `-> (and X Y)
+;; (Sequence X (Y Z ...))
+;;  `-> (and X Y Z ...)
+(define (Sequence->defn lst for-syntax)
+  (match lst
+    (('Sequence pre ...)
+     #`(and #,@(map (lambda (x) (Prefix->defn x for-syntax))
+                    (keyword-flatten '(Prefix) pre))))))
+
+;; (Prefix (Suffix ...))
+;;  `-> (...)
+;; (Prefix (NOT "!") (Suffix ...))
+;;  `-> (not-followed-by ...)
+;; (Prefix (AND "&") (Suffix ...))
+;;  `-> (followed-by ...)
+(define (Prefix->defn lst for-syntax)
+  (match lst
+    (('Prefix ('AND _) su) #`(followed-by     #,(Suffix->defn su for-syntax)))
+    (('Prefix ('NOT _) su) #`(not-followed-by #,(Suffix->defn su for-syntax)))
+    (('Prefix suffix) (Suffix->defn suffix for-syntax))))
+
+;; (Suffix (Primary ...))
+;;  `-> (...)
+;; (Suffix (Primary ...) (STAR "*"))
+;;  `-> (* ...)
+;; (Suffix (Primary ...) (QUESTION "?"))
+;;  `-> (? ...)
+;; (Suffix (Primary ...) (PLUS "+"))
+;;  `-> (+ ...)
+(define (Suffix->defn lst for-syntax)
+  (match lst
+    (('Suffix prim)               (Primary->defn prim for-syntax))
+    (('Suffix prim ('STAR     _)) #`(* #,(Primary->defn prim for-syntax)))
+    (('Suffix prim ('QUESTION _)) #`(? #,(Primary->defn prim for-syntax)))
+    (('Suffix prim ('PLUS     _)) #`(+ #,(Primary->defn prim for-syntax)))))
+
+
+(define (Primary->defn lst for-syntax)
+  (let ((value (second lst)))
+    (match (car value)
+      ('DOT        #'peg-any)
+      ('Identifier (Identifier->defn value for-syntax))
+      ('Expression (Expression->defn value for-syntax))
+      ('Literal    (Literal->defn value for-syntax))
+      ('Class      (Class->defn value for-syntax)))))
+
+;; (Identifier "hello")
+;;  `-> hello
+(define (Identifier->defn lst for-syntax)
+  (datum->syntax for-syntax (string->symbol (second lst))))
+
+;; (Literal (Char "a") (Char "b") (Char "c"))
+;;  `-> "abc"
+(define (Literal->defn lst for-syntax)
+  (apply string (map (lambda (x) (Char->defn x for-syntax)) (cdr lst))))
+
+;; TODO: empty Class can happen: `[]`, but what does it represent?
+;; (Class ...)
+;;  `-> (or ...)
+(define (Class->defn lst for-syntax)
+  #`(or #,@(map (lambda (x) (Range->defn x for-syntax))
+                (cdr lst))))
+
+;; For one character:
+;; (Range (Char "a"))
+;;  `-> "a"
+;; Or for a range:
+;; (Range (Char "a") (Char "b"))
+;;  `-> (range #\a #\b)
+(define (Range->defn lst for-syntax)
+  (match lst
+    (('Range ch)
+     (string (Char->defn ch for-syntax)))
+    (('Range range-beginning range-end)
+     #`(range
+         #,(Char->defn range-beginning for-syntax)
+         #,(Char->defn range-end       for-syntax)))))
+
+;; (Char "a")
+;;  `-> #\a
+;; (Char "\\n")
+;;  `-> #\newline
+;; (Char "\\135")
+;;  `-> #\]
+(define (Char->defn lst for-syntax)
+  (let* ((charstr (second lst))
+         (first   (string-ref charstr 0)))
+    (cond
+      ((= 1 (string-length charstr)) first)
+      ((char-numeric? (string-ref charstr 1))
+       (integer->char
+         (reduce + 0
+                 (map
+                   (lambda (x y)
+                     (* (- (char->integer x) (char->integer #\0)) y))
+                   (reverse (string->list charstr 1))
+                   '(1 8 64)))))
+      ((char=? #\u (string-ref charstr 1))
+       (integer->char
+         (reduce + 0
+                 (map
+                   (lambda (x y)
+                     (* (cond
+                          ((char-numeric? x)
+                           (- (char->integer x) (char->integer #\0)))
+                          ((char-alphabetic? x)
+                           (+ 10 (- (char->integer x) (char->integer #\a)))))
+                        y))
+                   (reverse (string->list (string-downcase charstr) 2))
+                   '(1 16 256 4096)))))
+      (else
+        (case (string-ref charstr 1)
+          ((#\n) #\newline)
+          ((#\r) #\return)
+          ((#\t) #\tab)
+          ((#\f) #\page)
+          ((#\') #\')
+          ((#\") #\")
+          ((#\]) #\])
+          ((#\\) #\\)
+          ((#\[) #\[))))))
+
+(define peg-grammar Grammar)
 
 ;; Macro wrapper for PEG-PARSER.  Parses PEG grammars expressed as strings and
 ;; defines all the appropriate nonterminals.
@@ -124,119 +372,6 @@ RB < ']'
       ((_ str)
        (peg-parser (syntax->datum #'str) x)))))
 
-;; lst has format (nonterm grabber pattern), where
-;;   nonterm is a symbol (the name of the nonterminal),
-;;   grabber is a string (either "<", "<-" or "<--"), and
-;;   pattern is the parse of a PEG pattern expressed as as string.
-(define (peg-nonterm->defn lst for-syntax)
-  (let* ((nonterm (car lst))
-         (grabber (cadr lst))
-         (pattern (caddr lst))
-         (nonterm-name (datum->syntax for-syntax
-                                      (string->symbol (cadr nonterm)))))
-    #`(define-peg-pattern #,nonterm-name
-       #,(cond
-          ((string=? grabber "<--") (datum->syntax for-syntax 'all))
-          ((string=? grabber "<-") (datum->syntax for-syntax 'body))
-          (else (datum->syntax for-syntax 'none)))
-       #,(compressor (peg-pattern->defn pattern for-syntax) for-syntax))))
-
-;; lst has format ('peg-pattern ...).
-;; After the context-flatten, (cdr lst) has format
-;;   (('peg-alternative ...) ...), where the outer list is a collection
-;;   of elements from a '/' alternative.
-(define (peg-pattern->defn lst for-syntax)
-  #`(or #,@(map (lambda (x) (peg-alternative->defn x for-syntax))
-                (context-flatten (lambda (x) (eq? (car x) 'peg-alternative))
-                                 (cdr lst)))))
-
-;; lst has format ('peg-alternative ...).
-;; After the context-flatten, (cdr lst) has the format
-;;   (item ...), where each item has format either ("!" ...), ("&" ...),
-;;   or ('peg-suffix ...).
-(define (peg-alternative->defn lst for-syntax)
-  #`(and #,@(map (lambda (x) (peg-body->defn x for-syntax))
-                 (context-flatten (lambda (x) (or (string? (car x))
-                                             (eq? (car x) 'peg-suffix)))
-                                  (cdr lst)))))
-
-;; lst has the format either
-;;   ("!" ('peg-suffix ...)), ("&" ('peg-suffix ...)), or
-;;     ('peg-suffix ...).
-(define (peg-body->defn lst for-syntax)
-    (cond
-      ((equal? (car lst) "&")
-       #`(followed-by #,(peg-suffix->defn (cadr lst) for-syntax)))
-      ((equal? (car lst) "!")
-       #`(not-followed-by #,(peg-suffix->defn (cadr lst) for-syntax)))
-      ((eq? (car lst) 'peg-suffix)
-       (peg-suffix->defn lst for-syntax))
-      (else `(peg-parse-body-fail ,lst))))
-
-;; lst has format ('peg-suffix <peg-primary> (? (/ "*" "?" "+")))
-(define (peg-suffix->defn lst for-syntax)
-  (let ((inner-defn (peg-primary->defn (cadr lst) for-syntax)))
-    (cond
-      ((null? (cddr lst))
-       inner-defn)
-      ((equal? (caddr lst) "*")
-       #`(* #,inner-defn))
-      ((equal? (caddr lst) "?")
-       #`(? #,inner-defn))
-      ((equal? (caddr lst) "+")
-       #`(+ #,inner-defn)))))
-
-;; Parse a primary.
-(define (peg-primary->defn lst for-syntax)
-  (let ((el (cadr lst)))
-  (cond
-   ((list? el)
-    (cond
-     ((eq? (car el) 'peg-literal)
-      (peg-literal->defn el for-syntax))
-     ((eq? (car el) 'peg-charclass)
-      (peg-charclass->defn el for-syntax))
-     ((eq? (car el) 'peg-nonterminal)
-      (datum->syntax for-syntax (string->symbol (cadr el))))))
-   ((string? el)
-    (cond
-     ((equal? el "(")
-      (peg-pattern->defn (caddr lst) for-syntax))
-     ((equal? el ".")
-      (datum->syntax for-syntax 'peg-any))
-     (else (datum->syntax for-syntax
-                          `(peg-parse-any unknown-string ,lst)))))
-   (else (datum->syntax for-syntax
-                        `(peg-parse-any unknown-el ,lst))))))
-
-;; Trims characters off the front and end of STR.
-;; (trim-1chars "'ab'") -> "ab"
-(define (trim-1chars str) (substring str 1 (- (string-length str) 1)))
-
-;; Parses a literal.
-(define (peg-literal->defn lst for-syntax)
-  (datum->syntax for-syntax (trim-1chars (cadr lst))))
-
-;; Parses a charclass.
-(define (peg-charclass->defn lst for-syntax)
-  #`(or
-     #,@(map
-         (lambda (cc)
-           (cond
-            ((eq? (car cc) 'charclass-range)
-             #`(range #,(datum->syntax
-                         for-syntax
-                         (string-ref (cadr cc) 0))
-                      #,(datum->syntax
-                         for-syntax
-                         (string-ref (cadr cc) 2))))
-            ((eq? (car cc) 'charclass-single)
-             (datum->syntax for-syntax (cadr cc)))))
-         (context-flatten
-          (lambda (x) (or (eq? (car x) 'charclass-range)
-                          (eq? (car x) 'charclass-single)))
-          (cdr lst)))))
-
 ;; Compresses a list to save the optimizer work.
 ;; e.g. (or (and a)) -> a
 (define (compressor-core lst)
@@ -263,11 +398,10 @@ RB < ']'
      (let ((string (syntax->datum #'str-stx)))
        (compile-peg-pattern
         (compressor
-         (peg-pattern->defn
-          (peg:tree (match-pattern peg-pattern string)) #'str-stx)
+         (Expression->defn
+          (peg:tree (match-pattern Expression string)) #'str-stx)
          #'str-stx)
         (if (eq? accum 'all) 'body accum))))
      (else (error "Bad embedded PEG string" args))))
 
 (add-peg-compiler! 'peg peg-string-compile)
-
diff --git a/test-suite/tests/peg.test b/test-suite/tests/peg.test
index f516571e8..1136c03f1 100644
--- a/test-suite/tests/peg.test
+++ b/test-suite/tests/peg.test
@@ -28,17 +28,25 @@
 ;; the nonterminals defined in the PEG parser written with
 ;; S-expressions.
 (define grammar-mapping
-  '((grammar peg-grammar)
-    (pattern peg-pattern)
-    (alternative peg-alternative)
-    (suffix peg-suffix)
-    (primary peg-primary)
-    (literal peg-literal)
-    (charclass peg-charclass)
-    (CCrange charclass-range)
-    (CCsingle charclass-single)
-    (nonterminal peg-nonterminal)
-    (sp peg-sp)))
+  '((Grammar Grammar)
+    (Definition Definition)
+    (Expression Expression)
+    (Sequence Sequence)
+    (Prefix Prefix)
+    (Suffix Suffix)
+    (Primary Primary)
+    (Identifier Identifier)
+    (Literal Literal)
+    (Class Class)
+    (Range Range)
+    (Char Char)
+    (LEFTARROW LEFTARROW)
+    (AND AND)
+    (NOT NOT)
+    (QUESTION QUESTION)
+    (STAR STAR)
+    (PLUS PLUS)
+    (DOT DOT)))
 
 ;; Transforms the nonterminals defined in the PEG parser written as a PEG to the nonterminals defined in the PEG parser written with S-expressions.
 (define (grammar-transform x)
@@ -69,7 +77,7 @@
     (peg:tree (match-pattern (@@ (ice-9 peg) peg-grammar) (@@ (ice-9 peg) peg-as-peg)))
     (tree-map
      grammar-transform
-     (peg:tree (match-pattern grammar (@@ (ice-9 peg) peg-as-peg)))))))
+     (peg:tree (match-pattern Grammar (@@ (ice-9 peg) peg-as-peg)))))))
 
 ;; A grammar for pascal-style comments from Wikipedia.
 (define comment-grammar
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* bug#73188: PEG parser does not support full PEG grammar
  2024-10-20 20:18             ` Ekaitz Zarraga
@ 2024-12-09 17:23               ` Ludovic Courtès
  0 siblings, 0 replies; 19+ messages in thread
From: Ludovic Courtès @ 2024-12-09 17:23 UTC (permalink / raw)
  To: Ekaitz Zarraga; +Cc: 73188-done

Egun on!

Finally pushed v4 with slight edits to ‘NEWS’:

  25504ba21 PEG: Add support for `not-in-range` and [^...]
  ff11753df PEG: Add full support for PEG + some extensions

Thank you!

Ludo’.





^ permalink raw reply	[flat|nested] 19+ messages in thread

* bug#73188: PEG parser does not support full PEG grammar
  2024-09-11 22:03 bug#73188: PEG parser does not support full PEG grammar Ekaitz Zarraga
                   ` (2 preceding siblings ...)
  2024-10-30 19:04 ` bug#73188: PEG: Fix bugs and add complex PEG for testing Ekaitz Zarraga
@ 2024-12-22 17:45 ` Ekaitz Zarraga
  2024-12-22 20:09   ` Ekaitz Zarraga
  2024-12-22 20:01 ` bug#73188: [PATCH 1/3] PEG: fix [^...] Ekaitz Zarraga
  4 siblings, 1 reply; 19+ messages in thread
From: Ekaitz Zarraga @ 2024-12-22 17:45 UTC (permalink / raw)
  To: 73188; +Cc: Ludovic Courtès

hi!

I didn't realize but you pushed the v4, and I detected a bug in it so I 
sent v5.

The bug is in the [^...] implementation.

v5 fixes it and it even simplifies the code.

Please revert 25504ba21 and apply the v5 instead.





^ permalink raw reply	[flat|nested] 19+ messages in thread

* bug#73188: [PATCH 1/3] PEG: fix [^...]
  2024-09-11 22:03 bug#73188: PEG parser does not support full PEG grammar Ekaitz Zarraga
                   ` (3 preceding siblings ...)
  2024-12-22 17:45 ` bug#73188: PEG parser does not support full PEG grammar Ekaitz Zarraga
@ 2024-12-22 20:01 ` Ekaitz Zarraga
  2024-12-22 20:01   ` bug#73188: [PATCH 2/3] PEG: string-peg: better support for escaping Ekaitz Zarraga
                     ` (2 more replies)
  4 siblings, 3 replies; 19+ messages in thread
From: Ekaitz Zarraga @ 2024-12-22 20:01 UTC (permalink / raw)
  To: 73188; +Cc: ludo, Ekaitz Zarraga

---
 module/ice-9/peg/string-peg.scm | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/module/ice-9/peg/string-peg.scm b/module/ice-9/peg/string-peg.scm
index 0026f8930..745d8e8e7 100644
--- a/module/ice-9/peg/string-peg.scm
+++ b/module/ice-9/peg/string-peg.scm
@@ -301,11 +301,19 @@ EndOfFile < !.
 (define (Literal->defn lst for-syntax)
   (apply string (map (lambda (x) (Char->defn x for-syntax)) (cdr lst))))
 
-;; (NotInClass ...)
-;;  `-> (and ...)
+;; (NotInClass (Range ...) (Range ...))
+;;  `-> (and (followed-by (not-in-range ...))
+;;           (followed-by (not-in-range ...))
+;;           ...
+;;           (not-in-range ...))
+;; NOTE: the order doesn't matter, because all `not-in-range`s will always
+;; parse exactly one character, but all the elements but the last need not to
+;; consume the input.
 (define (NotInClass->defn lst for-syntax)
-  #`(and #,@(map (lambda (x) (NotInRange->defn x for-syntax))
-                 (cdr lst))))
+  #`(and
+      #,@(map (lambda (x) #`(followed-by #,(NotInRange->defn x for-syntax)))
+              (cddr lst))
+      #,(NotInRange->defn (cadr lst) for-syntax)))
 
 ;; (Class ...)
 ;;  `-> (or ...)
-- 
2.46.0






^ permalink raw reply related	[flat|nested] 19+ messages in thread

* bug#73188: [PATCH 2/3] PEG: string-peg: better support for escaping
  2024-12-22 20:01 ` bug#73188: [PATCH 1/3] PEG: fix [^...] Ekaitz Zarraga
@ 2024-12-22 20:01   ` Ekaitz Zarraga
  2024-12-22 20:01   ` bug#73188: [PATCH 3/3] PEG: add large string-peg patch Ekaitz Zarraga
  2024-12-22 21:22   ` bug#73188: PEG parser does not support full PEG grammar Ludovic Courtès
  2 siblings, 0 replies; 19+ messages in thread
From: Ekaitz Zarraga @ 2024-12-22 20:01 UTC (permalink / raw)
  To: 73188; +Cc: ludo, Ekaitz Zarraga

---
 module/ice-9/peg/string-peg.scm | 26 +++++++++++++++++++++++---
 1 file changed, 23 insertions(+), 3 deletions(-)

diff --git a/module/ice-9/peg/string-peg.scm b/module/ice-9/peg/string-peg.scm
index 745d8e8e7..9891f2ae5 100644
--- a/module/ice-9/peg/string-peg.scm
+++ b/module/ice-9/peg/string-peg.scm
@@ -67,9 +67,10 @@ Literal <-- SQUOTE (!SQUOTE Char)* SQUOTE Spacing
 NotInClass <-- OPENBRACKET NOTIN  (!CLOSEBRACKET Range)* CLOSEBRACKET Spacing
 Class <-- OPENBRACKET !NOTIN  (!CLOSEBRACKET Range)* CLOSEBRACKET Spacing
 Range <-- Char DASH Char / Char
-Char <-- '\\\\' [nrt'\"\\[\\]\\\\]
+Char <-- '\\\\' [nrtf'\"\\[\\]\\\\]
        / '\\\\' [0-7][0-7][0-7]
        / '\\\\' [0-7][0-7]?
+       / '\\\\' 'u' HEX HEX HEX HEX
        / !'\\\\' .
 
 # NOTE: `<--` and `<` are extensions
@@ -79,6 +80,7 @@ DQUOTE < [\"]
 DASH < '-'
 OPENBRACKET < '['
 CLOSEBRACKET < ']'
+HEX <- [0-9a-fA-F]
 NOTIN < '^'
 SLASH < '/' Spacing
 AND <-- '&' Spacing
@@ -92,7 +94,7 @@ DOT <-- '.' Spacing
 
 Spacing < (Space / Comment)*
 Comment < '#' (!EndOfLine .)* EndOfLine
-Space < ' ' / '\t' / EndOfLine
+Space < ' ' / '\\t' / EndOfLine
 EndOfLine < '\\r\\n' / '\\n' / '\\r'
 EndOfFile < !.
 ")
@@ -144,12 +146,15 @@ EndOfFile < !.
 (define-sexp-parser Range all
   (or (and Char DASH Char) Char))
 (define-sexp-parser Char all
-  (or (and "\\" (or "n" "r" "t" "'" "\"" "[" "]" "\\"))
+  (or (and "\\" (or "n" "r" "t" "f" "'" "\"" "[" "]" "\\"))
       (and "\\" (range #\0 #\7) (range #\0 #\7) (range #\0 #\7))
       (and "\\" (range #\0 #\7) (? (range #\0 #\7)))
+      (and "\\" "u" HEX HEX HEX HEX)
       (and (not-followed-by "\\") peg-any)))
 (define-sexp-parser LEFTARROW body
   (and (or "<--" "<-" "<") Spacing)) ; NOTE: <-- and < are extensions
+(define-sexp-parser HEX body
+  (or (range #\0 #\9) (range #\a #\f) (range #\A #\F)))
 (define-sexp-parser NOTIN none
   (and "^"))
 (define-sexp-parser SLASH none
@@ -372,12 +377,27 @@ EndOfFile < !.
                      (* (- (char->integer x) (char->integer #\0)) y))
                    (reverse (string->list charstr 1))
                    '(1 8 64)))))
+      ((char=? #\u (string-ref charstr 1))
+       (integer->char
+         (reduce + 0
+                 (map
+                   (lambda (x y)
+                     (* (cond
+                          ((char-numeric? x)
+                           (- (char->integer x) (char->integer #\0)))
+                          ((char-alphabetic? x)
+                           (+ 10 (- (char->integer x) (char->integer #\a)))))
+                        y))
+                   (reverse (string->list (string-downcase charstr) 2))
+                   '(1 16 256 4096)))))
       (else
         (case (string-ref charstr 1)
           ((#\n) #\newline)
           ((#\r) #\return)
           ((#\t) #\tab)
+          ((#\f) #\page)
           ((#\') #\')
+          ((#\") #\")
           ((#\]) #\])
           ((#\\) #\\)
           ((#\[) #\[))))))
-- 
2.46.0






^ permalink raw reply related	[flat|nested] 19+ messages in thread

* bug#73188: [PATCH 3/3] PEG: add large string-peg patch
  2024-12-22 20:01 ` bug#73188: [PATCH 1/3] PEG: fix [^...] Ekaitz Zarraga
  2024-12-22 20:01   ` bug#73188: [PATCH 2/3] PEG: string-peg: better support for escaping Ekaitz Zarraga
@ 2024-12-22 20:01   ` Ekaitz Zarraga
  2024-12-22 21:22   ` bug#73188: PEG parser does not support full PEG grammar Ludovic Courtès
  2 siblings, 0 replies; 19+ messages in thread
From: Ekaitz Zarraga @ 2024-12-22 20:01 UTC (permalink / raw)
  To: 73188; +Cc: ludo, Ekaitz Zarraga

---
 test-suite/tests/peg.test | 117 ++++++++++++++++++++++++++++++++++++--
 1 file changed, 113 insertions(+), 4 deletions(-)

diff --git a/test-suite/tests/peg.test b/test-suite/tests/peg.test
index d9e3e1b22..d8d047288 100644
--- a/test-suite/tests/peg.test
+++ b/test-suite/tests/peg.test
@@ -86,7 +86,7 @@
 End <-- '*)'
 C <- Begin N* End
 N <- C / (!Begin !End Z)
-Z <- [^X-Z]") ;; Forbid some characters to test not-in-range
+Z <- .")
 
 ;; A short /etc/passwd file.
 (define *etc-passwd*
@@ -126,9 +126,6 @@ SLASH < '/'")
     (match-pattern C "(*blah*)")
     (make-prec 0 8 "(*blah*)"
 	       '((Begin "(*") "blah" (End "*)")))))
-  (pass-if
-   "simple comment with forbidden char"
-   (not (match-pattern C "(*blYh*)")))
   (pass-if
    "simple comment padded"
    (equal?
@@ -288,3 +285,115 @@ number <-- [0-9]+")
    (equal? (eq-parse "1+1/2*3+(1+1)/2")
 	   '(+ (+ 1 (* (/ 1 2) 3)) (/ (+ 1 1) 2)))))
 
+
+(define html-grammar
+"
+# Based on code from https://github.com/Fantom-Factory/afHtmlParser
+# 2014-2023 Steve Eynon. This code was originally released under the following
+# terms:
+#
+#      Permission to use, copy, modify, and/or distribute this software for any
+#      purpose with or without fee is hereby granted, provided that the above
+#      copyright notice and this permission notice appear in all copies.
+#
+#      THE SOFTWARE IS PROVIDED \"AS IS\" AND THE AUTHOR DISCLAIMS ALL
+#      WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES
+#      OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE
+#      FOR ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY
+#      DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER
+#      IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING
+#      OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
+
+# PEG Rules for parsing well formed HTML 5 documents
+# https://html.spec.whatwg.org/multipage/syntax.html
+
+html        <-- bom? blurb* doctype? blurb* xmlProlog? blurb* elem blurb*
+bom         <-- \"\\uFEFF\"
+xmlProlog   <-- \"<?xml\" (!\"?>\" .)+ \"?>\"
+
+# ---- Doctype ----
+
+doctype           <-- \"<!DOCTYPE\" [ \\t\\n\\f\\r]+ [a-zA-Z0-9]+ (doctypePublicId / doctypeSystemId)* [ \\t\\n\\f\\r]* \">\"
+doctypePublicId   <-- [ \\t\\n\\f\\r]+  \"PUBLIC\" [ \\t\\n\\f\\r]+   ((\"\\\"\" [^\"]* \"\\\"\") / (\"'\" [^']* \"'\"))
+doctypeSystemId   <-- [ \\t\\n\\f\\r]+ (\"SYSTEM\" [ \\t\\n\\f\\r]+)? ((\"\\\"\" [^\"]* \"\\\"\") / (\"'\" [^']* \"'\"))
+
+# ---- Elems ----
+
+elem              <-- voidElem / rawTextElem / escRawTextElem / selfClosingElem / normalElem
+voidElem          <-- \"<\"  voidElemName       attributes  \">\"
+rawTextElem       <-- \"<\"  rawTextElemName    attributes  \">\" rawTextContent endElem
+escRawTextElem    <-- \"<\"  escRawTextElemName attributes  \">\" escRawTextContent endElem
+selfClosingElem   <-- \"<\"  elemName           attributes \"/>\"
+normalElem        <-- \"<\"  elemName           attributes  \">\" normalContent? endElem?
+endElem           <-- \"</\" elemName                       \">\"
+
+elemName            <-- [a-zA-Z] [^\\t\\n\\f />]*
+voidElemName        <-- \"area\" / \"base\" / \"br\" / \"col\" / \"embed\" /
+                      \"hr\" / \"img\" / \"input\" / \"keygen\" / \"link\" /
+                      \"meta\" / \"param\" / \"source\" / \"track\" / \"wbr\"
+rawTextElemName     <-- \"script\" / \"style\"
+escRawTextElemName  <-- \"textarea\" / \"title\"
+
+rawTextContent      <-- (!(\"</script>\" / \"</style>\") .)+
+escRawTextContent   <-- ((!(\"</textarea>\" / \"</title>\" / \"&\") .)+ / charRef)*
+normalContent       <-- !\"</\" (([^<&]+ / charRef) / comment / cdata / elem)*
+
+# ---- Attributes ----
+
+attributes        <-- (&[^/>] ([ \\t]+ / doubleQuoteAttr / singleQuoteAttr / unquotedAttr / emptyAttr))*
+attrName          <-- [^ \\t\\n\\r\\f\"'>/=]+
+emptyAttr         <-- attrName+
+unquotedAttr      <-- attrName [ \\t]* \"=\" [ \\t]*      (charRef / [^ \\t\\n\\r\\f\"'=<>`&]+)+
+singleQuoteAttr   <-- attrName [ \\t]* \"=\" [ \\t]* \"'\"  (charRef / [^'&]+)* \"'\"
+doubleQuoteAttr   <-- attrName [ \\t]* \"=\" [ \\t]* \"\\\"\" (charRef / [^\"&]+)* \"\\\"\"
+
+# ---- Character References ----
+
+charRef         <-- &\"&\" (decNumCharRef / hexNumCharRef / namedCharRef / borkedRef)
+namedCharRef    <-- \"&\"   [^;>]+ \";\"
+decNumCharRef   <-- \"&#\"  [0-9]+ \";\"
+hexNumCharRef   <-- \"&#x\" [a-fA-F0-9]+ \";\"
+borkedRef       <-- \"&\"  &[ \\t]
+
+# ---- Misc ----
+
+cdata       <-- \"<![CDATA[\" (!\"]]>\" .)+ \"]]>\"
+comment     <-- \"<!--\" (!\"--\" .)+ \"-->\"
+blurb       <-- [ \\t\\n\\f\\r]+ / comment")
+
+(define html-example "
+<!DOCTYPE html>
+<html>
+<head>
+    <title>Example Domain</title>
+    <meta charset=\"utf-8\" />
+    <meta http-equiv=\"Content-type\" content=\"text/html; charset=utf-8\" />
+    <meta name=\"viewport\" content=\"width=device-width, initial-scale=1\" />
+    <style type=\"text/css\">
+    body {
+        background-color: #f0f0f2;
+        margin: 0;
+        padding: 0;
+    }
+    </style>
+</head>
+
+<body>
+<div>
+    <h1>Example Domain</h1>
+    <p>This domain is for use in illustrative examples in documents. You may
+    use this domain in literature without prior coordination or asking for
+    permission.</p> <p><a href=\"https://www.iana.org/domains/example\">More
+    information...</a></p>
+</div>
+</body>
+</html>
+")
+
+(with-test-prefix "Parsing with complex grammars"
+  (eeval `(define-peg-string-patterns ,html-grammar))
+  (pass-if
+    "HTML parsing"
+    (equal?
+      (peg:tree (match-pattern html html-example))
+      '(html (blurb "\n") (doctype "<!DOCTYPE html>") (blurb "\n") (elem (normalElem "<" (elemName "html") attributes ">" (normalContent "\n" (elem (normalElem "<" (elemName "head") attributes ">" (normalContent "\n    " (elem (escRawTextElem "<" (escRawTextElemName "title") attributes ">" (escRawTextContent "Example Domain") (endElem "</" (elemName "title") ">"))) "\n    " (elem (selfClosingElem "<" (elemName "meta") (attributes " " (doubleQuoteAttr (attrName "charset") "=\"utf-8\"") " ") "/>")) "\n    " (elem (selfClosingElem "<" (elemName "meta") (attributes " " (doubleQuoteAttr (attrName "http-equiv") "=\"Content-type\"") " " (doubleQuoteAttr (attrName "content") "=\"text/html; charset=utf-8\"") " ") "/>")) "\n    " (elem (selfClosingElem "<" (elemName "meta") (attributes " " (doubleQuoteAttr (attrName "name") "=\"viewport\"") " " (doubleQuoteAttr (attrName "content") "=\"width=device-width, initial-scale=1\"") " ") "/>")) "\n    " (elem (rawTextElem "<" (rawTextElemName "style") (attributes " " (doubleQuoteAttr (attrName "type") "=\"text/css\"")) ">" (rawTextContent "\n    body {\n        background-color: #f0f0f2;\n        margin: 0;\n        padding: 0;\n    }\n    ") (endElem "</" (elemName "style") ">"))) "\n") (endElem "</" (elemName "head") ">"))) "\n\n" (elem (normalElem "<" (elemName "body") attributes ">" (normalContent "\n" (elem (normalElem "<" (elemName "div") attributes ">" (normalContent "\n    " (elem (normalElem "<" (elemName "h1") attributes ">" (normalContent "Example Domain") (endElem "</" (elemName "h1") ">"))) "\n    " (elem (normalElem "<" (elemName "p") attributes ">" (normalContent "This domain is for use in illustrative examples in documents. You may\n    use this domain in literature without prior coordination or asking for\n    permission.") (endElem "</" (elemName "p") ">"))) " " (elem (normalElem "<" (elemName "p") attributes ">" (normalContent (elem (normalElem "<" (elemName "a") (attributes " " (doubleQuoteAttr (attrName "href") "=\"https://www.iana.org/domains/example\"")) ">" (normalContent "More\n    information...") (endElem "</" (elemName "a") ">")))) (endElem "</" (elemName "p") ">"))) "\n") (endElem "</" (elemName "div") ">"))) "\n") (endElem "</" (elemName "body") ">"))) "\n") (endElem "</" (elemName "html") ">"))) (blurb "\n")))))
-- 
2.46.0






^ permalink raw reply related	[flat|nested] 19+ messages in thread

* bug#73188: PEG parser does not support full PEG grammar
  2024-12-22 17:45 ` bug#73188: PEG parser does not support full PEG grammar Ekaitz Zarraga
@ 2024-12-22 20:09   ` Ekaitz Zarraga
  0 siblings, 0 replies; 19+ messages in thread
From: Ekaitz Zarraga @ 2024-12-22 20:09 UTC (permalink / raw)
  To: 73188; +Cc: Ludovic Courtès

The three previous messages (SORRY) add the changes that were missing in 
when v4 was applied instead of v5.





^ permalink raw reply	[flat|nested] 19+ messages in thread

* bug#73188: PEG parser does not support full PEG grammar
  2024-12-22 20:01 ` bug#73188: [PATCH 1/3] PEG: fix [^...] Ekaitz Zarraga
  2024-12-22 20:01   ` bug#73188: [PATCH 2/3] PEG: string-peg: better support for escaping Ekaitz Zarraga
  2024-12-22 20:01   ` bug#73188: [PATCH 3/3] PEG: add large string-peg patch Ekaitz Zarraga
@ 2024-12-22 21:22   ` Ludovic Courtès
  2024-12-23 22:04     ` bokr
  2 siblings, 1 reply; 19+ messages in thread
From: Ludovic Courtès @ 2024-12-22 21:22 UTC (permalink / raw)
  To: Ekaitz Zarraga; +Cc: 73188

Hi Ekaitz,

Apologies for applying the wrong version of the patch series!

I pushed the 3 patches you just sent:

  6750f6cc8 * PEG: string-peg: Add HTML5 grammar test.
  38ad26497 * PEG: string-peg: Better support for escaping.
  c86a48a92 * PEG: string-peg: Fix [^...] interpretation.

I added commit logs that follow the project’s conventions (same as
Guix).

In the future, when a patch fixes a bug, please include a test case that
reproduces the bug being fixed; possibly add information in the commit
log about the commit that introduced the bug/regression.  This is useful
to get a good understanding of the situation.  (I understand in this
case the problem was mostly me applying an earlier version.)

And bonus points if you provide commit logs.  :-)

Thank you!

Ludo’.





^ permalink raw reply	[flat|nested] 19+ messages in thread

* bug#73188: PEG parser does not support full PEG grammar
  2024-12-22 21:22   ` bug#73188: PEG parser does not support full PEG grammar Ludovic Courtès
@ 2024-12-23 22:04     ` bokr
  2024-12-23 22:13       ` Ekaitz Zarraga
  0 siblings, 1 reply; 19+ messages in thread
From: bokr @ 2024-12-23 22:04 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 73188, Ekaitz Zarraga

<tl;dr>
    • Late kudos to Ekaitz for this great PEG contribution, especially RISCV,
      which will IMO play a key part in the future of human secure control over machines :)
    • Is there a trustable git repo I could clone to follow future developments of this work?
    • IWBN to have an example doc for how to define a DSL and its interpreter a la brainf*ck,
      but compiled suitable for running in qemu bullet-proof container. Scroll to end of this email
      for what I was surprised worked, though it may have compromised my system, I'm not sure: CAVEAT! :)
</tl;dr>

On +2024-12-22 22:22:09 +0100, Ludovic Courtès wrote:
> Hi Ekaitz,
> 
> Apologies for applying the wrong version of the patch series!
> 
> I pushed the 3 patches you just sent:
> 
>   6750f6cc8 * PEG: string-peg: Add HTML5 grammar test.
>   38ad26497 * PEG: string-peg: Better support for escaping.
>   c86a48a92 * PEG: string-peg: Fix [^...] interpretation.
> 
> I added commit logs that follow the project’s conventions (same as
> Guix).
> 
> In the future, when a patch fixes a bug, please include a test case that
> reproduces the bug being fixed; possibly add information in the commit
> log about the commit that introduced the bug/regression.  This is useful
> to get a good understanding of the situation.  (I understand in this
> case the problem was mostly me applying an earlier version.)
> 
> And bonus points if you provide commit logs.  :-)
> 
> Thank you!
> 
> Ludo’.
> 
> 
> 

-=-=================================-=-

Here is what worked way back when ludo published
    <https://lists.gnu.org/archive/html/info-gnu/2022-12/msg00007.html>
(a release announcement email that IMO is a paragon of release announcements)

I'm not suggesting running this as I did, since I am not sure about security bugs,
then and since, but I did it back when Ludo realeased it. CAVEAT! YOU HAVE BEEN WARNED!
(Tips on published vulns appreciated)

But runvm (below) seemed to work amazingly under (debian-based) pureos wayland.

$ cd ~/wb/gxqemu/
$ file *
guix-system-vm-image-1.4.0.x86_64-linux.qcow2: QEMU QCOW2 Image (v3), 32255246336 bytes
runvm:                                         Bourne-Again shell script, ASCII text executable

$ ls -ltrad *
-rwxr-xr-x 1 bokr bokr        211 Feb  7  2024 runvm
-rw-r--r-- 1 bokr bokr 8862695424 Oct 14 02:17 guix-system-vm-image-1.4.0.x86_64-linux.qcow2

$ du -h guix-system-vm-image-1.4.0.x86_64-linux.qcow2 
8.3G	guix-system-vm-image-1.4.0.x86_64-linux.qcow2

$ cat -nA runvm 
     1	#!/usr/bin/bash$
     2	$
     3	qemu-system-x86_64 \$
     4	   -nic user,model=virtio-net-pci \$
     5	   -enable-kvm -m 2048 \$
     6	   -device virtio-blk,drive=myhd \$
     7	   -drive if=none,file=guix-system-vm-image-1.4.0.x86_64-linux.qcow2,id=myhd$
$

$ # I'm not doing the following, since I am not sure about security bugs, but I did it
$ # when Ludo realeased it (with a realease announcement email that IMO is a paragon of
$ # relase announcements):
    <https://lists.gnu.org/archive/html/info-gnu/2022-12/msg00007.html>
$ # 

$ # ./runvm &
$ # in a system with Wayland as the display compositor, this runs the image in the foreground AND
$ # continues running the term CLI '(bash)' in the background, amazingly compositing both anything
$ # running via X-wayland headless Xorg, like maybe firefox-esr, and the qcow image, so you can
$ # mouse around and switch between the two.
$ 
$ # NOTICE: Not responsible for consequences of trying this: YOU HAVE BEEN WARNED :)
$

$ uname -a
Linux BRL14v1 5.10.0-33-amd64 #1 SMP Debian 5.10.226-1 (2024-10-03) x86_64 GNU/Linux

$ pwd
/home/bokr/wb/gxqemu

$ cd ~/wb/guix/guix
$ git log|head
commit e92b20a41a026b8af7dd2031eb61267b061617b5
Author: Tomas Volf <~@wolfsden.cz>
Date:   Fri Dec 13 17:27:46 2024 +0100

    services: mingetty: Support waiting on shepherd services.
    
    For auto-login on systems with elogind, dbus-system needs to be started.  This
    commit adds ability to express that ordering.
    
    * gnu/services/base.scm (<mingetty-configuration>): Add shepherd-requirement
$ 
-=-=================================-=-

I am interested in using Ekaitz's Peg work and predecessor guile work of Andy Wingo and Ludo
to produce secure minimal-code guile extensions displaying popups showing e.g. progress
graphics based on direct wayland event protocols, to be able to show status of "hung"
looping or deadlocked threads, subject ot user privilege authentication.

IIRC the display ran at 60hz before starting the cow2 image and dropped to 30hz when showing
both display outputs at the same time. If I get time, I'll try to take a video with my phone
to show it, but don't hold your breath ;-)

It would be interesting to try a minimal wayland for mes too :)

Obviously these are wip-thoughts ;-)

Thanks for reading :)
Happy Holidays
--
With kind regards,
Bengt Richter





^ permalink raw reply	[flat|nested] 19+ messages in thread

* bug#73188: PEG parser does not support full PEG grammar
  2024-12-23 22:04     ` bokr
@ 2024-12-23 22:13       ` Ekaitz Zarraga
  0 siblings, 0 replies; 19+ messages in thread
From: Ekaitz Zarraga @ 2024-12-23 22:13 UTC (permalink / raw)
  To: bokr; +Cc: 73188

Hi,


On 2024-12-23 23:04, bokr@bokr.com wrote:
>      • Late kudos to Ekaitz for this great PEG contribution, especially RISCV,
>        which will IMO play a key part in the future of human secure control over machines 🙂

I think I didn't understand your message very well, but just as a 
clarification: the PEG work and RISC-V work are completely independent, 
one doesn't have anything to do with the other, and the RISC-V work is 
not merged in Guile yet.

The RISC-V work for the Guile JIT (guile already works in riscv without 
it) is in here:
https://gitlab.com/wingo/lightening/-/merge_requests/14

Hope this clarifies things.





^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2024-12-23 22:13 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-11 22:03 bug#73188: PEG parser does not support full PEG grammar Ekaitz Zarraga
2024-09-12 20:57 ` bug#73188: [PATCH v2] PEG: Add full support for PEG + some extensions Ekaitz Zarraga
2024-10-13 20:29   ` bug#73188: PEG parser does not support full PEG grammar Ludovic Courtès
2024-10-13 20:59     ` Ekaitz Zarraga
2024-10-14 11:56       ` Ludovic Courtès
2024-10-14 14:00         ` Ekaitz Zarraga
2024-10-20 10:10           ` Ludovic Courtès
2024-10-20 20:18             ` Ekaitz Zarraga
2024-12-09 17:23               ` Ludovic Courtès
2024-10-11 12:31 ` bug#73188: [PATCH] PEG: Add support for `not-in-range` and [^...] Ekaitz Zarraga
2024-10-30 19:04 ` bug#73188: PEG: Fix bugs and add complex PEG for testing Ekaitz Zarraga
2024-12-22 17:45 ` bug#73188: PEG parser does not support full PEG grammar Ekaitz Zarraga
2024-12-22 20:09   ` Ekaitz Zarraga
2024-12-22 20:01 ` bug#73188: [PATCH 1/3] PEG: fix [^...] Ekaitz Zarraga
2024-12-22 20:01   ` bug#73188: [PATCH 2/3] PEG: string-peg: better support for escaping Ekaitz Zarraga
2024-12-22 20:01   ` bug#73188: [PATCH 3/3] PEG: add large string-peg patch Ekaitz Zarraga
2024-12-22 21:22   ` bug#73188: PEG parser does not support full PEG grammar Ludovic Courtès
2024-12-23 22:04     ` bokr
2024-12-23 22:13       ` Ekaitz Zarraga

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).