unofficial mirror of guile-devel@gnu.org 
 help / color / mirror / Atom feed
From: Noah Lavine <noah.b.lavine@gmail.com>
To: Andy Wingo <wingo@pobox.com>
Cc: guile-devel <guile-devel@gnu.org>
Subject: Re: PEG Patches
Date: Thu, 31 Mar 2011 17:48:26 -0400	[thread overview]
Message-ID: <AANLkTimd+BRh4XNm6jp1-rhAXTuMcWpqG0mUxC=UWfNW@mail.gmail.com> (raw)
In-Reply-To: <m31v1q83n7.fsf@unquote.localdomain>

[-- Attachment #1: Type: text/plain, Size: 1426 bytes --]

Hello again,

I was about to do this, and then I discovered that it wouldn't work,
because there are a few special case PEGs that don't make sense as
macros. Specifically, in the context of a PEG, we interpret strings as
matching themselves, and those can't be made into macros.

So I went ahead and implemented a simple way to extend
peg-sexp-compile. It turned out to be much less difficult than I was
afraid of.

The first attached patch adds the interface to (ice-9 peg codegen) and
changes most of the functions there to use it, and also adds some
documentation in the PEG Internals section. The second one updates
(ice-9 peg string-peg) to use it as well, and gets rid of
peg-extended-compile from peg.scm since it's no longer needed.

I wrote the patches on top of the last two that I sent, because those
included some cleanups that I wanted to keep.

Noah

On Tue, Mar 29, 2011 at 9:20 AM, Andy Wingo <wingo@pobox.com> wrote:
> On Tue 29 Mar 2011 14:47, Noah Lavine <noah.b.lavine@gmail.com> writes:
>
>>> (define-peg-matcher and cg-and)
>>
>> That's doable. But if we're going to choose what to do entirely based
>> on the first element of the list, then we could also just not define
>> peg-sexp-compile at all and make each of the code generation functions
>> into macros.
>>
>> How does that sound?
>
> Good idea.  Sounds great to me!
>
> Andy
> --
> http://wingolog.org/
>

[-- Attachment #2: 0001-Extensible-PEG-Syntax.patch --]
[-- Type: application/octet-stream, Size: 11741 bytes --]

From 0f28602bf3fde35e7bfc6fa38b7608b97af3c017 Mon Sep 17 00:00:00 2001
From: Noah Lavine <nlavine@haverford.edu>
Date: Thu, 31 Mar 2011 17:04:06 -0400
Subject: [PATCH 1/2] Extensible PEG Syntax

* module/ice-9/peg/codegen.scm: Make the PEG syntax extensible, and
    move most of the current code generators to the new interface
* doc/ref/api-peg.texi: Document PEG extensions in the PEG Internals
    section of the manual
---
 doc/ref/api-peg.texi         |   32 ++++++++
 module/ice-9/peg/codegen.scm |  176 ++++++++++++++++++++++++------------------
 2 files changed, 133 insertions(+), 75 deletions(-)

diff --git a/doc/ref/api-peg.texi b/doc/ref/api-peg.texi
index 0c83365..6d0a346 100644
--- a/doc/ref/api-peg.texi
+++ b/doc/ref/api-peg.texi
@@ -992,3 +992,35 @@ interface.
 
 The above function can be used to match a string by running
 @code{(peg-parse match-a-b "ab")}.
+
+@subsubheading Code Generators and Extensible Syntax
+
+PEG expressions, such as those in a @code{define-nonterm} form, are
+interpreted internally in two steps.
+
+First, any string PEG is expanded into an s-expression PEG by the code
+in the @code{(ice-9 peg string-peg)} module.
+
+Then, then s-expression PEG that results is compiled into a parsing
+function by the @code{(ice-9 peg codegen)} module. In particular, the
+function @code{peg-sexp-compile} is called on the s-expression. It then
+decides what to do based on the form it is passed.
+
+The PEG syntax can be expanded by providing @code{peg-sexp-compile} more
+options for what to do with its forms. The extended syntax will be
+associated with a symbol, for instance @code{my-parsing-form}, and will
+be called on all PEG expressions of the form
+@lisp
+(my-parsing-form ...)
+@end lisp
+
+The parsing function should take two arguments. The first will be a
+syntax object containing a list with all of the arguments to the form
+(but not the form's name), and the second will be the
+@code{capture-type} argument that is passed to @code{define-nonterm}.
+
+New functions can be registered by calling @code{(add-peg-compiler!
+symbol function)}, where @code{symbol} is the symbol that will indicate
+a form of this type and @code{function} is the code generating function
+described above. The function @code{add-peg-compiler!} is exported from
+the @code{(ice-9 peg codegen)} module.
diff --git a/module/ice-9/peg/codegen.scm b/module/ice-9/peg/codegen.scm
index 8dd507c..597ead9 100644
--- a/module/ice-9/peg/codegen.scm
+++ b/module/ice-9/peg/codegen.scm
@@ -18,9 +18,7 @@
 ;;;;
 
 (define-module (ice-9 peg codegen)
-  #:export (peg-sexp-compile wrap-parser-for-users)
-  #:use-module (ice-9 peg)
-  #:use-module (ice-9 peg string-peg)
+  #:export (peg-sexp-compile wrap-parser-for-users add-peg-compiler!)
   #:use-module (ice-9 pretty-print)
   #:use-module (system base pmatch))
 
@@ -123,18 +121,35 @@ return EXP."
 
 ;; Generates code for matching a range of characters between start and end.
 ;; E.g.: (cg-range syntax #\a #\z 'body)
-(define (cg-range start end accum)
-  #`(lambda (str len pos)
-      (and (< pos len)
-           (let ((c (string-ref str pos)))
-             (and (char>=? c #,start)
-                  (char<=? c #,end)
-                  #,(case accum
-                      ((all) #`(list (1+ pos) (list 'cg-range (string c))))
-                      ((name) #`(list (1+ pos) 'cg-range))
-                      ((body) #`(list (1+ pos) (string c)))
-                      ((none) #`(list (1+ pos) '()))
-                      (else (error "bad accum" accum))))))))
+(define (cg-range pat accum)
+  (syntax-case pat ()
+    ((start end)
+     (if (not (and (char? (syntax->datum #'start))
+                   (char? (syntax->datum #'end))))
+         (error "range PEG should have characters after it; instead got"
+                #'start #'end))
+     #`(lambda (str len pos)
+         (and (< pos len)
+              (let ((c (string-ref str pos)))
+                (and (char>=? c start)
+                     (char<=? c end)
+                     #,(case accum
+                         ((all) #`(list (1+ pos) (list 'cg-range (string c))))
+                         ((name) #`(list (1+ pos) 'cg-range))
+                         ((body) #`(list (1+ pos) (string c)))
+                         ((none) #`(list (1+ pos) '()))
+                         (else (error "bad accum" accum))))))))))
+
+;; Generate code to match a pattern and do nothing with the result
+(define (cg-ignore pat accum)
+  (syntax-case pat ()
+    ((inner)
+     (peg-sexp-compile #'inner 'none))))
+
+(define (cg-capture pat accum)
+  (syntax-case pat ()
+    ((inner)
+     (peg-sexp-compile #'inner 'body))))
 
 ;; Filters the accum argument to peg-sexp-compile for buildings like string
 ;; literals (since we don't want to tag them with their name if we're doing an
@@ -147,35 +162,11 @@ return EXP."
    ((eq? accum 'none) 'none)))
 (define baf builtin-accum-filter)
 
-;; Takes an arbitrary expressions and accumulation variable, then parses it.
-;; E.g.: (peg-sexp-compile syntax '(and "abc" (or "-" (range #\a #\z))) 'all)
-(define (peg-sexp-compile pat accum)
-  (syntax-case pat (peg-any range ignore capture peg and or body)
-    (peg-any
-     (cg-peg-any (baf accum)))
-    (sym (identifier? #'sym) ;; nonterminal
-     #'sym)
-    (str (string? (syntax->datum #'str)) ;; literal string
-     (cg-string (syntax->datum #'str) (baf accum)))
-    ((range start end) ;; range of characters (e.g. [a-z])
-     (and (char? (syntax->datum #'start)) (char? (syntax->datum #'end)))
-     (cg-range (syntax->datum #'start) (syntax->datum #'end) (baf accum)))
-    ((ignore pat) ;; match but don't parse
-     (peg-sexp-compile #'pat 'none))
-    ((capture pat) ;; parse
-     (peg-sexp-compile #'pat 'body))
-    ((and pat ...)
-     (cg-and #'(pat ...) (baf accum)))
-    ((or pat ...)
-     (cg-or #'(pat ...) (baf accum)))
-    ((body type pat num)
-     (cg-body (baf accum) #'type #'pat #'num))))
-
 ;; Top-level function builder for AND.  Reduces to a call to CG-AND-INT.
 (define (cg-and clauses accum)
   #`(lambda (str len pos)
       (let ((body '()))
-        #,(cg-and-int clauses accum #'str #'len #'pos #'body))))
+        #,(cg-and-int clauses (baf accum) #'str #'len #'pos #'body))))
 
 ;; Internal function builder for AND (calls itself).
 (define (cg-and-int clauses accum str strlen at body)
@@ -195,7 +186,7 @@ return EXP."
 ;; Top-level function builder for OR.  Reduces to a call to CG-OR-INT.
 (define (cg-or clauses accum)
   #`(lambda (str len pos)
-      #,(cg-or-int clauses accum #'str #'len #'pos)))
+      #,(cg-or-int clauses (baf accum) #'str #'len #'pos)))
 
 ;; Internal function builder for OR (calls itself).
 (define (cg-or-int clauses accum str strlen at)
@@ -207,40 +198,75 @@ return EXP."
            #,(cg-or-int #'(rest ...) accum str strlen at)))))
 
 ;; Returns a function that parses a BODY element.
-(define (cg-body accum type pat num)
-  #`(lambda (str strlen at)
-      (let ((body '()))
-        (let lp ((end at) (count 0))
-          (let* ((match (#,(peg-sexp-compile pat accum) str strlen end))
-                 (new-end (if match (car match) end))
-                 (count (if (> new-end end) (1+ count) count)))
-            (if (> new-end end)
-                (push-not-null! body (single-filter (cadr match))))
-            (if (and (> new-end end)
-                     #,(syntax-case num (+ * ?)
-                         (n (number? (syntax->datum #'n))
-                            #'(< count n))
-                         (+ #t)
-                         (* #t)
-                         (? #'(< count 1))))
-                (lp new-end count)
-                (let ((success #,(syntax-case num (+ * ?)
-                                   (n (number? (syntax->datum #'n))
-                                      #'(= count n))
-                                   (+ #'(>= count 1))
-                                   (* #t)
-                                   (? #t))))
-                  #,(syntax-case type (! & lit)
-                      (!
-                       #`(if success
-                             #f
-                             #,(cggr accum 'cg-body #''() #'at)))
-                      (&
-                       #`(and success
-                              #,(cggr accum 'cg-body #''() #'at)))
-                      (lit
-                       #`(and success
-                              #,(cggr accum 'cg-body #'(reverse body) #'new-end)))))))))))
+(define (cg-body args accum)
+  (syntax-case args ()
+    ((type pat num)
+     #`(lambda (str strlen at)
+         (let ((body '()))
+           (let lp ((end at) (count 0))
+             (let* ((match (#,(peg-sexp-compile #'pat (baf accum))
+                            str strlen end))
+                    (new-end (if match (car match) end))
+                    (count (if (> new-end end) (1+ count) count)))
+               (if (> new-end end)
+                   (push-not-null! body (single-filter (cadr match))))
+               (if (and (> new-end end)
+                        #,(syntax-case #'num (+ * ?)
+                            (n (number? (syntax->datum #'n))
+                               #'(< count n))
+                            (+ #t)
+                            (* #t)
+                            (? #'(< count 1))))
+                   (lp new-end count)
+                   (let ((success #,(syntax-case #'num (+ * ?)
+                                      (n (number? (syntax->datum #'n))
+                                         #'(= count n))
+                                      (+ #'(>= count 1))
+                                      (* #t)
+                                      (? #t))))
+                     #,(syntax-case #'type (! & lit)
+                         (!
+                          #`(if success
+                                #f
+                                #,(cggr (baf accum) 'cg-body #''() #'at)))
+                         (&
+                          #`(and success
+                                 #,(cggr (baf accum) 'cg-body #''() #'at)))
+                         (lit
+                          #`(and success
+                                 #,(cggr (baf accum) 'cg-body
+                                         #'(reverse body) #'new-end)))))))))))))
+
+;; Association list of functions to handle different expressions as PEGs
+(define peg-compiler-alist '())
+
+(define (add-peg-compiler! symbol function)
+  (set! peg-compiler-alist
+        (assq-set! peg-compiler-alist symbol function)))
+
+(add-peg-compiler! 'range cg-range)
+(add-peg-compiler! 'ignore cg-ignore)
+(add-peg-compiler! 'capture cg-capture)
+(add-peg-compiler! 'and cg-and)
+(add-peg-compiler! 'or cg-or)
+(add-peg-compiler! 'body cg-body)
+
+;; Takes an arbitrary expressions and accumulation variable, then parses it.
+;; E.g.: (peg-sexp-compile syntax '(and "abc" (or "-" (range #\a #\z))) 'all)
+(define (peg-sexp-compile pat accum)
+  (syntax-case pat (peg-any range ignore capture peg and or body)
+    (peg-any
+     (cg-peg-any (baf accum)))
+    (sym (identifier? #'sym) ;; nonterminal
+     #'sym)
+    (str (string? (syntax->datum #'str)) ;; literal string
+     (cg-string (syntax->datum #'str) (baf accum)))
+    ((name . args) (let* ((nm (syntax->datum #'name))
+                          (entry (assq-ref peg-compiler-alist nm)))
+                     (if entry
+                         (entry #'args accum)
+                         (error "Bad peg form" nm #'args
+                                "Not one of" (map car peg-compiler-alist)))))))
 
 ;; Packages the results of a parser
 (define (wrap-parser-for-users for-syntax parser accumsym s-syn)
-- 
1.7.4.1


[-- Attachment #3: 0002-Update-String-PEGs.patch --]
[-- Type: application/octet-stream, Size: 4595 bytes --]

From a34f30694462ed7965cb885781dcfe6c45b04646 Mon Sep 17 00:00:00 2001
From: Noah Lavine <nlavine@haverford.edu>
Date: Thu, 31 Mar 2011 17:42:36 -0400
Subject: [PATCH 2/2] Update String PEGs

* module/ice-9/peg/string-peg.scm: use new interface for extending PEG
   syntax
* module/ice-9/peg.scm: remove peg-extended-compile
---
 module/ice-9/peg.scm            |   20 +++++++-------------
 module/ice-9/peg/string-peg.scm |   25 +++++++++++++++----------
 2 files changed, 22 insertions(+), 23 deletions(-)

diff --git a/module/ice-9/peg.scm b/module/ice-9/peg.scm
index 58e35ce..730e048 100644
--- a/module/ice-9/peg.scm
+++ b/module/ice-9/peg.scm
@@ -22,7 +22,6 @@
             define-nonterm
 ;            define-nonterm-f
             peg-match)
-;  #:export-syntax (define-nonterm)
   #:use-module (ice-9 peg codegen)
   #:use-module (ice-9 peg string-peg)
   #:use-module (ice-9 peg simplify-tree)
@@ -30,7 +29,6 @@
   #:re-export (peg-sexp-compile
                define-grammar
                define-grammar-f
-;               define-nonterm
                keyword-flatten
                context-flatten
                peg:start
@@ -67,13 +65,6 @@ execute the STMTs and try again."
         #f
         (make-prec 0 (car res) string (string-collapse (cadr res))))))
 
-(define (peg-extended-compile pattern accum)
-  (syntax-case pattern (peg)
-    ((peg str)
-     (string? (syntax->datum #'str))
-     (peg-string-compile #'str (if (eq? accum 'all) 'body accum)))
-    (else (peg-sexp-compile pattern accum))))
-
 ;; The results of parsing using a nonterminal are cached.  Think of it like a
 ;; hash with no conflict resolution.  Process for deciding on the cache size
 ;; wasn't very scientific; just ran the benchmarks and stopped a little after
@@ -85,7 +76,7 @@ execute the STMTs and try again."
   (lambda (x)
     (syntax-case x ()
       ((_ sym accum pat)
-       (let ((matchf (peg-extended-compile #'pat (syntax->datum #'accum)))
+       (let ((matchf (peg-sexp-compile #'pat (syntax->datum #'accum)))
              (accumsym (syntax->datum #'accum))
              (c (datum->syntax x (gensym))));; the cache
          ;; CODE is the code to parse the string if the result isn't cached.
@@ -103,6 +94,11 @@ execute the STMTs and try again."
                                       (list str at fres))
                          fres)))))))))))
 
+(define (peg-like->peg pat)
+  (syntax-case pat ()
+    (str (string? (syntax->datum #'str)) #'(peg str))
+    (else pat)))
+
 ;; Searches through STRING for something that parses to PEG-MATCHER.  Think
 ;; regexp search.
 (define-syntax peg-match
@@ -110,9 +106,7 @@ execute the STMTs and try again."
     (syntax-case x ()
       ((_ pattern string-uncopied)
        (let ((pmsym (syntax->datum #'pattern)))
-         (let ((matcher (if (string? (syntax->datum #'pattern))
-                            (peg-string-compile #'pattern 'body)
-                            (peg-sexp-compile #'pattern 'body))))
+         (let ((matcher (peg-sexp-compile (peg-like->peg #'pattern) 'body)))
            ;; We copy the string before using it because it might have been
            ;; modified in-place since the last time it was parsed, which would
            ;; invalidate the cache.  Guile uses copy-on-write for strings, so
diff --git a/module/ice-9/peg/string-peg.scm b/module/ice-9/peg/string-peg.scm
index a899727..181ec05 100644
--- a/module/ice-9/peg/string-peg.scm
+++ b/module/ice-9/peg/string-peg.scm
@@ -18,8 +18,7 @@
 ;;;;
 
 (define-module (ice-9 peg string-peg)
-  #:export (peg-string-compile
-            peg-as-peg
+  #:export (peg-as-peg
             define-grammar
             define-grammar-f
             peg-grammar)
@@ -248,11 +247,17 @@ RB < ']'
                  (compressor-core (syntax->datum syn))))
 
 ;; Builds a lambda-expressions for the pattern STR using accum.
-(define (peg-string-compile str-stx accum)
-  (let ((string (syntax->datum str-stx)))
-    (peg-sexp-compile
-     (compressor
-      (peg-pattern->defn
-       (peg:tree (peg-parse peg-pattern string)) str-stx)
-      str-stx)
-     accum)))
+(define (peg-string-compile args accum)
+  (syntax-case args ()
+    ((str-stx) (string? (syntax->datum #'str-stx))
+     (let ((string (syntax->datum #'str-stx)))
+       (peg-sexp-compile
+        (compressor
+         (peg-pattern->defn
+          (peg:tree (peg-parse peg-pattern string)) #'str-stx)
+         #'str-stx)
+        (if (eq? accum 'all) 'body accum))))
+     (else (error "Bad embedded PEG string" args))))
+
+(add-peg-compiler! 'peg peg-string-compile)
+
-- 
1.7.4.1


      reply	other threads:[~2011-03-31 21:48 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-03-06  5:25 PEG Patches Noah Lavine
2011-03-07  1:28 ` Noah Lavine
2011-03-25 18:06 ` Andy Wingo
2011-03-28 20:44   ` Noah Lavine
2011-03-28 20:46     ` Noah Lavine
2011-03-28 22:17     ` Michael Lucy
2011-03-29  8:00     ` Andy Wingo
2011-03-29 12:47       ` Noah Lavine
2011-03-29 13:20         ` Andy Wingo
2011-03-31 21:48           ` Noah Lavine [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/guile/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='AANLkTimd+BRh4XNm6jp1-rhAXTuMcWpqG0mUxC=UWfNW@mail.gmail.com' \
    --to=noah.b.lavine@gmail.com \
    --cc=guile-devel@gnu.org \
    --cc=wingo@pobox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).