unofficial mirror of help-gnu-emacs@gnu.org
 help / color / mirror / Atom feed
* early preview of a recursive descent parser compiler implemented as a macro
@ 2008-01-08 10:34 Mike Mattie
  2008-01-11 23:07 ` Mike Mattie
  0 siblings, 1 reply; 3+ messages in thread
From: Mike Mattie @ 2008-01-08 10:34 UTC (permalink / raw)
  To: emacs users ml


[-- Attachment #1.1.1: Type: text/plain, Size: 20599 bytes --]

Hello,

I recently found myself needing a top down parser, so I wrote up a parser compiler
as a macro. It is turning out quite nicely for my purposes so I would like to submit it 
for comments and suggestions. I am still working out a couple of kinks, but it is at the
point where I can start integrating feedback.

The goal is a lightweight parser for simple jobs such as scraping a data structure
out of essentially static buffers, which is the problem I was solving when I developed it.

I am in-lining the actual parser compiler, and attaching a testing file so people can play
with it if they find it interesting.

Status:

It's my first medium sized macro so it may have some thinko's or oddities. criticism
welcome.

Simple tests work. The backtracking hasn't been extensively tested yet. The and form
doesn't seem to return the full AST it should. Other than that all the basic bits
are in place.

The catch for semantic-error throws is not in place yet. I need something like
(condition-case) however that construct seemed a little over-powered for my
purpose. Suggestions on handling multiple error conditions in a single form
would be appreciated.

TODO:

1. add ? for optional productions. Not a big deal to implement once the basic
   implementation is sound.

2. Full Documentation

Review Questions:

Do the Emacs developers consider this tool useful ? Does it fill a niche
in Emacs ? If so I will document it.

Is it elegant ?

Is the terminology correct ? I am not entirely sure I used curry in the strict
sense.

Are there any essential features missing ? keeping in mind the goal is simplicity,
not a fully optimized parser.

Is it interesting in a creative sense, or have I just ground out another
lisp right of passage ?

Examples:

A couple of contrived examples I have used for testing. A couple of things to note.
Keywords are token|and|or. token defines a regex match, while or and and construct
terminals. The definition of a token or terminal creates a match object at the
point of definition.

There is a built-in start symbol that is a or-terminal, so the first example
will match to the start symbol either token. 

(parser-compile test-parser
  (token whitespace "[[:blank:]]+")
  (token word "[[:alpha:]]+"))

The parser is compiled to the function value of test-parser. You can call
it like this: (test-parser (point))

It parses a single start production. Looping is done outside of the parser
itself. 

as the second example shows you can reference previously created tokens.

(parser-compile test-parser
  (and indented (token whitespace "[[:blank:]]+") (token word "[[:alpha:]]+"))
  (and inverted word whitespace))

The attached file test-parser.el has some test drivers for playing with the parser
interactively. Some of the features for defining tokens such as defining your own
actions for tokens are shown there.

P.S, here is a fun little lisp comic in the spirit of finding amusement in your work:
     http://xkcd.com/297/

Without further ado here is parser.el , and enjoy !

;;----------------------------------------------------------------------
;; parser.el
;; Primary Author: Mike Mattie (codermattie@gmail.com)
;; Copyright: Mike Mattie (2008)
;;----------------------------------------------------------------------

;; ->Summary

;; parser.el aims to be a lightweight implementation of parsing.

;; My requirements involve building/extracting a data structure from
;; analysis of a given text. For this purpose a "top-down" parse
;; makes it easier to assemble a data structure. [TODO: why is it easier
;; this way ?]

;; ->Related Works

;; Much of the parsing facilities that I have seen so far in Emacs
;; have been various forms of regex tables.

;; There are many reasons why Emacs parsing tools would evolve away
;; from the usual parser generator tool-set.

;; 1. The requirement to work incrementally: parse errors are not
;;    errors, the user just has not finished typing :)

;; 2. Emacs analysis is often partial by requirement, usually only
;;    significant parts of the text need to be identified for
;;    features.

;; 3. Emacs can leverage the annotation facilities of
;;    text properties and overlays making analysis extremely
;;    modular and robust while editing.

;; CEDET.
;;
;; CEDET appears to be a parser generator implemented on-top of a
;; CLOS emulation. The fact that this sort of design is not "lispy"
;; does not negate it's value, but limits it IMHO. This tool-set
;; has also "grown around" the problem of analyzing source code as
;; a project which is bloat for simpler requirements.

;; ->Concept

;; A parser compiler that combines lexical (regex) analysis with LL
;; parsing compiled by recursive macro expansion.

;; The concept started out as an idea to build a parser as nested cond
;; forms. The matches would go in the predicate part of the clause
;; while the action part would go in the body.

;; This idea is expressed here with cond forms mutated into lambdas so
;; that previously defined matches can be referenced by productions.

;; -> Requirements

;; 1. Easy to use for trivial problems.

;; 2. Data structures are orthogonal to the parser and not hard-wired
;;    in. Leafs and nodes of the AST are constructed with beginning
;;    and ending positions of the text in the buffer.
;;
;;    This allows the user to choose between positions, markers, overlays
;;    according to the requirements.

;; -> Naming Conventions

;; parser-*          internal functions
;; parser-build-*    Construct the AST structures returned by the parser.
;; parser-interp-*   expand definitions of terminals and non-terminals into lisp
;; parser-compile-*  compile the expanded definitions into interchangeable match functions.

;; -> Key data structures

;; Matches

;; matches are a cons pair of ( parser . AST ).

;; Parser is used internally to indicate match success or failure. It
;; should be the distance matched from the current parser position.
;; It can, and often will be zero for a successful match. nil indicates
;; a failure to match.

;; AST is the data returned from the match ( production . ( start . end ) | list )
;; the car of the pair (production) is the production's symbol.

;;----------------------------------------------------------------------
;; diagnostics
;;----------------------------------------------------------------------

(defun parser-expr-diagnostic ( form ) ;; tested
  (format "type(%s) %s" (symbol-name (type-of form)) (pp (eval form))))

(defmacro parser-diagnostic ( form from expected ) ;; tested
  "syntax: (parser-diagnostic form from expected)

   Where form is the expr received, from is the component issuing the diagnostic,
   and expected is a message describing what the component expected"
  `(concat (format "[%s] expected: " ,from)  ,expected " not: " ,(parser-expr-diagnostic form)))

;;----------------------------------------------------------------------
;; AST constructors
;;----------------------------------------------------------------------

(defun parser-build-token ( identifier ) ;; tested
  "parser-make-token is a built-in constructor that records the analysis
   and the location of the text (id . (begin . end))"
  (cons identifier (cons (match-beginning 0) (match-end 0))))

;;----------------------------------------------------------------------
;; compiler internals
;;----------------------------------------------------------------------

;; make-match and get-match implement a table of productions and
;; tokens. Both production and tokens are interchangeable as far as
;; matching and referencing is concerned. This table gives the ability
;; to reference previously defined matches in other productions.

;; match-table is dynamically scoped by the macro, and does not appear
;; in the compiled form.

(defun parser-make-match ( symbol function ) ;; tested
  "parser-make-match takes ( symbol function ) and returns a symbol
   stored in the parser's match-table with the evaluated lambda
   bound"
  (lexical-let
    ((new-name (symbol-name symbol)))

    (if (intern-soft new-name match-table)
      (throw 'semantic-error (format "illegal redefinition of match %s" new-name)))

    (fset (intern new-name match-table) (eval function))
    (intern new-name match-table)
    ))

(defun parser-make-anon-func ( sexp ) ;; tested
  "bind an un-evaluated anonymous function to an un-interned symbol"
  (let
    ((anon-func (make-symbol "parser-lambda")))
    (fset anon-func (eval sexp))
    anon-func))

(defun parser-get-match ( symbol ) ;; tested
  "return the compiled match for symbol, or throw a semantic-error if it does not
   exist"
  (lexical-let
    ((existing-name (symbol-name symbol)))

    (unless (intern-soft existing-name match-table)
      (throw 'semantic-error (format "unkown match %s" existing-name)))

    (intern existing-name match-table)))

;;----------------------------------------------------------------------
;; compiler run-time
;;----------------------------------------------------------------------

;; these functions are defined independent of the compilation of a parser
;; for simplicity, and to avoid wasteful duplication in the macro expansion.

;; parser-position

;; The position of the parser in the input stream is maintained as a
;; stack of indexes into the buffer. As matching progresses the top of
;; the stack (car) is updated. The push and pop operations copy a
;; position to a next or previous stack position. backtrack discards
;; the top of the stack.

(defun parser-pos () ;; tested
  "Return the current position of the parser in the buffer"
  (car parser-position))

(defun parser-push ()
  "Copy the parser position to a new stack level so the parser can backtrack when necessary."
  (setq parser-position (cons (car parser-position) parser-position)))

(defun parser-pop ()
  "Copy the parser position to a previous stack level When the possibility of a backtrack
   has been eliminated by matching."
  (let
    ((current (parser-pos)))
    (setq parser-position (cons (car parser-position) (cddr parser-position)))
    ))

(defun parser-backtrack ()
  "Restore the previous parser position by discarding the top of the parser-position stack."
  (setq parser-position (cdr parser-position))
  (goto-char (parser-pos)))

(defun parser-advance ( distance )
  "Add distance to the parsing position on the current stack level.
   The advancing of the parser is done relative to the current
   position so that a successful match can advance 0 characters.
   This is important for optional matches such as the \"?\" meta-symbol.
   They want to indicate a successful match without moving the parser position."

  ;; we are going to get zero's as a unfortunate side-effect of
  ;; keeping the nesting of combination operators simple. Avoid NOP
  ;; work with a quick test.

  (if (> distance 0)
    (progn
      (setq parser-position (cons (+ distance (car parser-position)) (cdr parser-position)))
      (goto-char (parser-pos)))
    ))

(defun parser-next ()
  "Compute the next parser position from the length of the entire regex's current match, plus one."
  (+ 1 (- (match-end 0) (match-beginning 0))))

;; parser combination operators

;; the combination operators group matches by and,or operators as the
;; underlying mechanics of non-terminal productions. As these
;; functions are not compiled into the generated parser they have no
;; knowledge of the symbol associated with the match list, only the
;; match list itself.

(defun parser-or ( &rest match-list )
  "Combine match objects by or where the first successful match is returned.
   nil is returned if no matches are found"
  (catch 'match
    (dolist (match match-list)
      (lexical-let
        ((match-result (funcall match)))
        (if (car match-result)
          (progn
            (parser-advance (car match-result))
            (throw 'match (cons 0 (cdr match-result)))))
        ))
    (throw 'match (cons nil nil)) ;; this is what failure looks like :)
    ))

(defun parser-and ( &rest match-list )
  "combine the matches with and. all of the match objects must return non-nil
   in the parser part or the parser will backtrack and return nil."
  (parser-push)

  (lexical-let
    ((ast (catch 'backtrack
            ;; we want to gather all the matches, so mapcar across the match objects.
            (mapcar
              (lambda (match)
                (lexical-let
                  ((match-result (funcall match)))

                  (if (car match-result)
                    (progn
                      ;; on match advance the parser and return the AST
                      (parser-advance (car match-result))
                      (cdr match-result))

                    ;; on fail backtrack and return nil
                    (progn
                      (parser-backtrack)
                      (throw 'backtrack nil)))
                  ))
              match-list)
            (parser-pop)
            )))
   (if (car ast)
     ;; would be nice to run map-filter-nil on ast.
     (cons 0 ast)
     (cons nil nil))
   ))

;;----------------------------------------------------------------------
;; syntax interpretation
;;----------------------------------------------------------------------

;; interpretation as expansion of a form into lisp is separated from
;; compilation to make the parser easier to debug and verify by
;; phase.

;; the token interpreter was split into two functions to isolate the
;; flexibility of tokens (user functions or return values for
;; constructing AST) from the hard-wired parser part.

(defun parser-interp-token-action ( identifier constructor ) ;; tested
  "Translate the AST constructor definition for a token into Elisp."

  (unless (symbolp identifier)
    (throw 'syntax-error (parser-diagnostic identifier
                           "parser token"
                           "identifier: An unbound symbol used as an identifier"
                           )))

  ;; Warning: this code is very touchy, double quotation of AST
  ;; symbols is required. the saving grace is that the symbols don't
  ;; have a variable value bound so it fails noisily when the
  ;; quotation is incorrect.
  (cond
    ((eq nil constructor)    `(parser-build-token (quote ',identifier)))
    ((listp constructor)     `(,(parser-make-anon-func constructor) (match-beginning 0) (match-end 0)))
    ((functionp constructor) `(,constructor (match-beginning 0) (match-end 0)))
    ((symbolp constructor)   `(quote ',constructor))

    ;; all other constructor types are un-handled.
    ((throw 'syntax-error (parser-diagnostic identifier
                           "parser token: identifier" "A symbol"))))
  )

(defun parser-interp-token ( syntax ) ;; tested
  "Translate a token definition into a parser match function."

  (lexical-let
    ((identifier  (car syntax))
     (regex       (cadr syntax))    ;; eval ? many regexs are stored in variables.
     (constructor (caddr syntax)))

    `(lambda ()
       (if (looking-at ,regex)
         (cons (parser-next) ,(parser-interp-token-action identifier constructor))
         (cons nil nil)
         ))
    ))

(defun parser-interp-production ( production )
  "Translate the definition of a production, or a reference to a production
   into a match object symbol."

  (cond
    ((listp production) (parser-compile-definition production))
    ((symbolp production) (parser-get-match production))

    (throw 'syntax-error
      (parser-daignostic production
        "interpret definition"
        "expected a definition as a list, or a symbol as a production/token reference"))
    ))

;;----------------------------------------------------------------------
;; compilation
;;----------------------------------------------------------------------

;; after definitions are interpreted they are evaluated or compiled and
;; stored as match functions.

(defun parser-compile-token ( syntax ) ;; tested
  "compile a token definition into a match object"
  (parser-make-match (car syntax) (parser-interp-token syntax)))

(defun parser-curry-production ( identifier combine-operator &rest grammar )
  "Compile a match object with a combine operator and a match function list.
   I think curry is applicable, but largely it was named curry so I could
   create the parser-compile-production macro."
  (unless (symbolp identifier)
    (parser-diagnostic identifier
      "compile production"
      "match identifier"))

    (parser-make-match identifier
      `(lambda ()
         (lexical-let
           ((result (apply ',combine-operator ',grammar)))
         (if (car result)
           (cons (car result) (cons ',identifier (cdr result)))
           result)
           ))
      ))

(defmacro parser-compile-production ( combine-function production-list )
  "parser-compile-production simplifies the syntax of interpreting and compiling
   a production. The construct looks hairy because it combines two operations
   with quoting necessitated by apply. This macro mechanizes the tricky part
   to enhance code readability."
  `(apply 'parser-curry-production      ;; make a match function
     (car ,production-list)             ;; the identifier of the production
     ',combine-function                 ;; the combine operator
     (mapcar 'parser-interp-production (cdr ,production-list)))) ;; interpret the matching definition.

(defun parser-compile-definition ( term )
  "parser-compile-definition is the recursive heart of the compiler."
  (unless (listp term)
    (throw 'syntax-error (parser-diagnostic term
                           "parser definition"
                           "expected a definition of token|or|and")))

  (lexical-let
    ((keyword (car term))
      (syntax (cdr term)))

    (if (listp keyword)
      (parser-compile-definition keyword)

      (cond
        ((eq keyword 'token) (parser-compile-token syntax))
        ((eq keyword 'or)    (parser-compile-production parser-or syntax))
        ((eq keyword 'and)   (parser-compile-production parser-and syntax))

        ((throw 'syntax-error (parser-diagnostic term
                                "parser definition"
                                "definition keyword token|or|and")))
        ))
    ))

(defvar parser-mtable-init-size 13
  "initial size of the match-table objarray for storing match functions. the value
   was chosen based on the recommendation of prime numbers for good hashing.")

(defmacro parser-compile ( parser &rest definition )
  "compile a LL parser from the given grammar."
  (let
    ;; create a symbol table to store compiled terminal and
    ;; non-terminal match functions
    ((match-table (make-vector parser-mtable-init-size 0)))

    (lexical-let
      ((compiled (catch 'syntax-error
                   `(lambda ( start-pos )
                      (let
                        ((parser-position (cons start-pos nil))) ;; initialize the backtrack stack
                        (save-excursion
                          (goto-char start-pos)
                          ;; note that the start symbol of the grammar is built in as an or combination
                          ;; of the top-level definitions.
                          (lexical-let
                            ((parse (,(apply 'parser-curry-production 'start 'parser-or
                                        (mapcar 'parser-compile-definition definition))
                                      )))
                            (if (car parse)
                              ;; if we have a production return the position at which the
                              ;; parser stopped along with the AST.
                              (cons (car parser-position) (cdr parse))
                              nil))
                          )))
                   )))
      (if (stringp compiled)
        ;; error path, print a message and return nil.
        (progn
          (message "parser-compile failed! %s" compiled)
          nil)

        ;; success path, bind the compiled parser to the parser symbol and return t.
        (progn
          (fset parser (eval compiled))
          t))
      )))

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1.1.2: test-parser.el --]
[-- Type: text/x-emacs-lisp; name=test-parser.el, Size: 2622 bytes --]

;;----------------------------------------------------------------------
;; diagnostics
;;----------------------------------------------------------------------

;; test the diagnostics of the parser. I think some sort of eval may be necessary
;; in diagnostics.

;; see what the expansion looks like
(pp (macroexpand
      (parser-diagnostic `(foo bar)
        "test"
        "the right thing")))

;; see what it looks like when handled by message.
(message "%s"
      (parser-diagnostic `(foo bar)
        "test"
        "the right thing"))))

;;----------------------------------------------------------------------
;; token interp phase
;;----------------------------------------------------------------------

(defmacro test-interp-token ( &rest syntax )
  (lexical-let
    ((compile (catch 'syntax-error
                (parser-interp-token (cdr syntax)))))

    (if (stringp compile)
      (message "%s" compile)
      compile)
    ))

(defun run-test-interp-token ( match-function )
  (lexical-let
    ((result (funcall match-function)))
    (message "TI match? %s AST %s" (if (car result) "yes" "no") (pp (cdr result)))
    ))

(defun run-test-token ()
  (interactive)
  (run-test-interp-token (test-interp-token token whitespace "[[:blank:]]+")))

(pp (macroexpand
      (test-interp-token token whitespace "[[:blank:]]")))

;;----------------------------------------------------------------------
;; test tokens
;;----------------------------------------------------------------------

;; eval this, then eval various tests.

(defun run-parser ()
  "run test-parser interactively for testing and debugging."
  (interactive)
  (lexical-let
    ((parse-result (test-parser (point))))

    (message "PROD match? %s AST %s"
      (if (car parse-result)
        (format "matched to: %s" (car parse-result))
        "no")
      (pp (cdr parse-result)))
    ))

(parser-compile test-parser
  (token whitespace "[[:blank:]]+"))

(parser-compile test-parser
  (token whitespace "[[:blank:]]+" (lambda ( start stop ) (message "bar by far"))))

(parser-compile test-parser
  (token whitespace "[[:blank:]]+" bingo))

;;----------------------------------------------------------------------
;; test productions
;;----------------------------------------------------------------------

(parser-compile test-parser
  (token whitespace "[[:blank:]]+")
  (token word "[[:alpha:]]+"))

(parser-compile test-parser
  (and indented (token whitespace "[[:blank:]]+") (token word "[[:alpha:]]+"))
  (and inverted word whitespace))

fooo   bar


[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

[-- Attachment #2: Type: text/plain, Size: 152 bytes --]

_______________________________________________
help-gnu-emacs mailing list
help-gnu-emacs@gnu.org
http://lists.gnu.org/mailman/listinfo/help-gnu-emacs

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: early preview of a recursive descent parser compiler implemented as a macro
  2008-01-08 10:34 early preview of a recursive descent parser compiler implemented as a macro Mike Mattie
@ 2008-01-11 23:07 ` Mike Mattie
  2008-01-14  4:23   ` Mike Mattie
  0 siblings, 1 reply; 3+ messages in thread
From: Mike Mattie @ 2008-01-11 23:07 UTC (permalink / raw)
  To: emacs users ml


[-- Attachment #1.1.1: Type: text/plain, Size: 21272 bytes --]

On Tue, 8 Jan 2008 02:34:36 -0800
Mike Mattie <codermattie@gmail.com> wrote:

> Hello,
> 
> I recently found myself needing a top down parser, so I wrote up a
> parser compiler as a macro. It is turning out quite nicely for my
> purposes so I would like to submit it for comments and suggestions. I
> am still working out a couple of kinks, but it is at the point where
> I can start integrating feedback.
> 
> The goal is a lightweight parser for simple jobs such as scraping a
> data structure out of essentially static buffers, which is the
> problem I was solving when I developed it.
> 
> I am in-lining the actual parser compiler, and attaching a testing
> file so people can play with it if they find it interesting.
> 
> Status:
> 
> It's my first medium sized macro so it may have some thinko's or
> oddities. criticism welcome.
> 
> Simple tests work. The backtracking hasn't been extensively tested
> yet. The and form doesn't seem to return the full AST it should.
> Other than that all the basic bits are in place.

I reworked the runtime functions a bit and the and operator works correctly
now. test-parser.el has some minor changes so I have included the latest
version.

> The catch for semantic-error throws is not in place yet. I need
> something like (condition-case) however that construct seemed a
> little over-powered for my purpose. Suggestions on handling multiple
> error conditions in a single form would be appreciated.

My next priority is completing the error handling.

> TODO:
> 
> 1. add ? for optional productions. Not a big deal to implement once
> the basic implementation is sound.

> 2. Full Documentation

3. add a (define ) list so that tokens and productions can be defined
   without using them in place.

> Review Questions:
> 
> Do the Emacs developers consider this tool useful ? Does it fill a
> niche in Emacs ? If so I will document it.
> 
> Is it elegant ?
> 
> Is the terminology correct ? I am not entirely sure I used curry in
> the strict sense.
> 
> Are there any essential features missing ? keeping in mind the goal
> is simplicity, not a fully optimized parser.
> 
> Is it interesting in a creative sense, or have I just ground out
> another lisp right of passage ?
> 
> Examples:
> 
> A couple of contrived examples I have used for testing. A couple of
> things to note. Keywords are token|and|or. token defines a regex
> match, while or and and construct terminals. The definition of a
> token or terminal creates a match object at the point of definition.
> 
> There is a built-in start symbol that is a or-terminal, so the first
> example will match to the start symbol either token. 
> 
> (parser-compile test-parser
>   (token whitespace "[[:blank:]]+")
>   (token word "[[:alpha:]]+"))
> 
> The parser is compiled to the function value of test-parser. You can
> call it like this: (test-parser (point))
> 
> It parses a single start production. Looping is done outside of the
> parser itself. 
> 
> as the second example shows you can reference previously created
> tokens.
> 
> (parser-compile test-parser
>   (and indented (token whitespace "[[:blank:]]+") (token word
> "[[:alpha:]]+")) (and inverted word whitespace))
> 
> The attached file test-parser.el has some test drivers for playing
> with the parser interactively. Some of the features for defining
> tokens such as defining your own actions for tokens are shown there.
> 
> P.S, here is a fun little lisp comic in the spirit of finding
> amusement in your work: http://xkcd.com/297/
> 
> Without further ado here is parser.el , and enjoy !
>

Here is the second posted version with the changes indicated. If you need
a version number that is not ambiguous this posting is version 636. 

;;----------------------------------------------------------------------
;; parser.el
;; Primary Author: Mike Mattie (codermattie@gmail.com)
;; Copyright: Mike Mattie (2008)
;;----------------------------------------------------------------------

;; ->Summary

;; parser.el aims to be a lightweight implementation of parsing.

;; My requirements involve building/extracting a data structure from
;; analysis of a given text. For this purpose a "top-down" parse
;; makes it easier to assemble a data structure. [TODO: why is it easier
;; this way ?]

;; ->Related Works

;; Emacs regex tables.
;;
;; Much of the parsing facilities that I have seen so far in Emacs
;; have been various forms of regex tables.

;; There are many reasons why Emacs parsing tools would evolve away
;; from the usual parser generator tool-set.

;; 1. The requirement to work incrementally: parse errors are not
;;    errors, the user just has not finished typing :)

;; 2. Emacs analysis is often partial by requirement, usually only
;;    significant parts of the text need to be identified for
;;    features.

;; 3. Emacs can leverage the annotation facilities of
;;    text properties and overlays making analysis extremely
;;    modular and robust while editing.

;; CEDET.
;;
;; CEDET appears to be a parser generator implemented on-top of a
;; CLOS emulation. The fact that this sort of design is not "lispy"
;; does not negate it's value, but limits it IMHO. This tool-set
;; has also "grown around" the problem of analyzing source code as
;; a project which is bloat for simpler requirements.

;; ->Concept

;; A parser compiler that combines lexical (regex) analysis with LL
;; parsing compiled by recursive macro expansion.

;; The concept started out as an idea to build a parser as nested cond
;; forms. The matches would go in the predicate part of the clause
;; while the action part would go in the body.

;; This idea is expressed here with cond forms mutated into lambdas so
;; that previously defined matches can be referenced by productions.

;; -> Requirements

;; 1. Easy to use for trivial problems.

;; 2. Data structures are orthogonal to the parser and not hard-wired
;;    in. Leafs and nodes of the AST are constructed with beginning
;;    and ending positions of the text in the buffer.
;;
;;    This allows the user to choose between positions, markers, overlays
;;    according to the requirements.

;; -> Naming Conventions

;; parser-*          internal functions
;; parser-build-*    Construct the AST structures returned by the parser.
;; parser-interp-*   expand definitions of terminals and non-terminals into lisp
;; parser-compile-*  compile the expanded definitions into interchangeable match functions.

;; -> Key data structures

;; Matches

;; matches are a cons pair of ( parser . AST ).

;; Parser is used internally to indicate match success or failure. It
;; should be the distance matched from the current parser position.
;; It can, and often will be zero for a successful match. nil indicates
;; a failure to match.

;; AST is the data returned from the match ( production . ( start . end ) | list )
;; the car of the pair (production) is the production's symbol.

;;----------------------------------------------------------------------
;; diagnostics
;;----------------------------------------------------------------------

(defun parser-expr-diagnostic ( form ) ;; tested
  (format "type(%s) %s" (symbol-name (type-of form)) (pp (eval form))))

(defmacro parser-diagnostic ( form from expected ) ;; tested
  "syntax: (parser-diagnostic form from expected)

   Where form is the expr received, from is the component issuing the diagnostic,
   and expected is a message describing what the component expected"
  `(concat (format "[%s] expected: " ,from)  ,expected " not: " ,(parser-expr-diagnostic form)))

;;----------------------------------------------------------------------
;; AST constructors
;;----------------------------------------------------------------------

(defun parser-build-token ( identifier ) ;; tested
  "parser-make-token is a built-in constructor that records the analysis
   and the location of the text (id . (begin . end))"
  (cons identifier (cons (match-beginning 0) (match-end 0))))

;;----------------------------------------------------------------------
;; compiler internals
;;----------------------------------------------------------------------

;; make-match and get-match implement a table of productions and
;; tokens. Both production and tokens are interchangeable as far as
;; matching and referencing is concerned. This table gives the ability
;; to reference previously defined matches in other productions.

;; match-table is dynamically scoped by the macro, and does not appear
;; in the compiled form.

(defun parser-make-match ( symbol function ) ;; tested
  "parser-make-match takes ( symbol function ) and returns a symbol
   stored in the parser's match-table with the evaluated lambda
   bound"
  (lexical-let
    ((new-name (symbol-name symbol)))

    (if (intern-soft new-name match-table)
      (throw 'semantic-error (format "illegal redefinition of match %s" new-name)))

    (fset (intern new-name match-table) (eval function))
    (intern new-name match-table)
    ))

(defun parser-make-anon-func ( sexp ) ;; tested
  "bind an un-evaluated anonymous function to an un-interned symbol"
  (let
    ((anon-func (make-symbol "parser-lambda")))
    (fset anon-func (eval sexp))
    anon-func))

(defun parser-get-match ( symbol ) ;; tested
  "return the compiled match for symbol, or throw a semantic-error if it does not
   exist"
  (lexical-let
    ((existing-name (symbol-name symbol)))

    (unless (intern-soft existing-name match-table)
      (throw 'semantic-error (format "unkown match %s" existing-name)))

    (intern existing-name match-table)))

;;----------------------------------------------------------------------
;; compiler run-time
;;----------------------------------------------------------------------

;; these functions are defined independent of the compilation of a parser
;; for simplicity, and to avoid wasteful duplication in the macro expansion.

;; parser-position

;; The position of the parser in the input stream is maintained as a
;; stack of indexes into the buffer. As matching progresses the top of
;; the stack (car) is updated. The push and pop operations copy a
;; position to a next or previous stack position. backtrack discards
;; the top of the stack.

(defun parser-pos () ;; tested
  "Return the current position of the parser in the buffer"
  (car parser-position))

(defun parser-push ()
  "Copy the parser position to a new stack level so the parser can backtrack when necessary."
  (setq parser-position (cons (car parser-position) parser-position)))

(defun parser-pop ()
  "Copy the parser position to a previous stack level When the possibility of a backtrack
   has been eliminated by matching."
  (let
    ((current (parser-pos)))
    (setq parser-position (cons (car parser-position) (cddr parser-position)))
    ))

(defun parser-backtrack ()
  "Restore the previous parser position by discarding the top of the parser-position stack."
  (setq parser-position (cdr parser-position))
  (goto-char (parser-pos)))

(defun parser-advance ( distance )
  "Add distance to the parsing position on the current stack level.
   The advancing of the parser is done relative to the current
   position so that a successful match can advance 0 characters.
   This is important for optional matches such as the \"?\" meta-symbol.
   They want to indicate a successful match without moving the parser position."

  ;; we are going to get zero's as a unfortunate side-effect of
  ;; keeping the nesting of combination operators simple. Avoid NOP
  ;; work with a quick test.

  (if (> distance 0)
    (progn
      (setq parser-position (cons (+ distance (car parser-position)) (cdr parser-position)))
      (goto-char (parser-pos)))
    ))

(defun parser-next ()
  "Compute the next parser position from the length of the entire regex's current match, plus one."
  (+ 1 (- (match-end 0) (match-beginning 0))))

;; parser combination operators

;; the combination operators group matches by and,or operators as the
;; underlying mechanics of non-terminal productions. As these
;; functions are not compiled into the generated parser they have no
;; knowledge of the symbol associated with the match list, only the
;; match list itself.

(defun parser-or ( &rest match-list )
  "Combine match objects by or where the first successful match is returned.
   nil is returned if no matches are found"
  (catch 'match
    (dolist (match match-list)
      (lexical-let
        ((match-result (funcall match)))
        (if match-result
          (progn
            (parser-advance (car match-result))
            (throw 'match (cons 0 (cdr match-result)))))
        ))
    (throw 'match nil) ;; this is what failure looks like :)
    ))

(defun parser-and ( &rest match-list )
  "combine the matches with and. all of the match objects must return non-nil
   in the parser part or the parser will backtrack and return nil."
  (parser-push)

  (lexical-let
    ((ast (catch 'backtrack
            ;; we want to gather all the matches, so mapcar across the match objects.
            (mapcar
              (lambda (match)
                (lexical-let
                  ((match-result (funcall match)))

                  (if match-result
                    (progn
                      ;; on match advance the parser and return the AST
                      (parser-advance (car match-result))
                      (cdr match-result))

                    ;; on fail backtrack and return nil
                    (throw 'backtrack nil))
                  ))
              match-list)
            )))
   (if ast
     (progn
       (parser-pop)
       (cons 0 ast)) ;; would be nice to filter optional matches here.
     (progn
       (parser-backtrack)
       nil))
   ))

;;----------------------------------------------------------------------
;; syntax interpretation
;;----------------------------------------------------------------------

;; interpretation as expansion of a form into lisp is separated from
;; compilation to make the parser easier to debug and verify by
;; phase.

;; the token interpreter was split into two functions to isolate the
;; flexibility of tokens (user functions or return values for
;; constructing AST) from the hard-wired parser part.

(defun parser-interp-token-action ( identifier constructor ) ;; tested
  "Translate the AST constructor definition for a token into Elisp."

  (unless (symbolp identifier)
    (throw 'syntax-error (parser-diagnostic identifier
                           "parser token"
                           "identifier: An unbound symbol used as an identifier"
                           )))

  ;; Warning: this code is very touchy, double quotation of AST
  ;; symbols is required. the saving grace is that the symbols don't
  ;; have a variable value bound so it fails noisily when the
  ;; quotation is incorrect.
  (cond
    ((eq nil constructor)    `(parser-build-token (quote ',identifier)))
    ((listp constructor)     `(,(parser-make-anon-func constructor) (match-beginning 0) (match-end 0)))
    ((functionp constructor) `(,constructor (match-beginning 0) (match-end 0)))
    ((symbolp constructor)   `(quote ',constructor))

    ;; all other constructor types are un-handled.
    ((throw 'syntax-error (parser-diagnostic identifier
                           "parser token: identifier" "A symbol"))))
  )

(defun parser-interp-token ( syntax ) ;; tested
  "Translate a token definition into a parser match function."

  (lexical-let
    ((identifier  (car syntax))
     (regex       (cadr syntax))    ;; eval ? many regexs are stored in variables.
     (constructor (caddr syntax)))

    `(lambda ()
       (if (looking-at ,regex)
         (cons (parser-next) ,(parser-interp-token-action identifier constructor))
         nil
         ))
    ))

(defun parser-interp-production ( production )
  "Translate the definition of a production, or a reference to a production
   into a match object symbol."

  (cond
    ((listp production) (parser-compile-definition production))
    ((symbolp production) (parser-get-match production))

    (throw 'syntax-error
      (parser-daignostic production
        "interpret definition"
        "expected a definition as a list, or a symbol as a production/token reference"))
    ))

;;----------------------------------------------------------------------
;; compilation
;;----------------------------------------------------------------------

;; after definitions are interpreted they are evaluated or compiled and
;; stored as match functions.

(defun parser-compile-token ( syntax ) ;; tested
  "compile a token definition into a match object"
  (parser-make-match (car syntax) (parser-interp-token syntax)))

(defun parser-curry-production ( identifier combine-operator &rest grammar )
  "Compile a match object with a combine operator and a match function list.
   I think curry is applicable, but largely it was named curry so I could
   create the parser-compile-production macro."
  (unless (symbolp identifier)
    (parser-diagnostic identifier
      "compile production"
      "match identifier"))

    (parser-make-match identifier
      `(lambda ()
         (lexical-let
           ((result (apply ',combine-operator ',grammar)))
         (if result
           (cons (car result) (cons ',identifier (cdr result)))
           nil)
           ))
      ))

(defmacro parser-compile-production ( combine-function production-list )
  "parser-compile-production simplifies the syntax of interpreting and compiling
   a production. The construct looks hairy because it combines two operations
   with quoting necessitated by apply. This macro mechanizes the tricky part
   to enhance code readability."
  `(apply 'parser-curry-production      ;; make a match function
     (car ,production-list)             ;; the identifier of the production
     ',combine-function                 ;; the combine operator
     (mapcar 'parser-interp-production (cdr ,production-list)))) ;; interpret the matching definition.

(defun parser-compile-definition ( term )
  "parser-compile-definition is the recursive heart of the compiler."
  (unless (listp term)
    (throw 'syntax-error (parser-diagnostic term
                           "parser definition"
                           "expected a definition of token|or|and")))

  (lexical-let
    ((keyword (car term))
      (syntax (cdr term)))

    (if (listp keyword)
      (parser-compile-definition keyword)

      (cond
        ((eq keyword 'token) (parser-compile-token syntax))
        ((eq keyword 'or)    (parser-compile-production parser-or syntax))
        ((eq keyword 'and)   (parser-compile-production parser-and syntax))

        ((throw 'syntax-error (parser-diagnostic term
                                "parser definition"
                                "definition keyword token|or|and")))
        ))
    ))

(defvar parser-mtable-init-size 13
  "initial size of the match-table objarray for storing match functions. the value
   was chosen based on the recommendation of prime numbers for good hashing.")

(defmacro parser-compile ( parser &rest definition )
  "compile a LL parser from the given grammar."
  (let
    ;; create a symbol table to store compiled terminal and
    ;; non-terminal match functions
    ((match-table (make-vector parser-mtable-init-size 0)))

    (lexical-let
      ((compiled (catch 'syntax-error
                   `(lambda ( start-pos )
                      (let
                        ((parser-position (cons start-pos nil))) ;; initialize the backtrack stack
                        (save-excursion
                          (goto-char start-pos)
                          ;; note that the start symbol of the grammar is built in as an or combination
                          ;; of the top-level definitions.
                          (lexical-let
                            ((parse (,(apply 'parser-curry-production 'start 'parser-or
                                        (mapcar 'parser-compile-definition definition))
                                      )))
                            (if parse
                              ;; if we have a production return the position at which the
                              ;; parser stopped along with the AST.
                              (cons (car parser-position) (cdr parse))
                              nil))
                          )))
                   )))
      (if (stringp compiled)
        ;; error path, print a message and return nil.
        (progn
          (message "parser-compile failed! %s" compiled)
          nil)

        ;; success path, bind the compiled parser to the parser symbol and return t.
        (progn
          (fset parser (eval compiled))
          t))
      )))

(provide 'parser)

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1.1.2: test-parser.el --]
[-- Type: text/x-emacs-lisp; name=test-parser.el, Size: 2623 bytes --]

;;----------------------------------------------------------------------
;; diagnostics
;;----------------------------------------------------------------------

;; test the diagnostics of the parser. I think some sort of eval may be necessary
;; in diagnostics.

;; see what the expansion looks like
(pp (macroexpand
      (parser-diagnostic `(foo bar)
        "test"
        "the right thing")))

;; see what it looks like when handled by message.
(message "%s"
      (parser-diagnostic `(foo bar)
        "test"
        "the right thing"))))

;;----------------------------------------------------------------------
;; token interp phase
;;----------------------------------------------------------------------

(defmacro test-interp-token ( &rest syntax )
  (lexical-let
    ((compile (catch 'syntax-error
                (parser-interp-token (cdr syntax)))))

    (if (stringp compile)
      (message "%s" compile)
      compile)
    ))

(defun run-test-interp-token ( match-function )
  (lexical-let
    ((result (funcall match-function)))
    (message "TI match? %s AST %s" (if (car result) "yes" "no") (pp (cdr result)))
    ))

(defun run-test-token ()
  (interactive)
  (run-test-interp-token (test-interp-token token whitespace "[[:blank:]]+")))

(pp (macroexpand
      (test-interp-token token whitespace "[[:blank:]]")))

;;----------------------------------------------------------------------
;; test tokens
;;----------------------------------------------------------------------

;; eval this, then eval various tests.

(defun run-parser ()
  "run test-parser interactively for testing and debugging."
  (interactive)
  (lexical-let
    ((parse-result (test-parser (point))))

    (message "PROD match? %s"
      (if parse-result
        (format "Yes matched to: %s, AST: %s" (car parse-result) (pp (cdr parse-result))
        "No")
      ))
    ))

(parser-compile test-parser
  (token whitespace "[[:blank:]]+"))

(parser-compile test-parser
  (token whitespace "[[:blank:]]+" (lambda ( start stop ) (message "bar by far"))))

(parser-compile test-parser
  (token whitespace "[[:blank:]]+" bingo))

;;----------------------------------------------------------------------
;; test productions
;;----------------------------------------------------------------------

(parser-compile test-parser
  (token whitespace "[[:blank:]]+")
  (token word "[[:alpha:]]+"))

(parser-compile test-parser
  (and indented (token whitespace "[[:blank:]]+") (token word "[[:alpha:]]+"))
  (and inverted word whitespace))

fooo   bar


[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

[-- Attachment #2: Type: text/plain, Size: 152 bytes --]

_______________________________________________
help-gnu-emacs mailing list
help-gnu-emacs@gnu.org
http://lists.gnu.org/mailman/listinfo/help-gnu-emacs

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: early preview of a recursive descent parser compiler implemented as a macro
  2008-01-11 23:07 ` Mike Mattie
@ 2008-01-14  4:23   ` Mike Mattie
  0 siblings, 0 replies; 3+ messages in thread
From: Mike Mattie @ 2008-01-14  4:23 UTC (permalink / raw)
  To: emacs users ml


[-- Attachment #1.1.1: Type: text/plain, Size: 24541 bytes --]

On Fri, 11 Jan 2008 15:07:38 -0800
Mike Mattie <codermattie@gmail.com> wrote:

> On Tue, 8 Jan 2008 02:34:36 -0800
> Mike Mattie <codermattie@gmail.com> wrote:
> 
> > Hello,
> > 
> > I recently found myself needing a top down parser, so I wrote up a
> > parser compiler as a macro. It is turning out quite nicely for my
> > purposes so I would like to submit it for comments and suggestions.
> > I am still working out a couple of kinks, but it is at the point
> > where I can start integrating feedback.
> > 
> > The goal is a lightweight parser for simple jobs such as scraping a
> > data structure out of essentially static buffers, which is the
> > problem I was solving when I developed it.
> > 
> > I am in-lining the actual parser compiler, and attaching a testing
> > file so people can play with it if they find it interesting.
> > 
> > Status:
> > 
> > It's my first medium sized macro so it may have some thinko's or
> > oddities. criticism welcome.
> > 
> > Simple tests work. The backtracking hasn't been extensively tested
> > yet. The and form doesn't seem to return the full AST it should.
> > Other than that all the basic bits are in place.
> 
> I reworked the runtime functions a bit and the and operator works
> correctly now. test-parser.el has some minor changes so I have
> included the latest version.

quoting is fixed for the production symbols.

> > The catch for semantic-error throws is not in place yet. I need
> > something like (condition-case) however that construct seemed a
> > little over-powered for my purpose. Suggestions on handling multiple
> > error conditions in a single form would be appreciated.

> My next priority is completing the error handling.

I chose the condition-case route. I ended up writing up a quick and dirty define-error which
is evidently in XEmacs. Any advice for how to handle this compatibility issue would be
appreciated.

After some good suggestions following review on IRC (thanks guys), I have cleaned up the
code for readability and re-organized it.

> > TODO:
> > 
> > 1. add ? for optional productions. Not a big deal to implement once
> > the basic implementation is sound.
> 
> > 2. Full Documentation
> 
> 3. add a (define ) list so that tokens and productions can be defined
>    without using them in place.

It's implemented but not tested yet.

> > Review Questions:
> > 
> > Do the Emacs developers consider this tool useful ? Does it fill a
> > niche in Emacs ? If so I will document it.
> > 
> > Is it elegant ?
> > 
> > Is the terminology correct ? I am not entirely sure I used curry in
> > the strict sense.

This version cleans up the use of parsing terminology.

> > Are there any essential features missing ? keeping in mind the goal
> > is simplicity, not a fully optimized parser.
> > 
> > Is it interesting in a creative sense, or have I just ground out
> > another lisp right of passage ?

With some pointers from #emacs on freenode I found one similar work on On Lisp
by Paul Graham. They are technical similarities of course, but the interface
and implementation are quite different. Any other pointers to Related works
would be greatly appreciated.
 
> > Examples:
> > 
> > A couple of contrived examples I have used for testing. A couple of
> > things to note. Keywords are token|and|or. token defines a regex
> > match, while or and and construct terminals. The definition of a
> > token or terminal creates a match object at the point of definition.
> > 
> > There is a built-in start symbol that is a or-terminal, so the first
> > example will match to the start symbol either token. 
> > 
> > (parser-compile test-parser
> >   (token whitespace "[[:blank:]]+")
> >   (token word "[[:alpha:]]+"))
> > 
> > The parser is compiled to the function value of test-parser. You can
> > call it like this: (test-parser (point))
> > 
> > It parses a single start production. Looping is done outside of the
> > parser itself. 
> > 
> > as the second example shows you can reference previously created
> > tokens.
> > 
> > (parser-compile test-parser
> >   (and indented (token whitespace "[[:blank:]]+") (token word
> > "[[:alpha:]]+")) (and inverted word whitespace))
> > 
> > The attached file test-parser.el has some test drivers for playing
> > with the parser interactively. Some of the features for defining
> > tokens such as defining your own actions for tokens are shown there.
> > 
> > P.S, here is a fun little lisp comic in the spirit of finding
> > amusement in your work: http://xkcd.com/297/
> > 
> > Without further ado here is parser.el , and enjoy !
> >
> 

Here is the third posting, v669. It will probably be the last mailing list post.
If I continue to publish it I will probably put it on EmacsWiki. I wanted to drop
off a decently clean version of the code in case anybody stumbles across this thread
and is interested.

;;----------------------------------------------------------------------
;; parser.el
;; Primary Author: Mike Mattie (codermattie@gmail.com)
;; Copyright: Mike Mattie (2008)
;;----------------------------------------------------------------------

;; ->Summary.

;; parser.el aims to be a lightweight implementation of Recursive
;; Descent parsing. The parser-compile macro is given a symbol and a
;; grammar definition. It returns a compiled parser bound to the
;; function of the symbol.

;; -> Application.

;; Building or extracting a nested data structure by parsing static
;; text.

;; -> Comparison with Emacs parsing facilities.

;; Emacs supports parsing in two forms: Regular Expressions and tools
;; like Syntax Tables geared towards using the parse results to
;; annotate buffers with text-properties and overlays.

;; This Dynamic Programming approach works well for overlaying
;; functions that interpret the meaning of text in the buffer, such as
;; syntax highlighting, on a paradigm of unstructured text. The
;; analysis is preserved when the buffer is edited at a character
;; level.

;; When you need to build an interface that rests entirely on the
;; parse analysis to the degree that the user or program does not
;; modify or traverse the buffer at a character level, this tool
;; simplifies construction of a nested data structure that maps tokens
;; to beginning and ending positions of the match.

;; -> Related Works

;; * CEDET.

;; CEDET appears to be a parser generator implemented on-top of a CLOS
;; emulation. This tool-set has developed towards the problem of
;; analyzing source code as a project which is bloated for simpler
;; requirements.

;; CEDET is likely to offer much higher performance than an un-optimized
;; recursive descent parser. If backtracking needs to be optimized
;; something like CEDET is a better choice than parser-compile.

;; ->Concept

;; A parser compiler that combines lexical (regex) analysis with recursive
;; descent parsing compiled by macro expansion.

;; The concept started with the idea to build a parser as nested cond
;; forms. The matching would go in the predicate part and the action
;; in the body.

;; This essential simplicity is still present in the design but the
;; implementation is now more complex with Match Functions replacing
;; the cond clause form. Match Functions allow previous definitions of
;; a token or rule to be referenced later in later rules. Backtracking
;; was another essential complexity once the and non-terminal was
;; added.

;; ->Requirements

;; 1. Easy to use for trivial problems, powerful enough to solve
;;    most of them.

;; 2. Data structures are orthogonal to the parser and not hard-wired
;;    in. Tokens can perform user defined functions or construct custom
;;    data structures with the beginning and ending positions of the
;;    match in the input.
;;
;;    This allows the user to choose between positions, markers, overlays
;;    according to the requirements.

;; ->Characteristics

;; parser-compile produces a no frills recursive descent parser.

;; ->Terminology

;; My reference for parsing terminology is the Dragon Book:

;; Compilers
;; Principles,Techniques,and Tools
;; Alfred V.Aho, Ravi Sethi, Jeffrey D.Ullman
;; 1986, Addison Wesley

;; ->TODO

;; 1. define list where Matches can be defined without inserting a
;;    match at the definition point [implemented, but not tested]

;; 2. optional matching with ? for Match Function references. [easy]

;; 3. Canonical tree walk implemented as parser-ast-node.

(require 'cl)

;; not defined in Gnu Emacs evidently.

(defmacro define-error ( symbol message &rest isa-list )
  "define a error symbol with a isa list and a error message"
  `(progn
     (put ',symbol
       'error-conditions (append '(error ,symbol) ',isa-list))
     (put ',symbol 'error-message ,message)
     ))

;; A recursive macro expansion would be nice for creating a hierarchy.
(define-error parser-compile-error  "parser error")
  (define-error parser-syntactic-error  "syntactic error" parser-compile-error)
  (define-error parser-semantic-error   "semantic error" parser-compile-error)

;;----------------------------------------------------------------------
;; Backtracking.
;;----------------------------------------------------------------------

;; parser-position

;; The position of the parser in the input stream is maintained as a
;; stack of indexes into the buffer. As matching progresses the top of
;; the stack (car) is updated. The push and pop operations copy a
;; position to a next or previous stack position. backtrack discards
;; the top of the stack.

(defsubst parser-pos () ;; tested
  "Return the current position of the parser in the buffer"
  (car parser-position))

(defun parser-push ()
  "Copy the parser position to a new stack level so the parser can backtrack when necessary."
  (push (parser-pos) parser-position))

(defun parser-pop ()
  "Copy the parser position to a previous stack level When the possibility of a backtrack
   has been eliminated by matching."
  (let
    ((current (pop parser-position)))
    (setcar parser-position current)
    ))

(defun parser-backtrack ()
  "Restore the previous parser position by discarding the top of the parser-position stack."
  (pop parser-position)
  (goto-char (parser-pos)))

(defun parser-advance ( consumed )
  "Advance the input position of the parser to the next un-matched character: input consumed + 1."

  (if (> consumed 0)
    (lexical-let
      ((pos (+ 1 consumed (parser-pos))))
      (progn
        (setcar parser-position pos)
        (goto-char pos)))
    ))

(defun parser-consumed ()
  "The number of input characters consumed by the token's match in the input."
  (- (match-end 0) (match-beginning 0)))

;;----------------------------------------------------------------------
;; Match Result
;;----------------------------------------------------------------------

;; The parser functions have a standard structure for returning the
;; result of a Match Function.

;; nil | (parser . AST ).

;; nil indicates failure to match.

;; A successful match has a parser and AST parts. The parser is
;; the ending position of the match relative to the starting
;; match position.

;; The AST part is created by a token action handler or a combination
;; operator like and.

;; token ( match-symbol . ( start . end ) | "user value" | "user function return value")
;; and   ( match-symbol . "a list of Match Result AST parts" )

;; This consistent return value structure allows token/and/or
;; match-functions to nest arbitrarily [except that tokens are
;; strictly leafs].

;; My rationale for using inline functions is that I am essentially naming
;; the car and cdr of my cons cell for readability.

(defsubst parser-make-match ( consumed data )
  "create a match result from the input consumed by the match, and the match data."
  (cons consumed data))

(defsubst parser-match-consumed ( match-result )
  "return the input consumed by the match"
  (car match-result))

(defsubst parser-match-data ( match-result )
  "return the data of the match"
  (cdr match-result))

(defsubst parser-make-match-data ( name data )
  (cons name data))

;;----------------------------------------------------------------------
;; Combination Operators
;;----------------------------------------------------------------------

;; The parser uses two combination operators: parser-and, parser-or as
;; nodes in the parser tree. Significantly parser-and can backtrack
;; and parser-or never does.

;; and/or have the same essential meaning as the lisp and/or forms
;; with two specializations. Both functions treat their argument lists
;; as a list of Match Functions. Also the parser-and function returns
;; a production consisting of the AST parts of the Match Results,
;; instead of returning the last Match Result.

(defun parser-or ( &rest match-list )
  "Combine Match Functions by or ; the first successful match is returned.
   nil is returned if no matches are found"
  (catch 'match
    (dolist (match match-list)
      (lexical-let
        ((production (funcall match)))
        (if production
          (progn
            (parser-advance (parser-match-consumed production))
            (throw 'match (parser-make-match 0 (parser-match-data production))))
          )))
    (throw 'match nil)
    ))

(defun parser-and ( &rest match-list )
  "combine the matches with and. all of the match objects must return non-nil
   in the parser part or the parser will backtrack and return nil."
  (parser-push)

  (lexical-let
    ((production (catch 'backtrack
            ;; we want to gather all the matches, so mapcar across the match objects.
            (mapcar
              (lambda (match)
                (lexical-let
                  ((production (funcall match)))

                  (if production
                    (progn
                      (parser-advance (parser-match-consumed production))
                      (parser-match-data production))

                    (throw 'backtrack nil))
                  )) match-list)
            )))
   (if production
     (progn
       (parser-pop)
       (parser-make-match 0 production)) ;; would be nice to filter optional matches here.
     (progn
       (parser-backtrack)
       nil))
   ))

;;----------------------------------------------------------------------
;; compiler diagnostics
;;----------------------------------------------------------------------

;; construct meaningful compiler error messages in the
;; "expected: foo got: bar" form.

(defun parser-expr-diagnostic ( form ) ;; tested
  (format "type(%s) %s" (symbol-name (type-of form)) (pp (eval form))))

(defmacro parser-diagnostic ( form from expected ) ;; tested
  "syntax: (parser-diagnostic form from expected)

   Where form is the expr received, from is the component issuing the diagnostic,
   and expected is a message describing what the component expected"
  `(concat (format "[%s] expected: " ,from)  ,expected " not: " ,(parser-expr-diagnostic form)))

;;----------------------------------------------------------------------
;; Match Functions
;;----------------------------------------------------------------------

;; make-match and get-match implement a table of match functions.

;; Match functions are lambdas that take no arguments and return Match
;; Result values. They assume that the compiled parser has scoped a
;; parser-position stack to determine the parsing position in input.

;; match-table is objarray created at macro scope, and does not appear
;; in the compiled form.

(defun parser-match-function ( identifier &optional definition )
  "Retrieve or define a Match Function."
  (lexical-let*
    ((id (symbol-name identifier))
      (lookup (intern-soft id match-table)))

    (if lookup
      (if (eq definition nil)
        lookup
        (signal 'parser-semantic-error (format "illegal redefinition of Match Function %s" id)))

      (progn
        (fset (intern id match-table) (eval definition))
        (intern id match-table))
      )))

;;----------------------------------------------------------------------
;; tokens
;;----------------------------------------------------------------------

;; EXPERIMENT: allow other functions to be used for token matching
;;             other than looking-at, such as re-search.

(defun parser-build-token ( identifier ) ;; tested
  "parser-make-token is a built-in constructor that records the analysis
   and the location of the text (id . (begin . end))"
  (parser-make-match-data identifier (cons (match-beginning 0) (match-end 0))))

(defun parser-make-anon-func ( sexp ) ;; tested
  "bind an un-evaluated anonymous function to an un-interned symbol"
  (let
    ((anon-func (make-symbol "parser-user-handler")))
    (fset anon-func (eval sexp))
    anon-func))

;; the token interpreter was split into two functions to isolate the
;; flexibility of tokens (user functions or return values for
;; constructing AST) from the hard-wired parser part.

(defun parser-interp-token-action ( identifier constructor ) ;; tested
  "Translate the AST constructor part of a token definition into Elisp."

  (unless (symbolp identifier)
    (signal 'parser-syntactic-error
      (parser-diagnostic identifier
        "parser token"
        "identifier: An unbound symbol used as an identifier"
        )))

  ;; Warning: this code is very touchy, double quotation of AST
  ;; symbols is required. the saving grace is that the symbols don't
  ;; have a variable value bound so it fails noisily when the
  ;; quotation is incorrect.
  (cond
    ((eq nil constructor)    `(parser-build-token '',identifier))
    ((listp constructor)     `(,(parser-make-anon-func constructor) (match-beginning 0) (match-end 0)))
    ((functionp constructor) `(,constructor (match-beginning 0) (match-end 0)))
    ((symbolp constructor)   `(quote ',constructor))

    ;; all other constructor types are un-handled.
    ((signal 'parser-syntactic-error
       (parser-diagnostic identifier
         "parser token: identifier" "A symbol"))))
  )

(defun parser-interp-token ( syntax ) ;; tested
  "Translate a token definition into a Match Function.

   The matching part is hard-wired into the function, while the
   AST part which contains a great degree of flexibility is
   translated by parser-interp-token-action"

  (lexical-let
    ((identifier  (car syntax))
     (regex       (cadr syntax))    ;; eval ? many regexs are stored in variables.
     (constructor (caddr syntax)))

    `(lambda ()
       (if (looking-at ,regex)
         (parser-make-match (parser-consumed) ,(parser-interp-token-action identifier constructor))
         nil
         ))
    ))

(defun parser-compile-token ( syntax ) ;; tested
  "Compile a token into a Match Function."
  (parser-match-function (car syntax) (parser-interp-token syntax)))

;;----------------------------------------------------------------------
;; rules
;;----------------------------------------------------------------------

;; rules are non-terminals, with a left side or identifier, and a
;; right side containing matches that recognize a production of the
;; rule.

(defun list-filter-nil ( list )
  "filter nil symbols from a list"
  (if (consp list)
    (lexical-let
      ((head (car list)))

      (if (eq head 'nil)
        (list-filter-nil (cdr list))
        (cons head (list-filter-nil (cdr list)))
        ))
    nil
    ))

(defun parser-rule-left ( prod-left combine-operator prod-right )
  "Translate the Left Side of a Rule into a Match Function
   currying the Right Side of a rule."

  (unless (symbolp prod-left)
    (parser-diagnostic prod-left
      "Rule Left interpreter"
      "left side of the production or an identifier"))

  (lexical-let
    ((matchf-list (list-filter-nil prod-right))) ;; filter nils created by define lists.

    (parser-match-function prod-left
      `(lambda ()
         (lexical-let
           ((production (apply ',combine-operator ',matchf-list)))
           (if production
             (parser-make-match
               (parser-match-consumed production)
               (parser-make-match-data '',prod-left (parser-match-data production)))
             nil)
           )))
      ))

(defun parser-rule-right ( rule )
  "Translate a match in the Right Side of the rule into a
   compiled Match Function by retrieving the Match Function or
   recursively interpreting the grammar definition."

  (cond
    ((listp rule) (parser-compile-definition rule))
    ((symbolp rule) (parser-match-function rule))

    (signal 'parser-syntactic-error
      (parser-daignostic rule
        "Rule Right interpreter"
        "expected a grammar list e.g: token,and,or ; or a symbol as a rule or token reference"))
    ))

(defun parser-compile-rule ( combine-function prod-right )
  (parser-rule-left      ;; make a match function
    (car prod-right)     ;; the identifier of the production
    combine-function     ;; the combine operator
    (mapcar 'parser-rule-right (cdr prod-right)) ;; interpret the matching definition.
  ))

;;----------------------------------------------------------------------
;; grammar definition.
;;----------------------------------------------------------------------

(defun parser-compile-definition ( term )
  "parser-compile-definition is the recursive heart of the compiler."
  (unless (listp term)
    (signal 'parser-syntactic-error
      (parser-diagnostic term
        "parser definition"
        "expected a definition of token|or|and|define")))

  (lexical-let
    ((keyword (car term))
      (syntax (cdr term)))

    (if (listp keyword)
      (parser-compile-definition keyword)

      (cond
        ((eq keyword 'token)  (parser-compile-token syntax))
        ((eq keyword 'or)     (parser-compile-rule 'parser-or syntax))
        ((eq keyword 'and)    (parser-compile-rule 'parser-and syntax))

        ((eq keyword 'define)
          ;; define discards the match functions as a return value so
          ;; tokens and rules can be defined before they are used.
          (mapcar 'parser-rule-right syntax)
          nil)

        ((signal 'parser-syntactic-error
           (parser-diagnostic term
             "parser definition"
             "definition keyword token|or|and|define")))
        ))
    ))

(defvar parser-mtable-init-size 13
  "initial size of the match-table objarray for storing match functions. the value
   was chosen based on the recommendation of prime numbers for good hashing.")

(defmacro parser-compile ( parser &rest definition )
  "compile a LL parser from the given grammar."
  (let
    ;; create a symbol table to store compiled terminal and
    ;; non-terminal match functions
    ((match-table (make-vector parser-mtable-init-size 0)))

    (condition-case diagnostic
      (progn
        (fset parser
          (eval
            `(lambda ( start-pos )
               (let
                 ((parser-position (cons start-pos nil))) ;; initialize the backtrack stack
                 (save-excursion
                   (goto-char start-pos)
                   ;; note that the start symbol of the grammar is built in as an or combination
                   ;; of the top-level definitions.
                   (lexical-let
                     ((parse (,(parser-rule-left 'start 'parser-or
                                 (mapcar 'parser-compile-definition definition))
                               )))
                     (if parse
                       ;; if we have a production return the position at which the
                       ;; parser stopped along with the AST.
                       (parser-make-match (parser-pos) (parser-match-data parse))
                       nil))
                   )))
            ))
        t)

      (parser-syntactic-error
        (progn
          (message "parser-compile Syntax Error %s" diagnostic)
          nil
          ))
      (parser-semantic-error
        (progn
          (message "parser-compile invalid statement %s" diagnostic)
        ))
      )))

(provide 'parser)

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1.1.2: test-parser.el --]
[-- Type: text/x-emacs-lisp; name=test-parser.el, Size: 3298 bytes --]

;;----------------------------------------------------------------------
;; diagnostics
;;----------------------------------------------------------------------

;; test the diagnostics of the parser. I think some sort of eval may be necessary
;; in diagnostics.

;; see what the expansion looks like
(pp (macroexpand
      (parser-diagnostic `(foo bar)
        "test"
        "the right thing")))

;; see what it looks like when handled by message.
(message "%s"
      (parser-diagnostic `(foo bar)
        "test"
        "the right thing"))))

;;----------------------------------------------------------------------
;; token interp phase
;;----------------------------------------------------------------------

(defmacro test-interp-token ( &rest syntax )
  (lexical-let
    ((compile (catch 'syntax-error
                (parser-interp-token (cdr syntax)))))

    (if (stringp compile)
      (message "%s" compile)
      compile)
    ))

(defun run-test-interp-token ( match-function )
  (lexical-let
    ((result (funcall match-function)))
    (message "TI match? %s AST %s" (if (car result) "yes" "no") (pp (cdr result)))
    ))

(defun run-test-token ()
  (interactive)
  (run-test-interp-token (test-interp-token token whitespace "[[:blank:]]+")))

(pp (macroexpand
      (test-interp-token token whitespace "[[:blank:]]")))

;;----------------------------------------------------------------------
;; test tokens
;;----------------------------------------------------------------------

;; eval this, then eval various tests.

(defun run-parser ()
  "run test-parser interactively for testing and debugging."
  (interactive)
  (lexical-let
    ((parse-result (test-parser (point))))

    (message "PROD match? %s"
      (if parse-result
        (format "Yes matched to: %s, AST: %s" (car parse-result) (pp (cdr parse-result))
        "No")
      ))
    ))

(parser-compile test-parser
  (token whitespace "[[:blank:]]+")
  (token word "[[:alpha:]]+"))

(parser-compile test-parser
  (and indented (token whitespace "[[:blank:]]+") (token word "[[:alpha:]]+"))
  (and inverted word whitespace))

(parser-compile test-parser
  (token whitespace "[[:blank:]]+"))

(parser-compile test-parser
  (token whitespace "[[:blank:]]+" (lambda ( start stop ) (message "bar by far"))))

(parser-compile test-parser
  (token whitespace "[[:blank:]]+" bingo))

;;----------------------------------------------------------------------
;; test productions
;;----------------------------------------------------------------------

(parser-compile test-parser
  (token whitespace "[[:blank:]]+")
  (token word "[[:alpha:]]+"))

(parser-compile test-parser
  (and indented (token whitespace "[[:blank:]]+") (token word "[[:alpha:]]+"))
  (and inverted word whitespace))

fooo   bar

;;----------------------------------------------------------------------
;; experimental
;;----------------------------------------------------------------------

now create a spiffy function that walks the AST tree for you.

something like start/indented/whitespace
or start/indented

this should lay the grounds for verifying wether the AST is generated as expected,
with parser-walk defining a canonical traversal.

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

[-- Attachment #2: Type: text/plain, Size: 152 bytes --]

_______________________________________________
help-gnu-emacs mailing list
help-gnu-emacs@gnu.org
http://lists.gnu.org/mailman/listinfo/help-gnu-emacs

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2008-01-14  4:23 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-01-08 10:34 early preview of a recursive descent parser compiler implemented as a macro Mike Mattie
2008-01-11 23:07 ` Mike Mattie
2008-01-14  4:23   ` Mike Mattie

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).