unofficial mirror of help-gnu-emacs@gnu.org
 help / color / mirror / Atom feed
From: Mike Mattie <codermattie@gmail.com>
To: emacs users ml <help-gnu-emacs@gnu.org>
Subject: Re: early preview of a recursive descent parser compiler implemented as a macro
Date: Fri, 11 Jan 2008 15:07:38 -0800	[thread overview]
Message-ID: <20080111150738.773df4e4@reforged> (raw)
In-Reply-To: <20080108023436.714c4289@reforged>


[-- Attachment #1.1.1: Type: text/plain, Size: 21272 bytes --]

On Tue, 8 Jan 2008 02:34:36 -0800
Mike Mattie <codermattie@gmail.com> wrote:

> Hello,
> 
> I recently found myself needing a top down parser, so I wrote up a
> parser compiler as a macro. It is turning out quite nicely for my
> purposes so I would like to submit it for comments and suggestions. I
> am still working out a couple of kinks, but it is at the point where
> I can start integrating feedback.
> 
> The goal is a lightweight parser for simple jobs such as scraping a
> data structure out of essentially static buffers, which is the
> problem I was solving when I developed it.
> 
> I am in-lining the actual parser compiler, and attaching a testing
> file so people can play with it if they find it interesting.
> 
> Status:
> 
> It's my first medium sized macro so it may have some thinko's or
> oddities. criticism welcome.
> 
> Simple tests work. The backtracking hasn't been extensively tested
> yet. The and form doesn't seem to return the full AST it should.
> Other than that all the basic bits are in place.

I reworked the runtime functions a bit and the and operator works correctly
now. test-parser.el has some minor changes so I have included the latest
version.

> The catch for semantic-error throws is not in place yet. I need
> something like (condition-case) however that construct seemed a
> little over-powered for my purpose. Suggestions on handling multiple
> error conditions in a single form would be appreciated.

My next priority is completing the error handling.

> TODO:
> 
> 1. add ? for optional productions. Not a big deal to implement once
> the basic implementation is sound.

> 2. Full Documentation

3. add a (define ) list so that tokens and productions can be defined
   without using them in place.

> Review Questions:
> 
> Do the Emacs developers consider this tool useful ? Does it fill a
> niche in Emacs ? If so I will document it.
> 
> Is it elegant ?
> 
> Is the terminology correct ? I am not entirely sure I used curry in
> the strict sense.
> 
> Are there any essential features missing ? keeping in mind the goal
> is simplicity, not a fully optimized parser.
> 
> Is it interesting in a creative sense, or have I just ground out
> another lisp right of passage ?
> 
> Examples:
> 
> A couple of contrived examples I have used for testing. A couple of
> things to note. Keywords are token|and|or. token defines a regex
> match, while or and and construct terminals. The definition of a
> token or terminal creates a match object at the point of definition.
> 
> There is a built-in start symbol that is a or-terminal, so the first
> example will match to the start symbol either token. 
> 
> (parser-compile test-parser
>   (token whitespace "[[:blank:]]+")
>   (token word "[[:alpha:]]+"))
> 
> The parser is compiled to the function value of test-parser. You can
> call it like this: (test-parser (point))
> 
> It parses a single start production. Looping is done outside of the
> parser itself. 
> 
> as the second example shows you can reference previously created
> tokens.
> 
> (parser-compile test-parser
>   (and indented (token whitespace "[[:blank:]]+") (token word
> "[[:alpha:]]+")) (and inverted word whitespace))
> 
> The attached file test-parser.el has some test drivers for playing
> with the parser interactively. Some of the features for defining
> tokens such as defining your own actions for tokens are shown there.
> 
> P.S, here is a fun little lisp comic in the spirit of finding
> amusement in your work: http://xkcd.com/297/
> 
> Without further ado here is parser.el , and enjoy !
>

Here is the second posted version with the changes indicated. If you need
a version number that is not ambiguous this posting is version 636. 

;;----------------------------------------------------------------------
;; parser.el
;; Primary Author: Mike Mattie (codermattie@gmail.com)
;; Copyright: Mike Mattie (2008)
;;----------------------------------------------------------------------

;; ->Summary

;; parser.el aims to be a lightweight implementation of parsing.

;; My requirements involve building/extracting a data structure from
;; analysis of a given text. For this purpose a "top-down" parse
;; makes it easier to assemble a data structure. [TODO: why is it easier
;; this way ?]

;; ->Related Works

;; Emacs regex tables.
;;
;; Much of the parsing facilities that I have seen so far in Emacs
;; have been various forms of regex tables.

;; There are many reasons why Emacs parsing tools would evolve away
;; from the usual parser generator tool-set.

;; 1. The requirement to work incrementally: parse errors are not
;;    errors, the user just has not finished typing :)

;; 2. Emacs analysis is often partial by requirement, usually only
;;    significant parts of the text need to be identified for
;;    features.

;; 3. Emacs can leverage the annotation facilities of
;;    text properties and overlays making analysis extremely
;;    modular and robust while editing.

;; CEDET.
;;
;; CEDET appears to be a parser generator implemented on-top of a
;; CLOS emulation. The fact that this sort of design is not "lispy"
;; does not negate it's value, but limits it IMHO. This tool-set
;; has also "grown around" the problem of analyzing source code as
;; a project which is bloat for simpler requirements.

;; ->Concept

;; A parser compiler that combines lexical (regex) analysis with LL
;; parsing compiled by recursive macro expansion.

;; The concept started out as an idea to build a parser as nested cond
;; forms. The matches would go in the predicate part of the clause
;; while the action part would go in the body.

;; This idea is expressed here with cond forms mutated into lambdas so
;; that previously defined matches can be referenced by productions.

;; -> Requirements

;; 1. Easy to use for trivial problems.

;; 2. Data structures are orthogonal to the parser and not hard-wired
;;    in. Leafs and nodes of the AST are constructed with beginning
;;    and ending positions of the text in the buffer.
;;
;;    This allows the user to choose between positions, markers, overlays
;;    according to the requirements.

;; -> Naming Conventions

;; parser-*          internal functions
;; parser-build-*    Construct the AST structures returned by the parser.
;; parser-interp-*   expand definitions of terminals and non-terminals into lisp
;; parser-compile-*  compile the expanded definitions into interchangeable match functions.

;; -> Key data structures

;; Matches

;; matches are a cons pair of ( parser . AST ).

;; Parser is used internally to indicate match success or failure. It
;; should be the distance matched from the current parser position.
;; It can, and often will be zero for a successful match. nil indicates
;; a failure to match.

;; AST is the data returned from the match ( production . ( start . end ) | list )
;; the car of the pair (production) is the production's symbol.

;;----------------------------------------------------------------------
;; diagnostics
;;----------------------------------------------------------------------

(defun parser-expr-diagnostic ( form ) ;; tested
  (format "type(%s) %s" (symbol-name (type-of form)) (pp (eval form))))

(defmacro parser-diagnostic ( form from expected ) ;; tested
  "syntax: (parser-diagnostic form from expected)

   Where form is the expr received, from is the component issuing the diagnostic,
   and expected is a message describing what the component expected"
  `(concat (format "[%s] expected: " ,from)  ,expected " not: " ,(parser-expr-diagnostic form)))

;;----------------------------------------------------------------------
;; AST constructors
;;----------------------------------------------------------------------

(defun parser-build-token ( identifier ) ;; tested
  "parser-make-token is a built-in constructor that records the analysis
   and the location of the text (id . (begin . end))"
  (cons identifier (cons (match-beginning 0) (match-end 0))))

;;----------------------------------------------------------------------
;; compiler internals
;;----------------------------------------------------------------------

;; make-match and get-match implement a table of productions and
;; tokens. Both production and tokens are interchangeable as far as
;; matching and referencing is concerned. This table gives the ability
;; to reference previously defined matches in other productions.

;; match-table is dynamically scoped by the macro, and does not appear
;; in the compiled form.

(defun parser-make-match ( symbol function ) ;; tested
  "parser-make-match takes ( symbol function ) and returns a symbol
   stored in the parser's match-table with the evaluated lambda
   bound"
  (lexical-let
    ((new-name (symbol-name symbol)))

    (if (intern-soft new-name match-table)
      (throw 'semantic-error (format "illegal redefinition of match %s" new-name)))

    (fset (intern new-name match-table) (eval function))
    (intern new-name match-table)
    ))

(defun parser-make-anon-func ( sexp ) ;; tested
  "bind an un-evaluated anonymous function to an un-interned symbol"
  (let
    ((anon-func (make-symbol "parser-lambda")))
    (fset anon-func (eval sexp))
    anon-func))

(defun parser-get-match ( symbol ) ;; tested
  "return the compiled match for symbol, or throw a semantic-error if it does not
   exist"
  (lexical-let
    ((existing-name (symbol-name symbol)))

    (unless (intern-soft existing-name match-table)
      (throw 'semantic-error (format "unkown match %s" existing-name)))

    (intern existing-name match-table)))

;;----------------------------------------------------------------------
;; compiler run-time
;;----------------------------------------------------------------------

;; these functions are defined independent of the compilation of a parser
;; for simplicity, and to avoid wasteful duplication in the macro expansion.

;; parser-position

;; The position of the parser in the input stream is maintained as a
;; stack of indexes into the buffer. As matching progresses the top of
;; the stack (car) is updated. The push and pop operations copy a
;; position to a next or previous stack position. backtrack discards
;; the top of the stack.

(defun parser-pos () ;; tested
  "Return the current position of the parser in the buffer"
  (car parser-position))

(defun parser-push ()
  "Copy the parser position to a new stack level so the parser can backtrack when necessary."
  (setq parser-position (cons (car parser-position) parser-position)))

(defun parser-pop ()
  "Copy the parser position to a previous stack level When the possibility of a backtrack
   has been eliminated by matching."
  (let
    ((current (parser-pos)))
    (setq parser-position (cons (car parser-position) (cddr parser-position)))
    ))

(defun parser-backtrack ()
  "Restore the previous parser position by discarding the top of the parser-position stack."
  (setq parser-position (cdr parser-position))
  (goto-char (parser-pos)))

(defun parser-advance ( distance )
  "Add distance to the parsing position on the current stack level.
   The advancing of the parser is done relative to the current
   position so that a successful match can advance 0 characters.
   This is important for optional matches such as the \"?\" meta-symbol.
   They want to indicate a successful match without moving the parser position."

  ;; we are going to get zero's as a unfortunate side-effect of
  ;; keeping the nesting of combination operators simple. Avoid NOP
  ;; work with a quick test.

  (if (> distance 0)
    (progn
      (setq parser-position (cons (+ distance (car parser-position)) (cdr parser-position)))
      (goto-char (parser-pos)))
    ))

(defun parser-next ()
  "Compute the next parser position from the length of the entire regex's current match, plus one."
  (+ 1 (- (match-end 0) (match-beginning 0))))

;; parser combination operators

;; the combination operators group matches by and,or operators as the
;; underlying mechanics of non-terminal productions. As these
;; functions are not compiled into the generated parser they have no
;; knowledge of the symbol associated with the match list, only the
;; match list itself.

(defun parser-or ( &rest match-list )
  "Combine match objects by or where the first successful match is returned.
   nil is returned if no matches are found"
  (catch 'match
    (dolist (match match-list)
      (lexical-let
        ((match-result (funcall match)))
        (if match-result
          (progn
            (parser-advance (car match-result))
            (throw 'match (cons 0 (cdr match-result)))))
        ))
    (throw 'match nil) ;; this is what failure looks like :)
    ))

(defun parser-and ( &rest match-list )
  "combine the matches with and. all of the match objects must return non-nil
   in the parser part or the parser will backtrack and return nil."
  (parser-push)

  (lexical-let
    ((ast (catch 'backtrack
            ;; we want to gather all the matches, so mapcar across the match objects.
            (mapcar
              (lambda (match)
                (lexical-let
                  ((match-result (funcall match)))

                  (if match-result
                    (progn
                      ;; on match advance the parser and return the AST
                      (parser-advance (car match-result))
                      (cdr match-result))

                    ;; on fail backtrack and return nil
                    (throw 'backtrack nil))
                  ))
              match-list)
            )))
   (if ast
     (progn
       (parser-pop)
       (cons 0 ast)) ;; would be nice to filter optional matches here.
     (progn
       (parser-backtrack)
       nil))
   ))

;;----------------------------------------------------------------------
;; syntax interpretation
;;----------------------------------------------------------------------

;; interpretation as expansion of a form into lisp is separated from
;; compilation to make the parser easier to debug and verify by
;; phase.

;; the token interpreter was split into two functions to isolate the
;; flexibility of tokens (user functions or return values for
;; constructing AST) from the hard-wired parser part.

(defun parser-interp-token-action ( identifier constructor ) ;; tested
  "Translate the AST constructor definition for a token into Elisp."

  (unless (symbolp identifier)
    (throw 'syntax-error (parser-diagnostic identifier
                           "parser token"
                           "identifier: An unbound symbol used as an identifier"
                           )))

  ;; Warning: this code is very touchy, double quotation of AST
  ;; symbols is required. the saving grace is that the symbols don't
  ;; have a variable value bound so it fails noisily when the
  ;; quotation is incorrect.
  (cond
    ((eq nil constructor)    `(parser-build-token (quote ',identifier)))
    ((listp constructor)     `(,(parser-make-anon-func constructor) (match-beginning 0) (match-end 0)))
    ((functionp constructor) `(,constructor (match-beginning 0) (match-end 0)))
    ((symbolp constructor)   `(quote ',constructor))

    ;; all other constructor types are un-handled.
    ((throw 'syntax-error (parser-diagnostic identifier
                           "parser token: identifier" "A symbol"))))
  )

(defun parser-interp-token ( syntax ) ;; tested
  "Translate a token definition into a parser match function."

  (lexical-let
    ((identifier  (car syntax))
     (regex       (cadr syntax))    ;; eval ? many regexs are stored in variables.
     (constructor (caddr syntax)))

    `(lambda ()
       (if (looking-at ,regex)
         (cons (parser-next) ,(parser-interp-token-action identifier constructor))
         nil
         ))
    ))

(defun parser-interp-production ( production )
  "Translate the definition of a production, or a reference to a production
   into a match object symbol."

  (cond
    ((listp production) (parser-compile-definition production))
    ((symbolp production) (parser-get-match production))

    (throw 'syntax-error
      (parser-daignostic production
        "interpret definition"
        "expected a definition as a list, or a symbol as a production/token reference"))
    ))

;;----------------------------------------------------------------------
;; compilation
;;----------------------------------------------------------------------

;; after definitions are interpreted they are evaluated or compiled and
;; stored as match functions.

(defun parser-compile-token ( syntax ) ;; tested
  "compile a token definition into a match object"
  (parser-make-match (car syntax) (parser-interp-token syntax)))

(defun parser-curry-production ( identifier combine-operator &rest grammar )
  "Compile a match object with a combine operator and a match function list.
   I think curry is applicable, but largely it was named curry so I could
   create the parser-compile-production macro."
  (unless (symbolp identifier)
    (parser-diagnostic identifier
      "compile production"
      "match identifier"))

    (parser-make-match identifier
      `(lambda ()
         (lexical-let
           ((result (apply ',combine-operator ',grammar)))
         (if result
           (cons (car result) (cons ',identifier (cdr result)))
           nil)
           ))
      ))

(defmacro parser-compile-production ( combine-function production-list )
  "parser-compile-production simplifies the syntax of interpreting and compiling
   a production. The construct looks hairy because it combines two operations
   with quoting necessitated by apply. This macro mechanizes the tricky part
   to enhance code readability."
  `(apply 'parser-curry-production      ;; make a match function
     (car ,production-list)             ;; the identifier of the production
     ',combine-function                 ;; the combine operator
     (mapcar 'parser-interp-production (cdr ,production-list)))) ;; interpret the matching definition.

(defun parser-compile-definition ( term )
  "parser-compile-definition is the recursive heart of the compiler."
  (unless (listp term)
    (throw 'syntax-error (parser-diagnostic term
                           "parser definition"
                           "expected a definition of token|or|and")))

  (lexical-let
    ((keyword (car term))
      (syntax (cdr term)))

    (if (listp keyword)
      (parser-compile-definition keyword)

      (cond
        ((eq keyword 'token) (parser-compile-token syntax))
        ((eq keyword 'or)    (parser-compile-production parser-or syntax))
        ((eq keyword 'and)   (parser-compile-production parser-and syntax))

        ((throw 'syntax-error (parser-diagnostic term
                                "parser definition"
                                "definition keyword token|or|and")))
        ))
    ))

(defvar parser-mtable-init-size 13
  "initial size of the match-table objarray for storing match functions. the value
   was chosen based on the recommendation of prime numbers for good hashing.")

(defmacro parser-compile ( parser &rest definition )
  "compile a LL parser from the given grammar."
  (let
    ;; create a symbol table to store compiled terminal and
    ;; non-terminal match functions
    ((match-table (make-vector parser-mtable-init-size 0)))

    (lexical-let
      ((compiled (catch 'syntax-error
                   `(lambda ( start-pos )
                      (let
                        ((parser-position (cons start-pos nil))) ;; initialize the backtrack stack
                        (save-excursion
                          (goto-char start-pos)
                          ;; note that the start symbol of the grammar is built in as an or combination
                          ;; of the top-level definitions.
                          (lexical-let
                            ((parse (,(apply 'parser-curry-production 'start 'parser-or
                                        (mapcar 'parser-compile-definition definition))
                                      )))
                            (if parse
                              ;; if we have a production return the position at which the
                              ;; parser stopped along with the AST.
                              (cons (car parser-position) (cdr parse))
                              nil))
                          )))
                   )))
      (if (stringp compiled)
        ;; error path, print a message and return nil.
        (progn
          (message "parser-compile failed! %s" compiled)
          nil)

        ;; success path, bind the compiled parser to the parser symbol and return t.
        (progn
          (fset parser (eval compiled))
          t))
      )))

(provide 'parser)

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1.1.2: test-parser.el --]
[-- Type: text/x-emacs-lisp; name=test-parser.el, Size: 2623 bytes --]

;;----------------------------------------------------------------------
;; diagnostics
;;----------------------------------------------------------------------

;; test the diagnostics of the parser. I think some sort of eval may be necessary
;; in diagnostics.

;; see what the expansion looks like
(pp (macroexpand
      (parser-diagnostic `(foo bar)
        "test"
        "the right thing")))

;; see what it looks like when handled by message.
(message "%s"
      (parser-diagnostic `(foo bar)
        "test"
        "the right thing"))))

;;----------------------------------------------------------------------
;; token interp phase
;;----------------------------------------------------------------------

(defmacro test-interp-token ( &rest syntax )
  (lexical-let
    ((compile (catch 'syntax-error
                (parser-interp-token (cdr syntax)))))

    (if (stringp compile)
      (message "%s" compile)
      compile)
    ))

(defun run-test-interp-token ( match-function )
  (lexical-let
    ((result (funcall match-function)))
    (message "TI match? %s AST %s" (if (car result) "yes" "no") (pp (cdr result)))
    ))

(defun run-test-token ()
  (interactive)
  (run-test-interp-token (test-interp-token token whitespace "[[:blank:]]+")))

(pp (macroexpand
      (test-interp-token token whitespace "[[:blank:]]")))

;;----------------------------------------------------------------------
;; test tokens
;;----------------------------------------------------------------------

;; eval this, then eval various tests.

(defun run-parser ()
  "run test-parser interactively for testing and debugging."
  (interactive)
  (lexical-let
    ((parse-result (test-parser (point))))

    (message "PROD match? %s"
      (if parse-result
        (format "Yes matched to: %s, AST: %s" (car parse-result) (pp (cdr parse-result))
        "No")
      ))
    ))

(parser-compile test-parser
  (token whitespace "[[:blank:]]+"))

(parser-compile test-parser
  (token whitespace "[[:blank:]]+" (lambda ( start stop ) (message "bar by far"))))

(parser-compile test-parser
  (token whitespace "[[:blank:]]+" bingo))

;;----------------------------------------------------------------------
;; test productions
;;----------------------------------------------------------------------

(parser-compile test-parser
  (token whitespace "[[:blank:]]+")
  (token word "[[:alpha:]]+"))

(parser-compile test-parser
  (and indented (token whitespace "[[:blank:]]+") (token word "[[:alpha:]]+"))
  (and inverted word whitespace))

fooo   bar


[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

[-- Attachment #2: Type: text/plain, Size: 152 bytes --]

_______________________________________________
help-gnu-emacs mailing list
help-gnu-emacs@gnu.org
http://lists.gnu.org/mailman/listinfo/help-gnu-emacs

  reply	other threads:[~2008-01-11 23:07 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-01-08 10:34 early preview of a recursive descent parser compiler implemented as a macro Mike Mattie
2008-01-11 23:07 ` Mike Mattie [this message]
2008-01-14  4:23   ` Mike Mattie

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20080111150738.773df4e4@reforged \
    --to=codermattie@gmail.com \
    --cc=help-gnu-emacs@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).