On Tue, 8 Jan 2008 02:34:36 -0800 Mike Mattie wrote: > Hello, > > I recently found myself needing a top down parser, so I wrote up a > parser compiler as a macro. It is turning out quite nicely for my > purposes so I would like to submit it for comments and suggestions. I > am still working out a couple of kinks, but it is at the point where > I can start integrating feedback. > > The goal is a lightweight parser for simple jobs such as scraping a > data structure out of essentially static buffers, which is the > problem I was solving when I developed it. > > I am in-lining the actual parser compiler, and attaching a testing > file so people can play with it if they find it interesting. > > Status: > > It's my first medium sized macro so it may have some thinko's or > oddities. criticism welcome. > > Simple tests work. The backtracking hasn't been extensively tested > yet. The and form doesn't seem to return the full AST it should. > Other than that all the basic bits are in place. I reworked the runtime functions a bit and the and operator works correctly now. test-parser.el has some minor changes so I have included the latest version. > The catch for semantic-error throws is not in place yet. I need > something like (condition-case) however that construct seemed a > little over-powered for my purpose. Suggestions on handling multiple > error conditions in a single form would be appreciated. My next priority is completing the error handling. > TODO: > > 1. add ? for optional productions. Not a big deal to implement once > the basic implementation is sound. > 2. Full Documentation 3. add a (define ) list so that tokens and productions can be defined without using them in place. > Review Questions: > > Do the Emacs developers consider this tool useful ? Does it fill a > niche in Emacs ? If so I will document it. > > Is it elegant ? > > Is the terminology correct ? I am not entirely sure I used curry in > the strict sense. > > Are there any essential features missing ? keeping in mind the goal > is simplicity, not a fully optimized parser. > > Is it interesting in a creative sense, or have I just ground out > another lisp right of passage ? > > Examples: > > A couple of contrived examples I have used for testing. A couple of > things to note. Keywords are token|and|or. token defines a regex > match, while or and and construct terminals. The definition of a > token or terminal creates a match object at the point of definition. > > There is a built-in start symbol that is a or-terminal, so the first > example will match to the start symbol either token. > > (parser-compile test-parser > (token whitespace "[[:blank:]]+") > (token word "[[:alpha:]]+")) > > The parser is compiled to the function value of test-parser. You can > call it like this: (test-parser (point)) > > It parses a single start production. Looping is done outside of the > parser itself. > > as the second example shows you can reference previously created > tokens. > > (parser-compile test-parser > (and indented (token whitespace "[[:blank:]]+") (token word > "[[:alpha:]]+")) (and inverted word whitespace)) > > The attached file test-parser.el has some test drivers for playing > with the parser interactively. Some of the features for defining > tokens such as defining your own actions for tokens are shown there. > > P.S, here is a fun little lisp comic in the spirit of finding > amusement in your work: http://xkcd.com/297/ > > Without further ado here is parser.el , and enjoy ! > Here is the second posted version with the changes indicated. If you need a version number that is not ambiguous this posting is version 636. ;;---------------------------------------------------------------------- ;; parser.el ;; Primary Author: Mike Mattie (codermattie@gmail.com) ;; Copyright: Mike Mattie (2008) ;;---------------------------------------------------------------------- ;; ->Summary ;; parser.el aims to be a lightweight implementation of parsing. ;; My requirements involve building/extracting a data structure from ;; analysis of a given text. For this purpose a "top-down" parse ;; makes it easier to assemble a data structure. [TODO: why is it easier ;; this way ?] ;; ->Related Works ;; Emacs regex tables. ;; ;; Much of the parsing facilities that I have seen so far in Emacs ;; have been various forms of regex tables. ;; There are many reasons why Emacs parsing tools would evolve away ;; from the usual parser generator tool-set. ;; 1. The requirement to work incrementally: parse errors are not ;; errors, the user just has not finished typing :) ;; 2. Emacs analysis is often partial by requirement, usually only ;; significant parts of the text need to be identified for ;; features. ;; 3. Emacs can leverage the annotation facilities of ;; text properties and overlays making analysis extremely ;; modular and robust while editing. ;; CEDET. ;; ;; CEDET appears to be a parser generator implemented on-top of a ;; CLOS emulation. The fact that this sort of design is not "lispy" ;; does not negate it's value, but limits it IMHO. This tool-set ;; has also "grown around" the problem of analyzing source code as ;; a project which is bloat for simpler requirements. ;; ->Concept ;; A parser compiler that combines lexical (regex) analysis with LL ;; parsing compiled by recursive macro expansion. ;; The concept started out as an idea to build a parser as nested cond ;; forms. The matches would go in the predicate part of the clause ;; while the action part would go in the body. ;; This idea is expressed here with cond forms mutated into lambdas so ;; that previously defined matches can be referenced by productions. ;; -> Requirements ;; 1. Easy to use for trivial problems. ;; 2. Data structures are orthogonal to the parser and not hard-wired ;; in. Leafs and nodes of the AST are constructed with beginning ;; and ending positions of the text in the buffer. ;; ;; This allows the user to choose between positions, markers, overlays ;; according to the requirements. ;; -> Naming Conventions ;; parser-* internal functions ;; parser-build-* Construct the AST structures returned by the parser. ;; parser-interp-* expand definitions of terminals and non-terminals into lisp ;; parser-compile-* compile the expanded definitions into interchangeable match functions. ;; -> Key data structures ;; Matches ;; matches are a cons pair of ( parser . AST ). ;; Parser is used internally to indicate match success or failure. It ;; should be the distance matched from the current parser position. ;; It can, and often will be zero for a successful match. nil indicates ;; a failure to match. ;; AST is the data returned from the match ( production . ( start . end ) | list ) ;; the car of the pair (production) is the production's symbol. ;;---------------------------------------------------------------------- ;; diagnostics ;;---------------------------------------------------------------------- (defun parser-expr-diagnostic ( form ) ;; tested (format "type(%s) %s" (symbol-name (type-of form)) (pp (eval form)))) (defmacro parser-diagnostic ( form from expected ) ;; tested "syntax: (parser-diagnostic form from expected) Where form is the expr received, from is the component issuing the diagnostic, and expected is a message describing what the component expected" `(concat (format "[%s] expected: " ,from) ,expected " not: " ,(parser-expr-diagnostic form))) ;;---------------------------------------------------------------------- ;; AST constructors ;;---------------------------------------------------------------------- (defun parser-build-token ( identifier ) ;; tested "parser-make-token is a built-in constructor that records the analysis and the location of the text (id . (begin . end))" (cons identifier (cons (match-beginning 0) (match-end 0)))) ;;---------------------------------------------------------------------- ;; compiler internals ;;---------------------------------------------------------------------- ;; make-match and get-match implement a table of productions and ;; tokens. Both production and tokens are interchangeable as far as ;; matching and referencing is concerned. This table gives the ability ;; to reference previously defined matches in other productions. ;; match-table is dynamically scoped by the macro, and does not appear ;; in the compiled form. (defun parser-make-match ( symbol function ) ;; tested "parser-make-match takes ( symbol function ) and returns a symbol stored in the parser's match-table with the evaluated lambda bound" (lexical-let ((new-name (symbol-name symbol))) (if (intern-soft new-name match-table) (throw 'semantic-error (format "illegal redefinition of match %s" new-name))) (fset (intern new-name match-table) (eval function)) (intern new-name match-table) )) (defun parser-make-anon-func ( sexp ) ;; tested "bind an un-evaluated anonymous function to an un-interned symbol" (let ((anon-func (make-symbol "parser-lambda"))) (fset anon-func (eval sexp)) anon-func)) (defun parser-get-match ( symbol ) ;; tested "return the compiled match for symbol, or throw a semantic-error if it does not exist" (lexical-let ((existing-name (symbol-name symbol))) (unless (intern-soft existing-name match-table) (throw 'semantic-error (format "unkown match %s" existing-name))) (intern existing-name match-table))) ;;---------------------------------------------------------------------- ;; compiler run-time ;;---------------------------------------------------------------------- ;; these functions are defined independent of the compilation of a parser ;; for simplicity, and to avoid wasteful duplication in the macro expansion. ;; parser-position ;; The position of the parser in the input stream is maintained as a ;; stack of indexes into the buffer. As matching progresses the top of ;; the stack (car) is updated. The push and pop operations copy a ;; position to a next or previous stack position. backtrack discards ;; the top of the stack. (defun parser-pos () ;; tested "Return the current position of the parser in the buffer" (car parser-position)) (defun parser-push () "Copy the parser position to a new stack level so the parser can backtrack when necessary." (setq parser-position (cons (car parser-position) parser-position))) (defun parser-pop () "Copy the parser position to a previous stack level When the possibility of a backtrack has been eliminated by matching." (let ((current (parser-pos))) (setq parser-position (cons (car parser-position) (cddr parser-position))) )) (defun parser-backtrack () "Restore the previous parser position by discarding the top of the parser-position stack." (setq parser-position (cdr parser-position)) (goto-char (parser-pos))) (defun parser-advance ( distance ) "Add distance to the parsing position on the current stack level. The advancing of the parser is done relative to the current position so that a successful match can advance 0 characters. This is important for optional matches such as the \"?\" meta-symbol. They want to indicate a successful match without moving the parser position." ;; we are going to get zero's as a unfortunate side-effect of ;; keeping the nesting of combination operators simple. Avoid NOP ;; work with a quick test. (if (> distance 0) (progn (setq parser-position (cons (+ distance (car parser-position)) (cdr parser-position))) (goto-char (parser-pos))) )) (defun parser-next () "Compute the next parser position from the length of the entire regex's current match, plus one." (+ 1 (- (match-end 0) (match-beginning 0)))) ;; parser combination operators ;; the combination operators group matches by and,or operators as the ;; underlying mechanics of non-terminal productions. As these ;; functions are not compiled into the generated parser they have no ;; knowledge of the symbol associated with the match list, only the ;; match list itself. (defun parser-or ( &rest match-list ) "Combine match objects by or where the first successful match is returned. nil is returned if no matches are found" (catch 'match (dolist (match match-list) (lexical-let ((match-result (funcall match))) (if match-result (progn (parser-advance (car match-result)) (throw 'match (cons 0 (cdr match-result))))) )) (throw 'match nil) ;; this is what failure looks like :) )) (defun parser-and ( &rest match-list ) "combine the matches with and. all of the match objects must return non-nil in the parser part or the parser will backtrack and return nil." (parser-push) (lexical-let ((ast (catch 'backtrack ;; we want to gather all the matches, so mapcar across the match objects. (mapcar (lambda (match) (lexical-let ((match-result (funcall match))) (if match-result (progn ;; on match advance the parser and return the AST (parser-advance (car match-result)) (cdr match-result)) ;; on fail backtrack and return nil (throw 'backtrack nil)) )) match-list) ))) (if ast (progn (parser-pop) (cons 0 ast)) ;; would be nice to filter optional matches here. (progn (parser-backtrack) nil)) )) ;;---------------------------------------------------------------------- ;; syntax interpretation ;;---------------------------------------------------------------------- ;; interpretation as expansion of a form into lisp is separated from ;; compilation to make the parser easier to debug and verify by ;; phase. ;; the token interpreter was split into two functions to isolate the ;; flexibility of tokens (user functions or return values for ;; constructing AST) from the hard-wired parser part. (defun parser-interp-token-action ( identifier constructor ) ;; tested "Translate the AST constructor definition for a token into Elisp." (unless (symbolp identifier) (throw 'syntax-error (parser-diagnostic identifier "parser token" "identifier: An unbound symbol used as an identifier" ))) ;; Warning: this code is very touchy, double quotation of AST ;; symbols is required. the saving grace is that the symbols don't ;; have a variable value bound so it fails noisily when the ;; quotation is incorrect. (cond ((eq nil constructor) `(parser-build-token (quote ',identifier))) ((listp constructor) `(,(parser-make-anon-func constructor) (match-beginning 0) (match-end 0))) ((functionp constructor) `(,constructor (match-beginning 0) (match-end 0))) ((symbolp constructor) `(quote ',constructor)) ;; all other constructor types are un-handled. ((throw 'syntax-error (parser-diagnostic identifier "parser token: identifier" "A symbol")))) ) (defun parser-interp-token ( syntax ) ;; tested "Translate a token definition into a parser match function." (lexical-let ((identifier (car syntax)) (regex (cadr syntax)) ;; eval ? many regexs are stored in variables. (constructor (caddr syntax))) `(lambda () (if (looking-at ,regex) (cons (parser-next) ,(parser-interp-token-action identifier constructor)) nil )) )) (defun parser-interp-production ( production ) "Translate the definition of a production, or a reference to a production into a match object symbol." (cond ((listp production) (parser-compile-definition production)) ((symbolp production) (parser-get-match production)) (throw 'syntax-error (parser-daignostic production "interpret definition" "expected a definition as a list, or a symbol as a production/token reference")) )) ;;---------------------------------------------------------------------- ;; compilation ;;---------------------------------------------------------------------- ;; after definitions are interpreted they are evaluated or compiled and ;; stored as match functions. (defun parser-compile-token ( syntax ) ;; tested "compile a token definition into a match object" (parser-make-match (car syntax) (parser-interp-token syntax))) (defun parser-curry-production ( identifier combine-operator &rest grammar ) "Compile a match object with a combine operator and a match function list. I think curry is applicable, but largely it was named curry so I could create the parser-compile-production macro." (unless (symbolp identifier) (parser-diagnostic identifier "compile production" "match identifier")) (parser-make-match identifier `(lambda () (lexical-let ((result (apply ',combine-operator ',grammar))) (if result (cons (car result) (cons ',identifier (cdr result))) nil) )) )) (defmacro parser-compile-production ( combine-function production-list ) "parser-compile-production simplifies the syntax of interpreting and compiling a production. The construct looks hairy because it combines two operations with quoting necessitated by apply. This macro mechanizes the tricky part to enhance code readability." `(apply 'parser-curry-production ;; make a match function (car ,production-list) ;; the identifier of the production ',combine-function ;; the combine operator (mapcar 'parser-interp-production (cdr ,production-list)))) ;; interpret the matching definition. (defun parser-compile-definition ( term ) "parser-compile-definition is the recursive heart of the compiler." (unless (listp term) (throw 'syntax-error (parser-diagnostic term "parser definition" "expected a definition of token|or|and"))) (lexical-let ((keyword (car term)) (syntax (cdr term))) (if (listp keyword) (parser-compile-definition keyword) (cond ((eq keyword 'token) (parser-compile-token syntax)) ((eq keyword 'or) (parser-compile-production parser-or syntax)) ((eq keyword 'and) (parser-compile-production parser-and syntax)) ((throw 'syntax-error (parser-diagnostic term "parser definition" "definition keyword token|or|and"))) )) )) (defvar parser-mtable-init-size 13 "initial size of the match-table objarray for storing match functions. the value was chosen based on the recommendation of prime numbers for good hashing.") (defmacro parser-compile ( parser &rest definition ) "compile a LL parser from the given grammar." (let ;; create a symbol table to store compiled terminal and ;; non-terminal match functions ((match-table (make-vector parser-mtable-init-size 0))) (lexical-let ((compiled (catch 'syntax-error `(lambda ( start-pos ) (let ((parser-position (cons start-pos nil))) ;; initialize the backtrack stack (save-excursion (goto-char start-pos) ;; note that the start symbol of the grammar is built in as an or combination ;; of the top-level definitions. (lexical-let ((parse (,(apply 'parser-curry-production 'start 'parser-or (mapcar 'parser-compile-definition definition)) ))) (if parse ;; if we have a production return the position at which the ;; parser stopped along with the AST. (cons (car parser-position) (cdr parse)) nil)) ))) ))) (if (stringp compiled) ;; error path, print a message and return nil. (progn (message "parser-compile failed! %s" compiled) nil) ;; success path, bind the compiled parser to the parser symbol and return t. (progn (fset parser (eval compiled)) t)) ))) (provide 'parser)