From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Mike Mattie Newsgroups: gmane.emacs.help Subject: early preview of a recursive descent parser compiler implemented as a macro Date: Tue, 8 Jan 2008 02:34:36 -0800 Message-ID: <20080108023436.714c4289@reforged> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============2111957042==" X-Trace: ger.gmane.org 1199788734 16090 80.91.229.12 (8 Jan 2008 10:38:54 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Tue, 8 Jan 2008 10:38:54 +0000 (UTC) To: emacs users ml Original-X-From: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Tue Jan 08 11:39:14 2008 Return-path: Envelope-to: geh-help-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.50) id 1JCBrU-0005ZX-Rt for geh-help-gnu-emacs@m.gmane.org; Tue, 08 Jan 2008 11:39:13 +0100 Original-Received: from localhost ([127.0.0.1] helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1JCBr7-0001iG-IN for geh-help-gnu-emacs@m.gmane.org; Tue, 08 Jan 2008 05:38:49 -0500 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1JCBqW-0001cI-SW for help-gnu-emacs@gnu.org; Tue, 08 Jan 2008 05:38:12 -0500 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1JCBqU-0001ak-RJ for help-gnu-emacs@gnu.org; Tue, 08 Jan 2008 05:38:12 -0500 Original-Received: from [199.232.76.173] (helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1JCBqU-0001aJ-2W for help-gnu-emacs@gnu.org; Tue, 08 Jan 2008 05:38:10 -0500 Original-Received: from wa-out-1112.google.com ([209.85.146.176]) by monty-python.gnu.org with esmtp (Exim 4.60) (envelope-from ) id 1JCBqT-0003lM-CQ for help-gnu-emacs@gnu.org; Tue, 08 Jan 2008 05:38:10 -0500 Original-Received: by wa-out-1112.google.com with SMTP id k34so13163999wah.10 for ; Tue, 08 Jan 2008 02:38:07 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:date:from:to:subject:message-id:x-mailer:mime-version:content-type; bh=RLY5z7DgNeyjeXhY39u2UGzkgpsTtVHxgzIqWCKplps=; b=hYL9i6rjItSAZvPy2hs0dydDl6wV9WYfjqmUo7dJXGiQuz6XFfwEmnsyDknCf7XiVTYLxbUOYaDt1FYhlm3u8zq2l6we9bP+NNUzXGf7OA2H9AShuO7Eh4pVMwQ8iMHLxKrCCTu0jnsejF71tVesDyeIRG0vahrvBCUqdC6HwIA= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=date:from:to:subject:message-id:x-mailer:mime-version:content-type; b=BvMBriIDcCI2e0K6NaKdO022qB8N2ab0JMVhe8Z/uNEKkzwosX+W0cc+q+9OUc+gvOt6sb/IXZGfvQGA8XxTrfSG61orYxNOw7DgzUDamJNqPUyzFqenX4HNsFLXhS9hQg7Jg4Fmrrl9O02jEDCX0O3Ho4ELP0sBu9sjIS2GGTk= Original-Received: by 10.115.90.1 with SMTP id s1mr3976440wal.41.1199788687815; Tue, 08 Jan 2008 02:38:07 -0800 (PST) Original-Received: from reforged ( [71.217.198.62]) by mx.google.com with ESMTPS id j28sm36788413waf.36.2008.01.08.02.38.06 (version=SSLv3 cipher=OTHER); Tue, 08 Jan 2008 02:38:07 -0800 (PST) X-Mailer: Claws Mail 3.0.2 (GTK+ 2.12.1; i686-pc-linux-gnu) X-detected-kernel: by monty-python.gnu.org: Linux 2.6 (newer, 2) X-BeenThere: help-gnu-emacs@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Users list for the GNU Emacs text editor List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Errors-To: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.help:50586 Archived-At: --===============2111957042== Content-Type: multipart/signed; boundary="Sig_/4_9SDgemM+Y2sPQ46FO.OBc"; protocol="application/pgp-signature"; micalg=PGP-SHA1 --Sig_/4_9SDgemM+Y2sPQ46FO.OBc Content-Type: multipart/mixed; boundary="MP_/qFLIAZ=Tpt0VhHzt3V0QdvC" --MP_/qFLIAZ=Tpt0VhHzt3V0QdvC Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: quoted-printable Content-Disposition: inline Hello, I recently found myself needing a top down parser, so I wrote up a parser c= ompiler as a macro. It is turning out quite nicely for my purposes so I would like = to submit it=20 for comments and suggestions. I am still working out a couple of kinks, but= it is at the point where I can start integrating feedback. The goal is a lightweight parser for simple jobs such as scraping a data st= ructure out of essentially static buffers, which is the problem I was solving when = I developed it. I am in-lining the actual parser compiler, and attaching a testing file so = people can play with it if they find it interesting. Status: It's my first medium sized macro so it may have some thinko's or oddities. = criticism welcome. Simple tests work. The backtracking hasn't been extensively tested yet. The= and form doesn't seem to return the full AST it should. Other than that all the basi= c bits are in place. The catch for semantic-error throws is not in place yet. I need something l= ike (condition-case) however that construct seemed a little over-powered for my purpose. Suggestions on handling multiple error conditions in a single form would be appreciated. TODO: 1. add ? for optional productions. Not a big deal to implement once the bas= ic implementation is sound. 2. Full Documentation Review Questions: Do the Emacs developers consider this tool useful ? Does it fill a niche in Emacs ? If so I will document it. Is it elegant ? Is the terminology correct ? I am not entirely sure I used curry in the str= ict sense. Are there any essential features missing ? keeping in mind the goal is simp= licity, not a fully optimized parser. Is it interesting in a creative sense, or have I just ground out another lisp right of passage ? Examples: A couple of contrived examples I have used for testing. A couple of things = to note. Keywords are token|and|or. token defines a regex match, while or and and co= nstruct terminals. The definition of a token or terminal creates a match object at = the point of definition. There is a built-in start symbol that is a or-terminal, so the first example will match to the start symbol either token.=20 (parser-compile test-parser (token whitespace "[[:blank:]]+") (token word "[[:alpha:]]+")) The parser is compiled to the function value of test-parser. You can call it like this: (test-parser (point)) It parses a single start production. Looping is done outside of the parser itself.=20 as the second example shows you can reference previously created tokens. (parser-compile test-parser (and indented (token whitespace "[[:blank:]]+") (token word "[[:alpha:]]+= ")) (and inverted word whitespace)) The attached file test-parser.el has some test drivers for playing with the= parser interactively. Some of the features for defining tokens such as defining yo= ur own actions for tokens are shown there. P.S, here is a fun little lisp comic in the spirit of finding amusement in = your work: http://xkcd.com/297/ Without further ado here is parser.el , and enjoy ! ;;---------------------------------------------------------------------- ;; parser.el ;; Primary Author: Mike Mattie (codermattie@gmail.com) ;; Copyright: Mike Mattie (2008) ;;---------------------------------------------------------------------- ;; ->Summary ;; parser.el aims to be a lightweight implementation of parsing. ;; My requirements involve building/extracting a data structure from ;; analysis of a given text. For this purpose a "top-down" parse ;; makes it easier to assemble a data structure. [TODO: why is it easier ;; this way ?] ;; ->Related Works ;; Much of the parsing facilities that I have seen so far in Emacs ;; have been various forms of regex tables. ;; There are many reasons why Emacs parsing tools would evolve away ;; from the usual parser generator tool-set. ;; 1. The requirement to work incrementally: parse errors are not ;; errors, the user just has not finished typing :) ;; 2. Emacs analysis is often partial by requirement, usually only ;; significant parts of the text need to be identified for ;; features. ;; 3. Emacs can leverage the annotation facilities of ;; text properties and overlays making analysis extremely ;; modular and robust while editing. ;; CEDET. ;; ;; CEDET appears to be a parser generator implemented on-top of a ;; CLOS emulation. The fact that this sort of design is not "lispy" ;; does not negate it's value, but limits it IMHO. This tool-set ;; has also "grown around" the problem of analyzing source code as ;; a project which is bloat for simpler requirements. ;; ->Concept ;; A parser compiler that combines lexical (regex) analysis with LL ;; parsing compiled by recursive macro expansion. ;; The concept started out as an idea to build a parser as nested cond ;; forms. The matches would go in the predicate part of the clause ;; while the action part would go in the body. ;; This idea is expressed here with cond forms mutated into lambdas so ;; that previously defined matches can be referenced by productions. ;; -> Requirements ;; 1. Easy to use for trivial problems. ;; 2. Data structures are orthogonal to the parser and not hard-wired ;; in. Leafs and nodes of the AST are constructed with beginning ;; and ending positions of the text in the buffer. ;; ;; This allows the user to choose between positions, markers, overlays ;; according to the requirements. ;; -> Naming Conventions ;; parser-* internal functions ;; parser-build-* Construct the AST structures returned by the parser. ;; parser-interp-* expand definitions of terminals and non-terminals into= lisp ;; parser-compile-* compile the expanded definitions into interchangeable = match functions. ;; -> Key data structures ;; Matches ;; matches are a cons pair of ( parser . AST ). ;; Parser is used internally to indicate match success or failure. It ;; should be the distance matched from the current parser position. ;; It can, and often will be zero for a successful match. nil indicates ;; a failure to match. ;; AST is the data returned from the match ( production . ( start . end ) |= list ) ;; the car of the pair (production) is the production's symbol. ;;---------------------------------------------------------------------- ;; diagnostics ;;---------------------------------------------------------------------- (defun parser-expr-diagnostic ( form ) ;; tested (format "type(%s) %s" (symbol-name (type-of form)) (pp (eval form)))) (defmacro parser-diagnostic ( form from expected ) ;; tested "syntax: (parser-diagnostic form from expected) Where form is the expr received, from is the component issuing the diagn= ostic, and expected is a message describing what the component expected" `(concat (format "[%s] expected: " ,from) ,expected " not: " ,(parser-ex= pr-diagnostic form))) ;;---------------------------------------------------------------------- ;; AST constructors ;;---------------------------------------------------------------------- (defun parser-build-token ( identifier ) ;; tested "parser-make-token is a built-in constructor that records the analysis and the location of the text (id . (begin . end))" (cons identifier (cons (match-beginning 0) (match-end 0)))) ;;---------------------------------------------------------------------- ;; compiler internals ;;---------------------------------------------------------------------- ;; make-match and get-match implement a table of productions and ;; tokens. Both production and tokens are interchangeable as far as ;; matching and referencing is concerned. This table gives the ability ;; to reference previously defined matches in other productions. ;; match-table is dynamically scoped by the macro, and does not appear ;; in the compiled form. (defun parser-make-match ( symbol function ) ;; tested "parser-make-match takes ( symbol function ) and returns a symbol stored in the parser's match-table with the evaluated lambda bound" (lexical-let ((new-name (symbol-name symbol))) (if (intern-soft new-name match-table) (throw 'semantic-error (format "illegal redefinition of match %s" new= -name))) (fset (intern new-name match-table) (eval function)) (intern new-name match-table) )) (defun parser-make-anon-func ( sexp ) ;; tested "bind an un-evaluated anonymous function to an un-interned symbol" (let ((anon-func (make-symbol "parser-lambda"))) (fset anon-func (eval sexp)) anon-func)) (defun parser-get-match ( symbol ) ;; tested "return the compiled match for symbol, or throw a semantic-error if it do= es not exist" (lexical-let ((existing-name (symbol-name symbol))) (unless (intern-soft existing-name match-table) (throw 'semantic-error (format "unkown match %s" existing-name))) (intern existing-name match-table))) ;;---------------------------------------------------------------------- ;; compiler run-time ;;---------------------------------------------------------------------- ;; these functions are defined independent of the compilation of a parser ;; for simplicity, and to avoid wasteful duplication in the macro expansion. ;; parser-position ;; The position of the parser in the input stream is maintained as a ;; stack of indexes into the buffer. As matching progresses the top of ;; the stack (car) is updated. The push and pop operations copy a ;; position to a next or previous stack position. backtrack discards ;; the top of the stack. (defun parser-pos () ;; tested "Return the current position of the parser in the buffer" (car parser-position)) (defun parser-push () "Copy the parser position to a new stack level so the parser can backtrac= k when necessary." (setq parser-position (cons (car parser-position) parser-position))) (defun parser-pop () "Copy the parser position to a previous stack level When the possibility = of a backtrack has been eliminated by matching." (let ((current (parser-pos))) (setq parser-position (cons (car parser-position) (cddr parser-position= ))) )) (defun parser-backtrack () "Restore the previous parser position by discarding the top of the parser= -position stack." (setq parser-position (cdr parser-position)) (goto-char (parser-pos))) (defun parser-advance ( distance ) "Add distance to the parsing position on the current stack level. The advancing of the parser is done relative to the current position so that a successful match can advance 0 characters. This is important for optional matches such as the \"?\" meta-symbol. They want to indicate a successful match without moving the parser posit= ion." ;; we are going to get zero's as a unfortunate side-effect of ;; keeping the nesting of combination operators simple. Avoid NOP ;; work with a quick test. (if (> distance 0) (progn (setq parser-position (cons (+ distance (car parser-position)) (cdr p= arser-position))) (goto-char (parser-pos))) )) (defun parser-next () "Compute the next parser position from the length of the entire regex's c= urrent match, plus one." (+ 1 (- (match-end 0) (match-beginning 0)))) ;; parser combination operators ;; the combination operators group matches by and,or operators as the ;; underlying mechanics of non-terminal productions. As these ;; functions are not compiled into the generated parser they have no ;; knowledge of the symbol associated with the match list, only the ;; match list itself. (defun parser-or ( &rest match-list ) "Combine match objects by or where the first successful match is returned. nil is returned if no matches are found" (catch 'match (dolist (match match-list) (lexical-let ((match-result (funcall match))) (if (car match-result) (progn (parser-advance (car match-result)) (throw 'match (cons 0 (cdr match-result))))) )) (throw 'match (cons nil nil)) ;; this is what failure looks like :) )) (defun parser-and ( &rest match-list ) "combine the matches with and. all of the match objects must return non-n= il in the parser part or the parser will backtrack and return nil." (parser-push) (lexical-let ((ast (catch 'backtrack ;; we want to gather all the matches, so mapcar across the matc= h objects. (mapcar (lambda (match) (lexical-let ((match-result (funcall match))) (if (car match-result) (progn ;; on match advance the parser and return the AST (parser-advance (car match-result)) (cdr match-result)) ;; on fail backtrack and return nil (progn (parser-backtrack) (throw 'backtrack nil))) )) match-list) (parser-pop) ))) (if (car ast) ;; would be nice to run map-filter-nil on ast. (cons 0 ast) (cons nil nil)) )) ;;---------------------------------------------------------------------- ;; syntax interpretation ;;---------------------------------------------------------------------- ;; interpretation as expansion of a form into lisp is separated from ;; compilation to make the parser easier to debug and verify by ;; phase. ;; the token interpreter was split into two functions to isolate the ;; flexibility of tokens (user functions or return values for ;; constructing AST) from the hard-wired parser part. (defun parser-interp-token-action ( identifier constructor ) ;; tested "Translate the AST constructor definition for a token into Elisp." (unless (symbolp identifier) (throw 'syntax-error (parser-diagnostic identifier "parser token" "identifier: An unbound symbol used as an identi= fier" ))) ;; Warning: this code is very touchy, double quotation of AST ;; symbols is required. the saving grace is that the symbols don't ;; have a variable value bound so it fails noisily when the ;; quotation is incorrect. (cond ((eq nil constructor) `(parser-build-token (quote ',identifier))) ((listp constructor) `(,(parser-make-anon-func constructor) (match-= beginning 0) (match-end 0))) ((functionp constructor) `(,constructor (match-beginning 0) (match-end = 0))) ((symbolp constructor) `(quote ',constructor)) ;; all other constructor types are un-handled. ((throw 'syntax-error (parser-diagnostic identifier "parser token: identifier" "A symbol")))) ) (defun parser-interp-token ( syntax ) ;; tested "Translate a token definition into a parser match function." (lexical-let ((identifier (car syntax)) (regex (cadr syntax)) ;; eval ? many regexs are stored in var= iables. (constructor (caddr syntax))) `(lambda () (if (looking-at ,regex) (cons (parser-next) ,(parser-interp-token-action identifier constr= uctor)) (cons nil nil) )) )) (defun parser-interp-production ( production ) "Translate the definition of a production, or a reference to a production into a match object symbol." (cond ((listp production) (parser-compile-definition production)) ((symbolp production) (parser-get-match production)) (throw 'syntax-error (parser-daignostic production "interpret definition" "expected a definition as a list, or a symbol as a production/token= reference")) )) ;;---------------------------------------------------------------------- ;; compilation ;;---------------------------------------------------------------------- ;; after definitions are interpreted they are evaluated or compiled and ;; stored as match functions. (defun parser-compile-token ( syntax ) ;; tested "compile a token definition into a match object" (parser-make-match (car syntax) (parser-interp-token syntax))) (defun parser-curry-production ( identifier combine-operator &rest grammar ) "Compile a match object with a combine operator and a match function list. I think curry is applicable, but largely it was named curry so I could create the parser-compile-production macro." (unless (symbolp identifier) (parser-diagnostic identifier "compile production" "match identifier")) (parser-make-match identifier `(lambda () (lexical-let ((result (apply ',combine-operator ',grammar))) (if (car result) (cons (car result) (cons ',identifier (cdr result))) result) )) )) (defmacro parser-compile-production ( combine-function production-list ) "parser-compile-production simplifies the syntax of interpreting and comp= iling a production. The construct looks hairy because it combines two operatio= ns with quoting necessitated by apply. This macro mechanizes the tricky part to enhance code readability." `(apply 'parser-curry-production ;; make a match function (car ,production-list) ;; the identifier of the production ',combine-function ;; the combine operator (mapcar 'parser-interp-production (cdr ,production-list)))) ;; interpr= et the matching definition. (defun parser-compile-definition ( term ) "parser-compile-definition is the recursive heart of the compiler." (unless (listp term) (throw 'syntax-error (parser-diagnostic term "parser definition" "expected a definition of token|or|and"))) (lexical-let ((keyword (car term)) (syntax (cdr term))) (if (listp keyword) (parser-compile-definition keyword) (cond ((eq keyword 'token) (parser-compile-token syntax)) ((eq keyword 'or) (parser-compile-production parser-or syntax)) ((eq keyword 'and) (parser-compile-production parser-and syntax)) ((throw 'syntax-error (parser-diagnostic term "parser definition" "definition keyword token|or|and"))) )) )) (defvar parser-mtable-init-size 13 "initial size of the match-table objarray for storing match functions. th= e value was chosen based on the recommendation of prime numbers for good hashing= .") (defmacro parser-compile ( parser &rest definition ) "compile a LL parser from the given grammar." (let ;; create a symbol table to store compiled terminal and ;; non-terminal match functions ((match-table (make-vector parser-mtable-init-size 0))) (lexical-let ((compiled (catch 'syntax-error `(lambda ( start-pos ) (let ((parser-position (cons start-pos nil))) ;; initial= ize the backtrack stack (save-excursion (goto-char start-pos) ;; note that the start symbol of the grammar is b= uilt in as an or combination ;; of the top-level definitions. (lexical-let ((parse (,(apply 'parser-curry-production 'star= t 'parser-or (mapcar 'parser-compile-definition = definition)) ))) (if (car parse) ;; if we have a production return the positio= n at which the ;; parser stopped along with the AST. (cons (car parser-position) (cdr parse)) nil)) ))) ))) (if (stringp compiled) ;; error path, print a message and return nil. (progn (message "parser-compile failed! %s" compiled) nil) ;; success path, bind the compiled parser to the parser symbol and = return t. (progn (fset parser (eval compiled)) t)) ))) --MP_/qFLIAZ=Tpt0VhHzt3V0QdvC Content-Type: text/x-emacs-lisp; name=test-parser.el Content-Transfer-Encoding: quoted-printable Content-Disposition: attachment; filename=test-parser.el ;;---------------------------------------------------------------------- ;; diagnostics ;;---------------------------------------------------------------------- ;; test the diagnostics of the parser. I think some sort of eval may be nec= essary ;; in diagnostics. ;; see what the expansion looks like (pp (macroexpand (parser-diagnostic `(foo bar) "test" "the right thing"))) ;; see what it looks like when handled by message. (message "%s" (parser-diagnostic `(foo bar) "test" "the right thing")))) ;;---------------------------------------------------------------------- ;; token interp phase ;;---------------------------------------------------------------------- (defmacro test-interp-token ( &rest syntax ) (lexical-let ((compile (catch 'syntax-error (parser-interp-token (cdr syntax))))) (if (stringp compile) (message "%s" compile) compile) )) (defun run-test-interp-token ( match-function ) (lexical-let ((result (funcall match-function))) (message "TI match? %s AST %s" (if (car result) "yes" "no") (pp (cdr re= sult))) )) (defun run-test-token () (interactive) (run-test-interp-token (test-interp-token token whitespace "[[:blank:]]+"= ))) (pp (macroexpand (test-interp-token token whitespace "[[:blank:]]"))) ;;---------------------------------------------------------------------- ;; test tokens ;;---------------------------------------------------------------------- ;; eval this, then eval various tests. (defun run-parser () "run test-parser interactively for testing and debugging." (interactive) (lexical-let ((parse-result (test-parser (point)))) (message "PROD match? %s AST %s" (if (car parse-result) (format "matched to: %s" (car parse-result)) "no") (pp (cdr parse-result))) )) (parser-compile test-parser (token whitespace "[[:blank:]]+")) (parser-compile test-parser (token whitespace "[[:blank:]]+" (lambda ( start stop ) (message "bar by = far")))) (parser-compile test-parser (token whitespace "[[:blank:]]+" bingo)) ;;---------------------------------------------------------------------- ;; test productions ;;---------------------------------------------------------------------- (parser-compile test-parser (token whitespace "[[:blank:]]+") (token word "[[:alpha:]]+")) (parser-compile test-parser (and indented (token whitespace "[[:blank:]]+") (token word "[[:alpha:]]+= ")) (and inverted word whitespace)) fooo bar --MP_/qFLIAZ=Tpt0VhHzt3V0QdvC-- --Sig_/4_9SDgemM+Y2sPQ46FO.OBc Content-Type: application/pgp-signature; name=signature.asc Content-Disposition: attachment; filename=signature.asc -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (GNU/Linux) iD8DBQFHg1G8dfRchrkBInkRAsxrAJ9wiGpKPtrgrEHn5J0U3da9V3W1rgCeMJ1S GfpySsnC2HHMvuTs6mMgKi8= =rKKt -----END PGP SIGNATURE----- --Sig_/4_9SDgemM+Y2sPQ46FO.OBc-- --===============2111957042== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ help-gnu-emacs mailing list help-gnu-emacs@gnu.org http://lists.gnu.org/mailman/listinfo/help-gnu-emacs --===============2111957042==--