From: Mike Mattie <codermattie@gmail.com>
To: emacs users ml <help-gnu-emacs@gnu.org>
Subject: Re: early preview of a recursive descent parser compiler implemented as a macro
Date: Fri, 11 Jan 2008 15:07:38 -0800 [thread overview]
Message-ID: <20080111150738.773df4e4@reforged> (raw)
In-Reply-To: <20080108023436.714c4289@reforged>
[-- Attachment #1.1.1: Type: text/plain, Size: 21272 bytes --]
On Tue, 8 Jan 2008 02:34:36 -0800
Mike Mattie <codermattie@gmail.com> wrote:
> Hello,
>
> I recently found myself needing a top down parser, so I wrote up a
> parser compiler as a macro. It is turning out quite nicely for my
> purposes so I would like to submit it for comments and suggestions. I
> am still working out a couple of kinks, but it is at the point where
> I can start integrating feedback.
>
> The goal is a lightweight parser for simple jobs such as scraping a
> data structure out of essentially static buffers, which is the
> problem I was solving when I developed it.
>
> I am in-lining the actual parser compiler, and attaching a testing
> file so people can play with it if they find it interesting.
>
> Status:
>
> It's my first medium sized macro so it may have some thinko's or
> oddities. criticism welcome.
>
> Simple tests work. The backtracking hasn't been extensively tested
> yet. The and form doesn't seem to return the full AST it should.
> Other than that all the basic bits are in place.
I reworked the runtime functions a bit and the and operator works correctly
now. test-parser.el has some minor changes so I have included the latest
version.
> The catch for semantic-error throws is not in place yet. I need
> something like (condition-case) however that construct seemed a
> little over-powered for my purpose. Suggestions on handling multiple
> error conditions in a single form would be appreciated.
My next priority is completing the error handling.
> TODO:
>
> 1. add ? for optional productions. Not a big deal to implement once
> the basic implementation is sound.
> 2. Full Documentation
3. add a (define ) list so that tokens and productions can be defined
without using them in place.
> Review Questions:
>
> Do the Emacs developers consider this tool useful ? Does it fill a
> niche in Emacs ? If so I will document it.
>
> Is it elegant ?
>
> Is the terminology correct ? I am not entirely sure I used curry in
> the strict sense.
>
> Are there any essential features missing ? keeping in mind the goal
> is simplicity, not a fully optimized parser.
>
> Is it interesting in a creative sense, or have I just ground out
> another lisp right of passage ?
>
> Examples:
>
> A couple of contrived examples I have used for testing. A couple of
> things to note. Keywords are token|and|or. token defines a regex
> match, while or and and construct terminals. The definition of a
> token or terminal creates a match object at the point of definition.
>
> There is a built-in start symbol that is a or-terminal, so the first
> example will match to the start symbol either token.
>
> (parser-compile test-parser
> (token whitespace "[[:blank:]]+")
> (token word "[[:alpha:]]+"))
>
> The parser is compiled to the function value of test-parser. You can
> call it like this: (test-parser (point))
>
> It parses a single start production. Looping is done outside of the
> parser itself.
>
> as the second example shows you can reference previously created
> tokens.
>
> (parser-compile test-parser
> (and indented (token whitespace "[[:blank:]]+") (token word
> "[[:alpha:]]+")) (and inverted word whitespace))
>
> The attached file test-parser.el has some test drivers for playing
> with the parser interactively. Some of the features for defining
> tokens such as defining your own actions for tokens are shown there.
>
> P.S, here is a fun little lisp comic in the spirit of finding
> amusement in your work: http://xkcd.com/297/
>
> Without further ado here is parser.el , and enjoy !
>
Here is the second posted version with the changes indicated. If you need
a version number that is not ambiguous this posting is version 636.
;;----------------------------------------------------------------------
;; parser.el
;; Primary Author: Mike Mattie (codermattie@gmail.com)
;; Copyright: Mike Mattie (2008)
;;----------------------------------------------------------------------
;; ->Summary
;; parser.el aims to be a lightweight implementation of parsing.
;; My requirements involve building/extracting a data structure from
;; analysis of a given text. For this purpose a "top-down" parse
;; makes it easier to assemble a data structure. [TODO: why is it easier
;; this way ?]
;; ->Related Works
;; Emacs regex tables.
;;
;; Much of the parsing facilities that I have seen so far in Emacs
;; have been various forms of regex tables.
;; There are many reasons why Emacs parsing tools would evolve away
;; from the usual parser generator tool-set.
;; 1. The requirement to work incrementally: parse errors are not
;; errors, the user just has not finished typing :)
;; 2. Emacs analysis is often partial by requirement, usually only
;; significant parts of the text need to be identified for
;; features.
;; 3. Emacs can leverage the annotation facilities of
;; text properties and overlays making analysis extremely
;; modular and robust while editing.
;; CEDET.
;;
;; CEDET appears to be a parser generator implemented on-top of a
;; CLOS emulation. The fact that this sort of design is not "lispy"
;; does not negate it's value, but limits it IMHO. This tool-set
;; has also "grown around" the problem of analyzing source code as
;; a project which is bloat for simpler requirements.
;; ->Concept
;; A parser compiler that combines lexical (regex) analysis with LL
;; parsing compiled by recursive macro expansion.
;; The concept started out as an idea to build a parser as nested cond
;; forms. The matches would go in the predicate part of the clause
;; while the action part would go in the body.
;; This idea is expressed here with cond forms mutated into lambdas so
;; that previously defined matches can be referenced by productions.
;; -> Requirements
;; 1. Easy to use for trivial problems.
;; 2. Data structures are orthogonal to the parser and not hard-wired
;; in. Leafs and nodes of the AST are constructed with beginning
;; and ending positions of the text in the buffer.
;;
;; This allows the user to choose between positions, markers, overlays
;; according to the requirements.
;; -> Naming Conventions
;; parser-* internal functions
;; parser-build-* Construct the AST structures returned by the parser.
;; parser-interp-* expand definitions of terminals and non-terminals into lisp
;; parser-compile-* compile the expanded definitions into interchangeable match functions.
;; -> Key data structures
;; Matches
;; matches are a cons pair of ( parser . AST ).
;; Parser is used internally to indicate match success or failure. It
;; should be the distance matched from the current parser position.
;; It can, and often will be zero for a successful match. nil indicates
;; a failure to match.
;; AST is the data returned from the match ( production . ( start . end ) | list )
;; the car of the pair (production) is the production's symbol.
;;----------------------------------------------------------------------
;; diagnostics
;;----------------------------------------------------------------------
(defun parser-expr-diagnostic ( form ) ;; tested
(format "type(%s) %s" (symbol-name (type-of form)) (pp (eval form))))
(defmacro parser-diagnostic ( form from expected ) ;; tested
"syntax: (parser-diagnostic form from expected)
Where form is the expr received, from is the component issuing the diagnostic,
and expected is a message describing what the component expected"
`(concat (format "[%s] expected: " ,from) ,expected " not: " ,(parser-expr-diagnostic form)))
;;----------------------------------------------------------------------
;; AST constructors
;;----------------------------------------------------------------------
(defun parser-build-token ( identifier ) ;; tested
"parser-make-token is a built-in constructor that records the analysis
and the location of the text (id . (begin . end))"
(cons identifier (cons (match-beginning 0) (match-end 0))))
;;----------------------------------------------------------------------
;; compiler internals
;;----------------------------------------------------------------------
;; make-match and get-match implement a table of productions and
;; tokens. Both production and tokens are interchangeable as far as
;; matching and referencing is concerned. This table gives the ability
;; to reference previously defined matches in other productions.
;; match-table is dynamically scoped by the macro, and does not appear
;; in the compiled form.
(defun parser-make-match ( symbol function ) ;; tested
"parser-make-match takes ( symbol function ) and returns a symbol
stored in the parser's match-table with the evaluated lambda
bound"
(lexical-let
((new-name (symbol-name symbol)))
(if (intern-soft new-name match-table)
(throw 'semantic-error (format "illegal redefinition of match %s" new-name)))
(fset (intern new-name match-table) (eval function))
(intern new-name match-table)
))
(defun parser-make-anon-func ( sexp ) ;; tested
"bind an un-evaluated anonymous function to an un-interned symbol"
(let
((anon-func (make-symbol "parser-lambda")))
(fset anon-func (eval sexp))
anon-func))
(defun parser-get-match ( symbol ) ;; tested
"return the compiled match for symbol, or throw a semantic-error if it does not
exist"
(lexical-let
((existing-name (symbol-name symbol)))
(unless (intern-soft existing-name match-table)
(throw 'semantic-error (format "unkown match %s" existing-name)))
(intern existing-name match-table)))
;;----------------------------------------------------------------------
;; compiler run-time
;;----------------------------------------------------------------------
;; these functions are defined independent of the compilation of a parser
;; for simplicity, and to avoid wasteful duplication in the macro expansion.
;; parser-position
;; The position of the parser in the input stream is maintained as a
;; stack of indexes into the buffer. As matching progresses the top of
;; the stack (car) is updated. The push and pop operations copy a
;; position to a next or previous stack position. backtrack discards
;; the top of the stack.
(defun parser-pos () ;; tested
"Return the current position of the parser in the buffer"
(car parser-position))
(defun parser-push ()
"Copy the parser position to a new stack level so the parser can backtrack when necessary."
(setq parser-position (cons (car parser-position) parser-position)))
(defun parser-pop ()
"Copy the parser position to a previous stack level When the possibility of a backtrack
has been eliminated by matching."
(let
((current (parser-pos)))
(setq parser-position (cons (car parser-position) (cddr parser-position)))
))
(defun parser-backtrack ()
"Restore the previous parser position by discarding the top of the parser-position stack."
(setq parser-position (cdr parser-position))
(goto-char (parser-pos)))
(defun parser-advance ( distance )
"Add distance to the parsing position on the current stack level.
The advancing of the parser is done relative to the current
position so that a successful match can advance 0 characters.
This is important for optional matches such as the \"?\" meta-symbol.
They want to indicate a successful match without moving the parser position."
;; we are going to get zero's as a unfortunate side-effect of
;; keeping the nesting of combination operators simple. Avoid NOP
;; work with a quick test.
(if (> distance 0)
(progn
(setq parser-position (cons (+ distance (car parser-position)) (cdr parser-position)))
(goto-char (parser-pos)))
))
(defun parser-next ()
"Compute the next parser position from the length of the entire regex's current match, plus one."
(+ 1 (- (match-end 0) (match-beginning 0))))
;; parser combination operators
;; the combination operators group matches by and,or operators as the
;; underlying mechanics of non-terminal productions. As these
;; functions are not compiled into the generated parser they have no
;; knowledge of the symbol associated with the match list, only the
;; match list itself.
(defun parser-or ( &rest match-list )
"Combine match objects by or where the first successful match is returned.
nil is returned if no matches are found"
(catch 'match
(dolist (match match-list)
(lexical-let
((match-result (funcall match)))
(if match-result
(progn
(parser-advance (car match-result))
(throw 'match (cons 0 (cdr match-result)))))
))
(throw 'match nil) ;; this is what failure looks like :)
))
(defun parser-and ( &rest match-list )
"combine the matches with and. all of the match objects must return non-nil
in the parser part or the parser will backtrack and return nil."
(parser-push)
(lexical-let
((ast (catch 'backtrack
;; we want to gather all the matches, so mapcar across the match objects.
(mapcar
(lambda (match)
(lexical-let
((match-result (funcall match)))
(if match-result
(progn
;; on match advance the parser and return the AST
(parser-advance (car match-result))
(cdr match-result))
;; on fail backtrack and return nil
(throw 'backtrack nil))
))
match-list)
)))
(if ast
(progn
(parser-pop)
(cons 0 ast)) ;; would be nice to filter optional matches here.
(progn
(parser-backtrack)
nil))
))
;;----------------------------------------------------------------------
;; syntax interpretation
;;----------------------------------------------------------------------
;; interpretation as expansion of a form into lisp is separated from
;; compilation to make the parser easier to debug and verify by
;; phase.
;; the token interpreter was split into two functions to isolate the
;; flexibility of tokens (user functions or return values for
;; constructing AST) from the hard-wired parser part.
(defun parser-interp-token-action ( identifier constructor ) ;; tested
"Translate the AST constructor definition for a token into Elisp."
(unless (symbolp identifier)
(throw 'syntax-error (parser-diagnostic identifier
"parser token"
"identifier: An unbound symbol used as an identifier"
)))
;; Warning: this code is very touchy, double quotation of AST
;; symbols is required. the saving grace is that the symbols don't
;; have a variable value bound so it fails noisily when the
;; quotation is incorrect.
(cond
((eq nil constructor) `(parser-build-token (quote ',identifier)))
((listp constructor) `(,(parser-make-anon-func constructor) (match-beginning 0) (match-end 0)))
((functionp constructor) `(,constructor (match-beginning 0) (match-end 0)))
((symbolp constructor) `(quote ',constructor))
;; all other constructor types are un-handled.
((throw 'syntax-error (parser-diagnostic identifier
"parser token: identifier" "A symbol"))))
)
(defun parser-interp-token ( syntax ) ;; tested
"Translate a token definition into a parser match function."
(lexical-let
((identifier (car syntax))
(regex (cadr syntax)) ;; eval ? many regexs are stored in variables.
(constructor (caddr syntax)))
`(lambda ()
(if (looking-at ,regex)
(cons (parser-next) ,(parser-interp-token-action identifier constructor))
nil
))
))
(defun parser-interp-production ( production )
"Translate the definition of a production, or a reference to a production
into a match object symbol."
(cond
((listp production) (parser-compile-definition production))
((symbolp production) (parser-get-match production))
(throw 'syntax-error
(parser-daignostic production
"interpret definition"
"expected a definition as a list, or a symbol as a production/token reference"))
))
;;----------------------------------------------------------------------
;; compilation
;;----------------------------------------------------------------------
;; after definitions are interpreted they are evaluated or compiled and
;; stored as match functions.
(defun parser-compile-token ( syntax ) ;; tested
"compile a token definition into a match object"
(parser-make-match (car syntax) (parser-interp-token syntax)))
(defun parser-curry-production ( identifier combine-operator &rest grammar )
"Compile a match object with a combine operator and a match function list.
I think curry is applicable, but largely it was named curry so I could
create the parser-compile-production macro."
(unless (symbolp identifier)
(parser-diagnostic identifier
"compile production"
"match identifier"))
(parser-make-match identifier
`(lambda ()
(lexical-let
((result (apply ',combine-operator ',grammar)))
(if result
(cons (car result) (cons ',identifier (cdr result)))
nil)
))
))
(defmacro parser-compile-production ( combine-function production-list )
"parser-compile-production simplifies the syntax of interpreting and compiling
a production. The construct looks hairy because it combines two operations
with quoting necessitated by apply. This macro mechanizes the tricky part
to enhance code readability."
`(apply 'parser-curry-production ;; make a match function
(car ,production-list) ;; the identifier of the production
',combine-function ;; the combine operator
(mapcar 'parser-interp-production (cdr ,production-list)))) ;; interpret the matching definition.
(defun parser-compile-definition ( term )
"parser-compile-definition is the recursive heart of the compiler."
(unless (listp term)
(throw 'syntax-error (parser-diagnostic term
"parser definition"
"expected a definition of token|or|and")))
(lexical-let
((keyword (car term))
(syntax (cdr term)))
(if (listp keyword)
(parser-compile-definition keyword)
(cond
((eq keyword 'token) (parser-compile-token syntax))
((eq keyword 'or) (parser-compile-production parser-or syntax))
((eq keyword 'and) (parser-compile-production parser-and syntax))
((throw 'syntax-error (parser-diagnostic term
"parser definition"
"definition keyword token|or|and")))
))
))
(defvar parser-mtable-init-size 13
"initial size of the match-table objarray for storing match functions. the value
was chosen based on the recommendation of prime numbers for good hashing.")
(defmacro parser-compile ( parser &rest definition )
"compile a LL parser from the given grammar."
(let
;; create a symbol table to store compiled terminal and
;; non-terminal match functions
((match-table (make-vector parser-mtable-init-size 0)))
(lexical-let
((compiled (catch 'syntax-error
`(lambda ( start-pos )
(let
((parser-position (cons start-pos nil))) ;; initialize the backtrack stack
(save-excursion
(goto-char start-pos)
;; note that the start symbol of the grammar is built in as an or combination
;; of the top-level definitions.
(lexical-let
((parse (,(apply 'parser-curry-production 'start 'parser-or
(mapcar 'parser-compile-definition definition))
)))
(if parse
;; if we have a production return the position at which the
;; parser stopped along with the AST.
(cons (car parser-position) (cdr parse))
nil))
)))
)))
(if (stringp compiled)
;; error path, print a message and return nil.
(progn
(message "parser-compile failed! %s" compiled)
nil)
;; success path, bind the compiled parser to the parser symbol and return t.
(progn
(fset parser (eval compiled))
t))
)))
(provide 'parser)
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1.1.2: test-parser.el --]
[-- Type: text/x-emacs-lisp; name=test-parser.el, Size: 2623 bytes --]
;;----------------------------------------------------------------------
;; diagnostics
;;----------------------------------------------------------------------
;; test the diagnostics of the parser. I think some sort of eval may be necessary
;; in diagnostics.
;; see what the expansion looks like
(pp (macroexpand
(parser-diagnostic `(foo bar)
"test"
"the right thing")))
;; see what it looks like when handled by message.
(message "%s"
(parser-diagnostic `(foo bar)
"test"
"the right thing"))))
;;----------------------------------------------------------------------
;; token interp phase
;;----------------------------------------------------------------------
(defmacro test-interp-token ( &rest syntax )
(lexical-let
((compile (catch 'syntax-error
(parser-interp-token (cdr syntax)))))
(if (stringp compile)
(message "%s" compile)
compile)
))
(defun run-test-interp-token ( match-function )
(lexical-let
((result (funcall match-function)))
(message "TI match? %s AST %s" (if (car result) "yes" "no") (pp (cdr result)))
))
(defun run-test-token ()
(interactive)
(run-test-interp-token (test-interp-token token whitespace "[[:blank:]]+")))
(pp (macroexpand
(test-interp-token token whitespace "[[:blank:]]")))
;;----------------------------------------------------------------------
;; test tokens
;;----------------------------------------------------------------------
;; eval this, then eval various tests.
(defun run-parser ()
"run test-parser interactively for testing and debugging."
(interactive)
(lexical-let
((parse-result (test-parser (point))))
(message "PROD match? %s"
(if parse-result
(format "Yes matched to: %s, AST: %s" (car parse-result) (pp (cdr parse-result))
"No")
))
))
(parser-compile test-parser
(token whitespace "[[:blank:]]+"))
(parser-compile test-parser
(token whitespace "[[:blank:]]+" (lambda ( start stop ) (message "bar by far"))))
(parser-compile test-parser
(token whitespace "[[:blank:]]+" bingo))
;;----------------------------------------------------------------------
;; test productions
;;----------------------------------------------------------------------
(parser-compile test-parser
(token whitespace "[[:blank:]]+")
(token word "[[:alpha:]]+"))
(parser-compile test-parser
(and indented (token whitespace "[[:blank:]]+") (token word "[[:alpha:]]+"))
(and inverted word whitespace))
fooo bar
[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
[-- Attachment #2: Type: text/plain, Size: 152 bytes --]
_______________________________________________
help-gnu-emacs mailing list
help-gnu-emacs@gnu.org
http://lists.gnu.org/mailman/listinfo/help-gnu-emacs
next prev parent reply other threads:[~2008-01-11 23:07 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-01-08 10:34 early preview of a recursive descent parser compiler implemented as a macro Mike Mattie
2008-01-11 23:07 ` Mike Mattie [this message]
2008-01-14 4:23 ` Mike Mattie
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://www.gnu.org/software/emacs/
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20080111150738.773df4e4@reforged \
--to=codermattie@gmail.com \
--cc=help-gnu-emacs@gnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).