From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Michael Heerdegen Newsgroups: gmane.emacs.devel Subject: Re: Make peg.el a built-in library? Date: Tue, 16 Nov 2021 00:16:15 +0100 Message-ID: <87h7cdggcg.fsf@web.de> References: <875yvtbbn3.fsf@ericabrahamsen.net> <87a6jjc7c8.fsf@web.de> <87v926w3bs.fsf@ericabrahamsen.net> <87zgrhh4he.fsf@web.de> <87mtmobw1a.fsf@web.de> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="23698"; mail-complaints-to="usenet@ciao.gmane.io" User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/29.0.50 (gnu/linux) Cc: emacs-devel@gnu.org To: Helmut Eller Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Tue Nov 16 00:18:50 2021 Return-path: Envelope-to: ged-emacs-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1mmlFS-0005zI-67 for ged-emacs-devel@m.gmane-mx.org; Tue, 16 Nov 2021 00:18:50 +0100 Original-Received: from localhost ([::1]:44866 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1mmlFQ-0005eY-9s for ged-emacs-devel@m.gmane-mx.org; Mon, 15 Nov 2021 18:18:48 -0500 Original-Received: from eggs.gnu.org ([209.51.188.92]:34172) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1mmlDJ-0004f7-0x for emacs-devel@gnu.org; Mon, 15 Nov 2021 18:16:37 -0500 Original-Received: from mout.web.de ([212.227.15.3]:45663) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1mmlD5-0007mR-Fq for emacs-devel@gnu.org; Mon, 15 Nov 2021 18:16:36 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=web.de; s=dbaedf251592; t=1637018177; bh=s4dQyCQXAmTQl2OG+QxtwLbk7AEoqV85qOunjS6C6l0=; h=X-UI-Sender-Class:From:To:Cc:Subject:References:Date:In-Reply-To; b=m/0kndaGULgKowGhYLnMvQA0CC1nWLSB8EWPMXI6HDKVzlpfToqj3LL4+7ouNgMQz a4G3pRrBByUXOfyHH+pT50JHxHfQNyEyI3ZNP09O6Kfh5xcTCYlAaoVFNBqZTuATw0 AZ+4T5MfdhWKKdvNRY5CTJEtE17morXGglEj/yoE= X-UI-Sender-Class: c548c8c5-30a9-4db5-a2e7-cb6cb037b8f9 Original-Received: from drachen.dragon ([92.208.225.87]) by smtp.web.de (mrweb005 [213.165.67.108]) with ESMTPSA (Nemesis) id 1MN6FV-1n5fdd35U5-00JF35; Tue, 16 Nov 2021 00:16:16 +0100 In-Reply-To: <87mtmobw1a.fsf@web.de> (Michael Heerdegen's message of "Mon, 01 Nov 2021 00:43:13 +0100") X-Provags-ID: V03:K1:gCYGdxvCE8h0Rd0q/vWuv3CnsiH13dMV7fhALRqgDLTQ7eXRVul tsmDGSSo95k4sQ/mExcl84/8oclqXvmXGth5oNtfv8AgRJGyMMMWklKKync5G+J1x+yIWN3 3w562Q1W+ramtTU17WRuPBeMK7ubiqIH+4wl1QmzXbhzY/ean5P0FfHfHWVgX3BzljuVqfW f155CSBJtTZtct+jAIb4w== X-UI-Out-Filterresults: notjunk:1;V03:K0:T78AL4YCTR0=:KeQonbJ1AR8pZYkOHSVe0j ZvuP+BwsJsASbJkWq9BB93M4MMTBRux0ecKYGsL7o4s4GHsq2TIdSPfoIlZ+DYGLrLsRmn6R2 cggr2exHTfNQX92iVapq43ct7ZL/FTCPr+EOK/89Ub2+DZh/7xjhEuxj41gNm02TRnIDWY7Rv 6jF0K4GKfsBT41JIQMne8cQtlmag6htTXhHZGx++cE4E4a5eSlim1ZHy0LyR6C97Rh81rZYJJ P1OXxFgScwA+K6tJXlmqULtEi2/3uMjVcr0vR8Fs8FIo8uKz8A0D4HgO7YDKG1LkS5hjU/MGG LAiPtVCtHtdIsaIW2G0XmEPpGwoFd0yFKMfI1LlzMneV3SDrHkXnKULvekShMpR1KzeYaUGRt /epn8twAyyWhpAD88oD1MBc6oTtT3l4t3nqlwWrH9aIVzK9wDgztPWe8wxqvMZI5PLD1PPNmk YPFYt3dytKvbnC8jQGvvMxzHLh9RWD+O2CvLZs2JuLjFqhuBvPXzxNBm4x/R6iYS55qBEKGeU dyX+OJTFkAWNQRVG2iVR0WAy2A1BZH7rGrSPeGwie5yor3PZ+7KYD5tcPWpLstbc8KDz7wZtq HlEdQIbMJ1OiTRte7lXCowmddJ3DLEYXDtsjsz/KG/ENweDWqP9Pc1XZEIm83zqvybkvkJXQf 3kQpBOd2cA3PWXt7XK0UCEeYiYCqSucrtAWobQatS557FlmSpPEo2+MSfg/lo/l6aCsRNQ2fL 2w86WNIFIs7yB6+ltvFQsyxVKKGUGVWFuKsdG+tDH9Flb/xnz/BkMBPzOjGb5EsJx8AVppSx Received-SPF: pass client-ip=212.227.15.3; envelope-from=michael_heerdegen@web.de; helo=mout.web.de X-Spam_score_int: -27 X-Spam_score: -2.8 X-Spam_bar: -- X-Spam_report: (-2.8 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.io gmane.emacs.devel:279523 Archived-At: --=-=-= Content-Type: text/plain Michael Heerdegen writes: > > The LPEG people wrote a paper[*] about this problem. The converter is more or less done - see below. Feedback welcome! Nearly everything regexps support is implemented. I tried to make everything so that the resulting peg is really equivalent to the given regexp - please tell me if you find a translation where this is not respected. Remaining problems: (1) Group numbering currently has to be explicit - unnumbered groups are silently treated as shy. That's because getting the numbering right is not trivial. I implemented groups and backrefs using an uninterned global variable owned by the peg. It would be better to add built-in support to peg.el if we want that feature. (2) Transforming character ranges to the vector representation that peg.el uses is not trivial. I would welcome help to get it done correctly. A possible (slow) fallback solution is a guard calling `looking-at' followed by an (any). Oh - why I think this conversion code is useful? It's nice for learning but also for cases were a regexp would almost suffice but you need some Elisp guard somewhere in the middle of matching the regexp to examine the buffer at that position. --=-=-= Content-Type: application/emacs-lisp Content-Disposition: inline; filename=rx-to-peg.el Content-Transfer-Encoding: quoted-printable ;;; rx-to-peg.el --- Transform simple rx forms into really equivalent pegs = -*- lexical-binding: t -*- ;; Copyright (C) 2021 Free Software Foundation, Inc ;; For a peg implementation for Emacs see peg.el in Gnu Elpa. ;; The idea for the algorithm is from here: ;; http://www.lua.inf.puc-rio.br/publications/medeiros11regular.pdf ;; ;; We use the terms "peg" and "grammar" synonymously. A peg is ;; represented as a list (r p) where r the list of rule definitions ;; and p the expression to match. ;; ;; Use like this: ;; ;; (rx-to-peg '(and (* (and (or "a" "aa") "b")) eos)) ;; =3D=3D> ;; (((a (or (and "ab" a) ;; (and "aab" a) ;; (eob)))) ;; a) ;;; Code: (require 'rx) ;; Helpers for handling grammars. (defun rxtp--listify-for (op) (lambda (xs) (cl-mapcan (lambda (x) (if (eq (car-safe x) op) (cdr x) (list x))) xs))) ;; `rxtp--and' and `rxtp--or' help to build nice `and' and `or' expressions (defun rxtp--and (&rest xs) (cl-flet* ((ensure-string (obj) (pcase obj (`(char ,c) (string c)) (_ obj))) (stringishp (obj) (or (stringp obj) (eq 'char (car-safe obj)))) (concat-subsequent-strings-in (list) (let ((rest list)) (while rest (when (stringishp (car rest)) (while (stringishp (cadr rest)) (setcar rest (concat (ensure-string (car rest)) (ensure-string (cadr rest)))) (setcdr rest (cddr rest)))) (cl-callf cdr rest))) list)) (let ((operands (concat-subsequent-strings-in (delete "" (funcall (rxtp--listify-for 'and) xs))))) (if (cdr operands) `(and ,@operands) (or (car operands) ""))))) (defun rxtp--or (&rest xs) (let ((operands (funcall (rxtp--listify-for 'or) xs))) (if (cdr operands) `(or ,@operands) (car operands)))) (defun rxtp--do (expr) ;; An immediately performed action for side effects only. Never fails `(guard (progn ,expr t))) (defvar rxtp-symbol-ctr) (defun g--new-rule-name () "Return a fresh non-terminal symbol." (let ((n (cl-incf rxtp-symbol-ctr))) (intern (if (<=3D n 26) (string (+ (- ?a 1) n)) (format "r%d" n))))) (defvar rxtp--temp-rule-names) (defun g--new-temp-rule-name () (let ((d (cl-gensym "rule-"))) (push d rxtp--temp-rule-names) d)) (defun rxtp-make-grammar (r p) (list r p)) (defun rxtp-grammar-rules (g) ;; Return G's rule defintions (car g)) (gv-define-simple-setter rxtp-grammar-rules setcar) (defun rxtp-grammar-expression (g) ;; Return G's match expression (cadr g)) (defun rxtp-w/extended-expr (g f) ;; apply F to G's match expression and return the result as new peg (pcase-let ((`(,r ,p) g)) `(,r ,(funcall f p)))) (defvar rxtp--group-associations) (defvar rxtp--group-associations-used?) (defun rxtp-w/renamed-rule (g oname nname) "Return a copy of G with rule ONAME renamed to NNAME." (cl-labels ((subst-rule (thing) (pcase-exhaustive thing ((pred (eq oname)) nname) ((or (pred symbolp) (pred vectorp) (pred stringp) (pred c= haracterp) `(guard . ,_)) thing) ((pred listp) (mapcar #'subst-rule thing))))) (subst-rule g))) ;; Definition of the transcription function phi (defun rxtp-null () "Return a trivial grammar with language {\"\"}." (rxtp-make-grammar () "")) (defun rxtp--trivial-rx-p (rx) ;; All the stuff that can be translated directly goes here (pcase rx ((pred stringp) rx) ((pred characterp) `(char ,rx)) ((or 'anychar 'anything) '(any)) ;; beginning-of-*, end-of-* stuff ((or 'eos 'string-end 'eot 'buffer-end) '(eob)) ;pegs search buffer= s, so eos gets eob ((or 'bos 'string-start 'bot 'buffer-start) '(bob)) ;likewise ((or 'bol 'line-start) '(bol)) ((or 'eol 'line-end) '(eol)) ((or 'bow 'word-start) '(bow)) ((or 'eow 'word-end) '(eow)) ('symbol-start '(bos)) ('symbol-end '(eos)) ('word-boundary '(guard (looking-at "\\b"))) ('not-word-boundary '(guard (looking-at "\\B"))) ('not-wordchar '(and (guard (looking-at "\\W")) (any))) (`(syntax ,name) `(syntax-class ,name)) ((and (pred symbolp) (let (and (pred identity) c) (cdr (assq rx rx--char-classes)))) (vector c)) ((pred rx--charset-p) ;; good enough for now (real support would be some work...) (let* ((regexp-stringish (let ((print-escape-newlines t)) (prin1-to-string (rx-to-string rx)))) (try-as-vector (and-let* (((string-match (rx bos "\"[" (group (= * anychar)) "]\"" eos) regexp-stringish)) (ms (match-string 1 regexp-stringish)= )) (if (string-match-p (rx bos "^" ) ms) `(and (not ,(vector (substring ms 1))) (an= y)) (vector ms))))) (or try-as-vector `(and (guard (looking-at ,regexp-stringish)) '(any))))))) (defun rxtp--operator-greedyness (op) (cond ((memq op (list '* '+ '\? ?\s)) t) ((memq op (list '*? '+? '\?? ??)) nil) (t rx--greedy))) (defun rxtp-phi (expr cont) ;; EXPR is a regular expression (an `rx' s-exp). The second argument ;; CONT is a continuation grammar (when the function is recursively ;; called this argument is bound to the already transcribed ;; following parts of the grammar). ;; ;; The rewrite is so that when the following two conditions hold: ;; ;; (1) For no sub-rx (* E) of EXPR does E match the empty string. ;; ;; (2) The language of EXPR has the "prefix property": There are no ;; distinct string S1 and S2 in the language of EXPR so that S1 is a ;; prefix of S2. ;; ;; then the returned peg is equivalent to EXPR in the sense that both mat= ch the ;; same set of strings. (2) can be ensured by matching the end of the ;; string (`eos' in `rx'). (cl-flet ((cont-w/prepended (&rest exps) ;; prepend the peg EXPS to the expression of CONT (rxtp-w/extended-expr cont (lambda (p) (apply #'rxtp--and (append exps (list p))))))) (pcase expr ((let (and (pred identity) transformed) (rxtp--trivial-rx-p expr)) (cont-w/prepended transformed)) ((or 'nonl 'not-newline 'any) (cont-w/prepended '(not ["\n"])) '(any)) ;; `and' and `or' (`(,(or 'and 'seq 'sequence ':) . ,operands) (cl-reduce #'rxtp-phi operands :from-end t :initial-value cont)) (`(,(or 'or '|) ,e1 . ,es) ;; Build a grammar that matches when one of (rxtp-phi E CONT) would ;; match - E is any expression in the given `or` - starting with E1 (let ((g (rxtp-phi e1 cont))) (dolist (e2 es) (setq g (rxtp-w/extended-expr (rxtp-phi e2 (rxtp-make-grammar (rxtp-grammar-rules g) (rxtp-grammar-expression cont))) (lambda (p2) (rxtp--or (rxtp-grammar-expression g) p2))))) g)) ;; repetition operators (definition of `rxtp--**' see below) (`(,(and (or '* '*? (or 'zero-or-more '0+)) op) ,e) (rxtp--** e cont 0 'infinity (not (rxtp--operator-greedyness op)))) (`(,(and (or '+ '+? (or 'one-or-more '1+)) op) ,e) (rxtp--** e cont 1 'infinity (not (rxtp--operator-greedyness op)))) (`(,(and (or '\? ?\s '\?? ?? 'zero-or-one 'opt 'optional) op) ,e) (rxtp--** e cont 0 1 (not (rxtp--operator-greedyness op)))) (`(=3D ,n ,e) (rxtp--** e cont n n)) (`(>=3D ,n ,e) (rxtp--** e cont n 'infinity)) (`(,(or '** 'repeat) ,n ,m ,e) (rxtp--** e cont n m)) ;; Groups and Backrefs (`(,(or 'group 'submatch) . ,elts) (rxtp-phi `(and . ,elts) cont)) ;i= ncomplete! (no ref created) (`(,(or 'group-n 'submatch-n) ,n . ,elts) (let ((g1 (rxtp-phi `(and . ,elts) (rxtp-null)))) (rxtp-make-grammar (cl-union (rxtp-grammar-rules g1) (rxtp-grammar-rules cont)) (rxtp--and ;; we save the ref in a gensym owned by the peg (rxtp--do `(setf (alist-get ,n ,rxtp--group-associations) (point= ))) (rxtp-grammar-expression g1) (rxtp--do `(cl-callf (lambda (start) (buffer-substring-no-properties star= t (point))) (alist-get ,n ,rxtp--group-associations))) (rxtp-grammar-expression cont))))) (`(backref ,n) (setq rxtp--group-associations-used? t) (rxtp-w/extended-expr cont (let ((s (make-symbol "s"))) (lambda (e) (rxtp--and `(guard (let ((,s (alist-get ,n ,rxtp--group-associa= tions))) (when (looking-at (regexp-quote ,s)) (forward-char (length ,s)) t))) e))))) (unknown (error "Don't know how to handle `%S'" unknown))))) (defun rxtp--** (e cont n m &optional non-greedy) ;; Transform repetitions of E, N to M times (M can be the symbol ;; `infinity`) ;; For a finite number of repetitions of a nontrivial expression we ;; use a separate rule matching it once. (let* ((loop* (and (eq m 'infinity) (g--new-temp-rule-name))) ;; a grammar matching E exactly once, using either a "leaf term" or ;; a single rule name as expression: (ge (if-let ((trivial-e-translation (rxtp--trivial-rx-p e))) (rxtp-make-grammar () trivial-e-translation) (let ((g (rxtp-phi e (rxtp-null)))) (if (not (listp (rxtp-grammar-expression g))) g (let ((rule-e (g--new-temp-rule-name))) (rxtp-make-grammar (cons (list rule-e (rxtp-grammar-ex= pression g)) (rxtp-grammar-rules g)) rule-e))))))) (cl-flet ((g--or (reverse operands) (apply #'rxtp--or (funcall (if reverse #'reverse #'identity= ) operands)))) (rxtp-make-grammar (cl-reduce #'cl-union (list (and loop* (let ((loop-e-then-cont (rxtp-phi e (rxtp-w/extended-expr cont (lambda (_) loo= p*))))) (list (list loop* (g--or non-greedy (list (rxtp-grammar-expression loop-e-th= en-cont) (rxtp-grammar-expression cont)))))= )) (and (not (and (=3D n 0) (memq m (list 0 'infinity)))) (rxtp-grammar-rules ge)) (rxtp-grammar-rules cont))) (g--or (not non-greedy) (mapcar (lambda (k) (apply #'rxtp--and (append (make-list k (rxtp-grammar-expressio= n ge)) (list (or loop* (rxtp-grammar-expres= sion cont)))))) (apply #'number-sequence (if loop* (list n) (list n m= ))))))))) (declare-function xr "xr") ;; Our entry point: (defun rx-to-peg (regexp) "Transform regular expression REGEXP into a PEG. REGEXP is either a regular expression in string form or an `rx' expression." (when (stringp regexp) (cl-callf xr regexp)) (let ((rxtp--temp-rule-names ()) (peg) ) (let ((rxtp--group-associations (make-symbol "group-associations")) (rxtp--group-associations-used? nil)) (set rxtp--group-associations ()) (setq peg ;; call phi with REGEXP and continuation grammar with language = {""} (rxtp-phi (if (eq (car-safe regexp) 'rx) `(and ,@(cdr regexp)) = regexp) (rxtp-null))) (when rxtp--group-associations-used? (cl-callf rxtp-w/extended-expr peg (lambda (e) (rxtp--and `(guard (progn (setq ,rxtp--group-associat= ions ()) t)) e)))) ;; Try to make the result a bit less nasty: (cl-callf g--delete-equal-rules peg) (let ((rxtp-symbol-ctr 0)) (dolist (rulename (mapcar #'car (rxtp-grammar-rules peg))) (cl-callf rxtp-w/renamed-rule peg rulename (g--new-rule-name))))) peg)) ;;;; trivial beautifications (defun g--rules-equiv-p (r1 r2) (equal (cdr r1) (cdr r2))) (defun g--delete-equal-rules (g) (cl-maplist (pcase-lambda (`(,r1 . ,more-rules)) (mapc (lambda (r2) (when (g--rules-equiv-p r1 r2) (cl-callf2 delq r2 (rxtp-grammar-rules g)) (cl-callf rxtp-w/renamed-rule g (car r2) (car r1)= ))) more-rules)) (car g)) g) (provide 'rx-to-peg) ;;; rx-to-peg.el ends here --=-=-= Content-Type: text/plain Michael. --=-=-=--