From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.ciao.gmane.io!not-for-mail From: Rutger van Beusekom Newsgroups: gmane.lisp.guile.devel Subject: [PATCH] ice-9/peg: Extend PEG for production use. Date: Mon, 11 May 2020 15:41:48 +0200 Message-ID: <874ksmio6b.fsf@verum.com> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" Injection-Info: ciao.gmane.io; posting-host="ciao.gmane.io:159.69.161.202"; logging-data="70821"; mail-complaints-to="usenet@ciao.gmane.io" User-Agent: mu4e 1.2.0; emacs 26.3 To: guile-devel@gnu.org Original-X-From: guile-devel-bounces+guile-devel=m.gmane-mx.org@gnu.org Mon May 11 16:09:13 2020 Return-path: Envelope-to: guile-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1jY97H-000IEy-80 for guile-devel@m.gmane-mx.org; Mon, 11 May 2020 16:09:11 +0200 Original-Received: from localhost ([::1]:39498 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1jY97G-0003yM-1m for guile-devel@m.gmane-mx.org; Mon, 11 May 2020 10:09:10 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:50384) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1jY8gw-0001NB-VV for guile-devel@gnu.org; Mon, 11 May 2020 09:42:00 -0400 Original-Received: from mail-am6eur05on2113.outbound.protection.outlook.com ([40.107.22.113]:38497 helo=EUR05-AM6-obe.outbound.protection.outlook.com) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1jY8gs-0002bB-Pb for guile-devel@gnu.org; Mon, 11 May 2020 09:41:58 -0400 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=ZFvBsEetGNAR5NGwx28OdB5cwmpZueb6I772EaNymGpEuakGvkRR6J7fD1s1OQ1Vbia5MwvTfDolxb10cYu50EiyMqw/+XbYYvvovCUpa1LX6Loi4dmmTjiv79J95+Onhehosxcub7Ys1jeX0aDr2+D3+P9bDQWs+LZW7M9iLgadt5M3dULjM+IWrrbEXPlKjiBo3ovJGvUskyQSMEsjoG3brwzSiX4L8Zec730U8+sE9x6CfbRBZeqjzpvwddtdnrA60WF5y5DMZbWTzBSHMMLUv0t+KLq7/TE/Y5DlWaG/HnlDe9SgDQ7Pb8Ii+oBiRxJDc+6ixUghKI4tetuEtw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=HqF7SljnCx2aZ5lcQEP2o4q6Lyfj+QsH7p2p82xGqOQ=; b=kUirQnyy/Kvrv1UOMJ8QbanToFJWXzcxSuXsF0Eshwr4lHOkKZIA0epLSlSrEtj7yknULNYhPulNy6KOGskmgJ2XN9fbbFdO4fatgDOID2cV5RbVlbiBdJkp7eG0erefnyLAkxNh1NbighdtS9Dv/xvqe0PF2n8MnfIEkWCOxmn7WuV62XFX7qr3taJLRJKOqo2df3xT4Xy72gxj6eOB1I6M7rD+B74tltlN00TmRDWGvLUxEkHOy5XCofsQ2UmEJAi5BF+ylQYW2GoQqbri9MiF+0j6i0Cjk31NJXH/miaN1IJcjyBR5F7zEFugChFTBKzWy8qG23G5fA61znabbw== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=verum.com; dmarc=pass action=none header.from=verum.com; dkim=pass header.d=verum.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=verumst.onmicrosoft.com; s=selector2-verumst-onmicrosoft-com; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=HqF7SljnCx2aZ5lcQEP2o4q6Lyfj+QsH7p2p82xGqOQ=; b=cSXlzA8naLjfqjUHLg7eO9aMFsGay6ByqSrMJbPx+31JqjSNocqSp6PYj2+qA1WN4iXY3295sVYwOnqwAT0EtJ6w6oWNBmln9Wrpiex49huPZq+9OM4g9kw2ImEHKy3sXepmCruA/oeF1TPRvdXvanCFNErCFrPewr+1UXJq22g= Authentication-Results: gnu.org; dkim=none (message not signed) header.d=none;gnu.org; dmarc=none action=none header.from=verum.com; Original-Received: from DB7PR08MB4585.eurprd08.prod.outlook.com (2603:10a6:10:7b::24) by DB7PR08MB3305.eurprd08.prod.outlook.com (2603:10a6:5:1c::31) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.2979.26; Mon, 11 May 2020 13:41:50 +0000 Original-Received: from DB7PR08MB4585.eurprd08.prod.outlook.com ([fe80::f8f2:244a:b07f:cad0]) by DB7PR08MB4585.eurprd08.prod.outlook.com ([fe80::f8f2:244a:b07f:cad0%7]) with mapi id 15.20.2979.033; Mon, 11 May 2020 13:41:50 +0000 X-ClientProxiedBy: AM0PR02CA0003.eurprd02.prod.outlook.com (2603:10a6:208:3e::16) To DB7PR08MB4585.eurprd08.prod.outlook.com (2603:10a6:10:7b::24) X-MS-Exchange-MessageSentRepresentingType: 1 Original-Received: from eigen (2001:1c03:4f0a:b200:b097:cf26:7736:44c5) by AM0PR02CA0003.eurprd02.prod.outlook.com (2603:10a6:208:3e::16) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.2979.26 via Frontend Transport; Mon, 11 May 2020 13:41:50 +0000 X-Originating-IP: [2001:1c03:4f0a:b200:b097:cf26:7736:44c5] X-MS-PublicTrafficType: Email X-MS-Office365-Filtering-Correlation-Id: 02bc8b95-bf95-48c7-e8fb-08d7f5b10ec2 X-MS-TrafficTypeDiagnostic: DB7PR08MB3305: X-Microsoft-Antispam-PRVS: X-MS-Oob-TLC-OOBClassifiers: OLM:7219; X-Forefront-PRVS: 04004D94E2 X-MS-Exchange-SenderADCheck: 1 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: mU8grR8Dm0uGDqRlemPr8GtWqN2QFf5SAX0NtW2dIzo0VAa7SapP8JRdD90Kut0ixnFnqQ+Pgr9pwcvlkYEUdyWow6fHwiNnjPocFJQbIDvNX5dF6M31E7p3psa37JvzD0FMCFGoA2giPOHqauEvfBrETrODtcSgIML5IwtkUU835nrvTXtUjLKuEk76GjEjpjRuLjodPpwPEuEDC6EpJu7+EmmJXGKlwa3vsBHFkONm6B2nBztkTyEWGujgVVJ3jQLcRSZpPHB6MAavH/q4Rz9o0nkQdnyK2xIADmd9wrtRdxhtohe47FwPtggS2WFTKdczYnVPnQZhFiMim+Ufx3KyaaMYD0e6osMt+y56Bxj1dlKHbRUpshFV3Fi4AufOWzDo0NfPBIjnfcYTsWVABfvViRC9p7dSS1vlDrmpCSaZCOjEjtb6h09SvpDrj7meWJT8Voxbf0eD00+o4LmAF81xP50usZo4QXUT4Bqd/Q2apPCAnRU6a3EIK16t5o1efDRYbMzjhqle0lVXs02ojw== X-Forefront-Antispam-Report: CIP:255.255.255.255; CTRY:; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:DB7PR08MB4585.eurprd08.prod.outlook.com; PTR:; CAT:NONE; SFTY:; SFS:(396003)(39830400003)(136003)(376002)(366004)(346002)(33430700001)(66556008)(66946007)(5660300002)(8936002)(36756003)(33440700001)(66616009)(66476007)(6916009)(16526019)(86362001)(186003)(30864003)(2906002)(316002)(6496006)(508600001)(8676002)(53546011)(2616005)(52116002)(6486002); DIR:OUT; SFP:1102; X-MS-Exchange-AntiSpam-MessageData: V5Dms3juWr3v+R+9ZurcUSu7nCjqhMl4PCuS/nIH3dQky+3HkpaBakXgkC7Q5iEf7dOGM3czNu3wrAAMGIv0THSACEVfRwkbBCqPsg9RDYSn5A0CK+54NRbgaaJukhuJyC4xVBK85gwfdPHEAwxAYI3i/IjHIvNUJaD12JGaFdqqiQvdBJcwLTH11axr/k7ad/vu0qGbvRwzDry452KtSA4YXGKyWTFL2eOZ2ZsWmXyXPYtljmwEP+neYfP7dU4FcztZ+5WuwsPUahNdtnE0f/3qF3yOZ+PnV/JCCBrmCCd2dk+vu4LGJHb6sP/HMKiTpkRXfO3YdG1vfRKRbqxWvjrlP8EgwhF4n27cIh/pbIqmn/7cPqmeXJuyYB39uoG0dSObPLBwRIKGTA6FNMbEO7aLnnAvDrmpDxqqiN17td0UaSiaCKYagsg5P3vJfUnpy+Gp/J5iYb76y7Y/HLg6vFPqTuQsr8RKZpGf1VnqLfVikdIXiLY+9jKu3pfGjV7AwYTdxqATMV81X9Ey3FuyrhUokICHpzPzCho2oXSVFyhYaKQmOnnf2AGgdhkkDNCq X-OriginatorOrg: verum.com X-MS-Exchange-CrossTenant-Network-Message-Id: 02bc8b95-bf95-48c7-e8fb-08d7f5b10ec2 X-MS-Exchange-CrossTenant-OriginalArrivalTime: 11 May 2020 13:41:50.5233 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-CrossTenant-Id: a12adc2b-ece0-4282-b9ce-ad23864dd7c5 X-MS-Exchange-CrossTenant-MailboxType: HOSTED X-MS-Exchange-CrossTenant-UserPrincipalName: pDKg827Ewd87/LYJt7tdtiJCpf/LeP6/PtFBYf+7FIiXEg5HPWxK5dEwXybWmUUXmM0aVEuLrNg4kNco+dvZ5wU7NUhDT2hhrteyNcwb2oE= X-MS-Exchange-Transport-CrossTenantHeadersStamped: DB7PR08MB3305 Received-SPF: pass client-ip=40.107.22.113; envelope-from=rutger.van.beusekom@verum.com; helo=EUR05-AM6-obe.outbound.protection.outlook.com X-detected-operating-system: by eggs.gnu.org: First seen = 2020/05/11 09:41:52 X-ACL-Warn: Detected OS = Windows NT kernel [generic] [fuzzy] X-Spam_score_int: -25 X-Spam_score: -2.6 X-Spam_bar: -- X-Spam_report: (-2.6 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_MED=0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, MSGID_FROM_MTA_HEADER=0.001, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H2=-0.001, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001 autolearn=_AUTOLEARN X-Spam_action: no action X-Mailman-Approved-At: Mon, 11 May 2020 10:08:06 -0400 X-BeenThere: guile-devel@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Developers list for Guile, the GNU extensibility library" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guile-devel-bounces+guile-devel=m.gmane-mx.org@gnu.org Original-Sender: "guile-devel" Xref: news.gmane.io gmane.lisp.guile.devel:20514 Archived-At: --=-=-= Content-Type: text/plain Hi, First of all, I am aware I am submitting a sizeable patch here, and I am happy to break it up if necessary. Currently I am using these changes to the Parser Expression Grammar library in GNU Guile for a production parser. Hopefully you will find this interesting and useful. If so I would appreciate your feedback. Rutger. --=-=-= Content-Type: text/x-diff Content-Disposition: inline; filename=0001-ice-9-peg-Extend-PEG-for-production-use.patch >From 091915cc0cfbe73333b2c486b254628f37d52bb1 Mon Sep 17 00:00:00 2001 From: Rutger van Beusekom Date: Mon, 11 May 2020 14:30:49 +0200 Subject: [PATCH] ice-9/peg: Extend PEG for production use. * doc/ref/api-peg.texi: Document debug tracing, locations, skip parsing, expect parsing and fallback parsing. * module/ice-9/peg.scm (ice-9): Re-export added parameters and macro. * module/ice-9/peg/codegen.scm (%peg:debug?): Add exported parameter to toggle parser debug tracing. (%peg:locations?): Add exported parameter to toggle adding locations to the parse tree. (%peg:skip?): Add exported parameter to pass a skip parser to the main grammar. (%peg:fallback?): Add exported parameter to toggle fallback parsing. (%peg:error): Add exported parameter to handle fallback recovery. (%continuation): Add parameter to maintain the parser productions following the current production. (format-error): New function which wraps (%peg:error). (fallback-skip): New function which implements parser recovery. (partial-match): New function to allow continuations to match only partially. (cg-and-int): Rewrite to add fallback parsing. (cg-*): Rewrite to add fallback parsing. (cg-+): Rewrite to add fallback parsing. (cg-expect-int): New function implementing expect parsing interfals. (cg-expect): New function implementing expect parsing. (expect): Add expect parsing to PEG (trace?): New function determining debug tracing verbosity. (indent): New variable maintaining debug trace indentation level. (wrap-parser-for-users): Add debug tracing, locations and skip parsing. (define-skip-parser): New macro allows definging skip parsers. * module/ice-9/peg/string-peg.scm (peg-as-peg): Add # (peg-secondary->defn): Add # for expect parsing. * test-suite/tests/peg.test: Test it. --- doc/ref/api-peg.texi | 307 +++++++++++++++++++++++++++++--- module/ice-9/peg.scm | 9 +- module/ice-9/peg/codegen.scm | 193 +++++++++++++++----- module/ice-9/peg/string-peg.scm | 10 +- test-suite/tests/peg.test | 120 ++++++++++++- 5 files changed, 562 insertions(+), 77 deletions(-) diff --git a/doc/ref/api-peg.texi b/doc/ref/api-peg.texi index 82e2758b4..811c4d216 100644 --- a/doc/ref/api-peg.texi +++ b/doc/ref/api-peg.texi @@ -159,6 +159,22 @@ Would be: (+ "e")) @end lisp +@deftp {PEG Pattern} {expect} a +Expect to parse @var{a}. Upon failure, throw a @code{'syntax-error} +including its offset position and its symbol. + +@code{"a#"} + +@code{(expect a)} +@end deftp + +Example: + +@example +"a b# (c / d)#" +@end example + + @subsubheading Extended Syntax There is some extra syntax for S-expressions. @@ -221,11 +237,11 @@ character (in normal PEGs nonterminals can only be alphabetic). For example, if we: @lisp -(define-peg-string-patterns +(define-peg-string-patterns "as <- 'a'+ bs <- 'b'+ as-or-bs <- as/bs") -(define-peg-string-patterns +(define-peg-string-patterns "as-tag <-- 'a'+ bs-tag <-- 'b'+ as-or-bs-tag <-- as-tag/bs-tag") @@ -271,9 +287,9 @@ For Example, if we: @end lisp Then: @lisp -(match-pattern as-or-bs "aabbcc") @result{} +(match-pattern as-or-bs "aabbcc") @result{} # -(match-pattern as-or-bs-tag "aabbcc") @result{} +(match-pattern as-or-bs-tag "aabbcc") @result{} # @end lisp @@ -325,7 +341,7 @@ find a valid substring starting at index 0 and @code{search-for-pattern} keeps looking. They are both equally capable of ``parsing'' and ``matching'' given those constraints. -@deffn {Scheme Procedure} match-pattern nonterm string +@deffn {Scheme Procedure} match-pattern nonterm string Parses @var{string} using the PEG stored in @var{nonterm}. If no match was found, @code{match-pattern} returns false. If a match was found, a PEG match record is returned. @@ -344,19 +360,19 @@ nothing @lisp (define-peg-pattern as all (+ "a")) -(match-pattern as "aabbcc") @result{} +(match-pattern as "aabbcc") @result{} # (define-peg-pattern as body (+ "a")) -(match-pattern as "aabbcc") @result{} +(match-pattern as "aabbcc") @result{} # (define-peg-pattern as none (+ "a")) -(match-pattern as "aabbcc") @result{} +(match-pattern as "aabbcc") @result{} # (define-peg-pattern bs body (+ "b")) -(match-pattern bs "aabbcc") @result{} +(match-pattern bs "aabbcc") @result{} #f @end lisp @end deffn @@ -371,31 +387,31 @@ was found, a PEG match record is returned. @lisp (define-peg-pattern as body (+ "a")) -(search-for-pattern as "aabbcc") @result{} +(search-for-pattern as "aabbcc") @result{} # -(search-for-pattern (+ "a") "aabbcc") @result{} +(search-for-pattern (+ "a") "aabbcc") @result{} # -(search-for-pattern "'a'+" "aabbcc") @result{} +(search-for-pattern "'a'+" "aabbcc") @result{} # (define-peg-pattern as all (+ "a")) -(search-for-pattern as "aabbcc") @result{} +(search-for-pattern as "aabbcc") @result{} # (define-peg-pattern bs body (+ "b")) -(search-for-pattern bs "aabbcc") @result{} +(search-for-pattern bs "aabbcc") @result{} # -(search-for-pattern (+ "b") "aabbcc") @result{} +(search-for-pattern (+ "b") "aabbcc") @result{} # -(search-for-pattern "'b'+" "aabbcc") @result{} +(search-for-pattern "'b'+" "aabbcc") @result{} # (define-peg-pattern zs body (+ "z")) -(search-for-pattern zs "aabbcc") @result{} +(search-for-pattern zs "aabbcc") @result{} #f -(search-for-pattern (+ "z") "aabbcc") @result{} +(search-for-pattern (+ "z") "aabbcc") @result{} #f -(search-for-pattern "'z'+" "aabbcc") @result{} +(search-for-pattern "'z'+" "aabbcc") @result{} #f @end lisp @end deffn @@ -468,9 +484,9 @@ itself satisfies @var{tst}, @code{(list lst)} is returned (this is a flat list whose only element satisfies @var{tst}). @lisp -(context-flatten (lambda (x) (and (number? (car x)) (= (car x) 1))) '(2 2 (1 1 (2 2)) (2 2 (1 1)))) @result{} +(context-flatten (lambda (x) (and (number? (car x)) (= (car x) 1))) '(2 2 (1 1 (2 2)) (2 2 (1 1)))) @result{} (2 2 (1 1 (2 2)) 2 2 (1 1)) -(context-flatten (lambda (x) (and (number? (car x)) (= (car x) 1))) '(1 1 (1 1 (2 2)) (2 2 (1 1)))) @result{} +(context-flatten (lambda (x) (and (number? (car x)) (= (car x) 1))) '(1 1 (1 1 (2 2)) (2 2 (1 1)))) @result{} ((1 1 (1 1 (2 2)) (2 2 (1 1)))) @end lisp @@ -624,7 +640,7 @@ symbol for strings).. (entry "nobody:x:65534:65534:nobody:/nonexistent:/bin/sh") (entry "messagebus:x:103:107::/var/run/dbus:/bin/false")) (peg:tree (match-pattern tag-passwd "one entry")) -(tag-passwd +(tag-passwd (entry "one entry")) @end lisp @@ -841,7 +857,7 @@ number <-- [0-9]+") (apply next-func (car first)) (apply next-func (car rest))) ;; form (sum (((product ...) "+") ((product ...) "+")) (product ...)) - (car + (car (reduce ;; walk through the list and build a left-associative tree (lambda (l r) (list (list (cadr r) (car r) (apply next-func (car l))) @@ -1035,3 +1051,248 @@ symbol function)}, where @code{symbol} is the symbol that will indicate a form of this type and @code{function} is the code generating function described above. The function @code{add-peg-compiler!} is exported from the @code{(ice-9 peg codegen)} module. + +@subsubheading Debug tracing + +Due to the backtracking nature of PEG, the parser result is @code{#f} +when it cannot match the input text. It proves to be a big pain +determining whether the problem is actually in the input or in the +grammar, especially when changing the grammar itself. Setting the +parameter @code{%peg:debug?} to @code{#t} enables debug tracing of all +non-terminals, alternatively parameter @code{%peg:debug?} can be set to +a list of non-terminal sysmbols. Debug traring will make the PEG parser +print for each selected non-terminal: its name, the input remaining, as +well as the parse result. + +@lisp +(use-modules (ice-9 peg)) +(use-modules (ice-9 pretty-print)) + +(define-peg-string-patterns + ;; missing: / term + "expression <-- term (PLUS / MINUS) expression + term <-- factor (DIV / MULT) term / factor + factor <-- LPAR expression RPAR / number + number <- MINUS? [0-9]+ + PLUS <- '+' + MINUS <- '-' + DIV <- '/' + MULT <- '*' + LPAR < '(' + RPAR < ')'") + +(parameterize ((%peg:debug? #t)) + (pretty-print (match-pattern expression "1+"))) +@result{} +expression + term + factor + LPAR + number + MINUS + number := "1" next: "+" + factor := "1" next: "+" + DIV + MULT + term := "1" next: "+" + PLUS + PLUS := "+" next: "" + expression + term + factor + LPAR + number + MINUS +#f + +@end lisp + +@subsubheading Locations + +Having the corresponding location of non-terminal matches in the parse +tree proves to be useful when providing the user with warning or error +messages. Location generation in the parse tree is enabled by setting +the parameter @code{%peg:locations?} to @code{#t}. + +Example: + +@lisp +(parameterize ((%peg:locations? #t)) + (and=> (match-pattern expression "1+2/3*4") peg:tree)) +@result{} +(expression + (term (factor "1" (location 0 1)) (location 0 1)) + "+" + (expression + (term (factor "2" (location 2 3)) + "*" + (term (factor "3" (location 4 5)) + "/" + (term (factor "4" (location 6 7)) (location 6 7)) + (location 4 7)) + (location 2 7)) + (location 2 7)) + (location 0 7)) +@end lisp + +@subsubheading Skip parsing + +To write a PEG parser for a whitespace invariant language or a language +which includes line and block comments requires littering the grammar +with whitespace and comment parser expressions, which not only violates +the DRY principle, but is hard to get right by hand. + +To this end, the @code{%peg:skip?} parameter is available. When setting +this parameter to an @emph{unwrapped parser} using +@code{define-skip-parser}, this parser will be interleaved +with the @emph{main parser}. + +Example: + +@lisp +(use-modules (ice-9 match)) +(use-modules (ice-9 peg)) +(use-modules (ice-9 pretty-print)) + +(define-peg-string-patterns + "expression <-- term (PLUS / MINUS) expression / term + term <-- factor (DIV / MULT) term / factor + factor <-- LPAR expression RPAR / number + number <- MINUS? [0-9]+ + PLUS <- '+' + MINUS <- '-' + DIV <- '/' + MULT <- '*' + LPAR < '(' + RPAR < ')'") + +(define (transform e) + (match + e + (((? symbol?) lhs op rhs) `(,(transform op) ,(transform lhs) ,(transform rhs))) + (((? symbol?) lhs) (transform lhs)) + ("+" '+) + ("-" '-) + ("*" '*) + ("/" '/) + ((? string?) (string->number e)))) + +(pretty-print (transform (peg:tree (match-pattern expression "1/(2+3)*4")))) +@result{} +(/ 1 (* (+ 2 3) 4)) + +(define-skip-parser peg-ws none (* (or " " "\t" "\n"))) + +(parameterize ((%peg:skip? peg-ws)) + (pretty-print (transform (peg:tree (match-pattern expression "1 / (2 + 3) * 4"))))) +@result{} +(/ 1 (* (+ 2 3) 4)) +@end lisp + +@subsubheading Expect parsing + +The best thing about PEG is its backtracking nature giving it +@emph{infinite} look ahead capability. At the same time it severely +limits the debugability of the grammar, as mentioned in debug tracing. +The ability to short circuit backtracking is achieved by introducing the +@code{#} operator to the PEG language. Putting a @code{#} behind a +terminal or non-terminal indicates that its parsing must succeed, +otherwise an exception is thrown containing the current parser state +providing a hook to produce informative parse errors. + +Example: + +@lisp +(use-modules (ice-9 peg)) + +(define-peg-string-patterns + "expression <-- term (PLUS / MINUS) expression# / term + term <-- factor (DIV / MULT) term# / factor + factor <-- LPAR expression# RPAR# / number# + number <- MINUS? [0-9]+ + PLUS <- '+' + MINUS <- '-' + DIV <- '/' + MULT <- '*' + LPAR < '(' + RPAR < ')'") + +(let ((input "1/(2+var)*4")) + (catch 'syntax-error + (lambda _ (peg:tree (match-pattern expression input))) + (lambda (key args) (format #t "at position ~a, expected a ~a, not: ~s\n" + (car args) + (cadr args) + (substring input (car args)))))) +@result{} +at position 5, expected a number, not: "var)*4" +@end lisp + +@subsubheading Fallback parsing + +A natural extension to expect parsing is fallback parsing. Which +is implemented by catching the exception thrown by the expect operator. +The parser will catch the exception in the peg @code{and}, @code{+} and +@code{*} operator to attempt to recover its state by eating away at the +input until it runs out of input or until one of the grammar +continuations matches and it will continue parsing from there. + +By setting parameter @code{%peg:fallback?} to @code{#t} fallback parsing +will be enabled when it encouters a mismatch in the input on an expect +production. Parameter @code{%peg:error} is used to handle the error and +it receives the position, the input and a list representing the error as +pair. When the @code{car} is @code{'skipped}, the @code{cdr} is the +substring which is ignored by the parser. When the @code{car} is +@code{'missing} the @code{cdr} is the parser snippet that was expected, +but is missing in the input. + +Example: + +@lisp +(use-modules (ice-9 peg)) +(use-modules (ice-9 pretty-print)) + +(define-peg-string-patterns + "expression <-- term (PLUS / MINUS) expression# / term + term <-- factor (DIV / MULT) term# / factor + factor <-- LPAR expression RPAR# / number + number <- MINUS? [0-9]+ + PLUS <- '+' + MINUS <- '-' + DIV <- '/' + MULT <- '*' + LPAR < '(' + RPAR < ')'") + +(define (handle-error pos str error) + (format #t "at postion ~a, ~a: \"~a\"\n" + pos (car error) (cdr error))) + +(let ((input "1/(2+var)*4")) + (parameterize ((%peg:fallback? #t) + (%peg:error handle-error)) + (pretty-print (peg:tree (match-pattern expression input))))) +@result{} +at postion 5, skipped: "var)" +(expression + (term (factor "1") + "/" + (term (factor + (expression (term (factor "2")) "+" expression)) + "*" + (term (factor "4"))))) + +(let ((input "1/(2+)*4")) + (parameterize ((%peg:fallback? #t) + (%peg:error handle-error)) + (pretty-print (peg:tree (match-pattern expression input))))) +@result{} +at postion 5, missing: "expression" +(expression + (term (factor "1") + "/" + (term (factor + (expression (term (factor "2")) "+" expression)) + "*" + (term (factor "4"))))) +@end lisp diff --git a/module/ice-9/peg.scm b/module/ice-9/peg.scm index 4e03131cd..ca7d091bf 100644 --- a/module/ice-9/peg.scm +++ b/module/ice-9/peg.scm @@ -38,5 +38,10 @@ peg:string peg:tree peg:substring - peg-record?)) - + peg-record? + %peg:debug? + %peg:fallback? + %peg:locations? + %peg:skip? + %peg:error + define-skip-parser)) diff --git a/module/ice-9/peg/codegen.scm b/module/ice-9/peg/codegen.scm index d80c3e849..fcfb71422 100644 --- a/module/ice-9/peg/codegen.scm +++ b/module/ice-9/peg/codegen.scm @@ -18,10 +18,27 @@ ;;;; (define-module (ice-9 peg codegen) - #:export (compile-peg-pattern wrap-parser-for-users add-peg-compiler!) + #:export (compile-peg-pattern + wrap-parser-for-users + add-peg-compiler! + define-skip-parser + %peg:debug? + %peg:locations? + %peg:skip? + %peg:fallback? + %peg:error) + #:use-module (srfi srfi-1) #:use-module (ice-9 pretty-print) #:use-module (system base pmatch)) +(define %peg:debug? (make-parameter #f)) +(define %peg:locations? (make-parameter #f)) +(define %peg:skip? (make-parameter (lambda (str strlen at) `(,at ())))) +(define %peg:fallback? (make-parameter #f)) +(define %peg:error (make-parameter (lambda (pos str error) #f))) + +(define %continuation (make-parameter (lambda (str streln at) #f))) + (define-syntax single? (syntax-rules () "Return #t if X is a list of one element." @@ -162,6 +179,38 @@ return EXP." ((eq? accum 'none) 'none))) (define baf builtin-accum-filter) +(define (format-error missing str) + (lambda (from to) + (unless (and (< from to) + (string-every char-set:whitespace str from (1+ to))) + ((%peg:error) from str + (if (< from to) (cons 'skipped (substring str from (1+ to))) + (cons 'missing missing)))))) + +(define* (fallback-skip kernel #:optional sequence?) + (if (not (%peg:fallback?)) kernel + (lambda (str strlen start) + (catch 'syntax-error + (lambda _ + (cond ((or #t (< start strlen)) (kernel str strlen start)) + ((not sequence?) `(,strlen ())) + (else #f))) + (lambda (key . args) + (let* ((expected (cadar args)) + (format-error (format-error expected str))) + (let loop ((at start)) + (cond ((or (= at strlen) ((%continuation) str strlen at)) (format-error start at) (if sequence? `(,at ()) `(,at (,expected)))) + (else (or (let ((res (false-if-exception (kernel str strlen (1+ at))))) (when res(format-error start at)) res) + ;;if kernel matches, we have skipped over: (substring str start (1+ at))) + (loop (1+ at)))))))))))) + + +(define (partial-match kernel) + (lambda (str strlen at) + (catch #t + (lambda _ (kernel str strlen at)) + (lambda (key . args) (and (< at (caar args)) (car args)))))) + ;; Top-level function builder for AND. Reduces to a call to CG-AND-INT. (define (cg-and clauses accum) #`(lambda (str len pos) @@ -174,8 +223,14 @@ return EXP." (() (cggr accum 'cg-and #`(reverse #,body) at)) ((first rest ...) - #`(let ((res (#,(compile-peg-pattern #'first accum) #,str #,strlen #,at))) - (and res + #`(let* ((next #,(cg-or #'(rest ...) 'body)) + (kernel #,(compile-peg-pattern #'first accum)) + (res (parameterize ((%continuation (let ((after-that (%continuation))) + (lambda (str strlen at) + (or ((partial-match next) str strlen at) + ((partial-match after-that) str strlen at)))))) + ((fallback-skip kernel) #,str #,strlen #,at)))) + (and res ;; update AT and BODY then recurse (let ((newat (car res)) (newbody (cadr res))) @@ -200,42 +255,40 @@ return EXP." (define (cg-* args accum) (syntax-case args () ((pat) - #`(lambda (str strlen at) - (let ((body '())) - (let lp ((end at) (count 0)) - (let* ((match (#,(compile-peg-pattern #'pat (baf accum)) - str strlen end)) - (new-end (if match (car match) end)) - (count (if (> new-end end) (1+ count) count))) - (if (> new-end end) - (push-not-null! body (single-filter (cadr match)))) - (if (and (> new-end end) - #,#t) - (lp new-end count) - (let ((success #,#t)) - #,#`(and success - #,(cggr (baf accum) 'cg-body - #'(reverse body) #'new-end))))))))))) + #`(let* ((kernel #,(compile-peg-pattern #'pat (baf accum))) + (kleene (lambda (str strlen at) + (let ((body '())) + (let lp ((end at) (count 0)) + (let* ((match ((fallback-skip kernel #t) str strlen end)) + (new-end (if match (car match) end)) + (count (if (> new-end end) (1+ count) count))) + (when (> new-end end) + (push-not-null! body (single-filter (cadr match)))) + (if (and (> new-end end) #,#t) (lp new-end count) + (let ((success #,#t)) + #,#`(and success + #,(cggr (baf accum) 'cg-body + #'(reverse body) #'new-end)))))))))) + kleene)))) (define (cg-+ args accum) (syntax-case args () ((pat) - #`(lambda (str strlen at) - (let ((body '())) - (let lp ((end at) (count 0)) - (let* ((match (#,(compile-peg-pattern #'pat (baf accum)) - str strlen end)) - (new-end (if match (car match) end)) - (count (if (> new-end end) (1+ count) count))) - (if (> new-end end) - (push-not-null! body (single-filter (cadr match)))) - (if (and (> new-end end) - #,#t) - (lp new-end count) - (let ((success #,#'(>= count 1))) - #,#`(and success - #,(cggr (baf accum) 'cg-body - #'(reverse body) #'new-end))))))))))) + #`(let* ((kernel #,(compile-peg-pattern #'pat (baf accum))) + (multiple (lambda (str strlen at) + (let ((body '())) + (let lp ((end at) (count 0)) + (let* ((match ((fallback-skip kernel #t) str strlen end)) + (new-end (if match (car match) end)) + (count (if (> new-end end) (1+ count) count))) + (when (> new-end end) + (push-not-null! body (single-filter (cadr match)))) + (if (and (> new-end end) #,#t) (lp new-end count) + (let ((success #,#'(>= count 1))) + #,#`(and success + #,(cggr (baf accum) 'cg-body + #'(reverse body) #'new-end)))))))))) + multiple)))) (define (cg-? args accum) (syntax-case args () @@ -296,6 +349,16 @@ return EXP." #f #,(cggr (baf accum) 'cg-body #''() #'at))))))))))) +(define (cg-expect-int clauses accum str strlen at) + (syntax-case clauses () + ((pat) + #`(or (#,(compile-peg-pattern #'pat accum) #,str #,strlen #,at) + (throw 'syntax-error (list #,at (syntax->datum #'pat))))))) + +(define (cg-expect clauses accum) + #`(lambda (str len pos) + #,(cg-expect-int clauses (baf accum) #'str #'len #'pos))) + ;; Association list of functions to handle different expressions as PEGs (define peg-compiler-alist '()) @@ -313,6 +376,7 @@ return EXP." (add-peg-compiler! '? cg-?) (add-peg-compiler! 'followed-by cg-followed-by) (add-peg-compiler! 'not-followed-by cg-not-followed-by) +(add-peg-compiler! 'expect cg-expect) ;; Takes an arbitrary expressions and accumulation variable, then parses it. ;; E.g.: (compile-peg-pattern syntax '(and "abc" (or "-" (range #\a #\z))) 'all) @@ -332,28 +396,65 @@ return EXP." "Not one of" (map car peg-compiler-alist))))))) ;; Packages the results of a parser + +(define (trace? symbol) + (cond ((pair? (%peg:debug?)) (memq symbol (%peg:debug?))) + ((or (null? (%peg:debug?)) (%peg:debug?)) #t) + (else #f))) + +(define indent 0) + (define (wrap-parser-for-users for-syntax parser accumsym s-syn) - #`(lambda (str strlen at) - (let ((res (#,parser str strlen at))) + #`(lambda (str strlen at) + (when (trace? '#,s-syn) + (format (current-error-port) "~a~a\n" + (make-string indent #\space) + '#,s-syn)) + (set! indent (+ indent 4)) + (let* ((comment-res ((%peg:skip?) str strlen at)) + (comment-loc (and (%peg:locations?) comment-res `(location ,at ,(car comment-res)))) + (at (or (and comment-res (car comment-res)) at)) + (res (#,parser str strlen at))) + (set! indent (- indent 4)) + (let ((pos (or (and res (car res)) 0))) + (when (and (trace? '#,s-syn) (< at pos)) + (format (current-error-port) "~a~a := ~s\tnext: ~s\n" + (make-string indent #\space) + '#,s-syn + (substring str at pos) + (substring str pos (min strlen (+ pos 10)))))) ;; Try to match the nonterminal. (if res ;; If we matched, do some post-processing to figure out ;; what data to propagate upward. - (let ((at (car res)) - (body (cadr res))) + (let* ((body (cadr res)) + (loc `(location ,at ,(car res))) + (annotate (if (not (%peg:locations?)) '() + (if (null? (cadr comment-res)) `(,loc) + `((comment ,(cdr comment-res) ,comment-loc) ,loc)))) + (at (car res))) #,(cond ((eq? accumsym 'name) - #`(list at '#,s-syn)) + #`(list at '#,s-syn ,@annotate)) ((eq? accumsym 'all) #`(list (car res) (cond ((not (list? body)) - (list '#,s-syn body)) - ((null? body) '#,s-syn) + `(,'#,s-syn ,body ,@annotate)) + ((null? body) + `(,'#,s-syn ,@annotate)) ((symbol? (car body)) - (list '#,s-syn body)) - (else (cons '#,s-syn body))))) - ((eq? accumsym 'none) #`(list (car res) '())) - (else #`(begin res)))) + `(,'#,s-syn ,body ,@annotate)) + (else + (cons '#,s-syn (append body annotate)))))) + ((eq? accumsym 'none) #``(,at () ,@annotate)) + (else #``(,at ,body ,@annotate)))) ;; If we didn't match, just return false. #f)))) + +(define-syntax define-skip-parser + (lambda (x) + (syntax-case x () + ((_ sym accum pat) + (let* ((matchf (compile-peg-pattern #'pat (syntax->datum #'accum)))) + #`(define sym #,matchf)))))) diff --git a/module/ice-9/peg/string-peg.scm b/module/ice-9/peg/string-peg.scm index 45ed14bb1..e8af6bb3d 100644 --- a/module/ice-9/peg/string-peg.scm +++ b/module/ice-9/peg/string-peg.scm @@ -42,7 +42,8 @@ pattern <-- alternative (SLASH sp alternative)* alternative <-- ([!&]? sp suffix)+ suffix <-- primary ([*+?] sp)* -primary <-- '(' sp pattern ')' sp / '.' sp / literal / charclass / nonterminal !'<' +primary <-- secondary ([#] sp)? +secondary <-- '(' sp pattern ')' sp / '.' sp / literal / charclass / nonterminal !'<' literal <-- ['] (!['] .)* ['] sp charclass <-- LB (!']' (CCrange / CCsingle))* RB sp CCrange <-- . '-' . @@ -73,6 +74,8 @@ RB < ']' (define-sexp-parser peg-suffix all (and peg-primary (* (and (or "*" "+" "?") peg-sp)))) (define-sexp-parser peg-primary all + (and peg-secondary (? (and "#" peg-sp)))) +(define-sexp-parser peg-secondary all (or (and "(" peg-sp peg-pattern ")" peg-sp) (and "." peg-sp) peg-literal @@ -188,6 +191,11 @@ RB < ']' ;; Parse a primary. (define (peg-primary->defn lst for-syntax) + (let ((inner-defn (peg-secondary->defn (cadr lst) for-syntax))) + (if (and (pair? (cddr lst)) (equal? (caddr lst) "#")) #`(expect #,inner-defn) + inner-defn))) + +(define (peg-secondary->defn lst for-syntax) (let ((el (cadr lst))) (cond ((list? el) diff --git a/test-suite/tests/peg.test b/test-suite/tests/peg.test index f516571e8..3f2ea70e1 100644 --- a/test-suite/tests/peg.test +++ b/test-suite/tests/peg.test @@ -1,5 +1,5 @@ +;;;;; PEG test suite. -*- scheme -*- ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; -;;;;; PEG test suite. ;; Tests the parsing capabilities of (ice-9 peg). Could use more ;; tests for edge cases. ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; @@ -33,6 +33,7 @@ (alternative peg-alternative) (suffix peg-suffix) (primary peg-primary) + (secondary peg-secondary) (literal peg-literal) (charclass peg-charclass) (CCrange charclass-range) @@ -63,13 +64,12 @@ (pass-if "defining PEGs with PEG" (and (eeval `(define-peg-string-patterns ,(@@ (ice-9 peg) peg-as-peg))) #t)) - (pass-if + (pass-if-equal "equivalence of definitions" - (equal? - (peg:tree (match-pattern (@@ (ice-9 peg) peg-grammar) (@@ (ice-9 peg) peg-as-peg))) + (peg:tree (match-pattern (@@ (ice-9 peg) peg-grammar) (@@ (ice-9 peg) peg-as-peg))) (tree-map grammar-transform - (peg:tree (match-pattern grammar (@@ (ice-9 peg) peg-as-peg))))))) + (peg:tree (match-pattern grammar (@@ (ice-9 peg) peg-as-peg)))))) ;; A grammar for pascal-style comments from Wikipedia. (define comment-grammar @@ -276,3 +276,113 @@ number <-- [0-9]+") (equal? (eq-parse "1+1/2*3+(1+1)/2") '(+ (+ 1 (* (/ 1 2) 3)) (/ (+ 1 1) 2))))) +(define-peg-string-patterns + "trace-grammar <-- foo bar* baz +foo <-- 'foo' +bar <-- 'bar' +baz <-- 'baz'") + +(use-modules (ice-9 peg codegen)) ; %peg:debug? +(with-test-prefix "Parse tracing" + (pass-if-equal + "trace" +"trace-grammar + foo + foo := \"foo\" next: \"baz\" + bar + baz + baz := \"baz\" next: \"\" +trace-grammar := \"foobaz\" next: \"\" +" + (parameterize ((%peg:debug? #t)) + (with-error-to-string + (lambda _ (and=> (match-pattern trace-grammar "foobaz") + peg:tree)))))) + +(define-peg-string-patterns + "expect-grammar <-- one# two# three# +one <-- 'one' +two <-- 'two' +three <-- 'three'" +) + +(with-test-prefix "Parsing expect" + (pass-if-equal "expect okay" + '(expect-grammar (one "one") (two "two") (three "three")) + (and=> (match-pattern expect-grammar "onetwothree") + peg:tree)) + (pass-if-equal "expect one" + '(syntax-error (0 one)) + (catch 'syntax-error + (lambda _ + (and=> (match-pattern expect-grammar "twothree") + peg:tree)) + (lambda args args))) + (pass-if-equal "expect two" + '(syntax-error (3 two)) + (catch 'syntax-error + (lambda _ + (and=> (match-pattern expect-grammar "onethree") + peg:tree)) + (lambda args args)))) + +(define program-text " +/* + CopyLeft (L) Acme +*/ +foo // the first +bar + bar +baz +") + +(define-skip-parser peg-ws all (or " " "\t")) +(define-skip-parser peg-eol all (or "\r" "\n")) +(define-skip-parser peg-line all (and "//" (* (and (not-followed-by peg-eol) peg-any)))) +(define-skip-parser peg-block all (and "/*" (* (or peg-block (and (not-followed-by "*/") peg-any))) (expect "*/"))) +(define-skip-parser peg-skip all (* (or peg-ws peg-eol peg-line peg-block))) + +(with-test-prefix "Skip parser" + (pass-if-equal "skip comments and whitespace" + '(trace-grammar (foo "foo") ((bar "bar") (bar "bar")) (baz "baz")) + (and=> (parameterize ((%peg:skip? peg-skip)) + (match-pattern trace-grammar program-text)) + peg:tree)) + (pass-if-equal "preserve comments and whitespace" + '(trace-grammar (foo "foo" (location 26 29)) + ((bar "bar" + (comment " // the first\n" (location 29 43)) + (location 43 46)) + (bar "bar" + (comment "\n\t" (location 46 48)) + (location 48 51))) + (baz "baz" + (comment "\n" (location 51 52)) + (location 52 55)) + (comment "\n/*\n CopyLeft (L) Acme\n*/\n" (location 0 26)) + (location 26 55)) + (and=> (parameterize ((%peg:skip? peg-skip) + (%peg:locations? #t)) + (match-pattern trace-grammar program-text)) + peg:tree))) + +(with-test-prefix "Fallback parser" + (pass-if-equal "only one" + '(expect-grammar (one "one") two three) + (and=> (parameterize ((%peg:skip? peg-skip) + (%peg:fallback? #t)) + (match-pattern expect-grammar "one")) + peg:tree)) + (pass-if-equal "no two" + '(expect-grammar (one "one") two (three "three")) + (and=> (parameterize ((%peg:skip? peg-skip) + (%peg:fallback? #t)) + (match-pattern expect-grammar "one three")) + peg:tree)) + (pass-if-equal "missing one (skipping zero)" + '(expect-grammar (one (two "two") (three "three"))) + (and=> (parameterize ((%peg:skip? peg-skip) + (%peg:fallback? #t)) + (match-pattern expect-grammar "zero two three")) + peg:tree))) + -- 2.26.0 --=-=-=--