From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: Zelphir Kaltstahl Newsgroups: gmane.lisp.guile.user Subject: Re: Re: Re parse-result Date: Fri, 18 Jan 2019 23:27:27 +0100 Message-ID: References: NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Trace: blaine.gmane.org 1547850367 17944 195.159.176.226 (18 Jan 2019 22:26:07 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Fri, 18 Jan 2019 22:26:07 +0000 (UTC) User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.2.1 Cc: guile-user@gnu.org To: swedebugia Original-X-From: guile-user-bounces+guile-user=m.gmane.org@gnu.org Fri Jan 18 23:26:03 2019 Return-path: Envelope-to: guile-user@m.gmane.org Original-Received: from lists.gnu.org ([209.51.188.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1gkcaP-0004XI-60 for guile-user@m.gmane.org; Fri, 18 Jan 2019 23:26:01 +0100 Original-Received: from localhost ([127.0.0.1]:48191 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gkccW-0002xf-71 for guile-user@m.gmane.org; Fri, 18 Jan 2019 17:28:12 -0500 Original-Received: from eggs.gnu.org ([209.51.188.92]:41399) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gkcc9-0002x1-N9 for guile-user@gnu.org; Fri, 18 Jan 2019 17:27:51 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1gkcc6-0001I8-O6 for guile-user@gnu.org; Fri, 18 Jan 2019 17:27:49 -0500 Original-Received: from mail-lj1-x235.google.com ([2a00:1450:4864:20::235]:36900) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1gkcc5-00018x-QM for guile-user@gnu.org; Fri, 18 Jan 2019 17:27:46 -0500 Original-Received: by mail-lj1-x235.google.com with SMTP id t18-v6so12939807ljd.4 for ; Fri, 18 Jan 2019 14:27:30 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=subject:references:from:cc:to:message-id:date:user-agent :mime-version:in-reply-to:content-language; bh=wv/J60qDAFSCFrVdHtPOhNBA59T0sAsChGL6UFqmNWU=; b=IS2111bwDC+mbsD/jQWhFBBX9JUf+4jWtK7IAz9tRErH02Lj3oK1f0zs3ejontj/uX b21ukUbaSWOJavHFnDWNnH8BaP6OyUqb9+ForD1AsGdDSzm5ypbyKVfueXT+QIl+leO7 hNTnqtedXWk1/iEdnUWZ/ylEcXe3Yq7aUb+lo+kxG3kmRAW84M9Z9yETMHV5FH3tNH4O ZHv08G1GXXVdZ7A0Kp+vJPxxjt1avdykzzDLQmOC5+5hORLc6alA/9L7KjozHkie2Pfe iS4b4aP5V6Ws1S3YaGtdVGQCy486U0BTXRD9qQwMf9Ca2zdSUtO533HsuqHV+DrY1NK/ WxzA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:references:from:cc:to:message-id:date :user-agent:mime-version:in-reply-to:content-language; bh=wv/J60qDAFSCFrVdHtPOhNBA59T0sAsChGL6UFqmNWU=; b=R5OlM/8amqn5AJAdDPK7IPdtBwFgtDU7oo6xP9440dJ1xKpkWVwT8ZXpUXZehh+uJo sSX4RFSueDFFKOil00BaScCKnydLMCCnihK3Rtb7FAkT4u+trEcGhWYuPNWXk9N7jHqx HWY9sojFHkSDtODMuMg3V7X79yFy5CDkXCn1sbfLVZjrxImFUl0GxIByIZmRJiiwVaBi BIEF4I7dwb5+rVKoywW0H9CO8PfLx3ApsA2/HdcREVXPkZp7vhh9rim142lhw597Q+af xVzJzzi2OZOKhLRRYX3cYiXmjZXoJtuBOjj8L5IGLimei53DDJqi2Uljqcm1QlhVWe8+ PM1Q== X-Gm-Message-State: AJcUukdWQXgaxHxaNVOwkJtU7OS0k1WbFKeJGrY80y32CZki/mMm9CLU fjO/RAH1Iu8rRZW7Himy7YpfxKHRG/8= X-Google-Smtp-Source: ALg8bN6uaQXUy4N9w/AKRBrMGP3Ap3dQ0HC26q6JOliR4pkz1F+TPgubXeGEXu6Ipmyw5NcVqw6R7w== X-Received: by 2002:a2e:3012:: with SMTP id w18-v6mr12649045ljw.75.1547850449092; Fri, 18 Jan 2019 14:27:29 -0800 (PST) Original-Received: from [10.9.0.3] ([185.65.135.170]) by smtp.googlemail.com with ESMTPSA id t22sm1063587lfb.0.2019.01.18.14.27.28 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 18 Jan 2019 14:27:28 -0800 (PST) In-Reply-To: Content-Language: en-US X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-Received-From: 2a00:1450:4864:20::235 X-Content-Filtered-By: Mailman/MimeDel 2.1.21 X-BeenThere: guile-user@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: General Guile related discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guile-user-bounces+guile-user=m.gmane.org@gnu.org Original-Sender: "guile-user" Xref: news.gmane.org gmane.lisp.guile.user:15236 Archived-At: On 1/18/19 6:00 PM, guile-user-request@gnu.org wrote: > For example, what is the difference >> between PEG parsing and parser combinators? >> > Both of them do top-down parsing (try to match the top-level grammar rule, > which is > done by trying to match the lower-level rules which make it up, and so on > until we > get down to single tokens. They differ in the amount of lookahead that's > provided. > > Parser combinators are a facade over recursive-descent parsing, where you > write > a procedure corresponding to each rule in the parser, and you can only look > ahead > one token to guide parsing. This is known as LL(1) parsing. It's very easy > to write > these procedures, so you don't really need an engine like yacc to do the > job. > One character of lookahead isn't usually enough, so there is a lower-level > lexer > that uses (typically) regular expressions to aggregate things like numbers > and > identifiers into single parsing tokens. > > PEG parsing allows *infinite* lookahead, all the way to the end of the > input. > However, it memoizes the results of parsing to avoid reparsing repeatedly. > To make this possible, alternatives are ordered: instead of rules like A | > B, > where the first token has to be enough to tell you if you have an A or a B, > you have rules like A / B, which means "assume it's an A, and only if it > fails, assume it's a B". If A and B overlap, non-PEG parsing has to treat > A | B as a grammar ambiguity and recover, but A / B is never ambiguous. Hi swedebugia, Thanks for that explanation. I understand lookahead, but wait, somehow this description seems to go against what I have seen with parser combinators so far: I could have "or-connected" procedures where each of them requires more than one character or else the parser fails for the string. For example one expects "aa0" the other "aa1", but if neither is present, then the parsing fails. This seems like a lookahead of more than one character is possible with parser combinators. In code using Dave Thompson's parser-combinators library: ~~~ (load "parser-combinators.scm") (use-modules (parser-combinators)              (srfi srfi-41)) (define string->stream (compose list->stream string->list)) ((parse-any (parse-string "aa0") (parse-string "aa1")) (string->stream "aa2"))  ; fails ((parse-any (parse-string "aa0") (parse-string "aa1")) (string->stream "aa1"))  ; succeeds ((parse-any (parse-each (parse-string "aa") (parse-string "0")) (parse-each (parse-string "aa") (parse-string "1"))) (string->stream "aa1")) ; succeeds ~~~ Instead of parsing a string, I could also have a complex parser there, which parses all kinds of numbers and the second thing could be something else, which itself contains an "or-connected" thing inside. It seems that the abstract examples you wrote are already possible with the parser combinators then. I am probably misunderstanding it somewhere. In the examples A and B overlap, so maybe it already has to "recover" (make a decision which branch to go next)? But then again the part after "aa" is not ambiguous, so it does not seem like a problem in the example. Maybe if I had an example like the following: The string "aaa" could become the tokens "aa" and "a" or could become "a" and "aa", basically A is "aa" and B is "a" and so the parsing process has to decide between the two. Would that be such a case, where in PEG parsing there is no ambiguity, but in parser-combinators approach there is? I think the recovery here is, that the first mentioned parser in the code is the first one that is tried and if it fails the next one is tried. First one who succeeds "wins". This seems to be a backtracking approach. The wiki (https://en.wikipedia.org/wiki/Recursive_descent_parser) article says: "Recursive descent with backtracking is a technique that determines which production to use by trying each production in turn. Recursive descent with backtracking is not limited to LL(k) grammars, but is not guaranteed to terminate unless the grammar is LL(k). Even when they terminate, parsers that use recursive descent with backtracking may require exponential time." So I guess "Yes, I can handle these cases using parser-combinators, however, in the general case it may take exponential time." -- Which is usually bad. I wonder how common ambiguity in languages we use are. For example if I wanted to parse markdown, would there be difficult to handle ambiguities? I think there might be some about indentation, for example a block quote inside a list inside a list or things like that, but I guess those could be easily handled by putting in an assumption, that the indentation is always foremost for the list and only after that for anythind indented inside the list, like a blockquote. One could argue, that the specification has a different than or no assumption like that (I did not actually go and read it now) and that the language being parsed is slightly different. Maybe if one wanted 100% compliance PEG parsing might be necessary? Would that be such a case? Regards, Zelphir