Parsing

unofficial mirror of guile-devel@gnu.org 
 help / color / mirror / Atom feed

* Parsing
@ 2014-09-15 17:36 Ian Grant
  2014-09-15 18:29 ` Parsing Stefan Israelsson Tampe
  2014-10-09 23:20 ` Parsing Mark H Weaver
  0 siblings, 2 replies; 8+ messages in thread
From: Ian Grant @ 2014-09-15 17:36 UTC (permalink / raw)
  To: guile-devel

[-- Attachment #1: Type: text/plain, Size: 2705 bytes --]

LALR parser generators are not the last word in parsing technology. Many
unambiguous context-free languages are not well suited to LALR parsing. The
C language is an exception: the C grammar from the ISO standard can be
typed almost verbatim into a YACC/Bison file, and it will produce just the
one well-documented shift/reduce conflict in the IF ELSE statement, which
is fortuitously resolved the right way. But this is achieved at some cost,
because the C grammar does not actually constrain the parsed statements to
those which are well-formed. In addition to the grammar, the language
definition includes a slew of extra conditions which need to be checked.
These checks must be coded into a post-parsing phase, because they are not
things that the LALR tables can encode.

Important examples of grammars which are not LALR parsable include ASN.1
and Standard ML. When Claudio Russo added higher order functors to Moscow
ML, he apparently wasn't able to figure out how to get the LALR parser to
parse the grammar for the language he'd invented. So the Moscow ML parser
produces 37 shift/reduce conflicts. I don't think anyone actually knows how
those higher order declarations are parsed in practice.

But I think the problem of parsing has actually been solved. Tom Ridge's
"p3" combinator parsers with oracles will parse any unambiguous CFG in
polynomial time complexity at most O^3, or possibly O^2, I need to check,
in the length of the input. The advantage of combinator parsers is that
they parse top-down, in a way which follows the natural structure of the
grammar. So errors can be explained to the user (and the developer!) in
terms of the actual derivations (parsing decisions) taken on the input,
This is not the case with bottom-up parsers such as LALR.

It would be really good to have a p3-like parser generator working in
Guile. The only technical part is integrating the oracle with the
combinator parsers. Ridge's example uses an Earley parser, but he points
out that any parser generator could be used as the oracle. So perhaps LALR
parsers would do. The grammar the oracles need to be able to parse is much
simpler than the general CFG's the combinators work on: the oracles are
just there to make the decisions about where to split the left-recursive
productions. But if the LALR oracle doesn't work out, then it is easy to
write an Early parser in scheme. See
http://www.cavar.me/damir/charty/scheme/ which is LGPLed.

See Ridge's papers at http://www.tom-ridge.com/parsing.html

Again, one cold implement the recursive parser combinators in
scheme-generated lightning. If this were done abstractly then Guile could
generate fast machine code parsers for any host language.

Ian

[-- Attachment #2: Type: text/html, Size: 3013 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Parsing
  2014-09-15 17:36 Parsing Ian Grant
@ 2014-09-15 18:29 ` Stefan Israelsson Tampe
  2014-09-15 19:59   ` Parsing Ian Grant
  2014-10-09 23:20 ` Parsing Mark H Weaver
  1 sibling, 1 reply; 8+ messages in thread
From: Stefan Israelsson Tampe @ 2014-09-15 18:29 UTC (permalink / raw)
  To: Ian Grant; +Cc: guile-devel

[-- Attachment #1: Type: text/plain, Size: 3753 bytes --]

hi,

i find that logic programs together with memoization and auto cut ideoms
helps in making powerful parsers, really fun at least. I implemented a
python parser
using those and you my find some docs in subsections inside the guile-log
framework. I will try to make the parser combinators up to date in a few
weaks. To see
it in action you can glean on
https://gitorious.org/python-on-guile/python-on-guile/source/e32e72dfa7a09c4a791e49d816d52c483d12e5f6:modules/language/python/parser.scm
you my think that it will backtrack horribly but thanks to memoization it
will not and gives the AST quite quickly.

Oh, guile-log is based on C code that pretty restrictive in the kind of cpu
you can use, basically only amd64 atm. That can be changed but there is
really not much interest
in this software with more than a nicel little playground for this little
schemer.

Cheers!

On Mon, Sep 15, 2014 at 7:36 PM, Ian Grant <ian.a.n.grant@googlemail.com>
wrote:

> LALR parser generators are not the last word in parsing technology. Many
> unambiguous context-free languages are not well suited to LALR parsing. The
> C language is an exception: the C grammar from the ISO standard can be
> typed almost verbatim into a YACC/Bison file, and it will produce just the
> one well-documented shift/reduce conflict in the IF ELSE statement, which
> is fortuitously resolved the right way. But this is achieved at some cost,
> because the C grammar does not actually constrain the parsed statements to
> those which are well-formed. In addition to the grammar, the language
> definition includes a slew of extra conditions which need to be checked.
> These checks must be coded into a post-parsing phase, because they are not
> things that the LALR tables can encode.
>
> Important examples of grammars which are not LALR parsable include ASN.1
> and Standard ML. When Claudio Russo added higher order functors to Moscow
> ML, he apparently wasn't able to figure out how to get the LALR parser to
> parse the grammar for the language he'd invented. So the Moscow ML parser
> produces 37 shift/reduce conflicts. I don't think anyone actually knows how
> those higher order declarations are parsed in practice.
>
> But I think the problem of parsing has actually been solved. Tom Ridge's
> "p3" combinator parsers with oracles will parse any unambiguous CFG in
> polynomial time complexity at most O^3, or possibly O^2, I need to check,
> in the length of the input. The advantage of combinator parsers is that
> they parse top-down, in a way which follows the natural structure of the
> grammar. So errors can be explained to the user (and the developer!) in
> terms of the actual derivations (parsing decisions) taken on the input,
> This is not the case with bottom-up parsers such as LALR.
>
> It would be really good to have a p3-like parser generator working in
> Guile. The only technical part is integrating the oracle with the
> combinator parsers. Ridge's example uses an Earley parser, but he points
> out that any parser generator could be used as the oracle. So perhaps LALR
> parsers would do. The grammar the oracles need to be able to parse is much
> simpler than the general CFG's the combinators work on: the oracles are
> just there to make the decisions about where to split the left-recursive
> productions. But if the LALR oracle doesn't work out, then it is easy to
> write an Early parser in scheme. See
> http://www.cavar.me/damir/charty/scheme/ which is LGPLed.
>
> See Ridge's papers at http://www.tom-ridge.com/parsing.html
>
> Again, one cold implement the recursive parser combinators in
> scheme-generated lightning. If this were done abstractly then Guile could
> generate fast machine code parsers for any host language.
>
> Ian
>
>
>

[-- Attachment #2: Type: text/html, Size: 4682 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Parsing
  2014-09-15 18:29 ` Parsing Stefan Israelsson Tampe
@ 2014-09-15 19:59   ` Ian Grant
  2014-09-15 20:20     ` Parsing Ian Grant
                       ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Ian Grant @ 2014-09-15 19:59 UTC (permalink / raw)
  To: Stefan Israelsson Tampe; +Cc: guile-devel

[-- Attachment #1: Type: text/plain, Size: 3220 bytes --]

On Mon, Sep 15, 2014 at 2:29 PM, Stefan Israelsson Tampe <
stefan.itampe@gmail.com> wrote:

> i find that logic programs together with memoization and auto cut ideoms
> helps in making powerful parsers, really fun at least.
>

You should definitely look at Tom's paper then, the "director's cut" is
here:

   http://www.tom-ridge.com/doc/ridge14sle_extended.pdf

He explicitly mentions the analogy with cut, and the parsers are memoised.
He's a theorem prover by trade, and he started this by producing
machine-checked formal proofs in HOL4 of the soundness and completeness of
combinator parsers,

I implemented a python parser
> using those [...] you my think that it will backtrack horribly but thanks
> to memoization it will not and gives the AST quite quickly.
>

No, I believe you! And you should see what you can do with ambiguous
grammars. Ridge gives some crazy Ackermann-like numbers. The parser's cache
can be used as a compact representation of the otherwise hyper-exponential
size of the ASTs an ambiguous grammar generates.

"For example, for input length 19,there are 441152315040444150 parse trees,
but as with most exponential behaviours it is not feasible to actually
compute all these parse trees. Now let us consider the following parser,
which is identical except that it computes (in all possible ways) the
length of the input parsed [...] Naively we might expect that this also
exhibits exponential behaviour, since presumably the parse trees must all
be generated, and the actions applied. This expectation is wrong. Running
this example parser on an input of size 19 returns in 0.02 seconds with a
single result 19. For an input of size 100, this parser returns a single
result 100 in 5 seconds, and over a range of inputs this parser exhibits
polynomial behaviour rather than exponential behaviour"

This amused me, because I implemented the earlier (pre-oracle) parser he
described in Standard ML, and I did something which I thought later was a
bit stupid: I executed all the semantic actions _during the parse_ and
cached the semantic results in the "memo-pad" along with the syntactic
derivation. It wasn't so slow, because of the memoization. But I am
wondering now whether there is not some application to proof-search that
one could use this technique on. A parse is a derivation of a kind, so with
the right kind of grammar it could be a derivation of some sort of proof,
and if that were an ambiguous grammar then it could explore
non-deterministic relations, like a Prolog interpreter does.

I am talking out of my hat again, but maybe someone who has a clearer idea
of these things will see a spark amongst all the smoke. But I hope no-one
would spend a year or something looking for a way to prove P=NP because of
what I'd said. (Unless they actually proved P=NP, which would be very cool!)

Oh, guile-log is based on C code that pretty restrictive in the kind of cpu
> you can use, basically only amd64 atm. That can be changed but there is
> really not much interest
> in this software with more than a nicel little playground for this little
> schemer.
>

Is it a lot of C? Maybe you could replace that with lightning assembler,
and then it would run on a lot of hardware?

Ian

[-- Attachment #2: Type: text/html, Size: 4178 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Parsing
  2014-09-15 19:59   ` Parsing Ian Grant
@ 2014-09-15 20:20     ` Ian Grant
  2014-09-15 20:38     ` Parsing Stefan Israelsson Tampe
  2014-09-15 20:45     ` Parsing Ian Grant
  2 siblings, 0 replies; 8+ messages in thread
From: Ian Grant @ 2014-09-15 20:20 UTC (permalink / raw)
  To: Stefan Israelsson Tampe; +Cc: guile-devel

[-- Attachment #1: Type: text/plain, Size: 523 bytes --]

On Mon, Sep 15, 2014 at 3:59 PM, Ian Grant <ian.a.n.grant@googlemail.com>
wrote:
> The parser's cache can be used as a compact representation of
> the otherwise hyper-exponential size of the ASTs ...

This is wrong. They're not hyper-exponential, they're just exponential.

> it could explore non-deterministic relations, like a Prolog interpreter
does.

This is also wrong. They're non-deterministic _functions_ defined by
deterministic relations .

I am talking out of my hat again, ...
>

This is definitely right.

Ian

[-- Attachment #2: Type: text/html, Size: 1080 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Parsing
  2014-09-15 19:59   ` Parsing Ian Grant
  2014-09-15 20:20     ` Parsing Ian Grant
@ 2014-09-15 20:38     ` Stefan Israelsson Tampe
  2014-09-17  0:10       ` Parsing Ian Grant
  2014-09-15 20:45     ` Parsing Ian Grant
  2 siblings, 1 reply; 8+ messages in thread
From: Stefan Israelsson Tampe @ 2014-09-15 20:38 UTC (permalink / raw)
  To: Ian Grant; +Cc: guile-devel

[-- Attachment #1: Type: text/plain, Size: 4495 bytes --]

>This amused me, because I implemented the earlier (pre-oracle) parser he
described in Standard ML, and I did something which I thought later was a
bit stupid: I executed all the semantic actions _during the parse_ and
>cached the semantic results in the "memo-pad" along with the syntactic
derivation. It wasn't so slow, because of the memoization. But I am
wondering now whether there is not some application to proof-search that
one could
>use this technique on. A parse is a derivation of a kind, so with the
right kind of grammar it could be a derivation of some sort of proof, and
if that were an ambiguous grammar then it could explore non-deterministic
relations,
>like a Prolog interpreter does.

I have been wondering if one could speed proofs myself, and especially for
type provers. Note however if you check out the leanCop you can see that
some memoisation can be done. But surely one could probably build
an ultrasmart theorem prover by compiling small logic programs and match
the terms smartly. The problem is that I'm not smart enough to code such a
thing. :-)

On Mon, Sep 15, 2014 at 9:59 PM, Ian Grant <ian.a.n.grant@googlemail.com>
wrote:

> On Mon, Sep 15, 2014 at 2:29 PM, Stefan Israelsson Tampe <
> stefan.itampe@gmail.com> wrote:
>
>> i find that logic programs together with memoization and auto cut ideoms
>> helps in making powerful parsers, really fun at least.
>>
>
> You should definitely look at Tom's paper then, the "director's cut" is
> here:
>
>    http://www.tom-ridge.com/doc/ridge14sle_extended.pdf
>
> He explicitly mentions the analogy with cut, and the parsers are memoised.
> He's a theorem prover by trade, and he started this by producing
> machine-checked formal proofs in HOL4 of the soundness and completeness of
> combinator parsers,
>
> I implemented a python parser
>> using those [...] you my think that it will backtrack horribly but thanks
>> to memoization it will not and gives the AST quite quickly.
>>
>
> No, I believe you! And you should see what you can do with ambiguous
> grammars. Ridge gives some crazy Ackermann-like numbers. The parser's cache
> can be used as a compact representation of the otherwise hyper-exponential
> size of the ASTs an ambiguous grammar generates.
>
> "For example, for input length 19,there are 441152315040444150 parse
> trees, but as with most exponential behaviours it is not feasible to
> actually compute all these parse trees. Now let us consider the following
> parser, which is identical except that it computes (in all possible ways)
> the length of the input parsed [...] Naively we might expect that this also
> exhibits exponential behaviour, since presumably the parse trees must all
> be generated, and the actions applied. This expectation is wrong. Running
> this example parser on an input of size 19 returns in 0.02 seconds with a
> single result 19. For an input of size 100, this parser returns a single
> result 100 in 5 seconds, and over a range of inputs this parser exhibits
> polynomial behaviour rather than exponential behaviour"
>
> This amused me, because I implemented the earlier (pre-oracle) parser he
> described in Standard ML, and I did something which I thought later was a
> bit stupid: I executed all the semantic actions _during the parse_ and
> cached the semantic results in the "memo-pad" along with the syntactic
> derivation. It wasn't so slow, because of the memoization. But I am
> wondering now whether there is not some application to proof-search that
> one could use this technique on. A parse is a derivation of a kind, so with
> the right kind of grammar it could be a derivation of some sort of proof,
> and if that were an ambiguous grammar then it could explore
> non-deterministic relations, like a Prolog interpreter does.
>
> I am talking out of my hat again, but maybe someone who has a clearer idea
> of these things will see a spark amongst all the smoke. But I hope no-one
> would spend a year or something looking for a way to prove P=NP because of
> what I'd said. (Unless they actually proved P=NP, which would be very cool!)
>
> Oh, guile-log is based on C code that pretty restrictive in the kind of
>> cpu you can use, basically only amd64 atm. That can be changed but there is
>> really not much interest
>> in this software with more than a nicel little playground for this little
>> schemer.
>>
>
> Is it a lot of C? Maybe you could replace that with lightning assembler,
> and then it would run on a lot of hardware?
>
> Ian
>
>

[-- Attachment #2: Type: text/html, Size: 6471 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Parsing
  2014-09-15 19:59   ` Parsing Ian Grant
  2014-09-15 20:20     ` Parsing Ian Grant
  2014-09-15 20:38     ` Parsing Stefan Israelsson Tampe
@ 2014-09-15 20:45     ` Ian Grant
  2 siblings, 0 replies; 8+ messages in thread
From: Ian Grant @ 2014-09-15 20:45 UTC (permalink / raw)
  To: Stefan Israelsson Tampe; +Cc: guile-devel

[-- Attachment #1: Type: text/plain, Size: 811 bytes --]

I'll make this the last reply to my own message.

Here is a memoizing (with 'eager' semantic actions) implementation of
Ridge's combinator parsers in 221 lines of Standard ML. Scheme wouldn't be
much more,
Adding the binarizing grammar generator and an Earley oracle wouldn't add
more than another 300 lines, perhaps.

https://code.google.com/p/metaprogramming/source/browse/Parser.sml

and there are the parser monads here, which are tiny:

https://code.google.com/p/metaprogramming/source/browse/ParserMonad.sml

and here's the parser-generator which does a totally unnecessary two-stage
bootstrap that is just showing off. Without this it would be 1/3d of the
length it is.

https://code.google.com/p/metaprogramming/source/browse/ParserModule.sml

Happy software engineering!

Ian

[-- Attachment #2: Type: text/html, Size: 1196 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Parsing
  2014-09-15 20:38     ` Parsing Stefan Israelsson Tampe
@ 2014-09-17  0:10       ` Ian Grant
  0 siblings, 0 replies; 8+ messages in thread
From: Ian Grant @ 2014-09-17  0:10 UTC (permalink / raw)
  To: Stefan Israelsson Tampe; +Cc: guile-devel

[-- Attachment #1: Type: text/plain, Size: 795 bytes --]

On Mon, Sep 15, 2014 at 4:38 PM, Stefan Israelsson Tampe <
stefan.itampe@gmail.com> wrote:

> I have been wondering if one could speed proofs myself, and especially for
> type provers.
>

What do you mean by 'type provers'? Do you mean type inference? I did a
purely functional implementation of Hindley-Milner type inference:
http://prooftoys.org/ian-grant/hm/ and its performance is good. It can
infer the types of pathalogical examples such as Mairson's, in less than a
minute. This is a type expression which takes 50,000 lines of output on a
80-column terminal:

*let fun pair x y = fn z => z x y **    val x1 = fn y => pair y y **
 val x2 = fn y => x1 **(x1 y)**    val x3 = fn y => x2 **(x2 y)**
val x4 = fn y => x3 **(x3 y)**    val x5 = fn y => x4 **(x4 y)**in**
x5 **(fn z => z)**end*

[-- Attachment #2: Type: text/html, Size: 1869 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Parsing
  2014-09-15 17:36 Parsing Ian Grant
  2014-09-15 18:29 ` Parsing Stefan Israelsson Tampe
@ 2014-10-09 23:20 ` Mark H Weaver
  1 sibling, 0 replies; 8+ messages in thread
From: Mark H Weaver @ 2014-10-09 23:20 UTC (permalink / raw)
  To: Ian Grant; +Cc: guile-devel

Ian Grant <ian.a.n.grant@googlemail.com> writes:

> LALR parser generators are not the last word in parsing technology.
[...]
> Important examples of grammars which are not LALR parsable include
> ASN.1 and Standard ML.
[...]
> But I think the problem of parsing has actually been solved. Tom
> Ridge's "p3" combinator parsers with oracles will parse any
> unambiguous CFG in polynomial time complexity at most O^3, or possibly
> O^2, I need to check, in the length of the input. The advantage of
> combinator parsers is that they parse top-down, in a way which follows
> the natural structure of the grammar. So errors can be explained to
> the user (and the developer!) in terms of the actual derivations
> (parsing decisions) taken on the input, This is not the case with
> bottom-up parsers such as LALR.
>
> It would be really good to have a p3-like parser generator working in
> Guile.

I agree that combinator parsers would be a useful addition to Guile.

> Again, one cold implement the recursive parser combinators in
> scheme-generated lightning. If this were done abstractly then Guile
> could generate fast machine code parsers for any host language.

I would prefer to work within the framework of our compiler tower,
and have the combinator parsers generate one of our higher-level
intermediate languages, if not scheme itself (via syntax-case macros).
This should enable our compiler to analyze the combined program and
perform partial evaluation between the generated parser and other scheme
code.  For many reasons, it makes sense to keep our code generation in
one place.

      Thanks,
        Mark
        



^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2014-10-09 23:20 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-09-15 17:36 Parsing Ian Grant
2014-09-15 18:29 ` Parsing Stefan Israelsson Tampe
2014-09-15 19:59   ` Parsing Ian Grant
2014-09-15 20:20     ` Parsing Ian Grant
2014-09-15 20:38     ` Parsing Stefan Israelsson Tampe
2014-09-17  0:10       ` Parsing Ian Grant
2014-09-15 20:45     ` Parsing Ian Grant
2014-10-09 23:20 ` Parsing Mark H Weaver

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).