Using the wisent parser-generator, as it creates faster parsers

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

* Using the wisent parser-generator, as it creates faster parsers
@ 2022-12-26  4:02 ambulajan
  2022-12-26 13:54 ` Lynn Winebarger
  0 siblings, 1 reply; 4+ messages in thread
From: ambulajan @ 2022-12-26  4:02 UTC (permalink / raw)
  To: ericludlam; +Cc: emacs-devel

A follow-up note after reading earlier discussion with a subject "Re:
Why tree-sitter instead of Semantic?".

> If you want to build a parser that sits on the lexer, there is more
> to it, as I recommend using the wisent parser-generator, as it
> creates faster parsers. In the wisent .wy files, you define %tokens
> using a bison-like syntax, and that in turns builds analyzers that
> you include in your lexer.

I doubt that the problem of Semantic parsers was ever in Elisp being
slow for that purpose.
For me it was writing a LALR parser. Everything else was logical -
lexers, SemanticDB, etc. But a grammar in development that stalls at
every step with shift/reduce and reduce/reduce conflicts is like
pushing against a wall. LALR algorithm never meant to be an interface
for a developer, rather a workaround for slow CPUs with small memory
systems of the 1980s.

I've written an Earley parser, and so far it looks in the same
performance category as LALR(wisent) written in Elisp.
Earley parser works with any grammar you throw at it. No conflicts.
Each token gets full context of rules that are in effect at that point.
Seems like there's no need to build parse trees, a list of states-
tokens can be thought of as a flattened parse tree.
Though there's a lot of testing for this concept ahead.

Semantic is the only such a system that's conceptualized as
approachable(in Emacs way). Everything else is some combination of
"black boxes" connected with wires.
Lexers can be created with "block" tokens, when function's body is
consumed as one token. Potentially it allows invocation of a parser
with different variants of lexers - one mode with block tokens for the
exploration of project's structure, and another mode for indentation
and error checking purposes.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Using the wisent parser-generator, as it creates faster parsers
  2022-12-26  4:02 Using the wisent parser-generator, as it creates faster parsers ambulajan
@ 2022-12-26 13:54 ` Lynn Winebarger
  2022-12-26 18:13   ` Alexandr Karbivnichyi
  0 siblings, 1 reply; 4+ messages in thread
From: Lynn Winebarger @ 2022-12-26 13:54 UTC (permalink / raw)
  To: ambulajan; +Cc: Eric Ludlam, emacs-devel

[-- Attachment #1: Type: text/plain, Size: 1975 bytes --]

On Sun, Dec 25, 2022, 11:03 PM <ambulajan@gmail.com> wrote:

> A follow-up note after reading earlier discussion with a subject "Re:
> Why tree-sitter instead of Semantic?".
>
> > If you want to build a parser that sits on the lexer, there is more
> > to it, as I recommend using the wisent parser-generator, as it
> > creates faster parsers. In the wisent .wy files, you define %tokens
> > using a bison-like syntax, and that in turns builds analyzers that
> > you include in your lexer.
>
> I doubt that the problem of Semantic parsers was ever in Elisp being
> slow for that purpose.
> For me it was writing a LALR parser. Everything else was logical -
> lexers, SemanticDB, etc. But a grammar in development that stalls at
> every step with shift/reduce and reduce/reduce conflicts is like
> pushing against a wall. LALR algorithm never meant to be an interface
> for a developer, rather a workaround for slow CPUs with small memory
> systems of the 1980s.
>

I don't agree - conflicts detected by the parser generator indicate that
distinct ASTs map to the same text, at least according to the reduction
method being used.  I like using LR derivations (including ones produced by
LALR) because of the bottom-up recognition of grammar symbols.


I've written an Earley parser, and so far it looks in the same
> performance category as LALR(wisent) written in Elisp.
> Earley parser works with any grammar you throw at it. No conflicts.
> Each token gets full context of rules that are in effect at that point.
> Seems like there's no need to build parse trees, a list of states-
> tokens can be thought of as a flattened parse tree.
> Though there's a lot of testing for this concept ahead.
>


The above is not to discourage adding Earley parsing to the toolkit.
However, just defers the to resolution of ambiguities to your code rather
than by refining your grammar. For specifying a programming language, this
seems like inviting difficult to debug cases to me.

Lynn

[-- Attachment #2: Type: text/html, Size: 2774 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Using the wisent parser-generator, as it creates faster parsers
  2022-12-26 13:54 ` Lynn Winebarger
@ 2022-12-26 18:13   ` Alexandr Karbivnichyi
  2022-12-28  3:22     ` Lynn Winebarger
  0 siblings, 1 reply; 4+ messages in thread
From: Alexandr Karbivnichyi @ 2022-12-26 18:13 UTC (permalink / raw)
  To: Lynn Winebarger; +Cc: Eric Ludlam, emacs-devel

On Mon, 2022-12-26 at 08:54 -0500, Lynn Winebarger wrote:
> On Sun, Dec 25, 2022, 11:03 PM <ambulajan@gmail.com> wrote:
> 
> I don't agree - conflicts detected by the parser generator indicate
> that distinct ASTs map to the same text, at least according to the
> reduction method being used.  I like using LR derivations (including
> ones produced by LALR) because of the bottom-up recognition of
> grammar symbols. 

True for LALR parsers. Tried-and-tested system, for industrial
compilers built on a formal standard with simple grammar that doesn't
change often.
I explored parsing from time to time having in mind an idealistic view
of a system where given some grammar a lot of automatic features appear
almost ready for use.
So AST. It's required unambiguous to generate assembler mnemonics. It's
unavoidable in LALR to build AST in order to make any sense of the
parse.
Earley parser creates a state(list of grammar rules) at each scan
step(shift). I discovered that it's all what's needed at this stage in
the case when I don't intent to create executable. The state contains
all current grammar rules with a "dot" inside them(rule's progress at
point), from all possible parse trees. It's enough to compute
indentation, create smart completion, compose explicative error
messages.
The beauty of Early's algorithm is in possibility to write some grammar
and then keeping in mind the algorithm to write states by hand in a
text file for each token. The parser concept prints those states in
Message buffer, provides human readable interface. It gives an analogy
of 'syntax-ppss' function(elisp reader) but with full grammar contex. 

After all, the possibility of invoking one parser with different lexers
turns out to be an interesting idea. For different purposes. Some
tokens can be just blocks of text, either not contributing to the
current purpose or put aside to be reparsed recursively when contents
of that block-token change while editing.

> The above is not to discourage adding Earley parsing to the toolkit. 
> However, just defers the to resolution of ambiguities to your code
> rather than by refining your grammar. For specifying a programming
> language, this seems like inviting difficult to debug cases to me.  

And reciprocally not to discourage usage of existing tried-and-tested
parsers. Semantic allows that a parser may(should) be a separate
library. All in all, it's Emacs.

GLR algorithm doesn't reject grammar with conflicts and doesn't use
defaults to resolve. It's a LALR parser that starts parallel branches
for conflicts. Then after subsequent shifts only one branch survives -
conflict is resolved. The problem is that all parsing branches invoke
actions, all actions must be cached somehow during parsing and only
those actions of the survived branch to be included in parser's output.

In any case, there are no other tools in existence than Emacs that
would allow even reasoning about these ideas implying their practical
implementation.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Using the wisent parser-generator, as it creates faster parsers
  2022-12-26 18:13   ` Alexandr Karbivnichyi
@ 2022-12-28  3:22     ` Lynn Winebarger
  0 siblings, 0 replies; 4+ messages in thread
From: Lynn Winebarger @ 2022-12-28  3:22 UTC (permalink / raw)
  To: Alexandr Karbivnichyi; +Cc: Eric Ludlam, emacs-devel

[-- Attachment #1: Type: text/plain, Size: 2173 bytes --]

On Mon, Dec 26, 2022, 1:13 PM Alexandr Karbivnichyi <ambulajan@gmail.com>
wrote:

> On Mon, 2022-12-26 at 08:54 -0500, Lynn Winebarger wrote:
> > On Sun, Dec 25, 2022, 11:03 PM <ambulajan@gmail.com> wrote:
> >
> > I don't agree - conflicts detected by the parser generator indicate
> > that distinct ASTs map to the same text, at least according to the
> > reduction method being used.  I like using LR derivations (including
> > ones produced by LALR) because of the bottom-up recognition of
> > grammar symbols.
>
> True for LALR parsers. Tried-and-tested system, for industrial
> compilers built on a formal standard with simple grammar that doesn't
> change often.
>
I suppose it depends on the tools you are using.  I tend to roll my own
implementation of the LALR(1) construction of DeRemer and Penello, then use
the relations they define to determine how strings that will be ambiguously
parsed by the grammar. Then that ambiguity can be resolved. It's not that
different from what you describe doing at run-time with GLR, just
performing the analysis ahead of time (on LR(0) states, aka item sets) and
encoding the disambiguation logic into the grammar itself.  The
DeRemer-Penello construction is just a control-flow analysis.  I'm not
familiar with any work that uses it to automatically construct test cases
that illustrate ambiguities in grammars, but it's surprising if there isn't
given the tools for test case generation for programs in general purpose
programming languages.
There's then the second question of how to map productions of the
disambiguated grammar back to an AST data structure that is much
"flatter".  Usually the grammar for that AST data structure is the origin
of ambiguities in the concrete syntax.

It's a problem in noncommutative algebra.  The ambiguous grammar presumably
specifies the language (strings of terminal sbols or tokens) being
recognized correctly.  A CFG is a system of equations, and a parse is a
particular solution to those equations for a given element in that
language.  It's just a matter of finding an equivalent system of equations
such that every solution is unique.  At least, that formulation helps me.

Lynn

[-- Attachment #2: Type: text/html, Size: 2843 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2022-12-28  3:22 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-12-26  4:02 Using the wisent parser-generator, as it creates faster parsers ambulajan
2022-12-26 13:54 ` Lynn Winebarger
2022-12-26 18:13   ` Alexandr Karbivnichyi
2022-12-28  3:22     ` Lynn Winebarger

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).