Overlay mechanic improvements

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

* Overlay mechanic improvements
@ 2014-09-19 14:59 Vladimir Kazanov
  2014-09-19 17:22 ` Stefan Monnier
  2014-09-19 18:03 ` Richard Stallman
  0 siblings, 2 replies; 72+ messages in thread
From: Vladimir Kazanov @ 2014-09-19 14:59 UTC (permalink / raw)
  To: emacs-devel

[-- Attachment #1: Type: text/plain, Size: 1579 bytes --]

Hello everyone!

Couple of days ago I submitted a question to Stackoverflow regarding an
adequate way to handle many (thousands, that is) overlays. I wanted to know
if there are any alternatives to using overlays (which is slow) for cases
similar to mine. To make a long story short, I am building a generalized
incremental tokenizer, details can be read on Stackoverflow (
http://stackoverflow.com/questions/25915908/best-way-to-save-lots-of-position-pairs-in-emacs-lisp).

It turns out that the only viable alternative to overlays is using text
properties, and text properties were not meant for the purpose. All I need
is an ability to save position pairs, the positions should survive text
insertion/deletion and there should be a way to find those pairs given a
buffer point. Overlay interface is perfect (consider overlays-at,
overlays-in and the dynamic (re-)positioning system).

After answering my question Stefan (Monnier?) mentioned that overlay
behavior can be fixed from N^2 to N*logN, which would make them
asymptotically equivalent to text properties; adding that he has no time to
implement the fix.

I believe that given some time, a few tips from experienced Emacs
developers and enough coffee I can at least try to fulfil the task. I am
not in a hurry, and wanted to get into Emacs internals anyway.

What do you think?

PS I already have my Cedet/Emacs FSF papers signed - sent a minor patch to
Cedet some time ago.

-- 
Yours sincerely,

Vladimir Kazanov

--
С уважением,

Владимир Казанов

[-- Attachment #2: Type: text/html, Size: 1784 bytes --]

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: Overlay mechanic improvements
  2014-09-19 14:59 Overlay mechanic improvements Vladimir Kazanov
@ 2014-09-19 17:22 ` Stefan Monnier
  2014-09-20 13:19   ` Richard Stallman
  2014-09-19 18:03 ` Richard Stallman
  1 sibling, 1 reply; 72+ messages in thread
From: Stefan Monnier @ 2014-09-19 17:22 UTC (permalink / raw)
  To: Vladimir Kazanov; +Cc: emacs-devel

> After answering my question Stefan (Monnier?) mentioned that overlay
> behavior can be fixed from N^2 to N*logN, which would make them
> asymptotically equivalent to text properties; adding that he has no time to
> implement the fix.

So, here's the idea:

Currently overlays are implemented as (two) sorted singly linked lists (one
for overlays_before some position and one for overlay_after that
position, for some quirky definition of "before" and "after").
The function `overlay-recenter' changes the position used for the split
(and is called internally in various situations).

Each overlay is itself implemented with two markers (which keep track of
the overlay-start and overlay-end).  Markers are implemented as
a non-sorted singly linked list of markers.  So every text
insertion/deletion requires O(N) time, where N is the number of markers
since we have to go down that list to update those markers that are
affected by the modification.

You can start in src/buffer.[ch], maybe grepping for overlays_before for
a starting point.

Text-properties, OTOH, are implemented with a (mostly) balanced binary
tree.  This is implemented in src/intervals.[ch].

So we'd like to change overlays so that they don't use markers (and we
don't keep them in two sorted singly-linked lists) any more.  Instead,
we'll store them inside the balanced binary tree used for
text-properties.  I think we can use the "augmented tree" approach
described in https://en.wikipedia.org/wiki/Interval_tree.

To ease up debugging during development, I'd guess the implementation
would first add the new stuff, keeping the old stuff (i.e. add to
Lisp_Overlay whichever fields are needed for the new code, while keeping
the old ones, add needed overlay fields to the intervals tree, but keep
the old fields, the overlays_before etc...).  This way, you can add
consistency checks that make sure the new code computes the same results
as the old code.  And once that works well, we can remove the old code
and old fields.

So I suggest you start looking at src/buffer.[ch] and
src/intervals.[ch].  That should make you ask many more questions.

        Stefan

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: Overlay mechanic improvements
  2014-09-19 17:22 ` Stefan Monnier
@ 2014-09-20 13:19   ` Richard Stallman
  2014-09-20 13:37     ` David Kastrup
                       ` (2 more replies)
  0 siblings, 3 replies; 72+ messages in thread
From: Richard Stallman @ 2014-09-20 13:19 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: vekazanov, emacs-devel

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

I don't see that it makes sense to store overlays like text properties.
The existing data structure is fine for overlays used as they are intended
to be used.

-- 
Dr Richard Stallman
President, Free Software Foundation
51 Franklin St
Boston MA 02110
USA
www.fsf.org  www.gnu.org
Skype: No way! That's nonfree (freedom-denying) software.
  Use Ekiga or an ordinary phone call.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: Overlay mechanic improvements
  2014-09-20 13:19   ` Richard Stallman
@ 2014-09-20 13:37     ` David Kastrup
  2014-09-21 13:35       ` Richard Stallman
  2014-09-20 15:56     ` Eli Zaretskii
  2014-09-20 19:49     ` Stefan Monnier
  2 siblings, 1 reply; 72+ messages in thread
From: David Kastrup @ 2014-09-20 13:37 UTC (permalink / raw)
  To: emacs-devel

Richard Stallman <rms@gnu.org> writes:

> [[[ To any NSA and FBI agents reading my email: please consider    ]]]
> [[[ whether defending the US Constitution against all enemies,     ]]]
> [[[ foreign or domestic, requires you to follow Snowden's example. ]]]
>
> I don't see that it makes sense to store overlays like text properties.
> The existing data structure is fine for overlays used as they are intended
> to be used.

preview-latex <URL:http://www.gnu.org/software/auctex/preview-latex>
extensively uses overlays (and they can easily grow to a few thousand
per document in a mathematically oriented paper), and it is very clearly
_overlays_ which are wanted here since the associated images are only
correct in the context and position of a particular document and would
not make any sense at all to copy and paste.

Quadratic behavior for the markers used as start- and end markers is
clearly undesirable.

Naturally, this could be addressed by changing the data
structures/algorithms for updating markers (thus killing two birds with
one stone).  But whether solved together with marker behavior or not,
this is worth addressing.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: Overlay mechanic improvements
  2014-09-20 13:37     ` David Kastrup
@ 2014-09-21 13:35       ` Richard Stallman
  2014-09-21 13:52         ` David Kastrup
  2014-09-21 16:07         ` Stefan Monnier
  0 siblings, 2 replies; 72+ messages in thread
From: Richard Stallman @ 2014-09-21 13:35 UTC (permalink / raw)
  To: David Kastrup; +Cc: emacs-devel

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

    preview-latex <URL:http://www.gnu.org/software/auctex/preview-latex>
    extensively uses overlays (and they can easily grow to a few thousand
    per document in a mathematically oriented paper),

How many overlays would appear on one screen?

    Quadratic behavior for the markers used as start- and end markers is
    clearly undesirable.

The existing code should be fat if you recenter the overlays
frequently at point, unless you have lots of overlays on one screen.
Do you recenter them?  If so, why is it slow?

Why aren't text properties right for this?

-- 
Dr Richard Stallman
President, Free Software Foundation
51 Franklin St
Boston MA 02110
USA
www.fsf.org  www.gnu.org
Skype: No way! That's nonfree (freedom-denying) software.
  Use Ekiga or an ordinary phone call.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: Overlay mechanic improvements
  2014-09-21 13:35       ` Richard Stallman
@ 2014-09-21 13:52         ` David Kastrup
  2014-09-21 21:48           ` Richard Stallman
  2014-09-21 16:07         ` Stefan Monnier
  1 sibling, 1 reply; 72+ messages in thread
From: David Kastrup @ 2014-09-21 13:52 UTC (permalink / raw)
  To: Richard Stallman; +Cc: emacs-devel

Richard Stallman <rms@gnu.org> writes:

> [[[ To any NSA and FBI agents reading my email: please consider    ]]]
> [[[ whether defending the US Constitution against all enemies,     ]]]
> [[[ foreign or domestic, requires you to follow Snowden's example. ]]]
>
>     preview-latex <URL:http://www.gnu.org/software/auctex/preview-latex>
>     extensively uses overlays (and they can easily grow to a few thousand
>     per document in a mathematically oriented paper),
>
> How many overlays would appear on one screen?

Probably rarely more than a few dozen (that would mostly be the case
when using lots of inline math for small expressions like $a_i$).

>     Quadratic behavior for the markers used as start- and end markers is
>     clearly undesirable.
>
> The existing code should be fat if you recenter the overlays
> frequently at point, unless you have lots of overlays on one screen.
> Do you recenter them?

Not that I know of.

> If so, why is it slow?
>
> Why aren't text properties right for this?

Because you don't want _anything_ that text properties do.  You don't
want to have the stuff cut&paste, you don't want to have the buffer
modified because images are switched on and off, you most definitely
don't want ever to split its identity in two if there are insertions in
the middle, you don't want anything inserted anywhere inheriting
anything from it.

"We did not bother making overlays work efficiently" is not the right
criterion for choosing between overlays and text properties.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: Overlay mechanic improvements
  2014-09-21 13:52         ` David Kastrup
@ 2014-09-21 21:48           ` Richard Stallman
  2014-09-21 22:06             ` David Kastrup
  0 siblings, 1 reply; 72+ messages in thread
From: Richard Stallman @ 2014-09-21 21:48 UTC (permalink / raw)
  To: David Kastrup; +Cc: emacs-devel

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

    Because you don't want _anything_ that text properties do.  You don't
    want to have the stuff cut&paste,

Why not?  If the images are thought of as part of the buffer contents,
I'd expect it to be desirable that they follow text that is copied.
If you kill text that contains some of these images of math
and then yank  it back, shouldn't it come back with the images?

This is what lead me to think of text properties first for this job.

     you don't want to have the buffer
    modified because images are switched on and off,

We can provide a feature for turning off and on the display
of these images without changing the text properties themselves.
I think there already is a way.

		you don't want anything inserted anywhere inheriting
    anything from it.

We already have ways to specify no inheritance.

    you most definitely
    don't want ever to split its identity in two if there are insertions in
    the middle,

That's true.  However, isn't such insertion anomalous anyway?  How bad
is it, if that anomalous act causes the image to appear twice?

-- 
Dr Richard Stallman
President, Free Software Foundation
51 Franklin St
Boston MA 02110
USA
www.fsf.org  www.gnu.org
Skype: No way! That's nonfree (freedom-denying) software.
  Use Ekiga or an ordinary phone call.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: Overlay mechanic improvements
  2014-09-21 21:48           ` Richard Stallman
@ 2014-09-21 22:06             ` David Kastrup
  2014-09-22 23:11               ` Richard Stallman
  0 siblings, 1 reply; 72+ messages in thread
From: David Kastrup @ 2014-09-21 22:06 UTC (permalink / raw)
  To: Richard Stallman; +Cc: emacs-devel

Richard Stallman <rms@gnu.org> writes:

> [[[ To any NSA and FBI agents reading my email: please consider    ]]]
> [[[ whether defending the US Constitution against all enemies,     ]]]
> [[[ foreign or domestic, requires you to follow Snowden's example. ]]]
>
>     Because you don't want _anything_ that text properties do.  You don't
>     want to have the stuff cut&paste,
>
> Why not?  If the images are thought of as part of the buffer contents,

They most emphatically aren't.  Search and replace works on the
underlying text (invalidating the image when a replace happens),
incremental search works on the underlying text (temporarily removing
the image), saving saves the underlying text, all editing commands apply
to the underlying text.

> I'd expect it to be desirable that they follow text that is copied.
> If you kill text that contains some of these images of math
> and then yank  it back, shouldn't it come back with the images?

No.  The same input text at a different location in the same document
might show a different equation number.  In a different document it
might not compile at all or have a different size.

> This is what lead me to think of text properties first for this job.
>
>      you don't want to have the buffer
>     modified because images are switched on and off,
>
> We can provide a feature for turning off and on the display
> of these images without changing the text properties themselves.
> I think there already is a way.
>
> 		you don't want anything inserted anywhere inheriting
>     anything from it.
>
> We already have ways to specify no inheritance.
>
>     you most definitely
>     don't want ever to split its identity in two if there are insertions in
>     the middle,
>
> That's true.  However, isn't such insertion anomalous anyway?  How bad
> is it, if that anomalous act causes the image to appear twice?

Bad.  Text properties and overlays have different semantics.  You
basically say "let's ignore all the mismatching semantics because the
performance of overlays is bad".

That does not make sense.  If it did, we would not have overlays in the
first place.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: Overlay mechanic improvements
  2014-09-21 22:06             ` David Kastrup
@ 2014-09-22 23:11               ` Richard Stallman
  2014-09-22 23:50                 ` David Kastrup
  0 siblings, 1 reply; 72+ messages in thread
From: Richard Stallman @ 2014-09-22 23:11 UTC (permalink / raw)
  To: David Kastrup; +Cc: emacs-devel

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

    > Why not?  If the images are thought of as part of the buffer contents,

    They most emphatically aren't.  Search and replace works on the
    underlying text (invalidating the image when a replace happens),
    incremental search works on the underlying text (temporarily removing
    the image), saving saves the underlying text, all editing commands apply
    to the underlying text.

Could you explain to me why this is right?
I never used that feature, so I don't understand the context.

      You
    basically say "let's ignore all the mismatching semantics because the
    performance of overlays is bad".

No, I am not saying that.  I am saying something else:

    > I'd expect it to be desirable that they follow text that is copied.
    > If you kill text that contains some of these images of math
    > and then yank  it back, shouldn't it come back with the images?

    > This is what lead me to think of text properties first for this job.

Since you say that is not the desired behavior, I'd like to understand
why not.

What is a scenario for editing the text that is under the image
overlay, and what is the right behavior?

-- 
Dr Richard Stallman
President, Free Software Foundation
51 Franklin St
Boston MA 02110
USA
www.fsf.org  www.gnu.org
Skype: No way! That's nonfree (freedom-denying) software.
  Use Ekiga or an ordinary phone call.




^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: Overlay mechanic improvements
  2014-09-22 23:11               ` Richard Stallman
@ 2014-09-22 23:50                 ` David Kastrup
  2014-09-23 19:15                   ` Richard Stallman
  0 siblings, 1 reply; 72+ messages in thread
From: David Kastrup @ 2014-09-22 23:50 UTC (permalink / raw)
  To: Richard Stallman; +Cc: emacs-devel

Richard Stallman <rms@gnu.org> writes:

> [[[ To any NSA and FBI agents reading my email: please consider    ]]]
> [[[ whether defending the US Constitution against all enemies,     ]]]
> [[[ foreign or domestic, requires you to follow Snowden's example. ]]]
>
>     > Why not?  If the images are thought of as part of the buffer contents,
>
>     They most emphatically aren't.  Search and replace works on the
>     underlying text (invalidating the image when a replace happens),
>     incremental search works on the underlying text (temporarily removing
>     the image), saving saves the underlying text, all editing commands apply
>     to the underlying text.
>
> Could you explain to me why this is right?  I never used that feature,
> so I don't understand the context.

The text that is being edited is LaTeX source code.  preview-latex
displays previews of the edited material in the source buffer.  The
source are things like $\sum_{i=0}^\infty \frac{1}{i+i}$ that are easy
to enter and edit and somewhat cumbersome to read in context. Replacing
those fragments with the rendered images in the source code while the
particular corresponding source code is not being edited makes it easier
to peruse the mathematical meaning of the source while continuing to
compose further text and mathematics.

This is a visual aid, not something changing the text in any manner.
Like a tooltip, it has to get out of the way whenever you actually want
to do something to the source code of the text.

> No, I am not saying that.  I am saying something else:
>
>     > I'd expect it to be desirable that they follow text that is copied.
>     > If you kill text that contains some of these images of math
>     > and then yank  it back, shouldn't it come back with the images?
>
>     > This is what lead me to think of text properties first for this job.
>
> Since you say that is not the desired behavior, I'd like to understand
> why not.
>
> What is a scenario for editing the text that is under the image
> overlay,

Any editing at all.  The image is a visualization, you cannot edit that.
It is output, not input.

> and what is the right behavior?

The same as without preview-latex.  This is just like a document
previewer, only scattered over the corresponding input in order to save
place and concentration.  It's pointless for plain text: hopefully you
have selected a more readable font in your editor for that on a
low-contrast 100dpi color screen than some washed-out downsampling of a
font intended to look best printed with 1200dpi on a high-contrast
two-color output device.  But for stuff that becomes harder to follow
when looking at its source code rather than the rendered rendition, it's
helpful.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: Overlay mechanic improvements
  2014-09-22 23:50                 ` David Kastrup
@ 2014-09-23 19:15                   ` Richard Stallman
  0 siblings, 0 replies; 72+ messages in thread
From: Richard Stallman @ 2014-09-23 19:15 UTC (permalink / raw)
  To: David Kastrup; +Cc: emacs-devel

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

    > Since you say that is not the desired behavior, I'd like to understand
    > why not.
    >
    > What is a scenario for editing the text that is under the image
    > overlay,

    Any editing at all.  The image is a visualization, you cannot edit that.
    It is output, not input.

I think we are miscommunicating here.  Could you answer in a less
terse way?

Is the idea that the user goes on editing, and the overlay images
stick around even if they become invalid, and the user knows
to ignore them?

-- 
Dr Richard Stallman
President, Free Software Foundation
51 Franklin St
Boston MA 02110
USA
www.fsf.org  www.gnu.org
Skype: No way! That's nonfree (freedom-denying) software.
  Use Ekiga or an ordinary phone call.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: Overlay mechanic improvements
  2014-09-21 13:35       ` Richard Stallman
  2014-09-21 13:52         ` David Kastrup
@ 2014-09-21 16:07         ` Stefan Monnier
  2014-09-21 16:14           ` David Kastrup
                             ` (2 more replies)
  1 sibling, 3 replies; 72+ messages in thread
From: Stefan Monnier @ 2014-09-21 16:07 UTC (permalink / raw)
  To: Richard Stallman; +Cc: David Kastrup, emacs-devel

> The existing code should be fat if you recenter the overlays
> frequently at point, unless you have lots of overlays on one screen.
> Do you recenter them?  If so, why is it slow?

Reasons for overlays being slow are:
- if you have N overlays, you have 2N markers, so every
  insertion/deletion of text will involve a traversal of those 2N elements.
- when we need to move from bytepos to charpos (or vice versa), we go
  through the 2N markers again.
- when the Elisp code didn't know to overlay-recenter, or when overlays
  are added/removed from the "wrong" place (w.r.t to the overlay
  "center"), every overlay creation/removal takes O(N) time.

We rarely suffer from overlay's performance issues because most codes
that could suffer from it simply avoid using overlays (or they try to
minimize the number of overlays used, e.g. by flushing those overlays
currently not visible), even when they would be the right tool.  Or they
add calls to overlay-recenter when the performance problem is diagnosed.

Moving overlays into a balanced binary tree will remove those
algorithmic performance problems.  I'm not sue why you'd be opposed,
since it will provide the exact same functionality anyway.  It's just an
internal detail of implementation.

Note that I suggest to use the balanced binary tree used by
text-properties, but there is no need for the two trees to be linked.
We could as well use a separate tree for overlays.  I'm not sure which
of the two options is best in this regard, but at least I don't see any
obvious reason why re-using the existing intervals.c tree wouldn't
work well.

Of course, I don't know for a fact that the end result will be better
than what we currently have.  We will of course need to experiment with
it a bit.

> Why aren't text properties right for this?

There are various things which are difficult to do with text-properties.
E.g. the lack of after/before_string is the most obvious difference.

        Stefan

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: Overlay mechanic improvements
  2014-09-21 16:07         ` Stefan Monnier
@ 2014-09-21 16:14           ` David Kastrup
  2014-09-21 21:48             ` Richard Stallman
  2014-09-21 16:56           ` Eli Zaretskii
  2014-09-21 21:48           ` Richard Stallman
  2 siblings, 1 reply; 72+ messages in thread
From: David Kastrup @ 2014-09-21 16:14 UTC (permalink / raw)
  To: emacs-devel

Stefan Monnier <monnier@iro.umontreal.ca> writes:

>> Why aren't text properties right for [preview-latex]?
>
> There are various things which are difficult to do with text-properties.
> E.g. the lack of after/before_string is the most obvious difference.

Uh, that too.  preview-latex makes use of them as well for several
things.  Supporting after/before_string with text properties would be
really weird as inserting text in the middle of some text with the
respective properties would cause those strings to appear ex nihilo.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: Overlay mechanic improvements
  2014-09-21 16:14           ` David Kastrup
@ 2014-09-21 21:48             ` Richard Stallman
  2014-09-21 22:19               ` David Kastrup
  0 siblings, 1 reply; 72+ messages in thread
From: Richard Stallman @ 2014-09-21 21:48 UTC (permalink / raw)
  To: David Kastrup; +Cc: emacs-devel

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

      Supporting after/before_string with text properties would be
    really weird as inserting text in the middle of some text with the
    respective properties would cause those strings to appear ex nihilo.

That is true -- but does it matter, given that inserting text there is
a mistake anyway?

-- 
Dr Richard Stallman
President, Free Software Foundation
51 Franklin St
Boston MA 02110
USA
www.fsf.org  www.gnu.org
Skype: No way! That's nonfree (freedom-denying) software.
  Use Ekiga or an ordinary phone call.




^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: Overlay mechanic improvements
  2014-09-21 21:48             ` Richard Stallman
@ 2014-09-21 22:19               ` David Kastrup
  2014-09-23 19:16                 ` Richard Stallman
  0 siblings, 1 reply; 72+ messages in thread
From: David Kastrup @ 2014-09-21 22:19 UTC (permalink / raw)
  To: Richard Stallman; +Cc: emacs-devel

Richard Stallman <rms@gnu.org> writes:

> [[[ To any NSA and FBI agents reading my email: please consider    ]]]
> [[[ whether defending the US Constitution against all enemies,     ]]]
> [[[ foreign or domestic, requires you to follow Snowden's example. ]]]
>
>       Supporting after/before_string with text properties would be
>     really weird as inserting text in the middle of some text with the
>     respective properties would cause those strings to appear ex nihilo.
>
> That is true -- but does it matter, given that inserting text there is
> a mistake anyway?

Inserting a text in the middle of the text underlying a preview-latex
image is not a mistake but the intended mode of operation.  I think you
are fundamentally misunderstanding what preview-latex is.  It is a
previewing mode that previews selected elements (like math) of
typesetting a LaTeX document right in the source code so that you do not
need to keep switching your focus between source code and preview window
when proofreading/deriving your math.

The whole point is to work with an editable source and have the images
as a transient view.  In order to make movement unsurprising, only some
movement commands (like cursor left/right) will actually disable the
move-out-of-images-in-the-main-loop behavior of Emacs in order to "open"
images and display the source: most commands will avoid the inside of
images in the way Emacs usually does.  But that does not extend to any
_modifying_ commands or to isearch.  And of course not to macro
execution.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: Overlay mechanic improvements
  2014-09-21 22:19               ` David Kastrup
@ 2014-09-23 19:16                 ` Richard Stallman
  2014-09-23 19:27                   ` David Kastrup
  0 siblings, 1 reply; 72+ messages in thread
From: Richard Stallman @ 2014-09-23 19:16 UTC (permalink / raw)
  To: David Kastrup; +Cc: emacs-devel

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

    The whole point is to work with an editable source and have the images
    as a transient view.  In order to make movement unsurprising, only some
    movement commands (like cursor left/right) will actually disable the
    move-out-of-images-in-the-main-loop behavior of Emacs in order to "open"
    images and display the source: most commands will avoid the inside of
    images in the way Emacs usually does.

What is the _desired_ behavior when the user edits text that corresponds
to an image?  Should the image disappear?

-- 
Dr Richard Stallman
President, Free Software Foundation
51 Franklin St
Boston MA 02110
USA
www.fsf.org  www.gnu.org
Skype: No way! That's nonfree (freedom-denying) software.
  Use Ekiga or an ordinary phone call.




^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: Overlay mechanic improvements
  2014-09-23 19:16                 ` Richard Stallman
@ 2014-09-23 19:27                   ` David Kastrup
  2014-09-28 23:24                     ` Richard Stallman
  0 siblings, 1 reply; 72+ messages in thread
From: David Kastrup @ 2014-09-23 19:27 UTC (permalink / raw)
  To: Richard Stallman; +Cc: emacs-devel

Richard Stallman <rms@gnu.org> writes:

> [[[ To any NSA and FBI agents reading my email: please consider    ]]]
> [[[ whether defending the US Constitution against all enemies,     ]]]
> [[[ foreign or domestic, requires you to follow Snowden's example. ]]]
>
>     The whole point is to work with an editable source and have the images
>     as a transient view.  In order to make movement unsurprising, only some
>     movement commands (like cursor left/right) will actually disable the
>     move-out-of-images-in-the-main-loop behavior of Emacs in order to "open"
>     images and display the source: most commands will avoid the inside of
>     images in the way Emacs usually does.
>
> What is the _desired_ behavior when the user edits text that corresponds
> to an image?  Should the image disappear?

Editing involves making the image disappear anyway since otherwise you
don't have the text for editing visible (either one or the other is
displayed).  There are exceptions, like when typing C-d with the cursor
on an image.  In that case, the first character of the underlying text
will get deleted and, since the overlay content no longer corresponds to
the image, the image will get undisplayed and the source text (with the
first character already deleted) will reappear.

When the text corresponding to an image is displayed, the image itself
is not visible.  Instead the text is marked in a recognizable color and
it is preceded by a clickable icon with a generic "image" look.  Once
you finished editing, you can click that icon, the image will get
regenerated from the text covered by the overlay and will again replace
it in the source view.  Or you can use key sequences for regeneration.
Or just let the text be text and edit other things.  The overlays stick
around unless explicitly removed in order to make it easier to
regenerate an image based on exactly the region covered by the overlay.

It's not that hard to try that out yourself if you have LaTeX installed
on your system and some files preferably using mathematics written in
it.  Just use ELPA to install AUCTeX, open a LaTeX file, and use the
Preview menu (should be available by default when AUCTeX is installed
via ELPA, I think).

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: Overlay mechanic improvements
  2014-09-23 19:27                   ` David Kastrup
@ 2014-09-28 23:24                     ` Richard Stallman
  2014-09-29  5:45                       ` David Kastrup
  0 siblings, 1 reply; 72+ messages in thread
From: Richard Stallman @ 2014-09-28 23:24 UTC (permalink / raw)
  To: David Kastrup; +Cc: emacs-devel

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

    Editing involves making the image disappear anyway since otherwise you
    don't have the text for editing visible (either one or the other is
    displayed).

Does this mean you turn off display of the image on the overlay when
the text in that region is changed?

-- 
Dr Richard Stallman
President, Free Software Foundation
51 Franklin St
Boston MA 02110
USA
www.fsf.org  www.gnu.org
Skype: No way! That's nonfree (freedom-denying) software.
  Use Ekiga or an ordinary phone call.




^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: Overlay mechanic improvements
  2014-09-28 23:24                     ` Richard Stallman
@ 2014-09-29  5:45                       ` David Kastrup
  2014-09-29 20:48                         ` Richard Stallman
  0 siblings, 1 reply; 72+ messages in thread
From: David Kastrup @ 2014-09-29  5:45 UTC (permalink / raw)
  To: Richard Stallman; +Cc: emacs-devel

Richard Stallman <rms@gnu.org> writes:

> [[[ To any NSA and FBI agents reading my email: please consider    ]]]
> [[[ whether defending the US Constitution against all enemies,     ]]]
> [[[ foreign or domestic, requires you to follow Snowden's example. ]]]
>
>     Editing involves making the image disappear anyway since otherwise you
>     don't have the text for editing visible (either one or the other is
>     displayed).
>
> Does this mean you turn off display of the image on the overlay when
> the text in that region is changed?

Very much so, yes.  Changing the text in the region would be a pain if
you could not see it, and changing it breaks the correspondence between
text and image anyway.

-- 
David Kastrup



^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: Overlay mechanic improvements
  2014-09-29  5:45                       ` David Kastrup
@ 2014-09-29 20:48                         ` Richard Stallman
  2014-09-30  1:21                           ` Stephen J. Turnbull
  2014-09-30  5:52                           ` David Kastrup
  0 siblings, 2 replies; 72+ messages in thread
From: Richard Stallman @ 2014-09-29 20:48 UTC (permalink / raw)
  To: David Kastrup; +Cc: emacs-devel

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

    > Does this mean you turn off display of the image on the overlay when
    > the text in that region is changed?

    Very much so, yes.  Changing the text in the region would be a pain if
    you could not see it, and changing it breaks the correspondence between
    text and image anyway.

I see.

But you keep the overlay in existence -- how come?

-- 
Dr Richard Stallman
President, Free Software Foundation
51 Franklin St
Boston MA 02110
USA
www.fsf.org  www.gnu.org
Skype: No way! That's nonfree (freedom-denying) software.
  Use Ekiga or an ordinary phone call.




^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: Overlay mechanic improvements
  2014-09-29 20:48                         ` Richard Stallman
@ 2014-09-30  1:21                           ` Stephen J. Turnbull
  2014-09-30  8:43                             ` David Kastrup
  2014-09-30 19:23                             ` Richard Stallman
  2014-09-30  5:52                           ` David Kastrup
  1 sibling, 2 replies; 72+ messages in thread
From: Stephen J. Turnbull @ 2014-09-30  1:21 UTC (permalink / raw)
  To: rms; +Cc: David Kastrup, emacs-devel

Richard Stallman writes:

 >     > Does this mean you turn off display of the image on the
 >     > overlay when the text in that region is changed?
 > 
 >     Very much so, yes.  Changing the text in the region would be a
 >     pain if you could not see it, and changing it breaks the
 >     correspondence between text and image anyway.
 > 
 > I see.
 > 
 > But you keep the overlay in existence -- how come?

Why do you care?  David knows his business, isn't that good enough?
(These are real questions, not an attack.  It would be easier to
answer your questions informatively if we knew where you are going.)

In most cases the object is something, such as an equation, which is
marked by explicit delimiters, so unless you actually delete the
delimiters or add new ones, it makes sense to keep the overlay which
provably still contains the correct boundaries[1], rather than re-parse
the relevant region, and change only the image associated with the
delimited object at next request for the formatted display.

Unlike a fundamentally WYSIWYG application, preview-latex provides no
facility for displaying syntactically incorrect formatted output.[2]
Instead, preview-latex gets out of the user's way, and allows them to
take advantage of AUCTeX's[3] features to ensure syntactically correct
source editing, then provides immediate feedback as to whether the
input LaTeX code is *semantically* correct.  Ie, the source entered
displays as the user expects.

Because it is based on the device-independent rendering system which
will be used to produce formatted content, this feedback is as
accurate as possible in most cases.  Compare with WYSIWYG office
suites, which typically not only produce ugly math, but produce
*different* ugliness on different devices.  They even auto-bogotify
macro features like whether a table gets split across pages depending
on whether it's output to screen or to a printer.

In preview-latex, Emacs is at its text-editing best.

Footnotes: 
[1]  And it in theory could provide "protection" for the delimiters
against inadvertant corruption, although AFAIK such a feature has not
been added.

[2]  Eg, if you are entering a superscript, theoretically a WYSIWYG
editor could provide a transient raised text-entry field and move the
cursor there.  I don't know if any actually do that.

[3]  AFAIK, preview-latex is heavily dependent on AUCTeX features,
although it probably could be ported to other TeX modes, even *ML
modes, with a more or less great expenditure of effort.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: Overlay mechanic improvements
  2014-09-30  1:21                           ` Stephen J. Turnbull
@ 2014-09-30  8:43                             ` David Kastrup
  2014-09-30 10:35                               ` Rasmus
  2014-09-30 19:23                             ` Richard Stallman
  1 sibling, 1 reply; 72+ messages in thread
From: David Kastrup @ 2014-09-30  8:43 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: rms, emacs-devel

"Stephen J. Turnbull" <stephen@xemacs.org> writes:

> Richard Stallman writes:
>
>  >     > Does this mean you turn off display of the image on the
>  >     > overlay when the text in that region is changed?
>  > 
>  >     Very much so, yes.  Changing the text in the region would be a
>  >     pain if you could not see it, and changing it breaks the
>  >     correspondence between text and image anyway.
>  > 
>  > I see.
>  > 
>  > But you keep the overlay in existence -- how come?
>
> Why do you care?  David knows his business, isn't that good enough?
> (These are real questions, not an attack.  It would be easier to
> answer your questions informatively if we knew where you are going.)

Well, there are, of course, places where one might want to go.
Historically, preview-latex was inspired by Emacs 21'st upcoming ability
to integrate EPS images into a buffer, with a proof-of-concept up in
about a week.  Getting the whole thing to work as one smooth application
took quite more than a man-year and a number of collaborateurs, however.
And part of the reason for that is that is that a lot of functionality
had to be implemented from scratch to be usable: static screenshots
don't really reflect the interactive nature and efficiency with which
one could work on complex documents with 1000+ equations while having a
single 200MHz processor at one's disposal.

One key element is using DSC comments in the generated single PostScript
file for out-of-order rendering of images: Emacs entertains a pipeline
into a Ghostscript process that first gets all on-screen images
converted into PNG graphics before it reverts to conversion of the
remaining off-screen material.  If the on-screen regions change, the
rendering pipeline switches the workload accordingly.  With today's
processors, it is actually hard to see that effect in action, but at the
time it was implemented, scrolling around in a partially rendered buffer
lead to an xroach-like effect where changes in the window lead to
new images scurrying in until the display was static again.

This sort of display-oriented rendering pipeline is not in any way
specific to working with TeX/LaTeX.  It is infrastructure that would be
nice for a whole lot of things.

In its current form, the code works in AUCTeX's LaTeX mode.  It does not
work for Emacs' builtin TeX modes, not even its LaTeX mode (mostly
because of being tied in the process management and keymaps of AUCTeX).
It does not work for AUCTeX's TeX or Texinfo modes (and the latter would
be nice for working on GNU LilyPond which uses lots of images in its
manuals generated from source code in the manuals).  It does not work
for calc's TeX output (which would increase its readability and
usability significantly).

It does not work for other image-generating modes like LilyPond-mode.
I think that something like some Maxima-mode or similar copied an early
version of preview-latex in order to implement some similar
functionality.  It's probably fallen victim to bitrot long ago.

preview-latex does an excellent job, but much of its code is not at all
related to the AUCTeX user interface or the LaTeX input language, and
yet it will work for nothing but the AUCTeX user interface and the LaTeX
input language.

And without generally available facilities to do that sort of work, it
will remain a singularly useful application without peers.  Because it's
far too much work to do a peer from scratch for similar needs.  It was
supposed to make it easier for me to work on my PhD thesis, but I ended
up doing it _instead_ of my PhD thesis.

The main shortcoming of preview-latex in my book is that it stands out.
It should be one of possibly many thin applications on top of rendering
frameworks provided by Emacs proper.

Like I said: the barely-workable proof-of-concept (good for screenshots
but not actual work) took about a week.  The ideal state of Emacs would
be that implementing similar functionality for a comparable text input
language that can be rendered into PostScript graphics with DSC comments
(a non-trivial subset) or into PNG images or whatever, would take about
that long.

That would get us to see a whole lot more editing modes with
supplementary graphic support.

That this shortcoming of Emacs can be cured on the Elisp side alone (as
preview-latex does not need more than a working Emacs 21 binary) does
not make it less of a shortcoming.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: Overlay mechanic improvements
  2014-09-30  8:43                             ` David Kastrup
@ 2014-09-30 10:35                               ` Rasmus
  2014-09-30 14:22                                 ` Eli Zaretskii
                                                   ` (2 more replies)
  0 siblings, 3 replies; 72+ messages in thread
From: Rasmus @ 2014-09-30 10:35 UTC (permalink / raw)
  To: emacs-devel

Hi,

This is a bit OT, sorry.

David Kastrup <dak@gnu.org> writes:

> [...]
> It does not work for other image-generating modes like LilyPond-mode.
> I think that something like some Maxima-mode or similar copied an early
> version of preview-latex in order to implement some similar
> functionality.  It's probably fallen victim to bitrot long ago.

Imaxima has a similar functionality.  I don't know if it uses
preview-latex internally.

> preview-latex does an excellent job, but much of its code is not at all
> related to the AUCTeX user interface or the LaTeX input language, and
> yet it will work for nothing but the AUCTeX user interface and the LaTeX
> input language.
 >
> And without generally available facilities to do that sort of work, it
> will remain a singularly useful application without peers.  Because it's
> far too much work to do a peer from scratch for similar needs.  It was
> supposed to make it easier for me to work on my PhD thesis, but I ended
> up doing it _instead_ of my PhD thesis.
>
> The main shortcoming of preview-latex in my book is that it stands out.
> It should be one of possibly many thin applications on top of rendering
> frameworks provided by Emacs proper.

I agree wholeheartedly.  Org has a similar mechanism as well, but it
*sucks* compared to preview-latex.  I tried to get
preview-latex working with Org (which uses LaTeX-syntax for math), but
it was non-trivial.  Thus, I don't interact much with preview-latex
these days, but I have very fond memories of using it.

If you ever decide to allocate time to generalize preview-latex and
have it shipped with Emacs-core I'd happily support that work!

—Rasmus

PS: AUCTeX and preview-latex was what got me into Emacs initially.  At
the time, and probably still, there was no LaTeX editor that was on
par with AUCTeX.

Compiling the newest version of AUCTeX pushed me towards GNU/Linux
(the windows version was .01 behind and it was Really Hardᵀᴹ to
compile it on Windows).

-- 
Need more coffee. . .

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: Overlay mechanic improvements
  2014-09-30 10:35                               ` Rasmus
@ 2014-09-30 14:22                                 ` Eli Zaretskii
  2014-09-30 16:20                                   ` David Kastrup
  2014-09-30 14:32                                 ` Stefan Monnier
  2014-10-02 16:12                                 ` Uwe Brauer
  2 siblings, 1 reply; 72+ messages in thread
From: Eli Zaretskii @ 2014-09-30 14:22 UTC (permalink / raw)
  To: Rasmus; +Cc: emacs-devel

> From: Rasmus <rasmus@gmx.us>
> Date: Tue, 30 Sep 2014 12:35:08 +0200
> 
> > preview-latex does an excellent job, but much of its code is not at all
> > related to the AUCTeX user interface or the LaTeX input language, and
> > yet it will work for nothing but the AUCTeX user interface and the LaTeX
> > input language.
>  >
> > And without generally available facilities to do that sort of work, it
> > will remain a singularly useful application without peers.  Because it's
> > far too much work to do a peer from scratch for similar needs.  It was
> > supposed to make it easier for me to work on my PhD thesis, but I ended
> > up doing it _instead_ of my PhD thesis.
> >
> > The main shortcoming of preview-latex in my book is that it stands out.
> > It should be one of possibly many thin applications on top of rendering
> > frameworks provided by Emacs proper.
> 
> I agree wholeheartedly.  Org has a similar mechanism as well, but it
> *sucks* compared to preview-latex.  I tried to get
> preview-latex working with Org (which uses LaTeX-syntax for math), but
> it was non-trivial.  Thus, I don't interact much with preview-latex
> these days, but I have very fond memories of using it.
> 
> If you ever decide to allocate time to generalize preview-latex and
> have it shipped with Emacs-core I'd happily support that work!

Feel free to expand on the problem(s) you hint on, because just by
reading this exchange, I don't really understand what kind of
facilities are missing in the core.  (I don't use preview-latex and
probably never will.)



^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: Overlay mechanic improvements
  2014-09-30 14:22                                 ` Eli Zaretskii
@ 2014-09-30 16:20                                   ` David Kastrup
  2014-09-30 16:35                                     ` Eli Zaretskii
  0 siblings, 1 reply; 72+ messages in thread
From: David Kastrup @ 2014-09-30 16:20 UTC (permalink / raw)
  To: emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Rasmus <rasmus@gmx.us>
>> Date: Tue, 30 Sep 2014 12:35:08 +0200
>> 
>> > preview-latex does an excellent job, but much of its code is not at all
>> > related to the AUCTeX user interface or the LaTeX input language, and
>> > yet it will work for nothing but the AUCTeX user interface and the LaTeX
>> > input language.
>>  >
>> > And without generally available facilities to do that sort of work, it
>> > will remain a singularly useful application without peers.  Because it's
>> > far too much work to do a peer from scratch for similar needs.  It was
>> > supposed to make it easier for me to work on my PhD thesis, but I ended
>> > up doing it _instead_ of my PhD thesis.
>> >
>> > The main shortcoming of preview-latex in my book is that it stands out.
>> > It should be one of possibly many thin applications on top of rendering
>> > frameworks provided by Emacs proper.
>> 
>> I agree wholeheartedly.  Org has a similar mechanism as well, but it
>> *sucks* compared to preview-latex.  I tried to get
>> preview-latex working with Org (which uses LaTeX-syntax for math), but
>> it was non-trivial.  Thus, I don't interact much with preview-latex
>> these days, but I have very fond memories of using it.
>> 
>> If you ever decide to allocate time to generalize preview-latex and
>> have it shipped with Emacs-core I'd happily support that work!
>
> Feel free to expand on the problem(s) you hint on, because just by
> reading this exchange, I don't really understand what kind of
> facilities are missing in the core.  (I don't use preview-latex and
> probably never will.)

Since preview-latex is trivial to install via ELPA (as it's included in
AUCTeX), I am somewhat at a loss figuring out what point people are
trying to make by stating that they are not going to bother looking at
it, preferring me to type descriptions for hours.

There is a paper
<URL:https://www.tug.org/TUGboat/Articles/tb23-1/kastrup.pdf> sketching
the working mode of preview-latex some more.

But the problem is that static images/screenshots are not really much
help in seeing how the generation of images is actually integrated into
the interactive workflow.  The screenshots would not have looked
significantly different after the first prototype was running, so they
don't really do a lot more than give the information "Emacs can show
images".

I found some conference video Kaveh Bazargan made of me at
<URL:https://www.youtube.com/watch?v=avn7-aVpP9I#t=500>, picture quality
pretty flaky.  While it does not really say a lot about the Emacs parts
of the process, it is pretty obvious that the editing is not at all
disturbed by the images and that any images are regenerated
instantaneously without any obvious delays or interference with the
editing workflow.

That's probably 2007 or 2008, so the processor I use is probably
something like 600MHz.

Emacs does not offer any default support for rendering images on-demand
from source code in a buffer.  The pipelines that preview-latex can be
something like

buffer -> LaTeX -> Dvips -> Ghostscript -> PNG
buffer -> PDFLaTeX -> pdftodsc -> Ghostscript -> PNG
buffer -> LaTeX -> Dvipng -> PNG

It will also run several of those in parallel, like running
LaTeX->Dvipng but falling back to Dvips->Ghostscript for those images
which contain embedded PostScript.  The Ghostscript pipeline will render
the on-screen images first to make for better interactive response.  In
addition to the images, LaTeX produces geometry, baseline, source
location and various other info (for example, default font size in order
to match the rendering scale to the current default screen font) which
is parsed by Emacs and used for integrating the resulting images in the
right place and scale into the originating buffer.

So there are prioritized rendering pipelines and information side paths
that deliver information crucial for tying the image stream back into
the proper buffer locations, with proper scale and baseline.

There is other stuff like being able to paste text plus images into MIME
mails (with the obvious loss of baseline and other formatting
information, but for display math it is somewhat helpful), and desktop
mode saving and restoring the images associated with a buffer.

All of that is not specific to LaTeX, and Emacs does not really provide
anything helpful for creating that kind of facility apart from the
display feature itself and the possibility to run processes and process
filters.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: Overlay mechanic improvements
  2014-09-30 16:20                                   ` David Kastrup
@ 2014-09-30 16:35                                     ` Eli Zaretskii
  0 siblings, 0 replies; 72+ messages in thread
From: Eli Zaretskii @ 2014-09-30 16:35 UTC (permalink / raw)
  To: David Kastrup; +Cc: emacs-devel

> From: David Kastrup <dak@gnu.org>
> Date: Tue, 30 Sep 2014 18:20:06 +0200
> 
> Since preview-latex is trivial to install via ELPA (as it's included in
> AUCTeX), I am somewhat at a loss figuring out what point people are
> trying to make by stating that they are not going to bother looking at
> it, preferring me to type descriptions for hours.

I doubt that just installing preview-latex would give me the insight
into the issues you were alluding to, without some serious study of
the code.

> Emacs does not offer any default support for rendering images on-demand
> from source code in a buffer.  The pipelines that preview-latex can be
> something like
> 
> buffer -> LaTeX -> Dvips -> Ghostscript -> PNG
> buffer -> PDFLaTeX -> pdftodsc -> Ghostscript -> PNG
> buffer -> LaTeX -> Dvipng -> PNG
> 
> It will also run several of those in parallel, like running
> LaTeX->Dvipng but falling back to Dvips->Ghostscript for those images
> which contain embedded PostScript.  The Ghostscript pipeline will render
> the on-screen images first to make for better interactive response.  In
> addition to the images, LaTeX produces geometry, baseline, source
> location and various other info (for example, default font size in order
> to match the rendering scale to the current default screen font) which
> is parsed by Emacs and used for integrating the resulting images in the
> right place and scale into the originating buffer.
> 
> So there are prioritized rendering pipelines and information side paths
> that deliver information crucial for tying the image stream back into
> the proper buffer locations, with proper scale and baseline.
> 
> There is other stuff like being able to paste text plus images into MIME
> mails (with the obvious loss of baseline and other formatting
> information, but for display math it is somewhat helpful), and desktop
> mode saving and restoring the images associated with a buffer.
> 
> All of that is not specific to LaTeX, and Emacs does not really provide
> anything helpful for creating that kind of facility apart from the
> display feature itself and the possibility to run processes and process
> filters.

Thanks, this is the information I was missing.




^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: Overlay mechanic improvements
  2014-09-30 10:35                               ` Rasmus
  2014-09-30 14:22                                 ` Eli Zaretskii
@ 2014-09-30 14:32                                 ` Stefan Monnier
  2014-10-02 16:12                                 ` Uwe Brauer
  2 siblings, 0 replies; 72+ messages in thread
From: Stefan Monnier @ 2014-09-30 14:32 UTC (permalink / raw)
  To: Rasmus; +Cc: emacs-devel

> If you ever decide to allocate time to generalize preview-latex and
> have it shipped with Emacs-core I'd happily support that work!

Indeed, that would be great.


        Stefan



^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: Overlay mechanic improvements
  2014-09-30 10:35                               ` Rasmus
  2014-09-30 14:22                                 ` Eli Zaretskii
  2014-09-30 14:32                                 ` Stefan Monnier
@ 2014-10-02 16:12                                 ` Uwe Brauer
  2 siblings, 0 replies; 72+ messages in thread
From: Uwe Brauer @ 2014-10-02 16:12 UTC (permalink / raw)
  To: emacs-devel

>> "rasmus" == rasmus  <Rasmus> writes:

   > Hi,
   > This is a bit OT, sorry.

   > David Kastrup <dak@gnu.org> writes:

   >> [...]
   >> 
   >> The main shortcoming of preview-latex in my book is that it stands out.
   >> It should be one of possibly many thin applications on top of rendering
   >> frameworks provided by Emacs proper.

   > I agree wholeheartedly.  Org has a similar mechanism as well, but it
   > *sucks* compared to preview-latex.  I tried to get
do you mean?

org-preview-latex-fragment and friends?

Yes it is inferior to preview-latex, but can be used in non latex
buffers, such as matlab, and even more useful, can be used in emails and
together with org-mime-htmlize allows to have mathematical equations
presented as png, sent over email, which comes in handy in email
discussions about math.





^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: Overlay mechanic improvements
  2014-09-30  1:21                           ` Stephen J. Turnbull
  2014-09-30  8:43                             ` David Kastrup
@ 2014-09-30 19:23                             ` Richard Stallman
  2014-10-01  3:38                               ` Stephen J. Turnbull
  1 sibling, 1 reply; 72+ messages in thread
From: Richard Stallman @ 2014-09-30 19:23 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: dak, emacs-devel

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

     > But you keep the overlay in existence -- how come?

    Why do you care?

Because I want to understand what this program is doing.

      David knows his business, isn't that good enough?

Good enough to tell me the program logic???

-- 
Dr Richard Stallman
President, Free Software Foundation
51 Franklin St
Boston MA 02110
USA
www.fsf.org  www.gnu.org
Skype: No way! That's nonfree (freedom-denying) software.
  Use Ekiga or an ordinary phone call.




^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: Overlay mechanic improvements
  2014-09-30 19:23                             ` Richard Stallman
@ 2014-10-01  3:38                               ` Stephen J. Turnbull
  2014-10-01 12:53                                 ` Richard Stallman
  0 siblings, 1 reply; 72+ messages in thread
From: Stephen J. Turnbull @ 2014-10-01  3:38 UTC (permalink / raw)
  To: rms; +Cc: dak, emacs-devel

Richard Stallman writes:

 >      > But you keep the overlay in existence -- how come?
 > 
 >     Why do you care?
 > 
 > Because I want to understand what this program is doing.

That much is obvious.  But why do you care what the program is doing?
Do you plan to become an AUCTeX developer?  Do you want to check
whether it's possible for AUCTeX to use text properties rather than
overlays?  Do you want to check if AUCTeX is highly wasteful of
overlays, leading to the efficiency issue, so that an improvement to
Emacs overlays is actually unnecessary for the use case?

The first motivation seems implausible given your time constraints and
non-use of AUCTeX, and I am quite sure the second two are wastes of
your time -- you will discover that AUCTeX's overlay implementation is
preferable to using text properties and that it doesn't leak overlays,
it just uses a lot of them in a big document.  For two reasons: I know
a little bit about the implementation, and independently come to those
conclusions, and as confirmation, I know that David knows his business.

Do you have yet another motivation for studying AUCTeX's implementation?

 >       David knows his business, isn't that good enough?
 > 
 > Good enough to tell me the program logic???

Of course not.  But knowing the program logic tells you what it is
doing, not why it does it that way.  If the program is a poor program,
you could probably find ways to optimize it, and so avoid a large
change to Emacs.  But I see every reason to believe that is not the
case.

At this point, from my point of view, we've been going around in a
tight spiral for several days.  It would be nice if we could go
straight to the point, that's all.  Or, it would be nice if you could
show that we are already on the straight path to the point.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: Overlay mechanic improvements
  2014-10-01  3:38                               ` Stephen J. Turnbull
@ 2014-10-01 12:53                                 ` Richard Stallman
  2014-10-01 13:11                                   ` David Kastrup
  2014-10-02  1:26                                   ` Stephen J. Turnbull
  0 siblings, 2 replies; 72+ messages in thread
From: Richard Stallman @ 2014-10-01 12:53 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: dak, emacs-devel

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

    That much is obvious.  But why do you care what the program is doing?

You are awfully arrogant.

-- 
Dr Richard Stallman
President, Free Software Foundation
51 Franklin St
Boston MA 02110
USA
www.fsf.org  www.gnu.org
Skype: No way! That's nonfree (freedom-denying) software.
  Use Ekiga or an ordinary phone call.




^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: Overlay mechanic improvements
  2014-10-01 12:53                                 ` Richard Stallman
@ 2014-10-01 13:11                                   ` David Kastrup
  2014-10-02  1:26                                   ` Stephen J. Turnbull
  1 sibling, 0 replies; 72+ messages in thread
From: David Kastrup @ 2014-10-01 13:11 UTC (permalink / raw)
  To: Richard Stallman; +Cc: Stephen J. Turnbull, emacs-devel

Richard Stallman <rms@gnu.org> writes:

> [[[ To any NSA and FBI agents reading my email: please consider    ]]]
> [[[ whether defending the US Constitution against all enemies,     ]]]
> [[[ foreign or domestic, requires you to follow Snowden's example. ]]]
>
>     That much is obvious.  But why do you care what the program is doing?
>
> You are awfully arrogant.

Actually, the same question occured to me.  Since I was not able to
figure out the answer in the course of the ongoing exchange, I basically
resorted to an extensive information dump in the hope that the required
information would be somewhere among it.

Since you have a large number of things to keep track of, it may be a
bit of a stretch to assume that you'll find what you were looking for
originally in that dump.  But maybe at least somebody found something
worth finding in it, whether or not he was looking for it.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: Overlay mechanic improvements
  2014-10-01 12:53                                 ` Richard Stallman
  2014-10-01 13:11                                   ` David Kastrup
@ 2014-10-02  1:26                                   ` Stephen J. Turnbull
  1 sibling, 0 replies; 72+ messages in thread
From: Stephen J. Turnbull @ 2014-10-02  1:26 UTC (permalink / raw)
  To: rms; +Cc: dak, emacs-devel

Richard Stallman writes:

 >     That much is obvious.  But why do you care what the program is doing?
 > 
 > You are awfully arrogant.

Even if accurate, that's an ad hominem, and doesn't help communicate.
Furthermore, I explained *why* I want to know your reason in some
detail -- is that arrogant?  I think that true arrogance involves
demanding to know your reason without giving a reason of my own.

All I want to do is get to the point.  If you just want to know what
it's doing, that's a perfectly good reason -- but then I don't see how
it helps Emacs development, and I excuse myself from the conversation,
which probably should move to an AUCTeX channel.

OTOH, if you have a reason that bears on Emacs development, it would
be useful in framing the conversation to know what that reason is.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: Overlay mechanic improvements
  2014-09-29 20:48                         ` Richard Stallman
  2014-09-30  1:21                           ` Stephen J. Turnbull
@ 2014-09-30  5:52                           ` David Kastrup
  2014-10-06 19:14                             ` Richard Stallman
  1 sibling, 1 reply; 72+ messages in thread
From: David Kastrup @ 2014-09-30  5:52 UTC (permalink / raw)
  To: Richard Stallman; +Cc: emacs-devel

Richard Stallman <rms@gnu.org> writes:

> [[[ To any NSA and FBI agents reading my email: please consider    ]]]
> [[[ whether defending the US Constitution against all enemies,     ]]]
> [[[ foreign or domestic, requires you to follow Snowden's example. ]]]
>
>     > Does this mean you turn off display of the image on the overlay when
>     > the text in that region is changed?
>
>     Very much so, yes.  Changing the text in the region would be a pain if
>     you could not see it, and changing it breaks the correspondence between
>     text and image anyway.
>
> I see.
>
> But you keep the overlay in existence -- how come?

Convenience.  preview-latex has a mechanism to run only a region through
LaTeX, so it is possible to regenerate only images in a region.  This
comes with its own shortfalls compared with regenerating the whole
document but is much faster and less disruptive as then only images in
that region flicker as they appear.

When editing a particular equation or other element, reusing its
original boundaries as the region to run through LaTeX is a natural
choice, so the retained overlay preserves those boundaries (and offsets
them visually) for the sake of the commands regenerating a region.

If you split an equation into two by editing it, "regenerating" the
image will replace the original overlay with two new ones: in fact, even
if the boundaries stay the same, the overlay will get replaced.  It
sticks around only until some new image is generated touching any of the
original region.

If you write, say, word_text instead of word\_text (throwing LaTeX into
inadvertant math mode), the resulting nuisance overlay will need to get
manually removed (with keyboard sequence or mouse click) after
correcting the syntax error since no image will reappear in that area
after the correction.  That tends to be a smaller net distraction than
having to hand-select your regions to regenerate in the common case.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: Overlay mechanic improvements
  2014-09-30  5:52                           ` David Kastrup
@ 2014-10-06 19:14                             ` Richard Stallman
  2014-10-06 21:02                               ` David Kastrup
  0 siblings, 1 reply; 72+ messages in thread
From: Richard Stallman @ 2014-10-06 19:14 UTC (permalink / raw)
  To: David Kastrup; +Cc: emacs-devel

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

Thanks for the explanation.  Now I understand, more or less.

    If you write, say, word_text instead of word\_text (throwing LaTeX into
    inadvertant math mode), the resulting nuisance overlay will need to get
    manually removed (with keyboard sequence or mouse click) after
    correcting the syntax error since no image will reappear in that area
    after the correction.

If you regenerate that region, and it makes no image, can't that delete
the overlay automatically?

    preview-latex does an excellent job, but much of its code is not at all
    related to the AUCTeX user interface or the LaTeX input language, and
    yet it will work for nothing but the AUCTeX user interface and the LaTeX
    input language.

What is the obstacle to supporting other interfaces?

Would preview-latex be of any use with Texinfo?
Maybe not, since generally there are few equations in Texinfo.

-- 
Dr Richard Stallman
President, Free Software Foundation
51 Franklin St
Boston MA 02110
USA
www.fsf.org  www.gnu.org
Skype: No way! That's nonfree (freedom-denying) software.
  Use Ekiga or an ordinary phone call.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: Overlay mechanic improvements
  2014-10-06 19:14                             ` Richard Stallman
@ 2014-10-06 21:02                               ` David Kastrup
  0 siblings, 0 replies; 72+ messages in thread
From: David Kastrup @ 2014-10-06 21:02 UTC (permalink / raw)
  To: Richard Stallman; +Cc: emacs-devel

Richard Stallman <rms@gnu.org> writes:

> [[[ To any NSA and FBI agents reading my email: please consider    ]]]
> [[[ whether defending the US Constitution against all enemies,     ]]]
> [[[ foreign or domestic, requires you to follow Snowden's example. ]]]
>
> Thanks for the explanation.  Now I understand, more or less.
>
>     If you write, say, word_text instead of word\_text (throwing LaTeX into
>     inadvertant math mode), the resulting nuisance overlay will need to get
>     manually removed (with keyboard sequence or mouse click) after
>     correcting the syntax error since no image will reappear in that area
>     after the correction.
>
> If you regenerate that region, and it makes no image, can't that delete
> the overlay automatically?

Conceivably.  However, a region might just be

    \input{bozo}
    \input{clown}
    \input{summary}

with the actual images appearing in respectively other buffers than the
region.  So clearing out the original region would be a heuristic rather
than the full deal and should not be mandatory for satisfactory
operation.

>     preview-latex does an excellent job, but much of its code is not
>     at all related to the AUCTeX user interface or the LaTeX input
>     language, and yet it will work for nothing but the AUCTeX user
>     interface and the LaTeX input language.
>
> What is the obstacle to supporting other interfaces?

The job control that is used is that of AUCTeX.  That involves
management of various process buffers (without getting "A compilation is
already in progress" problems), expansion of commands with variable
elements based on the involved buffers and files.

The command sequences which are executed are based on the various ways
of generating graphical output from TeX input.  These kind of chains are
not really specified as some sort of abstract interface but hardcoded.

There are sidepaths from LaTeX for geometry and source location
information: Emacs parses the resulting error/log files in order to get
all those out.  There is no other source for this information.

A lot of convenience functions from AUCTeX are used that are not
actually related to TeX at all.  The user interface parts are not done
in any way independent from the AUCTeX user interface.

> Would preview-latex be of any use with Texinfo?
> Maybe not, since generally there are few equations in Texinfo.

The LilyPond (music typesetter) documentation is generated via Texinfo.
There are probably not all that many people using the Info version with
graphical images but it beats the pants off the generally used online
HTML version (all the web pages, not just the manuals, are also
generated from Texinfo).

But the _only_ usable reader for the Info version with graphics is
Emacs.  Yelp (GNOME help viewer) purports to support images in Info but
does not scale to ten thousands of images.  One never gets across the
"Loading" message.

Now the LilyPond documentation is probably the best use of images in
Info anywhere.  Still I would not want to bet my life savings on more
than a dozen people actively using the Info variants of those manuals.

But while few people actually use the Info output format, lots more use
Texinfo with other backends.  I could imagine the Texinfo/preview-latex
combination to be popular with our documentation editors even though
most of them probaly have never looked at Info output and consider HTML
and PDF the "normal" output formats.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: Overlay mechanic improvements
  2014-09-21 16:07         ` Stefan Monnier
  2014-09-21 16:14           ` David Kastrup
@ 2014-09-21 16:56           ` Eli Zaretskii
  2014-09-21 18:42             ` Stefan Monnier
  2014-09-21 21:48           ` Richard Stallman
  2 siblings, 1 reply; 72+ messages in thread
From: Eli Zaretskii @ 2014-09-21 16:56 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: dak, rms, emacs-devel

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Date: Sun, 21 Sep 2014 12:07:48 -0400
> Cc: David Kastrup <dak@gnu.org>, emacs-devel@gnu.org
> 
> We rarely suffer from overlay's performance issues because most codes
> that could suffer from it simply avoid using overlays (or they try to
> minimize the number of overlays used, e.g. by flushing those overlays
> currently not visible), even when they would be the right tool.  Or they
> add calls to overlay-recenter when the performance problem is diagnosed.

We should distinguish between 2 use cases here.  One is when overlays
need to be displayed.  In that case, indeed only the overlays in the
visible region count, which are normally not too many, and the display
engine already does the equivalent of overlay-recenter for each
display line.

The other use case is when some Lisp program, or even some part of
Emacs, needs to search for the next/previous change in character
properties, or find the beginning or end of an overlay.  In those
cases, you cannot assume that only the visible portion of a buffer
matters, so in large buffers with many overlays this could be rather
painful.

> I don't see any obvious reason why re-using the existing intervals.c
> tree wouldn't work well.

Overlays can overlap, while text properties cannot.  Does that matter
for what intervals.c implement?

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: Overlay mechanic improvements
  2014-09-21 16:56           ` Eli Zaretskii
@ 2014-09-21 18:42             ` Stefan Monnier
  2014-09-21 18:58               ` Eli Zaretskii
  0 siblings, 1 reply; 72+ messages in thread
From: Stefan Monnier @ 2014-09-21 18:42 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: dak, rms, emacs-devel

> Overlays can overlap, while text properties cannot.  Does that matter
> for what intervals.c implement?

Of course it does, but the algorithm I suggest we use handles it
just fine.


        Stefan



^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: Overlay mechanic improvements
  2014-09-21 18:42             ` Stefan Monnier
@ 2014-09-21 18:58               ` Eli Zaretskii
  2014-09-21 20:12                 ` Stefan Monnier
  0 siblings, 1 reply; 72+ messages in thread
From: Eli Zaretskii @ 2014-09-21 18:58 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: dak, rms, emacs-devel

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Date: Sun, 21 Sep 2014 14:42:57 -0400
> Cc: dak@gnu.org, rms@gnu.org, emacs-devel@gnu.org
> 
> > Overlays can overlap, while text properties cannot.  Does that matter
> > for what intervals.c implement?
> 
> Of course it does, but the algorithm I suggest we use handles it
> just fine.

What algorithm?



^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: Overlay mechanic improvements
  2014-09-21 18:58               ` Eli Zaretskii
@ 2014-09-21 20:12                 ` Stefan Monnier
  0 siblings, 0 replies; 72+ messages in thread
From: Stefan Monnier @ 2014-09-21 20:12 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: dak, rms, emacs-devel

>> Of course it does, but the algorithm I suggest we use handles it
>> just fine.
> What algorithm?

The one described in the "Augmented tree" section of the "Interval tree"
wikipedia page.


        Stefan



^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: Overlay mechanic improvements
  2014-09-21 16:07         ` Stefan Monnier
  2014-09-21 16:14           ` David Kastrup
  2014-09-21 16:56           ` Eli Zaretskii
@ 2014-09-21 21:48           ` Richard Stallman
  2014-09-22  0:31             ` Stefan Monnier
  2 siblings, 1 reply; 72+ messages in thread
From: Richard Stallman @ 2014-09-21 21:48 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: dak, emacs-devel

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

    Reasons for overlays being slow are:
    - if you have N overlays, you have 2N markers, so every
      insertion/deletion of text will involve a traversal of those 2N elements.
    - when we need to move from bytepos to charpos (or vice versa), we go
      through the 2N markers again.
    - when the Elisp code didn't know to overlay-recenter, or when overlays
      are added/removed from the "wrong" place (w.r.t to the overlay
      "center"), every overlay creation/removal takes O(N) time.

Yes, that's true.

    Moving overlays into a balanced binary tree will remove those
    algorithmic performance problems.  I'm not sue why you'd be opposed,
    since it will provide the exact same functionality anyway.  It's just an
    internal detail of implementation.

Why not do it for all markers, then?  A large number of markers will
cause the same slowdown even if they are not made for overlays.

    Note that I suggest to use the balanced binary tree used by
    text-properties, but there is no need for the two trees to be linked.

I don't think the data structure of intervals lends itself directly to
markers.  A marker, whether it's on its own or an end of an overlay,
has an identity that is permanent, whereas intervals are split and
recombined whenever useful.

-- 
Dr Richard Stallman
President, Free Software Foundation
51 Franklin St
Boston MA 02110
USA
www.fsf.org  www.gnu.org
Skype: No way! That's nonfree (freedom-denying) software.
  Use Ekiga or an ordinary phone call.




^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: Overlay mechanic improvements
  2014-09-21 21:48           ` Richard Stallman
@ 2014-09-22  0:31             ` Stefan Monnier
  2014-09-22 23:11               ` Richard Stallman
  0 siblings, 1 reply; 72+ messages in thread
From: Stefan Monnier @ 2014-09-22  0:31 UTC (permalink / raw)
  To: Richard Stallman; +Cc: dak, emacs-devel

>     Reasons for overlays being slow are:
>     - if you have N overlays, you have 2N markers, so every
>       insertion/deletion of text will involve a traversal of those 2N elements.
>     - when we need to move from bytepos to charpos (or vice versa), we go
>       through the 2N markers again.
>     - when the Elisp code didn't know to overlay-recenter, or when overlays
>       are added/removed from the "wrong" place (w.r.t to the overlay
>       "center"), every overlay creation/removal takes O(N) time.
> Yes, that's true.
>     Moving overlays into a balanced binary tree will remove those
>     algorithmic performance problems.  I'm not sue why you'd be opposed,
>     since it will provide the exact same functionality anyway.  It's just an
>     internal detail of implementation.
> Why not do it for all markers, then?

We could try, but it wouldn't address the last problem above.

> A large number of markers will cause the same slowdown even if they
> are not made for overlays.

Indeed.  But I've never seen this problem, except via the creation of
many overlays.

Note that if/when overlays are made efficient and markers's efficiency
becomes a problem, we can use the new overlay implementation for markers
(in the worst case, representing markers as degenerate overlays).

> A marker, whether it's on its own or an end of an overlay, has an
> identity that is permanent, whereas intervals are split and recombined
> whenever useful.

I must say I'm troubled by your questions and objections.  What is the
problem with my proposal?  Are you afraid it will make Emacs worse?
Or are you afraid it simply won't work?  Why do you care if we keep
overlays in a tree instead of singly-linked lists?


        Stefan



^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: Overlay mechanic improvements
  2014-09-22  0:31             ` Stefan Monnier
@ 2014-09-22 23:11               ` Richard Stallman
  0 siblings, 0 replies; 72+ messages in thread
From: Richard Stallman @ 2014-09-22 23:11 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: dak, emacs-devel

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

    I must say I'm troubled by your questions and objections.  What is the
    problem with my proposal?  Are you afraid it will make Emacs worse?
    Or are you afraid it simply won't work?  Why do you care if we keep
    overlays in a tree instead of singly-linked lists?

We may be miscommunicating.  I think you're interpreting each thing I
say as an argument against reimplementing overlays, but I'm mostly
talking about other issues that came up in the discussion.

I must not have noticed the message that made a specific proposal for
how to reimplement overlays.  I'm concerned about code complexity and
bugs that might introduce, but I'm not necessarily against it.

Mostly I'm not talking about that.

-- 
Dr Richard Stallman
President, Free Software Foundation
51 Franklin St
Boston MA 02110
USA
www.fsf.org  www.gnu.org
Skype: No way! That's nonfree (freedom-denying) software.
  Use Ekiga or an ordinary phone call.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: Overlay mechanic improvements
  2014-09-20 13:19   ` Richard Stallman
  2014-09-20 13:37     ` David Kastrup
@ 2014-09-20 15:56     ` Eli Zaretskii
  2014-09-20 19:49     ` Stefan Monnier
  2 siblings, 0 replies; 72+ messages in thread
From: Eli Zaretskii @ 2014-09-20 15:56 UTC (permalink / raw)
  To: rms; +Cc: vekazanov, monnier, emacs-devel

> Date: Sat, 20 Sep 2014 09:19:02 -0400
> From: Richard Stallman <rms@gnu.org>
> Cc: vekazanov@gmail.com, emacs-devel@gnu.org
> 
> The existing data structure is fine for overlays used as they are intended
> to be used.

The existing data structure doesn't scale well.  What Stefan suggest
will scale better, while preserving all the features and properties of
overlays as we have them today.



^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: Overlay mechanic improvements
  2014-09-20 13:19   ` Richard Stallman
  2014-09-20 13:37     ` David Kastrup
  2014-09-20 15:56     ` Eli Zaretskii
@ 2014-09-20 19:49     ` Stefan Monnier
  2014-09-21 13:36       ` Richard Stallman
  2 siblings, 1 reply; 72+ messages in thread
From: Stefan Monnier @ 2014-09-20 19:49 UTC (permalink / raw)
  To: Richard Stallman; +Cc: vekazanov, emacs-devel

> [[[ To any NSA and FBI agents reading my email: please consider    ]]]
> [[[ whether defending the US Constitution against all enemies,     ]]]
> [[[ foreign or domestic, requires you to follow Snowden's example. ]]]
> I don't see that it makes sense to store overlays like text properties.
> The existing data structure is fine for overlays used as they are intended
> to be used.

The data structure used is too naive, with an algorithmic complexity
that can be problematic.  Moving it to a balanced binary tree with turn
O(N^2) complexity into O(N log N).


        Stefan



^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: Overlay mechanic improvements
  2014-09-20 19:49     ` Stefan Monnier
@ 2014-09-21 13:36       ` Richard Stallman
  0 siblings, 0 replies; 72+ messages in thread
From: Richard Stallman @ 2014-09-21 13:36 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: vekazanov, emacs-devel

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

      Moving it to a balanced binary tree with turn
    O(N^2) complexity into O(N log N).

I am not sure that is true, here.
What value does N stand for?

-- 
Dr Richard Stallman
President, Free Software Foundation
51 Franklin St
Boston MA 02110
USA
www.fsf.org  www.gnu.org
Skype: No way! That's nonfree (freedom-denying) software.
  Use Ekiga or an ordinary phone call.




^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: Overlay mechanic improvements
  2014-09-19 14:59 Overlay mechanic improvements Vladimir Kazanov
  2014-09-19 17:22 ` Stefan Monnier
@ 2014-09-19 18:03 ` Richard Stallman
  2014-09-20  8:08   ` Vladimir Kazanov
  1 sibling, 1 reply; 72+ messages in thread
From: Richard Stallman @ 2014-09-19 18:03 UTC (permalink / raw)
  To: Vladimir Kazanov; +Cc: emacs-devel

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

    It turns out that the only viable alternative to overlays is using text
    properties, and text properties were not meant for the purpose.

I designed text properties precisely for uses like this.

     All I need
    is an ability to save position pairs, the positions should survive text
    insertion/deletion

I see multiple meanings for that; could you clarify?

		       and there should be a way to find those pairs given a
    buffer point.

Text properties are designed to be preserved through copying of text.
Overlays are not.  So it seems to me that you must use text properties.

For each token, you put a text property 'token' onto the characters in
the token.  The value of the property would say what token they are.
The property would be eq for all the characters in one token.

Then you can use 'next-single-char-property-change' and
'previous-single-char-property-change' to find the end and the
beginning of the token.

If you run into any difficulties using the existing interfaces
for text properties, we should improve the interfaces to make your
program easier to write.

-- 
Dr Richard Stallman
President, Free Software Foundation
51 Franklin St
Boston MA 02110
USA
www.fsf.org  www.gnu.org
Skype: No way! That's nonfree (freedom-denying) software.
  Use Ekiga or an ordinary phone call.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: Overlay mechanic improvements
  2014-09-19 18:03 ` Richard Stallman
@ 2014-09-20  8:08   ` Vladimir Kazanov
  2014-09-20 13:21     ` Richard Stallman
  2014-09-20 13:21     ` Tokenizing Richard Stallman
  0 siblings, 2 replies; 72+ messages in thread
From: Vladimir Kazanov @ 2014-09-20  8:08 UTC (permalink / raw)
  To: rms; +Cc: emacs-devel

The problem here is that tokens are not just lexemes(token text). To
be general and granular enough one also has to remember dependencies
between tokens. So, a token is: ["lexeme", lookback, lookahead,
position]. Here

 - "lexeme" is the string itself;

 - "lookahead" is a number of chars *after* the lexeme used to define
the lexeme (in the simplest case - just a whitespace between symbol
names, i.e. a single char);

 - "lookback" a number of previous tokens whose lookaheads span over
current token's zone of responsibility (this one can be recalculated
on the fly)

 - "position" is just a token beginning/end

Thus, we can define "a token" as a sum of a lexeme and context
required to extract the lexeme. Whenever something happens within a
particular lexem+lookahead interval -> the token has to be reparsed
and n=lookback previous tokens also have to be fixed.

More on this can be found in this paper:
http://harmonia.cs.berkeley.edu/papers/twagner-lexing.pdf. Quite a
lengthy read, I must warn. What I like about it is that it *proves*
token stream consistency after fixing. My system will hopefully be
simplified compared to the original idea while keeping the proved
part.

>      All I need
>     is an ability to save position pairs, the positions should survive text
>     insertion/deletion
>
> I see multiple meanings for that; could you clarify?

I mean something like relative positioning of markers/overlays:
changes in unrelated buffer parts should change positions accordingly.

>                        and there should be a way to find those pairs given a
>     buffer point.
>
> Text properties are designed to be preserved through copying of text.
> Overlays are not.  So it seems to me that you must use text properties.

Yes, properties do survive copying, but token context might change and
the whole thing will have to be reparsed anyway. We can take a piece
of comments and drop it into the middle of an expression. I don't care
whether the repositioned text was a part of a token or not, I just
need to know that something changed within a particular interval
(lexem+lookahead)

> For each token, you put a text property 'token' onto the characters in
> the token.  The value of the property would say what token they are.
> The property would be eq for all the characters in one token.
>
> Then you can use 'next-single-char-property-change' and
> 'previous-single-char-property-change' to find the end and the
> beginning of the token.

I chose overlays mostly because they allow to control *intervals of
text*, and the intervals can overlap. For example, I can add a
modification-hook to an overlay - and it won't be called for every
character, just for an interval(-s) overlapping. Citing the "Special
properties" documentation page for text properties: "you can't predict
how many times the function will be called". One can imagine a
workaround for this, but it would be just too cumbersome.

> If you run into any difficulties using the existing interfaces
> for text properties, we should improve the interfaces to make your
> program easier to write.
>

Should I just do that or try both (improving overlays on the way)?
After all overlays do have to be fixed anyway.

-- 
Yours sincerely,

Vladimir Kazanov

--
С уважением,

Владимир Казанов

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: Overlay mechanic improvements
  2014-09-20  8:08   ` Vladimir Kazanov
@ 2014-09-20 13:21     ` Richard Stallman
  2014-09-20 16:28       ` Stephen Leake
  2014-09-20 13:21     ` Tokenizing Richard Stallman
  1 sibling, 1 reply; 72+ messages in thread
From: Richard Stallman @ 2014-09-20 13:21 UTC (permalink / raw)
  To: Vladimir Kazanov; +Cc: emacs-devel

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

    The problem here is that tokens are not just lexemes(token text). To
    be general and granular enough one also has to remember dependencies
    between tokens. So, a token is: ["lexeme", lookback, lookahead,
    position].

Ok, but why is it a "problem"?  It seems easy enough to construct a
list (LEXEME LOOKBACK LOOKAHEAD) and put it as the value of the
`token' property for all the characters in the token.

There is no need to include POSITION in this list, since you can
find the start of the token by calling `previous-char-property-change'.

    >      All I need
    >     is an ability to save position pairs, the positions should survive text
    >     insertion/deletion
    >
    > I see multiple meanings for that; could you clarify?

    I mean something like relative positioning of markers/overlays:
    changes in unrelated buffer parts should change positions accordingly.

Text properties do that.

    Yes, properties do survive copying, but token context might change and
    the whole thing will have to be reparsed anyway.

If it needs to be reparsed, that's no problem; it is easy to delete
the `token' property from a region.  In fact, we can specify automatic
deletion of a certain text property from text copied into the buffer.

But there are many cases where it does not need to be reparsed.  If
you can detect such cases, and reparse only near the boundaries, it
would be a nice optimization for cases like killing a whole function
and yanking it between two other functions.

You could use `yank-handler' to hook this in.  It would check whether
the existing parsing of the text being yanked fits the new context; if
so, it would preserve the `token' property; if not, it would erase the
`token' property of the yanked text and cause it all to be reparsed.

Another idea: even if the beginning of the yanked text has to be
interpreted differently in the new context, maybe tokens later on
in the yanked text will still be right in the new context.  If you can
detect when this happens, you might be able to skip reparsing of
some of the yanked text.  That could be a substantial speedup.

    Citing the "Special
    properties" documentation page for text properties: "you can't predict
    how many times the function will be called". One can imagine a
    workaround for this, but it would be just too cumbersome.

If you put a timestamp into the token data structure, it will be
easy to detect duplicate calls, and make them do nothing.

I think it would be less work to change text properties to give you
whatever interface features are convenient, than to redesign how
overlays are stored.

    After all overlays do have to be fixed anyway.

Why so?  Aside from not being suitable for this use,
what is wrong with overlays that needs to be fixed?

The feature you want to implement is very useful, and we should
provide the low-level support needed to make it clean and fast.  But
if that is easy to do with text properties, let's do it with text
properties.

-- 
Dr Richard Stallman
President, Free Software Foundation
51 Franklin St
Boston MA 02110
USA
www.fsf.org  www.gnu.org
Skype: No way! That's nonfree (freedom-denying) software.
  Use Ekiga or an ordinary phone call.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: Overlay mechanic improvements
  2014-09-20 13:21     ` Richard Stallman
@ 2014-09-20 16:28       ` Stephen Leake
  0 siblings, 0 replies; 72+ messages in thread
From: Stephen Leake @ 2014-09-20 16:28 UTC (permalink / raw)
  To: emacs-devel

Richard Stallman <rms@gnu.org> writes:

> [[[ To any NSA and FBI agents reading my email: please consider    ]]]
> [[[ whether defending the US Constitution against all enemies,     ]]]
> [[[ foreign or domestic, requires you to follow Snowden's example. ]]]
>
>     The problem here is that tokens are not just lexemes(token text). To
>     be general and granular enough one also has to remember dependencies
>     between tokens. So, a token is: ["lexeme", lookback, lookahead,
>     position].
>
> Ok, but why is it a "problem"?  It seems easy enough to construct a
> list (LEXEME LOOKBACK LOOKAHEAD) and put it as the value of the
> `token' property for all the characters in the token.
>
> There is no need to include POSITION in this list, since you can
> find the start of the token by calling
> `previous-char-property-change'.

No need to store LEXME, either; just call buffer-substring-no-properties
when you need the text.


-- 
-- Stephe



^ permalink raw reply	[flat|nested] 72+ messages in thread

* Tokenizing
  2014-09-20  8:08   ` Vladimir Kazanov
  2014-09-20 13:21     ` Richard Stallman
@ 2014-09-20 13:21     ` Richard Stallman
  2014-09-20 16:24       ` Tokenizing Stephen Leake
  2014-09-20 16:36       ` Tokenizing Vladimir Kazanov
  1 sibling, 2 replies; 72+ messages in thread
From: Richard Stallman @ 2014-09-20 13:21 UTC (permalink / raw)
  To: Vladimir Kazanov; +Cc: emacs-devel

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

Do you plan to reparse continuously to keep the token structure up to
date all the time?

Given that reparsing is likely to take time, it may not be the best approach.
It could sometimes make Emacs become nonresponsive while it is reparsing
a substantial amount of text.

Ideally it should be like Font Lock -- reparsing parts of the buffer
when time is available.

-- 
Dr Richard Stallman
President, Free Software Foundation
51 Franklin St
Boston MA 02110
USA
www.fsf.org  www.gnu.org
Skype: No way! That's nonfree (freedom-denying) software.
  Use Ekiga or an ordinary phone call.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: Tokenizing
  2014-09-20 13:21     ` Tokenizing Richard Stallman
@ 2014-09-20 16:24       ` Stephen Leake
  2014-09-20 16:40         ` Tokenizing Vladimir Kazanov
  2014-09-20 16:36       ` Tokenizing Vladimir Kazanov
  1 sibling, 1 reply; 72+ messages in thread
From: Stephen Leake @ 2014-09-20 16:24 UTC (permalink / raw)
  To: emacs-devel

Richard Stallman <rms@gnu.org> writes:

> [[[ To any NSA and FBI agents reading my email: please consider    ]]]
> [[[ whether defending the US Constitution against all enemies,     ]]]
> [[[ foreign or domestic, requires you to follow Snowden's example. ]]]
>
> Do you plan to reparse continuously to keep the token structure up to
> date all the time?

That's the approach I'm using in Ada mode. 

Tokenizing the whole buffer after any change is easily fast enough (on
modern hardware), even on a 7000 line buffer. Semantic parsing gets a
lot slower.

But for the Ada language, incremental semantic parsing really isn't an
option. 

-- 
-- Stephe



^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: Tokenizing
  2014-09-20 16:24       ` Tokenizing Stephen Leake
@ 2014-09-20 16:40         ` Vladimir Kazanov
  2014-09-20 20:16           ` Tokenizing Eric Ludlam
  0 siblings, 1 reply; 72+ messages in thread
From: Vladimir Kazanov @ 2014-09-20 16:40 UTC (permalink / raw)
  To: Stephen Leake; +Cc: emacs-devel

> Tokenizing the whole buffer after any change is easily fast enough (on
> modern hardware), even on a 7000 line buffer. Semantic parsing gets a
> lot slower.
>

This is what I do right now in my prototype of a smarter Python mode.
The tokenizing process itself is usually fast enough. But parsing is
more complicated, and may take some time to rebuild the parse tree.
Incremental approach is a natural step here.

-- 
Yours sincerely,


Vladimir Kazanov


--
С уважением,

Владимир Казанов



^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: Tokenizing
  2014-09-20 16:40         ` Tokenizing Vladimir Kazanov
@ 2014-09-20 20:16           ` Eric Ludlam
  2014-09-20 20:35             ` Tokenizing Vladimir Kazanov
  2014-09-21 15:13             ` parsing (was tokenizing) Stephen Leake
  0 siblings, 2 replies; 72+ messages in thread
From: Eric Ludlam @ 2014-09-20 20:16 UTC (permalink / raw)
  To: Vladimir Kazanov; +Cc: Stephen Leake, emacs-devel

  On 09/20/2014 12:40 PM, Vladimir Kazanov wrote:
>> Tokenizing the whole buffer after any change is easily fast enough (on
>> modern hardware), even on a 7000 line buffer. Semantic parsing gets a
>> lot slower.
>>
> This is what I do right now in my prototype of a smarter Python mode.
> The tokenizing process itself is usually fast enough. But parsing is
> more complicated, and may take some time to rebuild the parse tree.
> Incremental approach is a natural step here.
>

I caught only the tail of this thread, so I apologize if I refer to the 
incorrect thing.

A year ago or so we were talking about ada-mode, a modified parser, and 
how it might integrate with CEDET/Semantic on the CEDET mailing list.  
Is it still 'wisi', a different flavor of 'wisent' ?

If calls into your parser are being handled by parts of CEDET/Semantic 
for creating tags, then there is an incremental parser that you can 
enable that works with two of the other parser types included.  The act 
of tagging the buffer enables it to track edits, and only reparse the 
bits that change.  It handles incomplete code, and waits for it to be 
completed.  There are many other possible states as well.

I certainly recommend trying to take advantage of that if the parsing 
you refer to is somehow related to Semantic.  I'll be happy to help you 
figure out how to make your parser work in that framework.

If this thread isn't about that, then carry on... :)

Eric

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: Tokenizing
  2014-09-20 20:16           ` Tokenizing Eric Ludlam
@ 2014-09-20 20:35             ` Vladimir Kazanov
  2014-09-21 15:13             ` parsing (was tokenizing) Stephen Leake
  1 sibling, 0 replies; 72+ messages in thread
From: Vladimir Kazanov @ 2014-09-20 20:35 UTC (permalink / raw)
  To: Eric Ludlam; +Cc: Stephen Leake, emacs-devel

> I certainly recommend trying to take advantage of that if the parsing you
> refer to is somehow related to Semantic.  I'll be happy to help you figure
> out how to make your parser work in that framework.
>
> If this thread isn't about that, then carry on... :)
>
> Eric

Actually, this is not about parsing itself, but a lower level layer -
tokenizing. :-) Couple of letters ago I formulated what I intend to
build, you can easily find it.

-- 
Yours sincerely,


Vladimir Kazanov


--
С уважением,

Владимир Казанов



^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: parsing (was tokenizing)
  2014-09-20 20:16           ` Tokenizing Eric Ludlam
  2014-09-20 20:35             ` Tokenizing Vladimir Kazanov
@ 2014-09-21 15:13             ` Stephen Leake
  1 sibling, 0 replies; 72+ messages in thread
From: Stephen Leake @ 2014-09-21 15:13 UTC (permalink / raw)
  To: emacs-devel

Eric Ludlam <ericludlam@gmail.com> writes:

>  On 09/20/2014 12:40 PM, Vladimir Kazanov wrote:
>>> Tokenizing the whole buffer after any change is easily fast enough (on
>>> modern hardware), even on a 7000 line buffer. Semantic parsing gets a
>>> lot slower.
>>>
>> This is what I do right now in my prototype of a smarter Python mode.
>> The tokenizing process itself is usually fast enough. But parsing is
>> more complicated, and may take some time to rebuild the parse tree.
>> Incremental approach is a natural step here.
>>
>
> I caught only the tail of this thread, so I apologize if I refer to
> the incorrect thing.
>
> A year ago or so we were talking about ada-mode, a modified parser,
> and how it might integrate with CEDET/Semantic on the CEDET mailing
> list.  Is it still 'wisi', a different flavor of 'wisent' ?

Yes.

> If calls into your parser are being handled by parts of CEDET/Semantic
> for creating tags, then there is an incremental parser that you can
> enable that works with two of the other parser types included.  

No, I'm not using the CEDET front end UI. Partly because that
incremental parser does not handle the Ada language (or I did not want
to put in the effort to modify my grammar to make it work).

The wisi parser is generalized LALR, and it does not lend itself to
incremental parsing.

I also didn't need the full flexibility of the Semantic lexer; the wisi
lexer relies solely on the elisp syntax properties, and is faster.

On the other hand, the wisi parser is unacceptably slow for 7000 line
Ada files (and I have customers complaining about files 10 times that
size (or they are using really slow machines)), so I'm looking into
ways to use compiled Ada code instead of elisp to run the parser.

> I'll be happy to help
> you figure out how to make your parser work in that framework.

Once I get the speed acceptable, I intend to look again at the CEDET UI,
and try to use as much as possible with Ada mode. In particular, smart
completion would be a nice feature.

-- 
-- Stephe

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: Tokenizing
  2014-09-20 13:21     ` Tokenizing Richard Stallman
  2014-09-20 16:24       ` Tokenizing Stephen Leake
@ 2014-09-20 16:36       ` Vladimir Kazanov
  2014-09-20 19:55         ` Tokenizing Stefan Monnier
                           ` (2 more replies)
  1 sibling, 3 replies; 72+ messages in thread
From: Vladimir Kazanov @ 2014-09-20 16:36 UTC (permalink / raw)
  To: rms; +Cc: emacs-devel

Okay, I'll give text properties a try.

Right now my vision for this mode is the following:

- avoid retokenizing undamaged buffer parts at all costs (as a main
feature meant for incremental parsing);

- collect damages and do reparsing only when user stops editing,
similar to the font-lock-mode (js2-mode, nxml-mode...);

- the incremental logic should have two interfaces, the first one
meant for language-specific tokenizing code and a second one - for the
user code, be it code beautifiers or advanced incremental parsers;

- it should be possible to completely replace the font-lock-mode with
this mode, given a concrete language tokenizer;

You said two things basically: 1) I must use text properties, 2) it is
possible to improve text properties interfaces to help the tokenizer.
I suggest the following plan:

1) try to implement the tokenizer using available text property mechanics;

2) see if there are slow-downs or problems, or space for improvements
on the Emacs side.

Does that sound reasonable?

On Sat, Sep 20, 2014 at 4:21 PM, Richard Stallman <rms@gnu.org> wrote:
> [[[ To any NSA and FBI agents reading my email: please consider    ]]]
> [[[ whether defending the US Constitution against all enemies,     ]]]
> [[[ foreign or domestic, requires you to follow Snowden's example. ]]]
>
> Do you plan to reparse continuously to keep the token structure up to
> date all the time?
>
> Given that reparsing is likely to take time, it may not be the best approach.
> It could sometimes make Emacs become nonresponsive while it is reparsing
> a substantial amount of text.
>
> Ideally it should be like Font Lock -- reparsing parts of the buffer
> when time is available.
>
> --
> Dr Richard Stallman
> President, Free Software Foundation
> 51 Franklin St
> Boston MA 02110
> USA
> www.fsf.org  www.gnu.org
> Skype: No way! That's nonfree (freedom-denying) software.
>   Use Ekiga or an ordinary phone call.
>



-- 
Yours sincerely,


Vladimir Kazanov


--
С уважением,

Владимир Казанов



^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: Tokenizing
  2014-09-20 16:36       ` Tokenizing Vladimir Kazanov
@ 2014-09-20 19:55         ` Stefan Monnier
  2014-09-21 15:35           ` Tokenizing Stephen Leake
  2014-09-21 13:35         ` Tokenizing Richard Stallman
  2014-09-21 15:32         ` Tokenizing Stephen Leake
  2 siblings, 1 reply; 72+ messages in thread
From: Stefan Monnier @ 2014-09-20 19:55 UTC (permalink / raw)
  To: Vladimir Kazanov; +Cc: rms, emacs-devel

> - it should be possible to completely replace the font-lock-mode with
> this mode, given a concrete language tokenizer;

The intention for font-lock highlighting is to show
syntactic structure.  In many cases, we end up limited to lexical
highlighting, but that's a problem.


        Stefan



^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: Tokenizing
  2014-09-20 19:55         ` Tokenizing Stefan Monnier
@ 2014-09-21 15:35           ` Stephen Leake
  2014-09-21 16:43             ` Tokenizing Stefan Monnier
  0 siblings, 1 reply; 72+ messages in thread
From: Stephen Leake @ 2014-09-21 15:35 UTC (permalink / raw)
  To: emacs-devel

Stefan Monnier <monnier@iro.umontreal.ca> writes:

>> - it should be possible to completely replace the font-lock-mode with
>> this mode, given a concrete language tokenizer;
>
> The intention for font-lock highlighting is to show
> syntactic structure.  In many cases, we end up limited to lexical
> highlighting, but that's a problem.

syntax-propertize provides hooks to let modes add to that; Ada mode uses
it for cases where lexical highlighting is ambiguous. Which is a problem
when parsing is too slow :(.

-- 
-- Stephe



^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: Tokenizing
  2014-09-21 15:35           ` Tokenizing Stephen Leake
@ 2014-09-21 16:43             ` Stefan Monnier
  2014-09-22 14:05               ` Tokenizing Stephen Leake
  0 siblings, 1 reply; 72+ messages in thread
From: Stefan Monnier @ 2014-09-21 16:43 UTC (permalink / raw)
  To: Stephen Leake; +Cc: emacs-devel

>>> - it should be possible to completely replace the font-lock-mode with
>>> this mode, given a concrete language tokenizer;
>> The intention for font-lock highlighting is to show
>> syntactic structure.  In many cases, we end up limited to lexical
>> highlighting, but that's a problem.
> syntax-propertize provides hooks to let modes add to that;

Maybe you use it for that, but it's definitely not its intention:
`syntax-propertize' is not meant for highlighting at all.


        Stefan



^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: Tokenizing
  2014-09-21 16:43             ` Tokenizing Stefan Monnier
@ 2014-09-22 14:05               ` Stephen Leake
  0 siblings, 0 replies; 72+ messages in thread
From: Stephen Leake @ 2014-09-22 14:05 UTC (permalink / raw)
  To: emacs-devel

Stefan Monnier <monnier@iro.umontreal.ca> writes:

>>>> - it should be possible to completely replace the font-lock-mode with
>>>> this mode, given a concrete language tokenizer;
>>> The intention for font-lock highlighting is to show
>>> syntactic structure.  In many cases, we end up limited to lexical
>>> highlighting, but that's a problem.
>> syntax-propertize provides hooks to let modes add to that;
>
> Maybe you use it for that, but it's definitely not its intention:
> `syntax-propertize' is not meant for highlighting at all.

Sorry, I meant "font-lock-add-keywords"; you can add arbitrary functions
that use the parser results to choose a face.

The parser does rely on syntax-propertize.

-- 
-- Stephe



^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: Tokenizing
  2014-09-20 16:36       ` Tokenizing Vladimir Kazanov
  2014-09-20 19:55         ` Tokenizing Stefan Monnier
@ 2014-09-21 13:35         ` Richard Stallman
  2014-09-21 14:24           ` Tokenizing Vladimir Kazanov
  2014-09-21 15:32         ` Tokenizing Stephen Leake
  2 siblings, 1 reply; 72+ messages in thread
From: Richard Stallman @ 2014-09-21 13:35 UTC (permalink / raw)
  To: Vladimir Kazanov; +Cc: emacs-devel

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

    I suggest the following plan:

    1) try to implement the tokenizer using available text property mechanics;

    2) see if there are slow-downs or problems, or space for improvements
    on the Emacs side.

Yes.

At various points you might have ideas for more convenient or faster
predefined functions for accessing the text properties.  It will be
very nice to add them for Emacs and thus make this sort of tokenizing
more convenient.

-- 
Dr Richard Stallman
President, Free Software Foundation
51 Franklin St
Boston MA 02110
USA
www.fsf.org  www.gnu.org
Skype: No way! That's nonfree (freedom-denying) software.
  Use Ekiga or an ordinary phone call.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: Tokenizing
  2014-09-21 13:35         ` Tokenizing Richard Stallman
@ 2014-09-21 14:24           ` Vladimir Kazanov
  0 siblings, 0 replies; 72+ messages in thread
From: Vladimir Kazanov @ 2014-09-21 14:24 UTC (permalink / raw)
  To: rms; +Cc: emacs-devel

Great, okay, I'll start working on it. This might take some time.

On Sun, Sep 21, 2014 at 4:35 PM, Richard Stallman <rms@gnu.org> wrote:
> [[[ To any NSA and FBI agents reading my email: please consider    ]]]
> [[[ whether defending the US Constitution against all enemies,     ]]]
> [[[ foreign or domestic, requires you to follow Snowden's example. ]]]
>
>     I suggest the following plan:
>
>     1) try to implement the tokenizer using available text property mechanics;
>
>     2) see if there are slow-downs or problems, or space for improvements
>     on the Emacs side.
>
> Yes.
>
> At various points you might have ideas for more convenient or faster
> predefined functions for accessing the text properties.  It will be
> very nice to add them for Emacs and thus make this sort of tokenizing
> more convenient.
>
> --
> Dr Richard Stallman
> President, Free Software Foundation
> 51 Franklin St
> Boston MA 02110
> USA
> www.fsf.org  www.gnu.org
> Skype: No way! That's nonfree (freedom-denying) software.
>   Use Ekiga or an ordinary phone call.
>



-- 
Yours sincerely,


Vladimir Kazanov


--
С уважением,

Владимир Казанов



^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: Tokenizing
  2014-09-20 16:36       ` Tokenizing Vladimir Kazanov
  2014-09-20 19:55         ` Tokenizing Stefan Monnier
  2014-09-21 13:35         ` Tokenizing Richard Stallman
@ 2014-09-21 15:32         ` Stephen Leake
  2014-09-21 16:42           ` Tokenizing Stefan Monnier
  2014-09-21 18:55           ` Tokenizing Vladimir Kazanov
  2 siblings, 2 replies; 72+ messages in thread
From: Stephen Leake @ 2014-09-21 15:32 UTC (permalink / raw)
  To: emacs-devel

Vladimir Kazanov <vekazanov@gmail.com> writes:

> Okay, I'll give text properties a try.
>
> Right now my vision for this mode is the following:
>
> - avoid retokenizing undamaged buffer parts at all costs (as a main
> feature meant for incremental parsing);

You might look at what I did in Ada mode (current source in ELPA); see
wisi.el wisi-before/after-change.

> - collect damages and do reparsing only when user stops editing,
> similar to the font-lock-mode (js2-mode, nxml-mode...);

Ada mode only reparses when the user requests an action that requires a
parse. How else do you tell when a user "stops editing"?

font-lock runs after 'idle-time', which appears to be about 2 seconds (I
could not figure out from the structure of 'timer-idle-list' what the
actual idle time is). I guess that's the approximation of when the user
stops editing.

I don't normally edit 7000 line files, so the Ada mode parsing delay is
not noticeable to me, so I prefer the current Ada mode approach of not
using the idle timer to trigger a parse. But it could be a user option. 

> - the incremental logic should have two interfaces, the first one
> meant for language-specific tokenizing code and a second one - for the
> user code, be it code beautifiers or advanced incremental parsers;
>
> - it should be possible to completely replace the font-lock-mode with
> this mode, given a concrete language tokenizer;
>
> You said two things basically: 1) I must use text properties, 2) it is
> possible to improve text properties interfaces to help the tokenizer.
> I suggest the following plan:
>
> 1) try to implement the tokenizer using available text property
> mechanics;

Ada mode uses text properties to store parse results; the tokenizer
results are part of that, but are not stored separately. I don't see
much point in separating the tokenizer from the parser; the tokenizer
results are not useful by themselves (at least, not in Ada mode).

> 2) see if there are slow-downs or problems, or space for improvements
> on the Emacs side.

I have not noticed any problems with the text properties interface; in
particular, storing and retrieving text properties is fast compared to
parsing. Ada mode stores about two parse result text properties per
source line on average.

-- 
-- Stephe

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: Tokenizing
  2014-09-21 15:32         ` Tokenizing Stephen Leake
@ 2014-09-21 16:42           ` Stefan Monnier
  2014-09-21 18:55           ` Tokenizing Vladimir Kazanov
  1 sibling, 0 replies; 72+ messages in thread
From: Stefan Monnier @ 2014-09-21 16:42 UTC (permalink / raw)
  To: Stephen Leake; +Cc: emacs-devel

> font-lock runs after 'idle-time', which appears to be about 2 seconds (I

Not really: font-lock is run "on the next redisplay",
i.e. "immediately", on the modified lines.  There is also a delayed
re-highlighting of the rest of buffer (the part after the modified
text) controlled by jit-lock-context-time (defaults to 0.5s).

        Stefan

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: Tokenizing
  2014-09-21 15:32         ` Tokenizing Stephen Leake
  2014-09-21 16:42           ` Tokenizing Stefan Monnier
@ 2014-09-21 18:55           ` Vladimir Kazanov
  2014-09-21 22:01             ` Tokenizing Daniel Colascione
  2014-09-22 13:15             ` Tokenizing Stephen Leake
  1 sibling, 2 replies; 72+ messages in thread
From: Vladimir Kazanov @ 2014-09-21 18:55 UTC (permalink / raw)
  To: Stephen Leake; +Cc: emacs-devel

> I don't normally edit 7000 line files, so the Ada mode parsing delay is
> not noticeable to me, so I prefer the current Ada mode approach of not
> using the idle timer to trigger a parse. But it could be a user option.
>

I will look into that. Although the main idea is to *keep the token
list consistent* most of the time. There will definitely be
customization possibilities.

>
> Ada mode uses text properties to store parse results; the tokenizer
> results are part of that, but are not stored separately. I don't see
> much point in separating the tokenizer from the parser; the tokenizer
> results are not useful by themselves (at least, not in Ada mode).
>

First, this not quite right. Tokenization results can be used, for
example, for granular syntax highlighting. Font Lock basically just
uses regexps to catch something that looks like
comments/keywords/whatever. Tokenizer already *knows* for sure what it
found. And you don't have to build a full parser to use the results.

Second, it not a tokenizer I want to build, there is a
misunderstanding of sorts. It is a helper mode (similar to Font Lock,
in a way) for keeping token lists up to date all the time, easy and
fast. User code - the tokenizer itself - will just have to provide an
interface to the mode (be restartable and supply required restart
information in resulting tokens). The mode will use the information to
avoid extra tokenizing.

> I have not noticed any problems with the text properties interface; in
> particular, storing and retrieving text properties is fast compared to
> parsing. Ada mode stores about two parse result text properties per
> source line on average.

I did not know about your mode - and parsers are sort of my hobby :-)
I will definitely check it out, especially because it uses GLR(it
really does?!), which can non-trivial to implement.

-- 
Yours sincerely,

Vladimir Kazanov

--
С уважением,

Владимир Казанов

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: Tokenizing
  2014-09-21 18:55           ` Tokenizing Vladimir Kazanov
@ 2014-09-21 22:01             ` Daniel Colascione
  2014-09-22 10:21               ` Tokenizing Vladimir Kazanov
  2014-09-22 14:02               ` Tokenizing Stephen Leake
  2014-09-22 13:15             ` Tokenizing Stephen Leake
  1 sibling, 2 replies; 72+ messages in thread
From: Daniel Colascione @ 2014-09-21 22:01 UTC (permalink / raw)
  To: Vladimir Kazanov, Stephen Leake; +Cc: emacs-devel

[-- Attachment #1: Type: text/plain, Size: 8905 bytes --]

On 09/21/2014 11:55 AM, Vladimir Kazanov wrote:
>> I don't normally edit 7000 line files, so the Ada mode parsing delay is
>> not noticeable to me, so I prefer the current Ada mode approach of not
>> using the idle timer to trigger a parse. But it could be a user option.
>>
> 
> I will look into that. Although the main idea is to *keep the token
> list consistent* most of the time. There will definitely be
> customization possibilities.

I've been working (very, very, very slowly) on similar functionality.
The basic idea is based on the incremental lexing algorithm that Tim A.
Wagner sets out in chapter 5 of his thesis [1].  The key is dynamically
tracking lookahead used while we generate each token.  Wagner's
algorithm allows us to incorporate arbitrary lookahead into the
invalidation state, so supporting something like flex's unlimited
trailing context is no problem.

The nice thing about this algorithm is that like the parser, it's an
online algorithm and arbitrarily restartable.

We can't use the built-in Emacs regular expression engine for this
facility because there's no way to tell regex.c to record how much
lookahead it used, and even if there were, I'd rather not rely on its
backtracking matcher. I've written a DFA-based regex matcher that should
help in implementing this algorithm; of course, a DFA matcher has a
lookahead of zero.

Mine supports zero-width predicates though, so we can actually use
achieve nonzero lookahead (and lookbehind!) if needed.

Where my thing departs from flex is that I want to use a regular
expression (in the rx sense) to describe the higher-level parsing
automaton instead of making mode authors fiddle with start states.  This
way, it's easy to incorporate support for things like JavaScript's
regular expression syntax, in which "/" can mean one of two tokens
depending on the previous token.

(Another way of dealing with lexical ambiguity is to let the lexer
return an arbitrary number of tokens for a given position and let the
GLR parser sort it out, but I'm not as happy with that solution.)

>> Ada mode uses text properties to store parse results; the tokenizer
>> results are part of that, but are not stored separately. I don't see
>> much point in separating the tokenizer from the parser; the tokenizer
>> results are not useful by themselves (at least, not in Ada mode).
>>
> 
> First, this not quite right. Tokenization results can be used, for
> example, for granular syntax highlighting. Font Lock basically just
> uses regexps to catch something that looks like
> comments/keywords/whatever. Tokenizer already *knows* for sure what it
> found. And you don't have to build a full parser to use the results.

There are two stages here: you want in *some* cases for fontification to
use the results of tokenization directly; in other cases, you want to
apply fontification rules to the result of parsing that token stream.
Splitting the fontification rules between terminals and non-terminals
this way helps us maintain rudimentary fontification even for invalid
buffer contents --- that is, if the user types gibberish in a C-mode
buffer, we want constructs that look like keywords and strings in that
gibberish stream to be highlighted.

> Second, it not a tokenizer I want to build, there is a
> misunderstanding of sorts. It is a helper mode (similar to Font Lock,
> in a way) for keeping token lists up to date all the time, easy and
> fast. User code - the tokenizer itself - will just have to provide an
> interface to the mode (be restartable and supply required restart
> information in resulting tokens). The mode will use the information to
> avoid extra tokenizing.

Indeed. I don't imagine replacing font-lock per se, but instead just
implementing a font-lock matcher that *lazily* brings the token list
up-to-date and pulls keywords out of it for fontification. It's
important to keep font-lock around for backward compatibility, but
there's no reason that font-lock's data source has to be a bunch of
simple regex matches.

cc-mode, for example, uses font-lock to apply highlights that come from
a parser that looks oddly like a CYK parser.  (We start parsing at all
points in the buffer that *might* start a declaration; there's also some
limited, ad-hoc backtracking.)

>> I have not noticed any problems with the text properties interface; in
>> particular, storing and retrieving text properties is fast compared to
>> parsing. Ada mode stores about two parse result text properties per
>> source line on average.
> 
> I did not know about your mode - and parsers are sort of my hobby :-)

Mine too. :-)

> I will definitely check it out, especially because it uses GLR(it
> really does?!), which can non-trivial to implement.

Wagner's thesis contains a description of a few alternative incremental
GLR algorithms that look very promising.

I have a few extensions in mind too.  It's important to be able to
quickly fontify a particular region of the buffer --- e.g., while scrolling.

If we've already built a parse tree and damage part of the buffer, we
can repair the tree and re-fontify fairly quickly. But what if we
haven't parsed the whole buffer yet?

We can start parsing at some region we quickly determine is semantically
equivalent to the start of the buffer using something like
font-lock-beginning-of-syntax-function.  But what about completing the
parse? Well, it's possible to determine whether any particular buffer
state is being parsed "dynamically" (the GSS is split) or "statically"
(the GSS has one head). In practice, ambiguous regions are short, so we
only have to worry about the latter case.  Here, we can reliably
partially parse the buffer by parsing forward (remember, GLR parsing is
an online algorithm) until the parsing limit is after the visible
portion of the buffer; we can then use a "shortest-to-completion"
production (which we can pre-compute from the grammar) to insert virtual
tokens that complete the language.  If we were parsing C++, we'd be able
to automatically insert virtual "}" tokens and complete the parse tree,
giving us a partial parse tree we can then use for fontification.

Of course, it's better to have a parse tree that covers the whole
buffer, but we can compute that in the background.  One really nice
property of GLR parsing (and LR parsers generally) is that we can
trivially pause and restart them, so we can parse in the background with
very little impact on user activity, using `while-no-input' to back out
to the event loop when we detect we have something more important to do.

Error recovery is a challenge --- we want to do something useful on
buffers that are temporarily invalided by user edits.  One of the
biggest challenges is dealing with mismatched brace constructs, since
one missing "}" in C-mode can cause the entire rest of the buffer to no
longer conform to the grammar.

Bridge parsing is particularly interesting here: the basic idea is that
we perform one pass over the buffer using a simplified grammar that
looks only at nesting constructs, insert virtual braces to make
everything balance, and then apply the usual parsing algorithm (with
ad-hoc error recovery) to the result. I haven't tried it, but it
apparently works pretty well.

I like what SMIE does, but I don't think it's powerful enough to parse
C++. GLR parsers give us all the power we need.

GLR parsers also leave us with a parse forest instead of a parse tree,
and we really need a parse *tree* if we're going to apply fontification
rules specified in terms of AST nodes.  We need to prune the parse
forest down to a parse tree before using it for fontification.

I actually have a pretty neat idea for doing that: we can attach
assertions to parts of the parse forest and prune the tree by deleting
alternatives that contradict what we know.  cc-mode already does a very
limited form of this analysis by recording types it sees during parsing.

Consider the most vexing parse:

  type variable(function())

We don't know whether that's a function declaration or a variable
declaration. In GLR-land, we generate a parse forest with branches for
both options. If later in the buffer we see something that looks like this:

  variable->foo();

That we know that "variable" cannot be a function, so we can prune the
part of the parse forest that assumes that "variable" is a function,
leaving us with an unambiguous parse tree.

This way, we should be able to come up with reasonable parses for C++
buffers without having an IDE-style global view of types declarations,
instead assuming that programmers, most of the time, don't contradict
themselves.

[1] harmonia.cs.berkeley.edu/papers/twagner-thesis.pdf
[2] fileadmin.cs.lth.se/cs/Personal/Emma_Soderberg/docs/SLE08pres.pdf

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: Tokenizing
  2014-09-21 22:01             ` Tokenizing Daniel Colascione
@ 2014-09-22 10:21               ` Vladimir Kazanov
  2014-09-22 13:55                 ` Tokenizing Daniel Colascione
  2014-09-22 14:02               ` Tokenizing Stephen Leake
  1 sibling, 1 reply; 72+ messages in thread
From: Vladimir Kazanov @ 2014-09-22 10:21 UTC (permalink / raw)
  To: Daniel Colascione; +Cc: Stephen Leake, emacs-devel

On Mon, Sep 22, 2014 at 1:01 AM, Daniel Colascione <dancol@dancol.org> wrote:

> I've been working (very, very, very slowly) on similar functionality.
> The basic idea is based on the incremental lexing algorithm that Tim A.
> Wagner sets out in chapter 5 of his thesis [1].  The key is dynamically
> tracking lookahead used while we generate each token.  Wagner's
> algorithm allows us to incorporate arbitrary lookahead into the
> invalidation state, so supporting something like flex's unlimited
> trailing context is no problem.
>
> The nice thing about this algorithm is that like the parser, it's an
> online algorithm and arbitrarily restartable.

I have already mentioned Wagner's paper in the previous letters.
Actually, it is the main source of inspiration :-) But I think it is a
bit over-complicated, and the only implementation I saw (Netbean's
Lexer API) does not even try to implement it completely. Which is
okay, academic papers tend to idealize things.

> We can't use the built-in Emacs regular expression engine for this
> facility because there's no way to tell regex.c to record how much
> lookahead it used, and even if there were, I'd rather not rely on its
> backtracking matcher. I've written a DFA-based regex matcher that should
> help in implementing this algorithm; of course, a DFA matcher has a
> lookahead of zero.
>
> Mine supports zero-width predicates though, so we can actually use
> achieve nonzero lookahead (and lookbehind!) if needed.

You do realize that this is a client's code problem? We can only
recommend to use this or that regex engine, or even set the lookahead
value for various token types by hand; and the latter case would
probably work for most real-life cases.

I am not even sure that it is possible to do it the Wagner's way (have
a real next_char() function) in Emacs. I would check Lexer API
solution as a starting point.

> Where my thing departs from flex is that I want to use a regular
> expression (in the rx sense) to describe the higher-level parsing
> automaton instead of making mode authors fiddle with start states.  This
> way, it's easy to incorporate support for things like JavaScript's
> regular expression syntax, in which "/" can mean one of two tokens
> depending on the previous token.
>
> (Another way of dealing with lexical ambiguity is to let the lexer
> return an arbitrary number of tokens for a given position and let the
> GLR parser sort it out, but I'm not as happy with that solution.)

I do not want to solve any concrete lexing problems. The whole point
is about supplying a way to do it incrementally. I do not want to know
anything about the code above or below , be it GLR/LR/flex/etc.

>
> There are two stages here: you want in *some* cases for fontification to
> use the results of tokenization directly; in other cases, you want to
> apply fontification rules to the result of parsing that token stream.
> Splitting the fontification rules between terminals and non-terminals
> this way helps us maintain rudimentary fontification even for invalid
> buffer contents --- that is, if the user types gibberish in a C-mode
> buffer, we want constructs that look like keywords and strings in that
> gibberish stream to be highlighted.

Yes, and this is a client's code that has to decide those things, be
it using only the token list to do fontification or let a higher-level
a parser do it.

>
>> I will definitely check it out, especially because it uses GLR(it
>> really does?!), which can non-trivial to implement.
>
> Wagner's thesis contains a description of a few alternative incremental
> GLR algorithms that look very promising.

Yes, and a lot more :-) I want to concentrate on a smaller problem -
don't feel like implementing the whole thesis right now.

> I have a few extensions in mind too.  It's important to be able to
> quickly fontify a particular region of the buffer --- e.g., while scrolling.
>
> If we've already built a parse tree and damage part of the buffer, we
> can repair the tree and re-fontify fairly quickly. But what if we
> haven't parsed the whole buffer yet?
>

Nice. And I will definitely need to discuss all the optimization
possibilities later. First, the core logic has to be implemented.

Bottom line: I want to take this particular narrow problem, a few user
code examples (for me it is a port of CPython's LL(1) parser) and see
if I can solve in an optimal way. A working prototype will take some
time, a month or more - I am not in a hurry.

As much as I understand, you want to cooperate on it, right..?

-- 
Yours sincerely,

Vladimir Kazanov

--
С уважением,

Владимир Казанов

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: Tokenizing
  2014-09-22 10:21               ` Tokenizing Vladimir Kazanov
@ 2014-09-22 13:55                 ` Daniel Colascione
  0 siblings, 0 replies; 72+ messages in thread
From: Daniel Colascione @ 2014-09-22 13:55 UTC (permalink / raw)
  To: Vladimir Kazanov; +Cc: Stephen Leake, emacs-devel

[-- Attachment #1: Type: text/plain, Size: 5065 bytes --]

On 09/22/2014 03:21 AM, Vladimir Kazanov wrote:
> On Mon, Sep 22, 2014 at 1:01 AM, Daniel Colascione <dancol@dancol.org> wrote:
> 
>> I've been working (very, very, very slowly) on similar functionality.
>> The basic idea is based on the incremental lexing algorithm that Tim A.
>> Wagner sets out in chapter 5 of his thesis [1].  The key is dynamically
>> tracking lookahead used while we generate each token.  Wagner's
>> algorithm allows us to incorporate arbitrary lookahead into the
>> invalidation state, so supporting something like flex's unlimited
>> trailing context is no problem.
>>
>> The nice thing about this algorithm is that like the parser, it's an
>> online algorithm and arbitrarily restartable.
> 
> I have already mentioned Wagner's paper in the previous letters.
> Actually, it is the main source of inspiration :-) But I think it is a
> bit over-complicated, and the only implementation I saw (Netbean's
> Lexer API) does not even try to implement it completely. Which is
> okay, academic papers tend to idealize things.

That Lexer is a dumb state matcher, last time I checked. So is
Eclipse's. Neither is adequate, at least not if you want to support
lexing *arbitrary* languages (e.g., Python and JavaScript) with
guaranteed correctness in the face of arbitrary buffer modification.

> You do realize that this is a client's code problem? We can only
> recommend to use this or that regex engine, or even set the lookahead
> value for various token types by hand; and the latter case would
> probably work for most real-life cases.
> 
> I am not even sure that it is possible to do it the Wagner's way (have
> a real next_char() function) in Emacs. I would check Lexer API
> solution as a starting point.

Of course it's possible to implement in Emacs. Buffers are strictly more
powerful than character streams.

> 
>> Where my thing departs from flex is that I want to use a regular
>> expression (in the rx sense) to describe the higher-level parsing
>> automaton instead of making mode authors fiddle with start states.  This
>> way, it's easy to incorporate support for things like JavaScript's
>> regular expression syntax, in which "/" can mean one of two tokens
>> depending on the previous token.
>>
>> (Another way of dealing with lexical ambiguity is to let the lexer
>> return an arbitrary number of tokens for a given position and let the
>> GLR parser sort it out, but I'm not as happy with that solution.)
> 
> I do not want to solve any concrete lexing problems. The whole point
> is about supplying a way to do it incrementally. I do not want to know
> anything about the code above or below , be it GLR/LR/flex/etc.
> 
>>
>> There are two stages here: you want in *some* cases for fontification to
>> use the results of tokenization directly; in other cases, you want to
>> apply fontification rules to the result of parsing that token stream.
>> Splitting the fontification rules between terminals and non-terminals
>> this way helps us maintain rudimentary fontification even for invalid
>> buffer contents --- that is, if the user types gibberish in a C-mode
>> buffer, we want constructs that look like keywords and strings in that
>> gibberish stream to be highlighted.
> 
> Yes, and this is a client's code that has to decide those things, be
> it using only the token list to do fontification or let a higher-level
> a parser do it.

Unless the parser itself is incremental, you're going to have
interactivity problems.

>>> I will definitely check it out, especially because it uses GLR(it
>>> really does?!), which can non-trivial to implement.
>>
>> Wagner's thesis contains a description of a few alternative incremental
>> GLR algorithms that look very promising.
> 
> Yes, and a lot more :-) I want to concentrate on a smaller problem -
> don't feel like implementing the whole thesis right now.
> 
>> I have a few extensions in mind too.  It's important to be able to
>> quickly fontify a particular region of the buffer --- e.g., while scrolling.
>>
>> If we've already built a parse tree and damage part of the buffer, we
>> can repair the tree and re-fontify fairly quickly. But what if we
>> haven't parsed the whole buffer yet?
>>
> 
> Nice. And I will definitely need to discuss all the optimization
> possibilities later. First, the core logic has to be implemented.
> 
> Bottom line: I want to take this particular narrow problem, a few user
> code examples (for me it is a port of CPython's LL(1) parser) and see
> if I can solve in an optimal way. A working prototype will take some
> time, a month or more - I am not in a hurry.
> 
> As much as I understand, you want to cooperate on it, right..?

*sigh* It sounds like you want to create something simple. You'll run
into the same problems I did, or you'll produce something less than
fully general. I don't have enough time to work on something that isn't
fully general. I'm sick of writing language-specific text parsing code.

Have fun.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: Tokenizing
  2014-09-21 22:01             ` Tokenizing Daniel Colascione
  2014-09-22 10:21               ` Tokenizing Vladimir Kazanov
@ 2014-09-22 14:02               ` Stephen Leake
  2014-09-22 14:14                 ` Tokenizing Daniel Colascione
  1 sibling, 1 reply; 72+ messages in thread
From: Stephen Leake @ 2014-09-22 14:02 UTC (permalink / raw)
  To: emacs-devel

Daniel Colascione <dancol@dancol.org> writes:

> Wagner's thesis contains a description of a few alternative incremental
> GLR algorithms that look very promising.

Thanks for the pointer; I really need better error handling in the Ada
mode parser. It currently defaults to a _really_ dumb indentation
algorithm (same as previous line) when the parse fails.

> We can start parsing at some region we quickly determine is semantically
> equivalent to the start of the buffer using something like
> font-lock-beginning-of-syntax-function.  

For Ada, that's only start of buffer; the standard convention is only
one compilation-unit (the single grammar start symbol) per file.

> Here, we can reliably partially parse the buffer by parsing
> forward (remember, GLR parsing is an online algorithm) until the
> parsing limit is after the visible portion of the buffer; we can then
> use a "shortest-to-completion" production (which we can pre-compute
> from the grammar) to insert virtual tokens that complete the language.

I had the same idea; I was hoping it was new and I could make money :).
Is that in Wagner's thesis (I didn't see it on a quick scan), or written
up somewhere else?

> One really nice
> property of GLR parsing (and LR parsers generally) is that we can
> trivially pause and restart them, so we can parse in the background with
> very little impact on user activity, using `while-no-input' to back out
> to the event loop when we detect we have something more important to
> do.

That's what I need! Although a really fast GLR parser implemented in
compiled Ada that accesses the buffer via an Emacs FFI would be better.

> Error recovery is a challenge --- we want to do something useful on
> buffers that are temporarily invalided by user edits.  One of the
> biggest challenges is dealing with mismatched brace constructs, since
> one missing "}" in C-mode can cause the entire rest of the buffer to no
> longer conform to the grammar.

I'm hoping to use the same "shortest-to-completion" algorithm for this.
It may distort the rest of the buffer, but it will be better than what
it does now.

> Bridge parsing is particularly interesting here: the basic idea is that
> we perform one pass over the buffer using a simplified grammar that
> looks only at nesting constructs, insert virtual braces to make
> everything balance, and then apply the usual parsing algorithm (with
> ad-hoc error recovery) to the result. I haven't tried it, but it
> apparently works pretty well.

That could work for Ada as well.

> I like what SMIE does, but I don't think it's powerful enough to parse
> C++. 

I wrote a SMIE parser for Ada, but it ended up always parsing the whole
buffer. So I switched to LALR. Then to generalized so I would not have
to mangle the grammar in the Ada standard as much.

> GLR parsers also leave us with a parse forest instead of a parse tree,
> and we really need a parse *tree* if we're going to apply fontification
> rules specified in terms of AST nodes.  We need to prune the parse
> forest down to a parse tree before using it for fontification.

In Ada mode, I just assume the results of parallel parses that end in
the same state are equivalent for the purposes of Ada mode
(fontification and navigation). So I just throw away all but one result.
I haven't really investigated whether this is true.

> Consider the most vexing parse:
>
>   type variable(function())

One reason I like Ada; the equivalent declarations are not ambiguous.

-- 
-- Stephe

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: Tokenizing
  2014-09-22 14:02               ` Tokenizing Stephen Leake
@ 2014-09-22 14:14                 ` Daniel Colascione
  0 siblings, 0 replies; 72+ messages in thread
From: Daniel Colascione @ 2014-09-22 14:14 UTC (permalink / raw)
  To: Stephen Leake, emacs-devel

[-- Attachment #1: Type: text/plain, Size: 4467 bytes --]

On 09/22/2014 07:02 AM, Stephen Leake wrote:
> Daniel Colascione <dancol@dancol.org> writes:
> 
>> Wagner's thesis contains a description of a few alternative incremental
>> GLR algorithms that look very promising.
> 
> Thanks for the pointer; I really need better error handling in the Ada
> mode parser. It currently defaults to a _really_ dumb indentation
> algorithm (same as previous line) when the parse fails.

Yep. falling back to counting parens seems reasonable enough.

>> We can start parsing at some region we quickly determine is semantically
>> equivalent to the start of the buffer using something like
>> font-lock-beginning-of-syntax-function.  
> 
> For Ada, that's only start of buffer; the standard convention is only
> one compilation-unit (the single grammar start symbol) per file.
> 
>> Here, we can reliably partially parse the buffer by parsing
>> forward (remember, GLR parsing is an online algorithm) until the
>> parsing limit is after the visible portion of the buffer; we can then
>> use a "shortest-to-completion" production (which we can pre-compute
>> from the grammar) to insert virtual tokens that complete the language.
> 
> I had the same idea; I was hoping it was new and I could make money :).
> Is that in Wagner's thesis (I didn't see it on a quick scan), or written
> up somewhere else?

Not AFAIK.

>> One really nice
>> property of GLR parsing (and LR parsers generally) is that we can
>> trivially pause and restart them, so we can parse in the background with
>> very little impact on user activity, using `while-no-input' to back out
>> to the event loop when we detect we have something more important to
>> do.
> 
> That's what I need! Although a really fast GLR parser implemented in
> compiled Ada that accesses the buffer via an Emacs FFI would be better.

Another nice thing about lexing with an automaton and parsing with a
generic algorithm is that the parsing machine can be pushed (if needed)
to C core with no loss of customizability or generality. I'd prefer that
individual modes not rely on the FFI for basic fontification and
indentation services.

>> Error recovery is a challenge --- we want to do something useful on
>> buffers that are temporarily invalided by user edits.  One of the
>> biggest challenges is dealing with mismatched brace constructs, since
>> one missing "}" in C-mode can cause the entire rest of the buffer to no
>> longer conform to the grammar.
> 
> I'm hoping to use the same "shortest-to-completion" algorithm for this.
> It may distort the rest of the buffer, but it will be better than what
> it does now.

Maybe it's better than nothing, but bridge parsing is a bit more
promising, I think. :-)

>> Bridge parsing is particularly interesting here: the basic idea is that
>> we perform one pass over the buffer using a simplified grammar that
>> looks only at nesting constructs, insert virtual braces to make
>> everything balance, and then apply the usual parsing algorithm (with
>> ad-hoc error recovery) to the result. I haven't tried it, but it
>> apparently works pretty well.
> 
> That could work for Ada as well.
> 
>> I like what SMIE does, but I don't think it's powerful enough to parse
>> C++. 
> 
> I wrote a SMIE parser for Ada, but it ended up always parsing the whole
> buffer. So I switched to LALR. Then to generalized so I would not have
> to mangle the grammar in the Ada standard as much.
> 
>> GLR parsers also leave us with a parse forest instead of a parse tree,
>> and we really need a parse *tree* if we're going to apply fontification
>> rules specified in terms of AST nodes.  We need to prune the parse
>> forest down to a parse tree before using it for fontification.
> 
> In Ada mode, I just assume the results of parallel parses that end in
> the same state are equivalent for the purposes of Ada mode
> (fontification and navigation). So I just throw away all but one result.
> I haven't really investigated whether this is true.

You can still end up in the same state having shifted different
subtrees, so if you do it that way, you can flop between different
parsing alternatives as you edit the buffer for no reason at all.

>> Consider the most vexing parse:
>>
>>   type variable(function())
> 
> One reason I like Ada; the equivalent declarations are not ambiguous.

Writing unambiguous languages seems to have fallen out of fashion.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: Tokenizing
  2014-09-21 18:55           ` Tokenizing Vladimir Kazanov
  2014-09-21 22:01             ` Tokenizing Daniel Colascione
@ 2014-09-22 13:15             ` Stephen Leake
  1 sibling, 0 replies; 72+ messages in thread
From: Stephen Leake @ 2014-09-22 13:15 UTC (permalink / raw)
  To: emacs-devel

Vladimir Kazanov <vekazanov@gmail.com> writes:

>> Ada mode uses text properties to store parse results; the tokenizer
>> results are part of that, but are not stored separately. I don't see
>> much point in separating the tokenizer from the parser; the tokenizer
>> results are not useful by themselves (at least, not in Ada mode).
>>
>
> First, this not quite right. Tokenization results can be used, for
> example, for granular syntax highlighting. 

Ada requires semantic parsing, not just tokenizing, for syntax
highlighting (it's a complex language :).

Hmm, I'm not sure what you mean by "granualar" here; if you mean "less
than totally accurate", then you are right, we don't need a parser for
that. On the other hand, we don't need a tokenizer, either :).

In Ada, a parser is required to distinguish between these two instances
of 'return':

function Foo (...) return Bar
is begin
  Baz := ...;
  return Baz;
end Foo;

In the first instance, "Bar" is a type; in the second, "Baz" is a
variable. They should have different faces.

Finding the 'function' keyword from just token information is
non-trivial. The parser tags Bar as a type.

> Font Lock basically just
> uses regexps to catch something that looks like
> comments/keywords/whatever. 

But that can be extended by arbitrary functions in
font-lock-add-keywords; Ada mode does that to use the parser information
(when available; it doesn't force a parse just for font-lock).

> Second, it not a tokenizer I want to build, there is a
> misunderstanding of sorts. It is a helper mode (similar to Font Lock,
> in a way) for keeping token lists up to date all the time, easy and
> fast. User code - the tokenizer itself - will just have to provide an
> interface to the mode (be restartable and supply required restart
> information in resulting tokens). The mode will use the information to
> avoid extra tokenizing.

Ok. Maybe I can use that, and have it run the parser whenever needed.
Just replace "token lists" with "some text properties" in the above; the
helper mode should not care if they are "tokenizer results" or "parser
results".

>> I have not noticed any problems with the text properties interface; in
>> particular, storing and retrieving text properties is fast compared to
>> parsing. Ada mode stores about two parse result text properties per
>> source line on average.
>
> I did not know about your mode - and parsers are sort of my hobby :-)
> I will definitely check it out, especially because it uses GLR(it
> really does?!), which can non-trivial to implement.

Yes, implementing GLR was complicated, and therefore fun :).

-- 
-- Stephe

^ permalink raw reply	[flat|nested] 72+ messages in thread

end of thread, other threads:[~2014-10-06 21:02 UTC | newest]

Thread overview: 72+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-09-19 14:59 Overlay mechanic improvements Vladimir Kazanov
2014-09-19 17:22 ` Stefan Monnier
2014-09-20 13:19   ` Richard Stallman
2014-09-20 13:37     ` David Kastrup
2014-09-21 13:35       ` Richard Stallman
2014-09-21 13:52         ` David Kastrup
2014-09-21 21:48           ` Richard Stallman
2014-09-21 22:06             ` David Kastrup
2014-09-22 23:11               ` Richard Stallman
2014-09-22 23:50                 ` David Kastrup
2014-09-23 19:15                   ` Richard Stallman
2014-09-21 16:07         ` Stefan Monnier
2014-09-21 16:14           ` David Kastrup
2014-09-21 21:48             ` Richard Stallman
2014-09-21 22:19               ` David Kastrup
2014-09-23 19:16                 ` Richard Stallman
2014-09-23 19:27                   ` David Kastrup
2014-09-28 23:24                     ` Richard Stallman
2014-09-29  5:45                       ` David Kastrup
2014-09-29 20:48                         ` Richard Stallman
2014-09-30  1:21                           ` Stephen J. Turnbull
2014-09-30  8:43                             ` David Kastrup
2014-09-30 10:35                               ` Rasmus
2014-09-30 14:22                                 ` Eli Zaretskii
2014-09-30 16:20                                   ` David Kastrup
2014-09-30 16:35                                     ` Eli Zaretskii
2014-09-30 14:32                                 ` Stefan Monnier
2014-10-02 16:12                                 ` Uwe Brauer
2014-09-30 19:23                             ` Richard Stallman
2014-10-01  3:38                               ` Stephen J. Turnbull
2014-10-01 12:53                                 ` Richard Stallman
2014-10-01 13:11                                   ` David Kastrup
2014-10-02  1:26                                   ` Stephen J. Turnbull
2014-09-30  5:52                           ` David Kastrup
2014-10-06 19:14                             ` Richard Stallman
2014-10-06 21:02                               ` David Kastrup
2014-09-21 16:56           ` Eli Zaretskii
2014-09-21 18:42             ` Stefan Monnier
2014-09-21 18:58               ` Eli Zaretskii
2014-09-21 20:12                 ` Stefan Monnier
2014-09-21 21:48           ` Richard Stallman
2014-09-22  0:31             ` Stefan Monnier
2014-09-22 23:11               ` Richard Stallman
2014-09-20 15:56     ` Eli Zaretskii
2014-09-20 19:49     ` Stefan Monnier
2014-09-21 13:36       ` Richard Stallman
2014-09-19 18:03 ` Richard Stallman
2014-09-20  8:08   ` Vladimir Kazanov
2014-09-20 13:21     ` Richard Stallman
2014-09-20 16:28       ` Stephen Leake
2014-09-20 13:21     ` Tokenizing Richard Stallman
2014-09-20 16:24       ` Tokenizing Stephen Leake
2014-09-20 16:40         ` Tokenizing Vladimir Kazanov
2014-09-20 20:16           ` Tokenizing Eric Ludlam
2014-09-20 20:35             ` Tokenizing Vladimir Kazanov
2014-09-21 15:13             ` parsing (was tokenizing) Stephen Leake
2014-09-20 16:36       ` Tokenizing Vladimir Kazanov
2014-09-20 19:55         ` Tokenizing Stefan Monnier
2014-09-21 15:35           ` Tokenizing Stephen Leake
2014-09-21 16:43             ` Tokenizing Stefan Monnier
2014-09-22 14:05               ` Tokenizing Stephen Leake
2014-09-21 13:35         ` Tokenizing Richard Stallman
2014-09-21 14:24           ` Tokenizing Vladimir Kazanov
2014-09-21 15:32         ` Tokenizing Stephen Leake
2014-09-21 16:42           ` Tokenizing Stefan Monnier
2014-09-21 18:55           ` Tokenizing Vladimir Kazanov
2014-09-21 22:01             ` Tokenizing Daniel Colascione
2014-09-22 10:21               ` Tokenizing Vladimir Kazanov
2014-09-22 13:55                 ` Tokenizing Daniel Colascione
2014-09-22 14:02               ` Tokenizing Stephen Leake
2014-09-22 14:14                 ` Tokenizing Daniel Colascione
2014-09-22 13:15             ` Tokenizing Stephen Leake

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).