* Overlay mechanic improvements @ 2014-09-19 14:59 Vladimir Kazanov 2014-09-19 17:22 ` Stefan Monnier 2014-09-19 18:03 ` Richard Stallman 0 siblings, 2 replies; 72+ messages in thread From: Vladimir Kazanov @ 2014-09-19 14:59 UTC (permalink / raw) To: emacs-devel [-- Attachment #1: Type: text/plain, Size: 1579 bytes --] Hello everyone! Couple of days ago I submitted a question to Stackoverflow regarding an adequate way to handle many (thousands, that is) overlays. I wanted to know if there are any alternatives to using overlays (which is slow) for cases similar to mine. To make a long story short, I am building a generalized incremental tokenizer, details can be read on Stackoverflow ( http://stackoverflow.com/questions/25915908/best-way-to-save-lots-of-position-pairs-in-emacs-lisp). It turns out that the only viable alternative to overlays is using text properties, and text properties were not meant for the purpose. All I need is an ability to save position pairs, the positions should survive text insertion/deletion and there should be a way to find those pairs given a buffer point. Overlay interface is perfect (consider overlays-at, overlays-in and the dynamic (re-)positioning system). After answering my question Stefan (Monnier?) mentioned that overlay behavior can be fixed from N^2 to N*logN, which would make them asymptotically equivalent to text properties; adding that he has no time to implement the fix. I believe that given some time, a few tips from experienced Emacs developers and enough coffee I can at least try to fulfil the task. I am not in a hurry, and wanted to get into Emacs internals anyway. What do you think? PS I already have my Cedet/Emacs FSF papers signed - sent a minor patch to Cedet some time ago. -- Yours sincerely, Vladimir Kazanov -- С уважением, Владимир Казанов [-- Attachment #2: Type: text/html, Size: 1784 bytes --] ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: Overlay mechanic improvements 2014-09-19 14:59 Overlay mechanic improvements Vladimir Kazanov @ 2014-09-19 17:22 ` Stefan Monnier 2014-09-20 13:19 ` Richard Stallman 2014-09-19 18:03 ` Richard Stallman 1 sibling, 1 reply; 72+ messages in thread From: Stefan Monnier @ 2014-09-19 17:22 UTC (permalink / raw) To: Vladimir Kazanov; +Cc: emacs-devel > After answering my question Stefan (Monnier?) mentioned that overlay > behavior can be fixed from N^2 to N*logN, which would make them > asymptotically equivalent to text properties; adding that he has no time to > implement the fix. So, here's the idea: Currently overlays are implemented as (two) sorted singly linked lists (one for overlays_before some position and one for overlay_after that position, for some quirky definition of "before" and "after"). The function `overlay-recenter' changes the position used for the split (and is called internally in various situations). Each overlay is itself implemented with two markers (which keep track of the overlay-start and overlay-end). Markers are implemented as a non-sorted singly linked list of markers. So every text insertion/deletion requires O(N) time, where N is the number of markers since we have to go down that list to update those markers that are affected by the modification. You can start in src/buffer.[ch], maybe grepping for overlays_before for a starting point. Text-properties, OTOH, are implemented with a (mostly) balanced binary tree. This is implemented in src/intervals.[ch]. So we'd like to change overlays so that they don't use markers (and we don't keep them in two sorted singly-linked lists) any more. Instead, we'll store them inside the balanced binary tree used for text-properties. I think we can use the "augmented tree" approach described in https://en.wikipedia.org/wiki/Interval_tree. To ease up debugging during development, I'd guess the implementation would first add the new stuff, keeping the old stuff (i.e. add to Lisp_Overlay whichever fields are needed for the new code, while keeping the old ones, add needed overlay fields to the intervals tree, but keep the old fields, the overlays_before etc...). This way, you can add consistency checks that make sure the new code computes the same results as the old code. And once that works well, we can remove the old code and old fields. So I suggest you start looking at src/buffer.[ch] and src/intervals.[ch]. That should make you ask many more questions. Stefan ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: Overlay mechanic improvements 2014-09-19 17:22 ` Stefan Monnier @ 2014-09-20 13:19 ` Richard Stallman 2014-09-20 13:37 ` David Kastrup ` (2 more replies) 0 siblings, 3 replies; 72+ messages in thread From: Richard Stallman @ 2014-09-20 13:19 UTC (permalink / raw) To: Stefan Monnier; +Cc: vekazanov, emacs-devel [[[ To any NSA and FBI agents reading my email: please consider ]]] [[[ whether defending the US Constitution against all enemies, ]]] [[[ foreign or domestic, requires you to follow Snowden's example. ]]] I don't see that it makes sense to store overlays like text properties. The existing data structure is fine for overlays used as they are intended to be used. -- Dr Richard Stallman President, Free Software Foundation 51 Franklin St Boston MA 02110 USA www.fsf.org www.gnu.org Skype: No way! That's nonfree (freedom-denying) software. Use Ekiga or an ordinary phone call. ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: Overlay mechanic improvements 2014-09-20 13:19 ` Richard Stallman @ 2014-09-20 13:37 ` David Kastrup 2014-09-21 13:35 ` Richard Stallman 2014-09-20 15:56 ` Eli Zaretskii 2014-09-20 19:49 ` Stefan Monnier 2 siblings, 1 reply; 72+ messages in thread From: David Kastrup @ 2014-09-20 13:37 UTC (permalink / raw) To: emacs-devel Richard Stallman <rms@gnu.org> writes: > [[[ To any NSA and FBI agents reading my email: please consider ]]] > [[[ whether defending the US Constitution against all enemies, ]]] > [[[ foreign or domestic, requires you to follow Snowden's example. ]]] > > I don't see that it makes sense to store overlays like text properties. > The existing data structure is fine for overlays used as they are intended > to be used. preview-latex <URL:http://www.gnu.org/software/auctex/preview-latex> extensively uses overlays (and they can easily grow to a few thousand per document in a mathematically oriented paper), and it is very clearly _overlays_ which are wanted here since the associated images are only correct in the context and position of a particular document and would not make any sense at all to copy and paste. Quadratic behavior for the markers used as start- and end markers is clearly undesirable. Naturally, this could be addressed by changing the data structures/algorithms for updating markers (thus killing two birds with one stone). But whether solved together with marker behavior or not, this is worth addressing. -- David Kastrup ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: Overlay mechanic improvements 2014-09-20 13:37 ` David Kastrup @ 2014-09-21 13:35 ` Richard Stallman 2014-09-21 13:52 ` David Kastrup 2014-09-21 16:07 ` Stefan Monnier 0 siblings, 2 replies; 72+ messages in thread From: Richard Stallman @ 2014-09-21 13:35 UTC (permalink / raw) To: David Kastrup; +Cc: emacs-devel [[[ To any NSA and FBI agents reading my email: please consider ]]] [[[ whether defending the US Constitution against all enemies, ]]] [[[ foreign or domestic, requires you to follow Snowden's example. ]]] preview-latex <URL:http://www.gnu.org/software/auctex/preview-latex> extensively uses overlays (and they can easily grow to a few thousand per document in a mathematically oriented paper), How many overlays would appear on one screen? Quadratic behavior for the markers used as start- and end markers is clearly undesirable. The existing code should be fat if you recenter the overlays frequently at point, unless you have lots of overlays on one screen. Do you recenter them? If so, why is it slow? Why aren't text properties right for this? -- Dr Richard Stallman President, Free Software Foundation 51 Franklin St Boston MA 02110 USA www.fsf.org www.gnu.org Skype: No way! That's nonfree (freedom-denying) software. Use Ekiga or an ordinary phone call. ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: Overlay mechanic improvements 2014-09-21 13:35 ` Richard Stallman @ 2014-09-21 13:52 ` David Kastrup 2014-09-21 21:48 ` Richard Stallman 2014-09-21 16:07 ` Stefan Monnier 1 sibling, 1 reply; 72+ messages in thread From: David Kastrup @ 2014-09-21 13:52 UTC (permalink / raw) To: Richard Stallman; +Cc: emacs-devel Richard Stallman <rms@gnu.org> writes: > [[[ To any NSA and FBI agents reading my email: please consider ]]] > [[[ whether defending the US Constitution against all enemies, ]]] > [[[ foreign or domestic, requires you to follow Snowden's example. ]]] > > preview-latex <URL:http://www.gnu.org/software/auctex/preview-latex> > extensively uses overlays (and they can easily grow to a few thousand > per document in a mathematically oriented paper), > > How many overlays would appear on one screen? Probably rarely more than a few dozen (that would mostly be the case when using lots of inline math for small expressions like $a_i$). > Quadratic behavior for the markers used as start- and end markers is > clearly undesirable. > > The existing code should be fat if you recenter the overlays > frequently at point, unless you have lots of overlays on one screen. > Do you recenter them? Not that I know of. > If so, why is it slow? > > Why aren't text properties right for this? Because you don't want _anything_ that text properties do. You don't want to have the stuff cut&paste, you don't want to have the buffer modified because images are switched on and off, you most definitely don't want ever to split its identity in two if there are insertions in the middle, you don't want anything inserted anywhere inheriting anything from it. "We did not bother making overlays work efficiently" is not the right criterion for choosing between overlays and text properties. -- David Kastrup ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: Overlay mechanic improvements 2014-09-21 13:52 ` David Kastrup @ 2014-09-21 21:48 ` Richard Stallman 2014-09-21 22:06 ` David Kastrup 0 siblings, 1 reply; 72+ messages in thread From: Richard Stallman @ 2014-09-21 21:48 UTC (permalink / raw) To: David Kastrup; +Cc: emacs-devel [[[ To any NSA and FBI agents reading my email: please consider ]]] [[[ whether defending the US Constitution against all enemies, ]]] [[[ foreign or domestic, requires you to follow Snowden's example. ]]] Because you don't want _anything_ that text properties do. You don't want to have the stuff cut&paste, Why not? If the images are thought of as part of the buffer contents, I'd expect it to be desirable that they follow text that is copied. If you kill text that contains some of these images of math and then yank it back, shouldn't it come back with the images? This is what lead me to think of text properties first for this job. you don't want to have the buffer modified because images are switched on and off, We can provide a feature for turning off and on the display of these images without changing the text properties themselves. I think there already is a way. you don't want anything inserted anywhere inheriting anything from it. We already have ways to specify no inheritance. you most definitely don't want ever to split its identity in two if there are insertions in the middle, That's true. However, isn't such insertion anomalous anyway? How bad is it, if that anomalous act causes the image to appear twice? -- Dr Richard Stallman President, Free Software Foundation 51 Franklin St Boston MA 02110 USA www.fsf.org www.gnu.org Skype: No way! That's nonfree (freedom-denying) software. Use Ekiga or an ordinary phone call. ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: Overlay mechanic improvements 2014-09-21 21:48 ` Richard Stallman @ 2014-09-21 22:06 ` David Kastrup 2014-09-22 23:11 ` Richard Stallman 0 siblings, 1 reply; 72+ messages in thread From: David Kastrup @ 2014-09-21 22:06 UTC (permalink / raw) To: Richard Stallman; +Cc: emacs-devel Richard Stallman <rms@gnu.org> writes: > [[[ To any NSA and FBI agents reading my email: please consider ]]] > [[[ whether defending the US Constitution against all enemies, ]]] > [[[ foreign or domestic, requires you to follow Snowden's example. ]]] > > Because you don't want _anything_ that text properties do. You don't > want to have the stuff cut&paste, > > Why not? If the images are thought of as part of the buffer contents, They most emphatically aren't. Search and replace works on the underlying text (invalidating the image when a replace happens), incremental search works on the underlying text (temporarily removing the image), saving saves the underlying text, all editing commands apply to the underlying text. > I'd expect it to be desirable that they follow text that is copied. > If you kill text that contains some of these images of math > and then yank it back, shouldn't it come back with the images? No. The same input text at a different location in the same document might show a different equation number. In a different document it might not compile at all or have a different size. > This is what lead me to think of text properties first for this job. > > you don't want to have the buffer > modified because images are switched on and off, > > We can provide a feature for turning off and on the display > of these images without changing the text properties themselves. > I think there already is a way. > > you don't want anything inserted anywhere inheriting > anything from it. > > We already have ways to specify no inheritance. > > you most definitely > don't want ever to split its identity in two if there are insertions in > the middle, > > That's true. However, isn't such insertion anomalous anyway? How bad > is it, if that anomalous act causes the image to appear twice? Bad. Text properties and overlays have different semantics. You basically say "let's ignore all the mismatching semantics because the performance of overlays is bad". That does not make sense. If it did, we would not have overlays in the first place. -- David Kastrup ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: Overlay mechanic improvements 2014-09-21 22:06 ` David Kastrup @ 2014-09-22 23:11 ` Richard Stallman 2014-09-22 23:50 ` David Kastrup 0 siblings, 1 reply; 72+ messages in thread From: Richard Stallman @ 2014-09-22 23:11 UTC (permalink / raw) To: David Kastrup; +Cc: emacs-devel [[[ To any NSA and FBI agents reading my email: please consider ]]] [[[ whether defending the US Constitution against all enemies, ]]] [[[ foreign or domestic, requires you to follow Snowden's example. ]]] > Why not? If the images are thought of as part of the buffer contents, They most emphatically aren't. Search and replace works on the underlying text (invalidating the image when a replace happens), incremental search works on the underlying text (temporarily removing the image), saving saves the underlying text, all editing commands apply to the underlying text. Could you explain to me why this is right? I never used that feature, so I don't understand the context. You basically say "let's ignore all the mismatching semantics because the performance of overlays is bad". No, I am not saying that. I am saying something else: > I'd expect it to be desirable that they follow text that is copied. > If you kill text that contains some of these images of math > and then yank it back, shouldn't it come back with the images? > This is what lead me to think of text properties first for this job. Since you say that is not the desired behavior, I'd like to understand why not. What is a scenario for editing the text that is under the image overlay, and what is the right behavior? -- Dr Richard Stallman President, Free Software Foundation 51 Franklin St Boston MA 02110 USA www.fsf.org www.gnu.org Skype: No way! That's nonfree (freedom-denying) software. Use Ekiga or an ordinary phone call. ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: Overlay mechanic improvements 2014-09-22 23:11 ` Richard Stallman @ 2014-09-22 23:50 ` David Kastrup 2014-09-23 19:15 ` Richard Stallman 0 siblings, 1 reply; 72+ messages in thread From: David Kastrup @ 2014-09-22 23:50 UTC (permalink / raw) To: Richard Stallman; +Cc: emacs-devel Richard Stallman <rms@gnu.org> writes: > [[[ To any NSA and FBI agents reading my email: please consider ]]] > [[[ whether defending the US Constitution against all enemies, ]]] > [[[ foreign or domestic, requires you to follow Snowden's example. ]]] > > > Why not? If the images are thought of as part of the buffer contents, > > They most emphatically aren't. Search and replace works on the > underlying text (invalidating the image when a replace happens), > incremental search works on the underlying text (temporarily removing > the image), saving saves the underlying text, all editing commands apply > to the underlying text. > > Could you explain to me why this is right? I never used that feature, > so I don't understand the context. The text that is being edited is LaTeX source code. preview-latex displays previews of the edited material in the source buffer. The source are things like $\sum_{i=0}^\infty \frac{1}{i+i}$ that are easy to enter and edit and somewhat cumbersome to read in context. Replacing those fragments with the rendered images in the source code while the particular corresponding source code is not being edited makes it easier to peruse the mathematical meaning of the source while continuing to compose further text and mathematics. This is a visual aid, not something changing the text in any manner. Like a tooltip, it has to get out of the way whenever you actually want to do something to the source code of the text. > No, I am not saying that. I am saying something else: > > > I'd expect it to be desirable that they follow text that is copied. > > If you kill text that contains some of these images of math > > and then yank it back, shouldn't it come back with the images? > > > This is what lead me to think of text properties first for this job. > > Since you say that is not the desired behavior, I'd like to understand > why not. > > What is a scenario for editing the text that is under the image > overlay, Any editing at all. The image is a visualization, you cannot edit that. It is output, not input. > and what is the right behavior? The same as without preview-latex. This is just like a document previewer, only scattered over the corresponding input in order to save place and concentration. It's pointless for plain text: hopefully you have selected a more readable font in your editor for that on a low-contrast 100dpi color screen than some washed-out downsampling of a font intended to look best printed with 1200dpi on a high-contrast two-color output device. But for stuff that becomes harder to follow when looking at its source code rather than the rendered rendition, it's helpful. -- David Kastrup ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: Overlay mechanic improvements 2014-09-22 23:50 ` David Kastrup @ 2014-09-23 19:15 ` Richard Stallman 0 siblings, 0 replies; 72+ messages in thread From: Richard Stallman @ 2014-09-23 19:15 UTC (permalink / raw) To: David Kastrup; +Cc: emacs-devel [[[ To any NSA and FBI agents reading my email: please consider ]]] [[[ whether defending the US Constitution against all enemies, ]]] [[[ foreign or domestic, requires you to follow Snowden's example. ]]] > Since you say that is not the desired behavior, I'd like to understand > why not. > > What is a scenario for editing the text that is under the image > overlay, Any editing at all. The image is a visualization, you cannot edit that. It is output, not input. I think we are miscommunicating here. Could you answer in a less terse way? Is the idea that the user goes on editing, and the overlay images stick around even if they become invalid, and the user knows to ignore them? -- Dr Richard Stallman President, Free Software Foundation 51 Franklin St Boston MA 02110 USA www.fsf.org www.gnu.org Skype: No way! That's nonfree (freedom-denying) software. Use Ekiga or an ordinary phone call. ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: Overlay mechanic improvements 2014-09-21 13:35 ` Richard Stallman 2014-09-21 13:52 ` David Kastrup @ 2014-09-21 16:07 ` Stefan Monnier 2014-09-21 16:14 ` David Kastrup ` (2 more replies) 1 sibling, 3 replies; 72+ messages in thread From: Stefan Monnier @ 2014-09-21 16:07 UTC (permalink / raw) To: Richard Stallman; +Cc: David Kastrup, emacs-devel > The existing code should be fat if you recenter the overlays > frequently at point, unless you have lots of overlays on one screen. > Do you recenter them? If so, why is it slow? Reasons for overlays being slow are: - if you have N overlays, you have 2N markers, so every insertion/deletion of text will involve a traversal of those 2N elements. - when we need to move from bytepos to charpos (or vice versa), we go through the 2N markers again. - when the Elisp code didn't know to overlay-recenter, or when overlays are added/removed from the "wrong" place (w.r.t to the overlay "center"), every overlay creation/removal takes O(N) time. We rarely suffer from overlay's performance issues because most codes that could suffer from it simply avoid using overlays (or they try to minimize the number of overlays used, e.g. by flushing those overlays currently not visible), even when they would be the right tool. Or they add calls to overlay-recenter when the performance problem is diagnosed. Moving overlays into a balanced binary tree will remove those algorithmic performance problems. I'm not sue why you'd be opposed, since it will provide the exact same functionality anyway. It's just an internal detail of implementation. Note that I suggest to use the balanced binary tree used by text-properties, but there is no need for the two trees to be linked. We could as well use a separate tree for overlays. I'm not sure which of the two options is best in this regard, but at least I don't see any obvious reason why re-using the existing intervals.c tree wouldn't work well. Of course, I don't know for a fact that the end result will be better than what we currently have. We will of course need to experiment with it a bit. > Why aren't text properties right for this? There are various things which are difficult to do with text-properties. E.g. the lack of after/before_string is the most obvious difference. Stefan ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: Overlay mechanic improvements 2014-09-21 16:07 ` Stefan Monnier @ 2014-09-21 16:14 ` David Kastrup 2014-09-21 21:48 ` Richard Stallman 2014-09-21 16:56 ` Eli Zaretskii 2014-09-21 21:48 ` Richard Stallman 2 siblings, 1 reply; 72+ messages in thread From: David Kastrup @ 2014-09-21 16:14 UTC (permalink / raw) To: emacs-devel Stefan Monnier <monnier@iro.umontreal.ca> writes: >> Why aren't text properties right for [preview-latex]? > > There are various things which are difficult to do with text-properties. > E.g. the lack of after/before_string is the most obvious difference. Uh, that too. preview-latex makes use of them as well for several things. Supporting after/before_string with text properties would be really weird as inserting text in the middle of some text with the respective properties would cause those strings to appear ex nihilo. -- David Kastrup ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: Overlay mechanic improvements 2014-09-21 16:14 ` David Kastrup @ 2014-09-21 21:48 ` Richard Stallman 2014-09-21 22:19 ` David Kastrup 0 siblings, 1 reply; 72+ messages in thread From: Richard Stallman @ 2014-09-21 21:48 UTC (permalink / raw) To: David Kastrup; +Cc: emacs-devel [[[ To any NSA and FBI agents reading my email: please consider ]]] [[[ whether defending the US Constitution against all enemies, ]]] [[[ foreign or domestic, requires you to follow Snowden's example. ]]] Supporting after/before_string with text properties would be really weird as inserting text in the middle of some text with the respective properties would cause those strings to appear ex nihilo. That is true -- but does it matter, given that inserting text there is a mistake anyway? -- Dr Richard Stallman President, Free Software Foundation 51 Franklin St Boston MA 02110 USA www.fsf.org www.gnu.org Skype: No way! That's nonfree (freedom-denying) software. Use Ekiga or an ordinary phone call. ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: Overlay mechanic improvements 2014-09-21 21:48 ` Richard Stallman @ 2014-09-21 22:19 ` David Kastrup 2014-09-23 19:16 ` Richard Stallman 0 siblings, 1 reply; 72+ messages in thread From: David Kastrup @ 2014-09-21 22:19 UTC (permalink / raw) To: Richard Stallman; +Cc: emacs-devel Richard Stallman <rms@gnu.org> writes: > [[[ To any NSA and FBI agents reading my email: please consider ]]] > [[[ whether defending the US Constitution against all enemies, ]]] > [[[ foreign or domestic, requires you to follow Snowden's example. ]]] > > Supporting after/before_string with text properties would be > really weird as inserting text in the middle of some text with the > respective properties would cause those strings to appear ex nihilo. > > That is true -- but does it matter, given that inserting text there is > a mistake anyway? Inserting a text in the middle of the text underlying a preview-latex image is not a mistake but the intended mode of operation. I think you are fundamentally misunderstanding what preview-latex is. It is a previewing mode that previews selected elements (like math) of typesetting a LaTeX document right in the source code so that you do not need to keep switching your focus between source code and preview window when proofreading/deriving your math. The whole point is to work with an editable source and have the images as a transient view. In order to make movement unsurprising, only some movement commands (like cursor left/right) will actually disable the move-out-of-images-in-the-main-loop behavior of Emacs in order to "open" images and display the source: most commands will avoid the inside of images in the way Emacs usually does. But that does not extend to any _modifying_ commands or to isearch. And of course not to macro execution. -- David Kastrup ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: Overlay mechanic improvements 2014-09-21 22:19 ` David Kastrup @ 2014-09-23 19:16 ` Richard Stallman 2014-09-23 19:27 ` David Kastrup 0 siblings, 1 reply; 72+ messages in thread From: Richard Stallman @ 2014-09-23 19:16 UTC (permalink / raw) To: David Kastrup; +Cc: emacs-devel [[[ To any NSA and FBI agents reading my email: please consider ]]] [[[ whether defending the US Constitution against all enemies, ]]] [[[ foreign or domestic, requires you to follow Snowden's example. ]]] The whole point is to work with an editable source and have the images as a transient view. In order to make movement unsurprising, only some movement commands (like cursor left/right) will actually disable the move-out-of-images-in-the-main-loop behavior of Emacs in order to "open" images and display the source: most commands will avoid the inside of images in the way Emacs usually does. What is the _desired_ behavior when the user edits text that corresponds to an image? Should the image disappear? -- Dr Richard Stallman President, Free Software Foundation 51 Franklin St Boston MA 02110 USA www.fsf.org www.gnu.org Skype: No way! That's nonfree (freedom-denying) software. Use Ekiga or an ordinary phone call. ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: Overlay mechanic improvements 2014-09-23 19:16 ` Richard Stallman @ 2014-09-23 19:27 ` David Kastrup 2014-09-28 23:24 ` Richard Stallman 0 siblings, 1 reply; 72+ messages in thread From: David Kastrup @ 2014-09-23 19:27 UTC (permalink / raw) To: Richard Stallman; +Cc: emacs-devel Richard Stallman <rms@gnu.org> writes: > [[[ To any NSA and FBI agents reading my email: please consider ]]] > [[[ whether defending the US Constitution against all enemies, ]]] > [[[ foreign or domestic, requires you to follow Snowden's example. ]]] > > The whole point is to work with an editable source and have the images > as a transient view. In order to make movement unsurprising, only some > movement commands (like cursor left/right) will actually disable the > move-out-of-images-in-the-main-loop behavior of Emacs in order to "open" > images and display the source: most commands will avoid the inside of > images in the way Emacs usually does. > > What is the _desired_ behavior when the user edits text that corresponds > to an image? Should the image disappear? Editing involves making the image disappear anyway since otherwise you don't have the text for editing visible (either one or the other is displayed). There are exceptions, like when typing C-d with the cursor on an image. In that case, the first character of the underlying text will get deleted and, since the overlay content no longer corresponds to the image, the image will get undisplayed and the source text (with the first character already deleted) will reappear. When the text corresponding to an image is displayed, the image itself is not visible. Instead the text is marked in a recognizable color and it is preceded by a clickable icon with a generic "image" look. Once you finished editing, you can click that icon, the image will get regenerated from the text covered by the overlay and will again replace it in the source view. Or you can use key sequences for regeneration. Or just let the text be text and edit other things. The overlays stick around unless explicitly removed in order to make it easier to regenerate an image based on exactly the region covered by the overlay. It's not that hard to try that out yourself if you have LaTeX installed on your system and some files preferably using mathematics written in it. Just use ELPA to install AUCTeX, open a LaTeX file, and use the Preview menu (should be available by default when AUCTeX is installed via ELPA, I think). -- David Kastrup ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: Overlay mechanic improvements 2014-09-23 19:27 ` David Kastrup @ 2014-09-28 23:24 ` Richard Stallman 2014-09-29 5:45 ` David Kastrup 0 siblings, 1 reply; 72+ messages in thread From: Richard Stallman @ 2014-09-28 23:24 UTC (permalink / raw) To: David Kastrup; +Cc: emacs-devel [[[ To any NSA and FBI agents reading my email: please consider ]]] [[[ whether defending the US Constitution against all enemies, ]]] [[[ foreign or domestic, requires you to follow Snowden's example. ]]] Editing involves making the image disappear anyway since otherwise you don't have the text for editing visible (either one or the other is displayed). Does this mean you turn off display of the image on the overlay when the text in that region is changed? -- Dr Richard Stallman President, Free Software Foundation 51 Franklin St Boston MA 02110 USA www.fsf.org www.gnu.org Skype: No way! That's nonfree (freedom-denying) software. Use Ekiga or an ordinary phone call. ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: Overlay mechanic improvements 2014-09-28 23:24 ` Richard Stallman @ 2014-09-29 5:45 ` David Kastrup 2014-09-29 20:48 ` Richard Stallman 0 siblings, 1 reply; 72+ messages in thread From: David Kastrup @ 2014-09-29 5:45 UTC (permalink / raw) To: Richard Stallman; +Cc: emacs-devel Richard Stallman <rms@gnu.org> writes: > [[[ To any NSA and FBI agents reading my email: please consider ]]] > [[[ whether defending the US Constitution against all enemies, ]]] > [[[ foreign or domestic, requires you to follow Snowden's example. ]]] > > Editing involves making the image disappear anyway since otherwise you > don't have the text for editing visible (either one or the other is > displayed). > > Does this mean you turn off display of the image on the overlay when > the text in that region is changed? Very much so, yes. Changing the text in the region would be a pain if you could not see it, and changing it breaks the correspondence between text and image anyway. -- David Kastrup ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: Overlay mechanic improvements 2014-09-29 5:45 ` David Kastrup @ 2014-09-29 20:48 ` Richard Stallman 2014-09-30 1:21 ` Stephen J. Turnbull 2014-09-30 5:52 ` David Kastrup 0 siblings, 2 replies; 72+ messages in thread From: Richard Stallman @ 2014-09-29 20:48 UTC (permalink / raw) To: David Kastrup; +Cc: emacs-devel [[[ To any NSA and FBI agents reading my email: please consider ]]] [[[ whether defending the US Constitution against all enemies, ]]] [[[ foreign or domestic, requires you to follow Snowden's example. ]]] > Does this mean you turn off display of the image on the overlay when > the text in that region is changed? Very much so, yes. Changing the text in the region would be a pain if you could not see it, and changing it breaks the correspondence between text and image anyway. I see. But you keep the overlay in existence -- how come? -- Dr Richard Stallman President, Free Software Foundation 51 Franklin St Boston MA 02110 USA www.fsf.org www.gnu.org Skype: No way! That's nonfree (freedom-denying) software. Use Ekiga or an ordinary phone call. ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: Overlay mechanic improvements 2014-09-29 20:48 ` Richard Stallman @ 2014-09-30 1:21 ` Stephen J. Turnbull 2014-09-30 8:43 ` David Kastrup 2014-09-30 19:23 ` Richard Stallman 2014-09-30 5:52 ` David Kastrup 1 sibling, 2 replies; 72+ messages in thread From: Stephen J. Turnbull @ 2014-09-30 1:21 UTC (permalink / raw) To: rms; +Cc: David Kastrup, emacs-devel Richard Stallman writes: > > Does this mean you turn off display of the image on the > > overlay when the text in that region is changed? > > Very much so, yes. Changing the text in the region would be a > pain if you could not see it, and changing it breaks the > correspondence between text and image anyway. > > I see. > > But you keep the overlay in existence -- how come? Why do you care? David knows his business, isn't that good enough? (These are real questions, not an attack. It would be easier to answer your questions informatively if we knew where you are going.) In most cases the object is something, such as an equation, which is marked by explicit delimiters, so unless you actually delete the delimiters or add new ones, it makes sense to keep the overlay which provably still contains the correct boundaries[1], rather than re-parse the relevant region, and change only the image associated with the delimited object at next request for the formatted display. Unlike a fundamentally WYSIWYG application, preview-latex provides no facility for displaying syntactically incorrect formatted output.[2] Instead, preview-latex gets out of the user's way, and allows them to take advantage of AUCTeX's[3] features to ensure syntactically correct source editing, then provides immediate feedback as to whether the input LaTeX code is *semantically* correct. Ie, the source entered displays as the user expects. Because it is based on the device-independent rendering system which will be used to produce formatted content, this feedback is as accurate as possible in most cases. Compare with WYSIWYG office suites, which typically not only produce ugly math, but produce *different* ugliness on different devices. They even auto-bogotify macro features like whether a table gets split across pages depending on whether it's output to screen or to a printer. In preview-latex, Emacs is at its text-editing best. Footnotes: [1] And it in theory could provide "protection" for the delimiters against inadvertant corruption, although AFAIK such a feature has not been added. [2] Eg, if you are entering a superscript, theoretically a WYSIWYG editor could provide a transient raised text-entry field and move the cursor there. I don't know if any actually do that. [3] AFAIK, preview-latex is heavily dependent on AUCTeX features, although it probably could be ported to other TeX modes, even *ML modes, with a more or less great expenditure of effort. ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: Overlay mechanic improvements 2014-09-30 1:21 ` Stephen J. Turnbull @ 2014-09-30 8:43 ` David Kastrup 2014-09-30 10:35 ` Rasmus 2014-09-30 19:23 ` Richard Stallman 1 sibling, 1 reply; 72+ messages in thread From: David Kastrup @ 2014-09-30 8:43 UTC (permalink / raw) To: Stephen J. Turnbull; +Cc: rms, emacs-devel "Stephen J. Turnbull" <stephen@xemacs.org> writes: > Richard Stallman writes: > > > > Does this mean you turn off display of the image on the > > > overlay when the text in that region is changed? > > > > Very much so, yes. Changing the text in the region would be a > > pain if you could not see it, and changing it breaks the > > correspondence between text and image anyway. > > > > I see. > > > > But you keep the overlay in existence -- how come? > > Why do you care? David knows his business, isn't that good enough? > (These are real questions, not an attack. It would be easier to > answer your questions informatively if we knew where you are going.) Well, there are, of course, places where one might want to go. Historically, preview-latex was inspired by Emacs 21'st upcoming ability to integrate EPS images into a buffer, with a proof-of-concept up in about a week. Getting the whole thing to work as one smooth application took quite more than a man-year and a number of collaborateurs, however. And part of the reason for that is that is that a lot of functionality had to be implemented from scratch to be usable: static screenshots don't really reflect the interactive nature and efficiency with which one could work on complex documents with 1000+ equations while having a single 200MHz processor at one's disposal. One key element is using DSC comments in the generated single PostScript file for out-of-order rendering of images: Emacs entertains a pipeline into a Ghostscript process that first gets all on-screen images converted into PNG graphics before it reverts to conversion of the remaining off-screen material. If the on-screen regions change, the rendering pipeline switches the workload accordingly. With today's processors, it is actually hard to see that effect in action, but at the time it was implemented, scrolling around in a partially rendered buffer lead to an xroach-like effect where changes in the window lead to new images scurrying in until the display was static again. This sort of display-oriented rendering pipeline is not in any way specific to working with TeX/LaTeX. It is infrastructure that would be nice for a whole lot of things. In its current form, the code works in AUCTeX's LaTeX mode. It does not work for Emacs' builtin TeX modes, not even its LaTeX mode (mostly because of being tied in the process management and keymaps of AUCTeX). It does not work for AUCTeX's TeX or Texinfo modes (and the latter would be nice for working on GNU LilyPond which uses lots of images in its manuals generated from source code in the manuals). It does not work for calc's TeX output (which would increase its readability and usability significantly). It does not work for other image-generating modes like LilyPond-mode. I think that something like some Maxima-mode or similar copied an early version of preview-latex in order to implement some similar functionality. It's probably fallen victim to bitrot long ago. preview-latex does an excellent job, but much of its code is not at all related to the AUCTeX user interface or the LaTeX input language, and yet it will work for nothing but the AUCTeX user interface and the LaTeX input language. And without generally available facilities to do that sort of work, it will remain a singularly useful application without peers. Because it's far too much work to do a peer from scratch for similar needs. It was supposed to make it easier for me to work on my PhD thesis, but I ended up doing it _instead_ of my PhD thesis. The main shortcoming of preview-latex in my book is that it stands out. It should be one of possibly many thin applications on top of rendering frameworks provided by Emacs proper. Like I said: the barely-workable proof-of-concept (good for screenshots but not actual work) took about a week. The ideal state of Emacs would be that implementing similar functionality for a comparable text input language that can be rendered into PostScript graphics with DSC comments (a non-trivial subset) or into PNG images or whatever, would take about that long. That would get us to see a whole lot more editing modes with supplementary graphic support. That this shortcoming of Emacs can be cured on the Elisp side alone (as preview-latex does not need more than a working Emacs 21 binary) does not make it less of a shortcoming. -- David Kastrup ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: Overlay mechanic improvements 2014-09-30 8:43 ` David Kastrup @ 2014-09-30 10:35 ` Rasmus 2014-09-30 14:22 ` Eli Zaretskii ` (2 more replies) 0 siblings, 3 replies; 72+ messages in thread From: Rasmus @ 2014-09-30 10:35 UTC (permalink / raw) To: emacs-devel Hi, This is a bit OT, sorry. David Kastrup <dak@gnu.org> writes: > [...] > It does not work for other image-generating modes like LilyPond-mode. > I think that something like some Maxima-mode or similar copied an early > version of preview-latex in order to implement some similar > functionality. It's probably fallen victim to bitrot long ago. Imaxima has a similar functionality. I don't know if it uses preview-latex internally. > preview-latex does an excellent job, but much of its code is not at all > related to the AUCTeX user interface or the LaTeX input language, and > yet it will work for nothing but the AUCTeX user interface and the LaTeX > input language. > > And without generally available facilities to do that sort of work, it > will remain a singularly useful application without peers. Because it's > far too much work to do a peer from scratch for similar needs. It was > supposed to make it easier for me to work on my PhD thesis, but I ended > up doing it _instead_ of my PhD thesis. > > The main shortcoming of preview-latex in my book is that it stands out. > It should be one of possibly many thin applications on top of rendering > frameworks provided by Emacs proper. I agree wholeheartedly. Org has a similar mechanism as well, but it *sucks* compared to preview-latex. I tried to get preview-latex working with Org (which uses LaTeX-syntax for math), but it was non-trivial. Thus, I don't interact much with preview-latex these days, but I have very fond memories of using it. If you ever decide to allocate time to generalize preview-latex and have it shipped with Emacs-core I'd happily support that work! —Rasmus PS: AUCTeX and preview-latex was what got me into Emacs initially. At the time, and probably still, there was no LaTeX editor that was on par with AUCTeX. Compiling the newest version of AUCTeX pushed me towards GNU/Linux (the windows version was .01 behind and it was Really Hardᵀᴹ to compile it on Windows). -- Need more coffee. . . ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: Overlay mechanic improvements 2014-09-30 10:35 ` Rasmus @ 2014-09-30 14:22 ` Eli Zaretskii 2014-09-30 16:20 ` David Kastrup 2014-09-30 14:32 ` Stefan Monnier 2014-10-02 16:12 ` Uwe Brauer 2 siblings, 1 reply; 72+ messages in thread From: Eli Zaretskii @ 2014-09-30 14:22 UTC (permalink / raw) To: Rasmus; +Cc: emacs-devel > From: Rasmus <rasmus@gmx.us> > Date: Tue, 30 Sep 2014 12:35:08 +0200 > > > preview-latex does an excellent job, but much of its code is not at all > > related to the AUCTeX user interface or the LaTeX input language, and > > yet it will work for nothing but the AUCTeX user interface and the LaTeX > > input language. > > > > And without generally available facilities to do that sort of work, it > > will remain a singularly useful application without peers. Because it's > > far too much work to do a peer from scratch for similar needs. It was > > supposed to make it easier for me to work on my PhD thesis, but I ended > > up doing it _instead_ of my PhD thesis. > > > > The main shortcoming of preview-latex in my book is that it stands out. > > It should be one of possibly many thin applications on top of rendering > > frameworks provided by Emacs proper. > > I agree wholeheartedly. Org has a similar mechanism as well, but it > *sucks* compared to preview-latex. I tried to get > preview-latex working with Org (which uses LaTeX-syntax for math), but > it was non-trivial. Thus, I don't interact much with preview-latex > these days, but I have very fond memories of using it. > > If you ever decide to allocate time to generalize preview-latex and > have it shipped with Emacs-core I'd happily support that work! Feel free to expand on the problem(s) you hint on, because just by reading this exchange, I don't really understand what kind of facilities are missing in the core. (I don't use preview-latex and probably never will.) ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: Overlay mechanic improvements 2014-09-30 14:22 ` Eli Zaretskii @ 2014-09-30 16:20 ` David Kastrup 2014-09-30 16:35 ` Eli Zaretskii 0 siblings, 1 reply; 72+ messages in thread From: David Kastrup @ 2014-09-30 16:20 UTC (permalink / raw) To: emacs-devel Eli Zaretskii <eliz@gnu.org> writes: >> From: Rasmus <rasmus@gmx.us> >> Date: Tue, 30 Sep 2014 12:35:08 +0200 >> >> > preview-latex does an excellent job, but much of its code is not at all >> > related to the AUCTeX user interface or the LaTeX input language, and >> > yet it will work for nothing but the AUCTeX user interface and the LaTeX >> > input language. >> > >> > And without generally available facilities to do that sort of work, it >> > will remain a singularly useful application without peers. Because it's >> > far too much work to do a peer from scratch for similar needs. It was >> > supposed to make it easier for me to work on my PhD thesis, but I ended >> > up doing it _instead_ of my PhD thesis. >> > >> > The main shortcoming of preview-latex in my book is that it stands out. >> > It should be one of possibly many thin applications on top of rendering >> > frameworks provided by Emacs proper. >> >> I agree wholeheartedly. Org has a similar mechanism as well, but it >> *sucks* compared to preview-latex. I tried to get >> preview-latex working with Org (which uses LaTeX-syntax for math), but >> it was non-trivial. Thus, I don't interact much with preview-latex >> these days, but I have very fond memories of using it. >> >> If you ever decide to allocate time to generalize preview-latex and >> have it shipped with Emacs-core I'd happily support that work! > > Feel free to expand on the problem(s) you hint on, because just by > reading this exchange, I don't really understand what kind of > facilities are missing in the core. (I don't use preview-latex and > probably never will.) Since preview-latex is trivial to install via ELPA (as it's included in AUCTeX), I am somewhat at a loss figuring out what point people are trying to make by stating that they are not going to bother looking at it, preferring me to type descriptions for hours. There is a paper <URL:https://www.tug.org/TUGboat/Articles/tb23-1/kastrup.pdf> sketching the working mode of preview-latex some more. But the problem is that static images/screenshots are not really much help in seeing how the generation of images is actually integrated into the interactive workflow. The screenshots would not have looked significantly different after the first prototype was running, so they don't really do a lot more than give the information "Emacs can show images". I found some conference video Kaveh Bazargan made of me at <URL:https://www.youtube.com/watch?v=avn7-aVpP9I#t=500>, picture quality pretty flaky. While it does not really say a lot about the Emacs parts of the process, it is pretty obvious that the editing is not at all disturbed by the images and that any images are regenerated instantaneously without any obvious delays or interference with the editing workflow. That's probably 2007 or 2008, so the processor I use is probably something like 600MHz. Emacs does not offer any default support for rendering images on-demand from source code in a buffer. The pipelines that preview-latex can be something like buffer -> LaTeX -> Dvips -> Ghostscript -> PNG buffer -> PDFLaTeX -> pdftodsc -> Ghostscript -> PNG buffer -> LaTeX -> Dvipng -> PNG It will also run several of those in parallel, like running LaTeX->Dvipng but falling back to Dvips->Ghostscript for those images which contain embedded PostScript. The Ghostscript pipeline will render the on-screen images first to make for better interactive response. In addition to the images, LaTeX produces geometry, baseline, source location and various other info (for example, default font size in order to match the rendering scale to the current default screen font) which is parsed by Emacs and used for integrating the resulting images in the right place and scale into the originating buffer. So there are prioritized rendering pipelines and information side paths that deliver information crucial for tying the image stream back into the proper buffer locations, with proper scale and baseline. There is other stuff like being able to paste text plus images into MIME mails (with the obvious loss of baseline and other formatting information, but for display math it is somewhat helpful), and desktop mode saving and restoring the images associated with a buffer. All of that is not specific to LaTeX, and Emacs does not really provide anything helpful for creating that kind of facility apart from the display feature itself and the possibility to run processes and process filters. -- David Kastrup ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: Overlay mechanic improvements 2014-09-30 16:20 ` David Kastrup @ 2014-09-30 16:35 ` Eli Zaretskii 0 siblings, 0 replies; 72+ messages in thread From: Eli Zaretskii @ 2014-09-30 16:35 UTC (permalink / raw) To: David Kastrup; +Cc: emacs-devel > From: David Kastrup <dak@gnu.org> > Date: Tue, 30 Sep 2014 18:20:06 +0200 > > Since preview-latex is trivial to install via ELPA (as it's included in > AUCTeX), I am somewhat at a loss figuring out what point people are > trying to make by stating that they are not going to bother looking at > it, preferring me to type descriptions for hours. I doubt that just installing preview-latex would give me the insight into the issues you were alluding to, without some serious study of the code. > Emacs does not offer any default support for rendering images on-demand > from source code in a buffer. The pipelines that preview-latex can be > something like > > buffer -> LaTeX -> Dvips -> Ghostscript -> PNG > buffer -> PDFLaTeX -> pdftodsc -> Ghostscript -> PNG > buffer -> LaTeX -> Dvipng -> PNG > > It will also run several of those in parallel, like running > LaTeX->Dvipng but falling back to Dvips->Ghostscript for those images > which contain embedded PostScript. The Ghostscript pipeline will render > the on-screen images first to make for better interactive response. In > addition to the images, LaTeX produces geometry, baseline, source > location and various other info (for example, default font size in order > to match the rendering scale to the current default screen font) which > is parsed by Emacs and used for integrating the resulting images in the > right place and scale into the originating buffer. > > So there are prioritized rendering pipelines and information side paths > that deliver information crucial for tying the image stream back into > the proper buffer locations, with proper scale and baseline. > > There is other stuff like being able to paste text plus images into MIME > mails (with the obvious loss of baseline and other formatting > information, but for display math it is somewhat helpful), and desktop > mode saving and restoring the images associated with a buffer. > > All of that is not specific to LaTeX, and Emacs does not really provide > anything helpful for creating that kind of facility apart from the > display feature itself and the possibility to run processes and process > filters. Thanks, this is the information I was missing. ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: Overlay mechanic improvements 2014-09-30 10:35 ` Rasmus 2014-09-30 14:22 ` Eli Zaretskii @ 2014-09-30 14:32 ` Stefan Monnier 2014-10-02 16:12 ` Uwe Brauer 2 siblings, 0 replies; 72+ messages in thread From: Stefan Monnier @ 2014-09-30 14:32 UTC (permalink / raw) To: Rasmus; +Cc: emacs-devel > If you ever decide to allocate time to generalize preview-latex and > have it shipped with Emacs-core I'd happily support that work! Indeed, that would be great. Stefan ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: Overlay mechanic improvements 2014-09-30 10:35 ` Rasmus 2014-09-30 14:22 ` Eli Zaretskii 2014-09-30 14:32 ` Stefan Monnier @ 2014-10-02 16:12 ` Uwe Brauer 2 siblings, 0 replies; 72+ messages in thread From: Uwe Brauer @ 2014-10-02 16:12 UTC (permalink / raw) To: emacs-devel >> "rasmus" == rasmus <Rasmus> writes: > Hi, > This is a bit OT, sorry. > David Kastrup <dak@gnu.org> writes: >> [...] >> >> The main shortcoming of preview-latex in my book is that it stands out. >> It should be one of possibly many thin applications on top of rendering >> frameworks provided by Emacs proper. > I agree wholeheartedly. Org has a similar mechanism as well, but it > *sucks* compared to preview-latex. I tried to get do you mean? org-preview-latex-fragment and friends? Yes it is inferior to preview-latex, but can be used in non latex buffers, such as matlab, and even more useful, can be used in emails and together with org-mime-htmlize allows to have mathematical equations presented as png, sent over email, which comes in handy in email discussions about math. ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: Overlay mechanic improvements 2014-09-30 1:21 ` Stephen J. Turnbull 2014-09-30 8:43 ` David Kastrup @ 2014-09-30 19:23 ` Richard Stallman 2014-10-01 3:38 ` Stephen J. Turnbull 1 sibling, 1 reply; 72+ messages in thread From: Richard Stallman @ 2014-09-30 19:23 UTC (permalink / raw) To: Stephen J. Turnbull; +Cc: dak, emacs-devel [[[ To any NSA and FBI agents reading my email: please consider ]]] [[[ whether defending the US Constitution against all enemies, ]]] [[[ foreign or domestic, requires you to follow Snowden's example. ]]] > But you keep the overlay in existence -- how come? Why do you care? Because I want to understand what this program is doing. David knows his business, isn't that good enough? Good enough to tell me the program logic??? -- Dr Richard Stallman President, Free Software Foundation 51 Franklin St Boston MA 02110 USA www.fsf.org www.gnu.org Skype: No way! That's nonfree (freedom-denying) software. Use Ekiga or an ordinary phone call. ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: Overlay mechanic improvements 2014-09-30 19:23 ` Richard Stallman @ 2014-10-01 3:38 ` Stephen J. Turnbull 2014-10-01 12:53 ` Richard Stallman 0 siblings, 1 reply; 72+ messages in thread From: Stephen J. Turnbull @ 2014-10-01 3:38 UTC (permalink / raw) To: rms; +Cc: dak, emacs-devel Richard Stallman writes: > > But you keep the overlay in existence -- how come? > > Why do you care? > > Because I want to understand what this program is doing. That much is obvious. But why do you care what the program is doing? Do you plan to become an AUCTeX developer? Do you want to check whether it's possible for AUCTeX to use text properties rather than overlays? Do you want to check if AUCTeX is highly wasteful of overlays, leading to the efficiency issue, so that an improvement to Emacs overlays is actually unnecessary for the use case? The first motivation seems implausible given your time constraints and non-use of AUCTeX, and I am quite sure the second two are wastes of your time -- you will discover that AUCTeX's overlay implementation is preferable to using text properties and that it doesn't leak overlays, it just uses a lot of them in a big document. For two reasons: I know a little bit about the implementation, and independently come to those conclusions, and as confirmation, I know that David knows his business. Do you have yet another motivation for studying AUCTeX's implementation? > David knows his business, isn't that good enough? > > Good enough to tell me the program logic??? Of course not. But knowing the program logic tells you what it is doing, not why it does it that way. If the program is a poor program, you could probably find ways to optimize it, and so avoid a large change to Emacs. But I see every reason to believe that is not the case. At this point, from my point of view, we've been going around in a tight spiral for several days. It would be nice if we could go straight to the point, that's all. Or, it would be nice if you could show that we are already on the straight path to the point. ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: Overlay mechanic improvements 2014-10-01 3:38 ` Stephen J. Turnbull @ 2014-10-01 12:53 ` Richard Stallman 2014-10-01 13:11 ` David Kastrup 2014-10-02 1:26 ` Stephen J. Turnbull 0 siblings, 2 replies; 72+ messages in thread From: Richard Stallman @ 2014-10-01 12:53 UTC (permalink / raw) To: Stephen J. Turnbull; +Cc: dak, emacs-devel [[[ To any NSA and FBI agents reading my email: please consider ]]] [[[ whether defending the US Constitution against all enemies, ]]] [[[ foreign or domestic, requires you to follow Snowden's example. ]]] That much is obvious. But why do you care what the program is doing? You are awfully arrogant. -- Dr Richard Stallman President, Free Software Foundation 51 Franklin St Boston MA 02110 USA www.fsf.org www.gnu.org Skype: No way! That's nonfree (freedom-denying) software. Use Ekiga or an ordinary phone call. ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: Overlay mechanic improvements 2014-10-01 12:53 ` Richard Stallman @ 2014-10-01 13:11 ` David Kastrup 2014-10-02 1:26 ` Stephen J. Turnbull 1 sibling, 0 replies; 72+ messages in thread From: David Kastrup @ 2014-10-01 13:11 UTC (permalink / raw) To: Richard Stallman; +Cc: Stephen J. Turnbull, emacs-devel Richard Stallman <rms@gnu.org> writes: > [[[ To any NSA and FBI agents reading my email: please consider ]]] > [[[ whether defending the US Constitution against all enemies, ]]] > [[[ foreign or domestic, requires you to follow Snowden's example. ]]] > > That much is obvious. But why do you care what the program is doing? > > You are awfully arrogant. Actually, the same question occured to me. Since I was not able to figure out the answer in the course of the ongoing exchange, I basically resorted to an extensive information dump in the hope that the required information would be somewhere among it. Since you have a large number of things to keep track of, it may be a bit of a stretch to assume that you'll find what you were looking for originally in that dump. But maybe at least somebody found something worth finding in it, whether or not he was looking for it. -- David Kastrup ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: Overlay mechanic improvements 2014-10-01 12:53 ` Richard Stallman 2014-10-01 13:11 ` David Kastrup @ 2014-10-02 1:26 ` Stephen J. Turnbull 1 sibling, 0 replies; 72+ messages in thread From: Stephen J. Turnbull @ 2014-10-02 1:26 UTC (permalink / raw) To: rms; +Cc: dak, emacs-devel Richard Stallman writes: > That much is obvious. But why do you care what the program is doing? > > You are awfully arrogant. Even if accurate, that's an ad hominem, and doesn't help communicate. Furthermore, I explained *why* I want to know your reason in some detail -- is that arrogant? I think that true arrogance involves demanding to know your reason without giving a reason of my own. All I want to do is get to the point. If you just want to know what it's doing, that's a perfectly good reason -- but then I don't see how it helps Emacs development, and I excuse myself from the conversation, which probably should move to an AUCTeX channel. OTOH, if you have a reason that bears on Emacs development, it would be useful in framing the conversation to know what that reason is. ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: Overlay mechanic improvements 2014-09-29 20:48 ` Richard Stallman 2014-09-30 1:21 ` Stephen J. Turnbull @ 2014-09-30 5:52 ` David Kastrup 2014-10-06 19:14 ` Richard Stallman 1 sibling, 1 reply; 72+ messages in thread From: David Kastrup @ 2014-09-30 5:52 UTC (permalink / raw) To: Richard Stallman; +Cc: emacs-devel Richard Stallman <rms@gnu.org> writes: > [[[ To any NSA and FBI agents reading my email: please consider ]]] > [[[ whether defending the US Constitution against all enemies, ]]] > [[[ foreign or domestic, requires you to follow Snowden's example. ]]] > > > Does this mean you turn off display of the image on the overlay when > > the text in that region is changed? > > Very much so, yes. Changing the text in the region would be a pain if > you could not see it, and changing it breaks the correspondence between > text and image anyway. > > I see. > > But you keep the overlay in existence -- how come? Convenience. preview-latex has a mechanism to run only a region through LaTeX, so it is possible to regenerate only images in a region. This comes with its own shortfalls compared with regenerating the whole document but is much faster and less disruptive as then only images in that region flicker as they appear. When editing a particular equation or other element, reusing its original boundaries as the region to run through LaTeX is a natural choice, so the retained overlay preserves those boundaries (and offsets them visually) for the sake of the commands regenerating a region. If you split an equation into two by editing it, "regenerating" the image will replace the original overlay with two new ones: in fact, even if the boundaries stay the same, the overlay will get replaced. It sticks around only until some new image is generated touching any of the original region. If you write, say, word_text instead of word\_text (throwing LaTeX into inadvertant math mode), the resulting nuisance overlay will need to get manually removed (with keyboard sequence or mouse click) after correcting the syntax error since no image will reappear in that area after the correction. That tends to be a smaller net distraction than having to hand-select your regions to regenerate in the common case. -- David Kastrup ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: Overlay mechanic improvements 2014-09-30 5:52 ` David Kastrup @ 2014-10-06 19:14 ` Richard Stallman 2014-10-06 21:02 ` David Kastrup 0 siblings, 1 reply; 72+ messages in thread From: Richard Stallman @ 2014-10-06 19:14 UTC (permalink / raw) To: David Kastrup; +Cc: emacs-devel [[[ To any NSA and FBI agents reading my email: please consider ]]] [[[ whether defending the US Constitution against all enemies, ]]] [[[ foreign or domestic, requires you to follow Snowden's example. ]]] Thanks for the explanation. Now I understand, more or less. If you write, say, word_text instead of word\_text (throwing LaTeX into inadvertant math mode), the resulting nuisance overlay will need to get manually removed (with keyboard sequence or mouse click) after correcting the syntax error since no image will reappear in that area after the correction. If you regenerate that region, and it makes no image, can't that delete the overlay automatically? preview-latex does an excellent job, but much of its code is not at all related to the AUCTeX user interface or the LaTeX input language, and yet it will work for nothing but the AUCTeX user interface and the LaTeX input language. What is the obstacle to supporting other interfaces? Would preview-latex be of any use with Texinfo? Maybe not, since generally there are few equations in Texinfo. -- Dr Richard Stallman President, Free Software Foundation 51 Franklin St Boston MA 02110 USA www.fsf.org www.gnu.org Skype: No way! That's nonfree (freedom-denying) software. Use Ekiga or an ordinary phone call. ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: Overlay mechanic improvements 2014-10-06 19:14 ` Richard Stallman @ 2014-10-06 21:02 ` David Kastrup 0 siblings, 0 replies; 72+ messages in thread From: David Kastrup @ 2014-10-06 21:02 UTC (permalink / raw) To: Richard Stallman; +Cc: emacs-devel Richard Stallman <rms@gnu.org> writes: > [[[ To any NSA and FBI agents reading my email: please consider ]]] > [[[ whether defending the US Constitution against all enemies, ]]] > [[[ foreign or domestic, requires you to follow Snowden's example. ]]] > > Thanks for the explanation. Now I understand, more or less. > > If you write, say, word_text instead of word\_text (throwing LaTeX into > inadvertant math mode), the resulting nuisance overlay will need to get > manually removed (with keyboard sequence or mouse click) after > correcting the syntax error since no image will reappear in that area > after the correction. > > If you regenerate that region, and it makes no image, can't that delete > the overlay automatically? Conceivably. However, a region might just be \input{bozo} \input{clown} \input{summary} with the actual images appearing in respectively other buffers than the region. So clearing out the original region would be a heuristic rather than the full deal and should not be mandatory for satisfactory operation. > preview-latex does an excellent job, but much of its code is not > at all related to the AUCTeX user interface or the LaTeX input > language, and yet it will work for nothing but the AUCTeX user > interface and the LaTeX input language. > > What is the obstacle to supporting other interfaces? The job control that is used is that of AUCTeX. That involves management of various process buffers (without getting "A compilation is already in progress" problems), expansion of commands with variable elements based on the involved buffers and files. The command sequences which are executed are based on the various ways of generating graphical output from TeX input. These kind of chains are not really specified as some sort of abstract interface but hardcoded. There are sidepaths from LaTeX for geometry and source location information: Emacs parses the resulting error/log files in order to get all those out. There is no other source for this information. A lot of convenience functions from AUCTeX are used that are not actually related to TeX at all. The user interface parts are not done in any way independent from the AUCTeX user interface. > Would preview-latex be of any use with Texinfo? > Maybe not, since generally there are few equations in Texinfo. The LilyPond (music typesetter) documentation is generated via Texinfo. There are probably not all that many people using the Info version with graphical images but it beats the pants off the generally used online HTML version (all the web pages, not just the manuals, are also generated from Texinfo). But the _only_ usable reader for the Info version with graphics is Emacs. Yelp (GNOME help viewer) purports to support images in Info but does not scale to ten thousands of images. One never gets across the "Loading" message. Now the LilyPond documentation is probably the best use of images in Info anywhere. Still I would not want to bet my life savings on more than a dozen people actively using the Info variants of those manuals. But while few people actually use the Info output format, lots more use Texinfo with other backends. I could imagine the Texinfo/preview-latex combination to be popular with our documentation editors even though most of them probaly have never looked at Info output and consider HTML and PDF the "normal" output formats. -- David Kastrup ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: Overlay mechanic improvements 2014-09-21 16:07 ` Stefan Monnier 2014-09-21 16:14 ` David Kastrup @ 2014-09-21 16:56 ` Eli Zaretskii 2014-09-21 18:42 ` Stefan Monnier 2014-09-21 21:48 ` Richard Stallman 2 siblings, 1 reply; 72+ messages in thread From: Eli Zaretskii @ 2014-09-21 16:56 UTC (permalink / raw) To: Stefan Monnier; +Cc: dak, rms, emacs-devel > From: Stefan Monnier <monnier@iro.umontreal.ca> > Date: Sun, 21 Sep 2014 12:07:48 -0400 > Cc: David Kastrup <dak@gnu.org>, emacs-devel@gnu.org > > We rarely suffer from overlay's performance issues because most codes > that could suffer from it simply avoid using overlays (or they try to > minimize the number of overlays used, e.g. by flushing those overlays > currently not visible), even when they would be the right tool. Or they > add calls to overlay-recenter when the performance problem is diagnosed. We should distinguish between 2 use cases here. One is when overlays need to be displayed. In that case, indeed only the overlays in the visible region count, which are normally not too many, and the display engine already does the equivalent of overlay-recenter for each display line. The other use case is when some Lisp program, or even some part of Emacs, needs to search for the next/previous change in character properties, or find the beginning or end of an overlay. In those cases, you cannot assume that only the visible portion of a buffer matters, so in large buffers with many overlays this could be rather painful. > I don't see any obvious reason why re-using the existing intervals.c > tree wouldn't work well. Overlays can overlap, while text properties cannot. Does that matter for what intervals.c implement? ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: Overlay mechanic improvements 2014-09-21 16:56 ` Eli Zaretskii @ 2014-09-21 18:42 ` Stefan Monnier 2014-09-21 18:58 ` Eli Zaretskii 0 siblings, 1 reply; 72+ messages in thread From: Stefan Monnier @ 2014-09-21 18:42 UTC (permalink / raw) To: Eli Zaretskii; +Cc: dak, rms, emacs-devel > Overlays can overlap, while text properties cannot. Does that matter > for what intervals.c implement? Of course it does, but the algorithm I suggest we use handles it just fine. Stefan ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: Overlay mechanic improvements 2014-09-21 18:42 ` Stefan Monnier @ 2014-09-21 18:58 ` Eli Zaretskii 2014-09-21 20:12 ` Stefan Monnier 0 siblings, 1 reply; 72+ messages in thread From: Eli Zaretskii @ 2014-09-21 18:58 UTC (permalink / raw) To: Stefan Monnier; +Cc: dak, rms, emacs-devel > From: Stefan Monnier <monnier@iro.umontreal.ca> > Date: Sun, 21 Sep 2014 14:42:57 -0400 > Cc: dak@gnu.org, rms@gnu.org, emacs-devel@gnu.org > > > Overlays can overlap, while text properties cannot. Does that matter > > for what intervals.c implement? > > Of course it does, but the algorithm I suggest we use handles it > just fine. What algorithm? ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: Overlay mechanic improvements 2014-09-21 18:58 ` Eli Zaretskii @ 2014-09-21 20:12 ` Stefan Monnier 0 siblings, 0 replies; 72+ messages in thread From: Stefan Monnier @ 2014-09-21 20:12 UTC (permalink / raw) To: Eli Zaretskii; +Cc: dak, rms, emacs-devel >> Of course it does, but the algorithm I suggest we use handles it >> just fine. > What algorithm? The one described in the "Augmented tree" section of the "Interval tree" wikipedia page. Stefan ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: Overlay mechanic improvements 2014-09-21 16:07 ` Stefan Monnier 2014-09-21 16:14 ` David Kastrup 2014-09-21 16:56 ` Eli Zaretskii @ 2014-09-21 21:48 ` Richard Stallman 2014-09-22 0:31 ` Stefan Monnier 2 siblings, 1 reply; 72+ messages in thread From: Richard Stallman @ 2014-09-21 21:48 UTC (permalink / raw) To: Stefan Monnier; +Cc: dak, emacs-devel [[[ To any NSA and FBI agents reading my email: please consider ]]] [[[ whether defending the US Constitution against all enemies, ]]] [[[ foreign or domestic, requires you to follow Snowden's example. ]]] Reasons for overlays being slow are: - if you have N overlays, you have 2N markers, so every insertion/deletion of text will involve a traversal of those 2N elements. - when we need to move from bytepos to charpos (or vice versa), we go through the 2N markers again. - when the Elisp code didn't know to overlay-recenter, or when overlays are added/removed from the "wrong" place (w.r.t to the overlay "center"), every overlay creation/removal takes O(N) time. Yes, that's true. Moving overlays into a balanced binary tree will remove those algorithmic performance problems. I'm not sue why you'd be opposed, since it will provide the exact same functionality anyway. It's just an internal detail of implementation. Why not do it for all markers, then? A large number of markers will cause the same slowdown even if they are not made for overlays. Note that I suggest to use the balanced binary tree used by text-properties, but there is no need for the two trees to be linked. I don't think the data structure of intervals lends itself directly to markers. A marker, whether it's on its own or an end of an overlay, has an identity that is permanent, whereas intervals are split and recombined whenever useful. -- Dr Richard Stallman President, Free Software Foundation 51 Franklin St Boston MA 02110 USA www.fsf.org www.gnu.org Skype: No way! That's nonfree (freedom-denying) software. Use Ekiga or an ordinary phone call. ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: Overlay mechanic improvements 2014-09-21 21:48 ` Richard Stallman @ 2014-09-22 0:31 ` Stefan Monnier 2014-09-22 23:11 ` Richard Stallman 0 siblings, 1 reply; 72+ messages in thread From: Stefan Monnier @ 2014-09-22 0:31 UTC (permalink / raw) To: Richard Stallman; +Cc: dak, emacs-devel > Reasons for overlays being slow are: > - if you have N overlays, you have 2N markers, so every > insertion/deletion of text will involve a traversal of those 2N elements. > - when we need to move from bytepos to charpos (or vice versa), we go > through the 2N markers again. > - when the Elisp code didn't know to overlay-recenter, or when overlays > are added/removed from the "wrong" place (w.r.t to the overlay > "center"), every overlay creation/removal takes O(N) time. > Yes, that's true. > Moving overlays into a balanced binary tree will remove those > algorithmic performance problems. I'm not sue why you'd be opposed, > since it will provide the exact same functionality anyway. It's just an > internal detail of implementation. > Why not do it for all markers, then? We could try, but it wouldn't address the last problem above. > A large number of markers will cause the same slowdown even if they > are not made for overlays. Indeed. But I've never seen this problem, except via the creation of many overlays. Note that if/when overlays are made efficient and markers's efficiency becomes a problem, we can use the new overlay implementation for markers (in the worst case, representing markers as degenerate overlays). > A marker, whether it's on its own or an end of an overlay, has an > identity that is permanent, whereas intervals are split and recombined > whenever useful. I must say I'm troubled by your questions and objections. What is the problem with my proposal? Are you afraid it will make Emacs worse? Or are you afraid it simply won't work? Why do you care if we keep overlays in a tree instead of singly-linked lists? Stefan ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: Overlay mechanic improvements 2014-09-22 0:31 ` Stefan Monnier @ 2014-09-22 23:11 ` Richard Stallman 0 siblings, 0 replies; 72+ messages in thread From: Richard Stallman @ 2014-09-22 23:11 UTC (permalink / raw) To: Stefan Monnier; +Cc: dak, emacs-devel [[[ To any NSA and FBI agents reading my email: please consider ]]] [[[ whether defending the US Constitution against all enemies, ]]] [[[ foreign or domestic, requires you to follow Snowden's example. ]]] I must say I'm troubled by your questions and objections. What is the problem with my proposal? Are you afraid it will make Emacs worse? Or are you afraid it simply won't work? Why do you care if we keep overlays in a tree instead of singly-linked lists? We may be miscommunicating. I think you're interpreting each thing I say as an argument against reimplementing overlays, but I'm mostly talking about other issues that came up in the discussion. I must not have noticed the message that made a specific proposal for how to reimplement overlays. I'm concerned about code complexity and bugs that might introduce, but I'm not necessarily against it. Mostly I'm not talking about that. -- Dr Richard Stallman President, Free Software Foundation 51 Franklin St Boston MA 02110 USA www.fsf.org www.gnu.org Skype: No way! That's nonfree (freedom-denying) software. Use Ekiga or an ordinary phone call. ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: Overlay mechanic improvements 2014-09-20 13:19 ` Richard Stallman 2014-09-20 13:37 ` David Kastrup @ 2014-09-20 15:56 ` Eli Zaretskii 2014-09-20 19:49 ` Stefan Monnier 2 siblings, 0 replies; 72+ messages in thread From: Eli Zaretskii @ 2014-09-20 15:56 UTC (permalink / raw) To: rms; +Cc: vekazanov, monnier, emacs-devel > Date: Sat, 20 Sep 2014 09:19:02 -0400 > From: Richard Stallman <rms@gnu.org> > Cc: vekazanov@gmail.com, emacs-devel@gnu.org > > The existing data structure is fine for overlays used as they are intended > to be used. The existing data structure doesn't scale well. What Stefan suggest will scale better, while preserving all the features and properties of overlays as we have them today. ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: Overlay mechanic improvements 2014-09-20 13:19 ` Richard Stallman 2014-09-20 13:37 ` David Kastrup 2014-09-20 15:56 ` Eli Zaretskii @ 2014-09-20 19:49 ` Stefan Monnier 2014-09-21 13:36 ` Richard Stallman 2 siblings, 1 reply; 72+ messages in thread From: Stefan Monnier @ 2014-09-20 19:49 UTC (permalink / raw) To: Richard Stallman; +Cc: vekazanov, emacs-devel > [[[ To any NSA and FBI agents reading my email: please consider ]]] > [[[ whether defending the US Constitution against all enemies, ]]] > [[[ foreign or domestic, requires you to follow Snowden's example. ]]] > I don't see that it makes sense to store overlays like text properties. > The existing data structure is fine for overlays used as they are intended > to be used. The data structure used is too naive, with an algorithmic complexity that can be problematic. Moving it to a balanced binary tree with turn O(N^2) complexity into O(N log N). Stefan ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: Overlay mechanic improvements 2014-09-20 19:49 ` Stefan Monnier @ 2014-09-21 13:36 ` Richard Stallman 0 siblings, 0 replies; 72+ messages in thread From: Richard Stallman @ 2014-09-21 13:36 UTC (permalink / raw) To: Stefan Monnier; +Cc: vekazanov, emacs-devel [[[ To any NSA and FBI agents reading my email: please consider ]]] [[[ whether defending the US Constitution against all enemies, ]]] [[[ foreign or domestic, requires you to follow Snowden's example. ]]] Moving it to a balanced binary tree with turn O(N^2) complexity into O(N log N). I am not sure that is true, here. What value does N stand for? -- Dr Richard Stallman President, Free Software Foundation 51 Franklin St Boston MA 02110 USA www.fsf.org www.gnu.org Skype: No way! That's nonfree (freedom-denying) software. Use Ekiga or an ordinary phone call. ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: Overlay mechanic improvements 2014-09-19 14:59 Overlay mechanic improvements Vladimir Kazanov 2014-09-19 17:22 ` Stefan Monnier @ 2014-09-19 18:03 ` Richard Stallman 2014-09-20 8:08 ` Vladimir Kazanov 1 sibling, 1 reply; 72+ messages in thread From: Richard Stallman @ 2014-09-19 18:03 UTC (permalink / raw) To: Vladimir Kazanov; +Cc: emacs-devel [[[ To any NSA and FBI agents reading my email: please consider ]]] [[[ whether defending the US Constitution against all enemies, ]]] [[[ foreign or domestic, requires you to follow Snowden's example. ]]] It turns out that the only viable alternative to overlays is using text properties, and text properties were not meant for the purpose. I designed text properties precisely for uses like this. All I need is an ability to save position pairs, the positions should survive text insertion/deletion I see multiple meanings for that; could you clarify? and there should be a way to find those pairs given a buffer point. Text properties are designed to be preserved through copying of text. Overlays are not. So it seems to me that you must use text properties. For each token, you put a text property 'token' onto the characters in the token. The value of the property would say what token they are. The property would be eq for all the characters in one token. Then you can use 'next-single-char-property-change' and 'previous-single-char-property-change' to find the end and the beginning of the token. If you run into any difficulties using the existing interfaces for text properties, we should improve the interfaces to make your program easier to write. -- Dr Richard Stallman President, Free Software Foundation 51 Franklin St Boston MA 02110 USA www.fsf.org www.gnu.org Skype: No way! That's nonfree (freedom-denying) software. Use Ekiga or an ordinary phone call. ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: Overlay mechanic improvements 2014-09-19 18:03 ` Richard Stallman @ 2014-09-20 8:08 ` Vladimir Kazanov 2014-09-20 13:21 ` Richard Stallman 2014-09-20 13:21 ` Tokenizing Richard Stallman 0 siblings, 2 replies; 72+ messages in thread From: Vladimir Kazanov @ 2014-09-20 8:08 UTC (permalink / raw) To: rms; +Cc: emacs-devel The problem here is that tokens are not just lexemes(token text). To be general and granular enough one also has to remember dependencies between tokens. So, a token is: ["lexeme", lookback, lookahead, position]. Here - "lexeme" is the string itself; - "lookahead" is a number of chars *after* the lexeme used to define the lexeme (in the simplest case - just a whitespace between symbol names, i.e. a single char); - "lookback" a number of previous tokens whose lookaheads span over current token's zone of responsibility (this one can be recalculated on the fly) - "position" is just a token beginning/end Thus, we can define "a token" as a sum of a lexeme and context required to extract the lexeme. Whenever something happens within a particular lexem+lookahead interval -> the token has to be reparsed and n=lookback previous tokens also have to be fixed. More on this can be found in this paper: http://harmonia.cs.berkeley.edu/papers/twagner-lexing.pdf. Quite a lengthy read, I must warn. What I like about it is that it *proves* token stream consistency after fixing. My system will hopefully be simplified compared to the original idea while keeping the proved part. > All I need > is an ability to save position pairs, the positions should survive text > insertion/deletion > > I see multiple meanings for that; could you clarify? I mean something like relative positioning of markers/overlays: changes in unrelated buffer parts should change positions accordingly. > and there should be a way to find those pairs given a > buffer point. > > Text properties are designed to be preserved through copying of text. > Overlays are not. So it seems to me that you must use text properties. Yes, properties do survive copying, but token context might change and the whole thing will have to be reparsed anyway. We can take a piece of comments and drop it into the middle of an expression. I don't care whether the repositioned text was a part of a token or not, I just need to know that something changed within a particular interval (lexem+lookahead) > For each token, you put a text property 'token' onto the characters in > the token. The value of the property would say what token they are. > The property would be eq for all the characters in one token. > > Then you can use 'next-single-char-property-change' and > 'previous-single-char-property-change' to find the end and the > beginning of the token. I chose overlays mostly because they allow to control *intervals of text*, and the intervals can overlap. For example, I can add a modification-hook to an overlay - and it won't be called for every character, just for an interval(-s) overlapping. Citing the "Special properties" documentation page for text properties: "you can't predict how many times the function will be called". One can imagine a workaround for this, but it would be just too cumbersome. > If you run into any difficulties using the existing interfaces > for text properties, we should improve the interfaces to make your > program easier to write. > Should I just do that or try both (improving overlays on the way)? After all overlays do have to be fixed anyway. -- Yours sincerely, Vladimir Kazanov -- С уважением, Владимир Казанов ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: Overlay mechanic improvements 2014-09-20 8:08 ` Vladimir Kazanov @ 2014-09-20 13:21 ` Richard Stallman 2014-09-20 16:28 ` Stephen Leake 2014-09-20 13:21 ` Tokenizing Richard Stallman 1 sibling, 1 reply; 72+ messages in thread From: Richard Stallman @ 2014-09-20 13:21 UTC (permalink / raw) To: Vladimir Kazanov; +Cc: emacs-devel [[[ To any NSA and FBI agents reading my email: please consider ]]] [[[ whether defending the US Constitution against all enemies, ]]] [[[ foreign or domestic, requires you to follow Snowden's example. ]]] The problem here is that tokens are not just lexemes(token text). To be general and granular enough one also has to remember dependencies between tokens. So, a token is: ["lexeme", lookback, lookahead, position]. Ok, but why is it a "problem"? It seems easy enough to construct a list (LEXEME LOOKBACK LOOKAHEAD) and put it as the value of the `token' property for all the characters in the token. There is no need to include POSITION in this list, since you can find the start of the token by calling `previous-char-property-change'. > All I need > is an ability to save position pairs, the positions should survive text > insertion/deletion > > I see multiple meanings for that; could you clarify? I mean something like relative positioning of markers/overlays: changes in unrelated buffer parts should change positions accordingly. Text properties do that. Yes, properties do survive copying, but token context might change and the whole thing will have to be reparsed anyway. If it needs to be reparsed, that's no problem; it is easy to delete the `token' property from a region. In fact, we can specify automatic deletion of a certain text property from text copied into the buffer. But there are many cases where it does not need to be reparsed. If you can detect such cases, and reparse only near the boundaries, it would be a nice optimization for cases like killing a whole function and yanking it between two other functions. You could use `yank-handler' to hook this in. It would check whether the existing parsing of the text being yanked fits the new context; if so, it would preserve the `token' property; if not, it would erase the `token' property of the yanked text and cause it all to be reparsed. Another idea: even if the beginning of the yanked text has to be interpreted differently in the new context, maybe tokens later on in the yanked text will still be right in the new context. If you can detect when this happens, you might be able to skip reparsing of some of the yanked text. That could be a substantial speedup. Citing the "Special properties" documentation page for text properties: "you can't predict how many times the function will be called". One can imagine a workaround for this, but it would be just too cumbersome. If you put a timestamp into the token data structure, it will be easy to detect duplicate calls, and make them do nothing. I think it would be less work to change text properties to give you whatever interface features are convenient, than to redesign how overlays are stored. After all overlays do have to be fixed anyway. Why so? Aside from not being suitable for this use, what is wrong with overlays that needs to be fixed? The feature you want to implement is very useful, and we should provide the low-level support needed to make it clean and fast. But if that is easy to do with text properties, let's do it with text properties. -- Dr Richard Stallman President, Free Software Foundation 51 Franklin St Boston MA 02110 USA www.fsf.org www.gnu.org Skype: No way! That's nonfree (freedom-denying) software. Use Ekiga or an ordinary phone call. ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: Overlay mechanic improvements 2014-09-20 13:21 ` Richard Stallman @ 2014-09-20 16:28 ` Stephen Leake 0 siblings, 0 replies; 72+ messages in thread From: Stephen Leake @ 2014-09-20 16:28 UTC (permalink / raw) To: emacs-devel Richard Stallman <rms@gnu.org> writes: > [[[ To any NSA and FBI agents reading my email: please consider ]]] > [[[ whether defending the US Constitution against all enemies, ]]] > [[[ foreign or domestic, requires you to follow Snowden's example. ]]] > > The problem here is that tokens are not just lexemes(token text). To > be general and granular enough one also has to remember dependencies > between tokens. So, a token is: ["lexeme", lookback, lookahead, > position]. > > Ok, but why is it a "problem"? It seems easy enough to construct a > list (LEXEME LOOKBACK LOOKAHEAD) and put it as the value of the > `token' property for all the characters in the token. > > There is no need to include POSITION in this list, since you can > find the start of the token by calling > `previous-char-property-change'. No need to store LEXME, either; just call buffer-substring-no-properties when you need the text. -- -- Stephe ^ permalink raw reply [flat|nested] 72+ messages in thread
* Tokenizing 2014-09-20 8:08 ` Vladimir Kazanov 2014-09-20 13:21 ` Richard Stallman @ 2014-09-20 13:21 ` Richard Stallman 2014-09-20 16:24 ` Tokenizing Stephen Leake 2014-09-20 16:36 ` Tokenizing Vladimir Kazanov 1 sibling, 2 replies; 72+ messages in thread From: Richard Stallman @ 2014-09-20 13:21 UTC (permalink / raw) To: Vladimir Kazanov; +Cc: emacs-devel [[[ To any NSA and FBI agents reading my email: please consider ]]] [[[ whether defending the US Constitution against all enemies, ]]] [[[ foreign or domestic, requires you to follow Snowden's example. ]]] Do you plan to reparse continuously to keep the token structure up to date all the time? Given that reparsing is likely to take time, it may not be the best approach. It could sometimes make Emacs become nonresponsive while it is reparsing a substantial amount of text. Ideally it should be like Font Lock -- reparsing parts of the buffer when time is available. -- Dr Richard Stallman President, Free Software Foundation 51 Franklin St Boston MA 02110 USA www.fsf.org www.gnu.org Skype: No way! That's nonfree (freedom-denying) software. Use Ekiga or an ordinary phone call. ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: Tokenizing 2014-09-20 13:21 ` Tokenizing Richard Stallman @ 2014-09-20 16:24 ` Stephen Leake 2014-09-20 16:40 ` Tokenizing Vladimir Kazanov 2014-09-20 16:36 ` Tokenizing Vladimir Kazanov 1 sibling, 1 reply; 72+ messages in thread From: Stephen Leake @ 2014-09-20 16:24 UTC (permalink / raw) To: emacs-devel Richard Stallman <rms@gnu.org> writes: > [[[ To any NSA and FBI agents reading my email: please consider ]]] > [[[ whether defending the US Constitution against all enemies, ]]] > [[[ foreign or domestic, requires you to follow Snowden's example. ]]] > > Do you plan to reparse continuously to keep the token structure up to > date all the time? That's the approach I'm using in Ada mode. Tokenizing the whole buffer after any change is easily fast enough (on modern hardware), even on a 7000 line buffer. Semantic parsing gets a lot slower. But for the Ada language, incremental semantic parsing really isn't an option. -- -- Stephe ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: Tokenizing 2014-09-20 16:24 ` Tokenizing Stephen Leake @ 2014-09-20 16:40 ` Vladimir Kazanov 2014-09-20 20:16 ` Tokenizing Eric Ludlam 0 siblings, 1 reply; 72+ messages in thread From: Vladimir Kazanov @ 2014-09-20 16:40 UTC (permalink / raw) To: Stephen Leake; +Cc: emacs-devel > Tokenizing the whole buffer after any change is easily fast enough (on > modern hardware), even on a 7000 line buffer. Semantic parsing gets a > lot slower. > This is what I do right now in my prototype of a smarter Python mode. The tokenizing process itself is usually fast enough. But parsing is more complicated, and may take some time to rebuild the parse tree. Incremental approach is a natural step here. -- Yours sincerely, Vladimir Kazanov -- С уважением, Владимир Казанов ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: Tokenizing 2014-09-20 16:40 ` Tokenizing Vladimir Kazanov @ 2014-09-20 20:16 ` Eric Ludlam 2014-09-20 20:35 ` Tokenizing Vladimir Kazanov 2014-09-21 15:13 ` parsing (was tokenizing) Stephen Leake 0 siblings, 2 replies; 72+ messages in thread From: Eric Ludlam @ 2014-09-20 20:16 UTC (permalink / raw) To: Vladimir Kazanov; +Cc: Stephen Leake, emacs-devel On 09/20/2014 12:40 PM, Vladimir Kazanov wrote: >> Tokenizing the whole buffer after any change is easily fast enough (on >> modern hardware), even on a 7000 line buffer. Semantic parsing gets a >> lot slower. >> > This is what I do right now in my prototype of a smarter Python mode. > The tokenizing process itself is usually fast enough. But parsing is > more complicated, and may take some time to rebuild the parse tree. > Incremental approach is a natural step here. > I caught only the tail of this thread, so I apologize if I refer to the incorrect thing. A year ago or so we were talking about ada-mode, a modified parser, and how it might integrate with CEDET/Semantic on the CEDET mailing list. Is it still 'wisi', a different flavor of 'wisent' ? If calls into your parser are being handled by parts of CEDET/Semantic for creating tags, then there is an incremental parser that you can enable that works with two of the other parser types included. The act of tagging the buffer enables it to track edits, and only reparse the bits that change. It handles incomplete code, and waits for it to be completed. There are many other possible states as well. I certainly recommend trying to take advantage of that if the parsing you refer to is somehow related to Semantic. I'll be happy to help you figure out how to make your parser work in that framework. If this thread isn't about that, then carry on... :) Eric ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: Tokenizing 2014-09-20 20:16 ` Tokenizing Eric Ludlam @ 2014-09-20 20:35 ` Vladimir Kazanov 2014-09-21 15:13 ` parsing (was tokenizing) Stephen Leake 1 sibling, 0 replies; 72+ messages in thread From: Vladimir Kazanov @ 2014-09-20 20:35 UTC (permalink / raw) To: Eric Ludlam; +Cc: Stephen Leake, emacs-devel > I certainly recommend trying to take advantage of that if the parsing you > refer to is somehow related to Semantic. I'll be happy to help you figure > out how to make your parser work in that framework. > > If this thread isn't about that, then carry on... :) > > Eric Actually, this is not about parsing itself, but a lower level layer - tokenizing. :-) Couple of letters ago I formulated what I intend to build, you can easily find it. -- Yours sincerely, Vladimir Kazanov -- С уважением, Владимир Казанов ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: parsing (was tokenizing) 2014-09-20 20:16 ` Tokenizing Eric Ludlam 2014-09-20 20:35 ` Tokenizing Vladimir Kazanov @ 2014-09-21 15:13 ` Stephen Leake 1 sibling, 0 replies; 72+ messages in thread From: Stephen Leake @ 2014-09-21 15:13 UTC (permalink / raw) To: emacs-devel Eric Ludlam <ericludlam@gmail.com> writes: > On 09/20/2014 12:40 PM, Vladimir Kazanov wrote: >>> Tokenizing the whole buffer after any change is easily fast enough (on >>> modern hardware), even on a 7000 line buffer. Semantic parsing gets a >>> lot slower. >>> >> This is what I do right now in my prototype of a smarter Python mode. >> The tokenizing process itself is usually fast enough. But parsing is >> more complicated, and may take some time to rebuild the parse tree. >> Incremental approach is a natural step here. >> > > I caught only the tail of this thread, so I apologize if I refer to > the incorrect thing. > > A year ago or so we were talking about ada-mode, a modified parser, > and how it might integrate with CEDET/Semantic on the CEDET mailing > list. Is it still 'wisi', a different flavor of 'wisent' ? Yes. > If calls into your parser are being handled by parts of CEDET/Semantic > for creating tags, then there is an incremental parser that you can > enable that works with two of the other parser types included. No, I'm not using the CEDET front end UI. Partly because that incremental parser does not handle the Ada language (or I did not want to put in the effort to modify my grammar to make it work). The wisi parser is generalized LALR, and it does not lend itself to incremental parsing. I also didn't need the full flexibility of the Semantic lexer; the wisi lexer relies solely on the elisp syntax properties, and is faster. On the other hand, the wisi parser is unacceptably slow for 7000 line Ada files (and I have customers complaining about files 10 times that size (or they are using really slow machines)), so I'm looking into ways to use compiled Ada code instead of elisp to run the parser. > I'll be happy to help > you figure out how to make your parser work in that framework. Once I get the speed acceptable, I intend to look again at the CEDET UI, and try to use as much as possible with Ada mode. In particular, smart completion would be a nice feature. -- -- Stephe ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: Tokenizing 2014-09-20 13:21 ` Tokenizing Richard Stallman 2014-09-20 16:24 ` Tokenizing Stephen Leake @ 2014-09-20 16:36 ` Vladimir Kazanov 2014-09-20 19:55 ` Tokenizing Stefan Monnier ` (2 more replies) 1 sibling, 3 replies; 72+ messages in thread From: Vladimir Kazanov @ 2014-09-20 16:36 UTC (permalink / raw) To: rms; +Cc: emacs-devel Okay, I'll give text properties a try. Right now my vision for this mode is the following: - avoid retokenizing undamaged buffer parts at all costs (as a main feature meant for incremental parsing); - collect damages and do reparsing only when user stops editing, similar to the font-lock-mode (js2-mode, nxml-mode...); - the incremental logic should have two interfaces, the first one meant for language-specific tokenizing code and a second one - for the user code, be it code beautifiers or advanced incremental parsers; - it should be possible to completely replace the font-lock-mode with this mode, given a concrete language tokenizer; You said two things basically: 1) I must use text properties, 2) it is possible to improve text properties interfaces to help the tokenizer. I suggest the following plan: 1) try to implement the tokenizer using available text property mechanics; 2) see if there are slow-downs or problems, or space for improvements on the Emacs side. Does that sound reasonable? On Sat, Sep 20, 2014 at 4:21 PM, Richard Stallman <rms@gnu.org> wrote: > [[[ To any NSA and FBI agents reading my email: please consider ]]] > [[[ whether defending the US Constitution against all enemies, ]]] > [[[ foreign or domestic, requires you to follow Snowden's example. ]]] > > Do you plan to reparse continuously to keep the token structure up to > date all the time? > > Given that reparsing is likely to take time, it may not be the best approach. > It could sometimes make Emacs become nonresponsive while it is reparsing > a substantial amount of text. > > Ideally it should be like Font Lock -- reparsing parts of the buffer > when time is available. > > -- > Dr Richard Stallman > President, Free Software Foundation > 51 Franklin St > Boston MA 02110 > USA > www.fsf.org www.gnu.org > Skype: No way! That's nonfree (freedom-denying) software. > Use Ekiga or an ordinary phone call. > -- Yours sincerely, Vladimir Kazanov -- С уважением, Владимир Казанов ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: Tokenizing 2014-09-20 16:36 ` Tokenizing Vladimir Kazanov @ 2014-09-20 19:55 ` Stefan Monnier 2014-09-21 15:35 ` Tokenizing Stephen Leake 2014-09-21 13:35 ` Tokenizing Richard Stallman 2014-09-21 15:32 ` Tokenizing Stephen Leake 2 siblings, 1 reply; 72+ messages in thread From: Stefan Monnier @ 2014-09-20 19:55 UTC (permalink / raw) To: Vladimir Kazanov; +Cc: rms, emacs-devel > - it should be possible to completely replace the font-lock-mode with > this mode, given a concrete language tokenizer; The intention for font-lock highlighting is to show syntactic structure. In many cases, we end up limited to lexical highlighting, but that's a problem. Stefan ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: Tokenizing 2014-09-20 19:55 ` Tokenizing Stefan Monnier @ 2014-09-21 15:35 ` Stephen Leake 2014-09-21 16:43 ` Tokenizing Stefan Monnier 0 siblings, 1 reply; 72+ messages in thread From: Stephen Leake @ 2014-09-21 15:35 UTC (permalink / raw) To: emacs-devel Stefan Monnier <monnier@iro.umontreal.ca> writes: >> - it should be possible to completely replace the font-lock-mode with >> this mode, given a concrete language tokenizer; > > The intention for font-lock highlighting is to show > syntactic structure. In many cases, we end up limited to lexical > highlighting, but that's a problem. syntax-propertize provides hooks to let modes add to that; Ada mode uses it for cases where lexical highlighting is ambiguous. Which is a problem when parsing is too slow :(. -- -- Stephe ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: Tokenizing 2014-09-21 15:35 ` Tokenizing Stephen Leake @ 2014-09-21 16:43 ` Stefan Monnier 2014-09-22 14:05 ` Tokenizing Stephen Leake 0 siblings, 1 reply; 72+ messages in thread From: Stefan Monnier @ 2014-09-21 16:43 UTC (permalink / raw) To: Stephen Leake; +Cc: emacs-devel >>> - it should be possible to completely replace the font-lock-mode with >>> this mode, given a concrete language tokenizer; >> The intention for font-lock highlighting is to show >> syntactic structure. In many cases, we end up limited to lexical >> highlighting, but that's a problem. > syntax-propertize provides hooks to let modes add to that; Maybe you use it for that, but it's definitely not its intention: `syntax-propertize' is not meant for highlighting at all. Stefan ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: Tokenizing 2014-09-21 16:43 ` Tokenizing Stefan Monnier @ 2014-09-22 14:05 ` Stephen Leake 0 siblings, 0 replies; 72+ messages in thread From: Stephen Leake @ 2014-09-22 14:05 UTC (permalink / raw) To: emacs-devel Stefan Monnier <monnier@iro.umontreal.ca> writes: >>>> - it should be possible to completely replace the font-lock-mode with >>>> this mode, given a concrete language tokenizer; >>> The intention for font-lock highlighting is to show >>> syntactic structure. In many cases, we end up limited to lexical >>> highlighting, but that's a problem. >> syntax-propertize provides hooks to let modes add to that; > > Maybe you use it for that, but it's definitely not its intention: > `syntax-propertize' is not meant for highlighting at all. Sorry, I meant "font-lock-add-keywords"; you can add arbitrary functions that use the parser results to choose a face. The parser does rely on syntax-propertize. -- -- Stephe ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: Tokenizing 2014-09-20 16:36 ` Tokenizing Vladimir Kazanov 2014-09-20 19:55 ` Tokenizing Stefan Monnier @ 2014-09-21 13:35 ` Richard Stallman 2014-09-21 14:24 ` Tokenizing Vladimir Kazanov 2014-09-21 15:32 ` Tokenizing Stephen Leake 2 siblings, 1 reply; 72+ messages in thread From: Richard Stallman @ 2014-09-21 13:35 UTC (permalink / raw) To: Vladimir Kazanov; +Cc: emacs-devel [[[ To any NSA and FBI agents reading my email: please consider ]]] [[[ whether defending the US Constitution against all enemies, ]]] [[[ foreign or domestic, requires you to follow Snowden's example. ]]] I suggest the following plan: 1) try to implement the tokenizer using available text property mechanics; 2) see if there are slow-downs or problems, or space for improvements on the Emacs side. Yes. At various points you might have ideas for more convenient or faster predefined functions for accessing the text properties. It will be very nice to add them for Emacs and thus make this sort of tokenizing more convenient. -- Dr Richard Stallman President, Free Software Foundation 51 Franklin St Boston MA 02110 USA www.fsf.org www.gnu.org Skype: No way! That's nonfree (freedom-denying) software. Use Ekiga or an ordinary phone call. ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: Tokenizing 2014-09-21 13:35 ` Tokenizing Richard Stallman @ 2014-09-21 14:24 ` Vladimir Kazanov 0 siblings, 0 replies; 72+ messages in thread From: Vladimir Kazanov @ 2014-09-21 14:24 UTC (permalink / raw) To: rms; +Cc: emacs-devel Great, okay, I'll start working on it. This might take some time. On Sun, Sep 21, 2014 at 4:35 PM, Richard Stallman <rms@gnu.org> wrote: > [[[ To any NSA and FBI agents reading my email: please consider ]]] > [[[ whether defending the US Constitution against all enemies, ]]] > [[[ foreign or domestic, requires you to follow Snowden's example. ]]] > > I suggest the following plan: > > 1) try to implement the tokenizer using available text property mechanics; > > 2) see if there are slow-downs or problems, or space for improvements > on the Emacs side. > > Yes. > > At various points you might have ideas for more convenient or faster > predefined functions for accessing the text properties. It will be > very nice to add them for Emacs and thus make this sort of tokenizing > more convenient. > > -- > Dr Richard Stallman > President, Free Software Foundation > 51 Franklin St > Boston MA 02110 > USA > www.fsf.org www.gnu.org > Skype: No way! That's nonfree (freedom-denying) software. > Use Ekiga or an ordinary phone call. > -- Yours sincerely, Vladimir Kazanov -- С уважением, Владимир Казанов ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: Tokenizing 2014-09-20 16:36 ` Tokenizing Vladimir Kazanov 2014-09-20 19:55 ` Tokenizing Stefan Monnier 2014-09-21 13:35 ` Tokenizing Richard Stallman @ 2014-09-21 15:32 ` Stephen Leake 2014-09-21 16:42 ` Tokenizing Stefan Monnier 2014-09-21 18:55 ` Tokenizing Vladimir Kazanov 2 siblings, 2 replies; 72+ messages in thread From: Stephen Leake @ 2014-09-21 15:32 UTC (permalink / raw) To: emacs-devel Vladimir Kazanov <vekazanov@gmail.com> writes: > Okay, I'll give text properties a try. > > Right now my vision for this mode is the following: > > - avoid retokenizing undamaged buffer parts at all costs (as a main > feature meant for incremental parsing); You might look at what I did in Ada mode (current source in ELPA); see wisi.el wisi-before/after-change. > - collect damages and do reparsing only when user stops editing, > similar to the font-lock-mode (js2-mode, nxml-mode...); Ada mode only reparses when the user requests an action that requires a parse. How else do you tell when a user "stops editing"? font-lock runs after 'idle-time', which appears to be about 2 seconds (I could not figure out from the structure of 'timer-idle-list' what the actual idle time is). I guess that's the approximation of when the user stops editing. I don't normally edit 7000 line files, so the Ada mode parsing delay is not noticeable to me, so I prefer the current Ada mode approach of not using the idle timer to trigger a parse. But it could be a user option. > - the incremental logic should have two interfaces, the first one > meant for language-specific tokenizing code and a second one - for the > user code, be it code beautifiers or advanced incremental parsers; > > - it should be possible to completely replace the font-lock-mode with > this mode, given a concrete language tokenizer; > > You said two things basically: 1) I must use text properties, 2) it is > possible to improve text properties interfaces to help the tokenizer. > I suggest the following plan: > > 1) try to implement the tokenizer using available text property > mechanics; Ada mode uses text properties to store parse results; the tokenizer results are part of that, but are not stored separately. I don't see much point in separating the tokenizer from the parser; the tokenizer results are not useful by themselves (at least, not in Ada mode). > 2) see if there are slow-downs or problems, or space for improvements > on the Emacs side. I have not noticed any problems with the text properties interface; in particular, storing and retrieving text properties is fast compared to parsing. Ada mode stores about two parse result text properties per source line on average. -- -- Stephe ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: Tokenizing 2014-09-21 15:32 ` Tokenizing Stephen Leake @ 2014-09-21 16:42 ` Stefan Monnier 2014-09-21 18:55 ` Tokenizing Vladimir Kazanov 1 sibling, 0 replies; 72+ messages in thread From: Stefan Monnier @ 2014-09-21 16:42 UTC (permalink / raw) To: Stephen Leake; +Cc: emacs-devel > font-lock runs after 'idle-time', which appears to be about 2 seconds (I Not really: font-lock is run "on the next redisplay", i.e. "immediately", on the modified lines. There is also a delayed re-highlighting of the rest of buffer (the part after the modified text) controlled by jit-lock-context-time (defaults to 0.5s). Stefan ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: Tokenizing 2014-09-21 15:32 ` Tokenizing Stephen Leake 2014-09-21 16:42 ` Tokenizing Stefan Monnier @ 2014-09-21 18:55 ` Vladimir Kazanov 2014-09-21 22:01 ` Tokenizing Daniel Colascione 2014-09-22 13:15 ` Tokenizing Stephen Leake 1 sibling, 2 replies; 72+ messages in thread From: Vladimir Kazanov @ 2014-09-21 18:55 UTC (permalink / raw) To: Stephen Leake; +Cc: emacs-devel > I don't normally edit 7000 line files, so the Ada mode parsing delay is > not noticeable to me, so I prefer the current Ada mode approach of not > using the idle timer to trigger a parse. But it could be a user option. > I will look into that. Although the main idea is to *keep the token list consistent* most of the time. There will definitely be customization possibilities. > > Ada mode uses text properties to store parse results; the tokenizer > results are part of that, but are not stored separately. I don't see > much point in separating the tokenizer from the parser; the tokenizer > results are not useful by themselves (at least, not in Ada mode). > First, this not quite right. Tokenization results can be used, for example, for granular syntax highlighting. Font Lock basically just uses regexps to catch something that looks like comments/keywords/whatever. Tokenizer already *knows* for sure what it found. And you don't have to build a full parser to use the results. Second, it not a tokenizer I want to build, there is a misunderstanding of sorts. It is a helper mode (similar to Font Lock, in a way) for keeping token lists up to date all the time, easy and fast. User code - the tokenizer itself - will just have to provide an interface to the mode (be restartable and supply required restart information in resulting tokens). The mode will use the information to avoid extra tokenizing. > I have not noticed any problems with the text properties interface; in > particular, storing and retrieving text properties is fast compared to > parsing. Ada mode stores about two parse result text properties per > source line on average. I did not know about your mode - and parsers are sort of my hobby :-) I will definitely check it out, especially because it uses GLR(it really does?!), which can non-trivial to implement. -- Yours sincerely, Vladimir Kazanov -- С уважением, Владимир Казанов ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: Tokenizing 2014-09-21 18:55 ` Tokenizing Vladimir Kazanov @ 2014-09-21 22:01 ` Daniel Colascione 2014-09-22 10:21 ` Tokenizing Vladimir Kazanov 2014-09-22 14:02 ` Tokenizing Stephen Leake 2014-09-22 13:15 ` Tokenizing Stephen Leake 1 sibling, 2 replies; 72+ messages in thread From: Daniel Colascione @ 2014-09-21 22:01 UTC (permalink / raw) To: Vladimir Kazanov, Stephen Leake; +Cc: emacs-devel [-- Attachment #1: Type: text/plain, Size: 8905 bytes --] On 09/21/2014 11:55 AM, Vladimir Kazanov wrote: >> I don't normally edit 7000 line files, so the Ada mode parsing delay is >> not noticeable to me, so I prefer the current Ada mode approach of not >> using the idle timer to trigger a parse. But it could be a user option. >> > > I will look into that. Although the main idea is to *keep the token > list consistent* most of the time. There will definitely be > customization possibilities. I've been working (very, very, very slowly) on similar functionality. The basic idea is based on the incremental lexing algorithm that Tim A. Wagner sets out in chapter 5 of his thesis [1]. The key is dynamically tracking lookahead used while we generate each token. Wagner's algorithm allows us to incorporate arbitrary lookahead into the invalidation state, so supporting something like flex's unlimited trailing context is no problem. The nice thing about this algorithm is that like the parser, it's an online algorithm and arbitrarily restartable. We can't use the built-in Emacs regular expression engine for this facility because there's no way to tell regex.c to record how much lookahead it used, and even if there were, I'd rather not rely on its backtracking matcher. I've written a DFA-based regex matcher that should help in implementing this algorithm; of course, a DFA matcher has a lookahead of zero. Mine supports zero-width predicates though, so we can actually use achieve nonzero lookahead (and lookbehind!) if needed. Where my thing departs from flex is that I want to use a regular expression (in the rx sense) to describe the higher-level parsing automaton instead of making mode authors fiddle with start states. This way, it's easy to incorporate support for things like JavaScript's regular expression syntax, in which "/" can mean one of two tokens depending on the previous token. (Another way of dealing with lexical ambiguity is to let the lexer return an arbitrary number of tokens for a given position and let the GLR parser sort it out, but I'm not as happy with that solution.) >> Ada mode uses text properties to store parse results; the tokenizer >> results are part of that, but are not stored separately. I don't see >> much point in separating the tokenizer from the parser; the tokenizer >> results are not useful by themselves (at least, not in Ada mode). >> > > First, this not quite right. Tokenization results can be used, for > example, for granular syntax highlighting. Font Lock basically just > uses regexps to catch something that looks like > comments/keywords/whatever. Tokenizer already *knows* for sure what it > found. And you don't have to build a full parser to use the results. There are two stages here: you want in *some* cases for fontification to use the results of tokenization directly; in other cases, you want to apply fontification rules to the result of parsing that token stream. Splitting the fontification rules between terminals and non-terminals this way helps us maintain rudimentary fontification even for invalid buffer contents --- that is, if the user types gibberish in a C-mode buffer, we want constructs that look like keywords and strings in that gibberish stream to be highlighted. > Second, it not a tokenizer I want to build, there is a > misunderstanding of sorts. It is a helper mode (similar to Font Lock, > in a way) for keeping token lists up to date all the time, easy and > fast. User code - the tokenizer itself - will just have to provide an > interface to the mode (be restartable and supply required restart > information in resulting tokens). The mode will use the information to > avoid extra tokenizing. Indeed. I don't imagine replacing font-lock per se, but instead just implementing a font-lock matcher that *lazily* brings the token list up-to-date and pulls keywords out of it for fontification. It's important to keep font-lock around for backward compatibility, but there's no reason that font-lock's data source has to be a bunch of simple regex matches. cc-mode, for example, uses font-lock to apply highlights that come from a parser that looks oddly like a CYK parser. (We start parsing at all points in the buffer that *might* start a declaration; there's also some limited, ad-hoc backtracking.) >> I have not noticed any problems with the text properties interface; in >> particular, storing and retrieving text properties is fast compared to >> parsing. Ada mode stores about two parse result text properties per >> source line on average. > > I did not know about your mode - and parsers are sort of my hobby :-) Mine too. :-) > I will definitely check it out, especially because it uses GLR(it > really does?!), which can non-trivial to implement. Wagner's thesis contains a description of a few alternative incremental GLR algorithms that look very promising. I have a few extensions in mind too. It's important to be able to quickly fontify a particular region of the buffer --- e.g., while scrolling. If we've already built a parse tree and damage part of the buffer, we can repair the tree and re-fontify fairly quickly. But what if we haven't parsed the whole buffer yet? We can start parsing at some region we quickly determine is semantically equivalent to the start of the buffer using something like font-lock-beginning-of-syntax-function. But what about completing the parse? Well, it's possible to determine whether any particular buffer state is being parsed "dynamically" (the GSS is split) or "statically" (the GSS has one head). In practice, ambiguous regions are short, so we only have to worry about the latter case. Here, we can reliably partially parse the buffer by parsing forward (remember, GLR parsing is an online algorithm) until the parsing limit is after the visible portion of the buffer; we can then use a "shortest-to-completion" production (which we can pre-compute from the grammar) to insert virtual tokens that complete the language. If we were parsing C++, we'd be able to automatically insert virtual "}" tokens and complete the parse tree, giving us a partial parse tree we can then use for fontification. Of course, it's better to have a parse tree that covers the whole buffer, but we can compute that in the background. One really nice property of GLR parsing (and LR parsers generally) is that we can trivially pause and restart them, so we can parse in the background with very little impact on user activity, using `while-no-input' to back out to the event loop when we detect we have something more important to do. Error recovery is a challenge --- we want to do something useful on buffers that are temporarily invalided by user edits. One of the biggest challenges is dealing with mismatched brace constructs, since one missing "}" in C-mode can cause the entire rest of the buffer to no longer conform to the grammar. Bridge parsing is particularly interesting here: the basic idea is that we perform one pass over the buffer using a simplified grammar that looks only at nesting constructs, insert virtual braces to make everything balance, and then apply the usual parsing algorithm (with ad-hoc error recovery) to the result. I haven't tried it, but it apparently works pretty well. I like what SMIE does, but I don't think it's powerful enough to parse C++. GLR parsers give us all the power we need. GLR parsers also leave us with a parse forest instead of a parse tree, and we really need a parse *tree* if we're going to apply fontification rules specified in terms of AST nodes. We need to prune the parse forest down to a parse tree before using it for fontification. I actually have a pretty neat idea for doing that: we can attach assertions to parts of the parse forest and prune the tree by deleting alternatives that contradict what we know. cc-mode already does a very limited form of this analysis by recording types it sees during parsing. Consider the most vexing parse: type variable(function()) We don't know whether that's a function declaration or a variable declaration. In GLR-land, we generate a parse forest with branches for both options. If later in the buffer we see something that looks like this: variable->foo(); That we know that "variable" cannot be a function, so we can prune the part of the parse forest that assumes that "variable" is a function, leaving us with an unambiguous parse tree. This way, we should be able to come up with reasonable parses for C++ buffers without having an IDE-style global view of types declarations, instead assuming that programmers, most of the time, don't contradict themselves. [1] harmonia.cs.berkeley.edu/papers/twagner-thesis.pdf [2] fileadmin.cs.lth.se/cs/Personal/Emma_Soderberg/docs/SLE08pres.pdf [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 819 bytes --] ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: Tokenizing 2014-09-21 22:01 ` Tokenizing Daniel Colascione @ 2014-09-22 10:21 ` Vladimir Kazanov 2014-09-22 13:55 ` Tokenizing Daniel Colascione 2014-09-22 14:02 ` Tokenizing Stephen Leake 1 sibling, 1 reply; 72+ messages in thread From: Vladimir Kazanov @ 2014-09-22 10:21 UTC (permalink / raw) To: Daniel Colascione; +Cc: Stephen Leake, emacs-devel On Mon, Sep 22, 2014 at 1:01 AM, Daniel Colascione <dancol@dancol.org> wrote: > I've been working (very, very, very slowly) on similar functionality. > The basic idea is based on the incremental lexing algorithm that Tim A. > Wagner sets out in chapter 5 of his thesis [1]. The key is dynamically > tracking lookahead used while we generate each token. Wagner's > algorithm allows us to incorporate arbitrary lookahead into the > invalidation state, so supporting something like flex's unlimited > trailing context is no problem. > > The nice thing about this algorithm is that like the parser, it's an > online algorithm and arbitrarily restartable. I have already mentioned Wagner's paper in the previous letters. Actually, it is the main source of inspiration :-) But I think it is a bit over-complicated, and the only implementation I saw (Netbean's Lexer API) does not even try to implement it completely. Which is okay, academic papers tend to idealize things. > We can't use the built-in Emacs regular expression engine for this > facility because there's no way to tell regex.c to record how much > lookahead it used, and even if there were, I'd rather not rely on its > backtracking matcher. I've written a DFA-based regex matcher that should > help in implementing this algorithm; of course, a DFA matcher has a > lookahead of zero. > > Mine supports zero-width predicates though, so we can actually use > achieve nonzero lookahead (and lookbehind!) if needed. You do realize that this is a client's code problem? We can only recommend to use this or that regex engine, or even set the lookahead value for various token types by hand; and the latter case would probably work for most real-life cases. I am not even sure that it is possible to do it the Wagner's way (have a real next_char() function) in Emacs. I would check Lexer API solution as a starting point. > Where my thing departs from flex is that I want to use a regular > expression (in the rx sense) to describe the higher-level parsing > automaton instead of making mode authors fiddle with start states. This > way, it's easy to incorporate support for things like JavaScript's > regular expression syntax, in which "/" can mean one of two tokens > depending on the previous token. > > (Another way of dealing with lexical ambiguity is to let the lexer > return an arbitrary number of tokens for a given position and let the > GLR parser sort it out, but I'm not as happy with that solution.) I do not want to solve any concrete lexing problems. The whole point is about supplying a way to do it incrementally. I do not want to know anything about the code above or below , be it GLR/LR/flex/etc. > > There are two stages here: you want in *some* cases for fontification to > use the results of tokenization directly; in other cases, you want to > apply fontification rules to the result of parsing that token stream. > Splitting the fontification rules between terminals and non-terminals > this way helps us maintain rudimentary fontification even for invalid > buffer contents --- that is, if the user types gibberish in a C-mode > buffer, we want constructs that look like keywords and strings in that > gibberish stream to be highlighted. Yes, and this is a client's code that has to decide those things, be it using only the token list to do fontification or let a higher-level a parser do it. > >> I will definitely check it out, especially because it uses GLR(it >> really does?!), which can non-trivial to implement. > > Wagner's thesis contains a description of a few alternative incremental > GLR algorithms that look very promising. Yes, and a lot more :-) I want to concentrate on a smaller problem - don't feel like implementing the whole thesis right now. > I have a few extensions in mind too. It's important to be able to > quickly fontify a particular region of the buffer --- e.g., while scrolling. > > If we've already built a parse tree and damage part of the buffer, we > can repair the tree and re-fontify fairly quickly. But what if we > haven't parsed the whole buffer yet? > Nice. And I will definitely need to discuss all the optimization possibilities later. First, the core logic has to be implemented. Bottom line: I want to take this particular narrow problem, a few user code examples (for me it is a port of CPython's LL(1) parser) and see if I can solve in an optimal way. A working prototype will take some time, a month or more - I am not in a hurry. As much as I understand, you want to cooperate on it, right..? -- Yours sincerely, Vladimir Kazanov -- С уважением, Владимир Казанов ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: Tokenizing 2014-09-22 10:21 ` Tokenizing Vladimir Kazanov @ 2014-09-22 13:55 ` Daniel Colascione 0 siblings, 0 replies; 72+ messages in thread From: Daniel Colascione @ 2014-09-22 13:55 UTC (permalink / raw) To: Vladimir Kazanov; +Cc: Stephen Leake, emacs-devel [-- Attachment #1: Type: text/plain, Size: 5065 bytes --] On 09/22/2014 03:21 AM, Vladimir Kazanov wrote: > On Mon, Sep 22, 2014 at 1:01 AM, Daniel Colascione <dancol@dancol.org> wrote: > >> I've been working (very, very, very slowly) on similar functionality. >> The basic idea is based on the incremental lexing algorithm that Tim A. >> Wagner sets out in chapter 5 of his thesis [1]. The key is dynamically >> tracking lookahead used while we generate each token. Wagner's >> algorithm allows us to incorporate arbitrary lookahead into the >> invalidation state, so supporting something like flex's unlimited >> trailing context is no problem. >> >> The nice thing about this algorithm is that like the parser, it's an >> online algorithm and arbitrarily restartable. > > I have already mentioned Wagner's paper in the previous letters. > Actually, it is the main source of inspiration :-) But I think it is a > bit over-complicated, and the only implementation I saw (Netbean's > Lexer API) does not even try to implement it completely. Which is > okay, academic papers tend to idealize things. That Lexer is a dumb state matcher, last time I checked. So is Eclipse's. Neither is adequate, at least not if you want to support lexing *arbitrary* languages (e.g., Python and JavaScript) with guaranteed correctness in the face of arbitrary buffer modification. > You do realize that this is a client's code problem? We can only > recommend to use this or that regex engine, or even set the lookahead > value for various token types by hand; and the latter case would > probably work for most real-life cases. > > I am not even sure that it is possible to do it the Wagner's way (have > a real next_char() function) in Emacs. I would check Lexer API > solution as a starting point. Of course it's possible to implement in Emacs. Buffers are strictly more powerful than character streams. > >> Where my thing departs from flex is that I want to use a regular >> expression (in the rx sense) to describe the higher-level parsing >> automaton instead of making mode authors fiddle with start states. This >> way, it's easy to incorporate support for things like JavaScript's >> regular expression syntax, in which "/" can mean one of two tokens >> depending on the previous token. >> >> (Another way of dealing with lexical ambiguity is to let the lexer >> return an arbitrary number of tokens for a given position and let the >> GLR parser sort it out, but I'm not as happy with that solution.) > > I do not want to solve any concrete lexing problems. The whole point > is about supplying a way to do it incrementally. I do not want to know > anything about the code above or below , be it GLR/LR/flex/etc. > >> >> There are two stages here: you want in *some* cases for fontification to >> use the results of tokenization directly; in other cases, you want to >> apply fontification rules to the result of parsing that token stream. >> Splitting the fontification rules between terminals and non-terminals >> this way helps us maintain rudimentary fontification even for invalid >> buffer contents --- that is, if the user types gibberish in a C-mode >> buffer, we want constructs that look like keywords and strings in that >> gibberish stream to be highlighted. > > Yes, and this is a client's code that has to decide those things, be > it using only the token list to do fontification or let a higher-level > a parser do it. Unless the parser itself is incremental, you're going to have interactivity problems. >>> I will definitely check it out, especially because it uses GLR(it >>> really does?!), which can non-trivial to implement. >> >> Wagner's thesis contains a description of a few alternative incremental >> GLR algorithms that look very promising. > > Yes, and a lot more :-) I want to concentrate on a smaller problem - > don't feel like implementing the whole thesis right now. > >> I have a few extensions in mind too. It's important to be able to >> quickly fontify a particular region of the buffer --- e.g., while scrolling. >> >> If we've already built a parse tree and damage part of the buffer, we >> can repair the tree and re-fontify fairly quickly. But what if we >> haven't parsed the whole buffer yet? >> > > Nice. And I will definitely need to discuss all the optimization > possibilities later. First, the core logic has to be implemented. > > Bottom line: I want to take this particular narrow problem, a few user > code examples (for me it is a port of CPython's LL(1) parser) and see > if I can solve in an optimal way. A working prototype will take some > time, a month or more - I am not in a hurry. > > As much as I understand, you want to cooperate on it, right..? *sigh* It sounds like you want to create something simple. You'll run into the same problems I did, or you'll produce something less than fully general. I don't have enough time to work on something that isn't fully general. I'm sick of writing language-specific text parsing code. Have fun. [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 819 bytes --] ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: Tokenizing 2014-09-21 22:01 ` Tokenizing Daniel Colascione 2014-09-22 10:21 ` Tokenizing Vladimir Kazanov @ 2014-09-22 14:02 ` Stephen Leake 2014-09-22 14:14 ` Tokenizing Daniel Colascione 1 sibling, 1 reply; 72+ messages in thread From: Stephen Leake @ 2014-09-22 14:02 UTC (permalink / raw) To: emacs-devel Daniel Colascione <dancol@dancol.org> writes: > Wagner's thesis contains a description of a few alternative incremental > GLR algorithms that look very promising. Thanks for the pointer; I really need better error handling in the Ada mode parser. It currently defaults to a _really_ dumb indentation algorithm (same as previous line) when the parse fails. > We can start parsing at some region we quickly determine is semantically > equivalent to the start of the buffer using something like > font-lock-beginning-of-syntax-function. For Ada, that's only start of buffer; the standard convention is only one compilation-unit (the single grammar start symbol) per file. > Here, we can reliably partially parse the buffer by parsing > forward (remember, GLR parsing is an online algorithm) until the > parsing limit is after the visible portion of the buffer; we can then > use a "shortest-to-completion" production (which we can pre-compute > from the grammar) to insert virtual tokens that complete the language. I had the same idea; I was hoping it was new and I could make money :). Is that in Wagner's thesis (I didn't see it on a quick scan), or written up somewhere else? > One really nice > property of GLR parsing (and LR parsers generally) is that we can > trivially pause and restart them, so we can parse in the background with > very little impact on user activity, using `while-no-input' to back out > to the event loop when we detect we have something more important to > do. That's what I need! Although a really fast GLR parser implemented in compiled Ada that accesses the buffer via an Emacs FFI would be better. > Error recovery is a challenge --- we want to do something useful on > buffers that are temporarily invalided by user edits. One of the > biggest challenges is dealing with mismatched brace constructs, since > one missing "}" in C-mode can cause the entire rest of the buffer to no > longer conform to the grammar. I'm hoping to use the same "shortest-to-completion" algorithm for this. It may distort the rest of the buffer, but it will be better than what it does now. > Bridge parsing is particularly interesting here: the basic idea is that > we perform one pass over the buffer using a simplified grammar that > looks only at nesting constructs, insert virtual braces to make > everything balance, and then apply the usual parsing algorithm (with > ad-hoc error recovery) to the result. I haven't tried it, but it > apparently works pretty well. That could work for Ada as well. > I like what SMIE does, but I don't think it's powerful enough to parse > C++. I wrote a SMIE parser for Ada, but it ended up always parsing the whole buffer. So I switched to LALR. Then to generalized so I would not have to mangle the grammar in the Ada standard as much. > GLR parsers also leave us with a parse forest instead of a parse tree, > and we really need a parse *tree* if we're going to apply fontification > rules specified in terms of AST nodes. We need to prune the parse > forest down to a parse tree before using it for fontification. In Ada mode, I just assume the results of parallel parses that end in the same state are equivalent for the purposes of Ada mode (fontification and navigation). So I just throw away all but one result. I haven't really investigated whether this is true. > Consider the most vexing parse: > > type variable(function()) One reason I like Ada; the equivalent declarations are not ambiguous. -- -- Stephe ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: Tokenizing 2014-09-22 14:02 ` Tokenizing Stephen Leake @ 2014-09-22 14:14 ` Daniel Colascione 0 siblings, 0 replies; 72+ messages in thread From: Daniel Colascione @ 2014-09-22 14:14 UTC (permalink / raw) To: Stephen Leake, emacs-devel [-- Attachment #1: Type: text/plain, Size: 4467 bytes --] On 09/22/2014 07:02 AM, Stephen Leake wrote: > Daniel Colascione <dancol@dancol.org> writes: > >> Wagner's thesis contains a description of a few alternative incremental >> GLR algorithms that look very promising. > > Thanks for the pointer; I really need better error handling in the Ada > mode parser. It currently defaults to a _really_ dumb indentation > algorithm (same as previous line) when the parse fails. Yep. falling back to counting parens seems reasonable enough. >> We can start parsing at some region we quickly determine is semantically >> equivalent to the start of the buffer using something like >> font-lock-beginning-of-syntax-function. > > For Ada, that's only start of buffer; the standard convention is only > one compilation-unit (the single grammar start symbol) per file. > >> Here, we can reliably partially parse the buffer by parsing >> forward (remember, GLR parsing is an online algorithm) until the >> parsing limit is after the visible portion of the buffer; we can then >> use a "shortest-to-completion" production (which we can pre-compute >> from the grammar) to insert virtual tokens that complete the language. > > I had the same idea; I was hoping it was new and I could make money :). > Is that in Wagner's thesis (I didn't see it on a quick scan), or written > up somewhere else? Not AFAIK. >> One really nice >> property of GLR parsing (and LR parsers generally) is that we can >> trivially pause and restart them, so we can parse in the background with >> very little impact on user activity, using `while-no-input' to back out >> to the event loop when we detect we have something more important to >> do. > > That's what I need! Although a really fast GLR parser implemented in > compiled Ada that accesses the buffer via an Emacs FFI would be better. Another nice thing about lexing with an automaton and parsing with a generic algorithm is that the parsing machine can be pushed (if needed) to C core with no loss of customizability or generality. I'd prefer that individual modes not rely on the FFI for basic fontification and indentation services. >> Error recovery is a challenge --- we want to do something useful on >> buffers that are temporarily invalided by user edits. One of the >> biggest challenges is dealing with mismatched brace constructs, since >> one missing "}" in C-mode can cause the entire rest of the buffer to no >> longer conform to the grammar. > > I'm hoping to use the same "shortest-to-completion" algorithm for this. > It may distort the rest of the buffer, but it will be better than what > it does now. Maybe it's better than nothing, but bridge parsing is a bit more promising, I think. :-) >> Bridge parsing is particularly interesting here: the basic idea is that >> we perform one pass over the buffer using a simplified grammar that >> looks only at nesting constructs, insert virtual braces to make >> everything balance, and then apply the usual parsing algorithm (with >> ad-hoc error recovery) to the result. I haven't tried it, but it >> apparently works pretty well. > > That could work for Ada as well. > >> I like what SMIE does, but I don't think it's powerful enough to parse >> C++. > > I wrote a SMIE parser for Ada, but it ended up always parsing the whole > buffer. So I switched to LALR. Then to generalized so I would not have > to mangle the grammar in the Ada standard as much. > >> GLR parsers also leave us with a parse forest instead of a parse tree, >> and we really need a parse *tree* if we're going to apply fontification >> rules specified in terms of AST nodes. We need to prune the parse >> forest down to a parse tree before using it for fontification. > > In Ada mode, I just assume the results of parallel parses that end in > the same state are equivalent for the purposes of Ada mode > (fontification and navigation). So I just throw away all but one result. > I haven't really investigated whether this is true. You can still end up in the same state having shifted different subtrees, so if you do it that way, you can flop between different parsing alternatives as you edit the buffer for no reason at all. >> Consider the most vexing parse: >> >> type variable(function()) > > One reason I like Ada; the equivalent declarations are not ambiguous. Writing unambiguous languages seems to have fallen out of fashion. [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 819 bytes --] ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: Tokenizing 2014-09-21 18:55 ` Tokenizing Vladimir Kazanov 2014-09-21 22:01 ` Tokenizing Daniel Colascione @ 2014-09-22 13:15 ` Stephen Leake 1 sibling, 0 replies; 72+ messages in thread From: Stephen Leake @ 2014-09-22 13:15 UTC (permalink / raw) To: emacs-devel Vladimir Kazanov <vekazanov@gmail.com> writes: >> Ada mode uses text properties to store parse results; the tokenizer >> results are part of that, but are not stored separately. I don't see >> much point in separating the tokenizer from the parser; the tokenizer >> results are not useful by themselves (at least, not in Ada mode). >> > > First, this not quite right. Tokenization results can be used, for > example, for granular syntax highlighting. Ada requires semantic parsing, not just tokenizing, for syntax highlighting (it's a complex language :). Hmm, I'm not sure what you mean by "granualar" here; if you mean "less than totally accurate", then you are right, we don't need a parser for that. On the other hand, we don't need a tokenizer, either :). In Ada, a parser is required to distinguish between these two instances of 'return': function Foo (...) return Bar is begin Baz := ...; return Baz; end Foo; In the first instance, "Bar" is a type; in the second, "Baz" is a variable. They should have different faces. Finding the 'function' keyword from just token information is non-trivial. The parser tags Bar as a type. > Font Lock basically just > uses regexps to catch something that looks like > comments/keywords/whatever. But that can be extended by arbitrary functions in font-lock-add-keywords; Ada mode does that to use the parser information (when available; it doesn't force a parse just for font-lock). > Second, it not a tokenizer I want to build, there is a > misunderstanding of sorts. It is a helper mode (similar to Font Lock, > in a way) for keeping token lists up to date all the time, easy and > fast. User code - the tokenizer itself - will just have to provide an > interface to the mode (be restartable and supply required restart > information in resulting tokens). The mode will use the information to > avoid extra tokenizing. Ok. Maybe I can use that, and have it run the parser whenever needed. Just replace "token lists" with "some text properties" in the above; the helper mode should not care if they are "tokenizer results" or "parser results". >> I have not noticed any problems with the text properties interface; in >> particular, storing and retrieving text properties is fast compared to >> parsing. Ada mode stores about two parse result text properties per >> source line on average. > > I did not know about your mode - and parsers are sort of my hobby :-) > I will definitely check it out, especially because it uses GLR(it > really does?!), which can non-trivial to implement. Yes, implementing GLR was complicated, and therefore fun :). -- -- Stephe ^ permalink raw reply [flat|nested] 72+ messages in thread
end of thread, other threads:[~2014-10-06 21:02 UTC | newest] Thread overview: 72+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-09-19 14:59 Overlay mechanic improvements Vladimir Kazanov 2014-09-19 17:22 ` Stefan Monnier 2014-09-20 13:19 ` Richard Stallman 2014-09-20 13:37 ` David Kastrup 2014-09-21 13:35 ` Richard Stallman 2014-09-21 13:52 ` David Kastrup 2014-09-21 21:48 ` Richard Stallman 2014-09-21 22:06 ` David Kastrup 2014-09-22 23:11 ` Richard Stallman 2014-09-22 23:50 ` David Kastrup 2014-09-23 19:15 ` Richard Stallman 2014-09-21 16:07 ` Stefan Monnier 2014-09-21 16:14 ` David Kastrup 2014-09-21 21:48 ` Richard Stallman 2014-09-21 22:19 ` David Kastrup 2014-09-23 19:16 ` Richard Stallman 2014-09-23 19:27 ` David Kastrup 2014-09-28 23:24 ` Richard Stallman 2014-09-29 5:45 ` David Kastrup 2014-09-29 20:48 ` Richard Stallman 2014-09-30 1:21 ` Stephen J. Turnbull 2014-09-30 8:43 ` David Kastrup 2014-09-30 10:35 ` Rasmus 2014-09-30 14:22 ` Eli Zaretskii 2014-09-30 16:20 ` David Kastrup 2014-09-30 16:35 ` Eli Zaretskii 2014-09-30 14:32 ` Stefan Monnier 2014-10-02 16:12 ` Uwe Brauer 2014-09-30 19:23 ` Richard Stallman 2014-10-01 3:38 ` Stephen J. Turnbull 2014-10-01 12:53 ` Richard Stallman 2014-10-01 13:11 ` David Kastrup 2014-10-02 1:26 ` Stephen J. Turnbull 2014-09-30 5:52 ` David Kastrup 2014-10-06 19:14 ` Richard Stallman 2014-10-06 21:02 ` David Kastrup 2014-09-21 16:56 ` Eli Zaretskii 2014-09-21 18:42 ` Stefan Monnier 2014-09-21 18:58 ` Eli Zaretskii 2014-09-21 20:12 ` Stefan Monnier 2014-09-21 21:48 ` Richard Stallman 2014-09-22 0:31 ` Stefan Monnier 2014-09-22 23:11 ` Richard Stallman 2014-09-20 15:56 ` Eli Zaretskii 2014-09-20 19:49 ` Stefan Monnier 2014-09-21 13:36 ` Richard Stallman 2014-09-19 18:03 ` Richard Stallman 2014-09-20 8:08 ` Vladimir Kazanov 2014-09-20 13:21 ` Richard Stallman 2014-09-20 16:28 ` Stephen Leake 2014-09-20 13:21 ` Tokenizing Richard Stallman 2014-09-20 16:24 ` Tokenizing Stephen Leake 2014-09-20 16:40 ` Tokenizing Vladimir Kazanov 2014-09-20 20:16 ` Tokenizing Eric Ludlam 2014-09-20 20:35 ` Tokenizing Vladimir Kazanov 2014-09-21 15:13 ` parsing (was tokenizing) Stephen Leake 2014-09-20 16:36 ` Tokenizing Vladimir Kazanov 2014-09-20 19:55 ` Tokenizing Stefan Monnier 2014-09-21 15:35 ` Tokenizing Stephen Leake 2014-09-21 16:43 ` Tokenizing Stefan Monnier 2014-09-22 14:05 ` Tokenizing Stephen Leake 2014-09-21 13:35 ` Tokenizing Richard Stallman 2014-09-21 14:24 ` Tokenizing Vladimir Kazanov 2014-09-21 15:32 ` Tokenizing Stephen Leake 2014-09-21 16:42 ` Tokenizing Stefan Monnier 2014-09-21 18:55 ` Tokenizing Vladimir Kazanov 2014-09-21 22:01 ` Tokenizing Daniel Colascione 2014-09-22 10:21 ` Tokenizing Vladimir Kazanov 2014-09-22 13:55 ` Tokenizing Daniel Colascione 2014-09-22 14:02 ` Tokenizing Stephen Leake 2014-09-22 14:14 ` Tokenizing Daniel Colascione 2014-09-22 13:15 ` Tokenizing Stephen Leake
Code repositories for project(s) associated with this public inbox https://git.savannah.gnu.org/cgit/emacs.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).