emacs-orgmode@gnu.org archives
 help / color / mirror / code / Atom feed
From: Ihor Radchenko <yantar92@posteo.net>
To: Ihor Radchenko <yantar92@gmail.com>
Cc: emacs-orgmode@gnu.org
Subject: [RFC] Refactoring org-element API (was: [DISCUSSION] Refactoring fontification system)
Date: Tue, 30 May 2023 11:25:46 +0000	[thread overview]
Message-ID: <874jnudps5.fsf@localhost> (raw)
In-Reply-To: <87k09ycc7n.fsf@localhost>

[-- Attachment #1: Type: text/plain, Size: 3156 bytes --]

Ihor Radchenko <yantar92@gmail.com> writes:

> Ihor Radchenko <yantar92@gmail.com> writes:
>
>> Instead of fontifying elements individually via regexps, we can leverage
>> org-element-map, org-element-parse-buffer, org-element-cache, and
>> jit-lock-mode. Each type of Org element/object can be assigned with a
>> fontification function accepting a single argument - the element datum.
>
> I have been working on the new fontification library for the last
> several months and I have noticed several confusing things in the
> current fontification settings.

It has been a while since the last update on the fontification... for a
reason.

As the fontification should be very tightly optimized to avoid Emacs
freezes, I stumbled upon a need to improve the performance of Org parser
further than we already have. This triggered a number of significant
changes in listp/org-element.el. So many that the current logic of the
library becomes a mess.

In order not to turn org-element into another org-agenda, I am proposing
to factor out Org syntax tree API into a separate file
org-element-ast.el and add a number of breaking changes how the syntax
tree is structured.

The most important changes are the following:

1. Frequently used element properties will no longer be stored directly
   in the property list. Instead, they will be placed inside a special
   vector that is much faster to access. `org-element-property' and
   other accessor functions are changed accordingly to inline the
   property queries into `aref' calls, when possible.

   This is a major breaking change.

2. Org parser will now be able to parse elements partially, with some
   parts of the parser executed later, only when necessary.
   The downside is relying upon the original buffer to be live even
   after parsing.

   This is also a major breaking change.

3. The order of multiple affiliated keywords in Org parse tree will be
   reversed. The main purpose here is to avoid special cases when fine
   details of Org syntax had to be accounted for when traversing Org
   parse trees.

   This is a breaking change.

4. Org cache will be available even when `org-element-use-cache' is
   non-nil. This is not a breaking change and simplifies Org code
   greatly without compromising performance.

5. The usage of regular expressions is by Org parser is now tightly
   optimized. See the discussion in https://debbugs.gnu.org/cgi/bugreport.cgi?bug=63225
   This is also not a breaking change.

6. Memory footprint is reduced for Org AST. This is achieved using
   shared string objects in Org parser. With low probability, this might
   cause problems if third-party code modifies these string objects by
   side effect.

This refactoring is large, spanning >80 commits.
The full commit log can be found at
https://git.sr.ht/~yantar92/org-mode/log/feature/org-element-ast-tidy

Here, I am attaching ORG-NEWS diff, summarizing important changes.
I am also adding the commentary section of org-element-ast.el with more
details about the new syntax tree structure and about the new concept of
deferred parsed values.

I am sure that I missed things, so comments are welcome.


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: ORG-NEWS.diff --]
[-- Type: text/x-patch, Size: 9763 bytes --]

diff --git a/etc/ORG-NEWS b/etc/ORG-NEWS
index ddf1e9110..396935283 100644
--- a/etc/ORG-NEWS
+++ b/etc/ORG-NEWS
@@ -13,6 +13,187 @@ Please send Org bug reports to mailto:emacs-orgmode@gnu.org.
 
 * Version 9.7 (not released yet)
 ** Important announcements and breaking changes
+*** Major changes and additions to Org API
+**** New term: "syntax node"
+
+To reduce confusion with "element" referring to both "syntax element"
+and "element/object" class, we now prefer using "syntax node" when
+referring to generic Org syntax elements.  "Elements" and "objects"
+now refer to different syntax node classes of paragraph-like nodes and
+markup-like nodes.
+
+**** New element type ~anonymous~
+
+Secondary strings can now be recognized as ~anonymous~ type to
+distinguish from non-elements.  With a new optional argument for
+~org-element-type~ will return ~anonymous~ for secondary strings
+instead of nil.
+
+The new element type can be used in ~org-element-lineage~,
+~org-element-map~, and other functions that filter by element type.
+**** Internal structure of Org parse tree has been changed
+
+The code relying upon the previously used =(TYPE PROPERTIES-PLIST CONTENTS-LIST)=
+structure may no longer work.  Please use ~org-element-create~,
+~org-element-property~, and other Org element API functions to work
+with Org syntax trees.
+
+Some syntax node properties are no longer stored as property list elements.
+Instead, they are kept in a special vector value of a new
+=:standard-properties= property.  This is done to improve performance.
+
+Properties and their values can now be deferred to avoid overheads when
+parsing.  They are calculated lazily, when the value/property is
+requested by getters.
+
+New special property =:secondary= is used internally to record which
+properties store secondary objects.
+
+New special property =:deferred= is used to keep information how to
+calculate property names lazily.
+
+See the commentary in =lisp/org-element-ast.el= for more details.
+
+**** Multiple affiliated keyword values are now stored in the order they appear in buffer
+
+Previously,
+
+: #+caption: foo
+: #+caption: bar
+: Paragraph
+
+would have its =:caption= property set to ~(("bar") ("foo"))~ in reverse order.
+
+Now, the order is not reversed: ~(("foo") ("bar"))~.
+
+**** Some property values may now be calculated lazily and require original Org buffer to be live
+
+~org-element-at-point~, ~org-element-context~, and
+~org-element-at-point-no-context~ may now not calculate all the
+property values at the call time.  Instead, the calculation will be
+deferred until ~org-element-property~ or the equivalent getter
+function is called.  The property names may not all be calculated as
+well.
+
+It may often be necessary to have the original Org buffer open when
+resolving the deferred values.
+
+One can ensure that all the deferred values are resolved using new
+function ~org-element-resolve-deferred~ and new optional argument for
+~org-element-property~.
+
+~org-element-parse-buffer~ and ~org-element-parse-secondary-string~
+will resolve all the deferred values by default.  No adjustment is
+needed for their users.
+
+**** New API functions and macros
+***** New property accessors and setters
+
+New functions to retrieve and set (via ~setf~) commonly used element properties:
+- =:begin= :: ~org-element-begin~
+- =:end= :: ~org-element-end~
+- =:contents-begin= :: ~org-element-contents-begin~
+- =:contents-end= :: ~org-element-contents-end~
+- =:contents-post-affiliated= :: ~org-element-post-affiliated~
+- =:contents-post-blank= :: ~org-element-post-blank~
+- =:parent= :: ~org-element-parent~
+ 
+***** New macro ~org-element-with-enabled-cache~
+
+The macro arranges the element cache to be active during =BODY= execution.
+When cache is enabled, the macro is identical to ~progn~.  When cache
+is disabled, the macro arranges a new fresh cache that is discarded
+upon completion of =BODY=.
+
+***** New function ~org-element-property-1~
+
+This function is like ~org-element-property~ but does not try to
+resolve deferred properties.
+
+~org-element-property-1~ can be used with ~setf~.
+
+***** New function ~org-element-put-property-2~
+
+Like ~org-element-put-property~, but the argument list is changed to have
+=NODE= as the last argument.  Useful with threading macros like
+~thread-last~.
+
+***** New function ~org-element-properties-resolve~
+
+This function resolves all the deferred values in a =NODE=, modifying
+the =NODE= for side effect.
+
+***** New functions ~org-element-properties-map~ and ~org-element-properties-mapc~
+
+New functions to map over =NODE= properties.
+
+***** New function ~org-element-ast-map~
+
+This is a more general equivalent of ~org-element-map~.  It allows
+more precise control over recursion into secondary strings.
+
+***** New function ~org-element-lineage-map~
+
+Traverse syntax tree ancestor list, applying arbitrary function to
+each ancestor.
+
+***** New function ~org-element-property-inherited~
+
+Like ~org-element-property~, but can be used to retrieve and combine
+multiple different properties for a given =NODE= and its parents.
+
+**** ~org-element-cache-map~ can now be used even when element cache is disabled
+**** =org-element= API functions and macros can now accept syntax elements as =POM= argument
+
+The following functions are updated:
+- ~org-agenda-entry-get-agenda-timestamp~
+- ~org-element-at-point~
+- ~org-is-habit-p~
+- ~org-id-get~
+- ~org-with-point-at~
+- ~org-entry-properties~
+- ~org-entry-get~
+- ~org-entry-delete~
+- ~org-entry-add-to-multivalued-property~
+- ~org-entry-remove-from-multivalued-property~
+- ~org-entry-member-in-multivalued-property~
+- ~org-entry-put-multivalued-property~
+- ~org-entry-get-with-inheritance~
+- ~org-entry-put~
+- ~org-read-property-value~
+- ~org-property-get-allowed-values~
+
+**** ~org-element-map~ now traverses main value in dual keywords before the secondary value
+
+The traverse order for dual keywords is reversed.  The main value is
+now traversed first, followed by the secondary value.
+
+**** Org parse tree is now non-printable
+
+Org parser now assigns a new property =:buffer= that holds
+non-printable buffer object.  This makes syntax tree non-printable.
+Using ~print~/~read~ is no longer safe.
+
+**** Some Org API functions no longer preserve match data
+
+~org-element-at-point~, ~org-element-context~, ~org-get-category~, ~org-get-tags~
+
+The relevant function docstrings now explicitly mention that match
+data may be modified.
+**** ~org-element-create~ now treats a single ~anonymous~ =CHILDREN= argument as a list of child nodes
+
+When =CHILDREN= is a single anonymous node, use its contents as children
+nodes.  This way,
+
+: (org-element-create 'section nil (org-element-contents node))
+
+will yield expected results with contents of another node adopted into
+a newly created one.
+
+Previously, one had to use
+
+: (apply #'org-element-create 'section nil (org-element-contents node))
+
 *** "Priority" used to sort items in agenda is renamed to "urgency"
 
 Previously, ~priority-up~ and ~priority-down~ in
@@ -225,7 +406,64 @@ editing with Emacs while a ~:session~ block executes.
 
 When ~org-return-follows-link~ is non-nil and cursor is over an
 org-cite citation, ~org-return~ will call ~org-open-at-point~.
+** New functions and changes in function arguments
+*** =TYPES= argument in ~org-element-lineage~ can now be a symbol
+
+When =TYPES= is symbol, only check syntax nodes of that type.
+
+*** New optional argument =KEEP-CONTENTS= for ~org-element-copy~
+
+With the new argument, the contents is copied recursively.
+
+*** ~org-element-property~ can now be used with ~setf~
+*** New optional arguments for ~org-element-property~
+
+The value of the new optional argument =DFLT= is returned if the
+property with given name is not present.  Same as =DEFAULT= argument
+for ~alist-get~.
+
+New optional argument =FORCE-UNDEFER= modifies the =NODE=, storing the
+resolved deferred values.
+
+*** New optional argument =NO-UNDEFER= in ~org-element-map~ and changed argument conventions
+
+New optional argument =NO-UNDEFER=, when non-nil, will make
+~org-element-map~ keep deferred secondary string values in their raw form.
+
+=TYPES= argument can now be set to ~t~.  This will match all the
+syntax nodes when traversing the tree.
+
+~FUN~ can now be a lisp form that will be evaluated with symbol ~node~
+assigned to the current syntax node.
+
+~FUN~ can now throw ~:org-element-skip~ signal to skip recursing into
+current element children and secondary strings.
+
+*** New optional argument =KEEP-DEFERRED= in ~org-element-parse-buffer~
+
+When non-nil, the deferred values and properties will not be resolved.
+
+*** New optional argument =ANONYMOUS= for ~org-element-type~
+
+When the new argument is non-nil, return symbol ~anonymous~ for anonymous elements.
+
+*** ~org-element-adopt-elements~ is renamed to ~org-element-adopt~
+
+The old name is kept as an alias.  The new name creates less confusion
+as the function can also act on objects.
+
+*** ~org-element-extract-element~ is renamed to ~org-element-extract~
+
+The old name is kept as an alias.  The new name creates less confusion
+as the function can also act on objects.
+
+*** ~org-element-set-element~ is renamed to ~org-element-set~
+
+The old name is kept as an alias.  The new name creates less confusion
+as the function can also act on objects.
 
+*** ~org-export-get-parent~ is renamed to ~org-element-parent~ and moved to =lisp/org-element.el=
+*** ~org-export-get-parent-element~ is renamed to ~org-element-parent-element~ and moved to =lisp/org-element.el=
 ** Miscellaneous
 *** =org-crypt.el= now applies initial visibility settings to decrypted entries
 

[-- Attachment #3: Type: text/plain, Size: 7620 bytes --]


;;; org-element-ast.el --- Abstract syntax tree for Org  -*- lexical-binding: t; -*-

;; ...
;; This file implements Org abstract syntax tree (AST) data structure.
;;
;; Only the most generic aspect of the syntax tree are considered
;; below.  The fine details of Org syntax are implemented elsewhere.
;;
;; Org AST is composed of nested syntax nodes.
;; Within actual Org syntax, the nodes can be either headings,
;; elements, or objects.  However, historically, we often call syntax
;; nodes simply "elements", unless the context requires clarification
;; about the node type.  In particular, many functions below will have
;; naming pattern `org-element-X', implying `org-element-node-X' --
;; they will apply to all the node types, not just to elements.
;;
;; 1. Syntax nodes
;; ------------------
;; Each Org syntax node can be represented as a string or list.
;;
;; The main node representation follows the pattern
;; (TYPE PROPERTIES CONTENTS), where
;;   TYPE is a symbol describing the node type.
;;   PROPERTIES is the property list attached to it.
;;   CONTENTS is a list of child syntax nodes contained within the
;;            current node, when applicable.
;;
;;; For example, "*bold text*  " node can be represented as
;;
;;    (bold (:begin 1 :end 14 :post-blank 2 ...) "bold text")
;;
;; TYPE can be any symbol, including symbol not explicitly defined by
;; Org syntax.  If TYPE is not a part of the syntax, the syntax
;; node is called "pseudo element/object", but otherwise considered a
;; valid part of Org syntax tree.  Search "Pseudo objects and
;; elements" in lisp/ox-latex.el for an example of using pseudo
;; elements.
;;
;; PROPERTIES is a property list (:property1 value1 :property2 value2 ...)
;; holding properties and value.
;;
;; `:standard-properties', `:parent', `:deferred', and `:secondary'
;; properties are treated specially in the code below.
;;
;; `:standard-properties' holds an array with
;; `org-element--standard-properties' values, in the same order.  The
;; values in the array have priority over the same properties
;; specified in the property list.  You should not rely on the value
;; of `org-element--standard-propreties' in the code.
;; `:standard-properties' may or may not be actually present in
;; PROPERTIES.  It is mostly used to speed up property access in
;; performance-critical code, as most of the code requesting property
;; values by constant name is inlined.
;;
;; The previous example can also be presented in more compact form as:
;;
;;    (bold (:standard-properties [1 10 ... 2 ...]) "bold text")
;;
;; Using an array allows faster access to frequently used properties.
;;
;; `:parent' holds the containing node, for a child node within the
;; AST.  It may or may not be present in PROPERTIES.
;;
;; `:secondary' holds a list of properties that may contain extra AST
;; nodes, in addition to the node contents.
;;
;; `deferred' property describes how to update not-yet-calculated
;; properties on request.
;;
;;
;; Syntax node can also be represented by a string.  Strings always
;; represent syntax node of `plain-text' type with contents being nil
;; and properties represented as string properties at position 0.
;; `:standard-properties' are not considered for `plain-text' nodes as
;; `plain-text' nodes tend to hold much fewer properties.
;;
;; In the above example, `plain-text' node "bold text" is more
;; accurately represented as
;;
;;    #("bold text" 0 9 (:parent (bold ...)))
;;
;; with :parent property value pointing back to the containing `bold'
;; node.
;;
;; `anonymous' syntax node is represented as a list with `car'
;; containing another syntax node.  Such node has nil type, does not
;; have properties, and its contents is a list of the contained syntax
;; node.  `:parent' property of the contained nodes point back to the
;; list itself, except when `anonymous' node holds secondary value
;; (see below), in which case the `:parent' property is set to be the
;; containing node in the AST.
;;
;; Any node representation other then described above is not
;; considered as Org syntax node.
;;
;; 2. Deferred values
;; ------------------
;; Sometimes, it is computationally expensive or even not possible to
;; calculate property values when creating an AST node.  The value
;; calculation can be deferred to the time the value is requested.
;;
;; Property values and contained nodes may have a special value of
;; `org-element-deferred' type.  Such values are computed dynamically.
;; Either every time the property value is requested or just the first
;; time.  In the latter case, the `org-element-deferred' property
;; value is auto-replaced with the dynamically computed result.
;;
;; Sometimes, even property names (not just property values) cannot, or
;; should not be computed in advance.  If a special property
;; `:deferred' has the value of `org-element-deferred-type', it is
;; first resolved for side effects of setting the missing properties.
;; The resolved value is re-assigned to the `:deferred' property.
;;
;; Note that `org-element-copy' unconditionally resolves deferred
;; properties.  This is useful to generate pure (in functional sense)
;; AST.
;;
;; The properties listed in `org-element--standard-properties', except
;; `:deferred' and `:parent' are never considered to have deferred value.
;; This constraint makes org-element API significantly faster.
;;
;; 3. Org document representation
;; ------------------------------
;; Document AST is represented by nested Org syntax nodes.
;;
;; Each node in the AST can hold the contained node in its CONTENTS or
;; as values of properties.
;;
;; For example, (bold (...) "bold text") `bold' node contains
;; `plain-text' node in CONTENTS.
;;
;; The containing node is called "parent node".
;;
;; The contained nodes held inside CONTENTS are called "child nodes".
;; They must have their `:parent' property set to the containing
;; parent node.
;;
;; The contained nodes can also be held as property values.  Such
;; nodes are called "secondary nodes".  Only certain properties
;; can contribute to AST - the property names listed as the value of
;; special property `:secondary'
;;
;; For example,
;;
;;   (headline ((:secondary (:title)
;;               :title (#("text" 0 4 (:parent (headline ...)))))))
;;
;; is a parent headline node containing "text" secondary string node
;; inside `:title' property.  Note that `:title' is listed in
;; `:secondary' value.
;;
;; The following example illustrates an example AST for Org document:
;;
;; ---- Org document --------
;; * Heading with *bold* text
;; Paragraph.
;; ---- end -----------------
;;
;; (org-data (...) ; `org-data' node.
;;   (headline
;;     (
;;      ;; `:secondary' property lists property names that contain other
;;      ;; syntax tree nodes.
;;
;;      :secondary (:title)
;;
;;      ;; `:title' property is set to anonymous node containing:
;;      ;; `plain-text', `bold', `plain-text'.
;;
;;      :title ("Heading with " (bold (:post-blank 1 ...) "bold") "text"))
;;
;;      ;; `headline' contents
;;     (section (...)
;;       (paragraph
;;         ;; `:parent' property set to the containing section.
;;         (:parent (section ...))
;;         ;; paragraph contents is a `plain-text' node.
;;         "Paragraph1."))))
;;
;; Try calling M-: (org-element-parse-buffer) on the above example Org
;; document to explore a more complete version of Org AST.


-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>

  parent reply	other threads:[~2023-05-30 11:22 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-11-19 14:12 [DISCUSSION] Refactoring fontification system Ihor Radchenko
2021-11-19 14:18 ` Bruce D'Arcus
2021-11-19 16:09 ` Tim Cross
2021-11-24 22:03 ` Nicolas Goaziou
2022-06-03  9:45 ` Ihor Radchenko
2022-06-03 20:37   ` Ted Reed via General discussions about Org-mode.
2022-06-04 13:45     ` Ihor Radchenko
2022-06-04 23:28       ` Ted Reed via General discussions about Org-mode.
2022-06-03 21:38   ` Tim Cross
2022-06-07 16:48   ` Max Nikulin
2022-06-08  2:02     ` Ihor Radchenko
2022-06-08  4:23       ` Tom Gillespie
2022-06-08  6:35         ` Tim Cross
2022-06-09 15:31         ` Max Nikulin
2022-06-10  2:06           ` [PATCH] #+begin_example lang used in manual and worg (was: [DISCUSSION] Refactoring fontification system) Ihor Radchenko
2022-06-15  3:40             ` Max Nikulin
2022-06-16 12:31               ` Ihor Radchenko
2022-06-16 12:33               ` [BUG] Unescaped #+ lines in WORG example blocks (was: [PATCH] #+begin_example lang used in manual and worg (was: [DISCUSSION] Refactoring fontification system)) Ihor Radchenko
2022-06-16 16:33                 ` Tim Cross
2024-04-15 13:44                 ` Ihor Radchenko
2022-06-16 15:08       ` [DISCUSSION] Refactoring fontification system Max Nikulin
2022-06-08  6:52   ` Phil Estival
2023-05-30 11:25   ` Ihor Radchenko [this message]
2023-05-30 11:32     ` [RFC] Refactoring org-element API (was: [DISCUSSION] Refactoring fontification system) Ihor Radchenko
2023-05-30 15:00     ` [RFC] Refactoring org-element API Stefan Nobis
2023-05-31  8:57       ` Ihor Radchenko
2023-06-23 12:20         ` Ihor Radchenko
2023-06-30 13:53           ` Ihor Radchenko
2023-07-01 11:44     ` [RFC] Refactoring org-element API (was: [DISCUSSION] Refactoring fontification system) Ihor Radchenko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.orgmode.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=874jnudps5.fsf@localhost \
    --to=yantar92@posteo.net \
    --cc=emacs-orgmode@gnu.org \
    --cc=yantar92@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).