From: Ihor Radchenko <yantar92@gmail.com>
To: "Mattias Engdegård" <mattiase@acm.org>
Cc: 52753@debbugs.gnu.org
Subject: bug#52753: 29.0.50; Printing long list-like structures fails
Date: Sun, 23 Jan 2022 20:44:48 +0800 [thread overview]
Message-ID: <8735le7ha7.fsf@localhost> (raw)
In-Reply-To: <5D2E995F-7A93-4A75-8AE9-AFCE02399D4E@acm.org>
Mattias Engdegård <mattiase@acm.org> writes:
> An alternative would be writing a special reader and printer specifically for marshalling Lisp data structures: they could be faster and simpler because they wouldn't need to recognise the full Lisp syntax, nor respect the various reader and printer options. The result could also be made more compact since it wouldn't be intended for human readers.
>
> However, the code would still need to detect shared structure (your skip-list use these a lot!) and the basic Lisp syntax is fairly space-efficient, so it may be better to fix the stuff we already have. Maybe you want to give it a go?
The idea to provide a special reader/printer sounds reasonable. If there
was a way to tell Emacs how to write some specific data structure,
things would be easier. However, I am not sure what should be the
mechanism to deal with malformed data structures and with circular
objects.
I tried to understand the printer code, but it appears to be too complex
for me. Maybe I am just not familiar enough with the C part of Emacs.
>> Skip list is conceptually very similar to ordinary list. Maybe the
>> existing list reader functionality can be somehow adapted to read
>> list-like structures?
>
> Your skip list is nothing like an ordinary list; it is deeply recursive in the 'car' slot, and even alternates conses and vectors for each level. Inserting the numbers 1..4 results in
>
> [org-skip-list- 16 2 0.367879441171 #3=(org-skip-list--head . [(1 . [#2=(2 . [#1=(3 . [(4 . [nil]) nil]) #1#])]) #2# nil nil nil nil nil nil nil nil nil nil nil nil nil nil]) [#1# #1# #3# #3# #3# #3# #3# #3# #3# #3# #3# #3# #3# #3# #3# #3#] <]
>
> where the hurting part is [(1 . [#2=(2 . [#1=(3 . [(4 . [... -- a depth of 2N for N elements. There is also a lot of shared structure (all the #k#) imposing further work on the printer and reader.
While I understand how the illustrated example is difficult for the
reader, note that recursion is _not_ in the car slot. Each list element
is a structure like (key . forward-vector) with key being the stored
value and forward-vector storing forward references to the next
elements. Conceptually, ordinary list is just a degenerate case of skip
list with forward-vector storing a single forward-reference to the next
element in the list.
Of course, my remark does not help much to solve the observed bug.
With better understanding of Elisp printer, I can probably construct an
alternative data structure that can be printed without issues.
I think that a "dummy" ordinary list storing all the keys can be used to
prevent the recursion:
[org-skip-list- 16 2 0.367879441171 (#1=1 #2=2 #3=3 #4=4) <- dummy list
(org-skip-list--head . [(#1 . [(#2 . [(#3 . [(#4 . [nil]) nil]) #3])]) #2 nil ...])]
>> I know that balanced trees have the same asymptotic behaviour with skip
>> lists and better worst-case behaviour. However, the original Pugh's paper
>> introducing skip lists did a comparison between skip list and
>> self-balancing trees.
>
> That's true (I remember being fascinated by that paper) but it was written in 1989 and some things have changed since. In particular, memory locality matters more today. Random number generators have made strides but they are often expensive, and the one in Emacs isn't likely to be particularly fast (nor good).
>
> Skip-lists have some properties that make them interesting for some uses, such as being fairly amenable to lock-free operation, but that's not of interest in Emacs today.
Since your last message, I have dived into the more recent literature on
data structures. It looks like skip lists are not that bad - similar to
even older AVL-tree they got updated with finger search facilities and
auto-adjustments to non-random data [1,2]. However, a systematic
comparison between AVL-trees, skip lists, and splay trees reveals that
insertion into skip lists perform up to 2x worse compared to AVL-trees,
especially for random insertion [3]. Ref. [3] generally agrees with my
measurements. On the other hand, Ref. [3] showed that average lookup
time should be 2x faster in skip lists - very different from my
implementation. I may be doing something wrong.
Ref. [3] also introduces some attractive behaviour of splay trees: they
outperform AVL-trees in random+non-random search and non-random
insertion.
[1] Jonathan J. Pittard, Alan L. Tharp [Springer Berlin Heidelberg]
(2010) Simplified Self-Adapting Skip Lists
https://doi.org/10.1007/978-3-642-15381-5_16
[2] W. Pugh (1990) A skip list cookbook
https://www.semanticscholar.org/paper/A-skip-list-cookbook-Pugh/83ebde23871ce9f839ad82e0247a7481410f994e
[3] Hassan Adelyar. Sayed (2008) The Complexity of Splay Trees and Skip
Lists
https://www.semanticscholar.org/paper/The-Complexity-of-Splay-Trees-and-Skip-Lists-Sayed/a0b7d3662afa5b8906861d33f22fdd34fee8f402
>> The common operations on the cache are: (1) random element
>> lookups (element at point while user edits the Org buffer); (2)
>> iterating cache elements one-by-one to filter them (agenda views,
>> column views, sparse trees, etc); (3) Deleting or inserting a series of
>> adjacent elements upon file modification (normal user edits). (4)
>> Combination of cache iteration with adding/removing elements in-place
>> (agenda lookups when parts of the cache should be calculated as we find
>> that some elements are not yet in cache).
>
> (3) is the killer here since it rules out hash tables, or does it? Is maintaining element order a strict requirement? Otherwise just go with hash tables.
I think I need to explain the Org element cache structure a but.
The cache stores syntactic elements in Org buffers. For example:
* headline 1
** subheadline
* headline 2
will be represented in cache like the following:
(:begin 1 "headline 1" ...)
(:begin 13 "subheadline" ...)
(:begin 28 "headline 2" ...)
The :begin values are buffer positions where the headlines begin. They
are also used as data structure sorting keys to allow quick queries about
syntax element at point.
If the user edits the file:
* headline 1
** new
** subheadline
* headline 2
we have to insert a new element and shift the :begin keys of all the
elements below:
(:begin 1 "headline 1" ...)
(:begin 13 "new" ...)
(:begin 13+7 "subheadline" ...)
(:begin 28+7 "headline 2" ...)
As you can see, cache does not just have to handle frequent lookups and
modifications, but also support key modification. Hash tables cannot be
used.
The cache queries and element additions/deletions generally happen
interchangeably. A typical pattern happens when user types in buffer:
something like flyspell will query cache about syntax element at point
followed by file modification by user; then again query, modification,
...
>> - Insertion of 200k subsequent numbers (0, 1, 2, ...): 0.44 sec (often triggers garbage collection)
>
> Most data structures allow for fast insertion of consecutive elements as a special operation, typically in more-or-less linear time. I'm sure we could add it to the standard AVL implementation if it would make a difference.
Adding search fingers to built-in AVL trees will certainly make a
difference. From example above, you can see that typical pattern is
query+modification in certain narrow key range (in narrow region in
buffer). If Emacs AVL tree supported search fingers stable against
modifications, we could improve typical search time to O(1) and somewhat
improve insertion time.
> Since I'm not sure what your keys are (integers, strings, ...) nor their distribution (dense, sparse, clustered...) it's hard to recommend precise structures but some alternatives are:
We use integers (buffer positions). Distribution is unknown a priori,
but may not be uniform (at least, it is not uniform in my files).
> - red-black-trees: typically similar to AVL trees but often easier to implement
> - radix trees: very fast for dense or sparse-but-clustered integers
> - B-trees: often very fast because of their good locality, many variants
> - tries: many variants, typically tailored for a specific key type (eg ASCII strings).
I am currently thinking about splay trees because of their nice
performance for local queries. B-trees may be an option, but I do not
see how they would be advantageous compared to AVL trees. We do
everything in memory.
> Then there are implementation aspects to consider. For example, suppose you want to store three values in a tree node (value and left and right subtrees). You could use:
>
> - vector: [L R X]
> - list: (L R X)
> - improper list: (L R . X)
This is really counter-intuitive. I am wondering how much can be the
difference in practice. At least by orders of magnitude. Do you have
any examples?
> They all have their advantages and drawbacks, and you don't know which one is best until you measure. The result is often counter-intuitive. Also remember that Elisp has efficient boolean vectors, so if you need an extra bit per node it may be more economical to gather them all in one vector instead of wasting 8-16 bytes for each bit. Even bignums can be harnessed for the occasional profit.
I need to think about it. org-element.el is really not using memory
effectively. A typical syntax element is structured as a list:
(type (:prop1 val1 :prop2 val2 ... ))
with a dozen+ properties.
Accessing the element properties may take comparable time to the actual
data structure queries, but changing the element data structure may be
hard - there may be back-compatibility issues...
Best,
Ihor
next prev parent reply other threads:[~2022-01-23 12:44 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-12-23 11:07 bug#52753: 29.0.50; Printing long list-like structures fails Ihor Radchenko
2021-12-23 16:11 ` Mattias Engdegård
2021-12-24 10:56 ` Ihor Radchenko
2021-12-24 16:54 ` Mattias Engdegård
2021-12-25 11:15 ` Mattias Engdegård
2022-01-23 12:44 ` Ihor Radchenko [this message]
2022-01-23 16:28 ` Mattias Engdegård
2022-01-30 9:16 ` Ihor Radchenko
2022-01-30 10:16 ` Mattias Engdegård
2022-05-31 10:31 ` Mattias Engdegård
2022-06-02 13:12 ` Ihor Radchenko
2022-06-02 16:10 ` Mattias Engdegård
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=8735le7ha7.fsf@localhost \
--to=yantar92@gmail.com \
--cc=52753@debbugs.gnu.org \
--cc=mattiase@acm.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this external index
https://git.savannah.gnu.org/cgit/emacs.git
https://git.savannah.gnu.org/cgit/emacs/org-mode.git
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.