On Aug 18, 2023, at 1:21 AM, Yuan Fu <casouri@gmail.com> wrote:


to go to the parent node from a child node, what tree-sitter actually does is go down from the root node until it hits the parent node. This process is linear to the height of the tree.

Do you mean linear in the width of the tree?  I’ve color-coded the plot by tree depth (i.e. how many ancestors a given node has back to root).  Or maybe you are thinking of the tree lying on its side (like vundo ;)?

Going down from the root node is proportion to the height of the tree, no?


Sure, but as you can see from the plot, depth (“height”) is subdominant to the starting line number.  To me that says the width/breadth of the tree must also matter a lot here. 


Also, getting the node at point isn’t free either. To get the node at point, we actually iterates from the first child node of the root node until reaching one that contains the point, then iterate from the first child node of that node until reaching one that contains the point, etc, until we reach a leaf node. So log(N) time complexity is expected.

I tested node-at-point on the same file, and it is quite fast, with worst case performance only 30µs (vs. 3ms for the full parent navigation), and growing very slowly, between sqrt(log(N)) and log(N).  Check the gist for a new figure.

Unless I am misunderstanding, for the common case of finding parent of the node at point, it seems the algorithm you describe could be tweaked to work well.  I.e. instead of "stop when reaching a leaf node containing point", just "stop when you reach a node containing point who has the original node as a child”?   This should give (hypothetical) “parent-of-node-at-point” the same great speed as node-at-point.  

I should’ve been more clear. For finding the parent of a node x, we stop at the parent of x. We only go to the leaf node when finding the node at point, which always returns a leaf node.

Thanks for the clarification. But does parent of x make use of the positional information node’s store, or only look for “some node who has a matching child x”?  I understood the latter, which would make sense why finding a node’s parent is more expensive than finding mode at point.   But this post seems to indicate it does use the bounds to accelerate all searches:


So I added a plot for the ratio in time taken to find a node’s (single) direct parent, and find the node at point.  Parent is more expensive, and the ratio does grow to about 10 at high line number.  But better than the 100x growth in time-to-root.  So I’d conclude some of the increase in the navigate to root time with line is due to the natural growing tree depth with line number, and some is just being at the “end” of a long list of children at many levels.  That said, I can’t understand why finding one parent is not just as cheap as finding a leaf node at point. Maybe I’ll take it up on treesitter to confirm.


Then “parent-of-node-at-point-until” could do something quite clever: accumulate parent nodes all the way from root to the child’s direct parent into a list (same low cost, modulo the node storage).  Then run the predicate on that list (in reverse), returning the first match.  Could of course return that full list too (“ancestors-of-node-at-point”), for other uses.  These should all be quite fast compared to a full breadth and depth searching of every nook and cranny in the syntax tree that node-parent seems to do (having no positional information to wield).

Sure, there hasn’t been a clever version because so far no one have complained they can’t find the parent of a node fast enough. Also I doubt the amount of gain we can get with a more clever algorithm. You can try implement one and benchmark it. I’m curious to know the result :-)

Yeah it seems this is baked into the design of TS.


I’m not too worried tho, because IIRC the absolute time is very short. The 100x variability doesn’t mean much if the 100x is still very fast.


This is on a brand new fast machine, and 3ms is pretty slow for things that need to run dozens of times per keystroke (e.g. font-lock).

Font-lock doesn’t need to find the parent of a node. So that hasn’t been a problem. In use-cases where finding the parent of a node is useful, 3ms hasn’t been a problem AFAIK. (Well, I guess indentation could benefit from a faster parent-finding function.)

And any algorithm that responds to containing syntactic units. For example, I’m working on my indent-bars package to allow it to find tree-sitter “containing nodes of certain types” to alter the bar depth, inside lists, function parameters/argument, etc.


These are fundamental limits of tree-sitter, unless it changes its data structure.

That is an interesting limitation for sure.  It seems that perhaps each node's start..end information can be really helpful here for winnowing the tree, when you are mostly concerned about nodes relevant to and covering particular positions in the buffer.

BTW, https://tree-sitter.github.io/tree-sitter/using-parsers says:
   •
A TSNode represents a single node in the syntax tree. It tracks its start and end positions in the source code, as well as its relation to other nodes like its parent, siblings and children.

I’m not entirely sure what exactly do you have in mind. A node’s start and end position doesn’t really help us finding it. Suppose you have an array of numbers, and you know one of the numbers is 3. How do you find the index of the number 3 in the array?

This was to some degree my confusion over whether node-at-point and node-parent are really that different in terms of how they search the tree.  But the start/end very much do help as you described in the node-at-point algorithm, and as mentioned in the post above:

it does pretty effective stepping down over a tree to a node by using checks of predecessor ranges

I.e. you can skip searching entire branches that do not contain node’s start..end range.

While the documentation say it tracks its relation to other nodes, I think it’s more pedagogical than factual. Behind the scenes, tree-sitter’s own node API does the same thing as I described: it goes from the root node and traverses down.

I see, kind of an interesting choice to call that “tracking”. 

Thanks again for the insights.