bug#69188: 30.0.50; project-files + project-find-file is slow in large repositories

all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed

From: Dmitry Gutov <dmitry@gutov.dev>
To: Spencer Baugh <sbaugh@janestreet.com>, 69188@debbugs.gnu.org
Subject: bug#69188: 30.0.50; project-files + project-find-file is slow in large repositories
Date: Sat, 13 Apr 2024 05:34:18 +0300	[thread overview]
Message-ID: <1b566e9e-eca5-4746-8e31-4155d35ce7a8@gutov.dev> (raw)
In-Reply-To: <iera5o11gnh.fsf@igm-qws-u22796a.mail-host-address-is-not-set>

Hi Spencer,

Sorry about the wait.

On 16/02/2024 00:55, Spencer Baugh wrote:
> 
> (project-files (project-current)) takes around 1 second in Linux (80k
> files) and 7 seconds in my larger (500k file) repository.
> 
> With this patch:
> diff --git a/lisp/progmodes/project.el b/lisp/progmodes/project.el
> index c7c07c3d34c..037beaa835a 100644
> --- a/lisp/progmodes/project.el
> +++ b/lisp/progmodes/project.el
> @@ -667,12 +667,15 @@
>                                                 (setq i (concat i "**"))))
>                                           i)))
>                                      extra-ignores)))))
> -       (setq files
> -             (mapcar
> -              (lambda (file) (concat default-directory file))
> -              (split-string
> -               (apply #'vc-git--run-command-string nil "ls-files" args)
> -               "\0" t)))
> +       (with-temp-buffer
> +         (let ((ok (apply #'vc-git--out-ok "ls-files" args))
> +               (pt (point-min)))
> +           (unless ok
> +             (error "File listing failed: %s" (buffer-string)))
> +           (goto-char pt)
> +           (while (search-forward "\0" nil t)
> +             (push (concat default-directory (buffer-substring-no-properties pt (1- (point)))) files)
> +             (setq pt (point)))))
>          (when (project--vc-merge-submodules-p default-directory)
>            ;; Unfortunately, 'ls-files --recurse-submodules' conflicts with '-o'.
>            (let* ((submodules (project--git-submodules))
> 
> project-files in Linux takes around .75 seconds.

The patch makes sense (and the approach works okay in 
project--files-in-directory), so this is something I've made a few 
attempts to use in the past.

However, the measurements on my machine show a much smaller improvement 
-- just 3-4%. I.e. if I just evaluate the functions interpreted or run 
them just once, the variations between the runs far exceed the 
difference in runtimes (around ~450ms with a Linux repository checkout 
from 2021, 70k files).

A stricter comparison works out like this:

1. Apply the patch (or not),
2. M-x byte-compile-file
3. (load "project.elc")
4. (benchmark-run 10 (project-files (project-current)))

When run these in my working session one after another, the 10 iteration 
benchmark works out to 4.09s vs 3.93s (master vs your patch).

(4.093848777 44 1.6119981489999944)

vs

(3.9392906549999998 41 1.499010061)

With 'emacs -Q', however, it's vice versa:

(3.777694389 130 1.2422826310000001)

vs

(3.889905663 165 1.46846598)

It seems like, maybe, the longer running session is more sensitive to 
the allocation of the initial long string than the fresh session.

In any case, I don't mind switching to the other approach. Just 
wondering where the difference between our machines might come from.

Last but not least, when/if we apply this, we should keep the fix for 
bug#66806 in there. Good news is it doesn't seem to affect performance.

> If I further remove the (concat default-directory ...) around each file,
> it speeds up to .5 seconds.
> 
> (Note that git ls-files itself takes only around 20 milliseconds)
> 
> My large repository (which uses Mercurial) has a custom project-files
> which is basically:
> 
> (with-temp-buffer
>    (unless (zerop (apply #'call-process "rhg" nil t nil "files"))
>      (error "File listing failed: %s" (buffer-string)))
>    (goto-char (point-min))
>    (let ((pt (point))
>          res)
>      (while (search-forward "\n" nil t)
>        (push (file-name-concat default-directory (buffer-substring-no-properties pt (1- (point)))) res)
>        (setq pt (point)))
>      res))
> 
> Likewise, removing the (concat default-directory ...) speeds my
> project-files up from 7 seconds to 4.5 seconds.
> 
> This is especially silly because project-find-file then just removes
> this default-directory again from all the files, which has yet more
> overhead.

This is, indeed, something that should show a universal improvement. 
Around 20% here with the Linux repository test.

> My proposal: Could we find a way to make the default-directory not
> necessary for the files returned from project-files?
> 
> Perhaps project-files could be allowed to return relative file paths
> which are relative to the project root.  Then in the common case where
> all the files are within the project root, project-find-file would be
> way faster.  Happy to implement this, if it makes sense.

Yep, that should make sense. Originally the idea was to keep it more 
universal so that lists of files coming from the "external roots" could 
be handled the same way (used in the two *-or-external-* commands).

But indeed it's the relatively rare case, so it'd be better to avoid 
paying the performance penalty, especially when the subsequent handling 
could do without the added prefix. And even the "rare case" could be 
split into separate calls instead of having all files returned at once.

My main concern is backward compatibility, so that 3rd party callers 
don't break after the update.

I think there are basically two approaches:
- A new devar like project-use-relative-names,
- Or a new argument for 'project-files', e.g. called RELATIVE.

Both options are relatively clunky, and the second one might also fail 
to work when DIRS is non-nil (or would have to fall back to absolute 
names anyway), so I'm leaning toward the first one. It might also allow 
certain code to be written supporting both relative and absolute names 
(e.g. a process call both binds default-directory to root and keeps the 
file names as-is -- the relative ones would be interpreted as such, the 
rest just as they are interpreted now).

Both project-find-file and project-find-regexp should be able to 
benefit. Although the former might require a bigger update, given that 
the current project-read-file-name-function options don't expect 
relative names. Ideally we'd have a smoother migration for custom 
p-r-f-n-f functions, but I don't have any good ideas there.

> Another optimization I've considered: We could run the process
> asynchronously so project-files parsing can be parallel with the
> process; but the process is usually very fast anyway, that's not most of
> the overhead, so that won't be a big win.

Right. This came up in bug#64735, and together with patch in bug#66020 
the asynchronous file listing can run a bit faster than the synchronous 
implementation.

I'm guessing the difference won't be huge in your case, since either way 
most time remains spent in Lisp code and GC. But if we take advantage of 
this by improving the UIs at the same time, this can be a real win.

This should go into a separate discussion, I think, but to quickly sum 
up my thinking on the subject:

- Ideally project-files implementations for sync and async UIs should 
always look the same. Hopefully the "async" implementation looks the 
same or almost the same as the "sync" one. Threads might help.
- project-find-regexp could benefit from this a lot, first by running 
the search in parallel to the file listing, and second by showing the 
results right away (the current advantage of 'M-x grep'). The difficult 
part is have the "async" Xref interface as well (can we do this without 
extending the current one? probably not). The UI also needs to have some 
"running ..." indicator, as well as a way to abort the search, killing 
both processes - that adds requirements to "async Xref" as well.

> However, that would make it easy for project-files as a whole to be
> asynchronous.  Then that would allow project-find-file to start the
> listing in the background, and then we'd write a completion table which
> completes only over whatever files we've already read into Emacs.  I
> think this would be a lot nicer for most use-cases, and I'd again be
> happy to implement this.

Could this be that simple?

Whatever the source of the file listing, as soon as the UI (or 
completion styles) calls try-completion or all-completions, the search 
has to finish first, shouldn't it? That seems like the semantics of this 
API. Or if perhaps we allow it to operate on incomplete results, how 
would we indicate to the user at the end that the scan has finished, and 
they can press TAB once more to refresh the results? Or perhaps to be 
able to find a file they hadn't managed to find in the incomplete set.

This seems like it might require both a new UI and an extension of 
completion table API. E.g. in certain cases we could say that we only 
need N matches, so if the current incomplete set can provide as many, we 
don't have to wait until the end. But 'try-completion' would become 
unreliable either way.

Even if keeping to the most conservative approach, though, it should be 
possible to at least render the prompt before the file listing is 
finished. That could make the UI look a bit more responsive.

next prev parent reply	other threads:[~2024-04-13  2:34 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-02-15 22:55 bug#69188: 30.0.50; project-files + project-find-file is slow in large repositories Spencer Baugh
2024-02-18 18:56 ` bug#69233: " Eli Zaretskii
2024-02-18 19:42   ` Dmitry Gutov
2024-02-18 19:45     ` Eli Zaretskii
2024-02-18 20:11       ` Dmitry Gutov
2024-02-18 20:18         ` Eli Zaretskii
2024-02-23 21:34           ` bug#69233: bug#69188: " Spencer Baugh
2024-04-13  2:34 ` Dmitry Gutov [this message]
2024-04-16 23:48   ` Dmitry Gutov
2024-04-29 20:27     ` bug#69188: bug#69233: " Spencer Baugh
2024-05-05  0:29       ` Dmitry Gutov
2024-04-29 21:04   ` Spencer Baugh
2024-05-05  3:32     ` Dmitry Gutov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1b566e9e-eca5-4746-8e31-4155d35ce7a8@gutov.dev \
    --to=dmitry@gutov.dev \
    --cc=69188@debbugs.gnu.org \
    --cc=sbaugh@janestreet.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.