unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
* bug#69188: 30.0.50; project-files + project-find-file is slow in large repositories
@ 2024-02-15 22:55 Spencer Baugh
  2024-02-18 18:56 ` bug#69233: " Eli Zaretskii
  2024-04-13  2:34 ` Dmitry Gutov
  0 siblings, 2 replies; 13+ messages in thread
From: Spencer Baugh @ 2024-02-15 22:55 UTC (permalink / raw)
  To: 69188; +Cc: Dmitry Gutov


(project-files (project-current)) takes around 1 second in Linux (80k
files) and 7 seconds in my larger (500k file) repository.

With this patch:
diff --git a/lisp/progmodes/project.el b/lisp/progmodes/project.el
index c7c07c3d34c..037beaa835a 100644
--- a/lisp/progmodes/project.el
+++ b/lisp/progmodes/project.el
@@ -667,12 +667,15 @@
                                               (setq i (concat i "**"))))
                                         i)))
                                    extra-ignores)))))
-       (setq files
-             (mapcar
-              (lambda (file) (concat default-directory file))
-              (split-string
-               (apply #'vc-git--run-command-string nil "ls-files" args)
-               "\0" t)))
+       (with-temp-buffer
+         (let ((ok (apply #'vc-git--out-ok "ls-files" args))
+               (pt (point-min)))
+           (unless ok
+             (error "File listing failed: %s" (buffer-string)))
+           (goto-char pt)
+           (while (search-forward "\0" nil t)
+             (push (concat default-directory (buffer-substring-no-properties pt (1- (point)))) files)
+             (setq pt (point)))))
        (when (project--vc-merge-submodules-p default-directory)
          ;; Unfortunately, 'ls-files --recurse-submodules' conflicts with '-o'.
          (let* ((submodules (project--git-submodules))

project-files in Linux takes around .75 seconds.

If I further remove the (concat default-directory ...) around each file,
it speeds up to .5 seconds.

(Note that git ls-files itself takes only around 20 milliseconds)

My large repository (which uses Mercurial) has a custom project-files
which is basically:

(with-temp-buffer
  (unless (zerop (apply #'call-process "rhg" nil t nil "files"))
    (error "File listing failed: %s" (buffer-string)))
  (goto-char (point-min))
  (let ((pt (point))
        res)
    (while (search-forward "\n" nil t)
      (push (file-name-concat default-directory (buffer-substring-no-properties pt (1- (point)))) res)
      (setq pt (point)))
    res))

Likewise, removing the (concat default-directory ...) speeds my
project-files up from 7 seconds to 4.5 seconds.

This is especially silly because project-find-file then just removes
this default-directory again from all the files, which has yet more
overhead.

My proposal: Could we find a way to make the default-directory not
necessary for the files returned from project-files?

Perhaps project-files could be allowed to return relative file paths
which are relative to the project root.  Then in the common case where
all the files are within the project root, project-find-file would be
way faster.  Happy to implement this, if it makes sense.

Another optimization I've considered: We could run the process
asynchronously so project-files parsing can be parallel with the
process; but the process is usually very fast anyway, that's not most of
the overhead, so that won't be a big win.

However, that would make it easy for project-files as a whole to be
asynchronous.  Then that would allow project-find-file to start the
listing in the background, and then we'd write a completion table which
completes only over whatever files we've already read into Emacs.  I
think this would be a lot nicer for most use-cases, and I'd again be
happy to implement this.

Also happy to implement any other optimizations you think might make
sense.


In GNU Emacs 30.0.50 (build 37, x86_64-pc-linux-gnu, X toolkit, cairo
 version 1.15.12, Xaw scroll bars) of 2024-02-13 built on
 igm-qws-u22796a
Repository revision: a24a2b1ceb12f11c9d345190fbf554f27c4ec186
Repository branch: master
Windowing system distributor 'The X.Org Foundation', version 11.0.12011000
System Description: Rocky Linux 8.9 (Green Obsidian)

Configured using:
 'configure -C --with-x-toolkit=lucid 'CFLAGS=-O0 -g3'
 --without-native-compilation --without-gif'

Configured features:
CAIRO DBUS FREETYPE GLIB GMP GNUTLS GSETTINGS HARFBUZZ JPEG JSON
LIBSELINUX LIBSYSTEMD LIBXML2 MODULES NOTIFY INOTIFY PDUMPER PNG RSVG
SECCOMP SOUND SQLITE3 THREADS TIFF TOOLKIT_SCROLL_BARS X11 XDBE XIM
XINPUT2 XPM LUCID ZLIB

Important settings:
  value of $LANG: en_US.UTF-8
  locale-coding-system: utf-8-unix

Major mode: Lisp Interaction

Minor modes in effect:
  tooltip-mode: t
  global-eldoc-mode: t
  eldoc-mode: t
  show-paren-mode: t
  electric-indent-mode: t
  mouse-wheel-mode: t
  tool-bar-mode: t
  menu-bar-mode: t
  file-name-shadow-mode: t
  global-font-lock-mode: t
  font-lock-mode: t
  blink-cursor-mode: t
  minibuffer-regexp-mode: t
  line-number-mode: t
  indent-tabs-mode: t
  transient-mark-mode: t
  auto-composition-mode: t
  auto-encryption-mode: t
  auto-compression-mode: t

Load-path shadows:
None found.

Features:
(shadow sort mail-extr emacsbug message mailcap yank-media puny dired
dired-loaddefs rfc822 mml mml-sec password-cache epa derived epg rfc6068
epg-config gnus-util text-property-search time-date subr-x mm-decode
mm-bodies mm-encode mail-parse rfc2231 mailabbrev gmm-utils mailheader
cl-loaddefs cl-lib sendmail rfc2047 rfc2045 ietf-drums mm-util
mail-prsvr mail-utils rmc iso-transl tooltip cconv eldoc paren electric
uniquify ediff-hook vc-hooks lisp-float-type elisp-mode mwheel
term/x-win x-win term/common-win x-dnd touch-screen tool-bar dnd fontset
image regexp-opt fringe tabulated-list replace newcomment text-mode
lisp-mode prog-mode register page tab-bar menu-bar rfn-eshadow isearch
easymenu timer select scroll-bar mouse jit-lock font-lock syntax
font-core term/tty-colors frame minibuffer nadvice seq simple cl-generic
indonesian philippine cham georgian utf-8-lang misc-lang vietnamese
tibetan thai tai-viet lao korean japanese eucjp-ms cp51932 hebrew greek
romanian slovak czech european ethiopic indian cyrillic chinese
composite emoji-zwj charscript charprop case-table epa-hook
jka-cmpr-hook help abbrev obarray oclosure cl-preloaded button loaddefs
theme-loaddefs faces cus-face macroexp files window text-properties
overlay sha1 md5 base64 format env code-pages mule custom widget keymap
hashtable-print-readable backquote threads dbusbind inotify
dynamic-setting system-font-setting font-render-setting cairo x-toolkit
xinput2 x multi-tty move-toolbar make-network-process emacs)

Memory information:
((conses 16 65052 9318) (symbols 48 9539 0) (strings 32 22452 1449)
 (string-bytes 1 659675) (vectors 16 9245)
 (vector-slots 8 111110 9295) (floats 8 40 17) (intervals 56 262 0)
 (buffers 976 10))





^ permalink raw reply related	[flat|nested] 13+ messages in thread

* bug#69233: 30.0.50; project-files + project-find-file is slow in large repositories
  2024-02-15 22:55 bug#69188: 30.0.50; project-files + project-find-file is slow in large repositories Spencer Baugh
@ 2024-02-18 18:56 ` Eli Zaretskii
  2024-02-18 19:42   ` Dmitry Gutov
  2024-04-13  2:34 ` Dmitry Gutov
  1 sibling, 1 reply; 13+ messages in thread
From: Eli Zaretskii @ 2024-02-18 18:56 UTC (permalink / raw)
  To: Spencer Baugh; +Cc: dmitry, 69233

merge 69233 69188
thanks

> Cc: Dmitry Gutov <dmitry@gutov.dev>
> From: Spencer Baugh <sbaugh@janestreet.com>
> Date: Thu, 15 Feb 2024 17:55:46 -0500
> 
> 
> (project-files (project-current)) takes around 1 second in Linux (80k
> files) and 7 seconds in my larger (500k file) repository.

This is a duplicate of another bug report you submitted not long ago.





^ permalink raw reply	[flat|nested] 13+ messages in thread

* bug#69233: 30.0.50; project-files + project-find-file is slow in large repositories
  2024-02-18 18:56 ` bug#69233: " Eli Zaretskii
@ 2024-02-18 19:42   ` Dmitry Gutov
  2024-02-18 19:45     ` Eli Zaretskii
  0 siblings, 1 reply; 13+ messages in thread
From: Dmitry Gutov @ 2024-02-18 19:42 UTC (permalink / raw)
  To: Eli Zaretskii, Spencer Baugh; +Cc: 69233

On 18/02/2024 20:56, Eli Zaretskii wrote:
> This is a duplicate of another bug report you submitted not long ago.

Any reason I didn't receive the first one to my inbox?





^ permalink raw reply	[flat|nested] 13+ messages in thread

* bug#69233: 30.0.50; project-files + project-find-file is slow in large repositories
  2024-02-18 19:42   ` Dmitry Gutov
@ 2024-02-18 19:45     ` Eli Zaretskii
  2024-02-18 20:11       ` Dmitry Gutov
  0 siblings, 1 reply; 13+ messages in thread
From: Eli Zaretskii @ 2024-02-18 19:45 UTC (permalink / raw)
  To: Dmitry Gutov; +Cc: sbaugh, 69233

> Date: Sun, 18 Feb 2024 21:42:37 +0200
> Cc: 69233@debbugs.gnu.org
> From: Dmitry Gutov <dmitry@gutov.dev>
> 
> On 18/02/2024 20:56, Eli Zaretskii wrote:
> > This is a duplicate of another bug report you submitted not long ago.
> 
> Any reason I didn't receive the first one to my inbox?

I don't have the foggiest, sorry.





^ permalink raw reply	[flat|nested] 13+ messages in thread

* bug#69233: 30.0.50; project-files + project-find-file is slow in large repositories
  2024-02-18 19:45     ` Eli Zaretskii
@ 2024-02-18 20:11       ` Dmitry Gutov
  2024-02-18 20:18         ` Eli Zaretskii
  0 siblings, 1 reply; 13+ messages in thread
From: Dmitry Gutov @ 2024-02-18 20:11 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: sbaugh, 69233

On 18/02/2024 21:45, Eli Zaretskii wrote:
>> Date: Sun, 18 Feb 2024 21:42:37 +0200
>> Cc:69233@debbugs.gnu.org
>> From: Dmitry Gutov<dmitry@gutov.dev>
>>
>> On 18/02/2024 20:56, Eli Zaretskii wrote:
>>> This is a duplicate of another bug report you submitted not long ago.
>> Any reason I didn't receive the first one to my inbox?
> I don't have the foggiest, sorry.

It seems Spencer didn't get the confirmation email either, or he 
wouldn't resubmit.





^ permalink raw reply	[flat|nested] 13+ messages in thread

* bug#69233: 30.0.50; project-files + project-find-file is slow in large repositories
  2024-02-18 20:11       ` Dmitry Gutov
@ 2024-02-18 20:18         ` Eli Zaretskii
  2024-02-23 21:34           ` bug#69233: bug#69188: " Spencer Baugh
  0 siblings, 1 reply; 13+ messages in thread
From: Eli Zaretskii @ 2024-02-18 20:18 UTC (permalink / raw)
  To: Dmitry Gutov; +Cc: sbaugh, 69233

> Date: Sun, 18 Feb 2024 22:11:43 +0200
> Cc: sbaugh@janestreet.com, 69233@debbugs.gnu.org
> From: Dmitry Gutov <dmitry@gutov.dev>
> 
> On 18/02/2024 21:45, Eli Zaretskii wrote:
> >> Date: Sun, 18 Feb 2024 21:42:37 +0200
> >> Cc:69233@debbugs.gnu.org
> >> From: Dmitry Gutov<dmitry@gutov.dev>
> >>
> >> On 18/02/2024 20:56, Eli Zaretskii wrote:
> >>> This is a duplicate of another bug report you submitted not long ago.
> >> Any reason I didn't receive the first one to my inbox?
> > I don't have the foggiest, sorry.
> 
> It seems Spencer didn't get the confirmation email either, or he 
> wouldn't resubmit.

One can know if debbugs received a report via the Web interface.





^ permalink raw reply	[flat|nested] 13+ messages in thread

* bug#69233: bug#69188: 30.0.50; project-files + project-find-file is slow in large repositories
  2024-02-18 20:18         ` Eli Zaretskii
@ 2024-02-23 21:34           ` Spencer Baugh
  0 siblings, 0 replies; 13+ messages in thread
From: Spencer Baugh @ 2024-02-23 21:34 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Dmitry Gutov, 69233, 69188

Eli Zaretskii <eliz@gnu.org> writes:
>> Date: Sun, 18 Feb 2024 22:11:43 +0200
>> Cc: sbaugh@janestreet.com, 69233@debbugs.gnu.org
>> From: Dmitry Gutov <dmitry@gutov.dev>
>> 
>> On 18/02/2024 21:45, Eli Zaretskii wrote:
>> >> Date: Sun, 18 Feb 2024 21:42:37 +0200
>> >> Cc:69233@debbugs.gnu.org
>> >> From: Dmitry Gutov<dmitry@gutov.dev>
>> >>
>> >> On 18/02/2024 20:56, Eli Zaretskii wrote:
>> >>> This is a duplicate of another bug report you submitted not long ago.
>> >> Any reason I didn't receive the first one to my inbox?
>> > I don't have the foggiest, sorry.
>> 
>> It seems Spencer didn't get the confirmation email either, or he 
>> wouldn't resubmit.
>
> One can know if debbugs received a report via the Web interface.

Yes, it seems that all my email was backed up for a day or so, for
whatever reason.  Sorry for the noise.

(Or maybe I just think this is such an important bug that I submitted it
twice :) )





^ permalink raw reply	[flat|nested] 13+ messages in thread

* bug#69188: 30.0.50; project-files + project-find-file is slow in large repositories
  2024-02-15 22:55 bug#69188: 30.0.50; project-files + project-find-file is slow in large repositories Spencer Baugh
  2024-02-18 18:56 ` bug#69233: " Eli Zaretskii
@ 2024-04-13  2:34 ` Dmitry Gutov
  2024-04-16 23:48   ` Dmitry Gutov
  2024-04-29 21:04   ` Spencer Baugh
  1 sibling, 2 replies; 13+ messages in thread
From: Dmitry Gutov @ 2024-04-13  2:34 UTC (permalink / raw)
  To: Spencer Baugh, 69188

Hi Spencer,

Sorry about the wait.

On 16/02/2024 00:55, Spencer Baugh wrote:
> 
> (project-files (project-current)) takes around 1 second in Linux (80k
> files) and 7 seconds in my larger (500k file) repository.
> 
> With this patch:
> diff --git a/lisp/progmodes/project.el b/lisp/progmodes/project.el
> index c7c07c3d34c..037beaa835a 100644
> --- a/lisp/progmodes/project.el
> +++ b/lisp/progmodes/project.el
> @@ -667,12 +667,15 @@
>                                                 (setq i (concat i "**"))))
>                                           i)))
>                                      extra-ignores)))))
> -       (setq files
> -             (mapcar
> -              (lambda (file) (concat default-directory file))
> -              (split-string
> -               (apply #'vc-git--run-command-string nil "ls-files" args)
> -               "\0" t)))
> +       (with-temp-buffer
> +         (let ((ok (apply #'vc-git--out-ok "ls-files" args))
> +               (pt (point-min)))
> +           (unless ok
> +             (error "File listing failed: %s" (buffer-string)))
> +           (goto-char pt)
> +           (while (search-forward "\0" nil t)
> +             (push (concat default-directory (buffer-substring-no-properties pt (1- (point)))) files)
> +             (setq pt (point)))))
>          (when (project--vc-merge-submodules-p default-directory)
>            ;; Unfortunately, 'ls-files --recurse-submodules' conflicts with '-o'.
>            (let* ((submodules (project--git-submodules))
> 
> project-files in Linux takes around .75 seconds.

The patch makes sense (and the approach works okay in 
project--files-in-directory), so this is something I've made a few 
attempts to use in the past.

However, the measurements on my machine show a much smaller improvement 
-- just 3-4%. I.e. if I just evaluate the functions interpreted or run 
them just once, the variations between the runs far exceed the 
difference in runtimes (around ~450ms with a Linux repository checkout 
from 2021, 70k files).

A stricter comparison works out like this:

1. Apply the patch (or not),
2. M-x byte-compile-file
3. (load "project.elc")
4. (benchmark-run 10 (project-files (project-current)))

When run these in my working session one after another, the 10 iteration 
benchmark works out to 4.09s vs 3.93s (master vs your patch).

(4.093848777 44 1.6119981489999944)

vs

(3.9392906549999998 41 1.499010061)

With 'emacs -Q', however, it's vice versa:

(3.777694389 130 1.2422826310000001)

vs

(3.889905663 165 1.46846598)

It seems like, maybe, the longer running session is more sensitive to 
the allocation of the initial long string than the fresh session.

In any case, I don't mind switching to the other approach. Just 
wondering where the difference between our machines might come from.

Last but not least, when/if we apply this, we should keep the fix for 
bug#66806 in there. Good news is it doesn't seem to affect performance.

> If I further remove the (concat default-directory ...) around each file,
> it speeds up to .5 seconds.
> 
> (Note that git ls-files itself takes only around 20 milliseconds)
> 
> My large repository (which uses Mercurial) has a custom project-files
> which is basically:
> 
> (with-temp-buffer
>    (unless (zerop (apply #'call-process "rhg" nil t nil "files"))
>      (error "File listing failed: %s" (buffer-string)))
>    (goto-char (point-min))
>    (let ((pt (point))
>          res)
>      (while (search-forward "\n" nil t)
>        (push (file-name-concat default-directory (buffer-substring-no-properties pt (1- (point)))) res)
>        (setq pt (point)))
>      res))
> 
> Likewise, removing the (concat default-directory ...) speeds my
> project-files up from 7 seconds to 4.5 seconds.
> 
> This is especially silly because project-find-file then just removes
> this default-directory again from all the files, which has yet more
> overhead.

This is, indeed, something that should show a universal improvement. 
Around 20% here with the Linux repository test.

> My proposal: Could we find a way to make the default-directory not
> necessary for the files returned from project-files?
> 
> Perhaps project-files could be allowed to return relative file paths
> which are relative to the project root.  Then in the common case where
> all the files are within the project root, project-find-file would be
> way faster.  Happy to implement this, if it makes sense.

Yep, that should make sense. Originally the idea was to keep it more 
universal so that lists of files coming from the "external roots" could 
be handled the same way (used in the two *-or-external-* commands).

But indeed it's the relatively rare case, so it'd be better to avoid 
paying the performance penalty, especially when the subsequent handling 
could do without the added prefix. And even the "rare case" could be 
split into separate calls instead of having all files returned at once.

My main concern is backward compatibility, so that 3rd party callers 
don't break after the update.

I think there are basically two approaches:
- A new devar like project-use-relative-names,
- Or a new argument for 'project-files', e.g. called RELATIVE.

Both options are relatively clunky, and the second one might also fail 
to work when DIRS is non-nil (or would have to fall back to absolute 
names anyway), so I'm leaning toward the first one. It might also allow 
certain code to be written supporting both relative and absolute names 
(e.g. a process call both binds default-directory to root and keeps the 
file names as-is -- the relative ones would be interpreted as such, the 
rest just as they are interpreted now).

Both project-find-file and project-find-regexp should be able to 
benefit. Although the former might require a bigger update, given that 
the current project-read-file-name-function options don't expect 
relative names. Ideally we'd have a smoother migration for custom 
p-r-f-n-f functions, but I don't have any good ideas there.

> Another optimization I've considered: We could run the process
> asynchronously so project-files parsing can be parallel with the
> process; but the process is usually very fast anyway, that's not most of
> the overhead, so that won't be a big win.

Right. This came up in bug#64735, and together with patch in bug#66020 
the asynchronous file listing can run a bit faster than the synchronous 
implementation.

I'm guessing the difference won't be huge in your case, since either way 
most time remains spent in Lisp code and GC. But if we take advantage of 
this by improving the UIs at the same time, this can be a real win.

This should go into a separate discussion, I think, but to quickly sum 
up my thinking on the subject:

- Ideally project-files implementations for sync and async UIs should 
always look the same. Hopefully the "async" implementation looks the 
same or almost the same as the "sync" one. Threads might help.
- project-find-regexp could benefit from this a lot, first by running 
the search in parallel to the file listing, and second by showing the 
results right away (the current advantage of 'M-x grep'). The difficult 
part is have the "async" Xref interface as well (can we do this without 
extending the current one? probably not). The UI also needs to have some 
"running ..." indicator, as well as a way to abort the search, killing 
both processes - that adds requirements to "async Xref" as well.

> However, that would make it easy for project-files as a whole to be
> asynchronous.  Then that would allow project-find-file to start the
> listing in the background, and then we'd write a completion table which
> completes only over whatever files we've already read into Emacs.  I
> think this would be a lot nicer for most use-cases, and I'd again be
> happy to implement this.

Could this be that simple?

Whatever the source of the file listing, as soon as the UI (or 
completion styles) calls try-completion or all-completions, the search 
has to finish first, shouldn't it? That seems like the semantics of this 
API. Or if perhaps we allow it to operate on incomplete results, how 
would we indicate to the user at the end that the scan has finished, and 
they can press TAB once more to refresh the results? Or perhaps to be 
able to find a file they hadn't managed to find in the incomplete set.

This seems like it might require both a new UI and an extension of 
completion table API. E.g. in certain cases we could say that we only 
need N matches, so if the current incomplete set can provide as many, we 
don't have to wait until the end. But 'try-completion' would become 
unreliable either way.

Even if keeping to the most conservative approach, though, it should be 
possible to at least render the prompt before the file listing is 
finished. That could make the UI look a bit more responsive.





^ permalink raw reply	[flat|nested] 13+ messages in thread

* bug#69188: 30.0.50; project-files + project-find-file is slow in large repositories
  2024-04-13  2:34 ` Dmitry Gutov
@ 2024-04-16 23:48   ` Dmitry Gutov
  2024-04-29 20:27     ` bug#69188: bug#69233: " Spencer Baugh
  2024-04-29 21:04   ` Spencer Baugh
  1 sibling, 1 reply; 13+ messages in thread
From: Dmitry Gutov @ 2024-04-16 23:48 UTC (permalink / raw)
  To: Spencer Baugh, 69188

[-- Attachment #1: Type: text/plain, Size: 812 bytes --]

On 13/04/2024 05:34, Dmitry Gutov wrote:
> Both options are relatively clunky, and the second one might also fail 
> to work when DIRS is non-nil (or would have to fall back to absolute 
> names anyway), so I'm leaning toward the first one. It might also allow 
> certain code to be written supporting both relative and absolute names 
> (e.g. a process call both binds default-directory to root and keeps the 
> file names as-is -- the relative ones would be interpreted as such, the 
> rest just as they are interpreted now).

Here's how that change can look.

The patch should demonstrate both the performance improvements for 
project-find-file and project-find-regexp, and some awkwardness in the 
implementation, chiefly due to backward compatibility.

Guess more tests will be required, at the very least.

[-- Attachment #2: project-files-relative-names.diff --]
[-- Type: text/x-patch, Size: 6189 bytes --]

diff --git a/lisp/progmodes/project.el b/lisp/progmodes/project.el
index 000a05804a8..567a25e0906 100644
--- a/lisp/progmodes/project.el
+++ b/lisp/progmodes/project.el
@@ -323,6 +323,12 @@ project--file-completion-table
 (cl-defmethod project-root ((project (head transient)))
   (cdr project))
 
+(defvar project-files-relative-names nil
+  "When non-nil, `project-files' is allowed to return relative names.
+The names will be relative to the project root.  And this can only
+happen when all returned files are in the same directory. Meaning, the
+DIRS argument has to be nil or have only one element.")
+
 (cl-defgeneric project-files (project &optional dirs)
   "Return a list of files in directories DIRS in PROJECT.
 DIRS is a list of absolute directories; it should be some
@@ -380,8 +386,10 @@ project--files-in-directory
                 res)
           (setq pt (point)))))
     (project--remote-file-names
-     (mapcar (lambda (s) (concat dfn s))
-             (sort res #'string<)))))
+     (if project-files-relative-names
+         (sort res #'string<)
+       (mapcar (lambda (s) (concat dfn s))
+               (sort res #'string<))))))
 
 (defun project--remote-file-names (local-files)
   "Return LOCAL-FILES as if they were on the system of `default-directory'.
@@ -689,7 +697,9 @@ project--vc-list-files
                    (mapcar
                     (lambda (file)
                       (unless (member file submodules)
-                        (concat default-directory file)))
+                        (if project-files-relative-names
+                            file
+                          (concat default-directory file))))
                     (split-string
                      (apply #'vc-git--run-command-string nil "ls-files" args)
                      "\0" t))))
@@ -716,7 +726,8 @@ project--vc-list-files
                                 dir))
             (args (list (concat "-mcard" (and include-untracked "u"))
                         "--no-status"
-                        "-0")))
+                        "-0"))
+            files)
        (when extra-ignores
          (setq args (nconc args
                            (mapcan
@@ -725,9 +736,12 @@ project--vc-list-files
                             extra-ignores))))
        (with-temp-buffer
          (apply #'vc-hg-command t 0 "." "status" args)
-         (mapcar
-          (lambda (s) (concat default-directory s))
-          (split-string (buffer-string) "\0" t)))))))
+         (setq files (split-string (buffer-string) "\0" t))
+         (unless project-files-relative-names
+           (setq files (mapcar
+                        (lambda (s) (concat default-directory s))
+                        files)))
+         files)))))
 
 (defun project--vc-merge-submodules-p (dir)
   (project--value-in-dir
@@ -970,6 +984,7 @@ project-find-regexp
   (let* ((caller-dir default-directory)
          (pr (project-current t))
          (default-directory (project-root pr))
+         (project-files-relative-names t)
          (files
           (if (not current-prefix-arg)
               (project-files pr)
@@ -1000,6 +1015,8 @@ project-or-external-find-regexp
   (require 'xref)
   (let* ((pr (project-current t))
          (default-directory (project-root pr))
+         ;; TODO: Make use of `project-files-relative-names' by
+         ;; searching each root separately (maybe in parallel, too).
          (files
           (project-files pr (cons
                              (project-root pr)
@@ -1054,7 +1071,8 @@ project-find-file
   (interactive "P")
   (let* ((pr (project-current t))
          (root (project-root pr))
-         (dirs (list root)))
+         (dirs (list root))
+         (project-files-relative-names t))
     (project-find-file-in
      (or (thing-at-point 'filename)
          (and buffer-file-name (project--find-default-from buffer-file-name pr)))
@@ -1130,7 +1148,12 @@ project--read-file-cpd-relative
             (if (> (length common-prefix) 0)
                 (file-name-directory common-prefix))))
          (cpd-length (length common-parent-directory))
-         (prompt (if (zerop cpd-length)
+         (common-parent-directory (if (file-name-absolute-p (car all-files))
+                                      common-parent-directory
+                                    (concat default-directory common-parent-directory)))
+         (prompt (if (and (zerop cpd-length)
+                          all-files
+                          (file-name-absolute-p (car all-files)))
                      prompt
                    (concat prompt (format " in %s" common-parent-directory))))
          (included-cpd (when (member common-parent-directory all-files)
@@ -1168,6 +1191,7 @@ project--read-file-absolute
                                     all-files &optional predicate
                                     hist mb-default)
   (project--completing-read-strict prompt
+                                   ;; FIXME: Map relative names to absolute?
                                    (project--file-completion-table all-files)
                                    predicate
                                    hist mb-default))
@@ -1215,6 +1239,7 @@ project-find-file-in
                dirs)
             (project-files project dirs)))
          (completion-ignore-case read-file-name-completion-ignore-case)
+         (default-directory (project-root project))
          (file (project--read-file-name
                 project "Find file"
                 all-files nil 'file-name-history
diff --git a/lisp/progmodes/xref.el b/lisp/progmodes/xref.el
index 755c3db04fd..29fc6cd560f 100644
--- a/lisp/progmodes/xref.el
+++ b/lisp/progmodes/xref.el
@@ -1922,7 +1922,8 @@ xref-matches-in-files
        (hits nil)
        ;; Support for remote files.  The assumption is that, if the
        ;; first file is remote, they all are, and on the same host.
-       (dir (file-name-directory (car files)))
+       (dir (or (file-name-directory (car files))
+                default-directory))
        (remote-id (file-remote-p dir))
        ;; The 'auto' default would be fine too, but ripgrep can't handle
        ;; the options we pass in that case.

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* bug#69188: bug#69233: 30.0.50; project-files + project-find-file is slow in large repositories
  2024-04-16 23:48   ` Dmitry Gutov
@ 2024-04-29 20:27     ` Spencer Baugh
  2024-05-05  0:29       ` Dmitry Gutov
  0 siblings, 1 reply; 13+ messages in thread
From: Spencer Baugh @ 2024-04-29 20:27 UTC (permalink / raw)
  To: Dmitry Gutov; +Cc: 69233, 69188

Dmitry Gutov <dmitry@gutov.dev> writes:
> On 13/04/2024 05:34, Dmitry Gutov wrote:
>> Both options are relatively clunky, and the second one might also
>> fail to work when DIRS is non-nil (or would have to fall back to
>> absolute names anyway), so I'm leaning toward the first one. It
>> might also allow certain code to be written supporting both relative
>> and absolute names (e.g. a process call both binds default-directory
>> to root and keeps the file names as-is -- the relative ones would be
>> interpreted as such, the rest just as they are interpreted now).
>
> Here's how that change can look.
>
> The patch should demonstrate both the performance improvements for
> project-find-file and project-find-regexp, and some awkwardness in the
> implementation, chiefly due to backward compatibility.
>
> Guess more tests will be required, at the very least.

I see almost a 50% performance improvement with this patch in my large
private repository, once adding support for project-files-relative-names
in my internal project backend.  Seems great so far.

My benchmarking:

(let ((proj (project-current)))
  (list (benchmark-run 10 (let ((project-files-relative-names t)) (length (project-files proj))))
        (benchmark-run 10 (let ((project-files-relative-names nil)) (length (project-files proj))))))

((17.605295389 28 7.647366087000023)
 (29.918302167 57 19.246283027999993))







^ permalink raw reply	[flat|nested] 13+ messages in thread

* bug#69233: 30.0.50; project-files + project-find-file is slow in large repositories
  2024-04-13  2:34 ` Dmitry Gutov
  2024-04-16 23:48   ` Dmitry Gutov
@ 2024-04-29 21:04   ` Spencer Baugh
  2024-05-05  3:32     ` Dmitry Gutov
  1 sibling, 1 reply; 13+ messages in thread
From: Spencer Baugh @ 2024-04-29 21:04 UTC (permalink / raw)
  To: Dmitry Gutov; +Cc: 69233, 69188

Dmitry Gutov <dmitry@gutov.dev> writes:

> Hi Spencer,
>
> Sorry about the wait.
>
> On 16/02/2024 00:55, Spencer Baugh wrote:
>> (project-files (project-current)) takes around 1 second in Linux
>> (80k
>> files) and 7 seconds in my larger (500k file) repository.
>> With this patch:
>> diff --git a/lisp/progmodes/project.el b/lisp/progmodes/project.el
>> index c7c07c3d34c..037beaa835a 100644
>> --- a/lisp/progmodes/project.el
>> +++ b/lisp/progmodes/project.el
>> @@ -667,12 +667,15 @@
>>                                                 (setq i (concat i "**"))))
>>                                           i)))
>>                                      extra-ignores)))))
>> -       (setq files
>> -             (mapcar
>> -              (lambda (file) (concat default-directory file))
>> -              (split-string
>> -               (apply #'vc-git--run-command-string nil "ls-files" args)
>> -               "\0" t)))
>> +       (with-temp-buffer
>> +         (let ((ok (apply #'vc-git--out-ok "ls-files" args))
>> +               (pt (point-min)))
>> +           (unless ok
>> +             (error "File listing failed: %s" (buffer-string)))
>> +           (goto-char pt)
>> +           (while (search-forward "\0" nil t)
>> +             (push (concat default-directory (buffer-substring-no-properties pt (1- (point)))) files)
>> +             (setq pt (point)))))
>>          (when (project--vc-merge-submodules-p default-directory)
>>            ;; Unfortunately, 'ls-files --recurse-submodules' conflicts with '-o'.
>>            (let* ((submodules (project--git-submodules))
>> project-files in Linux takes around .75 seconds.
>
> The patch makes sense (and the approach works okay in
> project--files-in-directory), so this is something I've made a few
> attempts to use in the past.
>
> However, the measurements on my machine show a much smaller
> improvement -- just 3-4%. I.e. if I just evaluate the functions
> interpreted or run them just once, the variations between the runs far
> exceed the difference in runtimes (around ~450ms with a Linux
> repository checkout from 2021, 70k files).
>
> A stricter comparison works out like this:
>
> 1. Apply the patch (or not),
> 2. M-x byte-compile-file
> 3. (load "project.elc")
> 4. (benchmark-run 10 (project-files (project-current)))
>
> When run these in my working session one after another, the 10
> iteration benchmark works out to 4.09s vs 3.93s (master vs your
> patch).
>
> (4.093848777 44 1.6119981489999944)
>
> vs
>
> (3.9392906549999998 41 1.499010061)
>
> With 'emacs -Q', however, it's vice versa:
>
> (3.777694389 130 1.2422826310000001)
>
> vs
>
> (3.889905663 165 1.46846598)
>
> It seems like, maybe, the longer running session is more sensitive to
> the allocation of the initial long string than the fresh session.
>
> In any case, I don't mind switching to the other approach. Just
> wondering where the difference between our machines might come from.
>
> Last but not least, when/if we apply this, we should keep the fix for
> bug#66806 in there. Good news is it doesn't seem to affect
> performance.

Oh, interesting, I see roughly the same result.

Benchmarking with:
(benchmark-run 10 (project-files (project-current)))

Running in my long-lived existing Emacs 29 session:
Old:
(4.434228319 14 2.850654906999921)
New:
(4.983809167 16 3.2989908669999295)

In Emacs 29 emacs -Q:
Old:
(3.5112438729999997 130 1.9230644630000002)
New:
(3.819248509 171 2.309731412)

But, in Emacs 30 emacs -Q:
Old:
(7.949549188 65 3.3445626799999992)
New:
(7.270785783999999 87 4.0610532379999995)

So... the performance improvement seems highly unreliable.  Probably not
worth changing this area, then - the other patch to allow relative files
will probably be more worth it.

>> My proposal: Could we find a way to make the default-directory not
>> necessary for the files returned from project-files?
>> Perhaps project-files could be allowed to return relative file paths
>> which are relative to the project root.  Then in the common case where
>> all the files are within the project root, project-find-file would be
>> way faster.  Happy to implement this, if it makes sense.
>
> Yep, that should make sense. Originally the idea was to keep it more
> universal so that lists of files coming from the "external roots"
> could be handled the same way (used in the two *-or-external-*
> commands).
>
> But indeed it's the relatively rare case, so it'd be better to avoid
> paying the performance penalty, especially when the subsequent
> handling could do without the added prefix. And even the "rare case"
> could be split into separate calls instead of having all files
> returned at once.
>
> My main concern is backward compatibility, so that 3rd party callers
> don't break after the update.
>
> I think there are basically two approaches:
> - A new devar like project-use-relative-names,
> - Or a new argument for 'project-files', e.g. called RELATIVE.
>
> Both options are relatively clunky, and the second one might also fail
> to work when DIRS is non-nil (or would have to fall back to absolute
> names anyway), so I'm leaning toward the first one. It might also
> allow certain code to be written supporting both relative and absolute
> names (e.g. a process call both binds default-directory to root and
> keeps the file names as-is -- the relative ones would be interpreted
> as such, the rest just as they are interpreted now).
>
> Both project-find-file and project-find-regexp should be able to
> benefit. Although the former might require a bigger update, given that
> the current project-read-file-name-function options don't expect
> relative names. Ideally we'd have a smoother migration for custom
> p-r-f-n-f functions, but I don't have any good ideas there.

I think the defvar approach seems reasonable.

The existing project-read-file-name-function certainly don't expect
relative names, but they do actually work OK.  e.g.

(project--read-file-cpd-relative "" '("foo/bar" "foo1/bar") nil 'minibuffer-history)
(project--read-file-absolute "" '("foo/bar" "foo1/bar") nil 'minibuffer-history)

Both complete fine and return a filename fine.  read-file-cpd-relative
returns an absolute filename, read-file-absolute reutrns a relative
filename.

Maybe the same is true for any custom project-read-file-name-functions
that exist?  Maybe they will just work?

>> Another optimization I've considered: We could run the process
>> asynchronously so project-files parsing can be parallel with the
>> process; but the process is usually very fast anyway, that's not most of
>> the overhead, so that won't be a big win.
>
> Right. This came up in bug#64735, and together with patch in bug#66020
> the asynchronous file listing can run a bit faster than the
> synchronous implementation.
>
> I'm guessing the difference won't be huge in your case, since either
> way most time remains spent in Lisp code and GC. But if we take
> advantage of this by improving the UIs at the same time, this can be a
> real win.

Right.

> This should go into a separate discussion, I think, but to quickly sum
> up my thinking on the subject:
>
> - Ideally project-files implementations for sync and async UIs should
>   always look the same. Hopefully the "async" implementation looks the
>   same or almost the same as the "sync" one. Threads might help.
> - project-find-regexp could benefit from this a lot, first by running
>   the search in parallel to the file listing, and second by showing
>   the results right away (the current advantage of 'M-x grep'). The
>   difficult part is have the "async" Xref interface as well (can we do
>   this without extending the current one? probably not). The UI also
>   needs to have some "running ..." indicator, as well as a way to
>   abort the search, killing both processes - that adds requirements to
>  "async Xref" as well.

All seems reasonable.

>> However, that would make it easy for project-files as a whole to be
>> asynchronous.  Then that would allow project-find-file to start the
>> listing in the background, and then we'd write a completion table which
>> completes only over whatever files we've already read into Emacs.  I
>> think this would be a lot nicer for most use-cases, and I'd again be
>> happy to implement this.
>
> Could this be that simple?
>
> Whatever the source of the file listing, as soon as the UI (or
> completion styles) calls try-completion or all-completions, the search
> has to finish first, shouldn't it? That seems like the semantics of
> this API. Or if perhaps we allow it to operate on incomplete results,
> how would we indicate to the user at the end that the scan has
> finished, and they can press TAB once more to refresh the results? Or
> perhaps to be able to find a file they hadn't managed to find in the
> incomplete set.
>
> This seems like it might require both a new UI and an extension of
> completion table API. E.g. in certain cases we could say that we only
> need N matches, so if the current incomplete set can provide as many,
> we don't have to wait until the end. But 'try-completion' would become
> unreliable either way.

Yes, that's all true, and this is definitely not the intended semantics
of the API, but I vaguely suspect it might be fine in practice?  That
vague suspicion can wait until later, though, because I think the more
conservative approach you suggest is also a good improvement on its own.

> Even if keeping to the most conservative approach, though, it should
> be possible to at least render the prompt before the file listing is
> finished. That could make the UI look a bit more responsive.

True, that would be pretty nice.  And further I suppose in the case of
the default completion UI (which doesn't automatically display
completions), the user can even type some input before hitting TAB and
waiting.

Also, I suppose that even non-default completion UIs would allow the
user to type input, if the non-default completion UI uses
while-no-input.  So it would be a pretty responsive experience for such
UIs (assuming we are careful in our implementation and don't have bugs
when being interrupted).

That sounds pretty great, actually.  We avoid the blocking part of the
UI without needing to think about how to surface "incomplete completion"
in all the different completion UIs.





^ permalink raw reply	[flat|nested] 13+ messages in thread

* bug#69188: bug#69233: 30.0.50; project-files + project-find-file is slow in large repositories
  2024-04-29 20:27     ` bug#69188: bug#69233: " Spencer Baugh
@ 2024-05-05  0:29       ` Dmitry Gutov
  0 siblings, 0 replies; 13+ messages in thread
From: Dmitry Gutov @ 2024-05-05  0:29 UTC (permalink / raw)
  To: Spencer Baugh; +Cc: 69233, 69188

On 29/04/2024 23:27, Spencer Baugh wrote:
> Dmitry Gutov<dmitry@gutov.dev>  writes:
>> On 13/04/2024 05:34, Dmitry Gutov wrote:
>>> Both options are relatively clunky, and the second one might also
>>> fail to work when DIRS is non-nil (or would have to fall back to
>>> absolute names anyway), so I'm leaning toward the first one. It
>>> might also allow certain code to be written supporting both relative
>>> and absolute names (e.g. a process call both binds default-directory
>>> to root and keeps the file names as-is -- the relative ones would be
>>> interpreted as such, the rest just as they are interpreted now).
>> Here's how that change can look.
>>
>> The patch should demonstrate both the performance improvements for
>> project-find-file and project-find-regexp, and some awkwardness in the
>> implementation, chiefly due to backward compatibility.
>>
>> Guess more tests will be required, at the very least.
> I see almost a 50% performance improvement with this patch in my large
> private repository, once adding support for project-files-relative-names
> in my internal project backend.  Seems great so far.
> 
> My benchmarking:
> 
> (let ((proj (project-current)))
>    (list (benchmark-run 10 (let ((project-files-relative-names t)) (length (project-files proj))))
>          (benchmark-run 10 (let ((project-files-relative-names nil)) (length (project-files proj))))))
> 
> ((17.605295389 28 7.647366087000023)
>   (29.918302167 57 19.246283027999993))

Nice!

Too bad it's still takes ~1.7s to list all the files in the project. 
Well above the comfortable wait time (ideally <100ms or at least <500ms, 
I guess).





^ permalink raw reply	[flat|nested] 13+ messages in thread

* bug#69233: 30.0.50; project-files + project-find-file is slow in large repositories
  2024-04-29 21:04   ` Spencer Baugh
@ 2024-05-05  3:32     ` Dmitry Gutov
  0 siblings, 0 replies; 13+ messages in thread
From: Dmitry Gutov @ 2024-05-05  3:32 UTC (permalink / raw)
  To: Spencer Baugh; +Cc: 69233, 69188

On 30/04/2024 00:04, Spencer Baugh wrote:

> Oh, interesting, I see roughly the same result.
> 
> Benchmarking with:
> (benchmark-run 10 (project-files (project-current)))
> 
> Running in my long-lived existing Emacs 29 session:
> Old:
> (4.434228319 14 2.850654906999921)
> New:
> (4.983809167 16 3.2989908669999295)
> 
> In Emacs 29 emacs -Q:
> Old:
> (3.5112438729999997 130 1.9230644630000002)
> New:
> (3.819248509 171 2.309731412)
> 
> But, in Emacs 30 emacs -Q:
> Old:
> (7.949549188 65 3.3445626799999992)
> New:
> (7.270785783999999 87 4.0610532379999995)
> 
> So... the performance improvement seems highly unreliable.  Probably not
> worth changing this area, then - the other patch to allow relative files
> will probably be more worth it.

All right then, let's hold off on this potential change for now, and 
maybe revisit it later. Maybe the new GC engine will swing the needle in 
one or the other direction.

> I think the defvar approach seems reasonable.
> 
> The existing project-read-file-name-function certainly don't expect
> relative names, but they do actually work OK.  e.g.
> 
> (project--read-file-cpd-relative "" '("foo/bar" "foo1/bar") nil 'minibuffer-history)

Evaluating this one with the version in master results in

Debugger entered--Lisp error: (wrong-type-argument stringp nil)
   expand-file-name(nil)

hence the associated change in the patch.

> (project--read-file-absolute "" '("foo/bar" "foo1/bar") nil 'minibuffer-history)

No errors here, but two problems are that a) it doesn't show the 
default-directory [meaning no indication in which project the read is 
happening], and b) returning the relative name will mess up the 
file-name-history entry.

Good thing you noted the latter, it needs explicit handling. The former 
can be be shown in the prompt, at least.

> Both complete fine and return a filename fine.  read-file-cpd-relative
> returns an absolute filename, read-file-absolute reutrns a relative
> filename.
> 
> Maybe the same is true for any custom project-read-file-name-functions
> that exist?  Maybe they will just work?

So, apparently not.

Anyway, I've pushed the patch in commit 370b216f086. Here's hoping the 
breakage will be minimal.

>>> However, that would make it easy for project-files as a whole to be
>>> asynchronous.  Then that would allow project-find-file to start the
>>> listing in the background, and then we'd write a completion table which
>>> completes only over whatever files we've already read into Emacs.  I
>>> think this would be a lot nicer for most use-cases, and I'd again be
>>> happy to implement this.
>>
>> Could this be that simple?
>>
>> Whatever the source of the file listing, as soon as the UI (or
>> completion styles) calls try-completion or all-completions, the search
>> has to finish first, shouldn't it? That seems like the semantics of
>> this API. Or if perhaps we allow it to operate on incomplete results,
>> how would we indicate to the user at the end that the scan has
>> finished, and they can press TAB once more to refresh the results? Or
>> perhaps to be able to find a file they hadn't managed to find in the
>> incomplete set.
>>
>> This seems like it might require both a new UI and an extension of
>> completion table API. E.g. in certain cases we could say that we only
>> need N matches, so if the current incomplete set can provide as many,
>> we don't have to wait until the end. But 'try-completion' would become
>> unreliable either way.
> 
> Yes, that's all true, and this is definitely not the intended semantics
> of the API, but I vaguely suspect it might be fine in practice?  That
> vague suspicion can wait until later, though, because I think the more
> conservative approach you suggest is also a good improvement on its own.

Some async stuff could make a big improvement on top of it, but it seems 
to require a fair bit more complexity.

>> Even if keeping to the most conservative approach, though, it should
>> be possible to at least render the prompt before the file listing is
>> finished. That could make the UI look a bit more responsive.
> 
> True, that would be pretty nice.  And further I suppose in the case of
> the default completion UI (which doesn't automatically display
> completions), the user can even type some input before hitting TAB and
> waiting.

It could be advantageous if the search process starts right when (or 
before) the prompt is shown, then by the type the first input is entered 
the search could either be finished or have found some matches at least.

> Also, I suppose that even non-default completion UIs would allow the
> user to type input, if the non-default completion UI uses
> while-no-input.  So it would be a pretty responsive experience for such
> UIs (assuming we are careful in our implementation and don't have bugs
> when being interrupted).

Not sure about this one:

1) If you only do the search while the user is not typing, it will 
finish later compared to the scheme in the previous paragraph.
2) Suppose you type a char, pause, then another one. Will the search 
start, abort, and then start again? That seems wasteful.

I'd ultimately prefer a scheme where work isn't thrown away - but that 
would require a more complex API. Including a way to abort the 
background computation (since typing won't do that anymore).

For some UIs and commands that makes sense (e.g. incremental interfaces 
like counsel-rg) because they perform the search with different inputs 
each time you type a new character. That kinds of works for 
small-to-medium projects, and you can enjoy the responsiveness of the 
process. I'm not sure about this approach for big projects.






^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2024-05-05  3:32 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-02-15 22:55 bug#69188: 30.0.50; project-files + project-find-file is slow in large repositories Spencer Baugh
2024-02-18 18:56 ` bug#69233: " Eli Zaretskii
2024-02-18 19:42   ` Dmitry Gutov
2024-02-18 19:45     ` Eli Zaretskii
2024-02-18 20:11       ` Dmitry Gutov
2024-02-18 20:18         ` Eli Zaretskii
2024-02-23 21:34           ` bug#69233: bug#69188: " Spencer Baugh
2024-04-13  2:34 ` Dmitry Gutov
2024-04-16 23:48   ` Dmitry Gutov
2024-04-29 20:27     ` bug#69188: bug#69233: " Spencer Baugh
2024-05-05  0:29       ` Dmitry Gutov
2024-04-29 21:04   ` Spencer Baugh
2024-05-05  3:32     ` Dmitry Gutov

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).