From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail
From: Spencer Baugh <sbaugh@janestreet.com>
Newsgroups: gmane.emacs.bugs
Subject: bug#69233: 30.0.50;
 project-files + project-find-file is slow in large repositories
Date: Mon, 29 Apr 2024 17:04:24 -0400
Message-ID: <ierfrv3op7r.fsf_-_@janestreet.com>
References: <iera5o11gnh.fsf@igm-qws-u22796a.mail-host-address-is-not-set>
 <1b566e9e-eca5-4746-8e31-4155d35ce7a8@gutov.dev>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214";
	logging-data="27392"; mail-complaints-to="usenet@ciao.gmane.io"
User-Agent: Gnus/5.13 (Gnus v5.13)
Cc: 69233@debbugs.gnu.org, 69188@debbugs.gnu.org
To: Dmitry Gutov <dmitry@gutov.dev>
Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Mon Apr 29 23:04:56 2024
Return-path: <bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org>
Envelope-to: geb-bug-gnu-emacs@m.gmane-mx.org
Original-Received: from lists.gnu.org ([209.51.188.17])
	by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
	(Exim 4.92)
	(envelope-from <bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org>)
	id 1s1YBE-0006oo-6z
	for geb-bug-gnu-emacs@m.gmane-mx.org; Mon, 29 Apr 2024 23:04:56 +0200
Original-Received: from localhost ([::1] helo=lists1p.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.90_1)
	(envelope-from <bug-gnu-emacs-bounces@gnu.org>)
	id 1s1YB2-0006wR-Ir; Mon, 29 Apr 2024 17:04:44 -0400
Original-Received: from eggs.gnu.org ([2001:470:142:3::10])
 by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.90_1) (envelope-from <Debian-debbugs@debbugs.gnu.org>)
 id 1s1YB0-0006vq-Pn
 for bug-gnu-emacs@gnu.org; Mon, 29 Apr 2024 17:04:43 -0400
Original-Received: from debbugs.gnu.org ([2001:470:142:5::43])
 by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128)
 (Exim 4.90_1) (envelope-from <Debian-debbugs@debbugs.gnu.org>)
 id 1s1YB0-0007IQ-GS
 for bug-gnu-emacs@gnu.org; Mon, 29 Apr 2024 17:04:42 -0400
Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2)
 (envelope-from <Debian-debbugs@debbugs.gnu.org>) id 1s1YBK-0003Ev-2T
 for bug-gnu-emacs@gnu.org; Mon, 29 Apr 2024 17:05:02 -0400
X-Loop: help-debbugs@gnu.org
Resent-From: Spencer Baugh <sbaugh@janestreet.com>
Original-Sender: "Debbugs-submit" <debbugs-submit-bounces@debbugs.gnu.org>
Resent-CC: bug-gnu-emacs@gnu.org
Resent-Date: Mon, 29 Apr 2024 21:05:02 +0000
Resent-Message-ID: <handler.69233.B69233.171442469312438@debbugs.gnu.org>
Resent-Sender: help-debbugs@gnu.org
X-GNU-PR-Message: followup 69233
X-GNU-PR-Package: emacs
Original-Received: via spool by 69233-submit@debbugs.gnu.org id=B69233.171442469312438
 (code B ref 69233); Mon, 29 Apr 2024 21:05:02 +0000
Original-Received: (at 69233) by debbugs.gnu.org; 29 Apr 2024 21:04:53 +0000
Original-Received: from localhost ([127.0.0.1]:58802 helo=debbugs.gnu.org)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <debbugs-submit-bounces@debbugs.gnu.org>)
 id 1s1YBA-0003ET-6a
 for submit@debbugs.gnu.org; Mon, 29 Apr 2024 17:04:52 -0400
Original-Received: from mxout1.mail.janestreet.com ([38.105.200.78]:46737)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <sbaugh@janestreet.com>) id 1s1YB8-0003EK-6d
 for 69233@debbugs.gnu.org; Mon, 29 Apr 2024 17:04:51 -0400
In-Reply-To: <1b566e9e-eca5-4746-8e31-4155d35ce7a8@gutov.dev> (Dmitry Gutov's
 message of "Sat, 13 Apr 2024 05:34:18 +0300")
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=janestreet.com;
 s=waixah; t=1714424664;
 bh=tGMuXHjWDMUVZnBREMf1JpXGB3rrB1vP2l9A8HxuyVk=;
 h=From:To:Cc:Subject:In-Reply-To:References:Date;
 b=US5I6dBNZwwmikp9TSCNjZllYxBFckKGpbj8KxSJEo6ZjN9qzCraT3ZhuX66ZxOo1
 DJ37ojXEcFMVjNaUJ6Ip90OA4VS1E9TGX3i0EoC1EFKsloqUrf1Lr0yzsSGEg1tX1y
 +ZpvXdkxXIaJK5XQs4Lwx+mrqz6wAh+wokVkc6ZQ5KJI5C6x10CLpLePziqOBzozWN
 YEUjh2Yqand71O8KaC3tWeTtJhC0oG6rAWmeoqyvNxM75GGGHG7QMc21IBCWsmewto
 QTM/OKKieK7uWPaG/htgHXVavI3fl2qHMMdT7IH+UdSYXOiFwspW3W8GzCMWw4zENL
 gcRQBV0d7cz6A==
X-BeenThere: debbugs-submit@debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
X-BeenThere: bug-gnu-emacs@gnu.org
List-Id: "Bug reports for GNU Emacs,
 the Swiss army knife of text editors" <bug-gnu-emacs.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/bug-gnu-emacs>,
 <mailto:bug-gnu-emacs-request@gnu.org?subject=unsubscribe>
List-Archive: <https://lists.gnu.org/archive/html/bug-gnu-emacs>
List-Post: <mailto:bug-gnu-emacs@gnu.org>
List-Help: <mailto:bug-gnu-emacs-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/bug-gnu-emacs>,
 <mailto:bug-gnu-emacs-request@gnu.org?subject=subscribe>
Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org
Original-Sender: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org
Xref: news.gmane.io gmane.emacs.bugs:284179
Archived-At: <http://permalink.gmane.org/gmane.emacs.bugs/284179>

Dmitry Gutov <dmitry@gutov.dev> writes:

> Hi Spencer,
>
> Sorry about the wait.
>
> On 16/02/2024 00:55, Spencer Baugh wrote:
>> (project-files (project-current)) takes around 1 second in Linux
>> (80k
>> files) and 7 seconds in my larger (500k file) repository.
>> With this patch:
>> diff --git a/lisp/progmodes/project.el b/lisp/progmodes/project.el
>> index c7c07c3d34c..037beaa835a 100644
>> --- a/lisp/progmodes/project.el
>> +++ b/lisp/progmodes/project.el
>> @@ -667,12 +667,15 @@
>>                                                 (setq i (concat i "**"))))
>>                                           i)))
>>                                      extra-ignores)))))
>> -       (setq files
>> -             (mapcar
>> -              (lambda (file) (concat default-directory file))
>> -              (split-string
>> -               (apply #'vc-git--run-command-string nil "ls-files" args)
>> -               "\0" t)))
>> +       (with-temp-buffer
>> +         (let ((ok (apply #'vc-git--out-ok "ls-files" args))
>> +               (pt (point-min)))
>> +           (unless ok
>> +             (error "File listing failed: %s" (buffer-string)))
>> +           (goto-char pt)
>> +           (while (search-forward "\0" nil t)
>> +             (push (concat default-directory (buffer-substring-no-properties pt (1- (point)))) files)
>> +             (setq pt (point)))))
>>          (when (project--vc-merge-submodules-p default-directory)
>>            ;; Unfortunately, 'ls-files --recurse-submodules' conflicts with '-o'.
>>            (let* ((submodules (project--git-submodules))
>> project-files in Linux takes around .75 seconds.
>
> The patch makes sense (and the approach works okay in
> project--files-in-directory), so this is something I've made a few
> attempts to use in the past.
>
> However, the measurements on my machine show a much smaller
> improvement -- just 3-4%. I.e. if I just evaluate the functions
> interpreted or run them just once, the variations between the runs far
> exceed the difference in runtimes (around ~450ms with a Linux
> repository checkout from 2021, 70k files).
>
> A stricter comparison works out like this:
>
> 1. Apply the patch (or not),
> 2. M-x byte-compile-file
> 3. (load "project.elc")
> 4. (benchmark-run 10 (project-files (project-current)))
>
> When run these in my working session one after another, the 10
> iteration benchmark works out to 4.09s vs 3.93s (master vs your
> patch).
>
> (4.093848777 44 1.6119981489999944)
>
> vs
>
> (3.9392906549999998 41 1.499010061)
>
> With 'emacs -Q', however, it's vice versa:
>
> (3.777694389 130 1.2422826310000001)
>
> vs
>
> (3.889905663 165 1.46846598)
>
> It seems like, maybe, the longer running session is more sensitive to
> the allocation of the initial long string than the fresh session.
>
> In any case, I don't mind switching to the other approach. Just
> wondering where the difference between our machines might come from.
>
> Last but not least, when/if we apply this, we should keep the fix for
> bug#66806 in there. Good news is it doesn't seem to affect
> performance.

Oh, interesting, I see roughly the same result.

Benchmarking with:
(benchmark-run 10 (project-files (project-current)))

Running in my long-lived existing Emacs 29 session:
Old:
(4.434228319 14 2.850654906999921)
New:
(4.983809167 16 3.2989908669999295)

In Emacs 29 emacs -Q:
Old:
(3.5112438729999997 130 1.9230644630000002)
New:
(3.819248509 171 2.309731412)

But, in Emacs 30 emacs -Q:
Old:
(7.949549188 65 3.3445626799999992)
New:
(7.270785783999999 87 4.0610532379999995)

So... the performance improvement seems highly unreliable.  Probably not
worth changing this area, then - the other patch to allow relative files
will probably be more worth it.

>> My proposal: Could we find a way to make the default-directory not
>> necessary for the files returned from project-files?
>> Perhaps project-files could be allowed to return relative file paths
>> which are relative to the project root.  Then in the common case where
>> all the files are within the project root, project-find-file would be
>> way faster.  Happy to implement this, if it makes sense.
>
> Yep, that should make sense. Originally the idea was to keep it more
> universal so that lists of files coming from the "external roots"
> could be handled the same way (used in the two *-or-external-*
> commands).
>
> But indeed it's the relatively rare case, so it'd be better to avoid
> paying the performance penalty, especially when the subsequent
> handling could do without the added prefix. And even the "rare case"
> could be split into separate calls instead of having all files
> returned at once.
>
> My main concern is backward compatibility, so that 3rd party callers
> don't break after the update.
>
> I think there are basically two approaches:
> - A new devar like project-use-relative-names,
> - Or a new argument for 'project-files', e.g. called RELATIVE.
>
> Both options are relatively clunky, and the second one might also fail
> to work when DIRS is non-nil (or would have to fall back to absolute
> names anyway), so I'm leaning toward the first one. It might also
> allow certain code to be written supporting both relative and absolute
> names (e.g. a process call both binds default-directory to root and
> keeps the file names as-is -- the relative ones would be interpreted
> as such, the rest just as they are interpreted now).
>
> Both project-find-file and project-find-regexp should be able to
> benefit. Although the former might require a bigger update, given that
> the current project-read-file-name-function options don't expect
> relative names. Ideally we'd have a smoother migration for custom
> p-r-f-n-f functions, but I don't have any good ideas there.

I think the defvar approach seems reasonable.

The existing project-read-file-name-function certainly don't expect
relative names, but they do actually work OK.  e.g.

(project--read-file-cpd-relative "" '("foo/bar" "foo1/bar") nil 'minibuffer-history)
(project--read-file-absolute "" '("foo/bar" "foo1/bar") nil 'minibuffer-history)

Both complete fine and return a filename fine.  read-file-cpd-relative
returns an absolute filename, read-file-absolute reutrns a relative
filename.

Maybe the same is true for any custom project-read-file-name-functions
that exist?  Maybe they will just work?

>> Another optimization I've considered: We could run the process
>> asynchronously so project-files parsing can be parallel with the
>> process; but the process is usually very fast anyway, that's not most of
>> the overhead, so that won't be a big win.
>
> Right. This came up in bug#64735, and together with patch in bug#66020
> the asynchronous file listing can run a bit faster than the
> synchronous implementation.
>
> I'm guessing the difference won't be huge in your case, since either
> way most time remains spent in Lisp code and GC. But if we take
> advantage of this by improving the UIs at the same time, this can be a
> real win.

Right.

> This should go into a separate discussion, I think, but to quickly sum
> up my thinking on the subject:
>
> - Ideally project-files implementations for sync and async UIs should
>   always look the same. Hopefully the "async" implementation looks the
>   same or almost the same as the "sync" one. Threads might help.
> - project-find-regexp could benefit from this a lot, first by running
>   the search in parallel to the file listing, and second by showing
>   the results right away (the current advantage of 'M-x grep'). The
>   difficult part is have the "async" Xref interface as well (can we do
>   this without extending the current one? probably not). The UI also
>   needs to have some "running ..." indicator, as well as a way to
>   abort the search, killing both processes - that adds requirements to
>  "async Xref" as well.

All seems reasonable.

>> However, that would make it easy for project-files as a whole to be
>> asynchronous.  Then that would allow project-find-file to start the
>> listing in the background, and then we'd write a completion table which
>> completes only over whatever files we've already read into Emacs.  I
>> think this would be a lot nicer for most use-cases, and I'd again be
>> happy to implement this.
>
> Could this be that simple?
>
> Whatever the source of the file listing, as soon as the UI (or
> completion styles) calls try-completion or all-completions, the search
> has to finish first, shouldn't it? That seems like the semantics of
> this API. Or if perhaps we allow it to operate on incomplete results,
> how would we indicate to the user at the end that the scan has
> finished, and they can press TAB once more to refresh the results? Or
> perhaps to be able to find a file they hadn't managed to find in the
> incomplete set.
>
> This seems like it might require both a new UI and an extension of
> completion table API. E.g. in certain cases we could say that we only
> need N matches, so if the current incomplete set can provide as many,
> we don't have to wait until the end. But 'try-completion' would become
> unreliable either way.

Yes, that's all true, and this is definitely not the intended semantics
of the API, but I vaguely suspect it might be fine in practice?  That
vague suspicion can wait until later, though, because I think the more
conservative approach you suggest is also a good improvement on its own.

> Even if keeping to the most conservative approach, though, it should
> be possible to at least render the prompt before the file listing is
> finished. That could make the UI look a bit more responsive.

True, that would be pretty nice.  And further I suppose in the case of
the default completion UI (which doesn't automatically display
completions), the user can even type some input before hitting TAB and
waiting.

Also, I suppose that even non-default completion UIs would allow the
user to type input, if the non-default completion UI uses
while-no-input.  So it would be a pretty responsive experience for such
UIs (assuming we are careful in our implementation and don't have bugs
when being interrupted).

That sounds pretty great, actually.  We avoid the blocking part of the
UI without needing to think about how to surface "incomplete completion"
in all the different completion UIs.