unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed
* Should project delegate project-find-regexp?
@ 2022-04-07 11:48 Joel Reicher
  2022-04-07 12:34 ` Ergus
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Joel Reicher @ 2022-04-07 11:48 UTC (permalink / raw)
  To: emacs-devel

It seems to me that, at least in the case of git, 'git grep' offers a superior implementation to anything offered by the generic implementation of project-find-regexp.

At the moment project delegates the list of files to vc (for example) but perhaps it should delegate the regexp search itself?

Regards,

        - Joel



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Should project delegate project-find-regexp?
  2022-04-07 11:48 Should project delegate project-find-regexp? Joel Reicher
@ 2022-04-07 12:34 ` Ergus
  2022-04-07 12:55   ` Joel Reicher
  2022-04-07 14:30 ` Dmitry Gutov
  2022-04-07 16:56 ` Sean Whitton
  2 siblings, 1 reply; 11+ messages in thread
From: Ergus @ 2022-04-07 12:34 UTC (permalink / raw)
  To: Joel Reicher; +Cc: emacs-devel

On Thu, Apr 07, 2022 at 09:48:33PM +1000, Joel Reicher wrote:
>It seems to me that, at least in the case of git, 'git grep' offers a
>superior implementation to anything offered by the generic
>implementation of project-find-regexp.
>
>At the moment project delegates the list of files to vc (for example)
>but perhaps it should delegate the regexp search itself?
>
I think it could and the implementation itself is not very complex to do
for this specific use case, the problem is that vc if a general frontend
for many vcs, and most of them does not support regex search... On the
other end project.el itself is agnostic respecting to the vc (or the
backend) in use, so to support this it may be needed some kind of
desicion between the two ends to use git specific code in the vc-side
adding a regex function wrapper that will only work for git..

OTOH the current implementation relies on the xref-matches-in-files
which will respect all the xref customs implied (like
xref-search-program) reusing all the existing code to match patterns,
and read the outputs from the processes.

In practice I know that git grep is good, but I am not sure how
"superior" is it compared to what we already have and if it worth to use
it.

Are there any real difference?

>Regards,
>
>        - Joel
>
Best,
Ergus



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Should project delegate project-find-regexp?
  2022-04-07 12:34 ` Ergus
@ 2022-04-07 12:55   ` Joel Reicher
  0 siblings, 0 replies; 11+ messages in thread
From: Joel Reicher @ 2022-04-07 12:55 UTC (permalink / raw)
  To: Ergus; +Cc: emacs-devel

Ergus <spacibba@aol.com> writes:

> On Thu, Apr 07, 2022 at 09:48:33PM +1000, Joel Reicher wrote:
>>It seems to me that, at least in the case of git, 'git grep' offers a
>>superior implementation to anything offered by the generic
>>implementation of project-find-regexp.
>>
>>At the moment project delegates the list of files to vc (for example)
>>but perhaps it should delegate the regexp search itself?
>>
> I think it could and the implementation itself is not very complex to do
> for this specific use case, the problem is that vc if a general frontend
> for many vcs, and most of them does not support regex search... On the
> other end project.el itself is agnostic respecting to the vc (or the
> backend) in use, so to support this it may be needed some kind of
> desicion between the two ends to use git specific code in the vc-side
> adding a regex function wrapper that will only work for git..
>
> OTOH the current implementation relies on the xref-matches-in-files
> which will respect all the xref customs implied (like
> xref-search-program) reusing all the existing code to match patterns,
> and read the outputs from the processes.
>
> In practice I know that git grep is good, but I am not sure how
> "superior" is it compared to what we already have and if it worth to use
> it.
>
> Are there any real difference?

The real difference is the future proofing. Git grep will track any improvements available in the fs representation of the repo and working tree. If project doesn't take advantage of that, it has to duplicate it.

I agree/concede that other vcs do not support this, but the right solution might be a record of "capabilities", similar to what's in LSP. If project can delegate, it does. If not, it falls back on a generic method. So I think we need a list of capabilities for vc backends, and I would argue this should not be a surprise. They do indeed have different capabilities.

Thanks and regards,

       - Joel



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Should project delegate project-find-regexp?
  2022-04-07 11:48 Should project delegate project-find-regexp? Joel Reicher
  2022-04-07 12:34 ` Ergus
@ 2022-04-07 14:30 ` Dmitry Gutov
  2022-04-07 16:10   ` Ergus
                     ` (2 more replies)
  2022-04-07 16:56 ` Sean Whitton
  2 siblings, 3 replies; 11+ messages in thread
From: Dmitry Gutov @ 2022-04-07 14:30 UTC (permalink / raw)
  To: Joel Reicher, emacs-devel

On 07.04.2022 14:48, Joel Reicher wrote:
> It seems to me that, at least in the case of git, 'git grep' offers a superior implementation to anything offered by the generic implementation of project-find-regexp.

Last I checked, there was no way to make 'git grep' search in untracked 
files.

That would be a violation of the contract.

And of course we'd have to spend the effort to teach it about 
user-customized error patterns.

If 'git ls-files' + 'xargs rg' gives similar performance, that's 
probably good enough. Though serializing the list of files back and 
forth adds its overhead, which is unfortunate.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Should project delegate project-find-regexp?
  2022-04-07 14:30 ` Dmitry Gutov
@ 2022-04-07 16:10   ` Ergus
  2022-04-07 16:33     ` Dmitry Gutov
  2022-04-08  8:40   ` Joel Reicher
  2022-04-09 23:01   ` Jim Porter
  2 siblings, 1 reply; 11+ messages in thread
From: Ergus @ 2022-04-07 16:10 UTC (permalink / raw)
  To: Dmitry Gutov; +Cc: Joel Reicher, emacs-devel

On Thu, Apr 07, 2022 at 05:30:21PM +0300, Dmitry Gutov wrote:
>On 07.04.2022 14:48, Joel Reicher wrote:
>>It seems to me that, at least in the case of git, 'git grep' offers a superior implementation to anything offered by the generic implementation of project-find-regexp.
>
>Last I checked, there was no way to make 'git grep' search in 
>untracked files.
>
>That would be a violation of the contract.
>
>And of course we'd have to spend the effort to teach it about 
>user-customized error patterns.
>
>If 'git ls-files' + 'xargs rg' gives similar performance, that's 
>probably good enough. Though serializing the list of files back and 
>forth adds its overhead, which is unfortunate.
>
Just a question here.. why to use 'xargs rg'... AFAIK rg already perform
a parallel search..



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Should project delegate project-find-regexp?
  2022-04-07 16:10   ` Ergus
@ 2022-04-07 16:33     ` Dmitry Gutov
  0 siblings, 0 replies; 11+ messages in thread
From: Dmitry Gutov @ 2022-04-07 16:33 UTC (permalink / raw)
  To: Ergus; +Cc: Joel Reicher, emacs-devel

On 07.04.2022 19:10, Ergus wrote:
> Just a question here.. why to use 'xargs rg'... AFAIK rg already perform
> a parallel search..

It's a command line argument handling thing. xargs is not used for 
parallelism.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Should project delegate project-find-regexp?
  2022-04-07 11:48 Should project delegate project-find-regexp? Joel Reicher
  2022-04-07 12:34 ` Ergus
  2022-04-07 14:30 ` Dmitry Gutov
@ 2022-04-07 16:56 ` Sean Whitton
  2 siblings, 0 replies; 11+ messages in thread
From: Sean Whitton @ 2022-04-07 16:56 UTC (permalink / raw)
  To: Joel Reicher, emacs-devel

Hello,

On Thu 07 Apr 2022 at 09:48pm +10, Joel Reicher wrote:

> It seems to me that, at least in the case of git, 'git grep' offers a superior implementation to anything offered by the generic implementation of project-find-regexp.
>
> At the moment project delegates the list of files to vc (for example) but perhaps it should delegate the regexp search itself?

Interesting.  One thing that comes to mind is that git-grep(1) is
line-based, i.e., it can't do multiline regexps.  Though, I am not sure
that project-find-regexp can either.

-- 
Sean Whitton



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Should project delegate project-find-regexp?
  2022-04-07 14:30 ` Dmitry Gutov
  2022-04-07 16:10   ` Ergus
@ 2022-04-08  8:40   ` Joel Reicher
  2022-04-18  3:01     ` Dmitry Gutov
  2022-04-09 23:01   ` Jim Porter
  2 siblings, 1 reply; 11+ messages in thread
From: Joel Reicher @ 2022-04-08  8:40 UTC (permalink / raw)
  To: Dmitry Gutov; +Cc: emacs-devel

Dmitry Gutov <dgutov@yandex.ru> writes:

> On 07.04.2022 14:48, Joel Reicher wrote:
>> It seems to me that, at least in the case of git, 'git grep' offers a superior implementation to anything offered by the generic implementation of project-find-regexp.
>
> Last I checked, there was no way to make 'git grep' search in
> untracked files.

There's a --untracked option, at least now.

> And of course we'd have to spend the effort to teach it about
> user-customized error patterns.

Sorry, not sure what you mean by "error patterns"? Can you refer me to some code or docs?

Thanks and regards,

       - Joel



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Should project delegate project-find-regexp?
  2022-04-07 14:30 ` Dmitry Gutov
  2022-04-07 16:10   ` Ergus
  2022-04-08  8:40   ` Joel Reicher
@ 2022-04-09 23:01   ` Jim Porter
  2022-04-18  3:06     ` Dmitry Gutov
  2 siblings, 1 reply; 11+ messages in thread
From: Jim Porter @ 2022-04-09 23:01 UTC (permalink / raw)
  To: Dmitry Gutov, Joel Reicher, emacs-devel

On 4/7/2022 7:30 AM, Dmitry Gutov wrote:
> On 07.04.2022 14:48, Joel Reicher wrote:
>> It seems to me that, at least in the case of git, 'git grep' offers a 
>> superior implementation to anything offered by the generic 
>> implementation of project-find-regexp.
> 
> Last I checked, there was no way to make 'git grep' search in untracked 
> files.

I use `git grep --no-index --exclude-standard', which lets you search in 
untracked files (and non-Git directories too!), but also respects 
`.gitignore'. I think that provides the most similarity to find+grep. Of 
course, there may be some problems with these flags that I haven't 
discovered yet...

- Jim



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Should project delegate project-find-regexp?
  2022-04-08  8:40   ` Joel Reicher
@ 2022-04-18  3:01     ` Dmitry Gutov
  0 siblings, 0 replies; 11+ messages in thread
From: Dmitry Gutov @ 2022-04-18  3:01 UTC (permalink / raw)
  To: Joel Reicher; +Cc: emacs-devel

On 08.04.2022 11:40, Joel Reicher wrote:
> Dmitry Gutov <dgutov@yandex.ru> writes:
> 
>> On 07.04.2022 14:48, Joel Reicher wrote:
>>> It seems to me that, at least in the case of git, 'git grep' offers a superior implementation to anything offered by the generic implementation of project-find-regexp.
>>
>> Last I checked, there was no way to make 'git grep' search in
>> untracked files.
> 
> There's a --untracked option, at least now.

Thanks, that works. And we could try to support it. "ignore patterns" 
would require some code duplication, but that's doable. Not "error 
patterns", sorry, that was a typo.

But I've benchmarked searching through a large project (200000 files), 
and the results seem mixed.

--untracked does slow it down noticeably.

Examples:

$ time git grep -z -e symlinks >/dev/null

________________________________________________________
Executed in    1,11 secs    fish           external
    usr time    2,16 secs  720,00 micros    2,16 secs
    sys time    3,65 secs  192,00 micros    3,65 secs

$ time git grep -z --untracked -e symlinks >/dev/null

________________________________________________________
Executed in    1,81 secs    fish           external
    usr time    2,42 secs    0,00 micros    2,42 secs
    sys time    4,00 secs  938,00 micros    4,00 secs

At the same time, if I pipe the results of 'git ls-files' to ripgrep:

$ time git ls-files -z -c -o --exclude-standard | xargs -0 rg --null 
--no-messages -g '!*/' -nH -e symlinks >/dev/null

________________________________________________________
Executed in    2,50 secs    fish           external
    usr time    2,91 secs    1,40 millis    2,90 secs
    sys time    3,02 secs    0,37 millis    3,02 secs

...it looks a little worse. But what if I add some forced parallelism?

$ time git ls-files -z -c -o --exclude-standard | xargs -0 -P8 rg --null 
--no-messages -g '!*/' -nH -e symlinks >/dev/null

________________________________________________________
Executed in    1,08 secs    fish           external
    usr time    4,03 secs    1,50 millis    4,03 secs
    sys time    3,60 secs    0,42 millis    3,60 secs

...it shows better performance. Unfortunately, using the -P argument of 
xargs for grepping because of synchronization problems, but I've wrote 
about this to ripgrep's issue tracker 
(https://github.com/BurntSushi/ripgrep/issues/273#issuecomment-1100792783), 
and we might get such feature there natively someday.

YMMV, but on this machine at least this seems to demonstrate that 'git 
grep' isn't always better, at least. And its '--threads' argument 
doesn't seem to make any difference.

Now, the default searcher (grep) is a little slower than ripgrep, but at 
least we have a faster option present.

Now, when it comes to Emacs, we also lose a fair amount of time on 
parsing the list of files internally (the output of 'git ls-files') 
before sending it to 'xargs rg' or 'xargs grep'.

There are a few approaches how to deal with this. Maybe we'd have a 
generic function which constructs the shell command (which we'd simply 
concatenate when constructing the shell command for search). Or we'd 
have 'project-files' return some opaque value with a bunch of accessors 
which would allow parsing the list of files lazily, and simply reuse the 
output buffer as input without parsing it (this would save ~500ms in my 
measurements in this scenario). Or we'd cache the list of files, and cut 
the whole 1s with that.

We've discussed some of this before (like the caching thing) but so far 
it's up in the air.

But given the possibility of being able to choose a faster search 
problem, I'm not sure about making the search a project method (which 
would lock such projects into one search implementation). I'd rather try 
to work on other inefficiencies first.

Do try installing ripgrep, though. The search program is configured 
through the xref-search-program defcustom.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Should project delegate project-find-regexp?
  2022-04-09 23:01   ` Jim Porter
@ 2022-04-18  3:06     ` Dmitry Gutov
  0 siblings, 0 replies; 11+ messages in thread
From: Dmitry Gutov @ 2022-04-18  3:06 UTC (permalink / raw)
  To: Jim Porter, Joel Reicher, emacs-devel

On 10.04.2022 02:01, Jim Porter wrote:
> I use `git grep --no-index --exclude-standard', which lets you search in 
> untracked files (and non-Git directories too!), but also respects 
> `.gitignore'. I think that provides the most similarity to find+grep. Of 
> course, there may be some problems with these flags that I haven't 
> discovered yet...

As another experiment, I've tried using 'git grep --no-index' together 
with xargs as another alternative to grep and ripgrep, and its 
performance wasn't great.

Like 11s versus ripgrep's 2.5 when used without '-P8'.

I suppose we could include this option anyway? For users who don't have 
any of the standard tools installed, only Git.



^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2022-04-18  3:06 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-07 11:48 Should project delegate project-find-regexp? Joel Reicher
2022-04-07 12:34 ` Ergus
2022-04-07 12:55   ` Joel Reicher
2022-04-07 14:30 ` Dmitry Gutov
2022-04-07 16:10   ` Ergus
2022-04-07 16:33     ` Dmitry Gutov
2022-04-08  8:40   ` Joel Reicher
2022-04-18  3:01     ` Dmitry Gutov
2022-04-09 23:01   ` Jim Porter
2022-04-18  3:06     ` Dmitry Gutov
2022-04-07 16:56 ` Sean Whitton

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).