* Re: master 544db1e: Faster grep pattern for identifiers
@ 2021-09-15 15:56 Eli Zaretskii
2021-09-15 16:25 ` Dmitry Gutov
2021-09-15 16:29 ` Mattias Engdegård
0 siblings, 2 replies; 11+ messages in thread
From: Eli Zaretskii @ 2021-09-15 15:56 UTC (permalink / raw)
To: Mattias Engdegård; +Cc: emacs-devel
> branch: master
> commit 544db1ee8679eec9edd5cee81a340ee1c4d70158
> Author: Mattias Engdegård <mattiase@acm.org>
>
> Faster grep pattern for identifiers
>
> * lisp/cedet/semantic/symref/grep.el (semantic-symref-perform-search):
> Use the `-w` flag instead of wrapping the pattern in regexps that make
> matching much slower. This speeds up `xref-find-references` by about
> 3× on macOS.
Doesn't this change the semantics of the "word"? The Grep notion of
the word is not necessarily identical to that of Emacs, since the
latter depends on the major mode. The comment in the deleted code
says that much, AFAICT. Or what am I missing?
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: master 544db1e: Faster grep pattern for identifiers
2021-09-15 15:56 master 544db1e: Faster grep pattern for identifiers Eli Zaretskii
@ 2021-09-15 16:25 ` Dmitry Gutov
2021-09-15 16:33 ` Eli Zaretskii
2021-09-15 16:29 ` Mattias Engdegård
1 sibling, 1 reply; 11+ messages in thread
From: Dmitry Gutov @ 2021-09-15 16:25 UTC (permalink / raw)
To: Eli Zaretskii, Mattias Engdegård; +Cc: emacs-devel
On 15.09.2021 18:56, Eli Zaretskii wrote:
>> branch: master
>> commit 544db1ee8679eec9edd5cee81a340ee1c4d70158
>> Author: Mattias Engdegård<mattiase@acm.org>
>>
>> Faster grep pattern for identifiers
>>
>> * lisp/cedet/semantic/symref/grep.el (semantic-symref-perform-search):
>> Use the `-w` flag instead of wrapping the pattern in regexps that make
>> matching much slower. This speeds up `xref-find-references` by about
>> 3× on macOS.
> Doesn't this change the semantics of the "word"? The Grep notion of
> the word is not necessarily identical to that of Emacs, since the
> latter depends on the major mode. The comment in the deleted code
> says that much, AFAICT. Or what am I missing?
Luckily, -w actually corresponds to the regexp which the previous
version of the code was using. Rather than to \<...\> which one might
surmise from reading the docs for some versions of Grep (or Ripgrep).
And the comment was about \< and \>.
The latest Grep manual describes it correctly:
-w, --word-regexp
Select only those lines containing matches that
form whole words. The test is that the matching substring must either
be at the beginning of the line, or preceded by a non-word
constituent character. Similarly, it must be either at
the end of the line or followed by a non-word constituent character.
Word-constituent characters are letters, digits, and the
underscore.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: master 544db1e: Faster grep pattern for identifiers
2021-09-15 16:25 ` Dmitry Gutov
@ 2021-09-15 16:33 ` Eli Zaretskii
2021-09-15 18:06 ` Dmitry Gutov
0 siblings, 1 reply; 11+ messages in thread
From: Eli Zaretskii @ 2021-09-15 16:33 UTC (permalink / raw)
To: Dmitry Gutov; +Cc: mattiase, emacs-devel
> Cc: emacs-devel@gnu.org
> From: Dmitry Gutov <dgutov@yandex.ru>
> Date: Wed, 15 Sep 2021 19:25:25 +0300
>
> The latest Grep manual describes it correctly:
>
> -w, --word-regexp
> Select only those lines containing matches that
> form whole words. The test is that the matching substring must either
> be at the beginning of the line, or preceded by a non-word
> constituent character. Similarly, it must be either at
> the end of the line or followed by a non-word constituent character.
> Word-constituent characters are letters, digits, and the
> underscore.
That's only GNU Grep, no?
And what about the "alternative Grep's"?
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: master 544db1e: Faster grep pattern for identifiers
2021-09-15 16:33 ` Eli Zaretskii
@ 2021-09-15 18:06 ` Dmitry Gutov
2021-09-15 18:14 ` Eli Zaretskii
0 siblings, 1 reply; 11+ messages in thread
From: Dmitry Gutov @ 2021-09-15 18:06 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: mattiase, emacs-devel
On 15.09.2021 19:33, Eli Zaretskii wrote:
> And what about the "alternative Grep's"?
The author of the commit uses one such Grep.
I also tested with ripgrep, to similar success.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: master 544db1e: Faster grep pattern for identifiers
2021-09-15 18:06 ` Dmitry Gutov
@ 2021-09-15 18:14 ` Eli Zaretskii
2021-09-15 18:39 ` Dmitry Gutov
2021-09-16 7:28 ` master 544db1e: Faster grep pattern for identifiers Omar Polo
0 siblings, 2 replies; 11+ messages in thread
From: Eli Zaretskii @ 2021-09-15 18:14 UTC (permalink / raw)
To: Dmitry Gutov; +Cc: mattiase, emacs-devel
> Cc: mattiase@acm.org, emacs-devel@gnu.org
> From: Dmitry Gutov <dgutov@yandex.ru>
> Date: Wed, 15 Sep 2021 21:06:09 +0300
>
> On 15.09.2021 19:33, Eli Zaretskii wrote:
> > And what about the "alternative Grep's"?
>
> The author of the commit uses one such Grep.
>
> I also tested with ripgrep, to similar success.
So they all have, miraculously, the same notion of what is a word,
regardless of the programming language and the script/character set?
Amazing.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: master 544db1e: Faster grep pattern for identifiers
2021-09-15 18:14 ` Eli Zaretskii
@ 2021-09-15 18:39 ` Dmitry Gutov
2021-09-17 16:07 ` bug#49836: Support ripgrep in semantic-symref-tool-grep Juri Linkov
2021-09-16 7:28 ` master 544db1e: Faster grep pattern for identifiers Omar Polo
1 sibling, 1 reply; 11+ messages in thread
From: Dmitry Gutov @ 2021-09-15 18:39 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: mattiase, emacs-devel
On 15.09.2021 21:14, Eli Zaretskii wrote:
>> Cc:mattiase@acm.org,emacs-devel@gnu.org
>> From: Dmitry Gutov<dgutov@yandex.ru>
>> Date: Wed, 15 Sep 2021 21:06:09 +0300
>>
>> On 15.09.2021 19:33, Eli Zaretskii wrote:
>>> And what about the "alternative Grep's"?
>> The author of the commit uses one such Grep.
>>
>> I also tested with ripgrep, to similar success.
> So they all have, miraculously, the same notion of what is a word,
> regardless of the programming language and the script/character set?
> Amazing.
Not exactly (e.g. Grep includes international chars in the "word" set,
and Ripgrep does not), but the notions of "not word" are compatible
enough for our purposes.
Speaking of Ripgrep, the compatible behavior of -w is only with recent
versions (reported and fixed in
https://github.com/BurntSushi/ripgrep/issues/389), starting with 0.10.0.
Debian 10 and Fedora 31 include that versions or newer
(https://repology.org/project/ripgrep/versions).
Not that it's really important: we don't support Ripgrep officially.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: master 544db1e: Faster grep pattern for identifiers
2021-09-15 18:14 ` Eli Zaretskii
2021-09-15 18:39 ` Dmitry Gutov
@ 2021-09-16 7:28 ` Omar Polo
1 sibling, 0 replies; 11+ messages in thread
From: Omar Polo @ 2021-09-16 7:28 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: mattiase, emacs-devel, Dmitry Gutov
Eli Zaretskii <eliz@gnu.org> writes:
>> Cc: mattiase@acm.org, emacs-devel@gnu.org
>> From: Dmitry Gutov <dgutov@yandex.ru>
>> Date: Wed, 15 Sep 2021 21:06:09 +0300
>>
>> On 15.09.2021 19:33, Eli Zaretskii wrote:
>> > And what about the "alternative Grep's"?
>>
>> The author of the commit uses one such Grep.
>>
>> I also tested with ripgrep, to similar success.
>
> So they all have, miraculously, the same notion of what is a word,
> regardless of the programming language and the script/character set?
> Amazing.
It seems to be available on OpenBSD' grep
> -w The expression is searched for as a word (as if surrounded by
> ‘[[:<:]]’ and ‘[[:>:]]’; see re_format(7)).
it also seems to be available on NetBSD and FreeBSD judging from the
manpages.
don't know about other grep implementations and honestly haven't tested
the code (yet).
Cheers,
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: master 544db1e: Faster grep pattern for identifiers
2021-09-15 15:56 master 544db1e: Faster grep pattern for identifiers Eli Zaretskii
2021-09-15 16:25 ` Dmitry Gutov
@ 2021-09-15 16:29 ` Mattias Engdegård
1 sibling, 0 replies; 11+ messages in thread
From: Mattias Engdegård @ 2021-09-15 16:29 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: Emacs developers
15 sep. 2021 kl. 17.56 skrev Eli Zaretskii <eliz@gnu.org>:
> Doesn't this change the semantics of the "word"? The Grep notion of
> the word is not necessarily identical to that of Emacs, since the
> latter depends on the major mode. The comment in the deleted code
> says that much, AFAICT. Or what am I missing?
Sorry, I should have written a more descriptive commit message.
First of all, there is no risk for false positives because the grep output is filtered for occurrence of the sought identifier in post-processing. Thus, the only correctness risk is for false negatives.
The effect of -w is to reject matches with a word char immediately before or after a match. This is exactly what the previous glued-on regexps did.
Both the old and new approaches are sound with respect to the programming languages they are used for, because what grep considers to be word chars are alphanumeric characters (as determined by the locale) and underline. Thus, a false negative would require an identifier to occur immediately before or after such a character, and the lexical rules for supported languages don't allow that.
There could be exceptions. For example, ancient Smalltalk used _ as assignment operator because Xerox's character set was based on the 1963 ASCII draft where that code was used for a left-pointing arrow. That wouldn't work with our scheme, now or before.
One might wonder why we use -w at all given the post-processing. It reduces the grep output so that the post-processor isn't overwhelmed by false positives: consider a search for the identifier `i`. That said, -w has a nonzero cost, so omitting it for searches of identifiers above a certain length is likely to be advantageous, especially when the grep tool is slow. We haven't doe that at this time.
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2021-09-18 18:37 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-09-15 15:56 master 544db1e: Faster grep pattern for identifiers Eli Zaretskii
2021-09-15 16:25 ` Dmitry Gutov
2021-09-15 16:33 ` Eli Zaretskii
2021-09-15 18:06 ` Dmitry Gutov
2021-09-15 18:14 ` Eli Zaretskii
2021-09-15 18:39 ` Dmitry Gutov
2021-09-17 16:07 ` bug#49836: Support ripgrep in semantic-symref-tool-grep Juri Linkov
2021-09-17 16:24 ` Lars Ingebrigtsen
2021-09-18 18:37 ` Juri Linkov
2021-09-16 7:28 ` master 544db1e: Faster grep pattern for identifiers Omar Polo
2021-09-15 16:29 ` Mattias Engdegård
Code repositories for project(s) associated with this external index
https://git.savannah.gnu.org/cgit/emacs.git
https://git.savannah.gnu.org/cgit/emacs/org-mode.git
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.