bug#73484: 31.0.50; Abolishing etags-regen-file-extensions

unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed

* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions
       [not found]         ` <b8001a72-8fc9-4e4e-a2d7-5da94a92f250@yandex.ru>
@ 2024-09-25 19:27           ` Sean Whitton
  2024-09-25 22:30             ` Dmitry Gutov
  0 siblings, 1 reply; 48+ messages in thread
From: Sean Whitton @ 2024-09-25 19:27 UTC (permalink / raw)
  To: 73484

Hello,

On Wed 25 Sep 2024 at 02:41pm +03, Dmitry Gutov wrote:

> On 25/09/2024 09:21, Sean Whitton wrote:
>>> We would probably also discuss etags' auto-detection and its list of default
>>> extensions, during the next release's development.
>> Okay, cool!  Should we have a bug to track this?

We want to replace etags-regen-file-extensions with enabling etags's
hashbang detection support.  That requires disabling its Fortran
fallback.

-- 
Sean Whitton





^ permalink raw reply	[flat|nested] 48+ messages in thread

* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions
  2024-09-25 19:27           ` bug#73484: 31.0.50; Abolishing etags-regen-file-extensions Sean Whitton
@ 2024-09-25 22:30             ` Dmitry Gutov
  2024-09-26  7:43               ` Francesco Potortì
  2024-09-29  8:25               ` Eli Zaretskii
  0 siblings, 2 replies; 48+ messages in thread
From: Dmitry Gutov @ 2024-09-25 22:30 UTC (permalink / raw)
  To: Sean Whitton, 73484

Hi!

On 25/09/2024 22:27, Sean Whitton wrote:

> On Wed 25 Sep 2024 at 02:41pm +03, Dmitry Gutov wrote:
> 
>> On 25/09/2024 09:21, Sean Whitton wrote:
>>>> We would probably also discuss etags' auto-detection and its list of default
>>>> extensions, during the next release's development.
>>> Okay, cool!  Should we have a bug to track this?
> 
> We want to replace etags-regen-file-extensions with enabling etags's
> hashbang detection support.  That requires disabling its Fortran
> fallback.

Thanks, a fuller plan would look something like this:

- Implement the --no-fortran-fallback flag in etags. Or an environment 
variable, or etc. Use it conditionally in etags-regen-mode.
- Revisit the default lists of extensions that etags recognizes, keeping 
in mind the recent thread we talking this about in - e.g. *.a seems out 
of place for ASM (someone more familiar with assembly dialects please 
feel free to correctme).
- Add new possible value t to etags-regen-file-extensions, and switch 
the default to it.





^ permalink raw reply	[flat|nested] 48+ messages in thread

* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions
  2024-09-25 22:30             ` Dmitry Gutov
@ 2024-09-26  7:43               ` Francesco Potortì
  2024-09-26 12:18                 ` Dmitry Gutov
  2024-09-29  8:25               ` Eli Zaretskii
  1 sibling, 1 reply; 48+ messages in thread
From: Francesco Potortì @ 2024-09-26  7:43 UTC (permalink / raw)
  To: Dmitry Gutov; +Cc: 73484, Sean Whitton

>- Implement the --no-fortran-fallback flag in etags. Or an environment 
>variable, or etc. Use it conditionally in etags-regen-mode.

If your purpose is to avoid Etags creating false tags on files whose language it cannot detect, you need to disable all fallbacks, rather than just Fortran.

Sorry if I got lost and missed something.

-- 
fp





^ permalink raw reply	[flat|nested] 48+ messages in thread

* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions
  2024-09-26  7:43               ` Francesco Potortì
@ 2024-09-26 12:18                 ` Dmitry Gutov
  0 siblings, 0 replies; 48+ messages in thread
From: Dmitry Gutov @ 2024-09-26 12:18 UTC (permalink / raw)
  To: Francesco Potortì; +Cc: 73484, Sean Whitton

On 26/09/2024 10:43, Francesco Potortì wrote:
>> - Implement the --no-fortran-fallback flag in etags. Or an environment
>> variable, or etc. Use it conditionally in etags-regen-mode.
> If your purpose is to avoid Etags creating false tags on files whose language it cannot detect, you need to disable all fallbacks, rather than just Fortran.

Yeah, sorry, I guess the next fallback is C?

We'll want to disable both, so the flag would be --no-fallbacks, I guess.





^ permalink raw reply	[flat|nested] 48+ messages in thread

* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions
  2024-09-25 22:30             ` Dmitry Gutov
  2024-09-26  7:43               ` Francesco Potortì
@ 2024-09-29  8:25               ` Eli Zaretskii
  2024-09-29 10:56                 ` Eli Zaretskii
  2024-09-30 23:19                 ` Dmitry Gutov
  1 sibling, 2 replies; 48+ messages in thread
From: Eli Zaretskii @ 2024-09-29  8:25 UTC (permalink / raw)
  To: Dmitry Gutov; +Cc: 73484, spwhitton

> Date: Thu, 26 Sep 2024 01:30:55 +0300
> From: Dmitry Gutov <dmitry@gutov.dev>
> 
> > We want to replace etags-regen-file-extensions with enabling etags's
> > hashbang detection support.  That requires disabling its Fortran
> > fallback.
> 
> Thanks, a fuller plan would look something like this:
> 
> - Implement the --no-fortran-fallback flag in etags. Or an environment 
> variable, or etc. Use it conditionally in etags-regen-mode.
> - Revisit the default lists of extensions that etags recognizes, keeping 
> in mind the recent thread we talking this about in - e.g. *.a seems out 
> of place for ASM (someone more familiar with assembly dialects please 
> feel free to correctme).
> - Add new possible value t to etags-regen-file-extensions, and switch 
> the default to it.

I understand that we need to disable the Fortran and C fallbacks to
avoid false positives, but what do we want to do if the fallbacks are
disabled and no suitable language parser is found using the file name?
Just skip the file and do nothing? emit a warning? something else?

I also don't understand why enabling the etags' shebang detection
requires to disable the Fortran and C fallbacks: etags looks for
shebang _before_ it falls back to Fortran and C, so what am I missing?





^ permalink raw reply	[flat|nested] 48+ messages in thread

* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions
  2024-09-29  8:25               ` Eli Zaretskii
@ 2024-09-29 10:56                 ` Eli Zaretskii
  2024-09-29 17:15                   ` Francesco Potortì
  2024-09-30 23:19                 ` Dmitry Gutov
  1 sibling, 1 reply; 48+ messages in thread
From: Eli Zaretskii @ 2024-09-29 10:56 UTC (permalink / raw)
  To: dmitry; +Cc: 73484, spwhitton

> Cc: 73484@debbugs.gnu.org, spwhitton@spwhitton.name
> Date: Sun, 29 Sep 2024 11:25:45 +0300
> From: Eli Zaretskii <eliz@gnu.org>
> 
> I understand that we need to disable the Fortran and C fallbacks to
> avoid false positives, but what do we want to do if the fallbacks are
> disabled and no suitable language parser is found using the file name?
> Just skip the file and do nothing? emit a warning? something else?

Wait a minute... we already have "--language=none", which means only
do regexp processing, if any.  If no regexps were specified, 'none'
produces a single entry for a file, stating just its name, like this:

  ^L
  foo,0

where ^L is a literal \f character.  Is the intent here to prevent
even that from being written to TAGS?  If not, then we don't need any
new command-line option; instead, etags-regen could simply pass the
"--language=none" option before each file with no extension, and be
done, no?

Or maybe this is "the missing link" between this and the shebang
processing?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions
  2024-09-29 10:56                 ` Eli Zaretskii
@ 2024-09-29 17:15                   ` Francesco Potortì
  0 siblings, 0 replies; 48+ messages in thread
From: Francesco Potortì @ 2024-09-29 17:15 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: dmitry, 73484, spwhitton

Eli Zaretskii:
>> I understand that we need to disable the Fortran and C fallbacks to
>> avoid false positives, but what do we want to do if the fallbacks are
>> disabled and no suitable language parser is found using the file name?
>> Just skip the file and do nothing? emit a warning? something else?

Eli Zaretskii:
>Wait a minute... we already have "--language=none", which means only
>do regexp processing, if any.  If no regexps were specified, 'none'
>produces a single entry for a file, stating just its name, like this:
>
>  ^L
>  foo,0
>
>where ^L is a literal \f character.  Is the intent here to prevent
>even that from being written to TAGS?  If not, then we don't need any
>new command-line option; instead, etags-regen could simply pass the
>"--language=none" option before each file with no extension, and be
>done, no?
>
>Or maybe this is "the missing link" between this and the shebang
>processing?

If you set language=none for files whose extension is unknown to Etags, then you give up on shebang processing.  If you do not set language=none and Etags does not recognise any shebang, it defaults to Fortran.  If it does not find any Fortran tags, it defaults to C/C++.  When default processing happens on a file which is neither Fortran nor C/C++, it usually generates no tags, but may occasionally generate fake tags.

AFAIU, the problem is that there are use cases when you have to feed Etags with files that should generate no tags, yet the occasional fake tags are not tolerable.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions
  2024-09-29  8:25               ` Eli Zaretskii
  2024-09-29 10:56                 ` Eli Zaretskii
@ 2024-09-30 23:19                 ` Dmitry Gutov
  2024-10-01 15:00                   ` Eli Zaretskii
  2024-10-02 11:28                   ` Eli Zaretskii
  1 sibling, 2 replies; 48+ messages in thread
From: Dmitry Gutov @ 2024-09-30 23:19 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 73484, spwhitton

On 29/09/2024 11:25, Eli Zaretskii wrote:
> I understand that we need to disable the Fortran and C fallbacks to
> avoid false positives, but what do we want to do if the fallbacks are
> disabled and no suitable language parser is found using the file name?
> Just skip the file and do nothing? emit a warning? something else?

Just do nothing. We'd really want to delegate language detection to 
etags rather than doing it inside Elisp - the latter is slower and 
ultimately more limited. But for that etags needs to have a reliable 
detection logic, one without too many false positives (and IME false 
positives here are worse than false negatives, because scanning too much 
can often mean both wrong tags and long scans, and a completion table 
that gets too large because of bogus tags).

For shebangs in particular, however, see Francesco's very good 
explanation. And detecting shebangs in Lisp would not be practical -- 
too slow.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions
  2024-09-30 23:19                 ` Dmitry Gutov
@ 2024-10-01 15:00                   ` Eli Zaretskii
  2024-10-01 22:01                     ` Dmitry Gutov
  2024-10-02 11:28                   ` Eli Zaretskii
  1 sibling, 1 reply; 48+ messages in thread
From: Eli Zaretskii @ 2024-10-01 15:00 UTC (permalink / raw)
  To: Dmitry Gutov; +Cc: 73484, spwhitton

> Date: Tue, 1 Oct 2024 02:19:17 +0300
> Cc: spwhitton@spwhitton.name, 73484@debbugs.gnu.org
> From: Dmitry Gutov <dmitry@gutov.dev>
> 
> On 29/09/2024 11:25, Eli Zaretskii wrote:
> > I understand that we need to disable the Fortran and C fallbacks to
> > avoid false positives, but what do we want to do if the fallbacks are
> > disabled and no suitable language parser is found using the file name?
> > Just skip the file and do nothing? emit a warning? something else?
> 
> Just do nothing. We'd really want to delegate language detection to 
> etags rather than doing it inside Elisp - the latter is slower and 
> ultimately more limited. But for that etags needs to have a reliable 
> detection logic, one without too many false positives (and IME false 
> positives here are worse than false negatives, because scanning too much 
> can often mean both wrong tags and long scans, and a completion table 
> that gets too large because of bogus tags).

I'm not sure I understand: if you worry about performance, then
disabling fallbacks will not eliminate all of the cases where etags
scans the entire file or at least some of its portions.

Can you explain to me again what exactly is the problem with the
fallbacks in the context of etags-regen?





^ permalink raw reply	[flat|nested] 48+ messages in thread

* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions
  2024-10-01 15:00                   ` Eli Zaretskii
@ 2024-10-01 22:01                     ` Dmitry Gutov
  0 siblings, 0 replies; 48+ messages in thread
From: Dmitry Gutov @ 2024-10-01 22:01 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 73484, spwhitton

On 01/10/2024 18:00, Eli Zaretskii wrote:

>> Just do nothing. We'd really want to delegate language detection to
>> etags rather than doing it inside Elisp - the latter is slower and
>> ultimately more limited. But for that etags needs to have a reliable
>> detection logic, one without too many false positives (and IME false
>> positives here are worse than false negatives, because scanning too much
>> can often mean both wrong tags and long scans, and a completion table
>> that gets too large because of bogus tags).
> 
> I'm not sure I understand: if you worry about performance, then
> disabling fallbacks will not eliminate all of the cases where etags
> scans the entire file or at least some of its portions.

etags's scanning should still be faster than doing it in Lisp, or the 
subsequent calls to tags-completion-table or etags--xref-find-definitions.

Further, the last function would repeatedly search through the tags 
file, so it's important to keep tags' scanner accuracy high: without 
incorrectly recognized files, and without wrong index entries.

> Can you explain to me again what exactly is the problem with the
> fallbacks in the context of etags-regen?

We've talked about this before, here's my previous reply: 
https://lists.gnu.org/archive/html/emacs-devel/2018-01/msg00387.html

I don't have the same experiment at hand, but the past me seems to be 
saying that scanning files incorrectly can also make the whole scan take 
longer, considerably. And make the resulting file bigger, which makes 
its parsing from Emacs slower as well, and so on.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions
  2024-09-30 23:19                 ` Dmitry Gutov
  2024-10-01 15:00                   ` Eli Zaretskii
@ 2024-10-02 11:28                   ` Eli Zaretskii
  2024-10-02 18:00                     ` Dmitry Gutov
  1 sibling, 1 reply; 48+ messages in thread
From: Eli Zaretskii @ 2024-10-02 11:28 UTC (permalink / raw)
  To: Dmitry Gutov; +Cc: 73484, spwhitton

> Date: Tue, 1 Oct 2024 02:19:17 +0300
> Cc: spwhitton@spwhitton.name, 73484@debbugs.gnu.org
> From: Dmitry Gutov <dmitry@gutov.dev>
> 
> On 29/09/2024 11:25, Eli Zaretskii wrote:
> > I understand that we need to disable the Fortran and C fallbacks to
> > avoid false positives, but what do we want to do if the fallbacks are
> > disabled and no suitable language parser is found using the file name?
> > Just skip the file and do nothing? emit a warning? something else?
> 
> Just do nothing.

Doing nothing means the file's name will not appear at all in TAGS.  I
don't think that's TRT, since every file submitted to etags should be
mentioned in TAGS for the benefit of tags-search and similar features.

So I currently tend to modify etags such that if no language was
detected by the file's name/extension, and this new no-fallbacks
option was specified, etags will behave as if given --language=none
(which also means that if any regexps were specified, they will be
processed correctly for such files).  If no regexps were specified or
none matched, this means only the file's name will appear in TAGS, and
that's all.

If the above is not a good plan for some reason, feel free to holler.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions
  2024-10-02 11:28                   ` Eli Zaretskii
@ 2024-10-02 18:00                     ` Dmitry Gutov
  2024-10-02 18:56                       ` Eli Zaretskii
  0 siblings, 1 reply; 48+ messages in thread
From: Dmitry Gutov @ 2024-10-02 18:00 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 73484, spwhitton

On 02/10/2024 14:28, Eli Zaretskii wrote:
>> Date: Tue, 1 Oct 2024 02:19:17 +0300
>> Cc: spwhitton@spwhitton.name, 73484@debbugs.gnu.org
>> From: Dmitry Gutov <dmitry@gutov.dev>
>>
>> On 29/09/2024 11:25, Eli Zaretskii wrote:
>>> I understand that we need to disable the Fortran and C fallbacks to
>>> avoid false positives, but what do we want to do if the fallbacks are
>>> disabled and no suitable language parser is found using the file name?
>>> Just skip the file and do nothing? emit a warning? something else?
>>
>> Just do nothing.
> 
> Doing nothing means the file's name will not appear at all in TAGS.  I
> don't think that's TRT, since every file submitted to etags should be
> mentioned in TAGS for the benefit of tags-search and similar features.

Hmm, maybe another flag, then?

Including many unrelated files would just bloat the tags file for little 
reason. And unlike manual generation, it's not like the user asked for 
all of them to be included.

> So I currently tend to modify etags such that if no language was
> detected by the file's name/extension, and this new no-fallbacks
> option was specified, etags will behave as if given --language=none
> (which also means that if any regexps were specified, they will be
> processed correctly for such files).

Any regexps for "all" files, right? For our etags-regen configuration in 
the Emacs repo, for example, we add 2 regexps, but for specific file 
types only.

If regexps are configured for 'none', and they match something, 
certainly the file should be in the index.

> If no regexps were specified or
> none matched, this means only the file's name will appear in TAGS, and
> that's all.

...but if there are no matches I'd prefer the files to be skipped. The 
files detected as type 'none' anyway.





^ permalink raw reply	[flat|nested] 48+ messages in thread

* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions
  2024-10-02 18:00                     ` Dmitry Gutov
@ 2024-10-02 18:56                       ` Eli Zaretskii
  2024-10-02 22:03                         ` Dmitry Gutov
  0 siblings, 1 reply; 48+ messages in thread
From: Eli Zaretskii @ 2024-10-02 18:56 UTC (permalink / raw)
  To: Dmitry Gutov; +Cc: 73484, spwhitton

> Date: Wed, 2 Oct 2024 21:00:58 +0300
> Cc: spwhitton@spwhitton.name, 73484@debbugs.gnu.org
> From: Dmitry Gutov <dmitry@gutov.dev>
> 
> On 02/10/2024 14:28, Eli Zaretskii wrote:
> >> Date: Tue, 1 Oct 2024 02:19:17 +0300
> >> Cc: spwhitton@spwhitton.name, 73484@debbugs.gnu.org
> >> From: Dmitry Gutov <dmitry@gutov.dev>
> >>
> >> Just do nothing.
> > 
> > Doing nothing means the file's name will not appear at all in TAGS.  I
> > don't think that's TRT, since every file submitted to etags should be
> > mentioned in TAGS for the benefit of tags-search and similar features.
> 
> Hmm, maybe another flag, then?
> 
> Including many unrelated files would just bloat the tags file for little 
> reason. And unlike manual generation, it's not like the user asked for 
> all of them to be included.

What do we tell to users of tags-search and its ilk?

> > So I currently tend to modify etags such that if no language was
> > detected by the file's name/extension, and this new no-fallbacks
> > option was specified, etags will behave as if given --language=none
> > (which also means that if any regexps were specified, they will be
> > processed correctly for such files).
> 
> Any regexps for "all" files, right?

The rules for regexps don't change: each regexp applies to the files
that follow it on the command line.

> ...but if there are no matches I'd prefer the files to be skipped. The 
> files detected as type 'none' anyway.

I don't like this, and I think this is misguided.  I also don't see
any special problem with having lines that name files in TAGS, it
isn't like the size of TAGS will grow significantly or its processing
will be significantly slower.  IOW, this sounds like a clear case of
premature optimization.





^ permalink raw reply	[flat|nested] 48+ messages in thread

* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions
  2024-10-02 18:56                       ` Eli Zaretskii
@ 2024-10-02 22:03                         ` Dmitry Gutov
  2024-10-03  6:27                           ` Eli Zaretskii
  0 siblings, 1 reply; 48+ messages in thread
From: Dmitry Gutov @ 2024-10-02 22:03 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 73484, spwhitton

On 02/10/2024 21:56, Eli Zaretskii wrote:
>> Date: Wed, 2 Oct 2024 21:00:58 +0300
>> Cc: spwhitton@spwhitton.name, 73484@debbugs.gnu.org
>> From: Dmitry Gutov <dmitry@gutov.dev>
>>
>> On 02/10/2024 14:28, Eli Zaretskii wrote:
>>>> Date: Tue, 1 Oct 2024 02:19:17 +0300
>>>> Cc: spwhitton@spwhitton.name, 73484@debbugs.gnu.org
>>>> From: Dmitry Gutov <dmitry@gutov.dev>
>>>>
>>>> Just do nothing.
>>>
>>> Doing nothing means the file's name will not appear at all in TAGS.  I
>>> don't think that's TRT, since every file submitted to etags should be
>>> mentioned in TAGS for the benefit of tags-search and similar features.
>>
>> Hmm, maybe another flag, then?
>>
>> Including many unrelated files would just bloat the tags file for little
>> reason. And unlike manual generation, it's not like the user asked for
>> all of them to be included.
> 
> What do we tell to users of tags-search and its ilk?

We can consider how most of such users' indexes look. See below.

>>> So I currently tend to modify etags such that if no language was
>>> detected by the file's name/extension, and this new no-fallbacks
>>> option was specified, etags will behave as if given --language=none
>>> (which also means that if any regexps were specified, they will be
>>> processed correctly for such files).
>>
>> Any regexps for "all" files, right?
> 
> The rules for regexps don't change: each regexp applies to the files
> that follow it on the command line.

This seems okay.

>> ...but if there are no matches I'd prefer the files to be skipped. The
>> files detected as type 'none' anyway.
> 
> I don't like this, and I think this is misguided.  I also don't see
> any special problem with having lines that name files in TAGS, it
> isn't like the size of TAGS will grow significantly or its processing
> will be significantly slower.  IOW, this sounds like a clear case of
> premature optimization.

I could do some experiments, if you post preliminary support of that 
flag, with "empty" files in TAGS and without.

But here's how I'm looking at it:

Imagine a straightforward C project, one that has .c files, .h, maybe 
.y, and also a bunch of docs, build artefacts (some of them checked in), 
and maybe other data files as well. Also README, ChangeLog, Makefile, 
config.bat, some .txt files, many other files without extensions, etc.

Previously, when building a TAGS file manually, a developer in such a 
project specified a list of file globs by hand. One that would be 
limited to .[ch] files, and maybe .y as well, but not all the files in 
the directory.

To use Emacs itself as an example, the 'tags' target in our own Makefile 
only includes .[hc], .m, .cc, .el and (surprising to me) .texi files. 
But not any of the others. The number of such files is ~3K, if I'm 
counting correctly.

The total number of all non-ignored files in our repo is ~5K. That's 2K 
more files that would be present in the 'M-x tags-search' or 'M-x 
list-tags' outputs, if an Emacs developer simply switches to using 
etags-regen-mode, and etags-regen-mode drops the file extensions 
whitelist, and etags keeps all passed files' names in its output.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions
  2024-10-02 22:03                         ` Dmitry Gutov
@ 2024-10-03  6:27                           ` Eli Zaretskii
  2024-10-04  1:25                             ` Dmitry Gutov
  0 siblings, 1 reply; 48+ messages in thread
From: Eli Zaretskii @ 2024-10-03  6:27 UTC (permalink / raw)
  To: Dmitry Gutov; +Cc: 73484, spwhitton

> Date: Thu, 3 Oct 2024 01:03:14 +0300
> Cc: spwhitton@spwhitton.name, 73484@debbugs.gnu.org
> From: Dmitry Gutov <dmitry@gutov.dev>
> 
> >> ...but if there are no matches I'd prefer the files to be skipped. The
> >> files detected as type 'none' anyway.
> > 
> > I don't like this, and I think this is misguided.  I also don't see
> > any special problem with having lines that name files in TAGS, it
> > isn't like the size of TAGS will grow significantly or its processing
> > will be significantly slower.  IOW, this sounds like a clear case of
> > premature optimization.
> 
> I could do some experiments, if you post preliminary support of that 
> flag, with "empty" files in TAGS and without.

OK.

> But here's how I'm looking at it:
> 
> Imagine a straightforward C project, one that has .c files, .h, maybe 
> .y, and also a bunch of docs, build artefacts (some of them checked in), 
> and maybe other data files as well. Also README, ChangeLog, Makefile, 
> config.bat, some .txt files, many other files without extensions, etc.
> 
> Previously, when building a TAGS file manually, a developer in such a 
> project specified a list of file globs by hand. One that would be 
> limited to .[ch] files, and maybe .y as well, but not all the files in 
> the directory.

If they definitely do NOT want the other files to be present in TAGS,
they can keep using those globs.  Nothing will change in that case.

> To use Emacs itself as an example, the 'tags' target in our own Makefile 
> only includes .[hc], .m, .cc, .el and (surprising to me) .texi files. 
> But not any of the others. The number of such files is ~3K, if I'm 
> counting correctly.
> 
> The total number of all non-ignored files in our repo is ~5K. That's 2K 
> more files that would be present in the 'M-x tags-search' or 'M-x 
> list-tags' outputs, if an Emacs developer simply switches to using 
> etags-regen-mode, and etags-regen-mode drops the file extensions 
> whitelist, and etags keeps all passed files' names in its output.

OTOH, if a file with a known extension has no taggable symbols, you
still get its file name in TAGS.  So omitting files whose language we
could not recognize would be an incompatible change in behavior.

The fact that in the scenario you describe above 2K more files will
appear in tags-search is, from my POV, an argument _for_ including
them, not against: we have no reason to assume that users don't want
to search those files for some regexp, because regexps specified in
tags-search don't necessarily have anything to do with the identifiers
we tag.  A valid case in point is to look up all references to some
file when the file is deleted, or references to some version when the
version is updated: we definitely want files like README and INSTALL
to be included in the search.





^ permalink raw reply	[flat|nested] 48+ messages in thread

* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions
  2024-10-03  6:27                           ` Eli Zaretskii
@ 2024-10-04  1:25                             ` Dmitry Gutov
  2024-10-04  6:45                               ` Eli Zaretskii
  0 siblings, 1 reply; 48+ messages in thread
From: Dmitry Gutov @ 2024-10-04  1:25 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 73484, spwhitton

On 03/10/2024 09:27, Eli Zaretskii wrote:

>> But here's how I'm looking at it:
>>
>> Imagine a straightforward C project, one that has .c files, .h, maybe
>> .y, and also a bunch of docs, build artefacts (some of them checked in),
>> and maybe other data files as well. Also README, ChangeLog, Makefile,
>> config.bat, some .txt files, many other files without extensions, etc.
>>
>> Previously, when building a TAGS file manually, a developer in such a
>> project specified a list of file globs by hand. One that would be
>> limited to .[ch] files, and maybe .y as well, but not all the files in
>> the directory.
> 
> If they definitely do NOT want the other files to be present in TAGS,
> they can keep using those globs.  Nothing will change in that case.

a) They would have to produce the same list of file extensions that we 
are using now, and they will need to find out which variable to 
customize, to set to that list.

b) They won't get the shebang detection capability, unless we add a new 
option where they will have to enumerate all their shebang-enabled file 
names as well.

So it seems like they would have to choose between the one and the 
other, with the end behavior that I'm describing not being supported 
even any combination of user options.

>> To use Emacs itself as an example, the 'tags' target in our own Makefile
>> only includes .[hc], .m, .cc, .el and (surprising to me) .texi files.
>> But not any of the others. The number of such files is ~3K, if I'm
>> counting correctly.
>>
>> The total number of all non-ignored files in our repo is ~5K. That's 2K
>> more files that would be present in the 'M-x tags-search' or 'M-x
>> list-tags' outputs, if an Emacs developer simply switches to using
>> etags-regen-mode, and etags-regen-mode drops the file extensions
>> whitelist, and etags keeps all passed files' names in its output.
> 
> OTOH, if a file with a known extension has no taggable symbols, you
> still get its file name in TAGS.  So omitting files whose language we
> could not recognize would be an incompatible change in behavior.

Incompatible change in etags' behavior, but likely a more compatible 
change in the behavior of the default Emacs.

For etags, though, we could an opt-in flag.

> The fact that in the scenario you describe above 2K more files will
> appear in tags-search is, from my POV, an argument _for_ including
> them, not against: we have no reason to assume that users don't want
> to search those files for some regexp, because regexps specified in
> tags-search don't necessarily have anything to do with the identifiers
> we tag.  A valid case in point is to look up all references to some
> file when the file is deleted, or references to some version when the
> version is updated: we definitely want files like README and INSTALL
> to be included in the search.

I would hope that project-find-regexp works well enough for that. Or 
'M-x project-search' for the fans of the classic interface.

README and INSTALL are not currently included in TAGS. You seem to be 
making a case that all files in our dev repository should be included, 
but for some reason the current build rules are very different?





^ permalink raw reply	[flat|nested] 48+ messages in thread

* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions
  2024-10-04  1:25                             ` Dmitry Gutov
@ 2024-10-04  6:45                               ` Eli Zaretskii
  2024-10-04 23:01                                 ` Dmitry Gutov
  0 siblings, 1 reply; 48+ messages in thread
From: Eli Zaretskii @ 2024-10-04  6:45 UTC (permalink / raw)
  To: Dmitry Gutov; +Cc: 73484, spwhitton

> Date: Fri, 4 Oct 2024 04:25:15 +0300
> Cc: spwhitton@spwhitton.name, 73484@debbugs.gnu.org
> From: Dmitry Gutov <dmitry@gutov.dev>
> 
> >> Previously, when building a TAGS file manually, a developer in such a
> >> project specified a list of file globs by hand. One that would be
> >> limited to .[ch] files, and maybe .y as well, but not all the files in
> >> the directory.
> > 
> > If they definitely do NOT want the other files to be present in TAGS,
> > they can keep using those globs.  Nothing will change in that case.
> 
> a) They would have to produce the same list of file extensions that we 
> are using now, and they will need to find out which variable to 
> customize, to set to that list.
> 
> b) They won't get the shebang detection capability, unless we add a new 
> option where they will have to enumerate all their shebang-enabled file 
> names as well.
> 
> So it seems like they would have to choose between the one and the 
> other, with the end behavior that I'm describing not being supported 
> even any combination of user options.

They will need to choose only if they want improvements.  To have the
same behavior, with the same downsides as before, they need not change
anything.  IOW, the change I propose does no harm to those projects.

And if shebang detection is desired, the choice is quite obvious, if
you ask me: submit all the files.  The downside is making TAGS larger
and having more file names in it, which I think is a very small
downside, if at all, compared to advantages.

So once again, I think this is a premature optimization.  The downside
of a larger TAGS will only have tangible effects in huge trees.

> > The fact that in the scenario you describe above 2K more files will
> > appear in tags-search is, from my POV, an argument _for_ including
> > them, not against: we have no reason to assume that users don't want
> > to search those files for some regexp, because regexps specified in
> > tags-search don't necessarily have anything to do with the identifiers
> > we tag.  A valid case in point is to look up all references to some
> > file when the file is deleted, or references to some version when the
> > version is updated: we definitely want files like README and INSTALL
> > to be included in the search.
> 
> I would hope that project-find-regexp works well enough for that. Or 
> 'M-x project-search' for the fans of the classic interface.

Maybe, but we do still want to keep tags-search, so the existence of
other commands don't invalidate my argument above.

> README and INSTALL are not currently included in TAGS. You seem to be 
> making a case that all files in our dev repository should be included, 
> but for some reason the current build rules are very different?

I'm not talking specifically about Emacs, because README and INSTALL
are typically present in many packages.  In our case, we don't pass
them to etags for historical reasons (we have admin/*.el stuff to help
us modify the version string in all the files that reference it, for
example), but it is quite plausible that if we had this option back
then, we could have used etags to help.  For example, one downside of
what we have in admin.el is that the list of files to edit when we
bump the version is maintained by hand, which is error-prone: we just
had an instance of this when exec/configure.ac was added and we forgot
to update admin.el according.  Using etags would have allowed us to
avoid such problems.

If we want a separate optional behavior that prevents files with no
tags from being mentioned in TAGS, I'd argue that such an option
should affect all the scanned files, not just those whose language
could not be determined from their names.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions
  2024-10-04  6:45                               ` Eli Zaretskii
@ 2024-10-04 23:01                                 ` Dmitry Gutov
  2024-10-05  7:02                                   ` Eli Zaretskii
  0 siblings, 1 reply; 48+ messages in thread
From: Dmitry Gutov @ 2024-10-04 23:01 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 73484, spwhitton

On 04/10/2024 09:45, Eli Zaretskii wrote:

> They will need to choose only if they want improvements.  To have the
> same behavior, with the same downsides as before, they need not change
> anything.  IOW, the change I propose does no harm to those projects.

We did talk about changing the default of etags-regen-file-extensions to 
t. I suppose it's debatable.

> And if shebang detection is desired, the choice is quite obvious, if
> you ask me: submit all the files.  The downside is making TAGS larger
> and having more file names in it, which I think is a very small
> downside, if at all, compared to advantages.
> 
> So once again, I think this is a premature optimization.  The downside
> of a larger TAGS will only have tangible effects in huge trees.

FWIW, TAGS for gecko-dev (Mozilla's repository which I have here for 
testing) takes ~30 seconds to generate and ~400ms to find a definition 
for the set of files to scan that I currently have set up. Both timings 
seem quite impactful for user experience. I imagine some Emacs users 
work at Mozilla, though that's only a guess.

If someone were to provide a patch for etags with new functionality 
(disabling fallbacks, at least), I could benchmark and come back with 
numbers. And if experimental flags are available, with numbers for those 
as well.

>>> The fact that in the scenario you describe above 2K more files will
>>> appear in tags-search is, from my POV, an argument _for_ including
>>> them, not against: we have no reason to assume that users don't want
>>> to search those files for some regexp, because regexps specified in
>>> tags-search don't necessarily have anything to do with the identifiers
>>> we tag.  A valid case in point is to look up all references to some
>>> file when the file is deleted, or references to some version when the
>>> version is updated: we definitely want files like README and INSTALL
>>> to be included in the search.
>>
>> I would hope that project-find-regexp works well enough for that. Or
>> 'M-x project-search' for the fans of the classic interface.
> 
> Maybe, but we do still want to keep tags-search, so the existence of
> other commands don't invalidate my argument above.

In my mind, tags-search is for files that are code-related. Actual users 
might differ, though.

>> README and INSTALL are not currently included in TAGS. You seem to be
>> making a case that all files in our dev repository should be included,
>> but for some reason the current build rules are very different?
> 
> I'm not talking specifically about Emacs, because README and INSTALL
> are typically present in many packages.  In our case, we don't pass
> them to etags for historical reasons (we have admin/*.el stuff to help
> us modify the version string in all the files that reference it, for
> example), but it is quite plausible that if we had this option back
> then, we could have used etags to help.  For example, one downside of
> what we have in admin.el is that the list of files to edit when we
> bump the version is maintained by hand, which is error-prone: we just
> had an instance of this when exec/configure.ac was added and we forgot
> to update admin.el according.  Using etags would have allowed us to
> avoid such problems.

Some other aspects of having more false positives would come up as a 
result, probably. But it might be worth testing.

> If we want a separate optional behavior that prevents files with no
> tags from being mentioned in TAGS, I'd argue that such an option
> should affect all the scanned files, not just those whose language
> could not be determined from their names.

I don't have a strong opinion here, just that it would depart from my 
mental model mentioned above, of having all code-related files listed. 
For example by missing some newly added .c file where no function 
definitions have been added yet; 'M-x tags-search' would skip it.

If that makes sense to you, okay.





^ permalink raw reply	[flat|nested] 48+ messages in thread

* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions
  2024-10-04 23:01                                 ` Dmitry Gutov
@ 2024-10-05  7:02                                   ` Eli Zaretskii
  2024-10-05 14:29                                     ` Dmitry Gutov
  0 siblings, 1 reply; 48+ messages in thread
From: Eli Zaretskii @ 2024-10-05  7:02 UTC (permalink / raw)
  To: Dmitry Gutov; +Cc: 73484, spwhitton

> Date: Sat, 5 Oct 2024 02:01:14 +0300
> Cc: spwhitton@spwhitton.name, 73484@debbugs.gnu.org
> From: Dmitry Gutov <dmitry@gutov.dev>
> 
> On 04/10/2024 09:45, Eli Zaretskii wrote:
> 
> > So once again, I think this is a premature optimization.  The downside
> > of a larger TAGS will only have tangible effects in huge trees.
> 
> FWIW, TAGS for gecko-dev (Mozilla's repository which I have here for 
> testing) takes ~30 seconds to generate and ~400ms to find a definition 
> for the set of files to scan that I currently have set up. Both timings 
> seem quite impactful for user experience. I imagine some Emacs users 
> work at Mozilla, though that's only a guess.

Like I said: in huge trees this might matter.

But in any case, I don't understand the significance of the timings
you show: we are discussing the increase in processing time which will
be caused by adding files with no tags, which produce a single line in
TAGS.  Therefore the interesting figures are time differences in
processing some commands with and without those additional lines.  Are
the times you show above related to any of that?

> If someone were to provide a patch for etags with new functionality 
> (disabling fallbacks, at least), I could benchmark and come back with 
> numbers. And if experimental flags are available, with numbers for those 
> as well.

How hard is it to add to a live TAGS file fake lines which look like
this:

  ^L
  foo,0

(with random strings instead of "foo"), and then time some TAGS-using
commands with and without these additions?

> >> I would hope that project-find-regexp works well enough for that. Or
> >> 'M-x project-search' for the fans of the classic interface.
> > 
> > Maybe, but we do still want to keep tags-search, so the existence of
> > other commands don't invalidate my argument above.
> 
> In my mind, tags-search is for files that are code-related. Actual users 
> might differ, though.

The fact that we pass *.texi files to etags should already tell you
that this mental model is incomplete.  The fact that etags supports
HTML, TeX, and PostScript files (in addition to Texinfo) is another
evidence to that effect.  And that's even before we consider the
regexp feature, which could be used to tag anything in any kind of
file.

I agree that these use cases are relatively rare, but that doesn't
make them invalid or even unimportant.

> > If we want a separate optional behavior that prevents files with no
> > tags from being mentioned in TAGS, I'd argue that such an option
> > should affect all the scanned files, not just those whose language
> > could not be determined from their names.
> 
> I don't have a strong opinion here, just that it would depart from my 
> mental model mentioned above, of having all code-related files listed. 
> For example by missing some newly added .c file where no function 
> definitions have been added yet; 'M-x tags-search' would skip it.

This matches my impression that this option (which skips files with no
tags) should rarely if ever be used.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions
  2024-10-05  7:02                                   ` Eli Zaretskii
@ 2024-10-05 14:29                                     ` Dmitry Gutov
  2024-10-05 15:27                                       ` Eli Zaretskii
  2024-10-05 16:38                                       ` Francesco Potortì
  0 siblings, 2 replies; 48+ messages in thread
From: Dmitry Gutov @ 2024-10-05 14:29 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 73484, spwhitton

On 05/10/2024 10:02, Eli Zaretskii wrote:

> Like I said: in huge trees this might matter.

We do want to support them, right? Or anyway make the project size 
cutoff (where it remains practical to use Emacs) as high as feasible.

> But in any case, I don't understand the significance of the timings
> you show: we are discussing the increase in processing time which will
> be caused by adding files with no tags, which produce a single line in
> TAGS.

If there are a magnitude more "other" files, and an average source file 
contains only several definitions, this can make a difference.

> Therefore the interesting figures are time differences in
> processing some commands with and without those additional lines.  Are
> the times you show above related to any of that?

The time to generate is relevant. The time to visit the tags table gets 
non-trivial too, and it can increase.

>> If someone were to provide a patch for etags with new functionality
>> (disabling fallbacks, at least), I could benchmark and come back with
>> numbers. And if experimental flags are available, with numbers for those
>> as well.
> 
> How hard is it to add to a live TAGS file fake lines which look like
> this:
> 
>    ^L
>    foo,0
> 
> (with random strings instead of "foo"), and then time some TAGS-using
> commands with and without these additions?

Okay, done that.

'M-.' takes more or less the same.

The file size of TAGS increased from 66 MB to 85 MiB.

Won't measure time to generate now - because the current method and the 
"real" one will be different, but note that it's more relevant with 
etags-regen-mode because the scan is performed lazily: every time the 
user does the first search in a new project.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions
  2024-10-05 14:29                                     ` Dmitry Gutov
@ 2024-10-05 15:27                                       ` Eli Zaretskii
  2024-10-05 20:27                                         ` Dmitry Gutov
  2024-10-05 16:38                                       ` Francesco Potortì
  1 sibling, 1 reply; 48+ messages in thread
From: Eli Zaretskii @ 2024-10-05 15:27 UTC (permalink / raw)
  To: Dmitry Gutov; +Cc: 73484, spwhitton

> Date: Sat, 5 Oct 2024 17:29:44 +0300
> Cc: spwhitton@spwhitton.name, 73484@debbugs.gnu.org
> From: Dmitry Gutov <dmitry@gutov.dev>
> 
> On 05/10/2024 10:02, Eli Zaretskii wrote:
> 
> > How hard is it to add to a live TAGS file fake lines which look like
> > this:
> > 
> >    ^L
> >    foo,0
> > 
> > (with random strings instead of "foo"), and then time some TAGS-using
> > commands with and without these additions?
> 
> Okay, done that.
> 
> 'M-.' takes more or less the same.
> 
> The file size of TAGS increased from 66 MB to 85 MiB.
> 
> Won't measure time to generate now - because the current method and the 
> "real" one will be different, but note that it's more relevant with 
> etags-regen-mode because the scan is performed lazily: every time the 
> user does the first search in a new project.

Thanks.  What about the time it takes tags-search to show the prompt:
is that affected in any way?





^ permalink raw reply	[flat|nested] 48+ messages in thread

* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions
  2024-10-05 14:29                                     ` Dmitry Gutov
  2024-10-05 15:27                                       ` Eli Zaretskii
@ 2024-10-05 16:38                                       ` Francesco Potortì
  2024-10-05 17:12                                         ` Eli Zaretskii
  2024-10-06  0:56                                         ` Dmitry Gutov
  1 sibling, 2 replies; 48+ messages in thread
From: Francesco Potortì @ 2024-10-05 16:38 UTC (permalink / raw)
  To: Dmitry Gutov; +Cc: Eli Zaretskii, 73484, spwhitton

Eli Zaretskii:
>> How hard is it to add to a live TAGS file fake lines which look like
>> this:
>> 
>>    ^L
>>    foo,0
>> 
>> (with random strings instead of "foo"), and then time some TAGS-using
>> commands with and without these additions?

Dmitry Gutov:
>Okay, done that.
>
>'M-.' takes more or less the same.
>
>The file size of TAGS increased from 66 MB to 85 MiB.
>
>Won't measure time to generate now - because the current method and the 
>"real" one will be different, but note that it's more relevant with 
>etags-regen-mode because the scan is performed lazily: every time the 
>user does the first search in a new project.

Removing the Fortran and C/C++ fallbacks just for testing requires recompiling etags.c after removing the code beginning with /* Else try Fortran or C. */.  This would avoid parsing the file (except for detecting the sharp-bang) and would leave the file name in the tags file, without tags.





^ permalink raw reply	[flat|nested] 48+ messages in thread

* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions
  2024-10-05 16:38                                       ` Francesco Potortì
@ 2024-10-05 17:12                                         ` Eli Zaretskii
  2024-10-06  0:56                                         ` Dmitry Gutov
  1 sibling, 0 replies; 48+ messages in thread
From: Eli Zaretskii @ 2024-10-05 17:12 UTC (permalink / raw)
  To: Francesco Potortì; +Cc: dmitry, 73484, spwhitton

> From: Francesco Potortì <pot@gnu.org>
> Date: Sat, 05 Oct 2024 18:38:22 +0200
> Cc: spwhitton@spwhitton.name,
> 	73484@debbugs.gnu.org,
> 	Eli Zaretskii <eliz@gnu.org>
> 
> Eli Zaretskii:
> >> How hard is it to add to a live TAGS file fake lines which look like
> >> this:
> >> 
> >>    ^L
> >>    foo,0
> >> 
> >> (with random strings instead of "foo"), and then time some TAGS-using
> >> commands with and without these additions?
> 
> Dmitry Gutov:
> >Okay, done that.
> >
> >'M-.' takes more or less the same.
> >
> >The file size of TAGS increased from 66 MB to 85 MiB.
> >
> >Won't measure time to generate now - because the current method and the 
> >"real" one will be different, but note that it's more relevant with 
> >etags-regen-mode because the scan is performed lazily: every time the 
> >user does the first search in a new project.
> 
> Removing the Fortran and C/C++ fallbacks just for testing requires recompiling etags.c after removing the code beginning with /* Else try Fortran or C. */.  This would avoid parsing the file (except for detecting the sharp-bang) and would leave the file name in the tags file, without tags.

We are not talking about disabling the fallbacks, we are talking about
something else: the impact of having in TAGS names of files where no
tags were found (e.g., because their language was not recognized and
the fallbacks are disabled).





^ permalink raw reply	[flat|nested] 48+ messages in thread

* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions
  2024-10-05 15:27                                       ` Eli Zaretskii
@ 2024-10-05 20:27                                         ` Dmitry Gutov
  0 siblings, 0 replies; 48+ messages in thread
From: Dmitry Gutov @ 2024-10-05 20:27 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 73484, spwhitton

On 05/10/2024 18:27, Eli Zaretskii wrote:
> Thanks.  What about the time it takes tags-search to show the prompt:
> is that affected in any way?

No, that's still instant, just like project-find-regexp. All the work 
happens after typing the input.





^ permalink raw reply	[flat|nested] 48+ messages in thread

* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions
  2024-10-05 16:38                                       ` Francesco Potortì
  2024-10-05 17:12                                         ` Eli Zaretskii
@ 2024-10-06  0:56                                         ` Dmitry Gutov
  2024-10-06  6:22                                           ` Eli Zaretskii
  1 sibling, 1 reply; 48+ messages in thread
From: Dmitry Gutov @ 2024-10-06  0:56 UTC (permalink / raw)
  To: Francesco Potortì; +Cc: Eli Zaretskii, 73484, spwhitton

On 05/10/2024 19:38, Francesco Potortì wrote:
> Eli Zaretskii:
>>> How hard is it to add to a live TAGS file fake lines which look like
>>> this:
>>>
>>>     ^L
>>>     foo,0
>>>
>>> (with random strings instead of "foo"), and then time some TAGS-using
>>> commands with and without these additions?
> 
> Dmitry Gutov:
>> Okay, done that.
>>
>> 'M-.' takes more or less the same.
>>
>> The file size of TAGS increased from 66 MB to 85 MiB.
>>
>> Won't measure time to generate now - because the current method and the
>> "real" one will be different, but note that it's more relevant with
>> etags-regen-mode because the scan is performed lazily: every time the
>> user does the first search in a new project.
> 
> Removing the Fortran and C/C++ fallbacks just for testing requires recompiling etags.c after removing the code beginning with /* Else try Fortran or C. */.  This would avoid parsing the file (except for detecting the sharp-bang) and would leave the file name in the tags file, without tags.

Thank you, this is useful for another kind of test (parsing the same 
project with the list of all enabled file types). The below was also 
needed to avoid a segfault:

diff --git a/lib-src/etags.c b/lib-src/etags.c
index 7f652790261..08c6037b9d7 100644
--- a/lib-src/etags.c
+++ b/lib-src/etags.c
@@ -1830,6 +1830,7 @@ process_file (FILE *fh, char *fn, language *lang)
       curfdp. */
    if (!CTAGS
        && curfdp->usecharno	/* no #line directives in this file */
+      && curfdp->lang
        && !curfdp->lang->metasource)
      {
        node *np, *prev;

Then, the total time increased a lot: from 30 s to 30-40 min. This cuts 
it down in half, if I measured correctly:

diff --git a/lib-src/etags.c b/lib-src/etags.c
index 7f652790261..5c2be2b9574 100644
--- a/lib-src/etags.c
+++ b/lib-src/etags.c
@@ -1902,21 +1903,21 @@ find_entries (FILE *inf)

    /* Else look for sharp-bang as the first two characters. */
    if (parser == NULL
+      && getc (inf) == '#'
+      && getc (inf) == '!'
        && readline_internal (&lb, inf, infilename, false) > 0
-      && lb.len >= 2
-      && lb.buffer[0] == '#'
-      && lb.buffer[1] == '!')
+      )
      {
        char *lp;

        /* Set lp to point at the first char after the last slash in the
           line or, if no slashes, at the first nonblank.  Then set cp to
  	 the first successive blank and terminate the string. */
-      lp = strrchr (lb.buffer+2, '/');
+      lp = strrchr (lb.buffer, '/');
        if (lp != NULL)
  	lp += 1;
        else
-	lp = skip_spaces (lb.buffer + 2);
+	lp = skip_spaces (lb.buffer);
        cp = skip_non_spaces (lp);
        /* If the "interpreter" turns out to be "env", the real 
interpreter is
  	 the next word.  */

But parsing HTML files seems to remain the slowest part. There are a lot 
of them in that project (many test cases), but maybe 3x the number of 
code files, not 60x their number. And they're pretty small, on average. 
If somebody wants to test that locally, here's the repository: 
https://github.com/mozilla/gecko-dev





^ permalink raw reply related	[flat|nested] 48+ messages in thread

* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions
  2024-10-06  0:56                                         ` Dmitry Gutov
@ 2024-10-06  6:22                                           ` Eli Zaretskii
  2024-10-06 19:14                                             ` Dmitry Gutov
  0 siblings, 1 reply; 48+ messages in thread
From: Eli Zaretskii @ 2024-10-06  6:22 UTC (permalink / raw)
  To: Dmitry Gutov; +Cc: pot, 73484, spwhitton

> Date: Sun, 6 Oct 2024 03:56:58 +0300
> Cc: spwhitton@spwhitton.name, 73484@debbugs.gnu.org,
>  Eli Zaretskii <eliz@gnu.org>
> From: Dmitry Gutov <dmitry@gutov.dev>
> 
> On 05/10/2024 19:38, Francesco Potortì wrote:
> > Eli Zaretskii:
> >>> How hard is it to add to a live TAGS file fake lines which look like
> >>> this:
> >>>
> >>>     ^L
> >>>     foo,0
> >>>
> >>> (with random strings instead of "foo"), and then time some TAGS-using
> >>> commands with and without these additions?
> > 
> > Dmitry Gutov:
> >> Okay, done that.
> >>
> >> 'M-.' takes more or less the same.
> >>
> >> The file size of TAGS increased from 66 MB to 85 MiB.
> >>
> >> Won't measure time to generate now - because the current method and the
> >> "real" one will be different, but note that it's more relevant with
> >> etags-regen-mode because the scan is performed lazily: every time the
> >> user does the first search in a new project.
> > 
> > Removing the Fortran and C/C++ fallbacks just for testing requires recompiling etags.c after removing the code beginning with /* Else try Fortran or C. */.  This would avoid parsing the file (except for detecting the sharp-bang) and would leave the file name in the tags file, without tags.

That would also remove the ability to scan files of no language for
regexps.  So this is not what I intend to do for this feature request,
FWIW.

> Then, the total time increased a lot: from 30 s to 30-40 min.

I don't understand why.  How many files with no extensions are in that
tree, and what was the etags command line in both cases?

> But parsing HTML files seems to remain the slowest part. There are a lot 
> of them in that project (many test cases), but maybe 3x the number of 
> code files, not 60x their number. And they're pretty small, on average. 
> If somebody wants to test that locally, here's the repository: 
> https://github.com/mozilla/gecko-dev

If HTML files is what explains the slowdown, then why this change
triggered it?  HTML files are supposed to have extensions that tell
etags they are HTML.  And if they don't have extensions, the code you
removed would have caused etags to scan these files anyway, looking
for Fortran or C tags.  So how come the change slowed down etags so
much?  What am I missing?





^ permalink raw reply	[flat|nested] 48+ messages in thread

* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions
  2024-10-06  6:22                                           ` Eli Zaretskii
@ 2024-10-06 19:14                                             ` Dmitry Gutov
  2024-10-07  2:33                                               ` Eli Zaretskii
  0 siblings, 1 reply; 48+ messages in thread
From: Dmitry Gutov @ 2024-10-06 19:14 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: pot, 73484, spwhitton

On 06/10/2024 09:22, Eli Zaretskii wrote:

>> Then, the total time increased a lot: from 30 s to 30-40 min.
> 
> I don't understand why.  How many files with no extensions are in that
> tree, and what was the etags command line in both cases?

Sorry, I have to add a correction: it's about 15 min either way. Seems 
like the first time I either messed up the start time, or the directory 
was in "cold" cache, or the used etags some much older version.

So to reiterate: the current etags-regen scans in around 30s, and the 
simple switch scans the directory in 15 minutes. Retesting the change 
from previous email, it doesn't really help.

And the 'find-tag' scan did become slower - i.e. from 400 ms to 1200 ms. 
Not clear about the mechanics (the size of TAGS only went up from 65 to 
88 MB).

>> But parsing HTML files seems to remain the slowest part. There are a lot
>> of them in that project (many test cases), but maybe 3x the number of
>> code files, not 60x their number. And they're pretty small, on average.
>> If somebody wants to test that locally, here's the repository:
>> https://github.com/mozilla/gecko-dev
> 
> If HTML files is what explains the slowdown, then why this change
> triggered it?  HTML files are supposed to have extensions that tell
> etags they are HTML.

Okay, I've commented out the most obvious suspects (html, asm, makefile) 
- all their entries in 'lang_names' - but the scan still takes too long.

Maybe it's some other file type, which I haven't found yet.

But what is see when monitoring the running scan with 'tail -f TAGS', is 
the output stops sometimes for like 20 seconds, in the middle of 
outputting tags of some common code file (like .cpp or .py, a common 
type), and then resumes, with files of the same type around this one.

> And if they don't have extensions, the code you
> removed would have caused etags to scan these files anyway, looking
> for Fortran or C tags.  So how come the change slowed down etags so
> much?  What am I missing?

I think it would also concern "unknown" extensions, right? Like .txt, 
.png and so on.

Anyway, the difference is either due to the different set of files (all 
project files, rather than files in the specified list of extensions), 
or due to all file names being printed. Not sure how to verify, yet.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions
  2024-10-06 19:14                                             ` Dmitry Gutov
@ 2024-10-07  2:33                                               ` Eli Zaretskii
  2024-10-07  7:11                                                 ` Dmitry Gutov
  0 siblings, 1 reply; 48+ messages in thread
From: Eli Zaretskii @ 2024-10-07  2:33 UTC (permalink / raw)
  To: Dmitry Gutov; +Cc: pot, 73484, spwhitton

> Date: Sun, 6 Oct 2024 22:14:46 +0300
> Cc: pot@gnu.org, spwhitton@spwhitton.name, 73484@debbugs.gnu.org
> From: Dmitry Gutov <dmitry@gutov.dev>
> 
> On 06/10/2024 09:22, Eli Zaretskii wrote:
> 
> >> Then, the total time increased a lot: from 30 s to 30-40 min.
> > 
> > I don't understand why.  How many files with no extensions are in that
> > tree, and what was the etags command line in both cases?
> 
> Sorry, I have to add a correction: it's about 15 min either way. Seems 
> like the first time I either messed up the start time, or the directory 
> was in "cold" cache, or the used etags some much older version.
> 
> So to reiterate: the current etags-regen scans in around 30s, and the 
> simple switch scans the directory in 15 minutes. Retesting the change 
> from previous email, it doesn't really help.

Can you please show the etags command line in each of these two cases
that you are comparing?

> > And if they don't have extensions, the code you
> > removed would have caused etags to scan these files anyway, looking
> > for Fortran or C tags.  So how come the change slowed down etags so
> > much?  What am I missing?
> 
> I think it would also concern "unknown" extensions, right? Like .txt, 
> .png and so on.

I have difficulty reasoning about this without knowing the command
lines you used.  E.g., I don't understand why in one case it would
scan files with unknown extensions that were not scanned in the other.





^ permalink raw reply	[flat|nested] 48+ messages in thread

* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions
  2024-10-07  2:33                                               ` Eli Zaretskii
@ 2024-10-07  7:11                                                 ` Dmitry Gutov
  2024-10-07 16:05                                                   ` Eli Zaretskii
  0 siblings, 1 reply; 48+ messages in thread
From: Dmitry Gutov @ 2024-10-07  7:11 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: pot, 73484, spwhitton

On 07/10/2024 05:33, Eli Zaretskii wrote:
>> Sorry, I have to add a correction: it's about 15 min either way. Seems
>> like the first time I either messed up the start time, or the directory
>> was in "cold" cache, or the used etags some much older version.
>>
>> So to reiterate: the current etags-regen scans in around 30s, and the
>> simple switch scans the directory in 15 minutes. Retesting the change
>> from previous email, it doesn't really help.
> Can you please show the etags command line in each of these two cases
> that you are comparing?

Both commands end with a '-' (scanning the list of files passed from stdin).

>>> And if they don't have extensions, the code you
>>> removed would have caused etags to scan these files anyway, looking
>>> for Fortran or C tags.  So how come the change slowed down etags so
>>> much?  What am I missing?
>> I think it would also concern "unknown" extensions, right? Like .txt,
>> .png and so on.
> I have difficulty reasoning about this without knowing the command
> lines you used.  E.g., I don't understand why in one case it would
> scan files with unknown extensions that were not scanned in the other.

In one case the list is pre-filtered with etags-regen-file-extensions 
(see 'etags-regen--all-files'), in the other - it is not, and all files 
in project are passed.





^ permalink raw reply	[flat|nested] 48+ messages in thread

* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions
  2024-10-07  7:11                                                 ` Dmitry Gutov
@ 2024-10-07 16:05                                                   ` Eli Zaretskii
  2024-10-07 17:36                                                     ` Dmitry Gutov
  0 siblings, 1 reply; 48+ messages in thread
From: Eli Zaretskii @ 2024-10-07 16:05 UTC (permalink / raw)
  To: Dmitry Gutov; +Cc: pot, 73484, spwhitton

> Date: Mon, 7 Oct 2024 10:11:08 +0300
> Cc: pot@gnu.org, spwhitton@spwhitton.name, 73484@debbugs.gnu.org
> From: Dmitry Gutov <dmitry@gutov.dev>
> 
> > Can you please show the etags command line in each of these two cases
> > that you are comparing?
> 
> Both commands end with a '-' (scanning the list of files passed from stdin).
> 
> >>> And if they don't have extensions, the code you
> >>> removed would have caused etags to scan these files anyway, looking
> >>> for Fortran or C tags.  So how come the change slowed down etags so
> >>> much?  What am I missing?
> >> I think it would also concern "unknown" extensions, right? Like .txt,
> >> .png and so on.
> > I have difficulty reasoning about this without knowing the command
> > lines you used.  E.g., I don't understand why in one case it would
> > scan files with unknown extensions that were not scanned in the other.
> 
> In one case the list is pre-filtered with etags-regen-file-extensions 
> (see 'etags-regen--all-files'), in the other - it is not, and all files 
> in project are passed.

So you are comparing the speed of scanning ~60K files with the speed
of scanning ~375K of files?  I'm not generally surprised that the
latter takes much longer, only that the slowdown is not proportional
to the number of scanned files.  But see below.

Btw, did you exclude the .git/* files from the list submitted to
etags?

Here, scanning, with the unmodified etags from Emacs 30, of only those
files with extensions in etags-regen-file-extensions takes 16.7 sec
and produces a 80.5MB tags table, whereas scanning all the files with
the same etags takes almost 16 min and produces 304MB tags table, of
which more than 200MB are from files whose language is not recognized.

From my testing, it seems like the elapsed time depends non-linearly
on the length of the list of files submitted to etags.  For example,
if I break the list of files in two, I get 3 min 20 sec and 1 min 40
sec, together 5 min.  But if I submit a single list with all the files
in those two lists, I get 14 min 30 sec.  I guess some internal
processing etags does depends non-linearly on the number of files it
scans.  The various loops in etags that scan all of the known files
and/or the tags it previously found seem to confirm this hypothesis.

So what is the conclusion from this?  Are you saying that the long
scan times in this large tree basically make this new no-fallbacks
option not very useful, since we still need to carefully include or
exclude certain files from the scan?  Or should I go ahead and install
these changes?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions
  2024-10-07 16:05                                                   ` Eli Zaretskii
@ 2024-10-07 17:36                                                     ` Dmitry Gutov
  2024-10-07 19:05                                                       ` Eli Zaretskii
  0 siblings, 1 reply; 48+ messages in thread
From: Dmitry Gutov @ 2024-10-07 17:36 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: pot, 73484, spwhitton

On 07/10/2024 19:05, Eli Zaretskii wrote:

> So you are comparing the speed of scanning ~60K files with the speed
> of scanning ~375K of files?  I'm not generally surprised that the
> latter takes much longer, only that the slowdown is not proportional
> to the number of scanned files.  But see below.

I forgot one thing: all .js files are actually set to be ignored there. 
And my tree is a little old, so it's 200K files total. Otherwise -- yes.

Note, however, that the time is really not proportional: 30 s vs 15 min 
is a 30x difference.

And I've been assuming that the "other" files would mostly fall in the 
non-recognized category, and most of them would only have the 2 first 
characters read (then, recognizing that those chars are not '#!', etags 
would skip the file).

> Btw, did you exclude the .git/* files from the list submitted to
> etags?

Yes, it's excluded. And the files matching the .gitignore entries are 
excluded as well.

> Here, scanning, with the unmodified etags from Emacs 30, of only those
> files with extensions in etags-regen-file-extensions takes 16.7 sec
> and produces a 80.5MB tags table, whereas scanning all the files with
> the same etags takes almost 16 min and produces 304MB tags table, of
> which more than 200MB are from files whose language is not recognized.

My result in the latter case was only 88 MB. Maybe the many .js files 
make the difference. I've put them into the "ignored" category long ago 
because most of them are used for tests, and there are a lot of those 
files, and there are generated one-long-line files.

>  From my testing, it seems like the elapsed time depends non-linearly
> on the length of the list of files submitted to etags.  For example,
> if I break the list of files in two, I get 3 min 20 sec and 1 min 40
> sec, together 5 min.  But if I submit a single list with all the files
> in those two lists, I get 14 min 30 sec.  I guess some internal
> processing etags does depends non-linearly on the number of files it
> scans.  The various loops in etags that scan all of the known files
> and/or the tags it previously found seem to confirm this hypothesis.

Makes sense! It sounds like some N^2 complexity somewhere.

> So what is the conclusion from this?  Are you saying that the long
> scan times in this large tree basically make this new no-fallbacks
> option not very useful, since we still need to carefully include or
> exclude certain files from the scan?  Or should I go ahead and install
> these changes?

I think that option will be useful, but for better benchmarks and for 
end usability as well, I think we need the N^2 thing fixed as well. 
Maybe before the rest of the changes.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions
  2024-10-07 17:36                                                     ` Dmitry Gutov
@ 2024-10-07 19:05                                                       ` Eli Zaretskii
  2024-10-07 22:08                                                         ` Dmitry Gutov
  0 siblings, 1 reply; 48+ messages in thread
From: Eli Zaretskii @ 2024-10-07 19:05 UTC (permalink / raw)
  To: Dmitry Gutov; +Cc: pot, 73484, spwhitton

> Date: Mon, 7 Oct 2024 20:36:47 +0300
> Cc: pot@gnu.org, spwhitton@spwhitton.name, 73484@debbugs.gnu.org
> From: Dmitry Gutov <dmitry@gutov.dev>
> 
> On 07/10/2024 19:05, Eli Zaretskii wrote:
> 
> > So what is the conclusion from this?  Are you saying that the long
> > scan times in this large tree basically make this new no-fallbacks
> > option not very useful, since we still need to carefully include or
> > exclude certain files from the scan?  Or should I go ahead and install
> > these changes?
> 
> I think that option will be useful, but for better benchmarks and for 
> end usability as well, I think we need the N^2 thing fixed as well. 
> Maybe before the rest of the changes.

If this latter part is a precodintion, then someone else will have to
work on this.  I have the new option coded and tested (and
documented), but I don't intend to work on redesigning the core etags
algorithms to remove the non-linear behavior, that's a much larger
project which I currently cannot afford, sorry.





^ permalink raw reply	[flat|nested] 48+ messages in thread

* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions
  2024-10-07 19:05                                                       ` Eli Zaretskii
@ 2024-10-07 22:08                                                         ` Dmitry Gutov
  2024-10-08 13:04                                                           ` Eli Zaretskii
  0 siblings, 1 reply; 48+ messages in thread
From: Dmitry Gutov @ 2024-10-07 22:08 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: pot, 73484, spwhitton

On 07/10/2024 22:05, Eli Zaretskii wrote:
>> Date: Mon, 7 Oct 2024 20:36:47 +0300
>> Cc: pot@gnu.org, spwhitton@spwhitton.name, 73484@debbugs.gnu.org
>> From: Dmitry Gutov <dmitry@gutov.dev>
>>
>> On 07/10/2024 19:05, Eli Zaretskii wrote:
>>
>>> So what is the conclusion from this?  Are you saying that the long
>>> scan times in this large tree basically make this new no-fallbacks
>>> option not very useful, since we still need to carefully include or
>>> exclude certain files from the scan?  Or should I go ahead and install
>>> these changes?
>>
>> I think that option will be useful, but for better benchmarks and for
>> end usability as well, I think we need the N^2 thing fixed as well.
>> Maybe before the rest of the changes.
> 
> If this latter part is a precodintion,

I think we still could use the new flag, just not switch to it (no 
extension filtering) by default yet.

> then someone else will have to
> work on this.  I have the new option coded and tested (and
> documented), but I don't intend to work on redesigning the core etags
> algorithms to remove the non-linear behavior, that's a much larger
> project which I currently cannot afford, sorry.

Do you mind pointing at the places in the code where you already noticed 
non-linear performance coming from?





^ permalink raw reply	[flat|nested] 48+ messages in thread

* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions
  2024-10-07 22:08                                                         ` Dmitry Gutov
@ 2024-10-08 13:04                                                           ` Eli Zaretskii
  2024-10-09 18:23                                                             ` Dmitry Gutov
  0 siblings, 1 reply; 48+ messages in thread
From: Eli Zaretskii @ 2024-10-08 13:04 UTC (permalink / raw)
  To: Dmitry Gutov; +Cc: pot, 73484, spwhitton

> Date: Tue, 8 Oct 2024 01:08:00 +0300
> Cc: pot@gnu.org, spwhitton@spwhitton.name, 73484@debbugs.gnu.org
> From: Dmitry Gutov <dmitry@gutov.dev>
> 
> On 07/10/2024 22:05, Eli Zaretskii wrote:
> >> Date: Mon, 7 Oct 2024 20:36:47 +0300
> >> Cc: pot@gnu.org, spwhitton@spwhitton.name, 73484@debbugs.gnu.org
> >> From: Dmitry Gutov <dmitry@gutov.dev>
> >>
> >> On 07/10/2024 19:05, Eli Zaretskii wrote:
> >>
> >>> So what is the conclusion from this?  Are you saying that the long
> >>> scan times in this large tree basically make this new no-fallbacks
> >>> option not very useful, since we still need to carefully include or
> >>> exclude certain files from the scan?  Or should I go ahead and install
> >>> these changes?
> >>
> >> I think that option will be useful, but for better benchmarks and for
> >> end usability as well, I think we need the N^2 thing fixed as well.
> >> Maybe before the rest of the changes.
> > 
> > If this latter part is a precodintion,
> 
> I think we still could use the new flag, just not switch to it (no 
> extension filtering) by default yet.

OK, installed on master.  I leave it up to you whether to close the
bug.

> > then someone else will have to
> > work on this.  I have the new option coded and tested (and
> > documented), but I don't intend to work on redesigning the core etags
> > algorithms to remove the non-linear behavior, that's a much larger
> > project which I currently cannot afford, sorry.
> 
> Do you mind pointing at the places in the code where you already noticed 
> non-linear performance coming from?

The while-loop near line 2020, for example.

Another one is the for-loop near line 1420, which deals with writing
into TAGS the entries of files with no tags.

There may be others, but those are what I saw.  Perhaps it is a good
idea to profile etags while it scans the files during those 15 min, to
see where it spends that much time, because I'm not sure even those
loops can account for that.  It's possible there's something else at
work here which we don't yet understand.

Two aspects that I found trying to understand the long scan times, and
I'd like to mention so they don't become forgotten:

 . If there are compressed files in the directory, etags will
   uncompress them before it attempts to identify their language.
   There are 20 such files in the gecko-dev tree (removing them from
   the list of scanned files had only minor effect on the elapsed
   time, but it could be different in other cases, especially if
   uncompressing them produces very large files).
 . Some files have their language identified by means other than their
   names or extensions: those are the languages that have
   "interpreters" defined in etags.c.  Shell scripts is one such case,
   but not the only one.  So when etags-regen.el passes only files
   with known extensions to etags, it misses those files from TAGS.
   As one example, the file js/src/devtools/rootAnalysis/run_complete
   in the gecko-dev tree is a Perl script, but has no .pl extension.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions
  2024-10-08 13:04                                                           ` Eli Zaretskii
@ 2024-10-09 18:23                                                             ` Dmitry Gutov
  2024-10-09 19:11                                                               ` Eli Zaretskii
                                                                                 ` (2 more replies)
  0 siblings, 3 replies; 48+ messages in thread
From: Dmitry Gutov @ 2024-10-09 18:23 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: pot, 73484, spwhitton

On 08/10/2024 16:04, Eli Zaretskii wrote:

>>>> I think that option will be useful, but for better benchmarks and for
>>>> end usability as well, I think we need the N^2 thing fixed as well.
>>>> Maybe before the rest of the changes.
>>>
>>> If this latter part is a precodintion,
>>
>> I think we still could use the new flag, just not switch to it (no
>> extension filtering) by default yet.
> 
> OK, installed on master.  I leave it up to you whether to close the
> bug.

Thank you!

Before closing though, I'd like to look into the performance issue more.

>>> then someone else will have to
>>> work on this.  I have the new option coded and tested (and
>>> documented), but I don't intend to work on redesigning the core etags
>>> algorithms to remove the non-linear behavior, that's a much larger
>>> project which I currently cannot afford, sorry.
>>
>> Do you mind pointing at the places in the code where you already noticed
>> non-linear performance coming from?
> 
> The while-loop near line 2020, for example.

Thanks. This one must be proportional to the number of files such as 
*.y. There are only 2 in our big repo.

> Another one is the for-loop near line 1420, which deals with writing
> into TAGS the entries of files with no tags.

It's not a nested 'for' loop, though (right?), and it's called from 
'main'. That seems to mean it's just O(N) - also fine.

> There may be others, but those are what I saw.  Perhaps it is a good
> idea to profile etags while it scans the files during those 15 min, to
> see where it spends that much time, because I'm not sure even those
> loops can account for that.  It's possible there's something else at
> work here which we don't yet understand.

'perf' shows me a profile like this:

   67.31%  etags    libc.so.6          [.] __strcmp_avx2
   26.29%  etags    etags              [.] process_file_name
    2.00%  etags    etags              [.] streq
    0.96%  etags    etags              [.] strcmp@plt
    0.32%  etags    etags              [.] readline_internal
    0.11%  etags    etags              [.] HTML_labels
    0.08%  etags    [kernel.kallsyms]  [k] syscall_return_via_sysret
    0.07%  etags    [kernel.kallsyms]  [k] kmem_cache_alloc
    0.06%  etags    [kernel.kallsyms]  [k] entry_SYSRETQ_unsafe_stack
    0.05%  etags    [kernel.kallsyms]  [k] perf_adjust_freq_unthr_context
    0.04%  etags    etags              [.] c_strncasecmp

So... most of the time is spent in string comparison.

Here is the nested loop, which if I comment out, makes the parse finish 
in ~20 seconds, with all the extra files (except *.js), or in 15s when 
using with new flags.

diff --git a/lib-src/etags.c b/lib-src/etags.c
index a822a823a90..331e3ffe816 100644
--- a/lib-src/etags.c
+++ b/lib-src/etags.c
@@ -1697,14 +1697,14 @@ process_file_name (char *file, language *lang)
        uncompressed_name = file;
      }

-  /* If the canonicalized uncompressed name
-     has already been dealt with, skip it silently. */
-  for (fdp = fdhead; fdp != NULL; fdp = fdp->next)
-    {
-      assert (fdp->infname != NULL);
-      if (streq (uncompressed_name, fdp->infname))
-	goto cleanup;
-    }
+  /* /\* If the canonicalized uncompressed name */
+  /*    has already been dealt with, skip it silently. *\/ */
+  /* for (fdp = fdhead; fdp != NULL; fdp = fdp->next) */
+  /*   { */
+  /*     assert (fdp->infname != NULL); */
+  /*     if (streq (uncompressed_name, fdp->infname)) */
+  /* 	goto cleanup; */
+  /*   } */

    inf = fopen (file, "r" FOPEN_BINARY);
    if (inf)

This is basically a "uniqueness" operation using linear search, O(N^2).

Is there a hash table we could use?

Or perhaps we would skip the search when the canonicalized name is the 
same as the original one.

> Two aspects that I found trying to understand the long scan times, and
> I'd like to mention so they don't become forgotten:
> 
>   . If there are compressed files in the directory, etags will
>     uncompress them before it attempts to identify their language.
>     There are 20 such files in the gecko-dev tree (removing them from
>     the list of scanned files had only minor effect on the elapsed
>     time, but it could be different in other cases, especially if
>     uncompressing them produces very large files).

I guess someone might ask for flag "--no-decompress", sometime.

>   . Some files have their language identified by means other than their
>     names or extensions: those are the languages that have
>     "interpreters" defined in etags.c.  Shell scripts is one such case,
>     but not the only one.  So when etags-regen.el passes only files
>     with known extensions to etags, it misses those files from TAGS.
>     As one example, the file js/src/devtools/rootAnalysis/run_complete
>     in the gecko-dev tree is a Perl script, but has no .pl extension.

This sounds the same as the "hashbang" files that we mentioned 
previously. It makes sense for the scan to take longer, of course, 
proportional to the number of the detected files.





^ permalink raw reply related	[flat|nested] 48+ messages in thread

* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions
  2024-10-09 18:23                                                             ` Dmitry Gutov
@ 2024-10-09 19:11                                                               ` Eli Zaretskii
  2024-10-09 22:22                                                                 ` Dmitry Gutov
  2024-10-10  1:07                                                               ` Francesco Potortì
  2024-10-10  1:39                                                               ` Francesco Potortì
  2 siblings, 1 reply; 48+ messages in thread
From: Eli Zaretskii @ 2024-10-09 19:11 UTC (permalink / raw)
  To: Dmitry Gutov; +Cc: pot, 73484, spwhitton

> Date: Wed, 9 Oct 2024 21:23:37 +0300
> Cc: pot@gnu.org, spwhitton@spwhitton.name, 73484@debbugs.gnu.org
> From: Dmitry Gutov <dmitry@gutov.dev>
> 
> 'perf' shows me a profile like this:
> 
>    67.31%  etags    libc.so.6          [.] __strcmp_avx2
>    26.29%  etags    etags              [.] process_file_name
>     2.00%  etags    etags              [.] streq
>     0.96%  etags    etags              [.] strcmp@plt
>     0.32%  etags    etags              [.] readline_internal
>     0.11%  etags    etags              [.] HTML_labels
>     0.08%  etags    [kernel.kallsyms]  [k] syscall_return_via_sysret
>     0.07%  etags    [kernel.kallsyms]  [k] kmem_cache_alloc
>     0.06%  etags    [kernel.kallsyms]  [k] entry_SYSRETQ_unsafe_stack
>     0.05%  etags    [kernel.kallsyms]  [k] perf_adjust_freq_unthr_context
>     0.04%  etags    etags              [.] c_strncasecmp
> 
> So... most of the time is spent in string comparison.
> 
> Here is the nested loop, which if I comment out, makes the parse finish 
> in ~20 seconds, with all the extra files (except *.js), or in 15s when 
> using with new flags.
> 
> diff --git a/lib-src/etags.c b/lib-src/etags.c
> index a822a823a90..331e3ffe816 100644
> --- a/lib-src/etags.c
> +++ b/lib-src/etags.c
> @@ -1697,14 +1697,14 @@ process_file_name (char *file, language *lang)
>         uncompressed_name = file;
>       }
> 
> -  /* If the canonicalized uncompressed name
> -     has already been dealt with, skip it silently. */
> -  for (fdp = fdhead; fdp != NULL; fdp = fdp->next)
> -    {
> -      assert (fdp->infname != NULL);
> -      if (streq (uncompressed_name, fdp->infname))
> -	goto cleanup;
> -    }
> +  /* /\* If the canonicalized uncompressed name */
> +  /*    has already been dealt with, skip it silently. *\/ */
> +  /* for (fdp = fdhead; fdp != NULL; fdp = fdp->next) */
> +  /*   { */
> +  /*     assert (fdp->infname != NULL); */
> +  /*     if (streq (uncompressed_name, fdp->infname)) */
> +  /* 	goto cleanup; */
> +  /*   } */
> 
>     inf = fopen (file, "r" FOPEN_BINARY);
>     if (inf)
> 
> This is basically a "uniqueness" operation using linear search, O(N^2).

Yes, this seems to be a protection against the same file name
mentioned more than once on the command line..

> Is there a hash table we could use?

Something like that should do, yes.

> Or perhaps we would skip the search when the canonicalized name is the 
> same as the original one.

That's not the same as the loop above does, I think.

> > Two aspects that I found trying to understand the long scan times, and
> > I'd like to mention so they don't become forgotten:
> > 
> >   . If there are compressed files in the directory, etags will
> >     uncompress them before it attempts to identify their language.
> >     There are 20 such files in the gecko-dev tree (removing them from
> >     the list of scanned files had only minor effect on the elapsed
> >     time, but it could be different in other cases, especially if
> >     uncompressing them produces very large files).
> 
> I guess someone might ask for flag "--no-decompress", sometime.

Yes, but it's also easy to exclude them via 'find'.

> >   . Some files have their language identified by means other than their
> >     names or extensions: those are the languages that have
> >     "interpreters" defined in etags.c.  Shell scripts is one such case,
> >     but not the only one.  So when etags-regen.el passes only files
> >     with known extensions to etags, it misses those files from TAGS.
> >     As one example, the file js/src/devtools/rootAnalysis/run_complete
> >     in the gecko-dev tree is a Perl script, but has no .pl extension.
> 
> This sounds the same as the "hashbang" files that we mentioned 
> previously. It makes sense for the scan to take longer, of course, 
> proportional to the number of the detected files.

My point was that if someone wants all the Python files, say,
submitting only Python extensions to etags might miss some Python
scripts.





^ permalink raw reply	[flat|nested] 48+ messages in thread

* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions
  2024-10-09 19:11                                                               ` Eli Zaretskii
@ 2024-10-09 22:22                                                                 ` Dmitry Gutov
  2024-10-10  5:13                                                                   ` Eli Zaretskii
  0 siblings, 1 reply; 48+ messages in thread
From: Dmitry Gutov @ 2024-10-09 22:22 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: pot, 73484, spwhitton

On 09/10/2024 22:11, Eli Zaretskii wrote:

>> This is basically a "uniqueness" operation using linear search, O(N^2).
> 
> Yes, this seems to be a protection against the same file name
> mentioned more than once on the command line..

Or, maybe more likely, against having symlinks scanned if the symlink 
target is also in the passed list.

>> Is there a hash table we could use?
> 
> Something like that should do, yes.

Can we use search.h? hcreate/hsearch/etc. IIUC it's on in the C stndard, 
and 
https://www.gnu.org/savannah-checkouts/gnu/gnulib/manual/html_node/hcreate.html 
says it's available on certain platforms.

>> Or perhaps we would skip the search when the canonicalized name is the
>> same as the original one.
> 
> That's not the same as the loop above does, I think.

If we assumed the duplicate check is only necessary for symlinks, and 
there is on average a small number of them, I think we could avoid using 
a hash table. But passing the same exact file 2 times would result in 
duplicate tags.

>> I guess someone might ask for flag "--no-decompress", sometime.
> 
> Yes, but it's also easy to exclude them via 'find'.

Or through etags-regen-ignores.

>>>    . Some files have their language identified by means other than their
>>>      names or extensions: those are the languages that have
>>>      "interpreters" defined in etags.c.  Shell scripts is one such case,
>>>      but not the only one.  So when etags-regen.el passes only files
>>>      with known extensions to etags, it misses those files from TAGS.
>>>      As one example, the file js/src/devtools/rootAnalysis/run_complete
>>>      in the gecko-dev tree is a Perl script, but has no .pl extension.
>>
>> This sounds the same as the "hashbang" files that we mentioned
>> previously. It makes sense for the scan to take longer, of course,
>> proportional to the number of the detected files.
> 
> My point was that if someone wants all the Python files, say,
> submitting only Python extensions to etags might miss some Python
> scripts.

Yes, that's the problem from the first comments of this report: to have 
hashbang files scanned, one can't use a whitelist of extensions. Using a 
blacklist should be fine, though.





^ permalink raw reply	[flat|nested] 48+ messages in thread

* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions
  2024-10-09 18:23                                                             ` Dmitry Gutov
  2024-10-09 19:11                                                               ` Eli Zaretskii
@ 2024-10-10  1:07                                                               ` Francesco Potortì
  2024-10-10  5:41                                                                 ` Eli Zaretskii
  2024-10-10 10:17                                                                 ` Dmitry Gutov
  2024-10-10  1:39                                                               ` Francesco Potortì
  2 siblings, 2 replies; 48+ messages in thread
From: Francesco Potortì @ 2024-10-10  1:07 UTC (permalink / raw)
  To: Dmitry Gutov; +Cc: Eli Zaretskii, 73484, spwhitton

>Here is the nested loop, which if I comment out, makes the parse finish 
>in ~20 seconds, with all the extra files (except *.js), or in 15s when 
>using with new flags.
>
>diff --git a/lib-src/etags.c b/lib-src/etags.c
>index a822a823a90..331e3ffe816 100644
>--- a/lib-src/etags.c
>+++ b/lib-src/etags.c
>@@ -1697,14 +1697,14 @@ process_file_name (char *file, language *lang)
>        uncompressed_name = file;
>      }
>
>-  /* If the canonicalized uncompressed name
>-     has already been dealt with, skip it silently. */
>-  for (fdp = fdhead; fdp != NULL; fdp = fdp->next)
>-    {
>-      assert (fdp->infname != NULL);
>-      if (streq (uncompressed_name, fdp->infname))
>-	goto cleanup;
>-    }
>+  /* /\* If the canonicalized uncompressed name */
>+  /*    has already been dealt with, skip it silently. *\/ */
>+  /* for (fdp = fdhead; fdp != NULL; fdp = fdp->next) */
>+  /*   { */
>+  /*     assert (fdp->infname != NULL); */
>+  /*     if (streq (uncompressed_name, fdp->infname)) */
>+  /* 	goto cleanup; */
>+  /*   } */
>
>    inf = fopen (file, "r" FOPEN_BINARY);
>    if (inf)
>
>This is basically a "uniqueness" operation using linear search, O(N^2).

This is only for dealing with the case when the same file exists in both compressed and uncompressed form, and we are currently hitting the second one.  In that case, we should skip it.  Yes, this is a uniqueness test and yes, it is O^2 in the number of file names, but I doubt that this can explain a serious slowdown.

>Is there a hash table we could use?

No, we have a hash table for C tags, and that's all.  It is useful because there are 34 keywords against which most strings in a C/C++ file are compared.  It makes sesns to build hash tables for other languages where a similar situation happens.

I do not think that it makes sense to build a hash table for file names given on the command line, because the number of comparisons made on those names is generally vastly inferior to the number of comparisons used to search for tags.

>>   . Some files have their language identified by means other than their
>>     names or extensions: those are the languages that have
>>     "interpreters" defined in etags.c

The interpreter is the token what comes after #!, with The possible exception for "env", in which case the interpreter is the second token after #!

There are two O^2 test in the number of tags in C/C++ files which depend on the two options "no-line-directive" and "no-duplicates".  Both options are usable to disable those checks and both are off by default because they help producing a more sane tags file and have no practical impact in most cases.  Both are there because, in principle, they cause significant slowdown in huge tags files.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions
  2024-10-09 18:23                                                             ` Dmitry Gutov
  2024-10-09 19:11                                                               ` Eli Zaretskii
  2024-10-10  1:07                                                               ` Francesco Potortì
@ 2024-10-10  1:39                                                               ` Francesco Potortì
  2024-10-10  5:45                                                                 ` Eli Zaretskii
  2 siblings, 1 reply; 48+ messages in thread
From: Francesco Potortì @ 2024-10-10  1:39 UTC (permalink / raw)
  To: Dmitry Gutov; +Cc: Eli Zaretskii, 73484, spwhitton

I have just written:
>There are two O^2 test in the number of tags in C/C++ files which depend on the two options "no-line-directive" and "no-duplicates".  Both options are usable to disable those checks and both are off by default because they help producing a more sane tags file and have no practical impact in most cases.  Both are there because, in principle, they cause significant slowdown in huge tags files.

However, --no-line-directive exhibits the O^2 behaviour inthe number of tags only for languages with the "metafile" property, currently only yacc files.  Unless you have a significant number of yacc files, the impact is O^1 in the number of tag candidates.  And --no-duplicates only matters when creating a ctags file.

Maybe you could give a try and check whether --no-line-directives has any impact





^ permalink raw reply	[flat|nested] 48+ messages in thread

* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions
  2024-10-09 22:22                                                                 ` Dmitry Gutov
@ 2024-10-10  5:13                                                                   ` Eli Zaretskii
  0 siblings, 0 replies; 48+ messages in thread
From: Eli Zaretskii @ 2024-10-10  5:13 UTC (permalink / raw)
  To: Dmitry Gutov; +Cc: pot, 73484, spwhitton

> Date: Thu, 10 Oct 2024 01:22:13 +0300
> Cc: pot@gnu.org, spwhitton@spwhitton.name, 73484@debbugs.gnu.org
> From: Dmitry Gutov <dmitry@gutov.dev>
> 
> On 09/10/2024 22:11, Eli Zaretskii wrote:
> 
> >> This is basically a "uniqueness" operation using linear search, O(N^2).
> > 
> > Yes, this seems to be a protection against the same file name
> > mentioned more than once on the command line..
> 
> Or, maybe more likely, against having symlinks scanned if the symlink 
> target is also in the passed list.

Yes, that, but also any other possible ways of specifying the same
file twice, like having a file both compressed and uncompressed, etc.

> >> Is there a hash table we could use?
> > 
> > Something like that should do, yes.
> 
> Can we use search.h? hcreate/hsearch/etc. IIUC it's on in the C stndard, 
> and 
> https://www.gnu.org/savannah-checkouts/gnu/gnulib/manual/html_node/hcreate.html 
> says it's available on certain platforms.

I think we shouldn't: it is not sufficiently portable and Gnulib
doesn't have an implementation for it for those platforms that don't
have it.

We could perhaps use the standard tsearch (although it will be more
expensive).  Alternatively, we could steal the hash table code from
somewhere, for example, from Gawk.

> >> Or perhaps we would skip the search when the canonicalized name is the
> >> same as the original one.
> > 
> > That's not the same as the loop above does, I think.
> 
> If we assumed the duplicate check is only necessary for symlinks, and 
> there is on average a small number of them, I think we could avoid using 
> a hash table. But passing the same exact file 2 times would result in 
> duplicate tags.

canonicalize_filename in etags.c does not resolve symlinks, AFAICT, so
the symlink scenario will not be solved by that.  We'd need realpath
or its equivalent, I think?





^ permalink raw reply	[flat|nested] 48+ messages in thread

* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions
  2024-10-10  1:07                                                               ` Francesco Potortì
@ 2024-10-10  5:41                                                                 ` Eli Zaretskii
  2024-10-10  8:27                                                                   ` Francesco Potortì
  2024-10-10 10:17                                                                 ` Dmitry Gutov
  1 sibling, 1 reply; 48+ messages in thread
From: Eli Zaretskii @ 2024-10-10  5:41 UTC (permalink / raw)
  To: Francesco Potortì; +Cc: dmitry, 73484, spwhitton

> From: Francesco Potortì <pot@gnu.org>
> Date: Thu, 10 Oct 2024 03:07:31 +0200
> Cc: 73484@debbugs.gnu.org,
> 	spwhitton@spwhitton.name,
> 	Eli Zaretskii <eliz@gnu.org>
> 
> >+  /* /\* If the canonicalized uncompressed name */
> >+  /*    has already been dealt with, skip it silently. *\/ */
> >+  /* for (fdp = fdhead; fdp != NULL; fdp = fdp->next) */
> >+  /*   { */
> >+  /*     assert (fdp->infname != NULL); */
> >+  /*     if (streq (uncompressed_name, fdp->infname)) */
> >+  /* 	goto cleanup; */
> >+  /*   } */
> >
> >    inf = fopen (file, "r" FOPEN_BINARY);
> >    if (inf)
> >
> >This is basically a "uniqueness" operation using linear search, O(N^2).
> 
> This is only for dealing with the case when the same file exists in both compressed and uncompressed form, and we are currently hitting the second one.  In that case, we should skip it.  Yes, this is a uniqueness test and yes, it is O^2 in the number of file names, but I doubt that this can explain a serious slowdown.

Are you sure this is executed only for compressed files?  Maybe I'm
missing something, but that's not my reading of the code:

  compr = get_compressor_from_suffix (file, &ext);
  if (compr)
    {
      compressed_name = file;
      uncompressed_name = savenstr (file, ext - file);
    }
  else
    {
      compressed_name = NULL;
      uncompressed_name = file;
    }

  /* If the canonicalized uncompressed name
     has already been dealt with, skip it silently. */
  for (fdp = fdhead; fdp != NULL; fdp = fdp->next)
    {
      assert (fdp->infname != NULL);
      if (streq (uncompressed_name, fdp->infname))
	goto cleanup;
    }

As you see, if the file is not compressed by any known method, the
code sets compressed_name to NULL and uncompressed_name to the
canonicalized file.  But the loop doesn't test compressed_name, so it
is executed for all the files, compressed and uncompressed.  Thus, I
believe the intent is to avoid duplicate tags if the same file was
encountered twice in some way.

Note that canonicalize_filename in this case doesn't really do what
its name seems to imply, e.g., relative file names will generally stay
relative.  So specifying the same file once as relative and the other
time as absolute will still process the file more than once.  We need
to use an inode test or equivalent, and probably use realpath or
equivalent, to make the duplicate test reliable.  Or maybe having the
same file processed under different names is okay, since TAGS is for
helping Emacs find the file, and so using relative names and symlinks
is okay?

> >Is there a hash table we could use?
> 
> No, we have a hash table for C tags, and that's all.  It is useful because there are 34 keywords against which most strings in a C/C++ file are compared.  It makes sesns to build hash tables for other languages where a similar situation happens.

The hash table we have was build by gperf, and that method can only be
used for fixed sets of strings known in advance.  We need a different
hash table for storing file names.

> I do not think that it makes sense to build a hash table for file names given on the command line, because the number of comparisons made on those names is generally vastly inferior to the number of comparisons used to search for tags.

That's not what I see in the code.  But it should be easy to count the
number of loop iterations in the use case we are talking about
(running etags on the geck-dev tree), so we don't need to argue about
facts.

> >>   . Some files have their language identified by means other than their
> >>     names or extensions: those are the languages that have
> >>     "interpreters" defined in etags.c
> 
> The interpreter is the token what comes after #!, with The possible exception for "env", in which case the interpreter is the second token after #!
> 
> There are two O^2 test in the number of tags in C/C++ files which depend on the two options "no-line-directive" and "no-duplicates".  Both options are usable to disable those checks and both are off by default because they help producing a more sane tags file and have no practical impact in most cases.  Both are there because, in principle, they cause significant slowdown in huge tags files.

AFAIU, --no-duplicates is only for ctags, not for etags.  I don't see
how --no-duplicates could be relevant to the loop described above.  Am
I missing something?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions
  2024-10-10  1:39                                                               ` Francesco Potortì
@ 2024-10-10  5:45                                                                 ` Eli Zaretskii
  0 siblings, 0 replies; 48+ messages in thread
From: Eli Zaretskii @ 2024-10-10  5:45 UTC (permalink / raw)
  To: Francesco Potortì; +Cc: dmitry, 73484, spwhitton

> From: Francesco Potortì <pot@gnu.org>
> Date: Thu, 10 Oct 2024 03:39:47 +0200
> Cc: 73484@debbugs.gnu.org,
> 	spwhitton@spwhitton.name,
> 	Eli Zaretskii <eliz@gnu.org>
> 
> I have just written:
> >There are two O^2 test in the number of tags in C/C++ files which depend on the two options "no-line-directive" and "no-duplicates".  Both options are usable to disable those checks and both are off by default because they help producing a more sane tags file and have no practical impact in most cases.  Both are there because, in principle, they cause significant slowdown in huge tags files.
> 
> However, --no-line-directive exhibits the O^2 behaviour inthe number of tags only for languages with the "metafile" property, currently only yacc files.  Unless you have a significant number of yacc files, the impact is O^1 in the number of tag candidates.  And --no-duplicates only matters when creating a ctags file.
> 
> Maybe you could give a try and check whether --no-line-directives has any impact

I already did that: the effect is null and void.  Which is not a
surprise, since there are only 3 Yacc files in this tree.





^ permalink raw reply	[flat|nested] 48+ messages in thread

* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions
  2024-10-10  5:41                                                                 ` Eli Zaretskii
@ 2024-10-10  8:27                                                                   ` Francesco Potortì
  2024-10-10  8:35                                                                     ` Eli Zaretskii
  0 siblings, 1 reply; 48+ messages in thread
From: Francesco Potortì @ 2024-10-10  8:27 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: dmitry, 73484, spwhitton

>> >This is basically a "uniqueness" operation using linear search, O(N^2).

> Thus, I
>believe the intent is to avoid duplicate tags if the same file was
>encountered twice in some way.

Yes.  Sorry, I spoke from memory and I was inaccurate.

>Note that canonicalize_filename in this case doesn't really do what
>its name seems to imply, e.g., relative file names will generally stay
>relative.

It canonicalises, that is, reduces to a standard common form.  It retains relative vs absolute difference.

>So specifying the same file once as relative and the other
>time as absolute will still process the file more than once.

From memory, I would tell so, yes.  Have not checked right now.

>We need
>to use an inode test or equivalent, and probably use realpath or
>equivalent, to make the duplicate test reliable.
>Or maybe having the
>same file processed under different names is okay, since TAGS is for
>helping Emacs find the file, and so using relative names and symlinks
>is okay?

Yes, I think so.  And from memory I think it should be left unchanged.

>> I do not think that it makes sense to build a hash table for file names given on the command line, because the number of comparisons made on those names is generally vastly inferior to the number of comparisons used to search for tags.
>
>That's not what I see in the code.  But it should be easy to count the
>number of loop iterations in the use case we are talking about
>(running etags on the geck-dev tree), so we don't need to argue about
>facts.

Yes.  If finding a bottleneck is the objective, you should maybe instrument the string comparison functions so that you can count how many times they are called from different places.

I had a quick look at the whole code and in fact the only place I can find where ou have O^2 behaviour seems to be file name comparison, and it still looks so strange to me that this can in facrt cause significant delay.

I may certainly have missed something, but if that's really the case, first thing is looking for code inefficiencies.  If this is really structural, one should first read all filenames, canonicalise and uniquify them, and only then create the tags.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions
  2024-10-10  8:27                                                                   ` Francesco Potortì
@ 2024-10-10  8:35                                                                     ` Eli Zaretskii
  2024-10-10 14:25                                                                       ` Francesco Potortì
  0 siblings, 1 reply; 48+ messages in thread
From: Eli Zaretskii @ 2024-10-10  8:35 UTC (permalink / raw)
  To: Francesco Potortì; +Cc: dmitry, 73484, spwhitton

> From: Francesco Potortì <pot@gnu.org>
> Date: Thu, 10 Oct 2024 10:27:57 +0200
> Cc: spwhitton@spwhitton.name,
> 	73484@debbugs.gnu.org,
> 	dmitry@gutov.dev
> 
> >That's not what I see in the code.  But it should be easy to count the
> >number of loop iterations in the use case we are talking about
> >(running etags on the geck-dev tree), so we don't need to argue about
> >facts.
> 
> Yes.  If finding a bottleneck is the objective, you should maybe instrument the string comparison functions so that you can count how many times they are called from different places.
> 
> I had a quick look at the whole code and in fact the only place I can find where ou have O^2 behaviour seems to be file name comparison, and it still looks so strange to me that this can in facrt cause significant delay.

We are using etags on a huge tree: about 375K files.  I think that's
the reason, because non-linear behaviors are like that: they are
insignificant with small sets, but huge with larger ones...

Profiles don't lie...





^ permalink raw reply	[flat|nested] 48+ messages in thread

* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions
  2024-10-10  1:07                                                               ` Francesco Potortì
  2024-10-10  5:41                                                                 ` Eli Zaretskii
@ 2024-10-10 10:17                                                                 ` Dmitry Gutov
  1 sibling, 0 replies; 48+ messages in thread
From: Dmitry Gutov @ 2024-10-10 10:17 UTC (permalink / raw)
  To: Francesco Potortì; +Cc: Eli Zaretskii, 73484, spwhitton

[-- Attachment #1: Type: text/plain, Size: 1803 bytes --]

On Thu, Oct 10, 2024, at 3:07 AM, Francesco Potortì wrote:
> >Here is the nested loop, which if I comment out, makes the parse finish 
> >in ~20 seconds, with all the extra files (except *.js), or in 15s when 
> >using with new flags.
> >
> >diff --git a/lib-src/etags.c b/lib-src/etags.c
> >index a822a823a90..331e3ffe816 100644
> >--- a/lib-src/etags.c
> >+++ b/lib-src/etags.c
> >@@ -1697,14 +1697,14 @@ process_file_name (char *file, language *lang)
> >        uncompressed_name = file;
> >      }
> >
> >-  /* If the canonicalized uncompressed name
> >-     has already been dealt with, skip it silently. */
> >-  for (fdp = fdhead; fdp != NULL; fdp = fdp->next)
> >-    {
> >-      assert (fdp->infname != NULL);
> >-      if (streq (uncompressed_name, fdp->infname))
> >- goto cleanup;
> >-    }
> >+  /* /\* If the canonicalized uncompressed name */
> >+  /*    has already been dealt with, skip it silently. *\/ */
> >+  /* for (fdp = fdhead; fdp != NULL; fdp = fdp->next) */
> >+  /*   { */
> >+  /*     assert (fdp->infname != NULL); */
> >+  /*     if (streq (uncompressed_name, fdp->infname)) */
> >+  /* goto cleanup; */
> >+  /*   } */
> >
> >    inf = fopen (file, "r" FOPEN_BINARY);
> >    if (inf)
> >
> >This is basically a "uniqueness" operation using linear search, O(N^2).
> 
> This is only for dealing with the case when the same file exists in both compressed and uncompressed form, and we are currently hitting the second one.  In that case, we should skip it.  Yes, this is a uniqueness test and yes, it is O^2 in the number of file names, but I doubt that this can explain a serious slowdown.
Like mentioned in a previous email, I did recompile with that step removed, and the slowdown was gone.

The whole scan went down to ~20 seconds.

[-- Attachment #2: Type: text/html, Size: 2885 bytes --]

^ permalink raw reply	[flat|nested] 48+ messages in thread

* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions
  2024-10-10  8:35                                                                     ` Eli Zaretskii
@ 2024-10-10 14:25                                                                       ` Francesco Potortì
  2024-10-10 16:28                                                                         ` Eli Zaretskii
  0 siblings, 1 reply; 48+ messages in thread
From: Francesco Potortì @ 2024-10-10 14:25 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: dmitry, 73484, spwhitton

>> I had a quick look at the whole code and in fact the only place I can find where ou have O^2 behaviour seems to be file name comparison, and it still looks so strange to me that this can in facrt cause significant delay.
>
>We are using etags on a huge tree: about 375K files.  I think that's
>the reason, because non-linear behaviors are like that: they are
>insignificant with small sets, but huge with larger ones...
>
>Profiles don't lie...

Ok, makes sense.  I must have missed the number of files in your previous explanations, sorry.  The only other place where I found O^2 behaviour is when managing #line directives, but you already tried to disable them without much change.  So let's concentrate on file name comparison which is done in process_file_name at

  for (fdp = fdhead; fdp != NULL; fdp = fdp->next)
    {
      assert (fdp->infname != NULL);
      if (streq (uncompressed_name, fdp->infname))
	goto cleanup;
    }

This is a simple O^2 comparison, which is repeated sum(1,N,N-1)=~N^2/2, which for ~375k files means ~70G comparisons.  If you can count the number of times streq is called and 70G is a substantial portion of that number, then we have the culprit.  To check, just remove the above test and see if the running time drops.

In that case, using a hash rather than a comparison would probably make sense.  Alternatively, rather than managing file names in a single loop, do a first loop on all file names to canonicalise them, but without searching for tags (essentially, remove the call to process_file from process_file_name), then uniquify the list of canonicalised file names, then run process_file on them.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions
  2024-10-10 14:25                                                                       ` Francesco Potortì
@ 2024-10-10 16:28                                                                         ` Eli Zaretskii
  2024-10-11 10:37                                                                           ` Francesco Potortì
  0 siblings, 1 reply; 48+ messages in thread
From: Eli Zaretskii @ 2024-10-10 16:28 UTC (permalink / raw)
  To: Francesco Potortì; +Cc: dmitry, 73484, spwhitton

> From: Francesco Potortì <pot@gnu.org>
> Date: Thu, 10 Oct 2024 16:25:28 +0200
> Cc: dmitry@gutov.dev,
> 	73484@debbugs.gnu.org,
> 	spwhitton@spwhitton.name
> 
>   for (fdp = fdhead; fdp != NULL; fdp = fdp->next)
>     {
>       assert (fdp->infname != NULL);
>       if (streq (uncompressed_name, fdp->infname))
> 	goto cleanup;
>     }
> 
> This is a simple O^2 comparison, which is repeated sum(1,N,N-1)=~N^2/2, which for ~375k files means ~70G comparisons.  If you can count the number of times streq is called and 70G is a substantial portion of that number, then we have the culprit.  To check, just remove the above test and see if the running time drops.

Dmitry already made this check, and the run time did drop, see
https://debbugs.gnu.org/cgi/bugreport.cgi?bug=73484#107

> In that case, using a hash rather than a comparison would probably make sense.

Right.

> Alternatively, rather than managing file names in a single loop, do a first loop on all file names to canonicalise them, but without searching for tags (essentially, remove the call to process_file from process_file_name), then uniquify the list of canonicalised file names, then run process_file on them.

I don't think this is possible because command-line options can be
interspersed with file names, and each option affects the processing
of the files whose names follow the option.





^ permalink raw reply	[flat|nested] 48+ messages in thread

* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions
  2024-10-10 16:28                                                                         ` Eli Zaretskii
@ 2024-10-11 10:37                                                                           ` Francesco Potortì
  0 siblings, 0 replies; 48+ messages in thread
From: Francesco Potortì @ 2024-10-11 10:37 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: dmitry, 73484, spwhitton

>> From: Francesco Potortì <pot@gnu.org>
>> Date: Thu, 10 Oct 2024 16:25:28 +0200
>> Cc: dmitry@gutov.dev,
>> 	73484@debbugs.gnu.org,
>> 	spwhitton@spwhitton.name
>> 
>>   for (fdp = fdhead; fdp != NULL; fdp = fdp->next)
>>     {
>>       assert (fdp->infname != NULL);
>>       if (streq (uncompressed_name, fdp->infname))
>> 	goto cleanup;
>>     }
>> 
>> This is a simple O^2 comparison, which is repeated sum(1,N,N-1)=~N^2/2, which for ~375k files means ~70G comparisons.  If you can count the number of times streq is called and 70G is a substantial portion of that number, then we have the culprit.  To check, just remove the above test and see if the running time drops.
>
>Dmitry already made this check, and the run time did drop, see
>https://debbugs.gnu.org/cgi/bugreport.cgi?bug=73484#107

Yes, sorry, I am travelling and I had missed that email.

>> In that case, using a hash rather than a comparison would probably make sense.
>
>Right.

If I recall correctly, etags depends on libc only.  If that is really the case, it would be nice to create an ad hoc has function without relying on additional libraries.

>> Alternatively, rather than managing file names in a single loop, do a first loop on all file names to canonicalise them, but without searching for tags (essentially, remove the call to process_file from process_file_name), then uniquify the list of canonicalised file names, then run process_file on them.
>
>I don't think this is possible because command-line options can be
>interspersed with file names, and each option affects the processing
>of the files whose names follow the option.

It should be possible as I have outlined above.  When the command line is parsed, process_file_name is called on each file name.  It canonicalises the current name, compares it with the previous file names, adds a new node containing the canonicalised name to a linked list and calls process_file on the file name.  It is possible to remove the last step and instead call process_file in a second loop, but I do not know if it is convenient.

The uniquify solutions would be nonparametric, if I am not wrong.  While the hash solution requires choosing the size of the hash table.

I guess that the hash solution is simpler and equally efficient in the great majority of cases, provided that the size of the hash table is appropriate.  Probably it would be reasonable to start with a 20-bit hash.  And increase that number if in some years it will look reasonable doing so.

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2024-10-11 10:37 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <87tteaznog.fsf@zephyr.silentflame.com>
     [not found] ` <edab570c-b2fa-4162-9383-df5c8aaff251@yandex.ru>
     [not found]   ` <8734lrrj4e.fsf@zephyr.silentflame.com>
     [not found]     ` <ea10f340-9b46-4199-93fc-274c5e81ace4@yandex.ru>
     [not found]       ` <87o74c1ce1.fsf@zephyr.silentflame.com>
     [not found]         ` <b8001a72-8fc9-4e4e-a2d7-5da94a92f250@yandex.ru>
2024-09-25 19:27           ` bug#73484: 31.0.50; Abolishing etags-regen-file-extensions Sean Whitton
2024-09-25 22:30             ` Dmitry Gutov
2024-09-26  7:43               ` Francesco Potortì
2024-09-26 12:18                 ` Dmitry Gutov
2024-09-29  8:25               ` Eli Zaretskii
2024-09-29 10:56                 ` Eli Zaretskii
2024-09-29 17:15                   ` Francesco Potortì
2024-09-30 23:19                 ` Dmitry Gutov
2024-10-01 15:00                   ` Eli Zaretskii
2024-10-01 22:01                     ` Dmitry Gutov
2024-10-02 11:28                   ` Eli Zaretskii
2024-10-02 18:00                     ` Dmitry Gutov
2024-10-02 18:56                       ` Eli Zaretskii
2024-10-02 22:03                         ` Dmitry Gutov
2024-10-03  6:27                           ` Eli Zaretskii
2024-10-04  1:25                             ` Dmitry Gutov
2024-10-04  6:45                               ` Eli Zaretskii
2024-10-04 23:01                                 ` Dmitry Gutov
2024-10-05  7:02                                   ` Eli Zaretskii
2024-10-05 14:29                                     ` Dmitry Gutov
2024-10-05 15:27                                       ` Eli Zaretskii
2024-10-05 20:27                                         ` Dmitry Gutov
2024-10-05 16:38                                       ` Francesco Potortì
2024-10-05 17:12                                         ` Eli Zaretskii
2024-10-06  0:56                                         ` Dmitry Gutov
2024-10-06  6:22                                           ` Eli Zaretskii
2024-10-06 19:14                                             ` Dmitry Gutov
2024-10-07  2:33                                               ` Eli Zaretskii
2024-10-07  7:11                                                 ` Dmitry Gutov
2024-10-07 16:05                                                   ` Eli Zaretskii
2024-10-07 17:36                                                     ` Dmitry Gutov
2024-10-07 19:05                                                       ` Eli Zaretskii
2024-10-07 22:08                                                         ` Dmitry Gutov
2024-10-08 13:04                                                           ` Eli Zaretskii
2024-10-09 18:23                                                             ` Dmitry Gutov
2024-10-09 19:11                                                               ` Eli Zaretskii
2024-10-09 22:22                                                                 ` Dmitry Gutov
2024-10-10  5:13                                                                   ` Eli Zaretskii
2024-10-10  1:07                                                               ` Francesco Potortì
2024-10-10  5:41                                                                 ` Eli Zaretskii
2024-10-10  8:27                                                                   ` Francesco Potortì
2024-10-10  8:35                                                                     ` Eli Zaretskii
2024-10-10 14:25                                                                       ` Francesco Potortì
2024-10-10 16:28                                                                         ` Eli Zaretskii
2024-10-11 10:37                                                                           ` Francesco Potortì
2024-10-10 10:17                                                                 ` Dmitry Gutov
2024-10-10  1:39                                                               ` Francesco Potortì
2024-10-10  5:45                                                                 ` Eli Zaretskii

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).