* etags-regen-mode: handling extensionless files @ 2024-09-20 9:20 Sean Whitton 2024-09-20 18:23 ` Dmitry Gutov 0 siblings, 1 reply; 57+ messages in thread From: Sean Whitton @ 2024-09-20 9:20 UTC (permalink / raw) To: Dmitry Gutov; +Cc: emacs-devel Hello Dmitry, I'm working on a Perl project which has .pm files (which should be added to etags-regen-file-extensions; I've written mail about that) but also programs intended for execution, which have no file extension: <https://salsa.debian.org/dgit-team/dgit>. E.g. 'dgit' and 'git-debrebase' in the project root. On the one hand, it makes sense not to index these files because they're programs not libraries, so their internal definitions won't ever be referenced elsewhere. On the other hand, it would be nice just to be able to use a simple M-. to jump to definition, and not have to think about whether the function is defined in the program or in a library. Should we have some etags defcustom that allows specifying extra files to index? Or, as another idea, maybe etags could somehow include looking at what imenu finds? Might be too clever. Let me know what you think about this use case. -- Sean Whitton ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: etags-regen-mode: handling extensionless files 2024-09-20 9:20 etags-regen-mode: handling extensionless files Sean Whitton @ 2024-09-20 18:23 ` Dmitry Gutov 2024-09-22 12:02 ` Sean Whitton 0 siblings, 1 reply; 57+ messages in thread From: Dmitry Gutov @ 2024-09-20 18:23 UTC (permalink / raw) To: Sean Whitton; +Cc: emacs-devel Hi Sean, On 20/09/2024 12:20, Sean Whitton wrote: > I'm working on a Perl project which has .pm files (which should be added > to etags-regen-file-extensions; I've written mail about that) but also > programs intended for execution, which have no file extension: > <https://salsa.debian.org/dgit-team/dgit>. > E.g. 'dgit' and 'git-debrebase' in the project root. > > On the one hand, it makes sense not to index these files because they're > programs not libraries, so their internal definitions won't ever be > referenced elsewhere. They're probably referenced internally though, at least once. > On the other hand, it would be nice just to be able to use a simple > M-. to jump to definition, and not have to think about whether the > function is defined in the program or in a library. > > Should we have some etags defcustom that allows specifying extra files > to index? Or, as another idea, maybe etags could somehow include > looking at what imenu finds? Might be too clever. I guess we could add that extra option. But see my other email regarding etags' hashbang detection. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: etags-regen-mode: handling extensionless files 2024-09-20 18:23 ` Dmitry Gutov @ 2024-09-22 12:02 ` Sean Whitton 2024-09-23 17:00 ` Dmitry Gutov 0 siblings, 1 reply; 57+ messages in thread From: Sean Whitton @ 2024-09-22 12:02 UTC (permalink / raw) To: Dmitry Gutov; +Cc: emacs-devel Hello, On Fri 20 Sep 2024 at 09:23pm +03, Dmitry Gutov wrote: > Hi Sean, > > On 20/09/2024 12:20, Sean Whitton wrote: > >> I'm working on a Perl project which has .pm files (which should be added >> to etags-regen-file-extensions; I've written mail about that) but also >> programs intended for execution, which have no file extension: >> <https://salsa.debian.org/dgit-team/dgit>. >> E.g. 'dgit' and 'git-debrebase' in the project root. >> On the one hand, it makes sense not to index these files because they're >> programs not libraries, so their internal definitions won't ever be >> referenced elsewhere. > > They're probably referenced internally though, at least once. Good point. >> On the other hand, it would be nice just to be able to use a simple >> M-. to jump to definition, and not have to think about whether the >> function is defined in the program or in a library. >> Should we have some etags defcustom that allows specifying extra files >> to index? Or, as another idea, maybe etags could somehow include >> looking at what imenu finds? Might be too clever. > > I guess we could add that extra option. > > But see my other email regarding etags' hashbang detection. Hashbang detection would solve my problem elegantly. Is my reading of the other thread correct that if we can fix the fortran fallback then we can enable the hashbang detection? -- Sean Whitton ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: etags-regen-mode: handling extensionless files 2024-09-22 12:02 ` Sean Whitton @ 2024-09-23 17:00 ` Dmitry Gutov 2024-09-25 6:21 ` Sean Whitton 0 siblings, 1 reply; 57+ messages in thread From: Dmitry Gutov @ 2024-09-23 17:00 UTC (permalink / raw) To: Sean Whitton; +Cc: emacs-devel On 22/09/2024 15:02, Sean Whitton wrote: >> But see my other email regarding etags' hashbang detection. > > Hashbang detection would solve my problem elegantly. > > Is my reading of the other thread correct that if we can fix the fortran > fallback then we can enable the hashbang detection? Yep, I think so. We would probably also discuss etags' auto-detection and its list of default extensions, during the next release's development. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: etags-regen-mode: handling extensionless files 2024-09-23 17:00 ` Dmitry Gutov @ 2024-09-25 6:21 ` Sean Whitton 2024-09-25 11:41 ` Dmitry Gutov 2024-09-25 12:10 ` etags-regen-mode: handling extensionless files Eli Zaretskii 0 siblings, 2 replies; 57+ messages in thread From: Sean Whitton @ 2024-09-25 6:21 UTC (permalink / raw) To: Dmitry Gutov; +Cc: emacs-devel Hello, On Mon 23 Sep 2024 at 08:00pm +03, Dmitry Gutov wrote: > On 22/09/2024 15:02, Sean Whitton wrote: > >>> But see my other email regarding etags' hashbang detection. >> Hashbang detection would solve my problem elegantly. >> Is my reading of the other thread correct that if we can fix the fortran >> fallback then we can enable the hashbang detection? > > Yep, I think so. > > We would probably also discuss etags' auto-detection and its list of default > extensions, during the next release's development. Okay, cool! Should we have a bug to track this? -- Sean Whitton ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: etags-regen-mode: handling extensionless files 2024-09-25 6:21 ` Sean Whitton @ 2024-09-25 11:41 ` Dmitry Gutov 2024-09-25 19:27 ` bug#73484: 31.0.50; Abolishing etags-regen-file-extensions Sean Whitton 2024-09-25 12:10 ` etags-regen-mode: handling extensionless files Eli Zaretskii 1 sibling, 1 reply; 57+ messages in thread From: Dmitry Gutov @ 2024-09-25 11:41 UTC (permalink / raw) To: Sean Whitton; +Cc: emacs-devel On 25/09/2024 09:21, Sean Whitton wrote: >> We would probably also discuss etags' auto-detection and its list of default >> extensions, during the next release's development. > Okay, cool! Should we have a bug to track this? Sure, please go ahead. ^ permalink raw reply [flat|nested] 57+ messages in thread
* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions 2024-09-25 11:41 ` Dmitry Gutov @ 2024-09-25 19:27 ` Sean Whitton 2024-09-25 22:30 ` Dmitry Gutov 0 siblings, 1 reply; 57+ messages in thread From: Sean Whitton @ 2024-09-25 19:27 UTC (permalink / raw) To: 73484 Hello, On Wed 25 Sep 2024 at 02:41pm +03, Dmitry Gutov wrote: > On 25/09/2024 09:21, Sean Whitton wrote: >>> We would probably also discuss etags' auto-detection and its list of default >>> extensions, during the next release's development. >> Okay, cool! Should we have a bug to track this? We want to replace etags-regen-file-extensions with enabling etags's hashbang detection support. That requires disabling its Fortran fallback. -- Sean Whitton ^ permalink raw reply [flat|nested] 57+ messages in thread
* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions 2024-09-25 19:27 ` bug#73484: 31.0.50; Abolishing etags-regen-file-extensions Sean Whitton @ 2024-09-25 22:30 ` Dmitry Gutov 2024-09-26 7:43 ` Francesco Potortì 2024-09-29 8:25 ` Eli Zaretskii 0 siblings, 2 replies; 57+ messages in thread From: Dmitry Gutov @ 2024-09-25 22:30 UTC (permalink / raw) To: Sean Whitton, 73484 Hi! On 25/09/2024 22:27, Sean Whitton wrote: > On Wed 25 Sep 2024 at 02:41pm +03, Dmitry Gutov wrote: > >> On 25/09/2024 09:21, Sean Whitton wrote: >>>> We would probably also discuss etags' auto-detection and its list of default >>>> extensions, during the next release's development. >>> Okay, cool! Should we have a bug to track this? > > We want to replace etags-regen-file-extensions with enabling etags's > hashbang detection support. That requires disabling its Fortran > fallback. Thanks, a fuller plan would look something like this: - Implement the --no-fortran-fallback flag in etags. Or an environment variable, or etc. Use it conditionally in etags-regen-mode. - Revisit the default lists of extensions that etags recognizes, keeping in mind the recent thread we talking this about in - e.g. *.a seems out of place for ASM (someone more familiar with assembly dialects please feel free to correctme). - Add new possible value t to etags-regen-file-extensions, and switch the default to it. ^ permalink raw reply [flat|nested] 57+ messages in thread
* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions 2024-09-25 22:30 ` Dmitry Gutov @ 2024-09-26 7:43 ` Francesco Potortì 2024-09-26 12:18 ` Dmitry Gutov 2024-09-29 8:25 ` Eli Zaretskii 1 sibling, 1 reply; 57+ messages in thread From: Francesco Potortì @ 2024-09-26 7:43 UTC (permalink / raw) To: Dmitry Gutov; +Cc: 73484, Sean Whitton >- Implement the --no-fortran-fallback flag in etags. Or an environment >variable, or etc. Use it conditionally in etags-regen-mode. If your purpose is to avoid Etags creating false tags on files whose language it cannot detect, you need to disable all fallbacks, rather than just Fortran. Sorry if I got lost and missed something. -- fp ^ permalink raw reply [flat|nested] 57+ messages in thread
* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions 2024-09-26 7:43 ` Francesco Potortì @ 2024-09-26 12:18 ` Dmitry Gutov 0 siblings, 0 replies; 57+ messages in thread From: Dmitry Gutov @ 2024-09-26 12:18 UTC (permalink / raw) To: Francesco Potortì; +Cc: 73484, Sean Whitton On 26/09/2024 10:43, Francesco Potortì wrote: >> - Implement the --no-fortran-fallback flag in etags. Or an environment >> variable, or etc. Use it conditionally in etags-regen-mode. > If your purpose is to avoid Etags creating false tags on files whose language it cannot detect, you need to disable all fallbacks, rather than just Fortran. Yeah, sorry, I guess the next fallback is C? We'll want to disable both, so the flag would be --no-fallbacks, I guess. ^ permalink raw reply [flat|nested] 57+ messages in thread
* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions 2024-09-25 22:30 ` Dmitry Gutov 2024-09-26 7:43 ` Francesco Potortì @ 2024-09-29 8:25 ` Eli Zaretskii 2024-09-29 10:56 ` Eli Zaretskii 2024-09-30 23:19 ` Dmitry Gutov 1 sibling, 2 replies; 57+ messages in thread From: Eli Zaretskii @ 2024-09-29 8:25 UTC (permalink / raw) To: Dmitry Gutov; +Cc: 73484, spwhitton > Date: Thu, 26 Sep 2024 01:30:55 +0300 > From: Dmitry Gutov <dmitry@gutov.dev> > > > We want to replace etags-regen-file-extensions with enabling etags's > > hashbang detection support. That requires disabling its Fortran > > fallback. > > Thanks, a fuller plan would look something like this: > > - Implement the --no-fortran-fallback flag in etags. Or an environment > variable, or etc. Use it conditionally in etags-regen-mode. > - Revisit the default lists of extensions that etags recognizes, keeping > in mind the recent thread we talking this about in - e.g. *.a seems out > of place for ASM (someone more familiar with assembly dialects please > feel free to correctme). > - Add new possible value t to etags-regen-file-extensions, and switch > the default to it. I understand that we need to disable the Fortran and C fallbacks to avoid false positives, but what do we want to do if the fallbacks are disabled and no suitable language parser is found using the file name? Just skip the file and do nothing? emit a warning? something else? I also don't understand why enabling the etags' shebang detection requires to disable the Fortran and C fallbacks: etags looks for shebang _before_ it falls back to Fortran and C, so what am I missing? ^ permalink raw reply [flat|nested] 57+ messages in thread
* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions 2024-09-29 8:25 ` Eli Zaretskii @ 2024-09-29 10:56 ` Eli Zaretskii 2024-09-29 17:15 ` Francesco Potortì 2024-09-30 23:19 ` Dmitry Gutov 1 sibling, 1 reply; 57+ messages in thread From: Eli Zaretskii @ 2024-09-29 10:56 UTC (permalink / raw) To: dmitry; +Cc: 73484, spwhitton > Cc: 73484@debbugs.gnu.org, spwhitton@spwhitton.name > Date: Sun, 29 Sep 2024 11:25:45 +0300 > From: Eli Zaretskii <eliz@gnu.org> > > I understand that we need to disable the Fortran and C fallbacks to > avoid false positives, but what do we want to do if the fallbacks are > disabled and no suitable language parser is found using the file name? > Just skip the file and do nothing? emit a warning? something else? Wait a minute... we already have "--language=none", which means only do regexp processing, if any. If no regexps were specified, 'none' produces a single entry for a file, stating just its name, like this: ^L foo,0 where ^L is a literal \f character. Is the intent here to prevent even that from being written to TAGS? If not, then we don't need any new command-line option; instead, etags-regen could simply pass the "--language=none" option before each file with no extension, and be done, no? Or maybe this is "the missing link" between this and the shebang processing? ^ permalink raw reply [flat|nested] 57+ messages in thread
* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions 2024-09-29 10:56 ` Eli Zaretskii @ 2024-09-29 17:15 ` Francesco Potortì 0 siblings, 0 replies; 57+ messages in thread From: Francesco Potortì @ 2024-09-29 17:15 UTC (permalink / raw) To: Eli Zaretskii; +Cc: dmitry, 73484, spwhitton Eli Zaretskii: >> I understand that we need to disable the Fortran and C fallbacks to >> avoid false positives, but what do we want to do if the fallbacks are >> disabled and no suitable language parser is found using the file name? >> Just skip the file and do nothing? emit a warning? something else? Eli Zaretskii: >Wait a minute... we already have "--language=none", which means only >do regexp processing, if any. If no regexps were specified, 'none' >produces a single entry for a file, stating just its name, like this: > > ^L > foo,0 > >where ^L is a literal \f character. Is the intent here to prevent >even that from being written to TAGS? If not, then we don't need any >new command-line option; instead, etags-regen could simply pass the >"--language=none" option before each file with no extension, and be >done, no? > >Or maybe this is "the missing link" between this and the shebang >processing? If you set language=none for files whose extension is unknown to Etags, then you give up on shebang processing. If you do not set language=none and Etags does not recognise any shebang, it defaults to Fortran. If it does not find any Fortran tags, it defaults to C/C++. When default processing happens on a file which is neither Fortran nor C/C++, it usually generates no tags, but may occasionally generate fake tags. AFAIU, the problem is that there are use cases when you have to feed Etags with files that should generate no tags, yet the occasional fake tags are not tolerable. ^ permalink raw reply [flat|nested] 57+ messages in thread
* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions 2024-09-29 8:25 ` Eli Zaretskii 2024-09-29 10:56 ` Eli Zaretskii @ 2024-09-30 23:19 ` Dmitry Gutov 2024-10-01 15:00 ` Eli Zaretskii 2024-10-02 11:28 ` Eli Zaretskii 1 sibling, 2 replies; 57+ messages in thread From: Dmitry Gutov @ 2024-09-30 23:19 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 73484, spwhitton On 29/09/2024 11:25, Eli Zaretskii wrote: > I understand that we need to disable the Fortran and C fallbacks to > avoid false positives, but what do we want to do if the fallbacks are > disabled and no suitable language parser is found using the file name? > Just skip the file and do nothing? emit a warning? something else? Just do nothing. We'd really want to delegate language detection to etags rather than doing it inside Elisp - the latter is slower and ultimately more limited. But for that etags needs to have a reliable detection logic, one without too many false positives (and IME false positives here are worse than false negatives, because scanning too much can often mean both wrong tags and long scans, and a completion table that gets too large because of bogus tags). For shebangs in particular, however, see Francesco's very good explanation. And detecting shebangs in Lisp would not be practical -- too slow. ^ permalink raw reply [flat|nested] 57+ messages in thread
* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions 2024-09-30 23:19 ` Dmitry Gutov @ 2024-10-01 15:00 ` Eli Zaretskii 2024-10-01 22:01 ` Dmitry Gutov 2024-10-02 11:28 ` Eli Zaretskii 1 sibling, 1 reply; 57+ messages in thread From: Eli Zaretskii @ 2024-10-01 15:00 UTC (permalink / raw) To: Dmitry Gutov; +Cc: 73484, spwhitton > Date: Tue, 1 Oct 2024 02:19:17 +0300 > Cc: spwhitton@spwhitton.name, 73484@debbugs.gnu.org > From: Dmitry Gutov <dmitry@gutov.dev> > > On 29/09/2024 11:25, Eli Zaretskii wrote: > > I understand that we need to disable the Fortran and C fallbacks to > > avoid false positives, but what do we want to do if the fallbacks are > > disabled and no suitable language parser is found using the file name? > > Just skip the file and do nothing? emit a warning? something else? > > Just do nothing. We'd really want to delegate language detection to > etags rather than doing it inside Elisp - the latter is slower and > ultimately more limited. But for that etags needs to have a reliable > detection logic, one without too many false positives (and IME false > positives here are worse than false negatives, because scanning too much > can often mean both wrong tags and long scans, and a completion table > that gets too large because of bogus tags). I'm not sure I understand: if you worry about performance, then disabling fallbacks will not eliminate all of the cases where etags scans the entire file or at least some of its portions. Can you explain to me again what exactly is the problem with the fallbacks in the context of etags-regen? ^ permalink raw reply [flat|nested] 57+ messages in thread
* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions 2024-10-01 15:00 ` Eli Zaretskii @ 2024-10-01 22:01 ` Dmitry Gutov 0 siblings, 0 replies; 57+ messages in thread From: Dmitry Gutov @ 2024-10-01 22:01 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 73484, spwhitton On 01/10/2024 18:00, Eli Zaretskii wrote: >> Just do nothing. We'd really want to delegate language detection to >> etags rather than doing it inside Elisp - the latter is slower and >> ultimately more limited. But for that etags needs to have a reliable >> detection logic, one without too many false positives (and IME false >> positives here are worse than false negatives, because scanning too much >> can often mean both wrong tags and long scans, and a completion table >> that gets too large because of bogus tags). > > I'm not sure I understand: if you worry about performance, then > disabling fallbacks will not eliminate all of the cases where etags > scans the entire file or at least some of its portions. etags's scanning should still be faster than doing it in Lisp, or the subsequent calls to tags-completion-table or etags--xref-find-definitions. Further, the last function would repeatedly search through the tags file, so it's important to keep tags' scanner accuracy high: without incorrectly recognized files, and without wrong index entries. > Can you explain to me again what exactly is the problem with the > fallbacks in the context of etags-regen? We've talked about this before, here's my previous reply: https://lists.gnu.org/archive/html/emacs-devel/2018-01/msg00387.html I don't have the same experiment at hand, but the past me seems to be saying that scanning files incorrectly can also make the whole scan take longer, considerably. And make the resulting file bigger, which makes its parsing from Emacs slower as well, and so on. ^ permalink raw reply [flat|nested] 57+ messages in thread
* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions 2024-09-30 23:19 ` Dmitry Gutov 2024-10-01 15:00 ` Eli Zaretskii @ 2024-10-02 11:28 ` Eli Zaretskii 2024-10-02 18:00 ` Dmitry Gutov 1 sibling, 1 reply; 57+ messages in thread From: Eli Zaretskii @ 2024-10-02 11:28 UTC (permalink / raw) To: Dmitry Gutov; +Cc: 73484, spwhitton > Date: Tue, 1 Oct 2024 02:19:17 +0300 > Cc: spwhitton@spwhitton.name, 73484@debbugs.gnu.org > From: Dmitry Gutov <dmitry@gutov.dev> > > On 29/09/2024 11:25, Eli Zaretskii wrote: > > I understand that we need to disable the Fortran and C fallbacks to > > avoid false positives, but what do we want to do if the fallbacks are > > disabled and no suitable language parser is found using the file name? > > Just skip the file and do nothing? emit a warning? something else? > > Just do nothing. Doing nothing means the file's name will not appear at all in TAGS. I don't think that's TRT, since every file submitted to etags should be mentioned in TAGS for the benefit of tags-search and similar features. So I currently tend to modify etags such that if no language was detected by the file's name/extension, and this new no-fallbacks option was specified, etags will behave as if given --language=none (which also means that if any regexps were specified, they will be processed correctly for such files). If no regexps were specified or none matched, this means only the file's name will appear in TAGS, and that's all. If the above is not a good plan for some reason, feel free to holler. ^ permalink raw reply [flat|nested] 57+ messages in thread
* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions 2024-10-02 11:28 ` Eli Zaretskii @ 2024-10-02 18:00 ` Dmitry Gutov 2024-10-02 18:56 ` Eli Zaretskii 0 siblings, 1 reply; 57+ messages in thread From: Dmitry Gutov @ 2024-10-02 18:00 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 73484, spwhitton On 02/10/2024 14:28, Eli Zaretskii wrote: >> Date: Tue, 1 Oct 2024 02:19:17 +0300 >> Cc: spwhitton@spwhitton.name, 73484@debbugs.gnu.org >> From: Dmitry Gutov <dmitry@gutov.dev> >> >> On 29/09/2024 11:25, Eli Zaretskii wrote: >>> I understand that we need to disable the Fortran and C fallbacks to >>> avoid false positives, but what do we want to do if the fallbacks are >>> disabled and no suitable language parser is found using the file name? >>> Just skip the file and do nothing? emit a warning? something else? >> >> Just do nothing. > > Doing nothing means the file's name will not appear at all in TAGS. I > don't think that's TRT, since every file submitted to etags should be > mentioned in TAGS for the benefit of tags-search and similar features. Hmm, maybe another flag, then? Including many unrelated files would just bloat the tags file for little reason. And unlike manual generation, it's not like the user asked for all of them to be included. > So I currently tend to modify etags such that if no language was > detected by the file's name/extension, and this new no-fallbacks > option was specified, etags will behave as if given --language=none > (which also means that if any regexps were specified, they will be > processed correctly for such files). Any regexps for "all" files, right? For our etags-regen configuration in the Emacs repo, for example, we add 2 regexps, but for specific file types only. If regexps are configured for 'none', and they match something, certainly the file should be in the index. > If no regexps were specified or > none matched, this means only the file's name will appear in TAGS, and > that's all. ...but if there are no matches I'd prefer the files to be skipped. The files detected as type 'none' anyway. ^ permalink raw reply [flat|nested] 57+ messages in thread
* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions 2024-10-02 18:00 ` Dmitry Gutov @ 2024-10-02 18:56 ` Eli Zaretskii 2024-10-02 22:03 ` Dmitry Gutov 0 siblings, 1 reply; 57+ messages in thread From: Eli Zaretskii @ 2024-10-02 18:56 UTC (permalink / raw) To: Dmitry Gutov; +Cc: 73484, spwhitton > Date: Wed, 2 Oct 2024 21:00:58 +0300 > Cc: spwhitton@spwhitton.name, 73484@debbugs.gnu.org > From: Dmitry Gutov <dmitry@gutov.dev> > > On 02/10/2024 14:28, Eli Zaretskii wrote: > >> Date: Tue, 1 Oct 2024 02:19:17 +0300 > >> Cc: spwhitton@spwhitton.name, 73484@debbugs.gnu.org > >> From: Dmitry Gutov <dmitry@gutov.dev> > >> > >> Just do nothing. > > > > Doing nothing means the file's name will not appear at all in TAGS. I > > don't think that's TRT, since every file submitted to etags should be > > mentioned in TAGS for the benefit of tags-search and similar features. > > Hmm, maybe another flag, then? > > Including many unrelated files would just bloat the tags file for little > reason. And unlike manual generation, it's not like the user asked for > all of them to be included. What do we tell to users of tags-search and its ilk? > > So I currently tend to modify etags such that if no language was > > detected by the file's name/extension, and this new no-fallbacks > > option was specified, etags will behave as if given --language=none > > (which also means that if any regexps were specified, they will be > > processed correctly for such files). > > Any regexps for "all" files, right? The rules for regexps don't change: each regexp applies to the files that follow it on the command line. > ...but if there are no matches I'd prefer the files to be skipped. The > files detected as type 'none' anyway. I don't like this, and I think this is misguided. I also don't see any special problem with having lines that name files in TAGS, it isn't like the size of TAGS will grow significantly or its processing will be significantly slower. IOW, this sounds like a clear case of premature optimization. ^ permalink raw reply [flat|nested] 57+ messages in thread
* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions 2024-10-02 18:56 ` Eli Zaretskii @ 2024-10-02 22:03 ` Dmitry Gutov 2024-10-03 6:27 ` Eli Zaretskii 0 siblings, 1 reply; 57+ messages in thread From: Dmitry Gutov @ 2024-10-02 22:03 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 73484, spwhitton On 02/10/2024 21:56, Eli Zaretskii wrote: >> Date: Wed, 2 Oct 2024 21:00:58 +0300 >> Cc: spwhitton@spwhitton.name, 73484@debbugs.gnu.org >> From: Dmitry Gutov <dmitry@gutov.dev> >> >> On 02/10/2024 14:28, Eli Zaretskii wrote: >>>> Date: Tue, 1 Oct 2024 02:19:17 +0300 >>>> Cc: spwhitton@spwhitton.name, 73484@debbugs.gnu.org >>>> From: Dmitry Gutov <dmitry@gutov.dev> >>>> >>>> Just do nothing. >>> >>> Doing nothing means the file's name will not appear at all in TAGS. I >>> don't think that's TRT, since every file submitted to etags should be >>> mentioned in TAGS for the benefit of tags-search and similar features. >> >> Hmm, maybe another flag, then? >> >> Including many unrelated files would just bloat the tags file for little >> reason. And unlike manual generation, it's not like the user asked for >> all of them to be included. > > What do we tell to users of tags-search and its ilk? We can consider how most of such users' indexes look. See below. >>> So I currently tend to modify etags such that if no language was >>> detected by the file's name/extension, and this new no-fallbacks >>> option was specified, etags will behave as if given --language=none >>> (which also means that if any regexps were specified, they will be >>> processed correctly for such files). >> >> Any regexps for "all" files, right? > > The rules for regexps don't change: each regexp applies to the files > that follow it on the command line. This seems okay. >> ...but if there are no matches I'd prefer the files to be skipped. The >> files detected as type 'none' anyway. > > I don't like this, and I think this is misguided. I also don't see > any special problem with having lines that name files in TAGS, it > isn't like the size of TAGS will grow significantly or its processing > will be significantly slower. IOW, this sounds like a clear case of > premature optimization. I could do some experiments, if you post preliminary support of that flag, with "empty" files in TAGS and without. But here's how I'm looking at it: Imagine a straightforward C project, one that has .c files, .h, maybe .y, and also a bunch of docs, build artefacts (some of them checked in), and maybe other data files as well. Also README, ChangeLog, Makefile, config.bat, some .txt files, many other files without extensions, etc. Previously, when building a TAGS file manually, a developer in such a project specified a list of file globs by hand. One that would be limited to .[ch] files, and maybe .y as well, but not all the files in the directory. To use Emacs itself as an example, the 'tags' target in our own Makefile only includes .[hc], .m, .cc, .el and (surprising to me) .texi files. But not any of the others. The number of such files is ~3K, if I'm counting correctly. The total number of all non-ignored files in our repo is ~5K. That's 2K more files that would be present in the 'M-x tags-search' or 'M-x list-tags' outputs, if an Emacs developer simply switches to using etags-regen-mode, and etags-regen-mode drops the file extensions whitelist, and etags keeps all passed files' names in its output. ^ permalink raw reply [flat|nested] 57+ messages in thread
* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions 2024-10-02 22:03 ` Dmitry Gutov @ 2024-10-03 6:27 ` Eli Zaretskii 2024-10-04 1:25 ` Dmitry Gutov 0 siblings, 1 reply; 57+ messages in thread From: Eli Zaretskii @ 2024-10-03 6:27 UTC (permalink / raw) To: Dmitry Gutov; +Cc: 73484, spwhitton > Date: Thu, 3 Oct 2024 01:03:14 +0300 > Cc: spwhitton@spwhitton.name, 73484@debbugs.gnu.org > From: Dmitry Gutov <dmitry@gutov.dev> > > >> ...but if there are no matches I'd prefer the files to be skipped. The > >> files detected as type 'none' anyway. > > > > I don't like this, and I think this is misguided. I also don't see > > any special problem with having lines that name files in TAGS, it > > isn't like the size of TAGS will grow significantly or its processing > > will be significantly slower. IOW, this sounds like a clear case of > > premature optimization. > > I could do some experiments, if you post preliminary support of that > flag, with "empty" files in TAGS and without. OK. > But here's how I'm looking at it: > > Imagine a straightforward C project, one that has .c files, .h, maybe > .y, and also a bunch of docs, build artefacts (some of them checked in), > and maybe other data files as well. Also README, ChangeLog, Makefile, > config.bat, some .txt files, many other files without extensions, etc. > > Previously, when building a TAGS file manually, a developer in such a > project specified a list of file globs by hand. One that would be > limited to .[ch] files, and maybe .y as well, but not all the files in > the directory. If they definitely do NOT want the other files to be present in TAGS, they can keep using those globs. Nothing will change in that case. > To use Emacs itself as an example, the 'tags' target in our own Makefile > only includes .[hc], .m, .cc, .el and (surprising to me) .texi files. > But not any of the others. The number of such files is ~3K, if I'm > counting correctly. > > The total number of all non-ignored files in our repo is ~5K. That's 2K > more files that would be present in the 'M-x tags-search' or 'M-x > list-tags' outputs, if an Emacs developer simply switches to using > etags-regen-mode, and etags-regen-mode drops the file extensions > whitelist, and etags keeps all passed files' names in its output. OTOH, if a file with a known extension has no taggable symbols, you still get its file name in TAGS. So omitting files whose language we could not recognize would be an incompatible change in behavior. The fact that in the scenario you describe above 2K more files will appear in tags-search is, from my POV, an argument _for_ including them, not against: we have no reason to assume that users don't want to search those files for some regexp, because regexps specified in tags-search don't necessarily have anything to do with the identifiers we tag. A valid case in point is to look up all references to some file when the file is deleted, or references to some version when the version is updated: we definitely want files like README and INSTALL to be included in the search. ^ permalink raw reply [flat|nested] 57+ messages in thread
* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions 2024-10-03 6:27 ` Eli Zaretskii @ 2024-10-04 1:25 ` Dmitry Gutov 2024-10-04 6:45 ` Eli Zaretskii 0 siblings, 1 reply; 57+ messages in thread From: Dmitry Gutov @ 2024-10-04 1:25 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 73484, spwhitton On 03/10/2024 09:27, Eli Zaretskii wrote: >> But here's how I'm looking at it: >> >> Imagine a straightforward C project, one that has .c files, .h, maybe >> .y, and also a bunch of docs, build artefacts (some of them checked in), >> and maybe other data files as well. Also README, ChangeLog, Makefile, >> config.bat, some .txt files, many other files without extensions, etc. >> >> Previously, when building a TAGS file manually, a developer in such a >> project specified a list of file globs by hand. One that would be >> limited to .[ch] files, and maybe .y as well, but not all the files in >> the directory. > > If they definitely do NOT want the other files to be present in TAGS, > they can keep using those globs. Nothing will change in that case. a) They would have to produce the same list of file extensions that we are using now, and they will need to find out which variable to customize, to set to that list. b) They won't get the shebang detection capability, unless we add a new option where they will have to enumerate all their shebang-enabled file names as well. So it seems like they would have to choose between the one and the other, with the end behavior that I'm describing not being supported even any combination of user options. >> To use Emacs itself as an example, the 'tags' target in our own Makefile >> only includes .[hc], .m, .cc, .el and (surprising to me) .texi files. >> But not any of the others. The number of such files is ~3K, if I'm >> counting correctly. >> >> The total number of all non-ignored files in our repo is ~5K. That's 2K >> more files that would be present in the 'M-x tags-search' or 'M-x >> list-tags' outputs, if an Emacs developer simply switches to using >> etags-regen-mode, and etags-regen-mode drops the file extensions >> whitelist, and etags keeps all passed files' names in its output. > > OTOH, if a file with a known extension has no taggable symbols, you > still get its file name in TAGS. So omitting files whose language we > could not recognize would be an incompatible change in behavior. Incompatible change in etags' behavior, but likely a more compatible change in the behavior of the default Emacs. For etags, though, we could an opt-in flag. > The fact that in the scenario you describe above 2K more files will > appear in tags-search is, from my POV, an argument _for_ including > them, not against: we have no reason to assume that users don't want > to search those files for some regexp, because regexps specified in > tags-search don't necessarily have anything to do with the identifiers > we tag. A valid case in point is to look up all references to some > file when the file is deleted, or references to some version when the > version is updated: we definitely want files like README and INSTALL > to be included in the search. I would hope that project-find-regexp works well enough for that. Or 'M-x project-search' for the fans of the classic interface. README and INSTALL are not currently included in TAGS. You seem to be making a case that all files in our dev repository should be included, but for some reason the current build rules are very different? ^ permalink raw reply [flat|nested] 57+ messages in thread
* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions 2024-10-04 1:25 ` Dmitry Gutov @ 2024-10-04 6:45 ` Eli Zaretskii 2024-10-04 23:01 ` Dmitry Gutov 0 siblings, 1 reply; 57+ messages in thread From: Eli Zaretskii @ 2024-10-04 6:45 UTC (permalink / raw) To: Dmitry Gutov; +Cc: 73484, spwhitton > Date: Fri, 4 Oct 2024 04:25:15 +0300 > Cc: spwhitton@spwhitton.name, 73484@debbugs.gnu.org > From: Dmitry Gutov <dmitry@gutov.dev> > > >> Previously, when building a TAGS file manually, a developer in such a > >> project specified a list of file globs by hand. One that would be > >> limited to .[ch] files, and maybe .y as well, but not all the files in > >> the directory. > > > > If they definitely do NOT want the other files to be present in TAGS, > > they can keep using those globs. Nothing will change in that case. > > a) They would have to produce the same list of file extensions that we > are using now, and they will need to find out which variable to > customize, to set to that list. > > b) They won't get the shebang detection capability, unless we add a new > option where they will have to enumerate all their shebang-enabled file > names as well. > > So it seems like they would have to choose between the one and the > other, with the end behavior that I'm describing not being supported > even any combination of user options. They will need to choose only if they want improvements. To have the same behavior, with the same downsides as before, they need not change anything. IOW, the change I propose does no harm to those projects. And if shebang detection is desired, the choice is quite obvious, if you ask me: submit all the files. The downside is making TAGS larger and having more file names in it, which I think is a very small downside, if at all, compared to advantages. So once again, I think this is a premature optimization. The downside of a larger TAGS will only have tangible effects in huge trees. > > The fact that in the scenario you describe above 2K more files will > > appear in tags-search is, from my POV, an argument _for_ including > > them, not against: we have no reason to assume that users don't want > > to search those files for some regexp, because regexps specified in > > tags-search don't necessarily have anything to do with the identifiers > > we tag. A valid case in point is to look up all references to some > > file when the file is deleted, or references to some version when the > > version is updated: we definitely want files like README and INSTALL > > to be included in the search. > > I would hope that project-find-regexp works well enough for that. Or > 'M-x project-search' for the fans of the classic interface. Maybe, but we do still want to keep tags-search, so the existence of other commands don't invalidate my argument above. > README and INSTALL are not currently included in TAGS. You seem to be > making a case that all files in our dev repository should be included, > but for some reason the current build rules are very different? I'm not talking specifically about Emacs, because README and INSTALL are typically present in many packages. In our case, we don't pass them to etags for historical reasons (we have admin/*.el stuff to help us modify the version string in all the files that reference it, for example), but it is quite plausible that if we had this option back then, we could have used etags to help. For example, one downside of what we have in admin.el is that the list of files to edit when we bump the version is maintained by hand, which is error-prone: we just had an instance of this when exec/configure.ac was added and we forgot to update admin.el according. Using etags would have allowed us to avoid such problems. If we want a separate optional behavior that prevents files with no tags from being mentioned in TAGS, I'd argue that such an option should affect all the scanned files, not just those whose language could not be determined from their names. ^ permalink raw reply [flat|nested] 57+ messages in thread
* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions 2024-10-04 6:45 ` Eli Zaretskii @ 2024-10-04 23:01 ` Dmitry Gutov 2024-10-05 7:02 ` Eli Zaretskii 0 siblings, 1 reply; 57+ messages in thread From: Dmitry Gutov @ 2024-10-04 23:01 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 73484, spwhitton On 04/10/2024 09:45, Eli Zaretskii wrote: > They will need to choose only if they want improvements. To have the > same behavior, with the same downsides as before, they need not change > anything. IOW, the change I propose does no harm to those projects. We did talk about changing the default of etags-regen-file-extensions to t. I suppose it's debatable. > And if shebang detection is desired, the choice is quite obvious, if > you ask me: submit all the files. The downside is making TAGS larger > and having more file names in it, which I think is a very small > downside, if at all, compared to advantages. > > So once again, I think this is a premature optimization. The downside > of a larger TAGS will only have tangible effects in huge trees. FWIW, TAGS for gecko-dev (Mozilla's repository which I have here for testing) takes ~30 seconds to generate and ~400ms to find a definition for the set of files to scan that I currently have set up. Both timings seem quite impactful for user experience. I imagine some Emacs users work at Mozilla, though that's only a guess. If someone were to provide a patch for etags with new functionality (disabling fallbacks, at least), I could benchmark and come back with numbers. And if experimental flags are available, with numbers for those as well. >>> The fact that in the scenario you describe above 2K more files will >>> appear in tags-search is, from my POV, an argument _for_ including >>> them, not against: we have no reason to assume that users don't want >>> to search those files for some regexp, because regexps specified in >>> tags-search don't necessarily have anything to do with the identifiers >>> we tag. A valid case in point is to look up all references to some >>> file when the file is deleted, or references to some version when the >>> version is updated: we definitely want files like README and INSTALL >>> to be included in the search. >> >> I would hope that project-find-regexp works well enough for that. Or >> 'M-x project-search' for the fans of the classic interface. > > Maybe, but we do still want to keep tags-search, so the existence of > other commands don't invalidate my argument above. In my mind, tags-search is for files that are code-related. Actual users might differ, though. >> README and INSTALL are not currently included in TAGS. You seem to be >> making a case that all files in our dev repository should be included, >> but for some reason the current build rules are very different? > > I'm not talking specifically about Emacs, because README and INSTALL > are typically present in many packages. In our case, we don't pass > them to etags for historical reasons (we have admin/*.el stuff to help > us modify the version string in all the files that reference it, for > example), but it is quite plausible that if we had this option back > then, we could have used etags to help. For example, one downside of > what we have in admin.el is that the list of files to edit when we > bump the version is maintained by hand, which is error-prone: we just > had an instance of this when exec/configure.ac was added and we forgot > to update admin.el according. Using etags would have allowed us to > avoid such problems. Some other aspects of having more false positives would come up as a result, probably. But it might be worth testing. > If we want a separate optional behavior that prevents files with no > tags from being mentioned in TAGS, I'd argue that such an option > should affect all the scanned files, not just those whose language > could not be determined from their names. I don't have a strong opinion here, just that it would depart from my mental model mentioned above, of having all code-related files listed. For example by missing some newly added .c file where no function definitions have been added yet; 'M-x tags-search' would skip it. If that makes sense to you, okay. ^ permalink raw reply [flat|nested] 57+ messages in thread
* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions 2024-10-04 23:01 ` Dmitry Gutov @ 2024-10-05 7:02 ` Eli Zaretskii 2024-10-05 14:29 ` Dmitry Gutov 0 siblings, 1 reply; 57+ messages in thread From: Eli Zaretskii @ 2024-10-05 7:02 UTC (permalink / raw) To: Dmitry Gutov; +Cc: 73484, spwhitton > Date: Sat, 5 Oct 2024 02:01:14 +0300 > Cc: spwhitton@spwhitton.name, 73484@debbugs.gnu.org > From: Dmitry Gutov <dmitry@gutov.dev> > > On 04/10/2024 09:45, Eli Zaretskii wrote: > > > So once again, I think this is a premature optimization. The downside > > of a larger TAGS will only have tangible effects in huge trees. > > FWIW, TAGS for gecko-dev (Mozilla's repository which I have here for > testing) takes ~30 seconds to generate and ~400ms to find a definition > for the set of files to scan that I currently have set up. Both timings > seem quite impactful for user experience. I imagine some Emacs users > work at Mozilla, though that's only a guess. Like I said: in huge trees this might matter. But in any case, I don't understand the significance of the timings you show: we are discussing the increase in processing time which will be caused by adding files with no tags, which produce a single line in TAGS. Therefore the interesting figures are time differences in processing some commands with and without those additional lines. Are the times you show above related to any of that? > If someone were to provide a patch for etags with new functionality > (disabling fallbacks, at least), I could benchmark and come back with > numbers. And if experimental flags are available, with numbers for those > as well. How hard is it to add to a live TAGS file fake lines which look like this: ^L foo,0 (with random strings instead of "foo"), and then time some TAGS-using commands with and without these additions? > >> I would hope that project-find-regexp works well enough for that. Or > >> 'M-x project-search' for the fans of the classic interface. > > > > Maybe, but we do still want to keep tags-search, so the existence of > > other commands don't invalidate my argument above. > > In my mind, tags-search is for files that are code-related. Actual users > might differ, though. The fact that we pass *.texi files to etags should already tell you that this mental model is incomplete. The fact that etags supports HTML, TeX, and PostScript files (in addition to Texinfo) is another evidence to that effect. And that's even before we consider the regexp feature, which could be used to tag anything in any kind of file. I agree that these use cases are relatively rare, but that doesn't make them invalid or even unimportant. > > If we want a separate optional behavior that prevents files with no > > tags from being mentioned in TAGS, I'd argue that such an option > > should affect all the scanned files, not just those whose language > > could not be determined from their names. > > I don't have a strong opinion here, just that it would depart from my > mental model mentioned above, of having all code-related files listed. > For example by missing some newly added .c file where no function > definitions have been added yet; 'M-x tags-search' would skip it. This matches my impression that this option (which skips files with no tags) should rarely if ever be used. ^ permalink raw reply [flat|nested] 57+ messages in thread
* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions 2024-10-05 7:02 ` Eli Zaretskii @ 2024-10-05 14:29 ` Dmitry Gutov 2024-10-05 15:27 ` Eli Zaretskii 2024-10-05 16:38 ` Francesco Potortì 0 siblings, 2 replies; 57+ messages in thread From: Dmitry Gutov @ 2024-10-05 14:29 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 73484, spwhitton On 05/10/2024 10:02, Eli Zaretskii wrote: > Like I said: in huge trees this might matter. We do want to support them, right? Or anyway make the project size cutoff (where it remains practical to use Emacs) as high as feasible. > But in any case, I don't understand the significance of the timings > you show: we are discussing the increase in processing time which will > be caused by adding files with no tags, which produce a single line in > TAGS. If there are a magnitude more "other" files, and an average source file contains only several definitions, this can make a difference. > Therefore the interesting figures are time differences in > processing some commands with and without those additional lines. Are > the times you show above related to any of that? The time to generate is relevant. The time to visit the tags table gets non-trivial too, and it can increase. >> If someone were to provide a patch for etags with new functionality >> (disabling fallbacks, at least), I could benchmark and come back with >> numbers. And if experimental flags are available, with numbers for those >> as well. > > How hard is it to add to a live TAGS file fake lines which look like > this: > > ^L > foo,0 > > (with random strings instead of "foo"), and then time some TAGS-using > commands with and without these additions? Okay, done that. 'M-.' takes more or less the same. The file size of TAGS increased from 66 MB to 85 MiB. Won't measure time to generate now - because the current method and the "real" one will be different, but note that it's more relevant with etags-regen-mode because the scan is performed lazily: every time the user does the first search in a new project. ^ permalink raw reply [flat|nested] 57+ messages in thread
* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions 2024-10-05 14:29 ` Dmitry Gutov @ 2024-10-05 15:27 ` Eli Zaretskii 2024-10-05 20:27 ` Dmitry Gutov 2024-10-05 16:38 ` Francesco Potortì 1 sibling, 1 reply; 57+ messages in thread From: Eli Zaretskii @ 2024-10-05 15:27 UTC (permalink / raw) To: Dmitry Gutov; +Cc: 73484, spwhitton > Date: Sat, 5 Oct 2024 17:29:44 +0300 > Cc: spwhitton@spwhitton.name, 73484@debbugs.gnu.org > From: Dmitry Gutov <dmitry@gutov.dev> > > On 05/10/2024 10:02, Eli Zaretskii wrote: > > > How hard is it to add to a live TAGS file fake lines which look like > > this: > > > > ^L > > foo,0 > > > > (with random strings instead of "foo"), and then time some TAGS-using > > commands with and without these additions? > > Okay, done that. > > 'M-.' takes more or less the same. > > The file size of TAGS increased from 66 MB to 85 MiB. > > Won't measure time to generate now - because the current method and the > "real" one will be different, but note that it's more relevant with > etags-regen-mode because the scan is performed lazily: every time the > user does the first search in a new project. Thanks. What about the time it takes tags-search to show the prompt: is that affected in any way? ^ permalink raw reply [flat|nested] 57+ messages in thread
* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions 2024-10-05 15:27 ` Eli Zaretskii @ 2024-10-05 20:27 ` Dmitry Gutov 0 siblings, 0 replies; 57+ messages in thread From: Dmitry Gutov @ 2024-10-05 20:27 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 73484, spwhitton On 05/10/2024 18:27, Eli Zaretskii wrote: > Thanks. What about the time it takes tags-search to show the prompt: > is that affected in any way? No, that's still instant, just like project-find-regexp. All the work happens after typing the input. ^ permalink raw reply [flat|nested] 57+ messages in thread
* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions 2024-10-05 14:29 ` Dmitry Gutov 2024-10-05 15:27 ` Eli Zaretskii @ 2024-10-05 16:38 ` Francesco Potortì 2024-10-05 17:12 ` Eli Zaretskii 2024-10-06 0:56 ` Dmitry Gutov 1 sibling, 2 replies; 57+ messages in thread From: Francesco Potortì @ 2024-10-05 16:38 UTC (permalink / raw) To: Dmitry Gutov; +Cc: Eli Zaretskii, 73484, spwhitton Eli Zaretskii: >> How hard is it to add to a live TAGS file fake lines which look like >> this: >> >> ^L >> foo,0 >> >> (with random strings instead of "foo"), and then time some TAGS-using >> commands with and without these additions? Dmitry Gutov: >Okay, done that. > >'M-.' takes more or less the same. > >The file size of TAGS increased from 66 MB to 85 MiB. > >Won't measure time to generate now - because the current method and the >"real" one will be different, but note that it's more relevant with >etags-regen-mode because the scan is performed lazily: every time the >user does the first search in a new project. Removing the Fortran and C/C++ fallbacks just for testing requires recompiling etags.c after removing the code beginning with /* Else try Fortran or C. */. This would avoid parsing the file (except for detecting the sharp-bang) and would leave the file name in the tags file, without tags. ^ permalink raw reply [flat|nested] 57+ messages in thread
* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions 2024-10-05 16:38 ` Francesco Potortì @ 2024-10-05 17:12 ` Eli Zaretskii 2024-10-06 0:56 ` Dmitry Gutov 1 sibling, 0 replies; 57+ messages in thread From: Eli Zaretskii @ 2024-10-05 17:12 UTC (permalink / raw) To: Francesco Potortì; +Cc: dmitry, 73484, spwhitton > From: Francesco Potortì <pot@gnu.org> > Date: Sat, 05 Oct 2024 18:38:22 +0200 > Cc: spwhitton@spwhitton.name, > 73484@debbugs.gnu.org, > Eli Zaretskii <eliz@gnu.org> > > Eli Zaretskii: > >> How hard is it to add to a live TAGS file fake lines which look like > >> this: > >> > >> ^L > >> foo,0 > >> > >> (with random strings instead of "foo"), and then time some TAGS-using > >> commands with and without these additions? > > Dmitry Gutov: > >Okay, done that. > > > >'M-.' takes more or less the same. > > > >The file size of TAGS increased from 66 MB to 85 MiB. > > > >Won't measure time to generate now - because the current method and the > >"real" one will be different, but note that it's more relevant with > >etags-regen-mode because the scan is performed lazily: every time the > >user does the first search in a new project. > > Removing the Fortran and C/C++ fallbacks just for testing requires recompiling etags.c after removing the code beginning with /* Else try Fortran or C. */. This would avoid parsing the file (except for detecting the sharp-bang) and would leave the file name in the tags file, without tags. We are not talking about disabling the fallbacks, we are talking about something else: the impact of having in TAGS names of files where no tags were found (e.g., because their language was not recognized and the fallbacks are disabled). ^ permalink raw reply [flat|nested] 57+ messages in thread
* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions 2024-10-05 16:38 ` Francesco Potortì 2024-10-05 17:12 ` Eli Zaretskii @ 2024-10-06 0:56 ` Dmitry Gutov 2024-10-06 6:22 ` Eli Zaretskii 1 sibling, 1 reply; 57+ messages in thread From: Dmitry Gutov @ 2024-10-06 0:56 UTC (permalink / raw) To: Francesco Potortì; +Cc: Eli Zaretskii, 73484, spwhitton On 05/10/2024 19:38, Francesco Potortì wrote: > Eli Zaretskii: >>> How hard is it to add to a live TAGS file fake lines which look like >>> this: >>> >>> ^L >>> foo,0 >>> >>> (with random strings instead of "foo"), and then time some TAGS-using >>> commands with and without these additions? > > Dmitry Gutov: >> Okay, done that. >> >> 'M-.' takes more or less the same. >> >> The file size of TAGS increased from 66 MB to 85 MiB. >> >> Won't measure time to generate now - because the current method and the >> "real" one will be different, but note that it's more relevant with >> etags-regen-mode because the scan is performed lazily: every time the >> user does the first search in a new project. > > Removing the Fortran and C/C++ fallbacks just for testing requires recompiling etags.c after removing the code beginning with /* Else try Fortran or C. */. This would avoid parsing the file (except for detecting the sharp-bang) and would leave the file name in the tags file, without tags. Thank you, this is useful for another kind of test (parsing the same project with the list of all enabled file types). The below was also needed to avoid a segfault: diff --git a/lib-src/etags.c b/lib-src/etags.c index 7f652790261..08c6037b9d7 100644 --- a/lib-src/etags.c +++ b/lib-src/etags.c @@ -1830,6 +1830,7 @@ process_file (FILE *fh, char *fn, language *lang) curfdp. */ if (!CTAGS && curfdp->usecharno /* no #line directives in this file */ + && curfdp->lang && !curfdp->lang->metasource) { node *np, *prev; Then, the total time increased a lot: from 30 s to 30-40 min. This cuts it down in half, if I measured correctly: diff --git a/lib-src/etags.c b/lib-src/etags.c index 7f652790261..5c2be2b9574 100644 --- a/lib-src/etags.c +++ b/lib-src/etags.c @@ -1902,21 +1903,21 @@ find_entries (FILE *inf) /* Else look for sharp-bang as the first two characters. */ if (parser == NULL + && getc (inf) == '#' + && getc (inf) == '!' && readline_internal (&lb, inf, infilename, false) > 0 - && lb.len >= 2 - && lb.buffer[0] == '#' - && lb.buffer[1] == '!') + ) { char *lp; /* Set lp to point at the first char after the last slash in the line or, if no slashes, at the first nonblank. Then set cp to the first successive blank and terminate the string. */ - lp = strrchr (lb.buffer+2, '/'); + lp = strrchr (lb.buffer, '/'); if (lp != NULL) lp += 1; else - lp = skip_spaces (lb.buffer + 2); + lp = skip_spaces (lb.buffer); cp = skip_non_spaces (lp); /* If the "interpreter" turns out to be "env", the real interpreter is the next word. */ But parsing HTML files seems to remain the slowest part. There are a lot of them in that project (many test cases), but maybe 3x the number of code files, not 60x their number. And they're pretty small, on average. If somebody wants to test that locally, here's the repository: https://github.com/mozilla/gecko-dev ^ permalink raw reply related [flat|nested] 57+ messages in thread
* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions 2024-10-06 0:56 ` Dmitry Gutov @ 2024-10-06 6:22 ` Eli Zaretskii 2024-10-06 19:14 ` Dmitry Gutov 0 siblings, 1 reply; 57+ messages in thread From: Eli Zaretskii @ 2024-10-06 6:22 UTC (permalink / raw) To: Dmitry Gutov; +Cc: pot, 73484, spwhitton > Date: Sun, 6 Oct 2024 03:56:58 +0300 > Cc: spwhitton@spwhitton.name, 73484@debbugs.gnu.org, > Eli Zaretskii <eliz@gnu.org> > From: Dmitry Gutov <dmitry@gutov.dev> > > On 05/10/2024 19:38, Francesco Potortì wrote: > > Eli Zaretskii: > >>> How hard is it to add to a live TAGS file fake lines which look like > >>> this: > >>> > >>> ^L > >>> foo,0 > >>> > >>> (with random strings instead of "foo"), and then time some TAGS-using > >>> commands with and without these additions? > > > > Dmitry Gutov: > >> Okay, done that. > >> > >> 'M-.' takes more or less the same. > >> > >> The file size of TAGS increased from 66 MB to 85 MiB. > >> > >> Won't measure time to generate now - because the current method and the > >> "real" one will be different, but note that it's more relevant with > >> etags-regen-mode because the scan is performed lazily: every time the > >> user does the first search in a new project. > > > > Removing the Fortran and C/C++ fallbacks just for testing requires recompiling etags.c after removing the code beginning with /* Else try Fortran or C. */. This would avoid parsing the file (except for detecting the sharp-bang) and would leave the file name in the tags file, without tags. That would also remove the ability to scan files of no language for regexps. So this is not what I intend to do for this feature request, FWIW. > Then, the total time increased a lot: from 30 s to 30-40 min. I don't understand why. How many files with no extensions are in that tree, and what was the etags command line in both cases? > But parsing HTML files seems to remain the slowest part. There are a lot > of them in that project (many test cases), but maybe 3x the number of > code files, not 60x their number. And they're pretty small, on average. > If somebody wants to test that locally, here's the repository: > https://github.com/mozilla/gecko-dev If HTML files is what explains the slowdown, then why this change triggered it? HTML files are supposed to have extensions that tell etags they are HTML. And if they don't have extensions, the code you removed would have caused etags to scan these files anyway, looking for Fortran or C tags. So how come the change slowed down etags so much? What am I missing? ^ permalink raw reply [flat|nested] 57+ messages in thread
* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions 2024-10-06 6:22 ` Eli Zaretskii @ 2024-10-06 19:14 ` Dmitry Gutov 2024-10-07 2:33 ` Eli Zaretskii 0 siblings, 1 reply; 57+ messages in thread From: Dmitry Gutov @ 2024-10-06 19:14 UTC (permalink / raw) To: Eli Zaretskii; +Cc: pot, 73484, spwhitton On 06/10/2024 09:22, Eli Zaretskii wrote: >> Then, the total time increased a lot: from 30 s to 30-40 min. > > I don't understand why. How many files with no extensions are in that > tree, and what was the etags command line in both cases? Sorry, I have to add a correction: it's about 15 min either way. Seems like the first time I either messed up the start time, or the directory was in "cold" cache, or the used etags some much older version. So to reiterate: the current etags-regen scans in around 30s, and the simple switch scans the directory in 15 minutes. Retesting the change from previous email, it doesn't really help. And the 'find-tag' scan did become slower - i.e. from 400 ms to 1200 ms. Not clear about the mechanics (the size of TAGS only went up from 65 to 88 MB). >> But parsing HTML files seems to remain the slowest part. There are a lot >> of them in that project (many test cases), but maybe 3x the number of >> code files, not 60x their number. And they're pretty small, on average. >> If somebody wants to test that locally, here's the repository: >> https://github.com/mozilla/gecko-dev > > If HTML files is what explains the slowdown, then why this change > triggered it? HTML files are supposed to have extensions that tell > etags they are HTML. Okay, I've commented out the most obvious suspects (html, asm, makefile) - all their entries in 'lang_names' - but the scan still takes too long. Maybe it's some other file type, which I haven't found yet. But what is see when monitoring the running scan with 'tail -f TAGS', is the output stops sometimes for like 20 seconds, in the middle of outputting tags of some common code file (like .cpp or .py, a common type), and then resumes, with files of the same type around this one. > And if they don't have extensions, the code you > removed would have caused etags to scan these files anyway, looking > for Fortran or C tags. So how come the change slowed down etags so > much? What am I missing? I think it would also concern "unknown" extensions, right? Like .txt, .png and so on. Anyway, the difference is either due to the different set of files (all project files, rather than files in the specified list of extensions), or due to all file names being printed. Not sure how to verify, yet. ^ permalink raw reply [flat|nested] 57+ messages in thread
* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions 2024-10-06 19:14 ` Dmitry Gutov @ 2024-10-07 2:33 ` Eli Zaretskii 2024-10-07 7:11 ` Dmitry Gutov 0 siblings, 1 reply; 57+ messages in thread From: Eli Zaretskii @ 2024-10-07 2:33 UTC (permalink / raw) To: Dmitry Gutov; +Cc: pot, 73484, spwhitton > Date: Sun, 6 Oct 2024 22:14:46 +0300 > Cc: pot@gnu.org, spwhitton@spwhitton.name, 73484@debbugs.gnu.org > From: Dmitry Gutov <dmitry@gutov.dev> > > On 06/10/2024 09:22, Eli Zaretskii wrote: > > >> Then, the total time increased a lot: from 30 s to 30-40 min. > > > > I don't understand why. How many files with no extensions are in that > > tree, and what was the etags command line in both cases? > > Sorry, I have to add a correction: it's about 15 min either way. Seems > like the first time I either messed up the start time, or the directory > was in "cold" cache, or the used etags some much older version. > > So to reiterate: the current etags-regen scans in around 30s, and the > simple switch scans the directory in 15 minutes. Retesting the change > from previous email, it doesn't really help. Can you please show the etags command line in each of these two cases that you are comparing? > > And if they don't have extensions, the code you > > removed would have caused etags to scan these files anyway, looking > > for Fortran or C tags. So how come the change slowed down etags so > > much? What am I missing? > > I think it would also concern "unknown" extensions, right? Like .txt, > .png and so on. I have difficulty reasoning about this without knowing the command lines you used. E.g., I don't understand why in one case it would scan files with unknown extensions that were not scanned in the other. ^ permalink raw reply [flat|nested] 57+ messages in thread
* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions 2024-10-07 2:33 ` Eli Zaretskii @ 2024-10-07 7:11 ` Dmitry Gutov 2024-10-07 16:05 ` Eli Zaretskii 0 siblings, 1 reply; 57+ messages in thread From: Dmitry Gutov @ 2024-10-07 7:11 UTC (permalink / raw) To: Eli Zaretskii; +Cc: pot, 73484, spwhitton On 07/10/2024 05:33, Eli Zaretskii wrote: >> Sorry, I have to add a correction: it's about 15 min either way. Seems >> like the first time I either messed up the start time, or the directory >> was in "cold" cache, or the used etags some much older version. >> >> So to reiterate: the current etags-regen scans in around 30s, and the >> simple switch scans the directory in 15 minutes. Retesting the change >> from previous email, it doesn't really help. > Can you please show the etags command line in each of these two cases > that you are comparing? Both commands end with a '-' (scanning the list of files passed from stdin). >>> And if they don't have extensions, the code you >>> removed would have caused etags to scan these files anyway, looking >>> for Fortran or C tags. So how come the change slowed down etags so >>> much? What am I missing? >> I think it would also concern "unknown" extensions, right? Like .txt, >> .png and so on. > I have difficulty reasoning about this without knowing the command > lines you used. E.g., I don't understand why in one case it would > scan files with unknown extensions that were not scanned in the other. In one case the list is pre-filtered with etags-regen-file-extensions (see 'etags-regen--all-files'), in the other - it is not, and all files in project are passed. ^ permalink raw reply [flat|nested] 57+ messages in thread
* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions 2024-10-07 7:11 ` Dmitry Gutov @ 2024-10-07 16:05 ` Eli Zaretskii 2024-10-07 17:36 ` Dmitry Gutov 0 siblings, 1 reply; 57+ messages in thread From: Eli Zaretskii @ 2024-10-07 16:05 UTC (permalink / raw) To: Dmitry Gutov; +Cc: pot, 73484, spwhitton > Date: Mon, 7 Oct 2024 10:11:08 +0300 > Cc: pot@gnu.org, spwhitton@spwhitton.name, 73484@debbugs.gnu.org > From: Dmitry Gutov <dmitry@gutov.dev> > > > Can you please show the etags command line in each of these two cases > > that you are comparing? > > Both commands end with a '-' (scanning the list of files passed from stdin). > > >>> And if they don't have extensions, the code you > >>> removed would have caused etags to scan these files anyway, looking > >>> for Fortran or C tags. So how come the change slowed down etags so > >>> much? What am I missing? > >> I think it would also concern "unknown" extensions, right? Like .txt, > >> .png and so on. > > I have difficulty reasoning about this without knowing the command > > lines you used. E.g., I don't understand why in one case it would > > scan files with unknown extensions that were not scanned in the other. > > In one case the list is pre-filtered with etags-regen-file-extensions > (see 'etags-regen--all-files'), in the other - it is not, and all files > in project are passed. So you are comparing the speed of scanning ~60K files with the speed of scanning ~375K of files? I'm not generally surprised that the latter takes much longer, only that the slowdown is not proportional to the number of scanned files. But see below. Btw, did you exclude the .git/* files from the list submitted to etags? Here, scanning, with the unmodified etags from Emacs 30, of only those files with extensions in etags-regen-file-extensions takes 16.7 sec and produces a 80.5MB tags table, whereas scanning all the files with the same etags takes almost 16 min and produces 304MB tags table, of which more than 200MB are from files whose language is not recognized. From my testing, it seems like the elapsed time depends non-linearly on the length of the list of files submitted to etags. For example, if I break the list of files in two, I get 3 min 20 sec and 1 min 40 sec, together 5 min. But if I submit a single list with all the files in those two lists, I get 14 min 30 sec. I guess some internal processing etags does depends non-linearly on the number of files it scans. The various loops in etags that scan all of the known files and/or the tags it previously found seem to confirm this hypothesis. So what is the conclusion from this? Are you saying that the long scan times in this large tree basically make this new no-fallbacks option not very useful, since we still need to carefully include or exclude certain files from the scan? Or should I go ahead and install these changes? ^ permalink raw reply [flat|nested] 57+ messages in thread
* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions 2024-10-07 16:05 ` Eli Zaretskii @ 2024-10-07 17:36 ` Dmitry Gutov 2024-10-07 19:05 ` Eli Zaretskii 0 siblings, 1 reply; 57+ messages in thread From: Dmitry Gutov @ 2024-10-07 17:36 UTC (permalink / raw) To: Eli Zaretskii; +Cc: pot, 73484, spwhitton On 07/10/2024 19:05, Eli Zaretskii wrote: > So you are comparing the speed of scanning ~60K files with the speed > of scanning ~375K of files? I'm not generally surprised that the > latter takes much longer, only that the slowdown is not proportional > to the number of scanned files. But see below. I forgot one thing: all .js files are actually set to be ignored there. And my tree is a little old, so it's 200K files total. Otherwise -- yes. Note, however, that the time is really not proportional: 30 s vs 15 min is a 30x difference. And I've been assuming that the "other" files would mostly fall in the non-recognized category, and most of them would only have the 2 first characters read (then, recognizing that those chars are not '#!', etags would skip the file). > Btw, did you exclude the .git/* files from the list submitted to > etags? Yes, it's excluded. And the files matching the .gitignore entries are excluded as well. > Here, scanning, with the unmodified etags from Emacs 30, of only those > files with extensions in etags-regen-file-extensions takes 16.7 sec > and produces a 80.5MB tags table, whereas scanning all the files with > the same etags takes almost 16 min and produces 304MB tags table, of > which more than 200MB are from files whose language is not recognized. My result in the latter case was only 88 MB. Maybe the many .js files make the difference. I've put them into the "ignored" category long ago because most of them are used for tests, and there are a lot of those files, and there are generated one-long-line files. > From my testing, it seems like the elapsed time depends non-linearly > on the length of the list of files submitted to etags. For example, > if I break the list of files in two, I get 3 min 20 sec and 1 min 40 > sec, together 5 min. But if I submit a single list with all the files > in those two lists, I get 14 min 30 sec. I guess some internal > processing etags does depends non-linearly on the number of files it > scans. The various loops in etags that scan all of the known files > and/or the tags it previously found seem to confirm this hypothesis. Makes sense! It sounds like some N^2 complexity somewhere. > So what is the conclusion from this? Are you saying that the long > scan times in this large tree basically make this new no-fallbacks > option not very useful, since we still need to carefully include or > exclude certain files from the scan? Or should I go ahead and install > these changes? I think that option will be useful, but for better benchmarks and for end usability as well, I think we need the N^2 thing fixed as well. Maybe before the rest of the changes. ^ permalink raw reply [flat|nested] 57+ messages in thread
* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions 2024-10-07 17:36 ` Dmitry Gutov @ 2024-10-07 19:05 ` Eli Zaretskii 2024-10-07 22:08 ` Dmitry Gutov 0 siblings, 1 reply; 57+ messages in thread From: Eli Zaretskii @ 2024-10-07 19:05 UTC (permalink / raw) To: Dmitry Gutov; +Cc: pot, 73484, spwhitton > Date: Mon, 7 Oct 2024 20:36:47 +0300 > Cc: pot@gnu.org, spwhitton@spwhitton.name, 73484@debbugs.gnu.org > From: Dmitry Gutov <dmitry@gutov.dev> > > On 07/10/2024 19:05, Eli Zaretskii wrote: > > > So what is the conclusion from this? Are you saying that the long > > scan times in this large tree basically make this new no-fallbacks > > option not very useful, since we still need to carefully include or > > exclude certain files from the scan? Or should I go ahead and install > > these changes? > > I think that option will be useful, but for better benchmarks and for > end usability as well, I think we need the N^2 thing fixed as well. > Maybe before the rest of the changes. If this latter part is a precodintion, then someone else will have to work on this. I have the new option coded and tested (and documented), but I don't intend to work on redesigning the core etags algorithms to remove the non-linear behavior, that's a much larger project which I currently cannot afford, sorry. ^ permalink raw reply [flat|nested] 57+ messages in thread
* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions 2024-10-07 19:05 ` Eli Zaretskii @ 2024-10-07 22:08 ` Dmitry Gutov 2024-10-08 13:04 ` Eli Zaretskii 0 siblings, 1 reply; 57+ messages in thread From: Dmitry Gutov @ 2024-10-07 22:08 UTC (permalink / raw) To: Eli Zaretskii; +Cc: pot, 73484, spwhitton On 07/10/2024 22:05, Eli Zaretskii wrote: >> Date: Mon, 7 Oct 2024 20:36:47 +0300 >> Cc: pot@gnu.org, spwhitton@spwhitton.name, 73484@debbugs.gnu.org >> From: Dmitry Gutov <dmitry@gutov.dev> >> >> On 07/10/2024 19:05, Eli Zaretskii wrote: >> >>> So what is the conclusion from this? Are you saying that the long >>> scan times in this large tree basically make this new no-fallbacks >>> option not very useful, since we still need to carefully include or >>> exclude certain files from the scan? Or should I go ahead and install >>> these changes? >> >> I think that option will be useful, but for better benchmarks and for >> end usability as well, I think we need the N^2 thing fixed as well. >> Maybe before the rest of the changes. > > If this latter part is a precodintion, I think we still could use the new flag, just not switch to it (no extension filtering) by default yet. > then someone else will have to > work on this. I have the new option coded and tested (and > documented), but I don't intend to work on redesigning the core etags > algorithms to remove the non-linear behavior, that's a much larger > project which I currently cannot afford, sorry. Do you mind pointing at the places in the code where you already noticed non-linear performance coming from? ^ permalink raw reply [flat|nested] 57+ messages in thread
* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions 2024-10-07 22:08 ` Dmitry Gutov @ 2024-10-08 13:04 ` Eli Zaretskii 2024-10-09 18:23 ` Dmitry Gutov 0 siblings, 1 reply; 57+ messages in thread From: Eli Zaretskii @ 2024-10-08 13:04 UTC (permalink / raw) To: Dmitry Gutov; +Cc: pot, 73484, spwhitton > Date: Tue, 8 Oct 2024 01:08:00 +0300 > Cc: pot@gnu.org, spwhitton@spwhitton.name, 73484@debbugs.gnu.org > From: Dmitry Gutov <dmitry@gutov.dev> > > On 07/10/2024 22:05, Eli Zaretskii wrote: > >> Date: Mon, 7 Oct 2024 20:36:47 +0300 > >> Cc: pot@gnu.org, spwhitton@spwhitton.name, 73484@debbugs.gnu.org > >> From: Dmitry Gutov <dmitry@gutov.dev> > >> > >> On 07/10/2024 19:05, Eli Zaretskii wrote: > >> > >>> So what is the conclusion from this? Are you saying that the long > >>> scan times in this large tree basically make this new no-fallbacks > >>> option not very useful, since we still need to carefully include or > >>> exclude certain files from the scan? Or should I go ahead and install > >>> these changes? > >> > >> I think that option will be useful, but for better benchmarks and for > >> end usability as well, I think we need the N^2 thing fixed as well. > >> Maybe before the rest of the changes. > > > > If this latter part is a precodintion, > > I think we still could use the new flag, just not switch to it (no > extension filtering) by default yet. OK, installed on master. I leave it up to you whether to close the bug. > > then someone else will have to > > work on this. I have the new option coded and tested (and > > documented), but I don't intend to work on redesigning the core etags > > algorithms to remove the non-linear behavior, that's a much larger > > project which I currently cannot afford, sorry. > > Do you mind pointing at the places in the code where you already noticed > non-linear performance coming from? The while-loop near line 2020, for example. Another one is the for-loop near line 1420, which deals with writing into TAGS the entries of files with no tags. There may be others, but those are what I saw. Perhaps it is a good idea to profile etags while it scans the files during those 15 min, to see where it spends that much time, because I'm not sure even those loops can account for that. It's possible there's something else at work here which we don't yet understand. Two aspects that I found trying to understand the long scan times, and I'd like to mention so they don't become forgotten: . If there are compressed files in the directory, etags will uncompress them before it attempts to identify their language. There are 20 such files in the gecko-dev tree (removing them from the list of scanned files had only minor effect on the elapsed time, but it could be different in other cases, especially if uncompressing them produces very large files). . Some files have their language identified by means other than their names or extensions: those are the languages that have "interpreters" defined in etags.c. Shell scripts is one such case, but not the only one. So when etags-regen.el passes only files with known extensions to etags, it misses those files from TAGS. As one example, the file js/src/devtools/rootAnalysis/run_complete in the gecko-dev tree is a Perl script, but has no .pl extension. ^ permalink raw reply [flat|nested] 57+ messages in thread
* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions 2024-10-08 13:04 ` Eli Zaretskii @ 2024-10-09 18:23 ` Dmitry Gutov 2024-10-09 19:11 ` Eli Zaretskii ` (2 more replies) 0 siblings, 3 replies; 57+ messages in thread From: Dmitry Gutov @ 2024-10-09 18:23 UTC (permalink / raw) To: Eli Zaretskii; +Cc: pot, 73484, spwhitton On 08/10/2024 16:04, Eli Zaretskii wrote: >>>> I think that option will be useful, but for better benchmarks and for >>>> end usability as well, I think we need the N^2 thing fixed as well. >>>> Maybe before the rest of the changes. >>> >>> If this latter part is a precodintion, >> >> I think we still could use the new flag, just not switch to it (no >> extension filtering) by default yet. > > OK, installed on master. I leave it up to you whether to close the > bug. Thank you! Before closing though, I'd like to look into the performance issue more. >>> then someone else will have to >>> work on this. I have the new option coded and tested (and >>> documented), but I don't intend to work on redesigning the core etags >>> algorithms to remove the non-linear behavior, that's a much larger >>> project which I currently cannot afford, sorry. >> >> Do you mind pointing at the places in the code where you already noticed >> non-linear performance coming from? > > The while-loop near line 2020, for example. Thanks. This one must be proportional to the number of files such as *.y. There are only 2 in our big repo. > Another one is the for-loop near line 1420, which deals with writing > into TAGS the entries of files with no tags. It's not a nested 'for' loop, though (right?), and it's called from 'main'. That seems to mean it's just O(N) - also fine. > There may be others, but those are what I saw. Perhaps it is a good > idea to profile etags while it scans the files during those 15 min, to > see where it spends that much time, because I'm not sure even those > loops can account for that. It's possible there's something else at > work here which we don't yet understand. 'perf' shows me a profile like this: 67.31% etags libc.so.6 [.] __strcmp_avx2 26.29% etags etags [.] process_file_name 2.00% etags etags [.] streq 0.96% etags etags [.] strcmp@plt 0.32% etags etags [.] readline_internal 0.11% etags etags [.] HTML_labels 0.08% etags [kernel.kallsyms] [k] syscall_return_via_sysret 0.07% etags [kernel.kallsyms] [k] kmem_cache_alloc 0.06% etags [kernel.kallsyms] [k] entry_SYSRETQ_unsafe_stack 0.05% etags [kernel.kallsyms] [k] perf_adjust_freq_unthr_context 0.04% etags etags [.] c_strncasecmp So... most of the time is spent in string comparison. Here is the nested loop, which if I comment out, makes the parse finish in ~20 seconds, with all the extra files (except *.js), or in 15s when using with new flags. diff --git a/lib-src/etags.c b/lib-src/etags.c index a822a823a90..331e3ffe816 100644 --- a/lib-src/etags.c +++ b/lib-src/etags.c @@ -1697,14 +1697,14 @@ process_file_name (char *file, language *lang) uncompressed_name = file; } - /* If the canonicalized uncompressed name - has already been dealt with, skip it silently. */ - for (fdp = fdhead; fdp != NULL; fdp = fdp->next) - { - assert (fdp->infname != NULL); - if (streq (uncompressed_name, fdp->infname)) - goto cleanup; - } + /* /\* If the canonicalized uncompressed name */ + /* has already been dealt with, skip it silently. *\/ */ + /* for (fdp = fdhead; fdp != NULL; fdp = fdp->next) */ + /* { */ + /* assert (fdp->infname != NULL); */ + /* if (streq (uncompressed_name, fdp->infname)) */ + /* goto cleanup; */ + /* } */ inf = fopen (file, "r" FOPEN_BINARY); if (inf) This is basically a "uniqueness" operation using linear search, O(N^2). Is there a hash table we could use? Or perhaps we would skip the search when the canonicalized name is the same as the original one. > Two aspects that I found trying to understand the long scan times, and > I'd like to mention so they don't become forgotten: > > . If there are compressed files in the directory, etags will > uncompress them before it attempts to identify their language. > There are 20 such files in the gecko-dev tree (removing them from > the list of scanned files had only minor effect on the elapsed > time, but it could be different in other cases, especially if > uncompressing them produces very large files). I guess someone might ask for flag "--no-decompress", sometime. > . Some files have their language identified by means other than their > names or extensions: those are the languages that have > "interpreters" defined in etags.c. Shell scripts is one such case, > but not the only one. So when etags-regen.el passes only files > with known extensions to etags, it misses those files from TAGS. > As one example, the file js/src/devtools/rootAnalysis/run_complete > in the gecko-dev tree is a Perl script, but has no .pl extension. This sounds the same as the "hashbang" files that we mentioned previously. It makes sense for the scan to take longer, of course, proportional to the number of the detected files. ^ permalink raw reply related [flat|nested] 57+ messages in thread
* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions 2024-10-09 18:23 ` Dmitry Gutov @ 2024-10-09 19:11 ` Eli Zaretskii 2024-10-09 22:22 ` Dmitry Gutov 2024-10-10 1:07 ` Francesco Potortì 2024-10-10 1:39 ` Francesco Potortì 2 siblings, 1 reply; 57+ messages in thread From: Eli Zaretskii @ 2024-10-09 19:11 UTC (permalink / raw) To: Dmitry Gutov; +Cc: pot, 73484, spwhitton > Date: Wed, 9 Oct 2024 21:23:37 +0300 > Cc: pot@gnu.org, spwhitton@spwhitton.name, 73484@debbugs.gnu.org > From: Dmitry Gutov <dmitry@gutov.dev> > > 'perf' shows me a profile like this: > > 67.31% etags libc.so.6 [.] __strcmp_avx2 > 26.29% etags etags [.] process_file_name > 2.00% etags etags [.] streq > 0.96% etags etags [.] strcmp@plt > 0.32% etags etags [.] readline_internal > 0.11% etags etags [.] HTML_labels > 0.08% etags [kernel.kallsyms] [k] syscall_return_via_sysret > 0.07% etags [kernel.kallsyms] [k] kmem_cache_alloc > 0.06% etags [kernel.kallsyms] [k] entry_SYSRETQ_unsafe_stack > 0.05% etags [kernel.kallsyms] [k] perf_adjust_freq_unthr_context > 0.04% etags etags [.] c_strncasecmp > > So... most of the time is spent in string comparison. > > Here is the nested loop, which if I comment out, makes the parse finish > in ~20 seconds, with all the extra files (except *.js), or in 15s when > using with new flags. > > diff --git a/lib-src/etags.c b/lib-src/etags.c > index a822a823a90..331e3ffe816 100644 > --- a/lib-src/etags.c > +++ b/lib-src/etags.c > @@ -1697,14 +1697,14 @@ process_file_name (char *file, language *lang) > uncompressed_name = file; > } > > - /* If the canonicalized uncompressed name > - has already been dealt with, skip it silently. */ > - for (fdp = fdhead; fdp != NULL; fdp = fdp->next) > - { > - assert (fdp->infname != NULL); > - if (streq (uncompressed_name, fdp->infname)) > - goto cleanup; > - } > + /* /\* If the canonicalized uncompressed name */ > + /* has already been dealt with, skip it silently. *\/ */ > + /* for (fdp = fdhead; fdp != NULL; fdp = fdp->next) */ > + /* { */ > + /* assert (fdp->infname != NULL); */ > + /* if (streq (uncompressed_name, fdp->infname)) */ > + /* goto cleanup; */ > + /* } */ > > inf = fopen (file, "r" FOPEN_BINARY); > if (inf) > > This is basically a "uniqueness" operation using linear search, O(N^2). Yes, this seems to be a protection against the same file name mentioned more than once on the command line.. > Is there a hash table we could use? Something like that should do, yes. > Or perhaps we would skip the search when the canonicalized name is the > same as the original one. That's not the same as the loop above does, I think. > > Two aspects that I found trying to understand the long scan times, and > > I'd like to mention so they don't become forgotten: > > > > . If there are compressed files in the directory, etags will > > uncompress them before it attempts to identify their language. > > There are 20 such files in the gecko-dev tree (removing them from > > the list of scanned files had only minor effect on the elapsed > > time, but it could be different in other cases, especially if > > uncompressing them produces very large files). > > I guess someone might ask for flag "--no-decompress", sometime. Yes, but it's also easy to exclude them via 'find'. > > . Some files have their language identified by means other than their > > names or extensions: those are the languages that have > > "interpreters" defined in etags.c. Shell scripts is one such case, > > but not the only one. So when etags-regen.el passes only files > > with known extensions to etags, it misses those files from TAGS. > > As one example, the file js/src/devtools/rootAnalysis/run_complete > > in the gecko-dev tree is a Perl script, but has no .pl extension. > > This sounds the same as the "hashbang" files that we mentioned > previously. It makes sense for the scan to take longer, of course, > proportional to the number of the detected files. My point was that if someone wants all the Python files, say, submitting only Python extensions to etags might miss some Python scripts. ^ permalink raw reply [flat|nested] 57+ messages in thread
* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions 2024-10-09 19:11 ` Eli Zaretskii @ 2024-10-09 22:22 ` Dmitry Gutov 2024-10-10 5:13 ` Eli Zaretskii 0 siblings, 1 reply; 57+ messages in thread From: Dmitry Gutov @ 2024-10-09 22:22 UTC (permalink / raw) To: Eli Zaretskii; +Cc: pot, 73484, spwhitton On 09/10/2024 22:11, Eli Zaretskii wrote: >> This is basically a "uniqueness" operation using linear search, O(N^2). > > Yes, this seems to be a protection against the same file name > mentioned more than once on the command line.. Or, maybe more likely, against having symlinks scanned if the symlink target is also in the passed list. >> Is there a hash table we could use? > > Something like that should do, yes. Can we use search.h? hcreate/hsearch/etc. IIUC it's on in the C stndard, and https://www.gnu.org/savannah-checkouts/gnu/gnulib/manual/html_node/hcreate.html says it's available on certain platforms. >> Or perhaps we would skip the search when the canonicalized name is the >> same as the original one. > > That's not the same as the loop above does, I think. If we assumed the duplicate check is only necessary for symlinks, and there is on average a small number of them, I think we could avoid using a hash table. But passing the same exact file 2 times would result in duplicate tags. >> I guess someone might ask for flag "--no-decompress", sometime. > > Yes, but it's also easy to exclude them via 'find'. Or through etags-regen-ignores. >>> . Some files have their language identified by means other than their >>> names or extensions: those are the languages that have >>> "interpreters" defined in etags.c. Shell scripts is one such case, >>> but not the only one. So when etags-regen.el passes only files >>> with known extensions to etags, it misses those files from TAGS. >>> As one example, the file js/src/devtools/rootAnalysis/run_complete >>> in the gecko-dev tree is a Perl script, but has no .pl extension. >> >> This sounds the same as the "hashbang" files that we mentioned >> previously. It makes sense for the scan to take longer, of course, >> proportional to the number of the detected files. > > My point was that if someone wants all the Python files, say, > submitting only Python extensions to etags might miss some Python > scripts. Yes, that's the problem from the first comments of this report: to have hashbang files scanned, one can't use a whitelist of extensions. Using a blacklist should be fine, though. ^ permalink raw reply [flat|nested] 57+ messages in thread
* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions 2024-10-09 22:22 ` Dmitry Gutov @ 2024-10-10 5:13 ` Eli Zaretskii 0 siblings, 0 replies; 57+ messages in thread From: Eli Zaretskii @ 2024-10-10 5:13 UTC (permalink / raw) To: Dmitry Gutov; +Cc: pot, 73484, spwhitton > Date: Thu, 10 Oct 2024 01:22:13 +0300 > Cc: pot@gnu.org, spwhitton@spwhitton.name, 73484@debbugs.gnu.org > From: Dmitry Gutov <dmitry@gutov.dev> > > On 09/10/2024 22:11, Eli Zaretskii wrote: > > >> This is basically a "uniqueness" operation using linear search, O(N^2). > > > > Yes, this seems to be a protection against the same file name > > mentioned more than once on the command line.. > > Or, maybe more likely, against having symlinks scanned if the symlink > target is also in the passed list. Yes, that, but also any other possible ways of specifying the same file twice, like having a file both compressed and uncompressed, etc. > >> Is there a hash table we could use? > > > > Something like that should do, yes. > > Can we use search.h? hcreate/hsearch/etc. IIUC it's on in the C stndard, > and > https://www.gnu.org/savannah-checkouts/gnu/gnulib/manual/html_node/hcreate.html > says it's available on certain platforms. I think we shouldn't: it is not sufficiently portable and Gnulib doesn't have an implementation for it for those platforms that don't have it. We could perhaps use the standard tsearch (although it will be more expensive). Alternatively, we could steal the hash table code from somewhere, for example, from Gawk. > >> Or perhaps we would skip the search when the canonicalized name is the > >> same as the original one. > > > > That's not the same as the loop above does, I think. > > If we assumed the duplicate check is only necessary for symlinks, and > there is on average a small number of them, I think we could avoid using > a hash table. But passing the same exact file 2 times would result in > duplicate tags. canonicalize_filename in etags.c does not resolve symlinks, AFAICT, so the symlink scenario will not be solved by that. We'd need realpath or its equivalent, I think? ^ permalink raw reply [flat|nested] 57+ messages in thread
* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions 2024-10-09 18:23 ` Dmitry Gutov 2024-10-09 19:11 ` Eli Zaretskii @ 2024-10-10 1:07 ` Francesco Potortì 2024-10-10 5:41 ` Eli Zaretskii 2024-10-10 10:17 ` Dmitry Gutov 2024-10-10 1:39 ` Francesco Potortì 2 siblings, 2 replies; 57+ messages in thread From: Francesco Potortì @ 2024-10-10 1:07 UTC (permalink / raw) To: Dmitry Gutov; +Cc: Eli Zaretskii, 73484, spwhitton >Here is the nested loop, which if I comment out, makes the parse finish >in ~20 seconds, with all the extra files (except *.js), or in 15s when >using with new flags. > >diff --git a/lib-src/etags.c b/lib-src/etags.c >index a822a823a90..331e3ffe816 100644 >--- a/lib-src/etags.c >+++ b/lib-src/etags.c >@@ -1697,14 +1697,14 @@ process_file_name (char *file, language *lang) > uncompressed_name = file; > } > >- /* If the canonicalized uncompressed name >- has already been dealt with, skip it silently. */ >- for (fdp = fdhead; fdp != NULL; fdp = fdp->next) >- { >- assert (fdp->infname != NULL); >- if (streq (uncompressed_name, fdp->infname)) >- goto cleanup; >- } >+ /* /\* If the canonicalized uncompressed name */ >+ /* has already been dealt with, skip it silently. *\/ */ >+ /* for (fdp = fdhead; fdp != NULL; fdp = fdp->next) */ >+ /* { */ >+ /* assert (fdp->infname != NULL); */ >+ /* if (streq (uncompressed_name, fdp->infname)) */ >+ /* goto cleanup; */ >+ /* } */ > > inf = fopen (file, "r" FOPEN_BINARY); > if (inf) > >This is basically a "uniqueness" operation using linear search, O(N^2). This is only for dealing with the case when the same file exists in both compressed and uncompressed form, and we are currently hitting the second one. In that case, we should skip it. Yes, this is a uniqueness test and yes, it is O^2 in the number of file names, but I doubt that this can explain a serious slowdown. >Is there a hash table we could use? No, we have a hash table for C tags, and that's all. It is useful because there are 34 keywords against which most strings in a C/C++ file are compared. It makes sesns to build hash tables for other languages where a similar situation happens. I do not think that it makes sense to build a hash table for file names given on the command line, because the number of comparisons made on those names is generally vastly inferior to the number of comparisons used to search for tags. >> . Some files have their language identified by means other than their >> names or extensions: those are the languages that have >> "interpreters" defined in etags.c The interpreter is the token what comes after #!, with The possible exception for "env", in which case the interpreter is the second token after #! There are two O^2 test in the number of tags in C/C++ files which depend on the two options "no-line-directive" and "no-duplicates". Both options are usable to disable those checks and both are off by default because they help producing a more sane tags file and have no practical impact in most cases. Both are there because, in principle, they cause significant slowdown in huge tags files. ^ permalink raw reply [flat|nested] 57+ messages in thread
* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions 2024-10-10 1:07 ` Francesco Potortì @ 2024-10-10 5:41 ` Eli Zaretskii 2024-10-10 8:27 ` Francesco Potortì 2024-10-10 10:17 ` Dmitry Gutov 1 sibling, 1 reply; 57+ messages in thread From: Eli Zaretskii @ 2024-10-10 5:41 UTC (permalink / raw) To: Francesco Potortì; +Cc: dmitry, 73484, spwhitton > From: Francesco Potortì <pot@gnu.org> > Date: Thu, 10 Oct 2024 03:07:31 +0200 > Cc: 73484@debbugs.gnu.org, > spwhitton@spwhitton.name, > Eli Zaretskii <eliz@gnu.org> > > >+ /* /\* If the canonicalized uncompressed name */ > >+ /* has already been dealt with, skip it silently. *\/ */ > >+ /* for (fdp = fdhead; fdp != NULL; fdp = fdp->next) */ > >+ /* { */ > >+ /* assert (fdp->infname != NULL); */ > >+ /* if (streq (uncompressed_name, fdp->infname)) */ > >+ /* goto cleanup; */ > >+ /* } */ > > > > inf = fopen (file, "r" FOPEN_BINARY); > > if (inf) > > > >This is basically a "uniqueness" operation using linear search, O(N^2). > > This is only for dealing with the case when the same file exists in both compressed and uncompressed form, and we are currently hitting the second one. In that case, we should skip it. Yes, this is a uniqueness test and yes, it is O^2 in the number of file names, but I doubt that this can explain a serious slowdown. Are you sure this is executed only for compressed files? Maybe I'm missing something, but that's not my reading of the code: compr = get_compressor_from_suffix (file, &ext); if (compr) { compressed_name = file; uncompressed_name = savenstr (file, ext - file); } else { compressed_name = NULL; uncompressed_name = file; } /* If the canonicalized uncompressed name has already been dealt with, skip it silently. */ for (fdp = fdhead; fdp != NULL; fdp = fdp->next) { assert (fdp->infname != NULL); if (streq (uncompressed_name, fdp->infname)) goto cleanup; } As you see, if the file is not compressed by any known method, the code sets compressed_name to NULL and uncompressed_name to the canonicalized file. But the loop doesn't test compressed_name, so it is executed for all the files, compressed and uncompressed. Thus, I believe the intent is to avoid duplicate tags if the same file was encountered twice in some way. Note that canonicalize_filename in this case doesn't really do what its name seems to imply, e.g., relative file names will generally stay relative. So specifying the same file once as relative and the other time as absolute will still process the file more than once. We need to use an inode test or equivalent, and probably use realpath or equivalent, to make the duplicate test reliable. Or maybe having the same file processed under different names is okay, since TAGS is for helping Emacs find the file, and so using relative names and symlinks is okay? > >Is there a hash table we could use? > > No, we have a hash table for C tags, and that's all. It is useful because there are 34 keywords against which most strings in a C/C++ file are compared. It makes sesns to build hash tables for other languages where a similar situation happens. The hash table we have was build by gperf, and that method can only be used for fixed sets of strings known in advance. We need a different hash table for storing file names. > I do not think that it makes sense to build a hash table for file names given on the command line, because the number of comparisons made on those names is generally vastly inferior to the number of comparisons used to search for tags. That's not what I see in the code. But it should be easy to count the number of loop iterations in the use case we are talking about (running etags on the geck-dev tree), so we don't need to argue about facts. > >> . Some files have their language identified by means other than their > >> names or extensions: those are the languages that have > >> "interpreters" defined in etags.c > > The interpreter is the token what comes after #!, with The possible exception for "env", in which case the interpreter is the second token after #! > > There are two O^2 test in the number of tags in C/C++ files which depend on the two options "no-line-directive" and "no-duplicates". Both options are usable to disable those checks and both are off by default because they help producing a more sane tags file and have no practical impact in most cases. Both are there because, in principle, they cause significant slowdown in huge tags files. AFAIU, --no-duplicates is only for ctags, not for etags. I don't see how --no-duplicates could be relevant to the loop described above. Am I missing something? ^ permalink raw reply [flat|nested] 57+ messages in thread
* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions 2024-10-10 5:41 ` Eli Zaretskii @ 2024-10-10 8:27 ` Francesco Potortì 2024-10-10 8:35 ` Eli Zaretskii 0 siblings, 1 reply; 57+ messages in thread From: Francesco Potortì @ 2024-10-10 8:27 UTC (permalink / raw) To: Eli Zaretskii; +Cc: dmitry, 73484, spwhitton >> >This is basically a "uniqueness" operation using linear search, O(N^2). > Thus, I >believe the intent is to avoid duplicate tags if the same file was >encountered twice in some way. Yes. Sorry, I spoke from memory and I was inaccurate. >Note that canonicalize_filename in this case doesn't really do what >its name seems to imply, e.g., relative file names will generally stay >relative. It canonicalises, that is, reduces to a standard common form. It retains relative vs absolute difference. >So specifying the same file once as relative and the other >time as absolute will still process the file more than once. From memory, I would tell so, yes. Have not checked right now. >We need >to use an inode test or equivalent, and probably use realpath or >equivalent, to make the duplicate test reliable. >Or maybe having the >same file processed under different names is okay, since TAGS is for >helping Emacs find the file, and so using relative names and symlinks >is okay? Yes, I think so. And from memory I think it should be left unchanged. >> I do not think that it makes sense to build a hash table for file names given on the command line, because the number of comparisons made on those names is generally vastly inferior to the number of comparisons used to search for tags. > >That's not what I see in the code. But it should be easy to count the >number of loop iterations in the use case we are talking about >(running etags on the geck-dev tree), so we don't need to argue about >facts. Yes. If finding a bottleneck is the objective, you should maybe instrument the string comparison functions so that you can count how many times they are called from different places. I had a quick look at the whole code and in fact the only place I can find where ou have O^2 behaviour seems to be file name comparison, and it still looks so strange to me that this can in facrt cause significant delay. I may certainly have missed something, but if that's really the case, first thing is looking for code inefficiencies. If this is really structural, one should first read all filenames, canonicalise and uniquify them, and only then create the tags. ^ permalink raw reply [flat|nested] 57+ messages in thread
* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions 2024-10-10 8:27 ` Francesco Potortì @ 2024-10-10 8:35 ` Eli Zaretskii 2024-10-10 14:25 ` Francesco Potortì 0 siblings, 1 reply; 57+ messages in thread From: Eli Zaretskii @ 2024-10-10 8:35 UTC (permalink / raw) To: Francesco Potortì; +Cc: dmitry, 73484, spwhitton > From: Francesco Potortì <pot@gnu.org> > Date: Thu, 10 Oct 2024 10:27:57 +0200 > Cc: spwhitton@spwhitton.name, > 73484@debbugs.gnu.org, > dmitry@gutov.dev > > >That's not what I see in the code. But it should be easy to count the > >number of loop iterations in the use case we are talking about > >(running etags on the geck-dev tree), so we don't need to argue about > >facts. > > Yes. If finding a bottleneck is the objective, you should maybe instrument the string comparison functions so that you can count how many times they are called from different places. > > I had a quick look at the whole code and in fact the only place I can find where ou have O^2 behaviour seems to be file name comparison, and it still looks so strange to me that this can in facrt cause significant delay. We are using etags on a huge tree: about 375K files. I think that's the reason, because non-linear behaviors are like that: they are insignificant with small sets, but huge with larger ones... Profiles don't lie... ^ permalink raw reply [flat|nested] 57+ messages in thread
* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions 2024-10-10 8:35 ` Eli Zaretskii @ 2024-10-10 14:25 ` Francesco Potortì 2024-10-10 16:28 ` Eli Zaretskii 0 siblings, 1 reply; 57+ messages in thread From: Francesco Potortì @ 2024-10-10 14:25 UTC (permalink / raw) To: Eli Zaretskii; +Cc: dmitry, 73484, spwhitton >> I had a quick look at the whole code and in fact the only place I can find where ou have O^2 behaviour seems to be file name comparison, and it still looks so strange to me that this can in facrt cause significant delay. > >We are using etags on a huge tree: about 375K files. I think that's >the reason, because non-linear behaviors are like that: they are >insignificant with small sets, but huge with larger ones... > >Profiles don't lie... Ok, makes sense. I must have missed the number of files in your previous explanations, sorry. The only other place where I found O^2 behaviour is when managing #line directives, but you already tried to disable them without much change. So let's concentrate on file name comparison which is done in process_file_name at for (fdp = fdhead; fdp != NULL; fdp = fdp->next) { assert (fdp->infname != NULL); if (streq (uncompressed_name, fdp->infname)) goto cleanup; } This is a simple O^2 comparison, which is repeated sum(1,N,N-1)=~N^2/2, which for ~375k files means ~70G comparisons. If you can count the number of times streq is called and 70G is a substantial portion of that number, then we have the culprit. To check, just remove the above test and see if the running time drops. In that case, using a hash rather than a comparison would probably make sense. Alternatively, rather than managing file names in a single loop, do a first loop on all file names to canonicalise them, but without searching for tags (essentially, remove the call to process_file from process_file_name), then uniquify the list of canonicalised file names, then run process_file on them. ^ permalink raw reply [flat|nested] 57+ messages in thread
* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions 2024-10-10 14:25 ` Francesco Potortì @ 2024-10-10 16:28 ` Eli Zaretskii 2024-10-11 10:37 ` Francesco Potortì 0 siblings, 1 reply; 57+ messages in thread From: Eli Zaretskii @ 2024-10-10 16:28 UTC (permalink / raw) To: Francesco Potortì; +Cc: dmitry, 73484, spwhitton > From: Francesco Potortì <pot@gnu.org> > Date: Thu, 10 Oct 2024 16:25:28 +0200 > Cc: dmitry@gutov.dev, > 73484@debbugs.gnu.org, > spwhitton@spwhitton.name > > for (fdp = fdhead; fdp != NULL; fdp = fdp->next) > { > assert (fdp->infname != NULL); > if (streq (uncompressed_name, fdp->infname)) > goto cleanup; > } > > This is a simple O^2 comparison, which is repeated sum(1,N,N-1)=~N^2/2, which for ~375k files means ~70G comparisons. If you can count the number of times streq is called and 70G is a substantial portion of that number, then we have the culprit. To check, just remove the above test and see if the running time drops. Dmitry already made this check, and the run time did drop, see https://debbugs.gnu.org/cgi/bugreport.cgi?bug=73484#107 > In that case, using a hash rather than a comparison would probably make sense. Right. > Alternatively, rather than managing file names in a single loop, do a first loop on all file names to canonicalise them, but without searching for tags (essentially, remove the call to process_file from process_file_name), then uniquify the list of canonicalised file names, then run process_file on them. I don't think this is possible because command-line options can be interspersed with file names, and each option affects the processing of the files whose names follow the option. ^ permalink raw reply [flat|nested] 57+ messages in thread
* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions 2024-10-10 16:28 ` Eli Zaretskii @ 2024-10-11 10:37 ` Francesco Potortì 0 siblings, 0 replies; 57+ messages in thread From: Francesco Potortì @ 2024-10-11 10:37 UTC (permalink / raw) To: Eli Zaretskii; +Cc: dmitry, 73484, spwhitton >> From: Francesco Potortì <pot@gnu.org> >> Date: Thu, 10 Oct 2024 16:25:28 +0200 >> Cc: dmitry@gutov.dev, >> 73484@debbugs.gnu.org, >> spwhitton@spwhitton.name >> >> for (fdp = fdhead; fdp != NULL; fdp = fdp->next) >> { >> assert (fdp->infname != NULL); >> if (streq (uncompressed_name, fdp->infname)) >> goto cleanup; >> } >> >> This is a simple O^2 comparison, which is repeated sum(1,N,N-1)=~N^2/2, which for ~375k files means ~70G comparisons. If you can count the number of times streq is called and 70G is a substantial portion of that number, then we have the culprit. To check, just remove the above test and see if the running time drops. > >Dmitry already made this check, and the run time did drop, see >https://debbugs.gnu.org/cgi/bugreport.cgi?bug=73484#107 Yes, sorry, I am travelling and I had missed that email. >> In that case, using a hash rather than a comparison would probably make sense. > >Right. If I recall correctly, etags depends on libc only. If that is really the case, it would be nice to create an ad hoc has function without relying on additional libraries. >> Alternatively, rather than managing file names in a single loop, do a first loop on all file names to canonicalise them, but without searching for tags (essentially, remove the call to process_file from process_file_name), then uniquify the list of canonicalised file names, then run process_file on them. > >I don't think this is possible because command-line options can be >interspersed with file names, and each option affects the processing >of the files whose names follow the option. It should be possible as I have outlined above. When the command line is parsed, process_file_name is called on each file name. It canonicalises the current name, compares it with the previous file names, adds a new node containing the canonicalised name to a linked list and calls process_file on the file name. It is possible to remove the last step and instead call process_file in a second loop, but I do not know if it is convenient. The uniquify solutions would be nonparametric, if I am not wrong. While the hash solution requires choosing the size of the hash table. I guess that the hash solution is simpler and equally efficient in the great majority of cases, provided that the size of the hash table is appropriate. Probably it would be reasonable to start with a 20-bit hash. And increase that number if in some years it will look reasonable doing so. ^ permalink raw reply [flat|nested] 57+ messages in thread
* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions 2024-10-10 1:07 ` Francesco Potortì 2024-10-10 5:41 ` Eli Zaretskii @ 2024-10-10 10:17 ` Dmitry Gutov 1 sibling, 0 replies; 57+ messages in thread From: Dmitry Gutov @ 2024-10-10 10:17 UTC (permalink / raw) To: Francesco Potortì; +Cc: Eli Zaretskii, 73484, spwhitton [-- Attachment #1: Type: text/plain, Size: 1803 bytes --] On Thu, Oct 10, 2024, at 3:07 AM, Francesco Potortì wrote: > >Here is the nested loop, which if I comment out, makes the parse finish > >in ~20 seconds, with all the extra files (except *.js), or in 15s when > >using with new flags. > > > >diff --git a/lib-src/etags.c b/lib-src/etags.c > >index a822a823a90..331e3ffe816 100644 > >--- a/lib-src/etags.c > >+++ b/lib-src/etags.c > >@@ -1697,14 +1697,14 @@ process_file_name (char *file, language *lang) > > uncompressed_name = file; > > } > > > >- /* If the canonicalized uncompressed name > >- has already been dealt with, skip it silently. */ > >- for (fdp = fdhead; fdp != NULL; fdp = fdp->next) > >- { > >- assert (fdp->infname != NULL); > >- if (streq (uncompressed_name, fdp->infname)) > >- goto cleanup; > >- } > >+ /* /\* If the canonicalized uncompressed name */ > >+ /* has already been dealt with, skip it silently. *\/ */ > >+ /* for (fdp = fdhead; fdp != NULL; fdp = fdp->next) */ > >+ /* { */ > >+ /* assert (fdp->infname != NULL); */ > >+ /* if (streq (uncompressed_name, fdp->infname)) */ > >+ /* goto cleanup; */ > >+ /* } */ > > > > inf = fopen (file, "r" FOPEN_BINARY); > > if (inf) > > > >This is basically a "uniqueness" operation using linear search, O(N^2). > > This is only for dealing with the case when the same file exists in both compressed and uncompressed form, and we are currently hitting the second one. In that case, we should skip it. Yes, this is a uniqueness test and yes, it is O^2 in the number of file names, but I doubt that this can explain a serious slowdown. Like mentioned in a previous email, I did recompile with that step removed, and the slowdown was gone. The whole scan went down to ~20 seconds. [-- Attachment #2: Type: text/html, Size: 2885 bytes --] ^ permalink raw reply [flat|nested] 57+ messages in thread
* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions 2024-10-09 18:23 ` Dmitry Gutov 2024-10-09 19:11 ` Eli Zaretskii 2024-10-10 1:07 ` Francesco Potortì @ 2024-10-10 1:39 ` Francesco Potortì 2024-10-10 5:45 ` Eli Zaretskii 2 siblings, 1 reply; 57+ messages in thread From: Francesco Potortì @ 2024-10-10 1:39 UTC (permalink / raw) To: Dmitry Gutov; +Cc: Eli Zaretskii, 73484, spwhitton I have just written: >There are two O^2 test in the number of tags in C/C++ files which depend on the two options "no-line-directive" and "no-duplicates". Both options are usable to disable those checks and both are off by default because they help producing a more sane tags file and have no practical impact in most cases. Both are there because, in principle, they cause significant slowdown in huge tags files. However, --no-line-directive exhibits the O^2 behaviour inthe number of tags only for languages with the "metafile" property, currently only yacc files. Unless you have a significant number of yacc files, the impact is O^1 in the number of tag candidates. And --no-duplicates only matters when creating a ctags file. Maybe you could give a try and check whether --no-line-directives has any impact ^ permalink raw reply [flat|nested] 57+ messages in thread
* bug#73484: 31.0.50; Abolishing etags-regen-file-extensions 2024-10-10 1:39 ` Francesco Potortì @ 2024-10-10 5:45 ` Eli Zaretskii 0 siblings, 0 replies; 57+ messages in thread From: Eli Zaretskii @ 2024-10-10 5:45 UTC (permalink / raw) To: Francesco Potortì; +Cc: dmitry, 73484, spwhitton > From: Francesco Potortì <pot@gnu.org> > Date: Thu, 10 Oct 2024 03:39:47 +0200 > Cc: 73484@debbugs.gnu.org, > spwhitton@spwhitton.name, > Eli Zaretskii <eliz@gnu.org> > > I have just written: > >There are two O^2 test in the number of tags in C/C++ files which depend on the two options "no-line-directive" and "no-duplicates". Both options are usable to disable those checks and both are off by default because they help producing a more sane tags file and have no practical impact in most cases. Both are there because, in principle, they cause significant slowdown in huge tags files. > > However, --no-line-directive exhibits the O^2 behaviour inthe number of tags only for languages with the "metafile" property, currently only yacc files. Unless you have a significant number of yacc files, the impact is O^1 in the number of tag candidates. And --no-duplicates only matters when creating a ctags file. > > Maybe you could give a try and check whether --no-line-directives has any impact I already did that: the effect is null and void. Which is not a surprise, since there are only 3 Yacc files in this tree. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: etags-regen-mode: handling extensionless files 2024-09-25 6:21 ` Sean Whitton 2024-09-25 11:41 ` Dmitry Gutov @ 2024-09-25 12:10 ` Eli Zaretskii 2024-09-25 21:19 ` Francesco Potortì 1 sibling, 1 reply; 57+ messages in thread From: Eli Zaretskii @ 2024-09-25 12:10 UTC (permalink / raw) To: Sean Whitton; +Cc: dgutov, emacs-devel > From: Sean Whitton <spwhitton@spwhitton.name> > Cc: emacs-devel@gnu.org > Date: Wed, 25 Sep 2024 07:21:58 +0100 > > Hello, > > On Mon 23 Sep 2024 at 08:00pm +03, Dmitry Gutov wrote: > > > On 22/09/2024 15:02, Sean Whitton wrote: > > > >>> But see my other email regarding etags' hashbang detection. > >> Hashbang detection would solve my problem elegantly. > >> Is my reading of the other thread correct that if we can fix the fortran > >> fallback then we can enable the hashbang detection? > > > > Yep, I think so. > > > > We would probably also discuss etags' auto-detection and its list of default > > extensions, during the next release's development. > > Okay, cool! Should we have a bug to track this? We could, but adding an option to disable the Fortran fallback is so easy that I hope someone will just do it... ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: etags-regen-mode: handling extensionless files 2024-09-25 12:10 ` etags-regen-mode: handling extensionless files Eli Zaretskii @ 2024-09-25 21:19 ` Francesco Potortì 2024-09-26 6:22 ` Eli Zaretskii 0 siblings, 1 reply; 57+ messages in thread From: Francesco Potortì @ 2024-09-25 21:19 UTC (permalink / raw) To: Eli Zaretskii; +Cc: emacs-devel, dgutov, Sean Whitton Eli Zaretskii <eliz@gnu.org> >> > We would probably also discuss etags' auto-detection and its list of default >> > extensions, during the next release's development. Sean Whitton <spwhitton@spwhitton.name> >> Okay, cool! Should we have a bug to track this? > >We could, but adding an option to disable the Fortran fallback is so >easy that I hope someone will just do it... How about just going with the backward-incompatible change of disabling both fallbacks entirely? In my opinion the whole fallback idea was already obsolete when I worked on it in 1993. Today, I can't imagine a situation where it can be useful, that is, where you work on Fortran or C sources without an extension. On the other hand, if you are working on thirty-years old sources, I argue you should use thirty-years old tools, rather than assuming that today's tool do the right thing on them. -- fp ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: etags-regen-mode: handling extensionless files 2024-09-25 21:19 ` Francesco Potortì @ 2024-09-26 6:22 ` Eli Zaretskii 0 siblings, 0 replies; 57+ messages in thread From: Eli Zaretskii @ 2024-09-26 6:22 UTC (permalink / raw) To: Francesco Potortì; +Cc: emacs-devel, dgutov, spwhitton > From: Francesco Potortì <pot@gnu.org> > Date: Wed, 25 Sep 2024 23:19:10 +0200 > Cc: emacs-devel@gnu.org, > dgutov@yandex.ru, > Sean Whitton <spwhitton@spwhitton.name> > > Eli Zaretskii <eliz@gnu.org> > >We could, but adding an option to disable the Fortran fallback is so > >easy that I hope someone will just do it... > > How about just going with the backward-incompatible change of disabling both fallbacks entirely? In my opinion the whole fallback idea was already obsolete when I worked on it in 1993. Since no one complained about it, and the only real use case is when invoking 'etags' from a Lisp program, which can easily use a non-standard option, I don't see a compelling reason for a backward-incompatible change in behavior. In some future version, we can then flip the default and make the fallback disabled by default. > Today, I can't imagine a situation where it can be useful, that is, where you work on Fortran or C sources without an extension. My gray hair tells me that our ability to imagine such situations is severely limited or biased, and we have enough evidence of this bias to not trust our imagination in these matters anymore. > On the other hand, if you are working on thirty-years old sources, I argue you should use thirty-years old tools, rather than assuming that today's tool do the right thing on them. These arguments are usually not well taken, IME. People want new tools because they give them more functionality and performance, but they do NOT want incompatibilities in the package. IOW, everyone likes to have the cake and eat it, too. ^ permalink raw reply [flat|nested] 57+ messages in thread
end of thread, other threads:[~2024-10-11 10:37 UTC | newest] Thread overview: 57+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-09-20 9:20 etags-regen-mode: handling extensionless files Sean Whitton 2024-09-20 18:23 ` Dmitry Gutov 2024-09-22 12:02 ` Sean Whitton 2024-09-23 17:00 ` Dmitry Gutov 2024-09-25 6:21 ` Sean Whitton 2024-09-25 11:41 ` Dmitry Gutov 2024-09-25 19:27 ` bug#73484: 31.0.50; Abolishing etags-regen-file-extensions Sean Whitton 2024-09-25 22:30 ` Dmitry Gutov 2024-09-26 7:43 ` Francesco Potortì 2024-09-26 12:18 ` Dmitry Gutov 2024-09-29 8:25 ` Eli Zaretskii 2024-09-29 10:56 ` Eli Zaretskii 2024-09-29 17:15 ` Francesco Potortì 2024-09-30 23:19 ` Dmitry Gutov 2024-10-01 15:00 ` Eli Zaretskii 2024-10-01 22:01 ` Dmitry Gutov 2024-10-02 11:28 ` Eli Zaretskii 2024-10-02 18:00 ` Dmitry Gutov 2024-10-02 18:56 ` Eli Zaretskii 2024-10-02 22:03 ` Dmitry Gutov 2024-10-03 6:27 ` Eli Zaretskii 2024-10-04 1:25 ` Dmitry Gutov 2024-10-04 6:45 ` Eli Zaretskii 2024-10-04 23:01 ` Dmitry Gutov 2024-10-05 7:02 ` Eli Zaretskii 2024-10-05 14:29 ` Dmitry Gutov 2024-10-05 15:27 ` Eli Zaretskii 2024-10-05 20:27 ` Dmitry Gutov 2024-10-05 16:38 ` Francesco Potortì 2024-10-05 17:12 ` Eli Zaretskii 2024-10-06 0:56 ` Dmitry Gutov 2024-10-06 6:22 ` Eli Zaretskii 2024-10-06 19:14 ` Dmitry Gutov 2024-10-07 2:33 ` Eli Zaretskii 2024-10-07 7:11 ` Dmitry Gutov 2024-10-07 16:05 ` Eli Zaretskii 2024-10-07 17:36 ` Dmitry Gutov 2024-10-07 19:05 ` Eli Zaretskii 2024-10-07 22:08 ` Dmitry Gutov 2024-10-08 13:04 ` Eli Zaretskii 2024-10-09 18:23 ` Dmitry Gutov 2024-10-09 19:11 ` Eli Zaretskii 2024-10-09 22:22 ` Dmitry Gutov 2024-10-10 5:13 ` Eli Zaretskii 2024-10-10 1:07 ` Francesco Potortì 2024-10-10 5:41 ` Eli Zaretskii 2024-10-10 8:27 ` Francesco Potortì 2024-10-10 8:35 ` Eli Zaretskii 2024-10-10 14:25 ` Francesco Potortì 2024-10-10 16:28 ` Eli Zaretskii 2024-10-11 10:37 ` Francesco Potortì 2024-10-10 10:17 ` Dmitry Gutov 2024-10-10 1:39 ` Francesco Potortì 2024-10-10 5:45 ` Eli Zaretskii 2024-09-25 12:10 ` etags-regen-mode: handling extensionless files Eli Zaretskii 2024-09-25 21:19 ` Francesco Potortì 2024-09-26 6:22 ` Eli Zaretskii
Code repositories for project(s) associated with this external index https://git.savannah.gnu.org/cgit/emacs.git https://git.savannah.gnu.org/cgit/emacs/org-mode.git This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.