* [BUG] Trailing dash is not included in link [9.7.3 (9.7.3-2f1844 @ /home/mwillcock/.emacs.d/elpa/org-9.7.3/)] @ 2024-06-13 13:32 Morgan Willcock 2024-06-14 14:04 ` Ihor Radchenko 0 siblings, 1 reply; 6+ messages in thread From: Morgan Willcock @ 2024-06-13 13:32 UTC (permalink / raw) To: emacs-orgmode Remember to cover the basics, that is, what you expected to happen and what in fact did happen. You don't know how to make a good report? See https://orgmode.org/manual/Feedback.html#Feedback Your bug report will be posted to the Org mailing list. ------------------------------------------------------------------------ When web links are inserted into an org buffer, if the link ends in a trailing dash this seems to be omitted from the link target. i.e. Inserting "https://domain/test-" into the buffer will create a clickable link for "https://domain/test". These types of links will likely be encountered for sites where anchor targets are automatically generated from documentation headings which are questions. e.g. https://learn.microsoft.com/en-us/entra/identity/hybrid/connect/how-to-connect-sso-faq#how-can-i-roll-over-the-kerberos-decryption-key-of-the--azureadsso--computer-account- It seems straight-forward to verify that the trailing dash of the link is not considered part of the link: (with-temp-buffer (org-mode) (insert "https://domain/test-") (goto-char (point-min)) (let ((context (org-element-context))) (cl-assert (eq (org-element-type context) 'link)) (buffer-substring-no-properties (org-element-property :begin context) (org-element-property :end context)))) Emacs : GNU Emacs 29.3 (build 2, x86_64-pc-linux-gnu, X toolkit, cairo version 1.16.0, Xaw3d scroll bars) of 2024-03-25 Package: Org mode version 9.7.3 (9.7.3-2f1844 @ /home/mwillcock/.emacs.d/elpa/org-9.7.3/) ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [BUG] Trailing dash is not included in link [9.7.3 (9.7.3-2f1844 @ /home/mwillcock/.emacs.d/elpa/org-9.7.3/)] 2024-06-13 13:32 [BUG] Trailing dash is not included in link [9.7.3 (9.7.3-2f1844 @ /home/mwillcock/.emacs.d/elpa/org-9.7.3/)] Morgan Willcock @ 2024-06-14 14:04 ` Ihor Radchenko 2024-06-16 15:43 ` Max Nikulin 0 siblings, 1 reply; 6+ messages in thread From: Ihor Radchenko @ 2024-06-14 14:04 UTC (permalink / raw) To: Morgan Willcock; +Cc: emacs-orgmode Morgan Willcock <morgan@ice9.digital> writes: > When web links are inserted into an org buffer, if the link ends in a > trailing dash this seems to be omitted from the link target. > > i.e. Inserting "https://domain/test-" into the buffer will create a > clickable link for "https://domain/test". > > These types of links will likely be encountered for sites where anchor > targets are automatically generated from documentation headings which > are questions. This makes sense. I improved the heuristics we use to detect plain links. Fixed, on main. https://git.savannah.gnu.org/cgit/emacs/org-mode.git/commit/?id=73da6beb5 -- Ihor Radchenko // yantar92, Org mode contributor, Learn more about Org mode at <https://orgmode.org/>. Support Org development at <https://liberapay.com/org-mode>, or support my work at <https://liberapay.com/yantar92> ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [BUG] Trailing dash is not included in link [9.7.3 (9.7.3-2f1844 @ /home/mwillcock/.emacs.d/elpa/org-9.7.3/)] 2024-06-14 14:04 ` Ihor Radchenko @ 2024-06-16 15:43 ` Max Nikulin 2024-06-16 15:59 ` Ihor Radchenko 0 siblings, 1 reply; 6+ messages in thread From: Max Nikulin @ 2024-06-16 15:43 UTC (permalink / raw) To: emacs-orgmode On 14/06/2024 21:04, Ihor Radchenko wrote: > Morgan Willcock writes: > >> i.e. Inserting "https://domain/test-" into the buffer will create a >> clickable link for "https://domain/test". >> > I improved the heuristics we use to detect plain links. > Fixed, on main. > https://git.savannah.gnu.org/cgit/emacs/org-mode.git/commit/?id=73da6beb5 > +++ b/etc/ORG-NEWS [...] > +*** Trailing =-= is now allowed in plain links After a look into 7dcb1afb6 2021-03-24 21:27:24 +0800 Ihor Radchenko: Improve org-link-plain-re I suspect, it worked prior to v9.5. Without a unit test it may be accidentally broken again. > +: https://domain/test- example.org, example.net, example.com are domains reserved for usage in examples: <https://www.iana.org/assignments/special-use-domain-names/special-use-domain-names.xhtml> > (or (regexp "[^[:punct:] \t\n]") I have realized that some Org regexps use [:punct:] *regexp class* and others *syntax class*, see latex math regexp. I am in doubts if the discrepancy is intentional. I have noticed that the following change 09ced6d2c 2024-02-03 15:15:46 +0100 Ihor Radchenko: org-link-plain-re: Improve regexp heuristics that causes (link http://example.org/a<b) input is exported as <p> (link <a href="http://example.org/a%3Cb)">http://example.org/a%3Cb)</a></p> I expect that ")" should not be parsed as a part of the link. Balanced brackets are tricky with regexps (and it is not possible to match arbitrary nested ones). Perhaps "[^[:punct:] \t\n]" is too strict in respect to spaces. It does not allow the recommended workaround with zero width space: (org-export-string-as "http://example.org\N{ZERO WIDTH SPACE}[fn::footnote]" 'html 'body) "<p> <a href=\"http://example.org[fn::footnote]\">http://example.org[fn::footnote]</a></p> " Actually some kind of non-breakable space should be better in such cases: (org-export-string-as "http://example.org\N{NO-BREAK SPACE}[fn::footnote]" 'html 'body) "<p> <a href=\"http://example.org [fn::footnote]\">http://example.org [fn::footnote]</a></p> " I would consider [:space:] or \s-. As to the original bug report, while reading it, I noticed that thunderbird includes dash into the recognized link for "https://domain/test-" I decided to look into its implementation and to my surprise I found: ``punctation chars and "-" at the end are stipped off.'' I realized that double quotes along with angle brackets are treated as a recommended way to mark URLs in plain text. Thunderbird does not consider dash as a part of links for e.g. http://example.org/t- It might be an attempt to reserve possibility to assemble URLs wrapped into several lines with added hyphenation marks, but it has not been implemented (RFC2396 appendix E warns about accidentally added hyphens). https://www.bucksch.org/1/projects/mozilla/16507/ https://searchfox.org/mozilla-central/source/netwerk/streamconv/converters/mozTXTToHTMLConv.cpp#line-243 mozTXTToHTMLConv::FindURLEnd Implementation is tricky, I have not noticed anything that may be reused to improve heuristics for Org. Nowadays it is likely better to inspect autolinking code for GitHub/GitLab or widely used python packages. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [BUG] Trailing dash is not included in link [9.7.3 (9.7.3-2f1844 @ /home/mwillcock/.emacs.d/elpa/org-9.7.3/)] 2024-06-16 15:43 ` Max Nikulin @ 2024-06-16 15:59 ` Ihor Radchenko 2024-06-20 12:15 ` Max Nikulin 0 siblings, 1 reply; 6+ messages in thread From: Ihor Radchenko @ 2024-06-16 15:59 UTC (permalink / raw) To: Max Nikulin; +Cc: emacs-orgmode Max Nikulin <manikulin@gmail.com> writes: >> +*** Trailing =-= is now allowed in plain links > > After a look into > > 7dcb1afb6 2021-03-24 21:27:24 +0800 Ihor Radchenko: Improve > org-link-plain-re > > I suspect, it worked prior to v9.5. Without a unit test it may be > accidentally broken again. No, it did not work. If you can, please do not make such assertions without testing. >> +: https://domain/test- > > example.org, example.net, example.com are domains reserved for usage in > examples: > <https://www.iana.org/assignments/special-use-domain-names/special-use-domain-names.xhtml> And so? >> (or (regexp "[^[:punct:] \t\n]") > > I have realized that some Org regexps use [:punct:] *regexp class* and > others *syntax class*, see latex math regexp. I am in doubts if the > discrepancy is intentional. It is not intentional, but using syntax classes can sometimes be fragile. > I have noticed that the following change > > 09ced6d2c 2024-02-03 15:15:46 +0100 Ihor Radchenko: org-link-plain-re: > Improve regexp heuristics > > that causes > > (link http://example.org/a<b) > > input is exported as > > <p> > (link <a > href="http://example.org/a%3Cb)">http://example.org/a%3Cb)</a></p> > > I expect that ")" should not be parsed as a part of the link. Balanced > brackets are tricky with regexps (and it is not possible to match > arbitrary nested ones). It is heuristics. We cannot be 100% right. So, it is what it is. > Perhaps "[^[:punct:] \t\n]" is too strict in respect to spaces. It does > not allow the recommended workaround with zero width space: You don't need zero width space for links. Just use <bracket link>. > As to the original bug report, while reading it, I noticed that > thunderbird includes dash into the recognized link for > > "https://domain/test-" > > I decided to look into its implementation and to my surprise I found: > ``punctation chars and "-" at the end are stipped off.'' I realized that > double quotes along with angle brackets are treated as a recommended way > to mark URLs in plain text. Thunderbird does not consider dash as a part > of links for e.g. http://example.org/t- It might be an attempt to > reserve possibility to assemble URLs wrapped into several lines with > added hyphenation marks, but it has not been implemented (RFC2396 > appendix E warns about accidentally added hyphens). > > https://www.bucksch.org/1/projects/mozilla/16507/ > https://searchfox.org/mozilla-central/source/netwerk/streamconv/converters/mozTXTToHTMLConv.cpp#line-243 > mozTXTToHTMLConv::FindURLEnd > > Implementation is tricky, I have not noticed anything that may be reused > to improve heuristics for Org. Nowadays it is likely better to inspect > autolinking code for GitHub/GitLab or widely used python packages. If you have concrete proposals, please share them. > I would consider [:space:] or \s-. Do you mean "[^[:punct:][:space:]\t\n]"? -- Ihor Radchenko // yantar92, Org mode contributor, Learn more about Org mode at <https://orgmode.org/>. Support Org development at <https://liberapay.com/org-mode>, or support my work at <https://liberapay.com/yantar92> ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [BUG] Trailing dash is not included in link [9.7.3 (9.7.3-2f1844 @ /home/mwillcock/.emacs.d/elpa/org-9.7.3/)] 2024-06-16 15:59 ` Ihor Radchenko @ 2024-06-20 12:15 ` Max Nikulin 2024-06-22 13:41 ` Ihor Radchenko 0 siblings, 1 reply; 6+ messages in thread From: Max Nikulin @ 2024-06-20 12:15 UTC (permalink / raw) To: emacs-orgmode On 16/06/2024 22:59, Ihor Radchenko wrote: > Max Nikulin writes: >> >> I suspect, it worked prior to v9.5. Without a unit test it may be >> accidentally broken again. > > No, it did not work. > If you can, please do not make such assertions without testing. I am sorry, I had no intention to offend you. I missed that the removed line with explicit list of punctuation characters was commented out. I have tried the regexp used before (a part of v6.34) facedba05 2009-12-09 15:13:50 +0100 Carsten Dominik: Use John Gruber's regular expression for URL's and it seems trailing dash was allowed. >>> +: https://domain/test- >> >> example.org, example.net, example.com are domains reserved for usage in >> examples: >> <https://www.iana.org/assignments/special-use-domain-names/special-use-domain-names.xhtml> > > And so? http://example.org/dash- may be a bit better for docs. (For IPv6 addresses the difference should be more noticeable, but I do not remember what range is reserved for usage in examples there.) >> I have realized that some Org regexps use [:punct:] *regexp class* and >> others *syntax class*, see latex math regexp. I am in doubts if the >> discrepancy is intentional. > > It is not intentional, but using syntax classes can sometimes be > fragile. Do you mean that result depends on current buffer? I do not have strong opinion what variant should be used. What I do not like is that in the case of $n$-th the character after second "$" is tested against syntax class, while regexp class is used for links. This subtle difference is almost certainly ignored in alternative implementations of the parser. However I am not sure what characters besides dash and apostrophe are affected and whether it depends on locale. >> 09ced6d2c 2024-02-03 15:15:46 +0100 Ihor Radchenko: org-link-plain-re: >> Improve regexp heuristics [...] >> (link http://example.org/a<b) [...] > It is heuristics. We cannot be 100% right. So, it is what it is. From my point of view it is at least close to a regression. I do not have any argument against http://example.org/a<b>, but the regexp should not match whole "http://example.org/a<b)" [...] >> Nowadays it is likely better to inspect >> autolinking code for GitHub/GitLab or widely used python packages. > > If you have concrete proposals, please share them. Not yet. I consider inspecting mozilla's code as a kind of negative result from the point of view of usefulness for Org. Expanding test suite by gathering examples of failed heuristics from bug reports require enough reports. https://wpt.live/url/resources/urltestdata.json (https://github.com/web-platform-tests/wpt) is too specific for browsers and HTML/JS. >> I would consider [:space:] or \s-. > > Do you mean "[^[:punct:][:space:]\t\n]"? I believe it might be an improvement ([:space:] includes \t). ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [BUG] Trailing dash is not included in link [9.7.3 (9.7.3-2f1844 @ /home/mwillcock/.emacs.d/elpa/org-9.7.3/)] 2024-06-20 12:15 ` Max Nikulin @ 2024-06-22 13:41 ` Ihor Radchenko 0 siblings, 0 replies; 6+ messages in thread From: Ihor Radchenko @ 2024-06-22 13:41 UTC (permalink / raw) To: Max Nikulin; +Cc: emacs-orgmode Max Nikulin <manikulin@gmail.com> writes: >> If you can, please do not make such assertions without testing. > > I am sorry, I had no intention to offend you. I missed that the removed > line with explicit list of punctuation characters was commented out. I > have tried the regexp used before (a part of v6.34) > facedba05 2009-12-09 15:13:50 +0100 Carsten Dominik: Use John > Gruber's regular expression for URL's > > and it seems trailing dash was allowed. Hmm. That's a really long time ago, earlier than built-in Org in Emacs versions that are available in various distros. My reading of "prior to v9.5" was more like "not too far before v9.5" (and I tested everything down to Org mode included into Emacs 26). >>>> +: https://domain/test- >>> >>> example.org, example.net, example.com are domains reserved for usage in >>> examples: >>> <https://www.iana.org/assignments/special-use-domain-names/special-use-domain-names.xhtml> >> >> And so? > > http://example.org/dash- may be a bit better for docs. (For IPv6 > addresses the difference should be more noticeable, but I do not > remember what range is reserved for usage in examples there.) I see. I would not mind installing a patch, if you submit it. >>> I have realized that some Org regexps use [:punct:] *regexp class* and >>> others *syntax class*, see latex math regexp. I am in doubts if the >>> discrepancy is intentional. >> >> It is not intentional, but using syntax classes can sometimes be >> fragile. > > Do you mean that result depends on current buffer? I do not have strong > opinion what variant should be used. Not current buffer. Current syntax table, inherited from outline-mode. And that syntax table is customized by some users, leading to Org parser behaving unexpectedly in some scenarios. Also, there is 'syntax-table text property, and I have managed to break Org parser in the past by trying to apply 'syntax-table property to code blocks in Org mode (I was trying to solve `forward-sexp' bug people frequently report). So, we should generally avoid using syntax tables, so that Org syntax becomes independent of user customizations in that area. Or, at least, we should not introduce more syntax class uses when possible. > ... What I do not like is that in the > case of $n$-th the character after second "$" is tested against syntax > class, while regexp class is used for links. This subtle difference is > almost certainly ignored in alternative implementations of the parser. > However I am not sure what characters besides dash and apostrophe are > affected and whether it depends on locale. These kinds of inconsistencies should be solved eventually. We should not use locale, but UTF syntax classes; and document it in org-syntax document. >>> 09ced6d2c 2024-02-03 15:15:46 +0100 Ihor Radchenko: org-link-plain-re: >>> Improve regexp heuristics > [...] >>> (link http://example.org/a<b) > [...] >> It is heuristics. We cannot be 100% right. So, it is what it is. > > From my point of view it is at least close to a regression. I do not > have any argument against http://example.org/a<b>, but the regexp should > not match whole "http://example.org/a<b)" No bug reports, so your point is rather theoretical. I do not mind improving the regexp, of course, but I am afraid that we will need PEG or `org-element--parse-paired-brackets' to match paired brackets accurately. And that kind of change will be breaking - we will need to trash the regexp variable. >>> I would consider [:space:] or \s-. >> >> Do you mean "[^[:punct:][:space:]\t\n]"? > > I believe it might be an improvement ([:space:] includes \t). https://git.savannah.gnu.org/cgit/emacs/org-mode.git/commit/?id=6cada29c0 -- Ihor Radchenko // yantar92, Org mode contributor, Learn more about Org mode at <https://orgmode.org/>. Support Org development at <https://liberapay.com/org-mode>, or support my work at <https://liberapay.com/yantar92> ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2024-06-22 13:40 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-06-13 13:32 [BUG] Trailing dash is not included in link [9.7.3 (9.7.3-2f1844 @ /home/mwillcock/.emacs.d/elpa/org-9.7.3/)] Morgan Willcock 2024-06-14 14:04 ` Ihor Radchenko 2024-06-16 15:43 ` Max Nikulin 2024-06-16 15:59 ` Ihor Radchenko 2024-06-20 12:15 ` Max Nikulin 2024-06-22 13:41 ` Ihor Radchenko
Code repositories for project(s) associated with this public inbox https://git.savannah.gnu.org/cgit/emacs/org-mode.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).