* Another issue with thingatpt [not found] ` <htx7iwdn717.fsf@urania.kanji.zinbun.kyoto-u.ac.jp> @ 2006-12-27 10:50 ` Werner LEMBERG 2006-12-27 20:29 ` Bob Rogers 0 siblings, 1 reply; 10+ messages in thread From: Werner LEMBERG @ 2006-12-27 10:50 UTC (permalink / raw) Here's another problematic URL: http://mousai.kanji.zinbun.kyoto-u.ac.jp/ids-find?components=&U+20207; thingatpt ignores the final `;'. Werner ^ permalink raw reply [flat|nested] 10+ messages in thread
* Another issue with thingatpt 2006-12-27 10:50 ` Another issue with thingatpt Werner LEMBERG @ 2006-12-27 20:29 ` Bob Rogers 2006-12-28 6:39 ` Werner LEMBERG 2006-12-29 21:23 ` Piet van Oostrum 0 siblings, 2 replies; 10+ messages in thread From: Bob Rogers @ 2006-12-27 20:29 UTC (permalink / raw) Cc: emacs-devel From: Werner LEMBERG <wl@gnu.org> Date: Wed, 27 Dec 2006 11:50:42 +0100 (CET) Here's another problematic URL: http://mousai.kanji.zinbun.kyoto-u.ac.jp/ids-find?components=&U+20207; thingatpt ignores the final `;'. Werner According to RFC3986 (aka STD066), this is wrong; ";" is legitimate anywhere in a path or query part, including the end. So are "." and ",", but thing-at-point-url-path-regexp also refuses to match these characters at the end of the string. Doing (ffap-string-at-point 'url) drops these characters plus ":", "!", and (questionably) "?". It may not be possible to find a tradeoff between RFC compliance and parsing dwimmery that would satisfy everybody. Since stripping off trailing punctuation is useful behavior (ISTR it's worked this way for a while now), I would recommend against changing it now. However, a case could be made for making thing-at-point and ffap-string-at-point consistent. Perhaps "!:;.," would be best? This is just the union of the two sets but without the dubious inclusion of "?". -- Bob Rogers http://rgrjr.dyndns.org/ ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Another issue with thingatpt 2006-12-27 20:29 ` Bob Rogers @ 2006-12-28 6:39 ` Werner LEMBERG 2006-12-29 21:23 ` Piet van Oostrum 1 sibling, 0 replies; 10+ messages in thread From: Werner LEMBERG @ 2006-12-28 6:39 UTC (permalink / raw) Cc: emacs-devel > Here's another problematic URL: > > http://mousai.kanji.zinbun.kyoto-u.ac.jp/ids-find?components=&U+20207; > > thingatpt ignores the final `;'. > > Werner > > According to RFC3986 (aka STD066), this is wrong; [...] > > It may not be possible to find a tradeoff between RFC compliance > and parsing dwimmery that would satisfy everybody. Since stripping > off trailing punctuation is useful behavior (ISTR it's worked this > way for a while now), I would recommend against changing it now. > However, a case could be made for making thing-at-point and > ffap-string-at-point consistent. Perhaps "!:;.," would be best? > This is just the union of the two sets but without the dubious > inclusion of "?". I suggest to handle `;' and friends as part of the URL if it is on a line of its own, this is, if the line starts with `http://' or a sibling of it, and is followed by whitespace and a newline. Werner ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Another issue with thingatpt 2006-12-27 20:29 ` Bob Rogers 2006-12-28 6:39 ` Werner LEMBERG @ 2006-12-29 21:23 ` Piet van Oostrum 2006-12-31 3:08 ` Bob Rogers 1 sibling, 1 reply; 10+ messages in thread From: Piet van Oostrum @ 2006-12-29 21:23 UTC (permalink / raw) >>>>> Bob Rogers <rogers-emacs@rgrjr.dyndns.org> (BR) wrote: >BR> From: Werner LEMBERG <wl@gnu.org> >BR> Date: Wed, 27 Dec 2006 11:50:42 +0100 (CET) >BR> Here's another problematic URL: >BR> http://mousai.kanji.zinbun.kyoto-u.ac.jp/ids-find?components=&U+20207; >BR> thingatpt ignores the final `;'. >BR> Werner >BR> According to RFC3986 (aka STD066), this is wrong; ";" is legitimate >BR> anywhere in a path or query part, including the end. So are "." and >BR> ",", but thing-at-point-url-path-regexp also refuses to match these >BR> characters at the end of the string. Doing (ffap-string-at-point 'url) >BR> drops these characters plus ":", "!", and (questionably) "?". >BR> It may not be possible to find a tradeoff between RFC compliance and >BR> parsing dwimmery that would satisfy everybody. Since stripping off >BR> trailing punctuation is useful behavior (ISTR it's worked this way for a >BR> while now), I would recommend against changing it now. However, a case >BR> could be made for making thing-at-point and ffap-string-at-point >BR> consistent. Perhaps "!:;.," would be best? This is just the union of >BR> the two sets but without the dubious inclusion of "?". The way to reconcile these would be to customize it, I think. For example have a string variable that contains the punctuation characters to be included at the end. Or a regexp. By the way, thing-at-point-url-path-regexp also disallows : inside a url. These would be necessary to accept IPv6 IP addresses. -- Piet van Oostrum <piet@cs.uu.nl> URL: http://www.cs.uu.nl/~piet [PGP 8DAE142BE17999C4] Private email: piet@vanoostrum.org ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Another issue with thingatpt 2006-12-29 21:23 ` Piet van Oostrum @ 2006-12-31 3:08 ` Bob Rogers 2006-12-31 9:25 ` Andreas Roehler 2006-12-31 20:07 ` Piet van Oostrum 0 siblings, 2 replies; 10+ messages in thread From: Bob Rogers @ 2006-12-31 3:08 UTC (permalink / raw) Cc: emacs-devel From: Piet van Oostrum <piet@cs.uu.nl> Date: Fri, 29 Dec 2006 22:23:55 +0100 >>>>> Bob Rogers <rogers-emacs@rgrjr.dyndns.org> (BR) wrote: >BR> From: Werner LEMBERG <wl@gnu.org> >BR> Date: Wed, 27 Dec 2006 11:50:42 +0100 (CET) >BR> . . . >BR> thingatpt ignores the final `;'. >BR> Werner >BR> According to RFC3986 (aka STD066), this is wrong; ";" is legitimate >BR> anywhere in a path or query part, including the end. So are "." and >BR> ",", but thing-at-point-url-path-regexp also refuses to match these >BR> characters at the end of the string. Doing (ffap-string-at-point 'url) >BR> drops these characters plus ":", "!", and (questionably) "?". >BR> It may not be possible to find a tradeoff between RFC compliance and >BR> parsing dwimmery that would satisfy everybody. Since stripping off >BR> trailing punctuation is useful behavior (ISTR it's worked this way for a >BR> while now), I would recommend against changing it now. However, a case >BR> could be made for making thing-at-point and ffap-string-at-point >BR> consistent. Perhaps "!:;.," would be best? This is just the union of >BR> the two sets but without the dubious inclusion of "?". The way to reconcile these would be to customize it, I think. For example have a string variable that contains the punctuation characters to be included at the end. Or a regexp. Both interfaces (ffap and thing-at-point) are already customizable, though in different ways. ffap-string-at-point uses ffap-string-at-point-mode-alist, which maps a thing type symbol or mode name symbol to a list of three character sets; the last string in each alist entry is the set of characters to exclude at the end. On the other hand, thing-at-point uses pure regexps, but they are constructed from each other, which makes thing-at-point harder to customize. Note that neither of thes implementations is really mode-sensitive, AFAICS; ffap-string-at-point-mode-alist is poorly named. If editing something XML-like, for example, you would want the attribute in <tag attr='http://...'> to be parsed without dropping ANY characters at the end -- and any embedded ''' to be translated to a literal apostrophe. But even if this is TRT, it is clearly too risky to attempt now. But is there any objection to unifying these two implementations after the release? And if so, which is the better implementation? I believe the difference is only historical; ffap.el is much older than thingatpt.el (IIRC). By the way, thing-at-point-url-path-regexp also disallows : inside a url. These would be necessary to accept IPv6 IP addresses. It works for me (though in an emacs built two weeks ago): (string-match thing-at-point-url-path-regexp "http://::1/foo/bar.html") => 0 (string-match thing-at-point-url-regexp "http://::1/foo/bar.html") => 0 Do you have an example of failure? -- Bob ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Another issue with thingatpt 2006-12-31 3:08 ` Bob Rogers @ 2006-12-31 9:25 ` Andreas Roehler 2006-12-31 17:24 ` Bob Rogers 2006-12-31 20:07 ` Piet van Oostrum 1 sibling, 1 reply; 10+ messages in thread From: Andreas Roehler @ 2006-12-31 9:25 UTC (permalink / raw) Cc: emacs-devel > Both interfaces (ffap and thing-at-point) are already customizable, > though in different ways. There is no `defcustom'-form in thingatpt.el, it's done mostly with `defvar'. Wouldn't conceive that as customizable. The problem mentioned originally however shouldn't occur, as ,---- | (defvar thing-at-point-url-path-regexp | "[^]\t\n \"'()<>[^`{}]*[^]\t\n \"'()<>[^`{}.,;]+" | "A regular expression probably matching the host and filename or e-mail part of a URL.") `---- includes that char. The error must reside elsewhere. Regards, Andreas Roehler ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Another issue with thingatpt 2006-12-31 9:25 ` Andreas Roehler @ 2006-12-31 17:24 ` Bob Rogers 2007-01-02 13:34 ` Andreas Roehler 2007-01-03 14:50 ` Andreas Roehler 0 siblings, 2 replies; 10+ messages in thread From: Bob Rogers @ 2006-12-31 17:24 UTC (permalink / raw) Cc: emacs-devel From: Andreas Roehler <andreas.roehler@easy-emacs.de> Date: Sun, 31 Dec 2006 10:25:35 +0100 > Both interfaces (ffap and thing-at-point) are already customizable, > though in different ways. There is no `defcustom'-form in thingatpt.el, it's done mostly with `defvar'. Wouldn't conceive that as customizable. Not in the sense of defcustom, no. But someone who can't "customize" it themselves via setq is probably not going to be able to change these hairy regexps and/or char-classes without shooting themselves in the foot. It's not just a matter of understanding Emacs regexps, but understanding how thing-at-point uses them. In any case, it seems to me that users shouldn't need to change the regexp proper, since that is defined by RFC3986, just the set of punctuation characters to drop at the end. The only thing that needs to be customized is just the "lose the punctuation" heuristic, IMHO. And the definition of "punctuation" should be enlarged so that it addresses Slawomir's issue with parens, which are not even allowed internally. The problem mentioned originally however shouldn't occur, as ,---- | (defvar thing-at-point-url-path-regexp | "[^]\t\n \"'()<>[^`{}]*[^]\t\n \"'()<>[^`{}.,;]+" | "A regular expression probably matching the host and filename or e-mail part of a URL.") `---- includes that char. The error must reside elsewhere. Regards, Andreas Roehler It does include a ";" in the second character class, but both are inverted. The second set is the same as the first set with the addition of ".,;", which is why it refuses to match any of these characters at the end of the URL. This would be easier to see if the regexp were written this way: (defvar thing-at-point-url-path-regexp (concat "[^]\t\n \"'()<>[^`{}]*" "[^]\t\n \"'()<>[^`{}.,;]+") "A regular expression probably matching the host and filename or e-mail part of a URL.") -- Bob ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Another issue with thingatpt 2006-12-31 17:24 ` Bob Rogers @ 2007-01-02 13:34 ` Andreas Roehler 2007-01-03 14:50 ` Andreas Roehler 1 sibling, 0 replies; 10+ messages in thread From: Andreas Roehler @ 2007-01-02 13:34 UTC (permalink / raw) Cc: emacs-devel Bob Rogers schrieb: > From: Andreas Roehler <andreas.roehler@easy-emacs.de> > Date: Sun, 31 Dec 2006 10:25:35 +0100 > > > Both interfaces (ffap and thing-at-point) are already customizable, > > though in different ways. > > There is no `defcustom'-form in thingatpt.el, > it's done mostly with `defvar'. Wouldn't conceive that > as customizable. > > Not in the sense of defcustom, no. But someone who can't "customize" it > themselves via setq is probably not going to be able to change these > hairy regexps and/or char-classes without shooting themselves in the > foot. It's not just a matter of understanding Emacs regexps, but > understanding how thing-at-point uses them. Probably you are right. > > In any case, it seems to me that users shouldn't need to change the > regexp proper, since that is defined by RFC3986, just the set of > punctuation characters to drop at the end. Maybe I miss something, but AFAIS the regexp in question is not derived in a strict sense. I give the description from RFC here: ;;;;;;;;;;;;;; reserved = gen-delims / sub-delims gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@" sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "=" ... Characters that are allowed in a URI but do not have a reserved purpose are called unreserved. These include uppercase and lowercase letters, decimal digits, hyphen, period, underscore, and tilde. unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" ;;;;;;;;;;;;;;; Thats basically what I detect concerning the matter there. > The only thing that needs to > be customized is just the "lose the punctuation" heuristic, IMHO. And > the definition of "punctuation" should be enlarged so that it addresses > Slawomir's issue with parens, which are not even allowed internally. > > The problem mentioned originally however shouldn't occur, as > > ,---- > | (defvar thing-at-point-url-path-regexp > | "[^]\t\n \"'()<>[^`{}]*[^]\t\n \"'()<>[^`{}.,;]+" > | "A regular expression probably matching the host and filename or > e-mail part of a URL.") > `---- > > includes that char. The error must reside elsewhere. > > Regards, > > Andreas Roehler > > It does include a ";" in the second character class, but both are > inverted. The second set is the same as the first set with the addition > of ".,;", which is why it refuses to match any of these characters at > the end of the URL. This would be easier to see if the regexp were > written this way: > > (defvar thing-at-point-url-path-regexp > (concat "[^]\t\n \"'()<>[^`{}]*" > "[^]\t\n \"'()<>[^`{}.,;]+") > "A regular expression probably matching the host and filename or e-mail part of a URL.") > > -- Bob Now I see it, thanks a lot. BTW: What about to drop the `;' from the regexp? Maybe together with the comma-sign, as this char is mentioned too as a sub-delimiter. Other problems: - Char ' (39, #o47, #x27) now seems excluded, whereas RFC mentiones it as a sub-delimiter too. - (defvar thing-at-point-short-url-regexp (concat "[-A-Za-z0-9.]+" thing-at-point-url-path-regexp) misses the underscore in its bracket. (unreserved after RFC) Andreas ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Another issue with thingatpt 2006-12-31 17:24 ` Bob Rogers 2007-01-02 13:34 ` Andreas Roehler @ 2007-01-03 14:50 ` Andreas Roehler 1 sibling, 0 replies; 10+ messages in thread From: Andreas Roehler @ 2007-01-03 14:50 UTC (permalink / raw) Cc: emacs-devel Employing the forms below together with standard thingatpt-function I see several advantages - url-at-point don't need longer a complicated regexp-corpus, the whole special handling of urls can be dropped - it don't make more specifications than RFC3986; which may help to avoid trouble in the future, if new url schemes may be invented - easy to read and understand from the users side, better to extend and maintain (put 'url 'beginning-op (lambda () (skip-chars-backward ":/?#[]@!$&'()*+,;=[:alnum:]-._~") )) (put 'url 'end-op (lambda () (skip-chars-forward ":/?#[]@!$&'()*+,;=[:alnum:]-._~"))) Already implemented it into thingatpt-utils.el, which is available at gnu.emacs.sources. __ Andreas Roehler ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Another issue with thingatpt 2006-12-31 3:08 ` Bob Rogers 2006-12-31 9:25 ` Andreas Roehler @ 2006-12-31 20:07 ` Piet van Oostrum 1 sibling, 0 replies; 10+ messages in thread From: Piet van Oostrum @ 2006-12-31 20:07 UTC (permalink / raw) >>>>> Bob Rogers <rogers-emacs@rgrjr.dyndns.org> (BR) wrote: >BR> From: Piet van Oostrum <piet@cs.uu.nl> [...] >BR> By the way, thing-at-point-url-path-regexp also disallows : inside >BR> a url. These would be necessary to accept IPv6 IP addresses. >BR> It works for me (though in an emacs built two weeks ago): >BR> (string-match thing-at-point-url-path-regexp "http://::1/foo/bar.html") >BR> => 0 >BR> (string-match thing-at-point-url-regexp "http://::1/foo/bar.html") >BR> => 0 >BR> Do you have an example of failure? Sorry, I was wrong. -- Piet van Oostrum <piet@cs.uu.nl> URL: http://www.cs.uu.nl/~piet [PGP 8DAE142BE17999C4] Private email: piet@vanoostrum.org ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2007-01-03 14:50 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <003001c727be$349c5a80$0203a8c0@HomeNetbbb0> [not found] ` <20061225.094150.13771816.wl@gnu.org> [not found] ` <htx7iwdn717.fsf@urania.kanji.zinbun.kyoto-u.ac.jp> 2006-12-27 10:50 ` Another issue with thingatpt Werner LEMBERG 2006-12-27 20:29 ` Bob Rogers 2006-12-28 6:39 ` Werner LEMBERG 2006-12-29 21:23 ` Piet van Oostrum 2006-12-31 3:08 ` Bob Rogers 2006-12-31 9:25 ` Andreas Roehler 2006-12-31 17:24 ` Bob Rogers 2007-01-02 13:34 ` Andreas Roehler 2007-01-03 14:50 ` Andreas Roehler 2006-12-31 20:07 ` Piet van Oostrum
Code repositories for project(s) associated with this external index https://git.savannah.gnu.org/cgit/emacs.git https://git.savannah.gnu.org/cgit/emacs/org-mode.git This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.