unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed
* Another issue with thingatpt
       [not found]   ` <htx7iwdn717.fsf@urania.kanji.zinbun.kyoto-u.ac.jp>
@ 2006-12-27 10:50     ` Werner LEMBERG
  2006-12-27 20:29       ` Bob Rogers
  0 siblings, 1 reply; 10+ messages in thread
From: Werner LEMBERG @ 2006-12-27 10:50 UTC (permalink / raw)



Here's another problematic URL:

  http://mousai.kanji.zinbun.kyoto-u.ac.jp/ids-find?components=&U+20207;

thingatpt ignores the final `;'.


    Werner

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Another issue with thingatpt
  2006-12-27 10:50     ` Another issue with thingatpt Werner LEMBERG
@ 2006-12-27 20:29       ` Bob Rogers
  2006-12-28  6:39         ` Werner LEMBERG
  2006-12-29 21:23         ` Piet van Oostrum
  0 siblings, 2 replies; 10+ messages in thread
From: Bob Rogers @ 2006-12-27 20:29 UTC (permalink / raw)
  Cc: emacs-devel

   From: Werner LEMBERG <wl@gnu.org>
   Date: Wed, 27 Dec 2006 11:50:42 +0100 (CET)

   Here's another problematic URL:

     http://mousai.kanji.zinbun.kyoto-u.ac.jp/ids-find?components=&U+20207;

   thingatpt ignores the final `;'.

       Werner

According to RFC3986 (aka STD066), this is wrong; ";" is legitimate
anywhere in a path or query part, including the end.  So are "." and
",", but thing-at-point-url-path-regexp also refuses to match these
characters at the end of the string.  Doing (ffap-string-at-point 'url)
drops these characters plus ":", "!", and (questionably) "?".

   It may not be possible to find a tradeoff between RFC compliance and
parsing dwimmery that would satisfy everybody.  Since stripping off
trailing punctuation is useful behavior (ISTR it's worked this way for a
while now), I would recommend against changing it now.  However, a case
could be made for making thing-at-point and ffap-string-at-point
consistent.  Perhaps "!:;.," would be best?  This is just the union of
the two sets but without the dubious inclusion of "?".

					-- Bob Rogers
					   http://rgrjr.dyndns.org/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Another issue with thingatpt
  2006-12-27 20:29       ` Bob Rogers
@ 2006-12-28  6:39         ` Werner LEMBERG
  2006-12-29 21:23         ` Piet van Oostrum
  1 sibling, 0 replies; 10+ messages in thread
From: Werner LEMBERG @ 2006-12-28  6:39 UTC (permalink / raw)
  Cc: emacs-devel


>    Here's another problematic URL:
> 
>      http://mousai.kanji.zinbun.kyoto-u.ac.jp/ids-find?components=&U+20207;
> 
>    thingatpt ignores the final `;'.
> 
>        Werner
> 
> According to RFC3986 (aka STD066), this is wrong; [...]
>
>    It may not be possible to find a tradeoff between RFC compliance
> and parsing dwimmery that would satisfy everybody.  Since stripping
> off trailing punctuation is useful behavior (ISTR it's worked this
> way for a while now), I would recommend against changing it now.
> However, a case could be made for making thing-at-point and
> ffap-string-at-point consistent.  Perhaps "!:;.," would be best?
> This is just the union of the two sets but without the dubious
> inclusion of "?".

I suggest to handle `;' and friends as part of the URL if it is on a
line of its own, this is, if the line starts with `http://' or a
sibling of it, and is followed by whitespace and a newline.


    Werner

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Another issue with thingatpt
  2006-12-27 20:29       ` Bob Rogers
  2006-12-28  6:39         ` Werner LEMBERG
@ 2006-12-29 21:23         ` Piet van Oostrum
  2006-12-31  3:08           ` Bob Rogers
  1 sibling, 1 reply; 10+ messages in thread
From: Piet van Oostrum @ 2006-12-29 21:23 UTC (permalink / raw)


>>>>> Bob Rogers <rogers-emacs@rgrjr.dyndns.org> (BR) wrote:

>BR>    From: Werner LEMBERG <wl@gnu.org>
>BR>    Date: Wed, 27 Dec 2006 11:50:42 +0100 (CET)

>BR>    Here's another problematic URL:

>BR>      http://mousai.kanji.zinbun.kyoto-u.ac.jp/ids-find?components=&U+20207;

>BR>    thingatpt ignores the final `;'.

>BR>        Werner

>BR> According to RFC3986 (aka STD066), this is wrong; ";" is legitimate
>BR> anywhere in a path or query part, including the end.  So are "." and
>BR> ",", but thing-at-point-url-path-regexp also refuses to match these
>BR> characters at the end of the string.  Doing (ffap-string-at-point 'url)
>BR> drops these characters plus ":", "!", and (questionably) "?".

>BR>    It may not be possible to find a tradeoff between RFC compliance and
>BR> parsing dwimmery that would satisfy everybody.  Since stripping off
>BR> trailing punctuation is useful behavior (ISTR it's worked this way for a
>BR> while now), I would recommend against changing it now.  However, a case
>BR> could be made for making thing-at-point and ffap-string-at-point
>BR> consistent.  Perhaps "!:;.," would be best?  This is just the union of
>BR> the two sets but without the dubious inclusion of "?".

The way to reconcile these would be to customize it, I think. For example
have a string variable that contains the punctuation characters to be
included at the end. Or a regexp.

By the way, thing-at-point-url-path-regexp also disallows : inside a url.
These would be necessary to accept IPv6 IP addresses.
-- 
Piet van Oostrum <piet@cs.uu.nl>
URL: http://www.cs.uu.nl/~piet [PGP 8DAE142BE17999C4]
Private email: piet@vanoostrum.org

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Another issue with thingatpt
  2006-12-29 21:23         ` Piet van Oostrum
@ 2006-12-31  3:08           ` Bob Rogers
  2006-12-31  9:25             ` Andreas Roehler
  2006-12-31 20:07             ` Piet van Oostrum
  0 siblings, 2 replies; 10+ messages in thread
From: Bob Rogers @ 2006-12-31  3:08 UTC (permalink / raw)
  Cc: emacs-devel

   From: Piet van Oostrum <piet@cs.uu.nl>
   Date: Fri, 29 Dec 2006 22:23:55 +0100

   >>>>> Bob Rogers <rogers-emacs@rgrjr.dyndns.org> (BR) wrote:

   >BR>    From: Werner LEMBERG <wl@gnu.org>
   >BR>    Date: Wed, 27 Dec 2006 11:50:42 +0100 (CET)

   >BR>    . . .

   >BR>    thingatpt ignores the final `;'.

   >BR>        Werner

   >BR> According to RFC3986 (aka STD066), this is wrong; ";" is legitimate
   >BR> anywhere in a path or query part, including the end.  So are "." and
   >BR> ",", but thing-at-point-url-path-regexp also refuses to match these
   >BR> characters at the end of the string.  Doing (ffap-string-at-point 'url)
   >BR> drops these characters plus ":", "!", and (questionably) "?".

   >BR>    It may not be possible to find a tradeoff between RFC compliance and
   >BR> parsing dwimmery that would satisfy everybody.  Since stripping off
   >BR> trailing punctuation is useful behavior (ISTR it's worked this way for a
   >BR> while now), I would recommend against changing it now.  However, a case
   >BR> could be made for making thing-at-point and ffap-string-at-point
   >BR> consistent.  Perhaps "!:;.," would be best?  This is just the union of
   >BR> the two sets but without the dubious inclusion of "?".

   The way to reconcile these would be to customize it, I think. For example
   have a string variable that contains the punctuation characters to be
   included at the end. Or a regexp.

Both interfaces (ffap and thing-at-point) are already customizable,
though in different ways.  ffap-string-at-point uses
ffap-string-at-point-mode-alist, which maps a thing type symbol or mode
name symbol to a list of three character sets; the last string in each
alist entry is the set of characters to exclude at the end.  On the
other hand, thing-at-point uses pure regexps, but they are constructed
from each other, which makes thing-at-point harder to customize.

   Note that neither of thes implementations is really mode-sensitive,
AFAICS; ffap-string-at-point-mode-alist is poorly named.  If editing
something XML-like, for example, you would want the attribute in

	<tag attr='http://...'>

to be parsed without dropping ANY characters at the end -- and any
embedded '&apos;' to be translated to a literal apostrophe.  But even if
this is TRT, it is clearly too risky to attempt now.

   But is there any objection to unifying these two implementations
after the release?  And if so, which is the better implementation?  I
believe the difference is only historical; ffap.el is much older than
thingatpt.el (IIRC).

   By the way, thing-at-point-url-path-regexp also disallows : inside a url.
   These would be necessary to accept IPv6 IP addresses.

It works for me (though in an emacs built two weeks ago):

	(string-match thing-at-point-url-path-regexp "http://::1/foo/bar.html")
	    => 0
	(string-match thing-at-point-url-regexp "http://::1/foo/bar.html")
	    => 0

Do you have an example of failure?

					-- Bob

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Another issue with thingatpt
  2006-12-31  3:08           ` Bob Rogers
@ 2006-12-31  9:25             ` Andreas Roehler
  2006-12-31 17:24               ` Bob Rogers
  2006-12-31 20:07             ` Piet van Oostrum
  1 sibling, 1 reply; 10+ messages in thread
From: Andreas Roehler @ 2006-12-31  9:25 UTC (permalink / raw)
  Cc: emacs-devel


> Both interfaces (ffap and thing-at-point) are already customizable,
> though in different ways. 


There is no `defcustom'-form in thingatpt.el,
it's done mostly with `defvar'. Wouldn't conceive that
as customizable.

The problem mentioned originally however shouldn't occur, as

,----
| (defvar thing-at-point-url-path-regexp
|   "[^]\t\n \"'()<>[^`{}]*[^]\t\n \"'()<>[^`{}.,;]+"
|   "A regular expression probably matching the host and filename or 
e-mail part of a URL.")
`----

includes that char. The error must reside elsewhere.

Regards,

Andreas Roehler

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Another issue with thingatpt
  2006-12-31  9:25             ` Andreas Roehler
@ 2006-12-31 17:24               ` Bob Rogers
  2007-01-02 13:34                 ` Andreas Roehler
  2007-01-03 14:50                 ` Andreas Roehler
  0 siblings, 2 replies; 10+ messages in thread
From: Bob Rogers @ 2006-12-31 17:24 UTC (permalink / raw)
  Cc: emacs-devel

   From: Andreas Roehler <andreas.roehler@easy-emacs.de>
   Date: Sun, 31 Dec 2006 10:25:35 +0100

   > Both interfaces (ffap and thing-at-point) are already customizable,
   > though in different ways. 

   There is no `defcustom'-form in thingatpt.el,
   it's done mostly with `defvar'. Wouldn't conceive that
   as customizable.

Not in the sense of defcustom, no.  But someone who can't "customize" it
themselves via setq is probably not going to be able to change these
hairy regexps and/or char-classes without shooting themselves in the
foot.  It's not just a matter of understanding Emacs regexps, but
understanding how thing-at-point uses them.

   In any case, it seems to me that users shouldn't need to change the
regexp proper, since that is defined by RFC3986, just the set of
punctuation characters to drop at the end.  The only thing that needs to
be customized is just the "lose the punctuation" heuristic, IMHO.  And
the definition of "punctuation" should be enlarged so that it addresses
Slawomir's issue with parens, which are not even allowed internally.

   The problem mentioned originally however shouldn't occur, as

   ,----
   | (defvar thing-at-point-url-path-regexp
   |   "[^]\t\n \"'()<>[^`{}]*[^]\t\n \"'()<>[^`{}.,;]+"
   |   "A regular expression probably matching the host and filename or 
   e-mail part of a URL.")
   `----

   includes that char. The error must reside elsewhere.

   Regards,

   Andreas Roehler

It does include a ";" in the second character class, but both are
inverted.  The second set is the same as the first set with the addition
of ".,;", which is why it refuses to match any of these characters at
the end of the URL.  This would be easier to see if the regexp were
written this way:

	(defvar thing-at-point-url-path-regexp
		(concat "[^]\t\n \"'()<>[^`{}]*"
			"[^]\t\n \"'()<>[^`{}.,;]+")
	  "A regular expression probably matching the host and filename or e-mail part of a URL.")

					-- Bob

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Another issue with thingatpt
  2006-12-31  3:08           ` Bob Rogers
  2006-12-31  9:25             ` Andreas Roehler
@ 2006-12-31 20:07             ` Piet van Oostrum
  1 sibling, 0 replies; 10+ messages in thread
From: Piet van Oostrum @ 2006-12-31 20:07 UTC (permalink / raw)


>>>>> Bob Rogers <rogers-emacs@rgrjr.dyndns.org> (BR) wrote:

>BR>    From: Piet van Oostrum <piet@cs.uu.nl>
[...]
>BR>    By the way, thing-at-point-url-path-regexp also disallows : inside
>BR>    a url. These would be necessary to accept IPv6 IP addresses.

>BR> It works for me (though in an emacs built two weeks ago):

>BR> 	(string-match thing-at-point-url-path-regexp "http://::1/foo/bar.html")
>BR> 	    => 0
>BR> 	(string-match thing-at-point-url-regexp "http://::1/foo/bar.html")
>BR> 	    => 0

>BR> Do you have an example of failure?

Sorry, I was wrong.
-- 
Piet van Oostrum <piet@cs.uu.nl>
URL: http://www.cs.uu.nl/~piet [PGP 8DAE142BE17999C4]
Private email: piet@vanoostrum.org

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Another issue with thingatpt
  2006-12-31 17:24               ` Bob Rogers
@ 2007-01-02 13:34                 ` Andreas Roehler
  2007-01-03 14:50                 ` Andreas Roehler
  1 sibling, 0 replies; 10+ messages in thread
From: Andreas Roehler @ 2007-01-02 13:34 UTC (permalink / raw)
  Cc: emacs-devel

Bob Rogers schrieb:
>    From: Andreas Roehler <andreas.roehler@easy-emacs.de>
>    Date: Sun, 31 Dec 2006 10:25:35 +0100
>
>    > Both interfaces (ffap and thing-at-point) are already customizable,
>    > though in different ways. 
>
>    There is no `defcustom'-form in thingatpt.el,
>    it's done mostly with `defvar'. Wouldn't conceive that
>    as customizable.
>
> Not in the sense of defcustom, no.  But someone who can't "customize" it
> themselves via setq is probably not going to be able to change these
> hairy regexps and/or char-classes without shooting themselves in the
> foot.  It's not just a matter of understanding Emacs regexps, but
> understanding how thing-at-point uses them.
Probably you are right.
>
>    In any case, it seems to me that users shouldn't need to change the
> regexp proper, since that is defined by RFC3986, just the set of
> punctuation characters to drop at the end. 
Maybe I miss something, but AFAIS the regexp in question is not  derived 
in a strict sense. I give the description from RFC

 here:

;;;;;;;;;;;;;;

      reserved    = gen-delims / sub-delims

      gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"

      sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
                  / "*" / "+" / "," / ";" / "="

...


   Characters that are allowed in a URI but do not have a reserved
   purpose are called unreserved.  These include uppercase and lowercase
   letters, decimal digits, hyphen, period, underscore, and tilde.

      unreserved  = ALPHA / DIGIT / "-" / "." / "_" / "~"

;;;;;;;;;;;;;;;

Thats basically what I detect concerning the matter there.

>  The only thing that needs to
> be customized is just the "lose the punctuation" heuristic, IMHO.  And
> the definition of "punctuation" should be enlarged so that it addresses
> Slawomir's issue with parens, which are not even allowed internally.
>
>    The problem mentioned originally however shouldn't occur, as
>
>    ,----
>    | (defvar thing-at-point-url-path-regexp
>    |   "[^]\t\n \"'()<>[^`{}]*[^]\t\n \"'()<>[^`{}.,;]+"
>    |   "A regular expression probably matching the host and filename or 
>    e-mail part of a URL.")
>    `----
>
>    includes that char. The error must reside elsewhere.
>
>    Regards,
>
>    Andreas Roehler
>
> It does include a ";" in the second character class, but both are
> inverted.  The second set is the same as the first set with the addition
> of ".,;", which is why it refuses to match any of these characters at
> the end of the URL.  This would be easier to see if the regexp were
> written this way:
>
> 	(defvar thing-at-point-url-path-regexp
> 		(concat "[^]\t\n \"'()<>[^`{}]*"
> 			"[^]\t\n \"'()<>[^`{}.,;]+")
> 	  "A regular expression probably matching the host and filename or e-mail part of a URL.")
>
> 					-- Bob
Now I see it, thanks a lot.

BTW: What about to drop the `;' from the regexp?
 
Maybe together with the comma-sign, as this char is mentioned too as a 
sub-delimiter.

Other problems:

- Char ' (39, #o47, #x27) now seems excluded, whereas RFC mentiones it as a
sub-delimiter too.

- (defvar thing-at-point-short-url-regexp
  (concat "[-A-Za-z0-9.]+" thing-at-point-url-path-regexp)

misses the underscore in its bracket. (unreserved after RFC)



Andreas

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Another issue with thingatpt
  2006-12-31 17:24               ` Bob Rogers
  2007-01-02 13:34                 ` Andreas Roehler
@ 2007-01-03 14:50                 ` Andreas Roehler
  1 sibling, 0 replies; 10+ messages in thread
From: Andreas Roehler @ 2007-01-03 14:50 UTC (permalink / raw)
  Cc: emacs-devel


 Employing the forms below together with standard
 thingatpt-function I see several advantages

- url-at-point don't need longer a complicated
  regexp-corpus, the whole special handling of urls can
  be dropped

- it don't make more specifications than RFC3986; which
  may help to avoid trouble in the future, if new url
  schemes may be invented

- easy to read and understand from the users side,
  better to extend and maintain


(put 'url 'beginning-op
     (lambda ()
       (skip-chars-backward ":/?#[]@!$&'()*+,;=[:alnum:]-._~")
))

(put 'url 'end-op
     (lambda ()
            (skip-chars-forward ":/?#[]@!$&'()*+,;=[:alnum:]-._~")))

Already implemented it into thingatpt-utils.el, which
is available at gnu.emacs.sources.

__
Andreas Roehler

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2007-01-03 14:50 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <003001c727be$349c5a80$0203a8c0@HomeNetbbb0>
     [not found] ` <20061225.094150.13771816.wl@gnu.org>
     [not found]   ` <htx7iwdn717.fsf@urania.kanji.zinbun.kyoto-u.ac.jp>
2006-12-27 10:50     ` Another issue with thingatpt Werner LEMBERG
2006-12-27 20:29       ` Bob Rogers
2006-12-28  6:39         ` Werner LEMBERG
2006-12-29 21:23         ` Piet van Oostrum
2006-12-31  3:08           ` Bob Rogers
2006-12-31  9:25             ` Andreas Roehler
2006-12-31 17:24               ` Bob Rogers
2007-01-02 13:34                 ` Andreas Roehler
2007-01-03 14:50                 ` Andreas Roehler
2006-12-31 20:07             ` Piet van Oostrum

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).