unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
* bug#6252: Emacs does not implement URL (aka "percent") decoding correctly.
@ 2010-05-22 23:46 José A. Romero L.
  2010-05-24  3:33 ` YAMAMOTO Mitsuharu
  2011-09-21 20:17 ` Lars Magne Ingebrigtsen
  0 siblings, 2 replies; 10+ messages in thread
From: José A. Romero L. @ 2010-05-22 23:46 UTC (permalink / raw)
  To: 6252

On May 18, 20:14, Xah Lee <xah...@gmail.com>  wrote:

> is there emacs lisp function that decode the url percent encoding?
> e.g.http://en.wikipedia.org/wiki/Sylvester%E2%80%93Gallai_theorem
> should become
> http://en.wikipedia.org/wiki/Sylvester–Gallai_theorem
> that's a EN DASH (unicode 8211, #o20023, #x2013).
> I know there's a
>   (require 'gnus-util)
>  gnus-url-unhex-string
> but that just unhex, and generate gibberish if the url contain unicode
> chars.
(...)

Seems that RFC 3986 has not been implemented correctly in Emacs. IMHO
that is an important hole you have found there. The standard requires
that all unreserved characters be encoded/decoded as UTF8 bytes. Even
though the encoding part looks OK (in url-util.el), the decoding does
not go that last mile to interpret the decoded bytes as UTF-8.

Until a proper implementation is  done, I guess you could work around
the problem with something like this:

    (decode-coding-string
     (apply 'unibyte-string
            (string-to-list
             (url-unhex-string "http://en.wikipedia.org/wiki/Sylvester
%E2%80%93Gallai_theorem")))
     'utf-8)

(yes, it's ugly as hell but hey, it's free ;])

I've just sent this very message as a bug report to the Emacs team.

Cheers,
-- 
José A. Romero L.
escherdragon@gmail.com
"We who cut mere stones must always be envisioning cathedrals."
(Quarry worker's creed)





^ permalink raw reply	[flat|nested] 10+ messages in thread

* bug#6252: Emacs does not implement URL (aka "percent") decoding correctly.
  2010-05-22 23:46 bug#6252: Emacs does not implement URL (aka "percent") decoding correctly José A. Romero L.
@ 2010-05-24  3:33 ` YAMAMOTO Mitsuharu
       [not found]   ` <AANLkTilUODUArElMVd6FzgL08u1TmCVw2kar1Gf9z1Z9@mail.gmail.com>
  2011-09-21 20:17 ` Lars Magne Ingebrigtsen
  1 sibling, 1 reply; 10+ messages in thread
From: YAMAMOTO Mitsuharu @ 2010-05-24  3:33 UTC (permalink / raw)
  To: José A. Romero L.; +Cc: 6252

>>>>> On Sun, 23 May 2010 01:46:54 +0200, José A. Romero L. <escherdragon@gmail.com> said:

> Seems that RFC 3986 has not been implemented correctly in
> Emacs. IMHO that is an important hole you have found there. The
> standard requires that all unreserved characters be encoded/decoded
> as UTF8 bytes.

If you are referring to the following part of RFC 3986, it doesn't say
anything about existing URI schemes (as opposed to "a new URI
scheme"), those defining a component that does NOT represent textual
data, or even for textual data, those NOT consisting of characters
from the Universal Character Sets.

  When a new URI scheme defines a component that represents textual
  data consisting of characters from the Universal Character Set
  [UCS], the data should first be encoded as octets according to the
  UTF-8 character encoding [STD63]; then only those octets that do not
  correspond to characters in the unreserved set should be percent-
  encoded.

(See also http://lists.gnu.org/archive/html/emacs-devel/2006-08/msg00065.html)

Though returning a multibyte string decoded as UTF-8 would be useful
for many cases, I think some "unhex"ing function should also provide a
functionality to return a unibyte string.

				     YAMAMOTO Mitsuharu
				mituharu@math.s.chiba-u.ac.jp





^ permalink raw reply	[flat|nested] 10+ messages in thread

* bug#6252: Fwd: bug#6252: Emacs does not implement URL (aka "percent") decoding correctly.
       [not found]   ` <AANLkTilUODUArElMVd6FzgL08u1TmCVw2kar1Gf9z1Z9@mail.gmail.com>
@ 2010-05-25  8:56     ` José A. Romero L.
  0 siblings, 0 replies; 10+ messages in thread
From: José A. Romero L. @ 2010-05-25  8:56 UTC (permalink / raw)
  To: 6252

(sorry, forgot to fwd this to the bugtrack)
---------- Forwarded message ----------
From: José A. Romero L. <escherdragon@gmail.com>
Date: 2010/5/24
Subject: Re: bug#6252: Emacs does not implement URL (aka "percent")
decoding correctly.
To: YAMAMOTO Mitsuharu <mituharu@math.s.chiba-u.ac.jp>


2010/5/24 YAMAMOTO Mitsuharu <mituharu@math.s.chiba-u.ac.jp>:
>>>>>> On Sun, 23 May 2010 01:46:54 +0200, José A. Romero L. <escherdragon@gmail.com> said:
(...)
> If you are referring to the following part of RFC 3986, it doesn't say
> anything about existing URI schemes (as opposed to "a new URI
> scheme"), those defining a component that does NOT represent textual
> data, or even for textual data, those NOT consisting of characters
> from the Universal Character Sets.

You are right. The standard *doesn't say anything* about existing URI
schemes on that matter. Thus  the question would be rather whether to
make the language more or less useful, especially on the light of the
fragment you've just quoted:

     >  When a new URI scheme defines a component that represents textual
     >  data consisting of characters from the Universal Character Set
     >  [UCS], the data should first be encoded as octets according to the
     >  UTF-8 character encoding [STD63]; then only those octets that do not
     >  correspond to characters in the unreserved set should be percent-
     >  encoded.

and the example that immediately follows:

   (...) For example, the character A would be represented as "A",
   the character LATIN CAPITAL LETTER A WITH GRAVE would be represented
   as "%C3%80", and the character KATAKANA LETTER A would be represented
   as "%E3%82%A2".

>
> (See also http://lists.gnu.org/archive/html/emacs-devel/2006-08/msg00065.html)
>
> Though returning a multibyte string decoded as UTF-8 would be useful
> for many cases, I think some "unhex"ing function should also provide a
> functionality to return a unibyte string.
(...)

That's perfectly valid. OTOH some other "unhex"-ing function (or even
the same) could also provide the functionality to return a multi-byte
string, and even allow to  choose the character encoding (UCS or not)
for the resulting string. After  all, don't you think there should be
a better way to decode a Katakana A than using a kludge like this?:

 (decode-coding-string
    (apply 'unibyte-string
           (string-to-list
            (url-unhex-string "%E3%82%A2")))
    'utf-8)

Cheers,
--
José A. Romero L.
escherdragon@gmail.com
"We who cut mere stones must always be envisioning cathedrals."
(Quarry worker's creed)





^ permalink raw reply	[flat|nested] 10+ messages in thread

* bug#6252: Emacs does not implement URL (aka "percent") decoding correctly.
  2010-05-22 23:46 bug#6252: Emacs does not implement URL (aka "percent") decoding correctly José A. Romero L.
  2010-05-24  3:33 ` YAMAMOTO Mitsuharu
@ 2011-09-21 20:17 ` Lars Magne Ingebrigtsen
       [not found]   ` <CAJ_WfoU_JAjf-cTruu3fL1OO5fSMRHsjmnB=-aUK6h6RM7e8AA@mail.gmail.com>
  1 sibling, 1 reply; 10+ messages in thread
From: Lars Magne Ingebrigtsen @ 2011-09-21 20:17 UTC (permalink / raw)
  To: José A. Romero L.; +Cc: 6252

José A. Romero L. <escherdragon@gmail.com> writes:

> On May 18, 20:14, Xah Lee <xah...@gmail.com>  wrote:
>
>> is there emacs lisp function that decode the url percent encoding?
>> e.g.http://en.wikipedia.org/wiki/Sylvester%E2%80%93Gallai_theorem
>> should become
>> http://en.wikipedia.org/wiki/Sylvester–Gallai_theorem
>> that's a EN DASH (unicode 8211, #o20023, #x2013).
>> I know there's a
>>   (require 'gnus-util)
>>  gnus-url-unhex-string
>> but that just unhex, and generate gibberish if the url contain unicode
>> chars.
> (...)
>
> Seems that RFC 3986 has not been implemented correctly in Emacs. IMHO
> that is an important hole you have found there. The standard requires
> that all unreserved characters be encoded/decoded as UTF8 bytes. Even
> though the encoding part looks OK (in url-util.el), the decoding does
> not go that last mile to interpret the decoded bytes as UTF-8.

I'm not quite sure I understand what the problem is.  Do you have a test
case that illustrates what url.el does wrong?

-- 
(domestic pets only, the antidote for overdose, milk.)
  bloggy blog http://lars.ingebrigtsen.no/





^ permalink raw reply	[flat|nested] 10+ messages in thread

* bug#6252: Emacs does not implement URL (aka "percent") decoding correctly.
       [not found]   ` <CAJ_WfoU_JAjf-cTruu3fL1OO5fSMRHsjmnB=-aUK6h6RM7e8AA@mail.gmail.com>
@ 2011-09-22  7:38     ` Lars Magne Ingebrigtsen
       [not found]       ` <CAJ_WfoV04Bis9msxHYLdu5VTONZ7hO_LsKy097MqfnmLf4kfjQ@mail.gmail.com>
  0 siblings, 1 reply; 10+ messages in thread
From: Lars Magne Ingebrigtsen @ 2011-09-22  7:38 UTC (permalink / raw)
  To: José A. Romero L.; +Cc: 6252

José A. Romero L. <escherdragon@gmail.com> writes:

> in short, there seems to be currently no way to perform the opposite
> of url-hexify-string for UTF-8 encoded strings:
>
>     (url-unhex-string (url-hexify-string "ä"))
>     => "ä"

`url-unhex-string' can't know what encoding the %xx-encoding is in, can
it?  The local part of an URL can use a different encoding, I think.

But is that the test case for the bug?  I thought somebody had problems
retrieving something...

-- 
(domestic pets only, the antidote for overdose, milk.)
  bloggy blog http://lars.ingebrigtsen.no/





^ permalink raw reply	[flat|nested] 10+ messages in thread

* bug#6252: Emacs does not implement URL (aka "percent") decoding correctly.
       [not found]       ` <CAJ_WfoV04Bis9msxHYLdu5VTONZ7hO_LsKy097MqfnmLf4kfjQ@mail.gmail.com>
@ 2011-09-23  8:34         ` Lars Magne Ingebrigtsen
  2011-09-23 11:12           ` José A. Romero L.
  0 siblings, 1 reply; 10+ messages in thread
From: Lars Magne Ingebrigtsen @ 2011-09-23  8:34 UTC (permalink / raw)
  To: José A. Romero L.; +Cc: 6252

José A. Romero L. <escherdragon@gmail.com> writes:

>>>     (url-unhex-string (url-hexify-string "ä"))
>>>     => "ä"

[...]

> Well, if you write a script that transforms URLs to/from strings
> (especially round-trip) you will probably encouter problems
> retrieving stuff from the web if you're not aware of this issue.

So this bug report is purely about the return value of
`url-unhex-string'?  It sounded at the beginning that url.el had
problems fetching something.

If this is just about `url-unhex-string', the obvious solution would be
to add a CODING-SYSTEM parameter to that function.

And please don't keep removing the debbugs address from the Cc list.
Your messages aren't going to the bug tracker if you do that.

-- 
(domestic pets only, the antidote for overdose, milk.)
  bloggy blog http://lars.ingebrigtsen.no/





^ permalink raw reply	[flat|nested] 10+ messages in thread

* bug#6252: Emacs does not implement URL (aka "percent") decoding correctly.
  2011-09-23  8:34         ` Lars Magne Ingebrigtsen
@ 2011-09-23 11:12           ` José A. Romero L.
  2011-09-25 22:16             ` Lars Magne Ingebrigtsen
  0 siblings, 1 reply; 10+ messages in thread
From: José A. Romero L. @ 2011-09-23 11:12 UTC (permalink / raw)
  To: Lars Magne Ingebrigtsen; +Cc: 6252

2011/9/23 Lars Magne Ingebrigtsen <larsi@gnus.org>:
(...)
> If this is just about `url-unhex-string', the obvious solution would be
> to add a CODING-SYSTEM parameter to that function.

Yes, as I see it, that's definitely it.

> And please don't keep removing the debbugs address from the Cc list.
> Your messages aren't going to the bug tracker if you do that.
(...)

Oops, sorry, I didn't notice it before -- won't happen again.

Cheers,
-- 
José A. Romero L.
escherdragon@gmail.com
"We who cut mere stones must always be envisioning cathedrals."
(Quarry worker's creed)





^ permalink raw reply	[flat|nested] 10+ messages in thread

* bug#6252: Emacs does not implement URL (aka "percent") decoding correctly.
  2011-09-23 11:12           ` José A. Romero L.
@ 2011-09-25 22:16             ` Lars Magne Ingebrigtsen
  2011-09-25 22:25               ` José A. Romero L.
  2012-04-10  2:14               ` Lars Magne Ingebrigtsen
  0 siblings, 2 replies; 10+ messages in thread
From: Lars Magne Ingebrigtsen @ 2011-09-25 22:16 UTC (permalink / raw)
  To: José A. Romero L.; +Cc: 6252

José A. Romero L. <escherdragon@gmail.com> writes:

>> If this is just about `url-unhex-string', the obvious solution would be
>> to add a CODING-SYSTEM parameter to that function.
>
> Yes, as I see it, that's definitely it.

I think that's a reasonable thing to add, but Emacs is in a feature
freeze, so it'll probably have to wait until after Emacs 24 has been
released.  I'll mark the bug report as "pending".

-- 
(domestic pets only, the antidote for overdose, milk.)
  bloggy blog http://lars.ingebrigtsen.no/





^ permalink raw reply	[flat|nested] 10+ messages in thread

* bug#6252: Emacs does not implement URL (aka "percent") decoding correctly.
  2011-09-25 22:16             ` Lars Magne Ingebrigtsen
@ 2011-09-25 22:25               ` José A. Romero L.
  2012-04-10  2:14               ` Lars Magne Ingebrigtsen
  1 sibling, 0 replies; 10+ messages in thread
From: José A. Romero L. @ 2011-09-25 22:25 UTC (permalink / raw)
  To: Lars Magne Ingebrigtsen; +Cc: 6252

2011/9/26 Lars Magne Ingebrigtsen <larsi@gnus.org>:
(...)
> I think that's a reasonable thing to add, but Emacs is in a feature
> freeze, so it'll probably have to wait until after Emacs 24 has been
> released.  I'll mark the bug report as "pending".
(...)

Cool, thanks a lot :)

Cheers,
-- 
José A. Romero L.
escherdragon@gmail.com
"We who cut mere stones must always be envisioning cathedrals."
(Quarry worker's creed)





^ permalink raw reply	[flat|nested] 10+ messages in thread

* bug#6252: Emacs does not implement URL (aka "percent") decoding correctly.
  2011-09-25 22:16             ` Lars Magne Ingebrigtsen
  2011-09-25 22:25               ` José A. Romero L.
@ 2012-04-10  2:14               ` Lars Magne Ingebrigtsen
  1 sibling, 0 replies; 10+ messages in thread
From: Lars Magne Ingebrigtsen @ 2012-04-10  2:14 UTC (permalink / raw)
  To: José A. Romero L.; +Cc: 6252

Lars Magne Ingebrigtsen <larsi@gnus.org> writes:

> I think that's a reasonable thing to add, but Emacs is in a feature
> freeze, so it'll probably have to wait until after Emacs 24 has been
> released.  I'll mark the bug report as "pending".

I've now added an optional coding-system parameter to the function to
the Emacs trunk.

-- 
(domestic pets only, the antidote for overdose, milk.)
  bloggy blog http://lars.ingebrigtsen.no/





^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2012-04-10  2:14 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-05-22 23:46 bug#6252: Emacs does not implement URL (aka "percent") decoding correctly José A. Romero L.
2010-05-24  3:33 ` YAMAMOTO Mitsuharu
     [not found]   ` <AANLkTilUODUArElMVd6FzgL08u1TmCVw2kar1Gf9z1Z9@mail.gmail.com>
2010-05-25  8:56     ` bug#6252: Fwd: " José A. Romero L.
2011-09-21 20:17 ` Lars Magne Ingebrigtsen
     [not found]   ` <CAJ_WfoU_JAjf-cTruu3fL1OO5fSMRHsjmnB=-aUK6h6RM7e8AA@mail.gmail.com>
2011-09-22  7:38     ` Lars Magne Ingebrigtsen
     [not found]       ` <CAJ_WfoV04Bis9msxHYLdu5VTONZ7hO_LsKy097MqfnmLf4kfjQ@mail.gmail.com>
2011-09-23  8:34         ` Lars Magne Ingebrigtsen
2011-09-23 11:12           ` José A. Romero L.
2011-09-25 22:16             ` Lars Magne Ingebrigtsen
2011-09-25 22:25               ` José A. Romero L.
2012-04-10  2:14               ` Lars Magne Ingebrigtsen

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).