[davidsmith@acm.org: [patch] url-hexify-string does not follow W3C spec]

all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed

* [davidsmith@acm.org: [patch] url-hexify-string does not follow W3C spec]
@ 2006-07-30 13:14 Richard Stallman
  2006-07-30 20:24 ` Thien-Thi Nguyen
  0 siblings, 1 reply; 22+ messages in thread
From: Richard Stallman @ 2006-07-30 13:14 UTC (permalink / raw


Would someone please DTRT?

------- Start of forwarded message -------
From: "David Smith" <davidsmith@acm.org>
To: emacs-pretest-bug@gnu.org
Date: Sun, 30 Jul 2006 05:29:50 +0900
MIME-Version: 1.0
Subject: [patch] url-hexify-string does not follow W3C spec
Content-Type: multipart/mixed; boundary="===============1391979861=="
X-Spam-Status: No, score=0.0 required=5.0 tests=none autolearn=failed 
	version=3.0.4

- --===============1391979861==
Content-Type: multipart/signed; boundary="=-=-=";
	micalg=pgp-sha1; protocol="application/pgp-signature"

- --=-=-=
Content-Transfer-Encoding: quoted-printable

url-hexify-string does not handle non-latin
characters. According to
http://www.w3.org/International/O-URL-code.html , the string
must be converted to hexadecimal UTF-8 and every hexadecimal
byte must be prefixed with a % character. Rewritten
url-hexify-string is below:

(defun url-hexify-string (str)
    "Escape characters in a string."
    (mapconcat
     (lambda (char)
       ;; Fixme: use a char table instead.
       (if (not (memq char url-unreserved-chars))
	   (if (< char 16)
	       (format "%%0%x" char)
	     (let ((ins nil))
	       (mapconcat=20
		(lambda (charhex)
		  (progn (setq ins (not ins))
		  (if ins (concat "%" (char-to-string charhex))
		    (char-to-string charhex))))
		(format "%x" char) "")))
	 (char-to-string char)))
     (encode-coding-string str 'utf-8) ""))

Important settings:
  value of $LC_ALL: ja_JP.utf8
  value of $LC_COLLATE: nil
  value of $LC_CTYPE: ja_JP.utf8
  value of $LC_MESSAGES: nil
  value of $LC_MONETARY: nil
  value of $LC_NUMERIC: nil
  value of $LC_TIME: nil
  value of $LANG: ja_JP
  locale-coding-system: utf-8-unix
  default-enable-multibyte-characters: t

Thanks,
=2D-=20
  David D. Smith

- --=-=-=
Content-Type: application/pgp-signature

- -----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (GNU/Linux)

iD8DBQFEy8VGEJGOueZRHH4RAmWmAJ9JMak1qheySPaTp6jaqsaxovHjgACgkFO+
zbUH6GB4KFjR8Li2EjWH4Ck=
=jmDi
- -----END PGP SIGNATURE-----
- --=-=-=--



- --===============1391979861==
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

_______________________________________________
emacs-pretest-bug mailing list
emacs-pretest-bug@gnu.org
http://lists.gnu.org/mailman/listinfo/emacs-pretest-bug

- --===============1391979861==--
------- End of forwarded message -------

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [davidsmith@acm.org: [patch] url-hexify-string does not follow W3C spec]
  2006-07-30 13:14 [davidsmith@acm.org: [patch] url-hexify-string does not follow W3C spec] Richard Stallman
@ 2006-07-30 20:24 ` Thien-Thi Nguyen
  2006-07-31  0:59   ` YAMAMOTO Mitsuharu
  0 siblings, 1 reply; 22+ messages in thread
From: Thien-Thi Nguyen @ 2006-07-30 20:24 UTC (permalink / raw
  Cc: emacs-devel

> From: "David Smith" <davidsmith@acm.org>
> Date: Sun, 30 Jul 2006 05:29:50 +0900
>
> the string must be converted to hexadecimal UTF-8 and every
> hexadecimal byte must be prefixed with a % character.

thanks for pointing this out.
i have rewritten `url-hexify-string'.

thi

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [davidsmith@acm.org: [patch] url-hexify-string does not follow W3C spec]
  2006-07-30 20:24 ` Thien-Thi Nguyen
@ 2006-07-31  0:59   ` YAMAMOTO Mitsuharu
  2006-07-31 10:13     ` Thien-Thi Nguyen
  0 siblings, 1 reply; 22+ messages in thread
From: YAMAMOTO Mitsuharu @ 2006-07-31  0:59 UTC (permalink / raw
  Cc: davidsmith, emacs-devel

>>>>> On 30 Jul 2006 16:24:07 -0400, Thien-Thi Nguyen <ttn@gnu.org> said:

>> the string must be converted to hexadecimal UTF-8 and every
>> hexadecimal byte must be prefixed with a % character.

> thanks for pointing this out.  i have rewritten `url-hexify-string'.

This change breaks the following case:

(concat
 "file://localhost"
 (mapconcat 'url-hexify-string
	    (split-string
	     (encode-coding-string "/SOME/NONASCII/FILE/NAME"
				   (or file-name-coding-system
				       default-file-name-coding-system))
	     "/")
	    "/"))

Maybe suppress encoding with UTF-8 for unibyte strings?

				     YAMAMOTO Mitsuharu
				mituharu@math.s.chiba-u.ac.jp

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [davidsmith@acm.org: [patch] url-hexify-string does not follow W3C spec]
  2006-07-31  0:59   ` YAMAMOTO Mitsuharu
@ 2006-07-31 10:13     ` Thien-Thi Nguyen
  2006-07-31 10:46       ` Jason Rumney
  0 siblings, 1 reply; 22+ messages in thread
From: Thien-Thi Nguyen @ 2006-07-31 10:13 UTC (permalink / raw
  Cc: davidsmith, emacs-devel

YAMAMOTO Mitsuharu <mituharu@math.s.chiba-u.ac.jp> writes:

> This change breaks the following case:
> 
> (concat
>  "file://localhost"
>  (mapconcat 'url-hexify-string
> 	    (split-string
> 	     (encode-coding-string "/SOME/NONASCII/FILE/NAME"
> 				   (or file-name-coding-system
> 				       default-file-name-coding-system))
> 	     "/")
> 	    "/"))
> 
> Maybe suppress encoding with UTF-8 for unibyte strings?

if the result of this expression is to be used as a URI, then that means
the change exposes improper use of `url-hexify-string'; according to the
RFC (as i understand it) URIs require utf-8.

if we want `url-hexify-string' to handle "URI-like" transformations
(i.e., not strictly produce URI-conformant results), we can add an
optional arg MAKE-UNIBYTE that specifies a function to do the conversion
to unibyte.  in most cases, i guess that would be `string-as-unibyte',
but i don't know for sure.

so i suppose it comes down to the question: is this case supposed to
produce a valid URI?  if so, we should fix the call and not the called.

thi

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [davidsmith@acm.org: [patch] url-hexify-string does not follow W3C spec]
  2006-07-31 10:13     ` Thien-Thi Nguyen
@ 2006-07-31 10:46       ` Jason Rumney
  2006-07-31 16:08         ` Stefan Monnier
  0 siblings, 1 reply; 22+ messages in thread
From: Jason Rumney @ 2006-07-31 10:46 UTC (permalink / raw
  Cc: davidsmith, YAMAMOTO Mitsuharu, emacs-devel


[-- Attachment #1.1: Type: text/plain, Size: 1465 bytes --]

Thien-Thi Nguyen wrote:
> YAMAMOTO Mitsuharu <mituharu@math.s.chiba-u.ac.jp> writes:
>
>   
>> This change breaks the following case:
>>
>> (concat
>>  "file://localhost"
>>  (mapconcat 'url-hexify-string
>> 	    (split-string
>> 	     (encode-coding-string "/SOME/NONASCII/FILE/NAME"
>> 				   (or file-name-coding-system
>> 				       default-file-name-coding-system))
>> 	     "/")
>> 	    "/"))
>>
>> Maybe suppress encoding with UTF-8 for unibyte strings?
>>     
>
> if the result of this expression is to be used as a URI, then that means
> the change exposes improper use of `url-hexify-string'; according to the
> RFC (as i understand it) URIs require utf-8.
>   
There is a recent RFC that mandates utf-8 encoding for URIs, but 
previous RFCs either said nothing, or specified Latin-1, so there are 
many implementations that do not use utf-8. We need some way to 
interoperate with such implementations.

> if we want `url-hexify-string' to handle "URI-like" transformations
> (i.e., not strictly produce URI-conformant results), we can add an
> optional arg MAKE-UNIBYTE that specifies a function to do the conversion
> to unibyte.  in most cases, i guess that would be `string-as-unibyte',
> but i don't know for sure.
>   
Alternatively, we could add an optional arg ENCODING, for specifying an 
encoding other than utf-8. That might be a cleaner interface than 
requiring the user to make the string unibyte before passing it to 
url-hexify-string.


[-- Attachment #1.2: Type: text/html, Size: 2073 bytes --]

[-- Attachment #2: Type: text/plain, Size: 142 bytes --]

_______________________________________________
Emacs-devel mailing list
Emacs-devel@gnu.org
http://lists.gnu.org/mailman/listinfo/emacs-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [davidsmith@acm.org: [patch] url-hexify-string does not follow W3C spec]
  2006-07-31 10:46       ` Jason Rumney
@ 2006-07-31 16:08         ` Stefan Monnier
  2006-07-31 16:35           ` David Smith
  0 siblings, 1 reply; 22+ messages in thread
From: Stefan Monnier @ 2006-07-31 16:08 UTC (permalink / raw
  Cc: davidsmith, Thien-Thi Nguyen, YAMAMOTO Mitsuharu, emacs-devel

> Alternatively, we could add an optional arg ENCODING, for specifying an
> encoding other than utf-8. That might be a cleaner interface than requiring
> the user to make the string unibyte before passing it to url-hexify-string.

I'd rather not add any arg and simply encode with utf-8 if it's not
already unibyte.  After all, non-utf-8 uses should be inexistent right now
since the code signals an error, and requiring an explicit call to
encode-coding-string for those rare cases where you want something else than
utf-8 (rare and getting rarer in the future, most likely) is really not
a big deal.

        Stefan

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [davidsmith@acm.org: [patch] url-hexify-string does not follow W3C spec]
  2006-07-31 16:08         ` Stefan Monnier
@ 2006-07-31 16:35           ` David Smith
  2006-07-31 20:49             ` Thien-Thi Nguyen
  0 siblings, 1 reply; 22+ messages in thread
From: David Smith @ 2006-07-31 16:35 UTC (permalink / raw
  Cc: Thien-Thi Nguyen, emacs-devel, YAMAMOTO Mitsuharu, Jason Rumney


[-- Attachment #1.1: Type: text/plain, Size: 910 bytes --]

Stefan Monnier <monnier@iro.umontreal.ca> writes:

>> Alternatively, we could add an optional arg ENCODING, for specifying an
>> encoding other than utf-8. That might be a cleaner interface than requiring
>> the user to make the string unibyte before passing it to url-hexify-string.
>
> I'd rather not add any arg and simply encode with utf-8 if it's not
> already unibyte.  After all, non-utf-8 uses should be inexistent right now
> since the code signals an error, and requiring an explicit call to
> encode-coding-string for those rare cases where you want something else than
> utf-8 (rare and getting rarer in the future, most likely) is really not
> a big deal.
>
>

So far, this suggestion does seem to satisfy all
issues. Nguyen, is this as easy to implement as it sounds?
Yamamoto-san, your issue is valid and this is what you
suggested originally, right?

-- 
  David D. Smith

[-- Attachment #1.2: Type: application/pgp-signature, Size: 188 bytes --]

[-- Attachment #2: Type: text/plain, Size: 142 bytes --]

_______________________________________________
Emacs-devel mailing list
Emacs-devel@gnu.org
http://lists.gnu.org/mailman/listinfo/emacs-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [davidsmith@acm.org: [patch] url-hexify-string does not follow W3C spec]
  2006-07-31 16:35           ` David Smith
@ 2006-07-31 20:49             ` Thien-Thi Nguyen
  2006-08-01  3:55               ` YAMAMOTO Mitsuharu
  0 siblings, 1 reply; 22+ messages in thread
From: Thien-Thi Nguyen @ 2006-07-31 20:49 UTC (permalink / raw
  Cc: Jason Rumney, Stefan Monnier, YAMAMOTO Mitsuharu, emacs-devel

"David Smith" <davidsmith@acm.org> writes:

> Nguyen, is this as easy to implement as it sounds?

i don't know how easy it sounds.  that depends on each listener.
to me, it seems very easy to implement, because it is already done.
here is the function as it stands in lisp/url/url-util.el:

(defun url-hexify-string (string)
  "Return a new string that is STRING URI-encoded.
First, STRING is converted to utf-8, if necessary.  Then, for each
character in the utf-8 string, those found in `url-unreserved-chars'
are left as-is, all others are represented as a three-character
string: \"%\" followed by two lowercase hex digits."
  (mapconcat (lambda (char)
               (if (memq char url-unreserved-chars)
                   (char-to-string char)
                 (format "%%%02x" char)))
             (encode-coding-string string 'utf-8 t)
             ""))

i suggest we leave it alone.

btw, i missed some of this thread so i assume "it" refers
to `url-hexify-string'.  if discussion is actually about some
other function (that calls `url-hexify-string'), i have no
opinion on those matters.

thi

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [davidsmith@acm.org: [patch] url-hexify-string does not follow W3C spec]
  2006-07-31 20:49             ` Thien-Thi Nguyen
@ 2006-08-01  3:55               ` YAMAMOTO Mitsuharu
  2006-08-01  4:20                 ` Stefan Monnier
  2006-08-01 14:47                 ` Thien-Thi Nguyen
  0 siblings, 2 replies; 22+ messages in thread
From: YAMAMOTO Mitsuharu @ 2006-08-01  3:55 UTC (permalink / raw
  Cc: David Smith, Jason Rumney, Stefan Monnier, emacs-devel

>>>>> On 31 Jul 2006 16:49:19 -0400, Thien-Thi Nguyen <ttn@gnu.org> said:

> btw, i missed some of this thread so i assume "it" refers to
> `url-hexify-string'.  if discussion is actually about some other
> function (that calls `url-hexify-string'), i have no opinion on
> those matters.

Let me summarize:

  * Rev 1.12
    The argument is assumed to be a sequence of octets.  Namely,
    either a unibyte string or a multibyte string only containing
    ascii, eight-bit-control, or eight-bit-graphic.  A multibyte
    string containing characters in other charsets causes an error.

  * Rev 1.13
    The argument is assumed to be a sequence of characters.
    Incompatible change for non-ASCII strings.

  * Rev 1.14
    The argument is assumed to be either a sequence of characters or a
    sequence of octets depending on the multibyteness of the string.
    Incompatibility still remains for a multibyte string containing
    eight-bit-control or eight-bit-graphic, but usually negligible.

I'm not sure if encoding with UTF-8 is really useful, but I don't
strongly oppose it if compatibility for the unibyte case is preverved.

				     YAMAMOTO Mitsuharu
				mituharu@math.s.chiba-u.ac.jp

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [davidsmith@acm.org: [patch] url-hexify-string does not follow W3C spec]
  2006-08-01  3:55               ` YAMAMOTO Mitsuharu
@ 2006-08-01  4:20                 ` Stefan Monnier
  2006-08-01  4:34                   ` YAMAMOTO Mitsuharu
  2006-08-01 14:47                 ` Thien-Thi Nguyen
  1 sibling, 1 reply; 22+ messages in thread
From: Stefan Monnier @ 2006-08-01  4:20 UTC (permalink / raw
  Cc: David Smith, Thien-Thi Nguyen, emacs-devel, Jason Rumney

>   * Rev 1.14
>     The argument is assumed to be either a sequence of characters or a
>     sequence of octets depending on the multibyteness of the string.
>     Incompatibility still remains for a multibyte string containing
>     eight-bit-control or eight-bit-graphic, but usually negligible.

What incompatibility?  If the string only contains ASCII and eight-bit-*,
then encoding it with utf-8 will return the same string of bytes (except
in a unibyte string rather than multibyte string).


        Stefan

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [davidsmith@acm.org: [patch] url-hexify-string does not follow W3C spec]
  2006-08-01  4:20                 ` Stefan Monnier
@ 2006-08-01  4:34                   ` YAMAMOTO Mitsuharu
  2006-08-01  6:50                     ` Stefan Monnier
  0 siblings, 1 reply; 22+ messages in thread
From: YAMAMOTO Mitsuharu @ 2006-08-01  4:34 UTC (permalink / raw
  Cc: David Smith, Thien-Thi Nguyen, emacs-devel, Jason Rumney

>>>>> On Tue, 01 Aug 2006 00:20:50 -0400, Stefan Monnier <monnier@iro.umontreal.ca> said:

> What incompatibility?  If the string only contains ASCII and
> eight-bit-*, then encoding it with utf-8 will return the same string
> of bytes (except in a unibyte string rather than multibyte string).

Here's an example:

(encode-coding-string "\x80" 'utf-8)
=> "\302\200"

				     YAMAMOTO Mitsuharu
				mituharu@math.s.chiba-u.ac.jp

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [davidsmith@acm.org: [patch] url-hexify-string does not follow W3C spec]
  2006-08-01  4:34                   ` YAMAMOTO Mitsuharu
@ 2006-08-01  6:50                     ` Stefan Monnier
  2006-08-01  7:14                       ` Kenichi Handa
  2006-08-01  8:42                       ` Jason Rumney
  0 siblings, 2 replies; 22+ messages in thread
From: Stefan Monnier @ 2006-08-01  6:50 UTC (permalink / raw
  Cc: emacs-devel

>> What incompatibility?  If the string only contains ASCII and
>> eight-bit-*, then encoding it with utf-8 will return the same string
>> of bytes (except in a unibyte string rather than multibyte string).

> Here's an example:

> (encode-coding-string "\x80" 'utf-8)
> => "\302\200"

Duh!  Looks like a serious bug to me.
Handa-san, what's up with that?


        Stefan

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [davidsmith@acm.org: [patch] url-hexify-string does not follow W3C spec]
  2006-08-01  6:50                     ` Stefan Monnier
@ 2006-08-01  7:14                       ` Kenichi Handa
  2006-08-01 14:32                         ` Stefan Monnier
  2006-08-01  8:42                       ` Jason Rumney
  1 sibling, 1 reply; 22+ messages in thread
From: Kenichi Handa @ 2006-08-01  7:14 UTC (permalink / raw
  Cc: mituharu, emacs-devel

In article <jwv4pwxhrx3.fsf-monnier+emacs@gnu.org>, Stefan Monnier <monnier@iro.umontreal.ca> writes:

>>> What incompatibility?  If the string only contains ASCII and
>>> eight-bit-*, then encoding it with utf-8 will return the same string
>>> of bytes (except in a unibyte string rather than multibyte string).

>> Here's an example:

>> (encode-coding-string "\x80" 'utf-8)
>> => "\302\200"

> Duh!  Looks like a serious bug to me.
> Handa-san, what's up with that?

??? \x80 == U+0080 is a valid Unicode character in "C1
Controls" block.

However, I agree that the following is very questionable
behaviour:

>> (encode-coding-string (string-as-unibyte "\x80") 'utf-8)
>> => "\302\200"

But, that is a long standing problem, and should be fixed
(if necessary) after the release.

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [davidsmith@acm.org: [patch] url-hexify-string does not follow W3C spec]
  2006-08-01  6:50                     ` Stefan Monnier
  2006-08-01  7:14                       ` Kenichi Handa
@ 2006-08-01  8:42                       ` Jason Rumney
  1 sibling, 0 replies; 22+ messages in thread
From: Jason Rumney @ 2006-08-01  8:42 UTC (permalink / raw
  Cc: YAMAMOTO Mitsuharu, emacs-devel


[-- Attachment #1.1: Type: text/plain, Size: 261 bytes --]

Stefan Monnier wrote:
>> (encode-coding-string "\x80" 'utf-8)
>> => "\302\200"
>>     
>
> Duh!  Looks like a serious bug to me.
> Handa-san, what's up with that?
>   

Why is it a serious bug? Are you expecting invalid utf-8 to be output 
from that function?


[-- Attachment #1.2: Type: text/html, Size: 660 bytes --]

[-- Attachment #2: Type: text/plain, Size: 142 bytes --]

_______________________________________________
Emacs-devel mailing list
Emacs-devel@gnu.org
http://lists.gnu.org/mailman/listinfo/emacs-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [davidsmith@acm.org: [patch] url-hexify-string does not follow W3C spec]
  2006-08-01  7:14                       ` Kenichi Handa
@ 2006-08-01 14:32                         ` Stefan Monnier
  0 siblings, 0 replies; 22+ messages in thread
From: Stefan Monnier @ 2006-08-01 14:32 UTC (permalink / raw
  Cc: mituharu, emacs-devel

>>>> What incompatibility?  If the string only contains ASCII and
>>>> eight-bit-*, then encoding it with utf-8 will return the same string
>>>> of bytes (except in a unibyte string rather than multibyte string).

>>> Here's an example:

>>> (encode-coding-string "\x80" 'utf-8)
>>> => "\302\200"

>> Duh!  Looks like a serious bug to me.
>> Handa-san, what's up with that?

> ??? \x80 == U+0080 is a valid Unicode character in "C1
> Controls" block.

Why was it chosen to represent U+0080 with \x80?
The problem with it is that it makes it impossible to reliably carry
byte-streams embedded in multibyte strings.  Oh well, I guess that ecbdic
and friends also make it impossible anyway :-(

> However, I agree that the following is very questionable
> behaviour:

>>> (encode-coding-string (string-as-unibyte "\x80") 'utf-8)
>>> => "\302\200"

> But, that is a long standing problem, and should be fixed
> (if necessary) after the release.

It should be fixed by signalling an error: if the string is unibyte it's
already encoded.


        Stefan

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [davidsmith@acm.org: [patch] url-hexify-string does not follow W3C spec]
  2006-08-01  3:55               ` YAMAMOTO Mitsuharu
  2006-08-01  4:20                 ` Stefan Monnier
@ 2006-08-01 14:47                 ` Thien-Thi Nguyen
  2006-08-01 15:10                   ` Stefan Monnier
  2006-08-02  2:06                   ` YAMAMOTO Mitsuharu
  1 sibling, 2 replies; 22+ messages in thread
From: Thien-Thi Nguyen @ 2006-08-01 14:47 UTC (permalink / raw
  Cc: David Smith, emacs-devel, Stefan Monnier, Jason Rumney

YAMAMOTO Mitsuharu <mituharu@math.s.chiba-u.ac.jp> writes:

>   [review]

thanks, that was very pleasant to read.

>   * Rev 1.14
>     The argument is assumed to be either a sequence of characters or a
>     sequence of octets depending on the multibyteness of the string.
>     Incompatibility still remains for a multibyte string containing
>     eight-bit-control or eight-bit-graphic, but usually negligible.
> 
> I'm not sure if encoding with UTF-8 is really useful, but I don't
> strongly oppose it if compatibility for the unibyte case is preverved.

conversion to utf-8 is per the RFC, which seems to be the primary context for
this function; avoiding that conversion means noncompliance w/ the RFC.

i think rev 1.14 is almost ok; anything that deviates from the RFC should be
under user control (via optional arg) and should be documented.  i assume that
(a) conversion of multibyte utf-8 is unconditionally desirable (a "negligible"
problem is no problem), and (b) that there exist non utf-8 unibyte encodings
that which callers wish to "hexify as is".  please correct me if these
assumptions do not hold.  on the other hand, if they do hold, how about:

(defun ... (string &optional unibyte-as-is-p)
   ...
   (if (or (multibyte-string-p string)
           (not unibyte-as-is-p))
       (encode-coding-string string 'utf-8 t)
     string)
   ...)

?

this way, RFC-compliance is the default, but suppressing the conversion to
utf-8 is still possible for unibyte strings by specifying UNIBYTE-AS-IS-P.

thi

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [davidsmith@acm.org: [patch] url-hexify-string does not follow W3C spec]
  2006-08-01 14:47                 ` Thien-Thi Nguyen
@ 2006-08-01 15:10                   ` Stefan Monnier
  2006-08-01 15:14                     ` David Kastrup
  2006-08-02  2:06                   ` YAMAMOTO Mitsuharu
  1 sibling, 1 reply; 22+ messages in thread
From: Stefan Monnier @ 2006-08-01 15:10 UTC (permalink / raw
  Cc: David Smith, emacs-devel, YAMAMOTO Mitsuharu, Jason Rumney

>    (if (or (multibyte-string-p string)
>            (not unibyte-as-is-p))
>        (encode-coding-string string 'utf-8 t)

Encoding a unibyte string doesn't make any sense (IMHO it should
signal an error, and indeed in my locally hacked Emacs it does ;-).


        Stefan

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [davidsmith@acm.org: [patch] url-hexify-string does not follow W3C spec]
  2006-08-01 15:10                   ` Stefan Monnier
@ 2006-08-01 15:14                     ` David Kastrup
  2006-08-01 15:54                       ` Stefan Monnier
  2006-08-09  3:48                       ` Kenichi Handa
  0 siblings, 2 replies; 22+ messages in thread
From: David Kastrup @ 2006-08-01 15:14 UTC (permalink / raw
  Cc: David Smith, Thien-Thi Nguyen, Jason Rumney, YAMAMOTO Mitsuharu,
	emacs-devel

Stefan Monnier <monnier@iro.umontreal.ca> writes:

>>    (if (or (multibyte-string-p string)
>>            (not unibyte-as-is-p))
>>        (encode-coding-string string 'utf-8 t)
>
> Encoding a unibyte string doesn't make any sense (IMHO it should
> signal an error, and indeed in my locally hacked Emacs it does ;-).

It sounds to me like this would be a sensible setting, and we should
have it before entering pretest: we would like to catch the cases
where this happens.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [davidsmith@acm.org: [patch] url-hexify-string does not follow W3C spec]
  2006-08-01 15:14                     ` David Kastrup
@ 2006-08-01 15:54                       ` Stefan Monnier
  2006-08-01 16:07                         ` David Kastrup
  2006-08-09  3:48                       ` Kenichi Handa
  1 sibling, 1 reply; 22+ messages in thread
From: Stefan Monnier @ 2006-08-01 15:54 UTC (permalink / raw
  Cc: David Smith, Thien-Thi Nguyen, Jason Rumney, YAMAMOTO Mitsuharu,
	emacs-devel

>>> (if (or (multibyte-string-p string)
>>> (not unibyte-as-is-p))
>>> (encode-coding-string string 'utf-8 t)
>> 
>> Encoding a unibyte string doesn't make any sense (IMHO it should
>> signal an error, and indeed in my locally hacked Emacs it does ;-).

> It sounds to me like this would be a sensible setting, and we should
> have it before entering pretest: we would like to catch the cases
> where this happens.

IIRC it breaks Gnus and fixing it requires non-trivial changes.
But maybe I'm confusing it with some other local change of mine (e.g. maybe
the problem is only that Gnus decodes multibyte text, but not that it
encodes unibyte strings).   Try it.


        Stefan

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [davidsmith@acm.org: [patch] url-hexify-string does not follow W3C spec]
  2006-08-01 15:54                       ` Stefan Monnier
@ 2006-08-01 16:07                         ` David Kastrup
  0 siblings, 0 replies; 22+ messages in thread
From: David Kastrup @ 2006-08-01 16:07 UTC (permalink / raw
  Cc: David Smith, Thien-Thi Nguyen, Jason Rumney, YAMAMOTO Mitsuharu,
	emacs-devel

Stefan Monnier <monnier@iro.umontreal.ca> writes:

>>>> (if (or (multibyte-string-p string)
>>>> (not unibyte-as-is-p))
>>>> (encode-coding-string string 'utf-8 t)
>>> 
>>> Encoding a unibyte string doesn't make any sense (IMHO it should
>>> signal an error, and indeed in my locally hacked Emacs it does ;-).
>
>> It sounds to me like this would be a sensible setting, and we should
>> have it before entering pretest: we would like to catch the cases
>> where this happens.
>
> IIRC it breaks Gnus and fixing it requires non-trivial changes.

But it sounds like absent of those changes it would be doubtful that
Gnus actually does the correct thing.  It is probably something which
muddles through if there is enough accidental overlap between latin1,
MULE encoding and other systems.

> But maybe I'm confusing it with some other local change of mine
> (e.g. maybe the problem is only that Gnus decodes multibyte text,
> but not that it encodes unibyte strings).  Try it.

Probably easy to do using advice.  But I am at the moment busy
finishing something else.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [davidsmith@acm.org: [patch] url-hexify-string does not follow W3C spec]
  2006-08-01 14:47                 ` Thien-Thi Nguyen
  2006-08-01 15:10                   ` Stefan Monnier
@ 2006-08-02  2:06                   ` YAMAMOTO Mitsuharu
  1 sibling, 0 replies; 22+ messages in thread
From: YAMAMOTO Mitsuharu @ 2006-08-02  2:06 UTC (permalink / raw
  Cc: David Smith, emacs-devel, Stefan Monnier, Jason Rumney

>>>>> On 01 Aug 2006 10:47:07 -0400, Thien-Thi Nguyen <ttn@gnu.org> said:

> conversion to utf-8 is per the RFC, which seems to be the primary
> context for this function; avoiding that conversion means
> noncompliance w/ the RFC.

Do you mean RFC 3986 by "the RFC"?  IIUC, it refers to UTF-8 in the
following 3 parts:

  * 2.5 Identifying Data, 3rd paragraph
    How to interpret a unreserved character as an octet.  (It also
    refers to other superset of the US-ASCII character encoding).

  * 2.5 Identifying Data, last paragraph
    Encoding of a URI component that represents *textual data*
    consisting of characters from UCS for *a new URI scheme*.

  * 3.2.2 Host
    Encoding of a registered name that represents a host.

So, I don't think that avoiding UTF-8 conversion for non-textual data
or for a URI scheme that has existed as of RFC 3986 deviates from the
RFC.

				     YAMAMOTO Mitsuharu
				mituharu@math.s.chiba-u.ac.jp

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [davidsmith@acm.org: [patch] url-hexify-string does not follow W3C spec]
  2006-08-01 15:14                     ` David Kastrup
  2006-08-01 15:54                       ` Stefan Monnier
@ 2006-08-09  3:48                       ` Kenichi Handa
  1 sibling, 0 replies; 22+ messages in thread
From: Kenichi Handa @ 2006-08-09  3:48 UTC (permalink / raw
  Cc: davidsmith, emacs-devel, monnier, ttn, jasonr, mituharu

In article <85k65sxzcq.fsf@lola.goethe.zz>, David Kastrup <dak@gnu.org> writes:

> Stefan Monnier <monnier@iro.umontreal.ca> writes:
>>> (if (or (multibyte-string-p string)
>>> (not unibyte-as-is-p))
>>> (encode-coding-string string 'utf-8 t)
>> 
>> Encoding a unibyte string doesn't make any sense (IMHO it should
>> signal an error, and indeed in my locally hacked Emacs it does ;-).

> It sounds to me like this would be a sensible setting, and we should
> have it before entering pretest: we would like to catch the cases
> where this happens.

This situation has been there for long (since Emacs 20).  I
don't want to touch this part before the release.  Emacs 23
is the good chance to clean this matter.

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2006-08-09  3:48 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-07-30 13:14 [davidsmith@acm.org: [patch] url-hexify-string does not follow W3C spec] Richard Stallman
2006-07-30 20:24 ` Thien-Thi Nguyen
2006-07-31  0:59   ` YAMAMOTO Mitsuharu
2006-07-31 10:13     ` Thien-Thi Nguyen
2006-07-31 10:46       ` Jason Rumney
2006-07-31 16:08         ` Stefan Monnier
2006-07-31 16:35           ` David Smith
2006-07-31 20:49             ` Thien-Thi Nguyen
2006-08-01  3:55               ` YAMAMOTO Mitsuharu
2006-08-01  4:20                 ` Stefan Monnier
2006-08-01  4:34                   ` YAMAMOTO Mitsuharu
2006-08-01  6:50                     ` Stefan Monnier
2006-08-01  7:14                       ` Kenichi Handa
2006-08-01 14:32                         ` Stefan Monnier
2006-08-01  8:42                       ` Jason Rumney
2006-08-01 14:47                 ` Thien-Thi Nguyen
2006-08-01 15:10                   ` Stefan Monnier
2006-08-01 15:14                     ` David Kastrup
2006-08-01 15:54                       ` Stefan Monnier
2006-08-01 16:07                         ` David Kastrup
2006-08-09  3:48                       ` Kenichi Handa
2006-08-02  2:06                   ` YAMAMOTO Mitsuharu

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.