fixing url-unhex-string for unicode/multi-byte charsets

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

* fixing url-unhex-string for unicode/multi-byte charsets
@ 2020-11-06  7:47 Boruch Baum
  2020-11-06  8:02 ` Eli Zaretskii
  0 siblings, 1 reply; 16+ messages in thread
From: Boruch Baum @ 2020-11-06  7:47 UTC (permalink / raw)
  To: Emacs-Devel List

In the thread "Friendlier dired experience", Michael Albinus noted that
the new emacs feature to place remote files in the local trash performs
hex-encoding on remote file-names as if they were URLs, which led me to
discover that was also happening for local files encoded in multi-byte
(eg. unicode) character-set encodings. Neither of these cases were being
properly handled by the current emacs function `url-unhex-string'. We
noticed this for the case of restoring a trashed file, but it can be
expected to exhibit in other cases.

I've solved the problem for diredc, using code from the emacs-w3m
project (thanks). Whether for the general emacs case it should be
handled by altering function `url-unhex-string', or whether a second
function should be created isn't for me to decide, so here's my fix for
you to discuss, decide, apply.

--8<--cut here-(start)------------------------------------------- >8
(defun diredc--decode-hexlated-string (str)
  "Convert hexlated string to human-readable, with charset coding support.
This function improves upon `url-unhex-string' by handled
hexlated multi-byte and unicode characters. Credit to the
`emacs-w3m' project for the core-code, at
`w3m-url-decode-string'."
  ;; NOTE: This technique should be used by `url-unhex-string' itself,
  ;;       or integrated otherwise into emacs.
  (let ((start 0)
        (case-fold-search t)
        (regexp "%\\(?:\\([0-9a-f][0-9a-f]\\)\\|0d%0a\\)"))
    (with-temp-buffer
      (set-buffer-multibyte nil)
      (while (string-match regexp str start)
        (insert (substring str start (match-beginning 0))
        	   (if (match-beginning 1)
        	      (string-to-number (match-string 1 str) 16)
        	    ?\n))
      (setq start (match-end 0)))
      (insert (substring str start))
      (decode-coding-string
        (buffer-string)
        (with-coding-priority nil
               (car (detect-coding-region (point-min) (point-max))))))))
--8<--cut here-(end)--------------------------------------------- >8

--
hkp://keys.gnupg.net
CA45 09B5 5351 7C11 A9D1  7286 0036 9E45 1595 8BC0

^ permalink raw reply	[flat|nested] 16+ messages in thread

* fixing url-unhex-string for unicode/multi-byte charsets
@ 2020-11-06  7:54 Boruch Baum
  2020-11-06  8:05 ` Eli Zaretskii
  0 siblings, 1 reply; 16+ messages in thread
From: Boruch Baum @ 2020-11-06  7:54 UTC (permalink / raw)
  To: Emacs-Devel List

Katsumi Yamaoka at the emacs-w3m project points out that emacs has a
function `eww-decode-url-file-name' that solves this issue. Maybe that
function should become the canonical emacs solution?

--
hkp://keys.gnupg.net
CA45 09B5 5351 7C11 A9D1  7286 0036 9E45 1595 8BC0

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: fixing url-unhex-string for unicode/multi-byte charsets
  2020-11-06  7:47 Boruch Baum
@ 2020-11-06  8:02 ` Eli Zaretskii
  2020-11-06 10:27   ` Boruch Baum
  0 siblings, 1 reply; 16+ messages in thread
From: Eli Zaretskii @ 2020-11-06  8:02 UTC (permalink / raw)
  To: Boruch Baum; +Cc: emacs-devel

> Date: Fri, 6 Nov 2020 02:47:42 -0500
> From: Boruch Baum <boruch_baum@gmx.com>
> 
> In the thread "Friendlier dired experience", Michael Albinus noted that
> the new emacs feature to place remote files in the local trash performs
> hex-encoding on remote file-names as if they were URLs, which led me to
> discover that was also happening for local files encoded in multi-byte
> (eg. unicode) character-set encodings. Neither of these cases were being
> properly handled by the current emacs function `url-unhex-string'. We
> noticed this for the case of restoring a trashed file, but it can be
> expected to exhibit in other cases.

I see no problem in url-unhex-string, because its job is very simple:
convert hex codes into bytes with the same value.  It doesn't know
what to do with the result because it has no idea what the string
stands for: it could be a file name, or some text, or anything else.
The details of the rules for decoding each kind of string vary a
little, so for optimal results the caller should apply the rules that
are relevant.

> I've solved the problem for diredc, using code from the emacs-w3m
> project (thanks). Whether for the general emacs case it should be
> handled by altering function `url-unhex-string', or whether a second
> function should be created isn't for me to decide, so here's my fix for
> you to discuss, decide, apply.

I made a suggestion in that discussion, I will repeat some of them
here:

>     (with-temp-buffer
>       (set-buffer-multibyte nil)
>       (while (string-match regexp str start)
>         (insert (substring str start (match-beginning 0))
>         	   (if (match-beginning 1)
>         	      (string-to-number (match-string 1 str) 16)
>         	    ?\n))
>       (setq start (match-end 0)))
>       (insert (substring str start))
>       (decode-coding-string
>         (buffer-string)
>         (with-coding-priority nil
>                (car (detect-coding-region (point-min) (point-max))))))))

There's no need to insert the string into a buffer, then decode it.
It sounds like you did that because you wanted to invoke
detect-coding-region? but then we have detect-coding-string as well.
Or maybe this was because you wanted to make sure you work with
unibyte text? but then url-unhex-string returns a unibyte string
already.

The use of detect-coding-region/string in this case is also
sub-optimal: depending on the exact content of the string, it can fail
to detect the correct encoding, if more than one can support the
bytes.  By contrast, variables like file-name-coding-system already
tell us how to decode file names, and they are used all the time in
Emacs, so they are almost certainly correct (if they aren't lots of
stuff in Emacs will break).

So, for file names, something like the below should do the job
simpler:

  (decode-coding-string (url-unhex-string STR)
                        (or file-name-coding-system
			    (default-value 'file-name-coding-system)))

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: fixing url-unhex-string for unicode/multi-byte charsets
  2020-11-06  7:54 fixing url-unhex-string for unicode/multi-byte charsets Boruch Baum
@ 2020-11-06  8:05 ` Eli Zaretskii
  2020-11-06 10:34   ` Boruch Baum
  0 siblings, 1 reply; 16+ messages in thread
From: Eli Zaretskii @ 2020-11-06  8:05 UTC (permalink / raw)
  To: Boruch Baum; +Cc: emacs-devel

> Date: Fri, 6 Nov 2020 02:54:57 -0500
> From: Boruch Baum <boruch_baum@gmx.com>
> 
> Katsumi Yamaoka at the emacs-w3m project points out that emacs has a
> function `eww-decode-url-file-name' that solves this issue. Maybe that
> function should become the canonical emacs solution?

eww-decode-url-file-name solves a slightly different problem: URLs we
find in Web pages.  There, the encoding is predominantly UTF-8, so we
mainly use that, and fall back to other possibilities as backup.

I believe in the trash case we already know these are file names, so
at least some of what eww-decode-url-file-name does is unnecessary,
IMO.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: fixing url-unhex-string for unicode/multi-byte charsets
  2020-11-06  8:02 ` Eli Zaretskii
@ 2020-11-06 10:27   ` Boruch Baum
  2020-11-06 12:04     ` Eli Zaretskii
  2020-11-06 14:38     ` Stefan Monnier
  0 siblings, 2 replies; 16+ messages in thread
From: Boruch Baum @ 2020-11-06 10:27 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

On 2020-11-06 10:02, Eli Zaretskii wrote:
> I made a suggestion in that discussion, I will repeat some of them
> here:

Yeah, but they don't work.

> So, for file names, something like the below should do the job
> simpler:
>
>   (decode-coding-string (url-unhex-string STR)
>                         (or file-name-coding-system
> 			    (default-value 'file-name-coding-system)))

Try it. To reproduce, touch and then trash a file named some two Hebrew
words delimited by a space. Navigate to the trash directory's 'info'
sub-directory and extract the 'path' value from the file's meta-data
.info file. That's the string we need to decode. Apply the string to
your solution and see that you do not get the space-delimited two Hebrew
words.

--
hkp://keys.gnupg.net
CA45 09B5 5351 7C11 A9D1  7286 0036 9E45 1595 8BC0

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: fixing url-unhex-string for unicode/multi-byte charsets
  2020-11-06  8:05 ` Eli Zaretskii
@ 2020-11-06 10:34   ` Boruch Baum
  2020-11-06 12:06     ` Eli Zaretskii
  0 siblings, 1 reply; 16+ messages in thread
From: Boruch Baum @ 2020-11-06 10:34 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

On 2020-11-06 10:05, Eli Zaretskii wrote:
> > Date: Fri, 6 Nov 2020 02:54:57 -0500
> > From: Boruch Baum <boruch_baum@gmx.com>
> >
> > Katsumi Yamaoka at the emacs-w3m project points out that emacs has a
> > function `eww-decode-url-file-name' that solves this issue. Maybe that
> > function should become the canonical emacs solution?
>
> eww-decode-url-file-name solves a slightly different problem: URLs we
> find in Web pages.  There, the encoding is predominantly UTF-8, so we
> mainly use that, and fall back to other possibilities as backup.
>
> I believe in the trash case we already know these are file names, so
> at least some of what eww-decode-url-file-name does is unnecessary,
> IMO.

This all started from Arthur Miller's observation that restoring a
'remote' file was failing. He said that's a new feature in emacs, that
one can trash a file over ssh or some other protocol and the file is
trashed to your local file system. In that case, Arthur pointed out to
the list that the colon character of the protocol wasn't being decoded.
Once emacs needs to account for remotes, it needs to account for the
protocols and urls of those remotes.

--
hkp://keys.gnupg.net
CA45 09B5 5351 7C11 A9D1  7286 0036 9E45 1595 8BC0



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: fixing url-unhex-string for unicode/multi-byte charsets
  2020-11-06 10:27   ` Boruch Baum
@ 2020-11-06 12:04     ` Eli Zaretskii
  2020-11-06 12:28       ` Boruch Baum
  2020-11-06 14:38     ` Stefan Monnier
  1 sibling, 1 reply; 16+ messages in thread
From: Eli Zaretskii @ 2020-11-06 12:04 UTC (permalink / raw)
  To: Boruch Baum; +Cc: emacs-devel

> Date: Fri, 6 Nov 2020 05:27:56 -0500
> From: Boruch Baum <boruch_baum@gmx.com>
> Cc: emacs-devel@gnu.org
> 
> > I made a suggestion in that discussion, I will repeat some of them
> > here:
> 
> Yeah, but they don't work.

I said "something like that", because I don't know the full context.
If "don't work" means "needs minor adaptations", the suggestions are
still valid.

> > So, for file names, something like the below should do the job
> > simpler:
> >
> >   (decode-coding-string (url-unhex-string STR)
> >                         (or file-name-coding-system
> > 			    (default-value 'file-name-coding-system)))
> 
> Try it.

I can't, not in full: I don't have a Freedesktop trash anywhere I have
access to.  I did try the 2 file names you posted, including the one
with Hebrew characters, and it did work for me, on the assumption that
file-name-coding-system is UTF-8.

> To reproduce, touch and then trash a file named some two Hebrew
> words delimited by a space. Navigate to the trash directory's 'info'
> sub-directory and extract the 'path' value from the file's meta-data
> .info file. That's the string we need to decode. Apply the string to
> your solution and see that you do not get the space-delimited two
> Hebrew words.

A stand-alone test case, which doesn't require an actual trash, would
be appreciated, so I could see which parrt doesn't work, and how to
fix it.

Alternatively, maybe you could explain why you needed to insert the
text into a temporary buffer and then extract it from there?  AFAIK,
we have the same primitives that work on decoding strings as we have
for decoding buffer text.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: fixing url-unhex-string for unicode/multi-byte charsets
  2020-11-06 10:34   ` Boruch Baum
@ 2020-11-06 12:06     ` Eli Zaretskii
  0 siblings, 0 replies; 16+ messages in thread
From: Eli Zaretskii @ 2020-11-06 12:06 UTC (permalink / raw)
  To: Boruch Baum; +Cc: emacs-devel

> Date: Fri, 6 Nov 2020 05:34:46 -0500
> From: Boruch Baum <boruch_baum@gmx.com>
> Cc: emacs-devel@gnu.org
> 
> > I believe in the trash case we already know these are file names, so
> > at least some of what eww-decode-url-file-name does is unnecessary,
> > IMO.
> 
> This all started from Arthur Miller's observation that restoring a
> 'remote' file was failing. He said that's a new feature in emacs, that
> one can trash a file over ssh or some other protocol and the file is
> trashed to your local file system. In that case, Arthur pointed out to
> the list that the colon character of the protocol wasn't being decoded.
> Once emacs needs to account for remotes, it needs to account for the
> protocols and urls of those remotes.

A remote file name is not a URL, especially not when we talk about
encoding non-ASCII characters.  The conventions and the defaults are
different.



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: fixing url-unhex-string for unicode/multi-byte charsets
  2020-11-06 12:04     ` Eli Zaretskii
@ 2020-11-06 12:28       ` Boruch Baum
  2020-11-06 13:34         ` Eli Zaretskii
  0 siblings, 1 reply; 16+ messages in thread
From: Boruch Baum @ 2020-11-06 12:28 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

On 2020-11-06 14:04, Eli Zaretskii wrote:
> > Date: Fri, 6 Nov 2020 05:27:56 -0500
> > From: Boruch Baum <boruch_baum@gmx.com>
> > Cc: emacs-devel@gnu.org

> I can't, not in full: I don't have a Freedesktop trash anywhere I have
> access to.  I did try the 2 file names you posted, including the one
> with Hebrew characters, and it did work for me, on the assumption that
> file-name-coding-system is UTF-8.
>
> > To reproduce, touch and then trash a file named some two Hebrew
> > words delimited by a space. Navigate to the trash directory's 'info'
> > sub-directory and extract the 'path' value from the file's meta-data
> > .info file. That's the string we need to decode. Apply the string to
> > your solution and see that you do not get the space-delimited two
> > Hebrew words.
>
> A stand-alone test case, which doesn't require an actual trash, would
> be appreciated, so I could see which parrt doesn't work, and how to
> fix it.

That would be the two file names that I previously posted. You say that
they succeeded for you, but they didn't for me. The result I got was
good for the first case (English two words), and garbage for the second
case (Hebrew two words).

> Alternatively, maybe you could explain why you needed to insert the
> text into a temporary buffer and then extract it from there?  AFAIK,
> we have the same primitives that work on decoding strings as we have
> for decoding buffer text.

I don't need to. It's implementation done in emacs-w3m. I also pointed
out that eww does it differently. I think the need in emacs-w3m is to
mix the ascii characters and selected binary output, which can't be done
with say replace-regexp-in-string. So what they do is use a temporary
buffer, set `buffer-multibyte' to nil, and instead of
replace-regexp-in-string build the result in the temporary buffer.

--
hkp://keys.gnupg.net
CA45 09B5 5351 7C11 A9D1  7286 0036 9E45 1595 8BC0



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: fixing url-unhex-string for unicode/multi-byte charsets
  2020-11-06 12:28       ` Boruch Baum
@ 2020-11-06 13:34         ` Eli Zaretskii
  2020-11-06 14:59           ` Stefan Monnier
  0 siblings, 1 reply; 16+ messages in thread
From: Eli Zaretskii @ 2020-11-06 13:34 UTC (permalink / raw)
  To: Boruch Baum; +Cc: emacs-devel

> Date: Fri, 6 Nov 2020 07:28:46 -0500
> From: Boruch Baum <boruch_baum@gmx.com>
> Cc: emacs-devel@gnu.org
> 
> > A stand-alone test case, which doesn't require an actual trash, would
> > be appreciated, so I could see which parrt doesn't work, and how to
> > fix it.
> 
> That would be the two file names that I previously posted. You say that
> they succeeded for you, but they didn't for me. The result I got was
> good for the first case (English two words), and garbage for the second
> case (Hebrew two words).

I tried that before posting the suggestion.  FTR, the below works for
me on the current emacs-27 branch and on master, both on MS-Windows
(where I used a literal 'utf-8 instead of file-name-coding-system)
and on GNU/Linux:

 (dolist (str '("hello%20world"
                "%d7%a9%d7%9c%d7%95%d7%9d%20%d7%a2%d7%95%d7%9c%d7%9d"))
   (insert (decode-coding-string (url-unhex-string str)
                                 (or file-name-coding-system
                                     default-file-name-coding-system))
           "\n"))

The result of evaluating this is two lines inserted into the current
buffer:

  hello world
  שלום עולם

If this doesn't work for you, or if you tried something slightly
different, I'd like to hear the details, perhaps there's some
subtlety I'm missing.

> > Alternatively, maybe you could explain why you needed to insert the
> > text into a temporary buffer and then extract it from there?  AFAIK,
> > we have the same primitives that work on decoding strings as we have
> > for decoding buffer text.
> 
> I don't need to. It's implementation done in emacs-w3m. I also pointed
> out that eww does it differently. I think the need in emacs-w3m is to
> mix the ascii characters and selected binary output, which can't be done
> with say replace-regexp-in-string. So what they do is use a temporary
> buffer, set `buffer-multibyte' to nil, and instead of
> replace-regexp-in-string build the result in the temporary buffer.

As a rule of thumb, any Lisp code that needs to do something with a
string and does that by inserting it into a temporary buffer and
working on that instead, should raise the "missing primitive" alarm.
In this case, I see no missing primitives for decoding a string, so
using a temp buffer looks an unnecessary complication to me.



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: fixing url-unhex-string for unicode/multi-byte charsets
  2020-11-06 10:27   ` Boruch Baum
  2020-11-06 12:04     ` Eli Zaretskii
@ 2020-11-06 14:38     ` Stefan Monnier
  1 sibling, 0 replies; 16+ messages in thread
From: Stefan Monnier @ 2020-11-06 14:38 UTC (permalink / raw)
  To: Boruch Baum; +Cc: Eli Zaretskii, emacs-devel

>> I made a suggestion in that discussion, I will repeat some of them
>> here:
> Yeah, but they don't work.

Could you try and figure out which part doesn't work as expected and
in describe what way it fails (e.g. show the decoded result along with
the expected result)?
"Don't work" is not a very usable starting point to debug a problem, as
we all know.

        Stefan

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: fixing url-unhex-string for unicode/multi-byte charsets
  2020-11-06 13:34         ` Eli Zaretskii
@ 2020-11-06 14:59           ` Stefan Monnier
  2020-11-06 15:04             ` Eli Zaretskii
  0 siblings, 1 reply; 16+ messages in thread
From: Stefan Monnier @ 2020-11-06 14:59 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Boruch Baum, emacs-devel

>  (dolist (str '("hello%20world"
>                 "%d7%a9%d7%9c%d7%95%d7%9d%20%d7%a2%d7%95%d7%9c%d7%9d"))
>    (insert (decode-coding-string (url-unhex-string str)
>                                  (or file-name-coding-system
>                                      default-file-name-coding-system))
>            "\n"))
>
> The result of evaluating this is two lines inserted into the current
> buffer:
>
>   hello world
>   שלום עולם
>
> If this doesn't work for you, or if you tried something slightly
> different, I'd like to hear the details, perhaps there's some
> subtlety I'm missing.

My guess is that his `file-name-coding-system` is set to something
different from utf-8.
[ BTW, I wouldn't be surprised to hear that the Freedesktop spec
  documents that the file names in the Trash should use utf-8, in which
  case the code should hard-code utf-8 rather than use
  `file-name-coding-system` ;-)  ]

> As a rule of thumb, any Lisp code that needs to do something with a
> string and does that by inserting it into a temporary buffer and
> working on that instead, should raise the "missing primitive" alarm.

Tho only for reasonably trivial amounts of "working on that" ;-)


        Stefan




^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: fixing url-unhex-string for unicode/multi-byte charsets
  2020-11-06 14:59           ` Stefan Monnier
@ 2020-11-06 15:04             ` Eli Zaretskii
  2020-11-08  9:12               ` Boruch Baum
  0 siblings, 1 reply; 16+ messages in thread
From: Eli Zaretskii @ 2020-11-06 15:04 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: boruch_baum, emacs-devel

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: Boruch Baum <boruch_baum@gmx.com>,  emacs-devel@gnu.org
> Date: Fri, 06 Nov 2020 09:59:02 -0500
> 
> >  (dolist (str '("hello%20world"
> >                 "%d7%a9%d7%9c%d7%95%d7%9d%20%d7%a2%d7%95%d7%9c%d7%9d"))
> >    (insert (decode-coding-string (url-unhex-string str)
> >                                  (or file-name-coding-system
> >                                      default-file-name-coding-system))
> >            "\n"))
> >
> > The result of evaluating this is two lines inserted into the current
> > buffer:
> >
> >   hello world
> >   שלום עולם
> >
> > If this doesn't work for you, or if you tried something slightly
> > different, I'd like to hear the details, perhaps there's some
> > subtlety I'm missing.
> 
> My guess is that his `file-name-coding-system` is set to something
> different from utf-8.
> [ BTW, I wouldn't be surprised to hear that the Freedesktop spec
>   documents that the file names in the Trash should use utf-8, in which
>   case the code should hard-code utf-8 rather than use
>   `file-name-coding-system` ;-)  ]

If the trash spec says it must be UTF-8, then yes, TRT is to use that
unconditionally.  But if the spec says nothing, I'd expect the file
names to be in whatever encoding they were on disk, which usually
should coincide with file-name-coding-system.



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: fixing url-unhex-string for unicode/multi-byte charsets
  2020-11-06 15:04             ` Eli Zaretskii
@ 2020-11-08  9:12               ` Boruch Baum
  2020-11-08 13:39                 ` Stefan Monnier
  2020-11-08 15:07                 ` Eli Zaretskii
  0 siblings, 2 replies; 16+ messages in thread
From: Boruch Baum @ 2020-11-08  9:12 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Stefan Monnier, emacs-devel

On 2020-11-06 17:04, Eli Zaretskii wrote:
> > From: Stefan Monnier <monnier@iro.umontreal.ca>
> > Cc: Boruch Baum <boruch_baum@gmx.com>,  emacs-devel@gnu.org
> > Date: Fri, 06 Nov 2020 09:59:02 -0500
> >
> > My guess is that his `file-name-coding-system` is set to something
> > different from utf-8.

That's correct, kind of. The setting isn't 'mine', its the emacs
default. In both emacs 26.1 (debian) and emacs-snapshot (v28),
file-name-coding-system defaults to nil, and
default-file-name-coding-system defaults to utf-8-unix, so we have:

                 file-name-coding-system  => nil
  (default-value file-name-coding-system) => nil
         default-file-name-coding-system  => utf-8-unix

Anybody besides me find that amusing? It reminds me of Bug report #43294
in that both seem intentionally designed to cause confusion and trip-up
developers.

> > [ BTW, I wouldn't be surprised to hear that the Freedesktop spec
> >   documents that the file names in the Trash should use utf-8, in which
> >   case the code should hard-code utf-8 rather than use
> >   `file-name-coding-system` ;-)  ]
>
> If the trash spec says it must be UTF-8, then yes, TRT is to use that
> unconditionally.

The FreeDesktop.org Trash specification[1] says about the trash restore
PATH:

   "The value type for this key is “string”; it SHOULD store the file
   name as the sequence of bytes produced by the file system, with
   characters escaped as in URLs (as defined by RFC 2396, section 2)."

The RFC says (section 2.1):

   "... there is currently no provision within the generic URI syntax to
   accomplish this identification ... It is expected that a systematic
   treatment of character encoding within URI will be developed as a
   future modification of this specification."

> But if the spec says nothing, I'd expect the file names to be in
> whatever encoding they were on disk, which usually should coincide
> with file-name-coding-system.

[1] https://specifications.freedesktop.org/trash-spec/trashspec-latest.html
[2] http://www.faqs.org/rfcs/rfc2396.html

--
hkp://keys.gnupg.net
CA45 09B5 5351 7C11 A9D1  7286 0036 9E45 1595 8BC0



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: fixing url-unhex-string for unicode/multi-byte charsets
  2020-11-08  9:12               ` Boruch Baum
@ 2020-11-08 13:39                 ` Stefan Monnier
  2020-11-08 15:07                 ` Eli Zaretskii
  1 sibling, 0 replies; 16+ messages in thread
From: Stefan Monnier @ 2020-11-08 13:39 UTC (permalink / raw)
  To: Boruch Baum; +Cc: Eli Zaretskii, emacs-devel

>> > My guess is that his `file-name-coding-system` is set to something
>> > different from utf-8.
>
> That's correct, kind of. The setting isn't 'mine', its the emacs
> default. In both emacs 26.1 (debian) and emacs-snapshot (v28),
> file-name-coding-system defaults to nil, and
> default-file-name-coding-system defaults to utf-8-unix, so we have:
>
>                  file-name-coding-system  => nil
>   (default-value file-name-coding-system) => nil
>          default-file-name-coding-system  => utf-8-unix

This means your file name coding system *is* set to utf-8, so that
doesn't explain the problem.

>    "The value type for this key is “string”; it SHOULD store the file
>    name as the sequence of bytes produced by the file system, with
>    characters escaped as in URLs (as defined by RFC 2396, section 2)."

Thanks.  So we should indeed obey `file-name-coding-system`.


        Stefan




^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: fixing url-unhex-string for unicode/multi-byte charsets
  2020-11-08  9:12               ` Boruch Baum
  2020-11-08 13:39                 ` Stefan Monnier
@ 2020-11-08 15:07                 ` Eli Zaretskii
  1 sibling, 0 replies; 16+ messages in thread
From: Eli Zaretskii @ 2020-11-08 15:07 UTC (permalink / raw)
  To: Boruch Baum; +Cc: monnier, emacs-devel

> Date: Sun, 8 Nov 2020 04:12:16 -0500
> From: Boruch Baum <boruch_baum@gmx.com>
> Cc: Stefan Monnier <monnier@iro.umontreal.ca>, emacs-devel@gnu.org
> 
> > > [ BTW, I wouldn't be surprised to hear that the Freedesktop spec
> > >   documents that the file names in the Trash should use utf-8, in which
> > >   case the code should hard-code utf-8 rather than use
> > >   `file-name-coding-system` ;-)  ]
> >
> > If the trash spec says it must be UTF-8, then yes, TRT is to use that
> > unconditionally.
> 
> The FreeDesktop.org Trash specification[1] says about the trash restore
> PATH:
> 
>    "The value type for this key is “string”; it SHOULD store the file
>    name as the sequence of bytes produced by the file system, with
>    characters escaped as in URLs (as defined by RFC 2396, section 2)."
> 
> The RFC says (section 2.1):
> 
>    "... there is currently no provision within the generic URI syntax to
>    accomplish this identification ... It is expected that a systematic
>    treatment of character encoding within URI will be developed as a
>    future modification of this specification."

This means the Trash uses the same byte sequence as stored in the
filesystem, without imposing any encoding restrictions.  In which case
decoding with this:

   (or file-name-coding-system default-file-name-coding-system)

will produce the expected results.

Please note that in general you should be able to use the (unibyte)
string produced by url-unhex-string directly, without decoding it.  It
will work just fine if you pass it to APIs that expect file names, the
only disadvantage is that the file name will not be human-readable.
Depending on the application, this may or may not be a problem.



^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2020-11-08 15:07 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-06  7:54 fixing url-unhex-string for unicode/multi-byte charsets Boruch Baum
2020-11-06  8:05 ` Eli Zaretskii
2020-11-06 10:34   ` Boruch Baum
2020-11-06 12:06     ` Eli Zaretskii
  -- strict thread matches above, loose matches on Subject: below --
2020-11-06  7:47 Boruch Baum
2020-11-06  8:02 ` Eli Zaretskii
2020-11-06 10:27   ` Boruch Baum
2020-11-06 12:04     ` Eli Zaretskii
2020-11-06 12:28       ` Boruch Baum
2020-11-06 13:34         ` Eli Zaretskii
2020-11-06 14:59           ` Stefan Monnier
2020-11-06 15:04             ` Eli Zaretskii
2020-11-08  9:12               ` Boruch Baum
2020-11-08 13:39                 ` Stefan Monnier
2020-11-08 15:07                 ` Eli Zaretskii
2020-11-06 14:38     ` Stefan Monnier

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).