all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
From: Eli Zaretskii <eliz@gnu.org>
To: Boruch Baum <boruch_baum@gmx.com>
Cc: emacs-devel@gnu.org
Subject: Re: fixing url-unhex-string for unicode/multi-byte charsets
Date: Fri, 06 Nov 2020 10:02:55 +0200	[thread overview]
Message-ID: <83wnyy9akw.fsf@gnu.org> (raw)
In-Reply-To: <20201106074742.jq3h4uujm7oce7af@E15-2016.optimum.net> (message from Boruch Baum on Fri, 6 Nov 2020 02:47:42 -0500)

> Date: Fri, 6 Nov 2020 02:47:42 -0500
> From: Boruch Baum <boruch_baum@gmx.com>
> 
> In the thread "Friendlier dired experience", Michael Albinus noted that
> the new emacs feature to place remote files in the local trash performs
> hex-encoding on remote file-names as if they were URLs, which led me to
> discover that was also happening for local files encoded in multi-byte
> (eg. unicode) character-set encodings. Neither of these cases were being
> properly handled by the current emacs function `url-unhex-string'. We
> noticed this for the case of restoring a trashed file, but it can be
> expected to exhibit in other cases.

I see no problem in url-unhex-string, because its job is very simple:
convert hex codes into bytes with the same value.  It doesn't know
what to do with the result because it has no idea what the string
stands for: it could be a file name, or some text, or anything else.
The details of the rules for decoding each kind of string vary a
little, so for optimal results the caller should apply the rules that
are relevant.

> I've solved the problem for diredc, using code from the emacs-w3m
> project (thanks). Whether for the general emacs case it should be
> handled by altering function `url-unhex-string', or whether a second
> function should be created isn't for me to decide, so here's my fix for
> you to discuss, decide, apply.

I made a suggestion in that discussion, I will repeat some of them
here:

>     (with-temp-buffer
>       (set-buffer-multibyte nil)
>       (while (string-match regexp str start)
>         (insert (substring str start (match-beginning 0))
>         	   (if (match-beginning 1)
>         	      (string-to-number (match-string 1 str) 16)
>         	    ?\n))
>       (setq start (match-end 0)))
>       (insert (substring str start))
>       (decode-coding-string
>         (buffer-string)
>         (with-coding-priority nil
>                (car (detect-coding-region (point-min) (point-max))))))))

There's no need to insert the string into a buffer, then decode it.
It sounds like you did that because you wanted to invoke
detect-coding-region? but then we have detect-coding-string as well.
Or maybe this was because you wanted to make sure you work with
unibyte text? but then url-unhex-string returns a unibyte string
already.

The use of detect-coding-region/string in this case is also
sub-optimal: depending on the exact content of the string, it can fail
to detect the correct encoding, if more than one can support the
bytes.  By contrast, variables like file-name-coding-system already
tell us how to decode file names, and they are used all the time in
Emacs, so they are almost certainly correct (if they aren't lots of
stuff in Emacs will break).

So, for file names, something like the below should do the job
simpler:

  (decode-coding-string (url-unhex-string STR)
                        (or file-name-coding-system
			    (default-value 'file-name-coding-system)))



  reply	other threads:[~2020-11-06  8:02 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-11-06  7:47 fixing url-unhex-string for unicode/multi-byte charsets Boruch Baum
2020-11-06  8:02 ` Eli Zaretskii [this message]
2020-11-06 10:27   ` Boruch Baum
2020-11-06 12:04     ` Eli Zaretskii
2020-11-06 12:28       ` Boruch Baum
2020-11-06 13:34         ` Eli Zaretskii
2020-11-06 14:59           ` Stefan Monnier
2020-11-06 15:04             ` Eli Zaretskii
2020-11-08  9:12               ` Boruch Baum
2020-11-08 13:39                 ` Stefan Monnier
2020-11-08 15:07                 ` Eli Zaretskii
2020-11-06 14:38     ` Stefan Monnier
  -- strict thread matches above, loose matches on Subject: below --
2020-11-06  7:54 Boruch Baum
2020-11-06  8:05 ` Eli Zaretskii
2020-11-06 10:34   ` Boruch Baum
2020-11-06 12:06     ` Eli Zaretskii

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=83wnyy9akw.fsf@gnu.org \
    --to=eliz@gnu.org \
    --cc=boruch_baum@gmx.com \
    --cc=emacs-devel@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.