From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.devel Subject: Re: fixing url-unhex-string for unicode/multi-byte charsets Date: Fri, 06 Nov 2020 10:02:55 +0200 Message-ID: <83wnyy9akw.fsf@gnu.org> References: <20201106074742.jq3h4uujm7oce7af@E15-2016.optimum.net> Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="13735"; mail-complaints-to="usenet@ciao.gmane.io" Cc: emacs-devel@gnu.org To: Boruch Baum Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Fri Nov 06 09:03:45 2020 Return-path: Envelope-to: ged-emacs-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1kawin-0003UF-EA for ged-emacs-devel@m.gmane-mx.org; Fri, 06 Nov 2020 09:03:45 +0100 Original-Received: from localhost ([::1]:39816 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1kawim-0002wQ-Hb for ged-emacs-devel@m.gmane-mx.org; Fri, 06 Nov 2020 03:03:44 -0500 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:46506) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1kawi3-0002NK-DB for emacs-devel@gnu.org; Fri, 06 Nov 2020 03:02:59 -0500 Original-Received: from fencepost.gnu.org ([2001:470:142:3::e]:48634) by eggs.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1kawi2-00054n-RZ; Fri, 06 Nov 2020 03:02:58 -0500 Original-Received: from [176.228.60.248] (port=2599 helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256) (Exim 4.82) (envelope-from ) id 1kawi2-0005Sj-0u; Fri, 06 Nov 2020 03:02:58 -0500 In-Reply-To: <20201106074742.jq3h4uujm7oce7af@E15-2016.optimum.net> (message from Boruch Baum on Fri, 6 Nov 2020 02:47:42 -0500) X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.io gmane.emacs.devel:258781 Archived-At: > Date: Fri, 6 Nov 2020 02:47:42 -0500 > From: Boruch Baum > > In the thread "Friendlier dired experience", Michael Albinus noted that > the new emacs feature to place remote files in the local trash performs > hex-encoding on remote file-names as if they were URLs, which led me to > discover that was also happening for local files encoded in multi-byte > (eg. unicode) character-set encodings. Neither of these cases were being > properly handled by the current emacs function `url-unhex-string'. We > noticed this for the case of restoring a trashed file, but it can be > expected to exhibit in other cases. I see no problem in url-unhex-string, because its job is very simple: convert hex codes into bytes with the same value. It doesn't know what to do with the result because it has no idea what the string stands for: it could be a file name, or some text, or anything else. The details of the rules for decoding each kind of string vary a little, so for optimal results the caller should apply the rules that are relevant. > I've solved the problem for diredc, using code from the emacs-w3m > project (thanks). Whether for the general emacs case it should be > handled by altering function `url-unhex-string', or whether a second > function should be created isn't for me to decide, so here's my fix for > you to discuss, decide, apply. I made a suggestion in that discussion, I will repeat some of them here: > (with-temp-buffer > (set-buffer-multibyte nil) > (while (string-match regexp str start) > (insert (substring str start (match-beginning 0)) > (if (match-beginning 1) > (string-to-number (match-string 1 str) 16) > ?\n)) > (setq start (match-end 0))) > (insert (substring str start)) > (decode-coding-string > (buffer-string) > (with-coding-priority nil > (car (detect-coding-region (point-min) (point-max)))))))) There's no need to insert the string into a buffer, then decode it. It sounds like you did that because you wanted to invoke detect-coding-region? but then we have detect-coding-string as well. Or maybe this was because you wanted to make sure you work with unibyte text? but then url-unhex-string returns a unibyte string already. The use of detect-coding-region/string in this case is also sub-optimal: depending on the exact content of the string, it can fail to detect the correct encoding, if more than one can support the bytes. By contrast, variables like file-name-coding-system already tell us how to decode file names, and they are used all the time in Emacs, so they are almost certainly correct (if they aren't lots of stuff in Emacs will break). So, for file names, something like the below should do the job simpler: (decode-coding-string (url-unhex-string STR) (or file-name-coding-system (default-value 'file-name-coding-system)))