From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.devel Subject: Re: fixing url-unhex-string for unicode/multi-byte charsets Date: Fri, 06 Nov 2020 15:34:01 +0200 Message-ID: <83lffe8v92.fsf@gnu.org> References: <20201106074742.jq3h4uujm7oce7af@E15-2016.optimum.net> <83wnyy9akw.fsf@gnu.org> <20201106102756.e2ctvpjruenatud5@E15-2016.optimum.net> <83pn4q8zdz.fsf@gnu.org> <20201106122846.unoizvad53blgncf@E15-2016.optimum.net> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="31343"; mail-complaints-to="usenet@ciao.gmane.io" Cc: emacs-devel@gnu.org To: Boruch Baum Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Fri Nov 06 14:34:41 2020 Return-path: Envelope-to: ged-emacs-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1kb1t3-00082w-0e for ged-emacs-devel@m.gmane-mx.org; Fri, 06 Nov 2020 14:34:41 +0100 Original-Received: from localhost ([::1]:52892 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1kb1t2-0003f3-3H for ged-emacs-devel@m.gmane-mx.org; Fri, 06 Nov 2020 08:34:40 -0500 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:41332) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1kb1sR-0003Co-Sq for emacs-devel@gnu.org; Fri, 06 Nov 2020 08:34:03 -0500 Original-Received: from fencepost.gnu.org ([2001:470:142:3::e]:36764) by eggs.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1kb1sR-0002M4-9n; Fri, 06 Nov 2020 08:34:03 -0500 Original-Received: from [176.228.60.248] (port=3169 helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256) (Exim 4.82) (envelope-from ) id 1kb1sQ-0000kk-H5; Fri, 06 Nov 2020 08:34:03 -0500 In-Reply-To: <20201106122846.unoizvad53blgncf@E15-2016.optimum.net> (message from Boruch Baum on Fri, 6 Nov 2020 07:28:46 -0500) X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.io gmane.emacs.devel:258813 Archived-At: > Date: Fri, 6 Nov 2020 07:28:46 -0500 > From: Boruch Baum > Cc: emacs-devel@gnu.org > > > A stand-alone test case, which doesn't require an actual trash, would > > be appreciated, so I could see which parrt doesn't work, and how to > > fix it. > > That would be the two file names that I previously posted. You say that > they succeeded for you, but they didn't for me. The result I got was > good for the first case (English two words), and garbage for the second > case (Hebrew two words). I tried that before posting the suggestion. FTR, the below works for me on the current emacs-27 branch and on master, both on MS-Windows (where I used a literal 'utf-8 instead of file-name-coding-system) and on GNU/Linux: (dolist (str '("hello%20world" "%d7%a9%d7%9c%d7%95%d7%9d%20%d7%a2%d7%95%d7%9c%d7%9d")) (insert (decode-coding-string (url-unhex-string str) (or file-name-coding-system default-file-name-coding-system)) "\n")) The result of evaluating this is two lines inserted into the current buffer: hello world שלום עולם If this doesn't work for you, or if you tried something slightly different, I'd like to hear the details, perhaps there's some subtlety I'm missing. > > Alternatively, maybe you could explain why you needed to insert the > > text into a temporary buffer and then extract it from there? AFAIK, > > we have the same primitives that work on decoding strings as we have > > for decoding buffer text. > > I don't need to. It's implementation done in emacs-w3m. I also pointed > out that eww does it differently. I think the need in emacs-w3m is to > mix the ascii characters and selected binary output, which can't be done > with say replace-regexp-in-string. So what they do is use a temporary > buffer, set `buffer-multibyte' to nil, and instead of > replace-regexp-in-string build the result in the temporary buffer. As a rule of thumb, any Lisp code that needs to do something with a string and does that by inserting it into a temporary buffer and working on that instead, should raise the "missing primitive" alarm. In this case, I see no missing primitives for decoding a string, so using a temp buffer looks an unnecessary complication to me.