From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.devel Subject: Re: Passing unicode filenames to start-process on Windows? Date: Wed, 06 Jan 2016 18:13:35 +0200 Message-ID: <83si2a3cuo.fsf@gnu.org> References: Reply-To: Eli Zaretskii NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Trace: ger.gmane.org 1452096828 31734 80.91.229.3 (6 Jan 2016 16:13:48 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Wed, 6 Jan 2016 16:13:48 +0000 (UTC) Cc: emacs-devel@gnu.org To: Klaus-Dieter Bauer Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Wed Jan 06 17:13:46 2016 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1aGqiY-0007BO-4u for ged-emacs-devel@m.gmane.org; Wed, 06 Jan 2016 17:13:46 +0100 Original-Received: from localhost ([::1]:54954 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aGqiX-0000li-26 for ged-emacs-devel@m.gmane.org; Wed, 06 Jan 2016 11:13:45 -0500 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:57502) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aGqiJ-0000lc-Pu for emacs-devel@gnu.org; Wed, 06 Jan 2016 11:13:32 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1aGqiE-0005ce-QY for emacs-devel@gnu.org; Wed, 06 Jan 2016 11:13:31 -0500 Original-Received: from fencepost.gnu.org ([2001:4830:134:3::e]:51294) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aGqiE-0005cZ-Nc; Wed, 06 Jan 2016 11:13:26 -0500 Original-Received: from 84.94.185.246.cable.012.net.il ([84.94.185.246]:4947 helo=HOME-C4E4A596F7) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_128_CBC_SHA1:128) (Exim 4.82) (envelope-from ) id 1aGqiE-00030W-3n; Wed, 06 Jan 2016 11:13:26 -0500 In-reply-to: (message from Klaus-Dieter Bauer on Wed, 6 Jan 2016 16:20:29 +0100) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 2001:4830:134:3::e X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:197705 Archived-At: > From: Klaus-Dieter Bauer > Date: Wed, 6 Jan 2016 16:20:29 +0100 > > Is there a reliable way to pass unicode file names as > arguments through `start-process'? No, not at the moment, not in the native Windows build of Emacs. Arguments to subprocesses are forced to be encoded in the current system codepage. This commentary in w32.c tells a few more details: . Running subprocesses in non-ASCII directories and with non-ASCII file arguments is limited to the current codepage (even though Emacs is perfectly capable of finding an executable program file in a directory whose name cannot be encoded in the current codepage). This is because the command-line arguments are encoded _before_ they get to the w32-specific level, and the encoding is not known in advance (it doesn't have to be the current ANSI codepage), so w32proc.c functions cannot re-encode them in UTF-16. This should be fixed, but will also require changes in cmdproxy. The current limitation is not terribly bad anyway, since very few, if any, Windows console programs that are likely to be invoked by Emacs support UTF-16 encoded command lines. . For similar reasons, server.el and emacsclient are also limited to the current ANSI codepage for now. . Emacs itself can only handle command-line arguments encoded in the current codepage. The main reason for this being a low-priority problem is that the absolute majority of console programs Emacs might invoke don't support UTF-16 encoded command-line arguments anyway, so the efforts to enable this would yield very little gains. However, patches to do that will be welcome. (Note that, as the comment above says, the changes will also need to touch cmdproxy, since we invoke all the programs through it.) > I realized two limitations: > > 1. Using `prefer-coding-system' with anything other than > `locale-default-encoding', e.g. > (prefer-coding-system 'utf-8), > causes a file name "Ö.txt" to be misdecoded as by > subprocesses -- notably including "emacs.exe", but also > all other executables I tried (both Windows builtins like > where.exe and third party executables like ffmpeg.exe or > GnuWin32 utilities). > In my case (German locale, 'utf-8 preferred coding > system) it is mis-decoded as "Ö.txt", i.e. emacs encodes > the process argument as 'utf-8 but the subprocess decodes > it as 'latin-1 (in my case). > While this can be fixed by an explicit encoding > (start-process ... > (encode-coding-string filename locale-coding-system)) > such code will probably not be used in most projects, as > the issue occurs only on Windows, dependent on the user > configuration (-> hard-to-find bug?). I have added some > elisp for demonstration at the end of the mail. > > 2. When a file-name contains characters that cannot be > encoded in the locale's encoding, e.g. Japanese > characters in a German locale, I cannot find any way to > pass the file name through the `start-process' interface; > Unlike for characters, that are supported by the locale, > it fails even in a clean "emacs -Q" session. > Curiously the file name can still be used in cmd.exe, > though entering it may require TAB-completion (even > though the active codepage shouldn't support them). Does the program which you invoke support UTF-16 encoded command-line arguments? It would need to either use '_wmain' instead of 'main', or access the command-line arguments via GetCommandLineW or such likes, and process them as wchar_t strings. If the program doesn't have these capabilities, it won't help that Emacs passes it UTF-16 encoded arguments, because Windows will attempt to convert them to strings encoded in the current codepage, and will replace any un-encodable characters with question marks or blanks. > ;; Set the preferred coding system. > (prefer-coding-system 'utf-8) You cannot use UTF-8 to encode command-line arguments on Windows, not in general, even if the program you invoke does understand UTF-8 strings as its command-line arguments. (I can explain if you want.) > ;; On Unix (tested with cygwin), it works fine; Presumably because > ;; the file name is decoded (in `directory-files') and encoded (in > ;; `start-process') with the same preferred coding system. It works with Cygwin because Cygwin does support UTF-8 for passing strings to subprograms. That support lives inside the Cygwin DLL, which replaces most of the Windows runtime with Posix-compatible APIs. The native Windows build of Emacs doesn't have that luxury.