From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Klaus-Dieter Bauer Newsgroups: gmane.emacs.devel Subject: Re: Passing unicode filenames to start-process on Windows? Date: Wed, 6 Jan 2016 22:19:39 +0100 Message-ID: References: <83si2a3cuo.fsf@gnu.org> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: multipart/alternative; boundary=001a11423e002d5efa0528b0eb11 X-Trace: ger.gmane.org 1452115233 21234 80.91.229.3 (6 Jan 2016 21:20:33 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Wed, 6 Jan 2016 21:20:33 +0000 (UTC) Cc: emacs-devel@gnu.org To: Eli Zaretskii Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Wed Jan 06 22:20:33 2016 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1aGvVP-0006iE-Jd for ged-emacs-devel@m.gmane.org; Wed, 06 Jan 2016 22:20:31 +0100 Original-Received: from localhost ([::1]:56146 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aGvVO-0006o7-PV for ged-emacs-devel@m.gmane.org; Wed, 06 Jan 2016 16:20:30 -0500 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:42531) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aGvV8-0006nv-Ha for emacs-devel@gnu.org; Wed, 06 Jan 2016 16:20:16 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1aGvV6-0001W7-PU for emacs-devel@gnu.org; Wed, 06 Jan 2016 16:20:14 -0500 Original-Received: from mail-wm0-x236.google.com ([2a00:1450:400c:c09::236]:33654) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aGvV3-0001Sy-Gv; Wed, 06 Jan 2016 16:20:09 -0500 Original-Received: by mail-wm0-x236.google.com with SMTP id f206so74862466wmf.0; Wed, 06 Jan 2016 13:20:09 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type; bh=T2VnB7yQGpUIoqDliqV7NnMZ2AHj4IEo9sKeFUmwkAY=; b=TFt/cxxe3kkwx4VPQ8lQ9kbJzIPYZL5aWgAFktktCp2pfO0QxfuwNT99xNnkKBoeb8 Gxc4AmjJ+uOvm2kO8UFAV1iRQsIvv7bwfyBrjLA3U06jgxQJ7/wNGtXwuo1qyhvIah29 YVdSq99YqCxiB5jZH9h7OhKItNrFyT6X8KWOkLxsTFQMp/TOpkg0Pyv/td1z198jSDob +p7PBBg43IsAIBuyDy7J175G06SEdOuHJVjZBe1+7faIolmUieY+hmMRluWUabdc6oIQ Lto9sxuVaztt5avwJ8GpzPLJVS92YoYr3RTjOjfV23NCF3OYkhHsRBqyhkzuA62fvJg9 sKAw== X-Received: by 10.28.47.11 with SMTP id v11mr13354541wmv.27.1452115208525; Wed, 06 Jan 2016 13:20:08 -0800 (PST) Original-Received: by 10.27.12.104 with HTTP; Wed, 6 Jan 2016 13:19:39 -0800 (PST) In-Reply-To: <83si2a3cuo.fsf@gnu.org> X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 2a00:1450:400c:c09::236 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:197710 Archived-At: --001a11423e002d5efa0528b0eb11 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable =E2=80=8B 2016-01-06 17:13 GMT+01:00 Eli Zaretskii : > > From: Klaus-Dieter Bauer > > Date: Wed, 6 Jan 2016 16:20:29 +0100 > > > > Is there a reliable way to pass unicode file names as > > arguments through `start-process'? > > No, not at the moment, not in the native Windows build of Emacs. > Arguments to subprocesses are forced to be encoded in the current > system codepage. This commentary in w32.c tells a few more details: > > . Running subprocesses in non-ASCII directories and with non-ASCII > file arguments is limited to the current codepage (even though > Emacs is perfectly capable of finding an executable program file > in a directory whose name cannot be encoded in the current > codepage). This is because the command-line arguments are > encoded _before_ they get to the w32-specific level, and the > encoding is not known in advance (it doesn't have to be the > current ANSI codepage), so w32proc.c functions cannot re-encode > them in UTF-16. This should be fixed, but will also require > changes in cmdproxy. The current limitation is not terribly bad > anyway, since very few, if any, Windows console programs that are > likely to be invoked by Emacs support UTF-16 encoded command > lines. > > . For similar reasons, server.el and emacsclient are also limited > to the current ANSI codepage for now. > > . Emacs itself can only handle command-line arguments encoded in > the current codepage. > > The main reason for this being a low-priority problem is that the > absolute majority of console programs Emacs might invoke don't support > UTF-16 encoded command-line arguments anyway, so the efforts to enable > this would yield very little gains. However, patches to do that will > be welcome. (Note that, as the comment above says, the changes will > also need to touch cmdproxy, since we invoke all the programs through > it.) > > > I realized two limitations: > > > > 1. Using `prefer-coding-system' with anything other than > > `locale-default-encoding', e.g. > > (prefer-coding-system 'utf-8), > > causes a file name "=C3=96.txt" to be misdecoded as by > > subprocesses -- notably including "emacs.exe", but also > > all other executables I tried (both Windows builtins like > > where.exe and third party executables like ffmpeg.exe or > > GnuWin32 utilities). > > In my case (German locale, 'utf-8 preferred coding > > system) it is mis-decoded as "=C3=83=E2=80=93.txt", i.e. emacs encodes > > the process argument as 'utf-8 but the subprocess decodes > > it as 'latin-1 (in my case). > > While this can be fixed by an explicit encoding > > (start-process ... > > (encode-coding-string filename locale-coding-system)) > > such code will probably not be used in most projects, as > > the issue occurs only on Windows, dependent on the user > > configuration (-> hard-to-find bug?). I have added some > > elisp for demonstration at the end of the mail. > > > > 2. When a file-name contains characters that cannot be > > encoded in the locale's encoding, e.g. Japanese > > characters in a German locale, I cannot find any way to > > pass the file name through the `start-process' interface; > > Unlike for characters, that are supported by the locale, > > it fails even in a clean "emacs -Q" session. > > Curiously the file name can still be used in cmd.exe, > > though entering it may require TAB-completion (even > > though the active codepage shouldn't support them). > > Does the program which you invoke support UTF-16 encoded command-line > arguments? It would need to either use '_wmain' instead of 'main', or > access the command-line arguments via GetCommandLineW or such likes, > and process them as wchar_t strings. > > If the program doesn't have these capabilities, it won't help that > Emacs passes it UTF-16 encoded arguments, because Windows will attempt > to convert them to strings encoded in the current codepage, and will > replace any un-encodable characters with question marks or blanks. > > > ;; Set the preferred coding system. > > (prefer-coding-system 'utf-8) > > You cannot use UTF-8 to encode command-line arguments on Windows, not > in general, even if the program you invoke does understand UTF-8 > strings as its command-line arguments. (I can explain if you want.) > > > ;; On Unix (tested with cygwin), it works fine; Presumably because > > ;; the file name is decoded (in `directory-files') and encoded (in > > ;; `start-process') with the same preferred coding system. > > It works with Cygwin because Cygwin does support UTF-8 for passing > strings to subprograms. That support lives inside the Cygwin DLL, > which replaces most of the Windows runtime with Posix-compatible > APIs. The native Windows build of Emacs doesn't have that luxury. > I checked again and found that indeed some of the utilities I tested before (specifically the GnuWin32 tools) can't handle japanese characters when called from cmd.exe; ffmpeg on the other hand supports unicode file names in cmd.exe, but I agree that this is quite a niche usage. I thought up some workarounds, but they all run into limitations: - w32-short-file-name: Doesn't work, because in modern Windows systems 8.3 file names may not be generated, so it may just return the unchanged filename. - rename-file: Allows working with a name via a temporary supported file name. Sadly there is no way to guarantee that such renaming is undone afterwards. - copy-file (to a temporary directory): Would work for the current application, but unviable when larger amounts of data are involved. Would you happen to know any other possible workaround? thanks for the explanations, - Klaus --001a11423e002d5efa0528b0eb11 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
=E2=80= =8B
2016-01-06 17:13 GMT+= 01:00 Eli Zaretskii=C2=A0<eliz@gn= u.org>:
> From: Klaus-Dieter Bauer <bauer.klaus.dieter@gmail.com>
> Date: Wed, = 6 Jan 2016 16:20:29 +0100
>
> Is there a relia= ble way to pass unicode file names as
> arguments through `start-proc= ess'?

No, not at the moment, not in the native Windows bu= ild of Emacs.
Arguments to subprocesses are forced to be encoded in the = current
system codepage.=C2=A0 This commentary in w32.c tells a few more= details:

=C2=A0 =C2=A0. Running subprocesses in non-ASCII directori= es and with non-ASCII
=C2=A0 =C2=A0 =C2=A0file arguments is limited to t= he current codepage (even though
=C2=A0 =C2=A0 =C2=A0Emacs is perfectly = capable of finding an executable program file
=C2=A0 =C2=A0 =C2=A0in a d= irectory whose name cannot be encoded in the current
=C2=A0 =C2=A0 =C2= =A0codepage).=C2=A0 This is because the command-line arguments are
=C2= =A0 =C2=A0 =C2=A0encoded _before_ they get to the w32-specific level, and t= he
=C2=A0 =C2=A0 =C2=A0encoding is not known in advance (it doesn't = have to be the
=C2=A0 =C2=A0 =C2=A0current ANSI codepage), so w32proc.c = functions cannot re-encode
=C2=A0 =C2=A0 =C2=A0them in UTF-16.=C2=A0 Thi= s should be fixed, but will also require
=C2=A0 =C2=A0 =C2=A0changes in = cmdproxy.=C2=A0 The current limitation is not terribly bad
=C2=A0 =C2=A0= =C2=A0anyway, since very few, if any, Windows console programs that are=C2=A0 =C2=A0 =C2=A0likely to be invoked by Emacs support UTF-16 encoded c= ommand
=C2=A0 =C2=A0 =C2=A0lines.

=C2=A0 =C2=A0. For similar reas= ons, server.el and emacsclient are also limited
=C2=A0 =C2=A0 =C2=A0to t= he current ANSI codepage for now.

=C2=A0 =C2=A0. Emacs itself can on= ly handle command-line arguments encoded in
=C2=A0 =C2=A0 =C2=A0the curr= ent codepage.

The main reason for this being a low-priority problem = is that the
absolute majority of console programs Emacs might invoke don= 't support
UTF-16 encoded command-line arguments anyway, so the effo= rts to enable
this would yield very little gains.=C2=A0 However, patches= to do that will
be welcome.=C2=A0 (Note that, as the comment above says= , the changes will
also need to touch cmdproxy, since we invoke all the = programs through
it.)

> I realized two = limitations:
>
> 1. Using `prefer-coding-system' with anyth= ing other than
> `locale-default-encoding', e.g.
> (prefer-= coding-system 'utf-8),
> causes a file name "=C3=96.txt"= ; to be misdecoded as by
> subprocesses -- notably including "em= acs.exe", but also
> all other executables I tried (both Windows= builtins like
> where.exe and third party executables like ffmpeg.ex= e or
> GnuWin32 utilities).
> In my case (German locale, 'u= tf-8 preferred coding
> system) it is mis-decoded as "=C3=83=E2= =80=93.txt", i.e. emacs encodes
> the process argument as 'u= tf-8 but the subprocess decodes
> it as 'latin-1 (in my case).> While this can be fixed by an explicit encoding
> (start-proces= s ...
> (encode-coding-string filename locale-coding-system))
>= such code will probably not be used in most projects, as
> the issue= occurs only on Windows, dependent on the user
> configuration (->= hard-to-find bug?). I have added some
> elisp for demonstration at t= he end of the mail.
>
> 2. When a file-name contains characters= that cannot be
> encoded in the locale's encoding, e.g. Japanese=
> characters in a German locale, I cannot find any way to
> pa= ss the file name through the `start-process' interface;
> Unlike = for characters, that are supported by the locale,
> it fails even in = a clean "emacs -Q" session.
> Curiously the file name can s= till be used in cmd.exe,
> though entering it may require TAB-complet= ion (even
> though the active codepage shouldn't support them).
Does the program which you invoke support UTF-16 encoded = command-line
arguments?=C2=A0 It would need to either use '_wmain= 9; instead of 'main', or
access the command-line arguments via G= etCommandLineW or such likes,
and process them as wchar_t strings.
If the program doesn't have these capabilities, it won't help tha= t
Emacs passes it UTF-16 encoded arguments, because Windows will attempt=
to convert them to strings encoded in the current codepage, and willreplace any un-encodable characters with question marks or blanks.

> ;; Set the preferred coding system.
> (prefer-c= oding-system 'utf-8)

You cannot use UTF-8 to encode comma= nd-line arguments on Windows, not
in general, even if the program you in= voke does understand UTF-8
strings as its command-line arguments.=C2=A0 = (I can explain if you want.)

> ;; On Unix (teste= d with cygwin), it works fine; Presumably because
> ;; the file name = is decoded (in `directory-files') and encoded (in
> ;; `start-pro= cess') with the same preferred coding system.

It works wi= th Cygwin because Cygwin does support UTF-8 for passing
strings to subpr= ograms.=C2=A0 That support lives inside the Cygwin DLL,
which replaces m= ost of the Windows runtime with Posix-compatible
APIs.=C2=A0 The native = Windows build of Emacs doesn't have that luxury.
<= br>

I checked again and found that indeed some of the utilitie= s I tested before (specifically the GnuWin32 tools) can't handle japane= se characters when called from cmd.exe;=C2=A0

ffmpeg on the other ha= nd supports unicode file names in cmd.exe, but I agree that this is quite a= niche usage.=C2=A0

I thought up some workarounds, but they all run = into limitations:

    w32-short-file-name: Doesn= 't work, because in modern Windows systems 8.3 file names may not be ge= nerated, so it may just return the unchanged filename.
  • rename-file: Allows working with = a name via a temporary supported file name. Sadly there is no way to guaran= tee that such renaming is undone afterwards.
  • copy-file (to a temporary directory): Would= work for the current application, but unviable when larger amounts of data= are involved.=C2=A0
Would you happen to know any other p= ossible workaround?

thanks for the explanations,=C2=A0
- Klaus

--001a11423e002d5efa0528b0eb11--