* on eshell's encoding
@ 2016-07-26 14:25 Daniel Bastos
2016-07-26 15:05 ` Eli Zaretskii
[not found] ` <mailman.2058.1469545530.26859.help-gnu-emacs@gnu.org>
0 siblings, 2 replies; 14+ messages in thread
From: Daniel Bastos @ 2016-07-26 14:25 UTC (permalink / raw)
To: help-gnu-emacs
I'm running eshell. My current modeline is
U\--- *eshell* [...]
But after a git commit, I get garbage out from my utf-8 string given in
the command line. It must be git's fault. Do you confirm? (I don't
have the same problem if I input the string in a file.)
%gc -a -m 'Função pra esvaziar a fila.'
[cooper 95bca82] Função pra esvaziar a fila.
2 files changed, 5 insertions(+), 1 deletion(-)
%
(*) My encoding in details
U -- utf-8-dos (alias: mule-utf-8-dos)
UTF-8 (no signature (BOM))
Type: utf-8 (UTF-8: Emacs internal multibyte form)
EOL type: CRLF
This coding system encodes the following charsets:
unicode
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: on eshell's encoding
2016-07-26 14:25 on eshell's encoding Daniel Bastos
@ 2016-07-26 15:05 ` Eli Zaretskii
[not found] ` <mailman.2058.1469545530.26859.help-gnu-emacs@gnu.org>
1 sibling, 0 replies; 14+ messages in thread
From: Eli Zaretskii @ 2016-07-26 15:05 UTC (permalink / raw)
To: help-gnu-emacs
> From: Daniel Bastos <dbastos@toledo.com>
> Date: Tue, 26 Jul 2016 11:25:55 -0300
>
> I'm running eshell. My current modeline is
>
> U\--- *eshell* [...]
>
> But after a git commit, I get garbage out from my utf-8 string given in
> the command line. It must be git's fault. Do you confirm? (I don't
> have the same problem if I input the string in a file.)
>
> %gc -a -m 'Função pra esvaziar a fila.'
> [cooper 95bca82] Função pra esvaziar a fila.
> 2 files changed, 5 insertions(+), 1 deletion(-)
> %
Is this on MS-Windows? If so, you cannot invoke programs from Emacs
with command-line arguments encoded in anything but the system
codepage. And UTF-8 cannot be a system codepage on Windows.
I suggest to put the commit message in a file and use the -F switch to
"git commit". Or use the built-in VC commands, they will do this
automatically for you (if you have Emacs 25).
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: on eshell's encoding
[not found] ` <mailman.2058.1469545530.26859.help-gnu-emacs@gnu.org>
@ 2016-07-26 16:49 ` Daniel Bastos
2016-07-26 17:17 ` Eli Zaretskii
[not found] ` <mailman.2074.1469553449.26859.help-gnu-emacs@gnu.org>
0 siblings, 2 replies; 14+ messages in thread
From: Daniel Bastos @ 2016-07-26 16:49 UTC (permalink / raw)
To: help-gnu-emacs
Hi, Eli.
Eli Zaretskii <eliz@gnu.org> writes:
>> From: Daniel Bastos <dbastos@toledo.com>
>> Date: Tue, 26 Jul 2016 11:25:55 -0300
>>
>> I'm running eshell. My current modeline is
>>
>> U\--- *eshell* [...]
>>
>> But after a git commit, I get garbage out from my utf-8 string given in
>> the command line. It must be git's fault. Do you confirm? (I don't
>> have the same problem if I input the string in a file.)
>>
>> %gc -a -m 'Função pra esvaziar a fila.'
>> [cooper 95bca82] Função pra esvaziar a fila.
>> 2 files changed, 5 insertions(+), 1 deletion(-)
>> %
>
> Is this on MS-Windows? If so, you cannot invoke programs from Emacs
> with command-line arguments encoded in anything but the system
> codepage. And UTF-8 cannot be a system codepage on Windows.
You're right. This is MS-Windows. But I thought MS-Windows would not
interfere here. Why does it interfere? I thought the messages would go
straight into git's ARGV. Does Windows read() and write() interpret the
bytes?
> I suggest to put the commit message in a file and use the -F switch to
> "git commit". Or use the built-in VC commands, they will do this
> automatically for you (if you have Emacs 25).
If I put the commit message in a file, even without using -F switch, it
works as expected.
(*) Version
GNU Emacs 24.3.1 (i386-mingw-nt6.2.9200) of 2013-03-17 on MARVIN
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: on eshell's encoding
2016-07-26 16:49 ` Daniel Bastos
@ 2016-07-26 17:17 ` Eli Zaretskii
2016-07-26 18:26 ` Yuri Khan
[not found] ` <mailman.2074.1469553449.26859.help-gnu-emacs@gnu.org>
1 sibling, 1 reply; 14+ messages in thread
From: Eli Zaretskii @ 2016-07-26 17:17 UTC (permalink / raw)
To: help-gnu-emacs
> From: Daniel Bastos <dbastos@toledo.com>
> Date: Tue, 26 Jul 2016 13:49:15 -0300
>
> > Is this on MS-Windows? If so, you cannot invoke programs from Emacs
> > with command-line arguments encoded in anything but the system
> > codepage. And UTF-8 cannot be a system codepage on Windows.
>
> You're right. This is MS-Windows. But I thought MS-Windows would not
> interfere here. Why does it interfere? I thought the messages would go
> straight into git's ARGV.
How can it go "straight"? Eshell is not a real shell, it's a Lisp
program that pretends to be a shell. When you type RET at the end of
a command line, Eshell takes the command and calls a Windows API that
invokes programs, passing it the command you typed. But the API that
Emacs calls accepts strings encoded in the system codepage. So the
UTF-8 string you typed is interpreted as encoded in that codepage, and
that's why you get it back garbled.
If the characters you typed can be encoded by your system codepage,
then what you do should still work, if you tell Git that log messages
are encoded in that codepage. Read about the i18n.commitEncoding
configuration parameter in the Git documentation. However, I don't
recommend doing that, because you (and whoever else participates in
that project) will have then confine yourself to that encoding.
There's no way of safely passing UTF-8 encoded command-line arguments
to a Windows program. The only way to break the limitations of the
system codepage is to use the Unicode (a.k.a. "wide") APIs, which
expect strings in UTF-16 encoding. But that is not currently
supported in Emacs, due to boring technical problems.
> > I suggest to put the commit message in a file and use the -F switch to
> > "git commit". Or use the built-in VC commands, they will do this
> > automatically for you (if you have Emacs 25).
>
> If I put the commit message in a file, even without using -F switch, it
> works as expected.
It will always work from a file, because file I/O doesn't have this
limitation.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: on eshell's encoding
2016-07-26 17:17 ` Eli Zaretskii
@ 2016-07-26 18:26 ` Yuri Khan
2016-07-26 18:35 ` Eli Zaretskii
0 siblings, 1 reply; 14+ messages in thread
From: Yuri Khan @ 2016-07-26 18:26 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: help-gnu-emacs@gnu.org
On Wed, Jul 27, 2016 at 12:17 AM, Eli Zaretskii <eliz@gnu.org> wrote:
> The only way to break the limitations of the
> system codepage is to use the Unicode (a.k.a. "wide") APIs, which
> expect strings in UTF-16 encoding. But that is not currently
> supported in Emacs, due to boring technical problems.
It’s not even clear if using the wide API on the caller side will
suffice. The callee also needs to cooperate, by using the
corresponding wide API to retrieve the command line arguments.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: on eshell's encoding
2016-07-26 18:26 ` Yuri Khan
@ 2016-07-26 18:35 ` Eli Zaretskii
0 siblings, 0 replies; 14+ messages in thread
From: Eli Zaretskii @ 2016-07-26 18:35 UTC (permalink / raw)
To: help-gnu-emacs
> From: Yuri Khan <yuri.v.khan@gmail.com>
> Date: Wed, 27 Jul 2016 00:26:42 +0600
> Cc: "help-gnu-emacs@gnu.org" <help-gnu-emacs@gnu.org>
>
> On Wed, Jul 27, 2016 at 12:17 AM, Eli Zaretskii <eliz@gnu.org> wrote:
>
> > The only way to break the limitations of the
> > system codepage is to use the Unicode (a.k.a. "wide") APIs, which
> > expect strings in UTF-16 encoding. But that is not currently
> > supported in Emacs, due to boring technical problems.
>
> It’s not even clear if using the wide API on the caller side will
> suffice. The callee also needs to cooperate, by using the
> corresponding wide API to retrieve the command line arguments.
Yes, and that's one of the few reasons why Emacs on Windows doesn't
bother to use the wide APIs: too few programs Emacs users normally
invoke can cooperate like that. But if Emacs did use the wide APIs,
it wouldn't have been a loss, because programs that use ANSI APIs to
access their command-line arguments would have them converted to the
system codepage by Windows, and so it would have worked or not exactly
as it does or doesn't now.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: on eshell's encoding
[not found] ` <mailman.2074.1469553449.26859.help-gnu-emacs@gnu.org>
@ 2016-07-27 11:56 ` Daniel Bastos
2016-07-27 13:15 ` Yuri Khan
` (2 more replies)
0 siblings, 3 replies; 14+ messages in thread
From: Daniel Bastos @ 2016-07-27 11:56 UTC (permalink / raw)
To: help-gnu-emacs
Eli Zaretskii <eliz@gnu.org> writes:
>> From: Daniel Bastos <dbastos@toledo.com>
>> Date: Tue, 26 Jul 2016 13:49:15 -0300
>>
>> > Is this on MS-Windows? If so, you cannot invoke programs from Emacs
>> > with command-line arguments encoded in anything but the system
>> > codepage. And UTF-8 cannot be a system codepage on Windows.
>>
>> You're right. This is MS-Windows. But I thought MS-Windows would not
>> interfere here. Why does it interfere? I thought the messages would go
>> straight into git's ARGV.
>
> How can it go "straight"?
I meant not being messed with. I don't know anything about MS-Windows.
In UNIX the creation of a new process by a shell is likely to call
execve, which won't touch the caller strings passed in through the
argv-argument.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: on eshell's encoding
2016-07-27 11:56 ` Daniel Bastos
@ 2016-07-27 13:15 ` Yuri Khan
2016-07-27 16:22 ` Eli Zaretskii
2016-07-27 16:14 ` Eli Zaretskii
[not found] ` <mailman.2119.1469636078.26859.help-gnu-emacs@gnu.org>
2 siblings, 1 reply; 14+ messages in thread
From: Yuri Khan @ 2016-07-27 13:15 UTC (permalink / raw)
To: Daniel Bastos; +Cc: help-gnu-emacs@gnu.org
On Wed, Jul 27, 2016 at 6:56 PM, Daniel Bastos <dbastos@toledo.com> wrote:
> I meant not being messed with. I don't know anything about MS-Windows.
> In UNIX the creation of a new process by a shell is likely to call
> execve, which won't touch the caller strings passed in through the
> argv-argument.
Well Windows is a different beast entirely. The basic premise is the
same, in that the parent invokes CreateProcessW, passing a
UTF-16-encoded command line, and the child process invokes
GetCommandLineW and then optionally CommandLineToArgvW to split the
command line into arguments.
Problem is, most programs prefer to work internally with 8-bit-based
encodings, and the Win32 API makes it very easy by providing backward
compatibility wrapper functions CreateProcessA and GetCommandLineA,
which unfortunately convert from/to the ANSI or OEM encoding defined
by the locale. And there is no Win32 locale for which UTF-8 is either
the ANSI or the OEM encoding.
This one point makes it very difficult to use Windows in the Unix Way:
you get to worry about encodings on every process boundary.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: on eshell's encoding
2016-07-27 11:56 ` Daniel Bastos
2016-07-27 13:15 ` Yuri Khan
@ 2016-07-27 16:14 ` Eli Zaretskii
[not found] ` <mailman.2119.1469636078.26859.help-gnu-emacs@gnu.org>
2 siblings, 0 replies; 14+ messages in thread
From: Eli Zaretskii @ 2016-07-27 16:14 UTC (permalink / raw)
To: help-gnu-emacs
> From: Daniel Bastos <dbastos@toledo.com>
> Date: Wed, 27 Jul 2016 08:56:31 -0300
>
> >> You're right. This is MS-Windows. But I thought MS-Windows would not
> >> interfere here. Why does it interfere? I thought the messages would go
> >> straight into git's ARGV.
> >
> > How can it go "straight"?
>
> I meant not being messed with. I don't know anything about MS-Windows.
> In UNIX the creation of a new process by a shell is likely to call
> execve, which won't touch the caller strings passed in through the
> argv-argument.
Like I said, Eshell is not a shell, it just pretends to be one. It
will eventually cause execve, or something like it, to be called, but
before it, the command-line arguments will be encoded in the locale's
encoding, since that's what execve expects. This is true on Windows
and on Unix alike. So in this case, the command-line arguments are
always "messed with" in Emacs. If your locale happens to use UTF-8,
then it will _almost_ look as if the arguments were passed to execve
untouched, but that's an illusion, and is certainly incorrect when the
locale's codeset is not UTF-8 (which is always true on Windows).
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: on eshell's encoding
2016-07-27 13:15 ` Yuri Khan
@ 2016-07-27 16:22 ` Eli Zaretskii
2016-07-27 16:47 ` Yuri Khan
0 siblings, 1 reply; 14+ messages in thread
From: Eli Zaretskii @ 2016-07-27 16:22 UTC (permalink / raw)
To: help-gnu-emacs
> From: Yuri Khan <yuri.v.khan@gmail.com>
> Date: Wed, 27 Jul 2016 19:15:45 +0600
> Cc: "help-gnu-emacs@gnu.org" <help-gnu-emacs@gnu.org>
>
> On Wed, Jul 27, 2016 at 6:56 PM, Daniel Bastos <dbastos@toledo.com> wrote:
>
> > I meant not being messed with. I don't know anything about MS-Windows.
> > In UNIX the creation of a new process by a shell is likely to call
> > execve, which won't touch the caller strings passed in through the
> > argv-argument.
>
> Well Windows is a different beast entirely. The basic premise is the
> same, in that the parent invokes CreateProcessW, passing a
> UTF-16-encoded command line, and the child process invokes
> GetCommandLineW and then optionally CommandLineToArgvW to split the
> command line into arguments.
So it isn't a different beast, really. Both on Unix and on Windows,
Emacs encodes the command line before passing it to system APIs. The
details differ, but not the basic idea.
> Problem is, most programs prefer to work internally with 8-bit-based
> encodings, and the Win32 API makes it very easy by providing backward
> compatibility wrapper functions CreateProcessA and GetCommandLineA,
> which unfortunately convert from/to the ANSI or OEM encoding defined
> by the locale.
Nitpicking: always ANSI, never the OEM.
> And there is no Win32 locale for which UTF-8 is either the ANSI or
> the OEM encoding.
It's actually worse than that: the Windows locale implementation
doesn't support variable-length encodings, so UTF-8 cannot be a
locale's encoding, unless MS change their related runtime libraries in
a radical way.
> This one point makes it very difficult to use Windows in the Unix Way:
> you get to worry about encodings on every process boundary.
Same on Unix, unless you are willing to bet on UTF-8 being the
locale's codeset.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: on eshell's encoding
2016-07-27 16:22 ` Eli Zaretskii
@ 2016-07-27 16:47 ` Yuri Khan
2016-07-27 17:12 ` Eli Zaretskii
0 siblings, 1 reply; 14+ messages in thread
From: Yuri Khan @ 2016-07-27 16:47 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: help-gnu-emacs@gnu.org
On Wed, Jul 27, 2016 at 11:22 PM, Eli Zaretskii <eliz@gnu.org> wrote:
> It's actually worse than that: the Windows locale implementation
> doesn't support variable-length encodings
It sort of does, as long as the variable in question never exceeds 2.
See, for example, cp932.
> so UTF-8 cannot be a
> locale's encoding, unless MS change their related runtime libraries in
> a radical way.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: on eshell's encoding
2016-07-27 16:47 ` Yuri Khan
@ 2016-07-27 17:12 ` Eli Zaretskii
0 siblings, 0 replies; 14+ messages in thread
From: Eli Zaretskii @ 2016-07-27 17:12 UTC (permalink / raw)
To: help-gnu-emacs
> From: Yuri Khan <yuri.v.khan@gmail.com>
> Date: Wed, 27 Jul 2016 22:47:01 +0600
> Cc: "help-gnu-emacs@gnu.org" <help-gnu-emacs@gnu.org>
>
> On Wed, Jul 27, 2016 at 11:22 PM, Eli Zaretskii <eliz@gnu.org> wrote:
>
> > It's actually worse than that: the Windows locale implementation
> > doesn't support variable-length encodings
>
> It sort of does, as long as the variable in question never exceeds 2.
> See, for example, cp932.
cp939 is a DBCS character set, so not relevant to the above.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: on eshell's encoding
[not found] ` <mailman.2119.1469636078.26859.help-gnu-emacs@gnu.org>
@ 2016-08-02 13:24 ` Daniel Bastos
2016-08-02 15:12 ` Eli Zaretskii
0 siblings, 1 reply; 14+ messages in thread
From: Daniel Bastos @ 2016-08-02 13:24 UTC (permalink / raw)
To: help-gnu-emacs
Eli Zaretskii <eliz@gnu.org> writes:
>> From: Daniel Bastos <dbastos@toledo.com>
>> Date: Wed, 27 Jul 2016 08:56:31 -0300
>>
>> >> You're right. This is MS-Windows. But I thought MS-Windows would not
>> >> interfere here. Why does it interfere? I thought the messages would go
>> >> straight into git's ARGV.
>> >
>> > How can it go "straight"?
>>
>> I meant not being messed with. I don't know anything about MS-Windows.
>> In UNIX the creation of a new process by a shell is likely to call
>> execve, which won't touch the caller strings passed in through the
>> argv-argument.
>
> Like I said, Eshell is not a shell, it just pretends to be one. It
> will eventually cause execve, or something like it, to be called, but
> before it, the command-line arguments will be encoded in the locale's
> encoding, since that's what execve expects. This is true on Windows
> and on Unix alike.
That's true of EMACS. You're saying EMACS always encodes the command
line arguments. But what I said about UNIX is that whatever execve
receives in argv[] will remain as such, which apparently is not the
MS-Windows behavior.
Precisely: if on UNIX I use EMACS to call /program/ with argv[] encoded
in X, then /program/ will definitely receive its argv[] as prepared by
EMACS. That does not happen on MS-Windows. EMACS encodes the command
line in utf-8, but /program/ receives it in another encoding.
This surprises me. MS-Windows should not care what a program puts in
argv[]. I think it violates an important principle: an operating system
should help programs to communicate, but it should not care what they're
saying to each other. That's an important principle UNIX has given us.
Even if I'm not totally correct now, I'm certainly better educated.
Thank you.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: on eshell's encoding
2016-08-02 13:24 ` Daniel Bastos
@ 2016-08-02 15:12 ` Eli Zaretskii
0 siblings, 0 replies; 14+ messages in thread
From: Eli Zaretskii @ 2016-08-02 15:12 UTC (permalink / raw)
To: help-gnu-emacs
> From: Daniel Bastos <dbastos@toledo.com>
> Date: Tue, 02 Aug 2016 10:24:32 -0300
>
> > Like I said, Eshell is not a shell, it just pretends to be one. It
> > will eventually cause execve, or something like it, to be called, but
> > before it, the command-line arguments will be encoded in the locale's
> > encoding, since that's what execve expects. This is true on Windows
> > and on Unix alike.
>
> That's true of EMACS. You're saying EMACS always encodes the command
> line arguments. But what I said about UNIX is that whatever execve
> receives in argv[] will remain as such, which apparently is not the
> MS-Windows behavior.
>
> Precisely: if on UNIX I use EMACS to call /program/ with argv[] encoded
> in X, then /program/ will definitely receive its argv[] as prepared by
> EMACS. That does not happen on MS-Windows. EMACS encodes the command
> line in utf-8, but /program/ receives it in another encoding.
That's not true. Emacs encodes the command line passed to
subprocesses on Windows and Unix alike. On each OS, it always encodes
them in the locale's codeset. If the Unix locale specified UTF-8 as
its codeset, then the command line will be encoded in UTF-8, but
that's no more than a coincidence. (On Windows, the locale's codeset,
a.k.a. "system codepage", can never be UTF-8, but that's the only
difference between Unix and Windows wrt encoding command lines of
subprocesses by Emacs.)
So, as long as you launch processes from Emacs, the difference between
Windows and Unix in this respect is all but non-existent.
The difference between the 2 OSes comes into play when you put
arbitrary byte sequences into argv[] passed to execve etc. (This
cannot be easily done in Emacs, but you can do that in your own
programs.) If those bytes are not valid for the locale's codeset,
Unix will nevertheless pass them verbatim to the subprogram. By
contrast, Windows will convert those bytes to UTF-16, assuming they
are in the current locale's codeset, then convert back to that codeset
when it invokes the subprogram. This conversion is lossy when the
bytes are not valid for the locale, as Windows will replace the
invalid bytes with either their close equivalents or with blanks or
with question marks. (When these bytes are all valid in the current
locale, this conversion happens as well, but it's not lossy, and
therefore its effect is exactly as on Unix.)
> This surprises me. MS-Windows should not care what a program puts in
> argv[].
It cares, because it attempts to transparently support both Unicode
programs, which expect their arguments in UTF-16, and non-Unicode
programs which expect their arguments in the locale's codeset.
> I think it violates an important principle: an operating system
> should help programs to communicate, but it should not care what they're
> saying to each other. That's an important principle UNIX has given us.
Clearly, Unix and Windows differ in their philosophy in this regard.
Each alternative has its advantages and disadvantages; which one you
like better is up to you.
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2016-08-02 15:12 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-07-26 14:25 on eshell's encoding Daniel Bastos
2016-07-26 15:05 ` Eli Zaretskii
[not found] ` <mailman.2058.1469545530.26859.help-gnu-emacs@gnu.org>
2016-07-26 16:49 ` Daniel Bastos
2016-07-26 17:17 ` Eli Zaretskii
2016-07-26 18:26 ` Yuri Khan
2016-07-26 18:35 ` Eli Zaretskii
[not found] ` <mailman.2074.1469553449.26859.help-gnu-emacs@gnu.org>
2016-07-27 11:56 ` Daniel Bastos
2016-07-27 13:15 ` Yuri Khan
2016-07-27 16:22 ` Eli Zaretskii
2016-07-27 16:47 ` Yuri Khan
2016-07-27 17:12 ` Eli Zaretskii
2016-07-27 16:14 ` Eli Zaretskii
[not found] ` <mailman.2119.1469636078.26859.help-gnu-emacs@gnu.org>
2016-08-02 13:24 ` Daniel Bastos
2016-08-02 15:12 ` Eli Zaretskii
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).