* on eshell's encoding @ 2016-07-26 14:25 Daniel Bastos 2016-07-26 15:05 ` Eli Zaretskii [not found] ` <mailman.2058.1469545530.26859.help-gnu-emacs@gnu.org> 0 siblings, 2 replies; 14+ messages in thread From: Daniel Bastos @ 2016-07-26 14:25 UTC (permalink / raw) To: help-gnu-emacs I'm running eshell. My current modeline is U\--- *eshell* [...] But after a git commit, I get garbage out from my utf-8 string given in the command line. It must be git's fault. Do you confirm? (I don't have the same problem if I input the string in a file.) %gc -a -m 'Função pra esvaziar a fila.' [cooper 95bca82] Função pra esvaziar a fila. 2 files changed, 5 insertions(+), 1 deletion(-) % (*) My encoding in details U -- utf-8-dos (alias: mule-utf-8-dos) UTF-8 (no signature (BOM)) Type: utf-8 (UTF-8: Emacs internal multibyte form) EOL type: CRLF This coding system encodes the following charsets: unicode ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: on eshell's encoding 2016-07-26 14:25 on eshell's encoding Daniel Bastos @ 2016-07-26 15:05 ` Eli Zaretskii [not found] ` <mailman.2058.1469545530.26859.help-gnu-emacs@gnu.org> 1 sibling, 0 replies; 14+ messages in thread From: Eli Zaretskii @ 2016-07-26 15:05 UTC (permalink / raw) To: help-gnu-emacs > From: Daniel Bastos <dbastos@toledo.com> > Date: Tue, 26 Jul 2016 11:25:55 -0300 > > I'm running eshell. My current modeline is > > U\--- *eshell* [...] > > But after a git commit, I get garbage out from my utf-8 string given in > the command line. It must be git's fault. Do you confirm? (I don't > have the same problem if I input the string in a file.) > > %gc -a -m 'Função pra esvaziar a fila.' > [cooper 95bca82] Função pra esvaziar a fila. > 2 files changed, 5 insertions(+), 1 deletion(-) > % Is this on MS-Windows? If so, you cannot invoke programs from Emacs with command-line arguments encoded in anything but the system codepage. And UTF-8 cannot be a system codepage on Windows. I suggest to put the commit message in a file and use the -F switch to "git commit". Or use the built-in VC commands, they will do this automatically for you (if you have Emacs 25). ^ permalink raw reply [flat|nested] 14+ messages in thread
[parent not found: <mailman.2058.1469545530.26859.help-gnu-emacs@gnu.org>]
* Re: on eshell's encoding [not found] ` <mailman.2058.1469545530.26859.help-gnu-emacs@gnu.org> @ 2016-07-26 16:49 ` Daniel Bastos 2016-07-26 17:17 ` Eli Zaretskii [not found] ` <mailman.2074.1469553449.26859.help-gnu-emacs@gnu.org> 0 siblings, 2 replies; 14+ messages in thread From: Daniel Bastos @ 2016-07-26 16:49 UTC (permalink / raw) To: help-gnu-emacs Hi, Eli. Eli Zaretskii <eliz@gnu.org> writes: >> From: Daniel Bastos <dbastos@toledo.com> >> Date: Tue, 26 Jul 2016 11:25:55 -0300 >> >> I'm running eshell. My current modeline is >> >> U\--- *eshell* [...] >> >> But after a git commit, I get garbage out from my utf-8 string given in >> the command line. It must be git's fault. Do you confirm? (I don't >> have the same problem if I input the string in a file.) >> >> %gc -a -m 'Função pra esvaziar a fila.' >> [cooper 95bca82] Função pra esvaziar a fila. >> 2 files changed, 5 insertions(+), 1 deletion(-) >> % > > Is this on MS-Windows? If so, you cannot invoke programs from Emacs > with command-line arguments encoded in anything but the system > codepage. And UTF-8 cannot be a system codepage on Windows. You're right. This is MS-Windows. But I thought MS-Windows would not interfere here. Why does it interfere? I thought the messages would go straight into git's ARGV. Does Windows read() and write() interpret the bytes? > I suggest to put the commit message in a file and use the -F switch to > "git commit". Or use the built-in VC commands, they will do this > automatically for you (if you have Emacs 25). If I put the commit message in a file, even without using -F switch, it works as expected. (*) Version GNU Emacs 24.3.1 (i386-mingw-nt6.2.9200) of 2013-03-17 on MARVIN ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: on eshell's encoding 2016-07-26 16:49 ` Daniel Bastos @ 2016-07-26 17:17 ` Eli Zaretskii 2016-07-26 18:26 ` Yuri Khan [not found] ` <mailman.2074.1469553449.26859.help-gnu-emacs@gnu.org> 1 sibling, 1 reply; 14+ messages in thread From: Eli Zaretskii @ 2016-07-26 17:17 UTC (permalink / raw) To: help-gnu-emacs > From: Daniel Bastos <dbastos@toledo.com> > Date: Tue, 26 Jul 2016 13:49:15 -0300 > > > Is this on MS-Windows? If so, you cannot invoke programs from Emacs > > with command-line arguments encoded in anything but the system > > codepage. And UTF-8 cannot be a system codepage on Windows. > > You're right. This is MS-Windows. But I thought MS-Windows would not > interfere here. Why does it interfere? I thought the messages would go > straight into git's ARGV. How can it go "straight"? Eshell is not a real shell, it's a Lisp program that pretends to be a shell. When you type RET at the end of a command line, Eshell takes the command and calls a Windows API that invokes programs, passing it the command you typed. But the API that Emacs calls accepts strings encoded in the system codepage. So the UTF-8 string you typed is interpreted as encoded in that codepage, and that's why you get it back garbled. If the characters you typed can be encoded by your system codepage, then what you do should still work, if you tell Git that log messages are encoded in that codepage. Read about the i18n.commitEncoding configuration parameter in the Git documentation. However, I don't recommend doing that, because you (and whoever else participates in that project) will have then confine yourself to that encoding. There's no way of safely passing UTF-8 encoded command-line arguments to a Windows program. The only way to break the limitations of the system codepage is to use the Unicode (a.k.a. "wide") APIs, which expect strings in UTF-16 encoding. But that is not currently supported in Emacs, due to boring technical problems. > > I suggest to put the commit message in a file and use the -F switch to > > "git commit". Or use the built-in VC commands, they will do this > > automatically for you (if you have Emacs 25). > > If I put the commit message in a file, even without using -F switch, it > works as expected. It will always work from a file, because file I/O doesn't have this limitation. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: on eshell's encoding 2016-07-26 17:17 ` Eli Zaretskii @ 2016-07-26 18:26 ` Yuri Khan 2016-07-26 18:35 ` Eli Zaretskii 0 siblings, 1 reply; 14+ messages in thread From: Yuri Khan @ 2016-07-26 18:26 UTC (permalink / raw) To: Eli Zaretskii; +Cc: help-gnu-emacs@gnu.org On Wed, Jul 27, 2016 at 12:17 AM, Eli Zaretskii <eliz@gnu.org> wrote: > The only way to break the limitations of the > system codepage is to use the Unicode (a.k.a. "wide") APIs, which > expect strings in UTF-16 encoding. But that is not currently > supported in Emacs, due to boring technical problems. It’s not even clear if using the wide API on the caller side will suffice. The callee also needs to cooperate, by using the corresponding wide API to retrieve the command line arguments. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: on eshell's encoding 2016-07-26 18:26 ` Yuri Khan @ 2016-07-26 18:35 ` Eli Zaretskii 0 siblings, 0 replies; 14+ messages in thread From: Eli Zaretskii @ 2016-07-26 18:35 UTC (permalink / raw) To: help-gnu-emacs > From: Yuri Khan <yuri.v.khan@gmail.com> > Date: Wed, 27 Jul 2016 00:26:42 +0600 > Cc: "help-gnu-emacs@gnu.org" <help-gnu-emacs@gnu.org> > > On Wed, Jul 27, 2016 at 12:17 AM, Eli Zaretskii <eliz@gnu.org> wrote: > > > The only way to break the limitations of the > > system codepage is to use the Unicode (a.k.a. "wide") APIs, which > > expect strings in UTF-16 encoding. But that is not currently > > supported in Emacs, due to boring technical problems. > > It’s not even clear if using the wide API on the caller side will > suffice. The callee also needs to cooperate, by using the > corresponding wide API to retrieve the command line arguments. Yes, and that's one of the few reasons why Emacs on Windows doesn't bother to use the wide APIs: too few programs Emacs users normally invoke can cooperate like that. But if Emacs did use the wide APIs, it wouldn't have been a loss, because programs that use ANSI APIs to access their command-line arguments would have them converted to the system codepage by Windows, and so it would have worked or not exactly as it does or doesn't now. ^ permalink raw reply [flat|nested] 14+ messages in thread
[parent not found: <mailman.2074.1469553449.26859.help-gnu-emacs@gnu.org>]
* Re: on eshell's encoding [not found] ` <mailman.2074.1469553449.26859.help-gnu-emacs@gnu.org> @ 2016-07-27 11:56 ` Daniel Bastos 2016-07-27 13:15 ` Yuri Khan ` (2 more replies) 0 siblings, 3 replies; 14+ messages in thread From: Daniel Bastos @ 2016-07-27 11:56 UTC (permalink / raw) To: help-gnu-emacs Eli Zaretskii <eliz@gnu.org> writes: >> From: Daniel Bastos <dbastos@toledo.com> >> Date: Tue, 26 Jul 2016 13:49:15 -0300 >> >> > Is this on MS-Windows? If so, you cannot invoke programs from Emacs >> > with command-line arguments encoded in anything but the system >> > codepage. And UTF-8 cannot be a system codepage on Windows. >> >> You're right. This is MS-Windows. But I thought MS-Windows would not >> interfere here. Why does it interfere? I thought the messages would go >> straight into git's ARGV. > > How can it go "straight"? I meant not being messed with. I don't know anything about MS-Windows. In UNIX the creation of a new process by a shell is likely to call execve, which won't touch the caller strings passed in through the argv-argument. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: on eshell's encoding 2016-07-27 11:56 ` Daniel Bastos @ 2016-07-27 13:15 ` Yuri Khan 2016-07-27 16:22 ` Eli Zaretskii 2016-07-27 16:14 ` Eli Zaretskii [not found] ` <mailman.2119.1469636078.26859.help-gnu-emacs@gnu.org> 2 siblings, 1 reply; 14+ messages in thread From: Yuri Khan @ 2016-07-27 13:15 UTC (permalink / raw) To: Daniel Bastos; +Cc: help-gnu-emacs@gnu.org On Wed, Jul 27, 2016 at 6:56 PM, Daniel Bastos <dbastos@toledo.com> wrote: > I meant not being messed with. I don't know anything about MS-Windows. > In UNIX the creation of a new process by a shell is likely to call > execve, which won't touch the caller strings passed in through the > argv-argument. Well Windows is a different beast entirely. The basic premise is the same, in that the parent invokes CreateProcessW, passing a UTF-16-encoded command line, and the child process invokes GetCommandLineW and then optionally CommandLineToArgvW to split the command line into arguments. Problem is, most programs prefer to work internally with 8-bit-based encodings, and the Win32 API makes it very easy by providing backward compatibility wrapper functions CreateProcessA and GetCommandLineA, which unfortunately convert from/to the ANSI or OEM encoding defined by the locale. And there is no Win32 locale for which UTF-8 is either the ANSI or the OEM encoding. This one point makes it very difficult to use Windows in the Unix Way: you get to worry about encodings on every process boundary. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: on eshell's encoding 2016-07-27 13:15 ` Yuri Khan @ 2016-07-27 16:22 ` Eli Zaretskii 2016-07-27 16:47 ` Yuri Khan 0 siblings, 1 reply; 14+ messages in thread From: Eli Zaretskii @ 2016-07-27 16:22 UTC (permalink / raw) To: help-gnu-emacs > From: Yuri Khan <yuri.v.khan@gmail.com> > Date: Wed, 27 Jul 2016 19:15:45 +0600 > Cc: "help-gnu-emacs@gnu.org" <help-gnu-emacs@gnu.org> > > On Wed, Jul 27, 2016 at 6:56 PM, Daniel Bastos <dbastos@toledo.com> wrote: > > > I meant not being messed with. I don't know anything about MS-Windows. > > In UNIX the creation of a new process by a shell is likely to call > > execve, which won't touch the caller strings passed in through the > > argv-argument. > > Well Windows is a different beast entirely. The basic premise is the > same, in that the parent invokes CreateProcessW, passing a > UTF-16-encoded command line, and the child process invokes > GetCommandLineW and then optionally CommandLineToArgvW to split the > command line into arguments. So it isn't a different beast, really. Both on Unix and on Windows, Emacs encodes the command line before passing it to system APIs. The details differ, but not the basic idea. > Problem is, most programs prefer to work internally with 8-bit-based > encodings, and the Win32 API makes it very easy by providing backward > compatibility wrapper functions CreateProcessA and GetCommandLineA, > which unfortunately convert from/to the ANSI or OEM encoding defined > by the locale. Nitpicking: always ANSI, never the OEM. > And there is no Win32 locale for which UTF-8 is either the ANSI or > the OEM encoding. It's actually worse than that: the Windows locale implementation doesn't support variable-length encodings, so UTF-8 cannot be a locale's encoding, unless MS change their related runtime libraries in a radical way. > This one point makes it very difficult to use Windows in the Unix Way: > you get to worry about encodings on every process boundary. Same on Unix, unless you are willing to bet on UTF-8 being the locale's codeset. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: on eshell's encoding 2016-07-27 16:22 ` Eli Zaretskii @ 2016-07-27 16:47 ` Yuri Khan 2016-07-27 17:12 ` Eli Zaretskii 0 siblings, 1 reply; 14+ messages in thread From: Yuri Khan @ 2016-07-27 16:47 UTC (permalink / raw) To: Eli Zaretskii; +Cc: help-gnu-emacs@gnu.org On Wed, Jul 27, 2016 at 11:22 PM, Eli Zaretskii <eliz@gnu.org> wrote: > It's actually worse than that: the Windows locale implementation > doesn't support variable-length encodings It sort of does, as long as the variable in question never exceeds 2. See, for example, cp932. > so UTF-8 cannot be a > locale's encoding, unless MS change their related runtime libraries in > a radical way. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: on eshell's encoding 2016-07-27 16:47 ` Yuri Khan @ 2016-07-27 17:12 ` Eli Zaretskii 0 siblings, 0 replies; 14+ messages in thread From: Eli Zaretskii @ 2016-07-27 17:12 UTC (permalink / raw) To: help-gnu-emacs > From: Yuri Khan <yuri.v.khan@gmail.com> > Date: Wed, 27 Jul 2016 22:47:01 +0600 > Cc: "help-gnu-emacs@gnu.org" <help-gnu-emacs@gnu.org> > > On Wed, Jul 27, 2016 at 11:22 PM, Eli Zaretskii <eliz@gnu.org> wrote: > > > It's actually worse than that: the Windows locale implementation > > doesn't support variable-length encodings > > It sort of does, as long as the variable in question never exceeds 2. > See, for example, cp932. cp939 is a DBCS character set, so not relevant to the above. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: on eshell's encoding 2016-07-27 11:56 ` Daniel Bastos 2016-07-27 13:15 ` Yuri Khan @ 2016-07-27 16:14 ` Eli Zaretskii [not found] ` <mailman.2119.1469636078.26859.help-gnu-emacs@gnu.org> 2 siblings, 0 replies; 14+ messages in thread From: Eli Zaretskii @ 2016-07-27 16:14 UTC (permalink / raw) To: help-gnu-emacs > From: Daniel Bastos <dbastos@toledo.com> > Date: Wed, 27 Jul 2016 08:56:31 -0300 > > >> You're right. This is MS-Windows. But I thought MS-Windows would not > >> interfere here. Why does it interfere? I thought the messages would go > >> straight into git's ARGV. > > > > How can it go "straight"? > > I meant not being messed with. I don't know anything about MS-Windows. > In UNIX the creation of a new process by a shell is likely to call > execve, which won't touch the caller strings passed in through the > argv-argument. Like I said, Eshell is not a shell, it just pretends to be one. It will eventually cause execve, or something like it, to be called, but before it, the command-line arguments will be encoded in the locale's encoding, since that's what execve expects. This is true on Windows and on Unix alike. So in this case, the command-line arguments are always "messed with" in Emacs. If your locale happens to use UTF-8, then it will _almost_ look as if the arguments were passed to execve untouched, but that's an illusion, and is certainly incorrect when the locale's codeset is not UTF-8 (which is always true on Windows). ^ permalink raw reply [flat|nested] 14+ messages in thread
[parent not found: <mailman.2119.1469636078.26859.help-gnu-emacs@gnu.org>]
* Re: on eshell's encoding [not found] ` <mailman.2119.1469636078.26859.help-gnu-emacs@gnu.org> @ 2016-08-02 13:24 ` Daniel Bastos 2016-08-02 15:12 ` Eli Zaretskii 0 siblings, 1 reply; 14+ messages in thread From: Daniel Bastos @ 2016-08-02 13:24 UTC (permalink / raw) To: help-gnu-emacs Eli Zaretskii <eliz@gnu.org> writes: >> From: Daniel Bastos <dbastos@toledo.com> >> Date: Wed, 27 Jul 2016 08:56:31 -0300 >> >> >> You're right. This is MS-Windows. But I thought MS-Windows would not >> >> interfere here. Why does it interfere? I thought the messages would go >> >> straight into git's ARGV. >> > >> > How can it go "straight"? >> >> I meant not being messed with. I don't know anything about MS-Windows. >> In UNIX the creation of a new process by a shell is likely to call >> execve, which won't touch the caller strings passed in through the >> argv-argument. > > Like I said, Eshell is not a shell, it just pretends to be one. It > will eventually cause execve, or something like it, to be called, but > before it, the command-line arguments will be encoded in the locale's > encoding, since that's what execve expects. This is true on Windows > and on Unix alike. That's true of EMACS. You're saying EMACS always encodes the command line arguments. But what I said about UNIX is that whatever execve receives in argv[] will remain as such, which apparently is not the MS-Windows behavior. Precisely: if on UNIX I use EMACS to call /program/ with argv[] encoded in X, then /program/ will definitely receive its argv[] as prepared by EMACS. That does not happen on MS-Windows. EMACS encodes the command line in utf-8, but /program/ receives it in another encoding. This surprises me. MS-Windows should not care what a program puts in argv[]. I think it violates an important principle: an operating system should help programs to communicate, but it should not care what they're saying to each other. That's an important principle UNIX has given us. Even if I'm not totally correct now, I'm certainly better educated. Thank you. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: on eshell's encoding 2016-08-02 13:24 ` Daniel Bastos @ 2016-08-02 15:12 ` Eli Zaretskii 0 siblings, 0 replies; 14+ messages in thread From: Eli Zaretskii @ 2016-08-02 15:12 UTC (permalink / raw) To: help-gnu-emacs > From: Daniel Bastos <dbastos@toledo.com> > Date: Tue, 02 Aug 2016 10:24:32 -0300 > > > Like I said, Eshell is not a shell, it just pretends to be one. It > > will eventually cause execve, or something like it, to be called, but > > before it, the command-line arguments will be encoded in the locale's > > encoding, since that's what execve expects. This is true on Windows > > and on Unix alike. > > That's true of EMACS. You're saying EMACS always encodes the command > line arguments. But what I said about UNIX is that whatever execve > receives in argv[] will remain as such, which apparently is not the > MS-Windows behavior. > > Precisely: if on UNIX I use EMACS to call /program/ with argv[] encoded > in X, then /program/ will definitely receive its argv[] as prepared by > EMACS. That does not happen on MS-Windows. EMACS encodes the command > line in utf-8, but /program/ receives it in another encoding. That's not true. Emacs encodes the command line passed to subprocesses on Windows and Unix alike. On each OS, it always encodes them in the locale's codeset. If the Unix locale specified UTF-8 as its codeset, then the command line will be encoded in UTF-8, but that's no more than a coincidence. (On Windows, the locale's codeset, a.k.a. "system codepage", can never be UTF-8, but that's the only difference between Unix and Windows wrt encoding command lines of subprocesses by Emacs.) So, as long as you launch processes from Emacs, the difference between Windows and Unix in this respect is all but non-existent. The difference between the 2 OSes comes into play when you put arbitrary byte sequences into argv[] passed to execve etc. (This cannot be easily done in Emacs, but you can do that in your own programs.) If those bytes are not valid for the locale's codeset, Unix will nevertheless pass them verbatim to the subprogram. By contrast, Windows will convert those bytes to UTF-16, assuming they are in the current locale's codeset, then convert back to that codeset when it invokes the subprogram. This conversion is lossy when the bytes are not valid for the locale, as Windows will replace the invalid bytes with either their close equivalents or with blanks or with question marks. (When these bytes are all valid in the current locale, this conversion happens as well, but it's not lossy, and therefore its effect is exactly as on Unix.) > This surprises me. MS-Windows should not care what a program puts in > argv[]. It cares, because it attempts to transparently support both Unicode programs, which expect their arguments in UTF-16, and non-Unicode programs which expect their arguments in the locale's codeset. > I think it violates an important principle: an operating system > should help programs to communicate, but it should not care what they're > saying to each other. That's an important principle UNIX has given us. Clearly, Unix and Windows differ in their philosophy in this regard. Each alternative has its advantages and disadvantages; which one you like better is up to you. ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2016-08-02 15:12 UTC | newest] Thread overview: 14+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-07-26 14:25 on eshell's encoding Daniel Bastos 2016-07-26 15:05 ` Eli Zaretskii [not found] ` <mailman.2058.1469545530.26859.help-gnu-emacs@gnu.org> 2016-07-26 16:49 ` Daniel Bastos 2016-07-26 17:17 ` Eli Zaretskii 2016-07-26 18:26 ` Yuri Khan 2016-07-26 18:35 ` Eli Zaretskii [not found] ` <mailman.2074.1469553449.26859.help-gnu-emacs@gnu.org> 2016-07-27 11:56 ` Daniel Bastos 2016-07-27 13:15 ` Yuri Khan 2016-07-27 16:22 ` Eli Zaretskii 2016-07-27 16:47 ` Yuri Khan 2016-07-27 17:12 ` Eli Zaretskii 2016-07-27 16:14 ` Eli Zaretskii [not found] ` <mailman.2119.1469636078.26859.help-gnu-emacs@gnu.org> 2016-08-02 13:24 ` Daniel Bastos 2016-08-02 15:12 ` Eli Zaretskii
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).