bug#6546: win32 grep/shell utf-8 encoding

unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed

* bug#6546: win32 grep/shell utf-8 encoding
@ 2010-07-01  8:46 Laimonas Vėbra
  2010-07-01 17:26 ` Eli Zaretskii
                   ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Laimonas Vėbra @ 2010-07-01  8:46 UTC (permalink / raw)
  To: 6546

Maybe it's actually not the bug (but missing functionality), but how do 
one should/could setup ones emacs && .emacs to grep files in utf-8 
encoding?

grep 2.6.3 (cygwin) at last works correctly (coloring multibyte matches) 
from win32 console (according to LANG environment settings), but no 
matter how i've tried to push emacs (set-language-environment, 
coding-system-for-(read|write), set-env in grep-setup-hook), it just 
don't work, because somewhere inside the Emacs win32 stuff it sticks to 
windows locale codepage and tries hard to convert to this 
codepage/encoding before it passes arguments to shell. No wonder -- it 
fails when it comes to unicode.

How to reproduce:

Create utf-8 file with some unicode characters (Cyrillic, Baltic, 
whatever; not only ascii) and try to grep for some utf-8 strings from 
Emacs (M-x grep).

^ permalink raw reply	[flat|nested] 20+ messages in thread

* bug#6546: win32 grep/shell utf-8 encoding
  2010-07-01  8:46 bug#6546: win32 grep/shell utf-8 encoding Laimonas Vėbra
@ 2010-07-01 17:26 ` Eli Zaretskii
  2010-07-01 18:05   ` Laimonas Vėbra
  2010-07-22 12:50 ` Juanma Barranquero
  2022-04-24 12:01 ` bug#6546: bug#6705: w32 cmdproxy.c pass args to cygwin; erroneous charset conversion (problem description, solution/suggestion) Lars Ingebrigtsen
  2 siblings, 1 reply; 20+ messages in thread
From: Eli Zaretskii @ 2010-07-01 17:26 UTC (permalink / raw)
  To: Laimonas Vėbra; +Cc: 6546

> Date: Thu, 01 Jul 2010 11:46:37 +0300
> From: Laimonas Vėbra <laimonas.vebra@gmail.com>
> Cc: 
> 
> Maybe it's actually not the bug (but missing functionality), but how do 
> one should/could setup ones emacs && .emacs to grep files in utf-8 
> encoding?
> 
> grep 2.6.3 (cygwin) at last works correctly (coloring multibyte matches) 
> from win32 console (according to LANG environment settings), but no 
> matter how i've tried to push emacs (set-language-environment, 
> coding-system-for-(read|write), set-env in grep-setup-hook), it just 
> don't work, because somewhere inside the Emacs win32 stuff it sticks to 
> windows locale codepage and tries hard to convert to this 
> codepage/encoding before it passes arguments to shell. No wonder -- it 
> fails when it comes to unicode.

Did you try set-process-coding-system?






^ permalink raw reply	[flat|nested] 20+ messages in thread

* bug#6546: win32 grep/shell utf-8 encoding
  2010-07-01 17:26 ` Eli Zaretskii
@ 2010-07-01 18:05   ` Laimonas Vėbra
  0 siblings, 0 replies; 20+ messages in thread
From: Laimonas Vėbra @ 2010-07-01 18:05 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 6546

Eli Zaretskii wrote:

> Did you try set-process-coding-system?

No, but is't it coding-system-for-(read|write) that specifies 
(synchronous) subprocess input|output coding system?
And how do i suppose to do that (set-process-coding-system) a priori 
(when no process exist yet) for a single grep command which executes and 
returns (process terminates)?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* bug#6546: win32 grep/shell utf-8 encoding
  2010-07-01  8:46 bug#6546: win32 grep/shell utf-8 encoding Laimonas Vėbra
  2010-07-01 17:26 ` Eli Zaretskii
@ 2010-07-22 12:50 ` Juanma Barranquero
  2010-07-22 14:11   ` Laimonas Vėbra
  2022-04-24 12:01 ` bug#6546: bug#6705: w32 cmdproxy.c pass args to cygwin; erroneous charset conversion (problem description, solution/suggestion) Lars Ingebrigtsen
  2 siblings, 1 reply; 20+ messages in thread
From: Juanma Barranquero @ 2010-07-22 12:50 UTC (permalink / raw)
  To: Laimonas Vėbra; +Cc: 6546

On Thu, Jul 1, 2010 at 10:46, Laimonas Vėbra <laimonas.vebra@gmail.com> wrote:

> Create utf-8 file with some unicode characters (Cyrillic, Baltic, whatever;
> not only ascii) and try to grep for some utf-8 strings from Emacs (M-x
> grep).

File 6546.txt (in utf-8, no BOM):

--------------------------------
Cyrillic follows:
ЁШејҘҘ
--------------------------------

M-x grep <RET> ШејҘ 6546.txt<RET>

=>

-*- mode: grep; default-directory: "c:/emacs/repo/" -*-
Grep started at Thu Jul 22 14:46:58

grep -nH -e ШејҘ 6546.txt
6546.txt:2:ЁШејҘҘ

Grep finished (matches found) at Thu Jul 22 14:46:58


so I cannot reproduce it. Could you please send a step-by-step recipe,
starting from emacs -Q?

Thanks,

    Juanma





^ permalink raw reply	[flat|nested] 20+ messages in thread

* bug#6546: win32 grep/shell utf-8 encoding
  2010-07-22 12:50 ` Juanma Barranquero
@ 2010-07-22 14:11   ` Laimonas Vėbra
  2010-07-22 15:02     ` Juanma Barranquero
  0 siblings, 1 reply; 20+ messages in thread
From: Laimonas Vėbra @ 2010-07-22 14:11 UTC (permalink / raw)
  To: Juanma Barranquero; +Cc: 6546

Juanma Barranquero wrote:

> --------------------------------
> Cyrillic follows:
> ЁШејҘҘ
> --------------------------------
>
> M-x grep<RET>  ШејҘ 6546.txt<RET>
>
> =>
>
> -*- mode: grep; default-directory: "c:/emacs/repo/" -*-
> Grep started at Thu Jul 22 14:46:58
>
> grep -nH -e ШејҘ 6546.txt
> 6546.txt:2:ЁШејҘҘ
>
> Grep finished (matches found) at Thu Jul 22 14:46:58
>
>
> so I cannot reproduce it. Could you please send a step-by-step recipe,
> starting from emacs -Q?

That means you are using gnu-win32 grep. Some older (2.5.4) and newer 
(2.6.3) cygwin greps won't work.

I don't believe cygwin grep (and other app) is going to be fixed/coded 
like (gnu-win32 app), because it's a matter how arguments are passed 
through winapi->cygwin (whole system) layers.

Besides, older (2.5.x) greps doesn't correctly color (multibyte) matches 
(try grep -nH -e 'Ш*' 6546.txt)






^ permalink raw reply	[flat|nested] 20+ messages in thread

* bug#6546: win32 grep/shell utf-8 encoding
  2010-07-22 14:11   ` Laimonas Vėbra
@ 2010-07-22 15:02     ` Juanma Barranquero
  2010-07-22 18:24       ` Laimonas Vėbra
  0 siblings, 1 reply; 20+ messages in thread
From: Juanma Barranquero @ 2010-07-22 15:02 UTC (permalink / raw)
  To: Laimonas Vėbra; +Cc: 6546

On Thu, Jul 22, 2010 at 16:11, Laimonas Vėbra <laimonas.vebra@gmail.com> wrote:

> That means you are using gnu-win32 grep. Some older (2.5.4) and newer
> (2.6.3) cygwin greps won't work.

Sorry, I missed that in your original report.

Did you try adding an entry to `process-coding-system-alist'?

    Juanma





^ permalink raw reply	[flat|nested] 20+ messages in thread

* bug#6546: win32 grep/shell utf-8 encoding
  2010-07-22 15:02     ` Juanma Barranquero
@ 2010-07-22 18:24       ` Laimonas Vėbra
  2010-07-22 19:53         ` Eli Zaretskii
  0 siblings, 1 reply; 20+ messages in thread
From: Laimonas Vėbra @ 2010-07-22 18:24 UTC (permalink / raw)
  To: Juanma Barranquero; +Cc: 6546

Juanma Barranquero wrote:
> On Thu, Jul 22, 2010 at 16:11, Laimonas Vėbra<laimonas.vebra@gmail.com>  wrote:
>
>> That means you are using gnu-win32 grep. Some older (2.5.4) and newer
>> (2.6.3) cygwin greps won't work.
>
> Sorry, I missed that in your original report.
>
> Did you try adding an entry to `process-coding-system-alist'?

The problem is not here. I can change the encoding of the command string 
(which is passed to external cygwin apps) using
coding-system-for-write. It works (converted correctly utf-8->cp1257, 
cp1251, etc), but it doesn't help, because of the way the args (command 
line) are passed/transcoded through the winapi (CreateProcessA) and 
cygwin layer.
This bug is related to bug#6705 (there are detailed description of 
what's happening)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* bug#6546: win32 grep/shell utf-8 encoding
  2010-07-22 18:24       ` Laimonas Vėbra
@ 2010-07-22 19:53         ` Eli Zaretskii
  2010-07-22 21:48           ` Laimonas Vėbra
  0 siblings, 1 reply; 20+ messages in thread
From: Eli Zaretskii @ 2010-07-22 19:53 UTC (permalink / raw)
  To: Laimonas Vėbra; +Cc: lekktu, 6546

> Date: Thu, 22 Jul 2010 21:24:12 +0300
> From: Laimonas Vėbra <laimonas.vebra@gmail.com>
> Cc: 6546@debbugs.gnu.org
> 
> The problem is not here. I can change the encoding of the command string 
> (which is passed to external cygwin apps) using
> coding-system-for-write. It works (converted correctly utf-8->cp1257, 
> cp1251, etc), but it doesn't help, because of the way the args (command 
> line) are passed/transcoded through the winapi (CreateProcessA) and 
> cygwin layer.

Did you try to add a suitably-valued LANG variable to
process-environment?  That would at least force Cygwin executables to
work in the Windows codepage.

> This bug is related to bug#6705 (there are detailed description of 
> what's happening)

Then please merge them.






^ permalink raw reply	[flat|nested] 20+ messages in thread

* bug#6546: win32 grep/shell utf-8 encoding
  2010-07-22 19:53         ` Eli Zaretskii
@ 2010-07-22 21:48           ` Laimonas Vėbra
  2010-07-22 23:00             ` Juanma Barranquero
  2010-07-23 10:24             ` Eli Zaretskii
  0 siblings, 2 replies; 20+ messages in thread
From: Laimonas Vėbra @ 2010-07-22 21:48 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 6546

Eli Zaretskii wrote:
>> Date: Thu, 22 Jul 2010 21:24:12 +0300
>> From: Laimonas Vėbra<laimonas.vebra@gmail.com>
>> Cc: 6546@debbugs.gnu.org
>>
>> The problem is not here. I can change the encoding of the command string
>> (which is passed to external cygwin apps) using
>> coding-system-for-write. It works (converted correctly utf-8->cp1257,
>> cp1251, etc), but it doesn't help, because of the way the args (command
>> line) are passed/transcoded through the winapi (CreateProcessA) and
>> cygwin layer.
>
> Did you try to add a suitably-valued LANG variable to
> process-environment?  That would at least force Cygwin executables to
> work in the Windows codepage.

The only way it works is when i set LANG process-environment variable to 
the current windows locale codepage and 'coding-system-for-write' to the 
encoding/charset in which i'd like to grep.
That way it works, but i'm not sure (seriously doubt) if LANG/locale 
codepage, which differs from the actual args encoding, won't result in 
any ugly problems/bugs (e.g. sorting, piping to other apps)
If it really won't and this setup is "as it should be, intended", then 
this bug could be closed.


>> This bug is related to bug#6705 (there are detailed description of
>> what's happening)
>
> Then please merge them.

How can i do that?





^ permalink raw reply	[flat|nested] 20+ messages in thread

* bug#6546: win32 grep/shell utf-8 encoding
  2010-07-22 21:48           ` Laimonas Vėbra
@ 2010-07-22 23:00             ` Juanma Barranquero
  2010-07-23 10:24             ` Eli Zaretskii
  1 sibling, 0 replies; 20+ messages in thread
From: Juanma Barranquero @ 2010-07-22 23:00 UTC (permalink / raw)
  To: Laimonas Vėbra; +Cc: 6546

On Thu, Jul 22, 2010 at 23:48, Laimonas Vėbra <laimonas.vebra@gmail.com> wrote:

> How can i do that?

You can send a message to control@debbugs.gnu.org, starting with

merge 6705 6546
quit

If both bugs aren't in the same state (open/closed, etc.), you can use
"forcemerge" instead.

Control commands for debbugs are documented in the file admin/notes/bugtracker.

    Juanma

^ permalink raw reply	[flat|nested] 20+ messages in thread

* bug#6546: win32 grep/shell utf-8 encoding
  2010-07-22 21:48           ` Laimonas Vėbra
  2010-07-22 23:00             ` Juanma Barranquero
@ 2010-07-23 10:24             ` Eli Zaretskii
  2010-07-23 12:54               ` Laimonas Vėbra
  1 sibling, 1 reply; 20+ messages in thread
From: Eli Zaretskii @ 2010-07-23 10:24 UTC (permalink / raw)
  To: Laimonas Vėbra; +Cc: 6546

> Date: Fri, 23 Jul 2010 00:48:28 +0300
> From: Laimonas Vėbra <laimonas.vebra@gmail.com>
> CC: 6546@debbugs.gnu.org
> 
> > Did you try to add a suitably-valued LANG variable to
> > process-environment?  That would at least force Cygwin executables to
> > work in the Windows codepage.
> 
> The only way it works is when i set LANG process-environment variable to 
> the current windows locale codepage and 'coding-system-for-write' to the 
> encoding/charset in which i'd like to grep.

That's the only way it's _supposed_ to work.

> That way it works, but i'm not sure (seriously doubt) if LANG/locale 
> codepage, which differs from the actual args encoding, won't result in 
> any ugly problems/bugs (e.g. sorting, piping to other apps)

You should set LANG to the current codepage and make sure
locale-coding-system is set to the same codepage.  Then the Cygwin
programs invoked as Emacs subprocesses should do what you expect.

> If it really won't and this setup is "as it should be, intended", then 
> this bug could be closed.

Yes, this is the only setup that is supposed to work.






^ permalink raw reply	[flat|nested] 20+ messages in thread

* bug#6546: win32 grep/shell utf-8 encoding
  2010-07-23 10:24             ` Eli Zaretskii
@ 2010-07-23 12:54               ` Laimonas Vėbra
  2010-07-23 14:23                 ` Eli Zaretskii
  0 siblings, 1 reply; 20+ messages in thread
From: Laimonas Vėbra @ 2010-07-23 12:54 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 6546

Eli Zaretskii wrote:
>> Date: Fri, 23 Jul 2010 00:48:28 +0300
>> From: Laimonas Vėbra<laimonas.vebra@gmail.com>
>> CC: 6546@debbugs.gnu.org
>>
>>> Did you try to add a suitably-valued LANG variable to
>>> process-environment?  That would at least force Cygwin executables to
>>> work in the Windows codepage.
>>
>> The only way it works is when i set LANG process-environment variable to
>> the current windows locale codepage and 'coding-system-for-write' to the
>> encoding/charset in which i'd like to grep.
>
> That's the only way it's _supposed_ to work.

Then i suppose it's wrong/incorrect way of what is supposed to operate 
like that.

Why? Because for the correct behaviour we (external app, Emacs) 
shouldn't require to set locale to some fixed setting; it should be 
freely changed as many cygwin apps relies on that. For example, how do 
you sort data with improper locale settings (which are required to be 
fixed)? Will seek for another workaround?

Example:
echo -e "-ĔĿİ-\n_ĔĿİ_\nELI\nĔĿİ" > file.txt

$ export LANG=lt_LT.cp1257
$ cat file.txt
-Ä”ÄæÄ°-
_Ä”ÄæÄ°_
ELI
Ä”ÄæÄ°

$ cat file.txt | sort
_Ä”ÄæÄ°_
Ä”ÄæÄ°
-Ä”ÄæÄ°-
ELI

$ export LANG=lt_LT.utf-8
$ cat file.txt
-ĔĿİ-
_ĔĿİ_
ELI
ĔĿİ

$ cat file.txt | sort
_ĔĿİ_
ELI
ĔĿİ
-ĔĿİ-

> Yes, this is the only setup that is supposed to work.

Maybe it is/was suppose to work (at all) like that in the sense of 
workaround, but i doubt if it was/is supposed to be correct.






^ permalink raw reply	[flat|nested] 20+ messages in thread

* bug#6546: win32 grep/shell utf-8 encoding
  2010-07-23 12:54               ` Laimonas Vėbra
@ 2010-07-23 14:23                 ` Eli Zaretskii
  2010-07-23 15:50                   ` Laimonas Vėbra
  0 siblings, 1 reply; 20+ messages in thread
From: Eli Zaretskii @ 2010-07-23 14:23 UTC (permalink / raw)
  To: Laimonas Vėbra; +Cc: 6546

> Date: Fri, 23 Jul 2010 15:54:34 +0300
> From: Laimonas Vėbra <laimonas.vebra@gmail.com>
> CC: 6546@debbugs.gnu.org
> 
> >> The only way it works is when i set LANG process-environment variable to
> >> the current windows locale codepage and 'coding-system-for-write' to the
> >> encoding/charset in which i'd like to grep.
> >
> > That's the only way it's _supposed_ to work.
> 
> Then i suppose it's wrong/incorrect way of what is supposed to operate 
> like that.
> 
> Why? Because for the correct behaviour we (external app, Emacs) 
> shouldn't require to set locale to some fixed setting; it should be 
> freely changed as many cygwin apps relies on that.

You cannot easily change the locale of a Windows system by specifying
some environment variable.  You need to actually switch it
system-wide.  As long as we use ANSI APIs on Windows, we can only
support a single Windows locale, and that locale must be the current
user's locale.

> For example, how do you sort data with improper locale settings
> (which are required to be fixed)?

You can't, sorry.

> > Yes, this is the only setup that is supposed to work.
> 
> Maybe it is/was suppose to work (at all) like that in the sense of 
> workaround, but i doubt if it was/is supposed to be correct.

It cannot work in any other way with ANSI APIs.






^ permalink raw reply	[flat|nested] 20+ messages in thread

* bug#6546: win32 grep/shell utf-8 encoding
  2010-07-23 14:23                 ` Eli Zaretskii
@ 2010-07-23 15:50                   ` Laimonas Vėbra
  2010-07-23 18:09                     ` Eli Zaretskii
  0 siblings, 1 reply; 20+ messages in thread
From: Laimonas Vėbra @ 2010-07-23 15:50 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 6546

Eli Zaretskii wrote:

> You cannot easily change the locale of a Windows system by specifying
> some environment variable.  You need to actually switch it
> system-wide.  As long as we use ANSI APIs on Windows, we can only

I am talking about LANG env settings, which we can freely change for the 
cygwin apps to act differently (as we need).

> You can't, sorry.

You can. That example was supposed to show, that you can freely change 
LANG variable and cygwin utils, which relies on it, acts appropriately.

Well, you can't change it freely in the sense of Emacs setup 
("workaround"), which requires, that LANG should be set the same as the 
current system locale in order for the Emacs to pass 
unicode/non-system-encoding args.

So, i'm asking the same question again -- why do you think it's not 
worth to fix this Emacs setup restriction in order to work with cygwin 
apps like it's intended from cygwin/cmd shell (setting on the fly as 
needed whatever supported locale)?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* bug#6546: win32 grep/shell utf-8 encoding
  2010-07-23 15:50                   ` Laimonas Vėbra
@ 2010-07-23 18:09                     ` Eli Zaretskii
  2010-07-23 19:07                       ` Laimonas Vėbra
  0 siblings, 1 reply; 20+ messages in thread
From: Eli Zaretskii @ 2010-07-23 18:09 UTC (permalink / raw)
  To: Laimonas Vėbra; +Cc: 6546

> Date: Fri, 23 Jul 2010 18:50:54 +0300
> From: Laimonas Vėbra <laimonas.vebra@gmail.com>
> CC: 6546@debbugs.gnu.org
> 
> Eli Zaretskii wrote:
> 
> > You cannot easily change the locale of a Windows system by specifying
> > some environment variable.  You need to actually switch it
> > system-wide.  As long as we use ANSI APIs on Windows, we can only
> 
> I am talking about LANG env settings, which we can freely change for the 
> cygwin apps to act differently (as we need).

You are talking about Cygwin programs, while I'm talking about the
native w32 build of Emacs.  The effect of LANG and the way to change
the locale is different for each one of these two.

> > You can't, sorry.
> 
> You can. That example was supposed to show, that you can freely change 
> LANG variable and cygwin utils, which relies on it, acts appropriately.

Again, I was not talking about Cygwin, I was talking about the native
w32 build of Emacs.  It doesn't use the Unicode (UTF-16) APIs, so it
can only support the current codepage when it invokes programs through
the Windows APIs.

> So, i'm asking the same question again -- why do you think it's not 
> worth to fix this Emacs setup restriction in order to work with cygwin 
> apps like it's intended from cygwin/cmd shell (setting on the fly as 
> needed whatever supported locale)?

I already answered that.  I have nothing to add to what I said.






^ permalink raw reply	[flat|nested] 20+ messages in thread

* bug#6546: win32 grep/shell utf-8 encoding
  2010-07-23 18:09                     ` Eli Zaretskii
@ 2010-07-23 19:07                       ` Laimonas Vėbra
  0 siblings, 0 replies; 20+ messages in thread
From: Laimonas Vėbra @ 2010-07-23 19:07 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 6546

Eli Zaretskii wrote:
>> Date: Fri, 23 Jul 2010 18:50:54 +0300
>> From: Laimonas Vėbra<laimonas.vebra@gmail.com>
>> CC: 6546@debbugs.gnu.org
>>
>> Eli Zaretskii wrote:
>>
>>> You cannot easily change the locale of a Windows system by specifying
>>> some environment variable.  You need to actually switch it
>>> system-wide.  As long as we use ANSI APIs on Windows, we can only
>>
>> I am talking about LANG env settings, which we can freely change for the
>> cygwin apps to act differently (as we need).
>
> You are talking about Cygwin programs, while I'm talking about the
> native w32 build of Emacs.  The effect of LANG and the way to change
> the locale is different for each one of these two.

I am talking about LANG setting restrictions, that Emacs implies. I 
think -- it shouldn't.

>
>>> You can't, sorry.
>>
>> You can. That example was supposed to show, that you can freely change
>> LANG variable and cygwin utils, which relies on it, acts appropriately.
>
> Again, I was not talking about Cygwin, I was talking about the native
> w32 build of Emacs.  It doesn't use the Unicode (UTF-16) APIs, so it
> can only support the current codepage when it invokes programs through
> the Windows APIs.

It *can* (try mingw example, that i posted) pass utf-8 encoded (and in 
other encodings) arguments when it invokes external programs and for 
that it doesn't need to use UTF-16 API _everywhere_. Like i said -- now 
it (perfectly) works with native/mingw apps without any change.






^ permalink raw reply	[flat|nested] 20+ messages in thread

* bug#6546: bug#6705: w32 cmdproxy.c pass args to cygwin; erroneous charset conversion (problem description, solution/suggestion)
  2010-07-01  8:46 bug#6546: win32 grep/shell utf-8 encoding Laimonas Vėbra
  2010-07-01 17:26 ` Eli Zaretskii
  2010-07-22 12:50 ` Juanma Barranquero
@ 2022-04-24 12:01 ` Lars Ingebrigtsen
  2022-04-24 12:31   ` Eli Zaretskii
  2 siblings, 1 reply; 20+ messages in thread
From: Lars Ingebrigtsen @ 2022-04-24 12:01 UTC (permalink / raw)
  To: Laimonas Vėbra; +Cc: 6546, 6705

Laimonas Vėbra <laimonas.vebra@gmail.com> writes:

> Create utf-8 file with some unicode characters (Cyrillic, Baltic,
> whatever; not only ascii) and try to grep for some utf-8 strings from
> Emacs (M-x grep).

(I'm going through old bug reports that unfortunately weren't resolved
at the time.)

This was eleven years ago -- is this still an issue in recent
Emacs/Cygwin versions?  (I can't recall seeing any recent reports about
this.)

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no





^ permalink raw reply	[flat|nested] 20+ messages in thread

* bug#6546: bug#6705: w32 cmdproxy.c pass args to cygwin; erroneous charset conversion (problem description, solution/suggestion)
  2022-04-24 12:01 ` bug#6546: bug#6705: w32 cmdproxy.c pass args to cygwin; erroneous charset conversion (problem description, solution/suggestion) Lars Ingebrigtsen
@ 2022-04-24 12:31   ` Eli Zaretskii
  2022-04-24 13:25     ` Lars Ingebrigtsen
  0 siblings, 1 reply; 20+ messages in thread
From: Eli Zaretskii @ 2022-04-24 12:31 UTC (permalink / raw)
  To: Lars Ingebrigtsen; +Cc: laimonas.vebra, 6546, 6705

> From: Lars Ingebrigtsen <larsi@gnus.org>
> Date: Sun, 24 Apr 2022 14:01:32 +0200
> Cc: 6546@debbugs.gnu.org, 6705@debbugs.gnu.org
> 
> Laimonas Vėbra <laimonas.vebra@gmail.com> writes:
> 
> > Create utf-8 file with some unicode characters (Cyrillic, Baltic,
> > whatever; not only ascii) and try to grep for some utf-8 strings from
> > Emacs (M-x grep).
> 
> (I'm going through old bug reports that unfortunately weren't resolved
> at the time.)
> 
> This was eleven years ago -- is this still an issue in recent
> Emacs/Cygwin versions?  (I can't recall seeing any recent reports about
> this.)

I think this bug should be closed.  Support for mixing a native w32
Emacs with Cygwin external programs is limited where character
encoding is involved because of the limitations of the APIs we use in
Emacs to invoke external programs, and because native w32 bui8lds of
external programs in most cases support only a single system codepage.

So people who want to be able to invoke Cygwin programs from Emacs and
play by Cygwin LANG and locale rules (which emulate quite well the
Posix environment) should use a Cygwin build of Emacs.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* bug#6546: bug#6705: w32 cmdproxy.c pass args to cygwin; erroneous charset conversion (problem description, solution/suggestion)
  2022-04-24 12:31   ` Eli Zaretskii
@ 2022-04-24 13:25     ` Lars Ingebrigtsen
  0 siblings, 0 replies; 20+ messages in thread
From: Lars Ingebrigtsen @ 2022-04-24 13:25 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: laimonas.vebra, 6546, 6705

Eli Zaretskii <eliz@gnu.org> writes:

> I think this bug should be closed.  Support for mixing a native w32
> Emacs with Cygwin external programs is limited where character
> encoding is involved because of the limitations of the APIs we use in
> Emacs to invoke external programs, and because native w32 bui8lds of
> external programs in most cases support only a single system codepage.
>
> So people who want to be able to invoke Cygwin programs from Emacs and
> play by Cygwin LANG and locale rules (which emulate quite well the
> Posix environment) should use a Cygwin build of Emacs.

OK; closing this bug report, then.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no





^ permalink raw reply	[flat|nested] 20+ messages in thread

* bug#6705: w32 cmdproxy.c pass args to cygwin; erroneous charset conversion (problem description, solution/suggestion)
@ 2010-07-22 12:31 Laimonas Vėbra
  2010-07-22 14:33 ` Jason Rumney
  0 siblings, 1 reply; 20+ messages in thread
From: Laimonas Vėbra @ 2010-07-22 12:31 UTC (permalink / raw)
  To: 6705

Below is a comment that i wrote for myself in the cmpdproxy.c (it 
explains the problem). I have a half (i suppose -- portable enough) of 
working solution/fix for it using MultiByteToWideChar API function, but 
i won't send a (partly working) patch, unless someone from the 
developers who agree with the problem and intend to fix it will ask for it.

Besides, the patch itself is larger than 10 diff lines and it uses 
(duplicates by copying) some helper functions/declarations 
(open_input_file(), close_file_data(), rva_to_section(), 
w32_executable_type, RVA_TO_PTR) from unexw32.c, so it may need some 
code refactoring.

This problem certainly needs some discussion (how best to solve it) 
because it addresses unicode communication aspects/issues. If some won't 
bother reading all the description, then here is a simple question -- 
how do one can/should (clearly) pass utf-8 arguments to an external 
(cygwin) app on windows? I suppose, now it's not possible.

Thank you for your attention.


> /* When calling cygwin executable we need to explicitly convert utf-8
>    arguments (it's encoding yhat Emacs uses internally and passes args to
>    external commands, when coding-system-for-write is nil) to utf-16 and
>    call unicode (wide) API function CreateProcess(W).
>    That needs to be done, because of this transcoding chain which
>    migth (and it definitely WILL if args contains unicode, i.e. non
>    ascii/locale_charset character) result in corrupted args:
>
>    WINAPI/OS layer:
>    multibyte string args (utf-8) -> CreateProcessA():
>    locale_codepage -> unicode (utf-16)
>
>    ->
>
>    CYGWIN layer:
>    unicode (utf-16) <-> utf-8 ->
>    cygwin locale env (LC_XXX, LANG; default: C.UTF-8)
>
>
>    Example #1:
>    utf-8 string 'žą'; 'ž'(0xC5, 0xBE) 'ą'(0xC4, 0x85) transcoding
>    (to cygwin locale env charset) chain:
>
>    converting #1:
>    locale_codepage (lt, LCID: 1063, ansi/oem cp: cp1257/cp775) -> utf-16;
>
>    utf-8 string 'žą' in locale codepage (cp1257) represenation: 'Å¾Ä…'
>    'Å'(0xC5), '¾'(0xBE), 'Ä'(0xC4), '…'(0x85).
>
>    string converted to utf-16: 'Å¾Ä…'
>    U+00C5(Å), U+00BE(¾), U+00C4(Ä), U+2026(…).
>
>    utf-16: 'Å¾Ä…': 'Å'(U+00C5), '¾'(U+00BE), 'Ä'(U+00C4), '…'(U+2026).
>    <->
>    utf-8 : 'Å¾Ä…': 'Å'(0xC385), '¾'(0xC2BE), 'Ä'(0xC384), '…'(0xE280).
>
>    converting #2:
>    utf-16/utf-8 -> cygwin locale env (LANG = lt_LT.cp1257);
>
>    utf-8 string 'Å¾Ä…' (0xC3, 0x85, 0xC2, 0xBE, 0xC3, 0x84, 0xE2, 0x80)
>    converted to cp1257: 'Å¾Ä…' (0xC5, 0xBE, 0xC4, 0x85)
>
>    cp1257 string 'Å¾Ä…' in utf-8 representation: 'žą'; 'ž'(0xC5BE), 'ą'(0xC485)
>
>    Although string was (should be) converted to cp1257 (according to
>    cygwin locale env variables), its original value ('žą'), after transcoding
>    to cp1257 (in cp1257 representation as it should be), is corrupted and indeed
>    passed args are (were preserved) in utf-8 encoding.
>    It's important to note that such "original value preservation" happens
>    only because of successful circumstances, when we are converting to windows
>    locale codepage/charset and arg string (utf-8) in  windows locale
>    representation doesn't result in some unconvertible character/combination
>    (e.g. undefined characters) and it's possible to convert back (from utf-16/utf-8
>    to locale charset). Corruption _always_ occurs  if we ar converting to other 	
>    codepage/charset than the current windows locale codepage.
>
>    Consider unsuccessful/erroneous conversion example:
>    utf-8 string/character 'ĥ' (U+0125) passed to cygwin (utf-8):
>
>    utf-8 string 'ĥ'(0xC4A5) in locale codepage (cp1257) representation: 'Ä'
>    (0xA5('') is undefined in cp1257 and it doesn't map to unicode)
>
>    converting #1:
>    locale_codepage (lt, LCID: 1063, ansi/oem cp: cp1257/cp775) -> utf-16;
>
>    utf-8 string 'ĥ' in cp1257 representation: 'Ä'
>
>    string converted to utf-16: 'Ä' (0x00C4, 0xF8FD)
>    (http://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1257.txt)
>    0xA5 (cp1257) is mapped to 0xF8FD in Unicode (Private Use Area Range: E000–F8FF)
>
>    utf-16: 'Ä': 'Ä'(U+00C4), ''(U+F8FD)
>    <->
>    utf-8 : 'Ä': 'Ä'(0xC384), ''(0xEFA3BD)
>
>    converting #2:
>    utf-16/utf-8 -> cygwin locale env (LANG = C.UTF-8);
>
>
>    utf-16 string 'Ä': 'Ä'(U+00C4), ''(U+F8FD)
>    converted to utf-8: 'Ä': 'Ä'(0xC384), ''(0xEFA3BD)
>
>    So, original string value 'ĥ' is transcoded to an invalid 'Ä' although that
>    shouldn't happen (as no conversion is supposed; neither implicitly, nor
>    explicitly)
>
>
>    Concluding all: erroneous conversion _always_ occurs, when we are converting
>    to codepage/charset other than the current windows locale codepage, although
>    corruption might occur even if we are not supposed to convert at all
>    (just pass utf-8 encoded arguments).
>
>
> */





^ permalink raw reply	[flat|nested] 20+ messages in thread

* bug#6705: w32 cmdproxy.c pass args to cygwin; erroneous charset conversion (problem description, solution/suggestion)
  2010-07-22 12:31 Laimonas Vėbra
@ 2010-07-22 14:33 ` Jason Rumney
  2010-07-22 18:14   ` bug#6546: " Laimonas Vėbra
  0 siblings, 1 reply; 20+ messages in thread
From: Jason Rumney @ 2010-07-22 14:33 UTC (permalink / raw)
  To: Laimonas Vėbra; +Cc: 6705

Laimonas Vėbra <laimonas.vebra@gmail.com> writes:

> This problem certainly needs some discussion (how best to solve it) because it addresses unicode communication aspects/issues. If some won't bother reading all the description, then here is a simple question -- 
> how do one can/should (clearly) pass utf-8 arguments to an external
> (cygwin) app on windows? I suppose, now it's not possible.

Don't use cmdproxy with Cygwin programs. If you need a shell in between,
use Cygwin bash.  cmdproxy is a wrapper to get around some problems with
various versions of the Windows native cmd.exe and command.com shell
programs.  Mixing Cygwin and native Windows is not advised.





^ permalink raw reply	[flat|nested] 20+ messages in thread

* bug#6546: bug#6705: w32 cmdproxy.c pass args to cygwin; erroneous charset conversion (problem description, solution/suggestion)
  2010-07-22 14:33 ` Jason Rumney
@ 2010-07-22 18:14   ` Laimonas Vėbra
  0 siblings, 0 replies; 20+ messages in thread
From: Laimonas Vėbra @ 2010-07-22 18:14 UTC (permalink / raw)
  Cc: 6546

Jason Rumney wrote:

> Don't use cmdproxy with Cygwin programs. If you need a shell in
> between, use Cygwin bash.  cmdproxy is a wrapper to get around some
> problems with various versions of the Windows native cmd.exe and
> command.com shell programs.  Mixing Cygwin and native Windows is not
> advised.

That doesn't solve the problem (try to pass utf-8 string from Emacs to 
cygwin/bin/(ba)sh.exe or any other cygwin app), nor it anyhow 
complicates the matter (cmdproxy just passes commandline to 
CreateProcess(); same happens in w32proc.c calling /bin/sh instead of 
cmdproxy.exe). The problem is not cmdproxy itself, but winapi/cygwin 
layer and the way the args are passed/transcoded using CreateProcess(A) 
-> cygwin layer.





^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2022-04-24 13:25 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-07-01  8:46 bug#6546: win32 grep/shell utf-8 encoding Laimonas Vėbra
2010-07-01 17:26 ` Eli Zaretskii
2010-07-01 18:05   ` Laimonas Vėbra
2010-07-22 12:50 ` Juanma Barranquero
2010-07-22 14:11   ` Laimonas Vėbra
2010-07-22 15:02     ` Juanma Barranquero
2010-07-22 18:24       ` Laimonas Vėbra
2010-07-22 19:53         ` Eli Zaretskii
2010-07-22 21:48           ` Laimonas Vėbra
2010-07-22 23:00             ` Juanma Barranquero
2010-07-23 10:24             ` Eli Zaretskii
2010-07-23 12:54               ` Laimonas Vėbra
2010-07-23 14:23                 ` Eli Zaretskii
2010-07-23 15:50                   ` Laimonas Vėbra
2010-07-23 18:09                     ` Eli Zaretskii
2010-07-23 19:07                       ` Laimonas Vėbra
2022-04-24 12:01 ` bug#6546: bug#6705: w32 cmdproxy.c pass args to cygwin; erroneous charset conversion (problem description, solution/suggestion) Lars Ingebrigtsen
2022-04-24 12:31   ` Eli Zaretskii
2022-04-24 13:25     ` Lars Ingebrigtsen
  -- strict thread matches above, loose matches on Subject: below --
2010-07-22 12:31 Laimonas Vėbra
2010-07-22 14:33 ` Jason Rumney
2010-07-22 18:14   ` bug#6546: " Laimonas Vėbra

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).