unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
* bug#6546: win32 grep/shell utf-8 encoding
@ 2010-07-01  8:46 Laimonas Vėbra
  2010-07-01 17:26 ` Eli Zaretskii
                   ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Laimonas Vėbra @ 2010-07-01  8:46 UTC (permalink / raw)
  To: 6546

Maybe it's actually not the bug (but missing functionality), but how do 
one should/could setup ones emacs && .emacs to grep files in utf-8 
encoding?

grep 2.6.3 (cygwin) at last works correctly (coloring multibyte matches) 
from win32 console (according to LANG environment settings), but no 
matter how i've tried to push emacs (set-language-environment, 
coding-system-for-(read|write), set-env in grep-setup-hook), it just 
don't work, because somewhere inside the Emacs win32 stuff it sticks to 
windows locale codepage and tries hard to convert to this 
codepage/encoding before it passes arguments to shell. No wonder -- it 
fails when it comes to unicode.

How to reproduce:

Create utf-8 file with some unicode characters (Cyrillic, Baltic, 
whatever; not only ascii) and try to grep for some utf-8 strings from 
Emacs (M-x grep).





^ permalink raw reply	[flat|nested] 20+ messages in thread
* bug#6705: w32 cmdproxy.c pass args to cygwin; erroneous charset conversion (problem description, solution/suggestion)
@ 2010-07-22 12:31 Laimonas Vėbra
  2010-07-22 14:33 ` Jason Rumney
  0 siblings, 1 reply; 20+ messages in thread
From: Laimonas Vėbra @ 2010-07-22 12:31 UTC (permalink / raw)
  To: 6705

Below is a comment that i wrote for myself in the cmpdproxy.c (it 
explains the problem). I have a half (i suppose -- portable enough) of 
working solution/fix for it using MultiByteToWideChar API function, but 
i won't send a (partly working) patch, unless someone from the 
developers who agree with the problem and intend to fix it will ask for it.

Besides, the patch itself is larger than 10 diff lines and it uses 
(duplicates by copying) some helper functions/declarations 
(open_input_file(), close_file_data(), rva_to_section(), 
w32_executable_type, RVA_TO_PTR) from unexw32.c, so it may need some 
code refactoring.

This problem certainly needs some discussion (how best to solve it) 
because it addresses unicode communication aspects/issues. If some won't 
bother reading all the description, then here is a simple question -- 
how do one can/should (clearly) pass utf-8 arguments to an external 
(cygwin) app on windows? I suppose, now it's not possible.

Thank you for your attention.


> /* When calling cygwin executable we need to explicitly convert utf-8
>    arguments (it's encoding yhat Emacs uses internally and passes args to
>    external commands, when coding-system-for-write is nil) to utf-16 and
>    call unicode (wide) API function CreateProcess(W).
>    That needs to be done, because of this transcoding chain which
>    migth (and it definitely WILL if args contains unicode, i.e. non
>    ascii/locale_charset character) result in corrupted args:
>
>    WINAPI/OS layer:
>    multibyte string args (utf-8) -> CreateProcessA():
>    locale_codepage -> unicode (utf-16)
>
>    ->
>
>    CYGWIN layer:
>    unicode (utf-16) <-> utf-8 ->
>    cygwin locale env (LC_XXX, LANG; default: C.UTF-8)
>
>
>    Example #1:
>    utf-8 string 'žą'; 'ž'(0xC5, 0xBE) 'ą'(0xC4, 0x85) transcoding
>    (to cygwin locale env charset) chain:
>
>    converting #1:
>    locale_codepage (lt, LCID: 1063, ansi/oem cp: cp1257/cp775) -> utf-16;
>
>    utf-8 string 'žą' in locale codepage (cp1257) represenation: 'žą'
>    'Å'(0xC5), '¾'(0xBE), 'Ä'(0xC4), '…'(0x85).
>
>    string converted to utf-16: 'žą'
>    U+00C5(Å), U+00BE(¾), U+00C4(Ä), U+2026(…).
>
>    utf-16: 'žą': 'Å'(U+00C5), '¾'(U+00BE), 'Ä'(U+00C4), '…'(U+2026).
>    <->
>    utf-8 : 'žą': 'Å'(0xC385), '¾'(0xC2BE), 'Ä'(0xC384), '…'(0xE280).
>
>    converting #2:
>    utf-16/utf-8 -> cygwin locale env (LANG = lt_LT.cp1257);
>
>    utf-8 string 'žą' (0xC3, 0x85, 0xC2, 0xBE, 0xC3, 0x84, 0xE2, 0x80)
>    converted to cp1257: 'žą' (0xC5, 0xBE, 0xC4, 0x85)
>
>    cp1257 string 'žą' in utf-8 representation: 'žą'; 'ž'(0xC5BE), 'ą'(0xC485)
>
>    Although string was (should be) converted to cp1257 (according to
>    cygwin locale env variables), its original value ('žą'), after transcoding
>    to cp1257 (in cp1257 representation as it should be), is corrupted and indeed
>    passed args are (were preserved) in utf-8 encoding.
>    It's important to note that such "original value preservation" happens
>    only because of successful circumstances, when we are converting to windows
>    locale codepage/charset and arg string (utf-8) in  windows locale
>    representation doesn't result in some unconvertible character/combination
>    (e.g. undefined characters) and it's possible to convert back (from utf-16/utf-8
>    to locale charset). Corruption _always_ occurs  if we ar converting to other 	
>    codepage/charset than the current windows locale codepage.
>
>    Consider unsuccessful/erroneous conversion example:
>    utf-8 string/character 'ĥ' (U+0125) passed to cygwin (utf-8):
>
>    utf-8 string 'ĥ'(0xC4A5) in locale codepage (cp1257) representation: 'Ä'
>    (0xA5('') is undefined in cp1257 and it doesn't map to unicode)
>
>    converting #1:
>    locale_codepage (lt, LCID: 1063, ansi/oem cp: cp1257/cp775) -> utf-16;
>
>    utf-8 string 'ĥ' in cp1257 representation: 'Ä'
>
>    string converted to utf-16: 'Ä' (0x00C4, 0xF8FD)
>    (http://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1257.txt)
>    0xA5 (cp1257) is mapped to 0xF8FD in Unicode (Private Use Area Range: E000–F8FF)
>
>    utf-16: 'Ä': 'Ä'(U+00C4), ''(U+F8FD)
>    <->
>    utf-8 : 'Ä': 'Ä'(0xC384), ''(0xEFA3BD)
>
>    converting #2:
>    utf-16/utf-8 -> cygwin locale env (LANG = C.UTF-8);
>
>
>    utf-16 string 'Ä': 'Ä'(U+00C4), ''(U+F8FD)
>    converted to utf-8: 'Ä': 'Ä'(0xC384), ''(0xEFA3BD)
>
>    So, original string value 'ĥ' is transcoded to an invalid 'Ä' although that
>    shouldn't happen (as no conversion is supposed; neither implicitly, nor
>    explicitly)
>
>
>    Concluding all: erroneous conversion _always_ occurs, when we are converting
>    to codepage/charset other than the current windows locale codepage, although
>    corruption might occur even if we are not supposed to convert at all
>    (just pass utf-8 encoded arguments).
>
>
> */





^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2022-04-24 13:25 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-07-01  8:46 bug#6546: win32 grep/shell utf-8 encoding Laimonas Vėbra
2010-07-01 17:26 ` Eli Zaretskii
2010-07-01 18:05   ` Laimonas Vėbra
2010-07-22 12:50 ` Juanma Barranquero
2010-07-22 14:11   ` Laimonas Vėbra
2010-07-22 15:02     ` Juanma Barranquero
2010-07-22 18:24       ` Laimonas Vėbra
2010-07-22 19:53         ` Eli Zaretskii
2010-07-22 21:48           ` Laimonas Vėbra
2010-07-22 23:00             ` Juanma Barranquero
2010-07-23 10:24             ` Eli Zaretskii
2010-07-23 12:54               ` Laimonas Vėbra
2010-07-23 14:23                 ` Eli Zaretskii
2010-07-23 15:50                   ` Laimonas Vėbra
2010-07-23 18:09                     ` Eli Zaretskii
2010-07-23 19:07                       ` Laimonas Vėbra
2022-04-24 12:01 ` bug#6546: bug#6705: w32 cmdproxy.c pass args to cygwin; erroneous charset conversion (problem description, solution/suggestion) Lars Ingebrigtsen
2022-04-24 12:31   ` Eli Zaretskii
2022-04-24 13:25     ` Lars Ingebrigtsen
  -- strict thread matches above, loose matches on Subject: below --
2010-07-22 12:31 Laimonas Vėbra
2010-07-22 14:33 ` Jason Rumney
2010-07-22 18:14   ` bug#6546: " Laimonas Vėbra

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).