From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Laimonas =?UTF-8?Q?V=C4=97bra?= Newsgroups: gmane.emacs.bugs Subject: bug#6705: w32 cmdproxy.c pass args to cygwin; erroneous charset conversion (problem description, solution/suggestion) Date: Thu, 22 Jul 2010 15:31:44 +0300 Message-ID: <4C483A30.9010804@gmail.com> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Trace: dough.gmane.org 1279804032 19223 80.91.229.12 (22 Jul 2010 13:07:12 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Thu, 22 Jul 2010 13:07:12 +0000 (UTC) To: 6705@debbugs.gnu.org Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Thu Jul 22 15:07:09 2010 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1ObvU7-00085N-Tq for geb-bug-gnu-emacs@m.gmane.org; Thu, 22 Jul 2010 15:07:06 +0200 Original-Received: from localhost ([127.0.0.1]:38511 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1ObvU5-00058J-4K for geb-bug-gnu-emacs@m.gmane.org; Thu, 22 Jul 2010 09:06:45 -0400 Original-Received: from [140.186.70.92] (port=45073 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1ObvOS-0001pZ-Jw for bug-gnu-emacs@gnu.org; Thu, 22 Jul 2010 09:05:27 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.69) (envelope-from ) id 1ObvOK-00076g-Er for bug-gnu-emacs@gnu.org; Thu, 22 Jul 2010 09:00:49 -0400 Original-Received: from debbugs.gnu.org ([140.186.70.43]:53588) by eggs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1ObvOK-00076Y-CT for bug-gnu-emacs@gnu.org; Thu, 22 Jul 2010 09:00:48 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.69) (envelope-from ) id 1ObuwT-0008AD-Sp; Thu, 22 Jul 2010 08:32:01 -0400 X-Loop: help-debbugs@gnu.org Resent-From: Laimonas =?UTF-8?Q?V=C4=97bra?= Original-Sender: debbugs-submit-bounces@debbugs.gnu.org Resent-To: owner@debbugs.gnu.org Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Thu, 22 Jul 2010 12:32:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: report 6705 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: X-Debbugs-Original-To: bug-gnu-emacs@gnu.org Original-Received: via spool by submit@debbugs.gnu.org id=B.127980189331363 (code B ref -1); Thu, 22 Jul 2010 12:32:01 +0000 Original-Received: (at submit) by debbugs.gnu.org; 22 Jul 2010 12:31:33 +0000 Original-Received: from localhost ([127.0.0.1] helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1Obuw0-00089o-GY for submit@debbugs.gnu.org; Thu, 22 Jul 2010 08:31:32 -0400 Original-Received: from mail.gnu.org ([199.232.76.166] helo=mx10.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1Obuvx-00089V-9S for submit@debbugs.gnu.org; Thu, 22 Jul 2010 08:31:30 -0400 Original-Received: from lists.gnu.org ([199.232.76.165]:52119) by monty-python.gnu.org with esmtps (TLS-1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.60) (envelope-from ) id 1ObuwQ-00029M-IJ for submit@debbugs.gnu.org; Thu, 22 Jul 2010 08:31:58 -0400 Original-Received: from [140.186.70.92] (port=57276 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1ObuwO-0006rt-KX for bug-gnu-emacs@gnu.org; Thu, 22 Jul 2010 08:31:58 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.69) (envelope-from ) id 1ObuwM-0002JX-59 for bug-gnu-emacs@gnu.org; Thu, 22 Jul 2010 08:31:56 -0400 Original-Received: from mail-ew0-f41.google.com ([209.85.215.41]:56720) by eggs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1ObuwL-0002J9-SI for bug-gnu-emacs@gnu.org; Thu, 22 Jul 2010 08:31:54 -0400 Original-Received: by ewy28 with SMTP id 28so3215820ewy.0 for ; Thu, 22 Jul 2010 05:31:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from :user-agent:mime-version:to:subject:content-type :content-transfer-encoding; bh=qqLCov3j4Zk6wmL5ift0j7yModKByul4zimjCsImmVg=; b=jrXYmldimFc+WEXZIHGPDdKvnHKfIU5bJa2X/5N5uqrQpuZK7SVOb6dSA0i5OQ2u0P QzMihDB5OkzK2WHwXYDagWMMhD8J7MgoL6msyL2ak0+p3vwHNiPc9BNBs2MnUBokd6M9 RFriHmidGWRPt4b514H+MdHIElUZVOTOp5DO0= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:user-agent:mime-version:to:subject :content-type:content-transfer-encoding; b=ghGxXJN8ly3aX3wU9G9bOLSjWP4+IXQ4W2gJ2v85/zZJ8EAcqk1IwPyIThmTBN3cde ZXDFeUIMn3BgbdYb+FFlL6ZDz8jIZdtW5jEBbnyw0ac1jDK0TIhwjog8bYtCPwZ3ld4o lF8inAzGMuVhpRBktMUCtk9+S6yQcx86LV+Sc= Original-Received: by 10.213.31.134 with SMTP id y6mr7699426ebc.49.1279801911963; Thu, 22 Jul 2010 05:31:51 -0700 (PDT) Original-Received: from [192.168.2.2] (lan-84-240-35-136.vln.skynet.lt [84.240.35.136]) by mx.google.com with ESMTPS id z55sm55180031eeh.15.2010.07.22.05.31.50 (version=TLSv1/SSLv3 cipher=RC4-MD5); Thu, 22 Jul 2010 05:31:51 -0700 (PDT) User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.11) Gecko/20100701 SeaMonkey/2.0.6 X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 2) X-detected-operating-system: by monty-python.gnu.org: GNU/Linux 2.6, seldom 2.4 (older, 4) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.11 Precedence: list Resent-Date: Thu, 22 Jul 2010 08:32:01 -0400 X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 3) X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.bugs:38763 Archived-At: Below is a comment that i wrote for myself in the cmpdproxy.c (it explains the problem). I have a half (i suppose -- portable enough) of working solution/fix for it using MultiByteToWideChar API function, but i won't send a (partly working) patch, unless someone from the developers who agree with the problem and intend to fix it will ask for it. Besides, the patch itself is larger than 10 diff lines and it uses (duplicates by copying) some helper functions/declarations (open_input_file(), close_file_data(), rva_to_section(), w32_executable_type, RVA_TO_PTR) from unexw32.c, so it may need some code refactoring. This problem certainly needs some discussion (how best to solve it) because it addresses unicode communication aspects/issues. If some won't bother reading all the description, then here is a simple question -- how do one can/should (clearly) pass utf-8 arguments to an external (cygwin) app on windows? I suppose, now it's not possible. Thank you for your attention. > /* When calling cygwin executable we need to explicitly convert utf-8 > arguments (it's encoding yhat Emacs uses internally and passes args to > external commands, when coding-system-for-write is nil) to utf-16 and > call unicode (wide) API function CreateProcess(W). > That needs to be done, because of this transcoding chain which > migth (and it definitely WILL if args contains unicode, i.e. non > ascii/locale_charset character) result in corrupted args: > > WINAPI/OS layer: > multibyte string args (utf-8) -> CreateProcessA(): > locale_codepage -> unicode (utf-16) > > -> > > CYGWIN layer: > unicode (utf-16) <-> utf-8 -> > cygwin locale env (LC_XXX, LANG; default: C.UTF-8) > > > Example #1: > utf-8 string 'žą'; 'ž'(0xC5, 0xBE) 'ą'(0xC4, 0x85) transcoding > (to cygwin locale env charset) chain: > > converting #1: > locale_codepage (lt, LCID: 1063, ansi/oem cp: cp1257/cp775) -> utf-16; > > utf-8 string 'žą' in locale codepage (cp1257) represenation: 'žą' > 'Å'(0xC5), '¾'(0xBE), 'Ä'(0xC4), '…'(0x85). > > string converted to utf-16: 'žą' > U+00C5(Å), U+00BE(¾), U+00C4(Ä), U+2026(…). > > utf-16: 'žą': 'Å'(U+00C5), '¾'(U+00BE), 'Ä'(U+00C4), '…'(U+2026). > <-> > utf-8 : 'žą': 'Å'(0xC385), '¾'(0xC2BE), 'Ä'(0xC384), '…'(0xE280). > > converting #2: > utf-16/utf-8 -> cygwin locale env (LANG = lt_LT.cp1257); > > utf-8 string 'žą' (0xC3, 0x85, 0xC2, 0xBE, 0xC3, 0x84, 0xE2, 0x80) > converted to cp1257: 'žą' (0xC5, 0xBE, 0xC4, 0x85) > > cp1257 string 'žą' in utf-8 representation: 'žą'; 'ž'(0xC5BE), 'ą'(0xC485) > > Although string was (should be) converted to cp1257 (according to > cygwin locale env variables), its original value ('žą'), after transcoding > to cp1257 (in cp1257 representation as it should be), is corrupted and indeed > passed args are (were preserved) in utf-8 encoding. > It's important to note that such "original value preservation" happens > only because of successful circumstances, when we are converting to windows > locale codepage/charset and arg string (utf-8) in windows locale > representation doesn't result in some unconvertible character/combination > (e.g. undefined characters) and it's possible to convert back (from utf-16/utf-8 > to locale charset). Corruption _always_ occurs if we ar converting to other > codepage/charset than the current windows locale codepage. > > Consider unsuccessful/erroneous conversion example: > utf-8 string/character 'ĥ' (U+0125) passed to cygwin (utf-8): > > utf-8 string 'ĥ'(0xC4A5) in locale codepage (cp1257) representation: 'Ä' > (0xA5('') is undefined in cp1257 and it doesn't map to unicode) > > converting #1: > locale_codepage (lt, LCID: 1063, ansi/oem cp: cp1257/cp775) -> utf-16; > > utf-8 string 'ĥ' in cp1257 representation: 'Ä' > > string converted to utf-16: 'Ä' (0x00C4, 0xF8FD) > (http://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1257.txt) > 0xA5 (cp1257) is mapped to 0xF8FD in Unicode (Private Use Area Range: E000–F8FF) > > utf-16: 'Ä': 'Ä'(U+00C4), ''(U+F8FD) > <-> > utf-8 : 'Ä': 'Ä'(0xC384), ''(0xEFA3BD) > > converting #2: > utf-16/utf-8 -> cygwin locale env (LANG = C.UTF-8); > > > utf-16 string 'Ä': 'Ä'(U+00C4), ''(U+F8FD) > converted to utf-8: 'Ä': 'Ä'(0xC384), ''(0xEFA3BD) > > So, original string value 'ĥ' is transcoded to an invalid 'Ä' although that > shouldn't happen (as no conversion is supposed; neither implicitly, nor > explicitly) > > > Concluding all: erroneous conversion _always_ occurs, when we are converting > to codepage/charset other than the current windows locale codepage, although > corruption might occur even if we are not supposed to convert at all > (just pass utf-8 encoded arguments). > > > */