From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Alexis Newsgroups: gmane.emacs.devel Subject: Re: "args-out-of-range" error when using data from external process on Windows Date: Thu, 18 Apr 2024 21:20:55 +1000 Message-ID: <87mspqzzl4.fsf@gmail.com> References: <87bk671b7l.fsf@gmail.com> <86msprfbul.fsf@gnu.org> <87y19bywr6.fsf@gmail.com> <86il0ff4qe.fsf@gnu.org> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: quoted-printable Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="23784"; mail-complaints-to="usenet@ciao.gmane.io" User-Agent: mu4e 1.12.4; emacs 29.3 Cc: emacs-devel To: Eli Zaretskii Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Thu Apr 18 16:09:34 2024 Return-path: Envelope-to: ged-emacs-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1rxSSD-000621-JK for ged-emacs-devel@m.gmane-mx.org; Thu, 18 Apr 2024 16:09:34 +0200 Original-Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1rxSRW-0006yJ-Co; Thu, 18 Apr 2024 10:08:50 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1rxPp9-0003zF-Jz for emacs-devel@gnu.org; Thu, 18 Apr 2024 07:21:03 -0400 Original-Received: from mail-pl1-x62c.google.com ([2607:f8b0:4864:20::62c]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1rxPp7-00011f-Ai; Thu, 18 Apr 2024 07:21:03 -0400 Original-Received: by mail-pl1-x62c.google.com with SMTP id d9443c01a7336-1e36b7e7dd2so6442675ad.1; Thu, 18 Apr 2024 04:21:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1713439259; x=1714044059; darn=gnu.org; h=content-transfer-encoding:mime-version:message-id:date:user-agent :references:in-reply-to:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=qVr3mU6Nsm5vwHgcSxcGkFvrJ+cP6MGTr6qdYLXjQGw=; b=JrETAGDG9TMi2+D1sNOTmLTvsoRzSq6ozPqACM32p8pzQclX4tIhFqZqhhoS17LLMA 7lAF6F+61/r16MVq6KXbR84/LNrXzr3LgfACvyxL7UVQ2lyRmGkuntO/Qt3U1D7iDeBO q9OeA16fAPJUKp180r+S5uuIyBP3lmHxb77sh0SecLcb5dUcoGObdV0X1ZETvxSiYefQ UWdkh1LVJGHON4ePr1irsBmZr6sIYzzvNWAfrxSizP//7is3zO15Unrhf7Dbrkjza9jZ Mx6C2E+xi5t1zowX5Wl+JynoeOxCICx1bpQZ8pWFRxKujcnunAiRWs1rkDyJHNB5fiKN Nxhg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1713439259; x=1714044059; h=content-transfer-encoding:mime-version:message-id:date:user-agent :references:in-reply-to:subject:cc:to:from:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=qVr3mU6Nsm5vwHgcSxcGkFvrJ+cP6MGTr6qdYLXjQGw=; b=bSIyVpVWRAqcA9fwUMcM75VmcPbCDLWTe/fc7FKujgcqyYng5ZFttydt6w1K9QTrta wircBw11zo4pT5xl6eZ6a/m2fSieY8it1LSHwmD4cfy0PMXrRZ440RaYkw5fMW3vbcrc xtypOGEVmnBJkfm2pXMGqxZIql+OfoUG6hOtFgbujjkE7bELnL9gM8BRBo8E/nTTgVgZ XR2TdhCcZf3Poz26Q298RixdvhCdGSDRf7pXBGIEAJ5RQ/K4XNGOT4adMhssOfJXnLOc 4KqgnD3iBnQsqKjXJIoCq4zuMtUULDBwUCxG+orOggpNDX+on4vwGPmyfqASu1+2pbA/ xbhQ== X-Gm-Message-State: AOJu0Yw4w76dV8ZNEkHIVm+/7xrogGAtHD6Bc3R0+RevX0gvG8e0ctVV 0JOVF9dJfDZnQAWAHBcXnuCQYMTdBwddkBCutlepnQkRhJ+/q1zfd4wA9A== X-Google-Smtp-Source: AGHT+IFEAnS8mxCCAe4/oSnbC6lNVzKpYPO736d/0MWoKE/0mLEdcVMgoWzu0P+DM58st8wQAepptA== X-Received: by 2002:a17:902:b786:b0:1e3:e022:1dd9 with SMTP id e6-20020a170902b78600b001e3e0221dd9mr2424486pls.40.1713439258804; Thu, 18 Apr 2024 04:20:58 -0700 (PDT) Original-Received: from localhost ([120.21.220.186]) by smtp.gmail.com with ESMTPSA id kh5-20020a170903064500b001e4d22f828fsm1070496plb.33.2024.04.18.04.20.57 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 18 Apr 2024 04:20:58 -0700 (PDT) In-Reply-To: <86il0ff4qe.fsf@gnu.org> (Eli Zaretskii's message of "Thu, 18 Apr 2024 11:35:21 +0300") Received-SPF: pass client-ip=2607:f8b0:4864:20::62c; envelope-from=flexibeast@gmail.com; helo=mail-pl1-x62c.google.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-Mailman-Approved-At: Thu, 18 Apr 2024 10:08:49 -0400 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Xref: news.gmane.io gmane.emacs.devel:317812 Archived-At: Thanks again for your assistance! As some additional context: i haven't actively used a Windows=20 system in more than a decade - it was Windows 7 - and even then, i=20 was running it in a VM in order to run some other software. i've=20 also never used Windows outside of an "Australian English"=20 context, and have never done any dev work on the Windows=20 platform. So i've got only a minimal idea of how Windows does=20 various things nowadays, and have never needed to become familiar=20 with sysadmin-/dev-level Windows documentation. Until now. :-) Specific responses inline below. > I don't think I understand the setting of LC_ALL part. First,=20 > AFAIK Windows programs generally ignore LC_* environment=20 > variables. If you read the Microsoft documentation of=20 > 'setlocale', here: > > https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/setlo= cale-wsetlocale?view=3Dmsvc-170 > > you will not see any reference to environment variables there. Thanks for this link; it gives me a good starting point to explore=20 the Win docs on this issue. > The Windows 'setlocale' supports only LC_* _categories_ in=20 > direct calls to the function, and doesn't consider the=20 > corresponding environment variables. The Emacs source code=20 > doesn't reference LC_* environment variables on MS-Windows,=20 > either. So how did the user set LC_ALL, and why did it have any=20 > effect whatsoever on the issue? They didn't say; all they wrote=20 (https://github.com/flexibeast/ebuku/issues/31#issuecomment-2058171986)=20 was: > I ... changed my LC_ALL to zh_CN.UTF-8. Ebuku can find the db=20 > now. i'll ask them. > Second, the user sets a UTF-8 locale, which as I wrote up-thread=20 > is not a good idea on MS-Windows. It could well cause failures=20 > in invoking external programs from Emacs, if the arguments to=20 > those programs include non-ASCII characters. In general, on=20 > MS-Windows Emacs can only safely invoke programs with non-ASCII=20 > characters in the command-line arguments if those characters can=20 > be encoded by the system codepage, in this case codepage-936=20 > AFAIU. Thanks, i'll add that to the information i pass back to the user=20 on that GitHub issue. > Regarding the "invalid string for collation: Invalid argument"=20 > error: how does ebuku determine the LOCALE argument with which=20 > it calls string-collate-lessp? It is important to understand=20 > what was the locale with which w32_compare_strings was called in=20 > that case. The single use of `string-collate-lessp` doesn't pass any LOCALE=20 argument, as i just wanted it to use the user's current locale for=20 sorting a given bookmark's tags into the appropriate=20 lexicographical order. > Finally, the issues with Windows-style file names with drive=20 > letters and with file names that begin with "~" lead me to=20 > believe that perhaps the underlying program 'buku' is not a=20 > native Windows program, but a Cygwin or MSYS program, in which=20 > case there could be incompatibilities both regarding file names=20 > and regarding handling of non-ASCII characters (Cygwin and MSYS=20 > use UTF-8 by default, whereas the native Windows build of Emacs=20 > does not). Sorry; i mentioned in my first email, but didn't reiterate in my=20 second, that `buku` is Python-based. > You need to take a good look at whether non-ASCII characters are=20 > passed to 'buku' in this case, and how the output from 'buku' is=20 > decoded. =F0=9F=91=8D > Also, ebuku-buku-path and ebuku-database-path should both be=20 > quoted with shell-quote-argument (but I don't think this is a=20 > problem in this case). Can ARGS include whitespace or characters=20 > special for the Windows shell? if so, each argument should be=20 > quoted with shell-quote-argument as well. Thanks, noted. > How output is decoded when it is put into the temporary buffer=20 > is also of interest -- what is the value of=20 > buffer-file-coding-system in the temporary buffer after reading=20 > output, in the OP's case? *nod* > Emacs on MS-Windows=20 > cannot use UTF-8 when encoding command-line arguments for=20 > sub-programs, it can only use the system codepage. Using=20 > set-language-environment as above will force Emacs to encode=20 > command-line arguments in UTF-8, which could very well be the=20 > reason for some of these problems. Ah okay. > No. > > The issue is complicated by several factors and will take a long=20 > post to explain. The upshot is that for passing non-ASCII=20 > characters safely to subprograms on their command lines, Emacs=20 > should use the system codepage, not UTF-8 or anything else (and=20 > definitely not UTF-16). This might require some tricky juggling=20 > with coding-system related settings when you call call-process,=20 > because coding-system-for-write is used for both encoding of the=20 > command-line arguments and of the stuff we send to the=20 > sub-program, so if they both can include non-ASCII characters,=20 > some care is in order. (By contrast, coding-system-for-read can=20 > be always bound to UTF-8 to decode the output correctly --=20 > assuming 'buku' outputs UTF-8 encoded text on MS-Windows.) That's very helpful, thank you. > The more important question is: can CRAB emoji be safely encoded=20 > by codepage 936, the system codepage of the OP? If not, and if=20 > that emoji can appear in the command-line arguments of a 'buku'=20 > invocation (as opposed to in the text we write to or read from=20 > 'buku'), then this character cannot be used at all with this=20 > package on MS-Windows. > > (And please note that Emacs now has a native SQLite support,=20 > which should make many of these complications simply disappear.) It would certainly make many things easier to just interact with=20 the db directly. That said, doing so would involve a substantial=20 rewrite, and i've got many things on my plate nowadays, including=20 supporting disabled loved ones while having chronic health issues=20 myself. But maybe i can open an issue requesting help to start and=20 develop a branch doing such a rewrite.=20 > As for why the problems disappear when the CRAB emoji is=20 > removed: as I wrote elsewhere, that's probably because all the=20 > other characters are plain ASCII, so all the encoding-related=20 > issues don't matter. *nod* > They don't have any effect on Emacs on MS-Windows, that's for=20 > sure. Whether they have effect on 'buku' depends on whether=20 > it's a native MS-Windows program or Cygwin/MSYS program, and=20 > also on its code (a program could potentially augment the MS=20 > 'setlocale' function with its own code which looks at the LC_*=20 > environment variables, and does TRT in the application code). *nod* >> But what should i do to handle the more general case of an=20 >> arbitrary encoding? Do i need to have a defcustom, with=20 >> 'reasonable defaults', that the user can set if necessary,=20 >> which i use as the value to pass to coding-system-for-read? > > That depends on what encoding does 'buku' expect on input and=20 > what encoding does it use on output. If it always uses UTF-8,=20 > you just need to make sure Emacs uses UTF-8 when encoding and=20 > decoding text passed to and from 'buku' (but note the caveat=20 > about encoding the command-line arguments -- these _must_ be=20 > encoded in the system codepage). If, OTOH, the encoding used by=20 > 'buku' can be changed dynamically, and Emacs cannot know what it=20 > is (for example, if it is determined by the encoding of the text=20 > put in the SQL database by the user), then a user option is in=20 > order. Great, thank you. =20 >> As i interpret their comments in the above discussions so far,=20 >> yes, they had themselves set LANG to "zh_CN.UTF-8" (and yes, as=20 >> described above, had definitely `set-language-environment` as=20 >> "UTF-8". > > NOT RECOMMENDED! *chuckle* i'll be sure to pass this on. :-) Thanks again! Alexis.