From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.devel Subject: Re: "args-out-of-range" error when using data from external process on Windows Date: Thu, 18 Apr 2024 11:35:21 +0300 Message-ID: <86il0ff4qe.fsf@gnu.org> References: <87bk671b7l.fsf@gmail.com> <86msprfbul.fsf@gnu.org> <87y19bywr6.fsf@gmail.com> Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="37523"; mail-complaints-to="usenet@ciao.gmane.io" Cc: emacs-devel@gnu.org To: Alexis Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Thu Apr 18 10:36:12 2024 Return-path: Envelope-to: ged-emacs-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1rxNFc-0009ae-El for ged-emacs-devel@m.gmane-mx.org; Thu, 18 Apr 2024 10:36:12 +0200 Original-Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1rxNFJ-00082a-0g; Thu, 18 Apr 2024 04:35:54 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1rxNFE-00080M-Oy for emacs-devel@gnu.org; Thu, 18 Apr 2024 04:35:49 -0400 Original-Received: from fencepost.gnu.org ([2001:470:142:3::e]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1rxNF5-0001M3-II; Thu, 18 Apr 2024 04:35:48 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=gnu.org; s=fencepost-gnu-org; h=References:Subject:In-Reply-To:To:From:Date: mime-version; bh=+FFp+cWlU39Y+zIXywGWMrBW/azRbTKCRMsDvKVU5i0=; b=pL8dfskWSQLw Zw4mD2CIerleqUvdUEBTP3vvhhr+rTHPfzEv+dBwcbpmVEeNpZ2DDyjoCvBPabdnEgOQZ/zpWkehk VIm2JyXFfl9aOyhs1MaqE0Dwh8C7Y4dy9spCpbzTG8FX1t8ceNgqXDj0Zd0zUOBNWSj7uUEzMzub4 IAhk3kVwoqSYJnsy/8CjDsuihi3xkPv6pqcsgs58rZ7be5dMTblIc+AKF4Hqebr6HeZrQxouQBBwM deCp9k+t+qGTYxQLLFB4LarVGqdpPdy+5gWrmUEz48XeZOFpTpz8NNtW+bOPJCCKthA/3l49gHDR2 V+zr3C8NXiT/ZkPgTB/nFQ==; In-Reply-To: <87y19bywr6.fsf@gmail.com> (message from Alexis on Thu, 18 Apr 2024 17:07:25 +1000) X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Xref: news.gmane.io gmane.emacs.devel:317804 Archived-At: > From: Alexis > Cc: emacs-devel@gnu.org > Date: Thu, 18 Apr 2024 17:07:25 +1000 > > > https://github.com/flexibeast/ebuku/issues/32 > > as there is already an extended discussion there about this issue, > which itself links to a previous issue and discussion: > > https://github.com/flexibeast/ebuku/issues/31 > > in which the user first reported an "Invalid string for collation" > issue. That issue was addressed, after some discussion, by setting > LC_ALL to the same value that the user had set LANG, > i.e. "zh_CN.UTF-8". That left us with issue 32, which is the one > i'm asking about here. I don't think I understand the setting of LC_ALL part. First, AFAIK Windows programs generally ignore LC_* environment variables. If you read the Microsoft documentation of 'setlocale', here: https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/setlocale-wsetlocale?view=msvc-170 you will not see any reference to environment variables there. The Windows 'setlocale' supports only LC_* _categories_ in direct calls to the function, and doesn't consider the corresponding environment variables. The Emacs source code doesn't reference LC_* environment variables on MS-Windows, either. So how did the user set LC_ALL, and why did it have any effect whatsoever on the issue? Second, the user sets a UTF-8 locale, which as I wrote up-thread is not a good idea on MS-Windows. It could well cause failures in invoking external programs from Emacs, if the arguments to those programs include non-ASCII characters. In general, on MS-Windows Emacs can only safely invoke programs with non-ASCII characters in the command-line arguments if those characters can be encoded by the system codepage, in this case codepage-936 AFAIU. Regarding the "invalid string for collation: Invalid argument" error: how does ebuku determine the LOCALE argument with which it calls string-collate-lessp? It is important to understand what was the locale with which w32_compare_strings was called in that case. Finally, the issues with Windows-style file names with drive letters and with file names that begin with "~" lead me to believe that perhaps the underlying program 'buku' is not a native Windows program, but a Cygwin or MSYS program, in which case there could be incompatibilities both regarding file names and regarding handling of non-ASCII characters (Cygwin and MSYS use UTF-8 by default, whereas the native Windows build of Emacs does not). > `buku` provides a command-line interface to an SQLite-based > database of Web bookmarks, allowing one to save, delete and search > for bookmarks, with each bookmark able to have a comment and tags > associated with it. > > `Ebuku` is a package that provides an Emacs-based UI for buku. It > allows the user to add bookmarks, edit them, remove them, search > them etc. without actually leaving Emacs. It does so by running > `call-process` to call `buku` with the appropriate options, > receiving the resulting output in a buffer, then processing the > data in that buffer in order to present the user with the relevant > results. > > ebuku.el has a function: > > (defun ebuku--call-buku (args) > "Internal function for calling `buku' with list ARGS." (unless > ebuku-buku-path > (error "Couldn't find buku: check 'ebuku-buku-path'")) > (apply #'call-process > `(,ebuku-buku-path nil t nil > "--np" "--nc" "--db" > ,ebuku-database-path ,@args))) > > which gets called in several places - e.g. > https://github.com/flexibeast/ebuku/blob/c854d128cba8576fe9693c19109b5deafb573e99/ebuku.el#L534 > - to put the contents inside a temp buffer, which is then 'parsed' > for the information to be presented to the user. You need to take a good look at whether non-ASCII characters are passed to 'buku' in this case, and how the output from 'buku' is decoded. Also, ebuku-buku-path and ebuku-database-path should both be quoted with shell-quote-argument (but I don't think this is a problem in this case). Can ARGS include whitespace or characters special for the Windows shell? if so, each argument should be quoted with shell-quote-argument as well. How output is decoded when it is put into the temporary buffer is also of interest -- what is the value of buffer-file-coding-system in the temporary buffer after reading output, in the OP's case? > In a comment from a couple of days ago, and after having noted in > a comment on issue 31: > > https://github.com/flexibeast/ebuku/issues/31#issuecomment-2053557703 > > that they'd set LANG on their system to "zh_CN.UTF-8", the user > wrote > (https://github.com/flexibeast/ebuku/issues/32#issuecomment-2058289816): > > > I set the value with (set-language-environment "UTF-8"). I > > remember I set up this value bacause I don't want my files > > containing Chinese to be encoded by GBK encoding. This is not a good idea, as I mentioned before. Emacs on MS-Windows cannot use UTF-8 when encoding command-line arguments for sub-programs, it can only use the system codepage. Using set-language-environment as above will force Emacs to encode command-line arguments in UTF-8, which could very well be the reason for some of these problems. > Then, in > https://github.com/flexibeast/ebuku/issues/32#issuecomment-2058498373, > i wrote: > > > if i remember correctly, the default encoding used by Windows is > > UTF-16, not UTF-8. So i'm wondering if that's somehow being used > > to transfer data from the buku process to the Emacs process, > > regardless of the value of LANG and LC_ALL, and regardless of > > the encoding of the buku database itself? > > to which the user responded: > > > I think the Powershell will use UTF-16 to encode instead of > > UTF-8. > > Is that correct? No. The issue is complicated by several factors and will take a long post to explain. The upshot is that for passing non-ASCII characters safely to subprograms on their command lines, Emacs should use the system codepage, not UTF-8 or anything else (and definitely not UTF-16). This might require some tricky juggling with coding-system related settings when you call call-process, because coding-system-for-write is used for both encoding of the command-line arguments and of the stuff we send to the sub-program, so if they both can include non-ASCII characters, some care is in order. (By contrast, coding-system-for-read can be always bound to UTF-8 to decode the output correctly -- assuming 'buku' outputs UTF-8 encoded text on MS-Windows.) > Is that the case despite the user having > specified "zh_CN.UTF-8"? But if that's the case, why does removing > the CRAB emoji from text being operated on by string-match / > match-string make the issue disappear? Is it perhaps something to > do with > the code point for the CRAB emoji being outside the BMP? The more important question is: can CRAB emoji be safely encoded by codepage 936, the system codepage of the OP? If not, and if that emoji can appear in the command-line arguments of a 'buku' invocation (as opposed to in the text we write to or read from 'buku'), then this character cannot be used at all with this package on MS-Windows. (And please note that Emacs now has a native SQLite support, which should make many of these complications simply disappear.) As for why the problems disappear when the CRAB emoji is removed: as I wrote elsewhere, that's probably because all the other characters are plain ASCII, so all the encoding-related issues don't matter. > > Suggest that you ask the user who reported that to show the > > actual output of the sub-process (e.g., by running the same > > command outside of Emacs and redirecting output to a file), and > > if the output looks correct, examine the Lisp code which > > processes that output, with an eye on how the text is decoded. > > For example, if the text from the sub-process is supposed to be > > UTF-8 encoded, your Lisp code should bind coding-system-for-read > > to 'utf-8', to make sure it is decoded correctly. > > Thanks, i can certainly do that, modulo the issue of whether the > LANG and LC_ALL variables have any effect data transferred between > the `buku` sub-process and Emacs. They don't have any effect on Emacs on MS-Windows, that's for sure. Whether they have effect on 'buku' depends on whether it's a native MS-Windows program or Cygwin/MSYS program, and also on its code (a program could potentially augment the MS 'setlocale' function with its own code which looks at the LC_* environment variables, and does TRT in the application code). > But what should i do to handle the more general case of an arbitrary > encoding? Do i need to have a defcustom, with 'reasonable defaults', > that the user can set if necessary, which i use as the value to pass > to coding-system-for-read? That depends on what encoding does 'buku' expect on input and what encoding does it use on output. If it always uses UTF-8, you just need to make sure Emacs uses UTF-8 when encoding and decoding text passed to and from 'buku' (but note the caveat about encoding the command-line arguments -- these _must_ be encoded in the system codepage). If, OTOH, the encoding used by 'buku' can be changed dynamically, and Emacs cannot know what it is (for example, if it is determined by the encoding of the text put in the SQL database by the user), then a user option is in order. > > Btw: using UTF-8 by default on MS-Windows is not a very good > > idea, even with Windows 11 where one can enable UTF-8 support > > (did they do it, btw?). Windows still doesn't support UTF-8 > > well, even after the improvements in Windows 11, so the above > > settings might very well cause trouble. Suggest to ask the user > > to try the same recipe in "emacs -Q", and if the zh_CN.UTF-8 > > stuff is set up outside Emacs, to try without it. > > As i interpret their comments in the above discussions so far, > yes, they had themselves set LANG to "zh_CN.UTF-8" (and yes, as > described above, had definitely `set-language-environment` as > "UTF-8". NOT RECOMMENDED!