all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
From: Alexis <flexibeast@gmail.com>
To: Eli Zaretskii <eliz@gnu.org>
Cc: emacs-devel <emacs-devel@gnu.org>
Subject: Re: "args-out-of-range" error when using data from external process on Windows
Date: Thu, 18 Apr 2024 21:20:55 +1000	[thread overview]
Message-ID: <87mspqzzl4.fsf@gmail.com> (raw)
In-Reply-To: <86il0ff4qe.fsf@gnu.org> (Eli Zaretskii's message of "Thu, 18 Apr 2024 11:35:21 +0300")


Thanks again for your assistance!

As some additional context: i haven't actively used a Windows 
system in more than a decade - it was Windows 7 - and even then, i 
was running it in a VM in order to run some other software. i've 
also never used Windows outside of an "Australian English" 
context, and have never done any dev work on the Windows 
platform. So i've got only a minimal idea of how Windows does 
various things nowadays, and have never needed to become familiar 
with sysadmin-/dev-level Windows documentation. Until now. :-)

Specific responses inline below.

> I don't think I understand the setting of LC_ALL part.  First, 
> AFAIK Windows programs generally ignore LC_* environment 
> variables.  If you read the Microsoft documentation of 
> 'setlocale', here:
>
>   https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/setlocale-wsetlocale?view=msvc-170
>
> you will not see any reference to environment variables there.

Thanks for this link; it gives me a good starting point to explore 
the Win docs on this issue.

> The Windows 'setlocale' supports only LC_* _categories_ in 
> direct calls to the function, and doesn't consider the 
> corresponding environment variables.  The Emacs source code 
> doesn't reference LC_* environment variables on MS-Windows, 
> either.  So how did the user set LC_ALL, and why did it have any 
> effect whatsoever on the issue?

They didn't say; all they wrote 
(https://github.com/flexibeast/ebuku/issues/31#issuecomment-2058171986) 
was:

> I ... changed my LC_ALL to zh_CN.UTF-8. Ebuku can find the db 
> now.

i'll ask them.

> Second, the user sets a UTF-8 locale, which as I wrote up-thread 
> is not a good idea on MS-Windows.  It could well cause failures 
> in invoking external programs from Emacs, if the arguments to 
> those programs include non-ASCII characters.  In general, on 
> MS-Windows Emacs can only safely invoke programs with non-ASCII 
> characters in the command-line arguments if those characters can 
> be encoded by the system codepage, in this case codepage-936 
> AFAIU.

Thanks, i'll add that to the information i pass back to the user 
on that GitHub issue.

> Regarding the "invalid string for collation: Invalid argument" 
> error: how does ebuku determine the LOCALE argument with which 
> it calls string-collate-lessp?  It is important to understand 
> what was the locale with which w32_compare_strings was called in 
> that case.

The single use of `string-collate-lessp` doesn't pass any LOCALE 
argument, as i just wanted it to use the user's current locale for 
sorting a given bookmark's tags into the appropriate 
lexicographical order.

> Finally, the issues with Windows-style file names with drive 
> letters and with file names that begin with "~" lead me to 
> believe that perhaps the underlying program 'buku' is not a 
> native Windows program, but a Cygwin or MSYS program, in which 
> case there could be incompatibilities both regarding file names 
> and regarding handling of non-ASCII characters (Cygwin and MSYS 
> use UTF-8 by default, whereas the native Windows build of Emacs 
> does not).

Sorry; i mentioned in my first email, but didn't reiterate in my 
second, that `buku` is Python-based.

> You need to take a good look at whether non-ASCII characters are 
> passed to 'buku' in this case, and how the output from 'buku' is 
> decoded.

👍

> Also, ebuku-buku-path and ebuku-database-path should both be 
> quoted with shell-quote-argument (but I don't think this is a 
> problem in this case). Can ARGS include whitespace or characters 
> special for the Windows shell? if so, each argument should be 
> quoted with shell-quote-argument as well.

Thanks, noted.

> How output is decoded when it is put into the temporary buffer 
> is also of interest -- what is the value of 
> buffer-file-coding-system in the temporary buffer after reading 
> output, in the OP's case?

*nod*

> Emacs on MS-Windows 
> cannot use UTF-8 when encoding command-line arguments for 
> sub-programs, it can only use the system codepage.  Using 
> set-language-environment as above will force Emacs to encode 
> command-line arguments in UTF-8, which could very well be the 
> reason for some of these problems.

Ah okay.

> No.
>
> The issue is complicated by several factors and will take a long 
> post to explain.  The upshot is that for passing non-ASCII 
> characters safely to subprograms on their command lines, Emacs 
> should use the system codepage, not UTF-8 or anything else (and 
> definitely not UTF-16).  This might require some tricky juggling 
> with coding-system related settings when you call call-process, 
> because coding-system-for-write is used for both encoding of the 
> command-line arguments and of the stuff we send to the 
> sub-program, so if they both can include non-ASCII characters, 
> some care is in order.  (By contrast, coding-system-for-read can 
> be always bound to UTF-8 to decode the output correctly -- 
> assuming 'buku' outputs UTF-8 encoded text on MS-Windows.)

That's very helpful, thank you.

> The more important question is: can CRAB emoji be safely encoded 
> by codepage 936, the system codepage of the OP?  If not, and if 
> that emoji can appear in the command-line arguments of a 'buku' 
> invocation (as opposed to in the text we write to or read from 
> 'buku'), then this character cannot be used at all with this 
> package on MS-Windows.
>
> (And please note that Emacs now has a native SQLite support, 
> which should make many of these complications simply disappear.)

It would certainly make many things easier to just interact with 
the db directly. That said, doing so would involve a substantial 
rewrite, and i've got many things on my plate nowadays, including 
supporting disabled loved ones while having chronic health issues 
myself. But maybe i can open an issue requesting help to start and 
develop a branch doing such a rewrite. 

> As for why the problems disappear when the CRAB emoji is 
> removed: as I wrote elsewhere, that's probably because all the 
> other characters are plain ASCII, so all the encoding-related 
> issues don't matter.

*nod*

> They don't have any effect on Emacs on MS-Windows, that's for 
> sure.  Whether they have effect on 'buku' depends on whether 
> it's a native MS-Windows program or Cygwin/MSYS program, and 
> also on its code (a program could potentially augment the MS 
> 'setlocale' function with its own code which looks at the LC_* 
> environment variables, and does TRT in the application code).

*nod*

>> But what should i do to handle the more general case of an 
>> arbitrary encoding? Do i need to have a defcustom, with 
>> 'reasonable defaults', that the user can set if necessary, 
>> which i use as the value to pass to coding-system-for-read?
>
> That depends on what encoding does 'buku' expect on input and 
> what encoding does it use on output.  If it always uses UTF-8, 
> you just need to make sure Emacs uses UTF-8 when encoding and 
> decoding text passed to and from 'buku' (but note the caveat 
> about encoding the command-line arguments -- these _must_ be 
> encoded in the system codepage).  If, OTOH, the encoding used by 
> 'buku' can be changed dynamically, and Emacs cannot know what it 
> is (for example, if it is determined by the encoding of the text 
> put in the SQL database by the user), then a user option is in 
> order.

Great, thank you.
 
>> As i interpret their comments in the above discussions so far, 
>> yes, they had themselves set LANG to "zh_CN.UTF-8" (and yes, as 
>> described above, had definitely `set-language-environment` as 
>> "UTF-8".
>
> NOT RECOMMENDED!

*chuckle* i'll be sure to pass this on. :-)

Thanks again!


Alexis.



  reply	other threads:[~2024-04-18 11:20 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-18  5:39 "args-out-of-range" error when using data from external process on Windows Alexis
2024-04-18  6:01 ` Eli Zaretskii
2024-04-18  7:07   ` Alexis
2024-04-18  8:35     ` Eli Zaretskii
2024-04-18 11:20       ` Alexis [this message]
2024-04-19  3:16       ` Alexis
2024-04-19  7:29         ` Eli Zaretskii
2024-04-21  0:57           ` Alexis
2024-04-18  6:05 ` Eli Zaretskii
  -- strict thread matches above, loose matches on Subject: below --
2024-04-18  6:11 Alexis
2024-04-18  7:08 ` Eli Zaretskii

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87mspqzzl4.fsf@gmail.com \
    --to=flexibeast@gmail.com \
    --cc=eliz@gnu.org \
    --cc=emacs-devel@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.