all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
From: Eli Zaretskii <eliz@gnu.org>
To: Alexis <flexibeast@gmail.com>
Cc: emacs-devel@gnu.org
Subject: Re: "args-out-of-range" error when using data from external process on Windows
Date: Thu, 18 Apr 2024 11:35:21 +0300	[thread overview]
Message-ID: <86il0ff4qe.fsf@gnu.org> (raw)
In-Reply-To: <87y19bywr6.fsf@gmail.com> (message from Alexis on Thu, 18 Apr 2024 17:07:25 +1000)

> From: Alexis <flexibeast@gmail.com>
> Cc: emacs-devel@gnu.org
> Date: Thu, 18 Apr 2024 17:07:25 +1000
> 
> 
>   https://github.com/flexibeast/ebuku/issues/32
> 
> as there is already an extended discussion there about this issue, 
> which itself links to a previous issue and discussion:
> 
>   https://github.com/flexibeast/ebuku/issues/31
> 
> in which the user first reported an "Invalid string for collation" 
> issue. That issue was addressed, after some discussion, by setting 
> LC_ALL to the same value that the user had set LANG, 
> i.e. "zh_CN.UTF-8". That left us with issue 32, which is the one 
> i'm asking about here.

I don't think I understand the setting of LC_ALL part.  First, AFAIK
Windows programs generally ignore LC_* environment variables.  If you
read the Microsoft documentation of 'setlocale', here:

  https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/setlocale-wsetlocale?view=msvc-170

you will not see any reference to environment variables there.  The
Windows 'setlocale' supports only LC_* _categories_ in direct calls to
the function, and doesn't consider the corresponding environment
variables.  The Emacs source code doesn't reference LC_* environment
variables on MS-Windows, either.  So how did the user set LC_ALL, and
why did it have any effect whatsoever on the issue?

Second, the user sets a UTF-8 locale, which as I wrote up-thread is
not a good idea on MS-Windows.  It could well cause failures in
invoking external programs from Emacs, if the arguments to those
programs include non-ASCII characters.  In general, on MS-Windows
Emacs can only safely invoke programs with non-ASCII characters in the
command-line arguments if those characters can be encoded by the
system codepage, in this case codepage-936 AFAIU.

Regarding the "invalid string for collation: Invalid argument" error:
how does ebuku determine the LOCALE argument with which it calls
string-collate-lessp?  It is important to understand what was the
locale with which w32_compare_strings was called in that case.

Finally, the issues with Windows-style file names with drive letters
and with file names that begin with "~" lead me to believe that
perhaps the underlying program 'buku' is not a native Windows program,
but a Cygwin or MSYS program, in which case there could be
incompatibilities both regarding file names and regarding handling of
non-ASCII characters (Cygwin and MSYS use UTF-8 by default, whereas
the native Windows build of Emacs does not).

> `buku` provides a command-line interface to an SQLite-based 
> database of Web bookmarks, allowing one to save, delete and search 
> for bookmarks, with each bookmark able to have a comment and tags 
> associated with it.
> 
> `Ebuku` is a package that provides an Emacs-based UI for buku. It 
> allows the user to add bookmarks, edit them, remove them, search 
> them etc. without actually leaving Emacs. It does so by running 
> `call-process` to call `buku` with the appropriate options, 
> receiving the resulting output in a buffer, then processing the 
> data in that buffer in order to present the user with the relevant 
> results.
> 
> ebuku.el has a function:
> 
> (defun ebuku--call-buku (args) 
>   "Internal function for calling `buku' with list ARGS."  (unless 
>   ebuku-buku-path 
>     (error "Couldn't find buku: check 'ebuku-buku-path'")) 
>   (apply #'call-process 
>          `(,ebuku-buku-path nil t nil 
>                             "--np" "--nc" "--db" 
>                             ,ebuku-database-path ,@args))) 
> 
> which gets called in several places - e.g. 
> https://github.com/flexibeast/ebuku/blob/c854d128cba8576fe9693c19109b5deafb573e99/ebuku.el#L534 
> - to put the contents inside a temp buffer, which is then 'parsed' 
> for the information to be presented to the user.

You need to take a good look at whether non-ASCII characters are
passed to 'buku' in this case, and how the output from 'buku' is
decoded.  Also, ebuku-buku-path and ebuku-database-path should both be
quoted with shell-quote-argument (but I don't think this is a problem
in this case).  Can ARGS include whitespace or characters special for
the Windows shell? if so, each argument should be quoted with
shell-quote-argument as well.

How output is decoded when it is put into the temporary buffer is also
of interest -- what is the value of buffer-file-coding-system in the
temporary buffer after reading output, in the OP's case?

> In a comment from a couple of days ago, and after having noted in 
> a comment on issue 31:
> 
>   https://github.com/flexibeast/ebuku/issues/31#issuecomment-2053557703
> 
> that they'd set LANG on their system to "zh_CN.UTF-8", the user 
> wrote 
> (https://github.com/flexibeast/ebuku/issues/32#issuecomment-2058289816):
> 
> > I set the value with (set-language-environment "UTF-8").  I 
> > remember I set up this value bacause I don't want my files 
> > containing Chinese to be encoded by GBK encoding.

This is not a good idea, as I mentioned before.  Emacs on MS-Windows
cannot use UTF-8 when encoding command-line arguments for
sub-programs, it can only use the system codepage.  Using
set-language-environment as above will force Emacs to encode
command-line arguments in UTF-8, which could very well be the reason
for some of these problems.

> Then, in 
> https://github.com/flexibeast/ebuku/issues/32#issuecomment-2058498373, 
> i wrote:
> 
> > if i remember correctly, the default encoding used by Windows is 
> > UTF-16, not UTF-8. So i'm wondering if that's somehow being used 
> > to transfer data from the buku process to the Emacs process, 
> > regardless of the value of LANG and LC_ALL, and regardless of 
> > the encoding of the buku database itself?
> 
> to which the user responded:
> 
> > I think the Powershell will use UTF-16 to encode instead of 
> > UTF-8.
> 
> Is that correct?

No.

The issue is complicated by several factors and will take a long post
to explain.  The upshot is that for passing non-ASCII characters
safely to subprograms on their command lines, Emacs should use the
system codepage, not UTF-8 or anything else (and definitely not
UTF-16).  This might require some tricky juggling with coding-system
related settings when you call call-process, because
coding-system-for-write is used for both encoding of the command-line
arguments and of the stuff we send to the sub-program, so if they both
can include non-ASCII characters, some care is in order.  (By
contrast, coding-system-for-read can be always bound to UTF-8 to
decode the output correctly -- assuming 'buku' outputs UTF-8 encoded
text on MS-Windows.)

> Is that the case despite the user having 
> specified "zh_CN.UTF-8"? But if that's the case, why does removing 
> the CRAB emoji from text being operated on by string-match / 
> match-string make the issue disappear? Is it perhaps something to 
> do with
> the code point for the CRAB emoji being outside the BMP?

The more important question is: can CRAB emoji be safely encoded by
codepage 936, the system codepage of the OP?  If not, and if that
emoji can appear in the command-line arguments of a 'buku' invocation
(as opposed to in the text we write to or read from 'buku'), then this
character cannot be used at all with this package on MS-Windows.

(And please note that Emacs now has a native SQLite support, which
should make many of these complications simply disappear.)

As for why the problems disappear when the CRAB emoji is removed: as I
wrote elsewhere, that's probably because all the other characters are
plain ASCII, so all the encoding-related issues don't matter.

> > Suggest that you ask the user who reported that to show the 
> > actual output of the sub-process (e.g., by running the same 
> > command outside of Emacs and redirecting output to a file), and 
> > if the output looks correct, examine the Lisp code which 
> > processes that output, with an eye on how the text is decoded. 
> > For example, if the text from the sub-process is supposed to be 
> > UTF-8 encoded, your Lisp code should bind coding-system-for-read 
> > to 'utf-8', to make sure it is decoded correctly.
> 
> Thanks, i can certainly do that, modulo the issue of whether the 
> LANG and LC_ALL variables have any effect data transferred between 
> the `buku` sub-process and Emacs.

They don't have any effect on Emacs on MS-Windows, that's for sure.
Whether they have effect on 'buku' depends on whether it's a native
MS-Windows program or Cygwin/MSYS program, and also on its code (a
program could potentially augment the MS 'setlocale' function with its
own code which looks at the LC_* environment variables, and does TRT
in the application code).

> But what should i do to handle the more general case of an arbitrary
> encoding? Do i need to have a defcustom, with 'reasonable defaults',
> that the user can set if necessary, which i use as the value to pass
> to coding-system-for-read?

That depends on what encoding does 'buku' expect on input and what
encoding does it use on output.  If it always uses UTF-8, you just
need to make sure Emacs uses UTF-8 when encoding and decoding text
passed to and from 'buku' (but note the caveat about encoding the
command-line arguments -- these _must_ be encoded in the system
codepage).  If, OTOH, the encoding used by 'buku' can be changed
dynamically, and Emacs cannot know what it is (for example, if it is
determined by the encoding of the text put in the SQL database by the
user), then a user option is in order.

> > Btw: using UTF-8 by default on MS-Windows is not a very good 
> > idea, even with Windows 11 where one can enable UTF-8 support 
> > (did they do it, btw?).  Windows still doesn't support UTF-8 
> > well, even after the improvements in Windows 11, so the above 
> > settings might very well cause trouble.  Suggest to ask the user 
> > to try the same recipe in "emacs -Q", and if the zh_CN.UTF-8 
> > stuff is set up outside Emacs, to try without it.
> 
> As i interpret their comments in the above discussions so far, 
> yes, they had themselves set LANG to "zh_CN.UTF-8" (and yes, as 
> described above, had definitely `set-language-environment` as 
> "UTF-8".

NOT RECOMMENDED!



  reply	other threads:[~2024-04-18  8:35 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-18  5:39 "args-out-of-range" error when using data from external process on Windows Alexis
2024-04-18  6:01 ` Eli Zaretskii
2024-04-18  7:07   ` Alexis
2024-04-18  8:35     ` Eli Zaretskii [this message]
2024-04-18 11:20       ` Alexis
2024-04-19  3:16       ` Alexis
2024-04-19  7:29         ` Eli Zaretskii
2024-04-21  0:57           ` Alexis
2024-04-18  6:05 ` Eli Zaretskii
  -- strict thread matches above, loose matches on Subject: below --
2024-04-18  6:11 Alexis
2024-04-18  7:08 ` Eli Zaretskii

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=86il0ff4qe.fsf@gnu.org \
    --to=eliz@gnu.org \
    --cc=emacs-devel@gnu.org \
    --cc=flexibeast@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.