unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed
* "args-out-of-range" error when using data from external process on Windows
@ 2024-04-18  5:39 Alexis
  2024-04-18  6:01 ` Eli Zaretskii
  2024-04-18  6:05 ` Eli Zaretskii
  0 siblings, 2 replies; 11+ messages in thread
From: Alexis @ 2024-04-18  5:39 UTC (permalink / raw)
  To: emacs-devel


[Not currently subscribed to the list, so please cc me on 
replies.]   Hi all,  A user of my `Ebuku` package has reported an 
"args-out-of-range" error that i'm out of my depth trying to 
diagnose. Here's the GitHub issue: 
 
  https://github.com/flexibeast/ebuku/issues/32 
 
i can't reproduce the issue on my own system:  * Gentoo + Emacs 
29.3.  * LANG=en_AU.UTF-8 * The only set LC_* variables are: 
  LC_MESSAGES=C LC_TIME=en_AU.UTF-8 
* current-language-environment = "English" locale-coding-system = 
* utf-8-unix    Their system:  Windows 11, using Emacs 29.2 
* obtained via Scoop package manager; not using WSL 
* LANG=zh_CN.UTF-8, LC_ALL=zh_CN.UTF-8 
* current-language-environment: UTF-8 locale-coding-system = cp936 
* default-process-coding-system = '(utf-8-dos . utf-8-unix) 
* `Ebuku` uses `call-process` to call the Python-based `buku` 
* bookmark database manager and present the resulting output in 
* Emacs. buku stores data in an SQLite database. 
 
  https://github.com/jarun/buku/ 
 
The link: 
 
  https://google.github.io/comprehensive-rust/ 
 
in the buku database results in:  ``` Debugger entered--Lisp 
error: (args-out-of-range "1884. Welcome to Comprehensive Rust 🦀 
- Comprehens..." 15862 15893) 
  match-string(1 "1884. Welcome to Comprehensive Rust 🦀 - 
  Comprehensive Rust 🦀") ebuku--search-helper("--print" "[all]" 
  "-1000" "") ebuku-show-all() ebuku() 
  funcall-interactively(ebuku)1 command-execute(ebuku record) 
  execute-extended-command(nil "ebuku" "ebuku") 
  funcall-interactively(execute-extended-command nil "ebuku" 
  "ebuku") command-execute(execute-extended-command) 
```  Once the Unicode CRAB emoji is removed, there's no issue. 
The link: 
 
  https://coredumped.dev/2021/05/26/taking-org-roam-everywhere-with-logseq/ 
 
in the buku database results in:  ``` Debugger entered--Lisp 
error: (args-out-of-range "2027. Taking org-roam everywhere with 
logseq • Core Dumped" 32318 32355) 
  match-string(1 "2027. Taking org-roam everywhere with logseq • 
  Cor...")  (setq tags (match-string 1 line)) (progn (string-match 
  "^\\s-*[#] \\(.*\\)$" line) (setq tags (match-string 1 line))) 
  [snip rest of traceback] 
```  The user has confirmed that the buku database is UTF-8. 

Does anyone have any suggestions about what might be happening? i 
presume my code is making some incorrect assumptions, or not doing 
some encoding stuff that it should be. i really want to get 
encoding and language support right, so even outside of this 
specific issue, general comments about things i need to fix in 
this regard would be most welcome. :-) 


Alexis.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: "args-out-of-range" error when using data from external process on Windows
  2024-04-18  5:39 Alexis
@ 2024-04-18  6:01 ` Eli Zaretskii
  2024-04-18  7:07   ` Alexis
  2024-04-18  6:05 ` Eli Zaretskii
  1 sibling, 1 reply; 11+ messages in thread
From: Eli Zaretskii @ 2024-04-18  6:01 UTC (permalink / raw)
  To: Alexis; +Cc: emacs-devel

> From: Alexis <flexibeast@gmail.com>
> Date: Thu, 18 Apr 2024 15:39:10 +1000
> 
> [Not currently subscribed to the list, so please cc me on 
> replies.]   Hi all,  A user of my `Ebuku` package has reported an 
> "args-out-of-range" error that i'm out of my depth trying to 
> diagnose. Here's the GitHub issue: 
>  
>   https://github.com/flexibeast/ebuku/issues/32 
>  
> i can't reproduce the issue on my own system:  * Gentoo + Emacs 
> 29.3.  * LANG=en_AU.UTF-8 * The only set LC_* variables are: 
>   LC_MESSAGES=C LC_TIME=en_AU.UTF-8 
> * current-language-environment = "English" locale-coding-system = 
> * utf-8-unix    Their system:  Windows 11, using Emacs 29.2 
> * obtained via Scoop package manager; not using WSL 
> * LANG=zh_CN.UTF-8, LC_ALL=zh_CN.UTF-8 
> * current-language-environment: UTF-8 locale-coding-system = cp936 
> * default-process-coding-system = '(utf-8-dos . utf-8-unix) 
> * `Ebuku` uses `call-process` to call the Python-based `buku` 
> * bookmark database manager and present the resulting output in 
> * Emacs. buku stores data in an SQLite database. 
>  
>   https://github.com/jarun/buku/ 
>  
> The link: 
>  
>   https://google.github.io/comprehensive-rust/ 
>  
> in the buku database results in:  ``` Debugger entered--Lisp 
> error: (args-out-of-range "1884. Welcome to Comprehensive Rust 🦀 
> - Comprehens..." 15862 15893) 
>   match-string(1 "1884. Welcome to Comprehensive Rust 🦀 - 
>   Comprehensive Rust 🦀") ebuku--search-helper("--print" "[all]" 
>   "-1000" "") ebuku-show-all() ebuku() 
>   funcall-interactively(ebuku)1 command-execute(ebuku record) 
>   execute-extended-command(nil "ebuku" "ebuku") 
>   funcall-interactively(execute-extended-command nil "ebuku" 
>   "ebuku") command-execute(execute-extended-command) 
> ```  Once the Unicode CRAB emoji is removed, there's no issue. 
> The link: 
>  
>   https://coredumped.dev/2021/05/26/taking-org-roam-everywhere-with-logseq/ 
>  
> in the buku database results in:  ``` Debugger entered--Lisp 
> error: (args-out-of-range "2027. Taking org-roam everywhere with 
> logseq • Core Dumped" 32318 32355) 
>   match-string(1 "2027. Taking org-roam everywhere with logseq • 
>   Cor...")  (setq tags (match-string 1 line)) (progn (string-match 
>   "^\\s-*[#] \\(.*\\)$" line) (setq tags (match-string 1 line))) 
>   [snip rest of traceback] 
> ```  The user has confirmed that the buku database is UTF-8. 
> 
> Does anyone have any suggestions about what might be happening?

Crystal ball says the package assumes UTF-8 encoding of the text from
the sub-process, which is generally not what happens on Windows.  Or
maybe the package assumes that UTF-8 text from a sub-process will
necessarily be decoded as UTF-8, which again can fail if the default
coding-systems are not UTF-8 (which happens on Windows).  The upshot
is that the Lisp code expects some number of characters, but gets a
different number of characters instead.

But this is all basically stabbing in the dark, since I have no idea
what that package does and what the program whose output it reads
does.  Suggest that you ask the user who reported that to show the
actual output of the sub-process (e.g., by running the same command
outside of Emacs and redirecting output to a file), and if the output
looks correct, examine the Lisp code which processes that output, with
an eye on how the text is decoded.  For example, if the text from the
sub-process is supposed to be UTF-8 encoded, your Lisp code should
bind coding-system-for-read to 'utf-8', to make sure it is decoded
correctly.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: "args-out-of-range" error when using data from external process on Windows
  2024-04-18  5:39 Alexis
  2024-04-18  6:01 ` Eli Zaretskii
@ 2024-04-18  6:05 ` Eli Zaretskii
  1 sibling, 0 replies; 11+ messages in thread
From: Eli Zaretskii @ 2024-04-18  6:05 UTC (permalink / raw)
  To: Alexis; +Cc: emacs-devel

> From: Alexis <flexibeast@gmail.com>
> Date: Thu, 18 Apr 2024 15:39:10 +1000
> 
> * Their system:  Windows 11, using Emacs 29.2 
> * obtained via Scoop package manager; not using WSL 
> * LANG=zh_CN.UTF-8, LC_ALL=zh_CN.UTF-8 
> * current-language-environment: UTF-8 locale-coding-system = cp936 
> * default-process-coding-system = '(utf-8-dos . utf-8-unix) 

Btw: using UTF-8 by default on MS-Windows is not a very good idea,
even with Windows 11 where one can enable UTF-8 support (did they do
it, btw?).  Windows still doesn't support UTF-8 well, even after the
improvements in Windows 11, so the above settings might very well
cause trouble.  Suggest to ask the user to try the same recipe in
"emacs -Q", and if the zh_CN.UTF-8 stuff is set up outside Emacs, to
try without it.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* "args-out-of-range" error when using data from external process on Windows
@ 2024-04-18  6:11 Alexis
  2024-04-18  7:08 ` Eli Zaretskii
  0 siblings, 1 reply; 11+ messages in thread
From: Alexis @ 2024-04-18  6:11 UTC (permalink / raw)
  To: emacs-devel


[Second attempt, after the formatting somehow got messed by the 
sending process up on the first attempt, sorry ....]

[Not currently subscribed to the list, so please cc me on 
replies.]
 
Hi all,

A user of my `Ebuku` package has reported an "args-out-of-range" 
error that i'm out of my depth trying to diagnose. Here's the 
GitHub issue: 

  https://github.com/flexibeast/ebuku/issues/32

i can't reproduce the issue on my own system:

* Gentoo + Emacs 29.3
* LANG=en_AU.UTF-8
* The only set LC_* variables are:
  LC_MESSAGES=C
  LC_TIME=en_AU.UTF-8
* current-language-environment: "English"
* locale-coding-system: utf-8-unix

Their system:

* Windows 11, using Emacs 29.2, obtained via Scoop package 
  manager; not using WSL
* LANG=zh_CN.UTF-8, LC_ALL=zh_CN.UTF-8
* current-language-environment: UTF-8 locale-coding-system: cp936
* default-process-coding-system: '(utf-8-dos . utf-8-unix)

`Ebuku` uses `call-process` to call the Python-based `buku` * 
bookmark database manager and present the resulting output in 
Emacs. buku stores data in an SQLite database.

  https://github.com/jarun/buku/

The link: 

  https://google.github.io/comprehensive-rust/

in the buku database results in:

```
Debugger entered--Lisp error: (args-out-of-range "1884. Welcome to 
Comprehensive Rust 🦀 - Comprehens..." 15862 15893) 
  match-string(1 "1884. Welcome to Comprehensive Rust 🦀 - 
  Comprehensive Rust 🦀")
  ebuku--search-helper("--print" "[all]" "-1000" "")
  ebuku-show-all() ebuku()
  funcall-interactively(ebuku)
  command-execute(ebuku record)
  execute-extended-command(nil "ebuku""ebuku")
  funcall-interactively(execute-extended-command nil "ebuku" 
  "ebuku")
  command-execute(execute-extended-command)
```

Once the Unicode CRAB emoji is removed, there's no issue. The 
link:
 
  https://coredumped.dev/2021/05/26/taking-org-roam-everywhere-with-logseq/ 

in the buku database results in:

```
Debugger entered--Lisp error:
  (args-out-of-range "2027. Taking org-roam everywhere with logseq 
  • Core Dumped" 32318 32355)
  match-string(1 "2027. Taking org-roam everywhere with logseq • 
  Cor...")
  (setq tags (match-string 1 line)) (progn (string-match 
  "^\\s-*[#] \\(.*\\)$" line)
  (setq tags (match-string 1 line)))
  [snip rest of traceback]
  ```

The user has confirmed that the buku database is UTF-8.

Does anyone have any suggestions about what might be happening? i 
presume my code is making some incorrect assumptions, or not doing 
some encoding stuff that it should be. i really want to get 
encoding and language support right, so even outside of this 
specific issue, general comments about things i need to fix in 
this regard would be most welcome. :-)


Alexis.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: "args-out-of-range" error when using data from external process on Windows
  2024-04-18  6:01 ` Eli Zaretskii
@ 2024-04-18  7:07   ` Alexis
  2024-04-18  8:35     ` Eli Zaretskii
  0 siblings, 1 reply; 11+ messages in thread
From: Alexis @ 2024-04-18  7:07 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel


Eli Zaretskii <eliz@gnu.org> writes:

> Crystal ball says the package assumes UTF-8 encoding of the text 
> from the sub-process, which is generally not what happens on 
> Windows.  Or maybe the package assumes that UTF-8 text from a 
> sub-process will necessarily be decoded as UTF-8, which again 
> can fail if the default coding-systems are not UTF-8 (which 
> happens on Windows).  The upshot is that the Lisp code expects 
> some number of characters, but gets a different number of 
> characters instead.
>
> But this is all basically stabbing in the dark, since I have no 
> idea what that package does and what the program whose output it 
> reads does. 

Hi Eli,

Thanks for your prompt reply. Sorry for my email not being more 
descriptive and self-contained. i linked to the GitHub issue:

  https://github.com/flexibeast/ebuku/issues/32

as there is already an extended discussion there about this issue, 
which itself links to a previous issue and discussion:

  https://github.com/flexibeast/ebuku/issues/31

in which the user first reported an "Invalid string for collation" 
issue. That issue was addressed, after some discussion, by setting 
LC_ALL to the same value that the user had set LANG, 
i.e. "zh_CN.UTF-8". That left us with issue 32, which is the one 
i'm asking about here.

Some better background about the software involved:

`buku` provides a command-line interface to an SQLite-based 
database of Web bookmarks, allowing one to save, delete and search 
for bookmarks, with each bookmark able to have a comment and tags 
associated with it.

`Ebuku` is a package that provides an Emacs-based UI for buku. It 
allows the user to add bookmarks, edit them, remove them, search 
them etc. without actually leaving Emacs. It does so by running 
`call-process` to call `buku` with the appropriate options, 
receiving the resulting output in a buffer, then processing the 
data in that buffer in order to present the user with the relevant 
results.

ebuku.el has a function:

(defun ebuku--call-buku (args) 
  "Internal function for calling `buku' with list ARGS."  (unless 
  ebuku-buku-path 
    (error "Couldn't find buku: check 'ebuku-buku-path'")) 
  (apply #'call-process 
         `(,ebuku-buku-path nil t nil 
                            "--np" "--nc" "--db" 
                            ,ebuku-database-path ,@args))) 

which gets called in several places - e.g. 
https://github.com/flexibeast/ebuku/blob/c854d128cba8576fe9693c19109b5deafb573e99/ebuku.el#L534 
- to put the contents inside a temp buffer, which is then 'parsed' 
for the information to be presented to the user.

In a comment from a couple of days ago, and after having noted in 
a comment on issue 31:

  https://github.com/flexibeast/ebuku/issues/31#issuecomment-2053557703

that they'd set LANG on their system to "zh_CN.UTF-8", the user 
wrote 
(https://github.com/flexibeast/ebuku/issues/32#issuecomment-2058289816):

> I set the value with (set-language-environment "UTF-8").  I 
> remember I set up this value bacause I don't want my files 
> containing Chinese to be encoded by GBK encoding.

Then, in 
https://github.com/flexibeast/ebuku/issues/32#issuecomment-2058498373, 
i wrote:

> if i remember correctly, the default encoding used by Windows is 
> UTF-16, not UTF-8. So i'm wondering if that's somehow being used 
> to transfer data from the buku process to the Emacs process, 
> regardless of the value of LANG and LC_ALL, and regardless of 
> the encoding of the buku database itself?

to which the user responded:

> I think the Powershell will use UTF-16 to encode instead of 
> UTF-8.

Is that correct? Is that the case despite the user having 
specified "zh_CN.UTF-8"? But if that's the case, why does removing 
the CRAB emoji from text being operated on by string-match / 
match-string make the issue disappear? Is it perhaps something to 
do with
the code point for the CRAB emoji being outside the BMP?

> Suggest that you ask the user who reported that to show the 
> actual output of the sub-process (e.g., by running the same 
> command outside of Emacs and redirecting output to a file), and 
> if the output looks correct, examine the Lisp code which 
> processes that output, with an eye on how the text is decoded. 
> For example, if the text from the sub-process is supposed to be 
> UTF-8 encoded, your Lisp code should bind coding-system-for-read 
> to 'utf-8', to make sure it is decoded correctly.

Thanks, i can certainly do that, modulo the issue of whether the 
LANG and LC_ALL variables have any effect data transferred between 
the `buku` sub-process and Emacs. But what should i do to handle 
the more general case of an arbitrary encoding? Do i need to have 
a defcustom, with 'reasonable defaults', that the user can set if 
necessary, which i use as the value to pass to 
coding-system-for-read?

> Btw: using UTF-8 by default on MS-Windows is not a very good 
> idea, even with Windows 11 where one can enable UTF-8 support 
> (did they do it, btw?).  Windows still doesn't support UTF-8 
> well, even after the improvements in Windows 11, so the above 
> settings might very well cause trouble.  Suggest to ask the user 
> to try the same recipe in "emacs -Q", and if the zh_CN.UTF-8 
> stuff is set up outside Emacs, to try without it.

As i interpret their comments in the above discussions so far, 
yes, they had themselves set LANG to "zh_CN.UTF-8" (and yes, as 
described above, had definitely `set-language-environment` as 
"UTF-8".

i'll certainly take your suggestions back to the user.

Thanks again,


Alexis.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: "args-out-of-range" error when using data from external process on Windows
  2024-04-18  6:11 "args-out-of-range" error when using data from external process on Windows Alexis
@ 2024-04-18  7:08 ` Eli Zaretskii
  0 siblings, 0 replies; 11+ messages in thread
From: Eli Zaretskii @ 2024-04-18  7:08 UTC (permalink / raw)
  To: Alexis; +Cc: emacs-devel

> From: Alexis <flexibeast@gmail.com>
> Date: Thu, 18 Apr 2024 16:11:02 +1000
> 
> Once the Unicode CRAB emoji is removed, there's no issue.

Most probably because the rest of the text is plain ASCII, so decoding
doesn't come into play.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: "args-out-of-range" error when using data from external process on Windows
  2024-04-18  7:07   ` Alexis
@ 2024-04-18  8:35     ` Eli Zaretskii
  2024-04-18 11:20       ` Alexis
  2024-04-19  3:16       ` Alexis
  0 siblings, 2 replies; 11+ messages in thread
From: Eli Zaretskii @ 2024-04-18  8:35 UTC (permalink / raw)
  To: Alexis; +Cc: emacs-devel

> From: Alexis <flexibeast@gmail.com>
> Cc: emacs-devel@gnu.org
> Date: Thu, 18 Apr 2024 17:07:25 +1000
> 
> 
>   https://github.com/flexibeast/ebuku/issues/32
> 
> as there is already an extended discussion there about this issue, 
> which itself links to a previous issue and discussion:
> 
>   https://github.com/flexibeast/ebuku/issues/31
> 
> in which the user first reported an "Invalid string for collation" 
> issue. That issue was addressed, after some discussion, by setting 
> LC_ALL to the same value that the user had set LANG, 
> i.e. "zh_CN.UTF-8". That left us with issue 32, which is the one 
> i'm asking about here.

I don't think I understand the setting of LC_ALL part.  First, AFAIK
Windows programs generally ignore LC_* environment variables.  If you
read the Microsoft documentation of 'setlocale', here:

  https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/setlocale-wsetlocale?view=msvc-170

you will not see any reference to environment variables there.  The
Windows 'setlocale' supports only LC_* _categories_ in direct calls to
the function, and doesn't consider the corresponding environment
variables.  The Emacs source code doesn't reference LC_* environment
variables on MS-Windows, either.  So how did the user set LC_ALL, and
why did it have any effect whatsoever on the issue?

Second, the user sets a UTF-8 locale, which as I wrote up-thread is
not a good idea on MS-Windows.  It could well cause failures in
invoking external programs from Emacs, if the arguments to those
programs include non-ASCII characters.  In general, on MS-Windows
Emacs can only safely invoke programs with non-ASCII characters in the
command-line arguments if those characters can be encoded by the
system codepage, in this case codepage-936 AFAIU.

Regarding the "invalid string for collation: Invalid argument" error:
how does ebuku determine the LOCALE argument with which it calls
string-collate-lessp?  It is important to understand what was the
locale with which w32_compare_strings was called in that case.

Finally, the issues with Windows-style file names with drive letters
and with file names that begin with "~" lead me to believe that
perhaps the underlying program 'buku' is not a native Windows program,
but a Cygwin or MSYS program, in which case there could be
incompatibilities both regarding file names and regarding handling of
non-ASCII characters (Cygwin and MSYS use UTF-8 by default, whereas
the native Windows build of Emacs does not).

> `buku` provides a command-line interface to an SQLite-based 
> database of Web bookmarks, allowing one to save, delete and search 
> for bookmarks, with each bookmark able to have a comment and tags 
> associated with it.
> 
> `Ebuku` is a package that provides an Emacs-based UI for buku. It 
> allows the user to add bookmarks, edit them, remove them, search 
> them etc. without actually leaving Emacs. It does so by running 
> `call-process` to call `buku` with the appropriate options, 
> receiving the resulting output in a buffer, then processing the 
> data in that buffer in order to present the user with the relevant 
> results.
> 
> ebuku.el has a function:
> 
> (defun ebuku--call-buku (args) 
>   "Internal function for calling `buku' with list ARGS."  (unless 
>   ebuku-buku-path 
>     (error "Couldn't find buku: check 'ebuku-buku-path'")) 
>   (apply #'call-process 
>          `(,ebuku-buku-path nil t nil 
>                             "--np" "--nc" "--db" 
>                             ,ebuku-database-path ,@args))) 
> 
> which gets called in several places - e.g. 
> https://github.com/flexibeast/ebuku/blob/c854d128cba8576fe9693c19109b5deafb573e99/ebuku.el#L534 
> - to put the contents inside a temp buffer, which is then 'parsed' 
> for the information to be presented to the user.

You need to take a good look at whether non-ASCII characters are
passed to 'buku' in this case, and how the output from 'buku' is
decoded.  Also, ebuku-buku-path and ebuku-database-path should both be
quoted with shell-quote-argument (but I don't think this is a problem
in this case).  Can ARGS include whitespace or characters special for
the Windows shell? if so, each argument should be quoted with
shell-quote-argument as well.

How output is decoded when it is put into the temporary buffer is also
of interest -- what is the value of buffer-file-coding-system in the
temporary buffer after reading output, in the OP's case?

> In a comment from a couple of days ago, and after having noted in 
> a comment on issue 31:
> 
>   https://github.com/flexibeast/ebuku/issues/31#issuecomment-2053557703
> 
> that they'd set LANG on their system to "zh_CN.UTF-8", the user 
> wrote 
> (https://github.com/flexibeast/ebuku/issues/32#issuecomment-2058289816):
> 
> > I set the value with (set-language-environment "UTF-8").  I 
> > remember I set up this value bacause I don't want my files 
> > containing Chinese to be encoded by GBK encoding.

This is not a good idea, as I mentioned before.  Emacs on MS-Windows
cannot use UTF-8 when encoding command-line arguments for
sub-programs, it can only use the system codepage.  Using
set-language-environment as above will force Emacs to encode
command-line arguments in UTF-8, which could very well be the reason
for some of these problems.

> Then, in 
> https://github.com/flexibeast/ebuku/issues/32#issuecomment-2058498373, 
> i wrote:
> 
> > if i remember correctly, the default encoding used by Windows is 
> > UTF-16, not UTF-8. So i'm wondering if that's somehow being used 
> > to transfer data from the buku process to the Emacs process, 
> > regardless of the value of LANG and LC_ALL, and regardless of 
> > the encoding of the buku database itself?
> 
> to which the user responded:
> 
> > I think the Powershell will use UTF-16 to encode instead of 
> > UTF-8.
> 
> Is that correct?

No.

The issue is complicated by several factors and will take a long post
to explain.  The upshot is that for passing non-ASCII characters
safely to subprograms on their command lines, Emacs should use the
system codepage, not UTF-8 or anything else (and definitely not
UTF-16).  This might require some tricky juggling with coding-system
related settings when you call call-process, because
coding-system-for-write is used for both encoding of the command-line
arguments and of the stuff we send to the sub-program, so if they both
can include non-ASCII characters, some care is in order.  (By
contrast, coding-system-for-read can be always bound to UTF-8 to
decode the output correctly -- assuming 'buku' outputs UTF-8 encoded
text on MS-Windows.)

> Is that the case despite the user having 
> specified "zh_CN.UTF-8"? But if that's the case, why does removing 
> the CRAB emoji from text being operated on by string-match / 
> match-string make the issue disappear? Is it perhaps something to 
> do with
> the code point for the CRAB emoji being outside the BMP?

The more important question is: can CRAB emoji be safely encoded by
codepage 936, the system codepage of the OP?  If not, and if that
emoji can appear in the command-line arguments of a 'buku' invocation
(as opposed to in the text we write to or read from 'buku'), then this
character cannot be used at all with this package on MS-Windows.

(And please note that Emacs now has a native SQLite support, which
should make many of these complications simply disappear.)

As for why the problems disappear when the CRAB emoji is removed: as I
wrote elsewhere, that's probably because all the other characters are
plain ASCII, so all the encoding-related issues don't matter.

> > Suggest that you ask the user who reported that to show the 
> > actual output of the sub-process (e.g., by running the same 
> > command outside of Emacs and redirecting output to a file), and 
> > if the output looks correct, examine the Lisp code which 
> > processes that output, with an eye on how the text is decoded. 
> > For example, if the text from the sub-process is supposed to be 
> > UTF-8 encoded, your Lisp code should bind coding-system-for-read 
> > to 'utf-8', to make sure it is decoded correctly.
> 
> Thanks, i can certainly do that, modulo the issue of whether the 
> LANG and LC_ALL variables have any effect data transferred between 
> the `buku` sub-process and Emacs.

They don't have any effect on Emacs on MS-Windows, that's for sure.
Whether they have effect on 'buku' depends on whether it's a native
MS-Windows program or Cygwin/MSYS program, and also on its code (a
program could potentially augment the MS 'setlocale' function with its
own code which looks at the LC_* environment variables, and does TRT
in the application code).

> But what should i do to handle the more general case of an arbitrary
> encoding? Do i need to have a defcustom, with 'reasonable defaults',
> that the user can set if necessary, which i use as the value to pass
> to coding-system-for-read?

That depends on what encoding does 'buku' expect on input and what
encoding does it use on output.  If it always uses UTF-8, you just
need to make sure Emacs uses UTF-8 when encoding and decoding text
passed to and from 'buku' (but note the caveat about encoding the
command-line arguments -- these _must_ be encoded in the system
codepage).  If, OTOH, the encoding used by 'buku' can be changed
dynamically, and Emacs cannot know what it is (for example, if it is
determined by the encoding of the text put in the SQL database by the
user), then a user option is in order.

> > Btw: using UTF-8 by default on MS-Windows is not a very good 
> > idea, even with Windows 11 where one can enable UTF-8 support 
> > (did they do it, btw?).  Windows still doesn't support UTF-8 
> > well, even after the improvements in Windows 11, so the above 
> > settings might very well cause trouble.  Suggest to ask the user 
> > to try the same recipe in "emacs -Q", and if the zh_CN.UTF-8 
> > stuff is set up outside Emacs, to try without it.
> 
> As i interpret their comments in the above discussions so far, 
> yes, they had themselves set LANG to "zh_CN.UTF-8" (and yes, as 
> described above, had definitely `set-language-environment` as 
> "UTF-8".

NOT RECOMMENDED!



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: "args-out-of-range" error when using data from external process on Windows
  2024-04-18  8:35     ` Eli Zaretskii
@ 2024-04-18 11:20       ` Alexis
  2024-04-19  3:16       ` Alexis
  1 sibling, 0 replies; 11+ messages in thread
From: Alexis @ 2024-04-18 11:20 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel


Thanks again for your assistance!

As some additional context: i haven't actively used a Windows 
system in more than a decade - it was Windows 7 - and even then, i 
was running it in a VM in order to run some other software. i've 
also never used Windows outside of an "Australian English" 
context, and have never done any dev work on the Windows 
platform. So i've got only a minimal idea of how Windows does 
various things nowadays, and have never needed to become familiar 
with sysadmin-/dev-level Windows documentation. Until now. :-)

Specific responses inline below.

> I don't think I understand the setting of LC_ALL part.  First, 
> AFAIK Windows programs generally ignore LC_* environment 
> variables.  If you read the Microsoft documentation of 
> 'setlocale', here:
>
>   https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/setlocale-wsetlocale?view=msvc-170
>
> you will not see any reference to environment variables there.

Thanks for this link; it gives me a good starting point to explore 
the Win docs on this issue.

> The Windows 'setlocale' supports only LC_* _categories_ in 
> direct calls to the function, and doesn't consider the 
> corresponding environment variables.  The Emacs source code 
> doesn't reference LC_* environment variables on MS-Windows, 
> either.  So how did the user set LC_ALL, and why did it have any 
> effect whatsoever on the issue?

They didn't say; all they wrote 
(https://github.com/flexibeast/ebuku/issues/31#issuecomment-2058171986) 
was:

> I ... changed my LC_ALL to zh_CN.UTF-8. Ebuku can find the db 
> now.

i'll ask them.

> Second, the user sets a UTF-8 locale, which as I wrote up-thread 
> is not a good idea on MS-Windows.  It could well cause failures 
> in invoking external programs from Emacs, if the arguments to 
> those programs include non-ASCII characters.  In general, on 
> MS-Windows Emacs can only safely invoke programs with non-ASCII 
> characters in the command-line arguments if those characters can 
> be encoded by the system codepage, in this case codepage-936 
> AFAIU.

Thanks, i'll add that to the information i pass back to the user 
on that GitHub issue.

> Regarding the "invalid string for collation: Invalid argument" 
> error: how does ebuku determine the LOCALE argument with which 
> it calls string-collate-lessp?  It is important to understand 
> what was the locale with which w32_compare_strings was called in 
> that case.

The single use of `string-collate-lessp` doesn't pass any LOCALE 
argument, as i just wanted it to use the user's current locale for 
sorting a given bookmark's tags into the appropriate 
lexicographical order.

> Finally, the issues with Windows-style file names with drive 
> letters and with file names that begin with "~" lead me to 
> believe that perhaps the underlying program 'buku' is not a 
> native Windows program, but a Cygwin or MSYS program, in which 
> case there could be incompatibilities both regarding file names 
> and regarding handling of non-ASCII characters (Cygwin and MSYS 
> use UTF-8 by default, whereas the native Windows build of Emacs 
> does not).

Sorry; i mentioned in my first email, but didn't reiterate in my 
second, that `buku` is Python-based.

> You need to take a good look at whether non-ASCII characters are 
> passed to 'buku' in this case, and how the output from 'buku' is 
> decoded.

👍

> Also, ebuku-buku-path and ebuku-database-path should both be 
> quoted with shell-quote-argument (but I don't think this is a 
> problem in this case). Can ARGS include whitespace or characters 
> special for the Windows shell? if so, each argument should be 
> quoted with shell-quote-argument as well.

Thanks, noted.

> How output is decoded when it is put into the temporary buffer 
> is also of interest -- what is the value of 
> buffer-file-coding-system in the temporary buffer after reading 
> output, in the OP's case?

*nod*

> Emacs on MS-Windows 
> cannot use UTF-8 when encoding command-line arguments for 
> sub-programs, it can only use the system codepage.  Using 
> set-language-environment as above will force Emacs to encode 
> command-line arguments in UTF-8, which could very well be the 
> reason for some of these problems.

Ah okay.

> No.
>
> The issue is complicated by several factors and will take a long 
> post to explain.  The upshot is that for passing non-ASCII 
> characters safely to subprograms on their command lines, Emacs 
> should use the system codepage, not UTF-8 or anything else (and 
> definitely not UTF-16).  This might require some tricky juggling 
> with coding-system related settings when you call call-process, 
> because coding-system-for-write is used for both encoding of the 
> command-line arguments and of the stuff we send to the 
> sub-program, so if they both can include non-ASCII characters, 
> some care is in order.  (By contrast, coding-system-for-read can 
> be always bound to UTF-8 to decode the output correctly -- 
> assuming 'buku' outputs UTF-8 encoded text on MS-Windows.)

That's very helpful, thank you.

> The more important question is: can CRAB emoji be safely encoded 
> by codepage 936, the system codepage of the OP?  If not, and if 
> that emoji can appear in the command-line arguments of a 'buku' 
> invocation (as opposed to in the text we write to or read from 
> 'buku'), then this character cannot be used at all with this 
> package on MS-Windows.
>
> (And please note that Emacs now has a native SQLite support, 
> which should make many of these complications simply disappear.)

It would certainly make many things easier to just interact with 
the db directly. That said, doing so would involve a substantial 
rewrite, and i've got many things on my plate nowadays, including 
supporting disabled loved ones while having chronic health issues 
myself. But maybe i can open an issue requesting help to start and 
develop a branch doing such a rewrite. 

> As for why the problems disappear when the CRAB emoji is 
> removed: as I wrote elsewhere, that's probably because all the 
> other characters are plain ASCII, so all the encoding-related 
> issues don't matter.

*nod*

> They don't have any effect on Emacs on MS-Windows, that's for 
> sure.  Whether they have effect on 'buku' depends on whether 
> it's a native MS-Windows program or Cygwin/MSYS program, and 
> also on its code (a program could potentially augment the MS 
> 'setlocale' function with its own code which looks at the LC_* 
> environment variables, and does TRT in the application code).

*nod*

>> But what should i do to handle the more general case of an 
>> arbitrary encoding? Do i need to have a defcustom, with 
>> 'reasonable defaults', that the user can set if necessary, 
>> which i use as the value to pass to coding-system-for-read?
>
> That depends on what encoding does 'buku' expect on input and 
> what encoding does it use on output.  If it always uses UTF-8, 
> you just need to make sure Emacs uses UTF-8 when encoding and 
> decoding text passed to and from 'buku' (but note the caveat 
> about encoding the command-line arguments -- these _must_ be 
> encoded in the system codepage).  If, OTOH, the encoding used by 
> 'buku' can be changed dynamically, and Emacs cannot know what it 
> is (for example, if it is determined by the encoding of the text 
> put in the SQL database by the user), then a user option is in 
> order.

Great, thank you.
 
>> As i interpret their comments in the above discussions so far, 
>> yes, they had themselves set LANG to "zh_CN.UTF-8" (and yes, as 
>> described above, had definitely `set-language-environment` as 
>> "UTF-8".
>
> NOT RECOMMENDED!

*chuckle* i'll be sure to pass this on. :-)

Thanks again!


Alexis.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: "args-out-of-range" error when using data from external process on Windows
  2024-04-18  8:35     ` Eli Zaretskii
  2024-04-18 11:20       ` Alexis
@ 2024-04-19  3:16       ` Alexis
  2024-04-19  7:29         ` Eli Zaretskii
  1 sibling, 1 reply; 11+ messages in thread
From: Alexis @ 2024-04-19  3:16 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Zhang Songyu, emacs-devel


Hi again Eli,

i've taken your comments back to the user:

  https://github.com/flexibeast/ebuku/issues/32#issuecomment-2063682096

And they've responded with two comments:

  https://github.com/flexibeast/ebuku/issues/32#issuecomment-2064663151

  https://github.com/flexibeast/ebuku/issues/32#issuecomment-2064663151

in which they said:

> I set the LC_ALL and LANG variable by editing Windows's 
> environment variable.
> Because Emacs don't read these, I have removed the variables.

and noted that they were successfully able to add and search for 
the CRAB emoji by using buku directly on the command line:

> The PowerShell display the CRAB emoji fine. ( I use Windows 
> Terminal )

but also that:

> I copied the output into scratch buffer, it's not displaying. 

More generally, they've noted:

> But right now my language environment will be "Chinese-GBK".  My 
> files will be encoded as GBK, which is not I desired.  I think 
> for some backward compatibility concern, MS use GBK encofing for 
> Chinese.  Can I somehow set the Emacs to use UTF-8 for new file 
> encodings?

And in their second comment, wrote:

> Hi, I found the encoding setting from Emacs China website.  I 
> added these lines in the early-init.el file 
> 
> (set-charset-priority 'unicode) (prefer-coding-system 'utf-8) 
> (setq system-time-locale "C") 
> 
> Right now, the new file would be saved with UTF-8 But the args 
> out of range problems still persistes.

i have subsequently responded:

  https://github.com/flexibeast/ebuku/issues/32#issuecomment-2065617662

in which i wrote: 
 
> setting those variables won't influence the encoding of the data 
> that Ebuku has to process. 
> 
> This is a very complex issue, so we need to control the various 
> factors involved.

> ...
>
> i understand that you don't want to use the GBK environment in 
> general

and explained starting Emacs with `-Q`, and manually loading 
Ebuku, to test Ebuku.

However, given that:

* i don't have access to a Win machine;
* i've not actively used Win for more than a decade;
* i don't have any experience with Win at a dev level;
* i've never used Win in a non-English environment, or Emacs in a 
  non-UTF8 environment;

i'm feeling overwhelmed by the various factors here, and am 
struggling to work out the right questions to ask the user, and 
how to appropriately
work with the answers. So i also wrote: 
 
> i'm think i'm going to have to ask you to interact directly with 
> the Eli on the mailing list about this, as i'm finding it 
> difficult to be the messenger going back and forth, and it will 
> be much quicker if Eli can ask you questions directly, which you 
> can respond to directly. Hopefully that process will make it 
> clear what would need to be done by Ebuku in order to fix the 
> problem, in a non-GBK environment.

Thus, i've cc'd the user on this email, so that, if you're willing 
and able, you can engage with them directly (perhaps by asking 
them to do specific tests that don't involve Ebuku, but which will 
show what Ebuku needs to handle).


Alexis.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: "args-out-of-range" error when using data from external process on Windows
  2024-04-19  3:16       ` Alexis
@ 2024-04-19  7:29         ` Eli Zaretskii
  2024-04-21  0:57           ` Alexis
  0 siblings, 1 reply; 11+ messages in thread
From: Eli Zaretskii @ 2024-04-19  7:29 UTC (permalink / raw)
  To: Alexis; +Cc: zsy9822, emacs-devel

> From: Alexis <flexibeast@gmail.com>
> Cc: Zhang Songyu <zsy9822@hotmail.com>, emacs-devel <emacs-devel@gnu.org>
> Date: Fri, 19 Apr 2024 13:16:58 +1000
> 
> 
> Hi again Eli,
> 
> i've taken your comments back to the user:
> 
>   https://github.com/flexibeast/ebuku/issues/32#issuecomment-2063682096
> 
> And they've responded with two comments:
> 
>   https://github.com/flexibeast/ebuku/issues/32#issuecomment-2064663151
> 
>   https://github.com/flexibeast/ebuku/issues/32#issuecomment-2064663151

I responded there.  There's no need to do this via the mailing list,
let's continue the discussion directly in the issue.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: "args-out-of-range" error when using data from external process on Windows
  2024-04-19  7:29         ` Eli Zaretskii
@ 2024-04-21  0:57           ` Alexis
  0 siblings, 0 replies; 11+ messages in thread
From: Alexis @ 2024-04-21  0:57 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

> I responded there.  There's no need to do this via the mailing 
> list, let's continue the discussion directly in the issue.

Thanks Eli, much appreciated.


Alexis.



^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2024-04-21  0:57 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-18  6:11 "args-out-of-range" error when using data from external process on Windows Alexis
2024-04-18  7:08 ` Eli Zaretskii
  -- strict thread matches above, loose matches on Subject: below --
2024-04-18  5:39 Alexis
2024-04-18  6:01 ` Eli Zaretskii
2024-04-18  7:07   ` Alexis
2024-04-18  8:35     ` Eli Zaretskii
2024-04-18 11:20       ` Alexis
2024-04-19  3:16       ` Alexis
2024-04-19  7:29         ` Eli Zaretskii
2024-04-21  0:57           ` Alexis
2024-04-18  6:05 ` Eli Zaretskii

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).