* "args-out-of-range" error when using data from external process on Windows @ 2024-04-18 5:39 Alexis 2024-04-18 6:01 ` Eli Zaretskii 2024-04-18 6:05 ` Eli Zaretskii 0 siblings, 2 replies; 11+ messages in thread From: Alexis @ 2024-04-18 5:39 UTC (permalink / raw) To: emacs-devel [Not currently subscribed to the list, so please cc me on replies.] Hi all, A user of my `Ebuku` package has reported an "args-out-of-range" error that i'm out of my depth trying to diagnose. Here's the GitHub issue: https://github.com/flexibeast/ebuku/issues/32 i can't reproduce the issue on my own system: * Gentoo + Emacs 29.3. * LANG=en_AU.UTF-8 * The only set LC_* variables are: LC_MESSAGES=C LC_TIME=en_AU.UTF-8 * current-language-environment = "English" locale-coding-system = * utf-8-unix Their system: Windows 11, using Emacs 29.2 * obtained via Scoop package manager; not using WSL * LANG=zh_CN.UTF-8, LC_ALL=zh_CN.UTF-8 * current-language-environment: UTF-8 locale-coding-system = cp936 * default-process-coding-system = '(utf-8-dos . utf-8-unix) * `Ebuku` uses `call-process` to call the Python-based `buku` * bookmark database manager and present the resulting output in * Emacs. buku stores data in an SQLite database. https://github.com/jarun/buku/ The link: https://google.github.io/comprehensive-rust/ in the buku database results in: ``` Debugger entered--Lisp error: (args-out-of-range "1884. Welcome to Comprehensive Rust 🦀 - Comprehens..." 15862 15893) match-string(1 "1884. Welcome to Comprehensive Rust 🦀 - Comprehensive Rust 🦀") ebuku--search-helper("--print" "[all]" "-1000" "") ebuku-show-all() ebuku() funcall-interactively(ebuku)1 command-execute(ebuku record) execute-extended-command(nil "ebuku" "ebuku") funcall-interactively(execute-extended-command nil "ebuku" "ebuku") command-execute(execute-extended-command) ``` Once the Unicode CRAB emoji is removed, there's no issue. The link: https://coredumped.dev/2021/05/26/taking-org-roam-everywhere-with-logseq/ in the buku database results in: ``` Debugger entered--Lisp error: (args-out-of-range "2027. Taking org-roam everywhere with logseq • Core Dumped" 32318 32355) match-string(1 "2027. Taking org-roam everywhere with logseq • Cor...") (setq tags (match-string 1 line)) (progn (string-match "^\\s-*[#] \\(.*\\)$" line) (setq tags (match-string 1 line))) [snip rest of traceback] ``` The user has confirmed that the buku database is UTF-8. Does anyone have any suggestions about what might be happening? i presume my code is making some incorrect assumptions, or not doing some encoding stuff that it should be. i really want to get encoding and language support right, so even outside of this specific issue, general comments about things i need to fix in this regard would be most welcome. :-) Alexis. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: "args-out-of-range" error when using data from external process on Windows 2024-04-18 5:39 "args-out-of-range" error when using data from external process on Windows Alexis @ 2024-04-18 6:01 ` Eli Zaretskii 2024-04-18 7:07 ` Alexis 2024-04-18 6:05 ` Eli Zaretskii 1 sibling, 1 reply; 11+ messages in thread From: Eli Zaretskii @ 2024-04-18 6:01 UTC (permalink / raw) To: Alexis; +Cc: emacs-devel > From: Alexis <flexibeast@gmail.com> > Date: Thu, 18 Apr 2024 15:39:10 +1000 > > [Not currently subscribed to the list, so please cc me on > replies.] Hi all, A user of my `Ebuku` package has reported an > "args-out-of-range" error that i'm out of my depth trying to > diagnose. Here's the GitHub issue: > > https://github.com/flexibeast/ebuku/issues/32 > > i can't reproduce the issue on my own system: * Gentoo + Emacs > 29.3. * LANG=en_AU.UTF-8 * The only set LC_* variables are: > LC_MESSAGES=C LC_TIME=en_AU.UTF-8 > * current-language-environment = "English" locale-coding-system = > * utf-8-unix Their system: Windows 11, using Emacs 29.2 > * obtained via Scoop package manager; not using WSL > * LANG=zh_CN.UTF-8, LC_ALL=zh_CN.UTF-8 > * current-language-environment: UTF-8 locale-coding-system = cp936 > * default-process-coding-system = '(utf-8-dos . utf-8-unix) > * `Ebuku` uses `call-process` to call the Python-based `buku` > * bookmark database manager and present the resulting output in > * Emacs. buku stores data in an SQLite database. > > https://github.com/jarun/buku/ > > The link: > > https://google.github.io/comprehensive-rust/ > > in the buku database results in: ``` Debugger entered--Lisp > error: (args-out-of-range "1884. Welcome to Comprehensive Rust 🦀 > - Comprehens..." 15862 15893) > match-string(1 "1884. Welcome to Comprehensive Rust 🦀 - > Comprehensive Rust 🦀") ebuku--search-helper("--print" "[all]" > "-1000" "") ebuku-show-all() ebuku() > funcall-interactively(ebuku)1 command-execute(ebuku record) > execute-extended-command(nil "ebuku" "ebuku") > funcall-interactively(execute-extended-command nil "ebuku" > "ebuku") command-execute(execute-extended-command) > ``` Once the Unicode CRAB emoji is removed, there's no issue. > The link: > > https://coredumped.dev/2021/05/26/taking-org-roam-everywhere-with-logseq/ > > in the buku database results in: ``` Debugger entered--Lisp > error: (args-out-of-range "2027. Taking org-roam everywhere with > logseq • Core Dumped" 32318 32355) > match-string(1 "2027. Taking org-roam everywhere with logseq • > Cor...") (setq tags (match-string 1 line)) (progn (string-match > "^\\s-*[#] \\(.*\\)$" line) (setq tags (match-string 1 line))) > [snip rest of traceback] > ``` The user has confirmed that the buku database is UTF-8. > > Does anyone have any suggestions about what might be happening? Crystal ball says the package assumes UTF-8 encoding of the text from the sub-process, which is generally not what happens on Windows. Or maybe the package assumes that UTF-8 text from a sub-process will necessarily be decoded as UTF-8, which again can fail if the default coding-systems are not UTF-8 (which happens on Windows). The upshot is that the Lisp code expects some number of characters, but gets a different number of characters instead. But this is all basically stabbing in the dark, since I have no idea what that package does and what the program whose output it reads does. Suggest that you ask the user who reported that to show the actual output of the sub-process (e.g., by running the same command outside of Emacs and redirecting output to a file), and if the output looks correct, examine the Lisp code which processes that output, with an eye on how the text is decoded. For example, if the text from the sub-process is supposed to be UTF-8 encoded, your Lisp code should bind coding-system-for-read to 'utf-8', to make sure it is decoded correctly. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: "args-out-of-range" error when using data from external process on Windows 2024-04-18 6:01 ` Eli Zaretskii @ 2024-04-18 7:07 ` Alexis 2024-04-18 8:35 ` Eli Zaretskii 0 siblings, 1 reply; 11+ messages in thread From: Alexis @ 2024-04-18 7:07 UTC (permalink / raw) To: Eli Zaretskii; +Cc: emacs-devel Eli Zaretskii <eliz@gnu.org> writes: > Crystal ball says the package assumes UTF-8 encoding of the text > from the sub-process, which is generally not what happens on > Windows. Or maybe the package assumes that UTF-8 text from a > sub-process will necessarily be decoded as UTF-8, which again > can fail if the default coding-systems are not UTF-8 (which > happens on Windows). The upshot is that the Lisp code expects > some number of characters, but gets a different number of > characters instead. > > But this is all basically stabbing in the dark, since I have no > idea what that package does and what the program whose output it > reads does. Hi Eli, Thanks for your prompt reply. Sorry for my email not being more descriptive and self-contained. i linked to the GitHub issue: https://github.com/flexibeast/ebuku/issues/32 as there is already an extended discussion there about this issue, which itself links to a previous issue and discussion: https://github.com/flexibeast/ebuku/issues/31 in which the user first reported an "Invalid string for collation" issue. That issue was addressed, after some discussion, by setting LC_ALL to the same value that the user had set LANG, i.e. "zh_CN.UTF-8". That left us with issue 32, which is the one i'm asking about here. Some better background about the software involved: `buku` provides a command-line interface to an SQLite-based database of Web bookmarks, allowing one to save, delete and search for bookmarks, with each bookmark able to have a comment and tags associated with it. `Ebuku` is a package that provides an Emacs-based UI for buku. It allows the user to add bookmarks, edit them, remove them, search them etc. without actually leaving Emacs. It does so by running `call-process` to call `buku` with the appropriate options, receiving the resulting output in a buffer, then processing the data in that buffer in order to present the user with the relevant results. ebuku.el has a function: (defun ebuku--call-buku (args) "Internal function for calling `buku' with list ARGS." (unless ebuku-buku-path (error "Couldn't find buku: check 'ebuku-buku-path'")) (apply #'call-process `(,ebuku-buku-path nil t nil "--np" "--nc" "--db" ,ebuku-database-path ,@args))) which gets called in several places - e.g. https://github.com/flexibeast/ebuku/blob/c854d128cba8576fe9693c19109b5deafb573e99/ebuku.el#L534 - to put the contents inside a temp buffer, which is then 'parsed' for the information to be presented to the user. In a comment from a couple of days ago, and after having noted in a comment on issue 31: https://github.com/flexibeast/ebuku/issues/31#issuecomment-2053557703 that they'd set LANG on their system to "zh_CN.UTF-8", the user wrote (https://github.com/flexibeast/ebuku/issues/32#issuecomment-2058289816): > I set the value with (set-language-environment "UTF-8"). I > remember I set up this value bacause I don't want my files > containing Chinese to be encoded by GBK encoding. Then, in https://github.com/flexibeast/ebuku/issues/32#issuecomment-2058498373, i wrote: > if i remember correctly, the default encoding used by Windows is > UTF-16, not UTF-8. So i'm wondering if that's somehow being used > to transfer data from the buku process to the Emacs process, > regardless of the value of LANG and LC_ALL, and regardless of > the encoding of the buku database itself? to which the user responded: > I think the Powershell will use UTF-16 to encode instead of > UTF-8. Is that correct? Is that the case despite the user having specified "zh_CN.UTF-8"? But if that's the case, why does removing the CRAB emoji from text being operated on by string-match / match-string make the issue disappear? Is it perhaps something to do with the code point for the CRAB emoji being outside the BMP? > Suggest that you ask the user who reported that to show the > actual output of the sub-process (e.g., by running the same > command outside of Emacs and redirecting output to a file), and > if the output looks correct, examine the Lisp code which > processes that output, with an eye on how the text is decoded. > For example, if the text from the sub-process is supposed to be > UTF-8 encoded, your Lisp code should bind coding-system-for-read > to 'utf-8', to make sure it is decoded correctly. Thanks, i can certainly do that, modulo the issue of whether the LANG and LC_ALL variables have any effect data transferred between the `buku` sub-process and Emacs. But what should i do to handle the more general case of an arbitrary encoding? Do i need to have a defcustom, with 'reasonable defaults', that the user can set if necessary, which i use as the value to pass to coding-system-for-read? > Btw: using UTF-8 by default on MS-Windows is not a very good > idea, even with Windows 11 where one can enable UTF-8 support > (did they do it, btw?). Windows still doesn't support UTF-8 > well, even after the improvements in Windows 11, so the above > settings might very well cause trouble. Suggest to ask the user > to try the same recipe in "emacs -Q", and if the zh_CN.UTF-8 > stuff is set up outside Emacs, to try without it. As i interpret their comments in the above discussions so far, yes, they had themselves set LANG to "zh_CN.UTF-8" (and yes, as described above, had definitely `set-language-environment` as "UTF-8". i'll certainly take your suggestions back to the user. Thanks again, Alexis. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: "args-out-of-range" error when using data from external process on Windows 2024-04-18 7:07 ` Alexis @ 2024-04-18 8:35 ` Eli Zaretskii 2024-04-18 11:20 ` Alexis 2024-04-19 3:16 ` Alexis 0 siblings, 2 replies; 11+ messages in thread From: Eli Zaretskii @ 2024-04-18 8:35 UTC (permalink / raw) To: Alexis; +Cc: emacs-devel > From: Alexis <flexibeast@gmail.com> > Cc: emacs-devel@gnu.org > Date: Thu, 18 Apr 2024 17:07:25 +1000 > > > https://github.com/flexibeast/ebuku/issues/32 > > as there is already an extended discussion there about this issue, > which itself links to a previous issue and discussion: > > https://github.com/flexibeast/ebuku/issues/31 > > in which the user first reported an "Invalid string for collation" > issue. That issue was addressed, after some discussion, by setting > LC_ALL to the same value that the user had set LANG, > i.e. "zh_CN.UTF-8". That left us with issue 32, which is the one > i'm asking about here. I don't think I understand the setting of LC_ALL part. First, AFAIK Windows programs generally ignore LC_* environment variables. If you read the Microsoft documentation of 'setlocale', here: https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/setlocale-wsetlocale?view=msvc-170 you will not see any reference to environment variables there. The Windows 'setlocale' supports only LC_* _categories_ in direct calls to the function, and doesn't consider the corresponding environment variables. The Emacs source code doesn't reference LC_* environment variables on MS-Windows, either. So how did the user set LC_ALL, and why did it have any effect whatsoever on the issue? Second, the user sets a UTF-8 locale, which as I wrote up-thread is not a good idea on MS-Windows. It could well cause failures in invoking external programs from Emacs, if the arguments to those programs include non-ASCII characters. In general, on MS-Windows Emacs can only safely invoke programs with non-ASCII characters in the command-line arguments if those characters can be encoded by the system codepage, in this case codepage-936 AFAIU. Regarding the "invalid string for collation: Invalid argument" error: how does ebuku determine the LOCALE argument with which it calls string-collate-lessp? It is important to understand what was the locale with which w32_compare_strings was called in that case. Finally, the issues with Windows-style file names with drive letters and with file names that begin with "~" lead me to believe that perhaps the underlying program 'buku' is not a native Windows program, but a Cygwin or MSYS program, in which case there could be incompatibilities both regarding file names and regarding handling of non-ASCII characters (Cygwin and MSYS use UTF-8 by default, whereas the native Windows build of Emacs does not). > `buku` provides a command-line interface to an SQLite-based > database of Web bookmarks, allowing one to save, delete and search > for bookmarks, with each bookmark able to have a comment and tags > associated with it. > > `Ebuku` is a package that provides an Emacs-based UI for buku. It > allows the user to add bookmarks, edit them, remove them, search > them etc. without actually leaving Emacs. It does so by running > `call-process` to call `buku` with the appropriate options, > receiving the resulting output in a buffer, then processing the > data in that buffer in order to present the user with the relevant > results. > > ebuku.el has a function: > > (defun ebuku--call-buku (args) > "Internal function for calling `buku' with list ARGS." (unless > ebuku-buku-path > (error "Couldn't find buku: check 'ebuku-buku-path'")) > (apply #'call-process > `(,ebuku-buku-path nil t nil > "--np" "--nc" "--db" > ,ebuku-database-path ,@args))) > > which gets called in several places - e.g. > https://github.com/flexibeast/ebuku/blob/c854d128cba8576fe9693c19109b5deafb573e99/ebuku.el#L534 > - to put the contents inside a temp buffer, which is then 'parsed' > for the information to be presented to the user. You need to take a good look at whether non-ASCII characters are passed to 'buku' in this case, and how the output from 'buku' is decoded. Also, ebuku-buku-path and ebuku-database-path should both be quoted with shell-quote-argument (but I don't think this is a problem in this case). Can ARGS include whitespace or characters special for the Windows shell? if so, each argument should be quoted with shell-quote-argument as well. How output is decoded when it is put into the temporary buffer is also of interest -- what is the value of buffer-file-coding-system in the temporary buffer after reading output, in the OP's case? > In a comment from a couple of days ago, and after having noted in > a comment on issue 31: > > https://github.com/flexibeast/ebuku/issues/31#issuecomment-2053557703 > > that they'd set LANG on their system to "zh_CN.UTF-8", the user > wrote > (https://github.com/flexibeast/ebuku/issues/32#issuecomment-2058289816): > > > I set the value with (set-language-environment "UTF-8"). I > > remember I set up this value bacause I don't want my files > > containing Chinese to be encoded by GBK encoding. This is not a good idea, as I mentioned before. Emacs on MS-Windows cannot use UTF-8 when encoding command-line arguments for sub-programs, it can only use the system codepage. Using set-language-environment as above will force Emacs to encode command-line arguments in UTF-8, which could very well be the reason for some of these problems. > Then, in > https://github.com/flexibeast/ebuku/issues/32#issuecomment-2058498373, > i wrote: > > > if i remember correctly, the default encoding used by Windows is > > UTF-16, not UTF-8. So i'm wondering if that's somehow being used > > to transfer data from the buku process to the Emacs process, > > regardless of the value of LANG and LC_ALL, and regardless of > > the encoding of the buku database itself? > > to which the user responded: > > > I think the Powershell will use UTF-16 to encode instead of > > UTF-8. > > Is that correct? No. The issue is complicated by several factors and will take a long post to explain. The upshot is that for passing non-ASCII characters safely to subprograms on their command lines, Emacs should use the system codepage, not UTF-8 or anything else (and definitely not UTF-16). This might require some tricky juggling with coding-system related settings when you call call-process, because coding-system-for-write is used for both encoding of the command-line arguments and of the stuff we send to the sub-program, so if they both can include non-ASCII characters, some care is in order. (By contrast, coding-system-for-read can be always bound to UTF-8 to decode the output correctly -- assuming 'buku' outputs UTF-8 encoded text on MS-Windows.) > Is that the case despite the user having > specified "zh_CN.UTF-8"? But if that's the case, why does removing > the CRAB emoji from text being operated on by string-match / > match-string make the issue disappear? Is it perhaps something to > do with > the code point for the CRAB emoji being outside the BMP? The more important question is: can CRAB emoji be safely encoded by codepage 936, the system codepage of the OP? If not, and if that emoji can appear in the command-line arguments of a 'buku' invocation (as opposed to in the text we write to or read from 'buku'), then this character cannot be used at all with this package on MS-Windows. (And please note that Emacs now has a native SQLite support, which should make many of these complications simply disappear.) As for why the problems disappear when the CRAB emoji is removed: as I wrote elsewhere, that's probably because all the other characters are plain ASCII, so all the encoding-related issues don't matter. > > Suggest that you ask the user who reported that to show the > > actual output of the sub-process (e.g., by running the same > > command outside of Emacs and redirecting output to a file), and > > if the output looks correct, examine the Lisp code which > > processes that output, with an eye on how the text is decoded. > > For example, if the text from the sub-process is supposed to be > > UTF-8 encoded, your Lisp code should bind coding-system-for-read > > to 'utf-8', to make sure it is decoded correctly. > > Thanks, i can certainly do that, modulo the issue of whether the > LANG and LC_ALL variables have any effect data transferred between > the `buku` sub-process and Emacs. They don't have any effect on Emacs on MS-Windows, that's for sure. Whether they have effect on 'buku' depends on whether it's a native MS-Windows program or Cygwin/MSYS program, and also on its code (a program could potentially augment the MS 'setlocale' function with its own code which looks at the LC_* environment variables, and does TRT in the application code). > But what should i do to handle the more general case of an arbitrary > encoding? Do i need to have a defcustom, with 'reasonable defaults', > that the user can set if necessary, which i use as the value to pass > to coding-system-for-read? That depends on what encoding does 'buku' expect on input and what encoding does it use on output. If it always uses UTF-8, you just need to make sure Emacs uses UTF-8 when encoding and decoding text passed to and from 'buku' (but note the caveat about encoding the command-line arguments -- these _must_ be encoded in the system codepage). If, OTOH, the encoding used by 'buku' can be changed dynamically, and Emacs cannot know what it is (for example, if it is determined by the encoding of the text put in the SQL database by the user), then a user option is in order. > > Btw: using UTF-8 by default on MS-Windows is not a very good > > idea, even with Windows 11 where one can enable UTF-8 support > > (did they do it, btw?). Windows still doesn't support UTF-8 > > well, even after the improvements in Windows 11, so the above > > settings might very well cause trouble. Suggest to ask the user > > to try the same recipe in "emacs -Q", and if the zh_CN.UTF-8 > > stuff is set up outside Emacs, to try without it. > > As i interpret their comments in the above discussions so far, > yes, they had themselves set LANG to "zh_CN.UTF-8" (and yes, as > described above, had definitely `set-language-environment` as > "UTF-8". NOT RECOMMENDED! ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: "args-out-of-range" error when using data from external process on Windows 2024-04-18 8:35 ` Eli Zaretskii @ 2024-04-18 11:20 ` Alexis 2024-04-19 3:16 ` Alexis 1 sibling, 0 replies; 11+ messages in thread From: Alexis @ 2024-04-18 11:20 UTC (permalink / raw) To: Eli Zaretskii; +Cc: emacs-devel Thanks again for your assistance! As some additional context: i haven't actively used a Windows system in more than a decade - it was Windows 7 - and even then, i was running it in a VM in order to run some other software. i've also never used Windows outside of an "Australian English" context, and have never done any dev work on the Windows platform. So i've got only a minimal idea of how Windows does various things nowadays, and have never needed to become familiar with sysadmin-/dev-level Windows documentation. Until now. :-) Specific responses inline below. > I don't think I understand the setting of LC_ALL part. First, > AFAIK Windows programs generally ignore LC_* environment > variables. If you read the Microsoft documentation of > 'setlocale', here: > > https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/setlocale-wsetlocale?view=msvc-170 > > you will not see any reference to environment variables there. Thanks for this link; it gives me a good starting point to explore the Win docs on this issue. > The Windows 'setlocale' supports only LC_* _categories_ in > direct calls to the function, and doesn't consider the > corresponding environment variables. The Emacs source code > doesn't reference LC_* environment variables on MS-Windows, > either. So how did the user set LC_ALL, and why did it have any > effect whatsoever on the issue? They didn't say; all they wrote (https://github.com/flexibeast/ebuku/issues/31#issuecomment-2058171986) was: > I ... changed my LC_ALL to zh_CN.UTF-8. Ebuku can find the db > now. i'll ask them. > Second, the user sets a UTF-8 locale, which as I wrote up-thread > is not a good idea on MS-Windows. It could well cause failures > in invoking external programs from Emacs, if the arguments to > those programs include non-ASCII characters. In general, on > MS-Windows Emacs can only safely invoke programs with non-ASCII > characters in the command-line arguments if those characters can > be encoded by the system codepage, in this case codepage-936 > AFAIU. Thanks, i'll add that to the information i pass back to the user on that GitHub issue. > Regarding the "invalid string for collation: Invalid argument" > error: how does ebuku determine the LOCALE argument with which > it calls string-collate-lessp? It is important to understand > what was the locale with which w32_compare_strings was called in > that case. The single use of `string-collate-lessp` doesn't pass any LOCALE argument, as i just wanted it to use the user's current locale for sorting a given bookmark's tags into the appropriate lexicographical order. > Finally, the issues with Windows-style file names with drive > letters and with file names that begin with "~" lead me to > believe that perhaps the underlying program 'buku' is not a > native Windows program, but a Cygwin or MSYS program, in which > case there could be incompatibilities both regarding file names > and regarding handling of non-ASCII characters (Cygwin and MSYS > use UTF-8 by default, whereas the native Windows build of Emacs > does not). Sorry; i mentioned in my first email, but didn't reiterate in my second, that `buku` is Python-based. > You need to take a good look at whether non-ASCII characters are > passed to 'buku' in this case, and how the output from 'buku' is > decoded. 👍 > Also, ebuku-buku-path and ebuku-database-path should both be > quoted with shell-quote-argument (but I don't think this is a > problem in this case). Can ARGS include whitespace or characters > special for the Windows shell? if so, each argument should be > quoted with shell-quote-argument as well. Thanks, noted. > How output is decoded when it is put into the temporary buffer > is also of interest -- what is the value of > buffer-file-coding-system in the temporary buffer after reading > output, in the OP's case? *nod* > Emacs on MS-Windows > cannot use UTF-8 when encoding command-line arguments for > sub-programs, it can only use the system codepage. Using > set-language-environment as above will force Emacs to encode > command-line arguments in UTF-8, which could very well be the > reason for some of these problems. Ah okay. > No. > > The issue is complicated by several factors and will take a long > post to explain. The upshot is that for passing non-ASCII > characters safely to subprograms on their command lines, Emacs > should use the system codepage, not UTF-8 or anything else (and > definitely not UTF-16). This might require some tricky juggling > with coding-system related settings when you call call-process, > because coding-system-for-write is used for both encoding of the > command-line arguments and of the stuff we send to the > sub-program, so if they both can include non-ASCII characters, > some care is in order. (By contrast, coding-system-for-read can > be always bound to UTF-8 to decode the output correctly -- > assuming 'buku' outputs UTF-8 encoded text on MS-Windows.) That's very helpful, thank you. > The more important question is: can CRAB emoji be safely encoded > by codepage 936, the system codepage of the OP? If not, and if > that emoji can appear in the command-line arguments of a 'buku' > invocation (as opposed to in the text we write to or read from > 'buku'), then this character cannot be used at all with this > package on MS-Windows. > > (And please note that Emacs now has a native SQLite support, > which should make many of these complications simply disappear.) It would certainly make many things easier to just interact with the db directly. That said, doing so would involve a substantial rewrite, and i've got many things on my plate nowadays, including supporting disabled loved ones while having chronic health issues myself. But maybe i can open an issue requesting help to start and develop a branch doing such a rewrite. > As for why the problems disappear when the CRAB emoji is > removed: as I wrote elsewhere, that's probably because all the > other characters are plain ASCII, so all the encoding-related > issues don't matter. *nod* > They don't have any effect on Emacs on MS-Windows, that's for > sure. Whether they have effect on 'buku' depends on whether > it's a native MS-Windows program or Cygwin/MSYS program, and > also on its code (a program could potentially augment the MS > 'setlocale' function with its own code which looks at the LC_* > environment variables, and does TRT in the application code). *nod* >> But what should i do to handle the more general case of an >> arbitrary encoding? Do i need to have a defcustom, with >> 'reasonable defaults', that the user can set if necessary, >> which i use as the value to pass to coding-system-for-read? > > That depends on what encoding does 'buku' expect on input and > what encoding does it use on output. If it always uses UTF-8, > you just need to make sure Emacs uses UTF-8 when encoding and > decoding text passed to and from 'buku' (but note the caveat > about encoding the command-line arguments -- these _must_ be > encoded in the system codepage). If, OTOH, the encoding used by > 'buku' can be changed dynamically, and Emacs cannot know what it > is (for example, if it is determined by the encoding of the text > put in the SQL database by the user), then a user option is in > order. Great, thank you. >> As i interpret their comments in the above discussions so far, >> yes, they had themselves set LANG to "zh_CN.UTF-8" (and yes, as >> described above, had definitely `set-language-environment` as >> "UTF-8". > > NOT RECOMMENDED! *chuckle* i'll be sure to pass this on. :-) Thanks again! Alexis. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: "args-out-of-range" error when using data from external process on Windows 2024-04-18 8:35 ` Eli Zaretskii 2024-04-18 11:20 ` Alexis @ 2024-04-19 3:16 ` Alexis 2024-04-19 7:29 ` Eli Zaretskii 1 sibling, 1 reply; 11+ messages in thread From: Alexis @ 2024-04-19 3:16 UTC (permalink / raw) To: Eli Zaretskii; +Cc: Zhang Songyu, emacs-devel Hi again Eli, i've taken your comments back to the user: https://github.com/flexibeast/ebuku/issues/32#issuecomment-2063682096 And they've responded with two comments: https://github.com/flexibeast/ebuku/issues/32#issuecomment-2064663151 https://github.com/flexibeast/ebuku/issues/32#issuecomment-2064663151 in which they said: > I set the LC_ALL and LANG variable by editing Windows's > environment variable. > Because Emacs don't read these, I have removed the variables. and noted that they were successfully able to add and search for the CRAB emoji by using buku directly on the command line: > The PowerShell display the CRAB emoji fine. ( I use Windows > Terminal ) but also that: > I copied the output into scratch buffer, it's not displaying. More generally, they've noted: > But right now my language environment will be "Chinese-GBK". My > files will be encoded as GBK, which is not I desired. I think > for some backward compatibility concern, MS use GBK encofing for > Chinese. Can I somehow set the Emacs to use UTF-8 for new file > encodings? And in their second comment, wrote: > Hi, I found the encoding setting from Emacs China website. I > added these lines in the early-init.el file > > (set-charset-priority 'unicode) (prefer-coding-system 'utf-8) > (setq system-time-locale "C") > > Right now, the new file would be saved with UTF-8 But the args > out of range problems still persistes. i have subsequently responded: https://github.com/flexibeast/ebuku/issues/32#issuecomment-2065617662 in which i wrote: > setting those variables won't influence the encoding of the data > that Ebuku has to process. > > This is a very complex issue, so we need to control the various > factors involved. > ... > > i understand that you don't want to use the GBK environment in > general and explained starting Emacs with `-Q`, and manually loading Ebuku, to test Ebuku. However, given that: * i don't have access to a Win machine; * i've not actively used Win for more than a decade; * i don't have any experience with Win at a dev level; * i've never used Win in a non-English environment, or Emacs in a non-UTF8 environment; i'm feeling overwhelmed by the various factors here, and am struggling to work out the right questions to ask the user, and how to appropriately work with the answers. So i also wrote: > i'm think i'm going to have to ask you to interact directly with > the Eli on the mailing list about this, as i'm finding it > difficult to be the messenger going back and forth, and it will > be much quicker if Eli can ask you questions directly, which you > can respond to directly. Hopefully that process will make it > clear what would need to be done by Ebuku in order to fix the > problem, in a non-GBK environment. Thus, i've cc'd the user on this email, so that, if you're willing and able, you can engage with them directly (perhaps by asking them to do specific tests that don't involve Ebuku, but which will show what Ebuku needs to handle). Alexis. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: "args-out-of-range" error when using data from external process on Windows 2024-04-19 3:16 ` Alexis @ 2024-04-19 7:29 ` Eli Zaretskii 2024-04-21 0:57 ` Alexis 0 siblings, 1 reply; 11+ messages in thread From: Eli Zaretskii @ 2024-04-19 7:29 UTC (permalink / raw) To: Alexis; +Cc: zsy9822, emacs-devel > From: Alexis <flexibeast@gmail.com> > Cc: Zhang Songyu <zsy9822@hotmail.com>, emacs-devel <emacs-devel@gnu.org> > Date: Fri, 19 Apr 2024 13:16:58 +1000 > > > Hi again Eli, > > i've taken your comments back to the user: > > https://github.com/flexibeast/ebuku/issues/32#issuecomment-2063682096 > > And they've responded with two comments: > > https://github.com/flexibeast/ebuku/issues/32#issuecomment-2064663151 > > https://github.com/flexibeast/ebuku/issues/32#issuecomment-2064663151 I responded there. There's no need to do this via the mailing list, let's continue the discussion directly in the issue. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: "args-out-of-range" error when using data from external process on Windows 2024-04-19 7:29 ` Eli Zaretskii @ 2024-04-21 0:57 ` Alexis 0 siblings, 0 replies; 11+ messages in thread From: Alexis @ 2024-04-21 0:57 UTC (permalink / raw) To: Eli Zaretskii; +Cc: emacs-devel Eli Zaretskii <eliz@gnu.org> writes: > I responded there. There's no need to do this via the mailing > list, let's continue the discussion directly in the issue. Thanks Eli, much appreciated. Alexis. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: "args-out-of-range" error when using data from external process on Windows 2024-04-18 5:39 "args-out-of-range" error when using data from external process on Windows Alexis 2024-04-18 6:01 ` Eli Zaretskii @ 2024-04-18 6:05 ` Eli Zaretskii 1 sibling, 0 replies; 11+ messages in thread From: Eli Zaretskii @ 2024-04-18 6:05 UTC (permalink / raw) To: Alexis; +Cc: emacs-devel > From: Alexis <flexibeast@gmail.com> > Date: Thu, 18 Apr 2024 15:39:10 +1000 > > * Their system: Windows 11, using Emacs 29.2 > * obtained via Scoop package manager; not using WSL > * LANG=zh_CN.UTF-8, LC_ALL=zh_CN.UTF-8 > * current-language-environment: UTF-8 locale-coding-system = cp936 > * default-process-coding-system = '(utf-8-dos . utf-8-unix) Btw: using UTF-8 by default on MS-Windows is not a very good idea, even with Windows 11 where one can enable UTF-8 support (did they do it, btw?). Windows still doesn't support UTF-8 well, even after the improvements in Windows 11, so the above settings might very well cause trouble. Suggest to ask the user to try the same recipe in "emacs -Q", and if the zh_CN.UTF-8 stuff is set up outside Emacs, to try without it. ^ permalink raw reply [flat|nested] 11+ messages in thread
* "args-out-of-range" error when using data from external process on Windows @ 2024-04-18 6:11 Alexis 2024-04-18 7:08 ` Eli Zaretskii 0 siblings, 1 reply; 11+ messages in thread From: Alexis @ 2024-04-18 6:11 UTC (permalink / raw) To: emacs-devel [Second attempt, after the formatting somehow got messed by the sending process up on the first attempt, sorry ....] [Not currently subscribed to the list, so please cc me on replies.] Hi all, A user of my `Ebuku` package has reported an "args-out-of-range" error that i'm out of my depth trying to diagnose. Here's the GitHub issue: https://github.com/flexibeast/ebuku/issues/32 i can't reproduce the issue on my own system: * Gentoo + Emacs 29.3 * LANG=en_AU.UTF-8 * The only set LC_* variables are: LC_MESSAGES=C LC_TIME=en_AU.UTF-8 * current-language-environment: "English" * locale-coding-system: utf-8-unix Their system: * Windows 11, using Emacs 29.2, obtained via Scoop package manager; not using WSL * LANG=zh_CN.UTF-8, LC_ALL=zh_CN.UTF-8 * current-language-environment: UTF-8 locale-coding-system: cp936 * default-process-coding-system: '(utf-8-dos . utf-8-unix) `Ebuku` uses `call-process` to call the Python-based `buku` * bookmark database manager and present the resulting output in Emacs. buku stores data in an SQLite database. https://github.com/jarun/buku/ The link: https://google.github.io/comprehensive-rust/ in the buku database results in: ``` Debugger entered--Lisp error: (args-out-of-range "1884. Welcome to Comprehensive Rust 🦀 - Comprehens..." 15862 15893) match-string(1 "1884. Welcome to Comprehensive Rust 🦀 - Comprehensive Rust 🦀") ebuku--search-helper("--print" "[all]" "-1000" "") ebuku-show-all() ebuku() funcall-interactively(ebuku) command-execute(ebuku record) execute-extended-command(nil "ebuku""ebuku") funcall-interactively(execute-extended-command nil "ebuku" "ebuku") command-execute(execute-extended-command) ``` Once the Unicode CRAB emoji is removed, there's no issue. The link: https://coredumped.dev/2021/05/26/taking-org-roam-everywhere-with-logseq/ in the buku database results in: ``` Debugger entered--Lisp error: (args-out-of-range "2027. Taking org-roam everywhere with logseq • Core Dumped" 32318 32355) match-string(1 "2027. Taking org-roam everywhere with logseq • Cor...") (setq tags (match-string 1 line)) (progn (string-match "^\\s-*[#] \\(.*\\)$" line) (setq tags (match-string 1 line))) [snip rest of traceback] ``` The user has confirmed that the buku database is UTF-8. Does anyone have any suggestions about what might be happening? i presume my code is making some incorrect assumptions, or not doing some encoding stuff that it should be. i really want to get encoding and language support right, so even outside of this specific issue, general comments about things i need to fix in this regard would be most welcome. :-) Alexis. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: "args-out-of-range" error when using data from external process on Windows 2024-04-18 6:11 Alexis @ 2024-04-18 7:08 ` Eli Zaretskii 0 siblings, 0 replies; 11+ messages in thread From: Eli Zaretskii @ 2024-04-18 7:08 UTC (permalink / raw) To: Alexis; +Cc: emacs-devel > From: Alexis <flexibeast@gmail.com> > Date: Thu, 18 Apr 2024 16:11:02 +1000 > > Once the Unicode CRAB emoji is removed, there's no issue. Most probably because the rest of the text is plain ASCII, so decoding doesn't come into play. ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2024-04-21 0:57 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-04-18 5:39 "args-out-of-range" error when using data from external process on Windows Alexis 2024-04-18 6:01 ` Eli Zaretskii 2024-04-18 7:07 ` Alexis 2024-04-18 8:35 ` Eli Zaretskii 2024-04-18 11:20 ` Alexis 2024-04-19 3:16 ` Alexis 2024-04-19 7:29 ` Eli Zaretskii 2024-04-21 0:57 ` Alexis 2024-04-18 6:05 ` Eli Zaretskii -- strict thread matches above, loose matches on Subject: below -- 2024-04-18 6:11 Alexis 2024-04-18 7:08 ` Eli Zaretskii
Code repositories for project(s) associated with this external index https://git.savannah.gnu.org/cgit/emacs.git https://git.savannah.gnu.org/cgit/emacs/org-mode.git This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.