* Grep Japanese characters @ 2018-07-11 23:02 Tak Kunihiro 2018-07-12 2:40 ` Eli Zaretskii 0 siblings, 1 reply; 15+ messages in thread From: Tak Kunihiro @ 2018-07-11 23:02 UTC (permalink / raw) To: help-gnu-emacs; +Cc: tkk I want to grep Japanese string. I can do it on Emacs for Mac but cannot do it on Emacs for MS Windows. I found that I can grep Japanese string using c:/msys64/usr/bin/grep.exe on command prompt (outside of Emacs). However, I cannot do it using c:/msys64/usr/bin/grep.exe on command prompt by M-x shell (inside of Emacs). I confirm that LC_ALL is set to en_US.UTF-8 on both environments. Can you give me a hint to grep Japanese string on Emacs for MS Windows? CMD> set ... LC_ALL=en_US.UTF-8 ... CMD> c:/msys64/usr/bin/cat hello.txt Hello is こんにちは in Japanese. Hello is bonjour in French. CMD> c:/msys64/usr/bin/grep.exe Hello hello.txt Hello is こんにちは in Japanese. Hello is bonjour in French. CMD> c:/msys64/usr/bin/grep.exe "こんにちは" hello.txt Hello is こんにちは in Japanese. CMD> c:/emacs-26.1/bin/runemacs -Q M-x shell CMD> set ... LC_ALL=en_US.UTF-8 ... CMD> c:/msys64/usr/bin/grep.exe Hello hello.txt c:/msys64/usr/bin/grep.exe Hello hello.txt Hello is こんにちは in Japanese. Hello is bonjour in French. CMD> c:/msys64/usr/bin/grep.exe "こんにちは" hello.txt c:/msys64/usr/bin/grep.exe "こんにちは" hello.txt ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Grep Japanese characters 2018-07-11 23:02 Grep Japanese characters Tak Kunihiro @ 2018-07-12 2:40 ` Eli Zaretskii 2018-07-12 3:05 ` YUE Daian 2018-07-12 5:05 ` Yuri Khan 0 siblings, 2 replies; 15+ messages in thread From: Eli Zaretskii @ 2018-07-12 2:40 UTC (permalink / raw) To: help-gnu-emacs > Date: Thu, 12 Jul 2018 08:02:55 +0900 (JST) > From: Tak Kunihiro <tkk@misasa.okayama-u.ac.jp> > Cc: tkk@misasa.okayama-u.ac.jp > > I want to grep Japanese string. I can do it on Emacs for Mac but > cannot do it on Emacs for MS Windows. > > I found that I can grep Japanese string using > c:/msys64/usr/bin/grep.exe on command prompt (outside of Emacs). > However, I cannot do it using c:/msys64/usr/bin/grep.exe on command > prompt by M-x shell (inside of Emacs). I confirm that LC_ALL is set > to en_US.UTF-8 on both environments. > > Can you give me a hint to grep Japanese string on Emacs for MS > Windows? You cannot pass UTF-8 encoded parameters to sub-programs on MS-Windows. You can only use the encoding of your system codepage. Sorry, it's an MS-Windows limitation. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Grep Japanese characters 2018-07-12 2:40 ` Eli Zaretskii @ 2018-07-12 3:05 ` YUE Daian 2018-07-12 13:10 ` Tak Kunihiro 2018-07-12 13:23 ` Eli Zaretskii 2018-07-12 5:05 ` Yuri Khan 1 sibling, 2 replies; 15+ messages in thread From: YUE Daian @ 2018-07-12 3:05 UTC (permalink / raw) To: help-gnu-emacs On 2018-07-12 05:40, Eli Zaretskii <eliz@gnu.org> wrote: >> Date: Thu, 12 Jul 2018 08:02:55 +0900 (JST) >> From: Tak Kunihiro <tkk@misasa.okayama-u.ac.jp> >> Cc: tkk@misasa.okayama-u.ac.jp >> >> I want to grep Japanese string. I can do it on Emacs for Mac but >> cannot do it on Emacs for MS Windows. >> >> I found that I can grep Japanese string using >> c:/msys64/usr/bin/grep.exe on command prompt (outside of Emacs). >> However, I cannot do it using c:/msys64/usr/bin/grep.exe on command >> prompt by M-x shell (inside of Emacs). I confirm that LC_ALL is set >> to en_US.UTF-8 on both environments. >> >> Can you give me a hint to grep Japanese string on Emacs for MS >> Windows? > > You cannot pass UTF-8 encoded parameters to sub-programs on > MS-Windows. You can only use the encoding of your system codepage. > Sorry, it's an MS-Windows limitation. I remember Windows 10 has a beta option to use UTF-8 for the whole system instead of your local encoding. I do not know if this can help... ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Grep Japanese characters 2018-07-12 3:05 ` YUE Daian @ 2018-07-12 13:10 ` Tak Kunihiro 2018-07-12 13:37 ` Eli Zaretskii 2018-07-13 2:58 ` YUE Daian 2018-07-12 13:23 ` Eli Zaretskii 1 sibling, 2 replies; 15+ messages in thread From: Tak Kunihiro @ 2018-07-12 13:10 UTC (permalink / raw) To: YUE Daian; +Cc: help-gnu-emacs, tkk >>> I want to grep Japanese string. I can do it on Emacs for Mac but >>> cannot do it on Emacs for MS Windows. >>> >>> I found that I can grep Japanese string using >>> c:/msys64/usr/bin/grep.exe on command prompt (outside of Emacs). >>> However, I cannot do it using c:/msys64/usr/bin/grep.exe on command >>> prompt by M-x shell (inside of Emacs). I confirm that LC_ALL is set >>> to en_US.UTF-8 on both environments. >>> >>> Can you give me a hint to grep Japanese string on Emacs for MS >>> Windows? >> >> You cannot pass UTF-8 encoded parameters to sub-programs on >> MS-Windows. You can only use the encoding of your system codepage. >> Sorry, it's an MS-Windows limitation. > > I remember Windows 10 has a beta option to use UTF-8 for the whole > system instead of your local encoding. I have checked a box on a dialog by following steps and confirmed that I can grep Japanese string. Thank you for the hints! Region & Language -> Administrative language settings -> Language for non-Unicode programs -> Change system locale... -> Beta: Use Unicode UTF-8 for worldwide language support ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Grep Japanese characters 2018-07-12 13:10 ` Tak Kunihiro @ 2018-07-12 13:37 ` Eli Zaretskii 2018-07-13 3:36 ` Tak Kunihiro 2018-07-13 2:58 ` YUE Daian 1 sibling, 1 reply; 15+ messages in thread From: Eli Zaretskii @ 2018-07-12 13:37 UTC (permalink / raw) To: help-gnu-emacs > From: Tak Kunihiro <homeros.misasa@gmail.com> > Date: Thu, 12 Jul 2018 22:10:45 +0900 > Cc: help-gnu-emacs@gnu.org, tkk@misasa.okayama-u.ac.jp > > I have checked a box on a dialog by following steps and > confirmed that I can grep Japanese string. > > Thank you for the hints! > > > Region & Language > -> Administrative language settings > -> Language for non-Unicode programs > -> Change system locale... > -> Beta: Use Unicode UTF-8 for worldwide language support Ah, good! So Windows is finally moving in the right direction. After you do the above, what values do the following variables/functions yield, please? M-: w32-ansi-code-page RET M-: (w32-get-console-codepage) RET M-: (w32-get-console-output-codepage) RET Also, if you invoke Emacs with the -nw command-line option, what do you see when you then display the HELLO file ("C-h h")? ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Grep Japanese characters 2018-07-12 13:37 ` Eli Zaretskii @ 2018-07-13 3:36 ` Tak Kunihiro 2018-07-13 7:21 ` Eli Zaretskii 0 siblings, 1 reply; 15+ messages in thread From: Tak Kunihiro @ 2018-07-13 3:36 UTC (permalink / raw) To: Eli Zaretskii; +Cc: help-gnu-emacs, tkk >> I have checked a box on a dialog by following steps and >> confirmed that I can grep Japanese string. >> >> Region & Language >> -> Administrative language settings >> -> Language for non-Unicode programs >> -> Change system locale... >> -> Beta: Use Unicode UTF-8 for worldwide language support > > After you do the above, what values do the following > variables/functions yield, please? > > M-: w32-ansi-code-page RET > M-: (w32-get-console-codepage) RET > M-: (w32-get-console-output-codepage) RET > > Also, if you invoke Emacs with the -nw command-line option, what do > you see when you then display the HELLO file ("C-h h")? I attach the response from Emacs with `Beta: Use Unicode UTF-8 for worldwide language support' checked. CMD> c:/emacs-26.1/bin/runemacs.exe -Q M-: w32-ansi-code-page --> 65001 (#o176751, #xfde9) M-: (w32-get-console-codepage) RET --> 65001 (#o176751, #xfde9) M-: (w32-get-console-output-codepage) RET --> 65001 (#o176751, #xfde9) CMD> c:/emacs-26.1/bin/emacs.exe -nw -Q M-x view-hello-file --> Letters besides ascii are all bricks (tofu). ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Grep Japanese characters 2018-07-13 3:36 ` Tak Kunihiro @ 2018-07-13 7:21 ` Eli Zaretskii 2018-07-13 14:06 ` Filipp Gunbin 0 siblings, 1 reply; 15+ messages in thread From: Eli Zaretskii @ 2018-07-13 7:21 UTC (permalink / raw) To: help-gnu-emacs > From: Tak Kunihiro <homeros.misasa@gmail.com> > Cc: help-gnu-emacs@gnu.org > Cc: tkk@misasa.okayama-u.ac.jp > Date: Fri, 13 Jul 2018 12:36:46 +0900 > > CMD> c:/emacs-26.1/bin/runemacs.exe -Q > > M-: w32-ansi-code-page > --> 65001 (#o176751, #xfde9) > > M-: (w32-get-console-codepage) RET > --> 65001 (#o176751, #xfde9) > > M-: (w32-get-console-output-codepage) RET > --> 65001 (#o176751, #xfde9) > > CMD> c:/emacs-26.1/bin/emacs.exe -nw -Q > > M-x view-hello-file > --> Letters besides ascii are all bricks (tofu). Thanks. That's expected, more or less. The conclusion is that UTF-8 can be used as a locale's codeset (good!), but sending UTF-8 text to the console still doesn't work well (not so good). So if people use this knob in Windows 10, they should arrange for console input and output to be in some codepage other than 65001 (a.k.a. UTF-8). It's possible that using a better font for the console, such as Lucida Console, will at least allow you to show Latin, Cyrillic, and Greek characters in HELLO, btw. Not sure what will happen with characters beyond the BMP, maybe Windows 10 has improved in that part as well. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Grep Japanese characters 2018-07-13 7:21 ` Eli Zaretskii @ 2018-07-13 14:06 ` Filipp Gunbin 2018-07-13 14:36 ` Eli Zaretskii 0 siblings, 1 reply; 15+ messages in thread From: Filipp Gunbin @ 2018-07-13 14:06 UTC (permalink / raw) To: Eli Zaretskii; +Cc: help-gnu-emacs On 13/07/2018 10:21 +0300, Eli Zaretskii wrote: [..] > The conclusion is that UTF-8 can be used as a locale's codeset > (good!), but sending UTF-8 text to the console still doesn't work well > (not so good). So if people use this knob in Windows 10, they should > arrange for console input and output to be in some codepage other than > 65001 (a.k.a. UTF-8). [..] But in message <86pnzsbnvu.fsf@misasa.okayama-u.ac.jp> above it was reported that grepping of these non-ascii chars worked from emacs, no? And what does "using as locale's codeset" then means in your message? I'm not a Windows user (anymore) myself, it'd be just nice to know what is the situation on Windows. Thanks. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Grep Japanese characters 2018-07-13 14:06 ` Filipp Gunbin @ 2018-07-13 14:36 ` Eli Zaretskii 2018-07-16 20:11 ` Filipp Gunbin 0 siblings, 1 reply; 15+ messages in thread From: Eli Zaretskii @ 2018-07-13 14:36 UTC (permalink / raw) To: help-gnu-emacs > From: Filipp Gunbin <fgunbin@fastmail.fm> > Cc: help-gnu-emacs@gnu.org > Date: Fri, 13 Jul 2018 17:06:38 +0300 > > > The conclusion is that UTF-8 can be used as a locale's codeset > > (good!), but sending UTF-8 text to the console still doesn't work well > > (not so good). So if people use this knob in Windows 10, they should > > arrange for console input and output to be in some codepage other than > > 65001 (a.k.a. UTF-8). > [..] > > But in message <86pnzsbnvu.fsf@misasa.okayama-u.ac.jp> above it was > reported that grepping of these non-ascii chars worked from emacs, no? When you gerp from Emacs, the results of the search are not displayed by the Windows console, they get read by Emacs and displayed by Emacs. And (GUI) Emacs can display _any_ character supported by the fonts installed on the systems, regardless of the codepage. But if people run Grep from the shell prompt, they will see unreadable output, even on Windows 10 with that setting in effect. > And what does "using as locale's codeset" then means in your message? A locale's most general specification is ll_CC.ENC, where ll is the language, CC is the country, and ENC is the encoding. Example from Posix systems: pr_BR.UTF-8, for Brazilian variety of Portuguese with UTF-8 encoding. Example from Windows: French_Canada.1252 (where 1252 is the codepage used for encoding). The ENC part is also known as "codeset". More about that, for Windows in particular, here: https://msdn.microsoft.com/en-us/library/x99tb11d.aspx You will see that the MS doc still says UTF-8 is not supported as the ENC part. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Grep Japanese characters 2018-07-13 14:36 ` Eli Zaretskii @ 2018-07-16 20:11 ` Filipp Gunbin 2018-07-17 2:29 ` Eli Zaretskii 0 siblings, 1 reply; 15+ messages in thread From: Filipp Gunbin @ 2018-07-16 20:11 UTC (permalink / raw) To: Eli Zaretskii; +Cc: help-gnu-emacs On 13/07/2018 17:36 +0300, Eli Zaretskii wrote: >> From: Filipp Gunbin <fgunbin@fastmail.fm> >> Cc: help-gnu-emacs@gnu.org >> Date: Fri, 13 Jul 2018 17:06:38 +0300 >> >> > The conclusion is that UTF-8 can be used as a locale's codeset >> > (good!), but sending UTF-8 text to the console still doesn't work well >> > (not so good). So if people use this knob in Windows 10, they should >> > arrange for console input and output to be in some codepage other than >> > 65001 (a.k.a. UTF-8). >> [..] >> >> But in message <86pnzsbnvu.fsf@misasa.okayama-u.ac.jp> above it was >> reported that grepping of these non-ascii chars worked from emacs, no? > > When you gerp from Emacs, the results of the search are not displayed > by the Windows console, they get read by Emacs and displayed by Emacs. > And (GUI) Emacs can display _any_ character supported by the fonts > installed on the systems, regardless of the codepage. But if people > run Grep from the shell prompt, they will see unreadable output, even > on Windows 10 with that setting in effect. > >> And what does "using as locale's codeset" then means in your message? > > A locale's most general specification is ll_CC.ENC, where ll is the > language, CC is the country, and ENC is the encoding. Example from > Posix systems: pr_BR.UTF-8, for Brazilian variety of Portuguese with > UTF-8 encoding. Example from Windows: French_Canada.1252 (where 1252 > is the codepage used for encoding). The ENC part is also known as > "codeset". > > More about that, for Windows in particular, here: > > https://msdn.microsoft.com/en-us/library/x99tb11d.aspx > > You will see that the MS doc still says UTF-8 is not supported as the > ENC part. Thanks. I'm familiar with locale concept, but was not sure about what "codeset" means. I'm still a bit lost in this. It seems that sending/receiving to/from subprocesses works with that Win10 setting, that's why grepping from M-x shell started to work. Output in graphical Emacs will work if font is ok. But the interactions with console confuse me, I guess I need to read more on that before I am able to ask something meaningful. In particular, it's unclear to me why grep outputs Japanese correctly in the OP (with LC_ALL=en_US.UTF-8), and you say that sending UTF-8 text to console will not work. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Grep Japanese characters 2018-07-16 20:11 ` Filipp Gunbin @ 2018-07-17 2:29 ` Eli Zaretskii 0 siblings, 0 replies; 15+ messages in thread From: Eli Zaretskii @ 2018-07-17 2:29 UTC (permalink / raw) To: help-gnu-emacs > From: Filipp Gunbin <fgunbin@fastmail.fm> > Cc: help-gnu-emacs@gnu.org > Date: Mon, 16 Jul 2018 23:11:36 +0300 > > But the interactions with console confuse me, I guess I need to read > more on that before I am able to ask something meaningful. In > particular, it's unclear to me why grep outputs Japanese correctly in > the OP (with LC_ALL=en_US.UTF-8), and you say that sending UTF-8 text to > console will not work. Once again: what worked for OP didn't involve sending matches to the console, the matches were sent to Emacs for display. It's the display on the console that doesn't work with UTF-8, as the OP confirmed in response to my question about "emacs -nw". ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Grep Japanese characters 2018-07-12 13:10 ` Tak Kunihiro 2018-07-12 13:37 ` Eli Zaretskii @ 2018-07-13 2:58 ` YUE Daian 1 sibling, 0 replies; 15+ messages in thread From: YUE Daian @ 2018-07-13 2:58 UTC (permalink / raw) To: help-gnu-emacs On 2018-07-12 22:10, Tak Kunihiro <homeros.misasa@gmail.com> wrote: >>>> I want to grep Japanese string. I can do it on Emacs for Mac but >>>> cannot do it on Emacs for MS Windows. >>>> >>>> I found that I can grep Japanese string using >>>> c:/msys64/usr/bin/grep.exe on command prompt (outside of Emacs). >>>> However, I cannot do it using c:/msys64/usr/bin/grep.exe on command >>>> prompt by M-x shell (inside of Emacs). I confirm that LC_ALL is set >>>> to en_US.UTF-8 on both environments. >>>> >>>> Can you give me a hint to grep Japanese string on Emacs for MS >>>> Windows? >>> >>> You cannot pass UTF-8 encoded parameters to sub-programs on >>> MS-Windows. You can only use the encoding of your system codepage. >>> Sorry, it's an MS-Windows limitation. >> >> I remember Windows 10 has a beta option to use UTF-8 for the whole >> system instead of your local encoding. > > I have checked a box on a dialog by following steps and > confirmed that I can grep Japanese string. > > Thank you for the hints! > > > Region & Language > -> Administrative language settings > -> Language for non-Unicode programs > -> Change system locale... > -> Beta: Use Unicode UTF-8 for worldwide language support I am glad it made sense. Just a reminder: If you select this button, you may find some local software that use hard-coded encoding do not show characters correctly. I personally do not care because Windows to me is only a game center anyway. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Grep Japanese characters 2018-07-12 3:05 ` YUE Daian 2018-07-12 13:10 ` Tak Kunihiro @ 2018-07-12 13:23 ` Eli Zaretskii 1 sibling, 0 replies; 15+ messages in thread From: Eli Zaretskii @ 2018-07-12 13:23 UTC (permalink / raw) To: help-gnu-emacs > From: YUE Daian <sheepduke@gmail.com> > Date: Thu, 12 Jul 2018 11:05:28 +0800 > > > You cannot pass UTF-8 encoded parameters to sub-programs on > > MS-Windows. You can only use the encoding of your system codepage. > > Sorry, it's an MS-Windows limitation. > > I remember Windows 10 has a beta option to use UTF-8 for the whole > system instead of your local encoding. If Windows will at some point will allow using UTF-8 as the locale's codeset, then invoking subprograms with UTF-8 encoded command-line arguments will become possible in Emacs on Windows. For now, the MSDN documentation of the latest C runtime still says: The locale argument can take a locale name, a language string, a language string and country/region code, a code page, or a language string, country/region code, and code page. The set of available locale names, languages, country/region codes, and code pages includes all those supported by the Windows NLS API except code pages that require more than two bytes per character, such as UTF-7 and UTF-8. If you provide a code page value of UTF-7 or UTF-8, setlocale will fail, returning NULL. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Grep Japanese characters 2018-07-12 2:40 ` Eli Zaretskii 2018-07-12 3:05 ` YUE Daian @ 2018-07-12 5:05 ` Yuri Khan 2018-07-12 13:27 ` Eli Zaretskii 1 sibling, 1 reply; 15+ messages in thread From: Yuri Khan @ 2018-07-12 5:05 UTC (permalink / raw) To: Eli Zaretskii; +Cc: help-gnu-emacs On Thu, Jul 12, 2018 at 9:41 AM Eli Zaretskii <eliz@gnu.org> wrote: > You cannot pass UTF-8 encoded parameters to sub-programs on > MS-Windows. You can only use the encoding of your system codepage. > Sorry, it's an MS-Windows limitation. That’s not entirely accurate: using the CreateProcessW API, you could pass UTF-16. However, in order to make full use of arguments passed that way, the sub-program needs to forgo the normal “int main(int, char**)” signature and use “int _wmain(int, wchar_t**)”, or to call GetCommandLineW and parse the returned UTF-16 string. A sub-program that accepts arguments via the usual ‘main’ function will be limited to characters that are representable in the current codepage. I do not know whether MSYS binaries do that. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Grep Japanese characters 2018-07-12 5:05 ` Yuri Khan @ 2018-07-12 13:27 ` Eli Zaretskii 0 siblings, 0 replies; 15+ messages in thread From: Eli Zaretskii @ 2018-07-12 13:27 UTC (permalink / raw) To: help-gnu-emacs > From: Yuri Khan <yurivkhan@gmail.com> > Date: Thu, 12 Jul 2018 12:05:51 +0700 > Cc: help-gnu-emacs <help-gnu-emacs@gnu.org> > > On Thu, Jul 12, 2018 at 9:41 AM Eli Zaretskii <eliz@gnu.org> wrote: > > > You cannot pass UTF-8 encoded parameters to sub-programs on > > MS-Windows. You can only use the encoding of your system codepage. > > Sorry, it's an MS-Windows limitation. > > That’s not entirely accurate: using the CreateProcessW API, you could > pass UTF-16. However, in order to make full use of arguments passed > that way, the sub-program needs to forgo the normal “int main(int, > char**)” signature and use “int _wmain(int, wchar_t**)”, or to call > GetCommandLineW and parse the returned UTF-16 string. A sub-program > that accepts arguments via the usual ‘main’ function will be limited > to characters that are representable in the current codepage. Not only does the sub-program need to use _wmain instead of main, it must also internally use wchar_t data type instead of char for text strings. Ports of GNU and Unix software generally won't do that, so passing UTF-16 encoded text to them is not really useful, with a few very rare exceptions. ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2018-07-17 2:29 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2018-07-11 23:02 Grep Japanese characters Tak Kunihiro 2018-07-12 2:40 ` Eli Zaretskii 2018-07-12 3:05 ` YUE Daian 2018-07-12 13:10 ` Tak Kunihiro 2018-07-12 13:37 ` Eli Zaretskii 2018-07-13 3:36 ` Tak Kunihiro 2018-07-13 7:21 ` Eli Zaretskii 2018-07-13 14:06 ` Filipp Gunbin 2018-07-13 14:36 ` Eli Zaretskii 2018-07-16 20:11 ` Filipp Gunbin 2018-07-17 2:29 ` Eli Zaretskii 2018-07-13 2:58 ` YUE Daian 2018-07-12 13:23 ` Eli Zaretskii 2018-07-12 5:05 ` Yuri Khan 2018-07-12 13:27 ` Eli Zaretskii
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).