* bug#75207: 29.4; Path conversion from native codepage to UTF-8 fails when Windows is set by default to UTF-8 @ 2024-12-30 12:12 michal--- via Bug reports for GNU Emacs, the Swiss army knife of text editors 2024-12-30 19:13 ` Eli Zaretskii 0 siblings, 1 reply; 9+ messages in thread From: michal--- via Bug reports for GNU Emacs, the Swiss army knife of text editors @ 2024-12-30 12:12 UTC (permalink / raw) To: 75207 Emacs generates gibberish UTF-8 characters during conversion from native codepage to UTF-8 if experimental default UTF-8 codepage is set on Windows. In GNU Emacs 29.4 (build 2, x86_64-w64-mingw32) of 2024-07-05 built on AVALON Windowing system distributor 'Microsoft Corp.', version 10.0.22631 System Description: Microsoft Windows 10 Education (v10.0.2009.22631.4602) Configured using: 'configure --with-modules --without-dbus --with-native-compilation=aot --without-compress-install --with-sqlite3 --with-tree-sitter CFLAGS=-O2' Configured features: ACL GIF GMP GNUTLS HARFBUZZ JPEG JSON LCMS2 LIBXML2 MODULES NATIVE_COMP NOTIFY W32NOTIFY PDUMPER PNG RSVG SOUND SQLITE3 THREADS TIFF TOOLKIT_SCROLL_BARS TREE_SITTER WEBP XPM ZLIB (NATIVE_COMP present but libgccjit not available) Important settings: value of $LANG: ENG locale-coding-system: cp65001 Major mode: recentf-dialog Minor modes in effect: global-company-mode: t company-mode: t nyan-mode: t fido-vertical-mode: t icomplete-vertical-mode: t icomplete-mode: t fido-mode: t global-display-line-numbers-mode: t display-line-numbers-mode: t recentf-mode: t global-display-fill-column-indicator-mode: t display-fill-column-indicator-mode: t tooltip-mode: t global-eldoc-mode: t show-paren-mode: t electric-indent-mode: t mouse-wheel-mode: t file-name-shadow-mode: t global-font-lock-mode: t font-lock-mode: t blink-cursor-mode: t column-number-mode: t line-number-mode: t transient-mark-mode: t auto-composition-mode: t auto-encryption-mode: t auto-compression-mode: t Load-path shadows: c:/Users/Michał/.emacs.d/elpa/transient-20241102.1229/transient hides c:/Program Files/Emacs/emacs-29.4/share/emacs/29.4/lisp/transient c:/Users/Michał/.emacs.d/elpa/standard-themes-2.1.0/theme-loaddefs hides c:/Program Files/Emacs/emacs-29.4/share/emacs/29.4/lisp/theme-loaddefs Features: (shadow sort mail-extr emacsbug message yank-media puny dired dired-loaddefs rfc822 mml mml-sec epa epg rfc6068 epg-config gnus-util time-date mm-decode mm-bodies mm-encode mail-parse rfc2231 mailabbrev gmm-utils mailheader sendmail rfc2047 rfc2045 ietf-drums mm-util mail-prsvr mail-utils eldoc-box high-theme company-oddmuse company-keywords company-etags etags fileloop generator xref project company-gtags company-dabbrev-code company-dabbrev company-files company-clang company-capf company-cmake company-semantic company-template company-bbdb company nyan-mode icomplete display-line-numbers recentf tree-widget wid-edit easy-mmode display-fill-column-indicator jai-mode derived compile text-property-search comint ansi-osc ansi-color ring js c-ts-common treesit imenu cc-mode cc-fonts cc-guess cc-menus cc-cmds cc-styles cc-align cc-engine cc-vars cc-defs theme-switcher finder-inf almost-mono-themes-autoloads auctex-autoloads tex-site centered-window-autoloads cmake-mode-autoloads company-autoloads dtrt-indent-autoloads editorconfig-autoloads eldoc-box-autoloads erlang-autoloads exec-path-from-shell-autoloads go-mode-autoloads gruber-darker-theme-autoloads haskell-mode-autoloads highlight-symbol-autoloads latex-preview-pane-autoloads magit-autoloads pcase magit-section-autoloads dash-autoloads markdown-mode-autoloads merlin-autoloads multiple-cursors-autoloads nyan-mode-autoloads powershell-autoloads projectile-autoloads rg-autoloads rust-mode-autoloads slime-autoloads macrostep-autoloads solarized-theme-autoloads standard-themes-autoloads swift-mode-autoloads transient-autoloads tuareg-autoloads rx caml-autoloads wgrep-autoloads white-sand-theme-autoloads with-editor-autoloads info compat-autoloads yasnippet-autoloads zig-mode-autoloads reformatter-autoloads package browse-url url url-proxy url-privacy url-expand url-methods url-history url-cookie generate-lisp-file url-domsuf url-util mailcap url-handlers url-parse auth-source cl-seq eieio eieio-core cl-macs password-cache json subr-x map byte-opt gv bytecomp byte-compile url-vars cl-loaddefs cl-lib rmc iso-transl tooltip cconv eldoc paren electric uniquify ediff-hook vc-hooks lisp-float-type elisp-mode mwheel dos-w32 ls-lisp disp-table term/w32-win w32-win w32-vars term/common-win tool-bar dnd fontset image regexp-opt fringe tabulated-list replace newcomment text-mode lisp-mode prog-mode register page tab-bar menu-bar rfn-eshadow isearch easymenu timer select scroll-bar mouse jit-lock font-lock syntax font-core term/tty-colors frame minibuffer nadvice seq simple cl-generic indonesian philippine cham georgian utf-8-lang misc-lang vietnamese tibetan thai tai-viet lao korean japanese eucjp-ms cp51932 hebrew greek romanian slovak czech european ethiopic indian cyrillic chinese composite emoji-zwj charscript charprop case-table epa-hook jka-cmpr-hook help abbrev obarray oclosure cl-preloaded button loaddefs theme-loaddefs faces cus-face macroexp files window text-properties overlay sha1 md5 base64 format env code-pages mule custom widget keymap hashtable-print-readable backquote threads w32notify w32 lcms2 multi-tty make-network-process native-compile emacs) Memory information: ((conses 16 185675 75051) (symbols 48 14661 7) (strings 32 55444 14585) (string-bytes 1 1821337) (vectors 16 27409) (vector-slots 8 520326 162526) (floats 8 83 1011) (intervals 56 494 150) (buffers 984 11)) ^ permalink raw reply [flat|nested] 9+ messages in thread
* bug#75207: 29.4; Path conversion from native codepage to UTF-8 fails when Windows is set by default to UTF-8 2024-12-30 12:12 bug#75207: 29.4; Path conversion from native codepage to UTF-8 fails when Windows is set by default to UTF-8 michal--- via Bug reports for GNU Emacs, the Swiss army knife of text editors @ 2024-12-30 19:13 ` Eli Zaretskii [not found] ` <003001db5d81$a8f144b0$fad3ce10$@0lock.xyz> 0 siblings, 1 reply; 9+ messages in thread From: Eli Zaretskii @ 2024-12-30 19:13 UTC (permalink / raw) To: michal; +Cc: 75207 severity 75207 wishlist thanks > Date: Mon, 30 Dec 2024 12:12:02 +0000 > From: michal--- via "Bug reports for GNU Emacs, > the Swiss army knife of text editors" <bug-gnu-emacs@gnu.org> > > Emacs generates gibberish UTF-8 characters during conversion from native > codepage to UTF-8 if experimental default UTF-8 codepage is set on > Windows. Please provide the minimum recipe for reproducing this, starting from "emacs -Q". What exactly did you convert, and how? And what problems did you see, exactly? Also, what do the following commands produce inside "emacs -Q"? M-: (getenv "ENU") RET M-: current-locale-environment RET M-: w32-ansi-code-page RET M-: (default-value 'buffer-file-coding-system) RET In general, the UTF-8 codepage on Windows is not (yet) supported. In particular, some functions we use in Emacs assume the system codepage cannot be a multibyte encoding. Also, invoking subprocesses on Windows doesn't currently support anything but single-byte encoding of the program's name and its command-line arguments, for boring technical reasons. For that reason, I don't recommend using the UTF-8 codepage, and I don't recommend making UTF-8 the default encoding on MS-Windows. That said, presenting a clear recipe could help us gradually improve support for this, as Windows improves its part in parallel. Thanks. ^ permalink raw reply [flat|nested] 9+ messages in thread
[parent not found: <003001db5d81$a8f144b0$fad3ce10$@0lock.xyz>]
* bug#75207: Fwd: bug#75207: 29.4; Path conversion from native codepage to UTF-8 fails when Windows is set by default to UTF-8 [not found] ` <003001db5d81$a8f144b0$fad3ce10$@0lock.xyz> @ 2025-01-03 11:49 ` Michał Lach via Bug reports for GNU Emacs, the Swiss army knife of text editors 2025-01-03 13:23 ` Eli Zaretskii 0 siblings, 1 reply; 9+ messages in thread From: Michał Lach via Bug reports for GNU Emacs, the Swiss army knife of text editors @ 2025-01-03 11:49 UTC (permalink / raw) To: 75207; +Cc: Eli Zaretskii Forgot to CC the bug report mail. > Begin forwarded message: > > From: <michal@0lock.xyz> > Subject: RE: bug#75207: 29.4; Path conversion from native codepage to UTF-8 fails when Windows is set by default to UTF-8 > Date: 3 January 2025 at 02:48:53 CET > To: "'Eli Zaretskii'" <eliz@gnu.org> > Reply-To: <michal@0lock.xyz> > > M-: (getenv "ENU") -> nil > M-: current-locale-environment -> "ENG" > M-: w32-ansi-code-page -> 65001 > M-: (default-value 'buffer-file-coding-system) -> iso-latin-1-dos > >> That said, presenting a clear recipe could help us gradually improve >> support for >> this, as Windows improves its part in parallel. > > Here is the repro. > 1. Put a path to your "PATH" environmental variable with some diacritic > character (ł in my case, maybe it won't work for some) > 2. M-: exec-path returns gibberish > > Here, "Michał" becomes "MichaÅ‚", you can get a similar result if you do > MultiByteToWideChar using Windows-1252 codepage on a UTF-8 path. > > I've digged around and it looks like codepage_for_filenames (src/w32.c) at > somepoint returns the Windows-1252 codepage. > This is then passed to MultiByteToWideChar() and the scenario that I > described above happens. > I've checked this hypothesis with API Monitor and this is what actually > happens, I can attach a trace if you will find it useful. ^ permalink raw reply [flat|nested] 9+ messages in thread
* bug#75207: Fwd: bug#75207: 29.4; Path conversion from native codepage to UTF-8 fails when Windows is set by default to UTF-8 2025-01-03 11:49 ` bug#75207: Fwd: " Michał Lach via Bug reports for GNU Emacs, the Swiss army knife of text editors @ 2025-01-03 13:23 ` Eli Zaretskii 2025-01-03 14:35 ` michal--- via Bug reports for GNU Emacs, the Swiss army knife of text editors 0 siblings, 1 reply; 9+ messages in thread From: Eli Zaretskii @ 2025-01-03 13:23 UTC (permalink / raw) To: Michał Lach; +Cc: 75207 > Date: Fri, 03 Jan 2025 11:49:34 +0000 > From: Michał Lach <michal@0lock.xyz> > Cc: Eli Zaretskii <eliz@gnu.org> > > Forgot to CC the bug report mail. > > > Begin forwarded message: > > > > From: <michal@0lock.xyz> > > Subject: RE: bug#75207: 29.4; Path conversion from native codepage to UTF-8 fails when Windows is set by default to UTF-8 > > Date: 3 January 2025 at 02:48:53 CET > > To: "'Eli Zaretskii'" <eliz@gnu.org> > > Reply-To: <michal@0lock.xyz> > > > > M-: (getenv "ENU") -> nil > > M-: current-locale-environment -> "ENG" > > M-: w32-ansi-code-page -> 65001 > > M-: (default-value 'buffer-file-coding-system) -> iso-latin-1-dos OK. I think I see the problem (and it is not specific to UTF-8 codepage), but just to be sure, please show some more values: M-: w32-multibyte-code-page RET M-: locale-coding-system RET M-: file-name-coding-system RET M-: default-file-name-coding-system RET > > Here is the repro. > > 1. Put a path to your "PATH" environmental variable with some diacritic > > character (ł in my case, maybe it won't work for some) > > 2. M-: exec-path returns gibberish > > > > Here, "Michał" becomes "MichaÅ‚", you can get a similar result if you do > > MultiByteToWideChar using Windows-1252 codepage on a UTF-8 path. We think that PATH is encoded in Windows-1252 codepage, and the question is why and where do we err. The above additional values I ask about might help answer that question. > > I've digged around and it looks like codepage_for_filenames (src/w32.c) at > > somepoint returns the Windows-1252 codepage. > > This is then passed to MultiByteToWideChar() and the scenario that I > > described above happens. > > I've checked this hypothesis with API Monitor and this is what actually > > happens, I can attach a trace if you will find it useful. Not necessary for now, thanks. If I send you a C-level patch, are you able to build Emacs after patching it, preferably the master branch of our Git repository? ^ permalink raw reply [flat|nested] 9+ messages in thread
* bug#75207: Fwd: bug#75207: 29.4; Path conversion from native codepage to UTF-8 fails when Windows is set by default to UTF-8 2025-01-03 13:23 ` Eli Zaretskii @ 2025-01-03 14:35 ` michal--- via Bug reports for GNU Emacs, the Swiss army knife of text editors 2025-01-03 15:25 ` Eli Zaretskii 0 siblings, 1 reply; 9+ messages in thread From: michal--- via Bug reports for GNU Emacs, the Swiss army knife of text editors @ 2025-01-03 14:35 UTC (permalink / raw) To: 'Eli Zaretskii'; +Cc: 75207 I've just built Emacs on somewhat new revision (577714e3fe) and cannot repro it there. Tag emacs-29.1 does not build by default on Windows so I didn't check. My theory is that maybe the codepage of the machine Emacs was built on influences this?? Or this has just been fixed on the latest version. I debugged a bit and it looks like w32_ansi_code_page is set to 1252 at some point. > OK. I think I see the problem (and it is not specific to UTF-8 codepage), but > just to be sure, please show some more values: > > M-: w32-multibyte-code-page RET > M-: locale-coding-system RET > M-: file-name-coding-system RET > M-: default-file-name-coding-system RET > M-: w32-multibyte-code-page -> 0 M-: locale-coding-system -> cp65001 M-: file-name-coding-system -> nil M-: default-file-name-coding-system -> cp65001 > We think that PATH is encoded in Windows-1252 codepage, and the question > is why and where do we err. The above additional values I ask about might > help answer that question. I can say for sure that it is not, API monitor trace confirms this as well as some basic Win32 programs. getenv("PATH") returns proper string, respecting the active code page. > If I send you a C-level patch, are you able to build Emacs after patching it, > preferably the master branch of our Git repository? Sure. ^ permalink raw reply [flat|nested] 9+ messages in thread
* bug#75207: Fwd: bug#75207: 29.4; Path conversion from native codepage to UTF-8 fails when Windows is set by default to UTF-8 2025-01-03 14:35 ` michal--- via Bug reports for GNU Emacs, the Swiss army knife of text editors @ 2025-01-03 15:25 ` Eli Zaretskii 2025-01-04 9:30 ` Eli Zaretskii 0 siblings, 1 reply; 9+ messages in thread From: Eli Zaretskii @ 2025-01-03 15:25 UTC (permalink / raw) To: michal; +Cc: 75207 > Date: Fri, 03 Jan 2025 14:35:26 +0000 > From: michal@0lock.xyz > Cc: 75207@debbugs.gnu.org > > I've just built Emacs on somewhat new revision (577714e3fe) and cannot repro it there. > Tag emacs-29.1 does not build by default on Windows so I didn't check. > > My theory is that maybe the codepage of the machine Emacs was built on influences this?? Yes, it does, according to my reading of the code. When we went from unexec to pdumper builds, we introduced a bug whereby the relevant variables are assigned values that come from the dump stage, and not reinitialized after that. If Emacs was dumped when the system codepage was different, you will see problems when the dumped Emacs starts with a different codepage, AFAICT. As I said, this is not limited to UTF-8, so it is good we found this problem. > Or this has just been fixed on the latest version. No, I don't think so. I see the problem on the latest master branch. > I debugged a bit and it looks like w32_ansi_code_page is set to 1252 at some point. AFAICT, that happens when we load the pdumper file. > M-: w32-multibyte-code-page -> 0 > M-: locale-coding-system -> cp65001 > M-: file-name-coding-system -> nil > M-: default-file-name-coding-system -> cp65001 OK, I think this confirms my hypothesis. I'll try to come up with a patch, probably tomorrow. > > We think that PATH is encoded in Windows-1252 codepage, and the question > > is why and where do we err. The above additional values I ask about might > > help answer that question. > > I can say for sure that it is not When I say "we think", I mean Emacs thinks that, mistakenly. > > If I send you a C-level patch, are you able to build Emacs after patching it, > > preferably the master branch of our Git repository? > > Sure. OK, but you'll need to build Emacs with a different system codepage to see the effects of the fix. ^ permalink raw reply [flat|nested] 9+ messages in thread
* bug#75207: Fwd: bug#75207: 29.4; Path conversion from native codepage to UTF-8 fails when Windows is set by default to UTF-8 2025-01-03 15:25 ` Eli Zaretskii @ 2025-01-04 9:30 ` Eli Zaretskii 2025-01-04 17:37 ` michal--- via Bug reports for GNU Emacs, the Swiss army knife of text editors 0 siblings, 1 reply; 9+ messages in thread From: Eli Zaretskii @ 2025-01-04 9:30 UTC (permalink / raw) To: michal; +Cc: 75207 > Cc: 75207@debbugs.gnu.org > Date: Fri, 03 Jan 2025 17:25:31 +0200 > From: Eli Zaretskii <eliz@gnu.org> > > > I debugged a bit and it looks like w32_ansi_code_page is set to 1252 at some point. > > AFAICT, that happens when we load the pdumper file. > > > M-: w32-multibyte-code-page -> 0 > > M-: locale-coding-system -> cp65001 > > M-: file-name-coding-system -> nil > > M-: default-file-name-coding-system -> cp65001 > > OK, I think this confirms my hypothesis. I'll try to come up with a > patch, probably tomorrow. The patch is below, and it is for the master branch of the Emacs Git repository. > > > If I send you a C-level patch, are you able to build Emacs after patching it, > > > preferably the master branch of our Git repository? > > > > Sure. > > OK, but you'll need to build Emacs with a different system codepage to > see the effects of the fix. This still stands: to fully test the patch, please change your system codepage after building Emacs and then start Emacs and see if everything works as expected. diff --git a/src/emacs.c b/src/emacs.c index c1e0c9f..896f219 100644 --- a/src/emacs.c +++ b/src/emacs.c @@ -1419,7 +1419,18 @@ android_emacs_init (int argc, char **argv, char *dump_file) #ifdef HAVE_PDUMPER if (attempt_load_pdump) - initial_emacs_executable = load_pdump (argc, argv, dump_file); + { + initial_emacs_executable = load_pdump (argc, argv, dump_file); +#ifdef WINDOWSNT + /* Reinitialize the codepage for file names, needed to decode + non-ASCII file names during startup. This is needed because + loading the pdumper file above assigns to those variables values + from the dump stage, which might be incorrect, if dumping was done + on a different system. */ + if (dumped_with_pdumper_p ()) + w32_init_file_name_codepage (); +#endif + } #else ptrdiff_t bufsize; initial_emacs_executable = find_emacs_executable (argv[0], &bufsize); diff --git a/src/w32.c b/src/w32.c index a493991..deeca03 100644 --- a/src/w32.c +++ b/src/w32.c @@ -1685,6 +1685,19 @@ w32_init_file_name_codepage (void) { file_name_codepage = CP_ACP; w32_ansi_code_page = CP_ACP; +#ifdef HAVE_PDUMPER + /* If we were dumped with pdumper, this function will be called after + loading the pdumper file, and needs to reset the following + variables that come from the dump stage, which could be on a + different system with different default codepages. Then, the + correct value of w32-ansi-code-page will be assigned by + globals_of_w32fns, which is called from 'main'. Until that call + happens, w32-ansi-code-page will have the value of CP_ACP, which + stands for the default ANSI codepage. The other variables will be + computed by codepage_for_filenames below. */ + Vdefault_file_name_coding_system = Qnil; + Vfile_name_coding_system = Qnil; +#endif } /* Produce a Windows ANSI codepage suitable for encoding file names. ^ permalink raw reply related [flat|nested] 9+ messages in thread
* bug#75207: Fwd: bug#75207: 29.4; Path conversion from native codepage to UTF-8 fails when Windows is set by default to UTF-8 2025-01-04 9:30 ` Eli Zaretskii @ 2025-01-04 17:37 ` michal--- via Bug reports for GNU Emacs, the Swiss army knife of text editors 2025-01-05 5:58 ` Eli Zaretskii 0 siblings, 1 reply; 9+ messages in thread From: michal--- via Bug reports for GNU Emacs, the Swiss army knife of text editors @ 2025-01-04 17:37 UTC (permalink / raw) To: 'Eli Zaretskii'; +Cc: 75207 > This still stands: to fully test the patch, please change your system codepage > after building Emacs and then start Emacs and see if everything works as > expected. Done, looks like that fixed the issue :-). Thank you for taking care of this and working on Emacs. Godspeed. ^ permalink raw reply [flat|nested] 9+ messages in thread
* bug#75207: Fwd: bug#75207: 29.4; Path conversion from native codepage to UTF-8 fails when Windows is set by default to UTF-8 2025-01-04 17:37 ` michal--- via Bug reports for GNU Emacs, the Swiss army knife of text editors @ 2025-01-05 5:58 ` Eli Zaretskii 0 siblings, 0 replies; 9+ messages in thread From: Eli Zaretskii @ 2025-01-05 5:58 UTC (permalink / raw) To: michal; +Cc: 75207-done > Date: Sat, 04 Jan 2025 17:37:34 +0000 > From: michal@0lock.xyz > Cc: 75207@debbugs.gnu.org > > > This still stands: to fully test the patch, please change your system codepage > > after building Emacs and then start Emacs and see if everything works as > > expected. > > Done, looks like that fixed the issue :-). Thanks for testing, I therefore installed the changes on the master branch, and I'm closing this bug. ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2025-01-05 5:58 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-12-30 12:12 bug#75207: 29.4; Path conversion from native codepage to UTF-8 fails when Windows is set by default to UTF-8 michal--- via Bug reports for GNU Emacs, the Swiss army knife of text editors 2024-12-30 19:13 ` Eli Zaretskii [not found] ` <003001db5d81$a8f144b0$fad3ce10$@0lock.xyz> 2025-01-03 11:49 ` bug#75207: Fwd: " Michał Lach via Bug reports for GNU Emacs, the Swiss army knife of text editors 2025-01-03 13:23 ` Eli Zaretskii 2025-01-03 14:35 ` michal--- via Bug reports for GNU Emacs, the Swiss army knife of text editors 2025-01-03 15:25 ` Eli Zaretskii 2025-01-04 9:30 ` Eli Zaretskii 2025-01-04 17:37 ` michal--- via Bug reports for GNU Emacs, the Swiss army knife of text editors 2025-01-05 5:58 ` Eli Zaretskii
Code repositories for project(s) associated with this external index https://git.savannah.gnu.org/cgit/emacs.git https://git.savannah.gnu.org/cgit/emacs/org-mode.git This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.