unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
* bug#75207: 29.4; Path conversion from native codepage to UTF-8 fails when Windows is set by default to UTF-8
@ 2024-12-30 12:12 michal--- via Bug reports for GNU Emacs, the Swiss army knife of text editors
  2024-12-30 19:13 ` Eli Zaretskii
  0 siblings, 1 reply; 6+ messages in thread
From: michal--- via Bug reports for GNU Emacs, the Swiss army knife of text editors @ 2024-12-30 12:12 UTC (permalink / raw)
  To: 75207

Emacs generates gibberish UTF-8 characters during conversion from native
codepage to UTF-8 if experimental default UTF-8 codepage is set on
Windows.

In GNU Emacs 29.4 (build 2, x86_64-w64-mingw32) of 2024-07-05 built on
 AVALON
Windowing system distributor 'Microsoft Corp.', version 10.0.22631
System Description: Microsoft Windows 10 Education (v10.0.2009.22631.4602)

Configured using:
 'configure --with-modules --without-dbus --with-native-compilation=aot
 --without-compress-install --with-sqlite3 --with-tree-sitter
 CFLAGS=-O2'

Configured features:
ACL GIF GMP GNUTLS HARFBUZZ JPEG JSON LCMS2 LIBXML2 MODULES NATIVE_COMP
NOTIFY W32NOTIFY PDUMPER PNG RSVG SOUND SQLITE3 THREADS TIFF
TOOLKIT_SCROLL_BARS TREE_SITTER WEBP XPM ZLIB

(NATIVE_COMP present but libgccjit not available)

Important settings:
  value of $LANG: ENG
  locale-coding-system: cp65001

Major mode: recentf-dialog

Minor modes in effect:
  global-company-mode: t
  company-mode: t
  nyan-mode: t
  fido-vertical-mode: t
  icomplete-vertical-mode: t
  icomplete-mode: t
  fido-mode: t
  global-display-line-numbers-mode: t
  display-line-numbers-mode: t
  recentf-mode: t
  global-display-fill-column-indicator-mode: t
  display-fill-column-indicator-mode: t
  tooltip-mode: t
  global-eldoc-mode: t
  show-paren-mode: t
  electric-indent-mode: t
  mouse-wheel-mode: t
  file-name-shadow-mode: t
  global-font-lock-mode: t
  font-lock-mode: t
  blink-cursor-mode: t
  column-number-mode: t
  line-number-mode: t
  transient-mark-mode: t
  auto-composition-mode: t
  auto-encryption-mode: t
  auto-compression-mode: t

Load-path shadows:
c:/Users/Michał/.emacs.d/elpa/transient-20241102.1229/transient hides c:/Program Files/Emacs/emacs-29.4/share/emacs/29.4/lisp/transient
c:/Users/Michał/.emacs.d/elpa/standard-themes-2.1.0/theme-loaddefs hides c:/Program Files/Emacs/emacs-29.4/share/emacs/29.4/lisp/theme-loaddefs

Features:
(shadow sort mail-extr emacsbug message yank-media puny dired
dired-loaddefs rfc822 mml mml-sec epa epg rfc6068 epg-config gnus-util
time-date mm-decode mm-bodies mm-encode mail-parse rfc2231 mailabbrev
gmm-utils mailheader sendmail rfc2047 rfc2045 ietf-drums mm-util
mail-prsvr mail-utils eldoc-box high-theme company-oddmuse
company-keywords company-etags etags fileloop generator xref project
company-gtags company-dabbrev-code company-dabbrev company-files
company-clang company-capf company-cmake company-semantic
company-template company-bbdb company nyan-mode icomplete
display-line-numbers recentf tree-widget wid-edit easy-mmode
display-fill-column-indicator jai-mode derived compile
text-property-search comint ansi-osc ansi-color ring js c-ts-common
treesit imenu cc-mode cc-fonts cc-guess cc-menus cc-cmds cc-styles
cc-align cc-engine cc-vars cc-defs theme-switcher finder-inf
almost-mono-themes-autoloads auctex-autoloads tex-site
centered-window-autoloads cmake-mode-autoloads company-autoloads
dtrt-indent-autoloads editorconfig-autoloads eldoc-box-autoloads
erlang-autoloads exec-path-from-shell-autoloads go-mode-autoloads
gruber-darker-theme-autoloads haskell-mode-autoloads
highlight-symbol-autoloads latex-preview-pane-autoloads magit-autoloads
pcase magit-section-autoloads dash-autoloads markdown-mode-autoloads
merlin-autoloads multiple-cursors-autoloads nyan-mode-autoloads
powershell-autoloads projectile-autoloads rg-autoloads
rust-mode-autoloads slime-autoloads macrostep-autoloads
solarized-theme-autoloads standard-themes-autoloads swift-mode-autoloads
transient-autoloads tuareg-autoloads rx caml-autoloads wgrep-autoloads
white-sand-theme-autoloads with-editor-autoloads info compat-autoloads
yasnippet-autoloads zig-mode-autoloads reformatter-autoloads package
browse-url url url-proxy url-privacy url-expand url-methods url-history
url-cookie generate-lisp-file url-domsuf url-util mailcap url-handlers
url-parse auth-source cl-seq eieio eieio-core cl-macs password-cache
json subr-x map byte-opt gv bytecomp byte-compile url-vars cl-loaddefs
cl-lib rmc iso-transl tooltip cconv eldoc paren electric uniquify
ediff-hook vc-hooks lisp-float-type elisp-mode mwheel dos-w32 ls-lisp
disp-table term/w32-win w32-win w32-vars term/common-win tool-bar dnd
fontset image regexp-opt fringe tabulated-list replace newcomment
text-mode lisp-mode prog-mode register page tab-bar menu-bar rfn-eshadow
isearch easymenu timer select scroll-bar mouse jit-lock font-lock syntax
font-core term/tty-colors frame minibuffer nadvice seq simple cl-generic
indonesian philippine cham georgian utf-8-lang misc-lang vietnamese
tibetan thai tai-viet lao korean japanese eucjp-ms cp51932 hebrew greek
romanian slovak czech european ethiopic indian cyrillic chinese
composite emoji-zwj charscript charprop case-table epa-hook
jka-cmpr-hook help abbrev obarray oclosure cl-preloaded button loaddefs
theme-loaddefs faces cus-face macroexp files window text-properties
overlay sha1 md5 base64 format env code-pages mule custom widget keymap
hashtable-print-readable backquote threads w32notify w32 lcms2 multi-tty
make-network-process native-compile emacs)

Memory information:
((conses 16 185675 75051)
 (symbols 48 14661 7)
 (strings 32 55444 14585)
 (string-bytes 1 1821337)
 (vectors 16 27409)
 (vector-slots 8 520326 162526)
 (floats 8 83 1011)
 (intervals 56 494 150)
 (buffers 984 11))







^ permalink raw reply	[flat|nested] 6+ messages in thread

* bug#75207: 29.4; Path conversion from native codepage to UTF-8 fails when Windows is set by default to UTF-8
  2024-12-30 12:12 bug#75207: 29.4; Path conversion from native codepage to UTF-8 fails when Windows is set by default to UTF-8 michal--- via Bug reports for GNU Emacs, the Swiss army knife of text editors
@ 2024-12-30 19:13 ` Eli Zaretskii
       [not found]   ` <003001db5d81$a8f144b0$fad3ce10$@0lock.xyz>
  0 siblings, 1 reply; 6+ messages in thread
From: Eli Zaretskii @ 2024-12-30 19:13 UTC (permalink / raw)
  To: michal; +Cc: 75207

severity 75207 wishlist
thanks

> Date: Mon, 30 Dec 2024 12:12:02 +0000
> From: michal--- via "Bug reports for GNU Emacs,
>  the Swiss army knife of text editors" <bug-gnu-emacs@gnu.org>
> 
> Emacs generates gibberish UTF-8 characters during conversion from native
> codepage to UTF-8 if experimental default UTF-8 codepage is set on
> Windows.

Please provide the minimum recipe for reproducing this, starting from
"emacs -Q".  What exactly did you convert, and how?  And what problems
did you see, exactly?  Also, what do the following commands produce
inside "emacs -Q"?

  M-: (getenv "ENU") RET
  M-: current-locale-environment RET
  M-: w32-ansi-code-page RET
  M-: (default-value 'buffer-file-coding-system) RET

In general, the UTF-8 codepage on Windows is not (yet) supported.  In
particular, some functions we use in Emacs assume the system codepage
cannot be a multibyte encoding.  Also, invoking subprocesses on
Windows doesn't currently support anything but single-byte encoding of
the program's name and its command-line arguments, for boring
technical reasons.  For that reason, I don't recommend using the UTF-8
codepage, and I don't recommend making UTF-8 the default encoding on
MS-Windows.

That said, presenting a clear recipe could help us gradually improve
support for this, as Windows improves its part in parallel.

Thanks.





^ permalink raw reply	[flat|nested] 6+ messages in thread

* bug#75207: Fwd: bug#75207: 29.4; Path conversion from native codepage to UTF-8 fails when Windows is set by default to UTF-8
       [not found]   ` <003001db5d81$a8f144b0$fad3ce10$@0lock.xyz>
@ 2025-01-03 11:49     ` Michał Lach via Bug reports for GNU Emacs, the Swiss army knife of text editors
  2025-01-03 13:23       ` Eli Zaretskii
  0 siblings, 1 reply; 6+ messages in thread
From: Michał Lach via Bug reports for GNU Emacs, the Swiss army knife of text editors @ 2025-01-03 11:49 UTC (permalink / raw)
  To: 75207; +Cc: Eli Zaretskii

Forgot to CC the bug report mail.

> Begin forwarded message:
> 
> From: <michal@0lock.xyz>
> Subject: RE: bug#75207: 29.4; Path conversion from native codepage to UTF-8 fails when Windows is set by default to UTF-8
> Date: 3 January 2025 at 02:48:53 CET
> To: "'Eli Zaretskii'" <eliz@gnu.org>
> Reply-To: <michal@0lock.xyz>
> 
> M-: (getenv "ENU") -> nil
> M-: current-locale-environment -> "ENG"
> M-: w32-ansi-code-page -> 65001
> M-: (default-value 'buffer-file-coding-system) -> iso-latin-1-dos
> 
>> That said, presenting a clear recipe could help us gradually improve 
>> support for
>> this, as Windows improves its part in parallel.
> 
> Here is the repro.
> 1. Put a path to your "PATH" environmental variable with some diacritic 
> character (ł in my case, maybe it won't work for some)
> 2. M-: exec-path returns gibberish
> 
> Here, "Michał" becomes "MichaÅ‚", you can get a similar result if you do 
> MultiByteToWideChar using Windows-1252 codepage on a UTF-8 path.
> 
> I've digged around and it looks like codepage_for_filenames (src/w32.c) at 
> somepoint returns the Windows-1252 codepage.
> This is then passed to MultiByteToWideChar() and the scenario that I 
> described above happens.
> I've checked this hypothesis with API Monitor and this is what actually 
> happens, I can attach a trace if you will find it useful.







^ permalink raw reply	[flat|nested] 6+ messages in thread

* bug#75207: Fwd: bug#75207: 29.4; Path conversion from native codepage to UTF-8 fails when Windows is set by default to UTF-8
  2025-01-03 11:49     ` bug#75207: Fwd: " Michał Lach via Bug reports for GNU Emacs, the Swiss army knife of text editors
@ 2025-01-03 13:23       ` Eli Zaretskii
  2025-01-03 14:35         ` michal--- via Bug reports for GNU Emacs, the Swiss army knife of text editors
  0 siblings, 1 reply; 6+ messages in thread
From: Eli Zaretskii @ 2025-01-03 13:23 UTC (permalink / raw)
  To: Michał Lach; +Cc: 75207

> Date: Fri, 03 Jan 2025 11:49:34 +0000
> From: Michał Lach <michal@0lock.xyz>
> Cc: Eli Zaretskii <eliz@gnu.org>
> 
> Forgot to CC the bug report mail.
> 
> > Begin forwarded message:
> > 
> > From: <michal@0lock.xyz>
> > Subject: RE: bug#75207: 29.4; Path conversion from native codepage to UTF-8 fails when Windows is set by default to UTF-8
> > Date: 3 January 2025 at 02:48:53 CET
> > To: "'Eli Zaretskii'" <eliz@gnu.org>
> > Reply-To: <michal@0lock.xyz>
> > 
> > M-: (getenv "ENU") -> nil
> > M-: current-locale-environment -> "ENG"
> > M-: w32-ansi-code-page -> 65001
> > M-: (default-value 'buffer-file-coding-system) -> iso-latin-1-dos

OK.  I think I see the problem (and it is not specific to UTF-8
codepage), but just to be sure, please show some more values:

  M-: w32-multibyte-code-page RET
  M-: locale-coding-system RET
  M-: file-name-coding-system RET
  M-: default-file-name-coding-system RET

> > Here is the repro.
> > 1. Put a path to your "PATH" environmental variable with some diacritic 
> > character (ł in my case, maybe it won't work for some)
> > 2. M-: exec-path returns gibberish
> > 
> > Here, "Michał" becomes "MichaÅ‚", you can get a similar result if you do 
> > MultiByteToWideChar using Windows-1252 codepage on a UTF-8 path.

We think that PATH is encoded in Windows-1252 codepage, and the
question is why and where do we err.  The above additional values I
ask about might help answer that question.

> > I've digged around and it looks like codepage_for_filenames (src/w32.c) at 
> > somepoint returns the Windows-1252 codepage.
> > This is then passed to MultiByteToWideChar() and the scenario that I 
> > described above happens.
> > I've checked this hypothesis with API Monitor and this is what actually 
> > happens, I can attach a trace if you will find it useful.

Not necessary for now, thanks.

If I send you a C-level patch, are you able to build Emacs after
patching it, preferably the master branch of our Git repository?





^ permalink raw reply	[flat|nested] 6+ messages in thread

* bug#75207: Fwd: bug#75207: 29.4; Path conversion from native codepage to UTF-8 fails when Windows is set by default to UTF-8
  2025-01-03 13:23       ` Eli Zaretskii
@ 2025-01-03 14:35         ` michal--- via Bug reports for GNU Emacs, the Swiss army knife of text editors
  2025-01-03 15:25           ` Eli Zaretskii
  0 siblings, 1 reply; 6+ messages in thread
From: michal--- via Bug reports for GNU Emacs, the Swiss army knife of text editors @ 2025-01-03 14:35 UTC (permalink / raw)
  To: 'Eli Zaretskii'; +Cc: 75207

I've just built Emacs on somewhat new revision (577714e3fe) and cannot repro it there.
Tag emacs-29.1 does not build by default on Windows so I didn't check.

My theory is that maybe the codepage of the machine Emacs was built on influences this??
Or this has just been fixed on the latest version.

I debugged a bit and it looks like w32_ansi_code_page is set to 1252 at some point.

> OK.  I think I see the problem (and it is not specific to UTF-8 codepage), but
> just to be sure, please show some more values:
> 
>   M-: w32-multibyte-code-page RET
>   M-: locale-coding-system RET
>   M-: file-name-coding-system RET
>   M-: default-file-name-coding-system RET
> 

M-: w32-multibyte-code-page -> 0
M-: locale-coding-system -> cp65001
M-: file-name-coding-system -> nil
M-: default-file-name-coding-system -> cp65001

> We think that PATH is encoded in Windows-1252 codepage, and the question
> is why and where do we err.  The above additional values I ask about might
> help answer that question.

I can say for sure that it is not, API monitor trace confirms this as well as some
basic Win32 programs.
getenv("PATH") returns proper string, respecting the active code page.
 
> If I send you a C-level patch, are you able to build Emacs after patching it,
> preferably the master branch of our Git repository?

Sure.







^ permalink raw reply	[flat|nested] 6+ messages in thread

* bug#75207: Fwd: bug#75207: 29.4; Path conversion from native codepage to UTF-8 fails when Windows is set by default to UTF-8
  2025-01-03 14:35         ` michal--- via Bug reports for GNU Emacs, the Swiss army knife of text editors
@ 2025-01-03 15:25           ` Eli Zaretskii
  0 siblings, 0 replies; 6+ messages in thread
From: Eli Zaretskii @ 2025-01-03 15:25 UTC (permalink / raw)
  To: michal; +Cc: 75207

> Date: Fri, 03 Jan 2025 14:35:26 +0000
> From: michal@0lock.xyz
> Cc: 75207@debbugs.gnu.org
> 
> I've just built Emacs on somewhat new revision (577714e3fe) and cannot repro it there.
> Tag emacs-29.1 does not build by default on Windows so I didn't check.
> 
> My theory is that maybe the codepage of the machine Emacs was built on influences this??

Yes, it does, according to my reading of the code.  When we went from
unexec to pdumper builds, we introduced a bug whereby the relevant
variables are assigned values that come from the dump stage, and not
reinitialized after that.  If Emacs was dumped when the system
codepage was different, you will see problems when the dumped Emacs
starts with a different codepage, AFAICT.  As I said, this is not
limited to UTF-8, so it is good we found this problem.

> Or this has just been fixed on the latest version.

No, I don't think so.  I see the problem on the latest master branch.

> I debugged a bit and it looks like w32_ansi_code_page is set to 1252 at some point.

AFAICT, that happens when we load the pdumper file.

> M-: w32-multibyte-code-page -> 0
> M-: locale-coding-system -> cp65001
> M-: file-name-coding-system -> nil
> M-: default-file-name-coding-system -> cp65001

OK, I think this confirms my hypothesis.  I'll try to come up with a
patch, probably tomorrow.

> > We think that PATH is encoded in Windows-1252 codepage, and the question
> > is why and where do we err.  The above additional values I ask about might
> > help answer that question.
> 
> I can say for sure that it is not

When I say "we think", I mean Emacs thinks that, mistakenly.

> > If I send you a C-level patch, are you able to build Emacs after patching it,
> > preferably the master branch of our Git repository?
> 
> Sure.

OK, but you'll need to build Emacs with a different system codepage to
see the effects of the fix.





^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2025-01-03 15:25 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-30 12:12 bug#75207: 29.4; Path conversion from native codepage to UTF-8 fails when Windows is set by default to UTF-8 michal--- via Bug reports for GNU Emacs, the Swiss army knife of text editors
2024-12-30 19:13 ` Eli Zaretskii
     [not found]   ` <003001db5d81$a8f144b0$fad3ce10$@0lock.xyz>
2025-01-03 11:49     ` bug#75207: Fwd: " Michał Lach via Bug reports for GNU Emacs, the Swiss army knife of text editors
2025-01-03 13:23       ` Eli Zaretskii
2025-01-03 14:35         ` michal--- via Bug reports for GNU Emacs, the Swiss army knife of text editors
2025-01-03 15:25           ` Eli Zaretskii

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).