unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
* bug#15803: default-file-name-coding-system: utf-8 better than latin-1 these days?
@ 2013-11-04 18:45 Glenn Morris
  2017-12-01  1:52 ` Glenn Morris
  0 siblings, 1 reply; 27+ messages in thread
From: Glenn Morris @ 2013-11-04 18:45 UTC (permalink / raw)
  To: 15803

Package: emacs
Version: 24.3

Split from http://debbugs.gnu.org/15260

Eli Zaretskii wrote:

> mule-cmds.el calls reset-language-environment, and language/english.el
> calls set-language-info-alist; both have the effect of resetting
> default-file-name-coding-system to latin-1 (!? an interesting
> "default" for a Unicode-era Emacs, perhaps Handa-san could comment why
> we still do that).

I know nothing about this, but eg glib defaults to utf-8, which seems
like a better default to me these days:

https://developer.gnome.org/glib/stable/glib-Character-Set-Conversion.html#file-name-encodings





^ permalink raw reply	[flat|nested] 27+ messages in thread

* bug#15803: default-file-name-coding-system: utf-8 better than latin-1 these days?
  2013-11-04 18:45 bug#15803: default-file-name-coding-system: utf-8 better than latin-1 these days? Glenn Morris
@ 2017-12-01  1:52 ` Glenn Morris
  2017-12-01  7:54   ` Eli Zaretskii
  0 siblings, 1 reply; 27+ messages in thread
From: Glenn Morris @ 2017-12-01  1:52 UTC (permalink / raw)
  To: 15803

Glenn Morris wrote:

>> mule-cmds.el calls reset-language-environment, and language/english.el
>> calls set-language-info-alist; both have the effect of resetting
>> default-file-name-coding-system to latin-1 (!? an interesting
>> "default" for a Unicode-era Emacs, perhaps Handa-san could comment why
>> we still do that).
>
> I know nothing about this, but eg glib defaults to utf-8, which seems
> like a better default to me these days:
>
> https://developer.gnome.org/glib/stable/glib-Character-Set-Conversion.html#file-name-encodings

... 4 years pass and latin-1 fails to make a comeback.

For some reason, I thought it was difficult to change the default to
utf-8 due to bootstrap ordering issues. This was probably prompted by
this comment in reset-language-environment:

  ;; On Darwin systems, this should be utf-8-unix, but when this file is loaded
  ;; that is not yet defined, so we set it in set-locale-environment instead.
  (setq default-file-name-coding-system 'iso-latin-1-unix)

But looking at it now, I cannot see what this comment is referring to.

If I change reset-language-environment so that it sets
default-file-name-coding-system (and default-sendmail-coding-system)
to 'utf-8, then a bootstrap works fine.

It looks like this stuff was all rewritten in Emacs 23.
Before that, there used to be international/utf-8.el,
which was indeed loaded after mule-cmds.
But since Emacs 23, mule-conf seems to define everything.
(But that rewrite seems to predate the above comment about Darwin...?)

So should the default finally be changed to utf-8?





^ permalink raw reply	[flat|nested] 27+ messages in thread

* bug#15803: default-file-name-coding-system: utf-8 better than latin-1 these days?
  2017-12-01  1:52 ` Glenn Morris
@ 2017-12-01  7:54   ` Eli Zaretskii
  2017-12-05  0:35     ` Glenn Morris
  0 siblings, 1 reply; 27+ messages in thread
From: Eli Zaretskii @ 2017-12-01  7:54 UTC (permalink / raw)
  To: Glenn Morris; +Cc: 15803

> From: Glenn Morris <rgm@gnu.org>
> Date: Thu, 30 Nov 2017 20:52:17 -0500
> 
> So should the default finally be changed to utf-8?

Perhaps on Posix systems, but not elsewhere.  And if we make the
change, we should make sure building Emacs in a non-ASCII directory
still works.

Btw, why does the default matter so much?  Once Emacs starts up
default-file-name-coding-system on GNU/Linux is set to UTF-8, if the
locale says so.  Is this just an aesthetic issue?





^ permalink raw reply	[flat|nested] 27+ messages in thread

* bug#15803: default-file-name-coding-system: utf-8 better than latin-1 these days?
  2017-12-01  7:54   ` Eli Zaretskii
@ 2017-12-05  0:35     ` Glenn Morris
  2017-12-08  9:46       ` Eli Zaretskii
  2020-09-09 13:33       ` Stefan Kangas
  0 siblings, 2 replies; 27+ messages in thread
From: Glenn Morris @ 2017-12-05  0:35 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 15803

Eli Zaretskii wrote:

> Perhaps on Posix systems, but not elsewhere. 

I assume non-POSIX is newspeak for MS-Windows (native and DOS).

> And if we make the change, we should make sure building Emacs in a
> non-ASCII directory still works.

It works fine for me on G/L to have source, build, and install
directories be distinct non-ASCII directories. (Emacs works, that is,
but makeinfo 5.1 fails to find @include files in non-ASCII directories,
so I wonder how common such setups are.)


BTW, it feels very dated to me to have discussion of Windows 9X in the
Emacs manual section on file-name-coding.


diff --git i/doc/emacs/mule.texi w/doc/emacs/mule.texi
index 78f77cb..5fc44a6 100644
--- i/doc/emacs/mule.texi
+++ w/doc/emacs/mule.texi
@@ -1214,11 +1214,8 @@ system can encode.
 
   If @code{file-name-coding-system} is @code{nil}, Emacs uses a
 default coding system determined by the selected language environment,
-and stored in the @code{default-file-name-coding-system} variable.
-@c FIXME?  Is this correct?  What is the "default language environment"?
-In the default language environment, non-@acronym{ASCII} characters in
-file names are not encoded specially; they appear in the file system
-using the internal Emacs representation.
+and stored in the @code{default-file-name-coding-system} variable
+(normally UTF-8).
 
 @cindex file-name encoding, MS-Windows
 @vindex w32-unicode-filenames
diff --git i/lisp/international/mule-cmds.el w/lisp/international/mule-cmds.el
index 9d22d6e..192f0e9 100644
--- i/lisp/international/mule-cmds.el
+++ w/lisp/international/mule-cmds.el
@@ -1797,10 +1797,11 @@ The default status is as follows:
    'raw-text)
 
   (set-default-coding-systems nil)
-  (setq default-sendmail-coding-system 'iso-latin-1)
-  ;; On Darwin systems, this should be utf-8-unix, but when this file is loaded
-  ;; that is not yet defined, so we set it in set-locale-environment instead.
-  (setq default-file-name-coding-system 'iso-latin-1-unix)
+  (setq default-sendmail-coding-system 'utf-8)
+  (setq default-file-name-coding-system (if (memq system-type
+                                                  '(window-nt ms-dos))
+                                            'iso-latin-1-unix
+                                          'utf-8-unix))
   ;; Preserve eol-type from existing default-process-coding-systems.
   ;; On non-unix-like systems in particular, these may have been set
   ;; carefully by the user, or by the startup code, to deal with the
@@ -1816,8 +1817,10 @@ The default status is as follows:
 	(input-coding
 	 (condition-case nil
 	     (coding-system-change-text-conversion
-	      (cdr default-process-coding-system) 'iso-latin-1)
-	   (coding-system-error 'iso-latin-1))))
+	      (cdr default-process-coding-system)
+	      (if (memq system-type '(window-nt ms-dos)) 'iso-latin-1 'utf-8))
+	   (coding-system-error
+	    (if (memq system-type '(window-nt ms-dos)) 'iso-latin-1 'utf-8)))))
     (setq default-process-coding-system
 	  (cons output-coding input-coding)))
 
diff --git i/lisp/mail/sendmail.el w/lisp/mail/sendmail.el
index cd80211..36fbb7d 100644
--- i/lisp/mail/sendmail.el
+++ w/lisp/mail/sendmail.el
@@ -993,7 +993,7 @@ but lower priority than the local value of `buffer-file-coding-system'.
 See also the function `select-message-coding-system'.")
 
 ;;;###autoload
-(defvar default-sendmail-coding-system 'iso-latin-1
+(defvar default-sendmail-coding-system 'utf-8
   "Default coding system for encoding the outgoing mail.
 This variable is used only when `sendmail-coding-system' is nil.
 
diff --git i/lisp/mh-e/mh-comp.el w/lisp/mh-e/mh-comp.el
index 98067ce..25118cd 100644
--- i/lisp/mh-e/mh-comp.el
+++ w/lisp/mh-e/mh-comp.el
@@ -304,6 +304,7 @@ message and scan line."
   (let ((draft-buffer (current-buffer))
         (file-name buffer-file-name)
         (config mh-previous-window-config)
+        ;; FIXME this is subtly different to select-message-coding-system.
         (coding-system-for-write
          (if (and (local-variable-p 'buffer-file-coding-system
                                     (current-buffer)) ;XEmacs needs two args
@@ -315,7 +316,7 @@ message and scan line."
            (or (and (boundp 'sendmail-coding-system) sendmail-coding-system)
                (and (default-boundp 'buffer-file-coding-system)
                     (default-value 'buffer-file-coding-system))
-               'iso-latin-1))))
+               'utf-8))))
     ;; Older versions of spost do not support -msgid and -mime.
     (unless mh-send-uses-spost-flag
       ;; Adding a Message-ID field looks good, makes it easier to search for





^ permalink raw reply related	[flat|nested] 27+ messages in thread

* bug#15803: default-file-name-coding-system: utf-8 better than latin-1 these days?
  2017-12-05  0:35     ` Glenn Morris
@ 2017-12-08  9:46       ` Eli Zaretskii
  2017-12-12  1:38         ` Glenn Morris
  2020-09-09 13:33       ` Stefan Kangas
  1 sibling, 1 reply; 27+ messages in thread
From: Eli Zaretskii @ 2017-12-08  9:46 UTC (permalink / raw)
  To: Glenn Morris; +Cc: 15803

> From: Glenn Morris <rgm@gnu.org>
> Cc: 15803@debbugs.gnu.org
> Date: Mon, 04 Dec 2017 19:35:05 -0500
> 
> Eli Zaretskii wrote:
> 
> > Perhaps on Posix systems, but not elsewhere. 
> 
> I assume non-POSIX is newspeak for MS-Windows (native and DOS).

I didn't say "non-Posix"; you did.

MS-Windows is definitely not a Posix system, but whether it is the
only one, I don't know.  Are we sure all macOS/Darwin systems are
sufficiently Posix in this aspect?  AFAIR they use quite different
encoding methods for file names (canonical normalization etc.).

> > And if we make the change, we should make sure building Emacs in a
> > non-ASCII directory still works.
> 
> It works fine for me on G/L to have source, build, and install
> directories be distinct non-ASCII directories.

Was it in a UTF-8 locale or in a non-UTF-8 locale?  The latter is the
potentially problematic case, AFAIR.

> (Emacs works, that is,
> but makeinfo 5.1 fails to find @include files in non-ASCII directories,
> so I wonder how common such setups are.)

Building a release tarball doesn't require makeinfo.

> BTW, it feels very dated to me to have discussion of Windows 9X in the
> Emacs manual section on file-name-coding.

We still try to support it, and the aspects of file-name encoding
related to it are definitely non-trivial.  Everything described there
is in the code.

> diff --git i/doc/emacs/mule.texi w/doc/emacs/mule.texi
> index 78f77cb..5fc44a6 100644
> --- i/doc/emacs/mule.texi
> +++ w/doc/emacs/mule.texi
> @@ -1214,11 +1214,8 @@ system can encode.
>  
>    If @code{file-name-coding-system} is @code{nil}, Emacs uses a
>  default coding system determined by the selected language environment,
> -and stored in the @code{default-file-name-coding-system} variable.
> -@c FIXME?  Is this correct?  What is the "default language environment"?
> -In the default language environment, non-@acronym{ASCII} characters in
> -file names are not encoded specially; they appear in the file system
> -using the internal Emacs representation.
> +and stored in the @code{default-file-name-coding-system} variable
> +(normally UTF-8).

Not sure why you removed the sentence which had the FIXME comment.  Is
it in any way related to the issue at hand?

>  @cindex file-name encoding, MS-Windows
>  @vindex w32-unicode-filenames
> diff --git i/lisp/international/mule-cmds.el w/lisp/international/mule-cmds.el
> index 9d22d6e..192f0e9 100644
> --- i/lisp/international/mule-cmds.el
> +++ w/lisp/international/mule-cmds.el
> @@ -1797,10 +1797,11 @@ The default status is as follows:
>     'raw-text)
>  
>    (set-default-coding-systems nil)
> -  (setq default-sendmail-coding-system 'iso-latin-1)
> -  ;; On Darwin systems, this should be utf-8-unix, but when this file is loaded
> -  ;; that is not yet defined, so we set it in set-locale-environment instead.
> -  (setq default-file-name-coding-system 'iso-latin-1-unix)
> +  (setq default-sendmail-coding-system 'utf-8)
> +  (setq default-file-name-coding-system (if (memq system-type
> +                                                  '(window-nt ms-dos))
> +                                            'iso-latin-1-unix
> +                                          'utf-8-unix))

Why are we changing sendmail-coding-system?  It has nothing to do with
file names, AFAIK.

>    ;; Preserve eol-type from existing default-process-coding-systems.
>    ;; On non-unix-like systems in particular, these may have been set
>    ;; carefully by the user, or by the startup code, to deal with the
> @@ -1816,8 +1817,10 @@ The default status is as follows:
>  	(input-coding
>  	 (condition-case nil
>  	     (coding-system-change-text-conversion
> -	      (cdr default-process-coding-system) 'iso-latin-1)
> -	   (coding-system-error 'iso-latin-1))))
> +	      (cdr default-process-coding-system)
> +	      (if (memq system-type '(window-nt ms-dos)) 'iso-latin-1 'utf-8))
> +	   (coding-system-error
> +	    (if (memq system-type '(window-nt ms-dos)) 'iso-latin-1 'utf-8)))))
>      (setq default-process-coding-system
>  	  (cons output-coding input-coding)))

And this changes the default encoding used to communicate with
sub-processes.  Why?  We never talked about a wholesale change of all
the defaults to UTF-8, that is a much more broad issue than just
encoding of file names.

> diff --git i/lisp/mh-e/mh-comp.el w/lisp/mh-e/mh-comp.el
> index 98067ce..25118cd 100644
> --- i/lisp/mh-e/mh-comp.el
> +++ w/lisp/mh-e/mh-comp.el
> @@ -304,6 +304,7 @@ message and scan line."
>    (let ((draft-buffer (current-buffer))
>          (file-name buffer-file-name)
>          (config mh-previous-window-config)
> +        ;; FIXME this is subtly different to select-message-coding-system.
>          (coding-system-for-write
>           (if (and (local-variable-p 'buffer-file-coding-system
>                                      (current-buffer)) ;XEmacs needs two args
> @@ -315,7 +316,7 @@ message and scan line."
>             (or (and (boundp 'sendmail-coding-system) sendmail-coding-system)
>                 (and (default-boundp 'buffer-file-coding-system)
>                      (default-value 'buffer-file-coding-system))
> -               'iso-latin-1))))
> +               'utf-8))))

Changes like that in MH-E should be communicated to the MH-E
developer; I 'm not sure he is reading this list.

And you never answered my question about the rationale:

> Btw, why does the default matter so much?  Once Emacs starts up
> default-file-name-coding-system on GNU/Linux is set to UTF-8, if the
> locale says so.  Is this just an aesthetic issue?





^ permalink raw reply	[flat|nested] 27+ messages in thread

* bug#15803: default-file-name-coding-system: utf-8 better than latin-1 these days?
  2017-12-08  9:46       ` Eli Zaretskii
@ 2017-12-12  1:38         ` Glenn Morris
  2020-09-09 13:15           ` Lars Ingebrigtsen
  0 siblings, 1 reply; 27+ messages in thread
From: Glenn Morris @ 2017-12-12  1:38 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 15803

Eli Zaretskii wrote:

> Are we sure all macOS/Darwin systems are sufficiently Posix in this
> aspect?

Emacs on Darwin has been unconditionally using utf-8 for over a decade.
It's special-cased in mule-cmds, as visible in the diff I sent.

>> It works fine for me on G/L to have source, build, and install
>> directories be distinct non-ASCII directories.
>
> Was it in a UTF-8 locale or in a non-UTF-8 locale?  The latter is the
> potentially problematic case, AFAIR.

I had LANG=en_US.UTF-8. I've repeated with LANG=en_US. Still works.

>>    If @code{file-name-coding-system} is @code{nil}, Emacs uses a
>>  default coding system determined by the selected language environment,
>> -and stored in the @code{default-file-name-coding-system} variable.
>> -@c FIXME?  Is this correct?  What is the "default language environment"?
>> -In the default language environment, non-@acronym{ASCII} characters in
>> -file names are not encoded specially; they appear in the file system
>> -using the internal Emacs representation.
>> +and stored in the @code{default-file-name-coding-system} variable
>> +(normally UTF-8).
>
> Not sure why you removed the sentence which had the FIXME comment.  Is
> it in any way related to the issue at hand?

I wrote the FIXME comment. In 5 years, no-one has addressed it.
Defaulting to UTF-8 makes it no longer relevant, so it seems better to
remove it.

> Why are we changing sendmail-coding-system?  It has nothing to do with
> file names, AFAIK.

I'm changing all (3) things that currently default to latin-1 to default to
utf-8.

>> Btw, why does the default matter so much?  Once Emacs starts up
>> default-file-name-coding-system on GNU/Linux is set to UTF-8, if the
>> locale says so.  Is this just an aesthetic issue?

utf-8 is the sensible, "modern" (ie, non-ancient) default.
If there is no reason to use latin-1, Emacs should use utf-8.
I'm not claiming it's critical.

Take it or leave it, as you wish.





^ permalink raw reply	[flat|nested] 27+ messages in thread

* bug#15803: default-file-name-coding-system: utf-8 better than latin-1 these days?
  2017-12-12  1:38         ` Glenn Morris
@ 2020-09-09 13:15           ` Lars Ingebrigtsen
  2020-09-09 15:00             ` Eli Zaretskii
  0 siblings, 1 reply; 27+ messages in thread
From: Lars Ingebrigtsen @ 2020-09-09 13:15 UTC (permalink / raw)
  To: Glenn Morris; +Cc: 15803

Glenn Morris <rgm@gnu.org> writes:

> utf-8 is the sensible, "modern" (ie, non-ancient) default.
> If there is no reason to use latin-1, Emacs should use utf-8.
> I'm not claiming it's critical.
>
> Take it or leave it, as you wish.

That was the final message in the thread.  Glenn's patch from six years
ago no longer applied, so I've respun it for Emacs 28 now (included
below).

Glenn's arguments make sense to me, but I'm not a domain expert here.
Does anybody object to applying this patch to Emacs 28?

diff --git a/doc/emacs/mule.texi b/doc/emacs/mule.texi
index 6eff0ca0d2..b78019020a 100644
--- a/doc/emacs/mule.texi
+++ b/doc/emacs/mule.texi
@@ -1215,11 +1215,8 @@ File Name Coding
 
   If @code{file-name-coding-system} is @code{nil}, Emacs uses a
 default coding system determined by the selected language environment,
-and stored in the @code{default-file-name-coding-system} variable.
-@c FIXME?  Is this correct?  What is the "default language environment"?
-In the default language environment, non-@acronym{ASCII} characters in
-file names are not encoded specially; they appear in the file system
-using the internal Emacs representation.
+and stored in the @code{default-file-name-coding-system} variable
+(normally UTF-8).
 
 @cindex file-name encoding, MS-Windows
 @vindex w32-unicode-filenames
diff --git a/lisp/international/mule-cmds.el b/lisp/international/mule-cmds.el
index ccc8ac9f9e..e3155dfc52 100644
--- a/lisp/international/mule-cmds.el
+++ b/lisp/international/mule-cmds.el
@@ -1799,13 +1799,11 @@ reset-language-environment
    'raw-text)
 
   (set-default-coding-systems nil)
-  (setq default-sendmail-coding-system 'iso-latin-1)
-  ;; On Darwin systems, this should be utf-8-unix, but when this file is loaded
-  ;; that is not yet defined, so we set it in set-locale-environment instead.
-  ;; [Actually, it seems to work fine to use utf-8-unix here, and not just
-  ;; on Darwin.  The previous comment seems to be outdated?
-  ;; See patch at https://debbugs.gnu.org/15803 ]
-  (setq default-file-name-coding-system 'iso-latin-1-unix)
+  (setq default-sendmail-coding-system 'utf-8)
+  (setq default-file-name-coding-system (if (memq system-type
+                                                  '(window-nt ms-dos))
+                                            'iso-latin-1-unix
+                                          'utf-8-unix))
   ;; Preserve eol-type from existing default-process-coding-systems.
   ;; On non-unix-like systems in particular, these may have been set
   ;; carefully by the user, or by the startup code, to deal with the
@@ -1821,8 +1819,10 @@ reset-language-environment
 	(input-coding
 	 (condition-case nil
 	     (coding-system-change-text-conversion
-	      (cdr default-process-coding-system) 'iso-latin-1)
-	   (coding-system-error 'iso-latin-1))))
+	      (cdr default-process-coding-system)
+	      (if (memq system-type '(window-nt ms-dos)) 'iso-latin-1 'utf-8))
+	   (coding-system-error
+	    (if (memq system-type '(window-nt ms-dos)) 'iso-latin-1 'utf-8)))))
     (setq default-process-coding-system
 	  (cons output-coding input-coding)))
 
diff --git a/lisp/mail/sendmail.el b/lisp/mail/sendmail.el
index dd6eecbfd0..7610939e57 100644
--- a/lisp/mail/sendmail.el
+++ b/lisp/mail/sendmail.el
@@ -975,7 +975,7 @@ sendmail-coding-system
 See also the function `select-message-coding-system'.")
 
 ;;;###autoload
-(defvar default-sendmail-coding-system 'iso-latin-1
+(defvar default-sendmail-coding-system 'utf-8
   "Default coding system for encoding the outgoing mail.
 This variable is used only when `sendmail-coding-system' is nil.
 
diff --git a/lisp/mh-e/mh-comp.el b/lisp/mh-e/mh-comp.el
index f7e30bfbb3..8a69adbb75 100644
--- a/lisp/mh-e/mh-comp.el
+++ b/lisp/mh-e/mh-comp.el
@@ -305,6 +305,7 @@ mh-send-letter
   (let ((draft-buffer (current-buffer))
         (file-name buffer-file-name)
         (config mh-previous-window-config)
+        ;; FIXME this is subtly different to select-message-coding-system.
         (coding-system-for-write
          (if (fboundp 'select-message-coding-system)
              (select-message-coding-system) ; Emacs has this since at least 21.1
@@ -318,7 +319,7 @@ mh-send-letter
              (or (and (boundp 'sendmail-coding-system) sendmail-coding-system)
                  (and (default-boundp 'buffer-file-coding-system)
                       (default-value 'buffer-file-coding-system))
-                 'iso-latin-1)))))
+                 'utf-8)))))
     ;; Older versions of spost do not support -msgid and -mime.
     (unless mh-send-uses-spost-flag
       ;; Adding a Message-ID field looks good, makes it easier to search for

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no





^ permalink raw reply related	[flat|nested] 27+ messages in thread

* bug#15803: default-file-name-coding-system: utf-8 better than latin-1 these days?
  2017-12-05  0:35     ` Glenn Morris
  2017-12-08  9:46       ` Eli Zaretskii
@ 2020-09-09 13:33       ` Stefan Kangas
  2020-09-09 15:09         ` Eli Zaretskii
  1 sibling, 1 reply; 27+ messages in thread
From: Stefan Kangas @ 2020-09-09 13:33 UTC (permalink / raw)
  To: Glenn Morris; +Cc: 15803

Glenn Morris <rgm@gnu.org> writes:

> BTW, it feels very dated to me to have discussion of Windows 9X in the
> Emacs manual section on file-name-coding.

Agreed.  Could we move this discussion to the MS Windows FAQ instead?





^ permalink raw reply	[flat|nested] 27+ messages in thread

* bug#15803: default-file-name-coding-system: utf-8 better than latin-1 these days?
  2020-09-09 13:15           ` Lars Ingebrigtsen
@ 2020-09-09 15:00             ` Eli Zaretskii
  2020-09-10 13:07               ` Lars Ingebrigtsen
  0 siblings, 1 reply; 27+ messages in thread
From: Eli Zaretskii @ 2020-09-09 15:00 UTC (permalink / raw)
  To: Lars Ingebrigtsen; +Cc: rgm, 15803

> From: Lars Ingebrigtsen <larsi@gnus.org>
> Cc: Eli Zaretskii <eliz@gnu.org>,  15803@debbugs.gnu.org
> Date: Wed, 09 Sep 2020 15:15:09 +0200
> 
> Glenn's arguments make sense to me, but I'm not a domain expert here.
> Does anybody object to applying this patch to Emacs 28?

Please try building Emacs from a pristine tarball or a clean
repository in a directory with non-ASCII characters, under a
non-UTF-8, non-C locale.  If that works, I think this is good to go.

Thanks.





^ permalink raw reply	[flat|nested] 27+ messages in thread

* bug#15803: default-file-name-coding-system: utf-8 better than latin-1 these days?
  2020-09-09 13:33       ` Stefan Kangas
@ 2020-09-09 15:09         ` Eli Zaretskii
  0 siblings, 0 replies; 27+ messages in thread
From: Eli Zaretskii @ 2020-09-09 15:09 UTC (permalink / raw)
  To: Stefan Kangas; +Cc: rgm, 15803

> From: Stefan Kangas <stefan@marxist.se>
> Date: Wed, 9 Sep 2020 06:33:11 -0700
> Cc: Eli Zaretskii <eliz@gnu.org>, 15803@debbugs.gnu.org
> 
> Glenn Morris <rgm@gnu.org> writes:
> 
> > BTW, it feels very dated to me to have discussion of Windows 9X in the
> > Emacs manual section on file-name-coding.
> 
> Agreed.  Could we move this discussion to the MS Windows FAQ instead?

I don't think the FAQ is the right place for this information.  So no,
please don't move it to the FAQ.

But we could move this to the MS-Windows appendix, leaving a
cross-reference where the text is now.





^ permalink raw reply	[flat|nested] 27+ messages in thread

* bug#15803: default-file-name-coding-system: utf-8 better than latin-1 these days?
  2020-09-09 15:00             ` Eli Zaretskii
@ 2020-09-10 13:07               ` Lars Ingebrigtsen
  2020-09-10 14:39                 ` Eli Zaretskii
  0 siblings, 1 reply; 27+ messages in thread
From: Lars Ingebrigtsen @ 2020-09-10 13:07 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: rgm, 15803

[-- Attachment #1: Type: text/plain, Size: 737 bytes --]

Eli Zaretskii <eliz@gnu.org> writes:

> Please try building Emacs from a pristine tarball or a clean
> repository in a directory with non-ASCII characters, under a
> non-UTF-8, non-C locale.  If that works, I think this is good to go.

All the tools under Linux are so utf-8-focused these days...  let's
see...  I first, under a utf-8 locale created the directory "émacs",
then converted it to 8859-1:

[larsi@stories ~/src/emacs]$ convmv --notest -f UTF-8 -t ISO-8859-1 émacs 
mv "./émacs"	"./�macs"

Which ls displays, funnily enough, as:

-rw-r--r--  1 larsi larsi    0 Sep 10 14:50 ''$'\351''macs'

Then I did

export LANG=sv_SE.ISO-8859-1
export LANG=sv_SE.ISO-8859-1

and now the ls says the file is:


[-- Attachment #2: Type: image/png, Size: 5999 bytes --]

[-- Attachment #3: Type: text/plain, Size: 1236 bytes --]


And then I build Emacs there, and it seems to work fine.  Then I apply
the patch and say "make:

Loading /home/larsi/src/emacs/�*macs/lisp/subdirs.el (source)...
>>Error occurred processing ../lisp/international/mule-cmds.el: File is missing (("Opening input file" "No such file or directory" "/home/larsi/src/emacs/�*macs/lisp/international/mule-cmds.el"))
make[2]: *** [Makefile:279: ../lisp/international/mule-cmds.elc] Error 1
make[1]: *** [Makefile:784: ../lisp/international/mule-cmds.elc] Error 2
make[1]: Leaving directory '/home/larsi/src/emacs/�*macs/src'

So that fails pretty much immediately...

OK, let's try a make bootstrap...

And now building Emacs works fine.  So it seems like a make bootstrap is
necessary after applying the patch.

And starting Emacs works fine.

But "make check" fails miserably:

make[3]: *** [Makefile:165: src/eval-tests.elc] Error 1
  ELC      src/font-tests.elc
>>Error occurred processing src/fileio-tests.el: File is missing (("Doing chmod" "No such file or directory" "/home/larsi/src/emacs/\301\203*macs/test/src/fileio-tests.elc7HRcu0"))

So...

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no

^ permalink raw reply	[flat|nested] 27+ messages in thread

* bug#15803: default-file-name-coding-system: utf-8 better than latin-1 these days?
  2020-09-10 13:07               ` Lars Ingebrigtsen
@ 2020-09-10 14:39                 ` Eli Zaretskii
  2020-09-11 10:55                   ` Lars Ingebrigtsen
  0 siblings, 1 reply; 27+ messages in thread
From: Eli Zaretskii @ 2020-09-10 14:39 UTC (permalink / raw)
  To: Lars Ingebrigtsen; +Cc: rgm, 15803

> From: Lars Ingebrigtsen <larsi@gnus.org>
> Cc: rgm@gnu.org,  15803@debbugs.gnu.org
> Date: Thu, 10 Sep 2020 15:07:12 +0200
> 
> > Please try building Emacs from a pristine tarball or a clean
> > repository in a directory with non-ASCII characters, under a
> > non-UTF-8, non-C locale.  If that works, I think this is good to go.
> 
> All the tools under Linux are so utf-8-focused these days...  let's
> see...  I first, under a utf-8 locale created the directory "émacs",
> then converted it to 8859-1:

No, please create the directory with non-ASCII name _after_ switching
the locale to Latin-1.

> And then I build Emacs there, and it seems to work fine.  Then I apply
> the patch and say "make:
> 
> Loading /home/larsi/src/emacs/�*macs/lisp/subdirs.el (source)...
> >>Error occurred processing ../lisp/international/mule-cmds.el: File is missing (("Opening input file" "No such file or directory" "/home/larsi/src/emacs/�*macs/lisp/international/mule-cmds.el"))
> make[2]: *** [Makefile:279: ../lisp/international/mule-cmds.elc] Error 1
> make[1]: *** [Makefile:784: ../lisp/international/mule-cmds.elc] Error 2
> make[1]: Leaving directory '/home/larsi/src/emacs/�*macs/src'
> 
> So that fails pretty much immediately...
> 
> OK, let's try a make bootstrap...
> 
> And now building Emacs works fine.  So it seems like a make bootstrap is
> necessary after applying the patch.
> 
> And starting Emacs works fine.
> 
> But "make check" fails miserably:
> 
> make[3]: *** [Makefile:165: src/eval-tests.elc] Error 1
>   ELC      src/font-tests.elc
> >>Error occurred processing src/fileio-tests.el: File is missing (("Doing chmod" "No such file or directory" "/home/larsi/src/emacs/\301\203*macs/test/src/fileio-tests.elc7HRcu0"))
> 
> So...

This all happens because the directory name doesn't correspond to the
locale.  You need to create the directory in the 8859-1 locale.





^ permalink raw reply	[flat|nested] 27+ messages in thread

* bug#15803: default-file-name-coding-system: utf-8 better than latin-1 these days?
  2020-09-10 14:39                 ` Eli Zaretskii
@ 2020-09-11 10:55                   ` Lars Ingebrigtsen
  2020-09-11 11:05                     ` Eli Zaretskii
  0 siblings, 1 reply; 27+ messages in thread
From: Lars Ingebrigtsen @ 2020-09-11 10:55 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: rgm, 15803

Eli Zaretskii <eliz@gnu.org> writes:

>> All the tools under Linux are so utf-8-focused these days...  let's
>> see...  I first, under a utf-8 locale created the directory "émacs",
>> then converted it to 8859-1:
>
> No, please create the directory with non-ASCII name _after_ switching
> the locale to Latin-1.

Shouldn't the result be the same?  I.e., a name with iso-8859-1 name?
The reason I did it this convoluted name was just that I couldn't
convince my system to make a 8859 name even after changing the locale.
That is, when I typed Alt-gr ' e, my terminal still sent over two bytes
(i.e., in utf-8) instead of a single-byte é.

But I think I know why "make check" was failing:

[larsi@stories ~/src/emacs/trunk]$ echo $LANG
sv_SE.ISO-8859-1
[larsi@stories ~/src/emacs/trunk]$ echo $LANG
en_US.UTF-8

The tests that were failing all talked about "chmod" and stuff, so I'm
guessing they were from a sub shell, and my system is apparently forcing
all new shells to use UTF-8...  And that was because I set the variables
in .bashrc.  I've now made them be 8859 also in sub-shells, but
unfortunately that doesn't help (it was a long shot, anyway -- these
aren't interactive shells, so .bashrc shouldn't be consulted).

make check:

>>Error occurred processing lisp/eshell/eshell-tests.el: File is missing (("Doing chmod" "No such file or directory" "/home/larsi/src/emacs/f\303\263o/test/lisp/eshell/eshell-tests.elcgtybBC"))

This time over, the directory is "fóo" (in latin-1), and that looks like
Emacs is trying to find the utf-8 version of the file name.

So it looks like the patch set has problems, and needs further fixes.
(Or "make check" has some problems here, since Emacs otherwise seems to
work fine.)

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no





^ permalink raw reply	[flat|nested] 27+ messages in thread

* bug#15803: default-file-name-coding-system: utf-8 better than latin-1 these days?
  2020-09-11 10:55                   ` Lars Ingebrigtsen
@ 2020-09-11 11:05                     ` Eli Zaretskii
  2020-09-11 11:27                       ` Lars Ingebrigtsen
  0 siblings, 1 reply; 27+ messages in thread
From: Eli Zaretskii @ 2020-09-11 11:05 UTC (permalink / raw)
  To: Lars Ingebrigtsen; +Cc: rgm, 15803

> From: Lars Ingebrigtsen <larsi@gnus.org>
> Cc: rgm@gnu.org,  15803@debbugs.gnu.org
> Date: Fri, 11 Sep 2020 12:55:55 +0200
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> 
> >> All the tools under Linux are so utf-8-focused these days...  let's
> >> see...  I first, under a utf-8 locale created the directory "émacs",
> >> then converted it to 8859-1:
> >
> > No, please create the directory with non-ASCII name _after_ switching
> > the locale to Latin-1.
> 
> Shouldn't the result be the same?  I.e., a name with iso-8859-1 name?

No, because the Linux file I/O APIs are encoding-agnostic, they will
(AFAIK) create the directory with a name that is the exact byte stream
that you type at the mkdir command (or at the Emacs make-directory).

> The reason I did it this convoluted name was just that I couldn't
> convince my system to make a 8859 name even after changing the locale.
> That is, when I typed Alt-gr ' e, my terminal still sent over two bytes
> (i.e., in utf-8) instead of a single-byte é.

Try doing this in Emacs, and use one of the Latin input methods if the
keyboard doesn't cooperate.

> But I think I know why "make check" was failing:
> 
> [larsi@stories ~/src/emacs/trunk]$ echo $LANG
> sv_SE.ISO-8859-1
> [larsi@stories ~/src/emacs/trunk]$ echo $LANG
> en_US.UTF-8

I don't understand this: 2 identical commands one after the other
yield different results?

> The tests that were failing all talked about "chmod" and stuff, so I'm
> guessing they were from a sub shell, and my system is apparently forcing
> all new shells to use UTF-8...

Really?  So there's no way to change the locale to something
non UTF-8?

> make check:
> 
> >>Error occurred processing lisp/eshell/eshell-tests.el: File is missing (("Doing chmod" "No such file or directory" "/home/larsi/src/emacs/f\303\263o/test/lisp/eshell/eshell-tests.elcgtybBC"))
> 
> This time over, the directory is "fóo" (in latin-1), and that looks like
> Emacs is trying to find the utf-8 version of the file name.

If that's the case, then we lack ENCODE_FILE (or more generally don't
encode a file name) somewhere.

> So it looks like the patch set has problems, and needs further fixes.
> (Or "make check" has some problems here, since Emacs otherwise seems to
> work fine.)

We could also just install the changes and wait for bug reports, on
the assumption that the problems you see aren't real.  Your call.





^ permalink raw reply	[flat|nested] 27+ messages in thread

* bug#15803: default-file-name-coding-system: utf-8 better than latin-1 these days?
  2020-09-11 11:05                     ` Eli Zaretskii
@ 2020-09-11 11:27                       ` Lars Ingebrigtsen
  2020-09-11 12:24                         ` Eli Zaretskii
  0 siblings, 1 reply; 27+ messages in thread
From: Lars Ingebrigtsen @ 2020-09-11 11:27 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: rgm, 15803

Eli Zaretskii <eliz@gnu.org> writes:

>> But I think I know why "make check" was failing:
>> 
>> [larsi@stories ~/src/emacs/trunk]$ echo $LANG
>> sv_SE.ISO-8859-1
>> [larsi@stories ~/src/emacs/trunk]$ echo $LANG
>> en_US.UTF-8
>
> I don't understand this: 2 identical commands one after the other
> yield different results?

Sorry, there was a "bash" started in between there.

>> This time over, the directory is "fóo" (in latin-1), and that looks like
>> Emacs is trying to find the utf-8 version of the file name.
>
> If that's the case, then we lack ENCODE_FILE (or more generally don't
> encode a file name) somewhere.

After instrumenting bytecomp (i.e., adding a bunch of messages), I see
what function is actually failing.  With this in byte-compile-file:

                  (message "foo2: %S" (prin1-to-string tempfile))
		  (unless (= temp-modes desired-modes)
		    (set-file-modes tempfile desired-modes 'nofollow))
                  (message "foo1: %S" (prin1-to-string tempfile))

I get this output:

make[1]: Entering directory '/home/larsi/src/emacs/f�o/test'
  ELC      lisp/eshell/eshell-tests.elc
foo2: "#(\"/home/larsi/src/emacs/fóo/test/lisp/eshell/eshell-tests.elcnjDFYY\" 0 65 (charset iso-8859-1))"
>>Error occurred processing lisp/eshell/eshell-tests.el: File is missing (("Doing chmod" "No such file or directory" "/home/larsi/src/emacs/f\303\263o/test/lisp/eshell/eshell-tests.elcnjDFYY"))
make[1]: *** [Makefile:165: lisp/eshell/eshell-tests.elc] Error 1

So it's created a tempfile, tagged with the correct charset (I had no
idea that that's how it worked), but decoded, and then set-file-modes
interprets that as an UTF-8 file name.

So...  it's a bug in set-file-modes?  Hm, nope, write-region has the
same problem.

That weird file name (decoded and tagged with a charset text parameter)
comes from make-temp-file -- everything seems to be OK before that.
target-file is:

foo: "\"/home/larsi/src/emacs/f\\363o/test/lisp/eshell/eshell-tests.elc\""

which seems to be correct, but

		       (tempfile
			(make-temp-file (expand-file-name target-file)))

is

"#(\"/home/larsi/src/emacs/fóo/test/lisp/eshell/eshell-tests.elcnjDFYY\" 0 65 (charset iso-8859-1))"

and then things fail.  Which makes me wonder why building Emacs at all
works if it's such a fundamental problem...  Just to check whether my
system is switching the LANG back to utf-8:

          (message "foo: %S" (getenv "LC_ALL"))

in byte-compile-file says

foo: "sv_SE.ISO-8859-1"

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no





^ permalink raw reply	[flat|nested] 27+ messages in thread

* bug#15803: default-file-name-coding-system: utf-8 better than latin-1 these days?
  2020-09-11 11:27                       ` Lars Ingebrigtsen
@ 2020-09-11 12:24                         ` Eli Zaretskii
  2020-09-11 12:33                           ` Lars Ingebrigtsen
  2020-09-11 12:39                           ` Lars Ingebrigtsen
  0 siblings, 2 replies; 27+ messages in thread
From: Eli Zaretskii @ 2020-09-11 12:24 UTC (permalink / raw)
  To: Lars Ingebrigtsen; +Cc: rgm, 15803

> From: Lars Ingebrigtsen <larsi@gnus.org>
> Cc: rgm@gnu.org,  15803@debbugs.gnu.org
> Date: Fri, 11 Sep 2020 13:27:28 +0200
> 
> make[1]: Entering directory '/home/larsi/src/emacs/f�o/test'
>   ELC      lisp/eshell/eshell-tests.elc
> foo2: "#(\"/home/larsi/src/emacs/fóo/test/lisp/eshell/eshell-tests.elcnjDFYY\" 0 65 (charset iso-8859-1))"
> >>Error occurred processing lisp/eshell/eshell-tests.el: File is missing (("Doing chmod" "No such file or directory" "/home/larsi/src/emacs/f\303\263o/test/lisp/eshell/eshell-tests.elcnjDFYY"))
> make[1]: *** [Makefile:165: lisp/eshell/eshell-tests.elc] Error 1
> 
> So it's created a tempfile, tagged with the correct charset (I had no
> idea that that's how it worked), but decoded, and then set-file-modes
> interprets that as an UTF-8 file name.
> 
> So...  it's a bug in set-file-modes?  Hm, nope, write-region has the
> same problem.

There be dragons ;-)

The problematic aspect of debugging these problems is that what you
see is not always what's there, due to display and decoding/encoding
operations by both Emacs and the display software you have on your
system (which drives the terminal).

In particular, strings inside Emacs are always in UTF-8-compatible
encoding, so the fact you get UTF-8 in *Messages* doesn't prove
anything.  What we need is to find 2 types of possible problems:

  . raw bytes from Latin-1 encoding inside Emacs buffers or strings
    that are supposed to be decoded
  . UTF-8 encoded (instead of Latin-1 encoded) characters passed to
    libc functions

So if you found that the problem reveals itself in set-file-modes,
let's see what happens there.  The relevant code is this:

  char *fname = SSDATA (ENCODE_FILE (absname));
  mode_t imode = XFIXNUM (mode) & 07777;
  if (fchmodat (AT_FDCWD, fname, imode, nofollow) != 0)
    report_file_error ("Doing chmod", absname);

Please either run this under GDB, or add printf's, to show the byte
sequences of 'absname' and of 'fname'.  The former should be in UTF-8
(so you should see 0xC3 and 0xB3 for the ó character), the latter
should be in Latin-1 (so you should see 0xF3 for the same letter).
This should give us some hints wrt where to look for the cause of the
problem.

> That weird file name (decoded and tagged with a charset text parameter)
> comes from make-temp-file -- everything seems to be OK before that.
> target-file is:
> 
> foo: "\"/home/larsi/src/emacs/f\\363o/test/lisp/eshell/eshell-tests.elc\""
> 
> which seems to be correct,

Where does the "foo:" printout comes from?  I wouldn't expect to see
Latin-1 encoded strings inside Emacs, not normally anyway.

> but
> 
> 		       (tempfile
> 			(make-temp-file (expand-file-name target-file)))
> 
> is
> 
> "#(\"/home/larsi/src/emacs/fóo/test/lisp/eshell/eshell-tests.elcnjDFYY\" 0 65 (charset iso-8859-1))"

I see nothing wrong here: this is how decoding works in Emacs.  And
again, how did you produce this string?  As I explained above, the
details of how you display these strings matter in this case.





^ permalink raw reply	[flat|nested] 27+ messages in thread

* bug#15803: default-file-name-coding-system: utf-8 better than latin-1 these days?
  2020-09-11 12:24                         ` Eli Zaretskii
@ 2020-09-11 12:33                           ` Lars Ingebrigtsen
  2020-09-11 12:41                             ` Eli Zaretskii
  2020-09-11 12:39                           ` Lars Ingebrigtsen
  1 sibling, 1 reply; 27+ messages in thread
From: Lars Ingebrigtsen @ 2020-09-11 12:33 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: rgm, 15803

Eli Zaretskii <eliz@gnu.org> writes:

> So if you found that the problem reveals itself in set-file-modes,
> let's see what happens there.  The relevant code is this:

Yeah, I don't think that function is the problem in itself, but I don't
know where the problem originates either.

>> foo: "\"/home/larsi/src/emacs/f\\363o/test/lisp/eshell/eshell-tests.elc\""
>> 
>> which seems to be correct,
>
> Where does the "foo:" printout comes from?  I wouldn't expect to see
> Latin-1 encoded strings inside Emacs, not normally anyway.

I just added a bunch of

          (message "foo: %S" variable)

here and there in byte-compile-file to watch how the passed-in string is
transformed. 

>> 		       (tempfile
>> 			(make-temp-file (expand-file-name target-file)))
>> 
>> is
>> 
>> "#(\"/home/larsi/src/emacs/fóo/test/lisp/eshell/eshell-tests.elcnjDFYY\" 0 65 (charset iso-8859-1))"
>
> I see nothing wrong here: this is how decoding works in Emacs.  And
> again, how did you produce this string?  As I explained above, the
> details of how you display these strings matter in this case.

Same way as above.

The file name is on the "f\\363o/test" form until make-temp-name, and
then it turns into a different string with a text property.  But I don't
know how much this is an artefact of how Emacs prints these things and
how much it's actually, er...  actual.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no





^ permalink raw reply	[flat|nested] 27+ messages in thread

* bug#15803: default-file-name-coding-system: utf-8 better than latin-1 these days?
  2020-09-11 12:24                         ` Eli Zaretskii
  2020-09-11 12:33                           ` Lars Ingebrigtsen
@ 2020-09-11 12:39                           ` Lars Ingebrigtsen
  2020-09-11 12:45                             ` Eli Zaretskii
  1 sibling, 1 reply; 27+ messages in thread
From: Lars Ingebrigtsen @ 2020-09-11 12:39 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: rgm, 15803

[-- Attachment #1: Type: text/plain, Size: 513 bytes --]

Another confusing data point.  If I say "make" in the test directory, I
get:

foo 1: "\"/home/larsi/src/emacs/f\\363o/test/lisp/eshell/eshell-tests.elc\""
foo 2: "#(\"/home/larsi/src/emacs/fóo/test/lisp/eshell/eshell-tests.elcGvbK3T\" 0 65 (charset iso-8859-1))"

If I just say "make" in the main directory, I get this:

foo 1: "\"/home/larsi/src/emacs/f�o/lisp/dos-w32.elc\""
foo 2: "\"/home/larsi/src/emacs/fóo/lisp/dos-w32.elcXgukAl\""

Or, if that doesn't survive emailing, here's an umage:


[-- Attachment #2: Type: image/png, Size: 12699 bytes --]

[-- Attachment #3: Type: text/plain, Size: 1125 bytes --]


Note -- no text properties, and not represented as "f\363o".

*scratches head*

So is this a problem with how ert calls the byte compiler after all?

This is with

diff --git a/lisp/emacs-lisp/bytecomp.el b/lisp/emacs-lisp/bytecomp.el
index 966990bac9..07448033ac 100644
--- a/lisp/emacs-lisp/bytecomp.el
+++ b/lisp/emacs-lisp/bytecomp.el
@@ -1990,6 +1990,7 @@ byte-compile-file
 	(with-current-buffer output-buffer
 	  (goto-char (point-max))
 	  (insert "\n")			; aaah, unix.
+          (message "foo 1: %S" (prin1-to-string (expand-file-name target-file)))
 	  (if (file-writable-p target-file)
 	      ;; We must disable any code conversion here.
 	      (progn
@@ -2007,6 +2008,7 @@ byte-compile-file
 			(cons (lambda () (ignore-errors
 					   (delete-file tempfile)))
 			      kill-emacs-hook)))
+		  (message "foo 2: %S" (prin1-to-string tempfile))
 		  (unless (= temp-modes desired-modes)
 		    (set-file-modes tempfile desired-modes 'nofollow))
 		  (write-region (point-min) (point-max) tempfile nil 1)


-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* bug#15803: default-file-name-coding-system: utf-8 better than latin-1 these days?
  2020-09-11 12:33                           ` Lars Ingebrigtsen
@ 2020-09-11 12:41                             ` Eli Zaretskii
  2020-09-11 14:18                               ` Lars Ingebrigtsen
  0 siblings, 1 reply; 27+ messages in thread
From: Eli Zaretskii @ 2020-09-11 12:41 UTC (permalink / raw)
  To: Lars Ingebrigtsen; +Cc: rgm, 15803

> From: Lars Ingebrigtsen <larsi@gnus.org>
> Cc: rgm@gnu.org,  15803@debbugs.gnu.org
> Date: Fri, 11 Sep 2020 14:33:08 +0200
> 
> The file name is on the "f\\363o/test" form until make-temp-name

That shouldn't happen.  It probably means we lack a DECODE_FILE
somewhere.  File names inside Emacs should always be decoded into
UTF-8.

> and
> then it turns into a different string with a text property.  But I don't
> know how much this is an artefact of how Emacs prints these things and
> how much it's actually, er...  actual.

The only way to know is to add printf's or look in GDB.





^ permalink raw reply	[flat|nested] 27+ messages in thread

* bug#15803: default-file-name-coding-system: utf-8 better than latin-1 these days?
  2020-09-11 12:39                           ` Lars Ingebrigtsen
@ 2020-09-11 12:45                             ` Eli Zaretskii
  0 siblings, 0 replies; 27+ messages in thread
From: Eli Zaretskii @ 2020-09-11 12:45 UTC (permalink / raw)
  To: Lars Ingebrigtsen; +Cc: rgm, 15803

> From: Lars Ingebrigtsen <larsi@gnus.org>
> Cc: rgm@gnu.org,  15803@debbugs.gnu.org
> Date: Fri, 11 Sep 2020 14:39:07 +0200
> 
> So is this a problem with how ert calls the byte compiler after all?

I don't think so, but I'm not sure.  It could be some shenanigans of
expand-file-name, for example: it has its own ideas for when to
produce a unibyte string and when a multibyte string.

Again, the fact that "foo 1" displays a unibyte undecoded file name
sounds wrong to me.  Is target-file also a unibyte Latin-1 string?





^ permalink raw reply	[flat|nested] 27+ messages in thread

* bug#15803: default-file-name-coding-system: utf-8 better than latin-1 these days?
  2020-09-11 12:41                             ` Eli Zaretskii
@ 2020-09-11 14:18                               ` Lars Ingebrigtsen
  2020-09-11 14:27                                 ` Lars Ingebrigtsen
  0 siblings, 1 reply; 27+ messages in thread
From: Lars Ingebrigtsen @ 2020-09-11 14:18 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: rgm, 15803

I'm just poking around to see what's different between the way the files
are compiled in the test directory and the lisp directory, because they
should either both fail or not.

So here's how "make" i test does it:

EMACSLOADPATH= LC_ALL=C EMACS_TEST_DIRECTORY=/home/larsi/src/emacs/f�o/test  "../src/emacs" --module-assertions --no-init-file --no-site-file --no-site-lisp -L ":."  --batch -f batch-byte-compile lisp/eshell/eshell-tests.el

Here's how "make" in Lisp does it:

EMACSLOADPATH= '../src/emacs' -batch --no-site-file --no-site-lisp --eval '(setq load-prefer-newer t)'  -f batch-byte-compile emacs-lisp/bytecomp.el

And, indeed, if I remove "LC_ALL=C" from the line, then this compiles
successfully.

*phew*

Hm...  in fact, everything compiles successfully without LC_ALL?

However, when the tests run (in the latin-1 environment) 11 tests fail:

SUMMARY OF TEST RESULTS
-----------------------
Files examined: 305
Ran 4200 tests, 4097 results as expected, 29 unexpected, 74 skipped
1 files did not contain any tests:
  src/emacs-module-tests.log
11 files contained unexpected results:
  src/regex-emacs-tests.log
  lisp/vc/vc-bzr-tests.log
  lisp/vc/diff-mode-tests.log
  lisp/time-stamp-tests.log
  lisp/net/shr-tests.log
  lisp/gnus/mml-sec-tests.log
  lisp/epg-tests.log
  lisp/emacs-lisp/package-tests.log
  lisp/emacs-lisp/faceup-tests/faceup-test-files.log
  lisp/cedet/semantic-utest-ia.log
  lib-src/emacsclient-tests.log

As a comparison, removing the LC_ALL in an utf-8 environment (with a
pure-ascii path) gives me:

SUMMARY OF TEST RESULTS
-----------------------
Files examined: 305
Ran 4231 tests, 4150 results as expected, 6 unexpected, 75 skipped
6 files contained unexpected results:
  src/emacs-module-tests.log
  src/callint-tests.log
  lisp/vc/vc-bzr-tests.log
  lisp/subr-tests.log
  lisp/files-tests.log
  lisp/emacs-lisp/gv-tests.log

The bzr test fails because of the brz/bzr thing, but the LC_ALL is
apparently needed for the other five things.

So: In conclusion, I this Glenn's patch needs more work before
applying.  :-)  But at least we now knows that it breaks, and why (well,
for some of it).

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no






^ permalink raw reply	[flat|nested] 27+ messages in thread

* bug#15803: default-file-name-coding-system: utf-8 better than latin-1 these days?
  2020-09-11 14:18                               ` Lars Ingebrigtsen
@ 2020-09-11 14:27                                 ` Lars Ingebrigtsen
  2020-09-11 14:46                                   ` Eli Zaretskii
  0 siblings, 1 reply; 27+ messages in thread
From: Lars Ingebrigtsen @ 2020-09-11 14:27 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: rgm, 15803

Lars Ingebrigtsen <larsi@gnus.org> writes:

> And, indeed, if I remove "LC_ALL=C" from the line, then this compiles
> successfully.

Oh, wow.  Apparently nobody is using non-ASCII in their Emacs paths?  I
just did a "mv trunk góo" on my laptop (UTF-8 environment), nothing
altered from out-of-the-box on Debian bullseye, and make check:

>>Error occurred processing lisp/emacs-lisp/regexp-opt-tests.el: File is missing (("Doing chmod" "No such file or directory" "/home/larsi/src/emacs/g\303\203\302\263o/test/lisp/emacs-lisp/regexp-opt-tests.elc15Rc5M"))
make[3]: *** [Makefile:165: lisp/emacs-lisp/regexp-opt-tests.elc] Error 1

for all the files.

So the LC_ALL=C thing in the compilation phase is just...  wrong?

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no





^ permalink raw reply	[flat|nested] 27+ messages in thread

* bug#15803: default-file-name-coding-system: utf-8 better than latin-1 these days?
  2020-09-11 14:27                                 ` Lars Ingebrigtsen
@ 2020-09-11 14:46                                   ` Eli Zaretskii
  2020-09-11 14:54                                     ` Lars Ingebrigtsen
  0 siblings, 1 reply; 27+ messages in thread
From: Eli Zaretskii @ 2020-09-11 14:46 UTC (permalink / raw)
  To: Lars Ingebrigtsen; +Cc: rgm, 15803

> From: Lars Ingebrigtsen <larsi@gnus.org>
> Cc: rgm@gnu.org,  15803@debbugs.gnu.org
> Date: Fri, 11 Sep 2020 16:27:30 +0200
> 
> >>Error occurred processing lisp/emacs-lisp/regexp-opt-tests.el: File is missing (("Doing chmod" "No such file or directory" "/home/larsi/src/emacs/g\303\203\302\263o/test/lisp/emacs-lisp/regexp-opt-tests.elc15Rc5M"))
> make[3]: *** [Makefile:165: lisp/emacs-lisp/regexp-opt-tests.elc] Error 1
> 
> for all the files.
> 
> So the LC_ALL=C thing in the compilation phase is just...  wrong?

It's probably not TRT when the directory is non-ASCII.  But note that
you can say

   make check TEST_LOCALE=<whatever>

Does it help to use the locale you have set?

"git log -L" indicates that the default setting of TEST_LOCALE=C was
introduced in commit 4874f0b.  It would be interesting to see what the
tests mentioned in the log message of that commit yield if the locale
is not C.





^ permalink raw reply	[flat|nested] 27+ messages in thread

* bug#15803: default-file-name-coding-system: utf-8 better than latin-1 these days?
  2020-09-11 14:46                                   ` Eli Zaretskii
@ 2020-09-11 14:54                                     ` Lars Ingebrigtsen
  2020-09-11 15:11                                       ` Eli Zaretskii
  0 siblings, 1 reply; 27+ messages in thread
From: Lars Ingebrigtsen @ 2020-09-11 14:54 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: rgm, 15803

Eli Zaretskii <eliz@gnu.org> writes:

> It's probably not TRT when the directory is non-ASCII.

Sure.

> But note that you can say
>
>    make check TEST_LOCALE=<whatever>
>
> Does it help to use the locale you have set?

That allows the files to be compiled, but some tests fail:

Files examined: 305
Ran 4241 tests, 4197 results as expected, 5 unexpected, 39 skipped
5 files contained unexpected results:
  src/emacs-module-tests.log
  src/callint-tests.log
  lisp/subr-tests.log
  lisp/net/tramp-archive-tests.log
  lisp/emacs-lisp/gv-tests.log

> "git log -L" indicates that the default setting of TEST_LOCALE=C was
> introduced in commit 4874f0b.  It would be interesting to see what the
> tests mentioned in the log message of that commit yield if the locale
> is not C.

Hm...  seems like that commit just made it optional.  Looks like the
LC_ALL=C has been there from the very beginning, which means that in all
these years, nobody has tried "make check" with non-ASCII chars in their
paths.  :-)

commit d221e7808c01fdc9234734f95ecf49e902085ddd
Author:     Christian Ohler <ohler@gnu.org>
AuthorDate: Thu Jan 13 03:08:24 2011 +1100
Commit:     Christian Ohler <ohler@gnu.org>
CommitDate: Thu Jan 13 03:08:24 2011 +1100

    Add ERT, a tool for automated testing in Emacs Lisp.
    
    * Makefile.in, configure.in, doc/misc/Makefile.in, doc/misc/makefile.w32-in:
    Add ERT.  Make "make check" run tests in test/automated.
    
    * doc/misc/ert.texi, lisp/emacs-lisp/ert.el, lisp/emacs-lisp/ert-x.el:
    New files.
    
    * test/automated: New directory.

diff --git a/test/automated/Makefile.in b/test/automated/Makefile.in
--- /dev/null
+++ b/test/automated/Makefile.in
@@ -0,0 +47,2 @@
+# The actual Emacs command run in the targets below.
+emacs = EMACSLOADPATH=$(lispsrc):$(test) LC_ALL=C $(EMACS) $(EMACSOPT)


-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no





^ permalink raw reply	[flat|nested] 27+ messages in thread

* bug#15803: default-file-name-coding-system: utf-8 better than latin-1 these days?
  2020-09-11 14:54                                     ` Lars Ingebrigtsen
@ 2020-09-11 15:11                                       ` Eli Zaretskii
  2020-09-12  8:47                                         ` Michael Albinus
  2020-09-12 11:21                                         ` Lars Ingebrigtsen
  0 siblings, 2 replies; 27+ messages in thread
From: Eli Zaretskii @ 2020-09-11 15:11 UTC (permalink / raw)
  To: Lars Ingebrigtsen; +Cc: rgm, 15803

> From: Lars Ingebrigtsen <larsi@gnus.org>
> Cc: rgm@gnu.org,  15803@debbugs.gnu.org
> Date: Fri, 11 Sep 2020 16:54:46 +0200
> 
> >    make check TEST_LOCALE=<whatever>
> >
> > Does it help to use the locale you have set?
> 
> That allows the files to be compiled, but some tests fail:
> 
> Files examined: 305
> Ran 4241 tests, 4197 results as expected, 5 unexpected, 39 skipped
> 5 files contained unexpected results:
>   src/emacs-module-tests.log
>   src/callint-tests.log
>   lisp/subr-tests.log
>   lisp/net/tramp-archive-tests.log
>   lisp/emacs-lisp/gv-tests.log

Maybe these tests expect some special locale.  For example,
emacs-module-tests could expect UTF-8, since we don't support
non-UTF-8 strings in modules.

Anyway, I think if this is down to a couple of tests, we can install
the changes, as the problems they uncover are elsewhere.





^ permalink raw reply	[flat|nested] 27+ messages in thread

* bug#15803: default-file-name-coding-system: utf-8 better than latin-1 these days?
  2020-09-11 15:11                                       ` Eli Zaretskii
@ 2020-09-12  8:47                                         ` Michael Albinus
  2020-09-12 11:21                                         ` Lars Ingebrigtsen
  1 sibling, 0 replies; 27+ messages in thread
From: Michael Albinus @ 2020-09-12  8:47 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: rgm, Lars Ingebrigtsen, 15803

Eli Zaretskii <eliz@gnu.org> writes:

>> That allows the files to be compiled, but some tests fail:
>>
>> Files examined: 305
>> Ran 4241 tests, 4197 results as expected, 5 unexpected, 39 skipped
>> 5 files contained unexpected results:
>>   src/emacs-module-tests.log
>>   src/callint-tests.log
>>   lisp/subr-tests.log
>>   lisp/net/tramp-archive-tests.log
>>   lisp/emacs-lisp/gv-tests.log
>
> Maybe these tests expect some special locale.  For example,
> emacs-module-tests could expect UTF-8, since we don't support
> non-UTF-8 strings in modules.

UTF8 is also required for tramp-archive-tests, IIRC (not checked actually).

> Anyway, I think if this is down to a couple of tests, we can install
> the changes, as the problems they uncover are elsewhere.

Agreed. If needed, I could adapt tramp-archive-tests. I cannot speak for
the other tests.

Best regards, Michael.





^ permalink raw reply	[flat|nested] 27+ messages in thread

* bug#15803: default-file-name-coding-system: utf-8 better than latin-1 these days?
  2020-09-11 15:11                                       ` Eli Zaretskii
  2020-09-12  8:47                                         ` Michael Albinus
@ 2020-09-12 11:21                                         ` Lars Ingebrigtsen
  1 sibling, 0 replies; 27+ messages in thread
From: Lars Ingebrigtsen @ 2020-09-12 11:21 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: rgm, 15803

Eli Zaretskii <eliz@gnu.org> writes:

> Maybe these tests expect some special locale.  For example,
> emacs-module-tests could expect UTF-8, since we don't support
> non-UTF-8 strings in modules.
>
> Anyway, I think if this is down to a couple of tests, we can install
> the changes, as the problems they uncover are elsewhere.

Yeah, that's true -- since "make check" has seemingly never worked well
with a non-ASCII path, then the patch doesn't really regress anything
much (although the number of tests that fail with non-ASCII paths
increase).

OK, I'll apply the patch (after test-compiling on a couple systems), and
open a new bug report for the non-ASCII path/"make check" thing.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no





^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2020-09-12 11:21 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-11-04 18:45 bug#15803: default-file-name-coding-system: utf-8 better than latin-1 these days? Glenn Morris
2017-12-01  1:52 ` Glenn Morris
2017-12-01  7:54   ` Eli Zaretskii
2017-12-05  0:35     ` Glenn Morris
2017-12-08  9:46       ` Eli Zaretskii
2017-12-12  1:38         ` Glenn Morris
2020-09-09 13:15           ` Lars Ingebrigtsen
2020-09-09 15:00             ` Eli Zaretskii
2020-09-10 13:07               ` Lars Ingebrigtsen
2020-09-10 14:39                 ` Eli Zaretskii
2020-09-11 10:55                   ` Lars Ingebrigtsen
2020-09-11 11:05                     ` Eli Zaretskii
2020-09-11 11:27                       ` Lars Ingebrigtsen
2020-09-11 12:24                         ` Eli Zaretskii
2020-09-11 12:33                           ` Lars Ingebrigtsen
2020-09-11 12:41                             ` Eli Zaretskii
2020-09-11 14:18                               ` Lars Ingebrigtsen
2020-09-11 14:27                                 ` Lars Ingebrigtsen
2020-09-11 14:46                                   ` Eli Zaretskii
2020-09-11 14:54                                     ` Lars Ingebrigtsen
2020-09-11 15:11                                       ` Eli Zaretskii
2020-09-12  8:47                                         ` Michael Albinus
2020-09-12 11:21                                         ` Lars Ingebrigtsen
2020-09-11 12:39                           ` Lars Ingebrigtsen
2020-09-11 12:45                             ` Eli Zaretskii
2020-09-09 13:33       ` Stefan Kangas
2020-09-09 15:09         ` Eli Zaretskii

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).